*By Insu Song, Bryan Anselme, Purnedu Mandal and John Vong *

*Abstract— Short and medium term predictions of stock prices have been important problems in financial analysis. In the past, various different approaches have been used including statistical analysis, fundamental analysis, and more recently advanced approaches that use machine learning and data mining techniques. However, most of existing algorithms do not incorporate all available information of the market. By using more informative and relevant data, prediction results will better reflect market reality. This would benefit in reducing the inaccuracy of predicting due to randomness in stock prices, by using trend rather than a single stock price variation. For instance some stock prices are correlated and/or dependent with/on each other and market mood. In this paper, we review the existing techniques of stock prices and time series predictions, and the classification and clustering methods. Based on the literature analysis, we propose a method for incorporating related stock trend information: clustering related companies using machine learning approaches. We report on a preliminary analysis results using monthly adjusted closing prices of 100 companies collected over a 15-month period.*

Keywords – Clustering, financial analyses, stock price prediction, prediction, classification

# 1.0 Introduction

Today most of the global monetary mass is invested in financial places, on coupons, debt financing, raw materials, features or stocks. Optimizing investment strategies in these markets has become one of today’s most important research topics. The most common methodologies are: portfolio management that aim to reduce the risk taken in investment by diversifying the range of investment [1], arbitrage by detecting anomaly in prices and so take a free lunch, pricing by calculating the real value of stocks, and finally standard trading with two majors types, which are the fundamental [2] and the technical/quantitative [3] analysis approaches. The Technical/quantitative analysis approaches include using mathematical tools in order to predict trends, discovering patterns for machine-based trading, and predicting medium/short term trends. The Fundamental analysis approaches include using micro and macroeconomics indicators, news and financial data of companies in order to predict trend for the concerned stocks for medium and long term views.

Recent advancement in data mining technologies, such as Clustering [4], ANN (Artificial Neural Network) [5], SVM (Support Vector Machines) [6], Decision Tree [7, 8], and Rough set theory [9], opened up new approaches that allow analysts to incorporate more relevant information [10]. These new technologies allow analysts consider much larger amounts of data and build prediction models automatically using computers with less training [10].

In this paper we will review existing techniques of financial analysis and machine learning approaches in order to identify new opportunities. In addition, we propose a new method for predicting stock prices in a more accurate way using clustering approaches. This proposal is based on the fact that many stock prices are correlated, and the awareness of those correlations can allow us to improve the previous models. This allows us to take into account more data and to diversify their sources in order to reduce the inaccuracy of predictions. In fact to use only the historical data of only one stock price is sometimes very risky and although lead to bad forecasting.

The rest of the paper is organized as follows. In the next section, we review existing techniques of financial analysis and machine learning approaches. We then report on identified opportunities for researchers. In Section 3, we propose a new clustering method that determines the optimal number of clusters of related companies. In Section 4, we report on the analysis result and conclude the paper with remarks in Section 5.

# 2.0 Review of Existing Financial Analysis Methods

## 2.1 Data Analysis for Market Prediction

### 2.1.1 Financial Analyses

Fundamental analyses try to use a company’s financial and operational information in order to predict future financial states of the company. This includes R&D resources allocations, growth margin, and other financial statements [2]. Analysts often compare this information with other companies in the same sector to assess the relative values of companies in the sector. This information is also used to assess the future financial trends of the sector as a whole. This will have impact on medium to long term predictions, but not on short term predictions. This is because fundamental analysis dose not really take in to account of historical stock prices. Today, this type of analysis is very common. However, this type of analysis requires good economic and accounting knowledge and expensive acquire trained analysts.

### 2.1.1 Technical and quantitative analyses

The technical and quantitative financial analysis use historical market prices and indexes to predict future financial prices of stocks. These methods often used for short term prediction to decide on buying, holding, or selling of stocks [3]. Technical analysis employs the tools of geometry and pattern recognition. On the other hand, quantitative finance employs the tools of mathematical analysis, such as probability and statistics. These techniques are still popular, because quantitative finance can be easily done by computer and most of the time done by computers, and pattern recognition is still an area where human can still compete with machines. This methods are mostly done using time series analysis.

### 2.1.1 Time series model

One of the popular methods of predicting short-time future financial values of stocks and companies is time series analysis of historical financial data. Moving Average (MA) of the financial data (e.g., stock prices) is commonly used with time series modeling [3]. Auto Regression (AR) modeling is widely used as the basic tool [6] [11] for building prediction models. AR is a model that represents dependencies of terms from previous terms in time series data. The random noise term in AR models represents the randomness in stocks price movement [6]. AR is often combined with other models to create new models, such as AutoRegressive–Moving-Average (ARMA) [6]. ARMA model was later extended [12] [13] to create more complex models, such as ARIMA (Autoregressive integrated moving average) and ARFIMA (Autoregressive fractionally integrated moving average), which shown to provide good prediction results. Recently, ANN (Artificial Neural Network) is also used to automatically generate predication models without relying on mathematical likes AR [14]. The advantage of these latest methods is that it requires less historical data and the prediction model can be improved incrementally.

## 2.2 Clustering Approaches

We review modern data mining approaches starting with clustering. Clustering as it suggests is grouping similar objects based on some similarity measure. Its objective is to discover hidden patterns automatically from abundant data. This is a unsupervised learning approach as it does not require training datasets unlike classification or predication approaches. Clustering is often used to detect interesting outliers, remove noise, and explore data as well. One popular algorithm is K-means algorithm, which was developed in late 70s [15]. It takes K (the number of clusters) to be discovered and partition the data into K similar groups. Its computation complexity is linear to the size of the data, but it can only discover convex shaped clusters, and affected by outliers and noise, and can only deal with numerical data types. To overcome the limitation various different clustering algorithms were proposed. K-prototype extends [16] K-means algorithm so it can handle different types of data. Density-based spatial clustering of applications with noise (DBSCAN) [4] can determine the number of clusters automatically and arbitrary shaped clusters, but it takes two other parameters (neighborhood distance and minimum number of neighbors), which must be given by analysts. Agglomerative hierarchical clustering (AGNES) and DIvisive ANAlysis Clustering (DIANA) [17] output a hierarchy of clusters, where individual objects are clusters of their own at the bottom of the hierarchy and forms one group at the top. Analysts then can later determine the number of clusters. All these approaches suffer from one drawback. They are query dependent. STatistical INformation Grid (STING) [18] overcomes this limitation summarizing the entire dataset using a grid. Each cell in the grid contains statistical summary of objects falling in the grid cell. However, it can only detect rectangular shaped clusters.

## 2.3 Problems of Existing Approaches and Opportunities

As we can see from the literature described above, there are many approaches in financial-analysis. They use a wide range of tools like ANN, Clustering, Decision trees, and Rough Sets, and deal with many different types of data. We list here some of the main problems of existing financial-prediction approaches:

- Staggering amount of different approaches and algorithms: whether the algorithm applies to a wide range of companies, sectors, or financial places: the number of variables taken as inputs; the time laps we want to predict from the few second ahead to months; pretreatment on data.
- A perfect and perfectly accurate prediction is impossible to realize due to inherent randomness in stock prices.
- Most of the existing methods rely on a small subset of available information and focus on optimizing the models for the select few attributes. We can classify the approaches to two approaches: fundamental analysis approaches [2] focusing on account reviews and macro/micro-economic figures; and technical/quantitative [3] analysis approaches focusing stocks prices. For example, well known prediction approaches mostly rely on moving average of stock prices, such as AR (auto regressive) model [6] [11], ARMA [6], ARIMA and ARFIMA [12] [13], time series analysis methods using ANN (Artificial Neural Network) [14].
- New approaches based on data mining techniques automate some of the tasks of analysts with much more parameters and processing power [8], but few use big amount of data in order to establish prediction model.

Given that a single stock price is subject to a lot of randomness, this can be considered as using bad quality data or not having sufficient amount of data for prediction tasks. What we can see from the limitations of existing approaches listed above, we see a great opportunity for researchers to develop better prediction models by using clustering approaches. Clustering can help analysts use vast amount of information available on the Internet to group stocks that are highly correlated and predict groups of related stocks or spread information across related stocks filling information gaps of individual stocks.

This clustering-based approach will result in three improvements. The first is improving the portfolio elaboration methodology used in [1]. The second is improving current prediction tool by comparing the stock price that is forecasted with the average evolution of stock price in the clustered to which it belongs. Lastly, it builds a model automatically to do prediction by studying the relation and influence between the different clusters of companies in order to build a classifier to predict whether a stock price will go up or down.

# 3.0 Methodology

In this section, we present a method of determining an optimal number of clusters of related companies. One of the challenges of clustering is determining the ideal number of groups companies. AGNES outputs a hierarchy of clusters for analyst to investigate, but does not provide a good measure to determine the optimal number of clusters. Therefore, we calculate clustering criterion on each level of the hierarchy of clusters and analysis dynamics of cluster configuration changes. This is measured by taking the first derivative of the clustering criterion. We use SSE (sum of squared errors) for the clustering criterion. Given SSE contains quadratic components of attributes of the companies that are being analyzed, SSE will decrease exponentially and gradually as the number of clusters increase and the size of the clusters shrink. However, when optimal configuration is found in the process, SSE will have change more dramatically. These dramatic changes will be observed in the first derivative of SSE curve.

To test the proposed method, we collected monthly adjusted closing prices of 100 companies over a 15-month period. First, we downloaded all company symbol data from the NASDAQ website. We then downloaded financial data of 3,000 companies from the Yahoo finance web service. Among the 3000 companies, we selected 165 companies as most of companies had various missing attribute values. The obtained result is a set of historic monthly adjusting closing prices of the companies over the 15 month period (from Feb, 2014 to Apr, 2015). The data set is then preprocessed as follows. First we computed the monthly return rates. We then normalized the monthly return rate so all variables will have the same importance for the clustering task. The proposed method was applied on this data.

# 4.0 Experimental Results

For the clustering process, we used AGNES with the average linkage. We choose this algorithm because AGNES outputs a hierarchy of clusters (multiple levels of clusters). In each level of the clusters, we calculate the SSE of the step. Figure 1 shows the plot of SSE over the number of clusters. In the figure, we see that the SSE increases exponentially and gradually as the number of clusters decreases. This is because the clusters are growing bigger with more points in them and they are further and further away from the center of the clusters. That explains the exponential growth of the SSE.

Then, we calculate the first derivative of the SSE values. We then use the plot of the first derivative of SSE to identify optimal K (cluster numbers). Figure 2 shows the plot of the first derivative. In Figure 2, we see several high spikes indicating sudden changes in cluster configuration meaning greater information gains at those levels. We use these spikes to identify optimal Ks. Table 1 shows the clustering configurations of optimal Ks.

Figure 1. SSE (Sum of the squared errors) over K/100, where K is the number of clusters starting from 100 to 1 as AGNES is a bottom up hierachical clustering appraoch.

Figure 2. First derivative of SSE over K/100. The sharp peaks are used to identify optimum Ks.

Table 1: Configuration of optimum number of clusters

Optimal k (number of clusters) | Number of points in each cluster |

39 | 4 1 2 2 4 1 1 1 1 3 1 4 1 3 1 1 1 1 2 1 1 1 2 1 10 1 5 6 7 2 5 2 6 4 1 1 2 1 6 |

35 | 1 1 2 1 5 4 1 5 1 2 1 1 1 3 9 2 5 1 1 2 6 8 1 1 3 2 1 6 3 1 5 8 3 1 2 |

27 | 1 3 5 5 1 1 9 1 12 1 1 1 1 2 18 1 4 1 3 9 1 2 2 1 3 2 9 |

17 | 1 1 25 2 2 5 21 3 8 1 1 10 12 1 4 1 2 |

6 | 36 9 9 1 1 44 |

5 | 1 1 1 78 19 |

3 | 81 3 16 |

# 5.0 Conclusion

By reviewing traditional and modern approaches of financial predictions, we have identified problems of existing approaches and opportunities. Instead of focusing on small well known pieces of information and trying to optimize existing models, we could start to find methods of incorporating ever available data on the Internet with help of automated data processing and data mining approaches. We proposed a method of grouping similar companies using clustering approaches. We should note that the clustering is done using all relevant information available on the Internet, not just select few well-known parameters, such as industry sectors. The immediate benefits of this approach are two folds: (a) information of some companies in a group can be applied to other companies in the same group removing the need to collect information for all companies; (b) predications results of companies on a group can be applied to other companies in the group. In the end, we proposed a novel method of determining optimal number of clusters for AGNES clustering algorithm and company monthly rate returns.

##### References

[1] P. Paranjape-Voditel and U. Deshpande, “A stock market portfolio recommender system based on association rule mining,” *Applied Soft Computing, *vol. 13, pp. 1055-1063, 2013.

[2] B. Lev and S. R. Thiagarajan, “Fundamental information analysis,” *Journal of Accounting research, *pp. 190-215, 1993.

[3] A. W. Lo, H. Mamaysky, and J. Wang, “Foundations of technical analysis: Computational algorithms, statistical inference, and empirical implementation,” National bureau of economic research2000.

[4] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in *Kdd*, 1996, pp. 226-231.

[5] L. Xi, H. Muzhou, M. H. Lee, J. Li, D. Wei, H. Hai*, et al.*, “A new constructive neural network method for noise processing and its application on stock market prediction,” *Applied Soft Computing, *vol. 15, pp. 57-66, 2014.

[6] Z. ZHANG, M. LI, and R. BAI, “AN INTEGRATED MODEL FOR STOCK PRICE PREDICTION BASED ON SVM AND ARMA,” *Journal of Theoretical & Applied Information Technology, *vol. 44, 2012.

[7] T.-S. Chang, “A comparative study of artificial neural networks, and decision trees for digital game content stocks price prediction,” *Expert Systems with Applications, *vol. 38, pp. 14846-14851, 2011.

[8] J. Patel, S. Shah, P. Thakkar, and K. Kotecha, “Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques,” *Expert Systems with Applications, *vol. 42, pp. 259-268, 2015.

[9] C.-H. Cheng, T.-L. Chen, and L.-Y. Wei, “A hybrid model based on rough sets theory and genetic algorithms for stock price forecasting,” *Information Sciences, *vol. 180, pp. 1610-1629, 2010.

[10] R. Dass, “Data mining in banking and finance: a note for bankers,” *Indian Institute of, *2006.

[11] G. S. Atsalakis and K. P. Valavanis, “Surveying stock market forecasting techniques–Part II: Soft computing methods,” *Expert Systems with Applications, *vol. 36, pp. 5932-5941, 2009.

[12] A. A. Karia, I. Bujang, and I. Ahmad, “Fractionally integrated ARMA for crude palm oil prices prediction: case of potentially overdifference,” *Journal of Applied Statistics, *vol. 40, pp. 2735-2748, 2013.

[13] G. C. Aye, M. Balcilar, R. Gupta, N. Kilimani, A. Nakumuryango, and S. Redford, “Predicting BRICS Stock Returns Using ARFIMA Models,” 2012.

[14] M. Khashei and M. Bijari, “An artificial neural network (p, d, q) model for timeseries forecasting,” *Expert Systems with applications, *vol. 37, pp. 479-489, 2010.

[15] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A k-means clustering algorithm,” *Applied statistics, *pp. 100-108, 1979.

[16] D.-T. Pham, M. M. Suarez-Alvarez, and Y. I. Prostov, “Random search with k-prototypes algorithm for clustering mixed datasets,” in *Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences*, 2011, pp. 2387-2403.

[17] A. T. Y. Musetti, “Clustering methods for financial time series,” *Swiss Federal Institute of Technology, *2012.

[18] W. Wang, J. Yang, and R. Muntz, “STING: A statistical information grid approach to spatial data mining,” in *VLDB*, 1997, pp. 186-195.