Characteristics of Principal Components in Stock Price Correlation

The following methods are used to analyze correlations among stock returns. 1) The meaningful part of the correlation is obtained by applying random matrix theory to the equal-time cross-correlation matrix of assets returns. 2) Null-model randomness is implemented via rotational random shuffling. 3) Principal component analysis and Helmholtz-Hodge decomposition are used to extract leading and lagging relationships among assets from the complex correlation matrix constructed from the Hilbert-transformed data set of asset returns. These methods are applied to price data for 445 assets from the S&P 500 from 2010 to 2019 (2,510 business days). Additional analysis and discussion clarify key aspects of leading and lagging relationships among business sectors in the market. Numerical investigation of these dataset reveals the possibility that leading and lagging relationships among business sectors may depend on gross market conditions.


INTRODUCTION
The analysis of big data can reveal novel aspects of nature and society. However, data often contain noise, making it necessary to distinguish the signal from the noise. Principal component analysis (PCA), independent component analysis, machine learning, and other techniques have been applied to extract the meaningful components of various datasets. About 20 years ago, random matrix theory (RMT) was introduced to distinguish the components of a dataset from the noise. [1,2] developed a "null-hypothesis" test based on RMT. In paticular, they compared the properties of empirical equaltime cross-correlation matrix to those of a random matrix and considered deviations from the random matrix case to suggest the presence of meaningful information. They compared the distribution of eigenvalues of this empirical cross-correlation matrix with the Marčenko-Pastur distribution [3], which is theoretically derived from so-called random Wishart matrices. They considered the eigenvector corresponding to the largest eigenvalue to represent the "market" itself. They also compared the distributions of the components of eigenvectors with the Porter-Thomas distribution [4], finding that the eigenvector corresponding to the largest eigenvalue differed remarkably from the Porter-Thomas distribution. [5] confirmed the findings by [1,2]; the meaningful part represents a market mode and group structures, such as industry categories and stocks with large market capitalization. [6] applied RMT to the equal-time cross-correlation matrix of assets listed on the first division of the Tokyo Stock Exchange (TSE). [7] clarified the structure of the meaningful part of the equal-time cross-correlation matrix of assets listed on the New York Stock Exchange (NYSE). [8] investigated the empirical equal-time cross-correlation matrix of stock price fluctuations on the National Stock Exchange of India, finding that this emerging market exhibited strong correlations in the movements of stock prices compared to developed markets such as the NYSE. [9] analyzed the empirical equal-time cross-correlation matrix of stock price fluctuations on the Tehran stock exchange and in the Dow Jones Industrial Average (DJIA), showing that the DJIA is more sensitive to global perturbations. [10] investigated the structures of networks constructed from principal components of the empirical equal-time cross-correlation matrices of stock price fluctuations on the Tehran stock exchange and in the DJIA. [11] constructed an autocorrelation matrix of a time series and analyzed it based on the random-matrix theory approach and fractional Gaussian noises.
[5] constructed a "filtered" cross-correlation matrix, from eigenvalues and eigenvectors outside the random matrix bound and applied this cross-correlation matrix to portfolio optimization [12]. The result they obtained shows that predicted risk was much closer to the realized risk than the traditional portofolio optimaization. [13] applied the portfolio optimization method to the stocks listed on the first division of the TSE and showed that the performance of the portfolio constructed by this method was usually better than that of market index such as TOPIX. [14] extended this portfolio optimization method to a case involving a short sale of stocks.
RMT is a powerful method for distinguishing meaningful components and noise in financial time-series data. The null hypothesis of randomness in this method assumes randomness in cross-correlation and autocorrelation. However, the autocorrelation of stock returns cannot be considered random (for example, see [15]. Thus, a new method is needed that preserves autocorrelation but randomizes cross-correlation. [16,17] developed a method referred to as rotational random shuffling (RRS). In RRS, empirical time-series data are shuffled rotationally in the time direction with a periodic boundary condition imposed. Therefore, equal-time cross-correlation matrices constructed from RRS time series preserve almost all the autocorrelation information of each time series while randomizing cross-correlation. By comparing the distribution of eigenvalues of this RSS cross-correlation matrix with that of the empirical cross-correlation matrix, meaningful components and noise can be successfully distinguished.
It is natural to consider the application of RMT to different-time cross-correlation matrix. [18] introduced so-called complex Hilbert principal component analysis (CHPCA), in which the crosscorrelation matrix is defined in the complex space. The components of eigenvectors of the complex cross-correlation matrix distribute in the complex plane, allowing the recognition of lead-lag relationships between components based on the difference in angle between them. [19] applied CHPCA to timeseries data set for 483 assets representing the S&P 500 from 2008 to 2011 (1,009 business days) and constructed a correlation network in which pairs of assets with phase differences below a certain threshold were weighted based on correlation strength. [20] explored data from 1990 to 2012 for foreign exchanges and stock markets in 48 countries using CHPCA and extracted a significant lead-lag relationship between the markets. [21] applied CHPCA to a time-series data for assets listed on the NYSE from 2005 to 2014 and clarified lead-lag relationships among stocks, investment trusts, real estate investment trusts (REITs), and exchange traded funds (ETFs). [22,23] applied CHPCA to the early warning indicators of financial crizes proposed by the Bank of Japan and explored changes in lead-lag relationships between indices before and after financial crizes.
When applying CHPCA to time series data, we need to explicitly extract the lead-lag relationship between the time series. [24,25]; and [26] applied the Helmholtz-Hodge decomposition (HHD) to extract circular and gradient flows in a complex network. [27] applied CHPCA and HHD to monthly time series of 57 US macroeconomic indicators and five trade/ money indexes, confirming statistically significant co-movements among these time series and identifying noteworthy economic events. [28] summarized CHPCA, RRS, and HHD and applied these methods to economic time-series data.
The purpose of the present paper is twofold. The first is to introduce a recently developed method to analyze stock return correlations. The second is to highlight a novel aspect of leading and lagging relations of business sectors in the market. In Section 2, log returns of stock prices are defined, and an empirical equal-time cross-correlation matrix is constructed for 445 assets from the S&P 500 from 2010 to 2019 (2,510 business days). A method is also presented for calculating the eigenvalues and eigenvectors of this cross-correlation matrix and applies RMT and RRS to distinguish the meaningful part from the noise. Furthermore, it is shown that the eigenvector corresponding to the largest eigenvalue represents the market mode and meaning components without the principal component represent group mode. In Section 3, the dataset is investigated using CHPCA, RRS, and HHD and lead-lag relationships among assets are discussed. In Section 4, an application of CHPCA to portfolio theory is sketched. Section 5 is devoted to summary and discussion.

APPLICATION OF RMT AND RRS
In this section, the equal-time cross-correlation matrix is defined. RMT is then applied to distinguish the meaningful components from the noise components. After that, RRS is introduced to distinguish the meaning components from the noise components.

Equal-Time Cross-Correlation Matrix
This paper investigates data for 445 assets from the S&P 500 for dates obtained 2010-2019 (2,510 business days). By denoting an opening price of stock n on day t as o n (t) and a closing price of stock n on day t as c n (t), the daily log return of stock n on day t is defined as where ln represents the natural logarithm. Here, n 1, 2, . . . , N 445, and t 1, 2, . . . , T 2510. For each stock n, the time-average of r n (t) is denoted as 〈r n 〉, and the standard deviation of r n (t) is denoted as σ n . These are defined by A normalized log return of asset n is denoted as w n (t), and define it by Thus, a component of equal-time cross-correlation matrix is defined by The left panel of Figure 1 depicts an equal-time crosscorrelation matrix. In this figure, shade indicates the strength of the positive correlation. White color corresponds to C nn 1, with darker shades representing weaker correlations, and yet darker shades representing negative correlations. The darkest shade corresponds to C mn −0.515641. Because the stocks are arranged in industry codes orders, the block pattern seen in the figure roughly corresponds to a grouping by industry. The right panel of Figure 1 shows the distribution of components of the equal time cross-correlation matrix. This figure shows that nearly all correlations are positive. Furthermore, the right tail of the distribution is thicker than the left tail.

Application of RMT
Calculation of eigenvalues λ R for this cross-correlation matrix produces Figure 2. Here, subscript R represents the eigenvalue rankings. The left panel of Figure 2 shows the distribution of eigenvalues. The largest eigenvalue is λ 1 143.516, and the smallest eigenvalue is λ 445 0.0638128. The right panel of Figure 2 shows the distribution in the range of small eigenvalues. The solid line is the probability distribution function of the so-called Marčenko-Pastur distribution, which is derived from RMT in the limit N → ∞ and T → ∞ by fixing Q N/T: where (x) + max(0, x); δ(x) denotes Dirac's delta function; and λ ± is defined by In this paper, λ + 2.01941 denotes the upper bound of eigenvalue λ, and λ − 0.335172 denotes the lower bound of λ.
In RMT extraction of the meaningful part of the correlation structure, empirical eigenvalues larger than λ + signify the meaningful part. In particular, in the cross-correlation matrix of stock returns, the largest eigenvalue corresponds to the market mode, and the remaining meaningful part correspond to group modes, such as, industry sectors. In this analysis, it was found that λ 1 > λ 2 > . . . > λ 17 > λ + , so, 17 meaningful components were retained.
In traditional PCA, Monte Carlo simulations and so-called scree graphs are used to extract meaningful components. In the present method, the time series of each stock is randomly shuffled to generate an equal-time cross-correlation matrix. This manipulation breaks both the autocorrelation and the crosscorrelation. It is derived from a similar concept as the application of RMT. If we construct the equal-time cross-correlation matrix from those randomly shuffled time series, we can obtain the histogram shown in the left panel of Figure 3. The solid line in this figure corresponds to the Marčenko-Pastur distribution given by Eq. 5. From this figure, we can recognize the equivalence between the traditional PCA and the application of RMT. The right panel of Figure 3 shows the scree graph. In this figure, the abscissa corresponds to the eigenvalue rankings and the ordinate corresponds to the magnitude of eigenvalues. The curve with error bars in this figure depicts the eigenvalue distribution of the randomly shuffled cross-correlation matrix. The thin line with filled circles in this figure depicts the distribution of eigenvalues of the empirical equal-time crosscorrelation matrix. If we denote the upper bound of eigenvalue derived from the randomly shuffled cross-correlation matrix as λ max , we obtain λ 1 > λ 2 > . . . > λ 19 > λ max 1.7947. Hence, there are 19 meaningful components in the dataset.

Application of the RRS
As stated above, when we make a randomly shuffled crosscorrelation matrix, we break both the autocorrelation and the cross-correlation conditions. However, it has been reported that the stock price has an autocorrelation tendency. Thus, we need to develop a method that preserves autocorrelation but randomizes the crosscorrelation. [16,17] developed a method referred to as RRS. In RRS, we shuffle the empirical time-series data rotationally in the time direction and impose the periodic boundary condition: The meaningful part can be obtained by comparing these two distributions. If the upper bound for eigenvalues derived from the randomly shuffled cross-correlation matrix is denoted as λ max , then λ 1 > λ 2 > . . . > λ 19 > λ max 1.7947. Hence, 19 meaningful components should be retained for this data set.
Here, τ ∈ [0, T − 1] is a (pseudo-) random integer that is different for each n. For example, if τ 1537 for stock 1, τ 2128 for stock 2, . . ., τ 138 for stock N, the time series of normalized log returns is given by w N (140), . . . , w N (2510), w N (1), w N (2), . . . , w N (138)} Such a rotationally randomly shuffled time series allows the cross-correlation matrix to be constructed and eigenvalues to be calculated. An example is shown in the histogram in the left panel of Figure 4. The solid line in this figure corresponds to the Marčenko-Pastur distribution given by Eq. 5. This figure shows that the distribution of eigenvalues is almost the same as the Marčenko-Pastur distribution based on RMT except for the large eigenvalue range.
The right panel of Figure 4 shows the scree graph. In this figure, the abscissa corresponds to eigenvalue rankings, and the ordinate corresponds to eigenvalue magnitude. The curve with error bars in this figure depicts the eigenvalue distribution of the RRS cross-correlation matrix. The thin line with filled circles in this figure depicts the distribution of eigenvalues of the empirical equal-time cross-correlation matrix. Again, if the upper bound of eigenvalues derived from the RRS cross-correlation matrix is denoted as λ max , then λ 1 > λ 2 > . . . > λ 19 > λ max 1.7947 is obtained. Hence, 19 meaningful components are retained. Although the numbers of meaningful components in RMT and RRS are equal, this result is a coincidence specific to the data set at hand. Figure 5 shows the distribution of components of the top 20 eigenvectors, v 1 , . . . , v 20 . The thin vertical lines in these figures separate business sectors. RMT suggests that the distribution of the components of each eigenvector is given by the Poter-Thomas distribution: The first eigenvector v 1 consists of components of similar magnitude and is referred to as the market mode. In the second eigenvector, there is a negative peak in the rightmost sector, which corresponds to the utility sector. In the third eigenvector, there is a negative peak in the left sector, which corresponds to the bank sector. In the fourth eigenvector, there is a positive peak in the middle sector, which corresponds to the oil and gas equipment and service sector. In the fifth eigenvector, there is a negative peak in the right middle sector, which corresponds to the REIT sector. The panels from the sixth eigenvector to the 20th eigenvector have peaks in some sectors containing a small number of assets. However, sometimes it is difficult to extract the meaning of each principal component. Thus, the correlation matrix was split into three parts: It is important to understand why the largest eigenvalue and the corresponding eigenvector are referred to as representing the market mode. The market index on day t is denoted as w M (t) and defines it by the scalar product of w(t) and the first eigenvector v 1 : i.e., weighting the average return with the weight given by the first eigenvector. On the other hand, the S&P 500 is used to characterize the entire market. The normalized log return on day t from open to close of the S&P 500 is denoted as w SP (t). Figure 6 shows the scatter plot of w M (t) vs. w SP (t). This figure shows that w M (t) and w SP (t) exhibit a strong, positive correlation. The dashed line in this figure shows a linear function with the slope given by Pearson's correlation index ρ 0.852 and with the intercept equal to 0. This correlation coefficient is almost the same as that obtained by [5].

APPLICATION OF CHPCA AND HHD
In this section, the complex correlation matrix is defined. RRT is then applied to distinguish the meaning components from the noise components, and CHPCA is introduced. After that, HHD is presented in order to clarify the lead-lag relationships among assets.

Complex Correlation Matrix
A simple definition of different-time correlation is given by Corr[w m (t), w n (t + Δt)], (Δt 1, . . . , T − 1). However, if N and T are extremely large, a huge number of combinations must be investigated. Therefore, a complex correlation matrix is introduced to overcome this problem. We consider the Fourier transform of the daily log returns of asset n as represented by where ω k 2πk/T ≥ 0. The Hilbert transform of r n (t) is given by We define a complex log returnr n (t) as Frontiers in Physics | www.frontiersin.org April 2021 | Volume 9 | Article 602944 6

Souma
Characteristics of Principal Components where i denotes an imaginary unit defined by i 2 −1. For each asset n, we define a time average 〈r n 〉 and a standard deviationσ n as follows.
We define the normalized complex log returnw n (t) as w n (t) r n (t) − 〈r n 〉 σ n Thus, the time-average ofw n (t) is zero, and its standard deviation is one. Each component of the complex correlation matrix is defined bỹ Herein, † represents the transposed complex conjugate. The elements of the complex correlation matrix distribute on the complex plane, as shown in the upper left panel of Figure 7. The lower left panel of Figure 7 shows the distribution of the real parts of the elements of the complex correlation matrix. This distribution is almost the same as for the case of the equal-time cross-correlation matrix shown in the right panel of Figure 1. The upper right panel of Figure 7 shows the distribution of the imaginary parts of the elements of the complex correlation matrix. This panel shows a symmetrical distribution. Figure 8 is obtained by calculating the eigenvalues λ R for the cross-correlation matrix. As in Section 2.2, here the subscript R again represents the eigenvalue rankings. The left panel of Figure 8 shows the distribution of the logarithms of eigenvalues. The largest eigenvalue is λ 1 143.71, and the smallest eigenvalue is λ 445 0.0442842. The right panel of Figure 8 shows the distribution in the small eigenvalue region. The solid line is the Marčenko-Pastur distribution given by Eq. 5 with Q 2N/T. Figure 9 shows the scree graph. In this figure, the abscissa corresponds to the eigenvalue rankings and the ordinate corresponds to eigenvalue magnitudes. The curve with error bars in this figure shows the eigenvalue distribution of the RRS complex correlation matrix. The thin line with filled circles in this figure depicts the distribution of eigenvalues of the empirical complex cross-correlation matrix. If we again denote the upper bound for eigenvalues derived from the RRS cross-correlation matrix as λ max we again obtain λ 1 > λ 2 > . . . > λ 16 > λ max 2.18894. Hence, 16 meaningful components are retained for this dataset. Figure 10 shows the distribution of each component for the top 16 eigenvectors v 1 , . . . , v 16 in the complex plane. In this case, the Poter-Thomas distribution, which is the null hypothesis of randomness, is given by

Complex Hilbert Principal Component Analysis
In the complex plane, we regard the clockwise direction from the positive real axis as corresponding to leading components, whereas the counterclockwise direction from the positive real axis corresponds to the lagging components. Components of the first eigenvector v 1 distribute along the positive real axis. This means that the phase difference, i.e., the difference between leading and lagging, is small for the first eigenvector. Thus, we refer to the first eigenmode as the market mode. On the other hand, components of the 2nd to 16th eigenvectors distribute over a wide region in the complex plane. This behavior suggests group structure.

Helmholtz-Hodge Decomposition
We decompose the complex correlation matrix into the meaningful part and the noise part as where † represents taking the complex conjugate of a vector. The left panel of Figure 11 shows the meaningful part of the complex correlation matrix. The introduction of a lower bound for the magnitudes of elements of the principal part of the complex correlation matrix produces, the right panel of Figure 11. The components of the real matrix F are the absolute values of the components of this constrained meaningful correlation matrix.
Here, F is considered the weighted adjacency matrix. The components of this matrix can then be written as where F (c) mn corresponds to the circular flow in the network defined by On the other hand, F mn corresponds to the gradient flow in the network defined by Here, ϕ m is the Helmholtz-Hodge potential. By using Eqs 16, 17 can be rewritten as By solving Eq. 18, we obtain the Helmholtz-Hodge potential shown in Figure 12. In this figure, the leading components show a The average values 〈ϕ〉 of the Helmholtz-Hodge potential for some major sectors are shown in Table 1. This table shows that the semiconductors industry is the most strongly leading, while the drug manufacturing industry is the most strongly lagging. On the other hand, [28] explored 483 assets from the S&P 500 for 4-years from 2008 to 2011 (1,009 business days). He obtained the result that the financial sector is the most strongly leading, while the telecommunications and service sector is the most strongly lagging. Therefore, we suspect that the lead-lag structure depends on the gross market conditions of the period investigated. However, clarifying this suspicion is a problem for future study.

APPLICATION OF CHPCA TO THE PORTFOLIO THEORY: A SKETCH
As a problem for future study, we consider the application of CHPCA to construct a portfolio by following Markowitz's portfolio theory [12]. We represent the fraction of wealth invested in asset n as ξ n . If we denote the number of assets as K, ξ n is normalized by By using the complex log return of each assetr n defined by Eq. 9, we define the complex log return of the portfolior P as However, the portfolio return must be a real number, so we need to impose the following constraint: The risk of the portfolio is defined by the variance: Here again, the risk must be a real number, so we need to impose the following constraint:  Therefore, under the conditions given in Eqs 19, 21, 23, a portfolio can be created that minimizes risk under the assumed returns.

CONCLUSION
An analysis of price data for 445 assets from the S&P 500 from 2010 to 2019 (2,510 business days) provided the basis for an exploration of recent developments in distinguishing the meaningful part from the noise part in correlation structures in big data. Application of RMT to the equaltime cross-correlation matrix was found to be a useful method for obtaining the meaningful components of the correlation structure. However, the null hypothesis of randomness underlying RMT destroyed both real autocorrelation and real cross-correlation in the data. In order to preserve autocorrelation, we introduce RRS. In  16 . In each figure, the abscissa represents R(v R,n ), and the ordinate represents I(v R,n ). Here, R is the eigenvalue rankings.
Frontiers in Physics | www.frontiersin.org April 2021 | Volume 9 | Article 602944 the case of this paper, the number of meaningful components for RMT and for RRS happened to be. We also introduced CHPCA for investigating the various different-time cross-correlations. By using both CHPCA and HHD, we clarified the lead-lag relationships for some major business sectors.

AUTHOR CONTRIBUTIONS
WS wrote this paper by himself.

ACKNOWLEDGMENTS
The author would like to thank Hiroshi Iyetomi, Hideaki Aoyama, Yoshi Fujiwara, Yuichi Ikeda, Hiroshi Yoshikawa, and Irena Vodenska for useful discussions.