An Algorithm for the Analysis of Temporally Structured Multidimensional Measurements

Analysis of multichannel recordings acquired with contemporary imaging or electrophysiological methods in neuroscience is often difficult due to the high dimensionality of the data and the low signal-to-noise ratio. We developed a method that addresses both problems by utilizing prior information about the temporal structure of the signal and the noise. This information is expressed mathematically in terms of sets of correlation matrices, a versatile approach that allows the treatment of a large class of signal and noise sources, including non-stationary sources or correlated signal and noise sources. We present a mathematical analysis of the algorithm, as well as application to an artificial dataset, and show that the algorithm is tolerant to inaccurate assumptions about the temporal structure of the data. We suggest that the algorithm, which we name temporally structured component analysis, can be highly beneficial to various multichannel measurement techniques, such as fMRI or optical imaging.

measurement at time t, z(t), is approximated by a linear sum of basis vectors ψ 1 , ψ 2 ,…called "principal components", i.e., The principal components ψ i are selected such that they maximize an objective function defi ned as the power of the associated time course of the coeffi cients α i (t). Moreover, the n components with maximal power provide the best linear approximation of the data using n components in a welldefi ned mathematical sense. For neuroscientifi c measurements, which are typically redundant, a small number of components suffi ce for approximating the data, and the dimensionality can be dramatically reduced.
While highly successful in dimensionality reduction, PCA often performs poorly in terms of separating the signal from the noise. It is rarely the case that a particular principal component can be associated solely with the signal, and it is much more common that the principal components are "non-interpretable" and refl ect useless mixtures of signal and noise. As a result, PCA is often disregarded as a rigorous method for data analysis. We observe that both the success of PCA in dimensionality reduction and its failure in separating the signal from the noise follow from the PCA objective function. While maximizing the power guarantees an optimal low dimensional representation, this objective function is completely "blind" to the fact that data combines a signal, which is of interest, and a noise, which should be suppressed. Moreover, the PCA construction is ignorant to the fact that the data constitutes a multidimensional time-series, in which the samples, i.e., measurements at different points in time, are correlated.
In this contribution we propose an algorithm suitable for the analysis of multichannel measurements. We refer to the algorithm as "temporally structured components analysis" (TSCA). The algorithm adheres to the PCA framework by seeking components with high power. However, in order to avoid the mixing of the signal and the noise, it seeks components that maximize the power originating from the signal while minimizing power arising from the noise. The main challenge here is to tell apart the signal from the noise.

INTRODUCTION
Contemporary methods in neuroscience research, ranging from multielectrode recordings, through optical imaging, to fMRI, EEG and MEG allow the collection of large volumes of multichannel data in a single experiment. Such data can be highly informative about the underlying neuronal machinery, as it can reveal its rich and complex dynamics, as well as its spatial structure. However, this data is inherently diffi cult to analyze due to two fundamental issues (Mitra and Pesaran, 1999). First, there are problems associated with the nature of high dimensional structures, such as the diffi culty to visualize them or to fi t them with statistical models. Second, the signal-to-noise ratio is often very low. A typical example for the latter problem is when the signal, which is identifi ed with neuronal activity, is masked by strong non-neuronal sources, which are considered noise. In this contribution however we deal with the signal-to-noise ratio problem from a broader perspective: we view the signal as any aspect of the data that is of interest in a particular context; everything else is considered noise.
A general approach for analyzing multichannel data is to reduce its dimensionality. Since in many neuroscientifi c measurements the different channels are correlated, the data is redundant, and its dimensionality can be reduced without causing a signifi cant loss of information. Moreover, one can seek a low dimensional representation that retains the signal's structure but suppresses the noise. If such a representation is attainable, the new representation not only exhibits an improved signal-to-noise ratio but is also easier to analyze, merely due to its reduced dimensionality.
Out of the numerous dimensionality reduction methods, one of the simplest and most popular is principal component analysis (PCA; e.g., Jolliffe, 2002) 1 . In PCA, the multichannel We address this problem by making use of the fact that the signal and the noise typically have different temporal structures, and that prior information about their temporal characteristics is often at hand. For example, neuronal responses evoked by a stimulus can be characterized by their expected timing within the recording, which is set by the stimulation protocol, and by quantities such as the neural delay, which can be estimated. Another example is the approximately periodic artifacts originating from heartbeat pulsation. These artifacts can be characterized by their fundamental frequency, which can be measured independently or estimated from the data. In both cases, the information required for the characterization of the signal and the noise is either known a-priori, or can be estimated.
In the following sections we show how the prior temporal information can be systematically incorporated in a PCA-like framework. The main building block of the mathematical construction is the expression of the prior information in terms of correlation matrices, i.e., matrices of second order moments of the signal and the noise. This defi nition is very versatile, and allows the treatment of a very broad class of signal or noise sources, as well as dealing with statistical dependencies between the signal and the noise.
Our goal in this contribution is to present the algorithm and the theoretical construction underlying it. To this end, we fi rst provide the mathematical derivation of the algorithm for a simplifi ed case that captures the essence of the algorithm. Then, we discuss some results for the general case (full analysis is presented in Section 1 and 2 of the Supplementary Material). We next present simulation results to validate the algorithm, and to demonstrate some of its additional properties, such as robustness to errors in the assumed temporal structure of the data. We fi nish with a discussion, focusing on how the algorithm can be applied in practical cases, and on comparison with other algorithms.

ANALYSIS OF A SIMPLIFIED CASE
We consider the data to be a matrix Z with dimensions P times T, where P is the number of channels and T is the number of time samples. The matrix Z is stochastic (i.e., Z is a matrix-valued random variable), and constitutes a multidimensional timeseries. A basic operation in the analysis is to transform Z into a one-dimensional time-series by projecting it onto a P-dimensional unit vector ψ, an operation written as ψ T Z. One measure of the importance or dominance of a particular direction ψ is the expected power of ψ T Z, a quantity we write as s E Z T 2 2 ( ) (|| || ), ψ ψ = Z where E(·) denotes expectation taken over realizations of Z. The approach taken by PCA is to fi nd components ψ such that s Z 2 ( ) ψ is maximized. The solution to this maximization problem is given by eigenvectors of the theoretical covariance matrix, which is given, up to a scaling, by E(ZZ T ). In practice, the theoretical covariance matrix is replaced by the empirical covariance matrix, estimated from the observations. We further assume the matrix Z to be the sum of a signal X and a noise Y, i.e., Z = X + Y. Since PCA maximizes s Z 2 ( ) ψ , it is not guaranteed, in general, that the power of the projection of the signal on ψ, i.e., s X 2 ( ) ψ , is high. Our approach is to replace the maximization of the PCA objective function with either the maximization of s X 2 ( ) ψ , the minimization of s Y 2 ( ) ψ , or an interplay between these two objectives. Clearly, neither the maximization of s X 2 ( ) ψ nor the minimization of s Y 2 ( ) ψ can be done directly, because neither X nor Y are observed; only their sum Z is known.
We will develop a solution to this problem by studying a simplifi ed but instructive case. We assume the signal to be a multidimensional process with a fi xed spatial structure whose amplitude fl uctuates according to a stochastic process. Thus, we express X as X = u X a X where the spatial component u X is a fi xed, length P, column vector, and a X is a stochastic process, represented by a length T row vector (note that throughout the manuscript column vectors are used for space, and row vectors are used for time). We choose the noise to be of a similar form Y = u Y a Y , with u Y being fi xed and a Y stochastic. No particular assumptions about the probability distribution of a X are made, but for the noise we assume that it is statistically independent of the signal and has zero mean in all its entries, i.e., E(a Y ) = 0. Our approach for maximizing s X 2 ( ) ψ assumes that some prior knowledge about the temporal structure of the signal and the noise is available. Specifi cally, we require that the autocorrelation matrices of the signal and the noise are known, up to scaling. We denote these matrices by C X and C Y , and defi ne them respectively by E(a X T a X ) and E(a Y T a Y ). These matrices are of size T × T. For C X (or C Y ), the entry at row t, column t′, measures the correlation between the value of the signal (or the noise) at times t and t′.
An example of such data is given in Figure 1A. The signal X has a fi xed spatial component u X , which we chose, rather arbitrarily, to be an image of a circle in a square frame (left top). The random vector a X , which defi nes the temporal evolution of the signal, was selected to mimic the time course of a stimulus-triggered response. The response starts at times 0,100,…,900. It has a constant exponential shape, but variable amplitude. A particular realization of a X is shown on the right. The noise Y was constructed with a sinusoidal-gratings spatial component (u Y , Figure 1A, left bottom). Its time course was defi ned by an autoregressive process of order 5, for which we show a particular realization on the right. The observed data Z is constructed summing X and Y, so every frame is a linear combination of the circle and the gratings ( Figure 1B). The autocorrelation matrices for the signal and the noise processes are shown in Figure 1C (left: C X , right: C Y ). Intuitively, analysis of the signal requires the recovery of the circle image u X from the data Z. This intuition is met by the mathematical formulation presented above, because s X 2 ( ) ψ is maximized when ψ = u X . In order to (indirectly) maximize s X 2 ( ) ψ or minimize s Y 2 ( ) ψ based on the data Z, we will fi rst try to estimate these quantities for a fi xed ψ. To this end, we consider a quadratic estimator of the form: where Q is an arbitrary symmetric T × T matrix. Our goal now is to constrain Q such that m Q (ψ) will enable us to estimate s X 2 ( ), ψ s Y 2 ( ), ψ or some combination of them. Thus, we calculate the expectation (taken over realizations) of m Q (ψ), and express it in terms of s X 2 ( ) ψ and s Y 2 ( ) ψ . We fi nd that: 100 with g(t) = Θ(t)e −t/25 , Θ(·) being the Heaviside step function, and r l denoting random variables independently drawn from a Gaussian distribution N(1, 0.0625). The time course for the noise is an AR(5) process: , with ε(t) independently drawn from a Gaussian distribution N(0, 0.0008). The standard deviation of ε(t) was selected such that the power of a Y matches approximately the power of a X . A particular realization of these processes is shown on the right. The data at time t is constructed by a linear sum: a X (t)u X + a Y (t)u Y . (B) Examples of data frames. where A,B tr B A T = ( ) is the matrix inner product. Eq. 2 implies that m Q (ψ) is an unbiased estimator for a weighted sum of s X 2 ( ) ψ and s Y 2 ( ) ψ that takes the form: Moreover, Eqs. 2 and 3 together imply that if we know C X and C Y up to a scaling factor, then expressions of the form given by Eq. 3 can be estimated for any γ X and γ Y , by fi nding Q such that: Expressions of the form in Eq. 3 are useful objective functions. For example, choosing γ X = 1, γ Y = 0, produces an estimator of s X 2 ( ) ψ . A direct calculation shows that this objective function is maximized for ψ = u X . Thus, the maximization of m(ψ) with γ X = 1, γ Y = 0, is a strategy for recovering u X from the data. Another choice which is of interest is γ X = 0 and γ Y = −1. When this expression is maximized, s Y 2 ( ) ψ is minimized, and more generally, a tradeoff between maximizing s X 2 ( ) ψ and minimizing s Y 2 ( ) ψ follows when selecting γ X = 1 and γ Y < 0. Later, we address how the value of γ Y affects the solution of the optimization problem, but for now, we restrict neither γ X nor γ Y .
Eq. 4 constitute a linear system with two equations and (T + 1)T/2 variables (the number of independent entries in the symmetric matrix Q). Unless C X and C Y are the same up to a scaling factor and therefore are linearly dependent, an infi nite number of solutions exists for any selection of γ X and γ Y . This degeneracy can be utilized to minimize the variance of the estimator m Q (ψ). In general, this variance cannot be expressed only in terms of the correlation matrices, as it depends on moments of Z of order higher than 2. So as not to require knowledge of higher order moments, we use an upper bound obtained by applying the Cauchy-Schwarz inequality to the variance (Eq. A12, see Section 1 of the Supplementary Material). This bound holds for any Q satisfying Eq. 4, and is given by: where ||Q|| 2 = tr(Q T Q) is the squared Frobenius norm. Since the bound depends on Q only through the factor ||Q|| 2 , a reasonable strategy for minimizing the estimation variance is simply to minimize ||Q|| 2 . Given C X , C Y , γ X , and γ Y , the solution for Eq. 4 with minimal ||Q|| 2 is a linear combination of C X and C Y given by: Having derived an expression for Q, we can estimate m(ψ) by m Q (ψ), and therefore we can treat m(ψ) as an objective function to be maximized over all ψ's. The solution to this maximization problem is analogous to the PCA solution. Here, it is given by eigenvectors of ZQZ T , and the eigenvalue of a particular eigenvector ψ i is equal to the estimate of m(ψ i ). If we sort the eigenvectors according to their eigenvalues in a descending order, the fi rst eigenvector obtains the global maximum of the objective function; the second eigenvector obtains the maximum in the subspace orthogonal to the fi rst eigenvector; and so forth.
Based on the analysis above, we can summarize the basic algorithm. First, we select an objective function m(ψ) of the form in Eq. 3 by choosing γ X and γ Y . Using prior knowledge of the autocorrelation of the signal C X and the noise C Y we construct the estimator for m(ψ) by calculating Q according to Eq. 6. We then diagonalize ZQZ T and sort the eigenvectors according to their eigenvalues. Those with highest eigenvalues maximize the objective function.
We demonstrate this procedure on the data in Figure 1. We selected the objective function by choosing γ X = 1, γ Y = 0, so the objective function is simply s X 2 ( ) ψ . Then, Q was calculated using the autocorrelation matrices in Figure 1C. Applying the algorithm for the realization of Z in Figure 1A, we diagonalized ZQZ T . The fi rst eigenvector ψ 1 is shown in Figure 1D, top panel. It is highly similar to the spatial component of the signal ( , . ). 〈 〉> ψ 1 0 99 u X The algorithm therefore successfully recovered the spatial component of the signal, as desired by this choice of objective function. In this simplifi ed example all columns of Z belong to the two dimensional linear subspace spanned by u X and u Y . The rank of Z is therefore 2, and there is only one more non-zero eigenvalue. The corresponding eigenvector is shown at the bottom of Figure 1D. Note that if we were interested in the spatial component of the noise, we could still use TSCA by switching the roles of the signal and the noise. Applying the algorithm this way produces the eigenvectors in Figure 1E. Now, the fi rst eigenvector (top) corresponds to the noise spatial component ( , . ).
For comparison, we applied PCA on the same data. This resulted in the usual mixing of the signal and noise ( Figure 1F; ). This phenomenon can be understood using the formulation developed above: PCA is obtained by selecting Q to be the identity matrix. Plugging Q = I in Eq. 2, and observing that I , C tr C and I , C tr C we see that the quantity maximized in this case is s s X 2 Y 2 ( ) ( ), ψ ψ + which cannot be expected to appropriately separate the signal from the noise.

THE GENERAL CASE
The analysis presented above makes the rather strong assumption that both the signal and the noise have one, fi xed, spatial component, and that they are uncorrelated. Clearly, these assumptions are unrealistic for complex biological data. However, using mathematical principles similar to the ones presented above, we can generalize the algorithm such that the restrictive assumptions are replaced by weaker ones. Detailed discussion is presented in Section 1 and 2 of the Supplementary Material. Here, we briefl y sketch the generalized formulation and summarize the main results. We clearly state the underlying assumptions, which defi ne the algorithms' scope of applicability.
In the general case, we assume that the data Z originates from N X signal sources X X 1 ,..., N X and N Y noise sources Y Y 1 ,..., N Y , which combine additively. Each of the sources is an arbitrary matrix valued random variable. The objective function is generalized and takes the form: m s s As before, the free parameters of the objective function γ X and γ Y control a tradeoff between maximizing signal power, now given by ∑ = j j 1 N X 2 X s ( ), ψ and minimizing noise power, i.e., ∑ = The estimation of the objective function is obtained by using an unbiased estimator m Q (ψ) of the form given by Eq. 1, with Q being a linear combination of correlation matrices (Eqs. A13 and A14, Section 1 of the Supplementary Material). The maximization of m Q (ψ) is solved by diagonalizing ZQZ T , and the leading eigenvectors of ZQZ T maximize the objective function, as in the simplifi ed case.
The main difference between the simplifi ed and general cases is the defi nition of the correlation matrices. Whereas in the simplifi ed case there was one correlation matrix for the signal and one correlation matrix for the noise, in the general case there are three sets of matrices: (1) Signal correlation matrices, representing the temporal correlations within each of the signal sources; (2) Noise correlation matrices representing the temporal correlations within each of the noise sources; and (3) Interaction correlation matrices, accounting for potential correlations between the different signal and noise sources. In order to defi ne these matrices, we consider a decomposition of each of the signal and noise sources: X j = U Xj A Xj and Y j = U Yj A Yj . Here, each matrix U Xj , U Yj is a fi xed matrix of size P by R X j or P by R Y j , respectively, and each of the matrices A Xj , A Yj is a matrix valued random variable of compatible size. The number of columns in U Xj , U Yj , i.e., R X j or R Y j , is arbitrary, although having a decomposition with small R X j and R Y j simplifi es the actual calculation of the correlation matrices. Note that there is always an infi nite number of such decompositions for each source, as U Xj or U Yj can be defi ned as any invertible matrix, and A Xj , A Yj as the inverse of that matrix multiplied by X j or Y j . However, the precise choice of U Xj , A Xj , U Yj and A Yj is not important because all choices lead to a valid matrix Q. Moreover, all choices of U Xj , U Yj such that their columns are linearly independent are equivalent, and lead to precisely the same matrix Q. Using these decompositions the defi nition of the signal, noise, and interaction correlation matrices is, respectively, the set of symmetrized expected outer products between any two rows of (1) the same matrix in the set {A Xj }, (2) the same matrix in the set {A Yj }, and (3) different matrices in the set {A Xj } ∪ {A Yj } (see Eq. A3, Section 1 of the Supplementary Material). In order to construct Q however, it is suffi cient to know only arbitrary bases for the linear span of the three set of matrices (see Eq. A10, Section 1 of the Supplementary Material). In particular, it follows that similarly to the simplifi ed case, it is enough to know the correlation matrices only up to a scaling factor. Note also that the defi nitions of the correlation matrices reduce to the defi nitions of the simplifi ed case whenever the conditions for the simplifi ed case are satisfi ed.
Overall, we make only two assumptions about the processes generating the data in the general case: additivity of the different sources, and a technical condition we refer to as "contrastability" of the signal and the noise. The latter condition requires the temporal structure of the signal and the noise to be suffi ciently different, such that objective functions m(ψ) that contrast the signal and noise, i.e., objective functions with γ X ≠ γ Y , can be estimated. Mathematically it requires the linear independence between the linear spaces spanned by the signal, noise, and interaction correlation matrices. While additivity is a rather strong assumption, we expect contrastability not to be very restrictive except for special cases. An example of contrastability violation is when a signal component and a noise component have identical autocorrelation matrices, up to scaling. Another example is when the sum of dimensions of the subspaces spanned by the signal, noise, and interaction correlation matrices exceeds the dimension of the space of real symmetric matrices T(T + 1)/2. This case refl ects that the correlation structure of the sources is too complex compared with the temporal length of the data.
In addition to the two assumptions about the processes generating the data, we also assume that suffi cient amount of data is available. This is required because the theoretical analysis relies on the diagonalization of E(ZQZ T ), whereas in practice this matrix has to be approximated with the fi nite amount of observed data. Such an approximation is valid if several independent realizations of Z are available, e.g., when measurements from several trials are available. In this case, all realizations can be used simultaneously (see section "Maximization of the objective function" Section 1 of the Supplementary Material), and the formal demand is that the number of realizations is large enough. Alternatively, if there is only one realization of Z, the processes producing the data have to be ergodic, and T has to be large enough.
We view the assumptions mentioned above to be either standard or weak. In contrast to many algorithms, we did not assume the sources to be Gaussian, stationary, space-time separable, independent, or orthogonal in any sense. This generality however comes with a price: we do make the strong assumption that the correlation matrices are known a-priori, or at least can be estimated. Whether or not this is a reasonable assumption is application specifi c, and depends on the characteristics of the sources.
A limiting factor in this context is the number of correlation matrices: although any source has a well-defi ned set of correlation matrices and thus can in principle be dealt with by the algorithm, it is unlikely that a very large number of correlation matrices can be known a-priori or reliably estimated. Generally, the number of correlation matrices associated with any particular source can be as high as the order of the number of channels P squared. We suggest however, that sources with a simple enough temporal structure will be associated with a set of correlation matrices of a reasonable size. This suggestion is supported by the common observation that the number of signifi cant principal components in standard PCA applied to datasets of the type considered here is limited. When this holds, any source, e.g., a signal source X j , is likely to be well approximated by X j = U Xj A Xj with U Xj having a small number of columns R X j . Thus, in this case, X j will contribute at most the order of R X j 2 , and not P 2 , to the total number of correlation matrices. Furthermore, in some cases the number of correlation matrices originating from a particular source can be smaller, even if R X j or R Y j is large. For example, a noise source Y j with independent zero-mean white noise at each of its entries requires a decomposition with R P Y j = . Nonetheless, there is only one correlation matrix associated with such a source: the identity matrix.
Although estimation of the correlation matrices can be diffi cult in some cases, they need not be known to perfect precision. Some robustness follows from the mathematical construction because the algorithm uses only continuous operations, so a suffi ciently small perturbation of the correlation matrices leads only to a small perturbation of the algorithm's output. Below, we use simulations performed on an artifi cial dataset to further demonstrate the algorithm's robustness to perturbations of the correlation matrices, as well some of its other properties.

SIMULATIONS
We constructed an artifi cial dataset as the sum of 10 sources. Each source i admitted a decomposition u i a i , with a fi xed u i and stochastic a i , i.e., Z a i i i = ∑ =0 9 u . We used 10 different defi nitions for what is the signal and what is the noise in the data. Each time, we defi ned one of the sources to be the signal X (e.g., X = u 0 a 0 ), and all other sources to be nine noise sources Y 1 ,..,Y 9 (e.g., Y 1 = u 1 a 1 ,…,Y 9 = u 9 a 9 ). As discussed above, having a single spatial component u i for each source is not a requirement of the algorithm, but this particular case simplifi es the performance analysis presented below, as each source has a single correlation matrix associated with it. On the other hand, we did introduce complexities into the dataset by constructing the vectors u i and a i such that they exhibited various spatial and temporal correlations. The spatial components u i were P = 900 length vectors, constructed by a 30 × 30 pixels rendering of the digits 0 to 9, normalized to have zero mean and unit norm (Figure 2A). There are marked correlations between different components, ranging from 0.22 (for u 0 and u 7 ) to 0.89 (for u 6 and u 8 ) with the mean correlation, u u i j , , i ≠ j, being 0.57 (std: 0.18). The temporal components of each source a i were constructed as realizations of stochastic processes of different types. Processes 0 to 4 are stationary and include a white noise process (process 0), autoregressive processes with broad (1) and concentrated (3) spectra, and an harmonic process (4). Process 2 was constructed by smoothing realizations of process 1, making processes 1 and 2 statistically dependent and correlated. Processes 5 to 9 were non-stationary and modeled different types of stimulus-triggered responses, including responses of different shapes (exponential: 6,7,9, α-function: 5,8), a response to a non-periodic stimulation (8), a response with a stochastic decay time-constant (9) and a response with variable sign (5). Two non-stationary processes were coupled by taking process 7 to be a delayed version of process 6. Defi ning equations for the 10 processes are given in Section 3 of the Supplementary Material. Exemplar realizations of the processes are shown in Figure 2A. In Figure 2B we show a realization of the observed data Z at 8 points in time. This fi gure demonstrates that individual digits were diffi cult to recognize by visual inspection of Z at specifi c times. Figure 2C shows the fi rst 10 principal components obtained with standard PCA for the same data. As expected, these components do not correspond to individual digits.
In Figure 3A, we present an example of the results obtained by applying the algorithm to the artifi cial data. In this example, we used one realization of the data with length T = 1000. The objective function here was defi ned by taking γ X = 1, γ Y = 0. We ran the algorithm 10 times on the same realization, using the 10 defi nitions of the signal and the noise. Figure 3A shows ψ 1 ( ) i , i = 0,…9, the leading eigenvector obtained when the signal was defi ned as the i-th source. These eigenvectors are highly similar to the spatial component of

Blumenfeld
Multidimensional data analysis the source defi ned as signal, i.e., ψ 1 ( ) i i ≈ u , and illustrate the success of the algorithm in recovering the signal's spatial components in this particular case.
A more rigorous and quantitative study of the performance of the algorithm applied on the artifi cial data is presented in Figures 3B-F. In Figure 3B we studied the effect of the temporal length of the data T on the performance of the algorithm by running the algorithm as in Figure 3A for 10000 realizations of Z and for several values of T. Similarity between ψ 1 ( ) i and the corresponding spatial component u i was quantifi ed by ρ ψ u equivalent here to the Pearson correlation coeffi cient due to normalization of the vectors; the subscript s is used here to indicate that the correlation is spatial. Figure 3B (left) shows the values of ρ s averaged over the 10000 realizations of Z, plotted against T for each of the 10 processes. As expected, increasing T improved the performance of the algorithm. For T ≥ 1000, all 10 processes had an average value of ρ s greater than 0.9, but even for T = 250 the average ρ s was greater than 0.9 for most process and greater than 0.8 for all processes. For a very large T, the average ρ s approached 1. This result is in fact guaranteed by the mathematical formulation, and implies a near prefect match between ψ 1 ( ) i and u i for every realization of Z. This behavior can be also seen in Figure 3B (right), which shows that the standard deviation of ρ s decays to zero as T increases. We further observed that the standard deviation of ρ s was of the order of 1 − ρ s . Since this behavior was found in all subsequent simulations, standard deviations are not shown in subsequent panels. (E) Mean values of ρ s (left) and ρ t (right) as a function of Δt, perturbation to simulation timing assumed when constructing correlation matrices involving processes 6 and 7. (F) Mean values of ρ s (left) and ρ t (right) as a function of Δτ, perturbation to decay time constant assumed when constructing correlation matrices involving processes 6 and 7. (G) Comparison of mean values of ρ s (left) and ρ t (right) in three cases: when the algorithm had access to all necessary correlation matrices (*), when crosscorrelation matrices were assumed null (**), and when noise was assumed white (***). In all panels, the shaded area corresponds to a performance measure higher than 0.9.
For further investigations we fi xed T to the intermediate value of 1000 and repeated similar simulations while varying other parameters. In Figure 3C we studied the effect of γ X and γ Y on the performance. Since multiplying both γ X and γ Y by a factor does not affect the resulting eigenvectors, we fi xed γ X = 1 and varied only γ Y . In Figure 3C (left) we plotted the average value of ρ s as a function of γ Y . Decreasing γ Y reduced ρ s for all processes, albeit at different rates. However, there was an advantage to decreasing γ Y : the projection of the data on the fi rst eigenvector ψ 1 ( ) i Z T tended to be more correlated with the temporal component of the signal source a i . This behavior is quantifi ed in Figure 3C (right), where we plotted the average value of ρ t , the Pearson correlation between ψ 1 ( ) i Z T and a i , against γ Y ; the subscript t is used here to indicate that the correlation is temporal. The average ρ t increased when γ Y was decreased, and reached values close to 1 for γ Y = −4. Comparing the left and right panels of Figure 3C, we observe that γ Y offers a tradeoff between the quality of spatial ( 1 ( ) ψ i ) and temporal ( 1 ( )T ψ i Z) outputs of the algorithm. In general however, it is not guaranteed that ρ t converges to 1 for large negative γ Y . This occurs only when there exist directions with non-zero signal power that are orthogonal to all directions with non-zero noise power. In this example one such direction indeed existed for each signal/noise defi nition: it was the direction in the 10 dimensional linear subspace spanned by all u i that was orthogonal to the 9 dimensional linear subspace spanned by the vectors u i designated as noise.
As discussed above, we view the requirement for knowing the correlation matrices to be the most restrictive assumption we make. In generating Figures 3B,C we used the true correlation matrices, obtained from our knowledge of the generating processes. Clearly, parameters of the generating processes are unknown for real datasets, and the algorithm must use estimates of the correlation matrices. We used the artifi cial data to test the performance of the algorithm when provided with correlation matrices that were perturbed away from their true values. In view of the results in Figure 3C, we quantifi ed the algorithm's performance using both ρ s and ρ t , with ρ s evaluated for γ X = 1, γ Y = 0, and ρ t evaluated for γ X = 1, γ Y = −4. In Figure 3D we show the results of perturbing the correlation matrices by adding to them a random symmetric matrix with independent, zero-mean, normally-distributed entries. All correlation matrices were perturbed independently. The scaling of the perturbing matrix was determined such that the ratio between its Frobenius norm and the Frobenius norm of the correlation matrix was fi xed to a predefi ned value η. In Figure 3D (left) we show the average value of ρ s for different values of η (the average was taken over realizations of the both the data and the perturbations). For most processes, ρ s remained above 0.9 even when η was 4, i.e., when the amplitude of the perturbation was four times the amplitude of the true correlation matrices. For η ≤ 1/2 the perturbation had only a marginal effect for all processes. Performance in terms of ρ t was somewhat less resistant to the perturbation, with a sharp decrease in ρ t when the perturbation and the true correlation matrices had similar amplitudes (η = 1). Still, most processes were almost unaffected by perturbations with η ≤ 1/2.
Although the analysis in Figure 3D validates the algorithm's robustness with respect to random perturbations of the correlation matrices, the perturbation scheme we used might not capture the type of errors in estimated correlation matrices that are likely to occur in practical applications, as it is unlikely that these errors will be random. Thus, in Figure 3E we studied another type of perturbation by replacing the true correlation matrices related to processes 6 and 7 with matrices calculated based on the true generating processes but with perturbed parameters. Specifi cally, the response onset times were delayed by Δt with respect to the true onset times. Marked reduction in performance was observed for perturbations of size greater than Δt = 5 time samples, which is 10% of the inter stimulus interval (50 time samples) and 20% of the exponential decay time constant (25 time samples). We note that the reduction in performance for large Δt was limited to process 6 and 7, even for Δt as high as 25. A different perturbation of the correlation matrices related to processes 6 and 7 is presented in Figure 3F.
Here, we manipulated the exponential decay time constant used in the calculation of the correlation matrices by increasing it by Δτ. In this case, performance for all processes was either unaffected or affected very moderately by the perturbation, even when the decay time constant was assumed twice its true value (Δτ = 25). Based on Figures 3E,F, we conclude that the algorithm always showed some degree of robustness with respect to the perturbation of the generating process parameters, but with different sensitivity for different parameters. Similar behavior was observed for perturbation of the parameters of other processes (data not shown).
In Figure 3G, we show that in some cases the algorithm can retain good performance even if some of the correlation matrices involved are completely unknown. Specifi cally, we tested the performance of the algorithm when (a) only the autocorrelation matrices but not the cross correlation matrices were assumed to be known and (b) when only the signal's autocorrelation matrix was known and the noise was assumed to be white. These two approaches can be viewed as the result of heuristically replacing unknown auto-and cross-correlation matrices with generic matrices, i.e., the identity and zero matrices, respectively. Omission of the cross correlation matrices led to a decrease in performance that was limited to the two pairs of coupled processes (1,2 and 6,7). Still, both ρ s and ρ t reached reasonable values of 0.8 or higher for these processes. The white noise approach led to a more pronounced decrease in performance. However, this reduction was again limited mainly to the coupled processes, with an only moderate decrease in performance for the other processes. This result suggests that the white noise approach, which reduces signifi cantly the number of correlation matrices that have to be known or estimated, can be useful for certain datasets. The fact that the algorithm performed well for most processes despite the strong deviation between the true and putative correlation matrices is a further indication of the algorithm's robustness.

DISCUSSION
We presented an algorithm for the analysis of temporally structured multidimensional data. Our approach in designing the algorithm was to allow it to benefi t as much as possible from prior knowledge about the temporal structure of the data. We proposed a framework in which this knowledge can be expressed mathematically, and an algorithm for using it to extract spatial structures and fi ner temporal details from the data.
A principal feature of the algorithm is the characterization of the temporal structure of the underlying sources through correlation matrices. This approach allows the algorithm to deal with a very the Fourier transform was chosen because t takes discrete values). In order to construct the correlation matrix for stationary sources of this type it is therefore suffi cient to have an estimate of the power spectrum of the fl uctuations, up to scaling. We suggest that such estimation is possible in some cases. For example, approximately periodic processes, such as those related to heartbeat or respiration, have highly concentrated spectra with most energy concentrated around a fundamental frequency and possibly around higher order harmonics. Estimation of the spectrum in this case reduces to the estimation of the fundamental frequency, and the relative contribution of different harmonics. We also note that in the case where the signal's spectrum is assumed to take a constant value within a narrow band around a particular frequency, there is an interesting relation to the theory of multitaper spectral estimation (Thomson, 1982;Percival and Walden, 1993). The estimator m Q (ψ) can be then viewed as a weighted sum of an estimator for the signal power and an estimator for the noise power, where the signal estimator is a multitaper spectral estimator constructed with all Slepian tapers scaled by their eigenvalues.
As a fi nal example of processes that can be dealt with by the algorithm we consider a simple traveling wave of the form [X] pt = sin (Kp − ft − φ), where K is the spatial frequency, p is the channel index that corresponds in this case to some location on a line, f is the temporal frequency, and φ is the random phase. This example is qualitatively different from the previous examples because a traveling wave cannot be described as fl uctuations of a fi xed spatial pattern, and therefore it cannot be dealt with by the simplifi ed algorithm. Calculating the correlation matrices, we fi nd that depending on the distribution of φ, either a single correlation matrix is required, [C] tt′ = cos(f(t − t′)), or three correlation matrices are required, [C 1 ] tt′ = cos(f(t − t′)), [C 2 ] tt′ = cos(f(t + t′)), and [C 3 ] tt′ = sin(f(t + t′)) (the latter case occurs whenever E(e 2iφ ) ≠ 0). In this case, all the information required to construct the correlation matrices is the wave's temporal frequency, and a specifi c characteristic of the distribution of the phase.
Though different, the three examples of source types given above illustrate a common scheme that can be used to construct the correlation matrices in practical applications. First, theoretical correlation matrices are calculated based on an idealized generative process or a "toy model" for a source, e.g., the equation for the synchronized stimulus-triggered response given above. Then, the actual matrices are calculated, either by estimating the remaining free parameters of the correlation matrices, e.g., the shape of the response function, or simply by plugging in known values for these parameters, e.g., stimulation times. What limits the method in this respect is thus the ability to postulate appropriate generative equations for the model, calculate the theoretical correlation matrices, and estimate the free parameters.
Regardless of the exact way by which the correlation matrices are constructed, they can be viewed as "weak models" of the data in the sense that they describe its structure not by specifying a full probability density function but only through its second order moments. The method can be therefore placed somewhere between parametric and non-parametric methods. Non-parametric methods such as PCA incorporate the assumption that the temporal structure of the data is unknown. Other methods such as independent component analysis (ICA; e.g., Hyvärinen and Oja, 2000) usually make very weak broad class of signal and noise sources. Specifi cally, we have shown that if the necessary correlation matrices are known, the algorithm is applicable whenever the following assumptions hold: (1) the signal and the noise combine additively; (2) the "contrastability" condition is satisfi ed, which, loosely speaking, means that the signal and the noise are suffi ciently different; and (3) there is suffi cient amount of data. Thus, common assumptions such as independence of the signal and noise or stationarity are not required (these cases are respectively dealt with by the appropriate interaction correlation matrices and non-Toeplitz correlation matrices); the algorithm can in principle be applied to arbitrarily complex signal or noise sources. Still, this versatility critically depends on the knowledge of the necessary correlation matrices. Below we demonstrate how prior temporal knowledge can be used to construct the correlation matrices for several prototypical types of sources.
Processes 5-8 in Figures 2 and 3 represent a specifi c type of stimulus-triggered response, where the temporal structure of the response is equal for all channels, up to scaling. Equivalently, a response of this type can be described as the synchronous emergence over all channels of a particular spatial pattern, i.e., an "activation map". We can consider a general form for such a signal, [X] pt = Σ l ς p r l g(t − t l ), with p and t denoting respectively channel and time indices, ς p denoting the scaling of channel p, t l denoting the time of stimulation l, g(·) being the temporal profi le of the response, and r l being independent and identically distributed random variables representing the amplitudes of the responses. When calculating the necessary correlations matrices we fi nd that there is a single correlation matrix associated with sources of this class; this correlation matrix is given by where c v is the coeffi cient of variation of any of the variables r l , and δ is Kronecker's delta. Thus, the temporal information required for constructing the correlation matrix is the time of stimulation, the coeffi cient of variation of the response amplitude, and the temporal profi le of the activation, which has to be known only up to scaling. In practical applications, these quantities are either known or can be estimated: the time of stimulation is known as it is set by the experimental protocol; the activation function can be estimated based on a preliminary analysis or on some canonical form (as it is commonly done for fMRI data, where the activation function is the hemodynamic response function); and the coeffi cient of variation, which measures the regularity of the response, can be estimated based on a preliminary analysis or on theoretical grounds. Another class of sources is represented by processes 0-4. These sources are stationary fl uctuations of a fi xed spatial pattern; the amplitude of these fl uctuations is determined by a zero-mean stationary process a(t) : [X] pt = ς p a(t). In this case, the single correlation matrix associated with the process is a Toeplitz matrix whose value at row t and column t′ is given by the value of the autocovariance sequence of a(t) at lag t − t′. In turn, the autocovariance sequence can be expressed as the Fourier transform of the power spectrum of the process a(t), a quantity denote by S(f), with f being the frequency in radians per sample. The correlation matrix can thus be expressed as: C Sf f (this particular form of assumptions about the temporal structure of the data, e.g., that the signal and noise are statistically independent. These methods can be highly useful in many cases and offer the best solution one can hope for in situations where indeed temporal information is unavailable. However, many times this is not the case. Information about the frequencies of different signals or about points in time where they emerge is often at hand. Moreover, some of the temporal aspects of the data are in fact defi ned by the experimental protocol, and therefore can be controlled by the experimentalist, e.g., by setting the stimulation frequency. On the other hand, parametric methods such as Bayesian methods make the opposite extreme assumption, by requiring a complete statistical description of the data. Under such strong assumptions, Bayesian methods are optimal. In practice, the true probability function generating the data is rarely known and may be very diffi cult to estimate. This limits the applicability of parametric methods in many neuroscience contexts. The present method offers a compromise between these two approaches. In this respect, it is similar in spirit to the common practice of fi ltering the data at a specifi c frequency band or applying other Fourier-theoretical tools. In these cases, frequencydomain information is used without an explicit specifi cation of a full probability function. However, such procedures assume stationarity, at least locally, and ignore any information about the signal's temporal phase. In addition, Fourier based methods are often applied independently for each channel, and therefore ignore the multidimensionality of the data. One spectral method that is inherently multidimensional is "space frequency SVD" (Mitra and Pesaran, 1999). The mathematical formulation of this method, which involves multitaper spectral estimators, is somewhat similar to the formulation presented above for the application of the algorithm on stationary data. However, whereas space frequency SVD assumes the different signals to be stationary, with well-localized non-overlapping spectra, our approach relaxes these assumptions and is therefore applicable to a wider class of problems.
The idea of exploiting temporal structures in the data has been incorporated into both PCA and ICA frameworks before (Molgedey and Schuster, 1994;Ziehe and Müller, 1998;Sornborger et al., 2005;de Cheveigné and Simon, 2007), albeit for more specifi c setups. Out of these algorithms, we fi nd constrained ICA (cICA; Rajapakse, 2000, 2005;James and Gibson, 2003) to be most similar to TSCA in several aspects. The cICA algorithm fi nds independent components in the data that satisfy user-defi ned temporal constraints. This algorithm thus elegantly incorporates prior knowledge about the temporal structure of the data into the analysis procedure, and is similar in this respect to our algorithm. There are, however, several fundamental differences between our algorithm and the family of ICA methods, and in particular cICA. Specifi cally, our algorithm does not require the data to be a mixture of a fi nite number of statistically-independent signals, which is the basic ICA assumption. This assumption is restrictive in many practical cases, because it doesn't allow any correlations between the signal and the noise. Moreover, for signals such as the traveling wave considered above, no ICA decomposition is feasible. In contrast, our algorithm readily deals with statistical dependencies in the data, as demonstrated by Figures 2 and 3. Another difference between ICA and our algorithm is that ICA typically requires the number of components constituting the data to be known a-priori, whereas TSCA is more fl exible as this number is not explicitly required. Rather, the number of signal autocorrelation matrices supplied to the algorithm serves only as a lower bound for the number of signal components. An additional difference between ICA algorithms and TSCA is that ICA typically allows only one component to be Gaussian. No such restriction is imposed in our case. This is demonstrated in Figures 2  and 3, in which processes 0-4 are Gaussian.
The cICA algorithm, while in rationale very similar to our algorithm, uses a mathematical construction very different from ours. In particular, our objective function is quadratic, so its maximization requires the solution of a linear system, given by a matrix diagonalization problem. Standard linear algebra methods are applicable and, unlike cICA, a search for locally optimal solutions is not necessary. An additional difference is that TSCA not only takes into account the temporal structure of the signal, through the signal correlation matrices, but also that of the noise, through the noise correlation matrices. More importantly, although cICA is defi ned using a very general theoretical framework that allows for a wide range of temporal constraints, practical applications of cICA and the development of stable numerical schemes for solving the cICA problem have been mainly restricted to constraining the time course of the components to be correlated with an a-priori known reference signal. The reference signal, whether derived from an independent measurement, a preliminary analysis, or theoretical considerations, requires the knowledge of the approximate behavior of the time course on a single trial basis. Thus, reference cICA addresses only a subclass of problems that are addressable with TSCA, which requires only knowledge of long term second order statistics.
While parallels between ICA methods and our algorithm can be drawn, we fi nd TSCA to be most similar to PCA, as our objective function is a direct generalization of the PCA objective function. The utility of PCA for data of the type we considered here is often questioned on the merit that it produces non-interpretable components. We endorse this observation, but point out that the typical failure of PCA stems not from the uselessness of the PCA strategy, i.e., the extraction of components with high power, but from the fact that it fails to distinguish between components of interest (signal) and nuisance components (noise). Our algorithm tackles this problem directly, and while maintaining the basic PCA principles, uses an objective function that contrasts the signal to the noise. To estimate the objective function we take into account the temporal structure of the data, which is ignored in naïve PCA applications. The output of the algorithm, is analogous to the PCA output, and is given by spatial components ranked by their score on the objective function and associated with temporal projections. PCA itself is often described as a rotation of the (spatial) coordinates followed by a projection on the leading coordinates. The same description holds here. The main difference is that the rotation performed here yields projections that have some prescribed temporal properties. Finally, there is an even stronger relation between PCA and TSCA for the particular choice of an objective function with γ X = 1, γ Y = 0. As it follows from the analysis in Section 1 of the Supplementary Material, the operation of TSCA in this case is nothing but PCA performed on the signal alone, with the noise completely ignored. January 2010 | Volume 3 | Article 28 | 11

Blumenfeld
Multidimensional data analysis could be construed as a potential confl ict of interest. From a computational perspective, the difference between PCA and the present algorithm amounts to the diagonalization of a matrix of the form ZQZ T instead of ZZ T . A related difference is that the objective function used here is not necessarily positive; correspondingly, neither Q nor ZQZ T are necessarily positive semi defi nite. As a result, the eigenvalues of ZQZ T can be either positive or negative. Throughout the paper we have focused on components with the largest eigenvalues, which estimate leading components of the signal. However, components with highly negative eigenvalues can be of interest as well in some cases. Consider for example the case where the signal and the noise are spatially orthogonal, i.e., all directions with positive signal power are orthogonal to all directions with positive noise power. If an objective functions with γ X = 1, γ Y < 0 is selected, the trailing TSCA components, which achieve highly negative scores on the objective function, are the dominant components of the noise, regardless of the precise value of γ Y . In more realistic cases, where the signal and noise are only approximately spatially orthogonal, we can still expect trailing components to exhibit some resemblance to dominant noise components. This can be seen in Figure 1D, where the spatial components of the signal and noise are approximately orthogonal ( , . ), 〈 〉≈ u u X Y 0 14 and the trailing TSCA eigenvector ψ 2 is similar to the noise spatial component u Y . In Figure 1E, where the roles of the signal and noise are switched, ψ 2 resembles u X . This observation gives rise to an interesting interpretation of TSCA for cases where the signal and noise are approximately orthogonal. The algorithm can be viewed as splitting the data space into two orthogonal subspaces. One space, spanned by the leading eigenvectors with positive eigenvalues, corresponds to the signal, while the other one, spanned by the trailing eigenvectors with negative eigenvalues, corresponds to the noise. We note that the dimensionality of these subspaces is related to the eigenspectrum of Q: by Sylvester's law of inertia (Ostrowski, 1960) the number of positive (or negative) eigenvalues of Q sets an upper bound for the dimension of the signal (or noise) subspace.
More generally, the algorithm puts forward a unifi ed framework by which many types of signals can be characterized and studied in a systematic manner. Because of its generality and versatility, we propose that this method can be highly useful for exploratory analysis. In an exploratory context the construction of the weak models, i.e., the estimation of the correlation matrices, can be based on hypotheses rather than on well-established knowledge. In other words, if one can make a good guess about the temporal structure of the data and formulate it through appropriate correlation matrices, the algorithm can be applied to fi nd both spatial components and a more detailed temporal structure. Our studies show that the algorithm always displays at least some degree of robustness to perturbations of the correlation matrices (Figure 3). Thus, the algorithm exhibits some tolerance to errors in the hypotheses under which the correlation matrices are constructed. This robustness suggests an interesting possibility of learning the correlation matrices from the data, through an iterative scheme. Such an adaptive algorithm can be highly benefi cial, as it might make use of even more limited knowledge about the temporal structure of the data. However, transforming this idea into a rigorous algorithm is beyond the scope of the current contribution and it is left for future research.

ACKNOWLEDGMENTS
I deeply thank Misha Tsodyks for ongoing assistance and support, Omri Barak for valuable discussions and critical reading of the manuscript, and an anonymous reviewer for numerous helpful suggestions. I also thank Alex Loebel, Son Preminger and Alik Mokeichev for critical reading of the manuscript in various stages of its preparation.