CUR Decompositions, Similarity Matrices, and Subspace Clustering

A general framework for solving the subspace clustering problem using the CUR decomposition is presented. The CUR decomposition provides a natural way to construct similarity matrices for data that come from a union of unknown subspaces U = M (cid:83) i = 1 S i . The similarity matrices thus constructed give the exact clustering in the noise-free case. A simple adaptation of the technique also allows clustering of noisy data. Two known methods for subspace clustering can be derived from the CUR technique. Experiments on synthetic and real data are presented to test the method.


I. INTRODUCTION
We present here two tales: one about the so-called CUR decomposition (or sometimes skeleton decomposition), and another about the subspace clustering problem.It turns out that there is a strong connection between the two subjects in that the CUR decomposition provides a general framework for the similarity matrix methods used to solve the subspace clustering problem, while also giving a natural link between these methods and other minimization problems related to subspace clustering.
The CUR decomposition is remarkable in its simplicity as well as its beauty: one can decompose a given matrix A into the product of three matrices, A = CU † R, where C is a subset of columns of A and R is a subset of rows of A (see Theorem 1 for a precise statement).The primary uses of the CUR decomposition to date are in the field of scientific computing.In particular, it has been used as a low-rank approximation method that is more faithful to the data structure than other factorizations [1], [2], an approximation to the singular value decomposition [3], [4], [5], and also has provided efficient algorithms to compute with and store massive matrices in memory.In the sequel, it will be shown that this November 10, 2017 DRAFT decomposition is the source of some well-known methods for solving the subspace clustering problem, while also adding to the construction of many similarity matrices based on the data.
The subspace clustering problem may be stated as follows: suppose that some collected data vectors in K m (with m large, and K being either R or C) comes from a union of linear subspaces (often lowdimensional) of K m , which will be denoted by U = M i=1 S i .However, one does not know a priori what the subspaces are, or even how many of them there are.Consequently, one desires to determine the number of subspaces represented by the data, the dimension of each subspace, a basis for each subspace, and finally to cluster the data: the data {w j } n j=1 ⊂ U are not ordered in any particular way, and so clustering the data means to determine which data belong to the same subspace.
There are indeed physical systems which do fit into the model just described.Two particular examples are motion tracking and facial recognition.For example, the Yale Face Database B [6] contains images of faces, each taken with 64 different illumination patterns.Given a particular subject i, there are 64 images of their face illuminated differently, and each image represents a vector lying approximately in a low-dimensional linear subspace, S i , of the higher dimensional space R 307,200 (based on the size of the greyscale images).It has been experimentally shown that images of a given subject approximately lie in a subspace S i having dimension 9 [7].Consequently, a data matrix obtained from facial images under different illuminations has columns which lie in the union of low-dimensional subspaces, and one would desire an algorithm which can sort, or cluster, the data, thus recognizing which faces are the same.
Many of the methods mentioned above begin by finding a similarity matrix for a given set of data, i.e. a square matrix whose entries are nonzero precisely when the corresponding data vectors lie in the same subspace, S i , of U .The present article is concerned with a certain matrix factorization method -the CUR decomposition -which provides a quite general framework for finding a similarity matrix for data that fits the subspace model above.It will be demonstrated that the CUR decomposition indeed produces similarity matrices for subspace data.Moreover, this decomposition provides a bridge between matrix factorization methods and the minimization problem methods such as Low-Rank Representation [20], [21].

A. Paper Contributions
• In this work, we show that the CUR decomposition gives rise to similarity matrices for clustering data that comes from a union of independent subspaces.Specifically, given the data matrix i=1 , any CUR decomposition W = CU † R can be used to construct a similarity matrix for W. In particular, if Y = U † R and Q is the element-wise binary or absolute value version of Y * Y , then is a similarity matrix for W; i.e.Ξ W (i, j) = 0 if the columns w i and w j of W come from the same subspace, and Ξ W (i, j) = 0 if the columns w i and w j of W come from different subspaces.
• This paper extends our previous framework for finding similarity matrices for clustering data that comes from the union of independent subspaces.In [29], we showed that any factorization W = BP, where the columns of B come from U and form a basis for the column space of W, can be used to produce a similarity matrix Ξ W .This work shows that we do not need to limit the factorization of W to bases, but may extend it to frames.
• Starting from the CUR decomposition framework, we demonstrate that some well-known methods utilized in subspace clustering follow as special cases, or are tied directly to the CUR decomposition; these methods include the Shape Interaction Matrix [30], [31] and Low-Rank Representation [20], [21].
• Experiments on synthetic and real data (specifically, the Hopkins155 motion dataset) are performed to justify the proposed theoretical framework.It is demonstrated that using an average of several CUR decompositions to find similarity matrices for a data matrix W outperforms some known methods in the literature.

B. Layout
The rest of the paper develops as follows: a brief section on preliminaries is followed by the statement and discussion of the most general exact CUR decomposition.Section IV contains the statements of the main results of the paper, while Section V contains the relation of the general framework that CUR gives for solving the subspace clustering problem.The proofs of the main theorems are enumerated in Section VI, whereupon the paper concludes with some numerical experiments and a discussion of future work.
For these and other notions, see Section 5 of [32].
Also of utility to our analysis is that a rank r matrix has a so-called skinny SVD of the form A = U r Σ r V * r , where U r comprises the first r left singular vectors of A, V r comprises the first r right singular vectors, and Σ r = diag(σ 1 , . . ., σ r ) ∈ K r×r .Note that in the case that rank(A) > r, the skinny SVD is simply a low-rank approximation of A.
Definition 1 (Independent Subspaces).Non-trivial subspaces {S i ⊂ K m } M i=1 are called independent if their dimensions satisfy the following relationship: The definition above is equivalent to the property that any set of non-zero vectors {w 1 , • • • , w M } such that w i ∈ S i , i = 1, . . ., M is linearly independent.
Definition 2 (Generic Data).Let S be a linear subspace of K m with dimension d.A set of data W drawn from S is said to be generic if (i) |W| > d, and (ii) every d vectors from W form a basis for S.
Note this definition is equivalent to the frame-theoretic description that the columns of W are a frame for S with spark d + 1 (see [33], [34]).It is also sometimes said that the data W is in general position.
columns drawn from a union of subspaces U = M i=1 S i .We say Ξ W is a similarity matrix for W if and only if (i) Ξ W is symmetric, and (ii) Ξ W (i, j) = 0 if and only if w i and w j come from the same subspace.
Finally, if A ∈ K m×n , we define its absolute value version via abs(A)(i, j) = |A(i, j)|, and its binary version via bin(A)(i, j) = 1 if A(i, j) = 0 and bin(A)(i, j) = 0 if A(i, j) = 0.

B. Assumptions
In the rest of what follows, we will assume that U = M i=1 S i is a nonlinear set consisting of the union of non-trivial, independent, linear subspaces {S i } M i=1 of K m , with corresponding dimensions {d i } M i=1 , with d max := max 1≤i≤M d i .We will assume that the data matrix W = [w 1 • • • w n ] ∈ K m×n has column vectors that are drawn from U , and that the data drawn from each subspace S i is generic for that subspace.

III. CUR DECOMPOSITION
Our first tale is the remarkable CUR matrix decomposition, also known as the skeleton decomposition [35], [36] whose proof can be obtained by basic linear algebra.Proof.Since U has rank r, rank(C) = r.Thus the columns of C form a frame for the column space of A, and we have A = CX for some (not necessarily unique) k × n matrix X.Let P I be an s × m row selection matrix such that R = P I A; then we have R = P I A = P I CX.Note also that U = P I C, so that the last equation can then be written as R = UX.Since rank(R) = r, any solution to R = UX is also a solution to A = CX.
Thus the conclusion of the theorem follows upon observing that Y = U † R is a solution to R = UX.Thus, Note that the assumption on the rank of U implies that k, s ≥ r in the theorem above.While this theorem is quite general, it should be noted that in some special cases, it reduces to a much simpler decomposition, a fact that is recorded in the following corollaries.The proof of each corollary follows from the fact that the pseudoinverse U † takes those particular forms whenever the columns or rows are linearly independent ([32, p. 257], for example).
Corollary 1.Let A,C,U, and R be as in Theorem 1 with C ∈ K m×r ; in particular, the columns of C are Corollary 2. Let A,C,U, and R be as in Theorem 1 with R ∈ K r×n ; in particular, the rows of R are linearly independent.Then A = CU * (UU * ) −1 R. Corollary 3. Let A,C,U, and R be as in Theorem 1 with U ∈ K r×r ; in particular, U is invertible.Then In most sources, the decomposition of Corollary 3 is what is called the skeleton or CUR decomposition, [37], though the case when k = s > r has been treated in [38].The statement of Theorem 1 is the most general version of the exact CUR decomposition.
The precise history of the CUR decomposition is somewhat difficult to discern.Many articles cite Gantmacher [39], though the authors have been unable to find the term skeleton decomposition therein.
Perhaps the modern starting point of interest in this decomposition is the work of Goreinov, Tyrtyshnikov, and Zamarashkin [35], [37].They begin with the CUR decomposition as in Corollary 3, and study the error A −CU −1 R 2 in the case that A has rank larger than r, whereby the decomposition CU −1 R is only approximate.Additionally, they allow more flexibility in the choice of U since computing the inverse directly may be computationally difficult.See also [40], [41], [42].
More recently, there has been renewed interest in this decomposition.In particular, Drineas, Kannan, and Mahoney [5] provide two randomized algorithms which compute an approximate CUR factorization of a given matrix A. Moreover, they provide error bounds based upon the probabilistic method for choosing C and R from A. It should also be noted that their middle matrix U is not U † as in Theorem 1.
Moreover, Mahoney and Drineas [1] give another CUR algorithm based on a way of selecting columns which provides nearly optimal error bounds for A − CUR F (in the sense that the optimal rank r approximation to any matrix A in the Frobenius norm is its skinny SVD of rank r, and they obtain error bounds of the form ).They also note that the CUR decomposition should be favored in analyzing real data that is low dimensional because the matrices C and R maintain the structure of the data, and the CUR decomposition actually admits an viable interpretation of the data as opposed to attempting to interpret singular vectors of the data matrix, which are generally linear combinations of the data.See also [43].

IV. SUBSPACE CLUSTERING VIA CUR DECOMPOSITION
Our second tale is one of the utility of the CUR decomposition in the similarity matrix framework for solving the subspace segmentation problem discussed above.Prior works have typically focused on CUR as a low-rank matrix approximation method which has a low cost, and also remains more faithful to the initial data than the singular value decomposition.This perspective is quite useful, but here we demonstrate what appears to be the first application in which CUR is responsible for an overarching framework, namely subspace clustering.
As mentioned in the introduction, one approach to clustering subspace data is to find a similarity matrix from which one can simply read off the clusters, at least when the data exactly fits the model and is considered to have no noise.The following theorem provides a way to find many similarity matrices for a given data matrix W, all stemming from different CUR decompositions (recall that a matrix has very many CUR decompositions depending on which columns and rows are selected).The key ingredient in the proof of Theorem 2 is the fact that the matrix Y = U † R, which generates the similarity matrix, has a block diagonal structure due to the independent subspace structure of U ; this fact is captured in the following theorem.
Theorem 3. Let W,C,U, and R be as in Theorem 2. If Y = U † R, then there exists a permutation matrix P such that where each Y i is a matrix of size k i × n i , where n i is the number of columns in W from subspace S i , and k i is the number of columns in C from S i .Moreover, WP has the form [W 1 . . .W M ] where the columns of W i come from the subspace S i .
Perhaps the most important consequence of Theorem 2 is that the shape interaction matrix is a special case of the general framework of the CUR decomposition.
be a rank r matrix whose columns are drawn from U .Choose C = W, and R to be any rank r row restriction of W. Then W = WR † R by Theorem 1.Moreover, , where V r is as in Corollary 4.
The proofs of the above facts will be related in a subsequent section.

A. Shape Interaction Matrix
In their pioneering work on factorization methods for motion tracking [30], Costeira and Kanade introduced the Shape Interaction Matrix, or SIM.Given a data matrix W whose skinny SVD is U r Σ r V * r , SIM(W) is defined to be V r V * r .Following their work, this has found wide utility in theory and in practice.Their observation was that the SIM often provides a similarity matrix for data coming from independent subspaces.It should be noted that in [29], it was shown that examples of data matrices W can be found such that V r V * r is not a similarity matrix for W; however, it was noted there that SIM(W) is almost surely a similarity matrix.
Note though that there is an easy way around this due to Theorem 2: if Q = abs(V r V * r ) or bin(V r V * r ), then Q d max is a similarity matrix for W.This also provides some theoretical justification for the idea of the Robust Shape Interaction Matrix for W in [31], where the authors took powers (albeit entrywise) of SIM(W) to get a better similarity matrix for the (noisy) data.

B. Low-Rank Representation Algorithm
Another class of methods for solving the subspace clustering problem arises from solving some sort of minimization problem.It has been noted that in many cases such methods are intimately related to some matrix factorization methods [28], [51].
One particular instance of a minimization based algorithm is the Low Rank Representation (LRR) algorithm of Liu, Lin, and Yu [20], [21].As a starting point, the authors consider the following rank minimization problem: where A is a dictionary that linearly spans W.
Note that there is indeed something to minimize over here since if A = W, Z = I n×n satisfies the constraint, and evidently rank(Z) = n; however, if rank(W) = r, then Z = W † W is a solution to W = WZ, and it can be easily shown that rank(W † W) = r.Note further that any Z satisfying W = WZ must have rank at least r, and so we have the following.
Proposition 1.Let W be a rank r data matrix whose columns are drawn from U , then W † W is a solution to the minimization problem Note that in general, solving this minimization problem V.1 is NP-hard since it is equivalent to minimizing σ (Z) 0 where σ (Z) is the vector of singular values of Z, and • 0 is the number of nonzero entries of a vector.Additionally, the solution to Equation V.1 is generally not unique, so typically the rank function is replaced with some norm to produce a convex optimization problem.Based upon intuition from the compressed sensing literature, it is natural to consider replacing σ (Z) 0 by σ (Z) 1 , which is the definition of the nuclear norm, denoted by Z * (also called the trace norm, Ky-Fan norm, or Shatten 1-norm).In particular, in [20], the following was considered: Solving this minimization problem applied to subspace clustering is dubbed Low-Rank Representation by the authors in [20].
Let us now specialize these problems to the case when the dictionary is chosen to be the whole data matrix, in which case we have It was shown in [21], [51] that the SIM defined in Section V-A, is the unique solution to problem (V.3): Theorem 4 ([51], Theorem 3.1).Let W ∈ K m×n be a rank r matrix whose columns are drawn from U , and let W = U r Σ r V * r be its skinny SVD.Then V r V * r is the unique solution to (V.3).
For clarity and completeness of exposition, we supply a simpler proof of Theorem 4 here than appears in [51].
Proof.Suppose that W = UΣV * is the full SVD of W. Then since W has rank r, we can write where U r Σ r V * r is the skinny SVD of W. Then if W = WZ, we have that I − Z is in the null space of W. Consequently, I − Z = Ṽr X for some X.Thus we find that Recall that the nuclear norm is unchanged by multiplication on the left or right by unitary matrices, whereby we see that Due to the above structure, we have that V * A +V * B * ≥ V * A * , with equality if and only if V * B = 0 (for example, via the same proof as [52, Lemma 1], or also Lemma 7.2 of [21]).
It follows that for any B = 0. Hence Z = V r V * r is the unique minimizer of problem (V.3).
Corollary 6.Let W be as in Theorem 4, and let W = WR † R be a factorization of W as in Theorem 1.
r is the unique solution to the minimization problem (V.3).
Let us note that the unique minimizer of problem (V.2) is known for general dictionaries as the following result of Liu, Lin, Yan, Sun, Yu, and Ma demonstrates.
Theorem 5 ([21], Theorem 4.1).Suppose that A is a dictionary that linearly spans W. Then the unique The following corollary is thus immediate from the CUR decomposition.The above theorems and corollaries provide a theoretical bridge between the shape interaction matrix, CUR decomposition, and Low-Rank Representation.In particular, in [51], the authors observe that of primary interest is that while LRR stems from sparsity considerations à la compressed sensing, its solution in the noise free case in fact comes from matrix factorization, which is quite interesting.

C. Basis Framework of [29]
As a final note, the CUR framework proposed here gives a broader vantage point for obtaining similarity matrices than that of [29].Consider the following, which is the main result therein: Theorem 6 ([29], Theorem 2).Let W ∈ K m×n be a rank r matrix whose columns are drawn from U .
Suppose W = BP where the columns of B form a basis for the column space of W and the columns of B lie in U (but are not necessarily columns of W).If Q = abs(P * P) or Q = bin(P * P), then Ξ W = Q d max is a similarity matrix for W.
Let us pause to note that at first glance, Theorem 6 is on the one hand more specific than Theorem 2 since the matrix B must be a basis for the span of W, whereas C may have some redundancy.On the other hand, Theorem 6 seems more general in that the columns of B need only come from U , but are not forced to be columns of W as are the columns of C.However, one need only notice that if W = BP as in the theorem above, then defining W = [W B] gives rise to the CUR decomposition W = BB † W.
But clustering the columns of W via Theorem 2 automatically gives the clusters of W.

VI. PROOFS
Here we enumerate the proofs of the results in Section IV, beginning with some lemmata.

A. Some Useful Lemmata
The first lemma follows immediately from the definition of independent subspaces.
where each U i is a submatrix whose columns come from independent subspaces of K m .Then we may write .
where the columns of B i form a basis for the column space of U i .
The next lemma is a special case of [29, Lemma 1].
Lemma 2. Suppose that U ∈ K m×n has columns which are generic for the subspace S of K m from which they are drawn.Suppose P ∈ K r×m is a row selection matrix such that rank(PU) = rank(U).Then the columns of PU are generic.
Lemma 3. Suppose that U = [U 1 . . .U M ], and that each U i is a submatrix whose columns come from independent subspaces S i , i = 1, . . ., M of K m , and that the columns of U i are generic for S i .Suppose that P ∈ K r×m with r ≥ rank(U) is a row selection matrix such that rank(PU) = rank(U).Then the subspaces P(S i ), i = 1, . . ., M are independent.The next few facts which will be needed come from basic graph theory.Suppose G = (V, E) is a finite, undirected, weighted graph with vertices in the set V = {v 1 , . . ., v k } and edges E. The geodesic distance between two vertices v i , v j ∈ V is the length (i.e.number of edges) of the shortest path connecting v i and v j , and the diameter of the graph G is the maximum of the pairwise geodesic distances of the vertices.
To each weighted graph is associated an adjacency matrix, A, such that A(i, j) is nonzero if there is an edge between the vertices v i and v j , and 0 if not.We call a graph positively weighted if A(i, j) ≥ 0 for all i and j.From these facts, we have the following lemma.
Lemma 4. Let G be a finite, undirected, positively weighted graph with diameter r, and let A be its adjacency matrix.Then A r (i, j) > 0 for all i, j.In particular, A r is the adjacency matrix of a fully connected The following lemma connects the graph theoretic considerations with the subspace model described in the opening.
Lemma 5. Let V = {p 1 , . . ., p N } be a set of generic vectors that represent data from a subspace S of dimension r and N > r ≥ 1.Let Q be as in Theorem 2 for the case U = S. Finally, let G be the graph whose vertices are p i and whose edges are those p i p j such that Q(i, j) > 0. Then G is connected, and has diameter at most r.Moreover, Q r is the adjacency matrix of a fully connected graph.
Proof.See [29, Lemmas 2 and 3] for the proof of the first part.The moreover statement follows directly from Lemma 4.

B. Proof of Theorem 3
Without loss of generality, we assume that W is such that where W i is an m × n i matrix for each i = 1, . . ., M and M ∑ i n i = n, and the vector columns of W i come from the subspace S i .
Under this assumption, we will show that Y is a block diagonal matrix.
Let P be the row selection matrix such that PW = R; in particular, note that P maps R m to R s , and that because of the structure of W, we may write R = [R 1 . . .R M ] where the columns of R i belong to the subspace Si := P(S i ).Note also that the columns of each R i are generic for the subspace Si on account of Lemma 2, and that the subspaces Si are independent by Lemma 3. Additionally, since U consists of certain columns of R, and rank(U) = rank(R) = rank(W) by assumption, we have that where the columns of U i are in Si .
Next, recall from the proof of the CUR decomposition that Y = U † R is a solution to R = UX; thus R = UY .Suppose that r is a column of R 1 , and let y = [y 1 y 2 . . .y M ] * be the corresponding column of Y so that r = Uy.Then we have that r = ∑ M j=1 U j y j , and in particular, since r is in R 1 , whereupon the independence of the subspaces Si implies that U 1 y 1 = r and U i y i = 0 for every i = 2, . . ., M.
On the other hand, note that ỹ = [y 1 0 . . .0] * is another solution; i.e. r = U ỹ. Recalling that U † y is the minimal-norm solution to the problem r = Uy, we must have that Consequently, y = ỹ, and it follows that Y is block diagonal by applying the same argument for columns of R i , i = 2, . . ., M.

C. Proof of Theorem 2
Without loss of generality, on account of Theorem 3, we may assume that Y is block diagonal as above.Then we first demonstrate that each block Y i has rank d i = dim(S i ) and has columns which are generic.Since R i = U i Y i , and rank(R i ) = rank(U i ) = d i , we have rank(Y i ) ≥ d i since the rank of a product is at most the minimum of the ranks.On the other hand, since To see that the columns of each Y i are generic, let y 1 , . . ., y d i be d i columns in Y i and suppose there are constants c 1 , . . ., c d i such that ∑ where r j , j = 1, . . ., d i are the columns of R i .Since the columns of R i are generic by Lemma 2, it follows that c i = 0 for all i, whence the columns of Y i are generic.
. ., M, and we may consider each block as the adjacency matrix of a graph as prescribed in Lemma 4. Thus from the conclusion there, each block gives a connected graph with diameter d i , and so Q d max will give rise to a graph with M distinct fully connected components, where the graph arising from Q i corresponds to the data in W drawn from S i .Thus Q d max is indeed a similarity matrix for W.

D. Proofs of Corollaries
Proof of Corollary 4. That Ξ W is a similarity matrix follows directly from Theorem 2. To see the moreover statement, recall that Proof of Corollary 5.By Lemma 1, we may write W = BZ, where Z is block diagonal, and B = [B 1 . . .B M ] with the columns of B i being a basis for S i .Let P be the row-selection matrix which gives R, i.e.R = PW.Then R = PBZ.The columns of B are linearly independent (and likewise for the columns of PB by Lemma 3), whence W † = Z † B † , and R † = Z † (PB) † .Moreover, linear independence of the columns also implies that B † B and (PB) † PB are both identity matrices of the appropriate size, whereby which completes the proof (note that the final identity W † W = V r V * r follows immediately from Corollary 4).

VII. EXPERIMENTAL RESULTS
In this section, the use of the CUR decomposition in practice is investigated by first considering its performance on synthetic data, whereupon the initial findings are subsequently verified using the motion segmentation dataset known as Hopkins155 [53].In motion segmentation, one begins with a video, and some sort of algorithm which extracts features on moving objects and tracks the positions of those features over time.At the end of the video, one obtains a data matrix of size 2F × N where F is the number of frames in the video and N is the number of features tracked.Each vector corresponds to the trajectory of a fixed feature, i.e. is of the form (x 1 , y 1 , . . ., x F , y F ), where (x i , y i ) is the position of the feature at frame 1 ≤ i ≤ F.Even though these trajectories are vectors in a high dimensional ambient space, it is known that the trajectories of all feature belonging to the same rigid body lie in a subspace of dimension 4 [54].Consequently, motion segmentation is a suitable practical problem to tackle in order to verify the validity of the proposed approach.
The Hopkins155 motion dataset contains 155 videos which can be broken down into several categories: checkerboard sequences where moving objects are overlaid with a checkerboard pattern to obtain many feature points on each moving object, traffic sequences, and articulated motion (such as people walking) where the moving body contains joints of some kind making the 4-dimensional subspace assumption on the trajectories incorrect.Associated with each video is a data matrix giving the trajectories of all features on the moving objects (these features are fixed in advance for easy comparison).Consequently, the data matrix is unique for each video, and the ground truth for clustering is known a priori, thus allowing calculation of the clustering error, which is simply the percentage of incorrectly clustered feature points.
In the synthetic experiments, similarity matrices that are generated using CUR and SIM are clustered using the Spectral Clustering algorithm [22], and both methods are compared in terms of clustering performance.In the case of clustering real motion segmentation from the Hopkins155 dataset (where the ground truth of clusters is known), our clustering performance is compared to known results in the literature [55].
As expected, in the case where the synthetic data contains no noise, SIM and CUR yield the same result for clustering.In the presence of noise, it makes sense to use the flexibility of the CUR decomposition and take an average of several different such decompositions (for further explanation, see the next subsection).
In this case, we will see that an averaging of several CUR-derived similarity matrices yields better or similar results to the SIM.

A. Simulations using Synthetic Data
A set of simulations are designed using synthetic data.In order for the results to be comparable to that of the motion segmentation case presented in the following section, the data is constructed in a similar fashion.Particularly, in the synthetic experiments, data comes from the union of independent 4 dimensional subspaces of R 300 .This corresponds to a feature being tracked for 5 seconds in a 30 fps video stream.Two cases similar to the ones in Hopkins155 dataset are investigated for increasing levels of noise.In both cases, the data is randomly drawn from the unit ball of a 4 dimensional subspace.In the first case, data comes from the union of 2 subspaces, whereas in the second case, the number of subspaces is 3.In both cases, the level of noise on W is gradually increased from the initial noiseless state to the maximum noise level.The entries of the noise are i.i.d.N (0, σ 2 ) random variables (i.e. with zero-mean and variance σ 2 ), where the variance increases as σ = [0.000,0.001, 0.010, 0.030, 0.050].
In each case, for each noise level, 100 data matrices are randomly generated.Once each data matrix W is formed, a similarity matrix Ξ W is generated using CUR and SIM.These similarity matrices are fed to Spectral Clustering and the results are evaluated based on clustering error percentages.
Ξ W for SIM is found using the skinny SVD of W with the expected rank of the data matrix (i.e. 8 and 12 for Case 1 and Case 2, respectively), while that for CUR is found by using all columns and the expected rank number of rows of W. Therefore, the matrix Y of Theorem 2 is R † R. Rows are chosen uniformly at random from W, and it is ensured that R has the expected rank before proceeding.
Results for Case 1 and Case 2 are given in Figure 1.Results indicate that CUR performs exactly as SIM for the noiseless case (as expected via Corollary 5) and degrades gradually as the level of noise increases.
Given the random nature of the CUR decomposition, a simple improvement on CUR performance is achieved by calculating the similarity matrix Ξ W not by using a single CUR decomposition, but by calculating n different CUR decompositions from the same data matrix.For each CUR decomposition, a similarity matrix Ξ i W is found, and the final similarity matrix for W is given by taking the median entrywise of the family {Ξ 1 W , . . ., Ξ n W }. If the same experiments are repeated, for n = 25, significant improvements are achieved as shown in  As mentioned, clustering performance using CUR decompositions is tested using the Hopkins155 dataset.Here, Ξ W is calculated as in Case 2 of the synthetic data section; i.e.Ξ W for CUR is still the aggregate result for n = 25 different similarity matrices as explained above.It turns out that for real motion data, CUR yields better overall results than SIM.This is reasonable given the flexibility of the CUR decomposition.Finding several similarity matrices and averaging them has the effect of averaging out some of the inherent noise in the data.The purpose of this work is not to fine tune CUR's performance on the Hopkins155 dataset; nonetheless, the results using this very simple method are already better than some of those in the literature.It should be noted that after thorough experimentation on noisy data, using a CUR decomposition which takes all columns and exactly the expected rank number of rows exhibits the best performance.That is, a decomposition of the form W = WR † R performs better on average than one of the form W = CU † R. The fact that choosing more columns performs better when the matrix W is noisy makes sense in that any representation of the form W = CX is a representation of W in terms of the frame vectors of C. Consequently, choosing more columns in the matrix C means that we are adding redundancy to the frame, and it is well-known to be one of the advantages of frame representations that redundancy provides greater robustness to noise.Additionally, we noticed experimentally that choosing exactly r rows in the decomposition WR † R exhibits the best performance.It is not clear as of yet why this is the case.
As a final remark, let us note that the CUR algorithm for finding similarity matrices for subspace data discussed here can be quite fast.Indeed, it took only 57 seconds to cluster all 155 data matrices in the Hopkins155 dataset.This means that it takes approximately 0.37 seconds per sequence to perform the clustering via averaged CUR similarity matrices plus spectral clustering.This is faster than most of the algorithms used [53], which is another advantage given its good performance.

VIII. CONCLUDING REMARKS
The motivation of this work was truly the realization that the exact CUR decomposition of Theorem 1 can be used for the subspace clustering problem.We demonstrated that, on top of its utility in randomized linear algebra, CUR enjoys a prominent place atop the landscape of solutions to the subspace clustering problem.CUR provides a theoretical umbrella under which sits the known shape interaction matrix, but it also provides a bridge to other solution methods inspired by compressed sensing, i.e. those involving the solution of a minimization problem.Moreover, we believe that the utility of CUR for clustering and other applications will only increase in the future.Below, we provide some reasons for the practical utility of CUR decompositions, particularly related to data analysis and clustering, as well as some future directions.

Benefits of CUR:
• From a theoretical standpoint, the CUR decomposition of a matrix is utilizing a frame structure rather than a basis structure to factorize the matrix, and therefore enjoys a level of flexibility beyond something like the SVD.This fact should provide utility for applications.
• Additionally, a CUR decomposition remains faithful to the structure of the data.For example, if the given data is sparse, then both C and R will be sparse, even if U † is not in general.In contrast, taking the SVD of a sparse matrix yields full matrices U and V , in general.
• Often, in obtaining real data, many entries may be missing or extremely corrupted.In motion tracking, for example, it could be that some of the features are obscured from view for several frames.
Consequently, some form of matrix completion may be necessary.On the other hand, a look at the CUR decomposition reveals that whole rows of a data matrix can be missing as long as we can still choose enough rows such that the resulting matrix R has the same rank as W.

Future Directions
• It remains to be seen the best way to use the CUR decomposition to cluster subspace data.The naive method tested here only gives a baseline, but further testing needs to be done.As commented above, its flexibility is a distinct advantage over SVD based methods.Consequently, some of the tools to make SIM more robust will no doubt lead to similar improvements for CUR.
• Another direction is to combine the CUR technique with sparse methods to construct algorithms that are strongly robust to noise and that allow clustering when the data points are not drawn from a union of independent subspaces.

Theorem 1 .
Suppose A ∈ K m×n has rank r.Let I ⊂ {1, . . ., m}, J ⊂ {1, . . ., n} with |I| = s and |J| = k, and let C be the m × k matrix whose columns are the columns of A indexed by J. Let R be the s × n matrix whose rows are the rows of A indexed by I. Let U be the s × k sub-matrix of A whose entries are indexed by I × J.If rank(U) = r, then A = CU † R.
be a rank r matrix whose columns are drawn from U .Let W be factorized as W = CU † R where C ∈ K m×k , R ∈ K s×n , and U ∈ K s×k are as in Theorem 1, and let Y = U † R and Q be either the binary or absolute value version of Y * Y .Then, Ξ W = Q d max is a similarity matrix for W.

Corollary 4 .
Let W = [w 1 • • • w n ] ∈ K m×n be a rank r matrix whose columns are drawn from U .Let W be factorized as W = WW † W. Let Q be either the binary or absolute value version of W † W. Then Ξ W = Q d max is a similarity matrix for W.Moreover, if the skinny singular value decomposition of W is

Corollary 7 .
If W has CUR decomposition W = CC † W (where R = W, hence U = C, in Theorem 1), then C † W is the unique solution to min Z Z * s.t.W = CZ.

Proof. Let S = S 1 +
• • • + S M .Let d i = dim(S i ), and d = dim(S).From the hypothesis, we have that rank(PU i ) = d i , and rank(PU) = d = M ∑ i=1 d i .Therefore, there are d linearly independent vectors for P(S)in the columns of PU.Moreover, since PU = [PU 1 , . . ., PU M ], there exist d i linearly independent vectors from the columns of PU i for P(S i ).Thus, dim P(S) = d = ∑ i d i = ∑ i dim P(S i ), and the conclusion of the lemma follows.

Figure 2 .
Figure 2. Extensive testing shows no real improvement for larger values of n, so this value was settled on empirically.

TABLE I :
% classification errors for sequences with two motions.

TABLE II :
% classification errors for sequences with three motions.

TABLE III :
% classification errors for all sequences.