Subject clustering by IF-PCA and several recent methods

Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of significant interest. In recent years, many approaches have been proposed, among which unsupervised deep learning (UDL) has received much attention. Two interesting questions are 1) how to combine the strengths of UDL and other approaches and 2) how these approaches compare to each other. We combine the variational auto-encoder (VAE), a popular UDL approach, with the recent idea of influential feature-principal component analysis (IF-PCA) and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on 10 gene microarray data sets and eight single-cell RNA-seq data sets. We find that IF-VAE shows significant improvement over VAE, but still underperforms compared to IF-PCA. We also find that IF-PCA is quite competitive, slightly outperforming Seurat and SC3 over the eight single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving phase transition in a rare/weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).


Introduction
We are interested in the problem of high-dimensional clustering or subject clustering.Suppose we have a group of n subjects (e.g., patients or cells) measured on the same set of p features (e.g., genes).The subjects come from K different classes or groups (e.g., normal group and diseased group), but unfortunately, the class labels are unknown.In such a case, we say the data are unlabeled.For 1 ≤ i ≤ n, denote the class label of subject i by Y i and denote the p-dimensional measured feature vector of subject i by X i .Note that Y i take values from {1, 2, . . ., K}.The class labels are unknown and the goal is to predict them using the measured features X 1 , X 2 , . . ., X n .
High-dimensional clustering is an unsupervised learning problem.It is especially interesting in the Big Data era: although the volume of available scientific data grows rapidly, a significant fraction of them are unlabeled.In some cases, it is simply hard to label each individual sample (e.g., action unit recognition [47]).In some other cases, labeling each individual sample is not hard, but due to the large sample size, it takes a huge amount of time and efforts to label the whole data set (e.g., ImageNet [7]).In other instances (e.g., cancer diagnosis), we may have a preliminary opinion on how to label the data, but we are unsure of the labels' accuracy, so we would like a second, preferably independent, opinion.In all these cases, we seek an effective and user-friendly clustering method.
In recent years, the area of high-dimensional clustering has witnessed exciting advancements in several directions.First, many new types of data sets (e.g., sing-cell data) have emerged and become increasingly more accessible.Second, remarkable successes have been made on nonlinear modeling for high dimensional data, and several Unsupervised Deep Leaning (UDL) approaches have been proposed [13], including but not limited to Variational Auto-Encoder (VAE) and Generative Adversarial Network (GAN).Last but not the least, several clustering methods for single-cell data (e.g., Seurat [39] and SC3 [29]) have been proposed and become popular.
In this paper, we are primarily interested in Influential-Feature Principal Component Analysis (IF-PCA), a clustering algorithm proposed by [25].As in many recent works in high-dimensional data analysis (e.g., [2], [37]), we assume • out of all p measured features, only a small fraction of them are relevant to clustering decision.
IF-PCA is easy-to-use and does not have tuning parameters.It is conceptually simple, and (on a high-level) contains two steps as follows.
• IF-step.A feature selection step that selects a small fraction of measured features which we believe to be influential or significant to the clustering decision.
• Clustering step.A clustering step in which PCA (as a spectral clustering approach) is applied to all retained features.
Instead of viewing IF-PCA as a specific clustering algorithm, we can view it as a generic two-step clustering approach: for each of the two steps, we can choose methods that may vary from occasion to occasion in order to best suit the nature of the data.We anticipate that IF-PCA will adapt and develop over time as new data sets and tasks emerge.[25] compared IF-PCA to a number of clustering algorithms (including the classical kmeans [35], kmeans++ [3], SpectralGem [31], hierarchical clustering [20] and sparse PCA [52]) using 10 microarray data sets.They found that IF-PCA was competitive in clustering accuracy.Later, [24] developed a theoretical framework for clustering and showed that IF-PCA is optimal in the Rare/Weak signal model (a frequently used model in high-dimensional data analysis ( [9], [10]).
These appealing properties of IF-PCA motivate a revisit of this method.Specifically, we are interested in the two questions listed below.
• There are many recent clustering algorithms specifically designed for single-cell data, such as Seurat [39], SC3 [29], RaceID [16], ACTIONet [36], Monocle3 [42], and SINCERA [17].Also, many UDL algorithms have been proposed and become well-known in recent years.An interesting question is how IF-PCA compares with these popular algorithms.[25] only examined IF-PCA on gene microarray data.The single-cell RNA-seq data are similar to gene microarray data in some aspects but also have some distinguished characteristics (e.g., singel-cell RNA-sequencing provides an unbiased view of all transcripts and is therefore reliable for accurately measuring gene expression level changes [51]).How IF-PCA compares to other popular methods for subject clustering with single-cell data is an intriguing question.
• The PCA employed in the clustering step of IF-PCA is a linear method.Although we believe that the associations between class labels and measured features may be nonlinear, the significance of the nonlinear effects is unclear.To investigate this, we may consider a variant of IF-PCA in which PCA is replaced by some non-linear UDL methods in the clustering step.An interesting question is how this variant compares to IF-PCA and standard UDL methods (which has no IF-step).It helps us understand how significant the nonlinear effects are.
To answer these questions, first, we propose a new approach, IF-VAE, which combines the main idea of IF-PCA with the Variational Auto-Encoder (VAE) [28] (one of the most popular Unsupervised Deep Learning approaches in recent literatures).Second, we compare IF-VAE with several methods including VAE, IF-PCA, SpectralGem [31], and classical kmeans, using the 10 microarray data sets in [25].We find that • Somewhat surprisingly, VAE underperforms most other methods, including the classical kmeans.
• IF-VAE, whic combines VAE with the IF-step of IF-PCA, significantly outperforms VAE.
• The performance of IF-PCA and IF-VAE is comparable for approximately half of the data sets, whereas IF-VAE significantly underperforms IF-PCA for the remaining half of the data sets.
These results suggest the following: • (a).The idea of combining the IF step in the IF-PCA with VAE is valuable.• IF-VAE continues to underperform other methods on the 8 single-cell data sets, but similar as above, the unsatisfactory performance is largely attributable to the VAE step and not the IF-step.
• IF-PCA outperforms SC3 slightly and outperforms Seurat more significantly.
At the same time, we note that • Seurat has four tuning parameters and is the method that has the shortest execution time.
• The idea of SC3 is quite similar to that of IF-PCA, except that SC3 has a "consensus voting" step that aggregates the strengths of many clustering results.With consensus voting, SC3 may empirically perform more satisfactorily, but it is also more complex internally.Regarding the computational cost, it runs much slower than IF-PCA due to the consensus voting step.
Moreover, IF-PCA is conceptually simple and permits fine-grained analysis.In Section 4, we develop a theoretical framework and show that IF-PCA achieves the optimal phase transition in a Rare/Weak signal setting.Especially, we show in the region of interest (where successful subject clustering is possible), • if the signals are less sparse, signals may be individually weak.In this case, PCA is optimal (and IF-PCA reduces to PCA if we choose the IF-step properly).
• if the signals are more sparse, the signals need to be relatively strong (so successful clustering is possible).In this case, feature selection is necessary, and IF-PCA is optimal.However, PCA may be non-optimal for it does not use a feature selection step.
In comparison, other popular methods are difficult to analyze theoretically, hence, their optimality is unclear.We note that hard-to-analyze methods will also be hard to improve in the future.
In conclusion, IF-PCA is quite competitive compared to the recently popular subject clustering methods, both for gene microarray data and single-cell data.It is worthwhile to study IF-PCA both theoretically and in (a variety of) applications.IF-VAE is a significant improvement over VAE, but it is still inferior to other prevalent methods in this area (the underperformance is largely due to the VAE step, not the IF-step).It is desirable to further improve IF-VAE (especially the VAE step) to make it more competitive.

Models and methods
As before, suppose we have measurements on the same set of p features for n samples.Denote the data matrix by X ∈ R n,p , and write where X i ∈ R p denotes the measured feature vector for sample i, 1 ≤ i ≤ n.From time to time, we may want to normalize the data matrix before we implement any approaches.For 1 ≤ j ≤ p, let X(j) and σ(j) be the empirical mean and standard deviation associated with feature j (column j of X), respectively.We normalize each column of X and denote the resultant matrix by W , where Below in Section 2.1, we introduce two models for X; then in Sections 2.2-2.6,we describe the clustering methods considered in this paper, some of which (e.g., IF-VAE, IF-VAE(X), IF-PCA(X)) are new.

Two models
A reasonable model is as follows.We encode the class label Y i as a K-dimensional vector π i , where π i = e k if and only if sample i belongs to class k, and e k is the k-th standard Euclidean basis vector of where µ k ∈ R p is the mean vector for class k.We assume . ., π n ] be the matrix of encoded class labels.We can rewrite (2.3) as Also, it is reasonable to assume that out of many measured features, only a small fraction of them are useful in the clustering decision.Therefore, letting μ = (1/K) K k=1 µ k , we assume µ 1 , µ 2 , . . ., µ K are linearly independent and µ k − μ is sparse for each 1 ≤ k ≤ K. (2.5) It follows that the n × p signal matrix E[X] has a rank K.
Recall that W is the normalized data matrix.Similar to (2.5), we may decompose W as the sum of a signal matrix and a noise matrix.But due to the normalization, the rank of the signal matrix is reduced to (K − 1).
In Model (2.3)-(2.5),E[X i ] = M π i , which is a linear function of the encoded class label vectors π i .For this reason, we may view Model (2.3)-(2.5)as a linear model.In many modern applications, linear models may be inadequate, and we may prefer to use a nonlinear model The recent idea of neural network modeling provides a wide class of nonlinear models, which may be useful for our setting.As an alternative to Model (2.3)-(2.5),we may consider a neural network model as follows.In this model, we assume where f (x, θ) belongs to a class of nonlinear functions.For example, we may assume f (x, θ) belongs to the class of functions (without loss of generality, x always includes a constant feature): where A 1 , A 2 , . . ., A L are matrices of certain sizes and s 1 , s 2 , . . ., s L are some non-linear functions.Similar to Model (2.3)-(2.5),we can impose some sparsity conditions on Model (2.6).See [13] for example.

The PCA clustering approach and the SpectralGem
Principal Component Analysis (PCA) is a classical spectral clustering approach, which is especially appropriate for linear models like that in (2.3)-(2.5)when the relevant features are non-sparse (see below for discussions on the case when the relevant features are sparse).The PCA clustering approach contains two simple steps as follows.Input: normalized data matrix X and number of clusters K. Output: predicted class label vector Ŷ = ( Ŷ1 , Ŷ2 , . . ., Ŷn ) .
• Obtain the n × K matrix H = [η 1 , . . ., ηK ], where ηk is the k-th left singular vector of X (associated with the k-th largest singular value of X).
• Cluster the n rows of H to K groups by applying the classical kmeans assuming there are ≤ K classes.Let Ŷi be the estimated class label of subject i.Output Ŷ1 , . . ., Ŷn .
From time to time, we may choose to apply the PCA clustering approach to the normalized data matrix W .As explained before, we can similarly write W as the sum of a "signal" matrix and a "noise" matrix as in (2.5), but due the normalization, the rank of the "signal" matrix under Model (2.3) is reduced from K to (K − 1).In such a case, we replace the n × K matrix H by the n where similarly ξk is the k-th left singular vector of W .
The PCA clustering approach has many modern variants, including but not limited to the SpectralGem [31] and SCORE [22,26].In this paper, we consider SpectralGem but skip the discussion on SCORE (SCORE was motivated by unsupervised learning in network and text data and shown to be effective on those types of data; it is unclear if SCORE is also effective for genetic and genomic data).Instead of applying PCA clustering to the data matrix X (or W ) directly, SpectralGem constructs an n × n symmetric matrix M , where M (i, j) can be viewed as a similarity metric between subject i and subject j.The remaining part of the algorithm has many small steps, but the essence is to apply the PCA clustering approach to the Laplacian normalized graph induced by M .
The PCA spectral clustering approach is based on two important assumptions.
• The signal matrix E[X] is a linear function of class labels.
• It is hard to exploit sparsity in the data: either the data are non-sparse (such as the classical setting of p n) or how to exploit sparsity is unclear.
In many modern settings, these assumptions are not satisfied: the relationship between the signal matrix E[X] and class labels may be nonlinear, and it is highly desirable to exploit sparsity by adding a feature selection before conducting PCA clustering.In such cases, we need an alternative approach.Below, we address respectively the non-linearity by VAE and the feature selection by IF-PCA.

The Variational AutoEncoder (VAE) and VAE(X) clustering approaches
Given an n × p data matrix X and an integer d ≤ rank(X), the essence of the PCA spectral clustering approach is to obtain a rank-d approximation of X is to use Singular Value Decomposition (SVD), Here σ k is the k-th smallest singular value of X, and u k and v k are the corresponding left and right singular vectors of X, respectively.Variational AutoEncoder (VAE) can be viewed as an extension of SVD, which obtains a rank-d approximation of X from training a neural network.The classical SVD is a linear method, but the neural network approach can be highly nonlinear.VAE was first introduced by [28] and has been successfully applied to many application areas (e.g., image processing [38], computer vision [14], and text mining [40]).VAE consists of an encoder, a decoder, and a loss function.Given a data matrix X ∈ R n,p , the encoder embeds X into a matrix Z ∈ R n,d (usually d p), and the decoder maps Z back to the original data space and outputs a matrix X ∈ R n,p , which can be viewed as a rank-d approximation of X. Different from classical SVD, X is obtained in a nonlinear fashion by minimizing an objective that measures the information loss between X and X.

• (Dimension reduction by VAE).
Train VAE and use the trained encoder to get an n × d matrix Z.
• (Clustering).Cluster all n subjects into K classes by applying k-means to the rows of Z.
Let Ŷ be the predicted label vector.
Except for using a nonlinear approach to dimension reduction, VAE is similar to the PCA approach in clustering.We can apply VAE either to the normalized data matrix W or the unnormalized data matrix X.We call them VAE(W) and VAE(X), respectively.In the context of using these notations, it is unnecessary to keep (W) and (X) at the same time, so we write VAE(W) as VAE for short (and to avoid confusion, we still write VAE(X) as VAE(X)).

The orthodox IF-PCA and its variant IF-PCA(X)
For many genomic and genetic data, Model (2.3)-(2.5) is already a reasonable model.We recall that under this model the normalized data matrix can be approximately written as where approximately, and is sparse (in the sense that only a small fraction of the columns of Q have a large 2 -norm; the 2 -norm of other columns are small or 0).In such a setting, it is appropriate to conduct features selection, which removes a large amount of noise while keeping most nonzero columns of Q.
Such observations motivate the (orthodox) IF-PCA.The IF-PCA was first proposed in [25] and shown to have appealing clustering results on 10 gene microarray data sets.In [24], it was shown that IF-PCA is optimal in high-dimensional clustering.IF-PCA contains an IF step and a PCA step, and the IF-step contains two important components which we now introduce.
The first component of the IF-step is the use of the Kolmogorov-Smirnov (KS) test for feature selection.Suppose we have n (univariate) samples z 1 , z 2 , . . ., z n from a cumulative distribution function (CDF) denoted by F .Introduce the empirical CDF by (2.7) The KS testing score is then In the IF-PCA below, we take F to be the theoretical CDF of (z i − z)/σ, where z i iid ∼ N (0, 1), 1 ≤ i ≤ n, and z and σ are the empirical mean and standard deviation of z 1 , z 2 , . . ., z n , respectively.
The second component of the IF step is Higher Criticism Threshold (HCT).Higher Criticism was initially introduced by [9] (see also [10,18,21,44]) as a method for global testing.It has been recently applied to genetic data (e.g., [4]).HCT adapts Higher Criticism to a data-driven threshold choice [25].It takes as input p marginal p-values, one for a feature, and outputs a threshold for feature selection.Suppose we have p-values π 1 , π 2 , . . ., π p .We sort them in the ascending order: Define the feature-wise HC score by where ĵ = argmax {j:π (j) >log p/p, j<p/2} {HC p,j }. (2.9) IF-PCA runs as follows.
• (Clustering-step).Let W IF be the n × m sub-matrix of W consisting of columns of W corresponding to the retained features only (m is the number of retained features in (a)).
For any 1 ≤ k ≤ min{m, n}, let ξIF k be the left singular vector of W IF corresponding to the k-th largest singular value of Cluster all n subjects by applying the k-means to the n rows of Ξ IF , assuming there are K clusters.Let Ŷ = ( Ŷ1 , Ŷ2 , . . ., Ŷn ) be the predicted class labels.
In the IF-step, the normalization of ψ * j = [φ n (w j ) − µ * ]/σ * is called Efron's null correction [11], a simple idea that is proved to be both necessary and effective for analyzing genomic and genetic data [23].We remark that although IF-PCA is motivated by the linear model in (2.5), it is not tied to (2.5) and is broadly applicable.In fact, the algorithm does not require any knowledge of Model (2.3)-(2.5).
In the (orthodox) IF-PCA, we apply both the IF-step and the clustering-step to the normalized data matrix W . Seemingly, for the IF-step, applying the algorithm to W instead of the unnormalized data matrix X is preferred.However, for the clustering-step, whether we should apply the algorithm to W or X remains unclear.We propose a small variant of IF-PCA by applying the IF-step and the clustering step to W and X, respectively.
• (IF-step).Apply exactly the same IF-step to W as in the (orthodox) IF-PCA above.
• (Clustering-step).Let X IF be the n × m sub-matrix of X consisting of columns of X corresponding to the retained features in the IF-step only.For any 1 ≤ k ≤ min{m, n}, let ηIF k be the left singular vector of X IF corresponding to the k-th largest singular value of Cluster all n subjects by applying the k-means to the n rows of H IF , assuming there are K clusters.Let Ŷ = ( Ŷ1 , Ŷ2 , . . ., Ŷn ) be the predicted class labels.
To differentiate from the (orthodox) IF-PCA (which we call IF-PCA below), we call the above variant IF-PCA(X).See Table 1 in Section 2.7.The new variant was never proposed or studied before.It outperforms the (orthodox) IF-PCA in several data sets (e.g., see Section 3).

IF-VAE and IF-VAE(X)
Near the end of Section 2.2, we mention that the classical PCA has two disadvantages, not exploiting sparsity in feature vectors and not accounting for possible nonlinear relationships between the signal matrix and class labels.In Sections 2. In the clustering-step, we apply VAE to the normalized data matrix W . Similarly as in Section 2.4, if we apply VAE to the un-normalized data matrix X, then we have a variant of IF-VAE, which we denote by IF-VAE(X).See Table 1 in Section 2.7.

Seurat and SC3
We now introduce Seurat and SC3, two recent algorithms that are especially popular for subject clustering with Single-cell RNA-seq data.We discuss them separately.
Seurat was proposed in [39].On a high level, Seurat is quite similar to IF-PCA, and we can view it as having only two main steps: a feature selection step and a clustering step.But different from IF-PCA, Seurat uses a different feature selection step and a much more complicated clustering step (which combines several methods including PCA, k-nearest neighborhood algorithm, and modularity optimization).Seurat needs 4 tuning parameters: m, N, k 0 , δ, where m is the number of selected features in the feature selection step, and N, k 0 , δ are for the clustering step, corresponding to the PCA part, the k-nearest neighborhood algorithm part, and the modularity optimization part, respectively.
• (IF-step).Select the m features that are mostly variable.Obtain the n × m post-selection data matrix.
• (Clustering-step).Normalize the post-selection data matrix and obtain the first N left singular vectors.For each pair of subjects, compute how many neighbors (for each subject, we only count the k 0 nearest neighbors) they share with each other, and use the results to construct a shared nearest neighborhood (SNN) graph.Cluster the class labels by applying a modularity optimization algorithm to the SNN graph, where we need a resolution parameter δ.
An apparent limitation of Seurat is that it needs 4 tuning parameters.Following the recommendations by [19], we may take (N, k 0 ) = (50,20), but it remains unclear how to select (m, δ).SC3 was first presented by [29].To be consistent with many other methods we discuss in this paper, we may view SC3 as containing two main steps, a gene filtering step and a clustering step.Similar to Seurat, the clustering step of SC3 is much more complicated than that of IF-PCA, where the main idea is to apply PCA many times (each for a different number of leading singular vectors) and use the results to construct a matrix of consensus.We then cluster all subjects into K groups by applying the classical hierarchical clustering method to the consensus matrix.SC3 uses one tuning parameter x 0 in the gene filtering step, and two tuning parameters d 0 and k 0 in the clustering-step, corresponding to the PCA part and the hierarchical clustering part, respectively.
• (Gene filtering-step).Removes genes/transcripts that are either expressed (expression value is more than 2) in less than x 0 % of cells or expressed (expression value is more than 0) in at least (100 − x 0 )% of cells.This step may reduce a significant fraction of features, and we consider it to be more like a feature selection step than a preprocessing step.
• (Clustering-step).First, we take a log-transformation of the post-filtering data matrix and construct an n × n matrix M , where M (i, j) is some kind of distances (e.g., Euclidean, Pearson, Spearman) between subject i and j.Second, Let H = [η 1 , . . ., ηd ], where ηk is the k-th singular vector of M (or alternatively, of the normalized graph Laplacian matrix of M ).Third, for d = 1, 2, . . ., d 0 , we cluster all n subjects to K classes by applying the k-means to the rows of the n × d sub-matrix of H consisting of the first d columns, and use the results to build a consensus matrix using the Cluster-based Similarity Partitioning Algorithm (CSPA) [41].Finally, we cluster the subjects by applying the classical hierarchical clustering to the consensus matrix with k 0 levels of hierarchy.
Following the recommendation by [29], we set (x 0 , d 0 ) = (6,15) and take k 0 to be the true number of clusters K.Such a tuning parameter choice may work effectively in some cases, but for more general cases, we may (as partially mentioned in [29]) need more complicated tuning.In summary, on a high level, we can view both Seurat and SC3 as two-stage algorithms, which consist of a feature selection step and a clustering step, just as in IF-PCA.However, these methods use more complicated clustering steps where the key is combining many different clustering results to reach a consensus; note that the Shared Nearest Neighborhood (SNN) in Seurat can be viewed a type of consensus matrix.Such additional miles taken in Seurat and SC3 may help reduce the clustering error rates, but also make the algorithms conceptually more complex, computationally more expensive, and theoretically more difficult to analyze.

A brief summary of all the methods
We have introduced about 10 different methods, some of which (e.g., IF-PCA(X), IF-VAE, IF-VAE(X)) were never proposed before.Among these methods, VAE is a popular unsupervised deep learning approach, Seurat and SC3 are especially popular in clustering with single-cell data, and IF-PCA is a conceptually simple method which was shown to be effective in clustering with gene microarray data before.Note that some of the methods are conceptually similar to each other with some small differences (though it is unclear how different their empirical performances are).For example, many of these methods are two-stage methods, containing an IF-step and a clustering-step.In the IF-step, we usually use the normalized data matrix W .In the clustering-step, we may use either W or the un-normalized data matrix X.To summarize all these methods and especially to clarify the small differences between similar methods, we have prepared a table below; see Table 1 Table 1: A summary of all methods discussed in this section.This table clarifies the small differences between similar methods.Take the column IF-PCA(X) for example: "W " on row 2 means that the IF-step of this method is applied to the normalized data matrix W defined in (2.2), and "X" on row 3 means the clustering-step is applied to the un-normalized data matrix X (NA: not applicable).

Result
Our study consists of two parts.In Section 3.1, we compare IF-VAE with several other methods using 10 microarray data sets.In Section 3.2, we compare IF-VAE with several other methods, including the popular approaches of Seurat and SC3, using 8 single-cell data sets.In all these data sets, the class labels are given.However, we do not use the class labels in any of the clustering approaches; we only use them when we evaluate the error rates.The code for numerical results in this section can be found at https://github.com/ZhengTracyKe/IFPCA.The 10 microarray data sets can be downloaded at https://data.mendeley.com/datasets/cdsz2ddv3t,and the 8 single-cell RNA-seq data sets can be downloaded at https://data.mendeley.com/drafts/nv2x6kf5rd.

Comparison of clustering approaches with 10 microarray data sets
Table 2 tabulates 10 gene microarray data sets (alphabetically) studied in [25].Here, Data sets 1, 3, 4, 7, 8, and 9 were analyzed and cleaned in [8], Data sets 2, 6, 10 were analyzed and grouped into two classes in [49], among which Data set 10 was cleaned by [25] in the same way as by [8].Data set 5 is from [15].First, we compare the IF-VAE approach introduced in Section 2.5 with four existing clustering methods: (1) the classical kmeans; (2) Spectral-GEM (SpecGem) [30], which is essentially classical PCA combined with a Laplacian normalization; (3) the orthodox IF-PCA [25], which adds a feature selection step prior to spectral clustering (see Section 2.4 for details); (4) The VAE approach, which uses VAE for dimension reduction and then runs kmeans clustering (see Section 2.3 for details).Among these methods, SpecGem and VAE involve dimension reduction, and IF-PCA and IF-VAE use both dimension reduction and feature selection.For IF-PCA, VAE and IF-VAE, we can implement the PCA step and the VAE step to either the original data matrix X or the normalized data matrix W .The version of IF-PCA associated with X is called IF-PCA(X), and the version associated with W is still called IF-PCA; similar rules apply to VAE and IF-VAE.Counting these variants, we have a total of 8 different algorithms.
Table 3 shows the numbers of clustering errors (i.e., number of incorrectly clustered samples, subject to a permutation of K clusters) of these methods.The results of SpecGem and IF-PCA are copied from [25].We implemented kmeans using the Python library sklean, wrote Matlab code for IF-PCA(X), and wrote Python code for the remaining four methods.The IF-step of IF-VAE needs no tuning.In the VAE-step of IF-VAE, we fix the latent dimension as d = 25 and use a traditional architecture in which both the encoder and decode have one hidden layer; the encoder uses the ReLU activation and the decode uses the sigmoid activation; when training the encoder and decoder, we use a mini-batch stochastic gradient descent with 50 batches, 100 epochs, and a learning rate of 0.0005.The same neural network architecture and tuning parameters are applied to VAE.We note that the outputs of these methods may have randomness due to the initialization in the kmeans step or in the VAE step.For VAE, IF-VAE, and IF-VAE(X) we repeat the algorithm 10 times and report the average clustering error.For kmeans, we repeat it for 5 times (because the results are more stable); for IF-PCA(X), we repeat it 20 times.We use the clustering errors to rank all 8 methods for each data set; in the presence of ties, we assign ranks in a way such that the total rank sum is 36 (e.g., if two methods have the smallest error rate, we rank both of them as 1.5 and rank the second best method as 3; other cases are similar).The average rank of a method is a metric of its overall performance across multiple data sets.
Besides ranks, we also compute regrets: For each data set, the regret of a method is defined to be r = (e − e min )/(e max − e min ), where e is the clustering error of this method, and e max and e min are the respective maximum and minimum clustering error among all the methods.The average regret also measures the overall performance of a method (the smaller, the better).There are several notable observations.First, somewhat surprisingly, the simple and tuningfree method, IF-PCA, has the best overall performance.It has the lowest average rank among all 8 methods and achieves the smallest number of clustering errors in 4 out of 10 data sets.We recall that the key idea of IF-PCA is to add a tuning-free feature selection step prior to dimension reduction.The results in Table 2 confirm that this idea is highly effective on microarray data and hard to surpass by other methods.Second, VAE (either on W or on X), which combines k-means with nonlinear dimension reduction, significantly improves kmeans on some "difficult" datasets, such as BreastCancer, ColonCancer and SuCancer.However, for those "easy" data sets such as Leukemia and Lymphoma, VAE significantly underperforms kmeans.It suggests that the nonlinear dimension reduction is useful mainly on "difficult" data sets.Third, IF-VAE (either on W or on X) improves VAE in the majority of data sets.In some data sets such as LungCancer(1), the error rate of IF-VAE is much lower than that of VAE.This observation confirms that the IF step plays a key role in reducing the clustering errors.[25] made a similar observation by combining the IF step with linear dimension reduction by PCA.Our results suggest that the IF step continues to be effective when it is combined with nonlinear dimension reduction by VAE.Last, IF-VAE(X) achieves the lowest error rate in 3 out of 10 data sets, and it has the second lowest average rank among all 8 methods.Compared with IF-PCA (the method with the lowest average rank), IF-VAE(X) has an advantage in 3 data sets (BreastCancer, SRBCT and SuCancer) but has a similar or worse performance in the other data sets.These two methods share the same IF step, hence, the results imply that the nonlinear dimension reduction by VAE has an advantage over the linear dimension reduction by PCA only on "difficult" data sets.
Next, we study IF-VAE(X) more carefully on the LungCancer(1) data set.Recall that the IF step ranks all the features using KS statistics and selects the number of features by a tuning-free procedure.We use the same feature ranking but manually change the number of retained features.For each m, we select the m top-ranked features, perform VAE on the unnormalized data matrix X restricted to these m features, and report the average number of clustering errors over 5 repetitions of VAE. Figure 1 displays the number of clustering errors as a function of m.An interesting observation is that as m increases, the clustering error first decreases and then increases (for a good visualization, Figure 1 only shows the results for m between 1 and 0.1p; we also tried larger values of m and found that the number of clustering errors continued to increase; especially, the number errors increased fast when m > 4000).A possible explanation is as follows: when m is too small, some influential features are missed, resulting in weak signals in the VAE step; when m is too large, too many non-influential features are selected, resulting in large noise in the VAE step.There is a sweet spot between 200 and 400, and the tuning-free procedure in the IF step selects m = 251.Figure 1 explains why IF step benefits the subsequent VAE step.A similar phenomenon was discovered in [25], but it is for PCA instead of VAE.Remark 1 (Comparison with other clustering methods for microarray): [25] reported the clustering errors of several classical methods on these 10 microarray data sets.We only include kmeans and SpecGem in Table 3, because kmeans is the most widely-used generic clustering methods and SpecGem is specially designed for microarray data.The table below shows the clustering errors of other methods reported in [25], including kmeans++ (a variant of kmeans with a particular initlization) and hierarchical clustering.It suggests that these methods significantly underperform IF-PCA.

Comparison of clustering approaches on 8 single-cell RNA-seq data sets
Table 5 tabulates 8 single-cell RNA-seq data sets.The data were downloaded from the Hemberg Group at the Sanger Institute (https://hemberg-lab.github.io/scRNA.seq.datasets).It contains scRNA-seq data sets from Human and Mouse.Among them, we selected 8 data sets that have a sample size between 100 and 2,000 and can be successfully downloaded and pre-processed using the code provided by Hemberg Group under the column 'Scripts'.The data sets Camp1, Camp2, Darmanis, Li and Patel come from Human, and the data sets Deng, Goolam and Grun come from Mouse.Each data matrix contains the log-counts of the RNA-seq reads of different genes (features) in different cells (samples).The cell types are used as the true cluster labels to evaluate the performances of clustering methods.We first pre-processd all the data using the code provided by Hemberg Group, then features (genes) with fractions of non-zero entries < 5% are filtered out.The resulting dimension for all data sets are shown in Table 5.We compare IF-VAE with three other existing methods: (1) the orthodox IF-PCA [25], (2) Seurat [39] and (3) SC3 [29].The orthodox IF-PCA was proposed for subject clustering on microarray data.It is the first time this method is applied to single-cell data.Seurat and SC3 are two popular methods clustering single-cell RNA-seq data (see Sections 2.6 for details).As discussed in Section 2.6, Seurat and SC3 implicitly use some feature selection ideas and some dimension reduction ideas, but they are much more complicated than IF-PCA and have several tuning parameters.Seurat has 4 tuning parameters, where m is the number of selected features, N is the number of principal components in use, k 0 is the number of clusters in k-nearest neighbors, and δ is a 'resolution' parameter.We fix (m, N, k 0 ) = (1000, 50,20) for all data sets (the values of (N, k 0 ) are the default ones; the default value of m is 2000, but we found that m=1000 gives the same results on the 8 data sets and is faster to compute).We choose a separate value of δ for each data set in a way such that the resulting number of clusters from a modularity optimization is exactly K (details can be found in [45]).Seurat is implemented by the R package Serut [19].SC3 has 3 tuning parameters, where x 0 % is a threshold of cell fraction used in the gene filtering step, d 0 is the number of eigenvectors in use, and k 0 is the level of hierarchy in the hierarchical clustering step.We fix (x 0 , d 0 ) = (10,15) and set k 0 as the number of true clusters in each data set.SC3 is implemented using the R package SC3 [29].We observed that SC3 output an NA value on the Patel data set, because the gene filtering step removed all of the genes.To resolve this issue, we introduced a variant of SC3 by skipping the gene filtering step.This variant is called SC3(NGF), where NGF stands for 'no gene filtering.'Seurat, SC3 and SC3(NGF) can only be applied to the unnormalized data matrix X.These methods also have randomness in the output, but the standard deviation of clustering error is quite small; hence, we only run 1 repetition for each of them.The implementation of IF-PCA, IF-PCA(X), IF-VAE and IF-VAE(X) are the same as in Section 3.1.

# Dataset
Table 6 contains the clustering accuracies (number of correctly clustered cells divided by the total number of cells) of different methods.For each data set, we rank all 6 methods (excluding SC3) by their clustering accuracies (the higher accuracy, the lower rank).SC3 is excluded in rank calculation, because it outputs NA on the Patel data set.Instead, we include SC3(NGF),  5.The result for SC3 on Patel is NA, because all genes are removed in the gene filtering step; for this reason, we exclude SC3 when calculating the rank and the regret.To resolve this issue, we also introduce a variant of SC3 by skipping the gene filtering step.This variant is called SC3(NGF), where 'NGF' stands for no gene filtering.It has a better performance than the original SC3.Note that IF-PCA(X) is regarded as the best on average: it has the smallest average regret (boldface) and average rank (boldface).Note also that the standard deviation (SD) of its rank is only about 50% of that of SC3(NGF).
a version of SC3 that resolves this issue on Patel and has better performances in most other data sets; this gives more favor to SC3 in the comparison.For each data set, we also compute the regret of each method (the same as in Section 3.1).Similarly, we exclude SC3 but include SC3(NGF) in the regret calculation.Each method has a rank and a regret on each data set.The last 4 rows of Table 6 show the mean and standard deviation of the 8 ranks of each method, as well as the mean and standard deviation of the 8 regrets of each method.We make a few comments.First, if we measure the overall performance on 8 data sets using the average rank, then IF-PCA(X) and SC3(NGF) are the best.If we use the average regret as the performance metric, then IF-PCA(X) is the best method.Second, a closer look at SC3(NGF) and IF-PCA(X) suggests that their performances have different patterns.SC3(NGF) is ranked 1 in some data sets (e.g., Camp2, Darmanis, etc.) but has low ranks in some other data sets (e.g., Goolam, Grun, etc.).In contrast, IF-PCA(X) is ranked 2 in almost all data sets.Consequently, IF-PCA(X) has a smaller rank standard deviation, even though the two methods have the same average rank.One possible explanation is that SC3 is a complicated method with several tuning parameters.For some data sets, the current tuning parameters are appropriate, and so SC3 can achieve an extremely good accuracy; for some other data sets, the current tuning parameters are probably inappropriate, resulting in an unsatisfactory performance.In comparison, IF-PCA is a simple and tuning-free method and has more stable performances across multiple data sets.Third, IF-VAE(X) is uniformly better than IF-VAE, hence, we recommend applying IF-VAE to the unnormalized data matrix instead of the normalized one.Last, IF-VAE(X) significantly improves IF-PCA(X) on Deng and Grun.This suggests that the nonlinear dimension reduction by VAE is potentially useful on these two data sets.In the other data sets, IF-VAE(X) either under-performs IF-PCA(X) or performs similarly.
In terms of computational costs, Seurat is the fastest, and IF-PCA is the second fastest.VAE and SC3 are more time-consuming, where the main cost of VAE arises from training the neural network and the main cost of SC3 arises from computing the n × n similarity matrix among subjects.For a direct comparison, we report the running time of different methods on the Camp1 dataset (n = 777 and p = 13111).IF-PCA is implemented in Matlab and takes about 1.7 minutes.VAE and IF-VAE are implemented in Python, where the VAE steps are conducted using the Python library keras.The running time of VAE is 2.7 minutes, and the running time of IF-VAE is 1.4 minutes.SC3 is implemented via the package SC3 of Bioconductor in R, and it takes 3 minutes.Seurat is implemented using the R package Seurat and takes only 6 seconds.
Remark 2 (Using ARI as the performance metric): The adjusted rand index (ARI) is another commonly-used metric for clustering performance.In Table 7, we report the ARI of different methods and recalculate the ranks and regrets.The results are quite similar to those in Table 6 Table 7: The values of adjusted rand index (ARI) for the same datasets and methods as in Table 6.Similar, the average rank and regret of SC3 is denoted as NA, for it generated NA on the Patel data set.
Remark 3 (Comparison with RaceID): Besides Seraut and SC3, there are many other clustering methods for single-cell data (e.g., see [50] for a survey).RaceID [16] is a recent method.It runs an initial clustering, followed by an outlier identification; and the outlier identification is based on a background model of combined technical and biological variability in single-cell RNA-seq measurements.We now compare IF-PCA(X) and IF-VAE(X) with RaceID (we used the R package RaceID and set all tuning parameters to be the default values in this package).We observe that IF-PCA(X) and IF-VAE(X) outperform RaceID on most datasets.One possible reason is that the outlier identification step in RaceID is probably more suitable for applications with a large number of cells (e.g., tens of thousands of cells).9 compares these two methods with their original versions.For Seurat, the IF-step improves the clustering accuracies on Camp1, Darmanis, and Patel, yields similar performances on Deng, Goolam Grun, and Li, and deteriorates the performances significantly on Camp2.For SC3, the IF-step sometimes yields a significant improvement (e.g., Camp1) and sometimes a significant deterioration (e.g., Deng).It is an interesting theoretical question when the current IF-step is suitable to combine with clustering methods other than PCA.We are interested in several intertwined questions.
• When the IF-step of the IF-PCA is really necessary.As IF-PCA reduces to classical PCA when we omit the IF-step, an equivalent question is when IF-PCA really has an advantage over of PCA.
• When IF-PCA is optimal in a minimax decision framework.
To facilitate the analysis, we consider a high-dimensional clustering setting where K = 2 so we only have two classes.We assume the two classes are equally likely so the class labels satisfy extension to the case where we replace the Bernoulli parameter 1/2 by a δ ∈ (0, 1) is comparably straightforward.We also assume that the p-dimensional data vectors X i 's are standardized, so that for a contrast mean vector µ ∈ R p (I p standards for the p × p identity matrix), As before, write For any 1 ≤ j ≤ p, we call feature j an "influential feature" or "useless feature" if µ(j) = 0 and a "noise" or "useless feature" otherwise.We adopt a Rare/Weak model setting where (ν a stands for point mass at a) For fixed parameters 0 < θ, β, α < 1, • (Signals are Sparse/Rare).The fraction of influential feature is p −β , which → 0 rapidly as p → ∞, • (Signals are individually Weak).The signal strength of each influential feature may be much smaller than n −1/4 and the signals are individually weak; it is non-trivial to separate the useful features from the useless ones.
• (No free lunch).Summing over X either across rows (samples) or across columns (feature) would not provide any useful information for clustering decisions.
The model is frequently used if we want to study the fundamental limits and phase transition associated with a high-dimensional statistical decision problem (e.g., classification, clustering, global testing).Despite the seeming simplicity, the RW model is actually very delicate to study, for it models a setting where the signals (i.e., useful features) are both rare and weak.See [9,10,18,21,44,48] for example.
Compared with the model in [25] (which only considers one-sided signals, where all nonzero µ(j) are positive), our model allows two-sided signal and so is different.In particular, in our model, summing over X either across rows or columns would not provide any useful information for clustering decisions.As a result, the phase transition we derive below is different from those in [25].
Consider a clustering procedure and let Ŷ ∈ R n be the predicted class label vector.Note that for any 1 ≤ i ≤ n, both Y i (true class label) and Ŷi take values from {−1, 1}.Let Π be the set of all possible permutations on {−1, 1}.We measure the performance of Ŷ by the Hamming error rate: where the probability measure is with respect to the randomness of (µ, Y, Z). • Obtain the first singular vector of X and denote it by ξ (this is simpler than ξ; we are misusing the notation a little bit here).

A slightly simplified version of PCA and IF-PCA
• Cluster by letting Ŷi = sgn(ξ i ), 1 ≤ i ≤ n.
To differentiable from PCA in Section 2.2, we may call the approach the slightly simplified PCA.Also, to use IF-PCA for Model (4.1)-(4.4),we introduce the normalized χ 2 -testing scores for feature j by By elementary statistics, Fix a threshold t * p = 2 log(p).The IF-PCA runs as follows.
• (IF-step).Select feature j if and only if ψ j ≥ t * p .• (Clustering-step).Let Ŝ = {1 ≤ j ≤ p : ψ j ≥ t * p }, and let X Ŝ be the post-selection data matrix (which is a sub-matrix of X consisting of columns in Ŝ).Let ξ * ∈ R n be the first singular vector of XS .We cluster by letting Similarly, to differentiate from the IF-PCA in Section 2.4, we call this the slightly simplified IF-PCA.

The computational lower bound (CLB)
We first discuss the computational lower bound (CLB).The notion of CLB is an extension of the classical information lower bound (LB) (e.g., the Cramer-Rao lower bound), and in comparison, • Classical information lower bound usually claims a certain goal is not achievable for any methods (which includes methods that are computationally NP hard).
• Computational lower bound usually claims a certain goal is not achievable for any methods with a polynomial computational time.
From a computational perspective, we highly prefer to have algorithms with a polynomial computation time.Therefore, compared with classical information lower bound, CLB is practically more relevant.Let s p = p p .Note that in our model, the number of signals is Bernoulli(p, p ), which concentrates at s p .Recall that in our calibrations, n = p θ and s p = p 1−β , and the strength of individual signals is τ p .Introduce the critical signal strength by We have the following theorem.In other words, any "computable clustering procedures" (meaning those with a polynomial computational time) fails in this case, where the error rate is approximately the same as that of random guess.The proof of Theorem 4.1 is long but is similar to that of [24, Theorem 1.1], so we omit it.
Next, we study the performance of classical PCA and IF-PCA.But before we do that, we present a lemma on classical PCA in Section 4.3.We state the lemma in a setting that is more general than Model (4.1)-(4.4),but we will come back to Model (4.1)-(4.4) in Section (4.4).
-The left is the less sparse case where the number of useful features s p √ p.For any fixed (α, β) in this region, the Hamming error rates of PCA are o(1), so PCA achieves the optimal phase transition.Also, in this case, the signals are too weak individually and feature selection is infeasible.Therefore, in the IF-step, the best we can do is to select all features, so IF-PCA reduces to PCA.
-The right is the more sparse case, where the number useful features s p √ p.For any fixed (α, β) in this region, the Hamming error rates IF-PCA is o(1), so IF-PCA achieves the optimal phase transition.Also in this case, the signals are strong enough individually and feature selection is desirable.Therefore, IF-PCA and PCA are significantly different.
See Figure 2 for details.In the part of Region of Possibility (β < 1/2), feature selection is infeasible, PCA is optimal, and IF-PCA reduces to PCA with an appropriate threshold.In the right part (β > 1/2), it is desirable to conduct feature selection, and IF-PCA is optimal.However, PCA is non-optimal for parameters in the shaded green region.

Discussions
IF-PCA is a simple and tuning-free approach to unsupervised clustering of high-dimensional data.
The main idea of IF-PCA is a proper combination of the feature selection and the dimension reduction by PCA.In this paper, we make several contributions.First, we extend IF-PCA to IF-VAE, by replacing PCA with the variational auto-encoder (VAE), a popular unsupervised deep learning algorithm.Second, we study the theoretical properties of IF-PCA in a simple clustering model and derive the phase transitions.Our results reveal how the feature sparsity and the feature strength affect the performance of IF-PCA, and explain why IF-PCA can significantly improve the classical PCA.Third, we investigate the performances of IF-PCA and IF-VAE on two applications, the subject clustering with gene microarray data and the cell clustering with single-cell RNA-seq data, and compare them with some other popular methods.We discover that IF-PCA performs quite well in the aforementioned applications.Its success on microarray data was reported in [25], but it has never been applied to single-cell data.To use IF-PCA on single-cell data, we recommend a mild modification of the original procedure called IF-PCA(X), which performs the PCA step on the unnormalized data matrix X instead of the normalized data matrix W .On the 8 single-cell RNA-seq data sets considered in this paper, IF-PCA(X) has the second best accuracy in almost all the data sets, showing a stable performance across multiple data sets.We think IF-PCA has a great potential for single-cell clustering, for the method is simple, transparent, and tuning-free.Although the current IF-PCA(X) still underperforms the state-of-the-art methods (e.g., SC3) in some data sets, it is hopeful that a variant of IF-PCA (say, by borrowing the consensus voting in SC3 or replacing PCA with some other embedding methods [5,34] can outperform them.
We also find that unsupervised deep learning algorithms do not immediately yield improvements over classical methods on the microarray data and the single-cell data.IF-VAE underperforms IF-PCA in most data sets; there are only a few data sets in which IF-VAE slightly improves IF-PCA.The reason can be either that nonlinear dimension reduction has no significant advantage over linear dimension reduction in these data sets or IF-VAE is not optimally tuned.How to tune the deep learning algorithms in unsupervised settings is an interesting future research direction.Moreover, the theory on VAE remains largely unknown [13].A theoretical investigation of VAE requires an understanding to both the deep neural network structures and the variational inference procedure.We also leave this to future work.
The framework of IF-PCA only assumes feature sparsity but no other particular structures on the features.It is possible that the features are grouped [6] or have some tree structures [32].How to adapt IF-PCA to this setting is an interesting yet open research direction.
In the real data analysis, we assume that the number of clusters, K, is given.When K is unknown, how to estimate K is a problem of independent interest.One approach is to use the scree plot.For example, [27] proposed a method that first computes a threshold from the bulk eigenvalues in the scree plot and then applies this threshold on the top eigenvalues to estimate K. Another approach is based on global testing.Given a candidate K, we may first apply a clustering method with this given K and then apply the global testing methods in [24] to test if each estimated cluster has no sub-clusters; K is set as the smallest K such that the global null hypothesis is accepted in all estimated clusters.In general, estimating K is an independent problem from clustering.It is interesting to investigate which estimators of K work best for gene microarray data and single-cell RNA-seq data, which we leave to future work 3-2.4, we have seen that VAE aims to exploit nonlinear relationships, and IF-PCA aims to exploit sparsity.We may combine VAE with the IF-step of IF-PCA for a simultaneous exploitation of sparsity and non-linearity.To this end, we propose a new algorithm called IF-VAE.IF-VAE contains an IF-step and a clustering step, and runs as follows.Input: normalized data matrix W = [w 1 , w 2 , . . ., w p ] = [W 1 , W 2 , . . ., W n ] , number of classes K, dimension of the latent space in VAE (denoted by d).Output: predicted class label vector Ŷ = ( Ŷ1 , Ŷ2 , . . ., Ŷn ).• (IF-step).Run the same IF-step as in Section 2.4, and let W IF = [W IF 1 , . . ., W IF n ] ∈ R n×m be the matrix consisting of the retained features only (same as in the IF-step in IF-PCA, m is the number of retained features).• (Clustering-step).Apply VAE with W IF ∈ R n×m and obtain an n × d matrix Z IF , which can be viewed as an estimation of the low-dimensional representation of W IF .Cluster the n samples into K clusters by applying the classical k-means to Z IF assuming there are K classes.Let Ŷ be the predicted label vector.

Figure 1 :
Figure 1: Clustering errors of IF-VAE(X) as a function of the number of selected features in the IF step (data set: LungCancer(1); y-axis: number of clustering errors; x-axis: number of selected features).

Remark 4 (
Combining the IF-step with Seurat and SC3): We investigate if the IF-step of IF-PCA can be used to conduct feature selection for other clustering methods.To this end, we introduce IF-Seurat and IF-SC3(NGF), in which Seurat and SC3(NGF) are applied respectively to the post-selection unnormalized data matrix from the IF-step of IF-PCA.Table

Figure 2 :
Figure 2: Phase transition for PCA and IF-PCA (θ = 0.6).The (three-segment) solid green line is α = α * (β, θ), which separates the whole region into the Region of Impossibility (top) and Region of Possibility (bottom).In the part of Region of Possibility (β < 1/2), feature selection is infeasible, PCA is optimal, and IF-PCA reduces to PCA with an appropriate threshold.In the right part (β > 1/2), it is desirable to conduct feature selection, and IF-PCA is optimal.However, PCA is non-optimal for parameters in the shaded green region.

Table 2 :
The 10 gene microarray data sets analyzed in Section 3.1 (n: number of subjects; p: number of genes; K: number of clusters).

Table 3 :
Comparison of clustering errors of different methods on the 10 microarray data sets in Table2.IF-PCA has the smallest average rank and average regret (boldface) and is regarded as the best on average.

Table 4 :
The clustering errors of kmeans++ and hierarchical clustering on the 10 microarray data sets (the clustering errors of IF-PCA are listed for reference).

Table 5 :
Single-cell RNA-seq data sets investigated in this paper.(n: number of cells; p: number of genes; K: number of cell types)

Table 6 :
Comparison of the clustering accuracies with the 8 single-cell RNA-seq data sets in Table .

Table 8 :
Comparison of the clustering accuracies of IF-PCA(X), IF-VAE(X) and RaceID.

Table 9 :
Combinations of IF-Seurat with Seurat and IF-SC3(NGF) with SC3(NGF).4Phasetransition for PCA and IF-PCACompared with VAE, Seurat, and SC3, an advantage of IF-PCA is that it is conceptually much simpler and thus comparably easier to analyze.In this section, we present some theoretical results and show that IF-PCA is optimal in a Rare/Weak signal setting.
To facilitate analysis for Model (4.1)-(4.4),we consider a slightly more idealized version of PCA and IF-PCA, where the main changes are (a) we skip the normalization step (as we assume the model is for data that is already normalized), and (b) we replace feature selection by Kolmogorov-Smirnov statistics in IF-PCA by feature selection by the χ 2 statistics, (c) we remove Efron's correction in IF-PCA (Efron's correction is especially useful for analyzing gene microarray data, but is not necessary for the current model), and (d) we skip the Higher Criticism Threshold (HCT) choice (the study on HCT is quite relevant for our model, but technically it is very long so we skip it).Note also the rank of the signal matrix Y µ is 1 in Model (4.1)-(4.4),so in both PCA and the clustering step of IF-PCA, we should apply kmeans clustering to the first singular vector of X only.Despite these simplifications, the essences of original PCA and IF-PCA are retained.See below for more detailed description of the (simplified) PCA and IF-PCA.