SMRT: Randomized Data Transformation for Cancer Subtyping and Big Data Analysis

Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. The treatment options, as well as treatment success, are highly dependent on the correct subtyping of individual patients. With the advancement of high-throughput platforms, we have the opportunity to differentiate among cancer subtypes from a holistic perspective that takes into consideration phenomena at different molecular levels (mRNA, methylation, etc.). This demands powerful integrative methods to leverage large multi-omics datasets for a better subtyping. Here we introduce Subtyping Multi-omics using a Randomized Transformation (SMRT), a new method for multi-omics integration and cancer subtyping. SMRT offers the following advantages over existing approaches: (i) the scalable analysis pipeline allows researchers to integrate multi-omics data and analyze hundreds of thousands of samples in minutes, (ii) the ability to integrate data types with different numbers of patients, (iii) the ability to analyze un-matched data of different types, and (iv) the ability to offer users a convenient data analysis pipeline through a web application. We also improve the efficiency of our ensemble-based, perturbation clustering to support analysis on machines with memory constraints. In an extensive analysis, we compare SMRT with eight state-of-the-art subtyping methods using 37 TCGA and two METABRIC datasets comprising a total of almost 12,000 patient samples from 28 different types of cancer. We also performed a number of simulation studies. We demonstrate that SMRT outperforms other methods in identifying subtypes with significantly different survival profiles. In addition, SMRT is extremely fast, being able to analyze hundreds of thousands of samples in minutes. The web application is available at http://SMRT.tinnguyen-lab.com. The R package will be deposited to CRAN as part of our PINSPlus software suite.


INTRODUCTION
Since cancer is a heterogeneous disease, the correct identification of cancer subtypes is essential for accurate prognosis and improved treatment. With the advancement of high-throughput platforms, subtyping methods have shifted toward multi-omics integration in order to differentiate between subtypes from a holistic perspective that takes into consideration phenomena at different molecular levels (mRNA, methylation, etc.). Vast amounts of molecular data have accumulated in public repositories, including The Cancer Genome Atlas datasets (TCGA) (1), Genomic Data Commons Data Portal (GDC) (2), Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) (3), and UK Biobank (4). This demands powerful yet fast analysis methods to leverage large multi-omics datasets for a more accurate subtype discovery.
Current approaches for multi-omics integration and cancer subtyping can be categorized into four categories based on their integration strategy. The first strategy is to concatenate different types of data into a single matrix and then partition the patients using the concatenated data. For example, users can normalize and concatenate multiple data types (e.g., mRNA, methylation, miRNA, etc.) into one single matrix and then apply well-known methods developed for single-omics analysis, such as ConsensusClusterPlus (5), to determine the subtypes. Such approaches are simple and computationally efficient. However, they do not account for data heterogeneity, e.g., different data types might have different scales, dimensions and might require different normalization procedures.
The second strategy is to model the multi-omics data as a mixture of statistical models. Methods in this category include LRACluster (6), rMKL-LPP (7), iClusterPlus (8), iClusterBayes (9), OTRIMLE (10), SBC (11), BCC (12), MID (13), JIVE (14), MCIA (15), moCluster (16), and sMBPLS (17). These methods typically maximize a joint likelihood function to determine the model parameters and the subtypes. Though statistically sound, these methods need to estimate a large number of parameters that often lead to overfitting and high computational complexity. Therefore, an added step of gene filtering or data transformation is often applied before the statistical analysis.
The third strategy is to project all data types into a joint latent space. A common technique used for this strategy is nonnegative matrix factorization. Methods in this category include MvNMF (18), MultiNMF (19), IntNMF (20), iNMF (21), jointNMF (22). Another method is MCCA (23) that performs correlation analysis and then concatenates the correlation matrices into one single matrix. After projecting the data onto a joint space, cluster analysis is performed to determine the final subtypes. Similar to the second strategy, methods in this category often have excessive computational complexity and cannot be applied on the whole genome-scale. Therefore, gene filtering is a necessary step in the data processing.
The fourth strategy is also called similarity-based strategy. Methods in this category include SNF (24), PSDF (25), PFA (26), IS-Kmeans (27), NEMO (28), PINS (29,30), SCFA (31), and CIMLR (32). These methods first compute a pair-wise connectivity matrix for each data type, that represents the similarity/connectivity between patients. The connectivity matrices are then fused onto a single similarity matrix that can be used for the final clustering. Although powerful, the similarity matrix requires a quadratic memory space. This is problematic when the number of samples increases. As we will demonstrate in our analysis, these methods cannot analyze data with tens of thousands of samples.
Here we introduce Subtyping Multi-omics using a Randomized Transformation (SMRT), a new method for cancer subtyping and big data analysis. This method offers important advantages over existing software: (i) it allows researchers to analyze hundreds of thousands of samples in minutes, (ii) it can integrate data types with different numbers of patients, (iii) the ability to integrate and analyze un-matched data of different types, and (iv) the web application offers a convenient data analysis pipeline. We also improve the efficiency of our ensemble-based, perturbation clustering to support analysis on machines with memory constraints. Our extensive analysis on 37 TCGA and two METABRIC datasets shows that SMRT is more accurate than state-of-the-art subtyping methods in identifying subtypes with significantly different survival profiles. In addition, our simulations with big data show that SMRT is fast and many-fold more scalable than existing methods. Specifically, SMRT is able to analyze hundreds of thousands of samples in minutes.

The SMRT Pipeline
The overall workflow of SMRT is presented in Figure 1. This workflow offers two different analysis pipelines for big data and data with a moderate size. In the first case, given a multi-omics dataset with a moderate size (e.g., less than 2,000 samples), SMRT performs subtyping as follows. It first projects each data type onto a lower-dimensional space using randomized singular value decomposition (RSVD) and then performs a perturbation clustering (PINS) (29,30) to determine the subtypes within each data level. It also builds a pair-wise connectivity matrix for each data type that represents the connectivity between patients red (See Supplementary Section 5 for the differences between SMRT and PINS). Next, the method combines the connectivity matrices into a single similarity matrix and then determines the final subtypes using an ensemble of multiple similarity-based methods. In the second case, when the data has more than 2,000 samples, SMRT splits the data into two different sets of patients: a sampled set and a propagated set. It then performs the subtyping on the sampled set and then assigns the patients from the propagated set to the identified subtypes. Note that the number 2,000 is chosen to balance between the accuracy and time complexity of the method. This moderate number of samples allows SMRT to perform a fast and accurate analysis in limited memory (see Supplementary Section 3). Our simulation studies show that the results do not change when we vary this number. However, users are free to change this parameter when using the R package. Below is the description of each of these analysis modules.

Dimension Reduction Using Randomized Singular Value Decomposition
The goal of this step is to project the multi-omics data into a lower-dimensional space using randomized singular value decomposition (RSVD). For data with hundreds of thousands of dimensions (e.g., Illumina 450k), this step substantially reduces the required computational power while maintaining the clustering accuracy. Let us denote X ∈ R n×m as the input matrix, where n is the number of samples/patients, and m is the number of genes/features. Briefly, the RSVD method starts by generating a random projection matrix P ∈ R m×r from a standard normal distribution where r ≪ m. It then projects X ∈ R n×m to the column space of P to get a matrix Z such that Z = XP. Due to the random projection, Z and X will have approximately the same dominant columns (features). Now, we can obtain the orthogonalized matrix Q of Z by using QR decomposition, where Q has the same size as Z of n × r. In the next step, the method projects X into a smaller space to get a matrix Y ∈ R r×m such that Y = Q^T *X and then computes singular value decomposition (SVD) of Y as Y = USV* using the traditional SVD method (33). U and V matrices only keep at most r eigenvectors so the size of U is r × r and the size of V* is m × r. Finally, the low rank rotated data of the original matrix X can be computed using: X′ = XV*.
In practice, RSVD is faster and requires less memory than the traditional SVD. To further speed up our approach, we implement a parallel version of RSVD that can efficiently utilize multiple cores available in modern processors. Note that when the input data is large (e.g., more than 2,000 samples), we do not perform RSVD on the whole input. Instead, we split the data into two sets of patients: a sampled set and a propagated set. We first perform RSVD on the sampled set, and then project the original data matrix (both sampled and propagated set) to the subspace of the sampled set by multiplying it with the rotation matrix obtained from the RSVD for the sampled set. This implementation allows us to perform SVD in at most a few seconds, even for datasets with hundreds of thousands of samples and features.
The output of this module is multiple matricesone per data type. In each matrix, the rows represent patients while the columns represent the principal components (PCA). These matrices will serve as input of the next module: perturbation clustering that will be described in the next section. This will compute the perturbed connectivity matrices and determine the subtypes.

Subtype Discovery Using One Data Type
Given a single data type, SMRT utilizes our previously developed perturbation clustering (PINS) (29,30) to partition the data.
Briefly, we perturb the data (by adding Gaussian noise) and repeatedly partition the patients (using k-means by default). For each partitioning, we build a pair-wise connectivity matrix of 0's and 1's in which 1 means that the two patients belong to the same cluster, and 0 otherwise. By perturbing and clustering the data multiple times, we obtain multiple connectivity matrices that represent how stable the connectivity between patient pairs. Finally, we choose the partitioning that is the most stable to data perturbation. This algorithm automatically determines the number of clusters and patient subgroups.
When the number of samples is large, the perturbation clustering becomes slow and memory-inefficient. The perturbation clustering algorithm relies on the pair-wise connectivity of size n × n for clustering (n is the number of patients). The time and space complexity (running time and memory usage) of this method increase quadratically when the number of samples increases. Therefore, when the number of samples is large (by default setting, when n > 2,000), we perform a sub-sampling process over the original data to obtain a subset of 2,000 patients/samples. Next, we transform the data into a lower-dimensional space, and use the perturbation clustering to partition these patients. After this step, each of the 2,000 patients has a subtype. Let us refer to this selected set of 2,000 patients as the sampled set. The next step is to determine the subtypes for the rest of the patients, called the propagated set. For this purpose, we use the fast k-nearest neighbor searching algorithms (FKNN) algorithm (34,35) to assign each patient from the propagated set to one of the subtypes in the sampled set. Briefly, the FKNN method calculates the distance between the new patient to the k nearest patients in the sampled set. Next, the FKNN method classifies the new patient using vote counting (i.e., it chooses the subtype with the most patients among the k neighbors). By default, k is determined using the Elbow method on the sampled set using 5-fold cross-validation. The sampled set is divided randomly into 5 equally smaller sets. In each round, the combination of 4 sets is used as the training set, and the other is used as the validation set for the KNN algorithm with k ranges from 5 to a maximum of 50. The k that yields the lowest average classification error rate will be used as the optimal k. However, users are also free to modify the value of this parameter. Supplementary Section 6 provides more details on the performance of using the Elbow method versus using a fixed number of k.
One note of caution is that the number of dimensions of the data can be high, thus slowing the process of distance calculation and neighbor finding. Therefore, instead of calculating the distance between patients in the original space, we calculate the distance between patients in the principal component (PC) space of the sampled set. As described above, we project the original data matrix (both sampled and propagated set) to the subspace of the sampled set by multiplying it with the projection matrix obtained from the RSVD for the sampled set. After this transformation, the pair-wise distance between patients will be calculated in the new space with a much lower number of dimensions.

Subtype Discovery Using Multi-Omics Data
When the number of samples is small (by default, when n ≤ 2,000), we utilize an ensemble strategy to partition the patients. The method first clusters each data type (using the algorithm described in Section 2.3) and constructs the perturbed connectivity matrices. It then merges the connectivity matrices of all data types to a single similarity matrix that represents the similarity between patients across all data types by averaging the connectivity values for each pair of samples. Next, to cluster the similarity matrix, it uses several similarity-based algorithms, including hierarchical clustering, partitioning around medoids (36), and dynamic tree cut (37) and then chooses the partitioning that agrees the most with the partitioning of individual data types. This ensemble strategy ensures that the identified subtypes are consistent across all data types and are robust against the choice of clustering algorithms.
When the number of samples is large (by default, when n > 2,000), we perform a sub-sampling and classifying procedure that is similar to the algorithm described in the Section 2.3. The difference here is that multiple data types are involved. First, we randomly select 2,000 samples/patients and then apply the multiomics algorithm described above to partition the selected samples. We refer to this selected set of 2,000 patients as the A B C FIGURE 1 | The overall workflow of SMRT. (A) Analysis pipeline for data with moderate size. First, SMRT projects each data type to a lower-dimensional space using randomized singular value decomposition (RSVD). Next, it performs a perturbation clustering to determine the subtypes, and to build a pair-wise patient connectivity for each data type. Finally, it merges the connectivity matrices onto a single similarity matrix and then determines the final subtypes using a cluster ensemble. The output is the clustering results for each data type, as well as the results after the multi-omics data integration. (B) Analysis pipeline for big data. SMRT first splits the data into two different sets: a sampled and a propagated set. The method first determines the subtypes using the sampled set and then assigns the patients from the propagated set to subtypes identified using the sampled set. The sampled data is partitioned using the pipeline described assignments for samples in the propagated set are determined by averaging the probabilities from all k-NN models. (C) An example of the subtypes discovered by the SMRT web service for the KIRC dataset. The left panel shows a preview of the uploaded data. The middle panel shows the visualization of the discoveredSMRT web service for the KIRC dataset. The left panel shows a preview of the uploaded data. The middle panel shows the visualization of the discovered subtypes and export functions. The right panel shows patient connectivity matrices for each data type. sampled set and the remaining patients as the propagated set. The next task is to determine the subtypes of patients in the propagated set. Given a patient in the propagated set, we perform the FKNN procedure for each data type to obtain the probability that it belongs to each subtype using the labels obtained from the nearest neighbors. The final probabilities are calculated by averaging the probabilities across all data types. Finally, we classify the patient to the subtype that has the highest probability. This strategy is also applied when integrating multiomics data whose each data type has different number of samples. Here the sampled set will be the set of patients (by default, maximum 2,000 patients) that have data in all data types, and the remaining patients will be in the propagated set.

The SMRT Web Interface
The web application is publicly available at http://SMRT. tinnguyen-lab.com. The website is built using the R Shiny framework (38). Shiny is an R package that allows developers to directly build an interactive web interface using the R programming language. We use the web interface to forward data and requests from users to the new SMRT method to perform data integration and clustering. Because of the efficiency of the SMRT method, the website is able to return the results in minutes even for datasets with hundreds of thousands of samples.
Analysis using the web application is simple and straightforward. Users can either upload expression data in .csv files or a single .rds file using the upload function on the left panel. Each data type is presented as a matrix in which rows represent samples and columns represent genes/features. SMRT can automatically determine the number of subtypes. It does not require any extra configuration or parameters to perform the analysis. See Supplementary Section 4 and Figures S6, S7 for a more detailed description of the web application.

RESULTS
To assess the performance of SMRT, we perform an extensive analysis using 39 cancer datasets and simulated data. First, we demonstrate that SMRT is able to identify cancer subtypes with significantly different survival profiles. Second, we provide an indepth analysis for a Glioma dataset. Finally, we illustrate the scalability of SMRT by analyzing simulated datasets with hundreds of thousands of samples. We also provide a comparative analysis between subtypes discovered by SMRT and those of PAM50 classifier on three Breast cancer datasets (TCGA-BRCA, METABRIC_Discovery, and METABRIC_Validation) in Supplementary Section 7.

Experimental Studies Using 39 Cancer Datasets
In this article, we analyze 37 TCGA and 2 METABRIC datasets. For TCGA datasets, we downloaded the matched mRNA, DNA methylation, and miRNA expression data from the TCGA data portal. For the METABRIC datasets, we were able to obtain matched mRNA and copy number variation data from the European Genome-Phenome Archive. We also downloaded clinical data and survival information of each patient, which will be used to assess the performance of the subtyping methods. Supplementary Tables 1, 2 provide more details of the datasets.
Using each method, we partition the patients in each dataset, and then assess the survival difference of the discovered patient groups using Cox regression (39). Overall survival data is used for TCGA datasets and Disease-free survival data is used for METABRIC datasets. Table 1 shows the Cox p-values obtained from each dataset and method (See Supplementary Section 9, Figures S10-S17 for the Kaplan-Meier survival curves for each dataset). There are seven datasets in which no method is able to identify subtypes with significant Cox p-values. For the remaining 32 datasets, SMRT has significant p-values in 28 datasets, whereas NEMO has significant p-values in 19 datasets and all other methods have significant p-values in 15 datasets or less. SMRT has the most significant p-values in 12 datasets out of those 28 datasets, while SNF, CIMLR, NEMO, moCluster, iClusterBayes, LRACluster, MCCA, and IntNMF have the most significant pvalues in 0, 3, 8, 4, 2, 0, 1, and 2 datasets, respectively. Figure 2 shows the distributions of the Cox p-values in the -log10 scale. Overall, the median -log10 p-values of SMRT is close to 2 (i.e., median p-value of 0.01) whereas the median -log10 p-value of the second-best method (NEMO) is close to 1 (i.e., median p-value of 0.1). A Wilcoxon test also confirms that the p-values of SMRT are significantly smaller than the p-values obtained from other methods (p = 0.0002 using the one-tailed Wilcoxon test).
The running time of each method is shown in Table 2. The top 39 row shows the running time of each method in each dataset while the last row shows the average running time. On average, SMRT, SNF, NEMO, and MCCA are fast and able to finish each analysis in less than a minute. The remaining methods are slower, especially iClusterBayes and IntNMF, although their analysis is limited to only 2,000 most varied genes.
To reveal the contribution of each data type, we used SMRT to partition the patients using each of the data types independently. Next, we calculated the Cox p-values obtained from each data type and compared them with those obtained from subtyping the multi-omics data. Figure 3 shows the distribution of -log10 pvalues of subtypes by each data type for 37 TCGA datasets. The pvalues obtained from multi-omics data are substantially more significant than those obtained from individual data types. The median p-value obtained from multi-omics data is close to 0.01 (-log10 values are close to 2) while the median p-values of each data type are even higher than 0.1 (-log10 values are close to 1). This demonstrates that SMRT is able to exploit the complementary information available in each data type to determine subtypes with significant survival differences. Supplementary Section 10 and Table S15 provide more details on the contribution of individual data types in each dataset.
Next, we investigated the association between discovered subtypes and clinical variables. We performed our analysis on gender, age, cancer stage, and tumor grade, which are available for at least 15 datasets. We perform the following analyses: (1) Fisher's exact test to assess the significance of the association between gender (male and female) and the discovered subtypes; (2) ANOVA to assess the age difference between discovered subtypes; and finally (3) calculate the agreement between the discovered subtypes and known cancer stages and tumor grades using Normalized Mutual Information (NMI). The distributions of -log10 of p-values for gender and age are shown in Supplementary Figure S8 (see Supplementary  Tables 11-12 for the exact p-values). With the exception of NEMO and iClusterBayes, the clustering methods do not generally yield differences in gender or age in their clustering. For gender, iClusterBayes has significant p-values in 17 out of 31 Cells highlighted in yellow have significant Cox p-values at the threshold of 5%. Cells highlighted in green have the most significant Cox p-value in their respective rows. No methods were able to yield subtypes with significantly different survival in 7 data sets (shown with red fonts). SMRT yields subtypes with significantly different survival profiles in 28 out of the 39 datasets. In 12 such datasets, SMRT also p-values more significant than any of those provided by the other eight methods.

Case Study of the GBMLGG Dataset
Here we perform an in-depth analysis for the GBMLGG (Glioma). Figure 4A shows the Kaplan-Meier survival analysis of the discovered subtypes. For this dataset, SMRT discovers three subtypes in which one subtype (group 2) has a very low survival rate where at year 3, the survival probability of patients this group is only at 26% while that number for the patients in  the other two subtypes (groups 1 and 3) is 84%. We also perform a variant analysis for the dataset in order to find mutations that highly occur in the short-term-survival patient group (group 2) but not in the long-term-survival patient group (groups 1 and 3) and vice versa. Figure 4B shows the mutations of each group in which each point is a gene, and its coordinates represent the number of patients that have that mutation in the corresponding group. In principle, we want to investigate the mutated genes in the top left or bottom right of the figure. In this figure, we can easily identify four marker genes that associate with GBMLGG disease: IDH1, TP53, PTEN, and EGFR. Among those, IDH mutant (bottom-right) is known as a factor driving Low Grade Glioma (LGG) and has been used in the WHO classification system (40) to classify IDH-mutant and IDH-wildtype, which has worse prognoses. On the other hand, EGFR is not a common mutation in LGG but in GBM (Glioblastoma) (41) which has a very low survival rate (42). The amplification of EGFR can cause the mutation of PTEN gene (43) which is a tumor suppressor gene (44). Interestingly, no patient in the long-term-survival group has PTEN mutation. The occurrence of EGFR mutated genes may be another cause that leads to a low survival rate of patients in the short-term-survival group.
We further conduct pathway analysis using the discovered subtypes on the Consensus Pathway Analysis platform (45) using the FGSEA method (46) and KEGG pathway database. Supplementary Figure S4 shows the pathways that are significant with a significance threshold of 0.5%. In this connected network, each node is a pathway and there is an edge between two pathways if they have common genes. As shown in the figure, the Glioma pathway is significantly impacted. Other pathways that have common components with the Glioma pathway, including MAPK signaling pathway, ErbB signaling pathway, Calcium signaling pathway, and Pathway in cancer, are also significantly impacted. This confirms that the subtypes discovered by SMRT have significant differences in the activity of Glioma-and cancerrelated pathways. Supplementary Section 2 and Figures S1-S4 provide a more detailed analysis of this dataset.

Scalability of the Subtyping Methods
In order to assess the scalability of the nine subtyping methods, we generate a number of simulated datasets with a fixed number of genes/features of 5,000 and varying numbers of samples (from 1,000 to 100,000). In each dataset generated, there are three classes of sampleseach with a different set of up-regulated genes. The true class information was used a posteriori to assess the accuracy of each clustering method. The memory of our server is limited to 376 GB. Figure 5 shows the running time of the methods with varying numbers of samples. The time complexity of SNF, CIMLR, NEMO, and moCluster increases exponentially with respect to sample size. These methods are not able to analyze datasets with more than 30,000 samples (out of memory, produce errors, or take more than 24 hours to analyze a single dataset). MCCA and LRACluster are able to analyze datasets with 50,000 samples but fail to analyze larger datasets. Only SMRT is able to analyze all large datasets, including those with 100,000 samples. SMRT is much faster than other methods and can analyze datasets with 100,000 samples in three minutes. See Supplemental Section 3, Figure S5, and Tables 4, 5 for details on simulation and results.

CONCLUSION
In this article, we introduced SMRT, a fast yet accurate method for data integration and subtype discovery. In an extensive analysis using 39 cancer datasets, we showed that SMRT outperformed other state-of-the-art methods in discovering novel subtypes with significantly different survival profiles. We also demonstrated that the method could accurately partition hundreds of thousands of samples in minutes with low memory requirements. At the same time, the provided web application will be extremely useful for life scientists who lack computational background or resources. Although the software was developed for the purpose of cancer subtyping, researchers in other fields can use the web application and R package for unsupervised learning and data integration.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: http://smrt.tinnguyen-lab.com/.

AUTHOR CONTRIBUTIONS
HN and TN conceived of and designed the approach. HN, DT, and BT implemented the method in R, performed the data analysis and computational experiments. MR, AC, and SDa helped with data preparation and some data analysis. HN, DT, SD, and TN wrote the manuscript. All authors reviewed and approved the manuscript.