Discovering Cancer Subtypes via an Accurate Fusion Strategy on Multiple Profile Data

Discovering cancer subtypes is useful for guiding clinical treatment of multiple cancers. Progressive profile technologies for tissue have accumulated diverse types of data. Based on these types of expression data, various computational methods have been proposed to predict cancer subtypes. It is crucial to study how to better integrate these multiple profiles of data. In this paper, we collect multiple profiles of data for five cancers on The Cancer Genome Atlas (TCGA). Then, we construct three similarity kernels for all patients of the same cancer by gene expression, miRNA expression and isoform expression data. We also propose a novel unsupervised multiple kernel fusion method, Similarity Kernel Fusion (SKF), in order to integrate three similarity kernels into one combined kernel. Finally, we make use of spectral clustering on the integrated kernel to predict cancer subtypes. In the experimental results, the P-values from the Cox regression model and survival curve analysis can be used to evaluate the performance of predicted subtypes on three datasets. Our kernel fusion method, SKF, has outstanding performance compared with single kernel and other multiple kernel fusion strategies. It demonstrates that our method can accurately identify more accurate subtypes on various kinds of cancers. Our cancer subtype prediction method can identify essential genes and biomarkers for disease diagnosis and prognosis, and we also discuss the possible side effects of therapies and treatment.


INTRODUCTION
Cancer is a heterogeneous disease caused by chemical, physical, or genetic factors (Mager, 2006;Liu and Chu, 2014). The development of high-throughput genome analysis techniques on the research of cancer subtypes plays an important role in the analysis and clinical treatment of various kinds of cancers (Kruijf et al., 2013;Prat et al., 2015;Thanki et al., 2017). In recent years, much expression data, including genomes, transcriptome and epigenomes, has accumulated and been stored in various databases. The Cancer Genome Atlas (TCGA) (Katarzyna et al., 2015) is a largescale project including over 34 cancers and 15 expression data sets. We can conveniently obtain genome-scale molecular data, which contributes to the development of computational methods for discovering cancer subtypes.
Until now, massive computational methods were proposed to discover cancer subtypes. Some methods are based on single expression data, including gene expression data (Nguyen and Rocke, 2002;Brunet et al., 2004;Finnegan and Carey, 2007;Teschendorff et al., 2007) and copy number (Wong et al., 2012) and DNA methylation . Gao and Church (2005) employed sparse non-negative matrix factorization (SNMF) and gene expression data to identify subtypes of three cancers. Also, various kinds of expression data (Wei et al., 2017(Wei et al., , 2018a and several types of similarity strategies (Zeng et al., 2016;Ding et al., 2017a,b;Pan et al., 2017Pan et al., , 2018Guo F. et al., 2018;Song et al., 2018) can be applied in many other biological prediction problems.
Generally, we desire a comprehensive view of one disease with a cohort of patients. We cannot analyze just one kind of data, but must separately abstract information from different types of data (Xu et al., 2017). Therefore, many methods improve the robustness of clustering by focusing on data processing (Ren et al., 2015). Wang et al. (2014) proposed the Similarity Network Fusion (SNF) approach for accurately clustering caner subtypes. This method first collects three types of genome-wide data including gene, methylation and miRNA expression. Then, it constructs the networks of samples (e.g., patients) by using three types of expression data, and fuses these networks into one network by using SNF representing the full spectrum of underlying data. Finally, it employs spectral clustering on an integrated network to predict caner subtypes. Ma and Zhang (2017) developed an improved SNF, Affinity Network Fusion (ANF), to integrate multiple similarity networks. Xu et al. (2016) proposed Weighted Similarity Network Fusion (WSNF) to identify cancer subtypes. This method constructs similarity of patients by integrating associations between miRNA, mRNA, and transcription factors. It is applied to two cancer types to demonstrate performance.
Furthermore, the effective models of clustering that we usually use have strong data sensitivity, such as k-means and hierarchical clustering. Today, many clustering methods have been developed to identify cancer subtypes. Le et al. (2016) developed the SRF algorithm, which identifies subtypes by combining mutational and expression information. It diffuses mutation information over an interaction network on the basis of each sample and eliminates scale differences by applying a rank-based transformation based on mutation and expression data. Then, rank matrix factorization is used to jointly factorize the transformed data into a number of ranked factors, and the subtypes are defined as the combination of ranked factors. This method obtains excellent performance, but some of the patients cannot be identified. Shen et al. (2009) proposed the iCluster method, which is based on the Gaussian latent variable model, to discover caner subtypes. This method was tested on breast cancer and lung cancer by using copy number and gene expression data types. Speicher and Pfeifer (2015) pointed out that iCluster has high computational complexity and proposed a dimensionality reduction method to integrate multiple similarity kernels. This method is evaluated by using five cancer types. Ge et al. (2017) developed the Scluster method, which integrates different types of data and maps them into an effective low-dimensional subspace. First, Scluster uses adaptive sparse reduced-rank regression (S-rrr) to map the original data into the principal subspaces. Next, a fused patient-by-patient network is abstracted for these subgroups by a scaled exponential similarity kernel method. It can then obtain the cancer subtypes by spectral clustering.
In this paper, we first collect multiple profile data on The Cancer Genome Atlas (TCGA), including five cancers (lung cancer, kidney cancer, stomach cancer, breast cancer, and colon cancer) and their three types of expression data (gene expression, isoform expression, and miRNA expression). Then, we construct three similarity kernels for all patients of the same cancer by using the three types of expression data. We then propose a novel unsupervised multiple kernel fusion method, Similarity Kernel Fusion (SKF), in order to integrate three similarity kernels into one combined kernel. Compared with SNF, SKF not only keeps the original information of each type of similarity kernel, but also gets rid of the noise in the integrated kernel. Finally, we make use of spectral clustering on the integrated kernel to predict cancer subtypes. To test the effectiveness and robustness of this novel approach, P-values from a Cox regression model and survival curve analysis can be used to evaluate the performance of our method on cancer subtype prediction. We compare the integrated kernel with the single kernel and other fusion methods, and also analyze the survival curve of the clinical data.

MATERIALS AND METHODS
In this paper, we first extract five cancer datasets from The Cancer Genome Atlas (TCGA). For a particular cancer, we construct three patient similarity kernels by using the expression data. Then, we combine these similarity kernels into one similarity kernel by using Similarity Kernel Fusion (SKF). Finally, we employ spectral clustering on the integrated kernel to divide all patients into multiple clusters. The flowchart of our method is shown in Figure 1.

Dataset
We collect five cancer datasets from the TCGA website, including stomach cancer, lung cancer, kidney cancer, breast cancer, and colon cancer. For each cancer, we extract three kinds of expression data respectively, including gene expression, miRNA expression, and isoform level. Our dataset is denoted as Dataset No.1 in this paper. In addition, we employ anther dataset to evaluate the performance of our method. The second dataset is provided in Wang et al. (2014), which includes lung cancer, kidney cancer, breast cancer, colon cancer, and glioblastoma multiforme (GBM). For each tumor, gene expression, methylation expression, and miRNA expression from TCGA are used to analyze cancer subtypes. We denote this dataset as Dataset No.2. Since genes could be categorized into multiple groups, we selected 18222 coding genes from

Similarity Kernel Construction
A special expression dataset is denoted as E ∈ R n×m , where m is the number of expression factors and n is the number of patients. We first normalize E by using Equation (1).
where x is an element of E, x ′ is corresponding elements of E after standardization, X is the mean of E and S is standard deviation of E. And, we denote normalized expression data as E ′ .
Based on the processed expression data E ′ , we construct similarity kernel K ∈ R m×m for patients. Here, the similarity between two patients is defined as Equation (2) Zhao et al., 2018a,b).
where K i,j is the similarity between i-th patient and j-th patient, e i ∈ R n×1 and e j ∈ R n×1 is i-th column and j-th column of E ′ , respectively.
Finally, we get three similarity kernels for a special disease, including similarity kernel K 1 ∈ R m×m by using gene expression, similarity kernel K 2 ∈ R m×m by using miRNA expression, and similarity kernel K 3 ∈ R m×m by using isoform expression.

Similarity Kernel Fusion
We constructed three similarity kernels for patients in the above section. We propose Similarity Kernel Fusion (SKF) to combine these kernels into one kernel K * ∈ R m×m . First, we construct two kernels P ∈ R m×m and S ∈ R m×m for each similarity kernel by using Equations (3, 4), where P is a normalized kernel and S is a sparse kernel that eliminates weak similarity.
where S satisfies m k=1 S(i, j) = 1; N i is a set of all neighbors of the i-th patient, including itself.
Second, we discover more information by using multiple iterations as Equation (5).
where P t l (l = 1, 2, 3) is the status of the l-th kernel after t iterations, α is a coefficient and satisfies α ∈ [0, 1], P 0 r (r = 1, 2, 3) represents the initial status of P r . After t + 1 iterations, the overall kernel can be computed as Equation (6).
Finally, based on the integrated kernel, we construct a weight matrix to eliminate noise in the integrated kernel as Equation (7).
where N i is a set of all neighbors of the i-th patient, including itself, and N j is a set of all neighbors of the j-th patient, including itself.
The final similarity kernel can be obtained as Equation (8).
where K * is the final integrated similarity kernel by using SKF.

Mining Subtypes Using Spectral Clustering
In this section, we employ spectral clustering (Ng et al., 2001) on the integrated similarity kernel to divide all patients into multiple clusters. Many previous studies, including CSPRV , Scluster (Ge et al., 2017), and SNF (Wang et al., 2014), have constructed similarity kernels for patients and used spectral clustering to discover cancer subtypes. These methods have achieved excellent performance by using spectral clustering. Additionally, Luxburg (2007) have pointed out that spectral clustering is effective in capturing the global structure of the graph. Therefore, we use spectral clustering to identify cancer subtypes. Then, we will introduce the processes of spectral clustering in detail. We define a matrix Y ∈ {0, 1} k×n to represent the result of a cluster, where Y(i, j) = 1 if patient p j belongs to ith cluster, otherwise Y(i, j) = 0. We also use Equation (9) as the optimal question to solve Y.
D is a diagonal matrix whose diagonal element is the sum of the row elements of K * .

RESULTS
In this section, we discuss the performance of our method in a variety of ways. First, we introduce an evaluation criteria and a verification method that are used to evaluate the performance significance of the cancer subtype predictions. Second, we analyze the performance of SKF with different parameters α on Dataset No.1 . Third, we discuss the performance of SKF on the three datasets. Fourth, we compare SKF with two other fusion methods on the three datasets. Finally, we analyze the survival probability curves of the predicted subtypes for four cancers.

Evaluation Criteria and Verification Method
In this paper, we employ the P-value from the Cox regression model to evaluate the performance of our method, where a lower P-value indicates higher significance for performance. When the P-value is less than 0.05, it is of significance to the performance of the model. When the P-value is less than 0.01, the performance of the model is highly significant. Here, we use 0.05 as the threshold for significance. The meaning of the P-value is significance in the difference of survival profiles between cancer subtypes. Moreover, we also use survival analysis to evaluate the performance of the clustering results. The survival curve represents the change in survival rate over time, and it is a monotone decreasing curve without any fluctuation. In the survival curve, we can find that different subtypes have different survival rates. We can analyze some subtypes that have a higher risk of death.

Parameter Selection for SKF
Particularly, α is an important parameter in the process of SKF. A lower α value represents keeping more initial information in the integrated kernel. A higher α value represents keeping more information after multiple iterations. In the three datasets, we take α from 0 to 1 with a step of 0.1 to find the optimal α for the five cancers. Results are shown in Figure 2, with the X axis representing the α value and the Y axis representing the − log 10 (P value ). A lower P-value is represented by a higher value of − log 10 (P value ). In Figure 2, the P-value maintains clear fluctuation in the range between 0 and 1. It demonstrates that SKF is sensitive to changes in α. We get the optimal P-value when α is equal to 1 for the four cancers except lung cancer on Dataset No.3. From the results of Dataset No.2, we can see that keeping more initial information is necessary for many of the datasets.

Performance of SKF in Difference Datasets
In this paper, we obtain Dataset No.1 from TCGA. For a specific disease, we extract all 60483 gene expression data points on Dataset No.1 . We employ the three datasets to evaluate the performance of SKF. For each dataset, we compare the performance of SKF with single kernel by using the optimal number of clusters. In Table 2, we can see that SKF achieves outstanding performance compared with single kernel in 12 cases. We also find that the same kernels with different numbers of clusters have different P-values. Therefore, we need to adjust the number of clusters to obtain optimal clustering results.

Comparing With Other Fusion Methods
Several multiple kernel fusion strategies have been developed, including similarity network fusion (SNF) (Wang et al., 2014) and unsupervised multiple kernel learning (UMKL) (Mariette and Villavialaneix, 2018). We compared the performance of SKF with these two strategies to find better subtypes for a particular cancer. We tested the three strategies on the three datasets to compare the performance of different fusion methods. All results are found in the Supplementary Table 1. The graphical results are shown in Figure 3, with the X axis representing the number of clusters and the Y axis representing the value of − log 10 (P value ). The blue lines represent the change of SKF, the red lines represent the change of SNF, the green lines represent the change of UMKL and the black dashed lines show the Pvalue equal to 0.05. In Figure 3, we find that SKF achieved a remarkable level of performance for the clustering of breast and colon cancer subtypes in the three datasets. Additionally, SKF achieved better performance than other kernel fusion strategies for the clustering of lung cancer subtypes in Datasets No.1 and No.3. We also found that SNF performed well for the clustering of kidney cancer subtypes in the three datasets and UMKL reached the best level of performance for the clustering of lung cancer subtypes in Dataset No.2 and stomach cancer subtypes in Dataset No.1. It demonstrates that SKF obtained a significant level performance for discovering subtypes of a particular cancer, and also that the cluster results can be used for guiding clinical treatment.

Survival Analysis
In this paper, we analyzed the performance of SKF based on six cancers, including breast, lung, kidney, colon, stomach, and GBM cancers. However, since the P-values for the clustering of kidney and GBM cancer subtypes were larger than 0.05, we showed survival probability curves for the four other cancers. We analyzed these cancer subtypes by using Dataset No.3 . In Figure 4, we find that subtype 3 for stomach cancer has a higher death rate. These patients with subtype 3 need more attention to be paid to them. The average survival time of subtype 2 for colon cancer is longer than the other subtypes. Similarly, subtype 3 for other cancers tends to be more aggressive than other subtypes. We also found that the average survival time for breast cancer and lung cancer are longer than for stomach and colon cancer. It demonstrates that the cluster results of SKF can be used to guide clinical treatment.

CONCLUSIONS
In this paper, we proposed an accurate model for predicting cancer subtypes. First, we extracted a novel dataset with three expression data types (gene expression, miRNA expression, and isoform expression) and five cancers (breast, lung, kidney, colon, and stomach cancers) from the TCGA website. Second, we constructed three similarity kernels by using the three types of expression data for each cancer. Then, we proposed Similarity Kernel Fusion (SKF) to integrate the three kernels into one combined kernel. Finally, we used spectral clustering on integrated kernel to discover cancer subtypes. We used an evaluation criteria (P-value) and a verification method (survival analysis) to evaluate the performance of SKF for the discovery of cancer subtypes. We compared SKF with single kernel and two kernel fusion strategies (SNF and UMKL) in three datasets. Results showed that SKF obtains a significant level of performance on P-value, and the survival curve of the subtypes was consistent with the clinical data. It demonstrates that SKF is an accurate computational tool for guiding clinical treatment.
Our method also has some limitations that require some attention. Since spectral clustering is a widely used and accepted cluster method, we are attempting to find an improved method to discover cancer subtypes more accurately. We will consider various machine learning methods and constructing kernel methods to predict cancer subtypes (Zeng et al., 2017;Ding et al., 2018;Zhang et al., 2018a,b,c;Zou et al., 2018). We also consider the potential possibility of developing computational models for cancer subtype identification based on microRNA information (Chen and Huang, 2017;Chen et al., , 2018aHu et al., 2018).

DATA AVAILABILITY STATEMENT
The results and codes for this study can be found at the following address: https://github.com/guofei-tju/Cancer-subtypes.