Comparison of the classifiers based on mRNA, microRNA and lncRNA expression and DNA methylation profiles for the tumor origin detection

Background Tumor tissue origin detection is of great importance in determining the appropriate course of treatment for cancer patients. Classifiers based on gene expression and DNA methylation profiles have been confirmed to be feasible and reliable to predict the tumor primary. However, few works have been performed to compare the performance of these classifiers based on different profiles. Methods Using gene expression and DNA methylation profiles from The Cancer Genome Atlas (TCGA) project, eight machine learning methods were employed for the tumor tissue origin detection. We then evaluated the predictive performance using DNA methylation, mRNA, microRNA (miRNA) and long non-coding RNA (lncRNA) expression profiles in a comparative manner. A statistical method was introduced to select the most informative CpG sites. Results We found that LASSO is the most predictive models based on various profiles. Further analyses indicated that the results derived from DNA methylation (overall accuracy: 97.77%) are better than those derived from mRNA expression (overall accuracy: 88.01%), microRNA expression (overall accuracy: 91.03%) and lncRNA expression (overall accuracy: 95.7%). It has been suggested that we can achieve an overall accuracy >90% using only 1,000 methylated CpG sites for prediction. Conclusion In this work, we comprehensively evaluated the performance of classifiers based on different profiles for the tumor origin detection. Our findings demonstrated the effectiveness of DNA methylation as biomarker for tracing tumor tissue origin using LASSO and neural network.


Introduction
Metastatic cancer of unknown primary (CUP) origin accounts for about 3%-5% of all cancer diagnoses (Pimiento et al., 2007).Patients with CUP origin are always associated with poor prognosis because of late diagnosis, and even worse, some patients may be misclassified for tumor tissue origin.Despite the development of diagnostic workups, they show relatively little benefit (Hainsworth and Greco, 1993;Oien, 2009).In this regard, it is necessary to find new strategies to improve diagnostic certainty, and the ability to identify tumor tissue origin holds great promise for improving prognosis and treatment selection.
Molecular characterization is increasingly used for cancer therapy and offers great potential for tumor diagnosis (Tothill et al., 2005;Wang et al., 2015).Cancer classification based on expression profiles was introduced and has been generally proposed as a clinical application for tumor tissue origin detection (Ramaswamy et al., 2001;Bloom et al., 2004;Staub et al., 2010).The rationale using '-omics' data to define the origin site of CUP is that tumors from different sites of origin have specific expression profile (Pimiento et al., 2007;Sotiriou and Piccart, 2007;Xu et al., 2016;Zheng et al., 2018).More importantly, gene expression profiling enables the measurement of expression levels of thousands of genes in a single experiment.For example, mRNA-based classifier was used to determine the CUP origin and the classifier achieved an accuracy of 89% (Tothill et al., 2005).A 92gene qRT-PCR assay has been developed to detect the site of origin of metastatic tumors (Ma et al., 2006).MicroRNA can regulate gene expression and showed marked tissue specificity (Lagos-Quintana et al., 2002;Babak et al., 2004;Lu et al., 2005;Yang et al., 2017).The expression profiles of microRNAs have been determined in paraffinembedded samples, and machine learning based classifiers achieved competitive performance (Rosenfeld et al., 2008;Varadhachary et al., 2011).DNA methylation is an epigenetic mechanism used by cells to control gene expression, which can fix genes in the "off" position (Ehrlich, 2002;Paz et al., 2003;Schubeler, 2015).Extensive DNA methylation perturbation have been widely explored in human cancer researches (Moran et al., 2016;Hao et al., 2017;Kang et al., 2017;Shen et al., 2017;Stieglitz et al., 2017).These works suggested that DNA methylation might be an additional way to help tumor tissue origin detection.
To comprehensively evaluate the potential and limitation of utilizing different profiles, we performed tumor tissue origin detection using eight different classification machine learning models (random forest, support vector machine, K-nearest neighbor, decision tree, linear discriminant analysis, LASSO, artificial neural network, naïve Bayesian classifier) and evaluated the predictive performance of these models in a comparative manner.These works reinforced the potential of DNA methylation as biomarkers for tumor tissue origin detection.

Data collection
Cancer gene expression (mRNA, miRNA and lncRNA) and DNA methylation profiles generated by the Cancer Genome Atlas (TCGA) project were downloaded via the cBioPortal for Cancer Genomics (Cerami et al., 2012).For TCGA gene expression and DNA methylation data, only level one data was employed in our analysis.RNA-SeqV2 was used, which takes transcript length into account and is suggested to provide more accurate results.This work only contains the data of solid tumors, and a data quality control were conducted.For each cancer type, the dataset should have a sufficient number of samples (>100).The DNA methylation profiles were measured by the Infinium HumanMethylation450 platform, and we removed those CpG sites with more than 30% missing sample values.The remaining missing values were calculated using the K-nearest neighbor method.In this work, we adopted the same cohort for each dataset, and a total of 6,738 tumor samples for mRNA, miRNA, lncRNA and DNA methylation-based profile were collected spanning 20 cancer types.

Classifiers construction
In this work, we employed eight machine learning classifiers for tumor tissue origin detection (Rosenfeld et al., 2008;Moran et al., 2016;Hao et al., 2017;Soh et al., 2017;Tang et al., 2017).These methods differ in their underlying methodology, and detailed descriptions of these models appear below.All these models were implemented in Python packages (v3.9).
Random forest (RF) is an ensemble learning algorithm for classification that works based on a multitude of decision trees.Each tree in the forest is built from a sample set drawn from the training set with replacement.Each feature used to split an internal node in the decision trees are picked from a random subset of the entire feature set.We used the soft voting strategy, i.e., the probabilities assigned to each class are calculated by averaging the output of each decision tree.The number of trees are set to 200.
Support vector machine (SVM) classifier is used to find a hyperplane to separate two classes through maximizing the distance between the hyper-plane and the support vectors, which are defined as the samples closest to the hyper-plane.For those which are not separable, SVM is able to classify them through mapping the points into a higher dimensional space.In this work, the linear kernel was used.For multi-class classification, we implemented the 'One-vs-Rest' (OvR) approach.
K-nearest neighbor (KNN) is a non-metric method for classification.KNN simply saves all the samples in training set.Each time a test sample is given, KNN calculates the distances between the sample and all the data points in the training set.The test sample is classified as the class most common among its k nearest neighbors.Here, we used Euclidean distance, and set K to 5.
Decision trees (DT) are tree-like models used for classification.Each internal node within a decision tree represents a classification rule and each leaf node represents a class label.We built Classification and Regression Trees (CART), which choose features through minimizing the Gini index at each node.
Least Absolute Shrinkage and Selector Operator (LASSO) is a linear classification model that uses L1-regularization strategy in parameter estimation.The probabilities of each class are calculated via logistic function.To avoid over fitting, the one norm of the coefficient was added in the loss function, and coefficients were calculated through minimizing the loss function.
Neural network (NN) is based on a collection of connected nodes called neurons.Each neuron receives the input signals of other neurons through weighted connection, and produces output through activation function.If the weighted sum of input signals exceeds a cutoff, the neuron will be activated and outputs a non-zero value.Here we use the rectified linear unit function (ReLU) as the activation function.We used multi-layer feed-forward neural network, where all the neurons are connected with the next layer, and neurons within a layer are not connected with each other.The network was trained using error BackPropagation (BP) algorithm.
Naïve Bayesian classifier (NBC) is a classifier based on Bayes' theorem and the assumption of independence among all features.Assuming that all feature is independent, the joint distribution equals the multiplication of marginal distribution.The probabilities assigned to each class is calculated through Bayes' theorem, and the predicted class is the class with the largest probability value.
Linear discriminant analysis (LDA) is a linear model used in classification.Given a dataset with two classes, LDA projects all the sample points to a line, trying to maximize the distance of the centers of two classes and minimize the dispersion of points within the same class.The covariance matrix is used to measure the dispersion within a class.Here, we took 'One-vs-Rest' (OvR) strategy to construct multi-classifier.

Performance evaluation
We compared the predictive performance of these models by tracing their overall accuracy.Overall accuracy measures how often a machine learning model correctly predicts the outcome.We calculated the overall accuracy by dividing the number of correct predictions by the total number of predictions.To further evaluate our models, 5-fold cross-validation was performed.Briefly, we randomly divided the data into five sets with approximately equal size, and used four of the five sets as the training set and the remaining set as the testing set to identify the positives and negatives.We considered precision and recall for specific cancer type i:

Dimensionality reduction
Due to the high dimensionality of DNA methylation profiles, the dimensionality reduction step is necessary before the classifier construction.Principle component analysis (PCA) is a statistical procedure to reduce the dimensionality of a dataset with a large number of interrelated variables by creating a new set of variables called principal components.The greatest variance by some projection of the data comes to lie on the first coordinate (w1), the second greatest variance on the second coordinate (w2), and so on.The principal components were selected based on cumulative percentage of total variations.We selected the number of principle of components taken together explaining more than 95% of the variance.

Feature selection using DNA methylation
At first, we identified tissue-specific DNA methylated sites to reduce the considerable redundancy of the original data.We calculated differential methylation values (β value) of CpGs for the corresponding cancer type compared with other cancers using Student's t-test with a threshold of F.D.R. < 0.01.Next, a recent proposed feature selection method was employed, named Maximum-F-statistic-Maximum-Distance (MFMD), to further detect the tissue-specific CpG sites.Briefly, we calculated the analysis of variance (ANOVA) to compare the DNA methylation levels among cancer types.In ANOVA, F-statistic is the ratio of the variance among the means to the variance within the samples.F-statistic is used to measure the difference among cancers.Euclidean distance (ED) was used to measure the data redundancy.The criterion of MFMD is redefined as follow: the variable w s (0 < w s ≤1) and w d (0 < w d ≤1) are the weights of F-statistic and distance, respectively.We ranked the CpG sites according to the MFMD values.The final feature set will have lowest ED values and highest F-statistic values.Then, topranked CpG sites were selected as features to construct classifiers and evaluate the classification accuracy.The topranked CpG sites with highest accuracy were selected as the final features.

Results
Comparison of the classifiers using mRNA, miRNA, lncRNA and DNA methylation profiles for the tumor tissue origin prediction.
Gene expression profiles (mRNA, miRNA and lncRNA) and DNA methylation profiles were obtained from TCGA cohort.After a strict review of these four different types of datasets, tumor samples spanning 20 cancer types were collected (Table 1).It was randomly divided into two equal parts (a training cohort and a testing cohort).For the gene expression profiles, the expression values (FPKM value) of all genes were used, and a total of 12,692 mRNA, 1,240 miRNA and 5,642 lncRNAs were enrolled.For the DNA methylation profiles, we adopted a feature selection step to select tissuespecific CpG methylation because of the high dimensionality.A total of 120,106 differentially methylated CpG sites were detected, which were distributed across the entire human genome.Then, the optimal number of principle components were determined using PCA (cumulative percentage of total variation >95%).As an outcome of dimensionality reduction process, machine learning models have been developed using 2,974 components.
We used eight machine learning algorithms to train classifiers (see Materials and Methods).Figure 1 summarized the overall accuracy of each classifier.A comparison of the results clearly showed that most of the classifiers achieved good performance (>80%), among which LASSO is the most predictive model with the highest overall accuracy (Table 2).Consistent with previous works (Lu et al., 2005;Ma et al., 2006;Li et al., 2007;Elias et al., 2017), it has been indicated that the expression-based classifiers achieved competitive performance with the overall accuracy of 88.01%(mRNA-based), 91.03% (miRNA-based), respectively.We demonstrated that lncRNA-based profiles also achieved competitive performance for the first time (overall accuracy 95.7%).Our work indicated that DNA methylation-based classifiers (overall accuracy 97.77%) performs better than other gene expression-based classifiers.
Since the overall accuracy cannot tell us how well each cancer type is classified, a 5-fold cross-validation was performed in the testing dataset.The results indicated that the classifiers do not classify all cancer types equally well (Table 2).The precision and recall values are generally high for all the cancer types.With the exception of esophageal carcinoma (precision: 85.19%, recall: 77.06%) and Stomach adenocarcinoma (precision: 86.27%, recall: 83.54%), all other cancer types have precision and recall values larger than 90% in the testing set.Notably, the precision and recall values reach to 100% for pancreatic adenocarcinoma, Pheochromocytoma and paraganglioma, thyroid carcinoma and prostate adenocarcinoma.

Performance of classifiers using small number of CpG markers
We next attempted to determine whether tumor tissue origin can be predictive using small number CpG markers.Selecting true tumor tissue specific features is important to construct classifiers that performs well at predicting tumor origin sites.To this end, we proposed a method called Maximum-F-statistic-Maximum-Distance (MFMD) to measure the tumor tissue specificity and redundancy of CpG sites.The feature candidates were ranked based on MFMD score, and the top-ranked features were used to construct classifiers to evaluate the classification accuracy.Figure 2 summarized the performance of the classifiers as a function of the number of CpG markers.The result showed a sharp increase in the overall accuracy of the classifiers at the initial stage when the number of CpG sites is small.There are diminishing increases of overall accuracy with the involvement of additional CpG sites.When the number of CpG sites used reaches 1,000 (Supplementary Table S1), it is enough to achieve an overall accuracy >90% and the overall accuracy of the classifier starts to level off.This result indicated that a competitive performance of the classifiers can be achieved using a small number of CpG markers.We further measured the locations across the chromosome of these CpG sites, and found that most of the CpG sites are located at introns (45.7%) and promoters (21.4%).

Discussion
Gene expression and DNA methylation profiles have become the basis for diagnosis and prognosis prediction, and are important for the detection of tumor tissue origin (Moran et al., 2016;Hao et al., 2017;Rani et al., 2017).With the emerging of highthroughput technologies, large amount of data has been generated, which provided us great importance to improve the prediction of tumor tissue origin.The goal of this work is to explore the potential and limitation of utilizing different profiles as a cancer diagnostic way.
Gene expression signature of mRNA and microRNA expression levels have been used for tumor tissue origin detection (Tothill et al., 2005;Sotiriou and Piccart, 2007;Rosenfeld et al., 2008;Elias et al., 2017).DNA methylation, microRNA and lncRNA are important class of regulatory mechanism, and are central to numerous biological processes.The comparison of eight benchmark machine learning based classifiers demonstrated that LASSO model is the best choice, and reached overall accuracy of >90% in 20 cancer types.Using large number of features may bring dome degree of over-fitting of the classifier, we divided the TCGA data into a training set and testing set.The 5-fold cross-validation further indicated that our prediction has high precision and recall values in each cancer type prediction.Comparison of the performance of classifiers based on different profiles is necessary.The classifiers for DNA methylation-, mRNA-,  Here, we demonstrated that the expression profiles of lncRNAs can accurately identify tumor tissue origin for the first time.Because archival formalin-fixed paraffin-embedded (FFPE) samples are important source for tumor material, the application is limited by its instability of lncRNA in FFPE samples.DNA methylations have several features that make them attractive diagnostic biomarkers.First, DNA methylation shows better stability and can largely maintain its methylated status in archival FFPE samples.Second, DNA methylation shows marked tissue specificity, and plays a key role in embryonic development.In this regards, differentially methylated CpG sites would be enriched for tissue-specific markers, and would provide a starting point for the development of tumor tissue origin classifier.

Conclusion
Taken together, we have showed that LASSO classifier can efficiently predict tumor tissue origin based on DNA methylation profiles.Moreover, the performance of DNA methylation-based classifiers is better than that of gene expression-based classifiers.Our results demonstrated the effectiveness of DNA methylation profiles as biomarkers for the prediction of tumor tissue origin.

FIGURE 1
FIGURE 1Comparison of the performance of different classifiers.Performance is estimated based on overall accuracy derived from multi-class classification tasks.
lncRNA-based are all demonstrated promising results on predicting tumor tissue origin.Here, we tried to quantify the performance using mRNA-based, miRNA-based, lncRNA-based and DNA methylation-based profiles to identify cancer type.A more competitive performance of DNA methylation-based classifiers was obtained than mRNA-, microRNA-and lncRNA-based classifiers.

TABLE 1
Cancer types and their respective sample size.

TABLE 2
Precision and recall of each of the 20 cancer types using the LASSO model.