ORIGINAL RESEARCH article

Front. Genet., 20 November 2020

Sec. Computational Genomics

Volume 11 - 2020 | https://doi.org/10.3389/fgene.2020.585029

A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations

  • 1. College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, China

  • 2. College of Computer and Communication Engineering, China University of Petroleum (East China), Qingdao, China

  • 3. Geneis (Beijing) Co., Ltd., Beijing, China

  • 4. School of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi’an Jiaotong-Liverpool University, Suzhou, China

Abstract

The discovery of cancer of unknown primary (CUP) is of great significance in designing more effective treatments and improving the diagnostic efficiency in cancer patients. In the study, we develop an appropriate machine learning model for tracing the tissue of origin of CUP with high accuracy after feature engineering and model evaluation. Based on a copy number variation data consisting of 4,566 training cases and 1,262 independent validation cases, an XGBoost classifier is applied to 10 types of cancer. Extremely randomized tree (Extra tree) is used for dimension reduction so that fewer variables replace the original high-dimensional variables. Features with top 300 weights are selected and principal component analysis is applied to eliminate noise. We find that XGBoost classifier achieves the highest overall accuracy of 0.8913 in the 10-fold cross-validation for training samples and 0.7421 on independent validation datasets for predicting tumor tissue of origin. Furthermore, by contrasting various performance indices, such as precision and recall rate, the experimental results show that XGBoost classifier significantly improves the classification performance of various tumors with less prediction error, as compared to other classifiers, such as K-nearest neighbors (KNN), Bayes, support vector machine (SVM), and Adaboost. Our method can infer tissue of origin for the 10 cancer types with acceptable accuracy in both cross-validation and independent validation data. It may be used as an auxiliary diagnostic method to determine the actual clinicopathological status of specific cancer.

Introduction

Recent advances in molecular biology, e.g., genomics, proteomics, and metabolics, have resulted in a more accurate and specific prediction of tumor response to treatment, as well as trends in metastasis recurrence and prognosis. However, traditional detection methods, e.g., clinical, impact, and pathological examination, can only determine 50–80% of patients of metastasis cancer, while the remaining 20–50% of patients still cannot be determined (Chen et al., 2017) as more effective methods of diagnosis are required. In the metastasis of cancer, tumor cells are carried from the primary site to lymphatics, blood vessels, or other sites to continue to grow and form the same type of tumor. Biochemical biopsy of micrometastasis may lead to partial diagnosis and chaos due to the instability of biochemical indicators. Identifying the type and origin of cancer is important to determine the most appropriate treatment for cancer patients. In practice, errors in the uncertainties will become bigger, resulting in the error of diagnosis. The molecular expression profile of tumor cells in the metastatic focus is more similar to that in the primary site but different from that in the metastatic site, suggesting that we can trace the tumor origin according to the molecular expression profile of tumor cells in the metastatic site.

Many studies have attempted to use cancer biomarkers to predict the locations of primary tumors in CUP such as gene expression, miRNA and DNA methylation (Talantov et al., 2006; Clavell et al., 2008; Staub et al., 2010; Søkilde et al., 2014; Tang et al., 2017; Grewal et al., 2019). Gene expression patterns in tumors were the most widely used biomarkers for tumor classification and have achieved higher accuracy with machine learning algorithms. For instance, the support vector machines (SVMs) have been utilized as a multiclass classifier for the expression levels of 16,063 genes, achieving 78% classification accuracy (Ramaswamy et al., 2001). Seventy-nine optimal gene markers were selected and subsequently used as features for training of SVM classifier, achieving 89% classification accuracy with 13 classes (Tothill et al., 2005). SVM-RFE approach has been applied to select 154 top genes and classified the 22 common tumor types on the pan-cancer transcriptome database, obtaining an overall accuracy of 96.5% for training set and 97.1% accuracy for independent test set consisting of 9,626 primary tumors (Xu et al., 2016). Other machine learning classifiers have been also carried out to identify tissues of origin. For example, K-nearest neighbors (KNN) classifier has been used for 39 cancers including 92 genes, achieving 84% accuracy in cross-validation and 82% in the independent set of 112 samples (Ma et al., 2006). Neural networks have been applied to complementary DNA (cDNA) and oligonucleotide data consisting of training sets and test sets independently, achieving a mean accuracy of 83% (Bloom et al., 2004). Random forest classifiers have been utilized on publicly available somatic mutation data in the COSMIC database to train using leave-one-out cross-validation and achieved over 80% accuracy (Marquard et al., 2016). Random forest has been also used to establish classifiers for 38 kinds of tumors to methylation data and obtained 87% accuracy in the test set. A least absolute shrinkage and selection operator has been proposed using cross-validated algorithm, which achieved an overall accuracy of 85% (Søkilde et al., 2014). The artificial bee colony (ABC) and the particle swarm optimization have been carried out on the brain lower grade glioma data. The highest classification accuracy was 99.1% by the ABC algorithm (Bhowmick et al., 2019). Although these methods have achieved promising results, they had the disadvantages of lower classification accuracy on independent sets with more features.

Copy number variations (CNVs) usually refer to the individual difference in continuous DNA fragments with length ranging from several thousand to several trillion base pairs in genome. It can affect the individual phenotype by changing the number of gene copies, thus affecting the expression dose, disrupting the coding region structure of gene, changing the position or length of gene regulatory sequence, exposing recessive mutation, etc. (Redon et al., 2006; Poduri et al., 2013). So far, few studies have explored the effects of copy number variations on tissue-of-origin by machine learning algorithms. Zhang et al. (2015) have executed a comprehensive genome-wide analysis of 23,082 CNVs in 3,480 cancer patients, with six cancer types executed and 19 discriminative genes for tumor classification selected by minimum redundancy and minimum relevancy (mRMR) as well as incremental feature selection (IFS) methods. The overall prediction accuracy was about 75% in 10-fold cross-validation (Zhang et al., 2015).

In this research, an efficient multiclass model of XGBoost is established to assist in the identification of tumor origin for the entire data group from 10 types of tumors based on copy number variations. Extremely randomized tree (Extra tree) is used to design an algorithm to select fewer variables replacing the original high-dimensional variables for training. Principal component analysis (PCA) is subsequently applied to feature extraction for eliminating noise. Our method is compared to algorithms, such as SVM, KNN, Bayes, and Adaboost in terms of its multiclass classification performance. Experimental results indicated that XGBoost classifier can achieve an overall higher accuracy of more than 0.8913 for all training dataset and 0.7421 for independent dataset with 300 features, significantly improving the classification performance of other classifiers. Refer to Figure 1 for the flow chart of our method.

FIGURE 1

This paper is organized as follows. A detailed description of CNV datasets are given first, then feature selection method and XGBoost classifier are introduced in section “Materials and Methods.” In section “Result and Discussion,” five classifiers including SVM, KNN, Bayes, Adaboost, and XGBoost are used to assess the prediction performance. The conclusions and discussion of the work are given in section “Conclusion.”

Materials and Methods

Data Preparation

The study focuses on the copy number value of 10 tumor datasets extracted from The Cancer Genome Atlas (TCGA). The copy number signal is produced by Affymetrix SNP 6.0 arrays for the set of samples in each TCGA study. These datasets are from primary solid tumor samples released by MSKCC in 2013, which are available from TCGA. The original data consist of 9,743 samples in total1. Each sample has 24,174 genes with discrete copy number value denoted as “−2,” “−1,” “0,” “1,” “2,” where “−2” is homozygous deletion, “−1” is heterozygous loss, “0” is diploid, “1” is one copy gain, and “2” is high-level amplification or multiple-copy gain. To evaluate the performance of CNVs_origin, we perform the experiments on another copy number datasets, which can be downloaded from http://gdac.broadinstitute.org/ released by TCGA in 2016 (Liang et al., 2017, 2020).

We carry out data preprocessing on all datasets. In the first step, genes with missing values in the samples are deleted. Then, genes contained in both MSKCC and TCGA datasets are selected. Finally, the TCGA samples that exist in MSKCC datasets are removed. After preprocessing, the MSKCC datasets have 4,566 samples and 19,895 genes, which are used as training datasets. The TCGA datasets have 1,262 samples and 19,895 genes as independent datasets. The latter includes 112 BLCA samples, 235 BRCA samples, 170 COADREAD samples, 28 GBM samples, 216 HNSC samples, 41 KIRC samples, 120 LUAD samples, 212 LUSC samples, 22 OV samples, and 106 UCEC samples. For details of all tissue samples of 10 cancers including tumor status and sample sizes after filter, refer to Table 1.

TABLE 1

Primary siteAbbreviationSample numbers of training datasetSample numbers of independent dataset
Bladder urothelial carcinomaBLCA135112
Breast invasive carcinomaBRCA847235
Colorectal adenocarcinomaCOADREAD575170
Glioblastoma multiformeGBM56328
Head and neck squamous cell carcinomaHNSC306216
Kidney renal clear cell carcinomaKIRC49041
Lung adenocarcinomaLUAD356120
Lung squamous cell carcinomaLUSC289212
Ovarian serous cystadenocarcinomaOV56222
Uterine corpus endometrial carcinomaUCEC443106

Description of the datasets.

Feature Selection

Each sample contains 19,895 variables whose large number may cause overfitting. As such, we carry out feature selection to improve the generalization ability of classifiers and reduce the time to train the classifier. In the research, Extra tree (Geurts et al., 2006; Geurts and Louppe, 2011) is used for feature selection.

A Wrapper feature selection method based on Extra tree is proposed. Similar idea has been proposed by some researchers utilizing random forest method (Genuer et al., 2010; Verikas et al., 2011). First, variable importance measure is calculated by Extra tree. The variable importance measure of a feature is defined as the average reduction in the classification accuracy after slight random disturbance and before disturbance of the feature out-of-bag (OOB) (Verikas et al., 2011). Then, the sequence backward search is utilized to rank all features according to their importance measures. The feature with the least score is removed from the feature set. The process is iterated until the end of the algorithm. The feature set with the highest classification accuracy and minimum number of variables is selected as the final feature set.

The steps in feature selection using Extra tree are given as follows. Refer to Figure 2 for the flow chart depicting exact process of feature selection.

FIGURE 2

Input: Original dataset with all features and samples.

Output: All features with their ranks and corresponding scores from the most important to the least.

Step 1. The original dataset is randomly divided into k equal parts. We suppose it contained m features. k-1 of them are used to train an Extra tree, and the remaining one is used as test dataset. Therefore, we have k extreme random trees R1,R2,⋯,Rk in total.

Step 2. For each Extra tree, a sequential backward selection method is used to sort the m features from the most important to the least. We delete the lowest ranking feature and recorded the average classification accuracy of Ri.

Step 3. m iterations are carried out in Step 2, and Ri with the highest classification accuracy is selected as the classification accuracy of the ith random tree, and its feature set is obtained.

Step 4. For the k random trees, the extreme random tree with the highest classification accuracy is used as the final feature selection.

After feature selection, we obtain and rank all features according to their scores from the most important to the least; then, we use principal component analysis (PCA) (Martinez and Kak, 2001; Yang et al., 2004) to perform feature extraction for further analysis.

XGBoost Classifier

XGBoost classifier is a gradient boosting method incorporating the regression tree (Chen and Guestrin, 2016). It uses the combination of weak learners to create a single strong learner. The idea is to continuously add tree and continuously perform feature splitting to grow a tree.

Defining the function as follows.

in which is the predicted label of ith sample, xi represents the ith sample, T represents the total number of trees, fk represents the kth tree model, and q represents the structure of each tree that maps an example to the corresponding leaf index. The objective function of XGBoost classifier is defined as , where . There are two terms in the objective function. The first term is the loss function measuring the difference between the predicted value and the real value. The second term is the regularization term. T and w refer to the number of leaf nodes and the weight of leaf nodes, respectively. γ controls the number of leaf nodes, and λ is used to prevent overfitting. Each time when a tree is added, it automatically learns a new function to fit the residuals arising from the last prediction. If we obtain k trees after training, it is necessary to sum the scores corresponding to each tree to get the predicted value of a sample. In the research, we choose XGBoost classifier, the maximum depth of which is nine, learning rate γ is 0.1, and λ is 0.3 by 10-fold cross-validation as the model for numerical experiments.

Results and Discussion

CNVs Performance on Training Datasets

For the training dataset, first, we choose the proper classifier and feature numbers using 10-fold cross-validation with respective overall prediction accuracy. Figure 3 shows the curves of five classifiers changing with dimensions of the training samples. It is found that XGBoost classifier achieves the best result compared with four other classifiers. The overall accuracy changes from 0.8499 to 0.8999 vs. the feature number from 100 to 1,000. The best prediction accuracy is 0.8999 for 700, 800 features of XGBoost classifier. The overall accuracy is 0.8913 using 300 or 400 genes as features for the training datasets. The top 300 features with relatively high between-group variance are more likely to contribute to cancer classification, so we select the top 300 features for training and testing finally.

FIGURE 3

In this research, the metrics to assess the efficacy of our model included true positives, true negatives, false positives, and false negatives. We also used recall rate, precision, F1 score, and overall prediction accuracy of each cancer for assessing the multiclass predictive performance (Zhao and Zhang, 2020). The recall rate intuitively represents the ability of the classifier to correctly identify all positive cases. Precision is defined as the proportion of the true positives out of all the positive results (both true and false positives). F1 score is the harmonic mean of precision and recall. Accuracy is defined as the ratio (true positive + true negative)/(total number of cases) and calculated for the entire cohort.

Table 2 lists the classification accuracy for each cancer with XGBoost classifier on training datasets via 10-fold cross-validation. Further based on the values of the recall and precision obtained, we plot a receiver operating characteristic curve (ROC) to describe the performance of classification models. The ROCs of these datasets are shown in Figure 4. We are interested in the area under an ROC curve, denoted by AUC, which is another commonly used evaluation criterion. In these experiments, our proposed model achieves fine results. The minimum AUC is 0.84, and microaverage AUC is 0.96.

TABLE 2

DatasetPrecision
BLCA0.9611
BRCA0.7852
COADREAD0.8716
GBM0.9101
HNSC0.9378
KIRC0.9776
LUAD0.8928
LUSC0.9453
OV0.8052
UCEC0.7988

Classification accuracy for each cancer with XGBoost classifier on training datasets via 10-fold cross-validation.

FIGURE 4

Performance Comparison on Independent Datasets

Furthermore, other classifiers, e.g., KNN, SVM, Bayes, and Adaboost, are used to compare with our model with the same benchmark datasets using the same 300 features. In our experiment, we compare the performance of the algorithm to other classical classifiers with the same benchmark datasets. We set the parameters of XGBoost classifier with the learning_rate γ = 0.1, max_depth = 9, andλ = 0.3. The number of categories in KNN is chosen to be equal to five. The parameter n_estimators for Adaboost classifier are 200. For SVM, linear kernel function is chosen with an optional constant C = 0.01.

Table 3 gives the performance for independent validation datasets. Clearly, XGBoost classifier performs much better than other classifiers in recall and F1 score on seven cancer datasets including COADREAD, GBM, HNSE, KIRC, LUAD, LUSC, and OV. For other datasets, XGBoost achieves 0.8771 prediction accuracy on BLCA cancer, far higher than that on KNN. For the BRCA dataset, XGBoost achieves an accuracy of 0.7034. For UCEC dataset, XGBoost achieves 0.4864, which is only lower than the Adaboost. For BLCA dataset, the best prediction accuracy comes from XGBoost, with a value of 0.8771.

TABLE 3

CancerPredictorPrecisionRecallF1 score
BLCAXGBoost0.87710.44640.5917
KNN0.59840.67850.6359
Bayes0.32320.28570.3033
Adaboost0.80000.17850.2919
SVM0.55550.62500.5882
BRCAXGBoost0.70340.78720.7429
KNN0.77770.41700.5429
Bayes0.31530.14890.2023
Adaboost0.41920.62970.5034
SVM0.77770.53610.6347
COADREADXGBoost0.82240.81760.8200
KNN0.52650.75880.6216
Bayes0.12500.10000.1574
Adaboost0.68910.60000.6415
SVM0.79860.48610.7324
GBMXGBoost0.66660.85710.7500
KNN0.51350.67850.5846
Bayes0.12500.03570.0555
Adaboost0.54830.60710.5762
SVM0.61760.75000.6774
HNSEXGBoost0.83690.71290.7700
KNN0.63710.66670.6515
Bayes0.32970.28700.3069
Adaboost0.53630.54620.5412
SVM0.67740.48610.5660
KIRCXGBoost0.73580.95120.8297
KNN0.41370.87800.5625
Bayes0.07230.92680.1342
Adaboost0.63410.63410.6341
SVM0.62710.90240.7400
LUADXGBoost0.55260.35000.4285
KNN0.35350.29160.3196
Bayes0.12120.03330.0522
Adaboost0.23410.30830.2661
SVM0.48690.46660.4765
LUSCXGBoost0.83100.85840.8445
KNN0.74860.67450.7096
Bayes0.50000.08490.1450
Adaboost0.72270.34430.4664
SVM0.71420.66030.6862
OVXGBoost0.36840.95450.5316
KNN0.30900.44150.7727
Bayes0.04080.36360.0733
Adaboost0.21420.40900.2812
SVM0.35080.90900.5063
UCECXGBoost0.48640.66980.5669
KNN0.49270.32070.3885
Bayes0.05000.00940.0158
Adaboost0.49650.76410.5702
SVM0.37850.67920.5062

Comparison with other algorithms on independent datasets.

The bold values in table are the scores of the predictors with the best score in a specific metric for a given type of cancer.

Furthermore, we compare the overall accuracy with the classifiers on all 10 cancer datasets. It is obvious that XGBoost achieves the best results whether on the training datasets or the independent validation datasets. The overall accuracy is 0.7421 and 0.8913, respectively, which is 17 and 22% higher than that of SVM, higher than KNN by 22%, and higher than Adaboost by 37 and 33%. Refer to Figure 5 for the comparison of the overall accuracy between different classifiers based on the training datasets and independent validation datasets.

FIGURE 5

Conclusion

In this paper, we propose a cancer type classifier that exploited the copy number variations data of the tumor samples. The copy number variations data contain much noise and is of high dimensionality, so we utilize Extra tree for dimensionality reduction and principal component analysis to reduce noise. Subsequently, XGBoost classifier is applied to 10 types of cancer. Our method achieves the best accuracy regardless of the training datasets or the independent invalidation dataset by selecting 300 features compared with other four classifiers, such as KNN, DNN, SVM, and Adaboost. XGboost classifier adds regular terms to the cost function for controlling the complexity of the model, which contains the number of leaf nodes in the tree and the score sum of squares on each leaf node. From the perspective of bias variance, the regular term reduces the variance of the classifier and simplifies the learned classifier (Chen and Guestrin, 2016), so it can better prevent overfitting in training and its performance is the best compared to the other four classifiers. The results may provide a useful revelation to determine the machine learning model and to understand actual pathological conditions. Furthermore, our results are confirmed by the literatures, with 6 out of top 10 features associated with specific cancers including NKIRAS1 (Gerashchenko et al., 2010; Khodyrev et al., 2012), CDKN2B (Clavell et al., 2008), YTHDC2 (Wittliff et al., 2015; Yu et al., 2020), MECOM (Choi et al., 2017), RBFOX1 (Sengupta et al., 2013), and RAB6B (Kou et al., 2015). The corresponding molecular functions and related cancer of six genes are given in Table 4.

TABLE 4

IDGeneMolecular functionGO annotationRelated cancer
28512NKIRAS1Intracellular signal transductionGO:0035556LUAD KIRC
1030CDKN2BPositive regulation of transforming growth factor beta receptor signaling pathwayGO:0030511HNSC
64848YTHDC2Positive regulation by host of viral genome replicationGO:0044829COADREAD BRCA
2122MECOMNegative regulation of JNK cascadeGO:0046329COADREAD
54715RBFOX1Regulation of alternative mRNA splicing, via spliceosomeGO:0000381COADREAD
51560RAB6BIntra-Golgi vesicle-mediated transportGO:0006891COADREAD

Top six genes and corresponding molecular function.

NKIRAS1 regulates the nuclear factor (NF)-kappa B activity by encoding a Ras-like protein. In addition, it is known that the copy number of NKIRAS1 is usually lower in RCC, and this gene is downregulated in malignant renal tumors.

By forming a complex with CDK4 or CDK6, CDKN2B encodes a cyclin-dependent kinase inhibitor. It is found that transforming growth factor (TGF) beta can drastically induce the expression of this gene, suggesting that CDKN2B might play roles in TGF-beta-related functions. In addition, the aberrant methylation of p15 (INK4B), a protein encoded by CDKN2B, can silence TRβI. Thus, CDKN2B and other tumor suppressor genes, such as CDKN2A, might be used as biomarkers for early detection in HNSC patient.

RBFOX1 is a member of the Fox-1 family of RNA-binding proteins, which regulates tissue-specific alternative splicing in metazoa. RBFOX1 mutations are found in COADREAD cell lines, and the loss of RBFOX1 may account for the anomalous splicing activity associated with COADREAD.

Function enrichment analysis is performed on the 300 genes. Specifically, all genes have been used as the enrichment background in Figure 6 (Shannon et al., 2003; Zhou et al., 2019). Terms with a p <0.01 and an enrichment factor >1.5 are identified and grouped into clusters. Log10(p) is the p-value in log base 10. All genes mainly participated in four primary biological pathways to be specific: adhesion via plasma–membrane adhesion molecules (4.61%), cell morphogenesis involved in differentiation (7.45%), positive regulation of establishment of protein localization (4.96%), and cell growth (4.96%).

FIGURE 6

Limitations

The traditional clinical pathology experiment has its limitation. Our model recovers the original signal under the condition of oversampling, and the additional signal samples generated by the cancer sample need to be taken into account. Due to the limited data, only 10 types of tumors are selected in this manuscript for the construction of the classifier, and further studies can incorporate more types of tumors into the cancer of unknown primary model in the future work.

Statements

Data availability statement

All datasets generated for this study are included in the article/Supplementary Material.

Author contributions

YZ conceived the experiments. TF carried out the experiments. SW designed the algorithm. RD analyzed the data. BW contributed reagents, materials, and analysis tools. JY designed the experiments. JS wrote the manuscript. All authors contributed to the article and approved the submitted version.

Funding

The research was supported by Grant Nos. 61572522, 61503224, and 61773245 from the National Natural Science Foundation of China, National Statistical Science Research Project of China in 2019 (No. 2019LY49), and the Natural Science Foundation of Hunan Province, China (Grant No. 2018JJ2461).

Conflict of interest

BW, RD, and JY were employed by the company Genies (Beijing) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020.585029/full#supplementary-material

References

  • 1

    BhowmickS. S.BhattacharjeeD.RatoL. (2019). Identification of tissue-specific tumor biomarker using different optimization algorithms. Genes Genomics41, 431443. 10.1007/s13258-018-0773-2

  • 2

    BloomG.YangI. V.BoulwareD.KwongK. Y.CoppolaD.EschrichS.et al (2004). Multi-platform, multi-site, microarray-based human tumor classification.Am. J. Pathol.164116.

  • 3

    ChenF.ZhangY.BosséD.LalaniA.-K. A.HakimiA. A.HsiehJ. J.et al (2017). Pan-urologic cancer genomic subtypes that transcend tissue of origin.Nat. Commun.8:199.

  • 4

    ChenT. Q.GuestrinC. (2016). “XGBoost: a scalable tree boosting system,” in Proceedings of the Acm Sigkdd International Conference on Knowledge Discovery & Data Mining, New York, NY.

  • 5

    ChoiE. J.KimM. S.SongS. Y.YooN. J.LeeS. H. (2017). Intratumoral heterogeneity of frameshift mutations inmecomgene is frequent in colorectal cancers with high microsatellite instability.Pathol. Oncol. Res.23145149. 10.1007/s12253-016-0112-3

  • 6

    ClavellJ.BernabeR.BaezA. (2008). Epigenetic silencing of cdkn2a, p14(arf) and cdkn2b and trβi in head and neck cancers.Cancer Res.681216.

  • 7

    GenuerR.MichelV.EgerE.ThirionB. (2010). “Random forests based feature selection for decoding fmri data,” in Proceedings of the IEEE International Symposium on Biomedical Imaging, Paris.

  • 8

    GerashchenkoG. V.BogatyrovaO. O.RudenkoE. E.KondratovA. G.KashubaV.I (2010). Genetic and epigenetic changes of NKIRAS1 gene in human renal cell carcinomas.Exper. Oncol.327175.

  • 9

    GeurtsP.ErnstD.WehenkelL. (2006). Extremely randomized trees.Mach. Learn.63342. 10.1007/s10994-006-6226-1

  • 10

    GeurtsP.LouppeG. (2011). Learning to rank with extremely randomized trees.J. Mach. Learn. Res.144961.

  • 11

    GrewalJ. K.Tessier-CloutierB.JonesM.GakkharS.MaY.MooreR.et al (2019). Application of a neural network whole transcriptome-based pan-cancer method for diagnosis of primary and metastatic cancers. JAMA Netw. Open2:e192597. 10.1001/jamanetworkopen.2019.2597

  • 12

    KhodyrevD. S.ProninaI. V.RykovS. V.BeresnevaE. V.FreedmanM. V. (2012). Involvement of methylation of group of mirna genes in regulation of expression ofrar-beta2 and nkiras1 target genes in lung cancer.Mol. Biol.46693704. 10.1134/s002689331205007x

  • 13

    KouY.QiaoL.WangQ. (2015). RETRACTED ARTICLE: Identification of core miRNA based on small RNA-seq and RNA-seq for colorectal cancer by bioinformatics.Tumor Biol.3622492255. 10.1007/s13277-014-2832-x

  • 14

    LiangY.QiuK.LiaoB.ZhuW.HuangX.LiL.et al (2017). Seeksv: an accurate tool for somatic structural variation and virus integration detection.Bioinformatics33184191. 10.1093/bioinformatics/btw591

  • 15

    LiangY.WangH.YangJ.LiX.DaiC.ShaoP.et al (2020). A Deep learning framework to predict tumor tissue-of-origin based on copy number alteration.Front. Bioeng. Biotechnol.82:701. 10.3389/fbioe.2020.00701

  • 16

    MaX. J.PatelR.WangX.SalungaR.ErlanderM. (2006). Molecular classification of human cancers using a 92-gene real-time quantitative polymerase chain reaction assay. Arch. Pathol. Lab. Med.130, 465473. 10.1043/1543-2165(2006)130[465:MCOHCU]2.0.CO;2

  • 17

    MarquardA. M.BirkbakN. J.ThomasC. E.FaveroF.KrzystanekM.LefebvreC.et al (2016). Tumortracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen.BMC Med. Genom.8:58. 10.1186/s12920-015-0130-0

  • 18

    MartinezA. M.KakA. C. (2001). PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence23, 228233. 10.1109/34.908974

  • 19

    PoduriA.EvronyG. D.CaiX.WalshC. A. (2013). Somatic mutation, genomic variation, and neurological disease.Science341:1237758. 10.1126/science.1237758

  • 20

    RamaswamyS.TamayoP.RifkinR.MukherjeeS.GolubT. R. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA98, 1514915154. 10.1073/pnas.211566398

  • 21

    RedonR.IshikawaS.FitchK.FeukL.PerryG. (2006). Global variation in copy number in the human genome.Nature444444454.

  • 22

    SenguptaN.YauC.SakthianandeswarenA.MouradovD.GibbsP.SuraweeraN. (2013). Analysis of colorectal cancers in british bangladeshi identifies early onset, frequent mucinous histotype and a high prevalence of rbfox1deletion.Mol. Cancer1216. 10.1186/1476-4598-12-1

  • 23

    ShannonP.MarkielA.OzierO.BaligaN. S.WangJ. T.RamageD.et al (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks.Genome Res.1124982504. 10.1101/gr.1239303

  • 24

    SøkildeR.VincentM.MøllerA. K.HansenA.HøibyP. E.BlondalT.et al (2014). Efficient identification of miRNAs for classification of tumor origin. J. Mol. Diagn.16, 106115. 10.1016/j.jmoldx.2013.10.001

  • 25

    StaubE.BuhrH. J.GroneJ. (2010). Predicting the site of origin of tumors by a gene expression signature derived from normal tissues. Oncogene29, 44854492. 10.1038/onc.2010.196

  • 26

    TalantovD.BadenJ.JatkoeT.HahnK.YuJ.RajpurohitY.et al (2006). A quantitative reverse transcriptase-polymerase chain reaction assay to identify metastatic carcinoma tissue of origin. J. Mol. Diagnost.8, 320329. 10.2353/jmoldx.2006.050136

  • 27

    TangW.WanS.YangZ.TeschendorffA. E.ZouQ. (2017). Tumor origin detection with tissue-specific miRNA and DNA methylation markers. Bioinformatics34, 398406. 10.1093/bioinformatics/btx622

  • 28

    TothillR. W.KowalczykA.BoussioutasA.HavivI.WaringP.RischinD.et al (2005). Development of an expression-based site of origin diagnostic designed for clinical application to cancer of unknown primary.Cancer Res.65229235.

  • 29

    VerikasA.GelzinisA.BacauskieneM. (2011). Mining data with random forests: a survey and results of new tests.Pat. Recogn.44330349. 10.1016/j.patcog.2010.08.011

  • 30

    WittliffJ. L.AndresS. A.AlatoumM. A.BickettK. E.BrockG. N. (2015). Interaction between smoking history and gene expression levels impacts survival of breast carcinoma patients.Cancer Res.75311.

  • 31

    XuQ.ChenJ.NiS.TanC.XuM.DongL.et al (2016). Pan-cancer transcriptome analysis reveals a gene expression signature for the identification of tumor tissue origin.Mod. Pathol.29546556. 10.1038/modpathol.2016.60

  • 32

    YangJ.ZhangD. D.FrangiA. F.YangJ. Y. (2004). Two-dimensional pca: a new approach to appearance-based face representation and recognition.IEEE Trans. Pattern Analys. Mach. Intellig.26131137. 10.1109/tpami.2004.1261097

  • 33

    YuT.XuY. Y.SunT.HuangD.CaoF.GaoX.et al (2020). YTHDC2 promotes the apoptosis of colorectal cancer cells through the p38MAPK signaling pathway.Research Square [Preprint], 10.21203/rs.3.rs-27416/v1

  • 34

    ZhangN.WangM.ZhangP.HuangT. (2015). Classification of cancers based on copy number variation landscapes.Biochim. Biophys. Acta Gen. Subj.186027502755. 10.1016/j.bbagen.2016.06.003

  • 35

    ZhaoQ. F.ZhangY. L. (2020). Ensemble method of feature selection and reverse construction of gene logical network based on information entropy.Intern. J. Pattern Recogn. Artif. Intellig.34:2059004. 10.1142/s0218001420590041

  • 36

    ZhouY. Y.ZhouB.PacheL.ChangM.KhodabakshshiA. H.TanaseichukO.et al (2019). Metascape provides a biologist-oriented resource for the analysis of systems-level datasets.Nat. Commun.10:1523.

Summary

Keywords

tissue-of-origin, copy number variations, multiclass, XGBoost, extremely randomized tree, principal component analysis

Citation

Zhang Y, Feng T, Wang S, Dong R, Yang J, Su J and Wang B (2020) A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations. Front. Genet. 11:585029. doi: 10.3389/fgene.2020.585029

Received

19 July 2020

Accepted

05 October 2020

Published

20 November 2020

Volume

11 - 2020

Edited by

Quan Zou, University of Electronic Science and Technology of China, China

Reviewed by

Pan Zheng, University of Canterbury, New Zealand; Martín Gutiérrez, Diego Portales University, Chile

Updates

Copyright

*Correspondence: Shudong Wang, Jialiang Yang, Jionglong Su,

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics