AUTHOR=Liu Haiyan , Qiu Chun , Wang Bo , Bing Pingping , Tian Geng , Zhang Xueliang , Ma Jun , He Bingsheng , Yang Jialiang TITLE=Evaluating DNA Methylation, Gene Expression, Somatic Mutation, and Their Combinations in Inferring Tumor Tissue-of-Origin JOURNAL=Frontiers in Cell and Developmental Biology VOLUME=Volume 9 - 2021 YEAR=2021 URL=https://www.frontiersin.org/journals/cell-and-developmental-biology/articles/10.3389/fcell.2021.619330 DOI=10.3389/fcell.2021.619330 ISSN=2296-634X ABSTRACT=Carcinoma of unknown primary (CUP) is a type of metastatic cancers, the primary tumor site of which cannot be identified. CUP occupies approximately 5% of cancer incidences in the U.S. with usually unfavorable prognosis, making it a big threat to public health. Traditional methods to identify the tissue-of-origin (TOO) of CUP like immunohistochemistry can only deal with around 20% CUP patients. In recent years, more and more studies suggest that it is promising to solve the problem by integrating machine learning techniques with big biomedical data involving multiple types of biomarkers including epigenetic, genetic and gene expression profiles, such as DNA methylation. However, there is no systematic comparison on which biomarker is better. In addition, it might also be possible to further improve the inference accuracy by integrating multiple types of biomarkers. In this study, we systematically compared the performances of 3 types of biomarkers including DNA methylation, gene expression profile, and somatic mutation as well as their combinations in inferring the TOO of CUP patients. First, we downloaded the gene expression profile, somatic mutation and DNA methylation data of 7,224 tumor samples across 21 common cancer types from the ICGC Data Portal and generated seven different feature matrices through various combinations. Second, we performed feature selection by the Pearson correlation method. The selected features for each matrix were used to build up an XGBoost multi-label classification model to infer cancer tissue-of-origin, an algorithm proven to be effective in a few previous studies. The performance of each biomarker and combination was compared by the 10-fold cross validation process. Our results showed that the TOO tracing accuracy using gene expression profile was the highest, followed by DNA methylation, while somatic mutation performed the worst. Meanwhile, we found that simply combining multiple biomarkers does not have much effect in improving prediction accuracy.