ORIGINAL RESEARCH article
Finding Colon Cancer- and Colorectal Cancer-Related Microbes Based on Microbe–Disease Association Prediction
- 1The Cancer Hospital of Jia Mu Si, Jiamusi, China
- 2Oncological Surgery, The Central Hospital of Jia Mu Si, Jiamusi, China
- 3Department of Thoracic Surgery, The Cancer Hospital of Jia Mu Si, Jiamusi, China
- 4Medical Oncology, The Cancer Hospital of Jia Mu Si, Jiamusi, China
- 5Geneis Beijing Co., Ltd., Beijing, China
- 6Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, China
- 7Department of Laboratory Medicine, Yantaishan Hospital of Yantai City, Yantai, China
Microbes are closely associated with the formation and development of diseases. The identification of the potential associations between microbes and diseases can boost the understanding of various complex diseases. Wet experiments applied to microbe–disease association (MDA) identification are costly and time-consuming. In this manuscript, we developed a novel computational model, NLLMDA, to find unobserved MDAs, especially for colon cancer and colorectal carcinoma. NLLMDA integrated negative MDA selection, linear neighborhood similarity, label propagation, information integration, and known biological data. The Gaussian association profile (GAP) similarity of microbes and GAPs similarity and symptom similarity of diseases were firstly computed. Secondly, linear neighborhood method was then applied to the above computed similarity matrices to obtain more stable performance. Thirdly, negative MDA samples were selected, and the label propagation algorithm was used to score for microbe–disease pairs. The final association probabilities can be computed based on the information integration method. NLLMDA was compared with the other five classical MDA methods and obtained the highest area under the curve (AUC) value of 0.9031 and 0.9335 on cross-validations of diseases and microbe–disease pairs. The results suggest that NLLMDA was an effective prediction method. More importantly, we found that Acidobacteriaceae may have a close link with colon cancer and Tannerella may densely associate with colorectal carcinoma.
Microbes are the most widespread microscopic organisms and affect many key biological processes including metabolic function and immune function (Qu et al., 2019; Sachdeva et al., 2019). There are many microbes in the human tissues, for example, skin (Fredricks, 2001), gut (Grenham et al., 2011), and lung (Cole, 1989). Normal microbial flora help the host health (Peng et al., 2018; Langella and Martín, 2019). Beneficial microbes, such as probiotics, synbiotics, and biotherapeutic agents, are effective therapeutic clues when normal microflora are disrupted (McFarland, 2000; Langella and Martín, 2019). However, the body easily gets sick when a microbial community is not balanced. Therefore, there are close associations between microbes and human diseases (Consortium, 2012; Peng et al., 2018).
Microorganisms have dense linkages with various diseases including infectious diseases and non-infectious diseases (Findley et al., 2013; Chen et al., 2017; Abu-Ali et al., 2018; Liu et al., 2019; Huang et al., 2020). For example, there is a close association between colorectal cancer and gut microbes (Heavey and Rowland, 2004; Belcheva et al., 2014). There was evidence that the changes in composition of the intestinal microbiota could induce human type 2 diabetes (Larsen et al., 2010). Toxins generated by microbes, such as Streptococcus and Staphylococcus aureus, could induce or even worsen inflammatory skin diseases (Belcheva et al., 2014). Thus, identifying the associations between microbes and diseases not only helps to characterize the pathogenesis of diseases but also provides new clues for the diagnosis and treatment of diseases (Peng et al., 2018). Although several validated microbe–disease associations (MDAs) have been reported in the Human MDA Database (HMDAD) dataset, there remains far from enough. Experimental methods to uncover new associations between two biological entities (for example, MDAs) are costly and time-consuming (Peng et al., 2017a, 2020b). Therefore, it is imperative to identify the possible disease-related microbes based on the computational models.
Based on the assumption that similar microbes tend to associate with similar diseases, computational methods are developed to predict MDAs. Ma et al. (2016) obtained the reported MDAs from documents and constructed the HMDAD. According to the computed microbe similarity, disease similarity, and known MDAs, various computational models are designed to find the associations between microbes and diseases. Chen et al. (2017) exploited the first MDA prediction method (KATZHMDA) based on the KATZ technique. Several MDA prediction models are then developed to discover the possible MDAs, for example, recommendation model based on neighbor information and MDA graph (NGRHMDA) (Huang et al., 2017), network consistency projection method (NCPHMDA) (Bao et al., 2017), network topological similarity method (NTSHMDA) (Luo and Long, 2018), adaptive boosting method (Peng et al., 2018), bi-direction similarity integration propagation method (Zhang et al., 2018), binary matrix completion method (BMCMDA), matrix decomposition method (Qu et al., 2019), and matrix factorization method combing credible negative MDA selection (Peng et al., 2020a). The above models obtained better performance for MDA prediction. Especially, the RNMFMDA method provided by Peng et al. (2020b) significantly improved MDA prediction through credible negative MDA selection based on positive-unlabeled learning (Peng et al., 2017b) and the matrix factorization with neighborhood regularization method. As such, RNMFMDA is one of the state-of-the-art MDA identification methods.
According to the recent report by EUROCARE, colon cancer and colorectal cancer demonstrated a minimal but significant increasing trend in the 5-year survival rate across the years by approximately 4–6%. More importantly, colon cancer (Terziæ et al., 2010; Ahmed, 2020) is the third most frequently diagnosed cancer in the United States. The disease is increasingly being certified now-a-days, even at an early or advanced stage. Colorectal cancer is now the fourth most widespread diagnosed cancer and the second most common cause of cancer death in the United States. Siegel et al. (2020) predicted that about 147,950 cases will be diagnosed with colorectal cancer and 53,200 will die from the cancer, including 17,930 individuals and 3,640 deaths in persons with age less than 50 years in 2020. Research studies suggest that colon cancer and colorectal cancer evolve in close associations with microbes (Garrett, 2019).
Therefore, in this manuscript, inspired by the neighborhood information method provided by Liu et al. (2020) and Peng et al. (2020a) and the neighbor propagation algorithm provided by Zhang et al. (2018), we developed an MDA prediction framework by integrating negative MDA selection, linear neighborhood similarity, label propagation, and information integration to find microbes associated with colon cancer and colorectal cancer. Firstly, microbe similarity matrix and disease similarity matrix were computed based on their Gaussian association profile (GAP) and symptom features. Secondly, the linear neighborhood similarity of microbes and diseases was calculated based on their neighborhood information, respectively. Thirdly, negative MDAs were selected according to the positive-unlabeled learning algorithm provided by Peng et al. (2020a). Fourthly, a label propagation method was designed to score all unknown microbe–disease pairs, and the scores were integrated based on the information integration method. Finally, NLLMDA was used to find the possible microbes related to colon cancer and colorectal cancer.
Materials and Equipment
We downloaded MDAs from the HMDAD (Ma et al., 2016). The HMDAD contains 483 MDAs from 292 microbes and 39 diseases, and finally, 450 MDAs remain after preprocessing. Assume that the ith microbe and the jth disease are denoted as mi and di, respectively. The associations between n microbes and m diseases are represented as a binary matrix Y(n = m) where
The elements with the values of 1 in Y are MDA data and taken as positive samples. The zero entities in Y are unknown microbe–disease pairs and taken as unlabeled samples. The microbe and disease similarity matrices are represented as SM ∈ ℝn=n and SD ∈ ℝm=m, respectively.
Microbe GAP Similarity
Assume that the GAP A(m(i)) of a microbe mi can be denoted as the ith row of the MDA matrix Y. For two microbes mi and mj, their GAP similarity can be defined as:
where denotes the normalized kernel bandwidth with parameter . The microbe similarity SM(n = n) can be computed based on Eq. (2).
Disease GAP Similarity
Assume that the GAP A(d(i)) of a disease di can be denoted as the jth column of the MDA matrix Y. For two diseases di and dj, their GAP similarity can be defined as:
where denotes the normalized kernel bandwidth with parameter .
Disease Symptom Similarity
The disease symptom similarity matrix Ss can be computed according to the method provided by Zhou et al. (2020).
The final disease similarity matrix SD(m = m) can be defined based on the above two similarity measurements:
where the parameter γ is used to measure the importance between the two similarity measurements.
Negative MDA Selection
High-quality negative MDA samples help to improve MDA prediction performance. Peng et al. (2020b) designed a reliable negative MDA selection method based on positive-unlabeled learning and random walk with restart. The method significantly outperformed other MDA prediction methods and is one of the state-of-the-art negative sample selection methods. In this manuscript, we used the negative MDA extraction method provided by Peng et al. (2020b) to select reliable negative MDA samples.
Linear Neighborhood Similarity
In association prediction area, Gaussian similarity is usually applied to evaluate similarity according to features of data points. However, the measurement is not robust to data points connecting different classes. Therefore, we assumed that each point can be reconstructed based on the linear combination of its neighborhoods and designed a linear neighborhood similarity measurement method to obtain more powerful similarity.
Suppose that Xi represents the feature vector of the ith microbe. We minimize the following objective function:
where Xij denotes the jth neighbor of Xi, N(Xi) represents the set of K nearest neighbors of Xi, and wiij evaluates the reconstructive contribution of Xi to Xij. Let Gijik = (Xi−Xij)T(Xi−Xik) and θi be rewritten as:
We then introduced L2 norm of the weight wi to avoid over-fitting based on Tikhonov regularization. The final linear neighborhood similarity can be described as:
where α is a weight used to balance the importance of the weight and the regularization terms.
We can solve Eqs. (5) and (7) to compute linear neighborhood weights and regularization linear neighborhood weights of s neighbors based on standard quadratic programming. When Xj∉N(Xi), wii = 0. For each microbe or disease, the weights of its neighbors can be applied to represent their similarities. Thus, microbe (or disease) similarity can be computed by their linear neighborhood similarity and regularized by their linear neighborhood similarity.
In this study, we used a label propagation algorithm to find unobserved MDAs based on known MDAs, the computed microbe similarity and disease similarity. We first took microbes (or diseases) as nodes and the similarity weight wij as the edge from node i and node j and constructed a directed graph. The known MDAs were denoted as labels, which were propagated in the microbe graph. In each propagation, the labeled nodes were updated by integrating label information from their neighborhoods with the rate of β and keeping its initial label with the rate of 1−β.
Let represent the prediction association scores of ith disease at time t, where ytij denotes the propensities of disease dj associated with microbe mi. The label propagation process can be defined as:
where denotes the association profile of disease di, and will converge to:
where Yi is the final MDA score matrix based on disease di, and the predicted entire MDA matrix can be written as:
Similarly, we can conduct label propagation based on microbes.
According to different features of microbes and diseases, we can compute different microbe–microbe similarities and disease–disease similarities. Different similarities produce different models and prediction results. Ensemble learning has been validated to be a powerful tool for dealing with high-dimensional and complex data. In this study, we considered diverse features of microbes and diseases and designed a linear combination technique to integrate different results. We assigned different weights to each model and integrated the predicted association scores as follows:
where S = 3 denotes the number of different models, Ykij denotes the predicted association scores for microbe–disease pair (mi, dj) by the kth model, ωk denotes the weights of the kth model, and Zij denotes the integrated association prediction score of microbe–disease pair (mi, dj). The flowchart is shown in Figure 1, where LNS and LP denote linear neighborhood similarity and label propagation.
Experimental Settings and Evaluation Metrics
We conducted 100 trials of 5-fold cross-validation, and an average performance was calculated to decrease the prediction bias. Three different cross-validations were conducted as follows:
• 5-fold cross-validation 1 (CV1) on microbes: random rows (microbes) in MDA matrix were masked for testing.
• 5-fold cross-validation 2 (CV2) on diseases: random columns (diseases) in MDA matrix were masked for testing.
• 5-fold cross-validation 3 (CV3) on microbe–disease pairs: random entries (microbe–disease pairs) in MDA matrix were masked for testing.
Under CV1, 80% of rows in Y were used as training set in each round. Under CV2, 80% of columns of Y were used as training set. Under CV3, 80% of entries in Y were used as training set. We defined new microbes (or diseases) as the microbes (or diseases) without any associated diseases (or microbes). The three cross-validations refer to MDA identification for new microbes, diseases, and microbe–disease pairs, respectively.
We conducted the grid search to find the optimal combination of parameters and found that NLLMDA obtained the best performance when , , γ=0.7, α=0.7, and β=0.1. Sensitivity, specificity, accuracy, and area under the curve (AUC) were applied to evaluate the performance of our proposed NLLMDA method. AUC is the area under Receiver Operating Characteristic (ROC) curve, and the remaining are defined as follows.
where TP, FP, TN, and FN denote true positives, false positives, true negatives, and false negatives, respectively.
Performance Comparison of Six MDA Prediction Methods
We compared the proposed NLLMDA method with other five MDA identification models, that is, KATZHMDA (Chen et al., 2017), LRLSHMDA (Wang et al., 2017), NGRHMDA (Huang et al., 2017), NTSHMDA (Luo and Long, 2018), and MDLPHMDA (Qu et al., 2019). The five MDA prediction methods separately used the KATZ measurement, Laplacian regularized least squares, neighbor and graph-based recommendation, network topological similarity, and matrix decomposition and label propagation. Tables 1–3 list the performance of these six methods. The best values in each column were denoted in boldface in Tables 1–3. Because we took all unlabeled microbe–disease pairs as negative MDA samples when computing specificity and accuracy, the two measurements are almost the same when accurate to four decimal places on three cross-validations.
Table 1 shows the sensitivity, specificity, accuracy, and AUC values obtained from KATZHMDA, LRLSHMDA, NGRHMDA, NTSHMDA, MDLPHMDA, and NLLMDA under CV1. From Table 1, we can find that all six MDA prediction methods did not obtain better sensitivity, specificity, accuracy, and AUC under CV1. We thought that it may be resulted in by different structures of data.
Table 2 lists the performance of the six MDA prediction models under CV2. In the cross-validation experiment, NLLMDA computed the best sensitivity and AUC. Especially, NLLMDA outperformed 4.69, 20.42, 9.32, 56.45, and 16.14% compared with KATZHMDA, LRLSHMDA, NTSHMDA, NGRHMDA, and MDLPHMDA, respectively, in terms of sensitivity. NLLMDA outperformed 4.09, 10.46, 8.18, 8.94, and 9.45% compared with the above five methods in terms of AUC. AUC is a more important evaluation metric than the other three metrics. Therefore, NLLMDA obtained better performance and was more appropriate to find associated microbes for a new disease.
Table 3 shows the predictive results from the proposed NLLMDA method and other five MDA identification methods under CV3. The sensitivity and AUC values of NLLMDA significantly outperformed the other five MDA identification methods. Especially, NLLMDA outperformed 7.84, 11.09, 4.68, 53.07, and 7.77% compared with KATZHMDA, LRLSHMDA, NTSHMDA, NGRHMDA, and MDLPHMDA, respectively, in terms of sensitivity. NLLMDA outperformed 8.18, 5.80, 4.70, 3.32, and 4.25% compared with the above five methods in terms of AUC. AUC is a more important evaluation metric than the other three measurements. Therefore, NLLMDA outperformed the other five MDA prediction models and is an effective MDA prediction method. Figures 2–4 show the AUC values obtained by all six MDA prediction models under three cross-validations.
We further analyzed the performance of NLLMDA by two cases. We intend to find the possible microbes associated with colon cancer and colorectal cancer. Although a rare population of undifferentiated cells is closely associated with tumor formation and maintenance, this has not still been found for colon cancer. In addition, colorectal carcinoma has a dense association with specific eating patterns affecting the gut microbiota (Garrett, 2019). The gastrointestinal tract is closely populated with microorganisms. Therefore, we predicted the top 20 microbes associated with the two cancers. The results are shown in Tables 4, 5.
Table 4 shows the predicted top 20 microbes associated with colon cancer. The 20 associations are not included in the known MDAs in the HMDAD. There are 8 MDAs validated by recent documents among the 20 MDAs. That is, 40% MDAs have been validated by publications. More importantly, Acidobacteriaceae are able to grow on various sugars or polysaccharides, and some Acidobacteriaceae use amino acids as carbon sources. They grow with a slow speed and grow better under nutrient-limiting conditions. They have been validated to associate with irritable bowel syndrome in the HMDAD (Saulnier et al., 2011). We found that Acidobacteriaceae may associate with colon cancer with the highest linkage probability.
Similarly, Table 5 lists the predicted top 20 microbes associated with colorectal carcinoma. The 20 MDAs are not included in the HMDAD. Among the 20 MDAs, 15 MDAs were reported by related publications. That is, 75% MDAs have been confirmed by documents. In addition, Tannerella forsythia is one bacterial pathogen related to human periodontitis, which is a polymicrobial inflammatory disease in tooth-surrounding tissues. It is closely associated with periodontitis, liver cirrhosis, atherosclerosis, and esophageal adenocarcinoma (Jorth et al., 2014; Qin et al., 2014; Bale et al., 2017; Sharma, 2020). The results showed that Tannerella may densely link with colorectal carcinoma.
Microbes are commonly distributed in various species and show important role in many biological processes. Many human diseases, for example, intestinal diseases, involved microorganisms. Therefore, finding the potential associations between microbes and diseases can boost the understanding of the pathogenic mechanisms of diseases and its drug research and development.
Traditional experimental methods used for MDA identification are costly and time-consuming. Computational models were designed to uncover new MDAs. However, the prediction performance of computational methods further needs improvement. Therefore, NLLMDA was exploited to find MDA candidates based on negative MDA selection, linear neighborhood similarity, label propagation, information integration, and known biological data. Experimental results showed that NLLMDA obtained better prediction performance. After that, we further analyzed two cases about colon cancer and colorectal carcinoma. We found the top 20 microbes associated with the above two diseases and need to further experimental confirmation.
The proposed NLLMDA methods can obtain better predictive performance. It may be the following characteristics. Firstly, it selected credible negative MDA samples. Secondly, it used linear neighborhood similarity to consider neighborhood information. Thirdly, it conducted information integration based on the prediction results by the computed three similarity scores.
In the future, we will firstly integrate more biological features related to microbes and diseases to more completely reflect the biological information of the two entities. Secondly, we will design more robust algorithms to extract high-quality negative MDA samples. Finally, we will exploit more effective models, such as deep learning, to improve MDA prediction accuracy.
Data Availability Statement
All datasets generated for this study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.
JC conceived, designed, and managed the study. YC, CS, and HMS proposed the computational models. YC wrote the manuscript. XS and BJ revised the original draft. HJS and MS discussed the computational models and gave the conclusion. All authors read and approved the final manuscript.
Conflict of Interest
XS and BJ were employed by the company Genesis Beijing Co., Ltd.
The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We would like to thank all authors of the cited references.
Abu-Ali, G. S., Mehta, R. S., Lloyd-Price, J., Mallick, H., Branck, T., Ivey, K. L., et al. (2018). Metatranscriptome of human faecal microbial communities in a cohort of adult men. Nat. Microbiol. 3:356. doi: 10.1038/s41564-017-0084-4
Bale, B. F., Doneen, A. L., and Vigerust, D. J. (2017). High-risk periodontal pathogens contribute to the pathogenesis of atherosclerosis. J. Postgrad. Med. 93, 215–220. doi: 10.1136/postgradmedj-2016-134279
Belcheva, A., Irrazabal, T., Robertson, S. J., Streutker, C., Maughan, H., Rubino, S., et al. (2014). Gut microbial metabolism drives transformation of MSH2-deficient colon epithelial cells. Cell 158, 288–299.
Chen, X., Huang, Y. A., You, Z. H., Yan, G. Y., and Wang, X. S. (2017). A novel approach based on KATZ measure to predict associations of human microbiota with non-infectious diseases. Bioinformatics 33, 733–739. doi: 10.1093/bioinformatics/btw715
Huang, Y., Yuan, K., Tang, M., Yue, J. M., Bao, L. J., Wu, S., et al. (2020). Melatonin inhibiting the survival of human gastric cancer cells under ER stress involving autophagy and Ras-Raf-MAPK signalling. J. Cell. Mol. Med. 25, 1480–1492. doi: 10.1111/jcmm.16237
Huang, Y.-A., You, Z.-H., Chen, X., Huang, Z.-A., Zhang, S., and Yan, G.-Y. (2017). Prediction of microbe-disease association from the integration of neighbor and graph with collaborative recommendation model. J. Transl. Med. 15:209. doi: 10.1186/s12967-017-1304-7
Jorth, P., Turner, K. H., Gumus, P., Nizam, N., Buduneli, N., and Whiteley, M. (2014). Metatranscriptomics of the human oral microbiome during health and disease. mBio 5:e01012–14. doi: 10.1128/mBio.01012-14
Larsen, N., Vogensen, F. K., Van Den Berg, F. W., Nielsen, D. S., Andreasen, A. S., Pedersen, B. K., et al. (2010). Gut microbiota in human adults with type 2 diabetes differs from non-diabetic adults. PLoS One 5:e9085. doi: 10.1371/journal.pone.0009085
Liu, H., Han, M., Li, S. C., Tan, G., Sun, S., Hu, Z., et al. (2019). Resilience of human gut microbial communities for the long stay with multiple dietary shifts. Gut 68, 2254–2255. doi: 10.1136/gutjnl-2018-317298
Liu, H., Ren, G.-F., Chen, H.-Y., Liu, Q., Yang, Y.-J., and Zhao, Q. (2020). Predicting lncRNA-miRNA interactions based on logistic matrix factorization with neighborhood regularized. Knowl. Based Syst. 191:105261. doi: 10.1016/j.knosys.2019.105261
Luo, J., and Long, Y. (2018). NTSHMDA: prediction of human microbe-disease association based on random walk by integrating network topological similarity. IEEE ACM Trans. Comput. Biol. Bioinform. 17, 1341–1351. doi: 10.1109/TCBB.2018.2883
Peng, L., Shen, L., Liao, L., Liu, G., and Zhou, L. (2020a). RNMFMDA: A microbe-disease association identification method based on reliable negative sample selection and logistic matrix factorization with neighborhood regularization[J]. Front. Microbiol. 11:592430. doi: 10.3389/fmicb.2020.592430
Peng, L., Zhou, L., Chen, X., and Piao, X. (2020b). A computational study of potential miRNA-disease association inference based on ensemble learning and kernel ridge regression. Front. Bioeng. Biotech. 8:40. doi: 10.3389/fbioe.2020.00040
Qu, J., Zhao, Y., and Yin, J. (2019). Identification and analysis of human microbe- disease associations by matrix decomposition and label propagation. Front. Microbiol. 10:291. doi: 10.3389/fmicb.2019.00291
Saulnier, D. M., Riehle, K., Mistretta, T. A., Diaz, M. A., Mandal, D., Raza, S., et al. (2011). Gastrointestinal microbiome signatures of pediatric patients with irritable bowel syndrome. Gastroenterology 141, 1782–1791. doi: 10.1053/j.gastro.2011.06.072
Zhang, W., Yang, W., Lu, X., Huang, F., and Luo, F. (2018). The bi-direction similarity integration method for predicting microbe-disease associations. IEEE Access 6, 38052–38061. doi: 10.1109/ACCESS.2018.2851751
Keywords: microbe–disease association, negative sample selection, linear neighborhood similarity, label propagation, information integration, colon cancer, colorectal carcinoma
Citation: Chen Y, Sun H, Sun M, Shi C, Sun H, Shi X, Ji B and Cui J (2021) Finding Colon Cancer- and Colorectal Cancer-Related Microbes Based on Microbe–Disease Association Prediction. Front. Microbiol. 12:650056. doi: 10.3389/fmicb.2021.650056
Received: 06 January 2021; Accepted: 09 February 2021;
Published: 16 March 2021.
Edited by:Qi Zhao, University of Science and Technology Liaoning, China
Reviewed by:Jing Sun, George Washington University, United States
Lihong Peng, Hunan University of Technology, China
Copyright © 2021 Chen, Sun, Sun, Shi, Sun, Shi, Ji and Cui. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Jinpeng Cui, email@example.com