Finding Colon Cancer- and Colorectal Cancer-Related Microbes Based on Microbe–Disease Association Prediction

Microbes are closely associated with the formation and development of diseases. The identification of the potential associations between microbes and diseases can boost the understanding of various complex diseases. Wet experiments applied to microbe–disease association (MDA) identification are costly and time-consuming. In this manuscript, we developed a novel computational model, NLLMDA, to find unobserved MDAs, especially for colon cancer and colorectal carcinoma. NLLMDA integrated negative MDA selection, linear neighborhood similarity, label propagation, information integration, and known biological data. The Gaussian association profile (GAP) similarity of microbes and GAPs similarity and symptom similarity of diseases were firstly computed. Secondly, linear neighborhood method was then applied to the above computed similarity matrices to obtain more stable performance. Thirdly, negative MDA samples were selected, and the label propagation algorithm was used to score for microbe–disease pairs. The final association probabilities can be computed based on the information integration method. NLLMDA was compared with the other five classical MDA methods and obtained the highest area under the curve (AUC) value of 0.9031 and 0.9335 on cross-validations of diseases and microbe–disease pairs. The results suggest that NLLMDA was an effective prediction method. More importantly, we found that Acidobacteriaceae may have a close link with colon cancer and Tannerella may densely associate with colorectal carcinoma.


INTRODUCTION
Microbes are the most widespread microscopic organisms and affect many key biological processes including metabolic function and immune function (Qu et al., 2019;Sachdeva et al., 2019). There are many microbes in the human tissues, for example, skin (Fredricks, 2001), gut (Grenham et al., 2011), and lung (Cole, 1989). Normal microbial flora help the host health (Peng et al., 2018;Langella and Martín, 2019). Beneficial microbes, such as probiotics, synbiotics, and biotherapeutic agents, are effective therapeutic clues when normal microflora are disrupted (McFarland, 2000;Langella and Martín, 2019). However, the body easily gets sick when a microbial community is not balanced. Therefore, there are close associations between microbes and human diseases (Consortium, 2012;Peng et al., 2018).
Microorganisms have dense linkages with various diseases including infectious diseases and non-infectious diseases (Findley et al., 2013;Chen et al., 2017;Abu-Ali et al., 2018;Liu et al., 2019;Huang et al., 2020). For example, there is a close association between colorectal cancer and gut microbes (Heavey and Rowland, 2004;Belcheva et al., 2014). There was evidence that the changes in composition of the intestinal microbiota could induce human type 2 diabetes (Larsen et al., 2010). Toxins generated by microbes, such as Streptococcus and Staphylococcus aureus, could induce or even worsen inflammatory skin diseases (Belcheva et al., 2014). Thus, identifying the associations between microbes and diseases not only helps to characterize the pathogenesis of diseases but also provides new clues for the diagnosis and treatment of diseases (Peng et al., 2018). Although several validated microbe-disease associations (MDAs) have been reported in the Human MDA Database (HMDAD) dataset, there remains far from enough. Experimental methods to uncover new associations between two biological entities (for example, MDAs) are costly and time-consuming (Peng et al., 2017a(Peng et al., , 2020b. Therefore, it is imperative to identify the possible disease-related microbes based on the computational models. Based on the assumption that similar microbes tend to associate with similar diseases, computational methods are developed to predict MDAs. Ma et al. (2016) obtained the reported MDAs from documents and constructed the HMDAD. According to the computed microbe similarity, disease similarity, and known MDAs, various computational models are designed to find the associations between microbes and diseases. Chen et al. (2017) exploited the first MDA prediction method (KATZHMDA) based on the KATZ technique. Several MDA prediction models are then developed to discover the possible MDAs, for example, recommendation model based on neighbor information and MDA graph (NGRHMDA) , network consistency projection method (NCPHMDA) (Bao et al., 2017), network topological similarity method (NTSHMDA) (Luo and Long, 2018), adaptive boosting method (Peng et al., 2018), bidirection similarity integration propagation method (Zhang et al., 2018), binary matrix completion method (BMCMDA), matrix decomposition method (Qu et al., 2019), and matrix factorization method combing credible negative MDA selection (Peng et al., 2020a). The above models obtained better performance for MDA prediction. Especially, the RNMFMDA method provided by Peng et al. (2020b) significantly improved MDA prediction through credible negative MDA selection based on positive-unlabeled learning (Peng et al., 2017b) and the matrix factorization with neighborhood regularization method. As such, RNMFMDA is one of the state-of-the-art MDA identification methods.
According to the recent report by EUROCARE, colon cancer and colorectal cancer demonstrated a minimal but significant increasing trend in the 5-year survival rate across the years by approximately 4-6%. More importantly, colon cancer (Terziae et al., 2010;Ahmed, 2020) is the third most frequently diagnosed cancer in the United States. The disease is increasingly being certified now-a-days, even at an early or advanced stage. Colorectal cancer is now the fourth most widespread diagnosed cancer and the second most common cause of cancer death in the United States. Siegel et al. (2020) predicted that about 147,950 cases will be diagnosed with colorectal cancer and 53,200 will die from the cancer, including 17,930 individuals and 3,640 deaths in persons with age less than 50 years in 2020. Research studies suggest that colon cancer and colorectal cancer evolve in close associations with microbes (Garrett, 2019).
Therefore, in this manuscript, inspired by the neighborhood information method provided by Liu et al. (2020) and Peng et al. (2020a) and the neighbor propagation algorithm provided by Zhang et al. (2018), we developed an MDA prediction framework by integrating negative MDA selection, linear neighborhood similarity, label propagation, and information integration to find microbes associated with colon cancer and colorectal cancer. Firstly, microbe similarity matrix and disease similarity matrix were computed based on their Gaussian association profile (GAP) and symptom features. Secondly, the linear neighborhood similarity of microbes and diseases was calculated based on their neighborhood information, respectively. Thirdly, negative MDAs were selected according to the positive-unlabeled learning algorithm provided by Peng et al. (2020a). Fourthly, a label propagation method was designed to score all unknown microbe-disease pairs, and the scores were integrated based on the information integration method. Finally, NLLMDA was used to find the possible microbes related to colon cancer and colorectal cancer.

MATERIALS AND EQUIPMENT
We downloaded MDAs from the HMDAD (Ma et al., 2016). The HMDAD contains 483 MDAs from 292 microbes and 39 diseases, and finally, 450 MDAs remain after preprocessing. Assume that the i th microbe and the j th disease are denoted as m i and d i , respectively. The associations between n microbes and m diseases are represented as a binary matrix Y (n = m) where The elements with the values of 1 in Y are MDA data and taken as positive samples. The zero entities in Y are unknown microbedisease pairs and taken as unlabeled samples. The microbe and disease similarity matrices are represented as S M ∈ R n = n and S D ∈ R m = m , respectively.

Microbe GAP Similarity
Assume that the GAP A(m (i)) of a microbe m i can be denoted as the i th row of the MDA matrix Y. For two microbes m i and m j , their GAP similarity can be defined as: where γ m = γ m /( 1 n n k = 1 ||A(m(k))|| 2 ) denotes the normalized kernel bandwidth with parameter γ m . The microbe similarity S M(n = n) can be computed based on Eq. (2).

Disease GAP Similarity
Assume that the GAP A(d (i)) of a disease d i can be denoted as the j th column of the MDA matrix Y. For two diseases d i and d j , their GAP similarity can be defined as: where

Disease Symptom Similarity
The disease symptom similarity matrix S s can be computed according to the method provided by Zhou et al. (2020).
The final disease similarity matrix S D(m = m) can be defined based on the above two similarity measurements: where the parameter γ is used to measure the importance between the two similarity measurements.

Negative MDA Selection
High-quality negative MDA samples help to improve MDA prediction performance. Peng et al. (2020b) designed a reliable negative MDA selection method based on positive-unlabeled learning and random walk with restart. The method significantly outperformed other MDA prediction methods and is one of the state-of-the-art negative sample selection methods.
In this manuscript, we used the negative MDA extraction method provided by Peng et al. (2020b) to select reliable negative MDA samples.

Linear Neighborhood Similarity
In association prediction area, Gaussian similarity is usually applied to evaluate similarity according to features of data points. However, the measurement is not robust to data points connecting different classes. Therefore, we assumed that each point can be reconstructed based on the linear combination of its neighborhoods and designed a linear neighborhood similarity measurement method to obtain more powerful similarity.
Suppose that X i represents the feature vector of the i th microbe. We minimize the following objective function: where X i j denotes the j th neighbor of X i , N(X i ) represents the set of K nearest neighbors of X i , and w ii j evaluates the reconstructive and θ i be rewritten as: We then introduced L 2 norm of the weight w i to avoid over-fitting based on Tikhonov regularization. The final linear neighborhood similarity can be described as: where α is a weight used to balance the importance of the weight and the regularization terms. We can solve Eqs. (5) and (7) to compute linear neighborhood weights and regularization linear neighborhood weights of X i s neighbors based on standard quadratic programming. When X j / ∈ N(X i ), w ii = 0. For each microbe or disease, the weights of its neighbors can be applied to represent their similarities. Thus, microbe (or disease) similarity can be computed by their linear neighborhood similarity and regularized by their linear neighborhood similarity.

Label Propagation
In this study, we used a label propagation algorithm to find unobserved MDAs based on known MDAs, the computed microbe similarity and disease similarity. We first took microbes (or diseases) as nodes and the similarity weight w ij as the edge from node i and node j and constructed a directed graph. The known MDAs were denoted as labels, which were propagated in the microbe graph. In each propagation, the labeled nodes were updated by integrating label information from their neighborhoods with the rate of β and keeping its initial label with the rate of 1 − β .
Let Y t i = {y t 1i , y t 2i , y t ni } represent the prediction association scores of i th disease at time t, where y t ij denotes the propensities of disease d j associated with microbe m i . The label propagation process can be defined as: where Y 0 i denotes the association profile of disease d i , and Y t i will converge to: where Y i is the final MDA score matrix based on disease d i , and the predicted entire MDA matrix can be written as: Similarly, we can conduct label propagation based on microbes.

Information Integration
According to different features of microbes and diseases, we can compute different microbe-microbe similarities and diseasedisease similarities. Different similarities produce different models and prediction results. Ensemble learning has been validated to be a powerful tool for dealing with high-dimensional and complex data. In this study, we considered diverse features of microbes and diseases and designed a linear combination technique to integrate different results. We assigned different weights to each model and integrated the predicted association scores as follows: where S = 3 denotes the number of different models, Y k ij denotes the predicted association scores for microbe-disease pair (m i , d j ) by the k th model, ω k denotes the weights of the k th model, and Z ij denotes the integrated association prediction score of microbe-disease pair (m i , d j ). The flowchart is shown in Figure 1, where LNS and LP denote linear neighborhood similarity and label propagation.

Experimental Settings and Evaluation Metrics
We conducted 100 trials of 5-fold cross-validation, and an average performance was calculated to decrease the prediction bias. Three different cross-validations were conducted as follows: • 5-fold cross-validation 1 (CV1) on microbes: random rows (microbes) in MDA matrix were masked for testing. • 5-fold cross-validation 2 (CV2) on diseases: random columns (diseases) in MDA matrix were masked for testing. The bold values denote the best performance in each column. The bold values denote the best performance in each column.
Under CV1, 80% of rows in Y were used as training set in each round. Under CV2, 80% of columns of Y were used as training set. Under CV3, 80% of entries in Y were used as training set. We defined new microbes (or diseases) as the microbes (or diseases) without any associated diseases (or microbes). The three cross-validations refer to MDA identification for new microbes, diseases, and microbe-disease pairs, respectively.
We conducted the grid search to find the optimal combination of parameters and found that NLLMDA obtained the best performance when γ m = 1, γ d = 1, γ = 0.7, α = 0.7, and β = 0.1. Sensitivity, specificity, accuracy, and area under the where TP, FP, TN, and FN denote true positives, false positives, true negatives, and false negatives, respectively.

Performance Comparison of Six MDA Prediction Methods
We compared the proposed NLLMDA method with other five MDA identification models, that is, KATZHMDA , LRLSHMDA , NGRHMDA , NTSHMDA (Luo and Long, 2018), and MDLPHMDA (Qu et al., 2019). The five MDA prediction methods separately used the KATZ measurement, Laplacian regularized least squares, neighbor and graph-based recommendation, network topological similarity, and matrix decomposition and label propagation. Tables 1-3 list the performance of these six methods. The best values in each column were denoted in boldface in Tables 1-3. Because we took all unlabeled microbe-disease pairs as negative MDA samples when computing specificity and accuracy, the two measurements are almost the same when accurate to four decimal places on three cross-validations. Table 1 shows the sensitivity, specificity, accuracy, and AUC values obtained from KATZHMDA, LRLSHMDA, NGRHMDA, NTSHMDA, MDLPHMDA, and NLLMDA under CV1. From Table 1, we can find that all six MDA prediction methods did not obtain better sensitivity, specificity, accuracy, and AUC under CV1. We thought that it may be resulted in by different structures of data. Table 2 lists the performance of the six MDA prediction models under CV2. In the cross-validation experiment, NLLMDA computed the best sensitivity and AUC. Especially, NLLMDA outperformed 4.69,20.42,9.32,56.45,and 16.14% Frontiers in Microbiology | www.frontiersin.org compared with KATZHMDA, LRLSHMDA, NTSHMDA, NGRHMDA, and MDLPHMDA, respectively, in terms of sensitivity. NLLMDA outperformed 4. 09, 10.46, 8.18, 8.94, and 9.45% compared with the above five methods in terms of AUC. AUC is a more important evaluation metric than the other three metrics. Therefore, NLLMDA obtained better performance and was more appropriate to find associated microbes for a new disease.   Table 3 shows the predictive results from the proposed NLLMDA method and other five MDA identification methods under CV3. The sensitivity and AUC values of NLLMDA significantly outperformed the other five MDA identification methods. Especially, NLLMDA outperformed 7.84, 11.09, 4.68, 53.07, and 7.77% compared with KATZHMDA, LRLSHMDA, NTSHMDA, NGRHMDA, and MDLPHMDA, respectively, in terms of sensitivity. NLLMDA outperformed 8. 18, 5.80, 4.70, 3.32, and 4.25% compared with the above five methods in terms of AUC. AUC is a more important evaluation metric than the other three measurements. Therefore, NLLMDA outperformed the other five MDA prediction models and is an effective MDA prediction method. Figures 2-4 show the AUC values obtained by all six MDA prediction models under three cross-validations.

Case Study
We further analyzed the performance of NLLMDA by two cases. We intend to find the possible microbes associated with colon cancer and colorectal cancer. Although a rare population of Prevotellaceae Unconfirmed undifferentiated cells is closely associated with tumor formation and maintenance, this has not still been found for colon cancer. In addition, colorectal carcinoma has a dense association with specific eating patterns affecting the gut microbiota (Garrett, 2019). The gastrointestinal tract is closely populated with microorganisms. Therefore, we predicted the top 20 microbes associated with the two cancers. The results are shown in Tables 4, 5. Table 4 shows the predicted top 20 microbes associated with colon cancer. The 20 associations are not included in the known  (Saulnier et al., 2011). We found that Acidobacteriaceae may associate with colon cancer with the highest linkage probability. Similarly, Table 5 lists the predicted top 20 microbes associated with colorectal carcinoma. The 20 MDAs are not included in the HMDAD. Among the 20 MDAs, 15 MDAs were reported by related publications. That is, 75% MDAs have been confirmed by documents. In addition, Tannerella forsythia is one bacterial pathogen related to human periodontitis, which is a polymicrobial inflammatory disease in tooth-surrounding tissues. It is closely associated with periodontitis, liver cirrhosis, atherosclerosis, and esophageal adenocarcinoma (Jorth et al., 2014;Qin et al., 2014;Bale et al., 2017;Sharma, 2020). The results showed that Tannerella may densely link with colorectal carcinoma.

DISCUSSION
Microbes are commonly distributed in various species and show important role in many biological processes. Many human diseases, for example, intestinal diseases, involved microorganisms. Therefore, finding the potential associations between microbes and diseases can boost the understanding of the pathogenic mechanisms of diseases and its drug research and development.
Traditional experimental methods used for MDA identification are costly and time-consuming. Computational models were designed to uncover new MDAs. However, the prediction performance of computational methods further needs improvement. Therefore, NLLMDA was exploited to find MDA candidates based on negative MDA selection, linear neighborhood similarity, label propagation, information integration, and known biological data. Experimental results showed that NLLMDA obtained better prediction performance. After that, we further analyzed two cases about colon cancer and colorectal carcinoma. We found the top 20 microbes associated with the above two diseases and need to further experimental confirmation.
The proposed NLLMDA methods can obtain better predictive performance. It may be the following characteristics. Firstly, it selected credible negative MDA samples. Secondly, it used linear neighborhood similarity to consider neighborhood information. Thirdly, it conducted information integration based on the prediction results by the computed three similarity scores.
In the future, we will firstly integrate more biological features related to microbes and diseases to more completely reflect the biological information of the two entities. Secondly, we will design more robust algorithms to extract high-quality negative MDA samples. Finally, we will exploit more effective models, such as deep learning, to improve MDA prediction accuracy.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
JC conceived, designed, and managed the study. YC, CS, and HMS proposed the computational models. YC wrote the manuscript. XS and BJ revised the original draft. HJS and MS discussed the computational models and gave the conclusion. All authors read and approved the final manuscript.