Inferring Disease-Associated Microbes Based on Multi-Data Integration and Network Consistency Projection

Plenty of microbes in our human body play a vital role in the process of cell physiology. In recent years, there is accumulating evidence indicating that microbes are closely related to many complex human diseases. In-depth investigation of disease-associated microbes can contribute to understanding the pathogenesis of diseases and thus provide novel strategies for the treatment, diagnosis, and prevention of diseases. To date, many computational models have been proposed for predicting microbe–disease associations using available similarity networks. However, these similarity networks are not effectively fused. In this study, we proposed a novel computational model based on multi-data integration and network consistency projection for Human Microbe–Disease Associations Prediction (HMDA-Pred), which fuses multiple similarity networks by a linear network fusion method. HMDA-Pred yielded AUC values of 0.9589 and 0.9361 ± 0.0037 in the experiments of leave-one-out cross validation (LOOCV) and 5-fold cross validation (5-fold CV), respectively. Furthermore, in case studies, 10, 8, and 10 out of the top 10 predicted microbes of asthma, colon cancer, and inflammatory bowel disease were confirmed by the literatures, respectively.


INTRODUCTION
As far as we know, microbes are ubiquitous in our living environment, and they occupy nearly all habitats including humans and animals (Kouzuma et al., 2015). According to existing literatures, the microbes are mainly classified into fungi, archaea, bacteria, protozoa, and viruses in the human body (Methé et al., 2012;Sommer and Bäckhed, 2013). More and more studies have shown that most of these microbes are friendly to human beings and play a significant role in the physiology processes of the human body, such as regulating gastrointestinal development, providing protection for pathogens, and enhancing metabolic capability (Ventura et al., 2009). Specifically, the overwhelming majority of microbes inhabit the gastrointestinal tract in an adult gut, where they not only synthesize essential vitamins and amino acids but also promote the digestion of indigestible components in the human diet . Thus, abnormal changes in the microbe communities may affect human health and diseases. For example, low microbial diversity could result in inflammatory bowel disease and obesity (Turnbaugh et al., 2009;Qin et al., 2010). However, high microbial diversity is associated with bacterial vaginosis in the vagina (Fredricks et al., 2005). Researchers have confirmed the close relationship between microbes and diseases. Some microbes may cause various diseases, such as colon cancer (Sears and Garrett, 2014), kidney stones (Hoppe et al., 2011), asthma (Hilty et al., 2010), colorectal carcinoma (Sobhani et al., 2011;Kostic et al., 2012), and inflammatory bowel disease (Frank et al., 2007). On the one hand, uncovering the disease-associated microbes can contribute to better understanding the pathogenesis of the diseases. On the other hand, understanding the mechanism of microbes behind the diseases provides novel strategies for the prevention, diagnosis, and treatment of the diseases (Zou et al., 2017;Peng et al., 2018). Unfortunately, the traditional biological experiments to uncover the relationship between microbes and diseases are time-consuming and costly. Thus, there is an urgent need to construct computational models to predict the diseaseassociated microbes.
In recent years, researchers have developed a number of feasible and effective prediction models for microbe-disease associations, which could provide the most promising diseaseassociated microbes for experimental verification. For example, according to the hypothesis that functionally similar microbes tend to be associated with similar diseases (Chen et al., 2016), Chen et al. (2016) proposed using the KATZ measurement to predict human microbe-disease associations (KATZHMDA) on a large scale.  applied the designed depth-first search algorithm on the heterogeneous networks and proposed a path-based approach (PBHMDA) to reveal the microbes that are likely to be associated with the disease. Wang et al. (2017) developed a machine learning-based computational approach called LRLSHMDA, which calculates the association scores for microbe-disease pairs based on the known microbe-disease association network.  developed a novel computational method (NGRHMDA), which can predict microbe-disease associations by applying collaborative recommendation model on a graph. Bao et al. (2017) proposed the computational model named NCPHMDA, which combines space consistency projection scores for diseases and microbes to predict latent disease-associated microbes. Zou et al. (2017) put forward a new prediction model called BiRWHMDA, which simultaneously performs random walks on the microbe similar network and disease similar network to uncover potential microbe-disease associations. Shi et al. (2018) proposed a predictive method based on Binary Matrix Completion (BMCMDA) for inferring the associations of microbe-disease.
However, the abovementioned methods have their own various shortcomings in uncovering microbe-disease associations. Multiple available similarity networks can be used for predicting disease-microbe associations. However, most of the previous methods are performed on individual networks, ignoring the complementarity between different similarity networks. How to better fuse them is still worth investigating. In this paper, to resolve the abovementioned limitations, we presented a novel computational model of multi-data integration and network consistency projection for prediction of Human Microbe-Disease Associations (HMDA-Pred) to boost the performance of human microbe-disease association prediction, which integrates multiple similarity networks. To begin with, the Gaussian interaction profile kernel similarity network and cosine similarity network for microbes and diseases were constructed based on known microbe-disease associations. Subsequently, we integrated the Gaussian interaction profile kernel similarity network of microbes and cosine similarity network of microbes by a linear network fusion method. In the same way, we integrated the Gaussian interaction profile kernel similarity network of diseases and cosine similarity network of diseases. Finally, we applied the network consistency projection algorithm to uncover the microbe-disease associations. Two evaluation strategies were implemented to evaluate the performance of HMDA-Pred, including leave-one-out cross validation (LOOCV) and 5-fold cross validation (5-fold CV). Related data and source code are available online at: https://github.com/AugustMe/HMDA-Pred.

Known Microbe-Disease Associations
We used the same microbe-disease associations as the existing literatures (Chen et al., 2016;Peng et al., 2018). The dataset was initially derived from the Human Microbe-Disease Association Database named HMDAD (Ma et al., 2016) 1 , which collected 483 microbe-disease associations from literatures. After removing duplicate associations of the dataset, we obtained 450 unique associations between 292 microbes and 39 diseases. Then, we constructed an adjacency matrix MD(nm × nd) to describe the association relationship between microbes and diseases, where nm and nd represented the number of microbes and diseases, respectively. If microbe m(i) was proved to be associated with disease d(j), the value of MD(i, j) was 1, otherwise 0. If the value of MD(i, j) is 0, that means there is no evidence yet showing microbe m(i) is associated with disease d(j).
In addition, we analyzed the degree distribution characteristics of the microbe-disease association network (Table 1 and Figure 1). The degree of a disease represents the number of microbes related to this disease. The degree of a microbe represents the number of diseases related to this microbe. In the left graph of Figure 1, the abscissa indicates the range of disease degree, which presents how many microbes are related to each disease; the ordinate counts the number of each disease degree. In the right graph of Figure 1, the abscissa indicates the range of microbe degree, which shows how many diseases are related to each microbe; the ordinate counts the number of each microbe degree. On average, each disease is related to 11.54 microbes and each microbe is involved with 1.54 diseases.

Gaussian Interaction Profile Kernel Similarity for Diseases and Microbes
According to the hypothesis that diseases have similar patterns with functionally similar microbes (Chen et al., 2016), we constructed a Gaussian interaction profile kernel similarity network for microbes and diseases based on the adjacency matrix MD, respectively. First, a binary vector GIP(m(i)) represents the interaction profiles of microbe m(i) by observing whether microbe m(i) has a known association with each disease or not (i.e., the ith row of adjacency matrix MD). Second, the Gaussian interaction profile kernel similarity between microbe m(i) and microbe m(j) could be defined as follows: where the parameter λ m is a regulation parameter, which could be obtained by normalizing a new parameter λ m to control the kernel bandwidth. For the sake of simplicity, we set λ m to 1 according to previous studies (van Laarhoven et al., 2011;Chen and Yan, 2013).
With the same processing, the Gaussian interaction profile kernel similarity between disease d(i) and disease d(j) was calculated as follows: where GIP(d(i)) represents the interaction profile of disease d(i) (i.e., the ith column of adjacency matrix MD). Here, the meaning of parameter λ d is the same as λ m and we also set the value of parameter λ d to 1 (van Laarhoven et al., 2011;Chen and Yan, 2013).
In the end, we could obtain the microbe Gaussian interaction profile kernel similarity matrix KM (nm × nm) and the disease Gaussian interaction profile kernel similarity matrix KD(nd × nd), respectively.

Cosine Similarity for Diseases and Microbes
The calculation of disease cosine similarity is based on the assumption that if disease d(i) and disease d(j) are similar to each other , then, in the microbe-disease association matrix, pattern MD(:, i) (i.e., the ith column of the adjacency matrix MD) and pattern MD(:, j) (i.e., the jth column of adjacency matrix MD) should be similar to each other. The same assumption should also be true for microbes. Therefore, the cosine similarity between disease d(i) and disease d(j) is defined as follows: After calculating the disease-disease cosine similarity of each pair, the disease cosine similarity matrix CD(nd × nd) can be constructed. Similarly, the cosine similarity between microbe m(i) and microbe m(j) is given: where MD(i,:) represents the ith row of adjacency matrix MD, and after calculating the microbe-microbe cosine similarity of each pair, the microbe cosine similarity matrix CM(nm × nm) can be constructed.

Integrated Similarity for Diseases and Microbes
To make full use of disease Gaussian interaction profile kernel similarity matrix KD and disease cosine similarity matrix CD, a comprehensive disease similarity matrix DS(nd × nd) was constructed by integrating the KD and CD similarity matrices. We proposed a linear network fusion (LNF) method to integrate KD and CD, defined as follows: where entity DS(d(i), d(j)) represents the integrated similarity between disease d(i) and disease d(j) and α represents the weight of disease similarity matrix (0 < α < 1).
In the same way, microbe Gaussian interaction profile kernel similarity matrix KM and microbe cosine similarity matrix CM are integrated to a comprehensive microbe similarity matrix MS(nm × nm) as follows: (8) where entity MS(m(i), m(j)) represents the integrated similarity between microbe m(i) and microbe m(j) and β represents the weight of microbe similarity matrix (0 < β < 1).
In the end, we obtained a comprehensive microbe similarity matrix MS and a comprehensive disease similarity matrix DS, respectively.

HMDA-Pred
HMDA-Pred is a network-based computation approach to infer the disease-associated microbes based on the network consistency projection (NCP) algorithm. The flowchart of HMDA-Pred is shown in Figure 2. To begin with, based on known microbe-disease associations, we calculated the Gaussian interaction profile kernel similarity matrix and cosine similarity matrix for microbes and diseases, respectively. Then, we integrated two similarity matrices for microbes and for diseases through LNF, respectively. Finally, we uncovered the microbedisease associations by scores obtained from the network consistency projection algorithm. The NCP algorithm has been successfully used to measure the similarity between nodes in the link prediction problems in a heterogeneous network (Gu et al., 2016;Bao et al., 2017). The following is how the NCP algorithm works in HMDA-Pred.
First, we calculated the disease space projection score as follows: where MD(:, j) is composed of the associations of disease d(i) and all microbes (i.e., the ith column of adjacency matrix MD), MS(i,:) is composed of the similarities of microbe m(i) and all microbes (i.e., the ith row of adjacency matrix DS), and | MD(:, j)| represents the norm of MD(:, j). NCPM(i, j) represents the projection score of microbe m(i) and disease d(j) from the projection space of microbe. Finally, we combined and normalized NCPD and NCPM as follows: NCP is the final probability matrix of microbe-disease associations, and the element NCP(i, j) represents the final association score of network consistency projection of microbe m(i) and disease d(j).

Performance Evaluation
To make the evaluation criteria consistent with existing methods, we performed LOOCV and 5-fold CV on our benchmark dataset, which are widely used not only in machine learning classification tasks based on sequence feature analysis but also in biological association prediction problems (Chen et al., 2016;Wang et al., 2017;Liu, 2019;Liu et al., 2019). For LOOCV, one of the 450 confirmed microbe-disease associations pairs was used as a test sample while the left 449 associations were used as the training samples. For 5-fold CV, we randomly divided the 450 confirmed microbe-disease association pairs into five subsets, where one subset is used as test samples and the remaining four subsets as training samples. The 5-fold CV was repeated 100 times to decrease the bias brought by the random splitting.
To visualize the performance of HMDA-Pred, the receiver operating characteristic (ROC) curve was used to plot the relationship between false-positive rate (1-specificity, 1-Spe) and true positive rate (sensitivity, Sen). The area under the ROC curve (AUC) was calculated, whose value of 1 represents perfect prediction performance, while 0.5 indicates purely random prediction performance (Chen et al., 2012(Chen et al., , 2016Fan and Shen, 2014;Pan and Shen, 2018). Moreover, we used the area under the precision-recall (PR) curve (AUPR) as an another indicator for model evaluation Shen, 2019, 2020). In addition, we adopted accuracy (Acc), precision (Pre), Matthews's correlation coefficient (MCC), and F1 score (F1) to further evaluate the model. They are defined as follows: where TP represents the number of known microbe-disease associations that are correctly identified, FP represents the number of unknown microbe-disease associations that are incorrectly identified, TN represents the number of unknown microbe-disease associations that are correctly identified, and FN represents the number of known microbe-disease associations that are incorrectly identified.

Parameter Selection
In this study, the parameters to be adjusted are α and β in LNF. We set the values of α and β from 0.1 to 0.9 with a step size of 0.1. In order to determine the best parameters, we ran LOOCV on the benchmark dataset to select the parameters with the best performance. As shown in Table 2, we observed that HMDA-Pred achieves the best AUC when α is 0.3 and β is 0.6.

Comparison With Other Integration Strategies
The similarity integration strategy proposed in this study is a linear network fusion (LNF) method. In order to verify the superior integration performance of the LNF, we compared LNF with two common similarity fusion strategies: similarity network fusion (SNF) (Zheng et al., 2017) and similarity kernel fusion (SKF) (Jiang et al., 2018;Xie et al., 2019). As shown in Figure 3, based on the LOOCV scheme, we plotted the ROC curve of three different integration methods. The AUC value of LNF achieved 0.9589, while those of SNF and SKF were 0.9437 and 0.8843, respectively. It can be seen that the AUC value of LNF is higher than that of SNF and SKF. Therefore, in the HMDA-Pred method, the performance of LNF is superior to the other two fusion methods in terms of the prediction accuracy of the microbe-disease associations.

Comparison With Single Similarity
In this study, we proposed to integrate different similarity data of microbes (i.e., Gaussian interaction profile kernel similarity and cosine similarity for microbes) and different similarity data of diseases (i.e., Gaussian interaction profile kernel similarity and cosine similarity for diseases) by LNF, respectively. The integration effect was verified by designing comparative The bold value is the highest AUC value. experiments, including all combinations of single similarity data of diseases and microbes. The experimental results are shown in Table 3. The proposed strategy of using LNF to integrate Gaussian interaction profile kernel similarity data and cosine similarity data presented the highest AUC values in LOOCV and 5-fold CV, which were 0.9589 and 0.9361 ± 0.0037, respectively.

Comparison With Other Existing Methods
In order to further verify the superior predictive performance of HMDA-Pred, we compared HMDA-Pred with three state-ofthe-art methods used to predict microbe-disease associations, namely, KATZHMDA (Chen et al., 2016), BiRWHMDA The bold values are the highest AUC value in LOOCV and 5-fold CV respectively. (Zou et al., 2017), and LRLSHMDA . Figure 4 shows the comparisons of the AUC values between different methods based on the benchmark data set. By LOOCV, the AUC values of KATZHMDA, BiRWHMDA, LRLSHMDA, and HMDA-Pred are 0.8873, 0.8284, 0.8816, and 0.9589, respectively. However, after repeating for 100 times the 5-fold CV, the AUC values of KATZHMDA, BiRWHMDA, LRLSHMDA, and HMDA-Pred are 0.8428 ± 0.0035, 0.7984 ± 0.0027, 0.8410 ± 0.0052, and 0.9361 ± 0.0037, respectively. In this study, the known microbe-disease associations are far less than unknown microbe-disease associations in the benchmark dataset, which is imbalanced. Therefore, the AUPR value (area under the PR curve) is an indispensable model evaluation indicator to show the balance of recall and precision, which is suitable to investigate the performance of different methods in the imbalanced dataset . Based on the benchmark data set, we plotted the PR curve of each method and calculated the AUPR value of each method by LOOCV. As shown in Figure 5, the AUPR values of HMDA-Pred, BiRWHMDA, KATZHMAD, and LRLSHMDA are 0.6510, 0.4363, 0.4782, and 0.5045, respectively, which reflects that the performance of HMDA-Pred is better than the other three methods in the case of imbalanced data set.

Case Studies
In this section, we investigated the top 10 microbes predicted by HMDA-Pred to be potentially associated with asthma, colon cancer, and inflammatory bowel disease, respectively. Then, we validated the predicted results by searching the relevant literatures, with the purpose of further evaluating the performance of HMDA-Pred.
Asthma is a common chronic disease, generally considered to be caused by a combination of genetic and environmental factors (Althani et al., 2016). The top 10 microbes predicted by the HMDA-Pred method have been confirmed to be potentially related to asthma in the relevant literatures, as shown in Table 5. Colon cancer is a common gastrointestinal malignant tumor with high morbidity and mortality (Bao et al., 2017). We selected the top 10 microbes that were potentially related to colon cancer predicted by HMDA-Pred, and through searching the relevant   literatures, we confirmed that 8 of them were related to colon cancer, as shown in Table 6. Inflammatory bowel disease is also known as non-specific enteritis or idiopathic enteritis, whose etiology has not been completely clear. Also, there is no cure for it in medicine currently (Wu et al., 2018). The top 10 microbes most likely to be associated with inflammatory bowel disease were predicted by HMDA_Pred, which was confirmed by relevant literatures, as shown in Table 7.

DISCUSSION
Effective computational methods can predict microbe-disease associations in a more efficient and low-cost manner, thus becoming an important aid to biological experimental methods. In this study, we present a novel prediction method called HMDA-Pred based on known microbe-disease associations, Gaussian interaction profile kernel similarity for microbes and diseases, and cosine similarity for microbes and diseases to infer disease-associated microbes. HMDA-Pred achieved AUC values of 0.9589 and 0.9361 ± 0.0037 in the LOOCV and 5-fold CV, respectively. In addition, we conducted case studies of asthma, colon cancer, and inflammatory bowel disease to further validate the predictive performance of HMDA-Pred, where 10, 8, and 10 of the top 10 candidate microbes were confirmed from literatures, respectively. Given the superior performance of HMDA-Pred, we expect HMDA-Pred to be a promising and effective tool for assisting clinical and biological research.
There are several reasons why HMDA-Pred performs well in microbe-disease associations prediction. First, the datasets used in HMDA-Pred are relatively more reliable. Secondly, a linear network fusion method is used to fuse multiple similarity networks to obtain an informative matrix. Third, network consistency projection executed on microbe and disease spatial networks is efficient and reliable. There is also room for improvement of HMDA-Pred in future work. First, although the predictive performance of HMDA-Pred has improved compared to previous methods, it will be further improved if more reliable similarities are considered, such as the semantic similarity of diseases and the functional similarity of microbes. Second, HMDA-Pred will inevitably lead to a bias in disease with more known related microbes due to data imbalance.

DATA AVAILABILITY STATEMENT
All datasets and code link for this study are included in the article.

AUTHOR CONTRIBUTIONS
QZ developed the prediction model and designed the experiments. QZ, YF, and MC analyzed the experiment and results and wrote the manuscript. MC and WW proofread the manuscript. All authors contributed to the article and approved the submitted version.