A Novel Approach Based on Bipartite Network Recommendation and KATZ Model to Predict Potential Micro-Disease Associations

Accumulating evidence indicates that the microbes colonizing human bodies have crucial effects on human health and the discovery of disease-related microbes will promote the discovery of biomarkers and drugs for the prevention, diagnosis, treatment, and prognosis of diseases. However clinical experiments of disease-microbe associations are time-consuming, laborious and expensive, and there are few methods for predicting potential microbe-disease association. Therefore, developing effective computational models utilizing the accumulated public data of clinically validated microbe-disease associations to identify novel disease-microbe associations is of practical importance. We propose a novel method based on the KATZ model and Bipartite Network Recommendation Algorithm (KATZBNRA) to discover potential associations between microbes and diseases. We calculate the Gaussian interaction profile kernel similarity of diseases and microbes based on validated disease-microbe associations. Then, we construct a bipartite graph and execute a bipartite network recommendation algorithm. Finally, we integrate the disease similarity, microbe similarity and bipartite network recommendation score to obtain the final score, which is used to infer whether there are some novel disease-microbe interactions. To evaluate the predictive power of KATZBNRA, we tested it with the walk length 2 using global leave-one-out cross validation (LOOV), two-fold and five-fold cross validations, with AUCs of 0.9098, 0.8463 and 0.8969, respectively. The test results also show that KATZBNRA is more accurate than two recent similar methods KATZHMDA and BNPMDA.


INTRODUCTION
A microbe is a microscopic organism, including bacteria, eukaryotes, archaea, and viruses (Wu et al., 2018). Various types of microbes live on or in different parts of a human body such as the skin, mouth, hair, stomach, and gastrointestinal tract. An adult human body contains a large number of bacterial cells, which is estimated to reach 10 14 and much more than the total number of human cells, with more than 5 million microbe genes, outnumbering the human genes by more than 100 fold (Sommer and Backhed, 2013). Most microbes are harmless and some are beneficial to humans (Grice and Segre, 2011). Recently, accumulated experimental evidence shows that microbes have an important impact on human health, nutrient absorption, immune response, cancer control, and the prevention of pathogen colonization (Wu et al., 2018). For example, the gut microbiota could significantly contribute nutrition absorption by producing indispensable vitamins and decomposing indigestible polysaccharides, and it also has an important impact on the mucus layer, the balance of antimicrobial peptides, and immunoglobulin A, and the differentiation and activation of some lymphocyte populations (Sommer and Backhed, 2013). Therefore, the gut microbiota is thought to be an extra 'organ' of humans (Gill et al., 2006). But some microbes may contribute to disease, such as psoriasis and inflammatory bowel disease (IBD). There have been reports that psoriasis occurs after strep throat and could worsen due to the colonization of Candida albicans, Malassezia, and Staphylococcus aureus on the skin or in the gut (Fry and Baker, 2007). Aroniadis et. al. (Aroniadis and Brandt, 2013) indicated that the biodiversity of bacteria, such as Bacteroidetes and Firmicutes, colonizing in individuals affected by IBD has been found to be reduced by 30 to 50%. Wang et. al. (Wang and Jia, 2016) showed that the gut microbiota's dysbiosis might be a key environmental risk factor of many human diseases, though it's difficult to reveal the true causality.
To explore the relationship between microbes and their human hosts, scientists from many countries collaborated and launched the Human Microbiome Project (HMP) (Human Microbiome Project, 2012a). Recently, high-throughput sequencing techniques and corresponding software packages have been developed rapidly, and a growing number of research analyses have been carried out on the microbiome, such as whole-genome shotgun (WGS), 16S, and the taxonomic profiling (Human Microbiome Project, 2012b), and have demonstrated significant associations between microbes and complex human diseases such as rheumatoid arthritis, colorectal cancer, obesity, and type 2 diabetes (Wang and Jia, 2016). However, these studies involve time-consuming and expensive biological experiments. Therefore, it is necessary to utilize the known information to predict the unknown microbe-disease interactions. Identifying microbe-disease interactions could promote discovering biomarkers and drugs for the prevention, diagnosis, treatment, and prognosis of diseases. Now, more and more computer algorithms (Chen and Zhang, 2013;Yang et al., 2014;Zhang et al., 2017;Zeng et al., 2018;Zhang et al., 2018a;Zhang et al., 2018b;Zhang et al., 2018c;Zhang et al., 2018d;Zeng et al., 2019) have been proposed for interaction prediction of miRNA-disease, lncRNA-disease, and drug-drug, and it is feasible to apply these methods to the microbe-disease association prediction field.
Recently, Ma et al. (2017) collected microbe-disease association data from previous published studies and constructed the Human Microbe-Disease Association Database (HMDAD). Based on the data from HMDAD, some microbe-disease association prediction methods have been proposed. Chen et al. (2017) used a KATZ measure to predict human microbe-disease association, and proposed an algorithm named KATZHMDA. KATZHMDA can predict new microbe-disease associations at a large scale. Bao et al. (2017) used network consistency projection and introduced an algorithm NCPHMD to predict human microbe-disease association. NCPHMD deals with unknown diseases or microbes that are not present in the disease-microbe databases. He et al. (He et al., 2018) presented an algorithm GRNMFHMDA. GRNMFHMDA assigns likelihood scores to unknown microbe-disease pairs by calculating weighted K nearest neighbor profiles of microbes and diseases, and then adapts the standard non-negative matrix factorization by integrating graph Laplacian and Tikhonov (L2) regularization to obtain a microbedisease association prediction score matrix. Zou et al. (2017) designed an approach BiRWHMDA. BiRWHMDA constructs a heterogeneous network by connecting the microbe similarity network and the disease similarity network based on known microbe-disease associations, and then uses a bi-random walk to predict microbe-disease association.
In the paper, we propose a novel approach to predict potential micro-disease association based on the KATZ measure and bipartite network recommendation algorithm (KATZBNRA), which is an improvement on KATZHMDA (Chen et al., 2017). Similar to KATZHMDA, KATZBNRA uses the KATZ measure and the similarity of diseases and microbes according to the Gaussian interaction profile kernel to predict novel microbe-disease associations based on the known microbe-disease associations. Furthermore, in order to improve the predicting accuracy, KATZBNRA uses a bipartite network recommendation algorithm.

MATeRIAls AND MeThODs
Known Disease-Microbe Associations HMDAD (Human Microbe-Disease Association Database, http://www.cuilab.cn/hmdad) collected the curated human microbe-disease association data from microbiota studies where the microbes were determined by 16s RNA sequencing on the genus level (Ma et al., 2017). HMDAD provides public access to the data, and our known microbe-disease association data were downloaded from HMDAD. The data contains 450 distinct confirmed associations between 39 diseases and 292 microbes and is coded in an adjacency matrix A R n n d m  × , where n d (or n m ) is the number of diseases (or microbes). If there has been an experiment confirming that microbe m j relates to disease d i ,A(i,j) is set to 1, otherwise A(i,j) is set to 0.

Disease Gaussian Interaction Profile Kernel similarity
According to (Chen et al., 2017), there is a generally accepted assumption that similar diseases show an interaction tendency to similar microbes. Similar to (Chen and Yan, 2013) and (Chen et al., 2017), we compute the disease network topologic similarity based on the Gaussian interaction profile kernel. For a diseasemicrobe association adjacent matrix A, the binary element A(i,j) at row i and column j encodes whether there is a confirmed association between disease d(i) and microbe m(j). The ith row of A is denoted by IP(d(i)). IP(d(i)) can be regarded a binary vector and is called the interaction profile of d(i) since it provides the association information of disease d(i) with all microbes. For two diseases, their similarity KD(d i ,d j ), based on the Gaussian interaction profile kernel, is calculated from their interaction profiles according to the following equations. (Vanunu et al., 2010), KD values in (0, 0.3] may be not informative, while KD values in [0.6, 1] may show significant similarity. Therefore, a logistic function transformation from KD(x, y) to KD'(x, y) in Equation (3) is utilized in order to measure the similarity of diseases x and y more appropriately.
The parameters ′ γ d and c could be set with cross-validation, but to simplify the calculation, we set ′ γ d = 1 as in van Laarhoven et al., 2011, c = -15 as in (Vanunu et al., 2010). According to (Vanunu et al., 2010)

Microbe Gaussian Interaction Profile Kernel similarity
As mentioned before, similar diseases show an association tendency with similar microbes. To measure the similarity of microbes, we also used the Gaussian interaction profile kernel as before. It could be calculated in a similar way as follows.
where ′ γ m is also set to 1, and IP(m i ) is the ith column of matrix A. Similarly, KM'(m i , m j ) could be calculated as Equation (3). It should be noted that in each cross-validation experiment, the similarities of diseases and microbes will be recalculated (Sun et al., 2018).

Bipartite Network Recommendation
The bipartite network recommendation is a two-step resource allocation process (Chen et al., 2018b), which is based on a bipartite graph G(D, M, E), where D represents disease nodes, M microbe nodes, E the edges corresponding to the known microbe-disease associations. Let f 0,i (m j ) denote the initial resource allocated to a microbe node m j when considering disease d i , k(m j ) be the number of adjacent disease nodes of microbe m j , and let k(d i ) be the number of adjacent microbe nodes of disease d i in graph G.
When focusing on disease d i , each disease d i related microbe node is initially allocated with a resource value of 1, i.e. if there is an edge between the disease node d i and a microbe node m j in G, allocate an initial resource of 1 to m j . The first step of the bipartite network recommendation is to transfer the resource from microbe nodes to disease nodes according to Equation (6), and the second step is to transfer the resource of the disease nodes back to microbe nodes according to Equation (8).
Please see an example of the process of the bipartite network recommendation focusing on disease d 1 in Figure 1. After the process, we obtain the recommendation scores (1, 0, 1/4, 3/4, 0) of the microbes for disease d 1 , which suggests that besides m 1 and m 4 , m 3 may also be related to the disease.
The matrix form of Equation (9) is as follows.
where W w ij n n m m = × { } , and B is a matrix with n m rows and n d columns. The ith column of B is the recommend scores of bipartite network recommendation regarding disease d i KATZBNRA KATZBNRA uses the KATZ model to compute the associations between diseases and microbes and is illustrated in Figure 2. As a network-based computation method, the KATZ model (Chen, 2015) had been used in the problem of link prediction in the heterogenous network to calculate the similarity of nodes. There are two factors that have been regarded as effective similarity metrics in the KATZ model, the walk steps (length, i.e. the number of edges of the walk) and the number of walks from one node to another. We use the KATZ model to calculate similarities between the nodes of the microbe and disease by counting the number of walks between them. Here A l (i,j), the element of the l-th power of A, is the number of l-length walks between disease node d i and microbe node m j . Due to the limited data from HMDAD, matrix A is sparse. In order to use more information, we integrated the matrices KM, KD, B into a matrix B* as Equation (12) and replace A by B* in the KATZ model to calculate similarities between microbes and diseases.
Since walks between nodes of microbe and disease with different lengths have different contributions to similarities of node pairs, in order to dampen longer walks' contribution, we introduced a parameter β l which is no smaller than 0, and if l 1 > l 2 , then β β l l 1 2 < . The potential association between diseases d i and microbe m j can be calculated as follows.
S is a matrix of size (n d + n m )×(n d + n m ), and could be partitioned into four sub-matrices as shown in Equation (12).
where the rows of S 1,1 and S 1,2 are n d , the rows of S 2,1 and S 2,2 are n m , the columns of S 1,1 and S 2,1 are n d , and the columns of S 1,2 and S 2,2 are n m . The element S 1,2 (i, j) of S 1,2 provides the possibility that an association between disease d i and the microbe m j exists, and our prediction result can be obtained from S 1,2 .
Considering that the walks of long lengths may be meaningless, we limit k in Equation (13) to be 2, 3 and 4, and the expression can be as follows.

Performance evaluation
The test dataset of microbe-disease association was downloaded from HMDAD. We used LOOCV (leave-one-out cross validation), two-fold cross validation and five-fold cross validation to test the prediction performance of KATZBNRA on the HMDAD data. In LOOCV, each known microbe-disease association takes turns to be picked out as the testing case and the other known associations are regarded as training data. We then obtained the prediction score of the test case output by KATZBNRA and ranked of the test case in the sorted list of all predicted microbedisease associations in descending order of their scores. We used different thresholds to determine the correct predictions and wrong predictions and calculated corresponding FPR (false positive rate) and TPR (true positive rate) according to Equation (19). Finally, the results were presented in the ROC (receiver operating characteristics) curve plot of TPR against FPR.

TPR TP FN TP FPR FP TN FP
where FN is the number of false negative predictions (i.e. the cases whose prediction scores below the threshold), and TP is the number of true positive predictions (i.e. the cases whose prediction scores are not smaller than the threshold). FP is the number of the predicted associations that are not in the HMDAD dataset with scores not smaller than the threshold, and TN is the number of predicted associations that are not in the HMDAD dataset with scores smaller than the threshold. The area under a ROC curve is called AUC, and AUC is generally utilized to compare the power of predictive models. AUC of 0.5 indicates an entirely random prediction while AUC = 1 means a completely correct prediction.
In order to further test the prediction power of KATZBNRA, we also adopted 5-fold cross validation and 2-fold cross validation besides LOOCV. 5-fold (or 2-fold) cross validation randomly divides the microbe-disease associations equally into five (or two) parts and one of the five (or two) parts is reserved as the verification data while the remaining is used as training data. Considering the potential random sampling bias, we repeated each LOOCV, 2-fold and 5-fold cross validation test 100 times, and all ROC curves and AUCs are the average results of the 100 repeated tests. Meanwhile, we compared KATZBNRA with several state-of-the-art predictive methods using these validations.
For our method KATZBNRA, the walk length k plays a critical role. To test the effect of k, we changed the value of k, and carried out a series of LOOCV experiments. As shown in Figure 3, when k is set to 2, 3 and 4, the AUCs of each walk lengths are 0.9098, 0.8968, and 0.8827, respectively. Obviously, when parameter k = 2, KATZBNRA achieved the best prediction performance and walks of longer lengths may make the association prediction worse. Therefore, in the following experiments, we set k = 2. KATZBNRA has two more parameters, γ′ and β. The test in a previous work (Chen et al., 2016) showed AUC tended to decrease when γ′ was increased from 1.0 to 1.5, 2.0 and 2.5, and β was increased from 0.01 to 0.05 and 0.1. We also evaluated the AUC of KATZBNRA with different values of parameter γ′ and c in Equation (2)  We compared KATZBNRA with another three prediction methods, the native bipartite network recommendation (BNR) (Zhou et al., 2007), KATZHMDA (Chen et al., 2017), and IMCMDA (Chen et al., 2018a) using LOOCV, 5-fold cross validation and 2-fold cross validation. The global LOOCV showed that the AUCs of KATZBNRA, KATZHMDA, IMCMDA and BNR were 0.9098, 0.8382, 0.7786, and 0.4113, respectively, as shown in Figure 4−6 show the 5-fold cross validation experimental results and the 2-fold cross validation experimental results, respectively. In 5-fold cross validation KATZBNRA, KATZHMDA, IMCMDA, and BNR obtained AUCs of 0.8972, 0.8330, 0.8041, and 0.5645, respectively, and in 2-fold cross validation, their AUCs were 0.8463, 0.8190, 0.7988 and 0.5434, respectively. In all the above experiments, the curves of KATZBNRA are above those of the other methods, which means that among the four methods, KATZBNRA achieved the best prediction performance.

Case studies
We studied asthma and inflammatory bowel disease (IBD) of microbe-related diseases of human beings based on recent published clinical and biological reports to further evaluate the ability of our method. The predicted disease-microbe associations which are contained in the HMDAD dataset are sorted according to their prediction scores in descending order. For asthma and IBD, we observed the microbes in the top 10 associations of the lists. This guarantees absolute independence between the verification candidate and the known association for model training.
As a common chronic lung inflammatory disease, asthma causes difficulty in breathing (Martinez, 2007). It is believed that asthma is caused by the environment and a combination of genes. For severe asthma, one of the leading causes is a microbe (Huang et al., 2011). All of top predicted 10 candidate microbes of KATZBNRA (Table 3) have been verified by recent studies.   As a typical chronic GI (gastrointestinal) tract inflammatory bowel disease, IBD includes ulcerative colitis and Crohn's disease (Lomax et al., 2006). We listed the top 10 IBD-related candidate microbes predicted by KATZBNRA in Table 4, among which eight microbes have been previously validated.

DIsCUssION
Based on the bipartite network recommendation and the KATZ model, the paper introduced a novel disease-microbe association prediction method called KATZBNRA. KATZBNRA uses the Gaussian interaction profile kernel to calculate the similarity of diseases and microbes in the bipartite network containing the known microbe-disease associations from the HMDAD database, and the bipartite network recommendation score on the KATZ model enables KATZBNRA to predict potential disease-microbe associations with high accuracy. The experimental results of LOOCV, 5-fold cross validation, 2-fold cross validation and the IBD and asthma case studies have demonstrated the excellent and reliable prediction ability of KATZBNRA. With regard to similar prediction problems such as predicting lncRNA-disease, drug-target, gene-disease, miRNA-disease, and other biological associations, this model can be applied with small modifications.

DATA AVAIlABIlITY sTATeMeNT
Publicly available datasets were analyzed in this study. This data can be found here: http://www.cuilab.cn/hmdad.

AUThOR CONTRIBUTIONs
SL and MX conceived the study and the conceptual design of the work. SL and XL collected the data and implemented the KATZBNRA algorithm. SL tested the algorithms and drafted the manuscript. MX and XL polished the manuscript. All authors have read and approved the manuscript.

ACKNOWleDGMeNTs
We sincerely thank all reviewers for their valuable comments that we have used to improve the quality of our manuscript.The paper is an extended version of the abstract presented orally at CBC 2019 and has been recommended by CBC 2019.  Propionibacterium acnes unconfirmed 9 Bacteroidaceae PMID:17897884 (Takaishi et al., 2008)