RNMFMDA: A Microbe-Disease Association Identification Method Based on Reliable Negative Sample Selection and Logistic Matrix Factorization With Neighborhood Regularization

Microbes with abnormal levels have important impacts on the formation and development of various complex diseases. Identifying possible Microbe-Disease Associations (MDAs) helps to understand the mechanisms of complex diseases. However, experimental methods for MDA identification are costly and time-consuming. In this study, a new computational model, RNMFMDA, was developed to find possible MDAs. RNMFMDA contains two main processes. First, Reliable Negative MDA samples were selected based on Positive-Unlabeled (PU) learning and random walk with restart on the heterogeneous microbe-disease network. Second, Logistic Matrix Factorization with Neighborhood Regularization (LMFNR) was developed to compute the association probabilities for all microbe-disease pairs. To evaluate the performance of the proposed RNMFMDA method, we compared RNMFMDA with five state-of-the-art MDA prediction methods based on five-fold cross-validations on microbes, diseases, and MDAs. As a result, RNMFMDA obtained the best AUCs of 0.6332, 0.8669, and 0.9081, respectively for the three five-fold cross validations, significantly outperforming other models. The promising prediction performance may be attributed to the following three features: highly quality negative MDA sample selection, LMFNR-based MDA prediction model, and various biological information integration. In addition, a few predicted microbe-disease pairs with high association scores are worthy of further experimental validation.


INTRODUCTION
Microbes are the most abundant microscopic organisms on Earth and control many major biological and chemical processes (Ley et al., 2006;Qu J. et al., 2019;Sachdeva et al., 2019). Normal microbial flora are beneficial for the host health (McFarland, 2000;Langella and Martín, 2019;Qu J. et al., 2019). Beneficial microbes including biotherapeutic agent, probiotics and synbiotics have been reported as effective therapeutic clues when normal microflora are disrupted (McFarland, 2000;Langella and Martín, 2019).
More importantly, microorganisms have an important affect on infectious diseases and non-infectious diseases (Findley et al., 2013;Ding and Schloss, 2014;Abu-Ali et al., 2018;Byrd et al., 2018;Liu et al., 2019). The human body is possible to get sick when foreign microorganisms invade or a microbial community is imbalanced (Zhu et al., 2018;Qu K. et al., 2019). For example, there are more abundant Fusobacterium in asthmatic patients than healthy people (Davis-Richardson et al., 2014). Lecithinase-negative Clostridium and Lactobacillus are much more in colorectal carcinoma patients (Heavey and Rowland, 2004). Increased Lactobacillus can result in tertiary lymphoid (Sze et al., 2012). All the above reports suggested that there are close associations between microbes and human diseases. Therefore, finding new Microbe-Disease Associations (MDAs) helps to provide diagnostic and therapeutic clues for clinical researches Chen et al. (2017).
Experimental methods to predict possible MDAs are costly and time-consuming. Computational methods are thus gradually developed to find potential MDAs. Ma et al. (2016) collected published MDA data from literatures and constructed Human Microbe-Disease Association Database (HMDAD). Various computational models are then exploited based on known MDA data, Gaussian Interaction Profile Kernel (GIP) similarity for diseases and microbes. Chen et al. (2017) assumed that functionally similar microbes are likely to associate with similar non-infectious diseases and presented the first tool (KATZHMDA) to predict potential MDAs based on the KATZ measure. Huang et al. (2017) proposed a neighbor and graph-based recommendation model (NGRHMDA). Bao et al. (2017) designed a Network Consistency Projectionbased MDA prediction method (NCPHMDA). Luo and Long (2018) constructed a heterogeneous network and presented a Network Topological Similarity-based human MDA prediction model (NTSHMDA). Wang et al. (2017) developed a semisupervised learning framework (LRLSHMDA) to prioritize microbe candidates for all interested diseases based on Laplacian Regularized Least Squares. Peng et al. (2018b) exploited a adaptive boosting-based method to compute association scores for human microbe-disease pairs based on a strong classification model. Zhang et al. (2018) proposed a bi-direction similarity integration label propagation method (BDSILP) for identifying MDAs. Shi et al. (2018) assumed that observed incomplete microbe-non-infectious disease association matrix is composed of a parameterized matrix and a noise matrix, and then developed a Binary Matrix Completion-based model (BMCMDA) to infer possible microbe-non-infectious disease associations. Qu J. et al. (2019) presented a human MDA model (MDLPHMDA) based on matrix decomposition and label propagation.
The above methods were effectively applied to MDA identification and captured a few MDAs, however, the prediction performance remains to be improved. More importantly, in MDA identification problem, negative training examples are missing. Therefore, most of models randomly extracted negative MDAs from unknown microbe-disease pairs, which may contain positive MDAs, thereby severely affecting the prediction accuracy. Learning from Positive and Unlabeled examples (PU learning) (Li et al., 2014) is one type of methods used to learn the models from numerous positive and unlabeled examples. PU learning has been widely applied to text mining and obtained better performance.
In this study, we developed a computational model, RNMFMDA, to predict human MDA candidates. RNMFMDA integrated Reliable Negative MDA selection based on PU learning and random walk with restart, Logistic Matrix Factorization with Neighborhood Regularization (LMFNR), and multiple heterogeneous data. RNMFMDA first computed disease similarity and microbe similarity. Credible negative MDAs were then selected based on PU learning and random walk with restart. LMFNR was finally developed to identify MDA candidates. RNMFMDA was compared to five state-of-the-art MDA prediction methods, MDLPHMDA (Qu J. et al., 2019), NGRHMDA , NTSHMDA (Luo and Long, 2018), LRLSHMDA , and KATZHMDA . To evaluate our proposed RNMFMDA, we conducted five-fold Cross Validations (CVs) on microbes, diseases, and MDAs. The results showed that RNMFMDA obtained the best AUCs under the above three CVs. In addition, we further performed the experiments to find possible microbes/diseases associate with a known disease/microbe. The experimental result analysis suggested that RNMFMDA is a powerful MDA identification method.

MATERIALS AND EQUIPMENT
Assume that the ith microbe is represented as m i (i = 1, 2, . . . , n), and the jth disease is denoted as d j (j = 1, 2, . . . , m). The associations between n microbes and m diseases are denoted as a binary matrix Y (n×m) where The non-zero elements in Y are called "MDA pairs" and considered as positive observations. The zero elements in Y are called "unknown microbe-disease pairs" and considered as unlabeled observations. The microbe similarity matrix and the disease similarity matrix are represented as S M ∈ ℜ n×n and S D ∈ ℜ m×m , respectively. Our objective is to select reliable negative MDAs based on PU learning and random walk with restart on the heterogeneous network, and then compute the association probability score for each microbe-disease pair by LMFNR, finally rank candidate microbe-disease pairs according to the scores in descending order, so that the top microbe-disease pairs are the most likely to be MDAs.
We collected confirmed MDAs from HMDAD (Ma et al., 2016) (http://www.cuilab.cn/hmdad). The database provides 483 MDAs between 292 microbes and 39 diseases from 61 previous works. We deleted the same MDAs based on different evidences and finally obtained 450 MDAs from these microbes and diseases.

Microbe GAP Similarity
Motivated by the similarity computation method provided by van Laarhoven et al. (2011), we computed microbe Gaussian Association Profile (GAP) similarity based on known MDA matrix. Given a microbe m(i), its GAP AP(m(i)) can be represented as the ith row of Y. The GAP similarity between two microbes m(i) and m(j) can be computed by Equation (2): where γ m = γ m ′ /( 1 n n k=1 ||AP(m(k))|| 2 ) denotes the normalized kernel bandwidth with bandwidth parameter γ m ′ . The microbe similarity matrix S M(n×n) can be obtained based on the GAP similarity.

Disease GAP Similarity
For a disease d(i), its GAP AP(d(i)) can be represented as the ith column of Y. The GAP similarity between two diseases d(i) and d(j) can be calculated by Equation (3): ||AP(d(k))|| 2 ) denotes the normalized kernel bandwidth with bandwidth parameter γ d ′ .

Disease Symptom Similarity
Inspired by the similarity measure method provided by Zhou et al. (2014), we computed disease symptom similarity matrix S S . Finally, the disease similarity matrix S D(m×m) can be computed by Equation (4): where γ is a parameter used to weigh the importance between the GAP similarity and the symptom similarity.

Reliable Negative MDA Selection
There exists a few known MDAs and numerous unobserved microbe-disease pairs in the HMDAD database (Ma et al., 2016). There are no negative MDA samples because of the limitations of experimental methods. High-quality negative MDAs can boost the performance of MDA prediction models. Therefore, most of machine learning-based methods have to randomly select negative examples from unknown microbe-disease pairs. However, this part of randomly selected negative examples probably contains positive MDAs, thereby severely affecting the performance of MDA identification algorithms. Therefore, we developed a negative sample selection method to extract reliable negative MDA data based on PU learning and random walk with restart. The pipeline mainly contains two basic processes: computing the association probability for each microbe-disease pair based on random walk with restart and extracting highquality negative MDA samples based on PU learning and the computed association scores.

Random Walk With Restart on the Heterogeneous Microbe-Disease Network
Inspired by the method proposed by Chen et al. (2012), we consider microbe similarity network, disease similarity network, and MDA network to construct a heterogeneous microbe-disease network. We used microbe similarity matrix S M(n×n) , disease similarity matrix S D(m×m) , and MDA matrix Y (n×m) as the adjacency matrices of the above three networks, respectively. And the adjacency matrix on the heterogeneous network can be denoted as: where Y T denotes the transpose of Y.
We then calculate different transition probabilities of random walk with restart on the heterogeneous graph.
probability matrix, where H MM and H DD represent the walks within microbe-microbe similarity network and disease-disease similarity network, respectively, H MD and H DM represent the skips between networks. Given a microbe/disease, if there exist a bipartite association between the microbe/disease and diseases/microbes, the particle will either skip between the four networks or stay in the current network with a transition probability λ ∈ [0, 1]. We predict MDA candidates from a perspective of microbes. Assume that a particle be situated on the i-th microbe node m i ∈ M, it will walk to a microbe node m j ∈ M with the transition probability H MM (i, j): or skip to a disease d j ∈ D based on a bipartite association with d j with the transition probability H MD (i, j): Similarly, we can find possible MDAs from a perspective of diseases. Assume that a particle be situated on the ith disease node d i ∈ D. It will walk to a disease node d j ∈ D with the transition probability H DD (i, j): Frontiers in Microbiology | www.frontiersin.org or skip to a microbe m j ∈ M based on a bipartite association with m j with a transition probability H DM (i, j): Therefore, we describe random walk with restart on the heterogeneous network as: where P(t) denotes a probability matrix used to represent the association scores of all unobserved microbe-disease pairs at the t-th step random walk, H T denotes the transpose of H, and θ represents the restarting probability. The particle will return to either a seed microbe or a seed disease. More importantly, it is possible to differentiate the relative important of each network based on the initial probability where v i and s i denote the initial probability distributions on disease-disease similarity network and microbe-microbe similarity network starting from their seed nodes, respectively. The parameter η ∈ [0, 1] is used to control the restarting probability in these two similarity networks. If η < 0.5, the particle will more tend to restart from one of the seed microbes than from one of the seed diseases.

Reliable Negative MDA Extraction
We took known MDAs as initial positive sample set P, observed microbe-disease pairs as initial unlabeled sample set U and developed a reliable negative MDA selection based on PU learning. The method contains the following five steps: Step 1. Randomly selecting positive sample subset S from P and adding S into U; Step 2. Taking P − S as positive samples, U + S as negative samples; Step 3. Computing the association score matrix AM based on random walk with restart on the heterogeneous microbedisease network; Step 4. Ranking microbe-disease pairs in S based on AM and finding the minimum score AM min in S; Step 5. For every sample x in U: We can obtain reliable negative MDA example set RN with the above negative selection method.

MDA Prediction Based on LMFNR
The logistic matrix factorization method has widely applied to the area of various association prediction and obtained better performance (Liu et al., 2016(Liu et al., , 2020. Inspired by the logistic matrix factorization method provided by Liu et al. (2016) and Liu et al. (2020), we developed an MDA prediction method (RNMFMDA) by integrating the Reliable Negative MDA sample selection method and the LMFNR method.
Suppose that both microbes and diseases are mapped into rdimensional shared latent spaces where r ≪ n, m. The properties of a microbe m i / disease d j is represented by a latent vector a i ∈ ℜ 1×t / b i ∈ ℜ 1×t . Then, the association probability p ij between m i and d j can be computed by Equation (11): The latent vectors of all microbes / diseases can be denoted as In MDA identification tasks, the observed MDAs have been experimentally validated and are more reliable than unknown microbe-disease pairs. To more accurately find MDA candidates, we assigned higher confidence scores to known MDAs than unknown pairs. Particularly, each MDA is considered as c(c ≥ 1) positive training samples, and each reliable negative MDA is considered as a single negative training sample. c is a constant to measure the importance of observations. The importance weighting technique has been effectively applied to the area of informatics. And we built the following MDA prediction model: The above model can represented as the following optimization function considering the probability distribution based on a Bayesian inference: where λ m and λ d are parameters, ||A|| F and ||B|| F denote the Frobenius norm of A and B, respectively. The nearest neighborhood information of biological entities in the association network can improve the prediction performance (Zhang et al., 2019a,b,c). For example, Zhang et al. We can obtain A and B by solving with the optimization problem by Equation (14) with an alternating gradient ascent procedure.
Finally, the association probability matrix Y p for all unknown microbe-disease pairs can be represented as: 4. RESULTS

Experimental Settings and Evaluation
The experiment was performed under 100 trials of five-fold Cross Validation. An average performance was finally computed to reduce the prediction bias. For an MDA matrix Y n×m , CVs were conducted under three different experimental settings as follows.
Under CV1, in each round, 80% of rows in Y was used as training set and the remaining was used as test set. Under CV2, in each round, 80% of columns in Y was used as training set and the remaining was used as test set. Under CV3, in each round, 80% of entries in Y was used as training set and the remaining was used as test set. These three CVs refer to MDA prediction for (1) new (unknown) microbes, (2) new diseases, and new microbe-disease pairs, respectively. Sensitivity, specificity, accuracy, and AUC were used to evaluate the performances. AUC is the average area under the receiver operating characteristics (ROC) curve. The curve can be plotted by the ratio of True Positive Rate (TPR) to False Positive Rate (FPR) according to different thresholds. TPR and FPR can be computed by Equations (16, 17). High AUC value represents  better performance. In our experiments, AUC was computed in each round of CV and final AUC was averaged over the five rounds for 100 times.
where the definitions of TP, FP and FN are as shown in Table 1. λ is used to determine the probability of jumping between nodes. θ is the restart rate. η denotes the restarting probability in microbe similarity network and disease similarity network. c is the importance level of positive samples to negative samples. K denotes the number of neighborhood. For the parameters λ, θ , η, c, and K, we conducted grid search to find the optimal values. RNMFMDA obtained the best performance when these five parameters are set as λ = 0.9, θ = 0.5, η = 0.9, c = 8, and K = 5. So we set the above five parameters as the corresponding values. Parameters γ m ′ , γ d ′ , and γ are set the same values in previous works, that is, γ m ′ = 1, γ d ′ = 1, and γ = 0.9. For other parameters, we set the corresponding values according to the method provided by Liu et al. (2016). When ||P(t + 1) − P(t)|| F ≤ 10e − 12, the iteration for random walk will stop. The ratio of extracted negative MDAs to positive MDAs is set as 1:1, this is to say, the number of negative MDAs is 450. The parameters in other five methods were set as the same values provided by the corresponding papers.

Performance Comparison of RNMFMDA With Other Five Methods
In this section, we compared our proposed RNMFMDA method with five state-of-the-art MDA prediction models, MDLPHMDA (Qu J. et al., 2019), NGRHMDA , NTSHMDA (Luo and Long, 2018), LRLSHMDA , and KATZHMDA . Tables 2-4 showed the performance of RNMFMDA with other five methods. The best performance is described in boldface in Tables 2-4.
As shown in Tables 2-4, RNMFMDA performed more efficiently than other five methods. Compared with MDLPHMDA, NGRHMDA, and NTSHMDA, RNMFMDA obtained a more remarkable improvement over four evaluation metrics under three CVs. KATZHMDA and LRLSHMDA are    Figures 1-3 showed the AUCs of these six methods. AUC is a more important measurement compared with other three evaluation metrics. Based on the comprehensive measure of the experimental results, RNMFMDA showed the optimal performance.  In addition, these six methods showed different advantages under different CVs. These variation in improvement can be attributed to differences in data structures under different CVs. In particular, RNMFMDA is more suitable to find possible microbes associated with a given disease.

Case Study
We further evaluated the prediction performance of our proposed RNMFMDA on the confirmed 450 MDAs by two case studies. Asthma is a disease with considerable global morbidity. Over the past 10 years, little improvement in asthma has been observed despite of escalating treatment costs (Pavord et al., 2018). In the first class, we mask all associated information for asthma to find possible microbes. The results are shown in Table 8. Among the predicted top 10 and 20 microbe-asthma association pairs, 8 and 15 microbes have been reported to associate with asthma by related publications, respectively. Inflammatory Bowel Disease (IBD) is a periodic inflammation. It may be produced by a deregulated immune response to gut microbiome dysbiosis (Halfvarson et al., 2017). In the second class, we mask all association information for IBD to find possible microbes. The results are shown in Table 9. Among the predicted top 10 and 20 microbe-IBD association pairs, there are 9 and 17 microbes that are validated to associate with IBD by recent works, respectively.

DISCUSSION
There are numerous microbes in the human body. They play an important role in various biological processes. Many human diseases including gastrointestinal diseases are reported to be closely associated with microorganisms. Therefore, identifying the associations between microbes and diseases helps to understand the pathogenic mechanisms of these diseases and further develop new drugs.
Traditional experimental methods applied to validate possible associations between microbes and diseases are expensive and time-consuming, computational methods are developed to solve with this problem. However, the performance of existing computational models need to further improve. More importantly, lacking of reliable negative MDA examples affects prediction performance. Therefore, RNMFMDA is developed to find possible MDAs. RNMFMDA obtained the optimal performance under three CVs. We analyzed the reason that RNMFMDA obtained excellent performance and thought that it may be contributed to the following three features. First, we developed a high-quality negative MDA extraction method based on PU learning and random walk with restart. Second, LMFNR is a optimal model in predicting associations between two entities. Finally, we integrated various heterogeneous biological information. Multiple heterogeneous data integration efficiently reflected the biological features of MDAs.
In the future, we will construct a multi-partite network by integrating MDAs, disease-gene associations (Tran et al., 2020), miRNA-disease associations (Peng et al., 2018a;Huang et al., 2019), long non-coding RNA-protein interactions Peng et al., 2019), and long non-coding RNA-disease associations (Chen et al., 2018;. More importantly, we will still develop more robust models, for example, ensemble strategy  and deep learning-based models (Min et al., 2017;Peng L. et al., 2018) to improve MDA prediction.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/Supplementary Material.