RNMFMDA: A Microbe-Disease Association Identification Method Based on Reliable Negative Sample Selection and Logistic Matrix Factorization With Neighborhood Regularization

Peng, Lihong; Shen, Ling; Liao, Longjie; Liu, Guangyi; Zhou, Liqian

doi:10.3389/fmicb.2020.592430

ORIGINAL RESEARCH article

Front. Microbiol., 27 October 2020

Sec. Systems Microbiology

Volume 11 - 2020 | https://doi.org/10.3389/fmicb.2020.592430

RNMFMDA: A Microbe-Disease Association Identification Method Based on Reliable Negative Sample Selection and Logistic Matrix Factorization With Neighborhood Regularization

School of Computer Science, Hunan University of Technology, Zhuzhou, China

Microbes with abnormal levels have important impacts on the formation and development of various complex diseases. Identifying possible Microbe-Disease Associations (MDAs) helps to understand the mechanisms of complex diseases. However, experimental methods for MDA identification are costly and time-consuming. In this study, a new computational model, RNMFMDA, was developed to find possible MDAs. RNMFMDA contains two main processes. First, Reliable Negative MDA samples were selected based on Positive-Unlabeled (PU) learning and random walk with restart on the heterogeneous microbe-disease network. Second, Logistic Matrix Factorization with Neighborhood Regularization (LMFNR) was developed to compute the association probabilities for all microbe-disease pairs. To evaluate the performance of the proposed RNMFMDA method, we compared RNMFMDA with five state-of-the-art MDA prediction methods based on five-fold cross-validations on microbes, diseases, and MDAs. As a result, RNMFMDA obtained the best AUCs of 0.6332, 0.8669, and 0.9081, respectively for the three five-fold cross validations, significantly outperforming other models. The promising prediction performance may be attributed to the following three features: highly quality negative MDA sample selection, LMFNR-based MDA prediction model, and various biological information integration. In addition, a few predicted microbe-disease pairs with high association scores are worthy of further experimental validation.

1. Introduction

Microbes are the most abundant microscopic organisms on Earth and control many major biological and chemical processes (Ley et al., 2006; Qu J. et al., 2019; Sachdeva et al., 2019). Normal microbial flora are beneficial for the host health (McFarland, 2000; Langella and Mart́ın, 2019; Qu J. et al., 2019). Beneficial microbes including biotherapeutic agent, probiotics and synbiotics have been reported as effective therapeutic clues when normal microflora are disrupted (McFarland, 2000; Langella and Mart́ın, 2019).

More importantly, microorganisms have an important affect on infectious diseases and non-infectious diseases (Findley et al., 2013; Ding and Schloss, 2014; Abu-Ali et al., 2018; Byrd et al., 2018; Liu et al., 2019). The human body is possible to get sick when foreign microorganisms invade or a microbial community is imbalanced (Zhu et al., 2018; Qu K. et al., 2019). For example, there are more abundant Fusobacterium in asthmatic patients than healthy people (Davis-Richardson et al., 2014). Lecithinase-negative Clostridium and Lactobacillus are much more in colorectal carcinoma patients (Heavey and Rowland, 2004). Increased Lactobacillus can result in tertiary lymphoid (Sze et al., 2012). All the above reports suggested that there are close associations between microbes and human diseases. Therefore, finding new Microbe-Disease Associations (MDAs) helps to provide diagnostic and therapeutic clues for clinical researches Chen et al. (2017).

Experimental methods to predict possible MDAs are costly and time-consuming. Computational methods are thus gradually developed to find potential MDAs. Ma et al. (2016) collected published MDA data from literatures and constructed Human Microbe-Disease Association Database (HMDAD). Various computational models are then exploited based on known MDA data, Gaussian Interaction Profile Kernel (GIP) similarity for diseases and microbes. Chen et al. (2017) assumed that functionally similar microbes are likely to associate with similar non-infectious diseases and presented the first tool (KATZHMDA) to predict potential MDAs based on the KATZ measure. Huang et al. (2017) proposed a neighbor and graph-based recommendation model (NGRHMDA). Bao et al. (2017) designed a Network Consistency Projection-based MDA prediction method (NCPHMDA). Luo and Long (2018) constructed a heterogeneous network and presented a Network Topological Similarity-based human MDA prediction model (NTSHMDA). Wang et al. (2017) developed a semi-supervised learning framework (LRLSHMDA) to prioritize microbe candidates for all interested diseases based on Laplacian Regularized Least Squares. Peng et al. (2018b) exploited a adaptive boosting-based method to compute association scores for human microbe-disease pairs based on a strong classification model. Zhang et al. (2018) proposed a bi-direction similarity integration label propagation method (BDSILP) for identifying MDAs. Shi et al. (2018) assumed that observed incomplete microbe-non-infectious disease association matrix is composed of a parameterized matrix and a noise matrix, and then developed a Binary Matrix Completion-based model (BMCMDA) to infer possible microbe-non-infectious disease associations. Qu J. et al. (2019) presented a human MDA model (MDLPHMDA) based on matrix decomposition and label propagation.

The above methods were effectively applied to MDA identification and captured a few MDAs, however, the prediction performance remains to be improved. More importantly, in MDA identification problem, negative training examples are missing. Therefore, most of models randomly extracted negative MDAs from unknown microbe-disease pairs, which may contain positive MDAs, thereby severely affecting the prediction accuracy. Learning from Positive and Unlabeled examples (PU learning) (Li et al., 2014) is one type of methods used to learn the models from numerous positive and unlabeled examples. PU learning has been widely applied to text mining and obtained better performance.

In this study, we developed a computational model, RNMFMDA, to predict human MDA candidates. RNMFMDA integrated Reliable Negative MDA selection based on PU learning and random walk with restart, Logistic Matrix Factorization with Neighborhood Regularization (LMFNR), and multiple heterogeneous data. RNMFMDA first computed disease similarity and microbe similarity. Credible negative MDAs were then selected based on PU learning and random walk with restart. LMFNR was finally developed to identify MDA candidates. RNMFMDA was compared to five state-of-the-art MDA prediction methods, MDLPHMDA (Qu J. et al., 2019), NGRHMDA (Huang et al., 2017), NTSHMDA (Luo and Long, 2018), LRLSHMDA (Wang et al., 2017), and KATZHMDA (Chen et al., 2017). To evaluate our proposed RNMFMDA, we conducted five-fold Cross Validations (CVs) on microbes, diseases, and MDAs. The results showed that RNMFMDA obtained the best AUCs under the above three CVs. In addition, we further performed the experiments to find possible microbes/diseases associate with a known disease/microbe. The experimental result analysis suggested that RNMFMDA is a powerful MDA identification method.

2. Materials and Equipment

Assume that the ith microbe is represented as m_i(i = 1, 2, …, n), and the jth disease is denoted as d_j(j = 1, 2, …, m). The associations between n microbes and m diseases are denoted as a binary matrix Y_{(n × m)} where

\begin{array}{l} y_{i j} = {\begin{matrix} 1 i f m_{i} a s s o c i a t e s w i t h d_{j} \\ 0 o t h e r w i s e \end{matrix} & (1) \end{array}

The non-zero elements in Y are called “MDA pairs” and considered as positive observations. The zero elements in Y are called “unknown microbe-disease pairs” and considered as unlabeled observations. The microbe similarity matrix and the disease similarity matrix are represented as $S_{M} \in ℜ^{n \times n}$ and $S_{D} \in ℜ^{m \times m}$ , respectively.

Our objective is to select reliable negative MDAs based on PU learning and random walk with restart on the heterogeneous network, and then compute the association probability score for each microbe-disease pair by LMFNR, finally rank candidate microbe-disease pairs according to the scores in descending order, so that the top microbe-disease pairs are the most likely to be MDAs.

We collected confirmed MDAs from HMDAD (Ma et al., 2016) (http://www.cuilab.cn/hmdad). The database provides 483 MDAs between 292 microbes and 39 diseases from 61 previous works. We deleted the same MDAs based on different evidences and finally obtained 450 MDAs from these microbes and diseases.

3. Methods

3.1. Microbe GAP Similarity

Motivated by the similarity computation method provided by van Laarhoven et al. (2011), we computed microbe Gaussian Association Profile (GAP) similarity based on known MDA matrix. Given a microbe m(i), its GAP AP(m(i)) can be represented as the ith row of Y. The GAP similarity between two microbes m(i) and m(j) can be computed by Equation (2):

\begin{array}{l} S_{M} (m (i), m (j)) = exp (- γ_{m} ‖ A P (m (i)) - A P (m (j)) ‖^{2}) & (2) \end{array}

where $γ_{m} = {γ_{m}}^{'} / (\frac{1}{n} \sum_{k = 1}^{n} ‖ A P (m (k)) ‖^{2})$ denotes the normalized kernel bandwidth with bandwidth parameter ${γ_{m}}^{'}$ . The microbe similarity matrix S_{M(n × n)} can be obtained based on the GAP similarity.

3.2. Disease Similarity

3.2.1. Disease GAP Similarity

For a disease d(i), its GAP AP(d(i)) can be represented as the ith column of Y. The GAP similarity between two diseases d(i) and d(j) can be calculated by Equation (3):

\begin{array}{l} S_{G} (d (i), d (j)) = exp (- γ_{d} ‖ A P (d (i)) - A P (d (j)) ‖^{2}) & (3) \end{array}

where $γ_{d} = {γ_{d}}^{'} / (\frac{1}{m} \sum_{k = 1}^{m} ‖ A P (d (k)) ‖^{2})$ denotes the normalized kernel bandwidth with bandwidth parameter ${γ_{d}}^{'}$ .

3.2.2. Disease Symptom Similarity

Inspired by the similarity measure method provided by Zhou et al. (2014), we computed disease symptom similarity matrix S_S.

Finally, the disease similarity matrix S_{D(m × m)} can be computed by Equation (4):

\begin{array}{l} S_{D} (d (i), d (j)) = S_{G} (d (i), d (j)) + γ S_{S} (d (i), d (j)) & (4) \end{array}

where γ is a parameter used to weigh the importance between the GAP similarity and the symptom similarity.

3.3. Reliable Negative MDA Selection

There exists a few known MDAs and numerous unobserved microbe-disease pairs in the HMDAD database (Ma et al., 2016). There are no negative MDA samples because of the limitations of experimental methods. High-quality negative MDAs can boost the performance of MDA prediction models. Therefore, most of machine learning-based methods have to randomly select negative examples from unknown microbe-disease pairs. However, this part of randomly selected negative examples probably contains positive MDAs, thereby severely affecting the performance of MDA identification algorithms. Therefore, we developed a negative sample selection method to extract reliable negative MDA data based on PU learning and random walk with restart. The pipeline mainly contains two basic processes: computing the association probability for each microbe-disease pair based on random walk with restart and extracting high-quality negative MDA samples based on PU learning and the computed association scores.

3.3.1. Random Walk With Restart on the Heterogeneous Microbe-Disease Network

Inspired by the method proposed by Chen et al. (2012), we consider microbe similarity network, disease similarity network, and MDA network to construct a heterogeneous microbe-disease network. We used microbe similarity matrix S_{M(n × n)}, disease similarity matrix S_{D(m × m)}, and MDA matrix Y_{(n × m)} as the adjacency matrices of the above three networks, respectively. And the adjacency matrix on the heterogeneous network can be denoted as:

\begin{array}{l} H = [\begin{matrix} S_{M} & Y \\ Y^{T} & S_{D} \end{matrix}] & (5) \end{array}

where Y^T denotes the transpose of Y.

We then calculate different transition probabilities of random walk with restart on the heterogeneous graph. Assume that $H = [\begin{matrix} H_{M M} & H_{M D} \\ H_{D M} & H_{D D} \end{matrix}]$ represent the transition probability matrix, where H_MM and H_DD represent the walks within microbe-microbe similarity network and disease-disease similarity network, respectively, H_MD and H_DM represent the skips between networks. Given a microbe/disease, if there exist a bipartite association between the microbe/disease and diseases/microbes, the particle will either skip between the four networks or stay in the current network with a transition probability λ ∈ [0, 1].

We predict MDA candidates from a perspective of microbes. Assume that a particle be situated on the i-th microbe node m_i ∈ M, it will walk to a microbe node m_j ∈ M with the transition probability H_MM(i, j):

\begin{array}{l} H_{M M} (i, j) = {\begin{matrix} S_{M} (i, j) / \sum_{k = 1}^{n} S_{M} (i, k), & if \sum_{k = 1}^{m} Y (i, k) = 0 \\ (1 - λ) S_{M} (i, j) / \sum_{k = 1}^{n} S_{M} (i, k), & o t h e r w i s e \end{matrix} & (6) \end{array}

or skip to a disease d_j ∈ D based on a bipartite association with d_j with the transition probability H_MD(i, j):

\begin{array}{l} H_{M D} (i, j) = {\begin{matrix} λ Y (i, j) / \sum_{k = 1}^{m} Y (i, k), if \sum_{k = 1}^{m} Y (i, k) \neq 0 \\ 0, o t h e r w i s e \end{matrix} & (7) \end{array}

Similarly, we can find possible MDAs from a perspective of diseases. Assume that a particle be situated on the ith disease node d_i ∈ D. It will walk to a disease node d_j ∈ D with the transition probability H_DD(i, j):

\begin{array}{l} H_{D D} (i, j) = {\begin{matrix} S_{D} (i, j) / \sum_{k = 1}^{m} S_{D} (i, k), & if \sum_{k = 1}^{n} Y (k, i) = 0 \\ (1 - λ) S_{D} (i, j) / \sum_{k = 1}^{m} S_{D} (i, k), & o t h e r w i s e \end{matrix} & (8) \end{array}

or skip to a microbe m_j ∈ M based on a bipartite association with m_j with a transition probability H_DM(i, j):

\begin{array}{l} H_{D M} (i, j) = {\begin{matrix} λ Y (j, i) / \sum_{k = 1}^{n} Y (k, i), if \sum_{k = 1}^{n} Y (k, i) \neq 0 \\ 0, o t h e r w i s e \end{matrix} & (9) \end{array}

Therefore, we describe random walk with restart on the heterogeneous network as:

\begin{array}{l} P (t + 1) = (1 - θ) H^{T} * P (t) + θ P (0) & (10) \end{array}

where P(t) denotes a probability matrix used to represent the association scores of all unobserved microbe-disease pairs at the t-th step random walk, H^T denotes the transpose of H, and θ represents the restarting probability. The particle will return to either a seed microbe or a seed disease. More importantly, it is possible to differentiate the relative important of each network based on the initial probability $p_{i} (0) = [\begin{matrix} (1 - η) v_{i} \\ η s_{i} \end{matrix}]$ , where v_i and s_i denote the initial probability distributions on disease-disease similarity network and microbe-microbe similarity network starting from their seed nodes, respectively. The parameter η ∈ [0, 1] is used to control the restarting probability in these two similarity networks. If η < 0.5, the particle will more tend to restart from one of the seed microbes than from one of the seed diseases.

3.3.2. Reliable Negative MDA Extraction

We took known MDAs as initial positive sample set P, observed microbe-disease pairs as initial unlabeled sample set U and developed a reliable negative MDA selection based on PU learning. The method contains the following five steps:

Step 1. Randomly selecting positive sample subset S from P and adding S into U;

Step 2. Taking P − S as positive samples, U + S as negative samples;

Step 3. Computing the association score matrix AM based on random walk with restart on the heterogeneous microbe-disease network;

Step 4. Ranking microbe-disease pairs in S based on AM and finding the minimum score AM_min in S;

Step 5. For every sample x in U:

if AM_x satisfying AM_x < AM_min

then RN = RN ∪ x

We can obtain reliable negative MDA example set RN with the above negative selection method.

3.4. MDA Prediction Based on LMFNR

The logistic matrix factorization method has widely applied to the area of various association prediction and obtained better performance (Liu et al., 2016, 2020). Inspired by the logistic matrix factorization method provided by Liu et al. (2016) and Liu et al. (2020), we developed an MDA prediction method (RNMFMDA) by integrating the Reliable Negative MDA sample selection method and the LMFNR method.

Suppose that both microbes and diseases are mapped into r-dimensional shared latent spaces where r ≪ n, m. The properties of a microbe m_i / disease d_j is represented by a latent vector $a_{i} \in ℜ^{1 \times t}$ / $b_{i} \in ℜ^{1 \times t}$ . Then, the association probability p_ij between m_i and d_j can be computed by Equation (11):

\begin{array}{l} p_{i j} = \frac{exp (a_{i} b_{j}^{T})}{1 + exp (a_{i} b_{j}^{T})} & (11) \end{array}

The latent vectors of all microbes / diseases can be denoted as A ∈ ℜ^{n × r} / B ∈ ℜ^{m × r}, where a_i / b_j is the ith / jth row in A/B.

In MDA identification tasks, the observed MDAs have been experimentally validated and are more reliable than unknown microbe-disease pairs. To more accurately find MDA candidates, we assigned higher confidence scores to known MDAs than unknown pairs. Particularly, each MDA is considered as c(c≥ 1) positive training samples, and each reliable negative MDA is considered as a single negative training sample. c is a constant to measure the importance of observations. The importance weighting technique has been effectively applied to the area of informatics. And we built the following MDA prediction model:

\begin{array}{l} \begin{matrix} p (Y | A, B) = (\prod_{1 \leq i \leq n, 1 \leq j \leq m, y_{i j} = 1} {[p_{i j}^{y_{i j}} {(1 - p_{i j})}^{(1 - y_{i j})}]}^{c}) \\ \times (\prod_{1 \leq i \leq n, 1 \leq j \leq m, y_{i j} = 0} [p_{i j}^{y_{i j}} {(1 - p_{i j})}^{(1 - y_{i j})}]) \\ = \prod_{i = 1}^{n} \prod_{j = 1}^{m} p_{i j}^{c y_{i j}} {(1 - p_{i j})}^{(1 - y_{i j})} \end{matrix} & (12) \end{array}

The above model can represented as the following optimization function considering the probability distribution based on a Bayesian inference:

\begin{array}{l} \begin{matrix} min_{A, B} \sum_{i = 1}^{m} \sum_{j = 1}^{n} (1 + c y_{i j} - y_{i j}) log [1 + exp (a_{i} b_{j}^{T})] \\ - c y_{i j} a_{i} b_{j}^{T} + \frac{λ_{m}}{2} ‖ A ‖_{F}^{2} + \frac{λ_{d}}{2} ‖ B ‖_{F}^{2} \end{matrix} & (13) \end{array}

where λ_m and λ_d are parameters, ||A||_F and ||B||_F denote the Frobenius norm of A and B, respectively.

The nearest neighborhood information of biological entities in the association network can improve the prediction performance (Zhang et al., 2019a,b,c). For example, Zhang et al. (2019a), Zhang et al. (2019b), and Zhang et al. (2019c) used neighborhood information and effectively found microRNA-disease associations, drug-drug interactions and long non-coding RNA-miRNA interactions. Therefore, we integrated neighborhood information to the above optimization model and built the final LMFNR model by Equation (14):

\begin{array}{l} \begin{matrix} min_{A, B} \sum_{i = 1}^{m} \sum_{j = 1}^{n} (1 + c y_{i j} - y_{i j}) ln [1 + exp (a_{i} b_{j}^{T})] - c y_{i j} a_{i} b_{j}^{T} \\ + \frac{1}{2} t r [A^{T} (λ_{m} I + α L_{m}) A + \frac{1}{2} t r [B^{T} (λ_{d} I + α L_{d}) B \end{matrix} & (14) \end{array}

where tr(·) denotes the trace of a matrix, L_m and L_d were defined as the same to Liu et al. (2016).

We can obtain A and B by solving with the optimization problem by Equation (14) with an alternating gradient ascent procedure.

Finally, the association probability matrix Y_p for all unknown microbe-disease pairs can be represented as:

\begin{array}{l} Y_{p} = A B^{T} & (15) \end{array}

4. Results

4.1. Experimental Settings and Evaluation

The experiment was performed under 100 trials of five-fold Cross Validation. An average performance was finally computed to reduce the prediction bias. For an MDA matrix Y_{n × m}, CVs were conducted under three different experimental settings as follows.

• Five-fold Cross Validation 1 (CV1): CV on microbes, that is, random rows in Y (i.e., microbes) were masked for testing.

• Five-fold Cross Validation 2 (CV2): CV on diseases, that is, random columns in Y (i.e., diseases) were masked for testing.

• Five-fold Cross Validation 3 (CV3): CV on microbe-disease pairs, that is, random entries in Y (i.e., microbe-disease pairs) were masked for testing.

Under CV1, in each round, 80% of rows in Y was used as training set and the remaining was used as test set. Under CV2, in each round, 80% of columns in Y was used as training set and the remaining was used as test set. Under CV3, in each round, 80% of entries in Y was used as training set and the remaining was used as test set. These three CVs refer to MDA prediction for (1) new (unknown) microbes, (2) new diseases, and new microbe-disease pairs, respectively.

Sensitivity, specificity, accuracy, and AUC were used to evaluate the performances. AUC is the average area under the receiver operating characteristics (ROC) curve. The curve can be plotted by the ratio of True Positive Rate (TPR) to False Positive Rate (FPR) according to different thresholds. TPR and FPR can be computed by Equations (16, 17). High AUC value represents better performance. In our experiments, AUC was computed in each round of CV and final AUC was averaged over the five rounds for 100 times.

\begin{array}{l} T P R = \frac{T P}{T P + F N} = \frac{T P}{T} & (16) \end{array}

\begin{array}{l} F P R = \frac{F P}{F P + T N} = \frac{F P}{F} & (17) \end{array}

where the definitions of TP, FP and FN are as shown in Table 1.

TABLE 1

Table 1. Confusion matrix of a binary classifier.

λ is used to determine the probability of jumping between nodes. θ is the restart rate. η denotes the restarting probability in microbe similarity network and disease similarity network. c is the importance level of positive samples to negative samples. K denotes the number of neighborhood. For the parameters λ, θ, η, c, and K, we conducted grid search to find the optimal values. RNMFMDA obtained the best performance when these five parameters are set as λ = 0.9, θ = 0.5, η = 0.9, c = 8, and K = 5. So we set the above five parameters as the corresponding values. Parameters ${γ_{m}}^{'}$ , ${γ_{d}}^{'}$ , and γ are set the same values in previous works, that is, ${γ_{m}}^{'} = 1$ , ${γ_{d}}^{'} = 1$ , and γ = 0.9. For other parameters, we set the corresponding values according to the method provided by Liu et al. (2016). When ||P(t + 1) − P(t)||_F ≤ 10e − 12, the iteration for random walk will stop. The ratio of extracted negative MDAs to positive MDAs is set as 1:1, this is to say, the number of negative MDAs is 450. The parameters in other five methods were set as the same values provided by the corresponding papers.

4.2. Performance Comparison of RNMFMDA With Other Five Methods

In this section, we compared our proposed RNMFMDA method with five state-of-the-art MDA prediction models, MDLPHMDA (Qu J. et al., 2019), NGRHMDA (Huang et al., 2017), NTSHMDA (Luo and Long, 2018), LRLSHMDA (Wang et al., 2017), and KATZHMDA (Chen et al., 2017). Tables 2–4 showed the performance of RNMFMDA with other five methods. The best performance is described in boldface in Tables 2–4.

TABLE 2

Table 2. Performance comparison of RNMFMDA with other five methods under CV1.

TABLE 3

Table 3. Performance comparison of RNMFMDA with other five methods under CV2.

TABLE 4

Table 4. Performance comparison of RNMFMDA with other five methods under CV3.

As shown in Tables 2–4, RNMFMDA performed more efficiently than other five methods. Compared with MDLPHMDA, NGRHMDA, and NTSHMDA, RNMFMDA obtained a more remarkable improvement over four evaluation metrics under three CVs. KATZHMDA and LRLSHMDA are two classic MDA prediction methods. Under CV1, KATZHMDA and LRLSHMDA computed better specificity and accuracy than RNMFMDA. Under CV2 and CV3, these two methods obtained better sensitivity than RNMFMDA. Although KATZHMDA and LRLSHMDA obtained relatively better specificity and accuracy than RNMFMDA under individual CVs, RNMFMDA computed the best AUCs under three CVs. For example, the AUC values in RNMFMDA increased by 42.42, 31.08, 36.48, 34.37, and 51.96% compared with those in KATZHMDA, LRLSHMDA, MDLPHMDA, NGRHMDA, and NTSHMDA under CV1; the corresponding values increased by 0.08, 6.73, 5.66, 5.13, and 4.35%, respectively, under CV2; the values also increased by 5.62, 3.16, 1.57, 0.62, and 2.04%, respectively, under CV3. Figures 1–3 showed the AUCs of these six methods. AUC is a more important measurement compared with other three evaluation metrics. Based on the comprehensive measure of the experimental results, RNMFMDA showed the optimal performance.

FIGURE 1

Figure 1. Performance comparison of RNMFMDA with other five methods under CV1.

FIGURE 2

Figure 2. Performance comparison of RNMFMDA with other five methods under CV2.

FIGURE 3

Figure 3. Performance comparison of RNMFMDA with other five methods under CV3.

In addition, these six methods showed different advantages under different CVs. These variation in improvement can be attributed to differences in data structures under different CVs. In particular, RNMFMDA is more suitable to find possible microbes associated with a given disease.

4.3. Performance Comparison Considering PU Learning

In this section, we performed extensive experiments to analyze the influence of different negative MDA selection ratios on prediction performance. Tables 5–7 described the comparison results. NMDAR represents the ratio of selected negative MDA samples to known positive MDA samples.

TABLE 5

Table 5. Performance comparison considering the number of negative sample CV1.

TABLE 6

Table 6. Performance comparison considering the number of negative sample CV2.

TABLE 7

Table 7. Performance comparison considering the number of negative sample CV3.

As shown in Tables 5–7, RNMFMDA did not extract negative MDAs when NMDAR is 0, and selected negative MDAs according to different NMDARs of 10, 20, 50%, 1, 2, 3, 4, and 5. When NMDAR is 1, RNMFMDA obtained promising performance under three CVs. Compared with the situation without negative MDA samples, when NMDAR is 1, the AUC values of RNMFMDA respectively increased 16.90, 11.49, and 4.61% under three CVs. Taken as a whole, RNMFMDA with NMDAR of 1 obtained better performance. To reduce overfitting of the experimental results, we selected NMDAR as 1, that is, we extracted negative MDA examples with the same number of positive MDA examples.

Figures 4–6 showed the AUC values obtained by RNMFMDA under different NMDARs. The results suggested that our proposed negative example extraction method helps to improve MDA prediction.

FIGURE 4

Figure 4. The performance comparison under different negative MDA selection ratios under CV1.

FIGURE 5

Figure 5. The performance comparison under different negative MDA selection ratios under CV2.

FIGURE 6

Figure 6. The performance comparison under different negative MDA selection ratios under CV3.

4.4. Case Study

We further evaluated the prediction performance of our proposed RNMFMDA on the confirmed 450 MDAs by two case studies. Asthma is a disease with considerable global morbidity. Over the past 10 years, little improvement in asthma has been observed despite of escalating treatment costs (Pavord et al., 2018). In the first class, we mask all associated information for asthma to find possible microbes. The results are shown in Table 8. Among the predicted top 10 and 20 microbe-asthma association pairs, 8 and 15 microbes have been reported to associate with asthma by related publications, respectively.

TABLE 8

Table 8. The top 20 microbes associated with asthma.

Inflammatory Bowel Disease (IBD) is a periodic inflammation. It may be produced by a deregulated immune response to gut microbiome dysbiosis (Halfvarson et al., 2017). In the second class, we mask all association information for IBD to find possible microbes. The results are shown in Table 9. Among the predicted top 10 and 20 microbe-IBD association pairs, there are 9 and 17 microbes that are validated to associate with IBD by recent works, respectively.

TABLE 9

Table 9. The top 20 microbes associated with asthma.

5. Discussion

There are numerous microbes in the human body. They play an important role in various biological processes. Many human diseases including gastrointestinal diseases are reported to be closely associated with microorganisms. Therefore, identifying the associations between microbes and diseases helps to understand the pathogenic mechanisms of these diseases and further develop new drugs.

Traditional experimental methods applied to validate possible associations between microbes and diseases are expensive and time-consuming, computational methods are developed to solve with this problem. However, the performance of existing computational models need to further improve. More importantly, lacking of reliable negative MDA examples affects prediction performance. Therefore, RNMFMDA is developed to find possible MDAs.

RNMFMDA obtained the optimal performance under three CVs. We analyzed the reason that RNMFMDA obtained excellent performance and thought that it may be contributed to the following three features. First, we developed a high-quality negative MDA extraction method based on PU learning and random walk with restart. Second, LMFNR is a optimal model in predicting associations between two entities. Finally, we integrated various heterogeneous biological information. Multiple heterogeneous data integration efficiently reflected the biological features of MDAs.

In the future, we will construct a multi-partite network by integrating MDAs, disease-gene associations (Tran et al., 2020), miRNA-disease associations (Peng et al., 2018a; Huang et al., 2019), long non-coding RNA-protein interactions (Zhao et al., 2018; Peng et al., 2019), and long non-coding RNA-disease associations (Chen et al., 2018; Li et al., 2019). More importantly, we will still develop more robust models, for example, ensemble strategy (Hu et al., 2018) and deep learning-based models (Min et al., 2017; Peng L. et al., 2018) to improve MDA prediction.

Data Availability Statement

All datasets generated for this study are included in the article/Supplementary Material.

Author Contributions

LP, LS, and LZ: conceptualization. LP: funding acquisition, project administration, writing—original draft, and writing—review and editing. LP and LZ: investigation. LP and LS: methodology. LS, LL, and GL: software. LS: validation. All authors contributed to the article and approved the submitted version.

Funding

This research was funded by the National Natural Science Foundation of China (Grant 61803151) and the Natural Science Foundation of Hunan province (Grant 2018JJ3570).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We would like to thank all authors of the cited references.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2020.592430/full#supplementary-material

References

Abu-Ali, G. S., Mehta, R. S., Lloyd-Price, J., Mallick, H., Branck, T., Ivey, K. L., et al. (2018). Metatranscriptome of human faecal microbial communities in a cohort of adult men. Nat. Microbiol. 3:356. doi: 10.1038/s41564-017-0084-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Bao, W., Jiang, Z., and Huang, D.-S. (2017). Novel human microbe-disease association prediction using network consistency projection. BMC Bioinformatics 18:543. doi: 10.1186/s12859-017-1968-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Byrd, A. L., Belkaid, Y., and Segre, J. A. (2018). The human skin microbiome. Nat. Rev. Microbiol. 16:143. doi: 10.1038/nrmicro.2017.157

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, L., Zhang, Y.-H., Huang, G., Pan, X., Wang, S., Huang, T., et al. (2018). Discriminating cirRNAs from other lncRNAs using a hierarchical extreme learning machine (H-ELM) algorithm with feature selection. Mol. Genet. Genomics 293, 137–149. doi: 10.1007/s00438-017-1372-7

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, X., Huang, Y.-A., You, Z.-H., Yan, G.-Y., and Wang, X.-S. (2017). A novel approach based on Katz measure to predict associations of human microbiota with non-infectious diseases. Bioinformatics 33, 733–739. doi: 10.1093/bioinformatics/btw715

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, X., Liu, M.-X., and Yan, G.-Y. (2012). Drug-target interaction prediction by random walk on the heterogeneous network. Mol. Biosyst. 8, 1970–1978. doi: 10.1039/c2mb00002d

PubMed Abstract | CrossRef Full Text | Google Scholar

Davis-Richardson, A. G., Ardissone, A. N., Dias, R., Simell, V., Leonard, M. T., Kemppainen, K. M., et al. (2014). Bacteroides dorei dominates gut microbiome prior to autoimmunity in Finnish children at high risk for type 1 diabetes. Front. Microbiol. 5:678. doi: 10.3389/fmicb.2014.00678

PubMed Abstract | CrossRef Full Text | Google Scholar

Ding, T., and Schloss, P. D. (2014). Dynamics and associations of microbial community types across the human body. Nature 509:357. doi: 10.1038/nature13178

PubMed Abstract | CrossRef Full Text | Google Scholar

Findley, K., Oh, J., Yang, J., Conlan, S., Deming, C., Meyer, J. A., et al. (2013). Topographic diversity of fungal and bacterial communities in human skin. Nature 498:367. doi: 10.1038/nature12171

PubMed Abstract | CrossRef Full Text | Google Scholar

Halfvarson, J., Brislawn, C. J., Lamendella, R., Vázquez-Baeza, Y., Walters, W. A., Bramer, L. M., et al. (2017). Dynamics of the human gut microbiome in inflammatory bowel disease. Nat. Microbiol. 2, 1–7. doi: 10.1038/nmicrobiol.2017.4

PubMed Abstract | CrossRef Full Text | Google Scholar

Heavey, P. M., and Rowland, I. R. (2004). Gastrointestinal cancer. Best Pract. Res. Clin. Gastroenterol. 18, 323–336. doi: 10.1016/j.bpg.2003.10.003

PubMed Abstract | CrossRef Full Text | Google Scholar

Hu, H., Zhang, L., Ai, H., Zhang, H., Fan, Y., Zhao, Q., et al. (2018). HLPI-ensemble: prediction of human lncRNA-protein interactions based on ensemble strategy. RNA Biol. 15, 797–806. doi: 10.1080/15476286.2018.1457935

PubMed Abstract | CrossRef Full Text | Google Scholar

Huang, F., Yue, X., Xiong, Z., Yu, Z., and Zhang, W. (2019). Tensor decomposition with relational constraints for predicting multiple types of microRNA-disease associations. arXiv preprint arXiv:1911.05584. doi: 10.1093/bib/bbaa140

PubMed Abstract | CrossRef Full Text | Google Scholar

Huang, Y.-A., You, Z.-H., Chen, X., Huang, Z.-A., Zhang, S., and Yan, G.-Y. (2017). Prediction of microbe-disease association from the integration of neighbor and graph with collaborative recommendation model. J. Transl. Med. 15:209. doi: 10.1186/s12967-017-1304-7

PubMed Abstract | CrossRef Full Text | Google Scholar

Langella, P., and Martín, R. (2019). Emerging health concepts in the probiotics field: streamlining the definitions. Front. Microbiol. 10:1047. doi: 10.3389/fmicb.2019.01047

PubMed Abstract | CrossRef Full Text | Google Scholar

Ley, R. E., Turnbaugh, P. J., Klein, S., and Gordon, J. I. (2006). Microbial ecology: human gut microbes associated with obesity. Nature 444:1022. doi: 10.1038/4441022a

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, G., Luo, J., Liang, C., Xiao, Q., Ding, P., and Zhang, Y. (2019). Prediction of lncRNA-disease associations based on network consistency projection. IEEE Access 7, 58849–58856. doi: 10.1109/ACCESS.2019.2914533

CrossRef Full Text | Google Scholar

Li, H., Chen, Z., Liu, B., Wei, X., and Shao, J. (2014). “Spotting fake reviews via collective positive-unlabeled learning,” in 2014 IEEE International Conference on Data Mining (Shenzhen: IEEE), 899–904. doi: 10.1109/ICDM.2014.47

CrossRef Full Text | Google Scholar

Liu, H., Han, M., Li, S. C., Tan, G., Sun, S., Hu, Z., et al. (2019). Resilience of human gut microbial communities for the long stay with multiple dietary shifts. Gut 68, 2254–2255. doi: 10.1136/gutjnl-2018-317298

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, H., Ren, G., Chen, H., Liu, Q., Yang, Y., and Zhao, Q. (2020). Predicting lncRNA-miRNA interactions based on logistic matrix factorization with neighborhood regularized. Knowl. Based Syst. 191:105261. doi: 10.1016/j.knosys.2019.105261

CrossRef Full Text | Google Scholar

Liu, Y., Wu, M., Miao, C., Zhao, P., and Li, X.-L. (2016). Neighborhood regularized logistic matrix factorization for drug-target interaction prediction. PLoS Comput. Biol. 12:e1004760. doi: 10.1371/journal.pcbi.1004760

CrossRef Full Text | Google Scholar

Luo, J., and Long, Y. (2018). NTSHMDA: prediction of human microbe-disease association based on random walk by integrating network topological similarity. IEEE/ACM Trans. Comput. Biol. Bioinform. 17, 1341–51. doi: 10.1109/TCBB.2018.2883041

PubMed Abstract | CrossRef Full Text | Google Scholar

Ma, W., Zhang, L., Zeng, P., Huang, C., Li, J., Geng, B., et al. (2016). An analysis of human microbe-disease associations. Brief. Bioinform. 18, 85–97. doi: 10.1093/bib/bbw005

PubMed Abstract | CrossRef Full Text | Google Scholar

McFarland, L. V. (2000). Beneficial microbes: health or hazard? Eur. J. Gastroenterol. Hepatol. 12, 1069–1071. doi: 10.1097/00042737-200012100-00001

CrossRef Full Text | Google Scholar

Min, S., Lee, B., and Yoon, S. (2017). Deep learning in bioinformatics. Brief. Bioinform. 18, 851–869. doi: 10.1093/bib/bbw068

CrossRef Full Text | Google Scholar

Pavord, I. D., Beasley, R., Agusti, A., Anderson, G. P., Bel, E., Brusselle, G., et al. (2018). After asthma: redefining airways diseases. Lancet 391, 350–400. doi: 10.1016/S0140-6736(17)30879-6

PubMed Abstract | CrossRef Full Text | Google Scholar

Peng, L., Liu, F., Yang, J., Liu, X., Meng, Y., Deng, X., et al. (2019). Probing lncRNA-protein interactions: data repositories, models, and algorithms. Front. Genet. 10:1346. doi: 10.3389/fgene.2019.01346

PubMed Abstract | CrossRef Full Text | Google Scholar

Peng, L., Peng, M., Liao, B., Huang, G., Li, W., and Xie, D. (2018). The advances and challenges of deep learning application in biological big data processing. Curr. Bioinform. 13, 352–359. doi: 10.2174/1574893612666170707095707

CrossRef Full Text | Google Scholar

Peng, L.-H., Sun, C.-N., Guan, N.-N., Li, J.-Q., and Chen, X. (2018a). HNMDA: heterogeneous network-based miRNA-disease association prediction. Mol. Genet. Genomics 293, 983–995. doi: 10.1007/s00438-018-1438-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Peng, L.-H., Yin, J., Zhou, L., Liu, M.-X., and Zhao, Y. (2018b). Human microbe-disease association prediction based on adaptive boosting. Front. Microbiol. 9:2440. doi: 10.3389/fmicb.2018.02440

PubMed Abstract | CrossRef Full Text | Google Scholar

Qu, J., Zhao, Y., and Yin, J. (2019). Identification and analysis of human microbe-disease associations by matrix decomposition and label propagation. Front. Microbiol. 10:291. doi: 10.3389/fmicb.2019.00291

PubMed Abstract | CrossRef Full Text | Google Scholar

Qu, K., Guo, F., Liu, X., Lin, Y., and Zou, Q. (2019). Application of machine learning in microbiology. Front. Microbiol. 10:827. doi: 10.3389/fmicb.2019.00827

PubMed Abstract | CrossRef Full Text | Google Scholar

Sachdeva, R., Campbell, B. J., and Heidelberg, J. F. (2019). Rare microbes from diverse earth biomes dominate community activity. bioRxiv 636373. doi: 10.1101/636373

CrossRef Full Text | Google Scholar

Shi, J.-Y., Huang, H., Zhang, Y.-N., Cao, J.-B., and Yiu, S.-M. (2018). BMCMDA: a novel model for predicting human microbe-disease associations via binary matrix completion. BMC Bioinformatics 19:169. doi: 10.1186/s12859-018-2274-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Sze, M. A., Dimitriu, P. A., Hayashi, S., Elliott, W. M., McDonough, J. E., Gosselink, J. V., et al. (2012). The lung tissue microbiome in chronic obstructive pulmonary disease. Am. J. Respirat. Crit. Care Med. 185, 1073–1080. doi: 10.1164/rccm.201111-2075OC

PubMed Abstract | CrossRef Full Text | Google Scholar

Tran, V. D., Sperduti, A., Backofen, R., and Costa, F. (2020). Heterogeneous networks integration for disease-gene prioritization with node kernels. Bioinformatics 36, 2649–2656. doi: 10.1093/bioinformatics/btaa008

PubMed Abstract | CrossRef Full Text | Google Scholar

van Laarhoven, T., Nabuurs, S. B., and Marchiori, E. (2011). Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics 27, 3036–3043. doi: 10.1093/bioinformatics/btr500

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, F., Huang, Z.-A., Chen, X., Zhu, Z., Wen, Z., Zhao, J., et al. (2017). LRLSHMDA: Laplacian regularized least squares for human microbe-disease association prediction. Sci. Rep. 7, 1–11. doi: 10.1038/s41598-017-08127-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, W., Jing, K., Huang, F., Chen, Y., Li, B., Li, J., et al. (2019a). SFLLN: a sparse feature learning ensemble method with linear neighborhood regularization for predicting drug-drug interactions. Inform. Sci. 497, 189–201. doi: 10.1016/j.ins.2019.05.017

CrossRef Full Text | Google Scholar

Zhang, W., Li, Z., Guo, W., Yang, W., and Huang, F. (2019b). A fast linear neighborhood similarity-based network link inference method to predict microRNA-disease associations. IEEE/ACM Trans. Comput. Biol. Bioinform. 1. doi: 10.1109/TCBB.2019.2931546

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, W., Tang, G., Zhou, S., and Niu, Y. (2019c). LncRNA-miRNA interaction prediction through sequence-derived linear neighborhood propagation method with information combination. BMC Genomics 20:946. doi: 10.1186/s12864-019-6284-y

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, W., Yang, W., Lu, X., Huang, F., and Luo, F. (2018). The bi-direction similarity integration method for predicting microbe-disease associations. IEEE Access 6, 38052–38061. doi: 10.1109/ACCESS.2018.2851751

CrossRef Full Text | Google Scholar

Zhao, Q., Yu, H., Ming, Z., Hu, H., Ren, G., and Liu, H. (2018). The bipartite network projection-recommended algorithm for predicting long non-coding RNA-protein interactions. Mol. Therapy Nucleic Acids 13, 464–471. doi: 10.1016/j.omtn.2018.09.020

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhou, X., Menche, J., Barabási, A.-L., and Sharma, A. (2014). Human symptoms-disease network. Nat. Commun. 5:4212. doi: 10.1038/ncomms5212

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhu, Y.-G., Gillings, M., Simonet, P., Stekel, D., Banwart, S., and Penuelas, J. (2018). Human dissemination of genes and microorganisms in Earth's critical zone. Glob. Change Biol. 24, 1488–1499. doi: 10.1111/gcb.14003

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: microbe-disease associations, reliable negative samples, positive-unlabeled learning, random walk with restart, logistic matrix factorization with neighborhood regularization

Citation: Peng L, Shen L, Liao L, Liu G and Zhou L (2020) RNMFMDA: A Microbe-Disease Association Identification Method Based on Reliable Negative Sample Selection and Logistic Matrix Factorization With Neighborhood Regularization. Front. Microbiol. 11:592430. doi: 10.3389/fmicb.2020.592430

Received: 07 August 2020; Accepted: 17 September 2020;
Published: 27 October 2020.

Edited by:

Qi Zhao, University of Science and Technology Liaoning, China

Reviewed by:

Wen Zhang, Huazhong Agricultural University, China
Jiyuan An, Queensland University of Technology, Australia

Copyright © 2020 Peng, Shen, Liao, Liu and Zhou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Liqian Zhou, emhvdWxxMTFAMTYzLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.