ORIGINAL RESEARCH article

Front. Microbiol., 21 June 2023

Sec. Systems Microbiology

Volume 14 - 2023 | https://doi.org/10.3389/fmicb.2023.1207209

SAELGMDA: Identifying human microbe–disease associations based on sparse autoencoder and LightGBM

  • 1. School of Computer Science, Hunan University of Technology, Zhuzhou, China

  • 2. Department of Gastrointestinal Surgery, Yidu Central Hospital of Weifang, Weifang, China

  • 3. Geneis (Beijing) Co., Ltd., Beijing, China

  • 4. The Second Department of Oncology, Beidahuang Industry Group General Hospital, Harbin, China

  • 5. The Second Department of Oncology, Heilongjiang Second Cancer Hospital, Harbin, China

Abstract

Introduction:

Identification of complex associations between diseases and microbes is important to understand the pathogenesis of diseases and design therapeutic strategies. Biomedical experiment-based Microbe-Disease Association (MDA) detection methods are expensive, time-consuming, and laborious.

Methods:

Here, we developed a computational method called SAELGMDA for potential MDA prediction. First, microbe similarity and disease similarity are computed by integrating their functional similarity and Gaussian interaction profile kernel similarity. Second, one microbe-disease pair is presented as a feature vector by combining the microbe and disease similarity matrices. Next, the obtained feature vectors are mapped to a low-dimensional space based on a Sparse AutoEncoder. Finally, unknown microbe-disease pairs are classified based on Light Gradient boosting machine.

Results:

The proposed SAELGMDA method was compared with four state-of-the-art MDA methods (MNNMDA, GATMDA, NTSHMDA, and LRLSHMDA) under five-fold cross validations on diseases, microbes, and microbe-disease pairs on the HMDAD and Disbiome databases. The results show that SAELGMDA computed the best accuracy, Matthews correlation coefficient, AUC, and AUPR under the majority of conditions, outperforming the other four MDA prediction models. In particular, SAELGMDA obtained the best AUCs of 0.8358 and 0.9301 under cross validation on diseases, 0.9838 and 0.9293 under cross validation on microbes, and 0.9857 and 0.9358 under cross validation on microbe-disease pairs on the HMDAD and Disbiome databases. Colorectal cancer, inflammatory bowel disease, and lung cancer are diseases that severely threat human health. We used the proposed SAELGMDA method to find possible microbes for the three diseases. The results demonstrate that there are potential associations between Clostridium coccoides and colorectal cancer and one between Sphingomonadaceae and inflammatory bowel disease. In addition, Veillonella may associate with autism. The inferred MDAs need further validation.

Conclusion:

We anticipate that the proposed SAELGMDA method contributes to the identification of new MDAs.

1. Introduction

Human microbes are a class of organisms with simple structure and small size (Wu et al., 2018; Cheng et al., 2020). They widely distribute in various organs of the human body including the gut, gastrointestinal tract, lung, oral cavity, and skin (Lynch and Pedersen, 2016). Its abnormality may cause diseases, such as cancers, inflammatory bowel disease (El Mouzan et al., 2018), and asthma (Demirci et al., 2019). Therefore, it is important to uncover potential associations between microbes and diseases. Identification of Microbe-Disease Associations (MDAs) helps capture the complex pathogenesis of various diseases and provides novel insights into its drug design. For example, a few methods have been developed to capture potential drugs against COVID-19 (Peng et al., 2022a; Shen L. et al., 2022; Tian et al., 2022). Traditional experimental methods are expensive, time-consuming, and laborious (Chen et al., 2019). Thus, much attention has been devoted to computational methods for new MDA prediction.

Many computational models have been designed to find potential MDAs based on known MDAs and biological features of diseases and microbes. These methods mainly contain network-based methods and machine learning-based methods. Network-based MDA prediction methods include the KATZ measurement (Zhang et al., 2017; Li et al., 2019), random walk with network topology structure (NTSHMDA) (Luo and Long, 2018), and bi-random walk (Zou et al., 2017; Luo and Long, 2018; Yan et al., 2019). Network-based methods effectively found a few new MDAs; however, they depend on known MDAs for similarity calculation and fail to screen possible microbes (or diseases) for a new disease (or microbes) that has no association prediction.

Machine learning-based MDA prediction methods contain Laplacian regularized least squares (LRLSHMDA) (Wang et al., 2017), binary matrix completion (Shi et al., 2018), graph regularized non-negative matrix factorization (He et al., 2018), logistic matrix factorization with neighborhood regularization combining positive-unlabeled learning (Peng et al., 2020), inductive matrix completion and graph attention networks (GATMDA) (Long et al., 2021), and low-rank matrix completion combining the nuclear norm minimization (MNNMDA) (Liu H. et al., 2023). Machine learning algorithms better improved MDA prediction.

In particular, deep learning has been increasingly applied to the area of bioinformatics, such as cardiotoxicity identification related to hERG channel blockers (Wang T. et al., 2023), protein model quality assessment (Guo et al., 2022; Liu J. et al., 2023), metabolite-disease association discovery (Sun et al., 2022), lncRNA-protein interaction prediction (Lihong et al., 2021), lncRNA-miRNA association inference (Chen et al., 2021; Wang et al., 2022), lncRNA-disease association identification (Liang et al., 2022; Zhang et al., 2023), single-cell data analysis (Hu et al., 2023; Xu et al., 2023), drug-target interaction detection (Zhang et al., 2022; Li et al., 2023), and intercellular communication analyses (Peng et al., 2022b). Similarly, deep learning has been widely applied to accurate MDA prediction. These methods include deep matrix factorization combining Bayesian personalized ranking (Liu et al., 2020), multi-component graph attention network (Liu et al., 2021), graph convolutional network (Hua et al., 2022), metapath aggregated graph neural network (Chen and Lei, 2022), dual network contrastive learning model (Cheng et al., 2022), weighted meta-graph-based model (Long and Luo, 2019), knowledge graph neural network (Jiang et al., 2022), and relation graph convolutional network (Wang Y. et al., 2023).

Deep learning efficiently implements accurate MDA identification. In this manuscript, we developed a computational MDA prediction method called SAELGMDA by combining a sparse autoencoder for feature extraction and Light Gradient Boosting Machine (LightGBM) for MDA classification.

2. Materials and methods

2.1. Data description

To construct a human MDA network, we investigated a human MDA database called HMDAD provided by Ma et al. (2017) (http://www.cuilab.cn/hmdad). The database contains 483 experimentally confirmed MDAs between 39 diseases and 292 microbes. We finally achieved 450 MDAs after filtering repetitive MDAs. In addition, Janssens et al. (2018) have collected a new MDA database named Disbiome. The database contains 5,573 experimentally confirmed human MDAs between 1,098 microbes and 240 diseases. Finally, we obtained 4,351 MDAs between 1,052 microbes and 218 diseases after filtering repetitive MDAs.

Consequently, an element Xij in an MDA matrix is represented as Eq. (1):

where nd and nm indicate the number of diseases and microbes, respectively. An MDA is taken as a positive sample if Xij = 1, otherwise, it is taken as an unlabeled sample.

2.2. Methods

In this manuscript, we proposed an MDA prediction method called SAELGMDA by combining sparse autoencoder and LightGBM. First, disease similarity and microbe similarity are computed by integrating functional similarity and Gaussian Interaction Profile Kernel (GIPK) similarity. Second, one microbe–disease pair is represented as one d-dimensional vector. Third, the obtained features for microbe–disease pairs are mapped into a low-dimensional space via a sparse autoencoder. Finally, the low-dimensional features are fed to LightGBM for MDA classification. The pipeline of SAELGBM is illustrated in Figure 1.

Figure 1

2.2.1. Functional similarity of diseases and microbes

We considered that similar diseases are more likely to associate with similar genes (Xu and Li, 2006; Wei and Liu, 2020) and computed disease functional similarity via disease-related genes. For two diseases di and dj and corresponding associated gene sets Gi = {gi1, gi2, …, gia} and Gj = {gj1, gj2, …, gjb}, the functional association between gene gk and gene set G = {g1, g2, …, gl} is first defined by Eq. (2):

where FS(gk, gt) indicates the functional similarity between gk and gt by Eq. (3):

where LLS′ denotes the normalized score of LLS by Eq. (4):

where LLS represents association log-likelihood score used to evaluate the functional linkage probability between two genes provided by HumanNet (Hwang et al., 2019; Long et al., 2021), LLSmax and LLSmin denote its maximum and minimum values, respectively.

Finally, the functional similarity between di and dj is computed by Eq. (5):

Microbe functional similarity matrix Mf is computed based on the method proposed by Kamneva (2017).

2.2.2. GIPK similarity of diseases and microbes

Based on the assumption that functionally similar diseases usually associate or disassociate with similar microbes, and disease Gaussian Interaction Profile Kernel (GIPK) similarity (Van Laarhoven et al., 2011) is computed via experimentally validated MDA network. In particular, the GIPK similarity of two diseases di and dj is computed by Eq. (6):

where

and IP(di) denotes associations between disease di and each microbe, that is, the ith row of X. γd denotes the normalized kernel bandwidth with original bandwidth of 1, and nd denotes the number of diseases.

Similarly, we computed the GIPK similarity matrix MG of microbes.

2.2.3. Similarity integration for diseases and microbes

We may fail to compute the functional similarity for all diseases because not all diseases have related to genes. Thus, we combined disease GIPK similarity and functional similarity by Eq. (8):

Similarly, the integrated microbe similarity SM is computed.

2.2.4. Feature representation for microbe–disease associations

For each microbe–disease pair (mi, dj), feature vectors of mi and dj are obtained based on similarity matrices SD and SM, respectively. Particularly, the feature vector of di is denoted as the similarity between di and all diseases. The feature vector of mj is denoted as the similarity between mj and all microbes. Thus, one microbe–disease pair is depicted as an (nd+nm)-dimensional feature vector after concatenation operation, where nd and nm indicate the number of diseases and microbes, respectively. In summary, there are n(n = nd×nm) samples (microbe–disease pairs), and each sample xi can be represented using a d(d = nd+nm)-dimensional vector. For xi, its label yi = 1. If its corresponding microbe–disease pair is associated, otherwise yi = 0. Consequently, an MDA matrix X with n samples is represented by Eq. (9):

2.2.5. Feature extraction based on sparse autoencoder

The obtained features for microbe–disease pairs are highly dimensional and severely affect the classification accuracy of models. Deep learning demonstrates stronger feature learning ability than traditional dimensional reduction approaches. Thus, we designed a sparse autoencoder to reduce the feature dimensionality of each sample.

Sparse autoencoder (Andrew, 2011) is an unsupervised neural network model. It minimizes the reconstruction error and enforces sparsity constraints on all hidden nodes to obtain a more robust and meaningful representation of features and further improves the prediction performance of classification models (Makhzani and Frey, 2013). First, a high-dimensional feature vector for the microbe–disease pair is fed to an encoder by Eq. (10):

where X represents the input n samples with d-dimensional vector, H denotes the low-dimensional features after encoding, W, b, and f(·) represent the weight, bias, and encoding function of the encoder, respectively.

Next, a decoder restores the low-dimensional representation H to the same appearance as the input feature representation by Eq. (11):

where W′, b′, and g(·) represent the weight, bias, and decoding function of the decoder, respectively, and denotes the learned feature representation.

To minimize the reconstruction error, we build a cost function by Eq. (12):

where λ and β denote the sparsity regularization parameter and the coefficients for L2 regularization, respectively.

The first term MSE is mean square error. The term is used to measure the discrepancy between the input features X and the reconstructed features on training data by Eq. (13):

The second term Ωsparsity is the Kullback–Leibler divergence. The term is used to control sparsity based on the sparsity proportion ρ by Eq. (14):

where sl and denote the number of neurons in the lth hidden layer and the average activity of the tth neuron, respectively, denotes the relative entropy between Bernoulli random variables with mean ρ and mean . is computed by Eq. (15):

The third term is L2 regularization term Ωweights. The term is used to control the weights and avoid overfitting by Eq. (16):

where nl, sl, and denote the number of layers, the number of units in the lth layer, and the weight, respectively.

2.2.6. MDA classification based on LightGBM

Each microbe–disease pair is represented as a low-dimensional vector after dimensional reduction based on a sparse autoencoder. LightGBM (Ke et al., 2017) is an optimized version of Gradient Boosting Decision Tree (GBDT) (Ye et al., 2009). It obtains better performance in the area of bioinformatics. Next, the constructed low-dimensional vector is used as the input of LightGBM (Ke et al., 2017), to classify each microbe–disease pair. For an MDA dataset , LightGBM intends to learn an approximation to a certain function f(x) by minimizing the expectation of the loss function L(y, f(x)) by Eq. (17):

LightGBM integrates T decision trees to approximate the final model . The decision trees with J leaf nodes are expressed as wq(x), where wq(x) denotes the weights of all samples on leaf nodes and q(x) denotes the decision rules. Hence, The loss function of LightGBM is defined by Eq. (18):

The constant term in model (18) is removed for simplicity, and model (18) is transformed as Eq. (19):

where gi and hi denote the first-order and second-order derivatives of the loss function, respectively.

For a sample set, Ij related to leaf j, model (19) could be transformed as follows:

Given a tree structure q(x), the optimal leaf weight of each leaf node and the maximum value of a scoring function Γk that evaluate the quality of q(x) are defined by Eqs. (21) and (22):

Consequently, the objective function is represented as Eq. (23):

where IL and IR denote the example sets on the left and right sides, respectively.

3. Results

3.1. Experimental settings and evaluation metrics

Similar to RNMFMDA provided by Peng et al. (2020), the experiments were performed under three 5-fold cross validations (CVs) 20 times. For an MDA matrix Xn, the three CVs were as follows:

  • five-fold CV 1 (CV1): CV on diseases, i.e., in each round, 80% of nd diseases in X was taken as training set and the remaining 20% was test set.

  • five-fold CV 2 (CV2): CV on microbes, i.e., in each round, 80% of nm microbes in X was taken as training set and the remaining 20% was test set.

  • five-fold Cv 3 (CV3): CV on microbe–disease pairs, i.e., in each round, 80% of entries (microbe–disease pairs) in X were used as training set and the remaining 20% was test set.

In the sparse autoencoder, the neural network comprised an encoder and a decoder. The network structure was trained in Keras based on the TensorFlow backend. The structure comprised one input layer, three hidden layers, and an output layer. The number of each layer was 331, 256, 128, 96, and 64, respectively. The layers in the encoder and decoder were symmetric around the bottleneck. Tanh and ReLU were used as the activation functions in the output layer and the other layers, respectively. The optimization method used the Adam algorithm (Kingma and Ba, 2014). The batch size was set to 32 because a smaller batch size can make the model converge faster. The parameters λ, β, and ρ were set to 0.1, 0.0005, and 0.05, respectively. The final encoding size of the autoencoder is set to 64, that is, the features of MDAs were reduced to 64 dimensions.

For LightGBM, the parameters “num_leaves,” “learning_rate,” and “max_depth” denote the number of leaves in a tree, the speed of iteration, and the maximum depth of the tree, respectively. They were set to 31, 0.1, and –1, respectively. “Feature_fraction” and “bagging_fraction” are two hyperparameters in the optimization process. The former denotes the fraction of features at each iteration and was set to 0.9. The latter denotes the fraction of data and applies to boost the training and reduce overfitting. It was set to 0.9. “min_data” denotes the minimum number of records in a leaf and is also used to reduce overfitting. The parameters in the other four comparison methods were set to the defaults in corresponding publications. One microbe–disease pair is taken as a positive MDA when its association probability is greater than 50%, otherwise, it is taken as a negative MDA.

Four evaluation metrics were used to measure the performance of MDA prediction methods: accuracy, Matthews correlation coefficient (MCC) (Chicco and Jurman, 2020), area under the ROC curve (AUC), and area under the Precision-Recall curve (AUPR). Higher values for the four evaluation metrics represent better performance.

3.2. Performance comparison of SAELGMDA with the other four methods

To evaluate the performance of SAELGMDA, we compared it with four state-of-the-art MDA identification algorithms (MNNMDA, GATMDA, NTSHMDA, and LRLSHMDA) under three CVs on the HMDAD and Disbiome datasets, that is, CV1, CV2, and CV3.

3.2.1. Performance comparison under CV1

Table 1 shows accuracies, MCCs, AUCs, and AUPRs of SAELGMDA and the other four methods under CV1. The best performance in each column is described in Tables 16. As shown in Table 1, SAELGMDA computed the best MCC, AUC, and AUPR on the HMDAD database and the best accuracy, MCC, AUC, and AUPR on the Disbiome database, significantly outperforming the other four MDA prediction methods under CV1. Although accuracy was slightly less than MNNMDA and GATMDA on HMDAD, the difference was very tiny. Moreover, SAELGMDA outperformed the other methods, especially AUC and AUPR on the whole. In addition, although SAELGMDA outperformed the other four methods, all methods computed lower MCC and AUPR under CV1, which may be caused by fewer diseases. Figure 2 shows the ROC and PR curves of the five methods on the two databases under CV1.

Table 1

DatabaseMethodAccuracyMCCAUCAUPR
HMDADSAELGMDA0.9497 ± 0.00220.1855±0.01160.8358±0.01090.2155±0.0075
MNNMDA0.9588±0.00090.1085 ± 0.01090.6907 ± 0.00400.1206 ± 0.0021
GATMDA0.9562 ± 0.00090.0421 ± 0.00180.5152 ± 0.00030.0816 ± 0.0014
NTSHMDA0.9138 ± 0.00060.0101 ± 0.00080.6423 ± 0.00850.0531 ± 0.0007
LRLSHMDA0.9421 ± 0.00070.1182 ± 0.00280.5343 ± 0.01090.0769 ± 0.0006
DisbiomeSAELGMDA0.9819±0.00000.3431±0.00590.9301±0.00020.3469±0.0037
MNNMDA0.9814 ± 0.00000.1521 ± 0.00080.6774 ± 0.00100.1207 ± 0.0004
GATMDA0.9807 ± 0.00000.0542 ± 0.00190.5214 ± 0.00050.2166 ± 0.0192
NTSHMDA0.9416 ± 0.00000.0204 ± 0.00000.5898 ± 0.00020.0235 ± 0.0000
LRLSHMDA0.9772 ± 0.00000.1469 ± 0.00040.7200 ± 0.00050.1109 ± 0.0002

The performance of five MDA identification methods under CV1.

Figure 2

3.2.2. Performance comparison under CV2

Table 2 demonstrates the prediction performance of SAELGMDA and the other four methods under CV2. The best performance in each column is described in boldface. As shown in Table 2, we observed that SAELGMDA computed the best accuracies, MCCs, and AUCs on the two databases under CV2. In particular, SAELGMDA obtained better MCC and AUPR on the HMDAD database than ones on the Disbiome database, which may be caused by different data structures. In addition, all five MDA prediction methods computed lower MCC and AUPR on the Disbiome database. Figure 3 shows the ROC and PR curves of the five methods under CV2.

Table 2

DatabaseMethodAccuracyMCCAUCAUPR
HMDADSAELGMDA0.986±0.00000.8017±0.00170.9838±0.00010.8706±0.0010
MNNMDA0.9654 ± 0.00000.344 ± 0.00340.896 ± 0.00160.7479 ± 0.0052
GATMDA0.9604 ± 0.00010.4775 ± 0.00650.7977 ± 0.00200.4677 ± 0.0096
NTSHMDA0.9642 ± 0.00000.4449 ± 0.00290.8614 ± 0.00070.3718 ± 0.0026
LRLSHMDA0.9642 ± 0.00000.4451 ± 0.00170.8596 ± 0.00090.4068 ± 0.0065
DisbiomeSAELGMDA0.9818±0.00000.3437±0.00400.9293±0.00030.3378 ± 0.0049
MNNMDA0.9817 ± 0.00000.1907 ± 0.00160.7744 ± 0.00150.4117±0.0023
GATMDA0.9763 ± 0.00000.0915 ± 0.00110.5761 ± 0.00090.1069 ± 0.0031
NTSHMDA0.9723 ± 0.00000.0951 ± 0.00020.7721 ± 0.00020.0767 ± 0.0000
LRLSHMDA0.9657 ± 0.00000.1135 ± 0.00020.7792 ± 0.00020.0905 ± 0.0001

The performance of five MDA identification methods under CV2.

Figure 3

3.2.3. Performance comparison under CV3

Table 3 shows the performance of SAELGMDA and the other four methods under CV3. The best performance in each column is described in boldface under CV3. The results from Table 3 suggest that SAELGMDA achieved the best accuracies, MCCs, and AUCs, significantly outperforming the other four MDA prediction methods under CV3. Moreover, the performance of all five methods under CV3 outperforms the ones under CV1 and CV2, demonstrating that more samples help improve the classification performance. Figure 4 shows the ROC and PR curves of the five methods under CV3.

Table 3

DatabaseMethodAccuracyMCCAUCAUPR
HMDADSAELGMDA0.9859±0.00000.7978±0.00100.9857±0.00000.8705±0.0008
MNNMDA0.9653 ± 0.00000.3401 ± 0.00550.9511 ± 0.00020.6465 ± 0.0023
GATMDA0.8935 ± 0.00040.3427 ± 0.00200.8638 ± 0.00070.3230 ± 0.0060
NTSHMDA0.9613 ± 0.00000.1783 ± 0.03380.8874 ± 0.00030.3568 ± 0.0026
LRLSHMDA0.9453 ± 0.00000.0568 ± 0.00110.7997 ± 0.00020.1158 ± 0.0002
DisbiomeSAELGMDA0.9826±0.00000.3376±0.00040.9358±0.00000.3604 ± 0.0004
MNNMDA0.9815 ± 0.00000.1523 ± 0.00120.9355 ± 0.00000.4175±0.0002
GATMDA0.8461 ± 0.00040.2032 ± 0.00020.8332 ± 0.00010.201 ± 0.0004
NTSHMDA0.9807 ± 0.00000.0207 ± 0.00020.8146 ± 0.00000.0766 ± 0.0000
LRLSHMDA0.9781 ± 0.00000.0744 ± 0.00020.7365 ± 0.00000.0625 ± 0.0000

The performance of five MDA identification methods under CV3.

Figure 4

3.2.4. Performance comparison of LightGBM and two classification models

To measure the MDA classification performance of LightGBM, we compared it with two classical boosting algorithms, XGBoost and NGBoost. Extreme Gradient Boosting (XGBoost) is an ensemble learning method based on a gradient boost tree and can accurately cope with multicollinearity impact and complicated non-linearity interactions (Chen and Guestrin, 2016; Zhu and Zhu, 2019). Natural Gradient Boosting (NGBoost) uses natural gradients instead of regular gradients to implement flexible probabilistic forecast (Duan et al., 2020). Tables 46 show the accuracy, MCC, AUC, and AUPR of LightGBM, NGBoost, and XGBoost on the Disbiome and HMDAD datasets under three cross validations. The results from Tables 46 indicate that LightGBM obtained better performance on the majority of conditions and can be used to improve MDA classification ability.

Table 4

DatabaseMethodAccuracyMCCAUCAUPR
HMDADLightGBM0.9497 ± 0.00220.1855±0.01160.8358 ± 0.01090.2155±0.0075
NGBoost0.9526±0.00160.1728 ± 0.01070.8301 ± 0.00970.1988 ± 0.0056
XGBoost0.946 ± 0.00180.1832 ± 0.00920.8385±0.00510.1843 ± 0.0050
DisbiomeLightGBM0.9819 ± 0.00000.3431 ± 0.00590.9301±0.00020.3469 ± 0.0037
NGBoost0.9826±0.00000.3631±0.00320.9284 ± 0.00020.3598±0.0027
XGBoost0.9775 ± 0.00000.2706 ± 0.00340.905 ± 0.00030.2494 ± 0.002

The performance of three classification models under CV1.

Table 5

DatabaseMethodAccuracyMCCAUCAUPR
HMDADLightGBM0.986±0.00000.8017±0.00170.9838±0.00010.8706±0.0010
NGBoost0.9854 ± 0.00460.794 ± 0.05110.9808 ± 0.01020.8615 ± 0.0447
XGBoost0.9846 ± 0.00000.7814 ± 0.00270.9803 ± 0.00010.8434 ± 0.0021
DisbiomeLightGBM0.9818±0.00000.3437±0.00400.9293±0.00030.3378 ± 0.0049
NGBoost0.9817 ± 0.00340.3382 ± 0.07560.9284 ± 0.01640.3597±0.0920
XGBoost0.9771 ± 0.00540.2671 ± 0.06190.904 ± 0.01860.2502 ± 0.0640

The performance of three classification models under CV2.

Table 6

DatabaseMethodAccuracyMCCAUCAUPR
HMDADLightGBM0.9859±0.00000.7978±0.00100.9857±0.00000.8705±0.0008
NGBoost0.9854 ± 0.00000.7905 ± 0.00130.9821 ± 0.00000.8625 ± 0.0013
XGBoost0.9838 ± 0.00000.7679 ± 0.00110.9804 ± 0.00000.835 ± 0.0010
DisbiomeSAELGMDA0.9826±0.00000.3376 ± 0.00040.9358±0.00000.3604 ± 0.0004
LightGBM0.9826±0.00000.3396±0.00030.9336 ± 0.00000.3764±0.0002
XGBoost0.9805 ± 0.00000.2375 ± 0.00390.9129 ± 0.00000.2594 ± 0.0002

The performance of three classification models under CV3.

3.2.5. Computational time analysis

We compared the computational time of SAELGMDA with the other four MDA prediction models, MNNMDA, GATMDA, NTSHMDA, and LRLSHMDA. The experiments were run on a machine with an AMD EPYC 7302 CPU, a GeForce RTX 2080 Ti, and 256GB RAM on Ubuntu 20.04.4 LTS operating system. Figure 5 shows computational time (m) of the five MDA prediction models on five-fold cross validation for one time on two MDA datasets. As shown in Figure 5, SAELGMDA is the most rapid method on the HMDAD dataset and the slowest one on the Disbiome dataset. SAELGMDA need only to spend 10.57 min, although it run slowly on the Disbiome database. In summary, SAELGMDA need not too much time on the two MDA datasets.

Figure 5

3.3. Case study

In this section, we predicted potential MDAs on the two MDA databases. In addition, multiple evidence suggests that colorectal cancer, inflammatory bowel diseases, and lung cancer have dense linkages with microbes (Guarner and Malagelada, 2003; Müller and Macpherson, 2006; Zhang et al., 2015; Mármol et al., 2017; Chicco and Jurman, 2020). In this section, we aim to find possible microbes for the three diseases using the proposed SAELGMDA method. For the three diseases, microbes that are known to associate with them were removed. Next, we computed the association scores between them and all microbes. Third, the computed scores were sorted in descending order. Finally, the top 20 microbes with the highest association scores with them were listed and confirmed by the existing publications.

3.3.1. Finding new MDAs based on known MDAs

We further predicted new MDAs based on known MDAs using SAELGMDA. The predicted top 50 MDAs are shown in Figure 6. In Figure 6, sky blue solid lines and red dotted lines represent known and unknown MDAs obtained from SAELGMDA, respectively. Deep sky blue round rectangles represent microbes and green diamonds denote diseases.

Figure 6

On the HMDAD database, all predicted top 50 MDAs have been known to be associated with the database. SAELGMDA predicted that Actinobacteria and liver cirrhosis have the highest association probability with the ranking of 130 among all 11,388 microbe–disease pairs. Actinobacteria have been reported to associate with liver disease (Bull-Otterson et al., 2013). The expansion of Proteobacteria and Actinobacteria has a pathogenic effect on alcoholic liver disease (Bull-Otterson et al., 2013).

In the Disbiome database, SAELGMDA predicted that Veillonella may associate with autism with a ranking of three among all 229,336 microbe–disease pairs. Zhang et al. (2018) has reported that the abundance of Veillonella was severely decreased in stools of children suffering from autism spectrum disorder. The decreasing of its abundance has been also found in subjects involved in autism (Strati et al., 2017). Furthermore, the decreased Veillonella may affect the fermentation of lactate in the autism children (Gronow et al., 2010).

3.3.2. Colorectal cancer-related microbe identification

Colorectal cancer is the third most frequent cause of cancer mortality worldwide, severely threatening global life and health (Biller and Schrag, 2021; Saeed et al., 2021; Wong et al., 2023). There are more than 1.85 million colorectal cancer cases and 850,000 colorectal cancer-related deaths each year. In total, 20% of patients with colorectal cancer have metastasis cancer among new colorectal cancer diagnoses. It has been reported that ~70%–75% of patients survive more than 1 year, 30%–35% more than 3 years, and fewer than 20% more than 5 years among patients diagnosed with metastatic colorectal cancer. Although colonoscopy has been widely applied to the screen, its effect on colorectal cancer remains unclear (Bretthauer et al., 2022). Table 7 shows the top 20 microbes associated with colorectal cancer on the HMDAD database.

Table 7

RankMicrobeEvidence
1Fusobacterium nucleatumConfirmed by HMDAD
2FirmicutesConfirmed by HMDAD
3ProteobacteriaPMID: 24 603 888, 27 194 068, 32 298 987
4PrevotellaConfirmed by HMDAD
5BacteroidetesConfirmed by HMDAD
6ClostridiaConfirmed by HMDAD
7FusobacteriumConfirmed by HMDAD
8BacteroidesConfirmed by HMDAD
9PseudomonasPMID: 33 998 814, 25699023, 25 217 106
10HaemophilusPMID: 31 358 825, 26 549 775
11ActinobacteriaPMID: 35 899 111, 35 049 922
12AcinetobacterPMID: 32 738 757, 32 595 614
13CorynebacteriumPMID: 313 873, 646 934
14LactobacillusPMID: 36 162 222, 22 830 611, 35 808 840
15StreptococcusPMID: 9 771 449, 21 960 713, 21 247 505, 18 990 738, 16 845 563
16Clostridium difficilePMID: 26 691 472, 28 060 753, 21 152 135, 1 626 323
17Faecalibacterium prausnitziiPMID: 26 595 550, 35 625 865, 32 675 782
18Clostridium coccoidesUnconfirmed
19LachnospiraceaePMID: 28 988 196, 36 893 736
20Helicobacter pyloriPMID: 22 294 430, 16 579 836, 18 506 454, 31 393 968

The top 20 microbes related to colorectal cancer inferred by SAELGMDA on the HMDAD database.

For colorectal cancer, as shown in Table 7, 19 microbes have been confirmed to have associations with colorectal cancer by the existing literature on the top 20 inferred microbes on the HMDAD database. For example, pseudomonas is distinctly less abundant in cancer tissues than normal tissues and has been increasingly taken as an emerging clinic-related opportunistic pathogen (Decker and Palmore, 2014; Gao et al., 2015). Haemophilus parainfluenzae demonstrates higher representation in colorectal cancer subjects but is scarcely investigated in control subjects (Kasai et al., 2016). Research in 219 patients with colorectal cancer has suggested that clostridium difficile has a dense relationship with colorectal cancer (Yeom et al., 2010). Helicobacter pylori infection has been reported to be a potential risk increase factor of left-sided colorectal cancer (Zhang et al., 2012).

Moreover, we inferred that Clostridium coccoides has a possible association with colorectal cancer. Clostridium coccoides is taken as one of the most prevalent groups of bacteria in human intestines. They constitute ~60% of mucin-adhered microbiota and comprise different species with high oxygen-sensitive anaerobes (such as Clostridium, Coprococcus, Eubacterium, and Ruminococcus). They contribute to the prevention of colonization of vancomycin-resistant Enterococcus in an antibiotic-treated mouse model (Grenda et al., 2022). The association between Clostridium coccoides and colorectal cancer needs further validation.

3.3.3. Inflammatory bowel disease-related microbe identification

Inflammatory bowel disease is one of the idiopathic inflammatory bowel disorders that severely affect the gastrointestinal tract. It has become a global, chronic, and life-threatening disease over the last few decades. Mak et al. (2020) predicted that patients with inflammatory bowel disease may be an exponential increase worldwide. It typically includes Crohn's disease and ulcerative colitis. It manifests progressive and unpredictable features and is partially caused by bacteria that activate patient's immune system to protect against foreign substances (Lomax et al., 2006; Kaplan and Windsor, 2021). It has a close relationship with microbes. Identification of associated microbes for the disease helps us better equip to stem its global rise in future. Table 8 lists the top 20 microbes associated with the disease on the HMDAD database.

Table 8

RankMicrobeEvidence
1BacteroidetesPMID: 12 906 096, 27 999 802, 21 575 910
2ProteobacteriaConfirmed by HMDAD
3FirmicutesPMID: 19 235 886
4LachnospiraceaeConfirmed by HMDAD
5HaemophilusPMID: 33 666 710, 30 685 379
6ActinobacteriaConfirmed by HMDAD
7PrevotellaPMID: 28 542 929, 26 468 751
8Clostridium coccoidesPMID: 27 687 331, 16 432 374
9BifidobacteriumPMID: 34 337 079, 25 793 197, 24 478 468, 25 391 346
10LactobacillusPMID: 29 854 599, 32 509 162, 15 664 933
11Staphylococcus aureusPMID: 31 698 044
12FusobacteriumPMID: 27 139 617, 33 996 366, 25 576 662
13ClostridiaPMID: 22 508 484, 28 506 071
14Clostridium difficilePMID: 22 508 484, 28 506 071
15Helicobacter pyloriPMID: 24 914 359, 19 760 778
16StreptococcusPMID: 30 392 911, 23 679 203, 28 618 865, 16 868 828
17Bacteroides vulgatusPMID: 12 906 096, 12 162 408
18BacteroidesPMID: 12 906 096, 12 162 408
19OxalobacteraceaePMID: 29228248
20SphingomonadaceaeUnconfirmed

The top 20 microbes related to inflammatory bowel disease inferred by SAELGMDA on the HMDAD database.

As shown in Table 8, 19 microbes have been validated to link to inflammatory bowel disorders by existing literature on the predicted top 20 microbes associated with it on the HMDAD database. Researchers reported that Firmicutes were less represented in patients suffered from inflammatory bowel disease than healthy subjects (Sokol et al., 2009). Streptococcus and Haemophilus were highly represented in patients with inflammatory bowel disease (Heidarian et al., 2019). Prevotella was reduced in pediatric Crohn's disease (Lewis et al., 2015). Clostridium coccoides was less abundant in patients with active inflammatory bowel disease than ones in remission (Prosberg et al., 2016).

In addition, we predicted that Sphingomonadaceae dense links to inflammatory bowel disease. Sphingomonadaceae family has high abundance in marine waters, freshwater, and even drinking water. They can degrade lignin-derived compounds and refractory organic matter that comprise monocyclic and polycyclic aromatic hydrocarbons (Shen S. et al., 2022). Sphingomonadaceae are significantly accommodated to bile salts through metabolic pathways (de Vries et al., 2019). In addition, Sphingomonadaceae has a high linkage with triclosan degradation in nitrification and denitrification systems (Dai et al., 2022). Microbial communities were adapted to Bisphenol A through the selection of Sphingomonadaceae populations including Sphingobium, Novosphingobium, and Sphingopyxis. The selected Sphingomonadaceae for Bisphenol A demonstrated higher Bisphenol A metabolic activity (Oh and Choi, 2019). The association between Sphingomonadaceae and inflammatory bowel disease needs further validation.

3.3.4. Lung cancer-related microbe identification

Lung cancer is one of the leading causes of cancer-related deaths worldwide. It accounts for ~18% of global cancer deaths (Sung et al., 2021). More than 350 patients died from lung cancer each day in the United States (Siegel et al., 2022). It has the highest incidence and mortality compared with other cancer types in China (Xia et al., 2022). We used the proposed SAELGMDA model to identify potential microbes for lung cancer. Table 9 lists the top 20 microbes associated with it on the Disbiome database. As shown in Table 9, all 20 top microbes have been confirmed to be associated with lung cancer by existing literatures or the Disbiome database. The results again validated the MDA prediction performance of SAELGMDA.

Table 9

RankMicrobeEvidence
1AcidovoraxConfirmed by Disbiome
2ParabacteroidesPMID: 30 693 820, 32 010 563, 33 302 682, 33 302 682, 32 329 229, 30 693 820
3DiaphorobacterConfirmed by Disbiome
4BifidobacteriumConfirmed by Disbiome
5RoseburiaPMID: 33 302 682, 32 227 387, 35 735 103
6BacteroidesPMID: 306 938 20, 36 498 063, 30 416 658,
7LactobacillusPMID: 26 125 762, 36 361 537, 36 638 662
8LeptotrichiaPMID: 34 432 217, 33 454 779
9PrevotellaConfirmed by Disbiome
10EnterococcusPMID: 33 302 682, 27 717 798, 31 065 547, 33 111 503
11StreptococcusConfirmed by Disbiome
12CorynebacteriumPMID: 350 388, 6 362 846, 6 998 933, 6 318 791
13PorphyromonasPMID: 33 279 803,32 615 270
14AlistipesPMID: 33 939 976, 34 793 492, 35 115 705
15HaemophilusPMID: 21 407 824, 21 407 824, 27 052 615, 21 098 042, 34 963 470
16KlebsiellaPMID: 32 099 416, 24 706 703
17DialisterPMID: 30 416 658, 29 023 689, 34 063 829, 31 595 156
18RuminococcusPMID: 32 227 387, 33 302 682, 36 737 654, 33 603 241, 32 240 032
19PseudomonasPMID: 27 507 537, 25 801 231, 30 101 407
20EscherichiaPMID: 18 496 688, 10.1158/1538-7445.AM2023-5185

The top 20 microbes associated with lung cancer identified by SAELGMDA on the Disbiome database.

3.4. Discussion and conclusion

Systematic identification of associations between microbes and diseases significantly contributes to the understanding of the complex pathogenic mechanism of various diseases (Takahashi et al., 2018; Zhou et al., 2018; Yang et al., 2022). In particular, computational pathogenic microorganism discovery helps to capture potential biomarkers from candidate compounds for human complex diseases (Barrows et al., 2016; Zhu et al., 2021).

Here, we developed a computational method called SAELGMDA to improve MDA prediction. First, microbe similarity and disease similarity were computed via their function similarity and GIPK similarity. Second, one microbe–disease pair was represented as a feature vector based on microbe similarity matrix and disease similarity matrix. Third, the obtained high-dimensional features were mapped to a low-dimensional space based on a sparse autoencoder. Finally, unknown microbe–disease pairs were classified using LightGBM.

Our proposed SAELGMDA method was compared with MNNMDA, GATMDA, LRLSHMDA, and NTSHMDA. Experimental results under CV1, CV2, and CV3 show that SAELGMDA outperforms the above four methods. SAELGMDA obtains the superior MDA identification ability. To investigate the MDA classification performance of LightGBM, we further compared it with XGBoost and NGBoost. The results demonstrate that LightGBM obtained better accuracy. Case studies demonstrate that there are possible associations between Clostridium coccoides and colorectal cancer, between Sphingomonadaceae and inflammatory bowel disease, and between Veillonella and autism and needs further validation.

We used two MDA databases (Disbiome and HMDAD) to investigate the performance of our proposed SAELGMDA method. The HMDAD dataset is a small dataset and Disbiome is a larger dataset. Under CV1, the performance of SAELGMDA, GATMDA, and LRLSHMDA on the Disbiome dataset outperforms the ones on the HMDAD dataset, demonstrating more data contribute to the performance improvement for the three methods under CV1. Under CV2 and CV3, all five methods computed higher accuracy and AUC on the two datasets. However, MCC and AUPR computed by these five methods significantly decreased the Disbiome dataset compared with the HMDAD dataset. It may be caused by data imbalance; that is, the generalization ability of SAELGMDA is good when identifying potential associated microbes for a query disease. However, its generalization ability needs further improvement under CV2 and CV3.

Although SAELGMDA outperformed the other four methods under the majority of condition on the HMDAD and Disbiome databases, the performance of all five MDA prediction methods, especially MCC and AUPR, remains an improvement. In future, we will integrate more biological data, such as microbe–drug associations and disease–gene associations, to extract effective features for microbe–disease pairs. Furthermore, we will explore new dimensional reduction algorithms and classification models to improve MDA prediction by combining deep learning.

Statements

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding authors.

Author contributions

FW and HY: conceptualization and validation. LP: funding acquisition. YW, LP, and XL: project administration. FW: writing—original draft and software. HY, LP, and XL: writing—reviewing and editing and investigation. FW and LP: methodology. All authors contributed to the article and approved the submitted version.

Funding

LP was supported by the National Natural Science Foundation of China under Grant No. 61803151.

Acknowledgments

We would like to thank all authors of the cited references.

Conflict of interest

YW was employed by Geneis (Beijing) Co., Ltd., Beijing, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    AndrewN. (2011). Sparse autoencoder. CS294A Lecture Notes72, 119.

  • 2

    BarrowsN. J.CamposR. K.PowellS. T.PrasanthK. R.Schott-LernerG.Soto-AcostaR.et al. (2016). A screen of FDA-approved drugs for inhibitors of Zika virus infection. Cell Host Microbe20, 259270. 10.1016/j.chom.2016.07.004

  • 3

    BillerL. H.SchragD. (2021). Diagnosis and treatment of metastatic colorectal cancer: a review. JAMA325, 669685. 10.1001/jama.2021.0106

  • 4

    BretthauerM.LøbergM.WieszczyP.KalagerM.EmilssonL.GarborgK.et al. (2022). Effect of colonoscopy screening on risks of colorectal cancer and related death. N. Engl. J. Med. 387, 15471556. 10.1056/NEJMoa2208375

  • 5

    Bull-OttersonL.FengW.KirpichI.WangY.QinX.LiuY.et al. (2013). Metagenomic analyses of alcohol induced pathogenic alterations in the intestinal microbiome and the effect of lactobacillus rhamnosus gg treatment. PLoS ONE8, e53028. 10.1371/journal.pone.0053028

  • 6

    ChenT.GuestrinC. (2016). “Xgboost: a scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY: Association for Computing Machinery), 785794. 10.1145/2939672.2939785

  • 7

    ChenX.LiT.-H.ZhaoY.WangC.-C.ZhuC.-C. (2021). Deep-belief network for predicting potential mirna-disease associations. Brief. Bioinformatics22, bbaa186. 10.1093/bib/bbaa186

  • 8

    ChenX.XieD.ZhaoQ.YouZ.-H. (2019). Micrornas and complex diseases: from experimental results to computational models. Brief. Bioinformatics20, 515539. 10.1093/bib/bbx130

  • 9

    ChenY.LeiX. (2022). Metapath aggregated graph neural network and tripartite heterogeneous networks for microbe-disease prediction. Front. Microbiol. 13, 919380. 10.3389/fmicb.2022.919380

  • 10

    ChengE.ZhaoJ.WangH.SongS.XiongS.SunY.et al. (2022). “Dual network contrastive learning for predicting microbe-disease associations,” in IEEE/ACM Transactions on Computational Biology and Bioinformatics (New Jersey, NJ: IEEE). 10.1109/TCBB.2022.3228617

  • 11

    ChengL.QiC.ZhuangH.FuT.ZhangX. (2020). gutmdisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Res. 48, D554D560. 10.1093/nar/gkz843

  • 12

    ChiccoD.JurmanG. (2020). The advantages of the Matthews correlation coefficient (MCC) over f1 score and accuracy in binary classification evaluation. BMC Genom. 21, 113. 10.1186/s12864-019-6413-7

  • 13

    DaiH.GaoJ.LiD.WangZ.CuiY.ZhaoY.et al. (2022). Family sphingomonadaceae as the key executor of triclosan degradation in both nitrification and denitrification systems. Chem. Eng. J. 442, 1362021. 10.1016/j.cej.2022.136202

  • 14

    de VriesH. J.BeyerF.JarzembowskaM.LipińskaJ.van den BrinkP.ZwijnenburgA.et al. (2019). Isolation and characterization of sphingomonadaceae from fouled membranes. NPJ Biofilms Microbiomes5, 6. 10.1038/s41522-018-0074-1

  • 15

    DeckerB. K.PalmoreT. N. (2014). Hospital water and opportunities for infection prevention. Curr. Infect. Dis. Rep. 16, 18. 10.1007/s11908-014-0432-y

  • 16

    DemirciM.TokmanH.UysalH.DemiryasS.KarakullukcuA.SaribasS.et al. (2019). Reduced Akkermansia muciniphila and Faecalibacterium prausnitzii levels in the gut microbiota of children with allergic asthma. Allergol. Immunopathol. 47, 365371. 10.1016/j.aller.2018.12.009

  • 17

    DuanT.AnandA.DingD. Y.ThaiK. K.BasuS.NgA.et al. (2020). “Ngboost: natural gradient boosting for probabilistic prediction,” in International Conference on Machine Learning (Vienna: The International Machine Learning Society), 26902700. PMLR.

  • 18

    El MouzanM. I.WinterH. S.AssiriA. A.KorolevK. S.Al SarkhyA. A.DowdS. E.et al. (2018). Microbiota profile in new-onset pediatric crohn's disease: data from a non-western population. Gut Pathog. 10, 110. 10.1186/s13099-018-0276-3

  • 19

    GaoZ.GuoB.GaoR.ZhuQ.QinH. (2015). Microbiota disbiosis is associated with colorectal cancer. Front. Microbiol. 6, 20. 10.3389/fmicb.2015.00020

  • 20

    GrendaT.GrendaA.DomaradzkiP.KrawczykP.KwiatekK. (2022). Probiotic potential of Clostridium spp.—advantages and doubts. Curr. Issues Mol. Biol. 44, 31183130. 10.3390/cimb44070215

  • 21

    GronowS.WelnitzS.LapidusA.NolanM.IvanovaN.Glavina Del RioT.et al. (2010). Complete genome sequence of Veillonella parvula type strain (te3t). Stand. Genomic Sci., 2, 5765. 10.4056/sigs.521107

  • 22

    GuarnerF.MalageladaJ.-R. (2003). Gut flora in health and disease. Lancet361, 512519. 10.1016/S0140-6736(03)12489-0

  • 23

    GuoS.-S.LiuJ.ZhouX.-G.ZhangG.-J. (2022). Deepumqa: ultrafast shape recognition-based protein model quality assessment using deep learning. Bioinformatics38, 18951903. 10.1093/bioinformatics/btac056

  • 24

    HeB.-S.PengL.-H.LiZ. (2018). Human microbe-disease association prediction with graph regularized non-negative matrix factorization. Front. Microbiol. 9, 2560. 10.3389/fmicb.2018.02560

  • 25

    HeidarianF.AlebouyehM.ShahrokhS.BalaiiH.ZaliM. R. (2019). Altered fecal bacterial composition correlates with disease activity in inflammatory bowel disease and the extent of il8 induction. Curr. Res. Transl. Med. 67, 4150. 10.1016/j.retram.2019.01.002

  • 26

    HuH.FengZ.LinH.ChengJ.LyuJ.ZhangY.et al. (2023). Gene function and cell surface protein association analysis based on single-cell multiomics data. Comput. Biol. Med., 157, 106733. 10.1016/j.compbiomed.2023.106733

  • 27

    HuaM.YuS.LiuT.YangX.WangH. (2022). MVGCNMDA: multi-view graph augmentation convolutional network for uncovering disease-related microbes. Interdiscip. Sci. Comput. Life Sci. 14, 669682. 10.1007/s12539-022-00514-2

  • 28

    HwangS.KimC. Y.YangS.KimE.HartT.MarcotteE. M.et al. (2019). Humannet v2: human gene networks for disease research. Nucleic Acids Res. 47, D573D580. 10.1093/nar/gky1126

  • 29

    JanssensY.NielandtJ.BronselaerA.DebunneN.VerbekeF.WynendaeleE.et al. (2018). Disbiome database: linking the microbiome to disease. BMC Microbiol. 18, 16. 10.1186/s12866-018-1197-5

  • 30

    JiangC.TangM.JinS.HuangW.LiuX. (2022). Kgnmda: a knowledge graph neural network method for predicting microbe-disease associations. IEEE/ACM Trans. Comput. Biol. Bioinform. 20, 11471155. 10.1109/TCBB.2022.3184362

  • 31

    KamnevaO. K. (2017). Genome composition and phylogeny of microbes predict their co-occurrence in the environment. PLoS Comput. Biol. 13, e1005366. 10.1371/journal.pcbi.1005366

  • 32

    KaplanG. G.WindsorJ. W. (2021). The four epidemiological stages in the global evolution of inflammatory bowel disease. Nat. Rev. Gastroenterol. Hepatol. 18, 5666. 10.1038/s41575-020-00360-x

  • 33

    KasaiC.SugimotoK.MoritaniI.TanakaJ.OyaY.InoueH.et al. (2016). Comparison of human gut microbiota in control subjects and patients with colorectal carcinoma in adenoma: terminal restriction fragment length polymorphism and next-generation sequencing analyses. Oncol. Rep. 35, 325333. 10.3892/or.2015.4398

  • 34

    KeG.MengQ.FinleyT.WangT.ChenW.MaW.et al. (2017). “LightGBM: a highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems, Vol. 30 (Long Beach, CA: MIT Press), 19.

  • 35

    KingmaD. P.BaJ. (2014). ADAM: a method for stochastic optimization. arXiv [preprint]. 10.48550/arXiv.1412.6980

  • 36

    LewisJ. D.ChenE. Z.BaldassanoR. N.OtleyA. R.GriffithsA. M.LeeD.et al. (2015). Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric crohn's disease. Cell Host Microbe18, 489500. 10.1016/j.chom.2015.09.008

  • 37

    LiS.XieM.LiuX. (2019). A novel approach based on bipartite network recommendation and katz model to predict potential micro-disease associations. Front. Genet. 10, 1147. 10.3389/fgene.2019.01147

  • 38

    LiT.-H.WangC.-C.ZhangL.ChenX. (2023). Snrmpacdc: computational model focused on siamese network and random matrix projection for anticancer synergistic drug combination prediction. Brief. Bioinformatics24, bbac503. 10.1093/bib/bbac503

  • 39

    LiangY.ZhangZ.-Q.LiuN.-N.WuY.-N.GuC.-L.WangY.-L.et al. (2022). Magcnse: predicting lncrna-disease associations using multi-view attention graph convolutional network and stacking ensemble model. BMC Bioinformatics23, 122. 10.1186/s12859-022-04715-w

  • 40

    LihongP.WangC.TianX.ZhouL.LiK. (2021). Finding lncRNA-protein interactions based on deep learning with dual-net neural architecture. IEEE/ACM Trans. Comput. Biol. Bioinform. 19, 34563468. 10.1109/TCBB.2021.3116232

  • 41

    LiuD.LiuJ.LuoY.HeQ.DengL. (2021). MGATMDA: predicting microbe-disease associations via multi-component graph attention network. IEEE/ACM Trans. Comput. Biol. Bioinformatics19, 35783585. 10.1109/TCBB.2021.3116318

  • 42

    LiuH.BingP.ZhangM.TianG.MaJ.LiH.et al. (2023). MNNMDA: predicting human microbe-disease association via a method to minimize matrix nuclear norm. Comput Struct. Biotechnol. J. 21, 14141423. 10.1016/j.csbj.2022.12.053

  • 43

    LiuJ.ZhaoK.ZhangG. (2023). Improved model quality assessment using sequence and structural information by enhanced deep neural networks. Brief. Bioinformatics24, bbac507. 10.1093/bib/bbac507

  • 44

    LiuY.WangS.-L.ZhangJ.-F.ZhangW.ZhouS.LiW.et al. (2020). DMFMDA: prediction of microbe-disease associations based on deep matrix factorization using bayesian personalized ranking. IEEE/ACM Trans. Comput. Biol. Bioinformatics18, 17631772. 10.1109/TCBB.2020.3018138

  • 45

    LomaxA. E.LindenD. R.MaweG. M.SharkeyK. A. (2006). Effects of gastrointestinal inflammation on enteroendocrine cells and enteric neural reflex circuits. Auton. Neurosci126, 250257. 10.1016/j.autneu.2006.02.015

  • 46

    LongY.LuoJ. (2019). Wmghmda: a novel weighted meta-graph-based model for predicting human microbe-disease association on heterogeneous information network. BMC Bioinformatics20, 118. 10.1186/s12859-019-3066-0

  • 47

    LongY.LuoJ.ZhangY.XiaY. (2021). Predicting human microbe-disease associations via graph attention networks with inductive matrix completion. Brief. Bioinformatics22, bbaa146. 10.1093/bib/bbaa146

  • 48

    LuoJ.LongY. (2018). NTSHMDA: prediction of human microbe-disease association based on random walk by integrating network topological similarity. IEEE/ACM Trans. Comput. Biol. Bioinformatics17, 13411351. 10.1109/TCBB.2018.2883041

  • 49

    LynchS. V.PedersenO. (2016). The human intestinal microbiome in health and disease. N. Engl. J. Med. 375, 23692379. 10.1056/NEJMra1600266

  • 50

    MaW.ZhangL.ZengP.HuangC.LiJ.GengB.et al. (2017). An analysis of human microbe-disease associations. Brief. Bioinformatics18, 8597. 10.1093/bib/bbw005

  • 51

    MakW. Y.ZhaoM.NgS. C.BurischJ. (2020). The epidemiology of inflammatory bowel disease: east meets west. J. Gastroenterol. Hepatol. 35, 380389. 10.1111/jgh.14872

  • 52

    MakhzaniA.FreyB. (2013). K-sparse autoencoders. arXiv. [preprint]. 10.48550/arXiv.1312.5663

  • 53

    MármolI.Sánchez-de DiegoC.Pradilla DiesteA.CerradaE.Rodriguez YoldiM. J. (2017). Colorectal carcinoma: a general overview and future perspectives in colorectal cancer. Int. J. Mol. Sci. 18, 197. 10.3390/ijms18010197

  • 54

    MüllerC.MacphersonA. (2006). Layers of mutualism with commensal bacteria protect us from intestinal inflammation. Gut55, 276284. 10.1136/gut.2004.054098

  • 55

    OhS.ChoiD. (2019). Microbial community enhances biodegradation of bisphenol a through selection of sphingomonadaceae. Microb. Ecol. 77, 631639. 10.1007/s00248-018-1263-4

  • 56

    PengL.ShenL.LiaoL.LiuG.ZhouL. (2020). RNMFMDA: a microbe-disease association identification method based on reliable negative sample selection and logistic matrix factorization with neighborhood regularization. Front. Microbiol. 11, 592430. 10.3389/fmicb.2020.592430

  • 57

    PengL.WangC.TianG.LiuG.LiG.LuY.et al. (2022a). Analysis of CT scan images for covid-19 pneumonia based on a deep ensemble framework with densenet, swin transformer, and regnet. Front. Microbiol. 13, 993523. 10.3389/fmicb.2022.995323

  • 58

    PengL.WangF.WangZ.TanJ.HuangL.TianX.et al. (2022b). Cell-cell communication inference and analysis in the tumour microenvironments from single-cell transcriptomics: data resources and computational strategies. Brief. Bioinformatics23, bbac234. 10.1093/bib/bbac234

  • 59

    ProsbergM.BendtsenF.VindI.PetersenA. M.GluudL. L. (2016). The association between the gut microbiota and the inflammatory bowel disease activity: a systematic review and meta-analysis. Scand. J. Gastroenterol. 51, 14071415. 10.1080/00365521.2016.1216587

  • 60

    SaeedM.ShoaibA.KandimallaR.JavedS.AlmatroudiA.GuptaR.et al. (2021). Microbe-based therapies for colorectal cancer: advantages and limitations. Semin. Cancer Biol. 86(Pt 3), 652665. 10.1016/j.semcancer.2021.05.018

  • 61

    ShenL.LiuF.HuangL.LiuG.ZhouL.PengL.et al. (2022). VDA-RWLRLS: an anti-sars-cov-2 drug prioritizing framework combining an unbalanced bi-random walk and laplacian regularized least squares. Comput. Biol. Med. 140, 105119. 10.1016/j.compbiomed.2021.105119

  • 62

    ShenS.AnazawaT.MatsudaT.ShimizuY. (2022). Draft genome sequences of Sphingomonadaceae strains isolated from a freshwater lake. Microbiol. Resour. Announc. 11, e00070-22. 10.1128/mra.00070-22

  • 63

    ShiJ.-Y.HuangH.ZhangY.-N.CaoJ.-B.YiuS.-M. (2018). Bmcmda: a novel model for predicting human microbe-disease associations via binary matrix completion. BMC Bioinformatics19, 8592. 10.1186/s12859-018-2274-3

  • 64

    SiegelR. L.MillerK. D.FuchsH. E.JemalA. (2022). Cancer statistics, 2022. CA Cancer J. Clin. 72, 733. 10.3322/caac.21708

  • 65

    SokolH.SeksikP.FuretJ.FirmesseO.Nion-LarmurierI.BeaugerieL.et al. (2009). Low counts of faecalibacterium prausnitzii in colitis microbiota. Inflamm. Bowel Dis. 15, 11831189. 10.1002/ibd.20903

  • 66

    StratiF.CavalieriD.AlbaneseD.De FeliceC.DonatiC.HayekJ.et al. (2017). New evidences on the altered gut microbiota in autism spectrum disorders. Microbiome5, 111. 10.1186/s40168-017-0242-1

  • 67

    SunF.SunJ.ZhaoQ. (2022). A deep learning method for predicting metabolite-disease associations via graph neural network. Brief. Bioinformatics23, bbac266. 10.1093/bib/bbac266

  • 68

    SungH.FerlayJ.SiegelR. L.LaversanneM.SoerjomataramI.JemalA.et al. (2021). Global cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209249. 10.3322/caac.21660

  • 69

    TakahashiM. K.TanX.DyA. J.BraffD.AkanaR. T.FurutaY.et al. (2018). A low-cost paper-based synthetic biology platform for analyzing gut microbiota and host biomarkers. Nat. Commun. 9, 3347. 10.1038/s41467-018-05864-4

  • 70

    TianG.WangZ.WangC.ChenJ.LiuG.XuH.et al. (2022). A deep ensemble learning-based automated detection of covid-19 using lung CT images and vision transformer and convnext. Front. Microbiol. 13, 1024104. 10.3389/fmicb.2022.1024104

  • 71

    Van LaarhovenT.NabuursS. B.MarchioriE. (2011). Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics27, 30363043. 10.1093/bioinformatics/btr500

  • 72

    WangF.HuangZ.-A.ChenX.ZhuZ.WenZ.ZhaoJ.et al. (2017). LRLSHMDA: Laplacian regularized least squares for human microbe-disease association prediction. Sci. Rep. 7, 7601. 10.1038/s41598-017-08127-2

  • 73

    WangT.SunJ.ZhaoQ. (2023). Investigating cardiotoxicity related with hERG channel blockers using molecular fingerprints and graph attention mechanism. Comput. Biol. Med. 153, 106464. 10.1016/j.compbiomed.2022.106464

  • 74

    WangW.ZhangL.SunJ.ZhaoQ.ShuaiJ. (2022). Predicting the potential human lncrna-mirna interactions based on graph convolution network with conditional random field. Brief. Bioinformatics23, bbac463. 10.1093/bib/bbac463

  • 75

    WangY.LeiX.PanY. (2023). Microbe-disease association prediction using rgcn through microbe-drug-disease network. IEEE/ACM Trans. Comput. Biol. Bioinformatics. 1, 110. 10.1109/TCBB.2023.3247035

  • 76

    WeiH.LiuB. (2020). ICIRCDA-MF: identification of circrna-disease associations based on matrix factorization. Brief. Bioinformatics21, 13561367. 10.1093/bib/bbz057

  • 77

    WongA. H.MaB.LuiR. N. (2023). New developments in targeted therapy for metastatic colorectal cancer. Ther. Adv. Med. Oncol. 15, 17588359221148540. 10.1177/17588359221148540

  • 78

    WuC.GaoR.ZhangD.HanS.ZhangY. (2018). Prwhmda: human microbe-disease association prediction by random walk on the heterogeneous network with pso. Int. J. Biol. Sci. 14, 849. 10.7150/ijbs.24539

  • 79

    XiaC.DongX.LiH.CaoM.SunD.HeS.et al. (2022). Cancer statistics in china and united states, 2022: profiles, trends, and determinants. Chin. Med. J. 135, 584590. 10.1097/CM9.0000000000002108

  • 80

    XuJ.LiY. (2006). Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics22, 28002805. 10.1093/bioinformatics/btl467

  • 81

    XuJ.XuJ.MengY.LuC.CaiL.ZengX.et al. (2023). Graph embedding and gaussian mixture variational autoencoder network for end-to-end analysis of single-cell RNA sequencing data. Cell Rep. Methods 3, 100382. 10.1016/j.crmeth.2022.100382

  • 82

    YanC.DuanG.WuF.-X.PanY.WangJ. (2019). BRWMDA: predicting microbe-disease associations based on similarities and bi-random walk on disease and microbe networks. IEEE/ACM Trans. Comput. Biol. Bioinformatics17, 15951604. 10.1109/TCBB.2019.2907626

  • 83

    YangM.YangH.JiL.HuX.TianG.WangB.et al. (2022). A multi-omics machine learning framework in predicting the survival of colorectal cancer patients. Comput. Biol. Med. 146, 105516. 10.1016/j.compbiomed.2022.105516

  • 84

    YeJ.ChowJ.-H.ChenJ.ZhengZ. (2009). “Stochastic gradient boosted distributed decision trees,” in Proceedings of the 18th ACM Conference on Information and Knowledge Management (Hong Kong), 20612064. 10.1145/1645953.1646301

  • 85

    YeomC. H.ChoM. M.BaekS. K.BaeO. S.et al. (2010). Risk factors for the development of Clostridium difficile associated colitis after colorectal cancer surgery. J. Korean Soc. Coloproctol. 26, 329333. 10.3393/jksc.2010.26.5.329

  • 86

    ZhangL.WangC.-C.ChenX. (2022). Predicting drug-target binding affinity through molecule representation block based on multi-head attention and skip connection. Brief. Bioinformatics23, bbac468. 10.1093/bib/bbac468

  • 87

    ZhangM.MaW.ZhangJ.HeY.WangJ. (2018). Analysis of gut microbiota profiles and microbe-disease associations in children with autism spectrum disorders in china. Sci. Rep. 8, 13981. 10.1038/s41598-018-32219-2

  • 88

    ZhangW.ChenY.LiuF.LuoF.TianG.LiX.et al. (2017). Predicting potential drug-drug interactions by integrating chemical, biological, phenotypic and network data. BMC Bioinformatics18, 112. 10.1186/s12859-016-1415-9

  • 89

    ZhangY.HoffmeisterM.WeckM. N.Chang-ClaudeJ.BrennerH. (2012). Helicobacter pylori infection and colorectal cancer risk: evidence from a large population-based case-control study in germany. Am. J. Epidemiol. 175, 441450. 10.1093/aje/kwr331

  • 90

    ZhangY.-J.LiS.GanR.-Y.ZhouT.XuD.-P.LiH.-B.et al. (2015). Impacts of gut bacteria on human health and diseases. Int. J. Mol. Sci. 16, 74937519. 10.3390/ijms16047493

  • 91

    ZhangZ.XuJ.WuY.LiuN.WangY.LiangY.et al. (2023). CAPSNET-LDA: predicting lncrna-disease associations using attention mechanism and capsule network based on multi-view data. Brief. Bioinformatics24, bbac531. 10.1093/bib/bbac531

  • 92

    ZhouY.XuZ. Z.HeY.YangY.LiuL.LinQ.et al. (2018). Gut microbiota offers universal biomarkers across ethnicity in inflammatory bowel disease diagnosis and infliximab response prediction. MSystems3, e0018817. 10.1128/mSystems.00188-17

  • 93

    ZhuS.ZhuF. (2019). Cycling comfort evaluation with instrumented probe bicycle. Transp. Res. Part A. Policy Pract. 129, 217231. 10.1016/j.tra.2019.08.009

  • 94

    ZhuT.DaiQ.HeP.-A. (2021). Identification of potential immune-related biomarkers in gastrointestinal cancers. Curr. Bioinform. 16, 12031213. 10.2174/1574893615666210106121335

  • 95

    ZouS.ZhangJ.ZhangZ. (2017). A novel approach for predicting microbe-disease associations by bi-random walk on the heterogeneous network. PLoS ONE12, e0184394. 10.1371/journal.pone.0184394

Summary

Keywords

microbe-disease association, feature representation, dimensional reduction, sparse autoencoder, LightGBM

Citation

Wang F, Yang H, Wu Y, Peng L and Li X (2023) SAELGMDA: Identifying human microbe–disease associations based on sparse autoencoder and LightGBM. Front. Microbiol. 14:1207209. doi: 10.3389/fmicb.2023.1207209

Received

17 April 2023

Accepted

18 May 2023

Published

21 June 2023

Volume

14 - 2023

Edited by

Qi Zhao, University of Science and Technology Liaoning, China

Reviewed by

Yulin Zhang, Shandong University of Science and Technology, China; Ju Xiang, Changsha Medical University, China

Updates

Copyright

*Correspondence: Lihong Peng Xiaoling Li

†These authors have contributed equally to this work and share first authorship

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics