- 1Computational Engineering Division, Engineering Directorate, Lawrence Livermore National Laboratory, Livermore, CA, United States
- 2Biosciences and Biotechnology Division, Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, United States
- 3Global Security Computing Applications Division, Computing Directorate, Lawrence Livermore National Laboratory, Livermore, CA, United States
- 4Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
Multiple studies have highlighted the interaction of the human microbiome with physiological systems such as the gut, immune, liver, and skin, via key axes. Advances in sequencing technologies and high-performance computing have enabled the analysis of large-scale metagenomic data, facilitating the use of machine learning to predict disease likelihood from microbiome profiles. However, challenges such as compositionality, high dimensionality, sparsity, and limited sample sizes have hindered the development of actionable models. One strategy to improve these models is by incorporating key metadata from both the human host and sample collection/processing protocols. This remains challenging due to sparsity and inconsistency in metadata annotation and availability. In this paper, we introduce a machine learning-based pipeline for predicting human disease states by integrating host and protocol metadata with microbiome abundance profiles from 68 different studies, processed through a consistent pipeline. Our findings indicate that metadata can enhance machine learning predictions, particularly at higher taxonomic ranks like Kingdom and Phylum, though this effect diminishes at lower ranks. Our study leverages a large collection of microbiome datasets comprising 11,208 samples, therefore enhancing the robustness and statistical confidence of our findings. This work is a critical step toward utilizing microbiome and metadata for predicting diseases such as gastrointestinal infections, diabetes, cancer, and neurological disorders.
1 Introduction
Imbalances in the human microbiome have been associated with conditions such as inflammatory bowel disease (Li et al., 2015), obesity (Liu et al., 2021), diabetes (Tilg and Moschen, 2014), cancer (Kandalai et al., 2023), and neurological disorders (Tiwari et al., 2023; Loh et al., 2024). Distinct microbial profiles can influence disease progression and treatment responses, underscoring the microbiome's potential as a prognostic indicator and therapeutic target. This potential paves the way for training machine learning (ML) models, given adequate training data, that use microbiome abundance information to predict the likelihood of a host developing specific diseases. Recent studies (Curry et al., 2021; Hernández Medina et al., 2022; Liao et al., 2023) have demonstrated the potential of ML techniques in this field. However, the development of actionable models can be hindered by challenges pertaining to microbiome data including (Papoutsoglou et al., 2023): (1) limited sample sizes in individual studies; (2) the additional complexity arising from the compositional nature of microbiome data; (3) the high dimensionality and sparsity of microbiome abundance profiles, particularly at lower taxonomic levels such as Genus and Species; (4) biases introduced by variations in sample collection, processing, and sequencing protocols across studies; and (5) the absence of standardized datasets and protocols for evaluating ML models. Previous studies have attempted to address these challenges through microbiome feature-specific analyses for biomarker identification (Oudah and Henschel, 2018; Oliver et al., 2023). An area that has received relatively less attention is the preparation and engineering of metadata features for inclusion in ML training datasets.
One strategy to improve machine learning models is by integrating key metadata from both host and sample collection/processing protocols. To date, however, there is a paucity of systematic assessments of the impact of such integration efforts on model performance. Host demographic factors, such as age and biological sex (O'Toole and Jeffery, 2015; Fransen et al., 2017), significantly influence microbiome composition, potentially altering disease susceptibility. Lifestyle and diet are crucial determinants of a healthy microbiome, while genetic predispositions can affect conditions like obesity, which, in turn, impact the gut microbiota (Santos-Marcos et al., 2023). Additionally, details about sample origin, collection, and sequencing methodologies can help reduce dataset biases. However, a major challenge lies in the lack of standardized host metadata variables across studies, as these are often conducted independently with distinct research questions in mind. To develop a machine learning model that effectively incorporates host and sample metadata, a consistent set of variables must be available across all studies. In cases of missing data, imputation techniques can be applied to fill these gaps.
More effective incorporation of metadata variables in future efforts will guard against artifactually inaccurate conclusions and limit the impact of systematic confounding factors; indeed, previous reviews have specifically stressed the challenge and criticality of more effective, integrated metadata processing methods (Kumar et al., 2024). Numerous important and laudable efforts are underway to improve metadata recording and curation, which are improving the accessibility of these datasets for model training (Gonzalez et al., 2018; Vangay et al., 2021; Mirzayi et al., 2021). These efforts will be further strengthened with testing and adoption of validated methods for metadata imputation, which will also facilitate the use and integration of past and current studies with incomplete or sparse metadata provision.
In this paper, we introduce a machine learning-based pipeline for predicting human disease states by integrating host and protocol metadata with microbiome abundance profiles from 68 different studies. Because not all studies collected the same set of metadata variables, we employ data imputation strategies to effectively utilize data from all sources. We quantitatively assess the impact of metadata on model classification accuracy and discuss factors that can both increase and decrease model performance across different disease groups when the input feature set is expanded with metadata. Additionally, we analyze which metadata variables provide the most informative insights. To our knowledge, this study leverages the largest meta-analysis of shotgun metagenomic microbiome datasets, classified using the complete NCBI nucleotide (nt) database and integrated with curated metadata features available in the literature, thereby enhancing the robustness and statistical confidence of our findings.
2 Methods and materials
2.1 Human disease prediction from host metadata and microbiome profile
Host disease prediction from microbiome profiles involves the application of machine learning models to classify individuals as either “diseased” or “control” based on their microbiome composition. These profiles are collected from specific body sites, depending on the interactions and mechanisms of interest between disease states and host microbiome sites. For example, skin or gut microbiome samples could be analyzed for dermatological conditions to either understand the direct effect of skin microbial communities on the disease or to study the bidirectional relationship of the gut-skin axis. The input to the machine learning models consists of normalized microbiome abundance profiles, which represent the relative abundances of microbial taxa in each individual. These profiles are preprocessed using normalization techniques to account for the compositional nature of microbiome data and mitigate the effects of varying sequencing depths. The centered log-ratio (CLR) transformation (Aitchison, 1982) is one of the most widely used methods and is applied in this study as well. The normalized profiles are then fed into the models, which are trained to predict whether a given microbiome profile is associated with a diseased or control individual, based on a specific disease under investigation. By identifying subtle microbial signatures, these models have the potential to provide valuable insights into underlying disease mechanisms. Figure 1 illustrates the pipeline employed in this study. Each step of the pipeline is explained in detail in the following sections.
For this analysis, we evaluated four well-established machine learning models: k-Nearest Neighbors (KNN), Logistic Classifier (LC), Linear Support Vector Classifier (LinearSVC), and Random Forest (RF). While other non-linear methods were considered, their performance was suboptimal compared to these selected models, so they were not included in the final evaluation. All models were implemented using the Scikit-learn library (Pedregosa et al., 2011).
2.2 Data
2.2.1 Data collection and processing
To facilitate microbiome machine learning, we leveraged high-performance computing (HPC) resources to perform metagenomic classification on publicly available metagenomic sequence data, in addition to robust metadata curation. This resulted in structurally consistent microbiome taxonomy feature count tables for 13,897 samples across 84 studies.
Methods of this pipeline were previously described (Martí et al., 2025; Kok et al., 2024). Briefly, literature studies were selected according to defined intake criteria and each metagenomic dataset assigned a label of “control” or “diseased”. Metadata were extracted and a controlled vocabulary applied for variables such as disease state, adapting and, where possible, adhering to conventions established by the Genomic Standards Consortium (Field et al., 2011). Raw sequence data (fastq files) were downloaded from the appropriate public repository, pre-processed via fastp (Chen et al., 2018) and defined threshold criteria. These metagenomic samples were then processed for taxonomic classification via Centrifuge (Kim et al., 2016) using a novel decontaminated index covering the entire tree of life (Martí et al., 2025) constructed from the NCBI BLAST Nucleotide (nt) database (NCBI Resource Coordinators, 2012). Finally, taxonomic results were post-processed using Recentrifuge (Mart́ı, 2019) to normalize classification scores and perform additional filtering.
From the library of 84 studies, 10 contained only samples with single-class labels “control” or “diseased”. Since our goal was to train classifiers on individual studies using supervised machine learning models, which require samples from both classes, these studies were excluded from our analysis. Additionally, some studies had very limited sample sizes or were extremely class-imbalanced. In both cases, it is challenging to accurately assess classification performance. Therefore, we also excluded studies with fewer than 20 samples or with a class imbalance greater than 98%. Ultimately, we retained a set of 68 microbiome studies, as shown in Figure 2, along with basic statistics for each study. The full list of referenced microbiome studies can be found in Supplementary material.
Figure 2. Summary statistics for the 68 microbiome studies used in this work. The bottom plot displays the number of samples in each study, while the top plot shows the proportion of control vs. diseased individuals in each study. Additional details and references for the microbiome studies can be found in Supplementary material.
2.2.2 Metadata variable selection
The union of metadata variables from the 68 microbiome studies comprised 171 variables. As expected, not all variables were collected in all studies, resulting in a large portion of the 171 variables being absent in many studies. Missing data significantly influences which subset of metadata is ultimately retained for use in machine learning models, as it can drastically affect the performance of the methods, a topic we will discuss in Section 2.2.3.
We employed a variable selection strategy that considers both the percentage of missing data for each variable in all studies and the variable's predictive power. There is a clear trade-off when retaining variables with missing data: while they may contain valuable subject information, they can also introduce noise if data imputation is performed.
From the original set of 171 variables, we selected a total of nine predictive variables, as listed in Table 1. The selection process is summarized in Figure 3 and described as follows. First, variables containing only missing values (NaNs) were removed, reducing the set to 131 variables. Next, we excluded identifier variables (e.g., sample or subject IDs), which are irrelevant for model prediction, as well as variables with constant values, which offer no predictive power. This step reduced the set to 109 variables. Finally, we removed all variables with more than 60% missing data across all studies, resulting in a final set of nine variables. The 60% missingness threshold was chosen as a trade-off between the number of retained variables and the overall fraction of missing data requiring imputation. A sensitivity analysis used to determine this threshold is shown in Supplementary Figure S1.
All selected categorical variables were retained without modification, except for the disease_category variable from the 2022 American Gut Project (AGP) study, in which diseases were grouped into a broader category labeled “GI_Neurological." In the AGP dataset, disease categories are self-reported, and multiple conditions can be assigned to the same individual. This broader grouping was chosen because the majority of reported conditions in the study are related to gastrointestinal (GI) or neurological disorders.
2.2.3 Missing metadata imputation
For host metadata variables to be used in machine learning models, missing values (NAs/NaNs) must be assigned a numerical representation. Various strategies have been developed in the field of data imputation. In this study, we employ and compare three different imputation techniques:
1. Imputation of all missing variables using the MICE algorithm (Raghunathan et al., 2001) (MICE_ALL);
2. Imputation of host_age using MICE with UNK (unknown) for missing data in other categorical variables (UNK_MICE_Host_Age);
3. Imputation of host_age using k-NN on the microbiome profiles with UNK for missing data in other categorical variables (UNK_KNN_Host_Age).
The retained metadata variables include a combination of subject-specific (e.g., age, sex) and protocol-specific (e.g., body location, sequencing method) variables. While subject-specific variables are most relevant to indicating and discriminating health states, variables describing the collection and analytical methods may be influential with respect to systematic distinctions in observed microbial features between studies. As such, they may be instructive in this and future studies and are included in the described assessment. Since host age is a key factor influencing microbiome composition (Seidel and Valenzano, 2018; Bosco and Noti, 2021; Brooks et al., 2023), it is expected to play a significant role in the machine learning model. Therefore, we place special emphasis on accurately imputing this variable. Figure 4 presents a heatmap showing the percentage of missing values for each selected metadata variable across all studies.
Figure 4. Percentage of missing metadata information for each variable across all studies. Darker color indicates higher levels of missing data.
In the first imputation method, we use MICE (Raghunathan et al., 2001; Azur et al., 2011). MICE stands for Multivariate Imputation by Chained Equations. It is a statistical method used to handle missing data in a dataset. The process uses multiple imputation techniques to fill in the missing data and then combines the results from multiple imputations to produce a final imputed dataset. We impute all variables using MICE and call this MICE_ALL. As a second method of imputation, we impute host_age using MICE and mark missing data in other categorical variables as UNK. We call this method UNK_MICE_Host_Age. For the third method, we impute host_age using k-NN on the microbiome profiles and label missing data in other categorical variables as UNK, a method we refer to as UNK_KNN_Host_Age. Here, a new sample is imputed by identifying the closest samples in the training set and averaging their values to fill in the missing data. In our experiments, we used k = 1, but we also evaluated k = 3 and k = 5 and observed no significant differences.
2.2.4 Feature encoding
Machine learning algorithms require that all predictive variables be represented numerically. This is called feature encoding. We employ different encoding strategies for different types of predictive variables. The selection depends on the characteristics of the variables under consideration. In what follows, we use the terms variable and feature interchangeably.
Numerical variables are already numbers and can be used directly as input features to machine learning models. Therefore, host_age does not need any additional encoding. Variables that are purely categorical (qualitative variables with no clear ordering or associated numerical values) are usually represented via one-hot encoding. In this encoding strategy, the variable is represented by a binary vector of the same size as the number of categories. The vector contains zeros everywhere except for the position corresponding to the given category, in which case a 1 appears. This is equivalent to introducing dummy variables for categorical variables in standard regression analysis. All categorical variables in Table 1 were encoded using this strategy.
Cases with missing categories in samp_collect_device, disease_category, host_sex, antibiotics, host_body_product and country are treated as an additional category for that feature, unless specified otherwise. We ignore all features that are not directly related to the host condition, such as IDs, study abstracts, paper references, etc.
2.2.5 Outcome definition
For each individual study, a binary variable indicating control (0) or diseased (1) serves as the outcome variable for the corresponding machine learning model. It is important to note that the definition of control is specific to each study. For instance, in a gastrointestinal (GI) study, a control label means the individual did not exhibit any GI disorders at the time of sample collection; however, other conditions like diabetes, cancer, or neurological disorders could theoretically be present but were not assessed. Similarly, a sample classified as diseased in a GI study may be considered a control in the context of a pulmonary disease study.
This suggests that disease state classification is better approached as a multi-label problem, since an individual may have multiple coexisting conditions. However, the existing collection of microbiome datasets does not support this approach, as the underlying studies assessed a single given disease condition. Furthermore, the variability and specificity of disease definitions across studies make it difficult to train a single, generalizable model to differentiate between control and diseased states using data from all studies combined, even though this would increase the training sample size. Therefore, developing multi-label models or broadly generalizable models is beyond the scope of this article.
2.3 Experimental setup
For each ML model in each microbiome study, we conducted 30 independent experiments using holdout cross-validation with varying train/validation/test splits. We allocated 70% of the data for training, 10% for validation, and 20% for testing. The hyperparameters used for each model are detailed in Supplementary material. The host disease prediction performance is evaluated using the widely recognized area under the Receiver Operating Characteristic curve (AUROC) metric (Hanley and McNeil, 1982). This metric assesses the trade-off between the True Positive Rate (Sensitivity) and the False Positive Rate (1 - Specificity) across various threshold settings, providing a comprehensive measure of the model's discriminatory ability.
3 Results and discussion
3.1 Host metadata only
We first investigated and compared the metadata imputation models presented in Section 2.2.3. For this, we trained binary classification models to distinguish between diseased and control patients using only metadata. The goal of this analysis was to assess the impact of each imputation method, as the microbiome profiles will be incorporated into the feature vectors in subsequent steps. Figure 5 presents the classification performance using host metadata as input. While the overall performance is modest, the models were able to extract some relevant discriminative information from the metadata (with Random Forest achieving over 62% accuracy), and the selected metadata variables alone are not necessarily expected to comprehensively indicate, on their own, disease state.
Figure 5. Human disease classification performance using only host metadata. Bars represent the mean AUROC across all studies, and error bars denote the standard deviation. Statistical comparisons across imputation strategies were performed using the Friedman test, followed by pairwise Wilcoxon signed-rank tests with FDR correction when significant (Sections 2.2.2, 2.2.3, and 2.2.4). While overall performance was modest, significant differences were observed for the LC and RF classifiers, with and UNK KNN Host Age being the best-performing imputation strategies for the top performing model, RF.
To compare the performance of different imputation methods across studies, we first aggregated performance metrics across 30 data splits per study–imputation–classifier combination using the median to reduce sensitivity to outliers. For each classifier, we applied the Friedman test to detect overall differences among imputation strategies. When the Friedman test indicated significant differences (p < 0.05), we performed pairwise Wilcoxon signed-rank tests between imputation methods and applied the Benjamini–Hochberg procedure to control the false discovery rate. Across classifiers, KNN (χ2 = 4.895, p = 0.086) and LinearSVC (χ2 = 4.358, p = 0.113) showed no significant differences among imputation strategies. In contrast, LC (χ2 = 6.488, p = 0.039) and RF (χ2 = 6.167, p = 0.046) showed significant differences. For the best performing model, RF, UNK_KNN_Host_Ageimputation method was the best median performer (0.612), significantly outperforming MICE_ALL (padj = 0.004) and UNK_MICE_Host_Age (padj = 0.011). Average ranks also favored UNK_KNN_Host_Age (1.77), followed by UNK_MICE_Host_Age (2.06) and MICE_ALL (2.17). Given its improved performance, we will only consider the UNK_KNN_Host_Ageimputation method for the remainder of this analysis.
3.2 Microbiome only
Figure 6 presents the average performance across the 30 runs for each study. The error bars represent the standard deviation in the ML model's performance at each taxonomic level across all studies. As can be seen, the classification performance improves noticeably as we move to lower taxonomic levels (from Kingdom to Species). However, working with profiles at these finer taxonomic levels demands significantly more computational resources due to the increased dimensionality of the data (e.g., microbiome abundance profiles have 11 dimensions at the Kingdom level but nearly 31,000 at the Species level, as these profiles are derived from shotgun metagenomic data across all domains). The high dimensionality not only increases the time required for data processing and normalization but also leads to substantially longer training times for machine learning models. Overall, the models demonstrate better performance when microbial abundance is analyzed at lower taxonomic levels. This is consistent with previous observations that classification performance generally improves with increasing taxonomic resolution (Armour et al., 2022).
Figure 6. Classification performance (AUROC) of the ML models across different taxonomic levels, based solely on microbiome profile information. Bars represent the mean AUROC across all studies, and error bars denote the standard deviation. Overall, the models demonstrate better performance when microbial abundance is analyzed at lower taxonomic levels.
In Figure 6, we also highlight the performance of the ML methods, showing that the LinearSVC and RF are the top performing models. RF slightly outperformed LinearSVC at higher taxonomic levels (Kingdom, Phylum, Order, and Class), while LinearSVC yielded more accurate predictions at lower taxonomic levels (Family, Genus, and Species). This could be a reflection of the relative flexibility of RF to operate across nonlinear decision boundaries, resulting from relationships between coarser taxonomic ranks. This is consistent with previous model comparisons for inflammatory bowel disease (Kubinski et al., 2022). It may be that, in this case, the LinearSVC more effectively processes the sparser, higher-dimensional species-level features.
3.3 Host metadata and microbiome jointly
Classification performance of ML models using both host metadata and microbiome profiles as input is presented in Figure 7. Notably, the Random Forest model demonstrates strong performance even at higher taxonomic levels, such as Kingdom and Phylum, highlighting the importance of incorporating host metadata. As can be seen in Figure 8, the improvement associated with metadata inclusion decreases as we move to lower taxonomic levels and becomes negligible, in the current analysis, at the Species level. A likely explanation for this diminishing improvement is that the influence of metadata features diminishes as the problem's dimensionality increases from dozens to tens of thousands. In other words, the limited metadata variables become “diluted" by the extensive microbiome metagenomic profile information.
Figure 7. AUROC performance of the ML models across various taxonomic levels, utilizing both host metadata and microbiome profiles. Bars represent the mean AUROC across all studies, and error bars denote the standard deviation. Random Forest model demonstrates strong performance even at higher taxonomic levels, such as Kingdom and Phylum, highlighting the importance of incorporating host metadata.
Figure 8. Comparative classification of ML models trained using microbiome only vs microbiome + host metadata. Left figure shows the AUROC for RF, and right figure presents the difference in performance for both RF and LinearSVC model. Bars represent the mean AUROC across all studies, and error bars denote the standard deviation. Significant improvement can be observed in particular for higher taxonomic levels. Improvement affiliated with metadata inclusion decreases as we move to lower taxonomic levels.
From a physiological perspective, this observation may also be due to defined demographic variables affiliating more directly with broad microbial taxonomic trends. This is consistent with previous observations, where shifts in Firmicutes/Bacteroidetes and other compositional ratios have been reported widely across metadata features such as age, gender, and antibiotic use (Odamaki et al., 2016; Zhernakova et al., 2016). There are further examples of compositional distinctions that are dependent on variable interaction, for instance, gender distinctions that are observed following stratification by BMI (Haro et al., 2016). As noted above, such interactions may be less pronounced, i.e., diluted, by higher dimensionality at the species level. However, this may differ by study and disease state, and future implementation of these methods may identify metadata impacts in other metagenomic contexts, not observed in the current study.
It is also important to consider the impact of the type of machine learning model used, as different models handle features in distinct ways. Some models, like k-NN, logistic regression, and LinearSVC, evaluate all features simultaneously, often by considering the distances between entire feature sets. In contrast, others, such as Random Forest, can focus on specific subsets of features, particularly in tree-based models with limited tree depth. This distinction can significantly influence model performance and interpretation when a limited set of metadata is incorporated into complex and high-dimensional microbiome profile data. In that sense, models with feature-level focus can more readily pick up the limited metadata features. Figure 8 shows a comparative performance plot for both Random Forest and LinearSVC, respectively. For the Random Forest model, the improvement of microbiome plus metadata over microbiome only is more noticeable than LinearSVC's for the majority of the taxonomy levels, in particular for lower levels such as Family and Genus.
3.3.1 Statistical significance and effect sizes across methods and taxonomic levels
To formally quantify the impact of including metadata on predictive performance, we performed paired Wilcoxon signed-rank tests comparing AUROC values of joint (microbiome+metadata) vs. microbiome-only models across 68 datasets. For each method and taxonomy level, we computed the median AUROC values from the 30 independent train/test splits per dataset to obtain a single representative value, and FDR correction (Benjamini-Hochberg) was applied to account for multiple comparisons. Effect sizes were calculated using Cohen's d for paired samples. Detailed statistical results, including mean AUROC, Cohen's d, and FDR-corrected p-values for all methods and taxonomic levels, are reported in Supplementary Table S2.
The results reveal that the magnitude and significance of performance improvements vary across both taxonomy levels and machine learning methods. At higher taxonomic levels (Kingdom, Phylum), most methods, including KNN, Random Forest, and LinearSVC, show moderate effect sizes (Cohen's d 0.3–0.47) and highly significant differences, indicating that metadata contributes substantially to model performance when microbiome signals are coarse. At intermediate levels (Order, Family, Genus), effect sizes are generally smaller but remain significant for several methods, with Random Forest and LC showing the most consistent gains. At the Species level, effect sizes are modest and, in some cases (LinearSVC and RF), not statistically significant. These results support the trends observed in Figures 7, 8, indicating that microbiome features exert a stronger influence at higher taxonomic resolutions.
3.4 Analysis at disease group level
Given the varying degrees of association between microbiome profiles and specific diseases, as well as the differing significance of host metadata across disease categories, it is reasonable to hypothesize that incorporating host metadata may significantly enhance classification performance for certain disease groups more than others. This analysis is essential to identify where host factors play a pivotal role alongside the microbiome in disease prediction, as some diseases may be more influenced by host characteristics such as age, sex, or antibiotic usage, for instance, distinctions in immune response mediated by gender (Fransen et al., 2017).
In Figure 9, we present the classification performance difference between the host metadata + microbiome and microbiome only RF models for each study for the Kingdom, Order, and Species taxonomic ranks. The plots depict the mean and 95% confidence interval of the performance improvements resulting from the addition of metadata variables to the microbiome metagenomic profiles, across independent runs. This detailed comparison allows us to assess the specific impact of host metadata on the predictive accuracy of the models across various disease groups, providing insights into the contexts where host information is most beneficial.
Figure 9. Performance differences in human disease classification between microbiome + host metadata and microbiome only across all studies, grouped by disease category. Positive values indicate improved performance with the inclusion of host metadata. The plot shows the mean and 95% confidence interval of the RF's AUROC difference. *The American Gut Project (2022_American) was a crowdsourced study with diverse participants having various health conditions, rather than focusing on one specific disease.
As previously observed in Figure 8, host metadata exerts a more pronounced effect at higher taxonomic ranks, such as Kingdom and Phylum, compared to lower ranks like Genus and Species. We hypothesize that this phenomenon is due to the “dilution” of metadata effects within the more complex, higher-dimensional, and potentially more physiologically representative microbiome abundance features at more specific taxonomic resolution.
At the Kingdom level, Figure 9 shows that host metadata significantly enhances classification performance for the Neuro (13 studies) and Cancer (6 studies) categories. In these groups, incorporating host metadata consistently improves or at least maintains performance across all studies, highlighting a strong and robust effect. This is likely attributable to specific metadata variables that are known to be associated with the epidemiology of these diseases. In particular, the variables host_age and antibiotics are the top two metadata variables identified from the models in feature importance analyses. Cancer-related studies included primarily assessments of colorectal cancer, with one study of pancreatic cancer. Our data-driven observations are consistent with the clinical demographic literature, as age (Siegel et al., 2023), gender (Tsokkou et al., 2025), and antibiotic usage (Simin et al., 2020) are all variables influencing occurrence and risk for colorectal cancer. The neurological studies employed here included evaluations of Alzheimer's disease (AD), Parkinson's disease (PD), neuropsychiatric disorders, myalgic encephalomyelitis, and multiple sclerosis. Similarly, patient age is associated with the neurological diseases employed in this study, particularly for AD (The Alzheimer's Association, 2024).
Other disease categories, such as Autoimmune, Cirrhosis, and Oral, also show noticeable performance gains, though the smaller number of studies in these categories may limit the generalizability of these observations. In contrast, larger disease groups like Cardiovascular (6 studies) and Pulmonary (8 studies) show virtually no performance change, with only one study in each group showing improvement. This suggests that additional or different types of host metadata might be needed for these disease groups, warranting further investigation into other host factors. It is also possible that the broad range of disease conditions categorized within the cardiovascular and pulmonary groups occludes the capacity of the current analysis to identify and employ consensus metadata variables with utility. For instance, studies specific to ischemic heart disease (Martin et al., 2024) would likely demonstrate clearer benefit.
For the largest group, Gastrointestinal (14 studies), the majority of studies exhibit improved performance with the inclusion of metadata. However, a few studies show no change, and some even display a decline in performance. It is important to note that this category encompasses a heterogeneous set of conditions, including diarrhea, gastrointestinal infections, irritable bowel syndrome, inflammatory bowel disease, and seasickness, which, as noted above, likely contribute to the variable impact of metadata inclusion.
3.5 Missing host metadata and disease prediction performance
Figure 9 prompts a critical question: what factors contribute to performance improvement in some studies while leading to a decline in others, and in which cases does data imputation measurably improve performance? Several factors could be at play, including the relevance of the specific set of host metadata variables available and the sample size. However, a key factor is the extent of missing host metadata, which was subsequently imputed using the methods described in Section 2.2.3.
Although imputing missing data enabled the use of host metadata in the ML model and generally improved performance, it also introduced noise into the final feature set. This is an expected result, and imputation can potentially introduce bias and obscure true biological signals (Karpievitch et al., 2012). Our imputation techniques leveraged machine learning models to minimize this noise and maximize the information derived from similar studies, consistent with observations that model-based methods reduce false positive results after imputation (Andrews and Hemberg, 2019), yet some residual noise remains inevitable.
To quantitatively evaluate the impact of imputed missing data, we calculated the percentage of missing values for each host metadata variable across all studies and then aggregated these percentages for two groups: studies where host metadata had a positive impact and those where it had a negative impact. These aggregated percentages are presented in Table 2. The last column shows the percentage of missing data when all variables are considered.
Table 2. Average percentage of metadata information imputed over all studies grouped by studies in which metadata information had a Positive and Negative impact on human disease diagnosis (AUROC).
A key observation from Table 2 is that, when considering all host metadata variables (“Overall”), studies with a positive impact had lower levels of missing data across all taxonomic ranks, with a reduction of up to 5% at the Order level. This suggests that the more complete the host metadata, the greater the improvement in disease prediction performance, and limitations in the introduction of additional noise.
At the level of individual host metadata variables, an opposite trend is observed for samp_collect_device, host_body_product, and country, where studies with a positive impact exhibit a higher level of missing or imputed data. This can be explained by the RF model feature importance shown in Figure 10, which indicates that these three variables have little or no predictive power for host disease state. As a result, the model places minimal weight on them, regardless of the percentage of missing data. This observation is consistent with expectations, as these methodological parameters would be consistent across healthy and disease state samples within a given study. While it is clear that the extraction method, library preparation, and sequencing platform can introduce biases in resultant microbiome profiles (Poulsen et al., 2022; Elie et al., 2023), such biases are expected to be uniform across the labels implemented for classification in the current study.
Figure 10. Random Forest model feature importance at Kingdom level. Host metadata variables are highlighted in red. Bars represent the mean and 95% confidence interval across all 68 studies. Sample processing variables (collection device, body product, country) show minimal predictive importance, as expected since these parameters remain consistent across healthy and disease samples within each study.
In contrast, variables such as host_age, host_sex, and antibiotics are highly relevant, as evidenced by their predictive value. Table 2 reveals that for these key variables, studies with a positive impact had significantly less missing data, up to 20% less compared to those with a negative impact. This strongly suggests that reduced performance in some studies can be attributed to the increased amount of missing data for these critical variables. It also underscores the importance of acquiring a complete set of metadata features, particularly those with high predictive value for the ML model.
Variables affiliated with sample processing, including sequencing method, body location, and collection device, demonstrated minimal feature importance values. Again, this observation is anticipated as, for a given study, these variables should be consistent across control and disease specimens. However, the described framework may be useful in future assessments evaluating distinct methods in pooled specimen sets, where it may be desirable to integrate datasets from sample cohorts that underwent distinct processing workflows. Nevertheless, capturing comprehensive metadata is crucial, not only for data interpretation but also for identifying study-specific artifacts and correcting potential biases. Study-specific factors such as sample collection devices, extraction protocols, library preparation methods, and sequencing platforms can act as confounders that obscure true disease-related effects. To mitigate these influences, it is important to include appropriate within-study controls, as done in this study, and to systematically evaluate correlations between metadata variables and disease status. Potential confounders can then be incorporated as covariates in statistical models, and batch correction methods such as percentile-normalization (Gibbons et al., 2018) and PLSDA-batch (Wang and Lê Cao, 2023) can be applied to reduce study-specific biases and to improve the detection of biological signal.
A potential future application of an integrated metadata–microbiome model is the identification of microbial features that may serve as disease biomarkers. To illustrate this, we examined the neurological datasets included in this study to identify features most influential in driving model predictions (Figure 11). The neurological cohort was selected as it contains the highest number of studies, providing a clearer signal in the data. Consistent with feature importance assessed across all disease categories (Figure 10), host-associated variables such as antibiotic use and age exhibited the strongest contributions, followed by microbial taxa including Enterocloster asparagiformis, Faecalibacterium duncaniae, and Akkermansia muciniphila. These findings suggest that host context, captured by variables such as medication history and demographic factors, can modulate or even confound microbiome–disease associations, highlighting the importance of integrated modeling approaches that jointly account for both host and microbial contributions.
Figure 11. Random forest model feature importance at the Species level for neurological diseases. Bars represent the mean and 95% confidence interval across all 68 studies. Host metadata and health-associated gut microbes are the most important features for predicting neurological disease, demonstrating the model's potential for biomarker discovery.
Among the top microbial predictors, E. asparagiformis, F. duncaniae, and A. muciniphila were identified as key taxa. These species are generally considered hallmarks of a healthy gut microbiome and are notable producers of metabolites such as urolithins and short-chain fatty acids (SCFAs), which support colonocyte health and exert anti-inflammatory effects (Ottman et al., 2017; Martín et al., 2023; Pidgeon et al., 2025). Notably, F. duncaniae has been reported to be more abundant in healthy controls compared to patients in Alzheimer's disease and mild cognitive impairment (Ueda et al., 2021). In contrast, the role of A. muciniphila in neurological disorders remains ambiguous, with conflicting evidence supporting both protective and detrimental effects (Ma et al., 2025), potentially reflecting strain-level differences or disease stage–specific interactions. Nevertheless, shifts in these taxa may contribute to, or serve as indicators of, neurological disease, especially when considered alongside key host variables that shape microbiome composition and function.
4 Conclusion
The broad range of consistency with which microbiome metadata are recorded and curated in the literature can render it difficult to perform consistent, ML-driven meta-studies that include these variables. In fact, without effective imputation strategies, the study presented here would not have been possible. We presented a pipeline that integrates microbiome abundance profile information and host and sample processing metadata to predict human disease state. These models can assess an individual's potential resilience and provide the predicted probability of being “diseased”.
Our multi-study investigation, comprising 68 published microbiome studies, reinforces the growing body of evidence supporting that incorporating metadata as additional inputs significantly improves the accuracy of disease prediction models. We provide here an ML-centric assessment of the performance impact of such efforts. This study is intended not as a defined biomarker identification effort, but as a systematic, metadata-specific performance evaluation that can be used to support the incorporation of metadata features in future prognostic and diagnostic platforms utilizing microbiome datasets.
In our analyses, the supportive effect of metadata incorporation diminishes at lower taxonomic levels, such as Genus and Species. This may be attributed to the dilution of the limited metadata feature set compared to the extremely high dimensionality at these taxonomic levels, suggesting that additional metadata information could be required. Additionally, while missing metadata can be imputed using various methods, the noise introduced by imputation for critical variables can negatively impact performance for certain studies or disease groups. Impact on performance will also be dependent on the classifier method being applied.
Future efforts that should be prioritized for advancing ML-based microbiome analytics include an in-depth investigation into the impact of metadata on model generalization to new, unseen studies. Such analyses would assess models trained on data from the same disease category as the new dataset, due to the varying definitions of “control” and “diseased” subjects across different studies, as discussed in Section 2.2.5. Another potential avenue for exploration is the application of bias correction methods prior to training the ML models. As the number of publicly available microbiome datasets increases, future work is necessary to evaluate the generalizability of these models on unseen, independent cohorts for health–disease prediction. Domain adaptation approaches such as transfer learning may help mitigate dataset-specific biases and enhance model robustness. Furthermore, with advances in bioinformatics and genome annotation, functional features reflecting microbial activity, such as genes and pathways, can be incorporated into future model iterations. These features are likely to correlate more directly with disease mechanisms, improving both biological interpretability and predictive performance.
From our results, consistent with calls from numerous other consortia, we encourage the development of a standardized metadata collection framework for microbiome studies to ensure consistency and comparability across datasets. Such a framework would help address variability in data collection practices, enhance reproducibility, and ultimately support the development of more robust and generalizable machine learning models. The combination of uniform guidelines for metadata reporting with robust methods for data imputation would provide a framework that facilitates cross-study analysis and improves the model's ability to generalize to new, unseen datasets, thereby advancing the capacity to generate predictive platforms based on microbiome-derived features.
Data availability statement
The datasets utilized in this study can be found in https://zenodo.org/records/17315984.
Ethics statement
Ethics approval was not required for this study as it re-analyzed publicly available de-identified datasets from open literature. No new specimens were obtained or generated for this study.
Author contributions
AG: Conceptualization, Formal analysis, Investigation, Methodology, Software, Writing – original draft, Writing – review & editing. HR: Conceptualization, Formal analysis, Writing – original draft. CV: Methodology, Writing – review & editing. HZ: Methodology, Writing – review & editing. BZ: Methodology, Writing – review & editing. CK: Data curation, Writing – review & editing. JM: Data curation, Writing – review & editing. NM: Data curation, Writing – review & editing. JT: Data curation, Writing – review & editing. CJ: Funding acquisition, Project administration, Writing – review & editing. NB: Funding acquisition, Project administration, Writing – review & editing.
Funding
The author(s) declare that financial support was received for the research and/or publication of this article. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 and was supported by the LLNL Laboratory Directed Research and Development Program under Project No. 22-SI-002.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that no Gen AI was used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2025.1695501/full#supplementary-material
References
Aitchison, J. (1982). The statistical analysis of compositional data. J. R. Stat. Soc.: Ser. B 44, 139–160. doi: 10.1111/j.2517-6161.1982.tb01195.x
Andrews, T. S., and Hemberg, M. (2019). False signals induced by single-cell imputation. F1000Research 7:1740. doi: 10.12688/f1000research.16613.2
Armour, C. R., Topçuoğlu, B. D., Garretto, A., and Schloss, P. D. (2022). A goldilocks principle for the gut microbiome: taxonomic resolution matters for microbiome-based classification of colorectal cancer. MBio 13, e03161–e03121. doi: 10.1128/mbio.03161-21
Azur, M. J., Stuart, E. A., Frangakis, C., and Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20, 40–49. doi: 10.1002/mpr.329
Bosco, N., and Noti, M. (2021). The aging gut microbiome and its impact on host immunity. Genes Immunity 22, 289–303. doi: 10.1038/s41435-021-00126-8
Brooks, C. N., Wight, M. E., Azeez, O. E., Bleich, R. M., and Zwetsloot, K. A. (2023). Growing old together: what we know about the influence of diet and exercise on the aging host's gut microbiome. Front. Sports Active Living 5:1168731. doi: 10.3389/fspor.2023.1168731
Chen, S., Zhou, Y., Chen, Y., and Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890. doi: 10.1093/bioinformatics/bty560
Curry, K. D., Nute, M. G., and Treangen, T. J. (2021). It takes guts to learn: machine learning techniques for disease detection from the gut microbiome. Emerg. Topics Life Sci. 5, 815–827. doi: 10.1042/ETLS20210213
Elie, C., Perret, M., Hage, H., Sentausa, E., Hesketh, A., Louis, K., et al. (2023). Comparison of DNA extraction methods for 16S rRNA gene sequencing in the analysis of the human gut microbiome. Sci. Rep. 13:10279. doi: 10.1038/s41598-023-33959-6
Field, D., Amaral-Zettler, L., Cochrane, G., Cole, J. R., Dawyndt, P., Garrity, G. M., et al. (2011). The genomic standards consortium. PLoS Biol. 9:e1001088. doi: 10.1371/journal.pbio.1001088
Fransen, F., van Beek, A. A., Borghuis, T., Meijer, B., Hugenholtz, F., van der Gaast-de Jongh, C., et al. (2017). The impact of gut microbiota on gender-specific differences in immunity. Front. Immunol. 8:754. doi: 10.3389/fimmu.2017.00754
Gibbons, S. M., Duvallet, C., and Alm, E. J. (2018). Correcting for batch effects in case-control microbiome studies. PLoS Comput. Biol. 14:e1006102. doi: 10.1371/journal.pcbi.1006102
Gonzalez, A., Navas-Molina, J. A., Kosciolek, T., McDonald, D., Vázquez-Baeza, Y., Ackermann, G., et al. (2018). Qiita: rapid, web-enabled microbiome meta-analysis. Nat. Methods 15, 796–798. doi: 10.1038/s41592-018-0141-9
Hanley, J. A., and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36. doi: 10.1148/radiology.143.1.7063747
Haro, C., Rangel-Zúñiga, O. A., Alcalá-Díaz, J. F., Gómez-Delgado, F., Pérez-Martínez, P., Delgado-Lista, J., et al. (2016). Intestinal microbiota is influenced by gender and body mass index. PLoS ONE 11:e0154090. doi: 10.1371/journal.pone.0154090
Hernández Medina, R., Kutuzova, S., Nielsen, K. N., Johansen, J., Hansen, L. H., Nielsen, M., et al. (2022). Machine learning and deep learning applications in microbiome research. ISME Commun. 2:98. doi: 10.1038/s43705-022-00182-9
Kandalai, S., Li, H., Zhang, N., Peng, H., and Zheng, Q. (2023). The human microbiome and cancer: a diagnostic and therapeutic perspective. Cancer Biol. Ther. 24:2240084. doi: 10.1080/15384047.2023.2240084
Karpievitch, Y. V., Dabney, A. R., and Smith, R. D. (2012). Normalization and missing value imputation for label-free LC-MS analysis. BMC Bioinform. 13(Suppl. 16):S5. doi: 10.1186/1471-2105-13-S16-S5
Kim, D., Song, L., Breitwieser, F. P., and Salzberg, S. L. (2016). Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729. doi: 10.1101/gr.210641.116
Kok, C. R., Mulakken, N. J., Thissen, J. B., Martí, J. M., Lee, R., Trainer, J. B., et al. (2024). Meta2DB: curated shotgun metagenomic feature sets and metadata for health state prediction. bioRxiv. doi: 10.1101/2024.10.03.616398
Kubinski, R., Djamen-Kepaou, J.-Y., Zhanabaev, T., Hernandez-Garcia, A., Bauer, S., Hildebrand, F., et al. (2022). Benchmark of data processing methods and machine learning models for gut microbiome-based diagnosis of inflammatory bowel disease. Front. Genet. 13:784397. doi: 10.3389/fgene.2022.784397
Kumar, B., Lorusso, E., Fosso, B., and Pesole, G. (2024). A comprehensive overview of microbiome data in the light of machine learning applications: categorization, accessibility, and future directions. Front. Microbiol. 15:1343572. doi: 10.3389/fmicb.2024.1343572
Li, J., Butcher, J., Mack, D., and Stintzi, A. (2015). Functional impacts of the intestinal microbiome in the pathogenesis of inflammatory bowel disease. Inflamm. Bowel dis. 21, 139–153. doi: 10.1097/MIB.0000000000000215
Liao, H., Shang, J., and Sun, Y. (2023). GDmicro: classifying host disease status with GCN and deep adaptation network based on the human gut microbiome data. Bioinformatics 39:btad747. doi: 10.1093/bioinformatics/btad747
Liu, B.-N., Liu, X.-T., Liang, Z.-H., and Wang, J.-H. (2021). Gut microbiota in obesity. World J. Gastroenterol. 27:3837. doi: 10.3748/wjg.v27.i25.3837
Loh, J. S., Mak, W. Q., Tan, L. K. S., Ng, C. X., Chan, H. H., Yeow, S. H., et al. (2024). Microbiota-gut-brain axis and its therapeutic applications in neurodegenerative diseases. Sig. Transd. Targeted Ther. 9:37. doi: 10.1038/s41392-024-01743-1
Ma, X., Liu, Q., and Yang, G. (2025). The multifaceted roles of akkermansia muciniphila in neurological disorders. Trends Neurosci. 48, 403–415. doi: 10.1016/j.tins.2025.04.004
Martí, J. M. (2019). Recentrifuge: robust comparative analysis and contamination removal for metagenomics. PLoS Comput. Biol. 15:e1006967. doi: 10.1371/journal.pcbi.1006967
Martí, J. M., Kok, C. R., Thissen, J. B., Mulakken, N. J., Avila-Herrera, A., Jaing, C. J., et al. (2025). Addressing the dynamic nature of reference data: a new nucleotide database for robust metagenomic classification. mSystems 10:e01239. doi: 10.1128/msystems.01239-24
Martín, R., Rios-Covian, D., Huillet, E., Auger, S., Khazaal, S., Bermúdez-Humarán, L. G., et al. (2023). Faecalibacterium: a bacterial genus with promising human health applications. FEMS Microbiol. Rev. 47:fuad039. doi: 10.1093/femsre/fuad039
Martin, S. S., Aday, A. W., Almarzooq, Z. I., Anderson, C. A., Arora, P., Avery, C. L., et al. (2024). 2024 heart disease and stroke statistics: a report of US and global data from the American Heart Association. Circulation 149, e347–e913. doi: 10.1161/CIR.0000000000001209
Mirzayi, C., Renson, A., Consortium, G. S., Analysis, M., 84, Q. C. S. F. C,.. S. S.-A., Zohra, F., et al. (2021). Reporting guidelines for human microbiome research: the STORMS checklist. Nat. Med. 27, 1885–1892. doi: 10.1038/s41591-021-01552-x,
NCBI Resource Coordinators (2012). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 41, D8–D20. doi: 10.1093/nar/gks1189
Odamaki, T., Kato, K., Sugahara, H., Hashikura, N., Takahashi, S., Xiao, J.-z., et al. (2016). Age-related changes in gut microbiota composition from newborn to centenarian: a cross-sectional study. BMC Microbiol. 16:90. doi: 10.1186/s12866-016-0708-5
Oliver, A., Kay, M., and Lemay, D. G. (2023). Taxahfe: a machine learning approach to collapse microbiome datasets using taxonomic structure. Bioinform. Adv. 3:vbad165. doi: 10.1093/bioadv/vbad165
O'Toole, P. W., and Jeffery, I. B. (2015). Gut microbiota and aging. Science 350, 1214–1215. doi: 10.1126/science.aac8469
Ottman, N., Geerlings, S. Y., Aalvink, S., de Vos, W. M., and Belzer, C. (2017). Action and function of akkermansia muciniphila in microbiome ecology, health and disease. Best Pract. Res. Clin. Gastroenterol. 31, 637–642. doi: 10.1016/j.bpg.2017.10.001
Oudah, M., and Henschel, A. (2018). Taxonomy-aware feature engineering for microbiome classification. BMC Bioinform. 19:227. doi: 10.1186/s12859-018-2205-3
Papoutsoglou, G., Tarazona, S., Lopes, M. B., Klammsteiner, T., Ibrahimi, E., Eckenberger, J., et al. (2023). Machine learning approaches in microbiome research: challenges and best practices. Front. Microbiol. 14:1261889. doi: 10.3389/fmicb.2023.1261889
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
Pidgeon, R., Mitchell, S., Shamash, M., Suleiman, L., Dridi, L., Maurice, C. F., et al. (2025). Diet-derived urolithin a is produced by a dehydroxylase encoded by human gut enterocloster species. Nat. Commun. 16:999. doi: 10.1038/s41467-025-56266-2
Poulsen, C. S., Ekstrøm, C. T., Aarestrup, F. M., and Pamp, S. J. (2022). Library preparation and sequencing platform introduce bias in metagenomic-based characterizations of microbiomes. Microbiol. Spectr. 10:e00090-22. doi: 10.1128/spectrum.00090-22
Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J., Solenberger, P., et al. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27, 85–96.
Santos-Marcos, J. A., Mora-Ortiz, M., Tena-Sempere, M., Lopez-Miranda, J., and Camargo, A. (2023). Interaction between gut microbiota and sex hormones and their relation to sexual dimorphism in metabolic diseases. Biol. Sex Diff. 14:4. doi: 10.1186/s13293-023-00490-2
Seidel, J., and Valenzano, D. R. (2018). The role of the gut microbiome during host ageing. F1000Research 7:1. doi: 10.12688/f1000research.15121.1
Siegel, R. L., Wagle, N. S., Cercek, A., Smith, R. A., and Jemal, A. (2023). Colorectal cancer statistics, 2023. CA: Cancer J. Clinic. 73, 233–254. doi: 10.3322/caac.21772
Simin, J., Fornes, R., Liu, Q., Olsen, R. S., Callens, S., Engstrand, L., et al. (2020). Antibiotic use and risk of colorectal cancer: a systematic review and dose-response meta-analysis. Br. J. Cancer 123, 1825–1832. doi: 10.1038/s41416-020-01082-2
The Alzheimer's Association (2024). 2024 Alzheimer's disease facts and figures. Alzheimers Dement 20, 3708–3821. doi: 10.1002/alz.13809
Tilg, H., and Moschen, A. R. (2014). Microbiota and diabetes: an evolving relationship. Gut 63, 1513–1521. doi: 10.1136/gutjnl-2014-306928
Tiwari, P., Dwivedi, R., Bansal, M., Tripathi, M., and Dada, R. (2023). Role of gut microbiota in neurological disorders and its therapeutic significance. J. Clin. Med. 12:1650. doi: 10.3390/jcm12041650
Tsokkou, S., Konstantinidis, I., Papakonstantinou, M., Chatzikomnitsa, P., Liampou, E., Toutziari, E., et al. (2025). Sex differences in colorectal cancer: epidemiology, risk factors, and clinical outcomes. J. Clin. Med. 14:5539. doi: 10.3390/jcm14155539
Ueda, A., Shinkai, S., Shiroma, H., Taniguchi, Y., Tsuchida, S., Kariya, T., et al. (2021). Identification of faecalibacterium prausnitzii strains for gut microbiome-based intervention in alzheimer's-type dementia. Cell Rep. Med. 2:100398. doi: 10.1016/j.xcrm.2021.100398
Vangay, P., Burgin, J., Johnston, A., Beck, K. L., Berrios, D. C., Blumberg, K., et al. (2021). Microbiome metadata standards: report of the national microbiome data collaborative's workshop and follow-on activities. MSystems 6, 10–1128. doi: 10.1128/mSystems.00273-21
Wang, Y., and Lê Cao, K.-A. (2023). Plsda-batch: a multivariate framework to correct for batch effects in microbiome data. Briefings Bioinform. 24:bbac622. doi: 10.1093/bib/bbac622
Keywords: human microbiome, host disease prediction, machine learning, host metadata, meta-analysis
Citation: Goncalves AR, Ranganathan H, Valdes C, Zhu H, Zhang B, Kok CR, Martí JM, Mulakken NJ, Thissen JB, Jaing C and Be NA (2026) Beyond microbial abundance: metadata integration enhances disease prediction in human microbiome studies. Front. Microbiol. 16:1695501. doi: 10.3389/fmicb.2025.1695501
Received: 01 September 2025; Accepted: 18 November 2025;
Published: 21 January 2026.
Edited by:
Lucinda Janete Bessa, Egas Moniz Center for Interdisciplinary Research (CiiEM), PortugalReviewed by:
Abhishek Kumar, Washington University in St. Louis, United StatesFrancisco Moya-Flores, Stanford University, United States
Copyright © 2026 Goncalves, Ranganathan, Valdes, Zhu, Zhang, Kok, Martí, Mulakken, Thissen, Jaing and Be. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Andre R. Goncalves, YW5kcmVAbGxubC5nb3Y=; Nicholas A. Be, YmUxQGxsbmwuZ292
Hiranmayi Ranganathan1