Neonatal gut microbiota stratification and identification of SCFA-associated microbial subgroups using unsupervised clustering and machine learning classification

Kasani, Payam Hosseinzadeh; Yun, Cheol-Heui; Cho, Kee Hyun; Jeong, Su Jin

doi:10.3389/fmicb.2025.1668451

ORIGINAL RESEARCH article

Front. Microbiol., 04 December 2025

Sec. Microorganisms in Vertebrate Digestive Systems

Volume 16 - 2025 | https://doi.org/10.3389/fmicb.2025.1668451

This article is part of the Research TopicNew and advanced mechanistic insights into the influences of the infant gut microbiota on human health and disease, Volume IIView all 19 articles

Neonatal gut microbiota stratification and identification of SCFA-associated microbial subgroups using unsupervised clustering and machine learning classification

Payam Hosseinzadeh Kasani¹

Cheol-Heui Yun^2,3,4

Kee Hyun Cho¹^*

Su Jin Jeong⁵^*

¹Department of Pediatrics, Kangwon National University Hospital, Kangwon National University School of Medicine, Chuncheon, Republic of Korea
²Department of Agricultural Biotechnology, and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
³Center for Food and Bioconvergence, and Interdisciplinary Programs in Agricultural Genomics, Seoul National University, Seoul, Republic of Korea
⁴Institutes of Green Bio Science and Technology, Seoul National University, Pyeongchang, Republic of Korea
⁵Department of Pediatrics, CHA Bundang Medical Center, CHA University School of Medicine, Seongnam, Republic of Korea

Background: The neonatal gut microbiome plays a critical role in infant health through the production of short-chain fatty acids (SCFAs). However, the organization of SCFAs-producing microbial communities in neonates remains poorly characterized. This study applied unsupervised clustering and machine learning to classify microbial subgroups associated with SCFAs production, providing insight into their composition and metabolic potential.

Methods: This study recruited 71 mother-infant pairs from Kangwon National University Hospital and Bundang CHA Hospital, collecting meconium samples within five days postpartum. Microbial diversity was analyzed by 16S rRNA gene sequencing (V3–V4 region) at the genus level, and SCFAs were quantified from the same samples. To identify functionally distinct microbial subgroups, K-Means, Agglomerative, Spectral, and Gaussian Mixture Model clustering were applied. Clustering validity was assessed using Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index, and Prediction Strength Validation, with t-distributed Stochastic Neighbor Embedding (t-SNE) visualization to evaluate cluster separation. SCFAs distributions across clusters were compared, while random forest and logistic regression models were used to classify SCFAs-associated microbial clusters through Receiver Operating Characteristic curves (ROC).

Results: The clustering analysis identified distinct microbial subgroups linked to SCFAs production, with Agglomerative clustering outperforming K-Means in capturing functionally relevant structures. Cluster 1 had higher SCFAs levels, enriched in Bacteroides, Prevotella, and Enterococcus, while Cluster 2 exhibited lower SCFAs concentrations with a more heterogeneous composition. The introduction of a third cluster in multi-class analysis revealed an intermediate metabolic profile, suggesting a continuum in microbial metabolic function. Classification analysis confirmed random forest model superiority, achieving ROC score of 91.05% (Agglomerative) and 87.74% (K-Means) in binary classification, and 92.98% (Agglomerative) and 89.84% (K-Means) in multi-class classification, demonstrating RF’s strong predictive ability for SCFAs-based clusters.

Conclusion: Unsupervised clustering combined with classification analysis effectively predict SCFAs-associated subgroups and paving the way for future research on longitudinal tracking and functional genomic integration in early-life metabolic health.

Introduction

The gut microbiome plays a central role in human health, metabolism, and immune regulation, with its composition influenced by environmental factors such as diet and lifestyle (Gacesa et al., 2022; Ahn and Hayes, 2021; Maher et al., 2023). Research indicates that infant gut colonization begins in utero, as bacteria have been detected in the placenta (Guzzardi et al., 2019; Collado et al., 2016), umbilical cord (Jiménez et al., 2005), and meconium of both vaginal and cesarean section-delivered newborns (Dominguez-Bello et al., 2010; Mshvildadze et al., 2010).

A key function of the gut microbiota in human nutrition is the breakdown of complex polysaccharides into simpler sugars, which are then fermented to produce microbial metabolites. Among various microbial metabolites, short-chain fatty acids (SCFAs), particularly butyrate, acetate, and propionate play a crucial role in modulating host metabolism by interacting with G-protein-coupled receptors, which regulate energy balance and immune responses (Koh et al., 2016). They contribute to glucose homeostasis and appetite regulation, directly impacting the risk of obesity and metabolic syndrome (Byrne et al., 2015; Portincasa et al., 2022; Luo et al., 2022). Additionally, SCFAs serve as energy sources for colonocytes, enhancing intestinal epithelial function, strengthening tight junction proteins, and reducing intestinal permeability, which is particularly important in preventing conditions like necrotizing enterocolitis (Gao et al., 2021; den Besten et al., 2013). Their anti-inflammatory properties further support gut health, making them essential in dietary interventions for inflammatory bowel disease (Parada Venegas et al., 2019; Liu et al., 2021). Emerging research also suggests that SCFAs regulate epigenetic modifications through histone deacetylase inhibition, linking gut microbiota activity to gene expression changes associated with metabolic diseases (Kopczyńska and Kowalczyk, 2024; Nshanian et al., 2025; Guo et al., 2022).

The significance of SCFAs extends to early life, where play a crucial role in infant health by shaping gut microbiota, supporting immune development, and influencing metabolic pathways, as they help maintain gut homeostasis by fueling colonic cells, strengthening the gut barrier, and regulating inflammation (Bridgman et al., 2017; Heath et al., 2020; Barman et al., 2024; Milani et al., 2017). Additionally, emerging research suggests SCFAs contribute to neuroprotection and cognitive development (Hernández-Martínez et al., 2022; Łoniewska et al., 2023).

Accumulating evidence highlights the potential health benefit of SCFAs in pediatric populations and disruptions in gut microbiota composition and SCFAs production are increasingly associated with a range of pediatric health issues, including obesity, allergic disorders, inflammatory bowel disease, and neurodevelopmental disorders (Hernández-Martínez et al., 2022; Abo Kouadio Jerome et al., 2023; Cifuentes et al., 2024; Chun and Toldi, 2022; Hsu et al., 2024; Gio-Batta et al., 2022).

This is especially important because they primarily rely on breast milk components such as human milk oligosaccharides, lactose, and lipids, with SCFAs providing an additional energy source, particularly in preterm or low-birth-weight infants (Grier et al., 2017; Dai et al., 2020; Mueller et al., 2021). The detection of SCFAs within the first a weeks of life indicates that microbial colonization occurs rapidly, with bacteria ferment available substrates to produce metabolic byproducts (Milani et al., 2017; Mueller et al., 2021; Wang et al., 2021). This aligns with studies showing that Bifidobacterium, Bacteroides, and Lactobacillus establish early metabolic activity, reinforcing their role functionally active organisms rather than transient colonizers (Tamburini et al., 2016).

Given the high interindividual variability in gut microbial composition, classifying microbial communities based on functional metabolic outputs rather than taxonomy alone is crucial for understanding early-life microbiome development (Luo et al., 2025; Favari et al., 2024). Neonatal gut communities are highly dynamic and transitional, making it challenging to identify functionally relevant microbial subgroups using traditional taxonomic methods alone. Clustering analysis offers a powerful unsupervised machine learning approach to uncover patterns in microbial composition that may correlate metabolic activities (Reitmeier et al., 2020; Cai et al., 2017; Cai and Sun, 2011). Traditional microbiome studies often focus on relative bacterial abundances, overlooking functional redundancy and microbial interactions, which are essential for understanding gut microbial metabolism and host-microbiome interactions (Turnbaugh et al., 2009; Wu et al., 2020). By applying clustering algorithms to microbial dataset, researchers can identify novel microbiome subtypes linked to metabolic outputs, offering deeper insights into microbial functionality beyond taxonomic classification. Despite advances in microbiome-based clustering, several key knowledge gaps remain:

a. SCFAs are widely studied in adults, but their microbiome relationships in neonates remain unclear, particularly in the early postnatal period when microbial colonization is evolving rapidly.

b. The extent to which microbial clustering reflects functional metabolic differences is unknown, as the neonatal gut microbiome is still developing, making it difficult to distinguish stable metabolic signatures from transient colonization patterns (Bäckhed et al., 2015).

c. While clustering methods have been applied to adult gut microbiome studies, their use in neonatal microbiomes remains scarce, limiting our understanding of whether microbial structures in neonates align with functional SCFAs production.

Addressing these gaps could provide new insights into early-life microbiome functionality, and thus we hypothesize that neonatal gut microbiota can be classified into distinct functional subgroups with varying SCFAs production capacities, each exhibiting specific microbial signatures predictive of metabolic function.

This study aims to classify neonatal gut microbiota into functionally relevant subgroups to determine whether distinct microbial clusters exhibit differential SCFAs profiles. Using unsupervised clustering based solely on genus-level microbial composition, without including any maternal or infant metadata, we sought to identify intrinsic microbiome structures independent of external clinical factors. With multiple clustering techniques including K-Means, Agglomerative Clustering, Spectral Clustering, and Gaussian Mixture Models (GMM), we classify microbial communities based on genus-level taxonomic composition and assess their correlation with SCFAs variations. To determine the most effective clustering approach, Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index, and Prediction Strength Validation will be used, alongside t-distributed Stochastic Neighbor Embedding (t-SNE) visualization for stability assessment. We identified key microbial genera contributing to high or low SCFAs production and examined whether microbial communities followed a binary or multi-clustered metabolic structure. To evaluate the functional relevance of SCFAs clusters, we applied supervised machine learning models, including Logistic Regression and Random Forest, to determine whether different level of SCFAs could be accurately predicted. By integrating unsupervised clustering with SCFAs profiling, this study identifies functionally distinct microbial subgroups that may influence neonatal metabolism, immunity, and gut health. These findings could enhance understanding of microbial colonization, aid in early diagnosis of microbiome-related disorders, and support targeted interventions to optimize SCFAs production and improve long-term health outcomes.

Materials and methods

Study design and participants

This study included a subset of 71 healthy mother-infant pairs recruited from Kangwon National University Hospital and Bundang CHA Hospital to study the establishment of gut microbiota during infancy. Seventy-one participants, recruited between December 06, 2021 and January 11, 2022, were asked to collect fecal samples from their infants. The study was approved by the Institutional Review Board of Kangwon National University Hospital (IRB no. KNUH-B-2021-12-004) for medical research. The approvals, as well as informed consent from the mothers, were obtained prior to collection of data and samples. Only healthy, full-term newborns (gestational age from 37 weeks 0 days to 41 weeks 6 days) with a birth weight of 2,500 grams or more, admitted to the newborn nursery, were included in the study. Infants admitted to the neonatal intensive care unit (NICU) were excluded, with the exception of those admitted for hyperbilirubinemia after 48 h of life, provided that phototherapy was the only treatment required. Mothers with a history of prolonged rupture of membranes (PROM) lasting 18 h or more, preterm labor, or antibiotics use during pregnancy were excluded to control for variables that could influence maternal and neonatal microbiome composition and health outcomes. Among the 71 enrolled infants, 43 were female and 28 were male. The majority were delivered by Cesarean section (n = 68), while only three infants were born via vaginal delivery. Twin births accounted for five cases of the cohort. Maternal conditions assessed included gestational diabetes mellitus (GDM; n = 7) and pregnancy-induced hypertension (PIH; n = 3). Baseline characteristics of the enrolled infants and their mothers are summarized in Table 1.

Table 1

Table 1. Baseline characteristics of enrolled infants and mothers.

Sample collection and storage

For fecal sample collection, meconium samples were collected from 71 infants within 5 days after birth using a fecal sampling kit of spoon type (Noble Bio®). The samples were immediately stored at −20 °C by hospital staff and then transported to the laboratory, where they were stored at −80 °C until analysis.

DNA extraction and metabarcoding

DNA was extracted using TIANLONG®-nucleic acid extraction kit (for stool DNA/RNA extraction) then determined concentration and purity using DNA/protein Analyzer (Pultton®). The target region (V3-V4) was amplified using PCRBIO VeriFi Mix (PCR Biosystems®) and 16S-amplicon primer (Macrogen®) at 95 °C for 3 min hot start followed by 25 cycles of 95 °C for 30s and 55 °C for 30s, 72 °C for 30s with a final elongation step of 72 °C for 5 min. The amplified DNA subsequently were purified automatically in nucleic acid extractor (TINLONG® Libex) using MagListo™ PCR/Gel Purification Kit (Bioneer®). Afterwards, Index PCR was performed to attach in dual indices and Illumina sequencing using PCRBIO VeriFi Mix (PCR Biosystems®) and Nextera® Index kit V2 Set A and Set B (Illumina®) at 95 °C for 3 min hotstart followed by 8 cycles of 95 °C for 30s and 55 °C for 30s, 72 °C for 30s with a final elongation step of 72 °C for 5 min. After indexing, the libraries were cleaned up using the same method and the size and concentration were determined using Qsep100 (Bioptic®) and Qbit flex fluorometer (Invitrogen®). After Qbit and Qsep measurements were completed, the mixing volume value was calculated then the PCR product was pooled into a microtube. The pooled libraries were denatured with NaOH 0.2 N, and then diluted with hybridization buffer before sequencing. Finally, DNA were pooled and sequenced using MiSeq® Reagent Kit V3 600 cycles kit (Illumina®) on an Illumina Miseq platform according to the manufacturer’s standard instruction. PhiX was used as an internal control. The V3-V4 hypervariable regions of the bacterial 16S rRNA gene were amplified using the following primers: forward primer 341F (5’-CCTACGGGNGGCWGCAG-3′) and reverse primer 805R (5’-GACTACHVGGGTATCTAATCC-3′) (PMID: 22933715). Sequencing data were analyzed using the DADA2 pipeline (version 1.16) (PMID: 27214047). Briefly, read trimming and filtering were performed using the filterAndTrim function in DADA2. Specifically, reads were truncated at any site with a quality score below 2 and reads with more than two expected errors were discarded. Additionally, reads were trimmed to remove primers and low-quality tails, with forward reads trimmed to 240 bp and reverse reads to 160 bp, based on the quality profiles of our sequencing runs. After trimming, the forward and reverse reads retained an overlap of approximately 25 bp, which was sufficient for accurate merging using the mergePairs() function in DADA2. The merged reads reconstructed full-length amplicons of approximately 430–440 bp, consistent with the expected length of the V3–V4 region. Amplicon Sequence Variants (ASVs) were inferred directly from these merged reads following DADA2’s standard denoising pipeline, and no artificial concatenation of non-overlapping ends was performed. Reads shorter than 150 base pairs were discarded. DADA2’s error model was then used to learn the error rates from the data, which were subsequently used to denoise the sequences. Forward and reverse reads were merged to create full-length sequences of the V3-V4 region. Chimeric sequences were identified and removed using the remove BimeraDenovo function in DADA2. An Amplicon Sequence Variant (ASV) table including 192,815 sequences was created, representing the abundance of each unique sequence in each sample. ASVs were assigned taxonomy using the SILVA database (release 138) (PMID: 23193283) with the assign Taxonomy function in DADA2. Taxonomic assignment results showed high resolution across different taxonomic levels. At the Kingdom level, 99.4% of ASVs were successfully assigned, all of which belonged to the Bacteria kingdom. Additionally, over 99% of ASVs were assigned at the Phylum, Class, Order, and Family levels.

Short-chain fatty acids analysis

Fecal samples were collected directly from infant diapers using sterile swabs, transferred to cryogenic tubes, and immediately stored at −20 °C until transport to the laboratory. For SCFA quantification, approximately 50 mg of fecal material was suspended in 800 μL of deionized water and acidified with 10 μL of 5 M HCl. The mixture was vortexed, followed by extraction with 400 μL of diethyl ether under cold conditions for 5 min. After centrifugation at 14000 rpm for 1 min, 200 μL of the ether layer was transferred to a clean tube, mixed with 20 μL of N, O-Bis(trimethylsilyl)trifluoroacetamide (BSTFA), and derivatized at 70 °C for 20 min, then incubated at 37 °C for 2 h. The resulting derivatives were analyzed by gas chromatography–mass spectrometry (GC–MS).

SCFAs were quantified using a Shimadzu GC-2010 Plus system coupled to a GCMS-TQ8030 detector (Tokyo, Japan) equipped with a DB-5 ms capillary column (30 m × 0.25 mm i.d., 0.25 μm film; Agilent J&W Scientific, Folsom, CA, USA). One microliter of the derivatized sample was injected in split mode (50:1) at 200 °C, with an oven temperature program as follows: hold at 40 °C for 2 min, ramp to 70 °C at 10 °C/min, to 85 °C at 4 °C/min, to 110 °C at 6 °C/min, and finally to 290 °C at 9 °C/min, maintained for 6 min. Helium served as the carrier gas at a constant flow rate of 0.89 mL/min. The ion source and interface temperatures were set at 200 °C and 250 °C, respectively. Data were collected in scan mode with an electron impact voltage of 0.1 kV and an event time of 0.03 s, monitoring the following m/z ions: 117 (acetic acid), 131 (propionic acid), and 145 (butyric acid). Quantification was performed using calibration curves generated from analytical standards of acetic acid (JUNSEI Chemical Co., Tokyo, Japan), propionic acid (Sigma-Aldrich, St. Louis, MO, USA), and butyric acid (Sigma-Aldrich). This analytical procedure followed the validated protocol previously described in our earlier study (Kwon et al., 2024) and was applied here with identical parameters for methodological consistency.

Diversity and abundance analysis

Alpha diversity was assessed at the genus level using Chao1 richness and Shannon diversity, which, respectively, estimate species richness and capture both richness and evenness of microbial communities. Prior to diversity analysis, the ASV table was processed in R (v4.0) using the phyloseq (v1.42) (McMurdie and Holmes, 2013) and microbiome (v1.18) (Lahti and Shetty, 2017) packages. Count data were rounded to the nearest integer, and zero-count taxa were removed to avoid distortions caused by rare or uninformative features and no rarefaction was applied. Both Chao1 and Shannon indices were calculated using the estimate_richness() function in phyloseq, and their distributions were compared among clusters using Kruskal–Wallis and Wilcoxon rank-sum tests.

For beta diversity, community dissimilarity among samples was quantified using Euclidean distances computed on Centered Log-Ratio (CLR)–transformed genus-level data, following compositional data analysis principles. The CLR transformation was implemented in R/Bioconductor package (Sharma et al., 2020) to account for the compositional nature of sequencing data. Distance matrices were generated with the vegan R package (v2.5–7) (Oksanen et al., 2001) and used to evaluate overall differences in microbial community composition between clusters. Ordination was performed using Principal Coordinates Analysis (PCoA) implemented in the ordinate() function of phyloseq (McMurdie and Holmes, 2013), and results were visualized with ggplot2 (v3.4) (Wickham, 2016). Ellipses representing the 95% confidence interval for each cluster were added to assess separation in community structure. Statistical differences in microbial composition were tested using PERMANOVA (Permutational Multivariate Analysis of Variance) with the adonis2() function in vegan (999 permutations). The resulting p-values and R² values indicated the significance and proportion of variation explained by the cluster groupings. All plots were generated using consistent color palettes and shape encodings to facilitate visual comparison across clustering methods.

For microbial composition analysis, genus-level abundance tables were quality-filtered, and missing values were replaced with zeros. To normalize for differences in sequencing depth, raw counts were converted into relative abundances (%) by dividing each genus count by the total number of reads per sample and multiplying by 100. The normalized abundance data were reshaped into long format in R using the tidyverse package (Wickham et al., 2019), and stacked bar plots were created using ggplot2 to visualize the distribution of dominant genera across K-Means and Agglomerative clusters. Each bar represents an individual sample, while colors correspond to bacterial genera grouped by phylum. This approach ensured direct comparability of taxonomic composition among clusters while maintaining the proportional structure of microbial communities.

Machine learning analysis

To investigate the relationship between genus-level microbial abundance clustering and SCFAs concentrations, a combination of supervised and unsupervised machine learning models was applied. Unsupervised clustering was used to identify naturally occurring patterns in microbial composition, grouping infants based on their gut microbiota profiles. This allowed for the exploration of potential associations between microbial clusters and SCFAs levels. Following this, supervised machine learning models were employed to assess the predictive power of microbial composition in distinguishing SCFAs-driven clusters, enabling a deeper understanding of the microbial signatures linked to SCFAs variation. By integrating both approaches, this analysis aimed to uncover biologically meaningful patterns in microbial community structure and their potential influence on metabolic outputs.

Unsupervised machine learning

Unsupervised clustering analysis was performed exclusively on infant gut microbial relative abundances at the genus level to identify natural groupings within the dataset. No infant or maternal metadata (e.g., delivery mode, feeding type, or gestational age) were included in the clustering process to ensure that the resulting groups reflected intrinsic microbial community structure rather than external influences.

Four clustering algorithms were applied including: K-Means, Agglomerative Hierarchical Clustering, Spectral Clustering, and Gaussian Mixture Model (GMM), each offering distinct advantages in detecting underlying patterns in microbial composition. To determine the optimal number of clusters, multiple evaluation metrics were utilized, including the Silhouette Score, which measures cluster cohesion and separation; the Calinski-Harabasz Index, which evaluates the compactness and separation of clusters; the Davies-Bouldin Index, which quantifies cluster dispersion; and Prediction Strength Validation, which assesses the stability of the clustering results. These complementary metrics ensured the selection of biologically meaningful and well-separated clusters for further downstream analysis. Both binary (two-cluster) and multi-class (three-cluster) configurations were examined based on performance consistency and interpretability. To visualize clustering patterns, the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm was applied to reduce high-dimensional microbial data into two dimensions, providing an intuitive overview of cluster separation.

After determining the optimal clustering structure based on genus-level microbial composition, a cluster label was assigned to each infant sample. This cluster information was then merged with the corresponding SCFA concentration data and maternal–infant metadata by matching sample identifiers. This post-clustering integration allowed us to compare metabolic (SCFA) and clinical variables across microbiome-derived clusters, while ensuring that only microbial features influenced the initial clustering process. In other words, SCFA and clinical data were used solely for downstream interpretation, not for model training or cluster formation, thereby avoiding bias and preserving the unsupervised nature of the analysis.

Supervised machine learning

To evaluate the discriminative potential of SCFAs clusters, supervised machine learning models were employed to classify samples based on their microbial composition and SCFAs profiles. For this purpose, two well-established classifiers were utilized, each offering distinct advantages in handling classification tasks as summarized blow.

Logistic regression (LR) is a statistical method adapted for ML that models the probability of an outcome based on input variables, primarily used for classification by applying a sigmoid function to a linear combination of features (Cox, 1958). While LR excels in distinguishing linearly separable categories, it can be extended to multiclass classification. Random forest (RF), a widely used supervised ML algorithm, employs ensemble learning by constructing multiple decision trees to improve classification and regression accuracy (Tin Ho, 1995). Each tree produces a class prediction, and the final outcome is determined by majority voting, making RF highly robust and versatile for complex predictive tasks.

To optimize model performance, Grid Search with Cross-Validation (GridSearchCV) was employed to fine-tune hyperparameters and identify the best-performing configurations (Arlot and Celisse, 2010). The dataset was divided into stratified folds, where one-fold served as the test set while the remaining folds were used for training. A 5-fold cross-validation approach was applied to ensure all data points were evaluated across multiple iterations, reducing bias and improving generalizability. The optimal hyperparameters were selected based on the highest Area Under the Curve (AUC) score in the validation set. Given the potential issue of class imbalance, the class_weight parameter in scikit-learn was applied to adjust for disparities in sample distribution, assigning higher weights to smaller sample classes and lower weights to larger sample classes to prevent bias in predictions (Mueller, n.d.).

Model performance was evaluated using the Receiver Operating Characteristic (ROC) curve, a widely used graphical tool for assessing the performance of binary and multi-class classification models across varying decision thresholds (Tan, 2009). The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity), providing insight into the trade-off between sensitivity and specificity at different classification thresholds. It is particularly useful for comparing classifiers, as it remains independent of class distribution and enables performance assessment across multiple threshold values.

Statistical analyses

Sequencing data analysis was conducted using R version 4.0, utilizing the phyloseq and ggplot2 packages (R Core Team, 2020; McMurdie and Holmes, 2013; Wickham, 2016). The top abundances of the microbiome were visualized using the aggregate top taxa and plotting functions provided by the microbiome package (Lahti and Shetty, 2017). The programming tasks in this study were executed using Python programming language (version 3.9) (Van Rossum and Drake, 2009). For data preprocessing and analysis, we utilized Pandas (McKinney, 2010) and NumPy (Harris et al., 2020), two popular Python libraries for data manipulation and analysis. Additionally, Scikit-learn (Pedregosa et al., 2011), a Python module that incorporates various ML algorithms, was employed for the analysis. We performed all analyses on 24-core Intel(R) Xeon(R) Gold 5,118 CPU @ 2.30GHz, RAM 128 GB (Intel Corporation, Santa Clara, CA, United States) running Windows 10 Pro. The data were expressed as means and standard deviations (continuous variables) or as numbers and percentages (categorical variables).

Results

Figure 1A presents the silhouette scores for clustering at the genus level using four clustering models. The Agglomerative and K-Means models achieved the highest silhouette scores across different cluster numbers, particularly for K = 3, where the Agglomerative model outperformed the others. The GMM model demonstrated moderate performance, while the Spectral clustering method consistently exhibited the lowest silhouette scores, suggesting weaker cluster separation.

Figure 1

Graphical data analyses of clustering at the genus level. Panel (A) shows silhouette scores, with KMeans peaking at a cluster count of 3. Panel (B) presents Calinski-Harabasz index, increasing with clusters, showing KMeans effectiveness. Panel (C) depicts Davies-Bouldin index, favoring lower scores for KMeans at higher clusters. Panel (D) illustrates prediction strength, showing significant drops beyond three clusters. Panels (E) to (H) display t-SNE visualizations: (E, G) for agglomerative clusters and (F, H) for KMeans clusters, highlighting varying numbers and distributions of clusters, with red asterisks denoting cluster centers.

Figure 1. Clustering performance evaluation at the genus level using four clustering models: K-Means, Agglomerative, Spectral, and Gaussian Mixture Model. (A) Silhouette scores, where higher values indicate better-defined clusters; (B) Calinski–Harabasz index, with higher scores reflecting more compact and well-separated clusters;(C) Davies–Bouldin index, where lower values indicate superior clustering performance; (D) Prediction strength analysis for K-Means and Agglomerative models. Performance of t-SNE implementation and cluster visualization: (E) Agglomerative binary clusters; (F) K-Means binary clusters; (G) Agglomerative multi-clusters; (H) K-Means multi-clusters. The convex hulls encapsulate each cluster, providing a visual representation of their boundaries, and the centroids, marked in red, highlight the central tendency of each cluster.

To further assess clustering quality, the Calinski-Harabasz index was computed, as shown in Figure 1B. Similar to the silhouette scores, the Agglomerative and K-Means models demonstrated superior performance, showing a consistent increase in the Calinski-Harabasz index as the number of clusters increased. Notably, the Spectral clustering method yielded substantially lower scores across all K values, reinforcing its lower clustering efficiency at the genus level. The GMM model displayed intermediate performance, ranking below Agglomerative and K-Means but above Spectral clustering.

The Davies-Bouldin index, an internal validation metric where lower scores indicate better clustering, is presented in Figure 1C. The Agglomerative and K-Means models again achieved the lowest Davies-Bouldin scores across most cluster numbers, with K-Means performing slightly better at K = 3. The GMM model exhibited relatively higher scores but still performed better than the Spectral method, which consistently yielded the highest Davies-Bouldin values, reflecting poor cluster compactness and separation.

Since Agglomerative and K-Means models performed better in previous three tests, prediction strength was evaluated for K-Means and Agglomerative clustering, as shown in Figure 1D, to validate cluster stability. Both methods exhibited strong prediction values at K = 2 where K-Means achieved higher prediction strength score, but a sharp decline was observed for higher K values, particularly beyond K = 3. The Agglomerative method maintained slightly higher prediction strength at K = 3 before decreasing, suggesting that three clusters may be a suitable choice for genus-level classification. This aligns with the results from silhouette, Calinski-Harabasz, and Davies-Bouldin scores, reinforcing the robustness of the clustering performance at K = 3.

Based on the clustering performance evaluation (Figure 1), Agglomerative and K-Means clustering were identified as the most effective models for grouping the dataset. Given the limited sample size (71 samples), we focused on two-clusters and three-clusters solutions to achieve both broad and fine-grained classifications.

Figures 1E,F illustrates the t-SNE visualization of Agglomerative and K-Means models with two clusters, where the data points are clearly separated into two distinct groups. This clustering approach successfully captured the structure within the dataset, with one group forming a compact cluster at the upper region, while the other cluster was more widely distributed. The clear separation of clusters suggests that Agglomerative model effectively identifies major subgroup differences at this level, aligning with the clustering validation metrics observed earlier. Figures 1G,H present the three-cluster solutions for Agglomerative and K-Means clustering, respectively. The introduction of a third cluster resulted in further subdivision within the data structure. In both clustering methods, the newly identified third cluster is positioned in the lower region of the t-SNE space, suggesting that this group exhibits distinct characteristics from the other two. While Agglomerative clustering (Figure 2C) maintained a hierarchical organization, K-Means clustering (Figure 2D) introduced a more compact third cluster, highlighting slight methodological differences in cluster assignment. These visualizations confirm that the three-cluster approach enhances granularity in data interpretation, further refining subgroup differentiation within the dataset.

Figure 2

Graphs A and B compare acetate, butyrate, and propionate levels using K-Means and Agglomerative clustering for two clusters, displaying p-values for significance. Graphs C and D show the same comparisons for three clusters. Box plots with individual data points are shown, with different clusters in different colors.

Figure 2. The distribution of SCFAs concentrations (acetate, butyrate, and propionate) across clusters identified by K-Means and Agglomerative clustering methods. (A): the SCFAs levels in binary clustering using K-Means, (B) SCFAs levels in binary clustering using Agglomerative, (C) SCFAs levels in multi-clustering using K-Means, (D) SCFAs levels in multi-clustering using Agglomerative. Kruskal-Wallis tests were used to assess overall differences across clusters, with pairwise comparisons for significant contrasts. Boxplots display median, interquartile range, and data distribution, highlighting potential differences in SCFAs production among microbial subgroups. Statistically significant differences are marked with asterisks (*p < 0.05, p < 0.01), while non-significant comparisons are labeled as ns (not significant).

To determine whether the identified microbial clusters exhibited distinct SCFAs profiles, SCFAs concentrations were compared across binary and multi-cluster solutions as presented in Table 2. This analysis aimed to assess whether microbial community structure correlates with SCFAs levels, providing insight into potential microbial metabolic functions. In both K-Means and Agglomerative binary clustering, cluster 1 exhibited significantly higher levels of SCFAs compared to cluster 2, which showed markedly lower concentrations. Specifically, in K-Means model, cluster 1 had the highest acetate (1190.57), butyrate (41.08), and propionate (200.86) levels, while cluster 2 showed much lower values (acetate: 280.57, butyrate: 9.36, propionate: 8.90). A similar pattern was observed in Agglomerative model, where cluster 1 demonstrated elevated acetate (1147.60), butyrate (39.17), and propionate (189.39) levels, whereas cluster 2 exhibited substantially lower values (acetate: 253.64, butyrate: 9.48, propionate: 9.34). These results suggest that cluster 1 represents a group with high SCFAs concentrations, whereas cluster 2 corresponds to a low SCFAs profile.

Table 2

Table 2. Comparison of SCFAs levels across clusters identified by K-Means and Agglomerative models.

The SCFAs concentrations (acetate, butyrate, and propionate) are presented for binary and multi-clustering solutions using K-Means and Agglomerative clustering methods. Clusters were derived based on genus-level microbial composition, and SCFAs levels are reported as mean values within each cluster. Binary clustering represents two distinct microbial groups, whereas multi-clustering further divides the dataset into three clusters.

In K-Means and Agglomerative multi-clustering, subtle variations in SCFAs levels are more apparent. In both methods, cluster 1 consistently exhibited the highest SCFAs levels, with K-Means values of acetate (1203.75), butyrate (45.50), and propionate (228.65), and Agglomerative values of acetate (1154.17), butyrate (43.03), and propionate (213.80). This indicates that cluster 1 represents individuals with elevated SCFAs concentrations. Meanwhile, cluster 2 displayed the lowest SCFAs levels, with acetate, butyrate, and propionate concentrations significantly reduced in both K-Means (280.57, 9.36, 8.90) and Agglomerative clustering (253.64, 9.48, 9.34), suggesting that Cluster 2 represents individuals with a low SCFAs profile. Cluster 3 demonstrated moderate SCFAs levels, positioned between the high and low clusters. In both clustering methods, acetate (K-Means: 1098.30, Agglomerative: 1098.30), butyrate (10.19), and propionate (6.32) were slightly higher than the lowest cluster but did not reach the elevated levels of cluster 1. This suggests that cluster 3 may represent an intermediate SCFAs profile, potentially reflecting a transitional or mixed metabolic state. These findings indicate a clear stratification of SCFAs levels across the three clusters, with cluster 1 being the highest, cluster 2 the lowest, and cluster 3 exhibiting moderate SCFAs concentrations. The corresponding maternal and neonatal characteristics associated with each clustering method are summarized in Supplementary Table 1 (binary clustering) and Supplementary Table 2 (multi-class clustering).

Figures 2A,B, present the SCFAs levels across two clusters identified using K-Means clustering. The propionate levels differ significantly between the two clusters [p = 0.010 (*)], while acetate (p = 0.06, ns) and butyrate (p = 0.17, ns) do not show statistically significant differences.

While in Agglomerative model, the statistical results reveal a significant difference in acetate levels between the two clusters [p = 0.041 (*)], whereas propionate (p = 0.05, ns) and butyrate (p = 0.22, ns) do not show strong evidence of clustering effects. Figure 2C present the boxplot distributions of acetate, butyrate, and propionate levels across three clusters identified by K-Means clustering. For acetate, while the overall Kruskal-Wallis test was not significant (p = 0.09, ns), pairwise comparisons suggest a notable difference between cluster 1 and cluster 3 [p = 0.017 (*)]. Butyrate levels did not show significant variation across clusters (p = 0.22, ns), indicating a more uniform distribution across groups. The Kruskal-Wallis test for propionate [p = 0.008 (**)] indicates a statistically significant difference across clusters, with cluster 1 having significantly difference propionate levels compared to cluster 2 [p = 0.004 (**)].

For Agglomerative multi clustering analysis presented in Figure 2D, the overall Kruskal-Wallis test was not statistically significant (p = 0.07, ns), for acetate, while pairwise comparisons suggest a notable difference between cluster 2 and cluster 3 [p = 0.016 (*)]. Butyrate levels remained consistent across all clusters (p = 0.29, ns), showing no evidence of differentiation. Propionate [p = 0.045 (*)] suggests an overall difference among clusters, with cluster 1 exhibiting significantly higher propionate levels compared to cluster 2 [p = 0.029 (*)].

To further investigate whether specific microbial compositions were associated with variations in SCFAs levels, we analyzed genus-level microbial compositions within each cluster. Figures 3A,B present the genus-level relative abundance distributions obtained using K-Means and Agglomerative clustering under a binary classification scheme. Both clustering methods successfully grouped the microbiome samples into two distinct clusters with noticeable differences in their microbial compositions. The dominant genera in each cluster, suggest that one cluster is enriched in genera such as Enterococcus, Bacteroides, and Prevotella, while the other cluster shows a more mixed distribution of various genera. The K-Means model (Figure 3A) appears to create a more distinct separation between clusters, with one cluster showing a more uniform genus composition. In contrast, Agglomerative model (Figure 3B) shows a slightly more heterogeneous composition within each cluster, possibly due to its hierarchical nature.

Figure 3

Bar charts showing genus abundance across different clustering methods: K-Means (Binary and Multi) and Agglomerative (Binary and Multi). Charts A and C show two clusters, while B and D display three. A legend identifies various genera with color coding.

Figure 3. Genus-level relative abundance distributions across clusters identified using K-Means and Agglomerative clustering methods. (A) Binary clustering using K-Means, (B) Binary clustering using Agglomerative, (C) Multi-clustering (three clusters) using K-Means, and (D) Multi-clustering using Agglomerative clustering. Each bar represents the relative abundance of different bacterial genera within individual samples, with colors corresponding to the genera listed in the legend.

Figures 3C,D display the genus-level relative abundance distributions when clustering was performed with three groups using K-Means and Agglomerative clustering, respectively. The introduction of a third cluster enables finer differentiation of microbial compositions, revealing substructures that were not captured in binary clustering. In K-Means clustering (Figure 3C), cluster 1 is characterized by a high relative abundance of Enterococcus and Bacteroides, while cluster 2 is dominated by Prevotella and Dialister. The cluster 3, in contrast, exhibits a more diverse microbiome profile, with Bifidobacterium, Streptococcus, and Roseburia being more prevalent.

Agglomerative clustering (Figure 3D) similarly identifies three clusters, though the transitions between them appear more gradual. The cluster 1 remains enriched in Enterococcus and Bacteroides, while cluster 2 contains a high proportion of Prevotella, Alloprevotella, and Escherichia-Shigella. The cluster 3, akin to its counterpart in K-Means, shows a more even distribution of multiple genera, with notable contributions from Streptococcus, Lactobacillus, and Oscillibacter.

To evaluate the microbial diversity, we compared Chao1 richness and Shannon diversity across clusters generated by K-Means and Agglomerative clustering models (Figures 4A,B). In the two-cluster models, Shannon diversity was significantly higher in Cluster 2 compared to Cluster 1 (p = 2.31 × 10⁻³ for K-Means; p = 1.98 × 10⁻³ for Agglomerative), indicating greater evenness and diversity within this group. In contrast, Chao1 richness did not differ significantly between clusters (p = 0.08 for K-Means; p = 0.48 for Agglomerative), suggesting comparable species richness between the two microbial community structures. In the three-cluster models, similar trends were observed. While Chao1 richness showed modest variation across clusters (p = 1.17 × 10⁻² for K-Means), pairwise comparisons revealed a borderline difference between Clusters 1 and 3 (p = 0.05). Shannon diversity, however, exhibited marked differences among clusters (p = 7.46 × 10⁻⁶ for K-Means; p = 8.39 × 10⁻⁶ for Agglomerative), with both methods consistently showing that Cluster 1 had lower microbial diversity than Clusters 2 and 3.

Figure 4

(A) Box plots show Chao1 richness and Shannon diversity for two clusters using KMeans and Agglomerative methods, with p-values indicating significance levels. (B) Similar box plots for three clusters, with Kruskal-Wallis p-values. (C) Scatter plots display KMeans and Agglomerative beta diversity for two clusters; PERMANOVA results are shown. (D) Scatter plots for three clusters with PERMANOVA results, illustrating distribution and separation of points in clusters.

Figure 4. Comparison of neonatal gut microbial diversity and community structure across K-Means and Agglomerative clustering models. (A,B) Alpha diversity analysis of neonatal gut microbiota based on (A) two-cluster and (B) three-cluster models generated by K-Means and Agglomerative clustering. Boxplots display Chao1 richness and Shannon diversity across identified microbial clusters, with Kruskal–Wallis and pairwise Wilcoxon p-values shown above each comparison (p < 0.05 indicated by *). (C,D) Beta diversity analysis using Principal Coordinates Analysis (PCoA) of Bray–Curtis dissimilarity for (C) two-cluster and (D) three-cluster solutions. Ellipses represent 95% confidence intervals around each cluster centroid. PERMANOVA results are shown in each panel, indicating significant compositional differences between clusters with corresponding p values and explained variance (R²). Data represent meconium microbiota from 71 infants analyzed at the genus level. Statistical tests were non-parametric (Kruskal–Wallis, Wilcoxon, and PERMANOVA, 999 permutations). Boxplots show median and interquartile range; points denote individual samples.

To assess between-sample variation in microbial community composition, beta diversity was evaluated using Bray–Curtis dissimilarity and visualized by PCoA (Figures 4C,D). In the two-cluster models, distinct spatial separation was observed between clusters in both the K-Means and Agglomerative approaches, indicating consistent differentiation in microbial composition. PERMANOVA confirmed these differences to be statistically significant (p = 0.001), with R² values of 0.217 for K-Means and 0.200 for Agglomerative clustering, demonstrating that approximately 20–22% of the variance in microbial community structure could be explained by cluster grouping. In the three-cluster models, PCoA plots revealed a clearer segregation of samples, with each cluster occupying distinct compositional space. The PERMANOVA test again demonstrated highly significant differences among clusters (p = 0.001) with higher explanatory power (R² = 0.358 for K-Means and R² = 0.343 for Agglomerative), suggesting that three-cluster solutions captured more refined structural variations within the microbial community. These findings collectively indicate that both clustering methods identified distinct microbial community profiles within the neonatal gut. While overall richness remained relatively stable across groups, the observed differences in Shannon diversity and community composition highlight that cluster formation reflects meaningful ecological variation, with certain clusters representing more compositionally diverse and functionally differentiated microbial communities.

The classification accuracy of SCFAs-based clusters was evaluated using LR and RF, as shown in Figure 5. The ROC curves illustrate the predictive performance of models for SCFAs-derived clusters, with K-Means and Agglomerative clustering represented in Figures 5A,B, respectively. The RF consistently demonstrating superior classification performance compared to LR. Specifically, in K-Means clustering, RF achieved an AUC of 87.74 ± 10.05, whereas LR exhibited a lower AUC of 75.59 ± 11.97, indicating that RF more effectively distinguished SCFAs-based clusters. Similarly, in Agglomerative clustering, RF again outperformed LR, with an AUC of 91.05 ± 10.35, compared to 78.73 ± 14.43 for LR.

Figure 5

ROC curve plots for evaluating machine learning models. (A) K-Means clustering with Random Forest (AUC = 87.74) and Logistic Regression (AUC = 75.59). (B) Agglomerative clustering with Random Forest (AUC = 91.05) and Logistic Regression (AUC = 78.73). (C) Multi-Class Logistic Regression shows three class curves and micro/macro averages. (D) Another Multi-Class Logistic Regression with different AUCs for each class. (E) Multi-Class Random Forest with three class curves and averages. (F) Another Multi-Class Random Forest showing higher AUCs for each class and micro/macro averages. All graphs display the true positive rate against the false positive rate.

Figure 5. The receiver operating characteristic curves for classification of SCFAs clusters using machine learning models. The ROC curves for binary classification of SCFAs clusters using (A) K-Means and (B) Agglomerative; Multi-class ROC curves for logistic regression (C) K-Means, (D) Agglomerative; Multi-class ROC curves for random forest (E) K-Means, (F) Agglomerative.

Figures 5C,D illustrate the multi-class classification performance of LR for SCFAs clusters, providing an evaluation of model capability in distinguishing three distinct groups. When applied to K-Means clustering (Figure 5C), LR achieved a micro-average AUC of 84.06% and a macro-average AUC of 75.12%, indicating moderate discriminative ability. In contrast, its application to Agglomerative clustering (Figure 5D) yielded slightly higher performance, with a micro-average AUC of 86.72% and a macro-average AUC of 77.25%. Overall, LR was able to differentiate SCFA-based clusters, though performance varied by clustering approach, with Agglomerative clustering providing marginally superior accuracy.

In comparison, the RF model (Figures 5E,F) demonstrated superior accuracy and stronger discriminative power across all SCFA-based clusters. For K-Means clustering (Figure 5E), RF achieved a micro-average AUC of 89.84% and a macro-average AUC of 88.32%, surpassing LR. When applied to Agglomerative clustering (Figure 5F), performance improved further, with a micro-average AUC of 92.98% and a macro-average AUC of 90.99%, confirming RF’s greater robustness in distinguishing SCFA-driven microbial community structures.

Discussion

In this study, we applied unsupervised learning to cluster neonatal gut microbiota based on SCFAs production profiles. Four well-known clustering algorithms were systematically compared using multiple validation metrics (Silhouette Score, Calinski–Harabasz Index, Davies–Bouldin Index, and Prediction Strength). This comprehensive evaluation identified the most effective approach for capturing meaningful microbial subgroups. Given the dataset size, both binary and three-class clustering strategies were tested, with K-Means and Agglomerative consistently outperforming other models. Cluster separation was further validated using t-SNE visualization and genus-level composition analysis to identify key taxa associated with differential SCFA production.

Infants were subsequently assigned to these clusters based on their SCFA profiles to examine the relationship between microbial structure and metabolic output. This analysis revealed how distinct microbial subgroups influence SCFA production and metabolic differentiation. Finally, we evaluated the extent to which these microbiome-driven clusters produced statistically significant differences in SCFAs levels. To evaluate the predictive power of microbial composition in SCFAs metabolism, we applied ML models (LR and RF) using infant demographic and clinical data combined with three SCFAs variables. Overall, the study provides insights into early microbial colonization, metabolic variability, and functional clustering, advancing understanding of neonatal gut development and potential microbiome-based biomarkers.

Clustering performance at the genus level showed that Agglomerative and K-Means consistently outperformed other models across all validation metrics. Centroid-based and hierarchical approaches more effectively captured microbial community structures than density-based or probabilistic methods. Evaluation metrics confirmed the robustness of a three-cluster solution, particularly for Agglomerative clustering, indicating well-separated microbial subgroups. Prediction strength analysis further validated this model, with Agglomerative maintaining slightly higher stability than K-Means at K = 3. Beyond three clusters, prediction strength declined sharply, indicating over-segmentation and reduced interpretability. Overall, these results highlight that three clusters represent the most stable and biologically meaningful classification.

The t-SNE visualizations confirmed the validity of our clustering approach, showing clear separations in both two-cluster and three-cluster solutions. Agglomerative clustering better preserved microbial community relationships, while K-Means formed more compact clusters, likely due to its centroid-based partitioning, which may impose artificial boundaries on fluid microbial ecosystems (Lozupone et al., 2012). The emergence of a third cluster suggests a transitional or metabolically distinct microbial community, supporting the idea that microbial composition exists along a functional gradient rather than rigid classifications (Pasolli et al., 2019). This aligns with metagenomic and ecological studies, which propose that microbial populations undergo gradual shifts rather than abrupt separations, influenced by dietary, environmental, and host factors (Gibbons et al., 2017).

The diversity analyses further reinforced the ecological validity of the identified clusters. Although overall species richness, as reflected by the Chao1 index, remained relatively stable among groups, the significant variation in Shannon diversity and Bray–Curtis dissimilarity indicates differences in community evenness and compositional heterogeneity across clusters. These findings indicate that clustering revealed distinct ecological subgroups within the neonatal gut, with some communities showing greater balance and functional diversity. The strong beta diversity separation and higher explanatory power of the three-cluster model suggest that early microbial colonization follows a structured pattern linked to metabolic potential and host–microbiome interactions.

We analyzed SCFAs concentrations (acetate, butyrate, and propionate) across microbial clusters to investigate how these microbiomes extracted structure correlates with metabolic function. In early life, the neonatal gut microbiome rapidly evolves under the influence of factors like delivery mode, maternal microbiota, and feeding patterns(Timmerman et al., 2017), with early microbial colonizers shaping metabolic development by fermenting complex polysaccharides into SCFAs, the primary microbial metabolites in the colon (Timmerman et al., 2017; Macfarlane and Macfarlane, 2012; Markowiak-Kopeć and Śliżewska, 2020). Our binary clustering analysis using K-Means and Agglomerative clustering revealed a distinct stratification of SCFAs levels, with cluster 1 displaying significantly higher concentrations and cluster 2 showing markedly lower levels. The microbial composition of these clusters offers key insights into their metabolic activity, as cluster 1 was enriched with SCFAs-producing bacteria, particularly Prevotella, Bacteroides, and Streptococcus, while cluster 2 was dominated by Enterococcus and exhibited a reduced presence of Prevotella and Bacteroides.

The predominance of Enterococcus in Cluster 2 aligns with findings by Al-Balawi & Morsy (Al-Balawi and Morsy, 2020), who identified Enterococcus faecalis and Enterococcus faecium as early neonatal colonizers, with E. faecalis comprising 62.5% of total lactic acid bacteria (LAB). Their study highlights the role of Enterococcus in the initial gut microbiota establishment, consistent with our observation that Enterococcus-dominant clusters exhibited lower SCFAs production, likely due to its distinct metabolic pathways compared to SCFAs-producing genera. Statistical analysis further confirmed the microbial influence on SCFAs metabolism.

In K-Means clustering, propionate levels differed significantly (p = 0.010), with acetate showing a trend toward significance (p = 0.06), while butyrate differences were not statistically significant. In contrast, Agglomerative clustering revealed a significant difference in acetate levels (p = 0.041), whereas propionate (p = 0.05) and butyrate (p = 0.22) did not reach significance. The significant acetate variation observed in Agglomerative clustering but not in K-Means suggests that hierarchical clustering may better capture microbial relationships influencing acetate production variability. This aligns with findings by Bäckhed et al. (2015), who reported that early colonizers such as Bacteroides were less frequently transmitted from mother to infant, while Enterococcus faecalis was more stably transmitted across birth modes. Given that our low-SCFAs cluster was dominated by Enterococcus, this may indicate that its metabolic contributions differ from those of Prevotella and Bacteroides, which were enriched in the high-SCFAs cluster and are well-established SCFAs producers (Martin et al., 2016).

Expanding the analysis to multi-class clustering revealed a third intermediate cluster in both K-Means and Agglomerative clustering, suggesting a potential metabolic transition between the two extremes of SCFAs profiles. While cluster 1 consistently exhibited the highest SCFAs levels and cluster 2 maintained the lowest, cluster 3 emerged as a transitional group with moderate SCFAs concentrations, bridging the metabolic divide between the two polar profiles. Taxonomic analysis revealed distinct microbial compositions across the three clusters. The cluster 1 remained predominantly composed of Prevotella and Bacteroides, while cluster 2 was largely dominated by Enterococcus. In contrast, cluster 3 exhibited a mixed profile, primarily consisting of Streptococcus with some Prevotella presence. The emergence of this intermediate cluster aligns with the findings of Martin et al. (2016), who highlighted the pivotal role of Bacteroides in early gut microbiota development, shaped by perinatal factors such as mode of delivery and diet. This reinforces the dynamic nature of early microbial colonization and its metabolic consequences.

Testing the three-cluster model revealed that differences in acetate and butyrate levels were largely non-significant, with the exception of propionate, which showed statistically significant variation in both K-Means (p = 0.008) and Agglomerative clustering (p = 0.045). Pairwise comparisons further underscored significant differences: acetate levels in cluster 3 were notably distinct from those in Cluster 2 (p = 0.017 in K-Means, p = 0.016 in Agglomerative), while propionate levels in cluster 1 were significantly higher than those in cluster 2 (p = 0.004 in K-Means, p = 0.029 in Agglomerative). The absence of significant differences in butyrate levels suggests that butyrate metabolism may be less susceptible to microbial compositional shifts compared to acetate and propionate. This finding is consistent with Tamana et al. (2021), who reported that a higher relative abundance of Bacteroides at one year was linked to improved neurodevelopmental outcomes. Similarly, Carlson et al. (2018) observed that infants enriched with Bacteroides at 12 months demonstrated enhanced cognitive development by the age of two. Given that cluster 1 in our study was enriched with Bacteroides and Prevotella, these results suggest that microbial SCFAs production may have far-reaching implications beyond metabolic health, potentially playing a role in shaping neurodevelopmental trajectories.

These findings highlight the broader significance of microbial clustering, suggesting potential microbiome-targeted interventions. The enrichment of Prevotella and Bacteroides in high-SCFAs clusters aligns with their known role in SCFAs synthesis, while the dominance of Enterococcus in low-SCFAs clusters points to alternative metabolic pathways or reduced SCFAs production. This is particularly relevant given Makino et al. (2013), who reported reduced Bacteroides transmission in C-section-delivered infants, emphasizing the influence of birth mode on neonatal microbiota. Furthermore, Bacteroides depletion has been associated with neurodevelopmental disorders, including autism spectrum disorder (Dan et al., 2020), with lower abundance linked to cognitive and behavioral alterations in early childhood (Carlson et al., 2018; Wang et al., 2017). Additionally, Bacteroides-Prevotella ratios have been correlated with infant weight and ponderal index at one month (Obermajer et al., 2017), underscoring the critical role of microbial composition in early metabolic health.

The classification analysis of SCFA-based clusters showed that the RF model consistently outperformed LR in distinguishing cluster structures. The RF achieved superior performance in both K-Means and Agglomerative models, particularly the latter, which yielded the highest ROC scores. This reflects RF model ability to capture nonlinear relationships and complex feature interactions. Multi-class classification further reinforced its advantage, as RF achieved higher micro- and macro-average AUC values than LR, indicating stronger discriminative power across all clusters. Agglomerative clustering also demonstrated slightly higher accuracy than K-Means, suggesting that hierarchical structures better represent biologically meaningful microbial groupings. These findings align with previous studies showing RF model superiority in disease classification tasks. Ripan et al. (2021) reported higher ROC values for RF compared to LR, support vector machine (SVM), and k-nearest neighbors algorithm (KNN) in heart disease prediction. Similarly, Hu et al. (2024) found that KNN outperformed SVM and LR in cardiovascular disease detection. In our study, RF achieved the best overall performance AUCs of 91.05 ± 10.35 for Agglomerative and 87.74 ± 10.05 for K-Means in binary classification, and 92.98% (micro-average) and 90.99% (macro-average) in multi-class classification highlighting RF’s robustness for microbiome-based cluster prediction.

Strengths of this study rely on its comprehensive and methodologically rigorous approach to understanding neonatal gut microbiome dynamics by integrating unsupervised clustering, machine learning, and metabolic profiling. It employs four distinct clustering algorithms (K-Means, Agglomerative, Spectral, and GMM), systematically evaluating their performance using multiple clustering validation metrics to identify the most biologically meaningful microbial subgrouping approach. Unlike traditional microbiome studies that focus primarily on taxonomic composition, this study incorporates SCFAs metabolic profiling, providing a functional perspective on microbial community interactions and bridging the gap between microbial composition and metabolic function. Additionally, supervised ML models (RF and LR) are implemented to classify SCFAs-driven clusters, with hyperparameter tuning and cross-validation ensuring robust and reproducible classification results. The findings offer biologically meaningful and clinically relevant insights, particularly regarding the influence of microbial communities on neonatal metabolism, which has implications for early-life gut development, immune regulation, and microbiome-targeted interventions. Furthermore, by benchmarking its classification performance against existing disease clustering models, such as those used in heart disease and cardiovascular disease classification, this study establishes its novelty and broader significance in pediatric microbiome research. Specifically, it provides new insights into early-life gut microbiota development, highlighting potential microbial biomarkers that could inform pediatric disease risk assessment, early diagnostics, and microbiome-targeted therapeutic interventions. Another key strength of this study is the use of multi-class clustering, which captures gradual metabolic transitions between microbial communities, rather than forcing rigid binary classifications. Additionally, this study enhances biological interpretability by integrating taxonomic and metabolic data, ensuring that the clustering outputs align with known microbial functions. Lastly, it provides a foundation for future longitudinal microbiome research, offering a framework for tracking microbial development and its long-term impact on metabolic health, ultimately paving the way for early disease detection and microbiome-targeted therapeutic strategies in neonates. This finding provides novel evidence that microbial community signatures alone can reveal underlying metabolic and clinical heterogeneity, supporting the potential of microbiome-based unsupervised models to predict functional states such as SCFA production capacity.

Despite its strengths, this study has several limitations. First, the analysis is cross-sectional, providing a static snapshot of microbial composition and SCFAs production, rather than tracking longitudinal changes in microbiome development over time. Second, while clustering effectively identifies microbial subgroups, it does not directly assess functional gene expression or metabolic activity, limiting our ability to confirm whether observed SCFAs variations result from differential microbial metabolism or external environmental factors. Third, although 16S rRNA sequencing provided valuable insights into taxonomic composition, it lacks species-level and functional resolution; future studies integrating shotgun metagenomics and metabolomics are warranted to comprehensively elucidate microbial functions and metabolic pathways. Additionally, sample size constraints may limit the generalizability of findings, requiring validation in larger and more diverse neonatal cohorts. Finally, while random forest demonstrated superior classification performance, alternative deep learning approaches could be explored to further enhance predictive accuracy. Future research should address these limitations by incorporating longitudinal data, functional genomic analyses, and expanded machine learning models to improve the resolution and applicability of microbiome-based SCFAs classification.

Conclusion

This study provides valuable insights into microbiome-based decision-making in neonatal health by demonstrating the potential of machine learning-driven clustering and classification models for stratifying microbial subgroups based on SCFA metabolic profiles. The findings emphasize the broader applicability of microbiome informatics in clinical and public health settings, offering a foundation for early microbiome-targeted interventions. Integrating SCFA-based microbiome classification with biomedical informatics systems could enhance predictive healthcare models, allowing clinicians to identify neonates at higher risk for gastrointestinal and metabolic disorders, optimize personalized nutritional and probiotic strategies, and incorporate microbiome screening into routine neonatal care. From a research perspective, these findings pave the way for longitudinal studies tracking microbial shifts over time, which could refine biomarker discovery for disease risk assessment. Additionally, advanced machine learning models hold promise for improving the predictive accuracy of microbial classification, enabling more precise and automated decision-making in neonatal and pediatric healthcare. Ultimately, this study underscores the clinical and translational potential of microbiome informatics in shaping evidence-based interventions, precision medicine strategies, and microbiome-driven health monitoring systems, fostering a proactive approach to neonatal metabolic and immune health management.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://www.ncbi.nlm.nih.gov/, PRJNA1123597.

Ethics statement

The studies involving humans were approved by the Institutional Review Board of Kangwon National University Hospital (IRB no. KNUH-B-2021-12-004) for medical research. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study.

Author contributions

PK: Writing – original draft, Investigation, Formal analysis, Conceptualization, Visualization, Methodology. C-HY: Validation, Writing – review & editing. KC: Resources, Supervision, Writing – review & editing, Data curation. SJ: Project administration, Supervision, Funding acquisition, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), the 2023 Kangwon National University Hospital Grant donated by Dr. Sang Shin Lee., and funded by the Ministry of Health and Welfare, Republic of Korea (grant number: HR22C1605).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The authors declare that no Gen AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2025.1668451/full#supplementary-material

References

Abo Kouadio Jerome, M., Fatogoma Etienne, S., Ngolo David, C., Thomas, M., Fréderic, P., Véronique, C., et al. (2023). Faecal short-chain fatty acid and early introduction of foods in the first 200 days of infant’s life in the district of Abidjan (Ivory Coast). Am. J. Food Nutr. 11, 7–15. doi: 10.12691/ajfn-11-1-2