Characteristics of Fecal Microbiota and Machine Learning Strategy for Fecal Invasive Biomarkers in Pediatric Inflammatory Bowel Disease

Background Early diagnosis and treatment of pediatric Inflammatory bowel disease (PIBD) is challenging due to the complexity of the disease and lack of disease specific biomarkers. The novel machine learning (ML) technique may be a useful tool to provide a new route for the identification of early biomarkers for the diagnosis of PIBD. Methods In total, 66 treatment naive PIBD patients and 27 healthy controls were enrolled as an exploration cohort. Fecal microbiome profiling using 16S rRNA gene sequencing was performed. The correlation between microbiota and inflammatory and nutritional markers was evaluated using Spearman’s correlation. A random forest model was used to set up an ML approach for the diagnosis of PIBD using 1902 markers. A validation cohort including 14 PIBD and 48 irritable bowel syndrome (IBS) was enrolled to further evaluate the sensitivity and accuracy of the model. Result Compared with healthy subjects, PIBD patients showed a significantly lower diversity of the gut microbiome. The increased Escherichia-Shigella and Enterococcus were positively correlated with inflammatory markers and negatively correlated with nutrition markers, which indicated a more severe disease. A diagnostic ML model was successfully set up for differential diagnosis of PIBD integrating the top 11 OTUs. This diagnostic model showed outstanding performance at differentiating IBD from IBS in an independent validation cohort. Conclusion The diagnosis penal based on the ML of the gut microbiome may be a favorable tool for the precise diagnosis and treatment of PIBD. A study of the relationship between disease status and the microbiome was an effective way to clarify the pathogenesis of PIBD.


INTRODUCTION
The incidence of Inflammatory bowel disease (IBD) has grown rapidly worldwide (Molodecky et al., 2012;Wang et al., 2013). About 20-30% of patients develop IBD before adulthood (Oliveira and Monteiro, 2017). This is even more dramatic in China. In the past decade, the prevalence of Pediatric IBD (PIBD) in Shanghai has increased more than 10-fold. The Asian population has unique genetic and environmental signatures compared to Caucasian and other ethnic groups, including differences in their dietary habits and the constitution of the gut microbiome. The current guidelines for the management of PIBD are mainly based on evidence from studies involving Caucasian populations and a better understanding of the individual signature of Chinese PIBD patients is required for early precision diagnosis and individualized treatment in the future. Nevertheless, there have been few attempts in China to examine the systemic characterization of PIBD to date.
Gut microbiome has been proposed as a promising noninvasive diagnosis tool for PIBD in the last decade. The gut microbiota is the most important microecosystem in symbiosis with humans (Rooks and Garrett, 2016). PIBD patients often have dysregulated gut microbiota, including shifts in bacterial taxa constitution and diversity (Shaw et al., 2016). The microbiota diversity was decreased in patients with IBD vs controls according to a systematic review for both children and adults (Pittayanon et al., 2020). A meta study in PIBD showed that Prevotella, Clostridium, Blautia, and Ruminococcus were depleted while Lactobacillus, Enterococcus, and Acidaminococcus were increased in IBD (Knoll et al., 2017). Malham et al. found that Akkermansia, Gemmiger, Ruminococcus, and Bacteroides were decreased in PIBD (Malham et al., 2019). These studies mainly included European and North American populations. Whether this is consistent in the Chinese population is unclear. In the Chinese population, the microbiota signature in treatment naive PIBD patients and their role in the pathogenesis of PIBD remains largely to be addressed. Recently, Wang and colleagues reported the first gut microbiome profiling study of pediatric Crohn's disease patients in a Chinese population (Wang et al., 2018b). They found different dysbiosis signatures in pediatric Crohn's disease (CD) as compared to the reported Caucasian studies. According to their study, individual microbial signatures could be a useful tool for the prediction of a patient's response to anti-TNF-a therapy. However, their study only included patients with a clear diagnosis of CD and anti-TNF-a treatment. Further studies including other types of IBD and the enrollment of treatment naive patients could provide valuable information for the early and differential diagnosis of PIBD.
Machine learning (ML), one of the most useful artificial intelligence (AI) for complex data statistics, has been successfully used for the diagnosis and early prediction of diseases such as cardiovascular diseases, cancer, and immune diseases (Rahman et al., 2018;Cammarota et al., 2020). It is worthwhile to explore its diagnostic value in PIBD.
In our study, 66 newly diagnosed young IBD patients and 27 healthy controls were prospectively enrolled for gut microbiome profiling. In addition, we also enrolled 48 patients with irritable bowel syndrome (IBS) for comparison to evaluate the performance of our diagnosis tool in the differential diagnosis. Fecal microbiota profiling was carried out using 16S ribosomal RNA gene sequencing (16S rRNAseq). The profiles of the microbiome in PIBD and its relationship with disease activity and nutrition status were analyzed. A diagnosis model for PIBD was constructed based on intestinal microecological machine learning. Our study presents comprehensive profiling of the gut microbiome and reveals unique biomarkers in Chinese PIBD patients. These insights into the complex interactions between the gut microbiome and hosts may also provide new insight into the pathogenesis of PIBD.

Study Cohort
In the initial discovery stage, we enrolled 66 IBD patients and 27 healthy controls for microbiome profiling. In the second validation stage, we enrolled 14 early-onset IBD and 48 IBS patients. All patients with IBD or IBS were recruited from the Department of Pediatrics, Ruijin Hospital affiliated with the School of Medicine, Shanghai Jiao Tong University from January 2016 to December 2019.
The diagnosis and disease evaluation were undertaking by following the protocol used in a previous study (Wang et al., 2018a). Briefly, for the diagnosis of CD, the Proto standard (Levine et al., 2014) was used and disease activity was assessed with the Pediatric Crohn's Disease Activity Index (PCDAI) (Hyams et al., 2005); for patients with UC, the Pediatric Ulcerative Colitis Activity Index (PUCAI) was used (Turner et al., 2009); for the diagnosis of IBS, Rome IV Criteria was used (Stanghellini et al., 2016). The index score of height and weight was calculated using the Z-scoring method based on a national survey in China in 2005 (Li et al., 2009).
All patients enrolled were newly diagnosed children below 18 years old and without any treatment for IBD. Patients were excluded from the study if they met the following criteria: 1) the diagnosis changed and was not considered as IBD. 2) The patient had taken antibiotics in the month before collecting the fecal samples.
3) The patients and their guardians did not agree to take part in the study. The healthy control group had not taken antibiotics for at least one month before entry. Written informed consent was obtained from each participant following the protocols approved by the institutional review boards of the Shanghai Jiao Tong University.
Fecal samples were collected from all participants and saved at −80°C within 3 hours. DNA extraction was performed using the E.Z.N.A. ® soil DNA Kit (Omega Bio-tek, Norcross, GA, U.S.A.) according to the manufacturer's protocols. The final DNA concentration and purification were determined by NanoDrop 2000 UV-vis spectrophotometer (Thermo Scientific, Wilmington, DE, U.S.A.), and DNA quality was checked by 1% agarose gel electrophoresis. The concentrations of all samples were above 50ng/ul. 10ng of DNA was used for 16S rRNAseq. The OD value of 260/280 of all DNA samples was between 1.8~2.0 to confirm the quality of the samples.

Processing of Sequencing Data
The raw fastq files were demultiplexed, quality filtered by Trimmomatic, and merged by FLASH. Operational taxonomic units (OTUs) were clustered with 97% similarity cutoff using UPARSE (version 7.1 http://drive5.com/uparse/) and chimeric sequences were identified and removed using UCHIME. The taxonomy of each 16S rRNA gene sequence was analyzed by the Silva128/16s bacteria database using a confidence threshold of 70%.
The 16S rRNA data were further analyzed and visualized on the online Majorbio Cloud Platform (www.majorbio.com). Alpha-diversity analyses, including community richness parameters (Sobs, Chao) and community diversity parameters (Shannon). Beta diversity measurements, including principal coordinate analyses (PCoA) and Partial Least Squares Discriminant Analysis (PLS-DA) based on OTU compositions, were determined. The bacterial taxonomic distributions of sample communities were visualized. Linear discriminant analysis effect size (LEfSe) was conducted to identify OTUs differentially.

OTU-Based Biomarkers Identification
Random forest models (Random Forest 4.6-14 package) were used to model OTU-based biomarkers as described before (Edwards et al., 2018). Briefly, we ranked individual OTUs by their importance. 10-fold cross validation was performed to evaluate model performance as well as to remove less important OTUs. The top 11 OTUs from the random forest models were listed with the smallest number of OTUs as the optimal set. The probability of disease (POD) for IBD in both the exploration and validation cohort were calculated and compared. To evaluate the discriminatory ability of the random forest models, operating characteristic curves (receiving operational curve, ROC) were constructed and the area under the curve (AUC) was calculated.

Imputed Metagenomic Analysis
The metagenomes of gut microbiota were imputed from 16S rRNAseq with Tax4Fun package available on Majorbio Cloud Platform. The predicted functional composition profiles were collapsed into KEGG (Kyoto Encyclopedia of Genes and Genomes) database pathways.

Statistical Analysis
The free online platform of Majorbio Cloud Platform (www. majorbio.com) or GraphPad 8.0 (GraphPad Software Inc, CA) was used for statistics. For the comparison of continuous variables, the Mann-Whitney U test for two groups was used. For correlation analysis, Spearman's rank test was performed. Multiple hypothesis tests were adjusted using Benjamini and Hochberg false discovery rate (FDR), and significant association was considered below an FDR threshold of 0.05. The differences between populations were analyzed using a one-way ANOVA. P < 0.05 was considered statistically significant.

Characteristics of the Participants
We recruited a total of 66 subjects with IBD and 27 healthy control subjects as the exploration group. Another 14 IBD and 48 IBS patients were enrolled for the evaluation of the diagnosis model. All the patients were newly diagnosed with PIBD. The demographic and clinical characteristics of PIBD and non-IBD controls are shown in Table 1.

Gut Microbial Dysbiosis in PIBD
To investigate the gut microbiome in our PIBD cohort, fecal samples from all 66 PIBD patients and 27 healthy controls were processed for 16S rRNAseq. Consistent with the findings reported in Caucasian populations, the gut microbiota adiversity was significantly reduced in our Chinese PIBD cohort, including decreases in the Sobs index, Shannon index, and Chao index ( Figures 1A-C). This suggests not only a significantly decreased number of bacterial species but also less evenly distributed species in PIBD. Beta diversity was also calculated to compare the similarity of bacteria species between    1D-F). This suggested asymmetrical distribution between the two groups. Notably, we found that the richness of species could explain the differences along the principal coordinate by Weighted Unifrac PCoA analysis ( Figure 1G). At the phylum level, Proteobacteria was significantly increased in PIBD patients (P=0.0014) while Actinobacteria was decreased in PIBD patients as compared with healthy controls (P<0.0001). Figure 2A shows the most significantly altered 10 genera between PIBD and healthy controls. Escherichia-Shigella and Enterococcus were enriched in PIBD patients ( Figures 2B, C). Bifidobacterium, Faecalibacterium, and Blautia were decreased in PIBD patients ( Figures 2D-F). The linear discriminant analysis effect size algorithm (LEfSe) analysis results in Figures 2G, H further show significantly different signatures between the two groups ( Figures 2G, H).

The Association Between Microbiome With the Disease Status
We next evaluated the relationship between the composition of genera with the disease severity in IBDs. For disease activity assessment, the PCDAI score was used for patients with CD. The mean index of PCDAI is 14.5 ± 17.1. For patients with Ulcerative Colitis (UC), the PUCAI score was used and the mean index is 16.9 ± 12.5. We combined the two indexes as Disease Activity Index (DAI) in the further statistics. The detail of the patient's disease severity and nutrition status are listed in Table 2. Spearman analysis showed DAI was positively correlated with several inflammatory markers including ESR, CRP, WBC, PLT, PLR. DAI also negatively correlated with the nutrition markers, including ALB, HGB, and HCT ( Figure 3A). In addition, several inflammatory markers are positively or negatively associated with each other. For example, PLR is significantly associated with every other parameter, including positive correlation with  The correlation between the disease activity and microbiome profile was further examined using DAI score and 16S rRNAseq data. As revealed by Spearman analysis, the richness of Enterococcus, Escherichia-Shigella, Streptococcus, Enterobacter, and Veillonella were positively associated with higher DAI score and inflammatory markers, and negatively correlation nutrition markers. Moreover, dysbiosis in IBD was also negatively associated with the Z score of height and weight, suggesting the host gut dysbiosis leads to an exaggerated disease. Oppositely, Faecalibacterium, Lachnoclostridium, Bacteroides, Parabacteroides, Blautia, and Prevotella were positively correlated with the Z score of height and/or weight and nutrition markers but negatively correlated with DAI and inflammatory markers. This suggests that microbiota were associated with favorable outcomes in patients ( Figures 3B, C).

Differential Diagnosis Discrimination With 11 OTUs Signature
To explore the diagnostic value of fecal microbiome profiling in PIBD, we applied the ML approach to analyze the major factors for the diagnosis of PIBD. We constructed a random forest model based on the total 1902 OTUs of gut microbiota in the exploration group. The top 30 OTUs were then ranked using the index of accuracy and Gini ( Figure 4A). 11 OTUs were further collected by 10-fold cross validation as the optimal marker set ( Figure 4B). These OTUs were mainly from the genus of Bifidobacterium (OTU2966, OTU218), Ruminiclostridium (OTU1660), Sphingobium (OTU398), Anaerostipes (OTU167), Fusicatenibacter (OTU1619), Clostridium (OTU1755), Brevundimonas (OTU924), Lachnospriaceae (OTU142), Adlercreutzia (OTU2761), and Dorea (OTU1657). The POD index was generated using Random Forest model analysis. It showed significantly increased value in PIBD samples versus healthy control (P<0.0001) ( Figure 4C).
The POD index was further examined using microbial data from the validation cohort. In addition to the comparison between healthy controls and PIBD patients to generate an unbiased POD index, it is even more important to identify PIBD at an early stage from other diseases with similar symptoms. To better evaluate its effectiveness in the diagnosis and differential diagnosis of IBD, we included patients with IBS rather than healthy controls in the validation cohort. This is because clinically it is more relevant to differentiate IBD from other patients with similar symptoms at an early stage than differentiating the IBD from healthy controls. The POD index was significantly higher in IBD samples than IBS ones (P<0.0001) ( Figure 4C). The performance of the model was assessed using ROC analysis, the exploration group achieved an AUC value of 0.88 (95% CI:0.8-0.95) ( Figure 4D). AUC was 0.84 in the validation cohort (95% CI:0.72-0.96) ( Figure 4E). This result indicated that the gut microbiome-based classifier can accurately and sensitively distinguish IBD from Non-IBD. Therefore, fecal invasive biomarkers obtained by ML achieved a powerful diagnostic potential for IBD.

Microbial Functions Altered in PIBD
The functional profiles of the gut microbiome in IBD patients and healthy controls were predicted with Tax4Fun based on the 16S rRNAseq data. The KEGG pathway analysis predicted that bacterial invasion of the epithelial cells pathway was significantly altered in IBD ( Figures 5A-C). The gut microbiome of IBD was characterized by over-representation of pathogenetic bacteria. In contrast, the pathway enriched in healthy controls highlighted pathways in "replication and repair", "amino acid metabolism" and "nucleotide metabolism" in level 2 KEGG pathway analysis ( Figure 5A). Differentially expressed level 3 KEGG pathways are listed in Figure 5B. The "abundant bacterial invasion of epithelial cells" pathway was significantly increased in IBD patients ( Figure 5C).

DISCUSSION
We studied the microbiota of Chinese PIBD patients using 16S rRNAseq. PIBD patients showed significantly lower diversity of their gut microbiome as compared to healthy controls. They had increased Escherichia-Shigella and Enterococcus, which were positively correlated with inflammatory markers, but were negatively correlated with nutrition markers. Bifidobacterium, Faecalibacterium, and Blautia were decreased in IBD patients. A diagnostic model was successfully developed integrating 11 OTUs using ML methods for differential diagnosis in PIBD. This diagnostic model showed outstanding performance differentiating IBD from IBS in an independent validation cohort.
PIBD is a chronic gastrointestinal disease. The main challenge in the diagnosis of PIBD is the occult onset of the disease. Noninvasive early diagnostic tools would provide possibilities for early intervention and improve the quality of life of the patients. Several studies have shown that, through ML-based predictive models, the microbial markers outperform clinical parameters in the diagnosis and prediction of relapse and response to therapies (Ananthakrishnan et al., 2017;Zhou et al., 2018;Aden et al., 2019). However, the gut microbiome significantly differs between the Chinese and Caucasian populations (Zhou et al., 2018). Moreover, Chinese PIBD patients also have different disease progress and different underlying pathogenesis, such as IL10-RA gene polymorphism, as we and others have shown (Wang et al., 2018a;Su et al., 2021). Therefore, the previously reported tools could not be directly applied to our PIBD patient cohort. We believe the exploration of microbiota and ML-based diagnostic tools will help the early diagnosis and therapy for PIBD in China. Our current study provides the first successful diagnostic model of microbial OTUs markers for PIBD in the Chinese population. In our study, Escherichia-Shigella and Enterococcus were enriched in IBD patients Escherichia-Shigella is the specific Escherichia coli strain. Pathogenic E. coli can evade the immune system of the host and induce inflammation by suppressing epithelial and inflammatory cell autophagy . Recurrent infection of Salmonella can cause colitis by . IBD, inflammation bowel disease; DAI, disease activity index; HAZ, z score for height of age; WAZ, z score for weight of age; WBC, white blood cells; PLT, platelets; CRP, C-reactive protein; ESR, erythrocyte sedimentation rate; N, neutrophil; L, lymphocyte; NLR, neutrophilto-lymphocyte ratio; PLR, platelet-to-lymphocyte ratio; ALB, albumin; HGB, hemoglobin; HCT, hematocrit. *P < 0.05, **P < 0.01, ***P < 0.001.
accelerating molecular aging (Yang et al., 2017). Enterococcus can also trigger pathological processes in IBD (Mancabelli et al., 2017;Wang et al., 2018b;Lo Presti et al., 2019;Salem et al., 2019). Zhou et al. found that Enterococcus faecalis (E. faecalis) levels were associated with clinically active disease in patients with CD (Zhou et al., 2016). The gelatinase from E. faecalis. can disrupt the intestinal epithelium by activating the protease-activated receptor 2 (Maharshak et al., 2015). A reduction of Short Chain Fatty Acid (SCFA) producing microbiota including Faecalibacterium, Blautia, Clostridium, and Lachnospiraceae in PIBD samples was found in the study. SCFA has been regarded as a source of energy for epithelium and can protect the tight junction of epithelial cells (Fachi et al., 2019). SCFA can also be attributed to active regulatory T cell function (Arpaia et al., 2013) and reduce neutrophils recruitment through blockade of IL-8 production (Sokol et al., 2008).
The high richness of Escherichia-Shigella and Enterococcus in PIBD were positively associated with disease severity. In contrast, the microbiota lost in PIBD, such as Bifidobacterium, Faecalibacterium, and Blautia were related to the good nutrition status and low disease score. Xue et al. also reported a microbial dysbiosis in pediatric CD, with increased Entrococcus, Novosphingobium, and Enhydrobacter and decreased Bifidobacterium, Klebsiella, and Closridium (Xue et al., 2020). Wang et al. had consistent findings (Wang et al., 2021). PIBD in China is characterized by increased Entrococcus and decreased Bifidobacterium Faecalibacterium and Blautia, which indicates that the microbiome of PIBD patients within the Chinese population have some conserved features. This might be associated with genetic or environmental factors and provides the foundation for microbiome-based diagnosis and disease evaluation.
Disease severity is significantly associated with inflammatory markers in the peripheral blood and nutritional status in children. It is generally accepted that elevated inflammation markers often indicate active disease, and the restoration of nutritional markers indicates that the patient has a more stable status (van Rheenen et al., 2021). This is also consistent with our current study. In this study, the differential microbiota was closely related to the disease severity, which indicates that the changing of the microbiome can be biomarkers that reflect disease severity and predict the outcome before clinical syndromes. Sylvie et al. reported reduced counts of Clostridium and Faecalibacterium in CD patients and a lower baseline abundance of F. prausnitzii and Bacteroides predicted relapse (Rajca et al., 2014). Hyams et al. found that the abundance of Ruminococcaceae and Sutterella predicted remission in 400 newly diagnosed Pediatric UC (Hyams et al., 2019). In a PIBD study, the enrichment of Rothia and Ruminococcus was associated with the development of strictured complications (Kugathasan et al., 2017). The microbiome-based biomarkers could well predict the disease progression or outcome. AI has revolutionized the study of IBD (Iablokov et al., 2020). According to the Random Forest models, the optimal 11 OTUs markers for PIBD were identified. The POD based on the 11 OTUs markers were distinct between PIBD and healthy controls, which achieved powerful classification potential for PIBD. More importantly, the POD successfully achieved validation of patients with IBD from IBS. Similarly, the study in adult patients also successfully demonstrated ML modeling for IBD using gut microbiome data in the United States (Manandhar et al., 2021). Xu et al. constructed a gut microbiome-based diagnostic tool for differential diagnosis and achieved a high AUC value both in health vs IBD and UC vs CD . These findings indicated that ML approaches offer the ability to improve the accuracy and convenience of diagnosing IBD, and fecal microbial markers show promising potentials as non-invasive tools for the early diagnosis of PIBD. However, our initial study with ML-based tools also has limitations that require further investigation. For example, we designed this tool for the diagnosis of typical PIBD at an early stage, and therefore we could not include patients who mainly have extraintestinal manifestations. Whether this tool could be improved to identify non-typical PIBD remains to be validated. Although machine learning methods are still evolving, it shows promising power in both early diagnosis and identifying underlying pathogenesis that can guide mechanistic studies. Nonetheless, Microbiomebased biomarkers have an irreplaceable advantage and continued exploration is needed to accelerate the applications of precision medicine.
Using the functional prediction of 16S sequencing, we found IBD patients have enriched strains of pathological bacteria and decreased diversity of the microbiota related to the host's metabolic capacities. Pathway analysis revealed significantly upregulated pathways in the bacterial invasion of epithelial  cells in our PIBD patients. Not surprisingly, this was closely related to the altered host microbiome in PIBD patients, featuring changes in Escherichia-Shigella, Enterococcus, and SCFA-producing microbiota (Solis et al., 2020). Another potential contributing factor is the host diet, as diet was recently found able to influence the human gut microbiome and the pathogenesis of IBD (Adithya et al., 2021). A deeper investigation of the key species by metagenomic and metabolic sequencing may further improve our understanding of the early triggers in PIBD, enhance the performance of the diagnosis model, and provide new routes for the treatment of IBD.

CONCLUSION
This study found dysbiosis in gut microbiota in PIBD. Escherichia-Shigella and Enterococcus were positively associated with the disease severity of PIBD. In contrast, Bifidobacterium, Faecalibacterium, and Blautia were associated with low inflammatory markers and good nutrition status. AI analysis of gut microbiota using ML models successfully identified optimal 11 OTUs biomarkers (OTU2966, OTU218, OTU1660, OTU398, OTU167, OTU1619, OTU1755, OTU924, OTU142, OTU2761, and OTU1657) for the diagnosis of PIBD, which could be potentially non-invasive tools for the early diagnosis of PIBD ( Figure 6).

DATA AVAILABILITY STATEMENT
The datasets presented in the study are deposited in the Genome Sequence Archive (Chen et al., 2021) in National Genomics Data Center (Members and Partners, 2021), China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA005251) that are publicly accessible at https://ngdc.cncb.ac.cn/gsa.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Ethics committee of Ruijin Hospital, Shanghai Jiao Tong University School of Medicine. Written informed consent to participate in this study was provided by the participants' legal guardian.

AUTHOR CONTRIBUTIONS
XW and YX contributed equally to this work. XW, YX, and CX designed this study. XW and NL analyzed the data. YY and XX collected the patients and recorded the data. XW, NL, and LG drafted the manuscript. CX and NL provided final approval of FIGURE 6 | Study design: discovery cohort enrolled 66 newly diagnosed young IBD patients and 27 healthy controls for gut microbiome 16S rRNAseq. The profiles of the microbiome in PIBD and its relationship with the disease activity and nutrition status were analyzed. Intestinal microecological machine learning was constructed to generate a PIBD diagnosis tool using microbiome data. Subsequently, 14 patients with IBD and 48 patients with IBS were collected as a validation cohort for the evaluation of the diagnostic model. the manuscript. All authors contributed to the article and approved the submitted version.