Development of a Novel Metagenomic Biomarker for Prediction of Upper Gastrointestinal Tract Involvement in Patients With Crohn’s Disease

The human gut microbiota is an important component in the pathogenesis of Crohn’s disease (CD), promoting host–microbe imbalances and disturbing intestinal and immune homeostasis. We aimed to assess the potential clinical usefulness of the colonic tissue microbiome for obtaining biomarkers for upper gastrointestinal (UGI) tract involvement in CD. We analyzed colonic tissue samples from 26 CD patients (13 with and 13 without UGI involvement at diagnosis) from the Inflammatory Bowel Disease Multi-Omics Database. QIIME1, DiTaxa, linear discriminant analysis effect size (LEfSe), and PICRUSt2 methods were used to examine microbial dysbiosis. Linear support vector machine (SVM) and random forest classifier (RF) algorithms were used to identify the UGI tract involvement-associated biomarkers. There were no statistically significant differences in community richness, phylogenetic diversity, and phylogenetic distance between the two groups of CD patients. DiTaxa analysis predicted significant association of the species Ruminococcus torques with UGI involvement, which was confirmed by the LEfSe analysis (P = 0.025). For the feature ranking method in both linear SVM and RF models, the species R. torques and age at diagnosis contributed to the combined models. The L-methionine biosynthesis III (P = 0.038) and palmitate biosynthesis II (P = 0.050) were under-represented in CD with UGI involvement. These findings suggest that R. torques might serve as a novel potential biomarker for UGI involvement in CD and its correlations, in addition to a range of bacterial species. The mechanisms of interaction between hosts and R. torques should be further investigated.

Accordingly, the European Crohn's and Colitis Organization consensus guideline recommends that UGI endoscopy and radiology, such as magnetic resonance imaging, computed tomography, and small bowel capsule endoscopy, should be performed in all CD patients where UGI tract involvement is suspected (Annese et al., 2013). However, UGI tract involvement is a diagnostically challenging presentation in CD, due to a lack of specific clinical symptoms, and thus, there is a heavier reliance on imaging modalities in practice.
Chronic inflammation in CD patients is related to altered interactions between the host and the microbiota, and microbial imbalance (Frank et al., 2007;Xavier and Podolsky, 2007;Sartor, 2008;Halfvarson et al., 2017). Currently, the human microbiome, comprising of the entire microbial complement related with human hosts, is a critical and emerging area for biomarker discovery (Pascal et al., 2017;Douglas et al., 2018;Mills et al., 2019). The identification of microbial biomarkers and their use for the prediction of the disease provide valuable information for predictions in a wide range of applications.
Hence, the aims of this study were to compare the metagenomic profile in CD patients with and without UGI involvement at diagnosis, and to identify the metagenomic biomarkers predicting its development.

Data Sources and Processing
We used the data from the Inflammatory Bowel Disease Multi'omics Database 1 for the most comprehensive description to date of host and microbial activities in inflammatory bowel diseases. Tissue samples gathered during the initial screening colonoscopy at diagnosis were collected according to a standardized protocol, and the V4 region of the 16S rRNA gene was PCR-amplified and sequenced in the MiSeq platform (Illumina) (for detailed protocols see http://ibdmdb.org/protocols). We divided the subjects into two groups, "nonL4" versus "L4" -where nonL4 are CD patients without UGI tract involvement and L4 are those with UGI tract involvement in disease extent. 1 http://ibdmdb.org

Community Analysis
The obtained raw data were analyzed using Quantitative Insights Into Microbial Ecology (QIIME) version 1.9.0, a software that performs microbial community analysis and taxonomic classification of microbial genomes (Navas-Molina et al., 2013). Sequences were assigned to operational taxonomic units (OTUs) with a 97% similarity threshold and subsequently picked by UCLUST against a closed reference table, the latest version of the Greengenes OTU database (Edgar, 2010). For diversity analysis, samples were normalized so all the samples could be compared. Alpha diversity of OTU libraries was described using the Chao1, phylogenetic diversity (PD) whole tree, and observed species, and were compared using a Student's t-test. Distance matrices were constructed using the unweighted and weighted UniFrac algorithms in QIIME from the whole community phylogenetic tree. Significant differences between the predefined groups were analyzed using one-way analysis of similarities (ANOSIM) with 999 permutations with their corresponding Global-R statistics.

Biomarker Detection and Functional Analysis
To determine the potential biomarker OTUs, linear discriminant analysis effect size (LEfSe) analysis was performed with a linear discriminant analysis (LDA) score threshold of > 1.0 to detect features significantly different in abundance between the groups (Fisher, 1936;Segata et al., 2011).
In addition, we conducted subsequence-based 16S rRNA data processing using the DiTaxa software, which substitutes standard OTU-clustering method by segmenting 16S rRNA reads into the most frequent variable-length subsequences, for sequence phenotype classification and biomarker detection (Asgari et al., 2018). The linear support vector machine (SVM) and random forest classifier (RF) algorithms were used to build a predictive model and to calculate the importance of all variables and rank them accordingly. For linear SVM, we set the cost to the value of 1 and use RF classifier in the default settings.
PICRUSt2 was used to predict microbial content from each sample's data and functionally annotate the data (Langille et al., 2013). The results were further subjected to statistical analysis of taxonomic and functional profiles (STAMP v2.1.1) software (Parks et al., 2014). To investigate the metabolic network of the predicted organism, we used MetaCyc database 2 , which contains data regarding chemical compounds, reactions, enzymes, and metabolic pathways that have been experimentally validated and reported in the scientific literature (Caspi et al., 2016). The statistical analyses were performed using R version 3.5.1 (R Core Team, 2017; Venables and Smith, 2020). All significant thresholds were set at a two-sided p-value of 0.05.

Baseline Characteristics
Among the 37 potential CD patients, four patients with insufficient data on disease extent and seven patients who were not receiving tissue samples at the time of diagnosis were excluded, leaving 26 patients for analysis. Patients of the L4 group were diagnosed at a younger median age of 13.0 years (IQR 10.5-15.5 years) compared to 19.0 years (IQR 14.5-28.0 years) for the patients in nonL4 group (P = 0.005), and the male to female ratio was 2.3 and 1.2 in the L4 and nonL4 groups, respectively (P = 0.650) ( Table 1). The baseline C-reactive protein (CRP) score and simple endoscopic score for Crohn's disease (SES-CD) did not differ significantly between the groups (P = 0.711 and P = 0.056 for L4 and nonL4, respectively) ( Table 1). However, the CD patients with UGI tract involvement had higher erythrocyte sedimentation rate (ESR) than those without UGI tract involvement (P = 0.033) ( Table 1). All tissue samples were obtained from the rectum and ileum ( Table 1). None of the patients were on any active medication, such as corticosteroids, immunomodulators, or biological agents at the time of sample collection. The detailed demographic and clinical characteristics are summarized in Table 1.

Taxonomic Characterization
We analyzed the intestinal microbiota diversity of the two groups and tested whether intestinal microbiota diversity could be related to disease extent. The alpha diversity indices of Chao1, PD whole tree, and observed species diversity are shown in Supplementary Figure S1. All three diversity indices were higher in nonL4 compared to L4, but there were no significant differences between the two groups (P = 0.522, P = 0.503, and P = 0.275 for Chao1, PD whole tree, and observed species diversity, respectively; Supplementary Figure S1). Beta diversity was further evaluated using weighted-UniFrac analysis, which showed similar bacterial communities in patients of both groups (Supplementary Figure S1). Furthermore, an unweighted UniFrac-based principal coordinate analysis (PCoA) showed that samples were clustered by subject (ANOSIM: R = −0.010;  Figure S2). We also performed a weighted-UniFrac PCoA analysis with ANOSIM (R = −0.043; P = 0.898) (Supplementary Figure S2).

Bacterial Abundance and Distribution
Subsequently, we analyzed the intestinal microbiota abundance and distribution in the two groups and tested whether they could be related to UGI tract involvement in CD. At the genus level, bacteria from Akkermansia (0.3% vs.  (Figure 1).

Metagenomic Biomarker Discovery
We found significant differences in the community compositions between the two groups by LEfSe analysis. As shown in Figure 2, the microbial composition was also significantly different at the order level among groups. The Pasteurellales (P = 0.042), Sphingomonadales (P = 0.045), Campylobacterales (P = 0.024), and Clostridiales (P = 0.043) exhibited a relatively higher abundance in nonL4 group (Figure 2). The patients in nonL4 group had members of the class Epsilonproteobacteria (P = 0. 024) and the family Campylobacteraceae (P = 0.024) that were significantly dominant than those in L4 group patients (Figure 2). Furthermore, there were seven significantly different genera, composed of Campylobacter (P = 0.024), Prevotella (P = 0.034), Clostridium (P = 0.043), Coprobacillus (P = 0.015), Slackia (P = 0.034), and Lachnospira (P = 0.015) that were enriched in the nonL4 group, while Limnohabitans (P = 0.034) was enriched in the L4 group (Figure 2). At the species level, significantly more Haemophilus parainfluenzae were detected in the patients in nonL4 group (P = 0.028), while Ruminococcus torques were enriched in L4 group patients (P = 0.015) (Figure 2). Comparative taxonomic visualization of detected differentially expressed markers for DiTaxa and a common workflow are shown in Supplementary Figure S3     Roseburia faecis, Faecalibacterium prausnitzii, Bacteroides caccae, and Bacteroides uniformis.

Metagenomic Biomarker Evaluation
To further characterize the predictive value of the eight identified taxa by LEfSe or DiTaxa methods, we performed ROC analysis with clinical variables (age at diagnosis and sex) using the machine learning models (Figure 3). A comparison of the average performance as a predictive model suggests the superiority of SVM: the average performance of SVM is >0.799 AUC and 68.2-75.2% accuracy, while that of RF is <0.740 AUC and 57.8-66.5% accuracy (Figure 3). For the top performing model architecture, the addition of microbial features improves the predictive performance of linear SVM model; however, the performance in RF model tends to decrease. Notably, for the feature ranking method in both linear SVM and RF models, the top two factors -the species Ruminococcus torques and age at diagnosis-contributed to the combined models (Figure 3). Figure 3 also shows that the addition of the signature of the species Haemophilus parainfluenzae into the models enabled us to achieve the highest accuracy and to increase the diagnostic performance of UGI tract involvement in CD patients.

Metagenomic Functional Analysis
In addition, the functional diversity of the different putative metagenomes was assessed using the PICRUSt2 software. Pathways displaying a significant difference in mean proportions between L4 and nonL4 groups were represented (Figure 4). The pathways, including thiazole biosynthesis II (p < 0.001), superpathway of thiamine diphosphate biosynthesis II (P = 0.010), and octane oxidation (P = 0.035), were overrepresented, whereas L-methionine biosynthesis III (P = 0.038) and palmitate biosynthesis II (P = 0.050) were under-represented in L4 (Figure 4). For selected pathways, we also examined the extent to which these pathways are linked with the species Ruminococcus torques. As shown in Figure 5, the two related MetaCyc pathways, L-methionine biosynthesis III and palmitate biosynthesis II, showed an association with Ruminococcus torques, which may play important roles in the intestinal integrity and barrier function (Figure 5).

DISCUSSION
To our knowledge, this is the first study to identify a reliable metagenomic biomarker for UGI tract involvement in CD. The reported frequency of UGI tract involvement in CD largely varies. The main cause of the discrepancies regarding prevalence rates of UGI tract lesions is probably related to irregularly performing different diagnostic modalities for CD diagnosis, presumably because of the low reliability of mapping disease extent in clinical practice. However, to date, no data has been analyzed regarding a simple biomarker for CD patients with UGI tract involvement. Our main hypothesis is that the possible differences in taxonomic composition might potentially be used as proxy biomarkers for UGI tract involvement in CD patients, since altered microbial communities have been demonstrated to be an essential factor in driving intestinal inflammation in CD (Tamboli et al., 2004;Sartor, 2008).
We analyzed the differences in the tissue microbial community of CD patients at the species level. In this study, the species Dorea formicigenerans, Ruminococcus torques, Ruminococcus gnavus, Roseburia faecis, Faecalibacterium prausnitzii, Bacteroides caccae, Bacteroides uniformis, and Haemophilus parainfluenza were identified as predictive biomarkers by the LEfSe or DiTaxa programs. Interestingly, Ruminococcus torques, a butyrate-producing bacterial species, was the only one commonly identified by the two different algorithms.
The authors also examined whether the composition of the microbiota, with clinical predictors, could predict whether the patient would have UGI tract involvement or not using two different machine learning algorithms (linear SVM and RF). Modest predictive performances were achieved with a few features (eight taxa, age at diagnosis, and sex), especially in linear SVM. Notably, the most influential features for predicting disease extent were levels of the species Ruminococcus torques (positive correlation) and age at diagnosis (negative correlation). These consistent reproducible results for the species Ruminococcus torques present the possibility of using microbiota analysis as a screening tool to determine CD patients at high risk of UGI tract involvement.
Contrary to the inconsistent results regarding the signature of Ruminococcus torques in fecal samples of CD patients (Joossens et al., 2011;Gevers et al., 2014;Takahashi et al., 2016), most of the studies from tissue samples showed consistently high levels of the species in the mucosa of patients with CD compared to that in healthy subjects (Martinez-Medina et al., 2006;Png et al., 2010). Furthermore, our results show that its abundance remains significantly high in the CD patients with UGI tract involvement.
No role in CD pathophysiology has been suggested so far for the species Ruminococcus torques, belonging to the Clostridium coccoides group/cluster XIVa. They utilize MUC2, the main secreted mucin in the human intestine, as the sole carbon source and have a strong gastrointestinal mucin-degrading ability, providing further evidence of their adaptability in the human gut mucosal environment (Colina et al., 1996;Dethlefsen et al., 2006;Png et al., 2010). Therefore, it has been proposed that excessive mucin degradation by these bacteria may contribute to intestinal FIGURE 4 | Prediction of changed KEGG pathways using PICRUSt2 analysis. A total of five KEGG pathways were significantly altered in the L4 group compared to the nonL4 group; Bar plots on the left side display the mean proportion of each KEGG pathway. Dot plots on the right show the differences in mean proportions between the two indicated groups using P-values.
FIGURE 5 | Functional annotation of predicted species in the elucidated pathways represented in MetaCyc (solid blue line: general reactions, solid gray line: spontaneous or missing reactions, dotted green line: reaction step predicted for the species). Potential mechanisms implicated in the interplay between the gut microbiomes would be influenced the UGI involvement in CD, because, each reaction step in the pathways was not enriched only for the species Ruminococcus torques in the database.
disorders, as access of luminal antigens to the intestinal immune system is facilitated (Ganesh et al., 2013).
In addition, we found an inverse relationship for age at diagnosis with UGI involvement in CD. Present study showed that the median age of the patients with UGI involvement was significantly lower compared to those without UGI involvement. This is in accordance with a previous study by Thomas and colleagues who demonstrated a higher rate of younger patients (≤16 years) suffering from UGI tract involvement compared to those without (9.4% versus 17.8%, P = 0.005) (Greuter et al., 2018). Another study by Lopez-Siles et al. observed that CD patients below 16 years of age had a striking reduction in the population of Akkermansia sp. that possess a lower mucolytic activity compared to those with disease onset at a later age, which is in agreement with our findings (Lopez-Siles et al., 2018).
Taken together, the mucus barrier dysfunction, due to the replacement of a less mucolytic bacteria, such as Akkermansia species by a more mucolytic one, such as Ruminococcus torques, at young age may influence the microbial community on the intestinal mucosa and be instrumental in the development of UGI tract involvement in CD.
To characterize the functional role of the microbiome in phenotype, we annotated the taxa by the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. This analysis suggested that L-methionine biosynthesis III and palmitate biosynthesis II pathway were decreased, which are linked with Ruminococcus torques, while the three KEGG pathways predicted to be increased in L4 group were not associated with the species. Currently, the role of methionine metabolism and its metabolites, and palmitate metabolic pathway in the pathogenesis of CD is poorly understood. Methionine is known to improve the integrity and barrier function of the small intestinal mucosa and villus morphology, and development in previous studies from animal models (Chen et al., 2014;Shen et al., 2014). A previous in vivo study by Wei et al. showed that palmitate plays a key role in the preservation of the gut barrier function by regulating the secretion and function of MUC2 (Wei et al., 2012). Another recent study also demonstrated that palmitate enhances MUC2 production in goblet cells of intestine, leading to the establishment of a thick mucus gel, thereby maintaining the integrity of the gut barrier (Benoit et al., 2015). These findings imply that there might be a decrease in the two protective pathways through the communication between intestinal cells and microbial community, especially the species Ruminococcus torques may induce excessive mucus degradation of small intestine in CD patients with UGI tract involvement.
The main strength of this study is that it evaluated the potential metagenomic biomarkers for prediction of UGI tract involvement in CD patients through various analyses. Further, we analyzed the CD patients with new-onset disease, before the commencement of treatment. Changes in microbiota community structure important for disease pathogenesis are likely to be more evident in new-onset and treatment-naive patients than those undergoing treatment. Lastly, the study focuses on the mucosaassociated microbiota samples, which may be more relevant to disease pathogenesis and diagnosis than fecal samples. However, our study was limited by the small sample size. Another limitation was that it comprised of predominately white patients, and thus, the findings may not be generalized to other racial populations. Finally, although our study was focused only on the microbial community, microbial metabolites also have great potential for improving diagnosis of CD and reflect the abnormalities of the host intestine microbiota. Therefore, new biomarkers for CD patients with UGI involvement could be developed by integrated analysis of metabolomics and metagenomics from a multinational and multicenter cohort.

CONCLUSION
In conclusion, the species Ruminococcus torques in the tissue microbial community of CD patients might serve as a novel potential biomarker for UGI tract involvement. The UGI tract involvement in CD is higher in younger age group patients; therefore, it should be carefully monitored in them. The mechanisms of interactions between the host and Ruminococcus torques should be further investigated.

DATA AVAILABILITY STATEMENT
All datasets presented in this study are included in the article/Supplementary Material.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent from the participants' legal guardian/next of kin was not required to participate in this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
MK designed the study. JC and HS analyzed and interpreted the data, and wrote the manuscript. JJ and JY supervised the project and revised the manuscript. All authors vouch for the data and analysis, have approved the final version, and agreed to publish the manuscript.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb. 2020.01162/full#supplementary-material FIGURE S1 | Analysis of alpha diversity as predicted by Chao 1 estimator, PD whole tree, and observed species (A); and beta diversity measured by weighted-UniFrac distances in L4 versus nonL4 groups (B). FIGURE S3 | Heat map of occurrence of markers by DiTaxa analysis between L4 and nonL4 groups. The rows are sorted based on the taxonomic marker assignments and the columns represent each group and are sorted firstly, based on their phenotype, and secondly, based on their pattern similarity.