A pilot study of the use of the oral and faecal microbiota for the diagnosis of ulcerative colitis and Crohn's disease in a paediatric population

Crohn's disease (CD) and ulcerative colitis (UC) are chronic inflammatory bowel diseases (IBD) that affect the gastrointestinal tract. Changes in the microbiome and its interaction with the immune system are thought to play a key role in their development. The aim of this study was to determine whether metagenomic analysis is a feasible non-invasive diagnostic tool for IBD in paediatric patients. A pilot study of oral and faecal microbiota was proposed with 36 paediatric patients divided in three cohorts [12 with CD, 12 with UC and 12 healthy controls (HC)] with 6 months of follow-up. Finally, 30 participants were included: 13 with CD, 11 with UC and 8 HC (6 dropped out during follow-up). Despite the small size of the study population, a differential pattern of microbial biodiversity was observed between IBD patients and the control group. Twenty-one bacterial species were selected in function of their discriminant accuracy, forming three sets of potential markers of IBD. Although IBD diagnosis requires comprehensive medical evaluation, the findings of this study show that faecal metagenomics or a reduced set of bacterial markers could be useful as a non-invasive tool for an easier and earlier diagnosis.


Metagenomics of the human microbiota
Changes in gene transcription are fundamental for the evolution of species.On the other hand, the conservation of regulatory mechanisms of gene expression has shed light on the evolution of human biology and disease when analysed by genetic models and comparative genomics.
The composition of microbial communities in the human body has attracted growing interest, especially in terms of their genetic characterization.Advances in DNA sequencing and the use of metagenomics and metatranscriptomics have made it possible to study complex microbial ecosystems without needing to isolate their members (1-4).
In the field of IBD, CD and UC patients are known to have a gut microbiome that differs significantly from that of healthy individuals.Various studies have shown that CD is related to a mismatch between the host and intestinal bacterial flora, whose structure and function have been characterized by the application of metagenomics and transcriptomics (5)(6).Such an ecosystems biology approach has highlighted the link between the gut microbiota and functional alterations in the pathophysiology of CD and IBD in general.
The composition of the gut microbiome changes with age; in infants, most studies show that it resembles that of the adult at around the age of 3 (7).The infant gut is colonised by environmental microbes at birth, most of them originating from the mother's microbiome (vaginal, faecal, or skin) (8), the most common phyla being Firmicutes, Proteobacteria, Bacteroidetes and Actinobacteria.The most common in adults are Bacteroidetes and Firmicutes, which are usually the most abundant phyla in infants by the age of one.Compared with healthy children, paediatric IBD patients are known to have lower levels of several beneficial bacteria, such as the Bifidobacterium genus, Firmicutes phylum and those from the genera Eubacterium, Ruminococcacea and Clostridium, and higher levels of detrimental bacteria such as Escherichia coli, Veillonellacea, Fusobacterium and Haemophilus parainfluenzae (9).Investigations of the oral microbiome also revealed that up to 40% of paediatric CD patients had oral involvement, diversity in tongue samples being lower compared with healthy subjects (10).

Biodiversity analysis
In biodiversity measurement, a quantitative estimate of biological variability is obtained to compare biological entities composed of diverse components.It is possible to distinguish between alpha diversity (diversity within a particular area, community, or ecosystem), usually measured by counting the number of taxa (OTUs) within the ecosystem (families, genera, and species), and beta diversity (difference in species diversity between ecosystems), which involves comparing the number of taxa unique to each ecosystem.
In the case of alpha biodiversity, three dimensions are distinguished: abundance, richness, and diversity.Local abundance is the relative representation of a species in a particular ecosystem, usually measured as the number of individuals found per sample.Taxon richness is a measure of the total number of the considered taxa in a community.Diversity usually refers to both species number and equitability (or 'evenness') (11).
In the present study, alpha diversity was analysed using the function Analysis.Biodiver.Metagen() of the library BDSbiost3 (12).For alpha diversity analysis, the richness (S, JACKNIFE2, CHAO), richness indexes (Shannon index [H], Simpson index [simp], inverse of simpson [invsimp], alpha index [alpha]) and evenness (J Pielou's index) were calculated.The statistical test to compare biodiversity between groups was performed using the mcpHill function from the library simboot (13).
The beta diversity analysis was carried out using the function Betabiodiversity() from the library BDSbiost3 (12).andrepresented as a network using the function ggraph() of the library ggplot2.Finally, the function coincidence.analysis() of the library BDSbiost3 (12).wasused to compute coincidences with Venn diagrams.

OTU proportions and the origin of the bacterial communities
A binomial test of proportions with adjustment of the p-value (P) ("FDR" method) for multiple hypothesis testing to avoid Type I error problems was used to assess differential proportions of OTUs among the different groups.This test was performed using the function dif.propOTU.between.groups() of the library BDSbiost3, which is based on Fisher's exact statistical test (12).

Discriminant and exploratory data analysis
Statistical analyses were performed using different R functions and libraries (14).The BDbiost3 library for R (12).was used to assess the coverage of the sequenced reads and for discriminant and exploratory data analysis.
The coverage of the sequenced reads was analysed to assess the representativeness of the obtained OTUs, as previously described (15).For this purpose, the PILI3() function of the BDbiost3 library function was used, which allowed the computation of the rarefaction curve between the number of reads and OTU abundance.This function was projected to an infinite rarefaction curve to verify its saturation or if it still had any margin to saturate.
For the exploratory analysis, contingency tables (OTU abundance tables) were obtained separately for each sample group.These data followed a multinomial distribution (16).and allowed us to apply an exploratory dimension reduction technique using non-metric multidimensional scaling (nMDS).Discriminant analysis was computed using the 20 To study whether consortia or networks were formed among microorganisms of each experimental group, and as part of the interpretation of the ecology of these microorganisms, the networks formed among bacterial species, genera and families were calculated.For this, several procedures within the BDsbiost3 library were used.
Espectral.CN() performs a spectral clustering network analysis for frequencies (taxon, omics data, counts) through sparse correlations or cosine similarity between columns (sparse matrices) with geometrical analysis and identification of communities (consortiums).Miriam.Network() performs a network analysis for frequencies (taxon, omics data, counts) and Gaussian graphs with the possibility of obtaining a complexity index.Espectral.CN() provides different procedures to separate the connections (nodes and vertices) formed between microorganisms from the correlation established between species.Each possible separation represents a consortium, and thus the main methods used are walk trap and Girvan Newman (18), algorithms widely applied in social analysis to detect associations and user communities.We also obtained the main statistics of the resulting networks as follows: Number of connections within the network (nc), Network diameter / effective size of the network (nd), Number of formed groups (g), Connection quality (cq).The latter is a pseudo-qualitative appreciation of how bacteria are connected.
On the other hand, the calculation of a new complexity index is proposed based on the number of nodes and connections in the networks formed between the different microorganism species, using the ggraf procedure and only the 30 most abundant species (due to calculation problems of the procedure).This procedure is implemented in the function Miriam.Network().

Characterization of bacterial populations through inferential statistical analysis
Binomial tests were used to interpret the clinical evolution of the patterns and to build a predictor that would allow these changes to be detected.A graphical summary of the results for bacterial species is provided in Supplementary Figure 2. The graph shows the main differences between bacterial species (statistically significant in terms of relative frequency) when comparing the sample groups based on cohorts and collection time.The number of bacterial species for each group is specified, including those whose frequency differs with a minimum of 2 %, 5% or 10% between groups (e.g., 3-2-1 indicates 3, 2 and 1 species with a difference > 2% > 5% and > 10 %, respectively).The graph lists the bacterial species with more than 10% of difference in frequency between groups: negative differences are in green and positive in red.
Notably, the main variations in bacterial frequency were found in faecal samples.In CD patients, a relative increase of Streptococcus sanguinis, Streptococcus mitis and Bacteroides fragilis was observed compared with the control group.Both streptococci species have been associated with colonization and pathogenesis, the former mainly with oral biofilm formation and infective endocarditis (19)(20)(21).The enterotoxigenicity of Bacteroides fragilis has been associated with the occurrence of IBD (22)(23)(24).Phoenicola vulgatus (Bacteroides ovatus) is reported to have a protective role against dextran sulfate sodium-and lipopolysaccharide-induced colitis (25)(26), although some studies indicate that both Bacteroides vulgatus and Bacteroides thetaiotaomicron have the capacity to induce severe ulcerative disease (27).
1.3.2.Consortium networks formed between microorganisms: the biological system according to the law of system completeness Possible microbial consortia or networks were explored to shed light on the ecology of these microorganisms and their behaviour in the host.Networks between species, genera and families were analysed and the complexity indexes of the consortia formed by the most abundant species (over 30-35) were calculated.The results, shown in Supplementary Figure 3 and Supplementary Table 5, indicate that the bacterial complexity in the control cohort is greater than in IBD patients, which agrees with the higher microbial biodiversity observed in that group.The least complexity was observed in the networks generated at disease onset in CD and UC patients, and although it increased throughout the treatment period, it remained lower than in the control cohort at 6M.Nevertheless, this result should be treated with caution due to the small number of samples.

Supplementary Figures and Tables
In this section the supplementary tables and figures to the main article will be indicated.

Supplementary Tables
Supplementary Table 1.Result of the binomial test applied to percentages of bacterial species detected in the different groups.The controls for the different groups of samples are shown in green.
Supplementary Table 5. Summary of the statistical analysis of the bacterial consortia in the different groups during the clinical evolution of the IBD patients using complex networks.
.MaLearning.Predict () of the BDbiost3 library, which allowed the evaluation of 5 different discriminant methods: linear discriminant analysis (LDA), support vector machine (SVM), xboosting (Xboost), kernel discrimination (kernel) and artificial neural nets (ANN).The results also offered final classification accuracy and a confusion matrix of the different discrimination methods performed.The relationship between the abundance (OTUs) of the species in each sample and their control (e.g., FE-DE and FE-HC groups) was mainly analysed using Bray-Curtis distance, although the non-parametric Spearman correlation(17) was also applied to assess the significance of the changes in biodiversity in relation to the different environmental variables; this statistic is referred to as np.cor in the text.As all the variables under consideration were quantitative and continuous (i.e., biodiversity, clinical activity, clinical activity at the end of the visit, DNA concentration and purity, np.cor), the related data were summarized by means of PCA.This analysis can significantly reduce the number of variables while still retaining much of the information in the original data set and graphically represent various hypotheses simultaneously.To represent the microbiological data, biodiversity, DNA concentration, np.cor, as well as some of the clinical variables of the children for all the experimental groups, PCA was carried out using the prcomp function of the library factoextra.

.
In A to D are shown, respectively, the circular pie charts of percentages for the following groups: faecal samples from ulcerative colitis (FE-UC) and Crohn's disease (FE-CD) patients, and oral samples from UC (OR-UC) and CD patients.HC (or CO), DE, 3M and 6M refer to the control cohort, and sampling points at disease onset, 3 months and 6 months.For simplification, only species with a relative frequency > 2 % and genera and families with a relative frequency > 1 % are represented.Some examples of bacterial species consortia formed in the samples during the clinical evolution of the patients using two different grouping methods (walk trap and Girvan Newman communities).In A, B and C are presented two examples.In A, several wellformed groups connected (faecal control and oral control) for: located on the left HC-FE and on the right HC-OR.In B, several well-formed groups connected for DE-CD-FE (left), DE-UC-FE (right).In C, groups connected for 3M-CD-FE (left), 3M-UC-FE (right).And finally, in D, groups connected for 6M-CD-FE (left side) and 6M-UC-FE (right side).to bacterial species.The formulas indicate the number of statistically different families, genera, and species between stages of the experimental design (e.g., FECAL-CD HC-DE 13-3-1: 13 families, 3 genera and 1 species are statistically different between the healthy control (HC) samples and faecal samples in the Crohn's disease (CD) patients at disease onset (DE))

Table 2 .
Result of the binomial test applied to percentages of bacterial genera detected in the different groups.

Table 3 .
Result of the binomial test applied to percentages of bacterial families detected in the different groups.

Table 4 .
Biodiversity estimators (richness and diversity) by