Inference of Significant Microbial Interactions From Longitudinal Metagenomics Data

Data of next-generation sequencing (NGS) and their analysis have been facilitating advances in our understanding of microbial ecosystems such as human gut microbiota. However, inference of microbial interactions occurring within an ecosystem is still a challenge mainly due to sequencing data (e.g., 16S rDNA sequences) providing relative abundance of microbes instead of absolute cell count. In order to describe growtth dynamics of microbial communities and estimate the involved microbial interactions, we introduce a procedure by integrating generalized Lotka-Volterra equations, forward stepwise regression and bootstrap aggregation. First, we successfully identify experimentally confirmed microbial interactions based on relative abundance data of a cheese microbial community. Then, we apply the procedure to time-series of 16S rDNA sequences of gut microbiomes of children who were progressing to Type 1 diabetes (T1D progressors), and compare their gut microbial interactions to a healthy control group. Our results suggest that the number of inferred microbial interactions increased over time during the first 3 years of life. More microbial interactions are found in the gut flora of healthy children than that of T1D progressors. The inhibitory effects from Actinobacteria and Bacilli to Bacteroidia, from Bacteroidia to Clostridia, and the beneficial effect from Clostridia to Bacteroidia are shared between healthy children and T1D progressors. An inhibition of Clostridia by Gammaproteobacteria is found in healthy children that maintains through their first 3 years of life. This suppression appears in T1D progressors during the first year of life, which transforms to a commensalism relationship at the age of 3 years old. Gammaproteobacteria is found exerting an inhibition on Bacteroidia in the T1D progressors, which is not identified in the healthy controls.


INTRODUCTION
The human gut microbiota, an ecosystem comprising a diverse collection of interactive microbial species, plays a key role in nutrition, metabolism, physiology, and immune function in humans (Sekirov et al., 2010). An interaction between two microbes can either be neutral (0), positive (+), or negative (−) to the fitness of the interacting microbes. Based on the overall effects, an interaction can be classified into commensalism (0/+), amensalism (0/−), mutualism (+/+), parasitism (+/−), competition (−/−), or no interactions (0/0). Interactions among microbial species characterize the composition and function of gut microbiota, thereby influencing the host's health (Rios-Covian et al., 2016). The microbial interactions in the gastrointestinal tract are complex, flexible, and capable of adapting to physiological perturbations (Sun and Chang, 2014). The change of one species may shift the relative abundances of other members in the community and affect the community's functional capacity. A better understanding of ecological dynamics and microbial interactions is essential to investigate the consequences of taxonomic perturbations.
By using NGS technologies, such as 16S rRNA gene profiling, it is now possible to follow the time-evolution of a microbial population by measuring abundance of bacterial species in a microbial community. However, analyzing the microbial community dynamics from genomic survey data is not straightforward, because sequencing data provide relative abundances of microbial species based on a fixed total number of sequences rather than absolute cell count. The high density of genomic survey data and its compositional nature are the most common challenges involved in inferring microbial interaction networks. Hence, the power of computational approaches allows to address this challenge. Mathematical and computational approaches make it possible to analyse highly complex microbial communities.
Correlation networks have proved useful for detecting biological interactions. There are many different techniques for computing correlation networks, for example Pearson correlation coefficient (Pearson, 1909), Spearman correlation coefficient (Spearman, 2010), Bray-Curtis distance (Roger Bray and Curtis, 1957), Local Similarity Analysis (Ruan et al., 2006;Beman et al., 2011;Steele et al., 2011;Xia et al., 2013), Maximal Information Coefficient (Reshef et al., 2011), SparCC (Friedman and Alm, 2012) based on Aitchison's log-ratio analysis (Anders and Huber, 2010) and CoNet which combines information from several different standard comparison metrics (Faust et al., 2012). Weiss et al. (2016) benchmarked the performance of these techniques in dealing with the relative abundance in simulated data specific to microbiome studies. Despite the correlation networks are useful in studying some overall biological relationships, they have significant limits in deciphering non-linear interactions such as between three or more species, and are unable to detect relationships like amensalism and partial-obligate-syntrophy (Morris et al., 2013;Weiss et al., 2016). Indeed, two species may be correlated even if they do not directly interact with each other.
The use of non-linear differential equations like Lotka-Volterra model is an alternative approach to study microbial interactions. The generalized Lotka-Volterra (gLV) equations are able to describe the time-dependent population dynamics and predict ecological relationships (i.e., mutualism, commensalism, parasitism, and competition) between members of different biological species. Several studies have attempted to use gLV equations studying microbial communities that consist of multiple bacterial species (Stein et al., 2013;Fisher and Mehta, 2014;Marino et al., 2014;Bucci et al., 2016;Shaw et al., 2016). In particular, Mounier et al. used gLV equations describing the population dynamics of a cheese microbial community consisting of five microbial groups with experimental data of the absolute cell numbers (Mounier et al., 2008). This model predicted some microbial interactions that were confirmed by co-culture experiments afterwards. Fisher and Mehta (2014) have proposed an approach named Learning Interactions from Microbial Time Series (LIMITS), which implements forward stepwise regression with median bootstrap aggregation for gLV equations inferring the species interactions in the human gut microbiomes. By combining forward stepwise regression with bootstrap aggregation, LIMITS was able to overcome a statistical issue termed "errors-in-variables" (accounting for measurement errors in the independent variables) and infer species interactions from time series relative abundances of species (Fuller, 1980). However, these inferred interactions were not experimentally confirmed.
Although these tools brought some insights to infer microbial interactions, they have no proven ability to infer biologically confirmed microbial interactions from relative abundance data. Here, we propose a procedure by integrating gLV equations, forward stepwise regression, bootstrap aggregation and onesample t-test. First, we test and validate this procedure by correctly identifying microbial interactions from relative abundances of a cheese microbial community (Mounier et al., 2008). For application, we then apply our procedure to infer microbial interactions within gut microbiota of healthy and type 1 diabetes (T1D) progressing infants (seroconverters for diabetes autoimmune) from a set of reported data (Kostic et al., 2015).

Procedure Validation on Experimentally Confirmed Data: Identification of Significant Microbial Interactions Within a Cheese Microbial Community
The procedure was tested through describing the dynamics of relative abundances of five microbial groups and estimating their interactions within a cheese microbial community. In the original study, the microbial population growth was measured in cell count, which was converted into relative abundances as input of our procedure. For initial conditions, the intrinsic growth rates and inter-species interaction coefficients were set at 1 generation/day and 0, respectively. The carrying capacity is 2e 10 CFU/g according to coculture studies (Mounier et al., 2008). A diagram of implementation of the procedure is given in Figure 1.
The gLV modeling succeeded in describing the changes of relative abundance of the microbial populations, and roughly predicted the growth trends of each member of this community (Figure 2). The inferred intrinsic growth rates of the microbes and their significant interactions were shown in Figure 3. In particular, some of the inferred interactions ( Figure 3B) were previously confirmed in the co-culture experiments, including the promotion effects from G. candidum to the bacterial group and Leucobacter sp., and the inhibitory effects of G. candidum on D. hansenii (Mounier et al., 2008). The estimated intrinsic growth rate of the bacterial group was low ( Figure 3A), indicating that their growth was highly influenced by other members in the community. A benefit effect of G. candidum on the bacterial group was inferred, suggesting that the growth of the bacterial group partially relies on G. candidum. According to the coculture study (Mounier et al., 2008), the growth of some bacteria, such as Brevibacterium aurantiacum, was actually relying on G. candidum was identified.
The sensitivity of our procedure was assessed by varying the number of bootstrapping samples and the size of training data (ranging from 100 to 1,000 bootstraps). The performance of this procedure was influenced by the number of bootstrap samples. We observed that some significant interactions were detected using a relatively low number of bootstrap samples. For instance, the suppression of bacterial group by D. hansenii was identified as low as 100 bootstraps while it took 700 bootstraps to identify bacteria inhibiting G. candidum (Supplementary Table S1). The frequency of these interactions may be an indicator of their weight in shaping the community dynamics.
In summary, our procedure was able to identify ground-truth microbe-microbe interactions within a microbial community through relative abundance data.
Application on Time-Series 16S rRNA Gene Sequence Data: Gut Microbial Interaction Networks of Healthy Children vs. T1D Progressors Next, we used this approach to study the gut microbial dynamics in healthy children and those who were progressing to T1D during their first 3 years of life. Kostic et al. (2015) examined the composition dynamics of gut microbiomes in 33 children genetically predisposed to T1D from birth until 3 years of age with monthly sampling. The authors observed a relative FIGURE 2 | Producing the growth dynamics of the cheese microbial community. An example of using gLV equations and forward stepwise regression describing the growth dynamics of the cheese microbial community in (A) relative abundance and (B) cell count. Experimental data (Mounier et al., 2008) is indicated by dots and the predicted population dynamics are indicated by curves. The fitting performance is measured in root-mean-square error (RMSE). A, Actinobacteria; Bi, Bacilli; Ba, Bacteroidia; C, Clostridia; G, Gammaproteobacteria. reduction in alpha-diversity in the gut of children who progress to T1D compared to the seroconverters defined as positive for at least two autoantibodies (no T1D developers occurred during the follow-up) and in non-converters' gut. We applied the proposed procedure on the 16S rRNA sequencing data reported in (Kostic et al., 2015). We investigated the class level gut bacterial interactions in T1D-associated and healthy infants at different age stages. The carrying capacity of intestinal microbial community was assumed to be 1e 11 cells/g of feces [given the average bacteria number 3.4e 10 ± 3e 10 cells/g feces (Savino et al., 2017)]. The resulting intrinsic growth rates and interaction coefficients of the five bacterial classes were given Frontiers in Microbiology | www.frontiersin.org in Supplementary Table S3 for the two groups at different age stages.
The gut microbial interaction network consisted of just a few amensalism interactions for the first year of life, but became more complex over time (Figure 4). There were more bacterial interactions identified in healthy children than T1D progressors after the age of 1 year old. Some interactions were shared between healthy controls and T1D progressors, including the inhibitory effects Gammaproteobacteria to Clostridia for 1 year of age, the inhibitory effect from Actinobacteria to Bacteroidia and the promotion effect from Clostridia to Bacteroidia.
Three interactions were identified uniquely in the healthy controls: inhibitions of Actinobacteria by Bacteroidia and Clostridia, and a promotion from Gammaproteobacteria to Actinobacteria. An inhibitory effect from Gammaproteobacteria to Clostridia was established within the first year of life which maintained through the following 3 years (top row of Figure 4). Intriguingly, the effect from Gammaproteobacteria to Clostridia was identified as amensalism during the first year of life for T1D progressors, which transformed to a benefit effect at their age of three. An inhibitory effect from Gammaproteobacteria to Bacteroidia was only observed in the T1D progression group (bottom row of Figure 4). The relationship between Bacilli and Clostridia was amensalism for healthy children, but commensalism for T1D progression children at age of three.

DISCUSSIONS
Investigating the gut microbial community composition and their interaction patterns are crucial to understand the role of gut microbiota in maintaining human health and causing diseases. However, a systematic approach for accessing microbial interactions has been lacking. Some methods have been attempted to infer microbial interaction networks, such as correlation networks and mathematical modeling. Although these attempts have contributed in creating interaction networks, none of their resulting microbial interactions have been experimentally confirmed. In the present study, we showed that a combination of gLV equations, forward stepwise regression, and bootstrap aggregation was able to infer microbial interactions from longitudinal relative abundance data. In order to test the performance this procedure, we used the dataset of a cheese microbial community with known microbial interactions (Mounier et al., 2008). Our proposed procedure was able to correctly infer some experimentally confirmed microbial interactions, such as the promotions of G. candidum on Leucobacter sp. and on a group of bacteria, and a suppression of D. hansenii by G. candidum. In addition, the estimated changes in cell counts by our procedure roughly recapitulated the growth trends of each microbial group in the community. To access the performance of this procedure, we tested the cheese dataset on MetaMIS software (Shaw et al., 2016),  Table S2). However, our proposed procedure also provided bacterial growth rates as they are important information to decipher microbial population dynamics. Despite the potential of our proposed method in dealing with compositional data, it could not explain all the observations from the co-culture experiments. For example, our procedure predicted an inhibitory effect of D. hansenii on the bacterial group, which was inconsistent with the experimental observation (Mounier et al., 2008). In addition, the interaction coefficient may be not proportional to the microbial abundances (e.g., variable a in Equation 1 does not only depend on the abundance of the microbes). This level of complexity was not considered in our current procedure, but it is an important mechanism needed to be addressed in the future.
One goal of comparative metagenomics is to identify meaningful changes in the microbiome's taxonomic and functional composition that are associated with health and disease. Disruptions of the process of gut microbial colonization during childhood have been shown to be associated with an increased risk of adiposity and pathogenesis of autoimmune disorders such as T1D (Huh et al., 2012;Blustein et al., 2013;Murri et al., 2013;Kostic et al., 2015). To address how alterations of the gut microbial community composition may contribute to childhood disease, we must first investigate the normal dynamics of the community in the growing infant. By comparing the gut taxonomic trajectories of children who progress to T1D compared to healthy controls, Kostic et al. (2015) have found general changes in abundance of the gut microbial community and the timing of these changes. They identified some microbes strongly correlated which was consistent across most healthy subjects (Kostic et al., 2015), such as a positive correlation between Gammaproteobacteria and Actinobacteria or a negative correlation between Gammaproteobacteria and Clostridia (Kostic et al., 2015). These correlations can be explained by the inferred interactions from our procedure, e.g., promotion effect from Gammaproteobacteria to Actinobacteria and inhibition effect from Gammaproteobacteria to Clostridia.
By applying our procedure to a longitudinal time-series study of gut microbiota in children (Kostic et al., 2015), we demonstrated that the gut bacterial interaction network (at class level) was getting more complex along with age, with more interactions inferred for the healthy children than the T1D progressors. Some microbial interactions were shared between healthy children and the T1D progressors, such as inhibitory effects from Actinobacteria and Bacilli to Bacteroidia. Although the composition of the microbiome community was changing along with child development, some of the bacterial interactions, in particular at higher phylogenetic levels such as class, remained stable after establishment. For example, the inhibitory effect from Gammaproteobacteria to Clostridia was established for healthy children cohort in the first year of life and maintained through the age of three. In T1D progressors, whereas the amensalism relationship between Gammaproteobacteria and Clostridia appeared during their first year of life, which became commensalism (i.e., Gammaproteobacteria enhancing the growth of Clostridia) at the age of three. The inhibition from Gammaproteobacteria to Bacteroidia maintained during the first 3 years of life for T1D progressors, which was not detected in the healthy controls.
It has been observed that the quantity of Clostridium (genus level, belongs to the class Clostridia) correlated positively and significantly with the plasma glucose level in diabetic children. In newborn infants, Gammaproteobacteria have been shown to cause a healthy level of inflammation in their intestines, protecting them from excessive inflammatory, and autoimmune disorders later in life (Mirpuri et al., 2014). From the data of (Kostic et al., 2015), the proportion of Gammaproteobacteria was indeed lower in the T1D progression children group than healthy controls during the first months of life (e.g., about 11% vs. 16% for the two groups within 180 days after birth). In addition, Bacilli was also found to promote the growth of Clostridia in the gut community of T1D progression children, whereas it contributed to inhibition of Clostridia for the healthy controls. Thus, our results suggest that the inhibitory effects of Gammaproteobacteria and Bacilli on Clostridia might play a role in regulating plasma glucose in early live and protect children from developing to T1D. Further studies are necessary to test this hypothesis.

CONCLUSIONS
Here we propose a procedure which was capable of inferring experimentally confirmed microbial interactions out of compositional data. By using our procedure, we identified some similarities as well as differences in the gut bacterial interaction networks between children toward T1D progression and healthy controls in their first 3 years of life. The interaction networks were getting more complex in the gut microbiome of children after their first year of life. The number of interactions was higher in healthy children than in T1D progressors. Some bacterial interactions were exclusively found in each group, which may be able to predict the T1D progression state. Our results provide potential new insights into the relationship between gut microbial interactions and infants' T1D progression. In the future, by incorporating microbiome's functional shift (Manor and Borenstein, 2017), the procedure presented above might help to interpret disease-specific changes of microbial interactions and accordingly help to predict disease progression.

Generalized Lotka-Volterra Equation
We used gLV equations to describe an ecological community consisting of n microbiome taxa. For taxon i, the population dynamics is described as where x i (t) is the absolute abundance of taxon i at time t, N(t) the community size at time t, n the number of species in the ecosystem, and K the carrying capacity of the community (a logistic growth component). r i represents the intrinsic growth rate of taxon i, and a ij the effect that taxon j has upon taxon i, which is proportional to the relative abundance of species j (i.e., x j (t) N(t) ). We assumed the interactions between species i and j is not necessarily bilateral, which can take one of six possible forms based on the signs of a ij /a ji (i = j); +/− (parasitism), −/− (competition), +/+ (mutualism), +/0 (commensalism), −/0 (amensalism), and 0/0 (neutral). For simplicity, we did not consider intra-species interactions, such as the competition for resources between the members of a same species. However, a carrying capacity of the environment was introduced. We fitted relative abundance data to the gLV equations using the Levenberg-Marquardt algorithm for non-linear least squares minimization, implemented by LMFIT (Newville et al., 2016) which is a high-level interface to non-linear optimization and curve fitting tool for Python.

Forward Stepwise Regression
We used a forward stepwise regression method for the gLV model selecting interactions that explain most of the population dynamics of microbial species. This method started with no interaction coefficients in the model (i.e., only with growth terms), testing the addition of each interaction coefficient using a comparison criterion, adding the interaction if it improves the model the most, and repeating this process until no improvement of the model fitting. Stepwise regression is prone to overfitting the data, which arises from searching a large space of possible models. We use Bayesian information criterion (BIC) (Kass and Wasserman, 1995) as a selection criterion in order to solve the overfitting problem by introducing a penalty term for the number of parameters in the model. BIC is calculated by where U is the number of data points, and V the number of variable parameters in the glV model. Residual sum of squares (RSS) over the whole data set is given by where x i N and xˆi Nˆa re the actual observed and predicted relative abundance, respectively. T is the total number of time points, and n is the number of species in the ecosystem.
Therefore, a lower BIC implies either a better fit, fewer explanatory variables, or both.

Bootstrap Aggregating (Bagging)
In order to attenuate the instability problem of forward stepwise regression, we introduced bootstrap aggregating strategy for noise-reduction (Breiman, 1996) by randomly partitioning the data into two sets: training set and test set. A typical way to test for accuracy in models created by stepwise regression is to evaluate the models against a test set that was not used for model creation. Accuracy is then calculated as errors between the values predicted by the model and data in the test set. In order to assess both the descriptive and predictive power, we fitted the models to the training set but evaluate the models to the whole data set. The procedure is performed in the following steps: This forward stepwise regression process was repeated many times (up to 1,000 bootstraps in this study), each resulting in a set of interaction coefficients that compose a specific gLV model. The python code we used for generating candidate models is provided in Supplementary Code S1. These models were then aggregated. Next, we removed models that violated biological constrains (e.g., based on a minimum population size and/or a minimum number for each species in the community). After filtering, we selected models with minimum BIC values (the minimum BIC + 10% allowance), and their parameters (growth rates and interaction coefficients) were compiled. Finally, the significant interactions were determined through one-sample t-test with the value of P(a ij = 0) > 95%.

investigated interactions between
yeast and bacteria within a cheese microbial community composed of three yeasts (Debaryomyces hansenii 1L25, Geotrichum candidum 3E17, and Yarrowia lipolytica 1E07) and six bacteria (Arthrobacter arilaitensis 3M03, Brevibacterium aurantiacum 2M23, Corynebacterium casei2M01, Hafnia alvei 2E12, Leucobacter sp. strain 1L36, and Staphylococcus xylosus 1L18). Two experiments were conducted to measure the microbial dynamics during the development of the ecosystem. In the first one, the dynamic of the full ecosystem was studied. Cheeses were sampled in duplicate every day for 21 days for microbial enumeration together with measurements of lactose, lactate content, and pH. The second experiment aimed to investigate the effect of the absence of one, two or three yeasts (all combination were tested) in the ecosystem development.
To test our procedure, we studied the development dynamics of this cheese microbiome community [data were obtained from (Mounier et al., 2008)]. Kostic et al. studied the impact of the gut microbiome dynamic on type 1 diabetes (T1D). They recruited between September 2008 and August 2010 in Finland and Estonia 33 newborns with positive cord blood testing for HLA DR-DQ, which are alleles conferring risk of T1D. The infants were followed-up until the age of 3 years with monthly stool samples. Data regarding infections, drugs consumption (in particular antibiotics), diet were collected. Serum samples were collected at 0 (cord blood), 3, 6, 12, 18, 24, and 36 months to test for 4 diabetesassociated autoantibodies. 16S rRNA sequencing were performed for accessing the fecal microbiota composition. Eleven children had at least two positive autoantibodies seroconversion during their follow-up but did not develop T1D (i.e., seroconverters).
Four children progressed to T1D by the end of the study. The 11 seroconverters were matched with the 22 healthy controls for gender, HLA genotype and country. We applied our procedure on a set of 16S rDNA sequences (in OTUs) of the gut microbiota of 11 T1D progression children (seven seroconverters and four who progressed to T1D) and 22 healthy controls between the age of 0-3 years old [data were obtained from Supplementary Table S2 published in Kostic et al. (2015)]. In specific, we used the relative abundance of five bacterial classes: Actinobacteria; Bacteroidia, Bacilli, Clostridia, and Gammaproteobacteria.

AUTHOR CONTRIBUTIONS
XG and LO conceived the method and analyzed the data. XG and LO wrote the manuscript with help from B-TH, DG, and PG. LO and PG were responsible for management of the project.

FUNDING
This work was supported by the program Projets Transversaux de Recherche PTR 2015 (grant no PTR558). This work was also supported directly by internal resources of the French National Institute for Health and Medical Research (Inserm), the Institut Pasteur, the University of Versailles-Saint-Quentin-en-Yvelines (UVSQ) and the French Government's Investissement d'Avenir program, Laboratoire d'Excellence Integrative Biology of Emerging Infectious Diseases (grant no ANR-10-LABX-62-IBEID).

ACKNOWLEDGMENTS
A previous version of this article was published as preprint on BioRxiv (Gao et al., 2018).