A Bayesian Framework for the Classification of Microbial Gene Activity States

Numerous methods for classifying gene activity states based on gene expression data have been proposed for use in downstream applications, such as incorporating transcriptomics data into metabolic models in order to improve resulting flux predictions. These methods often attempt to classify gene activity for each gene in each experimental condition as belonging to one of two states: active (the gene product is part of an active cellular mechanism) or inactive (the cellular mechanism is not active). These existing methods of classifying gene activity states suffer from multiple limitations, including enforcing unrealistic constraints on the overall proportions of active and inactive genes, failing to leverage a priori knowledge of gene co-regulation, failing to account for differences between genes, and failing to provide statistically meaningful confidence estimates. We propose a flexible Bayesian approach to classifying gene activity states based on a Gaussian mixture model. The model integrates genome-wide transcriptomics data from multiple conditions and information about gene co-regulation to provide activity state confidence estimates for each gene in each condition. We compare the performance of our novel method to existing methods on both simulated data and real data from 907 E. coli gene expression arrays, as well as a comparison with experimentally measured flux values in 29 conditions, demonstrating that our method provides more consistent and accurate results than existing methods across a variety of metrics.


INTRODUCTION
Numerous approaches to understanding and utilizing gene expression measurements attempt to classify them into one of two states: active (roughly speaking, the gene product is part of an active cellular mechanism) or inactive (the cellular mechanism is not active) (Ferrell, 2002;Abel et al., 2013;Gallo et al., 2015). We label this classification a determination of the gene activity state.
These approaches are becoming more and more relevant with continued dramatic increases in the quantity and diversity of transcriptomics data as prices to obtain data continue to decline. In particular, recent approaches to metabolic modeling (MM) have focused on the integration of multiple sources of genetic information including transcriptomics data (Pfau et al., 2011;Lewis et al., 2012;Bordbar et al., 2014;Chubukov et al., 2014;Machado and Herrgård, 2014;Monk et al., 2014;Rezola et al., 2014). In these approaches, gene activity states are usually incorporated into constraints on the fluxes through reactions associated with the gene products. For example, GIMME (Becker and Palsson, 2008) applies a user-specified expression level threshold to classify gene activity states in any given experiment, then computes a penalty for flux through any reaction associated with an inactive gene; Flux Balance Analysis (FBA) is then constrained by minimizing the sum of penalties across all reactions in the model. PROM (Chandrasekaran and Price, 2010) uses a version of this approach in which the researcher finds the user-specified expression level threshold by assuming that a pre-defined percentage of all genes in an experiment are active (e.g., 33% Chandrasekaran and Price, 2010 or 50% Dunman, Personal Communication.). Others have proposed approaches in a similar spirit (Jerby and Ruppin, 2012), while some have allowed for more uncertainty through the addition of an "unsure" state, which yields no corresponding flux constraint (Shlomi et al., 2008), or by examining relative expression values . Some  suggest that large relative changes in gene expression (above a threshold) signal a shift from one state to the other, while others (Van Berlo et al., 2011) use both absolute and relative changes. Most recently, some have proposed continuous approaches whereby larger expression values for an experiment are classified as more likely to be active, and lower expression values for an experiment are classified as less likely to be active (e.g., GIM3E, Schmidt et al., 2013). Constraints on FBA via the estimated gene expression states are "soft" in that they can be violated to allow for uncertainty in expression state classification and also allow for potential post-transcriptional control; precise handling of such violations varies among approaches but typically involves a penalty term in the linear programming problem. While PROM classifies gene states in the standard manner, PROM does not directly constrain FBA based on the states.
While not all MM approaches to integration of transcriptomics data attempt to classify gene activity measurements into two states (Colijn et al., 2009;Moxley et al., 2009;Fang et al., 2012;Kim and Reed, 2012;Lee et al., 2012;Navid and Almaas, 2012), integrated MM approaches which do classify gene states suffer from at least four major limitations. First, many of the existing methods do not allow for different activity state thresholds between genes (e.g., gene A is assigned a threshold of 9.25 on a log-scale between active and inactive states; whereas gene B is assigned a threshold of 10). Second, many existing methods do not allow differences in the proportion of genes that are active from one experiment to another (e.g., a higher proportion of genes are expected to be classified as active in a minimal media condition than in a rich media condition).
Third, existing methods do not leverage a priori knowledge about potential gene co-regulation to improve activity state classification (e.g., given that two genes A and B are co-regulated, if A is classified as active, B should also be classified as active). Finally, almost all existing methods do not meaningfully estimate statistical uncertainty in the classification process; the typical approaches to classifying genes using Boolean rules (every gene is either active or inactive) do not attempt to incorporate uncertainty in the classification. While some have attempted to incorporate uncertainty into the classification process (e.g., some have an "unsure" classification Shlomi et al., 2008), all approaches (including PROM Chandrasekaran and Price, 2010) incorporate post-hoc uncertainty adjustments by allowing violations of the gene activity state using penalties which apply similarly across all genes-in essence uniformly down-weighting the impact of expression data to account for upstream processing uncertainty. Here we propose improvements to gene activity state classification that address these limitations. Our work is motivated by the observation that many researchers work under operational definitions of genes as "active" in some conditions and "inactive" in others. Our goal is to provide guidance to researchers who regularly put this intuition into practice, by assessing their methods for classifying genes into activity states based on gene expression data, and proposing statistical models for data analysis that lead to improved classifications. Thus, we propose a flexible Bayesian approach that uses parametric mixture models as a platform for meaningfully estimating gene activity states through the integration of expression data and knowledge of operon structure. We quantify confidence in gene state estimates, which subsequently then can be incorporated into downstream analyses. We assess the performance of the model against other common approaches of estimating gene activity states using both simulated data and real E. coli transcriptome data; we compare our activity estimates to predictions of gene activity derived from reaction fluxes in a metabolic model of E. coli as well as to experimentally measured reaction fluxes in E. coli (Ishii et al., 2007;Machado and Herrgård, 2014).

General Mixture Modeling Framework
Throughout this paper, we consider a set of m bacterial genes from a single organism whose expression levels ǫ have been observed across n different experimental settings. We define ǫ ij to be the expression level recorded for the ith gene in the jth experimental sample, where ǫ ij is the background corrected, normalized, logarithm of recorded amount of mRNA from an expression array. In the spirit of Becker and Palsson (2008) and Chandrasekaran and Price (2010), for each gene, i, we consider these observed ǫ ij values to come from one of two possible (unobserved) gene states: inactive (gene i is producing only basal levels of product) and active (gene i is producing product involved in a functioning cellular process). A natural probabilistic model for the observed gene expression levels for genes in each state is a conditional Gaussian model, which follows earlier work in other genomic contexts (Gamba et al., 2015;Morfopoulou and Plagnol, 2015). Namely, ǫ i |active ∼ N (µ 1 , σ 1 ) and ǫ i |inactive ∼ N(µ 0 , σ 0 ). In other words, the distribution of expression values for a gene in the active state follows a Gaussian distribution with an underlying mean, µ 1 , and standard deviation, σ 1 where the standard deviation captures the underlying measurement (Ohtaki et al., 2010) and biological variability (Losick and Desplan, 2008;Chalancon et al., 2012) in expression measurements across settings. Similar assumptions and definitions hold for gene expression measurements from the inactive state. Since in most settings, the true state of the gene is unknown a priori, the resulting observed gene expression values, ǫ i = (ǫ i1 , ǫ i2 , ǫ i3 , . . . , ǫ in ), can be modeled as coming from a Gaussian mixture distribution, ǫ i ∼ (1 − π)N (µ 0 , σ 0 ) + πN(µ 1 , σ 1 ), where the mixing parameter π represents the proportion of the time that the given gene i is active across the set of experiments.
In particular, since we cannot observe α i directly, we wish to generate a i = (a i1 , a i2 , a i3 , . . . , a in ), such that a ij is the posterior probability that gene i is active in experiment j. We will use a Bayesian approach to generate a i for each gene i.
For each gene i, we start with prior distributions on the five unknown parameters from the Gaussian mixture model: π, µ 0 , σ 0 , µ 1 , σ 1 . In practice, we reduce to four unknown parameters by requiring σ 0 = σ 1 . This assumption provides increased model convergence and robustness to outliers (Fraley and Raftery, 2007), and assumes that similar amounts of biological and measurement variability will be present in expression values for the inactive and active states. The prior distributions for each of the four unknowns are given as: µ 0 ∼ N(µ = 8, σ = √ 3), µ 1 ∼ N(µ = 9, σ = √ 3), σ 0 2 = σ 1 2 ∼ InverseWishart(Ψ = 3, ν = 1) and π ∼ Beta(α = 5, β = 5). Briefly, these choices of prior distributional shapes make later mathematical computation of posterior distributions straightforward and are standard in statistical practice (Murphy, 2007). The corresponding parameter values reflect reasonable experimental and biological assumptions [e.g., E (π) = 0.5; E(µ 0 )< E(µ 1 ), etc.]. The choices of 8 and 9 for the prior means of RMA normalized data (See Section Real Data Sets), represent values near the overall "average" ǫ across all genes and all experiments (the range of which tends to be between 4 and 16). We note that, for this application, our analysis of the robustness of parameter choices indicates that these choices appear to have little bearing on resulting downstream a ij generation (detailed results not shown).
Step 3. Use the estimated Gaussian mixture model, ǫ i,k = 1 , to find b ij,k = 1 (ǫ ij ), the conditional probability that an expression value, ǫ ij , is from the active state, Step 4. Generate a random vector I i,k = 1 , where I ij,k = 1 is a single random value {0 or 1} drawn from Bernoulli(p = b ij,k = 1 ), indicating whether gene i is active or inactive in experimental setting j, for the k = 1 iteration of the Gibbs sampler.
Step 5. Update the prior distributions of the four parameters by incorporating the prior distributions with I i,k = 1 . In particular, let C active and C inactive be the set of expression values currently assigned to the active and inactive clusters, respectively, according to I i . Then where |C active | represents the cardinality (size) of the set of expression values in C active . And, where ǫ 0 and ǫ 1 are the means of the classified inactive and active gene expression data, respectively. The Gibbs sampler then repeats Steps 2-5 for k = K times. In our case, we used K = 500, with values less than 500 tending to give less robust results (detailed results not shown).
To generate a i , values of I i,k are averaged across the K runs of the Gibbs sampler, ignoring an initial set of burn-in runs, b. In our case we used b = 50, which yielded robust a i values (detailed results not shown). In particular, a ij = , for all j.

Multivariate Inference Overview
While the univariate Gaussian mixture model and associated Gibbs sampler provide a standard way to generate a i values gene by gene, this approach fails to account for other a priori known biological information which may be able to further improve a i estimates. For example, in bacteria, operons are sets of contiguous genes that are co-regulated and therefore are generally active or inactive simultaneously. Thus, knowledge about which genes are in operons should allow us to improve gene activity estimates.
If there are p genes (i 1 ,i 2 ,. . . ,i p ) located within a given operon, r, then we can extend the univariate Gaussian mixture model described earlier to a multivariate Gaussian mixture model as follows: where we assume that within each state (active or inactive) the biological and measurement co-variability between gene expression measurements is zero. Our approach in the multivariate case is very similar to our approach in the univariate case, and so is only outlined here. For each operon r, we start with prior distributions on the four unique and unknown parameters/vectors from the Gaussian mixture model: π, − → µ 0 , − → µ 1 , 0 = 1 . The prior distributions for each of the four unknowns are given Beta(5, 5). These distributions and initial prior parameter values are mainly explained above. Off-main diagonal values of 0 in − → µ 0 and − → µ 1 suggest that there is no correlation between the means of the active (or inactive) state distributions across the genes in an operon, with 0 = 1 suggesting no covariability in the state expression variances of genes in an operon. As before, our analysis of the robustness of parameter choices indicates that these choices appear to have little bearing on resulting downstream a ij generation (detailed results not shown). The Gibbs sampler is performed as described above by simply replacing the univariate parameters with the multivariate parameters. Notably, this yields a i values which are identical across all p genes located within an operon, since inference about activity states is occurring for the operon as a whole, not gene-by gene.

Implementation Details for All Methods Being Compared
In order to compare the performance of the two methods proposed above to current best practices, we implemented five approaches to estimating a ij : median thresholding, trichotomous thresholding, rank based estimation, and our proposed univariate mixture modeling and multivariate mixture modeling approaches. We now briefly describe the methods compared here, along with some relevant implementation notes:

This approach dichotomizes expression values such that
for all m genes in experiment j. This approach is a special case of that proposed by GIMME (Becker and Palsson, 2008), which allows users to, a priori, select any threshold (median or otherwise). In practice, some users select the median (?). We note that the GIMME software program uses the mean as the default value for the threshold 1 . Due to the typical symmetry of gene expression values within an experiment, the mean is nearly identical to the median.

Trichotomous Threshold (TT)
This approach trichotomizes expression values such that percentile for all i genes in experiment j and P high,j is the 60th percentile for all i genes in experiment j and is in the spirit of GIMME, but allowing for an uncertain region as proposed by Shlomi et al. (2008).

Rank Based Approach (RB)
This approach is a continuous analog to the MT approach. In , where rank(ǫ ij ) is the rank within experiment j. This approach is in the spirit of GIM3E (Schmidt et al., 2013) which assigns the equivalent of a ij = 1 to max(ǫ ij ), and a monotone and continuously changing decreasing confidence as ǫ ij decreases.

Univariate Mixture Model (UniMM)
This approach, described above, uses a Bayesian approach to infer a ij values according to a Gaussian mixture model.

Multivariate Mixture Model (MultiMM)
This approach, described above, uses a Bayesian approach to infer a ij values according to a Gaussian mixture model while incorporating knowledge of operon structure.

Screening and Imputation Methods for Mixture Model Approaches
When implementing UniMM and MultiMM we first assessed the quality of the fit of a 2-component mixture model to the observed expression data for each gene or operon. In particular, some genes/operons may not change states (between active and inactive) across the available set of expression values (all n experiment settings), thus making a 2-component mixture distribution invalid. With this in mind we used a screening method to determine which genes/operons had strong evidence that they were 1-component instead of 2-component.
The screening method uses the Bayesian Information Criterion (BIC) to assess the fit of a 1-component (univariate or multivariate) Gaussian mixture distribution vs. a 2-component mixture distribution using the R package Mclust 2 . Following Raftery et al. (Raftery, 1995) we require the BIC to be at least 12 points better for the 1-component model to be chosen vs. the 2component model. We note, however, that if we were to simply choose the best BIC between the 1 and 2 component models there would be little impact on the number of genes screened as being from a single component (detailed results not shown).
The a ij values for genes which were screened as coming from only a single component (all active or all inactive) were estimated using a multiple imputation approach (MI) in order to identify similar genes/operons and impute a ij values. Multiple imputation is a well-known statistical procedure for estimation of missing values (Rubin, 1987). When a gene, i, was screened as being from a single component, we used the R package Mclust 2 to fit a single component Gaussian distribution to the data and estimate the corresponding mean and standard deviation of the model. Results of the UniMM approach for all genes from 2-component mixtures were then evaluated to identify "similar" genes, where similar genes had ( µ 0 ∈ x ǫ i ± 0.1 and σ 0 ∈ s ǫ i ± 0.1) or ( µ 1 ∈ x ǫ i ± 0.1 and σ 1 ∈ s ǫ i ± 0.1), where 0.1 is an arbitrary threshold. The multiple imputation approach then computes a i 's for each of the similar genes as if the ǫ i,j came from each of the similar genes. The final a i 's for each imputed gene are computed by averaging across the imputed a i 's from each similar gene. In the case of operons identified as coming from a single component (MultiMM approach), each gene in the operon is first considered separately using the same MI approach as for the UniMM method, and then the resulting a i 's for each gene in the operon are averaged in order to yield consistent a i values for all genes in the operon. For single component operons which are identified as always active or inactive, π = 1 or 0, respectively. Finally, we note that if no similar genes are identified, then the MI approach returns a ij = 0.5 for all j.

Real Data Sets
For most of our analyses, we used genome-wide gene expression data from 907 different microarray data sets collected on 4329 Escherichia coli genes via the M3D data repository (Faith et al., 2007(Faith et al., , 2008 ). Raw data from Affymetrix 4 CEL files were normalized using RMA (Irizarry et al., 2003). Details of data processing are described elsewhere (Tintle et al., 2012;Powers et al., 2015). We also performed analysis on gene expression and fluxomics data from Ishii et al. (Ishii et al., 2007) comprising 79 E. coli genes in 29 experimental conditions. E. coli operon predictions for 2648 operons, including 1895 single gene operons, were obtained from Microbes Online (Price et al., 2005).

Simulated Data Sets
We also simulated expression data with "known" gene activity states (active/inactive). The simulation of expression data was informed by the E. coli expression data described above. We first ran the Screening Method described above (see Section Screening and Imputation Methods for Mixture Model Approaches) and dropped all operons, including single gene operons, for which the two-component model did not yield the highest BIC (n = 697 dropped). We then randomly selected 26.3% (=697/2648) of the remaining 1951 operons to be single component in the simulated data, with each of the single component operons having an equal likelihood of being always active or always inactive.
We used two different methods to calculate the mixing parameter, π, used in the simulation for the 1438 two-component operons. The Uniform Method (Unif ) chooses a random value for π between 0.2 and 0.8. The Fitted Method (Fit) uses the MultiMM estimate of π (details given above). Values for − → µ 0 , − → µ 1 , 0 = 1 are all as estimated by the MultiMM method computed on the real expression data. To generate simulated expression values, ǫ s ij , we drew 907(π i ) random values from a multivariate normal distribution ( − → µ 1i , 1i ) and 907(1 − π i ) random values from a multivariate normal distribution ( − → µ 0i , 0i ). Thus, we generated a 907 by 3435 matrix of ǫ s ij values.

Validation of Real Gene Activity Calls Using Flux Variability Analysis
In order to generate alternative predictions of gene activity we used Flux Variability Analysis (Mahadevan and Schilling, 2003) on the E. coli iJO1366 metabolic model (Orth et al., 2011). In particular, we ran flux variability analysis on the E. coli metabolic model yielding flux bounds v low and v high for each reaction in the model. Media conditions and gene mutations were accounted for in a model maximizing biomass. The following rules were then used to determine predictions r ij of reaction activity for each reaction in the model and each of the 907 experiments.
These reaction-level predictions were converted to genelevel predictions, p ij , accounting for isozymes and multi-gene complexes as follows. If r ij = 1 and the reaction is associated with a single gene or multi-gene complex, p ij for all genes involved is 1. If r ij = 0 and the reaction is associated with a single gene or multiple isozymes, p ij for all genes involved is 0. If r ij = 1 but the reaction is associated with multiple isozymes, we cannot be sure which isozyme is responsible for enabling the reaction, so we make no prediction (assign no p ij value) for any of the genes involved. If r ij = 0 but the reaction is associated with a multigene complex, we cannot be sure which subunit is responsible for thwarting reaction activity, so we make no prediction (assign no p ij value) for any of the genes involved. If the previous four rules result in contradictory p ij values for any given gene (e.g., both p ij = 0 and p ij = 1), then we let p ij = 1 for that gene. This assumes that a gene with multiple roles can be active but perform only some of its roles; for instance, a gene product may be associated with an active reaction, but also be an isozyme on an inactive reaction, resulting in contradictory p ij values. In these cases p ij = 1 is the correct prediction. Finally, we note that all but three of the 907 experiments resulted in a growth prediction by the metabolic model, and that, following these rules, p ij values could be obtained for approximately 845,000 of the 1.2 million gene-by-experiment combinations in the metabolic model.

Statistical Analysis
We used two primary approaches to evaluate the quality of a ij 's resulting from different methods applied to both simulated and real data. The squared deviation approach quantifies the difference between a ij 's and true or predicted gene activity states.
We computed d 2 ij = a ij − α ij 2 for simulated data, where α ij is the true gene activity state, and d 2 ij = a ij − p ij 2 for the real expression data where p ij is the predicted gene activity state based on the flux variability analysis of the metabolic model. We note that d ij is only computed on the 1353 genes in the metabolic model for the real expression data, since other genes have no p ij values. The average deviation can then be computed by experiment, by gene or for various other subsets of the data. The consistency approach indicates that an a ij is consistent with p ij or α ij if a dichotomized version of the a ij is consistent with p ij or α ij . In particular, c ij is computed as follows: 1 if a ij > 0.5 and p ij = 1 0 if a ij > 0.5 and p ij = 0 0 if a ij < 0.5 and p ij = 1 1 if a ij < 0.5 and p ij = 0 0.5 if a ij = 0.5 A consistency score can then be computed by summing c ij across various subsets of the data (e.g., all genes within an experiment; all genes in an operon, etc.). Lastly, we evaluated the differential expression of metabolic pathway components (DeJongh et al., 2007;Aziz et al., 2008;Henry et al., 2010) among different subsets of experiments using a modified gene set analysis approach (Tintle et al., 2008). These pathway components have been previously shown to demonstrate strong consistency with gene expression data (Tintle et al., 2012). Briefly, we found the average a ij value for all genes within each pathway component in each of the 907 experiments, and then ran a two-sample t-test comparing the mean activity scores between different subsets of the 907 experiments.

Validation Using Experimentally Measured Fluxes
Lastly, we evaluated gene activity estimates inferred from expression data vs. experimentally measured reaction fluxes on a published set of 79 genes in 29 separate experimental conditions (Ishii et al., 2007;Machado and Herrgård, 2014). In short, we computed gene activity estimates (a ij 's) for each gene-experiment combination using methods described above. After generating the gene-level a ij 's, we mapped them to reaction-level predictions q ij , accounting for isozymes and multi-gene complexes as follows. For each reaction, if the reaction is associated with a single gene, q ij for that reaction is equal to the a ij for that gene. If the reaction is associated with a multi-gene complex, q ij for that reaction is equal to the minimum of the a ij 's for all the genes involved; and if the reaction is associated with isozymes, q ij for that reaction is equal to the maximum of the a ij 's for all the genes involved. In any of these three cases, if any gene has no a ij (because it was not one of the 79 genes for which expression data was measured), it is ignored for the purposes of taking the minimum or maximum; or if all of the genes associated with a given reaction have no a ij , that reaction is dropped from the analysis. We repeated this procedure with each a ij generation method, and also with the gene-level ǫ ij 's for comparison purposes, to create alternate reaction-level predictions based directly on expression value.
We compared the correlations we observed between each set of q ij 's with the (absolute values of the) experimentally measured fluxes for those reactions in those experimental conditions. A square-root transformation was applied to both the fluxes and the ǫ ij -based q ij 's to normalize these skewed distributions to ensure robust correlation estimates were obtained. Multiple linear regression models were used to predict fluxes using gene activity method estimates and expression values in order to evaluate the explanatory ability of different gene activity state estimates with flux values.
Software R scripts and an example implementation of the approaches considered here (MT, TT, RB, UniMM, and MultiMM) are freely available on the Software page at http://www.dordt.edu/statgen.

Performance on Simulated Data
We begin by evaluating the performance of the different approaches to a ij estimation on simulated data.  Larger numbers on the y-axis represent less deviation from true activity states, illustrating that MultiMM has the best performance, followed by UniMM, with MT yielding the worst performance. This figure illustrates the results on simulated data using the Unif simulation approach. Performance with the Fitted approach was similar. states, with MultiMM performing best. Table 1 likewise shows the overall performance of the five approaches for the simulated data using the consistency metric. Both Mixture Model (MM) approaches yield the best sensitivity and best specificity, with the MultiMM method again performing best overall. The MultiMM method also performed best compared to other methods when examining only the subset of genes identified as coming from a single component. Table 2 illustrates that the MM approaches also give the most consistent results at different confidence levels. Finally, Table 3 shows similarly top performance of the MultiMM approach for operons, with the MultiMM approach yielding the most consistent calls with no inconsistencies, and the best overall concordance with the simulated data (85.3% consistent and correctly assigned compared to 58% or worse for all other approaches). Tables 1-3 are shown based on data from the Unif simulation approach. Results using the fitted simulation approach are similar (detailed results not shown). Table 4 provides the overall performance of each of the five approaches for inferring gene activity states as compared to metabolic model flux predictions. Overall, the MM approaches yielded better specificity, at the expense of sensitivity, by, in general, yielding fewer a ij > 0.5 than the MT, TT, and RB methods. This resulted in a larger combined sensitivity plus specificity for the MM approaches, with a slight preference to the MultiMM method using this metric. Both MM approaches also yielded better c ij values when evaluating genes flagged as one component. Table 5 illustrates the overall performance of each of the five approaches by confidence. MM approaches provide substantially improved consistency over the other approaches when confidence levels are high or medium, with similar performance for low certainty a ij values. Sensitivity and specificity trends by confidence and approach follow directly from the patterns observed in Table 4 (detailed results not shown). Finally, Table 6 shows the consistency of a ij 's within operons. As noted in the table, the MultiMM method provides calls which are consistent within operons, which is expected based on the way that calls are made for each operon when using the MultiMM approach. We note that our analysis evaluating consistency of the gene activity state estimates with results from Flux Variability Analysis are limited by the quality of the modeling results from FVA and inherent limitations of the FVA approach. Thus, sensitivity and specificity estimates provided here should also not be viewed without recognizing these limitations.

L-Arabinose Operon
The L-arabinose (ara) operon is a well-studied set of three co-located genes (araB, araA, araD) which encode enzymes needed for the catabolism of arabinose in E. coli (Schleif, 2010). Across the 907 experiments in our dataset, the MultiMM algorithm calls the L-arabinose operon active (a ij > 0.5) in 227 experiments and inactive in 680 experiments (a ij < 0.5).
In the vast majority of cases where the MultiMM identified the operon as active, L-arabinose was identified as present in the media (96.4% = 219/227), and all 8 inconsistent cases were from the same experimental series. Similarly, when our algorithm indicated that the ara operon was inactive, L-arabinose was not indicated as being present in the media in the vast majority of cases (94.9% = 645/680). Many of the inconsistent cases had reasonable biological explanations for why they  Figure 2A shows the raw expression data (histogram) and an overlaid Gaussian mixture distribution from the MultiMM method for araB. The remaining three figures (Figures 2B-D) graph the posterior probability that araB is active in experiment j (a ij ) vs. the log expression value (ǫ ij ). The rank based method ( Figure 2B) yields uncertain calls for many of the expression values (many a ij values near 0.5) for values which are clearly inactive based on the histogram. The UniMM ( Figure 2C) and MultiMM ( Figure 2D) approaches yield results that directly correspond to the raw expression values shown in the top left histogram. Notably, the MultiMM improves on the call certainty of the UniMM: it takes one somewhat uncertain call from the UniMM approach (a ij = 0.35) and makes that call more certain by leveraging observations about the experiment from other genes in the operon (where the other genes in the operon are clearly in the inactive cluster; details not shown). Figure 3A shows the raw expression data (histogram) and an overlaid Gaussian mixture distribution from the MultiMM method for araA (Note: a histogram for araB is provided in Figure 2A). The remaining three figures (Figures 3B-D) graph the observed log-expression values (ǫ ij ) for araA vs. araB. araA and araB are contained within the same operon, a three gene operon that also includes araD (not shown). The rank based method ( Figure 3B) yields uncertain and inconsistent calls for many of the expression values which appear to be clearly in the inactive category based on the apparent clustering. The UniMM ( Figure 3C) and MultiMM ( Figure 3D) approaches both show much better performance. Notably, the MultiMM approach eliminates one inconsistent call by leveraging observations about this experiment (the third gene in the operon (araD), is clearly in the inactive cluster; details not shown).

Cysteine Synthase Operon
The cysteine synthase operon consists of five genes (cysM, cysA, cysW, cyst, and cysP) which encode proteins associated with (See Faith et al., 2007, Supplemental [See Faith et al., 2007_r1; ccdB_MG1063_t120_r1 is the remaining (12th) experiment]. None of these cases had a ij values near 0.5. cysteine biosynthesis. Figure 4 illustrates how the MultiMM performs better than the UniMM and RB approaches for genes in this operon. Figure 4A shows the raw expression data (histogram) and an overlaid Gaussian mixture distribution from the MultiMM method for cysM, with a comparable figure for cysP (Figure 4B). Note that cysM and cysP are located within the same operon, which also contains three other genes (cysA, cysW, and cysT; not shown). It is also important to note that, while the Gaussian mixtures fit the data well, there is less separation in the clusters than is present in the araA/araB example (see Figures 2, 3). Furthermore, note that the threshold for active-inactive appears to be at approximately 9 for cysM ( Figure 4A) but is higher (∼10.5) for cysP ( Figure 4B). Figures 4C-E graph the observed log-expression values (ǫ ij ) for cysM vs. cysP. The rank based method ( Figure 4C) yields uncertain and inconsistent calls for many of the expression values which appear to be clearly in the inactive category based on the apparent clustering. The UniMM ( Figure 4D) and MultiMM ( Figure 4E) approaches both show much better performance, though the UniMM approach yields yield results which more intuitively agree with the observed raw expression values than the rank-based approach (B). The MultiMM method, by leveraging information from all genes in the operon, is able to provide improved certainty over the other methods. numerous inconsistent calls. Notably, the MultiMM approach eliminates all inconsistent calls by leveraging observations about each experiment from the other genes in the operon (details not shown).
The relatively "high" expression values observed when genes in the cysteine synthase operon are in the inactive state lead to poor performance of the rank-based method. Furthermore, the increased within gene-state variability yields more uncertain calls from the UniMM method which leads to a large number of inconsistent gene calls for genes in this operon. The MultiMM method leverages operonal structure to consistently infer gene states for all genes in the operon. In the majority of cases when the experiment was performed in the presence of yeast extract, a rich media, the MultiMM method identified the operon as inactive (91.3%; 625/684); presumably, in these conditions, cysteine would not need to be synthesized by the cell. Relatedly, when the experiment was not performed in the presence of yeast extract, the cysteine synthase operon was typically identified as active by the MultiMM approach (68.6%; 153/223). The UniMM approach yielded similar results, but other methods showed much weaker association between the yeast extract media and cysteine synthase activity.

Overall Patterns of Gene Activity
Overall, the MultiMM method yielded 3538 genes which were determined to be in both active and inactive states at least once in the set of 907 experiments, with 791 genes that did not show evidence of changing states in this set of experiments. Among 684 experiments done in the presence of yeast extract (a rich media), the overall (across all genes and experiments) average a ij value was 0.418, compared to 0.428 among the 223 experiments performed without yeast extract (p = 0.002). To better understand which aspects of the metabolic network may be accounting for this difference, we used a gene set analysis approach to test for potential differences in average activity level  Figure 2A). (B-D) graph the observed expression values for araA vs. araB indicating how consistent or inconsistent the calls are for genes within an operon for each of three different methods of estimating gene activity states. Blue dots represent experiments for which the approach is very sure the gene pair is inactive (a ij < 0.2) and red triangles represent experiments for which the approach is very sure the gene pair is active (a ij > 0.8). Open squares represent unsure (0.2 < a ij < 0.8) calls for both genes, and black "X's" represent situations where either a ij < 0.2 for one gene and a ij > 0.2 for the other, or a ij < 0.8 for one gene and a ij > 0.8 for the other. The MultiMM method, by leveraging information from all genes in the operon, is able to provide the most consistent calls of the three methods.
for pathway components (see methods) between the different sets of experiments. Supplemental Table 1 provides the full list of 161 pathway components. Notably, 29 of the top 40 most positively differentially expressed pathway components (more activity when yeast extract was absent than when it was present) involve synthesis of metabolic components (p < 0.002 in all cases) compared to only six of the top 35 most negatively differentially expressed pathway components (more activity when yeast extract was present; p < 0.002 in all cases).
Furthermore, Supplemental Table 1 includes a column indicating whether or not a pathway component is related to amino acid biosynthesis according to the SEED (DeJongh et al., 2007). Eighteen of the 19 amino acid biosynthesis pathway components are in the top 44 most positively differentially expressed pathway components, as would be expected given that these pathway components are typically not needed in the presence of yeast extract.

Evaluation of Gene Activity State Estimates and Experimentally Measured Reaction Fluxes
Lastly, we evaluated different methods of estimating gene activity states vs. experimentally measured reaction fluxes. Table 8 summarizes the results of these analyses. The MultiMM method yielded the strongest correlation between experimental measured fluxes and gene activity state estimates (0.354; 95 CI: 0.303 to 0.406; p < 0.001). A multiple regression model predicting flux measurements by both MultiMM and raw gene expression values found that while MultiMM is significantly associated with flux values (Std Beta = 0.344; 95% CI: 0.280, 0.480; p < 0.001), raw expression values are not unique predictors of flux values (Std. Beta = 0.018; 95% CI: -0.046, 0.082; p > 0.05), suggesting that the MultiMM method has sufficiently captured the aspects of expression data which associate with flux. We note (see Table 8 for details), that this is also true of the TT method, but not the MT and RB methods. Importantly, however, in models predicting fluxes using both the MultiMM method and other gene activity show raw expression data with overlaid Gaussian mixture distributions from the MultiMM method for cysM and cysP, respectively. (C-E) graph the observed expression values for these two genes, indicating how consistent the expression values are with three different methods of estimating gene activity states. Blue dots represent experiments for which the approach is very sure the gene pair is inactive (a ij < 0.2) and red triangles represent experiments for which the approach is very sure the gene pair is active (a ij > 0.8). Open squares represent unsure (0.2 < a ij < 0.8) calls for both genes, and black "X's" represent situations where either a ij < 0.2 for one gene and a ij > 0.2 for the other, or a ij < 0.8 for one gene and a ij > 0.8 for the other. The MultiMM method, by leveraging information from all genes in the operon, is able to provide the most consistent calls of the three methods.
estimates, the MultiMM method explains substantially more variation in flux values than other methods [between 0.29 and 0.38 vs. −0.032 (MT), 0.085 (TT) and 0.093 (RB)]. Because of the high correlation between the UniMM and MultiMM methods on this dataset (see Table 8, footnote D) we only focused on the MultiMM method in the previous paragraph.

DISCUSSION
We have presented a Bayesian framework for the classification of microbial gene activity states (active or inactive), based on a compendium of genome-wide gene expression data. Our approach first uses the Bayesian Information Criterion to identify genes that likely have a mixture of both active and inactive states present in the data. A Gibbs sampler is then used to provide estimates of the posterior probability that a gene is active in each condition, based on a Gaussian normal mixture model. Our approach addresses four key limitations of existing approaches for classifying gene activity states: (a) different activity thresholds for different genes (Figure 4A vs. Figure 4B)  leveraging a priori evidence of co-regulation (Figures 4C-E) and (d) benefits of quantified statistical uncertainty (e.g., Figure 1, among others). Specific figures and tables in the manuscript provide visual intuition about how the proposed method addresses these limitations. By addressing these limitations, the Mixture Model approaches (MultiMM and UniMM) show less deviation from true gene activity states on simulated data, more consistency with metabolic model flux predictions and operon structure on real data, and stronger association with experimentally measured fluxes, with MultiMM doing best. Results from model fitting on the real data by the mixture model approaches showed great variance in the overall means, standard deviations and mixing proportions suggesting empirically that the limitations stated above are, in fact, realistic concerns for existing approaches which can be addressed by these mixture model approaches. Furthermore, association between inferred gene activity states using the MultiMM approach yielded stronger correlations with observed flux data, while other methods (MT, TT, and RB, as well as the raw expression data itself) left substantial variation in flux values as unexplained. Finally, pathway components associated with synthesis activities were significantly more expressed in the absence of a rich media condition (yeast extract) than in the presence of a rich medium as expected.
The improved performance of MultiMM over UniMM highlights the utility of incorporating genome-based operon predictions in activity state estimation. The Bayesian approach we have developed acts as a general framework for future innovation via the inclusion of other -omics data sources. For example, transcriptional regulatory networks (TRNs) can be actively incorporated into the analysis pipeline by expanding the MultiMM approach to utilize regulons in addition to operons. However, full integration of TRN information will require explicitly incorporating TRN uncertainty into the Bayesian framework. For example, due to TRN uncertainty we might be only 90% sure that two genes are co-regulated, in contrast to our current approach, which requires gene sets (operons) to be defined explicitly and with 100% certainty. Furthermore, this same uncertainty approach can likely be applied to operons when, for example, an operon comprises several transcription units. We believe the Bayesian framework provided here provides a flexible platform for this future innovation, in order to continue to reduce overall deviation of activity state estimates from true gene activity states. A similar model is also being explored by our group to incorporate (a) flux profiles, (b) functional information, (c) cross-organism gene homology and other sources of genetic information which, when incorporated into the activity state estimates, may further improve their accuracy and precision.
Downstream applications of improved gene activity state measurements are numerous, though we focus our discussion here mainly on metabolic modeling. In particular, our improved gene activity state measurements will allow us to incorporate gene activity information into subsequent metabolic models simulations through statistically informed penalties, rather than arbitrary or loose penalties, which often serve to down-weight gene expression data to the point where it has little to no real impact on downstream modeling results (e.g., Chandrasekaran and Price, 2010). We are currently exploring metabolic modeling advances that incorporate a ij 's. Some limitations of our approach and our analysis here are worth noting. Gene activity state estimates likely improve as the number and diversity of experimental conditions increases, though we saw promising results even in a small follow-up study of expression data in 29 conditions. Given the dramatically reduced price of RNA-sequencing, continued rapid growth in the size and diversity of expression data is expected to make this limitation less of an issue. Our analysis here focuses on E. coli, though an important area of future work involves application of our approaches to a variety of other microbes, both to demonstrate transferability of the approach, but also in order to explore methods of improving activity state inference by leveraging information from multiple organisms simultaneously (e.g., gene homology, regulatory and metabolic homology, etc.). Further work is needed to evaluate the performance of these methods on RNA-seq data, though we anticipate the performance should be similar after standard normalization and transformation procedures are applied to the data. Finally, numerous additional validation studies are possible (e.g., an expanded comparison to observed flux data) and should be considered in future work. An initial small-scale evaluation presented here showed promising initial results.

CONCLUSIONS
We have developed a flexible Bayesian framework for the estimation of gene activity states from compendia of microbial gene expression data. Our approach provides improved consistency with true gene activity states compared to existing approaches on a large compendia of E coli expression data. Future work is needed to evaluate the performance of the method on other organisms and to expand the Bayesian model presented to incorporate other -omics data sources.

AUTHOR CONTRIBUTIONS
NT, MD, and AB contributed to the initial design of the study and motivating research questions. NT, CD, BG, KC, KK, RL, CV, JC, EH, YA, and KF designed the method, developed evaluative metrics, simulated data and implemented the methods. AA, MC, AB, and MD contributed to the evaluation of data, implementation and running of the metabolic models and understanding and parsing of media conditions. NT, CD, BG, and KC drafted initial versions of the manuscript. All authors saw, edited and approved of the final manuscript.