Computation of Single-Cell Metabolite Distributions Using Mixture Models

Metabolic heterogeneity is widely recognized as the next challenge in our understanding of non-genetic variation. A growing body of evidence suggests that metabolic heterogeneity may result from the inherent stochasticity of intracellular events. However, metabolism has been traditionally viewed as a purely deterministic process, on the basis that highly abundant metabolites tend to filter out stochastic phenomena. Here we bridge this gap with a general method for prediction of metabolite distributions across single cells. By exploiting the separation of time scales between enzyme expression and enzyme kinetics, our method produces estimates for metabolite distributions without the lengthy stochastic simulations that would be typically required for large metabolic models. The metabolite distributions take the form of Gaussian mixture models that are directly computable from single-cell expression data and standard deterministic models for metabolic pathways. The proposed mixture models provide a systematic method to predict the impact of biochemical parameters on metabolite distributions. Our method lays the groundwork for identifying the molecular processes that shape metabolic heterogeneity and its functional implications in disease.


I. INTRODUCTION
Non-genetic heterogeneity is a hallmark of cell physiology.Isogenic cells can display markedly different phenotypes as a result of the stochasticity of intracellular processes and fluctuations in environmental conditions.Gene expression variability, in particular, has received substantial attention thanks to robust experimental techniques for measuring transcripts and proteins at a single-cell resolution 1,2 .This progress has gone hand-in-hand with a large body of theoretical work on stochastic models to identify the molecular processes that affect expression heterogeneity [3][4][5][6][7] .
In contrast to gene expression, our understanding of stochastic phenomena in metabolism is still in its infancy.Traditionally, cellular metabolism has been regarded as a deterministic process on the basis that metabolites appear in large numbers that filter out stochastic phenomena 8 .But this view is changing rapidly thanks to a growing number of single-cell measurements of metabolites and co-factors [9][10][11][12][13][14][15][16][17] that suggest that cell-to-cell metabolite variation is much more pervasive than previously thought.The functional implications of this heterogeneity are largely unknown but likely to be substantial given the roles of metabolism in many cellular processes, including growth 18 , gene regulation 19 , epigenetic control 20 and immunity 21 .For example, metabolic heterogeneity has been linked to bacterial persistence 22,23 , a dormant phenotype characterised by a low metabolic activity, as well as antibiotic resistance 24 and other functional effects 25 .In biotechnology applications, metabolic heterogeneity is widely recognised as a limiting factor on metabolite production with genetically engineered microbes [26][27][28] .
A key challenge for quantifying metabolic variability is the difficulty in measuring cellular metabolites at a single-cell a) Corresponding author: d.oyarzun@ed.ac.uk;This work was partly funded by the Human Frontier Science Program through a Young Investigator Grant (RGY0076-2015) awarded to D.O., a UKRI Future Leaders Fellowship (MR/T018429/1) awarded to P.T., and the EPSRC Centre for Mathematics of Precision Healthcare (EP/N014529/1) awarded to M.B. resolution [29][30][31] .As a result, most studies use other phenotypes as a proxy for metabolic variation, e.g.enzyme expression levels 32,33 , metabolic fluxes 34 or growth rate 35,36 .From a computational viewpoint, the key challenge is that metabolic processes operate on two timescales: a slow timescale for expression of metabolic enzymes, and a fast timescale for enzyme catalysis.Such multiscale structure results in stiff models that are infeasible to solve with standard algorithms for stochastic simulation 37 .Other strategies to accelerate stochastic simulations, such as τ -leaping, also fail to produce accurate simulation results due to the disparity in molecule numbers between enzymes and metabolites 38 .These challenges have motivated a number of methods to optimise stochastic simulations of metabolism [39][40][41][42][43] .Most of these methods exploit the timescale separation to accelerate simulations at the expense of some approximation error.This progress has been accompanied by a number of theoretical results on the links between molecular processes and the shape of metabolite distributions 6,[44][45][46] .Yet to date there are no general methods for computing metabolite distributions that can handle inherent features of metabolic pathways such as feedback regulation, complex stoichiometries, and the high number of molecular species involved.
In this paper we present a widely applicable method for approximating single-cell metabolite distributions.Our method is founded on the timescale separation between enzyme expression and enzyme catalysis, which we employ to approximate the stationary solution of the chemical master equation.The approximate solution takes the form of mixture distributions with: (i) mixture weights that can be computed from models for gene expression or single-cell expression data, and (ii) mixture components that are directly computable from deterministic pathway models.The resulting mixture model can be employed to explore the impact of biochemical parameters on metabolite variability.We illustrate the power of the method in two exemplar systems that are core building blocks of large metabolic networks.Our theory provides a quantitative basis to draw testable hypotheses on the sources of metabolite heterogeneity, which together with the ongoing efforts in single-cell metabolite measurements, will help to reevaluate the role of metabolism as an active source of pheno-typic variation.

II. GENERAL METHOD FOR COMPUTING METABOLITE DISTRIBUTIONS
We consider metabolic pathways composed of enzymatic reactions interconnected by sharing of metabolites as substrates or products.In general, we consider models with M metabolites P i with i ∈ {1, 2, ..., M } and N catalytic enzymes E j with j ∈ {1, 2, ..., N }.A typical enzymatic reaction has the form where P i and P k are metabolites, and E j and C j are the free and substrate-bound forms of the enzyme.The parameters (k f,j , k b,j ) and (k cat,j , k rev,j ) are positive rate constants specific to the enzyme.In contrast to traditional metabolic models, where the number of enzyme molecules is assumed constant, here we explicitly model enzyme expression and enzyme catalysis as stochastic processes.Our models also account for dilution of molecular species by cell growth and consumption of the metabolite products by downstream processes.
Though in principle one can readily write a Chemical Master Equation (CME) for the marginal distribution P(P 1 , P 2 , ...P M ) given the pathway stoichiometry, analytical solutions of the CME are tractable only in few special cases.To overcome this challenge, we propose a method for approximating metabolite distributions that can be applied in a wide range of metabolic models.We first note that using the Law of Total Probability, the marginal distribution P(P 1 , P 2 , ...P M ) can be generally written as: where P = (P 1 , P 2 , ...P M ) and E = (E 1 , E 2 , ..., E N ) are the vectors of metabolite and enzyme abundances, respectively.The equation in (2) describes the metabolite distribution in terms of fluctuations in gene expression, comprised in the distribution P(E), and fluctuations in reaction catalysis, described by conditional distribution P(P |E).
A key observation is that Eq. ( 2) corresponds to a mixture model with weights P(E) and mixture components P(P |E).To compute the mixture weights and components, we make use of the timescale separation between gene expression and metabolism.Gene expression operates on a much slower timescale than catalysis 40,44,47 , with protein half-lives typically comparable to cell doubling times and catalysis operating in the millisecond to second range.Therefore, in the fast timescale of catalysis we can write a conservation law for the total amount of each enzyme (free and bound): where E t,j is the total number of enzymes E j .Note that since our models integrate enzyme kinetics with enzyme expression, the variables E t,j follow their own, independent stochastic dynamics.It is important to note that in our approach, the conservation relation in (3) holds only in the fast timescale of catalysis.This contrasts with classic deterministic models for metabolic reactions, which typically focus on the fast catalytic timescale and assume enzymes as constant model parameters 48 .As a result of the separation of timescales, the weights and components of the mixture in Eq. ( 2) can be computed separately.The mixture weights P(E), in particular, can be computed as solutions of a stochastic model for enzyme expression 4 , or taken from single-cell measurements of enzyme expression.The mixture components P(P |E), on the other hand, can be estimated with the Linear Noise Approximation 49,50 (LNA) on the basis that metabolites appear in large numbers.In Figure 1 we illustrate a schematic of the proposed method.
enzyme expression enzyme catalysis Computation of single-cell metabolite distributions with Gaussian mixture models.We exploit the separation of timescales to compute the weights and components of the mixture model in Eq. ( 2).Mixture weights are computed as stationary solutions to the Chemical Master Equation (CME) for a chosen model for stochastic enzyme expression.The mixture components are computed via the Linear Noise Approximation 50 (LNA) applied to the pathway ODE model.This provides Gaussian approximations to the stationary metabolite distribution conditioned on the enzyme state.Overall, the method produces a Gaussian mixture model for metabolite distributions that can be applied in a wide range of metabolic pathways.

Gene expression model
We thus propose the following procedure for computing single-cell metabolite distributions: where S is the stoichiometric matrix and v(•) is the vector of deterministic reaction rates; for ease of notation we have assumed a unit cell volume, and hence the deterministic rates are equal to the propensities of the stochastic model.Note that due to the timescale separation, Eq. ( 4) must be solved assuming constant enzymes E, and its solution depends on the enzyme abundance, i.e.P = P (E) .
3. For each enzyme state E, compute the solution to the Lyapunov equation 50 : where A is the Jacobian of (4) evaluated at the steady state and BB T = Sdiag {v} S T .Note that, as in ( 4), the solution of the Lyapunov equation depends on the enzyme state, i.e.Σ = Σ(E).
4. Following the LNA, approximate the mixture components P(P |E) as a multivariate Gaussian distribution with mean P and covariance matrix Σ.

5.
Combine the weights P(E) and Gaussian components P(P |E) through the mixture model in (2).
In the next sections we illustrate the effectiveness of our method in two exemplar systems.

III. REVERSIBLE MICHAELIS-MENTEN REACTION
We first consider a stochastic model that integrates a reversible Michaelis-Menten reaction with a standard model for enzyme expression.
As shown in Figure 2A, the Michaelis-Menten mechanism includes reversible binding of four species: a metabolic substrate S, a free enzyme E, a substrate-enzyme complex C and a metabolic product P .To model enzyme expression, we use the well-known two-stage scheme for transcription and translation 51,52 (Figure 2A).The complete set of reactions is: The reactions in (6) correspond to a reversible Michaelis-Menten reaction as in (1), while reactions in (7) are the two-stage model for gene expression.We include four additional first-order reactions ( 8)-( 9) to model consumption of the metabolite product with rate constant k c , mRNA degradation with rate constant k deg , and dilution of all model species with rate constant δ.In what follows we assume that the FIG. 2. Exemplar metabolic systems.(A) Reversible Michaelis-Menten reaction; the full set of reactions are shown in Eq. ( 6)- (7).The model accounts for reversible catalysis of a substrate S into a product P .(B) Two-step pathway with noncompetitive end-product inhibition; the reactions are shown in Eq. ( 18)- (25).The product (P2) sequesters enzyme E1 into an inactive form E * 1 , thereby reducing the rate of the first reaction.In both examples we assume a constant substrate S and linear dilution of all chemical species.Enzymes are assumed to follow the two-stage model for gene expression 52 , which includes species for the enzymatic mRNA and protein.
substrate S remains strictly constant, for example to model cases in which the substrate represents an extracellular carbon source that evolves in much slower timescale than cell doubling times.
Since on the fast timescale of the catalytic reaction, the total number of enzymes can be assumed in quasi-stationary state 6,48 , we have that and therefore the general mixture model in (2) can be written as: The mixture weights P(E total ) can be computed from the stochastic model for gene expression in (7).Under the standard assumption that mRNAs are degraded much faster than proteins 4 , the stationary solution of the two-stage model can be approximated by a negative binomial distribution 52 : where Γ is the Gamma function and the parameters are defined as the burst frequency a = k tx /δ and burst size b = k tl /k deg .
To compute the mixture components P(P |E total ) with the LNA, we write the full system of deterministic rate equations (see (35) in Methods) for the three species E, C and P .Note that in this case, we can further reduce the rate equations by (i) using the conservation law in (10), and (ii) assuming that the binding and unbinding reactions between S and E reach equilibrium faster than the product P , a condition that generally holds in metabolic reactions.After algebraic manipulations, the reduced ODE can be written as: where (14)   and the parameters are The mean of each mixture component is simply given by the steady state solution of ( 13), which we denote as P (E total ).For a given enzyme abundance E total , the variance Σ(E total ) of each Gaussian component is given by the solution to the Lyapunov equation in (5): where f and g are first-order derivatives.Combining the negative binomial in (12) with the Gaussian components, we can rewrite Eq. ( 11) to get a Gaussian mixture model for the metabolite: where both P (x) and Σ(x) must be computed for each value of x = E total in the summation.The normalization constant in ( 16) is In Figure 3 we plot the mixture model ( 16) for realistic parameter values and compare this approximation with distributions computed from long runs of Gillespie simulations of the whole set of reactions ( 6)- (9).The results indicate that the mixture model provides an excellent approximation of the metabolite distribution, even in the case of skewed or tailed distributions.In the next section we test our methodology in a more complex pathway with feedback regulation.16) provides an excellent approximation for the metabolite distribution obtained with Gillespie's algorithm 37 .Distributions were computed for varying values of the bursting parameter a, suggesting that high-frequency bursting tends to decrease metabolite skewness.All parameter values can be found in Table I in the Methods.

IV. PATHWAY WITH END-PRODUCT INHIBITION
A common regulatory motif in metabolism is end-product inhibition, in which a pathway enzyme can bind to its own substrate as well as the pathway product (see Figure 2B).The product thus sequesters enzyme molecules, which reduces the number of free enzymes available for catalysis and slows done the reaction rate.To examine the accuracy of our method in this setting, we study a fully stochastic model for a two-step pathway with noncompetitive end-product inhibition: The two reactions in (18) and ( 19) are reversible Michaelis-Menten kinetics, sharing the intermediate metabolite P 1 as a product and substrate, respectively.The end-product inhibition in (20) consists of reversible binding between h molecules of P 2 and the first enzyme E 1 into a catalyticallyinactive complex E * .The remaining model reactions in ( 21)-( 25) are analogous to the previous example in Section III: reactions in ( 21)-( 22) describe the two-stage model for expression of both enzymes, and with reactions ( 23)-( 25) we model first-order mRNA degradation, product consumption, and dilution by cell growth.For simplicity we also assume that both enzymes are independently expressed, but in general our method can also account for cases in which enzymes are co-expressed or co-regulated 53 .The resulting model has two distinct pools of enzymes, which remain constant over the timescale of catalysis: and therefore the mixture model in (2) becomes enzyme distribution where the summation goes through all (E t,1 , E t,2 ) pairs.Since both enzymes are expressed independently, the enzyme distribution is the product of two negative binomials P(E t,1 , E t,2 ) = P(E t,1 ) × P(E t,2 ), each one analogous to the distribution in (12).
To compute the mixture components with the LNA, we use the rate equations for the reactions in ( 18)-( 23); the full set of ODEs is listed in Eq. ( 36) in the Methods.As in the first example, by employing the conservation laws in (26) and assuming rapid equilibrium of the complexes C 1 and C 2 , the deterministic model can be further simplified to a 2-dimensional ODE: where for ease of notation we have omitted the dependency on E t,1 and E t,2 .The nonlinear functions in (28) are where θ = k sq /k rsq is the product-enzyme binding constant and the remaining parameters are defined as As in the previous example, the ODEs in (28) correspond to the full model (36) rewritten in terms of both metabolites assuming that the enzyme-substrate reactions reach equilibrium in a faster timescale than catalysis.This reduced model can be readily employed to obtain approximations for the mixture components with the LNA.If we denote as P = P (E t,1 , E t,2 ) the steady state solution of (28), we can write the Lyapunov equation as AΣ + ΣA T + BB T = 0 with A and BB T given by where f (•), g(•), and their derivatives are evaluated at the steady state solution P (E t,1 , E t,2 ).The Gaussian components of the mixture model are then where P = (P 1 , P 2 ) T and | • | is the matrix determinant.After combining the joint distribution of enzymes and the components into Eq.( 27), we get a Gaussian mixture model for the joint marginal distribution of both metabolites: where P (x, y) and Σ(x, y) need to computed numerically for each pair (x, y) = (E t,1 , E t,2 ) in the summation.The burst frequencies a i = k tx,i /δ and burst sizes b i = k tl,i /k deg,i are specific to each enzyme, and the normalisation constant is given by To test the quality of the approximation, we numerically computed the mixture model in (33) for various combinations of parameter values, shown in Figure 4. We observe that the mixture model offers an excellent approximation as compared to exact Gillespie simulations of the full model ( 18)- (25).We note that in this case, the full stochastic model has seven species and three different timescales, and therefore the runtime of Gillespie simulations are extremely long, in the order of several hours per run.
To further illustrate the utility of our method, we employed the mixture model to study the impact of parameter perturbations on the metabolite distributions.Without an analytical solution, such a study would require the computation of long Gillespie simulations for each combination of parameter values, which quickly become infeasible due to the long simulation time.In contrast, the mixture model provides a systematic way to rapidly evaluate the influence of model parameters on metabolite distributions.In Figure 5A we show summary statistics of the marginal P(P 1 ) for various combinations of average enzyme expression levels.The results suggest that expression levels can have a strong impact on the mean and coefficient of variation of the intermediate metabolite.Moreover, in Figure 5B we plot the distribution P(P 1 , P 2 ) for combinations of bursting parameters.The results show that uncorrelated enzyme fluctuations can still result in correlated and skewed metabolite distributions.

V. DISCUSSION
Cellular metabolism has traditionally been assumed to follow deterministic dynamics.This paradigm results largely from the observation that cellular metabolites are highly abundant.However, recent data shows that single-cell metabolite distributions can display substantial heterogeneity in their abundance across single cells [9][10][11][12][13][14][15][16][17] .It has also been shown that expression of metabolic genes is as variable as any other component of the proteome 2 , and thus in principle it is plausible that such enzyme fluctuations propagate to metabolites.These observations have begun to challenge the paradigm of metabolism being a deterministic process, suggesting that metabolite fluctuations may play a role in non-genetic heterogeneity.
Here we described a new computational tool to predict the statistics of metabolite fluctuations in conjunction with gene expression.The method is based on a timescale separation argument and leads to a Gaussian mixture model for the stationary distribution of cellular metabolites.Computing distributions from this approximate model is substantially faster than through stochastic simulations, as these can be extremely slow due to the multiple timescales of metabolic pathways.Our technique can therefore be employed to efficiently explore the parameter space and predict the shape of metabolite distributions in different conditions.In earlier work we showed that the product of a single metabolic reaction can be accurately described by a Poisson mixture model 6 .Such approximation allowed the discovery of previously unknown regimes for metabolite distributions, including heavily tailed distributions and various types of bimodality and multimodality.The Poisson approximation, however, is bespoke to single reactions and not valid for more complex systems.In contrast, the Gaussian mixture model discussed here is more general and can be applied to multiple kinetic mechanisms, more complex stoichiometries, as well as post-translational regulation.
Another advantage of our approach is that the mixture weights can be computed offline from stochastic models for gene expression or single-cell expression data.The model is flexible in that it can readily accommodate gene expression models of various complexity.For the sake of illustration, in our examples we used the simple two-stage model for gene expression, but other models including gene regulation can also be employed 7 .Particularly relevant models are those that account for enzyme co-regulation, a widespread feature of bacterial operons 53 , which translates into correlations between expression of different pathway enzymes and the resulting metabolite abundances.
In principle, most metabolic reactions satisfy the timescale separation as a result of their kinetics being much faster than the rate at which cells can synthesise new enzymes.However, throughout our examples we assumed that the metabolic substrate S, which is typically a carbon source or other precursor, remains constant.This case represents an abundant nutrient source with little fluctuations, but it is not adequate when substrates are lowly abundant or subject to stochastic fluctuations dictated by the environment.For example, depending on the timescale of such environmental fluctuations, the substrate can become another source of variability apart from enzyme expression 7 .In such cases, the timescale separation argument may not hold anymore and our theory needs to be revised to account for substrate fluctuations.
A number of works have sought to find links between fluctuations across layers of cellular organisation, such as gene expression, metabolism and cell growth 32,33,35,54,55 .But since measurement of metabolites in single cells remains technically challenging, there is pressing need for computational methods to predict fluctuations in cellular metabolites.Our proposed method provides a systematic approach for such task, paving the way for the generation of hypotheses on the molecular sources of metabolic heterogeneity.

A. Model simulation
Stochastic simulations were computed with Gillespie's algorithm over long simulation times (several hours) corresponding to thousands of cell cycles.The ODE models and # molecules P 1 # molecules P Lyapunov equations were solved in Matlab.In all examples, the negative binomial distribution for gene expression in (12)  was computed with its continuum approximation (Gamma distribution).

B. Deterministic rate equations
a. Reversible Michaelis Menten.The full set of rate equations for the reversible reaction in ( 6)-( 9) is: To further reduce the above system of ODEs to Eq. ( 13) in the main text, we can substitute the conservation relation in Eq. (10), i.e.C = E total − E, and use the fact that the substrateenzyme complex (C) typically equilibrates much faster than the product P , which means that dC/dt ≈ 0 in the timescale of catalysis.
b. End-product inhibition.The full set of rate equations for the reactions in ( 18)-( 23) is: As in the previous example, we can use the rapid equilibrium assumption and the conservation relations in (26), i.e.E t,1 = E 1 + E * + C 1 and E t,2 = E 2 + C 2 , to simplify the 7-dimensional ODE in (36) to the 2-dimensional system in (28) of the main text.
FIG. 1. Computation of single-cell metabolite distributions withGaussian mixture models.We exploit the separation of timescales to compute the weights and components of the mixture model in Eq. (2).Mixture weights are computed as stationary solutions to the Chemical Master Equation (CME) for a chosen model for stochastic enzyme expression.The mixture components are computed via the Linear Noise Approximation 50 (LNA) applied to the pathway ODE model.This provides Gaussian approximations to the stationary metabolite distribution conditioned on the enzyme state.Overall, the method produces a Gaussian mixture model for metabolite distributions that can be applied in a wide range of metabolic pathways.

FIG. 3 .
FIG. 3. Stationary product distribution of a Michaelis-Menten reaction.The proposed mixture model in (16) provides an excellent approximation for the metabolite distribution obtained with Gillespie's algorithm37 .Distributions were computed for varying values of the bursting parameter a, suggesting that high-frequency bursting tends to decrease metabolite skewness.All parameter values can be found in TableIin the Methods.

10 FIG. 4 .
FIG. 4. Stationary distributions for the intermediate metabolite in a two-step pathway with end-product inhibition.The panels show the distribution of intermediate metabolite P1 for different combinations of parameter values.(A) Impact of enzyme bursting frequency a1 and a2.(B) Impact of binding constant between the first enzyme and the end-product.All parameter values can be found in Table II in the Methods.

2 #FIG. 5 .
FIG. 5. Impact of enzyme expression on metabolite distributions.(A) For a wide sweep of the total enzyme expression levels, we observe changes in the mean abundances and coefficient of variation (CV) of the intermediate metabolite P1.Shown are the logarithmic values of mean and coefficient of variation.(B) Enzyme bursting parameters can strongly shape the metabolite distribution.All parameter values can be found in Table III in the Methods.

TABLE II .
Parameter values for simulations in Figure4.

TABLE III .
Parameter values for simulations in Figure5.