Development of a genome-scale metabolic model of Clostridium thermocellum and its applications for integration of multi-omics datasets and strain design

Solving environmental and social challenges such as climate change requires a shift from our current non-renewable manufacturing model to a sustainable bioeconomy. To lower carbon emissions in the production of fuels and chemicals, plant biomass feedstocks can replace petroleum using microorganisms as catalysts. The anaerobic thermophile Clostridium thermocellum is a promising bacterium for bioconversion due to its capability to efficiently degrade untreated lignocellulosic biomass. However, the complex metabolism of C. thermocellum is not fully understood, hindering metabolic engineering to achieve high titers, rates, and yields of targeted molecules. In this study, we developed an updated genome-scale metabolic model of C. thermocellum that accounts for recent metabolic findings, has improved prediction accuracy, and is standard-conformant to ensure easy reproducibility. We illustrated two applications of the developed model. We first formulated a multi-omics integration protocol and used it to understand redox metabolism and potential bottlenecks in biofuel (e.g., ethanol) production in C. thermocellum. Second, we used the metabolic model to design modular cells for efficient production of alcohols and esters with broad applications as flavors, fragrances, solvents, and fuels. The proposed designs not only feature intuitive push-and-pull metabolic engineering strategies, but also novel manipulations around important central metabolic branch-points. We anticipate the developed genome-scale metabolic model will provide a useful tool for system analysis of C. thermocellum metabolism to fundamentally understand its physiology and guide metabolic engineering strategies to rapidly generate modular production strains for effective biosynthesis of biofuels and biochemicals from lignocellulosic biomass.


Introduction
In an attempt to redirect carbon and electron flows for enhanced ethanol production, Deng et al. 12 manipulated the pyruvate node and malate shunt of C. thermocellum. By converting phospho-76 enolpyruvate (pep) to oxaloacetate (oaa) and then to pyruvate (pyr ), this shunt can interchange 77 one mole of NADPH with one mol of NADH generated from glycolysis. Interestingly, the authors 78 noted that replacement of the malate shunt by alternative pathways not linked to NADPH increased 79 ethanol production and carbon recovery but reduced amino acid formation, confirming the role of 80 the malate shunt as an NADPH source in C. thermocellum. 81 Sulfur metabolism also plays a key role in redox metabolism of C. thermocellum and has been 82 investigated for its role in ethanol production. Sulfate, a component of C. thermocellum media, 83 serves as an electron acceptor, which is capable of oxidizing sulfate to sulfite and then sulfide. 84 Thompson et al. 10 demonstrated that the strain ∆hydG∆ech∆pfl , which cannot grow in conven-85 tional medium due to its inability to secrete hydrogen or formate, recovered growth by sulfate 86 supplementation to the culture medium. More recently, Biswas et al. 13 reported an increase in final 87 sulfide concentration and over-expression of the associated sulfate uptake and reduction pathway 88 in the ∆hydG strain, but did not observe a significant difference in final sulfide concentration in 89 ∆hydG∆ech. Remarkably, neither of the strains consumed cysteine from the medium, unlike the 90 wild-type. Sulfide can be converted to cysteine by CYSS (Cysteine synthase) or homocysteine and 91 then methionine by SHSL2 and METS (succinyl-homoserine succinate-lyase and methionine syn-92 thase), but the connection between the cessation of cysteine uptake and sulfate metabolism remains 2. 1

.1 Modeling updates
Several modifications were also performed in key bioenergetic reactions. The reactions catalyzed by membrane-bound enzymes, including inorganic diphosphatase (PPA) 8 and membrane-157 bound ferredoxin-dependent hydrogenase (ECH) 40 , were corrected to capture proton translocation. 158 Furthermore, hydrogenase reactions were updated to ensure ferredoxin association for all cases 159 and remove those reactions that do not involve ferredoxin and only use NAD(P)H as a cofactor 160 based on our recent understanding of C. thermocellum metabolism. 13 Gene-protein-reaction asso- was added to enable methionine biosynthesis (essential for growth) without succinate formation. 170 Although this reaction is not currently known to be associated with any gene in C. thermocellum, 171 it is known to be present in other Clostridium GSMs. 29 Next, the reaction deoxyribose-phosphate 172 aldolase (DRPA) was removed to ensure correct lethality prediction of the ∆hydG∆ech∆pfl mu-173 tant strain as well as the correct prediction of growth recovery in this mutant by addition of 174 external electron sinks such as sulfate or ketoisovalerate ( Table 1). The correct prediction of 175 ∆hydG∆ech∆pfl -associated phenotypes is critical to successfully use the model for computational 176 strain design. 30,32,41-45 177 2.2 Comparison of iCBI655 against other genome-scale models 178 We compared iCBI655 with the previous GSMs of C. thermocellum and the highly-curated GSM 179 iML1515 of the extensively studied bacterium Escherichia coli ( Table 2). The increased number 180 of genes in iCBI655 with respect to iAT601 cover a variety of functions, including hydrogenase 181 chaperones, cellulosome and cellulase, ATP synthase, and transporters. Remarkably, iCBI655 has 182 a smaller percentage of blocked reactions than iAT601, indicating higher biochemical consistency.

183
The number of metabolites in iCBI655 is smaller than those in iAT601 mainly due to the removal of 184 metabolites that did not appear in any reaction, duplicated metabolites (e.g., certain isomers), and 185 blocked pathways added automatically during gap-filling without any gene association. C. thermo-186 cellum DSM1313 has 2911 protein coding genes, 22% of which is captured by iCB655, while E. coli 187 MG1655 has 4240 genes, 35% of which is included in iML1515. Overall, iCBI655 increases the 188 coverage of the metabolic functionality of C. thermocellum but remains far from the well studied 189 E. coli. to be condition-specific; however, genome-scale models do not include a mechanistic description 194 that allows to determine these ATP consumption rates as part of the simulation. Instead, GAM 195 is incorporated into the biomass pseudo-reaction and NGAM has its own pseudo-reaction that 196 hydrolyzes ATP at a rate tuned by constraint parameters.

197
To increase model prediction accuracy for various conditions, we trained GAM and NGAM 198 parameters of iCBI655 using an extensive dataset of 28 extracellular fluxes  rial 2) measured during the growth phase under different reactor configurations, carbon sources, 200 and gene deletion mutants. This approach is based on the method used to train the iML1655 E. coli 201 model. 46 Remarkably, we observed highly linear trends under three different conditions, including 202 chemostat reactor with cellobiose as a carbon source, chemostat reactor with cellulose as a carbon 203 source, and batch reactor with either cellobiose or cellulose as carbon sources (Figure 1 a). This 204 model training has led to improved growth rate prediction of iCBI655 as compared to iAT601 that 205 has previously been trained with only a smaller dataset (Figure 1 b). Specifically, the iAT601  The field of metabolic network modeling suffers from a lack of standard enforcement and quality 211 control metrics that limit model reproducibility and applicability. To address this issue, Lieven 212 et al. recently developed the Memote framework that systematically tests for standards and best 213 practices in GSMs. 47 We applied Memote to both the iCBI655 and E. coli iML1515 models for 214 comparison (Figure 1 d). This analysis produces five independent scores that assess model quality.

215
The consistency score measures basic biochemical requirements, such as mass and charge balance of 216 metabolic reactions, and it was near 100% for both models. Additionally, the different annotation 217 scores quantify how many elements in the model contain relevant metadata. More specifically, 218 the systems biology ontology (SBO) annotation indicates if an object in the model refers to a 219 metabolite, reaction, or gene, while the respective annotation scores of these elements correspond 220 to properties (e.g., name, chemical formula, etc.) and identifiers linking them to relevant databases 221 (e.g., KEGG 48 or modelSEED 34 ). The overall score is computed as a weighted average of all the 222 individual scores with additional emphasis on the consistency score. In summary, the high scores 223 obtained by iCBI655 indicate the quality of the model and ensure its applicability for future studies.     the resolution of NADPH as the key cofactor in redox imbalance of ∆hydG∆ech, and it identified 305 specific pathways that undergo major changes in protein levels, providing interesting target reac-306 tions for further engineering. Generally, the developed FC-based omics integration protocol can 307 be applied to different omics data types due to its simplicity. The method does not require one to 308 formulate or assume a quantitative relationship between omics measurements and simulated fluxes.  for efficient biosynthesis of alcohols and esters. Briefly, with ModCell we aim to design a modu-317 lar (chassis) cell that can be rapidly combined with different pathway modules in a plug-and-play 318 fashion to obtain modular production strains exhibiting target phenotypes with minimal strain 319 optimization cycles. In this study, the target phenotype for modular production strains is growth-320 coupled to product synthesis (wGCP ), that corresponds to the minimum product synthesis rate at 321 the maximum growth rate. The design variables to attain the target phenotypes involve genetic 322 manipulations of two types: i) reaction deletions, constrained by the parameter α, that corresponds 323 to gene knock-outs; and ii) module reactions, constrained by the parameter β, that corresponds to 324 reactions deleted in the modular cell but added back to specific modules to enhance the compat-325 ibility of a modular cell. Once these two parameters are specified, the solution to the problem is 326 a set of Pareto optimal designs named Pareto front. In a Pareto optimal design, the performance 327 (i.e., objective value) of a given module can only be increased at the expense of lowering the per-328 formance of another module. To characterize the practicality of each design, we say a modular cell 329 is compatible with certain modules if the design objective is above a specific threshold (e.g., 0.5 in 330 this study). Hence, the compatibility of a design corresponds to the number of compatible modules. Pareto front is composed of 12 designs that can be clustered into two groups (Figure 4 b). The first 336 group (e.g., designs 8, 3, and 9) are compatible with all products except butanol and its derived 337 esters, whereas the second group (e.g., designs 2, 12, 1, and 10) have high objective values for 338 butanol and its derived esters.  (Table 3). These include common targets such as 343 deletion of hydrogenases that appear in the cluster of designs 10, 12, 4, 11, 2, and 7 with the 344 ∆hydG∆ech genotype discussed earlier or removal of reactions that form fermentative byproducts 345 such as ALCD2x and ACALD (ethanol), PFL (formate), LDH L (lactate). Interestingly, ACKr 346 or PTA (acetate) does not appear in this list, likely because acetate production can serve as a 347 regulatory valve for redox metabolism, especially in a modular cell that must be compatible with 348 products of diverse degrees of reduction.

349
More interestingly, we also found important branch-point deletion reactions 52 in central metabolism 350 that have not yet been explored for strain design. Most prominently, these reactions include GLUDy, 351 PEPCK re, and PPDK, which appear in 50%, 33%, and 25% of the designs, respectively (Table 3).

352
Both PEPCK re and PPDK present two alternative routes that influence the ratio of NADPH to 353 NADH, which is relevant to control metabolic fluxes though the specific dependencies of certain 354 enzymes towards each redox cofactor. Since GLUDy consumes NADPH and is a key reaction in Here I and J are the sets of metabolites and reactions in the model, respectively, and v j is the  In different simulation conditions, k, S ij remains fixed given the structure of the network for all i, j ∈ I, J . However, certain bounds u jk and l jk are modified to represent specific metabolic constraints. For example, to apply measured reaction fluxes such as in the case of GAM and NGAM calculation or the omics integration protocol (Section 4.6), l jk and u jk are specified using the experimentally measured average (µ jk ) and standard deviation (σ jk ), which for normally distributed samples with 3 replicates produces an interval with a confidence level above 90% (3-4). Similarly, to represent a certain gene deletion mutant k, the bounds are set to be u jk = l jk = 0 for the associated reaction j.
The feasible flux space Ω k can be explored in different ways; 54,61 for instance, an optimization objective is often defined to identify specific flux distributions v sim jk ∀j ∈ J : Here c j is the coefficient of reaction j in the linear objective function, which is changed according For the FVA method, a sequence of linear programming problems is solved where each flux is minimized (7) and maximized (8): v min jk ∈ arg min {v jk : v jk ∈ Ω k } ∀j ∈ J (7) v max jk ∈ arg max {v jk : v jk ∈ Ω k } ∀j ∈ J Then, the fluxes are normalized to the substrate uptake rate v uptake,k and fold change is calculated 474 in log 2 space (11).

475
FC sim j (v j,mut , v j,wt ) = log 2 floor v j,mut |v uptake,mut | − log 2 floor v j,wt |v uptake,wt | (11) 4.6.2 Calculation of measured fold changes 476 Fold change between case and control samples, FC l , is calculated in log 2 space for each gene l ∈ L, where L is the set of genes in the model. These gene fold changes can be mapped to metabolic reaction fold changes using the gene-protein reaction associations (GPR), given G j as the set of genes with FC l = 0 in the GPR of reaction j: where M ⊆ J is the set of consistent reactions which is considered for further analysis and the simulated fold changes are re-defined for brevity (14-15).
FC sim,pFBA FC sim,FVA       Tables  Table 1: Comparison of mutant growth rates predicted by iAT601 and iCBI655. Experimental values are taken form Thompson et al. 10 ; for some mutants whose growth recovery, not growth rate, was reported, they are presented with "+".    Figure 3: Metabolic map visualization using the iCBI Escher map. Values next to reaction labels correspond to proteomics fold change between ∆hydG∆ech and wild-type strains only for the 70 consistent reactions identified by the omics analysis (Section 2.5). The labels of extracellular metabolites are appended with " e". Reactions marked with a red cross are deleted in ∆hydG∆ech.