Comparison of transcriptional profiles of Clostridium thermocellum grown on cellobiose and pretreated yellow poplar using RNA-Seq

The anaerobic, thermophilic bacterium, Clostridium thermocellum, secretes multi-protein enzyme complexes, termed cellulosomes, which synergistically interact with the microbial cell surface and efficiently disassemble plant cell wall biomass. C. thermocellum has also been considered a potential consolidated bioprocessing (CBP) organism due to its ability to produce the biofuel products, hydrogen, and ethanol. We found that C. thermocellum fermentation of pretreated yellow poplar (PYP) produced 30 and 39% of ethanol and hydrogen product concentrations, respectively, compared to fermentation of cellobiose. RNA-seq was used to analyze the transcriptional profiles of these cells. The PYP-grown cells taken for analysis at the late stationary phase showed 1211 genes up-regulated and 314 down-regulated by more than two-fold compared to the cellobiose-grown cells. These affected genes cover a broad spectrum of specific functional categories. The transcriptional analysis was further validated by sub-proteomics data taken from the literature; as well as by quantitative reverse transcription-PCR (qRT-PCR) analyses of selected genes. Specifically, 47 cellulosomal protein-encoding genes, genes for 4 pairs of SigI-RsgI for polysaccharide sensing, 7 cellodextrin ABC transporter genes, and a set of NAD(P)H hydogenase and alcohol dehydrogenase genes were up-regulated for cells growing on PYP compared to cellobiose. These genes could be potential candidates for future studies aimed at gaining insight into the regulatory mechanism of this organism as well as for improvement of C. thermocellum in its role as a CBP organism.


INTRODUCTION
Microbial conversion of biomass to biofuels is an attractive route for biofuel development, but an essential challenge is to increase the microbial capacity both for overcoming the biomass recalcitrance and for converting the biomass-derived sugars to biofuels Alper and Stephanopoulos, 2009). Clostridium thermocellum, a gram-positive, thermophilic, anaerobic bacterium, is one of the model consolidated bioprocessing (CBP) systems used to study the enzymatic hydrolysis of cellulosic biomass to produce fuels (Islam et al., 2006;Brown et al., 2012;Yang et al., 2012). The following features of C. thermocellum Abbreviations: Abf, α-N-arabinofuranosidase; ADH, alcohol dehydrogenase; CBO, cellobiose-grown cells only; Cbp, carbohydrate-binding protein; CBP, consolidated bioprocessing; CDP, cellodextrin phosphorylase; CEP, cellobiose phosphorylases; COG, clusters of orthologous groups; emPAI, exponentially modified protein abundance index; FC, fold changes; Fdred, reduced ferredoxin; GH, glycoside hydrolases; NSAF, normalized spectral abundance factor; PF, passing filter; PTA, phosphotransacetylase; PYP, pretreated yellow poplar; PYPO, PYP-grown cells only; qRT-PCR, quantitative reverse transcription-PCR; RPKM, reads per kilobase of exon model per million mapped reads; RsgI, RsgI-like anti-sigma factors; SigI, sigma factor. contribute to its suitability as a model cellulolytic, biofuelproducing bacterium: (1) It produces cellulosomes, a type of highly organized multi-protein enzyme complexes, which have been shown to be highly efficient enzyme systems in deconstructing plant cell wall, especially in degrading the recalcitrant substrate crystalline cellulose (Bayer et al., 1998(Bayer et al., , 2004. (2) It carries out mixed-product fermentation, producing ethanol, H 2 and numerous organic acids including acetate, formate, and lactate (Demain et al., 2005;Islam et al., 2006). (3) It is suitable for both submerged and solid-state fermentation (Bayer et al., 2007), the latter having similarity to compost in the set-up of feedstock (Wei et al., 2012). (4) Its genome sequence is available (http:// genome.jgi-psf.org/cloth/cloth.info.html) and many of the cellulolytic enzymes are identified and biochemically characterized. The knowledge gained from studies of this species will benefit work on other clostridial species of industrial interest, such as C. acetobutylicum, known to produce the potential fuels acetone, ethanol, and butanol (Cooksley et al., 2012).
Fulfilling this potential will require a more in-depth understanding of the metabolic and genetic mechanisms by which C. thermocellum utilizes recalcitrant biomass substrate. So far, a number of studies have been carried out on C. thermocellum, such as transcriptomic analysis of stress responses to ethanol, furfural, and heat during growth on pure sugars (i.e., cellobiose) (Yang et al., 2012), time course studies of cell growth on pure crystalline cellulose (Avicel) (Raman et al., 2011), and comparisons of growth on pure crystalline cellulose (Sigmacel 50) and cellobiose (Riederer et al., 2011). However, despite reports on the sub-proteomic analyses of cellulosomes from C. thermocellum grown on Avicel (Gold and Martin, 2007) and on combinations of Avicel with pectin and/or xylan, or on pretreated switchgrass (Raman et al., 2009), there were no genome-wide transcriptomic studies reported for growth on pretreated woody plant biomass until most recently a transcriptomic analysis comparing C. thermocellum cells grown on pretreated switchgrass and woody biomass, cottonwood (Populus trichocarpa × Populus deltoids; black cottonwood × eastern cottonwood in common names), was published on December 2, 2013 (Wilson et al., 2013a), after the submission of this manuscript.
Woody biomass has been found to be more recalcitrant to enzyme digestion than is herbaceous biomass. For example, whereas the Recalcitrance Index for switchgrass is ∼0.35, the index for hardwoods, such as yellow poplar, is 0.56, indicating that hardwoods are more difficult to be degraded (Wei et al., 2009). Previous studies have shown that cellulolytic bacteria grown on different lignocellulosic substrates have different levels of glycoside hydrolase (GH) activities (Irwin et al., 2003). As such C. thermocellum grown on pretreated woody plant biomass is likely to have distinctive responsive genes, as well as different composition of cellulolytic enzymes different from those grown on herbaceous biomass. Recently, we found that species of the genus Clostridium (including C. thermocellum) were among the dominant species, comprising 6.3% of the total, in an anaerobic community decomposing yellow poplar wood chips (van der Lelie et al., 2012). In parallel, we studied the yellow poplar compost system (Wei et al., 2012), in which Clostridium was also found to be one of the dominant bacteria (data not shown). These results prompted us to explore C. thermocellum grown on dilute acid-pretreated yellow poplar (PYP) as a single-species model for the plant cell wall-degradation. PYP is a widely-utilized feedstock in process-development for conversion of biomass to fuels and chemicals, and it is important to identify specific enzymes that this organism calls into action to attack this form of the feedstock/substrate. PYP has been previously demonstrated that 60% conversion to simple sugars can be achieved with a loading of ∼8.4 mg per g biomass cellulose of a mixture of Trichoderma reesei CBHI and Acidothermus cellulolyticus E1 (95%: 5% on molar basis) (Vinzant et al., 1994;Baker et al., 1997). In this study, C. thermocellum was bench fermented using PYP and cellobiose as sole carbon sources, respectively, and transcriptional profiles were analyzed. RNA-seq is a recently developed technology for transcriptome profiling, which uses next-generation sequencing to reveal the presence and quantity of RNA in biological samples. The goals of this study are two-fold: first, we identified genes responsive for degradation of recalcitrant biomass substrates in PYP-vs. cellobiose-grown C. thermocellum cells. Secondly, we specifically focused on candidate genes related to cellulosome, cellodextrin transport, polysaccharide signal transduction and end-product synthesis related genes. These candidate genes are likely to be valuable for mechanism study; as well as for protein-engineering to further improve the abilities of this already potent organism to degrade the cell walls of recalcitrant biomass feedstocks. Figure 1A shows the overall experimental approach we designed to investigate the C. thermocellum utilization of pretreated biomass substrates as reflected by changes in the cell's transcriptome. Details of the experimental approach are described in the following sections.

CARBON AND LIGNOCELLULOSIC SUBSTRATES
Cellobiose and other chemical compounds were purchased from Sigma (St. Louis, MO). PYP was prepared as described previously (Tucker et al., 1998). Briefly, the milled yellow biomass (20% solids loading) was pretreated in 0.21% w/w H 2 SO 4 at 200 • C for 4 min. The resultant PYP contained ∼65% cellulose, 4% xylan, and 31% lignin (dry weight basis). PYP was exhaustively washed with deionized water until the pH reached that of deionized water, prior to being used in medium preparation as described below.

MICROORGANISMS AND CULTURE CONDITIONS
C. thermocellum ATCC 27405 was routinely cultured at 60 • C anaerobically in 30-mL serum bottles containing 10 mL of ATCC medium 1191 containing cellobiose, and was subcultured with 2% inoculum taken at exponential growth phase. For this study, similarly to practices in the literature (Gold and Martin, 2007;Rydzak et al., 2009), 100-mL batch culture in 250-mL serum bottles was used for the growth of the strain anaerobically at 58 • C in ATCC medium 1191, containing 0.30% (w/v) cellobiose as the control carbohydrate substrate, or 0.44% PYP as biomass substrate (sugar-content equivalent to 0.30% cellobiose) with an agitation of 130 rpm. Cell growth was monitored based on either the measurement of optical density (OD) by spectrophotometry at 600 nm (OD 600 ) for cellobiose-grown culture, or the increase in pellet protein amount for PYP-grown culture using the method described in literature (Raman et al., 2009). Three independent sets of PYP-vs. cellobiose-grown cell cultures, harvested at the late stationary phase (20 h for cellobiose-grown cells and 36 h for PYP-grown cells), were carried through the downstream RNA extraction preparations. Two sets of RNA preparations were used for cDNA library construction and RNA-Seq, whereas all three sets of RNA preparations were used for qRT-PCR analysis in verifying RNA-Seq data of selected genes.

MEASUREMENT OF H 2 AND ETHANOL, AND ISOLATION OF BACTERIAL TOTAL RNA
Cells were harvested for RNA isolation in late stationary phase after the hydrogen concentration at the headspace was analyzed using GC method, wherein an Agilent 7890A GC equipped with a Supelco 60/80 molsieve 5A column was used for the measurement of H 2 . The cell culture was centrifuged at high speed (8000 × g) for 5 min to pellet the bacterial cells. Cell-free culture supernatants were used to measure ethanol concentration by using HPLC (Agilent) with a refractive index detector (Shimadzu, Kyoto, Japan). All samples were filtered through a 0.45-μm filter before HPLC analysis. The organic acids were separated in an Aminex HPX-87H column (Bio-Rad) running at a flow rate of 0.6 ml/min at 50 • C, with 4 mM H 2 SO 4 as eluent.
The cell pellets collected from 10 mL culture were used for RNA extraction, using a combined protocol (Yang et al., 2012), in which the TRIzol (Invitrogen, Carlsbad, CA) extraction aqueous phase was mixed with an equal volume of 70% ethanol, and then applied to the column of Qiagen RNeasy Mini Kit (Qiagen, Valencia, CA) for purification according to the manufacturer's instructions to obtain total RNAs. 7.5 μg total RNAs were mixed with 3 μL DNase I (Invitrogen; 1 U/μL) in 3 μL 10x DNase I reaction buffer, with adding RNase-free water added up to a total volume of 30 μL, and incubated at room temperature for 15 min. The DNase I was inactivated by adding 3 μL of 25 mM EDTA and heating for 10 min at 65 • C. The DNase I treated total RNA was precipitated by the ethanol-glycogen method and redissolved in 15 μL of 1 mM EDTA. Since the total RNAs contains 75% rRNA (Chen and Duan, 2011), mRNA was enriched by using the MICROBExpress Bacterial mRNA Enrichment Kit (cat # AM1905, Ambion, Life Technologies, Austin, TX) to remove 16 and 23 s rRNAs. The resultant RNAs were quantified and analyzed for integrity on the Agilent 2100 Bioanalyzer, and used for cDNA library construction as described below.

cDNA LIBRARY CONSTRUCTION, SEQUENCING, AND READ MAPPING
cDNA libraries were constructed with the RNA-Seq sample preparation kit (RS-100-0801, Illumina, San Diego, CA), using the procedure provided by the manufacturer. Briefly, the above enriched mRNA samples were fragmented, and then annealed to random hexamers and reverse transcribed. After second strand synthesis, end repair, and A-tailing, cDNA fragment ends were ligated to adapters that were complementary to sequencing primers. Resultant cDNA libraries were size separated on agarose gels with ∼200 bp fragments being excised, and amplified by 15 cycles of PCR. The prepared cDNA libraries were sequenced on an Illumina Genome Analyzer II by the Michigan State University DNA facility, using the standard protocols and running for 75 cycles of data acquisition. Solexa reads were aligned to the reference genome sequence of C. thermocellum strain (ATCC 27405) deposited at NCBI (Accession: NC_009012.1) using Bowtie (Langmead et al., 2009). The reference genome is 3.8-Mb long with 3305 predicted genes, including 71 structural RNA genes, according to the NCBI website for the genome (http://www.ncbi.nlm.nih.gov/genome/? term=Clostridium+thermocellum) accessed on April 28, 2010. The best end-to-end alignment (ties broken by read quality) was used with no more than two mismatches. The reads that cannot be unambiguously mapped were not included for further analysis. The above alignment approach and criteria were similar to those used in literature (Jia et al., 2009;Lin et al., 2011).

GENE EXPRESSION NORMALIZATION AND IDENTIFICATION OF DIFFERENTIALLY EXPRESSED GENES
After read mapping, the midpoint of the read-reference alignment was used to determine which gene that read belongs to (or was derived from), and the RNA-seq read counts for each gene can be then calculated. To facilitate comparison of gene expression levels, both within and between two samples, we quantified and normalized gene expression levels by calculating the reads per kilobase of exon model per million mapped reads (RPKM), i.e., calculating the number of reads mapped per kilobase of exon model divided by the total number of mapped reads in the whole dataset (Mortazavi et al., 2008). As an indicator of gene expression level, the RPKM is widely recognized as a measure of Solexa read density that reflects the molar concentration of a transcript in the starting RNA sample, thus making the normalized gene expression levels comparable within and among samples.
The fold-change for the expression of an individual gene was calculated by taking the ratio of RPKM in PYP-grown cells to that in cellobiose-grown cells (RPKM PYP /RPKM cellobiose ). The cutoff value for defining a gene as "differentially expressed" is either a two-fold increase (with the value of RPKM PYP /RPKM cellobiose larger than 2.0), or a two-fold decrease (with the value of RPKM PYP /RPKM cellobiose less than 0.5).

STATISTICAL ANALYSIS
The log2-transformed raw RPKM dataset was imported into the statistical analysis software JMP Genomics 6.0 software (SAS Institute, NC), and the data of PYP-and cellobiose-grown cell samples were normalized together using the LOWESS normalization algorithm within JMP Genomics. To determine if the differential expression levels between PYP-and cellobiosegrown cell samples were statistically significant, the normalized Log2(RPKM) data were subjected to One-Way analysis of variance (ANOVA) as described in literature (Yang et al., 2012(Yang et al., , 2013Wilson et al., 2013b). The False Discovery Rate (FDR) testing method was used with a significance threshold of p < 0.05 being considered statistically significant.

BIOLOGICAL INTERPRETATION OF DIFFERENTIALLY EXPRESSED GENES
The identified, differentially expressed genes were interpreted and discussed in the context of biological processes and functions, using clusters of orthologous groups (COG) and Carbohydrate-Active enZYmes (CAZy) analyses of proteins (www.cazy.org/).

QUANTITATIVE REVERSE TRANSCRIPTION-PCR (qRT-PCR)
Based on their potential functional importance, 12 genes were selected for validating the results of the RNA-Seq analysis. The primers for these genes were either based on literature or designed with Primer Express 3.0 software (Applied Biosystems), and are described along with the PCR results in the Results and Discussion section. Total RNA was extracted from three sets of independent cultures grown on PYP vs. cellobiose as described above, and then converted to cDNA by random priming, using the SuperScript II kit (Invitrogen, San Diego, CA). PCR reactions were run in triplicate using procedure as previously described (Wei et al., 2012). The transcription level of genes was determined according to the 2 − CT method, using RecA as a reference gene for the normalization of gene expression levels (Stevenson and Weimer, 2005).

CORRELATION ANALYSIS FOR CELLULOSOME-RELATED GENE EXPRESSION AND PROTEIN ABUNDANCE
Two sets of quantitative cellulosomal protein data of cellobiosegrown C. thermocellum cells at late stationery phase were retrieved from literature (Gold and Martin, 2007;Raman et al., 2009), and plotted against our sub-dataset of log2(RPKM) values of cellulosomal genes in cells grown on the same substrate at the same culture phase in this study. Pearson correlation coefficient values were calculated using Microsoft Excel (Microsoft Corporation, Redmond, WA, USA), and used as an indicator for the degree of correlation for the compared pairs.

COMPOSITIONAL ANALYSIS OF PRETREATED YELLOW POPLAR RESIDUES
The PYP residues from PYP-grown C. thermocellum culture were collected by centrifugation at low speed (100 × g) for 2 min to precipitate the insoluble substrate not consumed by the bacterial cells in the culture. Such centrifugation speed has been used in the literature to remove any insoluble substrate in C. thermocellum culture (Dror et al., 2005). Compositional analysis of the collected PYP residue was performed by the National Bioenergy Center, National Renewable Energy Laboratory, using method described in literature (Templeton et al., 2010).

C. THERMOCELLUM GROWTH AND THE PRODUCTION OF HYDROGEN AND ETHANOL
The first step of this study, as illustrated in Figures 1A-C, is the growth of C. thermocellum ATCC 27405 with two types of carbohydrate substrates, PYP being compared with cellobiose. GC analysis of gas composition in the headspace of batch cultures at the late stationary phase revealed a production yield of 1.22 vs. 0.92 mole H 2 /mole glucose consumed in cellobiose-and PYP-grown C. thermocellum, respectively (Table 1). Furthermore, HPLC analysis of metabolite production in supernatants of harvested cell culture revealed a production yield of 0.51 vs. 0.30 mole ethanol /mole glucose equivalent consumed in cellobioseand PYP-grown C. thermocellum, respectively. Overall, the production of hydrogen and ethanol by PYP-grown C. thermocellum are 75 and 58% of those by cellobiose-grown C. thermocellum, respectively ( Table 1). Wet chemistry analysis of spent PYP solids revealed that 52% of the glucan in the starting PYP substrate was consumed during the fermentations, which can partially explain the lower absolute H 2 and ethanol productions compared to the culture on cellobiose substrate (in which nearly all cellobiose was depleted at late stationary phase).

RNA-Seq RESULTS
The quality of RNA-Seq cDNA libraries was assessed on a Bioanalyzer prior to GA II (Figure 2). The results showed the size distribution of cDNA library was between 200 and 300 bp, which met the requirement for Solexa sequencing.
For the two PYP-grown cell cDNA library samples, Solexa sequencing from the first cDNA library generated 20.8 million total raw reads, with 14.5 million passing filter reads (70% PF reads); from the second cDNA library generated 19.6 million total raw reads, with 14.1 million passing filter reads (72% PF reads). Out of these PYP-grown cell derived reads, 2.3 and 2.1 millions reads were unambiguously mapped to C. thermocellum genome with the criteria set in the Materials and Methods section, respectively. For the two cellobiose-grown cell cDNA library samples, Solexa sequencing from their first cDNA library generated 21.1 million total raw reads, with 15.7 million passing filter reads (74% PF reads); from the second cDNA library generated 19.7 million total raw reads, with 14.0 million passing filter reads (71% PF reads). Out of these cellobiose-grown cell-derived reads, 2.6 and 2.2 millions reads were unambiguously mapped to C. thermocellum genome, respectively. The RPKM value for each gene in each condition was calculated as RPKM = Reads number  mapped to gene/Length of the gene (kb)/Total reads number (million reads), as described in the Materials and Methods section.
Overall, sequence analysis successfully aligned the Solexa reads to 3081 protein-coding genes and 61 structural RNA genes in cellobiose-and/or PYP-grown C. thermocellum mRNA samples, accounting for 95% [i.e., (3081 + 61)/3305] of all C. thermocellum genes in the genome, indicating that RNA-Seq analysis in this study achieved comprehensive coverage of the C. thermocellum transcriptome. In addition, the above result is consistent with the reports that in the genomes of other microorganisms such as Saccharomyces cerevisiae and S. pombe, more than 90% of the genes are transcriptionally active and expressed (Nagalakshmi et al., 2008;Wilhelm et al., 2008;Wang et al., 2009). The detailed lists of genes for the RNA-Seq sequences identified in the cellobiose-and PYP-grown cells are presented in Supplementary Data Sheet 1. Each of the RPKM values of cellobiose-grown cells and RPKM of PYP-grown cells was the average of two replicates.
Our data from Solexa-read transcriptome measurement of PYP-vs. cellobiose-grown C. thermocellum illustrate some key characteristics of the results. First, the obtained RPKM values for most of the transcripts of the active protein-coding genes in PYP-and cellobiose-grown C. thermocellum, are in the range of 1-50,000 (which can be revealed by re-sorting the Supplementary Data Sheet 1). Such range of RPKM values is comparable to the reported range of RPKM values in RNA-Seq whole-transcriptome analysis of other organism samples (Tang et al., 2009).
The gene expression levels can be classified into four levels: low, moderate, high, and very high, as illustrated in the X axis in Figure 3. While the RNA-Seq data sets of PYP-vs. cellobiosegrown cells had comparable numbers of genes that fall in the moderate expression levels (i.e., both 40% of the active proteinencoding genes), the PYP-grown cells had more highly expressed genes (41 vs. 33% in that of cellobiose-grown culture; Figure 3). In contrast, cellobiose-grown cells had more lowly expressed genes (26 vs. 18% in that PYP-grown culture).

FIGURE 3 | RPKM frequency histogram of transcripts from RNA-Seq of pretreated yellow poplar (PYP)-vs. cellobiose-grown C. thermocellum cells.
The diagram shows the distribution of the number of genes expressed at different RPKM levels. The total number of protein coding genes aligned with the RNA-Seq was 3081; the percentage value above each bar indicates the genes at specific expression level accounting for the proportion of total number of genes. The * mark indicates that significantly different frequencies (i.e., numbers of genes) were observed between the two RNA-Seq data sets from PYP-vs. cellobiose-grown C. thermocellum cells.

CHANGES IN GLOBAL GENE EXPRESSION AND CLUSTERS OF ORTHOLOGOUS GROUPS (COG) ANALYSIS
To compare the differential expression of genes between PYPand control cellobiose-grown cells, fold changes (FC) were computed as the ratio of the RPKM values obtained for individual genes in PYP-against cellobiose-grown cells. Analysis of changes in global gene expression had identified 1211 genes that show a two-fold or greater increase in expression (i.e., FC ≥ 2.0) or detected in PYP-grown cells only (referred as PYPO genes), and 314 genes with a two-fold or greater decrease (i.e., FC ≤ 0.5) in transcript abundance for PYP-against cellobiose-grown cells or detected in cellobiose-grown cells only (referred as CBO genes), as listed in Supplementary Data Sheet 2 with related statistic analyses showing that they were statistically significant.
The COG distribution for these up-and down-regulated genes in the transcriptome was determined and the result is shown in Figure 4. The top two categories for up-and down-regulated genes belong to the categories "general function prediction, [R]" (equivalent to unclassified) and "inorganic ion transport and metabolism, [P]." A closer examination of the distribution chart reveals that for the up-regulated genes, the categories "signal transduction mechanisms, [T]," "amino acid transport and metabolism, [E]," and "energy production and conversion, [C]" (which is important for the energy-consuming process for biofuel end-product production), were also the most well-represented categories in the transcriptome, with the number of induced genes above 200 for each category (Figure 4).
An important category of COG for the degradation and utilization of lignocellulosic substrate is "carbohydrate transport and metabolism, [G]," which includes primarily cellulosome-related genes. In this category, the number of up-and down-regulated genes is 175 and 46, respectively (Figure 4). These differentially expressed genes are described in detail in later section.

VALIDATION OF RNA-Seq RESULTS WITH QUANTITATIVE REVERSE TRANSCRIPTION-PCR
Quantitative reverse transcription-PCR (qRT-PCR) is a well accepted method for verifying microarray (Ferreira et al., 2010;Yang et al., 2012) and RNA-Seq data (Cusick et al., 2012;Huang et al., 2012;Ji et al., 2013). We used this method to validate the expression patterns of 12 selected genes in independent biological replicates. The selected genes mainly represented different functional categories involving in cellulosome, hydrogen and ethanol production and had a range of FC values based on RNA-Seq. In addition to these 12 genes, the gene RecA was chosen as a reference gene to normalize the real time RT-PCR data because it has been commonly used as reference gene for C. thermocellum (Stevenson and Weimer, 2005). As a further confirmation for the use of this gene as reference gene, RecA did not differ in expression in our cell samples grown on two substrates, the values being RPKM 1166 for PYP-grown cells vs. RPKM 1237 for cellobiose-grown cells. The primers for all 13 genes, along with one-by-one comparisons of the fold-changes in expression of each gene as measured by RNA-Seq and qRT-PCR, are listed in Supplementary Data Sheet 3, section 1. Most of the qRT-PCR data matched the RNA-Seq based FC values with a correlation coefficient of 0.95 for the set of 12 selected genes, which indicated that our RNA-Seq result is accurate and the conclusion from RNA-Seq should be reliable.

TRANSCRIPTIONAL CHANGES OF CELLULOSOMAL COMPONENTS
Cellulosome is an extracellular supramolecular machine that can efficiently degrade crystalline cellulosic substrates and associated plant cell wall materials. We are especially interested in the genes encoding cellulosomal component proteins in response to cellulosic substrates, which can lead to the identification of regulatory or rate-limiting components regarding cellulolysis.
The number of reported cellulosomal genes in the genome of ATCC 27405 strain has been increasing from the initial numbers of 71 (Zverlov et al., 2005) and 72 (Zverlov and Schwarz, 2006), to more recently, 81 genes (Raman et al., 2011). Note that the latter number is consistent with the updated genome annotation. Of these 81 genes, all were detected transcriptionally in this study. The list of 81 cellulosomal genes with their FC values is described in Table 2.
Analysis of this sub-transcriptome showed the following features: first, the overall cellulosome-associated genes were upregulated significantly in PYP-vs. cellobiose-grown cells, reflected by the facts that out of the 81 cellulosomal genes, 47 (i.e., 58%) were up-regulated with FC ≥ 2.0. In contrast, only 4 out of 81 (i.e., about 5%) cellulosomal genes, including a GH5 gene (Cthe_2193), CelU (Cthe_2360), XghA (Cthe_1398), and Cthe_0438, were down-regulated by two-fold or detected in celllobiose-grown cells only. The overall average FC value for all these 81 cellulosomal genes is 3.5 (see the bottom row in Table 2),

FIGURE 4 | COG functional categories of the 1211 up-and 314 down-regulated genes in PYP-grown cells compared with cellobiose-grown cells.
Note that since COG annotation classes overlap, the sum of COG annotated genes is larger than the number of total up-and down-regulated genes analyzed.
suggesting the whole cellulosome machinery was "geared up" at the late stationary phase on PYP substrate.
Secondly, the primary scaffoldin, CipA, and the main secondary scaffoldins, OlpB, Orf2P, and SdbA, have shown significant up-regulation and were found to have the highest abundance on the transcriptional level. This implies that the cellulosomal system is crucial for the efficient degradation of pretreated PYP. Our data further verifies the notion that the C. thermocellum cellulosome is the main contributor to the extremely high activity observed in cellulose degradation. Additionally, one putative scaffoldin gene, Cthe_0736 (cellulosome anchoring protein) has been up-regulated as much as 2.8 times, i.e., FC 2.8 (Table 2, row 31). OlpC, which has been recently identified as an important outer layer protein -cellulosome anchoring protein cohesin subunit (Pinheiro et al., 2009), was also found to be up-regulated in this study (Cthe_0452, FC 4.1; Table 2, row 19).
Thirdly, the major cellulosomal cellulases, such as CelS (Cthe_2089, exo-, FC 4.5; RPKM 3848-the most abundant cellulosomal transcripts in PYP-grown cells), CelA (Cthe_0269, endo-, FC 1.7, RPKM 1885-the third most abundant transcript in PYP-grown cells), CbhA (Cthe_0413, exo/endo, FC 7.5, RPKM 367), and Cel124A (Cthe_0435, endo-; FC 2.0; RPKM 401) were remarkably up-regulated, as shown in Table 2. Exo-and endorefer to the substrate site upon which the GHs act. While exocellulases remove one or more sugar units from the ends of polysaccharide chain, endoglucanases randomly hydrolyze the internal glycosidic bonds of polysaccharides. Among the above listed enzymes, CelS was reported to display high synergy with the endo-Cel124A (Brás et al., 2011). Our observation that the CelS had the highest RPKM value in PYP-grown cells is consistent with a report that a knockout mutant of this gene showed a ∼60% reduction in cell cellulolytic performance (Olson et al., 2010), and is also consistent with a most recent finding that CelS gene was found to be highly expressed in C. thermocellum cells grown on both pretreated switchgrass and cottonwood substrates (Wilson et al., 2013a). This study furthered such transcriptional analysis by showing that the transcript of CelS is not only highly abundant, but also significantly responsive to biomass substrate, as its RPKM level in cellobiose-grown cell was 4.5 times lower ( Table 2).
In addition, there was one cellulosome dockerin type I gene (Cthe_0438) for which the transcript was detected only in cellobiose-grown cells, with a low RPKM value of 4 (Table 2, row 81). This observation is consistent with a previous report regarding its absence in the sub-proteome of cellulosomes from cells grown on pretreated switchgrass (Raman et al., 2009). The domain architecture of Cthe_0438 is "DUF843-type I dockerin" (DUF indicates domain of unknown function). It is difficult to classify the association of this gene with any of the specific catalytic enzyme types, cellulases or hemicellulases, and thus this issue may warrant further studies.

EXPLORING THE CORRELATION BETWEEN RNA-Seq AND PUBLISHED PROTEOMIC DATA FOR CELLULOSOMAL GENES
Literature reports have described the quantitative sub-proteomic data of cellulosomes extracted from cell-free culture filtrates of C. thermocellum grown to late stationary phase on cellobiose, Avicel, and other cellulosic substrates (Gold and Martin, 2007;Raman et al., 2009). In one study (Gold and Martin, 2007), an "emPAI," defined as the exponentially modified protein abundance index, showed a linear relationship with protein concentration and was normalized to the value obtained for CipA. Similarly, in another study (Raman et al., 2009), the normalized spectral abundance factor (NSAF) represented the number of spectral counts divided by the number of amino acid residues in the protein and was also normalized to the value obtained for CipA. To determine whether or not a correlative relationship exists between these published cellulosomal sub-proteomic data and the RNA-Seq RPKM data in the present study, we first retrieved the emPAI/CipA and NSAF/CipA data from the literature (Supplementary Data Sheet 3, section 2), and then plotted it against the Log2(RPKM) data for the genes encoding the same proteins in this study. The results showed that despite the fact that the plotted data were obtained from three different research groups, there remain strong correlations between the RNA-Seq RPKM and the protein abundance indicators of emPAI/CipA and www.frontiersin.org April 2014 | Volume 5 | Article 142 | 9

FIGURE 5 | Correlation between the observed gene abundances in this study and the published protein abundance data of cellulosomal components. (A) Correlation between Log2(RPKM) and emPAI/CipA. (B)
Correlation between Log2(RPKM) and NSAF/CipA. The literature sources for data of emPAI/CipA and NSAF/CipA are described in the Results and Discussion section.

NON-CELLULOSOMAL GLYCOSIDE HYDROLASE PROTEINS AND "FREE" CELLULASES
In addition to the GHs existing in cellulosomes, the C. thermocellum genome includes non-cellulosomal genes coding for GHs, among which 12 were up-regulated by more than twofold (Table 3). Notably, both CelC (Cthe_2807) and Lic16A (Cthe_2809), the two members in the putative CelC-GlyR3-Lic16A operon (Newcomb et al., 2011), were up-regulated and predicted to be extracellular and/or bacterial cell wall associated ( Table 3). This was consistent with a recent report that genes in this operon were mainly expressed in the stationary phase (very little during exponential phase) of pure cellulose (Avicel) fermentation (Raman et al., 2011). In addition, Lic16A contains CBM54 and a tandem of four CBM4s, in which both types of CBMs are able to bind xylan; as well as cellulose (Dvortsov et al., 2010).
Remarkably, the grouping of four CBM4s in this protein has an ∼100-fold higher binding constant for xylan and cellulose than that of a protein with a single CBM4 module (Dvortsov et al., 2012). It is surprising that Cel9I (Cthe_0040), an important noncellulosomal processive endoglucanase that could digest crystalline cellulose with high efficiency, showed low abundance and FC value (RPKM 62 in PYP-grown cells, FC 1.3; see Supplementary Data Sheet 1, row no. 1875). Similarly, the only non-cellulosomal exo-cellulase CelY (Cthe_0071, exo-) also showed both low abundance and FC value (RPKM 61, FC 1.4, see Supplementary Data Sheet 1, row no. 1741), which is consistent with a recent report that knocking out the CelY gene had no significant impact on the cellulolytic capacity of the strain (Olson et al., 2010). Based on the evidence above, it would seem that the free-enzyme system plays a less important role than does the cellulosome system in the degradation of PYP by C. thermocellum.
In this study, while the transcripts of all seven pairs of genes had been detected (see Supplementary Data Sheet 1), three pairs of genes and an extra SigI-RsgI pair (which was not included in the initial literature prediction shown above) were up-regulated with FC values above 3.0 in the PYP-against cellobiose-grown cells at the late stationary phase, as described below: (1) σI3-RsgI3 (Cthe_0315-Cthe_0316), with FC 4.2, and 5.9, respectively. According to literature, RsgI3 contains a PA14 dyad domain that target pectin . The differential expression of this pair was not previously reported in cellulose-grown C. thermocellum.
(2) σI4-RsgI4 (Cthe_0403-Cthe_0404), with FC 3.1, and 3.9, respectively. RsgI4 contains CBM3 that binds cellulose. This observation supports previous observation of this pairs being involved in the time course of pure cellulose fermentation by C. thermocellum (Raman et al., 2011). (3) σ 24C -Rsi24C (Cthe_1470-Cthe_1470), with FC 4.0 and 5.9, respectively. Note that Rsi24C contains a module that resembles GH5 that target cellulose based on literature (Nataf et al., 2010). Similar to σI3-RsgI3, the differential expression of this pair was also not previously reported in cellulose-grown C. thermocellum. (4) Equally interesting, we found that another SigI-RsgI pair (Cthe_2521-Cthe_2522) was detected on transcriptome, with FC values of 3.9 and 4.8, respectively (Supplementary Data Sheet 1). The protein of sigma factor Cthe_2521 had been detected in the proteome of C. thermocellum cultured on cellobiose , but was not in the initial 7 pairs of SigI-RsgI proposed for polysaccharide signal transmission in the literature . The possible role of this pair warrants further investigation.

GENES RELATED TO CELLODEXTRIN TRANSPORT AND PHOSPHORYLATION
C. thermocellum has been reported to use ABC-type transporters for uptake of oligosaccharides derived from cellulose hydrolysis (Strobel et al., 1995), which is an important energyconserving mechanism by which importing long cellodextrins can reduce the cost of transport as one-ATP molecule is consumed per transport event. So far, four cellodextrin ABC transporters (carbohydrate-binding protein CbpA, B, C, and D) have been characterized for their substrate binding features (Nataf et al., 2009). Whereas CbpA (Cthe_0393) binds only to cellotriose (G3), CbpB (Cthe_1020) binds to G2-G5 cellodextrins, and CbpC (Cthe_2128) and D (Cthe_2446) bind to G3-G5 cellodextrins (Nataf et al., 2009;Rydzak et al., 2012). Several transcripts of these annotated genes have been detected in the transcriptome of C. thermocellum (Raman et al., 2011;Riederer et al., 2011). Previously, six cellodextrin ABC transporter genes (Cthe_1862, Cthe_0391-0393, and 1019-1020, including CbpA and CbpB) were found to be expressed at high levels throughout the course of Avicel alone fermentation (Raman et al., 2011). Most recently, Cthe_0391-0393 were found to be highly expressed on both pretreated switchgrass and cottonwood substrates (Wilson et al., 2013a). This study furthered such transcriptional analysis by comparing the gene expression between PYP-and cellobiose-grown cells at late stationary phase. A total of 12 transcripts of cellodextrin ABC transporters have been detected (Supplementary Data Sheet 3, section 3; also Figure 6), among which, seven were up-regulated in PYP-against cellobiose-grown cells, with FC values in the range of 3.1-10.3, suggesting that the PYP-grown cells had a mechanism for enhancing the uptake and utilization of polysaccharides derived from biomass substrates. For the subsequent phosphorolytic cleavage of the imported oligosaccharides and cellobiose, this study identified the transcripts of one cellodextrin phosphorylase (CDP, Cthe_2989) and two cellobiose phosphorylases (CEP, Cthe_0275 and Cthe_1221). Among these, the CEP Cthe_1221 was up-regulated with a FC value of 3.0 (Supplementary Data Sheet 3, section 3; also Figure 6), suggesting this gene may warrant further studies.

GENES RELATED TO GLYCOLYSIS, PYRUVATE CATABOLISM, AND END-PRODUCT SYNTHESIS
The deduced pathway for cellulolysis, glycolysis, ethanol, and H 2 production in C. thermocellum ATCC 27405 is illustrated in Figure 6, which is in accordance to accumulated literature findings (Demain et al., 2005;Rydzak et al., 2009Rydzak et al., , 2011Riederer et al., 2011;Carere, 2013). The set of gene IDs for the enzymes involved in above pathway were retrieved from the KEGG PATHWAY database (http://www. genome.jp/kegg) (Kanehisa et al., 2008), and were shown in Supplementary Data Sheet 3, section 3 with their FC values in PYP-vs. cellobiose-grown cells. Out of the listed genes, 18 and 12 were significantly up-and down-regulated, as shown in red and green text in Supplementary Data Sheet 3, section 3, respectively.
Fd red , and acetyl-CoA, in which acetyl-CoA eventually leads to the production of acetate and ethanol, the Fd red leads to the formation of hydrogen (Figure 6). The up-regulation of two out of four alcohol dehydrogenase (ADH) genes, namely Cthe_0423 (FC value 7.0) and Cthe_0394 (FC value 5.3), raise the average FC value for ADHs to 3.6. In contrast, the phosphotransacetylase (PTA; Cthe_1029), whose activity had been verified in C. thermocellum (Lamed and Zeikus, 1980), was significantly down-regulated in PYP-against cellobiose-grown cells, with an FC value of 0.2. These data suggest that based on mRNA profiling, carbon flux of acetyl-CoA may preferentially be channeled to ethanol production in PYP-grown cells.
The observed decrease of acetate/ethanol ratio, from 1.08 in cellobiose-grown cells to 0.92 in PYP-grown cells (Table 1), supports this carbon shift at the late stationary phase of PYP-grown cells.
Lactate and formate were not monitored in this study as previous study has shown that in the late growth phase of C. thermocellum cultures grown on cellobiose substrate, lactate, and formate together represent only a small fraction of the total end products produced (Islam et al., 2006). Future studies should also monitor the production of lactate, formate and CO 2 , which could provide a whole picture for carbon balance for the cells utilizing biomass substrates.

DYNAMICS FOR H 2 PRODUCTION WITH PYP AS SUBSTRATE
C. thermocellum genes encoding putative hydrogenases and sensory hydrogenases using ferredoxin and NAD(P)H as electron carriers are listed Table 4. In addition, hydrogenase maturation proteins are also listed. The classification of the above hydrogenases are mainly based on literatures that systematically characterized the putative hydrogenases in C. thermocellum and other species (Schut and Adams, 2009;Carere et al., 2012). Briefly, there are two types of hydrogenases (H 2 ases) according to the metal content in the respective active sites: The first type is [NiFe] H 2 ase, a putative Fd-dependent energy converting hydrogenase (ECH). Its hexameric structural subunits are encoded by Cthe_3019-Cthe_3024, with assembly of its active site assisted by a suite of maturation proteins encoded by Cthe_3013-Cthe_3018 (Table 4, Figure 6). This study showed that the FC values for 7 of the 12 detected Fd-H 2 ase and related genes (Cthe_3013-Cthe_3015, Cthe_3017-Cthe_3018, and Cthe_3022-Cthe_3023) were in the range of 0.6-1.6, suggesting an unchanged status at the transcriptional level for these genes. However, the remaining five genes (Cthe_3019-Cthe_3021, Cthe_3023-Cthe_3024) encoding the Fd-H 2 ase structural subunits were significantly down-regulated with FC values between 0.3 and 0.5, which might account for the observed lower H 2 yield in PYP-grown cells, compared to the control cellobiose-grown cells. This finding implies that the [NiFe] ECH H 2 ase likely functions in the H 2 production direction since its down-regulation led to less H 2 accumulation in the culture headspace.
The second type is [FeFe] H 2 ases that include two bifurcating H 2 ases: both being trimeric, and encoded by Cthe_0338-Cthe_0342 and Cthe_0428-Cthe_0430, respectively. Only seven genes are listed in Table 4 because Cthe_0339 is annotated as histidine kinase in GenBank protein database. The second type also includes a sensory H 2 ase (Cthe_0425-Cthe_0426) and a NAD(P)H dependent H 2 ase (Cthe_3003-Cthe_3004), as listed in Table 4. The data showed that the FC values for the trimeric bifurcating H 2 ase genes (Cthe_0338, Cthe_0340-Cthe_0342) and the NAD(P)H dependent H 2 ase genes (Cthe_3003-3004) were in the range of 0.6-1.5, indicating an unchanged status for their transcription (Figure 6). A homolog of this trimeric H 2 ase has been uncovered in Acetobacterium woodii which functions in H 2 consumption yielding reduced Fd and NAD(P)H (Schut and Adams, 2009;Hess et al., 2013a,b). In contrast, genes encoding the trimeric bifurcating H 2 ase (Cthe_0428-Cthe_0430) were significantly up-regulated in PYP-grown cells, with FC values of 6.4-7.9 (Figure 6). Based on literature (Schut and Adams, 2009;Calusinska et al., 2010), the trimeric [FeFe] H 2 ase identified in C. thermocellum is a putative bifurcating hydrogenase. This cytoplasmic enzyme was initially characterized in Thermotoga maritima and uses both reduced ferredoxin and NAD(P)H as substrates, and functions in H 2 production (Schut and Adams, 2009;Hess et al., 2013a). Yet the up-regulation of Cthe_0428-Cthe_0430 (this work) led to decreased H 2 production when cultured in PYP-vs. cellobiose-grown cells. As such, the exact role of the C. thermocellum trimeric H 2 ase is unknown, which makes it hard to link the above transcriptional data to the observed lower H 2 yield in PYP-grown cells. Further studies are warranted to investigate the exact direction (production or consumption of H 2 ) for each of the above [FeFe] H 2 ases.
In addition, the sensory H 2 ase encoded by Cthe_0425-Cthe_0426 had a FC value of 9.9 and 11.0, respectively (Table 4), and future studies are also needed to explore the implication of the significant up-regulation of these two genes in PYP-grown cells.

GENES FOR DEGRADING HEMICELLULOSE AND LIGNIN IN FEEDSTOCK BIOMASS
We noticed that some important hemicellulase genes, such as XynY (Cthe_0912, FC 13.8), XynD (Cthe_2590, FC 2.2), XynZ (Cthe_1963, FC 1.4), ManA (Cthe_2811, FC 3.4) were upregulated in PYP-against cellobiose-grown cells ( Table 2; cellulosomal proteins). In addition, some auxiliary enzymes such as α-N-arabinofuranosidase (Cthe_2548, FC 3.3; Table 3, noncellulosomal protein) were also up-regulated. This is probably caused by the adaptation of the strain to digest the PYP, of which the chemical composition and structure are much different from those of cellobiose. In contrast, we did not identify the primary genes involved in lignin modification and degradation, such as laccases, lignin peroxidases, or manganese peroxidases.

GROWTH PHASE AND THE TRANSCRIPTOME OF CELLS
The main focus of this study was to investigate the effects of different carbon sources (PYP vs. cellobiose) on the transcriptome of the cells. To attain this goal, the selection of sampling point was crucial. The experimental design of this study in choosing the late stationary phase is in accordance with literature (Gold and Martin, 2007;Raman et al., 2009). One merit for choosing this sampling point is that our transcriptomic data can be cross-analyzed with the literature's protein data of cellulosomes extracted from cellobiose-grown cells, which are used as one means of validating the transcriptomic data in this study.
Another merit in choosing this sampling point is that compared with other early growth stages (exponential or early stationary phases), the levels of simple, soluble sugars in PYP-and cellobiose-grown culture are both very low: due to either the limit of bacterial enzymes in releasing sugars from PYP or the depletion of cellobiose, respectively. In this sense, late stationary phase is relatively more suitable phase than other phases for comparing transcriptomes derived from different carbon substrate-grown cells in this study.
Nevertheless, it is noteworthy that different growth phases may cause some shifts to some metabolic pathways; thus future studies comparing cells grown on cellobiose vs. cells grown on woody biomass at the exponential phase would provide another angle to investigate the effects of biomass substrate on the transcriptomes of C. thermocellum cells, which may lead to the identification of both growth phase-and biomass substrate-responsive genes. In addition, the transcriptional profiles of these two cell samples could also be affected by the distinct concentrations of carbohydrate sources as well as the residual concentrations of metabolites/nutrients present at time of sampling, which will remain a challenge for future studies in managing these variables-some of which are nearly uncontrollable.

CONCLUSIONS
We conducted a RNA-Seq analysis of the transcriptional profiles of C. thermocellum grown on PYP and cellobiose, which is different from previous transcriptional studies that focused on degrading cellulose alone substrates, or sub-proteomic studies that focused on cellulosomal protein components. We found that nearly 60% of the genes encoding the protein components of the cellulosomes-the core machinery for cellulose degradationwere up-regulated, whereas only 5% were down-regulated. The top up-regulated cellulosomal genes, along with the responsive SigI-RsgI and cellodextrin transporter genes, present promising candidate genes for engineering C. thermocellum strains to improve their capacity in effectively converting lignocellulosic biomass substrates. Furthermore, the identified differentially expressed NAD(P)H H 2 ase, ADH, and PTA genes may provide insight into how the cells regulate the production of H 2 and ethanol under the carbon-limited condition.

ACKNOWLEDGMENTS
This work was funded by the Laboratory Directed Research and Development (LDRD) program at the National Renewable Energy Laboratory (NREL), and by the U.S. Department of Energy's Office of Science through the BioEnergy Science Center (BESC), Bioenergy Technology Office (DOE-BETO) and the Fuel Cell Technologies Office. This work was supported by DOE under Contract No. DE-AC36-08-GO28308 with NREL. Andrew Bowersox and Igor Bogorad were sponsored by DOE Academies Creating Teacher Scientists (ACTS) and Science Undergraduate Laboratory Internship (SULI) programs, respectively. We thank Dr. Katherine Chou of NREL for discussion.