Assessing and characterising the repertoire of constitutive promoter elements in soil metagenomic libraries in Escherichia coli

Although functional metagenomics has been widely employed for the discovery of genes relevant to biotechnology and biomedicine, its potential for assessing the diversity of transcriptional regulatory elements of microbial communities has remained poorly explored. Here, we have developed a novel framework for prospecting, characterising and estimating the accessibility of promoter sequences in metagenomic libraries by combining a bi-directional reporter vector, high-throughput fluorescence assays and predictive computational methods. Using the expression profiling of fluorescent clones from two independent libraries from soil samples, we directly analysed the regulatory dynamics of novel promoter elements, addressing the relationship between the “metaconstitutome” of a bacterial community and its environmental context. Through the construction and screening of plasmid-based metagenomic libraries followed by in silico analyses, we were able to provide both (i) a consensus exogenous promoter elements recognizable by Escherichia coli and (ii) an estimation of the accessible promoter sequences in a metagenomic library, which was close to 1% of the whole set of available promoters. The results presented here should provide new directions for the exploration through functional metagenomics of novel regulatory sequences in bacteria, which could expand the Synthetic Biology toolbox for novel biotechnological and biomedical applications.


INTRODUCTION 17
The study of prokaryotic transcriptional regulation is essential for understanding the molecular 18 mechanisms underlying decision-making processes in microorganisms 1 , comprising populational (e.g. colony 19 structure, quorum sensing detection), ecological (e.g. nutrient acquisition, biomass degradation) and pathogenic 20 behaviours (e.g. host recognition, biofilm formation). The activity of most bacterial promoters is usually dependent 21 on the combined action of transcription factors and sigma factors in response to multiple environmental stimuli 2

. 22
For instance, in Escherichia coli, the compilation of decades of experimental data indicate that approximately 50% 23 of its promoters are under the control of a single specific regulator, while all other genes are regulated by at least 24 two transcription factors 3 . Moreover, the recent development of experimental and large-scale sequencing 25 techniques, together with powerful computational approaches have allowed both the discovery of insightful 26 information about bacterial transcriptional systems and the development of novel approaches for studying those 1 systems in higher depth 4-7 . However, despite technical innovations, most of the studies are still centred on the 2 model organism E. coli, a single bacterial species among at least 30,000 other already sequenced 8 , in an estimated 3 total of 1 trillion species 9 . 4 With the advent of Metagenomics 10 , the exploration of unculturable bacteria (approximately 99% of a 5 bacterial community 11 widely expanded genomic information, providing resourceful data about populational 6 structures and genetic diversity in a myriad of environmental samples [12][13][14] . Two main approaches are commonly 7 adopted for those metagenomic studies 15 : the sequence-based metagenomic approach, which relies on massive 8 sequencing of metagenomic DNA and powerful bioinformatics tools for extracting information from the 9 metagenomic sequences; and functional metagenomics 16,17 , which directly explores the functionality of enzymes 10 and other structural elements through a wide range of stress/substrate/product-based assays [18][19][20][21] . In this context, 11 although a large number of genes/ORFs has been discovered through the previously described approaches, the 12 detection of novel bacterial regulatory elements using high-throughput technologies has been poorly explored, 13 presenting so far a single well-defined method for the discovery of substrate-inducible regulatory sequences -14 SIGEX -19 and a limited assay for exploration of constitutive promoters 22 . This narrow range of methodologies is 15 directly related to the biased functional search towards novel genes and to a lack of both experimental and 16 computational tools for finding and validating promoter sequences in metagenomic libraries 23 . 17 Unravelling novel bacterial promoters is essential for understanding the regulatory diversity of 18 microorganisms, addressing important questions, such as the abundance of both constitutive and inducible 19 elements in a metagenomic library, the bottlenecks regarding host choices (i.e. the constrains limiting the diversity 20 of exogenous regulatory sequences that can be recognized by different hosts) and the correlation between 21 promoter strength, transcriptional noise and the functional role of the regulated gene/operon [23][24][25][26] . Furthermore, 22 prospecting and characterising novel regulatory sequences is crucial for expanding the current Synthetic Biology 23 toolbox and generating novel biotechnological applications. For instance, there is a high demand for novel 24 constitutive and inducible promoters responding to process-specific parameters imposed by a wide variety of 25 processes, such as industrial applications, heterologous protein expression and biosensors generation 19, 23,27-29 . 26 In this context, the most common strategy for prospecting regulatory sequences is the usage of 1 unidirectional promoter trap-vectors, which consist in transcriptional fusions between DNA fragments and a 2 reporter gene. This method has been widely employed for assessing regulatory sequences in genomic DNA 30-33 , 3 however its application in metagenomic DNA fragments has remained poorly explored 19 . The main constraint 4 regarding the use of unidirectional systems is that bacterial genomes present a large variation in the percentage 5 of their leading-strand genes, ranging from ∼45% to ∼90% 34,35 . Thus, a bi-directional promoter reporter system 6 would be preferable, by increasing the probability of finding promoter sequences. In the present work, we have 7 developed a novel strategy for in-depth prospection, characterisation, and quantification of accessible promoter 8 elements from soil metagenomic samples in E. coli as a standard host. 9 Although both constitutive and inducible promoters were potentially detectable by this method, we have 10 focused exclusively on the study of the former, as a proof of concept, by avoiding substrate-based induction assays 11 as previously reported 18-21 . We have collected soil samples from two differentially biomass-enriched sites of a 12 Secondary Atlantic Forest in South-eastern Brazil and generated metagenomic libraries in a bi-directional probe 13 vector for primary screenings. We have characterised the expression behaviours of a large set of GFPlva 14 expressing clones from both libraries and narrowed down our selection to 10 clones for an in-depth analysis 15 regarding potential ORFs and endogenous promoters. By cross-validating in silico analyses and experimental data 16 of predicted regulatory sequences, we have located and profiled the expression of 33 endogenous promoters 17 within the selected clones (see Supplementary Table S2 online), providing resourceful information concerning the 18 architecture and transcriptional dynamics of promoters from metagenomic fragments. Thus, in order to contribute 19 to this set of accessible genetic features, we have used our gathered data to provide for the first time a direct 20 estimation of the whole set of accessible constitutive promoters in a soil metagenomic library hosted in E. coli, 21 which we have called the "metaconstitutome" of an environmental sample. 22

Generating metagenomic libraries and screening for fluorescent clones 24
We have constructed and assessed two metagenomic libraries hosted in E. coli DH10B strain for the 25 analysis of bacterial promoters in environmental samples (Figure 1). The libraries were generated from soil 26 microbial communities of two sites bearing differential tree litter composition (Anadenanthera spp. and Phytolacca 1 dioica) within a Secondary Atlantic Forest zone at the University of Sao Paulo, Ribeirão Preto, Brazil. Both 2 metagenomic DNA were cloned into the pMR1 bi-directional reporter vector, which has GFPlva and mCherry 3 reporter genes in opposite directions 36 . Each metagenomic library presented about 250 Mb of environmental DNA 4 distributed into approximately 60.000 clones harbouring insert fragments size ranging from 1.5 Kb to 7 Kb, with an 5 average size of 4.1 Kb (Table 1). We have chosen fragments of 1.5-7 Kb in order to validate our strategy on 6 standard-sized functional metagenomic libraries based on plasmid vectors 18,19,37-39 . In total, 1,100 fluorescent 7 clones, resulting in a rate of approximately one fluorescent clone every one hundred fifty clones (USP1) or every 8 ninety clones screened (USP3), were manually selected under blue light exposition. Then, these fluorescent clones 9 were directly recovered from LB agar plates supplemented with chloramphenicol. The direct screening was 10 preferred over the use of metagenomic clone pools from stocks as it reduces the chances of both biased clone 11 enrichment (e.g. clones with higher growth rates, usually clones bearing small inserts or without insert) and dilution 12 of positive clones with impaired growth (e.g. clones with high expression of GFP and/or other exogenous genes), 13 avoiding thus clonal amplification. 14

Evaluating the expression dynamics of fluorescent clones 15
In order to analyse the expression patterns of the isolated clones, we evaluated the intrinsic dynamics of 16 GFPlva and mCherry by randomly selecting 20 clones expressing each reporter (as schematically represented in 17 Figure 1). As represented in Figures 2A-B, we found that clones expressing mCherry were not suitable for standard 18 microplate 8 hour assays, as the fluorescence intensity values differed dramatically between 8 and 24 hours after 19 the beginning of the experiment. The slow kinetics of mCherry expression has already been reported as a 20 consequence of a two-step oxidation process for protein maturation when compared to the one-step maturation 21 process found in GFP reporters 40 . On the other hand, the clones expressing GFPlva presented the enhanced 22 intrinsic properties for microplate assays, supported by the observation of very similar fluorescence intensities 23 between the two time points tested. Furthermore, the GFPlva has an LVA-degradation tag attached to its C-24 terminal, which reduces GFP accumulation and increases protein turnover, generating a more precise fluorescence 25 output on analysis of expression patterns 41 . 26 Thus, 260 clones expressing GFPlva (160 clones from the USP1 library and 100 from USP3) were 1 selected for further analysis of expression patterns on microplate reader assays with biological and technical 2 triplicates. The dynamic profiles for each clone were converted into heat maps and hierarchically clustered by a 3 Euclidean Distance algorithm into a dendrogram, concisely representing the expression patterns of each 4 metagenomic library. In order to assess the diversity of promoter strengths among the generated metagenomics 5 libraries, three previously characterized constitutive promoters (see Experimental Procedures for further 6 information) positioned upstream a GFPlva reporter were used as standards for strong, medium and weak 7 expression profiles (referred here as p100, p106 and p114, respectively). Considering both metagenomics libraries, from selected clones were sequenced and analysed for both potential ORFs and RpoD-related promoter regions 3 (-10 and -35 conserved regions). In the case of the identification of putative genes, twenty-nine ORFs with 4 significant E-values (<0,001) were found ( Table 2 and Supplementary Table S1 online) unevenly distributed 5 between both DNA strands, in line with a lack of strong directional trends regarding bacterial genome organization 6 46 . The ORFs were also classified within a range of functional classes (delineated by MultiFun 47 and potential 7 bacterial phyla (see Supplementary Fig. S3 online). For this, we carried out the analysis of the microorganisms 8 associated with the closest similar protein of the identified ORFs ( Table 2). The most abundant ORFs were related 9 to unknown functions (31%) and metabolism (31%), followed by stress adaptation cell processes (17%) (see 10 Supplementary Table S1 online), while the most abundant phyla related to the recovered ORFs were 11 Proteobacteria (35%), followed by Bacteroidetes (22%) and Chloroflexi (14%) (see Supplementary Fig. S3 online). 12 The relative abundance of the guanine-cytosine content of each insert was also assessed ( Table 2), resulting in a 13 median of 54%, varying from 43% to 61%, indicating their diverse phylogenetic affiliation. These results are in 14 agreement with previous G-C content diversity analyses of soil samples which ranged from 50% to 61% 48-50 . Even 15 with a limited sample size when compared to NGS-based metagenomic studies, the abundance of gene functions 16 and bacterial groups predicted in this work was similar to the ones found in previous studies in soil microbial 17 communities 51-53 . Considering the above, these results suggest that different bacterial groups could be the sources 18 of accessible promoters in E. coli, that is, regulatory sequences recognizable by the molecular transcriptional 19 machinery of E. coli that allowed the expression of the reporter genes. 20 The in silico promoter prediction has also provided relevant information concerning the potential number  Considering that, we delineate a strategy to experimentally assess the number and location of accessible 2 promoters from our selected clones, contrasting experimental results with in silico data. 3

Experimental identification, characterisation, and cross-validation of promoter regions 4
In order to explore the potential set of accessible promoter regions from our metagenomic libraries, we 5 developed a small DNA insert library generation approach (Figure 1). Firstly, the plasmids from the previously 10 6 selected clones (original clones) were pooled together for insert amplification in a single PCR reaction. The 7 resulting amplicons were fragmented by Sau3AI digestion and DNA fragments ranging from 0.2 Kb to 0.5 Kb were 8 selected for subsequent cloning into the pMR1 vector. The generation of this sub-fragment library allowed the 9 screening for both red and green fluorescent colonies as they would represent the accessible set of promoters 10 among the metagenomic DNA fragments studied. It is important to highlight that as the cloning process was not 11 directed, small fragments bearing promoter regions had a 50% chance of getting cloned in any direction, thus 12 clones expressing mCherry were also isolated for subsequent sequencing. A total of 100 clones coming from the 13 small DNA insert library (80 expressing GFPlva and 20 expressing mCherry) were sequenced and then align 14 against the original metagenomic fragments. As a result, we have identified at least 33 promoter regions within the 15 initial set of the selected metagenomic clones (Figure 3, Supplementary Fig. S4 and Supplementary Table S2  16 online). These findings showed that the in silico prediction of 140 RpoD-related promoters was overestimated in 17 comparison with the experimental results. The above can be explained since prediction algorithms usually 18 misrepresent nature by underestimating or overestimating results due to a lack of information regarding diversity 19 and variability of natural cis-regulatory sequences 58-60 . 20 Additionally, the current experimental approach allowed us not only to identify novel promoter regions but 21 also to determine promoter directionality. The evaluation of promoter localization within the 10 selected clones 22 revealed that from the 33 experimentally selected small fragments, 7 (21%) were considered intragenic promoters 23 while the remaining 79% (26 promoters) were considered primary promoters, defined as the furthest upstream 24 promoter in a gene/operon 61 . This small-scale analysis slightly diverges from architectural features found in E. coli 25 K-12 genome in which the promoter dataset was dominated by primary promoters (66.3%), with a lower number 26 of secondary promoters (19.6%), defined as intergenic and downstream of primary promoters 61 , internal promoters 27 that are intragenic (9.8%), and antisense (4.2%) promoters 61,62 . This observation might reflect the diversity of 1 genomic architectures in metagenomic libraries and highlight the current underestimation of bacterial intragenic 2 promoters, which doubled the number in comparison to E. coli. 3 Based on the alignment results, we selected a defined set of small fragment clones related to each original 4 sequence for dynamic expression profiling on a microplate reader. The results showed that for each set of small-5 fragments belonging to a DNA metagenomic clone, there was at least one with an expression pattern 6 corresponding to the original clone previously observed (Figure 3 and Supplementary Fig. S4). Similarly, we 7 identified other clones bearing small-inserts with individual profiles different to the primarily observed, representing 8 alternative promoter regions in the original sequence that were not mapped in the initial approach (Figure 3). The 9 diversity of the promoter expression profiles found in a single original metagenomic clone has a multifactorial 10 nature, ruled by different processes. Firstly, it should be considered the inherent relationship between the 11 regulatory dynamics and the functional role of the regulated gene 26 . Secondly, the transcriptional bias imposed by 12 the E. coli molecular machinery, which would recognize orthologous sequences, but not necessarily reproduce the 13 original behaviours found in natural hosts 23,39,63,64 . Finally, another point to be considered is that the increase in 14 expression levels can be the result of the artificial juxtaposition of the promoter to the fluorescent reporter ribosome 15 binding site, as a consequence of the cloning process. 16 Regarding in silico cross-validation, from the 33 experimentally validated promoters, 23 RpoD-related 17 promoters (70%) were supported by the algorithmic analysis as they were aligned to their respective original 18 sequences (Figure 3). On the other hand, the remaining 10 sequences (30%) were considered as promoters 19 exclusively identified by experimental approaches. We hypothesized that these sequences could be either 20 recognized by other sigma factors than sigma70 or presented unusual consensus sequences for -10 and -35 boxes 21 which has bypassed the algorithmic analysis. However, experimental validation in E. coli strains lacking diverse 22 sigma factors genes should be necessary for a more accurate conclusion. Although this logo pattern was distant from the proposed for the RpoD-dependent constitutive promoters identified 5 in vitro ( Figure 4A 65 ), was very similar to previously described consensus 67 from experimentally validated promoter 6 sets from RegulonDB 3 and EcoCyc 68 databases ( Figure 4B). To conclude, the results presented here has allowed 7 us to identify a consensus for exogenous promoter recognition in E. coli, which can be an important resource for 8 defining host-dependent restrictions in functional metagenomics. basis, using 32 prokaryotic genomes, that 40% of the enzymatic activities present in a soil metagenomic library 2 could be readily accessed using E. coli as a host in an independent gene expression mode (in which both the 3 promoter and the ribosome binding sites (RBS) are provided by the metagenomics insert). Moreover, it was 4 predicted that Firmicutes, instead of Proteobacteria, would present the largest fraction of independently expressible 5 genes (73%). Contrastingly, recent empirical studies on E. coli and other hosts have shown that functional 6 expression faces a myriad of challenges that were not taken into account in previous mathematical models, such 7 as codon usage, improper promoter and RBS recognition 72 , missing initiation factors, protein misfolding, missing 8 co-factors, breakdown of product; improper secretion of product, toxicity of product or intermediates and formation 9 of inclusion bodies 24,25 . Since it is impossible to predict the effect of the previously described difficulties in unknown 10 metagenomic fragments, the actual fraction of genes that can be successfully expressed in E. coli is probably 11 significantly lower than the proposed by Gabor and collaborators 63 (2004). In this context, our work supports the 12 previous arguments 24,25 highlighting the large gap between theoretical predictions and experimental data as we 13 have shown only a small portion of the whole set of promoters is accessible for E. coli in metagenomics libraries 14 (~1%). Thus, we stress the importance of feeding mathematical models with empirical data in a continuous iterative 15 process for improving its predictive power. 16

CONCLUSIONS 17
In summary, we have developed a novel methodology for prospecting, characterising and estimating the 18 accessibility of promoter sequences in metagenomic samples by combining experimental and in silico approaches. 19 The expression profiling of fluorescent clones was used for the first time as a direct approach to analyse the 20 regulatory dynamics of an environmental sample, bearing great potential for revealing insightful trends regarding 21 the transcriptional diversity of microbial communities. It has already been computationally demonstrated by 22 Through the generation of a small-DNA insert library approach combined to in silico promoter prediction 3 we were able to provide both (i) a consensus of recognizable exogenous regulatory sequences in an E. coli host 4 and (ii) an estimation of the accessible promoter sequences in a plasmid-based functional metagenomic library, 5 which was close to 1% of the whole set of available promoters. These are resourceful data for building a concise 6 framework regarding the accessibility of genetic features from metagenomic libraries and how it can be influenced 7 by the choice of different microbial hosts 23,63,64 or by the tinkering of the host's transcription systems 72,77,78 . 8 Although this work provided seminal information regarding promoter accessibility in metagenomics 9 libraries, further high-throughput studies optimizing the proposed methods (e.g. application of automated screening 10 methods; exploration of the whole set of fluorescent clones in a metagenomics library by Next-Generation-11 Sequencing) will be essential for expanding our current estimation into a more holistic landscape. Finally, we 12 highlight that besides providing novel approaches for studying the regulatory diversity underlying environmental 13 microbial communities, this work should be extremely useful for expanding the current Synthetic Biology toolbox 14 through the discovery and characterisation of novel regulatory features. Waltham, MA, USA). Promoter activities were expressed as the emission of fluorescence at 535 nm upon excitation 1 with 485 nm light and then normalised with the optical density at each point (reported as fluorescence/OD600) 2 after background correction. Background signal was evaluated with non-inoculated M9 medium and used as a 3 blank for adjusting the baseline of measurements. E. coli DH10B harbouring the pMR1 empty plasmid was used 4 as a negative control. Three different positive controls were used, consisting in E. coli DH10B harbouring pMR1 5 plasmid with one of the following synthetic constitutive promoters from the iGEM BBa_J23104 Anderson's 6 catalogue (http://parts.igem.org/Promoters/Catalog/Anderson) 80 upstream a GFPlva reporter: J23100, J23106 and 7 J23114 (referred here as p100, p106 and p114, respectively). Unless otherwise indicated, measurements were 8 taken at 30 min intervals over 8 h. All experiments were performed with both technical and biological replicates, 9 being biological triplicates evaluated as independent measurements on different dates. Raw data were processed 10 and plots were constructed using Microsoft Excel. All data was normalised by background values and transformed 11 to a log2 scale for better data visualisation. Heatmap dendrograms with expression profiles were generated by 12 using MeV2 (http://mev.tm4.org/) software. 13

Small-DNA inserts libraries generation and screening 14
In order to experimentally find and validate the promoter regions from each of the ten selected metagenomic 15 clones, an experimental technique was developed based on the previously described methodology of 16 metagenomic library construction. All selected clones had their plasmids extracted and pooled together in an 17 equimolar ratio. The pooled sample was amplified through a single PCR reaction using high-fidelity polymerase 18 enzyme (Phusion) and previously described primers flanking the MCS region (Multiple Cloning Site) of the pMR1 19 vector, into which the metagenomic inserts were cloned. The resulting amplicons were firstly submitted to an 20 analytical digestion followed by electrophoretic analysis for finding the optimal concentration of Sau3AI enzyme for 21 obtaining fragments size ranging from 0.1Kb to 0.5Kb. Then, the purified pooled samples were fragmented by 22 Sau3AI in preparative digestion and thereafter punctured from a 1% agarose gel in the region between 0.1 Kb and 23 0.5 Kb. These small DNA fragments, in turn, were ligated to pMR1 vector. Aliquots of electrocompetent E. coli 24 DH10B cells were transformed with ligated DNA. A total of 100 fluorescent clones (80 expressing GFP and 20 25 expressing mCherry) were isolated under blue light excitation screening and had their plasmids extracted for 26 sequencing reactions. Fluorescent clones were stored at -80°C in LB medium supplemented with required 1 antibiotics and 10% glycerol (v/v). 2

In silico analysis of ORFs and promoter regions 3
The inserts of selected clones were sequenced on both strands as previously described. Sequences were manually 4 assembled for the generation of 10 contigs. Putative ORFs were identified and analysed using the online ORF 5 Finder platform, available at the NCBI website (http://www.ncbi.nlm.nih.gov/gorf/gorf. html). Comparisons of 6 nucleotide and transcribed amino acid sequences were performed against public databases (NCBI) using BlastN, 7 BlastX and BlastP (BLAST, basic local alignment search tool) at the NCBI on-line server. For translation to protein 8 sequences, the bacterial code was selected, allowing ATG, GTG, and TTG as alternative start codons. All the 9 predicted ORFs longer than 270 bp were translated and used as queries in BlastP. Sequences with significant 10 matches were further analysed with psiBlast, and their putative function was annotated based on their similarities 11 to sequences in the COG (Clusters of Orthologous Groups) and Pfam (Protein Families) databases. Predicted 12 general cellular functions were annotated only for known ORFs based on the MultiFam classification (Serres et al, 13 2006

Calculations for promoter/Kb rates from databases and for promoter accessibility estimation 1
Data from predicted sequences of promoter sites, TSS (Transcriptional Start Site) and TUs (Transcription Unit) 2 reported in different studies and databases regarding E. coli and other bacteria 3,61,65,70 were used as proxies for 3 the total number of predicted promoters. Those values were divided by their respective genome sizes (or average 4 genome sizes when calculating an average rate of multiple species at once) in order to provide promoter/Kb rates 5 (i.e. 8,000 predicted promoters, TSS or TUs on a genome of 4.6 Mb would result in a rate of 1.7 promoters/Kb).The 6 promoter accessibility estimation followed the same rationale and was based on the combination of the data from 7 both metagenomics libraries presented in Table 1

Data Availability 20
The nucleotide sequences obtained for the plasmid inserts have been deposited in the GenBank database under 21 the Accession numbers (KY939589-KY939597), which are also shown in Table 2  c Sequences with an E-value higher than 0.001 in Blastp searches were considered to be unknown proteins. beginning of the experiment. Maturation times are substantially lower for mCherry than for GFPlva, which excluded 7 the former from further analyses. Positive controls for GFP and mCherry are represented by p100 and pRED, 8 respectively. Fluorescence data has been normalised by OD600 values for each sample following normalisation by 9 values from the negative control (empty-pMR1). Data was transformed to log2 scale to allow better visualisation of 10 fluorescence variation. D) Hierarchical representation of a metaconstitutome (i.e. all expression profiles from a 11 single metagenomic library. Fluorescence time-lapse dynamics were measured during 8 hours for each clone and 12 represented as heat maps. Promoter activities (calculated as GFP/OD600) were normalised by the negative control  Promoters are indicated by elbow-shaped arrows and name according to their relative position in the contig. 4 Promoter directionality, regarding the leading and lagging strands, is represented by green and red colours, 5 respectively. Asterisks over specific promoters indicate regulatory regions which were cross-validated by matching 6 in silico predictions. Dark arrows represent predicted ORFs, according to their relative positions in each contig (see 7 Table 2 for more information). All genetic features respect their original relative sizes, following the 1 Kb scale 8 depicted at the bottom of this figure. Beneath each metagenomic insert, there is a heat map cluster representing 9 the whole set of promoter activities measured during 8-hours fluorescence assays. The first line of each cluster 10 shows the original expression profile initially measured for each metagenomic insert. All other lines represent 11 expression activities from de novo experimentally validated promoters within each contig (small DNA fragments). 12 The second line of each cluster represents the endogenous promoter showing the most similar activity with respect 13 to the original expression profile for each contig. All expression profiles are properly identified at the most rightmost sequences of the 33 promoters experimentally validated in this study were aligned and subjected to Logo analysis 5 66 . The consensus from the metagenomic set (C) is very similar to the one from the experimentally validated set 6 from E. coli (B). 7