Tracing Lifestyle Adaptation in Prokaryotic Genomes

Altermann, Eric

doi:10.3389/fmicb.2012.00048

METHODS article

Front. Microbiol., 21 February 2012

Sec. Evolutionary and Genomic Microbiology

Volume 3 - 2012 | https://doi.org/10.3389/fmicb.2012.00048

Tracing Lifestyle Adaptation in Prokaryotic Genomes

EA
Eric Altermann ^1,2^*

1. Rumen Microbiology, Animal Nutrition and Health Capability Group, Grasslands Research Centre, AgResearch Limited Palmerston North, New Zealand
2. Riddet Institute, Massey University Palmerston North, New Zealand

Abstract

Lifestyle adaptation of microbes due to changes in their ecological niches or acquisition of new environments is a major driving force for genetic changes in their respective genomes. Moving into more specialized niches often results in the acquisition of new gene sets via horizontal gene transfer to utilize previously unavailable metabolites, while genetic ballast is shed by gene loss and/or gene inactivation. In some cases, larger genome rearrangements can be observed, such as the incorporation of whole genetic islands, providing a range of new phenotypic capabilities. Until recently these changes could not be comprehensively followed and identified due to the lack of complete microbial genome sequences. The advent of high-throughput DNA sequencing has dramatically changed the scientific landscape and today microbial genomes have become increasingly abundant. Currently, more than 2,900 genomes are published and more than 11,000 genome projects are listed in the Genomes Online Database^‡. Although this wealth of information provides many new opportunities to assess microbial functionality, it also creates a new array of challenges when a comparison between multiple microbial genomes is required. Here, functional genome distribution (FGD) is introduced, analyzing the diversity between microbes based on their predicted ORFeome. FGD is therefore a comparative genomics approach, emphasizing the assessments of gene complements. To further facilitate the comparison between two or more genomes, degrees of amino-acid similarities between ORFeomes can be visualized in the Artemis comparison tool, graphically depicting small and large scale genome rearrangements, insertion and deletion events, and levels of similarity between individual open reading frames. FGD provides a new tool for comparative microbial genomics and the interpretation of differences in the genetic makeup of bacteria.

Introduction

Microbial genomes range in size from the smallest microbial genome known to date of Candidatus Tremblaya princeps with just under 139,000 nt (McCutcheon and von Dohlen, 2011) to large genomes with over 13,000,000 nt such as Sorangium cellulosum SO ce56 (Schneiker et al., 2007). With an average gene size of about 1,000 nt, microbial genomes harbor between 140 and 13,000 genes.

Lifestyle adaptation is one of the major driving forces for microbial genome re-arrangement processes such as gene loss and gene acquisition, genome rearrangements, and the movement of whole genetic islands. Until recently, such processes could not be monitored comprehensively and observations were limited to either a few select genes (typing) or the analysis of large scale genome re-arrangement events.

Typing methods such as multi-locus sequence typing (MLST; Chan et al., 2001), are based on the selection of a few widely distributed and conserved house-keeping genes. Today, MLST utilizes whole microbial genome predominantly to identify new select target regions for amplification, rather than analyzing the whole nucleotide sequence (Maiden, 2006).

With the advent of high-throughput DNA sequencing techniques such as automated capillary sequencing, solid-state sequencing¹, or pyrosequencing² or sequencing by synthesis³, access to high-quality draft, and complete bacterial genomes has become feasible and is a commonly used technique. Currently 2,943 complete genomes (including eukaryotic organisms) have been reported, with another 184 archeal and 5,490 bacterial genome projects in progress⁴. The availability of complete or nearly complete genomes triggered attempts to incorporate selected larger genetic subsets (Makarova et al., 2006; Makarova and Koonin, 2007) or complete genomes (Rohwer and Edwards, 2002; Henz et al., 2005) to infer evolutionary lineages, while less progress has been made in analysis of multiple whole microbial genomes from functional and comparative genomic perspectives. One of the most prominent examples of whole genome comparative analyses is based on Blast Score Ratios (Rasko et al., 2005).

Here, we introduce a new analysis tool, functional genome distribution (FGD). FGD does not attempt to represent the evolutionary path a genome has taken, since different genes will have been acquired by different routes. Instead, FGD investigates the overall levels of similarity between microbial genomes based on amino-acid sequences of the predicted complete ORFeomes. This reflects the impact the evolutionary force has had on genome makeup in the past, resulting in the current level of niche adaptation (Thomson et al., 2003). Thus presence, absence, or modification of individual genes or genetic islands defines the phenotypic potential of a given organism at a given temporal snapshot. The comparison of these ORFeomes to each other ultimately defines the level of similarity of the genomes. This approach then also takes into account important genetic adaptations to specific ecological niches or even to human made environments such as industrial fermentation processes. Such common genotype adaptations might render organisms more similar by FGD analyses than their respective evolutionary heritage would indicate. The presented approach of a FGD is a BLAST-based ORF-position-independent algorithm, implemented in the compACTor software. In the context of FGD analyses the term “functional” is used in the sense of functionality based on sequence and sequence similarity and is not based on annotation classification [i.e., such as implied by COG (Tatusov et al., 2003) or KEGG (Kanehisa, 2002; Kanehisa et al., 2008) databases].

Research in functional genomics (defined as the investigation of gene function by gene inactivation, gene complementation, and in silico analyses) relies heavily on the identification of differences between two or more genomes, identifying differences in the presence or absence of individual genes. The compACTor software also creates all-vs.-all ORFeome distance data files which, in combination with the respective GenBank files of the query microbes and the Artemis comparison tool (ACT; Carver et al., 2005), facilitate visual qualitative comparative in silico analyses of complete and draft phase microbial genomes for the subsequent analysis of changes to gene synteny and operon structures. Furthermore, to identify genes shared and unique between selected clusters a mining tool, FGDfinder, is provided, facilitating rapid identification of relevant gene sets for further in vitro analyses.

Materials and Methods

General workflow of the compACTor software

A pool of genomes in GenBank format represents the query space for compACTor (Figure 1). Gene models embedded and subsequently retrieved from the GenBank files of each query genome are parsed and the resulting ORFeome is translated into individual amino-acid sequences. The parsing algorithm considers both gene and CDS features for any given start position, while giving preference to CDS entries. Nucleotide sequences are retrieved from the genome sequence and translated into amino-acid sequence. Gene model errors such as unbalanced number of nucleotides, unrecognized amino-acid codons, or multiple stop-codons are reported but treated as non-critical. The translated ORFeome is then used to build query-specific amino-acid BLAST databases (DBs) using formatdb (Altschul et al., 1990). In subsequent runs previously build BLAST databases can be re-used (Figure 1), reducing the overall runtime of the compactor software. The pooled ORFeomes of all query entries are hashed for subsequent all-vs.-all analyses. Levels of similarities are inferred for each predicted query ORF by an all-vs.-all analysis, submitting sequence-database combinations to a non-filtered gapped BLASTP (Altschul et al., 1997) pipeline. Briefly, the deduced amino-acid sequence for each open reading frame (ORF) of each query entry is compared to all subject databases generated from the entry pool. Individual BLAST results are parsed and collated in ORF-specific ASCII result files. The total number of BLAST analyses performed is a direct function of the number of individual ORFeomes and the pool size.

Figure 1

The collated flatfile database is the basis for all subsequent calculations such as the generation of ACT (Carver et al., 2005) pMSP-datafiles, the prediction of putative strain-specific genes, and the generation of FGD trees. It is also possible to add new query entries to existing analyses. The compACTor software tests for the presence of respective ORF-database BLAST results and omits the BLAST pipeline where possible. This significantly reduces runtime requirements and facilitates future expansions of existing comparative analyses.

Generation of ACT comparison data files (pMSP)

Based on the all-vs.-all principle, all genome–genome pair combinations are analyzed. A bi-directional BLASTP approach was implemented for the creation of ORF-specific ACT comparison files. Genome pairs analyzed in respective comparison files are reflected within the filename which features both filenames in the order query-to-subject. Each predicted ORF of the query genome was allowed a maximum of 20 similarity hits to ORFs present in the respective subject genome. ACT allows two quality parameters, namely “Score cutoff” and “Percent ID cutoff” in its comparison file format. These two cutoffs determine the stringency of sequence similarities, and were replaced in the compACTor software output by the alternative parameters of “e-value” (“Percent ID”) and “weighted score” (“Score”). The parameter “e-value” is derived from individual e-values which are grouped into empirically determined trust level ranges (Table a1 in Appendix). Similarly, for the second parameter, a weighted score is calculated, placing additional emphasis on alignment quality and length (Eq. 1). Briefly, the initial alignment-centric local BLAST score is reflecting the maximum level of sequence similarity between a given ORF pair (ς, Eq. 1). To account for changes outside the local alignment and to further assess the alignment quality, the score is then subjected to three penalty blocks, assessing the quality of amino-acid similarity [(P − Id/2) + Id/λ, Eq. 1], the size of gaps (1 − G/A, Eq. 1), and the length of the BLAST alignment in relation to the query ORF length (A − G/λ, Eq. 1). These penalty parameters take into account differences in amino-acids sequence alignments which are not or insufficiently covered by the regular BLAST score. This process effectively reduces the level of similarity between both ORFs when the BLAST alignment covers only partial ORF sequences and/or the alignments show significant levels of insertions or deletions and results in a weighted BLAST score ψ.

The weighted score is a sub-parameter of the distance calculation between two genomes (see Eq. 2) and allows for stringency shifting in ACT while maintaining continuity to the FGD algorithm.

Weighted BLAST score

with: ψ = weighted score, ς = BLAST score between query sequence and respective BLAST hit, λ = length of deduced amino-acid in query sequence, P = identified positives in BLAST alignment, Id = identified identities in BLAST alignment, A = length of BLAST alignment, G = number of gaps in BLAST alignment

Artemis comparison tool compatible comparison files (pMSP) created with the compACTor software adhere to the MSPcrunch (Sonnhammer and Durbin, 1994) format. An example dataset of a two-way genome-to-genome comparison is provided in the Appendix.

Functional genome distribution

In prokaryotes, genome plasticity (re-arrangement of genomic regions between species), the presence of metabolic islands (clusters of genes with defined and specific metabolic functions), gene acquisition via horizontal gene transfer (HGT), and varied distribution of mobile genetic elements are well described and contribute significantly to phenotypic differences (Desiere et al., 2001; Makarova et al., 2006; Berger et al., 2007; Nicolas et al., 2007). Also, adaptation to new ecological niches and the resulting genetic drift can lead to rapid acquisition or the shedding of genetic elements without changes in selected sequence or gene subsets used in typing analyses (Makarova et al., 2006). Organisms, virtually indistinguishable based on selected individual gene (set) evolutionary analyses, may be markedly different in their complete genetic blueprint. To reflect this widespread diversity, a new algorithm, FGD, was empirically developed to distinguish organisms in the multi-dimensional genome space. Here, the focus distinctly lies on a snapshot picture of a current genome and how the genotype relates to other organisms. From a functional genomics perspective, the presence, absence, or modification of a gene determines a potential metabolic capability, while the respective position with the genome may influence the respective levels of gene expression (Sousa et al., 1997). Therefore, a position independent approach was embraced that investigates the phenotypic potential (absence/presence/similarity of genes) at the cost of loci dependent modulations of gene expression.

Each predicted ORF of a given query genome is analyzed via gapped BLASTP to a subject organism specific BLAST database. Result parameters of the best BLAST hit are applied to the following equations (Eqs 2 and 3).

Distance between two genomes

with: δ = final similarity score between two genomes, ψ = weighted BLAST score as described in Eq.1, λ = length of deduced amino-acid query sequence, T_ORFquery = total number of ORFs in query genome, n = current query ORF, and ρ_ORF = ORF number ratio as described in Eq.3.

ORFeome ratio between query and subject genome

with: ρ_ORF = ORF number ratio between query and subject genome, always ≤ 1, T_ORFquery = total number of ORFs in query genome, T_ORFsubject = total number of ORFs in subject genome.

Motivation of the FGD algorithm

The FGD algorithm initially investigates the level of similarity for each predicted ORF in a query genome to a subject ORF and is based on the BLAST score. This score describes the overall quality of the best local alignment found for a given query–subject sequence alignment. However, the score alone does not necessarily reflect the overall level of similarity between two sequences outside the local alignment. Also, from a functional genomics perspective, insertion, and deletion events in individual amino-acid sequences may change the properties of the gene product, thus decreasing the level of similarity. Furthermore, the relation of the local alignment to the overall sequence is an important factor. While stretches of highly conserved sequence contribute to a higher BLAST score, the presence of unique sequence outside the local alignment is likely to contribute to changing the properties of respective gene products. Therefore, the maximum level of similarity as described by the BLAST score will be decreased by the FGD algorithm if insertion/deletion events and imbalanced alignment/deduced sequence ratios are detected.

The FGD analysis then summarizes individual similarity scores determined for each ORF in the ORFeome. The presence of unique strain-specific genes and differences in the overall number of predicted ORFs in respective microbial genome pairs contribute to different genotypes and consequently decrease the level of overall functional similarity between genome pairs.

The initial alignment BLAST score is subjected to quality assessments as described above for Eq. 1. In Eq. 2 the resulting weighted score is normalized by the number of deduced amino-acids of the query ORF, resulting in a score per amino-acid [ψ/λ(n)]. Individual similarity scores are then are summed up over the query ORFeome and normalized over total number of predicted ORFs, resulting in a weighted score per ORF. Assessing differences in ORFeome sizes by calculating the ratio between query and subject ORFeomes (Eq. 3), the similarity score is adjusted to result in a final genome similarity score (δ_Gquery → Gsubject).

Genome similarity scores for each query–subject combination are entered into a distance matrix. A symmetrical distance matrix, based on the geometrical means of each genome–genome pair combination is calculated according to Eq.4.

Symmetrical distance matrix

with: Δ_sym = final score for symmetrical distance matrix between a given genome–genome pair and δ_Gquery → Gsubject = similarity score between a given genome–genome combination.

Distance matrices of larger datasets become increasingly complex to read directly. Therefore a traditional tree representation was used as a visual aid to the underlying data structure. The resulting symmetrical distance matrix can be imported into software packages such as Mega4 (Tamura et al., 2007). Genome clusters are approximated using the Unweighted Pair Group Method with Arithmetic mean (UPGMA) algorithm (Sneath and Sokal, 1962). Approximated branch lengths, representing the level of similarity between genomes are depicted by distance units (du). Although similar in appearance to phylogenetic trees, the resulting functional distribution tree (FDT) depicts a different concept, clustering organisms by their respective ORFeome similarities rather than by their inferred evolutionary line.

Identification of putative strain-specific genes

A key interest in comparative functional genomics is the identification of strain and cluster specific and shared gene sets, identifying the genetic differences, and similarities between selected microbial genomes. These gene sets can then be used to investigate potentially unique metabolic capabilities of microbes and may reveal specific genome adaptation to lifestyle changes. Cluster specific genes or genes shared between clusters are identified using the flatfile BLAST database created by compACTor. Query ORFs exhibiting a minimum e-value threshold with respect to a given subject genome are identified and stored if they fulfill the stringency restriction. Stringency between selected clusters can be set by selecting the number of allowable mismatches. A mismatch is defined as an ORF in the query cluster that has at least one ORF–ORF pairing above the minimum e-value threshold to the subject cluster. Objects implemented in the output are: name of the query GenBank file, respective ORF identifier, the name of the subject GenBank file, annotation of the subject ORF, and the corresponding e-value, query ORF length, score, query ORF start and stop positions, and gene, locus_tag, product, and note annotation. Data for each query ORFeome – subject genome combination are collated and summarized in a tab-delimited ASCII file. The individual data files can easily be imported into spreadsheet programs such as MS Excel or OpenOffice Calc and then be further re-grouped and analyzed. For example, grouping by query ORF number quickly identifies the degree to which a respective ORF is unique among the pool of genome entries.

Implementation

The compACTor and FGDfinder software has been realized as PERL scripts (ActivePerl version 5.8.8.822), using the official NCBI BLAST distribution⁵. The software is freely available for non-commercial use (commercial use requires a license).

The algorithms and BLAST utilities used in compACTor render the application CPU intensive. compACTor is multi-threaded and support multiple CPUs/cores to reduce runtime in a near linear fashion. For the analysis depicted in Figure 2, 39 genomes, ranging from ∼1.7 to ∼6.6 Mbp, were analyzed. A total number of 113,397 ORFs resulted in ∼4,300,000 individual BLAST queries. Artemis Comparison Files, the prediction of putative strain-specific ORFs and the phylogenetic tree were calculated using a collated and parsed BLAST flatfile database of >11 Gb. 1,560 Artemis comparison files were generated, reflecting all possible genome–genome combinations (a datapool of >2.5 Gb).

Figure 2

Known limitations

Known limitations of the algorithm at this time are the lack of integrated support for extrachromosomal genetic elements such as plasmids and additional chromosomes. A workaround exists by concatenating DNA sequences of all genetic information present in the respective organism and then performing an automated annotation using pipelines such as GAMOLA (Altermann and Klaenhammer, 2003). The implemented GenBank parser currently ignores joined features. In prokaryotes, joined features such as introns [e.g., the mobilization protein MobA E1 (genome position 1344187.0.1345428, 1347858.0.1348370) in Lactococcus lactis subsp. cremoris MG1363 (GI:124491690)] are very rare and usually do not interfere with the analyses. Also, identical genomes still incur a distance to each other, resulting in a small branching when visualizing the dissimilarity matrix. A normalization algorithm will be implemented in the next release to address this issue. compACTor and FGDfinder are actively developed and future releases will resolve those issues.

Results and Discussion

Evaluation of functional distribution trees

To assess the FGD concept 35 completed and four draft phase genomes from different phyla, including several archeal genomes, were selected (Table a2 in Appendix). The purpose of this diverse taxonomy is to investigate how FGD places individual genomes into clusters and how similar these genomes are to each other on a functional level. The majority of the genomes selected are members of the family Lactobacillaceae in the order Lactobacillales (12 genomes). To investigate the discriminatory power of the FGD algorithm, the range was expanded and five genomes of the family Bacillaceae (order Bacillales), three genomes of the family Streptococcaceae (order Lactobacillales) and two genomes of the family Leuconostocaceae (order Lactobacillales) were added. Eight more distantly related genomes of the order Clostridiales (class Clostridia) were chosen to broaden the taxonomic selection to different classes. All of these genomes are members of the phylum Firmicutes. Three representatives of the class Gammaproteobacteria, (phylum Proteobacteria) were included to investigate inter-phylum relationships. As a final outlier, six genomes of archeal Euryarchaeota were included in the analysis.

Figure 2 represents the functional distribution of the selected taxa within an FDT. To test the influence of early, incomplete draft phase genomes on functional placement, the draft phase sequence of Anaerostipes caccae DSM14662 (Schwiertz et al., 2002) which consisted of ∼1.69 Mbp at the time of analysis was included. FGD penalized the missing genetic information and set A. caccae apart from the Clostridiales as a separate cluster. When the updated genome sequence of A. caccae encompassing ∼3.6 Mbp was included instead, A. caccae shifted its position and clustered with Ruminococcus obeum and Ruminococcus gnavus (branch length 1.99 du), while showing a deeper branching to Butyrivibrio proteoclasticus (branch length 3.13 du; data not shown). This clearly highlights the necessity of obtaining high-coverage genome sequence data for FGD analyses. However, with high-throughput genome sequencing techniques currently available, initial draft phase genomes usually encompass 85–95% of the genome, thus allowing an initial representative functional placement.

Subsequent analysis of complete genomes included in the FDT revealed the deepest branching (20.6 du) for euryarcheal genomes. The selected methanogens form two distinct genome clusters (node at 5.3 du), separating Methanococcoides burtonii and Methanosarcina mazei (Deppenmeier et al., 2002) from the remaining four taxa. Initial comparisons of habitat, growth temperature, and GC content did not indicate a consistently shared denominator between the two groups. Further analyses will be necessary to determine the imminent functional similarities indicated by the FGD approach.

Interestingly, Clostridiaceae, Bacillaceae, and a subcluster comprised of Rumincocci and B. proteoclasticus B316 formed a new functional node within the FDT (branch depth from node to Lactobacillales was 7.8 du), combining the taxonomic families into one genome cluster. Although the genomes of members of Clostridiaceae and Bacillaceae are still placed into distinct functional groups (internal cluster branch depth was 4 du) and no taxon shuffling was observed between both sub-clusters, it appears that lifestyle adaptation has led to similar genome content, potentially indicative of a high level of HGT between both families or from one family to the other.

A survey of the nine ORFeomes of the Bacillus and Clostridium clusters revealed 154 ORFs that are highly conserved in both groups (e-value threshold 1e-100). As expected, most ORFs could be assigned to central house-keeping functions such as DNA synthesis and repair (21%), tRNA genes and related processes (12%), central metabolism (35%), transcription and translation (6%), cellular processes (9%), and molecule transport (8%). However, besides these central functions, a significant number of ORFs related to sporulation were present and highly conserved in both clusters (8%). Based on the algorithms used, it is reasonable to hypothesize that the presence of these highly conserved sporulation genes may be one of the key drivers for the observed clustering of bacilli and true clostridia.

When compared to the Ruminococcus subcluster a similar conserved gene set was found with the notable absence of most sporulation genes (e-value threshold 1e-60). This is in agreement with the observed non-sporulating phenotype. It is noteworthy, that two conserved genes involved in sporulation (stage V sporulation protein D, spoVD, and Sporulation initiation inhibitor protein, soj) were identified in the Ruminococcus and Butyrivibrio genomes. This may indicate an ongoing genetic loss in response to adaptation to a new environment (rumen) where sporulation is no longer offering an advantage in fitness. The other identified shared genes are likely to be present in most of the other microbial genomes analyzed, and thus would contribute to higher-level genome clustering.

In addition, the ORFeomes were analyzed for predicted genes which are conserved in one genome cluster but not in the other – and vice versa (threshold conserved: 1e-100; threshold unique: 1e-10). Overall, 84 ORFs were identified to be group specific. Seventy of these were found only in genomes assigned to the Bacillus cluster and 14 in genomes in the Clostridium group. Remarkably, genes involved in heme and cytochrome biogenesis (hemE, hemH, hemY, resB, and resC), cytochrome reduction (gcrB, gcrC, cypD), and cytochrome oxidation (qoxB, cydA, ctaB, ctaD, cydA) were identified, indicating a Bacillus-specific electron transport chain. It is thus tempting to speculate that, functionally, Bacilli are aerobic Clostridia, having acquired the capability of oxidative phosphorylation. Furthermore, a subset of the propionate metabolism pathways identified to be Bacillus-specific. This subset is involved in the conversion of propanoyl-CoA to succinate and succinyl-CoA (prpD, prpB, pccB, sucD, sucC) and might present an additional energy conversion option for Bacilli which is absent in Clostridia.

Only a few ORFs were identified to be Clostridium cluster specific. Among those a cobyric acid synthase cobQ was identified to be Clostridium specific. CobQ is part of the porphyrin metabolic pathway, involved in converting cobyrinic acid into coenzyme vitamin B₁₂. Notably, a central branching point in this pathway leads to the synthesis of hemes and cytochromes found in Bacilli (see above). Interestingly, CobQ is also absent in the Ruminococcus subcluster, providing further support for the proposed ongoing adaptation to the new rumen environment. In contrast, the Ruminococcus subcluster acquired a number of membrane and sugar utilization (e.g., beta-glucosidases bgl3A, bgl3B, and bgl3D and an l-fucose isomerase) which may aid in the adhesion to and degradation of plant fibers in the rumen (threshold conserved: 1e-60; threshold unique: 1e-10; Kelly et al., 2010).

While the function of these genes has been well described in the past, they deliver the proof-of-concept that FGD analyses are able to identify gene sets involved in lifestyle adaptation processes. Because the initial similarity analysis does not rely on existing gene annotation, uncharacterized ORFs (e.g., genes annotated as “conserved hypothetical”) can be identified as potential targets to contributing to respective phenotypes. This is particular important for poorly annotated microbial genomes with a high level of conserved hypothetical ORFs.

In summary, results obtained from the test dataset provide strong support for the usefulness of FGD analyses, by illustrating the ability of the method to draw together groups into common nodes based on shared core (shared by all genomes in a specific cluster) and lifestyle elements, yet distinguishing them into distinct sub-clusters based on relevant genotypic differences and lifestyle adaptation processes. Importantly, FGD subsequently allows identifying gene sets likely to be responsible for the observed clustering, providing meaningful new target selections for functional genomics analyses independent of other means of classification or prior annotation.

Furthermore, distinct placements in genome clustering of Leuconostoc mesenteroides and Oenococcus oeni (Ze-Ze et al., 2000; Makarova et al., 2006; Leuconostocaceae) were observed in the FDT. Both genera can be found epiphytically on fruits, fruit mashes, and vegetables and are used in industrial and food fermentation processes. Interestingly, Functional Distribution placed L. casei and L. plantarum into a separate cluster, branching deeply within the Bacilli (Figure A1 in Appendix). Both genomes are significantly larger than the average Lactobacillus genome of 1.8–2.0 Mbp with 2.8 and 3.3 Mbp, respectively. The significantly increased genome size likely reflects a more generalized lifestyle, capable of thriving in a variety of habitats such as raw and fermented dairy products, plants, and the intestinal and reproductive tracts of animals and humans. In contrast, the smaller genomes of other lactobacilli often reflect the more specialized lifestyle to one habitat such as the human or animal gastrointestinal system or specific fermentation processes.

These results may indicate that lifestyle adaptation can lead to a similar genetic makeup of taxa defined as being distinctively different by heritage based phylogeny.

Limits of the FGD resolution

In the previous example microbial genomes from a wider range of species were investigated. To determine if the algorithm can discriminate strains from the same species, 23 Chlamydia trachomatis genomes (host: human), three Chlamydia muridarum genomes (host: members of the family Muridae), and one Chlamydia pneumoniae genome (host: varied; see Table a3 in Appendix) were subjected to an FGD analysis. It is interesting to note that until 1999 C. muridarum, infecting only members of the family Muridae, was designated as C. trachomatis (Everett et al., 1999). C. pneumoniae, which can infect a wide variety of different hosts and causes atypical pneumonia, clusters away from both C. trachomatis and C. muridarum genomes, indicating a different – and possible more flexible – genome makeup (Figure A2 in Appendix). In contrast to the other Chlamydia genomes, C. pneumoniae harbors an additional ∼200 kb of genetic information, and a more detailed analysis will be necessary to determine whether gene loss or gene acquisition is the major driving force. Similarly, the C. muridarum group is clearly forming its own cluster, albeit indicating a higher-level of similarity to C. trachomatis than to C. pneumoniae. On this high level, FGD can clearly resolve genomic differences and support observed host specificities (varied hosts – Muridae – human). Within the C. trachomatis cluster, three distinct sub-clusters could be identified with little reshuffling observed. Cluster 1 comprises serotypes E, F, G, and J (and C. trachomatis Ds2923), Cluster 2 harbors serotypes A, B, D, and L (and C. trachomatis E Sweden2), and Cluster 3 groups two serotype L and one serotype A (Figure A2 in Appendix). The observed positioning of the serovar D strain C. trachomatis Ds2923 into Cluster 1 supports the pairwise alignment of several chlamydial isolates (Jeffrey et al., 2010) which identified the least number of nucleotide substitutions between Ds/2926 and E/11023. In contrast, different major groups were identified in this nucleotide based analysis; two major clades (D/G/J and E/F; Jeffrey et al., 2010) are contrasted by three clusters (E/F/G/J, A/B/D/L, and L/A).

Overall, the FGD analysis was able to resolve strains from the same species to a similar level and with similar results as other whole genome comparative approaches. However, one of the limitations seen in the analysis of very similar genomes from the same species was the difficulty in identifying cluster specific gene sets based on e-value using FGDfinder [only three hypothetical ORFs were found to be cluster specific to Cluster 3 (see Figure A2 in Appendix, threshold conserved: 1e-20; threshold unique: 1e-10)]. In its current version FGDfinder uses calculated e-values to determine respective conserved or cluster specific gene sets. A future version of the software will incorporate the FGD scoring algorithm, increasing the power of resolution when very similar ORFeome sets are compared.

Topographical stability of functional distribution trees

The overall stability of inferred trees and respective genome clusters was tested by a Jackknife analysis (James and McCulloch, 1990). Individual observations (genome entries) from the calculated distance matrix were removed sequentially and resulting UPGMA-based FDTs were approximated. This was done for each genome in the dataset, resulting in 39 subset trees. Respective topologies were assessed individually. To investigate the impact of whole genome clusters on tree robustness, a Jackknife analysis was performed defining observed genome clusters as individual observations. Again, FDTs were approximated and evaluated for each resulting subset (data not shown). In summary, tree topology remained stable and only a swapping of neighboring branches was observed while in no instance shuffling was found for individual genome entries between genome clusters in either Jackknife analysis. This strongly supports the overall stability and discriminatory power of the FGD analysis.

Similarly, a Jackknife analysis was performed for the narrow strain FGD analysis (Figure A2 in Appendix). Individual entries, individual clusters, and complete serotypes were removed and the resulting tree topology investigated. Removal of individual entries and complete clusters did not change the topology of the FGD tree. Only minor reshuffling was observed within a respective cluster when complete serotypes were removed from the analyses (e.g., removal of serotype L caused a repositioning of C. trachomatis D UW3 CX into the serotype D subcluster within Cluster 2.

As expected, the removal of individual entries reduces discriminatory power, resulting in grouping together previously more separated genome clusters, without any entry-reshuffling.

Comparative analysis using the artemis comparison tool

Comparison of the degree of genome similarity between two or more genomes relies on analyses of presence/absence of genes and their respective syntenies in an operon or genome-wide context. In combination with ORFeome based (ORF-to-ORF comparison on amino-acid level, a maximum of 20 similarity hits per ORF is permitted) MSPcrunch (Sonnhammer and Durbin, 1994) comparison format data files (pMSPs) created by the compACTor software and annotated GenBank files, the ACT (Carver et al., 2005) provides an excellent visualization platform for mobility through the entire genome. Figure 3 illustrates the differences and similarities found between closely related strains such as between members of the acidophilus-complex of the Lactobacilli or between subspecies as well as between less similar genomes of more distantly related microbes. For example, small scale [inversion of a specific gene locus between both L. delbrueckii ssp. bulgaricus genomes (Makarova et al., 2006; van de Guchte et al., 2006)] and large scale (double inversion ∼700 and ∼150 kb, respectively) of the terminus of DNA replication between L. gasseri (Azcarate-Peril et al., 2008) and L. johnsonii (Pridmore et al., 2004) genome inversions, deletion, and insertion events [between L. acidophilus (Altermann et al., 2005) and L. delbrueckii ssp. bulgaricus], localized gene synteny (between L. delbrueckii ssp. bulgaricus and L. brevis), general presence/absence of individual genes and larger synteny trends (between L. casei and L. plantarum) can be immediately identified.

Figure 3

Comparison of the FGD algorithm to alternative methods

Although similar algorithms exist to investigate phylogenetic relationships based on whole (or partial) genomes sequences (Snel et al., 1999; Wolf et al., 2001; Henz et al., 2005; Khiripet, 2005; Canchaya et al., 2006; Fuchsman and Rocap, 2006; Berger et al., 2007; Felis and Dellaglio, 2007; Blaiotta et al., 2008), their focus remains mostly to infer a heritage based phylogeny. Furthermore, often only subsets of ORFeomes are chosen for these analyses. These are then analyzed individually (with or without weighting) or as concatenated sequences. Thus an artificial restriction is introduced that may bias the analysis. An example of such a method has been published by Konstantinidis and Tiedje (2005). There, genome information to infer taxonomy of prokaryotes is used, calculating an average amino-acid identity of shared gene subsets. Only few publications investigate the functional relationship of microbial genomes, such as the development of the Blast Score Ratio which analyzes the complete ORFeome but is limited to three genomes at a time (Rasko et al., 2005). Other methods investigating the functional relationship and similarities between gene clusters have been used to address the problem of genomes with different sizes. One example of such a method, GRAST, explores the ongoing genome reductions and rearrangements by identifying clusters of functionally related genes (Toft and Fares, 2006). Subsets of orthologous gene pairs are identified to determine conserved genetic loci. Similar to Blast Score Ratio analyses, the number of input genomes is limited to two at a time. While the output in part shows similarity to FGD analyses (visual representation of a genome plot and the determination of common and non-common genes), the purpose of this method is distinctly different in specifically identifying genome plasticity trends. A combination of genome analysis and visualization tool, GeneComp, has been published earlier (Yang et al., 2003). GeneComp is able to use different BLAST flavors and then visualize the textual output with varying levels of alignment length stringencies. While this solution offers the advantage of providing a combined analysis and visualization package, a number of limitations exist when compared to FGD. Like BSR and GRAST, GeneComp is restricted to a maximum number of three genomes. Furthermore, the algorithm is sequence based, highlighting genome variations such as repeat regions, insertions, deletions, and rearrangements rather than specific similarities to predicted ORFs.

Non-sequence based methods such as MLST use only a relatively small number of conserved genome loci with the primary aim to establish a highly discriminating (microbial) typing system (Chan et al., 2001; Maiden, 2006; Diancourt et al., 2007). Noteworthy, the ability of FGD analyses to identify cluster-conserved gene sets may provide a high-quality starting point for the selection of MLST targets.

Conclusion

16S rRNA and other gene subset analyses mainly focus on the determination of the line of descendants of a given gene or organism (Zhang et al., 2009) or on the identification of protein families (Enright and Ouzounis, 2000; Enright et al., 2002; Kelil et al., 2007). Such phylogenetic studies aim to reconstruct the relationship between organisms and are paramount to analyze the (changing) community structures of complex biological ecosystems. While this type of phylogenetic analysis is well accepted and widely used, it does not reflect the respective comprehensive genotypes. In contrast, FGD provides a different view of microbial similarities to each other. The example data set demonstrated the effects of lifestyle adaptation on genome content. FGD has shown the potential to provide new insights into the relationships between microbes from a comparative genomics perspective. The algorithm has already been used in a variety of analyses ranging from microbial (Goh et al., 2011) and archeal (Leahy et al., 2010) genomes to bacteriophage (Lu et al., 2010) which describe the impact of FGD analyses in more detail within their respective scopes.

Functional genome distribution in combination with the graphical visualization in ACT using ORFeome distance files (pMSPs) and functionally annotated GenBank files offers a powerful tool for comparative genomics that allows comparisons of whole genomes within genome space, encompassing heritage based (vertical transmission), lateral gene transfer (HGT), and lifestyle-driven change (adaptation) in a single analysis. Therefore, rather than attempting to reconstruct the evolution of the core genome with its set of commonly shared genes, FGD allows a representation of whole genome similarity at the functional level.

It will be possible to add further functionality to the algorithm, such as the ability to mask defined or dynamically created gene clusters within groups of organisms, thus identifying potentially important genetic elements independent from otherwise overpowering gene sets, such as central house-keeping or metabolism genes.

Availability

The complete software suite consisting of the compACTor and the FGDfinder software is freely available upon email request for academic use. Commercial use is subject to a license agreement.

Supplementary Material

The Supplementary Material for this article can be found online at http://www.frontiersin.org/Evolutionary_and_Genomic_Microbiology/10.3389/fmicb.2012.00048/abstract

Statements

Acknowledgments

Special thanks go to Peter Janssen for helpful and critical discussions during the development of the FGD algorithm, the subsequent interpretation of the data and the critical review of the manuscript. I also thank Sinead Leahy, Graeme Attwood, and Bill Kelly for their scientific contributions to this manuscript. This work was funded by the New Economic Research Fund programs C10 × 0314 and PROJ-13812-LNNR-AGR and supported by the Pastoral Greenhouse Gas Research Consortium (PGgRc).

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

‡.^http://genomesonline.org/cgi-bin/GOLD/bin/gold.cgi

1.^http://www.appliedbiosystems.com

2.^http://454.com/

3.^http://www.illumina.com

4.^http://genomesonline.org/cgi-bin/GOLD/bin/gold.cgi

5.^ftp://ftp.ncbi.nih.gov/blast/executables/LATEST

References

1
AltermannE.KlaenhammerT. R. (2003). GAMOLA: a new local solution for sequence annotation and analyzing draft and finished prokaryotic genomes.OMICS7, 161–169.10.1089/153623103322246557
2
AltermannE.RussellW. M.Azcarate-PerilM. A.BarrangouR.BuckB. L.McauliffeO.SoutherN.DobsonA.DuongT.CallananM.LickS.HamrickA.CanoR.KlaenhammerT. R. (2005). Complete genome sequence of the probiotic lactic acid bacterium Lactobacillus acidophilus NCFM. Proc. Natl. Acad. Sci. U.S.A.102, 3906–3912.10.1073/pnas.0409188102
3
AltschulS. F.GishW.MillerW.MyersE. W.LipmanD. J. (1990). Basic local alignment search tool. J. Mol. Biol.215, 403–410.10.1016/S0022-2836(05)80360-2
4
AltschulS. F.MaddenT. L.SchafferA. A.ZhangJ.ZhangZ.MillerW.LipmanD. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.25, 3389–3402.10.1093/nar/25.17.3389
5
Azcarate-PerilM. A.AltermannE.GohY. J.TallonR.Sanozky-DawesR. B.PfeilerE. A.O’flahertyS.BuckB. L.DobsonA.DuongT.MillerM. J.BarrangouR.KlaenhammerT. R. (2008). Analysis of the genome sequence of Lactobacillus gasseri ATCC 33323 reveals the molecular basis of an autochthonous intestinal organism. Appl. Environ. Microbiol.74, 4610–4625.10.1128/AEM.00054-08
6
BergerB.PridmoreR. D.BarrettoC.Delmas-JulienF.SchreiberK.ArigoniF.BrussowH. (2007). Similarity and differences in the Lactobacillus acidophilus group identified by polyphasic analysis and comparative genomics. J. Bacteriol.189, 1311–1321.10.1128/JB.01393-06
7
BlaiottaG.FuscoV.ErcoliniD.AponteM.PepeO.VillaniF. (2008). Diversity of Lactobacillus strains based on partial HSP60 gene sequences and design of PCR-RFLP assays for species identification and differentiation. Appl. Environ. Microbiol.74, 208–215.10.1128/AEM.01711-07
8
CanchayaC.ClaessonM. J.FitzgeraldG. F.Van SinderenD.O’tooleP. W. (2006). Diversity of the genus Lactobacillus revealed by comparative genomics of five species. Microbiology152, 3185–3196.10.1099/mic.0.29140-0
9
CarverT. J.RutherfordK. M.BerrimanM.RajandreamM. A.BarrellB. G.ParkhillJ. (2005). ACT: the Artemis comparison tool. Bioinformatics21, 3422–3423.10.1093/bioinformatics/bti553
10
ChanM.-S.MaidenM. C. J.SprattB. G. (2001). Database-driven multi locus sequence typing (MLST) of bacterial pathogens. Bioinformatics17, 1077–1083.10.1093/bioinformatics/17.5.438
11
DeppenmeierU.JohannA.HartschT.MerklR.SchmitzR. A.Martinez-AriasR.HenneA.WiezerA.BaumerS.JacobiC.BruggemannH.LienardT.ChristmannA.BomekeM.SteckelS.BhattacharyyaA.LykidisA.OverbeekR.KlenkH. P.GunsalusR. P.FritzH. J.GottschalkG. (2002). The genome of Methanosarcina mazei: evidence for lateral gene transfer between bacteria and archaea. J. Mol. Microbiol. Biotechnol.4, 453–461.
- Pubmed Abstract
- Google Scholar
12
DesiereF.McshanW. M.Van SinderenD.FerrettiJ. J.BrussowH. (2001). Comparative genomics reveals close genetic relationships between phages from dairy bacteria and pathogenic Streptococci: evolutionary implications for prophage-host interactions. Virology288, 325–341.10.1006/viro.2001.1085
13
DiancourtL.PassetV.ChervauxC.GaraultP.SmokvinaT.BrisseS. (2007). Multilocus sequence typing of Lactobacillus casei (L. paracasei) reveals a clonal population structure with low levels of homologous recombination. Appl. Environ. Microbiol.73, 6601–6611.10.1128/AEM.01095-07
14
EnrightA. J.OuzounisC. A. (2000). GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics16, 451–457.10.1093/bioinformatics/16.5.451
15
EnrightA. J.Van DongenS.OuzounisC. A. (2002). An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res.30, 1575–1584.10.1093/nar/30.7.1575
16
EverettK. D.BushR. M.AndersenA. A. (1999). Emended description of the order Chlamydiales, proposal of Parachlamydiaceae fam. nov. and Simkaniaceae fam. nov., each containing one monotypic genus, revised taxonomy of the family Chlamydiaceae, including a new genus and five new species, and standards for the identification of organisms. Int. J. Syst. Bacteriol.49(Pt 2), 415–440.10.1099/00207713-49-2-803
17
FelisG. E.DellaglioF. (2007). Taxonomy of Lactobacilli and Bifidobacteria. Curr. Issues Intest. Microbiol.8, 44–61.
- Pubmed Abstract
- Google Scholar
18
FuchsmanC. A.RocapG. (2006). Whole-genome reciprocal BLAST analysis reveals that planctomycetes do not share an unusually large number of genes with eukarya and archaea. Appl. Environ. Microbiol.72, 6841–6844.10.1128/AEM.00429-06
19
GohY.GoinC.O’flahertyS.AltermannE.HutkinsR. (2011). Specialized adaptation of a lactic acid bacterium to the milk environment: the comparative genomics of Streptococcus thermophilus LMD-9. Microb. Cell Fact.10, S22.10.1186/1475-2859-10-S1-S22
20
HenzS. R.HusonD. H.AuchA. F.Nieselt-StruweK.SchusterS. C. (2005). Whole-genome prokaryotic phylogeny. Bioinformatics21, 2329–2335.10.1093/bioinformatics/bth324
21
JamesF. C.McCullochC. E. (1990). Multivariate analysis in ecology and systematics: panacea or Pandora’s box?Annu. Rev. Ecol. Syst.21, 129–166.10.1146/annurev.ecolsys.21.1.129
- CrossRef
- Google Scholar
22
JeffreyB. M.SuchlandR. J.QuinnK. L.DavidsonJ. R.StammW. E.RockeyD. D. (2010). Genome sequencing of recent clinical Chlamydia trachomatis strains identifies loci associated with tissue tropism and regions of apparent recombination. Infect. Immun.78, 2544–2553.10.1128/IAI.01324-09
23
KanehisaM. (2002). The KEGG database. Novartis Found. Symp.247, 91–101; discussion 101–103, 119–128, 244–152.10.1002/0470857897.ch8
24
KanehisaM.ArakiM.GotoS.HattoriM.HirakawaM.ItohM.KatayamaT.KawashimaS.OkudaS.TokimatsuT.YamanishiY. (2008). KEGG for linking genomes to life and the environment. Nucleic Acids Res.36, D480–D484.10.1093/nar/gkm882
25
KelilA.WangS.BrzezinskiR.FleuryA. (2007). CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinformatics8, 286.10.1186/1471-2105-8-286
- CrossRef
- Google Scholar
26
KellyW. J.LeahyS. C.AltermannE.YeomanC. J.DunneJ. C.KongZ.PachecoD. M.LiD.NoelS. J.MoonC. D.CooksonA. L.AttwoodG. T. (2010). The glycobiome of the rumen bacterium Butyrivibrio proteoclasticus B316(T) highlights adaptation to a polysaccharide-rich environment. PLoS ONE5, e11942.10.1371/journal.pone.0011942
- CrossRef
- Google Scholar
27
KhiripetN. (2005). “Bacterial whole genome phylogeny using proteome comparison and optimal reversal distance,” in Computational Systems Bioinformatics Conference, Stanford.
- Google Scholar
28
KonstantinidisK. T.TiedjeJ. M. (2005). Towards a genome-based taxonomy for prokaryotes. J. Bacteriol.187, 6258–6264.10.1128/JB.187.18.6258-6264.2005
29
LeahyS. C.KellyW. J.AltermannE.RonimusR. S.YeomanC. J.PachecoD. M.LiD.KongZ.MctavishS.SangC.LambieS. C.JanssenP. H.DeyD.AttwoodG. T. (2010). The genome sequence of the rumen methanogen Methanobrevibacter ruminantium reveals new possibilities for controlling ruminant methane emissions. PLoS ONE5, e8926.10.1371/journal.pone.0008926
- CrossRef
- Google Scholar
30
LuZ.AltermannE.BreidtF.KozyavkinS. (2010). Sequence analysis of Leuconostoc mesenteroides bacteriophage {Phi}1-A4 isolated from an industrial vegetable fermentation. Appl. Environ. Microbiol.76, 1955–1966.10.1128/AEM.00193-10
31
MaidenM. C. J. (2006). Multilocus sequence typing of bacteria. Annu. Rev. Microbiol.60, 561–588.10.1146/annurev.micro.59.030804.121325
32
MakarovaK.SlesarevA.WolfY.SorokinA.MirkinB.KooninE.PavlovA.PavlovaN.KaramychevV.PolouchineN.ShakhovaV.GrigorievI.LouY.RohksarD.LucasS.HuangK.GoodsteinD. M.HawkinsT.PlengvidhyaV.WelkerD.HughesJ.GohY.BensonA.BaldwinK.LeeJ. H.Diaz-MunizI.DostiB.SmeianovV.WechterW.BaraboteR.LorcaG.AltermannE.BarrangouR.GanesanB.XieY.RawsthorneH.TamirD.ParkerC.BreidtF.BroadbentJ.HutkinsR.O’sullivanD.SteeleJ.UnluG.SaierM.KlaenhammerT.RichardsonP.KozyavkinS.WeimerB.MillsD. (2006). Comparative genomics of the lactic acid bacteria. Proc. Natl. Acad. Sci. U.S.A.103, 15611–15616.10.1073/pnas.0607117103
33
MakarovaK. S.KooninE. V. (2007). Evolutionary genomics of lactic acid bacteria. J. Bacteriol.189, 1199–1208.10.1128/JB.01351-06
34
McCutcheonJ. P.von DohlenC. D. (2011). An interdependent metabolic patchwork in the nested symbiosis of mealybugs. Curr. Biol.21, 1366–1372.10.1016/j.cub.2011.06.051
35
NicolasP.BessieresP.EhrlichS. D.MaguinE.Van De GuchteM. (2007). Extensive horizontal transfer of core genome genes between two Lactobacillus species found in the gastrointestinal tract. BMC Evol. Biol.7, 141.10.1186/1471-2148-7-141
- CrossRef
- Google Scholar
36
PridmoreR. D.BergerB.DesiereF.VilanovaD.BarrettoC.PittetA. C.ZwahlenM. C.RouvetM.AltermannE.BarrangouR.MolletB.MercenierA.KlaenhammerT.ArigoniF.SchellM. A. (2004). The genome sequence of the probiotic intestinal bacterium Lactobacillus johnsonii NCC 533. Proc. Natl. Acad. Sci. U.S.A.101, 2512–2517.10.1073/pnas.0307327101
37
RaskoD. A.MyersG. S.RavelJ. (2005). Visualization of comparative genomic analyses by BLAST score ratio. BMC Bioinformatics6, 2.10.1186/1471-2105-6-2
- CrossRef
- Google Scholar
38
RohwerF.EdwardsR. (2002). The phage proteomic tree: a genome-based taxonomy for phage. J. Bacteriol.184, 4529–4535.10.1128/JB.184.16.4529-4535.2002
39
SchneikerS.PerlovaO.KaiserO.GerthK.AliciA.AltmeyerM. O.BartelsD.BekelT.BeyerS.BodeE.BodeH. B.BoltenC. J.ChoudhuriJ. V.DossS.ElnakadyY. A.FrankB.GaigalatL.GoesmannA.GroegerC.GrossF.JelsbakL.KalinowskiJ.KeglerC.KnauberT.KonietznyS.KoppM.KrauseL.KrugD.LinkeB.MahmudT.Martinez-AriasR.MchardyA. C.MeraiM.MeyerF.MormannS.Munoz-DoradoJ.PerezJ.PradellaS.RachidS.RaddatzG.RosenauF.RuckertC.SasseF.ScharfeM.SchusterS. C.SuenG.Treuner-LangeA.VelicerG. J.VorholterF. J.WeissmanK. J.WelchR. D.WenzelS. C.WhitworthD. E.WilhelmS.WittmannC.BlockerH.PuhlerA.MullerR. (2007). Complete genome sequence of the myxobacterium Sorangium cellulosum. Nat. Biotechnol.25, 1281–1289.10.1038/nbt1354
40
SchwiertzA.HoldG. L.DuncanS. H.GruhlB.CollinsM. D.LawsonP. A.FlintH. J.BlautM. (2002). Anaerostipes caccae gen. nov., sp. nov., a new saccharolytic, acetate-utilising, butyrate-producing bacterium from human faeces. Syst. Appl. Microbiol.25, 46–51.10.1078/0723-2020-00096
41
SneathP. H.SokalR. R. (1962). Numerical taxonomy. Nature193, 855–860.10.1038/193855a0
42
SnelB.BorkP.HuynenM. A. (1999). Genome phylogeny based on gene content. Nat. Genet.21, 108–110.10.1038/5052
43
SonnhammerE. L.DurbinR. (1994). A workbench for large-scale sequence homology analysis. Comput. Appl. Biosci.10, 301–307.
- Pubmed Abstract
- Google Scholar
44
SousaC.De LorenzoV.CebollaA. (1997). Modulation of gene expression through chromosomal positioning in Escherichia coli. Microbiology143(Pt 6), 2071–2078.10.1099/00221287-143-6-2071
45
TamuraK.DudleyJ.NeiM.KumarS. (2007). MEGA4: molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol. Biol. Evol.24, 1596–1599.10.1093/molbev/msm092
46
TatusovR.FedorovaN.JacksonJ.JacobsA.KiryutinB.KooninE.KrylovD.MazumderR.MekhedovS.NikolskayaA.RaoB. S.SmirnovS.SverdlovA.VasudevanS.WolfY.YinJ.NataleD. (2003). The COG database: an updated version includes eukaryotes. BMC Bioinformatics4, 41.10.1186/1471-2105-4-41
- CrossRef
- Google Scholar
47
ThomsonN.BentleyS.HoldenM.ParkhillJ. (2003). Fitting the niche by genomic adaptation. Nat. Rev. Microbiol.1, 92–93.10.1038/nrmicro756
48
ToftC.FaresM. A. (2006). GRAST: a new way of genome reduction analysis using comparative genomics. Bioinformatics22, 1551–1561.10.1093/bioinformatics/btl139
49
van de GuchteM.PenaudS.GrimaldiC.BarbeV.BrysonK.NicolasP.RobertC.OztasS.MangenotS.CoulouxA.LouxV.DervynR.BossyR.BolotinA.BattoJ. M.WalunasT.GibratJ. F.BessieresP.WeissenbachJ.EhrlichS. D.MaguinE. (2006). The complete genome sequence of Lactobacillus bulgaricus reveals extensive and ongoing reductive evolution. Proc. Natl. Acad. Sci. U.S.A.103, 9274–9279.10.1073/pnas.0603024103
50
WolfY.RogozinI.GrishinN.TatusovR.KooninE. (2001). Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol. Biol.1, 8.10.1186/1471-2148-1-8
- CrossRef
- Google Scholar
51
YangJ.WangJ.YaoZ.-J.JinQ.ShenY.ChenR. (2003). GenomeComp: a visualization tool for microbial genome comparison. J. Microbiol. Methods54, 423–426.10.1016/S0167-7012(03)00094-0
52
Ze-ZeL.TenreiroR.PaveiaH. (2000). The Oenococcus oeni genome: physical and genetic mapping of strain GM and comparison with the genome of a ‘divergent’ strain, PSU-1. Microbiology146(Pt 12), 3195–3204.
- Pubmed Abstract
- Google Scholar
53
ZhangH.ZhongY.HaoB.GuX. (2009). A simple method for phylogenomic inference using the information of gene content of genomes. Gene441, 163–168.10.1016/j.gene.2008.07.008

Appendix

Figure A1

Figure A2

Table A1

e-Value range	Trust value
<0.1	0
0.1 ≤ e-value > 1e−10	1
1e−10 ≤ e-value > 1e−40	10
1e−40 ≤ e-value > 1e−50	20
1e−50 ≤ e-value > 1e−60	30
1e−60 ≤ e-value > 1e−70	40
1e−70 ≤ e-value > 1e−80	50
1e−80 ≤ e-value > 1e−90	60
1e−90 ≤ e-value > 1e−100	70
1e−100 ≤ e-value > 1e−110	80
1e−110 ≤ e-value > 1e−120	85
1e−120 ≤ e-value > 1e−130	90
1e−130 ≤ e-value > 1e−160	95
1e−160 ≤ e-value ≥ 0	100

e-Value range based trust levels.

Table A2

Designation*	Domain/family	Genome size [bp]	ORFeome size	Accession number
Lactobacillus plantarum WCFS1	Bacteria/Lactobacillaceae	3308274	3051	AL935263
Lactobacillus brevis ATCC 367	Bacteria/Lactobacillaceae	2291220	2314	CP000416
Pediococcus pentosaceus ATCC 25745	Bacteria/Lactobacillaceae	1832387	1847	NC_008525
Lactobacillus sakei subsp. sakei 23K	Bacteria/Lactobacillaceae	1884661	1886	CR936503
Lactobacillus casei ATCC 334	Bacteria/Lactobacillaceae	2895264	2909	CP000423
Lactobacillus salivarius UCC118	Bacteria/Lactobacillaceae	1827111	1738	CP000233
Lactobacillus reuteri F275	Bacteria/Lactobacillaceae	1999618	1944	CP000705
Lactobacillus johnsonii NCC 533	Bacteria/Lactobacillaceae	1992676	1857	AE017198
Lactobacillus gasseri ATCC 33323	Bacteria/Lactobacillaceae	1894360	1811	CP000413
Lactobacillus acidophilus NCFM	Bacteria/Lactobacillaceae	1993561	1979	CP000033
Lactobacillus delbrueckii subsp. bulgaricus ATCC 11842	Bacteria/Lactobacillaceae	1864998	2218	CR954253
Lactobacillus delbrueckii subsp. bulgaricus ATCC BAA-365	Bacteria/Lactobacillaceae	1856951	2040	CP000412
Bacillus cereus ATCC 14579	Bacteria/Bacillaceae	5411809	5490	AE016877.1

Bacillus thuringiensis serovar konkukian str. 97-27	Bacteria/Bacillaceae	5237682	5168	AE017355
Bacillus pumilus SAFR-032	Bacteria/Bacillaceae	3704465	3737	CP000813
Bacillus licheniformis ATCC 14580	Bacteria/Bacillaceae	4222645	4379	AE017333.1
Bacillus subtilis subsp. subtilis 168	Bacteria/Bacillaceae	4214630	4106	AL009126
Streptococcus pneumoniae R6	Bacteria/Streptococcaceae	2038615	2046	NC_003098
Streptococcus pyogenes SSI-1	Bacteria/Streptococcaceae	1894275	1861	BA000034
Streptococcus thermophilus LMD-9	Bacteria/Streptococcaceae	1856368	2003	CP000419
Leuconostoc mesenteroides subsp. mesenteroides ATCC 8293	Bacteria/Leuconostocaceae	2038396	2073	NC_008531
Oenococcus oeni PSU-1	Bacteria/Leuconostocaceae	1780517	1864	NC_008528
Clostridium perfringens ATCC 13124	Bacteria/Clostridiaceae	3256683	2997	CP000246
Clostridium botulinum ATCC 3502	Bacteria/Clostridiaceae	3886916	3648	AM412317
Clostridium kluyveri DSM 555	Bacteria/Clostridiaceae	3964618	3926	CP000673
Clostridium difficile ATCC 9689	Bacteria/Clostridiaceae	4290252	3680	AM180355
Ruminococcus obeum ATCC 29174	Bacteria/Lachnospiraceae	3626304	4175	AAVO00000000; draft
Anaerostipes caccae L1-92	Bacteria/Lachnospiraceae	1691947	1582	ABAX00000000; draft
Ruminococcus gnavus ATCC 29149	Bacteria/Lachnospiraceae	3501953	3913	AAYG00000000; draft
Butyrivibrio proteoclasticus B316	Bacteria/Lachnospiraceae	3936787	3477	Unpublished draft
Pseudomonas aeruginosa PA7	Bacteria/Pseudomonadaceae	6588339	6371	CP000744
Escherichia coli O157:H7	Bacteria/Enterobacteriaceae	5528445	6006	AE005174
Escherichia coli K12 K-12	Bacteria/Enterobacteriaceae	4639675	4403	NC_000913
Methanococcoides burtonii DSM 6242	Archaea/Methanosarcinaceae	2575032	2446	NC_007955
Methanosarcina mazei Goe1	Archaea/Methanosarcinaceae	4096345	3371	NC_003901
Methanocaldococcus jannaschii DSM 2661	Archaea/Methanocaldococcaceae	1664970	1682	NC_000909
Methanosphaera stadtmanae DSM 3091	Archaea/Methanobacteriaceae	1767403	1588	NC_007681
Methanobrevibacter smithii ATCC 35061	Archaea/Methanobacteriaceae	1853160	1795	NC_009515
Methanothermobacter thermautotrophicus DeltaH	Archaea/Methanobacteriaceae	1751377	1918	NC_000916

Genomes used for assessment of functional genome distribution.

*Color coding used in the table corresponds to the color scheme shown in Figure 2 and Figure A1.

Table A3

Designation*	Serotype	Host	Genome status	Accession number
Chlamydia trachomatis A2497	A	Human	Complete	347974781
Chlamydia trachomatis A HAR-13	A	Human	Complete	76788711
Chlamydia trachomatis B TZ1A828	B	Human	Complete	231272648
OT Chlamydia trachomatis B Jali20 OT	B	Human	Complete	231273667
Chlamydia trachomatis D UW3 CX	D	Human	Complete	15604717
Chlamydia trachomatis D-LC	D	Human	Complete	297749010
Chlamydia trachomatis D-EC	D	Human	Complete	297748130
Chlamydia trachomatis Ds2923	D	Human	Complete	222356764
Chlamydia trachomatis E Sweden2	E	Human	Complete	289525045
Chlamydia trachomatis E 150	E	Human	Complete	296434583
Chlamydia trachomatis E 11023	E	Human	Complete	296438301
Chlamydia trachomatis F 70	F	Human	Complete	222444350
Chlamydia trachomatis F 70s	F	Human	Complete	222444349
Chlamydia trachomatis G 11074	G	Human	Complete	296437374
Chlamydia trachomatis G 9301	G	Human	Complete	297139873
Chlamydia trachomatis G 9768	G	Human	Complete	296435514
Chlamydia trachomatis G 11222	G	Human	Complete	296436438
Chlamydia trachomatis J 6276	J	Human	Complete	222444352
Chlamydia trachomatis J 6276s	J	Human	Complete	222444351
Chlamydia trachomatis L2-434 Bu	L	Human	Complete	166153973
Chlamydia trachomatis L2b UCH1 proctitis	L	Human	Complete	352951305
Chlamydia trachomatis L2c	L	Human	Complete	339625373
Chlamydia trachomatis L2tet1	L	Human	Complete	301334996
Chlamydia muridarum MopnTet14 draft		Muridae	Complete	311788820
Chlamydia muridarum Nigg		Muridae	Complete	29337300
Chlamydia muridarum Weiss.cb		Muridae	Draft	NC_002620.2
Chlamydia pneumoniae		Varied	Complete	340215159

Genomes used for assessment of functional genome distribution on strain level.

Summary

Keywords

functional genomics, genome comparison, lifestyle adaption, horizontal gene transfer, genome plasticity

Citation

Altermann E (2012) Tracing Lifestyle Adaptation in Prokaryotic Genomes. Front. Microbio. 3:48. doi: 10.3389/fmicb.2012.00048

Received

08 November 2011

Accepted

30 January 2012

Published

21 February 2012

Volume

3 - 2012

Edited by

Jacques Ravel, University of Maryland School of Medicine, USA

Reviewed by

David Nelson, Indiana University Bloomington, USA; Bing Ma, The Institute for Genome Sciences, USA

This is an open-access article distributed under the terms of the Creative Commons Attribution Non Commercial License, which permits non-commercial use, distribution, and reproduction in other forums, provided the original authors and source are credited.

*Correspondence: Eric Altermann, Grasslands Research Centre, AgResearch Limited, Tennent Drive, Private Bag 11008, Palmerston North, New Zealand. e-mail: eric.altermann@agresearch.co.nz

This article was submitted to Frontiers in Evolutionary and Genomic Microbiology, a specialty of Frontiers in Microbiology.

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Evolutionary and Genomic Microbiology

METHODS article

Tracing Lifestyle Adaptation in Prokaryotic Genomes

Abstract

Introduction

Materials and Methods

General workflow of the compACTor software