Genome-Based Characterization of Biological Processes That Differentiate Closely Related Bacteria

Bacteriologists have strived toward attaining a natural classification system based on evolutionary relationships for nearly 100 years. In the early twentieth century it was accepted that a phylogeny-based system would be the most appropriate, but in the absence of molecular data, this approach proved exceedingly difficult. Subsequent technical advances and the increasing availability of genome sequencing have allowed for the generation of robust phylogenies at all taxonomic levels. In this study, we explored the possibility of linking biological characters to higher-level taxonomic groups in bacteria by making use of whole genome sequence information. For this purpose, we specifically targeted the genus Pantoea and its four main lineages. The shared gene sets were determined for Pantoea, the four lineages within the genus, as well as its sister-genus Tatumella. This was followed by functional characterization of the gene sets using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. In comparison to Tatumella, various traits involved in nutrient cycling were identified within Pantoea, providing evidence for increased efficacy in recycling of metabolites within the genus. Additionally, a number of traits associated with pathogenicity were identified within species often associated with opportunistic infections, with some support for adaptation toward overcoming host defenses. Some traits were also only conserved within specific lineages, potentially acquired in an ancestor to the lineage and subsequently maintained. It was also observed that the species isolated from the most diverse sources were generally the most versatile in their carbon metabolism. By investigating evolution, based on the more variable genomic regions, it may be possible to detect biologically relevant differences associated with the course of evolution and speciation.


INTRODUCTION
Since the early twentieth century scientists have recognized the value of phylogenetic inferences in determining natural relationships between taxa, which is essential for both taxonomy and evolutionary studies (Woese, 1987). However, the move toward a more natural classification system by these early bacteriologists, based on phylogenetics, proved exceedingly difficult as traditionally used morphological traits were not variable enough to group taxa reliably (Stanier and Van Niel, 1941;Woese, 1987;Woese et al., 1990). Although this led to the use of physiological characters, some researchers already argued early on that such data would not be suitable for developing evolutionary hypotheses. They emphasized that physiological traits would generally not be phylogenetically informative as long as there were no clear understanding of their genetic basis and overall biological importance (Stanier and Van Niel, 1941). The consensus view at the time was thus that phylogenetic inferences were definitely needed for elucidating the natural relationships among bacteria, but that this would only be possible with the use of suitably informative characters (Stanier and Van Niel, 1941;Woese, 1987Woese, , 1994Woese et al., 1990). As a result, scholars mostly abandoned the field of bacterial systematics until more reliable characters became available with the advent of nucleic acid-based molecular phylogenetics in the 1970s (Woese, 1987(Woese, , 1994(Woese, , 1998Woese et al., 1990;McInerney et al., 2011).
The next logical step after having used phylogenetics to identify taxa, particularly those above the species level, would be to assign biological characters to them. For example, if bacterial genera or the lineages within them represent natural clusters, it should be possible to identify properties that they share with one another, but that are different from those occurring in other such clusters. Previously, various standardized sets of physiological tests have been used to study phenotypic cohesion of bacterial taxa (Schubert, 1968;Gavini et al., 1989;Mergaert et al., 1993;Brady et al., 2010aBrady et al., , 2013, but these have been mainly developed from a clinical diagnostics perspective (Konstantinidis and Tiedje, 2005a,b;Sutcliffe, 2015). Accordingly, the characters identified by these tests have limited application outside this environment (Sutcliffe et al., 2012;Sutcliffe, 2015). Other than revealing some basic physiological capabilities, these standard phenotypic tests are incapable of capturing the countless traits encoded on bacterial genomes. In addition, the very limited set of traits analyzed rarely differentiates between taxa, as the members of a taxon can show immense physiological variability. Therefore, characters that are biologically meaningful and that potentially define or distinguish higher-level bacterial groups and taxa would thus have to be sought through other means.
For identifying biological traits that are potentially taxondefining, whole genome sequences represent invaluable resources. A wealth of traits can be inferred from a bacterium's genome sequence by making use of bioinformatics approaches and databases built from experimental evidence. For example, metabolic and physiological networks or pathways can be inferred from gene sequences by making use of their homology to sequences in the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2016b). Each sequence in the KEGG database have an associated KEGG Orthology (KO) term, which is in turn coupled to proteins whose functions have been experimentally verified (Kanehisa et al., 2016b). In this workflow for inferring physiological properties, the database of functionally verified protein entries is often regarded as a significant limitation. This is because functional characterization of genes occurs at a much slower pace than gene discovery, thus making it impossible to functionally annotate certain genes (Linghu et al., 2008). As a result, taxa related to the model organisms typically have a higher number genes that can be functionally annotated because of the higher similarity between their genomes (Linghu et al., 2008). Despite this limitation, the current information would still provide valuable biological knowledge, especially as the information in these databases increase.
In this study, we explored the possibility of linking biological characters to higher-level taxonomic groups in bacteria by making use of whole genome sequence information. For this purpose we used the genus Pantoea for which genome sequences of 21 species were available (Hong et al., 2012;Kamber et al., 2012;Wan et al., 2015;Palmer et al., 2017). These were chosen to span the diversity of the genus, which includes plant pathogens (P. agglomerans, P. ananatis, P. stewartii amongst others; Coutinho and Venter, 2009) and species that affect humans (P. brenneri, P. conspicua, P. eucrina, and P. septica; Walterson and Stavrinides, 2015), as well as species that have been isolated from insects, fungal fruiting bodies and environmental samples (Walterson and Stavrinides, 2015;Ma et al., 2016;Palmer et al., 2016;Rong et al., 2016). Overall the members of this genus appear to be highly adaptable to changing environments and may act opportunistically when in contact with potential eukaryotic hosts (Coutinho and Venter, 2009;De Maayer et al., 2012bWalterson and Stavrinides, 2015). From a phylogenetic perspective, Pantoea and its sister taxon, Tatumella, are nested within the Enterobacteriaceae where they are closely related to Erwinia (Glaeser and Kämpfer, 2015;Palmer et al., 2017). Pantoea further separates into four well-supported lineages, viz. the P. agglomerans (containing P. agglomerans, P. anthophila, P. brenneri, P. conspicua, P. deleyi, P. eucalypti, and P. vagans), P. ananatis (containing P. allii, P. ananatis, P. stewartii ssp. stewartii and P. stewartii ssp. indologenes), P. rodasii (containing P. rodasii, P. rwandensis, and Pantoea sp. GM01) and P. dispersa (containing P. dispersa, P. eucrina and P. wallisii) lineages . Other than a limited set of general biological traits (e.g., colony and cell morphology, respiration status, growth temperature), characters that potentially define Pantoea and its lineages have never been identified.
The overall goal of this study was to link biological properties to the current evolutionary hypothesis of Pantoea , thus allowing the identification of phenotypic characters that potentially define the genus and its lineages. Our specific aims were three-fold. First, for each of Pantoea and Tatumella, we functionally compared their shared gene sets (i.e., in terms of the pathways and processes each gene is predicted to be involved in) to evaluate the feasibility of using whole genome sequences for identifying taxon-defining characters at ranks higher than the species level. Secondly, the shared gene sets in each of the four Pantoea lineages were functionally compared to identify characters associated with the specific evolutionary path of the lineage and that potentially contributed to its initial emergence or subsequent maintenance. Thirdly, the loci underlying these differential characters were further characterized in order to determine their gene composition and distribution among species and whether their conservation is maintained by purifying selection as suggested before (Fang et al., 2008;Sorrels et al., 2009). Broadly, our strategy (Figure 1) involved the identification of shared gene sets, followed by their functional annotation.

Genomes Analyzed
All genomes analyzed during this study are publicly available and accessible at the National Centre for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/). Whole genome sequence data for 21 species of Pantoea and three species of Tatumella (Tracz et al., 2015) were included in the analyses (Table 1). These Pantoea species span the current known phylogenetic and phenotypic diversity of the genus, with most representatives of all of the major lineages . For the intergeneric comparisons, all 24 genomes were utilized. For the intrageneric comparisons, we included only 17 Pantoea genomes. We excluded three of the known lineages of this genus as one contained only two species (i.e., Pantoea sp. At-9b and P. cypripedii LMG 2655 T ), while the other two are each represented by single species (i.e., Pantoea sp. A4 and P. septica LMG 5345 T ).

Generation of Shared Gene Sets
The shared gene sets of the two genera, as well as the different lineages within Pantoea (see Figure 1), were generated with the EDGAR (Efficient Database framework for comparative Genome Analyses using BLAST score Ratios) server (https://edgar.computational.bio.uni-giessen.de; Blom et al., 2016). For each gene set, a representative of the lineage/genus was used for downstream analyses. The representatives used for the different lineages were P. agglomerans R190 for the first lineage (encompassing P. agglomerans, P. eucalypti, P. vagans, P. deleyi, P. anthophila, P. brenneri, and P. conspicua), P. ananatis LMG 2665 T for the second lineage (comprising of P. ananatis, P. allii, P. stewartii subsp. stewartii, and P. stewartii subsp. indologenes), P. dispersa strain number to EGD-AAK13 for the third lineage (encompassing P. dispersa, P. eucrina, and P. wallisii) and P. rodasii LMG 26273 T for the fourth lineage FIGURE 1 | Experimental strategy followed for the lineages within Pantoea in the study. Lineages were identified from the subtree of Pantoea from the protein sequence maximum-likelihood tree of all the shared genes of Palmer et al. (2017). Average nucleotide identity (ANI) values were used as a measure of relatedness between species of a lineage as obtained from Palmer et al. (2017). Shared gene sets were determined from the genome sequences of species within each lineage. Gene sets were then annotated with the Kyoto Encyclopedia of Genes and Genomes (KEGG), followed by BLAST verification and locus comparisons of characterized genes. Uncharacterized genes were subjected to Blast2GO analyses. A similar strategy was followed for the generic comparisons with the exception of the locus comparisons.

Functional Annotation and Identification of Differentially Present Metabolic Pathways
Functional annotation of the different gene sets were first performed by orthology searches against the KEGG database (Kanehisa et al., 2016b) using GhostKOALA (KEGG Orthology and Links Annotation; Kanehisa et al., 2016a) for all gene sets. Genes with KO terms associated with them could be separated based on the functional role of the pathways to which they could be mapped. Specific pathways where differences were detected in the global maps were also considered for comparative purposes (Figure 1). For the Pantoea lineages, genes with no KO associations were then analyzed to assign putative functions using Blast2GO (Conesa et al., 2005;Götz et al., 2008). This was done by subjecting these genes to BLAST analyses for Gene Ontology (GO) associations using Blast2GO implemented in CLC Genomics Workbench (CLC Bio). In these analyses, assignment to more than one GO term per gene was allowed when functional annotation suggested that a gene product is involved in multiple processes. All Blast2GO analyses were initiated by BLAST searches against the RefSeq non-redundant protein database of NCBI followed by InterproScan (Jones et al., 2014) analyses to identify protein domains as a means of identifying putative functions. Genes remaining without annotation was again subjected to BLAST analyses against the non-redundant database on NCBI to determine the distribution of these genes across taxa.
Individual sets of reconstructed metabolic pathways obtained from the KEGG database were compared to identify differences between lineages and genera. This was done by assigning the set of KEGG pathway maps from each genus/lineage a unique color and then overlaying them onto each other for identifying differences (Figure 1). From these overview pathway maps, specific metabolic pathways were identified for further investigation in eight functional categories used in KEGG. These were carbohydrate, lipid, nucleotide, amino acid and energy metabolism, as well as genes involved in environmental information processing and the metabolism of co-factors, vitamins, and xenobiotics.
Multi-gene pathways that were differentially present or absent were identified from the full set of differences obtained from the KEGG pathway comparisons. This was done to limit the number of genes potentially identified as absent due to sequencing or Frontiers in Microbiology | www.frontiersin.org assembly errors and also aided in simplifying the overall analysis. For this purpose multi-gene pathways were defined as processes where more than one gene was required to complete a process. From these pathways, the absence of genes from the genomes included in the respective gene sets were verified using local BLAST (Altschul et al., 1990) analyses (tblastn). The genomic coordinates of these genes were then noted to identify clustered genes. The gene clusters were identified and visualized using Geneious 6.1.6 (Biomatters).
Sequences for complete clusters were subsequently extracted from genomes and aligned using the MAFFT 7.309 (Katoh and Standley, 2013) server. When more than three members of a lineage possessed the gene clusters, their sequence alignments were subjected to codon-based selection analyses in MEGA 6.0.6 (Tamura et al., 2013) using HyPhy (Pond and Muse, 2005), to obtain dN (proportion of non-synonymous substitutions) and dS (proportion of synonymous substitutions) values at all codon positions across the alignments. The normalized dN-dS values were then plotted against codon positions in Microsoft Excel 2013.

Generation of Shared Gene Sets
In this study, we identified a number of biologically informative characters for Pantoea and the four lineages examined. For analyses at both the inter-and intra-generic levels, comparable taxon sets were compiled based on phylogenetic relatedness and Average Nucleotide Identity values (ANI Konstantinidis and Tiedje, 2005b;see Palmer et al., 2017). At the intra-generic level, these sets were also comparable in the sizes of the gene sets (Figure 2), with the exception of the P. rodasii lineage. This larger gene set could be attributed to the large size of the genomes of the three current members in the lineage, in comparison to the members of other lineages. However, a large proportion of the genes (>25%) in the respective genomes were present in both Pantoea and Tatumella. For the Pantoea comparisons more than 30% of the genes in respective genomes were present in all taxa, with 55-75% associated with specific lineages and 25-45% apparently species-specific. Overall, the gene sets for the four lineages consisted of 2844 genes for the P. agglomerans lineage, 2924 genes for the P. ananatis lineage, 3599 genes for the P. rodasii lineage and 2872 genes for the P. dispersa lineage (Figure 3).
For the inter-generic comparisons, the Pantoea gene set (calculated from 21 genomes) consisted of 1862 genes. The Tatumella gene set consisted of 2196 genes (calculated from three genomes). This difference in the number of shared genes can most likely be attributed to the number of genomes analyzed in these genera, as the number of genomes available for Tatumella is underrepresented.

Functional Annotation of the Pantoea and Tatumella Gene Sets with KEGG
The number of genes with KO associations for Pantoea and Tatumella were 1,576 (84.6% of the Pantoea gene set) and 1,760 (80.1% of the Tatumella gene set), respectively (Supplementary File S1). In both cases, the highest number of genes was involved in "Genetic Information Processing", followed by "Environmental Information Processing", with "Unclassified" genes making up the third largest gene group.
The pathways in which we identified differences between Pantoea and Tatumella were "Metabolic pathways", "Biosynthesis of secondary metabolites", "Microbial metabolism in diverse environments", "Biosynthesis of antibiotics", "Carbon metabolism", "Biosynthesis of amino acids", as well as "2-Oxocarboxylic acid metabolism", and "Fatty acid metabolism". Comparison of the relevant global metabolic maps revealed a higher number of reactions predicted for Tatumella (as would be expected due to the higher number of shared genes), except for "Fatty acid metabolism". Closer inspection of the fatty acid metabolism pathways indicated the ability to perform β-oxidation of fatty acids occurred in Pantoea but not Tatumella.
A total of 124 differences were identified between Pantoea and Tatumella (Supplementary File S1). These consisted of reactions involved in all functional classes, namely "Carbohydrate metabolism" (citrate cycle, pentose phosphate pathway, fructose and mannose metabolism, ascorbate and aldarate metabolism, starch and sucrose metabolism, glyoxylate and dicarboxylate metabolism and inositol phosphate metabolism), "Energy metabolism" (including methane, sulfur and nitrogen metabolism), "Lipid metabolism" (including fatty acid degradation and sphingolipid metabolism), "Nucleotide metabolism" (purine and pyrimidine metabolism), "Amino acid metabolism" (cysteine and methionine metabolism, lysine degradation, arginine and proline metabolism, histidine metabolism and β-alanine metabolism), "Cofactor metabolism" (nicotinate and nicotinamide metabolism), "Xenobiotics metabolism" (benzoate degradation, chloroalkane and chloroalkene degradation) and "Environmental information processing" [ABC transporters, two-component systems, phosphotransferase systems (PTSs) and chemotaxis]. By limiting the pathways investigated to those where two or more genes are required to complete a pathway, 10 pathways (32 differences) were retained and subsequently absence was confirmed with BLAST analyses (Table 2, Supplementary File S1).
As suggested from the global maps, genes required for βoxidation of long-chain fatty acids ("Lipid metabolism"-fatty acid degradation) were present in all members of Pantoea and absent in all members of Tatumella. A number of genes involved in specific pathways in carbohydrate metabolism where detected only in Pantoea, namely the "Pentose phosphate pathway" (Dribose to D-ribose-1-P), "Fructose and mannose metabolism" (Dmannitol to β-D-fructose-6-P), "Starch and sucrose metabolism" (sucrose to ADP-glucose; glycogen to trehalose and amylose, respectively) and "Inositol phosphate metabolism" (myo-inositol to 2-deoxy-5-keto-D-gluconate-6-P). The "Energy metabolism" pathway with differences was sulfur metabolism, where Pantoea possessed genes required for the uptake of extracellular taurine and its subsequent conversion to sulfite. Pantoea also possessed genes required for the conversion of guanine to (S)-allantoin during "Purine metabolism". Several pathways involved in "Amino acid metabolism" were also present only in Pantoea, specifically those involved in arginine and proline metabolism FIGURE 2 | A bar graph indicating the number of genes for each isolate, separated into genes highly conserved in the sister genera (Pantoea + Tatumella), through to genes not shared by all closely related species or unique genes. The dendrogram was inferred for the different lineages and their relationships to each other from the amino acid based topology of the core genome of Palmer et al. (2017). The length of each bar is indicative of the size of each genome analyzed (in terms of the number of genes). The different lineages are indicated with colored blocks (blue-P. ananatis lineage; green-P. agglomerans lineage; yellow-P. rodasii lineage; red-P. dispersa lineage). All genomes analyzed encoded a similar number of genes, with the genome of P. rodasii encoding the highest number of genes (∼5,800) and P. eucrina encoding the least number of genes (∼3,800).
(creatine and N-carbamoylsarcosine to sarcosine) and histidine metabolism (L-histidinol to urocanate). In addition, Pantoea also possessed genes for the conversion of L-aspartate to nicotinate-D-ribonucleotide during "Cofactor metabolism" (nicotinate and nicotinamide metabolism).
We also observed differences for genes in the category "Environmental Information Processing". Amongst the ABC transporters, those for glutamine, glutathione and glycine betaine/proline transporters were found only in Tatumella, while those for osmoprotectant, taurine (also seen in sulfur metabolism), L-arabinose, and microcin C were only present in Pantoea. In terms of the phosphotransferase systems, Pantoea possessed genes necessary to transport and convert N-acetylmuramic acid to N-acetylmuramic acid-6-P, while Tatumella possessed genes required for the transport and conversion of N-acetyl-D-glucosamine to N-acetyl-D-glucosamine-6-P and arbutin/salicin to arbutin-6-P/salicin-6-P.

Functional Annotation of the Genes Shared by Lineages in Pantoea with KEGG
For the P. agglomerans lineage, the number of shared genes with KO associations resulted in 2148 genes (75.5% of the gene set). A total of 2197 genes (75.1% of the gene set) in the P. ananatis lineage could be annotated using KO terms (Supplementary File S2). The highest percentage of genes with associated KO terms were 75.9% (2181 genes) for the P. dispersa lineage, with the lowest being 71.9% for the P. rodasii lineage (2559 genes) (Supplementary File S2). In contrast to the inter-generic gene sets, the highest number of genes in all four lineages were involved in "Environmental Information Processing", followed by "Genetic Information Processing", with "Unclassified" genes again being the third most prevalent (Figure 3, Supplementary File S2).
Comparisons of global maps indicated differences between the four lineages in "Biosynthesis of amino acids", "Biosynthesis of antibiotics", "Biosynthesis of secondary metabolites", "Carbon metabolism", "Overview metabolism" and "Microbial metabolism in diverse environments" (Supplementary File S2). Limiting the reactions investigated to two or more genes acting together to complete a pathway, led to the identification of a number of reactions involved in "Polyketide sugar unit biosynthesis", "Biosynthesis of siderophore group non-ribosomal peptides", "Starch and sucrose metabolism", "Riboflavin metabolism", "Fructose and mannose metabolism", "Lysine degradation", "Chloroalkane and chloroalkene degradation", "Benzoate degradation", "Pentose and glucuronate interconversions" and "Cysteine and methionine metabolism" (Supplementary File S3). However, we excluded "Starch and sucrose metabolism" and "Riboflavin biosynthesis" after local BLAST analyses showed that homologs of the respective genes were detected in all taxa (Supplementary File S3). They were likely not recognized previously in our generation of the shared gene datasets with EDGAR's strict orthology estimation criteria. The genes involved in the remaining nine processes (two of which were involved in siderophore synthesis) were all found to be clustered and allowed comparison of the gene clusters across all taxa containing these genes ( Table 3).
The two genes identified (rfbC and rfbD) being involved in "Polyketide sugar unit biosynthesis" were present in all the members of the P. ananatis lineage, with various members of the other three lineages lacking the genes (P. eucalypti, P. brenneri, Pantoea sp. GM01 and P. wallisii; Table 3). Upon examination of the gene cluster containing these genes, two different loci were identified (Figure 4). The first locus was observed in nearly all members of the genus that possessed these genes (including P. dispersa and P. stewartii subsp. indologenes), while the second locus (lacking rfbB) was found only in P. ananatis, P. stewartii subsp. stewartii, P. dispersa and a partial locus in P. stewartii subsp. indologenes. The first locus was also slightly different in P. deleyi and P. eucrina, as the position of rfbC in P. deleyi differed (Figure 4-gene indicated in a darker shade) and the locus of P. eucrina contained an additional three genes in comparison to the other taxa (Figure 4). Furthermore,  Gowrishankar, 1989;Stirling et al., 1989;Kempf and Bremer, 1995;Ko and Smith, 1999 Glutathione ABC Transporter gsiABCD (yliABCD) • Transports glutathione from outside environment into cell through the use of ATP  Kempf and Bremer, 1995;Kappes et al., 1999 (Continued) Frontiers in Microbiology | www.frontiersin.org  Scripture et al., 1987;Kehres and Hogg, 1992;Schneider, 2001 Microcin C ABC Transporter yejABEF • Transports microcin C from outside environment into cell through the use of ATP • Results in susceptibility to antimicrobial microcin C Pantoea (21/21) Tatumella (0/3) Vanneste et al., 2001;Novikova et al., 2007;Metlitskaya et al., 2009 N-acetylmuramic acid PTS selection analyses indicated purifying selection for rfbA and rfbB and diversifying selection for rfbC and rfbD in the first locus (Figure 4, Supplementary File S4). Contrary to this, both rfbC and rfbD were under purifying selection in the second locus, with rfbA being under mainly purifying selection for the first part and diversifying selection for the second part of the gene (Figure 4,  Supplementary File S4). The genes involved in "Biosynthesis of siderophore group non-ribosomal peptides" and "Lysine degradation" both encoded for different iron acquisitioning molecules (siderophores) ( Table 3). The genes present in most of the members in the genus ("Biosynthesis of siderophore group non-ribosomal peptides") were identified as being required for the production of enterobactin. The majority of the gene cluster encoding enterobactin appeared to have evolved under purifying selection, with only some regions that evolved mainly under diversifying selection (for example see Figure 5 entB; Supplementary File S4). Conversely, the genes involved in "Lysine degradation" in the P. ananatis lineage were those required to produce aerobactin from lysine. All members of the P. ananatis lineage lacked the genes involved in enterobactin biosynthesis, but contained the genes required for aerobactin synthesis, while all other members of the genus lacked the genes required for aerobactin biosynthesis (Figure 5). Selection analyses amongst the genes encoding for aerobactin biosynthesis indicated purifying selection in particular for iucA and iucB. Both these loci were absent from P. eucalypti, P. deleyi and P. eucrina.
The differentially present genes associated with "Fructose and mannose metabolism" consisted of rhaA, rhaB, and rhaD which convert L-rhamnose to glycerone-P and S-lactaldehyde (rhaD) ( Table 3). This cluster was present in all the members of the P. agglomerans and P. rodasii lineages, but present only in P. allii and P. ananatis in the P. ananatis lineage, and P. dispersa and P. wallisii in the P. dispersa lineage (Supplementary File S4). From the selection analyses of the P. agglomerans and P. rodasii lineages it was observed that rhaB and rhaD evolved under purifying selection, with rhaA evolving under diversifying selection (Supplementary File S4).
Our analysis showed that for "Chloroalkane and chloroalkene degradation", the specific pathway was absent from all the members of the P. dispersa lineage ( Table 3). This pathway Polyketide sugar unit biosynthesis dTDP-4-oxo-6-deoxy-D-glucose <-> dTDP-4-dehydro-beta-L-rhamnose Biosynthesis of siderophore group non-ribosomal peptides 2,3-Dihydroxybenzoate <-> Enterochelin  catalyzes the conversion of chloroacetaldehyde to glycolate and hydrochloric acid. All other members of the genus possessed the genes required for this process (Supplementary File S4).
The differentially present genes associated with "Benzoate degradation" were involved in the utilization of protochatechuate. They were present only in the P. ananatis lineage. However, closer examination revealed that most members of the lineage contained a cluster of 9 genes (pcaH, pcaG, pcaQ, pcaL, pcaB, KAT, pcaJ, pcaI, and pcaR), but that it contained a deletion in P. stewartii subsp. stewartii which truncated pcaL and removed pcaB and KAT from the cluster. Overall, the cluster appeared to be under purifying selection (Supplementary File S4).
All members of the genus, except P. eucrina, possessed a gene cluster (uxaA, uxaB, and uxaC) involved in "Pentose and glucuronate interconversions" (Table 3, Supplementary File S4). The products of uxaA and uxaB catalyze, respectively, the reversible conversion of 2-dehydro-3-deoxy-D-gluconate to D-altronate and D-altronate to D-tagaturonate, while uxaC facilitates interconversions between D-tagaturonate and Dglucuronate and between D-fructuronate and D-galacturonate. These three genes were conserved within the P. agglomerans, P. ananatis, and P. rodasii lineages, with only uxaA and uxaB being present in P. dispersa and P. wallisii. Overall, it appeared that these genes were evolving under neutral selection (Supplementary File S4). Comparison of the processes involved in "Cysteine and methionine metabolism" showed differences in the synthesis of spermidine (speD and speE) and the methionine salvage pathway (mtnA, mtnB, mtnC, mtnD, and mtnK) ( Table 3, Supplementary Files S3, S4). The two genes required for the biosynthesis of spermidine allow for the conversion of S-adenosyl-L-methionine and putrescine to 5 ′ -methylthioadenosine and spermidine (speE). These two genes were present in all members of the genus except Pantoea sp. GM01 (P. rodasii lineage), P. eucrina and P. wallisii (both from the P. dispersa lineage; Supplementary File S4). Genes involved in the methionine salvage pathway allow conversion of 5-methylthio-D-ribose to 3-(methylthio)-propanoate through 5 different intermediate reactions (Supplementary File S3). These methionine salvage pathway genes were absent in the P. dispersa lineage (P. dispersa, P. eucrina, and P. wallisii) (Supplementary File S4).
We also found several multi-gene systems for two-component systems (2 systems), ABC Transporters (14 systems) and PTSs (2 systems) that were differentially present within these lineages (Supplementary File S5). Local BLAST analyses allowed identification of taxa where these genes were indeed present, despite not being conserved within the specific gene sets (Figure 6, Supplementary File S5). The two-component systems identified were that for citrate as well as nitrate/nitrite uptake. The ABC transporters identified were the systems for nitrate/nitrite/cyanate, HMP/FAMP, spermidine/putrescine, putrescine, maltose/maltodextrin, D-xylose, myoinositol-1phophate, phosphonate, glutamine, arginine, urea, glutathione and iron(II)/manganese. The PTSs detected were those for cellobiose and L-ascorbate.

Annotation of Lineage-Specific Genes without KEGG Associations
A total of 264 genes were identified as being differentially present in the Pantoea lineages, to which no KO term assignment could be made. This set of uncharacterized genes consisted of 62 genes in the P. ananatis lineage, 75 genes in the P. agglomerans FIGURE 5 | The gene clusters involved in "Lysine degradation" and "Biosynthesis of siderophore group non-ribosomal peptides". The dendrogram was inferred from the species tree of Palmer et al. (2017). Lineages are indicated with colored blocks. Both these clusters encode for the biosynthesis of siderophores, namely aerobactin and enterobactin, respectively. The locus required for the production of aerobactin was conserved in members of the P. ananatis lineage, while the locus required for enterobactin biosynthesis was present in most other members of Pantoea. The enterobactin biosynthesis locus was completely absent from the genomes of the members of the P. ananatis lineage, while the aerobactin locus was lacking in all other members of Pantoea. As an indication of selective pressures on the loci, the normalized dN-dS value at each codon position was plotted across the clusters.
lineage, 98 genes in the P. rodasii lineage and 29 genes in the P. dispersa lineage. Analysis of these genes with Blast2GO allowed annotation of 182 genes. A further six genes could be assigned GO terms, but could not be fully annotated upon merging of annotations due to a lack in InterProScan hits. A total of 76 genes had no functional associations. These unannotated genes could however be used for blastp analyses to identify potential sources of horizontally acquired genes.
The 62 genes of the P. ananatis lineage were subjected to Blast2GO analyses, leading to the annotation of 38 genes. In terms of biological processes (GO Level 3), the highest number of genes were involved in "cellular metabolic processes", "primary metabolic processes" and "organic substance metabolic processes", followed by "regulation of cellular processes" and "nitrogen compound metabolic processes" (Figure 7). This lineage thus contained 24 genes present in all members of the lineage, without KEGG or GO functional annotations. Based on blastp hits, 15 of the 24 genes had their closest homologs within other members of the Enterobacteriaceae, while two genes had homologs in members of the Rhizobiaceae (α-Proteobacteria). The closest homolog for three genes was respectively from the Aurantimonadaceae (Martelella mediterranea; α-Proteobacteria), Corchorus olitorius (bush okra; Malvaceae, Eudicots), and Erwinia phage ENT90. The remaining four genes had no BLAST hits (blastp) on the non-redundant database (Supplementary File S6). FIGURE 6 | Differences between the lineages in processes involved in "Environmental Information Processing". The presence (+) or absence (−) of complete (all genes required for functional system) two component systems, ABC transporters and PTSs in the genomes of the species in the main lineages within Pantoea. The dendrogram of the relationships within and between lineages were inferred from Palmer et al. (2017). The separate lineages are indicated with colored blocks.
Of the 75 genes conserved within the P. agglomerans lineage not annotated with KEGG, 54 genes could be annotated with Blast2GO. Most of these genes were involved, in descending order, in "cellular metabolic processes", "organic substance metabolic processes", "establishment of localization", "primary metabolic processes" and "biosynthetic processes" (BP GO Level 3; Figure 7). The remaining 21 genes with no associated GO terms could not be functionally classified. Homologs of these genes were however, identified in other members of the Enterobacteriaceae, often pathogens, with a single gene having its closest homolog in the metagenome of a soil sample from an unknown source (Supplementary File S6).
Of the 98 genes conserved within the P. rodasii lineage without KEGG annotations, 76 could be annotated with Blast2GO. The five highest biological processes in which these genes were involved were "organic substance metabolic processes", "cellular metabolic processes", "primary metabolic processes", "regulation of cellular processes" and "nitrogen compound metabolic processes" (GO Level 3; Figure 7). This resulted in 22 genes without any functional annotation with either KEGG or GO analyses. Homologs for all 22 genes were identified in other members of the Enterobacteriaceae, of which 21 genes were most closely related to genes from human pathogens (Supplementary File S6).
Of the 29 unique genes in the P. dispersa lineage without any KEGG annotations, 14 genes could be annotated with Blast2GO. These genes were primarily involved in "cellular metabolic processes", "organic substance metabolic processes", "primary metabolic processes", "nitrogen compound metabolic processes" and "biosynthetic processes" (Figure 7). Of the 15 unannotated genes, homologs for all genes were identified in other members of the Enterobacteriaceae, particularly those associated with the stinkbug, Plautia stali (Supplementary File S6).

DISCUSSION
Our findings suggest that in silico mining of bacterial genome sequences is a feasible approach for inferring large sets of biological characters for particular taxa. This approach is invaluable for unveiling large repertoires of potential bacterial phenotypes and can thus contribute hugely toward identifying biologically relevant diagnostic characteristics from whole genome sequences. Furthermore, by superimposing such characters onto the phylogeny of a particular bacterial group it appears to be possible to identify those traits that might have contributed toward the initial emergence of a taxon and/or its subsequent stable persistence in nature.
Here we identified extensive sets of biological characters specific to Pantoea and its main phylogenetic lineages. Our study thus outlines the initial steps toward linking biological functions (based on the variable genomic components) to taxonomy (based on the stable, conserved genomic components).

Genome-Based Comparisons of Specific Processes between Genera
The methodology employed in this study allowed for the identification of biological characters that potentially define and differentiate bacterial genera from one another. Despite the necessity of these taxonomic ranks, our understanding of what constitutes and distinguishes genera is mostly limited. Previous attempts to obtain natural and logical groupings have always been based on a limited view of the organisms' metabolic potential, often with a focus on what was considered to be clinically relevant data rather than from a biological outlook (Konstantinidis and Tiedje, 2005a,b;Sutcliffe, 2015). Although, the current classification system aims to identify and describe naturally occurring groups by employing an evolution-based approach, it still does not provide any biologically meaningful information for the organisms (Cohan, 2002;Konstantinidis and Tiedje, 2005a;Tindall et al., 2010). However, our study of Pantoea and Tatumella clearly highlights how diverse sets of biological characters for bacterial genera may be inferred from genome data. Apart from so-called genus-defining traits that can potentially be used to differentiate these taxa, these characters also provide information on the general biology of the taxa investigated (see Table 3). Our findings indicate that such genome-based analyses provide a more informed view of the biology of the organisms, and the information emerging from comparing metabolic differences can be linked to the shared ancestry of groups of organisms.
Taken together, these findings suggest that evolution has equipped Pantoea with extensive repertoires of metabolic processes that make them generally more versatile in their ability to adapt to changing environments. Compared to Tatumella, they can utilize a wider range of carbon sources and use available resources more efficiently by recycling metabolic byproducts. This potentially also provides them with a competitive advantage in nutrient-poor environments such as mammalian blood. The various genus-defining traits we identified for Pantoea may thus contribute to our understanding of the complex, and often opportunistic, relationships these species have with their plant and animal hosts (De Baere et al., 2004;Cruz et al., 2007;De Maayer et al., 2012a).

Genome-Based Comparisons of Specific Processes between Lineages of Pantoea
Comparisons of the metabolic processes inferred from whole genome sequences allowed for the identification of various sets of traits specific to one or more lineages of Pantoea. Based on previous work in diverse bacteria (including Pantoea), we attempted to correlate these processes to the lifestyles of the taxa investigated. Although a number of the identified processes were likely related to pathogenicity (see Table 4), most probably play roles in niche adaptation and utilization in a non-pathogenic capacity (see Table 5).
The processes likely associated with pathogenesis, particularly in the Enterobacteriaceae, were those involved in O-antigen (Stevenson et al., 1994;Whitfield, 1995;Wang and Reeves, 1998;Kohchi et al., 2006;Greenfield and Whitfield, 2012), siderophore (Montgomerie et al., 1984;Williams and Carbonetti, 1986;Opal et al., 1990;Fecteau et al., 2001;Fiedler et al., 2001;Torres et al., 2001;Hubertus et al., 2003;Raymond et al., 2003;Garcia et al., 2011;Gao et al., 2012) and polyamine (Khan et al., 1992;Ha et al., 1998;Gugliucci and Menini, 2003;Shah and Swiatlo, 2008;Pegg, 2016) biosynthesis. Differences in the locus involved in Oantigen biosynthesis (of which some Pantoea species have two) were previously associated with a pathogen's ability to escape host responses (Whitfield, 1995;Kohchi et al., 2006;Greenfield and Whitfield, 2012). Our results further showed that all species in the P. ananatis lineage likely produce aerobactin, while many of those in the other lineages produce enterobactin. Although these siderophores essentially perform the same function, aerobactin is more efficient at scavenging iron during nutrient limitation, and may in some instances even assist in resistance against iron-dependent antimicrobials (Williams and Carbonetti, 1986;Torres et al., 2001;Garcia et al., 2011). Also, all of the examined species in the P. agglomerans and P. ananatis lineages are predicted to be capable of producing the polyamine spermidine (this process was detected in only some of the species in the other two lineages). Apart from their essential cellular functions (Ha et al., 1998;Shah and Swiatlo, 2008;Pegg, 2016), polyamines have been implicated in biofilm formation, escape from host phagolysosomes and toxin production and activity (Shah and Swiatlo, 2008).

(Supplementary
Overall, we could correlate the versatility in lifestyle and host to increased metabolic potential (in terms of compounds utilized) as well as pathogen-associated traits between the different lineages within Pantoea. Despite the association of various Pantoea species with only clinical infections (P. brenneri, P. conspicua, and P. eucrina), it appears that species associated with opportunistic clinical infections (De Baere et al., 2004;Cruz et al., 2007), possessed genes often associated with animal pathogenicity within the Enterobacteriaceae. In general, it appears that the lineages with the most diverse niche associated processes also corresponded to those species isolated from the most diverse environments. For example, members of the P. agglomerans and P. ananatis lineages are routinely isolated from various plant species, as epiphytes, endophytes, or pathogens, as well as from insects, animals and humans (Coutinho and Venter, 2009;Walterson and Stavrinides, 2015), while members of the P. dispersa and P. rodasii lineages are usually only associated with a single host organism, with the exception of P. dispersa. These characteristics may contribute to the opportunistic nature of lineages containing species like P. ananatis and P. agglomerans that are proven plant pathogens, but isolated from diverse environments including the clinical setting.

Evolution of Multi-gene Pathways in Pantoea and Its Lineages
Various evolutionary mechanisms likely shaped the presence and distribution of the multi-gene pathways inferred for Pantoea. Bacteria propagate asexually and progeny are anticipated to contain the same genetic material as the parent (Daubin et al., 2002). Any changes in an individual's genetic makeup can become fixed in populations if they provide a competitive advantage, or at least have no deleterious effects (Cohan, 2001(Cohan, , 2005Caro-Quintero and Konstantinidis, 2012). The most common forces facilitating genetic change are random mutations (point mutations as well as insertions and deletions) and horizontal gene transfer (HGT) (Gogarten et al., 2002;Cohen et al., 2011). Accordingly, more closely related species are likely to encode similar pathways, while those subject to HGT would have a more spurious distribution (Gogarten et al., 2002;Cohen et al., 2011).
Evolution of most of the multi-gene pathways in Pantoea and its lineages involved complex processes, involving vertical descent with lineage-and species-specific gene losses/gains via duplication or HGT. For example, lineage-and species-specific gene losses would be characterized by distribution patterns where particular gene clusters are present in all taxa neighboring a species lacking it, because of gene losses at a specific ancestral node (Koskiniemi et al., 2012). In our study, such processes were likely involved in the loci required for the conversion of chloroacetaldehyde to glycolate and in the methionine salvage pathway. In contrast to this, the sudden appearance of genes or loci in a lineage that are lacking in all neighboring taxa suggest they were acquired via HGT (Zaneveld et al., 2008). An example of this is the locus required for protochatechuate utilization. The evolution of the siderophore loci likely involved gene losses together with the horizontal acquisition of genes. This is evident from the absence of the enterobactin locus (despite its presence in all closely related lineages) from the P. ananatis lineage, and the presence of the locus encoding aerobactin in only this lineage, thus acquired through an HGT event.
Our results suggested that the gene clusters encoding the various Pantoea pathways examined in this study, mainly experience purifying selection. Numerous hypotheses attempt to explain why genes cluster and how clusters are maintained (Carbone et al., 2007;Geddy and Brown, 2007;Fang et al., 2008;Sorrels et al., 2009). However, various studies have showed that purifying selection may contribute to the stability and functionality of gene clusters once they have formed (Carbone et al., 2007;Geddy and Brown, 2007;Fang et al., 2008;Sorrels et al., 2009). Purifying selection seems to facilitate the maintenance of ancient or ancestral gene clusters by limiting the possibility of non-synonymous mutations becoming fixed, which in turn allows out-competition of individuals undergoing deleterious or lethal mutations inhibiting the functioning of the gene cluster (Fang et al., 2008;Sorrels et al., 2009). In our study, purifying selection appeared to play a role in the maintenance and functioning of the loci for siderophore biosynthesis (aerobactin and enterobactin loci) and protochatechuate utilization, which are evolving mainly under the influence of purifying selection. However, we also detected diversifying selection for some of the genes/gene regions examined [e.g., rfbC and rfbD (rfb locus 1), both associated with the conversion of dTDP-6-deoxy-D-xylo-4-hexulose to dTDP-Lrhamnose (Graninger et al., 1999)]. In these cases, selection is causing non-synonymous changes in the sequences of genes or parts of genes, thus driving the appearance of new alleles.

In Silico Predictions vs. Experimentally Confirmed Multi-gene Processes
Although a large number of differential characters were identified from the genome sequences of these organisms, little has been done in terms of experimental verification. Also, a number of characters could not be correlated to current experimental knowledge as false negative and positive results for phenotypic tests are common (Sutcliffe et al., 2012). For instance, despite the presence of genes required for the utilization of arbutin and salicin as carbon source in Tatumella, previous phenotypic tests have previously tested negative (Brady et al., 2010b;Tracz et al., 2015). This was also observed for the intra-generic comparisons. Examples within Pantoea included the uptake and utilization of cellobiose (Brady et al., 2009) and D-galacturonate, previously identified as a genus-defining attribute (Gavini et al., 1989;Brady et al., 2010a), as well as citrate (Brady et al., 2010a(Brady et al., , 2012. In these cases, phenotypic tests previously confirmed the ability to perform these functions, but these genes were lacking in various Pantoea species (absence confirmed in available genome sequences of additional isolates of these species). This lack in correlation may either be as a result of gene expression complexities during phenotypic tests, to sequencing, assembly or annotation errors or the presence of as of yet uncharacterized alternate pathways.
There were, however, also a number of in silico functional predictions that correlated well with the results of previous phenotypic tests. For example, the lack of genes required for the utilization of histidine in Tatumella, correlated entirely with negative results of all previous phenotypic tests (Brady et al., 2010b;Tracz et al., 2015). Moreover, the utilization of protochatechuate has previously tested positive in P. allii (Brady et al., 2011), and our findings showed that the locus encoding the necessary gene products is indeed present within this lineage. Other examples of characters supporting previously performed phenotypic tests are the transport for the utilization of Dxylose, maltose, myo-inositol-1-P, and sucrose. In all of the above mentioned examples gene clusters were only observed in taxa previously testing positive for the associated phenotypic characters (Brady et al., 2010a(Brady et al., ,b, 2011.

Perspectives and Relevance
The increase in data obtained from bacterial genome sequences has superseded the rate at which gene discovery, characterization, and verification can occur. This means that a number of genes could not be assigned definitive functions as no homologs were detected for these genes in the KEGG database. The majority of these genes, at both the generic and lineage-specific level, could be annotated with gene ontologies with Blast2GO, although the functions of some genes remained unknown. Of the uncharacterized genes, the most abundant GO terms were generally quite similar (data for genera not shown). This indicates that the unknown genes could be performing similar functions within the different lineages or genera, despite not being orthologous amongst the different gene sets. The genes to which no functional annotation could be assigned seemed to have originated mostly within the Enterobacteriaceae, either acquired through lateral acquisition of genes from other taxa within this family, or, in the case where genes are conserved in most of the lineages, acquired by an ancestor and subsequently lost in some lineages when the genes were no longer required. A number of unique genes were also identified with no known homologs, but expression of these genes should be confirmed before they are considered for further investigation. Although no functional information is available for the unannotated genes, the distribution across various taxa provides insight into potential HGT events.
The availability of whole genome sequence data has transformed the approaches available for understanding bacterial evolution. This has, however, not yet replaced traditional methods such as DNA-DNA hybridization, monophyly in phylogenetic analyses and physiological and metabolic tests for the delineation of bacterial taxa. Although physiologic capabilities provide an insight into what potential metabolic differentiation may have occurred during speciation, inconsistencies may still occur due to differential expression of genes in different isolates or regulation of expression in specific environments. In recent years, we have been moving toward approaches aiming to identify natural groups at higher taxonomic levels by implementing monophyly as a prerequisite for taxon descriptions, but no light can be shed on the biology of the organisms through these approaches. Instead we need to investigate the more variable genomic compartments reflecting the biology of the organisms to obtain a more natural and robust taxonomic system. By employing this approach, one would be able to supplement or supplant the available diagnostic characters used in bacterial taxonomy. Additionally, obtaining a holistic biological perspective from the genome will provide power to predict the lifestyle and ecology of the organisms and is essentially much more informative than only having discriminative power between taxa. We thus believe that this approach of identifying genome-based characteristics in metabolic networks for the taxonomic levels higher than the species, provide an approach of identifying biologically relevant differences along the course of speciation.