Evolutionary History of Bacteriophages in the Genus Paraburkholderia

The genus Paraburkholderia encompasses mostly environmental isolates with diverse predicted lifestyles. Genome analyses have shown that bacteriophages form a considerable portion of some Paraburkholderia genomes. Here, we analyzed the evolutionary history of prophages across all Paraburkholderia spp. Specifically, we investigated to what extent the presence of prophages and their distribution affect the diversity/diversification of Paraburkholderia spp., as well as to what extent phages coevolved with their respective hosts. Particular attention was given to the presence of CRISPR-Cas arrays as a reflection of past interactions with phages. We thus analyzed 36 genomes of Paraburkholderia spp., including those of 11 new strains, next to those of three Burkholderia species. Most genomes were found to contain at least one full prophage sequence. The highest number was found in Paraburkholderia sp. strain MF2-27; the nine prophages found amount to up to 4% of its genome. Among all prophages, potential moron genes (e.g., DNA adenine methylase) were found that might be advantageous for host cell fitness. Co-phylogenetic analyses indicated the existence of complex evolutionary scenarios between the different Paraburkholderia hosts and their prophages, including short-term co-speciation, duplication, host-switching and phage loss events. Analysis of the CRISPR-Cas systems showed a record of diverse, potentially recent, phage infections. We conclude that, overall, different phages have interacted in diverse ways with their Paraburkholderia hosts over evolutionary time.


INTRODUCTION
The interaction between bacterial hosts and bacteriophages (phages) has been intensively studied (reviewed in Salmond and Fineran, 2015). A known consequence of such interaction, which is mainly driven by lysis (fitness) pressure from phages, is bacterial diversification . This diversification is the result of an evolutionary arms race, where bacteria and phages constantly develop new attack-defense strategies to impede partner's mechanisms (Stern and Sorek, 2011;Wang et al., 2016).
Since their discovery, phages have fundamentally changed our traditional view-from a simple parasitic interaction to co-evolution dynamics-of bacterial hosts and phages Obeng et al., 2016). As the most abundant entities in the biosphere, phages commonly outnumber bacteria by at least one order of magnitude; they are estimated to infect about 10 23 to 10 25 bacterial cells every second in ocean ecosystems (Keen et al., 2017). Considerable numbers of phages have been shown to be present in bacterial genomes. In fact, integrated phages (prophages) are at the heart of bacterial diversification processes, e.g., in Escherichia coli (Lawrence and Ochman, 1998;Ohnishi et al., 2001;Touchon et al., 2016), Streptococcus agalactiae. S. pyogenes, Salmonella sp., Listeria innocua, and L. monocytogenes . Phages play essential roles in the life of their hosts, from the individual to the population level. For instance, the evolution of bacterial pathogenicity , human health (Manrique et al., 2016), and global nutrient cycling in ocean ecosystems are all affected by phage activities (Roux et al., 2016).
Phylogenetic approaches (in particular co-phylogenetic analyses), have been used to answer questions with respect to the co-evolution of tightly associated members of a community, such as viruses and their hosts (Geoghegan et al., 2017). Given the evolutionary timeline of these relationships, it is expected that congruence, or phylogenetic similarity, is detected from both partners. Congruence is unlikely to occur as a process of simple co-speciation (the process of speciation of one species in response to another one). It is entangled with other evolutionary mechanisms, such as duplications, host-switching, losses and failure to diverge (Conow et al., 2010;Hutchinson et al., 2017). To unravel the co-evolutionary scenario between prophages and their Paraburkholderia hosts, two approaches have been applied. First, global-fit/distance-based approaches address the congruence between the phylogenies of the associated organisms and evaluate the dependency of the phage phylogeny upon the host's tree (Hutchinson et al., 2017). The second approach is an event-based approach. This approach considers, for example, duplication, host-switching, and losses, in order to assess the co-evolutionary events (Conow et al., 2010).
Despite offering, in some cases, fitness advantages to their bacterial hosts, phages often provide a "burden" to host functioning that may lead to host cell death by lysis. Clustered regularly interspaced short palindromic repeats (CRISPRs) and their associated proteins (Cas) provide bacteria with protection against invading genetic elements such as phages and plasmids (Makarova et al., 2015). The CRISPR-Cas system is able to acquire short (26-72 bp) sequences of foreign DNA, named proto-spacers, and flank these sequences with proto-spaceradjacent motifs (PAMs) to make spacers, integrating these into so-called CRISPR arrays (Makarova et al., 2015). The CRISPRencoded RNA then guides complexes of Cas proteins, which recognize and cleave incoming foreign genetic material at specific sites, preventing further infection. Thus, CRISPR spacers are protective "immune" functions, that can provide insight into the history of bacterial host/phage interplays (Sun et al., 2015). Such interplays spur the diversity of phages (Shmakov et al., 2017), as shown by analyses of the sequences of CRISPR spacers that have little or no homology to any known sequences (Edwards et al., 2016).
Prophages can make up to about 20-30% of the size of bacterial genomes (Casjens, 2003). A previous study has shown the presence of inducible prophage sequences in the genome of the fungal-interactive Paraburkholderia terrae strain BS437 . Paraburkholderia species inhabit the mycosphere (the soil surrounding fungal hyphae), where frequent exchange of genes across the local microbes is possible (Haq et al., 2014;Zhang et al., 2014). We hypothesized that, by analyzing the presence of phages and CRISPR-Cas systems (especially CRISPR spacers) in the genomes of Paraburkholderia spp., we will unearth the evolutionary record of recent phage infections and shed light on the dynamic arms race interaction between the bacterial host and its phages. Here, we address the following key questions: to what extent does the presence of prophages and their distribution affect the diversity and diversification of Paraburkholderia spp.? To what extent did coevolution occur between these partners? And does the presence of CRISPR arrays in bacterial genomes reflect this interaction in natural systems?

Bacterial Growth Conditions, Genome Sequencing, and Assembly
The 12 newly-sequenced Paraburkholderia sp. genomes (P. terrae strains BS001, BS007, BS110, BS437, and DSM 17804 T ; P. phytofirmans strains BS455, PsJN, BIFAS53, and J1U5; P. hospita DSM 17164 T , P. caribensis DSM 13236 T , and Paraburkholderia sp. MF2-27) were used. Strains BS001, BS007, BS110, BS437, BS455, BIFAS53, and J1U5 have been previously isolated in our group (Warmink et al., 2011;Nazir et al., 2012;. All strains were grown aerobically in Luria-Bertani (LB) broth at 28 • C (180 rpm, shaking, overnight). The genomic DNA of the overnight cultures was then extracted using a modified (UltraClean) DNA isolation kit (MOBio Laboratories Inc., Carlsbad, CA, USA). The modification consisted of adding glass beads to the cultures in order to spur mechanical cell lysis. The extraction method is a rapid way to produce highly pure DNA from bacterial cultures. The extracted DNAs were purified with the Wizard DNA cleanup system (Promega, Madison, USA). The quality and quantity of the extracted DNAs were assessed using electrophoresis in 1% agarose gels.

Bacterial Genome Data Retrieving
Initially, we entered the Burkholderia database (http://www. burkholderia.com/strain/download, last accessed in March 2017), yielding a total of 1,185 strains (containing 123 complete genomes and 1,062 drafts of Burkholderia genomes). We then selected the recently-named genus Paraburkholderia, primarily containing 62 environmental species that are non-pathogenic (Sawana et al., 2014). Among the selected genomes, we found 24 species with complete genomes in the database. We included three outgroup species (i.e., Burkholderia glumae, B. cenocepacia, and B. pseudomallei) in the initial prophage identification analysis. A total of 36 genomes of Paraburkholderia spp. and three genomes of Burkholderia were thus used in this study (see Table 1 and Figure 1). The predicted habitats ("P"-plant-associated and "N-P"-Non-plant-associated) of the Paraburkholderia species were taken from literature data and then crossedchecked using the GOLD database, version October 2017.
Here, what we call plant-associated vs. non-plant-associated FIGURE 1 | Phylogeny of the 36 Paraburkholderia and three Burkholderia species used in this study. The 16S rRNA genes of the 39 bacteria were aligned using the SINA (Pruesse et al., 2007) and MAFFT (Katoh et al., 2002) alignment tools. The alignment was edited using Gblocks (Talavera et al., 2007). A maximum likelihood based phylogenetic tree was then constructed using FastTree (Price et al., 2009) and the tree (midpoint-rooting) was visualized using iTOL (Letunic and Bork, 2016). Gray circles represent "plant-associated" Paraburkholderia species, while white circles represent "non-plant-associated" Paraburkholderia species. See section Materials and Methods for explanation of plant-vs. non-plant-association.
(including fungal-interactive) Paraburkholderia species might not strongly reflect the true nature of these species, as some fungi can occur in the rhizosphere and so also be plant-associated.

Phylogenetic Analysis and Genome Comparisons
Prophage phylogenetic trees were generated using selected concatenated phage signature genes (i.e., capsid, portal, tail tape, and terminase), next to the individual phylogenies of those genes. The predicted proteins were aligned with MUSCLE (Edgar, 2004). The 16S rRNA genes of the Paraburkholderia strains were used to align with SINA (Pruesse et al., 2007) and MAFFT alignment tools (Katoh et al., 2002). Both phage and host gene alignments were edited using Gblocks (Talavera et al., 2007), with default parameters. Then, maximum likelihood phylogenetic trees were constructed using FastTree (Price et al., 2009) and these (midpoint-rooting) were visualized using iTOL (Letunic and Bork, 2016). Furthermore, genome comparison percentages were generated using BLAST + 2.4.0 (tBLASTx with cut-off value 10 −3 ) and map comparison figures created with Easyfig (Sullivan et al., 2011).

Detection of Prophage and CRISPR-Cas Arrays (Spacers)
Prophage regions were detected in the bacterial genomes using PHAST (Zhou et al., 2011). PHAST uses current publicly available viral databases, such as "NCBI phages and viruses, " to identify prophage position, length, boundaries, number of genes and attachment sites, such as tRNA sites. The completeness of the identified prophage regions was determined based on scores that consider three scenarios: (i) complete prophage regions contain only genes for known phage proteins and the joint proteins conceptually allow building a functional phage, (ii) incomplete prophage regions are defined as having >50% genes for proteins that are predicted to be of prophage nature, and (iii) questionable prophage regions are defined as having <50% genes for proteins predicted to be of prophage nature. Three other criteria were used to further define prophage regions, those are (i) regions with sizes below 10 Kb, (ii) regions lacking genes for phage core/hallmark proteins (e.g., tail protein, capsid/head protein, terminase and integrase) and (iii) regions with > 25% insertion sequence (IS) elements (that is, short DNA sequences that act as simple transposable elements) were discarded (Bobay et al., 2013). The resulting manually curated prophages, with genes for structural proteins, replication, recombination, and lysis proteins, were further analyzed.
The analyses of CRISPR-Cas arrays (repeat and spacers) were performed on the genomes of all known Paraburkholderia spp. from CRISPRdb (Grissa et al., 2008), which were downloaded to our local server. CRISPRFinder (http://crispr.i2bc.paris-saclay. fr/Server/) was then used to identify any spacers from bacterial genomes that are not in the database. Thus, Cas proteins were annotated using CRISPRone (Zhang and Ye, 2017). Manual readings were applied to the identified CRISPR-Cas systems using as criteria: (i) Regions with more than three repeats and two spacers were considered to constitute bona fide CRISPR arrays, (ii) CRISPR arrays lacking Cas proteins in the vicinity were kept, as predicted orphan CRISPR arrays (Makarova et al., 2015;Almendros et al., 2016), and (iii) CRISPR-Cas loci lacking CRISPR arrays were discarded. The classification and clustering of CRISPR repeats were analyzed using CRISPRmap, a comprehensive cluster analysis method (based on HMM clustering), which clusters conserved sequence families and potential structural motifs (Lange et al., 2013). To determine the record of past infections by phages and/or other mobile genetic elements (MGEs), all unique spacers were compared against the NCBI (viruses, phages, and plasmid) databases (last accessed: December 2017) using BLASTN (Altschul et al., 1997); the 11nt word size was used in BLASTN. The results showing highest coverage (≥70%) and identity were considered to represent valid hits. The analyses using BLAST all-vs.-all were carried out by doing BLAST the identified prophage dataset against the spacer dataset. The analyses of proto-spacers were also done using IMG/VR (https://img.jgi.doe.gov/cgi-bin/vr/main. cgi) against their viral spacer and metagenome spacer database (Paez-Espino et al., 2017).

Prophage-Host Co-phylogeny Analyses
Co-phylogeny analyses were performed using two methods: first, we used PACo (Procrustes approach to co-phylogenetics), which assesses the congruence or evolutionary dependency, of two groups of interacting species using both ecological interaction networks and their phylogenetic history. The analysis (evaluating the distribution of a set of shapes) produced superimposition plots (illustrating the correspondence coordinates of divergences between lineages, or patristic distances), in which the global-fit of prophage phylogenies onto the hosts is shown. The contribution of each individual host-prophage association to the global-fit is also evaluated (Hutchinson et al., 2017). The second approach is an event-based approach, which evaluates the combination of events of co-speciation, duplication, duplication with host switching, loss and failure to diverge, using Jane 4 (Conow et al., 2010). The default settings used were co-speciation = 0, duplication = 1, duplication and host switching = 2, sorting = 1, failure to diverge = 1. To find the best solution, while minimizing the cost of co-evolutionary events, a genetic algorithm (GA) that relies on bio-inspired operators, in this case randomized host, and prophage trees, was applied. The GA algorithm consists of population (the number of different solutions being considered at each iteration of the algorithm) and generation (the number of iterations performed by the algorithm as it seeks a good reconstruction of the parasite tree onto the host tree). These were set at 1,000 and 10, respectively (Conow et al., 2010). The statistical significance of the reconstructions was evaluated with 100,000 random tip mapping permutations.

Statistical Analysis
Statistical analysis of the prophage distribution was performed using RStudio (Integrated Development for R. RStudio, Inc., Boston, MA URL http://www.rstudio.com/). A Shapiro-Wilk test was used for testing the normality of the data distribution (Razali and Wah, 2011) and a Kolmogorov-Smirnov test to assess the significance of differences for non-normally-distributed data. For all statistical significance tests, the significance level was set to 95%.

Identification and Distribution of Prophages Across Selected Paraburkholderia Genomes
Prophages and prophage-derived elements were identified using the criteria stated in section Materials and Methods. A total of 105 genomic regions meeting these criteria were found before manual curation. Following removal of sequences <10 Kbp, those lacking genes for phage tail, capsid/head, terminase and integrase proteins and those with > 25% of ISs, we ended up with a total of 79 prophages in the genomes that were analyzed ( Table 1). These included the genomes of the three Burkholderia species used as an outgroup (Figure 1). Most of the prophages (75%) identified consisted of "remnants" of past infection events, as evidenced by the loss of essential phage genes (e.g., structural and replication genes). This finding is consistent with the deletion bias theory (Bobay et al., 2014) and with the idea that these phage insertions represent ancient events (Hendrix et al., 2001). We detected prophages in most of the selected Paraburkholderia strains and in the three outgroup Burkholderia (94.87%, n = 37), the exceptions being P. bryophila 376MFSha3.1 and P. caledonica NBRC 102488. The number of identified prophages per genome ranged from one to nine, with most of the Paraburkholderia species (27%, n = 10) harboring just one prophage region meeting the criteria. Remarkably, one outlier genome, that of Paraburkholderia sp. MF2-27, contained nine prophage regions, which stood in sharp contrast to the number present in all other genomes (range 16.8 to 62 Kb; Figure 2). The genome sizes of the identified prophages ranged from 11.5 to 419 Kb, contributing up to 4% of the genomes of the Paraburkholderia strains. The G+C contents of the prophages were on the average 60.5% (ranging from 56-66%), which was invariably below that of their hosts (average = 63%; ranging from 58.88 to 67.88%). No significant differences observed (Kolmogorov-Smirnov, N.S.not significant ρ = 0.3454) between the prophage genome sizes in the genomes of the "plant-associated" vs. the "non-plantassociated" Paraburkholderia species ( Figure 2C).
Here, we further focus on the 25% more recently acquired prophages, defined on the basis of their possession of full gene sets for phage particle assembly (e.g., genes for tail, capsid genes), replication, recombination and lysis proteins. Thus, we selected a total of 26 prophages, identified in the genomes of 17 Paraburkholderia strains, for further analysis ( Table 2).
Prophages often offer lysogenic conversions that advance their hosts' fitness. Here, we found some potential moron genes, on the basis of these being not essential for phage reproduction, in these complete phages (Supplementary Table 1). We found high identity (70-83% identity and 100% coverage) of a DNA adenine methylase gene in several prophages, i.e., φPphytPsJN, φPphyt455, φPcari1DS, φPcari2DS, φPcari1MW, φPcari2MW and φPglat2. However, most (i.e., 13.33-71.15%) of the genes in these prophages remained hypothetical (Supplementary Table 1), so it is possible that these phages constitute a repertoire of novel genes that may enhance host fitness.

Prophage Phylogenies and Genomic Analyses
To better understand the evolutionary trajectory of the selected prophage regions (Table 2), we built phylogenetic trees based on concatenated as well as single phage signature genes. For these, we selected the genes for phage capsid, portal, tape and terminase proteins (Figure 3). Despite the high divergence across the constructed phage phylogenies, the prophages clustered into five groups, as supported by all phylogenetic trees. Interestingly, the previously-identified P. terrae BS437 phage φ437 (Pratama and  van Elsas, 2017) revealed to be closely related to two prophages from Paraburkholderia sp. MF2-27, denoted φPsp20 and φPsp61, one prophage from P. xenovorans (φPxeno2) and another one from P. nodosa (φPnodo2). However, it was distantly related to the other prophages identified in P. terrae DSM 17804 T and P. terrae NBRC 100964. Moreover, the prophages identified in P. caribensis strains DSM 13236 T and MWAP64, P. terrae strains DSM 17804 T and NBRC 100964, and P. phytofirmans strains PsJN and BS455, always clustered in the same group (Figure 3). The grouping among these prophages was consistent with the comparative analyses of the whole genomes of these prophages (see Figure 4 and Supplementary Figure 1), which showed these to be highly syntenous to each other (100, 100, and 38-100% similarity, respectively). The conserved regions in these prophages (i.e., φPcari2DS and φPcari1MW; φPcari1DS and φPsp2MW; φPt17804 and φPtNBRC1; φPphytPsJN and φPhyt455) included phage genes for structural, replication, recombination and lysis proteins. The genome structures of the 26 complete prophages in Paraburkholderia showed different divergences (see Figure 4 and Supplementary Figure 1), which, as we postulate here, were either rearranged via homologous recombinations at short conserved boundary sequences in the phage genomes (Pedulla et al., 2003) or came about by multiple infection events from different phages (Ohnishi et al., 2001). This latter scenario was supported by the fact that the prophages from Paraburkholderia sp. MF2-27 (φPsp20, φPsp21, φPsp31, φPsp41, φPsp51 and φPsp61) were highly divergent. However, high levels of synteny were observed in the structures of the prophage genomes, for instance across the replication regions in prophages φPsp20 and φPsp61, as well as φPsp21 and φPsp51 (Supplementary Figure 1).

Host-Prophage Co-phylogeny Analyses
To understand the evolutionary relationships of the 26 identified full prophages and their hosts, we applied the two different methods of co-phylogenetic analysis: PACo (distance-based) and Jane 4 (event-based). The distance-based method showed globalfit values (m 2 XY ) between Paraburkholderia and their prophages of 0.28 (concatenated genes), 0.31 (capsid), 0.28 (portal), 0.27 (tape), and 0.27 (terminase). To test the robustness of the analyses, we applied 100,000 permutations (P < 0.0001) with α = 0.05 as the significance level. The m 2 XY values were inversely proportional to the topological congruence between the two phylogenies (Balbuena et al., 2013). Therefore, the analyses FIGURE 3 | Complete prophage relatedness trees based on (A) concatenated phage signature (capsid, portal, tape, and terminase) genes and individual phage genes, i.e., (B) capsid, (C) portal, (D) tape, and (E) terminase. The translations of concatenated and individual signature genes were aligned with MUSCLE (Edgar, 2004) and edited using Gblocks (Talavera et al., 2007). Maximum likelihood phylogenetic trees were constructed using FastTree (Price et al., 2009) and trees (midpoint-rooting) were visualized using iTOL (Letunic and Bork, 2016). Colors represent consistent groupings of the prophages. Red arrows indicate phage φ437.
suggested that the Paraburkholderia (host) phylogenetic trees did not predict the topology of the prophage trees, using any of the genes (concatenated, capsid, portal, tape, and terminase; Table 3).
To identify the extent of the contribution of each prophage host to the co-phylogenetic structure, we then evaluated the procrustean superimposition plots and the Jack-knifed square residual values (Balbuena et al., 2013). The former analysis showed clusters of possible congruencies of the phylogenies of the prophages (concatenated gene tree) and their hosts ( Figure 5A and Supplementary Figure 2). Some of these clusters were remarkably close to each other, reflecting the high relatedness with the species phylogeny of their hosts (Figure 1). Moreover, a significant portion of the prophage tree topology presented low squared residual values, specifically for prophages φPoxy3, φPphym1, φPphym2, φPt17804, φPtNBRC, φPxeno2, φPphyPsJN, φPphy455, φPcari1MW, φPcari2MW, φPcari1DS, and φPcari2DS. These lineages thus reflected the tree topologies of their hosts (based on 16S rRNA), suggesting possible co-evolutionary associations (Figure 5B and Supplementary  Figure 3).
The second analysis, which addressed the reconciliation of the prophage-concatenated phylogeny with the host tree, resulted in 31 putative evolutionary scenarios. These included 11 events of co-speciation, three duplications, 12 host-switching and five loss events (Table 3). However, no "failure to diverge" event showed up. Furthermore, three Paraburkholderia sp. MF2-27 prophages, i.e., φPsp31, φPsp51, and φPsp21, were predicted to derive from recent host switching by an ancestor of the P. bannensis prophage φPban3. Additionally, P. terrae BS437 phage φ437 was found to be derived from recent host switching from Paraburkholderia sp. MF2-27 phages φPsp20 and φPsp61 and P. nodosa phage φPnodo2 with the common ancestor of prophage φPxeno2 (Figure 6 and Supplementary Figure 4). Similar numbers of each of the evolutionary scenarios were obtained from analyses of the individual phage signature genes (see Table 3 and Supplementary  Figures 2, 3). Overall, the results show that the evolutionary trajectory of the Paraburkholderia prophages is predominantly characterized by host switching, followed by co-speciation events.

Insight Into CRISPR-Cas Systems in Paraburkholderia Genomes
To provide insight into the evolutionary history of past infections from bacteriophages in the Paraburkholderia strains, we investigated the occurrence of CRISPR-Cas systems, especially CRISPR arrays (spacers). In silico analyses were carried out to identify CRISPR-Cas loci and CRISPR arrays, applying strict criteria for detection (see section Materials and Methods). The analyses showed that 55.55% (n = 20) of the genomes of the Paraburkholderia species harbor identifiable CRISPR-Cas systems ( Figure 7A). Two complete systems were found in P. grimmiae and P. zhejiangensis. These consisted of genes for Cas proteins (cas1, cas2, cas3HR, cas5, cas6e, cas7, cas8e, and FIGURE 4 | Synteny analysis of complete prophage genomes. The groupings of the prophages are based on the phylogenetic tree constructed with the concatenated phage signature (i.e., capsid, portal, tape, and terminase) genes. Red arrows: phage lysis and lysogeny genes; blue arrows: phage structural genes (tail, capsid, and fiber); green arrows: replication, recombination, repressor, and phage related genes; gray arrows: hypothetical proteins; yellow arrows: non-phage or possible moron genes. Comparison percentage was generated using BLAST + 2.4.0 (tBLASTx with cutoff value 10 −3 ) and map comparison figures were created with Easyfig (Sullivan et al., 2011) as indicated in section Materials and Methods. Gene similarity percentage is indicated in gray-scale bars. Procrustean superimposition plot analysis, which minimizes differences between the two partners' principal correspondence coordinates of patristic distances. For each vector, the starting point (black dot) represents the configuration of prophages and the arrowhead the configuration of hosts. The vector length represents the global fit (residual sum of squares), which is inversely proportional to the topological congruence. (B) Contribution of each Paraburkholderia lineage and their prophages to the general pattern of coevolution. Each bar represents a Jack-knifed squared residual. Error bars represent the upper 95% confidence intervals from applying PACo to patristic distances. Further, the median squared residual value is shown (dashed red line). The colors represent the clusters shown in the prophage phylogeny (see Figure 3). cse2gr11) flanked by two CRISPR arrays with totals of 42 and 18 spacers, respectively ( Figure 7A). The complete CRISPR-Cas systems found in these two genomes (P. grimmiae and P. zhejiangensis) showed the presence of the Cas signature protein cas3, which classified them into class 1. Despite the variation in the arrangement of the Cas protein and the absence of cas4, the two complete systems could further be classified into subtype 1-E ( Figure 7A). Approximately 90% (n = 18) of the Paraburkholderia species had so-called orphan CRISPRs (those not associated with Cas genes or remnants of CRISPR systems), containing at least two spacers and three repeats. The genome of P. oxyphila had the highest number of spacers, with 36 and 37 repeats (see Table 4 and Supplementary Figure 5).
Given the high percentage of orphan CRISPRs, we then also classified the systems found on the basis of the repeat sequences found in the CRISPR arrays (see Supplementary Figure 5) using the CRISPRmap database (Lange et al., 2013). Based on the analyses, most repeats belonged to super-classes D (46.87%, n = 15) and B (12.5%, n = 4). Moreover, 37.5% (n = 12) of the repeats were not attributable to any superclass. Additionally, none (n = 32) of the repeats showed a match with any sequence family and/or structure motif in the CRISPRmap database (Lange et al., 2013). Furthermore, repeat30 from P. xenovorans was found to match motif11 in the database (Supplementary Figure 6). In detail, this repeat was related to a repeat found in CRISPR-Cas systems of the Gram-positive bacteria Streptococcus pneumoniae CGSP14 (NC_010582) and Clostridium botulinum F str. 230613 (NC_017297) (see Supplementary Figure 6 and Supplementary Tables 2, 3). Moreover, the highest numbers of repeats were found in the P. grimmiae genome, which harbored 44 repeat copies. These consisted of two consensus repeats, i.e., 12 copies of repeat10: GGGTCTATCTCCGCGCACGCGGAGGAACC and 32 of repeat11: GGTTCCTCCGCATGCGCGGAGATAGACCC. Both repeats had lengths of 29 bp ( Table 4). The repeat with next-high number was found in P. oxyphila. This genome harbored 39 repeats, consisting of three consensus sequences. These were: five copies of repeat21: TGTGTCGACTCGACACAGCACTCAATCG, 30 of repeat22: TTTCTAAGCTGCCTACGCGGCAGCGAAC and five of repeat23: GTCGACCAGAGTTAGCGCTTCAGC. These repeats had lengths of 28, 28, and 24 bp, respectively ( Table 4). The genome of P. zhejiangensis also harbored relatively high repeat sequence numbers, with a total of 20 repeat sequences. These consisted of two consensus repeats, i.e., 12 copies of repeat31: GGTCTATCTCCGCGCGCGCGGAGGAACC and eight of repeat32: GGTTCCTCCGCGTCCGCGGAGATAG. These repeats had lengths of 28 and 25 bp, respectively. Remarkably, we found similar CRISPR-array repeat sequences in some of the Paraburkholderia genomes, i.e. repeat24 from P. phenoliruptrix ac1100 was similar to repeat26 from Paraburkholderia sp. MF2-27, as well as repeat1 and repeat2 from P. terrae strain BS001 were similar to repeat5 and repeat6 from P. terrae strain BS110, respectively.
In order to discern the phages (and other mobile genetic elements) that most frequently infect the genomes of Paraburkholderia spp., we compared the spacer matches to phage, virus and plasmid sequences found in the database (NCBI). Based on the numbers of phages from different families (Figure 7B), we found that 31.14% (n = 52/167) of the spacers had best hits against database sequences, with 115 spacers remaining unknown. This analysis thus identified sequences of Coronaviridae, Flaviviridae, Geminiviridae, Herpeviridae, Inoviridae, Myoviridae, Podoviridae, and Siphoviridae. For example, most Myoviridae phages came from Gamma-Proteobacterial hosts, such as Burkholderia, Pseudomonas, and Erwinia (see Figure 7B; the detailed organism hits can be seen in Supplementary Table 4). Spacers with the highest hits often matched phages from the family Myoviridae (21%, n = 11). The comparison of the spacer dataset to the prophage dataset using BLAST (all-vs.-all) did not yield any matches. In the analyses, we were unable to detect any other mobile genetic elements (Supplementary Table 4).
To investigate whether the identified prophages could be predicted to infect hosts beyond Paraburkholderia spp., we compared our prophage dataset against the viral and metagenomics spacer database in the IMG/VR platform (Paez-Espino et al., 2017). Remarkably, most of the identified prophages were found to have a narrow predicted host range, as evidenced by the absence of matches to other bacteria. However, there were some exceptions. For instance, phage φ437 was found to contain two proto-spacers with 100% identity to spacers in the genomes of Yersinia pekkanenii strains A125KOH2 and CIP110230 (see Figure 7C and Supplementary Table 5). Other phages, i.e., φPban1, φPban3, φPsp21, φPsp31, and φPsp51, contained proto-spacers with 95-100% matches to spacers present in the metagenomics spacer database, specifically from maize rhizosphere and peatland microbiomes (Supplementary Table 6).

Screening for R-M Defense Systems
We found genes involved in R-M systems of types I, II, III, and IV across the Paraburkholderia genomes. Our results showed FIGURE 6 | Tanglegram depicting the Paraburkholderia (host) species phylogenetic tree in black and the prophage tree in blue. Auxiliary lines connecting the two trees are shown. The event-based method was performed with the default settings for cost regimes ("co-speciation" event = 0; "duplication" event = 1; "loss" event = 1; "duplication then host switching" event = 2) using Jane 4.0 (Conow et al., 2010). All analyses were performed with populations of 1,000 and 10 generations. Gray boxes represent plant-associated Paraburkholderia species and white boxes non-plant-associated Paraburkholderia species. Jane results showed that host-switching events occurred frequently, next to co-speciation.
that type-II R-M systems were widely found in all genomes, next to type-I ones (Figure 8). Further, only P. terrae BS001, BS007 and P. phytofirmans J1U5 contained all R-M system types (I-IV). Additionally, we found P. terrae strain BS437 to only have type-I (i.e., hsdR, hsdM, and hsdS) and type-II R-M systems (i.e. dcm-methyltransferase and adenine-specific-methyltransferase; Figure 8).

DISCUSSION
In this study, we addressed the question to what extent phages have shaped, over evolutionary time, the genomes of Paraburkholderia species. It has been amply shown that the genomes of bacteria are often littered with both functional and "fossilized" viral sequences (Casjens, 2003). Here, prophage sequences were indeed found in the majority of the Paraburkholderia genomes (Table 1). In most cases, we found evidence for genetic degradation, most likely due to the hosts' selective pressure leading to deletions (deletion bias). Remarkably, in just a few Paraburkholderia genomes we found multiple prophage regions, whereas most were found to contain just one such region (see Table 2 and Supplementary Figure 1).
A previous study on Paraburkholderia genomes showed that prophages can make up to 13% of these, as exemplified by the  P. phytofirmans J1U5 genome . It is worth to note that this previous study used not only the database-based approach (i.e., PHAST) but also an algorithmbased program (i.e., PhiSpy) to find novel prophage sequences. However, the latter program has been reported to give less consistent results (Popa et al., 2017;. Therefore, in this current study we decided to identify prophages based on more strict criteria, as outlined in section Materials and Methods. Moreover, we decided to base our analyses solely on the latest prophage/viral database (Zhou et al., 2011). Our current analysis shows that there was no significant difference between the size of the prophage regions found in the genomes of the plant-associated vs. the non-plant-associated Paraburkholderia strains (including soil and mycosphere inhabitants; Figure 2). It may support the notion that the phenotypic diversity of Paraburkholderia species enables them to inhabit diverse soil environmental settings. In consequence, at some point of their lifetime, they may have been exposed to either different or similar phage pools, allowing the acquisition of diverse novel sequences by phage insertions. The latter may thus relate to the lifestyles of these organisms in soil. Examples can be observed in Paraburkholderia sp. MF2-27 that was isolated from the mycosphere of Trichoderma harzianum (Rudnick et al., 2015) and was found to contain the highest prophage number in our dataset. This Paraburkholderia genome harbors nine prophage sequences, six of these being complete prophages. Clearly, its phage exposure legacy was different from that of the other hosts that were examined, which potentially reflects a more "turbulent" evolutionary record. In contrast, the plant-associated Paraburkholderia species containing the highest number of prophages was P. phymatum STM815, with a total of five prophages, two of which were complete (see Tables 1, 2).
Furthermore, the G+C contents of all full prophage regions were lower than those of the genomes of their host. This was taken to reflect their relatively "recent" acquisition on the evolutionary time scale (Hendrix et al., 2001;Casjens, 2003;Canchaya et al., 2004). It is worth to mention thatto the best of our knowledge-there is still a lack of reliable estimation of the time scale between phage integration and codon usage equilibrium in the host genomes. Moreover, some of the phages-within their taxon-were highly syntenous across each other (e.g., φPt17804 and φPtNBRC1; φPphytPsJN and φPhyt455), suggesting that these were (i) preserved, possibly functionally, and (ii) vertically inherited. These prophages may have derived from a single ancestral integration and then maintained through different diversification events (Bobay et al., 2013). The overall results of prophage distributions and prophage genome architectures suggested that possibly multiple infections by distinct prophages of the respective host cells had taken place. In an overall fashion, the genetic history of these Paraburkholderia prophages was found to be very complex, as was previously also observed in the genetic history of E. coli prophages (Ohnishi et al., 2001).
In a recent paper, we described that a novel prophagedenoted φ437-could be induced from P. terrae strain BS437 . We found that it harbors the putative moron gene amrZ and we hypothesized that this gene enhances the host's biofilm formation capacity. In the current study, we also found moron genes in other prophage genomes (see Supplementary Table 1). For example, genes for methylases were found and such proteins may be important for phages to overcome bacterial R-M systems, maintain the phage lysogenic stage, as well as support host pathogenicity (Murphy et al., 2013). Experimental studies are required to prove this.
Interestingly, the co-phylogenetic analyses (global-fit) between Paraburkholderia and their prophages revealed incongruence between trees (see Table 3). This means that the evolutionary events shaping the phage-host partnerships may have been duplications, host-switching and horizontal gene transfer (HGT) events. Jane results showed co-speciation, duplication, host switching and phage losses had differentially occurred (see Figure 6 and Table 3). Both analyses, thus, indicated that all host-prophage links below the set threshold ( Figure 5B) were corresponding to co-speciation (Figure 6; exceptions being φPphym1 and φPphym1). There are several scenarios under which host switching can occur in the natural environment. For example, φ437 may have switched from Paraburkholderia sp. MF2-27 to P. terrae strain BS437 (Figure 6). The two organisms were, interestingly, isolated from the mycosphere. Moreover, five host switch events were suggested to have occurred from "plant-associated" to nonplant-associated Paraburkholderia species, while three (plant to plant) and two (non-plant to non-plant) were observed from Paraburkholderia living in the same habitat (see section Materials and Methods for "plant-" vs. "non-plant-associated" Paraburkholderia species). These results support the notion of the ecological plasticity of Paraburkholderia species to occupy different niches in the soil (Haq et al., 2014). We also hypothesize that these Paraburkholderia (P. terrae strain BS437 and Paraburkholderia sp. MF2-27) might have been in close proximity and exposed to diverse prophages. The latter may thus have infected, and diverged in, these Paraburkholderia species, either including or not including transfers between species. Different rates of phage evolution could have been caused by (i) the variations in host evolution itself, (ii) accessibility of the common (horizontal) gene pool in different environments, (iii) constraints on the sequence diversity present across the genomes and available for recombination and (iv) the roles of temperate phages (i.e., high in gene flux-faster rate of gene acquisition and loss through HGT; Mavrich and Hatfull, 2017).
Our CRISPR-Cas analyses showed the infestation record of past infections by exogenous DNA elements (Figure 7). Strikingly, we found high variability of the CRISPR systems, with an uneven presence and numbers of CRISPR spacers. We surmised this is the consequence of differential Paraburkholderia arms races with exogenous elements like phages (see Table 4 and Figure 7). It is speculated that high variability of CRISP-Cas systems is due to (i) a high rate of HGT in some of these species "hubs, " (ii) possible duplication of the arrays and (iii) the enrichment of these arrays offering other advantages to the host cell (Weissman et al., 2017). Interestingly, Paraburkholderia species with complete CRISPR-Cas systems, i.e., P. grimmiae and P. zhejiangensis, grouped in the same branch (Figure 1), suggesting their close relationship. We also observed the composition of the genes for Cas proteins in P. grimmiae and P. zhejiangensis to be different from the classification proposed by Makarova et al. (2015).
The finding of high numbers of orphan CRISPRs in the analyzed genomes was remarkable. It could be the consequence of (i) rapid genetic rearrangement in the bacterial genome, (ii) the loss of functionality of CRISPR sequences by deletion,(iii) an incomplete or poor assembly of the draft genome (Shin et al., 2017) and (iv) lifestyle of the bacteria (Makarova et al., 2015). To date, only ∼40% of bacterial genome sequences in the current database were found to carry CRISPR systems. Moreover, it is still unknown why the genetic structures of CRISPR-Cas systems are so diverse and have such a non-uniform distribution (Vale and Little, 2010;Makarova et al., 2015). The finding of CRISPR arrays in only half of our Paraburkholderia genomes (see Table 4 and Figure 7), was thus not unexpected, as we may be just lifting the tip of the iceberg of Paraburkholderia CRISPR array sequences.
Furthermore, we clearly found evidence for the tenet that the host CRISPR spacers matched sequences of a variety of phage families ( Figure 7B). Matches to Myoviridae were the highest, suggesting phages from this family most commonly infected the Paraburkholderia species. This result was consistent with the recent discovery of the inducible Myoviridae phage φ437 from P. terrae BS437 . The matches with various different viruses and phages (see Supplementary  Table 1 for details of the hosts of these viruses and phages) showed some were hits with known Burkholderia phages (e.g., Burkholderia cepacia complex phages BcepC6B and phiE12-2) and other Proteobacteria phages (e.g., Pseudomonas phage Lu11 and PEV2). These results confirm the contention that most of the CRISPR spacers have been exposed to rapid genetic turnover processes (Makarova et al., 2015;Shmakov et al., 2017). We argue here that, given that phages indeed constituted the majority of matches of the spacers, they most likely constitute the main mobile genetic elements the Paraburkholderia hosts have been exposed to, as argued in Modell et al. (2017), Shin et al. (2017) and Shmakov et al. (2017). However, we acknowledge the paucity of knowledge on spacers in databases, as well as the limitations posed by current bioinformatics analysis programs. And, although we have come a long way in our understanding of the CRISPR-Cas systems of prokaryotes (bacteria and archaea), these still remain to be better explored and identified.
The clear matches of the proto-spacers in phage φ437 with genome sequences of Y. pekkanenii and with other phages in the metagenomics database suggest an co-evolutionary relationship. Either such phages may have a surprisingly broad host range, or the two divergent organisms may have been infected with related phages in their natural ecosystem. Y. pekkanenii has been isolated from soil (Murros-Kontiainen et al., 2011), suggesting a common niche. Niche sharing by P. terrae BS437 and Y. pekkanenii may, thus, be at the basis of the relatedness. However, these tenets are speculative and need confirmation.
The fact that many Paraburkholderia strains (i.e., P. grimmiae, P. zhejiangenis, P. terrae BS001, P. terrae BS007, P. terrae BS110, P. terrae DSM 17804 T , P. terrae NBRC 100964, P. xenovorans, Paraburkholderia sp. MF2-27, P. sacchari, and P. oxyphila) had multiple CRISPR arrays ( Figure 7A) indicated multiple exposures to phages. Moreover, Paraburkholderia may also have utilized other antiphage defense systems, such as the R-M system (Figure 8). Recent studies have reported previously unknown anti-phage systems, next to one anti-plasmid system, that are widespread and arm bacterial genomes against invading genetic elements like phages and plasmids (Doron et al., 2018;Ofir et al., 2018). We currently ignore the extent to which such systems are operational in Paraburkholderia species and so further analyses on these systems are warranted.
In summary, we here analyzed the distribution of prophage regions across the genomes of all species of the genus Paraburkholderia. Although we observed incongruences between the trees built for host and prophage evolutionary relationships, we obtained evidence for the tenet that duplication, host switching and HGT have affected the evolutionary histories. The analyses of CRISPR-Cas systems also indicated frequent phage-host encounters, revealing a complex and entangled relationship.