Comparative genomics defines the core genome of the growing N4-like phage genus and identifies N4-like Roseophage specific genes

Two bacteriophages, RPP1 and RLP1, infecting members of the marine Roseobacter clade were isolated from seawater. Their linear genomes are 74.7 and 74.6 kb and encode 91 and 92 coding DNA sequences, respectively. Around 30% of these are homologous to genes found in Enterobacter phage N4. Comparative genomics of these two new Roseobacter phages and 23 other sequenced N4-like phages (three infecting members of the Roseobacter lineage and 20 infecting other Gammaproteobacteria) revealed that N4-like phages share a core genome of 14 genes responsible for control of gene expression, replication and virion proteins. Phylogenetic analysis of these genes placed the five N4-like roseophages (RN4) into a distinct subclade. Analysis of the RN4 phage genomes revealed they share a further 19 genes of which nine are found exclusively in RN4 phages and four appear to have been acquired from their bacterial hosts. Proteomic analysis of the RPP1 and RLP1 virions identified a second structural module present in the RN4 phages similar to that found in the Pseudomonas N4-like phage LIT1. Searches of various metagenomic databases, including the GOS database, using CDS sequences from RPP1 suggests these phages are widely distributed in marine environments in particular in the open ocean environment.


INTRODUCTION
Phages (viruses that infect bacteria) are the most prevalent entities in the biosphere; they harbor a vast, untapped reservoir of genomic diversity and are important in driving the evolution of bacteria (Rohwer, 2003;Paul and Sullivan, 2005;Angly et al., 2006). They are also a significant component of the microbial food web and have major influence on fluxes of organic and inorganic matter, in particular in the oceans (Fuhrman, 1999;Wilhelm and Suttle, 1999;Weinbauer and Rassoulzadegan, 2004;Suttle, 2005Suttle, , 2007Breitbart et al., 2007). Metagenomic surveys suggest that the true diversity of marine phages exceeds that represented by isolated phages (Breitbart and Rohwer, 2005;Angly et al., 2006;Hurwitz and Sullivan, 2013) and there remain major gaps in understanding which hosts are infected by the wide diversity of phage observed in the environment.
One of the major groups of bacteria found in the marine environment is the so-called Roseobacter clade. Its members represent a taxonomically and metabolically diverse group of bacteria found in pelagic and benthic habitats where they play key roles in a wide range of biogeochemically important transformations (Buchan et al., 2005). Processes affecting their abundance and activity, such as viral lysis, are of biogeochemical significance but are currently poorly understood as only a small number of bacteriophages interacting with Roseobacters (roseophages) have previously been described. The first isolated roseophage was SIO1 (Rohwer et al., 2000), but since then four lytic roseophages infecting Roseobacter denitrificans (phage RDJL 1), Ruegeria pomeroyi (phage DSS3 2), Sulfitobacter strain EE36 (phage EE36 1) and Sulfitobacter strain 2047 (phage pCB2047-B) have been described (Zhang and Jiao, 2009;Zhao et al., 2009;Ankrah and Budinoff, 2014). The latter three are closely related to Enterobacteria phage N4, which, for over 40 years, was the sole representative of the N4-like genus, a genetic orphan among the tailed phages (Schito et al., 1965;Ceyssens et al., 2010). N4 was unique in the phage world due to its use of three distinct RNA polymerases and single-stranded DNA protein/activators to control gene expression (Choi et al., 2008). In recent years a further 25 N4-like phages have been isolated and genome sequenced (Table 1) all of which share these features.
The aim of this study was to isolate and characterize lytic phages infecting members of the Roseobacter clade using a number of different Roseobacter host strains and samples of coastal seawater from the United Kingdom. We isolated two new Roseobacter N4-like phages (RN4-phages) that infect Roseovarius nubinhibens and Roseovarius sp. 217. Here, we report the sequencing of their genomes and the identification of phage-particle associated proteins by mass spectrometry. With the increased number of genome sequences available for N4-like phages it was possible to address questions regarding the structure and evolution of the genomes of this growing group of phages.

GROWTH OF BACTERIAL STRAINS
Cultures of Rsv. nubinhibens (Gonzalez et al., 2003) and Rsv. sp. 217 (Schäfer et al., 2005) were routinely grown in Marine Ammonium Mineral Salts amended with 10 g L −1 peptone and 5 g L −1 yeast extract (MAMS-PY).  (Gonzalez et al., 2003;Schäfer et al., 2005) to enrich any Roseobacter phages present. After incubation for 7 days, cells and large cellular debris were removed by centrifugation and the supernatant used in plaque assays against the species in the original inoculum. Clear plaques could be observed on bacterial lawns of Rsv. nubinhibens and Rsv sp. 217 after 24-48 h incubation at 25 • C. The plaques were then picked and made clonal.

PRODUCTION OF PHAGE STOCKS
The clonal phage samples made from agar plugs were used in plaque assays to produce plates with confluent lysis of the Roseovarius lawn. The top agar layer was removed using a flamesterilized glass microscope slide and mixed with 3 ml (per plate) of artificial seawater (ASW) modified as described in Wilson et al. (1996). Chloroform was added to a final concentration of 25% (v/v) to lyse remaining host cells. The resulting slurry was mixed thoroughly for at least 1 min and incubated for at least 30 min at room temperature in the dark. The top agar and chloroform was removed by centrifugation at 1780 × g for 10 min at 4 • C. This typically produced stocks of 1 × 10 8 plaque-forming units (PFU) ml −1 . Phages were further purified using CsCl gradient centrifugation for subsequent electron microscopy, DNA extraction and virion proteomic analyses (Sambrook and Russell, 2001).

MODIFIED BACTERIOPHAGE ONE-STEP GROWTH CURVE
Bacterial host cells grown in MAMS-PY in early exponential phase were harvested by centrifugation (4000 rpm/1300 × g, 15 • C for 10 min). The cells were then washed in Marine Broth (Pronadisa, Conda, Madrid) and centrifuged again at 16000 × g at room temperature for 10 min. The pellet was resuspended in sterile Marine Broth containing enough phage to have a multiplicity of infection of 0.001. Prior to addition of bacterial host cells, aliquots of the Marine Broth + phage solution had been removed to act as control samples. Both "bacteria + phage" and "phage-only" samples were then plated using the top agar overlay technique and the time noted for each plate. The plates were then transferred to a dark, 20 • C incubator for the duration of the experiment. At appropriate intervals plates were removed and the top agar layer removed with a flame-sterilized glass slide. This was mixed with 3 ml ASW and 3 ml chloroform or cold 3 ml ASW. The period of time between plating and mixing with the ASW:chloroform or cold ASW only solution was taken as time of incubation.
All samples were left at 4 • C in the dark overnight then centrifuged at 1300 × g at 4 • C for 10 min to separate the agar and chloroform. The number of free plaque forming units in the supernatant was then analyzed by appropriate dilution and plaque assays. Each time point for bacterial/phage samples was assayed in triplicate, control samples in duplicate and each growth curve was repeated three times.

PHAGE GENOMIC DNA DIGESTION WITH Bal31
CsCl-purified phage stocks were dialysed twice using size 3/MWCO 12-14,000 Da, dialysis tubing for at least 2 h in ASW at 4 • C. DNA was isolated and purified using a phenol-chloroform extraction as described previously (Sambrook and Russell, 2001).
To determine the physical structure of the genome of the two phages (linear or circular), around 40 μg of phage DNA was digested with Bal31 at 30 • C as described elsewhere (Loessner et al., 2000). Briefly, samples were removed 0, 5, 10, 20, 40, and 60 min after the addition of the enzyme and the digest stopped by incubation at 65 • C for 10 min. All samples were purified by phenol-chloroform extraction, precipitated with sodium acetate and ethanol which was followed by digestion with Nde1 fast digest (Fermentas) according to manufacturer's instructions. The digest patterns were analyzed by pulsed field gel electrophoresis using a 1% PFGE grade agarose gel run in a CHEF Mapper (BioRad).

PHAGE GENOME SEQUENCING
RLP1 and RPP1 phage DNA was extracted from CsCl stocks and dissolved in 10 mM Tris 1 mM EDTA buffer pH 8 (TE). The genomes were sequenced by the GenePool at the University of Edinburgh using Illumina for RPP1 and a combination of Illumina and Roche 454 shotgun sequencing for RLP1. Short-read Illumina data from RPP1 were assembled using Velvet (Zerbino and Birney, 2008), whereas the mixture of 454 and Illumina reads from RLP1 was assembled using Minimus (Sommer et al., 2007).
RPP1 assembled into a single contig whilst RLP1 assembled into 10 contigs; initial annotation of the largest contig suggested a high degree of gene synteny between RLP1 and RPP1. Consequently, RPP1 was used as a scaffold for RLP1 and the order of contigs was confirmed by PCR. Sequencing of the PCR products (by Sanger sequencing) resulted in complete assembly of RLP1. Whole-genome sequence data was submitted to EBML under accession numbers FR682616 and FR719956 for RLP1 and RPP1 respectively.

IDENTIFICATION OF CODING SEQUENCES
Coding sequences (CDSs) were predicted using the freely available gene prediction programs GeneMark™, heuristic approach (Besemer and Borodovsky, 1999) and GLIMMER 3.01 (NCBI) (Delcher et al., 1999). The final set of predicted CDSs for each genome was created by amalgamation of the two sets of results from GeneMark and GLIMMER. For predicted CDSs with discordant start codons between the two programs, the longer of the two predictions was kept.

DATABASE SEARCHES
Basic Local Alignment Search Tool (BLAST) comparisons were carried out on the predicted CDSs using different custommade databases (Altschul et al., 1990). Initially, a search using the BLASTp algorithm of the predicted protein sequences from the two Roseovarius phages to a database containing all bacteriophage protein sequences freely available in July 2008 was performed. This was then repeated using BLASTp against the non-redundant protein sequences database at the National Centre for Biotechnology Information (NCBI). In addition, HMMER was used to search the SWISS-PROT database. The results from the three searches were compared to assign putative function to each predicted CDS in RLP1 and RPP1.
To examine the environmental distribution of RN4 phages CDS sequences from RPP1 were used as query sequences for the BLAST algorithm against the environmental metagenomes downloaded from CAMERA (accession numbers CAM_PROJ_HumanGut, CAM_PROJ_AntarcticAquatic, CAM_PROJ_BotanyBay, CAM_P_0000545, CAM_P_0000915, CAM_PROJ_GOS, CAM_PROJ_SalternMetagenome) and EBI for metagenomes from freshwater lakes Bourget (MET6) and Pavin (MET7) (accession ERS015568 and ERS015567 respectively). tBLASTx analysis was carried out with the following parameters modified from default settings -F F -b 100000 -v 100000 -e 0.0001. A reciprocal blastp analysis was then carried out against a custom database of viral sequences. This was constructed from all complete viral genomes available from http://ftp.ncbi.nlm.nih.gov/genomes/Viruses as of February 2013. RPP1 was chosen as a representative of RN4 phages as it has the same complement of genes as RLP1, and an additional three genes. A sequence identified in a metagenome was only considered to be of RN4-like origin if RPP1 was one the top four results in a BLAST search against the viral database described above. The top four were considered as there is significant similarity between the proteins of the RN4 phages DSS3F2, EE36F1, RLP1 and RPP1 that were also in the blast database.
To account for the difference in size between genes and between metagenomic libraries a similar approach to that taken by Zhao et al. was employed (Zhao et al., 2013). The number of hits for each gene was divided by the number of sequences in the database, this was then divided by the size of the gene product. Samples were then scaled using the mean of all samples, to reduce the number of significant figures. Counts are presented as normalized relative abundance of each gene.
To determine how RN4 phage abundance changes within the defined environmental sites of the Global Ocean Survey (Venter et al., 2004) the same approach was carried out for individual sampling station using ORFs 24, 36, and 51 (the three most abundant ORFs in the eight metagenome examined) as queries.

CDS/GENOME COMPARISONS
Phage genome comparisons of all the available N4-like phages were carried out using Orthomcl (Li et al., 2003) which computes a bidirectional best hit search in the amino acid space (with an e-value Cutoff -1e −06 , I = 1.5). The initial database was constructed of the amino acid sequence of all predicted proteins extracted from publically available files in Genbank.

PHYLOGENETIC ANALYSES
The evolutionary history of selected genes encoding thioredoxins and the core N4-like genome was inferred using the Neighbor-Joining method (Saitou and Nei, 1987). The bootstrap consensus tree inferred from 1000 replicates was taken to represent the evolutionary history of the taxa analyzed (Felsenstein, 1985). Branches corresponding to partitions reproduced in less than 50% bootstrap replicates were collapsed. The evolutionary distances were computed using the Poisson correction method (Zuckerkandl and Pauling, 1965) and all positions containing gaps and missing data were eliminated from the dataset. Phylogenetic analyses were conducted in MEGA5 (Tamura et al., 2007).

EXTRACTION OF PHAGE STRUCTURAL PROTEINS AND SODIUM-DODECYL-SULFATE POLYACRYLAMIDE GEL ELECTROPHORESIS
High titre suspensions of RLP1 and RPP1 roseophage stocks were purified twice on a CsCl step gradient to remove host cellular protein contaminants. 0.01 volume of 2% (w/v) sodium deoxycholate was added to the phage sample and left on ice for 30 min. Trichloracetic acid was added to the samples to a final concentration of 12% (w/v) and the sample was left on ice for 30 min. The precipitated proteins were harvested by centrifugation using a TLA-100.3 (Beckman Coulter) at 37200 × g at 4 • C for 20 min. The pellet was washed twice in cold acetone then left to air dry. The dry pellet was re-suspended in 1 × Laemmli buffer (50 mM Tris-HCl pH 6.8, 2% (w/v) SDS, 10% (v/v) glycerol, 1% (v/v) β-mercaptoethanol, 12.5 mM EDTA, 0.02% (w/v) bromophenol blue). All samples were denatured at 100 • C for 10 min prior to electrophoresis on a 10-20% sodium dodecylsulfate (SDS) gradient polyacrylamide gel using a dual slab gel kit (C.B.S. Scientific) run overnight at 100 V. Protein bands were visualized using Coomassie stain.

MASS SPECTROMETRY ANALYSIS OF PHAGE PROTEINS
Protein bands of interest were excised from SDS-PAGE gels and tryptically digested using the manufacturer's recommended protocol on the MassPrep robotic protein handling system (Waters). The extracted peptides from each sample were analyzed by means of nanoLC-ESI-MS/MS using the NanoAcquity/Q-ToFUltima Global instrumentation (Waters) using a 45-min LC gradient. All MS data were corrected for mass drift using reference data collected from the [Glu 1 ]-Fibrinopeptide B (human-F3261 Sigma) sampled each minute of data collection. The data were then used to interrogate a database made up of the predicted protein sequences from RLP1 or RPP1 appended with the common Repository of Adventitious Proteins sequences (http://www. thegpm.org/cRAP/index.html) using ProteinLynx Global Server v2.3. All protein identification was carried out in the in-house Biological Mass Spectrometry and Proteomics Facility of the School of Life Sciences at the University of Warwick.

ISOLATION AND CHARACTERIZATION OF PHAGES RPP1 AND RLP1
Two lytic phages RLP1 and RPP1, infecting two strains of Roseovarius were isolated from seawater collected from Langstone Harbour, Hampshire, UK and from water collected from station L4 in the English Channel, respectively. The phages were named using the nomenclature suggested by Kropinski et al. (2009); vB_Rsv217_RLP1 (RLP1, Roseovarius Langstone Podovirus) which infects Roseovarius (Rsv.) 217 (Schäfer et al., 2005) and vB_RsvN_RPP1 (RPP1, Roseovarius Plymouth Podovirus) which infects Rsv. nubinhibens (Gonzalez et al., 2003 with other N4-like phages (Zhao et al., 2009;Ceyssens et al., 2010;Kulikov et al., 2012;Fouts et al., 2013) and appears to be a property of many podoviruses (Sullivan et al., 2003;Hess, 2008). Infection using soft agar overlays with both phages produced clear plaques around 0.5-2 mm in diameter after ca. 48 h incubation with susceptible hosts and infectivity was found to be unaffected by chloroform treatment. Transmission electron microscopy (TEM) of purified virions revealed phages with icosahedral heads and short tails (Figure 1), characteristics typical of the family Podoviridae. RLP1 and RPP1 had capsid head sizes of 72.4 ± 2 and 77.4 ± 5 nm respectively.

HOST-VIRUS INTERACTIONS
In laboratory conditions RLP1 and RPP1 only infected host cells when in semi-solid agar matrix, but not in liquid culture. Therefore, it was not possible to carry out a standard liquid-based one-step growth curve analysis and a modified assay was performed using infected hosts embedded in double-layer agar plates in order to characterize some basic properties of these phages (see Materials and Methods for details). In the modified assay, immediate processing of samples taken during infection (to determine nascent and mature/free phage) was not possible as both infected and un-infected host cells and nascent and mature/free phages were trapped within the top agar matrix and therefore not available for plaque assay. Instead an additional overnight incubation of the top agar layer in phage buffer, to allow diffusion of phage particles out of the matrix, was required prior to enumeration. To quench phage replication mid-cycle, chloroform was added to the phage buffer. As a result only the total plaque forming units (PFU), comprised of both nascent and mature phage, could be determined. The results suggest that the eclipse period for both phages is between 2 and 3 h and the latent period is between 4 and 6 h (Figure 2), however, without a free phage infection profile this cannot be verified. RLP1 appears to have a larger burst size compared to that of RPP1, ∼100 PFU cell −1 and ∼10 PFU cell −1 , respectively. A precise number for burst size could not be calculated as it is likely that the infected cells were not synchronized and it is possible that multiple infections of a single bacterium occurred as infected cells were not diluted as occurs in a standard one-step growth assay. Compared to EE36 1 and DSS3 2, which had latent periods of 2 and 3 h respectively, the phages obtained here had slightly longer latent periods although data have to be interpreted with caution due to the use of a modified one-step experiment.

GENOME SEQUENCE AND STRUCTURE OF PHAGES RPP1 AND RLP1
The genome sizes of phages RPP1 and RLP1 determined by whole-genome sequencing were 74.7 and 74.6 kb, respectively, which was in good agreement with estimates based on PFGE (Supplementary Material Figure 1). Both phages have a GC content of 49% in contrast to their hosts, Roseovarius sp. 217 and Rsv. nubinhibens, which have a GC content of 60 and 63%, respectively.
Both phage genomes were determined to be linear dsDNA through Bal31/Nde1 double digest treatment (Figure 3). The presence of two progressively shortening bands is indicative of a linear genome with defined ends. Gene prediction identified 92 and 91 putative CDSs in RLP1 and RPP1 respectively. Most CDSs (in both phages) appear to initiate at an ATG codon although around 10% use GTG or TTG as start codons. Three transfer RNA genes were also identified in both phages for proline (CCA), isoleucine (ATC) and glutamine (CAA). The two Roseovarius phages are highly related in almost all putative CDSs; FIGURE 2 | Modified one-step growth cure for phages RLP1 and RPP1. Host cells were infected with a MOI of 0.001. One step growth curve of RLP1 on Rsv. 217 ( ) and RPP1 on Rsv. nubinhibens ( ). The number of phage increases over time indicating infection has occurred. There is a marked increase in phage between 2 and 3 h which suggests a burst event has occurred during this period. Each growth curve was performed in triplicate.

www.frontiersin.org
October 2014 | Volume 5 | Article 506 | 5   (Figure 4). Twenty-eight (∼30%) of the predicted CDSs in RLP1/RPP1 are related to those found in Enterobacteria phage N4 and a further 19 CDSs are similar to genes found in roseophages DSS3 2, EE36 1 and pCB2047-B ( Table 2). Unlike N4 and N4like Pseudomonas phages no promoter consensus sequences could be identified to assign the predicted CDSs to early, middle or late genes.
The properties and genome sequences of these two novel phages are remarkably similar even though they were isolated from samples obtained 7 years apart, from two locations in UK coastal waters, and they infect different hosts (one isolated from the Caribbean the other from the English Channel). The host strains of these highly similar phage are only moderately close relatives at 93.5% 16S rRNA gene identity, and in case of RLP1, even the closest relative (Rsv mucosus, 99% 16S rRNA gene identity with Rsv. Sp. 217) was not infected. Although relatively few lytic phages of Roseobacters had been reported previously, it is intriguing that five of the seven lytic roseophages are closely related N4-like phages suggesting that similar phages may be common in the marine environment.

PHYLOGENETIC ANALYSIS OF N4-LIKE CORE GENES
Analysis of the 25 sequenced N4-like phages identified 14 core genes, examples of these genes in N4 are listed in Table 3 (see  Supplementary Material Table 1 for full list). This number of core is genes is similar to the 12 that were found for podoviruses infecting marine Synechococcus and Prochlorococcus (Labrie et al., 2013), however, the environments and hosts of the N4-like phage in this study are more diverse. Of these core genes five have no known function (designated as gps 24, 25, 53, 55, 69 in N4), leaving only nine genes that have putative function that are core to N4-like phage. As might be expected these are involved in processes that all N4-like phage would undergo regardless of the host they infect including DNA replication and packaging (gps 45, 50 and 68), transcription (gp15 and gp16) and production of structural proteins (gps 54, 55, 56 and 59). Interestingly, the homolog of RNAP2 in the Achromobacter phages JWAlpha and JWDelta has been divided into two parts due to the insertion of a 186 amino acid CDS similar to gp8 from Celetribacter phage P12053L (Wittmann et al., 2014). In N4, middle gene products are transcribed by a heterodimeric RNA polymerase the subunits of which are encoded by genes RNAP1 and RNAP2 (Willis et al., 2002). Though it is not clear if the RNAP2 homolog is functional in JWAlpha and JWDelta, we believe that the function of the gene product is essential and hence warrants its inclusion in the list of core genes.
Gene order of the core genes is largely conserved across all N4-like phage isolates (Figure 4) with unique/clade-specific genes tending to be toward the ends of the genomes. The insertion of genes specific to a subset of phage such as the RN4 phages also occurs at conserved positions as can be seen for rnr and trx (Figure 4). The high degree of synteny of the core genes involved in control of gene expression, DNA replication and structural proteins of 25 N4-like phages suggests that a stable association within each core module has been formed; conversely the areas between the blocks of core genes are likely hot-spots for recombination.
Phylogenetic analysis of the N4-like phages based on an alignment of concatenated core gene products showed that, with the exception of Escherichia phage EC1-UPM, phages that infect closely related hosts cluster together on well supported branches ( Figure 5). For example, the five RN4-phages which infect marine Alphaproteobacteria, form a distinct clade away from their relatives that target gammaproteobacterial hosts. Furthermore, the two phages which infect Roseovarius species, RLP1 and RPP1, are further delineated from the other three RN4-phages; however, the phages EE36F1 and pCB2047-B that infect Sulfitobacter strains EE36 and 2047, respectively, did not form a distinct subclade. Overall the phylogeny based on concatenated core genes is concordant to that previously reported by Wittman et al. based on the proteomes of 24 N4-like phages (Wittmann et al., 2014). The delineation of N4 phage into clades that infect specific hosts suggests that all N4 phage shared a common ancestor and have since specialized to infect a particular group of hosts.

COMPARATIVE ANALYSIS OF RN4 PHAGES pCB2047-B, DSS3 2, EE36 1, RLP1 AND RPP1
Analysis of the five RN4-phages identified 33 conserved CDSs of which 14 are N4 core genes, five have homologs in N4 phage, five are found in other N4-like phages and nine are exclusive to the RN4 phages ( Table 2). Interestingly one of the conserved RN4 phage genes, gp37 (in RPP1), is a host-like metabolic gene (known as auxiliary metabolic genes, AMGs; highlighted in bold in Table 2). Gp37 encodes a thioredoxin which has also been found in the T7-like Roseophage SIO1 (Rohwer et al., 2000). A homolog of this gene is also found in phages JWAlpha and JWDelta which were isolated from waste water treatment plants. It is interesting to note that whilst these phages infect Achromobacter xylosoxidans, a nosocomial pathogen widely distributed in the natural environment (Wittmann et al., 2014), other members of the Achromobacter genus are found in freshwater and marine environments (Brenner et al., 2005).
Phages DSS3 2, EE36 1, RLP1 and RPP1 share a further 22 CDSs (Supplementary Material Table 2) one of which, gp51 (in RPP1), is another AMG. RPP1 gp51 encodes a class II ribonucleoside diphosphate reductase (rnr). A previous study by Dwivedi et al., showed that the rnr genes in DSS3 2 and EE36 1 cluster together, with their bacterial host(s) forming a sister group (Dwivedi et al., 2013). A similar analysis using trx from the five RN4 phages, showed no clear relationship

www.frontiersin.org
October 2014 | Volume 5 | Article 506 | 7  Homologs were identified using OrthoMCL which computes reciprocal best blast hit. An e-value cutoff of 1e-6 and I = 1.5 was used to identify the 14 core genes in the 25 publically available N4-like phage genomes.
FIGURE 5 | Phylogram of concatenated core genes of the 25 sequenced N4-like phages. The neighbor-joining tree was based on a ClustalW alignment of the concatenated core genes amino acid sequences; bootstrap values were based on 1000 replicates. Apart from Escherichia phage EC1-UPM, N4-like phages that infect closely related hosts cluster together on well supported branches. The tree is rooted at mid-point and branches with less than 50% bootstrap replicates were collapsed; scale bar indicate expected changes per site.
between phage and host genes (Supplementary Material Figure 2). The presence of the AMG trx in the five RN4 phages is likely to represent an adaptation to the marine environment as it is common to all N4-like phages that infect marine bacteria (Figure 4). Thioredoxin-encoding genes can also be found in T7-like phages though it is also more common in viruses from the marine environment e.g., SIO1 and P60, than in enteric phages (Zhao et al., 2009). What the function of this gene might be is unclear; in bacteriophage T7 there is an increased rate of processing when thioredoxin binds to T7 DNA polymerase (Huber et al., 1987). However, whilst trx is found in other marine phages it is not clear if it serves the same function as found in T7 as the correct domain required for thioredoxin to bind may not be present (Hardies et al., 2003). Thioredoxin is known to have many other roles, one of which is a hydrogen donor to ribonucleotide reductase. This is possibly the most parsimonious function for trx, as four out of five RN4 phage also carry the rnr gene encoding for a ribonucleotide reductase. With rnr commonly found in other marine phage (Angly et al., 2006) it is thought to provide a mechanism of scavenging ribonucleotides in the oligotrophic marine environment . Therefore, it could be speculated for RN4 phages ribonuclease reductase is expressed to replicate the function of the host gene and the phage encoded thioredoxin acts in co-ordination as specific hydrogen donor, in a similar fashion that occurs in T4 (Holmgren, 1989).

IDENTIFICATION OF A SECOND STRUCTURAL MODULE IN RPP1 PHAGE
We identified, using mass spectrometry, 13 structural proteins in the mature RPP1/RLP1 virions ( Table 4, Supplementary Material Figure 3) including five which have been identified as N4 virion proteins (gps 52,54,56,59,and 67 in N4 phage/ gps 64,66,68,71,and 77 in RPP1). Nine of the identified structural proteins in RPP1/RLP1 (gps 63, 64, 66, 68, 71 77, 80, 81, and 82 in RPP1) are likely "late" gene products inferred through synteny with N4 phage and their localization after the vRNAP gene and other late genes in N4 (Kazmierczak and Rothman-Denes, 2005). The remaining four (gps 25,28,31,and 32 in RPP1) are located near the N4 homologs of gp24 and 25 which in the Enterobacter phage N4 are middle gene transcripts (Kazmierczak and Rothman-Denes, 2005). This suggests there is a second structural module (SSM) in RPP1 which is expressed during the mid-phase of infection. Ceyssens et al. (2010) also identified a similar additional cluster of structural genes not expressed with the late genes in Pseudomonas phage LIT1 (Ceyssens et al., 2010). BLASTp analysis shows that the RPP1 gp32 gene product (a 650 aa protein) shares similarity with gp230 in Pseudomonas myovrius 201 2-1, which is a fusion of homologs of KZ gp145 and gp146, both tail proteins. Interestingly, genes within the second structural cluster in LIT1 (gps 48-56) have strong similarity to Pseudomonas aeruginosa prophage proteins and tail proteins from other Podoviridae (Ceyssens et al., 2010). Taken together, these observations suggest that the additional structural module encodes for and/or is associated with virion tail protein(s) production.
The gene products 25 and 28 in RPP1 found in the tail proteinlinked SSM contain protein chaperone-like domains which could be associated with the translocation of the unfolded/semi-folded vRNAP out of the virion head into the host cell during initial infection. This is required as the virion polymerase is relatively large, 382.5 kDa, whilst the narrowest section of the tail tube in N4 is only 25 Å in diameter (Choi et al., 2008).

www.frontiersin.org
October 2014 | Volume 5 | Article 506 | 9  The location of these additional structural genes (upstream of the N4 gp45 homolog encoding an ssDNA-binding protein which activates transcription of late phage genes) suggests they are "middle" genes, but the advantage of expressing such proteins prior to the capsid genes is not yet clear. It may point to a gene regulation requirement and/or a possibility that tail proteins require maturation prior to assembly on the virion. In general, the constituent parts of phage virion particles (heads, tails and tail fibers) are made separately via subassembly pathways rather than a single linear pathway. Upon completion of the virion segment, the heads and tails combine first, forming complexes that are visible by electron microscopy, then the distal tail fibers are added (Campbell, 2007). It is possible that the assembly of the structurally complex tail portion of the virion may involve multiple steps and requires the assistance of helper proteins whilst the head is relatively simple to construct. Consequently, there might be an advantage in expressing some tail structural genes earlier than the genes coding for head, portal and other tail fiber genes.
Of the 13 structural proteins identified in RPP1/RLP1, 10 are conserved in all the sequenced RN4 phages. These include gps 31 and 32 (in RPP1) from the SSM. Interestingly whilst gp31 is only shared by the RN4 phages, a homolog of gp32 is also found in Erwinia phage S6 (Born et al., 2011) as gp66. The aforementioned gps 25 and 28 (in RPP1) are only found in phages DSS3 2, EE36 1 and RLP1 suggesting this module could be a determinant of host specificity whilst gene product 81 is only found in RLP1 and RPP1.

ENVIRONMENTAL DISTRIBUTION OF RN4-LIKE PHAGES
Using all the CDS sequences in RPP1 as blast query against a range of environmental metagenomic datasets downloaded from CAMERA (Sun et al., 2011) we searched for RN4-like phage sequences. The number of hits were normalized for database size and gene size to allow comparison between metagenomes (see Materials and Methods for further details). Previous searches of Global Ocean Survey (GOS) metagenomic data using RN4 polymerase genes as well as the other N4-like genes as query sequences suggested that N4-like phage infecting Roseobacters are mainly found in coastal areas and may be rare in open ocean environments (Zhao et al., 2009). We found homologs of CDSs from RN4 phages are widespread in a number of environments ( Figure 6A) with the highest frequency of counts in samples from the Antarctic, Saltern Sea and GOS metagenomes. As expected, given the known distribution of members of the Roseobacter lineage, we found very low detection rates in the metagenomes from freshwater lakes (MET6, MET7).
A more detailed analysis of the distribution of hits found in the GOS metagenome was carried out based on the previously defined environments as reported by the Sorcerer II GOS expedition (Rusch et al., 2007). The distribution of three RPP1-like genes for each GOS sampling site was carried out using the three most abundant gene sequences identified previously, ORFs 24, 38, and 51, as queries. A large proportion of matches were found in locations characterized as a coastal environment ( Figure 6B); this would be expected based on the distribution of Roseobacter hosts in costal environments. However, for some genes-ORF36 and ORF51, a higher percentage of hits were found in samples from open ocean environments (Figure 6B), thus suggesting that there are more RN4-like phage, and their corresponding hosts, present in the open ocean environment than previously thought. However, this finding should be considered with caution as we presume the hosts of these phages belong to the Roseobacter lineage. There is the possibility that these are not RN4 phages and instead belong to a different family of podoviruses that infect another group of bacteria which have not yet been cultured and/or had their genome sequenced.

EVOLUTION OF THE N4-LIKE PHAGE GENUS AND BEYOND
The genome arrangement of core and variable genes within this phage genus bears striking similarity to the T4 superfamily in which the genomes have been defined as bipartite (Krisch and Comeau, 2008); a conserved core comprised of the minimal essential genes required for viral multiplication and a larger, highly variable set of facultative genes which collectively create an optimal environment, particular to that host, to enable successful infection. However, in the T4 superfamily most of the "core T4" genes encode either virus replication functions or virion structural components. As N4 has such an unusual gene expression mechanism (Kazmierczak and Rothman-Denes, 2005), it is perhaps not surprising to find genes involved in transcription control to be conserved, such as the three RNA polymerase genes and the single-stranded DNA-binding protein involved in late gene expression.
In the T4 superfamily, the number of core genes varies according to the subset of phages considered. For example, there are 75 common core genes when "true" T-even (T4), pseudo Teven (RB49) and schizo T-even (Aeh1) are compared Clokie et al., 2010), but this falls to 38 when the cyanophages are included (Millard et al., 2009;Sullivan et al., 2010). With the N4-like phages, the subdivisions below genus level are not as clear but it appears that core genes from phages which infect closely related hosts bear more similarity to each other than those from evolutionary distant hosts as seen by the clustering of the RN4, Pseudomonas, Enterobacter/Escherichia, and Vibrio phages (Figure 4). In addition to vertical gene transfer, horizontal gene exchanges could have occurred from both phage (Pseudomonas tail proteins and the trx gene) and host (Roseobacter host-like proteins e.g., rnr) sources.
Phage biologists have long debated as to whether or not phage genera actually exist or if instead there is a continuum of phage genes in which all tailed-phages dip into, to find a "best-fit" genome. The mosaic model proposed by Hendrix et al., poses the best compromise to this problem (Hendrix et al., 1999), proposing that early phages have exchanged large chunks of genetic information prior to the demarcation of the now accepted supergroups. Fine tuning of host/environmental specific genes between close relatives then followed, the consequence of which are phages with genomes created from a mixture of vertical and horizontal gene transfer events. The results from this study fit in well with this theory. The 14 core genes, which encode and control general infectivity, appear to be derived from ancient phages thus accounting for the homology and gene synteny found in the terrestrial and marine phages, whilst the plastic periphery is comprised of genes such as rnr, trx and the tail/tail fiber structural www.frontiersin.org October 2014 | Volume 5 | Article 506 | 11 proteins which provide environmental adaptations and determine the host range. However, further analyses are required to determine if the latter set of genes were horizontally or vertically acquired. Such studies and characterization of more N4-like phages, in particular those from the marine environment, will allow further population genetic type analyses of this diverse phage group.

ACKNOWLEDGMENT
This work was supported by BBSRC and NERC (UK). Jacqueline Z.-M. Chan. was supported through a BBSRC PhD studentship. Hendrik Schäfer was supported by a NERC Advanced Fellowship (NE/E01333/1) and phage genome sequencing was funded by a grant from the NERC (NE/F010044/1). Ms Susan Slade from the Biological Mass Spectrometry and Proteomics Facility, University of Warwick is thanked for performing mass spectrometry analyses. The GenePool facility, University of Edinburgh is thanked for performing the genome sequencing.