Salmonella enterica Prophage Sequence Profiles Reflect Genome Diversity and Can Be Used for High Discrimination Subtyping

Non-typhoidal Salmonella is a leading cause of foodborne illness worldwide. Prompt and accurate identification of the sources of Salmonella responsible for disease outbreaks is crucial to minimize infections and eliminate ongoing sources of contamination. Current subtyping tools including single nucleotide polymorphism (SNP) typing may be inadequate, in some instances, to provide the required discrimination among epidemiologically unrelated Salmonella strains. Prophage genes represent the majority of the accessory genes in bacteria genomes and have potential to be used as high discrimination markers in Salmonella. In this study, the prophage sequence diversity in different Salmonella serovars and genetically related strains was investigated. Using whole genome sequences of 1,760 isolates of S. enterica representing 151 Salmonella serovars and 66 closely related bacteria, prophage sequences were identified from assembled contigs using PHASTER. We detected 154 different prophages in S. enterica genomes. Prophage sequences were highly variable among S. enterica serovars with a median ± interquartile range (IQR) of 5 ± 3 prophage regions per genome. While some prophage sequences were highly conserved among the strains of specific serovars, few regions were lineage specific. Therefore, strains belonging to each serovar could be clustered separately based on their prophage content. Analysis of S. Enteritidis isolates from seven outbreaks generated distinct prophage profiles for each outbreak. Taken altogether, the diversity of the prophage sequences correlates with genome diversity. Prophage repertoires provide an additional marker for differentiating S. enterica subtypes during foodborne outbreaks.

Non-typhoidal Salmonella is a leading cause of foodborne illness worldwide. Prompt and accurate identification of the sources of Salmonella responsible for disease outbreaks is crucial to minimize infections and eliminate ongoing sources of contamination. Current subtyping tools including single nucleotide polymorphism (SNP) typing may be inadequate, in some instances, to provide the required discrimination among epidemiologically unrelated Salmonella strains. Prophage genes represent the majority of the accessory genes in bacteria genomes and have potential to be used as high discrimination markers in Salmonella. In this study, the prophage sequence diversity in different Salmonella serovars and genetically related strains was investigated. Using whole genome sequences of 1,760 isolates of S. enterica representing 151 Salmonella serovars and 66 closely related bacteria, prophage sequences were identified from assembled contigs using PHASTER. We detected 154 different prophages in S. enterica genomes. Prophage sequences were highly variable among S. enterica serovars with a median ± interquartile range (IQR) of 5 ± 3 prophage regions per genome. While some prophage sequences were highly conserved among the strains of specific serovars, few regions were lineage specific. Therefore, strains belonging to each serovar could be clustered separately based on their prophage content. Analysis of S. Enteritidis isolates

INTRODUCTION
Salmonella is a genus of Gram-negative bacteria composed of two species: Salmonella bongori and Salmonella enterica. The latter is divided into six subspecies and one of these subspecies, S. enterica subsp. enterica is divided into more than 2,500 serovars based on the surface antigens (Grimont and Weill, 2007). S. enterica includes all Salmonella of medical importance; these are divided into typhoidal serovars (Typhi and Paratyphi A, B, and C) and thousands of non-typhoidal serovars (NTS). Human infection with non-typhoidal Salmonella strains typically results in a selflimiting enterocolitis known as salmonellosis. The World Health Organization (WHO) estimated that approximately 180 million cases of salmonellosis occur in people every year with 298,000 deaths (WHO, 2016).
Salmonellosis typically results from consumption of contaminated food or water.
High discrimination subtyping of isolates is essential to link outbreak cases and for source tracking, so that intervention to prevent further illnesses may be made. This can be challenging for S. enterica, for though it is serologically diverse it is a relatively clonal pathogen and the majority of reported cases of salmonellosis can be attributed to a small group of serovars. In Canada, 75% of reported cases of salmonellosis are attributed to 10 serovars, with 57% of cases attributed to only 3 serovars: S. Enteritidis (30%), S. Heidelberg (15%), and S. Typhimurium (12%) (Public Health Agency of Canada [PHAC], 2002). Consequently, conventional Salmonella sub-typing schemes, including pulsed field gel electrophoresis (PFGE), phage typing and Multiple Locus Variable Number Tandem Repeat Analysis (MLVA) may lack the level of discrimination required to support effective public health decision making (Cooke et al., 2007;Feasey et al., 2012). Current whole genome sequencing (WGS)-based technologies can provide accurate and effective tools for the rapid subtyping of foodborne Salmonella (Brüssow et al., 2004;Ashton et al., 2017). However, in cases where a clearer distinction is needed among many closely related strains, e.g., delineating different outbreak strains or separating outbreak and sporadic strains, current WGS methodologies do not consistently meet the required need. Therefore, developing a parallel WGS-based subtyping tool may improve the differentiation among highly clonal Salmonella isolates.
The pan-genome of a bacterial species consists of a core genome which is the collection of coding sequences present in all members of the species and an accessory genome made up of sequences that are not consistently present among all members. The core genome defines the common attributes of the species whereas the accessory genome affect strain fitness by encoding proteins that influence virulence, antigenic structure, antimicrobial resistance, and metabolic characteristics. The accessory genome can represent up to 10% of the typical S. enterica genome (Jacobsen et al., 2011). These sequences include mobile genetic elements such as prophages, plasmids, transposons, and insertion elements (Thomson et al., 2008). Prophage sequences are common in the genome of Salmonella and their potential as molecular markers of genomic diversity has been previously investigated (Brüssow et al., 2004;Cooke et al., 2007;Feasey et al., 2012). For example, prophage sequences were used to discriminate between different field isolates of S. Typhimurium (Cooke et al., 2007). Moreover, two African and two global epidemic clades of S. Enteritidis were recently discriminated by their prophage repertoire (Feasey et al., 2016). Others have used a single prophage gene, the prophage integrase, to demonstrate genomic diversity in S. enterica (Colavecchio et al., 2017a). However, no single gene is likely to provide a comprehensive enough information for taxonomic analyses (Rohwer and Edwards, 2002) and is likely applicable only within a narrow spectrum of diversity such as within the same serovar of Salmonella (Colavecchio et al., 2017a).
Prophage sequences have been shown to encode virulence factors, toxins, and antimicrobial resistance genes (Brüssow et al., 2004;Colavecchio et al., 2017b) all of which contribute to diversity of strains. Therefore, we postulated that prophage features may be one of the principal drivers of diversification among and within serovars of Salmonella. To that end, we aimed to conduct a comprehensive characterization of prophage diversity in 151 serovars of S. enterica and to test the performance of the whole prophage repertoire as a measure of the host organism diversity. We tested the discriminative ability of this approach among S. enterica serovars, closely related isolates of the same serovar from different outbreaks and finally between predominant S. enterica serovars and other closely related members of the family Enterobacteriaceae including Proteus, Hafnia, and Citrobacter, which are sometimes misidentified as Salmonella isolates.

Prophage Sequence Detection
Even though the Salmonella genomes used in this study were assembled previously using A5 (Tritt et al., 2012;Coil et al., 2015) at a high level of quality (Emond-Rheault et al., 2017), we decided to reassemble the raw data using SPAdes genome assembler version 3.10.1 with updates available at: http://bioinf.spbau.ru/ en/spades (Bankevich et al., 2012). The raw paired-end Illumina reads were trimmed to a minimum PHRED quality score of Q10 from the 3 ′ end using the BBDuk tool (http://jgi.doe.gov/ data-and-tools/bbtools/). Trimmed reads shorter than 64 bases were discarded and the remainders were merged using BBMerge (http://jgi.doe.gov/data-and-tools/bbtools/). Both paired and unpaired reads were kept for de novo assembly, which was performed using SPAdes genome assembler 3.10.1 algorithm (Bankevich et al., 2012). The quality of each assembled genome was assessed following analysis with the QUAST software (Gurevich et al., 2013) and all were satisfactory based on the number of contigs, size of the largest contig, number of bases in the assembly, and NG 50 (data not shown). Prophage sequences within the assembled contigs of each genome were identified with the PHAge Search Tool Enhanced Release (PHASTER; Zhou et al., 2011;Arndt et al., 2016). All contigs <2 Kb were discarded before submitting the assembly to PHASTER, as PHASTER processes only contigs of length >= 2 kb.

Prophage Sequence Typing
We developed a prophage sequence typing procedure by clustering the identified prophages sequences present in the genomes. The prophage regions identified by PHASTER were extracted from each genome and every sequence was renamed by adding the sample name to the sequence header. All identified prophage regions from the strains analyzed were clustered using CD-HIT-EST (Fu et al., 2012) based on 90% identity and 90% prophage sequence length coverage. Increasing the stringency to identity and sequence coverage of 95, 99, and 100% did not improve the clustering efficiency at the serovar level (data not shown), hence the 90% threshold was selected. Due to the high genomic clonality of S. Enteritidis isolates we used 100% sequence identity and alignment coverage for comparing the prophage profiles of S. Enteritidis outbreak isolates. The representative sequence of each cluster generated by CD-HIT-EST was used to determine the prophage identity of that cluster by local alignment of the sequences against the PHASTER prophage virus database using BLASTX (Altschul et al., 1990). A prophage cluster matrix table was then constructed from the CD-HIT-EST and BLASTX results showing the prophage cluster number, the length and the identity of each prophage sequence detected in every analyzed genome (Supplementary Table 2).

Prophage Diversity and Composition Analyses
The prophage matrix table was fed to Quantitative Insights Into Microbial Ecology (QIIME) analytical platform (Caporaso et al., 2010) to determine the diversity of prophage repertoire within (alpha diversity) and among (beta diversity) the different strains in addition to prophage taxonomy summarization. To test the variability of prophage profiles between two or more groups of samples, the diversity of the identified prophage clusters was used to calculate the Bray-Curtis distances among different strains and the generated distances were employed in Unweighted Pair Group Method with Arithmetic mean (UPGMA) clustering. The generated UPGMA tree files were exported to the Interactive Tree of Life (iTOL) software (Letunic and Bork, 2011) for viewing and editing. The identified RE-2010 sequences in S. Enteritidis genomes from known outbreaks (n = 111) were extracted from PHASTER results and aligned with Multiple Alignment using Fast Fourier Transform (MAFFT; Katoh and Standley, 2013) tool from Galaxy platform (https://usegalaxy.org) and the alignment tree file was uploaded to iTOL platform for editing. We conducted an analysis of similarity (ANOSIM) on the prophage sequence profiles using QIIME where the p-value was calculated using 999 permutations; p-value <0.05 was considered significant. ANOSIM generates R-value that lies between 0 and +1 where the closer the value to +1 the higher the dissimilarity of the prophage profiles among the compared groups of isolates (i.e., among different serovars, outbreaks, or species) (Mottawea et al., 2016). Furthermore, a supervised learning serovar classification using QIIME was conducted by training the supervised random forests classifier with the prophage clusters as predictors and the serovar name as the class labels. This supervised learning analysis generated a ranking of prophage clusters by importance, estimated the classifier out-of-bag error, and predicted the serovar label in addition to the probabilities of classification.

Prophage Sequence Diversity in S. enterica Serovars
The prophage sequence diversity was characterized among the whole genome sequences of 1,760 S. enterica isolates with 86% or 1,508 isolates belonging to 151 serovars, while the serovar designations of 14% or 252 isolates were unknown. Overall, 11,297 discrete prophage sequences were identified among all the total genomes, with a median of 5 and interquartile range (IQR) of three prophage sequences per isolate. These sequences were grouped into 3,126 clusters based on 90% sequence identity and 90% long sequence coverage. No prophage sequence was detected in 3 S. enterica strains belonging to serovars Braenderup, Brandenburg and Typhimurium, despite a genome sequence coverage of 16, 22, and 19X, respectively. We identified only one prophage sequence in 12 strains (4 Havana, 3 Bareilly, and 1 strain from each of Gallinarum, Infantis, Liverpool, Putten, and Stanleyville serovars). In contrast, the highest number of prophage sequences in one genome (n = 15) was detected in a strain belonging to serovar Kisarawe. Prophage sequences were most prevalent in S. Typhimurium (148 strains) (Median ± IQR of 9 ± 2 prophages per isolate). The genomes of the other two common serovars; S. Enteritidis (208 strains) and S. Heidelberg (201 strains) carried 5 ± 1 and 6 ± 2 prophage sequences per isolate, respectively. S. Havana serovar (10 strains) had the fewest prophages (2 ± 1) (Figure 1 and Supplementary Table 3).

Prophage Sequence Profiles in S. enterica Genomes
In order to assign a prophage identity to every prophage sequence detected, the CD-HIT-identified reference sequence from each prophage cluster (see under Prophage sequence clustering, above) was aligned against the PHASTER prophages/virus database (last updated on Feb 9, 2017), using the BLASTX tool. Among the 151 serovars (n = 1,508 isolates), we detected 154 different prophages (Figure 2 and Supplementary Table 3).

Exploiting Prophage Sequence Diversity for Salmonella enterica Subtyping
In order to assess the performance of genomic prophage sequence diversity as a discriminatory marker for the different Salmonella serovars, we calculated Bray-Curtis distance among the prophage profiles of different strains and the generated distance matrix was used to construct a phylogenetic tree of the strains. In order to reduce the bias generated by low isolate number and prophages that may be detected due to analytical errors, this analysis included those serovars that had eight or more Salmonella strains (n = 1,350 isolates) and by using only prophage clusters detected in at least four strains. We found that strains within each serovar clustered separately from strains of other serovars as indicated by the constructed phylogenetic tree (Figure 3). Clustering of serovars based on prophages was found to be statistically significant following ANOSIM evaluation (R = 0.88, p = 0.001). In order to test the ability to predict the S. enterica serovar from the prophage sequences, we also conducted a supervised machine learning algorithm using random forests classifier and out-of-the-bag prediction to estimate the generalization error. The estimated generalization error was 0.15, while the baseline error (for random guessing) was 0.85 and the ratio of baseline error to the observed error was 5.38. The model confusion matrix, cv_probabilities and mislabelling matrix are detailed in Supplementary Tables 4-6. In general, these results indicate a significant correlation between the identified prophage profiles and the genome diversity of different S. enterica serovars.

Prophage Sequence Diversity Differentiates Salmonella enterica Serovar Enteritidis Outbreak Isolates
Salmonella enterica serovars are characterized by a high genetic clonality. For this reason, conventional typing methods such as PFGE cannot reliably differentiate between closely related strains (Bekal et al., 2016). Although typing by SNPs has been successfully used to differentiate strains of highly clonal serovars, such as S. Enteritidis Ashton et al., 2016) the high resolution does not always result in a separation of outbreak and sporadic strains. For these reasons, development of an accurate, parallel subtyping method should provide additional tools to discriminate bacterial strains. To this end, we investigated the correlation between prophage sequence profiles and the genome diversity among 111 S. Enteritidis strains isolated during seven outbreaks. A total of 607 prophage sequences was detected in the genomes of these strains. The identified prophage sequences were clustered based on 100% identity and sequence coverage. A UPGMA clustering tree was generated based on Bray-Curtis distances among the prophage profiles of these strains. We identified 43 prophage sequence clusters that belong to five different prophages in our dataset. The UPGMA clustering resulted in a complete separation of these strains into seven clusters with each cluster specific to an outbreak, except only one strain isolated from outbreak 7 that clustered with outbreak 4 (Figure 4; ANOSIM R = 0.98; p = 0.001). The large Gifsy-2 prophage fragment was consistently detected in all isolates (100%, fragment length = 31,114-31,120 bases). The small Gifsy-2 prophage fragment (8,541 bases) was also very prevalent (93%) but was missing in a few isolates. Similarly, prophage Erwinia phage vB was present in almost all the isolates (99%) either as a small (13,087 bases), or large (25,792 bases) fragment. On the other hand, four different sequence variants of prophage ST64B were detected in our Enteritidis dataset (Image 1): one variant was specific to outbreaks 2 and 3 (length = 34,766 bases), while the other three variants were distributed among the isolates of outbreaks 1 and 5 (lengths = 50,483, 54,013, and 64,334 bases). We did not detect phage ST64B sequence in strains from outbreaks 4, 6, and 7. Interestingly, the prophage RE2010 was detected as six different clusters (Figure 5): one specific for each of the outbreaks 2, 3, 4, 5, 6, and 7, whereas this prophage was not detected in outbreak 1 (Figure 4). Finally, we did not detect any sequence corresponding to the prophage SJ46 in outbreak 6 (Figure 4), but it was identified in other outbreaks with a length of 8,109 bases. These findings indicate that prophage sequence profiles distinguished among the different S. Enteritidis subtypes responsible for distinct foodborne outbreaks.

Non-typhoidal Salmonella Isolates Harbour Prophage Sequence Profiles Distinct From Other Closely Related Enterobacteriaceae
The bacterial set in this study included organisms commonly misidentified as S. enterica strains such as Citrobacter, Proteus, and Hafnia, as well as S. bongori. To investigate the genomic prophage sequence profiles in these pathogens as compared to S. enterica, we extracted the prophage clusters identified in these bacteria along with those identified in S. Enteritidis, S. Heidelberg and S. Typhimurium isolates, as these serovars represent the majority of our S. enterica set. UPGMA clustering showed that these organisms which can cause false positives clustered separately from S. Enteritidis, S. Heidelberg and S. Typhimurium isolates (Figure 6). ANOSIM R-value was 0.86 with p-value of 0.001. In contrast to Salmonella isolates, we observed that the prophage profiles of these bacteria were highly variable among strains analyzed with the exception of P. mirabilis, which were clustered together. This conservation among P. mirabilis isolates mainly arises from one prophage cluster identified as Shewanella phage 1/41 NC 025458 which was present in 13 out of the 14 isolates (Supplementary Table 7). These results revealed that the prophage sequences integrated in the genome of S. enterica  Enteritidis genomes involved in known outbreaks (n = 111) were extracted from PHASTER results and aligned with Multiple Alignment using Fast Fourier Transform (MAFFT) tool from Galaxy platform (https://usegalaxy.org). The alignment tree was exported to the Interactive Tree of Life (iTOL) software for viewing and editing. Numbers on branches represent the corresponding branch length. Blue color represents strains of outbreak 6; Red for outbreak 4; Purple for outbreak 8, Orange for outbreak 3, Green for outbreak 7, and Black for outbreak 2. isolates are distinct from those harbored by the genomes of other closely related bacteria of the family Enterobacteriaceae.

DISCUSSION
Bacteriophages appear to be significant drivers of the evolution of pathogenic bacteria including S. enterica (Figueroa-Bossi and Bossi, 1999;Figueroa-Bossi et al., 2001;Davies et al., 2016;Diard et al., 2017). Accordingly, prophage sequences constitute a significant component of Salmonella accessory genomes Owen et al., 2017) and have been reported to enable discrimination among closely related Salmonella lineages (Kingsley et al., 2009;Okoro et al., 2015;Colavecchio et al., 2017a;Owen et al., 2017). This discriminatory power of prophage sequences has been demonstrated also for two different serovars of Streptococcus pyogenes (Brüssow et al., 2004).
The current study reports a pipeline for the rapid identification of genomic prophage sequences in Salmonella and to use them to discriminate between different epidemiological categories and serovars. Previous efforts to characterize bacterial strains and distinguish one from another based on their prophage content have followed different approaches. Some studies have relied on descriptive methods that compare the number of prophages and their identity among a certain number of bacterial strains within specific taxonomic groups (Okoro et al., 2015;Zheng et al., 2017). In general, these studies focussed on prevalent and intact prophage regions while neglecting less predominant and incomplete ones. Some prophages exhibit sequence variants that are differentially integrated among distinct host organisms and serovars (Mmolawa et al., 2003;Bobay et al., 2013;Hiley et al., 2014;Zheng et al., 2017). For example, Bobay and coauthors have detected different prophage sequences in the whole genome of 20 strains of S. enterica and classified them based on their sequence similarity in order to understand prophage integration patterns and their genetic adaptation to host genomes (Bobay et al., 2013). These sequence variants may arise from modular exchange with other prophage DNA, mutation, adaptational shift, and/or drift (Brüssow et al., 2004;Boyd et al., 2012;Omer et al., 2017) and these changes can be expected to have evolutionary implications (Diard et al., 2017). To that end, prophage diversity within bacterial strains does not only rely on the presence or absence of a certain prophage, but also on the conservation of the prophage sequences. Few studies have targeted these sequence variations in certain phage genes to assess the diversity of both phages and their host genomes (Adriaenssens and Cowan, 2014;Colavecchio et al., 2017a). Since there is no universal phage gene (Rohwer and Edwards, 2002), relying on one gene will narrow the spectrum of the discriminative approach to phages that have this gene. Additionally, each prophage sequence feature represents a specific evolutionary event in the bacterial genome (Brüssow et al., 2004).
Here, we analyzed every sequence detected by PHASTER (Arndt et al., 2016) as a distinct operational unit that has its own discriminative weight among Salmonella strains. Moreover, we identified prophage sequence clusters based on the similarity and coverage lengths so that sequence variants are analyzed as distinct operational units. It should be noted that the performance of the current pipeline is limited by the sensitivity and the accuracy of the incorporated assembly and prophage detection tools. For example, we may lose some prophage regions because of improper sequence assembly. Furthermore, PHASTER appears to have the same sensitivity index (85.4%) as its predecessor, PHAST (Zhou et al., 2011). Therefore, the performance errors of the genome assembly and prophage detection tools involved in our pipeline could be reflected on the sensitivity and the prediction accuracy of our approach. Despite this potential limitation, the use of PHASTER proved to be adequately sensitive and the procedure outlined highly discriminative of epidemiologically unrelated strains of Enteritidis.
Salmonella populations are characterized by highly variable prophage profiles (Figueroa-Bossi et al., 2001;Brüssow et al., 2004;Owen et al., 2017). This is largely consistent with our finding that the prophage repertoires identified in this study are highly variable among different Salmonella serovars. The number of prophage sequences detected ranged from 1 to 15 regions per genome. Even within the same serovar, there is a high variability in prophage combinations between strains. For example, we detected a median of 9 ± 2 (IQR) prophages per genome in S. Typhimurium isolates. Eight of these prophages (Burkholderia phage BcepMu, Gifsy-2, Gifsy-1, Salmonella phage SJ46, Enterobacteria phage SfV, Escherichia phage EB49, Escherichia phage MG1655, Salmonella phage ST64B) are prevalent among S. Typhimurium genomes. Gifsy-2, Gifsy-1, and ST64B are known to be common among different S. Typhimurium lineages (Kingsley et al., 2009;Okoro et al., 2015;Owen et al., 2017). Certain prophages, such as Gifsy-2, are highly prevalent among different Salmonella serovars. We detected Gifsy-2 in 61 serovars of S. enterica and it is present in more than 95% of S. Enteritidis, S. Heidelberg and S. Typhimurium isolates. Still, Gifsy-2 exhibits three different sequence variants among the three serovars (Image 2). This result is consistent with Colavecchio et al. who reported that one gene of this prophage was sufficient to differentiate between S. Enteritidis and S. Heidelberg isolates (Colavecchio et al., 2017a). In addition to the genetic variants of common prophages, newly acquired prophages or lineage-specific prophages may contribute to strain divergence as shown recently with S. Newport (Zheng et al., 2017). In general, this variability in prophage sequence types illustrates the correlation between prophage repertoires and the genome diversity of different Salmonella serovars.
Foodborne Salmonellosis is an important concern for public health (Ziebell et al., 2017). To reduce the impact of outbreaks caused by foodborne pathogens, timely subtyping of isolates to incriminate sources of contamination is crucial. Conventional Salmonella typing methods such as PFGE, phage typing (PT), MLVA, and multiple amplification of phage locus typing (MAPLT) do not always provide the required level of discrimination (Assis et al., 2017;Ziebell et al., 2017). It was proposed previously that prophage composition could differentiate between S. Enteritidis subtypes during foodborne outbreaks (Ogunremi, 2013). Although prophage sequences are highly variable among Salmonella and can sometimes be transient and mobile , herein we have established that these repertoires correlate with outbreak strains of Salmonella with these results illustrate how prophage sequences can accurately distinguish between epidemiologically unrelated isolates. While Gifsy-2 prophage sequence is common among all the isolates from different outbreaks as noted by others (Colavecchio et al., 2017a), three other prophages (RE2010, ST64B, and SJ46) exhibit outbreak-specific sequences and profiles. Interestingly, a single prophage sequence (phage RE2010) was enough to differentiate between the investigated seven outbreaks. This finding supports a previous report on the contribution of RE2010 to separate S. Enteritidis isolates with the same PFGE pattern (Allard et al., 2013) but considerably expanded on previous observations by showing a common prophage profile among isolates recovered from each outbreak of S. Enteritidis, which distinguished the different outbreaks. Furthermore, prophages ST64B and SJ46 have been shown previously to be present as multiple variants among closely related S. enterica genotypes of the same serovar (Mmolawa et al., 2003;Hiley et al., 2014;Zheng et al., 2017). Therefore, we conclude that prophage diversity can serve to differentiate epidemiologically unrelated subtypes of S. Enteritidis. In addition, considering that single nucleotide polymorphism (SNP) in a 60-loci test can differentiate between unrelated S. Enteritidis , detecting SNPs within the sequence of universally distributed prophages, such as Gifsy-2 or RE2010, may provide a synergistic mean of differentiating S. Enteritidis subtypes during foodborne outbreaks.
One difficulty in the detection of foodborne Salmonella is the misidentification of strains of Citrobacter, Hafnia, and Proteus spp. as Salmonella (Muldoon et al., 2007;Fang et al., 2009;Margot et al., 2013). The results of this study revealed that the prophage profiles in these false positive strains are completely different from those of S. enterica isolates. In contrast to Salmonella prophages, which are characterized by high degree of conservation within the same serovar, the prophage profiles of Citrobacter and Hafnia spp. are characterized by high degree of variability among strains. For P. mirabilis, only one prophage remnant sequence is common among its isolates, while the other sequences are highly variable. Therefore, it is clear that prophage sequences could provide specific S. enterica targets for development of inclusive and exclusive detection tools of foodborne Salmonella pathogens.
In conclusion, this study reports a new pipeline for the detection of prophage sequence diversity in S. enterica. The diversity of the prophage sequences described here was found to be sufficient in differentiating among S. enterica serovars, between genetically related isolates from different epidemiological events and between Salmonella and Citrobacter, Hafnia, and Proteus. Our data reveal a high diversity among S. enterica prophages which is herein applied to develop a novel subtyping tool for identifying S. enterica clusters.

AUTHOR CONTRIBUTIONS
WM developed the prophage sequence analyses pipeline, conducted the data analyses, prepared the figures and tables and drafted the manuscript; M-OD wrote the pipeline script for prophage sequence analyses; AD, VU, and HH contributed to data analyses; JJ, LF, J-GE-R, JH, IK-I, BB, and RL developed the whole genome sequences of all the bacterial isolates, developed, curated, and maintained the SalFos database; AG, EB, EF, GA, JTW, SG, MW, FD, SM, SB, and LG provided bacterial isolates for whole genome sequencing and provided metadata; SB, RL, and LG were involved in the study design, data analysis and review; DO conceived the study and participated in data analysis, development of manuscript and review. All the authors revised and approved the manuscript.

FUNDING
This study was funded by Genome Canada grant.

ACKNOWLEDGMENTS
We would like to greatly thank the members of the genomics analysis and bioinformatics platforms at IBIS. SM holds a Tier 1 Canada Research Chair in Bacteriophages. WM is funded by Genome Canada. DO's contribution was funded by the Canadian Food Inspection Agency. We would like to express our special thanks to Sandeep Tamber, Danielle Malo, Ann Perets, Kenneth E. Sanderson, and Roger Stephan for providing bacterial strains.