Whole-Genome Sequencing for National Surveillance of Shigella flexneri

National surveillance of Shigella flexneri ensures the rapid detection of outbreaks to facilitate public health investigation and intervention strategies. In this study, we used whole-genome sequencing (WGS) to type S. flexneri in order to detect linked cases and support epidemiological investigations. We prospectively analyzed 330 isolates of S. flexneri received at the Gastrointestinal Bacteria Reference Unit at Public Health England between August 2015 and January 2016. Traditional phenotypic and WGS sub-typing methods were compared. PCR was carried out on isolates exhibiting phenotypic/genotypic discrepancies with respect to serotype. Phylogenetic relationships between isolates were analyzed by WGS using single nucleotide polymorphism (SNP) typing to facilitate cluster detection. For 306/330 (93%) isolates there was concordance between serotype derived from the genome and phenotypic serology. Discrepant results between the phenotypic and genotypic tests were attributed to novel O-antigen synthesis/modification gene combinations or indels identified in O-antigen synthesis/modification genes rendering them dysfunctional. SNP typing identified 36 clusters of two isolates or more. WGS provided microbiological evidence of epidemiologically linked clusters and detected novel O-antigen synthesis/modification gene combinations associated with two outbreaks. WGS provided reliable and robust data for monitoring trends in the incidence of different serotypes over time. SNP typing can be used to facilitate outbreak investigations in real-time thereby informing surveillance strategies and providing the opportunities for implementing timely public health interventions.


INTRODUCTION
Shigellosis is caused by four species of Shigella, including S. boydii, S. dysenteriae, S. flexneri and S. sonnei, transmitted via the fecal oral route. Symptoms typically start 1-2 days after exposure and include diarrhea, bloody diarrhea abdominal pain, fever, and tenesmus. The burden of shigellosis is highest in developing countries with up to 167 million episodes of diarrhea annually, leading to over a million deaths (Kotloff et al., 1999). A multicenter study of shigellosis in six Asian countries indicated the incidence rate to be highest in children under the age of 4 years old and in adults over 70 years old (von Seidlein et al., 2006). In the United Kingdom, S. flexneri is most commonly associated with causing travelers' diarrhea and outbreaks of gastrointestinal symptoms in men who have sex with men (MSM) . Furthermore, there are reports of increased intercontinental dissemination of multidrug resistant S. flexneri . Between 2004 and2015, 18,266 Shigella cases were reported by GBRU for England and Wales with S. flexneri accounting for 7075 (39%) of these infections (S. sonnei n = 8897, 49%; S. boydii n = 1364, 7%; S. dysenteriae n = 808, 4%; Shigella species unknown n = 122, 1%).
Whole-genome sequencing (WGS) has been shown to have potential in replacing traditional phenotypic and PCR methods (Gentle et al., 2016) for routine surveillance. This approach has the added value of further discriminating strains by their genetic relatedness to a single nucleotide polymorphism (SNP) level and has been used to investigate multiple gastrointestinal outbreaks at Public Health England (PHE) (McDonnell et al., 2013;Dallman et al., , 2016.
The aim of this study was to compare traditional serotyping with WGS for serotyping S. flexneri for routine public health surveillance and to evaluate utility of WGS data to support epidemiologically linked clusters and outbreak investigations.

Serotyping and PCR
Phenotypic identification of S. flexneri isolates were confirmed using the Omnilog GenIII MicroPlate (Biolog, Hayward, CA, United States). Serotyping was carried out using standard methods by slide agglutination using both commercially available monovalent antisera (Denka Seiken, Japan) and monoclonal antibody reagents (Reagensia AB, Sweden) and in-house antisera raised in rabbits (Gross and Rowe, 1985). Molecular PCR was carried out on the discrepancies between phenotypic and WGS typing results as previously described (Gentle et al., 2016) and a serotype assigned according to the gene combination detected (Supplementary Table S1).

Whole-Genome Sequencing
Genome sequencing and sequencing analysis were carried out as previously described . Genomic DNA extracted using the QiaSymphony DNA extraction platform (Qiagen) from 330 S. flexneri was fragmented and tagged for multiplexing with Nextera XT DNA Sample Preparation Kits (Illumina) and sequenced using the Illumina HiSeq 2500 at PHE. A reference database containing the gene sequences encoding the 12 O-antigen synthesis or modification genes described by Sun et al. (2011Sun et al. ( , 2012a, including wzxc1-5, wzxe1-5, wzx6, gtrI, gtrII, gtrIV, gtrV, gtrX, gtr1c, oac oac1b and opt, was constructed. Using the GeneFinder tool (Doumith, unpublished), FASTQ reads were mapped to the S. flexneri O-antigen synthesis or modification genes using Bowtie 2 (Langmead and Salzberg, 2012) and the best match to each target was reported with metrics including coverage, depth, mixture and nucleotide similarity in XML format for quality assessment. Only in silico predictions of serotype that matched to a gene determinant at >80% nucleotide identity over >80% target gene length were accepted. FASTQ sequences were deposited in the National Center for Biotechnology Information Short Read Archive under the bioproject PRJNA315192 (see Supplementary Table S2 for SRA identifiers).
To proactively detect outbreaks from WGS data, SNP typing was carried out on S. flexneri isolates belonging to clonal complex (CC) 245 and CC145. CC145 mostly comprises S. boydii serotypes but includes S. flexneri serotype 6, as this S. flexneri serotype was misidentified historically (Wirth et al., 2006;Chattaway et al., 2017). Hierarchical single linkage clustering was performed at seven descending thresholds of SNP distance ( 250, 100, 50, 25, 10, 5, 0) as previously described . This clustering results in a discrete seven digit code where each number represents the cluster membership at each descending SNP distance threshold. The resultant SNP addresses describes an isolates position in the S. flexneri population structure where two isolates with the same SNP addresses have 0 SNP differences between them.

Demographic of Patients
There were 330 isolates reported between 1st August 2015 and 18th January 2016, 236 (71%) were from males, 85 (26%) from females, and for nine (3%) cases the sex was not stated. Travel history was not provided for 154 (47%) cases, 85 cases reported travel 7 days prior to onset of symptoms and 91 (27%) reported that they did not travel during the 7 days prior to onset of symptoms. The most frequently reported destinations were India (n = 16, 5%) and Pakistan (n = 11, 3%) with other counties accounting for less < 1% each.

Comparison of WGS Predicted Serotype versus PCR Predicted Serotype Results
Of the 330 cultures tested prospectively by WGS, 306 (93%) had concordant results with phenotypic serotyping. Three of the mismatched results between the phenotypic and genotypic tests were attributed to mutations identified in O-antigen synthesis or modification genes (Table 1), including nonsense mutations resulting in early stop codons (n = 2) and a frameshift mutation. Repeat testing of sample 4 revealed the discrepancy was due to an auto-agglutination reaction of the strain with the sera ( Table 1).
The remaining mismatches were novel serotype gene profiles detected by WGS, both associated with outbreaks (Cluster 2, n = 13; Cluster 4, n = 5) (Tables 1, 2). Cluster 2 comprised 13 isolates associated with a local community outbreak. The isolates failed to agglutinate any of the serotype specific S. flexneri antisera and were positive for wzx1-5 and gtrIc gene targets in the PCR but were negative for gtrI. Typical strains of serotype 1c are positive for wzx1, gtrI, and gtrIc (Supplementary Table S1). These strains were designated 1c variants (1cv). Cluster 4 comprised five isolates of S. flexneri associated with an outbreak of gastrointestinal symptoms in five captive chimpanzees phenotypically identified as serotype 3a. The isolates had wzx1-5, oac, and grtX but were also positive for gtrII and were designated 3a variant (3av).

Outbreak Investigation Using SNP Clustering Typing
Analysis of the WGS data organized 161 of the 330 isolates into 36 five SNP single-linkage clusters of two or more isolates associated with CC245 and one cluster in CC145. The median number of cases in these clusters was 2 and ranged from 2 to 27; 32 (89%) clusters investigated comprised less than 5 isolates ( Table 2). It was not possible to identify epidemiological links associated with these small clusters using only the limited epidemiological data available from laboratory report forms. However, 9/32 (28%) comprised at least one case reporting recent travel abroad prior to onset of symptoms.
During the period of the study, four outbreaks were identified following routine surveillance of local hospital reports of gastrointestinal symptoms caused by S. flexneri. SNP typing confirmed that all the isolates belonging to each outbreak cluster were closely related and monophyletic. Cluster 1 was the largest cluster and had a high male to female ratio with the 97% of cases being adult males (for two cases the gender was not stated) ( Table 2). Isolates from this cluster were observed throughout the study period. The minimum SNP distance between isolates in this cluster was zero with the median distance 17. The maximum SNP distance between any two isolates was 64, however the majority of these SNPs were caused by a transposase mediated recombination of pic, encoding a serine protease. This demographic of adult males has previously been shown to be characteristic of clusters linked to sexual transmission among the MSM community (Borg et al., 2012;Gilbart et al., 2015;Baker et al., 2015;Simms et al., 2015). Cluster 1 was part of larger outbreak of  Table summarising the four clusters (containing five or more isolates) detected between 1st August 2015 and 18th January 2016. SNP address: hierarchical single linkage clustering performed on the pairwise SNP difference between all isolates at descending distance thresholds (250,100,50,25,10,5,0). Each number of the SNP address represents the cluster membership at that threshold level. SNP addresses with the '#' suffix match all subsequent clusters within that SNP address.ˆSNP distance corrected for recombination. * 2 cases gender were unknown and not included the ratio. $ determined as variant (v) due to absence of gtrl for 1cv or additional presence of gtrll for 3av.
S. flexneri serotype 2a previously described by Simms et al. (2015). Clusters 2 and 3 were community outbreaks and cases were geographically linked. They were temporally restricted and genetically homogenous, exhibiting 0-1 SNPs difference in the core genome between isolates. Despite a thorough epidemiological investigation, the source and route of transmission associated with Cluster 2 could not be determined. Cluster 3 was linked to consumption of contaminated food at a restaurant, most likely due to an infected food handler. Cluster 4 was associated with an outbreak of gastrointestinal symptoms in a group of captive chimpanzees with a minimum SNP distance between isolates of one and a maximum SNP distance of four.
The S. flexneri population structure clusters into seven distinct phylogenetic groups (PGs) . Phylogenetic analysis of the diversity of CC245 in the PHE collection is shown in Figure 1. Four out of seven clades are represented by samples received by PHE through routine surveillance. With respect to the clusters described in this study, Clusters 1, 3, and 4 fall within PG3 and Cluster 2 falls within PG1.

DISCUSSION
This study showed WGS to be a robust and reliable method for serotyping S. flexneri isolates and provided additional strain discrimination at the SNP level. There was high correlation between phenotypic serotyping and WGS serotyping (93%) thus facilitating the prospective comparison of WGS data with historical phenotypic data and ensuring continuity for monitoring trends in the incidence of different serotypes over time.
In three of the mismatches, the WGS derived serotype was shown to predict serotype based on gene presence which may not be expressed phenotypically. WGS provided insight on the effect of mutations on O-antigen modification genes and potential mechanisms that inactivate phenotypic expression. WGS analysis also identified novel serotype gene profiles associated with two outbreaks.
Serotype is not a robust phylogenetic marker, as the O-antigen synthesis/modification genes are encoded on mobile genetic FIGURE 1 | Maximum Likelihood tree showing a single representative from each 10 SNP single linkage cluster (N = 156) within the PHE clonal complex 245 database (N = 1333). Phylogenetic Groups as defined by Connor et al. (2015) are listed 1-7 and the four outbreak clusters annotated. Refer to Table 2 for details of the outbreak clusters. elements (prophages) . Prior to the implementation of WGS, detection of outbreaks of S. flexneri at PHE relied on the identification of epidemiological links between cases as serotyping was not discriminatory enough to detect outbreaks during routine surveillance. During this study, SNP typing provided microbiological evidence that the isolates associated with each of the four outbreaks identified were closely related. SNP typing has been used previously to investigate outbreaks of S. sonnei, and this study provides further evidence of the utility of this approach (McDonnell et al., 2013;Dallman et al., 2016;Mook et al., 2016;Baker et al., 2017).
Three of the outbreak clusters were temporally and genetically restricted with less than five SNP differences observed between outbreak isolates. In contrast, an outbreak representing on-going person-to-person transmission within the MSM community exhibited between zero and 64 SNP differences between isolates. S. flexneri may become endemic in defined populations or communities, such as religious communities or sexual networks. Over time circulating strains will accumulate mutations leading to an increase in observed SNPs between cases in that network. Similarly, extended transmission provides greater opportunities for horizontal gene transfer in the population and therefore bioinformatic analyses have to be robust to such influxes of variation.
This heterogeneity in genetic conservation between epidemiologically linked cases highlights the need to be flexible with respect to the case cluster definition. Epidemiological data associated with each outbreak, and WGS analyses of the deeper phylogenetic relationship between isolates, should be used in concert to inform outbreak investigations.
The use of WGS for routine surveillance of S. flexneri provides reliable and robust data that can be used to monitor trends in the incidence of different serotypes over time. WGS derived serotyping data ensures backward compatibility with historical phenotypic serotyping data. SNP typing can be used to facilitate outbreak investigations in real-time thereby enhancing surveillance strategies and providing the opportunities for implementing rapid public health interventions.

INFORMED CONSENT
Informed consent was not required as all data in this study was anonymized.

AUTHOR CONTRIBUTIONS
DG and AG performed the DNA extractions, PCR, phenotypic serotyping and identification. MC and CJ implemented the wet lab WGS pipelines and performed analysis. TD, HH, and DG performed bioinformatic analysis. MC and CJ wrote the manuscript and TD, DG, and HH contributed to the manuscript.

ACKNOWLEDGMENT
We would like to thank Vivienne do Nascimento and Dawn Hedges for curation of the serotyping scheme in GBRU.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fmicb. 2017.01700/full#supplementary-material