Genome-Enabled Molecular Subtyping and Serotyping for Shiga Toxin-Producing Escherichia coli

Foodborne pathogens are a major public health burden in the United States, leading to 9.4 million illnesses annually. Since 1996, a national laboratory-based surveillance program, PulseNet, has used molecular subtyping and serotyping methods with the aim to reduce the burden of foodborne illness through early detection of emerging outbreaks. PulseNet affiliated laboratories have used pulsed-field gel electrophoresis (PFGE) and immunoassays to subtype and serotype bacterial isolates. Widespread use of serotyping and PFGE for foodborne illness surveillance over the years has resulted in the accumulation of a wealth of routine surveillance and outbreak epidemiological data. This valuable source of data has been used to understand seasonal frequency, geographic distribution, demographic information, exposure information, disease severity, and source of foodborne isolates. In 2019, PulseNet adopted whole genome sequencing (WGS) at a national scale to replace PFGE with higher-resolution methods such as the core genome multilocus sequence typing. Consequently, PulseNet's recent shift to genome-based subtyping methods has rendered the vast collection of historic surveillance data associated with serogroups and PFGE patterns potentially unusable. The goal of this study was to develop a bioinformatics method to associate the WGS data that are currently used by PulseNet for bacterial pathogen subtyping to previously characterized serogroup and PFGE patterns. Previous efforts to associate WGS to PFGE patterns relied on predicting DNA molecular weight based on restriction site analysis. However, these approaches failed owing to the non-uniform usage of genomic restriction sites by PFGE restriction enzymes. We developed a machine learning approach to classify isolates to their most probable serogroup and PFGE pattern, based on comparisons of genomic k-mer signatures. We applied our WGS classification method to 5,970 Shiga toxin-producing Escherichia coli (STEC) isolates collected as part of PulseNet's routine foodborne surveillance activities between 2003 and 2018. Our machine learning classifier is able to associate STEC WGS to higher-level serogroups with very high accuracy and lower-level PFGE patterns with somewhat lower accuracy. Taken together, these classifications support the ability of public health investigators to associate currently generated WGS data with historical epidemiological knowledge linked to serogroups and PFGE patterns in support of outbreak surveillance for food safety and public health.

Foodborne pathogens are a major public health burden in the United States, leading to 9.4 million illnesses annually. Since 1996, a national laboratory-based surveillance program, PulseNet, has used molecular subtyping and serotyping methods with the aim to reduce the burden of foodborne illness through early detection of emerging outbreaks. PulseNet affiliated laboratories have used pulsed-field gel electrophoresis (PFGE) and immunoassays to subtype and serotype bacterial isolates. Widespread use of serotyping and PFGE for foodborne illness surveillance over the years has resulted in the accumulation of a wealth of routine surveillance and outbreak epidemiological data. This valuable source of data has been used to understand seasonal frequency, geographic distribution, demographic information, exposure information, disease severity, and source of foodborne isolates. In 2019, PulseNet adopted whole genome sequencing (WGS) at a national scale to replace PFGE with higher-resolution methods such as the core genome multilocus sequence typing. Consequently, PulseNet's recent shift to genome-based subtyping methods has rendered the vast collection of historic surveillance data associated with serogroups and PFGE patterns potentially unusable. The goal of this study was to develop a bioinformatics method to associate the WGS data that are currently used by PulseNet for bacterial pathogen subtyping to previously characterized serogroup and PFGE patterns. Previous efforts to associate WGS to PFGE patterns relied on predicting DNA molecular weight based on restriction site analysis. However, these approaches failed owing to the non-uniform usage of genomic restriction sites by PFGE restriction enzymes. We developed a machine learning approach to classify isolates to their most probable serogroup and PFGE pattern, based on comparisons of genomic k-mer signatures. We applied our WGS classification method to 5,970 Shiga toxin-producing Escherichia coli (STEC) isolates collected as part of PulseNet's routine foodborne surveillance activities between 2003 and 2018. Our machine learning classifier is able to associate STEC WGS to higher-level serogroups with very high accuracy and lower-level PFGE patterns with somewhat lower accuracy.
Taken together, these classifications support the ability of public health investigators to associate currently generated WGS data with historical epidemiological knowledge linked to serogroups and PFGE patterns in support of outbreak surveillance for food safety and public health.

INTRODUCTION
In the United States, foodborne pathogens are responsible for an estimated 9.4 million illnesses each year, leading to over 55,000 hospitalizations and ∼1,300 deaths (Scallan et al., 2011). Shiga toxin-producing Escherichia coli (STEC) alone, is estimated to cause 265,000 infections per year (Scallan et al., 2011) and is known to cause a range of symptoms from mild gastroenteritis and diarrhea to severe symptoms such as bloody diarrhea and hemolytic uremic syndrome (HUS) (Mead and Griffin, 1998).
Rapid identification and characterization of the pathogens causing illness are essential for an effective epidemiological response. Molecular subtyping and serotyping methodologies have been used for this purpose since the early 1980s (Holmberg et al., 1984), aiding in epidemiological investigations of various foodborne pathogens. In 1996, PulseNet was established as the national molecular subtyping network for foodborne disease surveillance, with the aim to reduce the burden of foodborne illness through early detection of potential outbreaks . The PulseNet network of laboratories consists of local, state and regional public health laboratories across the US that perform routine phenotypic and molecular characterization of clinical isolates. Routine laboratory testing of STEC includes culture and phenotypic analysis for genus and species confirmation (Zadik et al., 1993;Brooks et al., 2005;Church et al., 2007), serotyping to determine O-and Hantigens (Wang et al., 2003;DebRoy et al., 2011), molecular detection of virulence genes via PCR (Paton and Paton, 1998;Belanger et al., 2002;Reischl et al., 2002;Gilmour et al., 2009), and molecular subtyping by pulsed-field gel electrophoresis (PFGE) (Cooper et al., 2006;Ribot et al., 2006). Isolate information, test results, and images of PFGE banding patterns are stored within a central national repository, the PulseNet National Databases , and are actively used for routine surveillance. Widespread use of serotyping and PFGE for foodborne illness surveillance over many years has resulted in the accumulation of data from nearly 1 million isolates from routine surveillance, associated with distinct serogroups and PFGE patterns, e.g., seasonal frequency, geographic distribution, demographic information, exposure information, disease severity, and common outbreak sources and vehicles (Tolar et al., 2019). This collection of outbreak data has provided investigators with links to more rapidly determine contamination sources of similar outbreaks (Tolar et al., 2019).
Until recently, PFGE has been the gold standard subtyping method for PulseNet and was the primary method used for tracking foodborne pathogens . PFGE provides limited resolution but has several advantages such as rapid identification and comparison of isolates (Graves and Swaminathan, 2001). With the technological advances and maturations, newer sequencing-based methodologies have become more practical to implement in a large laboratory network, such as PulseNet. In 2019, the PulseNet program replaced PFGE with whole genome sequencing (WGS) based methods as the new gold standard method for molecular subtyping and routine surveillance . WGS data provides a much higher-resolution alternatives to PFGE, such as single nucleotide variant-based (SNV) typing, core-genome multilocus typing (cgMLST), and whole genome multilocus typing (wgMLST) (Maiden et al., 2013). PulseNet now utilizes cgMLST based subtyping for its epidemiological investigations. While many of the molecular methodologies results can be inferred through sequence comparisons and annotations, it has been particularly challenging to identify the PFGE banding pattern from WGS data. In silico prediction of banding patterns has been shown to be incorrect due to factors not captured within the sequence itself (e.g., DNA methylation and specific electrophoresis conditions used for generating the restriction profile). As such, the transition of PulseNet from PFGE to WGS risks losing a wealth of valuable historical epidemiological investigation data.
In this study, we describe a machine learning-based computational method that infers PFGE banding patterns and serogroup information from the WGS datasets. Being able to associate WGS data with serogroup and PFGE patterns will allow PulseNet, and other public health laboratories, to access the accumulated epidemiological knowledge linked to historical surveillance data and utilize them in future investigations.

Isolate Genome Collection
STEC isolate genome data used in this study was collected as part of routine subtyping and surveillance activities of PulseNet and PulseNet affiliated laboratories between 2005 and 2018. Isolates were selected from the PulseNet STEC National Database ) based on the availability of their complete O-antigen serogroup and pulsedfield gel electrophoresis (PFGE) test results, and whole genome sequence data submitted to NCBI's Sequence Read Archive (SRA) (Supplementary Table 1). PFGE patterns represented by <5 isolates were removed from the study. This resulted in a total of 5,970 isolates, belonging to 27 O-antigen serogroups and 377 PFGE patterns ( Table 1)

Molecular Subtyping and Whole Genome Sequencing
Isolates were previously characterized and sequenced by PulseNet and PulseNet affiliated laboratories using a standardized set of procedures (Atikson et al., 2012). Isolates collected prior to 2012 were retroactively sequenced. Serotyping for each isolate was performed by latex agglutination and/or tube/slide agglutination (Atikson et al., 2012). PFGE was performed on the isolates following PulseNet standard protocols and resulting gel images were analyzed and submitted to the PulseNet STEC National Database for foodborne disease outbreak surveillance (Ribot et al., 2006).
Whole genome sequencing was performed using the Illumina MiSeq and HiSeq 2,500 platforms. Qiagen Blood and Tissue kits were used for DNA extractions and library preparation was performed with Nextera XT (MiSeq) and NEB Next (HiSeq 2,500) using 2 x 250 bp chemistry. PulseNet's minimum quality thresholds for acceptance of raw sequence data include 40x average read coverage, Q-score > 30 in each paired read set, and an expected genome size of 5.0 Mbp. Raw sequence data quality was assessed using the LYVE-SET (Katz et al., 2017) software package. Isolates passing the quality thresholds were submitted to NCBI's SRA and were retrieved for this study using the SRA Toolkit (https://github.com/ncbi/sra-tools).

Genomic Data Representation
Each isolate's WGS data was represented as a k-mer presence/absence matrix, with rows representing the isolate (i.e., the instance) and columns representing the k-mers (i.e., the features). Raw sequence reads were k-merized and hashed using the open-source k-mer sketching library as implemented within MASH (Ondov et al., 2016). Specifically, reads were k-merized to 32 bp fragments. Any k-mer with <5 occurrences in either orientation (forward or reverse) were filtered out as noise in the data. The remaining k-mers were hashed using the hashing  function as implemented within the MASH library (Ondov et al., 2016). Hashed k-mer outputs were sorted, and the smallest 10,000 hash values (k-mers) were selected for each isolate. These selected k-mers were merged across isolates to form the isolate by k-mer matrix, which was used as the input for our machine learning framework. This resulted in a 5,970 (isolates) x 254,464 (k-mers) large matrix.

Model Training Framework
We approached the PFGE pattern prediction problem as a hierarchical two-level system (Figure 2 and Supplementary Figure 2): serogroup prediction followed by PFGE pattern prediction. Consequently, a single serogroup prediction model and several serogroup-specific PFGE prediction models were created. For each model, the unimportant features were removed using feature selection to (a) reduce the time and complexity for model training, and (b) avoid overfitting of the models.
Feature selection procedure was applied on the isolate by kmer matrix to identify the minimal set of maximally informative features. This procedure was applied independently for each class, serogroup or PFGE pattern, as the informative features in one class may differ from another. We adopted a stepwise process to systemically evaluate feature sets of different sizes. The procedure starts by creating an initial random forest model using all 254,464 features across all isolates belonging to the specified class (Strobl et al., 2008). Features are then ranked in descending order of their Gini importance (Breiman, 2001). Feature sets with increasing number of features are created iteratively, starting with a set size of the 10 most important features and adding the next 10 most important features in each iteration. For each feature set, 5-fold cross validation (CV) accuracy is calculated as the number of correctly predicted isolates/total number of isolates within the class. Since the input to our model is short read sequence data, it is possible that certain k-mers are not represented within an isolate due to the stochasticity in the sequencing process. We account for these stochastic factors within our machine learning model by searching for feature sets where accuracy is high and relatively stable when compared to neighboring feature sets. More specifically, for each feature set evaluated in this step, we compare the accuracy of our feature set with the neighboring ±15 feature sets (i.e., accuracy achieved by ±150 k-mers). If the accuracy for the given feature set is ranked in the top 5% of its neighbor feature sets' accuracy, the given feature is deemed optimal.
The final models for each class were created by using a random forest classifier as implemented in the scikit-learn Python package (Pedregosa et al., 2011). The random forest classifier was run with default parameters with two exceptions: the number of estimators was set to 500 and a 5-fold CV was used for evaluating the model performance.

Performance Evaluation
To evaluate the performance of our final models, we used a 5fold CV F 1 accuracy. Given the small average number of isolates per class, a CV based accuracy is a better measure of our model's robustness. F 1 accuracy is calculated as below: F 1 accuracy accounts for the class imbalance and better captures the model accuracy compared to traditional accuracy measures. The 5-fold CV F 1 accuracy was calculated for both serogroup and PFGE pattern models. Overall classification performance of our system was calculated as the ability of the system to correctly predict both serogroup and the PFGE pattern. Random accuracy was calculated by using a bootstrapping technique with 10-replicates. Specifically, isolates were randomly assigned to a PFGE pattern and for each replicate, the number of correct predictions (both serogroup and PFGE pattern) were calculated. The final random accuracy was calculated as the macro-average of all bootstrap results = total correct predictions/total number of isolates.
For each isolate, we also recorded the probability of the top five predictions at the serogroup and PFGE pattern levels. The function predict_proba within the scikit-learn package was used for calculating each class' prediction probability.
The diversity of PFGE patterns in our dataset was quantified using Shannon's entropy (Shannon, 1948): where, i ranges from 1 to n PFGE patterns and p i is the frequency of the PFGE pattern within any given serogroup.

Data and Model Availability
The isolate sequence read data analyzed here is publicly accessible in NCBI's SRA (Supplementary Table 1). The final models, and the Python script for predicting the serogroup and PFGE pattern from an isolate's sequence read file (FASTQ) is available here: https://github.com/jordanlab/pfgeBLAST.

Computational Environment
The experiments described in this study were performed on a compute server provisioned with 24 processing cores, 64 GB of RAM, and running Red Hat Linux operating system.

Serogroup and PFGE Pattern Distribution
We evaluated 5,970 STEC isolates collected in US as part of the routine surveillance efforts of the PulseNet network of public health laboratories since 1996. All STEC isolates were characterized by (1) confirmation of genus and species with phenotypic analysis, (2) O-antigen serotyping by agglutination and (3) subtyping by PFGE. The results from this characterization are captured in the 10-character code of each isolate, which indicates the O-antigen serogroup (first three characters), restriction enzyme (next three characters), and the unique PFGE banding pattern (last four characters). The isolates analyzed in this study are representative of the proportion of Oantigen serogroups reported in the U.S.; the top 6 most frequent O-antigen serogroups (in order) in this study are O157, O26, O103, O111, O121, O45 ( Figure 1A). As expected, the top six most frequent PFGE patterns belong to serogroups O157 (EXH), O121 (EXK) and O111 (EXD). All included PFGE patterns had a minimum of five isolates (Supplementary Figure 1). Isolate FIGURE 3 | Density plot representing the F 1 accuracy achieved by varying number of features. F 1 accuracy from 5-fold cross validation is iteratively computed starting with 100 of the most important features determined by Gini importance and incremented by 10 features at each iteration. Each point represents the average F 1 accuracy across each of the 27 serogroup classes. Kernel density estimation is calculated to highlight the abundance of similar, uninformative features where F 1 accuracy provides diminishing returns.
Frontiers in Sustainable Food Systems | www.frontiersin.org selection was limited by the availability of paired sequence data and PFGE pattern information since PulseNet began using WGS on select isolates in 2013 before it transitioned to WGS for all STEC in 2019. Only a select number of historical isolates have been retroactively sequenced for study. All isolate sequence data is publicly available; however, each isolate's test results were acquired from the PulseNet STEC National Database.

Machine Learning Framework for Isolate Characterization
To model the PFGE patterns within a machine learning framework, we represented each isolate in an n-dimensional k-mer space (Figure 2). Specifically, each isolate's WGS reads were fragmented to 32 bp k-mers and the resulting k-mers were subsampled to 10,000 non-singleton k-mers per isolate. The kmers from all the isolates were compiled into a single k-mer presence/absence matrix of size 5,970 (isolates) x 254,464 (kmers). This large matrix was used as the input for the machine learning framework.
We approached the PFGE pattern prediction problem as a hierarchical two-level prediction system (Supplementary Figure 2): O-antigen serogroup and PFGE pattern-level predictors. A random forest classification algorithm was utilized for feature selection as well as for building the final machine learning models. Even though the large feature space for the input data (254,464 k-mers) is a reduced representation of the full k-mer space, it is still too large for the classification algorithms, leading to increased time and complexity for model building as well as the potential to over fit the data, producing less generalizable models. Hence, we employed a feature selection procedure for computing the minimal set of maximally informative features (see Materials and Methods). The feature selection procedure was independently applied for each class (i.e., serogroup and all PFGE patterns). Briefly, we iteratively calculated the accuracy of increasingly larger feature sets and identified the optimal feature set size where the accuracy was high and relatively stable when compared to the neighboring feature sets (Figure 3). We found that classification accuracy increases steeply until a feature set size of ∼400 k-mers is reached, after which the accuracy starts decreasing or flattening out irrespective of how many k-mers are added to the model. Overall, this procedure resulted in a set of smaller class-specific matrices, with feature count ranging from 10 to 7,700 features (Supplementary Table 2). The top-level serogroup model had the greatest number of features (n = 7,700). The average number of features across all classes (serogroup and PFGE pattern) was 302 (0.12% of the initial data).
The smaller class-specific matrices were used to build a total of 28 models (one serogroup-level and 27 PFGE pattern-level), using a random forest classification algorithm.

Model Performance
Our hierarchical two-level prediction system has an overall classification accuracy of 78.1% (4,663 out of 5,970). The WGSto-serogroup classification is highly accurate; the serogroups of 98.5% of STEC genomes were accurately classified (5,881 out 5,970). WGS-to-PFGE pattern classification was less accurate; 78.9% of PFGE patterns were accurately classified (4,710 out of 5,970). The accuracy of PFGE pattern classification is highly dependent on the diversity of PFGE patterns found for any given FIGURE 4 | Overall performance of PFGE classifier using final, two-step prediction model measured by F 1 accuracy. (A) PFGE pattern prediction performance measured by F 1 accuracy across serotypes. Bubble size scales with number of isolates within a serotype while color shade darkens with number of unique PFGE classes within a serotype. (B) Classification performance measured in F 1 accuracy when using each model independently and combined. F 1 accuracy of a classifier using random guessing was calculated to demonstrate a baseline expected performance without training. serogroup. PFGE patterns from diverse serogroups, containing a high number of unique patterns, showed far lower classification accuracy than PFGE patterns from less diverse serogroups (Figure 4 and Supplementary Figure 3). Nevertheless, given the presence of 377 unique PFGE patterns in our dataset, and considering the number of genomes corresponding to each pattern, 78.1% accuracy is ∼34-times higher than could be expected (2.3%) compared to purely random classification.
The performance of the PFGE pattern model can further be improved from 78% to 94% by reporting the five mostprobable PFGE patterns predicted for any isolate (Figure 5 and Supplementary Figure 4). Since the serogroups with the relatively lower accuracy are also the most diverse (on average 63 PFGE patterns per serogroup) and abundant (∼88% of total isolates in the study), being able to know the five most likely PFGE patterns is epidemiologically meaningful.
The full prediction runs on average in ∼10.27 seconds per isolate genome with the k-mer representation taking ∼8.1 seconds and random forest classification taking ∼2.16 seconds.

Serotyping Performance Comparison With Other Methods
We compared the performance of our method's serotyping ability with other popularly used software: ECTyper (https://github. com/phac-nml/ecoli_serotyping) and SRST2 (Inouye et al., 2014;Ingle et al., 2016). For our dataset, ECTyper had an accuracy of 72.7% and an average speed of 76.07 seconds per isolate, while SRST2 yielded a higher accuracy of 90.9% but with a slower average speed of 114.46 seconds per isolate (Supplementary Figure 5).

Importance of and Challenges in Predicting Serogroup and PFGE Patterns From WGS
Potential foodborne STEC outbreaks are initially discovered by matching genetically related isolates collected as part of the routine surveillance activities (Hedberg and Besser, 2006). Historically, similarity between STEC isolates was determined by using serotype and PFGE banding pattern information. Serotyping entailed determination of the O-group by the Oantigen on the cell surface and, if available, the H-type from the flagellar H-antigen. PFGE yields distinct DNA restriction patterns and has the discriminatory power to differentiate highly similar isolates. Further, PulseNet's highly optimized PFGE protocols ensured maximum discriminatory power, reproducibility, and comparability between isolates processed at distinct laboratories (Ribot et al., 2006). In the PulseNet National Database, PFGE banding patterns were procedurally assigned PFGE pattern identifiers by visual comparison of PFGE images. PFGE banding patterns matching an existing pattern were assigned the same identifier, distinct PFGE banding patterns were assigned unique identifiers.
Since PulseNet's transition to WGS in 2019, PulseNet's primary subtyping method, PFGE, was succeeded by two geneby-gene typing approaches: cgMLST and wgMLST. The cgMLST focusses on a subset of genes commonly found in >95% of the genomes; wgMLST encompasses all genes observed within the sequenced isolates (Maiden et al., 2013). cgMLST is shown to offer sufficient discriminatory resolution for identifying closely related genomes for outbreak surveillance while requiring a smaller computational footprint than wgMLST. The use of WGS-based analyses replaces several conventional laboratory tests, improves detection accuracy, and reduces resource cost by consolidating multiple conventional workflows into a single rapid assay.
Although WGS datasets have been immensely useful in advancing outbreak investigation science, being able to infer historically used PFGE banding pattern information has been immensely challenging. A wealth of isolate associated information such as previously identified sources, vehicles or seasonal trends, has been harder to utilize, risking loss of valuable collection of data. While it is theoretically possible to predict DNA fragment lengths and molecular weights from the restriction enzyme (Bikandi et al., 2004) utilized by PulseNet, it has been difficult to do so for a number of reasons: (1) limitation of PFGE in resolving DNA fragments of the same size where non-homologous DNA fragments appear as a single band (Davis et al., 2003), (2) unpredictability of DNA methylation sites where restriction is prevented (Davis et al., 2003), (3) specific run conditions established from PulseNet's PFGE protocol, and (4) subjectivity introduced by visual comparison of gel images. These limitations make simplistic approaches for in silico digestion of genomes incomparable to experimentally generated PFGE banding patterns.

Leveraging WGS for Creating High-Dimensional Models
More complex methodologies, such as machine learning algorithms, are able to capture obscure patterns from large collection of data, if such patterns exist. The availability of a large paired WGS-PFGE dataset allows us to implement such methodologies, enabling us to approach the problem in an agnostic and data-driven manner.
Gene-by-gene typing methodologies provide substantial resolution for epidemiological investigation purposes, but are poorly correlated with PFGE banding patterns (Supplementary Figure 6). This can be attributed to the reduction of the genomic problem (sequence-based information) to a genetic problem (allele numbers) within the multilocus sequence typing family of approaches. In other words, absence of sequence information in the cgMLST calls likely prohibits our machine learning approach from discovering informative patterns. On the other hand, representing the whole genome or the collection of genes within a machine learning amenable manner for thousands of isolates is computationally prohibitive and not practical. Hence, we proceeded to investigate the association of shorter sequences (k-mers) with PFGE banding patterns. k-mers have been shown to be sufficiently informative short sequences used across a number of different applications including genome assembly (Compeau et al., 2011), read-togenome mapping (Li and Durbin, 2009), taxonomic classification (Nayfach et al., 2016;Wood et al., 2019), and multi-locus sequence typing (Gupta et al., 2017;Espitia-Navarro et al., 2020). Consequently, we represent our genomes as high-dimensional k-mer presence/absence vectors and utilize these vectors within a machine learning framework.
Despite the high-dimensional nature of our approach, it is fast, scalable, and adds minimum overhead to the existing data analysis processes at PulseNet. Further, our machine learning classifier is able to associate STEC WGS to Oantigen serogroups and PFGE patterns with high accuracy. Taken together, these classifications support the ability of public health investigators to associate currently generated WGS data with historical epidemiological knowledge linked to STEC Oantigen serogroups and PFGE patterns in support of outbreak surveillance for food safety and public health.

Generalizable Approach for Building Models for Organisms
PulseNet currently tracks seven bacterial organisms, including STEC, each with a collection of historic PFGE pattern information. Although, the results presented in this study focus on STEC, we hope it serves as a proof-of-principle that such methods can also work well for other organisms of interest. This is especially timely, considering that other genome-based methodologies, such as cgMLST, are still in early development phases.

Limitations
Our current study has some outstanding challenges and limitations. As with most database-based methodologies, the accuracy and coverage of our method is dependent on the number of genomes used to build the initial model. Infrequently observed or sequenced PFGE patterns do not have enough genome representation in our model and will be incorrectly predicted. Similarly, any new PFGE pattern will be predicted incorrectly.
The accuracy of our method is also tied to the noise in the input data. PFGE banding patterns identifiers are susceptible to a high-level of subjectivity as they were primarily assigned by visually comparing band fragments. This subjectivity introduces noise when building random forest models. We expect that by sequencing additional genomic data and some manual recuration of the data will help us improve the current accuracy of our methodology.

DATA AVAILABILITY STATEMENT
The isolate sequence read data analyzed here is publicly accessible in NCBI's SRA (Supplementary Table 1). The final models, and the Python script for predicting the serogroup and PFGE pattern from an isolate's sequence read file (FASTQ) is available at: https://github.com/jordanlab/pfgeBLAST.

AUTHOR CONTRIBUTIONS
SI, HC, IJ, and LR defined the problem and designed the study. SBI and HAC collected the data. SI, SG, MJ, and AC performed the experimentation. SI, SG, and LR created the figures and tables. SI, HC, IJ, and LR wrote the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
This work was made possible through support from the Advanced Molecular Detection (AMD) initiative at the Centers for Disease Control and Prevention. SG, MJ, AC, IJ, and LR were supported by IHRC-Georgia Tech Applied Bioinformatics Laboratory.