Rapid Real-Time Polymerase Chain Reaction for Salmonella Serotyping Based on Novel Unique Gene Markers by Pangenome Analysis

An accurate diagnostic method for Salmonella serovars is fundamental to preventing the spread of associated diseases. A diagnostic polymerase chain reaction (PCR)-based method has proven to be an effective tool for detecting pathogenic bacteria. However, the gene markers currently used in real-time PCR to detect Salmonella serovars have low specificity and are developed for only a few serovars. Therefore, in this study, we explored the novel unique gene markers for 60 serovars that share similar antigenic formulas and show high prevalence using pangenome analysis and developed a real-time PCR to detect them. Before exploring gene markers, the 535 Salmonella genomes were evaluated, and some genomes had serovars different from the designated serovar information. Based on these analyses, serovar-specific gene markers were explored. These markers were identified as genes present in all strains of target serovar genomes but absent in strains of other serovar genomes. Serovar-specific primer pairs were designed from the gene markers, and a real-time PCR method that can distinguish between 60 of the most common Salmonella serovars in a single 96-well plate assay was developed. As a result, real-time PCR showed 100% specificity for 199 Salmonella and 29 non-Salmonella strains. Subsequently, the method developed was applied successfully to both strains with identified serovars and an unknown strain, demonstrating that real-time PCR can accurately detect serovars of strains compared with traditional serotyping methods, such as antisera agglutination. Therefore, our method enables rapid and economical Salmonella serotyping compared with the traditional serotyping method.


INTRODUCTION
The genus Salmonella, the causative agent of foodborne salmonellosis, can infect both animals and humans, leading to public health problems and economic loss (Gand M. et al., 2020a). Most Salmonella infections are caused by consuming contaminated water or food (Kasturi, 2020). Currently, Salmonella is divided into two species and six subspecies. Serotypes are further classified into more than 2,600 serovars following the White-Kauffman-Le Minor scheme, using antigenic agglutination reactions to three cell-surface antigens of somatic O, and flagellar H antigens denoted as H1 and H2 (Grimont and Weill, 2007;Yachison et al., 2017;Zhang et al., 2019;Gand M. et al., 2020a). As a reliable surveillance protocol is critical for detecting outbreaks or preventing their spread, using a differential serotyping method that identifies serogroups and serovars of Salmonella isolates from causative agents is important (Kasturi, 2020).
Traditional serotyping methods require numerous antisera, are labor-intensive, time-consuming, and complicated, and may produce ambiguous results (Hong et al., 2008;Xiong et al., 2017). Some isolates do not express antigens because of a single-nucleotide mutation in the genome, and some require multiple passes through semisolid media to enhance the flagella antigen expression (Ibrahim and Morin, 2018). As a result of these limitations, serotyping by antigenic agglutination and biochemical tests has been replaced by molecular serotyping (Zhang et al., 2019). For the rapid diagnosis or tracking of Salmonella serovars, epidemiological investigations have been conducted by molecular serotyping analysis based on pulsedfield gel electrophoresis, plasmid profiles, and polymerase chain reactions (PCRs) (Ozdemir and Acar, 2014;Gad et al., 2018;Yang et al., 2021). Notably, the PCR-based method is widely used for early diagnosis because it can diagnose a few bacteria in the specimen and can yield rapid results. However, as the PCR-based method for serotyping mainly uses markers within genes responsible for somatic and flagellar antigen expression, these genes do not detect strains that share the same antigenic formula. Thus, although PCR is reasonably rapid and inexpensive compared with conventional serotyping methods, the limiting factors of this assay are that molecular serotyping does not diagnose numerous serovars and focuses mainly on the most common serovars, such as Salmonella Typhimurium and Enteritidis (Ibrahim and Morin, 2018).
Previous studies have identified gene markers, such as fimA, hilA, invA, and ttr, that are useful in detecting Salmonella species (Laing et al., 2017;Gand M. et al., 2020b;Kreitlow et al., 2021). Moreover, for serovar identification, gene markers based on allelic variations in O and H antigen genes were used (Laing et al., 2017;Gand M. et al., 2020b). However, this gene marker cannot distinguish isolates that share the same antigenic formula. To overcome this problem, some studies have identified markers, such as STM0292, STM4200, STM4493, and STM2235 specific to Typhimurium and SEN1392 specific to Enteritidis, using comparative genomics (Akiba et al., 2011;Liu et al., 2012). These gene markers have specificity but have been screened using a limited number of genomes and cannot detect various serovars in a single reaction (Heymans et al., 2018;Ibrahim and Morin, 2018).
Advances in whole-genome sequencing technology have improved the understanding of the species and subtypes of pathogenic bacteria, and can provide information on the virulence of the underlying pathogenesis. Data obtained using whole-genome sequencing technology can be used to confirm paths of disease transmission and provide information on potential outbreaks (Ibrahim and Morin, 2018;Diep et al., 2019;Cooper et al., 2020). Therefore, whole-genome sequencing is currently used as a technique to obtain reliable and rapid serovar information. Recently, multiple in silico tools have been developed to determine Salmonella serovars from whole-genome sequence data (Zhang et al., 2015(Zhang et al., , 2019Yoshida et al., 2016). SeqSero and Salmonella In Silico Typing Resource (SISTR), which can infer serovar predictions from analyses of somatic O and flagellar H antigens derived from whole-genome sequence data, are representative in silico analysis tools (Uelze et al., 2020). Whole-genome sequencing and in silico tools have many advantages in serotyping, for example, they reveal detailed genetic information on the characteristics of isolates and accurate serovar predictions, provided the database is sufficient (Ibrahim and Morin, 2018). However, whole-genome sequence-based method must be considered against costs associated with genome sequencing to analyze the many samples. This method also requires trained researchers with a specific skill set (Mellmann et al., 2017). To overcome the limitations of genome sequencing, primer sets for PCR targeting serovar-specific genes were recently developed. They show accuracy for serovar detection but still cannot detect a large number of serovars Ye et al., 2021).
The purpose of this study is to evaluate Salmonella genomes by in silico serotyping, to select novel serovar-specific gene markers based on pangenome analysis, and to develop a real-time PCR method that can distinguish between 60 of the most common Salmonella serovars in a single 96-well plate by detecting unique serovar-specific gene markers.

In silico Serotyping
This study selected 60 serovars, which have been frequently isolated worldwide, are essential for public health, and are difficult to diagnose using traditional serotyping methods.

Pangenome Analysis and Discovery of Unique Gene Markers
The pangenome was analyzed using workflow for microbial pangenomic analysis with the Anvi'o package version 6.1 (Eren et al., 2015). Briefly, the assembled genome sequences were used as input and clustered based on the similarity of the amino acid sequences by the Markov cluster algorithm and NCBI's blastp algorithm according to the developer recommendations (Kayansamruaj et al., 2019). Then the pangenome result was visualized using anvi-display-pan code of Anvi'o, and the genomes were organized based on the pan gene cluster frequencies.
The unique gene of each serovar was obtained using a Bacterial Pan Genome Analysis (BPGA) version 1.3 (Chaudhari et al., 2016). The annotated protein in fasta-format file was used as input, and the pangenome was analyzed with the default value (cut-off: 50%). All genomes were compiled into separate local databases, which include the core-genome composed of proteins common to genomes of the target serovar, and the pangenome is composed of entire proteins of all genomes. The unique genes of each serovar were gathered by comparing the pan and coregenome databases. The extracted unique genes were aligned with 65,815,883 sequences using the Basic Local Alignment Search Tool (BLAST). To verify the unique gene markers, the unique gene of each serovar was aligned with 535 Salmonella genomes using USEARCH version 9.0 (Edgar, 2010). Afterward, serovarspecific primer pairs were designed from selected gene markers. To increase the efficiency of primer pairs, the length was 18-30 base pairs, the guanine-cytosine (GC) content was designed to be 45-60%, and the melting temperature (Tm) value was 52-58 • C (Abd-Elsalam, 2003). Representativeness of newly designed primer pairs were evaluated by in silico PCR. PCR was run from a web-based in silico PCR amplification 1 software using 625 Salmonella genomes (Supplementary Table 2). The genomes used in the in silico PCR come from NCBI and EnteroBase.

Cultured Bacterial Strains and DNA Extraction
The 199 Salmonella strains, 33 Salmonella species or Salmonella enterica strains, and 29 non-Salmonella strains used in this study are presented in Supplementary Table 3. All bacterial strains were grown in TSB at 37 • C for 18 h under aerobic conditions. The cultured strains were collected by centrifugation at 13,600 × g for 5 min. Then, genomic DNA was extracted from the pellet using the DNeasy Blood & Tissue Kit (QIAGEN, Hilden, Germany), according to the manufacturer's instructions. The concentration and purity of the genomic DNA extracted were measured using the MaestroNano R spectrophotometer (Maestrogen, Las Vegas, NV, United States).

Specificity and Accuracy for Developed Primer Pairs
Each 20-µl real-time PCR reaction mixture contained 500 nM of each primer pair, 10 µl of 2× Thunderbird SYBR R qPCR 1 http://in~silico.ehu.es/PCR/ Mix (Toyobo, Osaka, Japan), and 20 ng of template DNA. To avoid false-negative results, an internal standard targeting the bacterial 16S rRNA gene fragment was used (Aboutalebian et al., 2021). Amplification was conducted in a 7500 Real-Time PCR System (Applied Biosystems, Foster City, CA, United States), with an initial denaturation for 2 min at 95 • C, followed by 30 cycles of 95 • C for 5 s and 60 • C for 30 s. The melting curve was generated according to the following conditions: 95 • C for 15 s, 60 • C for 1 min, 95 • C for 30 s, and 60 • C for 15 s. The specificity of the primer pairs developed was tested against each Salmonella strain as well as each non-Salmonella reference strain. Primer pairs cross reactivity across all of the serovars were evaluated. To evaluate the accuracy, genomic DNA for each serovar were serially diluted, and then standard curves were generated in triplicate using diluted DNA from 0.0002 to 20 ng. The results obtained were analyzed using 7500 software version 2.3 (Applied Biosystems).

Evaluation of Real-Time Polymerase Chain Reaction With a Single 96-Well Plate
A real-time PCR method was developed that can distinguish between 60 of the most common Salmonella serovars in a single 96-well plate. The real-time PCR was designed so that each primer pair was run independently in a single 96-well plate (Supplementary Figure 1). Each well contained different primer pair, and the genomic DNA from a single isolate was added to each well. The serovar of strain was determined as serovar corresponding to the primer pair included in the well in which amplification occurred. The real-time PCR developed in this study was evaluated using 189 Salmonella strains whose serovars were confirmed through antisera agglutination and 33 strains whose serovars were unknown. Each strain was tested against each of the primer pairs. For serovar diagnosis of isolates, genomic DNA of the isolate was added to each well of the reaction plate containing different primer pair and 2× Thunderbird SYBR R qPCR Mix (Toyobo). Then, real-time PCR was conducted in the 7500 Real-Time PCR System (Applied Biosystems). The conditions were the same as those described in the previous "Specificity and Accuracy for Developed Primer Pairs" section.

Traditional Serotyping Using Antisera Agglutination
The antigenic formulas of strains were determined by the World Health Organization Collaborating Center for Reference and Research on Salmonella, located at the Pasteur Institute, and the serovar name was assigned by the White-Kauffmann-Le Minor scheme (Grimont and Weill, 2007). All strains were cultured in brain heart infusion (BD Difco, Sparks, MD, United States) and motility GI medium (BD Difco) to determine somatic and flagellar antigens. The somatic antigen was confirmed by the slide agglutination reaction using antisera (BD Difco). The flagellar phase was activated by the motility test and fixed by treatment with 0.6% formalin. The flagellar antigen was determined by an aggregation reaction in glass test tubes by mixing with an antisera solution (BD Difco). The serotyping results of antisera agglutination were compared with the results of the real-time PCR method.

In silico Serotyping
Whole-genome sequencing data have been used for serotype diagnosis or subtyping of Salmonella (Zhang et al., 2019;Uelze et al., 2020). SeqSero is a web-based serotyping tool that can predict many Salmonella serovars using whole-genome sequence data based on a database of Salmonella serovar determinants (Ibrahim and Morin, 2018). This tool extracts the relevant genomic regions of cell-surface antigens, such as the rfb gene cluster to determine the O antigen and the fliC and fljB genes to determine H1 and H2 antigens, from the genome assemblies or raw sequencing reads and aligns these genes to the curated database using BLAST (Zhang et al., 2015;Ibrahim and Morin, 2018). This software then determines the serovar according to the White-Kauffmann-Le Minor scheme.
Of the 24 genomes that produced incorrect serovars, eight genomes were predicted to be serotypes inconsistent with serovar nomenclature reported to NCBI, and 13 genomes indicated two or more serotypes ( Table 1). This shows that as the serotype by SeqSero is determined only by the O and H antigens, these genomes show that two or more serovars were indicated for similar serovars, such as Gallinarum or Enteritidis, Paratyphi C or Choleraesuis or Typhisuis, and Albany or Duesseldorf. Additionally, some genomes (n = 4) were determined to be partial serovars lacking O or H1 or H2-antigens, and accurate serovar information could not be extracted. The reason for this appears to be the failure to extract the serovar determinant from the assembled genomes (Zhang et al., 2015).
Salmonella In Silico Typing Resource also determines the serovar according to the White-Kauffmann-Le Minor scheme, based on their databases of Salmonella serotype determinants (wzx/wzy, fliC, and fljB alleles) (Yoshida et al., 2016). To resolve the ambiguous serovar designations resulting from antigen determination, SISTR uses the novel 330 locus cgMLST analysis, and together, these two determinants are used to provide an overall serovar prediction (Yoshida et al., 2016). As a result of SISTR analysis, 526 genomes (98.32%) were predicted as correct serovars (Supplementary Table 5), whereas nine genomes (1.68%) were predicted as incorrect serovars (Table 1).
In silico serotyping analyses of the genomes obtained from the NCBI showed 95.51 and 98.32% accuracy on SeqSero and SISTR platforms, respectively. These data are consistent with the success rates of each platform reported in previous studies (Zhang et al., 2015;Yoshida et al., 2016). However, some limitations of whole-genome sequencing also exist. Previous studies have reported that SeqSero provided two possible serovars that share the same antigenic formula but differed in the minor O antigen factor (Ibrahim and Morin, 2018). In this study, serovars such as Anatum (3,10:e,h:1,6) and Newington (Anatum var. 15 + , 3,10,15:e,h:1,6) shared similar antigenic formulas, yielding only Anatum serovars. Moreover, in silico serotyping tools could not predict some genomes. These unpredicted serovars may infrequently be separated, or possibly there were some gaps in their databases (Ibrahim and Morin, 2018).

Genome Evaluation
Previous studies have reported that the NCBI has often misclassified genomes, so the genome evaluation should be conducted before using the genome obtained from the NCBI (Kim et al., 2020. Therefore, the genomes used in this study were evaluated to prevent incorrect results obtained in selecting unique serovar genes. As a result of phylogeny based on the pangene cluster frequencies among the 535 genomes, most genomes were clustered according to serovar types (Figure 1). Serovars with similar antigenic formula types, such as Typhimurium and I 4,[5],12:i:-, were adjacent but clustered into different groups. These two serovars differ by only one flagellar antigen (1,4,[5],12:i:1,2 vs. 4,[5],12:i:-) in their antigenic formulas. However, some Enteritidis, Typhimurium, and Paratyphi B strains were clustered with different serovars. Paratyphi B SARA61 was clustered into Agona; Typhimurium TW-Stm6, FORC50, and FORC098 were clustered into I 4,[5],12:i:-, Enteritidis, and Tennessee, respectively; Enteritidis 92-0392 was clustered into Typhimurium. Most genomes were clustered with the same serovar group because of in silico serotyping analyses. This is consistent with a previous study that reported that similar or rare serotypes could produce unpredicted serovar results using the SeqSero database (Ibrahim and Morin, 2018).

Identification of Serovar-Specific Gene Markers
A total of 2,440,535 genes yielded a pangenome size of 15,853 genes. The core genome comprises 1,215 genes; the accessory genome, 11,156 genes; and the unique genome, 3,482 genes. The core genomes for each serovar included from 3,065 to 4,702 genes, and unique genomes contained from 1 to 160 genes. The unique genes considered specific for each serovar were identified by BLAST analysis, and genes rarely present in other bacteria and specific to the serovar were selected. Further, a unique gene marker of each serovar was finally selected considering the GC  content and sequence length suitable for the primer design. The 60 gene markers selected by the analysis were confirmed to be genes present in all target serovars and absent in other Salmonella serovars. Information on these gene markers is shown in Table 2.
The specificity of gene markers was evaluated using 535 genomes by in silico analysis. As a result, gene markers were present in the genomes of most target serovars (Figure 2). Notably, most of the gene markers shared 99-100% of the sequence identified in the target serovars, and 0-50% of the sequence identified against other serovars. In contrast, for the serovar Enteritidis, 60 genomes showed 99-100% identity, but in one genome, the Enteritidis gene marker was not found. Among them, Enteritidis 92-0392 contained the Typhimurium gene marker (100% identity) instead of the Enteritidis gene marker, and this genome was determined as Typhimurium in the pangenome analysis and in silico serotyping. For the Typhimurium gene marker, 46 genomes showed 99-100% identity, but in three genomes, the Typhimurium-specific gene marker could not be found. Typhimurium TW-Stm6, FORC50, and FORC098 showed 100% identity to I 4,[5],12:i:-, Enteritidis, and Tennessee-specific gene markers instead of the Typhimurium-specific gene marker, respectively. For serovar Paratyphi B, two genomes showed 100% identity, but in one genome, the Paratyphi B-specific gene marker could not be found. Paratyphi B SARA61 showed 100% identity to Agona gene markers instead of the Paratyphi B gene marker. As mentioned above, misclassified genomes were reclassified. Therefore, Agona, I 4,[5],12:i:-, and Tennessee had more genomes with corresponding gene markers than the number of genomes analyzed (Figure 2). In contrast, misclassified genome absent the corresponding gene markers, so some serovar had more analyzed genomes than the number of corresponding gene markers. Based on the pangenome analysis, serovar-specific      primer pairs that can accurately detect Salmonella serovars were developed (Table 3). An Infantis-specific primer pair was developed in the previous study . In this study, our method was applied against 232 Salmonella strains. However, since the number of strains included in some serovars (e.g., Aberdeen, Berta, Cerro, Hadar, Kedougou, etc.) is rarely isolated, only a small number of strains were analyzed. In silico PCR was performed using the 625 genomes to determine whether marker genes are representative of each serovar and whether the accuracy can maintain high enough once it is applied to more strains of these serovars. All primer pairs were successfully amplified for corresponding genome sequences (Supplementary Table 2). Therefore, in silico PCR results revealed that the marker gene was representative for each serovar.

Specificity and Accuracy of the Developed Real-Time Polymerase Chain Reaction
Pangenome analysis based on the whole-genome sequence can efficiently select serovar-specific gene markers using largescale genomes. The classification of pathogenic bacteria to their correct taxonomy using whole-genome sequencing shows reproducibility and accuracy, but is expensive and requires additional bioinformatics analysis (Ibrahim and Morin, 2018). In contrast, the PCR-based method can rapidly detect pathogenic bacteria with high accuracy and sensitivity (Abubakar et al., 2007;Chen et al., 2010). This method can cost-effectively detect many isolates with relatively simple procedures; it also has potent sensitivity and specificity (Hoorfar, 2011;Xiong et al., 2018). It is crucial to develop the primer pair with high specificity as  the accuracy of these PCR-based assays depends mainly on the specific gene or primer pair. Therefore, in this study, a real-time PCR method was developed for serotyping by detecting unique gene markers obtained through pangenome analysis.
A total of 199 Salmonella strains and 29 non-Salmonella strains were used to develop a real-time PCR method that is specific and accurate. Amplification plots for the most frequently isolated six serovars worldwide are shown in Figure 3, and the   result on the remaining serovars is in Supplementary Table 6. The genomic DNA across serovars yielded a detectable amplicon for the target primer pair, whereas those from all non-target Salmonella did not generate any signal (Figure 3). The Ct value ranges were 11.49-18.07 for each Salmonella serovar strain (Supplementary Table 6). Thus, primer pairs developed in this study were considered specific for the identification and detection of individual Salmonella serovars. Serial dilution was used on the genomic DNA of Salmonella reference strains to confirm the accuracy of the real-time PCR assay. All Salmonella serovar-specific primer pairs showed a linear relationship over the range of 0.002-20 ng. The slopes for the primer pairs of Bareilly, Enteritidis, I 4,[5],12:i:-, Montevideo, Typhi, and Typhimurium were −3.589, −3.395, −3.66, −3.457, −3.677, and −3.61, respectively, and the R 2 values were ≥0.997 (Figure 4). The primer pairs for the remaining 54 serovars also showed that slope values were −3.19 to −3.683, and the R 2 values were ≥0.996 (Supplementary Table 7). To show high efficiency, the slope value should be −3.1 to −3.9, and the R 2 value ≥ 0.996 for the standard curve (Broeders et al., 2014). The slope and R 2 values of all primer pairs developed in this study were within these ranges.

Evaluation of Real-Time Polymerase Chain Reaction and Validation by Traditional Serotyping
The real-time PCR method developed in this study was evaluated in a single 96-well plate using 222 Salmonella strains. Moreover, serovars of some strains were identified using the antisera agglutination method and compared with real-time PCR results. Real-time PCR competency was checked using an internal standard. Of 222 strains, 189 strains are known at the serovar level, and the remaining 33 were strains with unknown serovars. The real-time PCR result determined that the corresponding serovars were detected when amplified in a well containing a serovar-specific primer pair. All strains were amplified in wells containing a specific primer pair and internal standard primer pair, whereas no amplification was observed in the other wells (Supplementary Table 8). The internal standard primer pair was amplified in 222 Salmonella strains and 29 non-Salmonella, and the Ct value ranged from 10.03 to 14.95. Serovars of all strains were identified accurately by real-time PCR, and the result was identical to the serotyping result of the antisera agglutination method (Table 4). Anatum and Newington presented similar antigenic formulas as the result of antisera agglutination, but two serovars were distinguishable in realtime PCR. To verify our real-time PCR method, serovars were determined for 33 isolates identified down to the genus or species level obtained from the Korea Veterinary Culture Collection (KVCC) ( Table 5). As a result, 33 isolates were determined as 13 different serovars, such as I 4,[5],12:i:-, Enteritidis, Blockley, Hadar, Livingstone, Mbandaka, Montevideo, Rissen, Typhimurium, Infantis, Virchow, London, and Senftenberg. Therefore, our results suggest that the 60 primer pairs and realtime PCR method developed in this study are 100% accurate in detecting Salmonella serovars. However, in this study, few strains were used in some serovars that may influence the identification of the real marker genes and detecting accuracy. This method not only can clearly distinguish between two serovars presenting similar antigenic formulas, but also alleviates the time and cost required for traditional serotyping method. This method has the limitation of lack of replicates in a sample run of 60 serovars in 96-well plates, but may be necessary for application in the field as it is evaluated in a single plate.

CONCLUSION
In this study, novel serovar-specific gene markers were discovered through pangenome analysis of whole-genome sequences. The pangenome analysis could identify gene markers for 60 Salmonella serovars present in genomes of target serovars and absent in genomes of other serovars. Furthermore, in silico analyses confirmed that some genomes deposited in the public database, such as the NCBI, were incorrectly designated. The real-time PCR method, designed to detect serovar-specific gene markers using a single 96-well plate, successfully detected 222 strains, thus validating the specificity and effectiveness of the assay. Additionally, the traditional serotyping method yielded ambiguous results for strains that share similar antigenic formulas but were accurately identified as one serovar using realtime PCR. These results suggest that the efficient real-time PCR assay developed could be used as a high-throughput diagnostic tool to identify 60 serovars. The real-time PCR method developed in this study is useful in diagnosing Salmonella infections and has applications in food safety and human health.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.