Development and Validation of a Reference Data Set for Assigning Staphylococcus Species Based on Next-Generation Sequencing of the 16S-23S rRNA Region

Many members of the Staphylococcus genus are clinically relevant opportunistic pathogens that warrant accurate and rapid identification for targeted therapy. The aim of this study was to develop a careful assignment scheme for staphylococcal species based on next-generation sequencing (NGS) of the 16S-23S rRNA region. All reference staphylococcal strains were identified at the species level using Sanger sequencing of the 16S rRNA, sodA, tuf, and rpoB genes and NGS of the 16S-23S rRNA region. To broaden the database, an additional 100 staphylococcal strains, including 29 species, were identified by routine diagnostic methods, 16S rRNA Sanger sequencing and NGS of the 16S-23S rRNA region. The results enabled development of reference sequences encompassing the 16S-23S rRNA region for 50 species (including one newly proposed species) and 6 subspecies of the Staphylococcus genus. This study showed sodA and rpoB targets were the most discriminative but NGS of the 16S-23S rRNA region was more discriminative than tuf gene sequencing and much more discriminative than 16S rRNA gene sequencing. Almost all Staphylococcus species could be distinguished when the max score was 99.0% or higher and the sequence similarity between the best and second best species was equal to or >0.2% (min. 9 nucleotides). This study allowed development of reference sequences for 21 staphylococcal species and enrichment for 29 species for which sequences were publicly available. We confirmed the usefulness of NGS of the 16S-23S rRNA region by identifying the whole species content in 45 clinical samples and comparing the results to those obtained using routine diagnostic methods. Based on the developed reference database, all staphylococcal species can be reliably detected based on the 16S-23S rRNA sequences in samples composed of both single species and more complex polymicrobial communities. This study will be useful for introduction of a novel diagnostic tool, which undoubtedly is an improvement for reliable species identification in polymicrobial samples. The introduction of this new method is hindered by a lack of reference sequences for the 16S-23S rRNA region for many bacterial species. The results will allow identification of all Staphylococcus species, which are clinically relevant pathogens.

Many members of the Staphylococcus genus are clinically relevant opportunistic pathogens that warrant accurate and rapid identification for targeted therapy. The aim of this study was to develop a careful assignment scheme for staphylococcal species based on next-generation sequencing (NGS) of the 16S-23S rRNA region. All reference staphylococcal strains were identified at the species level using Sanger sequencing of the 16S rRNA, sodA, tuf, and rpoB genes and NGS of the 16S-23S rRNA region. To broaden the database, an additional 100 staphylococcal strains, including 29 species, were identified by routine diagnostic methods, 16S rRNA Sanger sequencing and NGS of the 16S-23S rRNA region. The results enabled development of reference sequences encompassing the 16S-23S rRNA region for 50 species (including one newly proposed species) and 6 subspecies of the Staphylococcus genus. This study showed sodA and rpoB targets were the most discriminative but NGS of the 16S-23S rRNA region was more discriminative than tuf gene sequencing and much more discriminative than 16S rRNA gene sequencing. Almost all Staphylococcus species could be distinguished when the max score was 99.0% or higher and the sequence similarity between the best and second best species was equal to or >0.2% (min. 9 nucleotides). This study allowed development of reference sequences for 21 staphylococcal species and enrichment for 29 species for which sequences were publicly available. We confirmed the usefulness of NGS of the 16S-23S rRNA region by identifying the whole species content in 45 clinical samples and comparing the results to those obtained using routine diagnostic methods. Based on the developed reference database, all staphylococcal species can be reliably detected based on the 16S-23S rRNA sequences in samples composed of both single species and more complex polymicrobial communities. This study will be useful for introduction of a novel diagnostic tool, which undoubtedly is an improvement for reliable species identification in polymicrobial samples. The introduction of this new
The staphylococci are opportunistic pathogens that are a part of the natural microbiota of human and animal skin and mucous membranes (Kaspar et al., 2016;Islam et al., 2017;Mrochen et al., 2017;Kosecka-Strojek et al., 2018). However, changes in patient populations, such as the increased number of premature neonates, elderly and immunocompromised patients, and the increasing use of implanted foreign prosthetic material and indwelling catheters have led to a rise in documented infections caused by CoNS and CoPS other than S. aureus (Flores-Mireles et al., 2015;Giormezis et al., 2015;Butin et al., 2016;Savini et al., 2016). Because most studies report such infections as being caused by CoNS and do not differentiate isolates at the species level, the real impact of single species, especially less frequent species, is underreported. Moreover, species that have only been discovered in the last few years are not part of the routine diagnostic tests to identify bacterial species. Accurate identification is highly desirable for precise therapy, monitoring the spread of infections with epidemiologic characteristics and investigating disease progression (Ghebremedhin et al., 2008;Hwang et al., 2011;Shin et al., 2011;Becker et al., 2014).
In routine diagnostics, culture-dependent phenotypic tests, including automated systems such as the Vitek 2 (bioMérieux, La Balme Les Grottes, France) and BD Phoenix (BD Diagnostic Systems, Sparks, MD, USA), and matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS), are used for identification of bacterial species. However, these methods are not always sufficiently reliable because of variable expression of phenotypic characteristics, and the databases are limited to only some species (Heikens et al., 2005;Dupont et al., 2010;Bergeron et al., 2011;Becker et al., 2015;Singhal et al., 2015;Ayeni et al., 2017;Gherardi et al., 2018). Accurate identification at the species level may not only change the diagnosis but can also help to identify unusual antimicrobial resistance patterns. The ideal method should have high discriminatory power and allow the identification of closely related species while also being relatively simple, inexpensive, rapid and reproducible. Therefore, genetic methods based on PCR or sequencing are good candidates for identification purposes. Most of these methods are based on specific nucleic acid target amplification and sequencing (Couto et al., 2001;Drancourt and Raoult, 2002;Becker et al., 2004).
For polymicrobial samples that need to be analyzed using culture-independent tests, the simultaneous identification of species of different genera using a single primer pair is a useful approach. As the 16S rRNA gene is universally present across bacteria, is highly conserved, and can be easily amplified using universal primers, microbial analyses are often performed using 16S rRNA amplicon sequencing (Nguyen et al., 2016). Although this method is widely used and accurate, the high degree of similarity between closely related species has limited its usefulness for identifying several CoNS species as well as distinguishing between the recently established species of the S. aureus complex (Hwang et al., 2011;Shin et al., 2011;Sabat et al., 2013;Tong et al., 2015).
Next-generation sequencing has highly improved microbiological genetic investigations by providing a costeffective method to characterize bacterial genomes. The main advantage of NGS over Sanger sequencing is its ability to produce millions of reads in a single run. The NGS technologies produce reads with high sequence quality and high throughput, but the reads are short and need to be assembled de novo, which can be challenging; therefore, qualified investigators are highly desirable for NGS results analysis (Sabat et al., 2013). Recently, a NGS-based method for the 16S-23S rRNA region was developed by Sabat and colleagues (Sabat et al., 2017). This method is based on PCR amplification of the 16S-23S rRNA region, followed by amplicon sequencing on the MiSeq platform (Illumina, Inc., San Diego, CA, USA). The resulting reads are de novo assembled into contigs. Species identification is based on alignment of the contig sequences with sequences deposited in the reference databases (Sabat et al., 2017). This method can be used for identification of common pathogens, such as Staphylococcus aureus or Escherichia coli, directly from patient specimens with a high identification potential. Identification is also possible for non-cultured microorganisms, polymicrobial samples or samples with a DNA concentration that is too low for direct whole genome sequencing (WGS). However, the main disadvantage of this method is a lack of 16S-23S rRNA reference sequences for many bacterial species, which hinders proper interpretation of the results (Sabat et al., 2017). The main aim of this study was to develop a dataset of reference sequences of the 16S-23S rRNA region for almost all staphylococcal species. For validation of the dataset, we also compared the identification potential of NGS of the 16S-23S rRNA region with Sanger sequencing of the 16S rRNA, sodA, tuf and rpoB genes based on whole database and determined the cut off values for genus-and species-level identification of staphylococcal strains. Finally, we confirmed the applicability of NGS of the 16S-23S rRNA region for species identification with 45 clinical samples using the newly developed cut off values for staphylococcal species.

Bacterial Isolates
The collection of bacterial reference strains used in this study is described in Table 1

Genomic DNA Extraction
For genomic DNA extraction, the isolates were grown for 18-20 h at 37 • C on blood agar plates. A full inoculation loop of 10 µl of bacterial colonies was homogenized with a TissueLyser II (Qiagen, Germantown, MD, USA). Total DNA was extracted by enzymatic lysis using the buffers and solutions provided with the DNeasy Blood and Tissue Kit (Qiagen, Germantown, MD, USA) according to the manufacturer's instructions. To obtain accurate quantification of the extracted genomic DNA for NGS, the Qubit dsDNA BR Assay Kit, which is a fluorometric method specific for duplex DNA, and the Qubit fluorometer 2.0 (Life Technologies, Inc., Eggenstein, Germany) were used according to the manufacturer's instructions.

Genomic DNA Extraction of Clinical Samples
The Purelink Genomic DNA purification kit (Invitrogen, Carlsbad, CA, USA) was used for DNA extraction. Briefly, initial lysis was performed using 180 µl Purelink genomic digestion buffer and 20 µl Proteinase K. Digestion was performed in a thermoshaker at 56 • C until lysis was complete. Two hundred microliter Purelink Genomic lysis/binding buffer was added to 200 µl of lysed sample and vortexed to create a homogenous solution. Two hundred microliter 96% ethanol was added and the DNA purification protocol was followed according to the manufacturer's instructions.
PCR Amplification and Sanger Sequencing of the 16S rRNA, sodA, tuf, and rpoB Genes All reference strains were identified at the species level by polymerase chain reaction (PCR) and Sanger sequencing of the 16S rRNA, sodA, tuf, and rpoB genes. Additional strains were identified at the species level by routine diagnostic methods and 16S rRNA gene sequencing. The amplification and sequencing primers and PCR conditions are listed in Table 2. All PCR products were resolved by electrophoresis using the 2200 TapeStation System (Agilent Technologies, Santa Clara, CA, USA) and then purified using the DNA Clean & Concentrator TM -5 purification kit (Zymo Research, Irvine, CA, USA). The pair-end Sanger sequencing with forward and reverse strand sequencing was performed in GATC/ Eurofins Genomics company (Ebersberg, Germany).

S. aureus and 16S rDNA Real-Time PCR of Clinical Samples
In order to detect the presence of Staphylococcus DNA in clinical samples, two real-time PCR assays were performed as showed in Table 3. Firstly, the 16S rDNA assay was used to assess the bacterial load followed by the more specific S. aureus realtime PCR.

Next-Generation Sequencing of the 16S-23S rRNA Region
Amplification of the 16S-23S rRNA region was performed using the primers 16S-27F (5 ′ -AGAGTTTGATCMTGGCTCAG-3 ′ ) and 23S-2490R (5 ′ -GACATCGAGGTGCCAAAC-3 ′ ) as described previously (Sabat et al., 2017). The PCR program for pure strains was as follows: initial denaturation for 2 min at 94 • C, followed by 30 cycles of denaturation at 94 • C for 30 sec, annealing at 66 • C for 30 sec, and extension at 72 • C for 120 sec and a final extension at 72 • C for 5 min. For clinical samples, inclusion of a polymerase (MTP Taq DNA Polymerase, Sigma-Aldrich, St. Louis, MO, USA) recommended by the vendor for clinical use was essential because contamination occurred in the negative controls. To enhance the sensitivity of the PCR product, the number of PCR cycles was increased to 35. The PCR program used for clinical samples was as follows: initial denaturation for 2 min at 94 • C, followed by 35 cycles of denaturation at 94 • C for 30 sec, annealing at 66 • C for 30 sec, and extension at 72 • C for 120 sec and a final extension at 72 • C for 5 min. The same program was used for the included controls. The obtained PCR products were purified and the DNA libraries were prepared with Nextera XT DNA Sample Preparation Kit (Illumina, Inc., San Diego, CA, USA) according to the manufacturer's instructions. The indexed libraries were pooled and loaded onto an Illumina MiSeq reagent cartridge using MiSeq reagent kit v3 and 600 250 ng Mellmann et al., 2006 cycles. The 2 × 300 bp sequencing was run on an Illumina MiSeq platform.

Data Analysis
The Sanger sequencing results were analyzed using the Chromas software (Technelysium Pty Ltd., South Brisbane, Australia). The obtained sequences were analyzed using nucleotide BLAST (Basic Local Alignment Search Tool, http://www.ncbi.nlm.nih.gov/ BLAST/) and aligned to the reference sequences deposited in the GenBank (v. 231.0; June 25, 2019) and leBIBI databases. The best and the second best species alignments were analyzed.According to the criteria developed by Sabat et al. (2017), the bacterial species were assigned when the similarity score was 99% or higher and the similarity score differences with the next closest species was ≥0.2%. Therefore, the identification at the species level using Sanger sequencing of the 16S rRNA (1284-bp), sodA (430-bp), tuf (884-bp), and rpoB (740-bp) gene fragments was considered as unambiguous for sequences different in at least 3, 2, 3, and 3 nucleotides, respectively. The identification at the species level using NGS of the whole 16S-23S rRNA region 1x TaqMan   (4.3-kb) was considered as unambiguous for sequences different in at least 9 nucleotides. All sequences were compared with each other and to whole public database. The sequences were aligned in ClustalW (Larkin et al., 2007), and the phylogenetic trees were constructed using the neighbor-joining method (Saitou and Nei, 1987). The trees were drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed using the Jukes-Cantor method (Tamura et al., 2004) and were in the units of the number of base substitutions per site. All positions containing gaps were eliminated. The evolutionary analyses were conducted in MEGA7 (Kumar et al., 2016). The pairwise comparison of each pair of sequences was obtained using the CLC Genomics Workbench (Qiagen, Germantown, MD, USA) considering deletions as differences.
The NGS generated 17,000-210,000 sequencing reads for pure culture to obtain a minimum coverage of 1,000 per sample. The fastq files (Illumina MiSeq) with read lengths of 250 or 300 nucleotides were de novo assembled with the DNASTAR SeqMan NGen software (version 12.1.0; DNASTAR, Madison, WI, USA). During read assembly, reads shorter than 250 nucleotides were excluded. For most species, the minimum match percentage was 85% or 93% and the mer size was set as 31 nucleotides. The minimum matches of 94% were required for S. agnetis, S. capitis subsp. capitis, S. pettenkoferi, and S. saprophyticus subsp. saprophyticus and 86% for S. carnosus subsp. utilis, S. haemolyticus, S. massiliensis, and S. nepalensis to obtain the best quality sequences. After assembly, mean sample coverage was 4349.75-fold. However, the coverage per sample varied between 966.96-and 10649.1-fold. Only runs with a Q30 read quality score of >80% were accepted. If the assembly resulted in multiple contigs, the obtained ones were checked for length and quality in order to select the longest main contig with the highest reads amount assigned. Finally, the main contig was exported as fasta file for use in the subsequent analyses. For all species and subspecies, the main contig comprising the whole 16S-23S rRNA region extending from 4,237 (S. epidermidis) to 4,625 nucleotides (S. lugdunensis) was obtained. For clinical samples, the mer size was set as 31 nucleotides, and the minimum match percentage was 93%. Most of the identified bacterial species or genera were represented by a single contig of the expected size of approximately 4,500 bp, but in some cases smaller contigs ranging from 700 to 3,500 bp in size represented the same bacterial species in a sample. If several contigs were annotated to the same microorganism in some samples, the reads of these contigs were added up. Species identification was based on alignment of contig sequences with 16S-23S rRNA sequences deposited in the GenBank database using nucleotide BLAST and comparison to the leBIBI database (using the 16S rRNA gene sequence as a reference). Additionally, all staphylococcal species identifications were checked with the Staphylococcus reference database. The bacterial species and genus were assigned when the similarity scores were as previously described by Sabat et al. (2017). A score below 90% was interpreted as an unidentified organism.

Sanger Sequencing of the 16S rRNA Gene
As presented in Table 4, Sanger sequencing of the 16S rRNA gene allowed unambiguous identification at the species level for 27

Sanger Sequencing of the rpoB Gene
Sanger sequencing of the rpoB gene allowed unambiguous identification at the species level for 48 species (including one proposed species), which constituted 96% of the species. Identification of the following pairs or groups of species was impossible because the rpoB gene sequences were identical or almost identical: S. felis -S. cohnii and S. intermedius -S. pseudintermedius. The rpoB sequences for eight species (S. argensis, S. chromogenes, S. devriesei, S. massiliensis, S. muscae, "S. pseudolugdunensis, " S. schweitzeri and S. stepanovicii) were not available in the GenBank (v. 231.0; June 25, 2019) database. Identification at the subspecies level was possible for S. hominis subsp. hominis -S. hominis subsp. novobiosepticus, S. petrasii subsp. jettensis -S. petrasii subsp. pragensis, and S. succinus subsp. casei -S. succinus subsp. succinus using the rpoB gene. Considering the nucleotide differences in the rpoB gene sequences between pairs of species used within this study, the pairs with the lowest differences were S. nepalensis -S. stepanovicii (5 nucleotides) and S. pettenkoferi -"S. pseudolugdunensis" (13 nucleotides). The pair with the highest nucleotide difference was S. fleurettii -S. piscifermentans, which differed in 192 nucleotides and was the highest difference among those of all genes used in this study (Table 4).

Sanger Sequencing of the sodA Gene
Sanger sequencing of the sodA gene allowed unambiguous identification at the species level for 48 species (including one proposed species), which constituted 96% of the species. Identification of the following pairs or groups of species was impossible because the sodA gene sequences were identical or almost identical: S. nepalensis -S. hominis and S. warneri -S. epidermidis. The sodA gene sequences were not available in the GenBank (v. 231.0; June 25, 2019) database for 7 species (S. argensis, S. devriesei, S. petrasii, "S. pseudolugdunensis, " S. schweitzeri, S. simiae, and S. stepanovicii) ( Table 4). The pairs of species used within this study with the lowest nucleotide differences in the sodA gene sequences were S. pettenkoferi -"S. pseudolugdunensis" (4 nucleotides), S. argensis -"S. pseudolugdunensis" (6 nucleotides) and S. argensis -S. pettenkoferi (8 nucleotides), and the pair with the highest difference was S. aureus -S. fleurettii, which differed in 149 nucleotides ( Table 4). Only the pair S. petrasii subsp. jettensis -S. petrasii subsp. pragensis could be identified at the subspecies level using sodA gene sequencing.

Next-Generation Sequencing of the 16S-23S rRNA Region
The nucleotide sequences of the 16S-23S rRNA region were obtained for all staphylococcal species. Analysis of the GenBank (v. 231.0; June 25, 2019) database showed that 16S-23S rRNA region sequences were available for 29 Staphylococcus species, whereas this study allowed the development of nucleotide sequences for an additional 21 species. Taking into consideration the differences in the length of the intergenic spacer located between the 16S and 23S rRNA genes, the average sequence length of the 16S-23S rRNA region was determined and equaled 4,381 nucleotides. The highest similarity among species analyzed within this study was found between S. pettenkoferi and "S. pseudolugdunensis" showing 99.8% sequence homology (7 nucleotides of difference), while the highest nucleotide difference was found between S. delphini and S. fleurettii and equaled 667 nucleotides (85.6% similarity).

Comparison of the Sequencing Methods
All strains from the collection were characterized by Sanger sequencing of the 16S rRNA, rpoB, sodA, and tuf genes but identification to the species level was not possible by all targets used due to identical or almost identical sequence or the lack of some reference sequences in the GenBank (v. 231.0; June 25, 2019) database. The identification was confirmed by 2 targets for 4 species (S. aureus, S. pettenkoferi, "S. pseudolugdunensis" and S. schweitzeri), by 3 targets for 12 species and by 4 targets for the vast majority of the species (34 species) ( Table 4).
To show the relationships among the species, phylogenetic trees were constructed. The computed overall mean distances according to the Jukes-Cantor model for the 16S rRNA, sodA, tuf, and rpoB genes and the 16S-23S rRNA region were 0.027, 0.023, 0.090, 0.178, and 0.045, respectively. Based on the criteria used for phylogenetic tree construction, the trees were constructed using the neighbor-joining method and revealed that the isolates clustered into groups for all of the methods used. All of the methods showed that S. massiliensis and S. auricularis were distantly related to the other species (Figure 1 and Supplementary Figures 1A-D). Analysis of the phylogenetic tree of the 16S-23S region showed similar clustering to the dendrogram based on 16S rRNA gene sequencing but was more discriminative, with unambiguous identification of almost all staphylococcal species (Figure 1).
For Staphylococcus species, sodA and rpoB genes Sanger sequencing had the highest identification potential allowing for an unambiguous identification of 96% of analyzed species, while the NGS-based method allowed for identification of 90% species ( Table 4). The 16S rRNA gene Sanger sequencing had the lowest identification potential of all the methods used.

Additional Staphylococcal Strains
To compare the routine diagnostic methods with NGS of the 16S-23S rRNA region for identification of Staphylococcus species, 101 staphylococcal strains isolated from human infections, animal infections, natural environments and reference collections were included (Supplementary Table 1). All strains were identified at the species level by different standard diagnostic tools, such as MALDI-TOF MS, VITEK R 2 ID, the BD Phoenix TM system and the API STAPH ID system. To confirm the obtained results, all strains were identified using 16S rRNA gene sequencing. Finally, all strains were identified by NGS of the 16S-23S rRNA region. Of all of the strains, 76 (76.8%) were correctly identified and 25 (25.2%) were misidentified by the routine diagnostic methods. The 16S rRNA sequencing allowed unambiguous identification of only 32 strains (31.7%), whereas 69 strains (68.3%) were incorrectly identified. NGS of the 16S-23S rRNA allowed unambiguous identification of all strains, including one nonstaphylococcal strain misidentified by standard methods and also unidentified by 16S rRNA sequencing as Shigella dysenteriae or Escherichia coli. NGS of the 16S-23S region allowed unambiguous identification of 5 S. pseudintermedius strains, which was not possible with any other method used. Moreover, the reference 16S-23S rRNA sequence dataset allowed differentiation of one S. agnetis strain that was aligned as S. hyicus by MALDI-TOF MS and both 16S rRNA [1 st ID: S. agnetis 1283/1283 (100%);

Intraspecies Nucleotide Sequence Variation of the 16S-23S rRNA Region
To show the variability of 16S-23S rRNA region, the nucleotide sequence variation within Staphylococcus species was determined ( Table 6). The analysis was performed for those species for which at least one nucleotide sequence of the 16S-23S rRNA region could be found in the GenBank (v. 231.0; June 25, 2019) database. For almost all species, the length of the 16S-23S rRNA region was the same within a species when the sequences obtained in this study and those deposited in GenBank (v. 231.0; June 25, 2019) were compared. The length of 16S-23S rRNA region was different within the same species only in case of S. capitis subsp. capitis, S. muscae, and S. pettenkoferi. The nucleotide variation within Staphylococcus species accounted from 0.05 to 5.9%, with the exception of S. aureus, S. capitis, and S. sciuri for which the intraspecies nucleotide variation was 9.70, 7.0, and 6.56%, respectively.

Clinical Samples
The aim of this part of our study was to assess the potential of NGS of the 16S-23S rRNA region to improve the resolution of Staphylococcus species identification directly from clinical samples. Forty-five clinical samples from various human infections were included ( Table 7). Selection of these samples was based on a positive culture for Staphylococcus species. Staphylococcus species were identified by culture in all 45 samples, whereas NGS of the 16S-23S rRNA region identified Staphylococcus species in 37 samples (STA1-STA6, STA8-STA12, STA14-STA15, STA17-STA29, STA32-STA40, STA42, and STA45). Among the analyzed samples, conventional culture methods and NGS of the 16S-23S rRNA region identified the same Staphylococcus species in 27 samples (STA1, STA4, STA5, STA9-STA11, STA15, STA17-STA24, STA26-STA28, STA32-STA34, STA36-STA40, and STA42). In 10 samples, NGS of the 16S-23S rRNA region identified a higher number  of Staphylococcus species than culture, showing two species in samples STA6, STA8, STA12, and STA45, three species in samples STA3, STA14, STA29, and STA35 and 4 and 6 species in samples STA2 and STA25, respectively. In a few samples, the culture-based methods allowed identification at the genus level, whereas NGS of the 16S-23S rRNA region was able to identify microorganisms at the species level (STA1, STA14, and STA31). Moreover, the NGS-based approach showed other pathogens and the coexisting microflora. The species content of the samples identified by NGS contained a range from one (STA4, STA26, and STA36) to a maximum of 33 (STA14) different microorganisms.

The Discrepancy Analysis of Identification Results in Clinical Samples
Twenty four samples with both consistent and discrepant S. aureus identification by culture and NGS-based approach were selected for the 16S rRNA and S. aureus real-time PCR (Table 8).
For all selected samples the 16S rRNA assay revealed presence of bacterial DNA and the S. aureus real-time PCR revealed the presence of S. aureus DNA between Ct = 18 and Ct = 33.1, and Ct = 21.58 and Ct = 36, respectively. Comparing the results of the differences in Ct values ( Ct) between 16S rRNA and S. aureus PCR, samples were divided into 3 groups. For the first one (STA2, STA4, STA5, STA6, STA9, STA12, STA32, STA34, STA35, STA36, STA37, STA42, and STA45) the NGS-based approach was able to detect S. aureus at high level between 24.3 and 100% (reads in a sample) and the Ct values were the lowest counting from 0.03 to 2.42. The second group included samples (STA1, STA3, STA13, STA33, STA38, and STA39) for which the Ct values varied from 0.85 to 16.58 and the S. aureus was detected at the bottom level of 0.2% to maximum 8.5%. The third group consisted of the samples for which the NGS-based approach failed to identify the S. aureus (STA7, STA31, STA41, STA43, and STA44). For this group the Ct values were very high at the level between 7.08 and 11.32. Therefore, a lack of S. aureus detection with NGS-based approach was a result of a low ratio between S. aureus DNA and total bacterial DNA in a sample.

DISCUSSION
Because of the increasing clinical significance of CoNS (Becker et al., 2014), accurate identification of staphylococci at the species level is highly desirable to permit a more precise determination of host-pathogen relationships and to better understand the pathogenic potential of various staphylococcal species. Phenotypic identification of CoNS appears to be unsatisfactory, unreliable, and irreproducible (Heikens et al., 2005;Dupont et al., 2010;Bergeron et al., 2011;Singhal et al., 2015;Ayeni et al., 2017). Therefore, applying genetic methods in standard microbiological diagnostics is necessary to improve the identification process. When an unknown organism needs to be identified in a clinical sample, 16S rRNA gene sequencing is the method of choice because of the availability of universal primers (Clarridge, 2004). The 16S rRNA gene sequencing is an appropriate target for most staphylococcal species; however, for some species, inter-species differentiation is difficult or impossible due to missing or insufficient heterogeneity within the 16S rRNA gene. Most reports show that the discriminatory power of 16S rRNA gene sequencing is very low for closely related Staphylococcus species (Heikens et al., 2005;Mellmann et al., 2006;Woo et al., 2009;Shin et al., 2011;Lange et al., 2015). Moreover, the accuracy of identification of bacterial species with 16S rRNA gene sequencing is hindered by the low quality of many of the sequences deposited in public databases (Becker et al., 2004). Other targeted sequencing methods may have a higher identification potential than 16S rRNA gene sequencing but often are limited to selected genera (Li et al., 2012). This study showed sodA and rpoB targets were the most discriminative but NGS of the 16S-23S rRNA region was more discriminative than tuf gene sequencing and much more discriminative than 16S rRNA gene sequencing based on obtained sequences and whole database search. Moreover, the NGS-based method showed the same clustering as the other methods (Figure 1). Because NGS of the 16S-23S rRNA region uses universal primers, this method is applicable to different and genetically unrelated bacterial genera.
Beyond the comparison of the five sequence-based methods used for staphylococcal identification, the main purpose of this study was to develop and validate a complete staphylococcal reference sequence dataset for the 16S-23S rRNA region and to evaluate the potential of this method for clinical samples. The NGS of the 16S-23S rRNA region developed by Sabat et al. (2017) provides the ability to detect microorganisms not only in samples from mixed infections, which also consist of commensal microorganisms, but also in whole microbiomes.   ( Frontiers in Cellular and Infection Microbiology | www.frontiersin.org    However, this method suffers from a lack of reference sequences in the GenBank database for many bacterial species at present. Prior to this study, 16S-23S rRNA sequences were available for only 29 Staphylococcus species. Our investigations allowed development of 16S-23S rRNA sequences for an additional 21 species, making identification of almost all Staphylococcus species feasible with the exception only of the recently described Antarctic S. edaphicus species (Pantøuček et al., 2017). In order to identify strains at the species level, the reference sequence with the highest similarity score needs to be found. For several Staphylococcus species, only one or a few reference 16S-23S rRNA sequences can be found during BLAST searches in the GenBank database. In such cases, it is possible that the sequence obtained during a study belongs to a different evolutionary cluster within a species than the reference and the nucleotide differences between them are high (more than 1%). Then, it is not possible to assign bacterial species with the similarity score 99% or higher. If more reference sequences are deposited in the genetic sequence databases, representing evolutionary diverse lineages, species will always be assigned with a similarity score above 99%.
The NGS of 16S-23S rRNA approach proved to be an excellent tool for identification at the species level for a great majority of Staphylococcus strains. Nevertheless, some problematic cases were found. In our study, in case of pairs: S. pseudintermedius -S. aureus; S. simulans -S. hyicus and S. warneri -S. epidermidis, an alignment to the next closest species accounted <0.2% but to only one and not published genome assembly. In all these cases, the second next species was aligned <99% similarity. Similar situation was found for S. agnetis -S. hyicus and S. schweitzeri -S. aureus. As the problems in accurate identification of these species are described (Tong et al., 2015;Adkins et al., 2017), we believe that the increase of deposited sequences for S. agnetis and S. schweitzeri will allow for an unequivocal identification. It is very important to develop a well-curated database with a verification of deposited sequences in terms of proper organism identification. For now, the sequences that are not published should not be considered as reference ones. There is no previous single study with a same dataset of reference sequences for genes commonly used for staphylococci identification, so usually those sequences cannot be compared. In this study, we have not only deposited such dataset for 4 commonly used identification targets but also added a package of sequences for a new identification tool with a high identification potential.
The developed reference dataset improved the identification accuracy of staphylococci in clinical samples. Data from this study showed that NGS of the 16S-23S rRNA region for most clinical samples correctly identified the bacterial species that were identified using culture-based methods. In most samples, a few more species of the same genera or other genera were also identified by NGS. In some samples, NGS allowed identification of clinically relevant pathogens that often remain unidentified by culture methods due to their challenging growth, such as GPAC (Gram-positive anaerobic cocci) bacteria, Actinotignum sp., and Streptococcus species (Murphy and Frick, 2013;Pedersen et al., 2017). In other cases, the NGS results were not consistent with the culture methods. In some cases, the NGS showed the dominating unidentified species within the samples, and in others, a lowquality PCR product was obtained. These problems may be removed by obtaining more reference sequences for other genera and using the NGS method for more clinical samples to slightly improve the PCR conditions. Moreover, when differences can be found within the whole 16S-23S rRNA region sequence, then polymorphism within the species is the cause, and inclusion of a higher number of reference sequences will improve the specieslevel identification. Facilitating NGS 16S-23S rRNA data analyses is crucial because at present, this method is quite complicated for clinical samples since it is time consuming and the results may be difficult to interpret. The new software that is being created will be a great improvement for researchers. Importantly, positive and negative controls should always be included to monitor the whole process.
The rapid development of DNA sequencing techniques has allowed substantial improvement of culture-independent identification of microbial pathogens. Additionally, advances in DNA sequencing techniques have allowed simultaneous investigation of millions of DNA fragments and enabled rapid identification of all microorganisms present in a given clinical sample. NGS-based techniques, especially NGS of the 16S rRNA gene, have been successfully applied for the comprehensive analysis of microbiomes not only from healthy people but also from microbiomes associated with many diseases (Jervis Bardy and Psaltis, 2016;Jovel et al., 2016;Pérez-Losada et al., 2017). Microbiome analysis could lead to reclassification of terms such as "infectious agent" or "bacterial pathogen, " because any microbiome appears to have one or more dominant bacteria but also contains potentially important coconspirators that may modulate growth, virulence, biofilm formation, quorum sensing, and antibiotic resistance, and sensitive NGS-based techniques enable their detection (Toma et al., 2014). In any case, identification of microbiome constituents at the phylum or genus level does not provide sufficient microbiological details. This lack occurs because microbes at the species level are transmitted between hosts and have different transmission power, tenacity and biological behavior. NGS of the 16-23S rRNA region allows polymicrobial diagnostics, and analysis of the intergenic regions contained in this region significantly increases the identification potential of the method, allowing unambiguous species identification.
In conclusion, our study demonstrated the reliability of NGS of the 16S-23S rRNA region for staphylococcal identification at the species level. The method based on NGS of the 16S-23S region undoubtedly had the highest identification potential of all of the methods used. We have developed a reference dataset of the 16S-23S rRNA region for 50 staphylococcal species (including one proposed species) and 6 subspecies. Therefore, all clinically relevant staphylococcal species can be detected in patient specimens at present. Expanding the database in the future will allow this approach to constitute a highly precise, rapid and reliable method for highly specific microbial identification in general.

AUTHOR CONTRIBUTIONS
AS, AK-S, and AF designed the project. AK-S, KB, and JM provided the strains and clinical samples with their data. MK-S, VA, EvZ, GW, and AS performed the experiments. MK-S and AS carried out de novo assemblies. All authors interpreted the data. MK-S and AS wrote the manuscript. All authors reviewed the manuscript.

FUNDING
This project was financed by funds granted by the National Science Center (NCN, Poland) on the basis of the decision no. UMO-2016/21/N/NZ6/00981 (for MK-S) and in part by the European Regional Development Fund within the EurHealth-1Health project (EU/INTERREG VA-681377 to KB and AF). The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.