Diagnostic Sequences That Distinguish M. avium Subspecies Strains

Over a decade ago Mycobacterium avium subspecies paratuberculosis (Map) specific genes were initially identified in a whole genome context by comparing draft genome sequences of Map strain K-10 with Mycobacterium avium subspecies hominissuis (Mah) strain 104. This resulted in identification of 32 Map specific genes, not including repetitive elements, based on the two-genome comparison. The goal of this study was to define a more complete catalog of M. avium subspecies-specific genes. This is important for obtaining additional diagnostic targets for Johne's disease detection and for understanding the unique biology, evolution and niche adaptation of these organisms. There are now over 28 complete genome sequences representing three M. avium subspecies, including avium (Maa), Mah, and Map. We have conducted a comprehensive comparison of these genomes using two independent pan genomic comparison tools, PanOCT and Roary. This has led to the identification of more than 250 subspecies defining genes common to both analyses. The majority of these genes are arranged in clusters called genomic islands. We further reduced the number of diagnostic targets by excluding sequences having high BLAST similarity to other mycobacterial species recently added to the National Center for Biotechnology Information database. Genes identified as diagnostic following these bioinformatic approaches were further tested by DNA amplification PCR on an additional 20 M. avium subspecies strains. This combined approach confirmed 86 genes as Map-specific, seven as Maa-specific and three as Mah-specific. A single-tube PCR reaction was conducted as a proof of concept method to quickly distinguish M. avium subspecies strains. With these novel data, researchers can classify isolates in their freezers, quickly characterize clinical samples, and functionally analyze these unique genes.


INTRODUCTION
Accurate diagnosis of mycobacterial infections is important for effective disease management within the livestock production industry. National and state veterinary laboratories are routinely confronted with unresolved cases of non-tuberculous mycobacterial (NTM) infections that are not further delineated due to complicated cross-reactive results as well as a lack of time and funds. However, correct identification of M. avium subspecies is very important because of the divergent, yet prominent, clinical relevance that exists among these strains (1). The Mycobacterium avium subspecies comprise three predominant, closely-related subspecies, M. avium subspecies avium (Maa), M. avium subspecies hominissuis (Mah), and M. avium subspecies paratuberculosis (Map) and their genetic similarity has limited the identification of diagnostic targets that distinguish each subspecies. A forth subspecies has been reported (silvaticum), but the lack of circulating strains and sequences in public databases suggests this subspecies designation may be artifactual (2). Map is the causative agent of Johne's disease, which is a chronic intestinal ailment of ruminants that results in significant economic loss to dairy, meat and wool industries (3,4). Mah is an opportunistic pathogen of swine and humans that manifests most commonly as a pulmonary infection in humans (5), while Maa is primarily a pathogen of birds (6).
In this report, the terms Map and non-Map are used to divide the M. avium subspecies. This is due to the focus on diagnostic Map sequences, which are major tools for research and surveillance of Johne's disease. Conversely, very few studies have reported sequences that are diagnostic for non-Map subspecies. The only known sequences that are specific for the Maa and Mah subspecies include IS1245 (7) and genes contained within insertions or other genetic rearrangements termed large sequence polymorphisms (LSPs) (8). Whole genome comparative methods are a comprehensive approach for identifying specific genes useful in molecular diagnostic tests. In this approach, genomes from closely related organisms are compared to identify genes uniquely correlated with particular groups. A complicating factor is that M. avium subspecies strains studied thus far all share >98% average nucleotide identity across the entire genome (2). In addition, the core genome appears stable for Map with 80% of the genome sequence comprising genes present in all strains, whereas the core genomes of non-Map subspecies comprise only 40% of genes (2). Despite strong genetic similarity and availability of commonly used diagnostic targets (i.e., IS900), there are still undiscovered and unexplored genetic targets that may be used in diagnostic tests to distinguish the M. avium subspecies. These sequences are also important to understanding the unique biology, evolution and niche adaptation of these strains. A primary goal of this study is to identify genetic targets that can distinguish M. avium subspecies from each other.
Despite high sequence identity among these subspecies, some specific sequences have long been identified and used for diagnostics of Johne's disease. The most commonly used target is the repetitive sequence, IS900, which is present in 16-22 copies in Map. This insertion element was identified as Map-specific long before whole-genome sequences became available (9,10). Other sequences reported as specific to Map include the single-copy genes hspX (MAP_RS11095; MAP2182c) (11, 12), F57 (MAP_RS22535; MAP_0865) (13), and locus 251 (MAP_RS14130; MAP2765c) (14,15). Locus 251 was originally identified from shotgun sequence data shortly before the first complete genome of an M. avium subspecies was assembled (15) and later used as a target in commercial PCR assays. F57 was originally identified from a genomic DNA library screen (16) and was later discovered as part of the MAP_0865 coding sequence present on a large sequence polymorphism termed LSP4 (17). This gene has also been a common target in real time PCR assays (13,18,19). Thirty-two Map-specific genes were identified by genomic hybridizations to DNA microarrays (20)(21)(22), but this approach lacked the resolution and completeness of whole-genome sequences.
To identify all remaining sequences specific for Map and new diagnostic sequences for Maa and Mah, complete genomes from several strains were compared and candidate genes identified. Two independent bioinformatic analyses were first conducted to obtain subspecies specific genes in M. avium. These analyses included Roary (23) and PanOCT (24) pan genome tools, which identify core and accessory genes for a group of closely related organisms. PanOCT works by joining homologous and conserved gene neighborhoods to cluster orthologous proteins while Roary preclusters highly similar protein sequences before BLASTP similarity searching. PanOCT analysis yields a variety of comparative data about the input genomes (24), but one of the more useful outputs are the match tables. These tables list orthologous clusters of genes along with information on the cutoff metrics that would result in genes being declared absent in the match table. Match tables were constructed from PanOCT analysis showing all genomes, including their corresponding genes and percent identity of their gene products. These tools were used previously to determine the core genome of Map and non-Map strains (2). Finally, once specific genes are identified, DNA amplification primers can then be designed for further testing on a panel of organisms representing additional M. avium strains.

In silico Identification of Subspecies Specific Genes
PanOCT output includes a match table of all genes in every input genome (Genbank flatfile format) with orthologs arranged in rows and genomes arranged in columns. The input genomes included the 29 M. avium genomes reported in Tables 1, 2 of (2). When orthologs are not present in a genome(s), this is interpreted as one of several possibilities: the amino acid percent identity is <35%, the BLAST E-value was >1 × 10 −5 , the minimum percent match length of subject and query (default is 1%) was not met, the frame-shift overlap parameter (default is 1.33) triggered a non-match, there were more than 20 amino acids at the beginning or end of a match that were missing, and finally, there was not at least one match that confirmed a protein fragment/frame-shift. Roary (v3.11.2) was used to define the accessory genome of each subspecies and was launched with -e and -n options to compute rapid core gene alignment. Coding sequences from both analyses were assembled and compared. Only subspecies-specific genes identified by both approaches were analyzed by DNA amplification.

Primer Design
Primers were designed using the National Center for Biotechnology Information (NCBI) primer search software (https://www.ncbi.nlm.nih.gov/tools/primer-blast/). Parameters that differed from the defaults included minimum product size of 300 bp, maximum product size of 800 bp, optimal primer  If the open reading frames were small and adjacent to each other, a forward primer was designed in one coding sequence and the reverse primer was designed in the neighboring coding sequence. This approach provided more sequence to design optimal primer pairs and lessened the number of PCRs to be conducted. In one case, a primer pair was designed that spanned three small coding sequences, MAP_RS10970, MAP_RS10975, and MAP_RS10980.

Functional Analysis of Genes
All Roary Map-specific protein sequences of each gene (155 total) were extracted using an in-house python script (Supplementary Materials) based on Biopython (https:// biopython.org). KEGG orthologs for each protein sequence were assigned using kofamscan tool (https://github.com/ takaram/kofam_scan) based on HMM profile search with default parameters. Kofamscan results were parsed using another in-house python script (searchKEGG.py). Briefly, for each gene, we selected the best result based either on the ratio value compute as score / threshold_score where available or on the lower e-value. K numbers were then used to retrieve the complete KEGG hierarchies and functional classes were assigned based on the KEGG hierarchies. For functional analysis using eggNOG, all Roary Map-specific protein sequences were again extracted. Proteins were aligned against the eggNOG database v5.0 using Diamond (https://github.com/bbuchfink/ diamond) and functional annotation was assigned to each protein sequence when available using emapper.py python script from eggnog-Mapper tool. Results were parsed using an in-house python script (Supplementary Materials) in order to extract relevant information.

DNA Amplification
Genomic DNA was purified from mycobacterial strains shown in Table 1 using the method described previously (15). Advantage GC-2 (Takara Bio) was used for all amplifications on 100 ng of DNA and 100 µM of each primer. All reactions were repeated twice and if an unexpected result occurred, the reaction was again tested using EconoTaq PLUS GREEN master mix (Lucigen

Data Availability
Raw sequence data supporting the results of this article were deposited in the European Nucleotide Archive (ENA) and NCBI under accession PRJEB2204.

Map
The initial number of subspecies-specific gene candidates were obtained by two comparative genomic methods, Roary and PanOCT analysis. The same set of 28 complete genomes that were used in our previous study (2) was included herein. PanOCT analysis revealed a total of 115 genes that are present in all Map genomes, but absent in all non-Map genomes while Roary analysis identified 155 Map-specific genes ( Table 2;  Supplementary Table 2). The total number of Map-specific genes identified by either Roary or PanOCT analysis was 156 (Supplementary Table 2).
Of the 156 total genes, Roary analysis uncovered 41 genes not found by PanOCT analysis whereas one gene was identified by PanOCT analysis that was not listed by Roary. The reason for this single-gene discrepancy is due to different translational start sites in one of the 13 Map orthologs (EGA31_RS21685 in the Telford strain) encoding an alpha/beta hydrolase, which made this gene appear as though it was not present in all Map strains analyzed (Supplementary Figure 1; Table 3). Thus, it was excluded by Roary analysis. The 41 genes present by Roary analysis but absent by PanOCT is due to the different clustering methods used by these bioinformatic tools. However, it is of interest to further explore this subset of 41 genes to understand which bioinformatic approach might be better for defining these gene targets. BLAST analysis of these 41 genes against the NCBI database showed reasonably strong homologies to other, more distantly related, Mycobacterium species (Supplementary Table 3). These results suggest that PanOCT may be better suited to consider more evolutionarily distant strains than Roary. Nonetheless, 114 Map-specific gene candidates are present in both analyses (Supplementary Table 2). This result excludes Map-specific repetitive sequences, such as IS900, which are not annotated in the E1 and E93 genome sequences (2).

Non-map
The Mah-specific and Maa-specific genes were also obtained using similar analyses. There are only nine Mah-specific genes identified by PanOCT and 13 by Roary analysis ( Table 2;  Supplementary Table 4). Eight of these genes are present in both analyses. One gene, encoding hydroxycinnamic acid hydroxylase ( Table 3), that is uniquely present in the PanOCT analysis was due to differences in the clustering of orthologous genes. In this case, the BLAST score ratio, when used alone, would not classify these orthologs as a cluster. Therefore, Roary excluded it. However, when the conserved gene neighborhood is factored in the PanOCT analysis, these genes are found rooted in the same source. This might be what one would expect if there was a single event that introduced an entire cluster followed by divergence of a gene within that cluster. Conversely, there are five Mah genes identified as specific only by Roary analysis. As before, these differences were due to the clustering methods used by these bioinformatic tools (see Discussion). The largest set of subspecies-specific genes were obtained for Maa strains, which included 157 genes identified by both analyses ( Table 2;  Supplementary Table 5).

Location of M. avium Subspecies-Specific Genes
Subspecies specific genes are not randomly distributed over the entire genome but tend to be organized in clusters, termed large sequence polymorphisms (LSPs) (8,17). This fact is supported by the results of this study, where several specific genes are contained within LSPs regardless of the subspecies (Figure 1). Maa shows the most clusters (26 total), followed by Map and then Mah, which only has four clusters of eight total genes. For example, the largest genomic island is just over 100 kb in Maa and it contains 97 Maa-specific genes. This island was previously described as LSP1 by Semret et al. (8). The location of LSP1 is at 7 o'clock on the Maa genome map in Figure 1. Likewise, the largest genomic island in Map includes 30 genes spanning 55 kb and was previously named LSP P 14 (17). Not all genes within any defined LSP were found specific to a given subspecies in this study.

BLAST Similarity and in silico PCR Analysis
The 114 Map-specific genes were next analyzed by similarity searches against other complete mycobacterial genomes including members of the M. tuberculosis complex and nontuberculosis mycobacteria closely related to the MAC complex (25,26). Several Map-specific genes showed surprisingly strong similarity to mycobacterial genomes that were recently added to public sequence databases (Supplementary Table 6). These include M. marseillense, M. lepraemurium, M. branderi, and Mycolicibacterium doricum. In silico PCR using primers designed for this study (Supplementary Table 1) were tested on the strongest matching genome sequences obtained from BLAST analysis and amplification products of the correct size were obtained for nine Map genes (Supplementary Table 6). Removing these nine targets reduced the list of Map genes to 104 that are defined as Map-specific following PanOCT, Roary and in silico PCR analysis. Within that group of 104 genes, 54 were highly specific to Map as they had no matches in the NCBI sequence database (indicated by "no result" in Supplementary Table 6). Two additional M. avium genomes, Map DSM 44135 (27) and Maa ATCC 25291 (28), were published after this study began and in silico PCR on those genome sequences shows agreement with subspecies specificity for all genes except FCV17_RS22495, a Mah gene that yielded a correct size product in Maa ATCC 25291 (Supplementary Table 4).

Gene Ontology of Map-Specific Genes
Functional analysis was performed on Map-specific genes using Kyoto Encyclopedia of Genes and Genomes (KEGG) (29) and evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG) (30). These genes were separated into 16 functional classifications by KEGG and 18 by eggNOG (Figure 2). Other than the most predominant unknown function class, KEGG analysis revealed 14 lipid transport system proteins, which likely remodel the cell wall in some way. EggNOG analysis showed the presence of Mce family proteins (n=6), and ABC transporters (n=5), but were predominantly genes of unknown function (n=45

DNA Amplification of Subspecies Specific Genes
Taq polymerase-based DNA amplification of selected genes was performed on purified genomic DNA from 16 additional M. avium subspecies strains and Map K-10 as a control. All 104 Map-specific genes (Supplementary Table 2) and all eight Table 4) were tested. In addition, a subset of eight genes from the 157 Maa-specific genes (Supplementary Table 5) were also tested by DNA amplification. IS900 and IS1245 repetitive element targets were included as controls. The results show that 106 of 120 genes tested were confirmed as subspecies-specific ( Table 4). A gene was dropped from the subspecies-specific gene list if it failed to amplify in any target strains or if it did amplify in any non-target strains. Such unexpected reactions are highlighted in yellow (if negative) and green (if positive) in Table 4.

Map
Eighty-six genes were confirmed as Map-specific by DNA amplification (Table 4). Conversely, four genes amplified non-Map DNA and thus are no longer considered Map-specific ( Table 4). The primers designed from the MAP_RS01130-MAP_RS01135 combination amplified the correct size product in the M. intracellulare strain, but only faintly, perhaps indicating nucleotide mismatches in the primers. Primers to MAP_RS22390-MAP_RS00525 along with MAP_RS00515-MAP_RS00520 are all clustered together in the genome and amplified a correct sized product in two Mah strains ( Table 4). MAP_RS08655 was the most conserved gene, amplifying a product in six of the nine non-Map strains tested (

Non-map
Gene candidates that are diagnostic for Mah are very limited. Among the eight genes identified as specific to Mah by bioinformatic approaches, only three (FCV17_RS18630, FCV17_RS18640, and FCV17_RS18645) remained specific following PCR analysis as the other five genes were amplified a correct sized product from M. intracellulare DNA ( Table 4).
Of the eight Maa-specific genes selected for PCR analysis, seven were confirmed specific while only one (DBO90_RS13535) showed amplification in three Map strains. However, there are 149 other Maa gene candidates that were not tested, but could be Maa-specific (Supplementary Table 4). The collective results identified 86 genes that are Map-specific, seven Maa-specific, and three Mah-specific. Thus, a final list of the subspecies-specific genes is included in Table 5. Collectively, these results further extended the number diagnostic targets tested for Johne's disease against 10 additional Map strains and 10 non-Map strains.
A Single Reaction PCR Distinguishes All M. avium Subspecies As a proof of concept, primer combinations were tested that can distinguish any M. avium subspecies using a single-tube amplification reaction. Although other PCR reactions have been developed to distinguish M. avium subspecies (31), this approach eliminates the need for three reactions to accomplish the same result. This primer combination was used on selected genomic DNAs from M. avium subspecies. The single DNA amplification was able to distinguish each subspecies on the basis of size by gel electrophoresis (Figure 3), but can also be adapted to real time PCR using different Taqman probes and concentrations (32).

DISCUSSION
A total of 32 Map-specific genes were initially identified by comparing Map K-10 with Mah strain 104 (17,20,22). These genes have aided the specific detection of Map, but are singlecopy genes compared to IS900, a repetitive element that was known to be diagnostic for Map long before whole-genome comparisons were performed (9, 10). Using a comparative genomic approach combined with PCR analysis, this study confirmed 18 of the original 32 genes and identified 68 additional Map-specific genes as diagnostic targets ( Table 5) yielding a total of 86 genes. Some genes were initially identified as potentially subspecies specific after PanOCT and Roary analysis, but were subsequently found to be conserved in other mycobacteria after additional strains were tested by in silico PCR or Taq-based PCR analysis. For example, eight Mah-specific genes were identified by bioinformatic analysis, but only three were still deemed Mahspecific following DNA amplification experiments. The number of subspecies-specific genes will likely be further reduced as additional Map and non-Map strains are analyzed. Significantly more genes specific to Map vs. Mah are likely attributed to the closed, stable genome of Map, which increases the chance that specific genes are present in all strains. The opposite is true for Mah, which has a more dynamic, open genome containing a smaller number of core sequences (2) and therefore, the number of specific genes is very small. Maa had the highest number of subspecies specific genes, possibly because only two complete genome sequences were bioinformatically analyzed for members of this subspecies. Only eight of the 157 Maa genes were selected for DNA amplification studies and seven genes remained specific to Maa. Nonetheless, with these novel data, researchers can classify isolates in their freezers, quickly characterize clinical samples, and functionally analyze these unique genes.
A total of 40% of the Map-specific genes and 24% of Maa-specific genes are annotated as hypothetical. This large percentage of Map hypothetical proteins is interesting considering the updated RefSeq annotation has resulted in a

MAP_RS19325-MAP_RS19330
356 Frontiers in Veterinary Science | www.frontiersin.org Amplification results with a green highlight indicate an unexpected positive reaction while results with a yellow highlight indicate an unexpected negative reaction. Bovine str. = 3 bovine strains that gave identical results. These include 6013, ATCC19698 and 5077. goat-bov = This column represents 1 goat strain and 1 bovine strain that gave identical results (4025 and 117p). 2 strains = M. avium subs hominissuis strains 10-5894 and 10-05561, which had identical results. 6106/6007 = Two M. avium subspecies avium strains that gave identical results.   significant drop in the overall number of hypothetical proteins, but it suggests how little is known about these novel subspeciesspecific genes. Also notable among this group is an operon of genes encoding Mce family proteins ( Table 5), some of which are surface exposed and play a role in virulence and lipid transport (33). These proteins are typically conserved among the mycobacteria, and at least one gene this group is no different as in silico PCR analysis showed that the correct sized amplification product is obtained using MAP_RS11145 primers on M. marseillence strains (Supplementary Table 6). This species of Mycobacterium is considered a member of the Mycobacterium avium complex based on 16S rRNA, rpoB, and hsp65 sequence similarity (25). There are significant differences between PanOCT and Roary analysis in terms of genes assigned to subspecies-specific orthologous clusters. There are 41 genes identified as Mapspecific by Roary that are not considered specific by PanOCT analysis. These discrepancies are attributed to differences in the algorithms of these two bioinformatic tools. PanOCT uses a single clustering tool compared to Roary, which uses two clustering mechanisms and then merges the results. In addition, Roary appears more stringent in the gene cluster formation through the use of BLASTP identity thresholds of 95%, which yields more gene clusters. This may explain why the Roary output listed more gene clusters specific to Map than PanOCT. Roary stringency results in more gene clusters overall since Roary will not cluster genes less identical than the threshold. Thus, Roary will generate more groups of subspecies-specific genes than PanOCT causing Roary to estimate a higher number of subspecies-specific genes. On the other hand, PanOCT will group some of the genes because of its lower stringency and the number of specific groups will be lower.
Other discrepancies not related to mechanistic processes used by each tool can be traced to unexpected annotation differences in very similar orthologs. It is important to note that PanOCT is designed to give context to its findings in relation to genome organization and works off annotation data. This may be the reason why there is only one Map-specific gene identified by PanOCT analysis that is not listed by Roary. That one gene, encoding alpha/beta hydrolase (MAP_RS19250 in K-10), has the same translational start for all of the Map strains except Telford, which starts 95 codons downstream of the other strains (Supplementary Figure 1). This prevented its inclusion by Roary, and thus it was not considered a Map-specific gene, but it is indeed Map-specific. The same situation can be applied to the Mah-specific gene (MAV_RS08635 in strain 104), which encodes hydroxycinnamic acid hydroxylase ( Table 3). These results pose the question of which bioinformatic approach works better for defining these diagnostic gene targets. Since agreement of both pipelines was required, the more conservative method would drive the final list of targets. Is the conservative assessment by PanOCT too strict or is Roary too liberal? Or is it somewhere in between? The only way to answer the question of whether the conservative nature of PanOCT was warranted would be to test those genes identified only by Roary to determine their diagnostic specificity. When the 41-gene subset from Roary analysis was further analyzed by BLAST against the NCBI database, most of these genes (31 of 41) showed reasonably strong homologies to other, more distantly related, Mycobacterium species (Supplementary Table 3). Therefore, PanOCT might be the better bioinformatic tool to use in this approach.
There are several possible reasons for the non-specific amplification of selected genes that were reported as specific by PanOCT and Roary analysis. First, the region amplified in non-Map strains may not have been called as a protein coding region (pseudogene, intergenic region, or frameshifted protein). Second, the region amplified in non-Map strains may be in a protein coding region that is annotated in a different reading frame compared to Map strains. Third, underlying errors in genomic sequence could produce both these types of annotation deviations or the regions may be near a site where one of the strain clusters is reorganized compared to the other cluster. One example is MAP_RS22905, a gene specific to Map based on PanOCT and Roary, but amplifies a product in non-Map strains by in silico PCR and Taq-based PCR. This gene encodes a hypothetical protein and is immediately upstream of MAP_RS17350 in Map K-10, but with a significant 27-amino acid overlap in these coding sequences. MAP_RS17350 is annotated as a frameshifted non-functional protein that contains several internal stop codons. The primers to this gene were designed such that the forward primer binds within the MAP_RS22905 coding sequence while the reverse primer binds within the coding sequence that overlaps with MAP_RS17350. In the non-Map strains (as well as other Map strains) there is a single protein that is functional and encompasses the sequence that corresponds to MAP_RS22905 and MAP_RS17350. Therefore, it is important to examine the location of predicted amplicons in the genome context to clear up some ambiguities.
While some Taq-based amplification reactions were unexpectedly positive, and thus non-specific, other reactions were unexpectedly negative. One Map-specific gene (MAP_RS14095) did not amplify any of the genomic DNAs using the primer pairs ( Table 4). These primers were carefully selected using strict parameters which resulted in a PCR product of the correct size by in silico PCR, but still did not amplify a product using purified genomic DNAs. Therefore, this primer pair was considered bad and MAP_RS14095 has yet to be confirmed as subspecies-specific. An additional six Map-specific genes failed to amplify using only the Map strain Linda template, although all of the other genes were successfully amplified with this genomic DNA sample. It is unclear why these genes did not amplify with this one template, since a search of unfinished contigs for the Linda strain suggests these six genes are indeed present. Nonetheless, these genes were not included among the Map-specific genes because they didn't amplify using two independent commercial recipes.
The insertion sequence IS900 and many other repeat elements were not identified by PanOCT or Roary analysis. This is due to the lack of such elements included in the two Map Egyptian isolates E1 (CP010113.1) and E93 (CP010114.1) (34). Applying the criteria used in this study that the gene must be present in all Map and absent in all non-Map, these sequences do not appear in the E1 and E93 strains for reasons speculated previously (2), and hence they did not make the list. Nonetheless, there is no doubt that these two commonly used repetitive sequences are indeed Map specific and present in all Map strains. Surprisingly, the ISMap02 sequence did show positive PCR reactions with some Mah strains, but not with Maa (data not shown), casting doubt on its diagnostic specificity. Repetitive sequences specific to Mah and Maa were also identified in this study. One example is DBO90_RS18520 (from Maa HJW), an IS110 family transposase that has 14-16 copies in Maa strains.
F57, hspX, and locus 251 have been widely used as diagnostic sequences for Map (11, 13,18,35). These genes were checked against the subspecies-specific gene lists obtained from the current analyses. The F57 sequence occurs within MAP_RS22535 (MAP_0865) and was also identified as Map-specific in this study. Likewise, hspX was originally discovered as a Mapspecific gene before the whole genome sequence was available (11) and was subsequently shown to be expressed within Mapinfected macrophages (12). The locus tag for this sequence is MAP_RS11095 (MAP_2182c) and has been used in studies to determine antigenic reactivity (36). Locus 251 has been tested in a real-time PCR platform and was considered Map-specific in that context (14). All three of these sequences were also identified in this study as Map-specific by Roary, PanOCT, and PCR analysis.
One area where annotations have improved dramatically has been in G+C rich organisms like the mycobacteria. RefSeq annotation has identified new coding sequences that were previously hidden in the same genome because original annotation software either missed them or called distinctly different start sites or reading frames for the same gene. Because of the consistency in the RefSeq annotation pipeline, genomes can now be directly compared at the gene feature level. Several subspecies specific genes identified in this study do not list an outdated locus tag, but only the updated RefSeq locus tag. Among the Map-specific genes, 17 do not have an original locus tag designation ( Table 5), representing 20% of the Map-specific genes identified in this study that might have been "missed" or changed significantly from the initial sequence annotation (37). For example, closer inspection of MAP_RS04350, which does not list an original locus tag in Table 5 and is annotated as a hypothetical protein, actually encodes the C-terminal 86 amino acids of MAP_0852 which was previously shown to be Map-specific (22). Other similar examples of shifted coding sequences that result in dropped locus tags from the original annotation exist as well. In our previous studies, we expressed Map-specific coding sequences identified by the original annotation and tested them for antibody reactivity (38). Although specific antigens were identified, additional testing revealed low antibody reactivity to all Map-specific coding sequences known at that time. Now with these updated RefSeq coding sequences identified by the prokaryotic genome annotation pipeline (PGAP), there still remains a possibility to obtain a truly specific and strong antigen for Map. Therefore, these updated RefSeq coding sequences should be recombinantly expressed and examined for antigenicity as has been done with previous Map-specific genes (38).

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Materials, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
JB conceived and designed the study. All authors made substantial contributions to the experimentation, analysis, and writing of the manuscript.

FUNDING
This work was funded by the USDA Agricultural Research Service.