Genetic Diversity Among Mycobacterium avium Subspecies Revealed by Analysis of Complete Genome Sequences

Mycobacterium avium comprises four subspecies that contain both human and veterinary pathogens. At the inception of this study, twenty-eight M. avium genomes had been annotated as RefSeq genomes, facilitating direct comparisons. These genomes represent strains from around the world and provided a unique opportunity to examine genome dynamics in this species. Each genome was confirmed to be classified correctly based on SNP genotyping, nucleotide identity and presence/absence of repetitive elements or other typing methods. The Mycobacterium avium subspecies paratuberculosis (Map) genome size and organization was remarkably consistent, averaging 4.8 Mb with a variance of only 29.6 kb among the 13 strains. Comparing recombination events along with the larger genome size and variance observed among Mycobacterium avium subspecies avium (Maa) and Mycobacterium avium subspecies hominissuis (Mah) strains (collectively termed non-Map) suggests horizontal gene transfer occurs in non-Map, but not in Map strains. Overall, M. avium subspecies could be divided into two major sub-divisions, with the Map type II (bovine strains) clustering tightly on one end of a phylogenetic spectrum and Mah strains clustering more loosely together on the other end. The most evolutionarily distinct Map strain was an ovine strain, designated Telford, which had >1,000 SNPs and showed large rearrangements compared to the bovine type II strains. The Telford strain clustered with Maa strains as an intermediate between Map type II and Mah. SNP analysis and genome organization analyses repeatedly demonstrated the conserved nature of Map versus the mosaic nature of non-Map M. avium strains. Finally, core and pangenomes were developed for Map and non-Map strains. A total of 80% Map genes belonged to the Map core genome, while only 40% of non-Map genes belonged to the non-Map core genome. These genomes provide a more complete and detailed comparison of these subspecies strains as well as a blueprint for how genetic diversity originated.

Mycobacterium avium comprises four subspecies that contain both human and veterinary pathogens. At the inception of this study, twenty-eight M. avium genomes had been annotated as RefSeq genomes, facilitating direct comparisons. These genomes represent strains from around the world and provided a unique opportunity to examine genome dynamics in this species. Each genome was confirmed to be classified correctly based on SNP genotyping, nucleotide identity and presence/absence of repetitive elements or other typing methods. The Mycobacterium avium subspecies paratuberculosis (Map) genome size and organization was remarkably consistent, averaging 4.8 Mb with a variance of only 29.6 kb among the 13 strains. Comparing recombination events along with the larger genome size and variance observed among Mycobacterium avium subspecies avium (Maa) and Mycobacterium avium subspecies hominissuis (Mah) strains (collectively termed non-Map) suggests horizontal gene transfer occurs in non-Map, but not in Map strains. Overall, M. avium subspecies could be divided into two major sub-divisions, with the Map type II (bovine strains) clustering tightly on one end of a phylogenetic spectrum and Mah strains clustering more loosely together on the other end. The most evolutionarily distinct Map strain was an ovine strain, designated Telford, which had >1,000 SNPs and showed large rearrangements compared to the bovine type II strains. The Telford strain clustered with Maa strains as an intermediate between Map type II and Mah. SNP analysis and genome organization analyses repeatedly demonstrated the conserved nature of Map versus the mosaic nature of non-Map M. avium strains. Finally, core and pangenomes were developed for Map and non-Map strains. A total of 80% Map genes belonged to the Map core genome, while only 40% of non-Map genes belonged to the non-Map core genome. These genomes provide a more complete and detailed comparison of these subspecies strains as well as a blueprint for how genetic diversity originated.

INTRODUCTION
The Mycobacterium avium complex (MAC) comprises three species which include M. avium, M. intracellulare, and more recently, M. chimaera (Tortoli et al., 2004). There currently exist four subspecies within M. avium, which are the focus of this study. These include: avium, hominissuis, paratuberculosis, and silvaticum. The subspecies designations of M. avium have recently been confirmed by calculating average nucleotide identity (ANI) and genome-to-genome distance pairwise values (Tortoli et al., 2019). M. avium infections in humans manifest in three different forms: lymphadenitis (Christensen and Koeppe, 2010), disseminated (Ohkusu et al., 2004), or the most common, pulmonary (Inderlied et al., 1993;Tran and Han, 2014). The MAC are all closely related genetically and thus it can be difficult to distinguish which species/subspecies are the cause in human infections. However, through genome sequencing, differences have been identified and highlighted, especially for M. avium subspecies paratuberculosis (Map), which is the causative agent of Johne's disease in cattle, sheep and other ruminant animals. This disease is endemic in the United States and is a problem worldwide due to the severe economic consequences to dairy cattle as well as goat and sheep industries. As more complete genome sequences accumulate, the opportunity to understand the pathological and genetic distinctions between the subspecies of M. avium will become more apparent.
Map infection of ruminant animals results in a lengthy incubation that yields a chronic inflammation of the intestine which mimics symptoms of Crohn's disease. Studies focusing on Johne's disease commonly use the K-10 bovine strain of Map since it has emerged as a widely disseminated reference strain and was used as the reference Map strain in this study. Originally isolated in 1990 from a dairy cow in Wisconsin, the genome sequence from this strain was the first one elucidated for Map. The K-10 genome sequence has now been available for 15 years (Li et al., 2005) and has been useful for diagnostics, epidemiology and pathogen biology. Two dozen more genomes of Map have since been reported in varying degrees of completion including bovine strains (Amin et al., 2015;Mobius et al., 2017), ovine strains in the United States and Australia (Bannantine et al., 2012;Brauning et al., 2019), camel strains (Ghosh et al., 2012), and human strains (Wynne et al., 2011;Bannantine et al., 2014).
Within Map, two primary strain lineages have emerged that correlate to the host they are isolated from (sheep versus cattle); although host-based lineages have been blurred as more isolates are typed from deer, bison and other hosts (Bryant et al., 2016). The type I and type III strains are generally ovine isolates and the type II strains are primarily bovine, deer, bison or human isolates. In this study, we examined 12 type II bovine isolates of Map and one ovine isolate, which is a type I strain (Brauning et al., 2019). No type III ovine strains are represented as no complete genome sequences are yet available. Two other subspecies are represented, which include M. avium subspecies avium (Maa) and M. avium subspecies hominissuis (Mah). However, no M. avium subspecies silvaticum complete genomes have been made publicly available to date. In this study, Mah and Maa strains are combined and termed non-Map strains.
With the number of M. avium subspecies complete genomes now available, we can look in depth at the genomic diversity among these subspecies with a much higher resolution and accuracy than was possible using DNA microarrays (Paustian et al., 2005(Paustian et al., , 2008 or partial genomic sequences (Bannantine et al., 2002). As an example, the movements of the hallmark IS900 insertion elements can be tracked among sequenced strains that could not be deciphered based solely on DNA hybridizations to microarrays and average nucleotide identities can be determined at a whole genome level. This study has revealed different numbers of the taxonomically defining insertion elements, IS900 and IS1245, and in some cases, there exist only partial copies of these mobile DNA sequences. Furthermore, gene deletions could be detected by DNA microarray provided they spanned at least one open reading frame represented on the array, but with complete genome sequences, small deletions and insertions (indels) are now readily detectable. Finally, core and pangenomes can be deciphered for each subspecies, which is less accurate with fragmented genome assemblies.
Over 90% of genome sequencing projects target microbial species. Partial sequencing of over 140 Map isolates worldwide has demonstrated a very stable genome in this subspecies with less than 0.5 single nucleotide polymorphisms (SNPs) per genome per year (Bryant et al., 2016). This is comparable to another stable bacterium, Listeria monocytogenes, which showed SNP differences of less than 6 out of 2,298 genes over 2 years (Schmid et al., 2014), but more variable than the highly stable Group A Streptococcus showing 0.0002 core genome SNPs per site/year (Coppens et al., 2019). In this study, we conducted a multi-strain comparison of the largest set to date of complete mycobacterial genomes from the same species. We targeted 13 complete Map genomes for analysis, but also included other non-Map genomes in the avium species for comparison purposes and to add evolutionary insights. We further identified the core and accessory genomes of Map and non-Map strains to show how infrequent horizontal gene transfer events occur in Map.

Genomes
At the study's conception, all M. avium complete genomes were download from the NCBI RefSeq database in November 2019 (Tables 1, 2). Each strain was compared with all others enrolled in this study for determining similarity using ANI and Jaccard coefficient. ANI was calculated using the OrthoANIu algorithm (Yoon et al., 2017) and Jaccard coefficient was calculated using PanOCT software (Fouts et al., 2012).

Bioinformatic Analysis
Paired-ends reads (250 bp) were simulated using ART software (v2.5.8) (Huang et al., 2012) based on the HiSeq 2500 platform for each genome. Reads were then mapped to either K-10 (Map) or 101115 (Mah, plasmids ignored) reference genomes with Burrow-Wheeler Aligner (BWA) mem (v0.7.12)  (Li and Durbin, 2009). SNPs were detected with FreeBayes (v1.1.0) (Garrison and Marth, 2012). SNPs were filtered out based on three criteria (i) quality of the SNP is more than 20 (ii) read depth is more than 20 and (iii) distance between 2 SNPs must be more than 10 bp to avoid sequencing error generated by the read simulation. Each VCF file was functionally annotated with the snpEff tool (v4.3) (Cingolani et al., 2012), then converted into a sqlite3 database with vcflib 1 (vcf2sqlite3.py) and finally merged in a unique sqlite3 database for further analysis using an in-house script (tables and figures). In parallel, each VCF file was merged into a single VCF file using vcflib to build the SNP concatenate used to infer UPGMA phylogenetic trees with Bionumerics version 7.6.3 created by Applied Maths NV and available from http://www.applied-maths.com for each reference. Genomic feature comparisons were performed in MacVector 17.0.9 using the compare genomes tool. Synteny alignments were determined by Mauve. In order to avoid false indications of inversions or other rearrangements, these genomes were first aligned to start at the dnaA gene prior to Mauve analysis.

Pangenome Analysis
The gbk files of each strain were retrieved from the NCBI RefSeq database and converted to gff3 format using bp_genbank2gff3.pl perl script from BioPerl library. Roary software (v3.11.2) (Page et al., 2015) was used to define the pangenome, core genome and accessory genome of each Map strain and non-Map strain separately. Roary was launched with -e and -n options to compute rapid core gene alignment. Phylogenies based on accessory genome matrix (presence/absence of gene) were generated using FastTree (v2.1) on Map and non-Map (Price et al., 2010). Roary outputs can be visualized with Phandango (Hadfield et al., 2017) and R scripts were used to view the resulting outputs. Pangenomes were also analyzed using PanOCT (Fouts et al., 2012) in conjunction with the JCVI pipeline and PanACEA visualization tool (Inman et al., 2019). PanOCT and 1 https://github.com/vcflib/vcflib Roary are both especially useful for clustering genes from closely related species/strains.

Nonsynonymous-Synonymous Ratio (dN/dS) Calculation
The nucleotide and protein sequences of all core genes identified by Roary analysis were extracted and binned into two groups corresponding to Map and non-Map strains. Each protein coding sequence was aligned with MAFFT (Katoh and Standley, 2013) and converted back into nucleotide codon alignments with Pal2Nal (Suyama et al., 2006). Phylogenetic trees were inferred with FastTree 2.1.11, based on resulting nucleotide codon alignment. Both tree and nucleotide codon alignment were fed into HyPhy software using a MEME (Multiple EM For Motif Elicitation) algorithm (Murrell et al., 2012) to determine dN/dS value for each gene in each group. Grouped dN/dS values were then calculated after removing 877 recombinant genes. Finally, the dN/dS values of the remaining 1559 genes were filtered to keep only values less than 10, above which are considered artefactual (Mastrorilli et al., 2018).
In silico Analysis of the hsp65 Gene Gene sequences were extracted from each genome and analyzed for defining SNPs. The hsp65 codes were assigned based on the nomenclature developed by Turenne and coworkers .

RefSeq Genomes Using PGAP Annotation Reveals Additional Genes
We used NCBI's RefSeq annotation for all M. avium genomes in this study. The Map bovine strain K-10 genome was initially annotated using Artemis and Glimmer in 2005 (Li et al., 2005) and most recently re-annotated using NCBI's prokaryotic genome annotation pipeline (PGAP) (Haft et al., 2018 Table S1).
A RefSeq annotation has much improved and consistent annotation across all genomes while also implementing certain quality standards, including lack of sequence contamination (Haft et al., 2018). From the K-10 RefSeq annotation, 227 newly discovered genes not present in the initial annotation were revealed and 67% of these "new" genes are annotated as hypothetical proteins (Supplementary Table S2). These genes could encode new antigens or novel virulence and metabolic functions. Conversely, 199 coding sequences have been annotated as pseudogenes (Supplementary Table S3). To be included in this study, all genome assemblies had to be complete, closed, and annotated using PGAP with a RefSeq accession number. These criteria excluded one complete genome, the Mah strain MAH11, which did not have a RefSeq accession when this study was conceived and has recently been published (Dragset et al., 2019). Therefore, a total of 13 Map genomes and 15 non-Map RefSeq genomes were analyzed (Tables 1, 2).

Characteristics of M. avium Genomes
These M. avium genomes are from strains distributed across five continents: Africa, North America, Europe, Asia and Oceania. Eleven of the thirteen Map are bovine isolates with one human (MAP4) and one ovine strain (Telford 9.2) completing the set (Table 1). However, the human isolate groups with the bovine isolates to yield 12 type-II strains and a single type-I ovine strain. Accession, country of origin and publication information are listed in Tables 1, 2 for all of the M. avium subspecies genomes. Map genome sizes are relatively consistent despite their broad geographical distribution. An average size of 4.83 Mb + 29.6 kb was observed for Map while Maa averaged 4.96 Mb + 5.8 kb among its two genomes ( Table 3). Mah strains had the largest average genome size at 5.25 Mb + 195.9 kb. All Map and  Singh et al., 2010;Sidoti et al., 2011) and this diagnostic element varies from 16 to 22 copies per genome. Interestingly, the IS1245 element has been considered a defining sequence for non-Map M. avium often used as the target in RFLP analysis (Van Soolingen et al., 1998;Johansen et al., 2005;Thibault et al., 2007), however, both Maa strains and one Mah strain do not contain this element ( Table 3).
Functional copies of this element vary from 0 to 36, but some strains include many additional partial copies of this insertion sequence annotated as pseudogenes (Supplementary Table S4).
No Map or Maa strains possess plasmid DNA, while, eight Mah strains have plasmids ( Table 3). The variation in genome size, gene, and pseudogene numbers suggest important differences in gene gain and loss events have occurred during evolution of the M. avium subspecies. One characteristic that is very consistent among all M. avium genomes is the uniquely high GC content (69% ; Table 3).
Several previously published analyses have been applied in this study to validate these genomes as Map or non-Map. For example, only Map strains possess a C-to-A SNP in MAP_1025, a gene encoding a proline rich protein (Bannantine et al., 2011). This defining SNP remains true among these M. avium strains with all 13 Map genomes possessing an adenine nucleotide at position 83 in MAP_1025 while all non-Map genomes have a cytosine at that position (Supplementary Table S5). The hsp65 gene (aka groEL2, MAP_3936, and MAP_RS20190 in K-10), which encodes a heat shock protein, has been used to distinguish members of the MAC based on sequence variants at 19 positions within the gene . A total of 14 different sequevars were identified among all 73 MAC isolates tested in that study . All Map bovine strains in the current study belong to sequevar code 5 whereas the Map ovine strain Telford is sequevar code 6 (Supplementary Table S6). This division of bovine and ovine strains into codes 5 and 6 is consistent with what was reported by Turenne and coworkers . The Mah genomes grouped into sequevar codes 1, 2, 7, and 9 while the two Maa strains were in code 4. Mah strain mc2 2500 was not in a sequevar previously reported. By analyzing genomes in the current study, three additional SNPs were detected in the 1,626 bp hsp65 gene that were not reported by Turenne et al. (2006). The locations of these SNPs are highlighted in red in Supplementary Table S6.
Several large sequence polymorphism regions reported by Semret et al. (2004Semret et al. ( , 2005 are able to distinguish Map from non-Map strains. For example, the LSP A 8 sequence is present in all non-Map genomes, but missing in all Map genomes while LSP P 12 is present in all Map strains and absent from all non-Map strains in this study (Semret et al., 2005). Finally, the Telford strain has characterized deletions reported in all ovine Map strains analyzed thus far, including a 19,930 and 8,049 bp deletion relative to Map bovine strains (Marsh et al., 2006).
The multi-locus variable number tandem repeat analysis (MLVA) is another molecular epidemiological tool that has been standardized for MAC strains (Cochard et al., 2019). Although this tool has been outperformed by whole genome SNP analysis due to homoplasy in MLVA (Bryant et al., 2016), it is nonetheless easy to perform and has a well-established database to compare strains against. This technique, applied to the set of genomes in this study, reveals that two Map genomes from the South Korea strains (MAPK_CN4/13 and MAPK_JB16/15) have an INMV type code belonging to Maa rather than Map. This is code 68 (INMV type 22532228) shown in Table 3. Still another Map genome from the South Korea strains (MAPK_JJ1/13) has an INMV code belonging to Mah (code 149, INMV type 22432228). Overall, INMV2 was predominant in this collection of genomes with 7 of the 13 Map genomes falling in this category ( Table 3). INMV codes could not be assigned to several non-Map strains to due ambiguities at one loci (VNTR3). These results suggest MLVA is not ideal for distinguish MAC subspecies.

M. avium Genomic Diversity
With multiple complete genomes, investigators can now examine in more detail the genome distances among M. avium strains. These measurements can be obtained a variety of ways at the nucleotide level including Jaccard cluster similarity (Jay et al., 2012) and more commonly, ANI (Kim et al., 2014). ANI values showed a narrow range from 98.51 to 99.99 among all M. avium subspecies. This metric is used to define species and subspecies boundaries and is a better measure of relatedness than data obtained from a single gene, such as 16S rRNA. For taxonomic speciation, the ANI cutoff is historically 96% for a species (Ciufo et al., 2018) and greater than 98% for subspecies. All of the subspecies of M. avium are well above 98% ANI and confirm their taxonomic grouping (Figure 1). By way of comparison, M. intracellulare ATCC 13950 (CP003322), which is another species within the MAC, had an 86.33% ANI with Map FDAARGOS. The same Map strain shares a 79.42% ANI with the more distantly related M. bovis Danish 1331 (NZ_CP039850) and 63.66% when compared to Escherichia coli K-12 (NC_000913.3). The Jaccard similarity coefficient was more discriminating with all M. avium genomes in this study sharing >69.68% Jaccard pairwise similarity to each other (Figure 1). Among the Map strains the Jaccard similarity is above 90% except when the E1 and E93 strains from Egypt were compared with the Australian sheep strain Telford (88.5 and 88.6%, respectively). Interestingly, when comparing the non-Map strains against each other, the Jaccard percentages (range = 69.68-96.31) eclipsed percentages for non-Map to Map comparisons (Range = 72.44-83.9). Based on Jaccard similarity, Mah strain 104 is most distantly related to all other M. avium strains (Figure 1). These values further illustrate the diversity that exists among the non-Map strains.

SNPs Among M. avium Subspecies
SNPs are valuable for constructing phylogenies of closely related genomes. SNP-based analysis among the 12 Map type II genomes, using K-10 as the reference, shows less than 300 single nucleotide variants among them, and just over 1,000 SNPs when compared to the Map type I isolate (Telford 9.2; Figure 2A). When the SNPs FIGURE 1 | Nucleotide level similarity analysis. The top matrix shows the Jaccard pairwise cluster similarity and the bottom matrix show the average nucleotide identity (ANI). The ANI values are a similarity index between two genomes expressed as a percentage. Jaccard percentages were obtained by direct pairwise genome comparisons using PanOCT as described in the methods. Note that the Jaccard similarity has a much broader range of values. The strain names and corresponding RefSeq accession numbers are shown in both matrices. Heat map shows high similarity in red versus lower values in green. (B) The number of synonymous (blue bars, first value) and non-synonymous (red bars, second value) SNPs. Both graphs were plotted on a log scale with the strain designation listed in the center. Numerical differences between total SNPs and the combined nonsynonymous plus synonymous SNPs represent the intergenic SNPs.
were divided into synonymous and non-synonymous categories, there were always more amino acid-changing non-synonymous SNPs than synonymous SNPs among Map strains, while the opposite is true among the non-Map strains ( Figure 2B). This suggests that there is more selective pressure on Map. To test this, dN/dS ratios were calculated, which is the ratio of nonsynonymous substitutions per non-synonymous site (dN) to the number of synonymous substitutions per synonymous site (dS). This ratio is used as an indicator of selective pressure acting on protein coding genes. Non-Map strains had a dN/dS ratio of 0.10, which shows that these strains are under a stabilizing selective pressure. In contrast, the Map strain dN/dS ratio is 1.27, which clearly shows the positive selective pressure these strains exhibit.
The little Map diversity that does exist is not related to geographical boundaries as multiple strains from Egypt and South Korea are present in distinct branches of the phylogenetic tree (Figure 3). This result agrees with that observed by Bryant et al., which showed no association between strain relatedness with geographic location for over 140 Map isolates (Bryant et al., 2016). The K-10 strain is most closely related to FDAARGOS_305 and the MAPK_CN9/15 strain from South Korea. Conversely, among the Map bovine strains, K-10 is most distantly related to the E1 strain from Eygpt and all the bovine strains are well separated from the Map type I strain (Telford 9.2). Overall, even the most divergent of Map strains had a SNP density of only 0.284 SNPs per 1 kb. This compared to 10 SNPs per 1 kb between Mah 104 and MAH11 (Dragset et al., 2019). This analysis confirms that single nucleotide variants are more abundant in Mah than other M. avium subspecies (Uchiya et al., 2017). Finally, there are well over 11,500 SNPs in Mah and Maa isolates when compared to K-10 ( Figure 2B) and each subspecies clustered together regardless of which strain was used as the reference (Figure 3).

Genomic Synteny
Genome organization among the 12 Map type II strains is remarkably stable with only one large inversion shared among six of the strains (Figure 4A). If the genomes were entered into Mauve without first aligning them at dnaA, many false rearrangements appeared (Supplementary Figure S1). The single inversion that is observed in the type II strains (Figure 4A) might be due to a mis-assembly in K-10 (Wynne et al., 2010) rather than a true DNA rearrangement. In contrast, the type I strain shows a more extensive degree of large-scale rearrangements relative to the Map K-10 strain despite using dnaA as the same starting point for alignment (Figure 4B). A genomic inversion in the center of these sequences is a primary cause of this genomic discontinuity.

Pangenome of M. avium Subspecies
A significant part of the genome evolutionary process involves the extensive gain and loss of genes (Iranzo et al., 2019). Pangenome analysis should ideally identify all orthologs and distinguish them from paralogs. Analysis using Roary (with a FIGURE 4 | Genomic alignment between Map strains. The 12 type II genomes (A) and the type I Telford and type II K-10 genomes (B) were all aligned relative to the dnaA gene as the anchor point. The Mauve alignment is organized into one horizontal section per input genome sequence. Each genome track contains the accession number and strain of the genome sequence along with a base pair scale showing the sequence coordinates for that genome. Blocks of various colors above and beneath the center line represent defined genome regions. Similar colored blocks in different genomes represent regions that aligned in each genome. When a block lies above the center line the aligned region is in the forward orientation relative to the K-10 genome sequence (top). Blocks below the center line indicate regions that align in the reverse complement (inverse) orientation.

95% BLASTp identity cut off) identified genes present across all
Map strains totaling 3,749, which comprises the core genome. Conversely, the accessory genome, which is subdivided into the cloud and shell genomes, total only 920 genes (Figure 5). An alternative method, PanOCT analysis, yielded a Map core genome comprising 3,772 genes, which is similar to the 3,749 core Map genes identified by Roary analysis. The pangenome of 15 non-Map strains comprise a total of 8,124 genes with the core genome consisting of 3,283 genes. Typically, the smaller the number of genomes analyzed, the larger the core. However, despite the similar numbers of Map and non-Map genomes in this study, there is a significantly larger core and smaller accessory genome associated with Map than for non-Map (Figure 5), suggesting a lack of horizontal gene transfer in Map. The Map core genome comprised 80% of the total genes compared to non-Map, which contained only 40% (Figure 5). Conversely, there were nearly 5 times more accessory genes among the non-Map genomes (920 Map versus 4,841 non-Map genes).
In comparison to Map, the non-Map pangenome has substantially more total genes (Figures 5, 6). This is in part due to plasmid DNAs, which are not present in all strains and comprise at least a portion of the accessory genome. To determine if the presence of plasmid DNA impacted phylogenetic lineages in non-Map strains, Roary analysis was conducted in both circumstances. The phylogenies are very similar whether or not plasmids are included in this analysis. The only difference observed is with Mah OCU464, a strain which contains two plasmids ( Table 3). This strain clusters with Mah TH135 when plasmids are included in the analysis and clusters with Mah MAC109 when plasmids are excluded (Figure 6). All of the gene clusters comprising the core genomes of Map and non-Map strains are listed in Supplementary Tables S7, S8. When using PanOCT, core gene clusters were, similarly, assigned (Supplementary Table S9). The pangenome map of all 28 M. avium strains combined is shown in Supplementary Figure S2. Collectively, the core genome of Map was twice the size of the core genome of non-Map despite having significantly fewer gene clusters and the accessory genome is very small (Figure 6). This suggests a more closed genome in Map with less horizontal gene transfer than non-Map.
When quantifying the frequency of each gene in the pangenome among all the Map or non-Map genomes, a unique pattern emerges within each group (Figure 7). The Map genes are present in most of the genomes (12-13) while the non-Map genes show a more biphasic pattern where they are either in a few (1-2) genomes or in most (14-15) genomes. In contrast, very few genes are present in approximately half of the genomes regardless of subspecies (Figure 7). These data further suggest that Map is a closed genome while non-Map show evidence of horizontal gene transfer.

Core Gene Phylogeny
Based on the Roary core gene analysis, a core gene SNP phylogeny was determined for all M. avium strains in this study. This phylogeny shows that the ovine strain Telford is well-isolated from both Map and non-Map strains (Figure 8). The tree also shows that Mah strain 104 is most closely related to the Maa strains. The Map type II strains separate into two main clusters similar to the Mah strains. Overall, the data suggest that the intermediate strains between Mah and Map are Maa and the Map type I strain. From a network phylogenetic analysis, not only do Map strains form a well-separated clade from other M. avium subspecies, but it appears some Mah strains are about as closely related to each other as they are to Map strains (Figure 9). This is supported by the many parallel lines (indicative of phylogenetic splits) on the non-Map side of the network. The Telford type-I strain is also clearly separated from the type-II strains of Map. Mah strains form a complex web-like tree which is in agreement with a previous study (Turenne et al., 2008) based on the diversity observed among the subspecies. Also note the Mah 104 strain is closely related to the two Maa strains. Historically, this strain was first identified as Maa (Mijs et al., 2002).

Origin of Genomic Diversity in M. avium Subspecies
From these collective results, it is clear non-Map strains show considerable genomic diversity, but where did this diversity originate? With the understanding that recombination can be a primary driver of diversity, ignoring this possibility when reconstructing M. avium phylogenies could lead to misleading conclusions about strain relationships. ClonalFrameML was used to detect recombination hot spots on a whole genome scale (Didelot and Wilson, 2015) and showed many hot spots were present in non-Map strains while absent in Map strains (Figure 10). When recombination-induced SNPs were removed to correct branch lengths, the phylogenetic relationships are more accurately reflected between Map and non-Map strains FIGURE 6 | Gene presence matrix of the Map (A) and non-Map pangenomes analyzed with (B) and without (C) plasmid DNA elements. At the right of each panel is a compressed gene cluster matrix where blue blocks indicate presence of the gene in that cluster and white indicates its absence. Each column represents an orthologous gene family. The corresponding phylogenetic tree is presented on the left and strains listed on the tree correspond to each row of the matrix. Only the phylogeny of OCU464 is affected by the presence/absence of plasmid DNAs among the non-Map strains (compare B and C). Note the discrepancy of the core genome between Map and non-Map. Fully 80% of the Map genome consists of core genes.  ( Figure 10A). A striking number of recombination events is detected by this analysis in non-Map strains relative to Map (compare dark blue squares in Figure 10B). These events are distributed evenly across non-Map genomes ( Figure 10B). Collectively, these data suggest that Map is a closed genome with horizontal gene transfer kept to a minimum.

DISCUSSION
To increase understanding of M. avium genome organization, only completely sequenced M. avium subspecies genomes were considered. Insufficient numbers of complete genomes had been a limitation in earlier studies, but recently a similar number of Map and non-Map M. avium genomes have emerged. Overall, we show that among the three M. avium subspecies studied, Map is a relatively stable, closed genome while Mah and Maa show more horizontal gene transfer and accessory genome components. Map clearly has a very homogenous genome, except for the Telford type I strain which has a different genomic organization and an order of magnitude more SNPs compared to the 12 type II Map strains. It will be interesting to examine additional type I and type III strains of Map once complete genomes become available. Currently, there are two draft genomes available for the type III strain of Map (Bannantine et al., 2012;Mobius et al., 2015), but no complete genome sequences. We recognize that analyzing 12 of 13 type II genomes may skew the data to enable Map to appear more homogeneous, but that was what met the criteria for inclusion in this study and highlights the current focus of research on these strains. Other studies using SNP-based analyses of selected genes or VNTRs of incomplete genomes suggest ovine strains are more polymorphic than type II bovine strains (Turenne et al., 2008;Bryant et al., 2016;Sohal et al., 2019). Furthermore, Map genomes aligned at a common start revealed that type II strains have a highly homologous genome synteny while the sheep strain showed large rearrangements.
Conversely, the non-Map members of M. avium have genomes that are mosaic in nature similar to what has been described elsewhere (Turenne et al., 2008;Uchiya et al., 2017;Yano et al., 2017). The homogeneous Map and heterogeneous non-Map genomes are well illustrated by the following observations. Map was divided into two hsp65 sequevars, while non-Map binned into 6 sequevars. The total numbers of SNPs were 10 times more in non-Map than Map and the non-synonymous SNPs were consistently higher than synonymous SNPs in Map genomes while the opposite is true for non-Map. The resulting phylogenies showed distinct clustering between Map and non-Map. Plasmid DNAs are present in some non-Map strains, but not in any FIGURE 10 | Recombination events detected in M. avium. ClonalFrameML was used with input from the core gene alignment by Roary and RAxML phylogeny to identify recombination induced SNPs and remove them to correct the tree. (A) Core gene phylogeny corrected from inferred recombination events. SNP-inducing homologous recombination due to horizontal gene transfer events in the core genome were removed to correct the true branch lengths on the tree. Map strains. Inferred recombination sites are significantly more prevalent in non-Map genomes. And finally, there is a large accessory genome in non-Map strains. Collectively, this further demonstrates the heterogeneous nature of Mah genomes and the conserved nature of Map genomes. One potential reason Map genomes appear very stable with low numbers of SNPs and lack of rearrangements could be due to minimal selective or environmental pressure that stems from the passive lifestyle niche Map exists in. An alternative explanation may be that Map has spread worldwide relatively recently through livestock transportation in the industrialized world.
It is interesting to note the strong bias toward two of the four M. avium subspecies in terms of genomes sequenced. From this set of 28 genomes, 26 comprise either hominissuis or paratuberculosis. This does not include yet another Mah MAH11 sequence recently completed (Dragset et al., 2019). This bias is most likely due to the fact that Map is a significant veterinary pathogen and M. avium subspecies hominissuis is a primary cause of human lung infections and lymphadenitis in pigs. Conversely, silvaticum and avium subspecies are either bird pathogens or environmental commensals and hence there is little funding or effort dedicated to understanding their genetic makeup. There is another possibility for the lack of representation of the silvaticum subspecies in sequence repositories. M. avium subspecies silvaticum was originally proposed by Thorel and coworkers based on specific growth conditions (Thorel et al., 1990), however, it is likely not a separate subspecies. Phylogenetic evidence and other analyses suggest that the silvaticum subspecies groups very closely to the avium subspecies Bryant et al., 2016;Tortoli et al., 2019). Also, there is very little sequence in public databases and what is present clusters tightly to Maa. Only the silvaticum type strain, designated ATCC 49884, has been carried over in the analyses of many phylogenetic studies as no others are available and there may be only one other isolate circulating with no new isolates having been reported or sequenced. Therefore, it is our position that the only M. avium subspecies that should be recognized are avium, hominissuis and paratuberculosis, however, very recent data obtained by calculating ANI and genome-to-genome distance suggest the addition of a new subspecies termed M. avium subspecies lepraemurium (Tortoli et al., 2019). This new subspecies differs from subspecies avium by 4 bp in the 16S rRNA gene.
One area where annotations have improved dramatically has been in GC-rich organisms like that observed in M. avium genomes. The new RefSeq annotation of these genomes has enabled the standardization and identification of pseudogenes. Also, the number of hypothetical proteins has dropped significantly with the new annotation. The resequenced K-10 with updated annotation (Wynne et al., 2010) had 3,100 hypothetical proteins, compared to the RefSeq annotation, which has only 793. All M. avium genomes in this study are now below 20% hypothetical proteins (Table 3). Despite the number of complete genomes for M. avium and their RefSeq annotation, there are still a considerable number of hypothetical proteins. These should become the target of future studies examining their function and a prioritized list, which includes proteins with a predictable biochemical activity, has already been developed (Galperin and Koonin, 2004). Interestingly, a high concentration of hypothetical proteins (67%) in the K-10 strain are listed among RefSeq annotated genes that were not in the initial annotation (Supplementary Table S2). Many of these newly predicted genes (82%) are also core genes in Map, increasing the interest to study these further.
Even with these complete genome sequences now available and a phylogenetic network established, a progenitor strain cannot be unequivocally determined from these analyses. The larger average genome size for Mah makes it tempting to suggest it is the progenitor, but there is no clear evidence for this hypothesis. It could simply be a more open genome that readily takes up new genes through recombination. Overall, M. avium has evolved through well-described pathways which include insertions/deletions, recombination and modifications such as SNPs. A model can be proposed once the order of these documented genomic modifications becomes known.
Whole genome similarity analysis based on nucleotide level comparisons such as ANI and Jaccard similarity are ideally performed on complete genomes sequences to obtain static, reproducible results. The M. avium strains in this study are closely related and belong together at the subspecies level as shown by > 98% ANI values, which drop quickly when comparing sequences outside M. avium, even with M. intracellulare, which is a member of the MAC. The Jaccard pairwise similarity values showed better discrimination of M. avium strain relatedness. The Jaccard method has not been widely adapted because detailed genome annotations were lacking, however, with RefSeq annotations available, similarity measurements using the Jaccard coefficient provide a more relevant and simplified basis for genome comparison (Jay et al., 2012). At the gene level, Map and Mah form well-separated clusters with Maa and the Map ovine strain as intermediates between these principle clusters (Figures 8, 9).
The two Egyptian Map strains E1 and E93 do not have complete copies of the hallmark IS900 insertion sequence ( Table 3). We identified 17 incomplete copies of IS900 represented in 34 fragments of ∼200 bp at each extremity of the IS element. Furthermore, both of their genome sizes are smaller than for all other Map strains (Table 3), which is also detectable in the Mauve alignment (Figure 4). We confirmed that these two genomes are type II strains on the basis of SNPs in the gyrA and gyrB genes (Castellanos et al., 2007) and they also contain the LSP20 region as do all other type II strains (Semret et al., 2004). These two genomes were not de novo assembled, but were assembled using K-10 as the reference and not all reads mapped to the reference (Amin et al., 2015). For these reasons, we suspect a mistake may have been introduced in the assembly of those genomes where the repeat elements might have been masked out during an initial assembly, but then not added back into the final assembly. All other type II strains in this study had either 16 or 17 copies of IS900.
Surprisingly, only 16 SNPs were detected in FDAARGOS_305, which suggests it is essentially the same as K-10. These two genomes also clustered tightly by core gene phylogeny ( Figure 8B). The FDAAROGOS_305 strain is part of a much larger microbial genome sequencing effort that began in May of 2014 and is funded by two US government agencies, the Department of Defense and the Food and Drug Administration. Termed FDA-ARGOS, this large sequencing effort currently has over 1,000 microbial genomes sequenced (Sichtig et al., 2019). When researching the source of the FDAAROGOS_305 strain, we discovered that FDA-ARGOS re-sequenced the K-10 strain that was deposited in the ATCC culture collection. This explained why the two genomes were separated by less than 20 SNPs and showed no major genomic rearrangements.
The core genome represents genes that are present in all strains while the pangenome encompasses all genes, orthologous or not, among the species. In E. coli for example, only 39% of the genes comprise the core genome and genome sizes among E. coli strains can vary by more than 500 kilobase pairs (Touchon et al., 2009). These numbers are similar for non-Map M. avium strains. By contrast in Map, the core genome is 80% (Figure 5) with genome sizes varying by only 126,426 bp among the 13 genomes analyzed. This fact, combined with recombination analysis (Figure 10), suggests that Map genomes do not show evidence of widespread horizontal gene transfer and that one strain can epitomize the Map subspecies, at least for type II strains. Previous studies have shown that Mah is more like E. coli in that it rapidly acquires new genes, contains plasmids, and the gene repertoire is larger than Map (Yano et al., 2017). Thus, no single Mah strain could adequately represent this subspecies.
The M. avium genomes that have been sequenced, along with others in progress, will serve as the foundation for population genomics and evolution of the MAC. These data have narrowed the gap between population genetic and phylogenetic approaches to study genome evolution, both of which are important to understanding the effect of gene gain and loss on adaptation and genome synteny. We have observed that the core genome evolves primary through SNPs, and possible recombinant events, while the accessory genome is acquired through horizontal gene transfer. Finally, the Map genome is very closed and stable, which is not the case for non-Map genomes.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in the EVA, Project: PRJEB39090 and Analyses: ERZ1461588.

AUTHOR CONTRIBUTIONS
JB conceived and designed the study. All authors made substantial contributions to the analysis and writing of the manuscript.

FUNDING
Portions of this work were funded by the USDA Agricultural Research Service.

ACKNOWLEDGMENTS
We thank Kayla E. Straight for assistance with supplementary tables.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb. 2020.01701/full#supplementary-material FIGURE S1 | Mauve alignment between Map type II strains without prior alignment of dnaA as the starting point. The alignment is organized into one horizontal section per input genome sequence. Each genome track contains the accession number and strain of the genome sequence along with a base pair scale showing the sequence coordinates for that genome. Blocks of various colors above and beneath the center line represent defined genome regions. Similar colored blocks in different genomes represent regions that aligned in each genome. When a block lies above the center line the aligned region is in the forward orientation relative to the K-10 genome sequence (top). Blocks below the center line indicate regions that align in the reverse complement (inverse) orientation. Comparison of this figure with Figure 4 in the manuscript shows a striking difference.  S1 | Correlation of old and new locus tags for Map K-10. Shown is a complete list of the RefSeq locus tags for Map K-10 along with the corresponding old locus tags which have been used in the scientific literature. In addition, the protein ID and description are given for coding sequences. The "NA" designation in the old locus tag column indicates a gene that was not included in the original annotation. If "NA" is present in both the old locus tag and Protein ID columns, that gene is a pseudogene. TABLE S2 | RefSeq annotated genes not present in the initial Map K-10 annotation. Shown are the cds and gene numbers as well as the RefSeq locus tag, protein ID and expression product.
TABLE S3 | Pseudogenes in Map K-10. A total of 199 pseudogenes are listed along with their start and stop coordinates on the K-10 RefSeq annotation. The old locus tag is listed if applicable.
TABLE S4 | Complete and partial copies of IS1245 among non-Map genomes. The criteria to be listed as a functional copy of IS1245 is that the product accession must be WP_011725525.1 and the size of the element must be 1,233 bp. Any other size was considered a partial copy and a pseudogene. Each IS1245 copy is listed along with strain designation and locus tag.