A New Single Gene Differential Biomarker for Mycobacterium tuberculosis Complex and Non-tuberculosis Mycobacteria

Background Tuberculosis (TB) and non-tuberculous mycobacteriosis are serious threats to health worldwide. A simple non-sequencing method is needed for rapid diagnosis, especially in less experienced hospitals, but there is no specific biomarker commonly used for all mycobacteria. The ku gene of the prokaryotic error-prone non-homologous end joining system (NHEJ) has the potential to be a highly specific detection biomarker for mycobacteria. Methods A total of 7294 mycobacterial genomes and 14 complete genomes of other families belonging to Corynebacteriales with Mycobacteriaceae were downloaded and analyzed for the existence and variation of the ku gene. Mycobacterium tuberculosis complex (MTBC) and non-tuberculosis mycobacteria (NTM)- specific primers were designed and the actual amplification and identification efficiencies were tested with 150 strains of 40 Mycobacterium species and 10 kinds of common respiratory pathogenic bacteria. Results The ku gene of the NHEJ system was ubiquitous in all genome sequenced Mycobacterium species and absent in other families of Corynebacteriales. On the one hand, as a single gene non-sequencing biomarker, its specific primers could effectively distinguish mycobacteria from other bacteria, MTBC from NTM, which would make the clinical detection of mycobacteria easy and have great clinical practical value. On the other hand, the sequence of ku gene can effectively distinguish NTM to species level with high resolution. Conclusion The Ku protein existed before the differentiation of Mycobacterium species, which was an important protein involved in maintaining of the genome’s integrity and related to the special growth stage of mycobacteria. It was rare in prokaryotes. These features made it a highly special differential biomarker for Mycobacterium.


INTRODUCTION
Mycobacterium is a genus of over 190 species and 13 subspecies. Apart from the causative agents, the Mycobacterium tuberculosis complex (MTBC) and Mycobacterium leprae, the other members of this genus are grouped together and termed non-tuberculosis mycobacteria (NTM). Both tuberculosis (TB) and non-tuberculous mycobacteriosis pose serious threats to health worldwide, especially with the increase of multi-drug and pan-drug-resistant strains.
Mycobacteria have completely different culture characteristics and therapeutic antibiotics from other bacteria. A preliminary discrimination of the infection as mycobacteriosis, TB or NTM, or a mixed infection of these two kinds of mycobacteria, will make the subsequent culture more directed and help improve the isolation/culture rate and appropriately administer clinical medication to reduce transmission more effectively, especially for less-experienced hospitals. This method should be simple, fast, accurate and culture-free.
Compared with the relatively easy diagnosis of tuberculosis, the diagnosis of non-tuberculosis is more dependent on culture and biochemical tests or by the exclusion that negative TB detection in smear/culture of acid-positive samples. Because of the cumbersome procedure, isolation, cultivation and identification of NTM is not actually done in many hospitals in China. The incidence and disease burden of NTM are continuously increasing in many regions, and the prevalence of NTM in aged people, human immunodeficiency virus (HIV)-infected patients and those with severely damaged immune systems is significant, even more than that of TB (Wang et al., 2014;Halstrom et al., 2015;Mortaz et al., 2018). Therefore, the rapid diagnosis of NTM is also a prominent problem. It is necessary to find a better singlegene biomarker, which can be used for both of MTBC and NTM identification.
Mycobacteria have three DNA double-strand break repair pathways, which include the NHEJ system required for the CRISPR/Cas9 system in the second step (Pitcher et al., 2006). The NHEJ system is absent in most prokaryotic cells. To date, eukaryotic NHEJ homologs have only been identified in M. smegmatis, M. tuberculosis, and Bacillus subtilis (Bs). Furthermore, prokaryotic NHEJ is a much simpler system that needs only two key proteins, Ku and ligase D (LigD) (Gong et al., 2005;Korycka-Machala et al., 2006;Shuman and Glickman, 2007;Gupta et al., 2011;Zheng et al., 2017). The Ku protein exists as a homodimer and preferentially binds to dsDNA ends (Weller et al., 2002). LigD is an adenosine triphosphate (ATP)-dependent DNA ligase that contains polymerase and nuclease domains, which facilitates the joining of long linear DNA molecules with different incompatible ends (Della et al., 2004). The rarity of the NHEJ system in bacteria hints that it may be developed into a Mycobacterium specific detection biomarker. Although, the NHEJ system has been confirmed existing in the M. tuberculosis strain H37Rv (Rv0937c and Rv0938 encoded) (Doherty et al., 2001;Weller et al., 2002), its distribution, especially the distribution of the Ku protein that could specifically stimulate LigD and suppress homologous recombination (Della et al., 2004) in other Mycobacterium species has not yet been elucidated. In this study, we analyzed Mycobacterium genome data submitted in GenBank before September 2018 to explore the existence of the ku gene in the Mycobacterium genus and determine its applicability for Mycobacterium identification.

Genomic Data
We downloaded a total of 7294 genomes from 139 definite species, seven subspecies and five variants of

Sequence Extraction and Analysis
All the regions annotated as ku or mku and/or homologous to Rv0937c of M. tuberculosis were extracted from the genomes. Sequence alignments and comparisons were performed using the MEGA program version 6.0 (Tamura et al., 2013). Sequences were aligned on ClustalW using a gap opening penalty of 15 and a gap extension penalty of 6.66. Maximum likelihood trees were drawn. In each Mycobacterium species/variant, every ku sequence with even a nucleotide difference was defined as a genotype and listed in Supplementary Table 1 with a representative sequence. The IS6110 and rpoB genes were analyzed much as the ku gene was, and Supplementary Table 2 lists the genotypes of the rpoB gene.

Primer Designed and PCR Amplification
Primers were designed using Oligo 6.0 and followed the general design principle of PCR primer. Simulated PCRs were performed using the Analyze Mix Wizard of Clone Manager Professional 9.0. The 379 ku genotype sequences were added as molecules in the mix.

Statistical Analysis
Fourfold table Chi-square test was used to test the differences in variant rates between the ku and rpoB genes.

Distribution and Polymorphism of the ku Gene in Mycobacterium
The ku gene was found to be distributed in almost all of the 7294 Mycobacterium genomes with three exceptions: two incomplete genomes, M. setense strain 852014-10208_SCH5295773 and M. tuberculosis strain 0109V, without sequence analogous to the ku gene, and an incomplete ku gene in M. tuberculosis strain AH26_28866, with the first 290 bp in contig NZ_LKMH01000091.1 and 304-822 bp in contig NZ_LKMH01000168.1. In M. tuberculosis, the ku gene was highly conserved. Of the 5243 M. tuberculosis genomes, there were 39 ku genotypes, of which 5149 (98.17%) genomes harbored the Rv0937c genotype, while 25 genomes were one genome with one genotype (Supplementary Table 1). The similarity of the ku gene sequence of the 39 genotypes was also very high. Only 37 sites on the 822 bp of the ku gene had variants, and sites 287, 449, and 451 had the highest rate of variation, at 5.13% (2/39).
In of M. decipiens (1/1) and RN08_1045 of M. microti (1/1) were completely identical to that of the Rv0937c genotype, as a result of which Rv0937c was the dominant genotype in the MTBC (98%, 5253/5359). On the other hand, nine M. canettii genomes were found to have six genotypes and two variant sites, A210G and A487G, which appeared only in all M. canettii, indicating obvious species specificity. In addition, from the high conservation of the ku gene in MTBC, we inferred that M. tuberculosis strain 0109V and M. setense strain 852014-10208_SCH5295773 might also carry the ku gene, and the missing of the ku gene was caused by the incompletion of the genomic data.
In NTM, 278 sites of the ku gene were conserved in all NTM genotypes, but the conservation of the ku gene in each species varied greatly (Supplementary Table 1). In some species, the ku gene was highly conserved and had few genotypes.  Overall, 32.4% sites (266/822) of the ku gene were conserved in all Mycobacterium genotypes, but none of the NTM genotype were 100% identical to the genotypes of MTBC, including Rv0937c. On phylogeny tree based on the ku gene, MTBC could be clearly separated from NTM without any exception (Figure 1). The ku genotypes in different species but having identical sequences are listed in Table 2. For MTBC, only M. canettii, M. decipiens, and M. mungi had completely special genotypes, other species of MTBC could not be separated by the ku gene sequence. For NTM, the genotypes of M. avium and M. intracellulare, M. abscessus, and M. chelonae only accounted for a very small proportion. Therefore, except the other 18 NTM species in Table 2, most of the NTMs could be identified to species level by the ku gene sequence.

Comparison of the ku Gene With the IS6110 Element and the rpoB Gene
The distribution of the two most used identification loci, IS6110 and rpoB, in the whole Mycobacterium genus had also been analyzed in this study.
The IS6110 element has 16 completely identical copies in the M. tuberculosis strain H37Rv genome. There was at least one IS6110 copy in 4450 M. tuberculosis genomes with identical sequence to the IS6110 element of H37Rv. The IS6110 element in 428 genomes were only partially identical to that of H37Rv. The IS6110 of 306 M. tuberculosis genomes were located at the end of contig and were incomplete. The remaining 61 genomes, including the complete genome of M. tuberculosis UT205, had no sequences homologous to that of IS6110. In the other MTBC species/variants, the complete genomes of M. canettii CIPT 140070008 and 140070017 also had no sequences similar to that of IS6110. Therefore, even though it has been used as an important diagnostic marker to identify MTBC species (Coros et al., 2008;Guernier et al., 2017), the IS6110 element was not common to all the MTBC strains, even M. tuberculosis. Tests based on it have false negatives, which is consistent with previous studies (Viana-Niero et al., 2006;Freidlin et al., 2017).
The rpoB gene was widely distributed in Mycobacterium and had a total of 861 genotypes, excluding 57 incomplete rpoB sequences from analysis (Supplementary Table 2). The rpoB gene had 454 variant sites in M. tuberculosis alone, compared to only 37 variant sites of the ku gene in M. tuberculosis. Even after standardization by gene length, the variant rate of rpoB gene in M. tuberculosis was 14.4% (454/3519 bp), which was far greater than that of the ku gene at 4.5% (37/822 bp) ( 2 = 46.872, P < 0.01). The rpoB gene could set apart the MTBC from all NTM without any exception on its phylogenetic tree, but the four complex groups of NTM could not be separated as distinctly as the tree drawn on the ku gene (Figure 1).

Development of PCR System for Identifying of MTBC and NTM
For the rarity of the ku gene in bacteria and the distinct clusters of MTBC and NTM on the phylogenetic tree, we inferred that mycobacteria might be specially identified from other bacteria and MTBC from NTM only by PCR without sequencing. Afterward, the PCR system was developed.
Mycobacterium tuberculosis complex-specific primers were designed in the conservative regions that were identical in the 55 ku genotypes of MTBC, but different to all ku genotypes of NTM. Candidate primer sets with different product lengths were designed and screened. The pair with the most sufficient difference between MTBC and NTM was selected. ku-MTBC-U: GGT GGT CGA CTA CCG CGA TCT T and ku-MTBC-L: TCT TCG GGC TCG TCC AGC AAC C were located at 159-180 bp and 719-740 bp of the reference sequence of Rv0937c genotype, respectively. Figure 2 shows that the design region of this pair of primers has a high degree of variation in NTM. Especially, the first base of the 3 end of the upstream primer and the first, third and fourth base of the 3 end of the downstream primer were single nucleotide polymorphic loci that completely distinguished MTBC from NTM. Simulated PCR revealed that 55 MTBC genotypes could be amplified by this primer set, but 324 NTM genotypes could not be amplified.
NZ_CP009616.seq of M. abscessus was used as reference of NTM for primer design. By marking the completely conserved sites in NTM on the reference sequence, it was found that the initial 1-20 bp of ku gene was the best region for the design of upstream primer ku-NTM-U with the most conserved sites in NTM and nine base difference to MTBC (Figure 2). The downstream primer Ku-NTM-L was an NTM/MTBC universal primer at 220-245 bp with eight conserved bases at the 3 end. To improve the amplification efficiency of all NTMs, degenerate base were used. The primer set was ku-NTM-U: ATG CGT TCB ATH TGG AAR GG and ku-NTM-L: AGG CTC GCC AGR TCN TCR TCG GTG AT, and its specificity for NTM depends on the upstream primer.
In the actual PCR amplification of the 148 Mycobacterium and common respiratory pathogenic bacteria, all of the MTBC strains were positive for the amplification of ku-MTBC-U/L and negative for ku-NTM-U/L. The amplifications of NTM strains were opposite, and all of the 23 respiratory pathogenic bacteria were negative for both of them. The sensitivity and specificity of the two pairs of primers were 100%. Among the control primers, primers of 16SrRNA and ITS were universal and almost all of the tested strains, including the 23 respiratory pathogenic strains, were positive for them. Primers for hsp60 and rpoB were specific for Mycobacterium. But all of the 10 Nocardia strains belonging to six Nocardia species were positive, although other FIGURE 2 | The variation rates of the primer design sites in NTM ku genotypes.
Frontiers in Microbiology | www.frontiersin.org respiratory pathogenic strains were negative ( Table 1). Without sequencing, none of the four pairs of primers would be suitable for the identification of Mycobacterium, let alone the distinction between MTBC and NTM.

DISCUSSION
The GenBank database contains plenty of Mycobacterium genome data. The sensitivity and specificity of molecular detection methods can be predicted and compared before actual use. In this work, approximately 151 definite species/variants of Mycobacterium have been included, accounting for 79.7% of all known mycobacterial species and all submitted genomes. This work has the same coverage as the phylogenomics and comparative genomics studies of Gupta et al. (2018). Although it does not cover all the Mycobacterium species, it may be the most comprehensive analysis that can be done so far and includes all of the clinically common species.
The similarity between the eukaryotic and bacterial Ku proteins suggested that they were evolved from a common ancestor and very ancient process (Weller et al., 2002). The ku gene should be present in the Mycobacterium genome before the differentiation of Mycobacterium species. Additionally, The Ku-based NHEJ system participates in the repairing of DSBs of Mycobacterium, which maintains the genome integrity and is pivotal for cell survival. The ku gene is less likely to be lost in the evolution of Mycobacterium. Thus, it is reasonable that the ku gene is distributed and conserved in all sequenced Mycobacterium genomes. Moreover, the research of Weller et al. (2002) has speculated that the Ku ligase system might be related to the special growth stage of mycobacteria, especially the bacterial sporulation and the long stationary phase of life cycle. Different growth rates and culture and biochemical test characteristics are just important indicators for distinguishing Mycobacterium species, especially NTM. Therefore, in theory, it is not unexpected that the ku gene can completely distinguish MTBC from NTM and distinguish species in NTM.
Among the detection methods of Mycobacterium, acid-fast (AF) staining, also known as Ziehl-Neelsen stain, is currently the most widely used preliminary diagnostic method. The sensitivity of AF staining compared with culture ranges from 22 to 78%, and its limit of detection ranges from 5 × 10 3 to 1 × 10 4 bacilli/mL. However, AF staining is not specific for Mycobacterium detection, and Mycobacterium spp. cannot be distinguished from other AF bacteria, such as Nocardia, Rhodococcus, Tsukamurella, Gordona, Dietzia, Legionella micdadei, Cryptosporidium, Isospora belli, and Cyclospora cayetanensis and parasites such as Sarcocystis and Taenia saginata. Moreover, non-AF tuberculosis bacilli (Vilcheze and Kremer, 2017) also exist.
Thus, nucleic acid detection methods (NADMs) are becoming more and more important in the diagnosis and identification of mycobacteria (Niemann et al., 2000;Brunello et al., 2001). Loci of developed NADM include species-specific insertion sequences, such as IS6110 for members of the M. tuberculosis complex (Thierry et al., 1990), IS900 and F57 for M. avium subsp.
paratuberculosis (Slana et al., 2008), IS901 for M. avium subsp. avium (Slana et al., 2010), IS2404 and IS2606 for M. ulcerans, M. liflandii, M. pseudoshottsii, and M. shottsii (mycolactoneproducing mycobacteria) (Fyfe et al., 2007), and common shared bacterial genes, for example 16S rRNA, hsp65, rpoB genes and the internal transcribed spacer (ITS) of broad-range sequencing approaches. Multi-genes analysis (Homolka et al., 2012;Perez-Lago et al., 2014;Gupta et al., 2018) and whole genome sequencing (Vissa et al., 2009;Fedrizzi et al., 2017;Trofimov et al., 2018) have also been used in Mycobacterium and because of their high resolution, they can identify mycobacteria to the species level. Methods based on sequencing and homology comparison, especially on sequences of hundreds of core genes or whole genome are more available for research purposes, and they are promising tools in the identification of new mycobacteria species, new molecules for bacterial typing and new candidate genes for multidrug resistance. But they are less practical for rapid screening of large samples and for primary hospitals without the ability of bioinformatics analysis. This is also the reason that single locus, IS6110 and rpoB, are the most commonly used biomarkers. But IS6110 cannot be applied to NTM. Although rpoB is used not only for identification but also for antibiotic resistance prediction, and GeneXpert assay which based on it requires little technical training and can obtained from unprocessed sputum samples in 90 min, with minimal biohazard. GeneXpert is replacing AF and endorsed by the World Health Organization (WHO) (Moure et al., 2011;Li et al., 2017), but it is also only applicable to MTBC and requires special instruments. In fact, until now, there has been no specific biomarker that is commonly used for all mycobacteria without sequencing.
The analysis in this study showed the value of the ku gene as a diagnostic biomarker. The ku gene is not a common gene of bacteria. Its rarity in prokaryotes (Weller et al., 2002), especially its absence in bacteria closely related to Mycobacterium (such as Nocardia), endows it with high specificity. Its wide distribution in all sequenced Mycobacterium makes it widely applicable for MTBC and NTM. Both features actualize the greatest application value of the ku gene, that is, they can directly distinguish mycobacteria, MTBC and NTM by PCR, and achieve the purpose of rapid clinical diagnosis. This actual application value has been confirmed by the MTBC/NTM-specific primers we designed and the testing of the standard and clinical strains. In conclusion, Ku gene is a new single-gene biomarker of Mycobacterium that differentiates MTBC and NTM and makes the identification of MTBC and NTM simpler and more accurate. However, it does not define the resistance of Mycobacterium, so it can't take into account the identification of strains and the prediction of drug resistance simultaneously as ropB gene does. Its application value lies in that it can be used as a single indicator for the primary screening and identification of mycobacteria, or can be combined with rpoB to complement its deficiencies for accurate identification of Mycobacterium. More sensitive detection methods based on this gene and application for the detection of different samples besides pure cultures will be explored in our further study to help the diagnosis of Tb worldwide.