Improved Resistance Prediction in Mycobacterium tuberculosis by Better Handling of Insertions and Deletions, Premature Stop Codons, and Filtering of Non-informative Sites

Resistance in Mycobacterium tuberculosis is a major obstacle for effective treatment of tuberculosis. Multiple studies have shown promising results for predicting drug resistance in M. tuberculosis based on whole genome sequencing (WGS) data, however, these tools are often limited to this single species. We have previously developed a common platform for resistance prediction in multiple species. This platform detects acquired resistance genes (ResFinder) and species-specific chromosomal mutations (PointFinder) associated with resistance, all based on WGS data. In this study, we present a new version of PointFinder together with an updated M. tuberculosis database. PointFinder now includes predictions based on insertions and deletions, and it explicitly reports frameshift mutations and premature stop codons. We found that premature stop codons in four resistance-associated genes (katG, ethA, pncA, and gidB) were over-represented in resistant strains, and we saw an increased prediction performance when including premature stop codons in these genes as resistance markers. Different M. tuberculosis resistance prediction tools vary in performance mostly due to the mutation library used. We found that a well-established mutation library included non-predictive linage markers, and through forward feature selection we eliminated those from the mutation library. Compared to other similar web-based tools, PointFinder performs equally good. The advantages of PointFinder is that together with ResFinder it serves as a common web-based and downloadable platform for resistance detection in multiple species. It is easy to use for clinicians and already widely used in the research community.


INTRODUCTION
Next generation sequencing is a rapidly evolving field and it is in the process of being adopted as the standard in many clinical and public health settings. Here, it replaces many traditional typing and phenotyping methods such as those for species determination and detection of antimicrobial resistance. Rapid and precise detection of antimicrobial resistance is important for correct treatment, surveillance and control efforts. Antimicrobial resistance occurs either through horizontal gene transfer or by de novo chromosomal mutations (Munita and Arias, 2016). In Mycobacterium tuberculosis all acquired resistance have been mediated by chromosomal mutations, and horizontal transfer have not been described (Smith et al., 2013). In addition to acquired resistance M. tuberculosis have a number of intrinsic resistance mechanisms including modification of drug targets, chemical modification of drugs, enzymatic degradation of drugs, molecular mimicry of drug targets, and drug deportation by efflux pumps (Smith et al., 2013). This is a serious obstacle for effective tuberculosis treatment and prevention of the disease worldwide (World Health Organization [WHO], 2017). Mutations and other genetic changes may lead to enzymatic inactivation of antibiotic molecules, overexpression of novel efflux pumps and porin alterations in the cell wall, trapping of drugs and overexpression of proteins involved in neutralizing the effect of drugs. Due to slow growth rates of M. tuberculosis, determining resistance by conventional drug susceptibility testing (DST) is highly time-consuming. Contrarily, next-generation sequencing rapidly yields accurate whole genome sequencing (WGS) data. Using prior knowledge on the genomic changes leading to resistance, WGS data can be used for rapid prediction of antimicrobial resistance (Koser et al., 2014). In fact, studies have already shown promising results for predicting resistance in M. tuberculosis based on WGS for first-line anti-tuberculosis drugs (Feuerriegel et al., 2015; The CRyPTIC Consortium and the 100,000 Genomes Project et al., 2018). However, a challenge for applying this knowledge in a clinical setting is that resistance predictor tools are often limited to a single species. We have previously developed ResFinder (Zankari et al., 2012), an in silico method for detection of acquired genes associated with antimicrobial resistance in multiple species based on WGS data. ResFinder was recently expanded with PointFinder (Zankari et al., 2017), a speciesspecific tool detecting chromosomal mutations associated with drug resistance. PointFinder already includes five species. The rationale of this study is to expand PointFinder also to cover M. tuberculosis. In addition to point mutations insertions and deletions may also affect resistance. Especially if the insertion/deletion length is not a multiple of three they will cause the rest of the gene to be read out of frame, which have a high likelihood of introducing a stop codon leading to a truncated gene. We have therefore set out to do a thorough analysis of the correlation of premature stop codons with resistance. In this study we optimized and evaluated the performance of PointFinder's prediction of resistance in a sixth species, M. tuberculosis. M. tuberculosis was chosen because of its importance for global health. It is also an organism for which many resistance mutations have been described. Here, we wanted to investigate is some of these are in fact non-informative when it comes to predicting resistance, and study in more detail how the presence of premature stop codons affects resistance. We used a data set of 3,528 M. tuberculosis isolates in the optimization which consisted of omitting non-predictive mutations from a well-established mutation library, and including premature stop codons as resistance markers. 2,480 isolates were used to validate the performed optimization.

PointFinder Database
The PointFinder database contains both a mutation library listing resistance-associated chromosomal mutation and a collection of reference sequences in which these mutations occur. All database files are available at bitbucket.org/genomicepidemiology/pointfinder_db.
The tuberculosis mutation library was obtained from pathogenseq.lshtm.ac.uk, under Tuberculosis and Rapid DR Study and described in Coll et al. (2015). Additional mutations were achieved from a genome wide association (GWA) study performed by the same group in Coll et al. (2018). Mutations, which were observed in the GWA study to be significantly associated with resistance and observed more than 10 times, were also included in the mutation library. All genes, RNA genes and promoter regions of interest for resistance prediction in M. tuberculosis are shown in Table 1. Reference sequences for genes and genomic regions listed in Table 1 were obtained from the H37Rv M. tuberculosis reference strain, NCBI-reference sequence: NC_000962.3.

PointFinder
PointFinder is both a web service and command line application for predicting resistance associated with chromosomal mutations based on WGS data. The web service is available on cge.cbs.dtu.dk/services/ResFinder/where users can specify to search "Chromosomal mutations" in six different species, including M. tuberculosis. The command line version of PointFinder is available on bitbucket.org/genomicepi-demiology/pointfinder.
PointFinder is a Python program that accepts both FastQ and Fasta files for resistance prediction. Initially, the genes of interest for resistance prediction (Table 1) are identified. KMA (Clausen et al., 2018) is used for mapping of raw reads, and BLASTN, RRID:SCR_001598 (Camacho et al., 2009) for aligning assembled genomes, to the genes of interest.
Mutations are detected by comparing the alignments between the reference sequences and the sequences found in the input file. The aligned sequences are compared nucleotide by nucleotide when the alignment represents a promoter region or an RNA gene and codon by codon when it represents a coding gene sequence. Effort has been put into detecting insertions and deletions and reporting any disruption or restoring of the reading frame. If a premature stop codon is detected, this will also be explicitly reported, and no further search for mutations will be performed after the detection of a stop codon. The observed mutations are looked up into the mutation library which holds information about mutations known to be predictive for resistance. If a found mutation exists in the mutation library, the resistance phenotype is returned together with the PubMed ID of the article linking the observed genotype with the predicted resistance phenotype.

M. tuberculosis Data Sets
All data sets used in this study exclusively consisted of paired-end WGS data associated with phenotype data. Phenotype data was given as Resistant or Susceptible based on laboratory determined DST results for multiple anti-tuberculosis drugs. The first data set, called the ReSeq data set, was obtained from the Relational Sequencing TB Data Platform (Starks et al., 2015). It consisted of WGS data from 3,528 M. tuberculosis isolates. The second data set was obtained from the Supplementary Data in Coll et al. (2018) and was used as a validation data set. The validation data set contained 2,480 isolates. The ReSeq and validation data set contained sufficient phenotype data for 10 drugs namely; Rifampicin, Isoniazid, Streptomycin, Ethambutol, Amikacin, Capreomycin, Ethionamide, Kanamycin, Pyrazinamide, and Fluoroquinolones. The number of isolates with determined phenotype varied with each drug. Fluoroquinolones DSTs were determined for the specific compounds namely, Ciprofloxacin, Ofloxacin, Moxifloxacin, and Levofloxacin. However, in the analysis we considered Fluoroquinolones resistance as one, since the mutation library did not distinguish between different compounds. Thus, if an isolate was resistant to any of the Fluoroquinolone compounds it was considered Fluoroquinolones resistant. The data sets can be found in Supplementary Tables S1, S2.
A third data set was used to compare PointFinder to similar resistance predictor tools developed for M. tuberculosis. From a scientific report by Schleusener et al. (2017) we obtained 91 isolates that had been used to compare five existing M. tuberculosis resistance predictor tools. These 91 isolates were Illumina MiSeq paired end sequenced, and phenotype data existed for five drugs namely, Rifampicin, Isoniazid, Streptomycin, Ethambutol, and Pyrazinamide.

Measuring Prediction Performance
PointFinder's detection of resistance-associated mutations was used for binary classification of resistance and susceptibility using the following rules. Isolates were predicted resistant to a drug if one or more mutations predictive of resistance to the drug were found. Isolates were predicted susceptible to a drug if all genes of interest for resistance to the drug were found with an identity above 90% and a sequence coverage above 60%, and no resistance-associated mutations were detected in the genes. We used default options and parameters when running PointFinder. To assess the quality of PointFinder's binary classification we calculated the Matthew's Correlation Coefficient (MCC) and the sensitivity and specificity of the prediction.

Forward Selection of Predictive Mutation
To detect non-predictive mutations, we applied forward feature selection optimized based the MCC over threefold crossvalidation. We exclusively examined abundant mutations, defined as mutations found in the ReSeq data set 10 times or more. Mutations found less than 10 times were included in the initial state of the prediction model, whereas the abundant mutations were initially excluded. With each step of the forward selection one abundant mutation was added to the model. The mutation added was the one mutation that benefited the prediction the most based on the MCC. Mutations were added to the model one by one until adding any remaining mutations would decrease the quality of prediction. Examined mutations that were not selected in any of the threefold of the cross-validation were considered nonpredictive for resistance.

Statistical Analyses
Significant over-representation of premature stop codons in resistant isolates was assessed with Pearson's Chi-squared test on a 2 × 2 matrix using the statistical software R (Version 3.4.0). PointFinder was compared with a similar predictor called PhyResSE. We assessed if PhyResSE performed significantly better than PointFinder using bootstrapping. FIGURE 1 | Flow chart describing the PointFinder workflow. The input sequences are aligned to a database of reference genes. The genetic differences observed in the alignments are compared to a mutation library, with annotated phenotypes. Based on this a resistance phenotype prediction is made.

RESULTS
We created an updated method for predicting antimicrobial resistance from the genomic sequence. An overview of the method can be seen in Figure 1.

Evaluating and Optimizing the Mutation Library
We calculated the sensitivity, specificity and MCC for predicting drug resistance using PointFinder compared to phenotypic DST results ( Table 2). The resistance prediction was based on mutations from the mutation library detected in the 3,528 M. tuberculosis isolates from the ReSeq data set. The best prediction performances were obtained for the first-line drugs Rifampicin, Isoniazid and MDR (MCC of 0.85, 0.82, and 0.86, respectively). PointFinder's prediction performance varied dependently on the drug with MCCs ranging from 0.386 to 0.848. Especially, the prediction of resistance to Ethambutol, Pyrazinamide, Amikacin, and Ethionamide was less successful, which indicated that the mutation library was not fully developed. Table 3 shows the occurrence of PointFinder-detected premature stop codons in the resistance-associated genes found in resistant and susceptible isolates. Genes shown in bold in Table 3 were in the mutation library described with positionspecific premature stop codons causing resistance. With the exception of the panD gene, these genes showed a significantly higher occurrence of premature stop codons among resistant strains. However, for many genes the analysis was only based on a few premature stop codons. Only for four genes, katG, pncA, ethA, and gidB premature stop codons occurred more than 10 times, and we used this as a threshold for a considerable frequency. Moreover, premature stop codons in these genes were significantly over-represented in strains resistant to the drug that the genes were associated with (see Table 1). For katG, pncA, and ethA the representative p-values were below 0.00001 and for gidB it was 0.006. PointFinder's prediction performance given in Table 4 shows that considering premature stop codons in the four genes as resistance markers improved the MCC of the resistance prediction for drugs in question; Isoniazid, Streptomycin, Pyrazinamide, and Ethionamide. In the case of Streptomycin and Ethionamide, the performances were improved with a compromise of the specificity.
Besides a possible lack of predictive premature stop codons, the mutation library also seemed to include mutations that were not predictive for resistance. For example, a low specificity was observed in the resistance prediction of Ethambutol and Amikacin, due to many false positive predictions. This indicated that the mutation library contained mutations, which should be omitted.
To detect such non-predictive mutations, we used a forward feature selection approach where the selection of mutations was optimized based the MCC over threefold cross-validation. Mutations not selected in any of the threefold of the crossvalidation were considered non-predictive for resistance and shown in bold in Table 5. For 7 out of the 10 drugs, we   found one or more mutations that were deselected in every fold and these mutations were omitted from the mutation library. The occurrence of the deselected mutations in resistant and susceptible isolates are shown in Supplementary Table S3. Table 6 shows the prediction performance when excluding the mutations from the mutation library. Omitting the nonpredictive mutations from the mutation library did compromises the sensitivity, yet since the forward feature selection was trained based on the MCCs, the MCC performance was improved for all seven drugs.

Validating the Mutation Library Optimization
To validate the effects of including premature stop codons and excluding non-predictive mutations from the mutation library, we performed resistance predictions on a validation data set. This data set consisted of 2,480 isolates, and was independent of the ReSeq data set.
First, we examined occurrence of genes with premature stop codons in resistant and susceptible strains ( Table 7). Like in the ReSeq data set premature stop codons occurred with a considerably frequency in the genes katG, pncA, ethA, and gibB. However, here only the katG and pncA premature stop codons were significantly over-represented in the resistant strains. gidB was close to the significant level of 0.01 (p-value: 0.015) whereas ethA premature stop codons seemed to be equally distributed between resistant and susceptible strains (p-value: 0.642).
Additionally, we looked at the occurrence of the mutations that were considered non-predictive in the forward feature selection analysis. Data is shown in Supplementary Table S3.
Most of the mutations that were found to be non-predictive for resistance in the ReSeq data set were confirmed to be widely present in susceptible strains in the validation data set. The mutations, rpoB I491F, inhA V78A, pncA I6L, gyrA T80A, and rrs 517C > T, were present in none or in very few samples in the validation data set, and therefore, the positive effect of removing these mutations could not be validated. Table 8 shows prediction performances on the validation data set using three different mutation libraries; first, the initial mutation library, secondly, the mutation library where premature stop codons in katG, pncA, ethA, and gidB were included as resistance markers, and thirdly, the mutation library containing both the four premature stop codon markers and excluding the non-predictive mutations. Table 8 shows an overall improved prediction performance when including the premature stop codons as resistance markers and when excluding the nonpredictive mutations.

Comparing PointFinder With Similar Tools
A scientific report from 2017 by Schleusener et al. PhyResSE generally showed the best performance, therefor we used the same data set to compare PointFinder to PhyResSE. We reran the data set through PhyResSE, to make a direct comparison. Table 9 show the prediction performance of PointFinder and PhyResSE based on WGS data and DST results from the 91 isolates. The mutation library used for PointFinder included premature stop codons in katG, pncA, ethA, and gidB and did not contain the non-predictive mutations.

DISCUSSION
In this study, we presented an improved version of PointFinder where the detection of insertion and deletion together with frameshift mutations were handled properly. As an effect of the improvements we were able to enhance PointFinder's resistance prediction in M. tuberculosis by including premature stop codons as resistance markers. Additionally, we optimized the obtained M. tuberculosis mutation library by excluding mutations that through forward feature selection were considered nonpredictive for resistance. A scientific report from 2017 by Schleusener et al. compared five M. tuberculosis resistance prediction tools based on a data set of 91 isolates. These five tools were, CASTB (Iwai et al., 2015), PhyResSE (Feuerriegel et al., 2015), TBProfiler (Coll et al., 2015), KvarQ (Steiner et al., 2014), and Mykrobe Predictor TB (Bradley et al., 2015). To our knowledge it has not been studied thoroughly how the occurrence of premature stop codons in resistanceassociated genes affect the resistance phenotype. The mutation library lists premature stop codons predictive for resistance, yet these premature stop codons are only considered as predictive markers if found at the specific position listed. However, the outcome of a premature stop codon -gene truncation -is, in most cases, independent of the position in the gene. The first version of PointFinder described in Zankari et al. (2017) did not consider insertions and deletions, and as a consequence of  It was tested whether genes with premature stop codons were equally distributed between the two phenotypic groups. This was tested using a χ 2 test on a 2 × 2 matrix, and the p-value is given. * p-value below the significance level of 0.01. Res, number of resistant strains determined by drug susceptibility testing; Sus, number of susceptible strains determined by drug susceptibility testing; RMP, Rifampicin; INH, Isoniazid; STM, Streptomycin; PZA, Pyrazinamide; ETH, Ethionamide; CAP, Capreomycin.
this, frameshift mutations and premature stop codons was not correctly detected. With this new version of PointFinder, efforts were put into detecting reading frame disruptions and premature stop codons caused by insertions and deletions, and the improved PointFinder version was used to assess the impact of premature stop codons on resistance emergence. Among all genes annotated with predictive premature stop codons in the mutation library we found a significantly higher occurrence of premature stop codons among resistant strains in the ReSeq data set, with the exception of the panD gene ( Table 3). A study from 2014, showed a M. tuberculosis panD deleted mutant still susceptible to Pyrazinamide (Dillon et al., 2014). The study postulated that panD is not a target for Pyrazinamide resistance, and our results support this hypothesis by indicating that loss of function of panD is not associated with Pyrazinamide resistance.
Our results suggest that katG and pncA premature stop codons are predictive for resistance, whereas the role of ethA and gidB premature stop codons was less clear. Isoniazid, Pyrazinamide, and Ethionamide are pro-drugs, and the proteins encoded by katG, pncA, and ethA are enzymes catalyzing the activation of these drugs, respectively (Zhang et al., 1992;Almeida Da Silva and Palomino, 2011). If the enzymatic activity is lost (e.g., by the occurrence of a premature stop codon), the drug cannot be converted to its active form, which can explain the emergence of drug resistance.
Surprisingly, premature stop codons in ethA also occurred with a high frequency in susceptible strains, and in the validation data set premature stop codons in ethA were not overrepresented in resistant strains (Table 7). Since, ethA encodes the Ethionamide activating enzyme, we speculate whether this is not the only enzyme able to activate Ethionamide, or if Ethionamide also has antimicrobial effects as a pro-drug, or maybe premature stop codons can be neglected and not cause complete depletion of the ethA-encoded enzyme. Another explanation for the inconclusive effect of ethA premature stop codons, might be that the use of Ethionamide constitutes a selective pressure that favors premature stop codon in ethA leading to low-levels resistance close to the clinical breakpoint used in DST protocols.
Premature stop codons in gidB were slightly over-represented among the resistant strains both in the ReSeq (p-value = 0.006) and the validation data set (p-value = 0.015), yet, premature stop codons in gidB were also observed in many susceptible isolates (see Tables 3, 7). Like for ethA, this might reflect that depletion of the gidB-encoded protein causes resistance levels close to the clinical breakpoint. In fact, a functional study showed that knocking out gidB leads to low-level Streptomycin resistance (Wong et al., 2011). We observed an increased MCC when treating the mutation as a resistance marker, thereby, our study also indicates that loss of function of the gidB-encoded protein is associated with Streptomycin resistance.
The forward feature selection analysis implied that several mutations included in the obtained mutation library were misclassified as resistance markers, and the positive effects of removing these mutations were also seen in the MCC on the validation data set (Table 8), with the exception of predicting Amikacin resistance. The two mutation rrs 514A > C and rrs 517C > T that were removed in this case, were however also found in other studies to play no role in resistance to Amikacin (Maus et al., 2005;Jugheli et al., 2009).
Further investigation showed that the misclassification of many of the deselected mutations was also reported in other studies, for example for kasA G269S and kasA G312S (Sun et al., 2007), rrs 492C > T (Victor et al., 2001;Villellas et al., 2013), rrs 1401A > G (for Streptomycin resistance) (Via et al., 2010), gyrA T80A (Pantel et al., 2016), and embB E378A and embC T270I (Goude et al., 2009;Campbell et al., 2011;Koser et al., 2011). In the forward feature selection analysis, we chose to only include mutations that were observed 10 times or more, however, with more isolates or a lower threshold for including  mutations, we might discover even more misclassified mutations. On the other hand, the mutation rpoB L430P, rpoB H445N, and rpoB I491F were considered non-predictive for resistance to Rifampicin based on the forward feature selection. However, studies have shown that DST performed on liquored-based mediums fails to detect resistance in strains with rpoB I491F and other rpoB mutations that were clinically associated with treatment failure (Rigouts et al., 2013;André et al., 2017). Thus, with forward feature selection we risk removing mutations that truly causes resistance but appears not to, due to erroneous DST results. This underlines a problem regarding using DST results as the standard for determining resistance. A wellestablished mutation library is important to avoid incorrect mutation interpretations. When comparing PointFinder to PhyResSE we did see differences in variant interpretation. This was notable in the gidB gene associated with Streptomycin resistance. PointFinder only interpreted resistance based on premature stop codons in gidB, whereas PhyResSE included several gidB mutations in the interpretation, e.g., gidB A200E, V88A, and A138V (see Supplementary Table S4), and with the interpretation of these mutations as resistance markers PhyResSE showed a significantly better Streptomycin resistance prediction. A GWA study from 2018 did detect the same gidB mutations among 6,465 strains, but in this study this gidB mutations were either observed in less than 10 samples or not identified as being significantly associated with resistance (Coll et al., 2018). Based on this, we did not choose to include these gidB mutation in our mutation library. We have here evaluated the effect of genetic alterations on resistance. A limitation of this is that it is overlooked if mutations have an effect of for example fitness. Future studies may seek to clarify such correlations if large scale datasets with genomes and fitness estimations become available.
The predicting performance of PointFinder is comparable to other M. tuberculosis resistance prediction tools, like PhyResSE, and PointFinder has the advantage of being build into a larger platform for resistance prediction, that is not limited to a single species. Additionally, PointFinder is available on bitbucket.org/genomicepi-demiology/pointfinder, where all changes in the script are tracked. The databases are also available on bitbucket which gives the needed transparency. This creates a good foundation for future maintenance and improvements of the variant interpretation methods and the mutation library.

CONCLUSION
We have developed improved version of PointFinder with better detection of insertions and deletions as well as the possible associated frameshifts. We find that the accuracy of PointFinder's resistance prediction in M. tuberculosis is improved as a result. We also optimized the M. tuberculosis mutation library by excluding mutations that through forward feature selection were found to be non-predictive for resistance. We think that these methods may also be applied to increase the antibiotic resistance prediction in other species. The method is flexible and can be updated if new genetic markers for resistance is identified. The method is freely available on the web as well as a stand alone version.

DATA AVAILABILITY STATEMENT
The datasets analyzed in this study was obtained from platform.reseqtb.org and as Supplementary Data from Coll et al. (2018) and Schleusener et al. (2017). All accession numbers and phenotype data are also given as Supplementary Data (Supplementary Tables S1, S2, S4).

AUTHOR CONTRIBUTIONS
CJ implemented changes in the improved version of PointFinder, performed all analyses, and wrote the manuscript with inputs from all authors. PC provided help with statistical calculations and did proofreading. OL supervised the project and, together with FA, were in charge of overall direction and planning.

FUNDING
This study has received funding from the European Union's Horizon 2020 Research and Innovation Program under grant agreement no. 643476 (COMPARE) and the Center for Genomic Epidemiology (Grant 09-067103/DSF). The funding body did not play any role in the design of the study, writing of the manuscript nor did they have any influence on the data collection, analysis or interpretation of the data and results.

ACKNOWLEDGMENTS
We are grateful to Rosa Allesøe, Judit Szarvas, and Valeria Bortolaia for excellent technical support and theoretical guidance.