Original Research ARTICLE
Machine learning predicts accurately Mycobacterium tuberculosis drug resistance from whole genome sequencing data
- 1London School of Hygiene and Tropical Medicine, University of London, United Kingdom
- 2Imperial College London, United Kingdom
- 3University of Cape Town, South Africa
Background: Tuberculosis disease, caused by Mycobacterium tuberculosis, is a major public health problem. The emergence of M. tuberculosis strains resistant to existing treatments threatens to derail control efforts. Resistance is mainly conferred by mutations in genes coding for drug-targets or -converting enzymes, but our knowledge of these mutations is incomplete. Whole genome sequencing (WGS) is an increasingly common approach to rapidly characterize isolates and identify mutations predicting antimicrobial resistance, and thereby providing a diagnostic tool to assist clinical decision making.
Methods: We applied machine learning approaches to 16,688 M. tuberculosis isolates that have undergone WGS and laboratory drug-susceptibility testing (DST) across 14 anti-tuberculosis drugs, with 22.5% of samples being multi-drug resistant and 2.1% being extensively drug resistant. We used “nonparametric” classification-tree (CT) and gradient-boosted-tree (GBT) models to predict drug resistance and uncover any associated novel putative mutations. We fitted separate models for each drug, with and without “co-occurrent resistance” markers known to be causing resistance to drugs other than the one of interest. Predictive performance was measured using sensitivity, specificity, and area under the ROC curve, assuming DST results as the gold standard.
Results: The predictive performance was highest for resistance to first-line drugs, amikacin, kanamycin, ciprofloxacin, moxifloxacin, and MDR-TB (AUC above 96%), and lowest for third-line drugs such as D-cycloserine and Para-aminosalisylic acid (AUC below 85%). The inclusion of co-occurrent resistance markers led to improved performance for some drugs, and superior results when compared to similar models in other large-scale studies, which had smaller sample sizes. Overall, the GBT models performed better than the CT models. The mutation-rank analysis detected no new SNPs linked to drug-resistance. Discordance between DST and genotypically-inferred resistance may be explained by DST errors, novel rare mutations, hetero-resistance, and non-genomic drivers such as efflux-pump upregulation.
Conclusion: Our work demonstrates the utility of machine learning as a flexible approach to drug resistance prediction that is able to accommodate a much larger number of predictors and to summarise their predictive ability, thus assisting clinical decision making and SNP detection in an era of increasing WGS data generation.
KEY WORDS: Mycobacterium tuberculosis, MDR-TB, XDR-TB, drug resistance, machine learning.
Keywords: Mycobacterium tuberculosis, MDR-TB (Multidrug Resistant-TB), XDR-TB (Extensively drug resistant - TB), Drug Resistance, machine learning
Received: 24 May 2019;
Accepted: 02 Sep 2019.
Copyright: © 2019 Deelder, Christakoudi, Phelan, Diez Benavente, Campino, Mcnerney, Palla and Clark. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Dr. Luigi Palla, London School of Hygiene and Tropical Medicine, University of London, London, United Kingdom, Luigi.Palla@lshtm.ac.uk
Prof. Taane G. Clark, London School of Hygiene and Tropical Medicine, University of London, London, United Kingdom, Taane.Clark@LSHTM.ac.uk