AUTHOR=Muralidharan Harihara Subrahmaniam , Fox Noam Y. , Pop Mihai 

TITLE=The impact of transitive annotation on the training of taxonomic classifiers

JOURNAL=Frontiers in Microbiology

VOLUME=Volume 14 - 2023

YEAR=2024

URL=https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2023.1240957

DOI=10.3389/fmicb.2023.1240957

ISSN=1664-302X

ABSTRACT=A common task in the analysis of microbial communities involves assigning taxonomic labels to the sequences derived from organisms found in the communities. Frequently, such labels are assigned using machine learning algorithms that are trained to recognize individual taxonomic groups based on training data sets that comprise sequences with known taxonomic labels. Ideally, the training data should rely on labels that are experimentally verified-formal taxonomic labels require knowledge of physical and biochemical properties of organisms that cannot be directly inferred from sequence alone. However, the labels associated with sequences in biological databases are most commonly computational predictions which themselves may rely on computationally-generated data-a process commonly referred to as "transitive annotation". In this manuscript we explore the implications of training a machine learning classifier (the Ribosomal Database Project's Bayesian classifier in our case) on data that itself has been computationally generated. We demonstrate that even a few computationally-generated training data points can significantly skew the output of the classifier to the point where entire regions of the taxonomic space can be disturbed. We also discuss key factors that affect the resilience of classifiers to transitively-annotated training data, and propose best practices to avoid the artifacts described in our paper.