Original Research ARTICLE
A pipeline for classifying deleterious coding mutations in agricultural plants
- 1Saint Petersburg State Polytechnic University, Russia
- 2University of Southern California, United States
The impact of deleterious variation on both plant fitness and crop productivity is not completely understood and is a hot topic of debates. The deleterious mutations in plants have been solely predicted using sequence conservation methods rather than function-based classifiers due to lack of well-annotated mutational datasets in these organisms. Here, we developed a machine learning classifier based on a dataset of deleterious and neutral mutations in Arabidopsis thaliana by extracting 18 informative features that discriminate deleterious mutations from neutral, including 9 novel features not used in previous studies. We examined linear SVM, Gaussian SVM and Random Forest classifiers, with the latter performing best. Random Forest classifiers exhibited a markedly higher accuracy than the popular PolyPhen-2 tool in the Arabidopsis dataset. Additionally, we tested whether the Random Forest, trained on the Arabidopsis dataset, accurately predicts deleterious mutations in Orýza sativa and Pisum sativum and observed satisfactory levels of performance accuracy (87% and 93%, respectively) higher than obtained by the PolyPhen-2. Application of Transfer learning in classifiers does not improve their performance. To additionally test the performance of the Random Forest classifier across different angiosperm species, we applied it to annotate deleterious mutations in Cicer arietinum and validated them using population frequency data. Overall, we devised a classifier with the potential to improve the annotation of putative functional mutations in QTL and GWAS hit regions, as well as for the evolutionary analysis of proliferation of deleterious mutations during plant domestication; thus optimizing breeding improvement and development of new cultivars.
Keywords: Deleterious mutation, Random forest (bagging) and machine learning, Oryza, Pisum, Cicer
Received: 18 Sep 2018;
Accepted: 08 Nov 2018.
Edited by:Yuriy L. Orlov, Institute of Cytology and Genetics, Russian Academy of Sciences, Russia
Reviewed by:Konstantin V. Gunbin, Institute of Cytology and Genetics, Russian Academy of Sciences, Russia
Vasily Ramensky, Moscow Institute of Physics and Technology, Russia
Copyright: © 2018 Kovalev, Igolkina, Samsonova and Nuzhdin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Mrs. Anna A. Igolkina, Saint Petersburg State Polytechnic University, Saint Petersburg, 195251, Saint Petersburg, Russia, firstname.lastname@example.org