TECHNOLOGY AND CODE article
Front. Bioinform.
Sec. Integrative Bioinformatics
Volume 5 - 2025 | doi: 10.3389/fbinf.2025.1514880
This article is part of the Research TopicIntegrating Machine Learning and AI in Biological Research: Unraveling Complexities and Driving AdvancementsView all 4 articles
EPheClass: Ensemble-based Phenotype Classifier from 16S rRNA gene sequences
Provisionally accepted- 1Intelligent Technologies Research Centre, University of Santiago de Compostela, Santiago de Compostela, Spain
- 2Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, Galicia, Spain
- 3School of Management and Engineering Vaud, Yverdon-les-Bains, Switzerland
- 4CI4CB—Computational Intelligence for Computational Biology, SIB—Swiss Institute of Bioinformatics, Lausanne, Switzerland
- 5Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical Surgical Specialities, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Santiago de Compostela, Spain
- 6Departamento de Electrónica e Computación, Escola Técnica Superior de Enxeñaría, Universidade de Santiago de Compostela, Santiago de Compostela, Spain
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
One area of bioinformatics that is currently attracting particular interest is the classification of polymicrobial diseases using machine learning (ML), with data obtained from high-throughput amplicon sequencing of the 16S rRNA gene in human microbiome samples. The microbial dysbiosis underlying these types of diseases is particularly challenging to classify, as the data is highly dimensional, with potentially hundreds or even thousands of predictive features. In addition, the imbalance in the composition of the microbial community is highly heterogeneous across samples. In this paper, we propose a curated pipeline for binary phenotype classification based on a count table of 16S rRNA gene amplicons, which can be applied to any microbiome. To evaluate our proposal, raw 16S rRNA gene sequences from samples of healthy and periodontally affected oral microbiomes that met certain quality criteria were downloaded from public repositories. In the end, a total of 2,581 samples were analysed. In our approach, we first reduced the dimensionality of the data using feature selection methods. After tuning and evaluating different machine learning (ML) models and ensembles created using Dynamic Ensemble Selection (DES) techniques, we found that all DES models performed similarly and were more robust than individual models. Although the margin over other methods was minimal, DES-P achieved the highest AUC and was therefore selected as the representative technique in our analysis. When diagnosing periodontal disease with saliva samples, it achieved with only 13 features an F1 score of 0.913, a precision of 0.881, a recall (sensitivity) of 0.947, an accuracy of 0.929, and an AUC of 0.973. In addition, we used EPheClass to diagnose inflammatory bowel disease (IBD) and obtained better results than other works in the literature using the same dataset. We also evaluated its effectiveness in detecting antibiotic exposure, where it again demonstrated competitive results. This highlights the importance and generalisation aspect of our classification approach, which is applicable to different phenotypes, study niches, and sample types. The code is available at https://gitlab.citius.usc.es/lara.vazquez/epheclass.
Keywords: microbiome, Phenotype classification, 16S rRNA gene, machine learning, Feature Selection, Ensemble-based Classification
Received: 21 Oct 2024; Accepted: 28 Aug 2025.
Copyright: © 2025 Vázquez-González, Peña-Reyes, Regueira-Iglesias, Castro, TOMÁS CARMONA and Carreira. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Lara Vázquez-González, Intelligent Technologies Research Centre, University of Santiago de Compostela, Santiago de Compostela, Spain
INMACULADA TOMÁS CARMONA, Intelligent Technologies Research Centre, University of Santiago de Compostela, Santiago de Compostela, Spain
Maria J Carreira, Intelligent Technologies Research Centre, University of Santiago de Compostela, Santiago de Compostela, Spain
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.