ORIGINAL RESEARCH article
Front. Plant Sci.
Sec. Functional and Applied Plant Genomics
Volume 16 - 2025 | doi: 10.3389/fpls.2025.1626539
This article is part of the Research TopicMachine Learning for Mining Plant Functional GenesView all 4 articles
Identification of DNA N6-methyladenine modifications in the rice genome with a fine-tuned large language model
Provisionally accepted- College of Biomedical Engineering, Sichuan University, Chengdu, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
DNA N6-methyladenine (6mA) plays a significant role in various biological processes. In the rice genome, 6mA is involved in important processes such as growth and development, influencing gene expression. Therefore, identifying the 6mA locus in rice is crucial for understanding its complex gene expression regulatory system. Although several useful prediction models have been proposed, there is still room for improvement. To address this, we propose an architecture named iRice6mA-LMXGB that integrates a fine-tuned large language model to identify the 6mA locus in rice. Specifically, our method consists of two main components: (1) a BERT model for feature extraction and (2) an XGBoost module for 6mA classification. We utilize a pre-trained DNABERT-2 model to initialize the parameters of the BERT component. Through transfer learning, we fine-tune the model on the rice 6mA recognition task, converting raw DNA sequences into high-dimensional feature vectors. These features are then processed by an XGBoost algorithm to generate predictions. Our approach achieves a validation accuracy of 0.9903 in a five-fold cross-validation setting and produces a receiver operating characteristic (ROC) curve with an area under the curve (AUC) of 0.9994. Compared to existing predictors trained on the same dataset, our method demonstrates superior performance. To further validate the effectiveness of our fine-tuning strategy, we employ UMAP(Uniform Manifold Approximation and Projection) visualization. This study provides a powerful tool for advancing research in rice 6mA epigenetics.
Keywords: Rice genome, N6-methyladenine, Large Language Model, BERT, UMAP visualization
Received: 11 May 2025; Accepted: 04 Jun 2025.
Copyright: © 2025 Zhang, Chen, Xiang and Lv. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Zhibin Lv, College of Biomedical Engineering, Sichuan University, Chengdu, China
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.