ORIGINAL RESEARCH article
Front. Plant Sci.
Sec. Functional and Applied Plant Genomics
Volume 16 - 2025 | doi: 10.3389/fpls.2025.1618174
This article is part of the Research TopicMachine Learning for Mining Plant Functional GenesView all articles
A BERT-based rice enhancer identification model combined with sequence-representation differential entropy interpretation
Provisionally accepted- 1School of Biomedical Engineering, Sichuan University, Chengdu, China
- 2College of Life Science, Sichuan University, Chengdu, Sichuan Province, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Rice is an important food crop, and research into its gene expression regulation is of great significance to molecular breeding and yield improvement. Enhancers are key elements that regulate the spatial and temporal expression of genes, making their precise identification a core challenge in functional genomics. Among existing methods employing deep learning for rice enhancer identification, there are limitations in extracting rice enhancer features and model architecture. Therefore, we propose a novel model architecture: RiceEN-BERT-SVM. The model utilizes a pre-trained nucleotide large language model as the feature extraction tool and SVM for enhancer sequence recognition. Additionally, differential entropy of feature representation is used to explain the improvement in model performance. The results indicate that direct application of the pre-trained language model achieves high accuracy in rice enhancer identification, with cross-validation and independent testing accuracies of 88.05% and 87.52%, respectively. These results surpass state-of-the-art models by 1.47–6.87% on the same dataset. Fine-tuning the language model further improved the performance of RiceEN-BERT-SVM by 6.95%, resulting in a final accuracy of 93.63%. We introduce differential entropy for sequential feature representation to explain the reasons for the improved model performance. The calculations indicate that as the number of fine-tuning iterations increases, the differential entropy distributions representing positive and negative sample characteristics gradually separate from their initial superposition state. This separation corresponds with an improvement in model performance. However, there is an optimal number of fine-tuning. In this study, the best model performance is achieved when the number of fine-tunings reaches 6. Exceeding this number, the differential entropy distributions of positive and negative samples begin to overlap again, leading to a decline in model performance.
Keywords: rice enhancer, Large Language Model, positive and negative sample distribution, Support vector machine, Visual explanation
Received: 25 Apr 2025; Accepted: 12 May 2025.
Copyright: © 2025 Pu, Hao, Zheng, Ma and Lv. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Zhibin Lv, School of Biomedical Engineering, Sichuan University, Chengdu, China
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.