ORIGINAL RESEARCH article

Front. Plant Sci.

Sec. Functional and Applied Plant Genomics

Volume 16 - 2025 | doi: 10.3389/fpls.2025.1618174

This article is part of the Research TopicMachine Learning for Mining Plant Functional GenesView all articles

A BERT-based rice enhancer identification model combined with sequence-representation differential entropy interpretation

Provisionally accepted
Yajing  PuYajing Pu1Xintong  HaoXintong Hao1Zhaoqi  ZhengZhaoqi Zheng1Huiyan  MaHuiyan Ma2Zhibin  LvZhibin Lv1*
  • 1School of Biomedical Engineering, Sichuan University, Chengdu, China
  • 2College of Life Science, Sichuan University, Chengdu, Sichuan Province, China

The final, formatted version of the article will be published soon.

Rice is an important food crop, and research into its gene expression regulation is of great significance to molecular breeding and yield improvement. Enhancers are key elements that regulate the spatial and temporal expression of genes, making their precise identification a core challenge in functional genomics. Among existing methods employing deep learning for rice enhancer identification, there are limitations in extracting rice enhancer features and model architecture. Therefore, we propose a novel model architecture: RiceEN-BERT-SVM. The model utilizes a pre-trained nucleotide large language model as the feature extraction tool and SVM for enhancer sequence recognition. Additionally, differential entropy of feature representation is used to explain the improvement in model performance. The results indicate that direct application of the pre-trained language model achieves high accuracy in rice enhancer identification, with cross-validation and independent testing accuracies of 88.05% and 87.52%, respectively. These results surpass state-of-the-art models by 1.47–6.87% on the same dataset. Fine-tuning the language model further improved the performance of RiceEN-BERT-SVM by 6.95%, resulting in a final accuracy of 93.63%. We introduce differential entropy for sequential feature representation to explain the reasons for the improved model performance. The calculations indicate that as the number of fine-tuning iterations increases, the differential entropy distributions representing positive and negative sample characteristics gradually separate from their initial superposition state. This separation corresponds with an improvement in model performance. However, there is an optimal number of fine-tuning. In this study, the best model performance is achieved when the number of fine-tunings reaches 6. Exceeding this number, the differential entropy distributions of positive and negative samples begin to overlap again, leading to a decline in model performance.

Keywords: rice enhancer, Large Language Model, positive and negative sample distribution, Support vector machine, Visual explanation

Received: 25 Apr 2025; Accepted: 12 May 2025.

Copyright: © 2025 Pu, Hao, Zheng, Ma and Lv. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Zhibin Lv, School of Biomedical Engineering, Sichuan University, Chengdu, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.