ORIGINAL RESEARCH article

Front. Genet.

Sec. Computational Genomics

Volume 16 - 2025 | doi: 10.3389/fgene.2025.1641162

This article is part of the Research TopicInsights in Computational Genomics: 2025View all articles

HyenaCircle: a HyenaDNA-based pretrained large language model for long eccDNA prediction

Provisionally accepted
Fuyu  LiFuyu LiWenxiang  LuWenxiang LuYunfei  BaiYunfei Bai*
  • Southeast University, Nanjing, China

The final, formatted version of the article will be published soon.

Introduction: Extrachromosomal circular DNA (eccDNA) represents a class of circular DNA molecules derived from chromosomes with diverse roles in disease. Long eccDNAs (typically 1-5 kb) pose detection challenges due to their large size, hindering functional studies. We propose HyenaCircle, a novel deep learning model leveraging large language model and third-generation sequencing data to predict long eccDNA formation.Methods: Full-length eccDNAs within 1-5 kb were identified by FLED algorithm for Nanopore sequencing data, extended by 100-bp flanking sequences, and paired with 20,000 length-matched negative controls from eccDNA-depleted genomic regions. HyenaCircle was built by adapting the pretrained HyenaDNA model with a designed classifier head. The strategies of data augmentation, regularization and class imbalance weighting were applied to increase model robustness.Results: HyenaCircle achieved comparable performance with a validation AUROC of 0.715 and recall of 0.776. It surpassed DNABERT by 5.9% in AUROC and demonstrated stable convergence. Hyperparameter optimization confirmed batch size 16 and learning rate 5×10⁻⁵ as optimal. The ablation studies revealed flanking sequences are important, as their removal reduced model stability. The model also showed superior stability over the baseline HyenaDNA architecture.HyenaCircle integrated third-generation sequencing data and large language model for long eccDNA prediction, which outperformed the existing model. Our work demonstrates that the HyenaDNA architecture enables effective long-sequence genomic modeling and provides a new insight for eccDNA prediction and identification.

Keywords: Extrachromosomal circular DNA, eccDNA, Large Language Model, deep learning, Long eccDNA prediction, Third-generation sequencing

Received: 04 Jun 2025; Accepted: 19 Jun 2025.

Copyright: © 2025 Li, Lu and Bai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Yunfei Bai, Southeast University, Nanjing, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.