REVIEW article
Front. Genet.
Sec. Computational Genomics
Volume 16 - 2025 | doi: 10.3389/fgene.2025.1634882
Gene-LLMs: A Comprehensive Survey of Transformer-Based Genomic Language Models for Regulatory and Clinical Genomics
Provisionally accepted- VIT University, Vellore, India
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
The convergence of natural language processing (NLP) and genomics has given rise to a new class of transformer-based models-Genome Large Language Models (Gene-LLMs)-capable of interpreting the language of life at unprecedented scale and resolution. It represent a revolution in the field of bioinformatics since they use only raw nucleotide sequences, gene expression data, and multi-omic annotations and their capacity of self-supervised pretraining to decipher complex regulatory grammars that are hidden in the genome. This survey proposal is an exemplar inclusive of the entire lifecycle of Gene-LLMs, including such stages as raw data ingestion and k-mer or gene-level tokenization, and tasks like masked nucleotide prediction and sequence alignment that are conducted for pretext learning. We specify their wide-range usage that goes through crucial subsequent activities such as finding the enhancer or promoter, modeling the chromatin state, predicting the RNA-protein, and creating synthetic sequences. We further explore how the Gene-LLMs approach has created an impact in functional genomics, clinical diagnostics and evolutionary inference by analysing a number of recent benchmarks: CAGI5, GenBench, NT-Bench, and BEACON.We also point to some of the new encoder-decoder modifications as well as theinclusion of position embeddings which are a feature specific to living organisms hinting at their interpretability and translational potential. The paper finally outlines a route to federated genomic learning, multimodal sequence modeling, and low-resource adaptation for rare variant discovery, branding Gene-LLMs as a corner stone technology for the responsible and proactive future of biomedicine.
Keywords: Sequential Genomic Data, Whole genome sequencing (WGS), Encoders in Genomics, Gene-LLMs Multi-species training, Long-range attention, Nucleotide Transformer
Received: 25 May 2025; Accepted: 11 Aug 2025.
Copyright: © 2025 P, Leema, Leema, Saad and Babu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Anny Leema, VIT University, Vellore, India
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.