Machine learning tools for deciphering the regulatory logic of enhancers in health and disease

Foutadakis, Spyros; Bourika, Vasiliki; Styliara, Ioanna; Koufargyris, Panagiotis; Safarika, Asimina; Karakike, Eleni

doi:10.3389/fgene.2025.1603687

REVIEW article

Front. Genet., 13 August 2025

Sec. Human and Medical Genomics

Volume 16 - 2025 | https://doi.org/10.3389/fgene.2025.1603687

Machine learning tools for deciphering the regulatory logic of enhancers in health and disease

Spyros Foutadakis ¹^*

Vasiliki Bourika ²

Ioanna Styliara ³

Panagiotis Koufargyris ¹

Asimina Safarika ¹

Eleni Karakike ¹^*

1. 4th Department of Internal Medicine, Medical School, National and Kapodistrian University of Athens, Athens, Greece
2. Neonatal Unit, First Department of Pediatrics, National and Kapodistrian University of Athens, Athens, Greece
3. Department of Obstetrics and Gynaecology, School of Medicine, University of Patras, Patras, Greece

Article metrics

View details

Citations

2,9k

Views

545

Downloads

Abstract

Transcriptional enhancers are DNA regulatory elements that control the levels and spatiotemporal patterns of gene expression during development, homeostasis, and pathophysiological processes. Enhancer identification and characterization at the genome-wide scale rely on their structural characteristics, such as chromatin accessibility, binding of transcription factors and cofactors, activating histone modifications, 3D interactions with other regulatory elements, as well as functional characteristics measured by massively parallel reporter assays and sequence conservation approaches. Recently, machine learning approaches and particularly deep learning models (Enformer, BPNet, DeepSTARR, etc.) allow the prediction of enhancers, the impact of variants on their activity and the inference of transcription factor binding sites, leading, among others, to the construction of the first completely synthetic enhancers. We present the above computational tools and discuss their diverse applications towards cracking the enhancer regulatory code, which could have far-reaching ramifications for uncovering essential regulatory mechanisms and diagnosing and treating diseases. With an emphasis on sepsis, a leading cause of morbidity and mortality in hospitalized patients, we discuss computational approaches to identify sepsis-associated endotypes, circuits, and immune cell states and signatures characteristic of this condition, which could aid in developing novel therapies.

1 Introduction

Transcriptional enhancers are DNA regulatory elements that control the levels and spatiotemporal patterns of gene expression during development, homeostasis, and pathophysiological processes (Banerji et al., 1981; Agelopoulos et al., 2021; Sur and Taipale, 2016; Rickels and Shilatifard, 2018; Furlong and Levine, 2018). Enhancers regulate transcription irrespective of orientation to the Transcription Start Site (TSS) (Banerji et al., 1981), can act over large genomic distances exceeding 1 Mbp (Lettice et al., 2003; Long et al., 2020) and can skip their closest gene in the linear DNA sequence to regulate distal genes (Kessler et al., 2023; Chen et al., 2024). There are two main models regarding the architectural organization of enhancers: the enhanceosome and the billboard model (Arnosti and Kulkarni, 2005; Jindal and Farley, 2021). The enhanceosome operates with a high degree of cooperativity between enhancer-bound proteins, with the exact organization of transcription factor (tf) binding sites being crucial for enhancer output. The archetype enhanceosome is that of the interferon beta, induced following viral infections (Thanos and Maniatis, 1995). Nevertheless, enhanceosomes are rather rare and have been described for a limited number of cases involving mainly cytokine genes (Giese et al., 1995; Jindal and Farley, 2021). On the other hand, the billboard model posits that the binding sites are flexibly arranged and the bound proteins act as an ensemble interacting independently with their targets (Kulkarni and Arnosti, 2003; Jindal and Farley, 2021).

In the following sections, we present the main wet-lab methodologies to identify enhancers at the genome-wide scale (Catarino and Stark, 2018) and characterize their regulatory grammar (Jindal and Farley, 2021) and 3D interaction networks (Uyehara and Apostolou, 2023). We also describe booming machine learning algorithms to aid the above tasks towards a better understanding of pathophysiological conditions.

2 Enhancer identification at the genome-wide scale

Enhancer identification and characterization at the genome-wide scale rely on their structural characteristics, such as chromatin accessibility (Buenrostro et al., 2013; Marinov et al., 2023; John et al., 2013; Vierstra et al., 2020), binding of transcription factors and cofactors (Robertson et al., 2007; Lambert et al., 2018), activating histone modifications (Heintzman et al., 2007; Calo and Wysocka, 2013; Barral and Déjardin, 2023) and DNA hypomethylation (Angeloni and Bogdanovic, 2019) (Figure 1). More specifically, chromatin accessibility denotes the regulatory potential of a DNA sequence and is measured by popular assays such as DNaseI-seq (John et al., 2013; Vierstra et al., 2020) and ATAC-seq (Buenrostro et al., 2013; Marinov et al., 2023). The combinatorial presence of histone modifications on chromatin constitutes a regulatory code (Kouzarides, 2007), with the genomic distribution of individual modifications usually measured with the ChIP-seq technology (Heintzman et al., 2007; Fadri et al., 2024). Histone modifications associated with active enhancers include H3K27ac and H3K4me1, while H3K27me3 and H3K9me3 are repressive marks (Heintzman et al., 2007; Calo and Wysocka, 2013; Barral and Déjardin, 2023). Most importantly, the binding of transcription factors and coactivators such as Med1 and CBP/P300 and other members of the transcription apparatus are also hallmarks of active enhancers (Catarino and Stark, 2018). High levels of the above enhancer markers are found in long stretches of enhancer DNA termed super-enhancers or stretch-enhancers that play crucial roles in the control of cell identity and disease-related genes (Hnisz et al., 2013; Whyte et al., 2013; Parker et al., 2013). The above epigenomics methodologies are frequently paired with whole transcriptome measurements using RNA-seq (Mortazavi et al., 2008; Deshpande et al., 2023) and the integration of the complementary multiomics methodologies provides a more holistic picture of the regulatory mechanisms that govern transcriptional programs.

FIGURE 1

Diagram illustrating genomic analysis methods. Top left shows chromatin accessibility assays with ChIP-seq and motifs like Enformer and BPNet. Top right displays Hi-C assays with tools like Akita and DeepC. Bottom left features STARR-seq assays emphasizing putative enhancers and DeepSTARR. Bottom right depicts DNA methylation analysis with DeepCpG and MethylNet. — Depicted in black are the main (epi)genomics methodologies for studying enhancers, such as chromatin accessibility assays (ATAC-seq, DNaseI-seq) and ChIP-seq to identify open chromatin regions and the presence of histone modifications and tf and coactivator binding, respectively. Moreover, assays such as Hi-C are used to study the 3D organization of the genome, while massively parallel reporter assays like STARR-seq are used to examine the *bona fide* enhancer potential of putative regulatory regions. Depicted in blue are selected state-of-the-art deep learning algorithms trained with datasets produced through the above methodologies and used to identify enhancers, predict crucial motifs and assess the effect of variants on their function.

Until recently the above (epi)genomics methodologies were only applied in bulk cell populations, thus providing only an ensemble of averaged signals, although it is well established that there is pervasive heterogeneity and stochasticity across systems, meaning that each individual cell inevitably has a unique profile (Eling et al., 2019). In the last decade, technological advances have ushered in the era of single-cell genomics with technologies such as single-cell RNA-seq (Tang et al., 2009; Klein et al., 2015), single-cell ATAC-seq (Buenrostro et al., 2015), simultaneous measurement of different modalities (Baysoy et al., 2023) and even spatial methodologies (Moses and Pachter, 2022; Deng et al., 2022), allowing the study of individual cells (Heumos et al., 2023).

All the above structural characteristics of enhancers are just convenient proxies for enhancer identification and do not prove that a certain DNA sequence holds true enhancer potential. The classic method to measure enhancer activity is the reporter gene assay, with the genomics-based derivatives STARR-seq (Arnold et al., 2013) and MPRA (Melnikov et al., 2012) allowing the massively parallel measurement of the enhancer potential for thousands or even millions of DNA sequences (Inoue and Ahituv, 2015). A limitation of the above methodologies is their episomal nature, measuring the activity of putative enhancers outside their native chromatin environment. The random genomic integration of the tested regions with the variant MPRA technology termed lenti-MPRA (Inoue et al., 2017) alleviates this problem to a certain extent, but nevertheless it does not allow the measuring of enhancer activity at the endogenous locus. An approach to measure enhancer activity at the native chromatin environment is the detection of enhancer RNAs (eRNAs), long non-coding RNAs that are produced from transcription of enhancer sequences, usually in a bidirectional fashion (Kim et al., 2010; Sartorelli and Lauberth, 2020). Nevertheless, eRNAs are of an unstable nature, necessitating the application of nascent RNA sequencing approaches such as GRO-seq (Core et al., 2008) and PRO-seq (Mahat et al., 2016) for their detection. The ultimate test to prove the activity of an enhancer and identify its target gene(s) is the disruption or mutation of its sequence with simultaneous measurements of its effect on the expression levels and chromatin environment of nearby genes and regulatory elements, respectively. This can be achieved with CRISPR-based methodologies such as CRISPR-interference (Fulco et al., 2016) that can silence an enhancer, CRISPR-mediated saturation mutagenesis (Canver et al., 2015) and even CRISPR-mediated activation of enhancers (Heidersbach et al., 2023). The above experimental methodologies to detect enhancers are complemented by DNA sequence conservation approaches to examine enhancer evolution across species (Siepel et al., 2005; Villar et al., 2015).

Large national and international consortia like the ENCODE (ENCODE Project Consortium, 2012), the Roadmap Epigenomics (Roadmap Epigenomics Consortium, 2015), the International Human Epigenome Consortium (Stunnenberg et al., 2016) and the Genotype-Tissue Expression Project (GTEx Consortium, 2013), as well as individual labs, have produced thousands of datasets that are deposited at public repositories such as the Gene Expression Omnibus (Clough et al., 2024) and the European Nucleotide Archive (Yuan et al., 2024) and are freely available for re-analysis and as training material for machine learning applications discussed in a following section.

3 Three-dimensional genome organization and enhancer communication

Vertebrate genomes, apart from the linear DNA sequence, are organized in the three-dimensional space of the nucleus with important ramifications for transcription regulation, DNA replication, and genome integrity (Misteli, 2020). The main approaches to studying 3D genome organization include Hi-C-based methodologies (Rao et al., 2014) or super-resolution microscopy (Boettiger and Murphy, 2020). The higher organizational unit of the genome is the chromosome compartment, with compartment A containing mainly active DNA regions, while compartment B hosts inactive regions (Rao et al., 2014). Although older studies produced rather sparse Hi-C datasets that examined compartments at the Mbp scale, recent efforts using ultra-deep Hi-C data found that A and B compartments alternate at the kilobase scale level (Harris et al., 2023). Another chromatin organizational unit is the topologically associating domains (TADs) that usually span between a few hundred Kbp to a few Mbp (Dixon et al., 2012). TADs set the stage for the finer organizational unit, the loops, which are chromatin interactions between regulatory elements or structural loops that connect CTCF-bound sites (Rao et al., 2014; Furlong and Levine, 2018). Enhancers and promoters within a TAD have a higher probability of interacting than if these regulatory elements were on different TADs. Nevertheless, there are examples of interactions spanning TAD boundaries, especially of developmentally important genes (Hung et al., 2024).

There are two prevailing models regarding the mechanism by which an enhancer interacts with the promoter of the gene it regulates: the classic looping model and the emerging phase separation model (Popay and Dixon, 2022). It is generally accepted that spatial proximity between an enhancer and a promoter is required, but the minimum distance of this interaction and how it relates to activity is highly debated. In support of the looping model and more specifically its instructive nature, targeted tethering of a looping factor allowed the de novo formation of a chromatin loop and activation of transcription (Deng et al., 2012). On the other hand, it has been observed mainly through super-resolution microscopy experiments that the timing of enhancer-promoter proximity is not always correlated with activation of gene expression (Alexander et al., 2019; Benabdallah et al., 2019). The latter findings, together with evidence that depletion of CTCF or cohesin, the main organizers of loop formation, does not lead to global transcriptional changes (Rao et al., 2017; Hsieh et al., 2022) have challenged the universality of the looping model. An alternative model, compatible with the above findings, is the phase separation model (Hnisz et al., 2017), according to which weak multivalent interactions between disordered regions in the activation domains of transcription factors and coactivators lead to condensates that separate from their surroundings, causing high local concentrations of activating factors, especially at super-enhancers. Containment of an enhancer and a promoter within the same condensate would allow gene activation without physical interaction through loop formation. Alternatively, the condensate could transiently interact with the regulatory elements and the gene locus to control gene bursting (Du et al., 2024). Clearly, more experiments are required to establish in a broader fashion the regulatory scenarios in which each of these two models for enhancer regulation applies.

4 Using machine learning to decipher the regulatory logic of enhancers

Machine learning-based approaches to discover patterns in genomics data are broadly categorized into non-neural network algorithms such as support vector machines (SVMs) and random forests (RFs) as well as the increasingly popular neural networks/deep learning approaches (Eraslan et al., 2019; Smith et al., 2023; Li et al., 2023). Neural networks utilize many layers of interconnected neurons to decipher hard-to-recognize patterns, hence the term “deep learning”. There are supervised neural network approaches comprised of the most popular convolutional neural networks (CNNs), the fully connected, the recurrent and the graph convolutional, as well as unsupervised methodologies such as autoencoders and generative adversarial networks, with the latter approaches mainly applied in single-cell genomics. In supervised learning, a model is obtained that takes features as input and delivers a prediction for the target variable, which is the desired output used to train the model. A machine learning algorithm is trained using a set of features, usually genomics data such as ChIP-seq or chromatin accessibility datasets that are split into three sets: the training set used for optimizing the parameters of the model, the validation set for evaluating the performance of the model and the test set for the assessment of the best model (Eraslan et al., 2019). The parameters of the network are randomly initiated and refined in an iterative fashion using batches of the training dataset for memory efficiency and prevention of overfitting, and the process is usually parallelized using graphical processing units (GPUs). The analyst can fine-tune different hyperparameters, such as the number of layers and the batch size, using the validation set before the final evaluation of the model using the test set. On the contrary, in unsupervised learning, unlabeled data are characterized by utilizing useful properties of the data. Unsupervised methodologies include autoencoders, which embed the data into a low-dimensional space and force the network to extract the most useful features. While autoencoders have found applications using bulk sequencing data such as extraction of gene signatures from RNA-seq data (Tan et al., 2017), they are also suited for single-cell applications such as improving clustering and denoising data (Wang et al., 2021). The generative adversarial networks offer a different approach consisting of two neural networks, a discriminator and a generator trained in parallel, with the generator creating realistic data and the discriminator classifying if a sample is real or created by the generator.

While simple linear “shallow learning” models like logistic regression can take care of standard tabular data, genomic sequences pose several challenges, such as local dependencies. For example, in classifying regions as bound or unbound by a transcription factor, representing enhancers as position-weighted matrices (PWMs) or k-mers with support vector machine approaches such as gkmSVM (Ghandi et al., 2016) may miss patterns where binding depends on cooperative interaction between multiple motifs with fixed spacing. In contrast, convolutional neural networks are perfectly suited to discern such complex dependencies (Eraslan et al., 2019; Smith et al., 2023; Li et al., 2023; Toneyan et al., 2022). They are composed of multiple layers, each scanning the sequence with several filters- PWMs and quantifying similarities between the filter and the sequence, followed by a non-linear activation function and pooling. Each subsequent layer composes the output of the previous layer that can be fed to a fully connected neural network that receives all information to perform the final prediction task. Applications of deep learning tools in genomics include prediction of regulatory elements such as enhancers, tf binding and assessing the effect of DNA variants. The main models for the above tasks and their salient features are presented in Supplementary Table S1. Early seminal applications of CNNs in genomics include DeepBind (Alipanahi et al., 2015), DeepSEA (Zhou and Troyanskaya, 2015) and Basset (Kelley et al., 2016), which were trained on large-scale chromatin accessibility or transcription factor binding data and were used to prioritize variants by predicting their effect on chromatin accessibility patterns. The DanQ algorithm (Quang and Xie, 2016) uses a hybrid architecture of CNNs and bidirectional long short-term memory recurrent networks that deal effectively with long-range dependencies. Another algorithm, the improved successor of Basset, Basenji (Kelley et al., 2018), uses dilated convolution to increase the receptive field, thus accommodating inputs of 131 Kbp again to effectively take into consideration long-range dependencies. Transformers, previously used in natural language processing applications, are particularly capable of handling pairwise interdependencies in DNA sequence data. The Enformer algorithm (Avsec et al., 2021) was the first to employ a combination of CNN and Transformer architecture and achieved higher accuracy compared to previously designed tools. A recent approach, BPNet, trained with high-resolution ChIP-exo data, is a state-of-the-art algorithm and its architecture contains a 10-layer CNN, with 64 filters per layer with dilated convolution, thus achieving a receptive field of 1,034 bp for any position in the genome (Avsec et al., 2021). In general, models that use a large input sequence (receptive field), such as Basenji and Enformer, can better predict long-range enhancers. Together with BPNet, these three models are the state-of-the-art in supervised learning for tasks related to deciphering the enhancer regulatory code (Toneyan et al., 2022). Finally, it is interesting to mention that the constant maturation of deep neural network algorithms has led to a recent milestone in enhancer biology, the construction of synthetic enhancers with the aid of deep neural network models such as DeepSTARR (de Almeida et al., 2022; de Almeida et al., 2024; Taskiran et al., 2024).

There are also deep learning tools dedicated to other applications in genomics, such as predicting the 3D architecture from DNA sequence as well as predicting DNA methylation states. Machine learning approaches such as Akita (Fudenberg et al., 2020), DeepC (Schwessinger et al., 2020), Orca (Zhou, 2022), Epiphany (Yang et al., 2023) and C. origami (Tan et al., 2023) have been used to predict the 3D genome organization from the linear sequence and/or 1D epigenomics experiments (Wang et al., 2024; Smaruj et al., 2025) (Supplementary Table S2). One major distinction between approaches is the type of input data, with some tools using only DNA sequence (Akita, DeepC, ORCA), while Epiphany inputs only epigenomics data. Sequence-based-only approaches cannot make accurate de novo predictions in different cell types, while epigenomics-only approaches usually require an array of different datasets to improve predictive power. Combining the above approaches, C. origami (Tan et al., 2023) is expected to achieve better predictive accuracy at the expense of possibly introducing uncertainties in model interpretation. Another discriminating characteristic of 3D predictive models includes the type of output, which is either a 2D contact map (Akita, Orca, C. origami) or predicted pixels directly from the 1D representation (DeepC, Epiphany), with the former possibly offering advantages as it makes use of the local correlation structure of contact map data (Smaruj et al., 2025).

Finally, tools such as DeepCpG (Angermueller et al., 2017) and MethylNet (Levy et al., 2020) based on CNNs have been used to predict DNA methylation states.

4.1 Model sharing and interpretation

Models are usually created using machine learning frameworks such as TensorFlow (Abadi et al., 2016), PyTorch (Adam et al., 2017) and Keras (Chollet, 2015), a user-friendly API that operates on frameworks like PyTorch. Models can be shared through repositories called model zoos available through the above frameworks as well as Kipoi (Avsec et al., 2019), a model zoo dedicated to genomics.

Although deep neural networks are often criticized as being “black boxes”, they can nevertheless be interpreted with various methodologies. For example, in perturbation-based methods, the input is modified and changes in the output are inspected. In the case of DNA sequence-based models, perturbations could involve a single nucleotide substitution (Alipanahi et al., 2015; Zhou and Troyanskaya, 2015). Nevertheless, the above approaches are computationally expensive, in contrast to attribution-based approaches such as saliency maps (Simonyan et al., 2013). The latter attribute a model’s intermediate network value to the input, with the magnitude of the score showing the amount of contribution. Two state-of-the-art approaches in model interpretation for transcription factor motif-fed algorithms include DeepLift (Shrikumar et al., 2016) and TF-MoDISco (Shrikumar et al., 2020). DeepLift decomposes the output of a neural network on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input. In this way, it quantifies the importance of each nucleotide in a sequence for the model prediction. On the other hand, TF-MoDISco is suitable for motif interpretation and discovery and uses all neurons of a network to process sequence importance scores. In doing so, it clusters the most important nucleotides from different sequences into motifs.

5 Enhancer identification and machine learning tools in health and disease

Predicting enhancers and deciphering their regulatory code (Kim and Wysocka, 2023) could have far-reaching ramifications not only for discovering fundamental regulatory mechanisms (Kim et al., 2021) but also for diagnosing and treating various diseases (Karczewski and Snyder, 2018). A salient example of the former includes the application of chromBPnet, a convolutional neural network, in characterizing the regulatory syntax of enhancers that drive the reprogramming of human fibroblasts to pluripotent cells (Nair et al., 2023). ChromBPnet was used to infer putatively bound tf motifs that influence chromatin accessibility, thus offering insights on transcriptional regulators that drive cellular reprogramming. Moreover, genomics-based enhancer identification frequently in conjunction with machine learning tools has been used to subtype, stage and predict drug responses for various types of cancer such as colon, Ewing sarcoma and hematological malignancies (Akhtar-Zaidi et al., 2012; Riggi et al., 2014; Morgan and Shilatifard, 2015; Ntziachristos et al., 2016; Sur and Taipale, 2016; Morrow et al., 2018; Mack et al., 2019; Cejas et al., 2019; Shang et al., 2019; Yao et al., 2021; Zhu et al., 2024) and other conditions, such as heart failure (Spurrell et al., 2022) and Alzheimer’s disease (Berson et al., 2023).

Sepsis is another field of applicable but understudied enhancer discovery, as a condition characterized by important morbidity and mortality (Giamarellos-Bourboulis et al., 2024b), associated with a highly heterogenous but always dysregulated host immune response. Several sepsis subtypes have been identified among patients presenting with immunoparalysis (Cheng et al., 2016), those with macrophage activation-like syndrome (Karakike and Giamarellos-Bourboulis, 2019), as well as the novel category of patients with high interferon gamma and CXCL9 levels (Giamarellos-Bourboulis et al., 2024a). Other studies have applied unsupervised clustering to identify three or four distinct sepsis endotypes (Scicluna et al., 2017; Sweeney et al., 2018). Heterogeneity is addressed in other machine learning approaches, interrogating the transcriptome as conditioned by the pathogen or the predominating immune response (Komorowski et al., 2022); for example, bvnGPS2 has been trained with transcriptomic data to discriminate between bacterial and viral infections (Xie et al., 2024). Other deep learning models (Zhang et al., 2020; Davenport et al., 2016) trained with transcriptomic data from sepsis patients derived two classes of patients, with class 1 characterized by immunosuppression and higher mortality rates.

The large-scale application of epigenomics methodologies in sepsis that are needed to train machine learning algorithms is still at a rudimentary stage. MAGICAL (Chen et al., 2023), a hierarchical Bayesian framework, using single-cell RNA-seq and single-cell ATAC-seq data from peripheral blood mononuclear cells, was able to identify epigenetic circuit biomarkers distinctive of methicillin-susceptible or -resistant Staphylococcus aureus bloodstream infection. Nevertheless, there is an urgent need to produce large-scale epigenomics datasets from patients with sepsis, as epigenomics datasets, at least in other experimental systems (Dinh et al., 2020), have been found to have superior performance compared to transcriptomics data in stratifying patients and predicting responses to therapy.

Moreover, other immune response mechanisms, such as tolerance and resilience, are underrepresented in current subtyping efforts, and corresponding signatures are still missing (Shankar-Hari et al., 2024). Putative enhancers marked by H3K4me1 and H3K27ac that are activated in response to lipopolysaccharide-treated monocytes (Saeed et al., 2014) significantly overlapped sepsis-associated single nucleotide polymorphisms (SNPs) that were differentially expressed among two sepsis classes (Davenport et al., 2016). Moreover, there is evidence that histone modifications can act as a reservoir of epigenetic memory in immune tolerance and trained immunity (Netea et al., 2020). For a comprehensive review of epigenetic markers in sepsis, interested readers are referred to a recent review (Binnie et al., 2020). Finally, although promising immunotherapies have recently become available (Kyriazopoulou et al., 2021; Karakike et al., 2022), there is an urgent need for a deeper, holistic understanding of the immune dysregulation during sepsis and the development of novel biomarkers besides the classic markers already used in clinical practice (Karakike et al., 2024).

6 Discussion

Machine learning algorithms for genomics are evolving at a fast pace and as described above, have found applications in identifying enhancers, predicting their most crucial motifs and regulatory logic, as well as assessing the effect of variants on their function. In the latter application, state-of-the-art algorithms such as Enformer (Avsec et al., 2021) trained with large-scale structural (ChIP-seq, ATAC-seq) and functional epigenomics data such as massively parallel reporter assays (Arnold et al., 2013) and SELEX-like assays (Yan et al., 2021) could aid in assessing variants, for example, in immune conditions, such as sepsis where datasets are more scarce (Stankey and Lee, 2023). Nevertheless, machine learning algorithms are faced with challenges in their application, such as overfitting, meaning they do not generalize well to unseen examples. This could be due to the limited number of validated ground truth enhancers, the scarcity of large-scale training data beyond a small number of well-studied cell lines, and the presence of biological and technical variation in training datasets (Li et al., 2023).

The problem of data insufficiency could be ameliorated by using pre-trained self-supervised models adapted from the field of natural language processing, where models such as BERT (Devlin et al., 2019) and GPT (Brown et al., 2020) have achieved great success. For example, DNA-BERT (Ji et al., 2021) is a pre-trained bidirectional encoder that tokenizes the genome into k-mers, has cross-species transfer learning capabilities, and performs on par with other algorithms in tasks such as variant interpretation and transcription factor binding site prediction using only small amounts of task-specific labeled data. Other recently developed foundation models include Nucleotide Transformer (Dalla-Torre et al., 2025) and Alpha Genome (Avsec et al., 2025). Nevertheless, a comparison of pre-trained genomic language models with supervised tools such as Enformer did not find a clear advantage for self-supervised models in a variety of genomics tasks (Tang et al., 2025). Another challenge in clinical settings such as sepsis is that the presence of baseline datasets before overt disease onset (Garcia Lopez et al., 2024) is very rare. Perhaps in this scenario, machine learning algorithms could be used to impute missing datasets (Schreiber et al., 2023).

With more sophisticated and efficient algorithms trained with well-validated ground truth datasets, the stage is set for machine learning applications to delve into biological mechanisms and provide groundbreaking insights for the betterment of human health.

Statements

Author contributions

SF: Writing – original draft, Writing – review and editing. VB: Writing – review and editing. IS: Writing – review and editing. PK: Writing – original draft. AS: Writing – original draft. EK: Writing – review and editing.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2025.1603687/full#supplementary-material

References

1
AbadiM.AgarwalA.BarhamP.BrevdoE.ChenZ.CitroC.et al (2016). Tensorflow: large-scale machine learning on heterogeneous distributed systems. Available online at: https://arxiv.org/abs/1603.04467.
- Google Scholar
2
AdamP.GrossS.ChintalaS.ChananG.YangE.DeVitoZ.et al (2017). “Automatic differentiation in PyTorch,” in Presented at 31st conference on neural information processing systems (NIPS 2017).
- Google Scholar
3
AgelopoulosM.FoutadakisS.ThanosD. (2021). The causes and consequences of spatial organization of the genome in regulation of gene expression. Front. Immunol.12, 682397. 10.3389/fimmu.2021.682397
- CrossRef
- Google Scholar
4
Akhtar-ZaidiB.Cowper-Sal-lariR.CorradinO.SaiakhovaA.BartelsC. F.BalasubramanianD.et al (2012). Epigenomic enhancer profiling defines a signature of colon cancer. Science336 (6082), 736–739. 10.1126/science.1217277
- CrossRef
- Google Scholar
5
AlexanderJ. M.GuanJ.LiB.MaliskovaL.SongM.ShenY.et al (2019). Live-cell imaging reveals enhancer-dependent Sox2 transcription in the absence of enhancer proximity. Elife8, e41769. 10.7554/eLife.41769
- CrossRef
- Google Scholar
6
AlipanahiB.DelongA.WeirauchM. T.FreyB. J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol.33 (8), 831–838. 10.1038/nbt.3300
- CrossRef
- Google Scholar
7
AngeloniA.BogdanovicO. (2019). Enhancer DNA methylation: implications for gene regulation. Essays Biochem.63 (6), 707–715. 10.1042/EBC20190030
- CrossRef
- Google Scholar
8
ArnoldC. D.GerlachD.StelzerC.BoryńŁ. M.RathM.StarkA. (2013). Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science339 (6123), 1074–1077. 10.1126/science.1232542
- CrossRef
- Google Scholar
9
ArnostiD. N.KulkarniM. M. (2005). Transcriptional enhancers: Intelligent enhanceosomes or flexible billboards?J. Cell Biochem.94 (5), 890–898. 10.1002/jcb.20352
- CrossRef
- Google Scholar
10
AvsecŽ.KreuzhuberR.IsraeliJ.XuN.ChengJ.ShrikumarA.et al (2019). The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol.37 (6), 592–600. 10.1038/s41587-019-0140-0
- CrossRef
- Google Scholar
11
AvsecŽ.AgarwalV.VisentinD.LedsamJ. R.Grabska-BarwinskaA.TaylorK. R.et al (2021). Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods18 (10), 1196–1203. 10.1038/s41592-021-01252-x
- CrossRef
- Google Scholar
12
AvsecŽ.LatyshevaN.ChengJ.NovatiG.TaylorK. R.WardT.et al (2025). AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model. BioRxiv. 10.1101/2025.06.25.661532
- CrossRef
- Google Scholar
13
BanerjiJ.RusconiS.SchaffnerW. (1981). Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell27 (2 Pt 1), 299–308. 10.1016/0092-8674(81)90413-x
- CrossRef
- Google Scholar
14
BarralA.DéjardinJ. (2023). The chromatin signatures of enhancers and their dynamic regulation. Nucleus14 (1), 2160551. 10.1080/19491034.2022.2160551
- CrossRef
- Google Scholar
15
BaysoyA.BaiZ.SatijaR.FanR. (2023). The technological landscape and applications of single-cell multi-omics. Nat. Rev. Mol. Cell Biol.24 (10), 695–713. 10.1038/s41580-023-00615-w
- CrossRef
- Google Scholar
16
BenabdallahN. S.WilliamsonI.IllingworthR. S.KaneL.BoyleS.SenguptaD.et al (2019). Decreased enhancer-promoter proximity Accompanying enhancer activation. Mol. Cell76 (3), 473–484. 10.1016/j.molcel.2019.07.038
- CrossRef
- Google Scholar
17
BersonE.SreenivasA.PhongpreechaT.PernaA.GrandiF. C.XueL.et al (2023). Whole genome deconvolution unveils Alzheimer's resilient epigenetic signature. Nat. Commun.14 (1), 4947. 10.1038/s41467-023-40611-4
- CrossRef
- Google Scholar
18
BinnieA.TsangJ. L. Y.HuP.CarrasqueiroG.Castelo-BrancoP.Dos SantosC. C. (2020). Epigenetics of sepsis. Crit. Care Med.48 (5), 745–756. 10.1097/CCM.0000000000004247
- CrossRef
- Google Scholar
19
BoettigerA.MurphyS. (2020). Advances in chromatin imaging at kilobase-scale resolution. Trends Genet.36 (4), 273–287. 10.1016/j.tig.2019.12.010
- CrossRef
- Google Scholar
20
BrownT. B.MannB.RyderN.SubbiahM.KaplanJ.DhariwalP.et al (2020). Language models are few-shot learners. Preprint at arXiv. 10.48550/arXiv.2005.14165
- CrossRef
- Google Scholar
21
BuenrostroJ. D.GiresiP. G.ZabaL. C.ChangH. Y.GreenleafW. J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods10 (12), 1213–1218. 10.1038/nmeth.2688
- CrossRef
- Google Scholar
22
BuenrostroJ. D.WuB.LitzenburgerU. M.RuffD.GonzalesM. L.SnyderM. P.et al (2015). Single-cell chromatin accessibility reveals principles of regulatory variation. Nature523 (7561), 486–490. 10.1038/nature14590
- CrossRef
- Google Scholar
23
CaloE.WysockaJ. (2013). Modification of enhancer chromatin: what, how, and why?Mol. Cell49 (5), 825–837. 10.1016/j.molcel.2013.01.038
- CrossRef
- Google Scholar
24
CanverM. C.SmithE. C.SherF.PinelloL.SanjanaN. E.ShalemO.et al (2015). BCL11A enhancer dissection by Cas9-mediated in situ saturating mutagenesis. Nature527 (7577), 192–197. 10.1038/nature15521
- CrossRef
- Google Scholar
25
CatarinoR. R.StarkA. (2018). Assessing sufficiency and necessity of enhancer activities for gene expression and the mechanisms of transcription activation. Genes Dev.32 (3-4), 202–223. 10.1101/gad.310367.117
- CrossRef
- Google Scholar
26
CejasP.DrierY.DreijerinkK. M. A.BrosensL. A. A.DeshpandeV.EpsteinC. B.et al (2019). Enhancer signatures stratify and predict outcomes of non-functional pancreatic neuroendocrine tumors. Nat. Med.25 (8), 1260–1265. 10.1038/s41591-019-0493-4
- CrossRef
- Google Scholar
27
ChenX.WangY.CappuccioA.ChengW. S.ZamojskiF. R.NairV. D.et al (2023). Mapping disease regulatory circuits at cell-type resolution from single-cell multiomics data. Nat. Comput. Sci.3 (7), 644–657. 10.1038/s43588-023-00476-5
- CrossRef
- Google Scholar
28
ChenZ.SnetkovaV.BowerG.JacintoS.ClockB.DizehchiA.et al (2024). Increased enhancer-promoter interactions during developmental enhancer activation in mammals. Nat. Genet.56 (4), 675–685. 10.1038/s41588-024-01681-2
- CrossRef
- Google Scholar
29
ChengS. C.SciclunaB. P.ArtsR. J.GresnigtM. S.LachmandasE.Giamarellos-BourboulisE. J.et al (2016). Broad defects in the energy metabolism of leukocytes underlie immunoparalysis in sepsis. Nat. Immunol.17 (4), 406–413. 10.1038/ni.3398
- CrossRef
- Google Scholar
30
CholletF. (2015). Keras. Available online at: https://github.com/fchollet/keras.
- Google Scholar
31
CloughE.BarrettT.WilhiteS. E.LedouxP.EvangelistaC.KimI. F.et al (2024). NCBI GEO: archive for gene expression and epigenomics data sets: 23-year update. Nucleic Acids Res.52 (D1), D138–D144. 10.1093/nar/gkad965
- CrossRef
- Google Scholar
32
CoreL. J.WaterfallJ. J.LisJ. T. (2008). Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science322 (5909), 1845–1848. 10.1126/science.1162228
- CrossRef
- Google Scholar
33
Dalla-TorreH.GonzalezL.Mendoza-RevillaJ.Lopez CarranzaN.GrzywaczewskiA. H.OteriF.et al (2025). Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat. Methods22 (2), 287–297. 10.1038/s41592-024-02523-z
- CrossRef
- Google Scholar
34
DavenportE. E.BurnhamK. L.RadhakrishnanJ.HumburgP.HuttonP.MillsT. C.et al (2016). Genomic landscape of the individual host response and outcomes in sepsis: a prospective cohort study. Lancet Respir. Med.4 (4), 259–271. 10.1016/S2213-2600(16)00046-1
- CrossRef
- Google Scholar
35
de AlmeidaB. P.ReiterF.PaganiM.StarkA. (2022). DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat. Genet.54 (5), 613–624. 10.1038/s41588-022-01048-5
- CrossRef
- Google Scholar
36
de AlmeidaB. P.SchaubC.PaganiM.SecchiaS.FurlongE. E. M.StarkA. (2024). Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo. Nature626 (7997), 207–211. 10.1038/s41586-023-06905-9
- CrossRef
- Google Scholar
37
DengW.LeeJ.WangH.MillerJ.ReikA.GregoryP. D.et al (2012). Controlling long-range genomic interactions at a native locus by targeted tethering of a looping factor. Cell149 (6), 1233–1244. 10.1016/j.cell.2012.03.051
- CrossRef
- Google Scholar
38
DengY.BartosovicM.MaS.ZhangD.KukanjaP.XiaoY.et al (2022). Spatial profiling of chromatin accessibility in mouse and human tissues. Nature609 (7926), 375–383. 10.1038/s41586-022-05094-1
- CrossRef
- Google Scholar
39
DeshpandeD.ChhuganiK.ChangY.KarlsbergA.LoefflerC.ZhangJ.et al (2023). RNA-seq data science: from raw data to effective interpretation. Front. Genet.14, 997383. 10.3389/fgene.2023.997383
- CrossRef
- Google Scholar
40
DevlinJ.ChangM.-W.LeeK.ToutanovaK. (2019). BERT: pre training of deep bidirectional Transformers for language understanding. Minneapolis, Minnesota: Association for Computational Linguistics, 4171–4186.
- Google Scholar
41
DinhT. A.SritharanR.SmithF. D.FranciscoA. B.MaR. K.BunaciuR. P.et al (2020). Hotspots of Aberrant enhancer activity in Fibrolamellar Carcinoma reveal Candidate Oncogenic Pathways and Therapeutic Vulnerabilities. Cell Rep.31 (2), 107509. 10.1016/j.celrep.2020.03.073
- CrossRef
- Google Scholar
42
DixonJ. R.SelvarajS.YueF.KimA.LiY.ShenY.et al (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature485 (7398), 376–380. 10.1038/nature11082
- CrossRef
- Google Scholar
43
DuM.StitzingerS. H.SpilleJ. H.ChoW. K.LeeC.HijazM.et al (2024). Direct observation of a condensate effect on super-enhancer controlled gene bursting. Cell187 (10), 2595–2598. 10.1016/j.cell.2024.04.001
- CrossRef
- Google Scholar
44
ElingN.MorganM. D.MarioniJ. C. (2019). Challenges in measuring and understanding biological noise. Nat. Rev. Genet.20 (9), 536–548. 10.1038/s41576-019-0130-6
- CrossRef
- Google Scholar
45
ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature489 (7414), 57–74. 10.1038/nature11247
- CrossRef
- Google Scholar
46
EraslanG.AvsecŽ.GagneurJ.TheisF. J. (2019). Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet.20 (7), 389–403. 10.1038/s41576-019-0122-6
- CrossRef
- Google Scholar
47
FadriM. T. M.LeeJ. B.KeungA. J. (2024). Summary of ChIP-seq methods and Description of an optimized ChIP-seq Protocol. Methods Mol. Biol.2842, 419–447. 10.1007/978-1-0716-4051-7_22
- CrossRef
- Google Scholar
48
FudenbergG.KelleyD. R.PollardK. S. (2020). Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods17 (11), 1111–1117. 10.1038/s41592-020-0958-x
- CrossRef
- Google Scholar
49
FulcoC. P.MunschauerM.AnyohaR.MunsonG.GrossmanS. R.PerezE. M.et al (2016). Systematic mapping of functional enhancer-promoter connections with CRISPR interference. Science354 (6313), 769–773. 10.1126/science.aag2445
- CrossRef
- Google Scholar
50
FurlongE. E. M.LevineM. (2018). Developmental enhancers and chromosome topology. Science361 (6409), 1341–1345. 10.1126/science.aau0320
- CrossRef
- Google Scholar
51
Garcia LopezA.SchäubleS.Sae-OngT.SeelbinderB.BauerM.Giamarellos-BourboulisE. J.et al (2024). Risk assessment with gene expression markers in sepsis development. Cell Rep. Med.5 (9), 101712. 10.1016/j.xcrm.2024.101712
- CrossRef
- Google Scholar
52
GhandiM.Mohammad-NooriM.GhareghaniN.LeeD.GarrawayL.BeerM. A. (2016). gkmSVM: an R package for gapped-kmer SVM. Bioinformatics32 (14), 2205–2207. 10.1093/bioinformatics/btw203
- CrossRef
- Google Scholar
53
Giamarellos-BourboulisE. J.AntonelliM.BloosF.KotsamidiI.PsarrakisC.DakouK.et al (2024a). Interferon-gamma driven elevation of CXCL9: a new sepsis endotype independently associated with mortality. EBioMedicine109, 105414. 10.1016/j.ebiom.2024.105414
- CrossRef
- Google Scholar
54
Giamarellos-BourboulisE. J.AschenbrennerA. C.BauerM.BockC.CalandraT.Gat-ViksI.et al (2024b). The pathophysiology of sepsis and precision-medicine-based immunotherapy. Nat. Immunol.25 (1), 19–28. 10.1038/s41590-023-01660-5
- CrossRef
- Google Scholar
55
GieseK.KingsleyC.KirshnerJ. R.GrosschedlR. (1995). Assembly and function of a TCR alpha enhancer complex is dependent on LEF-1-induced DNA bending and multiple protein-protein interactions. Genes Dev.9 (8), 995–1008. 10.1101/gad.9.8.995
- CrossRef
- Google Scholar
56
GTEx Consortium (2013). The genotype-Tissue expression (GTEx) project. Nat. Genet.45 (6), 580–585. 10.1038/ng.2653
- CrossRef
- Google Scholar
57
HarrisH. L.GuH.OlshanskyM.WangA.FarabellaI.EliazY.et al (2023). Chromatin alternates between A and B compartments at kilobase scale for subgenic organization. Nat. Commun.14 (1), 3303. 10.1038/s41467-023-38429-1
- CrossRef
- Google Scholar
58
HeidersbachA. J.DorighiK. M.GomezJ. A.JacobiA. M.HaleyB. (2023). A versatile, high-efficiency platform for CRISPR-based gene activation. Nat. Commun.14 (1), 902. 10.1038/s41467-023-36452-w
- CrossRef
- Google Scholar
59
HeintzmanN. D.StuartR. K.HonG.FuY.ChingC. W.HawkinsR. D.et al (2007). Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet.39 (3), 311–318. 10.1038/ng1966
- CrossRef
- Google Scholar
60
HeumosL.SchaarA. C.LanceC.LitinetskayaA.DrostF.ZappiaL.et al (2023). Best practices for single-cell analysis across modalities. Nat. Rev. Genet.24 (8), 550–572. 10.1038/s41576-023-00586-w
- CrossRef
- Google Scholar
61
HniszD.AbrahamB. J.LeeT. I.LauA.Saint-AndréV.SigovaA. A.et al (2013). Super-enhancers in the control of cell identity and disease. Cell155 (4), 934–947. 10.1016/j.cell.2013.09.053
- CrossRef
- Google Scholar
62
HniszD.ShrinivasK.YoungR. A.ChakrabortyA. K.SharpP. A. (2017). A phase separation model for transcriptional control. Cell169 (1), 13–23. 10.1016/j.cell.2017.02.007
- CrossRef
- Google Scholar
63
HsiehT. S.CattoglioC.SlobodyanyukE.HansenA. S.DarzacqX.TjianR. (2022). Enhancer-promoter interactions and transcription are largely maintained upon acute loss of CTCF, cohesin, WAPL or YY1. Nat. Genet.54 (12), 1919–1932. 10.1038/s41588-022-01223-8
- CrossRef
- Google Scholar
64
HungT. C.KingsleyD. M.BoettigerA. N. (2024). Boundary stacking interactions enable cross-TAD enhancer-promoter communication during limb development. Nat. Genet.56 (2), 306–314. 10.1038/s41588-023-01641-2
- CrossRef
- Google Scholar
65
InoueF.AhituvN. (2015). Decoding enhancers using massively parallel reporter assays. Genomics106 (3), 159–164. 10.1016/j.ygeno.2015.06.005
- CrossRef
- Google Scholar
66
InoueF.KircherM.MartinB.CooperG. M.WittenD. M.McManusM. T.et al (2017). A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res.27 (1), 38–52. 10.1101/gr.212092.116
- CrossRef
- Google Scholar
67
JindalG. A.FarleyE. K. (2021). Enhancer grammar in development, evolution, and disease: dependencies and interplay. Dev Cell.56(5), 575–587. 10.1016/j.devcel.2021.02.016
- CrossRef
- Google Scholar
68
JiY.ZhouZ.LiuH.DavuluriR. V. (2021). DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics37 (15), 2112–2120. 10.1093/bioinformatics/btab083
- CrossRef
- Google Scholar
69
JohnS.SaboP. J.CanfieldT. K.LeeK.VongS.WeaverM.et al (2013). Genome-scale mapping of DNase I hypersensitivity. Curr. Protoc. Mol. Biol.Chapter 27, Unit 21.27. 10.1002/0471142727.mb2127s103
- CrossRef
- Google Scholar
70
KarakikeE.Giamarellos-BourboulisE. J. (2019). Macrophage activation-like syndrome: a distinct entity leading to early Death in sepsis. Front. Immunol.10, 55. 10.3389/fimmu.2019.00055
- CrossRef
- Google Scholar
71
KarakikeE.DalekosG. N.KoutsodimitropoulosI.SaridakiM.PourzitakiC.PapathanakosG.et al (2022). ESCAPE: an open-label trial of Personalized immunotherapy in critically lll COVID-19 patients. J. Innate Immun.14 (3), 218–228. 10.1159/000519090
- CrossRef
- Google Scholar
72
KarakikeE.MetallidisS.PoulakouG.KosmidouM.GatselisN. K.PetrakisV.et al (2024). Clinical Phenotyping for Prognosis and immunotherapy Guidance in bacterial sepsis and COVID-19. Crit. Care Explor6 (9), e1153. 10.1097/CCE.0000000000001153
- CrossRef
- Google Scholar
73
KarczewskiK. J.SnyderM. P. (2018). Integrative omics for health and disease. Nat. Rev. Genet.19 (5), 299–310. 10.1038/nrg.2018.4
- CrossRef
- Google Scholar
74
KelleyD. R.SnoekJ.RinnJ. L. (2016). Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res.26 (7), 990–999. 10.1101/gr.200535.115
- CrossRef
- Google Scholar
75
KelleyD. R.ReshefY. A.BileschiM.BelangerD.McLeanC. Y.SnoekJ. (2018). Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res.28 (5), 739–750. 10.1101/gr.227819.117
- CrossRef
- Google Scholar
76
KesslerS.MinouxM.JoshiO.Ben ZouariY.DucretS.RossF.et al (2023). A multiple super-enhancer region establishes inter-TAD interactions and controls Hoxa function in cranial neural crest. Nat. Commun.14 (1), 3242. 10.1038/s41467-023-38953-0
- CrossRef
- Google Scholar
77
KimS.WysockaJ. (2023). Deciphering the multi-scale, quantitative cis-regulatory code. Mol. Cell83 (3), 373–392. 10.1016/j.molcel.2022.12.032
- CrossRef
- Google Scholar
78
KimT. K.HembergM.GrayJ. M.CostaA. M.BearD. M.WuJ.et al (2010). Widespread transcription at neuronal activity-regulated enhancers. Nature465 (7295), 182–187. 10.1038/nature09033
- CrossRef
- Google Scholar
79
KimD. S.RiscaV. I.ReynoldsD. L.ChappellJ.RubinA. J.JungN.et al (2021). The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nat. Genet.53 (11), 1564–1576. 10.1038/s41588-021-00947-3
- CrossRef
- Google Scholar
80
KleinA. M.MazutisL.AkartunaI.TallapragadaN.VeresA.LiV.et al (2015). Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell161 (5), 1187–1201. 10.1016/j.cell.2015.04.044
- CrossRef
- Google Scholar
81
KomorowskiM.GreenA.TathamK. C.SeymourC.AntcliffeD. (2022). Sepsis biomarkers and diagnostic tools with a focus on machine learning. EBioMedicine86, 104394. 10.1016/j.ebiom.2022.104394
- CrossRef
- Google Scholar
82
KouzaridesT. (2007). Chromatin modifications and their function. Cell128 (4), 693–705. 10.1016/j.cell.2007.02.005
- CrossRef
- Google Scholar
83
KulkarniM. M.ArnostiD. N. (2003). Information display by transcriptional enhancers. Development130 (26), 6569–6575. 10.1242/dev.00890
- CrossRef
- Google Scholar
84
KyriazopoulouE.PoulakouG.MilionisH.MetallidisS.AdamisG.TsiakosK.et al (2021). Early treatment of COVID-19 with anakinra guided by soluble urokinase plasminogen receptor plasma levels: a double-blind, randomized controlled phase 3 trial. Nat. Med.27 (10), 1752–1760. 10.1038/s41591-021-01499-z
- CrossRef
- Google Scholar
85
LambertS. A.JolmaA.CampitelliL. F.DasP. K.YinY.AlbuM.et al (2018). The human transcription factors. Cell172 (4), 650–665. 10.1016/j.cell.2018.01.029
- CrossRef
- Google Scholar
86
LetticeL. A.HeaneyS. J.PurdieL. A.LiL.de BeerP.OostraB. A.et al (2003). A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum. Mol. Genet.12 (14), 1725–1735. 10.1093/hmg/ddg180
- CrossRef
- Google Scholar
87
LevyJ. J.TitusA. J.PetersenC. L.ChenY.SalasL. A.ChristensenB. C. (2020). MethylNet: an automated and modular deep learning approach for DNA methylation analysis. BMC Bioinforma.21 (1), 108. 10.1186/s12859-020-3443-8
- CrossRef
- Google Scholar
88
LiZ.GaoE.ZhouJ.HanW.XuX.GaoX. (2023). Applications of deep learning in understanding gene regulation. Cell Rep. Methods3 (1), 100384. 10.1016/j.crmeth.2022.100384
- CrossRef
- Google Scholar
89
LongH. K.OsterwalderM.WelshI. C.HansenK.DaviesJ. O. J.LiuY. E.et al (2020). Loss of Extreme long-range enhancers in human neural crest drives a Craniofacial disorder. Cell Stem Cell27 (5), 765–783. 10.1016/j.stem.2020.09.001
- CrossRef
- Google Scholar
90
MackS. C.SinghI.WangX.HirschR.WuQ.VillagomezR.et al (2019). Chromatin landscapes reveal developmentally encoded transcriptional states that define human glioblastoma. J. Exp. Med.216 (5), 1071–1090. 10.1084/jem.20190196
- CrossRef
- Google Scholar
91
MahatD. B.KwakH.BoothG. T.JonkersI. H.DankoC. G.PatelR. K.et al (2016). Base-pair-resolution genome-wide mapping of active RNA polymerases using precision nuclear run-on (PRO-seq). Nat. Protoc.11 (8), 1455–1476. 10.1038/nprot.2016.086
- CrossRef
- Google Scholar
92
MarinovG. K.ShiponyZ.KundajeA.GreenleafW. J. (2023). Genome-wide mapping of active regulatory elements using ATAC-seq. Methods Mol. Biol.2611, 3–19. 10.1007/978-1-0716-2899-7_1
- CrossRef
- Google Scholar
93
MelnikovA.MuruganA.ZhangX.TesileanuT.WangL.RogovP.et al (2012). Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat. Biotechnol.30 (3), 271–277. 10.1038/nbt.2137
- CrossRef
- Google Scholar
94
MisteliT. (2020). The self-Organizing genome: principles of genome architecture and function. Cell183 (1), 28–45. 10.1016/j.cell.2020.09.014
- CrossRef
- Google Scholar
95
MorganM. A.ShilatifardA. (2015). Chromatin signatures of cancer. Genes Dev.29 (3), 238–249. 10.1101/gad.255182.114
- CrossRef
- Google Scholar
96
MorrowJ. J.BaylesI.FunnellA. P. W.MillerT. E.SaiakhovaA.LizardoM. M.et al (2018). Positively selected enhancer elements endow osteosarcoma cells with metastatic competence. Nat. Med.24 (2), 176–185. 10.1038/nm.4475
- CrossRef
- Google Scholar
97
MortazaviA.WilliamsB. A.McCueK.SchaefferL.WoldB. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods5 (7), 621–628. 10.1038/nmeth.1226
- CrossRef
- Google Scholar
98
MosesL.PachterL. (2022). Museum of spatial transcriptomics. Nat. Methods19 (5), 534–546. 10.1038/s41592-022-01409-2
- CrossRef
- Google Scholar
99
NairS.AmeenM.SundaramL.PampariA.SchreiberJ.BalsubramaniA.et al (2023). Transcription factor stoichiometry, motif affinity and syntax regulate single-cell chromatin dynamics during fibroblast reprogramming to pluripotency. bioRxiv. 10.1101/2023.10.04.560808
- CrossRef
- Google Scholar
100
NeteaM. G.Domínguez-AndrésJ.BarreiroL. B.ChavakisT.DivangahiM.FuchsE.et al (2020). Defining trained immunity and its role in health and disease. Nat. Rev. Immunol.20 (6), 375–388. 10.1038/s41577-020-0285-6
- CrossRef
- Google Scholar
101
NtziachristosP.Abdel-WahabO.AifantisI. (2016). Emerging concepts of epigenetic dysregulation in hematological malignancies. Nat. Immunol.17 (9), 1016–1024. 10.1038/ni.3517
- CrossRef
- Google Scholar
102
ParkerS. C.StitzelM. L.TaylorD. L.OrozcoJ. M.ErdosM. R.AkiyamaJ. A.et al (2013). Chromatin stretch enhancer states drive cell-specific gene regulation and harbor human disease risk variants. Proc. Natl. Acad. Sci. U. S. A.110 (44), 17921–17926. 10.1073/pnas.1317023110
- CrossRef
- Google Scholar
103
PopayT. M.DixonJ. R. (2022). Coming full circle: on the origin and evolution of the looping model for enhancer-promoter communication. J. Biol. Chem.298 (8), 102117. 10.1016/j.jbc.2022.102117
- CrossRef
- Google Scholar
104
QuangD.XieX. (2016). DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res.44 (11), e107. 10.1093/nar/gkw226
- CrossRef
- Google Scholar
105
RaoS. S.HuntleyM. H.DurandN. C.StamenovaE. K.BochkovI. D.RobinsonJ. T.et al (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell159 (7), 1665–1680. 10.1016/j.cell.2014.11.021
- CrossRef
- Google Scholar
106
RaoS. S. P.HuangS. C.Glenn St HilaireB.EngreitzJ. M.PerezE. M.Kieffer-KwonK. R.et al (2017). Cohesin loss Eliminates all loop domains. Cell171 (2), 305–320. 10.1016/j.cell.2017.09.026
- CrossRef
- Google Scholar
107
RickelsR.ShilatifardA. (2018). Enhancer logic and Mechanics in development and disease. Trends Cell Biol.28 (8), 608–630. 10.1016/j.tcb.2018.04.003
- CrossRef
- Google Scholar
108
RiggiN.KnoechelB.GillespieS. M.RheinbayE.BoulayG.SuvàM. L.et al (2014). EWS-FLI1 utilizes divergent chromatin remodeling mechanisms to directly activate or repress enhancer elements in Ewing sarcoma. Cancer Cell26 (5), 668–681. 10.1016/j.ccell.2014.10.004
- CrossRef
- Google Scholar
109
Roadmap Epigenomics Consortium, KundajeA.MeulemanW.ErnstJ.BilenkyM.YenA.Heravi-MoussaviA.et al (2015). Integrative analysis of 111 reference human epigenomes. Nature518 (7539), 317–330. 10.1038/nature14248
- CrossRef
- Google Scholar
110
RobertsonG.HirstM.BainbridgeM.BilenkyM.ZhaoY.ZengT.et al (2007). Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods4 (8), 651–657. 10.1038/nmeth1068
- CrossRef
- Google Scholar
111
SaeedS.QuintinJ.KerstensH. H.RaoN. A.AghajanirefahA.MatareseF.et al (2014). Epigenetic programming of monocyte-to-macrophage differentiation and trained innate immunity. Science345 (6204), 1251086. 10.1126/science.1251086
- CrossRef
- Google Scholar
112
SartorelliV.LauberthS. M. (2020). Enhancer RNAs are an important regulatory layer of the epigenome. Nat. Struct. Mol. Biol.27 (6), 521–528. 10.1038/s41594-020-0446-0
- CrossRef
- Google Scholar
113
SchreiberJ. M.BoixC. A.Wook LeeJ.LiH.GuanY.ChangC. C.et al (2023). The ENCODE Imputation Challenge: a critical assessment of methods for cross-cell type imputation of epigenomic profiles. Genome Biol.24 (1), 79. 10.1186/s13059-023-02915-y
- CrossRef
- Google Scholar
114
SchwessingerR.GosdenM.DownesD.BrownR. C.OudelaarA. M.TeleniusJ.et al (2020). DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat. Methods17 (11), 1118–1124. 10.1038/s41592-020-0960-3
- CrossRef
- Google Scholar
115
SciclunaB. P.van VughtL. A.ZwindermanA. H.WiewelM. A.DavenportE. E.BurnhamK. L.et al (2017). Classification of patients with sepsis according to blood genomic endotype: a prospective cohort study. Lancet Respir. Med.5 (10), 816–826. 10.1016/S2213-2600(17)30294-1
- CrossRef
- Google Scholar
116
ShangS.YangJ.JazaeriA. A.DuvalA. J.TufanT.Lopes FischerN.et al (2019). Chemotherapy-induced distal enhancers drive transcriptional programs to Maintain the chemoresistant state in ovarian cancer. Cancer Res.79 (18), 4599–4611. 10.1158/0008-5472.CAN-19-0215
- CrossRef
- Google Scholar
117
Shankar-HariM.CalandraT.SoaresM. P.BauerM.WiersingaW. J.PrescottH. C.et al (2024). Reframing sepsis immunobiology for translation: towards informative subtyping and targeted immunomodulatory therapies. Lancet Respir. Med.12 (4), 323–336. 10.1016/S2213-2600(23)00468-X
- CrossRef
- Google Scholar
118
ShrikumarA.GreensideP.ShcherbinaA.KundajeA. (2016). Not just a black box: learning important features through propagating activation differences. Preprint at arXiv.
- Google Scholar
119
ShrikumarA.TianK.AvsecŽ.ShcherbinaA.BanerjeeA.SharminM.et al (2020). Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5. 6.5. Available online at: http://arxiv.org/abs/1811.00416.
- Google Scholar
120
SiepelA.BejeranoG.PedersenJ. S.HinrichsA. S.HouM.RosenbloomK.et al (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res.15 (8), 1034–1050. 10.1101/gr.3715005
- CrossRef
- Google Scholar
121
SimonyanK.VedaldiA.ZissermanA. (2013). Deep inside convo lutional networks: visualising image classification models and saliency maps. Preprint at arXiv. 10.48550/arXiv.1312.6034
- CrossRef
- Google Scholar
122
SmarujP. N.XiaoY.FudenbergG. (2025). Recipes and ingredients for deep learning models of 3D genome folding. Curr. Opin. Genet. Dev.91, 102308. 10.1016/j.gde.2024.102308
- CrossRef
- Google Scholar
123
SmithG. D.ChingW. H.Cornejo-PáramoP.WongE. S. (2023). Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol.24 (1), 116. 10.1186/s13059-023-02955-4
- CrossRef
- Google Scholar
124
SpurrellC. H.BarozziI.KosickiM.MannionB. J.BlowM. J.Fukuda-YuzawaY.et al (2022). Genome-wide fetalization of enhancer architecture in heart disease. Cell Rep.40 (12), 111400. 10.1016/j.celrep.2022.111400
- CrossRef
- Google Scholar
125
StankeyC. T.LeeJ. C. (2023). Translating non-coding genetic associations into a better understanding of immune-mediated disease. Dis. Model Mech.16 (3), dmm049790. 10.1242/dmm.049790
- CrossRef
- Google Scholar
126
StunnenbergH. G.HirstM.International Human Epigenome Consortium (2016). The international human epigenome Consortium: a blueprint for Scientific collaboration and discovery. Cell167 (5), 1145–1149. 10.1016/j.cell.2016.11.007
- CrossRef
- Google Scholar
127
SurI.TaipaleJ. (2016). The role of enhancers in cancer. Nat. Rev. Cancer16 (8), 483–493. 10.1038/nrc.2016.62
- CrossRef
- Google Scholar
128
SweeneyT. E.AzadT. D.DonatoM.HaynesW. A.PerumalT. M.HenaoR.et al (2018). Unsupervised analysis of transcriptomics in bacterial sepsis across multiple datasets reveals three robust clusters. Crit. Care Med.46 (6), 915–925. 10.1097/CCM.0000000000003084
- CrossRef
- Google Scholar
129
TanJ.DoingG.LewisK. A.PriceC. E.ChenK. M.CadyK. C.et al (2017). Unsupervised extraction of stable expression signatures from public Compendia with an ensemble of neural networks. Cell Syst.5 (1), 63–71. 10.1016/j.cels.2017.06.003
- CrossRef
- Google Scholar
130
TanJ.Shenker-TaurisN.Rodriguez-HernaezJ.WangE.SakellaropoulosT.BoccalatteF.et al (2023). Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening. Nat. Biotechnol.41 (8), 1140–1150. 10.1038/s41587-022-01612-8
- CrossRef
- Google Scholar
131
TangF.BarbacioruC.WangY.NordmanE.LeeC.XuN.et al (2009). mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods6 (5), 377–382. 10.1038/nmeth.1315
- CrossRef
- Google Scholar
132
TangZ.SomiaN.YuY.KooP. K. (2025). Evaluating the representational power of pre-trained DNA language models for regulatory genomics. Genome Biol.26 (1), 203. 10.1186/s13059-025-03674-8
- CrossRef
- Google Scholar
133
TaskiranI. I.SpanierK. I.DickmänkenH.KempynckN.PančíkováA.EkşiE. C.et al (2024). Cell-type-directed design of synthetic enhancers. Nature626 (7997), 212–220. 10.1038/s41586-023-06936-2
- CrossRef
- Google Scholar
134
ThanosD.ManiatisT. (1995). Virus induction of human IFN beta gene expression requires the assembly of an enhanceosome. Cell83 (7), 1091–1100. 10.1016/0092-8674(95)90136-1
- CrossRef
- Google Scholar
135
ToneyanS.TangZ.KooP. K. (2022). Evaluating deep learning for predicting epigenomic profiles. Nat. Mach. Intell.4 (12), 1088–1100. 10.1038/s42256-022-00570-9
- CrossRef
- Google Scholar
136
UyeharaC. M.ApostolouE. (2023). 3D enhancer-promoter interactions and multi-connected hubs: organizational principles and functional roles. Cell Rep.42 (4), 112068. 10.1016/j.celrep.2023.112068
- CrossRef
- Google Scholar
137
VierstraJ.LazarJ.SandstromR.HalowJ.LeeK.BatesD.et al (2020). Global reference mapping of human transcription factor footprints. Nature583 (7818), 729–736. 10.1038/s41586-020-2528-x
- CrossRef
- Google Scholar
138
VillarD.BerthelotC.AldridgeS.RaynerT. F.LukkM.PignatelliM.et al (2015). Enhancer evolution across 20 mammalian species. Cell160 (3), 554–566. 10.1016/j.cell.2015.01.006
- CrossRef
- Google Scholar
139
WangJ.MaA.ChangY.GongJ.JiangY.QiR.et al (2021). scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nat. Commun.12 (1), 1882. 10.1038/s41467-021-22197-x
- CrossRef
- Google Scholar
140
WangY.KongS.ZhouC.WangY.ZhangY.FangY.et al (2024). A review of deep learning models for the prediction of chromatin interactions with DNA and epigenomic profiles. Brief. Bioinform26 (1), bbae651. 10.1093/bib/bbae651
- CrossRef
- Google Scholar
141
WhyteW. A.OrlandoD. A.HniszD.AbrahamB. J.LinC. Y.KageyM. H.et al (2013). Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell153 (2), 307–319. 10.1016/j.cell.2013.03.035
- CrossRef
- Google Scholar
142
XieJ.ZhengX.YanJ.LiQ.JinN.WangS.et al (2024). Deep learning model to discriminate diverse infection types based on pairwise analysis of host gene expression. iScience27 (6), 109908. 10.1016/j.isci.2024.109908
- CrossRef
- Google Scholar
143
YanJ.QiuY.Ribeiro Dos SantosA. M.YinY.LiY. E.VinckierN.et al (2021). Systematic analysis of binding of transcription factors to noncoding variants. Nature591 (7848), 147–151. 10.1038/s41586-021-03211-0
- CrossRef
- Google Scholar
144
YangR.DasA.GaoV. R.KarbalaygharehA.NobleW. S.BilmesJ. A.et al (2023). Epiphany: predicting Hi-C contact maps from 1D epigenomic signals. Genome Biol.24 (1), 134. 10.1186/s13059-023-02934-9
- CrossRef
- Google Scholar
145
YaoQ.EpsteinC. B.BanskotaS.IssnerR.KimY.BernsteinB. E.et al (2021). Epigenetic alterations in Keratinocyte Carcinoma. J. Invest. Dermatol141 (5), 1207–1218. 10.1016/j.jid.2020.10.018
- CrossRef
- Google Scholar
146
YuanD.AhamedA.BurginJ.CumminsC.DevrajR.GueyeK.et al (2024). The European nucleotide archive in 2023. Nucleic Acids Res.52 (D1), D92–D97. 10.1093/nar/gkad1067
- CrossRef
- Google Scholar
147
ZhangZ.PanQ.GeH.XingL.HongY.ChenP. (2020). Deep learning-based clustering robustly identified two classes of sepsis with both prognostic and predictive values. EBioMedicine62, 103081. 10.1016/j.ebiom.2020.103081
- CrossRef
- Google Scholar
148
ZhouJ. (2022). Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nat. Genet.54 (5), 725–734. 10.1038/s41588-022-01065-4
- CrossRef
- Google Scholar
149
ZhouJ.TroyanskayaO. G. (2015). Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods12 (10), 931–934. 10.1038/nmeth.3547
- CrossRef
- Google Scholar
150
ZhuY.LeeH.WhiteS.WeimerA. K.MonteE.HorningA.et al (2024). Global loss of promoter-enhancer connectivity and rebalancing of gene expression during early colorectal cancer carcinogenesis. Nat. Cancer5 (11), 1697–1712. 10.1038/s43018-024-00823-z
- CrossRef
- Google Scholar

Summary

Keywords

deep learning, enhancers, genomics, machine learning, sepsis, transcriptional regulation

Citation

Foutadakis S, Bourika V, Styliara I, Koufargyris P, Safarika A and Karakike E (2025) Machine learning tools for deciphering the regulatory logic of enhancers in health and disease. Front. Genet. 16:1603687. doi: 10.3389/fgene.2025.1603687

Received

31 March 2025

Accepted

31 July 2025

Published

13 August 2025

Volume

16 - 2025

Edited by

Jared C. Roach, Institute for Systems Biology (ISB), United States

Reviewed by

David N. Arnosti, Michigan State University, United States

Faizah Aplop, University of Malaysia Terengganu, Malaysia

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Spyros Foutadakis, foutadakiss@gmail.com; Eleni Karakike, elkarakike@gmail.com

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Human and Medical Genomics

REVIEW article

Machine learning tools for deciphering the regulatory logic of enhancers in health and disease

Abstract

1 Introduction

2 Enhancer identification at the genome-wide scale

3 Three-dimensional genome organization and enhancer communication

4 Using machine learning to decipher the regulatory logic of enhancers

4.1 Model sharing and interpretation

5 Enhancer identification and machine learning tools in health and disease

6 Discussion

Statements

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

Supplementary material

References

Summary

Outline

Figures

Cite article

Article metrics

REVIEW article

Machine learning tools for deciphering the regulatory logic of enhancers in health and disease

Abstract

1 Introduction

2 Enhancer identification at the genome-wide scale

3 Three-dimensional genome organization and enhancer communication

4 Using machine learning to decipher the regulatory logic of enhancers

4.1 Model sharing and interpretation

5 Enhancer identification and machine learning tools in health and disease

6 Discussion

Statements

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

Supplementary material

References

Summary

Outline

Figures

Cite article

Share article

Article metrics