About this Research Topic
Given its importance in proper cell functioning and adaptability, decoding the architecture of gene regulation has become one of the most pressing tasks in modern (computational) biology. To this end, it has been a long-held ambition to enable quantitative prediction of gene regulation, i.e., inference of gene expression levels, from genomic and epigenomic features alone. The rise in computing power, recent advances in learning algorithms alongside high-throughput, next-generation sequencing that provide large-scale quantification of gene expression at single-cell resolution, as well as the identification of novel genes and noncoding RNAs at unprecedented levels, may bring us one step closer to the realization of this dream.
Early attempts in the application of modern approaches in machine learning - in particular deep learning - to predict mRNA abundance levels directly from DNA sequence have already yielded promising results. Despite this, it remains an open question how individual factors and epigenomic features involved in the gene regulatory apparatus interact within the vast genomic landscape of an organism’s non-coding regions and, in turn, contribute to mRNA expression levels. In order to advance our understanding of gene expression inference, we need scalable algorithms that: i) allow for the integration of a variety of diverse genomic and epigenomic datasets as well as structural and/or biological priors; ii) are interpretable with respect to the representations they have learned; and, iii) allow for a transfer of the learned representations to novel and different environments and contexts.
This Research Topic welcomes both original studies and review articles assessing modern machine learning approaches that integrate genomic and/or epigenomic datasets to quantitatively predict gene regulation. The topics of interest include, but are not limited to, the following:
- Quantitative inference of gene expression levels from DNA sequence and/or epigenomic features;
- Analyses of transcription factor interaction within multiple binding elements and/or effects of single nucleotide polymorphisms, copy number variations, etc., in causing loss or creation of promoter binding elements and
enhancers;
- Strategies for heterogeneous data integration of genomic sequences and epigenetic datasets, including chromatin accessibility, methylation, or chromatin conformation-related data;
- Evaluation and/or benchmarking of different (un-) supervised learning paradigms, including Bayesian (deep) learning, deep convolutional models, graph neural networks, or attention-based approaches;
- Approaches to structural learning of efficient model architectures based on biological, structural priors, or data-driven methods, such as Neural Architectural Search or Bayesian sampling;
- Analyses of model scalability versus model complexity trade-offs with respect to structural priors, inductive biases, dimensionality reduction, randomized or sampling-based techniques;
- Investigations into model interpretability e.g., visualization of individual components of trained models, such as model filters representing sequence binding motifs;
- Studies into the possibilities/limitations of the transfer of trained models to novel/different contexts and environments.
Keywords: machine learning, plant genome, plant genes, gene regulation, epigenomics, deep learning
Important Note: All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.