Predicting Genome Architecture: Challenges and Solutions

Belokopytova, Polina; Fishman, Veniamin

doi:10.3389/fgene.2020.617202

REVIEW article

Front. Genet., 22 January 2021

Sec. Computational Genomics

Volume 11 - 2020 | https://doi.org/10.3389/fgene.2020.617202

Predicting Genome Architecture: Challenges and Solutions

PB
Polina Belokopytova ^1,2
VF
Veniamin Fishman ^1,2^*

1. Natural Sciences Department, Novosibirsk State University, Novosibirsk, Russia
2. Institute of Cytology and Genetics Siberian Branch of Russian Academy of Sciences (SB RAS), Novosibirsk, Russia

Article metrics

View details

Citations

13k

Views

3,3k

Downloads

Abstract

Genome architecture plays a pivotal role in gene regulation. The use of high-throughput methods for chromatin profiling and 3-D interaction mapping provide rich experimental data sets describing genome organization and dynamics. These data challenge development of new models and algorithms connecting genome architecture with epigenetic marks. In this review, we describe how chromatin architecture could be reconstructed from epigenetic data using biophysical or statistical approaches. We discuss the applicability and limitations of these methods for understanding the mechanisms of chromatin organization. We also highlight the emergence of new predictive approaches for scoring effects of structural variations in human cells.

Studying Genome Architecture: Methods and Mechanisms

The human genome has a three-dimensional structure, which folds in the nucleus, producing specific chromatin interactions. These chromatin interactions can be experimentally assessed by modern microscopy methods (reviewed in Boettiger and Murphy, 2020) or sequencing approaches, such as genome-wide modifications of chromatin conformation capture (Hi-C) (Lieberman-Aiden et al., 2009; Rao et al., 2014), split-pool recognition of interactions by tag extension (Quinodoz et al., 2018), and genome architecture mapping (Beagrie et al., 2017). These methods are covered by comprehensive reviews (Kempfer and Pombo, 2020) and comparative studies (Fiorillo et al., 2020). Here, we focus mainly on the Hi-C technique and its results because this method was most widely applied in various genomic studies during the last decade, allowing the accumulation of a huge amount of experimental data. Both methodological aspects of the Hi-C technique (Fiorillo et al., 2020) and biological principles revealed by applying this method to study genome architecture (Szabo et al., 2019) are discussed in detail in several recent reviews. We refer readers to Box 1, where we briefly discuss the main concepts of this field for the sake of completeness.

BOX 1 Start of

Hi-C includes crosslinking and digestion of chromatin, followed by proximity ligation and sequencing of ligation products (Lieberman-Aiden et al., 2009; Rao et al., 2014). During the proximity ligation step, only those genomic regions that spatially co-localize have a chance to be ligated. Thus, counting ligation products by next-generation sequencing allows deciphering the spatial proximity of loci. Although several single-cell Hi-C methods are published (Flyamer et al., 2017), the technique is most often applied to large cell populations, and ligation event frequency (also referred to as interaction or contact frequency throughout this review) should be interpreted as the average frequency of loci co-localization among the studied cell population. This snapshot of averaged chromatin contacts in a population, typically represented by a matrix of pairwise interaction frequencies, is known as a Hi-C map.

Using Hi-C and other methods, several important principles of genome architecture were recently discovered. At the largest scales, chromosomes occupy distinct territories, showing only limited intermingling (Tavares-Cadete et al., 2020) and characterized by an exponential decay of contact frequencies with the genomic distance between loci (Lieberman-Aiden et al., 2009). Within the territories, one can distinguish compartments that correspond to different chromatin types (Lieberman-Aiden et al., 2009). Mechanisms underlying compartment formation are actively debated, and there is a growing body of theoretical and experimental pieces of evidence suggesting the essential role of liquid–liquid phase separation in these processes (Kantidze and Razin, 2020; Razin and Gavrilov, 2020; Razin and Ulianov, 2020). At a finer scale, specific loci may preferentially interact with each other, forming topologically associated domains (TADs) (Dixon et al., 2012), stripes (Vian et al., 2018), cliques (Petrovic et al., 2019), and loops (Rao et al., 2014). Although the terminology is not well established in this field (de Wit, 2020), the current mechanisms underlying the formation of these structures fall into two categories.

First is a recently proposed loop extrusion mechanism (Sanborn et al., 2015; Fudenberg et al., 2016). It is considered that ring-shaped cohesin and condensin proteins bind chromatin and form and continuously extend loops in an ATP-dependent manner. Extrusion stops encountering another extrusion complex or, in the case of cohesins, when reaching CTCF protein bound to DNA in a specific orientation. This results in increased interaction frequency between loci bound by cohesin, displayed on Hi-C maps as loops (two-point interactions) (Rao et al., 2014) or stripes (one-to-many-points interactions) (Vian et al., 2018). The chromatin interaction patterns arising from loop extrusion mechanisms could be qualitatively described by the landscape of CTCF binding and also depend on the loading and processivity of cohesin (Fudenberg et al., 2016). Moreover, loop extrusion results in increased proximity of all loci located between convergently oriented CTCF sites, which is captured by the formation of looping domains (Rao et al., 2014).

The second mechanism responsible for the formation of loops and cliques is mediated by the formation of regulatory protein complexes, for example, polycomb complexes (Eagen et al., 2017), and certain transcription factors (Petrovic et al., 2019). This mechanism is at least partially independent of cohesin-mediated extrusion because the subset of loops remains stable upon degradation of the cohesin complex (Rao et al., 2017).

It is important to note that profiles of chromatin interactions captured by the Hi-C experiment are formed by the joint action of different mechanisms. For example, the formation of TADs, which represent self-interacting regions in the genome, is affected both by loop extrusion and compartmentalization processes (Szabo et al., 2019; de Wit, 2020), which is consistent with both convergent CTCF sites and chromatin state transition enrichment at TAD boundaries (Dixon et al., 2012; Rao et al., 2014; Huang et al., 2015).

Why Modeling 3-D Genome Folding?

The models and algorithms predicting genome architecture can be used in different ways. First, we can apply modeling to get new insights or test our hypotheses of molecular mechanisms underlying 3-D genome folding. Polymer modeling is used more often for this purpose, but convolutional neural networks, such as, for example, Akita (Fudenberg et al., 2020) and DeepC (Schwessinger et al., 2020), also enable identifying the main chromosome features contributing to genome architecture. Such approaches give remarkable results. During the last few years, we gained a significant amount of data describing the main features of 3-D genome folding and understanding the molecular mechanisms underlying these data, including loop extrusion and phase separation, which was largely facilitated by biophysical modeling and statistical analysis of chromatin properties. This field of research is well described in reviews (Imakaev et al., 2015; Lin et al., 2019). However, known mechanisms do not explain all 3-D chromatin features, which limits hypothesis-driven models and further research is required to explain them.

Second, 3-D genome models can be used to predict functional consequences caused by changes in 3-D genome folding. It is shown that alterations of chromatin topology accompanying genomic variations, especially large structural variations, can cause changes of gene expression (Franke et al., 2016; Rodríguez-Carballo et al., 2017; Kraft et al., 2019). One can find examples of such gene expression changes and their underlying mechanisms in the last part of this review. In these cases, modeling of 3-D genome architecture is essential for accurate prediction of the consequences of the genomic mutations.

Last, one can use modeling for predicting the 3-D genome architecture of new data. It is possible to predict chromatin interactions for different cell types lacking experimental Hi-C data (Belokopytova et al., 2020). Machine learning methods often gain applicability in this way.

Which 3-D Genome Structures Can Be Predicted, and Why They Are Relevant?

Chromosome-capturing methods, such as Hi-C, allow deciphering the main features of chromatin folding. Since the first Hi-C experiments, chromatin structures as compartments, TADs, and loops were revealed (see Box 1 for details of mechanisms underlying these structures). In the following, we describe the main Hi-C map features and algorithms used to predict them. Also, it may be helpful for readers new to the field to use the table of algorithms (Table 1) containing algorithms for predicting different 3-D genome features.

TABLE 1

Tool name	Input features	Target features	Method/algorithm
See review by Xu et al. (2018)	Histone marks, TFs binding, DHS	Promoter–enhancer interactions	See review by Xu et al. (2018)
MacPherson et al. (2018) model	HP1, H3K9me3	Compartments	Polymer modeling
MichroM + MEGABASE (Di Pierro et al.)	Histone marks, TFs binding	Compartments	NN classifier + polymer modeling
Huang et al. (2015) model	Histone marks	TADs	BART
3Disease Browser (Li et al., 2016)	Enhancers and TAD boundaries	Rearranged TADs	Linear model
Lollipop (Kai et al., 2018)	Chip-seq data, CTCF directionality	Loops	ML ensemble classifier (random forest)
3DEpiloop (Al Bkhetan and Plewczynski, 2018)	Histone marks, TFs binding	Loops	ML ensemble classifier (random forest)
CTCF-MP (Zhang et al., 2018)	CTCF binding, DHS, nucleotide sequence	Loops	ML ensemble classifier/NN (Boosted trees/word2vec)
EpiTensor (Zhu et al., 2016)	Histone marks, TFs binding	Loops	Tensor modeling + PCA
DeepMILO (Trieu et al., 2020)	Sequence of loop anchors	Rearranged loops	CNN and RNN
3D-GNOME (Sadowski et al., 2019)	CTCF ChIA-PET	Rearranged loops	linear models
3DPredictor (Belokopytova et al., 2020)	CTCF, RNA-seq	Whole hi-c map	ML ensemble regression (gradient boosting)
Hi-C Reg (Zhang et al., 2019)	Histone marks, TFs binding, DHS	Whole hi-c map	ML ensemble regression (random forest)
Akita (Fudenberg et al., 2020)	Sequence	Whole hi-c map	CNN
DeepC (Schwessinger et al., 2020)	Sequence	Whole hi-c map	CNN
Yifeng Qi and Bin Zhang model (Qi and Zhang, 2019)	CTCF binding, Chromatin states	Whole hi-c map	Polymer modeling
HiP-HoP (Buckle et al., 2018)	CTCF and cohesin binding, Histone marks or DHS	Whole hi-c map	Polymer modeling
Rowley et al. (2017) model	GRO-seq + CTCF binding	Whole hi-c map	Explicit algebraic model
PRISMR (Bianco et al., 2018)	Wild-type Hi-C data	Whole hi-c map in mutated cells	Polymer modeling

Tools for modeling and predicting chromatin interactions.

DHS, DNAse I hypersensitivity sites; TFs, transcription factors; TADs, topologically associated domains; ML, machine learning; NN, neural network; CNN, convolutional neural network; RNN, recurrent neural network; BART, Bayesian additive regression trees; PCA, principle component analysis.

Promoter–Enhancer Interactions

Interactions between promoters and enhancers are essential for expression regulation. Pioneering attempts to find such regulatory connections rely on either the correlation of epigenetic marks of promoters and enhancers across different cell types or evolutionary conservation of promoter–enhancer proximity in the linear DNA molecule (Spicuglia and Vanhille, 2012; Andersson and Sandelin, 2020). With the advent of genome-wide 3C-methods, we gain the ability to measure spatial proximity between genomic segments. The question about the exact role of spatial contacts between regulatory elements in the control of gene expression is still under active debate; however, much research defines “interacting” enhancers and promoters as pairs of loci belonging to the anchors of one Hi-C loop. Although we argue that using this loop-based definition of interacting promoters and enhancers might be confusing (see Box 1 and limitations section below for additional discussion), several algorithms are designed to predict enhancer–promoter pairs located within the anchors of one loop (Whalen et al., 2016).

Loops

Instead of predicting whether promoters and enhancers overlap loop anchors, some algorithms, such as Lollipop (Kai et al., 2018), 3DEpiloop (Al Bkhetan and Plewczynski, 2018), and EpiTensor (Zhu et al., 2016), are designed to directly infer all loop positions using epigenetic data. In mammals, most of the looping interactions are formed due to the cohesin-mediated loop extrusion process (see Box 1 for details). Thus, some algorithms, such as CTCF-MP (Zhang et al., 2018) or Lollipop (Kai et al., 2018), are focused exclusively on the prediction of CTCF-mediated interactions or separately access quality of prediction for CTCF-mediated and all other loops as in the DeepMILO algorithm (Trieu et al., 2020).

TADs

TADs have the shape of triangles on Hi-C maps, which indicates an increase of chromatin interaction frequency within TADs and insulation at TAD borders. These structures are largely dependent on the extrusion process and also influenced by other mechanisms (see references provided in Box 1 for discussion of the TAD definition and current views on mechanisms explaining TAD formation). TADs are also relevant for promoter–enhancer interactions as the majority of the functional interactions occur within the same TAD. It is known that TAD boundaries are enriched by CTCF binding sites (usually in convergent orientation) and different epigenetic marks (Dixon et al., 2012). Based on these observations, Huang et al. (2015) use ChIP-seq data for different proteins in a computational model predicting TAD boundaries and chromatin interaction hubs.

Compartments

Chromatin compartments are the main features of distant contacts revealed by chromosome conformation capture. Hi-C maps show that interactions occur more often within each compartment rather than across compartments (Lieberman-Aiden et al., 2009). The presence of compartments results in a checkerboard-like (or “plaid-like”) pattern of contacts on Hi-C maps. It is shown that compartments reflect the clustering of different types of chromatin (see Box 1 for details). Seminal work proposed binary division of the genome into eu- and heterochromatin, which correspond to A- and B-compartments. Subsequent research extends this view, suggesting that multiple chromatin states exist, each described by a unique profile of spatial interactions (Rao et al., 2014). In accord with this, several models are proposed, allowing the prediction of compartmental interactions based on epigenetic data (Di Pierro et al., 2017; MacPherson et al., 2018). Most of these algorithms utilize physical modeling to infer spatial chromatin interactions. Machine learning methods are often used as a part of the algorithm to attribute genomic loci to a certain compartment based on its epigenetic signatures.

Hi-C Maps

Predictions of all aforementioned features require similar epigenetic information. Thus, it should be possible to develop an algorithm predicting all topological structures simultaneously. Because it is widely assumed that biologically relevant interactions do not occur at a distance above several megabases, most of the algorithms limit their prediction to these distances, which reduces computational time and resources. For instance, machine learning algorithms, such as 3Dpredictor (Belokopytova et al., 2020), HiC-Reg (Zhang et al., 2019), Akita (Fudenberg et al., 2020), and DeepC (Schwessinger et al., 2020), predict all interactions within an ∼1–3 Mb window. In addition, some polymer modeling approaches, such as Hip-Hop (Buckle et al., 2018) and PRISMR (Bianco et al., 2018), could be used to predict the whole Hi-C heat map.

From Contact Frequencies to 3-D Models

Hi-C and other 3C-based methods provide a snapshot of pairwise interactions between loci. Although we call this “3-D” information, it cannot be trivially transformed into 3-D structures. An approach known as restraint-based (RB) modeling interprets the 3C-based data as a set of spatial restraints to build a 3-D model of the chromatin fiber by satisfying the input restraints. The chromatin fiber is represented as a polymer of consecutive monomers, and several computational optimization strategies can be employed to find 3-D models of chromatin (Dekker et al., 2013; Serra et al., 2015). The challenge of predicting 3-D genomic structures from high-resolution chromosome conformation capture data was recently taken by several groups, and we refer the reader to the recent review by Kimberly MacKay and Anthony Kusalik describing problems and solutions in this field (MacKay and Kusalik, 2020) and to the articles collected in the recently published book Modeling the 3D Conformation of Genomes (Tiana and Luca, 2019).

How Do the Modeling Algorithms Work? Problems and Limitations

All models and algorithms that are currently used to infer chromatin contacts from epigenetic data could be divided into two categories. First are the models derived from the physical simulation of chromatin behavior, i.e., polymer modeling. The second includes statistical algorithms searching for interdependencies between genetic and epigenetic properties and patterns of 3-D contacts. Here, we described the principles and limitations of both approaches.

Polymer Modeling

The physics of chromatin has been the subject of intense research over many decades. Seminal studies by de Gennes and Witten (1980) provide basic rules describing polymer behavior under different conditions. Importantly, these studies show that, when a polymer is large (i.e., its size increases the size of individual monomers significantly), its physical properties do not depend on the monomer’s chemical structure. Instead, the behavior of a polymer depends on several physical parameters, such as monomer concentration, solvent quality, and temperature. For different combinations of these parameters, the polymer would exist in one of the well-described equilibrium states, such as the random coil, the swollen coil, the equilibrium globular state, and others (Fudenberg and Mirny, 2012). Thus, knowing the key parameters and using the laws of polymer physics would allow the description (and prediction) of chromatin behavior within the nucleus. These ideas gave rise to the first physical models of chromatin architecture.

Development and validation of physical models during recent decades are linked to the development of experimental techniques measuring genome architecture (Figure 1). The presence of chromosome territories as well as measures of mean distances between defined loci by FISH disagree with basic swollen coil or random coil polymer properties (Hahnfeldt et al., 1993). There were multiple attempts to improve these disagreements, of which the fractal globule (Mirny, 2011) is currently the most accepted. This model, originally proposed by Grosberg et al. (1988) suggests that chromatin exists in a highly unknotted fractal-like non-equilibrium state, and the predictions obtained using this model fit well with the experimentally measured scaling of Hi-C contacts (Lieberman-Aiden et al., 2009).

FIGURE 1

Although the fractal globule recapitulates the experimentally observed scaling of chromatin contacts better than the equilibrium globule state, it is still far from a complete description of chromatin folding in a real cell. Not to mention all disagreements (see Grosberg, 2016, for a detailed review), the fractal globule represents a pictorial description of the chromatin structures and does not include locus-specific features. Thus, to build a more comprehensive description of chromatin conformation and dynamics in a real cell, active (energy-consuming) locus-specific mechanisms should be introduced into the system.

One such mechanism, which maintains the structure of chromatin, is a loop extrusion process (see Box 1 for details on this mechanism). This process was recently introduced into physical models of chromatin by Fudenberg et al. (2016) and Sanborn et al. (2015), and later experimentally validated by Ganji et al. (2018), Davidson et al. (2019), and Kim et al. (2019). A recent preprint from Banigan et al. (2020) shows another impressive application of polymer modeling in which it helps to investigate if a one- or two-sided loop extrusion model works in the cell and to identify a class of one-sided extrusion models that can reproduce in vivo experiments. The models of loop extrusion show good agreement with the experimental Hi-C data. Importantly, loop extrusion models use epigenetic information about CTCF binding to account for CTCF-mediated extrusion barriers. This allows making the model locus-specific; moreover, modifying CTCF anchors in silico results in different chromatin packaging as revealed by the models (Sanborn et al., 2015). Thus, such physical models allow predicting chromatin packaging and its perturbations knowing CTCF-binding sites.

Another class of locus-specific models is designed to study and predict the packaging of different chromatin types. Distinct types of chromatin differentially interact with themselves and surrounding proteins. This can be imagined as a polymer composed of several distinct units or blocks. Such polymers are called block copolymers, and their behavior could be modeled knowing the interaction potential between blocks (Bates and Fredrickson, 1990). Several attempts have been made to apply this logic for modeling chromatin interactions in Drosophila and Human (Jost et al., 2014; Di Pierro et al., 2016; Ulianov et al., 2016). These models predict that specific preferences of interactions between similar blocks of chromatin result in spatial segregation of distinct chromatin domains in the process of liquid–liquid phase separation (Nuebler et al., 2018).

Block copolymer models rely on the epigenetic information about histone modifications and/or architectural factor binding to assign DNA segments to specific chromatin types. Once developed, these models could be used to predict chromatin architecture if epigenetic data is available. Indeed, several studies show that such prediction recapitulates Hi-C data very well (Di Pierro et al., 2017), especially when accounting for the loop extrusion process (Nuebler et al., 2018; Qi and Zhang, 2019).

To further extend block copolymer models, one should consider the physical nature of interactions between blocks. In a nucleus, these interactions are mediated by specific factors, such as polycomb-group proteins (Plys et al., 2019; Eeftens et al., 2020), BRD-domain containing proteins (Gibson et al., 2019), HP1 (Larson et al., 2017; Sanulli et al., 2019), mediator and RNA polymerase II (Cho et al., 2018), or interactions between DNA and nuclear lamina proteins (Chiang et al., 2019; Ulianov et al., 2019). The above-described block copolymer models account for these interactions implicitly by setting specific interaction potentials between different block types. Other models explicitly introduce binder proteins that mediate interactions in the system.

There are multiple ligand-binding theories applied to model DNA–protein interactions in chromatin, reviewed, for example, in Teif and Rippe, 2010. Among recent models that aim to explain genome-wide interaction profiles revealed by 3C-based methods, several consider specific chromatin binders, such as HP1 (Teif et al., 2015; MacPherson et al., 2018), lamina proteins (Chiang et al., 2019; Ulianov et al., 2019), or generic active and inactive complexes (Brackley et al., 2016b), whereas others describe binders, such as abstract molecules with defined physical properties but unknown biological nature (Nicodemi and Prisco, 2009; Barbieri et al., 2012; Brackley et al., 2013, 2017; Chiariello et al., 2016). Mechanistically, chromatin clustering may be reproduced by these models either due to the affinity of binders or because of multivalent interactions between binders and chromatin, which results in bridging-induced attraction (Brackley et al., 2013, 2017; Johnson et al., 2015). In addition to compartmentalization, these mechanisms could explain TAD and loops formation (Brackley et al., 2016b). For more details on these and other physical models, we refer the reader to a recently published extensive review (Brackey et al., 2020) and a collection of articles provided with the book (Tiana and Luca, 2019).

Here, it is pertinent to note that the binder positions are inferred from epigenetic data even in those models that use “abstract” binders. This allows predicting chromatin folding in normal and mutated genomes, knowing epigenetic data with high accuracy (Scialdone et al., 2011; Bianco et al., 2012, 2018; Brackley et al., 2016a,b; Barbieri et al., 2017; Chiariello et al., 2017; Kragesteen et al., 2018). For example, the Hip-Hop model (Buckle et al., 2018) infers binder positions based on H3K27 acetylation data and/or chromatin accessibility, and the authors show that this epigenetic information is sufficient for prediction of chromatin interactions. In the PRISMR model (Bianco et al., 2018), Hi-C data obtained from wild-type cells are used to define the number of binder types and their affinities, and this information can be further used to model chromatin conformation after a deletion or duplication event occurs.

The examples mentioned above show that physical modeling could be a powerful tool for both validation of proposed molecular mechanisms underlying chromatin architecture and predicting spatial interactions based on epigenetic data. In the following, we discuss some limitations that should be addressed to allow a comprehensive description of genome organization by physical modeling.

Limitations of Physical Models

Physical Modeling Is Hypothesis-Driven

As was mentioned above, physical models rely on an explicitly defined set of rules to describe polymer behavior. However, we are still far from a complete understanding of all biophysical processes involved in chromatin organization. Thus, it is clear that none of the currently developed models can accurately explain all details of genome architecture and dynamics.

For example, PRISMR and Hip-Hop models introduce specific binders whose positions and affinity could be inferred from experimental Hi-C or ChIP-seq data. The problem is not only that we do not know the correspondence between the model’s abstract binders and real proteins. The major concern is that these abstract binders might not be given the same physical properties as real proteins. Biochemical dissection of regulatory complexes, such as PRC1 or Mediator, show the complexity of their structural organization and regulation, which is not described by current models. This limits modeling approaches to qualitative predictions of trends rather than quantitative comparison with contact maps.

Inferring Key Physical Parameters Might Be Challenging

There are many biophysical parameters that are currently unknown but essential for modeling. This includes affinity constants and concentrations of chromatin binders, the position of boundaries, and processivity of loop extruders and other factors. One solution to this problem is extracting the missing parameters from available ChIP-seq data. For example, in the MEGABASE + MiChroM model developed by Di Pierro and colleagues, chromatin states are first inferred from epigenetic data using a machine learning approach and then used in a block copolymer model optimized to fit Hi-C data (Di Pierro et al., 2017). However, in many cases, available ChIP-seq data is only indirectly connected to the affinity and concentration of the key architectural factors, and the dependence between ChIP-seq signals and biophysical properties of chromatin may vary in different cell types. Thus, the model developed using one cell type might not be well transferable to another.

There are also models that fit their parameters directly using Hi-C data. This is, for example, the PRISMR model (Bianco et al., 2018), which defines binder types and positions based on Hi-C maps. The transferability of this model to other cell types or loci without knowing corresponding experimental Hi-C data could be problematic.

There are also several technical parameters of simulation that could influence the results, including the finite volume effect, polymer conformation used for model initialization, equilibration time, sampling size, etc. We refer those readers interested in this subject to a recent review describing potential pitfalls and methods developed to overcome these limitations (Gartner and Jayaraman, 2019).

Physical Modeling Is Computationally Intensive and Often Requires Coarse-Graining

Using a polymer modeling approach is computationally intensive. Technically, the vast majority of the physical models describe chromatin as a string with beads. Ideally, each bead should represent a single nucleosome as histone octamers are monomers of chromatin organization. However, this leads to a huge number of beads required to simulate chromosome-scaled loci. The behavior of beads is typically simulated using LAMMPS software, which is computationally intensive for such a large number of objects. Great computational resources are needed for every modeling attempt, and these are not always accessible. Although it is possible to model only a particular chromosomal region, whole chromosome or whole genome modeling is computationally too expensive.

One solution could be to decrease the resolution and use more coarse-grained models, with which several atoms or molecules are grouped and represented by a single simple object. However, this comes at a cost of the inability to resolve fine patterns of interactions. There are multiple levels of chromatin coarse-graining, starting from atomic resolution and up to hundreds of thousands of base pairs, each suitable for the specific problem of interest (see Table 1 in the recent review published by Brackey et al., 2020). The choice of coarse-graining should be considered carefully in order to find a balance between the detail of the model and computational cost.

To sum up, physical modeling is essential for validating hypotheses about mechanisms driving chromatin organization. When using epigenetic data to infer properties of chromatin monomers, it is easy to repurpose a physical model from hypothesis validation to prediction of locus-specific chromatin organization. However, there are several limitations of these predictions, and we next describe another class of approaches based on machine learning techniques that have the potential to overcome some of the aforementioned limitations.

Statistical Approach

It is known that different epigenetic marks and transcriptional factors correlate with various regulatory elements, chromatin states, and other genomic features. For example, histone modification H3K9me3 correlates well with constitutive heterochromatin, which correlates with the B compartment (Strom et al., 2017), TAD boundaries are enriched by CTCF protein (Dixon et al., 2012; Rao et al., 2014), and open chromatin regions are enriched by specific histone modification. Thus, one can simply use regression to predict 3-D genome features based on epigenetics data. For example, correlation-based methods are used for the prediction of enhancer–promoter interactions using histone modifications, CAGE, ChIP-seq, and other chromatin features as input (Xu et al., 2020).

Although linear models could explain 3-D organization to some extent, it is clear that certain dependencies between genetic features and chromatin interactions are not linear. The most prominent example of such non-linearity is the scaling of the average chromatin contact frequency with genomic distance, which could be well described as a power law. This dependence, P(s) ∼ s^x, has only one free parameter x, which could be easily obtained by fitting experimental data. Of course, it is not enough to account for distance dependence to obtain accurate estimations of contact frequencies. One should also describe locus-specific insulation, compartmentalization, and other features of genome organization. This description should be done in the form of algebraic expressions with some free parameters that could be fit from the data. This was utilized recently by Rowley et al. (2017), who proposed an algebraic expression combining linear and exponential terms to predict genomic contacts based on GRO-seq transcription data, CTCF binding, and genomic distance. As a result, Rowley et al. simulate Hi-C maps including main 3-D structures, such as TADs and loops with high accuracy.

However, there might be multiple non-linear dependencies between histone modifications, transcription factor binding, and chromatin interactions, which cannot be defined analytically as an algebraic expression, such as a power law. These dependencies could be found by sophisticated machine learning algorithms, such as logistic regression, gradient boosting, random forest regression, neural networks, and others (Eraslan et al., 2019; Figure 2).

FIGURE 2

Machine learning algorithms operate with a numerical representation of input information (features): nucleotide sequence; genomic distance or epigenetic marks; and experimentally measured target feature values, such as contact frequency between loci, positions of loop anchors, etc. The main result of machine learning training is a function that transforms input features into predictions of target values. The similarity between predictions and experimental data is measured using a user-defined loss function. During a training step, the portion of available data called the training subsample is used to optimize the transforming function so that the loss function is minimal; this is how the algorithm finds interdependencies between features and target values. These interdependencies might represent general biological mechanisms or be subsampling artifacts specific to the training subsample. Moreover, the function transforming the input features into predictions of target values typically has numerous adjustable parameters. This could allow fitting the detail and noise in the training data to the extent that it negatively impacts the performance of the model on held-out data. In this case, the developed algorithm is of no use even if prediction accuracy is high as it cannot generalize over unseen samples. This problem is well known in the machine learning field under the name of “overfitting.” To verify that any increase in accuracy over the training subset is generalizable, an evaluation of the algorithm using a portion of unseen data (validation subset) should be done. It is essential that the validation subset does not contain samples presented in the training subset. However, during the design of training and validation subsets, one should note that genomic objects that are not equivalent from a mathematical point of view might share a large amount of biological information. For example, nested chromatin loops might share a large portion of epigenetic information encoded by the window spanning loop anchors although the anchors themselves do not overlap and formally represent different pairs of genomic regions. Such indirect overlapping results in the sharing of information between training and validation data sets, leading to the overestimation of prediction accuracy (Belokopytova et al., 2020). To overcome this problem, one can use different chromosomes for training and validation data sets.

It is considered that machine learning–based algorithms can find complex non-linear patterns when fitting the model. Machine learning is used for binary classifiers for regression-based models, enabling the prediction of structures ranging from two-point interactions to whole Hi-C maps. Several algorithms employing these methods for promoter–enhancer interaction prediction were recently developed, including TargetFinder (Whalen et al., 2016), DeepTACT (Li et al., 2019), 3DPredictor (Belokopytova et al., 2020), and HiC-Reg (Zhang et al., 2019). We refer the reader to the informative review of Xu et al. (2020) describing different algorithms for the prediction of enhancer-promoter interactions. Other spatial chromatin structures, such as loops (Zhu et al., 2016; Al Bkhetan and Plewczynski, 2018; Kai et al., 2018; Zhang et al., 2018; Trieu et al., 2020) and contact probabilities (Zhang et al., 2019; Belokopytova et al., 2020; Fudenberg et al., 2020; Schwessinger et al., 2020) also can be predicted by machine learning–based algorithms (see the section above). Furthermore, a machine learning–based approach enables revealing biological features underlying 3-D genome folding, which improves our understanding of biological mechanisms. For example, extracting matrix positional weights from layers of convolution neural networks helps to find the main features, in particular, sequences giving the main contribution to the prediction and consequently to the 3-D chromatin structure. Another example is the analysis of feature importance in a gradient-boosting algorithm that gives the ranked list of features that helps to find the best feature. Anyway, analysis of features and algorithm parameters can inspire thoughts of biological mechanisms underlying the studying process.

Challenges and Limitations

Defining Target Features and Their Properties

The development of a predictive algorithm should start from a clear statement of biological features one wants to predict. Clear definitions of the features are important for the selection of positive and negative samples as well as for the choice of the machine learning algorithm.

Let us consider the goal of the prediction of interacting promoter–enhancer pairs. How would one define positive cases, i.e., interacting pairs? Now, it is clear that the majority of loops (see Box 1 for details of mechanisms underlying these structures) observed on Hi-C maps are due to the synergetic activity of cohesin and CTCF proteins. These complexes form loops that might facilitate interactions of promoters and enhancers located within the looping region by reducing the spatial distance between them but do not necessarily directly mediate contacts between these regulatory elements. In accord with this, direct functional tests based on targeted enhancer deletions or CRISPR-interference approaches (Gasperini et al., 2019) indicate that the vast majority of interacting enhancer–promoter pairs do not overlap with loop anchors although they are often located within a reasonable distance from them (Belokopytova et al., 2020). Thus, functionally interacting enhancer–promoter pairs might show only a slight increase in contact frequency. It is worth noting that the NG Capture-C approach (Davies et al., 2015) provides more sensitive and robust quantitation and enables detecting more significant interactions than Hi-C; however, typical Hi-C data are more widespread and available. At the same time, the majority of algorithms predicting 3-D genome structures are classifiers, so they solve the question of whether the promoter and enhancer interact, answering yes or no. We argue that quantitative measurement and prediction of spatial enhancer–promoter interactions are more informative than qualitative attribution to the loop anchors, and regression-based methods are more suited for such predictions.

Another example of varying feature definition is loop prediction. In this case, authors often use loops called by specific algorithms as positive samples. A large proportion of loop calls varies between algorithms and visually assessed loops (Belokopytova et al., 2020; Salameh et al., 2020). Methods for loop detection, such as for TAD detection, are constantly improving. For example, the last published method Peakachu for loop calling can detect more loops than previous algorithms (Salameh et al., 2020). The same applies to TAD calling: Zufferey et al. (2018) compared 22 different TAD caller algorithms and found that TAD sizes and numbers vary significantly among callers and data resolutions.

To sum up, it is very important to consider the nature and biological properties of target features and carefully design positive and negative samples if using classifiers for prediction.

Predicting Single-Cell Data

The statistical approach is well applicable for 3-D genome structure prediction and investigation, but it uses population data. It allows getting a prediction that is actually a mean value for a cell population, which does not provide information about the 3-D genome organization of a single cell and differences of spatial contacts between distinct cells. Conversely, physical modeling always produces ensembles of single-cell chromatin configurations. Nevertheless, it does not mean that this prediction matches a real biological cell exactly even if its average matches population Hi-C data. However, recently Conte et al. (2020) show the consistent agreement between the predicted structures and independent single-cell super-resolution microscopy data, which provides evidence that, at least in the studied loci, polymer physics approaches accurately capture single-cell chromatin conformation. This issue is under active debate, however.

Understanding Mechanisms Underlying Prediction

Another limitation is that one cannot extract a simple algebraic formula transforming features into target feature values from a trained machine learning model. Therefore, the statistical dependencies found by machine learning algorithms are difficult to interpret in biological terms. Nevertheless, it is possible to evaluate the feature’s contribution to prediction. We have already discussed several approaches for estimation of feature importance above; in addition, modifying features in silico and accessing how the modifications impact prediction could provide insights about the role of biological features used for prediction (Fudenberg et al., 2020).

Choosing Data Parameterization Function

To train a machine learning model, input data should be represented in a specific format, typically as a numeric vector of fixed length. The process of conversion of the input data into the desired format is called parameterization, and choosing the parameterization function might not be trivial. For example, ChIP-seq data is often used for the prediction of spatial chromatin contacts. There are several ways to submit these data to the algorithm: as a sum of ChIP-seq signals in the interval between two genome loci of interest, the total number of peaks in this region, the signal value of the nearest ChIP-seq peaks, or the p-values of peaks, etc. In our experience, differences in parameterization could significantly affect prediction accuracy. Thus, the most challenging part is to choose the best way of parameterization to achieve the best performance of the algorithm.

Input Data Quality

Another important issue is the quality of the training data. Some machine learning algorithms are sensitive to outliers presented in the data. In this case, data smoothing should be performed before training the model. For example, for Hi-C and RNA-seq data, it is often useful to log-transform values.

Recently, high-resolution Hi-C maps were published (Hsieh et al., 2015, 2020; Krietenstein et al., 2020). They reveal chromatin structures in more detail and thereby improve predictions. Moreover, we noticed that the prediction of higher resolution heat maps is more accurate than the prediction of the same heat map but with a lower resolution (Belokopytova et al., 2020). This aspect is explained by features used for prediction. We gain lots of information from ChIP-seq data, in which the protein-binding event is attributed to a small locus (usually less than 200 base pairs). In this case, using an ultra-high resolution of Hi-C maps provides a better correspondence between protein-binding sites and interacting loci, allowing the model to learn effects mediated by specific proteins in a more direct way.

Overfitting

Another problem of machine learning approaches is overfitting. In this case, the model performs well on the training data set but does not perform well on a holdout sample, actually not capturing real complex patterns underlying the 3-D genome structure. Non-overlapping subsets for training and validation help to detect overfitting. There are two main ways to minimize overfitting: training the network on more examples and changing the complexity of the network. However, in the case of biological data, it is not always possible to have enough training samples. To increase the number of samples, it may be necessary to combine data from multiple sources. This leads to the next challenge: to normalize data from different sources that require rigorous data preprocessing (Xu and Jackson, 2019).

What Do We Consider a Good Prediction?

Any data type has its data specificities, and this is also true for the Hi-C maps discussed below. It should be remembered that, usually for 3-D chromatin architecture, prediction binary classifiers or regression-based methods are used. There are some common metrics to access the binary classifier’s performance, such as f1-score, AUC, and others. These metrics do not have any special characteristics related to genomic data.

The performance estimation of regression-based methods is more specific for Hi-C maps. How can we understand that one heat map is similar to another? Actually, a Hi-C map is a matrix of numbers, so we can apply any metrics for matrices comparison.

The basic metric is Pearson’s correlation. Let us consider, for instance, a Pearson’s correlation equal to 0.8: Does this correspond to a good or bad prediction? Intuitively, it seems that a Pearson’s correlation equal to 0.8 indicates accurate prediction. However, using absolute values is not a good idea. As we discussed above, contact probability shows prominent dependence from distance, and even very simple prediction algorithms efficiently capture this dependence. Even when the distance between loci is not directly provided, it could be inferred from many epigenetic features. For example, cumulative ChIP-seq signals scale with the length of the genomic region, allowing prediction of contact probability. As we show in Figure 3, using randomly shuffled ChIP-seq signals, which have no biological meaning, allows the generation of predictions highly correlating with experimental data. Also, the whole-map correlation coefficient does not reflect the prediction of specific topological structures, such as TADs, loops, or compartments.

FIGURE 3

There are several workarounds allowing the comparison of Hi-C maps using correlation coefficients. First, one can compare the correlation between predicted and experimental data with the correlation between experimental replicates. Ideally, the prediction should be as similar to the experimental data as replicates among themselves. However, replicates are not always available; in addition, Tao Yang et al. show that Pearson’s correlation between unrelated samples sometimes is equal to differences between replicates (Figure 3 in Yang et al., 2017).

Another baseline could be obtained by scoring differences of Hi-C maps between distinct cell types. Chromatin organization is moderately conserved between different cell types (Dixon et al., 2012; Battulin et al., 2015) and even between different species (Fishman et al., 2019; Nuriddinov and Fishman, 2019), thus predicting cell type–specific features might be more challenging than an overall 3-D organization. For a high-quality algorithm, one would expect the difference between prediction and experimental data on the target cell type to be less than between different cell types. Besides this, one should carefully select data sets for comparison, accounting for their noise level. The lower noise level in the experimental data on target cell type results in higher measures of prediction accuracy, whereas a high noise level in a cell type used for baseline results in low baseline metrics, thus overestimating predictive power.

To overcome the limitations of standard correlations as measurements of Hi-C map similarity, Tao Yang et al. propose a framework that minimizes the effect of noise and biases by smoothing the Hi-C matrix, and then it addresses the distance-dependence effect by stratifying Hi-C data according to their genomic distance (Yang et al., 2017). This SCC metric distinguishes subtle differences between closely related cell lines, biological replicates, and pseudoreplicates, which was shown in the paper (Figure 3 in Yang et al., 2017).

Besides Pearson’s correlation and SCC standard metrics for comparison of matrices, such as MAE, MRE and others can be used for algorithm performance estimation. Similar to Pearson’s correlation, understanding the values of these metrics requires a comparison with the baseline. Overall, we recommend using several metrics and several baselines for the optimal assessment of prediction accuracy (Figure 3).

Nevertheless, it is useful to visualize the predicted Hi-C map for empirical assessment to be confident that the chosen metric correctly reflects the differences between heat maps. Another way is to estimate the prediction of 3-D chromatin structures, such as TADs and loops. For some statistics, one can call loops or insulator boundaries at experimental and predicting maps and then compare and overlap detected structures.

The selection of metrics for prediction accuracy estimation is an important issue for every algorithm. It should correctly reflect differences of 3-D chromatin features.

Prediction of Functional Consequences of Rearrangements

Some rearrangements have been known to change the 3-D chromatin structure, causing diseases. Several works show the importance of chromatin folding in the gene regulation process (Franke et al., 2016; Rodríguez-Carballo et al., 2017; Kraft et al., 2019). Inversions, duplications, and other rearrangements can lead to TAD disruption, changing of promoter–enhancer interactions, and the emergence of new interactions between regulatory elements and genes. These insights are significant for medical genetics because the interpretation of chromosomal rearrangements in non-coding regions remains a big challenge. Zepeda-Mendoza et al. (2018) suggest detailed instructions on how to run a computational pipeline that identifies relevant candidates of non-coding balanced and apparently balanced chromosomal abnormality position effects. This pipeline includes analysis of TADs and the possibility of changing enhancer–promoter interactions due to rearrangement. Hence, the analysis of chromosomal rearrangement consequences in the context of the 3-D genome structure becomes a routine assay. The recently published machine learning algorithm TADA (Hertzberg et al., 2020) can prioritize large chromosomal alterations, such as copy number variants (CNVs) based on their pathogenicity.

Besides the prediction of the overall rearrangement effect, it is possible to predict changes in 3-D genome structures as TADs and loops. The 3D-GNOME algorithm (Sadowski et al., 2019; Wlasnowolski et al., 2020) generates chromatin 3-D structures using a Monte Carlo approach based on chromatin conformation capture (3C) data. It uses high-quality CTCF or RNA polymerase II ChIA-PET data as a reference chromatin interaction pattern. For rearrangement prediction, it applies a series of simple rules to recover chromatin interaction patterns. The 3D-GNOME algorithm can visualize alterations emerging in genomic structures after the introduction of SVs¹. Another approach is to predict changes in chromatin loops by a machine learning–based DeepMilo algorithm (Trieu et al., 2020). The algorithm can extract features directly from DNA sequences of loop anchors not using information about the presence and orientation of CTCF motifs. It allows predicting true Hi-C loops not having a CTCF signal at their anchors. DeepMILO can predict effects even of small mutations, and authors identified insulator loops predicted to change in multiple cancer patients and genes affected by these loops.

The aforementioned algorithms predict the perturbation of specific chromatin structures, such as loops and TADs. Other tools are capable of predicting a complete Hi-C map of the mutated locus. Algorithms such as Akita (Fudenberg et al., 2020), DeepC (Schwessinger et al., 2020), 3DPredictor (Belokopytova et al., 2020), PRISMR (Bianco et al., 2018), and others can predict alterations of 3-D chromatin architecture induced by structural variants.

An area of increasing interest and active research is the effect of small INDELs and single base pair variants on chromatin architecture. It is known that even single nucleotide replacement can lead to changes in 3-D genome structure, for example, by modifying CTCF binding sites (Schmiedel et al., 2016; Sun et al., 2020). A separate mission of predictive algorithms is to foresee the consequences of such mutations. Some algorithms, such as DeepMILO (Trieu et al., 2020), Akita (Fudenberg et al., 2020), and DeepC (Schwessinger et al., 2020) use a nucleotide sequence as the main feature for prediction. These algorithms are very powerful in predicting changes induced by small mutations because the mutations directly affect input features. On the other hand, training these algorithms requires knowledge of 3-D chromatin organization in wild-type cells of the same type because a nucleotide sequence does not provide cell type–specific epigenetic information.

Other algorithms do not use nucleotide sequences for prediction directly. In this case, it is important to model changes in input features caused by SNP or small INDEL. For instance, in the case of polymer modeling, it needs to change binder position or to remove the part of the polymer corresponding to the mutated DNA. All the same is about statistical approaches not using nucleotides as features for the prediction.

Conclusion

The mechanisms that underlie genome organization are intensively studied. Multiple groups developed computational algorithms to explain mechanisms underlying genome architecture and predict chromatin folding in normal and mutated cells. However, there is still no approach that is able to completely describe the whole complexity of the nuclear organization. Physical models are limited by incomplete knowledge of mechanisms and relevant system parameters, such as interaction affinities and concentrations. Statistical methods do not allow understanding of the exact mechanisms underlying captured dependencies. And for both methods, it is not clear whether developed algorithms trained and validated using several cell types could be broadly and efficiently transferred to other cell types and conditions.

The latter question could be answered using the rapidly growing number of high-resolution Hi-C data sets. There are multiple published experimental data studying 3-D genome structure in normal and rearranged genomes. Such experiments provide detailed Hi-C maps of mutated regions that can be used as validation data for predictive algorithms. We believe that benchmarking and comparing existing predictive algorithms using these data sets would help to describe their power and limitations and to develop new, comprehensive approaches for the prediction of chromatin organization and dynamics in the future.

Statements

Author contributions

Both authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

Funding

This work was supported by the RSF grant #19-74-00102. Computations have shown in Figure 3 were performed using nodes of the Novosibirsk State University high-throughput computation cluster [supported by the Ministry of Education and Science of Russian Federation, grant #2019-0546 (FSUS-2020-0040)].

Acknowledgments

We thank Emil Valeev and Olga Gladkih who helped us with designing illustrations.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

1.^https://3dgnome.cent.uw.edu.pl/

References

1
Al BkhetanZ.PlewczynskiD. (2018). Three-dimensional epigenome statistical model: genome-wide chromatin looping prediction.Sci. Rep.8:5217. 10.1038/s41598-018-23276-23278
- CrossRef
- Google Scholar
2
AnderssonR.SandelinA. (2020). Determinants of enhancer and promoter activities of regulatory elements.Nat. Rev. Genet.2171–87. 10.1038/s41576-019-0173-178
- CrossRef
- Google Scholar
3
BaniganE.van den BergA.BrandãoH.MarkoJ.MirnyL. (2020). Chromosome organization by one-sided and two-sided loop extrusion.Elife9:815340. 10.1101/815340
- CrossRef
- Google Scholar
4
BarbieriM.ChotaliaM.FraserJ.LavitasL. M.DostieJ.PomboA.et al (2012). Complexity of chromatin folding is captured by the strings and binders switch model.Proc. Natl. Acad. Sci. U S A.10916173–16178. 10.1073/pnas.1204799109
5
BarbieriM.XieS. Q.Torlai TrigliaE.ChiarielloA. M.BiancoS.De SantiagoI.et al (2017). Active and poised promoter states drive folding of the extended HoxB locus in mouse embryonic stem cells.Nat. Struct. Mol. Biol.24515–524. 10.1038/nsmb.3402
6
BatesF. S.FredricksonG. H. (1990). Block copolymer thermodynamics: theory and experiment.Annu. Rev. Phys. Chem.41525–557. 10.1146/annurev.pc.41.100190.002521
7
BattulinN.FishmanV. S.MazurA. M.PomaznoyM.KhabarovaA. A.AfonnikovD. A.et al (2015). Comparison of the three-dimensional organization of sperm and fibroblast genomes using the Hi-C approach.Genome Biol.16:77. 10.1186/s13059-015-0642-640
- CrossRef
- Google Scholar
8
BeagrieR. A.ScialdoneA.SchuelerM.KraemerD. C. A.ChotaliaM.XieS. Q.et al (2017). Complex multi-enhancer contacts captured by genome architecture mapping.Nature543519–524. 10.1038/nature21411
9
BelokopytovaP. S.NuriddinovM. A.MozheikoE. A.FishmanD.FishmanV. (2020). Quantitative prediction of enhancer–promoter interactions.Genome Res.3072–84. 10.1101/gr.249367.119
10
BiancoS.LupiáñezD. G.ChiarielloA. M.AnnunziatellaC.KraftK.SchöpflinR.et al (2018). Polymer physics predicts the effects of structural variants on chromatin architecture.Nat. Genet.50662–667. 10.1038/s41588-018-0098-98
- CrossRef
- Google Scholar
11
BiancoV.ScialdoneA.NicodemiM. (2012). Colocalization of multiple DNA loci: a physical mechanism.Biophys. J.1032223–2232. 10.1016/j.bpj.2012.08.056
12
BoettigerA.MurphyS. (2020). Advances in chromatin imaging at kilobase-scale resolution.Trends Genet.36273–287. 10.1016/j.tig.2019.12.010
13
BrackeyC. A.MarenduzzoD.GilbertN. (2020). Mechanistic modeling of chromatin folding to understand function.Nat. Methods17767–775. 10.1038/s41592-020-0852-856
- CrossRef
- Google Scholar
14
BrackleyC. A.BrownJ. M.WaitheD.BabbsC.DaviesJ.HughesJ. R.et al (2016a). Predicting the three-dimensional folding of cis-regulatory regions in mammalian genomes using bioinformatic data and polymer models.Genome Biol.17:59. 10.1186/s13059-016-0909-900
- CrossRef
- Google Scholar
15
BrackleyC. A.JohnsonJ.KellyS.CookP. R.MarenduzzoD. (2016b). Simulated binding of transcription factors to active and inactive regions folds human chromosomes into loops, rosettes and topological domains.Nucleic Acids Res.443503–3512. 10.1093/nar/gkw135
16
BrackleyC. A.LiebchenB.MichielettoD.MouvetF.CookP. R.MarenduzzoD. (2017). Ephemeral protein binding to DNA shapes stable nuclear bodies and chromatin domains.Biophys. J.1121085–1093. 10.1016/j.bpj.2017.01.025
17
BrackleyC. A.TaylorS.PapantonisA.CookP. R.MarenduzzoD. (2013). Nonspecific bridging-induced attraction drives clustering of DNA-binding proteins and genome organization.Proc. Natl. Acad. Sci. U S A.110:E3605–E3611. 10.1073/pnas.1302950110
18
BuckleA.BrackleyC. A.BoyleS.MarenduzzoD.GilbertN. (2018). Polymer simulations of heteromorphic chromatin predict the 3D folding of complex genomic Loci.Mol. Cell72786–797.e11. 10.1016/j.molcel.2018.09.016
19
ChiangM.MichielettoD.BrackleyC. A.RattanavirotkulN.MohammedH.MarenduzzoD.et al (2019). Polymer modeling predicts chromosome reorganization in senescence.Cell Rep.283212–3223.e6. 10.1016/j.celrep.2019.08.045
20
ChiarielloA. M.AnnunziatellaC.BiancoS.EspositoA.NicodemiM. (2016). Polymer physics of chromosome large-scale 3D organisation.Sci. Rep.6:29775. 10.1038/srep29775
21
ChiarielloA. M.EspositoA.AnnunziatellaC.BiancoS.FiorilloL.PriscoA.et al (2017). A polymer physics investigation of the architecture of the murine orthologue of the 7q11.23 human locus.Front. Neurosci.11:559. 10.3389/fnins.2017.00559
22
ChoW. K.SpilleJ. H.HechtM.LeeC.LiC.GrubeV.et al (2018). Mediator and RNA polymerase II clusters associate in transcription-dependent condensates.Science361412–415. 10.1126/science.aar4199
23
ConteM.FiorilloL.BiancoS.ChiarielloA. M.EspositoA.NicodemiM. (2020). Polymer physics indicates chromatin folding variability across single-cells results from state degeneracy in phase separation.Nat. Commun.11:3289. 10.1038/s41467-020-17141-17144
- CrossRef
- Google Scholar
24
DavidsonI. F.BauerB.GoetzD.TangW.WutzG.PetersJ. M. (2019). DNA loop extrusion by human cohesin.Science3661338–1345. 10.1126/science.aaz3418
25
DaviesJ. O. J.TeleniusJ. M.McGowanS. J.RobertsN. A.TaylorS.HiggsD. R.et al (2015). Multiplexed analysis of chromosome conformation at vastly improved sensitivity.Nat. Methods1374–80. 10.1038/nmeth.3664
26
de GennesP. G.WittenT. A. (1980). Scaling concepts in polymer physics.Phys. Today3351–54. 10.1063/1.2914118
- CrossRef
- Google Scholar
27
de WitE. (2020). TADs as the caller calls them.J. Mol. Biol.432638–642. 10.1016/j.jmb.2019.09.026
28
DekkerJ.Marti-RenomM. A.MirnyL. A. (2013). Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data.Nat. Rev. Genet.14390–403. 10.1038/nrg3454
29
Di PierroM.ChengR. R.AidenE. L.WolynesP. G.OnuchicJ. N. (2017). De novo prediction of human chromosome structures: epigenetic marking patterns encode genome architecture.Proc. Natl. Acad. Sci. U S A.11412126–12131. 10.1073/pnas.1714980114
30
Di PierroM.ZhangB.AidenE. L.WolynesP. G.OnuchicJ. N. (2016). Transferable model for chromosome architecture.Proc. Natl. Acad. Sci. U S A.11312168–12173. 10.1073/pnas.1613607113
31
DixonJ. R.SelvarajS.YueF.KimA.LiY.ShenY.et al (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions.Nature485376–380. 10.1038/nature11082
32
EagenK. P.AidenE. L.KornbergR. D. (2017). Polycomb-mediated chromatin loops revealed by a subkilobase-resolution chromatin interaction map.Proc. Natl. Acad. Sci. U S A.1148764–8769. 10.1073/pnas.1701291114
33
EeftensJ. M.KapoorM.BrangwynneC. P. (2020). Epigenetic memory as a time integral over prior history of Polycomb phase separation.bioRxiv [preprint]. 10.1101/2020.08.19.254706
- CrossRef
- Google Scholar
34
EraslanG.AvsecŽ.GagneurJ.TheisF. J. (2019). Deep learning: new computational modelling techniques for genomics.Nat. Rev. Genet.20389–403. 10.1038/s41576-019-0122-126
- CrossRef
- Google Scholar
35
FiorilloL.MusellaF.KempferR.ChiarielloA.BiancoS.KukalevA.et al (2020). Comparison of the Hi-C, GAM and SPRITE methods by use of polymer models of chromatin.bioRxiv [preprint]. 10.1101/2020.04.24.059915
- CrossRef
- Google Scholar
36
FishmanV.BattulinN.NuriddinovM.MaslovaA.ZlotinaA.StrunovA.et al (2019). 3D organization of chicken genome demonstrates evolutionary conservation of topologically associated domains and highlights unique architecture of erythrocytes’ chromatin.Nucleic Acids Res.47648–665. 10.1093/nar/gky1103
37
FlyamerI. M.GasslerJ.ImakaevM.BrandãoH. B.UlianovS. V.AbdennurN.et al (2017). Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition.Nature544110–114. 10.1038/nature21711
38
FrankeM.IbrahimD. M.AndreyG.SchwarzerW.HeinrichV.SchöpflinR.et al (2016). Formation of new chromatin domains determines pathogenicity of genomic duplications.Nature538265–269. 10.1038/nature19800
39
FudenbergG.ImakaevM.LuC.GoloborodkoA.AbdennurN.MirnyL. A. (2016). Formation of chromosomal domains by loop extrusion.Cell Rep.152038–2049. 10.1016/j.celrep.2016.04.085
40
FudenbergG.KelleyD. R.PollardK. S. (2020). Predicting 3D genome folding from DNA sequence with Akita.Nat. Methods171111–1117. 10.1038/s41592-020-0958-x
41
FudenbergG.MirnyL. A. (2012). Higher-order chromatin structure: bridging physics and biology.Curr. Opin. Genet. Dev.22115–124. 10.1016/j.gde.2012.01.006
42
GanjiM.ShaltielI. A.BishtS.KimE.KalichavaA.HaeringC. H.et al (2018). Real-time imaging of DNA loop extrusion by condensin.Science360102–105. 10.1126/science.aar7831
43
GartnerT. E.JayaramanA. (2019). Modeling and simulations of polymers: a roadmap.Macromolecules52755–786. 10.1021/acs.macromol.8b01836
- CrossRef
- Google Scholar
44
GasperiniM.HillA. J.McFaline-FigueroaJ. L.MartinB.KimS.ZhangM. D.et al (2019). A genome-wide framework for mapping gene regulation via cellular genetic screens.Cell176377–390.e19. 10.1016/j.cell.2018.11.029
45
GibsonB. A.DoolittleL. K.SchneiderM. W. G.JensenL. E.GamarraN.HenryL.et al (2019). Organization of chromatin by intrinsic and regulated phase separation.Cell179470–484.e21. 10.1016/j.cell.2019.08.037
46
GrosbergA. Y. (2016). Extruding loops to make loopy globules? Biophys.J.1102133–2135. 10.1016/j.bpj.2016.04.008
47
GrosbergA. Y.NechaevS. K.ShakhnovichE. I. (1988). The role of topological constraints in the kinetics of collapse of macromolecules.J. Phys.492095–2100. 10.1051/jphys:0198800490120209500
- CrossRef
- Google Scholar
48
HahnfeldtP.HearstJ. E.BrennerD. J.SachsR. K.HlatkyL. R. (1993). Polymer models for interphase chromosomes.Proc. Natl. Acad. Sci. U S A.907854–7858. 10.1073/pnas.90.16.7854
49
HertzbergJ.MundlosS.VingronM.GalloneG. (2020). TADA – a Machine learning tool for functional annotation based prioritisation of putative pathogenic CNVs.bioRxiv[preprint]. 10.1101/2020.06.30.180711
- CrossRef
- Google Scholar
50
HsiehT. H. S.CattoglioC.SlobodyanyukE.HansenA. S.RandoO. J.TjianR.et al (2020). Resolving the 3D landscape of transcription-linked mammalian chromatin folding.Mol. Cell78539–553.e8. 10.1016/j.molcel.2020.03.002
51
HsiehT. H. S.WeinerA.LajoieB.DekkerJ.FriedmanN.RandoO. J. (2015). Mapping nucleosome resolution chromosome folding in yeast by Micro-C.Cell162108–119. 10.1016/j.cell.2015.05.048
52
HuangJ.MarcoE.PinelloL.YuanG. C. (2015). Predicting chromatin organization using histone marks.Genome Biol.16:162. 10.1186/s13059-015-0740-z
53
ImakaevM. V.FudenbergG.MirnyL. A. (2015). Modeling chromosomes: beyond pretty pictures.FEBS Lett.5893031–3036. 10.1016/j.febslet.2015.09.004
54
JohnsonJ.BrackleyC. A.CookP. R.MarenduzzoD. (2015). A simple model for DNA bridging proteins and bacterial or human genomes: bridging-induced attraction and genome compaction.J. Phys. Condens. Matter27:064119. 10.1088/0953-8984/27/6/064119
- CrossRef
- Google Scholar
55
JostD.CarrivainP.CavalliG.VaillantC. (2014). Modeling epigenome folding: formation and dynamics of topologically associated chromatin domains.Nucleic Acids Res.429553–9561. 10.1093/nar/gku698
56
KaiY.AndricovichJ.ZengZ.ZhuJ.TzatsosA.PengW. (2018). Predicting CTCF-mediated chromatin interactions by integrating genomic and epigenomic features.Nat. Commun.9:4221. 10.1038/s41467-018-06664-6666
- CrossRef
- Google Scholar
57
KantidzeO. L.RazinS. V. (2020). Weak interactions in higher-order chromatin organization.Nucleic Acids Res.484614–4626. 10.1093/nar/gkaa261
58
KempferR.PomboA. (2020). Methods for mapping 3D chromosome architecture.Nat. Rev. Genet.21207–226. 10.1038/s41576-019-0195-192
- CrossRef
- Google Scholar
59
KimY.ShiZ.ZhangH.FinkelsteinI. J.YuH. (2019). Human cohesin compacts DNA by loop extrusion.Science3661345–1349. 10.1126/science.aaz4475
60
KraftK.MaggA.HeinrichV.RiemenschneiderC.SchöpflinR.MarkowskiJ.et al (2019). Serial genomic inversions induce tissue-specific architectural stripes, gene misexpression and congenital malformations.Nat. Cell Biol.21305–310. 10.1038/s41556-019-0273-x
61
KragesteenB. K.SpielmannM.PaliouC.HeinrichV.SchöpflinR.EspositoA.et al (2018). Dynamic 3D chromatin architecture contributes to enhancer specificity and limb morphogenesis.Nat. Genet.501463–1473. 10.1038/s41588-018-0221-x
62
KrietensteinN.AbrahamS.VenevS. V.AbdennurN.GibcusJ.HsiehT. H. S.et al (2020). Ultrastructural details of mammalian chromosome architecture.Mol. Cell78554–565.e7. 10.1016/j.molcel.2020.03.003
63
LarsonA. G.ElnatanD.KeenenM. M.TrnkaM. J.JohnstonJ. B.BurlingameA. L.et al (2017). Liquid droplet formation by HP1α suggests a role for phase separation in heterochromatin.Nature547236–240. 10.1038/nature22822
64
LiR.LiuY.LiT.LiC. (2016). 3Disease browser: a web server for integrating 3D genome and disease-associated chromosome rearrangement data.Nat. Publ. Gr.6:34651. 10.1038/srep34651
65
LiW.WongW. H.JiangR. (2019). DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning.Nucleic Acids Res.47:e60. 10.1093/nar/gkz167
66
Lieberman-AidenE.van BerkumN. L.WilliamsL.ImakaevM.RagoczyT.TellingA.et al (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome.Science326289–293. 10.1126/science.1181369
67
LinD.BonoraG.YardımcıG. G.NobleW. S. (2019). Computational methods for analyzing and modeling genome structure and organization. wiley interdiscip.Rev. Syst. Biol. Med.11:e1435. 10.1002/wsbm.1435
68
MacKayK.KusalikA. (2020). Computational methods for predicting 3D genomic organization from high-resolution chromosome conformation capture data.Brief. Funct. Genom.19292–308. 10.1093/bfgp/elaa004
69
MacPhersonQ.BeltranB.SpakowitzA. J. (2018). Bottom–up modeling of chromatin segregation due to epigenetic modifications.Proc. Natl. Acad. Sci. U S A.11512739–12744. 10.1073/pnas.1812268115
70
MirnyL. A. (2011). The fractal globule as a model of chromatin architecture in the cell. Chromosome Res.1937–51. 10.1007/s10577-010-9177-9170
- CrossRef
- Google Scholar
71
NicodemiM.PriscoA. (2009). Thermodynamic pathways to genome spatial organization in the cell nucleus.Biophys. J.962168–2177. 10.1016/j.bpj.2008.12.3919
72
NueblerJ.FudenbergG.ImakaevM.AbdennurN.MirnyL. A. (2018). Chromatin organization by an interplay of loop extrusion and compartmental segregation.Proc. Natl. Acad. Sci. U S A.115E6697–E6706. 10.1073/pnas.1717730115
73
NuriddinovM.FishmanV. (2019). C-InterSecture-a computational tool for interspecies comparison of genome architecture.Bioinformatics354912–4921. 10.1093/bioinformatics/btz415
74
PetrovicJ.ZhouY.FasolinoM.GoldmanN.SchwartzG. W.MumbachM. R.et al (2019). Oncogenic notch promotes long-range regulatory interactions within hyperconnected 3D cliques.Mol. Cell731174–1190.e12. 10.1016/j.molcel.2019.01.006
75
PlysA. J.DavisC. P.KimJ.RizkiG.KeenenM. M.MarrS. K.et al (2019). Phase separation of polycomb-repressive complex 1 is governed by a charged disordered region of CBX2.Genes Dev.33799–813. 10.1101/gad.326488.119
76
QiY.ZhangB. (2019). Predicting three-dimensional genome organization with chromatin states.PLoS Comput. Biol.15:e1007024. 10.1371/journal.pcbi.1007024
77
QuinodozS. A.OllikainenN.TabakB.PallaA.SchmidtJ. M.DetmarE.et al (2018). Higher-Order inter-chromosomal hubs shape 3D genome organization in the nucleus.Cell174744–757.e24. 10.1016/j.cell.2018.05.024
78
RaoS. S. P.HuangS. C.Glenn St HilaireB.EngreitzJ. M.PerezE. M.Kieffer-KwonK. R.et al (2017). Cohesin loss eliminates all loop domains.Cell171305–320.e24. 10.1016/j.cell.2017.09.026
79
RaoS. S. P.HuntleyM. H.DurandN. C.StamenovaE. K.BochkovI. D.RobinsonJ. T.et al (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping.Cell1591665–1680. 10.1016/j.cell.2014.11.021
80
RazinS. V.GavrilovA. A. (2020). The role of liquid–liquid phase separation in the compartmentalization of cell nucleus and spatial genome organization.Biochemistry85643–650. 10.1134/S0006297920060012
81
RazinS. V.UlianovS. V. (2020). Divide and rule: phase separation in eukaryotic genome functioning.Cells9:2480. 10.3390/cells9112480
82
Rodríguez-CarballoE.Lopez-DelisleL.ZhanY.FabreP. J.BeccariL.El-IdrissiI.et al (2017). The HoxD cluster is a dynamic and resilient TAD boundary controlling the segregation of antagonistic regulatory landscapes.Genes Dev.312264–2281. 10.1101/gad.307769.117
83
RowleyM. J.NicholsM. H.LyuX.Ando-KuriM.RiveraI. S. M.HermetzK.et al (2017). Evolutionarily conserved principles predict 3D chromatin organization.Mol. Cell67837–852.e7. 10.1016/j.molcel.2017.07.022
84
SadowskiM.KraftA.SzalajP.WlasnowolskiM.TangZ.RuanY.et al (2019). Spatial chromatin architecture alteration by structural variations in human genomes at the population scale.Genome Biol.20:148. 10.1186/s13059-019-1728-x
85
SalamehT. J.WangX.SongF.ZhangB.WrightS. M.KhunsriraksakulC.et al (2020). A supervised learning framework for chromatin loop detection in genome-wide contact maps.Nat. Commun.11:3428. 10.1038/s41467-020-17239-17239
- CrossRef
- Google Scholar
86
SanbornA. L.RaoS. S. P.HuangS. C.DurandN. C.HuntleyM. H.JewettA. I.et al (2015). Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes.Proc. Natl. Acad. Sci. U S A.112E6456–E6465. 10.1073/pnas.1518552112
87
SanulliS.TrnkaM. J.DharmarajanV.TibbleR. W.PascalB. D.BurlingameA. L.et al (2019). HP1 reshapes nucleosome core to promote phase separation of heterochromatin.Nature575390–394. 10.1038/s41586-019-1669-1662
- CrossRef
- Google Scholar
88
SchmiedelB. J.SeumoisG.Samaniego-CastruitaD.CayfordJ.SchultenV.ChavezL.et al (2016). 17q21 asthma-risk variants switch CTCF binding and regulate IL-2 production by T cells.Nat. Commun.7:13426. 10.1038/ncomms13426
89
SchwessingerR.GosdenM.DownesD.BrownR. C.OudelaarA. M.TeleniusJ.et al (2020). DeepC: predicting 3D genome folding using megabase-scale transfer learning.Nat. Methods171118–1124. 10.1038/s41592-020-0960-963
- CrossRef
- Google Scholar
90
ScialdoneA.CataudellaI.BarbieriM.PriscoA.NicodemiM. (2011). Conformation regulation of the X chromosome inactivation center: a model.PLoS Comput. Biol.7:1002229. 10.1371/journal.pcbi.1002229
91
SerraF.Di StefanoM.SpillY. G.CuarteroY.GoodstadtM.BaùD.et al (2015). Restraint-based three-dimensional modeling of genomes and genomic domains.FEBS Lett.5892987–2995. 10.1016/j.febslet.2015.05.012
92
SpicugliaS.VanhilleL. (2012). Chromatin signatures of active enhancers.Nucleus3126–131. 10.4161/nucl.19232
93
StromA. R.EmelyanovA. V.MirM.FyodorovD. V.DarzacqX.KarpenG. H. (2017). Phase separation drives heterochromatin domain formation.Nature547241–245. 10.1038/nature22989
94
SunY.DongL.ZhangY.LinD.XuW.KeC.et al (2020). 3D genome architecture coordinates trans and cis regulation of differentially expressed ear and tassel genes in maize.Genome Biol.21:143. 10.1186/s13059-020-02063-2067
- CrossRef
- Google Scholar
95
SzaboQ.BantigniesF.CavalliG. (2019). Principles of genome folding into topologically associating domains.Sci. Adv.5:eaaw1668. 10.1126/sciadv.aaw1668
96
SzaboQ.DonjonA.JerkovićI.PapadopoulosG. L.CheutinT.BonevB.et al (2020). Regulation of single-cell genome organization into TADs and chromatin nanodomains.Nat. Genet.521151–1157. 10.1038/s41588-020-00716-718
- CrossRef
- Google Scholar
97
Tavares-CadeteF.NorouziD.DekkerB.LiuY.DekkerJ. (2020). Multi-contact 3C reveals that the human genome during interphase is largely not entangled.Nat. Struct. Mol. Biol.271105–1114. 10.1038/s41594-020-0506-5
98
TeifV. B.KepperN.YserentantK.WedemannG.RippeK. (2015). Affinity, stoichiometry and cooperativity of heterochromatin protein 1 (HP1) binding to nucleosomal arrays.J. Phys. Condens. Matter.27:064110. 10.1088/0953-8984/27/6/064110
- CrossRef
- Google Scholar
99
TeifV. B.RippeK. (2010). Statistical-mechanical lattice models for protein-DNA binding in chromatin.J. Phys. Condens. Matter.22:414105. 10.1088/0953-8984/22/41/414105
- CrossRef
- Google Scholar
100
TianaG.LucaG. (2019). “Modeling the 3D conformation of genomes,” in Raton: Taylor & Francis, 2018. | Series: Series in Computational Biophysics, edsTianaG.Giorgetti BocaL. (Boca Raton, FL: CRC Press). 10.1201/9781315144009.
- CrossRef
- Google Scholar
101
TrieuT.Martinez-FundichelyA.KhuranaE. (2020). DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure.Genome Biol.21:79. 10.1186/s13059-020-01987-1984
- CrossRef
- Google Scholar
102
UlianovS. V.DoroninS. A.KhrameevaE. E.KosP. I.LuzhinA. V.StarikovS. S.et al (2019). Nuclear lamina integrity is required for proper spatial organization of chromatin in Drosophila.Nat. Commun.10:1176. 10.1038/s41467-019-09185-y
103
UlianovS. V.KhrameevaE. E.GavrilovA. A.FlyamerI. M.KosP.MikhalevaE. A.et al (2016). Active chromatin and transcription play a key role in chromosome partitioning into topologically associating domains.Genome Res.2670–84. 10.1101/gr.196006.115
104
VianL.PȩkowskaA.RaoS. S. P.Kieffer-KwonK. R.JungS.BaranelloL.et al (2018). The energetics and physiological impact of cohesin extrusion.Cell1731165–1178.e20. 10.1016/j.cell.2018.03.072
105
WhalenS.TrutyR. M.PollardK. S. (2016). Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin.Nat. Genet.48488–496. 10.1038/ng.3539
106
WlasnowolskiM.SadowskiM.CzarnotaT.JodkowskaK.SzalajP.TangZ.et al (2020). 3D-GNOME 2.0: a three-dimensional genome modeling engine for predicting structural variation-driven alterations of chromatin spatial structure in the human genome.Nucleic Acids Res.48W170–W176. 10.1093/nar/gkaa388
107
XuC.JacksonS. A. (2019). Machine learning and complex biological data.Genome Biol.20:76. 10.1186/s13059-019-1689-1680
- CrossRef
- Google Scholar
108
XuH.ZhangS.YiX.PlewczynskiD.LiM. J. (2020). Exploring 3D chromatin contacts in gene regulation: the evolution of approaches for the identification of functional enhancer-promoter interaction.Comput. Struct. Biotechnol. J.18558–570. 10.1016/j.csbj.2020.02.013
109
XuT.ZhengX.LiB.JinP.QinZ.WuH. (2018). A comprehensive review of computational prediction of genome-wide features.Brief. Bioinform.10.1093/bib/bby110Online ahead of print..
110
YangT.ZhangF.YardımcıG. G.SongF.HardisonR. C.NobleW. S.et al (2017). HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient.Genome Res.271939–1949. 10.1101/gr.220640.117
111
Zepeda-MendozaC. J.MenonS.MortonC. C. (2018). Computational prediction of position effects of human chromosome rearrangements.Curr. Protoc. Hum. Genet.97:e57. 10.1002/cphg.57
112
ZhangR.WangY.YangY.ZhangY.MaJ. (2018). Predicting CTCF-mediated chromatin loops using CTCF-MP.Bioinformatics34i133–i141. 10.1093/bioinformatics/bty248
113
ZhangS.ChasmanD.KnaackS.RoyS. (2019). In silico prediction of high-resolution Hi-C interaction matrices.Nat. Commun.10:5449. 10.1038/s41467-019-13423-8
114
ZhuY.ChenZ.ZhangK.WangM.MedovoyD.WhitakerJ. W.et al (2016). Constructing 3D interaction maps from 1D epigenomes.Nat. Commun.7:10812. 10.1038/ncomms10812
115
ZuffereyM.TavernariD.OricchioE.CirielloG. (2018). Comparison of computational methods for the identification of topologically associating domains.Genome Biol.19:217. 10.1186/s13059-018-1596-1599
- CrossRef
- Google Scholar

Summary

Keywords

Hi-C, modeling, polymer physics, machine learning, predicting approaches

Citation

Belokopytova P and Fishman V (2021) Predicting Genome Architecture: Challenges and Solutions. Front. Genet. 11:617202. doi: 10.3389/fgene.2020.617202

Received

14 October 2020

Accepted

15 December 2020

Published

22 January 2021

Volume

11 - 2020

Edited by

Yuriy L. Orlov, I.M. Sechenov First Moscow State Medical University, Russia

Reviewed by

Davide Marenduzzo, University of Edinburgh, United Kingdom; Vladimir B. Teif, University of Essex, United Kingdom; Ron Schwessinger, University of Oxford, United Kingdom

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Veniamin Fishman, minja-f@ya.ru; minja-f@yandex.ru

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

REVIEW article

Predicting Genome Architecture: Challenges and Solutions

Abstract

Studying Genome Architecture: Methods and Mechanisms

Why Modeling 3-D Genome Folding?

Which 3-D Genome Structures Can Be Predicted, and Why They Are Relevant?

Promoter–Enhancer Interactions

Loops

TADs

Compartments

Hi-C Maps

From Contact Frequencies to 3-D Models

How Do the Modeling Algorithms Work? Problems and Limitations

Polymer Modeling

Limitations of Physical Models

Physical Modeling Is Hypothesis-Driven

Inferring Key Physical Parameters Might Be Challenging

Physical Modeling Is Computationally Intensive and Often Requires Coarse-Graining

Statistical Approach

Challenges and Limitations

Defining Target Features and Their Properties

Predicting Single-Cell Data

Understanding Mechanisms Underlying Prediction

Choosing Data Parameterization Function

Input Data Quality

Overfitting

What Do We Consider a Good Prediction?

Prediction of Functional Consequences of Rearrangements

Conclusion

Statements

Author contributions

Funding

Acknowledgments

Conflict of interest

Footnotes

References

Summary

Outline

Figures

Cite article

Share article

Article metrics