Extended Topological Persistence and Contact Arrangements in Folded Linear Molecules

Verovšek, Sara Kališnik; Mashaghi, Alireza

doi:10.3389/fams.2016.00006

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 31 May 2016

Sec. Biological Modeling and Simulation

Volume 2 - 2016 | https://doi.org/10.3389/fams.2016.00006

Extended Topological Persistence and Contact Arrangements in Folded Linear Molecules

Sara Kališnik Verovšek¹

Alireza Mashaghi²^*

¹Department of Mathematics, Stanford University, Stanford, CA, USA
²Department of Ophthalmology, Harvard Medical School, Harvard University, Boston, MA, USA

Structure plays a pivotal role in determining the functional properties of self-interacting linear biomolecular chains, for example proteins and nucleic acids. In this paper, we propose a method for representing each such molecule combinatorially—as a one-dimensional simplicial complex—in a novel way that takes into account intra-chain contacts. The representation allows for efficient quantification of structural similarities and differences between molecules, and for studying molecular topology using extended persistence. This method performs a multi-scale analysis on a filtered simplicial complex as it tracks clusters, holes, and higher dimensional voids in the filtration. From extended persistence we extract information about the arrangement of intra-chain interactions, a topological property which demonstrably affects folding and unfolding dynamics of the linear chains.

1. Introduction

Proteins and nucleic acids are linear organic polymers which are present in and vital to every living cell [1, 2]. They perform a vast array of functions within living organisms, for example, proteins provide mechanical support of cells, catalyze specific chemical reactions as well as transport and store nutrients and metabolites. In determining the functional properties of proteins and nucleic acids, and more generally, self-interacting molecular chains, shape plays a pivotal role. Chemists have been aware of the connection between the shape of a molecule and its function for decades already. For example, Linus Pauling wrote in 1974: “I am convinced that it will be found in the future…that the shapes and sizes of molecules are of just as great significance in determining their physiological behavior as are their internal structure and ordinary chemical properties [3].”

Topology is the branch of mathematics which deals with shape. It can be used to study connectivity information, which includes the classification of connected components of a space, loops and higher dimensional surfaces within the space. It can further be used to approximate complicated objects with simple, combinatorial ones, for example simplicial complexes. In the past decade with the emergence of ‘big data’, topology started playing a more prominent role in data analysis [4, 5]. Using a method called persistent homology [6] and a generalization of it, called extended persistence [7], researchers have solved problems in sensor networks [8, 9], gained insights into texture images [10] and classified lesions [11, 12]. Topological ideas have also inspired methods for visualizing complex datasets [13].

Chemists and structural biologists have successfully applied topological principles to classify elementary features in the structure of molecules [14–17]. Knot theory provided an appropriate framework for modeling the structure and function of the genomic DNA and for the action of particular enzymes in altering the topology of DNA in site-specific recombination. However, physical knots are extremely rare in RNA molecules [18]. Polypeptide chains can form knots [19–21], but only a small fraction of proteins (< 1%) in the Protein Data Bank (PDB), including rRNA methyltransferases, carbonic anhydrides, and ubiquitin hydrolase, are identified to be knotted [22, 23]. Although some progress has been made [24], knot theory does not suffice to distinguish between most RNA and protein folds.

Circuit topology [25] addresses this problem and takes into account multiple intra-molecular contacts (interactions), in which one part of the chain binds to another part. The arrangements of these contacts (parallel, series, cross; see Figure 1) can be used for classification of biomolecules; molecular structures that are not resolvable with previous methods may be distinguished based on their circuit topology [25]. Circuit topology complements other contact based methods (e.g., contact order based approach, Gaussian network models, Anisotropic network models) in predicting folding kinetics and their relation to molecular shape [26–31]. Recent studies revealed the power of circuit topology framework in determining the (un)folding dynamics [32–35] as well as functions [25] of folded (bio)polymers.

FIGURE 1

Figure 1. (A) A linear molecule with four contact sites. The two chain ends are distinguishable in linear biomolecules like nucleic acids and proteins. The curved arrows show how the molecule folds, i.e., four contact sites merge into two binary contacts. The arrangement is parallel in this case, as shown in (B), on the left side. (B) Three basic circuit relations between two contacts in a linear molecule. From left to right: parallel, series, and crossed.

In this paper, we work within the circuit topology framework and provide a method for representing every linear molecule as a one-dimensional simplicial complex. While the idea of representing a molecule as a graph is not novel [36–38], our approach is, as it takes into account intra-molecular contacts in such a way that the circuit relations between contacts are preserved. As a measure of similarity and dissimilarity, we propose the graph edit distance which is commonly used in bioinformatics and structural biology for comparing structures [39–43]. Once we have a metric space model for linear molecules, we can use a plethora of other tools, for example, compute persistent homology, extended persistence, etc. In this paper, in particular, we use extended persistence to extract complete information about contact arrangements. This is the first computational method available to compute arrangement matrices presented in Mashaghi et al. [25]. Extended persistence has been successfully applied to analyze biomolecules previously [44–46], however this is the first time it has ever been used in the context of circuit topology.

2. Mathematical Background

The aim of this work is to analyze properties of self-interacting linear molecular chains using topological methods. We present the theoretical material in a compact, informal manner following [4, 5] and encourage the readers interested in details to read the given references. For a deep treatment of homology theory, we refer the reader to Allen Hatcher's Algebraic Topology [47].

2.1. Simplicial Complexes and Homology

The idea behind algebraic topology is that one can distinguish spaces by the occurrences of patterns within a space. Homology is one of many topological invariants that can be used for this purpose, but has the advantage of being relatively easy to compute. It allows us to assigns the so-called Betti numbers, i.e., a list of non-negative integers, β₀, β₁, β₂, …, to any topological space. Each of these numbers carries connectivity information about the space, for example, β₀ counts the number of connected components of a space, β₁ counts the number of loops in the space, β₂ counts the number of voids, etc.

Sometimes ignoring a certain amount of data or structure yields a simpler theory, which can give results not readily obtainable in the original setting. Relative homology is an example where this occurs. Given a space X and a nice enough subset A, the relative homology of (X, A), H(X, A), looks exactly like the homology of the quotient X∕A (except in dimension 0, where X∕A always has one connected component more). A quotient space X∕A is, intuitively speaking, the result of identifying points in X that belong to A.

Homology was initially defined for spaces described in a very particular way, namely as simplicial complexes. The basic idea of a simplicial complex is that of gluing together points, lines, triangles, tetrahedra, and the higher dimensional equivalents (which are called simplices) along their boundaries in a structured way.

We formally define a k-simplex as follows. Suppose that k + 1 points $v_{0}, \dots, v_{k} \in ℝ^{n}$ are affinely independent, i.e., v₁ − v₀, …, v_k − v₀ are linearly independent vectors. Then the set of points

C = {t_{0} v_{0} + \dots + t_{k} v_{k} | t_{i} \geq 0, 0 \leq i \leq k, \sum_{i = 0}^{k} t_{i} = 1} .

is a k-simplex. We denote it by {v₀, …, v_k}. A 0-simplex is a point or a vertex, a 1-simplex is an edge, a 2-simplex is a triangle and a 3-simplex is a tetrahedron (see Figure 2 for examples).

FIGURE 2

Figure 2. (A) A vertex, an edge, a triangle, and a tetrahedron. In all three cases β₀ = 1. There are no loops or higher dimensional voids and therefore β_n = 0 for n ≥ 1. (B) Example of an edit path whose cost is 3. This is the minimal cost edit path, thus the graph edit distance between these two graphs is 3.

A simplicial complex K is a finite collection of simplices that satisfies the following conditions:

1. If σ ∈ K and a simplex τ ⊆ σ, then τ is also in K.

2. The intersection of any two simplices σ₁, σ₂ ∈ K is either ∅ or a face of both σ₁ and σ₂.

1-dimensional simplicial complexes have been widely enough studied to deserve their own name, graphs, and the branch of mathematics that studies them is graph theory. A subcomplex of a simplicial complex K is an simplicial complex K′ ⊆ K.

2.2. Distance Functions on the Space of Graphs

Once we have a way to represent every linear molecule as a graph, it would be desirable to have a notion of “distance” on this space of molecules that would serve as a measure of how “similar” or “dissimilar” two molecules are. A space equipped with such a function (distance function) is called a metric space. Formally, a distance function d: X × X → ℝ on space X is a function that satisfies the following properties:

• For all x, y ∈ X, d(x, y) ≥ 0 and d(x, y) = 0 if and only if x = y.

• For all x, y ∈ X, d(x, y) = d(y, x) (symmetry).

• For all x, y, z ∈ X, d(x, y) ≤ d(x, z) + d(z, y) (triangle inequality).

One of the metrics commonly used on the space of all (finite) graphs is the graph edit distance [39–43, 48] and its variations. Intuitively speaking, graph edit distance defines the similarity of two graphs by the minimum amount of distortion which is needed to transform one graph into the other.

Suppose we have a graph (V, Σ), where V denotes a finite set of vertices and Σ a finite set of edges between them. Each edge is represented by a two-element set that contains vertices that the edge connects. Let μ: V ∪ Σ → ℤ be a function that assigns an integer to each vertex and each edge. We call μ a labeling function. If the labeling function is not explicitly specified, we assume that all vertices have the same label. The allowed modifications include the insertion, deletion, and substitution of vertices and edges. A vertex deletion operation, for example, refers to the removal of a vertex from a graph, and a vertex/edge substitution operation is equivalent to changing the label of a vertex/edge substitution. A sequence of edit operations that transforms one graph into another is called an edit path. Computing the edit distance of two graphs is equivalent to finding an edit path with a minimal number of operations required.

The edit distance method can be tailored to a specific application by assigning each edit operation a cost that depends on the strength of the corresponding distortion. The total edit cost of a given edit path is the sum of costs of individual edit operations. The edit distance of two graphs is defined to be the minimal cost edit path between them, that is, the least expensive way to edit one graph into the other one. Such an edit distance is a metric if the cost functions is positive definite, symmetric and satisfies triangle inequality at the level of single edit operations [49]. If the cost of all operations is c, then the graph edit distance is just the minimal number of modifications required times c. See Figure 2B for an example of the graph edit distance computation.

2.3. Extended Persistence

Homology provides information about the number of connected components, loops and higher dimensional voids that occur within a topological space. However, sometimes as a tool it is not powerful enough to make desired distinctions between structures. For example, Figure 4 depicts three simplicial complexes that arise from three qualitatively different arrangements of contacts, but whose Betti numbers are all the same. A variant of homology, called extended persistence [7], allows us to extract information about a simplicial complex K equipped with a function f: K → {1, 2, …, n} (filtering function). In our case, function f records the time when a simplex is added in a filtration. Depending on the nature of the problem, it can be any other function significant to K. For example, Aadcock et al. [11] constructed simplicial complexes from images of liver lesions and used grayscale values as the filter.

To determine the extended persistence of (K, f) with respect to the i-dimensional homology functor H_i with coefficients in a field, construct the following sequence of vector spaces and maps between them:

\begin{array}{l} 0 \to H_{i} (K_{1}) \to H_{i} (K_{2}) \to \dots \to H_{i} (K_{n}) = H_{i} (K) \\ \to H_{i} (K, K^{n}) \to \dots \to H_{i} (K, K^{1}) = 0 \end{array}

Here K_i contains such simplices from K that f(σ) ⊂ {1, 2, …, i} for every σ ∈ K_i; Kⁱ contains such simplices from K that f(σ) ⊂ {i, …, n} for every σ ∈ Kⁱ. In our case H(X, A) is isomorphic to H(X∕A), except in the case of 0-dimensional homology (in that case β₀(X∕A) = β₀(X, A) + 1).

The idea behind this sequence is to grow the space from the bottom up; and then to relativize it from the top down and keep track of when i-dimensional voids appear and disappear in this sequence.

Figure 3 depicts a simplicial complex with labeled vertices. The first eight pictures show how simplices are added. The following eight pictures represent spaces isomorphic to relevant quotients. The 1-dimensional extended persistence keeps track of when each loop appears and disappears in this sequence.

FIGURE 3

Figure 3. A depiction of a simplicial complex with a filtering function (numbers next to vertices) and of spaces that appear in extended persistence. There are two loops of extended type—the first present over [6, 4) and the other over [7, 2). We interpret this as follows: in this simplicial complex there is one loop that begins to form at 2 and is completed at 7 and another completed at 6 whose first edge appears at 4.

The first stage or the standard part of the extended persistence is

0 \to H_{i} (K_{1}) \to H_{i} (K_{2}) \to \dots H_{i} (K_{n})

while the second stage or the extended part is

H_{i} (K, K^{n}) \to H_{i} (K, K^{n - 1}) \to \dots \to H_{i} (K, K^{1}) = 0.

Every i-dimensional void which appears at some point of the two-stage process will eventually disappear. Depending on where in the extended filtration they appear and disappear, we distinguish three types of voids:

• ordinary : appears at a and disappears at b during the first stage with a < b;

• relative : appears at a and disappears at b during the second stage, with b < a;

• extended : appears at a in one stage and disappears at b in the other.

In our setting we are only interested in the voids of the extended type. For example, in Figure 3, there are two loops of extended type—the first present over [6, 4) and the other over [7, 2). We interpret this as follows: in this simplicial complex there is one loop that begins to form at 2 and is completed at 7 and another completed at 6 whose first edge appears at 4. There are also several features of the relative type.

2.4. Circuit Topology

Pairwise relations between intramolecular contacts affect the function and properties of a molecule [25] (for example the folding rate of the molecule, etc). We distinguish three basic types of relations that can occur between two contacts: parallel, series, and cross. These three basic types and their simplicial complexes are depicted in Figure 4. Intuitively, the easiest way to determine the type of a relation is to consider the molecular strand as an interval of length one, and then check which points would bind together and form contacts. Figure 1 illustrates what happens for chains with two contacts. Each broken line connects two points that form a contact and from the picture it is easy to see what the “parallel” (denoted by p), “series” (denoted by s) and “cross” (denoted by x) arrangements refer to. This leads us to represent every contact as an interval [a, b] in which a is the point on a molecular chain that is linked to b (the molecular strand is directed and we take a < b).

FIGURE 4

Figure 4. Simplicial representations of molecules with parallel, series, and cross arrangement. In all three cases the corresponding Betti numbers are: β₀ = 1, β₁ = 2, β_n = 0 for n ≥ 2.

We have a simple criterion to test these arrangements for interval representations. Let c₁ = [a₁, b₁] and c₂ = [a₂, b₂] be contacts. Without loss of generality we may assume that a₁ ≤ a₂. Then:

• c₁ and c₂ are parallel if [a₂, b₂] ⊆ [a₁, b₁];

• c₁ and c₂ form a series if b₁ ≤ a₂;

• c₁ and c₂ cross if a₂ < b₁ < b₂.

The arrangement of contact pair remains unchanged if we reverse the direction of the molecule. In terms of intervals this change would correspond to a map r: I → I, r(x) = 1 − x.

We record the information about the arrangements of these contacts in a n × n arrangement matrix, where n is the number of contacts and every element of the matrix is taken from {p, s, x}.

3. Results

3.1. Metric Space of Molecules

We first use the algebraic tools presented in the previous section to model the molecular chains with the most basic circuit topologies as simplicial complexes (Figure 4). In this paper we only consider chains with binary contacts as multi-valent contacts can typically be decomposed to multiple binary ones. In this setting each linear molecule can be thought of as directed chain with contacts linking different points on the chain. Red denotes the beginning of the chain and blue the end. Black dots are points of contact in the molecule and arrows denote the direction of the strand.

Algorithm 1 converts a linear molecule into a simplicial complex. We use it in Figure 5 to get a simplicial complex representation of the Hammerhead Ribozyme, a small RNA motif with catalytic activity capable of self-cleavage [50].

ALGORITHM 1

Algorithm 1. Representing a linear molecule as a simplicial complex

FIGURE 5

Figure 5. (A) A depiction of Hammerhead Ribozyme and (B) a coarse grained schematic of the molecule. (C) The simplicial complex that represents Hammerhead Ribozyme. (D) Filter function of Hammerhead Ribozyme. Voids of extended type in the extended persistence are represented by a multiset {[15, 2), [7, 4), [16, 6), [14, 9), [13, 11)}. (E) Corresponding arrangement matrix. To arrive at the arrangement matrix, take the intervals representing voids of extended type and use the interval criterion from Section 2.4 to determine the pairwise arrangements of contacts.

We quantify the dissimilarity between two molecules as the graph edit distance between its simplicial complex (graph) representations. The simplest is to assign the same cost to all modifications. We can also tailor the cost value of a modification so that it reflects the strength of the corresponding distortion. In the case of linear molecules energetic and entropic factors come in. There are operations that are energetically forbidden (like breaking connectivity of the chain) to which we can assign greater cost. For intra-chain contacts, forming and breaking contacts are both possible, forming contacts being energetically favored over breaking. Depending on the context, we may take a graph edit distance that involves the least energy cost (and the highest energy release). Note that sequences of edge and vertex additions and deletions may result in graphs that are not representatives of linear molecules obtainable by using Algorithm 1, but this metric still provides a reasonable way of measuring the distance between the graphs that do.

In Figure 6, we calculate pairwise graph edit distances between three basic arrangements—parallel (P), cross (X) and series (S) arrangements. While other graphs may have identical topologies to the P-, S-, or X-type arrangements, it is these specific arrangments that represent linear molecules. The statistics of these arrangements defines folding kinetics; a molecule that is rich in S-type structures folds non-cooperatively while a molecule, which is rich in P-type structures, undergoes cooperative folding. Furthermore, inter-conversion of these structures represents a generic way through which a topology can evolve from another topology [33]. As it turns out, the P- and S-arrangements are relatively close in distance, as are P- and X-arrangements, while X- and S-arrangements are relatively far.

FIGURE 6

Figure 6. The graph edit distance (with uniform costs) between a molecule with two contacts in series arrangement and a molecule with contacts in parallel arrangement is 4, as is the graph edit distance between a molecule with two contacts in parallel arrangement and a molecule with contacts in cross arrangement. The graph edit distance between molecules with contacts in series and cross arrangements is 8. It is not too difficult to see that the paths of distance 4 from P- to S-type arrangements and from S- to X-type arrangements are minimal, that the distance from an S- to an X-type arrangement is no less than 8 can be seen through detailed considerations of the possibilities of removing vertices (and edges) from a graph with 9 vertices and 10 edges (the S-type) to obtain a graph with 7 vertices and 8 edges (the X-type). For instance, removing two vertices connected to four edges each would already account for a graph edit distance of 10, so only certain vertices need to be considered.

3.2. On Genus

Bon et al. [51] have proposed the topological classification of RNA secondary structures with pseudoknots based on the concept of topological genus. The genus is a topological invariant, whose geometric interpretation is quite simple. For example, the genus of a graph is the minimal integer n such that the graph can be drawn without crossing itself on a sphere with n handles (i.e., an oriented surface of genus n). Computer software exists (for example SAGE) that can compute the genus of a given graph. So we can model an RNA molecules as graphs using Algorithm 1 (each graph is given as a list of vertices and 2-element sets of vertices determining the edges), find its genus and classify it according to Bon et al. [51], which demonstrates that our framework fits in well with other existing ones.

3.3. Determining the Contact Arrangement Matrix of a Molecule using Extended Persistence

Let M be any linear molecule. We can represent it as a simplicial complex K using Algorithm 1. As we move along the molecular strand we assign numbers to contacts—start with 1 for the beginning of the strand, 2 for the first point that binds to another point on the chain, etc. This defines a function f on K. See Figure 5D for such a filter function of the Hammerhead Ribozyme. The extended persistence of (K, f) is a multiset of intervals. We are interested in the voids of extended type {[b_i, a_i)}. Each interval carries information about a contact that connects two points on the molecular strand. For example, the appearance of [b_i, a_i) demonstrates that there is a contact completed at b_i that starts forming at a_i. Using the interval criterion from Subsection 2.4) we determine the pairwise arrangements of contacts, i.e., the arrangement matrix.

The software for computing extended persistence is available at http://www.mrzv.org/software/dionysus/.

3.4. Folding

Here we demonstrate the applicability of our proposed approach to study the folding process. Folding of a molecular chain involves sequential formation of contacts and starts with a configuration that includes no contact, the so called unfolded or extended state. The final product of this reaction is a natively folded state. Given a conformation with 2n-contact sites, which eventually forms n contacts with certain correct arrangement. At a certain time point we look at the system:

• it may have n′ contacts formed where n′ < n

• it may have n contacts formed.

We make use of the metric space model from Section 3.1 to represent folding schematically. The example we observe here is of pbu E adenine riboswitch [52, 53] and is depicted in Figure 7. Riboswitches are known to regulate genes through conformational changes in ligand-binding RNA aptamers [54, 55].

FIGURE 7

Figure 7. (A) E adenine riboswitch folding. Four intramolecular contacts are formed during folding as indicated in the figure [53]. (B) Coarse grained representation of the folding process. (C) We can represent folding in a graph where the x-axis (reaction coordinate) is the graph edit distance from the original molecule to the molecular strand with no contacts and y-axis is the number of contacts.

In the beginning, the molecule is a single chain with no intra-chain contacts. Then contacts start forming, increasing the complexity. Although the graph edit distance to a molecular chain with no contacts (represented by a graph with two vertices and an edge between them) increases with the number of contacts along this pathway, the relation between the number of contacts and changes in edit distance is not linear and and depends on how the topology changes during the transitions. We note that folding is not always a unidirectional path with increasing number of contacts. Sometimes the molecule adopts a non-native arrangement that has to be corrected before folding to native state is accomplished.

To further illustrate how topology changes during folding, we study some of the most basic folding and rearrangement steps. Figure 8 shows how graph edit distance to a molecular strand with no contacts changes upon formation of parallel contacts, the so-called zipping, series contacts and cross contacts. Graph edit distance to a molecular strand with no contacts is proportional to the number of contacts n and increases with a slope smaller, 7n, than that of series contacts, 9n. The approach presented here is general and can be applied to more complex molecular structures commonly seen in living organisms.

FIGURE 8

Figure 8. Parallel, cross, and series arrangement in the case of n contacts. With each contact we add, the complexity increases and is detected by the graph edit distance.

4. Conclusions

Intra-chain interactions that bring two parts of a chain into physical contact are characteristic features of folded proteins and nucleic acids. The importance of these interactions has been widely recognized and numerous tools and approaches have been developed to identify the interactions and to relate them to molecular function and dynamics [36, 56]. Contact order and average connectivity of protein contact network have been successfully applied to explain folding rate and folding pathways of certain proteins [26, 36]. In the field of RNA research, a number of tools exist where graph theoretic approaches are used to characterize secondary and higher order structures. Graph edit distance is for example used by the Vienna RNA package [39, 40]. Despite all these efforts, there have been no tools available to extract arrangement of contacts from structures (beyond statistical measures e.g., contact order and average connectivity) and to efficiently compare the arrangements among many molecules or many conformations of one molecule.

In this paper we presented a systematic way of representing self-interacting linear molecules as simplicial complexes. We equipped the resulting spaces with a metric that can be used as a measure of similarity and dissimilarity between different molecules. One application of this model using extended persistence yields the first computational method for determining the contact arrangement matrix of a single chain molecule, which was previously determined manually [25]. We can also use this model to structurally present RNA and protein folding processes, to analyze genome-wide chromosome conformation capture data, and to assess functional and evolutionary relatedness of biomolecules.

Author Contributions

AM conceived and designed research. SK and AM performed the research and wrote the paper.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer KX and handling Editor declared their shared affiliation and a past co-authorship and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

Acknowledgments

The authors thank Gunnar Carlsson for careful reading of the manuscript and helpful comments.

References

1. Clancy S, Brown W. Translation: DNA to mRNA to Protein. Nat Educ. (2008) 1:101. Available online at: http://www.nature.com/scitable/topicpage/translation-dna-to-mrna-to-protein-393

2. Mashaghi A, Katan A. A physicist's view of DNA. De Physicus (2013) 3:59–61. Available online at: http://ceesdekkerlab.tudelft.nl/wp-content/uploads/Alireza-Allard-manuscript.pdf

3. Pauling L. The molecular basis of biological specificity. Nature (1974) 248:769–71.

PubMed Abstract | Google Scholar

4. Carlsson G. Topology and data. Bull Am Math Soc. (2009) 46:255–308. doi: 10.1090/S0273-0979-09-01249-X