- 1Image and Signal Processing Group, Institute for Computer Science, Leipzig University, Leipzig, Germany
- 2Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Leipzig University, Leipzig, Germany
Sequence alignments are often used to analyze genomic data. However, such alignments are often only calculated and compared on small sequence intervals for analysis purposes. When comparing longer sequences, these are usually divided into shorter sequence intervals for better alignment results. This usually means that the order context of the original sequence is lost. To prevent this, it is possible to use a graph structure to represent the order of the original sequence on the alignment blocks. The visualization of these graph structures can provide insights into the structural variations of genomes in a semi-global context. In this paper, we propose a new graph drawing framework for representing gMSA data. We produce a hierarchical graph layout that supports the comparative analysis of genomes. Based on a reference, the differences and similarities of the different genome orders are visualized. In this work, we present a complete graph drawing framework for gMSA graphs together with the respective algorithms for each of the steps. Additionally, we provide a prototype and an example data set for analyzing gMSA graphs. Based on this data set, we demonstrate the functionalities of the framework using two examples.
1 Introduction
The quality and the throughput of sequencing technologies, and thus the variety of data, have continuously increased in recent years (Genomes Project Consortium et al., 2015; Goodwin et al., 2016; Hickey et al., 2023). However, there is a large gap between sequence determination and sequence analysis (Gärtner et al., 2018). Especially, the capabilities for visually comparing sequences provided by the tools are insufficient, as most tools available today are designed for comparing closely related sequences or sequences of a limited length, only.
Multiple sequence alignments (MSAs) are a central procedure in the field of genetic information analysis (Albers et al., 2011). Sequence alignments are commonly visualized using dot plots, synteny views, and parallel coordinate views (Albers et al., 2011). Thereby, the focus is on visualizing local details in the alignments and not global trends. In the MSAs of higher animals and plants, the alignment block size is usually smaller than the size of individual genes (Gärtner et al., 2018). Consequently, it is difficult to analyze the MSAs from a more global perspective on rearrangements like the inversion, the translocation, or the duplication of a sequence with the traditionally used visualization approaches. On the other hand, there are approaches to compare whole genomes with genome-wide multiple sequence alignments (gMSAs). The visualizations showing global trends of such gMSAs are aggregated and are often restricted to only two sequences. As a result, rearrangements are barely visible in these global approaches.
Recently, research efforts and successes in pangenomes have increased (Consortium, 2016; Liao et al., 2023). In a pangenome similarities and differences are usually summarized in the form of a common reference genome assembly for a selection of individuals of one species. This addresses the problem that reference-based comparisons, such as classical MSAs, always have a bias towards the reference. Pangemones are usually represented by graph models (Consortium, 2016). The intervals of a pangenome can be visualized with a graph layout, e.g., by Sequence Tube Maps (Beyer et al., 2019), to compare the different sequences of the individuals used to generate the pangenome.
However, it would be also interesting to investigate variations between different species using a gMSA (Gärtner et al., 2018), especially if an artificially generated common coordinate system can be used as an additional order to counteract the bias toward the reference (Gärtner et al., 2018). Therefore, our overarching objective is to facilitate the examination of variations in the sequences of a gMSA from a semi-global perspective. For each original sequence in the MSA (e.g., a chromosome or contig), a total order of the alignment blocks can be found that represents the original sequence with the underlying sequence intervals. The sum of these alignment block sequences can be represented as a directed multi-graph (gMSA graph).
Here, we propose a graph layout framework that supports the visual comparative analysis of such gMSA graphs using the graph data basis produced by Gärtner et al. (Gärtner et al., 2018). This layout framework is based on the Sugiyama framework (Sugiyama et al., 1981) that was developed for directed acyclic graphs (DAGs). The Sugiyama framework consists of individual steps, for each of which several algorithms exist. However, most of the existing algorithms do not lead to layouts that meet the requirement, that the sequences forming a gMSA graph can be compared easily. Therefore, we propose new and tailored algorithms for most of the steps.
Our contribution is the description of the adapted Sugiyama framework to draw gMSA graphs to support the visual comparative analysis from a semi-global perspective and a prototype to demonstrate the results.
First, related work, important definitions, and fundamentals for both the biological and graph-based aspects are presented (Section 2 and 2.1). Then, the design criteria for the graph layout are presented (Section 2.2). Next, the entire framework from the input data to the final graph layout is described (Section 2.3). Two examples are provided (Section 3) followed by a discussion (Section 4).
2 Methods
Comparative genomics is the field of research in which two or more genomes are analyzed based on their conservation and synteny (Nusrat et al., 2019), The analysis of conservation consists of finding sequence intervals with a high degree of similarity in the genomes (e.g., with an alignment). Synteny refers to analyzing if the location, order, proximity, and orientation of the sequence intervals are similar in the compared sequences.
Nusrat and Harbig et al. (Nusrat et al., 2019) provide a broad overview of existing visualization techniques and tools for comparative genomics. This is the basis for the following discussion. There are generally three basic techniques for the visual comparison of genomes (gMSAs):
The comparison using alignment based techniques is especially performed for small sequences or local analyses (Li et al., 2009; Carver et al., 2012; Yachdav et al., 2016).
When comparing the genomes with connected conserved blocks, the genomes (sequence intervals) are arranged on two or more axes and the synteny is displayed using color or line coding. For this, there are mainly two different types of layouts: linear and circular arrangements for comparing sequences. Examples for tools using linear layouts are Cinteny (Sinha and Meller, 2007) (using colors and lines) or Synteny Explorer (Bryan et al., 2017). Circular layouts are, for example, used by the tools MizBee (Meyer et al., 2009), Circos (Krzywinski et al., 2009), or Synteny Explorer (Bryan et al., 2017). With MizBee (Meyer et al., 2009), a tool is provided that supports analyses at the genome, the chromosome, and the block levels with circular and linear layouts. However, its line based approach is in some instances not suitable for a detailed comparisons on the genome level, since it tends to produce several intersections. Furthermore, most tools like MizBee (Meyer et al., 2009) are designed for the analysis of only two sequences.
In dot based approaches, the comparative sequence axes are arranged orthogonally and similarities are indicated by diagonal rows of dots. Tools using this approach are, for example,: Gepard (Krumsiek et al., 2007), EDGAR (Blom et al., 2016), Syn-Map2 (Haug-Baltzell et al., 2017), and iDotter (Gerighausen et al., 2017). Most of the tools only allow the comparative analyses of two sequences and do not scale well for the comparison of larger regions.
In the following, a brief overview of graph-related related work is given. We decided to use the Sugiyama framework (Sugiyama et al., 1981) as the basis to layout the gMSA graphs since it is by far the most common layout framework for directed graphs (Healy et al., 2013) and is highly adaptable.
Other graph-based approaches related to our problem setting can be found in the field of digital humanities. There, the comparison of different versions of a text is an important aspect. According to Schmidt et al. (Schmidt and Colomb, 2009), so-called Text Variant Graphs emphasize such overlapping textual structures. Jänicke et al. (Jänicke et al., 2014a; Jänicke et al., 2014b) developed Sentence Alignment Flow, a well readable layout algorithm for Text Variant Graphs. While their approach is similar to ours, these graphs are easier to visualize since texts always have the same reading direction, but this does not always have to be the case in gMSA graphs. Therefore, and since the sequences can be considerably longer and the differences are more complex in gMSA graphs, the development of a separate approach was necessary.
The following definitions are based on Chapter 7 of the “Handbook on Graph Drawing and Visualization” (Duncan et al., 2013). Let
2.1 Data definition
According to Wang et al. (Wang and Jiang, 1994), we define a nucleotide sequence (nt-seq) as an arbitrary, finite string
Figure 1. In this example the construction of an MSA
It should be noted, that a vertex in the gMSA graph can also represent a sequence of alignment blocks (merged alignment blocks) if no information is lost as a result. This is the case when two or more alignment blocks are traversed by only one contig or are traversed co-linearly in the same order by several contigs. In this manuscript, the terms alignment block (including merged alignment blocks) and vertex will be used as synonyms.
The alignment block sequences (also called vertex sequences), which reflect the contigs in the gMSA (Figure 1F), serve as the data basis for our visualization. Therefore, our input data consists of a list of vertex sequences
It should also be noted that each comparative sequence starts at an alignment block of the guide sequence and ends at another alignment block of the guide sequence. Between the start and the end alignment block of a comparative sequence there can be any number of alignment blocks, which do not necessarily belong to the guide sequence. These two conditions shall avoid loose ends of the comparative sequences, which are not shared with the guide sequence (as this should be in focus).
2.2 Design requirements
We had extensive discussions with our collaboration partners from the bioinformatics field about what is important and helpful to them for a comparative analysis of sequences. At the same time, we were inspired by well-known visualizations, such as classical genome browsers, and existing graph-based representations for alignment data, such as text variant graphs. With this, the following design requirements were derived:
1. The general reading direction of the graph is from left to right.
2. The guide sequence is represented linearly in the layout, i.e., at the same vertical position with a uniform direction.
3. The first vertex of the guide sequence is placed furthest to the left and the last one furthest to the right, i.e., at the smallest and largest horizontal position, respectively.
4. The comparative sequences are ordered by decreasing relevance, creating a genome order.
5. Sub-sequences of the comparative sequences are arranged above and below the guide sequence if they diverge from the guide sequence.
6. Contiguous sub-sequences of the comparative sequences are aligned horizontally, i.e., at the same vertical position, if possible.
7. If there is a deviation (e.g., an insertion) from the already processed sequences (predecessor in the genome order), these parts are placed between the corresponding vertices if possible.
The Requirements: 1–3 and 5–7 for the layout were highly inspired by the Text Variant Graphs by Jänicke et al. (Jänicke et al., 2014a; Jänicke et al., 2014b). The general reading direction of Requirement 1 (also in accordance with Requirement 3) describes especially the direction of the guide sequence in which no edge may be reversed (Requirement 2) and is therefore based on the familiar representation of genomic data in common genome browsers. Of course, back edges against the reading direction are generally possible in the comparative sequences. Requirement five supports the demonstration of similarities and differences between the sequences. Following Requirement 5, the space above and below the reference is used which creates a compact visualization, unlike classic genome browsers.
Not all comparative sequences can have an equally strong influence on the final layout. Therefore, the genome order of Requirement four is important and affects the graph layout in a strong manner, since the creation of the DAG (which is necessary for the Sugiyama framework) depends on it. This will be discussed further in Section 2.3.1. The length of the contiguous sub-sequences of Requirement six also depends on the genome order (Requirement 4). The same holds for the deviations (Requirement 7).
We impose additional constraints (adapted from Healy and Nikolov (Healy et al., 2013)):
1. Edges should point in a uniform direction,
2. Short edges are preferred over long edges,
3. The vertices should be uniformly distributed in the drawing space,
4. Edge crossings should be avoided, and
5. Straight edges should be preferred
These constraints are frequently imposed in general onto the layout of directed graphs (e.g., by the Sugiyama framework).
2.3 Framework
Our layout algorithm for gMSA graphs is based on the Sugiyama framework (Sugiyama et al., 1981; Healy et al., 2013) (Steps 2–4). While the original framework applies to DAGs only, it was extended to general directed graphs by adding one step at the beginning (Step 1) and at the end (Step 5) of the original framework. In addition of being cyclic, our graphs also contain multiple edges between vertices. Therefore, an additional step for routing these edges was added to the framework (Step 6). At the end, the final layout is computed (Step 7). Overall, we obtain the following extended framework:
1. Cycle removal by reversing edges (Section 2.3.1)
2. Layer assignment (Section 2.3.2)
3. Vertex ordering (Section 2.3.3)
4. Orthogonal coordinate assignment (Section 2.3.4)
5. Reversing the edges reversed in Step 1 (Section 2.3.5)
6. Routing the edges between vertices (Section 2.3.6)
7. Computing the final drawing (Section 2.3.7)
For each step, we describe, which algorithms were used for performing the respective step. Besides taking algorithms from literature, we also adapted known algorithms and created new ones for meeting the requirements (Section 2.2). A data flow diagram of the described framework is shown in the Supplementary Section S1.
The Sugiyama framework has the following additional properties:
1. The algorithm produces a layered graph layout, i.e., every vertex is assigned to exactly one layer.
2. There are no edges between two vertices of the same layer.
3. Crossings of the edges are minimized.
4. Crossings between edges and vertices are avoided.
It is generally assumed that edge crossings impede the readability of graphs and create visual clutter. This is mitigated by reducing the number of edge crossings as reflected by Property 3.
As the requirements partially contradict the properties of the Sugiyama Algorithm such as, e.g., Requirement five and Property 3, it is necessary to consider which is more important for the resulting visualization. In the example mentioned, the straight arrangement of genome parts (Requirement 5) is more important for the visual comprehension than the minimization of the edge crossings (Property 3), which will be reflected by our algorithms.
2.3.1 Cycle removal
Although each alignment block sequence (contig) is cycle-free by itself, there may (and frequently will) be cycles in a gMSA graph due to the sum of the contigs. The simplest example for this is if contig one first visits alignment block one and then alignment block two in its sequence and this is reversed for contig 2. Thus, a preparation step is necessary to create a proper DAG which can be processed by the Sugiyama framework. Therefore, the orientation of some of the edges of the original graph has to be reversed. This step is very important because the edge selection strongly influences the layering and with this the resulting layout.
Several heuristics address the problem of reversing a minimal number of edges for transforming a directed graph containing cycles into a DAG, as this problem is NP-complete (Di Battista et al., 1999; Healy et al., 2013). The heuristics by Berger and Shor (Berger and Shor, 1990) and Eades et al. (Eades et al., 1993) use approaches based on linear ordering. Another approach by Gansner et al. (Gansner et al., 1993) uses a depth-first traversal of strongly connected components to solve this problem. According to them (Gansner et al., 1993), the heuristic reverses edges whose direction against the flow is natural. The heuristic by Demetrescu and Finocchi (Demetrescu and Finocchi, 2003) solves the problem for weighted directed graphs. For us, however, the comparative aspect of the vertex sequences, by meeting our requirements (Section 2.2), is more important than the number of reversed edges. For example, the greedy heuristic by Eades et al. (Eades et al., 1993) was tested and even adapted without success (Figure 2D). This and similar techniques potentially produce sinks or sources in the DAG and therefore Requirements 2, 3, and seven are potentially violated, as sinks and sources would be placed furthest to the right or to the left (Figure 2D). Therefore, a new approach for edge reversal was developed that adheres to these requirements. An example of a DAG created by the new algorithm is shown in Figure 2C.
Figure 2. An example of our new cycle removal algorithm and the results for a greedy heuristic (Eades et al., 1993) are shown. (A) The vertex sequences of the selected sequences. This is the data basis for the graph created in this example. (B) A possible final graph representation of the data from (A). (C) A DAG created by applying Algorithm one to the data from (A). The bypath
The genome order (Requirement 4) has an important significance for this algorithm and thereby for all the subsequent steps. The higher the priority of a comparative sequence in the genome order, the higher the influence on the DAG and thus on the final layout.
Before describing our algorithm, we introduce the following definitions. A vertex sequence
We introduce the notion of a bypath. Let
be a successor function (see Equation 1), mapping each vertex
be a predecessor function (see Equation 2), mapping each vertex
A sub-path
For layout purposes, the set of edges
For
be a function (see Equation 4), mapping an edge from the DAG
The pseudo code of the algorithms described in this section is provided in in the Supplementary Section S2.1. The original graph is represented by the list of vertex sequences
Depending on which comparative sequences are processed first, the decomposition of these vertex sequences in the bypaths will be strongly affected. This also influences the amount of reversed edges, the created DAG, and therefore the subsequent steps of the Sugiyama framework. Since the similarities in vertex sequences between phylogenetically closely related genomes should be large, the priority of a comparative sequence should be higher whenever its relation to the guide sequence is higher. This promotes the visual comparability of the genomes in the gMSA graph layout. It could therefore be advantageous to sort the comparative sequences according to their phylogenetic proximity to the guide sequence, creating a genome order (Requirement 4).
The edges of the guide sequence are always added in their original direction to fulfill Requirement 2. For a given bypath
First, between all adjacent vertex pairs
1.
2. For all predecessors
If the bypath needs to be reversed, the reversed edge
1.
2. For all successors
Again, if the bypath needs to be reversed, the reversed edge
The direction tuple set
In Supplementary Section S2.2 an example is provided showing the results for the sets
The result of this step is a DAG with only one source (the first vertex of the guide sequence) and only one sink (the last vertex of the guide sequence) (Figure 2C). This is necessary but not sufficient for fulfilling the Requirements 1, 2, 3, and 7. These requirements are only completely met after the layering step (Section 2.3.2). Besides, this step is the foundation for fulfilling Requirement six since the length of the sub-sequences is determined by the bypaths. Requirement four is essential for the resulting layout since with every change in the genome order, the DAG created and thus the final layout could be completely different. Finally, the mapping of the multiple edge function
2.3.2 Layer assignment
For assigning the vertices to layers, quite some algorithms meeting different requirements exist (Healy et al., 2013). For the layer assignment, we use the longest-path algorithm without any adaption. A more detailed explanation of the algorithm with associated pseudo code can be found in the Supplementary Section S3.
It is worth emphasizing that our cycle removal algorithm from Section 2.3.1 paired with the properties of the longest-path algorithm satisfies the Requirements 1, 2, 3, and 7, since the guide sequence is the longest path in the graph. The resulting layers represent the horizontal position (x-coordinate) of the vertices in the final layout.
For the subsequent steps of the framework, a proper layering is required, which can be obtained from the results of the longest path algorithm as follows (Healy et al., 2013). Let
Let
An edge between two dummy vertices is called inner edge and all other edges are called outer edge. All dummy vertices of a properly layered DAG are added to the set of dummy vertices
In the following, the notion
2.3.3 Vertex ordering
An important step of the Sugiyama framework is the reduction of edge-edge crossings (edge crossings) to improve the readability and to avoid visual clutter. There are three different crossing types, depending on the involved number of inner edges:
Especially the Type 2 conflicts should be avoided because they produce crossings of long edges, which are hard to follow even without crossings. Moreover, Type 2 conflicts can always be avoided and their absence is a precondition for some algorithms computing the orthogonal coordinates.
To minimize the number of those crossings, the order of the vertices within the layers is changed. We adapted the global k-level crossing reduction heuristic called global sifting that was introduced by Bachmaier et al. (Bachmaier et al., 2010) such that the Requirements 2, 5, and six are fulfilled.
The following formal explanations are based on Bachmaier et al. (Bachmaier et al., 2010). Let the graph
A block
Figure 3. The blue rectangles in the graphs represent the vertices and the black circles the dummy vertices. The edges between them indicate the order of the selected vertex sequences corresponding to their color. The black frames around single vertices or sequences of dummy vertices illustrate blocks. Block-sets are represented by the gray rectangles. Each block-set has a unique vertical position in the block-set order. (A) The vertex order of the selected genomes. (B) A possible initial block-set order with avoidable edge intersections. (C) A possible block-set order after applying our algorithm.
We introduce the term block-set
Let
The idea of the heuristic proposed by Bachmaier et al. (Bachmaier et al., 2010) is to place a block at every position (every block has a unique vertical position) thus finding the optimal position of the block with minimal edge intersections. This is done successively for every block and repeated several times (according to the original heuristic ten sifting rounds suffice). Using the original heuristic, the Requirements two and 6 may be violated by placing the blocks of a block-set far apart concerning their vertical position so that they cannot be horizontally aligned during the next step of the Sugiyama framework. In our heuristic, block-sets, which consist ordinarily of a sequence of blocks, are shifted instead of the blocks. By doing this, the connected blocks of the bypaths can be placed horizontally aligned thus complying with the Requirements two and six (illustrated in Figure 3). The result of this algorithm is an order of the block-sets
2.3.4 Assignment of the orthogonal coordinate
For each vertex
According to Healy and Nikolov (Healy et al., 2013), straight edges (especially for long edges) and vertices that are centered with respect to their neighbors are aesthetically desirable in this step. A standard algorithm for this purpose is the approach by Brandes and Köpf (Brandes et al., 2002), which is linear in time in the number of vertices and edges, and allows at most two bends per edge. Unfortunately, this approach might violate some of our requirements. For all vertices of a block-set, the vertical positions have to be the same to fulfill the Requirements 2, 5, and 6. This is not always the case when using the algorithm by Brandes and Köpf (Brandes et al., 2002). Therefore, a new approach was developed where the vertical positions of the block-sets and thus also the vertical positions of the vertices are determined. This algorithm also guarantees at most two bends per edge in the final layout like the algorithm by Brandes and Köpf (Brandes et al., 2002).
The output of the previous step (Section 2.3.3) was a sorted block-set list
In two steps, all block-sets are placed as closely as possible to the guide sequence without creating intersections between the block-sets. During the first step, the block-sets before the guide sequence in
As a prerequisite, the index of the guide sequence
This step defines the relative vertical position for all vertices (block-sets) and fulfills the Requirements 2, 5, and 6. Improving the vertical position of an aesthetically unpleasing special case and a complexity analysis of this approach are discussed in the Supplementary Section S5.
2.3.5 Preparing the graph to be drawn
The last step of the Sugiyama framework, putting back the edges reversed in Section 2.3.1 in their original direction, is not necessary in this framework because the edges of the DAG are mapped by the multiple edge function
2.3.6 Edge routing
The Sugiyama framework was created for directed acyclic graphs with single edges between two nodes. The gMSA graph, however, is a multi-graph with potentially multiple edges between two vertices. Therefore, a routing algorithm for multiple edges is proposed, which is applied after the positioning of the vertices using the Sugiyama framework.
The multiple edges of
Let
Figure 4. Let
Our requirements for routing the edges are the avoidance of additional edge intersections and a compact edge packing, which still supports readability (Figure 5B). To achieve these requirements, we adhere to the following conventions and impose the following constraints:
Figure 5. The blue boxes represent the vertices, whereby every vertex consists of three areas: the up-, the straight-, and the down-space. The dotted line in the straight-edge space of (A) illustrates the vertical center of the vertex. The gray boxes represent the edge positions. “FS” is the abbreviation for free space and represents the amount of free space between two edges. “BS” stands for border space and represents the amount of free space between the border of a vertex and the vertical line of an edge. (A) A detailed view of placing edges at the right border of a vertex is shown. (B) The final edge layout between two layers created by our approach is shown. One of the vertices is shown in (A) in detail. (C) The detailed view of the down-space of two opposite vertices from (B) is shown. To avoid (partial) overlaps, first the down-edges from the left vertex and afterwards the down-edges from the right vertex are placed. The space used for the down-edges of the left vertex is blocked and can not be used at the right vertex. (D) The detailed view of the horizontal layering of the vertical lines of all edges from (B) in the inter layer space is shown. Every vertical line of an edge has a unique horizontal position in the layout to avoid additional intersections.
To determine the vertex-edge connection points and the horizontal positions of the vertical lines of an edge, four steps are performed:
1. Preprocessing
2. Calculating relative edge-vertex connection coordinates
3. Horizontal edge layering in the inter layer space
4. Determining the space parameter
The first step is a preparation step where the left and the right sides of every vertex are processed consecutively. During the second step, the relative coordinates of the edge-vertex connections are calculated. During the third step, the horizontal positions of the vertical lines of the edges in the inter layer space are determined. During the fourth step, the two space parameters needed vertex height and needed inter layer space, which are necessary for the final drawing (Section 2.3.7), are calculated. A detailed description of the algorithms and a complexity analysis are provided in the Supplementary Section S7. With the chosen orthogonal layout, additional edge intersections are avoided and a compact edge packing that still supports readability was produced.
2.3.7 Final drawing
In this final step, the gMSA graph is drawn. First, every vertex
Figure 6. Two examples of the drawing of a gMSA graph generated with our workflow. The settings used for the graph in (A) are shown in Table 1 and for the graph in (B) in Table 2. The blue vertices represent single alignment blocks and the orange vertices represent merged alignment blocks. The width of a vertex depends on the length of the included alignment block(s) and is between pre-defined limits. The colored edges represent the different sequences, whereby the red edges represent the guide sequence and the other colors the comparative sequences. The directional glyphs (colored arrows) within the vertices indicate the reading direction for the individual sub-sequences aligned. Since there are no aligned sub-sequences for the supergenome, there are also no directional glyphs. The start and end glyphs (triangular flag) illustrate for each contig which vertex is the first one or the last one, respectively.
The vertices are drawn as rectangles. Blue vertices represent single alignment blocks and the orange vertices represent merged alignment blocks. Every vertex has the same height, which is at least the needed vertex height, which was calculated before (Section 2.3.6). The width of a vertex depends on the length of the included alignment block(s) and is between pre-defined limits. The vertices are positioned on a grid, with the block-set position reflecting the vertical position and the layer affiliation reflecting the horizontal position. Therefore, the final vertical position of a vertex depends on the vertical position of its block-set. The vertical free space between two vertically adjacent vertices is always the same. The final horizontal position of a vertex depends on the assigned layer. Since the vertices in a layer can be of different width, the vertices of a layer are vertically aligned with their horizontal midpoint. Consequently, the width of the widest vertex in the graph defines the needed width for every layer. The horizontal free space between two layers, where the edges are located, is always the same and is at least the needed inter layer space, which was calculated before (Section 2.3.6). With this information all vertices can be drawn.
Having fixed the final coordinates of the vertices, the relative coordinates of the edges (Section 2.3.6) can be transformed into final coordinates and drawn. Each edge is drawn individually, including the individual edges in a bundled edge. This makes it easier to follow the route of the individual vertex sequences from
Additionally, the directional glyphs (colored arrows) within the nodes indicate the reading direction for the individual sub-sequences aligned. The color codes the affiliation to the contig, a glyph pointing to the right (reading direction in the graph) means a positive direction of the aligned sub-sequence, and a glyph pointing to the left means a negative direction. For merged vertices, it is possible that a glyph pointing in both directions indicates that there are subsequences in both directions in the contained alignment blocks.
The start and end glyphs (triangular flag) illustrate for each contig which vertex is the first or last, respectively. The color codes the affiliation to the contig, a glyph pointing to the right (reading direction in the graph) means that the sequence begins here, and a glyph pointing to the left indicates the end of a sequence.
The final result is a heuristically generated drawing of the input gMSA graph where all requirements (Section 2.2) are fulfilled. A complexity analysis of this step is discussed in the Supplementary Section S8.
3 Results
In this section we discuss two examples of a gMSA graph layout that were created with our new framework. The examples are sub-graphs of the data set described in the Supplementary Section S9. With the graph layout generated, the similarities and the differences between the guide sequence and the comparative sequences are visually highlighted. Therefore, all requirements of Section 2.2 are fulfilled.
3.1 Example 1
In the first example, four comparative sequences are compared with a guide sequence. The genome order chosen as well as the coloring assigned to the sequences and the selected range are presented in Table 1.
Table 1. Settings used for the graph layout of Figure 6. The nucleotide range 7744–9984 was used for the GS. The order in the table represents the genome order (GS: guide sequence, CS: comparative sequence).
The red edges represent the guide sequence (GS) which traverses the alignment blocks
The green edges represent the first comparative sequence (CS1) having the highest priority for the comparative analysis. All vertices of the GS are traversed. All edges of the CS1 point against the reading direction of the GS and follow the reverse order of the vertices of the GS. In addition, the green directional glyphs (CS1) always point against those of the GS. This may indicate that the reversed strand of the DNA was sequenced and the actual order corresponds to the one of the GS. The difference to the GS are three potential insertions, since the vertices
The purple edges represent the second comparative sequence (CS2). The edges and the directional glyphs of the (purple) are pointing in the same direction as the ones of the CS1; all against the general reading direction, which may indicate the sequencing of the reversed strand of DNA. Unlike the CS1, some vertices (
The pink edges represent the third comparative sequence (CS3). As with the CS1 and the CS2, there are also potential insertions and deletions compared to the GS. In this case the edges all point in the general reading direction and the order of the vertices shared by the GS and the CS3 follows the order of the GS. The directional glyphs always point in the same direction in the shared vertices.
The brown edges represent the fourth comparative sequence (CS4) and show a different pattern than the other comparative sequences. This is the artificial common coordinate system (supergenome) (Gärtner et al., 2018). There are regions where the CS4 and the GS are linear to each other:
3.2 Example 2
In the second example, a longer nucleotide interval of over 106 kb was selected as the GS. The genome order chosen as well as the coloring assigned to the sequences and the selected range are presented in Table 2.
Table 2. Settings used for the graph layout of Figure 6 (B). The nucleotide range 4,515,228–4,621,763 was used for the GS (over 106 kb). The order in the table represents the genome order (GS: guide sequence, CS: comparative sequence).
The comparative sequences CS2, CS3, and CS4 are mostly co-linear to the GS. All three sequences visit the same vertices in the selected area in the same order and have the same directional glyphs. With the exception of four insertions (
However, the green sequence (CS1) is similar to the GS in some parts (it traverses all of its vertices), but it differs from the GS in terms of the order of the vertices. The CS1 starts at vertex
All these findings allow the domain users to confirm, to reject, or to purpose hypotheses regarding the comparative analysis of genomes.
4 Discussion and summary
For layouting gMSA graphs, we presented a complete framework together with the respective algorithms for each of the steps of the framework. Our framework is based on the Sugiyama framework, and we used or adapted existing algorithms, where possible, and developed new algorithms, where needed, within this framework. The layout obtained meets the design requirements derived from the task of visually comparing genome-wide Multiple-Sequence-Alignment (gMSA) graphs. It supports analysts in gaining insights into the closeness and distance of species based on these gMSAs. In addition, artificial common coordinate reference systems, such as the supergenome of Gärtner et al. (Gärtner et al., 2018), can be visualized. As a further possible application, this visualization can be used to visually evaluate the quality of sequence assembling or such artificial reference systems. It could also be tested to visualize pangenomes with this graph layout.
One limiting factor is the runtime of the framework. With large graphs, the time complexities, which are sometimes quadratic or even cubic for the individual steps, can result in a long runtime. Also, with a strong fragmentation of the subsequences of the alignments and strong dissimilarities between the orders, the graph can be very inflated and it can be difficult to follow the orders due to many long edges. However, this is usually due to poor alignment quality. The choice of the order of the comparative sequences has a strong influence on the graph layout. Whether a phylogenetic hierarchy or a graph-based similarity metric leads to better results for such an order has not yet been tested but would be exciting to investigate.
We have implemented a prototype of the described framework. The created graph layout can be used as a starting visualization for an exhaustive analysis of gMSAs. For a more precise analysis, further visualizations such as local MSA visualizations and visualizations of associated annotation data must of course be integrated into connected views. Our next step is therefore to expand the prototype into a visualization system with such functionalities. The final drawing could also be customized according to specific tasks. For example, the aligned sequences could be displayed directly in the vertices or additional information, such as the alignment quality, could be encoded using the vertex color. The layout created can be strongly influenced by changing the drawing parameters. In the Supplementary Figure S7, you can see a clearer encoding of the alignment block length by greatly increasing the maximum vertex width.
Data availability statement
The source code for this project is available in this repository: https://github.com/jeremias-schebera/gMSA-Graph-Browser---Source-Code.git. An executable prototype is available under the following DOI:https://doi.org/10.5281/zenodo.10284921. Docker Compose is required to execute the prototype. Both the source code and the prototype are open source under the MIT license.
Author contributions
JS: Conceptualization, Data curation, Investigation, Methodology, Software, Validation, Visualization, Writing–original draft, Writing–review and editing. DZ: Conceptualization, Methodology, Software, Writing–original draft, Writing–review and editing. DW: Conceptualization, Investigation, Methodology, Project administration, Supervision, Writing–original draft, Writing–review and editing.
Funding
The author(s) declare financial support was received for the research, authorship, and/or publication of this article. The authors acknowledge the financial support by the Federal Ministry of Education and Research of Germany and by Sächsische Staatsministerium für Wissenschaft, Kultur und Tourismus in the programme Center of Excellence for AI-research “Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig”, project identification number: ScaDS.AI.
Acknowledgments
In particular, we would like to thank Yves Annanias for the constructive and helpful discussion. We would also like to thank Peter Stadler for critically reading the manuscript. The authors acknowledge support from the German Research Foundation (DFG) and Universität Leipzig within the program of Open Access Publishing.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fbinf.2024.1358374/full#supplementary-material
References
Albers, D., Dewey, C., and Gleicher, M. (2011). Sequence Surveyor: Leveraging overview for Scalable Genomic Alignment Visualization. IEEE Trans. Vis. Comput. Graph. 17, 2392–2401. doi:10.1109/TVCG.2011.232
Bachmaier, C., Brandenburg, F. J., Brunner, W., and Hübner, F. (2010). “A Global k-Level Crossing Reduction Algorithm,” in WALCOM: Algorithms and Computation. Editors M. S. Rahman, and S. Fujita (Berlin, Heidelberg: Springer Berlin Heidelberg), 70–81.
Berger, B., and Shor, P. W. (1990). “Approximation alogorithms for the maximum acyclic subgraph problem,” in Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms (USA: Society for Industrial and Applied Mathematics), 236–243. SODA ’90.
Beyer, W., Novak, A. M., Hickey, G., Chan, J., Tan, V., Paten, B., et al. (2019). Sequence tube maps: making graph genomes intuitive to commuters. Bioinformatics 35, 5318–5320. doi:10.1093/bioinformatics/btz597
Blom, J., Kreis, J., Spänig, S., Juhre, T., Bertelli, C., Ernst, C., et al. (2016). EDGAR 2.0: an enhanced software platform for comparative gene content analyses. Nucleic Acids Res. 44, W22–W28. doi:10.1093/nar/gkw255
Brandes, U., and Köpf, B. (2002). “Fast and simple horizontal coordinate assignment,” in Graph Drawing. Editors P. Mutzel, M. Jünger, and S. Leipert (Berlin, Heidelberg: Springer Berlin Heidelberg), 31–44.
Bryan, C., Guterman, G., Ma, K., Lewin, H., Larkin, D., Kim, J., et al. (2017). Synteny explorer: An interactive visualization application for teaching genome evolution. IEEE Trans. Vis. Comput. Graph. 23, 711–720. doi:10.1109/tvcg.2016.2598789
Carver, T., Harris, S. R., Otto, T. D., Berriman, M., Parkhill, J., and McQuillan, J. A. (2012). BamView: visualizing and interpretation of next-generation sequencing read alignments. Briefings Bioinforma. 14, 203–212. doi:10.1093/bib/bbr073
Consortium, TCPG (2016). Computational pan-genomics: status, promises and challenges. Briefings Bioinforma. 19, 118–135. doi:10.1093/bib/bbw089
Demetrescu, C., and Finocchi, I. (2003). Combinatorial algorithms for feedback problems in directed graphs. Inf. Process. Lett. 86, 129–136. doi:10.1016/S0020-0190(02)00491-X
Di Battista, G., Eades, P., Tamassia, R., and Tollis, I. G. (1999). Graph Drawing: Algorithms for the Visualization of Graphs. 1st edn. Upper Saddle River, NJ, USA: Prentice Hall PTR.
Duncan, C. A., and Goodrich, M. T. (2013). “Planar Orthogonal and Polyline Drawing Algorithms,” in Handbook on Graph Drawing and Visualization. Editor R. Tamassia (Chapman and Hall/CRC), 223–245. chap. 7.
Eades, P., Lin, X., and Smyth, W. (1993). A fast and effective heuristic for the feedback arc set problem. Inf. Process. Lett. 47, 319–323. doi:10.1016/0020-0190(93)90079-o
Gansner, E. R., Koutsofios, E., North, S. C., and Vo, K. (1993). A technique for drawing directed graphs. IEEE Trans. Softw. Eng. 19, 214–230. doi:10.1109/32.221135
Gärtner, F., Höner, C., Müller, L., and Stadler, P. F. (2018). Coordinate systems for supergenomes. Algorithms Mol. Biol. 13, 15. doi:10.1186/s13015-018-0133-4
Genomes Project Consortium Auton, A., Brooks, L. D., Durbin, R. M., Garrison, E. P., Kang, H. M., et al. (2015). A global reference for human genetic variation. Nature 526, 68–74. doi:10.1038/nature15393nature15393
Gerighausen, D., Hausdorf, A., Zänker, S., and Zeckzer, D. (2017). “idotter - an interactive dot plot viewer,” in 25th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision 2017.
Goodwin, S., McPherson, J. D., and McCombie, W. R. (2016). Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351. doi:10.1038/nrg.2016.49
Haug-Baltzell, A., Stephens, S. A., Davey, S., Scheidegger, C. E., and Lyons, E. (2017). SynMap2 and SynMap3D: web-based whole-genome synteny browsers. Bioinformatics 33, 2197–2198. doi:10.1093/bioinformatics/btx144
Healy, P., and Nikolov, S. N. (2013). “Hierarchical Drawing Algorithms,” in Handbook on Graph Drawing and Visualization. Editor R. Tamassia (Chapman and Hall/CRC), 409–453. chap. 13.
Hickey, G., Monlong, J., Ebler, J., Novak, A. M., Eizenga, J. M., Gao, Y., et al. (2023). Pangenome graph construction from genome alignments with minigraph-cactus. Nat. Biotechnol. 42, 663–673. doi:10.1038/s41587-023-01793-w
Jänicke, S., Büchler, M., and Scheuermann, G. (2014a). “Improving the layout for text variant graphs,” in Workshop VisLR: Visualization as Added Value in the Development, Use and Evaluation of Language Resources.
Jänicke, S., Geaaner, A., Büchler, M., and Scheuermann, G. (2014b). “Visualizations for text re-use,” in 2014 International Conference on Information Visualization Theory and Applications (IVAPP), 59–70.
Krumsiek, J., Arnold, R., and Rattei, T. (2007). Gepard: a rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics 23, 1026–1028. doi:10.1093/bioinformatics/btm039
Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., et al. (2009). Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645. doi:10.1101/gr.092759.109
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079. doi:10.1093/bioinformatics/btp352
Liao, W. W., Asri, M., Ebler, J., Doerr, D., Haukness, M., Hickey, G., et al. (2023). A draft human pangenome reference. Nature 617, 312–324. doi:10.1038/s41586-023-05896-x
Meyer, M., Munzner, T., and Pfister, H. (2009). Mizbee: A multiscale synteny browser. IEEE Trans. Vis. Comput. Graph. 15, 897–904. doi:10.1109/tvcg.2009.167
Nusrat, S., Harbig, T., and Gehlenborg, N. (2019). Tasks, techniques, and tools for genomic data visualization. Comput. Graph. Forum 38, 781–805. doi:10.1111/cgf.13727
Schmidt, D., and Colomb, R. (2009). A data structure for representing multi-version texts online. Int. J. Human-Computer Stud. 67, 497–514. doi:10.1016/j.ijhcs.2009.02.001
Sinha, A., and Meller, J. (2007). Cinteny: Flexible analysis and visualization of synteny and genome rearrangements in multiple organisms. BMC Bioinforma. 8, 82. doi:10.1186/1471-2105-8-82
Sugiyama, K., Tagawa, S., and Toda, M. (1981). Methods for Visual Understanding of Hierarchical System Structures. IEEE Transaction Syst. Man, Cybern. 11, 109–125. doi:10.1109/tsmc.1981.4308636
Wang, L., and Jiang, T. (1994). On the Complexity of Multiple Sequence Alignment. J. Comput. Biol. 1, 337–348. doi:10.1089/cmb.1994.1.337
Keywords: genome analysis, multiple sequence alignment, graph drawing, visualization, genome comparison
Citation: Schebera J, Zeckzer D and Wiegreffe D (2024) A layout framework for genome-wide multiple sequence alignment graphs. Front. Bioinform. 4:1358374. doi: 10.3389/fbinf.2024.1358374
Received: 19 December 2023; Accepted: 08 July 2024;
Published: 16 August 2024.
Edited by:
William C. Ray, Nationwide Children’s Hospital, United StatesReviewed by:
Gianluca Della Vedova, University of Milano-Bicocca, ItalySidharth Mohan, Janssen Pharmaceuticals, Inc., United States
Copyright © 2024 Schebera, Zeckzer and Wiegreffe. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Jeremias Schebera, schebera@informatik.uni-leipzig.de