A New Algorithm for Identifying Genome Rearrangements in the Mammalian Evolution

Genome rearrangements are the evolutionary events on level of genomes. It is a global view on evolution research of species to analyze the genome rearrangements. We introduce a new method called RGRPT (recovering the genome rearrangements based on phylogenetic tree) used to identify the genome rearrangements. We test the RGRPT using simulated data. The results of experiments show that RGRPT have high sensitivity and specificity compared with other tools when to predict rearrangement events. We use RGRPT to predict the rearrangement events of six mammalian genomes (human, chimpanzee, rhesus macaque, mouse, rat, and dog). RGRPT has recognized a total of 1,157 rearrangement events for them at 10 kb resolution, including 858 reversals, 16 translocations, 249 transpositions, and 34 fusions/fissions. And RGRPT has recognized 475 rearrangement events for them at 50 kb resolution, including 332 reversals, 13 translocations, 94 transpositions, and 36 fusions/fissions. The code source of RGRPT is available from https://github.com/wangjuanimu/data-of-genome-rearrangement.


INTRODUCTION
The rapid development of sequencing technologies makes the phylogenetic analysis from the level of whole genome possible. A studied genome is represented as a line of conserved segments (called syntenic blocks). The genome rearrangements of species are changes of syntenic block orderings and losing of sequence blocks. These events include reversal, translocation, transposition, fusion, fission, and so on (Xu et al., 2017;Cheng et al., 2019;Dong et al., 2018). The research on genome rearrangements is mainly three aspects.
One is the computation of evolutionary distance between two species by considering genome rearrangements. Researchers have proposed a lot of metric for measuring the dissimilarity of evolution between species and a large amount of algorithms for computing the metrics. The breakpoint distance is the minimum rearrangement operations transforming one genome to the other genome, which is computed by means of breakpoint graph (Blanchette et al., 1997;Sankoff and Blanchette, 1998). There are lots of algorithms for computing breakpoint distance. In 1995, Hannenhalli and Pevzner put forward an algorithm with O(n 5 ) time complexity to compute the breakpoint distance just considering reversal events (Hannenhalli and Pevzner, 1999). Later, Kaplan improved the algorithm to time complexity O(n 5 ) (Kaplan et al., 2000). In 1996, Hannenhalli designed an algorithm with O(n 3 ) time complexity to compute it by considering translocation events (Hannenhalli, 1995). In 2001, Zhu et al. improved the algorithm to time complexity O(n 2 logn) (Zhu and Ma, 2002). And then Zhu et al. devised an algorithm with O(n 2 ) time complexity (Liu et al., 2004). The DCJ distance is introduced by Yancopoulos et al. (Sophia et al., 2005), which uses the double cut and join (DCJ for short) operation to model rearrangement events, such as reversal, translocation, transposition, fusion, and fission in an unified way. Yancopoulos et al. first propose a method to compute the DCJ distance by considering only translocations and reversals on linear chromosomes (Sophia et al., 2005). Paper (Lu et al., 2006) has proposed an O(n 2 ) time algorithm to compute the distance by considering the fusions and fissions between circular unsigned chromosomes. Unimog (Hilker et al., 2012) is software for computing DCJ distance which implements lots of algorithms (Erdös et al., 2011;Jakub et al., 2011). SoRT is a tool to compute breakpoint distance and the DCJ distance for linear/circular multi-chromosomal gene orders (Yen-Lin et al., 2010). SCJ distance (Feijão and Meidanis, 2011) is defined using the single cut and join (SCJ for short) operations, which is in analogy to DCJ measure. The distance can be computed by a speedily computable.
Two is the reconstruction of the ancestral gene orders by using the genomes of extant species. Ma et al. (Ma et al., 2006) use maximum parsimony principle to recover reliably ancestral genomes starting from phylogenetic tree and adjacent genes in genome and make the probabilistic reconstruction accuracy analysis for the six mammalian genome (human, mouse, rat, dog, opossum, and chicken) based on the improved Jukes-Cantor model. PMAG utilized the Bayesian theorem in the probabilistic framework to infer ancestral genomes (Yang et al., 2014). Multiple Genome Rearrangements (MGR) recovers the ancestral genome by minimizing the rearrangement distance (Bourque and Pevzner, 2002). Multiple Genome Rearrangements and Ancestors (MGRA) is developed to reconstruct ancestral genomes based on multiple breakpoint graphs and is used to analyze rearrangement evolutionary events of seven mammalian genomes (human, chimpanzee, macaque, mouse, rat, dog, and opossum) (Alekseyev and Pevzner, 2009). Decostar (Duchemin et al., 2017) is a software which reconstructs neighborhood relations of ancestral genes aiming at reconstructing the organization of ancestral genomes.
Three is the recognition of the rearrangement events of existing species. Efficient Method to Recover Ancestral Events (EMRAE) is an algorithm which can recognize rearrangement events in evolution described by phylogenetic tree by means of adjacent genes in genomes (Zhao and Bourque, 2009).

MATERIALS AND METHODS Preliminaries
A genome is composed of several chromosomes, and each chromosome is an ordering of syntenic blocks. For convenience, each syntenic block is recorded by an integer, so a chromosome is represented by a signed permutation X=c 1 c 2 ⋯g n , where c i (1≤i≤n) is an integer representing a syntenic block, its sign is assigned with the orientation that is either positive (recorded by c i ) or negative (recorded by -c i ). The chromosome X=c 1 c 2 ⋯c n is the same as -X = -c n -c n -1 … -c 1 . A reversal r (i, j) (i ≤ j) converts chromosome X=c 1 c 2 ⋯c n into a new chromosome Xʹ=c 1 c 2 ⋯−c j −c j-1 ⋯−c i+1 −c i c j+1 ⋯c n , where the reversal is from c i to c j .
A translocation event breaks two chromosomes into four segments and then reconnects them into two new chromosomes. Given two chromosomes X = X 1 X 2 and Y = Y 1 Y 2 , where X 1 =x 1 x 2 ⋯x i-1 ,X 2 =x i x i+1 ⋯x m ,Y 1 =y 1 y 2 ⋯y j-1 , and Y 2 =y j y j+1 ⋯y n , a translocation is represented by tl(i,j). X 1 and Y 1 are exchanged to form two new chromosomes Xʹ=Y 1 X 2 and Yʹ=X 1 Y 2 , or X 1 and Y 2 are exchanged to form two new chromosomes X" = -Y 2 X 2 and A transposition event is to exchange two adjacent fragments on one chromosome into a new chromosome. A transposition is represented by tp(i, j, k), i.e., the fragment c i ⋯c j of one chromosome inserted into after c k . If c k is on the same chromosome (k > j or k < i), then the transposition tp(i, j, k) is called intra-chromosomal; otherwise, it is inter-chromosomal. Given a chromosome X=c 1 c 2 ⋯c i c i+1 ⋯c j-1 c j ⋯c k ⋯c n and an intra-chromosomal transposition, X is converted into A fusion event is to connect two chromosomes into a new chromosome. The fusion acting on chromosomes X 1 and X 2 is represented by f u(X 1 , X 2 ) and forming a new chromosome X 1 X 2 or X 1 −X 2 . A fission is to split a chromosome into two new chromosomes. A fission acting on the chromosome X = X 1 X 2 is represented by f i(X) and forming two new chromosomes X 1 and X 2 (where X 1 and X 2 are non-empty segments).
An adjacency a(c i ,c i+1 ) of genome X is two adjacent integers in one chromosome of X. a(c i ,c i+1 ) is the same as a(−c i+1 ,−c i ). For example, all adjacencies on chromosome X = 1,234 are a(1, 2), a(2, 3), and a(3, 4). For a set of genomes S, an adjacency a is effective w.r.t. S if it belongs to at least one genome and not all genomes. For example, two uni-chromosomal genomes G 1 and G 2 , the chromosome X = 1,234 of G 1 and the chromosome Y = 1 -3 − 24 of G 2 , then all effective adjacencies w.r.t. G 1 and G 2 are a(1, 2), a(2, 3), a(3, 4), a(1, −3), and a(−2, 4).

EMRAE
Given a phylogenetic tree T describing the evolution of the genomes G, EMRAE first computes all effective adjacencies w.r.t. G. Then, it predicts the rearrangement events for each edge of T by means of inference rules (will be introduced in the following). Figure 1 shows a reversal r(2, 3) during the evolution from A to B, where A and B are two uni-chromosomal genomes, and the chromosomes are X = 1,234 and Y = 1 -3 -24, respectively. The set of genomes will be divided into two subsets recorded by S A and S B after removing the edge e from T. Suppose there is not any rearrangement events inside S A and S B . Then, adjacencies a(1, 2) and a(3, 4) can be found in each genome of S A and not in any one genome of S B ; a(1,−3) and a(−2,4) can be found in each genome of S B and not in any one genome of S A . In turn, we can utilize the four adjacencies a(1, 2), a(3, 4), a(1, −3), and a(−2,4) to identify a reversal r(2, 3) occurring on the edge e. The EMRAE method infers the rearrangement events by means of the similar rules.
Let e = (A, B) be an edge of T, G={G 1 ,G 2 ,⋯,G m }the genomes of leaves, and a 1 ,a 2 ,⋯a i the children of A and b 1 ,b 2 ,⋯b j the children of B. EMRAE first selects a number of adjacencies as candidate adjacencies Ca(e,A) for edge e and node A according the following steps.
1. Find the adjacencies are in each genome of S A and not in any one genome of S B , then put them to Ca(e, A); 2. If A is an internal node, find all edges connected with A except e and record them with e 1 ,e 2 ,⋯,e k . For each e i =(u i ,A)(1≤i≤k), G can be divided into two parts after removing e i , S ui is the part not including A. a. Find the adjacencies that are in one genome of each S ui (1 ≤ i ≤ k) and not in any one genome of S B , then put them to Ca(e,A); b. Compute Ca(e i , u i ) and Ca(e i ,u)(1≤i≤k). For each one Ca(e i , u i ), find the adjacency a 1 from Ca(e i , u i ), such that a 1 is not overlap gene with any one adjacency in Ca(e i , u), a 1 has overlap gene with one adjacency a 2 in each Ca(e j ,u j )(1≤j≠i≤k), and a 2 has overlap gene with at least one adjacency in Ca(e j , u), then put a\s\do5 (1) to Ca(e, u).

EMRAE then infers rearrangement from Ca(e, A) and
Ca(e, B) for edge e = (A, B) with the help of inference rules in the following section. From the definitions of genome rearrangements, we find that each genome rearrangement can change several adjacencies. For example, each reversal r(i, j)(i ≤ j) can change two adjacencies a 1 =a(c i-1 ,c i ) and a 2 =a(c j ,c j+1 ) into b 1 = a(c i -1 , -c j ) and b 2 =a(−c i ,c j+1 ). Based on those facts, we obtain the inference rules introduced in the following section.

Inference Rule
Let e = (A,B) be an edge of the phylogenetic tree T. Given adjacencies a 1 = a (c 1 -1 , c i ), a 2 = a (c j, c j+1 ) in Ca(e,A) and b 1 =a(c i-1 ,− c j ), b 2 =a(−c i ,c j+1 ) in Ca(e,B), EMRAE infers a reversal r(i,j) from A to B if all genomes are uni-chromosomal or a 1 , a 2 are in the same chromosome in S A and b 1 , and b 2 are in the same chromosome in S B . Otherwise, we infer a translocation tl (i, j). Similarly, given adjacencies a 1 =a(c i-1 ,c i ), a 2 =a(c j c j+1 ) in Ca(e,A) and b 1 =a(c i+1 ,c j+1 ), b 2 =a(c j ,c i ) in Ca(e,B), EMRAE infers a translocation tl(i,j), or a reversal for a 1 , a 2 in Ca(e,A) and adjacencies b 1 , b 2 in Ca(e,B).
Assume that there are adjacencies a 1 =a(c i-1 ,c i ), a 2 =a(c j ,c j+1 ), and a 3 =a(c k ,c k+1 ) in Ca(e,A) and b 1 =a(c i-1 ,c j+1 ), b 2 =a(c k ,c i ), and b 3 =a(c j ,c k+1 ) in Ca(e,B). EMRAE can predict a transposition tp(i,j,k) during the evolution from A to B if all genomes are uni-chromosomal. Otherwise, suppose m genomes in S A have a 1 and a 2 , then EMRAE can predict a transposition tp (i,j,k) if there are at least m/2 genomes such that the four integers of a 1 and a 2 on the same chromosome, or there are at least m/2 genomes such that the four integers of a 2 and a 3 on the same chromosome.
Assume that there is a=a(c i ,c j ) in Ca(e,A). EMRAE can predict a fission that splits the adjacency a=a(c i ,c j ) if a is sign-compatible for each genome G k in S B . The fusion from A to B can be seen as a fission from B to A.

Recovering the Genome Rearrangements Based on Phylogenetic Tree
EMRAE can not identify the rearrangement occurring in the frontier of genomes. We take Figure 2, for example, where species A, B, and C are uni-chromosomal genomes A = 1,234, B = −2 -134, and C = 1,234. A reversal r(1,2) has occurred in the evolution from A to B. EMRAE can compute the candidate adjacencies a (−1,3) for Ca(e 1 ,B) and a(2,3) for Ca(e 1 ,A). So, EMRAE can not infer the reversal r(1,2) on the edge e 1 according to the candidate adjacencies.
We improve EMRAE so that the improved method (called RGRPT) is able to infer the rearrangement events occurring in the frontier region. The inference rule of RGRPT is the same as that of EMRAE. The difference between RGRPT and EMRAE is that they have different candidate adjacencies. RGRPT puts 0 to the head and tail for each chromosome, so there will be added a lot of adjacencies for each genome. For example, considering the uni-chromosomal genomes X = 1,234 and Y = −2 −134, the two additional candidate adjacencies a(0,1) and a(0,−2) are added.
RGRPT adds candidate adjacencies in the step b of EMRAE. For each one Ca(e i ,u i ) and an adjacency a 1 from Ca(e i ,u i ), if there is an adjacency a 2 in each Ca(e j ,u j )(1≤j≠i≤k) such that a 1 with a 2 has overlap gene, then put a 1 to Ca(e,u).

RESULTS
All of the experiments were performed on a computer with Intel Vostro 14 2.0 GHz CPU, 4 GB RAM, and 500 GB Hard Disk Drives (HDD). The operating system was Win10 64 bit with Java 1.6 installed. RGRPT was written in Java.
We tested RGRPT with both simulated data and the practical data (i.e., real biological data) introduced by the following section.

Simulated Data
Here, we start with an uni-chromosomal genome as the ancestor, and it evolves along the phylogenetic tree with n taxa whose topology sees the Figure 3.
We generate two simulated data sets in order to test the affectivity of RGRPT. One of them is created from the phylogeny only with reversals events. The other data set is generated from the phylogeny with kinds of events, including reversals, translocation, transposition, fusion, and fission, and the quantity of those events is in a certain ratio. The two data sets can test the ability of methods to recover the simple and the complex evolution histories. First data set is created just using reversal events. Since the reversal on only one gene is rare (Korbel et al., 2007), we set the ratio of reversal on one gene and on more than one gene as 1:3. The number of leaves is from 3 to 10 with step 1. For each number of leaves, the ancestor genome with m gene, where m from 50 to 150 with step 10. Each edge will happen k reverse, where k is random integer number from 3 to 10. So, there are 11 groups data for each leaf number. Sensibility is the percentage of correctly predicted events in all practical events. Specificity is the percentage of correctly predicted events in all predicted events. We compute the sensibility and specificity for RGRPT and EMRAE for each group data. Table 1 shows the average sensitivity and specificity for each leaf number. The second column of the table records the number of all events, and its last row records the average values. Table 1 shows that RGRPT achieves higher sensibility than EMRAE, and RGRPT achieves comparable specificity with EMRAE. Obviously, RGRPT can distinguish more actually occurred events than EMRAE. So, the experimental results show that the RGRPT is more efficient than EMRAE for predicting reversal events.
Second data set is generated by using all events, i.e., reversal, translocation, transposition, fusion, and fission. The reversals are generally more than the other rearrangement events. The fusions and the fissions are very rare; so, we record the number of the two events together. Here, we set the ratio of those events as 10:2:2:0.1. The ancestor genome has 5 chromosomes and each chromosome with 100 genes. The ancestor genome evolves along the topology with four leaves (see Figure 3). Each edge happen k events, where k is random number from 1 to μ and μ is 6, 12, 18, and 24. For each μ, it runs 10 times; so, we can obtain 10 groups data for each μ. Table 2 shows the average of 10 groups data for each μ. This table indicates that the RGRPT is more efficient than EMRAE for predicting all events.

Practical Data
The practical data is from the paper (Zhao and Bourque, 2009). It contains six mammalian genomes, i.e., human, chimpanzee, rhesus monkey, mouse, voles, and dog. The data are created from two different levels of resolution 10 kb and 50 kb. Figure 4 is the tree describing the phylogeny of species. The results are shown in Tables 3 and 4. EM and RG represent EMRAE and RGRPT respectively, and Rev, Tloc, Tran, Fus, and Fis represent reversal,   edge are mostly in all edges either at 10 kb resolution or at 50 kb resolution. The syntenic blocks of genomes at 10 kb resolution are more than the syntenic blocks of genomes at 50 kb resolution. The fact reduces the recognized rearrangement events at 10 kb resolution that are more than the recognized rearrangement events at 50 kb resolution. Experiments show that RGRPT can recover more ancestor events than EMRAE.

DISCUSSION
This paper proposes a new method, RGRPT, to infer ancestor rearrangement events. RGRPT takes a phylogenetic tree describing the evolution of species and the genomes of species as input. Experiments on the simulated data and practical data show that RGRPT is more efficient than EMRAE and can recover more ancestor rearrangement events than EMRAE. RGRPT provides a method for us to research the genome rearrangement of species. We can use RGRPT to recognize the ancestral genome rearrangement for the evolution of other species in future (Tian et al., 2018).

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. These data can be found here: https://github.com/wangjuanimu/ data-of-genome-rearrangement.