A Stochastic Phylogenetic Algorithm for Mitochondrial DNA Analysis

This paper presents an exploratory analysis of the mitochondrial DNA (mtDNA) of 32 species in the subphylum Vertebrata, divided in 7 taxonomic classes. Multiple stochastic parameters, such as the Hurst and detrended fluctuation analysis (DFA) exponents, Shannon entropy, and Chargaff ratio are computed for each DNA sequence. The biological interpretation of these parameters leads to defining a triplet of novel indices. These new functions incorporate the long-range correlations, the probability of occurrence of nucleic bases, and the ratio of pyrimidines-to-purines. Results suggest that relevant regions in mtDNA can be located using the proposed indices. Furthermore, early results from clustering algorithms indicate that the indices introduced might be useful in phylogenetic studies.


INTRODUCTION
Previous mathematical studies on DNA sequences have seen a variety of approaches and frequently involve a numerical representation of the nucleotide chains. For instance, distance matrices have been constructed using different metrics (Randi et al., 2003;Liao and Wang, 2004;Zhang and Tan, 2007;Kandiah and Shepelyansky, 2013). These matrices, in combination with clustering methods, are used to evaluate phylogenetic relationships among species (Yu and Huang, 2013).
Other studies involve the representation of DNA sequences as random-walks, known as DNAwalks (Peng et al., 1994). The main objectives of these studies focus on the long-range correlations among nucleotides; i.e., "how the frequency of each nucleotide of a pairing nucleotide couple changes locally" (Namazi and Kiminezhadmalaie, 2015). These DNA-walk studies find differences in the long-range correlation between coding and non-coding DNA sequences (Peng et al., 1994).
Recently, DNA-walk analysis has been used in combination with the fractal dimension and Hurst exponent to identify mosaic structures in DNA that allow distinguishing between healthy and cancerous cells (Namazi and Kiminezhadmalaie, 2015).
Additionally, alternative statistical tools frequently used in DNA sequence analysis include Shannon entropy, which is a measure of the amount of "information" stored within a system (López-Ruiz et al., 1995). In a biological sense, Shannon entropy evaluates the probability of independent occurrences of each nucleic base in a DNA sequence. In recent studies, fluctuations in local Shannon entropy in DNA sequences have been analyzed to identify regions of repeating patterns of one or more nucleotides, known as tandem repeats (Thanos et al., 2018). The capability of Shannon entropy to highlight important segments in DNA sequences has led to the supported notion that entropy studies might be used for biological classifications of species (Melnik and Usatenko, 2014).
Similarly, the concept of complexity has played a central role in various DNA sequence analyses. For instance, López-Mancini-Calbet (LMC) complexity, employed in this paper, has led to the development of an effective gene-predicting technique (López-Ruiz et al., 1995;Monge and Crespo, 2015). In a recent study, the symbolic complexity of DNA sequences is used to identify segments resulting from random duplication, as well as changes in the speed of accumulation of point mutations (Salgado-Garcia and Ugalde, 2016).
Our objective is to examine the parameters previously mentioned to determine a small number of coefficients with biological relevance that may be used to determine rates of change in nucleotide bases, establish comparisons between regions, and better understand the relation among species in a phylogenetic sense.
This paper is structured as follows: section 2 introduces the concepts and methodology; section 3 presents the results obtained and the variables introduced; and section 4 is devoted to a discussion of the results, comments on the methodology in general, and final remarks. Tables and figures are incorporated in sections 2 and 3, respectively. The Supplementary Material includes a table with the identification codes for the data.

METHODOLOGY
GenBank R is the National Institutes of Health's genetic sequence database made possible by the collaboration of several organizations. All datasets used within this work were obtained through GenBank because of its availability of access, encouragement of use, and the advantage that the information stays up-to-date.
A total of 32 complete mtDNA sequences of different species in the subphylum Vertebrata were selected. The lengths vary from 16, 207 to 18, 254 base pairs (bp). The choice of this type of DNA presents multiple advantages: it is relatively small in size (in contrast, human chromosomal DNA contains hundreds of millions bp); the sequences contain conserved regions, can be compared in blocks among different species, and contain a small percentage of non-coding regions; and the interpretation of the mutations in mtDNA as an estimator of evolutionary change (Barton and Jones, 1983). For these reasons, the exploratory nature of this study does not require additional information on the species themselves. Thus, the selection criteria focused on 32 different members from 7 groups intuitively related in taxonomic classes. The 32 NCBI codes from the data files have been attached in the Table S1.
A pre-processing of the data files consists of a realignment of the sequences to set the control region of the heavy chain (Hchain) in the direction of transcription as the new ending point. This realignment is done once. The displacement loop, or Dloop, is within the control region and the most varying region in mtDNA, with substantial differences observed even among individuals of the same species (Yamamoto, 2001). See Figure S1 (Supplementary Material). Additionally, the header information was removed, which contains the identification key and the name of the organism. The downloaded files (in .fasta format) were processed using the programming language R version 3.4.4 (2018-03-15). The packages used are stringr and fractal.

DNA-Walk
DNA consists of sequences of nitrogenous bases: adenine (A), guanine (G), thymine (T), and cytosine (C). The length and distribution of the bases fluctuate from species to species. Several mappings have been introduced based on properties intrinsic to DNA. Moreover, adenine and guanine have a two-ring structure and belong to the purine group, while cytosine and thymine have a one-ring structure and belong to the pyrimidine group. Furthermore, adenine bonds with thymine through a double hydrogen bond, which is called a weak bond, while guanine and cytosine bond through a triple hydrogen bond, which is called a strong bond. Figure 1 illustrates these descriptions. In summary, we have: Considering the properties described previously, it is possible to read a DNA sequence and assign either a +1 or −1 depending on whether the respective nucleotide is a purine or pyrimidine (RY rule). This can be interpreted as random steps x i of a onedimensional walk. Then, the final position after n steps is given by where x 0 = 0 by definition. Let S = {s 1 s 2 . . . s M } be a nucleotide sequence of length M, where s k ∈ {A, C, G, T} for k ∈ {1, 2, . . . , M}. Hence, a one-dimensional DNA-walk can be defined through the following rules: • RY rule: • SW rule: • KM rule: where s k is the k−th nucleotide and x k is the value of the k−th assigned step in a DNA sequence. The path of the DNA-walk after n steps is then defined as the partial sums X n = x 0 + n k=1 x k , where n ∈ {1, 2, . . . , M} and x 0 = 0.
In the context of DNA-walks, Equation (2) evaluates the tendency of changes between purines and pyrimidines. Transversions (substitutions of purines for pyrimidines, or vice versa) are less likely to happen and have been used to evaluate FIGURE 1 | Chemical structure of DNA. Adenine, guanine, cytosine, and thymine are shown in colors green, blue, red, and purple, respectively. Notice the double-ring structure of the purines (A, G) and the single-ring structure of the pyrimidines (C, T). Similarly, the type of bond is readily observable: doubleand triple-Hydrogen bonds for A, T and G, C, respectively. This illustration, by Madeleine Price Ball, has a Creative Commons Zero (CC0, i.e., "No Rights Reserved") license and has been published in previous articles (Wikimedia Commons Contributors, 2018). molecular evolution (Stoltzfus and Norris, 2016). Thus, using this rule within corresponding blocks of nucleotides in different species, it is possible to observe changes in the DNA-walk that could be interpreted as an evolutionary variation. Similarly, Equation (4) is associated with the rate of recombination between transversions and transitions (purine-purine or pyrimidinepyrimidine substitutions).
Moreover, Equation (3) refers to the difference in abundance of the GC bond with respect to the AT bond. A higher GC content suggests a significantly higher temperature for DNA denaturing (melting temperature T m ). Previous studies have shown that GC content is associated to an age-related natural selection and environmental factors (Min and Hickey, 2008). Finally, it is assumed that each DNA-walk is an ergodic stochastic process. Specifically, the conceived notion adopted is that each DNA sequence may be used to represent the ensemble of DNA sequences of individuals within the same species.
In summary, the three assignment rules provide insight into the evolutionary aspects of the organisms considered.

Hurst Exponent and DFA Exponent
Additional information of the long-range correlations of DNA-walks can be obtained via stochastic methods such as rescaled-range analysis and detrended fluctuation analysis. With these methods, it is possible to obtain the Hurst exponent, which represents a quantitative measure of the fractal nature of DNA sequences.
The Hurst exponent, here denoted by α, satisfies 0 < α < 1. In comparisons of mtDNA sequences, each Hurst exponent can be interpreted as a measure of the tendency of changes between nucleotides according to the rules mentioned in the previous section. The calculations used to obtain the Hurst exponent have been reported in previous studies (Peng et al., 1994;Buldyrev et al., 1995).
The Hurst exponent is directly related to the fractal dimension α ′ by the relation: The fractal dimension evaluates changes in detail of the pattern of a DNA-walk with respect to the scale used for measurement. An alternative method to calculate the Hurst exponent of a DNA-walk is DFA. In contrast to the rescaled-range analysis, DFA analyzes the random fluctuations of the DNA-walk without trend in the data (Peng et al., 1994;Buldyrev et al., 1995). The DFA exponent is computed using the following algorithm: • Given a numerical sequence X = {X 1 , X 2 , . . . , X M }, calculate the cumulative sum where k = 1, 2, . . . , M and X is the mean value of X. • Divide y k into M/L subintervals of length L. For each window, calculate the polynomial linear fit (the local trend) y k,L via least-squares minimization. • Calculate the fluctuation, which is an average of the squares of the detrended sequence given by • The slope β of the linear regression analysis in the scale log F(L)/ log L is an estimator of the Hurst exponent.
This method tests for self-similarity at different window sizes L. No correlation (or short-range correlations) gives stochastic properties such as those of a random-walk, so β = 0.5; in contrast, long-range correlations give a value of β = 0.5. Specifically, correlation yields β > 0.5, while anti-correlation gives β < 0.5. This paper adopts a minimum block size of 4 nucleotides, while the maximum is B = M 2 , corresponding to half the length of the sequence in question. Should M be odd, B is rounded down.

Chargaff Ratio
In a remarkable discovery, Erwin Chargaff determined that there is a balance held in DNA by the nucleobases (Chargaff, 1950), known as Chargaff 's Rule. These state: (1) that globally (i.e., considering both strands of DNA) adenine is equal to thymine in quantity, and (2) that guanine is equal to cytosine in quantity. This result was the basis for the Watson-Crick model, which determined that adenine binds with thymine and that guanine binds with cytosine (Watson and Crick, 1953).
On this basis, and in the context of this work, the Chargaff ratio is defined as the ratio of pyrimidines to purines: where N C , N T , N A , N G represent the amount of cytosine, thymine, adenine, and guanine, respectively, within one strand of DNA. Note that this value is always positive. If 0 ≤ ξ < 1, there are more purines than pyrimidines (i.e., N C + N T < N A + N G ); similarly, ξ > 1 reflects an excess of pyrimidines over purines. A Chargaff ratio with value 1 results from an equal number of either type of nucleotide bases.

Shannon Entropy
In his seminal paper, Claude Shannon introduced the concept of information entropy. It measures the "amount" of information or uncertainty of a system (Shannon and Weaver, 1998). Let = {ω 1 , ω 2 , . . . , ω N } be a set of events where each ω i has probability of occurrence p i ∈ [0, 1], for i = 1, 2, . . . , N. Thus, the Shannon entropy of the system is defined as where K is a positive constant chosen appropriately according to the units desired for measurement (thus, for this work, K = 1). For the case when p i = 0, p i log 2 (p i ) = 0 in the limit definition. Also, note that the logarithm is in base 2; this is because information in a computer is encoded in binary digits, or bits, which are the basic units of measurement of information.
For N = 2, events ω 1 and ω 2 have probability p and 1 − p, respectively, see Figure S2 (Supplementary Material). Thus, it can be seen that a maximum is attained at p = 1 − p = 1 2 . This result can be extended to the general case with N events. The proof requires Jensen's inequality for a concave function (in this case, the logarithmic function), and is given below. Using some algebra to rewrite Equation (9) with K = 1 yields By the weighted arithmetic-mean and geometric-mean inequality, this implies that where equality (the maximum) is satisfied when p 1 = p 2 = · · · = p N . That is, when To evaluate Shannon entropy in the context of DNA sequence analysis, it seems rather reasonable to define the set of possible events as = {A, G, C, T}. However, it is expected that the probability of occurrence of each nucleotide in a DNA sequence will likely be different for different species; thus, these associated probabilities will be calculated empirically for each DNA sequence in a straightforward fashion. That is, by counting the amount of each nucleotide within the sequence and taking the corresponding proportion by dividing by the total amount of nucleotides M. Thus, the probabilities will be given by where N A , N C , N G , N T are the amount of adenine, cytosine, guanine, and thymine, respectively.
In the context of DNA sequence analysis, maximum entropy is attained whenever the nucleic bases within a DNA sequence are found with equiprobability. It may thus be interpreted that such a sequence is the result of a random combination of these events. Any departure from the maximum value of the Shannon entropy due to an underlying structure might contribute to determining any tendencies present in a sequence, see Figure S3 (Supplementary Material).
In a more general sense, the entropy fluctuations could be analyzed by means of the Local Shannon entropy. By studying the local fluctuations of entropy at a given scale, and across scales, an "entropic microscope" could highlight areas with a high degree of variation or, equally interesting, low degree of variation, as seen in previous studies (Melnik and Usatenko, 2014;Thanos et al., 2018).

Coefficient of Disequilibrium
Additional information of DNA sequences can be derived from the deviations from equiprobability of occurrence of each FIGURE 2 | DNA-walk illustration for various species using the purine-pyrimidine rule. Observe the vicinity of nucleotide 2, 700 and the change in tendency from a purine-rich region (positive slope) to a predominance of pyrimidines for the remaining DNA-walk (negative slope).
Frontiers in Genetics | www.frontiersin.org nucleotide. This measure is known as disequilibrium (López-Ruiz et al., 1995). The events in the set have probability p i for i = 1, 2, 3, 4. The coefficient of disequilibrium, D, is defined as: FIGURE 3 | DNA-walk illustration for various species using the strong-and weak-bond rule. Observe the immediate (and consistent) tendency. This indicates that mtDNA is rich in adenine and thymine, whose type of bond is weaker than that of cytosine and guanine. This sum of squared distances can be seen as a type of variance. Note that D = 0 in the case of equilibrium. Any deviation from this would result in D > 0. The maximum disequilibrium value, D max = 3 4 can be obtained using multivariate calculus.
The coefficient of disequilibrium may represent a measure of relatedness between a DNA sequence and one resulting from a random process if each (independent) event has a probability p i of occurrence. That is, larger deviations from an equiprobable space yield higher coefficients of disequilibrium. It can be observed that this behavior counters that of the Shannon entropy in an intuitive manner.

Coefficient of Complexity
The coefficient of complexity C is then given by the product of the Shannon entropy (9) and the coefficient of disequilibrium (12), as in (13). It can be seen from (12) that D resembles the definition of The coefficient of complexity may thus be regarded as the Shannon entropy weighted by the coefficient of disequilibrium, which can be interpreted as the tendency of a random sequence.

RESULTS
The three DNA-walks for the 7 groups are depicted in Figures 2-4. Results for the Chargaff ratio ξ and Shannon entropy H are shown in Table 1, while Tables 2, 3 contain the Hurst and DFA exponents for each type of random-walk and for each sequence.
In Figure 2, there is an initial upward trend that is present irrespective of the species. The RY rule (Equation 2) implies that a (local) inclination toward the positive direction of the vertical axis corresponds to a (local) majority of purines (adenine or guanine). Similarly, the downward trend in Figure 3 reflects a consistent predominance of the weakly-pairing bases, adenine or thymine (considering rule SW). Thus, adenine dominates within the range 0− ∼ 3, 000 bp.
The Hurst exponents for the rules RY, SW, and KM (Equations 2-4, respectively) fall in the range of 0.900 − 0.912 and imply a long-term positive autocorrelation. To put it into perspective, a Hurst exponent value of 0.9 indicates that, on average, the tendency of changes between nucleotides varies slightly as the sub-sequence size is changed. Moreover, the proximity of the Hurst exponent toward unity suggests that either purines or pyrimidines are predominant; it cannot distinguish, however, which one prevails. Similarly, the DFA exponents fall within 0.64 − 0.91 which implies the existence of strong long-range correlations in the sequences even after detrending. Interestingly, neither the Hurst nor DFA exponent values are near zero in any of the species considered. A possible explanation is that the tendency of changes between nucleotides does not vary randomly; i.e., mtDNA has an informational structure.
For all the DNA sequences, the Chargaff ratio is positive with ξ > 1, implying a larger amount of pyrimidines than purines. This implication is visually reflected in the overall downward tendency of the curves in Figure 2.
The disequilibrium coefficient takes values D ∈ (0.01 − 0.03). From Equation (12), values near 0 imply that the probabilities p i ≈ 1 4 for any of the four nucleic bases. In other words, the disequilibrium values obtained suggest that the four nucleotide bases appear with almost the same proportion within each of the 32 mtDNA sequences. This is further supported by the Shannon entropy values. In this case, Equation (10) and N = 4 yield a (theoretical) maximum entropy value H = log 2 (4) = 2. Hence, the empirical entropy values H ∈ (1.89 − 1.97) suggest near-equiprobability among the nucleic bases.
A graph of D vs. the Shannon entropy H suggests a linear relation. On this account, the disequilibrium coefficient is omitted for the remainder of the study. In addition, the complexity coefficient is omitted due to its direct proportionality to D. See Figure S4 (Supplementary Material).
This work proposes three new evolutionary indices as functions of Shannon entropy, the Chargaff ratio, and the fractal dimensions derived from the Hurst and DFA exponents: These indices reflect the long-range correlations found in DNAwalks and the information given by Shannon entropy and the Chargaff ratio. The fractal dimensions α ′ and β ′ are derived from the Hurst and DFA exponents, respectively, using Equation (5). The natural logarithm can be seen as a transformation that maximizes the differences between the coefficients. Equations (14), (15), and (16) are defined from an evolutionary perspective, while Equation (16) provides information on the energy content of sequences.
In Equation (14), the logarithm of the fractal dimension derived from the Hurst exponent using the KM rule provides information regarding the transversions and transitions of the entire DNA sequence. On the other hand, the Chargaff ratio is used as a weighting factor for the fractal dimension derived using the RY rule. The logarithm of the product of these quantities provides an evolutionary measure related to the long-range correlations. The last term in the equation (the Shannon entropy) evaluates the probability of independent nucleotide changes for a given DNA sequence.
Equation (15) uses the fractal dimensions of the DFA exponents, which are computed using the detrended DNAwalks. Therefore, it is not accurate to include the Chargaff ratio or Shannon entropy as normalization parameters. Finally, Equation (16) represents a measure of the natural selection factors in relation to the environment. Results for v 1 , v 2 , v 3 are shown in Table 4.
Clustering algorithms may benefit from the proposal. Preliminary results, shown in Figure 5, suggest a possible application in studies centering on the evolutionary relations among species. The proposed indices are used in the groupaverage agglomerative clustering algorithm with Euclidean metric and the sum of distances as the clustroid. Furthermore, an additional grouping was constructed using a traditional program, ClustalW, which is frequently applied to the study of phylogenetic trees, as seen in Figure 6.
The implementation of the algorithm using the R programming language is not computationally demanding, with running times of about 15-20 min. In comparison, ClustalW requires about 2 and a half hours for the construction of the phylogenetic tree of 32 mtDNA sequences.
The comparative analysis between the two methods shows consistency among the group of primates and other mammals sharing a common ancestry of similar lineage to the lemur. On the other hand, the marsupials and rodents (including the common rabbit) are more closely grouped with the stochastic algorithm and present a common ancestor, just as calculated by the traditional method. Other groups that share proximity with both methods are the reptiles and the birds, as well as the fish group and some amphibians.
The most pronounced differences are found in certain taxa. The proposed method relates the rabbit more closely to rodents, with characteristics similar to marsupials. Meanwhile, the traditional method positions the rabbit closer to primates. Another interesting point is that the proposed stochastic method shows that small reptiles and birds are more closely related, while the traditional method relates the birds closer to large reptiles.

CONCLUSIONS
As has been suggested by other studies, Shannon entropy and Hurst and DFA exponents provide insight into the properties of DNA sequences (Peng et al., 1994;Oiwa and Glazier, 2004;Melnik and Usatenko, 2014;Monge and Crespo, 2015;Namazi and Kiminezhadmalaie, 2015;Salgado-Garcia and Ugalde, 2016;Thanos et al., 2018). This exploratory analysis combines various measures utilized in the literature to establish a biologically meaningful measure of distinction among species.
Our proposal defines new indices as functions of Shannon entropy, the Chargaff ratio, and fractal dimensions using rescaled-range analysis and DFA. These indices can be employed to construct phylogenetic trees using clustering algorithms.
Long-range correlations attributed to DNA-walks can be identified during our study. These can represent data with persistence in its evolutionary memory; i.e., that mtDNA sequences contain highly conserved regions among similar species.
The comparison between the traditional and the proposed clustering method shows clear agreements; however, there are differences that must be analyzed under an evolutionary perspective. For example, we notice that the mtDNA sequences of the common rabbit and the common snapping turtle show different properties in both methods. According to the established phylogeny, the placement of the rabbit is closer to the rodents. Interestingly, results of the stochastic hierarchical clustering suggest a potential application for phylogenetic studies.
Evolutionary processes are associated to an adaptive selection of the species throughout millions of years. However, the fluctuations of the changes in nucleotide bases could be random in order to find new sequence combinations. The proposed method attempts to measure the stochastic fluctuations to yield indices that allow the observation of tendencies and correlations in the mutations that produce new species throughout evolutionary history.

AUTHOR CONTRIBUTIONS
MC-R provided data collection of the mtDNA sequences from the GenBank R , worked on numerical and graphical results, and drafted the article. FJ provided numerical analysis, methodology, and mathematical insight. FH-C rendered numerical analysis, as well as mathematical and biological interpretations. JC-G contributed with numerical analysis, revision, critical revision for important intellectual content, and co-final approval of the version to be published. OG-A provided textual and structural revision of the co-final version of this work.

ACKNOWLEDGMENTS
Thanks are due to the Consejo Nacional de Ciencia y Tecnología (Conacyt) for providing a scholarship for one of the authors. Special recognition is given to the Universidad Autónoma de Nuevo León, the Facultad de Ciencias Físico-Matemáticas, and the Centro de Investigación en Ciencias Físico-Matemáticas for logistical support given during our research endeavors. Thanks are due to Programa para el Desarrollo Profesional Docente, para el Tipo Superior (PRODEP) for the support for the publication of the article.