Quantitative Analysis of Protein Evolution: The Phylogeny of Osteopontin

The phylogenetic analysis of proteins conventionally relies on the evaluation of amino acid sequences or coding sequences. Individual amino acids have measurable features that allow the translation from strings of letters (amino acids or bases) into strings of numbers (physico-chemical properties). When the letters are converted to measurable properties, such numerical strings can be evaluated quantitatively with various tools of complex systems research. We build on our prior phylogenetic analysis of the cytokine Osteopontin to validate the quantitative approach toward the study of protein evolution. Phylogenetic trees constructed from the number strings differentiate among all sequences. In pairwise comparisons, autocorrelation, average mutual information and box counting dimension yield one number each for the overall relatedness between sequences. We also find that bivariate wavelet analysis distinguishes hypermutable regions from conserved regions of the protein. The investigation of protein evolution via quantitative study of the physico-chemical characteristics pertaining to the amino acid building blocks broadens the spectrum of applicable research tools, accounts for mutation as well as selection, gives assess to multiple vistas depending on the property evaluated, discriminates more accurately among sequences, and renders the analysis more quantitative than utilizing strings of letters as starting points.


INTRODUCTION
The analysis of amino acid sequences or coding sequences for proteins is central to the study of molecular evolution. Among the main tools is the construction of phylogenetic trees, which are assembled based on algorithms that consider the numbers of mismatched amino acids or bases between aligned sequences (Charleston, 2013;He, 2019). In this strategy, multiple comparisons may be characterized by identical numbers of differences and are placed on the same evolutionary level. The resultant tree represents a hypothesis on the developmental paths that have led to the existing diversity.
The rationale for basing evolutionary studies of proteins on their differences in amino acid sequences is rooted in the mechanism of mutation, which alters bases in the coding sequence, consecutively resulting in alterations of individual amino acids. However, selection-the second major driver of evolution-favors certain mutations over others, and this preference is not captured by the evaluation of sequence mismatches. It is feasible to research protein evolution by taking the selection component into account, which may be best reflected in the physicochemical properties of the mutated and selected amino acids. We hypothesized that basing phylogenetic analysis on such quantifiable properties of amino acids is more discriminating than predicating this analysis on base or amino acid sequences. By incorporating a readout for selection, not only mutation, evolutionary distances calculated through this quantitative method are likely more accurate than distance estimates obtained from conventional approaches.
The replacement of the letters, which indicate identities or mismatches of the amino acids, with quantitative measurements of their properties generates strings of numbers that can be studied with various mathematical techniques of complex systems analysis. Therefrom, numerical assessments are obtainable for the pairwise relatedness of these strings of numbers, including autocorrelation, average mutual information, and fractal dimension. More quantitative phylogenetic trees can thus be generated. Bivariate wavelet analysis differentiates hypermutable from conserved regions. Here we apply these techniques to the evaluation of Osteopontin phylogeny.
The cytokine Osteopontin is of keen interest because of its importance for tissue remodeling, cellular immune responses, and calcium homeostasis in milk and urine. In pathophysiology, the biomolecule contributes to the progression of multiple cancers (Weber, 2008). From the evolutionary context, conserved and variable domains can be inferred. It is implied that domains with a high level of conservation among species are reflective of important biological functions being fulfilled by such regions. A comprehensive phylogenetic analysis of 202 Osteopontin protein sequences was recently conducted with conventional methods (Weber, 2018). It followed a much more basic tree construction reported earlier (Sivakumar and Devaraj, 2014). Here, we go beyond those predecessor studies with a quantitative evaluation based on the physico-chemical properties of the Osteopontin amino acid strings.

Source Data
To study Osteopontin, we have used a subset of taxa from Weber (2018), having eliminated the sequences belonging to fish, birds, reptiles and rodents. As we develop new approaches for phylogenetic analysis in the present investigation, we sought to focus the large number of available sequences only on comparisons among higher-developed taxa. The less extensive range of diversity in this domain also allows us to concentrate on more nuanced sequence variations. We constructed consensus sequences for each taxonomic group under study here (primata, perissodactyla, carnivora, chiroptera, artiodactyla a = camelidae/suidae/celaceae, artiodactyla b = cervidae/bovidae, marsupialia, prototheria, xenarthra/afroteria) by applying the most common amino acid to all polymorphic sites, then aligning them in Clustal Omega.
To corroborate the methodology, we performed a separate analysis on the avian Osteopontin sequences. Five major clades of neoaves have been described (Prum et al., 2015). Among 64 avian Osteopontin sequences in NCBI nucleotide, we previously identified four groups. We constructed consensus sequences for each taxonomic group by choosing the most common amino acid for every polymorphic site, aligned the sequences with Clustal Omega, converted them into number strings according to their physico-chemical properties (gaps were replaced with 0 yielding number strings of 332 positions for all groups), and calculated the distances on phylogenetic trees. We also constructed a tree based on the autocorrelation values.

Sequence Conversion
The values for select properties of the amino acids were used to replace the letter codes, so as to characterize volume, hydropathy index, solubility, octanol interface, or pI at 25 • C. Even though the exact numbers pertaining to the properties of the isolated amino acids may shift after their incorporation into a protein, they represent good estimates for quantification of the molecular characteristics. Gaps were replaced by 0, yielding number strings of 356 entries for the Osteopontins from the advanced taxa.

Phylogenetic Tree
Because available programs to generate phylogenetic trees require the input of letter strings, a manual approach had to be applied for this study. We generated phylogenetic trees with a strategy similar to conventional algorithms (Meneely et al., 2017). For each physico-chemical property under study, the number strings pertaining to the Osteopontins were aligned pairwise, such that all pairs were covered. In each position (1 through 356 for advanced species, 1 through 332 for aves, 1 through 230 for VEGF), the absolute of the difference between two sequences was calculated, and the results for all positions were added up to yield the sum-difference. Among all the pairwise comparisons, the smallest sum-difference between two strings of numbers was considered to represent the evolutionary distance between the closest relatives. In the next step, the number strings for the two taxa with the smallest sum-difference were combined by averaging at each position, and the process was repeated until all distances between taxa had been calculated.
The tree was constructed with the branch lengths equaling the calculated distances.

Autocorrelation
Proteins at various stages of evolution sometimes repeat patterns or have other properties, according to which developmentally earlier values display some relation to later values. The autocorrelation statistic measures the degree of that affiliation as it refers to linear dependence. The magnitude of its dimensionless number reflects the extent of similarity. The formula for autocorrelation R m is comprised of terms for autocovariance and variance autocorelation = autocovariance variance Autocorrelation coefficients range from −1 to +1, with +1 indicating perfect synchrony and −1 reflecting exact mirror images. An absence of any correlation yields R m = 0.

Average Mutual Information
The average mutual information, an information theory measure, summarizes a non-linear correlation function that quantifies the amount of information shared between the sequence data from two species. The average mutual information was calculated with the mi.empirical function in the R package entropy, where the Shannon entropy is calculated. The mutual information, MI, is defined as MI = H(X)+H(Y)-H(X,Y), where H(.) represents the marginal or the joint entropy function.

Box Counting Dimension
Contrasting with Euclidean measures, the dimension in terms of fractal geometry is best described as a non-integer. This dimension serves as a quantitative measure for the evaluation of geometric complexity by objects, such as two number strings representing different taxa. A general relationship assumes Here, we estimated the box counting dimension for the plots of pairwise comparisons between taxa after binning into 10 × 10 squares of 2-dimensional graphs (with both axes of identical length). The numbers generated as indicators of dimension are larger than 1 (approximating identity between the taxa) and smaller than 2 (approximating total independence between the sequences). Smaller box counting dimensions are reflective of closer relatedness between the number strings representing two taxa.

Bivariate Wavelet Analysis
The cross-wavelet analysis allows us to conduct pairwise comparison between taxa from the frequency domain. It illustrates the similarity in periodicity between two sequences. We have four images to summarize the analysis results, including the cross-wavelet power plot, the wavelet coherence plot, the average power plot and the phase difference image. While crosswavelet power corresponds to covariance, wavelet coherence is a measure similar to correlation with a value range between 0 and 1. Two sequences are coherent in periodicity if they have a constant relative phase. The cross-wavelet power and coherence plots also contain arrows showing the area of significant joint periods (significance level = 0.05). The direction of the arrows indicates the direction of phase differences with (π/2, π) and (−π, −π/2) angles for out-of-phase and (−π/2, π/2) for inphase. A more explicit global view of the phase difference is shown in the phase difference image. The coherence plot, which is not affected by wide swings in sequences, is more appropriate in showing the areas where two sequences share jointly significant periods. The cross-wavelet average power plot shows the shared periods, the corresponding average power and the corresponding statistical significance.

Physico-Chemical Properties of Amino Acids
There are empirical numbers for all twenty amino acids, which are used in higher species, to characterize their molecular weights, sizes, volumes, various solubility characteristics, and isoelectric points ( Table 1). These numbers represent good estimates for quantification of the relevant properties. For each taxon under investigation, the consensus Osteopontin sequence was generated by choosing the most common amino acid for every polymorphic site, and the sequences were aligned with Clustal Omega (Supplementary Table 1). Gaps were replaced by 0 (yielding number strings of 356 entries for all taxa), and all amino acids were replaced by the number for the physicochemical characteristic of interest (volume, hydropathy index, solubility, octanol interface, and pI at 25 • C) (for solubility as an example, see Supplementary Figure 1).

Phylogenetic Trees
There are several algorithms in use for the construction of conventional phylogenetic trees, yielding similar but distinct results (see examples in Supplementary Figure 2). Applying a more quantitative strategy, for each physico-chemical property under study, we utilized the number strings to calculate relatedness, and assemble numerical trees. The trees based on amino acid volume and hydropathy are very similar to each other (Figures 1A,B). Likewise, the trees based on solubility and octanol interface resemble each other closely (Figures 1C,D). These two groups are reflective of alternative hypotheses regarding the evolution of Osteopontin. Noticeably, the trees generated in this study (Figure 1 and Supplementary Figure 5-described below) are mutually less divergent than the trees generated with the conventional algorithms (Supplementary Figure 2).
We corroborated the approach with the separate subset of avian Osteopontin sequences (Supplementary Figure 6, deliberately placed at the end). Strikingly, the conventional The table shows the amino acid name, code, and one-letter abbreviation, followed by the values for the indicated physico-chemical properties. The top row specifies applicable units.
Frontiers in Genetics | www.frontiersin.org trees displayed stark divergence, whereas the trees generated on the basis of physico-chemical properties or on the basis of their autocorrelation (averaged across all properties) were highly consistent.

Overall Relatedness
Readouts of complexity allow for pairwise comparisons between two strings of numbers. The protein sequences under study here were analyzed according to their autocorrelation, average mutual information, and box counting dimension (an estimate of their fractal dimension). The autocorrelation of two data strings with few mutations is expected to be high (approaching 1.0). Thus, increasing evolutionary distance can be reflected in a reduction of the autocorrelation. We calculated the autocorrelation values, pairwise between taxa, for each physico-chemical parameter under study (Supplementary Table 2A) and averaged all values as estimates for evolutionary relatedness ( Table 2A).
The average mutual information is information shared between the measurements of two strings. For protein sequences, the average mutual information reflects evolutionary relatedness. We calculated the average mutual information by pairwise comparison between all taxa for each of the physico-chemical parameters under study (Supplementary Table 2B). We averaged the calculated numbers across all five properties (Table 2B).
On a graph that plots the number values for Osteopontins by two species against each other, a close relatedness between the two taxa is reflected in a data distribution near the 45 • angle, resulting in a small fractal dimension. More distant relationships are characterized by wider scatter and higher fractal dimensions (practically approximated by the box counting dimension with values between 1 and 2). We calculated the box counting dimensions for the amino acid volumes, isoelectric points and solubilities among all pairs of taxa in this study (Supplementary Table 2C) and utilized their averages as estimates for evolutionary distance (Table 2C).
Once the amino acid sequences of proteins are converted to numerical values that represent their measurable characteristics, those strings become amenable to quantitative evaluation of their similarities. Depending on the focus of the investigation, the properties can be analyzed individually or together. Here, we chose to average autocorrelation, average mutual information, and fractal dimension across all properties evaluated. We interpret the resulting numbers as being reflective of evolutionary distance. Thus, the overall relatedness among protein sequences can be estimated quantitatively by tabulating the values for the above measures (Supplementary Table 3). Of note, the evolutionary distances inferred from autocorrelation, average mutual information and box counting dimension are very similar. The approach is corroborated by the consistent results. It is further validated by the analysis of the subset of avian Osteopontins (Supplementary Figure 6) as well as by the analysis of a different protein, VEGF (Supplementary Figure 7).

Regional Mutability
The values pertaining to the physico-chemical characteristics of individual amino acids can be plotted, with the mean values across taxa and a measure of their scatter, on a graph of property (here the values from Table 1) versus position (here amino acid 1 through 356). Domains with no or small error bars are highly conserved, whereas data clusters with large error bars indicate mutable regions (Supplementary Figure 3). In a more stringent approach, the numerical sequences that represent the amino acid properties in each position within a protein can be viewed as waves (which start at position 1 and extend through the length of the protein), Algorithms are available for the comparisons of such waves. Preserved regions display synchrony between the waves that represent two taxa, while mutated regions are visualized as divergence between the waves. Bivariate wavelet analysis can thus reveal domains of evolutionary stability (preserved domains) versus evolutionary instability (Figure 2 and Supplementary Figure 4). One representation is the cross-wavelet average power plot, which provides a summarized view on the shared periods, the corresponding average power and the associated statistical significance (Figure 2A). Significant joint periods (red dots in Figure 2A) are reflective of low mutability and have higher abundance between species of close evolutionary relatedness.
The phase difference image ( Figure 2B) provides a global view of the synchrony status by the two variables at all amino acid positions with different Fourier periods. The graph displays inphase relationships between the two wavelets as shades of green to yellow, with the x-axis indicating the amino acid position and the y-axis indicating the corresponding Fourier period. The color shades of green to yellow represent regions of high conservation between the two species under study. The graph also shows outof-phase relationships as shades of blue and red, and those are regions in the protein with evolutionary instability. Marsupials and prototheria are quite distant from each other and from the other taxa studied here, which is reflected in a high abundance of blue to red areas in the graph.
Two waves are coherent if they have a constant relative phase. The wavelet coherence plot is a measure similar to correlation. It has a value range between 0 and 1, and it shows statistical significance only in areas where the two series being compared actually share significant joint periods ( Figure 2C). Coherence is high (red color in the graph, enclosure by white contour FIGURE 2 | Continued period for synchrony (0-256). The color range (π/2,π) and (-π, -π/2) covers shades of blue and red to code for out-of-phase and (-π/2, π/2) shades of green and yellow for in-phase. (C) Wavelet coherence plot. The x-axis indicates the amino acid position in the protein (1-356), while the y-axis displays the length of the Fourier period for synchrony. The color legend shows the magnitude of coherence in frequencies between two sequences, with 1 as high coherence (orange) and 0 low coherence (blue). White contour lines depict significance for joint periodicity, black arrows depict the phase difference in the areas with significant joint periods.
lines) between taxa with evolutionary closeness. Green to blue areas in the graph are a readout for the extent of the loss of coherence between two taxa. Their x coordinates show the region of mutability on the protein, while their y coordinates (Fourier period for synchrony) reflect the extent of the difference between taxa. Artiodactyla b display a region of difference from all other taxa between positions 200 and 250. This equates to a stretch of gaps downstream of the RGD motif, where all other taxa have amino acids. Similarly, a stretch of gaps close to position 100 in marsupialia and prototheria is recognizable in the coherence plots.

DISCUSSION
Existing sequence alignment algorithms use heuristic scoring schemes based on biological expertise, which cannot be used as objective distance metrics (Penner et al., 2011). Conventional phylogenetic trees, based on the alignments of nucleotides or amino acids, often tabulate the numbers of mismatches as a basis for calculating relatedness. Accounting for gaps often is problematic [in a prior phylogenetic analysis of Osteopontin using a large set of sequences (Weber, 2018), the exclusion of gaps caused a substantial reduction in the number of residues that contributed to the calculations]. Further, this approach may place two species at identical distance from all others based on identical numbers of mutated positions (see the conventional phylogenetic trees in Supplementary Figure 2). Trees based on numerical values, which measure amino acid properties, can use the value 0 to account for a gap. Due to the applicable range of such values for the individual amino acids, they are extremely unlikely to encounter identical distances between any pairs. The use of such numbers in tree construction considers not only mutation but also selection as forces of evolution, because selection favors certain mutations over others, and this preference is reflected in the physico-chemical properties of the mutated and selected amino acids (it is not reflected in the assessment of mismatches alone). The resultant tree diagrams are more likely to be a representation of the actual evolutionary distances.
In addition, we note that the calculated values for overall relatedness, comprising autocorrelation, average mutual information, and fractal dimension are interpretable as evolutionary distances that can be mapped in a tree (Supplementary Figure 5). Their use could substantially simplify and unify the construction of phylogenetic trees compared to a diversity of algorithms that are currently in use.
Some techniques of complex systems analysis have been successfully applied to phylogenetic studies, albeit not on the basis of physico-chemical quantities. The patterns of inheritance predict greater similarity in the tempo of molecular evolution between direct ancestors and descendants than in branched relationships, resulting in autocorrelation of evolutionary rates in the tree of life. Rate autocorrelation has turned out to be a common phenomenon throughout evolution (Gillespie, 1984;Tao et al., 2019). Various calculations of mutual information-based distances have yielded better results than conventional distances in distance-based phylogeny. A version based on single letter Shannon entropies gave superior results throughout the entire animal kingdom (Penner et al., 2011). It is reasonable to propose that the logical next step of replacing amino acid identities with their quantifiable properties results in additional improvement.
For the analysis of the highly developed organisms, our approach has produced similar, but not identical trees depending on which parameter was used for generating them. This is not a concern, not only because there is room for improvement of the very basic algorithm applied (as described in section "Methods"). Even for the simple alignment of amino acid sequences followed by counting the number of mismatches, there are diverse methods that yield similar but distinct results [for Osteopontin see Weber (2018), also Supplementary Figure 2]. We believe that the basis of assigned values for various characteristics and their comprehensive analysis offer a path to closer approximation of the actual evolutionary history than the reliance on sequence information alone. Alternatively, for certain proteins, some physico-chemical characteristics may be more important for function than others. It is meaningful to assume that their selection offers a more accurate estimation of the evolutionary history than properties of lesser functional importance. Further, it is noteworthy that calculations of mutual information can assist in judging the relative merits of distinct algorithms by estimating the significance values of specific alignments (Penner et al., 2011). This enables an objective comparison among methods.
This investigation has focused on the innovation of replacing the letter strings with quantitative properties in evolutionary analysis. Further refinements are possible. The box counting procedure is subject to error arising from arbitrary grid placement (quantization error), which is strictly positive and varies as a function of scale. This causes problems for the procedure's slope estimation step. The quantization error, due to both grid position and orientation, can be a substantial source of error in the estimation of fractal dimension, but pattern search provides an efficient means of minimizing it (Bouda et al., 2016). Future quantitative evolutionary studies, which include the calculation of box counting dimensions, should incorporate such measures.
A wide range of phenomena in nature is describable through strings of numbers. Once a phenomenon has been converted in this manner, the analytical tools of complex systems research become applicable (Abarbanel, 1995) and can yield a plethora of information. Here, we demonstrate the usefulness of converting protein sequence information into numbers that reflect the physico-chemical properties of the constituent amino acids, followed by utilizing this description for close-to-quantitative evolutionary investigation.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.