An Information-Entropy Position-Weighted K-Mer Relative Measure for Whole Genome Phylogeny Reconstruction

Alignment methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational costs in handling time and space complexity. On the other hand, alignment-free methods incur low computational costs and have recently gained popularity in the field of bioinformatics. Here we propose a new alignment-free method for phylogenetic tree reconstruction based on whole genome sequences. A key component is a measure called information-entropy position-weighted k-mer relative measure (IEPWRMkmer), which combines the position-weighted measure of k-mers proposed by our group and the information entropy of frequency of k-mers. The Manhattan distance is used to calculate the pairwise distance between species. Finally, we use the Neighbor-Joining method to construct the phylogenetic tree. To evaluate the performance of this method, we perform phylogenetic analysis on two datasets used by other researchers. The results demonstrate that the IEPWRMkmer method is efficient and reliable. The source codes of our method are provided at https://github.com/ wuyaoqun37/IEPWRMkmer.


INTRODUCTION
The reconstruction of a phylogenetic tree is a primary problem in evolutionary biology. Sequence alignment is a key step in the reconstruction, aiming to identify the homology of sequences and uncover phylogenetic relationships in sequences. Traditional sequence comparison is based on pairwise or multiple sequence alignment (Felsenstein and Felenstein, 2004;Morrison, 2006) and was implemented by software packages such as BLAST (Altschul et al., 1990), ClustalW (Thompson et al., 1994), and MrBayes (Ronquist et al., 2012). However, the methods based on sequence alignment have some disadvantages, including high computational cost in handling the time and space complexity of the algorithm. Therefore, alignment-free methods have been proposed to overcome these problems (Zielezinski et al., 2017). The computational cost of alignment-free methods is low because they are generally of linear complexity (Fox et al., 1977).
Several alignment-free methods for sequence comparison are based on word counts (Blaisdell, 1986;Höhl et al., 2006;Wang et al., 2016). A key idea is to use the close distribution of k-mers to imply the high correlation degree, hence the similarity of the sequences. The methods have been implemented in software tools, such as FFP (Sims et al., 2009), kWIP (Murray et al., 2017), CVtree (Qi et al., 2004), and DLtree (Wu et al., 2017). Many k-mer methods transform the input sequence into a frequency vector of k-mers, then define the distance of the sequences by that of the frequency vector of k-mers (Qi et al., 2004;Wu et al., 2017). To reduce the statistical dependence between adjacent word matches, Spaced-Words (Leimeister and Boden, 2014) proposed to use spaced words, which are defined by patterns of matches without reference to positions. Some alignment-free methods are based on match length, which defines the distance between sequences based on the length of substring matches between two sequences. These include the shortest unique substring method (Haubold et al., 2005), ACS (Ulitsky et al. 2006), UA (Comin and Verzotto, 2012), and ALFRED (Thankachan et al. 2016). In addition, graphical representation was used to construct the probability distribution of a DNA sequence (Yu et al., 2011). The chaos game representation transforms the distribution of characters in a DNA sequence into the distribution of nodes in a graph (Hoang et al. 2016;Yin, 2017;Mendizabal-Ruiz et al., 2018). Many researchers considered extracting the position information of a k-mer (Huang and Wang, 2011;Ding et al., 2013;Tang et al., 2014). Ding et al. (2013) used the average interval distance of normalized k-mers to capture evolutionary information for sequence comparison. Tang et al. (2014) presented the average relative distance of normalized k-mers to improve the method of Ding et al. (2013). Ma et al. (2020) proposed the PWKmer method, which combines the k-mer counts and k-mer position distributions for phylogenetic analysis.
In this work, we propose a new alignment-free method which combines the position-weighted measure of k-mers proposed by Ma et al. (2020) and the information entropy of frequency of k-mers to obtain phylogenetic information for sequence comparison. It is named information-entropy position-weighted k-mer relative measure (IEPWRMkmer). To evaluate the performance of this method, we carry out phylogenetic analysis on two data sets used by other researchers.

Genomic Datasets
Dataset 1 The first dataset for analysis consists of the same whole genome DNA sequences of 30 mammalian species studied in Li et al. (2001), Otu and Sayood (2003), and Tang et al. (2014). The accession numbers, species, and species name are listed in Table 1. All sequences were downloaded from NCBI GenBank.

Dataset 2
The second dataset for analysis is the HIV-1 dataset studied in Ma et al. (2020). This dataset contains 43 HIV genome sequences used in Wu et al. (2007) and a controversial taxonomic sequence used in Chang et al. (2014). The dataset includes subtypes A, B, C, D, F, G, J, K, and H of the HIV-1 M, O, N groups and the CPZ sequence. The area, accession numbers, and subtypes are listed in Table 2. All these sequences were downloaded from NCBI GenBank. We use two approaches to validate the method. First, we use the Robinson-Foulds (RF) distance to compare our method with other alignment-free methods. Second, we use the bootstrap method to construct consensus trees and show the stability of the trees obtained by our method.

METHODS
Let S s 1 s 2 /s L be a DNA sequence with length L, a 1 a 2 /a k is a k-mer, where a i ∈(A,T,C,G). If the k-mer a 1 a 2 /a k occurs in S, we denote by p a1a2/a k the vector composed of the positions of a 1 a 2 /a k in this given sequence and by p a1a2/a k (i) its ith element. If the k-mer a 1 a 2 /a k does not occur in S, we set p a 1 a 2 /a k (0). For example, for the DNA sequence GTAACCTGAACGTACTTGGA with length 20, we list all 2mer position vectors: P AA (3,9); P AC (4,10,14); P AG (0); P AT (0); P CA (0); P CC (5); P CG (11); P CT (6,15); P GA (8,19); P GC (0); P GG (18); P GT (1,12); P TA (2,13); P TC 0; P TG (7,17); P TT (16).
In this example, the 2-mers AG, AT, CA, GC, and TC do not appear. For each k-mer, its position vector provides its position distribution information in the sequence. One can use the k-mer position vectors to reconstruct the DNA sequence (Ma et al., 2020). Ma et al. (2020) defined the position-weighted measure D(a 1 a 2 /a k ) of a 1 a 2 /a k based on its position in the sequence as where n is the length of the vector p a 1 a 2 /a k . Actually p a1a2/a k (i)/L means the position weight of a 1 a 2 /a k in the given sequence with length L.
We denote by N the number of sequences in a dataset. In order to characterize the importance of k-mers in the whole dataset, we count the number m of the sequences that contain a k-mer a 1 a 2 /a k . Then the occurrence frequency F(a 1 a 2 /a k ) of this k-mer in the whole dataset is defined as m/N. We introduce the Shannon entropy H(a 1 a 2 /a k ) of frequency F(a 1 a 2 /a k ) defined by Murray et al. (2017) as where F stands for F (a 1 a 2 /a k ).
In this study, we aim to get more DNA phylogenetic information by combining the above two methods and defining Here, we regard Shannon entropy H (a 1 a 2 /a k ) as another weight.
For a fixed K, there are 4 K k-mers. For each k-mer a 1 a 2 /a k , we can calculate the corresponding E(a 1 a 2 /a k ), then arrange 4 K of these E(a 1 a 2 /a k ) to get a feature representation vector (E 1 , E 2 , /, E 4 K ) according to the alphabet order of the 4 K k-mers for each genome.
For two given genome sequences A and B, we can obtain by the method. We use the Manhattan distance to calculate the pairwise distance between these two genome sequences: For a given dataset, we can derive a distance matrix by Eq. 4. This distance matrix contains the sequence similarity information. After obtaining the distance matrix, we insert it into the mega 7.0 software (Sudhir et al., 2016) and use Neighbor-Joining (NJ) program (Saitou et al. 1987) to construct the phylogenetic tree.

Robinson-Foulds Distance and the Bootstrap Method
We use the Robinson-Foulds (RF) distance (Robinson and Foulds 1981) to judge the quality of the method. A smaller RF value means a closer distance between the phylogenetic tree and the reference tree.  proposed a modified version of the bootstrap method to evaluate the reliability of the constructed phylogenetic tree. We also use this method in the present work. Its workflow is as follows: Each row is the feature vector (E 1 , E 2 , /, E 4 K ) of a species, and each column is the feature value of all genome sequences based on the same k-mer. Through random sampling of all columns, in which some columns may be selected many times, while some columns may not be selected at all, we randomly select one column. After 4 K times of selection, a new N×4 K feature matrix is constructed. Using the new feature matrix, the Manhattan distance of any two rows is calculated to get a new distance matrix. Then we use the NJ method to construct a phylogenetic tree and repeat the above steps 100 times. Finally, a consensus tree is drawn by using consense. exe in the Phylip package. The frequency of a particular branch of a phylogenetic tree can be used as a measure of the stability of this branch.

Experiment 1
We use the genomes of 30 mammalian species in dataset 1 to construct a phylogenetic tree using ClustalX (Larkin et al. 2007) as the reference tree. ClustalX is one of the widely used multiple alignment programs. The result is shown in Figure 1A. It is seen that rabbit, fat dormouse, squirrel, guinea pig, mouse, rat, platypus, opossum, and wallaroo belong to the rodents group; human, baboon, orangutan, gibbon, gorilla, pigmy chimpanzee, and common chimpanzee belong to the primates group; blue whale, fin whale, hippopotamus, cow, sheep, pig, donkey, horse, Indian-rhinoceros, white rhinoceros, cat, dog, gray seal, and harbor seal belong to the ferungulates group. When K < 5, it is not feasible to construct a phylogenetic tree using our method. When K 5, 6, the 30 mammals cannot be divided into three groups in our tree. When K 7, it can be divided into three groups, but the relationship between guinea pig and fat dormouse is not correct. When K 8, 9, the branches of the tree become correct. We list the RF distances between the phylogenetic tree constructed by our method at K 5, 6, 7, 8, 9 and the reference tree constructed by ClustalX in Table 3. From Table 3, we can see that the RF distance reaches the minimum when K 8. We show the phylogenetic tree of K 8 constructed by our method in Figure 1B. From Figure 1B, we can see that the species in the three main categories are grouped correctly. Primates and ferungulates are closer, and this relationship is consistent with that in Figure 1A. In terms of branches, monotremes (platypus), marsupials (wallaroo, opossum), murid rodents (mouse, rat), non-murid rodents (guinea pig, squirrel, fat dormouse, rabbit), perissodactyls (white rhinoceros, horse, Indian rhinoceros, donkey), carnivores (harbor seal, dog, gray seal, cat), artiodactyls (sheep, cow, hippopotamus, pig), primates (human, pigmy chimpanzee, common chimpanzee, gorilla, baboon, gibbon, orangutan), and cetaceans (blue whale, fin whale) are grouped into respective taxonomic classes accurately. Figure 2 shows the RF distance between the reference tree constructed by ClustalX and the phylogenetic tree constructed by our method, Tang's method, PWKmer, DLtree, and CVtree on dataset 1. Using our method, when K 8, the RF distance is 8. The shortest RF distance of DLtree (K 9) is 10, the shortest distance of CVtree (K 9) is 16, the shortest distance of Tang's method (K 7) is 16, and the shortest distance of PWKmer (K 9) is 10. Therefore, the results of our method are closer to those of  ClustalX than those of the other methods, which indicates that our method is effective. Figure 3 shows the consensus tree of 30 mammalian species based on our method. Compared with Figure 1B, 30 mammalian species are divided into the rodents group, the ferungulates group, and the primates group correctly. The support rate is 80% for the rodents group and 100% for both ferungulates and primates groups. Among the branches, marsupials (opossum, wallaroo), carnivores (dog, cat, harbor seal, gray seal), murid roots (rat, mouse), and cetaceans (fin whale, blue whale) are all supported by a 100% rate. In the artiodactyls group (cow, sheep, pig, hippopotamus), pig is separated out of the artiodactyls group, but the support rate is low at 43%. It indicates that the phylogenetic tree constructed by our method is quite robust.

Experiment 2
The human immunodeficiency viruses (HIV) represent a group of retroviruses, which are not presumed to have originated from human cellular DNA sequences, hence are distinct from endogenous retroviruses (Wu et al., 2007). HIV-1 can be classified into three major phylogenetic groups, namely M (major), N (new), and O (others). Group M is responsible for the HIV pandemic, it is divided into nine subtypes, namely A, B, C, D, F, G, J, K, and H. Based on differential phylogenetic clustering, the subtypes A and F are further divided into subsubtypes (A1, A2) and (F1, F2), respectively. Groups N and O are derived from other primates and then infect humans. CPZ is a non-human primate virus isolated from chimpanzees, which is closest to human-to-human transmission of HIV.
We performed the phylogenetic analysis of 44 HIV-1 complete genome sequences in dataset 2 using ClustalX and our method. FIGURE 2 | The Robinson-Foulds distance between the tree reconstructed by ClustalX method and the phylogenetic trees reconstructed by our method (IEPWRMkmer K 8), the CVTree method, the DLTree method, Tang's method (K 7), and the PWKmer method (K 9) on dataset 1 (we used the optimal tree by CVTree and DLTree).  FIGURE 5 | The RF distance between the reference tree constructed by Clustalx and the phylogenetic trees constructed by our method (IEPWRMkmer, K 7), Tang's method (K 8), the PWKmer method (K 9), the DLtree method, and the CVtree method. (For the PWKmer method, the DLtree method, and the CVtree method, we chose their optimal classification tree).
Frontiers in Genetics | www.frontiersin.org October 2021 | Volume 12 | Article 766496 6 The phylogenetic trees reconstructed by ClustalX and our method (K 7) are shown in Figure 4A and Figure 4B, respectively. From Figure 4B, we can see that the species from all subtypes can be correctly classified into their groups (A, B, C, D, F, G, J, K, H, O, and M), and CPZ as the reference sequence is separated into the outermost. From the internal branches, both F and A contain two subtypes (F1 and F2) and (A1 and A2), respectively. Our method can separate the two subtypes, and in the branches, both F and A subtypes can be closely grouped together. Figure 5 shows the RF distances between the reference tree constructed by ClustalX and the phylogenetic trees constructed by our method, Tang's method, PWKmer, DLtree, and CVtree. Using our method, when K 7, the RF distance is 10. The shortest RF distance of the DLtree (K 11) is 12, the shortest distance of the CVtree (K 9) is 16, the shortest distance of the PWKmer (K 9) is 10, and the shortest distance of Tang's method (K 9) is 10. Therefore, our method performs better than the DLtree and the CVtree on dataset 2 and has the same performance as Tang's method and PWKmer. The results indicate that our method is quite effective again. Figure 6 shows the consensus tree of 44 HIV-1 based on our method. Comparing with Figure 4B, all HIV-1 sequences are divided into the M, N, O, and CPZ groups, whose support rate is 100%. From the branch point of view, in group M, the branch support rate of all subtypes is 100%. For subtypes A and F, the subtypes (A1, A2) and (F1 and F2) are clustered with 100% support. It again indicates that the phylogenetic tree constructed by our method is quite robust.
Estimate of the Optimal Parameter K Different lengths of k-mers contain different phylogenetic information. Short k-mers may not contain sufficient DNA sequence information. Long k-mers contain sufficient phylogenetic information, but it needs large memory and takes a long time to calculate the distance based on information on long k-mers. Therefore, it is also very important to estimate an optimal value of K as heralded in  for the DLTree method and (Qi et al., 2004) for the CVTree method.
In this paper, we propose to use the Shannon entropy of the feature matrix to determine the optimal value of K. Using Eq. 3, we can obtain an N ×4 K feature matrix for a dataset with N genomes. Then, we propose to define a scoring strategy as The optimal K is the value at which score(K) reaches its maximum.   We use Eq. 5 to calculate score(K) on datasets 1 and 2 for different K. The relationship between score(K) and K is shown in Figure 7 for these two datasets. It is seen that score(K) reaches the largest value when K 8 on the two datasets. Considering that the larger K is, the more memory resources are consumed, we only consider the values near K 8 (e.g., K 7, 8, 9). For the 30 mammalian species dataset, we have seen that the phylogenetic tree for K 8 constructed by our method is closest to the reference tree. The same happened for the HIV-1 dataset with K 7. The outcomes indicate that score(K) can provide an effective means to estimate the optimal value of K.

CONCLUSION
In this paper, a new alignment-free method is proposed for phylogenetic analysis and sequence comparison based on whole genome sequences. Our method combines the positionweighted measure of k-mers and the information entropy of frequency of k-mers. We used the Manhattan metric to measure the distance between a pair of sequences and the NJ method to construct the phylogenetic tree. In order to test the effectiveness and reliability of our method, we applied it on two datasets of 30 mammalian species and 44 HIV-1 genomes. The results demonstrated that the present method is efficient and reliable. A suitable K value is important to capture rich phylogenetic information of DNA sequences. In order to choose an optimal K value, we proposed a scoring measure based on the information entropy. The obtained results on two real datasets support that the method can capture the k-mer distribution information and is effective for whole genome sequence comparison and phylogenetic analysis.
Remark: The method of this paper is derived from the two studies Ma et al. (2020) and Murray et al. (2017). There are differences between this work and previous works: Tang et al. presented the average relative distance for normalized k-mers. PWKmer uses the counts and position distributions of k-mers to capture more evolutionary information. KWIP (Murray et al. 2017) uses information entropy to weight the inner product (SipSj), while we use information entropy to weight the relative positions of k-mers. KWIP uses a kernel function to calculate the distance, while we use the Manhattan metric to calculate the pairwise distance between species. Here, we claimed that the results obtained by the IEPWRMkmer method are close to those by ClustalX and the IEPWRMkmer is superior to the other distance metrics. We used the phylogenetic tree constructed by ClustalX as the reference tree or standard tree, hence we cannot claim that our method is superior to the ClustalX method.

DATA AVAILABILITY STATEMENT
The genome datasets analyzed for this study can be found in the GenBank https://www.ncbi.nlm.nih.gov/ AUTHOR CONTRIBUTIONS Y-QW contributed to the conception and design of the study, developed the method, and wrote the manuscript. Z-GY gave the ideas and supervised the project. All authors discussed the results and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.