ORIGINAL RESEARCH article

Front. Genet., 22 October 2021

Sec. Statistical Genetics and Methodology

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.766496

An Information-Entropy Position-Weighted K-Mer Relative Measure for Whole Genome Phylogeny Reconstruction

  • 1. Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan, China

  • 2. Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang, China

  • 3. Faculty of Science, Engineering and Technology, Swinburne University of Technology, Hawthorn, VIC, Australia

Article metrics

View details

5

Citations

2,4k

Views

851

Downloads

Abstract

Alignment methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational costs in handling time and space complexity. On the other hand, alignment-free methods incur low computational costs and have recently gained popularity in the field of bioinformatics. Here we propose a new alignment-free method for phylogenetic tree reconstruction based on whole genome sequences. A key component is a measure called information-entropy position-weighted k-mer relative measure (IEPWRMkmer), which combines the position-weighted measure of k-mers proposed by our group and the information entropy of frequency of k-mers. The Manhattan distance is used to calculate the pairwise distance between species. Finally, we use the Neighbor-Joining method to construct the phylogenetic tree. To evaluate the performance of this method, we perform phylogenetic analysis on two datasets used by other researchers. The results demonstrate that the IEPWRMkmer method is efficient and reliable. The source codes of our method are provided at https://github.com/ wuyaoqun37/IEPWRMkmer.

Introduction

The reconstruction of a phylogenetic tree is a primary problem in evolutionary biology. Sequence alignment is a key step in the reconstruction, aiming to identify the homology of sequences and uncover phylogenetic relationships in sequences. Traditional sequence comparison is based on pairwise or multiple sequence alignment (Felsenstein and Felenstein, 2004; Morrison, 2006) and was implemented by software packages such as BLAST (Altschul et al., 1990), ClustalW (Thompson et al., 1994), and MrBayes (Ronquist et al., 2012). However, the methods based on sequence alignment have some disadvantages, including high computational cost in handling the time and space complexity of the algorithm. Therefore, alignment-free methods have been proposed to overcome these problems (Zielezinski et al., 2017). The computational cost of alignment-free methods is low because they are generally of linear complexity (Fox et al., 1977).

Several alignment-free methods for sequence comparison are based on word counts (Blaisdell, 1986; Höhl et al., 2006; Wang et al., 2016). A key idea is to use the close distribution of k-mers to imply the high correlation degree, hence the similarity of the sequences. The methods have been implemented in software tools, such as FFP (Sims et al., 2009), kWIP (Murray et al., 2017), CVtree (Qi et al., 2004), and DLtree (Wu et al., 2017). Many k-mer methods transform the input sequence into a frequency vector of k-mers, then define the distance of the sequences by that of the frequency vector of k-mers (Qi et al., 2004; Wu et al., 2017). To reduce the statistical dependence between adjacent word matches, Spaced-Words (Leimeister and Boden, 2014) proposed to use spaced words, which are defined by patterns of matches without reference to positions. Some alignment-free methods are based on match length, which defines the distance between sequences based on the length of substring matches between two sequences. These include the shortest unique substring method (Haubold et al., 2005), ACS (Ulitsky et al. 2006), UA (Comin and Verzotto, 2012), and ALFRED (Thankachan et al. 2016). In addition, graphical representation was used to construct the probability distribution of a DNA sequence (Yu et al., 2011). The chaos game representation transforms the distribution of characters in a DNA sequence into the distribution of nodes in a graph (Hoang et al. 2016; Yin, 2017; Mendizabal-Ruiz et al., 2018). Many researchers considered extracting the position information of a k-mer (Huang and Wang, 2011; Ding et al., 2013; Tang et al., 2014). Ding et al. (2013) used the average interval distance of normalized k-mers to capture evolutionary information for sequence comparison. Tang et al. (2014) presented the average relative distance of normalized k-mers to improve the method of Ding et al. (2013). Ma et al. (2020) proposed the PWKmer method, which combines the k-mer counts and k-mer position distributions for phylogenetic analysis.

In this work, we propose a new alignment-free method which combines the position-weighted measure of k-mers proposed by Ma et al. (2020) and the information entropy of frequency of k-mers to obtain phylogenetic information for sequence comparison. It is named information-entropy position-weighted k-mer relative measure (IEPWRMkmer). To evaluate the performance of this method, we carry out phylogenetic analysis on two data sets used by other researchers.

Materials and Methods

Genomic Datasets

Dataset 1

The first dataset for analysis consists of the same whole genome DNA sequences of 30 mammalian species studied in Li et al. (2001), Otu and Sayood (2003), and Tang et al. (2014). The accession numbers, species, and species name are listed in Table 1. All sequences were downloaded from NCBI GenBank.

TABLE 1

NoAccession noSpeciesSequence name
1AJ002189Sus scrofaPig
2AJ010957Homo sapiensHippopotamus
3AJ001588Pan troglodytesRabbit
4U96639Canis familiarisDog
5AF010406Ovis ariesSheep
6V00662Homo sapiensHuman
7U20753Felis catusCat
8X72004Halichoerus grypusGray seal
9D38115Pongo pygmaeusOrangutan
10V00654Bos taurusCow
11X97337Equus asinusDonkey
12D38116Pan troglodytesCommon chimpanzee
13D38113Pan paniscusPigmy chimpanzee
14Z29573Didelphis virginianaOpossum
15Y10524Macropus robustusWallaroo
16X99256Hylobates larGibbon
17Y18001Papio hamadryasBaboon
18X97336Rhinoceros unicornisIndian rhinoceros
19Y07726Ceratotherium simumWhite rhinoceros
20X63726Phoca vitulinaHarbor seal
21AJ238588Sciurus vulgarisSquirrel
22AJ001562Glis glisFat dormouse
23AJ222767Cavia porcellusGuinea pig
24X79547Equus caballusHorse
25X14848Rattus norvegicusRat
26V00711Mus musculusMouse
27D38114Gorilla gorillaGorilla
28X61145Balenoptera physalusFin whale
29X72204Balenoptera musculusBlue whale
30X83427Ornithorhyncus anatinusPlatypus

Names, species, and accession numbers for mitochondrial genomes of 30 mammalian species.

Dataset 2

The second dataset for analysis is the HIV-1 dataset studied in Ma et al. (2020). This dataset contains 43 HIV genome sequences used in Wu et al. (2007) and a controversial taxonomic sequence used in Chang et al. (2014). The dataset includes subtypes A, B, C, D, F, G, J, K, and H of the HIV-1 M, O, N groups and the CPZ sequence. The area, accession numbers, and subtypes are listed in Table 2. All these sequences were downloaded from NCBI GenBank.

TABLE 2

NoAreaAccession noSubtype
1Belgium (DRC)AF084936G
2Finland (Kenya)AF061641G
3Sweden (DRC)AF061642G
4BelgiumAF190128H
5BelgiumAF190127H
6Cent. Afr. RepAF005496H
7TanzaniaAF447763CPZ
8CameroonL20571O
9SenegalAJ302647O
10CameroonL20587O
11CameroonAY169812O
12IndiaAF067155C
13South AfricaAY772699C
14EthiopiaU46016C
15BrazilU52953C
16CameroonAY371157D
17DRCK03454D
18UgandaU88824D
19SomaliaAF069670A1
20UgandaAF484509A1
21UgandaU51190A1
22KenyaAF004885A1
23DRCAF286238A2
24CyprusAF286237A2
25SwedenAF082395J
26SwedenAF082394J
27CameroonAJ249239K
28DRCAJ249235K
29CameroonAJ249237F2
30CameroonAY371158F2
31CameroonAJ249236F2
32CameroonAF377956F2
33FinlandAF075703F1
34FranceAJ249238F1
35BrazilAF005494F1
36Belgium (DRC)AF077336F1
37CameroonAJ271370N
38CameroonAY532635N
39CameroonAJ006022N
40NetherlandsAY423387B
41ThailandAY173951B
42AustraliaGray sealB
43FranceK03455B
44U.S.AY331295B

Accession numbers, subtype, and area for 44 HIV-1.

We use two approaches to validate the method. First, we use the Robinson-Foulds (RF) distance to compare our method with other alignment-free methods. Second, we use the bootstrap method to construct consensus trees and show the stability of the trees obtained by our method.

Methods

Let S = be a DNA sequence with length L, is a k-mer, where ∈(A,T,C,G). If the k-mer occurs in S, we denote by the vector composed of the positions of in this given sequence and by its ith element. If the k-mer does not occur in S,=(0). For example, for the DNA sequence GTA​ACC​TGA​ACG​TAC​TTG​GA with length 20, we list all 2-mer position vectors:

PAA=(3,9); PAC=(4,10,14); PAG= (0); PAT= (0); PCA=(0); PCC=(5); PCG=(11); PCT=(6,15); PGA=(8,19); PGC=(0); PGG=(18); PGT=(1,12); PTA=(2,13); PTC = 0; PTG=(7,17); PTT=(16).

In this example, the 2-mers AG, AT, CA, GC, and TC do not appear. For each k-mer, its position vector provides its position distribution information in the sequence. One can use the k-mer position vectors to reconstruct the DNA sequence (Ma et al., 2020).

Ma et al. (2020) defined the position-weighted measure of based on its position in the sequence aswhere n is the length of the vector . Actually means the position weight of in the given sequence with length L.

We denote by N the number of sequences in a dataset. In order to characterize the importance of k-mers in the whole dataset, we count the number m of the sequences that contain a k-mer . Then the occurrence frequency F of this k-mer in the whole dataset is defined as m/N. We introduce the Shannon entropy H() of frequency F() defined by Murray et al. (2017) aswhere F stands for F ().

In this study, we aim to get more DNA phylogenetic information by combining the above two methods and defining

Here, we regard Shannon entropy H () as another weight.

For a fixed K, there are 4Kk-mers. For each k-mer , we can calculate the corresponding , then arrange 4K of these to get a feature representation vector () according to the alphabet order of the 4Kk-mers for each genome.

For two given genome sequences A and B, we can obtain = and by the method. We use the Manhattan distance to calculate the pairwise distance between these two genome sequences:

For a given dataset, we can derive a distance matrix by Eq. 4. This distance matrix contains the sequence similarity information. After obtaining the distance matrix, we insert it into the mega 7.0 software (Sudhir et al., 2016) and use Neighbor-Joining (NJ) program (Saitou et al. 1987) to construct the phylogenetic tree.

Robinson-Foulds Distance and the Bootstrap Method

We use the Robinson-Foulds (RF) distance (Robinson and Foulds 1981) to judge the quality of the method. A smaller RF value means a closer distance between the phylogenetic tree and the reference tree.

(Yu et al., 2010) proposed a modified version of the bootstrap method to evaluate the reliability of the constructed phylogenetic tree. We also use this method in the present work. Its workflow is as follows: Each row is the feature vector ( of a species, and each column is the feature value of all genome sequences based on the same k-mer. Through random sampling of all columns, in which some columns may be selected many times, while some columns may not be selected at all, we randomly select one column. After 4K times of selection, a new N4K feature matrix is constructed. Using the new feature matrix, the Manhattan distance of any two rows is calculated to get a new distance matrix. Then we use the NJ method to construct a phylogenetic tree and repeat the above steps 100 times. Finally, a consensus tree is drawn by using consense. exe in the Phylip package. The frequency of a particular branch of a phylogenetic tree can be used as a measure of the stability of this branch.

Results

Experiment 1

We use the genomes of 30 mammalian species in dataset 1 to construct a phylogenetic tree using ClustalX (Larkin et al. 2007) as the reference tree. ClustalX is one of the widely used multiple alignment programs. The result is shown in Figure 1A. It is seen that rabbit, fat dormouse, squirrel, guinea pig, mouse, rat, platypus, opossum, and wallaroo belong to the rodents group; human, baboon, orangutan, gibbon, gorilla, pigmy chimpanzee, and common chimpanzee belong to the primates group; blue whale, fin whale, hippopotamus, cow, sheep, pig, donkey, horse, Indian-rhinoceros, white rhinoceros, cat, dog, gray seal, and harbor seal belong to the ferungulates group. When K < 5, it is not feasible to construct a phylogenetic tree using our method. When K = 5, 6, the 30 mammals cannot be divided into three groups in our tree. When K = 7, it can be divided into three groups, but the relationship between guinea pig and fat dormouse is not correct. When K = 8, 9, the branches of the tree become correct. We list the RF distances between the phylogenetic tree constructed by our method at K = 5, 6, 7, 8, 9 and the reference tree constructed by ClustalX in Table 3. From Table 3, we can see that the RF distance reaches the minimum when K = 8. We show the phylogenetic tree of K = 8 constructed by our method in Figure 1B. From Figure 1B, we can see that the species in the three main categories are grouped correctly. Primates and ferungulates are closer, and this relationship is consistent with that in Figure 1A. In terms of branches, monotremes (platypus), marsupials (wallaroo, opossum), murid rodents (mouse, rat), non-murid rodents (guinea pig, squirrel, fat dormouse, rabbit), perissodactyls (white rhinoceros, horse, Indian rhinoceros, donkey), carnivores (harbor seal, dog, gray seal, cat), artiodactyls (sheep, cow, hippopotamus, pig), primates (human, pigmy chimpanzee, common chimpanzee, gorilla, baboon, gibbon, orangutan), and cetaceans (blue whale, fin whale) are grouped into respective taxonomic classes accurately.

FIGURE 1

TABLE 3

K56789
RF distance382822810

The RF distance between the phylogenetic tree conducted by our method at K = 5,6,7,8,9 and the reference tree conducted by ClustalX.

Figure 2 shows the RF distance between the reference tree constructed by ClustalX and the phylogenetic tree constructed by our method, Tang’s method, PWKmer, DLtree, and CVtree on dataset 1. Using our method, when K = 8, the RF distance is 8. The shortest RF distance of DLtree (K = 9) is 10, the shortest distance of CVtree (K = 9) is 16, the shortest distance of Tang’s method (K = 7) is 16, and the shortest distance of PWKmer (K = 9) is 10. Therefore, the results of our method are closer to those of ClustalX than those of the other methods, which indicates that our method is effective.

FIGURE 2

Figure 3 shows the consensus tree of 30 mammalian species based on our method. Compared with Figure 1B, 30 mammalian species are divided into the rodents group, the ferungulates group, and the primates group correctly. The support rate is 80% for the rodents group and 100% for both ferungulates and primates groups. Among the branches, marsupials (opossum, wallaroo), carnivores (dog, cat, harbor seal, gray seal), murid roots (rat, mouse), and cetaceans (fin whale, blue whale) are all supported by a 100% rate. In the artiodactyls group (cow, sheep, pig, hippopotamus), pig is separated out of the artiodactyls group, but the support rate is low at 43%. It indicates that the phylogenetic tree constructed by our method is quite robust.

FIGURE 3

Experiment 2

The human immunodeficiency viruses (HIV) represent a group of retroviruses, which are not presumed to have originated from human cellular DNA sequences, hence are distinct from endogenous retroviruses (Wu et al., 2007). HIV-1 can be classified into three major phylogenetic groups, namely M (major), N (new), and O (others). Group M is responsible for the HIV pandemic, it is divided into nine subtypes, namely A, B, C, D, F, G, J, K, and H. Based on differential phylogenetic clustering, the subtypes A and F are further divided into sub-subtypes (A1, A2) and (F1, F2), respectively. Groups N and O are derived from other primates and then infect humans. CPZ is a non-human primate virus isolated from chimpanzees, which is closest to human-to-human transmission of HIV.

We performed the phylogenetic analysis of 44 HIV-1 complete genome sequences in dataset 2 using ClustalX and our method. The phylogenetic trees reconstructed by ClustalX and our method (K = 7) are shown in Figure 4A and Figure 4B, respectively. From Figure 4B, we can see that the species from all subtypes can be correctly classified into their groups (A, B, C, D, F, G, J, K, H, O, and M), and CPZ as the reference sequence is separated into the outermost. From the internal branches, both F and A contain two subtypes (F1 and F2) and (A1 and A2), respectively. Our method can separate the two subtypes, and in the branches, both F and A subtypes can be closely grouped together.

FIGURE 4

Figure 5 shows the RF distances between the reference tree constructed by ClustalX and the phylogenetic trees constructed by our method, Tang’s method, PWKmer, DLtree, and CVtree. Using our method, when K = 7, the RF distance is 10. The shortest RF distance of the DLtree (K = 11) is 12, the shortest distance of the CVtree (K = 9) is 16, the shortest distance of the PWKmer (K =9) is 10, and the shortest distance of Tang’s method (K = 9) is 10. Therefore, our method performs better than the DLtree and the CVtree on dataset 2 and has the same performance as Tang’s method and PWKmer. The results indicate that our method is quite effective again.

FIGURE 5

Figure 6 shows the consensus tree of 44 HIV-1 based on our method. Comparing with Figure 4B, all HIV-1 sequences are divided into the M, N, O, and CPZ groups, whose support rate is 100%. From the branch point of view, in group M, the branch support rate of all subtypes is 100%. For subtypes A and F, the subtypes (A1, A2) and (F1 and F2) are clustered with 100% support. It again indicates that the phylogenetic tree constructed by our method is quite robust.

FIGURE 6

Estimate of the Optimal Parameter K

Different lengths of k-mers contain different phylogenetic information. Short k-mers may not contain sufficient DNA sequence information. Long k-mers contain sufficient phylogenetic information, but it needs large memory and takes a long time to calculate the distance based on information on long k-mers. Therefore, it is also very important to estimate an optimal value of K as heralded in (Yu et al., 2010) for the DLTree method and (Qi et al., 2004) for the CVTree method.

In this paper, we propose to use the Shannon entropy of the feature matrix to determine the optimal value of K. Using Eq. 3, we can obtain an N4K feature matrix for a dataset with N genomes. Then, we propose to define a scoring strategy as

The optimal K is the value at which reaches its maximum.

We use Eq. 5 to calculate on datasets 1 and 2 for different K. The relationship between and K is shown in Figure 7 for these two datasets. It is seen that reaches the largest value when K = 8 on the two datasets. Considering that the larger K is, the more memory resources are consumed, we only consider the values near K = 8 (e.g., K = 7, 8, 9). For the 30 mammalian species dataset, we have seen that the phylogenetic tree for K = 8 constructed by our method is closest to the reference tree. The same happened for the HIV-1 dataset with K = 7. The outcomes indicate that can provide an effective means to estimate the optimal value of K.

FIGURE 7

Conclusion

In this paper, a new alignment-free method is proposed for phylogenetic analysis and sequence comparison based on whole genome sequences. Our method combines the position-weighted measure of k-mers and the information entropy of frequency of k-mers. We used the Manhattan metric to measure the distance between a pair of sequences and the NJ method to construct the phylogenetic tree. In order to test the effectiveness and reliability of our method, we applied it on two datasets of 30 mammalian species and 44 HIV-1 genomes. The results demonstrated that the present method is efficient and reliable. A suitable K value is important to capture rich phylogenetic information of DNA sequences. In order to choose an optimal K value, we proposed a scoring measure based on the information entropy. The obtained results on two real datasets support that the method can capture the k-mer distribution information and is effective for whole genome sequence comparison and phylogenetic analysis.

Remark: The method of this paper is derived from the two studies Ma et al. (2020) and Murray et al. (2017). There are differences between this work and previous works: Tang et al. presented the average relative distance for normalized k-mers. PWKmer uses the counts and position distributions of k-mers to capture more evolutionary information. KWIP (Murray et al. 2017) uses information entropy to weight the inner product (SiSj), while we use information entropy to weight the relative positions of k-mers. KWIP uses a kernel function to calculate the distance, while we use the Manhattan metric to calculate the pairwise distance between species. Here, we claimed that the results obtained by the IEPWRMkmer method are close to those by ClustalX and the IEPWRMkmer is superior to the other distance metrics. We used the phylogenetic tree constructed by ClustalX as the reference tree or standard tree, hence we cannot claim that our method is superior to the ClustalX method.

Statements

Data availability statement

The genome datasets analyzed for this study can be found in the GenBank https://www.ncbi.nlm.nih.gov/

Author contributions

Y-QW contributed to the conception and design of the study, developed the method, and wrote the manuscript. Z-GY gave the ideas and supervised the project. All authors discussed the results and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by funds from the National Natural Science Foundation of China (grant numbers: 11871061 and 12026213); The National Key Research and Development Program of China (grant number: 2020YFC0832405); Innovation Foundation of Qian Xuesen Laboratory of Space Technology.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    AltschulS. F.GishW.MillerW.MyersE. W.LipmanD. J. (1990). Basic Local Alignment Search Tool. J. Mol. Biol.215 (3), 403410. 10.1016/S0022-2836(05)80360-2

  • 2

    BlaisdellB. E. (1986). A Measure of the Similarity of Sets of Sequences Not Requiring Sequence Alignment. Proc. Natl. Acad. Sci.83 (14), 51555159. 10.1073/pnas.83.14.5155

  • 3

    ChangG.WangH.ZhangT. (2014). A Novel Alignment-free Method for Whole Genome Analysis: Application to HIV-1 Subtyping and HEV Genotyping. Inf. Sci.279, 776784. 10.1016/j.ins.2014.04.029

  • 4

    CominM.VerzottoD. (2012). Alignment-free Phylogeny of Whole Genomes Using Underlying Subwords. Algorithms Mol. Biol.7 (1), 112. 10.1186/1748-7188-7-34

  • 5

    DingS.LiY.YangX.WangT. (2013). A Simple K-word Interval Method for Phylogenetic Analysis of DNA Sequences. J. Theor. Biol.317, 192199. 10.1016/j.jtbi.2012.10.010

  • 6

    FelsensteinJ.FelensteinJ. (2004). Inferring Phylogenies. (Sunderland, MA: Sinauer Associates). 10.1086/383584

  • 7

    FoxG. E.MagrumL. J.BalchW. E.WolfeR. S.WoeseC. R. (1977). Classification of Methanogenic Bacteria by 16S Ribosomal RNA Characterization. Proc. Natl. Acad. Sci.74 (10), 45374541. 10.1073/pnas.74.10.4537

  • 8

    HauboldB.PierstorffN.MöllerF.WieheT. (2005). Genome Comparison without Alignment Using Shortest Unique Substrings. BMC Bioinformatics6 (1), 123211. 10.1186/1471-2105-6-123

  • 9

    HoangT.YinC.YauS. S.-T. (2016). Numerical Encoding of DNA Sequences by Chaos Game Representation with Application in Similarity Comparison. Genomics108, 134142. 10.1016/j.ygeno.2016.08.002

  • 10

    HöhlM.RigoutsosI.RaganM. A. (2006). Pattern-based Phylogenetic Distance Estimation and Tree Reconstruction. Evol. Bioinformatics2, 359375. 10.2174/157489306775330570

  • 11

    HuangY.WangT. (2011). Phylogenetic Analysis of DNA Sequences with a Novel Characteristic Vector. J. Math. Chem.49 (8), 14791492. 10.1007/s10910-011-9811-x

  • 12

    KumarS.StecherG.TamuraK. (2016). MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. Mol. Biol. Evol.33 (7), 18701874. 10.1093/molbev/msw054

  • 13

    LarkinM. A.BlackshieldsG.BrownN. P.ChennaR.McGettiganP. A.McWilliamH.et al (2007). Clustal W and Clustal X Version 2.0. Bioinformatics23 (21), 29472948. 10.1093/bioinformatics/btm404

  • 14

    LeimeisterC.-A.BodenM.HorwegeS.LindnerS.MorgensternB. (2014). Fast Alignment-free Sequence Comparison Using Spaced-word Frequencies. Bioinformatics30, 19911999. 10.1093/bioinformatics/btu177

  • 15

    LiM.BadgerJ. H.ChenX.KwongS.KearneyP.ZhangH. (2001). An Information-Based Sequence Distance and its Application to Whole Mitochondrial Genome Phylogeny. Bioinformatics17 (2), 149154. 10.1093/bioinformatics/17.2.149

  • 16

    MaY.YuZ.TangR.XieX.HanG.AnhV. V. (2020). Phylogenetic Analysis of HIV-1 Genomes Based on the Position-Weighted K-Mers Method. Entropy22 (2), 255. 10.3390/e22020255

  • 17

    Mendizabal-RuizG.Román-GodínezI.Torres-RamosS.Salido-RuizR. A.Vélez-PérezH.MoralesJ. A. (2018). Genomic Signal Processing for DNA Sequence Clustering. PeerJ6 (3), e4264. 10.7717/peerj.4264

  • 18

    MorrisonD. A. (2006). Multiple Sequence Alignment for Phylogenetic Purposes. Aust. Syst. Bot.19 (6), 479539. 10.1071/sb06020

  • 19

    MurrayK. D.WebersC.OngC. S.BorevitzJ.WarthmannN. (2017). KWIP: The K-Mer Weighted Inner Product, a De Novo Estimator of Genetic Similarity. Plos Comput. Biol.13 (9), e1005727. 10.1371/journal.pcbi.1005727

  • 20

    OtuH. H.SayoodK. (2003). A New Sequence Distance Measure for Phylogenetic Tree Construction. Bioinformatics19 (16), 21222130. 10.1093/bioinformatics/btg295

  • 21

    QiJ.LuoH.HaoB. (2004). CVTree: a Phylogenetic Tree Reconstruction Tool Based on Whole Genomes. Nucleic Acids Res.32 (Suppl. l_2), W45W47. 10.1093/nar/gkh362

  • 22

    RobinsonD. F.FouldsL. R. (1981). Comparison of Phylogenetic Trees. Math. Biosciences53 (1-2), 131147. 10.1016/0025-5564(81)90043-2

  • 23

    RonquistF.TeslenkoM.Van Der MarkP.AyresD. L.DarlingA.HöhnaS.et al (2012). MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice across a Large Model Space. Syst. Biol.61 (3), 539542. 10.1093/sysbio/sys029

  • 24

    SaitouN.NeiM. (1987). The Neighbor-Joining Method: a New Method for Reconstructing Phylogenetic Trees. Mol. Biol. Evol.4 (4), 406425. 10.1093/oxfordjournals.molbev.a040454

  • 25

    SimsG. E.JunS.-R.WuG. A.KimS.-H. (2009). Alignment-free Genome Comparison with Feature Frequency Profiles (FFP) and Optimal Resolutions. Pnas106 (8), 26772682. 10.1073/pnas.0813249106

  • 26

    TangJ.HuaK.ChenM.ZhangR.XieX. (2014). A Novel K-word Relative Measure for Sequence Comparison. Comput. Biol. Chem.53, 331338. 10.1016/j.compbiolchem.2014.10.007

  • 27

    ThankachanS. V.ChockalingamS. P.LiuY.ApostolicoA.AluruS. (2016). ALFRED: a Practical Method for Alignment-free Distance Computation. J. Comput. Biol.23 (6), 452460. 10.1089/cmb.2015.0217

  • 28

    ThompsonJ. D.HigginsD. G.GibsonT. J. (1994). CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-specific gap Penalties and Weight Matrix Choice. Nucl. Acids Res.22 (22), 46734680. 10.1093/nar/22.22.4673

  • 29

    UlitskyI.BursteinD.TullerT.ChorB. (2006). The Average Common Substring Approach to Phylogenomic Reconstruction. J. Comput. Biol.13 (2), 336350. 10.1089/cmb.2006.13.336

  • 30

    WangY.LeiX.WangS.WangZ.SongN.ZengF.et al (2016). Effect of K-Tuple Length on Sample-Comparison with High-Throughput Sequencing Data. Biochem. Biophysical Res. Commun.469 (4), 10211027. 10.1016/j.bbrc.2015.11.094

  • 31

    WuQ.YuZ.-G.YangJ. (2017). DLTree: Efficient and Accurate Phylogeny Reconstruction Using the Dynamical Language Method. Bioinformatics33 (14), 22142215. 10.1093/bioinformatics/btx158

  • 32

    WuX.CaiZ.WanX.-F.HoangT.GoebelR.LinG. (2007). Nucleotide Composition String Selection in HIV-1 Subtyping Using Whole Genomes. Bioinformatics23 (14), 17441752. 10.1093/bioinformatics/btm248

  • 33

    YinC. (2019). Encoding and Decoding DNA Sequences by Integer Chaos Game Representation. J. Comput. Biol.26 (2), 143151. 10.1089/cmb.2018.0173

  • 34

    YuC.DengM.YauS. S.-T. (2011). DNA Sequence Comparison by a Novel Probabilistic Method. Inf. Sci.181 (8), 14841492. 10.1016/j.ins.2010.12.010

  • 35

    YuZ.-G.ChuK. H.LiC. P.AnhV.ZhouL.-Q.WangR. W. (2010). Whole-proteome Phylogeny of Large dsDNA Viruses and Parvoviruses through a Composition Vector Method Related to Dynamical Language Model. BMC Evol. Biol.10 (1), 111. 10.1186/1471-2148-10-192

  • 36

    YuZ.-G.ZhanX.-W.HanG.-S.WangR. W.AnhV.ChuK. H. (2010). Proper Distance Metrics for Phylogenetic Analysis Using Complete Genomes without Sequence Alignment. Ijms11 (3), 11411154. 10.3390/ijms11031141

  • 37

    ZielezinskiA.VingaS.AlmeidaJ.KarlowskiW. M. (2017). Alignment-free Sequence Comparison: Benefits, Applications, and Tools. Genome Biol.18 (1), 117. 10.1186/s13059-017-1319-7

Summary

Keywords

alignment-free method, k-mer relative distance, information entropy, phylogenetic analysis, genome

Citation

Wu Y-Q, Yu Z-G, Tang R-B, Han G-S and Anh VV (2021) An Information-Entropy Position-Weighted K-Mer Relative Measure for Whole Genome Phylogeny Reconstruction. Front. Genet. 12:766496. doi: 10.3389/fgene.2021.766496

Received

29 August 2021

Accepted

29 September 2021

Published

22 October 2021

Volume

12 - 2021

Edited by

Juan Wang, Inner Mongolia University, China

Reviewed by

Liang Cheng, Harbin Medical University, China

Yanjuan Li, Quzhou University, China

Updates

Copyright

*Correspondence: Zu-Guo Yu,

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics