An Information-Entropy Position-Weighted K-Mer Relative Measure for Whole Genome Phylogeny Reconstruction

Wu, Yao-Qun; Yu, Zu-Guo; Tang, Run-Bin; Han, Guo-Sheng; Anh, Vo V.

doi:10.3389/fgene.2021.766496

ORIGINAL RESEARCH article

Front. Genet., 22 October 2021

Sec. Statistical Genetics and Methodology

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.766496

An Information-Entropy Position-Weighted K-Mer Relative Measure for Whole Genome Phylogeny Reconstruction

1. Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan, China
2. Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang, China
3. Faculty of Science, Engineering and Technology, Swinburne University of Technology, Hawthorn, VIC, Australia

Article metrics

View details

Citations

2,4k

Views

851

Downloads

Abstract

Alignment methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational costs in handling time and space complexity. On the other hand, alignment-free methods incur low computational costs and have recently gained popularity in the field of bioinformatics. Here we propose a new alignment-free method for phylogenetic tree reconstruction based on whole genome sequences. A key component is a measure called information-entropy position-weighted k-mer relative measure (IEPWRMkmer), which combines the position-weighted measure of k-mers proposed by our group and the information entropy of frequency of k-mers. The Manhattan distance is used to calculate the pairwise distance between species. Finally, we use the Neighbor-Joining method to construct the phylogenetic tree. To evaluate the performance of this method, we perform phylogenetic analysis on two datasets used by other researchers. The results demonstrate that the IEPWRMkmer method is efficient and reliable. The source codes of our method are provided at https://github.com/ wuyaoqun37/IEPWRMkmer.

Introduction

The reconstruction of a phylogenetic tree is a primary problem in evolutionary biology. Sequence alignment is a key step in the reconstruction, aiming to identify the homology of sequences and uncover phylogenetic relationships in sequences. Traditional sequence comparison is based on pairwise or multiple sequence alignment (Felsenstein and Felenstein, 2004; Morrison, 2006) and was implemented by software packages such as BLAST (Altschul et al., 1990), ClustalW (Thompson et al., 1994), and MrBayes (Ronquist et al., 2012). However, the methods based on sequence alignment have some disadvantages, including high computational cost in handling the time and space complexity of the algorithm. Therefore, alignment-free methods have been proposed to overcome these problems (Zielezinski et al., 2017). The computational cost of alignment-free methods is low because they are generally of linear complexity (Fox et al., 1977).

Several alignment-free methods for sequence comparison are based on word counts (Blaisdell, 1986; Höhl et al., 2006; Wang et al., 2016). A key idea is to use the close distribution of k-mers to imply the high correlation degree, hence the similarity of the sequences. The methods have been implemented in software tools, such as FFP (Sims et al., 2009), kWIP (Murray et al., 2017), CVtree (Qi et al., 2004), and DLtree (Wu et al., 2017). Many k-mer methods transform the input sequence into a frequency vector of k-mers, then define the distance of the sequences by that of the frequency vector of k-mers (Qi et al., 2004; Wu et al., 2017). To reduce the statistical dependence between adjacent word matches, Spaced-Words (Leimeister and Boden, 2014) proposed to use spaced words, which are defined by patterns of matches without reference to positions. Some alignment-free methods are based on match length, which defines the distance between sequences based on the length of substring matches between two sequences. These include the shortest unique substring method (Haubold et al., 2005), ACS (Ulitsky et al. 2006), UA (Comin and Verzotto, 2012), and ALFRED (Thankachan et al. 2016). In addition, graphical representation was used to construct the probability distribution of a DNA sequence (Yu et al., 2011). The chaos game representation transforms the distribution of characters in a DNA sequence into the distribution of nodes in a graph (Hoang et al. 2016; Yin, 2017; Mendizabal-Ruiz et al., 2018). Many researchers considered extracting the position information of a k-mer (Huang and Wang, 2011; Ding et al., 2013; Tang et al., 2014). Ding et al. (2013) used the average interval distance of normalized k-mers to capture evolutionary information for sequence comparison. Tang et al. (2014) presented the average relative distance of normalized k-mers to improve the method of Ding et al. (2013). Ma et al. (2020) proposed the PWKmer method, which combines the k-mer counts and k-mer position distributions for phylogenetic analysis.

In this work, we propose a new alignment-free method which combines the position-weighted measure of k-mers proposed by Ma et al. (2020) and the information entropy of frequency of k-mers to obtain phylogenetic information for sequence comparison. It is named information-entropy position-weighted k-mer relative measure (IEPWRMkmer). To evaluate the performance of this method, we carry out phylogenetic analysis on two data sets used by other researchers.

Materials and Methods

Genomic Datasets

Dataset 1

The first dataset for analysis consists of the same whole genome DNA sequences of 30 mammalian species studied in Li et al. (2001), Otu and Sayood (2003), and Tang et al. (2014). The accession numbers, species, and species name are listed in Table 1. All sequences were downloaded from NCBI GenBank.

TABLE 1

No	Accession no	Species	Sequence name
1	AJ002189	Sus scrofa	Pig
2	AJ010957	Homo sapiens	Hippopotamus
3	AJ001588	Pan troglodytes	Rabbit
4	U96639	Canis familiaris	Dog
5	AF010406	Ovis aries	Sheep
6	V00662	Homo sapiens	Human
7	U20753	Felis catus	Cat
8	X72004	Halichoerus grypus	Gray seal
9	D38115	Pongo pygmaeus	Orangutan
10	V00654	Bos taurus	Cow
11	X97337	Equus asinus	Donkey
12	D38116	Pan troglodytes	Common chimpanzee
13	D38113	Pan paniscus	Pigmy chimpanzee
14	Z29573	Didelphis virginiana	Opossum
15	Y10524	Macropus robustus	Wallaroo
16	X99256	Hylobates lar	Gibbon
17	Y18001	Papio hamadryas	Baboon
18	X97336	Rhinoceros unicornis	Indian rhinoceros
19	Y07726	Ceratotherium simum	White rhinoceros
20	X63726	Phoca vitulina	Harbor seal
21	AJ238588	Sciurus vulgaris	Squirrel
22	AJ001562	Glis glis	Fat dormouse
23	AJ222767	Cavia porcellus	Guinea pig
24	X79547	Equus caballus	Horse
25	X14848	Rattus norvegicus	Rat
26	V00711	Mus musculus	Mouse
27	D38114	Gorilla gorilla	Gorilla
28	X61145	Balenoptera physalus	Fin whale
29	X72204	Balenoptera musculus	Blue whale
30	X83427	Ornithorhyncus anatinus	Platypus

Names, species, and accession numbers for mitochondrial genomes of 30 mammalian species.

Dataset 2

The second dataset for analysis is the HIV-1 dataset studied in Ma et al. (2020). This dataset contains 43 HIV genome sequences used in Wu et al. (2007) and a controversial taxonomic sequence used in Chang et al. (2014). The dataset includes subtypes A, B, C, D, F, G, J, K, and H of the HIV-1 M, O, N groups and the CPZ sequence. The area, accession numbers, and subtypes are listed in Table 2. All these sequences were downloaded from NCBI GenBank.

TABLE 2

No	Area	Accession no	Subtype
1	Belgium (DRC)	AF084936	G
2	Finland (Kenya)	AF061641	G
3	Sweden (DRC)	AF061642	G
4	Belgium	AF190128	H
5	Belgium	AF190127	H
6	Cent. Afr. Rep	AF005496	H
7	Tanzania	AF447763	CPZ
8	Cameroon	L20571	O
9	Senegal	AJ302647	O
10	Cameroon	L20587	O
11	Cameroon	AY169812	O
12	India	AF067155	C
13	South Africa	AY772699	C
14	Ethiopia	U46016	C
15	Brazil	U52953	C
16	Cameroon	AY371157	D
17	DRC	K03454	D
18	Uganda	U88824	D
19	Somalia	AF069670	A1
20	Uganda	AF484509	A1
21	Uganda	U51190	A1
22	Kenya	AF004885	A1
23	DRC	AF286238	A2
24	Cyprus	AF286237	A2
25	Sweden	AF082395	J
26	Sweden	AF082394	J
27	Cameroon	AJ249239	K
28	DRC	AJ249235	K
29	Cameroon	AJ249237	F2
30	Cameroon	AY371158	F2
31	Cameroon	AJ249236	F2
32	Cameroon	AF377956	F2
33	Finland	AF075703	F1
34	France	AJ249238	F1
35	Brazil	AF005494	F1
36	Belgium (DRC)	AF077336	F1
37	Cameroon	AJ271370	N
38	Cameroon	AY532635	N
39	Cameroon	AJ006022	N
40	Netherlands	AY423387	B
41	Thailand	AY173951	B
42	Australia	Gray seal	B
43	France	K03455	B
44	U.S.	AY331295	B

Accession numbers, subtype, and area for 44 HIV-1.

We use two approaches to validate the method. First, we use the Robinson-Foulds (RF) distance to compare our method with other alignment-free methods. Second, we use the bootstrap method to construct consensus trees and show the stability of the trees obtained by our method.

Methods

Let S = be a DNA sequence with length L, is a k-mer, where ∈(A,T,C,G). If the k-mer occurs in S, we denote by the vector composed of the positions of in this given sequence and by its ith element. If the k-mer does not occur in S,=(0). For example, for the DNA sequence GTAACCTGAACGTACTTGGA with length 20, we list all 2-mer position vectors:

P_AA=(3,9); P_AC=(4,10,14); P_AG= (0); P_AT= (0); P_CA=(0); P_CC=(5); P_CG=(11); P_CT=(6,15); P_GA=(8,19); P_GC=(0); P_GG=(18); P_GT=(1,12); P_TA=(2,13); P_TC = 0; P_TG=(7,17); P_TT=(16).

In this example, the 2-mers AG, AT, CA, GC, and TC do not appear. For each k-mer, its position vector provides its position distribution information in the sequence. One can use the k-mer position vectors to reconstruct the DNA sequence (Ma et al., 2020).

Ma et al. (2020) defined the position-weighted measure of based on its position in the sequence aswhere n is the length of the vector . Actually means the position weight of in the given sequence with length L.

We denote by N the number of sequences in a dataset. In order to characterize the importance of k-mers in the whole dataset, we count the number m of the sequences that contain a k-mer . Then the occurrence frequency F of this k-mer in the whole dataset is defined as m/N. We introduce the Shannon entropy H() of frequency F() defined by Murray et al. (2017) aswhere F stands for F ().

In this study, we aim to get more DNA phylogenetic information by combining the above two methods and defining

Here, we regard Shannon entropy H () as another weight.

For a fixed K, there are 4^Kk-mers. For each k-mer , we can calculate the corresponding , then arrange 4^K of these to get a feature representation vector () according to the alphabet order of the 4^Kk-mers for each genome.

For two given genome sequences A and B, we can obtain = and by the method. We use the Manhattan distance to calculate the pairwise distance between these two genome sequences:

For a given dataset, we can derive a distance matrix by Eq. 4. This distance matrix contains the sequence similarity information. After obtaining the distance matrix, we insert it into the mega 7.0 software (Sudhir et al., 2016) and use Neighbor-Joining (NJ) program (Saitou et al. 1987) to construct the phylogenetic tree.

Robinson-Foulds Distance and the Bootstrap Method

We use the Robinson-Foulds (RF) distance (Robinson and Foulds 1981) to judge the quality of the method. A smaller RF value means a closer distance between the phylogenetic tree and the reference tree.

(Yu et al., 2010) proposed a modified version of the bootstrap method to evaluate the reliability of the constructed phylogenetic tree. We also use this method in the present work. Its workflow is as follows: Each row is the feature vector ( of a species, and each column is the feature value of all genome sequences based on the same k-mer. Through random sampling of all columns, in which some columns may be selected many times, while some columns may not be selected at all, we randomly select one column. After 4^K times of selection, a new N4^K feature matrix is constructed. Using the new feature matrix, the Manhattan distance of any two rows is calculated to get a new distance matrix. Then we use the NJ method to construct a phylogenetic tree and repeat the above steps 100 times. Finally, a consensus tree is drawn by using consense. exe in the Phylip package. The frequency of a particular branch of a phylogenetic tree can be used as a measure of the stability of this branch.

Results

Experiment 1

We use the genomes of 30 mammalian species in dataset 1 to construct a phylogenetic tree using ClustalX (Larkin et al. 2007) as the reference tree. ClustalX is one of the widely used multiple alignment programs. The result is shown in Figure 1A. It is seen that rabbit, fat dormouse, squirrel, guinea pig, mouse, rat, platypus, opossum, and wallaroo belong to the rodents group; human, baboon, orangutan, gibbon, gorilla, pigmy chimpanzee, and common chimpanzee belong to the primates group; blue whale, fin whale, hippopotamus, cow, sheep, pig, donkey, horse, Indian-rhinoceros, white rhinoceros, cat, dog, gray seal, and harbor seal belong to the ferungulates group. When K < 5, it is not feasible to construct a phylogenetic tree using our method. When K = 5, 6, the 30 mammals cannot be divided into three groups in our tree. When K = 7, it can be divided into three groups, but the relationship between guinea pig and fat dormouse is not correct. When K = 8, 9, the branches of the tree become correct. We list the RF distances between the phylogenetic tree constructed by our method at K = 5, 6, 7, 8, 9 and the reference tree constructed by ClustalX in Table 3. From Table 3, we can see that the RF distance reaches the minimum when K = 8. We show the phylogenetic tree of K = 8 constructed by our method in Figure 1B. From Figure 1B, we can see that the species in the three main categories are grouped correctly. Primates and ferungulates are closer, and this relationship is consistent with that in Figure 1A. In terms of branches, monotremes (platypus), marsupials (wallaroo, opossum), murid rodents (mouse, rat), non-murid rodents (guinea pig, squirrel, fat dormouse, rabbit), perissodactyls (white rhinoceros, horse, Indian rhinoceros, donkey), carnivores (harbor seal, dog, gray seal, cat), artiodactyls (sheep, cow, hippopotamus, pig), primates (human, pigmy chimpanzee, common chimpanzee, gorilla, baboon, gibbon, orangutan), and cetaceans (blue whale, fin whale) are grouped into respective taxonomic classes accurately.

FIGURE 1

TABLE 3

K	5	6	7	8	9
RF distance	38	28	22	8	10

The RF distance between the phylogenetic tree conducted by our method at K = 5,6,7,8,9 and the reference tree conducted by ClustalX.

Figure 2 shows the RF distance between the reference tree constructed by ClustalX and the phylogenetic tree constructed by our method, Tang’s method, PWKmer, DLtree, and CVtree on dataset 1. Using our method, when K = 8, the RF distance is 8. The shortest RF distance of DLtree (K = 9) is 10, the shortest distance of CVtree (K = 9) is 16, the shortest distance of Tang’s method (K = 7) is 16, and the shortest distance of PWKmer (K = 9) is 10. Therefore, the results of our method are closer to those of ClustalX than those of the other methods, which indicates that our method is effective.

FIGURE 2

Figure 3 shows the consensus tree of 30 mammalian species based on our method. Compared with Figure 1B, 30 mammalian species are divided into the rodents group, the ferungulates group, and the primates group correctly. The support rate is 80% for the rodents group and 100% for both ferungulates and primates groups. Among the branches, marsupials (opossum, wallaroo), carnivores (dog, cat, harbor seal, gray seal), murid roots (rat, mouse), and cetaceans (fin whale, blue whale) are all supported by a 100% rate. In the artiodactyls group (cow, sheep, pig, hippopotamus), pig is separated out of the artiodactyls group, but the support rate is low at 43%. It indicates that the phylogenetic tree constructed by our method is quite robust.

FIGURE 3

Experiment 2

The human immunodeficiency viruses (HIV) represent a group of retroviruses, which are not presumed to have originated from human cellular DNA sequences, hence are distinct from endogenous retroviruses (Wu et al., 2007). HIV-1 can be classified into three major phylogenetic groups, namely M (major), N (new), and O (others). Group M is responsible for the HIV pandemic, it is divided into nine subtypes, namely A, B, C, D, F, G, J, K, and H. Based on differential phylogenetic clustering, the subtypes A and F are further divided into sub-subtypes (A1, A2) and (F1, F2), respectively. Groups N and O are derived from other primates and then infect humans. CPZ is a non-human primate virus isolated from chimpanzees, which is closest to human-to-human transmission of HIV.

We performed the phylogenetic analysis of 44 HIV-1 complete genome sequences in dataset 2 using ClustalX and our method. The phylogenetic trees reconstructed by ClustalX and our method (K = 7) are shown in Figure 4A and Figure 4B, respectively. From Figure 4B, we can see that the species from all subtypes can be correctly classified into their groups (A, B, C, D, F, G, J, K, H, O, and M), and CPZ as the reference sequence is separated into the outermost. From the internal branches, both F and A contain two subtypes (F1 and F2) and (A1 and A2), respectively. Our method can separate the two subtypes, and in the branches, both F and A subtypes can be closely grouped together.

FIGURE 4

Figure 5 shows the RF distances between the reference tree constructed by ClustalX and the phylogenetic trees constructed by our method, Tang’s method, PWKmer, DLtree, and CVtree. Using our method, when K = 7, the RF distance is 10. The shortest RF distance of the DLtree (K = 11) is 12, the shortest distance of the CVtree (K = 9) is 16, the shortest distance of the PWKmer (K =9) is 10, and the shortest distance of Tang’s method (K = 9) is 10. Therefore, our method performs better than the DLtree and the CVtree on dataset 2 and has the same performance as Tang’s method and PWKmer. The results indicate that our method is quite effective again.

FIGURE 5

Figure 6 shows the consensus tree of 44 HIV-1 based on our method. Comparing with Figure 4B, all HIV-1 sequences are divided into the M, N, O, and CPZ groups, whose support rate is 100%. From the branch point of view, in group M, the branch support rate of all subtypes is 100%. For subtypes A and F, the subtypes (A1, A2) and (F1 and F2) are clustered with 100% support. It again indicates that the phylogenetic tree constructed by our method is quite robust.

FIGURE 6

Estimate of the Optimal Parameter K

Different lengths of k-mers contain different phylogenetic information. Short k-mers may not contain sufficient DNA sequence information. Long k-mers contain sufficient phylogenetic information, but it needs large memory and takes a long time to calculate the distance based on information on long k-mers. Therefore, it is also very important to estimate an optimal value of K as heralded in (Yu et al., 2010) for the DLTree method and (Qi et al., 2004) for the CVTree method.

In this paper, we propose to use the Shannon entropy of the feature matrix to determine the optimal value of K. Using Eq. 3, we can obtain an N4^K feature matrix for a dataset with N genomes. Then, we propose to define a scoring strategy as

The optimal K is the value at which reaches its maximum.

We use Eq. 5 to calculate on datasets 1 and 2 for different K. The relationship between and K is shown in Figure 7 for these two datasets. It is seen that reaches the largest value when K = 8 on the two datasets. Considering that the larger K is, the more memory resources are consumed, we only consider the values near K = 8 (e.g., K = 7, 8, 9). For the 30 mammalian species dataset, we have seen that the phylogenetic tree for K = 8 constructed by our method is closest to the reference tree. The same happened for the HIV-1 dataset with K = 7. The outcomes indicate that can provide an effective means to estimate the optimal value of K.

FIGURE 7

Conclusion

In this paper, a new alignment-free method is proposed for phylogenetic analysis and sequence comparison based on whole genome sequences. Our method combines the position-weighted measure of k-mers and the information entropy of frequency of k-mers. We used the Manhattan metric to measure the distance between a pair of sequences and the NJ method to construct the phylogenetic tree. In order to test the effectiveness and reliability of our method, we applied it on two datasets of 30 mammalian species and 44 HIV-1 genomes. The results demonstrated that the present method is efficient and reliable. A suitable K value is important to capture rich phylogenetic information of DNA sequences. In order to choose an optimal K value, we proposed a scoring measure based on the information entropy. The obtained results on two real datasets support that the method can capture the k-mer distribution information and is effective for whole genome sequence comparison and phylogenetic analysis.

Remark: The method of this paper is derived from the two studies Ma et al. (2020) and Murray et al. (2017). There are differences between this work and previous works: Tang et al. presented the average relative distance for normalized k-mers. PWKmer uses the counts and position distributions of k-mers to capture more evolutionary information. KWIP (Murray et al. 2017) uses information entropy to weight the inner product (SiSj), while we use information entropy to weight the relative positions of k-mers. KWIP uses a kernel function to calculate the distance, while we use the Manhattan metric to calculate the pairwise distance between species. Here, we claimed that the results obtained by the IEPWRMkmer method are close to those by ClustalX and the IEPWRMkmer is superior to the other distance metrics. We used the phylogenetic tree constructed by ClustalX as the reference tree or standard tree, hence we cannot claim that our method is superior to the ClustalX method.

Statements

Data availability statement

The genome datasets analyzed for this study can be found in the GenBank https://www.ncbi.nlm.nih.gov/

Author contributions

Y-QW contributed to the conception and design of the study, developed the method, and wrote the manuscript. Z-GY gave the ideas and supervised the project. All authors discussed the results and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by funds from the National Natural Science Foundation of China (grant numbers: 11871061 and 12026213); The National Key Research and Development Program of China (grant number: 2020YFC0832405); Innovation Foundation of Qian Xuesen Laboratory of Space Technology.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1
AltschulS. F.GishW.MillerW.MyersE. W.LipmanD. J. (1990). Basic Local Alignment Search Tool. J. Mol. Biol.215 (3), 403–410. 10.1016/S0022-2836(05)80360-2
- CrossRef
- Google Scholar
2
BlaisdellB. E. (1986). A Measure of the Similarity of Sets of Sequences Not Requiring Sequence Alignment. Proc. Natl. Acad. Sci.83 (14), 5155–5159. 10.1073/pnas.83.14.5155
- CrossRef
- Google Scholar
3
ChangG.WangH.ZhangT. (2014). A Novel Alignment-free Method for Whole Genome Analysis: Application to HIV-1 Subtyping and HEV Genotyping. Inf. Sci.279, 776–784. 10.1016/j.ins.2014.04.029
- CrossRef
- Google Scholar
4
CominM.VerzottoD. (2012). Alignment-free Phylogeny of Whole Genomes Using Underlying Subwords. Algorithms Mol. Biol.7 (1), 1–12. 10.1186/1748-7188-7-34
- CrossRef
- Google Scholar
5
DingS.LiY.YangX.WangT. (2013). A Simple K-word Interval Method for Phylogenetic Analysis of DNA Sequences. J. Theor. Biol.317, 192–199. 10.1016/j.jtbi.2012.10.010
- CrossRef
- Google Scholar
6
FelsensteinJ.FelensteinJ. (2004). Inferring Phylogenies. (Sunderland, MA: Sinauer Associates). 10.1086/383584
- CrossRef
- Google Scholar
7
FoxG. E.MagrumL. J.BalchW. E.WolfeR. S.WoeseC. R. (1977). Classification of Methanogenic Bacteria by 16S Ribosomal RNA Characterization. Proc. Natl. Acad. Sci.74 (10), 4537–4541. 10.1073/pnas.74.10.4537
- CrossRef
- Google Scholar
8
HauboldB.PierstorffN.MöllerF.WieheT. (2005). Genome Comparison without Alignment Using Shortest Unique Substrings. BMC Bioinformatics6 (1), 123–211. 10.1186/1471-2105-6-123
- CrossRef
- Google Scholar
9
HoangT.YinC.YauS. S.-T. (2016). Numerical Encoding of DNA Sequences by Chaos Game Representation with Application in Similarity Comparison. Genomics108, 134–142. 10.1016/j.ygeno.2016.08.002
- CrossRef
- Google Scholar
10
HöhlM.RigoutsosI.RaganM. A. (2006). Pattern-based Phylogenetic Distance Estimation and Tree Reconstruction. Evol. Bioinformatics2, 359–375. 10.2174/157489306775330570
- CrossRef
- Google Scholar
11
HuangY.WangT. (2011). Phylogenetic Analysis of DNA Sequences with a Novel Characteristic Vector. J. Math. Chem.49 (8), 1479–1492. 10.1007/s10910-011-9811-x
- CrossRef
- Google Scholar
12
KumarS.StecherG.TamuraK. (2016). MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. Mol. Biol. Evol.33 (7), 1870–1874. 10.1093/molbev/msw054
- CrossRef
- Google Scholar
13
LarkinM. A.BlackshieldsG.BrownN. P.ChennaR.McGettiganP. A.McWilliamH.et al (2007). Clustal W and Clustal X Version 2.0. Bioinformatics23 (21), 2947–2948. 10.1093/bioinformatics/btm404
- CrossRef
- Google Scholar
14
LeimeisterC.-A.BodenM.HorwegeS.LindnerS.MorgensternB. (2014). Fast Alignment-free Sequence Comparison Using Spaced-word Frequencies. Bioinformatics30, 1991–1999. 10.1093/bioinformatics/btu177
- CrossRef
- Google Scholar
15
LiM.BadgerJ. H.ChenX.KwongS.KearneyP.ZhangH. (2001). An Information-Based Sequence Distance and its Application to Whole Mitochondrial Genome Phylogeny. Bioinformatics17 (2), 149–154. 10.1093/bioinformatics/17.2.149
- CrossRef
- Google Scholar
16
MaY.YuZ.TangR.XieX.HanG.AnhV. V. (2020). Phylogenetic Analysis of HIV-1 Genomes Based on the Position-Weighted K-Mers Method. Entropy22 (2), 255. 10.3390/e22020255
- CrossRef
- Google Scholar
17
Mendizabal-RuizG.Román-GodínezI.Torres-RamosS.Salido-RuizR. A.Vélez-PérezH.MoralesJ. A. (2018). Genomic Signal Processing for DNA Sequence Clustering. PeerJ6 (3), e4264. 10.7717/peerj.4264
- CrossRef
- Google Scholar
18
MorrisonD. A. (2006). Multiple Sequence Alignment for Phylogenetic Purposes. Aust. Syst. Bot.19 (6), 479–539. 10.1071/sb06020
- CrossRef
- Google Scholar
19
MurrayK. D.WebersC.OngC. S.BorevitzJ.WarthmannN. (2017). KWIP: The K-Mer Weighted Inner Product, a De Novo Estimator of Genetic Similarity. Plos Comput. Biol.13 (9), e1005727. 10.1371/journal.pcbi.1005727
- CrossRef
- Google Scholar
20
OtuH. H.SayoodK. (2003). A New Sequence Distance Measure for Phylogenetic Tree Construction. Bioinformatics19 (16), 2122–2130. 10.1093/bioinformatics/btg295
- CrossRef
- Google Scholar
21
QiJ.LuoH.HaoB. (2004). CVTree: a Phylogenetic Tree Reconstruction Tool Based on Whole Genomes. Nucleic Acids Res.32 (Suppl. l_2), W45–W47. 10.1093/nar/gkh362
- CrossRef
- Google Scholar
22
RobinsonD. F.FouldsL. R. (1981). Comparison of Phylogenetic Trees. Math. Biosciences53 (1-2), 131–147. 10.1016/0025-5564(81)90043-2
- CrossRef
- Google Scholar
23
RonquistF.TeslenkoM.Van Der MarkP.AyresD. L.DarlingA.HöhnaS.et al (2012). MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice across a Large Model Space. Syst. Biol.61 (3), 539–542. 10.1093/sysbio/sys029
- CrossRef
- Google Scholar
24
SaitouN.NeiM. (1987). The Neighbor-Joining Method: a New Method for Reconstructing Phylogenetic Trees. Mol. Biol. Evol.4 (4), 406–425. 10.1093/oxfordjournals.molbev.a040454
- CrossRef
- Google Scholar
25
SimsG. E.JunS.-R.WuG. A.KimS.-H. (2009). Alignment-free Genome Comparison with Feature Frequency Profiles (FFP) and Optimal Resolutions. Pnas106 (8), 2677–2682. 10.1073/pnas.0813249106
- CrossRef
- Google Scholar
26
TangJ.HuaK.ChenM.ZhangR.XieX. (2014). A Novel K-word Relative Measure for Sequence Comparison. Comput. Biol. Chem.53, 331–338. 10.1016/j.compbiolchem.2014.10.007
- CrossRef
- Google Scholar
27
ThankachanS. V.ChockalingamS. P.LiuY.ApostolicoA.AluruS. (2016). ALFRED: a Practical Method for Alignment-free Distance Computation. J. Comput. Biol.23 (6), 452–460. 10.1089/cmb.2015.0217
- CrossRef
- Google Scholar
28
ThompsonJ. D.HigginsD. G.GibsonT. J. (1994). CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-specific gap Penalties and Weight Matrix Choice. Nucl. Acids Res.22 (22), 4673–4680. 10.1093/nar/22.22.4673
- CrossRef
- Google Scholar
29
UlitskyI.BursteinD.TullerT.ChorB. (2006). The Average Common Substring Approach to Phylogenomic Reconstruction. J. Comput. Biol.13 (2), 336–350. 10.1089/cmb.2006.13.336
- CrossRef
- Google Scholar
30
WangY.LeiX.WangS.WangZ.SongN.ZengF.et al (2016). Effect of K-Tuple Length on Sample-Comparison with High-Throughput Sequencing Data. Biochem. Biophysical Res. Commun.469 (4), 1021–1027. 10.1016/j.bbrc.2015.11.094
- CrossRef
- Google Scholar
31
WuQ.YuZ.-G.YangJ. (2017). DLTree: Efficient and Accurate Phylogeny Reconstruction Using the Dynamical Language Method. Bioinformatics33 (14), 2214–2215. 10.1093/bioinformatics/btx158
- CrossRef
- Google Scholar
32
WuX.CaiZ.WanX.-F.HoangT.GoebelR.LinG. (2007). Nucleotide Composition String Selection in HIV-1 Subtyping Using Whole Genomes. Bioinformatics23 (14), 1744–1752. 10.1093/bioinformatics/btm248
- CrossRef
- Google Scholar
33
YinC. (2019). Encoding and Decoding DNA Sequences by Integer Chaos Game Representation. J. Comput. Biol.26 (2), 143–151. 10.1089/cmb.2018.0173
- CrossRef
- Google Scholar
34
YuC.DengM.YauS. S.-T. (2011). DNA Sequence Comparison by a Novel Probabilistic Method. Inf. Sci.181 (8), 1484–1492. 10.1016/j.ins.2010.12.010
- CrossRef
- Google Scholar
35
YuZ.-G.ChuK. H.LiC. P.AnhV.ZhouL.-Q.WangR. W. (2010). Whole-proteome Phylogeny of Large dsDNA Viruses and Parvoviruses through a Composition Vector Method Related to Dynamical Language Model. BMC Evol. Biol.10 (1), 1–11. 10.1186/1471-2148-10-192
- CrossRef
- Google Scholar
36
YuZ.-G.ZhanX.-W.HanG.-S.WangR. W.AnhV.ChuK. H. (2010). Proper Distance Metrics for Phylogenetic Analysis Using Complete Genomes without Sequence Alignment. Ijms11 (3), 1141–1154. 10.3390/ijms11031141
- CrossRef
- Google Scholar
37
ZielezinskiA.VingaS.AlmeidaJ.KarlowskiW. M. (2017). Alignment-free Sequence Comparison: Benefits, Applications, and Tools. Genome Biol.18 (1), 1–17. 10.1186/s13059-017-1319-7
- CrossRef
- Google Scholar

Summary

Keywords

alignment-free method, k-mer relative distance, information entropy, phylogenetic analysis, genome

Citation

Wu Y-Q, Yu Z-G, Tang R-B, Han G-S and Anh VV (2021) An Information-Entropy Position-Weighted K-Mer Relative Measure for Whole Genome Phylogeny Reconstruction. Front. Genet. 12:766496. doi: 10.3389/fgene.2021.766496

Received

29 August 2021

Accepted

29 September 2021

Published

22 October 2021

Volume

12 - 2021

Edited by

Juan Wang, Inner Mongolia University, China

Reviewed by

Liang Cheng, Harbin Medical University, China

Yanjuan Li, Quzhou University, China

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Zu-Guo Yu, yuzuguo@aliyun.com

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Statistical Genetics and Methodology

ORIGINAL RESEARCH article

An Information-Entropy Position-Weighted K-Mer Relative Measure for Whole Genome Phylogeny Reconstruction

Abstract

Introduction