Cotton Duplicated Genes Produced by Polyploidy Show Significantly Elevated and Unbalanced Evolutionary Rates, Overwhelmingly Perturbing Gene Tree Topology

A phylogenetic tree can be used to illustrate the evolutionary relationship between a group of genes, especially duplicated genes, which are sources of genetic innovation and are often a hotspot of research. However, duplicated genes may have complex phylogenetic topologies due to changes in their evolutionary rates. Here, by constructing phylogenetic trees using different methods, we evaluated the phylogenetic relationships of duplicated genes produced by polyploidization in cotton. We found that at least 83.2% of phylogenetic trees did not conform the expected topology. Moreover, cotton homologous gene copy number has little effect on the topology of duplicated genes. Compared with their cacao orthologs, elevated evolutionary rates of cotton genes are responsible for distorted tree topology. Furthermore, as to both branch and site models, we inferred that positive natural selection during the divergence of fiber-development-related MYB genes was likely, and found that the reconstructed tree topology may often overestimate natural selection, as compared to the inference with the expected trees. Therefore, we emphasize the importance of borrowing precious information from gene collinearity in tree construction and evaluation, and have evidence to alert the citation of thousands of previous reports of adaptivity and functional innovation of duplicated genes.


INTRODUCTION
Molecular phylogeny describes evolutionary relationships among organisms or genes that they harbor through molecular biology technology (Yang and Rannala, 2012). It is an area of molecular evolution and has attracted wide attention in recent years, mainly because it is difficult to evaluate phylogenetic relationship by any other method in many cases (Zhang et al., 2018). Molecular methods can provide much clearer answers to some long-standing phylogenetic problems than traditional methods. The accumulation of DNA sequence data has exerted a tremendous influence on the development of the phylogenetic system . The coming genome era provides even more opportunities to understand phylogenetic relationship among species and genes.
Phylogenetic trees are commonly used to illustrate the evolutionary relationship among a group of taxa. The order of species formation events leading to the formation of an extant taxonomic species is unique in history and is similar to the formation of an extant gene in a genome. Therefore, only one of all possible trees constructed with a given dataset can represent the true evolutionary history. The tree built from a specific set of data and a selected tree-building method may be the same as or different from the real tree. If a phylogenetic tree is made up of a gene from each species, the presumed tree is referred as a gene tree. It may differ from a species tree in topology or in branching pattern (Edwards, 2009;Jombart et al., 2017). The reasons may involve alternative loss of anciently duplicated homologs in each species or untoned evolutionary rates among homologs. If a phylogenetic tree is made up of multiple gene homologs from each and more species, often constructed to understand gene evolutionary trajectory or functional innovation, the tree could be much more complex and much diverted from the real tree, due to the above-mentioned reasons and more.
Duplicated genes provide important opportunities for genetic novelty (van de Peer et al., 2009). It has been widely noted that multi-mer in a regulatory complex of proteins were likely produced by gene duplication. Multiple domains in a protein might also be made due to gene duplication. For example, disease resistance genes often have tens, or even hundreds, of copies in a genome, conferring resistance capability to fight fast evolving environmental pathogens. Therefore, the evolution and function of duplicated genes are a popular focus of interest in biology research. The existence of duplicated genes might provide more freedom to them, resulting in elevated or unbalanced evolutionary rates and functional innovation (Lynch and Conery, 2000;Salman-Minkov et al., 2016). During evolution, two homologs derived from a common ancestral gene may each partially take the ancestor's functions, often with multiple functions conflicting in time and space in cells, resulting in subfunctionalization (Lynch and Force, 2000); one homologous copy may evolve new function(s), referred as neo-functionalization. A combination of the above functional innovations could also occur, or one copy can be pseudogenized or lost to lower functional redundancy or to eliminate dosage changes, resulting in non-functionalization.
In a plant genome, there are often thousands of duplicated genes produced by various means of genetic duplication, such as polyploidization, tandem gene duplication, or transposon-related duplication, etc. Recursive polyploidization specifically has made plant genomes very complex and provides a great impact on evolutionary and functional innovation (Charon et al., 2012;Kim et al., 2014;Liu et al., 2014). Though wide-spread gene losses often occurred, a polyploidization event often produced thousands of duplicated genes in an extant plant genome. Major eudicot plants share a hexaploid ancestor, and the corresponding event has around 1600 duplicated copies in an eudicot plant, such as grape, cotton, soybean, etc. (Jaillon et al., 2007;Jiao et al., 2011;Wang et al., 2017). Cotton was affected by a decaploidization event, likely shared with the other Gossypium relatives, but not the non-Gossypium Malvaceae plants, such as durian and cacao (Wang et al., 2016(Wang et al., , 2019. This means that, if no gene loss had occurred, a grape gene would have one cacao ortholog and five orthologs in collinear positions in Gossypium raimondii (DD). However, owing to wide-spread gene losses after the decaploidization, only 39.1% of cacao genes have two or more duplicated copies at colinear positions in extant cotton DD genome (Wang et al., 2016). The inference of these thousands of years of polyploidization events and their produced duplicated genes were based on the detection of gene collinearity between chromosomes, and a set of collinear homologs were supposed to be derived from a common ancestral gene in the genome before the decaploidization.
Phylogenetic and evolutionary analysis of plant genomes may be problematic due to the changes in their evolutionary rates after polyploidiztion. In Gramineae, it has been suggested that barley (Hordeum vulgare), sorghum (Sorghum bicolor), and maize (Zea mays) evolve 12-33% faster than rice (Oryza sativa), which retains the most conserved genome (Wang et al., 2015). Assuming that genes evolve at the same rate will lead to weird inferences when duplicate genes in different species are used to determine the age of the same polyploidization. Using grape orthologs as a reference, the comparison of cacao and cotton genes showed that cotton genes evolved 19 and 15% faster at synonymous and non-synonymous substitution sites than their cacao homologs, respectively. Cacao was used to assess the difference in evolutionary rates between durian and cotton, indicating that after division from cacao, cotton evolves about 64% faster than durian (Wang et al., 2019). The difference between cotton paralogs is greater than the difference between them and cacao orthologs due to the increased rate of cotton evolution. The higher evolutionary rate in cotton than its related species was at least partly attributed to the occurrence of the polyploidization in cotton, as an elevated evolutionary rate of genes was also observed in other paleopolyploidies (Wang et al., 2011;The Tomato Genome Consortium., 2012;Chalhoub et al., 2014).
Elevated evolutionary rates of cotton genes raised the following interesting questions: how have the phylogeny of duplicated homolog genes that produced simultaneously been affected? Do the trees conform to the real or expected gene trees? What can be understood from the unbalanced nature of the evolution of the duplicated genes? Is there any difference from the large gene families? Is there evidence of adaptive selection and how does diverted phylogeny affect the inference of adaptive evolution? To answer the above questions, here, we used different tree-constructing methods to build phylogenetic trees of colinear homologous genes in collinearity between genomes of cotton, grape, and cacao, analyzed the topological structure of homologous gene trees, compared them with the expected trees inferred based on gene collinearity, and assessed the difference in inferring selective pressure based on the constructed and the expected trees.

Gene Phylogeny Construction
The coding sequences (CDS) of each group of genes were aligned using ClustalW (Thompson et al., 1994;Larkin et al., 2007) and MUSCLE (Edgar, 2004), and tree construction was performed using five methods in MEGA X (Kumar et al., 2018) (Maximum Likelihood, Neighbor-Joining, Minimum-Evolution, UPGMA, and Maximum Parsimony) with default parameters. Grape was set as the outgroup for each phylogenetic tree. The reliability of an inferred tree was characterized with bootstrap analysis with 1000 replications.

Gene Copy Number Analysis
BLASTP (Camacho et al., 2009) was used to align protein sequences of grape with cotton to detect the copy number of each grape gene in the cotton. We characterized gene copy variation in cotton by searching each grape gene at two E-value cutoffs of 1e−5 and 1e−20, respectively.

Calculation of Ks
Synonymous nucleotide substitutions on synonymous sites (Ks) were estimated by using the Nei-Gojobori approach (Nei and Gojobori, 1986) by implementing the Bioperl Statistical module.

Selective Pressure Detection
MYB family transcription factor genes were identified in diploid and tetraploid cotton genomes using a list of seed genes retrieved from a previous report (Paterson et al., 2012). With MYB genes, PhyML 3.0 was used to build ML trees, with the Jones, Taylor, and Thorton (JTT) model and 100 non-parametric bootstrap replicates (Guindon et al., 2010). We applied likelihood ratio (LR) tests to detect likely positive selection based on the ML methods and codon substitution models. Based on previously reported methods (Mondragon-Palomino et al., 2002;Mondragon-Palomino and Gaut, 2005), we implemented Codeml from the PAML package and analyzed the ML and expected trees to infer ω, the ratio of the non-synonymous to synonymous distances (Yang, 1997;Yang et al., 2000). We detected variation in ω among sites by employing a likelihood ratio test between M0 and M1, and M7 and M8 models.

Homologous Gene Phylogenetic Tree
To understand the evolution of cotton's duplicated genes, we involved cacao, the close relative of cotton, and grape, with a genome closely resembling that of the ancestral eudicot, to our experience in plant genomics analysis, in the present analysis. We selected 662 groups of homologs at the collinear positions of the three genomes involved, and each group had one grape gene as the outgroup of other homologs, one cacao gene, and at least three colinear cotton paralogs, being orthologous to the grape and the cacao gene (Figure 1 and Supplementary Table S1). The cotton paralogs were likely produced by the decaploidization. 2123 cotton genes were involved in the above homologous groups, i.e., each group has an average 3.21 cotton paralogs. Fractions of 81.4, 16.5, and 2.1% of all the homologous groups have 3, 4, or 5 cotton paralogs, respectively.
Notably, only a quite small fraction of trees conformed to the expected tree topology (Figure 1, Table 1, and Supplementary Table S2). The expected trees reflect the relationship of colinear paralogs produced by the decaploidy, and colinear orthologs originated due to speciation (Figure 1). That is, the decaploidy-produced cotton paralogs were expected to group together, with the cacao and the grape orthologs being their outgroup. Coding sequences of genes were translated into amino acid sequences to produce the alignment. ClustalW was used to make the alignment. With four tree-constructing approaches, ML, NJ, ME, and MP, an average 15.2% of constructed trees agree with the expected tree topology ( Table 1). With the ML approach, a maximum of 16.8% of constructed trees agree with the expected tree. In contrast, with a UPGMA approach, a minimum of 0.9% of trees matched the expected topology, showing that it is the weakest choice to reconstruct the right phylogeny. With all approaches, only five (0.76%) groups of homologs have all constructed trees conforming to the expected topology, while 503 (76%) groups have all trees not conforming to the expected topology (Figure 2). With the non-UPGMA approaches, only 45 (6.8%) groups of homologs all have constructed gene trees conforming to the expected topology, while 503 (76%) groups all have trees not conforming to the expected topology.
Different alignment approaches did not affect the tree topology much. Sequences were also aligned using MUSCLE and then trees were constructed with the above five approaches (Supplementary Tables S2, S3). Results were very similar   compared to the trees constructed using ClustalW. One thing to note is that with the MP approach, the trees constructed with MUSCLE had ∼2% increase agreeing with the expected topology as compared to those constructed with ClustalW.

Gene Homologous Copy Number and Tree Topology
Besides collinear homologs, there are other homologs in the cotton genome which increase redundancy. And these homologs, which are theoretically likely to elevate evolutionary freedom, possibly affect the topology of the trees constructed above. Here, for each contrasted tree, we checked whether the cotton collinear homologs have other homologous copies in the whole genome. Initially, at the BLASTP e-value < 1e−20, we searched homologous copies, and grossly did not find much relationship between cotton homologous copy numbers and the percentages of trees agreeing with the expected topology. However, when homologous copy number < 5, the agreeing percentage reached its maximum (21.3%) with the ML approach ( Table 2). In consideration of a much divergent and fast evolutionary rates of genes, especially when they are tens of millions of years old as is the occurrence of the decaploidy, we lowered the criteria to find homologs to BLASTP e-value < 1e−5, for three treeconstructing approaches (ML, NJ, and ME), when homologous copy number < 5, the agreeing percentages reaches the highest values (22.2%) (Supplementary Table S4); and in contrast, when copy number ≥ 50, the agreeing percentages reaches the lowest values (10.1%) (Supplementary Table S4).

Elevated Evolutionary Rates of Cotton Homologs
Elevated evolutionary rates of cotton genes are responsible for distorted tree topology. In all the distorted trees, the cacao gene is clustered with one or more cotton genes with the other cotton homolog(s) to form their outgroup branch(es) (Figure 2). This finding could be explained by elevated evolutionary rates of those cotton genes forming outgroup branches. Here, by estimating synonymous nucleotide substitutions at the synonymous site  (Ks), we characterized the level of difference between cottongrape and cotton-cacao homologous genes (Table 3). To find whether numbers of cotton paralogs affect their evolutionary rates, we checked 3803 homologous gene groups, including those with one and two cotton colinear paralogs, which were previously excluded in tree construction analysis. We found that, with groups with 1-4 cotton homologs, the average of Ks between cotton and grape orthologs is significantly larger than the Ks between cacao and grape orthologs (T-test P-value ≤ 0.008).
The copy number of collinear cotton paralogs may affect their evolutionary rates. To characterize how elevated cotton genes were in evolutionary rates as to their respective cacao orthologs, we defined the average relative rate difference ratio, R A = [Average(Gr i Vv)−TcVv]/TcVv, where GrVv is the Ks between a cotton gene and its grape ortholog, and TcVv is the Ks between the cacao gene and its grape ortholog, and average Ks were found when multiple cotton colinear genes were involved. As to the R A , we found that cotton genes evolved 9.1-18.4% (significantly) faster than their cacao orthologs with different groups involving 2-5 cotton colinear paralogs (Table 3). Actually, even when there is only one cotton colinear gene preserved, the cotton-grape orthologs' Ks is significantly larger than the cacao-grape orthologs' Ks values (T-test P-value = 2.32e−11).
Moreover, cotton homologs showed significantly diverged evolutionary rates. While there are cotton homologs that have much higher evolutionary rates, in the meantime, there are ones that have evolved even slower than their cacao orthologs. We defined the maximal and minimal relative rate difference ratios, where the maximal and minimal Ks were found when multiple cotton colinear paralogs were involved. When there are two cotton colinear paralogs, as to R Max and R Min , one cotton gene evolved 38.6% faster than its cacao ortholog ( Table 3), while the other cotton gene evolved 3.8% slower than the cacao ortholog. More divergent evolutionary rates were found with the groups with more cotton paralogs. The cotton genes having evolved much faster are more likely to develop new functions, i.e., neofunctionalization, while the ones much slower are more likely to preserve the ancestral function(s).

Problematic Tree Topology Affects Inference of Natural Selection
Of the 662 groups of homologous genes selected, 13 groups of them contain MYB family transcription factor genes, which may help differentiate epidermal cells into fibers. To understand whether problematic tree topology affected the assessment of functional innovation, we detected the orthologous genes from the cultivated cottons, G. hirsutum, G. barbadense, and G. arboretum. A total of 199 MYB family transcription factor gene homologs were included in the following analysis, with an average of 15.3 per group.
We performed natural selection pressure analyses on the constructed ML trees and the expected trees, respectively. The expected trees were made as to the relationship of colinear homologs, as described above (Figure 1). Notably, we found that none of the trees constructed by the ML method conformed to the expected topology. As to the branch model, only one group inferred the same gene homologs to be positively selected, ignoring the difference in ω values among branches leading to the gene homologs on the reconstructed and the expected trees.
With the other 12 groups of MYB genes, the reconstructed and the expected trees came to very different inferences of positively selected genes (Figure 3 and Supplementary Figure S1). Often (8/12) the expected trees inferred fewer positively selected genes or branches. As to the reconstructed tree, a RAD-like gene inferred 12 positively selected branches or genes, as compared to only 4 in the expected tree, showing much more overassessment of natural selection (Figures 3A,B). Surely, the tree topology difference should be responsible of this over-assessment. Checking the tree topology, we found that one cotton gene GhA05g2602 was clustered together with the cacao ortholog, Tc08g0220, while the other cotton genes were their outgroup. This shows that GhA05g2602 may have preserved ancestral gene function, while the other cotton gene had elevated evolutionary rates. In contrast, a MYB59 gene inferred ten positively selected branches and genes with the expected tree and more than five with the reconstructed tree (Figures 3C,D). It seems that one of three sets of decaploidy-derived colinear cotton paralogs elevated the evolutionary rates and moved outsides of the branch of the cacao ortholog, Tc09g3887.
The reconstructed and the expected trees often inferred different positively selected sites. As to the site model using both NEB (naïve empirical Bayes) and BEB (Bayes empirical Bayes) posteria tests, there were only 4 out of 13 groups inferring all the same positive selection sites with both the expected and the reconstructed trees (Supplementary Table S5). However, only two sites had a posterior probability greater than 0.95, verified by NEB and BEB.

DISCUSSION
The selected 662 colinear homologous groups described above provided a valuable evaluation as to whether correct trees could be produced based on sequence alignment. As we have shown, the majority (80%) of reconstructed trees could not reflect the actual relationship between these homologs. Without a reliable tree reflecting the actual relationship of their evolutionary history, one cannot rightly infer evolutionary innovation of genes.
Here, we showed that nearly all cases that infer positive selection are not right.
Fortunately, gene collinearity within a genome and between genomes could, to some extent, make up the flaws in tree topology reconstruction. Colinear genes often have a deterministic origination. Within a genome, they may be produced by rounds of polyploidization, and each of the polyploidization events are often well inferred about their occurrence time and related ploidy number. Between two genomes, colinear genes are often credible orthologs due to speciation, or outparalogs produced by polyploidization affected their common ancestor. Here, the above analysis involved a major-eudicot-common hexaploidy, and a decaploidy specific to the cotton lineage. The homologs produced by these two events were well grouped into orthologs and (out)paralogs. Therefore, without sequence alignment and tree reconstruction, we know their actual relationship used to construct the expected trees. With these expected trees, we could perform correct evolutionary analysis. To our knowledge, this is the first time gene collinearity is emphasized to be introduced to gene tree construction and evolutionary analysis of duplicated genes.
Some may doubt that alternative loss of anciently duplicated genes may result in problematic evaluation. This likelihood was lowered to its utmost with the cross-multigenomic collinearity restriction. Ancestral duplication at chromosomal location might result in tandem genes. In fact, only a small fraction (8.7%) of these genes have a tandem homology in at least a considered genome. Therefore, the likelihood of a tree topology resulted from the alternative loss of ancestral genes is very slim.
Many and many trees have been constructed to infer and describe their evolutionary history and functional innovation. Duplicated genes are the key to understanding gene functional innovation. However, without gene collinearity support, we have reasons to worry whether this research could have come to the right conclusion, and would advise double checking their reports of adaptivity and functional innovations of genes or encoded proteins.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation, to any qualified researcher.

AUTHOR CONTRIBUTIONS
XW conceived and led the research. FM, YP, and JW performed the analysis. JY, CL, ZZ, CW, and HG contributed to data collection and joined constructive discussions. XW, YP, and JW wrote the manuscript.