Short branch attraction in phylogenomic inference under the multispecies coalescent

Accurate reconstruction of species trees often relies on the quality of input gene trees estimated from molecular sequences. Previous studies suggested that if the sequence length is fixed, the maximum likelihood may produce biased gene trees which subsequently mislead inference of species trees. Two key questions need to be answered in this context: what are the scenarios that may result in consistently biased gene trees? and for those scenarios, are there any remedies that may remove or at least reduce the misleading effects of consistently biased gene trees? In this article, we establish a theoretical framework to address these questions. Considering a scenario where the true gene tree is a 4-taxon star tree T*=S1,S2,S3,S4 with two short branches leading to the species S1 and S2, we demonstrate that maximum likelihood significantly favors the wrong bifurcating tree S1,S2,S3,S4 grouping the two species S1 and S2 with short branches. We name this inconsistent behavior short branch attraction, which may occur in real-world data involving a 4-taxon bifurcating gene tree with a short internal branch. If no mutation occurs along the internal branch, which is likely if the internal branch is short, the 4-taxon bifurcating tree is equivalent to the 4-taxon star tree and thus will suffer the same misleading effect of short branch attraction. Theoretical and simulation results further demonstrate that short branch attraction may occur in gene trees and species trees of arbitrary size. Moreover, short branch attraction is primarily caused by a lack of phylogenetic information in sequence data, suggesting that converting short internal branches to polytomies in the estimated gene trees can significantly reduce artifacts induced by short branch attraction.


Introduction
Coalescent-based approaches have been shown to be statistically consistent in estimating species trees as the number of loci and the sequence length increase to infinity (Felsenstein, 2006;Liu et al., 2010).However, short sequences are commonly observed in the coding and non-coding regions across species in the Tree of Life, necessitating an assessment of coalescent approaches in the context of finite sequence lengths.Previous studies (Roch and Warnow, 2015;Roch et al., 2019) showed that molecular sequences with a fixed length can mislead coalescent methods to estimate an incorrect species tree even if the number of loci increases to infinity.The failure of coalescent methods in this case is not caused by the deficiency of coalescent models (i.e., violation of the coalescent model assumptions) (Carvajal-Rodriguez et al., 2006;Adams et al., 2018;Jiang et al., 2020), but by the maximum likelihood (ML) gene trees consistently favor incorrect phylogenetic relationships of the species involved.Specifically, molecular sequences with a finite length may produce biased gene trees.These biased gene trees can subsequently mislead coalescent methods to estimate an incorrect species tree.It should be noted here that random bias in gene tree estimation does not have a major effect on species tree estimation.For instance, if some loci support incorrect relationships between two species S 1 and S 2 , while others support incorrect relationships between other species S i and S j , the misleading effects of biased gene trees are canceled out in estimating species trees.However, the greatest challenge in species tree inference arises if gene tree inference is consistently biased towards supporting the same or a similar set of incorrect relationships.Key questions that need to be addressed are what scenarios result in consistently biased gene trees and if any remedies can alleviate the misleading effects.These questions do not have straightforward solutions.Finding both necessary and sufficient conditions for biased gene trees is challenging, and while large gene tree estimation errors may mislead species tree estimation, coalescent methods are robust to a certain degree of gene tree errors and can still recover the true species tree (Liu et al., 2015).This article aims to establish a theoretical framework to address these questions, enabling the identification of problematic scenarios in real-world data analysis and providing potential solutions to these issues.
Genes with minimal phylogenetic information can substantially increase gene tree estimation error, which may reduce the accuracy of species tree inference (Xi et al., 2015).Similar consequences involving minimal phylogenetic information have been observed for another well-documented phenomena long branch attraction (LBA) (Felsenstein, 1978), which can mislead phylogenetic tree inference as well.LBA occurs when two long terminal branches are separated by a short internal branch in a phylogenetic tree, another example of lack of phylogenetic signals causing systematic errors in estimating phylogenetic trees.Long branch artefacts have historically been seen as a major problem for parsimony-based inference, but maximum likelihood and Bayesian approaches can also be susceptible to these issues (Martyn and Steel, 2012;Su and Townsend, 2015;Susko, 2015).Current theories on LBA have identified branch length conditions that lead to inconsistent inferences when using maximum parsimony (Hendy and Penny, 1989;Kim, 1996).Moreover, Townsend et al. (2012) proposed a "signal and noise" framework that uses substitution rates to estimate the ability of molecular sequences to resolve a four-taxon tree with equally-subtending branch lengths.This framework can be generalized to account for LBA bias resulting from asymmetric topologies or unequal evolution rates and can identify branch length conditions where phylogenetic inference is inconsistent for these types of phylogenies (Su and Townsend, 2015).Nevertheless, these theories originated within the framework of conventional phylogenetic models (Felsenstein, 1981).However, when it comes to the multispecies coalescent model (Rannala and Yang, 2003), there is a scarcity of theories addressing the specific conditions of branch lengths that can mislead species tree estimation.A review of the evidence on the impact of missing data on species tree estimation indicates that missing data can bias both gene tree and species tree estimation (Xi et al., 2016).The estimation problems induced by minimal phylogenetic signal can be alleviated by sampling more informative gene sequences.More severe challenges arise, however, when gene tree estimation is biased and cannot be remedied by increasing the sequence length or the number of loci (Xi et al., 2015).
Owing to its asymptotic properties (i.e., consistency, asymptotic unbiasedness, asymptotic efficiency, etc.), ML is among the most popular methods for building phylogenetic trees from molecular sequence data.It is well known that maximum likelihood estimates (MLE) are often biased if the sample size is finite (Mardia et al., 1999).Given a set x 1 , …, x n of identically and independently distributed random variables generated from the normal distribution with mean ∝ and variance σ 2 , the MLE of the variance σ 2 is underestimates the variance σ 2 .Biasness is a good measure of an estimator's performance in a finite sample size.A continuous estimator θ of the parameter θ is said to be unbiased if its expectation is equal to the parameter value, i.e., E θ = θ.Since the tree topology is a discrete random variable, the ML tree T is defined as an unbiased estimator of the true tree T * if the most probably ML tree is the true tree T *, i.e., P T = T * > P T = T for any T ≠ T *.In contrast to previous studies (Su and Townsend, 2015) that employed predicted utility to identify branch length conditions in which ML methods fail to recover the true phylogenetic tree, this study takes a different approach to investigate the statistical properties, specifically the bias, of ML phylogenetic trees.In this paper, we can show that under certain conditions ML trees are biased estimators of the underlying phylogenetic trees (i.e., gene trees and species trees) if the sequence length is finite.The proof begins with a scenario where the true gene tree is a 4-taxon star tree T * = S 1 , S 2 , S 3 , S 4 with two short branches leading to species S 1 and S 2 .In this case, ML significantly favors the wrong bifurcating tree S 1 , S 2 , S 3 , S 4 grouping the two species S 1 and S 2 with short branches, which is called short branch attraction (SBA).SBA may also occur in a 4-taxon bifurcating gene tree with a short internal branch.If no mutation occurs in the internal branch, which is quite likely if the internal branch is short, the 4-taxon bifurcating tree is equivalent to the 4-taxon star tree and thus it suffers the same misleading effect of short branch attraction.Similarly, if the true species tree is a 4-taxon star tree with two short branches leading to the species S 1 and S 2 , most gene trees generated from this species tree are the 4-taxon bifurcating trees with a short internal branch.Due to SBA, the ML gene trees consistently favor the wrong tree S 1 , S 2 , S 3 , S 4 across genes, which subsequently mislead the coalescent methods (Liu and Yu, 2011;Mirarab and Warnow, 2015) to estimate the wrong species tree S 1 , S 2 , S 3 , S 4 .Our findings indicate that a phylogenetic tree may exhibit the short branch artifact when a quartet within the tree consists of a short internal branch and four unequal external branches.As such, the phenomenon known as long branch attraction, which is characterized by a short internal branch and dissimilar external branches, represents a specific instance of the aforementioned artifact.This study presents a comprehensive analytical framework for examining the impact of branch length heterogeneity on the inference of gene and species trees using finite sequence lengths.

Biased maximum likelihood estimates of phylogenetic trees
The MLEs of phylogenetic trees have been shown to be statistically consistent as the sequence length K goes to infinity (Rogers, 1997), i.e., P T MLE = T * 1 as K ∞, where T MLE is the MLE of the phylogenetic tree T *.If the sequence length K is finite, ML methods may produce a biased estimate of the phylogenetic tree T *.Consider a 4-taxon tree with an internal branch of length t 0 ≥ 0 and four terminal branches of lengths t 1 , t 2 , t 3 , t 4 in mutation units (i.e., the branch length t represents the number of mutations per site).Without loss of generality, we assume that 0 ≤ t 1 ≤ t 2 ≤ t 3 ≤ t 4 < ∞.Let D = d 1 , …, d K be the DNA alignment of length K generated from the 4-taxon tree.In this paper, the Jukes-Cantor substitution model (Jukes and Cantor, 1969) is adopted for modeling the evolution of a single nucleotide.(1) The probabilities p = p 1 , …, p 15 of 15 site patterns are functions of the true 4-taxon tree T * and branch lengths t = t 0 , t 1 , t 2 , t 3 , t 4 .The probability of the alignment D is equal to the probability of frequencies ω, i.e., P D | T *, t = P ω | p .If the sequence length is K, the number of different alignments is 15 k .For simplicity, we assume that the MLE is unique.
Then the ML tree can be obtained from each of the 15 k alignments.Let T 1 = S 1 , S 2 , S 3 , S 4 , T 2 = S 1 , S 3 , S 2 , S 4 , and T 3 = S 1 , S 4 , S 2 , S 3 be the three (unrooted) ML trees estimated from the sequence alignment D. The probability that the MLE T MLE is equal to the tree T = T j , j = 1, 2, 3 is given by Here D i 's are the alignments from which the MLE tree is T j .The MLE T MLE is said to be biased if the most probable ML tree is not the true tree T *, i.e., P T MLE = T j > P T MLE = T * for some T j ≠ T *.

Short branch attraction
In gene trees-In this section, we analyze three distinct scenarios (4-taxon star trees, 4-taxon bifurcating trees, and n-taxon trees) for gene trees.We show that the SBA artifact can mislead the ML estimation of gene trees in these scenarios.

Scenario 1: 4-taxon star trees:
We first consider a 4-taxon star tree where the four species S 1 , S 2 , S 3 , S 4 diverged from the same ancestral node (Figure 1A).Let W, X, Y, Z be the nucleotides of the species S 1 , S 2 , S 3 , S 4 in a single site of the DNA alignment D. Let H be the nucleotide at the internal node of the star tree T *.Given the nucleotide H, the probability of nucleotides W, X, Y, Z is the multiplication of the probabilities of four terminal branches, i.e., P W XY Z H, T *, t 1 , t 2 , t 3 , t 4 = P HW t 1 P HX t 2 P HY t 3 P HZ t 4 (3) In (3), P HW t 1 is the probability that the nucleotide H changes to the nucleotide W after time t 1 .Because P H = 1 4 for H = A, C, G, T , the probability of a single site for a star tree T * is given by , G, T P HW t 1 P HX t 2 P HY t 3 P HZ t 4 (4) Additionally, there are 12 out of 15 nucleotide patterns that produce an unresolved ML tree (a star tree, i.e., T MLE = T *).The remaining three patterns (xxyy, xyxy, xyyx) lead to a bifurcating ML tree.If the sequence length K = 1, it follows from equation ( 2) that the probability P T MLE = T * that the MLE is a star tree is equal to the sum of the probabilities of the 12 site patterns, and the probability that the MLE is a bifurcating tree, which is the true tree ), the probability P T MLE = T * that the MLE T MLE is the true star tree T * is greater than the probability P T MLE ≠ T * that the MLE T MLE is not the true star tree T *.
Thus, the ML tree T MLE is an unbiased estimator of the 4-taxon star tree T *.In real data analysis, most phylogenetic programs, for example RAxML (Stamatakis et al., 2005) and PHYML (Guindon et al., 2010), are forced to produce bifurcating trees.Therefore, we here only consider the site patterns If t 1 = t 2 = t 3 = t 4 , then P xxyy = P xyxy = P xyyx .The ML tree is an unbiased estimator of the 4-taxon star tree T * if P T MLE = T 1 = P T MLE = T 2 = P T MLE = T 3 .If the sequence length K = 1, the probability P T MLE = T 1 is equal to the probability P xxyy , and P T MLE = T 2 = P xyxy , and P T MLE = T 3 = P xyyx .Thus, the ML tree T MLE is an unbiased estimator if the 4-taxon star tree T * has equal branch lengths t 1 = t 2 = t 3 = t 4 .If the branch lengths are not equal, for example, t 1 < t 2 < t 3 < t 4 , then P xxyy > P xyxy > P xyyx .It follows that P T MLE = T 1 > P T MLE = T 2 and P T MLE = T 1 > P T MLE = T 3 .Thus, the ML tree T MLE is a biased estimator of the 4-taxon star tree T * with unequal branch lengths.Moreover, the ML tree T MLE favors the tree T 1 which groups the lineages of S 1 and S 2 with short branches t 1 and t 2 .We call this phenomenon short branch attraction (SBA).
If the sequence length K is large, there is at least one site with the pattern xxyy, xyxy, or xyyx in the sequence alignment D. As a result, ML methods would consistently produce a bifurcating tree as the estimate of the 4-taxon star tree T *.Let λ = P xxyy + P xyxy + P xyyx .The probability that the sequence alignment of length K consists of at least one site with the patterns xxyy, xyxy, and xyyx is 1 − (1 − λ) K , which converges to 1.0 as K goes to infinity.Note that the probability that the counts of three patterns are equal, the case that ML methods estimate a star tree, converges to 0 as K ∞.
It indicates that the MLE T MLE converges to a bifurcating tree as the sequence length K goes to infinity.Thus, T MLE is a biased estimator of the star tree T * if the sequence length K is large.Moreover, ML methods favor the tree T 1 = S 1 , S 2 , S 3 , S 4 over the other two trees T 2 = S 1 , S 3 , S 2 , S 4 and T 3 = S 1 , S (Supplementary Appendix A2).This result indicates that the MLE T MLE of the 4-taxon star tree T * is biased toward the bifurcating tree T 1 which groups the lineages of S 1 and S 2 due to SBA.The probability P T MLE = T 1 , or equivalently the biasness of T MLE , increases as the branch lengths t 1 and t 2 decrease and/or t 3 and t 4 increase.

Scenario 2: 4-taxon bifurcating trees:
In this section, the true tree T * is assumed to be a 4-taxon bifurcating tree T * = T 2 = S 1 : t 1 , S 3 : t 3 : t 0 , S 2 : t 2 , S 4 : t 4 with an internal branch length = T 0 and four terminal branches length = t 1 , t 2 , t 3 , t 4 (Figure 1B).To calculate the probabilities of 15 site patterns, we consider two scenarios -the nucleotides x n 1 and x n 2 at two internal nodes n 1 and n 2 are identical or distinct.Under the Jukes-Cantor model, the probability of two identical nucleotides is e −4t 0 /3 , while the probability of two distinct nucleotides is e −4t 0 /3 .If the nucleotides x n 1 and x n 2 at two internal nodes are identical, the probability of a site pattern & = W XY Z coincides with that for a star tree described in equation ( 3), The probability of the site pattern & is equal to the weighted sum of the two probabilities in equations ( 7) and ( 8 Thus, when the internal branch length t 0 is small, according to the theory derived for 4-taxon star trees in the previous section, the ML tree is a biased estimator of T * = T 2 , favoring the wrong tree T 1 due to SBA.As  1B).Thus, as the internal branch length t 0 ∞, the MLE T MLE becomes an unbiased estimator of T *.
2.1.1.3.Scenario 3: generalization to the trees of more than 4 taxa: Let T be an n-taxon n > 4 bifurcating tree (Figure 1C).Let n 01 and n 02 be the two nodes at the two ends of an internal branch b of T .Because T is a bifurcating tree, each of the two internal nodes n 01 and n 02 is also associated with two other nodes (Figure 1C).Let n 1 and n 3 be the two nodes associated with n 01 .Let n 2 and n 4 be the two nodes associated with n 02 (Figure 1C).The nodes n 1 and n 2 are terminal nodes, while the nodes n 3 and n 4 could be the terminal or internal nodes in the n-taxon tree T .The six nodes along with the five branches connecting them form a 4-taxon subtree T * = n 1 , n 3 , n 2 , n 4 (i.e., the subtree in brown, Figure 1C), where b is the "internal" branch of length t 0 and the remaining four are "terminal" branches of length t 1 < t 2 < t 3 < t 4 .The theory derived for Scenario 1 and 2 indicates that there are 15 patterns with distinct probabilities for the nucleotides at the four "terminal" nodes n 1 , n 2 , n 3 , n 4 .Here, we ask the same question: what are the probabilities of the 15 site patterns?Firstly, those probabilities do not depend on the other parts of the n-taxon tree T (i.e., the blue subtrees in Figure 1C).Secondly, if the assumed substitution model is time reversible -in fact, most substitution models are time reversible, then the theory derived in Scenario 2 can be applied to computing the probabilities of 15 site patterns for this 4-taxon subtree T *.Specifically, the probability of the site pattern & is equal to the sum of the probabilities of the site pattern & when the nucleotides x n 01 and x n 02 at two internal nodes n 01 and n 02 are identical or distinct as described in equation ( 9).Finally, according to the theory derived for Scenario 2, if the internal branch length t 0 is small (i.e., t 0 0), the ML tree T MLE is a biased estimator of T * = n 1 , n 3 , n 2 , n 4 , favoring the wrong tree n 1 , n 2 , n 3 , n 4 due to SBA.However, because the nucleotides at the nodes n 3 , n 4 are not observable, it is difficult to formally prove that the ML tree T MLE is a biased estimator of T * when t 0 is small and the branch lengths t 1 and t 2 are much less than the branch lengths t 3 and t 4 .Instead, the biased MLE of the subtree T * will be illustrated by simulation.

Short branch attraction in species trees-
In this section, we analyze three distinct scenarios (4-taxon star trees, 4-taxon bifurcating trees, and n-taxon trees) for species trees.We demonstrate that the short branch attraction artifact has the potential to lead to erroneous species tree estimation in these scenarios 2.1.2.1.Scenario 1: 4-taxon star species trees: Let S* = S 1 : τ 1 , S 2 : τ 2 , S 3 : τ 3 , S 4 : τ 4 be a 4-taxon star species tree with unequal branch lengths τ 1 < τ 2 < τ 3 < τ 4 (Figure 1D).Let θ 0 be the population size parameter of the ancestral population at the root of the species tree S*.
It is assumed that a single allele is sampled from each species.The gene tree T * generated from this star species tree under the multispecies coalescent model is a 4-taxon bifurcating tree with unequal branch lengths.Under the coalescent model, the three unrooted gene trees T 1 = S 1 , S 2 , S 3 , S 4 , T 2 = S 1 , S 3 , S 2 , S 4 , and T 3 = S 1 , S 4 , S 2 , S 3 have the same probability , 1 3 .Let t 0 be the length of the internal branch of T *.It follows from the coalescent theory that the internal branch length t 0 has the exponential distribution with mean 5 9 θ.When θ is small (i.e., θ 0), most gene trees generated from the star species tree S* have a short internal branch and four terminal branches whose lengths t 1 < t 2 < t 3 < t 4 tend to be in the same order as those τ 1 < τ 2 < τ 3 < τ 4 in the species tree S* Due to SBA, the MLEs of such 4-taxon gene trees support the tree T 1 = S 1 , S 2 , S 3 , S 4 with a probability ⊕ 1.0, which significantly deviates from the true probability distribution 1 3 , 1 3 , 1 3 of the gene trees derived from the species tree S* under the multispecies coalescent model.As a result, the biased gene trees mislead the coalescent methods to consistently estimate the wrong species tree S 1 , S 2 , S 3 , S 4 as the number of gene trees goes to infinity.Thus, SBA occurs in the 4-taxon star species trees.
2.1.2.2.Scenario 2: 4-taxon bifurcating species trees: Consider a 4-taxon bifurcating species tree S* = S 1 : τ 1 , S 3 : τ 3 , S 2 : τ 2 , S 4 : τ 4 (Figure 1E).Let τ 1 and τ 2 be the branch lengths and θ 1 and θ 2 be the population size parameters of two internal branches (Figure 1E).The probability that two alleles from species S 1 and S 2 do not coalesce in their most recent common ancestral population (MRCA) is θ 1 .The probability that three alleles from species S 1 , S 2 , and S 3 do not coalesce in their MRCA θ 2 .The probability p that the four alleles coalesce in the root population is equal to the probability p 1 that two alleles from species S 1 and S 2 do not coalesce in their MRCA multiplied by the probability p 2 that three alleles from species S 1 , S 2 , and S 3 do not coalesce in their MRCA, i.e., p = p 1 p 2 = 1 − are small (i.e., θ 1 and θ 2 are large, or τ 1 and τ 2 are small), the four alleles from species S 1 , S 2 , S 3 , and S 4 have a high probability of coalescing in the root population, a case described in Scenario 1: 4-taxon star species trees for which the coalescent methods consistently estimate the wrong species tree S 1 , S 2 , S 3 , S 4 .Thus, SBA occurs in the 4-taxon bifurcating species trees if the population size parameters θ 1 and θ 2 of the two ancestral populations are large, or the branch length τ 1 and τ 2 are small.

Scenario 3:
n-taxon bifurcating species trees: Let S* be the 4-taxon subtree (i.e., the subtree in red, Figure 1F) of an n-taxon n > 4 bifurcating species tree S. Let n 1 , n 2 , n 3 , n 4 be the four terminal nodes of S* (Figure 1F).Note that the nodes n 3 and n 4 are internal nodes, while the nodes n 1 and n 2 are two terminal nodes in the n-taxon species tree S. It is assumed that one allele is sampled from each species.If all genealogical lineages coalesce in the blue subtrees below the nodes n 3 , n 4 (Figure 1F), then only one lineage enters each of the nodes n 3 , n 4 .In this case, the coalescence process occurring in the subtree S* is the same process as that occurs in the 4-taxon species tree described in the previous section Scenario 2: 4 -taxon bifurcating species trees.Thus, SBA may occur in the 4-taxon subtree S* of an n-taxon species tree if the population size parameter θ 0 in the root population is small, and the population size parameters θ 1 and θ 2 of the two ancestral populations are large, or the branch length τ 1 and τ 2 are small.If multiple lineages enter the nodes n 3 , n 4 , the gene tree lineages generated from the subtree S* involve more than 4 taxa.Although SBA may still occur in the n-taxon gene trees (see Scenario 3: Generalization to the trees of more than 4 taxa) which can subsequently mislead the coalescent methods to estimate the wrong species tree, it is difficult to formally prove that SBA can occur in the n-taxon species trees.Instead, we will use simulation to illustrate SBA in the species trees of more than 4 species.

The algorithmic bias of PhyML and
RAxML-This section is to assess the algorithmic bias of two popular phylogenetic programs, PhyML and RAxML, in building the ML trees from sequence alignments.Later, when we use simulation to illustrate that ML trees are biased estimators, we need to show that the level of the biasness of ML trees exceeds what the algorithmic bias can explain, i.e., the biasness of ML trees is due to model Saturated sequences of 1000 bpS were simulated for species S 1 , S 2 , S 3 , S 4 and then used to build phylogenetic trees.The results indicate that an algorithmic bias of PhyML and RAxML for saturated sequences is less severe than that for identical sequences (Figure 2B).PhyML appears to favor the tree S 1 , S 4 , S 2 , S 3 , as 40% of the ML trees reconstructed by PhyML are S 1 , S 4 , S 2 , S 3 (Figure 2B), which is significantly higher than the expected proportion 1/3.By contrast, RAxML appears to favor the tree S 1 , S 2 , S 3 , S 4 when the number of trees is 100 or 500, but the proportion of S 1 , S 2 , S 3 , S 4 is 0.33 when the number of trees is 1,000 (Figure 2B), which is not significantly different from the expected proportion 1/3.

Scenario 3: sequences from a 4-taxon star tree:
Identical or saturated sequences are not frequently observed in real sequence data.Here, we consider a more realistic scenario where the true phylogenetic tree is a 4-taxon star tree.In this simulation, DNA sequences were generated from the tree (S 1 : 0.01, S 2 : 0.01, S 3 : 0.01, S 4 : 0.01) with equal branch lengths of 0.01.When the true tree is a polytomy tree, three ML trees S 1 , S 2 , S 3 , S 4 , S 1 , S 3 , S 2 , S 4 , and S 1 , S 4 , S 2 , S 3 are expected to be uniformly distributed with equal probabilities 1/3, 1/3, 1/3 .Deviation from this uniform distribution indicates an algorithmic bias of PhyML and RAxML.The simulation results suggest that both PhyML and RAxML favor the tree S 1 , S 2 , S 3 , S 4 .When the number of trees is 1,000, the proportion of the tree S 1 , S 2 , S 3 , S 4 estimated by PhyML and RAxML is 0.41 and 0.4, respectively (Figure 2C), significantly higher than the expected proportion 1/3.
The three simulations (identical sequences, saturated sequences, and 4-taxon star tree) indicate that both programs (PHYML and RAxML) suffer an algorithmic bias, but the algorithmic bias of PHYML appears to be more severe than that of RAxML.Therefore, we will use RAxML to build ML trees in the subsequent analyses.

Short branch attraction in gene trees
2.2.2.1.Scenario 1: 4-taxon star trees with unequal branch lengths: According to the theory we developed above, the MLE of a 4 -taxon star tree T * with unequal branch lengths favors a bifurcating tree in which two lineages with short branches are grouped together.
Here, we use simulation to demonstrate the biasness of ML methods in estimating 4-taxon star trees with unequal branch lengths.In this simulation, the 4-taxon star tree (S 1 : 0.0001, S 2 : 0.01, S 3 : 0.0001, S 4 : 0.01) has two short branches of length 0.0001 leading to species S 1 and S 3 .DNA sequences were simulated from this 4-taxon star tree and then used to build ML trees by RAxML.The simulation was repeated 100 times and we calculated the proportions of three ML trees S 1 , S 2 , S 3 , S 4 , S 1 , S 3 , S 2 , S 4 , and S 1 , S 4 , S 2 , S 3 .As expected, more than 90% of the ML trees are S 1 , S 3 , S 2 , S 4 (Figure 3A).When the true species tree is (S 1 : 0.1, S 2 : 0.01, S 3 : 0.1, S 4 : 0.01) in which the length of two short terminal branches increases from 0.0001 to 0.1, most of the ML trees are still S 1 , S 3 , S 2 , S 4 , but the probability of the most probable ML tree drops from 0.91 to 0.74 for sequence length = 100 bps (Figure 3B).As the sequence length increases, the probability drops from 0.75 to 0.45 (Figure 3B).

Scenario 2: 4-taxon bifurcating trees with unequal branch lengths:
Likewise, SBA occurs in the 4 -taxon bifurcating trees, misleading ML methods to produce the wrong tree estimates.To assess the effects of SBA in gene trees, DNA sequences of 100,500, and 1,000 bps were simulated from a 4 -taxon bifurcating tree ( S 1 : 0.0001, S 2 : 0.01 : 0.0001, S 3 : 0.0001, S 4 : 0.01).Because the terminal branches leading to the species S 1 and S 3 are short, the sequences of species S 1 and S 3 generated from this tree are almost identical to each other.As a result, the ML methods group the species S 1 and S 3 and estimate the wrong tree S 1 , S 3 , S 2 , S 4 (Figure 3C).The simulation results suggest that 80% of the ML trees reconstructed by RAxML are the tree S 1 , S 3 , S 2 , S 4 (Figure 3C).When the true species tree is ( S 1 : 0.01, S 2 : 0.01 : 0.0001, S 3 : 0.01, S 4 : 0.01) in which the length of two short terminal branches increases from 0.0001 to 0.1, most of the ML trees are still S 1 , S 3 , S 2 , S 4 , but the probability of the most probable ML tree drops from 0.86 to 0.41 for sequence length = 100 bps (Figure 3D).As the sequence length increases, the probability increases from 0.41 for sequence length = 100 bps to 0.49 for 500 bps and 0.42 for 1,000 (Figure 3D) 2.2.2.3.Scenario 3: SBA in n-taxon trees: DNA sequences were simulated from the 8-taxon tree ((A: 0.0001, ( S 1 : 0.01, S 2 : 0.01 : 0.01, S 3 : 0.01): 0.01): 0.0001, B: 0.0001, ( S 4 : 0.01, S 5 : 0.01 : 0.01, S 6 : 0.01): 0.01) with two short branches length = 0.0001 leading to the species A and B. The species A and B are placed in two monophyletic clades -A, S 1 , S 2 , S 3 and B, S 4 , S 5 , S 6 of the 8-taxon tree.However, when the sequence length is 100 bps, the species A and B are mistakenly grouped together in 73% of the ML trees reconstructed by RAxML (Figure 3E).The percentage increases to 96 and 83%, respectively, for the sequence length = 500 and 1,000 bps (Figure 3E).
Accordingly, if the true gene trees are given, coalescent methods are expected to estimate three (unrooted) species trees with equal probabilities , 1 3 .However, due to the SBA, the ML gene trees estimated from DNA sequences tend to group the species S 1 and S 3 across genes, resulting in a high proportion (58%, 94, 93% for the sequence length 100 bp, 500 bp, and 1,000 bp, respectively, Figure 4A) of the tree S 1 , S 3 , S 2 , S 4 .As a result, the coalescent methods estimated the wrong species tree S 1 , S 3 , S 2 , S 4 .For the 4-taxon bifurcating species tree, a high proportion of the ML gene trees support the tree S 1 , S 3 , S 2 , S 4 (Figure 4B).Again, the coalescent methods, misled by the biased ML gene trees, estimated the wrong species tree S 1 , S 3 , S 2 , S 4 .Similarly, SBA significantly biased the gene tree distribution (i.e., > 70% of the ML gene trees comprise a group (A, B)) (Figure 4C) for the 8-taxon species tree, and the coalescent methods estimated the wrong species tree containing a monophyletic group (A, B) of the species A and B.

The algorithmic bias of PhyML and RAxML
3.1.1.Scenario 1: identical sequences-A sequence of 100, 500, 1,000 nucleotides were randomly generated from the multinomial distribution with probabilities p A = p C = p G = p T = 0.25.Then, the sequence was replicated four times to generate identical sequences for species S 1 , S 2 , S 3 , and S 4 .Phylogenetic trees were estimated from four identical sequences using the phylogenetic programs PhyML and RAxML with the GTRGAMMA substitution model (Yang, 2006).Each simulation was repeated 100 times.The proportions of three bifurcating trees tree1 = S 1 , S 2 , S 3 , S 4 , tree2 = S 1 , S 3 , S 2 , S 4 , and tree3 = S 1 , S 4 , S 2 , S 3 were calculated and compared with the expected probabilities 1/3, 1/3, 1/3 using the multinomial test.
3.1.2.Scenario 2: saturated sequences-A sequence of 100, 500, 1,000 nucleotides were generated independently from the multinomial distribution with probabilities P A = 0.1, P C = 0.2, P G = 0.3, P T = 0.4 for species S 1 , S 2 , S 3 , and S 4 .Phylogenetic trees were estimated parameter was simulated from the normal distribution with mean 0.5 and variance 0.01.Phylogenetic trees were estimated by PhyML and RAxML with the substitution model GTRGAMMA substitution model.

Short branch attraction in species trees
DNA sequences were simulated from three non-clock species trees.The first species tree is a 4-taxon star tree(S 1 : 0.0001, S 2 : 0.01, S 3 : 0.0001, S 4 : 0.01) with two short branches length = 0.0001 leading to the species S1 and S3.The population size parameter in the root population is set to θ = 0.0001.The second species tree is a 4-taxon bifurcating tree (( S 1 : 0.0001, S 2 : 0.01 : 0.0001, S 3 : 0.0001): 0.0001, S 4 : 0.01) with two short internal branches length = 0.0001 and two short terminal branches length = 0.0001 leading to the species S 1 and S 3 .The population size parameters θ = 0.0001 for the root population and θ = 0.01 for the other two internal branches (i.e., ancestral populations).The third species tree is an 8-taxon tree (((A: 0.0001, ( S 1 : 0.01, S 2 : 0.01 : 0.01, S 3 : 0.02): 0.01): 0.0001, B: 0.0001):0.0001,( S 4 : 0.01, S 5 : 0.01 : 0.01, S 6 : 0.02): 0.01) with two short branches leading to the species A and B. The population size parameters θ = 0.0001 for the root population and θ = 0.01 for the other internal branches (i.e., ancestral populations) in the 8-taxon species tree.One thousand gene trees were generated from each of the three species trees under the multispecies coalescent model using an R package Phybase (Liu and Yu, 2010).Then, DNA sequences were simulated from the gene trees using the GTR + GAMMA substitution model where the base frequencies were simulated from the Dirichlet distribution 1, 1, 1, 1 , and the rate parameters were simulated from the lognormal distribution 6, 1, 1 , and the shape parameter was simulated from the normal distribution with mean 0.5 and variance 0.01.The ML gene trees were built from the simulated DNA sequences by RAxML and then were compared with the true gene trees simulated from the three species trees.Finally, the species trees were estimated from the ML gene trees using a coalescent method NJst (Liu and Yu, 2011).

Discussion
Under regularity conditions (absolute continuity, identifiability, etc.), the MLEs are statistically consistent in estimating model parameters as the sample size goes to infinity.The estimator θ of the parameter θ is said to be a biased estimator if the expected value of θ is not equal to the true parameter value θ, i.e., E θ ≠ θ.When the sample size is finite, it is frequently observed that the MLE θ MLE is a biased estimator of the model parameter θ.
Hence, it is important to investigate/explore the behavior of MLEs in the context of a finite sample size.The theory developed in this paper indicates that the presence of heterogeneous branch lengths (i.e., SBA) can introduce bias into the probability distribution of the ML gene trees, consequently leading to erroneous species tree inference.The findings align with a prior investigation (Dornburg et al., 2019) which demonstrated that the utility of a character depends on the relative rates and times of evolution of subtending lengths of the internode to be resolved and therefore heterogeneous branch lengths can introduce bias into phylogenetic inference.SBA may occur in the polytomy of trees and in bifurcating trees with two short terminal branches separated by a similarly short internal branch.Our theoretical and simulation analyses surprisingly demonstrate that short internal branches not only introduce a large amount of phylogenetic uncertainty but can also severely bias gene tree and species tree inference.The simulation for identical sequences, saturated sequences, and phylogenetic trees with polytomies suggests that widely applied phylogenetic programs PhyML and RAxML favor a particular bifurcating tree, rather than producing three equally likely bifurcating trees.PhyML appears to be more problematic than RAxML in estimating phylogenetic trees from noninformative sequences.The algorithmic artefact of PhyML and RAxML may further exacerbate the effect of SBA on gene tree and species tree estimation.
LBA is a phenomenon in molecular phylogenetics where rapidly evolving lineages tend to cluster together in a phylogenetic tree, even if they are not actually closely related.It can lead to erroneous inferences of evolutionary relationships if distantly related lineages share many rapidly evolving characters due to substitution saturation.Analytic tools have been developed to detect and avoid the LBA artifact (Su and Townsend, 2015).One way to address the problems of LBA is to use more slowly evolving characters or to use more complex models that can account for the different rates of evolution among different lineages.Another approach is to shorten long branches by increasing taxon sampling.In contrast, SBA artifacts can create an "artificial" branch length that connects distantly related taxa when their molecular sequences share numerous slowly evolving characters due to short branches.Presently, there are no effective techniques to counteract the adverse impacts of the SBA artifact on gene and species tree estimation.
We hypothesize that trees with short internal branches may lead to SBA, because the sequence data generated from such trees lack sufficient phylogenetic information to resolve the corresponding short internal branches.Likewise, there is a lack of phylogenetic information in the identical and saturated sequences, or the sequences generated from star trees, causing the algorithmic error of PhyML and RAxML.Thus, lack of phylogenetic signals is the primary cause of SBA and the algorithmic bias of phylogenetic programs PhyML and RAxML.We previously demonstrated similar pathological results involving gene tree estimation when short gene sequences with minimal phylogenetic data were involved (Xi et al., 2016).In particular, relationships were resolved artificially by taxon ordering in the data matrix leading to spurious species tree inference.Since SBA is caused by a lack of phylogenetic information, removing short loci from phylogenetic analysis can lower the bias in gene trees and species tree estimation, but a decline in the number of loci can increase the uncertainty of species tree estimation.
The multispecies coalescent model is a hierarchical model involving two stochastic processes: mutation process and coalescence.The mutation process describes how nucleotides evolve in gene trees, whereas the coalescence process describes how genealogical lineages evolve in the species tree.The algorithmic bias and systematic error due to SBA we uncovered here are rooted in the mutation process and only affect gene tree estimation.Because accurate estimation of species trees relies on accurate estimation of gene trees, biased gene trees can mislead species tree estimation.Moreover, LBA and SBA artifacts are the consequence of minimal phylogenetic signal in the sequence data, rather than deficiencies involving multispecies coalescent models.Our simulation indicates that 95% of the biased gene trees have an extremely short internal branch length < 1e − 5 .This suggests that converting these short branches to a polytomy in the estimated gene trees can likely reduce the algorithmic bias and SBA artifacts.We have developed an algorithm in the latest version of MP-EST (Liu et al., 2010) to convert short branches to a polytomy in the ML gene trees with a plan to update the species tree estimation program MP-EST such that it can take polytomy gene trees to reconstruct species trees.The 4-taxon gene trees and species trees used in the theoretical proof and simulation.(A) 4-taxon star tree: t 1 , t 2 , t 3 , t 4 are the lengths of the four terminal branches leading to the species S 1 , S 2 , S 3 , and S 4 .(B) 4-taxon bifurcating tree: t 0 is the length of the internal branch and t 1 , t 2 , t 3 , t 4 are the lengths of the four terminal branches leading to the species S 1 , S 2 , S 3 , and S 4 .(C) n-taxon bifurcating tree: the 4-taxon subtree is highlighted in red.The subtree has two internal nodes n 01 and n 02 and t 0 , t 1 , t 2 , t 3 , t 4 are the branch lengths of the subtree.
(D) 4-taxon star species tree: θ 0 is the population size parameter in the root population.(E) 4-taxon bifurcating species tree: θ 0 is the population size parameter in the root population.θ 1 and τ 1 are the population size parameter and the branch length of the ancestral population of the species S 1 and S 2 , while θ 2 and τ 2 are the population size parameter and the branch length of the ancestral population of the species S 1 , S 2 and S 3 .(F) n-taxon species tree: the 4-taxon subtree is highlighted in red.The proportions of three ML trees S 1 , S 2 , S 3 , S 4 , S 1 , S 3 , S 2 , S 4 , and S 1 , S 4 , S 2 , S 3 were calculated and compared with the expected proportions 1/3, 1/3, 1/3 if two phylogenetic programs PhyML and RAxML do not have an algorithmic bias.with two short branches length = 0.0001 leading to the species A and B. The y-axis in (A-D) is the proportion of three ML trees S 1 , S 2 , S 3 , S 4 , S 1 , S 3 , S 2 , S 4 , and S 1 , S 4 , S 2 , S 3 , while the y-axis in (E) is the proportion of ML trees with the group (A,C).The x-axis is the sequence length 100,500, and 1,000 bp FIGURE 1.
T *, is given by P T MLE ≠ T * = P xxyy + P xyxy + P xyyx .Because P xxxy + P xxyx + P xyxx + P yxxx > P xxyy + P xyxy + P xyyx (Supplementary , T P HW t 1 P HX t 2 P HY t 3 P HZ t 4If two nucleotides x n 1 and x n 2 are distinct, the probability of the site pattern & is given by ), i.e., 0 and P x n 1 = x n 2 | t 0 1.The probability of the site pattern & converges to the probability P & | T *, t 1 , t 2 , t 3 , t 4 , x n 1 = x n 2 derived from 4-taxon star trees.