Statistical Phylogenetic Tree Analysis Using Differences of Means

We propose a statistical method to test whether two phylogenetic trees with given alignments are significantly incongruent. Our method compares the two distributions of phylogenetic trees given by two input alignments, instead of comparing point estimations of trees. This statistical approach can be applied to gene tree analysis for example, detecting unusual events in genome evolution such as horizontal gene transfer and reshuffling. Our method uses difference of means to compare two distributions of trees, after mapping trees into a vector space. Bootstrapping alignment columns can then be applied to obtain p-values. To compute distances between means, we employ a “kernel method” which speeds up distance calculations when trees are mapped in a high-dimensional feature space, e.g., splits or quartets feature space. In this pilot study, first we test our statistical method on data sets simulated under a coalescence model, to test whether two alignments are generated by congruent gene trees. We follow our simulation results with applications to data sets of gophers and lice, grasses and their endophytes, and different fungal genes from the same genome. A companion toolkit, Phylotree, is provided to facilitate computational experiments.


Introduction
Estimating differences between phylogenetic trees is one of the fundamental questions in computational biology.Conflicting phylogenies arise when, for example, different phylogenetic reconstruction methods are applied to the same data set, or even with one reconstruction method applied to multiple different genes.Gene phylogenies may be codivergent by virtue of congruence (identical trees) or insignificant incongruence.Otherwise, they may be significantly incongruent [16].All of these outcomes are fundamentally interesting.Congruence of gene trees (or subtrees) is often considered the most desirable outcome of phylogenetic analysis, because such a result indicates that all sequences in the clade are orthologs (homologs derived from the same ancestral sequence without a history of gene duplication or lateral transfer), and that discrete monophyletic clades can be unambiguously identified, perhaps supporting novel or previously described taxa.In contrast, gene trees that are incongruent are often considered problematic because the precise resolution of speciation events seems to be obscured.Thus, it would also be very useful to identify significant incongruencies in gene trees because these represent noncanonical evolutionary processes.(e.g., [4,14,15,17]).In this paper we propose a statistical hypothesis test which tells whether two phylogenetic trees are significantly incongruent to each other by comparing two distributions for phylogenetic trees, instead of comparing two point estimations.More specifically we will compare two distributions of trees using difference of means.In this paper we estimate these distribution by Bayesian sampling from the posterior distribution.Our statistical hypotheses are: H 0 : Phylogenetic trees T 1 and T 2 are congruent.H 1 : Phylogenetic trees T 1 and T 2 are incongruent.
Usually a statistical test on the above hypotheses considers point estimates of the trees obtained by a tree reconstruction method, such as maximum likelihood estimates [6,7] or the neighbor-joining method [20].See [21] and references within for an overview.Variation of reasonable tree estimates can be assessed, for example, by using the bootstrap or jackknife method.
There are several techniques to test if gene trees are codiverged.For example, the Bayesian estimation methods (e.g., [1,4,14]), the Templeton test implemented in paup * [24] (e.g., [8]), the partition-homogeneity test (PHT) also implemented with paup * (e.g., [27]), Kishino-Hasegawa test (e.g., [18]), and the likelihood ratio test (LRT; e.g., [26]) are statistical methods to see if there is a "significant" level of incongruence between the trees (these methods are also called partition likelihood support (PLS) [13]).However, there is a limitation in many methods for comparing two phylogenetic trees: It is implicitly assumed that the two given trees are actually correctly estimated phylogenies.In reality, trees are estimated from observed data (e.g.fossil record, sequence data), and tree uncertainty is the rule instead of the exception.Thus, to estimate trees, we propose using posterior means instead of maximum likelihood, and we apply the bootstrap method to assess variation in the posterior means.Our method could also be applied with tree estimators like maximum likelihood, instead of the posterior mean.
This paper is organized as follows: In Section 2, we state our method.In Section 3, we show simulation studies with data generated by the software Mesquite [Maddison Knowles 2006].In Section 4, we apply our method to well-known gopher-louse data sets from [10] and grass-endophyte data sets from [21].We end with a discussion.

Preliminaries
Let T n be the space of trees on n taxa.When analyzing and comparing phylogenies, often tree features are used.The notion of tree features can be expressed formally as a vector space embedding: Definition 1.Given a vector space embedding v : T n → R m for some m, the vector v(T ) is the feature vector of T .
The difference between trees T 1 , T 2 can be quantified as the distance ||v(T 1 ) − v(T 2 )||, where || • || is any norm.In this paper we will focus on L 2 norms.
A notable example of our framework is the dissimilarity map distance.
between leaves i and j in T .The dissimilarity map distance is where || • || represents L 2 norm (Euclidean length).
In our computational experiments, we will use the dissimilarity map distance.Dissimilarity map distance was studied in [2].One can also consider a variation where all edge lengths are set to 1.The arising dissimilarity distance is called the path difference and only depends on tree topologies.

Testing for congruence of two trees
In our framework, given are D 1 , D 2 , each a collection of n aligned homologous sequences.We assume D 1 , D 2 were generated by models of sequence evolution on unknown trees T 1 , T 2 .After embedding trees into a vector space, our statistical hypotheses are: For convenience, we describe our approach as comparing two gene trees T 1 , T 2 from the same set of species.One can also compare a phylogeny for host species and a phylogeny for corresponding parasites, as we do in section 3.2.
Random fluctuations in sequence evolution can cause reconstructed gene trees for D 1 and D 2 to look at least slightly different, even if the true underlying trees are equal.Thus we need a way to tell if the difference between two estimated trees is "significant." One classical approach to assess variability in reconstructed trees is the bootstrap [6].The bootstrap generates new hypothetical sequence alignments, by sampling (with replacement) columns of aligned sequence.Then trees can be re-estimated for each hypothetical alignment.One common application of the bootstrap is to measure support for each clade; clades that appear in most bootstrap replicate trees are regarded as likely clades in the true tree.
Here we propose a bootstrap procedure to assess significance of the distance between two trees.Our method is based on the triangle inequality.Namely, if v( T1 ), v( T2 ) are estimators for v(T 1 ), v(T 2 ), then the triangle inequality says which gives a lower bound on the distance between the true trees T 1 , T 2 .See Figure 1

Difference of means
The bootstrap procedure we have proposed can be applied with any tree estimator, such as neighbor joining or maximum likelihood.Since we are presuming tree uncertainty is high, and Bayes estimator trees are more accurate than neighbor joining or ML [12], we prefer a Bayes estimator approach.Given an alignment D, generated by sequence evolution on an unknown tree T , Bayesian MCMC sampling methods will approximately sample from the posterior distribution P (T | D) ∼ P (D | T )P (T ) [29].For two posterior distributions P (T 1 | D 1 ) and P (T 2 | D 2 ), let t 1 , . . ., t N 1 be a sample from P (T 1 | D 1 ), and similarly for t 1 , . . ., t N 2 a sample from P (T 2 | D 2 ).Then we can use 1 ) as an estimator for v(T 1 ), and similarly for v(T 2 ).The difference of means is and || ∆|| is an estimator for ||v(T 1 ) − v(T 2 )||.

The kernel trick for estimating || ∆||
Some feature space embeddings produce very high-dimensional feature vectors v(T 1 ), v(T 2 ), yet the distance ||v(T 1 ) − v(T 2 )|| can be computed quickly without explicitly writing down the feature vectors for T 1 and T 2 .
Notable examples include Robinson-Foulds distance and quartet distance.In such cases, it would be desirable if the difference of means || ∆|| could be estimated, by sampling trees and computing the distances between samples (without writing down any feature vectors).This is indeed possible, using a kernel trick: R m be four pairwise independent random variables, where x 1 and x 2 are drawn according to distribution P , and y 1 , y 2 are drawn according to distribution A proof of Proposition 1 is provided in the supplement.Using the proposition, and a subroutine which can compute ||v(t) − v(t )|| for two given tree samples t, t , the length || ∆|| = ||Ev(T 1 ) − Ev(T 2 )|| can be estimated from the samples {t i }, {t i }.

Simulations
In this section we estimate posterior distributions of phylogenetic trees via MCMC-based software MrBayes [11] and apply the difference of means method to test if two phylogenetic trees are incongruent.Note other softwares such as BEAST [3] could also be used.Simulated data sets were generating using the software Mesquite [17] with parameters chosen similar to [17], to emulate real data and test the effectiveness of our method.Mesquite takes two parameters; the species depth in terms of number of generations and the population size in terms of number of individuals.Three simulation sets were generated, determined by the species depths of 100000, 600000, and 1000000.The effective population size was fixed to 100000 for all data sets.For each simulation set, two species trees, species tree one and two, with eight species were generated using the pure birth yule process in Mesquite.Sequence alignments were generated by Mesquite under HKY85 model with transition-transversion ratio of 3.0, a discrete gamma distribution with four categories and shape parameters 0.8.In all our simulations, we set the stationary probability distribution π = (0.3, 0.2, 0.2, 0.3) for A, C, G, T respectively, the 3:2 AT:GC ratio was maintained through all trees, and our sequences were generated with 1000 base pairs.The coalescence gene trees generated had branch lengths in terms of the coalescence model and therefore a scaling factor of 10 −8 was used to yield sequences with sequence divergence similar to real data.Table 1 shows sequence divergences.The sequence divergence was calculated in two ways: (1) the average percent pairwise difference between all sequences [17], and (2) the minimum of the pairwise percent differences among sequences [9].1: Q1 means the first quantile and Q3 means the third quantile.By "min" we mean the smallest number and "max" means the largest number among a sample.Sequence divergences were calculated in two ways: 1) the pairwise minimum percentage of sequence divergence and 2) the average pairwise percentage of sequence divergence.between the two species trees used to generate gene trees for our simulations are 0.4333 for 1, 000, 000 species depth, 0.2672 for 600, 000 species depth and 0.046 for 100, 000 species depth.
We generated simulated data sets in three different ways; (1) two separate sequence data sets generated from the same gene tree, (2) sequence data sets generated from two different gene trees under the same species tree, (3) sequence data sets generated by two sequence data sets generated from two different gene trees whose species trees are also different.We tested ten gene trees for each species depth (i.e. 30 different gene trees in total) generated under the same species tree.One can find the species trees we used in Figure 2. We used two sets of sequences generated under the HKY model with the same tree for each test.We have the three species depths of 1000,000, 600,000 and 100,000, with fixed population size of 100,000.Notice that we do not observe (a) Box plots of p-values for simulated data.We generated ten gene trees for each species depth (i.e. 30 different gene trees in total) for each species tree.Sequences for each gene tree were generated using the HKY model.Species depths of 1, 000, 000, 600, 000 and 100, 000 were used, with fixed population size of 100, 000.White box plots are for testing Type I error by comparing identical gene trees generated by Mesquite (we used the species tree on the right in Figure 2), light grey box plots represent the p-values with two different gene trees generated from the same species tree under the coalescence model (we used the species tree on right in Figure 2  any Type I errors with our testing method, however, in within-species comparisons at species depth of 1,000,000 the p-values were high in general (Figure 3(a)).Also notice that with pairs of gene trees where each pair of gene trees are generated from different species trees under the coalescence model, the p-values were 0.000 for all pairs of genes from 1,000,000 and 600,000 species depth.However, in the case of species depth 100,000 we see that only one pair (Species1 Genetree0 / Species2 Genetree7) has a p-value less than 0.05 (see Table 5 in the supplement).We appear to have Type II errors, probably because species trees with 100,000 species depth are very close to each other.
p-values and distance between true trees appear strongly correlated.We fitted correlations between p-values and distance between true trees as well as correlation between p-values and the difference of means for the posterior distributions given the original sequence data sets, using a function called loess (Figure 3(b)).The fitted lines show negative correlation between the p-values and the distance between true trees and also negative correlation between the p-values and the difference of means.Note that the fitted lines for distances between true trees and for differences of means in Figure 3(b) any p-values below the α-level (0.05 in our case) are within their confidence intervals.Actually they are within their confidence intervals up to the p-value equals to 0.3.This means the differences of means with posterior distributions given the original sequence data sets are good measurements for distance between true trees for our statistical tests.This is particularly important since we usually do not know the true trees with biological data sets.For complete results of our simulations see Table 4 and Table 5 in the supplement.
The posterior distributions were estimated using MrBayes with the following parameters: (1) for the model: GTR + Gamma + Invariant sites; (2) for MCMC: number of runs: 1, number of chains: 2, chain length: 100, 000, sample frequency: 1, 000, burn-in: 25%; and (3) for bootstrap sampling: 100 bootstrap samples with sample size of 379 columns which is the length of sequence alignments in the data sets.
We also tested our Method with the data sets from [21].After removing cases of apparent host jumps, the data sets contain sequences from 20 taxa of grasses and 20 taxa of endophytes.Sequences were aligned with the aid of PILEUP implemented in SEQWeb Version 1.1 with Wisconsin Package Version 10 (Genetics Computer Group, Madison, Wisconsin).PILEUP parameters were adjusted empirically; a gap penalty of two and a gap extension penalty of zero resulted in reasonable alignment of intron-exon junctions and intron regions of endophyte sequences, and of intergenic spacer and intron regions of cpDNA sequences.Alignments were scrutinized and adjusted by eye, using tRNA or protein coding regions as anchor points.For phylogenetic analysis of the symbionts, sequences from tubB (encoding β-tubulin) and tefA (encoding translation elongation factor 1-α) were concatenated to create a single, contiguous sequence of approximately 1400 bp for each endophyte, of which 357 bp was exon sequence and the remainder was intron sequence.For phylogenetic analysis of the hosts, sequences for both cpDNA intergenic regions (trnT-trnL and trnL-trnF) and the trnL intron were aligned individually then concatenated to give a combined alignment of approximately 2200 bp.Analysis was also performed using the sequences from tubB and tefA separately.

Data set p-value
Grass-endophyte tefA 0.040 Grass-endophyte tubB 0.080 Grass-endophyte tubB plus tefA 0.000 (b) p-values for grass-endophyte data sets from [21].After removing cases of apparent host jumps, the data comprises 20 taxa of grasses and 20 taxa of endophytes.The first two rows compare grass phylogeny to gene trees for tefA and tubB in endophytes; the last row uses the concatenation of tefA and tubB.The posterior distributions were estimated using MrBayes with the following parameters: (1) for the model: GTR + Gamma + Invariant sites; (2) for MCMC: number of runs: 1, number of chains: 2, chain length: 100, 000, sample frequency: 1, 000, burn-in: 25%; and (3) for bootstrap sampling: 100 bootstrap samples, number of bootstrap columns equals length of original alignment.
These results are interesting in comparison with the prior finding of significant relationship between the phylogenies of the grasses and their endophytes [21].The previous analysis indicated a significant relationship between ages of corresponding nodes in endophyte and grass phylogenies, addressing whether divergences of grass and endophyte clades tended to occur at approximately the same time.In contrast, results of the analysis above suggest that the grass and endophyte phylogenies are significantly different (Table 2(b)).We conclude that such a relationship of node ages does not necessarily imply similar phylogenetic histories.This is reasonable because the relationships of grasses and their endophytes is expected to be one of diffuse cospeciation at best.Individual species of endophyte may be associated with genera or tribes of grasses, but rarely with individual species.This contrasts with the gopher-gopher louse situation, where evidence suggests a much stricter coevolutionary relationship (Table 2(a)).
We chose an additional biological data set to compare phylogenies of genes that occur together in endophyte genomes.Whereas tefA and tubB are housekeeping genes present in all isolates, lolC is a secondary metabolism gene sporadically present in endophyte isolates [22].It has been suggested that such sporadically occurring secondary metabolism genes may be distributed in fungi largely by horizontal gene transfer [28].To investigate this possibility in the case of lolC, we used our approach to test whether the phylogenies of these three genes were significantly different.The most likely trees obtained by MCMC showed related but nonidentical topologies (Figure 4; note placement of genes from Epichloe festucae and Epichloe brachyelytri).Our test found no significant difference between the phylogenies, although the p-values appear stochastically smaller than the p-values observed for simulated data under the null.This perhaps reflects the conservative nature of our test.Removing either Epichloe festucae or Epichloe brachyelytri) altered the results only slightly (Table 3(b)).These results indicate that lolC evolution was largely or exclusively by decent, and disfavored horizontal transfer as an explanation for the sporadic distribution of this gene.

Toolkit for Computational Experiments
To facilitate computations for our experiments, we developed a set of programs, collectively called Phylotree.Phylotree is organized as a collection of scripts for running a complete computational experiment starting from sequence alignments, then sampling phylogenetic trees and computing distances between phylogenetic trees and their distributions (see Section 2).Supported distance measures include path-difference, dissimilarity map distance, Robinson-Foulds distance.Available scripts allow for selecting the number of columns and the number of bootstrap samples, linking taxa in the alignments and provide flexibility for using different sampling methods (e.g., MrBayes or BEAST) and distance measures.This is free software, and will be distributed under the terms of the GNU General Public License.One can download the software at http://csurs7.csr.uky.edu/phylotree/.The website is password protected and login information can be obtained at http://cophylogeny.net/research.php.

Discussion
In this paper we presented a method to determine if two phylogenetic trees with given alignments are significantly incongruent.Our method computes the difference of means of posterior distributions of trees, which has the advantage of using entire tree distributions, as opposed to single tree estimators.
In this paper we used the triangle inequality (d 1 < d 2 + d 3 in Figure 1) to derive a bootstrap procedure to compute p-values.However, our bootstrap procedure appears to be very conservative, producing p-values whose null distribution is stochastically much larger than uniform U (0, 1).Thus in order to increase the power we might want to consider different criteria for computing p-values.One approach may be to define v( T1 ), v( T2 ) to be the average of bootstraps {v(T * 1 )}, {v(T * 2 )}, rather than the initial tree estimates.Another possibility is to replace the triangle inequality with a max condition (e.g. in Figure 1 use the condition d 1 < max(d 2 , d 3 )).We explored this in the supplementary material, and it seems that the max condition provides much more power, but is somewhat anti-conservative.
In this paper we used the dissimilarity map as feature space.However, there other common tree features which can be used to define different feature spaces.Examples of distances derived from tree features include (normalized) Robinson-Foulds distance [19]; quartet distance [5]; and the path difference metric [23].Of course, in all the above examples, we could choose any vector space norm, such as L p for any p.The important point is that there are many different useful features (i.e., choices of vector space embeddings) which can be used to analyze trees, and many such as splits and quartets have already been used for quite some time.Moreover, with the kernel trick presented above we can efficiently calculate distances between distributions of trees using the Robinson-Foulds and quartet distance.Thus it is interesting to use different feature spaces for our statistical method, and we leave this for future work.

But
By dividing the both sides of the equation in ( 4) by 2 we have: Similarly we have Then we substitute E(x T 1 x 1 ) and E(y T 1 y 1 ) in equations ( 5) and ( 6) into the equation in (3), we have (7)  We tested ten gene trees for each species depth (i.e. 30 different gene trees in total) generated under the same species tree.We used two sets of sequences generated under the HKY model with the same tree for each test.The numbers in the table are p-value for our hypothesis testing.We give p-values for the three species depths of 1000,000, 600,000 and 100,000, with fixed population size of 100,000.We used the species tree on right with each species depth in

Figure 1 :
Figure 1: A diagram showing two cases of the differences of means method.In Subfigure 1(a) the distance d 1 between v(T 1 ) and v(T 2 ) is greater than the distance d 2 between v(T 1 ) and v( T 1 ) plus the distance d 3 between v(T 2 ) and v( T 2 ), i.e., d 1 ≥ d 2 + d 3 .In Subfigure 1(b) we see d 1 ≤ d 2 + d 3 .

Figure 2 :
Figure 2: The three pairs of species trees used in our simulations.The dissimilarity maps normalized by ), and dark grey box plots summarize the p-values with gene trees generated from two different species trees.Some boxes of p-values are identically zero.Box plots were computed in R[25] using boxplot (b) These plots represent correlation between distances among true trees and p-values, and correlation between difference of means and p-values.These are computed with the same data sets in Figure3(a).The sample size here is 94 (i.e., there are 94 pairs of sequence data sets we tested in total).For more details see supplement.We fitted the data in R [25] using loess to perform local regression.The dotted lines are for 95% confidence intervals of the fitted lines.The vertical solid line is the line x = 0.05 which represents the α-level, p value = 0.05.

Figure 4 :
Figure 4: Trees with maximum likelihood identified by MCMC search on aligned intron sequences from the lolC, tefA, and tubB genes of Epichloe and Neotyphodium species.Some Neotyphodium species are interspecific hybrids that have multiple genomes from different ancestors.The genes from the different genomes are distinguished, for example, as lolC1 and lolC2.The same number labeling leaves on the three trees indicates genes from the same genome from the same fungal isolate.

Figure 2 .
dist represents the average of the normalized difference of means with posterior distributions given the original sequence data sets.

Table 4 :
Testing Type I error by comparing the identical gene trees generated by Mesquite.