Combinatorial and Computational Investigations of Neighbor-Joining Bias

The Neighbor-Joining algorithm is a popular distance-based phylogenetic method that computes a tree metric from a dissimilarity map arising from biological data. Realizing dissimilarity maps as points in Euclidean space, the algorithm partitions the input space into polyhedral regions indexed by the combinatorial type of the trees returned. A full combinatorial description of these regions has not been found yet; different sequences of Neighbor-Joining agglomeration events can produce the same combinatorial tree, therefore associating multiple geometric regions to the same algorithmic output. We resolve this confusion by defining agglomeration orders on trees, leading to a bijection between distinct regions of the output space and weighted Motzkin paths. As a result, we give a formula for the number of polyhedral regions depending only on the number of taxa. We conclude with a computational comparison between these polyhedral regions, to unveil biases introduced in any implementation of the algorithm.


Introduction
The Neighbor-Joining (NJ) algorithm [23] is a polynomial-time phylogenetic tree construction method.It is agglomerative, so it constructs ancestral relationships between taxa by clustering the most closely related taxa at each step until a complete phylogeny is formed.The performance of the NJ algorithm has been studied from multiple mathematical and biological perspectives [4,12,13,15].Moreover, some statistical conditions have been given to guarantee a good performance of the algorithm for different types of biological data [16,21].Due to its speed and theoretical performance guarantees, NJ remains a popular tool in the design of phylogenetic pipelines for large and complex data sets [18,29,19].
The NJ algorithm takes as input a dissimilarity map D between n taxa and returns an unrooted binary tree with edge weights forming a tree metric.One established paradigm for evaluating the performance of the NJ algorithm is to consider it as a heuristic for either the Least Squares Phylogeny (LSP) problem, which is NP-complete [8], or the Balanced Minimum Evolution (BME) problem, which is a special case of a weighted LSP approach [9].The LSP problem asks for the tree metric T that minimizes the function i,j∈( [n]   2 ) This perspective gives rise to the problem of determining whether a given heuristic for LSP or BME is biased, favoring certain phylogenies over others in its output.Addressing this problem requires a clear accounting of the possible outputs of the NJ algorithm.
Dissimilarity maps are symmetric matrices that can be realized as points in a Euclidean space.In this geometric setting, the space of all tree metrics form a polyhedral fan [13], which is a union of polyhedral cones meeting along common faces.The geometry of the space of tree metrics is well-understood [24,27].Phylogenetic inference methods that take a dissimilarity map as an input divide the Euclidean space into a family of subsets indexed by the combinatorial type of the trees returned.For the NJ algorithm, these subsets are also polyhedral regions, as its decision criteria are linear inequalities on the input data.
Comparing these polyhedral cones had lead to new computational methods to evaluate the performance of the NJ algorithm as a heuristic for LSP in [7] and of NJ as a heuristic for BME in [12].These methods are based on measuring the spherical volume of the cones, which is the volume of their intersection with the unit sphere.The spherical volume of polyhedral cones had unveiled unexpected bias in the UPGMA method, an older agglomerative phylogenetic method that has poorer performance guarantees in comparison to NJ [6].
The contribution of this article is that we give a formula that enumerates the polyhedral cones in the partition returned by NJ algorithm, by giving a bijection between the cones and weighted Motzkin paths.We also explore the inherent bias in the algorithm by computing the spherical fraction of the cones and noting significant variation between some cones, similar to those found in [6] for the UPGMA algorithm.
In Section 2 we give a in-depth description of NJ and provide related definitions necessary to describe distinct NJ outputs.In Section 3 we give a bijection between weighted Motzkin paths and labeled Newick strings to describe the outputs combinatorially.Lastly, in Section 4 we estimate the spherical fraction of the NJ cones and unveil a bias.We propose two ways to correct the bias, and display the performance of these corrections via computational experiments for small numbers of taxa.

The Neighbor-Joining Algorithm
A graph is a pair of sets {V, E} called vertices and edges, with the latter representing relations between pairs of vertices.The degree of a vertex is the number of edges adjacent to it.A path in a graph is a sequence of edges joining a sequence of vertices.If the sequence starts and ends at the same vertex, the path is called a cycle.A graph without cycles is a tree.The vertices of a tree are usually called nodes, except for those of degree one which are called leaves.In trees, there is a unique path between any two vertices.A tree is rooted if there is a distinguished vertex ρ called the root.If T is a rooted tree, there is a partial order on the vertices of T induced by the root ρ and the unique path between vertices, such that ρ ≤ v for all vertices v of T .A binary tree is a tree where all vertices have degree three except for the leaves and the root, if a root exists.We will restrict our attention to unrooted binary trees.An unrooted binary tree with n leaves has n−2 nodes and 2n−3 edges.We also consider the star tree, which is a non-binary unrooted tree with exactly one node that is adjacent to all leaves.We illustrate a star tree in the left image of Figure 1.The NJ algorithm takes a dissimilarity map as input and returns an unrooted binary tree with a tree metric.It is useful to view the progress of the NJ algorithm as a series of graph transformations.The algorithm stars from a star tree t n , a graph with n leaves and only one node O adjacent to all leaves, as illustrated on the left of Figure 1.Throughout the algorithm, the node O plays an important role; thus, we give special names to the vertices adjacent to O. Definition 2.1.We call boughs to all vertices adjacent to O. Vertices that are boughs can be either leaves or internal nodes, so we refer to them as stems and bouquets respectively.
In the recursive step, the algorithm takes a tree t k with k boughs, it selects a pair of them and joins them by adding a new node adjacent to O in a way that the resulting tree The trees corresponding to the first and second step in the NJ algorithm boughs.The algorithm iterates this step until there are only three boughs.Figure 1 illustrates the first two steps in the NJ algorithm, starting from the star t 7 and the next graph constructed by the NJ algorithm corresponding to the tree t 6 , obtained from the star by joining the stems a, b by introducing a bouquet u.The subgraph consisting of only two adjacent leaves is sometimes called a cherry, and the recursive step of the NJ algorithm is then referred as cherry picking step [13].This term served as inspiration for the names in Definition 2.1, as it resembles the way the NJ algorithm creates trees.
The algorithm selects a pair of boughs a, b to agglomerate based on the Q-criterion, which is a function given by where X is the set of boughs, with k = |X|.The NJ algorithm selects the vertices a, b where Q is minimal.It has been shown that this criterion uniquely determines the NJ algorithm [4], and that a pair of leaves minimizing Q come from a cherry in the true tree [23,28].To agglomerate the selected nodes a, b, the NJ algorithm introduces a new bouquet u adjacent to them, resulting in a tree with one bough less.It then estimates the distance in this new tree from the remaining vertices to u via the reduction formula We sometimes use D k and Q k to denote the dissimilarity map and the Q-criterion associated to the tree t k with k boughs.Thus, the last step of the algorithm is decided by a pair of nodes a, b that minimize Q 4 .However, in this last step there is more than one minimizing pair, as we demonstrate in the following lemma.
Lemma 2.2.For n = 4, the matrix Q 4 always achieves its minimum in exactly two entries.
Proof.Let {a, b, c, d} be the four vertices corresponding to t 4 and suppose, without loss of generality, that the minimum in Therefore, we get that Q(a, b) = Q(c, d), meaning that the minimum is achieved twice.
Remark 2.3.The previous lemma is similar to the four-point condition but not the same, as Lemma 2.2 holds even when the dissimilarity map D is not additive.
Remark 2.4.Note that if n > 4, the minimum in Q n could be achieved in more than one entry too, but that happens in a very small dimensional space in R ( n 2 ) , thus it occurs in a measure zero set.Only in the case n = 4 it happens always.2.2.Newick notation.We represent trees with the Newick notation.This is one of the most widely used notations in bioinformatics to encode information about the tree topology, branch distances, and vertex labels.It consist of parentheses that represent tree data as textual strings (see [30,14]).A pair of vertices enclosed within matching parentheses indicates they have a common ancestor.There are slightly different formats representing the Newick notation, so we explain the one we use with an example.Let t be the tree in Figure 2, with leaves labeled by the set {a, b, c, d, e, f }.Newick notation for t can be written as ((c, (a, b)), (e, f ), d), ed, (c, (a, b))), e, f ), or in many other ways.This non-uniqueness of Newick notation can be inconvenient, as one must determine whether two strings represent the same tree or not.We let the length of a string in Newick notation to be the maximum number of parentheses of one orientation contained in the string (to the right or the left), and its size to be largest amount of taxa contained in a parenthesis.For instance, the two strings above have length 3, yet the size of the former is 3 whereas for the latter is 4.
We now explain the Newick notation in the NJ algorithm.Let [n] indicate the leaves of a tree.The starting point is the start tree t n , which is indicated with the string (1, 2, . . ., n).Then, the algorithm selects two vertices and we join them by enclosing them with a parenthesis, which we append at the left to the remaining string.For instance, if 1 and 2 are the selected vertices, we would write ((1, 2), 3, . . ., n) to indicate the tree obtained in the second step in the NJ algorithm.This is illustrated in the tree to the right of Figure 1, by letting a = 1 and b = 2.In the recursive step, NJ will take a string s k of length k and writes a string of length k−1 by joining two elements of s k with a new parenthesis attached to the left.In this way, in the string ((c, (a, b)), (e, f ), d) we know that the last step was to join c with the pair (a, b), but we do not know if the previous string was ((a, b), (e, f ), c, d) or ((e, f ), (a, b), c, d).A parenthesis in the Newick string indicates a node in the tree, so we could remove this ambiguity by labeling the nodes when we introduce them in the algorithm.We label these nodes with a circled number written to the left of a parenthesis, indicating the step when the node was created in the algorithm.For instance, we could write ( 3 (c, 1 (a, b)), 2 (e, f ), d) to indicate that the first step in the NJ algorithm was to join (a, b), then (e, f ), and lastly c with (a, b).We refer hereafter to this notation as ordered Newick notation.Hence, we have a bijection of the trees that output the NJ algorithm with these ordered Newick strings.
Algorithm 1: The Neighbor-Joining Algorithm Input : A dissimilarity map D n on the set {1, . . ., n}.Output: An unrooted binary tree T with leaf labels {1, . . ., n} written in ordered Newick notation.Initialize k = n, r = 0, and S k = (1, . . ., k) representing the Newick format for t k , the star tree on k leaves with boughs B k .while k > 3 do (1) Identify boughs {a, b} of t k minimizing ( for all remaining boughs in t k (5) Increase r by one and decrease k by 1. return Label O in t 3 with zero and return t 3 = T end

Combinatorial description of NJ
To a given data matrix D, the NJ algorithm associates a binary tree with n leaves, together with a tree metric.Without the distinction of ordered Newick notation, it can appear that the algorithm associates the same tree to different data.For instance, let us consider the following matrices: .
The NJ algorithm associates the same unrooted binary tree to both of them.However, their Newick notation is not the same.For D, the corresponding tree given by NJ is ( (d, (a, b the same mathematically, they have a subtle difference that one may take into account when interpreting the biological meaning of the output of the NJ algorithm.The tree ((d, (a, b)), c, e) corresponding to D indicates that a, b where joined together and that c and e were never joined to anything, whereas the tree ((d, (c, e)), a, b) for D indicates the opposite, that a, b were never joined to anything but c and d were joined to each other.To distinguish these data and relate them to ordered Newick notation, we endow binary trees with something we called agglomeration order and we explain it next.
3.1.Binary trees with agglomeration order.The expected number of trees that arise as an output depends on the shape (topology) and the labels of the leaves.For unrooted trees with n taxa, there are (2(n−2)−1)!! labeled binary trees.Thus, for n = 4 there is only one tree shape and three ways to label the leaves.1. Number of trees for small number of taxa.
We will give a combinatorial formula to compute the expected number of output trees from the algorithm.For this, we set the combinatorial definitions we will require.Definition 3.1.For a binary tree with n leaves, an agglomeration order means labeling the n−2 internal nodes with the set {0, 1, 2, . . ., n−3}, such that the labels of the internal nodes in every path from 0 to a leaf form a decreasing sequence.
We use circled numbers to indicate the assignment of the agglomeration order, and to simplify the schematics, we omit0 the label 0, so it is easier to verify that all sequences of internal vertices starting there and terminating at a leaf are decreasing.In Figure 4 we illustrate an agglomeration order for the tree t with 6 taxa from Figure 2. There, if we exchange 3 with 1 we get a label of the internal nodes that is not an agglomeration order.In Newick notation, the tree in Figure 4 would be ( 3 (c, 1 (a, b)), 2 (e, f ), d).
Theorem 3.2.The output space of the NJ algorithm is in bijection with the set of agglomeration orders on unrooted binary trees.
Proof.The NJ algorithm starts with the star tree t n consisting of n leaves and just one internal node O. From there, at each step, the algorithm applies a series of graph transformations to a given tree, preserving the number of leaves, while increasing the number of nodes by one.This new node is adjacent to O. Thus, at each step the new node is closer (combinatorially, not metrically) to O than all other nodes in a path to a leaf.Numbering each node with the moment they appear in the algorithm, starting with 0 for the node O, results in an agglomeration order.As this steps are reversible, each agglomeration order can arise from the algorithm, giving the bijection with the output space.
Therefore, understanding the NJ algorithm leads to understanding orders of agglomeration for binary trees.To simplify the rest of the exposition, we give the following definitions.Definition 3.3.We call an agglomerated tree to refer to an unrooted binary tree endowed with an agglomeration order.The number of agglomerated trees with n leaves will be denoted by Φ(n).Lastly, we let ϕ(n) denote the number of all possible agglomeration orders on unrooted binary trees with n leaves.
Computing the number Φ(n) is important to understand more about the combinatorial complexity of the NJ algorithm.For instance, Φ(n) counts the number of polyhedral cones NJ defines.We tried to determine Φ(n) by estimating first the number ϕ(n) of agglomeration orders, knowing that the number of unrooted binary trees is given by the combinatorial formula (2n−5)!!.The last column of Table 1 is precisely the value of Φ(n) for n = 4, . . ., 8. For n = 4 and 5, there is only one tree topology and one can list all agglomeration orders they can have, so we get ϕ(4) = 2 and ϕ(5) = 4.For these cases, it holds true that Φ(4) = ϕ(4) • 3!! and Φ(5) = ϕ(5) • 5!!.However, as we can see in Table 1, this is no longer the case for other values of n, as Φ(n) is not always divisible by (2n−5)!!.This suggests that to determine the number ϕ(n) one needs to consider the topology of the tree in a non-trivial way.Nonetheless, we were able to give formulas for the numbers ϕ(n) and Φ(n) using Motzkin paths.However, it remains open to understand more about the connection between unrooted binary trees and agglomeration orders.Problem 3.4.Determine the number of agglomeration orders that can be assigned to a given unrooted binary tree.

Motzkin paths.
Motzkin paths are combinatorial structures appearing in many contexts.They are counted by Motzkin numbers, which are related to Catalan numbers [1,2,22,20].Note that while Catalan numbers are known to count combinatorial objects referred to as planar rooted trees [10], the trees in this paper are fundamentally different objects.
A Motzkin path is an integer lattice path starting and ending in the horizontal axis without crossing it, consisting of up steps u = (1, 1), down steps d = (1, −1), and horizontal steps h = (1, 0).A Motzkin path with no horizontal steps is a Dyck path.The number of Dyck paths from (0, 0) to (2N, 0) is given by the Catalan number C N , whereas the number of Motzkin paths from (0, 0) to (0, N ), is given by the Motzkin number M N .
Our main result here is to give a bijection between Motzkin paths and agglomerated trees.This bijection allows the derivation of a formula to determine the number Φ(n) that counts all the possible outputs for the NJ algorithm.The key is to focus on the recursive step of the algorithm.
Remark 3.5.Let O be the unique node of the star tree t n .In the recursive step, the NJ algorithm takes a tree t k with stems and r bouquets, such that k = + r.From there, it constructs the graph t k−1 by adding an internal node in three possible ways: • It merges two stems.We call this step α.
• It merges two bouquets.We call this step β.
• It merges a stem to a bouquet.We call this step γ.We illustrate this three steps in the following Figure.From this remark, we see that we can summarize a given tree t with a distinguished node O by its bough vector ( , r), where and r are the number of stems and bouquets in t respectively.For instance, the tree t in Figure 7 has three nodes and nine leaves, but just four stems and two bouquets.Thus, the bough vector (4, 2) summarizes t.The bough vector for the star with n leaves is (n, 0).Let t k be a tree from the NJ algorithm, and let ( , r) be its bough vector, so k = +r.Note that the tree obtained from t k after an α step is summarized by the bough vector (l, r) + (−2, 1).Similarly, after a step β or γ, the bough vector of the tree is (l, r) + (0, −1), or (l, r) + (−1, 0) respectively.Note that the NJ algorithm ends with a tree T 3 consisting only of three boughs.Thus, for n ≥ 4, the possible bough vectors for the last step are (2, 1), (1, 2), or (0, 3).Note that the first step of the algorithm is forced to be always an α step, thus we can omit it and analyze the rest of the steps, starting at (n−2, 1).Definition 3.6.For n ≥ 4 we define an NJ path of length n−4 as an integer lattice path consisting of steps α = (−2, 1), β = (0, −1), and γ = (−1, 0), starting at (n−2, 1) and ending at one of (2, 1), (1, 2), or (0, 3), without crossing the horizontal axis y = 1.Theorem 3.7.For every n ≥ 4, there is a bijection between NJ paths of lenght n−4 and Motzkin paths of length n−4 from (1, 1) to either (n−3, 1), (n−3, 2), or (n−3, 3).
Motzkin paths starting at the origin and ending at ( , r) are called partial Motzkin paths.Thus, after translating (1, 1) → (0, 0), we could write Theorem 3.7 in terms of partial Motzkin paths.As a consequence, we translate Theorem 3.2 into the following.For 6 taxa, all possible NJ paths and their corresponding Motzkin paths are depicted in Figure 8. NJ paths there start at (6, 0) corresponding to the start tree t 6 , and the first step is always an α step.From there, one chooses from the three steps {α, β, γ} consecutively until reaching one of the points (0, 3), (1, 2), or (2, 1).In this case, there are only 5 paths, each corresponding to an agglomeration order in Figure 9.For instance, the path formed by the sequence ααγ (read from left to right) corresponds to the agglomeration order of the tree in Figure 4, denoted in Newick format by ( 3 (c, 1 (a, b)), 2 (e, f ), d).

3.3.
The number of agglomerated trees.The results of the previous discussion reduce the counting of the number of trees obtained from the NJ algorithm to enumerating some partial Motzkin paths, which are counted by Motzkin numbers.Hence, we let M k be the number of partial Motzkin paths of length k.More specifically, we let M k,j be the number of partial Motzkin paths on length k that end at level j.It is known (see [22]) that the Motzkin numbers M k satisfy the following formulas involving Catalan numbers: While the numbers M k,j satisfy the following formula [3, Theorem 10.8.1] .
These numbers M k,j construct the Motzkin triangle that enumerates all partial Motzkin paths [17], which is shown in Figure 10.The triangle is defined recursively by The numbers in the triangle form the sequence A026300 in [25], and the first and second rows that for all NJ paths, there is only one way to reach the point (0, 2) from the ending points (0, 3), (1, 2), and (2, 1).This holds in general due the recursion (3.4).Thus, counting the number of NJ paths that end in one of these three points is equivalent to counting all NJ paths ending at (0, 2), or equivalently, all partial Motzkin paths ending at (n−3, 1).In this way, we conclude that the number ϕ(n) is given by the Motzkin number M n−3,1 , and equation (3.3) gives a formula for them.We summarize these observations in the following theorem.
Theorem 3.9.The number ϕ(n) of agglomeration orders on unrooted binary trees with n leaves equals the Motzkin number M n−3,1 .Thus, it can be written as This gives a new combinatorial interpretation of the series A002026 in [25] as the number of binary trees with an agglomeration order.In order to find a formula for Φ(n), the output size of the NJ algorithm, we consider weighted partial Motzkin paths which are partial Motzkin paths with weight assignments of non-negative numbers {a k,j , b k,j , c k,j } to the steps {u, d, h}.We interpret the weights as the multiplicity of the arrow, or equivalently, as the number of arrows in the given direction.In Figure 11, we show the weight assignment to the arrows of the Motzkin triangle.Triangles of this kind are called Motzkin triangles with multiplicities [17].
Let t k be a tree in the NJ 4algorithm with interior vector ( , r), such that k = + r.If the next step in the NJ algorithm is a step α, one needs to choose two of the stems to join.There are 2 ways to do this.Thus, translating NJ paths to partial Motzkin paths by Corollary 3.8, we need to assign weights in the following way: Proof.From definition, equation (3.6) can be written as Lemma 3.11.Let n ≥ 4, and for 2 ≤ k ≤ n − 2, we have with the sum in the right hand side ending at Proof.The recursion (3.5) for weighted Motzkin paths writes the left hand side of the sum as Notice that, by Lemma 3.10, we have 2 , and by definition, Hence, the sum in the right hand side equals We conclude the proof by noting that, when using the recursion (3.5) to write a sum of Motzkin paths of length i+1 in terms of those of length i, the maximum amount of summands that can appear in the sum is attained when i = n−2 2 .Hence, we obtain the following formula for the number Φ(n).Theorem 3.12.For n ≥ 4, let Φ(n) be the number of labeled agglomerated trees with n leaves.Then, in the weighted version of the partial Motzkin paths.From Lemma 3.11 we obtain We can use Lemma 3.11 again to compute the sum in the right hand side.Continuing recursively we see that Defining M 0,0 = n 2 we obtain the result.The last equality in the theorem comes from writing n 2 as n(n−1) 2 and expanding the product of binomial coefficients.
We conclude this section by noticing that this formula for Φ(n) produces the sequence of the last column of Table 1.Moreover, the sequence formed by our formula in Theorem 3.12 does not appear in [25]; however, it can be obtained from the product of consecutive binomial coefficients.This product forms the sequence A006472, which counts the number of ranked trees.Thus, our sequence is obtained from A006472 starting from n = 4 by dividing each element by 3.This suggest a connection between agglomerated trees and ranked trees [11].

Estimated volumes
The Neighbor-Joining and UPGMA algorithms take dissimilarity maps as inputs and return trees with an additive dissimilarity map.The decisions in both algorithms are based on linear inequalities that divide R ( n 2 ) into half-spaces.To elucidate, label the coordinates of R ( n 2 ) with the 2-element subsets of [n] = {1, 2, . . ., n}, so that the symmetric matrix D n is a point in R ( n 2 ) .Recall, for instance, the example in the beginning of Section 3. The two matrices D and D from (3.1) are both in correspondence with the tree in Figure 3.For the matrix D, the tree returned by the NJ algorithm is ((d, (a, b)), c, e) in Newick format.This means that the first cherry formed by the algorithm was (a, b).For this to hold true, the Q-criterion of the matrix D must satisfy the following 9 inequalities: Since each entry Q(i, j) is a linear equation on the entries of D, these equations form semialgebraic sets in R 15 .All input matrices D 5 lying in the intersection of the 9 half-spaces listed in (4.1) will return the same tree with the same agglomeration order.In general, these halfspaces form a polyhedral cone in R ( n 2 ) , and every input that satisfies them will return same agglomerated tree.Thus, we identify this cone with this labeled agglomerated tree.For D the NJ algorithm writes the tree as ((d, (c, e)), a, b).Thus, the 9 inequalities for D are not the same as those for D. Therefore, the trees lie on different cones, even though they are the same topologically.
Eickmeyer et.al.[12,13] studied the defining inequalities of the NJ cones for small taxa.In other words, the cones were described using hyperplanes, also known as an H-representation.To give a full combinatorial description, the vectors for the extreme rays of the cones would be required-a V -representation.In contrast, for the UPGMA algorithm [26], a V -representation for the corresponding polyhedral cones was given in [6].
Remark 4.1.Ideally, a phylogenetic method would depend only on the quality of the data it receives as input, so we would expect it to return each tree with equal probability.This would imply that the algorithm has no bias towards a specific type of tree.
One could estimate this probability by measuring the spherical volume of the polyhedral cones, which is the proportional volume when intersecting the cones with the unit sphere in R ( n 2 ) .A bias in the algorithm towards some trees can be detected when the spherical volume is not roughly the same for each tree.Computational methods based in this paradigm are used to evaluate the performance of NJ as a heuristic for LSP in [7] and of NJ as a heuristic for BME in [12].Davidson and Sullivant [6] approximated the spherical volume of the UPGMA cones finding cones with considerable less volume than others, unveiling a bias of the algorithm towards balanced trees (rooted trees with two children of the root defining subtrees with a similar amount of leaves).
We studied the approximated spherical volume of the NJ cones, by analyzing how simulated data lies in the cones.We implemented the NJ algorithm in Mathematica [31].Our implementation takes as input a dissimilarity map, represented as a symmetric real matrix, and returns the agglomerated tree as well as all the agglomeration steps.Our software, as well as the results of the computations discussed in the rest of the paper, is available at [5].
4.1.Simulation results.In order to measure the polyhedral cones, for n = 4, . . ., 8, we uniformly sampled 1,000,000 data matrices lying inside the unit sphere and the positive orthant of R ( n 2 ) .For each, our software computed its corresponding agglomerated tree.We counted the number of matrices that were associated to each agglomerated tree.
Table 2 displays our results for 4 taxa.The first column represents agglomerated trees in Newick format.The second shows the percentage of the million random instances lying in that cone.We can notice there is an apparent bias to certain trees.Three of the six trees accumulated 86.7771% of the random input matrices, while the remaining instances were equally distributed among the other three cones.This unpredicated bias is explained by Lemma 2.2, since in the last step of the algorithm, the minimum is achieved twice, it is left to the implementation the way to resolve this conflict.We noticed that the bias in Table 2 is due the way Mathematica sorts the minimum in these cases.We want to remark that all implementations have to resolve the decision of the tie in the last step, and the original paper of Saitou and Nei [23] noticed the tie in equations (11b) and (11c), suggesting to resolve a tie between (ab) and (cd) by choosing to agglomerate the one with the smallest index (lexicographically), in this case (ab).We explored two ways to correct this artificial bias due our implementation.

Bias correction.
There could be many ways to resolve the tie in the last step of the algorithm.One could choose a distinguish set of pairs to resolve the clash of minima.For instance, we could always choose to agglomerate the pair involving the lowest (or highest) index.This choice would make it so that all instances in Table 2 would be equally distributed between the trees ((1, 2), 3, 4), ((1, 3), 2, 4), and ((1, 4), 2, 3), and the other three would not observe any instances.This could be desirable for practical reasons, but not for understanding the geometry of the algorithm.This convenient choice also has an impact on the biological interpretation of the output, as we would be choosing beforehand what should be considered biologically near without taking in consideration any biological information.Thus, we explored the impact of some different choices.
Another natural way to correct the ambiguity in the last step of the NJ algorithm is to choose uniformly at random one of the two coinciding agglomerations.We repeated the experiment with this correction in our software, and found this choice equally distributes the number of instances observed in each cone.We refer to this bias correction just as Uniform.Table 3 displays the results for 4 taxa after this correction.
Lastly, we also consider another more biologically motivated bias correction method.As noticed, the last step of the algorithm could have to decide between joining two sets of taxa A, B, over other two C, D, and we could take biological information in consideration for making this choice.For instance, one could decide to join the pair containing more taxa: we call this that is more balanced, we could produce a bias similar to the one discovered in the UPGMA algorithm [7].In Table 4 we display a simulation for the implementation of the Baggage bias correction, using the Uniform correction when there is ambiguity.We can verify that for n = 4, the Baggage correction and the Uniform coincide, but this is not the case for n ≥ 5. We show in Table 5 the comparison for 5 taxa of our implementation without bias correction (column labeled Regular), versus the Uniform, and Baggage bias correction methods.To interpret the results, we notice from Remark 4.1 that if the NJ method had no bias, each agglomerated tree should be returned with probability 1.666%, which is very close to what is obtained in the Uniform method.5: Fraction of 1,000,000 samples in each tree for different bias corrections.
In Table 5 we see that our original implementation had a bias due the way Mathematica sorted the double minima conflict in the last step.The Uniform method seems to correct the bias, making each tree occur with equal probability.Lastly, the Baggage method creates a bias towards trees formed with less α steps.There could be biological reasons to correct the bias to agree with what is observed.We realized these computations for n = 4, . . ., 8 taxa, obtaining similar results to those displayed in Table 5 and are available at [5].We conclude by remarking that it is left to understand other reasonable ways to explore the bias correction, and understand its combinatorial implication.Knowing about different bias that can arise in the algorithm, not only helps to develop other uses for this agglomerative method, but also to aid in the study of bias from the heuristics for LSP or BME, as they can be compared with a biased NJ method.

2. 1 .
Description of the algorithm.Let [n] := {1, . . ., n}.A dissimilarity map, which is a function D : [n] × [n] → R ≥0 such that D(a, b) = D(b, a), D(a, a) = 0, and D(a, b) ≥ 0 for all a, b ∈ [n].A dissimilarity map is additive or a tree metric, if there exists a tree T with edges E with a weight function w : E → R ≥0 on T so that D(a, b) equals the sum of the edge weights on the unique path from a to b in T .
) Define S k−1 by appending r (a, b) to the left of S k after dropping a and b. (3) Construct t k−1 from t k by (a) Deleting the edges from both a and b to O. (b) Introducing a new vertex labeled u adjacent to O. (c) Connecting u to both a and b. (4) Compute D k−1 on the new set of boughs via the formula

Figure 4 .
Figure 4.An agglomeration order in an unrooted tree with 6 taxa.

Figure 9 .
Figure 9.All agglomeration orders in a tree with 6 taxa.

Figure 10 .
Figure 10.Motzkin triangle with the second row highlighted (from bottom to top) are the sequences A001006, A002026 respectively.Recall from Definition 3.3 that ϕ(n) denotes the number of agglomeration orders on unrooted binary trees with n leaves.We now derive the relation of ϕ(n) with partial Motzkin numbers.Notice in Figure8that for all NJ paths, there is only one way to reach the point (0, 2) from the ending points (0, 3), (1, 2), and (2, 1).This holds in general due the recursion (3.4).Thus, counting the number of NJ paths that end in one of these three points is equivalent to counting all NJ paths ending at (0, 2), or equivalently, all partial Motzkin paths ending at (n−3, 1).In this way, we conclude that the number ϕ(n) is given by the Motzkin number M n−3,1 , and equation (3.3) gives a formula for them.We summarize these observations in the following theorem.

Figure 11 .
Figure 11.Motzkin triangle with multiplicities k ≤ n − 4 and 0 ≤ j ≤ k, or zero otherwise.Lemma 3.10.For n ≥ 4 and k ≤ n − 3, the weights a k,j , b k,j , c k,j satisfy(3.6) However, there are two possible ways to write them in Newick notation.For instance, we could have ((a, b), c, d) or ((c, d), a, b) to mean the same tree, but the NJ algorithm differentiate these two.Therefore, for n = 4 the NJ algorithm outputs 6 trees instead of only 3. We summarize this information for the first five cases in the following table.

Table 2 .
Fraction of the 1,000,000 samples in each agglomerated tree with 4 taxa.

Table 3 .
Fraction of 1,000,000 samples found in each agglomerated tree.If in the last step the NJ algorithm has to decide between an α or a β step, Baggage would chose β over α as it is joining two bouquets having at least 2 leaves in each, as opposed to α that is joining two stems, i.e., two leaves.Formally, we think of the NJ algorithm as joining sets of taxa, starting from a list of singletons, and stopping until when we obtain a set of three taxa sets.For a taxa set A we let |A| denote the size of A, which is the number of taxa it contains.Then, Baggage chooses to join taxa sets A and B over C and D, if |A| + |B| > |C| + |D|.If both pairs of sets have the same taxa, so |A| + |B| equals |C| + |D|, we could use the uniform method to solve this case.We note that if we consider to join a pair A, B

Table 4 .
Fraction of 1,000,000 samples found in each agglomerated tree.