A novel phylogeny and morphological reconstruction of the PIN genes and first phylogeny of the ACC-oxidases (ACOs)

The PIN and ACO gene families present interesting questions about the evolution of plant physiology, including testing hypotheses about the ecological drivers of their diversification and whether unrelated genes have been recruited for similar functions. The PIN-formed proteins contribute to the polar transport of auxin, a hormone which regulates plant growth and development. PIN loci are categorized into groups according to their protein length and structure, as well as subcellular localization. An interesting question with PIN genes is the nature of the ancestral form and location. ACOs are members of a superfamily of oxygenases and oxidases that catalyze the last step of ethylene synthesis, which regulates many aspects of the plant life cycle. We used publicly available PIN and ACO sequences to conduct phylogenetic analyses. Third codon positions of these genes in monocots have a high GC content, which could be historical but is more likely due to a mutational bias. Thus, we developed methods to extract phylogenetic information from nucleotide sequences while avoiding this convergent feature. One method consisted in using only A-T transformations, and another used only the first and second codon positions for serine, which can only take A or T and G or C, respectively. We also conducted tree-searches for both gene families using unaligned amino acid sequences and dynamic homology. PIN genes appear to have diversified earlier than ACOs, with monocot and dicot copies more mixed in the phylogeny. However, gymnosperm PINs appear to be derived and not closely related to those from primitive plants. We find strong support for a long PIN gene ancestor with short forms subsequently evolving one or more times. ACO genes appear to have diversified mostly since the dicot-monocot split, as most genes cluster into a small number of monocot and dicot clades when the tree is rooted by genes from mosses. Gymnosperm ACOs were recovered as closely related and derived.


INTRODUCTION
The dramatic increase in the amount of publicly available genomic information has facilitated analyses of gene-family origins and evolution. Plant gene phylogenies have proliferated, but they are commonly made using distance matrices of amino acid sequences. This method is expected to amplify misleading information resulting from convergence (Farris, 1983;Simmons, 2000;Simmons et al., 2002), obscuring any signal of the true history.
Methodological approaches to studying gene family histories need focused attention, since gene phylogenies are inherently difficult to verify. Molecular phylogenies of species can be compared to their morphology and biogeography to ask whether resulting trees are plausible. Perfect congruence is not expected, but phylogenies at extreme variance cast doubt on the quality of the methods and underlying data. Because gene copies can undergo any combination of diversification, loss, and neofunctionalization in different taxonomic lineages, nearly any sort of phylogenetic hypothesis can be the reasonable outcome of an analysis. Gene copies in the same taxon or with the same morphology would be expected to have a higher probability of being closely related, but this is not necessarily so, and it is dependent on the timing of gene family diversification events and the phylogenetic placement of progenitor copies.
Here we employ phylogenetic methods designed to minimize the effects of convergence and amplify historical signal in the date. We apply these methods to the PIN-formed (PIN) auxin transporters and the ethylene-forming, ACC-oxidase enzymes (ACO). We chose to analyze these gene families because both are important plant development genes for which there are hypotheses about the timing and drivers of their diversification, but they differ in the amount of phylogenetic study thus far received. The PIN genes have been the subject of different phylogenetic analyses (Paponov et al., 2005;Křeček et al., 2009;De Smet et al., 2011;Carraro et al., 2012;Viaene et al., 2013;Bennett et al., 2014), but for ACOs no such analysis has been attempted. We thus are able to compare the results of our methods to prior analyses in PIN genes, and we lay the groundwork for a new discussion on the history of ACOs. PIN-formed proteins polarly transport the plant hormone auxin, which regulates several aspects of plant growth and development (Robert and Friml, 2009;Zazímalová et al., 2010). Since the discovery and characterization of the first PIN mutant in Arabidopsis (Okada et al., 1991), several other PINs have been identified and characterized in different plant species. In the Arabidopsis genome, there are eight PIN loci, which are categorized into groups according to their protein length and structure, as well as subcellular localization (Paponov et al., 2005;Křeček et al., 2009). The first cladistic analysis of PIN genes (Carraro et al., 2012), which was rooted by moss and lycophyte copies, suggested that PIN genes diversified mostly since the rise of land plants, around the time of the monocot-dictot split. A subsequent analysis (Viaene et al., 2013) focused on the evolution of PIN gene morphology; the preferred topology-rooted by protist, animal, bacterial genes-suggested that the moss sequence "PpPIN1D" is sister to all other PIN genes, and the morphology evolved from short forms to long.
ACOs help in the synthesis of ethylene, which is a gaseous hormone under normal environmental conditions, and which regulates many aspects of the plant life cycle (Bleecker and Kende, 2000;Lin et al., 2009). In higher plants, ethylene is synthesized via two committed enzyme-catalyzed steps from S-adenosyl-Lmethionine. The first step is catalyzed by 1-aminocyclopropane-1-carboxylic acid (ACC) synthase (ACS), and the second (and last) step is carried out by ACC oxidase (ACO). ACOs are members of a superfamily of oxygenases and oxidases, most of which utilize Fe (II) as a cofactor and 2-oxoglutarate (2OG) as a cosubstrate (Sato and Theologis, 1989;Bidonde et al., 1998;Wang et al., 2002). The subcellular localization of ACO proteins is prevalently cytosolic rather than membrane-bound (Chung et al., 2002;Hudgins et al., 2006;Lin et al., 2009). ACOs have high similarity throughout the protein coding sequences and expression analyses reveal that the ACO genes display a high degree of differential expression in tissues at various stages of development. A variety of plant species produce ethylene, including unicellular and multicellular algae, although angiosperms use a different biosynthetic pathway from primitive land plants and algae (Wang et al., 2002;De Paepe and Van der Straeten, 2005;Plettner et al., 2005;Yordanova et al., 2010;Wanke, 2011;Yasumura et al., 2012).

ASSEMBLING OF THE ACO AND PIN DATA SETS
For all phylogenetic analyses, unless otherwise specified, coding sequences (CDS) were used. All taxa with publicly available sequences were included, although a random subset of all available angiosperm sequences were taken, so as to generate a manageable data set size. For the PIN data set (Table 1), reported unique sequence identifiers were used to retrieve the corresponding sequences from the Phytozome v.9.1 (www.phytozome. org) (Goodstein et al., 2012), ConGenIE (congenie.org), and Genebank (www.ncbi.nlm.nih.gov/genbank/) (Benson et al., 2005) databases. The only major plant group that was not included in the PIN data set was ferns, as previously reported PIN genes from them are not publicly available (Viaene et al., 2013). ACO sequences ( Table 2) were identified from previously published studies and via queries with the BLASTn algorithm at the National Center for Biotechnology Information (NCBI) nucleotide collection and from the Phytozome v.9.1 (Goodstein et al., 2012). Only proteins that were annotated as aminocyclopropane-carboxylate oxidases were retained. Transmembrane profiles for PIN amino acid sequences were predicted querying the TMHMM Server v.2.0 (www.cbs.dtu.dk/ services/TMHMM/) (Moller et al., 2001). PIN proteins were classified (1-5) according to their length, number of transmembrane domains, and length of the central hydrophilic loop (See Table 1). Generally, caution should be exerted when classifying PIN proteins according to their number of TMDs, as those are predicted protein domains that will need final confirmation by reconstruction of the tertiary structure by crystallography. In two cases where gene sequences showed no notable differences from long forms but were predicted to have only the N-terminal transmembrane domain (OsPIN3a and Aco018694), we coded them as long.

ALIGNMENT AND PHYLOGENETIC ANALYSIS
The moss PIN gene "PpPIN1D" (Gene Code "Pp1s79_126V6.1") was excluded from analysis due to suspicion of it being a pseudogene; although not the most distant gene in the data set, it has only about half the number of nucleotides relative to other moss PIN genes (with the gaps appearing in the middle), and in preliminary phylogenies it was recovered on a very long branch (usually twice as long as its sister). This can also be seen in our previous phylogeny of PIN genes (Figure 2A in Carraro et al., 2012). Likewise, we did not include purported PIN homologs from non-plants (Viaene et al., 2013), as we had no evidence for their homology.

ABF20554 PsmACO1
Selaginella moellendorffii  Sorghum bicolor Sb02g026280 Sb02g026280 Sb05g005710 Sb05g005710 Sb05g005720 Sb05g005720 Sb09g003790 Sb09g003790 Sb09g003800 Sb09g003800 Sb10g022640 Sb10g022640 Sb04g034520 Sb04g034520 SeaView retains original nucleotide data through the process of amino acid alignment and thus allows an accurate backtranslation after alignment. SeaView also has the alignment program MAFFT (Katoh et al., 2009) implemented, but, with the -gapopen parameter defined, it failed to run with our data set. Clustal misaligned one of the short genes, which was fixed by hand, although tree-searching on the uncorrected version of the alignment resulted in nearly identical topologies. We noticed that monocots tended to have a very high GC content, especially in the third codon position (Figure 2). Thus, using the program Mega v. 4.0.2 (Tamura et al., 2007), we removed the third codon positions of these nucleotide alignments, and in the program BioEdit (Hall, 2007) we replaced G and C with N ( Figures 1C,D). Also in BioEdit, we translated nucleotide alignments and replaced all amino acids but Serine with N ( Figure 1G); this required several steps, as BioEdit uses a default codon translation for amino acids in back-translation. We also used the program Mesquite (Maddison and Maddison, 2009) to convert fasta files into files that can be read by the phylogenetic program TNT (Goloboff et al., 2008). (Mesquite changes N to "?" automatically when generating TNT files and so these were returned to N using the search-and-replace function in BioEdit.) We used TNT because it clearly reads N as "any nucleotide," preserving the gap information. We analyzed the same alignments under likelihood bootstrap in the program RAxML (Stamatakis et al., 2008) on the Cipres (Miller et al., 2010) computing cluster. RAxML does not recognize gaps and treats them and N as simply missing data. For searches in TNT, we used the "new technology" function (which combines several strategies for exploring tree space) with 100 initial builds, and we followed with calculations of bootstrap support using 1000 pseudoreplicates. We took the strict consensus of the shortest trees, which is reported here. For searches of amino acids in POY, we started with 100 Wagner tree builds and conducted SPR and TBR swapping, selecting the shortest trees (with zero-length branches collapsed) and reporting the strict consensus.
Bootstrap support, which is a measure of the redundancy of signal for optimal clades, was not expected to be high, given the nature of these analyses. Single-gene phylogenies with large portions of their content removed to avoid convergence are unlikely to contain enough information to support every clade in resampling analyses, but we do report resampling support for the largest alignments (all 1st and 2nd codon positions, with GC replaced by N).
To mimic the way RAxML treats gaps and missing data, we also replaced all Ns and gaps with "?" (Figures 1F,H) and reran them in TNT, which reads "?" as "any nucleotide or gap." We also removed gappy regions in the program Gblocks (Castresana, 2000) (Figure 1E), replacing the remaining gaps (which were allowed in half of the positions) with N. We also searched the unaligned amino acid sequences in the phylogenetic program POY v. 4 and 5 (Varón et al., 2009), which optimizes the multiple sequence alignment and tree searching simultaneously. When completed, PIN trees were uploaded to Mesquite with a character matrix of their protein lengths (Figure 3; coded as 1-5), and parsimony ancestral reconstructions traced over the trees. Characters were treated as unordered.

RESULTS
PIN phylogenies recovered from a variety of sequence alignments and under parsimony or likelihood reconstructed the evolution of these genes from long to short (Figures 4, 5). The phylogeny recovered with full gap information under parsimony showed a clearer evolution from long, through intermediate, and to short forms, and the short versions were recovered as two clades under likelihood and parsimony when gaps were replaced by "?" or gappy regions removed by Gblocks (Figures 4B, 5A,B). Although several smaller clades remained stable throughout the analysis, and the moss and most of the lycophyte sequences tended to remain as sister to the remaining PIN genes, the relationships among the major lineages were generally unresolved. Genes having the trans-membrane domain only on the C-terminal end appeared to have evolved from long-form PINs, possibly twice, and those having this domain only on the N-terminal appear to have evolved from short-form genes, perhaps more than once. Monocot PINs were recovered mostly in small stable clades mixed among the dicot genes. Lycophyte genes were monophyletic under likelihood and under parsimony when the alignment was trimmed by Gblocks ( Figure 5B); otherwise they tended to form a paraphyletic grade at the base of the tree, near the Pyscomitrella patens (moss) genes, or several small clades, only some of which were near the moss genes. Gymnosperm genes were found monophyletic only under likelihood (Figure 4B), and at the base of the tree, after the lycophytes, under parsimony ( Figure 4A).
ACO genes clustered more distinctly by taxonomic group, under both parsimony and likelihood, except for the moss and lycophyte genes, which were mixed at the base of the FIGURE 1 | Different alignments used in this study. Alignments began by translating coding sequences and aligning them with ClustalW2 ("gapopen" option set to 15), and then back-translating them into nucleotides (A,B). From these only the 1st and 2nd codon positions were taken (C), and then all G and C nucleotides converted to "N" (D), which is read as "any nucleotide" in the phylogenetic program TNT. This alignment then had gappy regions removed by the program Gblocks (E) or had all of its GC bases and gaps converted to "?" (F), which is read as "any nucleotide or gap" in TNT. The alignment in B also had all codon positions except those coding for Serine converted to "N" , then the 3rd codon positons removed (G). This also had all Ns and gaps converted to "?" (H). tree (Figure 6). Gymnosperm ACOs were recovered as closely related but not near the basal plants. Under parsimony gymnosperms ACOs were recovered as monophyletic, with the exception of the Norway spruce gene PaABO4, and under likelihood they were recovered in two clades that formed a paraphyletic grade between two angiosperm diversifications. In both analysis, ACOs of monocotyledonous species (Oryza sativa, Sorghum bicolor, and Zea mays) cluster into three groups of closely related copies; under likelihood these groups are each monophyletic. Genes resulting from duplications are recovered as closest in both analysis (for example Sb05g005710 and Sb05g005720). ACOs from dicotyledonous species form several clades or paraphyletic grades, each containing copies from a mixture of species, except one which contains exclusively Arabidopsis sequences. As in monocots, dicot species that underwent relatively recent whole genome duplications (Malus domestica, Populus thricocarpa) present tightly related gene copies with high sequence similarity.
Trees recovered under dynamic homology using amino acids were less organized by taxonomic group than those recovered using the nucleotide alignments above (Figure 7). With PIN genes, copies from the moss P. patens remained sister to all other PINs, but the lycophyte copies were recovered throughout the rest of the tree. This would make any ancestral reconstruction of gene length ambiguous, as the lycophyte copies are classified as long. With the ACOs, the mixing of gene copies by taxonomic group occurred mostly at the base of the tree, although it was more complete.
Tree lengths, numbers, and likelihoods are provided in Table 3. The full alignment of 1st and 2nd codon positions, with all GC

FIGURE 2 | Frequencies of nucleotides (T, C, A, and G) at each codon position (1-3) for 151 PIN and 193 ACO genes in dicots (A,B), and monocots (C,D). Bars represent 95% confidence intervals.
content replaced by N resulted in a large number of equally parsimonious trees for PINs and ACOs, mostly due to equally parsimonious resolutions among the main lineages in PINs, and among members of small, derived clades in ACOs. These unstable regions of each gene family's phylogeny also received little bootstrap support (Figures S1, S2).

DISCUSSION
We find strong support for shorter PINs evolving during the diversification of angiosperms, and the basal plants (here represented by the moss Physcomitrella patens and lycophyte Selaginella moellendorfii) retaining only long PINs. Whether originating more than once or not, short PINs have evolved more recently and their number increased in monocots and dicots following genome duplications in species such as Oryza Sativa, Zea mays, Populus trichocarpa. This finding contradicts that of Viaene et al. (2013). The difference results from how our phylogenies are rooted, and rooting rests homology assessments, alignment methods, the way phylogenetic programs handle missing data, and assumptions about whether primitive plants are more likely to retain primitive gene copies. There are several tests for homology (Patterson, 1988), and the only one presently available for analyzing gene families is similarity; thus for highly dissimilar sequences some justification should be offered for their inclusion. To illustrate how divergent the outgroups of Viaene et al. (2013) are, their supposed PIN homolog from ants has a shorter uncorrected pdistance relative to the ingroups when reversed than in its original direction. Homology for the animal sequences cannot be justified based on auxin transport, since animals do not have auxin, so it seems likely such sequences are simply not homologous. For such divergent sequences, even if homologous, they will generate large gappy sections in the alignment, only some of which can be removed by hand, and they will likely root randomly. Another source of alignment difficulties is the partial nature of the algal PIN sequences which Viaene et al. (2013) took from ESTs. If the phylogenetic programs they employed treated gaps and missing data the same, algal sequences among the outgroups would be inclined to pull short PIN genes to the base of the tree. Indeed, it is not surprising that non-homologous and partial sequences in the outgroups attracted the strange and truncated PpPIN1D sequence to the outside of the ingroups-more through shared exclusion than similarity-thus strengthening the appearance of a short-to-long evolution of PIN genes.
In the absence of a suitable outgroup for plant PIN genes, we have simply made the assumption that plants recovered as sister to all the others (what one might call "primitive" or "basal" plants) carry gene copies that are also likely to be sister to all the others. This may not be correct, as evidenced by the placement of some gymnosperm copies as more derived than angiosperm copies. However, perhaps the most stable and supported relationship we recovered with PINs and ACOs was the clear distinction between moss (and usually lycophyte) copies and all the others. Thus, these gene families either diversified very early, and mosses and lycophytes retained only the most derived copies, or, more parsimoniously, the gene families simply diversified after the rise of spermatophytes. In any case, we note that the reconstruction of a long-to-short evolution of PIN genes is not merely the result of putting moss copies as the outgroup, for even in generally unresolved trees, like the one recovered from the alignment trimmed of gaps by Gblocks (Figure 5B), the main clades of short-form PINs are closely related to each other and are derived from angiosperm long forms. Only rerooting specifically by a short-form copy would change this, and in reconstructions with two origins of short-form copies, one clade would remain derived. A recent phylogenetic analysis of PIN genes done independently of the analysis here (Bennett et al., 2014) uses a similar methodological approach and obtains results that are broadly congruent with our previous analysis (Carraro et al., 2012) and the analysis here. Bennett et al. (2014) use nucleotide sequences and root by bryophytes, and they recover the odd PpPIND in a derived position on a long branch. They also find multiple, later origins for short (or more specifically, what they term "non-canonical") forms of PIN genes. It thus appears that whether improvements are made to PIN phylogenies by adding more sequences (Bennett et al., 2014) or excluding sources of convergence in the data, as we do here, PIN genes increasingly seem to have undergone shortening events multiple times.
In our phylogenies it appears that most modern ACOs arose subsequent to the monocot-dicot split [140-150 Ma ago, during the late Jurassic-early Cretaceous (Chaw et al., 2004)], but PIN genes diversified much earlier, as evidenced by their more thorough historical mixture of monocot and dicot copies. A broad diversification of ACOs during the Mesozoic is later than was FIGURE 5 | Strict consensuses of equally parsimonious trees using various alignments of the PIN gene data set with only 1st and 2nd codon positions. With Gs and Cs removed, the influence of gaps was minimized so as to mimic how gaps and Ns are treated in likelihood programs by replacing them with "?" (A). Alternatively, the alignment had gappy regions removed using the program Gblocks, and then all Gs, Cs, and remaining gaps replaced by "?" (B). A Serine-only alignment was also used, with only 1st and 2nd positions and all gaps and non-Serine postions replaced with "?" (C).
previously hypothesized by John (1997), who did not consider them present in primitive land plants and believed their appearance was necessitated by droughts at the end of certain Permian periods (the Devonian at 360 Ma ago and the Carboniferous at 300 Ma ago). It appears that multiple ACO copies were present during the Permian and before the split between gymnosperms and angiosperms, and even before the split between mosses and lycophytes, but the later proliferation of copies in angiosperms requires a new ecological driver besides droughts in the Permian.
Gymnosperm PINs and ACOs appear to have derived from angiosperm gene copies. This is not corrected simply by using a gymnosperm root, since that renders moss and lycophyte copies derived. Rather it indicates that for both PINs and ACOs the ancestral copy in gymnosperms was a more derived copy than some of the copies inherited and retained in the ancestral angiosperm. A very limited number of gymnosperm sequences are available for both gene families, so the possibility remains that the history of gymnosperm sequences will become clearer as more of them are included in future analyses.
Amino acid sequences should present problems for historical reconstruction (Simmons, 2000), despite their popularity in plant gene family phylogenies. Genetic code degeneracy and selection pressure on protein function are sources of convergence, and although amino acids may correct for back-mutations in the third codon position, which can be another source of convergence, ignoring the third codon position removes information on recent divergences. Using amino acids with phenetic tree-building methods like Neighbor-Joining (algorithms that cluster sequences by overall similarity) (Saitou and Nei, 1987) will likely amplify convergence in amino acids, and using them with probabilistic optimality criteria requires a model of evolution both for the alignment step and tree-searching. We find here that trees made with amino acids and the most agnostic cladistic method for optimizing alignment and tree-searching (dynamic homology) produced trees with little in their favor relative to the trees made from GC-free nucleotide alignments, and we would not recommend using amino acids for future historical studies of gene families.
Although a few monocot PIN and ACO seqeunces do not have a high GC content, it seems likely that this quality is the result of an ongoing substitution bias and thus a source of misleading, convergent signal. Monocot copies are not monophyletic, and almost all of them present very high GC content, which argues against this being the result of a historical event no longer maintained in monocots. For example, in PINs almost half of the monocot copies have GC content at the third codon position above 50%, and all but four are over 40%. We notice in Meister and Barow's survey (2007) that monocots in general have statistically significantly higher GC content than dicots (using a t-test of the GC percentages, p < 0.001), and that they attain a maximum content around 50%, about 10% higher than the maximum GC content of dicots. However on average monocot genomes have only about 1% more GC content than dicots. We notice that the most GC-rich species in Meister and Barow (2007) are the grasses, which constitute all the monocots in our data set. Given how strong this bias appears in grass ACOs and PINs, it was perhaps trivial for us to retrieve monocot clades in our previous PIN phylogeny (Carraro et al., 2012), which was based on all three codon positions and did not have GC content removed. However, we still recover most monocots together (either still in clades or paraphyletic grades) and in the same combinations here as before. For example, the monocot group of nine shortform genes that includes OsPIN5b and SbPIN5 was recovered both previously and here ( Figure 4B), but previously these were monopheletic, and here they form a grade out of which diversifies a clade of short-form and N-terminal-TMD-only genes. Even using the alignment that had only the first and second positions of serine, which should be immune from a GC mutational bias, we recovered clusters of monocots, a close relationship among primitive land plants, and a clade of mostly short PIN genes ( Figure 5C).
Here we present new phylogenies for PIN and ACO genes, after working to improve the methods used to reconstruct the histories of gene families. First, we avoid the use of amino acids and distance (phenetic) algorithms, which have the potential to convey and amplify homoplasy. Next, in a further attempt to avoid homoplasy among genes with similar lengths and GC content, we avoid the use of indels and treat Gs and Cs or their transformations as missing data. We root trees by gene copies found in bryophytes, and we exclude sequences which are not clearly homologous. The results suggest an evolution from long to short PINs, perhaps multiple times, and a diversification of ACOs mostly after the dicot-monocot split. More sequences from a wider taxonomic range for these gene families are welcome for the continued development of their phylogenetic hypotheses and a deeper understanding of their histories.