Glycosyltransferase Family 61 in Liliopsida (Monocot): The Story of a Gene Family Expansion

Plant cell walls play a fundamental role in several plant traits and also influence crop use as livestock nutrition or biofuel production. The Glycosyltransferase family 61 (GT61) is involved in the synthesis of cell wall xylans. In grasses (Poaceae), a copy number expansion was reported for the GT61 family, and raised the question of the evolutionary history of this gene family in a broader taxonomic context. A phylogenetic study was performed on GT61 members from 13 species representing the major angiosperm clades, in order to classify the genes, reconstruct the evolutionary history of this gene family and study its expansion in monocots. Four orthogroups (OG) were identified in angiosperms with two of them displaying a copy number expansion in monocots. These copy number expansions resulted from both tandem and segmental duplications during the genome evolution of monocot lineages. Positive selection footprints were detected on the ancestral branch leading to one of the orthogroups suggesting that the gene number expansion was accompanied by functional diversification, at least partially. We propose an OG-based classification framework for the GT61 genes at different taxonomic levels of the angiosperm useful for any further functional or translational biology study.


INTRODUCTION
Glycosyltransferases (GTs) constitute a large superfamily of enzymes that catalyze the assembling of monosaccharide moieties into linear and branched glycan chains (Rini et al., 2009). GTs have been subdivided into several families (Coutinho et al., 2003) with 105 of them identified up to now (Lombard et al., 2014). Among them, the glycosyltransferase family 61 (GT61) contains genes strongly believed to play a central role in the synthesis and feruloylation of Arabinoxylan, the major components of cell walls in grasses (Mitchell et al., 2007). These components are known to play a crucial role against pathogen penetration, manufacturing processes for human and animal consumption as well as alcohol and biofuel production (Freeman et al., 2017).
Previous studies (Anders et al., 2012;Voiniciuc et al., 2015) consistently divided the GT61 genes family in three major clades, named A, B, and C. Clade C is the most differentiated and contains usually one gene per species whereas clades A and B contain several members per species. In clade A, a gene expansion was shown in the Poaceae family based on sequences from O. sativa and Sorghum bicolor. However, comparative genomics databases based on a wider plant genome sampling such as GreenPhyl (Rouard et al., 2010) indicate possible gene copy amplification in other monocots beyond the Poaceae ( Figure 1A). Gene family expansions can be due to both small size duplications (tandem duplications) and large scale duplications [segmental or whole genome duplications (WGDs)]. In the first case, duplicated copies of the genes remain physically close in the genome (forming gene clusters) while for large scale duplications, the genes are more distant with reduced physical interactions between them. Whatever the duplication mode, each copy accumulates independent mutations (genic conversion apart) and the classically described evolutionary fates of duplicated genes are either the loss of redundant copies, subfunctionalization (maintenance of both copies to assure the original function) or neofunctionalization (acquisition of a new function retained over time by natural selection due to a selective effect increasing fitness) [Moore and Purugganan (2005) and Innan and Kondrashov (2010)]. Somehow, the persistence of two or more copies of a gene over the course of evolution is a clue of functional differentiation among the copies (Fischer et al., 2014). Such a gene expansion and selective pressure of the GT61 gene duplicates have not been examined yet.
The objectives of this study are first to investigate the homology relationships in the GT61 by identifying orthogroups (OG) and thus providing an OG-based classification framework for this family (Cenci et al., 2014;Cenci and Rouard, 2017). Orthogroups are defined as group of genes of a given sample of species that descended from a single ancestral gene. Since the functional annotation transfer relies on the accurate identification of orthologous sequences between genomes, this work will likely be useful for any further classification and gene function assignment in the GT61 gene family. The second objective was to identify the type of duplication events that occurred in the GT61 beyond Poales in the monocots and clarify whether patterns of functional differentiation were present.

Sequence Identification and Conserved
Motif Analysis of GT61 Genes GT61 protein sequences from M. acuminata were retrieved from GreenPhyl database (Rouard et al., 2010) and used to search GT61 sequences of other species with BLASTp (score > 200) in their respective NCBI Annotation Release from A. trichopoda (release 101, species code AMBTC), V. vinifera (102, VITVI), T. cacao (100, THECC), P. dactylifera (101, PHODA), E. guineensis (101, ELAGV), S. italica (103, SETIT), B. distachyon (103, BRADI), O. sativa (102, ORYSA), and A. officinalis (100, ASPOF) or from species specific sequence databases: A. thaliana (ARATH) from TAIR10.1 (Berardini et al., 2015), M. acuminata (MUSAC) v2, from the Banana Genome Hub (Droc et al., 2013), C. canephora (COFCA), from the Coffee Genome Hub (Dereeper et al., 2015) and A. comosus (ANACO) from Plaza v4 (Van Bel et al., 2018). Sequences were manually curated to verify their gene structure and when necessary exon introns boundaries were corrected. A list detailing species and annotation for all the sequences used as well as a FASTA file containing them are available in Supplementary Datas S1, S2. A five-digit species code at the end of sequence names (as reported above between brackets) indicates the relative species. GT61 genes' physical locations along the genome were determined for all the species. When chromosome pseudomolecules were unavailable, the assignation was based on scaffold coordinates. GT61 genes separated by no more than five other genes were considered in tandem cluster.

Alignment, Phylogenetic Analysis and Orthogroup Identification Method
Phylogenetic analyses were performed on protein sequence alignments obtained with the MAFFT program (Katoh and Standley, 2013) via the EMBL-EBI bioinformatics interface  using default parameters. Conserved blocks were extracted from the alignments with Gblocks (Castresana, 2000). The selection of conserved blocks was performed by allowing: (i) smaller final blocks, (ii) gap positions within the final blocks, and (iii) less strict flanking positions. Phylogenetic trees were built with PhyML (Guindon and Gascuel, 2003) available at phylogeny.fr (Dereeper et al., 2008) using an LG substitution model and the Approximate Likelihood-Ratio Test (aLRT) as statistical tests for branch support (Guindon et al., 2009). Phylogenetic trees were visualized with MEGA6 (Tamura et al., 2013).
In this study, orthogroups (OG) were visually delineated with regards to the angiosperm species tree in Figure 1B (source NCBI taxonomy). OGs were thus identified based on gene trees as the clades including both monocot and dicot, implying the existence of a common ancestor gene before the monocot/dicot lineage split. The same method was applied for GT61 genes that underwent additional copy amplification in monocot lineage with commelinids and Poaceae divergence as reference taxonomic level.

PAML Analysis
In order to investigate the selection pressures driving evolution of the GT61 family, different models allowing the dN/dS ratio (ω, i.e., the non-synonymous on synonymous substitution rate ratio) to vary according to branches, sites or both, were tested using the codeml program of the PAML4 software (Yang, 1997). Three kinds of models were used: 'site' models, wherein the dN/dS ratio is allowed to vary between sites; 'branch' models wherein the dN/dS ratio is allowed to vary between branches; and 'branch-site' models wherein the dN/dS ratio is allowed to vary between both branches and sites. Site models were implemented in homemade python scripts, relying on the egglib package (De Mita and Siol, 2012).
The 'site' models were used to test whether positive selection drove the differentiation between paralogous sequences within each species, as performed in Fischer et al. (2016). Two models were tested: the nearly neutral model (M8a) assumes that codons evolve either neutrally or under purifying selection whereas the positive selection model (M8) assumes positive selection acting on certain codons. Likelihood ratio tests (LRTs) were performed to compare M8 with M8a and, hence, to detect sequences groups (species) for which models that include positive selection are more likely to occur than models that do not. When models with positive selection were more likely, Bayes empirical method was used to calculate the posterior probabilities at each codon and to detect those under positive selection (i.e., those with a posterior probability of having a dN/dS > 1 above 95%). Sites detected to be under positive selection at the codon level were manually validated according to alignment quality and reliability.
To determine if each of the six branches leading to subgroups A1 to A6 underwent a significant different selective pressure compared to all the other branches of the tree, the 'branch' and 'branch-site' models were used. Each of the six branches was tested individually by comparing the likelihood of a 'branch' model allowing the dN/dS to take a different value for the tested branch with the likelihood of a 'null' model. The null model was defined as follow. Two different values of dN/dS are allowed: a first value of dN/dS on the 'outgroup' branches (G group) and another value of dN/dS on all the other branches of the tree. As multiple testing is implicit in this method, the p-values were corrected using the total number of branch partitions tested (i.e., 6).
Branch partitions tested with 'branch-site' models were the same as for the 'branch' models. Two models were compared: the null model (A0), in which sites on all branches (including the G group) evolved under the same selective pressure (purifying or neutral), was compared to a model including positive selection (model A) in which some sites on the tested branch evolved under positive selection, whereas sites on the rest of branches still evolved under purifying selection or neutrality. Again, the most likely model was inferred by LRT and sites detected to be under positive selection at the codon level were manually validated according to alignment quality and reliability.

RESULTS
A total of 219 members of GT61 gene family were found in A. thaliana, T. cacao, V. vinifera, C. canephora, A. officinalis, M. acuminata, E. guineensis, P. dactylifera, A. comosus, B. distachyon, S. italica, O. sativa, and A. trichopoda. In dicots and A. trichopoda, the number of GT61 genes found per species ranged between 4 and 7 and in monocots between 9 and 39, with 24 in average (between 15 and 39 when only commelinids are considered, i.e., excluding A. officinalis) ( Table 1 and Supplementary Data S1).

Phylogenetic Analyses
The phylogenetic tree of the whole GT61 gene family was obtained with 253 aligned positions (amino acids). Three main clades were detected (Figure 2 and Supplementary Data S3). The first one (in green in Figure 2, named clade G), has a strong branch support (aLRT = 0.98) and contains 11 sequences belonging to both monocot and dicot lineages plus one sequence of A. trichopoda. However, in this clade, there are no sequences of the following species: A. thaliana, M. acuminata and all considered Poaceae ( Table 1). The second clade is also strongly supported (aLRT = 1), and includes 44 sequences (subclades D and F in Table 1) distributed in sub-clades with variable support (in blue and fuchsia in Figure 2). The third clade (clade A, aLRT = 0.94) is the largest one (in red in Figure 2). It includes 164 sequences, i.e., more than 75% of GT61 sequences ( Table 1).
A second phylogenetic analysis restricted to the second clade was performed in order to increase its phylogenetic resolution, using the clade G as an outgroup. Thus, the 44 sequences of this clade and 11 of the clade G were re-aligned and the phylogenetic tree was built based on 307 amino acid aligned positions (Figure 3 and Supplementary Data S3). In this new tree, a strongly supported clade (aLRT = 0.98) includes seven sequences from only A. trichopoda and three dicots species (T. cacao, V. vinifera, and A. thaliana; in fuchsia in Figures 2, 3). The remaining 34 sequences (aLRT = 0.80) form five new clades named D1 to D4, and D (dicot) (Figure 3). The OG-GT61-D (dicot) clade (aLRT = 0.94) contains all and only dicot sequences (Figure 3). An A. trichopoda sequence is branched to this clade with a lower branch support (aLRT = 0.76) (Figure 3). The sequences composing the four remaining clades (D1-D4) come all exclusively from monocot species, and each clade includes sequences from almost all the commelinid species studied. As expected, A. officinalis sequences have a basal position in all four clades, with LOC109843905 and LOC109840057 clearly included in clade D1 and D4, respectively, whereas the A. officinalis sequence (LOC109831329 * ) is branched on a basal position of the two clades D2 and D3 (Figure 3).
In the same order of idea, a third phylogenetic tree was built to better disentangle the phylogenetic relationships within the group A. The 164 sequences it contained were re-aligned and a new phylogenetic tree was build based on 277 aligned amino acid positions. Several well-supported clades were identified but relationships among them remained poorly resolved (Figure 4 and Supplementary Data S3). All the dicot sequences were grouped in the same well-supported clade named ' A (dicot)'. Five others clades contain at least one sequence of each monocot species sampled in our study (A1, A2, A3, A4, and A6) whereas a sixth (A5) is missing sequences from O. sativa. Finally, the clade A7 contains only sequences from A. comosus and from the three Poaceae species. All these clades have an aLRT support higher than 0.95 except A3 (0.81). The seven remaining sequences, whether grouped or not in small well-supported clades, could not be assigned to the main clades previously described.

Orthogroups Identification and Classification of GT61 Genes
As previously mentioned, an orthogroup (i.e., group of orthologous genes) is defined, for a given sample of species, as a group of genes all issued from a single ancestral gene present in the last common ancestor of the species being considered. Based on the species considered in this study (i.e., monocots and dicots), we identified 4 orthogroups. The sequences included in the phylogenetic clade G appear clearly derived from the same ancestral gene present in the monocot/dicot last common ancestor and constitutes thus a first orthogroup (named OG-GT61-G). Indeed, the internal topology of clade G follows almost perfectly the expected topology of angiosperm phylogeny. Clade G is well supported and contains sequences from both monocot and dicot species (Figure 3); one sequence from A. trichopoda is included in this cluster. Two other orthogroups (OG-GT61-F and OG-GT61-D) were defined in the second main clade. Again, the ancestral gene, from which all the genes of the clade OG-GT61-F are derived, was probably present in the common ancestor of monocot and dicots. However, in this clade, the absence of genes from the monocots species studied here can be interpreted as a gene loss that probably occurred early in the evolution of monocot lineages. Conversely, for the OG-GT61-D, in the lineage of commelinids copy amplification took place that generated four copies; each one was conserved in the descendant species and formed four commelinid-specific orthogroups, named from OG-GT61-D1 to OG-GT61-D4 (Figure 3). The position of the three A. officinalis sequences in the phylogenetic tree suggests that the duplication which resulted in the presence of D2 and D3 sequences occurred after commelinids lineage diverged from Asparagales (Figure 3). Due to the low branch support for OG-GT61-D2 and -D3 clades, one cannot exclude the possibility that Acorales and Poales sequence duplications (no representative sequence was found for M. acuminata) were independent and successive to the respective lineages separation (Figure 3), implying the existence of a unique orthogroup.
All the sequences of the group A, issued from both dicot and monocot species, were assigned to an orthogroup named OG-GT61-A. Clade A (dicot) contains all the dicots sequences and each dicot species possesses one to three sequences. Speciesspecific tandem duplications occurred in some species, such as A. thaliana and T. cacao. A lineage-specific amplification was also observed in A. trichopoda. The phylogenetic analysis performed on the sequences of this orthogroup shows that several amplifications occurred but concerned almost exclusively genes from the monocots.

Genome Distribution of GT61 Sequences
The analysis of the genomic positions of the GT61 sequences from the orthogroup A in monocots showed that several genes are tandemly distributed. Each tandem is composed of genes, each one belonging to a different commelinid-specific OG-A. Moreover, the tandem organization is collinear among the studied monocot species ( Table 2). The tandem repeat loci present in the genome of P. dactylifera and E. guineensis contain members from all the subclades (A1-A6). Differently, in M. acuminata, whose genome experienced three lineage-specific WGDs, the number of OG-GT61-A tandem loci is higher (nine loci) and it appears that each of them underwent gene loss, a process known as 'fractionation process' in polyploid species (Langham et al., 2004). Thus, the complete ancestral structure of the tandem repeat locus of M. acuminata can be inferred by consensus from all the clusters (Table 2). In Poales (not shown in Table 2), because of (i) the presence of the A7 orthogroup members and (ii) the occurrence of additional tandem amplifications, the OG-GT61-A tandem loci harbor a more complex organization, but they still remain partially collinear with those of Arecales and Zingiberales.
Based on comparative analysis of loci on which GT61 genes are tandemly distributed and on the phylogenetic analysis, a model was built to reconstruct the history of GT61 gene amplification before the commelinid radiation. Since the GT61 family underwent additional copy amplifications in the Poales, only Arecales and Zingiberales are taken into account (Figure 5).

Selective Pressure in GT61 Family
In order to investigate evolutionary forces which could explain the diversification of this gene family into six orthogroups in commelinids, several models of codon evolution were tested. Due to the further gene expansion that took place in Poaceae, only A. comosus was included in analyses as a species representative of Poales. A. officinalis sequences were also removed to simplify the focus on commelinids. The sequences of the OG-GT61-G were included as outgroup. The 103 amino acid sequences  were aligned and converted back into codons (311 positions). The phylogenetic tree used in the PAML analysis was built with the same codon alignment (Figure 6 and Supplementary Data S3).
The first model tested the diversification between paralogous sequences within each species and revealed no signal of positive selection ( Table 3).
The second model allowed us to test the significance of a different dN/dS value on each of the six ancestral branches leading to the different orthogroups (A1-A6), compared to the dN/dS values of the rest of the tree (Figure 6 and Table 4). This analysis showed that on the branch leading to the orthogroup A6, the dN/dS value was significantly higher (0.7543) than the dN/dS values of all the remaining (i.e., background and root) branches of the tree (0.1467 and 0.1881 for the root). This result shows that the constraints were much more relaxed, on average, in this branch, i.e., after the duplication leading to the A6 copy and before the speciation occurred.
To go further in the investigation of the selective constraints on those six branches, other models were used. They tested for the presence of sites under positive selection on the selected branches (Figure 6 and Table 5). On two branches, A5 and A6, a proportion of sites (2.3 and 9.8% for A5 and A6, respectively) were under positive selection. For the A5 branch, the estimated value of dN/dS in the site category expected to be superior to one is exactly 1, suggesting that the significance of the 'branch-site' model compared to the null model may be artefactual. On the contrary, the dN/dS value estimated for positively selected sites on the A6 branch was 5.98, confirming that those sites are clearly under positive selection on this branch. Indeed, Bayes empirical method (BEB) showed that 5 codons have a high posterior probability to be under positive selection (p > 0.95). These codons are located in the cleaned alignment positions 67, 129, 151, 236, and 311, where amino acid variants are often specific to the OG-GT61 sequences (Supplementary Data S4).

DISCUSSION
In order to reconstruct the evolutionary history of the GT61 gene family in angiosperms and to propose a classification of its members, we conducted a deep phylogenetic analysis of the GT61 genes present in 13 species representative of angiosperms.
Our strategy was to identify orthogroups (OG) for three different sub-sample of species, from the widest one (monocots and dicots) to the narrowest (Poaceae species), with an intermediate one (commelinid species). This strategy allowed us to make assumptions on how many ancestral genes were present at each node of interest across the angiosperm evolutionary history, and thus to model when and where duplication and loss events occurred. When all the species were considered (the largest species subsample), four orthogroups could be defined: OG-GT61-A, -D, -F, and -G [the very divergent clade C in the phylogeny of Anders et al. (2012) was not considered in this study]. It indicates that all the GT61 genes from monocots and dicots derived from at least 4 ancestral genes. This number is consistent with the fact that six GT61 genes were found in A. trichopoda, included here as an outgroup for the monocot/dicot lineage. Actually, two of them are included in the clade containing the OG-GT61-A genes and probably result from an A. trichopoda lineage-specific duplication while other two are included in the well-supported F orthogroup (Figures 2-4).
Since a commelinid-specific expansion was observed within OG-GT61-A and -D, the commelinids were chosen as the second species sub-sample to define narrower OGs. Thus, 7 and 4 orthogroups, respectively (OG-GT61-A1 to -A7 and OG-GT61-D1 to D4) were identified. Finally, due to additional amplifications detected in A orthogroups of Poaceae (represented by O. sativa, B. distachyon and S. italica in our study), Poaceae constituted the third species level and 15 additional OG could be defined (Supplementary Figure S1). This number should be validated by an analysis including a more complete representation of Poaceae species. A GT61 additional amplification was observed also in Arecales, but since only two species represented this group, we considered unreliable to define an Arecales-specific orthogroups.
The GT61 gene family underwent dramatic copy amplifications, as already noted in the study by Anders et al. (2012), which was restricted to Poaceae species, in particular within OG-GT61-A. However, the estimation of when the amplifications occurred, in particular over the course of monocot evolution needed to be specified. Our results show that a first round of amplification involved the ancestor of all commelinids analyzed here. Moreover, the phylogenetic positions of the non-commelinid A. officinalis GT61 sequences in D and A orthogroups suggests that the amplification process started before the split between the Asparagales and Commelinid lineages. The copy number amplification observed for grasses (Poaceae) by Anders et al. (2012) would actually involve a larger taxonomic range (at least for some duplication events).
Gene amplifications shared by the commelinid were observed in two different orthogroups: OG-GT61-A and, to a lesser extent, OG-GT61-D. The dicot clade position in the OG-GT61-A and OG-GT61-D trees raises the question whether the amplification process started before the monocot/dicot split and was followed by loss of the duplicated copies in the dicot lineage. For example GT61-A1 clade appears to have a basal position in the OG-GT61-A subtree (Figure 2 and Supplementary Figure S1) but unresolved position of the outgroup (A. trichopoda) sequences prevent us from solving this question.
The Poales-specific amplifications within the orthogroups A1, A2, A4, and A7 indicate that new copies were retained. With the exception of the Poales-specific A7 OG, independent amplifications took place also in Arecales (P. dactylifera and E. guineensis). Finally, in the M. acuminata GT61 family, gene amplification was observed in all A OGs, likely explained by the three WGDs that took place during the Musa genome evolution (D'Hont et al., 2012). Taken together, these observations indicate that the expansion in the GT61 family was not an evolutionary burst that occurred before the Commelinids or Poaceae radiation but rather the result of recurrent duplication events in the monocot evolution. The analysis of GT61 gene genomic locations indicates that the GT61 family expansion was shaped by both local tandem amplification and large scale duplications (WGDs).
Combined analyses of phylogeny and comparative genomics ( Table 2) of the OG-GT61-A members allowed us to reconstruct the first phase of amplification, i.e., the one concerning the entire commelinid lineage (Figure 5). This reconstruction is based on highly supported phylogenetic branches (aLRT higher than 0.9) and on relative positions and orientations of GT61 genes in tandemly organized loci found in all the commelinids. However, to reach a higher level of resolution in the GT61 family tandem amplification history in monocots, additional high quality genome would be needed, especially in other monocot orders such as Acorales or Liliales for instance.
One hypothesis to explain the retention of duplicated copies during the genome evolution of all monocot species analyzed here is that the GT61 genes underwent a functional divergence. This hypothesis relies, in a schematic way, either on the neofunctionalization scenario under which a new advantageous function appears on one copy (for which positive selection is expected to act), or on the subfunctionalization scenario under which divergence accumulated between copies makes them non-redundant and consequently prevents each of them from being eliminated during the evolution (for which only neutral processes are expected) (Moore and Purugganan, 2005;Innan and Kondrashov, 2010). In order to go further, and to figure out which functional implications could be deduced from the evolutionary history of GT61 gene family, we searched for putative footprints of selection. The PAML analysis performed on the sequences of OG-GT61-A of monocots revealed positive selection footprints on the branch specific to the OG-GT61-A6. In particular, five codons with a dN/dS value significantly higher than 1 were identified. Only two of them lie in the DUF563 domain. No other branches or sites under positive selection were identified but action of diversifying selection cannot be excluded. In particular, some analyzed branches are really short (for example in the case of OG-GT61-A3 and -A4, Figure 6) and the statistical power is known to be reduced in short branches. It is also possible that the alignment cleaning steps removed significant codons.
At present, the function of only three OG-GT61-A genes has been identified: TaXAT1 and TaXAT2 in wheat (Anders et al., 2012) and XAX1 in rice (Chiniquy et al., 2012). These three genes are involved in cell wall xylan synthesis. TaXAT1 and TaXAT2 belong to the orthogroups GT61-A1a and GT61-A2b, respectively; the rice XAX1 belongs to the commelinid orthogroup GT61-A7 (Poales specific). It is likely that other copies share similar function in synthesis of cell wall xylans. The increasing number of genes observed in the monocot could be related to the higher biochemical complexity and diversity of the cell wall of this taxonomic group compared to other plant species (Peña et al., 2016). Duplicated copies could have enabled the transfer of different molecules to the xylose chain, thus modifying the cell wall composition. Moreover, the cell wall xylan composition is different in tissues and organs (Anders et al., 2012) and a part of the amplification could be followed by regulatory subfunctionalization adjusting expression of the cell wall network genes to an optimal level in each plant tissue. Since these additional copies were not lost during the course of evolution, this retention possibly being the result of selective forces, the new functions most probably improved the fitness of the plants.

AUTHOR CONTRIBUTIONS
AC and MR conceived the study. AC performed the analyses and wrote the manuscript. NC performed the PAML analysis.

FUNDING
This work was supported by CGIAR Fund Donors and CGIAR Research Program on Roots, Tubers and Bananas (RTB) and the Belgian Development Cooperation (DGD).