Alignment of Common Wheat and Other Grass Genomes Establishes a Comparative Genomics Research Platform

Grass genomes are complicated structures as they share a common tetraploidization, and particular genomes have been further affected by extra polyploidizations. These events and the following genomic re-patternings have resulted in a complex, interweaving gene homology both within a genome, and between genomes. Accurately deciphering the structure of these complicated plant genomes would help us better understand their compositional and functional evolution at multiple scales. Here, we build on our previous research by performing a hierarchical alignment of the common wheat genome vis-à-vis eight other sequenced grass genomes with most up-to-date assemblies, and annotations. With this data, we constructed a list of the homologous genes, and then, in a layer-by-layer process, separated their orthology, and paralogy that were established by speciations and recursive polyploidizations, respectively. Compared with the other grasses, the far fewer collinear outparalogous genes within each of three subgenomes of common wheat suggest that homoeologous recombination, and genomic fractionation should have occurred after its formation. In sum, this work contributes to the establishment of an important and timely comparative genomics platform for researchers in the grass community and possibly beyond. Homologous gene list can be found in Supplemental material.

Recursive polyploidizations have contributed to the evolution of grasses. The sequencing of the rice genome revealed a grass-common tetraploidization-or whole-genome duplication (WGD) event-that occurred 100 million years ago (Paterson et al., 2004;Wang et al., 2005). This event might have played a major role in promoting the speciation of new grasses to form the large monocot family that exists today. Since then, further polyploidizations continued to occur in this family, including one that likely contributed directly to the formation of maize, and two sequential ones that contributed to the origin of common wheat (Triticum aestivum). The latter arose through hybridization of the wheat genome A (Triticum monococum) with the genome B, which afterward hybridized with genome D (A. tauschii). Recursive polyploidizations greatly complicate the structure of plant genomes, and this process produces large numbers of duplicated genes even after widespread post-polyploidy gene losses occur. Nonetheless, these duplicated genes arising from polyploidization are an important evolutionary driving force, one that has exerted its biological effects for millions of years.
An accurate alignment of multiple genomes is critical to better understanding their structures, to reveal homologous genes, and to infer how evolutionary events actually unfolded. By using the rice genome as a reference, and an by examining their gene collinearity, XW and JW were able to successfully align several sequenced grass genomes (Wang X. et al., 2015b), but this was done before the genome of common wheat had become available. Nevertheless, that study provided insight into the genomic changes after the divergence in grasses, and it helped redate key events during the evolution of the Poaceae family. The identified homologous genes were well-related to each recursive polyploidization and to each speciation event, making it possible to hierarchically distinguish the paralogous and orthologous genes. This genetic information is valuable for understanding the genome structure formation and its overall changes, and in particular for clarifying cases of gene divergence, and phylogeny. During this alignment process, the rice genome served as a reference because it is well-sequenced and assembled and has conserved its genome structure, and gene evolution (Salse et al., 2009;Wang X. et al., 2015b).
Here, we build on this prior work to take advantage of the now-available genome of common wheat, by adding it to the previously constructed multiple-genome alignment of grass species. Although only one new species is added here, it has three subgenomes, and its inclusion thus required considerable effort to achieve. Besides, we have involved the most updated assemblies and/or annotations of rice, barley, and other grasses in the present analysis. The present effort aims at producing a list of homologous (paralogous and orthologous) genes, related to different polyploidizations, and speciations, characterizing genomic instability of common wheat, and contributing to establishing the grass comparative genomics platform.

Materials
Grass genomes and their gene annotations for each species were downloaded (Supplementary Table 1). Then the data were preprocessed.

Analysis of Genomic Homology
To obtain the gene collinear homologs, we used the BLASTP to search for the potential anchors (E < 1e-5; top five matches) between every possible pair of chromosomes within the 9 grasses, and between every possible species pair. Based on these results, dot-plots were drawn by using the Perl scripts to perform an illustrative comparison of the genomes to better understand their structures. By using the software MCSCAN (Tang et al., 2008) and CollinearScan (Wang et al., 2006), we identified those homologous blocks containing collinear genes within a genome and between different genomes (maximal searching gap ≤ 50 genes; P < 0.05). By characterizing the homologous sequence similarities, as measured by both the collinear gene number and sequence identity, we then distinguished the paralogous and orthologous genes among them, as detailed previously (Wang X. et al., 2015b).

Pairwise Alignment of the Genomes
By inferring gene collinearity, we performed a whole-genome multiple alignment for common wheat vis-à-vis the sequenced grass genomes of rice, purple false brome, barley, foxtail millet, sorghum, maize, and two diploid wheat species (Supplementary Table 2). The duplicated genes produced by the grass-common tetraploidization (GCT) and the maize-specific tetraploidizaton (MST) were thus obtained. Based on the derived intraspecific and interspecific collinear gene information, we constructed a table of the homologous genes, and their orthologs, and (out) paralogs associated with speciations, and polyploidizations, respectively (Wang X. et al., 2015a,b).
The detailed statistics for the homologous blocks and genes within a plant genome, or between any pair of them, are given in Supplementary Table 3 (wherein any tandem genes were filtered out). Homologous blocks that had more than a certain number of collinear genes were counted to reflect the breakages of genomic homology. The homologous genes in common wheat were further divided into subgroups to show the extent of gene collinearity within each subgenome and between the subgenomes. For example, in the subgenomes A of common wheat, we found 619, 38, 17, and 5 homologous blocks, each with respectively at least 4,10, 20, and 50 collinear gene pairs that contained 4,054, 1,070, 810, and 418 collinear gene pairs in total. In the subgenomes B of common wheat, we found 584, 38, 13, and 8 homologous blocks, each with respectively at least 4,10, 20, and 50 collinear gene pairs that contained 3,806, 986, 651, and 512 collinear gene pairs in total. In the subgenomes D of common wheat, we found 602, 34, 15, and 4 homologous blocks, each with respectively at least 4,10, 20, and 50 collinear gene pairs that contained 3,969, 988, 745, and 379 collinear gene pairs in total. This means nearly one fourth of wheat genes have collinear homology in each subgenome, which is higher than rice, sorghum, and other genomes affected by the GCT, but not maize (28.3%) affected by an extra MST. The finding suggests that the gene-dense regions, often with well-preserved gene collinearity, have been well-assembled.

Multiple Alignments of the Genomes
By integrating the information on collinear homologs found within the genomes and between them, we were able to first construct an alignment of the grass genomes (Figure 1), and then an alignment of the chromosomes of wheat and its close relatives Figure 2.
The first alignment was built by using rice as a reference. Rice has a well-preserved genome structure that not only closely resembles that of the grass-common ancestor but also has been well-sequenced and assembled. The alignment was done by putting the collinear gene information into a table, wherein the rice gene IDs from 12 of its chromosomes were placed in the first column. However, because of the GCT (experienced together with the other grasses), rice would have to have two columns to contain the duplicated genes, i.e., the GCT paralogs. Similarly, each non-maize grass species would also have two columns, with each being orthologous to one of the rice columns. For maize, however, which experienced an MST, it would have two paralogous columns that corresponded to each one of the two columns in rice (and likewise for each of the other grasses). In the case of common wheat, a hexaploid plant-it is derived from three diploid wheat species-each of the two rice columns would have to have three wheat-orthologous columns. Therefore, for the nine grass genomes studied, the ensuing alignment table had a total of 24 columns in it. Each row of the table contained the collinear homologs, orthologs, or (out) paralogs. For a gene missing from an expected location in a given row, a dot was put in this place to flag this likely gene loss or translocation (deletion/insertion).
Considering their intragenomic homology, the number of paralogous blocks in the different species ranged from 30 to 96 (Table 1), which consisted of 922-6,614 collinear gene pairs ( Table 2), and 1,614-9,196 homologous genes (Table 3). Notably, we found 30 homologous blocks involving 2,852 collinear gene pairs, and 4,026 homologous genes in the subgenomes A of common wheat, 31 homologous blocks involving 2,589 collinear gene pairs, and 3,749 homologous genes in the subgenomes B of common wheat, 30 homologous blocks involving 2,778 collinear gene pairs, and 3,975 homologous genes in the subgenomes D of common wheat, most of which were produced by the GCT (Tables 1-3).
Considering the intergenomic homology, we found that common wheat A, B, and D subgenomes had 96-615 orthologous blocks containing 4,582-10,163 collinear gene pairs, and 67-126 out-paralogous blocks containing 988-6,003 collinear gene pairs, as compared with other genomes (Tables 1, 2). Compared with the other grasses, orthologous regions or genes found between the common wheat A, B, and D subgenomes are more than those found between each of them and each of A. tauschii and T. urartu, but similar to those found between each of them and other grasses ( Table 2). Between the subgenomes A, B, and D, there were, respectively, 24 A-B, 37 A-D, and 8 B-D orthologous regions containing 8,197, 8,255, and 7,993 collinear gene pairs, accounting for ∼60% of the predicted genes. Besides, there were 51 A-B, 56 A-D, and 55 B-D (out) paralogs produced by the GCT containing 1,447, 1,530, and 1,536 collinear gene pairs, accounting for ∼6% of total genes in each subgenome, respectively, which are two times fewer than the number of outparalogs between barley, rice, sorghum, Brachpodium, and maize (Table 3). Especially, each of the wheat three subgenomes has preserved ∼40% more outpraralogs with other grasses, excluding A. tauschii, and T. urartu, than between any two of wheat subgenomes ( Table 3). The far fewer outparalogous collinear genes found between these subgenomes of common wheat points to possible genome fractionation after its origination through extra polyploidizations.
With respect to the alignment of those grasses that were not common wheat, we have updated he inference reported previously (Wang X. et al., 2015b) based on the latest versions of the genome data available (Middleton et al., 2014;Du et al., 2017;Mascher et al., 2017;Zimin et al., 2017).
We used barley as a reference to construct the alignment table of common wheat and its diploid relatives (Figure 2). Here, we found, with the available genome sequences, barley had a better homology with each of the three wheat subgenomes than with each of two diploid wheat genomes, and had a similar level of homology to that with Brachypodium ( Table 3). The orthologous collinear genes between barley and each wheat subgenome are ∼25% more than between barley and each of diploid relatives.
Any local region of the genome alignment can be linearly displayed to view the details of aligned genes, as well as the gene losses or translocations found there. The alignment of local regions often revealed the large-scale gene losses after the GCT and the lineage-specific events after their divergence (Figure 3). Using as reference the rice chromosomes 1 and 5, which were produced by the GCT, we displayed the alignment of a region from 44.5 to 44.8 Mb on rice chromosome 1, along with its corresponding regions from all other (sub) genomes (Figure 3). For example, this region in rice chromosome 1 was orthologous to those regions from 34.2 to 34.5 Mb on chromosome 3 of the common-wheat subgenome A, for which three collinear genes were shared. The region also shared orthology with a region from 67.9 to 68.6 Mb on chromosome 3 of genome B, and a region from 34.2 to 34.5 Mb on chromosome 3 of genome A. We also found a paralogous region with two collinear genes located on rice chromosome 5, which shared orthology with chromosome 1 of the subgenome A of common wheat. For the most part, however, the local alignment figure reveals only a few collinear genes, thus suggesting the occurrence of widespread gene losses or removal from their ancestral location.

Chromosome Reorganization
The alignment of chromosomes illustrates neatly how chromosome reorganization may have occurred after its divergence with rice, which was supposed to have preserved much of the ancestral grass karyotype after the GCT. Judging by the inner half set of circles of global alignment (Figure 1), rice chromosome 1 was fully preserved in chromosome 3 of wheat, and its close relatives, including barley. A similar phenomenon is evident for many other rice chromosomes except chromosome 3, which was split into parts to form wheat chromosome 4 and 5.

DISCUSSION
The comparative analysis of homology within and between the 9 grasses enhances our understanding of the evolution of grasses. Nearly 30 million years after a whole-genome duplication event ∼100 million years ago (Paterson et al., 2004;Wang et al., 2005), the common ancestor of sorghum, maize, and foxtail millet were separated from the common ancestor of wheat, rice, barley, and purple false brome (Hilu, 2004).
Our study is a considerable expansion of prior published work inferring gene collinearity (Salse et al., 2008;Murat et al., 2010Murat et al., , 2014Wang X. et al., 2015a). Here, we added common wheat to the execution of a multiple-cereal genome alignment. As an important group of Poaceae plants, wheat crops experienced both the GCT, and the hybridization that occurred between the wheat subgenomes. The common wheat plant of today is a result of the sequential hybridizations of wheat genome A with genome B, followed by hybridization with genome D. Thus, it includes three subgenomes; this has made the present wheat genome structure much complex. Besides, we included the most updated genome assemblies, and/or annotations of other genomes in the present analysis.
We constructed a collinearity table of genes that were hierarchically associated with the polyploidizations and speciations during the evolution of grasses. Doing so provides an important comparative genomics platform to support future related research in grasses. The gene collinearity dataset for  Numbers in boldness on the main diagonal denote the paralogous gene pairs within a genome, numbers above the diagonal denote the orthologous gene pairs between two genomes, while numbers below the diagonal denote the out-paralogous gene pairs between two genomes.
these studied grass genomes is valuable in several ways. Firstly, researchers can use it to gain new insight into the chromosome segments of interest to find out how their genes were affected by genomic changes. This is possible because the collinear genes work as anchors to help locate specific DNA changes in the intragenic regions and in the regulatory cis-elements. Secondly, the collinear genes displayed in the alignment table can be used to construct phylogenetic trees for later use in sophisticated evolutionary analyses (Supplementary Table 2). Specifically, the information provided by our study clarifies when and how these genes originated, and diverged, thus providing robust data to support the pursuit of their functional innovation, especially for cases of duplicated genes. For plants, such duplicated genes are currently a "hotspot" of research activity (Innan, 2009;Mun et al., 2009;Wang et al., 2009;Wang H. et al., 2015). Thirdly, the new data we provide here may help resolve problematic trees that have been constructed to date. Plant genes evolve at very divergent rates and using this information in isolation might lead to wrong phylogenetic trees that fail to reflect the true relationships among plant taxa (Wang and Paterson, 2013). Here, by contrast, gene collinearity clearly displays the actual relationships among genes to better help construct a correct phylogenetic tree, which forms the sound basis of any evolutionary, and functional analysis. Characterization of the homology within common wheat, and between it and the other grasses, shows that fewer outparalogous but similar orthologous collinear genes occur within common wheat or between its three subgenomes and other grasses, excluding A. tauschii, and T. urartu. A similar orthology between them and between each of them with some other grasses may mean that genome fractionation may have resulted small pieces of translocated regions, resulting in a higher effect eroding outparalogy but not orthology, smaller pieces of the latter being able to be inferred through gene collinearity. This result may be partially explained by still incomplete genome assembly (so far). However, considering that the wheat genome was sequenced and assembled later than its two diploid relatives-using similar and even better technology-it is quite plausible that much genome instability, and fractionation have happened since the formation of the ancestral hexaploid only ∼10,000 years ago (Mayer et al., 2014). Though it was viewed as an allopolyploid, homoeologous chromosomes might have been much diverged before the formation of the hexaploid, illegitimate recombination should have occurred to accumulate considerable effect over time.
This inference is tenable when considering that the GCT was previously proposed as an allopolyploid (Murat et al., 2014), and that it may have resulted in non-negligible homoeologous recombination (Wang et al., 2009). A pair of the GCT homoeologous of the grasses, rice chromosomes 11 and 12 (and their respective orthologs in other grasses), have been illegitimately recombining with each another at one of their terminal regions for millions of years and this process is still on-going in the Oryza species (Jacquemin et al., 2011;Wang et al., 2011). There is solid evidence suggesting that homoeologous recombination has resulted in large-scale gene losses, possibly by incurring breakages in the DNA double helix that has led to gene conversion, and thus it perhaps represents a mechanism of transmitting information between homoeologous genes (Gaeta and Chris Pires, 2010;Chen, 2013). More evidence of homoeologous exchanges can be found with Brassica napus (Cai et al., 2014), which is an allotetraploid of similar time of origination (Chalhoub et al., 2014). It seems that the homoeologous recombination between the common wheat subgenomes has been extensive and remains ongoing. The cumulative effect of this process may have contributed to wheat's domestication and the innovation of key biological functions, all of which invites further research in conjunction with population genomic data.

AUTHOR CONTRIBUTIONS
XW conceived the study and led the research. JW and SS implemented and coordinated the analyses. SS, JW, JY, FM, RX, LW, ZW, WG, XL, YLi, YLiu, and NY performed the analysis. XW, SS, and JW wrote the paper.