Illegitimate Recombination Between Homeologous Genes in Wheat Genome

Polyploidies produce a large number of duplicated regions and genes in genomes, which have a long-term impact and stimulate genetic innovation. The high similarity between homeologous chromosomes, forming different subgenomes, or homologous regions after genome repatterning, may permit illegitimate DNA recombination. Here, based on gene colinearity, we aligned the (sub)genomes of common wheat (Triticum aestivum, AABBDD genotype) and its relatives, including Triticum urartu (AA), Aegilops tauschii (DD), and T. turgidum ssp. dicoccoides (AABB) to detect the homeologous (paralogous or orthologous) colinear genes within and between (sub)genomes. Besides, we inferred more ancient paralogous regions produced by a much ancient grass-common tetraploidization. By comparing the sequence similarity between paralogous and orthologous genes, we assumed abnormality in the topology of constructed gene trees, which could be explained by gene conversion as a result of illegitimate recombination. We found large numbers of inferred converted genes (>2,000 gene pairs) suggested long-lasting genome instability of the hexaploid plant, and preferential donor roles by DD genes. Though illegitimate recombination was much restricted, duplicated genes produced by an ancient whole-genome duplication, which occurred millions of years ago, also showed evidence of likely gene conversion. As to biological function, we found that ~40% catalytic genes in colinearity, including those involved in starch biosynthesis, were likely affected by gene conversion. The present study will contribute to understanding the functional and structural innovation of the common wheat genome.


INTRODUCTION
Common wheat is one of the most widely grown cereal crops in the world and an essential source of food, its production affects the global economy, and failed harvests could lead to social unrest (Choulet et al., 2014;Wang et al., 2015;Appels et al., 2018). The sequencing of its relatives, the diploid Triticum urartu (genotype A d A d ; hereafter capital letters to indicate genome types and subscripts to indicate ploidy of organisms carrying them, with d, t, or h to show diploid, tetraploid, or hexaploid, respectively), and Aegilops tauschii (genome D d D d ), and the tetraploid wild emmer wheat (T. turgidum ssp. dicoccoides, genome A t A t B t B t ) contributed to understanding their genome evolution and innovation (Middleton et al., 2014;Du et al., 2017;Zimin et al., 2017;Gardiner et al., 2019). Wheat, as a heterologous hexaploid (genotype A h A h B h B h D h D h ), was possibly produced when T. urartu hybridized with an ear of wheat carrying a B genome, thereby forming a neo-tetraploid that hybridized with the A. tauschii. Two hybridization process made the common wheat genome complicated and made its genome even more complicated is that a grass-common wholegenome duplication (cWGD) having occurred~100 million years ago produced thousands of duplicated genes in the extant grass genomes (Paterson et al., 2004;Wang et al., 2015;Du et al., 2017). The event played an important role in promoting the formation of new species of grasses .
Genetic recombination is the primary source of genetic novelty (Kurosawa and Ohta, 2011). In plants, meiotic and mitotic recombinations are reciprocal, involving asymmetric exchange of genetic information between the homologs (Jia et al., 2013;Gardiner et al., 2019). Whereas, irreversible recombination involves the transfer of information from one site to its homeolog, leading to gene transformation (Datta et al., 1997;Kurosawa and Ohta, 2011). Gene conversion involves the unidirectional transfer of genetic material from a 'donor' sequence to a homologous 'acceptor.' In eukaryotes, it constitutes the main form of homologous recombination initiated by DNA double-strand breaks, as reviewed previously (Chen et al., 2007).
Whole-genome duplication will result in a large number of paralogous genes on homeologous chromosomes or chromosomal regions, and substantial similarity between homeologous regions could permit the occurrence of illegitimate recombination (as compared to normal homologous recombination), often resulting in a pattern called gene conversion that could be explained by that a gene is replaced by its homologous genes (Wang et al., 2009). There were reports to show that gene conversion could have occurred between paralogs in grasses and other plants. In some cases, gene conversion could have been frequent and is even on-going between paralogs on the homeologous chromosomes, e.g., rice chromosome 11 and its cWGD homeolog, chromosome 12 (Wang et al., 2009;Kurosawa and Ohta, 2011;Wang et al., 2019).
Illegitimate recombination has profound consequences on the evolution of paralogous genes and the chromosomes that they reside in. While paralogous recombination elevates base mutation rates, evolutionary rate, and DNA sequence deletion, converted regions have highly similar paralogous DNA (Wang et al., 2009;. This is a seemingly perplexing but an essence reasonable finding. Temporally segmental restriction of paralogous recombination along homeologous chromosomes has produced stepwise strata of increased similarity from their centromeres to chromosome ends . Increased mutation rates may foster genetic innovation. A significant QTL (qS12) resulting in hybrid male sterility was mapped within~400 kb region adjacent to the highly recombined areas on the short arm of chromosome 12 (Zhang et al., 2011). Moreover, two genes harboring the E3-ligase RING-C2 domain on the distal parts of rice chromosomes 11 and 12, were identified, and have possibly evolved in concert via gene conversion (Jung et al., 2012).
Despite thousands of paralogs in wheat genomes produced by recurrent polyploidies, a comprehensive analysis of gene conversion is lacking. Here, by aligning the wheat and its relatives' genomes, we inferred gene colinearity within a genome and orthology between them, and eventually characterized the pattern of illegitimate recombination and gene conversion. The present work will contribute to the understanding of genetic and structural innovation of wheat genes and chromosomes.

Alignment of Wheat and Relative Genomes
Chromosome homeology was revealed by inferring colinear genes using the software ColinearScan (Wang et al., 2006). We revealed 232 A h B h , 187 A h D h , and 154 B h D h homeologous regions between wheat subgenomes, involving 14,161, 15,376, and 15,553 homeologous genes, respectively ( Table 1). This meant about 40% of their genes were involved in intersubgenome colinearity. Actually, as to the numbers and percentages of colinear genes between genomes, B h D h shared better gene colinearity than the other pairs of subgenomes; while A h B h shared the least gene colinearity. As to each of the three subgenomes, D h shared a better colinearity with the other two subgenomes. This seems to agree with the inference that D h subgenome joined the latest to produce the hexaploid wheat and/ or has a conservative nature, as discussed later.
Similarly, we detected orthologous regions and genes between the diploid wheat relative genomes (A d A d from T. urantu and D d D d from A. tauschii) and the corresponding common wheat subgenomes. Actually, with the same parameters, we inferred 211 A d A h and 374 D d D h orthologous regions, including 8,195 and 20,474 colinear genes, respectively. Notably, the D d and D h shared much better orthology than the A d and A h did, possibly due to much more genome fractionation in the A d and A h . This fact also supports D h 's being a later player to form the hexaploid. We also revealed gene colinearity between the tetraploid wild wheat, T. turgidum (A t A t B t B t ), with the common wheat genome. Mainly, the tetraploid B t subgenome and hexaploid B h subgenome shared 97 orthologous regions, involving 13,913 colinear genes and 80.0% of these colinear genes are supported by OrthorMCL (Table 2, Figure 1).

Gene Conversion Between Common Wheat Subgenomes
To infer likely gene conversion, we retrieved homologous gene quartets from the above constructed multiple alignments. With D d as the reference, we assumed 7,462 A d A h D d D h homeologous gene quartets, with each quartet containing a D d gene, and its ortholog in common wheat D h subgenome, and their corresponding orthologs in A d genome and common wheat A h genome ( Table 1). This meant that about 42.36-43.91% of colinear genes in each genome were involved in the revealed quartets ( Figure 2A). Theoretically, the common wheat A h D h homeologous genes were more diverged than each with their diploid orthologs. However, due to gene conversion following illegitimate recombination, the homeologs might be more similar to one another, resulting in an aberrant tree topology. The phylogenetic tree of the homologous quartet with gene conversion is shown in Figures 2C-H. By checking tree topology after aligning the gene sequence of each quartet, we found that 164 common wheat homeologs (~2.20% of all those involved in quartets) were likely converted.
For there was not a diploid B genome available, we took the B t genome from the emmer wheat in the analysis and constructed A d A h B t B h and B t B h D d D h quartets ( Table 3). We inferred 6059 A d A h B t B h quartets and found that 127 (or 2.10% of all those involved in quartets) A h B h homeologs were likely converted, showing much more conversion than the B h D h homeologs. Similarly, we inferred 12,770 B h -D h quartets, and found that 253 (or 1.99% of all those involved in quartets) were likely converted, showing similar gene conversion rates between subgenomes ( Figure 3). With the converted homeologs, one copy might act as the donor, while the other as the acceptor. If one gene was taken as the donor, the sequence of the acceptor would be converted to be identical or highly similar, immediately after the conversion, to the donor's ortholog in the diploid (tetraploid) plant. It seemed that in A h D h gene conversions, about 62.20% of the converted genes took the D h homeologs as the donor. This showed that the subgenome D was more likely taken as the repairing template during recombination with the subgenome A. With A h B h and B h D h converted genes, the B h homeologs were more often (59.84 and 66.40%) taken as the donors, respectively, showing the subgenome B was often taken as the repairing template when pairing with the subgenomes A or D ( Table 3).

The Lower Gene Conversion Rate Between Ancient Paralogs Produced by the Grass Common Whole-Genome Duplication
We found likely gene conversion between ancient paralogs in each subgenome produced by the cWGD~100 million years ago. To see gene conversions between ancient paralogs, such as those in the subgenome A h , we first revealed the paralogs having preserved colineartiy resulting from the cWGD, inferred their A d orthologs, and found the A d A h A d A h quartets. Then, we checked likely tree topology changes, being aberrant from the expected one. The orthologs were supposed to be more similar  than each to the paralog in the same subgenome. However, if the paralogs, such as those A h paralogs, were more similar, it showed likely gene conversion. Here, we found about 700 homologous quartets within each of the three subgenomes and showed that ancient gene conversion rates varied from 0.83 to 1.71% (Table  4), being lower (though not significantly) than those between these subgenomes, as estimated above.

Gene Function Analysis
By performing Gene Ontology analysis, we found that, in the whole genome level, specific functional genes were enriched ( Figure 4). The organelle-related functional genes were significantly more likely to be converted compared to all colinear genes (~3.32 to 1.89%, Fisher′s exact test p-value = 0.041). Similarly, the organelle extracellular region genes were significantly more likely to be converted compared to all colinear genes (~0.70% to 0, p-value = 0.004). Genes involved in the nutrient reservoir and molecular transduction were also significantly likely affected by conversion ( Figure 4D).
Comparatively, the membrane part genes and biological regulation genes showed substantially less likely to be involved in conversion as compared to all colinear genes (~4.71 and 7.63%, p-value = 0.0182; and~9.95 and 15.45%, p-value = 0.002) ( Figure 4A). As to each subgenome, specific genes were enriched in converted genes ( Figures 4B-D). For subgenome B, the molecular transducer activity genes showed a significant increase in converted genes (~0 to 2.02%, p-value = 0.0004). As to the signaling genes in subgenome A, those in subgenome D showed a considerable increase (1.27 and 10.09%, p-value = 0.0002). The biological regulation genes in subgenome D showed a significant increase (~1.83 to 8.28%, p-value = 0.003).
Starch, one of the final products of photosynthesis, helps form the wheat seeds. About 65-80% of the grain composition of common wheat is starch. Here, we found that about 40% of starch biosynthesis catalytic genes were likely involved in gene conversion in the whole genome and each its subgenome. Therefore, we selected a set of catalytic genes regulating starch FIGURE 1 | Alignment of the wheat and relative genomes with wheat DD as reference. The whole-genome duplication (WGD) in the common grass ancestor of plants caused them to have at least two circles of chromosomes. The hybridization event caused common wheat to have six such chromosomes. The innermost circle represents the seven chromosomes of the wheat DD genome (D d ) from Aegilops tauschii, and the gray lines linking paralogous genes. A d , Triticum urartu; B t , subgenome B in tetraploid wild wheat; A h , B h , and D h , three subgenomes of common wheat. biosynthesis, the converted ones, and their close homologs in wheat and relative (sub)genomes to construct a phylogenetic tree ( Figure  5). The converted genes shown here are Ah7g3962/Bh7g2982, Ah6g3817/Dh6g3664, Ah5g5014/Ah4g3299, Bh1g2423/Dh1g2297, each pair showing conversion between different subgenomes.

DISCUSSION
The doubling of the whole-genome often leads to an extremely unstable genome of the species (Freeling et al., 2012;Zhang et al., 2013;Wang et al., 2018). Illegitimate recombination, in contrast to recombination usually between homologous chromosomes, may occur between homeologous chromosomes (Wang et al., 2009;Woodhouse et al., 2010;Murat et al., 2014). Genome instability is often characterized by large-scale DNA losses, DNA rearrangement, even chromosome breakages, and fusions, which eventually contribute to regaining genome stability. After some time, which is difficult to define how long it is, the duplicated genome would come to a calm state. However, though at a muchlowered rate, illegitimate recombination may still occur between homeologous chromosomal regions. In rice, sorghum and other grasses, we found illegitimate recombination could last tens of millions of years, resulting in a particular stratum structure between two rice homeologous chromosomes, and between their sorghum counterparts, and many involved rice and sorghum genes might have been converted by their homeologous copies (Wang et al., 2009;. Here, we checked likely gene conversion in wheat, an important world-wide crop, and found hundreds of converted genes between the subgenomes. This shows that the wheat genome has not been quite stable after the formation of its hexaploid structure about 10,000 years ago (Feldman et al., 2012). There has been genetically illegitimate recombination between the subgenomes, contributing to the structural and functional changes of homeologous chromosomes and the genes residing on them. We also found likely gene conversion between more ancient duplicated genes produced by the cWGD having occurred~100 million of years ago (Wang et al., 2015). Though the conversion scale of duplicated genes within each subgenome is lower than that between different subgenomes, it suggested the long-lasting lingering effect of illegitimate recombination after the whole-genome duplication, as reported previously in rice and its Oryza relatives, sorghum, and other  grasses (Jacquemin et al., 2009;Ling et al., 2013). Notably, we revealed that genes from different common wheat subgenomes tend to play different roles during the occurrence of conversion. The subgenome DD shares more colinear genes with the other two subgenomes and its genes act more likely as donors, showing likely preferential roles of DD chromosomes and the genes residing on them. This may be related to the fact that subgenome DD was the last one to be added to form the hexaploid plant, common wheat (Middleton et al., 2014;Du et al., 2017;Zimin et al., 2017;Gardiner et al., 2019). The initial formation of a tetraploid ancestor (AABB) may have been followed likely wide-spread DNA and gene losses, accompanying illegitimate recombination between homeologous chromosomes, due to genomic instability of the tetraploid. After this turmoil stage, the homeologous chromosomes became much different, and the tetraploid genome came to a relatively stable state. Then the third genome DD, contributed by an ancient plant that A. tauschii is derived from, hybridized with the tetraploid to form the hexaploid common wheat. The newly joined subgenome DD lives much at ease with the reduced DNA content of the subgenomes AA and BB. Though there should have been continual DNA losses and illegitimate recombination, the occurrence rates should have likely decreased. This eventually resulted in a better gene colinearity of the subgenome DD with the other two than the other ones with one another, as observed above. Furthermore, relatively innocent state of DD genes might render themselves privileges to act as the donors during gene conversion, to preserve ancestral gene structures and functions.

C) if a D h gene is converted by an A h paralog; (D) if an A h gene is converted by a D h paralog; (E) if an Ah gene is converted by a B h paralog; (F) if a B h gene is converted by an A h paralog; (G) if a B h gene is converted by a D h paralog; (H) if a D h gene is converted by a B h paralog. Places without squares indicate gene loss.
The above-inferred conversion involved homeologs produced by the formation of the hexaploid about 10,000 years ago, while the tetraploid progenitor (AABB) might have formed at least 0.5 million years ago (Marcussen et al., 2014;Brenchley et al., 2012). Comparatively, the divergence of their diploid ancestor occurred at least 2.5 to 7 million years ago (Marcussen et al., 2014;Cheng et al., 2019). With updated fossil evidence applying to colinear genes, a grass-family level estimation of their Triticeae common ancestor suggested its appearing more than 20 million years ago (Wang et al., 2015). These facts mean that the wheat homeologs have much diverged from one another and phylogenetic inference of gene conversion between homeologs by comparing their close orthologs in diploid or tetraploid progenitor would not likely result in high false positives. This supports the credibility of the present gene conversion inference. However, this could really be an underestimation of gene conversion. Thus, it is not easy of converted paralogs to be well supported at such rigorous criteria about gene tree topology change. The subgenome copies are quite similar to each other, and there may be no much time to gather mutations between the diploid-  hexaploid orthologs to show likely occurrence of gene conversion between the hexaploid subgenome copies. Notably, we found that around 40% of starch biosynthesis catalytic genes were involved in gene conversion, showing that conversion may favor certain types of functional genes, as shown with the above pilot characterization with GO. Actually, while characterizing the phylogeny of those catalytic genes, the constructed phylogenetic tree involved low Bootstrap values in some branches. However, the credibility of the tree was well supported by known or expected phylogenetic relationship, paralogy between homeologs from subgenomes, or orthology between homeologs from different organisms. Actually, the known or expected relationship between homeologs was further supported by inferred gene colinearity, which is a good criteria to explore the real phylogeny, especially when it was disturbed by non-uniform evolutionary rates of plant genes (Meng et al., 2020).

Materials
The genome-wide datasets, including whole-genome sequences, annotated genes, and their translated proteins, were retrieved from following public databases.

Gene Colinearity Inference
By using the similarity comparison tool BLASTP compared the proteins within and between the genomes of wheat and related species, and scoring according to the similarity between the protein sequences, we inferred the relationship of gene homology according to the results of the comparison, and the expected value (E-value) is limited to 1e−5. Then the result of BLASTP was used for the software ColinearScan to infer colinear gene pairs. Colinear genes are those in a genomic region sharing locational orders with their homologs in another genomic region. The detection of genomic homologous colinear fragments was based on the principle of similarity between genes, which was combined with the position information of genes on the chromosomes. The colinearity analysis software ColinearScan was used to infer homologous regions (or blocks) containing five or more colinear genes . The maximal searching gap between colinear genes was set to be 50 intervening non-colinear genes(P <0.05), as often adopted in previous analyses . Besides, we used OrthoMCL (http://www.micans.org/mcl/src/mcl-latest.tar.gz) to identify orthologous and paralogous genes to support the credibility of the gene colinearity inferred above (Li et al., 2003;Brenchley et al., 2012).

Inference of Homologous Quartets and Gene Conversion
We used wheat A. tauschii genomes as a reference to construct multiple alignments. In details, each genome was compared and aligned to the referenced genome by inferring gene colinearity; any two genomes were compared to find inter-genome colinearity; and then, each genome was aligned to itself to find paralogous genes produced by cWGD (Wang et al., 2015;Ling et al., 2018;Liu et al., 2019). We used multiple alignments to construct homologous gene quartets based on the homologous relationships. By checking colinear genes in the multiple alignments, the paralogous gene pairs in each of the involved genomes and the orthologous gene pairs between genomes were obtained, and homologous gene quartets were retrieved. Using these quartets, we inferred gene conversion based on the aberrant topology of each homologous gene quartet. Aberrant tree topology is that the relationship between homeologous genes is different from the expected phylogeny ( Figure 2A). With each gene quartet, multiple sequence alignment was performed using ClustalW (Thompson et al., 1994). If in the alignment, the number of gaped sites in any of the aligned sequences is more than 50% of the alignment length or the amino acid identity was less than 40%. Then the alignment was discarded for further inference (Wang et al., 2009) As to the conversion inference with each homologous quartet, phylogenetic gene trees were constructed by using Neighbor-joining approach implemented in MEGAX with encoded amino acid sequences as input, and default parameters were adopted (Tamura et al., 2007). Bootstrap of 1,000 iterations was performed to improve the reliability of the evolutionary tree. Then, with trees supported by bootstrap value >70%, we checked the tree topology to find whether there was likely gene conversion. Synonymous nucleotide substitutions on synonymous substitution sites (Ks) between homologous genes (Nei and Gojobori, 1986) were estimated to characterize the distance between homologs with the program implemented in Bioperl (Stajich et al., 2002). Whole-genome scale gene conversion was displayed by using Circos software (Krzywinski et al., 2009).

GO Enrichment Analysis
We performed GO enrichment analysis and used the domain similarity comparison software InterProScan to annotate genes with GO (Gaudet, 2019). The best international standard matching wheat genes was obtained through the BLASTP comparison between the international standard protein database and the whole wheat genomic protein file. Then, we used WEGO2.0 (http://wego.genomics.org.cn/) to generate the wheat gene function map. The log function standardized the number of genes involved in each category, and then the functional distribution and correlation were analyzed. The pvalue was calculated using hypergeometric delivery to explain the significance of gene enrichment. Here we selected a set of starch catalytic activity function genes to build a phylogenetic tree. CLUSTALW was used to perform sequence alignment on the protein sequences (Thompson et al., 1994), and a phylogenetic tree was constructed using the Maximum Likelihood (Murshudov et al., 1997). The pilot value was based on 1,000 copies.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in the URGI, https://wheat-urgi.versailles.inra.fr.

AUTHOR CONTRIBUTIONS
XW conceived and led the research. CL, JW, PS, JY, FM, ZZ, HG, CW, and XL performed the analysis or contributed analysis tools. XW and CL wrote the paper. All authors contributed to the article and approved the submitted version.

ACKNOWLEDGMENTS
We appreciate financial support from the Ministry of Science and Technology of the People's Republic of China (2016YFD0101001), China National Science Foundation (3117022), Tangshan Key Laboratory Project to XW. We thank the helpful discussion with researchers at the iGeno Co. Ltd, China.