Structural Variations Affecting Genes and Transposable Elements of Chromosome 3B in Wheats

Structural variations (SVs) such as copy number and presence–absence variations are polymorphisms that are known to impact genome composition at the species level and are associated with phenotypic variations. In the absence of a reference genome sequence, their study has long been hampered in wheat. The recent production of new wheat genomic resources has led to a paradigm shift, making possible to investigate the extent of SVs among cultivated and wild accessions. We assessed SVs affecting genes and transposable elements (TEs) in a Triticeae diversity panel of 45 accessions from seven tetraploid and hexaploid species using high-coverage shotgun sequencing of sorted chromosome 3B DNA and dedicated bioinformatics approaches. We showed that 23% of the genes are variable within this panel, and we also identified 330 genes absent from the reference accession Chinese Spring. In addition, 60% of the TE-derived reference markers were absent in at least one accession, revealing a high level of intraspecific and interspecific variability affecting the TE space. Chromosome extremities are the regions where we observed most of the variability, confirming previous hypotheses made when comparing wheat with the other grasses. This study provides deeper insights into the genomic variability affecting the complex Triticeae genomes at the intraspecific and interspecific levels and suggests a phylogeny with independent hybridization events leading to different hexaploid species.


INTRODUCTION
Bread wheat is one of the most consumed crops in the world (Hawkesford et al., 2013), and a better knowledge of the extent of genomic diversity among wheat and its wild relatives is needed to meet the challenge of ensuring food security in the context of climate change and sustainability of agriculture. However, genomic studies in wheat are facing intrinsic genome complexity due to allohexaploidy (2n = 6× = AABBDD), 85% of transposable elements (TEs), and a large genome size (∼15.5 Gb). Bread wheat originated from two interspecific hybridization events: the first one occurred ∼800,000 years ago between Triticum urartu (AA) and a species close to Aegilops speltoides (BB); the second one occurred ∼10,000 years ago between Triticum turgidum (AABB) and Aegilops tauschii (DD) (Marcussen et al., 2014). A reference sequence of the complete bread wheat genome of cv. Chinese Spring (CS) has been produced recently by the International Wheat Genome Sequencing Consortium, revealing 107,891 gene models and around 4 million TEs (IWGSC, 2018). Chromosome 3B is the largest of the 21 wheat chromosomes and, thus, the only chromosome that can be isolated individually from any wheat cultivar using flow cytometric sorting (Doležel et al., 2007). This is why chromosome 3B was established as the model to study the wheat genome. Its analysis highlighted a structural and functional partitioning, and this particular organization was then found to be common to all wheat and barley chromosomes (Choulet et al., 2014;Mascher et al., 2017;IWGSC, 2018). Indeed, chromosomal extremities are the regions where most of recombination events occur, are enriched in genes, and are depleted in TEs (Choulet et al., 2014;Daron et al., 2014). They are also enriched in non-syntenic and recently duplicated genes whose functions are related to adaptation Pingault et al., 2015). These conclusions were drawn through comparisons with rice, maize, Brachypodium, and sorghum, but little is known about the extent of structural variability among Triticeae and at the intraspecific level.
Structural variations (SVs) are genomic polymorphisms that could originate from insertion, duplication, deletion, translocation, or inversion (Alkan et al., 2011). Insertions/duplications and deletions lead to polymorphisms affecting the gene content called copy-number variations (CNVs) and presence-absence variations (PAVs). In plants, such SVs have been associated with a diversity of phenotypes for major traits like photoperiod sensitivity and vernalization (Sutton et al., 2007;Cook et al., 2012;Maron et al., 2013). In wheat, CNVs affect Ppd-B1, Vrn-A1, and Rht-D1, the key genes controlling photoperiod, vernalization, and dwarfism, respectively (Díaz et al., 2012;Li et al., 2012;Würschum et al., 2015). Previous studies attempted to assess the landscape of SVs at large scales, but the complexity of the genome limits our ability to identify them accurately. Eighty-five CNVs and five PAVs were detected between two tetraploid wheats based on the analysis of 3497 cDNAs (Saintenac et al., 2011). In 2017, Montenegro et al. analyzed resequencing data from a panel of 18 bread wheat accessions and estimated a pangenome of ∼140,500 genes (Montenegro et al., 2017). A comparative sequence analysis of chromosome 2D between two distant accessions of T. aestivum revealed that 13% and 20% of gene models are specific to each other (Thind et al., 2018). Comparison between Chinese Spring, Synthetic w7984, and Opata highlighted that large (at least 100 kb) SVs were more abundant in distal regions (Gutierrez-Gonzalez et al., 2019).
Little is known about SVs affecting wheat TEs. However, they account for 85% of the genome and, in this regard, are the major drivers of the wheat genome dynamics. A recent study showed that the vast majority of TEs that are present in the hexaploid genome inserted after A, B, and D genomes diverged from a common ancestor ∼3 Mya, but before the polyploidization events . This resulted in a nearly complete TE turnover making A-B-D homeologous loci completely different in terms of TEs. However, the global genome architecture remained stable, suggesting a role of TEs in maintaining a constrained genome architecture. Although polyploidization did not trigger massive TE activation, the TE turnover is still an ongoing process at the origin of genome diversification of di-, tetra-, and hexaploid wheats (Keidar-Friedman et al., 2018;Arystanbekkyzy et al., 2019).
Here, we used chromosome sorting combined with shotgun sequencing in order to decipher the extent of SVs affecting chromosome 3B using a diverse panel of 45 Triticum accessions from seven species. We established robust methods based on short-read mapping to accurately call PAVs and CNVs affecting genes and TEs while preventing from the complexity due to polyploidy. We show that 23% of the genes are variable among this diverse panel and more than half of the TEs is affected by PAVs. This gave us the most detailed and accurate view of the landscape of SVs in such a complex genome and confirmed that distal chromosomal regions are privileged regions for rapid diversification.
DNA of sorted chromosomes was amplified and sequenced with Illumina HiSeq-2000, yielding between 171 and 845 million reads per sample, corresponding to an estimated sequencing depth of 43× on average ( Table 1). The 45 resequencing datasets were mapped on the complete Chinese Spring (CS) genome sequence (IWGSC RefSeq_v1.0). On average, 98.6% of reads were mapped, demonstrating the absence of a major contamination in our samples (Table 1) and that sequence divergence between different Triticum genomes did not prevent from mapping reads accurately on the CS reference. On average, 81% of mapped reads were mapped onto the chr3B pseudomolecule. The remaining 19% are mapped onto unanchored scaffolds (chrUn; 4% of mapped reads) but also to other chromosomes, suggesting the presence of DNA from other chromosomes in the sorted DNA samples (especially chr2B whose size is close to that of chr3B; Additional file 1: Supplementary Table S1). The distribution of mapped reads along the 20 other chromosomes did not reveal any case of homeologous exchange with 3A or 3D, neither translocated from other chromosomes.
Based on the non-repeated fraction of chr3B (representing 134 Mb, see Materials and Methods), we found that 89% Frontiers in Genetics | www.frontiersin.org of the chromosome was covered, on average, per accession ( Table 1). We used the Chinese Spring sample as a positive control and found that 4% of these regions were not covered, comprising 47 gene models. We discarded them from further analyses. The coverage ranged from 85 to 92% for the 44 other accessions, revealing that chr3B has not been affected by a large introgression, translocation, or deletion in any accession. No correlation was found between the coverage and depth of coverage (DoC; R = 0.26, P > 0.05), meaning that depth is high enough to avoid bias in SV detection (Additional file 1: Supplementary Figures S1-S2).

SNP Diversity and Haplotypes Across chr3B
After read mapping, we called single-nucleotide polymorphisms (SNPs) in the non-repeated space of chr3B. In addition, we also searched for SNPs in the TE space by analyzing reads that were mapped uniquely onto markers called ISBPs (insertion site-based polymorphisms; adapted from Rimbert et al., 2018). For that, we extracted 150 bps spanning the junction between a TE and its flanking sequences at both 5 and 3 ends. ISBPs are unique k-mers that can be used to map reads unambiguously (Additional file 1: Supplementary Figure S3). In total, 605,443 SNPs were detected (transition/transversion ratio: 2.33). It ranged from 41,255 for East Asian T. aestivum to 165,932 for T. dicoccoides, and 74% of SNPs were shared by at least two accessions. The SNP density was higher in the centromere proximal region (called "C") for all accessions except for the East Asian T. aestivum lines that were almost devoid of SNPs within a 250-Mb region encompassing the centromere (Figure 1 and Additional file 1: Supplementary Figures S4, S5). These results agree with recent results showing a clear separation between Asian and European T. aestivum accessions (Balfourier et al., 2019). Minor allele frequency (MAF) was lower in the C region, indicating an enrichment in rare alleles (Figure 1).
We computed haplotype blocks based on linkage disequilibrium (LD) between SNPs. This highlighted strong differences along the chromosome with large blocks in the central part of the chromosome and small blocks in the distal regions (Figure 1). This is in accordance with the recombination pattern observed for this chromosome (Choulet et al., 2014).

Structural Variations Affecting TEs
TEs account for 84% of chromosome 3B sequence with 216,378 elements belonging to 482 families. To assess the level of variability of the TE space, we designed 309,462 ISBPs (that are uniquely mappable) representing the extremities of 211,494 TEs (i.e., 98% of all chr3B TEs, Additional file 1: Supplementary Figure S3). Read mapping revealed that 186,611 ISBPs (60%) designed from 154,012 TEs (73%) were absent in at least one accession of the panel. These results highlight the extreme variability of the TE space since more than half of the 700 Mb of TEs are not conserved among the Triticum genus. This variability differs strongly between individuals and species, with TE inDels ranging from 7% in Tae_sirazaiaibarigi_2 to 24% in Synthetic-w7984. When only considering the three closest accessions from CS (East-Asian T. aestivum), inDels affect 7-8% of the TEs. Such polymorphisms were mainly shared by at least 2 accessions (155,830/186,611, 84%) and shared within each species (from 63% to 79% of shared polymorphisms, Additional file 1: Supplementary Table S2).
For 193,054 TEs (91%), we were able to design ISBPs at both 5 and 3 ends, therefore allowing us to trace the complete (two ends) vs partial (one end) absence of the element. Out of these, absence of both ends was observed for 38% (75,089). This revealed that SVs are mainly due to small deletions covering TEs only partially, rather than large deletions. This also revealed inDels that are due to recent TE insertions corresponding to cases where a TE is conserved among a monophyletic group while totally absent in all other accessions. We found 1,445 TEs conserved only among T. aestivum accessions, 574 (338 Gypsy, 60 Copia, 37 unclassified LTR-RTs, 7 LINEs, 117 DNA transposons, and 15 unclassified) conserved only among East Asian accessions, and 76 (42 retrotransposons, 30 DNA-transposons, and 4 unclassified) specific to CS. Extrapolating these values to the whole genome, one could expect roughly 29,000 new TE insertions since the hexaploidization, 10,000 years ago and 1,500 new TEs inserted in CS since it diverged, very recently, from Tae_Nanking_No25, Tae_Fukuhokomugi, and Tae_Sirazaiabarigi_2. We did not find any TE family contributing massively to these recent insertions, showing that no TE burst occurred recently. Thus, we conclude that the TE space is highly variable between Triticeae accessions, with a core-TE space restricted to 40% of the studied loci, mainly because of small deletions rather than to transposition.
In addition to the ISBP-based approach, we established a second approach in order to estimate the variations in copy number for the 482 TE families by analyzing the depth of coverage. We mapped reads on the whole CS genome, focused only on reads mapped inside TEs, and weighed read counts according to multiple mapping (see Materials and Methods). The main result is that, at the TE family and subfamily levels, no strong difference in copy number was detected for any of the 45 accessions (Additional file 1: Supplementary Figure S6). No massive amplification of any TE family was detected. The biggest difference observed here was a three-fold increase of the Mutator family DTM_famc6 in Tdi_45309 compared to CS. Among T. aestivum, the three accessions Nachipundo, Tom Thumb, and Hope exhibit some differences of TE composition, although it concerns a limited number of families (Additional file 1: Supplementary Figure S6). Tom Thumb and Hope are known to originate from a hybrid cross, which might explain their slightly different TE composition (see paragraph "Phylogenetic analysis"). Nachipundo originates from Nepal, and it was previously shown that Nepalese accessions are clearly distinct (Horvath et al., 2009). Six gypsy subfamilies have an increased copy number in Nachipundo compared to CS: Danae (RLG_famc9.3), Nusif (RLG_famc4.1, RLG_famc4.2, RLG_famc4.3), RLG_famc24, and RLG_famc25. FIGURE 1 | Distribution of the SNP density, minor allele frequency (MAF), and haplotype blocks across chromosome 3B. SNP density is represented as the number of SNPs per kb of uniquely mappable sequences. Distal (R1-R3), interstitial (R2a-R2b), and centromeric/pericentromeric (C) regions of chr3B are represented as defined in Choulet et al., 2014. The "C" region exhibits higher SNP density, lower MAF, and larger haplotype blocks (max = 111 Mb) than the other regions.

Structural Variations Affecting Genes and Identification of Genes Absent From CS
We analyzed the variation of the depth of coverage (DoC; Figure 2A) across genes and their surrounding low copy sequences in order to identify potential CNVs (see Materials and Methods). This led to identifying 761 genes (i.e., 13% of the 6,082 gene set) potentially affected by upCNVs (i.e., increased number of copies compared to CS). To add a layer of evidence for the existence of these upCNVs, we searched for the presence of pseudo-heterozygous SNPs in the mapping data. Indeed, since wheat accessions are highly homozygous, the presence of heterozygous SNPs supports the existence of slightly divergent paralogous copies ( Figure 2B). Indeed, pseudo-heterozygous sites were highly enriched in the 761 candidates compared to other genes (34% vs 1%, respectively). By combining DoC increase and presence of pseudo-heterozygous SNPs, we ended up with 210 genes affected by upCNVs. They represent on average 21 (0.3%) genes per accession, ranging from 7 to 49, and the vast majority of them (70%) were shared between at least two accessions ( Table 2 and Additional file 1: Supplementary Table S3). Among the 210 upCNVs, analysis of amino acid changes observed in duplicated gene copies revealed that 31 exhibit mutations with a strong effect (start codon loss or new stop codon), 140 have variations that only moderately affect the protein (missense variants), and 39 have only mutations that are synonymous or The numbers of genes affected by CNVs and PAVs are indicated for each species. "Unique" indicates the number of SVs that are present in only one accession of the species. "Shared" and "All" indicate the number of SVs that are shared by at least two accessions or shared among all accessions, respectively. Tde, T. dicoccoides; Tdi, T. dicoccum; Tca, T. carthlicum; Tdu, T. durum; Tma, T. macha; Tsp, T. spelta; Tae, T. aestivum.
within UTRs. We also found 15 cases of genes with potential altered splicing. For downCNVs (decrease in copy number compared to CS; Figure 2C), we first identified duplicated genes in CS. Clustering of the 6,129 CS genes revealed the presence of 443 duplicated genes belonging to 191 clusters. Among them, 250 genes (56% of the duplicated genes; 4% of the total gene set) exhibit a lower copy number in at least one accession of our panel (  Table S3). This analysis led to identifying two genes (TraesCS3B01G011000: putative zinc ion binding protein, and TraesCS3B01G059600: putative D-Ala-D/L-Ala epimerase) that were duplicated only in CS.
For gene PAVs, we considered a threshold of 10% of the gene length covered. In total, 781 genes were affected by PAVs, representing 14% of chr3B single-copy genes. They represent on average 145 genes (2%) per accession, ranging from 42 to 217 (0.7-3.5%; Table 2; Additional file 1: Supplementary Table S3). As expected, East Asian T. aestivum accessions exhibit the lower rate of variations with 75 genes (1%) showing PAVs, while T. carthlicum exhibit the highest: 184 (3%) genes. Only one gene is absent in all tested accessions, meaning it is specific to CS: a cupredoxin-encoding gene (TraesCS3B01G482300).
A de novo assembly of the unmapped reads was conducted for each sample to identify genes that are absent from the reference. We retrieved between 695 and 11,173 contigs (>500 bps) per dataset representing between 620 kb and 7.96 Mb (<1% of chr3B size). We searched for similarity with known proteins from related plants (see Materials and Methods) and applied a clustering step to detect genes present in several accessions. Hence, we identified 330 genes (possibly pseudogenes) absent from the reference CS but present in our diversity panel and shared by at least two accessions. They represent approximately 1% of the gene repertoire (Additional file 1: Supplementary Table S3).

The Wheat Pan-and Core-Chromosome 3B
We applied the concept of pan-and core genomes to describe the extent of variability among Triticum chr3B. Detection of PAVs, CNVs, and newly assembled genes revealed a pan-chromosome with 6,459 genes (6129 + 330) and a core genome of 5,134 genes, i.e., 79%. In total, 1,509 genes were variable in terms of presence/absence and copy number among the panel, i.e., 23% of the pan-chromosomes. Pairwise comparisons with CS revealed on average 293 variable genes (5%), ranging from 157 (2% for Tae_Nanking_No25) to 358 (6% for Tsp_Epeautre_de_L'Aveyron) (Additional file 1: Supplementary Table S3).
We observed that SVs more likely affect genes that were previously defined as non-syntenic , i.e., not conserved at a syntenic position with the related grasses (Brachypodium, rice, and sorghum). Gene Ontology (GO) term enrichment revealed that variable genes tend to be enriched in functions like protein phosphorylation and protein catabolic process ( Table 3). It also pointed out the over-representation of functions related to pathogen resistance: 103 receptor-like kinase genes (RLKs) and 62 nucleotide-binding domain and leucinerich-repeat-containing proteins (NLRs) are subject to CNVs in our panel. As an example, the two NLRs TraesCS3B01G081400 and TraesCS3B01G083800 are upCNVs in eight and nine accessions, respectively. While the two NLRs are single-copy genes in CS (given the criteria used for clustering, see Materials and Methods), the DoC fold change is between 2.3 and 3.5 for TraesCS3B01G081400 and between 3.0 and 6.5 for TraesCS3B01G083800. Combined with the presence of pseudoheterozygous SNPs, it showed at least one additional copy. Another example is the case of four Flowering Locus T-like genes (two pairs of duplicates: TraesCS3B01G007400-010100 and TraesCS3B01G015600-015200) that were downCNVs, thus, single-copy genes, in 14 and 33 accessions, respectively. The wellknown family of Flowering Locus T-like genes is a large family involved in the regulation of floral transition and developmental processes like fruit set, vegetative growth, stomatal control, and tuberization (Pin and Nilsson, 2012). In addition, we found another FLT-like gene (TraesCS3B01G489500) affected by upCNVs (in Tae_nachipundo and Tae_n67m2) and PAVs (in Tea_sirazaiaibarigi).
We also compared the expression pattern and histone marks of variable vs core genes using the available RNAseq data produced in 15 conditions  and CHIPseq data (3-leaf stage; IWGSC, 2018; Ramírez-González et al., 2018), both for CS. Variable genes tend to be expressed at a lower level in fewer conditions and tend to be targeted by facultative repressive epigenetic mark H3K27me3 (Additional file 1: Supplementary Figure S7). Altogether, these data are consistent with previous findings and confirm that, as observed in other species, Triticeae variable genes tend to exhibit features of genes related to the adaptation of the plant to its environment.

Distribution of Polymorphisms
We called PAVs across the non-repeated fraction of the chromosome by splitting the sequence into segments of 100 bps while excluding TEs except previously designed ISBPs. In total, this represented 1,593,666 ordered segments cumulating 134 Mb (see Materials and Methods). We then used a sliding window of 50 segments to define regions of varying size affected by a PAV (Figure 3). Between 626 and 2,372 PAVs were found for the 45 accessions. The average size of a PAV was between 29 and 70 kb. The largest region was 6 Mb absent in Tsp_Epeautre_noir_barbu_velu (from position 215,620,552 to 221,751,806 bps). These results highlighted the fact that most of the structural variability between these species/accessions originated from small deletion events occurring across the whole chromosome (Figure 3). We did not observe any large deletion, translocation, or introgression event in any of the tested accessions.
These data allowed us to study the distribution of SVs across the chromosome in regard to the partitioning previously mentioned for the Triticeae chromosomes (Choulet et al., 2014;Mascher et al., 2017;IWGSC, 2018). The SV density is much higher in the distal regions than in the proximal regions: 38 vs 8%, respectively (Figure 3). It also revealed the presence of one peak representing an increase of local PAV density in the central chromosomal region: 6 Mb at position 514-519 Mb, reflecting a single large region of CS absent in the genomes of East-Asian T. aestivum, T. macha, T. carthlicum, one T. spelta (Tsp_Epeautre_noir_barbu_velu), and one T. dicoccum (Tdi_ 45309). This region contains 18 genes of which a cluster of four duplicates (sharing >99% identity). This suggests a recent insertion of a DNA segment in the common ancestor of East-Asian T. aestivum.

Phylogenetic Analysis
The phylogenetic relationships between the 45 accessions were investigated using either SNPs or PAVs. SNP-based and PAV-based dendrograms were mainly consistent, although some minor differences were observed (Baker's gamma index = 0.97; Figure 4). Both dendrograms defined the same eight monophyletic groups corresponding to the seven Triticum species of our panel with T. aestivum split into two groups: East-Asian accessions (China and Japan) vs all other accessions, in agreement with a recent work on a large diversity panel of T. aestivum (Balfourier et al., 2019). The tree topology shows that the three hexaploid species originated from independent hybridization events and do not share a common tetraploid parent. Hexaploid Triticum do not form a monophyletic group: T. macha is branched with East-Asian T. aestivum, while T. spelta is branched with the tetraploids. We also identified four accessions that were classified in a species but for which the position in the tree suggested a recent hybrid origin (see Discussion): Tae_Hope was branched with T. dicoccum, Tae_Tom Thumb was related to T. durum, and Synthetic_w7984 and Tdi_415152 were also related to T. durum.

DISCUSSION
The complexity of the wheat genome makes it a challenge for molecular analyses, and it requires dedicated strategies to avoid misleading interpretations due to limitations of short read-based technologies in a highly redundant, polyploid, and large genome. Here, as a pilot study, we reduced the complexity through chromosome sorting, which allowed us to avoid multiple mapping of reads on homeologous gene copies, which share on average 97% nucleotide identity . Because of this complexity, most of previous studies focused on the gene space; i.e., 2% of the genome. Here, we obtained a global view of the variability at the whole chromosome scale, not only around genes but also for the 84% of TEs that shaped the chromosome. In addition, we selected a highly diverse panel of accessions, belonging to different species, with different ploidy levels. Indeed, it is well known that diversity within bread wheat is low because of the domestication and selection bottlenecks; thus, here we would like to get better insights into the extent of the potential reservoir of diversity present in the related species. We detected 1,179 variable genes (including CNV and PAV); i.e., 19% of the genes (or copies) sampled are not conserved among the seven Triticeae species. In addition, 330 genes not present in CS IWGSC RefSeq v1.0 were detected in several accessions. In total, variable genes account for 23% of chromosome 3B pangenome. In barley, a closely related Triticeae species, SVs were estimated to affect 15% of sequences and 9% of genes (Muñoz-Amatriaín et al., 2013), which is in the same range that what was found here. Thus, our results suggest that polyploidy did not trigger a significant boost in variability.
By comparing wheat to other grasses, we previously showed that distal chromosome regions evolve at a higher rate than the central region and concentrate most of the recently duplicated genes . Here, we demonstrate that this is also true at the intraspecific level for wheat, agreeing with several recent studies showing that large genomic variation mostly affects the distal regions (Rimbert et al., 2018;Balfourier et al., 2019;Gutierrez-Gonzalez et al., 2019). This genome partitioning was also the main outcome of the barley genome sequence analysis (Mascher et al., 2017), suggesting that this is a common feature to the Triticeae which share a genome architecture under constraint. This partitioning is associated with the higher frequency of meiotic recombination in the distal regions with fast-evolving regions being the regions where most of crossovers occur (Darrier et al., 2017). By building haplotypes, we confirmed the presence of a very large haplotype block of 111 Mb, which includes the centromere, showing that recombination did not break the genetic linkage in this region. In contrast, haplotype blocks are very small in the distal regions showing that the most variable regions are also the regions that recombined in the recent wheat history. This organization might have been selected for its efficiency to adapt to a changing environment. This is also in line with the findings that genes involved in the response to environment are the most variable. Our results on a diverse panel of Triticum accessions confirmed these observations on fewer accessions (Gutierrez-Gonzalez et al., 2019).
At the intraspecific level, for T. aestivum, we found 16% of dispensable/variable genes. Extrapolating this to the whole genome (i.e., 107,891 predicted genes), we can estimate a core genome of 90,628 genes and, on average, 17,263 variable genes. The difference with previously reported 64% (82,725) core genes predicted in Montenegro et al., 2017, may be explained by the use of higher depth of coverage and more stringent criteria to call an SV in our study.
If the diversity affecting genes has temptingly been estimated in previous pilot studies, the extent of variability of the repeat fraction of the genome was only poorly characterized. Here, at the scale of a complete chromosome, we show that the variability of the TE space is high: more than 60% of the ca. 200,000 elements shaping chromosome 3B of CS are either partially or totally absent in the related accessions. However, the number of new insertions detected either in CS or even in Asian T. aestivum (574 TEs) represents less than 1% of the TE content of the chromosome. Thus, it is obvious that TE transposition is not the main factor of this variability. This was confirmed by the strong conservation of the TE copy number estimated for each family. Indeed, only a few families are present in different proportions between accessions. This supports previous findings suggesting that the TE composition is highly controlled since A-B-D divergence, with no major TE burst . Most of SVs affecting TEs originate from deletions, agreeing with previous conclusions made based on MITEs (Keidar-Friedman et al., 2018).
We used SV data to study the phylogenetic relationships between Triticeae. The tree topology suggests that hexaploid species originated from independent hybridization events and do not share a common tetraploid parent. Indeed, T. macha is branched with East-Asian T. aestivum, confirming previous hypotheses stating that T. macha originated from a hybrid cross between T. aestivum and wild emmer wheat (Matsuoka, 2011). Similarly, the position of T. carthlicum is consistent with previous findings indicating that it originated from a cross between domesticated emmer wheat and T. aestivum (Matsuoka, 2011). For T. spelta, it was proposed previously that it likely originated from an introgression of tetraploid wheat into hexaploid wheat (Blatter et al., 2004), which is in agreement with our phylogeny. Our study also revealed four cases where an accession classified in a species is more related to another species. Tae_Hope was branched with T. dicoccum; this can be explained by the fact that Hope has a recent parent which is an emmer wheat (INRA). Synthetic_w7984 is branched with T. durum, which is explained by the fact that it is a synthetic accession obtained by hybridization of a T. durum accession with Ae. tauschii (Sorrells et al., 2011). Tdi_415152, while being classified as a T. dicoccum, appeared to be a spontaneous hybrid between T. dicoccum and an unknown T. durum, explaining its clustering in the T. durum clade (INRA). Finally, Tae_Tom Thumb position among T. durum species suggests a similar scenario, with a potential hybrid origin from T. aestivum and T. durum. These results on multiple events leading to polyploid species provide information for future redefinition of species among the Triticeae and show the difficulty to trace the phylogenetic relationships between species in the Triticum clade due to recurrent hybridizations.
We established a strategy dedicated to complex genomes based on chromosome flow-sorting, Illumina resequencing, and fine-tuned bioinformatics approaches, to get highly accurate SVs affecting genes and the TE space at a wide scale of diversity. This study establishes the foundation for future pangenomic analyses at a larger scale and provides insights into the complex phylogenetic relationships between Triticum species.

Flow Sorting and Sequencing Chromosome 3B
Flow cytometry was used to sort chromosome 3B DNA of the 45 accessions following the protocol detailed in Vrána et al. (2000). Sorted chromosomal DNA was then amplified as previously described in Šimková et al. (2008) to get sufficient amount of DNA for sequencing. Each DNA sample was then sequenced with Illumina HISEQ-2000 using 2 × 100 bp paired-end reads (ENA project number PRJEB31708). For Chinese Spring, we used Illumina reads produced on flow sorted chromosome 3B DNA previously available [(Choulet et al., 2014); ENA project number PRJEB7360; Illumina runs ERR634934, ERR634939, ERR634942, ERR634936].

Read Mapping and Quality Assessment
Reads obtained for the 45 samples were mapped using BWA-MEM v0.7.12 (Li and Durbin, 2010) on the reference genome sequence of T. aestivum cv. Chinese Spring: IWGSC RefSeq v1.0 which includes the latest high-quality assembly of chromosome 3B pseudomolecule and the annotation of gene models IWGSC RefSeq Annotation v1.0 (IWGSC, 2018). We used samtools/1.3.1 (Li et al., 2009) to filter the alignments and bedtools/2.26 (Quinlan and Hall, 2010) to calculate the depth of coverage. A mapping quality (mapq) threshold of 11 (samtools view −q 11) was applied to filter out multimapped reads.
The coverage (i.e., proportion of chr3B covered by reads) and the depth of coverage (DoC) were calculated by considering the number of reads mapped with mapq 11 on the nonrepeated fraction of the chromosome 3B (including genes, nongenic low-copy sequences, and ISBPs) accounting for 134 Mb. We used a DoC threshold of 5× to consider that a region was covered. To check for potential translocation between chromosomes 3B and 2B, we searched for reads mapped (mapq 11 and DoC > 5×) over at least 50 kb on chromosome 2B. To estimate the extent of local DoC variations across chromosome 3B, we used ISBPs conserved in all 45 accessions. First, we filtered out reads that were unmapped, not primary aligned, with supplementary alignments, and with quality <11 (samtools view −F 2308 −q11). Then, we discarded PCR duplicates (samtools rmdup). Finally, we counted reads that overlap TE insertion sites (i.e., unique k-mer) over at least 20 bps on each side of the junction using "bedtools coverage" (parameters: −f 0.6 −F0.9). To call SVs and SNPs, we filtered out reads using samtools view (−F 2308 −q11) and PCR duplicates using samtools rmdup.

SNP Calling
SNPs were called on non-repeat fraction of chr3B (134 Mb) using bam files generated at the previous step and with samtools mpileup (options: −C 50 −q 30 −Q 20) and Varscan2/2.3.9 (options: mpileup2snp -min-coverage 10 -min-reads2 10strand-filter 1 -min-freq-for-hom 0.85 -min-var-freq 0.02; Koboldt et al., 2012). For diversity analysis and haplotype assessment, we selected only SNPs of class 1 and 2; i.e., SNPs for which the 45 genotypes are homozygous (alleles AA or BB) or with the two homozygous and the potential heterozygous alleles (AA, AB, and BB). Then, haplotype blocks were computed using plink (parameters: -blocks no-pheno-req -allow-no-sexblocks-max-kb 500000 -indep-pairwise 500000 5 0.99; v1.90b6.2; Purcell et al., 2007) on the previously generated VCF file. Based on the N50 of the haplotype block size, large haplotype blocks were considered as >3.1 Mb. To validate the presence of an upCNV, we searched for the presence of pseudo-heterozygous SNPs as evidence for the presence of several slightly divergent copies of the same gene, whose reads map one single gene in the reference. To discriminate pseudo-heterozygous SNPs from true residual heterozygosity, we considered only cases where all homozygous accessions share the same allele (no homozygous variant).

Phylogenetic Analyses
To get robust markers, we first pruned SNPs using plink (parameters: -blocks no-pheno-req -allow-no-sex -blocks-maxkb 500000 -indep-pairwise 500000 5 0.99; v1.90b6.2; Purcell et al., 2007) while keeping undetermined values (NA) in our genotyping matrix. We then used R package pvclust on both SNP and PAV matrix to perform a hierarchical clustering (function: pvclust(), distance: simple matching, clustering: Ward.D2, Bootstrap: 10000). Dendrograms were then compared and aligned using R package dendextend (Galili, 2015). Entanglement score estimated the quality of the dendrogram alignment, zero being perfectly untangle (SNP tree vs PAV tree = 0.01). To evaluate the similarity between the two dendrograms, Baker's gamma index (Baker, 1974) was calculated using dendextend function cor_bakers_gamma. We then performed permutation test for the calculation of the statistical significance of Baker's gamma index. For that, we compared (i) the SNP dendrogram to itself (Baker's gamma index = 1, i.e., theoretical perfect case of identical trees comparison), (ii) SNP dendrogram to a dendrogram obtained after random shuffling of the SNP dendrogram (iteration = 100, mean Baker's gamma index = 0.005), and (iii) SNP dendrogram to PAV dendrogram (Baker's gamma index = 0.97).

Detection of CNVs and PAVs Affecting Genes
We used the High Confidence gene models predicted in the IWGSC RefSeq v1.0 Annotation as a reference gene annotation of chromosome 3B (IWGSC, 2018). We used samtools to analyze read mapping data in regard to gene positions. We used the available annotation of TEs and bedtools "complements" function in order to create a non-repeated chromosome 3B sequence of 134 Mb (including non-coding low-copy DNA, genes, and ISPBs). Reads mapped on the nonrepeated fraction of chromosome 3B were then filtered using samtools view: reads unmapped, not primary aligned, with supplementary alignments, and mapped with quality under 11 were filtered out (samtools view −F 2308 −q11). PCR duplicates were also filtered out using samtools rmdup. Read depth and coverage were finally calculated using bedtools coverage. Forty-seven genes not covered after mapping CS resequencing dataset were filtered out, leaving 6,082 genes that could be analyzed.
Genes were considered absent (PAV) when covered over less than 10% of their length, while considering only coding exons. For the detection of CNVs affecting genes, we calculated a normalized ratio between the read count for each gene in accession X and the read count in Chinese Spring. First, we normalized the read counts based on the proportion of the length of the gene covered by reads. Second, for each accession, we normalized read counts based on the total number of reads mapped on all chromosome 3B genes. Finally, we computed a log2ratio of the normalized read counts for each gene between accession X and Chinese Spring. For each accession, a boxplot of the 6,082 gene log2ratios was computed and we used the R native function boxplot upper and lower whiskers as thresholds to identify potential genes with up-or downCNVs, i.e., DoC outliers/extreme values as CNVs are expected to be the outliers. Normalizations and statistics were computed using homemade Perl and R scripts. In addition, pseudo-heterozygous SNPs were used to detect the presence of upCNVs (multiple slightly different gene copies). They were obtained after the SNP calling using samtools mpileup and Varscan2 mpileup2snp (v2.3.9, options: -mincoverage 10 -min-reads2 10 -strand-filter 1 -min-freq-for-hom 0.85 -min-var-freq 0.02). We selected only SNPs of class three (see paragraph "SNP calling") as evidence for upCNVs. For downCNVs, we added a detection of highly duplicated genes across the CS chromosome 3B in order to build a list of candidate genes that might be present in fewer numbers of copies in other accessions. For that, we clustered chromosome 3B protein sequences with CD-HIT (Li and Godzik, 2006) using the following parameters: −c 0.95 (i.e., 95% identity), −s 0.5 (i.e., 50% maximum gene size difference between the shortest and largest gene in the cluster), and −g 1 (more accurate mode).

Detection of TE inDels and CNVs
We developed a Perl script (getISBPfromBed.pl; available at 1 ) in order to extract 150 bps encompassing the junction 1 https://github.com/fchoulet/biotools between every TE predicted across CS chromosome 3B and its insertion site, each corresponding to a potential 5 and 3 ISBP marker. In case of major overlap between two neighbor ISBPs (>99 pbs), we kept only one ISBP to avoid redundancies. ISBPs containing one or more Ns were filtered out. In addition, we discarded non-unique ISBPs (detected by mapping ISBP sequences using BWAmem). These later represent typical cases where a TE was predicted as two or more adjacent fragments, leading to design ISBPs that are fully included into an element and, thus, are repeated. TE inDels were assessed by mapping reads on the set of ISBPs while keeping only 100-bp reads fully mapped with at maximum 2 mismatches, meaning that the junction between the TE extremity of the insertion site is covered.
CNVs affecting TEs were assessed by comparing read depth for a given TE family in CS and in other accessions. We mapped reads with BWA-mem on IWGSC RefSeq v1.0, including TEs, and selected all best alignment(s) for each read. We normalized the read counts based on the length of the read mapped and based on multiple mapping; i.e., when a read mapped two TE copies, we counted 0.5 per copy. For each TE copy in each accession, we calculated a cumulative weighted read count and calculated a fold change compared with CS. Only TEs covered in CS for at least 50× were analyzed.

Detection of PAVs and CNVs Affecting the Non-coding Non-repeated Fraction of Chromosome 3B
We built a BED file as a template to call PAVs only in genes, non-coding low-copy DNA, and TE extremities (ISBPs). We retrieved the positions of low-copy regions (including genes) as the complement to the TE positions. These segments were split into 100-bp windows, while discarding all segments smaller than 51 bps. We then added the positions of the complete list of 150 bp ISBPs to the BED file. Altogether, these regions account for 134 Mb split into 1,593,666 loci of 100/150 bps. Presence/absence of each locus was computed by analyzing read mapping data as described above. This BED file was also used to detect deletion blocks. Blocks were defined using a sliding window of 50 neighbor loci (with step = 5). Large deletion blocks were considered when at least two neighbor windows with more than 50% loci are absent. Distribution of SVs across the chromosome was analyzed in a sliding window of 10 Mb (step = 1 Mb), and segmentation analyses were computed with R package changepoint (penalty: "BIC, " penalty value: 0.05, method: "SegNeigh;" Chen and Gupta, 2012).

Assembly of Unmapped Reads and Gene Prediction
We used samtools to select unmapped reads (−F4) and to create 2 paired-end and 1 single end fastq files (−f 64, −f 128). We assembled these reads independently for each accession using