Relationships and genome evolution of polyploid Salix species revealed by RAD sequencing data

Despite the general progress in using next generation sequencing techniques for evolutionary research questions, the analysis of polyploid species is still hampered by the lack of suitable analytical tools and the statistical difficulties of dealing with more than two alleles per locus. Polyploidization and especially allopolyploidy leads to new combinations of traits by combining genomes of two or more parental species. This enhances the adaptive potential and often results in speciation. However, multiple origins of polyploids, backcrossing to the parental species and post-origin evolution can strongly influence the genome composition of polyploid species. Here, we used RAD sequencing, which revealed 23,393 loci and 320,010 high quality SNPs, to analyze the relationships and origin of seven polyploid species of the diverse genus Salix by utilizing a phylogenomic and a network approach, as well as analyzing the genetic structure and composition of the polyploid genome in comparison to putative parental species. We adapted the SNiPloid pipeline that was originally developed to analyse SNP composition of recently established allotetraploid crop lineages to RAD sequencing data by using concatenated RAD loci as reference. Our results revealed a well-resolved phylogeny of 35 species of Eurasian shrub willows (Salix subg. Chamaetia/Vetrix), including 28 diploid and 7 polyploid species. Polyploidization in willows appears to be predominantly connected to hybridization, i.e. to an allopolyploid origin of species. More ancient allopolyploidization events involving hybridization of more distantly related, ancestral lineages were observed for two hexaploid and one octoploid species. Our data suggested a more recent allopolyploid origin for the included tetraploids within the major subclades and identified putative parental taxa that appear to be plausible in the context of geographical, morphological and ecological patterns. SNiPloid and HyDe analyses disentangled the different genomic signatures resulting from hybrid origin, backcrossing, and secondary post-origin evolution in the polyploid species. All tetraploids showed a considerable post-origin, species-specific proportion of SNPs. The amount of extant hybridization appears to be related to the degree of geographical and ecological isolation of species. Our data demonstrate that high-quality RAD sequencing data are suitable and highly informative for the analysis of the origin and relationships of polyploid species. The combination of the traditional tools RAxML, STRUCTURE, SplitsTree and recently developed programs like SNAPP, HyDe and SNiPloid established a bioinformatic pipeline for unraveling the complexity of polyploid genomes.

Eocene and started to diversify in the middle Eocene (Wu et al. 2015). Previous studies 126 proposed that ancient genome duplications occurred several times in Salix (Dorn 1976; 127 Leskinen and Alström-Rapaport 1999) and in the Salicoid lineage (Tuskan et al. 2006). 128 Paleopolyploid species are more difficult to recognize, because their cytological behaviour 129 has undergone 'diploidization', i.e. they have returned to regular chromosome pairing at 130 meiosis as typical for diploid sexual species (Comai 2005). Therefore, polyploidization events 131 can be best understood in the phylogenetic framework of diploid congeneric species. Recent Salix) from the mostly shrubby species of a big clade uniting the two subgenera Chamaetia 160 and Vetrix, but lacked any infrageneric resolution (Wu et al. 2015). The diversification of this 161 clade started probably in the late Oligocene (Wu et al. 2015). We are mainly interested in the 162 evolution of these shrub willows, which range from creeping arctic-alpine dwarf shrubs to 163 medium-sized trees ). In the Chamaetia/Vetrix clade, which comprises 164 about three quarters of all Salix species, more than one third of all species with known ploidy 165 levels are tetra-, hexa-or octoploids. An analytical toolkit for the discrimination of polyploid 166 willows has been established, based on the combination of SSR markers and flow cytometry, 167 but was only used to detect the ploidy level for breeding purposes (Guo et al. 2016). In Salix. 187 In this study we want to test the utility of RAD sequencing data for the analysis of 188 polyploid species with an emphasis on allotetraploids in an empirical dataset of diploid and 189 polyploid shrub willows. We will use a comprehensive approach of different traditional and For this study, we sampled 28 diploid species, one triploid, three tetraploid, two 201 hexaploid and one octoploid species representing 19 sections sensu Skvortsov (1999) Table S1. Tween 20) at 4 °C, and then chopped with a razor blade. After incubating for 10 minutes on 222 ice, the homogenate was filtered through a 30 µm nylon mesh. Then the suspension was 223 centrifuged at 12.5 × RPM at 10°C for 5 minutes. This step was repeated two to three times 224 for some samples until pellet of nuclei showed up. The supernatant was discarded and the 225 nuclei re-suspended with 200μL Otto I buffer. Before the samples were analyzed, 800 μL Otto 226 Ⅱ (0.4 M Na2HPO4.12H2O) containing DAPI (4′-6-diamidino-2-phenylindole, 3 μg mL-1) 227 was added and incubated for 30 minutes in the dark to stain the nuclei. The histograms for 228 each sample were evaluated with FloMax V2.0 (Sysmex Partec GmbH, Münster, Germany).

229
Salix caprea with known ploidy level (2x=2n=38) was used as an external standard. To re-230 establish the G1 peak position the standard was used after every tenth sample. The ploidy 231 level was roughly calculated as: sample ploidy = reference ploidy × mean position of the G1 232 sample peak divided by the mean position of the G1 reference peak.  The statistics for the tested settings are summarized in Supplement Table S2. Finally, 257 the m40 dataset (29.06% missing data) was used for phylogenetic approaches and the m100 258 dataset (9.19% missing data) for the genetic structure analysis. The seven steps of the ipyrad 259 pipeline were also performed for each clade that contains one or more polyploid samples to 260 find the optimal number of shared loci and SNPs, respectively, specific for each clade. 261 We inferred phylogenetic relationships on concatenated alignments of the complete  Because we used a concatenated data set of short loci, it was not possible to analyse them as 265 separate partitions. However, Leaché et al. (2015) have shown that it is preferable to use full 266 sequence information instead of only SNPs, because this might lead to branch length and 267 accuracy bias. 268 We performed for each analysis a rapid bootstrapping analysis with 100 replicates 269 using the -f a option, which searches for the best-scoring tree. The resulting trees of all       Genetic structure and network analyses 306 In order to test for an influence of reticulate evolution on the genetic composition of 307 the included species, especially the polyploids, we explored the genetic structure of each 308 sample for both the complete dataset and each clade separately by using the unlinked SNPs 309 data. They consist of one SNP per locus and represent therefore independent loci. As a side 310 effect, the size of the dataset is decreased and therewith more suitable to the computational 311 power needed by the used program. To avoid bias by too much missing data, we used a 312 dataset composed of loci shared by more than 90% of individuals (m100). The clade specific 313 analyses were performed using the results of the clade specific ipyrad runs. Additionally, we  suggest using the recessive allele flag for polyploid samples that will consider genotypic 319 ambiguity that might be present if there are more than two alleles. However, there is no way to enter data from species with different ploidy levels into Structure. Here, we circumvent the 321 challenge of analysing different ploidy levels at the same time. The allelic information of a 322 polyploid sample was summarized to a consensus sequence during the ipyrad pipeline. In the 323 end, diploids as well as polyploids will look the same in the Structure input file (e.g. the four 324 alleles "AATT" will be reduced to "AT"). This way we simplified the complex data and could 325 combine diploids and polyploids in the same analyses. Some information might be lost 326 especially if more than two alleles are originally present (e.g. "AGTT") or the dosage of 327 alleles is unknown. However, the huge number of informative SNPs that remain in the data 328 set enabled us to analyse the genetic structure. We chose a burn-in of 10,000 and a MCMC of 329 100,000 replicates, with three iterations of each value of K (K=number of genotypic groups).

330
A higher amount of replications would be preferable; however, the limits in computational 331 power forced us to reduce the number of replicates per iteration. After an initial test run on a 332 smaller dataset, the range of K was set from 2 to 7 in the overall analysis. For the subclades, genomes. This hints to a mutation that occurred after the polyploidization event. Finally, cat 5 409 corresponds to putative homeo-SNPs, i.e., the tetraploid is heterozygous for homeologous 410 alleles of both parental genomes. The remaining portion of SNPs is categorized as "other" and 411 contains SNP combinations that do not fall into the four categories mentioned above. To use the advantages of this pipeline and to endeavour the SNP composition of the 418 tetraploids in our dataset we adapted the SNiPloid pipeline to RAD sequencing data (Fig. 1). 419 We initially tested a recent diploid hybrid of S. helvetica and S. purpurea (Gramlich et al. 420 2018) and revealed 31% Cat 5 SNPs, 6% Cat 1, 11% Cat 2 SNPs, and 23% of Cat 3/4 SNPs, 421 confirming the power of the pipeline. For our study, we utilized a pseudoreference using the   Table S2). The HyDe results of the m40 as well as four accessions of S. triandra as outgroup (Fig. 2, Supplement Fig. S2). The

517
In the RAxML phylogeny, clade I was divided into two subclades (Fig. 2). The supported SNPs did not fall into the given categories and are treated as 'others' (Fig. 5b).

558
Genetic structure analysis for the subclade for the most likely value of K=4 based on 37,867 unlinked SNPs.  (Fig. 2). 579 The HyDe analysis of the unlinked SNP data tested 5,313 combinations and revealed SNPs did not fall into the given categories (Fig. 5d). Salix laggeri was situated in sister position to the remaining species of section Vetrix 608 in the clade-specific analysis. The NeighbourNet showed S. laggeri in close relationship to S. 609 appendiculata and S. caprea (Fig. 4). The genetic structure analysis of subclade IIa for K=3 610 revealed a similar genetic composition as S. appendiculata, with some admixture of the S. 611 caprea specific partition (Fig. 4). The SNiPloid analysis with S. caprea and S. appendiculata 612 as putative diploid parents revealed 5.6% homeoSNPs and about 12.2% interspecific SNPs 613 (5.3% cat 1, 6.9% cat 2). About 36.6% of SNPs were shared with one allele of one parent (cat 614 3/4), while 45.9% of SNPs were not categorized (Fig. 5c).  Fig S3). However, in the clade-specific structure 629 analysis for the most likely K-value of five genetic clusters (Supplement Fig S3), S. bicolor   This method was developed for species tree estimation based on SNP data to circumvent the 691 generation of gene trees. Reduced representation methods generate thousands of biallelic 692 SNPs in short loci that are too small to calculate gene trees. Thus, SNAPP worked well for the SNP data generated with ipyrad in our project. However, since the RAxML analysis already 694 revealed a quite robust tree topology, we used this time-consuming coalescence approach 695 only for the subclade data. Our results confirmed that the coalescent approach using RAD 696 sequencing data is suitable for species delimitation, as recently shown by Brandrud et al.

697
(2019) on Dactylorhiza. 698 To overcome the drawbacks of bifurcating tree topologies for understanding reticulate 699 relationships, we used the informative SNPs to reconstruct networks and to infer genetic  Here, we decided to use the specific unlinked SNPs output of the ipyrad pipeline for the 707 genetic structure analyses with Structure. Because it is not possible to analyse mixed ploidy 708 levels, we circumvent this by condensing the information of all observed alleles into only two 709 alleles, which is automatically done by ipyrad. We decided to do this, first, because we were  origin. HyDe is a tool to detect hybridization in a given dataset. In this study, we used it to 723 identify potential parental taxa for the tetraploid species included. We observed that the 724 number of significant hybridization events depends on the size of the input data. We used the individuals. This is in contrast to the results of HyDe. We assume that the huge amount of 735 underlying data representing the whole genome is an explanation for the observed distinct 736 topology in our analyses, while HyDe uncovers non-visible introgression. Another 737 explanation could be that the amount of data cause significant "false" hybrid combination. 738 However, in this study we focused on the detection of putative parental species of the 739 tetraploids and not on the amount of hybridization in general. Here, we showed that it is While the higher polyploid willows are in sister position to major clades or on basal 793 branches, the tetraploids fall within these clades (Fig. 1). This could be due an older origin of on the putatively younger tetraploids that are situated at the terminal branches of subclades. 808 We wanted to test hypotheses on the putative Eurasian parental species, or, lineages, 809 respectively, that might have contributed to the formation and speciation of the tetraploids.   In willows, polyploidization appears to be predominantly connected to hybridization, 899 i.e. to allopolyploid origin, as hypothesized by Skvortsov (1999). These events happened in