A Genome-Wide Genetic Diversity Scan Reveals Multiple Signatures of Selection in a European Soybean Collection Compared to Chinese Collections of Wild and Cultivated Soybean Accessions

Targeted and untargeted selections including domestication and breeding efforts can reduce genetic diversity in breeding germplasm and create selective sweeps in crop genomes. The genomic regions at which selective sweeps are detected can reveal important information about signatures of selection. We have analyzed the genetic diversity within a soybean germplasm collection relevant for breeding in Europe (the EUCLEG collection), and have identified selective sweeps through a genome-wide scan comparing that collection to Chinese soybean collections. This work involved genotyping of 480 EUCLEG soybean accessions, including 210 improved varieties, 216 breeding lines and 54 landraces using the 355K SoySNP microarray. SNP calling of 477 EUCLEG accessions together with 328 Chinese soybean accessions identified 224,993 high-quality SNP markers. Population structure analysis revealed a clear differentiation between the EUCLEG collection and the Chinese materials. Further, the EUCLEG collection was sub-structured into five subgroups that were differentiated by geographical origin. No clear association between subgroups and maturity group was detected. The genetic diversity was lower in the EUCLEG collection compared to the Chinese collections. Selective sweep analysis revealed 23 selective sweep regions distributed over 12 chromosomes. Co-localization of these selective sweep regions with previously reported QTLs and genes revealed that various signatures of selection in the EUCLEG collection may be related to domestication and improvement traits including seed protein and oil content, phenology, nitrogen fixation, yield components, diseases resistance and quality. No signatures of selection related to stem determinacy were detected. In addition, absence of signatures of selection for a substantial number of QTLs related to yield, protein content, oil content and phenological traits suggests the presence of substantial genetic diversity in the EUCLEG collection. Taken together, the results obtained demonstrate that the available genetic diversity in the EUCLEG collection can be further exploited for research and breeding purposes. However, incorporation of exotic material can be considered to broaden its genetic base.


INTRODUCTION
Crop improvement relies on genetic diversity of plant genetic resources. A high genetic diversity provides an opportunity for plant breeders to develop cultivars with desirable characteristics (Savchenko, 2017;Byrne et al., 2018). Today's improved cultivars of various crops, with specific characteristics depending on their use and environmental adaptation, are the result of historical domestication events and intentional as well as unintentional selections by farmers and breeders (Bradshaw, 2017;Stoskopf et al., 2019). The processes of domestication and selection lead to an increased frequency of favorable alleles, and in the most extreme situation may cause complete fixation at genomic loci underlying beneficial traits in the genepool of the crop (Smýkal et al., 2018;Weigand and Leese, 2018). At chromosomal scale, the locus that underlies a beneficial trait is surrounded by other linked loci carrying neutral mutations. The selection process targeting the advantageous allele also results in an increase of the frequency of alleles at those neutral loci that are in linkage disequilibrium with the advantageous allele. This causes a socalled "selective sweep" in the genome. Genomic regions that have undergone a selective sweep are characterized by high levels of homozygosity, an increase in low-and high-frequency alleles, a high linkage disequilibrium and a low genetic diversity (Nielsen, 2005;Hufford et al., 2012).
Selective sweeps have been analyzed in several crops to reconstruct their history of domestication and selection and to identify the genetic loci and their underlying genome sequence that were mainly affected by these processes. For example, Hufford et al. (2012) identified 3,040 genes through selective sweep analysis, revealing signatures of domestication and improvement of maize in the United States. The dispersion history and adaptive evolution of wheat throughout the agroecological zones of China have been inferred by population genetics analysis including selective sweep analysis (Zhou et al., 2018). Ndjiondjop et al. (2019) identified 37 candidate selective sweep regions harboring genes related to biotic and abiotic stress tolerance in African rice, and demonstrated that those regions displayed low genetic diversity as a result of strong positive selection and domestication in African rice compared to Asian rice and its wild progenitor (Oryza barthii A. Chev.). A selective sweep analysis in barley by Pankin et al. (2018) provided molecular evidence of multiple domestication origins and allowed to distinguish domestication-related traits (i.e., nonbrittle rachis) from improvement-related traits (i.e., naked grain).
Modern cultivated soybean was domesticated approximately 5000 years ago from its wild progenitor Glycine soja, which is distributed throughout Eastern Asia, including most of China, South Korea, and Japan (Jeong et al., 2019). Soybean is the world's most grown high-value legume crop with beans containing high percentages of protein and oil (Pagano and Miransari, 2016). Being a restorative crop, soybean fixes atmospheric nitrogen in symbiosis with Rhizobium bacteria and delivers environmental services by minimizing the need for mineral nitrogen fertilizer. Between 2008 and 2018, global soybean production has grown from 212 to 337 million tons per year, while the total cultivated surface increased from 97 to 124 million hectares (IDH and IUCN NL, 2019). Soybean was first introduced to Europe during the second half of the 19th century. The current soybean acreage of 5.5 million hectares in Europe, representing a mere 3.4% of the world soybean production (FAOSTAT., 2019), can meet only 34% of the current European need for soybean (IDH and IUCN NL, 2019)). To meet the increasing European demand and to reduce the dependency on import, it is crucial to expand soybean cultivation and to adapt soybean genotypes to new cultivation zones in Europe. This requires a good understanding of the origin and genetic architecture of European soybean germplasm and how it relates to the germplasm from other origins.
Based on the responsiveness of soybean flowering and maturity to photoperiod and temperature, a total of 13 distinct maturity groups have been defined, of which only early maturing types (maturity groups MG000 to MGII) are suitable for cultivation in Europe (Kurasch et al., 2017). Previous studies have shown a narrow genetic base of the European soybean germplasm (Hahn and Würschum, 2014;Žulj Mihaljević et al., 2020), which can be due to the use of only a few ancestors originating from Canada, North America, Japan and China for breeding in Europe (Ristova et al., 2010;Hahn and Würschum, 2014;Miladinović et al., 2018). In addition, the original material used for breeding probably carried a low level of genetic diversity, as both pedigree and molecular marker data have indicated a narrow genetic base of North American and Canadian germplasm (Gizlice et al., 1996;Vaughn and Li, 2016;Bruce et al., 2019). In contrast, the Chinese soybean breeding pool contains a high level of genetic diversity because of a long history of cultivation over diverse eco-geographical zones with varying ranges of temperature and photoperiod (Liu et al., 2017). Selective sweep analysis has also been applied in soybean to understand the domestication and selection history. For example, Wen et al. (2015); Zhou et al. (2015) and Wang et al. (2016) report candidate selection regions harboring genes potentially involved in traits of agronomic relevance such as grain yield, seed size, flowering date, maturity date, seed protein and oil content and traits related to stress tolerance. In addition, selective sweep analysis by Jeong et al. (2019) reported domestication-related signals in soybean using mainly germplasm from Japan and Korea. Thus, available studies have considered materials of Chinese, Japanese and Korean origin. However, similar studies have not been performed in the soybean genepool available in Europe.
Our current knowledge on the origin and genetic relationships within European soybean germplasm is still fragmented. Main reasons are the limited number of accessions (covering only a fraction of the total genetic diversity) that were included in previous studies (e.g., 28 accessions in Ristova et al. (2010), 93 in Hahn and Würschum (2014), 75 in Kurasch et al. (2017) and97 in Žulj Mihaljević et al. (2020)), and/or the low number of genetic markers that were used for screening. Miladinović et al. (2018), genotyped 445 accessions at 85,000 SNP loci, but only used the 38 SNPs located in maturity genes for analysis, and focused on materials from one European breeding program. To fill this gap in our knowledge about the genetics of soybean germplasm relevant for breeding in Europe and to develop breeding tools for legume crops including soybean, a consortium was established within the European Union project EUCLEG 1 . In this context a unique collection of 480 soybean accessions considered relevant for European breeders, originating from 25 countries and covering a broad range of genetic diversity was assembled (named the EUCLEG collection in what follows). This offers a unique opportunity to compare the genetic diversity of the EUCLEG collection to that contained in reference materials from China, helping us to understand the main forces that have shaped the soybean genepool currently being used in breeding programs outside China.
Here, we present an analysis of the genetic diversity within the EUCLEG collection and identify selective sweeps through a genome-wide diversity scan between the EUCLEG and a Chinese soybean collection (Wang et al., 2016) ("NJAU collection"). Specific objectives of this study were: (i) to explore the structure and genetic relatedness of accessions in the EUCLEG collection; (ii) to determine the level of genetic diversity in the EUCLEG collection compared to that of the NJAU collection; (ii) to identify genomic regions that putatively underwent selective sweeps in the EUCLEG collection and their significance for future soybean improvement efforts in Europe.

EUCLEG Collection
The EUCLEG collection consists of 480 accessions belonging to maturity groups (MG) 000, 00, 0 and I/II, and includes 210 improved varieties, 216 breeding lines and 54 landraces. A detailed description of this collection is provided in Supplementary Table 1. This collection was made in collaboration with seed companies and gene banks by considering the accessions with diverse geographical origins, with varying levels of selection (landraces, varieties and breeding lines), and their relevance for European soybean breeders.
For DNA extraction, one fully developed unifoliate leaf was sampled per accession from plants grown in the field. Leaf samples were lyophilized and stored under vacuum conditions until use. Genomic DNA was extracted using the DNeasy R Plant Mini Kit 2 . Pure and good quality DNA samples with an average concentration of 106 ng/µl (range 24-731 ng/µl) were used for genotyping using the 355K SoySNP Axiom microarray (Wang et al., 2016) from Affymetrix (Thermo Fisher Scientific), via Eurofins, DK.

NJAU Collection
The NJAU collection originates from the Germplasm Storage of Chinese National Center for Soybean Improvement, Nanjing Agricultural University, China, and comprises 122 wild and 272 cultivated accessions. It covers the three ecological habitats of soybean in China including the regions of Northern China, Huang-Huai and Southern China. A full description of the NJAU collection is available in Wang et al. (2016). This collection has previously been genotyped using the 355K SoySNP microarray (Wang et al., 2016). Here, we have combined the raw microarray fluorescence data of the EUCLEG samples with the raw fluorescence data of the NJAU samples to perform a "joint" SNP calling.

SNP Calling
SNP calling was performed using the software Axiom Analysis Suite (AAS) from Affymetrix R , following the instructions provided in the Axiom Analysis Suite 3.1 user guide 3 . Before SNP calling of the 874 samples of the combined data set (EUCLEG and NJAU), we first checked the performance of the 355K SoySNP microarray on the 480 EUCLEG samples separately. This step was considered necessary, as the 355K SoySNP microarray was developed using the NJAU collection and might perform suboptimally with plant materials of a different origin. In brief, the Affymetrix R Power Tools (APT) software package, version 1.15.0 implemented in AAS performed sample quality control based on 20,000 non-polymorphic probe sets and considering the parameters Dish Quality Control (DQC; determines the intensity of contrast between signal and noise) and Sample Call Rate (QC-CR; refers to the ratio of genotype-called SNPs to attempted SNPs in a sample). Based on criteria DQC > 0.82 and QC-CR ≥ 97, AAS filtered out four poor-quality samples. The R package SNPolisher version 1.3.6.7 implemented in AAS was used for SNP calling using 609,883 probe sets targeting 355,595 SNPs. Its Ps_Classification function classified the SNPs/ probe sets into six categories based on the following SNP QC metrics: call rate (CR) ≥97%, Fisher's linear discriminant (FLD) ≥ 3.6, heterozygous strength offset (HetSO) ≥−0.1, and homozygote ratio offset (HomRO) ≥0.3 for one-cluster or twocluster SNPs or ≥−0.9 for three-cluster SNPs. A summary of the SNP classification was obtained for the 476 good quality samples of the EUCLEG collection. We compared this summary with the SNP classification summary obtained from the NJAU collection by Wang et al. (2016).
In a second step, we genotyped the combined dataset (EUCLEG and NJAU), starting from the raw fluorescence data following the procedure described above. In the quality control step, 69 poor quality samples were excluded. SNP calling was performed on the remaining 805 good quality samples. After genotyping, low quality SNPs based on SNP QC metrics were excluded and a final genotyping dataset containing 229,557 SNPs was generated. This dataset was divided in three subsets for further processing: EUCLEG, NJAU-Wild and NJAU-Cultivated, comprising 477, 82, and 246 accessions, respectively. For the divisions NJAU-Wild and NJAU-Cultivated, we refer to Wang et al. (2016). In further analyses, we considered either the whole collection (EUCLEG and NJAU) or some of these subsets.
For some of the downstream analyses, the genomic coordinates of the SNPs were required. Because during the development of the 355K SoySNP microarray SNP coordinates were assigned using an older version of the soybean reference genome sequence (Glyma.Wm82.a1), we positioned SNPs onto the novel reference genome sequence Glyma.Wm82.a2 (with improved assembly and gene annotation quality compared to Glyma.Wm82.a1). Finally, the 224,993 SNPs corresponding to probes that could be positioned onto the 20 soybean chromosomes using a blast query were considered for further analyses.

Population Structure Analysis
The results of population structure of the NJAU collection are available in Wang et al. (2016). Here, we performed a population structure analysis of EUCLEG and NJAU combined (805 accessions in total). Two approaches were applied. In the first approach, a model-based structure analysis was performed in fastSTRUCTURE 1.0 (Raj et al., 2014) including 179,812 SNPs with minor allele frequency (MAF) of at least 5% across the sample set. The K value was varied from 2 to 10, while for other parameters default settings were used. The optimum value of K was determined using the best marginal likelihood value of fastSTRUCTURE-output from K = 2 to K = 10. The results of fastSTRUCTURE were graphically visualized using the R-package pophelper v. 2.1.0 (Francis, 2017). In the second approach, a principal component analysis (PCA) was performed in TASSEL 5 (Bradbury et al., 2007). The missing values of SNPs were imputed using the unweighted average method and PCA was performed on the genetic correlation matrix of accessions considering the first five principal components.
Population structure was also inferred for the 477 genotyped accessions of the EUCLEG collection separately, using the settings and methods described in the previous paragraph. The analysis included 139,986 SNPs with MAF of at least 5% across the sample set. The number of subgroups (K) was determined considering the delta log-likelihood criterion. For the interpretation of the results of fastSTRUCTURE, once an optimal value of K was identified, each accession was assigned to a subgroup "n" for which the ancestral coefficient reached a value Qn ≥ 0.4; where n is the number of subgroup (1 to K). Accessions for which the two highest Q values differed by less than 0.2 were considered "Admixed". Finally, the degree of divergence between the EUCLEG and the NJAU collections was estimated by calculating a fixation index (F ST ) value per SNP site in VCFtools v. 0.1.15 (Danecek et al., 2011) following the methods of Weir and Cockerham (1984).
Hierarchical cluster analysis of EUCLEG and NJAU combined was performed to determine the relationship among accessions of both collections, and to see the relationship between the EUCLEG part from the Chinese origin with other accessions. For this, a dendrogram was constructed following Ward's D2 method (Murtagh and Legendre, 2014) and using Nei's standard genetic distances between accessions (Nei, 1972).

Genetic Diversity Estimates
Linkage disequilibrium (LD) analysis was performed in VCFtools v. 0.1.15 considering the filtered genotyping data including 139,986 SNPs,162,098 SNPs and 185,194 SNPs with MAF of at least 5% in the sample sets EUCLEG, NJAU-Cultivated, and NJAU-Wild, respectively. LD was estimated for each chromosome by computing the r 2 for all pairwise comparisons of two SNPs located at a maximum of 1000 kbp inter-SNP distance. The LD decay distance per chromosome was estimated as the point at which r 2 dropped to half of its maximum value. The genome-wide LD decay was estimated by pooling the LD output across all chromosomes.
Diversity estimates were determined for EUCLEG, NJAU-Cultivated, and NJAU-Wild separately, including all 224,993 SNPs. The EUCLEG collection contained 21 accessions from Chinese origin, but to avoid any possible confounding effect of those accessions while comparing genetic diversity between EUCLEG and NJAU, these 21 accessions were not considered. The average pairwise divergence among genotypes within each collection was then determined by computing the nucleotide diversity index (π) per SNP site in VCFtools v. 0.1.15.

Selective Sweep Analysis
To detect signals of selection in the EUCLEG collection, the cross-population composite likelihood ratio test (XP-CLR) implemented in XP-CLR v. 1.0 (Chen et al., 2010) was used. XP-CLR is a site frequency spectrum (SFS)-based method that detects selective sweeps by jointly modeling the multi-locus allele frequency differentiation between two populations. Given the allele frequency of a locus in the reference population, XP-CLR maximizes the likelihood ratio of the allele frequency in an object population between a selective sweep model and a null model (Chen et al., 2010). We compared the object EUCLEG collection to the reference NJAU-Wild collection. The 21 accessions of Chinese origin in the EUCLEG collection were not considered in this analysis. XP-CLR requires as input the genetic position (expressed in recombination units) of each SNP. Because for most SNPs the genetic position was unknown, we transformed physical positions (Mbp) to recombination positions (cM) considering a homogeneous recombination rate (1 Mbp = 1 cM) throughout the soybean genome.
The whole genome was scanned in XP-CLR choosing a sliding window of 1 Mbp at steps of 5 kbp. XP-CLR options were as follows: XPCLR -xpclr genofile1.txt genofile2.txt mapfile outputfile -w1 snpWin 0.01 gridSize 5000 chrN -p0 corrLevel 0.95; where genofile1.txt and genofile2.txt correspond to the object (EUCLEG) and reference (NJAU-Wild) collections, respectively. Because two SNP loci with high pairwise r 2 values can provide redundant information, corrLevel was set to 0.95 to weight the XP-CLR value of a window containing highly correlated SNPs. Windows with weighted XP-CLR scores in the top 1% of the empirical distribution of the genome-wide XP-CLR values were used to delineate regions of interest. To define the regions of interest (hereafter called selective sweep regions), we combined neighboring windows when the gap was less than 1 Mbp.
SoyBase 4 (Grant et al., 2010) was used to search for reported QTLs located in the selective sweep regions and to generate a list of genes positioned in respective selective sweep regions. The transcripts description was obtained from "Gmax_275_Wm82.a2.v1.annotation_info.txt" 5 and the annotation of biological functions was obtained from UniProtKB 6 . The former contains transcript definition of the best hit obtained through homology-dependent sequence analysis of soybean transcripts in the Arabidopsis genome, whereas the latter corresponds to their molecular and biological functions manually annotated and reviewed from literature and computational analysis by the UniProtKB.

Evaluation of the Use of the 355K SoySNP Array in the EUCLEG Collection
Analysis of the EUCLEG collection with the 355K SoySNP array revealed a total of 285,953 SNP markers (80% of the total 355,595 SNPs present on the array) belonging to the recommended categories including PolyHighResolution (PHR, total 211,593), MonoHighResolution (MHR, total 46,953) and NoMinorHom (NMH, total 27,407). These categories refer to SNPs exhibiting all three genotypic classes with a good cluster resolution (PHR), SNPs with good cluster resolution but displaying only one of the homozygous clusters (MHR) and SNPs with good cluster resolution but for which one of the two homozygous clusters is missing (NMH). These proportions correspond quite well with those previously reported for the NJAU collection by Wang et al. (2016) (Supplementary Figure 1), indicating that the 355K SoySNP array is not only useful for the genotyping of Chinese soybean germplasm, but also for germplasm from other origins.
Genotyping of the EUCLEG and NJAU combined collection rendered 229,557 SNPs (65% of the total 355,595) belonging to the recommended categories (PHR, MHR and NMH; a total of 194,171, 16,868 and 18,518 respectively). For the remaining 126,038 SNPs, at least one of the QC metrics were below the threshold and hence they were assigned to nonrecommended categories. There were 16% (of the total 355,595) more SNPs of non-recommended categories in the combined analysis as compared to the separate analysis of the EUCLEG collection. This was because a number of SNPs (19% of the total 355,595) of recommended categories (PHR, MHR and NMH) in the EUCLEG separate analysis were assigned to the non-recommended categories in the combined analysis. Taken together, these results indicate that SNP calling on EUCLEG and NJAU combined is essential to get a more precise classification of SNP markers as compared to a separate analysis for each collection. In addition, these results indicate that the SNP dataset of EUCLEG and NJAU combined contains a high number of SNPs from the recommended categories.
For 224,993 SNPs of the total 229,557 (98%), new coordinates could be positioned onto 20 chromosomes in the Glyma.Wm82.a2 soybean reference genome sequence using a BLAST query. The probes targeting the remaining SNPs (4,564) were either missing in the novel soybean genome assembly or were assigned to the non-anchored scaffolds and not to chromosomes, and were excluded from subsequent analyses. The genome-wide distribution of the final set of 224,993 SNPs used for downstream data analyses is given in Table 1. The longest chromosome (18) contained the highest number of SNPs (6.3% of the total 224,993), and the shortest chromosome (11) contained the lowest number of SNPs (3.8% of the total 224,993). The average SNP density was the lowest on chromosome 1 and the highest on chromosome 13 (19 and 27 SNPs per 100 kbp, respectively). The average distance between two adjacent SNPs was 2.6 kbp ( Table 1 and Supplementary Figure 2).

Population Structure Analysis
We analyzed the presence of population structure using two approaches, fastSTRUCTURE and PCA. In the EUCLEG and NJAU combined analysis comprising 805 accessions, the marginal likelihood of the fastSTRUCTURE-output from K = 2 to K = 10 indicated the optimum K between 2 and 4 (Supplementary Figure 3). The EUCLEG sample set clustered separately from NJAU sample sets ( Figure 1A and Supplementary Figure 4). The division between NJAU-Cultivated and NJAU-Wild (as defined by Wang et al., 2016) was also apparent. The F ST values were 0.14, 0.34 and 0.22 for EUCLEG vs. NJAU-Cultivated, EUCLEG vs. NJAU-Wild and NJAU-Cultivated vs. NJAU-Wild comparisons, respectively. This indicates that the set of 805 soybean accessions considered in this study consists of three major groups including EUCLEG, NJAU-Cultivated and NJAU-Wild. The relatively low F ST between EUCLEG and NJAU-Cultivated in relation to comparisons with NJAU-Wild supports a strong differentiation between cultivated and wild soybean accessions.
In the separate analysis of the EUCLEG collection the marginal likelihood of the fastSTRUCTURE-output from K = 2 to K = 10 indicated the optimum at K = 5 (Supplementary Figure 5). Although some small subgroups were formed at K > 5, we worked further with five subgroups as this clustering was in concordance with the background information of the accessions. Population structure in the EUCLEG collection is presented in Figures 1B,C and is summarized in Table 2. Additional information about the type and maturity group of accessions within each subgroup can be found in Supplementary Table 1. Of the total of 477 accessions, 362 were assigned to one of five subgroups (G1 to G5), and the remaining 115 accessions The results of the cluster analysis were similar with those of the fastSTRUCTURE analysis of EUCLEG and NJAU combined (see above). The accessions of the two collections were assigned to different clusters. NJAU-cultivated and NJAU-wild grouped into two separate clusters as in Wang et al. (2016). Interestingly, the accessions of Chinese origin included in the EUCLEG collection clustered mostly among the EUCLEG germplasm, spread over several clusters and only one of them clustered within NJAU-Cultivated (Supplementary Figure 6). This indicates that EUCLEG accessions originating from China resemble more closely with other EUCLEG accessions than the Chinese accessions from NJAU.

Genetic Diversity in the EUCLEG and NJAU Collections
A higher number of fixed SNP sites (MAF = 0) was observed in the EUCLEG collection (20%) compared to the NJAU collections (13% and 10% in NJAU-Cultivated and NJAU-Wild, respectively) (Supplementary Figure 7). Moreover, the  proportion of polymorphic SNPs (MAF ≥ 5%) was lower in EUCLEG (62%) than in NJAU-Cultivated (72%) and NJAU-Wild (82%). These results suggest an overall higher level of homozygosity in the EUCLEG collection compared to the NJAU collections. Linkage disequilibrium (LD) dropped to half of its maximum at 175, 100, and 50 kbp in EUCLEG, NJAU-Cultivated and NJAU-Wild, respectively (Figure 2), suggesting lower effective population size in the EUCLEG and NJAU-Cultivated collections than in the NJAU-Wild collection. In addition, the three collections showed different patterns of LD per chromosome (Table 1), which indicates different histories of recombination and selection in these three collections.
Average pairwise divergence among genotypes per site (π) decreased from 0.31 in NJAU-Wild to 0.26 in NJAU-Cultivated and to 0.23 in EUCLEG (Table 1). This is in agreement with a loss of genetic diversity due to domestication and selection in cultivated soybean. There is also a clear tendency toward less variation in average π values per chromosome in NJAU-Wild compared to NJAU-Cultivated and to EUCLEG (coefficient of variation of average π values per chromosome is 4.6, 10.4, and 13.1%, respectively). This is consistent with a scenario of a genetic bottleneck and selection that might have more prominently affected specific chromosomes in the cultivated genepools suggesting that selective sweep analysis comparing EUCLEG and NJAU-Wild may help to identify chromosomal regions that have undergone selection and domestication in the past and that probably are involved in the determination of important agronomic traits.

Selective Sweep Analysis
XP-CLR analysis revealed 23 selective sweep regions with an average length of 1.8 Mbp (range 1.14 Mbp-3.75 Mbp) ( Table 3), accounting for 4% of the total sequenced genome FIGURE 2 | Linkage disequilibrium decay at increasing genetic distances in the EUCLEG collection, the NJAU-Cultivated collection and in the NJAU-Wild collection. X-axis and Y-axis show the average distance between two SNPs in kbp and mean linkage disequilibrium (r 2 ) within bins of 5 kbp distance, respectively.
length. Selective sweep regions were present on 12 different chromosomes including 1, 2, 6,7,8,9,10,12,15,18,19, and 20 and some chromosomes had multiple selective sweep regions (Figure 3 and Supplementary Figure 8). Exploration of SoyBase 7 delivered 248 of the total 2,880 previously published QTLs coinciding with the selective sweep regions (Table 4 and Supplementary Table 2). A total of 3,811 genes were positioned within the selective sweep regions. The description of genes is provided in Supplementary Table 3. Consistent with the selected proportion of total chromosome size, selective sweep regions on chromosome 7 and 19 contained the highest (1,104) and the lowest (44) number of genes, respectively (Supplementary Table 3).
A nearly equal and relatively low nucleotide diversity (π) was observed in selective sweep regions 6.2, 12.2, and 15.1 in both EUCLEG (0.14, 0.06, and 0.23) and NJAU-Cultivated (0.13, 0.06, and 0.21) when compared to NJAU-Wild (0.39, 0.35, and 0.34) ( Table 3 and Supplementary Figure 9). This indicates that the traits regulated by these regions may have undergone similar histories of selection in EUCLEG and NJAU-Cultivated. Known QTLs for water use efficiency (WUE) and time to flowering and maturity coincide with region 6.2 and 15.1 ( Table 4). In addition, genes conferring resistance to pathogens or controlling time to flowering are located in these regions (Supplementary Table 3). Strikingly, no QTL coincided with selective sweep region 12.1, although this region harbors genes for important functions such as control of time to flowering (Supplementary Table 3).
Interestingly, π was extremely low (0.04 − 0.13) in selective sweep regions 1.  "Selective sweep region" refers to the name (left side of decimal indicates the chromosome on which the region is located and the right side is an ordinal number), "Start" and "End" delineate the chromosome coordinates of the selective sweep region. "Number of SNPs" is the total number of SNPs contained in the candidate selective sweep region. "Average XP-CLR value" is the average of XP-CLR values for all the windows contained in the selective sweep region. π is the average nucleotide diversity. Figure 9). Such a low diversity in EUCLEG can be caused by the effects of domestication and/or selection. These regions contain known QTLs for seed fatty acids, seed oil, yield components, resistance to biotic stresses including Sclerotinia stem rot (SCL), time to flowering, and WUE (Table 4). Genes controlling flowering and maturity, resistance against pathogens, uptake of minerals, and abiotic stress response are also located in these selective sweep regions (Supplementary Table 3). Some of the candidate genes for seed isoflavone content reported by Meng et al. (2016) are located in region 8.2 (Glyma.08G190300, Glyma.08G190500).
We also observed known QTLs related to hilum color and seed coat color that coincide with selective sweep region 8.1 (Table 4). This region contains a group of chalcone synthase (CHS) genes that are part of the flavonoid and anthocyanin biosynthesis pathway required for seed coat color (Akada and Dube, 1995) (Supplementary Table 3).

Genotyping Non-Chinese Soybean Accessions With the 355K SoySNP Array
In soybean, different genotyping microarrays are available including the SoySNP50K iSelect Bead chip from Illumina (Song et al., 2013) and the SoySNP180K Axiom microarray from Affymetrix (Lee et al., 2015) containing probes for 52,041 and 180,961 SNPs, respectively. Here we have used the recently developed NJAU 355K Affymetrix SoySNP array containing probes for 355,595 SNPs (Wang et al., 2016). The proportion of high quality SNPs detected in this study for the EUCLEG collection corresponded quite well with that for the NJAU collection reported by Wang et al. (2016). This indicates that the 355K SoySNP microarray, which was originally developed using plant materials of Chinese origin, is also useful for genotyping soybean from non-Chinese origin. It was therefore possible to perform SNP calling on combined EUCLEG and NJAU collections. This joint analysis offered the advantage that genotyping a larger sample set minimizes the chance of misclassification of SNPs, which reduces the type I error (Mascha and Vetter, 2018). This combined SNP dataset is therefore of great value to analyze the genetic diversity available in the EUCLEG collection and to contrast this with the genetic diversity present in the Chinese collection.

Genetic Structure of the EUCLEG and the NJAU Collections
Soybean is native to China, Japan and Korea, and has been introduced to Europe and other parts of the world (Singh and Hymowitz, 1999;Liu et al., 2017). Evidence from previous studies has shown that the earlier soybean breeding programs of different parts of the world have used Chinese soybean accessions as ancestors (Gizlice et al., 1994;Wysmierski and Vello, 2013), although an earlier study showed a clear distinction between soybean collections from United States and China (Liu et al., 2017). Similarly, population structure analysis in our study revealed a clear genetic differentiation between the EUCLEG and both the NJAU-Wild, and NJAU-Cultivated collections. This reflects breeding efforts in different parts of the world over many decades that have concentrated on improving the local adaptation of soybean to different environmental conditions. Accessions of Chinese origin included in the EUCLEG collection clustered closely with other accessions from EUCLEG instead of clustering with NJAU accessions. This is probably because these accessions have been used for breeding purposes outside China and in this way show closer relationships with their descendants.
The level of genetic diversity (π) in this study was the lowest for the EUCLEG collection, followed by NJAU-Cultivated, and with NJAU-Wild containing the highest level of genetic diversity. This agrees with the model for soybean breeding history presented by Hyten et al. (2006), in which domestication and further selection has reduced the genetic diversity in Asian germplasm. This was followed by genetic bottlenecks during introduction of soybean to other regions of the world and further selection. As a consequence, the substantially lower level of diversity in the EUCLEG collection in comparison to the NJAU collections reflects the combined effect of all three processes (domestication, introduction bottlenecks and selection).
Nevertheless, the EUCLEG collection is strongly structured, with a distribution of genetic diversity over five subgroups. In agreement with previous reports by Žulj Mihaljević et al. (2020), our analysis confirms that soybean accessions from Southern Europe are closely related to those from United States and Canada. Soybean accessions from Eastern and Western Europe contain a range of diversity as they were distributed over all five subgroups (G1 to G5). These results indicate frequent exchange of genetic resources across countries of Eastern and Western Europe, as well as the incorporation of diversity from different geographical origins including Japan, China, United States and Canada into European breeding activities (Tavaud-Pirra et al., 2009;Hahn and Würschum, 2014).
Surprisingly, no significant signatures of selection were detected for stem determinacy, which is an important adaptive trait affecting grain yield in soybean (Kato et al., 2019). Determinate growth habit is a domestication related trait (reviewed in Sedivy et al., 2017), as a high level of determinacy contributes to synchronous seed maturation, thus avoiding undesired variability of moisture content in the harvested material. However, determinate varieties perform less well at high latitudes (Kato et al., 2019) including a large part of Northern and Western Europe (Schori et al., 2003). We have previously shown that stem determinacy is quite variable in the EUCLEG collection (Borra-Serrano et al., 2020), which may explain why the selective sweep analysis presented here failed to obtain relevant signatures of selection for this trait.
In Europe, soybean is considered a protein crop and, together with yield, seed protein content is one of the main breeding goals (Berschneider, 2016). We have identified several signatures of selection that coincide with QTLs for these traits. In addition, selective sweep regions also contain some QTLs related to nutrient use efficiency (e.g., P, Fe, K, and Ca) indicating that they might have been selected for nutrient use efficiency. However, from the total 230 QTLs for seed protein that are described in SoyBase, only 31 were located in the selective sweep regions detected in this study. This, together with the observation that the protein content is higher in wild accessions than in cultivated accessions (Chen and Nelson, 2004) indicates that during the domestication and improvement processes, either some favorable haplotypes for high protein content might have been lost or that the diversity present at those sites might not have been exploited yet in the EUCLEG collection. Therefore, there is still room to improve seed protein content by exploiting the genetic diversity available in the EUCLEG collection for this trait.
Previously known QTLs related to high seed oil content also coincided with the selective sweep regions in the EUCLEG collection compared to NJAU-Wild. This is in line with the observation that wild soybean seeds have low oil content (Chen and Nelson, 2004). Although European soybean is not specifically bred for high oil content, soybean breeding programs in Europe have incorporated breeding materials from United States, where seed oil content is an important trait (Xavier et al., 2018). A low to medium genetic diversity of the respective selective sweep regions in the EUCLEG collection as compared to NJAU-Cultivated and NJAU-Wild suggests that these QTLs are not completely fixed in the EUCLEG collection.
Our analysis has also revealed some signatures of selection related to seed fatty acids including the polyunsaturated fatty acids (linolenic acid and linoleic acid) and monounsaturated fatty acids (oleic acid). Improving the quality of seed oil by minimizing the level of polyunsaturated fatty acids is an important consideration for improved stability of soybean oil (Clemente and Cahoon, 2009). Moreover, signatures of selection were also found for traits related to tofu quality and isoflavone content, a metabolite that helps in the prevention of chronic diseases such as cancer and cardiovascular diseases (Messina and Messina, 2010). This is relevant information for European breeding programs with a particular interest in compositional traits related to food production.
Soybean, a member of the Fabaceae family, has the ability to fix atmospheric nitrogen through symbiosis with rhizobium bacteria. It has been reported that high yielding soybean varieties have a better ability to fix nitrogen (Collino et al., 2015). While direct selection of nitrogen fixation may not have been one of the objectives of current breeding programs, we have found that some previously known QTLs related to nitrogen fixation coincide with the selective sweep regions in the EUCLEG collection. This suggests that this trait might have been improved indirectly through selection for high yield. However, from the total of 145 QTLs for nitrogen fixation reported in SoyBase 11 coincide with selective sweep regions in our analysis. This suggests that still a broad genetic diversity might be present in the EUCLEG collection which can be further used to improve nitrogen fixation.
Resistance to diseases, especially to Sclerotinia stem rot (SCL) caused by Sclerotinia sclerotiorum, is considered important in European soybean breeding programs because SCL is widely spread throughout Europe (Rüdelsheim and Smets, 2012). Pannecoucque et al. (2018) reported the presence of genetic variation in the level of susceptibility for SCL in 14 early maturing varieties from Europe. Sources of SCL resistance have been identified on 11 soybean chromosomes and a total of 99 QTLs have been reported in SoyBase (reviewed in Neupane et al., 2019). In our analysis, only 9 QTLs coincided with a selective sweep region. Moreover, soybean mosaic virus (SMV) can be a serious issue in Western and Northern Europe (Aper et al., 2016). Of the 18 QTLs conferring resistance to different strains of SMV reported in SoyBase, only a single QTL coincided with a selective sweep region in the EUCLEG collection. Lack of selection signals for a large number of previously reported QTLs linked to resistance to SCL and SMV suggests a high genetic diversity at the corresponding genomic loci in the EUCLEG collection. A more detailed analysis of the genetic patterns at these QTLs in the EUCLEG collection might be relevant to plan future breeding efforts to improve SCL and SMV resistance in European germplasm.
To the best of our knowledge this is one of the first studies that explores the genetic diversity of a large soybean collection relevant for breeding in Europe, in comparison to Chinese germplasm. Although we have found several selective sweeps that could be linked to useful traits in soybean through XP-CLR analysis, some methodological aspects require attention. First, genotyping using whole genome sequencing rather than microarray data could provide more variants and thus a more detailed description of the genomic regions that have experienced selective sweeps (Ronen et al., 2013). For example, this would enable a gene enrichment analysis which ultimately could provide information about candidate genes. Second, some of the selective sweeps determined by XP-CLR might be false positives caused by demographic processes such as bottlenecks or population expansions (Weigand and Leese, 2018). While the first aspect can be tackled if sufficient resources are available, d it is hard to entirely overcome the second limitation unless other methods are developed that allow to differentiate among the different scenarios that can lead to a positive signal of selection.

CONCLUSION
The present study focused on the exploration of a representative sample of soybean accessions relevant for breeding in Europe, the EUCLEG collection. This is one of the first studies in which the patterns of genetic diversity in a large soybean germplasm set relevant for breeding in Europe has been compared to the genetic diversity contained in Chinese cultivated and wild soybean germplasm. Our study has demonstrated a relatively lower genetic diversity in the EUCLEG collection compared to Chinese collections of cultivated and wild accessions, which indicates a narrow genetic base of the EUCLEG collection. However, a more detailed analysis of the patterns of genetic diversity in the EUCLEG collection has revealed substantial sub-structuration in five subgroups associated with geographical origins, and without a clear association with maturity classes. A selective sweep analysis has revealed the presence of multiple signatures of selection in the EUCLEG collection, compared to Chinese wild germplasm. In particular, genomic regions previously reported to influence grain protein, yield and disease resistance have been identified, whose exploration in future work might facilitate further selection efforts. No signals of selection have been detected for loci involved in stem determinacy, probably because no directed selection has been performed for this trait among the germplasm represented in the EUCLEG collection. On the other hand, clear signatures of selection were detected for at least two loci involved in photoperiod sensitivity and time to flowering, which are main traits considered by breeders in order to adapt soybean for cultivation in Europe (only maturity classes 000 to I/II). Taken together, our results have identified relevant genomic regions that can be further exploited to improve soybean for the European agricultural sector, either through further improvement of genetic resources that are available in Europe, or through incorporation of exotic soybean material in European breeding programs.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
AS, HM, JA, and IR-R conceived the study. AS and HM performed the research and analyzed the data. AS and IR-R drafted the manuscript. AS, HM, JA, TR, JW, DY, and IR-R interpreted the results and elaborated the manuscript. All authors read and approved the final version of the manuscript.

ACKNOWLEDGMENTS
We would like to thank Nancy Mergan for technical assistance in DNA extraction. Thanks also to Sabine Van Glabeke for assistance with SNP calling and to Dr. Rex Nelson (Computational Biologist at SoyBase) for sharing with us the description of the GWAS QTL database.