Genome-Wide Association Study of Seed Dormancy and the Genomic Consequences of Improvement Footprints in Rice (Oryza sativa L.)

Seed dormancy is an important agronomic trait affecting grain yield and quality because of pre-harvest germination and is influenced by both environmental and genetic factors. However, our knowledge of the factors controlling seed dormancy remains limited. To better reveal the molecular mechanism underlying this trait, a genome-wide association study was conducted in an indica-only population consisting of 453 accessions genotyped using 5,291 SNPs. Nine known and new significant SNPs were identified on eight chromosomes. These lead SNPs explained 34.9% of the phenotypic variation, and four of them were designed as dCAPS markers in the hope of accelerating molecular breeding. Moreover, a total of 212 candidate genes was predicted and eight candidate genes showed plant tissue-specific expression in expression profile data from different public bioinformatics databases. In particular, LOC_Os03g10110, which had a maize homolog involved in embryo development, was identified as a candidate regulator for further biological function investigations. Additionally, a polymorphism information content ratio method was used to screen improvement footprints and 27 selective sweeps were identified, most of which harbored domestication-related genes. Further studies suggested that three significant SNPs were adjacent to the candidate selection signals, supporting the accuracy of our genome-wide association study (GWAS) results. These findings show that genome-wide screening for selective sweeps can be used to identify new improvement-related DNA regions, although the phenotypes are unknown. This study enhances our knowledge of the genetic variation in seed dormancy, and the new dormancy-associated SNPs will provide real benefits in molecular breeding.

Seed dormancy is an important agronomic trait affecting grain yield and quality because of pre-harvest germination and is influenced by both environmental and genetic factors. However, our knowledge of the factors controlling seed dormancy remains limited. To better reveal the molecular mechanism underlying this trait, a genome-wide association study was conducted in an indica-only population consisting of 453 accessions genotyped using 5,291 SNPs. Nine known and new significant SNPs were identified on eight chromosomes. These lead SNPs explained 34.9% of the phenotypic variation, and four of them were designed as dCAPS markers in the hope of accelerating molecular breeding. Moreover, a total of 212 candidate genes was predicted and eight candidate genes showed plant tissue-specific expression in expression profile data from different public bioinformatics databases. In particular, LOC_Os03g10110, which had a maize homolog involved in embryo development, was identified as a candidate regulator for further biological function investigations. Additionally, a polymorphism information content ratio method was used to screen improvement footprints and 27 selective sweeps were identified, most of which harbored domestication-related genes. Further studies suggested that three significant SNPs were adjacent to the candidate selection signals, supporting the accuracy of our genome-wide association study (GWAS) results. These findings show that genome-wide screening for selective sweeps can be used to identify new improvement-related DNA regions, although the phenotypes are unknown. This study enhances our knowledge of the genetic variation in seed dormancy, and the new dormancy-associated SNPs will provide real benefits in molecular breeding.

INTRODUCTION
Plant seeds, especially cereal grains, are vitally important sources of human nutrition, providing half of the global per capita energy intake. Consequently, various seed traits such as seed dormancy, which is regarded as the failure of an intact viable seed to complete germination under favorable conditions (Bewley, 1997), have been under strong artificial and natural selection during crop domestication (Kovach et al., 2007;Izawa et al., 2009). Because seed dormancy is closely related to pre-harvest sprouting (PHS), it is one of the most important traits in rice breeding programs (Bewley and Black, 1982). Rice seed dormancy is like a doubleedged sword in terms of cultivation and utilization. Weak dormancy leads to a higher PHS rate in rainy weather and results in production losses and poor quality. In southern China, because of the long rainy season, it causes heavy PHS of 5-20% of hybrid rice grains (Hu et al., 2003). However, deep seed dormancy, especially for hybrid seeds, can cause non-uniform germination or even prevent germination in the process of sowing. Thus, balancing the advantages and disadvantages of seed dormancy is an important goal in rice breeding. Moreover, understanding the genetic variation of seed dormancy is of great interest to plant breeders.
Plant seed dormancy is a complex agronomic trait. Traditional bi-parental QTL mapping is limited by the recombination events occurring over a few generations during the development of a recombination inbreed line population. Therefore, most of the biological mechanisms of seed dormancy have not been clearly elucidated. In previous reports, only three genes, qSD12 (Gu et al., 2010), qSD7-1 (Gu et al., 2011), and qSD1-2 , were identified to regulate hormone accumulation in developing or mature seeds in rice. Thus, more novel QTLs for this trait need to be isolated by highly efficient and reliable QTL mapping methods in the future. Genome-wide association studies (GWASs) based on the historic recombination in a large natural population, a high-density SNP map and a comprehensive HapMap have become a powerful complementary approach for linkage mapping to identify complex trait variation at the genome-wide level . GWASs can overcome the limitations of traditional bi-parental populations and dissect complex traits with high mapping resolution. In the past few years, GWASs have been successfully applied in the dissection of complex traits in humans (Edwards et al., 2005), animals (Duijvesteijn et al., 2010), and plant species such as Arabidopsis thaliana (Atwell et al., 2010), soybean (Wen et al., 2014;, wheat (Sukumaran et al., 2015), barley (Gawenda et al., 2015), maize , sorghum (Zhang D. et al., 2015), tomato (Zhang et al., 2016a), rapeseed (Qu et al., 2015), sesame (Wei et al., 2015), and Aegilops tauschii . In rice, GWASs have also been applied widely to detect novel QTLs involved in different complex traits, for example yield (Begum et al., 2015;Huang X. et al., 2015), environmental stress resistance (Kumar et al., 2015;Lv et al., 2015), grain quality , blast resistance Wang et al., 2015), flowering time (Huang et al., 2011) and also seed dormancy (Magwa et al., 2016). Nowadays, together with linkage analysis, the GWAS strategy is playing an increasingly important role in dissecting novel QTLs and uncovering whole genome-wide variation.
Rice domestication has been a hot research topic for a long time. Many domestication related-genes have been mapped or cloned in previous studies, such as PROG1 (Jin et al., 2008;Tan et al., 2008), An1 (Luo et al., 2013), sh4 (Li et al., 2006), Bh4 (Zhu et al., 2011), andsd1 (Asano et al., 2011). Seed dormancy is a typical and direct domestication-related trait that has been strongly affected by nature and human selection over the long history of rice domestication. Two seed dormancy genes, Sdr4 (Sugimoto et al., 2010) and qSD7-1/Rc (Gu et al., 2011), have been proven to be directly involved in rice domestication. Long-term domestication has dramatically changed phenotypes, reduced allele frequencies, genetic diversity and polymorphism information content (PIC), and drastically increased linkage disequilibrium (LD) (Richards et al., 2004;Chen et al., 2010;Qanbari et al., 2014). Many methods based on the high density of single nucleotide polymorphism (SNP) markers have been used to detect genomic selection signals, such as nucleotide diversity (π) (Huang et al., 2012), PIC (Rostoks et al., 2006), population differentiation (F ST ) (Wilkinson et al., 2013), crosspopulation composite likelihood ratio (XP-CLR)  and extended haplotype homozygosity (EHH) (Sabeti et al., 2002;Olsen et al., 2006). In the case of rice, multiple selective sweeps have been detected using these methods (Olsen et al., 2006;Huang et al., 2012;Huang X. et al., 2015). Therefore, selective sweep analysis is another approach to identify QTLs for domestication traits.
In this study, on the basis of our previous reports , a high-density custom-designed array containing 5,291 SNPs was used to genotype 453 indica accessions. The aims of our research were (1) to identify a substantial number of significantly associated SNPs and some putative genes potentially regulating seed dormancy in the whole panel; (2) to design dCAPS markers for different alleles of the associated SNPs for breeding applications; (3) to analyze the genetic diversity population structure and LD of the landraces and improved lines; and (4) to detect selective sweep regions using PIC ratio statistics between the landraces and improved lines.

Plant Materials and SNP Genotypes
The association mapping population consisted of 453 indica accessions previously described in detail by Lu et al. (2015). According to the pedigree information, these accessions were classified into three types: landraces (266), improved lines (89) and foreign introduced lines (81) (Table S1). A rice landrace was defined as a "farm cultivar", "traditional variety, " or "local variety" that had developed over a long time and adapted to the local natural and cultural environment. Rice landraces have always maintained rich genetic diversity. An improved line was defined as a "breeding variety" that was artificially selected by a breeder to meet a specific economic need, such as high yield, fine quality or multi-resistance. The genetic diversity of the improved lines was lower than that of the landraces. Therefore, the PIC ratios of the landraces (266) and improved lines (89) were used to detect whole-genome selective sweep signals. All accessions were planted in a randomized complete block design with three field replications and each line was planted in six rows with six hills per row, spacing at ∼20 × 20 cm, in Lingshui (LS; N 18 • 32 ′ , E 110 • 01 ′ ) and Hangzhou (HZ; N 30 • 15 ′ , E 120 • 12 ′ ) in 2014.
Rice genomic DNA was extracted from young leaf tissue and all accessions were genotyped using an Illumina customdesigned array containing 5,291 SNP markers following the Infinium HD Assay Ultra Protocol [https://support.illumina. com/downloads/infinium_hd_ultra_assay_protocol_guide_ (11328087_b).html] (Illumina, Inc., San Diego, CA, USA) . Genotypes were called using the GenomeStudio software (Illumina, Inc.). The quality of each SNP was checked manually following a previous study . SNPs with tri-and tetra-allelic, low quality, high heterozygosis or a minor allele frequency (MAF) <5% were removed from the dataset. Finally, a total of 3,948 SNPs were chosen for further analysis (Table S2).

Seed Germination Evaluation and Data Analysis
The materials were cultivated under the same environmental conditions as much as possible. The heading date of each plant was recorded using the emergence of the first panicle from the flag leaf sheath as a standard. To reduce the marginal effect, four plants in the middle of each line were chosen as testers. The degree of seed dormancy was evaluated using the germination at seed maturity. To minimize the influence of conditions, ∼100 filled seeds per line were harvested on the 30-40th day (in the case of the accessions) after heading, and then ∼100 seeds were stored at 4 • C to maintain seed freshness for germination evaluation on the next day and another ∼100 seeds per line were air-dried in greenhouse (∼35 • C) for 7 days and then treated at 45 • C for 3 days to break dormancy. To test the germination, the seeds were wrapped in doubled sheets of 20 × 20 cm wetted absorbent filter paper, placed vertically, and germinated at 30 • C and 100% relative humidity in the dark for 10 days. Germinated seeds were defined as those in which the length of shoot exceeded half the length of the seed. Germination (%) = number of germinated seeds × 100/number of tested seeds. This experiment was repeated three times. The mean values of the three replications were used as the final phenotypic data. All seeds were intact grains.
Means, standard errors (SEs), broad-sense heritability (H 2 B ), the percentage of phenotypic variation explained by population structure (R 2 Q ) and interactions of genotype × environment (G × E) were calculated as in our previous report . The skewness, kurtosis and coefficient of variation (CV) were calculated using the functions Skew (), Kurt () and STDEV ()/AVERAGE () in Excel 2007, respectively.

Genome-Wide Association Analysis
To minimize the effects of environmental variation, best linear unbiased predictions (BLUPs) were performed using the R package lme4 (Bates et al., 2011) to estimate the phenotypic value for each line in the two environments (LS and HZ). The BLUP model can be described as where Y ijk is the observed phenotype for the kth line in the jth replicate of the ith environment; L k is the random effect of the kth line; E i is the random effect of the ith environment; R (E) ij is the random effect of the jth replicate in the ith environment; (L × E) ik is the random interaction effect of the ith environment and the kth line, and ε ijk is a random error following N (0, σ 2 e ). GWASs were performed in TASSEL version 4.0 (Bradbury et al., 2007). The EMMA (Kang et al., 2008) and P3D (Zhang et al., 2010) algorithms were used to reduce computing time. The compressed mixed linear model (cMLM) with the population structure matrix (Q) and the relative kinship matrix (K) as covariates was used to reduce false-positive associations (Yu et al., 2006;Zhang et al., 2010). The Q matrix was calculated in our previous study  and the K matrix was generated using TASSEL version 4.0 (Bradbury et al., 2007) based on the 3,948 SNP markers. Four models including GLM, Q, K, and Q + K were used to evaluate type I errors (GLM model: no correction of population structure or familial relatedness; Q model: the population structure factor (Q matrix) was used as a covariate; K model: the familial relatedness effect (K matrix) was used as a covariate; Q + K model: both Q and K matrixes were used as covariates). The GLM and Q models were implemented using a general linear model program. However, the K and Q + K models were implemented in a mixed linear model program . The significance threshold for trait-marker associations was determined by Bonferroni correction (α = 1; 1/3,948 = 2.5E-04) (Duggal et al., 2008;Yang et al., 2014). The Bonferroni-corrected threshold probability based on individual tests is calculated to correct for multiple comparisons, using 1/N (α = 1), where N is the number of individual trait-SNP combinations tested. Candidate genes were predicted within a 200-kb genomic region (±100 kb of each significant SNP) from the Rice Haplotype Map Project Database (http://202.127.18.221/ RiceHap2/) Lu et al., 2015).

Candidate Gene Expression Profiles
The expression profiles of all candidate genes were analyzed according to the results of Davidson et al. (2012). Previous studies showed that seed dormancy-related genes had higher expression levels in the embryo, endosperm or seed (Sugimoto et al., 2010;Gu et al., 2011;Ye et al., 2015); thus, the expression data of Davidson et al. (2012) for these three tissues were used to screen the candidate genes. The screened candidate genes were then validated using the expression data in the Bio-Analytic Resource Plant Biology (BAR) database (http://bar.utoronto.ca/) and the Rice Genome Annotation Project database (http://rice. plantbiology.msu.edu/). SNPs located in the candidate genes and haplotype data were obtained from the RiceVarMap database (http://ricevarmap.ncpgr.cn/). Homologous gene identification was performed in the Rice Genome Annotation Project database, and then protein sequences were aligned using BLASTP in NCBI (https://www.ncbi.nlm.nih.gov/).

Linkage Disequilibrium Decay, Population Genetics and Polymorphism Information Content Analyses
To detect the regions of improvement sweeps, the 266 landraces and 89 improved lines were used to analyze LD decay, population structure and PIC ratios. The LD decay of the two panels was investigated using TASSEL version 4.0 (Bradbury et al., 2007). The LD decay rate was measured as the chromosomal distance at which the average pairwise correlation coefficient (r 2 ) dropped to half of its maximum value Lu et al., 2015). A neighbor-joining (NJ) tree and principal component analysis (PCA) were used to infer the population structure. Based on Nei's genetic distance (Nei, 1972), the NJ tree was constructed using Powermarker 3.25 (Liu and Muse, 2005) and the PCA was performed using NTSYSpc version 2.1 (Rohlf, 2000). The F ST among different subgroups and the PIC of each SNP marker were also evaluated using Powermarker 3.25 (Liu and Muse, 2005). The PIC ratio was used to test the improvement sweep regions according to differing genetic diversity of the two populations in the selected regions of the genome. The equation for the ratio can be expressed as follows: PIC ratio = PIC (landrace) /PIC (improvedline) . SNPs with PIC ratios > 3 were empirically considered improvement footprints.

Diversity Panel and Phenotypic Variation
A rice diversity panel consisting of 453 indica accessions gathered from 20 rice-planting countries (Figure 1a) was used in our GWAS analyses. These accessions have rich seed dormancy variation, from deep to weak dormancy (Figures 1b,c). The germination of dormant seeds and dormancy-broken seeds were used to confirm the existence of dormancy. The germination of dormancy-broken seeds was nearly 100%, which was significantly higher than that of dormant seeds regardless of environment (LS or HZ), indicating that all of the accessions had true dormancy characteristics and were viable (Figure 1d).
Germination of threshed seeds was used to test the degree of seed dormancy in our study. The G × E analysis indicated that environmental effects should not be ignored (Table 1). Thus, to minimize the effect of environment variation, BLUPs of the genetic effect for each line were used for the overall association analysis in the panel. The phenotypic variations of germination of threshed seeds in the two environments and predictions using the BLUP method are shown in Table 1 and Table S3. The mean germination were 34.7, 36.8, and 34.8% in LS, HZ and the BLUPs, respectively. The coefficients of variation were 87.8, 84.3, and 67.3%, respectively. Moreover, the ranges of germination were 0.0-98.2%, 0.0-97.4%, and 5.8-86.0%, respectively. These results suggested that the accessions had abundant phenotypic variation and were suitable for GWAS analysis (Pearson and Manolio, 2008). The positive skewness value ranged from 0.4 to 0.5, suggesting that the germination had a certain skewed distribution (Table 1, Figure 1b). The phenotypic variation explained by the population structure (R 2 Q ) ranged from 45.7 to 61.2%; additionally, the broad-sense heritability (H 2 B ) was 85.2%. Rice has been found in archaeological sites dating to 8000 B.C. (Higham and Lu, 1998) and has been domesticated artificially and naturally from wild rice over a long period. Great changes have occurred in numerous traits, such as seed dormancy. Thus, one possible explanation for the skewed phenotype distribution was that the degree of seed dormancy in some cultivated rice accessions was significantly weakened to meet production needs during rice domestication . The high R 2 Q value suggested that the phenotypic variation was strongly affected by the population structure of the panel , and in further GWAS analyses, the population structure factor (Q matrix) should be taken into consideration to adjust the GWAS results. In addition, the relatively high H 2 B indicated that genetic improvement of the trait was effective and could play a significant role in the breeding process in the future.
In recent years, GWASs in plants have become a new QTL detection strategy to unlock the genetic secrets of heritable traits through high-throughput genotyping technologies based on a large natural germplasm collection containing rich genetic and phenotypic variation . Compared with linkage mapping, GWASs use a germplasm mapping population and LD to improve the genetic mapping efficiency and mapping precision (Remington et al., 2001;Salvi and Tuberosa, 2005). Here, a large indica rice diversity panel containing a wealth of genetic diversity in seed dormancy (Figures 1b,c, Table 1) was used to perform a GWAS. Previous studies have also demonstrated that this indica-only population is well suited for association mapping (Lu et al., , 2016.

Genome-Wide Analysis of Seed Dormancy
Our previous report showed that the indica-only panel could be classified into four populations (POPs) and a mixed subgroup (Mixed) (Figure S1), and relative kinship analysis  indicated that there was no or weak relatedness in the panel (Lu et al., , 2016. Each accession was assigned to a corresponding group using Q component ≥ 0.6 as a threshold. The average germination was significantly different among the five populations ( Figure S1). In particular, POP1 had the highest value (>60%) and POP4 had the lowest (∼15%) (Figure 2A). This result was also supported by the high R 2 Q ( Table 1). Taken together, these results highlighted the need to account for population structure and relative kinship when performing the subsequent GWAS analyses.
To evaluate the effects of the two elements and control false positive associations, four models (GLM, Q, K, and Q + K) were compared using a quantile-quantile (Q − Q) plot ( Figure 2B). Compared with the GLM and Q models, the K and Q + K models showed great control of type I errors. Moreover, the Q + K model included both population structure and relatedness. Therefore, all further analyses were performed using the Q + K model with the cMLM.
The GWAS was conducted using the BLUPs of individual germination over the two environments. Through the GWAS, a total of 12 SNPs significantly associated with the seed dormancy trait were detected across eight chromosomes ( Figure 2C). Four significant SNPs located close to each other were detected on the short arm of the chromosome 7 within a 260-kb region, among which the lead SNP (seq-rs3227, P = 1.19E-06) was used as the representative. As a result, nine peak SNPs were retained ( Table 2). The contribution of a single SNP to the phenotypic variation ranged from 3.5 to 5.8%, and together the SNPs explained 34.9% of the variation ( Table 2). In addition, all SNPs except seq-rs5598 on chromosome 12 were adjacent to or overlapped with previously reported QTLs ( Table 2). In particular, the strongest trait-associated SNP, seq-rs3227 (P = 1.19E-06), on chromosome 7 was located ∼164.2 kb downstream of the cloned pleiotropic gene qSD7-1 for seed dormancy, and another peak SNP, seq-rs3527 (P = 1.93E-04), resided ∼3.7 kb downstream of Sdr4, which is involved in seed dormancy ( Figure 2C; Figure S2) (Sugimoto et al., 2010;Gu et al., 2011).
In recent years, association studies have become a leading method to detect genes (or QTLs) underlying human diseases (Edwards et al., 2005) and agriculturally complex traits . However, the inflation of type I errors (false positive associations) caused by gene effects, allele frequencies, sample size and marker density is inevitable (Pe'er et al., 2006;Moonesinghe et al., 2007). Population structure and relative kinship are two common factors that determine the false positive rate (Zhang Z. et al., 2009). In our study, the high R 2 Q (Table 1), which was supported by the large germination variation among the five different populations (Figure 2A), indicated that population stratification could cause some false positive associations when performing GWAS analyses. Thus, the cMLM (Q + K) (Zhang et al., 2016b) method was used to effectively eliminates false positive results by combining the two covariates simultaneously ( Figure 2B).
In previous studies, many seed dormancy QTLs distributed on the 12 rice chromosomes have been identified using traditional linkage mapping (Gu et al., 2004(Gu et al., , 2011Guo et al., 2004;Wan et al., 2006;Sugimoto et al., 2010;Xie et al., 2011;Marzougui et al., 2012). In this study, the phenotypic variation explained by each individual significant SNP was <6%. This result demonstrated that rice seed dormancy is a typical quantitative trait with a minor genetic effect. Moreover, eight of the nine associated SNPs were detected previously in linkage analyses ( Table 2) and two cloned seed dormancy genes, qSD-7-1 and Sdr4 on chromosome 7, were adjacent to the peak trait-associated SNPs seq-rs3227 and seq-rs3527, respectively (Table 2). Thus, the GWAS lead SNPs were largely confirmed by previous reports, showing that our results were reliable and implying that genome-wide association mapping is an effective strategy to uncover novel QTLs for complex agronomic traits based on a high density genetic map, although only one SNP, seq-rs5598 (P = 1.07E-04), was not detected in previous reports. Seed dormancy is easily affected by multiple environmental factors, and seed maturing time is one of the most important influencing factors. However, it is difficult to investigate because there is no effective standard to define or evaluate when the seeds are mature. Generally speaking, seed that turns yellow and hardened means maturity. But this is a continuous process of changes and it is hard to record a specific time. In this study, we harvested filled seed on the 30-40th day after heading in the case of each accession by empiricism to reduce the effects of seed maturing time. In conclusion, the relationship between seed dormancy and seed maturing time should be explored in depth. In addition, how to define and determine seed maturing time that may be related to grain-filling rate will also be a valuable topic to deep study in future.

Trait-Associated SNPs Effects and Molecular Breeding Application
The phenotypic differences between the two alleles of each of the strongest trait-associated SNPs are summarized in Table S4. Using the absolute value as the standard, the deviation values ranged from ∼0.67 to ∼25.4% for each SNP. Moreover, the deviation values of five SNPs reached significant (P ≤ 0.05) or highly significant (P ≤ 0.01 or 0.001) levels between the two alleles of each SNP marker (Figure 3). Among them, the absolute values of four SNPs were close to or more than 10%, especially for the seq-rs3527 (P = 1.93E-04), which was ∼25.4% ( Figure 3B; Table S4). The results suggested that all these trait-associated SNPs could be used efficiently for molecular breeding in the future.
To confirm this, the elite alleles with positive effects were used to test the effectiveness of pyramid breeding. Without considering the effects of interactions among these lead SNPs and environmental influences, the more elite alleles that were pyramided in a variety, the higher the germination increase ( Figure 3E). This result indicated that pyramiding of favorable alleles could attenuate seed dormancy. In practice, according to their breeding goals, breeders could introduce different numbers of favorable alleles into different varieties to modify the seed dormancy trait. In the GWAS panel, most of the accessions carried four to six favorable alleles and had moderate germination ranging from 30.8 to 42.4%. Only a few of accessions had an extreme germination; <18.1% or larger than 48.3% ( Figure 3E). This result suggested that a suitable number of favorable alleles have been maintained in these accessions through artificial phenotypic selection in the process of rice breeding to meet cultivation needs. Accessions with extreme phenotypes (very high or low PHS) do not suit human needs and are gradually eliminated during the breeding process, but these accessions may be good materials for genetic research. Since LS and HZ are of different latitudes, the day-length and temperature are quite different. Which environment would be more suitable for the favorable allele mining for the indica breeding practice? Alleles favorable under both locations were stably expressed would be widely adopted by more breeders throughout the country. However, those mined specifically at one location would also offer useful information for the breeding work under similar conditions. For example, the changes of day-length and temperature are more obvious in HZ other than that in LS throughout the year. Consequently, the phenotypic effects of some favorable alleles would be more apparent and easier to observe, especially for those that sensitive to the day-length and temperature.
How can these useful lead SNPs be applied to molecular breeding? Although the flanking sequence of each SNP (Table S2) can be used to detect the SNPs for breeding with nextgeneration sequencing technology, to some extent, this method is inconvenient. Thus, a dCAPS marker for each SNP was designed using dCAPS Finder 2.0 (Neff et al., 2002) (Table S5). These dCAPS markers can clearly distinguish the genotypes of the corresponding SNPs in polypropylene gel electrophoresis ( Figure  S3). Generally speaking, these markers will be beneficial for molecular marker-assisted selection breeding in the future.

Candidate Gene Prediction and Expression Profiling
The flanking regions within a 200-kb window (±100 kb) of the lead SNPs were searched to identify candidate genes in the Rice Haplotype Map Project Database Lu et al., 2016). For the two SNPs on chromosome 7, seq-rs3227 (P = 1.19E-06) and seq-rs3527 (P = 1.93E-04), two  known seed dormancy-related genes, qSD7-1 (Gu et al., 2011) and Sdr4 (Sugimoto et al., 2010), respectively, were located in the corresponding region (Table 2, Figure S2). For the other seven SNPs, a total of 212 candidate genes were identified, which are summarized in Table S6. To identify the most promising candidate genes, the expression levels in embryo, endosperm and seed tissues, which were downloaded from the results of Davidson et al. (2012), were used as a screening reference. Interestingly, qSD7-1 had a high expression level in the seed at 5 days after pollination (DAP). By comparison, the Sdr4 had high expression levels in the embryo at 25 DAP, endosperm at 25 DAP, and seeds at 5 and 10 DAP (Figures 4A,B, Table S6). These results were consistent with previous reports (Sugimoto et al., 2010;Gu et al., 2011;Ye et al., 2015). After searching the expression data for the 212 candidate genes, the expression patterns of eight genes were found to be similar to those of qSD7-1 or Sdr4 (Figures 4A,B, Table S6). Then, expression data from the BAR database and the rice genome annotation project database were used to verify the eight genes ( Figure 4C, Table S6). The candidate gene LOC_Os03g10110, which encoded a cupin domain-containing protein, had high expression levels in seed tissues at different stages (Figures 4B,C , Table S6). Moreover, the expression pattern of this gene was highly consistent with that of Sdr4, which has very high expression levels in seeds at 5 and 10 DAP, embryos at 25 DAP, and endosperm at 25 DAP ( Figure 4D, Table S6). Homology analysis indicated that the gene was homolog to grmzm2g078441, a gene in the cupin family of unknown function that is highly expressed in embryo tissues in maize (Teoh et al., 2013). A previous report suggested that cupins comprise a superfamily of functionally diverse proteins that include germins and plant storage proteins (Dunwell, 1998). Storage proteins may be important in plants during seed germination and seedling growth (Shewry et al., 1995). The RiceVarMap database (http:// ricevarmap.ncpgr.cn/) showed that a total of 34 SNPs were located within this gene (Table S7) and that 15 SNPs were nonsynonymous mutations (Figure 4E). Haplotype network analysis indicated that most indica accessions could be classified into haplotype groups I and III, while the majority of japonica accessions were assigned to haplotype group II ( Figure 4F, Table  S8). This result suggested that the haplotype of the candidate gene displayed some indica-japonica specificity.

Population Division, Nucleotide Diversity and Linkage Disequilibrium between Improved Lines and Landraces
Rice improvement is the outcome of continuous artificial selection to enhance the adaptation of the plant to fit human needs. Consequently, various rice traits, such as plant type and seed shape, have been changed dramatically (Figures S4A,B). The successive selection effect also alters the genetic diversity at the genomic level, resulting in a distorted pattern of genetic variation and LD decay.
To identify selective sweeps across the whole rice genome, the 89 improved lines and 266 landraces, most of which were widely distributed in southern China, were used to detect selective sweep signals ( Figures S4C,D). To better understand the population stratification and geographic structure diversity, a NJ tree was constructed and PCA was performed to illustrate the relatedness among the accessions. The results indicated that the improved lines comprised four subgroups, among which PC1 and PC2 accounted for 22.0 and 8.4% of the genetic variation, respectively (Figures 5A,B). However, the landraces were classified into five subgroups, among which PC1 and PC2 explained 18.2 and 10.0% of the genetic variation, respectively (Figures 5C,D). The F ST ranged from 0.07 to 0.47 among all nine subgroups, indicating strong population differentiation among some of the subgroups ( Figure S5). The F ST averaged 0.17 among the subgroups of the improved lines, suggesting a moderate level of differentiation, and was estimated at 0.26 on average among the landrace subgroups, implying greater population differentiation than in the improved lines ( Figure S5). These averages were close to previously published values in rice Lu et al., 2015), but much less than that between indica and japonica (F ST = 0.55) , and a little bit larger than in soybean (Wen et al., 2014), maize , and sesame (Wei et al., 2015).
To further evaluate the degree of human selection, the gene diversity, PIC and LD decay were quantified for the two populations. Gene diversity and PIC values for the landraces were both significantly higher than in the improved lines (Figures 6A,B). These estimations were close to previous estimates from 926 SNPs in maize , but much smaller than those calculated from simple sequence repeat (SSR) markers in rice  and spring barley landraces (Pasam et al., 2014). This result demonstrated that the landraces had retained more genetic diversity than the improved lines. Because increased LD is another hallmark of artificial selection in rice, the LD decay rates between the two populations were compared. The extent of LD decay increased from 163.3 kb for the landraces to 352.4 kb for the improved lines ( Figure 6C). The larger LD decay distance in the improved lines may have been caused by a loss of genetic diversity and a low frequency of genetic recombination because of human selection forces during improvement (Lam et al., 2010). The extent of LD for the landraces was similar to that in a previous evaluation in rice Lu et al., 2015), but much greater than that in maize (Remington et al., 2001).   Polymorphism information content in the improved lines and landraces. (C) Estimation of genome-wide average LD decay distances from the improved line (red) and landrace panels (blue). The dashed line represents the LD decay rate, which was measured as the chromosomal distance at which the average pairwise correlation coefficient (r 2 ) dropped to half of its maximum value. ***P = 0.001.  (Table S9) are marked with red, yellow and dark blue dots, respectively. (B-F) Five GWAS results, including the two seed dormancy genes qSD7-1 and Sdr4, that are very close to strong selection signals. The peaks of blue lines and red dots represent selective sweep regions and GWAS signals, respectively. The orientations of the two known genes and selective sweep regions are indicated by arrows. (G) The extent of LD around the SNPs in multiple resistance genes on chromosome 6. The r 2 values are indicated by the color key.

Genome Wide Selective Sweep Signal Scan
Artificial selection has probably changed the nucleotide diversity within the genomes of cultivars. To detect the genomic regions most affected by artificial selection during rice improvement, PIC ratios between the landraces and improved lines were screened throughout the whole genome ( Figure 7A). PIC ratios >3.0 (the top ∼1% of all values) were retained as an empirical threshold. After screening, the PIC ratios of 57 SNPs exceeded the cutoff. Peak signals within a ∼±1.5-Mb window were grouped into a single DNA region because the selection effect leads to greater LD decay and extended haplotype structure (Qanbari et al., 2014). Finally, 27 selection DNA regions were identified (Table S9). Among these regions, 18 known domestication-related genes resided in 15 corresponding selection regions, such as the "green revolution" gene sd1 (Sasaki et al., 2002), the seed shatteringrelated gene sh4 (Li et al., 2006), qSD7-1 and Sdr4 for seed dormancy (Sugimoto et al., 2010;Gu et al., 2011) and IPA1 for ideal plant architecture (Jiao et al., 2010) (Figure 7A, Table S9).
Interestingly, in our GWAS, the qSD7-1 gene for seed dormancy was located near the most significant SNP seq-rs3227 (P = 1.19E-06) ( Table 2). This SNP also showed strong selection signals with a highest PIC ratio value of ∼4.0 within a 6-Mb genome region ( Figure 7B). Notably, another selection signal, seq-rs3480 (PIC ratio = 8.8), appeared near the seed dormancy-related gene Sdr4, which resided upstream of the GWAS signal of seq-rs3527 (P = 1.93E-04) ( Table 2) within a ∼2.8 Mb physical distance ( Figure 7C). More importantly, there were three strong selection signals in the vicinity of the corresponding three peak SNP previously identified by the GWAS method (Table 2, Figures 7D-F). In particular, one of the GWAS peak SNPs (seq-rs2896, P = 7.96E-05) was located just 1.1 kb downstream of the strongest selection SNPs (seq-rs2895, PIC ratio = 3.2) ( Figure 7E). In addition, on chromosome 6, there were eight selection signals adjacent to rice blast resistance genes ( Figure 7A, Table S9). The pairwise LD values showed that most of these were not located in one haplotype block (r 2 < 0.8) (Figure 7G), implying that rice resistance-related traits may also have experienced strong artificial selection during rice improvement ( Figure 7E).
Taken together, our results suggested most of the selective sweep regions (∼66.7%) were adjacent to both known domestication-related genes and SNPs we identified previously by GWAS (Table 2, Figure 7, Table S9). These results indicated that genome-wide screening to detect selective sweeps using PIC ratios between two populations with different degrees of human selection can be used to identify DNA regions potentially related to artificial selection in rice. Additionally, the accuracy of our previous GWAS results ( Table 2) was further validated by these selective sweep regions. However, no selection signals were detected on chromosomes 5 and 12. There are two possible reasons for this. On the one hand, a lack of sufficient polymorphic SNP markers in these specific genome regions among the two panels may have led to poor detection capability. On the other hand, weak selective pressure on these two chromosomes may have limited our power to detect selection signals.

CONCLUSIONS
In this study, a total of nine known and new SNPs associated with rice seed dormancy were identified via GWAS. dCAPS markers were designed to accelerate the molecular breeding of rice dormancy. Moreover, 212 candidate genes were identified. The expression profiles and haplotype network data from public databases revealed eight genes, especially LOC_Os03g10110, which has a maize homolog involved in embryo development, as candidate regulators for further investigations to verify their biological functions. A genome-wide screen to detect artificial selection signals identified 27 selection DNA regions. Among them, 15 were adjacent to known domestication-related genes and three strong selection signals were located near GWAS lead SNPs. These results not only further verify the accuracy of our GWAS findings but also suggest that genome-wide screening for selective sweeps can be used to identify new improvement-related DNA regions, although the phenotypes are unknown. This study enhances our knowledge of the genetic variation in rice seed dormancy, and the new GWAS SNPs will provide real benefits for genomic selection in breeding programs. More importantly, the genomic consequences of improvement footprints will enable the detection of domestication-related traits.

AVAILABILITY OF DATA AND MATERIAL
The SNP dataset used during this study is available in the in the Dryad digital repository (http://dx.doi.org/10.5061/dryad. cp25h). Any other datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.