Genetic Diversity and Elite Allele Mining for Grain Traits in Rice (Oryza sativa L.) by Association Mapping

Mining elite alleles for grain size and weight is of importance for the improvement of cultivated rice and selection for market demand. In this study, association mapping for grain traits was performed on a selected sample of 628 rice cultivars using 262 SSRs. Grain traits were evaluated by grain length (GL), grain width (GW), grain thickness (GT), grain length to width ratio (GL/GW), and 1000-grain weight (TGW) in 2013 and 2014. Our result showed abundant phenotypic and genetic diversities found in the studied population. In total, 2953 alleles were detected with an average of 11.3 alleles per locus. The population was divided into seven subpopulations and the levels of linkage disequilibrium (LD) ranged from 34 to 84 cM. Genome-wide association mapping detected 10 marker trait association (MTAs) loci for GL, 1MTAs locus for GW, 7 MTAs loci for GT, 3 MTAs loci for GL/GW, and 1 MTAs locus for TGW. Twenty-nine, 2, 10, 5, and 3 elite alleles were found for the GL, GW, GT, GL/GW, and TGW, respectively. Optimal cross designs were predicted for improving the target traits. The accessions containing elite alleles for grain traits mined in this study could be used for breeding rice cultivars and cloning the candidate genes.


INTRODUCTION
Today, rice (Oryza sativa L.) feeds more than half the world's population, and accounts for 20% of the world's total calorie intake. Although a staple in diets worldwide, rice is central to the economy and landscape of wider East Asian, Southeast Asian, and South Asian ancient, and modern civilizations. The rapid human population growth in the world is boosting demand for a corresponding increase in crop grain yield. Developing countries account for 95% of global rice production, with China and India responsible for nearly half the world output. Nine out of the top 10 rice producing countries are in Asia and interestingly, major rice producers are also typically the major rice consumers.
Rice grain shape related traits (grain length, width, thickness, and length to-width ratio) have a direct bearing on grain weight and quality, and hence commercial success, of modern rice (O. sativa L.) cultivars (Redoña and MacKill, 1998). Grain yield in rice is determined by three major components: number of panicles per plant, number of grains per panicle, and grain weight. Among these, the most reliable trait is grain weight, which is measured as the 1000-grain weight (TGW). Grain size, as specified by grain length (GL), grain width (GW), grain thickness (GT), and grain length to-width ratio (GL/GW), is a major determinant of grain appearance quality and grain weight in rice.
Grain size is an important agronomic trait for artificial selection in rice breeding. Breeders tend to select plants with large seed size for high yield and appropriate grain size for milling yield and market preferences. However, it is difficult for breeders to improve grain size efficiently by phenotypes, since the traits are quantitatively inherited (McKenzie and Rutger, 1983). Rice yield is the most noticeable characteristic to farmers while the crop is in the ground, but when the product of the crop, the milled rice, reaches the market, quality becomes the key determinant of its sale-ability. Buyers, millers, and consumers judge the quality of rice on the uniformity of its size and shape as well as the pleasing appearance of its overall size-shape relationship.
Many QTLs for grain size and weight have been identified and reported by various researchers (Lin et al., 1996;Hua et al., 2002;Xing et al., 2002;Aluko et al., 2004;Li et al., 2004;Agrama et al., 2007;Song et al., 2007;Bai et al., 2010;Huang et al., 2011;Wang et al., 2011;Tran Thi et al., 2014;Dang et al., 2015). Genes that affect the grain size have been identified in inter-specific crosses (Xiao et al., 1998;Brondani et al., 2002;Ordonez et al., 2010). In most cases, wild-type alleles were associated with small grain, whereas cultivar alleles were associated with large grains. Usually, grain size is determined by GL, GW, and GT. The three traits are quantitatively inherited under the control of several or many genes. To date, 5 key genes controlling grain size have been isolated in rice: GS3, GW2, qSW5 or GW5, GIF1, and GS5 (Fan et al., 2006;Shomura et al., 2008;Li et al., 2011). GS3 has a major effect on GL, whereas qSW5/GW5 and GW2 confer both the GW and weight in rice. GIF1 encodes a cell-wall invertase that is required for carbon partitioning during early grain filling, and the over-expression of GIF1 by using its native promoter leads to large grains (Wang et al., 2008). Shomura et al. (2008) found that a deletion in qSW5 was associated with grain size owing to an increase in the cell number in the outer glume of the rice spikelet.
A number of QTL studies showed that one genomic region was associated with several traits, especially yield component traits, indicating linkage and/or pleiotropic effects (Xiao et al., 1996;Tan et al., 2000;Yu et al., 2005;Tian et al., 2006;Liu et al., 2015). Recent advent of molecular and computational tools now enables the estimation of genetic diversity and population structure of rice germplasm rather easily. With the growing availability of genome sequence data and advances in technology for rapid identification and scoring of genetic markers, linkage disequilibrium (LD) based genome-wide association study (GWAS) has gained favor in higher plants, especially crops, for the mapping of genetic factors responsible for trait variation (Remington et al., 2001;Gupta et al., 2005;Mackay and Powell, 2007;Cockram et al., 2008;Sneller et al., 2009;Atwells et al., 2010;Zhou and Stephens, 2012). Providing the intrinsic nature of exploiting historical recombination events, association mapping offers increased mapping resolution to polymorphisms at sequence level and should therefore enhance the efficiency of gene discovery and facilitate marker assisted selection (MAS) in plant breeding (Moose and Mumm, 2008;Zhang et al., 2010). Once the plant cultivars are genotyped with high-density markers, association mapping is promising in resolving the genetic basis of traits of both economic and ecological importance.
The objectives of this study were (1) to evaluate the population structure and genetic diversity of a set of rice materials; (2) to detect the extent of LD between pairs of SSR markers on genome-wide scale in the population used; and (3) to detect QTLs controlling seed grain components traits and mine elite alleles; (4) to explore design of parental combinations for cultivar improvement.

Sample Collection
A total of 628 rice accessions were used from China (507) and Vietnam (121) in this study (Supplementary Table S1). Five hundred and seven varieties originated from different regions of China and have been widely used as parents in plant breeding during the past decades. The seeds of all accessions were collected, stored, and supplied by State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China.

Phenotypic Data Collection
For field studies, 628 rice accessions were planted out in a field at the Jiangpu Experimental Station, Nanjing Agricultural University, China (118.62 • E, 32.07 • N) from early May to November in 2013 and 2014 using a randomized complete block design with two replications. For all varieties, seedlings aged about 30 days were transplanted onto the field at a spacing of 20 cm between rows and 17 cm between each individual with standard agronomic management. The middle five plants in the central row of each plot were sampled to examine grain size traits. Grain size traits contained grain length (GL), grain width (GW), grain thickness (GT), ratio of grain length to grain width (GL/GW), and 1000-grain weight (TGW), were evaluated. For each trait, means of the replicates were used in the data analyses.

SSR Marker Genotyping
Genomic DNA was extracted from leaf tissue of each selected plant according to the methods described by Murray and Thompson (1980). According to the published rice molecular map and microsatellite database of Temnykh et al. (2000) and McCouch et al. (2002), 262 polymorphic microsatellite markers, approximately evenly distributed scattered on 12 chromosomes were used for genotyping (Figure 1). Primers were synthesized by Shanghai Generay Biotech Co. Ltd., Shanghai, China. Each 10 µl PCR reaction consisted of 10 mM tris HCl (PH 9.0), 50 mM KCl, 0.1% triton X-100, 1.5 mM MgCl 2 , 0.5 nM dNTPs, 0.14 pM forward primers, 0.14 pM reverse primers, 0.5 U of taq polymerase, and 20 ng of genomic DNA. DNA amplification was performed using a PTC-100 TM Peltier Thermal Cycler (MJ Research TM Incorporated, USA) under the following conditions: (1) denaturation at 94 • C for 5 min; (2) 34 cycles of denaturation at 94 • C for 0.5 min, annealing at 55-61 • C for 1 min, and extension at 72 • C for 1 min; (3) final extension at 72 • C for 10 min. The PCR products were run on 8% polyacrylamide gel at 150 V for 1 h, and visualized using silver staining.

Heritability
Analysis of variance (ANOVA) was carried out to determine genotypic and environmental variances among the traits measured using the SAS package (SAS Institute Inc., CARY, NC, USA). Heritability in the broad sense (H 2 B ) was computed on the basis of the natural population through analysis of variance using the formula H 2 B = σ 2 g /(σ 2 g + σ 2 e /n), where σ 2 g is genetic variance, σ 2 e is error variance, and n is number of replicates.

Genetic Diversity, Phylogenetic Analysis, and Population Structure
The summary statistics including the number of alleles per locus, major allele frequency, gene diversity, and polymorphism information content (PIC) values were determined using PowerMarker version 3.25 (Liu and Muse, 2005). Nei's distance (Nei et al., 1983) was calculated and used for the unrooted phylogeny reconstruction using neighbor joining method as implemented in PowerMarker with the tree viewed using MEGA 4.0 (Tamura et al., 2007). Analysis of population structure among rice accessions was performed using the software package STRUCTURE in its   revised version 2.2 (Falush et al., 2007). The optimum number of populations (K) was selected after five independent runs of a burn-in of 50,000 iterations followed by 100,000 iterations for each value of K (testing from K = 2 to K = 10). A model based clustering algorithm was applied that identified subgroups with distinctive allele frequencies and placed individuals into K clusters, where K is chosen in advance but can be varied for independent runs of the algorithm. The most likely number of clusters (K) was selected by comparing the logarithmized probabilities of data [Pr (X|K)] and a value for each value of K according to Pritchard et al. (2000). The software SPAGeDi55 (Spatial Pattern Analysis of Genetic Diversity) was used to calculate the pairwise relatedness coefficients (K, kinship matrix) to estimate the genetic relatedness among individuals.

Linkage Disequilibrium
Linkage disequilibrium (LD) was evaluated for each pair of SSR loci using TASSEL 2.1, both on all accessions and on the clusters as inferred by STRUCTURE. D ′ measures modified for multiple loci were used. Significance (P-values) of D ′ for each SSR pair was determined by 100,000 permutations. To compare phenotypes of the seven groups identified by STRUCTURE, ANOVA was employed with the SAS program version 8 (SAS Institute Inc., Cary, NC), followed by multiple means comparisons among these groups.

Association Mapping
To avoid possible spurious associations, the mixed linear model (Q+K) built in TASSEL V2.1 (Excoffier et al., 2005;Yu and Buckler, 2006;Bradbury et al., 2007) was used to account for population structure and relatedness of individuals among 628 rice accessions. Association between marker alleles and grain size component traits and weight data were performed (trait analysis by association, and linkage) taking into account gross level population structure (Q) (Bradbury et al., 2007). The P-value determined whether a marker is associated with the trait and the R 2 indicated the fraction of the total variation explained by the marker identified. Using the association locus identified, the "null allele" (non-amplified allele) was used to determine the phenotypic effects of other alleles (Breseghello and Sorrells, 2006). The formula used for calculating phenotypic effect of a single allele was a i = x ij /n i − N k /n k , where a i was the phenotypic effect of the allele of i; x ij denoted the phenotypic measurement values of j variety carrying the allele of i; n i represented the number of materials carrying the allele of i; N k meant the phenotypic value of the variety of k carrying the null allele; and n k represented the number of materials for the null allele. If we want to increase the trait value of interest, we have to use alleles with positive effect as the elite allele likewise, if we want to reduce the trait, we use allele of negative effect as the elite allele. In our study, marker loci with PVE >5% were considered for further analysis. Top 30 varieties with higher phenotypic values were analyzed together with the selected marker loci to determine elite alleles and their carrier varieties.

Phenotypic Evaluations
Mean value, range, coefficient of variation, kurtosis, and skewness for each trait measured in 628 accessions were shown in Table 1. The phenotypic data of the GW, GT, GL/GW, and TGW followed a normal distribution but GL followed a skewed distribution based on the skewness values and kurtosis. A two-way analysis of variance (ANOVA) showed that differences among cultivars for each trait were highly significant (P < 0.01), indicating a large amount of genetic variation existed in the population.
There existed variances between 2013 and 2014 for the five grain components traits studied (Supplementary Tables S2-S6 Table 2 showed that GL was negatively correlated with GW, GT, and positively correlated with GL/GW but shows no correlation with TGW when α = 0.01. GW was positively correlated with GT and TGW but negatively correlated with GL/GW. Again, GT also shows positive correlation with TGW but negatively correlated to GL/GW. TGW was negatively correlated to GL/GW.

Genetic Diversity
A total of 262 SSR markers, randomly distributed across the genome, were used to evaluate the genetic diversity of the population. All of the 262 SSR markers were polymorphic across the 628 rice accessions and a total of 2953 alleles were detected (Supplementary Tables S7, S8). Numbers of alleles ranged from 3 (at locus RM7163_chr11) to 25 (RM7545_chr10) with an average of 11.3 alleles per locus (Figure 2 Table  S8). Two hundred and twenty-six markers (89.6%) were highly informative (PIC > 0.5), 22 (8%) moderately informative (0.5 > PIC > 0.25), and 2 (0.75%) slightly informative (PIC < 0.25).

Population Structure and Genetic Relatedness
The model-based simulation of population structure using SSRs showed that the log-likelihood increased with the elevation of model parameter K, so the statistic K was used to determine a suitable value for K. Here, the K-value was much higher for the model parameter K = 7 than for other values of K. Population structure data based on the Q matrix for each accession are summarized in Supplementary Table S1, and the 628 accessions could be divided into seven subpopulations, viz. from POP1 to POP7 (Figure 3). A neighbor-joining tree of 628 accessions was constructed based on Nei's genetic distance and the information revealed was consistent with the result from STRUCTURE analysis (Figure 4). For instance, the accessions from Vietnam were mainly clustered into POP2 while the accessions which were landraces coming from Taihu lake valley were in POP3.
Genetic relatedness analysis in this study based on 262 SSR markers showed that more than 50% of the kinship coefficient values were <0.05 (Figure 5), 35.6% were in a range of 0.05-0.10, and the remaining 9.5% showed various degrees of genetic relatedness, indicating that there was no or weak relatedness between pair-wise accessions used. Based on the results of the relatedness analysis, a K matrix was constructed for the association analysis.

Linkage Disequilibrium
Among the seven subpopulations, the lowest percentage of significant pair-wise loci in LD was found in POP3 (4.4%), and the highest was found in POP4 (29.3%; Table 3). POP1 had the highest average of D ′ -value among the subpopulation (0.64) and POP7 had the lowest average of D ′ -value among the seven subpopulations (0.52), suggesting that accessions in this subpopulation might have been subjected to extreme artificial selection. Regression analysis between the D ′ -value and genetic distance of syntenic (intra-chromosome) marker pairs shown that the seven subpopulation genomes fitted in the equation y = blnx + c (Figure 6). The minimum distance of LD decay for POP1 to POP7 was 79. 39, 42.82, 83.67, 78.26, 73.70, 67.48, and 34.35 cM, respectively. It was realized that POP7 had the lowest decay velocity, while POP3 demonstrated the fastest decay velocity among the seven subpopulations.

Association Analysis
Marker-trait associations were detected using MLM for five grain traits across 2 years. Association analysis identified markertrait associations (P < 0.05) for all the traits evaluated (Supplementary Table S9). Table 4 showed the significant association loci with the PVE more than 5% for five grain traits across 2 years. Ten markers located on all the 12 chromosomes except chromosome 3, 10, and 11 were associated with GL. The range of phenotypic variation explained (PVE) was from 5.03 to 21.97%. RM297_Chr 1, which resides on 161.3 cM, had the maximum PVE for GL, which was 21.97% in 2013 and 21.89% in 2014 (Table 4). One marker was found distributed on chromosome 1 associated with GW, of which RM1_Chr 1 had PVE, from 6.1 to 8.3% (Table 4). Seven markers distributed on chromosome 1 were found associated with GT, of which RM84 _chr1 explained the maximum phenotypic variation, viz 19.16% in 2013 and 23.46% in 2014, respectively ( Table 4). Three markers associated with GL/GW distributed on chromosomes 1 and 2 with the corresponding PVE range from 5.21 to 19.37% of which RM297_chr1 had the highest PVE of 19.37% in 2013 and 18.47% in 2014 explaining the maximum phenotypic variation ( Table 4). One marker RM259 associated with TGW distributed on chromosome 1 and the PVE was from 5.40% in 2013 and 5.88% in 2014. RM259_chr1, reside on 66.3 cM on the short arm of chromosome 1 ( Table 4).

Mining Elite Alleles
In this study, the alleles with positive effects are considered elite alleles for all five grain traits measured. Supplementary  Table S10 shows a summary of elite alleles of the significant association loci with PVE more than 5% and their typical carrier materials for the given traits. The total numbers of elite alleles for GL, GW, GT, GL/GW, and TGW detected across the entire population were 29, 2, 10, 5, and 3, respectively. The allele RM3600-120 bp showed the largest phenotypic effect (1.81 mm) for GL, and the typical carrier accession was Yuedao 62 (Supplementary Table S10). The allele RM1-90 bp showed the largest phenotypic effect (0.20 mm) for GW, and the typical carrier accession was Wumangzaodao (Supplementary Table  S10). The allele RM3453-135 bp showed the largest phenotypic effect (0.31 mm) for GT, and the typical carrier accession was Zhen9424 (Supplementary Table S10). The allele RM1-170 bp showed the largest phenotypic effect (0.92) for GL/GW, and the typical carrier accession was Yuedao100 (Supplementary Table  S10). The allele RM259-185 bp showed the largest phenotypic effect (0.74 g) for TGW, and the typical carrier accession was Yuedao86 (Supplementary Table S10).

Excellent Cross Design for Novel Parental Combination
Based on the number of positive alleles that could be pyramided into an individual plant and the expected phenotypic effects, the best five cross combinations for improving GL, GT, and GL/GW were proposed ( Table 5). The elite alleles carried by the parents in excellent crosses were listed in Supplementary Table   S11. These are super parents or varieties carrying elite alleles which are potential for cross breeding purpose. Alleles identified in these materials will help in the cross breeding programme for grain size improvement. Figure 7 shows parents in their superior crosses. Yuedao 62 and Ningjing1R-37 were found repeatedly in the supposed parental combinations which demonstrate their excellent possession of elite alleles.

DISCUSSION
Population structure is a strong confounding factor in association studies, especially with respect to traits that are important in local adaptation or diversifying selection and familial relatedness associated with recent co-ancestry (Nordborg and Weigel, 2008). Here, we made use of the MLM approach, which took into account population structure and familial relatedness, in order to reveal the association between SSR makers and five seed grain traits in rice. Such genome-wide association mapping should, therefore, be valid even in a selfpollinating species such as rice with high levels of population structure and much familial relatedness. By use of Structure software, population consisted of 628 accessions was clustered into seven subpopulations, i.e., POP1, POP2, POP3, POP4, POP5, POP6, and POP7. One common discovery about each subpopulation is that the accessions with the same geographic origin were classed into the same group. It was revealed that POP6 had accessions coming from Vietnam likewise POP1 also revealed accession coming from the Northeastern part of China. This helps us to conclude that the result of the groupings by structure analysis was essentially consistent with the geographical region. General agreement between the genetic and predefined clusters suggests that knowledge of the ancestral background can facilitate choices of parental lines in ricebreeding programs. Also, the sample size used in this study is larger than used in the previous studies in rice (Garris et al., 2005;Agrama et al., 2007;Mather et al., 2007;Rakshit et al., 2007;Thomson et al., 2007). A larger sample size increases detection power and allows the quantification of the effects of more alleles.
Linkage disequilibrium is the basis of association analysis (Flint-Garcia et al., 2003). In comparison to other populations, the levels of LD for POP2 and POP5 in this study (42.82 and 34.35 cM, respectively), we observed a substantial drop in LD values between 20 and 40 cM, suggesting it should, nevertheless, be possible to achieve resolution down to the 25 cM level. The same observation on LD at larger distances was found in Arabidopsis (Nordborg et al., 2002) and (Kraakman et al., 2004). Many factors affect LD (Ardlie et al., 2002), but the most probable cause of the high level of LD in rice is that it is a self-pollinated species. Selection can also increase LD, for instance, by a hitchhiking effect in which the alleles at flanking loci of a locus under selection can be rapidly swept to high frequency or fixation (Kraakman et al., 2004). These studies suggest that the extent of LD varies among different genomic regions (Mather et al., 2007), different rice accessions studied (Agrama and Eizenga, 2008) and different markers used. Thus, in large populations like this kind of autogamous species, the stretches of LD extending over several cM are expected. Again, on the basis of the LD decay range in this present study, genome-wide LD mapping is possible using this set of rice materials.
In this study, we identified twenty-two markers (with PVE more than 5%) associated with grain traits using the entire set of accessions, including 10 associated with GL, 1 associated with GW, 7 associated with GT, 3 associated with GL/GW, and 1 associated with TGW. Seven of the 22 associations were in regions where the QTL associated with the given traits had been identified (http://www.gramene.org/), and they are listed in Supplementary Table S12. Fifteen loci in this study were found for the first time, including 4 for GL, 1 for GW, 7 for GT, and 3 for GL/GW. For the 4 new loci in GL, RM297_chr 1 had the largest PVE (21.97% in 2013 and21.89% in 2014). The marker RM1, which is located on chromosome 1, was a new locus associated with GW. For the 7 new loci in GT, RM84_chr 1 had the largest PVE (19.16% in 2013 and23.46% in 2014). For the 3 new loci in GL/GW, RM297_chr 1 had the largest PVE (19.37% in 2013 and18.47% in 2014). These results might increase the QTL information for the grain traits and provide help for further fine mapping and cloning.
For GL trait, heritability in the broad sense averaged across 2 years was 95.8%. Among the ten SSR associated markers detected for GL, RM297_chr1 had the largest PVE (21.97% in 2013 and 21.89% in 2014). RM297-145 bp had the largest phenotypic effect value (0.22 mm) among positive elite alleles found on their marker loci. And the typical carrier material was Tongjing 109. We expect GL could be improved by the crosses listed in Table 5 which shows a potential cross breeding of super parents based on the number of alleles that could be pyramided into an individual plant and the expected phenotypic effects for improving seed grain size component traits and predicted alleles outcomes.
For GW trait, heritability in the broad sense averaged across 2 years was 91.9%. One SSR associated marker detected for GW, RM1_chr1 had PVE of 8.33% in 2013 and 6.05% in 2014. RM1-90 bp had the highest phenotypic effect value (0.20 mm) among the positive alleles found on the locus (Supplementary Table S10).
For GT trait, heritability in the broad sense averaged across 2 years was 92%. Among the seven SSR associated markers detected for GT, RM84_chr1 had the largest PVE (19.16% in 2013 and 23.46% in 2014). RM3453-135 bp had the largest phenotypic effect value (0.31 mm) among positive elite alleles found on their marker loci and the typical carrier material was Zhen9424. We expect GT could be improved by the crosses listed in Table 5.
For GL/GW trait, heritability in the broad sense average across 2 years was 98.2%. Among the three SSR associated marker detected for GL/GW, RM297_chr1 had the largest phenotypic effect value (19.37% in 2013 and 18.47%in 2014). RM297-145 bp had the highest phenotypic effect value (0.19) among the positive elite alleles found on the loci. This elite allele was carried by four accessions, in which Zijianjingnuo was the typical carrier material. GL/GW trait could be improved by the crosses listed in Table 5.
Finally on TGW trait, heritability in the broad sense average across 2 years was 95.7%. One SSR associated marker detected for TGW, RM259_chr1 had phenotypic effect value of 5.40% in 2013 and 5.88% in 2014. RM259-185 bp had the highest phenotypic effect value (0.74 g) among the positive alleles found on the locus (Supplementary Table S10).
Genetic relatedness analysis based on 262 SSR markers showed that more than 50% of the kinship coefficient values were <0.05, 35.6% were in a range of 0.05-0.10, and the remaining 9.5% showed various degrees of genetic relatedness (Figure 6), indicating that there was no or weak relatedness between pair-wise accessions used in this study.
In conclusion, association mapping provides a powerful tool in unraveling the complex traits in plants and helps to identify superior alleles through marker-assisted selection in plant breeding. This provides essential clues that can be used to improve rice breeding programs.

AUTHOR CONTRIBUTIONS
DH planned and designed the research; WE, XD, LL, EB, and IZ carried out the field experiment; WE, XD, LL, and EB carried out the molecular experiment; WE and XD analyzed data; WE and XD wrote the manuscript; DH revised the manuscript. WE and XD contributed equally to this work. All authors read and approved the manuscript.

ACKNOWLEDGMENTS
Funding support was provided by a grant from the China national "863" program (2010AA101301), a grant from doctoral found of Educational Ministry (B0201300662) and a grant from National Natural Science Foundation of China (3157 1743).