Genotyping of Soybean Cultivars With Medium-Density Array Reveals the Population Structure and QTNs Underlying Maturity and Seed Traits

Soybean was domesticated about 5,000 to 6,000 years ago in China. Although genotyping technologies such as genotyping by sequencing (GBS) and high-density array are available, it is convenient and economical to genotype cultivars or populations using medium-density SNP array in genetic study as well as in molecular breeding. In this study, 235 cultivars, collected from China, Japan, USA, Canada and some other countries, were genotyped using SoySNP8k iSelect BeadChip with 7,189 single nucleotide polymorphisms (SNPs). In total, 4,471 polymorphic SNP markers were used to analyze population structure and perform genome-wide association study (GWAS). The most likely K value was 7, indicating this population can be divided into 7 subpopulations, which is well in accordance with the geographic origins of cultivars or accession studied. The LD decay rate was estimated at 184 kb, where r2 dropped to half of its maximum value (0.205). GWAS using FarmCPU detected a stable quantitative trait nucleotide (QTN) for hilum color and seed color, which is consistent with the known loci or genes. Although no universal QTNs for flowering time and maturity were identified across all environments, a total of 30 consistent QTNs were detected for flowering time (R1) or maturity (R7 and R8) on 16 chromosomes, most of them were corresponding to known E1 to E4 genes or QTL region reported in SoyBase (soybase.org). Of 16 consistent QTNs for protein and oil contents, 11 QTNs were detected having antagonistic effects on protein and oil content, while 4 QTNs soly for oil content, and one QTN soly for protein content. The information gained in this study demonstrated that the usefulness of the medium-density SNP array in genotyping for genetic study and molecular breeding.


INTRODUCTION
Soybean [Glycine max (L.) Merr.] is one of important crops worldwide, providing a sustainable source of high-quality protein feed and vegetable oil. Soybean was domesticated in China more than 5,000-6,000 years ago. Soybean can grow across a wide range of latitudes from 50 • N to 35 • S (Norman, 1978). Soybean yield related traits such as flowering, maturity and protein/oil contents are quantitatively inherited traits controlled by internal and external factors (Xia et al., 2013).
Each soybean cultivar adapts to a limited latitudinal region for its maximal yield since soybean is a short day plants with photoperiod sensitivity (Xia et al., 2012b). Flowering time and maturity are important agronomic traits related to soybean adaptability and productivity. More than 200 loci or genes have been mapped to control flowering time in soybean (SoyBase, www.soybase.org). Previous studies identified eleven major-effect loci affecting flowering and maturity in soybean, which have been designated as E1 to E10, and the J locus for "long juvenile period" (Bernard, 1971;Buzzell, 1971;Buzzell and Voldeng, 1980;McBlain and Bernard, 1987;Ray et al., 1995;Bonato and Vello, 1999;Cober and Voldeng, 2001;Cober et al., 2010;Kong et al., 2014;Samanfar et al., 2017). Of these genes, E1, E2, E3, E4, E6, E9, E10, and J have been cloned and functionally characterized (Liu et al., 2008;Watanabe et al., 2009Watanabe et al., , 2011Xia et al., 2012a;Zhai et al., 2014a;Zhao et al., 2016;Lu et al., 2017;Samanfar et al., 2017). E1 encodes a nuclear-localized B3 domain-containing protein, suppresses both GmFT2a and GmFT5a expression, two FT orthologs promoting early flowering in soybean (Xia et al., 2012a). E1 expression is suppressed in short day, which is regarded as the main factor for soybean being a short day plant (Xia et al., 2012a;Zhai et al., 2015;Zhang et al., 2016). E2 encodes a homolog of GIGANTEA, controls soybean flowering through regulation of GmFT2a expression but not GmFT5a (Watanabe et al., 2011). E3 and E4 are Phytochrome A (PHYA) genes of GmPHYA3 and GmPHYA2 (Liu et al., 2008;Watanabe et al., 2009). Various allelic combinations of E1, E3 or E4 lead to various photoperiod insensitivity, enabling soybean to adapt to high-latitude environments (Zhai et al., 2014b). J loci is identified as the ortholog of Arabidopsis thaliana EARLY FLOWERING 3 (ELF3), which control flowering time through regulation of E1 expression . Higher E1 expression in short day enables soybean to grow in the area of lower latitude near equator. E9 and E10 are GmFT2a and GmFT4, FT homolog of Arabidopsis (Zhai et al., 2014a;Zhao et al., 2016). Apart from negative report on existence of E5 loci (Dissanayaka et al., 2016), molecular identities of E7 and E8 are still unknown. Many quantitative trait loci (QTL) or quantitative trait nucleotide (QTN) related to soybean flowering time (first flowering, R1) and maturity have also been documented at SoyBase (http:// soybase.org). Many genes or QTL might regulate flowering time through regulation of the expression of the E1 gene (Zhai et al., 2015).
Soybean seed compositions traits such as protein and oil contents are important quality traits in breeding programs. Patil et al. (2017) reviewed molecular mapping and genomic of soybean seed protein, and concluded genetic improvement of soybean protein meal is a complex process because of negative correlation with oil, yield, and the temperature (Patil et al., 2017). Major QTL were repeated detected on chromosome (20 (LG I) and 15 (LG E) (Patil et al., 2017). Leamy et al. (2017) studied seed composition traits in wild soybean (Glycine soja) and found 29 SNPs located on ten different chromosomes that are significantly associated with the seven seed composition traits, of which eight SNPs co-localized with QTLs previously uncovered in linkage or association mapping studies conducted with cultivated soybean samples (Leamy et al., 2017). Zhou et al. (2015) mapped major QTN for protein on chromosome 13, 3, 17, 12, 11, and 15 using a 302 accessions (Zhou et al., 2015). More than 100 quantitative trait loci (QTLs) for soybean oil content have been documented at SoyBase (https://www.soybase.org). Cao et al. (2017) found 8 QTLs explained a range of phenotypic variance from 6.3 to 26.3% using RIL population, and qOil-5-1, qOil-10-1, and qOil-14-1 were detected in different environments (Cao et al., 2017). And qOil-5-1 was also detected using natural population and further localized to a linkage disequilibrium block region of approximately 440 kb . WRINKLED1(WRI1), LEAFY COTYLEDON1 (LEC1), and LEC2 are involved in the regulatory pathways modulating seed oil content in Arabidopsis. However, their homologs have been modified in the palaeopolyploid soybean, each exhibiting similar intensities of purifying selection to their respective duplicates since these pairs were formed by a 13 mya (million years ago) whole-genome duplication (WGD) event .
Recently, researchers have been applied GWAS in soybean (Bandillo et al., 2015;Wen et al., 2015;Zhang et al., 2015Zhang et al., , 2016Zhou et al., 2015;Contreras-Soto et al., 2017;Fang et al., 2017). Zhang et al. (2015) revealed that genetic loci underlying some agronomically important traits, such as days to flowering, days to maturity, duration of flowering-to-maturity, and plant height in early maturity soybean . The ability of GWAS to capture one trait often depends on the frequency of the accessions with contrast phenotypic value in the population being investigated. Recently, as the great advance in sequencing technology, genotyping by sequencing (GBS) has been a choice over other genotyping method, SNP array and traditional SSR markers.
In comparison of traditional linage analysis, genomewide association study (GWAS) takes advantage of more historic recombination events that have occurred within natural populations. GWAS has been widely applied to crop plants such as maize (Tian et al., 2011), rice (Huang et al., 2010;Ma et al., 2016). However, in rice, recently studies demonstrates the power of GWAS in combination of biparental association mapping and fine-mapping in dissect agronomic important trait (Huang et al., 2010;Ma et al., 2016).
In this study, we genotyped 235 cultivars using Illumina SoySNP8k iSelect BeadChip; and 4471 core SNP markers were selected. A relatively complex population structure (K = 7) was revealed. GWAS were performed to identify the QTN associated with flowering time and the protein/oil contents using FarmCPU. More than 30 QTN were identified under multiple environments for flowering time and maturity; while 16 consistent QTNs were detected for protein and oil contents.

Cultivars and Growth Condition
A set of 235 cultivars collected from China, Japan, USA, and Canada were mainly obtained from the Gene Resource Center of Jilin Academy of Agricultural Sciences, China. The origin and other traits for these cultivars are listed in Table S1.

Phenotypic Observation
Soybean accessions were evaluated for photoperiodic responses at six geographic locations: (1) (Fehr et al., 1971). R1 refers the beginning of bloom (the opening of the first flower at any node on the main stem). R7 represents the beginning of maturity (one normal pod on the main stem has reached its mature pod color, normally brown or tan); R8 stands for full maturity (95 percent of the pods having reached their mature pod color). For a given cultivar, each specific R stage is defined only when at least 50% of individual plants reached that stage.
Seed were harvested upon maturity. In HRB, GZL, MDJ locations, cultivars that did not reach mature stage (R8) were precluded for maturity and protein/oil content.
Seed coat or hilum color were classified into four groups and coded as follows: (1) yellow or yellowish; (2) green or light brown; (3) brown; (4) black. Seed-weight (100-seedweight) was determined by weighing 3 different set of randomly selected 100 seeds for each cultivar or accession. Seed protein and oil contents of cultivars were measured using MATRIX-I FT-NIR spectrometer (Bruker). The protein or oil contents were measured three times using different bulk seeds of a given cultivar.
The heritability estimates were calculated using variance components obtained by lme4 of R package (Fehr, 1987).

Genotyping With SNP Markers
DNA was extracted from fresh leaves using the hexadecyltrimethylammonium bromide (CTAB) method with slight modification (Murray and Thompson, 1980;Xia et al., 2007). Due to availability of financial budget, cultivars were divided into two batches (95 cultivars and 140 cultivars) to proceed genotyping. Genotyping using Illumina SoySNP8k iSelect BeadChip (Akond et al., 2013;Yang et al., 2017), which contained a total of 7,189 SNPs and was specifically manufactured by Infinium HD Ultra. SNP genotyping was performed with the Illumina Iscan platform (Illumina, Inc., San Diego, CA). A series of procedures, such as incubation, DNA amplification, preparation of bead assay, hybridization of samples for the bead assay, extension, staining of samples, and imaging of the bead assay, were conducted following previously reported methods (Song et al., 2013). The SNP alleles were called with the Genome Studio Genotyping module (Illumina, Inc.) (Song et al., 2013), and SNP data is available at ftp://159.226.208. 134/public/SNP_data.zip (Data Sheet 1).

Population Structure Analysis and GWAS
Population structure analysis was performed using STRUCTURE (Pritchard et al., 2000) and to choose the appropriate number of inferred clusters to model the data, 5 independent runs were performed for each K cluster (2 < K < 13, the length of the burn-in is 10,000, the length of MCMC(Markov chain Monte Carlo) is 10,000). After several attempts, we found that our parameter set was sufficient, longer length of burn-in and MCMC did not change the result significantly. Furthermore, population structure was assessed for K values ranging from 2 to 13 on the entire panel using high quality SNPs. The calculation method of STRUCTURE is based on the Bayesian model. For the simulation result of each K value, STRUCTURE will correspondingly produce the log maximum likelihood value, "LnP(D)." As LnP(D) increases, the K value is closer to the real case. The simulation result with largest LnP(D) and smallest K value is the optimal result (Evanno et al., 2005). The neighborjoining tree was analyzed using the TASSEL (Version 5.2.38) (Bradbury et al., 2007).
By analyzing r 2 value of all pairs of SNPs located within 1 Mb of physical distance, the LD decay trend was found following the regression of negative natural logarithm. Heterozygosis, linkage disequilibrium decade, and kinship plot were generated using GAPIT (Lipka et al., 2012) with default parameters. For kinship plot, a heat map of the values in the values in the kinship matrix is created. Kinship matrix was using the VanRaden kinship algorithm (Tang et al., 2016).
GWAS was conducted the Fixed and random model Circulating Probability Unification (FarmCPU; Liu X. L. et al., 2016) with Bonferroni-corrected threshold with 0.01. This recently developed model selection algorithm takes into account the confounding problem between covariates and test marker by using both Fixed Effect Model (FEM) and a Random Effect Model (REM) (Arora et al., 2017). The first three principal components calculated using GAPIT were used as covariates. The quantilequantile (Q-Q) plot was used for assessing how fit the model was to account for population structure.
Rare SNPs other than two majority nucleotide identities were treated as unknown. Heterozygosis was calculated for both individuals and makers ( Figure S1A). By analyzing r 2 value of all pairs of SNPs located within 1 Mb of physical distance, the LD decay trend was found following the regression of negative natural logarithm ( Figure 1D). The LD decay rate was estimated at 184 kb, where r 2 drop to half of its maximum value (0.205). Also this trend was confirmed using GAPIT ( Figure S1B). This LD rate calculated is well consistent with previous studies Song et al., 2016).

Population Structures
Two hundred thirty five cultivars were originally obtained from different geographic origins, e.g., different latitudinal regions of China, Japan, USA. Apart from 5 landraces, the majority of set of germplasms are modern cultivars (Table S1). According to the population structure, the most likely value of K was 7 and such a portioning of the population was consistent with the significant delta K value (Figures 1A,B). Moreover, this result is also well in accordance with the neighbor-joining tree ( Figure 1A). All cultivars are classified into 7 subgroups, which are generally in accordance with their geographic origins, Japan, Northern America, central China, Huang-huai region China, Northern area China, landraces (wild soybean) (Figure 1). This classification was also supported by the VanRaden kinship algorithm (Figure 2).
In this study, a relatively complex population structure (K = 7) was revealed in comparison of previous reports in which population structures (K = 2, 4, 9) were disclosed Fang et al., 2017). After eliminating batch specific or biased markers, the set of 4471 markers might represents the core markers for this set of germplasm (Data Sheet 1, ftp://159.226.208.134/public/SNP_data.zip).
We detected four significant QTNs for seed coat color using FarmCPU ( Figure 3B). The major QTN was also located at 8622793 bp of chromosome 08. The major QTN detected for seed coat color was about 20 kb away from that for hilum (Figure 3). The clustered CHS family is considered to be candidate genes responsible for the seed coat color (Cho et al., 2017). Also other three QTNs were detected on chromosome 08 (41,212,762 bp), chromosome 12 (37,411,186 bp) and chromosome 14 (41,162,011 bp). A peak but not over the threshold was present on chromosome 13. Recently, seed coat bloom in wild soybeans is mainly controlled by Bloom1 (B1) on chromosome 13, which encodes a transmembrane transporter-like protein for biosynthesis of the bloom in pod endocarp (Zhang et al., 2018). Interestingly, this gene also elevated seed oil content in domesticated soybeans.

GWAS on Flowering Time and Maturity
In this study, flowering time R1 and maturity R7 and R8 were evaluated in six geographic locations. For flowering time, the basic statistics of flowering time (R1) of cultivars were presented in Table 1. It took longer days to reach R1 in the northern locations, HRB, MDJ, and GZL (Figure 4). Other parameters such as Skewness, Kurtosis, K-S distance, K-S probability, SWilk W, SWilk probability indicated these traits were quantitatively inherited ( Table 1). The correlation coefficients with a range of 0.592 to 0.978 between R1 of soybean cultivars grown at different locations in 2011 or 2012 ( Table 2) were all statistically   11, 2011; 12, 2012. For protein or oil contents, R1, from emergence to first flower. **, Correlation coefficient is statistically highly significant (P < 0.01 ); *, Correlation coefficient is statistically significant (P < 0.05). significant, which indicates this trait is genetically inherited, and also phenotypic data are validated. Statistical analysis ( Table 3) showed that broad sense heritability was 0.5833.
Although phenotypic data for R7 and R8 were not conducted in all locations, the basic distributions were presented in Figure S2, which was similar to R1 trait. Since some cultivars could not reached R7 or R8 before frost in northern locations, HRB, MDJ, and GZL.
In order to analyze the relationship between R1 and R7/R8, the correlation coefficients matrix were generated and listed in Table S2. The correlation coefficients of R7 (R8) between different geographic locations or years were statistically significant except for that between MDJ and southern location, HA and NJ. The correlation coefficients between R1 and R7 or R8 were higher in the same location than in different location. Considering maturity genes, such as E1-E4, are controlling flowering time as well as maturity, we also enclosed R7 and R8 for GWAS.
Although no consistent QTNs for flowering time and maturity were identified across all environments, a total of 30 consistent QTNs were detected for flowering time (R1) or maturity (R7 and R8) on 16 chromosomes (Figures 3C-H; Table 4; Figures S3-S6; Table S3). In Table 4 and Table S3, we only listed the QTN that has been detected more than three environments. In Table 4, we listed the corresponding QTLs listed in SoyBase or known genes with a physical distance less than 5 Mb.
In our previous study, the genotypes at E1, E2, E3, and E4 of 180 cultivars revealed great allelic variations at E1 and E3 genes (Zhai et al., 2014b). The power of GWAS to capture a certain trait often depends on the frequency of the accessions with contrast phenotypic value in the population being investigated (Yan et al., 2017). In the previous GWAS studies, fewer QTNs were detected for this trait. When the modern cultivars only a QTN corresponding to E3 was detected at a natural population of 304 short-season soybean lines (K = 9) .  (Liu et al., 2008) Only QTN that was detected more than three environments were listed. $ Effect of−3.524 for R1_MDJ_2012 was not counted due to the oppositing effect; & effect of−6.267737 for R1_GZL_2011 was not counted due to the oppositing effect.
No universal QTN was detected over all environments in this study. Common QTNs detected in three or more environments are also informative for us to understand this trait, although authenticity of these QTNs detected in this study need to be verified. GWAS and biparental linkage mapping are commentary each other in mapping and thereafter gene cloning. At present, around 50 biparental populations were generated using the cultivars in this study. We will use these populations to verify the QTN obtained in this study. Fine-mapping or positional cloning will be performed when a novel gene or QTN is verified.

GWAS of Protein and Oil Contents of Cultivar Seeds
In this study, protein and oil contents were simultaneously measured in 5 geographic location in 2011 and 2012. The basic statistics of two traits were listed in Table 5 and presented in Figure 5. The parameters such as Skewness, Kurtosis, K-S distance, K-S probability, SWilk W, SWilk probability indicated this trait were quantitatively inherited ( Table 5). The correlation coefficients between protein and oil were presented in Table 6. From the correlation coefficients, the protein contents were negatively and significantly correlated to oil content in the same environments or different environments; while the protein contents in an environments was positively correlated to protein contents in other environments (Table 6, Figure 5). The trend was the same for oil contents. According to statistical analysis, the broad sense heritability for oil and protein were 0.6364 and 0.3947. When we used data for protein and oil contents obtained in 9 environments for GWAS using FarmCPU, 16 consistent QTNs for protein and oil contents were detected for oil or protein over 3 environments ( Table 7; Table S4; Figures 3G,H, Figures S7, S8). Eleven QTNs were detected having antagonistic effects on protein and oil content, while 4 QTNs soly for oil content, and one QTN soly for protein content. Of eleven QTN for both traits detected over 3 environments, each QTN showed antagonistic effects on protein and oil contents, which indicated these QTNs are involved in biological pathway affecting both oil and protein. Major QTL were repeatedly detected on Chromosome 20 (LG I) and 15 (LG E) using America cultivars (Patil et al., 2017). In this study, we detected three QTNs on Chromosome 20 (LG O). Two QTNs were identified for both traits, QTN (Gm20_2372509_T_C-1_T_R_2179344425) at position of 2366428 with antagonistic effects on protein (0.431691) and oil (-0.45203) and QTN (Gm20_7927513_A_G-1_T_F_2179344472) with antagonistic effects on protein(0.76146) and oil (-0.47998). Another QTN (Gm20_38151772_C_T-1_T_R_2179344711) for oil with effect of−0.53353 was identified on chromosome 20. We did not detect any consistent QTN on Chr 15 (LG E). All 16 QTNs mapped in this study ( Table 7) were physically near (less than 5 Mb) QTL reported in SoyBase.

Conclusion and Further Consideration
Instead of traditional molecular markers, e.g., SSR, AFLP, advances in sequencing technologies have enabled high-density array and GBS to be widely applied to genomic and genetic study to dissect genetic population structure and GWAS (Sonah et al., 2013;Bandillo et al., 2015;Wen et al., 2015;Zhang et al., 2015;Contreras-Soto et al., 2017;Fang et al., 2017;Yan et al., 2017). However, this study employed a medium density array to reveal population genetic structure, the result showed the quality of the population genetic study has been improved by elimination of some batch specific or biased SNPs. Also the GWAS quality has been monitored using hilum color and seed coat color. Fast genotyping method e.g., using a set of core SNP array is in high demand for genetic study or molecular breeding (Chaudhary et al., 2015). The information gained in this study demonstrated that the usefulness of the medium-density SNP array in genotyping for genetic study and molecular breeding.
Up to date, there are a large number of loci or QTL have been identified by GWAS using different set of natural population or by linkage or association mapping using biparental populations under different environments in different years. In generally, the effect of each locus is rather small, its detection might be influenced by population size, population structure, accuracy of phenotyping, physical location of the causal gene (e.g., pericentromeric region), epistatic association between QTLs as well as environmental factors. High negative correlation coefficients between oil and protein content in soybean was revealed in this study, which is consistent with previous reports (Boydak et al., 2002;Karaaslan et al., 2008); common regions or loci might have favorable effect on one and unfavorable effect on the other. The higher negative correlation coefficients of two traits might reflect that we might be able to detect QTL or QTN with higher effect on both traits. Hwang et al. (2014) found seven of 13 regions associated with oil content also have effect on protein content (Hwang et al., 2014). Similarly, in this study, we have detected 11 common QTNs associated with oil and antagonistically associated with protein, although no universal QTN detected over all environments. However, the overall oil and protein content can be varied to a great extent, also the environmental effect e.g., latitudinal   Najing. For years, 11, 2011;12, 2012. For protein or oil contents, PR, protein content; OL, oil content. **, Correlation coefficient is statistically highly significant (P < 0.01 ); *, Correlation coefficient is statistically significant (P < 0.05).
Frontiers in Plant Science | www.frontiersin.org Only QTN that was detected more than three environments were listed.
location, temperature can also influence the balance of two contents, there are a lot loci affecting most to one content, but not the other, at least not significantly (Eskandari et al., 2013). Overall, a large number of loci have been identified to underlie some important agronomic traits e.g., flowering time, maturity, oil and protein contents; however, a detailed study may only detect some of them. Ideally, a large numbers of natural population can be subtracted into a subpopulation each member of which carries higher or lower phenotypic values for a given trait; GWAS for the given trait can be performed using in this subpopulation (Yan et al., 2017).
A large number of QTLs or loci underlying agronomically important traits have been identified by GWAS or linkage mapping, some of which were detected in different environments or in different populations while some are environmental or population specific. Although molecular identities of genes or QTL underlying some important agronomic traits e.g., maturity have been disclosed, vast of loci underlying quantitative traits like soybean seed protein /oil content are still largely unknown. GWAS in combination with biparental populations such as RIL, NIL, CSSL, is very powerful for QTL identification and their gene cloning. As high throughput sequencing data aggregate, the important QTL or QTN detected by traditional linkage mapping or GWAS will be verified and subsequently cloned. As most components of a molecular or signaling pathway have been identified (Gentzbittel et al., 2015), information of gene regulation or crosstalk with different pathways will enable us to build a genetic network that can be used in molecular design breeding.  The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2018. 00610/full#supplementary-material        Table S1 | Geographic origins of soybean cultivars or accession used in this study. Table S2 | The correlation coefficients between R1 (first flower), R7 (beginning of maturity) and R8 (Fully Maturity) of soybean cultivars grown at different locations in 2011 or 2012. Table S3 | QTNs for flowering time (R1) and maturity (R7 or R8) were detected using FarmCPU. Table S4 | QTNs for protein and oil contents were detected using FarmCPU.