Identification of QTNs Controlling 100-Seed Weight in Soybean Using Multilocus Genome-Wide Association Studies

Hundred-seed weight (HSW) is an important measure of yield and a useful indicator to monitor the inheritance of quantitative traits affected by genotype and environmental conditions. To identify quantitative trait nucleotides (QTNs) and mine genes useful for breeding high-yielding and high-quality soybean (Glycine max) cultivars, we conducted a multilocus genome-wide association study (GWAS) on HSW of soybean based on phenotypic data from 20 different environments and genotypic data for 109,676 single-nucleotide polymorphisms (SNPs) in 144 four-way recombinant inbred lines. Using five multilocus GWAS methods, we identified 118 QTNs controlling HSW. Among these, 31 common QTNs were detected by various methods or across multiple environments. Pathway analysis identified three potential candidate genes associated with HSW in soybean. We used allele information to study the common QTNs in 20 large-seed and 20 small-seed lines and identified a higher percentage of superior alleles in the large-seed lines than in small-seed lines. These observations will contribute to construct the gene networks controlling HSW in soybean, which can improve the genetic understanding of HSW, and provide assistance for molecular breeding of soybean large-seed varieties.


INTRODUCTION
Soybean [Glycine max (L.) Merr.] is an important source of edible oil and plant protein for humans. Soy-based foods are popular in the international market, and the demand for soybeans is increasing. Hundred-seed weight (HSW) is an important agronomic trait related to soybean yield and food quality. Therefore, locating the quantitative trait locus (QTLs) related to HSW and exploring valuable alleles associated with the QTLs have important theoretical and application value for improving soybean yield and quality.
Linkage analysis based on framework linkage maps of low polymorphism molecular markers, such as restriction fragment length polymorphisms (RFLPs) and simple sequence repeats (SSRs), is the main method for detecting QTLs controlling HSW in soybean. Quantitative trait loci associated with HSW have been identified in F 2 (Wang et al., 2010), recombinant inbred line (RIL; Reinprecht et al., 2006), and backcross (BC; Li W.X. et al., 2008) populations using QTL mapping methods such as interval mapping (IM; Lander and Botstein, 1989), composite interval mapping (CIM; Zeng, 1994), and inclusive complete interval mapping (Wang, 2009). Genome-wide CIM (Wang et al., 2016b;Wen et al., 2017) was recently proposed for detecting small-effect and linked QTLs. However, QTLs detected by these methods have rarely been applied successfully to marker-assisted breeding research because large linkage between markers and QTL is feasible to be interrupted. Thus, genome-wide association studies (GWASs) have been conducted gradually for QTL mapping of HSW in soybean.
Drifting and selection of genes during crop breeding and domestication produce complex population structures, which can cause false positives. Statistical methods to circumvent this issue include case-control studies, transmission imbalance tests, structured associations, principal component analysis (PCA), and hybrid models. A newly developed GWAS statistical method based on a mixed linear model (MLM) can overcome the above difficulties and address fixed and random effects flexibly. Common variables from STRUCTURE (Pritchard et al., 2000) and PCA can serve as fixed effects to account for false associations due to group structure. The complex relationship between individuals constitutes the affinity matrix in the MLM . The kinship matrix derived from a set of aggregated individuals is used in the computationally efficient compressed MLM (CMLM; Zhang et al., 2010). In some cases, the traditional maximum-likelihood method cannot be used to solve an MLM with a large number of genotypes because the calculation intensity is too large. Therefore, the efficient mixed model association (EMMA; Kang et al., 2008) algorithm was developed to reset the parameter of MLM likelihood function. Among the many software packages available, mrMLM.GUI integrates the most accurate and computationally efficient methods for GWAS and genome prediction selection, and these methods are integrated into the R package, which can analyze large amounts of data in the shortest time.
Genome-wide association study has been used to detect QTLs related to HSW in soybean. Zhou et al. (2015) used a total of 302 germplasm accessions including wild soybean varieties, local varieties, and bred cultivars to detect HSW loci on chromosome 17, while Zhang et al. (2016) used 309 soybean germplasm accessions to detect 22 HSW QTLs by GWAS analysis. These studies used single-gene locus methods such as MLM Zhang et al., 2010). The mrMLM method used in our study is a multilocus method that detects more small-effect QTLs than the single-locus method (He et al., 2019), reducing the probability of false positives (Fang et al., 2017).
We used 144 RILs with four-way hybridization (FW-RILs) to obtain phenotypic data for the HSW and SNP genotype data across multiple environments. The multilocus GWAS method was used to detect quantitative trait nucleotides (QTNs) associated with HSW. We identified potential candidate genes and common QTNs associated with large-seed lines across multiple methods or in multiple environments, laying the foundation for high-yield soybean breeding.

Field Experiments and Phenotypic Data Collection
The 144 FW-RILs were planted in 20 environments; details of the planting environments are summarized in Supplementary  Table S1. The field experiment in each environment followed a randomized block design. Plot length was 5 m, ridge distance was 65 cm, plant spacing was 6 cm, three rows were repeated three times, and field management was in accordance with general field cultivation. After maturity, 10 plants showing uniform growth were selected from each row, and HSW was measured indoors. Phenotypic values are all averages of three replicates.

Phenotypic Variation Analysis
Mean, variance, standard deviation, minimum, maximum, skewness, and kurtosis of FW-RILs were calculated in each environment. Analysis of variance was performed on phenotypic data from the 20 environments using the following model: where µ is the grand average, G i is the ith genotype effect, E j is the jth environment effect, GE ij is the genotype × environmental effect, and ε ij is the error effect following N(0, σ 2 ). The mean squared difference was estimated by applying the mean square of each variation source, and the estimated generalized heritability was calculated using the following formula.
where h 2 is the generalized heritability of gene × environment interaction,σ 2 G is the genotype variance, σ 2 GE is the genotype × environmental interaction variance, σ 2 is the error variance, e is the number of environments, and r is the number of repetitions in each environment. Data were analyzed using the general linear model method in Statistical Analysis System (SAS) 9.2, North Carolina.

Genotyping
Leaves of each accession were collected in the field at the three-leaf stage and placed in 2-mL centrifuge tubes. Liquid  nitrogen was added to the tubes, and leaves were rapidly ground to a white powder. Soybean genomic DNA was extracted using the CTAB method (Doyle et al., 1990). DNA was dissolved in 50 µL ddH 2 O, 1 µL RNAase (10 mg/mL) was added, and samples were stored at −20 • C. Single-nucleotide polymorphism genotyping was performed by Beijing Boao Biotechnology Co., Ltd. (Beijing, China) using the SoySNP660K BeadChip. A total of 109,676 SNPs were obtained on 20 chromosomes after mass filtration based on the following criteria: minor allele frequency (MAF) of the SNP was identified (MAF > 0.05), maximum missing data rate <10% (Belamkar et al., 2016), and heterozygous loci deleted.

Analysis of Linkage Disequilibrium and Population Structure
Linkage disequilibrium (LD) analysis was performed using TASSEL 5.0 (Bradbury et al., 2007) to calculate the square of the allelic frequency dependence (r 2 ) of all SNPs located within 10-Mb physical distance, and the physical distance of LD decay was estimated as the position where r 2 dropped to half of its maximum value. Population structure was analyzed after filtration on 5,000 SNPs evenly distributed on 20 chromosomes using STRUCTURE 2.3.4 (Pritchard et al., 2000). The number of burn-in iterations per run was 100,000, followed by 100,000 Markov Chain Monte Carlo replications after burn-in. Mixed and allele frequency correlation models were considered in the analysis. Five implicit iterations were used in the STRUCTURE analysis. The best subgroups (K) were identified according to the method of Evanno et al. (2005) using STRUCTURE HARVESTER (Earl and Vonholdt, 2012). The assumed number of K ranged from 1 to 10.

Genome-Wide Association Studies
Genome-wide association studies were performed using the software mrMLM.GUI (version 3.0). There are five multilocus GWAS methods in the mrMLM package that can be used to identify significant QTNs, namely, mrMLM (Wang et al., 2016a), FASTmrMLM (Tamba and Zhang, 2018), FASTmrEMMA  a mrMLM, FASTmrMLM, FASTmrEMMA, pLARmEB, and ISIS EM-BLASSO were indicated by 1-5, respectively. b r 2 (%), proportion of total phenotypic variation explained by each QTN. (Wen et al., 2018), pLARmEB , and ISIS EM-BLASSO . In the first stage, the critical P value parameter was set to 0.01, except that of FASTmrEMMA was set to 0.005. In the final stage, the critical LOD value of significant QTNs was set to 3. All matrices involved in these five methods were calculated using mrMLM.GUI3.0.

Superior Allele Analysis
Among the significant QTNs identified, some were detected in a variety of environments or methods. These significant QTNs are referred to as common QTNs. The positive and negative of the QTN effect values were used as the criterion for selecting superior alleles. If the QTN effect value is positive, the genotype of code 1, which was obtained by GWAS, is the superior allele; if the QTN effect value is negative, the other genotype is the superior allele. The percentage of excellent alleles is calculated as follows: for each line, the percentage of excellent alleles is the number of common QTNs carrying excellent alleles divided by the total number of common QTNs; for each common QTN, the percentage of alleles among the 144 FW-RILs is equal to the number of rows a mrMLM, FASTmrMLM, FASTmrEMMA, pLARmEB, and ISIS EM-BLASSO were indicated by 1-5, respectively. b r 2 (%), proportion of total phenotypic variation explained by each QTN. Bold text indicates the QTNs appeared to be near QTLs associated with hundred-seed weight in soybean that had been mapped in earlier studies.
containing the alleles divided by the total number of rows. Heat maps were generated using the R package Complexheatmap (Mellbye and Schuster, 2014).

Identification of Potential Candidate Genes
The physical locations of generic QTNs identified by the above methods were marked. Intervals containing each of the common QTNs were then selected on the Phytozome website. These intervals are determined by the LD decay rate. Highly expressed genes at certain locations based on relevant traits were selected. This step was done on the BAR website. Finally, the genes were matched to those in the Kyoto Gene and Genomics Encyclopedia (KEGG) to analyze gene expression pathways and identify potential candidate genes.

Phenotypic Variation
We measured HSW phenotypic data for 144 FW-RILs and the four parent lines in 20 environments. The phenotypic values for the four parents are given in Supplementary  Table S2. In 19 environments, the kurtosis and skewness (absolute value) were less than 1, indicating a continuous normal distribution of HSW (Supplementary Table S3 and Figure S1). We used SAS software to perform a descriptive statistical analysis, conduct an analysis of variance, and estimate the generalized heritability of phenotypic values ( Table 1). The results showed that significant phenotypic variation was detected in the HSW of 144 FW-RILs across 20 environments. An analysis of variance revealed significant differences in genotype effects, environmental effects, and genotype × environment interactions (Table 1). Therefore,  HSW is not only affected by genotype and environmental factors, but also by genotype-by-environmental interaction effects. The heritability of HSW in multiple environments was 79%. This indicates that although HSW is affected by the environment, most of the variation is due to genetic effects. Bold font indicates the 20 lines with smaller seeds, and non-bold font indicates the 20 lines with larger seeds.

Population Structure
Among 109,676 SNPs, we selected 5,000 SNPs with superior polymorphism evenly distributed on the 20 soybean chromosomes. We used STRUCTURE 2.3.4 to calculate K ( Figure 1A; K = 1-10) and identified two subgroups (selected K = 2) based on the K value ( Figure 1B). We analyzed the r 2 values of all pairs of SNPs located within 10 Mb of each other and determined the LD decay trend based on regression to the negative natural logarithm.
We further examined significant QTNs occurring in multiple environments. Three common QTNs were identified in E2 and E3 ( Table 2). AX-157525995 was located on chromosome 6, with LOD values ranging from 3.08 to 4.48 ( Table 2). This QTN explained a proportion of phenotypic variance (PVE) of between 3.28 and 6.43%. AX-157369770 was located on chromosome 11, with LOD values ranging from 3.07 to 3.80 ( Table 2). This QTN explained a proportion of phenotypic variance (PVE) of between 3.78 and 5.02% ( Table 2). AX-157588392 was located on chromosome 12, with LOD values ranging from 4.20 to 4.26 ( Table 2). This QTN explained a proportion of phenotypic variance (PVE) of between 6.36 and 6.64% ( Table 2). The QTN effect direction (positive or negative) was consistent across environments and methods ( Table 2).
Among the 31 common QTNs, 20, 6, 2, and 3 QTNs were detected commonly by two, three, four, and five methods, respectively (Figure 3). From the above studies, mrMLM, FASTmrMLM, and ISIS EM-BLASSO detected a larger number of QTNs among the five methods generally (Figure 4A). In different arrangements of two methods, the number of common QTNs detected by mrMLM combined with FASTmrMLM was large ( Figure 4B).

Distribution of Superior Alleles in the FW-RILs
We identified 20 large-seed lines with high HSW and 20 smallseed lines with low HSW based on the average phenotypic values of 144 FW-RILs in 20 environments ( Table 4). The HSW phenotypic values of the 20 large-seed lines ranged from 20.65 to 22.55 g. For each large-seed line, the percentage of superior alleles (PSAs) for the 31 common QTNs ranged from 35 to 77%; nine lines (45%) showed a PSA of less than 50%. The phenotypic values for the 20 small-seed lines ranged from 16.53 to 17.79 g, and the PSA ranged from 21 to 52% (Table 4). Therefore, the large-seed lines had a higher PSA than the small-seed lines (Figure 5).
The PSA range for each common QTN in the 20 large-seed lines was 15-100%, with 18 QTNs having PSAs of ≥50%. Among the 20 small-seed lines, PSAs ranged from 5 to 80%, and nine QTNs had PSAs of >50% (Table 5 and Figure 6). The number of common QTNs with PSAs ≥50% in 20 large-seed lines was higher than 20 small-seed lines. Based on these results, we can find elite lines by identifying the superior allele ratio of highyield soybeans.
We identified several superior alleles that were present in various large-seed lines; AX-116905453, AX-157457247, AX-157588392, and AX-157090659 were all present in HN96, HN61, HN95, HN124, HN87, and HN41 (Figure 6). The superior allele  AX-116905453 was present in all 20 large-seed lines. We believe that superior alleles may have an important effect on the HSW of soybeans. In further research, we hope to use this information to develop larger seed size soybean varieties through markerassisted selection.

Potential Candidate Genes Determined Based on Common QTNs
For each common QTN, we use the LD decay distance as the interval range to find candidate genes. Because the population we selected was not a natural group, but an FW-RIL population, the LD decay distance was very large. We therefore chose the range of potential candidate genes according to the location with the largest decay rate. As the LD decayed fastest before 200 kb and then tended to flatten (Figure 2A), we searched for potential candidate genes at 100-kb intervals on either side of each QTN. We identified 635 genes in this interval, of which 129 were highly expressed during seed formation (Supplementary Table S7).
Based on the annotation data, 51 of the 129 genes were annotated in 29 pathways and three protein families in the KEGG database (Supplementary Tables S5, S6 and Figure 7). Three of these are potential candidate genes based on their annotation information and function in metabolic pathways ( Table 6).

DISCUSSION
Most genetic analyses of yield-related traits in soybean, such as HSW, have been based on RILs arising from the hybridization of two parents used for QTL mapping analyses. Sites detected in these analyses will therefore be suitable only for that specific population. Most mapping populations involve a limited number of recombination, and few molecular markers can be used; fine mapping is challenging. Four-way RILs were derived from four parents, and there may be one to four alleles in a QTL. As long as there is a significant difference in genetic effect between the two alleles in a QTL, this QTL is detected. Therefore, the intensity of QTL detection and superior allele selection are improved.
In recent years, with the rapid development of GWAS, natural population is generally used as research materials to identify HSW QTL in soybean. The FW-RILs used in this study are the MAGIC (multiparent advanced generation intercross) lines, which were artificially configured with good linkage relationship, whereas the natural population has no linkage relationship. The number of parents of the FW-RILs is limited, and the population structure is not obvious compared with the natural population composed of many germplasm resources. Thus, the false-positive rate of FW-RILs is lower than that of the natural population; the statistical power and QTL positioning accuracy are higher.
Previous studies have shown that HSW is a typical quantitative trait controlled by multiple genes, and it is controlled by many microgene loci under different genetic conditions. A GWAS using 31,283 SNPs and CMLM detected five loci associated with HSW (Copley et al., 2018). Yan et al. (2017) used SoySNP50K BeadChip to genotype more than 42,000 SNPs, and the MLM accounting for both population structure and kinship was conducted for GWAS. Eight SNPs located on chromosomes 4 and 17 were significantly associated with HSW. The variation explained by significant markers (R 2 ) ranged from 6.9 to 13.2%. Contreras-Sota et al. (2017) used SNP markers and MLM considering Q + K models for GWAS to identify genomic regions controlling HSW. Seven SNPs were significantly associated with HSW on chromosomes 5, 7, 11, and 12 across the locations under study, and R 2 ranged from 13.2 to 31.2%. Although high-density SNP markers were applied in these experiments, statistical methods  treating single loci mainly detect QTLs with high heritability and neglect those with moderate or low heritability. The five multilocus GWAS methods used in this study greatly reduce the false positives in the result and increase the credibility of the results. All potential QTNs with large or small effects in different statistical models could be detected by the methods.
Based on the common QTNs detected in this study and the annotated pathways associated with these QTNs, we identified three genes that may be associated with HSW in soybean ( Table 6). Glyma.06G321900 is associated with psaG, an intrinsic membrane protein associated with Photosystem I (PSI). psaG is a nuclear-encoded gene corresponding to PSI subunit V of spinach (Steppuhn et al., 1988). Further experiments evaluate the potential role of subunit V in PSI function under water-deficit stress (Wood and Duff, 1999). Photosynthesis converts light energy into chemical energy for plant growth, increasing crop yield. Therefore, we speculate that this gene increases photosynthesis in the absence of water, which indirectly affects the HSW of soybean. Glyma.08G365900 is related to mannose phosphate isomerase (PMI), and the PMI (manA) gene encodes a mannose-6-phosphate transferase, catalyzing the conversion of 6-phosphoric acid mannose to fructose 6-phosphate. Cells can metabolize fructose via the glycolytic pathway (Reed et al., 2001), rendering mannose a carbon source for the growth of transformed cells, and thus the plant exhibits a more prominent growth advantage. We speculate that PMI (manA) indirectly controls the HSW of soybean plants. Glyma.20G240000 is associated with hexosaminidase (Hex). During fruit ripening, β-Hex releases a large amount of free N-glycans from glycoprotein polypeptide chains (Yunovitz and Gross, 1994). Free N-glycans are important signaling molecules that promote fruit ripening (Handa et al., 1985;Priem and Gross, 1992), possibly through β-Hex triggering ethylene synthesis systems (Jagadeesh et al., 2004). We therefore suggest that Hex controls soybean seed development by affecting the biosynthesis of phytohormones.

Summary
We identified 31 QTNs using a five-multilocus GWAS method. An analysis of the common QTNs identified three candidate genes. By analyzing the ratio of superior alleles in large-and small-seed lines, we established that the superior alleles of these common QTNs might be useful for molecular marker-assisted breeding of soybean plants with larger seeds.

DATA AVAILABILITY STATEMENT
The genotyping data is available at https://figshare.com/s/ 84eb97ad5c5e523072e5.

AUTHOR CONTRIBUTIONS
WL and HN conceived and designed the experiments. JS, XT, YW, YF, XL, JW, CY, SJ, and XS performed the field experiments. SL and ZT performed the genome sequencing. ZQ, KZ, and HN analyzed and interpreted the results. ZQ and HN drafted the manuscript and all authors contributed to manuscript revision. WL acquired the funding. All authors contributed to the article and approved the submitted version.