Identification of the Genomic Region Underlying Seed Weight per Plant in Soybean (Glycine max L. Merr.) via High-Throughput Single-Nucleotide Polymorphisms and a Genome-Wide Association Study

Seed weight per plant (SWPP) of soybean (Glycine max (L.) Merr.), a complicated quantitative trait controlled by multiple genes, was positively associated with soybean seed yields. In the present study, a natural soybean population containing 185 diverse accessions primarily from China was used to analyze the genetic basis of SWPP via genome-wide association analysis (GWAS) based on high-throughput single-nucleotide polymorphisms (SNPs) generated by the Specific Locus Amplified Fragment Sequencing (SLAF-seq) method. A total of 33,149 SNPs were finally identified with minor allele frequencies (MAF) > 5% which were present in 97% of all the genotypes. Twenty association signals associated with SWPP were detected via GWAS. Among these signals, eight SNPs were novel loci, and the other twelve SNPs were overlapped or located in the linked genomic regions of the reported QTL from SoyBase database. Several genes belonging to the categories of hormone pathways, RNA regulation of transcription in plant development, ubiquitin, transporting systems, and other metabolisms were considered as candidate genes associated with SWPP. Furthermore, nine genes from the flanking region of Gm07:19488264, Gm08:15768591, Gm08:15768603, or Gm18:23052511 were significantly associated with SWPP and were stable among multiple environments. Nine out of 18 haplotypes from nine genes showed the effect of increasing SWPP. The identified loci along with the beneficial alleles and candidate genes could be of great value for studying the molecular mechanisms underlying SWPP and for improving the potential seed yield of soybean in the future.


INTRODUCTION
Seed weight per plant, a complicated and agronomically important quantitative trait, was significantly related with yield in soybean (Glycine max L. Merr.) (Chen et al., 2007;Liu et al., 2016). SWPP is controlled by multiple genes or quantitative trait loci (QTL). An additive effect dominates the inheritance pattern of SWPP (Chen et al., 2007). The development of cultivars with suitable SWPP has been an important breeding object of many soybean breeders because SWPP was an important soybean yield component. SWPP is influenced by the environment or genotype by environment interactions, and this trait performs differently in different environments. Hence, breeding soybean cultivars with suitable SWPP via traditional methods requires evaluation in multiple environments over several years, which is expensive, time consuming, and labor-intensive.
Genome-wide association studies (GWAS), based on highdensity markers and natural populations, have been regarded as an efficient alternative to linkage analysis with more extensive recombination events and shorter LD segments. Thus, the resolution and accuracy of marker-phenotype associations could be further increased via GWAS compared with the conventional QTL mapping of segregating populations . To date, GWAS has been widely utilized to elucidate the genetic basis of many complex traits in some crops, including rice (Huang et al., 2010), maize (Weng et al., 2011), wheat (Raman et al., 2010), and barley (Cockram et al., 2010). In soybean, the genetic architecture of some important traits, such as protein , fatty acid content , and SCN resistance (Han et al., 2015;Zhao et al., 2017), has been well dissected. Moreover, the rapid development of next-generation sequencing technology and SNP genotyping technology has propelled much of the practicability of GWAS. In previous studies, Yan et al. (2017), Contreras-Soto et al. (2017 identified seventeen, seven, and forty-eight SNPs, respectively, which were all significantly associated with 100-seed weight of soybean (HSW). However, to date, only few studies were conducted to identify QTL underlying SWPP based on highthroughput sequencing technology. Han et al. (2016) detected rs18976374 (located on Chr.16) significantly related with SWPP by using a bar coded multiplex sequencing approach with an Illumina Genome Analyzer II. Liu et al. (2016) found one SNP 1 www.soybase.org (ss244932137 located on Chr.3) associated with SWPP through GWAS strategy based on an association panel of 138 cultivars genotyped by Illumina SoySNP6KiSelect BeadChip.
The aims of the present study were to understand better the genetic architecture of SWPP via GWAS based on 185 tested accessions and 33,149 SNPs and to analyze the potential candidate genes that might regulate soybean SWPP in associated genomic regions with peak SNPs.

Soybean Germplasms and Field Trials
A total of 185 diverse soybean accessions (Supplementary Table S1), including landraces and elite cultivars, were applied to evaluate the variation of SWPP and for the subsequent sequencing analysis. All samples were grown in Harbin in , Gongzhuling in 2015, and Shenyang in 2015. Field experiments were performed with single row plots of 3 m in length with 0.65 m between rows. A randomized complete block design was used with three replications in each tested environment. After reaching full maturity, a total of 10 randomly selected plants per row in each plot were weighed and used to evaluate SWPP.

DNA Isolation and Sequencing Analysis
Total DNA from the fresh leaves of a single test sample was extracted by the CTAB method according to Han et al. (2015). The isolated high-quality DNA was partly sequenced via specific locus amplified fragment sequencing (SLAF-seq) methodology (Sun et al., 2013). Soybean reference genome (Version:Glyma.Wm82.a2) was preliminarily analyzed and digested enzyme Mse I (EC 3.1.21.4) and Hae III (EC: 3.1.21.4) (Thermo Fisher Scientific Inc., Waltham, MA, United States) were used to generate more than 50,000 sequencing tags (approximately 300-500 bp in length) of all tested samples. The obtained tags were evenly distributed among the unique genomic regions of the 20 soybean chromosomes. The sequencing libraries of each tested accession were built based on the sequencing tags. A barcode method combined with the Illumina Genome Analyzer II system (Illumina Inc., San Diego, CA, United States) was used to generate the 45-bp sequence reads at both ends of the sequencing tags from each accession library. Short Oligonucleotide Alignment Program 2 (SOAP2) software was used to align raw paired-end reads to the soybean reference genome (Version:Glyma.Wm82.a2). The SLAF groups were designed based on the raw reads, which mapped to the same unique genomic positions. Approximately 58,000 high-quality SLAF tags were acquired from each tested accession. The SNPs were called as such based on MAF ≥ 0.05. If the minor allele depth or the total depth of the sample was larger than 1/3, then the genotype was considered heterozygous.
For thirty lines, a genome resequencing with 10-fold in depth was conducted on an Illumina HiSeq 2000 sequencer. Paired-end resequencing reads were mapped to the soybean Williams 82 reference genome (Version:Glyma.Wm82.a2) with BWA (Version: 0.6.1-r104) (Zhou et al., 2015) using the  default parameters. SAMtools48 (Version: 0.1.18) software (Zhou et al., 2015) was used for converting mapping results into the BAM format and to filter the unmapped and non-unique reads. Duplicated reads were filtered with the Picard package (picard.sourceforge.net, Version:1.87) (Zhou et al., 2015). The BEDtools (Version: 2.17.0) (Zhou et al., 2015) coverageBed program was applied to compute the coverage of sequence alignments. A sequence was defined as absent when the coverage was lower than 90% and present when the coverage was higher than 90%. SNP detection was performed by the Genome Analysis Toolkit (GATK, version 2.4-7-g5e89f01) and SAMtools (Zhou et al., 2015). Only the SNPs detected by both methods could be analyzed further. SNPs with allele frequencies lower than 1% in the population were discarded. SNP annotation was performed based on the soybean genome (Version:Glyma.Wm82.a2) using the package ANNOVAR (Version: 2013-08-23) (Zhou et al., 2015).

Population Structure Evaluation and Linkage Disequilibrium (LD) Analysis
The population structure analysis of the natural group was conducted based on the PCA programs in the GAPIT software package (Lipka et al., 2012). The LD block was evaluated across the soybean genome based on SNPs with MAF ≥ 0.05 and missing data ≤ 10% by using squared allele frequency correlations (r 2 ) in TASSEL version 3.0 (Bradbury et al., 2007). In contrast to the GWAS, missing SNP genotypes were not imputed with the major allele prior to LD analysis. The parameters in the software programs were set according to the MAF (≥0.05) and the integrity of each SNP (≥80%).

Genome-Wide Association Analysis
The SWPP association signals were identified based on 33,149 SNPs (Supplementary Table S2) from 185 tested samples with CMLM in GAPIT (Lipka et al., 2012). The P value was calculated using the Bonferroni method with α ≤ 0.05 (≤2.58 × 10 −4 ) and was used as the threshold to declare whether a significant association signal existed (Holm, 1979).

Prediction of Candidate Genes Controlling SWPP
According to the studies of Hwang et al. (2014), Han et al. (2015), and Zhao et al. (2017), the average LD decay distance of soybean genome (Version:Glyma.Wm82.a2) was approximately 200 kbp. Thus, candidate genes located in the 200-kbp genomic region of each peak SNP were classified and annotated underlying the soybean reference genome Williams 82 2 .

Haplotype Analysis of Candidate Genes
Based on the genome annotation, SNPs were classified in exonic regions (overlapping with a coding exon), splicing sites (within 2 bp of a splicing junction), 5 UTRs and 3 UTRs, intronic regions (overlapping with an intron), upstream and downstream regions (within a 1 kb region upstream or downstream from the transcription start site), and intergenic regions. SNPs in coding exons were further grouped into synonymous SNPs (did not cause amino acid changes) and nonsynonymous SNPs (caused amino acid changes). The variation happened in these regions (except for intergenic regions) of candidate genes in thirty lines generated from genome re-sequencing data which were analyzed using the General Linear Model (GLM) method in TASSEL version 3.0 (Bradbury et al., 2007) to identify related SNPs and haplotypes. Significant SNPs affecting the SWPP were claimed when the test statistics reached P < 0.01.

Statistical and Variation Analysis of SWPP
The SWPP of the 185 tested soybean accessions, grown in multiple locations over years, was determined. The mean results of the SWPP analysis showed that the effects of genotype, environment, and genotype by environment interactions were significant. The skewness and kurtosis of SWPP across the six environments were both less than 1, indicating continuous variation and a near normal distribution (Figure 1). Therefore, the SWPP in the present study was suitable for the subsequent GWAS.

Specific-Locus Amplified Fragment Sequencing (SLAF-seq) and Genotyping
The selective population contained 185 diverse accessions, primarily collected from China. The genomic DNA from all tested accessions was extracted and partially sequenced based on the SLAF-seq approach. For each tested sample, a mean of 49,571 high-quality tags (or SLAFs) was scanned from 153 million paired-end reads with a 45-bp read length and 6.51-fold sequencing depth. A total of 33,149 SNPs with MAF ≥ 0.05 were generated from the high-quality tags and subsequently used to perform GWAS for SWPP. The obtained SNPs were evenly distributed among the 20 soybean chromosomes, resulting in a marker density of 28.7 kbp per SNP. Chr. 6 and Chr. 11 included the most and least numbers of SNPs, which was 3,556 and 638, respectively (Figure 2).

Extent of Linkage Disequilibrium (LD) and Analysis of Population Structure
The average distance of LD decay was analyzed to characterize the mapping resolution for GWAS, and LD decayed differently among all the 20 chromosomes. Accordingly, the mean LD decay of the panel was evaluated at 214 kbp when r 2 dropped to half of its maximum value ( Figure 3A). To scan the population stratification of the association panel, principal component analysis and kinship analysis were conducted based on all 33,149 SNP markers. The first three PCs explained 13.83% of the genetic variation. A drastic decline in the genetic variation appeared at PC2 ( Figure 3B). However, analysis of the variation of the first 10 PCs revealed an inflection point at PC3 (Figure 3C). Thus, these results suggested that mainly the first three PCs dominated the population structure. Additionally, the heatmap of kinship matrix revealed low levels of genetic relatedness among the 185 tested samples ( Figure 3D).

Quantitative Trait Nucleotide (QTN) Associated With SWPP Evaluated by GWAS
A total of 20 association signals, distributed on 12 of the 20 soybean chromosomes, were detected by CMLM in the present study (Figure 4 and Table 1). Among these signals, only one QTN (Gm18:23052511 located on Chr.18) was identified in the linked region of a known SWPP QTL. However, as a specific member of seed weight, another 11 SNPs were overlapped or located in the linked region of a known seed weight QTL from the SoyBase databank, particularly for 100-seed weight ( Table 2). The remaining 8 association signals were regarded as novel loci, which were first reported for SWPP in the present study ( Table 2).
The average SWPP for all tested samples with two different alleles were compared (Table 1), and the results demonstrated that the SWPP among these samples were so different that the appropriate alleles might be effectively applied in the marker assisted selection (MAS) of soybean cultivars with suitable SWPP.

Prediction of Candidate Genes Controlling SWPP
The genes located in the 200-kbp genomic region of each peak SNP of the identified loci were considered as candidate genes, consistent with a mean LD decay distance of 214 kbp for the entire association panel. Approximately 126 candidate genes were identified. Among these genes, 26 genes had no functional annotation, and one gene had unknown function domains. The remaining 99 genes were classified into different groups by MAPMAN (Thimm et al., 2004) to determine clearly the potential functions. A total of 16 categories were scanned, including co-factor and vitamin metabolism, misc, RNA regulation of transcription, DNA synthesis/chromatin structure, protein synthesis/modification/degradation, signaling, development, transport, amino acid metabolism, hormone metabolism, abiotic stress, cell wall, major CHO metabolism, nucleotide metabolism, other groups, and unassigned genes (Supplementary Figure S1). The factors that influenced SWPP were almost the same as those that influence seed weight, and seed size was the main component for determining seed weight. Thus, the factors that could regulate the mechanism of seed size might be important elements in directly or indirectly adjusting SWPP . In some plants, several genes controlling seed size/weight have been identified, including the genes associated with hormone metabolism, such as auxin and cytokinin (Schruff et al., 2006;Li et al., 2013), various transcription factors associated with RNA regulation of transcription in plant development and maturation, such as TTG2, AP2, MINI3, and C2H2 (Garcia et al., 2005;Ohto et al., 2009;Costa et al., 2010;Yin et al., 2010;Cao et al., 2016), the genes associated with protein modification or degradation (particularly ubiquitin ligase genes) (Li et al., 2008;Xia et al., 2013;Wang et al., 2017), and regulatory genes dominating transport systems and other metabolic processes associated with plant growth and development, such as ABC transporters and amino acid metabolism (Wu et al., 2007;Kim et al., 2010;Less et al., 2010;Hildebrandt et al., 2015). Among these genes detected by GWAS in the present study, those associated with the RNA regulation of transcription, including different transcription factor families, such as AP2 and C2H2, were scanned and identified as the functional genes for SWPP, including Glyma.02G159600, Glyma.02G173900, Glyma.07G157600, Glyma.07G157800, Glyma.07G157900, Glyma.08G195300, Glyma.08G195500, Glyma.08G195600, Glyma.08G195700, Glyma.08G195800, Glyma.17G127400, Glyma.17G127700, Glyma.19G057700,and Glyma.19G067900 located on Chr.2,Chr.7,Chr.8,Chr.17,and Chr.19. Two brassinosteroid response factor genes (Glyma.10G129700 and Glyma.17G127100 with distances of 34.31 and 39.83 kbp from peak SNPs Gm10:34747895 and Gm17:10196755, respectively) were identified as candidate genes. Similarly, Glyma.10G130000, an auxin response factor gene located 44.84 kbp from SNP Gm10:34946492 on Chr.10, Glyma.15G240600, an ethylene factor gene located 31.54 kbp from SNP Gm15:45663581 on Chr.15, and Glyma.17G127500, a gibberellin factor gene located 15.58 kbp from SNP Gm17:10196755 on Chr.17, were all considered as the genes controlling SWPP. Three E3 ubiquitin ligase genes (Glyma.08G195200, Glyma.08G195400, and Glyma.08G195900 with distances of 40.42, 17.31, and 15.42 kbp, respectively, from the peak SNP Gm08:15768591 located on Chr.8), which are associated with protein modification or degradation, might also affect SWPP. An additional 10 genes belonging to transport systems (such as ABC transporters) and other main metabolic processes (such as amino acid metabolism, N-metabolism, and vitamin metabolism), including Glyma.02G159900, Glyma.08G196200, Glyma.12G163700, Glyma.12G163800, Glyma.14G207200, Glyma.14G207300, Glyma.14G207700, Glyma.14G207800, Glyma.17G127800, and Glyma.18G143700, were also selected and might also contribute to SWPP.
To predict the possible roles of candidate genes associated with SWPP, haplotype analysis of the 99 genes was performed. A total of 578 SNPs in 99 candidate genes were found among the thirty soybean lines (MAF > 0.1) through genome resequencing. Finally, 44 SNPs from nine genes were significantly associated with SWPP among multiple environments (Figure 5 and Supplementary Table S3). Glyma.07G157900, with only one SNP, was not shown in the figure (Figure 5). Two haplotypes were identified for each of the nine genes and the SWPP between each pair of haplotypes showed significant or highly significant difference (Figure 6). Glyma.07G157900, from Gm07:19488264, was detected under the environments of Shenyang in 2016 and the average at Shenyang in 2015 and 2016, which was located in the genomic region of the known loci, "seed weight 12-4" (Csanadi et al., 2001). A total of seven genes from Gm08:15768591 and Gm08:15768603 were detected, which overlapped the region of the known loci, "seed weight 22-1" (Zhang et al., 2004) and "seed weight 35-1" (Han et al., 2012). Of the seven genes, Glyma. Glyma.18G143700 was detected from Gm18:23052511 that overlapped the region of known loci, "SWPP 6-6" (Yao et al., 2015) and was detected under the condition of Shenyang in 2015, Shenyang in 2016, and the average at Shenyang in 2015 and 2016. These genes and beneficial haplotypes might be valuable for MAS in regulating SWPP of soybean.

DISCUSSION
Seed weight per plant, controlled by multiple genes, was an important component of seed yield in soybean. Thus far, some QTL have been identified based on linkage analysis from the SoyBase databank. However, few studies dissecting the genetic basis of SWPP in soybean via GWAS in combination with high-throughput SNPs and diverse accessions across multiple environments have been discussed, and even fewer candidate genes have been reported. In the present study, 185 soybean accessions were widely collected from China and used to conduct GWAS via high-throughput SNPs and diverse environments. In total, twenty SNPs were identified among twelve different soybean chromosomes, and these SNPs might have value for further breeding cultivars with appropriate SWPP.
In addition to the QTL associated with SWPP in the SoyBase databank, the genes associated with seed weight were referred for more accurately identifying the genomic region underlying the SWPP of soybean via the twenty SNPs used in the present study. Among the twenty SNPs, eight loci, including Gm02:27731352 located on Chr.2, Gm03:25789435 and Gm03:26388034 located on Chr.3, Gm12:24101071 located on Chr.12, Gm14:47326193 located on Chr.14, Gm15:45663581 located on Chr.15, and Gm19:19848537 and Gm19:19848932 located on Chr.19, were novel genes first reported in the present study. The genomic region of Chr.19, which included two close loci (Gm19:19848537 and Gm19:19848932), might be a primary region associated with SWPP. Additionally, another twelve SNPs were overlapped or located near known QTL (

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2018.01392/ full#supplementary-material FIGURE S1 | Functional classifications of the candidate genes of seed weight per plant in soybean.