Association mapping reveals novel genes and genomic regions controlling grain size architecture in mini core accessions of Indian National Genebank wheat germplasm collection

Wheat (Triticum aestivum L.) is a staple food crop for the global human population, and thus wheat breeders are consistently working to enhance its yield worldwide. In this study, we utilized a sub-set of Indian wheat mini core germplasm to underpin the genetic architecture for seed shape-associated traits. The wheat mini core subset (125 accessions) was genotyped using 35K SNP array and evaluated for grain shape traits such as grain length (GL), grain width (GW), grain length, width ratio (GLWR), and thousand grain weight (TGW) across the seven different environments (E1, E2, E3, E4, E5, E5, E6, and E7). Marker-trait associations were determined using a multi-locus random-SNP-effect Mixed Linear Model (mrMLM) program. A total of 160 non-redundant quantitative trait nucleotides (QTNs) were identified for four grain shape traits using two or more GWAS models. Among these 160 QTNs, 27, 36, 38, and 35 QTNs were associated for GL, GW, GLWR, and TGW respectively while 24 QTNs were associated with more than one trait. Of these 160 QTNs, 73 were detected in two or more environments and were considered reliable QTLs for the respective traits. A total of 135 associated QTNs were annotated and located within the genes, including ABC transporter, Cytochrome450, Thioredoxin_M-type, and hypothetical proteins. Furthermore, the expression pattern of annotated QTNs demonstrated that only 122 were differentially expressed, suggesting these could potentially be related to seed development. The genomic regions/candidate genes for grain size traits identified in the present study represent valuable genomic resources that can potentially be utilized in the markers-assisted breeding programs to develop high-yielding varieties.


Introduction
Bread wheat (Triticum aestivum L.) is an important staple food crop, serving as the main source of energy, protein, and fiber for much of the world's human population (Ling et al., 2013;Ji et al., 2022). However, with the current rate of yearly increment in wheat yield, feeding the ever-increasing world population, which is expected to reach 9.3 billion by 2050, will be a daunting task. Further, depletion of natural resources such as land and water and a rise in the mean earth surface temperature exacerbates the problem and poses a challenge to producing sufficient wheat to feed the human population in the future (Nehe et al., 2019). To increase wheat yield, it is important to understand the genetic basis of traits that contribute to grain yield (GY). Grain yield and its contributing traits are complex in nature, highly influenced by environmental conditions, and regulated by multiple genes (Kato et al., 2000;Li et al., 2019;Ji et al., 2022). Quantitative trait loci (QTL) associated with GY has been extensively studied and reported on all 21 wheat chromosomes (Bennett et al., 2012;Sun et al., 2020). Several studies have identified numerous QTL for GY and productivity (Bennett et al., 2012;Zegeye et al., 2014;Malik et al., 2021;Ji et al., 2022). However, there is very limited information on the marker-assisted improvement of GY traits in wheat. This is primarily due to the non-availability of tightly linked robust markers with the GYassociated traits. The conventional QTL mapping approach has been extensively used for gene mapping and has enabled genetic dissection of seed traits in wheat (Breseghello and Sorrells, 2007;Duan et al., 2020). However, this approach does not allow detection of all the possible allelic variants of the target gene that might exist in the natural populations of wheat. Another downside of the QTL mapping approach is its poor resolution. Availability of gold standard wheat reference genome sequence and high-density SNP arrays is expected to accelerate high-resolution mapping of complex traits using both conventional and association mapping approaches (IWGSC et al., 2018;Chaurasia et al., 2021).
In the past few years, the genome-wide association study (GWAS) has become a popular approach to identify the QTL associated with complex traits in crops. In contrast to QTL mapping, this approach enables the exploration of a large number of alleles for any locus from a natural population of diverse individuals. This approach facilitates high-resolution mapping of traits because the individuals used for the association analysis might have undergone several rounds of historical recombination (Yu and Buckler, 2006;Li et al., 2019). Several GWAS studies have been performed on major crops such as Oryza sativa (Spindel et al., 2016), Zea mays (Xu et al., 2017;Xu et al., 2018), Hordeum vulgare (Visioni et al., 2013), Avena sativa (Newell et al., 2011), Brassica napus (Zhou et al., 2017), Glycine max , Sorghum bicolor (Morris et al., 2013), and in wheat for the genetic dissection of various desirable traits (Peng et al., 2018;Chaurasia et al., 2021;Malik et al., 2021). Over the past few years, efforts have been made to develop GWAS models that are more suited to investigating genetics of simple as well as complex traits in plants. These GWAS models are broadly grouped into single-locus GWAS (SL-GWAS) and multi-locus GWAS (ML-GWAS) methods. SL-GWAS methods have been widely used to detect genetic variants for various traits, but one main limitation of this model is that the pvalues of the markers identified to be associated with the target trait need to be subjected to multiple rounds of testing to avoid false positive associations. To overcome this limitation, Zhang et al. (2020) developed a mrMLM package that contains the following six ML-GWAS methodologies: mrMLM (multilocus random-SNPeffect MLM) (Wang et al., 2016), FASTmrMLM (fast mrMLM) , ISIS EM-BLASSO (iterative modified-sure independence screening expectation-maximization Bayesian least absolute shrinkage and selection operator) , pKWmEB (integration of Kruskal-Wallis test with empirical Bayes) (Ren et al., 2018), FASTmrEMMA (fast multi-locus random-SNP-effect efficient mixed model analysis) , and pLARmEB (polygenic-background-control-based least angle regression plus empirical Bayes) .
Among the GY-associated traits, grain size contributes the most, making it a key selection target in wheat breeding programs for developing high yielding varieties. Thousand grain weight (TGW) is the main component of GY and is determined by grain size traits such as grain length (GL), grain width (GW), and grain length width ratio (GLWR) (Sun et al., 2009;Li et al., 2022). Thus, it is important to understand the genetic and molecular basis of the mechanisms governing grain size in wheat genotypes and to identify the superior novel alleles governing this trait from germplasm collections for exploitation in the breeding program. Therefore, the main aim of this study was to dissect the genetic control of grain size traits such as GL, GW, GLWR, and TGW in wheat germplasm employing 35K SNP array using multi-locus GWAS approaches.

Experimental materials and phenotyping
The experimental material for the GWAS study comprised of 125 diverse wheat accessions, a subset of a mini core developed from the composite wheat core set (Phogat et al., 2020) of the National Genebank of India. These accessions were comprised of 85 indigenous and 40 exotic collections, which included released varieties, landraces, genetic stocks, and elite genotypes (Supplementary Table 1). These accessions were evaluated following augmented block design in five blocks using four checks (HD2967 and C-306) randomized in each block over five years (Rabi 2015-16 to Rabi 2019. The GWAS panel was evaluated at the ICAR-National Bureau of Plant Genetic Resources (NBPGR), Issapur Farm, Delhi located 28.3748°N, 77.0902°E, 228.6 m AMSL, for five consecutive years; during the fifth year, the trial was also taken up at the ICAR-NBPGR, Regional Station, Jodhpur located at 26. Each genotype was grown in a three-rows plot of 2 m length each, with a row-to-row distance of 0.25 m. Pests and diseases were controlled chemically, whereas weeds were controlled manually.
The wheat GWAS panel was evaluated for GL (mm), GW (mm), GLWR, and TGW (gm) from the harvested grain samples. The measurements of these grain parameters were performed by selecting the main spike of five random individual plants in the middle of the row for each accession. Grains per spike were estimated by hand-threshing the mature spike. TGW of each genotype was recorded by weighing all the seeds from a sample, dividing it by the total seed number measured, and multiplying the result by 1000. For GL and GW analysis, ten seeds from each five spikes were measured using digital vernier caliper and average value of the plot accessions was taken up for analysis. Grain length/width ratio (GLWR) was calculated by dividing the grain length mean by the grain width mean for each genotype.

Phenotypic data analyses
The phenotypic data was analyzed using ACBD-R (Augmented Complete Block Design with R) version 4.0 software (Rodriguez et al., 2018). The mean, coefficient of variation (CV), least significant difference (LSD), genotypic variance, and heritability were estimated. In ACBD-R v4.0, the best linear unbiased predictors (BLUPs) of each genotype were calculated for each environment and across environments along with four checks varieties. The calculated BLUPs were then used in the GWAS analysis. The frequency distribution graphs, correlation coefficients of the recorded traits, and principal component analysis were obtained through SAS JMP Version 14 software (https://www.jmp.com/ en_in/software/data-analysis-software.html).

Genomic DNA isolation and SNP genotyping
Genomic DNA was isolated from one-week-old wheat seedlings using the CTAB method (Murray and Thompson, 1980) with a few modifications and then treated with RNase to remove any RNA contamination. The integrity of DNA samples was checked on 0.8% agarose gel and concentration was determined by using a NanoDrop1000 (Thermo Scientific). Genotyping of isolated DNA samples was done using Breeder's 35K Axiom ® array (Allen et al., 2017). The SNPs with a genotyping call rate < 97% and minor allele frequency (MAF) <5% were removed while performing genomic data analysis.
Clustering, population structure, and linkage disequilibrium analysis A total of 23,874 SNPs were used to perform principal component analysis (PCA) and generate kinship matrix using TASSEL 5.2 program (https://www.maizegenetics.net/tassel). STRUCTURE software was used to estimate the level of genetic differentiation in the population using the Bayesian model-based approach; the parameter burn-in period and Monte Carlo Markov Chain (MCMC) replication number were set to 10,000 and 20,000 respectively for ten independent runs to estimate the number of subpopulations (k) in a putative range of k = 1 to 5. The optimal subpopulation number was estimated using an ad hoc statistic delta k (Evanno et al., 2005). The squared allele frequency correlation (r 2 ) between SNP markers was used to estimate linkage disequilibrium (LD) using TASSEL v5.2 (https://www.maizegenetics.net/tassel).

Genome-wide association analysis
All 125 accessions were genotyped using 35K SNPs array. We used five ML-GWAS methods which were included in the R package mrMLM v4.0.2 (https://cran.r-project.org/web/packages/ mrMLM/index.html). These five models are mrMLM, FASTmrMLM, FASTmrEMMA, pLARmEB, and ISISEM-BLASSO. All the parameters were set at default values. The critical thresholds of significant association for all the five methods were set as logarithm of the odds (LOD) score ≥3.00. The most significant SNPs, detected in at least two methods, were considered as reliable SNPs.

Differential gene expression analysis
We utilized RNA-seq data of two wheat genotypes with contrasting seed size i.e., IC111905 (large-seeded)and EC575981 (small-seeded) at 15 days and 30 days post anthesis (DPA) with three biological replicates to check the expression profile of putative candidate genes located in the identified genomic regions. Illumina sequencing was performed, which generated approximately 177 Gb raw data. Approximately 98.5% of reads passed the quality control and clean reads were mapped back on to the reference genome (IWGSC v2.0) (https://plants.ensembl.org/Triticum_aestivum/ Info/Index) using bwa-mem software (https://sourceforge.net/ projects/bio-bwa/files/). The differential gene expression analysis was performed using edge R package and genes with p-value <0.05 were considered as significantly differentially expressed genes. Heat maps of differentially expressed genes were generated by MeV software (https://sourceforge.net/projects/mev-tm4/).

Result Phenotypic evaluation and variability
All the genotypes of the wheat association panel were phenotyped for grain size parameters (GL, GW, GLWR, and TGW) in seven different environments (E 1 -E 7 ). The descriptive statistics of the investigated traits in seven environments are presented in Supplementary Table 2, and revealed wide variability for all the traits. GL ranged from 5.24 to 8.20 mm in E 1 , 5.09 to 8.17 mm in E 2 , 4.83 to 8.15 mm in E 3, 5.04 to 8.09 mm in E 4, 4.43 to 7.79 mm in E 5 , 4.43 to 8.41 mm in E 6 , and 4.47 to 7.87 mm in E 7 . GW ranged from 2.29 to 3.95 mm in E 1 , 2.26 to 3.71 mm in E 2 , 1.89 to 3.73 mm in E 3, 1.87 to 3.69 mm in E 4, 1.41 to 3.67 mm in E 5 , 1.41 to 3.58 mm in E 6 , and 1.48 to 3.81 mm in E 7 . GLWR ranged from 1.49 to 2.65 in E 1 , 1.29 to 2.70 in E 2 , 1.64 to 3.12 in E 3, 1.64 to 3.52 in E 4, 1.62 to 3.12 in E 5 , 1.69 to 3.12 in E 6, and 1.61 to 3.02 in E 7 . TGW ranged from 18.00 to 64.54 in E 1 , 13.51 to 67.56 in E 2 , 7.91 to 51.93 in E 3, 9.87 to 61.39 in E 4, 12.17 to 56.66 in E 5 , 13.75 to 67.21 in E 6, and 14.26 to 55.51 in E 7 . The coefficients of variation for GL, GW, GLWR, and TGW ranged from 8.32% to 24.00%, indicating considerable variability for these traits. The CV percent was highest for TGW in E 6 (24.00%) followed by E 4 (21.62%) and E 2 (20.39%). The frequency distribution of four traits (GL, GW, GLWR, and TGW) (Supplementary Figure 1) showed near normal distribution in all environments, indicating the quantitative nature of these traits except for GW under environments E 3 and E 4 .

Multivariate analysis Correlation between traits in different environments
Pearson's correlation coefficients were estimated among grain traits for diverse wheat accessions under each environment separately (Supplementary Table 4). GL was found to have a consistently significant positive correlation with GW (0.368, 0.406, 0.444), GLWR (0.585, 0.562, 0.335), and TGW (0.471, 0.394, 0.538) under E 1 , E 2, and E 3 respectively. Contrarily, GW showed a negative correlation with GLWR (-0.363, -0.438, and -0.684) and significant positive correlation with TGW (0.371, 0.380, and 0.604) under E 1 , E 2 , and E 3 respectively. Under the environment E 4 , GL showed significant positive correlations with GW, GLWR, and TGW that ranged between 0.225 (GW) to 0.457 (GLWR), while GW showed highly a significant negative correlation with GLWR (-0.748) and significant positive correlation with TGW (0.488). Similarly, under E 5 , a significant positive correlation was observed with GW (0.406), GLWR (0.361), and TGW (0.643), while GW showed a significant negative correlation with GLWR (-0.686) and significant positive correlation with TGW (0.723). A similar correlation pattern among s e e d t r a i t s w a s a l s o o b s e r v e d u n d e r E 6 a n d E 7 (Supplementary Table 4).

Phenotypic correlation between different environments for traits
The magnitudes of correlation between environments were assessed for knowing behavior of genotypes for trait expression. A total of twenty-one combinations of correlations were observed between different pairs of environments for all the traits (Supplementary Table 5). Here, significant positive associations were revealed for response of traits by genotypes for all environment pairs except five for grain length, and TGW and three for grain width and GLWR. For GL, these correlation values ranged from 0.084 between E 3 to E 7 to 0.985 between E 1 to both E 2 and E 4 . For GW, these correlation values ranged from 0.061 between E 4 to E 7 to 0.953 between E 1 and E 2 . Similarly, for GLWR, the lowest association was observed as 0.089 between E 4 to E 6 and highest as 0.961 between E 1 and E 2 . For TGW, lowest association was observed as 0.021 between E 4 to E 6 and highest as 0.954 between E 2 and E 4 .

Principal component analysis and correlation study based on pooled analysis
PCA was performed on the basis of pooled data for seven environments for grain parameters. Genotype by trait biplot depicted two-dimensional spatial diversity of accessions as well as trait variability ( Figure 1A). High trait variability was observed for traits like GLWR, GL, GW, and TGW, which was evidenced by the larger length of the characters and high positive correlations between traits such as GL and TGW and GLWR and GL, and a negative correlation between TGW and GW, which is evident by the narrow angle between them. Here, the first principal component, PC1, explained 57.4% of the cumulative variation. The major contributing traits for PC1 were GW (0.90), TGW (0.90), and GL (0.79) in positive direction. Second principal component, PC2, explained 34.5% of the cumulative variation. The major contributing traits for PC2 were GLWR (0.97) and GL (0.56). Further, we analyzed correlations between traits using the BLUP values where GL showed a highly significant positive correlation with GW (0.562), GLWR (0.368), and TGW (0.667), whereas GW showed a highly significant negative correlation with GLWR (-0.508) and significant positive correlation with TGW (0.680). GLWR was not significantly related to TGW ( Figure 1B).

Genotyping
A total of 125 wheat accessions representing a subset of the Indian National Genbank mini core germplasm were genotyped using 35K wheat SNP array that contains 35,143 genome-wide single nucleotide polymorphism (SNP) markers. The SNP probe s e q u e n c e s o f w h e a t a r r a y w e r e B L A S T n ( h t t p s : / / blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html) against wheat genome to find out their physical location, which revealed only 31,926 SNPs with known positions. Furthermore, SNPs were also filtered on the basis of minor allele frequency (≥0.05), missing threshold of < 10%, and call rate ≥97%. Finally, a total of 23,874 SNPs were retained for genetic diversity, population structure, and GWAS analysis. Mapping of 23,874 filtered SNP markers provided a whole genome-wide coverage along the 21 chromosomes of wheat ( Figure 2). Further, distribution analysis of SNPs on wheat chromosomes revealed that the maximum number of SNPs was positioned on 1B (1594), followed by 2D (1575) and 1D (1516), while the lowest number was positioned on 4D (508), followed by 4B (800) and 4A (808). We also compared the distribution of SNPs on the three wheat sub genomes; it was found that 7,291 SNPs belonged to A sub-genome, 8,784 SNPs were found on B subgenome, and 7,799 SNPs belonged to D sub-genome (Table 1).  Population structure, kinship, and linkage disequilibrium decay analyses We used 23, 874 SNP markers to ascertain the population structure in the wheat mini core set using STRUCTURE and PCA analysis. The most probable number of populations were estimated using delta K method implemented in the STRUCTURE HARVESTOR program. The value of DK peaked at K=2 and revealed two sub-populations in the wheat mini core germplasm. Sub-population 1 represented 82% of the individuals; out of that, 62% were pure and 38% admixtures. Whereas sub-population 2 had 18% of the individuals of the AM panel, and contained 75% pure and 25% admixtures. PCA also detected the two sub-populations indicated by two significant components, explaining the maximum variation of the population. Further, kinship matrix was also created to explore the relationship among the individuals using the genome association and prediction integrated tool (GAPIT) which demonstrated the presence of two sub-groups within the association panel ( Figure 3).
The LD decay in the wheat mini core set was estimated by calculating the squared correlation coefficient (r 2 ) for all the SNPs. The LD decay for the whole genome was 1.9 Mb. Further, it was found that the decay was most rapid in the A sub-genome (1.63 Mb), followed by the D sub-genome (1.93 Mb) and B sub-genome (2.28 Mb) (Figure 4).

GWAS for grain size traits
GWAS was performed using 23,874 SNPs filtered on various parameters to identify genomic loci associated to four different grain yield traits (GL, GW, GLWR, and TGW) independently for the seven environments and also based on the BLUP values derived from data of grain size traits of all the seven environments. Here, we have used five multi-locus models (mrMLM, FASTmrMLM, FASTmrEMMA, pLARmEB, and ISIS EM-BLASSO) to conduct GWAS. A total of 752 significant SNPs were predicted for four grain size traits using ML-GWAS models with LOD score ≥ 3. Manhattan plots for GL drawn using various ML-GWAS models that depict marker trait associations are presented in Figure 5. Of these 752 SNPs, 72 were identified using BLUP values derived from the data of all the environments and other SNPs were identified by analyzing data of each location separately. We classified 752 SNPs according to trait, which demonstrated that 156, 179, 250, and 167 SNPs were associated with GL, GW, GLWR and TGW traits, respectively ( Figure 6). In addition, comparative study on the basis of the  Figure 6). On the basis of redundancy of SNPs in the models and locations, we combined the identified SNPs and found a total of 160 SNPs were simultaneously detected in two or more multi-locus models. These SNPs were designated as reliable QTNs for the respective traits. Furthermore, distribution of these 160 significant SNPs was also analyzed across the environments. Out of these 160 QTNs, 87 were confined to only one environment that included 13, 19, 27, and 25 QTNs for GL, GW, GLWR, and TGW respectively, and 3 QTNs that were associated with more than one trait (Table  S6). On the other hand, a total of 73 QTNs were simultaneously identified in two or more environments as well as two or more models (Table 2). Among these 14, 17, 11, and 10 QTNs were identified for GL, GW, GLWR, and TGW traits respectively while 21 SNPs were associated with more than one trait. The physical distribution of all the 160 SNPs on chromosomes demonstrated that SNPs were present on all the chromosomes. Moreover, the highest number of SNPs were found on chr3D (7 SNPs), followed by chr2D (6 SNPs), and chr7D (6 SNPs) while chr1D, chr6B, and chr6D had only one SNPs.

Allelic effects of identified genomic regions on grain shape
To evaluate the allelic effects of QTNs on respective phenotype, we only analyzed those QTNs that were detected in more than two environments and revealed R 2 value ≥10% with at least one GWAS model. Association panel genotypes were divided into two groups according to allele types in order to test whether the mean phenotypes of the two groups were significantly different (Figure 7). Results showed that six QTNs had significant effect (P ≤ 0.01) on their respective traits. Among these six QTNs, four QTNs (Q.GL-GW-3D (AX-95008504), Q.GL-5D (AX-94482861), Q.GL-5D (AX-95020206), and Q.GL-TGW-6A (AX-95238912)) demonstrated significant effects on GL(mm) whereas one, QTN Q.GW-4B (AX-94878781), had significant effect on GW and another, Q.GLWR-2A (AX-94736090), showed significant effect on GLWR. The QTNs with significant phenotypic effects on seed traits might contribute to their genetic variations.

Annotation of identified QTNs
All the 160 significantly associated QTNs with grain size traits that were detected in two or more models were searched for their annotation in the wheat reference genome assembly cv. Chinese Spring (IWGSCrefseq version 2.0, https://wheat.pw.usda.gov/GG3/ iwgsc-2.0), available at Plant Ensemble. Of these 160 QTNs, annotation was only detected for 136 QTNs. The detailed analysis of annotated SNPs showed that Q.TGW-5D (SNP-AX-95234313) was located within a gene encoding cytochrome 450 and was identified in E4 environment using mrMLM and FASTmrMLM models with LOD score ranging between 3.8 to 3.92. Another, Q.GW-3D (SNP-AX-94540502), for grain weight identified at the E6 environment was annotated as ABC transporter and detected by mrMLM and ISIS EM-BLASSO models. We also checked the annotation of QTNs that were identified at multiple environments and found that Q.GL-1B (SNP-AX-94699549), Q.GL-GLWR-4B (SNP-AX-94879134), Q.GL-3D (SNP-AX-95074739), and Q.GW-5A (SNP-AX-94657794) were located w i t h i n g e n e s e n c o d i n g A B C t r a n s p o r t e r , W R K Y transcription_factor, Glucan endo-1,3-beta-glucosidase, and Zinc_finger_protein respectively. Among these four QTLs, Q.GL-1B (encoding for an ABC transporter) was located at 585,331,222bp on chromosome 1B. It was identified in two different environments, E 1 and E 4, using ISIS EM-BLASSO model with LOD scores 4.17 and 5.65 and R 2 3.05% and 6.42%, respectively. Q.GL-GLWR-4B (encoding for a WRKY transcription factor) was located on chr4B at 63,231,309bp position and identified in four different environments: E 1 with ISIS EM-BLASSO model, E 2 with ISIS EM-BLASSO and FASTmrEMMA models, E 3 with pLARmEB model, and E 4 with ISIS EM-BLASSO model. It had an LOD score ranging between 3.52 to 8.12. This QTN was associated with GL in all locations except E 2 . Further, in E 2 , this Q.GL-GLWR-4B was also determined by three different models, namely FASTmrMLM, pLARmEB, and ISIS EM-BLASSO, but associated with GLWR trait.

Expression analysis
The transcriptome sequencing of contrasting seed size wheat genotypes, i.e. IC111905 (large-seeded) and EC575981(smallseeded), was performed at two time intervals during the seed development (i.e., 15 and 30 DPA) to quantify expression of all annotated genes within associated genomic regions.
Expression analysis demonstrated that only 123 genes were expressed in both stages (15 and 30 DPA) of small and large seeded genotypes, of which 23, 33, 27, and 28 were uniquely associated with GL, GW, GLWR, and TGW respectively. Among the identified genes, those with foldchange ≥1 and p-value<0.05 were considered as significantly differentially expressed genes. A total of 18 and 12 genes were significantly differentially regulated in large seeded The rate of Linkage disequilibrium decay (R 2 ) between pairs of polymorphic markers of the whole wheat genome and its sub-genomes A, B, and D are plotted against the genetic distance (Mb).
cultivars at 15 and 30 DPA respectively. At 15 DPA, 11 genes were upregulated and 7 genes were downregulated, while 6 genes were significantly upregulated and downregulated in large seed cultivars at 30 DPA (Figure 8). Many genes, including TraesCS7B02G462900 Result also showed that Q.GLWR-5B (SNP-AX-94915493) associated with GLWR was located within gene TraesCS5B02G552400 (hypothetical protein), which was only expressed in small grain cultivars with 10 fold upregulation, which showed it has some specific role in small seed cultivars. Another gene, namely TraesCS7D02G463100, located within the QTNQ.GW-7D and identified in E 1 and E 5 location was upregulated in large grain cultivars. TraesCS7D02G463100 is nuclear transcription factor associated with GLWR. Manhattan plots of associated QTNs for grain length (GL) in wheat using multi-locus GWAS model. The x-axis shows the chromosome label and the yaxis displays -thresholds for significance (LOD score = 3) and log10 (p-value). The significant QTNs with LOD score >=3 is represented with purple dots. Distribution of identified and significant SNPs for each trait on the basis of detection models of multi-GWAS.

Discussion
Grain yield is a highly complex agronomic trait, governed by several genes and also influenced by environmental conditions (Li et al., 2022). It is essentially determined by two main components i.e., number of grains per m 2 and thousand grain weight (TGW) (Sun et al., 2009;Kumari et al., 2018;Li et al., 2019). In the breeding history, grain yield was mainly improved with increase in the grain number per m 2 , which is determined by grain number per spike (Kumari et al., 2018). In the present study, we focused on the grain shape traits i.e. GL, GW, and GLWR, which determine TGW, a phenotypically stable yield contributing trait and used by the breeders for selecting high yielding varieties (Avni et al., 2018;Duan et al., 2020). The TGW and other grain size traits contribute to higher grain yield than grain number per spike. (Ji et al., 2022). Thus, it is very important to study the grain shape traits when the aim is to improve grain yield. Here, we have applied GWAS to identify genomic regions regulating variation for grain yield in a sub-set of the Indian National Genebank wheat mini core set germplasm (125 accessions). These mini core set accessions have been identified from a core set (2226 accessions), constituted from the entire wheat accessions (22416) conserved in the National Genebank of India (Phogat et al., 2020). Therefore, the mini core set accessions are a valuable genetic resource for mapping various desirable traits including grain shape traits. The phenotyping of wheat mini core set accessions across the seven environments revealed significant variability among wheat accessions for grain parameters. High coefficients of variation for TGW under all the environments indicated broad phenotypic variation and a large improvement potential. Heritability is the proportion of genotypic variance to all observable variance in the total population. Over the environments, heritability was high for GL and moderate for GW, GLWR, and TGW. The trend of heritability is more specific to environment than traits, as we observed low/ moderate heritability for E 5 and E 6 . These environments fall in stress prone areas affected by less rainfall and high temperatures, which might have caused low heritability of the traits. Promising accessions for grain parameters were identified. Among them, EC578134, IC539313, IC535217, EC464070, and EC578152 were promising for GL as well as TGW. EC339611, EC578134, and IC535217 were promising for GL as well as GLWR, whereas IC335715 was promising for GW and TGW. These accessions can be used in breeding programs for trait introgression, genetics, and genomics study. The significant positive correlation of grain length and width with thousand grain weight revealed that the selection of grains with increased width and length can greatly contribute to grain weight and indirectly to grain yield. Earlier studies have reported moderate to strong correlations between TGW and size (Rasheed et al., 2018). Simmonds et al. (2016) reported that GL and GW in tetraploid and hexaploid wheat can greatly influence the TGW, as longer and broader grains have more starch accumulation and, hence, a higher weight (Simmonds et al., 2016). Previous studies have also reported positive associations among TGW, GL, and GW (Breseghello and Sorrells, 2007;Ramya et al., 2010). Principal component analysis also found GW, TGW, and GL as major contributing traits positively contributing to variations in grain shape among wheat genotypes Boxplot for 6 reliable QTNs (A-F). Genotypes were divided into two groups at each locus based on the allele type. A significant difference between the phenotype of these two groups was analyzed using t-test (P ≤ 0.001). Two alleles for each QTN (Locus) are given on X-axis. Y-axis shows phenotypic values of the traits.
( Figure 1). In our study, the association between GW and GLWR is consistent and significantly negative in all environments. A negative correlation between GW and GLWR could be attributed to compensation of photosynthates to GW rather than to GL. The different correlations could be explained by the influence of the environment on the plant growth and grain development. This study shows that GL, GW, and GLWR are all expected to increase with TGW, one of the major yield components of grain yield and which can be targeted to enhance wheat yield potential. Genetic diversity and population structure in the wheat mini core subset was analyzed using 35K wheat SNP array. Both STRUCTURE and PCA analyses revealed two subpopulations in the wheat mini core set germplasm used in our study. The whole genome LD decay distance in the wheat mini core set was 1.93Mb. Further, LD decay was most rapid in A genome followed by D and B sub-genome. Many earlier studies in wheat have reported much longer LD decay distance ranging from 4Mb to 15 Mb or even more (Pang et al., 2020;Hanif et al., 2021;Li et al., 2021). This suggested the presence of a big LD block size, which has so far limited high-resolution trait mapping in wheat. One of the ways to overcome this problem is to use a very high density genic-SNP array having lakhs of SNPs derived from the coding regions for genotyping of association panels used for conducting GWAS. The high-density genotyping would facilitate construction of haplotypes maps of the associated regions that may help us in pinpointing the exact causal SNP/genes for the target traits.
The Genome-wide association study (GWAS) has been found to be a powerful tool to investigate genetic bases of complex traits in many plant species such as rice, maize, soybean, and wheat (Zegeye et al., 2014;Zhang et al., 2015;Spindel et al., 2016;Zhang et al., 2018;Chaurasia et al., 2020;Chaurasia et al., 2021). There are many statistical methods based on different algorithms that can be used to predict the true association between SNP markers and corresponding phenotypic variations in GWAS (Spindel et al., 2016). In our study, we used the ML-GWAS method for the detection of marker trait-associations for grain shape traits. Multi-locus methods are effective because of their higher statistical power which provides higher efficiency and accuracy for QTNs detection. In numerous studies, it has been found that ML-GWAS is much better than other methods (Bennett et al., 2012;Visioni et al., 2013;Spindel et al., 2016;Ma et al., 2018;Xu et al., 2018;Khan et al., 2019). Peng et al. (2018) used six ML-GWAS models to detect the genetic dissection of 20 free amino acid (AA) levels in T. aestivum and claimed that ML-GWAS methods are more reliable and powerful. In the current study, we used five multilocus methods, mrMLM, FASTmrMLM, FASTmrEMMA, pLARmEB, and ISIS EM-BLASSO, to perform GWAS analysis of four agronomic traits in our association panel. Among these five models, pLARmB identified the highest number of QTNs (211 SNPs), followed by FASTmrMLM (202), ISI EMBLASSO (177), mrMLM(132), and FASTmrEMMA(62). QTNs for thousand grain weight, grain length, grain width and grain length width ratio QTL for grain yield component traits have been extensively studied and reported on all the 21 chromosomes of wheat (Brinton et al., 2017;Cao et al., 2019;Ma et al., 2019;Ji et al., 2022). In our analysis, a total of 160 reliable QTNs were detected for four grain shape-related traits across the seven locations (Table 2; Supplementary Table 6).
For grain length, 27 QTNs were detected, which were distributed on 17 wheat chromosomes (chr1B, chr1D, chr2B, chr2D, chr3A, chr3B, chr3D, chr4A, chr4B, chr5A, chr5B, chr5D, chr6B, chr6D, chr7A, chr7B, and chr7D). Among these 27 QTNs, 10 QTNs were major (R 2 ≥ 10% at least in one GWAS method), of which Q.GL-4A (SNP-AX-94839917), Q.GL-7D (SNP-AX-94872194), Q.GL-7A (SNP-AX-94760450) and Q.GL-6D (SNP-AX-94647721) were strongest because their R 2 value were ≥20%. The Q.GL-4A on the chr4A with highest R 2 = 31.94% may explain a significant proportion of the variation for GL in the wheat mini core germplasm. Moreover, this QTN was identified simultaneously in two environments i.e., E 1 and E 2, and with four different models. The Q.GL-7D is located within a gene encoding Thioredoxin M type protein with R 2 ranging between 4.82% to 21.88%. This QTN was predicted in three different environments i.e., E 1 , E 3, and E 4, and using three different models, suggesting this could be a reliable QTN contributing to GL variation in wheat. In an earlier study, Thioredoxin has been shown to play an important role in preventing sprouting of developing grains in cereals (wheat and barley) by reducing the intramolecular disulfide bonds of storage proteins and other proteins in the starchy endosperm, and thereby affecting grain yield (Guo et al., 2013).
For grain width, 36 QTNs were identified that were distributed on 17 wheat chromosomes. Among these QTNs, Q.GW-3D (SNP-AX-94642652), Q.GW-5B (SNP-AX-94547840), Q.GW-3A (SNP-AX-94741529), Q.GW-4D (SNP-AX-95213549), and Q.GW-2B (SNP-AX-94519462) were predicted as major QTNs as the phenotypic variance explained by these QTLs was ≥10% of at least one of the ML-GWAS models. Q.GW-3D and Q.GW-5B were annotated as unnamed protein product and hypothetical protein respectively. Additionally, Q.GW-2B and Q.GW-4D had R 2 values ranging from 7.7 to 12.23 and 0.72 to 17.72, respectively. Q.GW-2B was identified at two environments E 3 and E 4, while Q.GW-4D was identified at E 2 . Interestingly, both intragenic SNPs showed higher expression in large grain wheat cultivars than small seed cultivars. This suggested that these QTNs might have important roles in determining variation for GW in wheat.
Thirty-seven and thirty-five QTNs were predictive for GLWR and TGW traits respectively. The GLWR-associated QTNs were distributed on 17 chromosomes (chr1A, chr1D, chr2A, chr2B, chr2D, chr3A, chr3B, chr3D, chr4A, chr4B, chr5A, chr5B, chr6A, chr6B, chr6D, chr7B, and chr7D) while QTNs for TGW were spread over 16 chromosomes (chr1A, chr1B, chr1D, chr2A, chr2B, chr2D, chr3A, chr3B, chr3D, chr4B, chr4D, chr5A, chr5D, chr6A, chr7B, and chr7D). In TGW, a total of fourteen SNPs had R 2 ≥10 and were considered as major genomic regions for this trait. Q.TGW-1A (SNP-AX-94605845) was annotated as TTL1 protein (TETRATRICOPEPTIDE-REPEAT THIOREDOXIN-LIKE 1) with R 2 = 11.78% and highly expressed in large grains. Studies have reported that TTL1 positively regulates the stress response regulated by ABA (Guo et al., 2013). The loss of TTL1 function causes plants to be sensitive to salt and osmotic stress during seed germination and later development (Rosado et al., 2006). So, it could be possible that the identified genomic region in our study may positively regulate the expression of TTL1 gene and regulate seed maturation. Q.TGW-5D (SNP-AX-95234313) was located within a gene encoding cytochrome 450 and it was only identified at E4 environment with R 2 = 21.45. In a previous study on cytochrome family protein, CYP78A3 on chr7 has been shown to play an important role in wheat seed development by promoting integument cell proliferation . Thus, it could be suggested that Q.TGW-5D (cytochrome 450) identified in our study might also have some role in seed development. A total of 13 QTNs were associated with GLWR and were considered as strong QTNs explaining ≥10% phenotyping variance of the trait. Most of the QTNs were annotated to be either hypothetical proteins or intergenic SNPs. Three QTNs for TGW, namely Q.GLWR-2D (SNP-AX-94922377), Q.GLWR-1A (SNP-AX-95213485), and Q.GLWR-6A (SNP-AX-94722285), were simultaneously identified in three different environments and located on chr2D, chr1A, and chr6A respectively.

Comparison of the QTLs identified in the present and previous studies
In wheat, several candidate genes underlying grain size and weight have been identified including TaGS (Bernard et al., 2008), TaGW2 (Su et al., 2011), TaGS-D1 (Zhang et al., 2014), TaCWI (Jiang et al., 2015), and Tackx4 (Chang et al., 2015). Additionally, McCartney et al. (2005) identified two major QTLs for TKW responsible for reduced plant height that were near the Rht-B1b and Rht-D1b genes that control plant height (McCartney et al., 2005;Gao et al., 2015). Another QTL, Qtgw-cb.5A, was identified as a key determinant of final grain weight which increased grain length by driving pericarp cell expansion (Brinton et al., 2017). We performed the comparative analysis of QTNs for grain shape identified in the present study with previously reported QTLs on the basis of their physical locations on chromosomes. Some of the previously reported grain size-associated QTLs were also predicted in our analysis. For example, Qgl.cib-CK1-4A associated with GL on chr4A coincided with Q.GL-4A (SNP-AX-94839917) for grain length trait at the same region on chr4A and identified in two environments (E 1 and E 2 ). Further, LOD (4.92~6.13) and R2 value (15.33~31.94) of this QTL demonstrated its importance in regulating GL trait. Goel et al. (2019) identified qTKW.6A.1 associated with TGW on 6A at the interval 166.64-596.18 Mb (Goel et al., 2019). The QTN, Q.TGW-6A (SNP-AX-95240001), identified in our study appears to correspond to qTKW.6A.1. Interestingly, Q.TGW-6A was identified at two locations, E2 and E4, which showed that it is a stable genomic region for TGW. Further, we found that Q.GL-TGW-6A (SNP-AX-95238912) and Q.GLWR-6A (SNP-AX-94722285), which are located on chr6A at 362.7Mb and 307Mb, overlapped with the grain shape QTLs identified by Cao et al. (2019) and Ji et al. (2022), respectively. Interestingly, Q.GL-TGW-6A was identified in three environments (E3, E4, and E5). LOD score and R 2 value of Q.GL-TGW-6A and Q.GLWR-6A ranged from 3.46 to 8.23 and 2.6 to 23.8 respectively. On the other hand, Q.GLWR-6A was present at 307Mb on chr6A with LOD score (3.47~7.59) and R2 value (5.03~18.87). Since the two QTNs on chr6A were also identified in the previous studies, these appear to represent major genomic regions for the grain shape traits.
A few other underlying genes influenced grain size and weight have been reported by Cabral et al., 2018. TaGS-D1, controlling GL and grain weight, is an ortholog of OsGs3 and located at 106.73 Mb on chr3D. Expression pattern of this TaGS-D1 (TraesCS7A03G0037700) gene in our data showed relatively higher expression in large seeded genotypes as compare to small seeded genotypes. So, we examined nearby QTNs around the gene and we found two QTNs, Q.GLWR-7D and Q.TGW-7D, located in the vicinity of TaGS-D1 and positioned at 54.9Mb on chr7DS and 100.1 Mb on chr7D respectively. The presence of these two QTNs indirectly suggested a major locus which corresponds to either TaGS-D1 or an additional novel gene for grain shape trait on the short of chr7D. A second grain weight locus cytokinin oxidase/dehydrogenase (TaCKX6-D1) gene is physically located on chromosome 3D and played a key role in controlling cytokinin levels and affects grain weight in wheat (Zhang et al., 2012). TaCKX6-D1 gene is located at 106.73 Mb on chr3D, so its expression could be influenced by nearby SNPs around the gene. On the basis of the physical location of gene, we found two significant genes, Q.GL-GW-3D and Q.TGW-3D (SNP-AX-95008504, and SNP-AX-94406908), in our analysis at 151.4 Mb and 239.3 Mb respectively. Q.GL-GW-3D associated with GL was identified in four environments (E 1 , E 2 , E 4 , and E 5 ) with LOD value from 3.05 to 6.33 and R 2 value from 4.09 to 14.97, which showed the significance of SNP. The second, Q.TGW-3D, demonstrated association with TGW with LOD (3.16~8.2) and R 2 (7.21~11.48) and was identified at E2, E3, and E4 environments. Both the QTNs were annotated as hypothetical proteins and expressed in our transcriptome data. Q.TGW-3D was highly expressed in large seed cultivars while Q.GL-GW-3D also showed expression in both cultivars. In conclusion, in this study we have comprehensively phenotyped wheat mini core germplasm accessions for grain shape traits and identified promising accessions for large grain size and length which can be incorporated in breeding programs. Further, integration of phenotyping and genotyping data has enabled us to identify genomic regions/candidate genes, some of which are novel. Comparative study also showed that many QTNs identified in our study represented novel genomic regions that can be further validated for their role in determining grain size and can be potentially exploited in breeding programs to develop highyielding varieties.

Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ncbi.nlm.nih.gov/, PRJNA906296.

Author contributions
JK planned and handled the field experiments, conducted formal analysis, and assisted in writing the manuscript. DL contributed to the data curation, investigation, formal analysis, and original draft writing. MY and ST planned and executed DNA extraction and SNP data generation. PJ, SS, ST, SM, and HA performed data analysis, writing, reviewing, and editing, NS, KS and KM recorded phenotyping data and reviewed and edited the manuscript, RS, MY, and GS contributed resources and participated in the writing, reviewing, and editing. AS contributed to the conceptualization, supervision, writing, reviewing, and editing of the manuscript. All authors contributed to the article and approved the submitted version.

Funding
Financial support for this study was received from the Science and Engineering Research Board (project: EMR/2017/005133) and ICAR-National Innovations in Climate Resilient Agriculture (NICRA) Project (Project code 1006607)