Edited by: Roberto Papa, Università Politecnica delle Marche, Italy
Reviewed by: Ernesto Igartua, Consejo Superior de Investigaciones Científicas (CSIC), Spain; Pasquale De Vita, Consiglio per la Ricerca in Agricoltura e l'Analisi dell'Economia Agraria (CREA), Italy
*Correspondence: Peter S. Kristensen
This article was submitted to Plant Breeding, a section of the journal Frontiers in Plant Science
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
The aim of the this study was to identify SNP markers associated with five important wheat quality traits (grain protein content, Zeleny sedimentation, test weight, thousand-kernel weight, and falling number), and to investigate the predictive abilities of GBLUP and Bayesian Power Lasso models for genomic prediction of these traits. In total, 635 winter wheat lines from two breeding cycles in the Danish plant breeding company Nordic Seed A/S were phenotyped for the quality traits and genotyped for 10,802 SNPs. GWAS were performed using single marker regression and Bayesian Power Lasso models. SNPs with large effects on Zeleny sedimentation were found on chromosome 1B, 1D, and 5D. However, GWAS failed to identify single SNPs with significant effects on the other traits, indicating that these traits were controlled by many QTL with small effects. The predictive abilities of the models for genomic prediction were studied using different cross-validation strategies. Leave-One-Out cross-validations resulted in correlations between observed phenotypes corrected for fixed effects and genomic estimated breeding values of 0.50 for grain protein content, 0.66 for thousand-kernel weight, 0.70 for falling number, 0.71 for test weight, and 0.79 for Zeleny sedimentation. Alternative cross-validations showed that the genetic relationship between lines in training and validation sets had a bigger impact on predictive abilities than the number of lines included in the training set. Using Bayesian Power Lasso instead of GBLUP models, gave similar or slightly higher predictive abilities. Genomic prediction based on all SNPs was more effective than prediction based on few associated SNPs.
Wheat (
Test weight (the weight of 100 L of grain) is a grain quality trait that is important to many end-users (Bordes et al.,
Phenotyping of quality traits is expensive and time-consuming. Consequently, most wheat breeders prioritize the yield of new lines over improvement of grain quality. Furthermore, it is difficult to phenotype most quality traits in the early stages of a breeding program, because relatively large quantities of grain are needed for the tests. Thus, reliable genetic markers for quality traits are useful as indicators of the grain quality of wheat lines. DNA can be extracted from few grain or leaf samples, and genotyping of genetic markers are in many cases economically advantageous compared to phenotyping (Heffner et al.,
A number of genes/QTL affecting quality traits in wheat have been found across the genome (for reviews see e.g., Liu et al.,
SNP-BLUP (Best Linear Unbiased Prediction) models are commonly used for genomic selection. In these multiple regression models, a high number of SNP effects are fitted simultaneously (Meuwissen et al.,
The objectives of the present study were to identify genetic loci associated with the quality traits grain protein content, Zeleny sedimentation value, test weight, falling number, and TKW, and to develop models for genomic prediction of these quality traits using GBLUP and Bayesian Power Lasso models. Genomic predictions based on all SNPs were compared with predictions based on few SNPs with most significant effects.
A total of 635 F6 winter wheat lines from two breeding cycles (set2014 and set2015) of the Danish plant breeding company Nordic Seed A/S were used for the analyses. The lines of the first breeding cycle, set2014, consisted of 321 lines and were harvested in 2014, and the lines of the second breeding cycle, set2015, consisted of 314 lines that were harvested in 2015. In total, 96 different lines were used as crossing parents in the two sets, and 6 of these lines were used as parents in both sets. The number of different full-sib families from the crosses was 159, and the number of lines in each full-sib family ranged from 1 to 33 with an average of 10 lines (Table
Distribution of lines in full-sib families.
Full-sibs per family | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 15 | 18 | 33 |
Number of families | 39 | 27 | 21 | 29 | 9 | 10 | 5 | 5 | 3 | 2 | 3 | 2 | 1 | 1 | 1 | 1 |
Total number of lines | 39 | 54 | 63 | 116 | 45 | 60 | 35 | 40 | 27 | 20 | 33 | 24 | 13 | 15 | 18 | 33 |
The wheat lines were phenotyped for grain protein content, Zeleny sedimentation value, test weight, falling number, and TKW (
DNA was extracted from leaves of three bulked, 2-week old seedlings for each line using a modified CTAB method (Rogers and Bendich,
Number of SNPs mapped to each of the 21 chromosomes.
Chromosome | 1A | 2A | 3A | 4A | 5A | 6A | 7A | Total, A |
No. of SNPs | 474 | 474 | 489 | 319 | 559 | 557 | 657 | 3529 |
Chromosome | 1B | 2B | 3B | 4B | 5B | 6B | 7B | Total, B |
No. of SNPs | 736 | 713 | 692 | 355 | 797 | 685 | 565 | 4543 |
Chromosome | 1D | 2D | 3D | 4D | 5D | 6D | 7D | Total, D |
No. of SNPs | 269 | 217 | 123 | 49 | 133 | 128 | 114 | 1033 |
Genome-wide association analyses were conducted by single marker regression using the following model that was run for each of the 10,802 SNPs:
where
G-matrices with genomic relationship between the lines were used to correct for family structure in order to avoid spurious associations (Price et al.,
where
Population structure was studied by performing a principal component analysis on the G-matrix computed from all 10,802 SNPs using the built-in R function “prcomp” (R Development Core Team,
Average genetic distances were calculated from the 10,802 SNPs using the R function “dist” (R Development Core Team,
Genomic inflation factors, λIF, were calculated for each trait and used to correct the
GWAS were also performed by fitting all SNPs at the same time with a Bayesian Power Lasso model using the Bayz software (Janss,
where
where
Genomic predictions using all SNPs were performed using the Bayesian Power Lasso model (3) and using a GBLUP model:
where
Estimation of model effects and variance components for the GBLUP and Bayesian Power Lasso models were done using DMU and Bayz, respectively. For the GBLUP models, the narrow sense genomic heritability (de los Campos et al.,
where
For the Bayesian Power Lasso models,
Predictive abilities of the models were determined as the correlations between observed phenotypes corrected for fixed effects and GEBVs. These correlations were compared with the square root of the narrow sense genomic heritability, which is the maximum correlation that can be achieved. Bias in the genomic predictions was calculated as the slope of the regression line of the corrected phenotypes on the GEBVs. The expectation of this slope is 1.0, and the bias is the deviation from this expectation.
Several types of cross-validation strategies were tested to study the effectiveness of different approaches for implementation of genomic selection in breeding programs:
- LOO (Leave-One-Out), where the GEBV of each line was predicted based on the rest of the lines. The LOO strategy was used to study the predictive ability when using the largest training set available and the highest possible genetic relationship between lines in the training and validation set.
- LFO (Leave-Family-Out), where the GEBVs of lines in each half-sib family were predicted based on lines from other half-sib families, so that they had no parents in common. The LFO cross-validation strategy was used to study the effect of the genetic relationship between the lines in training and validation sets.
- LSO (Leave-Set-Out), where the GEBVs of lines in each set were predicted based on lines only from the other set. The LSO cross-validation strategy was used to study the predictive ability when predicting GEBVs of lines from one breeding cycle based on lines from another breeding cycle.
- k-fold, where the lines were randomly divided into k folds (2, 5, or 10) of equal size and the GEBVs of lines in each fold were predicted based on lines only in the other folds. The size of the training sets in the 2-, 5-, and 10-fold cross-validations were approximately 318 lines, 508 lines, and 572 lines, respectively. The k-fold cross-validation strategy was used to study the effect of the size of the training set.
Predictions of breeding values were also performed using the three and the ten best SNPs for each trait according to the single marker GWAS or the Bayesian Power Lasso (model 1 or 3). Here, the 635 lines were randomly divided into 5-folds of equal size. The best SNPs and their effects were estimated based on the lines in four of the folds and used for predictions of breeding values in the fifth fold. This was done to avoid unrealistically high predictive abilities that would be the result of using the same lines for estimation and prediction. The best SNPs were defined as the SNPs most significantly associated with the studied trait for the single marker regressions, and as the SNPs with the largest additive genetic effects for the Bayesian Power Lasso. When selecting SNPs using model 1, maximum one SNP was selected from each chromosome to avoid selecting several SNPs linked to the same QTL. Predictions were made using the SNP effects estimated in model 1. SNP effects were also re-estimated as random effects by fitting the selected SNPs simultaneously in either model 3 or in a SNP-BLUP model (equivalent to GBLUP model 5). Predictions were evaluated by calculating the correlation between observed phenotypes corrected for fixed effects and estimated breeding values.
Grain from 635 winter wheat lines were phenotyped for the quality traits grain protein content, Zeleny sedimentation value, test weight, falling number, and TKW. The lines were from two different breeding cycles (set2014 and set2015). Phenotypic variation was observed for all traits, especially for Zeleny sedimentation and falling number, which had a coefficient of variation of 26.9 and 23.4%, respectively (Table
Mean, range, and coefficient of variation (CV) for phenotypic data.
Grain protein content (%) | 8.7 | 7.5–10.9 | 6.4 |
Zeleny sedimentation (mL) | 18.3 | 8.0–36.0 | 26.9 |
Test weight (kg/hL) | 78.7 | 73.4–83.7 | 2.2 |
Falling number (s) | 254.7 | 79.0–391.0 | 23.4 |
Thousand-kernel weight (g) | 53.7 | 40.8–63.0 | 6.0 |
Distribution of the phenotypes.
The 635 lines were genotyped for 13,006 SNP markers, and 10,802 of these were selected for the analyses (Table
Minor allele frequency (MAF) of SNPs and heterozygosity of lines.
The genomic relationship between the lines was determined by computing a G-matrix based on the 10,802 SNPs. The lines were genetically related both within and between the two sets. The lines of set2014 were not clearly separated from the lines of set2015 in the principal component analysis of the G-matrix (Figure
Genomic relationship between lines.
Narrow sense genomic heritabilities (based on single plots) and variance components based on the GBLUP and Bayesian Power Lasso models are shown in Table
Variance components, their standard errors, and narrow sense genomic heritabilities estimated from the GBLUP and from the Bayesian Power Lasso models.
Grain protein content | 0.16 ± 0.015 | 0.12 ± 0.011 | 0.56 | 0.13 ± 0.015 | 0.13 ± 0.013 | 0.51 |
Zeleny sedimentation | 15.4 ± 1.01 | 4.2 ± 0.49 | 0.78 | 20.3 ± 0.85 | 4.1 ± 0.46 | 0.83 |
Test weight | 2.2 ± 0.14 | 0.51 ± 0.064 | 0.81 | 2.0 ± 0.11 | 0.51 ± 0.070 | 0.79 |
Falling number | 2, 555 ± 184 | 825 ± 95 | 0.75 | 2, 575 ± 151 | 815 ± 107 | 0.76 |
Thousand-kernel weight | 10.3 ± 0.69 | 2.4 ± 0.31 | 0.81 | 7.4 ± 0.49 | 2.3 ± 0.36 | 0.76 |
GWAS were carried out using single marker regression of each marker on the phenotypes. Family structure was taken into account using G-matrices based on all chromosomes, except the one for the selected marker. Q-Q plots of observed against expected –log10(
Q-Q plots. Plots of observed –log10(
Manhattan plots of the corrected –log10(
Manhattanplots of –log10(
For falling number, a nearly significant region was found on chromosome 7B (
For TKW, a region on chromosome 1B was nearly significant (
Another approach for GWAS was also used, where all SNPs were fitted at the same time using a Bayesian Power Lasso model, where SNP effects were assumed to be from an exponential power distribution (Gao et al.,
Percentage of additive genetic variance explained by the SNPs according to the Bayesian Power Lasso analyses.
The SNPs that explain most variance according to the Bayesian analyses were located in the same genetic regions as the most significant SNPs identified by single marker regression for Zeleny sedimentation, TKW, and falling number. However, the explained variance was lower compared with the single marker regressions. For the Bayesian analyses, only one SNP explaining more than 1.5% of the additive genetic variance was found (SNP on chromosome 5D associated with Zeleny sedimentation). Sorting the lines based on their genotypes for the associated SNPs for Zeleny sedimentation on chromosome 5D, 1D, and 1B, and looking at the mean of the phenotypic data, showed that these SNPs had a clear effect on Zeleny sedimentation (Table
Mean values for Zeleny sedimentation and number of lines with the different alleles of the top SNPs on chromosome 5D, 1D, and 1B.
SNP 5D | T | 22.1 | 166 | T | G | A | 28.4 | 28 | |
C | 16.8 | 442 | G | 21.7 | 7 | ||||
SNP 1D | G | 23.7 | 82 | A | A | 21.5 | 101 | ||
A | 17.4 | 538 | G | 16.9 | 21 | ||||
SNP 1B | A | 19.3 | 389 | C | G | A | 21.7 | 11 | |
G | 16.5 | 215 | G | 19.7 | 11 | ||||
A | A | 16.8 | 214 | ||||||
G | 15.9 | 164 |
For test weight, the SNP with the largest effect were located on chromosome 7A, but the SNP only explained 0.13% of the genetic variance. For grain protein content, one SNP on chromosome 2A and five unmapped SNPs were identified to have the largest effects. However, together these SNPs explained less than 5 % of the total genetic variance.
Genomic predictions were conducted based on all 10,802 SNPs, and the predictive ability was evaluated using different kinds of cross-validations. The correlations between observed phenotypes corrected for fixed effects and GEBVs using GBLUP models are shown in Figure
Correlations between observed phenotypes corrected for fixed effects and GEBVs based on the GBLUP models. Maximum correlation, h, is the square root of the narrow sense genomic heritability and is shown as blue bars over the correlations. Correlations are based on different kinds of cross-validations.
Genomic predictions were also performed using the Bayesian Power Lasso model. The predictive abilities were very close to or slightly better than the predictions based on the GBLUP model (Table
Comparison of predictive abilities based on GBLUP and Bayesian Power Lasso models.
Grain protein content | GBLUP | 0.75 | −0.01 | 0.43 | 0.49 | 0.50 |
Bayesian | 0.71 | −0.02 | 0.44 | 0.48 | 0.50 | |
Zeleny sedimentation | GBLUP | 0.88 | 0.64 | 0.76 | 0.77 | 0.78 |
Bayesian | 0.92 | 0.70 | 0.80 | 0.81 | 0.82 | |
Test weight | GBLUP | 0.90 | 0.40 | 0.66 | 0.68 | 0.71 |
Bayesian | 0.89 | 0.40 | 0.66 | 0.69 | 0.71 | |
Falling number | GBLUP | 0.87 | 0.51 | 0.67 | 0.67 | 0.69 |
Bayesian | 0.87 | 0.51 | 0.68 | 0.68 | 0.69 | |
Thousand-kernel weight | GBLUP | 0.90 | 0.37 | 0.59 | 0.63 | 0.65 |
Bayesian | 0.87 | 0.38 | 0.59 | 0.63 | 0.66 |
The GEBVs predicted based on the LOO cross-validations were either not or only slightly biased (regressions from 0.93 for falling number to 0.98 for Zeleny sedimentation, Table
Regressions of corrected phenotypes on GEBVs and their standard errors based on different cross-validations strategies using the GBLUP models.
Grain protein content | 0.96 ± 0.06 | 0.53 ± 0.10 | −0.02 ± 0.11 | 0.89 ± 0.07 | 0.94 ± 0.06 | 0.97 ± 0.06 |
Zeleny sedimentation | 0.98 ± 0.03 | 1.02 ± 0.04 | 1.05 ± 0.05 | 0.99 ± 0.03 | 0.98 ± 0.03 | 0.97 ± 0.03 |
Test weight | 0.95 ± 0.04 | 0.71 ± 0.05 | 0.70 ± 0.06 | 0.91 ± 0.04 | 0.93 ± 0.04 | 0.95 ± 0.04 |
Falling number | 0.93 ± 0.04 | 0.88 ± 0.05 | 0.83 ± 0.05 | 0.92 ± 0.04 | 0.91 ± 0.04 | 0.92 ± 0.04 |
Thousand-kernel weight | 0.96 ± 0.04 | 0.73 ± 0.06 | 0.66 ± 0.06 | 1.01 ± 0.05 | 0.95 ± 0.05 | 0.97 ± 0.04 |
In total, 128 of the 635 lines were selected (mainly based on yield) to continue to F7 in the breeding program. The selected lines were phenotyped for Zeleny sedimentation, TKW, and falling number again in the F7 generation. The correlation between the phenotypic F7 data and GEBVs estimated from the F6 data using LSO cross-validations was 0.68 for Zeleny sedimentation, 0.35 for TKW, and 0.41 for falling number.
The three and ten best SNPs were identified for each trait using the single marker regression models and the Bayesian Power Lasso models. The correlation between observed phenotypes corrected for fixed effects and GEBVs based on these SNPs were calculated to compare the models.
The correlations based on the best SNPs according to the single marker regression were higher than the correlations based on the Bayesian Power Lasso for grain protein content, test weight, falling number, and TKW (Table
Correlations between observed phenotypes corrected for fixed effects and breeding values predicted based on the best three or ten SNPs according to the single marker regression and to the Bayesian Power Lasso.
Grain protein content | 0.21 | 0.27 | 0.11 | 0.14 | 0.34 | 0.38 | 0.25 | 0.28 |
Zeleny sedimentation | 0.61 | 0.61 | 0.59 | 0.59 | 0.62 | 0.66 | 0.68 | 0.69 |
Test weight | 0.26 | 0.22 | 0.16 | 0.18 | 0.42 | 0.46 | 0.19 | 0.29 |
Falling number | 0.17 | 0.28 | 0.17 | 0.14 | 0.41 | 0.36 | 0.34 | 0.17 |
Thousand-kernel weight | 0.24 | 0.21 | 0.09 | 0.24 | 0.31 | 0.43 | 0.18 | 0.43 |
Correlations between observed phenotypes corrected for fixed effects and GEBVs based on 5-fold cross-validations. GEBVS were predicted from the three or ten best SNPs based on single marker regression (effects re-estimated with SNP-BLUP) and from all 10,802 SNPs based on the GBLUP model.
Elite breeding material was used in the present study, so the number of significant SNPs (and their effects) were low compared with other studies, where the plant material had high diversity and larger phenotypic differences between lines (Sun et al.,
The wheat lines have been grown with low nitrogen fertilization (to follow the Danish agricultural practice), so the grain protein content was quite low. No SNPs were significantly associated with grain protein content, suggesting that this trait is controlled by many QTL with small effects. Furthermore, the relatively low heritability indicate that protein content is strongly affected by environmental effects and GxE interactions (Shewry,
The SNPs most significantly associated with grain protein content were located on chromosome 2A, 4D, and 7A. In agreement with the present study, Groos et al. (
Few SNPs were mapped to the D-genome chromosomes compared with the chromosomes of the A- and B-genome both in Bordes et al. (
Regions associated with Zeleny sedimentation were also found on chromosomes 1B and 1D. Known loci that can affect gluten quality on these two chromosomes are the glutenin loci
In accordance with the present study, QTL associated with falling number have previously been found on chromosomes 4D and 7B. Mohler et al. (
The allele frequencies of the SNPs positively associated with Zeleny sedimentation on chromosome 1B, 1D, and 5D were 64 %, 14 %, and 28 %, respectively. Hence, selection of lines based on these three SNPs can significantly improve the Zeleny sedimentation in the breeding material. For TKW, the frequency of the positive allele was 97% for the most significant SNP on 1B, so the allele is almost fixed in the material. Most of the wheat lines had the positive alleles of the SNPs on chromosome 4D and 7B associated with falling number, as the allele frequencies of these two SNPs were 77 and 89%, respectively. Thus, if only marker-assisted selection is used, there is mainly potential for improving Zeleny sedimentation in the studied population. This was also confirmed by the predictions using few associated SNPs, where the difference between using few or all SNPs were smaller for Zeleny sedimentation than for the other traits.
The additive genetic variance explained by the best SNPs were considerably lower when estimated using the Bayesian Power Lasso than when using single marker regression. SNP effects are shrunken to fit with the overall genetic variance, when fitting all SNPs simultaneously in the Bayesian Power Lasso. Conversely, the SNP effects might be overestimated when performing single marker regression because of the Beavis effect (Beavis,
Since the variance explained by single SNPs was quite low for most of the studied traits, genomic selection seems to be a more promising strategy than selection based on few markers. For all traits, the predictive ability increased when using all markers compared to using the three or ten best markers. Other recent studies have reached the same conclusion. Würschum et al. (
One way to implement genomic selection in breeding programs is to select the best lines from a new set (breeding cycle) based on phenotyped lines of previous sets. Only six lines (of a total of 96 different parents) were used as crossing parents for both sets used in the present study. Nevertheless, the principal component analysis and G-matrix showed that the lines of the two sets were genetically related. The average genetic distance between the sets (0.76) was only slightly higher than within sets (0.74). This indicates that the crossing parents used for each set were genetically related. The genetic distance calculated as modified Rogers' distance can range from 0 to 1 (Reif et al.,
The LSO cross-validations resemble how the predictions would perform most realistically, when predicting lines of a new set before phenotypic information is available for those lines. However, the correlations based on the LSO were not as high as for the other types of cross-validations, because of a combination of a smaller training set size, lower genetic relationship between lines of the training and validation sets, and GxE interactions. The k-fold cross-validations showed that lowering the number of lines in the training set had a small negative impact on the correlations. Thus, the main reasons for the lower correlations of the LSO compared with the LOO and LFO were the decrease in genetic relationship between lines and GxE interactions. However, the effect of the genetic relationship may partly be confounded with the GxE interactions. Including data from more years, would most probably improve the results, which is the case in a recent study with spring wheat, where the predictive abilities for several quality traits increased as lines from more years were included in the training set (Battenfield et al.,
Bayesian models often perform better than GBLUP models when the genetic relationship between training and validation sets are low, because they are better at capturing LD between markers and QTL (Gao et al.,
Here, the largest improvement was for the LSO cross-validations for Zeleny sedimentation, when using the Bayesian Power Lasso model instead of the GBLUP model. Thus, the Bayesian Power Lasso model is mainly advantageous to use, if the genetic relationships between lines are low and for traits, where some QTL have large effects.
SNPs significantly associated with Zeleny sedimentation were identified on chromosome 1B, 1D, and 5D, where genes that are known to affect baking quality of wheat are located. These three SNPs together explain 18.9% of the additive genetic variance according to the single marker regression analysis. The same genetic regions were identified using the Bayesian Power Lasso. However, the explained variance estimated in this model was considerably lower. The prediction of phenotypes based on the best three or best ten SNPs found using the two different methods revealed that the SNPs found using the single marker regression could more accurately predict the phenotypes in most cases. Predictions improved for all traits when using ten SNPs instead of three. The GWAS and predictions using few SNPs indicated that there was a couple of QTL with large effect on Zeleny sedimentation, while the other quality traits seemed to be controlled by many QTL with small effects. Genomic predictions based on all SNPs further improved the predictions and worked quite well for most traits. The predictive abilities were highest for Zeleny sedimentation, while grain protein content was the most difficult trait to predict, especially across breeding cycles. The different types of cross-validations indicated that the genetic relationship between lines and GxE interactions were more important for the predictive abilities and for the bias than the size of the training set. However, the predictions are based on single replication of lines, so additional replications across years and locations would be useful to study the effects of the GxE interactions.Using Bayesian Power Lasso models resulted in similar or slightly higher predictive abilities than when using GBLUP models.
PK: performed most of the phenotyping, data analysis and wrote the draft for the manuscript; All authors helped designing the study, interpreting results, and reviewing the manuscript. Data was acquired by JA, JO, and PK. Funding was acquired by AJ, JA, JJ, and PK.
The study was performed in a collaboration between Aarhus University and the plant breeding company Nordic Seed A/S. Authors PK, AJ, JA, and JO were employed by company Nordic Seed A/S. The other authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We would like to thank Hanne Svenstrup, Janni H. Jørgensen, Eva M. Szendrei, and Jette Andersen from Nordic Seed A/S, Denmark and the Wheat Chemistry and Quality Laboratory at CIMMYT (The International Maize and Wheat Improvement Center), Mexico for supporting laboratory work and phenotyping.
The Supplementary Material for this article can be found online at: