Harnessing Genetic Diversity in the USDA Pea Germplasm Collection Through Genomic Prediction

Bari, Md. Abdullah Al; Zheng, Ping; Viera, Indalecio; Worral, Hannah; Szwiec, Stephen; Ma, Yu; Main, Dorrie; Coyne, Clarice J.; McGee, Rebecca J.; Bandillo, Nonoy

doi:10.3389/fgene.2021.707754

ORIGINAL RESEARCH article

Front. Genet., 24 December 2021

Sec. Statistical Genetics and Methodology

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.707754

This article is part of the Research TopicNew Developments for Embracing Genomic Selection in Breeding ApplicationsView all 14 articles

Harnessing Genetic Diversity in the USDA Pea Germplasm Collection Through Genomic Prediction

Md. Abdullah Al Bari¹

Ping Zheng²

Indalecio Viera¹

Hannah Worral³

Stephen Szwiec³

Yu Ma²

Dorrie Main²

Clarice J. Coyne⁴

Rebecca J. McGee⁵

Nonoy Bandillo¹*

¹Department of Plant Sciences, North Dakota State University, Fargo, ND, United States
²Department of Horticulture, Washington State University, Pullman, WA, United States
³NDSU North Central Research Extension Center, Minot, ND, United States
⁴USDA-ARS Plant Germplasm Introduction and Testing, Washington State University, Pullman, WA, United States
⁵USDA-ARS Grain Legume Genetics and Physiology Research, Pullman, WA, United States

Phenotypic evaluation and efficient utilization of germplasm collections can be time-intensive, laborious, and expensive. However, with the plummeting costs of next-generation sequencing and the addition of genomic selection to the plant breeder’s toolbox, we now can more efficiently tap the genetic diversity within large germplasm collections. In this study, we applied and evaluated genomic prediction’s potential to a set of 482 pea (Pisum sativum L.) accessions—genotyped with 30,600 single nucleotide polymorphic (SNP) markers and phenotyped for seed yield and yield-related components—for enhancing selection of accessions from the USDA Pea Germplasm Collection. Genomic prediction models and several factors affecting predictive ability were evaluated in a series of cross-validation schemes across complex traits. Different genomic prediction models gave similar results, with predictive ability across traits ranging from 0.23 to 0.60, with no model working best across all traits. Increasing the training population size improved the predictive ability of most traits, including seed yield. Predictive abilities increased and reached a plateau with increasing number of markers presumably due to extensive linkage disequilibrium in the pea genome. Accounting for population structure effects did not significantly boost predictive ability, but we observed a slight improvement in seed yield. By applying the best genomic prediction model (e.g., RR-BLUP), we then examined the distribution of genotyped but nonphenotyped accessions and the reliability of genomic estimated breeding values (GEBV). The distribution of GEBV suggested that none of the nonphenotyped accessions were expected to perform outside the range of the phenotyped accessions. Desirable breeding values with higher reliability can be used to identify and screen favorable germplasm accessions. Expanding the training set and incorporating additional orthogonal information (e.g., transcriptomics, metabolomics, physiological traits, etc.) into the genomic prediction framework can enhance prediction accuracy.

Introduction

Pea (Pisum sativum L.) is a vitally important pulse crop that provides protein (15.8–32.1%), vitamins, minerals, and fibers. Pea consumption has cardiovascular benefits as it is rich in potassium, folate, and digestible fibers, which promote gut health and prevent certain cancers (Mudryj et al., 2014; Tayeh et al., 2015). Considering the health benefits of pulse crop, the US Department of Agriculture recommends regular pulses consumption, including peas, to promote human health and wellbeing (http://www.choosemyplate.gov/). In 2019, more than 446,000 hectares of edible dry pea were planted with production totaling 1,013,600 tonnes in the USA, making it the fourth largest legume crop (http://www.fao.org) (USDA, 2020). Growing peas also help maintain soil health and productivity by fixing atmospheric nitrogen (Burstin et al., 2015). Recently, the pea protein has emerged as a frontrunner and showed the most promise in the growing alternative protein market. The Beyond Meat burger is a perfect example of a pea protein product gaining traction in the growing market. About 20-g protein (17.5%) in each burger comes from pea (https://www.nasdaq.com/articles/heres-why-nows-thetime-to-buy-beyond-meat-stock-2019-12-05). Another product made from pea, Ripptein, is a non-dairy milk product of pea protein that is gaining tremendous interest as an alternative dairy product (https://www.ripplefoods.com/ripptein/). Additionally, peas are gaining attention in the pet food market as it is grain-free and an excellent source of essential amino acids required by cats and dogs (PetfoodIndustry.com; Facciolongo et al., 2014). As the demand for pea increases, particularly in the growing alternative protein market, genetic diversity expansion is needed to hasten the current rate of genetic gain in pea (Vandemark et al., 2014).

Germplasm collections serve as an essential source of variation for germplasm enhancement that can sustain long-term genetic gain in breeding programs. The USDA Pisum collection, held at the Western Regional Plant Introduction Station at Washington State University, is a good starting point to investigate functional genetic variation useful for applied breeding efforts. To date, this collection consists of 6,192 accessions plus a Pea Genetic Stocks collection of 712 accessions. From this collection, the USDA core collection, comprised of 504 accessions, was assembled to represent ∼18% of all USDA pea accessions at the time of construction (Simon and Hannan 1995; Coyne et al., 2005). Subsequently, single-seed descent derived homozygous accessions were developed from a subset of the core to form the ‘Pea Single Plant'-derived (PSP) collection. The PSP was used to facilitate the collection’s genetic analysis (Cheng et al., 2015). The USDA Pea Single Plant Plus Collection (Pea PSP) was assembled as well as included the PSP and Chinese accessions and field, snap and snow peas from US public pea breeding programs (Holdsworth et al., 2017).

Genomic selection (GS) takes advantage of high-density genomic data that holds a promise to increase the rate of genetic gain (Meuwissen et al., 2001). As genotyping costs have significantly declined relative to current phenotyping costs, GS has become an attractive option as a selection decision tool to evaluate accessions in extensive germplasm collections. A genomic prediction approach could use only genomic data to predict each accession’s breeding value in the collection (Meuwissen et al., 2001; Habier et al., 2007; VanRaden, 2008). The predicted values would significantly increase the value of accessions in germplasm collections by giving breeders a means to identify those favorable accessions meriting their attention from the thousand available accessions in germplasm collections (Longin et al., 2014; Crossa et al., 2016; Jarquin et al., 2016). Several studies used the genomic prediction approach to harness diversity in germplasm collections, including lentil (Haile et al., 2020), soybean (Jarquin et al., 2016), wheat (Crossa et al., 2016), rice (Spindel et al., 2015), sorghum (Yu et al., 2016), maize (Gorjanc et al., 2016), and potato (Bethke et al., 2019). A pea genomic selection study for drought-prone Italian environment revealed increased selection accuracy of pea lines (Annicchiarico et al., 2019; Annicchiarico et al., 2020). To the best of our knowledge, no such studies have been performed using the USDA Pea Germplasm Collection, but a relevant study has been conducted using a diverse pea germplasm set comprised of more than 370 accessions genotyped with a limited number of markers (Burstin et al., 2015; Tayeh et al., 2015).

To date, methods to sample and utilize an extensive genetic resource like germplasm collections remain a challenge. In this study, a genomic prediction approach targeting complex traits, including seed yield and phenology, was evaluated to exploit diversity contained in the USDA Pea Germplasm Collection. No research has been conducted before on genomic prediction for the genetic exploration of the USDA Pea Germplasm Collection. Different cross-validation schemes were used to answer essential questions surrounding the efficient implementation of genomic prediction and selection, including determining best prediction models, optimum population size and number of markers, and impact of accounting population structure into genomic prediction framework. We then examined the distribution of all nonphenotyped accessions using SNP information in the collection by applying genomic prediction models and estimated reliability criteria of genomic estimated breeding values for the assessed traits.

Materials and Methods

Plant Materials

A total of 482 USDA germplasm accessions were used in this study, including the Pea Single Plant Plus Collection (Pea PSP) comprised of 292 accessions (Cheng et al., 2015). The USDA Pea Core Collection contains accessions from different parts of the world and represents the entire collection’s morphological, geographic, and taxonomic diversity. These accessions were initially acquired from 64 different countries and are conserved at the Western Regional Plant Introduction Station, USDA, Agricultural Research Service (ARS), Pullman, WA (Cheng et al., 2015).

DNA Extraction, Sequencing, SNP Calling

Green leaves were collected from seedlings of each accession grown in the greenhouse with the DNeasy 96 Plant Kit (Qiagen, Valencia, CA, USA). Genomic libraries for the Single Plant Plus Collection were prepped at the University of Minnesota Genomics Center (UMGC) using genotyping-by-sequencing (GBS). Four hundred eighty-two (482) dual-indexed GBS libraries were created using restriction enzyme ApeKI (Elshire et al., 2011). A NovaSeq S1 1 × 100 Illumina Sequencing System (Illumina Inc., San Diego, CA, USA) was then used to sequence the GBS libraries. Preprocessing was performed by the UMGC that generated the GBS sequence reads. An initial quality check was performed using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Sequencing adapter remnants were clipped from all raw reads. Reads with final length <50 bases were discarded. The high-quality reads were aligned to the reference genome of Pisum sativum (Pulse Crop Database https://www.pulsedb.org/, Kreplak et al., 2019) using the Burrow Wheelers Alignment tool (Version 0.7.17) (Li and Durbin, 2009) with default alignment parameters, and the alignment data was processed with SAMtools (version 1.10) (Li et al., 2009). Sequence variants, including single and multiple nucleotide polymorphisms (SNPs and MNPs, respectively), were identified using FreeBayes (Version 1.3.2) (Garrison and Marth, 2012). The combined read depth of 10 was used across samples for identifying an alternative allele as a variant, with the minimum base quality filters of 20. The putative SNPs from freeBayes were filtered across the entire population to maintain the SNPs for biallelic with minor allele frequency (MAF) < 5%. The putative SNP discovery resulted in biallelic sites of 380,527 SNP markers. The QUAL estimate was used for estimating the Phred-scaled probability. Sites with a QUAL value less than 20 and more than 80% missing values were removed from the marker matrix. The rest of the markers were further filtered out so that heterozygosity was less than 20%. The filters were applied using VCFtools (version 0.1.16) (Danecek et al., 2011) and in-house Perl scripts. The SNP data were uploaded in a public repository and is available at this link: https://www.ncbi.nlm.nih.gov/sra/PRJNA730349 (Submission ID: SUB9608236). Missing data were imputed using a k-nearest neighbor genotype imputation method (Money et al., 2015) implemented in TASSEL (Bradbury et al., 2007). SNP data were converted to a numeric format where 1 denotes homozygous for a major allele, -1 denotes homozygous for an alternate allele, and 0 refers to heterozygous loci. Finally, 30,646 clean, curated SNP markers were identified and used for downstream analyses.

Phenotyping

Pea germplasm collections (Pea PSP) were planted following augmented design with standard checks (“Hampton,” “Arargorn,” “Columbian,” and “1,022”) at the USDA Central Ferry Farm in 2016, 2017, and 2018 (planting dates were March 14, March 28, and April 03, respectively). The central Ferry farm is located at Central Ferry, WA at 46°39′5.1″N; 117°45′45.4″ W, and elevation of 198 m. The Central Ferry farm has a Chard silt loam soil (coarse-loamy, mixed, superactive, mesic Calcic Haploxerolls) and was irrigated with subsurface drip irrigation at 10 min d⁻¹. All seeds were treated with fungicides; mefenoxam (13.3 ml a.i. 45 kg-1), fludioxonil (2.4 ml a.i. 45 kg -1), and thiabendazole (82.9 ml a.i.45 kg -1), insecticide; thiamethoxam (14.3 ml a.i. 45 kg -1), and sodium molybdate (16 g 45 kg -1) prior to planting. Thirty seeds were planted per plot; each plot was 152 cm long, having double rows with 30 cm center spacing. The dimensions of each plot were 152 × 60 cm. Standard fertilization and cultural practices were used.

The following traits were recorded and are presented in this manuscript. Days to first flowering are the number of days from planting to when 10% of the plot’s plants start flowering. The number of seeds per pod is the number of seeds in each pod. Plant height (cm) is defined as when all plants in a plot obtained full maturity and were measured in centimeters from the collar region at soil level to the plants’ top. Pods per plant is the number of recorded pods per plant. Days to maturity referred to physiological maturity when plots were hand-harvested, mechanically threshed, cleaned with a blower, and weighed. Plot weight (gm) is the weight of each plot in grams after each harvest. Seed yield (kg ha⁻¹) is the plot weight converted to seed yield in kg per hectare.

Phenotypic Data Analysis

A mixed linear model was used to extract best linear unbiased predictors (BLUPs) for all traits evaluated using the following model:

y_{i j} = μ + G_{i} + E_{j} + {(G * E)}_{i j} + e_{i j} (1)

where $y_{i j}$ is the observed phenotype of ith genotypes and jth environment which is the number of years, $μ$ is the overall mean, $G_{i}$ is the random genetic effect (i is number of genotypes), $E_{j}$ is the random environments (j is number of years), ${(G * E)}_{i j}$ is the genotype by environment interaction, and $e_{i j}$ is the residual error.

For the purpose of estimating heritability, we fit the same model above. The heritability in broad sense (H²) on an entry-mean basis for each assessed trait was calculated to evaluate the quality of trait measurements following the equation (Hallauer et al., 2010):

H^{2} = \frac{σ_{G}^{2}}{σ_{G}^{2} + σ_{G E}^{2} / j + σ_{e}^{2} / j r} (2)

where $σ_{G}^{2}$ is the genetic variance, $σ_{G E}^{2}$ is variance due to the genotype by year interaction, $σ_{e}^{2}$ is the error variance, j is number of years considered as environments, and r is the relative number of occurrences of each genotype in a trial (harmonic mean of the entries). We also calculated heritability proposed by Cullis et al. (2006) implemented in Sommer package in R (Covarrubias-Pazaran, 2016).

H_{Cullis}^{2} = 1 - (\frac{PEV}{md * Vg}) (3)

where PEV is the predicted error variance for the genotype, V_g refers to the genotypic variance, md is the mean values from the diagonal of the relationship matrix, which is an identity matrix.

The R package, lme4 (Bates et al., 2015), was used to analyze the data. The trait values derived from the BLUPs were used to measure correlation with the ggcorrplot using ggplot2 package (Wickham 2016). All phenotypic and genomic prediction models were analyzed in the R environment (R Core Team, 2020).

Genomic Prediction Models

The genomic prediction models were fitted as follows:

y = μ + Z u + ε (4)

where y is a vector of the genotype BLUPs obtained from Eq. 1, $μ$ is the intercept of the model used for the study, Z is the SNP marker matrix, u is the vector of marker effects, and ɛ is a residual vector.

Five genomic prediction models were evaluated including ridge regression ridge regression best linear unbiased prediction approach (RR-BLUP), partial least squares regression model (PLSR), random forest (RF), BayesCpi, and Reproducing Kernel Hilbert Space (RKHS).

The RR-BLUP model assumes all markers have an equal contribution to the genetic variance. One of the most widely used methods for predicting breeding values is RR-BLUP, comparable to the best linear unbiased predictor (BLUP) used to predict the worth of entries in the context of mixed models (Meuwissen et al., 2001). The RR-BLUP basic frame model is:

y = Z u + ε (5)

where u ∼ $N (0, I σ_{u}^{2})$ is a vector of marker effects and Z is the genotype matrix e.g., (aa, Aa, AA) = (0, 1, 2) for biallelic single nucleotide polymorphisms (SNPs) that relates to phenotype y (Endelman, 2011). The RR-BLUP genomic prediction was implemented using the “RR-BLUP” package (Endelman, 2011).

Partial least square regression is a reduction dimension technique that aims to find independent latent components that maximize the covariance between the observed phenotypes and the markers (predictor variables) (Colombani et al., 2012). The number of components (also known as latent variables) should be less than the number of observations to avoid multicollinearity issues and commonly the number of components are chosen by cross validation. PLSR was executed using the “pls” package (Mevik and Wehrens, 2007).

Random forest is a machine learning model for genomic prediction that uses an average of multiple decision trees to determine the predicted values. This regression model was implemented using the “randomForest” package (Breiman, 2001). The number of latent components for PLSR and decision trees for random forest was determined by a five-fold cross-validation to have a minimum prediction error.

BayesCpi was used to verify the influence of distinct genetic architectures of different traits on prediction accuracy. The BayesCpi assumes that each marker has a probability $π$ of being included in the model, and this parameter is estimated at each Markov Chain Monte Carlo (MCMC) iteration. The vector of marker effects u is assumed to be a mixture of distributions having the probability π of being null effect and (1- π) of being a realization of a normal distribution, so that, $u_{j} | π, σ_{g}^{2} \sim N (0, σ_{g}^{2})$ . The vector of residual effects was considered as $e \sim N (0, σ_{e}^{2})$ . The marker and residual variances were assumed to follow a chi-square distribution $σ_{g}^{2} \sim χ^{2} (S_{b}, ν_{0})$ and $σ_{e}^{2} \sim χ^{2} (S_{b}, ν_{0})$ , respectively, with $ν_{0} = 5$ degrees of freedom as prior and $S_{b}$ shape parameters assuming a heritability of 0.5 (Pérez and de los Campos, 2014).

The last model used was the RKHS. The method is a regression where the estimated parameters are a linear function of the basis provided by the reproducing kernel (RK). RKHS considers both additive and non-additive genetic effects (de los Campos et al., 2013). In this work, the multi-kernel approach was used by averaging three kernels with distinct bandwidth values. In this implementation the averaged kernel, $\bar{K}$ was given by:

$\bar{K} = \sum_{r} K_{r} σ_{β_{r}}^{2} {\tilde{σ}}_{β}^{- 2}$ , where ${\tilde{σ}}_{β}^{2} = \sum_{r} σ_{β_{r}}^{2}$ . Here r = 3 and $σ_{β_{r}}^{2}$ are interpretable as variance parameters associated with each kernel. Therefore, for each r^th kernel the proportion of sharing alleles between pairs of individuals (ii′) was given by $K_{r} = \exp {- h_{k} d_{i i^{'}}^{2}}$ , where h_k is a bandwidth parameter associated with r^th reproducing kernel and $d_{i i ´}^{2}$ is the genetic distance between individuals i and i′ computed as follows: $d_{i i `}^{2} = {\sum_{j = 1}^{p} (x_{i j -} x_{i ´ j})}^{2}$ , where j = 1, … , p markers stated as above. The bandwidth parameter values for the three kernels were h = 0.5{1/5,1,5, as suggested by (Pérez and de los Campos, 2014). Those values were chosen using the rule proposed by de los Campos et al. (2010).

Genomic prediction models RR-BLUP, PLSR, RF were carried out using “GSwGBS” package (Gaynor, 2015) while the BayesianCpi and RKHS were executed with the BGLR package (de los Campos et al., 2010). We calculated each genomic prediction model’s predictive ability as the Pearson correlation between the estimated breeding values from model 1) (obtained using the full data set) and those of validation set predicted from the respective model. For that, we used a cross-validation scheme considering 80% of the observations, randomly selected, as training and the remaining 20% as validation set. The process was repeated 20 times for each model. From the predictive ability values, we estimated the confidence interval for this parameter using the bootstrap method considering 10,000 samples (James et al., 2013).

Determining Optimal Training Population Size

The influence of training population size on predictive ability was evaluated using a validation set comprising of 50 randomly selected lines and training sets of variable sizes. The validation set was formed by randomly sampling 50 lines without replacement. The training population of size n was formed sequentially by adding 25 accessions from the remaining accessions such that its size ranged between 50 and 175. We subset the collection into subgroups of 50, 75, 100, 125, 150, and 175 individuals each. The RR-BLUP model was used to predict each trait. This procedure was repeated 20 times, and accuracies of each training population size were averaged across 20 replicates. To predict a particular subpopulation with increasing population size, a similar procedure was followed to using variable training population size ranged from 50 to 175 with an increment of 25.

Determining Optimal Marker Density

To evaluate the effects of GBS marker selection on predictive ability, we randomly sampled markers five times with the following subset: one thousand (1 K), five thousand (5 K), ten thousand (10 K), fifteen thousand (15 K), twenty thousand (20 K), twenty-five thousand (25 K), and thirty thousand (30 K). A random sampling of SNP was implemented to minimize or avoid any possible biases on sampling towards a particular distribution. Using the RR-BLUP model, a five-fold cross validation approach was used to obtain predictive ability in each marker subset. This procedure was repeated 20 times and predictive ability for each subset of SNPs were averaged across 20 replicates.

Accounting for Population Structure Into the Genomic Prediction Framework

We explored the confounding effect due to population structure on predictive ability. We investigated subpopulation structure on 482 accessions genotyped with 30,600 SNP markers using the ADMIXTURE clustering-based algorithm (Alexander et al., 2009). ADMIXTURE identifies K genetic clusters, where K is specified by the user, from the provided SNP data. For each individual, the ADMIXTURE method estimates the probability of membership to each cluster. An analysis was performed in multiple runs by inputting successive values of K from 2 to 10. The optimal K value was determined using ADMIXTURE’s cross-validation (CV) error values. Based on >60% ancestry, each accession was classified into seven subpopulations (K = 7). Accessions within a subpopulation with membership coefficients of <60% were considered admixed. A total of eight subpopulations were used in this study, including admixed as a separate subpopulation. Principal component (PC) analysis was also conducted to summarize the genetic structure and variation present in the collection.

To account for the effect of population structure, we included the top 10 PC, or the Q-matrix from ADMIXTURE into the RR-BLUP model and performed five-fold cross-validation repeated 20 times. Alternatively, we also used the subpopulation (SP) designation identified by ADMIXTURE as a factor in the RR-BLUP model. Albeit a smaller population size, we also performed a within-subpopulation prediction. As stated above, a subpopulation was defined based on >60% ancestry cut-off. Only three subpopulations with this cut-off were identified and used with reasonable number of entries (e.g., N > 40): SP5 (N = 51), SP7 (N = 58), and SP8 (N = 41). A leave-one-SP-out was used to predict individuals within the subpopulation with the RR-BLUP model. We also used increasing population sizes to predict specific subpopulation (e.g. SP8) using the RR-BLUP model.

Estimating Reliability Criteria and Predicting Unknown Phenotypes

Nonphenotyped entries were predicted based on the RR-BLUP model using SNP markers only. The reliability criteria for each of the nonphenotyped lines were then calculated using the formula (Hayes et al., 2009; Clark et al., 2012) as follows:

r (PEV) = \sqrt{(1 - (\frac{PEV}{σ_{G}^{2}}))} (6)

where PEV is the predicted error variance, and $σ_{G}^{2}$ is the genetic variance.

Results

Phenotypic Heritability and Correlation

Recorded days to first flowering had a wide range of variability from 60 to 84 days with a mean of 71 days. The estimated heritability for days to first flowering was 0.90 using Eq. 2 and 0.80 as per Cullis heritability using Eq. 3 (Table 1). For the number of seeds per pod, the mean was 5.7 with a heritability estimate of 0.84 (H²_Cullis = 0.66). The heritability for plant height was 0.81 (H²_Cullis = 0.68), with an average height of 74 cm. The number of pods per plant had a heritability estimate of 0.50 (H²_Cullis = 0.27) with a mean of 18 number of pods per plant and ranged from 15 to 23 pods. Days to physiological maturity had a mean of 104 days with an estimated heritability of 0.51 (H²_Cullis = 0.38). Seed yield per hectare ranged widely from 1734 to 4,463 kg ha⁻¹ with a mean yield of 2,918 kg ha⁻¹ and a heritability value of 0.67 (H²_Cullis = 0.46). The number of pods per plant was highly and positively correlated with seed yield. Correlation estimation also suggested seed yield was positively correlated with plant height, days to physiological maturity, and days to first flowering (Supplementary Figure S1).

TABLE 1

TABLE 1. Heritability and summary statistics for seed yield and other agronomic traits.

Predictive Ability of Different Genomic Prediction Models

No single model consistently performed best across all traits that we evaluated (Table 2), however Bayesian model BayesCpi, RKHS, and RR-BLUP, in general, tended to generate better results. Roughly the predictive abilities from different models were similar, although slight observed differences were likely due to variations on genetic architecture and the model’s assumptions underlying them. For days to first flowering, the highest predictive ability was obtained from the RR-BLUP (0.60). RR-BLUP, RF, and RKHS models generated the highest predictive ability for number of pods per plant (0.28). Number of seeds per pod was better predicted by RR-BLUP and Bayes Cpi (0.42). For plant height highest prediction accuracies were obtained from RF and BayesCpi (0.45). BaysCpi also gave the highest prediction accuracies for days to physiological maturity (0.47). For seed yield, RKHS had slight advantages over other models (0.42). As mentioned above, some differences between the model’s accuracy were only marginal and cannot be a criterion for choosing one model (Table 2). For example, among the tested models, the highest difference in predictive accuracy, considering number of seeds per pod, had a magnitude of 0.02, a marginal value. The lack of significant differences among genomic prediction models can be interpreted as either a good approximation to the optimal model by all methods or there may be a need for further research (Yu et al., 2016). Unless indicated otherwise, the rest of our results focused on findings from the RR-BLUP model.

TABLE 2

TABLE 2. Predictive ability for seed yield and agronomic traits using five genomic prediction models.

Determining Optimal Number of Individuals

Increasing the training population size led to a slight increase in the predictive ability overall for all traits. Across all traits except days to first flowering and plant height, predictive ability reached a maximum with the largest training population size of N = 175 (Figure 1). A training population comprised of 50 individuals had the lowest predictive ability across all traits. For days to first flowering, and plant height predictive ability did steadily increase up at N = 150, and prediction ability reached the maximum for most traits at highest training population size with N = 175. Regardless of population size, predictive ability was consistently higher for days to first flowering, whereas predictive ability was consistently lower for pods per plant (Figure 1). However, while predicting subpopulation 5 highest predictive ability was obtained for plant height (Supplementary Figure S3).

FIGURE 1

FIGURE 1. Predictive ability with increasing training population size using the RR-BLUP model, DFF is days to first flowering, DM, is days to physiological maturity, NoSeedsPod is number of seeds per pod, PH is plant height in cm, PodsPlant is the number of pods per plant, SeedYield is seed yield in kg ha⁻¹.

Determining Optimal Marker Density

The different marker subsets had insignificant differences on predictive ability for all the traits evaluated in this study. In general, however, predictive abilities were higher between 5K and 15K SNPs and reached a plateau with increasing number of SNPs (Supplementary Figure S2). For seed yield, plant height, and days to maturity, highest predictive ability were 0.38, 0.39, and 0.42 respectively. The highest predictive ability for days to first flowering was 0.61 using a SNP subset of 15K.

Accounting for Population Structure in the Genomic Prediction Model

Population structure explained some portion of the phenotypic variance, ranging from 9 to 19%, with the highest percentages observed for plant height (19%) and seed yield (17%). Using either ADMIXTURE or PCA to account for the effect due to population structure, we improved the predictive ability. We observed a 6% improvement for days to first flowering and 32% for seed yield compared over models that did not account for population structure.

We also performed within subpopulation predictions. Presented here are the predictive abilities for subpopulations 5, 7, and 8, as they had at least 40 entries. Subpopulation 8 had the highest predictive ability for days to first flowering (0.68), plant height (0.33), days to maturity (0.43), and seed yield (0.37). The highest predictive abilities for the number of seeds per pod (0.40) and pods per plant (0.12) were obtained from subpopulation 7 (Table 3). Notably, predictive ability was generally higher when all germplasm sets or subpopulations were included in the model compared to when predictions were made using a subset of germplasm.

TABLE 3

TABLE 3. Predictive ability within and across subpopulations using RR-BLUP and all SNP markers.

Predicting Genotyped but Nonphenotyped Accessions

The genomic prediction model was then used to predict nonphenotyped entries based on their SNP information. Based on the distribution of GEBV, none of the predicted phenotypes for nonphenotyped accessions exceeded the top-performing observed phenotypes for seed yield (Figure 2). The mean seed yield of predicted entries (2,914 kg ha⁻¹) was not significantly different from the mean seed yield of observed genotypes (2,918 kg ha⁻¹). The mean of observed and predicted entries were non-significant for the other five traits (Supplementary Table S1). The GEBV for number of pods per plant, number of seeds per pod (Supplementary Figures S4, S5), days to first flowering, and days to maturity all fall within the range of observed phenotypes (Similar Figures not added).

FIGURE 2

FIGURE 2. Distribution of phenotyped and predicted nonphenotyped accessions within the USDA pea germplasm collection for seed yield and plant height.

Reliability Estimation

We obtained reliability criteria for all traits, including seed yield and phenology, for 244 nonphenotyped accessions. The average reliability values ranged from 0.30 to 0.35, while the highest values for evaluated traits ranged from 0.75 to 0.78. The higher reliability values were distributed in the top, bottom, and intermediate predicted breeding values (Supplementary Tables S2–S7). For seed yield (kg ha⁻¹), the highest reliability was obtained from the bottom 50 (Figure 3). Higher reliability criteria are primarily distributed among the intermediate and top GEBV for days to first flowering. Predicted intermediate plant height showed the highest reliability, as presented in Figure 3.

FIGURE 3

FIGURE 3. Reliability criteria for nonphenotyped lines: the top 50 of genomic estimated breeding values are blue, and bottom 50 are in red, intermediates are in green. (A) reliability estimates for seed yield (kg ha⁻¹), (B) days to first flowering, (C) plant height, (D) number of seeds per plant.

Discussion

Widely utilized plant genetic resources collections, such as the USDA pea germplasm collection, hold immense potential as diverse genetic resources to help guard against genetic erosion and serve as unique sources of genetic diversity from which we could enhance genetic gain, boost crop production, and help reduce crop losses due to disease, pests, and abiotic stresses (Jarquin et al., 2016; Crossa et al., 2017; Holdsworth et al., 2017; Mascher et al., 2019). As the costs associated with genotyping on a broader and more accurate scale continue to decrease, opportunities increase to evaluate and utilize these collections in plant breeding. Relying on phenotypic evaluation alone can be costly, rigorous, and time-intensive. However, by incorporating high-density marker coverage and efficient computational algorithms, we can better realize the potential for utilizing these germplasm stocks by reducing the time and cost associated with their evaluation (Yu et al., 2016; H. Li et al., 2018; Yu et al., 2020). In this study, we evaluated the potential of genotyping-by-sequencing derived SNPs for genomic prediction. We found that it holds promises for extracting useful diversity from germplasm collections for applied breeding efforts.

In this study, predictive ability was generally similar among methods, and there was no single model that worked across traits, consistent with results obtained by other authors (Burstin et al., 2015; Spindel et al., 2015; Yu et al., 2016; Azodi et al., 2019). For example, considering only the punctual estimates, RR-BLUP model was the best for days to first flowering, however for plant height, days to physiological maturity, and seed yield, the best models were BayesCpi and RF, BayesCpi and RKHS, respectively. In recent work, Azodi et al. (2019) compared 12 models (6 linear and 6 non-linear) considering 3 traits in 6 different plant species, and they did not find any best algorithm for all traits across all species. Newer statistical methods are expected to boost prediction accuracy; however, the biological complexity and unique genetic architecture of traits can be regarded as the root cause for getting zero or slight improvement on prediction accuracy (Valluru et al., 2019; Yu et al., 2020). As data collection accelerates in at different levels of biological organization (Kremling et al., 2019), genomic prediction models will expand and nonparametric models, including machine learning, may play an essential role for boosting prediction accuracy (Azodi et al., 2019; Yu et al., 2020).

A related work in pea has been published but only based on a limited number of markers (Burstin et al., 2015). This work assessed genomic prediction models in a diverse collection of 373 pea accessions with 331 SNP markers and found no single best model across traits, which is consistent with our findings. In this work, the authors reported that traits with higher heritability, such as thousand seed weight and flowering date, had higher prediction accuracy. We also verified days to first flowering as having the highest heritability and predictive accuracies through all the models. Interestingly, yield components like the number of seeds per pod and pods per plant showed lower predictive accuracy, regardless of prediction models used. Consistent with our results, Burstin et al. (2015) also found yield components like seed number per plant as having lower predictive accuracy and higher standard deviation for prediction. These traits are highly complex and largely influenced by the environment.

The predictive ability increased for all traits except plant height when we increased the model’s training population size, suggesting that adding more entries in the study can boost predictive ability. By accounting population structure into genomic prediction framework, we observed an improved prediction accuracy for some traits—seed yield and days to first flowering—but not for other traits. Although the population structure explained 9–19% of the phenotypic variance, we cannot fully and conclusively answer the effect of population structure in prediction accuracy due to smaller population size. In addition, accounting for the relatedness among individuals in the training and testing sets can potentially boost prediction accuracy (Riedelsheimer et al., 2013; Lorenz and Smith, 2015; Rutkoshi et al., 2015); it was outside the scope of this research but deserves further study. Adding more environments (year-by-location combination) can also potentially improve prediction accuracy using genomic prediction frameworks that account for genotype-by-environment interactions and/or phenotypic plasticity (Jarquin et al., 2014; Crossa et al., 2017; X. Li et al., 2018; Guo et al., 2020). In general, we observed that predictive ability slightly increased and plateaued after reaching certain subset of SNPs. Such a plateau on prediction ability maybe due to overfitting of models (Hickey et al., 2014; Norman et al., 2018), presumably due to extensive linkage disequilibrium in the pea genome (Kreplak et al., 2019).

Previous studies have indicated the importance of considering reliability values when using predictive ability values to select genotypes (Yu et al., 2016). We found higher reliability estimates were spread across all GEBVs rather than clustering around higher or lower extreme of GEBVs. Those accessions with top predicted values and high reliability estimates maybe selected as candidate parents for increasing seed yield and/or germplasm enhancement. However, for a trait such as days to flowering in pea, even low or intermediate predicted values maybe suitable candidates when paired with high reliability values. We found the means of GEBV for nonphenotyped entries were non-significantly different with phenotyped accessions, and almost none of nonphenotyped accessions were expected to exceed seed yield of phenotyped accessions. Several accessions in the USDA pea germplasm collection can be readily incorporated into breeding programs for germplasm enhancement by incorporating above-average accessions with high or moderately high reliability values (Yu et al., 2020).

Conclusions and Research Directions

The research findings demonstrated that the wealth of genetic diversity available in a germplasm collection could be assessed efficiently and quickly using genomic prediction to identify valuable germplasm accessions that can be used for applied breeding efforts. With the integration of more orthogonal information (e.g., expression, metabolomics, proteomics, etc.) into genomic prediction framework (Kremling et al., 2019; Valluru et al., 2019) coupled with the implementation of more complex genomic selection models like a multivariate genomic selection approach (Rutkoski et al., 2015), we can considerably enhance predictive ability. This research framework could greatly contribute to help discover and extract useful diversity targeting high-value quality traits such as protein and mineral concentrations from a large germplasm collection in the future.

Data Availability Statement

The original contributions presented in the study are publicly available. The SNP dataset can be accessed here: https://www.ncbi.nlm.nih.gov/sra/PRJNA730349.

Author Contributions

NB, CJC, and MAB conceived and designed the manuscript. CJC, DM, and RJM designed and executed the field and genotyping experiments. YM and PZ performed DNA extraction, constructed the library, and called SNPs. MAB, IV, and SS analyzed data, curated SNPs, and ran genomic prediction models. NB oversaw statistical analyses. MAB, HW, IV, and NB wrote and edited the overall manuscript. All authors edited, reviewed, and approved the manuscript.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The authors would like to acknowledge the funding provided by USDA Plant Genetic Resource Evaluation Grant for the GS analysis, USA Dry Pea and Lentil Council Research Committee for the field phenotyping, and USDA ARS Pulse Crop Health Initiative for the SNP genotyping and support from USDA ARS Project: 5348-21000-017-00D (CC), and 5348-21000-024-00D (RJM). We also acknowledge the support from USDA-NIFA (Hatch Project ND01513 for NB) and the North Dakota Department of Agriculture through the Specialty Crop Block Grant Program (19–429). Technical assistance from Britton Bourland, Lydia Fields, Kurt Tetrick, and Jennifer Morris is gratefully acknowledged. This work used resources of the Center for Computationally Assisted Science and Technology (CCAST) at North Dakota State University, Fargo, ND, USA which were made possible in part by NSF MRI Award No. 2019077.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.707754/full#supplementary-material

Supplementary Figure S1 | Phenotypic correlation among seed yield and agronomic traits evaluated in this study, DFF is days to first flowering, PH is plant height in cm, SeedYield is seed yield in kg ha^-1, DM is the days to physiological maturity

Supplementary Figure S2 | Predictive ability with increasing SNP markers RR-BLUP model, DFF is days to first flowering, DM, is days to physiological maturity, NoSeedsPod is number of seeds per pod, PH is plant height in cm, PodsPlant is pods per plant, SeedYield is seed yield in kg ha^-1

Supplementary Figure S3 | Predictive ability of subpopulation 5 with increasing training population size

Supplementary Figure S4 | Distribution of phenotyped and predicted non phenotyped accessions for seed yield and number of pods per plant in the USDA germplasm collections

Supplementary Figure S5 | Distribution of phenotyped and predicted non phenotyped accessions for seed yield and number of seeds per pod in the USDA germplasm collections

References

Alexander, D. H., Novembre, J., and Lange, K. (2009). Fast Model-Based Estimation of Ancestry in Unrelated Individuals. Genome Res. 19 (9), 1655–1664. doi:10.1101/gr.094052.109

PubMed Abstract | CrossRef Full Text | Google Scholar

Annicchiarico, P., Nazzicari, N., Laouar, M., Thami-Alami, I., Romani, M., and Pecetti, L. (2020). Development and Proof-Of-Concept Application of Genome-Enabled Selection for Pea Grain Yield under Severe Terminal Drought. Ijms 21 (7), 2414–2420. doi:10.3390/ijms21072414

PubMed Abstract | CrossRef Full Text | Google Scholar

Annicchiarico, P., Nazzicari, N., Pecetti, L., Romani, M., and Russi, L. (2019). Pea Genomic Selection for Italian Environments. BMC Genomics 20 (1), 1–18. doi:10.1186/s12864-019-5920-x

PubMed Abstract | CrossRef Full Text | Google Scholar

Azodi, C. B., Bolger, E., McCarren, A., Roantree, M., de los Campos, G., and Shiu, S.-H. (2019). Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits. G3: Genes, Genomes, Genet. 9 (11), 3691–3702. doi:10.1534/g3.119.400498

PubMed Abstract | CrossRef Full Text | Google Scholar

Bates, D., Mächler, M., Bolker, B., and Walker, S. (2015). Fitting Linear Mixed-Effects Models Usinglme4. J. Stat. Soft. 67 (1), 1–48. doi:10.18637/jss.v067.i01