Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Plant Sci., 15 January 2026

Sec. Plant Breeding

Volume 16 - 2025 | https://doi.org/10.3389/fpls.2025.1699491

Optimizing training sets to identify superior genotypes in hybrid populations

  • Department of Agronomy, National Taiwan University, Taipei, Taiwan

The identification of superior hybrids from candidate populations is a central goal in plant breeding, particularly for commercial applications and large-scale cultivation. In this study, several promising training set optimization methods in genomic selection (GS) are evaluated and extended to construct predictive models for the identification of top-performing genotypes in hybrid populations. The methods investigated include: (i) a ridge regression-based approach, MSPE(v2)Ridge, (ii) a generalized coefficient of determination-based method, CDmean(v2), and (iii) an A-optimality-like ranking strategy, GVaverage. To assess predictive performance in identifying genotypes with the highest true breeding values (TBVs), three evaluation metrics were developed. Since TBVs are latent quantities derived from models, simulation experiments based on real genotype data from wheat (Triticum aestivum L.), maize (Zea mays), and rice (Oryza sativa L.) were carried out to assess the proposed methods. Results demonstrated that GVaverage not only achieved substantial computational efficiency but also generally generated highly informative training sets across a broad range of sizes. However, when constructing small training sets, GVaverage occasionally failed to maintain adequate genomic diversity. In such cases, CDmean(v2) is recommended as a more reliable alternative. Overall, the proposed framework provides a flexible and effective approach to optimizing training sets for hybrid breeding, thereby enhancing the accuracy of genomic prediction in practical breeding programs.

1 Introduction

Genomic selection (GS) has become an indispensable strategy in modern plant breeding for the identification of superior genotypes from large breeding populations. The central principle of GS is the capture of quantitative trait loci across the genome using dense molecular markers (Meuwissen et al., 2001). Genomic prediction models are then constructed to estimate genomic estimated breeding values (GEBVs) for individuals in the breeding population based on a training set comprising both phenotype and genotype data. Since the accuracy of GS is highly dependent on the quality of phenotype data and the genomic relationship between the training set and the target population, optimization of the training set has become a critical factor for the successful implementation of GS (Wu et al., 2023).

Considerable research has focused on developing optimization methods for training sets, especially in inbred populations (Muleta et al., 2019; Sabadin et al., 2022; Fernandez-Gonzales et al., 2023). A wide variety of strategies—ranging from minimizing prediction error variance (PEV) to maximizing the generalized coefficient of determination (CD)—have been proposed (Rincent et al., 2012; Isidro et al., 2015; Rincent et al., 2017; Akdemir et al., 2015; Ou and Liao, 2019; Akdemir and Isidro-Sánchez, 2019). These methods, however, generally assume additive marker effects and have been designed primarily for inbred populations. A recent comprehensive review by Alemu et al. (2024) highlighted the development of these approaches and emphasized the need for more flexible methods applicable to a broader range of breeding scenarios.

In practice, plant breeders are often tasked not only with improving inbred lines but also with identifying superior hybrids for commercial deployment. Hybrid breeding exploits heterosis, or hybrid vigor, to enhance yield, stress tolerance, and quality traits in crops. The performance of a hybrid depends on both general combining ability (related to additive effects) and specific combining ability (linked to dominance and epistatic effects). Consequently, training set optimization for hybrid populations must explicitly account for nonadditive genetic components. Despite its practical importance, this area has received far less attention than inbred-focused optimization. For example, Guo et al. (2019) evaluated several data-mining strategies for representative subset selection in hybrid breeding and demonstrated their advantages over random sampling and earlier CD- or PEV-based methods.

More recently, Chen et al. (2024) proposed and validated training set optimization methods for inbred populations that consider only additive effects. Building on these advances, the present study extends such approaches to hybrid populations by integrating both additive and dominance marker effects into predictive modeling. The present study attempts to identify superior hybrids with the highest true breeding values (TBVs). As TBVs of hybrids are latent quantities derived from models, comprehensive simulation studies based on real genotype data from wheat (Triticum aestivum L.), maize (Zea mays), and rice (Oryza sativa L.), respectively, presented in Zhao et al. (2015); Technow et al. (2014), and Xu et al. (2018), were conducted to evaluate the proposed methods.

This work aims to (i) adapt and evaluate state-of-the-art training set optimization methods for hybrid breeding, (ii) compare their predictive ability and computational efficiency, and (iii) provide practical recommendations for method selection under different training set sizes. Our findings contribute to bridging the methodological gap between inbred- and hybrid-focused optimization and offer breeders flexible tools for enhancing genomic prediction in hybrid breeding programs.

2 Materials and methods

2.1 Genome datasets

Three genome datasets were used in this study and are described as follows.

2.1.1 Wheat

Zhao et al. (2015) investigated the genome-based establishment of a high-yielding heterotic pattern for hybrid wheat breeding. Their study was based on 135 advanced elite winter wheat lines, which include 15 male lines and 120 female lines. A set of 1,604 hybrids derived from crosses between the male and female lines was evaluated for grain yield (GY, Mg/ha) across 11 environments. The adjusted means of the GY were available in the dataset. For the genotype data, the 135 parental lines were genotyped by using a 90,000 SNP array based on an Illumina Infinium array. After quality tests, 17,372 high-quality SNP markers were retained. In the current study, the genotype data with 1,604 hybrids and 17,372 SNPs were used for simulation experiments.

2.1.2 Maize

Technow et al. (2014) analyzed a maize dataset comprising 123 dent lines and 86 flint lines, in which the genotype data consist of 35,478 SNP markers across the 209 lines. A total of 10,578 single-cross hybrids was obtained between the 123 dent lines and the 86 flint lines. Only 1,254 out of the 10,578 possible hybrids were measured on the two traits: GY (in quintals per hectare) and grain moisture content (GM, in percent). The genotype data with 1,254 hybrids and 35,478 SNPs were used for simulation experiments in our study.

2.1.3 Rice

A hybrid rice population was presented in Xu et al. (2018), consisting of 575 hybrid combinations produced by crossing five male sterile lines with 115 inbred lines. The 120 parental lines were sequenced, and the SNPs were filtered with a missing rate > 0.2 in the male sterile lines and 0.5 in the 115 inbred lines, retaining 2,561,889 SNPs. The filtered SNPs aligned with the 3kRG core SNP set v.4 (a total of 996,009 SNPs), leaving 116,482 SNPs remaining. The aligned SNP markers were further filtered by missing rate ≥ 0.05 and minor allele frequency ≤ 0.1, resulting in 63,735 SNPs. Phenotype data, including eight traits, were evaluated at two locations. In our study, the genotype data with 575 hybrids and 63,735 SNPs were used for simulation experiments.

Details regarding the quality filtering criteria, imputation methods, and normalization procedures for the above datasets can be found in the cited references. To examine the relative effects of different training set sizes across the crop genome datasets, 500 hybrids were randomly sampled from each dataset to serve as the candidate population. These selected candidate populations were designated as WHEAT.CP, MAIZE.CP, and RICE.CP. Let wijA be the additive effect score for the locus of individual i at SNP j for i=1, 2, , nc and j=1, 2, , p, where nc is the number of individuals in the candidate population, and p is the number of SNP markers. From VanRaden (2008) and Wu et al. (2019), the additive effect scores were coded as:

wijA={1, if AA 0, if AB1, if BB,

where AA denotes the homozygote of the major allele, AB denotes the heterozygote, and BB denotes the homozygote of the minor allele at locus j. Moreover, the corresponding dominance effect scores wijD were coded as:

wijD={0, if AA 1, if AB0, if BB.

Normalized marker scores were then used in analysis throughout this study. Let xijA and xijD respectively denote the corresponding normalized marker scores for additive and dominance effects, then xijA=(wijAw¯jA)/sjA where w¯jA and sjA are the sample mean and the sample standard deviation of all wijA values in SNP j. Likewise, xijD were obtained in the same manner. Furthermore, genomic relationship matrices for the additive and dominance effects were calculated as:

KA=XAXATp(1)

and

KD=XDXDTp(2)

where XA and XD are the normalized marker score matrices for additive and dominance effects of the candidate population, respectively.

2.2 Training set optimization methods

The methods for training set optimization were derived mainly from two kinds of statistical models: whole genome regression (WGR) models and genomic best linear unbiased prediction (GBLUP) models.

2.2.1 A whole-genome regression model-based method

The WGR model, including both additive and dominance effects, is expressed as:

y=β01n+XAβA+XDβD+e,(3)

where y represents the vector of phenotypic values; β0 denotes the constant term; 1n is the unit vector of length n (with n being the number of phenotypic values); βA corresponds to the vector of additive effects; βD corresponds to the vector of dominance effects; and e denotes the vector of random errors.

The model described in Equation 3 can be rewritten as:

y=β01n+Xβ+e,

where X = [XA, XD] and β=(βA,βD)T. Following Chen et al. (2024), a mean square of prediction error (MSPE) method has been proposed for evaluating the entire candidate population:

MSPERidge=1+1nc{Tr[XcAAXc]+Tr[(XcXcAXt)(XcXcAXt)T]},(4)

Where Xc and Xt are defined as the merged marker score matrices of additive and dominance effects for the candidate population and the training set, respectively, and

A=Xt(XtXt+λInt)1.

Here, nt is the training set size and λ is the regularization parameter. Note that the curly brackets in Equation 4 were missing in the original formulae presented by Chen et al. (2024); the correct formulae can be found in Appendix B of their paper. Minimizing MSPERidge in Equation 4 is equivalent to minimizing the following simplified version:

MSPE(v2)Ridge=Tr[XcAAXc]+Tr[(XcXcAXt)(XcXcAXt)T](5)

The marker scores were replaced by principal components (PCs) scores to reduce computing time, and the first PCs were selected such that the cumulative explained variance exceeded 99%. Moreover, the regulation parameter λ was fixed at 1 when calculating MSPE(v2)Ridge in the current study.

2.2.2 A genomic best linear unbiased prediction model-based method

A GBLUP model considering both additive and dominance effects was used to construct training sets, which can be described as:

y=1nμ+ɡA+ɡD+e,(6)

where μ is the general mean; ɡA and ɡD are respectively the vectors of genotypic values for the additive and dominance effects, and e is the vector of random errors. It is assumed that ɡA, ɡD, and e are mutually independent and distributed with multivariate normal distributions, denoted by ɡA~N(0, σA2KA), ɡD~N(0, σD2KD), and e~N(0, σe2In), where σA2 and σD2 are respectively the genetic variances for additive and dominance effects, KA and KD are the corresponding genomic relationship matrices, and σe2 is the error variance.

The GBLUP model described in Equation 6 can be rewritten as:

y=1nμ+g+e,(7)

where ɡ=ɡA+ɡD and ɡN(0, σA2KA+σD2KD). Following Chen et al. (2024), a heuristic-based CD method for evaluating the entire candidate population can be described as follows. Let ɡt be the true gnomic values for a selected training set as described in Equation 7. The BLUP for ɡt is given by:

ɡ^t=(Mt+Gt1)1Mtyt

where Mt=IntJ¯nt, Gt=αAKAt+αDKDt, and yt consists of the phenotypic values in the training set. Here, J¯nt is the square matrix of order nt with all elements equal to 1/nt, and αA=σA2/σe2, αD=σD2/σe2, KAt and KDt are the additive and dominance relationship matrices for the training set. Moreover, let gc be the true genomic values in the candidate population, then the BLUP for gc is given by:

g^c=Gct(MtGt+Int)1Mtyt

where Gct=αAKAct+αDKDct. Here, KAct and KDct are the additive and dominance relationship matrices between the candidate population and the training set. Furthermore, the variance–covariance matrix for g^c is equal to the covariance matrix between gc and g^c, which is:

Var(g^c)=Cov(gc,g^c)=(Gct(MtGt+Int)1MtGctT)σe2(8)

Let Ai denote the ith diagonal element of Gct(MtGt+Int)1MtGctT in Equation 8, and let Bi denote the corresponding element in Gc=αAKAc+αDKDc, for i=1, 2, , nc, where KAc and KDc separately denote the additive and dominance relationship matrices for the candidate population. A heuristic-based CD criterion is given by:

CDmean(v2)=i=1nc[cov(gci,g^ci)]2var(gci)×var(g^ci)=i=1nc([Ai]2Bi×Ai)=i=1nc(AiBi).(9)

Maximizing CDmean(v2) is equivalent to maximizing the mean of squared correlations between the true and estimated genotypic values in the candidate population. The regulation parameters of αA and αD were both arbitrarily fixed at 1 when calculating CDmean(v2), because the selection of training sets based on the CD criterion is usually not affected by the setting of the parameter values (Rio et al., 2022; Fernandez-Gonzales et al., 2023). We will further discuss the robustness of this setting. An exchanging algorithm implemented in the R package TSDFGS (Ou, 2022) was utilized to optimize training sets for individually minimizing MSPE(v2)Ridge in Equation 5 or maximizing CDmean(v2) in Equation 9.

2.2.3 A-optimality-like method

The A-optimality criterion is used to minimize the average variance for estimating model parameters in the field of classical optimal designs. Conversely, the GVaverage criteria here was utilized to construct training sets that maximize the genomic variation. This is because the probability of correctly identifying the true superior genotypes from a candidate population can be enhanced if the training set can explain the variation of genotypic values as much as possible (Tanaka and Iwata, 2018; Tsai et al., 2021; Tu and Liao, 2024).

For a fixed training set size nt, Chen et al. (2024) proposed an A-optimality-like method to construct training sets such that the average variance for the selected gt is maximized. The method, modified for the GBLUP model in Equation 7, can be described as:

GVaverage=argd*Sd max(Tr[Kd]),

where Sd denotes the set comprising all possible subsets with nt candidates, and Kd is the submatrix of σA2KAc+σD2KDc corresponding to d. The variance components σA2 and σD2 were both fixed at 1 when calculating GVaverage. The GVaverage method ranks all candidates based on their individually calculated metrics and selects the top-ranked ones, eliminating the need for computationally intensive algorithms.

2.3 Evaluation metrics

Let v^(1), v^(2), …, v^(nc) be the GEBVs corresponding to the TBVs v(1)v(2)v(nc) in a candidate population. By reordering these GEBVs, it follows that v^(π1)v^(π2)v^(πnc), where π=(π1,π2,,πnc) is a permutation of π0=(1, 2,, nc). Notably, π1, π2, …, πnc are actually the ranks of TBVs corresponding to v^(π1), v^(π2), ,  v^(πnc). Three evaluation metrics given below were used to assess the predictive performance of a training set in identifying the best k genotypes with the highest TBVs in the candidate population.

2.3.1 Normalized discounted cumulative gain

Blondel et al. (2015) proposed the normalized discounted cumulative gain (NDCG) score at position k, defined as follows. The discounted cumulative gain (DCG) at position k of the predicted ranking π=(π1,π2,,πnc) and the ideal ranking π0=(1, 2,, nc) based on a training set are as follows:

DCG@k(v,π(v^))=i=1kf(v(πi))d(i),(10)
DCG@k(v,π0(v))=i=1kf(v(i))d(i),(11)

where f(v) is a monotonically increasing gain function and d(i) is a monotonically decreasing discount function. In this study, a linear gain function f(v)=v, and a discount function d(i)=1log2(i+1) were used in Equations 10, 11. The NDCG score for identifying the best k genotypes is then defined as:

NDCG@k(v,v^)=DCG@k(v,π(v^))DCG@k(v,π0(v))(12)

The NDCG@k(v,v^) value ranges from 0 to 1, with higher values indicating better performance of the training set.

2.3.2 Spearman’s rank correlation

Spearman’s rank correlation (SRC) was applied to measure the linear relationship between the ranks of the top k genotypes with the highest GEBVs and their true ranks in TBVs. It is defined as:

SRC@k=i=1k(ik+12)(πiπ¯)[i=1k(ik+12)2]×[i=1k(πiπ¯)2],

where π¯=i=1kπi/k. The SRC@k actually is the Pearson’s correlation calculated from the paired values of (i, πi) for i=1, 2, , k.

2.3.3 Rank sum ratio

The sum of ranks in TBVs corresponding to the top k genotypes with the highest GEBVs is i=1kπi, and the sum of ranks of the ideal ranking is i=1ki. The ratio of these two rank sums indicates the ability of the training set to identify the best k genotypes and is defined as:

RS_ratio@k=i=1kii=1kπi.

The RS_ratio@k also ranges from 0 to 1, taking the value of 1 if the top k genotypes with the highest GEBVs exactly match the k genotypes with the highest TBVs.

2.4 Simulation studies

Based on the GBLUP model in Equation 6, phenotype data were simulated to evaluate the performance of the construction methods, including MSPE(v2)Ridge, CDmean(v2), and GVaverage. Random sampling was also performed as a baseline method. For each candidate population (WHEAT.CP, MAIZE.CP, and RICE.CP), the additive and dominance relationship matrices KA and KD were first using Equations 1, 2. Subsequently, training sets were generated from each construction method at sizes of 50, 100, 150, and 200.

From some historical studies on crops such as wheat, rice, and maize (Hassan et al., 2013; Chen et al., 2023; Barreto et al., 2024; Al-Daej, 2023), 31 estimates of γ=σD2/σA2 were available for various traits. The mean and standard deviation of the 31 estimated γ values were 1.43 and 2.56, respectively. Therefore, the model parameters were fixed at μ=100,  σA2= 20,  σD2=γ× σA2 with γ=0.5, 1, 2,and 4 and σe2=1h2h2×(σA2+σD2) with h2=(σA2+σD2)/(σA2+σD2+σe2)=0.3 and 0.6 in this simulation study. Accordingly, 2,000 datasets were generated for each parameter setting, i.e., gAMVN(0,σA2KA),  gDMVN(0,σD2KD), and eMVN(0,σe2In).  Simulated TBVs were obtained as the fixed general mean μ plus the simulated g=gA+ gD, and simulated phenotypic values y were obtained as the simulated TBVs plus the residual e. The mean values and the standard deviations of NDCG@k, SRC@k, and RS_ratio@k with k=20 across the 2,000 simulated datasets were calculated to compare the methods. Moreover, the relative improvement percentages (RIPs) across different metrics and optimization methods, calculated based on the random sampling method, were also reported. The RIP can be expressed as:

RIP=M¯iM¯0M¯0×100(13)

Here, M¯i represents the mean value of a given metric under a specific optimization method, and M¯0 denotes the corresponding mean value obtained from the random sampling method. The Bayesian reproducing kernel Hilbert space (RKHS) method in the R package BGLR (Pérez and de los Campos, 2014) was used to perform the GEBV prediction.

3 Results

The mean values and the standard deviations of the evaluation metrics for identifying the top 20 genotypes across 2,000 simulation runs for each dataset are presented in Figures 13. The RIPs, as defined in Equation 13, for all datasets are provided in Supplementary Tables S1–S9. In addition, the minimum and maximum RIPs across all simulation scenarios for each combination of evaluation metric and training set construction method are summarized in Tables 13. The key findings are as follows:

Figure 1
Bar graphs displaying performance metrics for wheat using different gamma values and sample sizes. The metrics shown are NDCG@k, SRC@k, and RS_ratio@k across conditions with parameters \(\gamma\) set at 0.5, 1, 2, and 4, and \(k^*\) at 0.3 and 0.6. Different methods are represented in red, green, blue, and grey for comparison.

Figure 1. The means and the standard deviations of the resulting values of evaluation metrics for identifying the best 20 genotypes over 2,000 simulated datasets based on the training set construction methods in the WHEAT.CP dataset.

Figure 2
Bar charts comparing NDCG@k, SRC@k, and RS_ratio@k for maize across different sample sizes and gamma values (0.5, 1, 2, 4). Each panel represents a set \( h^2 \) value (0.3, 0.6). Colored bars represent different methods: MSPERidge(λ) in red, CDmean(λZ) in green, GVaverage in blue, and Random in gray.

Figure 2. The means and the standard deviations of the resulting values of evaluation metrics for identifying the best 20 genotypes over 2,000 simulated datasets based on the training set construction methods in the MAIZE.CP dataset.

Figure 3
Bar charts comparing performance metrics for different sample sizes and gamma values in a study labeled “Rice.” Metrics include NDCG@k, SRC@k, and RS_ratio@k, with results displayed for h* values 0.3 and 0.6. Colors represent methods: red for MSPE, green for CD, blue for GV, and gray for Random.

Figure 3. The means and the standard deviations of the resulting values of evaluation metrics for identifying the best 20 genotypes over 2000 simulated datasets based on the training set construction methods in the RICE.CP dataset.

Table 1
www.frontiersin.org

Table 1. The minimum and maximum relative improvement percentages across all simulation scenarios for the WHEAT.CP dataset.

Table 2
www.frontiersin.org

Table 2. The minimum and maximum relative improvement percentages across all simulation scenarios for the MAIZE.CP dataset.

Table 3
www.frontiersin.org

Table 3. The minimum and maximum relative improvement percentages across all simulation scenarios for the RICE.CP dataset.

i. Performance of construction methods: Overall, GVaverage generally outperformed the other three methods. Among them, CDmean(v2) performed slightly better than MSPE(v2)Ridge and random sampling across most datasets and evaluation metrics. An exception occurred in the RICE.CP dataset, where GVaverage​ underperformed the other methods at nt=50 when estimating NDCG@k and RS_ratio@k in RICE.CP (Figure 3).

ii. Random sampling as a baseline: The random sampling method consistently exhibited competitive performance relative to MSPE(v2)Ridge and CDmean(v2),, suggesting that the heuristic-based optimization methods do not always confer large advantages.

iii. Dataset-specific differences: Method performance varied across datasets. In the MAIZE.CP dataset, GVaverage substantially outperformed the other three methods across all evaluation metrics. By contrast, in WHEAT.CP and RICE.CP, the four methods showed more similar levels of performance.

iv. Effect of dominance-to-additive variance ratio (γ): For fixed nt and h2,, the predictive ability of all methods gradually declined as γ​ increased, indicating that higher relative dominance variance reduced accuracy.

v. Effect of training set size (nt): For fixed γ and h2,, the performance of all methods improved with increasing training set size, consistent with expectations from GS theory.

vi. Effect of heritability (h2): For fixed γ and nt, all methods benefited from higher heritability, highlighting the importance of genetic signal strength in prediction accuracy.

vii. Metric-specific comparisons: Across datasets and scenarios, the differences among construction methods were generally smaller for NDCG@k compared with SRC@k and RS_ratio@k, suggesting that NDCG@k may be less sensitive to differences in training set optimization.

4 Discussion

The simulation studies demonstrated that GVaverage generally outperformed MSPE(v2)Ridge, CDmean(v2),, and random sampling methods across the datasets and training set sizes, except when the training set size was small (nt=50) in the RICE.CP dataset. Note that the minimal RIPs of metrics in RICE.CP for GVaverage occurred in cases with nt=50. This decline in performance likely reflects reduced diversity in the training set. GVaverage prioritizes candidates with the highest genomic variation, which may inadvertently select individuals with high relatedness, thereby limiting the genetic diversity of the training set. For example, if two candidates with the highest genomic variation are nearly identical, GVaverage would likely select both, thereby reducing the overall genomic diversity of the training set. In contrast, heuristic-based approaches use exchange algorithms to maintain diversity, which is especially beneficial for small training sets.

The MSPE(v2)Ridge and CDmean(v2) methods for training set selection were primarily designed to maximize the relationship between the training set and the candidate population. In contrast, the GVaverage method focuses on maximizing the variation of dominance effects, thereby expanding the search space for identifying superior hybrids. Notably, GVaverage exhibited a clear advantage in the maize dataset, likely due to maize displaying more pronounced dominance effects than wheat and rice. The genomic variance attributable to dominance effects for genotypes in the WHEAT.CP, MAIZE.CP, and RICE.CP datasets is displayed in Figure 4, indicating that maize exhibited substantially stronger dominance effects than wheat and rice. This finding helps explain why GVaverage showed a marked advantage over the other methods in the MAIZE.CP dataset. Moreover, small training sets can hinder accurate estimation of dominance relationships among hybrids, which may also explain the weaker performance in small training set scenarios.

Figure 4
Three histograms comparing genomic variance of dominance effects for wheat, maize, and rice. Each graph has the x-axis labeled as genomic variance from 0 to 4 and the y-axis as count from 0 to 100. Wheat and rice show similar peaked distributions around 1, while maize has a broader distribution with peaks around 1 and extended values.

Figure 4. Histograms of genomic variances attributable to dominance effects for genotypes in the WHEAT.CP, MAIZE.CP, and RICE.CP datasets.

Another consistent observation from the results was that differences among construction methods were smaller for NDCG@k than for SRC@k and RS_ratio@k. This observation can be attributed to the fact that NDCG@k in Equation 12 incorporates both the simulated TBV and a discount function. In our simulations, the simulated TBV was defined as the sum of the general mean and the simulated genotypic value, with the general mean set substantially larger than the genotypic value. As a result, the metric became less sensitive to differences in DCG values between the predicted and ideal rankings. This result is also be supported by the fact that the estimates for NDCG@k had much smaller standard deviations than those for SRC@k and RS_ratio@k (Figures 1-3).

To check whether the methods of MSPE(v2)Ridge and CDmean(v2) are affected by the parameter setting when constructing training sets, 20 random training sets each comprising 50 genotypes were chosen from the MAIZE.CP dataset, which has a higher dominance effect compared to the WHEAT.CP and RICE.CP datasets. The MSPE(v2)Ridge values in Equation 5 were calculated at λ= 0.01, 0.1, 1, 10, and 100, and the CDmean(v2) values in Equation 9 were calculated with (αA, αD) equal to (0.5, 0.5), (0.5, 1), (1, 1), (1, 2), and (1, 4) for the 20 training sets. These results are displayed in Supplementary Tables S10, S11. For MSPE(v2)Ridge, the results in Supplementary Table S10 showed that rankings of candidate training sets remained identical, indicating robustness to the choice of λ. From Supplementary Table S11, Pearson’s correlation and Spearman’s rank correlation between the resulting CDmean(v2) values at (αA, αD)=(1, 1) (default setting) and the other settings are displayed in Table 4. Both of the correlation coefficients decreased as the discrepancy of the dominance variance component increased. However, the correlation analyses across different weight settings of additive and dominance components still demonstrated strong concordance with the default setting, supporting its use in practice. To check the robustness for GVaverage, σA2KAc+σD2KDc was calculated with (σA2,σD2) equal to (0.5, 0.5), (0.5, 1), (1, 1), (1, 2), and (1, 4). The overlap proportion of the top 50 candidates and RS_ratio@k=50 between the default setting of (σA2,σD2)=(1,1) and the other settings are displayed in Table 5. These results showed high stability under alternative parameter settings of variance components, further reinforcing its practical utility with default parameters. The robustness verification based solely on these rank concordance tests may be insufficient. A more comprehensive sensitivity analysis—considering a wider range of parameter values, runtime comparisons, and changes in evaluation metrics—would provide stronger evidence. We plan to address this issue in future work.

Table 4
www.frontiersin.org

Table 4. Pearson’s correlations (r) and Spearman’s rank correlations (SRC) over the CDmean(v2) values of the 20 random training sets between the default setting, (αA,αD) = (1,1), and the other four settings in the MAIZE.CP dataset.

Table 5
www.frontiersin.org

Table 5. The overlap proportion of the top 50 candidates and RS_ratio@k=50 between the default setting, (σA2,σD2) = (1,1), and the other four settings for the GVaverage method in the MAIZE.CP dataset.

We employed the randomly selected datasets of WHEAT.CP, MAIZE.CP, and RICE.CP throughout our study. To evaluate the potential effect of stochastic sampling of candidate populations, we conducted additional simulations at nt=50 across 50 randomly sampled candidate populations per dataset. These simulation results are displayed in Supplementary Figures S1–S3, which were consistent with those obtained from a single candidate population (Figures 1-3), confirming the robustness of our conclusions. Notably, GVaverage continued to underperform in the small training sets for the rice dataset, particularly in estimating NDCG@k and RS_ratio@k at h2=0.6 (Supplementary Figure S3), an outcome consistent with earlier findings (Figure 3).

The ranking agreement metrics of NDCG@k, SRC@k, and RS_ratio@k may have limited biological interpretability, as they are not directly linked to actual genetic gain. Nevertheless, because the present study focuses on identifying the truly superior genotypes within a candidate population, these metrics remain valuable for quantifying information gain from the selected training sets. There is, however, still room for developing evaluation metrics with stronger biological relevance. Furthermore, the simulation framework in this study was primarily based on the GBLUP model described in Equation 6, which incorporates most factors that may influence the identification of superior genotypes in hybrid populations. Despite this, the model still lacks some degree of practical and biological realism. Future studies could improve upon this by relaxing the assumptions of normality and independence of genotypic values derived from additive and dominance effects. Alternatively, empirical estimates of model parameters from real datasets could be employed to enhance biological validity.

The simulation studies conducted for performance comparison evaluated the effects of the training sets within a single breeding cycle. However, it is well recognized that other key factors, such as genetic variation, can substantially influence genetic gain over multiple breeding cycles (Lehermeier et al., 2017; Chung and Liao, 2022). Future research could therefore explore the long-term impact of training sets constructed using the proposed methods. Moreover, the phenotypic value of a trait is influenced by genotype, environment, and genotype-by-environment interactions. In practice, local environmental conditions can markedly affect hybrid performance during the growth period. Consequently, superior hybrids identified through simulation may not necessarily perform as expected under real field conditions. Thus, conducting extensive field experiments across different crops and environments to validate the main findings of this study would be highly valuable.

The ranking method based on GVaverage is expected to be substantially more computationally efficient than the heuristic-based approaches of MSPE(v2)Ridge and CDmean(v2). Nevertheless, concerns may still arise regarding the actual computation time. Accordingly, we reported in Table 6 the runtimes required to generate training sets of sizes 50, 100, 150, and 200 for the WHEAT.CP, MAIZE.CP, and RICE.CP datasets. The runtimes ranged from 745.08 to 6,602.93 min, from 148.18 to 1,359.55 min, and from 0.001 to 0.0276 s employing MSPE(v2)Ridge, CDmean(v2), and GVaverage respectively. It should be noted that computational time may escalate with an increase in training set size or candidate population size. The time requirement, however, can be mitigated by employing upgraded hardware and optimized software implementations.

Table 6
www.frontiersin.org

Table 6. Runtime of executing R code for the construction methods to generate training sets in the datasets.

Overall, these results reinforce the value of GVaverage as a computationally efficient and generally competitive method for training set optimization in hybrid populations. While Chen et al. (2024) showed its competitiveness in inbred populations, the present study extends its utility to hybrids by incorporating dominance effects into the construction framework. GVaverage can thus be recommended for medium-to-large training sets where efficiency and scalability are critical. However, for small training sets, CDmean(v2) remains preferable due to its ability to safeguard diversity and maintain predictive accuracy. Finally, the practical contribution of this study is underscored by the availability of an R code implementing the proposed methods (https://github.com/spcspin/AD), providing breeders with accessible tools for optimizing training sets in hybrid breeding programs.

5 Conclusion

This study provides a comprehensive evaluation of training set optimization methods for GS in hybrid populations by explicitly incorporating both additive and dominance effects. Our findings demonstrate that GVaverage offers a fast and effective strategy for constructing training sets of medium to large sizes, while CDmean(v2) is more reliable in small-sample scenarios, where maintaining genetic diversity is critical. These insights not only extend existing optimization frameworks from inbred to hybrid breeding but also provide practical recommendations for method selection under varying breeding program conditions. Future research could explore integrating epistatic effects, multienvironment data, and dynamic training set updating strategies to further enhance the accuracy and robustness of genomic prediction in hybrid breeding.

Data availability statement

WHEAT.CP, MAIZE.CP, and RICE.CP datasets and the 20 training sets of size 50 sampled from the MAIZE.CP dataset for the robustness verification can be downloaded from the GitHub repository (https://github.com/spcspin/AD).

Author contributions

S-PC: Software, Writing – original draft, Investigation, Data curation. C-TL: Writing – original draft, Funding acquisition, Project administration, Methodology, Supervision, Writing – review & editing, Investigation, Validation, Conceptualization.

Funding

The author(s) declared financial support was received for this work and/or its publication. This work was supported by the Ministry of Science and Technology, Taiwan (Grant number. MOST 112-2118-M-002-003-MY2).

Conflict of interest

The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that Generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2025.1699491/full#supplementary-material

References

Akdemir, D. and Isidro-Sánchez, J. (2019). Design of training population for selective phenotyping in genomic prediction. Sci. Rep. 9, 1–15. doi: 10.1038/s41598-018-38081-6, PMID: 30723226

PubMed Abstract | Crossref Full Text | Google Scholar

Akdemir, D., Sanchez, J. I., and Jannink, J. L. (2015). Optimization of genomic selection training populations with a genetic algorithm. Genet. Sel Evol. 47, 1–10. doi: 10.1186/s12711-015-0116-6, PMID: 25943105

PubMed Abstract | Crossref Full Text | Google Scholar

Al-Daej, M. I. (2023). Identifying the combining ability and genetic components of some rice agronomic traits (Oryza sativa L.). J. King Saud University-Science 35, 102657. doi: 10.1016/j.jksus.2023.102657

Crossref Full Text | Google Scholar

Alemu, A., Åstrand, J., Montesinos-López, O. A., Isidro y Sánchez, J., Fernández-Gónzalez, J., Tadesse, W., et al. (2024). Genomic selection in plant breeding: Key factors shaping two decades of progress. Mol. Plant 17, 552–578. doi: 10.1016/j.molp.2024.03.007, PMID: 38475993

PubMed Abstract | Crossref Full Text | Google Scholar

Barreto, C. A. V., das Graças Dias, K. O., de Sousa, I. C., Azevedo, C. F., Nascimento, A. C. C., Guimarães, L. J. M., et al. (2024). Genomic prediction in multi-environment trials in maize using statistical and machine learning methods. Sci. Rep. 14, 1062. doi: 10.1038/s41598-024-51792-3, PMID: 38212638

PubMed Abstract | Crossref Full Text | Google Scholar

Blondel, M., Onogi, A., Iwata, H., and Ueda, N. (2015). A ranking approach to genomic selection. PloS One 10, e0128570. doi: 10.1371/journal.pone.0128570, PMID: 26068103

PubMed Abstract | Crossref Full Text | Google Scholar

Chen, S. P., Sung, W. H., and Liao, C. T. (2024). Constructing training sets for genomic selection to identify superior genotypes in candidate populations. Theor. Appl. Genet. 137, 270. doi: 10.1007/s00122-024-04766-y, PMID: 39550734

PubMed Abstract | Crossref Full Text | Google Scholar

Chen, S. P., Tung, C. W., Wang, P. H., and Liao, C. T. (2023). A statistical package for evaluation of hybrid performance in plant breeding via genomic selection. Sci. Rep. 13, 12204. doi: 10.1038/s41598-023-39434-6, PMID: 37500737

PubMed Abstract | Crossref Full Text | Google Scholar

Chung, P. Y. and Liao, C. T. (2022). Selection of parental lines for plant breeding via genomic selection. Front. Plant Sci. 13, 934767. doi: 10.3389/fpls.2022.934767, PMID: 35968112

PubMed Abstract | Crossref Full Text | Google Scholar

Fernández-González, J., Akdemir, D., and Isidro y Sánchez, J. (2023). A comparison of methods for training population optimization in genomic selection. Theor. Appl. Genet. 136, 30. doi: 10.1007/s00122-023-04265-6, PMID: 36892603

PubMed Abstract | Crossref Full Text | Google Scholar

Guo, T., Yu, X., Li, X., Zhang, H., Zhu, C., Flint-Garcia, S., et al. (2019). Optimal designs for genomic selection in hybrid crops. Mol. Plant 12, 390–401. doi: 10.1016/j.molp.2018.12.022, PMID: 30625380

PubMed Abstract | Crossref Full Text | Google Scholar

Hassan, M. S., El-Said, R. A. R., and Abd-El-Haleem, S. H. M. (2013). Estimation of heritability and variance components for some quantitative traits in bread wheat (Triticum aestivum L.). World Appl. Sci. J. 27, 944–949.

Google Scholar

Isidro, J., Jannink, J. L., Akdemir, D., Poland, J., Heslot, N., and Sorrells, M. E. (2015). Training set optimization under population structure in genomic selection. Theor. Appl. Genet. 128, 145–158. doi: 10.1007/s00122-014-2418-4, PMID: 25367380

PubMed Abstract | Crossref Full Text | Google Scholar

Lehermeier, C., Teyssedre, S., and Schon, C. C. (2017). Genetic gain increases by applying the usefulness criterion with improved variance prediction in selection of crosses. Genetics 207, 1651–1661. doi: 10.1534/genetics.117.300403, PMID: 29038144

PubMed Abstract | Crossref Full Text | Google Scholar

Meuwissen, T. H. E., Hayes, B. J., and Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829. doi: 10.1093/genetics/157.4.1819, PMID: 11290733

PubMed Abstract | Crossref Full Text | Google Scholar

Muleta, K. T., Pressoir, G., and Morris, G. P. (2019). Optimizing genomic selection for a sorghum breeding program in Haiti: A simulation study. G3 Genes|Genomes|Genetics 9, 391–401. doi: 10.1534/g3.118.200932, PMID: 30530641

PubMed Abstract | Crossref Full Text | Google Scholar

Ou, J. H. (2022). TSDFGS: Training set determination for genomic selection (R package version 2.0). doi: 10.32614/CRAN.package.TSDFGS

Crossref Full Text | Google Scholar

Ou, J. H. and Liao, C. T. (2019). Training set determination for genomic selection. Theor. Appl. Genet. 132, 2781–2792. doi: 10.1007/s00122-019-03387-0, PMID: 31267147

PubMed Abstract | Crossref Full Text | Google Scholar

Pérez, P. and de los Campos, G. (2014). Genome-wide regression and prediction with the BGLR statistical package. Genetics 198, 483–495. doi: 10.1534/genetics.114.164442, PMID: 25009151

PubMed Abstract | Crossref Full Text | Google Scholar

Rincent, R., Charcosset, A., and Moreau, L. (2017). Predicting genomic selection efficiency to optimize calibration set and to assess prediction accuracy in highly structured populations. Theor. Appl. Genet. 130, 2231–2247. doi: 10.1007/s00122-017-2956-7, PMID: 28795202

PubMed Abstract | Crossref Full Text | Google Scholar

Rincent, R., Laloë, D., Nicolas, S., Altmann, T., Brunel, D., Revilla, P., et al. (2012). Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192, 715–728. doi: 10.1534/genetics.112.141473, PMID: 22865733

PubMed Abstract | Crossref Full Text | Google Scholar

Rio, S., Akdemir, D., Carvalho, T., and Sanchez, J. I. Y. (2022). Assessment of genomic prediction reliability and optimization of experimental designs in multi-environment trials. Theor. Appl. Genet. 135, 405–419. doi: 10.1007/s00122-021-03972-2, PMID: 34807267

PubMed Abstract | Crossref Full Text | Google Scholar

Sabadin, F., DoVale, J. C., Platten, J. D., and Fritsche-Neto, R. (2022). Optimizing self-pollinated crop breeding employing genomic selection: From schemes to updating training sets. Front. Plant Sci. 13, 935885. doi: 10.3389/fpls.2022.935885, PMID: 36275547

PubMed Abstract | Crossref Full Text | Google Scholar

Tanaka, R. and Iwata, H. (2018). Bayesian optimization for genomic selection: a method for discovering the best genotype among a large number of candidates. Theor. Appl. Genet. 131, 93–105. doi: 10.1007/s00122-017-2988-z, PMID: 28986680

PubMed Abstract | Crossref Full Text | Google Scholar

Technow, F., Schrag, T. A., Schipprack, W., Bauer, E., Simianer, H., and Melchinger, A. E. (2014). Genome properties and prospects of genomic prediction of hybrid performance in a breeding program of maize. Genetics 197, 1343–1355. doi: 10.1534/genetics.114.165860, PMID: 24850820

PubMed Abstract | Crossref Full Text | Google Scholar

Tsai, S. F., Shen, C. C., and Liao, C. T. (2021). Bayesian approaches for identifying the best genotype from a candidate population. J. Agric. Biol. Environ. Stat. 26, 519–537. doi: 10.1007/s13253-021-00454-2

Crossref Full Text | Google Scholar

Tu, H. N. and Liao, C. T. (2024). A modified Bayesian approach for determining a training set to identify the best genotypes from a candidate population in genomic selection. J. Agric. Biol. Environ. Stat., 1–22. doi: 10.1007/s13253-024-00641-x

Crossref Full Text | Google Scholar

VanRaden, P. M. (2008). Efficient methods to compute genomic prediction. J. Dairy Sci. 91, 4414–4423. doi: 10.3168/jds.2007-0980, PMID: 18946147

PubMed Abstract | Crossref Full Text | Google Scholar

Wu, P. Y., Ou, J. H., and Liao, C. T. (2023). Sample size determination for training set optimization in genomic prediction. Theor. Appl. Genet. 136, 57. doi: 10.1007/s00122-023-04254-9, PMID: 36912999

PubMed Abstract | Crossref Full Text | Google Scholar

Wu, P. Y., Tung, C. W., Lee, C. Y., and Liao, C. T. (2019). Genomic prediction of pumpkin hybrid performance. Plant Genome 12, 180082. doi: 10.3835/plantgenome2018.10.0082, PMID: 31290920

PubMed Abstract | Crossref Full Text | Google Scholar

Xu, Y., Wang, X., Ding, X., Zheng, X., Yang, Z., Xu, C., et al. (2018). Genomic selection of agronomic traits in hybrid rice using an NCII population. Rice 11, 32. doi: 10.1186/s12284-018-0223-4, PMID: 29748895

PubMed Abstract | Crossref Full Text | Google Scholar

Zhao, Y., Li, Z., Liu, G., Jiang, Y., Maurer, H. P., Würschum, T., et al. (2015). Genome-based establishment of a high-yielding heterotic pattern for hybrid wheat breeding. PNAS 112, 15624–15629. doi: 10.1073/pnas.1514547112, PMID: 26663911

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: genomic best linear unbiased prediction model, genomic prediction, hybrid performance, plant breeding, whole genome regression model

Citation: Chen S-P and Liao C-T (2026) Optimizing training sets to identify superior genotypes in hybrid populations. Front. Plant Sci. 16:1699491. doi: 10.3389/fpls.2025.1699491

Received: 05 September 2025; Accepted: 09 December 2025; Revised: 04 December 2025;
Published: 15 January 2026.

Edited by:

Madhav Bhatta, Bayer Crop Science, United States

Reviewed by:

Venkata Naresh Boddepalli, Iowa State University, United States
Felipe Sabadin, Utah State University, United States

Copyright © 2026 Chen and Liao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Chen-Tuo Liao, Y3RsaWFvQG50dS5lZHUudHc=

Present address: Szu-Ping Chen, Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY, United States

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.