- Department of Agronomy, National Taiwan University, Taipei, Taiwan
The identification of superior hybrids from candidate populations is a central goal in plant breeding, particularly for commercial applications and large-scale cultivation. In this study, several promising training set optimization methods in genomic selection (GS) are evaluated and extended to construct predictive models for the identification of top-performing genotypes in hybrid populations. The methods investigated include: (i) a ridge regression-based approach, , (ii) a generalized coefficient of determination-based method, , and (iii) an A-optimality-like ranking strategy, . To assess predictive performance in identifying genotypes with the highest true breeding values (TBVs), three evaluation metrics were developed. Since TBVs are latent quantities derived from models, simulation experiments based on real genotype data from wheat (Triticum aestivum L.), maize (Zea mays), and rice (Oryza sativa L.) were carried out to assess the proposed methods. Results demonstrated that not only achieved substantial computational efficiency but also generally generated highly informative training sets across a broad range of sizes. However, when constructing small training sets, occasionally failed to maintain adequate genomic diversity. In such cases, is recommended as a more reliable alternative. Overall, the proposed framework provides a flexible and effective approach to optimizing training sets for hybrid breeding, thereby enhancing the accuracy of genomic prediction in practical breeding programs.
1 Introduction
Genomic selection (GS) has become an indispensable strategy in modern plant breeding for the identification of superior genotypes from large breeding populations. The central principle of GS is the capture of quantitative trait loci across the genome using dense molecular markers (Meuwissen et al., 2001). Genomic prediction models are then constructed to estimate genomic estimated breeding values (GEBVs) for individuals in the breeding population based on a training set comprising both phenotype and genotype data. Since the accuracy of GS is highly dependent on the quality of phenotype data and the genomic relationship between the training set and the target population, optimization of the training set has become a critical factor for the successful implementation of GS (Wu et al., 2023).
Considerable research has focused on developing optimization methods for training sets, especially in inbred populations (Muleta et al., 2019; Sabadin et al., 2022; Fernandez-Gonzales et al., 2023). A wide variety of strategies—ranging from minimizing prediction error variance (PEV) to maximizing the generalized coefficient of determination (CD)—have been proposed (Rincent et al., 2012; Isidro et al., 2015; Rincent et al., 2017; Akdemir et al., 2015; Ou and Liao, 2019; Akdemir and Isidro-Sánchez, 2019). These methods, however, generally assume additive marker effects and have been designed primarily for inbred populations. A recent comprehensive review by Alemu et al. (2024) highlighted the development of these approaches and emphasized the need for more flexible methods applicable to a broader range of breeding scenarios.
In practice, plant breeders are often tasked not only with improving inbred lines but also with identifying superior hybrids for commercial deployment. Hybrid breeding exploits heterosis, or hybrid vigor, to enhance yield, stress tolerance, and quality traits in crops. The performance of a hybrid depends on both general combining ability (related to additive effects) and specific combining ability (linked to dominance and epistatic effects). Consequently, training set optimization for hybrid populations must explicitly account for nonadditive genetic components. Despite its practical importance, this area has received far less attention than inbred-focused optimization. For example, Guo et al. (2019) evaluated several data-mining strategies for representative subset selection in hybrid breeding and demonstrated their advantages over random sampling and earlier CD- or PEV-based methods.
More recently, Chen et al. (2024) proposed and validated training set optimization methods for inbred populations that consider only additive effects. Building on these advances, the present study extends such approaches to hybrid populations by integrating both additive and dominance marker effects into predictive modeling. The present study attempts to identify superior hybrids with the highest true breeding values (TBVs). As TBVs of hybrids are latent quantities derived from models, comprehensive simulation studies based on real genotype data from wheat (Triticum aestivum L.), maize (Zea mays), and rice (Oryza sativa L.), respectively, presented in Zhao et al. (2015); Technow et al. (2014), and Xu et al. (2018), were conducted to evaluate the proposed methods.
This work aims to (i) adapt and evaluate state-of-the-art training set optimization methods for hybrid breeding, (ii) compare their predictive ability and computational efficiency, and (iii) provide practical recommendations for method selection under different training set sizes. Our findings contribute to bridging the methodological gap between inbred- and hybrid-focused optimization and offer breeders flexible tools for enhancing genomic prediction in hybrid breeding programs.
2 Materials and methods
2.1 Genome datasets
Three genome datasets were used in this study and are described as follows.
2.1.1 Wheat
Zhao et al. (2015) investigated the genome-based establishment of a high-yielding heterotic pattern for hybrid wheat breeding. Their study was based on 135 advanced elite winter wheat lines, which include 15 male lines and 120 female lines. A set of 1,604 hybrids derived from crosses between the male and female lines was evaluated for grain yield (GY, Mg/ha) across 11 environments. The adjusted means of the GY were available in the dataset. For the genotype data, the 135 parental lines were genotyped by using a 90,000 SNP array based on an Illumina Infinium array. After quality tests, 17,372 high-quality SNP markers were retained. In the current study, the genotype data with 1,604 hybrids and 17,372 SNPs were used for simulation experiments.
2.1.2 Maize
Technow et al. (2014) analyzed a maize dataset comprising 123 dent lines and 86 flint lines, in which the genotype data consist of 35,478 SNP markers across the 209 lines. A total of 10,578 single-cross hybrids was obtained between the 123 dent lines and the 86 flint lines. Only 1,254 out of the 10,578 possible hybrids were measured on the two traits: GY (in quintals per hectare) and grain moisture content (GM, in percent). The genotype data with 1,254 hybrids and 35,478 SNPs were used for simulation experiments in our study.
2.1.3 Rice
A hybrid rice population was presented in Xu et al. (2018), consisting of 575 hybrid combinations produced by crossing five male sterile lines with 115 inbred lines. The 120 parental lines were sequenced, and the SNPs were filtered with a missing rate > 0.2 in the male sterile lines and 0.5 in the 115 inbred lines, retaining 2,561,889 SNPs. The filtered SNPs aligned with the 3kRG core SNP set v.4 (a total of 996,009 SNPs), leaving 116,482 SNPs remaining. The aligned SNP markers were further filtered by missing rate ≥ 0.05 and minor allele frequency ≤ 0.1, resulting in 63,735 SNPs. Phenotype data, including eight traits, were evaluated at two locations. In our study, the genotype data with 575 hybrids and 63,735 SNPs were used for simulation experiments.
Details regarding the quality filtering criteria, imputation methods, and normalization procedures for the above datasets can be found in the cited references. To examine the relative effects of different training set sizes across the crop genome datasets, 500 hybrids were randomly sampled from each dataset to serve as the candidate population. These selected candidate populations were designated as WHEAT.CP, MAIZE.CP, and RICE.CP. Let be the additive effect score for the locus of individual i at SNP j for and where is the number of individuals in the candidate population, and is the number of SNP markers. From VanRaden (2008) and Wu et al. (2019), the additive effect scores were coded as:
where AA denotes the homozygote of the major allele, AB denotes the heterozygote, and BB denotes the homozygote of the minor allele at locus j. Moreover, the corresponding dominance effect scores were coded as:
Normalized marker scores were then used in analysis throughout this study. Let and respectively denote the corresponding normalized marker scores for additive and dominance effects, then where and are the sample mean and the sample standard deviation of all values in SNP . Likewise, were obtained in the same manner. Furthermore, genomic relationship matrices for the additive and dominance effects were calculated as:
and
where and are the normalized marker score matrices for additive and dominance effects of the candidate population, respectively.
2.2 Training set optimization methods
The methods for training set optimization were derived mainly from two kinds of statistical models: whole genome regression (WGR) models and genomic best linear unbiased prediction (GBLUP) models.
2.2.1 A whole-genome regression model-based method
The WGR model, including both additive and dominance effects, is expressed as:
where represents the vector of phenotypic values; denotes the constant term; is the unit vector of length n (with n being the number of phenotypic values); corresponds to the vector of additive effects; corresponds to the vector of dominance effects; and denotes the vector of random errors.
The model described in Equation 3 can be rewritten as:
where X = [XA, XD] and . Following Chen et al. (2024), a mean square of prediction error (MSPE) method has been proposed for evaluating the entire candidate population:
Where and are defined as the merged marker score matrices of additive and dominance effects for the candidate population and the training set, respectively, and
Here, is the training set size and is the regularization parameter. Note that the curly brackets in Equation 4 were missing in the original formulae presented by Chen et al. (2024); the correct formulae can be found in Appendix B of their paper. Minimizing in Equation 4 is equivalent to minimizing the following simplified version:
The marker scores were replaced by principal components (PCs) scores to reduce computing time, and the first PCs were selected such that the cumulative explained variance exceeded 99%. Moreover, the regulation parameter was fixed at 1 when calculating in the current study.
2.2.2 A genomic best linear unbiased prediction model-based method
A GBLUP model considering both additive and dominance effects was used to construct training sets, which can be described as:
where is the general mean; and are respectively the vectors of genotypic values for the additive and dominance effects, and is the vector of random errors. It is assumed that and are mutually independent and distributed with multivariate normal distributions, denoted by and where and are respectively the genetic variances for additive and dominance effects, and are the corresponding genomic relationship matrices, and is the error variance.
The GBLUP model described in Equation 6 can be rewritten as:
where and . Following Chen et al. (2024), a heuristic-based CD method for evaluating the entire candidate population can be described as follows. Let be the true gnomic values for a selected training set as described in Equation 7. The BLUP for is given by:
where , , and consists of the phenotypic values in the training set. Here, is the square matrix of order with all elements equal to , and and are the additive and dominance relationship matrices for the training set. Moreover, let be the true genomic values in the candidate population, then the BLUP for is given by:
where . Here, and are the additive and dominance relationship matrices between the candidate population and the training set. Furthermore, the variance–covariance matrix for is equal to the covariance matrix between and , which is:
Let denote the diagonal element of in Equation 8, and let denote the corresponding element in , for , where and separately denote the additive and dominance relationship matrices for the candidate population. A heuristic-based CD criterion is given by:
Maximizing is equivalent to maximizing the mean of squared correlations between the true and estimated genotypic values in the candidate population. The regulation parameters of and were both arbitrarily fixed at 1 when calculating , because the selection of training sets based on the CD criterion is usually not affected by the setting of the parameter values (Rio et al., 2022; Fernandez-Gonzales et al., 2023). We will further discuss the robustness of this setting. An exchanging algorithm implemented in the R package TSDFGS (Ou, 2022) was utilized to optimize training sets for individually minimizing in Equation 5 or maximizing in Equation 9.
2.2.3 A-optimality-like method
The A-optimality criterion is used to minimize the average variance for estimating model parameters in the field of classical optimal designs. Conversely, the criteria here was utilized to construct training sets that maximize the genomic variation. This is because the probability of correctly identifying the true superior genotypes from a candidate population can be enhanced if the training set can explain the variation of genotypic values as much as possible (Tanaka and Iwata, 2018; Tsai et al., 2021; Tu and Liao, 2024).
For a fixed training set size , Chen et al. (2024) proposed an A-optimality-like method to construct training sets such that the average variance for the selected is maximized. The method, modified for the GBLUP model in Equation 7, can be described as:
where denotes the set comprising all possible subsets with candidates, and is the submatrix of corresponding to . The variance components and were both fixed at 1 when calculating . The method ranks all candidates based on their individually calculated metrics and selects the top-ranked ones, eliminating the need for computationally intensive algorithms.
2.3 Evaluation metrics
Let , , …, be the GEBVs corresponding to the TBVs in a candidate population. By reordering these GEBVs, it follows that where is a permutation of Notably, , …, are actually the ranks of TBVs corresponding to Three evaluation metrics given below were used to assess the predictive performance of a training set in identifying the best genotypes with the highest TBVs in the candidate population.
2.3.1 Normalized discounted cumulative gain
Blondel et al. (2015) proposed the normalized discounted cumulative gain (NDCG) score at position k, defined as follows. The discounted cumulative gain (DCG) at position k of the predicted ranking and the ideal ranking based on a training set are as follows:
where is a monotonically increasing gain function and is a monotonically decreasing discount function. In this study, a linear gain function and a discount function were used in Equations 10, 11. The NDCG score for identifying the best k genotypes is then defined as:
The value ranges from 0 to 1, with higher values indicating better performance of the training set.
2.3.2 Spearman’s rank correlation
Spearman’s rank correlation (SRC) was applied to measure the linear relationship between the ranks of the top k genotypes with the highest GEBVs and their true ranks in TBVs. It is defined as:
where The actually is the Pearson’s correlation calculated from the paired values of for
2.3.3 Rank sum ratio
The sum of ranks in TBVs corresponding to the top k genotypes with the highest GEBVs is , and the sum of ranks of the ideal ranking is . The ratio of these two rank sums indicates the ability of the training set to identify the best k genotypes and is defined as:
The also ranges from 0 to 1, taking the value of 1 if the top k genotypes with the highest GEBVs exactly match the k genotypes with the highest TBVs.
2.4 Simulation studies
Based on the GBLUP model in Equation 6, phenotype data were simulated to evaluate the performance of the construction methods, including , , and . Random sampling was also performed as a baseline method. For each candidate population (WHEAT.CP, MAIZE.CP, and RICE.CP), the additive and dominance relationship matrices and were first using Equations 1, 2. Subsequently, training sets were generated from each construction method at sizes of 50, 100, 150, and 200.
From some historical studies on crops such as wheat, rice, and maize (Hassan et al., 2013; Chen et al., 2023; Barreto et al., 2024; Al-Daej, 2023), 31 estimates of were available for various traits. The mean and standard deviation of the 31 estimated values were 1.43 and 2.56, respectively. Therefore, the model parameters were fixed at with and with and 0.6 in this simulation study. Accordingly, 2,000 datasets were generated for each parameter setting, i.e., and Simulated TBVs were obtained as the fixed general mean plus the simulated and simulated phenotypic values were obtained as the simulated TBVs plus the residual . The mean values and the standard deviations of and with across the 2,000 simulated datasets were calculated to compare the methods. Moreover, the relative improvement percentages (RIPs) across different metrics and optimization methods, calculated based on the random sampling method, were also reported. The RIP can be expressed as:
Here, represents the mean value of a given metric under a specific optimization method, and denotes the corresponding mean value obtained from the random sampling method. The Bayesian reproducing kernel Hilbert space (RKHS) method in the R package BGLR (Pérez and de los Campos, 2014) was used to perform the GEBV prediction.
3 Results
The mean values and the standard deviations of the evaluation metrics for identifying the top 20 genotypes across 2,000 simulation runs for each dataset are presented in Figures 1–3. The RIPs, as defined in Equation 13, for all datasets are provided in Supplementary Tables S1–S9. In addition, the minimum and maximum RIPs across all simulation scenarios for each combination of evaluation metric and training set construction method are summarized in Tables 1–3. The key findings are as follows:
Figure 1. The means and the standard deviations of the resulting values of evaluation metrics for identifying the best 20 genotypes over 2,000 simulated datasets based on the training set construction methods in the WHEAT.CP dataset.
Figure 2. The means and the standard deviations of the resulting values of evaluation metrics for identifying the best 20 genotypes over 2,000 simulated datasets based on the training set construction methods in the MAIZE.CP dataset.
Figure 3. The means and the standard deviations of the resulting values of evaluation metrics for identifying the best 20 genotypes over 2000 simulated datasets based on the training set construction methods in the RICE.CP dataset.
Table 1. The minimum and maximum relative improvement percentages across all simulation scenarios for the WHEAT.CP dataset.
Table 2. The minimum and maximum relative improvement percentages across all simulation scenarios for the MAIZE.CP dataset.
Table 3. The minimum and maximum relative improvement percentages across all simulation scenarios for the RICE.CP dataset.
i. Performance of construction methods: Overall, generally outperformed the other three methods. Among them, performed slightly better than and random sampling across most datasets and evaluation metrics. An exception occurred in the RICE.CP dataset, where underperformed the other methods at when estimating and in RICE.CP (Figure 3).
ii. Random sampling as a baseline: The random sampling method consistently exhibited competitive performance relative to and , suggesting that the heuristic-based optimization methods do not always confer large advantages.
iii. Dataset-specific differences: Method performance varied across datasets. In the MAIZE.CP dataset, substantially outperformed the other three methods across all evaluation metrics. By contrast, in WHEAT.CP and RICE.CP, the four methods showed more similar levels of performance.
iv. Effect of dominance-to-additive variance ratio (): For fixed and , the predictive ability of all methods gradually declined as increased, indicating that higher relative dominance variance reduced accuracy.
v. Effect of training set size (): For fixed and , the performance of all methods improved with increasing training set size, consistent with expectations from GS theory.
vi. Effect of heritability (): For fixed and , all methods benefited from higher heritability, highlighting the importance of genetic signal strength in prediction accuracy.
vii. Metric-specific comparisons: Across datasets and scenarios, the differences among construction methods were generally smaller for NDCG@k compared with SRC@k and , suggesting that NDCG@k may be less sensitive to differences in training set optimization.
4 Discussion
The simulation studies demonstrated that generally outperformed , and random sampling methods across the datasets and training set sizes, except when the training set size was small () in the RICE.CP dataset. Note that the minimal RIPs of metrics in RICE.CP for occurred in cases with . This decline in performance likely reflects reduced diversity in the training set. prioritizes candidates with the highest genomic variation, which may inadvertently select individuals with high relatedness, thereby limiting the genetic diversity of the training set. For example, if two candidates with the highest genomic variation are nearly identical, would likely select both, thereby reducing the overall genomic diversity of the training set. In contrast, heuristic-based approaches use exchange algorithms to maintain diversity, which is especially beneficial for small training sets.
The and methods for training set selection were primarily designed to maximize the relationship between the training set and the candidate population. In contrast, the method focuses on maximizing the variation of dominance effects, thereby expanding the search space for identifying superior hybrids. Notably, exhibited a clear advantage in the maize dataset, likely due to maize displaying more pronounced dominance effects than wheat and rice. The genomic variance attributable to dominance effects for genotypes in the WHEAT.CP, MAIZE.CP, and RICE.CP datasets is displayed in Figure 4, indicating that maize exhibited substantially stronger dominance effects than wheat and rice. This finding helps explain why showed a marked advantage over the other methods in the MAIZE.CP dataset. Moreover, small training sets can hinder accurate estimation of dominance relationships among hybrids, which may also explain the weaker performance in small training set scenarios.
Figure 4. Histograms of genomic variances attributable to dominance effects for genotypes in the WHEAT.CP, MAIZE.CP, and RICE.CP datasets.
Another consistent observation from the results was that differences among construction methods were smaller for NDCG@k than for SRC@k and . This observation can be attributed to the fact that NDCG@k in Equation 12 incorporates both the simulated TBV and a discount function. In our simulations, the simulated TBV was defined as the sum of the general mean and the simulated genotypic value, with the general mean set substantially larger than the genotypic value. As a result, the metric became less sensitive to differences in DCG values between the predicted and ideal rankings. This result is also be supported by the fact that the estimates for NDCG@k had much smaller standard deviations than those for SRC@k and (Figures 1-3).
To check whether the methods of and are affected by the parameter setting when constructing training sets, 20 random training sets each comprising 50 genotypes were chosen from the MAIZE.CP dataset, which has a higher dominance effect compared to the WHEAT.CP and RICE.CP datasets. The values in Equation 5 were calculated at 0.01, 0.1, 1, 10, and 100, and the values in Equation 9 were calculated with equal to and for the 20 training sets. These results are displayed in Supplementary Tables S10, S11. For , the results in Supplementary Table S10 showed that rankings of candidate training sets remained identical, indicating robustness to the choice of . From Supplementary Table S11, Pearson’s correlation and Spearman’s rank correlation between the resulting values at (default setting) and the other settings are displayed in Table 4. Both of the correlation coefficients decreased as the discrepancy of the dominance variance component increased. However, the correlation analyses across different weight settings of additive and dominance components still demonstrated strong concordance with the default setting, supporting its use in practice. To check the robustness for , was calculated with equal to and The overlap proportion of the top 50 candidates and between the default setting of and the other settings are displayed in Table 5. These results showed high stability under alternative parameter settings of variance components, further reinforcing its practical utility with default parameters. The robustness verification based solely on these rank concordance tests may be insufficient. A more comprehensive sensitivity analysis—considering a wider range of parameter values, runtime comparisons, and changes in evaluation metrics—would provide stronger evidence. We plan to address this issue in future work.
Table 4. Pearson’s correlations (r) and Spearman’s rank correlations (SRC) over the CDmean(v2) values of the 20 random training sets between the default setting, () = (1,1), and the other four settings in the MAIZE.CP dataset.
Table 5. The overlap proportion of the top 50 candidates and between the default setting, () = (1,1), and the other four settings for the method in the MAIZE.CP dataset.
We employed the randomly selected datasets of WHEAT.CP, MAIZE.CP, and RICE.CP throughout our study. To evaluate the potential effect of stochastic sampling of candidate populations, we conducted additional simulations at across 50 randomly sampled candidate populations per dataset. These simulation results are displayed in Supplementary Figures S1–S3, which were consistent with those obtained from a single candidate population (Figures 1-3), confirming the robustness of our conclusions. Notably, continued to underperform in the small training sets for the rice dataset, particularly in estimating and at (Supplementary Figure S3), an outcome consistent with earlier findings (Figure 3).
The ranking agreement metrics of NDCG@k, SRC@k, and RS_ratio@k may have limited biological interpretability, as they are not directly linked to actual genetic gain. Nevertheless, because the present study focuses on identifying the truly superior genotypes within a candidate population, these metrics remain valuable for quantifying information gain from the selected training sets. There is, however, still room for developing evaluation metrics with stronger biological relevance. Furthermore, the simulation framework in this study was primarily based on the GBLUP model described in Equation 6, which incorporates most factors that may influence the identification of superior genotypes in hybrid populations. Despite this, the model still lacks some degree of practical and biological realism. Future studies could improve upon this by relaxing the assumptions of normality and independence of genotypic values derived from additive and dominance effects. Alternatively, empirical estimates of model parameters from real datasets could be employed to enhance biological validity.
The simulation studies conducted for performance comparison evaluated the effects of the training sets within a single breeding cycle. However, it is well recognized that other key factors, such as genetic variation, can substantially influence genetic gain over multiple breeding cycles (Lehermeier et al., 2017; Chung and Liao, 2022). Future research could therefore explore the long-term impact of training sets constructed using the proposed methods. Moreover, the phenotypic value of a trait is influenced by genotype, environment, and genotype-by-environment interactions. In practice, local environmental conditions can markedly affect hybrid performance during the growth period. Consequently, superior hybrids identified through simulation may not necessarily perform as expected under real field conditions. Thus, conducting extensive field experiments across different crops and environments to validate the main findings of this study would be highly valuable.
The ranking method based on is expected to be substantially more computationally efficient than the heuristic-based approaches of and . Nevertheless, concerns may still arise regarding the actual computation time. Accordingly, we reported in Table 6 the runtimes required to generate training sets of sizes 50, 100, 150, and 200 for the WHEAT.CP, MAIZE.CP, and RICE.CP datasets. The runtimes ranged from 745.08 to 6,602.93 min, from 148.18 to 1,359.55 min, and from 0.001 to 0.0276 s employing , , and respectively. It should be noted that computational time may escalate with an increase in training set size or candidate population size. The time requirement, however, can be mitigated by employing upgraded hardware and optimized software implementations.
Table 6. Runtime of executing R code for the construction methods to generate training sets in the datasets.
Overall, these results reinforce the value of as a computationally efficient and generally competitive method for training set optimization in hybrid populations. While Chen et al. (2024) showed its competitiveness in inbred populations, the present study extends its utility to hybrids by incorporating dominance effects into the construction framework. can thus be recommended for medium-to-large training sets where efficiency and scalability are critical. However, for small training sets, remains preferable due to its ability to safeguard diversity and maintain predictive accuracy. Finally, the practical contribution of this study is underscored by the availability of an R code implementing the proposed methods (https://github.com/spcspin/AD), providing breeders with accessible tools for optimizing training sets in hybrid breeding programs.
5 Conclusion
This study provides a comprehensive evaluation of training set optimization methods for GS in hybrid populations by explicitly incorporating both additive and dominance effects. Our findings demonstrate that offers a fast and effective strategy for constructing training sets of medium to large sizes, while is more reliable in small-sample scenarios, where maintaining genetic diversity is critical. These insights not only extend existing optimization frameworks from inbred to hybrid breeding but also provide practical recommendations for method selection under varying breeding program conditions. Future research could explore integrating epistatic effects, multienvironment data, and dynamic training set updating strategies to further enhance the accuracy and robustness of genomic prediction in hybrid breeding.
Data availability statement
WHEAT.CP, MAIZE.CP, and RICE.CP datasets and the 20 training sets of size 50 sampled from the MAIZE.CP dataset for the robustness verification can be downloaded from the GitHub repository (https://github.com/spcspin/AD).
Author contributions
S-PC: Software, Writing – original draft, Investigation, Data curation. C-TL: Writing – original draft, Funding acquisition, Project administration, Methodology, Supervision, Writing – review & editing, Investigation, Validation, Conceptualization.
Funding
The author(s) declared financial support was received for this work and/or its publication. This work was supported by the Ministry of Science and Technology, Taiwan (Grant number. MOST 112-2118-M-002-003-MY2).
Conflict of interest
The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that Generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2025.1699491/full#supplementary-material
References
Akdemir, D. and Isidro-Sánchez, J. (2019). Design of training population for selective phenotyping in genomic prediction. Sci. Rep. 9, 1–15. doi: 10.1038/s41598-018-38081-6, PMID: 30723226
Akdemir, D., Sanchez, J. I., and Jannink, J. L. (2015). Optimization of genomic selection training populations with a genetic algorithm. Genet. Sel Evol. 47, 1–10. doi: 10.1186/s12711-015-0116-6, PMID: 25943105
Al-Daej, M. I. (2023). Identifying the combining ability and genetic components of some rice agronomic traits (Oryza sativa L.). J. King Saud University-Science 35, 102657. doi: 10.1016/j.jksus.2023.102657
Alemu, A., Åstrand, J., Montesinos-López, O. A., Isidro y Sánchez, J., Fernández-Gónzalez, J., Tadesse, W., et al. (2024). Genomic selection in plant breeding: Key factors shaping two decades of progress. Mol. Plant 17, 552–578. doi: 10.1016/j.molp.2024.03.007, PMID: 38475993
Barreto, C. A. V., das Graças Dias, K. O., de Sousa, I. C., Azevedo, C. F., Nascimento, A. C. C., Guimarães, L. J. M., et al. (2024). Genomic prediction in multi-environment trials in maize using statistical and machine learning methods. Sci. Rep. 14, 1062. doi: 10.1038/s41598-024-51792-3, PMID: 38212638
Blondel, M., Onogi, A., Iwata, H., and Ueda, N. (2015). A ranking approach to genomic selection. PloS One 10, e0128570. doi: 10.1371/journal.pone.0128570, PMID: 26068103
Chen, S. P., Sung, W. H., and Liao, C. T. (2024). Constructing training sets for genomic selection to identify superior genotypes in candidate populations. Theor. Appl. Genet. 137, 270. doi: 10.1007/s00122-024-04766-y, PMID: 39550734
Chen, S. P., Tung, C. W., Wang, P. H., and Liao, C. T. (2023). A statistical package for evaluation of hybrid performance in plant breeding via genomic selection. Sci. Rep. 13, 12204. doi: 10.1038/s41598-023-39434-6, PMID: 37500737
Chung, P. Y. and Liao, C. T. (2022). Selection of parental lines for plant breeding via genomic selection. Front. Plant Sci. 13, 934767. doi: 10.3389/fpls.2022.934767, PMID: 35968112
Fernández-González, J., Akdemir, D., and Isidro y Sánchez, J. (2023). A comparison of methods for training population optimization in genomic selection. Theor. Appl. Genet. 136, 30. doi: 10.1007/s00122-023-04265-6, PMID: 36892603
Guo, T., Yu, X., Li, X., Zhang, H., Zhu, C., Flint-Garcia, S., et al. (2019). Optimal designs for genomic selection in hybrid crops. Mol. Plant 12, 390–401. doi: 10.1016/j.molp.2018.12.022, PMID: 30625380
Hassan, M. S., El-Said, R. A. R., and Abd-El-Haleem, S. H. M. (2013). Estimation of heritability and variance components for some quantitative traits in bread wheat (Triticum aestivum L.). World Appl. Sci. J. 27, 944–949.
Isidro, J., Jannink, J. L., Akdemir, D., Poland, J., Heslot, N., and Sorrells, M. E. (2015). Training set optimization under population structure in genomic selection. Theor. Appl. Genet. 128, 145–158. doi: 10.1007/s00122-014-2418-4, PMID: 25367380
Lehermeier, C., Teyssedre, S., and Schon, C. C. (2017). Genetic gain increases by applying the usefulness criterion with improved variance prediction in selection of crosses. Genetics 207, 1651–1661. doi: 10.1534/genetics.117.300403, PMID: 29038144
Meuwissen, T. H. E., Hayes, B. J., and Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829. doi: 10.1093/genetics/157.4.1819, PMID: 11290733
Muleta, K. T., Pressoir, G., and Morris, G. P. (2019). Optimizing genomic selection for a sorghum breeding program in Haiti: A simulation study. G3 Genes|Genomes|Genetics 9, 391–401. doi: 10.1534/g3.118.200932, PMID: 30530641
Ou, J. H. (2022). TSDFGS: Training set determination for genomic selection (R package version 2.0). doi: 10.32614/CRAN.package.TSDFGS
Ou, J. H. and Liao, C. T. (2019). Training set determination for genomic selection. Theor. Appl. Genet. 132, 2781–2792. doi: 10.1007/s00122-019-03387-0, PMID: 31267147
Pérez, P. and de los Campos, G. (2014). Genome-wide regression and prediction with the BGLR statistical package. Genetics 198, 483–495. doi: 10.1534/genetics.114.164442, PMID: 25009151
Rincent, R., Charcosset, A., and Moreau, L. (2017). Predicting genomic selection efficiency to optimize calibration set and to assess prediction accuracy in highly structured populations. Theor. Appl. Genet. 130, 2231–2247. doi: 10.1007/s00122-017-2956-7, PMID: 28795202
Rincent, R., Laloë, D., Nicolas, S., Altmann, T., Brunel, D., Revilla, P., et al. (2012). Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192, 715–728. doi: 10.1534/genetics.112.141473, PMID: 22865733
Rio, S., Akdemir, D., Carvalho, T., and Sanchez, J. I. Y. (2022). Assessment of genomic prediction reliability and optimization of experimental designs in multi-environment trials. Theor. Appl. Genet. 135, 405–419. doi: 10.1007/s00122-021-03972-2, PMID: 34807267
Sabadin, F., DoVale, J. C., Platten, J. D., and Fritsche-Neto, R. (2022). Optimizing self-pollinated crop breeding employing genomic selection: From schemes to updating training sets. Front. Plant Sci. 13, 935885. doi: 10.3389/fpls.2022.935885, PMID: 36275547
Tanaka, R. and Iwata, H. (2018). Bayesian optimization for genomic selection: a method for discovering the best genotype among a large number of candidates. Theor. Appl. Genet. 131, 93–105. doi: 10.1007/s00122-017-2988-z, PMID: 28986680
Technow, F., Schrag, T. A., Schipprack, W., Bauer, E., Simianer, H., and Melchinger, A. E. (2014). Genome properties and prospects of genomic prediction of hybrid performance in a breeding program of maize. Genetics 197, 1343–1355. doi: 10.1534/genetics.114.165860, PMID: 24850820
Tsai, S. F., Shen, C. C., and Liao, C. T. (2021). Bayesian approaches for identifying the best genotype from a candidate population. J. Agric. Biol. Environ. Stat. 26, 519–537. doi: 10.1007/s13253-021-00454-2
Tu, H. N. and Liao, C. T. (2024). A modified Bayesian approach for determining a training set to identify the best genotypes from a candidate population in genomic selection. J. Agric. Biol. Environ. Stat., 1–22. doi: 10.1007/s13253-024-00641-x
VanRaden, P. M. (2008). Efficient methods to compute genomic prediction. J. Dairy Sci. 91, 4414–4423. doi: 10.3168/jds.2007-0980, PMID: 18946147
Wu, P. Y., Ou, J. H., and Liao, C. T. (2023). Sample size determination for training set optimization in genomic prediction. Theor. Appl. Genet. 136, 57. doi: 10.1007/s00122-023-04254-9, PMID: 36912999
Wu, P. Y., Tung, C. W., Lee, C. Y., and Liao, C. T. (2019). Genomic prediction of pumpkin hybrid performance. Plant Genome 12, 180082. doi: 10.3835/plantgenome2018.10.0082, PMID: 31290920
Xu, Y., Wang, X., Ding, X., Zheng, X., Yang, Z., Xu, C., et al. (2018). Genomic selection of agronomic traits in hybrid rice using an NCII population. Rice 11, 32. doi: 10.1186/s12284-018-0223-4, PMID: 29748895
Keywords: genomic best linear unbiased prediction model, genomic prediction, hybrid performance, plant breeding, whole genome regression model
Citation: Chen S-P and Liao C-T (2026) Optimizing training sets to identify superior genotypes in hybrid populations. Front. Plant Sci. 16:1699491. doi: 10.3389/fpls.2025.1699491
Received: 05 September 2025; Accepted: 09 December 2025; Revised: 04 December 2025;
Published: 15 January 2026.
Edited by:
Madhav Bhatta, Bayer Crop Science, United StatesReviewed by:
Venkata Naresh Boddepalli, Iowa State University, United StatesFelipe Sabadin, Utah State University, United States
Copyright © 2026 Chen and Liao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Chen-Tuo Liao, Y3RsaWFvQG50dS5lZHUudHc=
†Present address: Szu-Ping Chen, Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY, United States