Genomic Prediction and the Practical Breeding of 12 Quantitative-Inherited Traits in Cucumber (Cucumis sativus L.)

Genomic prediction is an effective way for predicting complex traits, and it is becoming more essential in horticultural crop breeding. In this study, we applied genomic prediction in the breeding of cucumber plants. Eighty-one cucumber inbred lines were genotyped and 16,662 markers were identified to represent the genetic background of cucumber. Two populations, namely, diallel cross population and North Carolina II population, having 268 combinations in total were constructed from 81 inbred lines. Twelve cucumber commercial traits of these two populations in autumn 2018, spring 2019, and spring 2020 were collected for model training. General combining ability (GCA) models under five-fold cross-validation and cross-population validation were applied to model validation. Finally, the GCA performance of 81 inbred lines was estimated. Our results showed that the predictive ability for 12 traits ranged from 0.38 to 0.95 under the cross-validation strategy and ranged from −0.38 to 0.88 under the cross-population strategy. Besides, GCA models containing non-additive effects had significantly better performance than the pure additive GCA model for most of the investigated traits. Furthermore, there were a relatively higher proportion of additive-by-additive genetic variance components estimated by the full GCA model, especially for lower heritability traits, but the proportion of dominant genetic variance components was relatively small and stable. Our findings concluded that a genomic prediction protocol based on the GCA model theoretical framework can be applied to cucumber breeding, and it can also provide a reference for the single-cross breeding system of other crops.


INTRODUCTION
Genomic prediction predicts the breeding value of potential hybrids using high-density molecular markers or genetic relationship matrices (Meuwissen et al., 2001;Jannink et al., 2010;Crossa et al., 2017) and is more effective in selecting potential hybrids. The genomic prediction method has been widely used in crop breeding, such as maize (Riedelsheimer et al., 2012), rice (Xu et al., 2016), and potato (Sverrisdottir et al., 2018). However, relatively fewer studies in horticultural crops are reported, such as in pea (Tayeh et al., 2015), tomato (Duangjit et al., 2016), and strawberry (Gezan et al., 2017), while there are still no related reports in cucumber.
Cucumber (Cucumis sativus L.) is an important vegetable crop, with a worldwide production of more than 87 million tons in 2019 (http://www.fao.org/faostat/en/?#data/QC). Nowadays, using the traditional breeding methods, diverse ecotypes of cucumber have been developed to satisfy different consumption markets (Weng et al., 2015). Many important horticultural traits, such as size/shape of fruits and flowering time, which are quantitatively inherited and their underlying quantitative trait locus and related genes have been recently reported and identified . Mostly, the quantitative traits of cucumber are their economic traits that have a polygenic structure and are highly influenced by environmental factors. However, selecting potential hybrids by using the marker-assistant selection method has its limitations under such circumstances. Fortunately, genomic prediction models can capture complex genetic variances using genomic-coverage molecular markers to achieve more accuracy in the selection process. In cucumber, the genomic prediction has also become more practical with the release of the full cucumber genome (Huang et al., 2009;Yang et al., 2012) and the decreasing cost of genome sequencing.
In genomic prediction practices, fitting high-density markers as fixed effects will cause over-fitting issues in a linear model (due to n << p, where n signifies observations and p signifies predictors). Therefore, alternatively, they can be fitted as a random effect such as ridge-regression best linear unbiased prediction (BLUP) (Ogutu et al., 2012). To improve the normal distribution assumption of the random effect in ridge-regression BLUP for some traits with more simple inheritance, the Bayesian models may assign different priors to the marker effects allowing for stronger shrinkage or a mixture of null-and major-effect markers (Karkkainen and Sillanpaa, 2012;Perez and de los Campos, 2014). In addition, genetic relationship matrices are used as predictors in the genomic-BLUP (GBLUP) model (Nishio and Satoh, 2014), which can estimate the genetic variance components more conveniently under the Bayesian theoretical framework, and it is equivalent to the ridge-regression BLUP model (Endelman, 2011).
In a single-cross breeding system, F 1 hybrids produced by the inbred lines are breeding goals, where their performances are divided into general combining ability (GCA) and special combining ability (SCA) to explain the genetic effects from parental independence and its interaction (de Santana et al., 2017). In previous studies, GCA and SCA have been used in cucumber breeding (Golabadi et al., 2015;Moradipour et al., 2017;Ene et al., 2019), which indicates that both GCA and SCA are good indicators for evaluating either the inbred lines or hybrid lines. Furthermore, the accurate estimation of combining ability relies on the orthogonal a priori hypothesis of variance components, and the GCA model has been developed for the single-cross breeding system (González-Diéguez et al., 2021). In the GCA model, the genetic relationship matrices of additive, dominance, additive-by-additive, and residual genetic effects are defined orthogonally. The GCA model has shown its potential in a maize single-cross breeding program (Technow et al., 2014;González-Diéguez et al., 2021). In this study, we investigated whether the GCA model is suitable for the cucumber single-cross breeding system.
In general, the predictive ability of the model is estimated by the cross-validation (Heslot et al., 2012). In some cases, the model performance needs more comprehensive verification. The performance of a model for cross seasons and/or cross populations is necessary to validate since the population size and the times of trial in the field are usually limited in a training group. From another perspective, whether the genetic variance components are similar among different seasons and/or population structures is also essential for the comprehension of trait performance in a breeding system.
To verify the applicability of genomic prediction in cucumber, three GCA models under two model validation strategies in cucumber were performed to solve the following issues: (i) estimation of variance parameters under orthogonal prior hypothesis; (ii) the GCA model performance under crossvalidation and cross-population strategies; and (iii) estimation of the GCA performance of 81 inbred lines.

Construction of Model Training Populations
Seventy-one cucumber inbred lines were collected for genomic prediction. For these collected lines, the inbred lines with "WI ##" were provided by professor Yiqun Weng at the University of Wisconsin-Madison, and most of the other inbred lines were obtained from our own Laboratory. Besides, another 10 inbred lines from Liu et al. (2015) and Qi et al. (2013) were used for the analysis. The origin of some inbred lines refers to Bo et al. (2016), and the detailed information of all 81 inbred lines is listed in Supplementary Table 1. There were two populations for model training, namely, diallel cross (DC) population and the North Carolina II (NC) population. The DC population included 18 inbred lines, 153 DC F 1 hybrids, and another 5 F 1 hybrids developed from 8 cucumber inbred lines with a total of 176 combinations. The NC population included 92 combinations, which were derived by "23, " "3,511" as male parents, and other 49 inbred lines as female parents based on the NC genetic mating design . The detailed information for two model training populations is listed in Supplementary Table 2.

Genotypic Data Collection
Seventy-two collected inbred lines were planted in a greenhouse at the Horticultural Farm of Northwest A&F University (HF-NWAF), Yangling, Shaanxi Province, China, (108.0809 • E, 34.3006 • N) during spring 2018. Young leaves of these inbred lines were collected for DNA extraction by using the cetyl trimethyl ammonium bromide (CTAB) method from the seedlings at a four-leaf stage. Murray and Thompson (1980). The qualified DNA samples were used for sequencing library construction following the paired-end library protocol (Illumina company). The Illumina HiSeq2500 platforms and the pairedend 150 bp sequencing strategy were used for re-sequencing.
To obtain clean reads, raw sequencing data were filtered by removing sequences with adaptors, filtering out the reads with N content over 10%, and ignoring the reads that had over 50% of low-quality bases. Regarding reads alignment and single nucleotide polymorphism (SNP) calling, 81 inbred lines were sequenced and analyzed altogether. First, the clean reads of 81 inbred lines were aligned onto the draft genome of cucumber "9930" V2.0 (https://www.ncbi.nlm.nih.gov/genome/ 1639?genome_assembly_id=228904), using BWA software (Li and Durbin, 2010) (parameter: mem -t 4 -k 32 -M). Second, the SNP calling process was performed by using SAMtools software  with the "mpileup" function. Missing and heterozygous SNP loci were imputed by major homozygous SNP genotype, and the SNP locus with minor allele frequency <0.05 was removed. Later, the SNPs located on coding sequences were annotated by using ANNOVAR software (Wang et al., 2010). Finally, 16,662 non-synonymous SNPs were collected as genotypic data for the construction of genetic relationship matrices. The origin SNP information of 81 inbred lines is given in Supplementary File 1.

Phenotypic Data Collection
Three field trials of the DC population were carried out during autumn 2018 (2018A), spring 2019 (2019S), and spring 2020 (2020S). The planting dates of these three seasons are July 27, 2018, March 14, 2019, and March 6, 2020, respectively. One field trial of the NC population was carried out during 2020S. For each field trial, combinations in the DC population had two blocks (separated glasshouse as a block) while four (two blocks in a single glasshouse) in the NC population. A total of 10 plants were sowed for each combination in a block, and the detailed experimental design was visualized in Supplementary Figure 1.
A total of 12 traits were collected in both DC and NC populations, namely, commercial fruit yield (cFY, kg), commercial fruit number (cFN), female flower time (FFT, days), commercial fruit weight (cFW, g), commercial fruit length (cFL, mm), commercial fruit diameter (cFD, mm), commercial fruit neck length (cFNL, mm), commercial fruit flesh thickness (cFTH, mm), commercial seed cavity radius (cSCR, mm), commercial fruit spine density (cFSD, n·cm −2 ), female flower node ratio (FFNR), and the first female flower node (FFFN) traits. The naming standard of these traits follows the rules in the study by Wang et al. (2020).
The commercial fruits of all combinations in three seasons were collected and measured according to the market standard of fresh cucumbers (Golabadi et al., 2015). The phenotypic data collection methods were described as follows: For cFY and cFN, the total cFY and cFN were continuously recorded up to 30 harvesting days. FFT was recorded from sowing until days to 50% of female flowering. Moreover, each block was a measurement unit for cFY, cFN, and FFT. FFNR was recorded by calculating a ratio of female flower nodes within the first 20 nodes and FFFN from the FFFN. A total of six plants were used for measuring FFNR and FFFN, while a total of six commercial fruits were collected for the measurement of cFW, cFL, cFD, cFNL, cFTH, cSCR, and cFSD, respectively. cFSD trait was recorded by selecting six random fruits, and their spine numbers were converted to spine density (n·cm −2 ). All these 12 traits were recorded in 2019S and 2020S; however, the FFNR and FFFN traits were not available in 2018A. The information of origin phenotypic data are listed in Supplementary File 2.
Before model training, for most traits in each season, including cFW, cFL, cFD, cFNL, cFTH, cSCR, cFSD, FFNR, and FFFN, the phenotypic data were corrected using the following mixed linear model: where y ijk is the origin phenotypic value, g i is the best linear unbiased estimates (BLUEs) of the ith hybrids, block j is the jth block, Rep k is the kth replicates, and ε ijk is the residual item. For cFY, cFN, and FFT in each season, the phenotypic data were corrected using a simplified model: y ijk = g i + block j + ε ijk , where the replicate item was ignored.

Model Type and Model Validation Strategy
For a single-cross breeding system, the decomposition of combining ability of F 1 hybrids is the basic theoretical framework of genomic prediction, and the basic formula is: where y ij is the BLUEs of F 1 combinations (female parent i × male parent j), µ is the overall mean of the group, GCA i and GCA j are the GCA of female parent (i) and male parent (j), respectively. SCA ij is the SCA of the F 1 hybrids (female parent i × male parent j), and ε ij is the residual item. In this study, the GCA models developed by González-Diéguez et al. (2021) were used for model training. In the full GCA model, additive, dominance, additive-by-additive epistatic, and residual genetic effects were all considered. In addition, GCA components were considered by the following formula: where GCA is the GCA effect vector of inbred lines, g A is the additive effect vector of inbred lines, g AA is the additiveby-additive (intrapopulation epistatic) effect vector of inbred lines, and T 1 and T 2 are incidence matrices assigning hybrids to female parents and male parents, respectively. The additive effects in inbred line population are assumed to be distributed as g A ∼MVN 0, G A σ 2 A , and the G A matrix was calculated based on the following formula: where p i and q i are allele frequencies in ith SNP loci and n is the SNP number. Z=M−1 n p ′ , where M is the m × n SNP matrix of inbred lines, and m is the inbred line number. {0, 1} are basic elements in M, where 0 means "aa" and 1 means "AA." The intrapopulation additive-by-additive effects in inbred line population are assumed as g AA ∼MVN 0, G AA σ 2 AA . r is the residual genetic effect (Endelman et al., 2018), which is assumed to be distributed as r∼MVN(0, Iσ 2 r ), and r = g AAA + g AAAA + . . .. For the SCA component, the dominance effect was considered, and the formula is where SCA is the SCA effect vector of hybrids, g D is the dominance effect of hybrids, which is assumed as g D ∼MVN 0, G D σ 2 D . G AA and G D genetic relationship matrices are constructed followed by González-Diéguez et al. (2021).
Thus, the full GCA model is the GCA additive-dominanceepistasis (A-D-E) model, and the formula is where y is the BLUE vector of F 1 combinations, ε is the residual item, and ε∼MVN(0, Iσ 2 e ). Other items in formula (5) have been described earlier.
Besides, the other two simplified models were also used for the analysis as follows: Formula (6) is the GCA additive-dominance (A-D) model, which contains additive, dominance, and residual genetic effects. Formula (7) is the GCA additive (A) model, which contains additive and residual genetic effects. In addition, three GCA models without residual genetic effects were also performed. All the three GCA models were performed by using the Bayesian ridge regression framework, and variance components including σ 2 A , σ 2 D , σ 2 AA , σ 2 r , and σ 2 e were estimated by using the Bayesian ridge regression model. Broad-sense heritability (H 2 ) was calculated by the proportion of genetic variance in the total variance. To show the genetic relationship among 81 inbred lines, G A matrix was visualized using a heat map. Finally, the GCA of 81 inbred lines in all three seasons was estimated using the full GCA model.
Two model validation strategies, namely, five-fold crossvalidation and cross-population validation, were used to check the predictive ability of the model. These two strategies are shown in Figure 1.
For the five-fold cross-validation strategy, DC population in 2018A, DC population in 2019S, and (DC + NC) populations in 2020S were training groups, respectively. For the crosspopulation validation strategy, the DC population in 2018A, 2019S, and 2020S were the training groups, and the NCII population in 2020S was the validation group. Regarding each model validation strategy, the Pearson's correlation relationship between the phenotypic BLUEs and the genomic estimated breeding values (GEBVs) was calculated as model predictive ability, and each model validation process was repeated 20 times to obtain robust results.

Statistical Analysis
In this study, most of the analyses were performed by using R (3.6.1 version) software (R Development Core Team, 2013). The correlations among the BLUEs of traits in three seasons were visualized by using "ggpairs" (https://ggobi.github.io/ ggally/reference/ggpairs.html), and the heat map of the G A matrix was visualized by "pheatmap" package using R software (https://CRAN.R-project.org/package=pheatmap). The principal component analysis of DC and NC populations was performed based on the 16,662 non-synonymous SNP information of these two populations using "prcomp" R function. Regarding the GCA models validation process, the "BGLR" function in "BGLR" package in R software was used for the analysis (Perez and de los Campos, 2014). For the parameter setting of "BGLR" function, the nIter parameter was 30,000, the burnIn parameter was 10,000, and the thin parameter was 5 (default value). The trace plot of each variance parameter was visualized to check its convergence. The R script including the function for the model validation process is presented in Supplementary File 3.

Overview of Genetic Relationship of Inbred Lines and Phenotypic Data
In this study, a total of 81 cucumber inbred lines were collected to represent the main genetic background of cucumber. The heat map of G A matrix (Figure 2) showed that all the collected cucumber inbred lines were divided into four genetic groups, namely, East Asian group (43 inbred lines), Eurasian group (23 inbred lines), Indian group (9 inbred lines), and xishuangbanna group (6 inbred lines). Two populations, namely, DC and NC, were constructed for genomic prediction (Supplementary Figure 2A). The principal component analysis of two populations (Supplementary Figure 2B) showed that the DC population had a similar but broader genetic background compared with the NC population.
Two main types of traits in DC and NC populations were collected, including fruit yield-related and commercial-related traits. Fruit yield-related traits included cFY, cFN, FFT, FFNR, and FFFN traits; fruit commercial-related traits included cFW, cFL, cFD, cFNL, cFTH, cSCR, and cFSD traits.

Model Predictive Ability Under Five-Fold Cross-Validation Strategy
For the five-fold cross-validation strategy, the model predictive ability was estimated by the training group itself. Table 1

The Estimation of Variance Parameters of GCA Models
The posterior genetic variance parameters of three GCA models for 12 traits in three seasons were estimated. The information of variance parameters for the cFY trait is listed in Table 2, and the information of variance parameters for the other 11 traits is listed in Supplementary Table 3. For all the 12 traits, with increasing model complexity, a higher proportion of genetic variance components was observed, leading to a higher broadsense heritability. In addition, a relatively high proportion of additive-byadditive (epistatic) genetic variance was captured in the GCA (A-D-E) model for most of the traits. For example, for cFY in 2018A, the additive-by-additive genetic variance accounted for   43.5% of the total genetic variance under the GCA (A-D-E) model and for cFY in 2019S, the ratio was 47.6%. And in 2020S, the ratio was 39.6%. But for the high broad-sense heritability traits, like cFL in 2020S, the ratio of additive-by-additive genetic variance to total genetic variance was only 20.6%. When σ 2 AA existed (GCA (A-D-E) model), the proportion of additive and residual genetic components drops compared with the other GCA models. In contrast, the proportion of dominant genetic variance in total genetic variance during three GCA models is relatively stable.

Regarding the GCA model, (A) is an additive model, (A-D) is an additive-dominance model, and (A-D-E) is the additive-dominance-epistasis (additive-by-additive) model. The predictive ability is presented as mean (SD). Different letters indicate a significant difference in the predictive abilities of different models
Besides, the percentage of genetic variance components in the total variance among traits under the GCA (A-D-E) model was visualized (Figure 4). σ 2 A components accounted for low proportions of the total genetic variance (1.9-15.2%), especially for cFTH in 2018A (1.9%). σ 2 D components also accounted for a lower proportion of the total genetic variance (1.2-14.9%), and only for FFT, FFFN in 2020S had relatively higher dominant variance (12.7-14.9%). σ 2 AA and σ 2 r components accounted for higher proportions of the total genetic variance, respectively (9.7-36.8% for σ 2 AA and 8.6-56.1% for σ 2 r ).

Regarding the GCA model, (A) is an additive model, (A-D) is an additive-dominance model, and (A-D-E) is the additive-dominance-epistasis (additive-by-additive) model.
σ 2 A , σ 2 D , σ 2 AA , and σ 2 r are additive, dominance, additive-by-additive, and residual genetic variance components, respectively. σ 2 ε is the residual variance. The estimated posterior of genetic variance components is expressed as mean (SD). GCA, general combining ability; cFY, commercial fruit yield.

FIGURE 4 | The percentage of genetic variance components in the total variance among traits under the general combining ability (A-D-E) model. σ 2
A , σ 2 D , σ 2 AA , and σ 2 r are additive, dominance, additive-by-additive, and residual genetic variance components, respectively.

The Phenotypic Correlation of DC Population in Three Seasons
The phenotypic correlation of the DC population of each trait in 2018A, 2019S, and 2020S was estimated (Supplementary Figure 3). Most of the traits have a relatively higher correlation in three seasons, especially for cFL traits (r = 0.92-0.95). However, the cFTH showed relatively inconsistent correlations among these three seasons, in which the cFTH in 2018A has a weak correlation with the cFTH in 2019S and 2020S (r = −0.02-0.21), while cFTH trait in 2019S has a relatively higher correlation with that in 2020S (r = 0.63).

Model Predictive Ability Under Cross-Population Strategy
For the cross-population strategy, there were three validation schemes ( Table 3). The NC population in 2020S was the validation group, and the DC population in 2018A, 2019S, and 2020S were the training groups, respectively. As shown in Table 3, when the DC population in 2018A was the training group, the GCA (A) model was enough to obtain the best performance for cFW, cFL, cFNL, cFTH, cSCR, and cFSD, and the GCA (A-D-E) model had the best performance for cFY, cFN, FFT, and cFD. When the DC population in 2019S and 2020S were training groups, the GCA (A-D-E) model performed the best   for most of the traits, except for cFL in 2020S and FFNR in both 2019S and 2020S. Besides, most of the traits had effective predictive ability under the cross-population strategy, indicating that the GCA models are effective for cross-population prediction, while cSCR and cFTH under the cross-population strategy had relatively worse performance (−0.38 to 0.53).

The GCA Estimation Result of 81 Inbred Lines Under the Full GCA Model
The estimated GCA information of 81 inbred lines during three seasons is listed in Supplementary Table 4. The result shows that most of the inbred lines in the East Asian and Eurasian groups had higher fruit yield, earlier FFT, and more diverse fruit commercial traits. The GCA effects including additive and additive-by-additive genetic effects are stable genetic effects between generations, and they are essential for evaluating inbred lines. Overall, the estimated GCA information of these 81 inbred lines could provide useful guidance in cucumber breeding practice, which makes the breeding scheme more purposeful.

DISCUSSION
The Discussion About GCA Models The GCA model has been developed by González-Diéguez et al. (2021), which can trace genetic effects of male and female parents ("according to origin") based on the GCA and SCA theoretical framework. Moreover, it can also estimate the orthogonal variance components. The correlation among the variance components estimated by the GCA (A-D-E) model for cFY (Supplementary Figure 4) shows that the correlation relationship among genetic variance components is weak (r ≈ 0), which might be due to its orthogonal prior hypothesis. Orthogonal variance components are essential for the estimation of genetic effects, because the non-orthogonal variance components have overlaps among each other, and genetic effects cannot be estimated accurately. Therefore, the GCA model is a more powerful tool for the estimation of variance components.
Residual genetic effects contain high-order additive effects and interaction (epistatic) effects (g AAA , g AAAA , . . .), which are important for the genetic structure (González-Diéguez et al., 2021) but are difficult to be estimated separately. In this study, the GCA model without residual genetic effects was also considered. The variance components estimated by the GCA models without residual genetic effects (Supplementary Table 5) showed that the additive and the additive-by-additive effects will have overestimated tendency when residual genetic effects do not exist, though dominance effect and broad-sense heritability may not have obvious changes. The residual genetic effects will be estimated separately when residual genetic effects exist, causing the genetic variance estimation more accurate. However, due to the a priori hypothesis that the variance components are orthogonal, the dominant and the residual items are independent and not affected.
Besides, in the GCA model developed by González-Diéguez et al. (2021), the parents come from different populations. In this study, we assumed that both female and male parents are from the same population because cultivated cucumber species have relatively lower genetic diversity and similar genetic structure due to the severe domestication bottleneck events (Qi et al., 2013). Therefore, there were some differences in the details of the GCA models compared to the previous study (González-Diéguez et al., 2021). In addition, this study involved some wild and semiwild germplasm, which were not suitable for direct breeding of commercial cucumber varieties. However, the significance of including this germplasm for model training was to expand the genetic background of the training population and make the genomic prediction models more robust (Crain et al., 2020).

Additive and Non-additive Genetic Effects in GCA Models
The additive effect, also called "breeding value, " is the accumulation of genotype values of minor genes (Hayes et al., 2009). An additive effect is essential for a breeding program because it is stable and accumulable between generations. Nonadditive effects, like dominance and additive-by-additive effects, are also important to capture more genetic variance (Varona et al., 2018). Although non-additive models are transient and highly dependent on the heterozygosity of the population, F 1 hybrids are final cultivars in a single-cross breeding system, thus non-additive effects can provide more accurate information for the prediction of F 1 hybrid performance, and the performance of three GCA models under the five-fold cross-validation strategy verified this conclusion ( Table 1). Some previous studies have reported that models containing non-additive effects perform better than pure additive models in some cases (Munoz et al., 2014;Dias et al., 2018;Wu et al., 2019), especially for the traits with lower heritability (Liu et al., 2018).
The proportion of each genetic variance to the total genetic variance reflects the degree of control on each genetic effect of traits. In this study, the SCA effect (σ 2 D ) accounts for only a small part (1.2-14.9%) of the total genetic variance for most of the traits (Figure 4) compared with the GCA effect (20.4-81.4%). This estimation result indicates that the GCA effect plays a leading role in explaining phenotypic variance in cucumber. Therefore, breeding excellent inbred lines is very important for selecting superior cucumber hybrid varieties, and Supplementary Table 4 shows the valuable GCA information for 81 cucumber inbred lines.
Although year/season effects have important impacts on breeding work, we did not handle these effects in the model, as we preferred to analyze and compare the changes of genetic effects in different seasons/years. But the phenotypic correlation of the DC population among three seasons was calculated to check whether there were any obvious year/seasonal effects. It was stated as G × E effects (where G signifies genotypes and E signifies environments). Most of the traits have a relatively higher correlation in three seasons indicating that there were no obvious G × E for these traits.

The Comparison of Two Model Validation Strategies
In this study, two model validation strategies were used, which were five-fold cross-validation and cross-population strategies. In the previous studies, k-fold cross-validation and leave-one-out cross-validation are the two main methods for model predictive ability estimation (Wu et al., 2019;Cui et al., 2020), which are similar to the five-fold cross-validation strategy in this study. The cross-validation strategy provides an effective tool for checking overfitting issues among three GCA models. For instance, the cFL in 2018A, the GCA (A) model is enough to obtain the best performance, since dominance and additive-by-additive genetic components occupy a relatively small proportion of the total genetic variance (23.1%), thus more complex GCA models are unnecessary.
For the five-fold cross-validation strategy, the model predictive ability represents the reliability estimation of prediction results. In general, the predictive ability has a positive correlation with trait heritability (Combs and Bernardo, 2013), indicating that the higher predictive ability means higher trait heritability. Furthermore, the higher trait heritability means the higher proportion of genetic variance explained by the genomic prediction model to the total variance, resulting in more reliable estimates for untested inbreds and combinations. For example, the predictive ability of cFL is 0.90-0.95 under the GCA (A-D-E) model, and the corresponding heritability is 0.70-0.83. In fact, cFL is more stable under different environments compared with yield-related traits. Therefore, traits with high predictive ability will provide more precise references to practical breedings.
The cross-population strategy is cross-population validation in the same or different seasons. When the training group and the validation group are in the same season, the environmental factor is similar, and the model performance mainly depends on the genetic similarity between populations (Sverrisdottir et al., 2018;Edwards et al., 2019). More complex models can capture a higher genetic variance proportion of the training group for explaining the variance in the validation group. Thus, more complex models have better performance ( Table 3, 2020S(DC)->2020S(NC) validation strategy).
When the training group and the validation groups are in a different year, especially in different seasons, not only the genetic similarity between populations but also environmental factors may affect the model performance. Although GCA (A-D) and GCA (A-D-E) models can catch non-additive genetic effects, like dominance and additive-by-additive variance components, these non-additive effects are transient and highly dependent on the population structure and specific environment. The additive effect is relatively stable among populations but the additive effect is reduced in the non-additive models, especially in the GCA (A-D-E) model (Supplementary Table 3). Therefore, the GCA (A) model is enough for cFW, cFL, cFNL, cFTH, cSCR, and cFSD under the "2018A(DC)->2020S(NC)" validation strategy ( Table 3).

The Comparison Between GCA Models and Traditional Combining Ability Estimation Methods
Combining ability estimation has always been an important content in plant breeding, which provides a valuable reference for the breeding of inbreds and hybrids (Lv et al., 2012). There are some studies on combining ability estimation in cucumber breeding (López-Sesé and Staub, 2002;Golabadi et al., 2015;Ene et al., 2019), where rigorous field experimental design and phenotypic variance decomposition were used to estimate GCA and SCA. In these studies, SCA has a substantial contribution to some traits. For example, FL and FD have high proportions of variance of SCA in the study by Golabadi et al. (2015), but the variance of SCA for these traits may be overestimated.
In this study, the genetic relationship matrices based on highdensity markers and the phenotypes of the training population were used for model training, and genetic effects were estimated accurately based on the assumption of orthogonal genetic variance. The results show that for all investigated traits, the SCA (dominance effect) only accounts for a small part of the total variance (1.2-14.9%), and the GCA accounts for the main part of the total variance (20.4-81.4%). Besides, in actual breeding practice, the heterosis of cucumber is not obvious in most cases (Weng, 2021), indicating that the proportion of non-additive effects (SCA) is low. Therefore, the GCA models in this study may be more accurate in combining ability estimation compared with traditional methods.
In addition, the GCA model can predict the GCA of inbred lines, which have genotypes but not involved in field trials. Moreover, the GCA models do not have strict requirements for the design of the training population, though more representative training groups will have more reliable results. These advantages can make combining ability estimation more efficient and then saving the costs in breeding.

Opportunities and Challenges of Genomic Prediction Applied to Cucumber Breeding
The traditional hybrid breeding scheme limits the efficiency of cucumber breeding, as potential hybrids that need to be tested will increase exponentially when parents increase linearly. The genomic prediction could perform large-scale accurate prediction with a smaller training group, which improves the selection efficiency significantly and meet the increasing demand for breeding tasks. Besides, the cucumber genome is relatively small (∼400 Mbp) (Weng, 2021), and fewer molecular markers can achieve high marker density, which is an advantage of low cost for genomic prediction applied to cucumber breeding.
Although genomic prediction has broad application prospects in cucumber breeding, there are some challenges that need to be considered. First, some fresh market cucumber species should be planted in a greenhouse, but the greenhouse area may limit the training population size. Second, some cucumber agronomic traits are important but difficult to quantify, such as fruit crispness, fruit aroma, fruit skin color consistency, etc. Therefore, more in-depth studies are needed to quantify these traits to support the application of genomic prediction. Regarding further studies, stress resistance and fruit nutritional quality traits of cucumber will be considered in genomic prediction to make the breeding scheme more practicable.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: NCBI BioProject accession: PRJNA741624.

AUTHOR CONTRIBUTIONS
CL: data curation, investigation, methodology, software, writing-original draft, writing-review, and editing. XL: data curation, investigation, methodology, writing-review, and editing. YH: funding acquisition, project administration, and supervision. XW and YD: methodology, writingreview, and editing. HM: funding acquisition, project administration, supervision, writing-review, and editing. ZC: conceptualization, project administration, supervision, writing-original draft, writing-review, and editing. All authors contributed to the article and approved the submitted version.