Partial Least Squares Enhances Genomic Prediction of New Environments

In plant breeding, the need to improve the prediction of future seasons or new locations and/or environments, also denoted as “leave one environment out,” is of paramount importance to increase the genetic gain in breeding programs and contribute to food and nutrition security worldwide. Genomic selection (GS) has the potential to increase the accuracy of future seasons or new locations because it is a predictive methodology. However, most statistical machine learning methods used for the task of predicting a new environment or season struggle to produce moderate or high prediction accuracies. For this reason, in this study we explore the use of the partial least squares (PLS) regression methodology for this specific task, and we benchmark its performance with the Bayesian Genomic Best Linear Unbiased Predictor (GBLUP) method. The benchmarking process was done with 14 real datasets. We found that in all datasets the PLS method outperformed the popular GBLUP method by margins between 0% (in the Indica data) and 228.28% (in the Disease data) across traits, environments, and types of predictors. Our results show great empirical evidence of the power of the PLS methodology for the prediction of future seasons or new environments.


INTRODUCTION
Genomic selection (GS) proposed by Meuwissen et al. (2001) is a disruptive methodology that uses statistical machine learning algorithms and data to improve the selection of candidate lines early in time without the need to measure phenotypic information. The empirical evidence in favor of GS methodology can be found in applications of many crops, such as wheat, maize, cassava, rice, chickpea, groundnut, etc. (Roorkiwal et al., 2016;Crossa et al., 2017;Wolfe et al., 2017;Huang et al., 2019). However, the practical implementation of GS is challenging because the GS methodology does not always guarantee medium or high prediction accuracies, since many factors affect the prediction performance.
One important complexity of GS models arises when predicting unphenotype cultivars in specific environments (e.g., planting data-site-management combinations) by incorporating G×E interaction into the genomic-based statistical models. Also important is the genomic complexity related to G×E interactions for multi-traits, as these interactions require statistical-genetic models that exploit the complex multivariate relations due to multi-trait and multienvironment variance-covariance but also to exploit the genetic correlations between environments, between traits, and between traits and environments simultaneously. In GS modeling, the interaction between markers and environmental covariates can be a complex task due to the high dimensionality of the matrix of markers, environmental covariates, or both. Jarquín et al. (2014) suggested modeling this interaction using Gaussian processes, where the associated variance-covariance matrix induces a reaction norm model. The authors showed that assuming normality for the terms involving the interaction and also assuming that the interaction obtained using a first-order multiplicative model is distributed normally, then the covariance function is the cell-by-cell product (Hadamard) of two covariance structures, one describing the genetic information and the other describing the environmental effects. Many studies using a reaction norm model indicated that including the G×E interaction in the model substantially increased the accuracy of across-environment (locations and/or years) predictions (Crossa et al., 2017).
Many of the factors that affect the prediction performance of the GS methodology are under scrutiny to be improved. For example, the quality of the marker data is of paramount importance and the breeder needs to be careful to use only high-quality markers. Also, the quality of the training population is key to guaranteeing reasonable prediction accuracies in untested new lines. For this reason, research is in progress to optimize the design of the training-testing sets. Regarding statistical machine learning algorithms, there is also research in progress to select the best algorithm for each implementation. Currently, it is very common to find applications in GS using random forest, mixed models, Bayesian methods (GBLUP, BRR, BayesA, BayesB, BayesC, and Bayes Lasso), support vector machine, gradient boosting machine methods, and deep learning methods. But, as stated in the "No Free Lunch Theorem," none of these algorithms is the best statistical machine learning algorithm for predictive modeling problems such as classification and regression. The no free lunch theorem indicates that the performance of all statistical machine learning algorithms is similar in select specific situations. However, there are niches where a particular algorithm consistently outperforms the others. For example, in interacting with images, there is empirical evidence that deep neural network methods are the best, yet one limitation of deep neural network models is that they require considerably large datasets (Montesinos-López O. A. et al., 2018;Montesinos-López A. et al., 2018;Montesinos-López et al., 2019).
The Bayesian method GBLUP is quite robust and most of the time produces reasonable and competitive predictions with the advantage that we can implement this method requiring no extra time for the tuning process. This is due to its efficient default hyperparameters for most applications, which is an important advantage regarding support vector machine, gradient boosting machine, random forest, and deep learning models that require considerable effort to select a reasonable set of hyperparameters. However, it is important to note that deep learning methods are the most demanding and challenging for selecting reasonable hyperparameters since this algorithm requires many inputs (Montesinos-López O. A. et al., 2018;Montesinos-López A. et al., 2018).
For the prediction of new environments (or seasons), most statistical machine learning methods struggle to produce reasonable predictions because often there is not a good match between the distribution of the training and testing set. For this reason, the prediction of a new season or environments is a more challenging task than when using conventional strategies of crossvalidation (CV1), which tests the performance of lines that have not been measured in any of the observed environments, and CV2, which tests the performance of lines that have been measured in some environments but not in others (Burgueño et al., 2012). This is significant for small plant breeding programs, where data from multiple locations are scarce or non-existent, with the need to predict which lines are more likely to perform better in future seasons or environments (Monteverde et al., 2019).
In multi-environmental plant breeding field trials, information on environments may enhance the information in genotype×environment interactions (GE), and the principal component regression procedure that relates environments to the principal component scores of the GE has been proposed (Aastveit and Martens, 1986). To overcome some of the interpretation difficulties present in the principal component regression scores, Aastveit and Martens (1986) proposed the partial least squares (PLS) regression method as a more direct and parsimonious linear model. PLS regression describes GE in terms of differential sensitivity of cultivars to environmental variables where explanatory variables are linear combinations of the complete set of measured environmental and/or cultivar variables with no limit to the number of explanatory covariables. Vargas et al. (1998Vargas et al. ( , 1999 carried out extensive studies for assessing the importance of environmental covariables to interpret and understand GE in multienvironment plant breeding trials. Furthermore, Crossa et al. (1999) demonstrated the usefulness of PLS for understanding GE using environmental and marker covariables.
Based on the above considerations, in this study we explore the power of PLS for the prediction of new environments, a problem denoted as "leave one environment out." The prediction accuracy of the PLS regression method is compared to that of GBLUP, one of the most robust and widely used methods in genomic selection. The benchmarking was done with 14 real datasets in which at least two environments were evaluated.

Bayesian GBLUP Model
The model used was , where X E is the standardized (centered and scaled) matrix of dimension I × r containing the environmental information of I environments and for each environment were measured r environmental covariates; g j , j 1, . . . , J, are the random effects of lines, gL ij are the random effects of location-line interaction (GE) and ϵ ij are random error components in the model assumed to be independent normal random variables with mean 0 and variance σ 2 . Furthermore, it is assumed that g (g 1 , . . . , g J ) T~N J (0, σ 2 g G), gL (gL 11 , . . . , gL 1J , . . . , gL IJ ) T~N IJ (0, σ 2 gL (H ⊗ G)), where G is the genomic relationship matrix as computed by VanRaden (2008), ⊗ denotes the Kronecker product and H is the environmental relationship matrix of size I. Environmental covariates were available for only the last two datasets (datasets 13 and 14), so for the first 12 datasets the H environmental relationship matrix was reduced to an identity matrix, I I , of dimension I × I. The implementation of this model was done in the BGLR library of Pérez and de los Campos (2014). This model contains GE but was also implemented without GE interaction, that is, the model (1) without the fourth component on the right side of Eq. 1, such that

Partial Least Squares (PLS) Method
PLS regression was first introduced by Wold (1966) and was originally developed for econometrics and chemometrics. It is a multivariate statistical technique designed to deal with the p > n problem, i.e., when the number of explanatory variables (p) is much larger (and more highly correlated) than the number of observations (n). The PLS works for relating one or more response variables (Y) to a set of explanatory variables (X) (Wold, 2001;Boulesteix and Strimmer, 2006). For PLS regression, the components, called Latent Variables (LVs) in this context, are obtained iteratively. One starts with the SVD of the cross-product matrix S X T Y, thereby including information on the variation in both X and Y, and on the correlation between them. The first left and right singular vectors, w and q, are used as weight vectors for X and Y, respectively, to obtain scores t and u: where E and F are initialized as both X and Y, respectively. The X scores t are often normalized: The Y scores u are not actually necessary in the regression but are often saved for interpretation purposes. Next, X and Y loadings are obtained by regressing against the same vector t: Finally, the data matrices are "deflated": the information related to this latent variable, in the form of the outer products tp T and tq T , is subtracted from the (current) data matrices E and F.
The estimation of the next component can then start from the SVD of the cross-product matrix E T n+1 F n+1 . After every iteration, vectors w, t, p and q are saved as columns in matrices W, T, P, and Q, respectively. One complication is that columns of matrix W cannot be compared directly: they are derived from successively deflated matrices E and F. It has been shown that an alternative way to represent the weights is that all columns relate to the original X matrix is given by Now, instead of regressing Y on X, we use T scores to calculate the regression coefficients, and later convert these back to the realm of the original variables by pre-multiplying with matrix R (since T XR): Again, here only the first a components are used. Since regression and dimension reduction are performed simultaneously, all B, T, W, P, and Q are part of the output. Both X and Y are considered when calculating the LV in T. Moreover, they are defined so that the covariance between the LV and the response is maximized. Finally, predictions for new data (X new ) should be made as: with T new X new R. How many components are optimal must be determined, usually by cross-validation. We used the root mean squared error of prediction (RMSEP), which was minimized with 10-fold cross-validation in the training dataset and for each value of LV (Mevik and Cederkvist, 2004). PLS is similar to principal component regression (PCR). In theory, PLS regression should have an advantage over PCR. One could imagine a situation where a minor component in X is highly correlated with Y; not selecting enough components would then lead to poor predictions. In PLS regression, such a component would be automatically present in the first LV. However, there is hardly any difference between the use of PLS regression and PCR; in most situations, the methods achieve similar prediction accuracies, although PLS regression usually needs fewer latent variables than PCR. Put the other way around: with the same number of latent variables, PLS regression will cover more of the variation in Y and PCR will cover more of X. Both behave similarly to ridge regression.
It is important to point out that under the PLS regression method, first we computed the design matrices (dummy variables) of environments (X L ), genotypes (X g ), and GE interactions (X gL ). The dimensions of matrices X L , X g , and X gL were JI × I, JI × J, and JI × JI, respectively. While the dimension of the response variable (Y), was of JI × 1. Then, to include the environmental and markers information in the design matrices of environments (X L ), genotypes (X g ), and GE (X gL ), we used the following augmented design matrices: X L L E , X g L g , and X gL (L E ⊗ L g ) for environments, the genotypes and GE components, respectively. L E is a matrix of order I × I that denotes the square root of the environmental relationship matrix H or order I × I, while L g is a matrix of order J × J that denotes the square root of the genomic relationship matrix G or order J × J. It is important to point out that the resulting augmented matrices: X L L E , X g L g , and X gL (L E ⊗ L g ) have the same dimension of the non-augmented matrices X L , X g , and X gL , since the matrices for which they were post-multiplied were square matrices that have the same number of columns of X L , X g , and X gL , respectively. For this reason, the input matrix used in the PLS analysis was X [X L L E , X g L g , X gL (L E ⊗ L g )] when the predictor contained the GE term, which is of order JI × (I + J + JI), but when the GE term was not included, the input matrix used was X [X L L E , X g L g ] of order JI × (I + J). Both models (GBLUP and PLS) were implanted under a univariate (uni-trait) framework, so the response variable is a vector that contains information of all environments available in the dataset. Implementation of both models (GBLUP and PLS) was done in the R statistical software (R Core Team, 2022). However, the PLS model was implemented using the PLS library (Mevik and Wehrens, 2007).

Data Sets
Note that for datasets 1-12, the PLS included only marker covariables, thus the data augmentation X g L g applies and the markers are included as the square root of the genomic relationship matrix G (L g ). For datasets 13-14, the PLS included both marker and environmental covariables; thus, data augmentation applies to both factors X L L E , and X g L g where the environments are included as the square root of the environmental relationship matrix H (L E ). Three datasets were collected by the Global Wheat Program (GWP) of the International Maize and Wheat Improvement Center (CIMMYT) and belong to elite yield trials (EYT) established in four different cropping seasons with four or five environments. Dataset 1 is from 2013-2014, dataset 2 is from 2014-2015, and dataset 3 is from 2015-2016. The EYT datasets 1, 2 and 3 contain 776, 775, and 964 lines, respectively. The experimental design used was an alpha-lattice design and the lines were sown in 39 trials, each covering 28 lines and 2 checks in 6 blocks with 3 replications. Several traits were available for some environments and lines in each dataset. In this study, we included four traits measured for each line in each environment: days to heading (DTHD, number of days from germination to 50% spike emergence), days to maturity (DTMT, number of days from germination to 50% physiological maturity or the loss of the green color in 50% of the spikes), plant height, and grain yield (GY). For full details of the experimental design and how the Best Linear Unbiased Estimates (BLUEs) were computed, see Juliana et al. (2018).
Genome-wide markers for the 2,515 (776 + 775+964) lines in the two datasets were obtained using genotyping-by-sequencing (GBS) (Elshire et al., 2011;Poland et al., 2012) at Kansas State University using an Illumina HiSeq2500. After filtering, 2,038 markers were obtained from an initial set of 34,900 markers. The imputation of missing marker data was carried out using LinkImpute (Money et al., 2015) and implemented in TASSEL (Trait Analysis by Association Evolution and Linkage) version 5 (Bradbury et al., 2007). Lines that had more than 50% missing data were removed, and 2,515 lines were used in this study (776 lines in the first dataset, 775 lines in the second dataset, and 964 lines in the third dataset).

Dataset 4. Groundnut Data
The phenotypic dataset reported by Pandey et al. (2020) contains information on the phenotypic performance for various traits in four environments. In the present study, we assessed predictions using the trait seed yield per plant (SYPP), pods per plant (NPP), pod yield per plant (PYPP) and yield per hectare (YPH), for 318 lines in four environments, denoted as Environment1 (ENV1) The dataset is balanced, resulting in a total of 1,272 assessments with each line included once in each environment. Marker data were available for all lines and 8,268 SNP markers remained after quality control (each marker was coded with 0, 1 and 2).

Dataset 5. Maize Data
This maize dataset was included in Souza et al. (2017) from USP (Universidad Sao Paulo) and consists of 722 (with 722 × 4 = 2,888 observations) maize hybrids obtained by crossing 49 inbred lines. The hybrids were evaluated in four environments (Env1-Env4) in Piracicaba and Anhumas, São Paulo, Brazil, in 2016. The hybrids were evaluated using an augmented block design, with two commercial hybrids as checks to correct for micro-environmental variation. At each site, two levels of nitrogen (N) fertilization were used. The experiment conducted under ideal N conditions received 100 kg ha-1 of N (30 kg ha-1 at sowing and 70 kg ha-1 in a coverage application) at the V8 plant stage, while the experiment with low N received 30 kg/ha of N at sowing. The parent lines were genotyped with an Affymetrix Axiom Maize Genotyping Array of 616 K SNPs. Markers with a minor allele frequency (MAF) of 0.05 were removed. After applying QC, 54,113 SNPs were available to make the predictions.

Dataset 6. Disease Data
This dataset contains 438 lines for which three diseases were recorded: Pyrenophora tritici-repentis (PTR) that causes a disease originally named "yellow spot" but also known as tan spot, yellow leaf spot, yellow leaf blotch, or helminthosporiosis. The second disease, Parastagonospora nodorum (SN), is a major fungal pathogen of wheat fungal taxon that includes several plant pathogens affecting the leaves and other parts of the plants. The third disease, Bipolaris sorokiniana (SB), is the cause of seedling diseases, common root rot and spot blotch of several crops such as barley and wheat. The 438 wheat lines were evaluated in the greenhouse for several replicates, and the replicates were considered as different environments (Env1, Env2, Env3, Env4, Env5, and Env6). The total number of observations was 438 × 6 = 2,628 observations for which the three traits were measured.
DNA samples were extracted from each line, following the manufacturer's protocol. DNA samples were genotyped using 67,436 single nucleotide polymorphisms (SNPs). For a given marker, the genotype for the ith line was coded as the number of copies of a designated marker-specific allele carried by the ith line (absence = zero and presence = one). SNP markers with unexpected heterozygous genotypes were recoded as either AA or BB. We kept those markers that had fewer than 15% missing values. Next, we imputed the markers using observed allelic frequencies. We also removed markers with MAF<0.05. After quality control and imputation, a total of 11,617 SNPs were available for analysis.

Datasets 7-12. Wheat Data
Spring wheat lines selected for grain yield analyses from CIMMYT first year yield trials (YT) were used as the training population to predict the quality of lines selected from EYT for grain yield analyses in a second year. The analyses were conducted for only the grain yield trait unless specified, and using six sets of data, as reported below: All the lines were genotyped using genotyping-by-sequencing (GBS; Poland et al., 2012). The TASSEL version 5 GBS pipeline was used to call marker polymorphisms (Glaubitz et al., 2014), and a minor allele frequency of 0.01 was assigned for SNP discovery. The resulting 6,075,743 unique tags were aligned to the wheat genome reference sequence (RefSeq v.1.0) (IWGSC 2018) with an alignment rate of 63.98%. After filtering for SNPs with homozygosity >80%, p-value for Fisher's exact test <0.001 and χ 2 value lower than the critical value of 9.2, we obtained 78,606 GBS markers that passed at least one of those filters. These markers were further filtered for less than 50% missing data, greater than a 0.05 minor allele frequency, and less than 5% heterozygosity. Markers with missing data were imputed using the "expectation-maximization" algorithm in the "R" package rrBLUP (Endelman, 2011).

Datasets 13. Indica
The rice dataset, Monteverde et al. (2019), contains information on the phenotypic performance of four traits (GY = Grain Yield, PHR = Percentage of Head Rice Recovery, GC = percentage of Chalky Grain, PH = Plant Height) in three environments (2010, 2011, and 2012). For each year, 327 lines were evaluated and measured three times (once for each developmental stage: vegetative, reproductive, and maturation) and for 18 environmental covariates. 1) ThermAmp denotes the thermal amplitude (°C), average of daily thermal amplitude calculated as max temperature (°C)-min temperature (°C). 2) RelSun denotes the relative sunshine duration (%), quotient between the real duration of the brightness of the Sun and the possible geographical or topographic duration. 3) SolRad denotes solar radiation (cal/cm2/day), solar radiation calculated using Armstrong's formula. 4) EfPpit denotes effective precipitation (mm), average daily precipitation in mm added and stored in the soil. 5) DegDay denotes degrees day in rice (°C), mean of daily average temperature minus 10°. 6) RelH denotes relative humidity (hs), sum of hours (0-24 h) where the relative humidity was equal to 100%. 7) PpitDay denotes the precipitation day, which is the sum of days when it rained. 8) MeanTemp denotes the mean temperature (°C), average of temperature over 24 h (0-24 h). 9) AvTemp denotes average temperature (°C), average temperature calculated as daily (Max + Min)/2. 10) MaxTemp denotes the maximum temperature (°C), average of maximum daily temperature. 11) MinTemp denotes minimum temperature (°C), average of minimum daily temperature. 12) TankEv denotes the tank water evaporation (mm), amount of evaporated water under influence of Sun and wind. 13) Wind denotes wind speed (2 m/km/24 h), distance covered by wind (in km) over 2 m height in 1 day. 14) PicheEv denotes piche evaporation (mm), the amount of evaporated water without the influence of the Sun. 15) MinRelH denotes the minimum relative humidity (%), lowest value of relative humidity for the day. 16) AccumPpit denotes accumulated precipitation (mm), daily accumulated precipitation. 17) Sunhs denotes sunshine duration, sum of total hours of sunshine per day. 18) MinT15 denotes the minimum temperature below 15°, the sum of the days where the minimum temperature was below 15°.
This dataset is balanced, giving a total of 981 assessments with each line included once in each environment. Marker data were available for all lines and 16,383 SNP markers remained after quality control (each marker was coded with 0, 1, and 2).

Datasets 14. Japonica
This rice dataset was also reported by Monteverde et al. (2019) but belongs to the tropical Japonica population and contains phenotypic information on the same four traits as for the Indica population (GY = Grain Yield, PHR = Percentage of Head Rice Recovery, GC = percentage of Chalky Grain, PH = Plant Height) in five environments (2009( , 2010( , 2011( , and 2013( ). In 2009( , 2010( , 2011( , and 2013 lines were evaluated, respectively. In each year, the same 54 environmental covariates measured in the Indica dataset (dataset 13) were examined, that is, the 18 covariates measured in the three developmental stages: vegetative, reproductive, and maturation. Since this dataset is not balanced, a total of 1,051 assessments were evaluated in the five environments. Marker data were available for 320 lines and 16,383 SNP markers remained after quality control (each marker was coded with 0, 1, and 2).

Metrics for Evaluation of Prediction Accuracy
We employed the leave-one-environment out type of crossvalidation in each of the 14 datasets (Montesinos-López et al., 2022). That is, we used I − 1 environments as the training set and the remaining environments as the testing set until each of the I environments played the role of testing one time. In the case of the model (1), we did not require a tuning process, but under PLS we divided the respective training (information of the I − 1 environments) in inner training (80% of the training) and validation set (20% of the training). This was done under five nested cross-validations, which implies the training set was divided into five parts and one of those was used as the validation set and the remaining four as inner training. Then, the average of the five validation sets was reported as prediction accuracy to select the optimal hyperparameter, which in PLS, was the number of principal components to retain (Montesinos-López et al., 2022). Then, with this optimal hyperparameter (number of principal components), we refitted the model with the whole training set (information of the I − 1 environments), and finally, with these trained models, the prediction for the testing set (a full environment) was obtained. As a metric to evaluate the prediction accuracy, we used the normalized root mean square , with y i denoting the observed value i, whilef(x i ) represents the predicted value for observation i, with i 1, . . . , n). This metric was computed under the GBLUP (NRMSE GBLUP ) and PLS (NRMSE PLS ) methods, and then we calculated with both metrics the relative efficiency:

RE NRMSE NRMSE GBLUP NRMSE PLS
When RE NRMSE > 1, the best prediction performance in terms of NRMSE was obtained using the PLS method, but when RE NRMSE < 1, the GBLUP method was superior in terms of prediction accuracy, and of course, when RE NRMSE 1, both methods were equally efficient.

RESULTS
The results are given in nine sections. In sections 1-6 are the results for dataset 1 to dataset 6, respectively, section 7 gives the results for the datasets (dataset 7 to dataset 12), section 8 gives the results for dataset 13, and section 9 gives the results for dataset 14.

Dataset 1. EYT_1
With Predictor = E + G + GE When the GE was considered in the predictor, we observed relative efficiencies in terms of NRMSE of the GBLUP methods vs. the PLS method for trait DTHD of 0.73, 1.642, 3.343, and 0.434 for environments Bed5IR, EHT, Flat5IR, and LHT, respectively. Only in environments EHT and Flat5IR genomic prediction performance of PLS regression method was superior to GBLUP by 64.2% (EHT) and 234.3% (Flat5IR). However, across environments, we observed that the GBLUP method was better than the PLS since the relative efficiency was equal to 0.929. Across environments, the GBLUP outperformed the PLS by 7.1% in terms of prediction performance (Figure 1 with predictor = E + G + GE). For trait DTMT, the RE NRMSE were 1.591 (Bed5IR), 0.779 (EHT), 3.457 (Flat5IR) and 0.197 (LHT). Only in environments Bed5IR and Flat5IR the PLS outperformed the GBLUP by 59.1% and 245.7%, respectively, while in the remaining two of four environments, the GBLUP method was the best. Across environments, the GBLUP method was also better than the PLS by 17.7% (RE NRMSE = 0.823) (Figure 1 with predictor = E + G + GE). Also, in trait GY, only in one out of four environments the PLS outperformed the GBLUP method, since the RE NRMSE were 0.435 (Bed5IR), 0.884 (EHT), 1.825 (Flat5IR) and 0.764 (LHT). Across environments, the GBLUP was better than the PLS by 18.9% since the RE NRMSE = 0.811 (Figure 1 with predictor = E + G + GE). Finally, for trait Height, the relative efficiencies were 0.889 (Bed5IR), 1.431 (EHT), 7.895 (Flat5IR), 0.145 (LHT), and 1.095 (Global), which means that the PLS outperformed the GBLUP method by 43.1% (EHT), 689.5% (Flat5IR), and by 9.5% across environments. But in environments like Bed5IR and LHT, the GBLUP method was superior to the PLS method by 11.1% and 85.5%, respectively ( Figure 1 with predictor = E + G + GE). See more details in Supplementary Material.

With Predictor = E + G
When only the main effects of environments and genotypes were considered in the predictor, the RE NRMSE for each environment and across environments for trait DTHD were 1.344 (Bed5IR)

With Predictor = E + G + GE
With GE in the predictor, the RE NRMSE for trait DTHD were 0.81 (Bed2IR), 1.386 (Bed5IR), 3.011 (EHT), 2.181 (Flat5IR), 1.059 FIGURE 1 | Relative efficiency in terms of normalized root mean square error (RE_NRMSE) computed by dividing the NRMSE under the best linear unbiased predictor model (GBLUP) between the NRMSE of the partial least squares regression method. Prediction performance is reported for each environment and across environments (Global) in dataset 1 (EYT_1), also with two predictors (E + G and E + G + GE). When the RE_NRMSE>1 the PLS outperforms in terms of prediction performance (lower NRMSE) the GBLUP method.
FIGURE 2 | Relative efficiency in terms of normalized root mean square error (RE_NRMSE) computed by dividing the NRMSE under the best linear unbiased predictor model (GBLUP) between the NRMSE of the partial least squares regression method. Prediction performance is reported for each environment and across environments (Global) in dataset 2 (EYT_2), also with two predictors (E + G and E + G + GE). When the RE_NRMSE>1, the PLS outperforms the GBLUP method in terms of prediction performance (lower NRMSE).
Frontiers in Genetics | www.frontiersin.org July 2022 | Volume 13 | Article 920689 (LHT), and 1.235 (Global), which means that in four out of five environments, the PLS method was better than the GBLUP method by 38.6%, 201.1%, 118.1%, and 5.9% in environments Bed5IR, EHT, Flat5IR, and LHT, respectively. Across environments, the PLS outperformed the GBLUP method by 23.5% (Figure 2 with predictor = E + G + GE). For trait DTMT, the RE NRMSE were 1.85 (Bed2IR), 1.091 (Bed5IR), 1.135 (EHT), 2.467 (Flat5IR), 1.177 (LHT), and 1.255 (Global), which means that in the five environments, the PLS was superior to the GBLUB method by 85% (Bed2IR) (Figure 4 with predictor = E + G + GE). For trait FIGURE 3 | Relative efficiency in terms of normalized root mean square error (RE_NRMSE) computed by dividing the NRMSE under the best linear unbiased predictor model (GBLUP) between the NRMSE of the partial least squares regression method. Prediction performance is reported for each environment and across environments (Global) in dataset 3 (EYT_3), also with two predictors (E + G and E + G + GE). When the RE_NRMSE>1, the PLS outperforms the GBLUP method in terms of prediction performance (lower NRMSE).

With Predictor = E + G + GE
FIGURE 4 | Relative efficiency in terms of normalized root mean square error (RE_NRMSE) computed by dividing the NRMSE under the best linear unbiased predictor model (GBLUP) between the NRMSE of the partial least squares regression method. Prediction performance is reported for each environment and across environments (Global) in dataset 4 (Groundnut), also with two predictors (E + G and E + G + GE). When the RE_NRMSE>1, the PLS outperforms the GBLUP method in terms of prediction performance (lower NRMSE).

With Predictor = E + G + GE
With GE, the RE NRMSE for dataset Wheat_1 were 0.915 (YT_13_14), 1.198 (YT_14_15), and 1.041 (Global), that is, the PLS outperformed the GBLUP method by 19.8% (YT_14_15), and 4.1% across environments (Figure 7 with FIGURE 5 | Relative efficiency in terms of normalized root mean square error (RE_NRMSE) computed by dividing the NRMSE under the best linear unbiased predictor model (GBLUP) between the NRMSE of the partial least squares regression method. Prediction performance is reported for each environment and across environments (Global) in dataset 5 (Maize), also with two predictors (E + G and E + G + GE). When the RE_NRMSE>1 the PLS outperforms the GBLUP method in terms of prediction performance (lower NRMSE).
FIGURE 6 | Relative efficiency in terms of normalized root mean square error (RE_NRMSE) computed by dividing the NRMSE under the best linear unbiased predictor model (GBLUP) between the NRMSE of the partial least squares regression method. Prediction performance is reported for each environment and across environments (Global) in dataset 6 (Disease), also with two predictors (E + G and E + G + GE). When the RE_NRMSE>1 the PLS outperforms the GBLUP method in terms of prediction performance (lower NRMSE).

With Predictor = E + G + GE
With GE and considering the environmental covariates (EC), the RE NRMSE for trait GY were 0.888 in 2010, 1.010 in 2011, 1.010 in 2012, and 0.974 (Global), which means that in only one environment and across environments, the GBLUP method was better than the PLS method by 11.2% (2010) and 2.6% (Global). However, for this same trait but not including the environmental covariates, the RE NRMSE were 0.896 in 2010, 1.000 in 2011, 1.037 in 2012, and 0.980 (Global), so that the GBLUP only outperformed the PLS in 2010 by 10.4% and across years by 2% (Figure 8 with predictor = E + G + GE). For trait PHR, the environmental covariates in the predictor the RE NRMSE were 0.965 in 2010, 0.947 in 2011, 0.948 in 2012, which means that the GBLUP was better than the PLS by 3.5% in 2010, 5.3% in 2011, 5.2% in 2012. But ignoring the EC, the RE NRMSE were 0.972 in 2010, 0.947 in 2011, 0.940 in 2012, which means that the GBLUP was better than the PLS by 2.8% in 2010, 5.3% in 2011, 6.0% in 2012, and 4.7% (Global) (Figure 8 with predictor = E + G + GE).
FIGURE 7 | Relative efficiency in terms of normalized root mean square error (RE_NRMSE) computed by dividing the NRMSE under the best linear unbiased predictor model (GBLUP) between the NRMSE of the partial least squares regression method. Prediction performance is reported for each environment and across environments (Global) in datasets 7-12 (Wheat_1 to Wheat_6), also with two predictors (E + G and E + G + GE). When the RE_NRMSE>1, the PLS outperforms the GBLUP method in terms of prediction performance (lower NRMSE).
Frontiers in Genetics | www.frontiersin.org July 2022 | Volume 13 | Article 920689 For trait GC with EC, the RE NRMSE were 0.974 in 2010, 0.934 in 2011, 0.992 in 2012, which means that the GBLUP was better than the PLS by 2.6% in 2010, 6.63% in 2011, 0.8% in 2012. But ignoring the EC, the RE NRMSE were 0.979 in 2010, 0.949 in 2011, 0.992 in 2012, which means that the GBLUP was better than the PLS by 2.1% in 2010, 5.1% in 2011, 0.8% in 2012 (Figure 8 with predictor = E + G + GE). Finally, for trait PH with EC, the RE NRMSE were 1.031 in 2010, 1.02 in 2011, 0.925 in 2012, which means that the GBLUP was better than the PLS by 7.5% in 2012 and 0.9% (Global). But ignoring the EC, the RE NRMSE were 0.915 in 2010, 0.991 in 2012, and0.979 (Global), which means that the GBLUP was better than the PLS by 8.5% in 2010, 0.9% in 2010, and 2.1% (Global) (Figure 8 with predictor = E + G + GE). See more details in Supplementary Material.

With Predictor = E + G
Without the GE and taking into account the EC, we observed the RE NRMSE for trait GY were 0.988 in 2010, 0.981 in 2011, 1.011 in 2012, which means that the GBLUP was better than the PLS by 1.2% in 2010, 1.9% in 2011, and 0.7% (Global). However, without taking into account the EC, the RE NRMSE were 0.955 in 2010, 0.980 in 2011, 1.011 in 2012, which means that the GBLUP was better than the PLS by 4.5% in 2010, 2.0% in 2011, and 1.8% (Global) (Figure 8 with predictor = E + G). For trait PHR, taking into account the EC, the RE NRMSE were 0.911 in 2010, 0.932 in 2011, 0.955 in 2012, that is, the GBLUP was better than the PLS by 8. 9% in 20109% in , 6.8% in 20119% in , 4.5% in 2012. While ignoring the EC, the RE NRMSE were 0.913 in 2010, 0.933 in 2011, 0.954 in 2012, so that the GBLUP was better than the PLS by 8. 7% in 20107% in , 6.7% in 20117% in , 4.6% in 2010 (Figure 8 with predictor = E + G).
For trait GC, taking into account the EC, the RE NRMSE were 0.983 in 2010, 0.933 in 2011, 0.986 in 2012, so that the GBLUP was better than the PLS by 1. 7% in 20107% in , 6.7% in 20117% in , 1.4% in 2012. While ignoring the EC, the RE NRMSE were 0.987 in 2010, 0.935 in 2011, 0.986 in 2012, and 2009, 2010, 2011, 2013. Which means that in both cases with (and without) EC across years, the best prediction performance was under the PLS method, which was superior by 1.5% with EC and by 5.9% without EC (Figure 9 with predictor = E + G + GE  , 2010, , 2011, , 2013 Which means that in both cases with (and without) EC across years, the PLS method outperformed the GBLUP method by 0.5% with EC and by 7.7% without EC (Figure 9 with predictor = E + G + GE  , 2010, 2011, 2012, 2013 and Global respectively, which means that with (and without) EC the GBLUP outperformed across year the PLS by 1.3% and 13.8% respectively (Figure 9 with predictor = E + G + GE  2010,2011,2012,2013, and Global, respectively, which means that with (and without) EC the PLS method outperformed across years the GBLUP method by 8.9 % and 25.7% respectively (Figure 9 with predictor = E + G + GE  2010,2011,2012,2013 and Global, respectively, which means that without EC the PLS method outperformed across years the GBLUP method by 11.1%, but with EC the GBLUP outperformed the PLS by 1% (Figure 9 with predictor = E + G  2010,2011,2012,2013, and Global, respectively, which means that with (and without) EC the GBLUP method outperformed across years the PLS method by 2.8 and 3.2%, respectively (Figure 9 with predictor = E + G). See more details in Supplementary Material.

DISCUSSION
Breeders need novel methodologies to match the productivity required to feed the world's increasing population. For this FIGURE 9 | Relative efficiency in terms of normalized root mean square error (RE_NRMSE) computed by dividing the NRMSE under the best linear unbiased predictor model (GBLUP) between the NRMSE of the partial least squares regression method. Prediction performance is reported for each environment and across environments (Global) in dataset 14 (Japonica), also with two predictors (E + G and E + G + GE). When the RE_NRMSE>1, the PLS outperforms the GBLUP method in terms of prediction performance (lower NRMSE). (A) With environmental covariates (EC) and (B) without EC.
reason, the adoption of genomic selection has been successful in plant breeding since it is a disruptive methodology that offers significant savings in resources in selecting candidate lines because GS is a predictive methodology trained with data that contain phenotypic and genotypic information. GS predicts breeding values or phenotypic values of new (untested) lines that were only genotyped, so it allows for earlier selection of lines, since we can obtain predictions for the lines of interest once the model is trained. (Crossa et al., 2017). However, even though GS is efficient and simple to understand, breeders worldwide are still struggling to put it into practice successfully. To guarantee the successful implementation of GS, we need to guarantee moderate or high prediction accuracies, which is challenging since the resulting prediction accuracy depends on many factors, such as the degree of relatedness between the training and testing sets, the statistical machine learning model used, the size of the training and testing sets, the trait under study, the quality of inputs like markers and environmental information, and the prediction problem at hand (for example, the prediction of lines that were evaluated in other environments, or the prediction of lines that were not evaluated in previous years or environments, etc.).
The reaction norm model of Jarquín et al. (2014) has been successfully used for multi-environmental data and for incorporating interactions among multi-type input sources (e.g., dense molecular markers, pedigree, high-throughput phenotypes, environmental covariables) in several crops (Perez-Rodriguez et al., 2015;Crossa et al., 2016). One of the major gaps in GS research for G×E modeling using environmental data relies in the steps of collection, processing, and integrating those data into an ecophysiology-smart and parsimony manner. A method for envirotyping-informed genomic prediction of GxE using linear and non-linear kernels was presented by Costa-Neto et al. (2020) in two maize germplasms and taking account for additive, dominance and their interaction with environments. These authors showed that nonlinear kernels are the better option to deal with environmental similarity realized with environmental covariables in GS.
For the predictions mentioned above, research is underway to optimize the GS methodology so that its practical implementation can be more consistent, robust, and straightforward. For this reason, with the goal of finding more robust statistical machine learning methods, we evaluated the prediction performance of the partial least squares regression in the context of predicting a new year or environment (leave one environment out cross-validation strategy). We found the PLS method was superior to the popular GBLUP method in most cases when the goal was to predict a complete location or year.
Our results are encouraging because they show that PLS regression is a powerful tool for this type of prediction problem, since across all datasets, the PLS outperformed the GBLUP method by large margins (see Figures 1-7). Across traits, environments, and types of predictors, the PLS outperformed the GBLUP in terms of normalized root mean square error by 58.8% in dataset 1 (EYT_1 data), 52.52% in dataset 2 (EYT_2 data), 49.84% in dataset 3 (EYT_3 data), 178.21 in dataset 4 (groundnut data), 127.07% in dataset 5 (maize data), 228.28% in dataset 6 (disease data), 62.31% in datasets 7-12 (wheat_1 to wheat_6 datasets) and 0% in dataset 13 (Indica dataset). For those environments that were more similar to the environments in the training set less error of prediction was observed, while for those environments that were more different from the environments in the training set, more error of prediction was reached. Our findings agree with those reported by Monteverde et al. (2019) for this type of prediction problem ("leave one environment out"), but in our case, instead of using Pearson's correlation as the metric of prediction performance, we used the normalized root mean square error.
Although our evaluations are empirical, we found that the PLS regression method is an efficient statistical machine learning prediction tool that is especially appropriate for small sample data with many (possibly correlated) independent variables, especially useful for small p and large n problems. However, since we only evaluated the leave one environment out crossvalidation scheme, our assertions are valid only for this context. However, other authors (Campbell and Ntobedzi, 2007;Bergström et al., 2012;Vucicevic et al., 2015;Zhang et al., 2015;Kouskoura et al., 2019) have evaluated the prediction accuracy of the PLS methodology with conventional strategies of cross-validation (k-fold cross-validations, stratified-k-fold cross-validations, etc.) and there is also empirical evidence that the PLS regression method produces very competitive predictions.
We observed that implementing the PLS regression is fast for small datasets, but the larger the dataset, the more computational resources are required for its implementation. In general, we observed that the implementation of the PLS is at least two times slower than the GBLUP method. In part this is because under PLS selecting the optimal number of principal components (hyperparameters) is by means of an inner five-fold crossvalidation that also is time consuming, but a key factor to a successful implementation of the PLS regression method. For this reason, the computational resources required for PLS implementation become a problem for moderate and large datasets. However, since the PLS regression only depends on one hyperparameter (number of principal components to retain), the tuning process is very easy, but with an increased need on computational resources.
One reason PLS is very competitive in terms of predictive modeling is that it automatically performs variable selection by creating the latent variables (factors) that are linear combinations of the original independent variables and the response variable (s). For this reason, PLS is quite efficient for handling many and correlated predictors, and additionally, it can detect which inputs (independent variables) are the most explanatory in a trained model by looking at the model coefficients (Wold, 2001;Mehmood et al., 2012). Vargas et al. (1998 and Crossa et al. (1999) have shown the benefits of PLS for identifying the set of independent variables (environmental as well as molecular markers) that best explain the genotype by environment interaction. For all these reasons, PLS regression has become a well-established tool in predictive and association modeling in bioinformatics and genomics, because it is often possible to interpret the extracted factors in terms of the underlying physical system; to derive "hard" modeling information from the soft model.
It is important to note that the PLS regression method is not restricted only to continuous and univariate response variables since it can also be used for binary and categorical univariate response variables and to other tasks such as survival analysis, multivariate modeling, and modeling of regulation network. At present, most reported applications of the PLS method to genomic data focus on the analysis of microarray data from gene expression experiments.

CONCLUSION
In this study, we compare the prediction performance of the popular GBLUP (in its Bayesian version) to the partial least squares (PLS) regression for the prediction of future seasons or new environments, when the goal is to predict the whole information of a new year or environment. This prediction scheme, denoted as "leave one environment out," is challenging because many times we do not have any information or reference for the new year or environment we want to predict. Consequently, the predictions accuracies are low or very low. We found that the PLS regression outperformed the prediction performance of the Bayesian GBLUP method in terms of normalized root mean square error by at least 49.84% (across traits, environments, and types of predictors, when environmental covariates were not considered), which is an impressive large gain. For this reason, we encourage doing more research in this direction to support our findings. However, with our results, we can see that the PLS regression algorithm is a powerful tool for predicting new environments (or years) in the context of genomic selection.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding authors.