New developments in Sparse PLS regression

Methods based on partial least squares (PLS) regression, which has recently gained much attention in the analysis of high-dimensional genomic datasets, have been developed since the early 2000s for performing variable selection. Most of these techniques rely on tuning parameters that are often determined by cross-validation (CV) based methods, which raises important stability issues. To overcome this, we have developed a new dynamic bootstrapbased method for significant predictor selection, suitable for both PLS regression and its incorporation into generalized linear models (GPLS). It relies on the establishment of bootstrap confidence intervals, that allows testing of the significance of predictors at preset type I risk $\alpha$, and avoids the use of CV. We have also developed adapted versions of sparse PLS (SPLS) and sparse GPLS regression (SGPLS), using a recently introduced non-parametric bootstrap-based technique for the determination of the numbers of components. We compare their variable selection reliability and stability concerning tuning parameters determination, as well as their predictive ability, using simulated data for PLS and real microarray gene expression data for PLS-logistic classification. We observe that our new dynamic bootstrapbased method has the property of best separating random noise in y from the relevant information with respect to other methods, leading to better accuracy and predictive abilities, especially for non-negligible noise levels. Keywords: Variable selection, PLS, GPLS, Bootstrap, Stability


Introduction
Partial least squares (PLS) regression, introduced by (Wold et al., 1983), is a well-known dimension-reduction method, notably in chemometrics and spectrometric modeling (Wold et al., 2001). In this paper, we focus on the PLS univariate response framework, better known as PLS1. Let n be the number of observations and p the number of covariates. Then, y = (y 1 , . . . , y n ) T ∈ R n represents the response vector, with (.) T denoting the transpose. The original underlying algorithm, developed to deal with continuous responses, consists of building latent variables t k , 1 k K, also called components, as linear combinations of the original predictors X = (x 1 , . . . , x p ) ∈ M n,p (R), where M n,p (R) represents the set of matrices of n rows and p columns. Thus, where X 0 = X, and X k−1 , k 2 represents the residual covariate matrix obtained through the ordinary least squares regression (OLSR) of X k−2 on t k−1 . Here, w k = (w 1k , . . . , w pk ) T ∈ R p is obtained as the solution of the following maximization problem (Boulesteix and Strimmer, 2007): with the constraint w k 2 2 = 1, and where y 0 = y, and y k−1 , k 2 represents the residual vector obtained from the OLSR of y k−2 on t k−1 .
The final regression model is thus: with ǫ = (ǫ 1 , . . . , ǫ n ) T the n × 1 error vector and (c 1 , . . . , c K ) the regression coefficients related to the OLSR of y k−1 on t k , ∀k ∈ [[1, K]], also known as y-loadings. More details are available, notably in Höskuldsson (1988) and Tenenhaus (1998). This particular regression technique, based on reductions in the original dimension, avoids matrix inversion and diagonalization, using only deflation. It allows us to deal with highdimensional datasets efficiently, and notably solves the collinearity problem (Wold et al., 1984).
With great technological advances in recent decades, PLS regression has been gaining attention, and has been successfully applied to many domains, notably genomics. Indeed, the development of both microarray and allelotyping techniques result in high-dimensional datasets from which information has to be efficiently extracted. To this end, PLS regression has become a benchmark as an efficient statistical method for prediction, regression and dimension reduction (Boulesteix and Strimmer, 2007). Practically speaking, the observed response related to such studies does commonly not follow a continuous distribution. Frequent goals with gene expression datasets involve classification problems, such as cancer stage prediction, disease relapse, and tumor classification. For such reasons, PLS regression has had to be adapted to take into account discrete responses. This has been an intensive research subject in recent years, leading globally to two types of adapted PLS regression for classification. The first, studied and developed notably by Nguyen and Rocke (2002a), Nguyen and Rocke (2002b) and Boulesteix (2004), is a two-step method. The first step consists of building standard PLS components by treating the response as continuous. In the second step, classification methods are run, e.g., logistic discrimination (LD) or quadratic discriminant analysis (QDA). The second type of adapted PLS regression consists of building PLS components using either an adapted version of or the original iteratively reweighted least squares (IRLS) algorithm, followed by generalized linear regression on these components. This type of method was first introduced by Marx (1996). Different modifications and improvements, using notably ridge regression (Le Cessie and Van Houwelingen, 1992) or Firth's procedure (Firth, 1993) to avoid non-convergence and infinite parameter-value estimations, have been developed, notably by Nguyen and Rocke (2004), Ding and Gentleman (2005), Fort and Lambert-Lacroix (2005) and Bastien et al. (2005). In this work, we focus on the second type of adapted PLS regression, referred to from now on as GPLS.
As previously mentioned, a feature of datasets of interest is their high-dimensional setting, i.e., n ≪ p. Chun and Keleş (2010) have shown that the asymptotic consistency of PLS estimators does not hold in this situation, so filtering or predictor selection become necessary in order to obtain consistent parameter estimation. However, all methods described above proceed to classification using the entire set of predictors. For datasets that frequently contain thousands of predictors, such as microarray ones, a variable filtering pre-processing thus needs to be applied. A commonly used pre-processing method when performing classification uses the BSS / WSS-statistic: withμ j the sample mean of x j , andμ jq the sample mean of x j in class G q for q ∈ {1, . . . , Q}.
Then, predictors associated with the highest values of this are retained, but no specific rule exists to choose the number of predictors to retain. A Bayesian-based technique, available in the R-package limma, has become a common way to deal with this, computing a Bayesianbased p-value for each predictor, therefore allowing users to choose the number of relevant predictors based on a threshold p-value (Smyth, 2004). This method cannot be considered parsimonious, but rather as a pre-processing stage for exclusion of uninformative covariates. Reliably selecting relevant predictors in PLS regression is of interest for several reasons. Practically speaking, it would allow users to identify the original covariates which are significantly linked to the response, as is done in OLS regression with Student-type tests. Statistically speaking, it would avoid the establishment of over-complex models and ensure consistency of PLS estimators. Several methods for variable selection have already been developed (Mehmood et al., 2012). Lazraq et al. (2003) group these techniques into two main categories. The first, model-wise selection, consists of first establishing the PLS model before performing a variable selection. The second, dimensionwise selection, consists of selecting variables on one PLS component at a time.
A dimension-wise method, introduced by Chun and Keleş (2010) and called Sparse PLS (SPLS), has become the benchmark for selecting relevant predictors using PLS methodology. The technique is for continuous responses and consists of simultaneous dimension reduction and variable selection, computing sparse linear combinations of covariates as PLS components. This is achieved by introducing an L 1 constraint on the weight vectors w k , leading to the following formulation of the objective function: where λ determines the amount of sparsity. More details are available in Chun and Keleş (2010). Two tuning parameters are thus involved: η ∈ [0, 1] as a rescaled parameter equivalent to λ, and the number of PLS components K, which are determined through CV-based mean squared error (CV MSE). We refer to this technique as SPLS CV in the following. Chung and Keles (2010) have developed an extension of this technique by integrating it into the generalized linear model (GLM) framework, leading it to be able to solve classification problems. They also integrate Firth's procedure, in order to deal with non-convergence issues. Both tuning parameters are selected using CV MSE. We refer to this method as SGPLS CV in the following.
A well-known model-wise selection method is the one introduced by Lazraq et al. (2003). It consists of bootstrapping pairs (y i , x i• ) , 1 i n, where x i• represents the i th row of X, before applying PLS regression with a preset number of components K on each bootstrap sample. By performing this method, approximations of distributions related to predictors' coefficients can be achieved. This leads to bootstrap-based confidence intervals (CI) for each predictor, and so opens up the possibility of directly testing their significance with a fixed type I risk α. The advantages of this method are twofold. First, by focusing on PLS regression coefficients, it is relevant for both the PLS and GPLS frameworks. Second, it only depends on one tuning parameter K, which must be determined earlier.
An important related issue should be mentioned. While performing this technique, approximations of distributions are achieved conditionally on the fixed dimension of the extracted subspace. In other words, it approximates the uncertainty of these coefficients, conditionally on the fact that estimations are performed in a K-dimensional subspace for each bootstrap sample. The determination of an optimal number of components is crucial for achieving reliable estimations of the regression coefficients (Wiklund et al., 2007). Thus, since this determination is specific to the dataset in question, it must be performed for each bootstrap sample, in order to obtain reliable bootstrap-based CI. We have established some theoretical results which confirm this (SI 1).
Determining tuning parameters by using q-fold cross-validation (q-CV) based criteria may induce important issues concerning the stability of extracted results (Hastie et al. (2009, p.249); Boulesteix (2014); Magnanensi et al. (2015)). Thus, using such criteria for successive choosing of the number of components should be avoided. As mentioned, amongst others, by Efron and Tibshirani (1993, p.255) and Kohavi (1995), bootstrap-based criteria are known to be more stable than CV-based ones. In this context, Magnanensi et al. (2015) developed a robust bootstrap-based criterion for the determination of the number of PLS components, characterized by a high level of stability and suitable for both the PLS and GPLS regression frameworks. Thus, this criterion opens up the possibility of reliable successive choosing for each bootstrap sample.
In this article, we introduce a new dynamic bootstrap-based technique for covariate selection suitable for both the PLS and GPLS frameworks. It consists of bootstrapping pairs (y i , x i• ), and successive extraction of the optimal number of components for each bootstrap sample, using the previously mentioned bootstrap-based criterion. Here, our goal is to better approximate the uncertainty related to regression coefficients by removing the condition of extracting a fixed K-dimensional subspace for each bootstrap sample, leading to more reliable CI. This new method both avoids the use of CV, and features the same advantages as those previously mentioned related to the technique introduced by Lazraq et al. (2003). We refer to this new dynamic method as BootYTdyn in the following.
We also succeed in adapting the bootstrap-based criterion introduced by Magnanensi et al. (2015) to the determination of a unique optimal number of components related to each preset value of η in both the SPLS and SGPLS frameworks. Thus, these adapted versions, by reducing the use of CV, improve the reliability of the hyper-parameter tuning. We will refer to these adapted techniques as SPLS BootYT and SGPLS BootYT, respectively.
The article is organized as follows. In Section 2, we introduce our new dynamic bootstrapbased technique, followed by the description of our adaptation of the BootYT stopping criterion to the SPLS and SGPLS frameworks. In Section 3, we present simulations related to the PLS framework, and summarize the results, depending notably on different noise levels in y. In Section 4, we treat a real microarray gene expression dataset with a binary response, here the original localization of colon tumors, by benchmarking our new dynamic bootstrap-based approach for GPLS regression. Lastly, we discuss results and conclusions in Section 5.

A new dynamic bootstrap-based technique
As mentioned in Section 1, the selected number of extracted components is crucial for reliable estimation of β PLS j , 1 j p (Wiklund et al., 2007). We have shown (SI 1) that selecting an optimal number of components on the original dataset and using it to perform PLS regression on the constructed bootstrap samples, as done by Lazraq et al. (2003), is not correct for the obtention of reliable CI.
In order to take into account these theoretical results, we have developed a new dynamic bootstrap-based approach for variable selection relevant for both the PLS and GPLS frameworks. The approach consists of estimating the optimal number of components for each bootstrap sample created during the algorithm. To obtain consistent results, a robust and resample-stable stopping criterion in component construction has to be used. Let us use β j to mean β PLS j in the following, in order to lighten notation. The algorithm for this new method is as follows: 1. Let D ori be the original dataset and R the total number of bootstrap samples D b r , r ∈ [[1, R]].

∀r ∈ [[1, R]]:
• Extract the number of PLS components that is needed for D b r using a preset stopping criterion.
3. ∀j ∈ [[1, p] ], construct a (100 × (1 − α))% bilateral BC a CI for β j , noted: ∈ CI j then retain x j else delete x j . 5. Obtain the reduced model M sel by only integrating the significant predictors, and extracting the number of components K sel determined by the preset stopping criterion.
Note that here, we set α = 0.05.

An adapted bootstrap-based Sparse PLS implementation
As mentioned by Boulesteix (2014), using q-CV based methods for tuning parameters potentially induces problems, notably concerning variability of results due to dependency on the way folds are randomly chosen. However, as detailed in Section 1, the selection of both tuning parameters involved in the SPLS regression developed by Chun and Keleş (2010) is performed using q-CV MSE. Therefore, to improve the reliability of this selection, we adapt the bootstrap-based stopping criterion to this method, which gives the following algorithm: 1. Let {η 1 , . . . , η s } be a set of pre-chosen values for the sparsity parameter and {k 1 , . . . , k s } = {1, . . . , 1} the set of initial numbers of components for each η i . Let i = 1.
Retesting all components obtained after each increase in k i is essential since the original predictors involved in the components construction change when k i increases. (Chun and Keleş, 2010). This fact, combined with the aim of retaining orthogonality between components, leads the components themselves to change, so that the significance of each component has to be retested at each step.
While the original implementation compares K max × s models through CV MSE, with K max the maximal number of components (set by the user), this new bootstrap-based version only focuses on s models, since only one number of components is determined for each preset value of η. An illustration of the stability improvement obtained by using this new implementation, based on the simulated dataset introduced in Section 3.2.1 with sd (ǫ) = 5, is shown in Fig. 1.

Simulations for accuracy comparisons
These simulations are based on a simulation scheme proposed by Chun and Keleş (2010, p. 14) and modified in order to study high-dimensional settings. We consider the case where there are less observations than predictors, i.e., n < p, and set n = 100, and p = 200 or 1000. Let q be the number of spurious predictors. While Chun and Keleş (2010, p. 14) only consider a ratio q/p equal to 0.25 and 0.5, both the 0.05 and 0.95 ratio values have been added here. Four independent and identically distributed hidden variables h 1 , . . . , h 4 , following a N (0, 25I n ) distribution, were computed. Then, columns of the covariate matrix X were generated by x j = h l + ǫ j for p l−1 + 1 j p l , where l = 1, . . . , 4, (p 0 , . . . , p 4 ) = (0, (p − q) /2, p − q, p − r, p), r = 5 when p = 200 and r = 10 when p = 1000, and ǫ 1 , . . . , ǫ p are drawn independently from a N (0, 0.1I n ). Also, y is generated by 3h 1 − 4h 2 + f , where f is normally distributed with mean 0 and variance such that the signal-to-noise ratio (SNR) equals 10.
Using this simulation scheme, accuracy of the SPLS technique using 10 fold-CV for selecting tuning parameters (SPLS CV), and our new dynamic bootstrap method combined with the bootstrap-based stopping criterion (BootYTdyn), is compared. In order to do so, for each parameter setting, 50 selections of the sparse support related to both methods were established. Lastly, mean accuracy values over the 50 trials were calculated. Results are summarized in Table 1. Based on these results, SPLS CV gives better accuracy than BootYTdyn for ratio values q/p that are not close to 0 or 1. While both give good performance when the ratio is close to 0, i.e., when a major part of predictors are significant, BootYTdyn outperforms SPLS CV when only a small proportion of predictors are significant. Nevertheless, in this simulation set-up, covariates are collected into four groups. While within-group correlations between covariates are close to one, between-group ones are close to zero. This unrealistic situation makes irrelevant the determination of an optimal support, and seems more appropriate to selecting the number of components. As an illustration, 50 additional samples in the p = 1000 and q/p = 0.5 case were simulated. We then calculated the predictive MSE (PMSE) based on four different supports S 1 , S 2 , S 3 , S 4 , where Table 2. In the light of these observations, the aim of this simulation scheme is instead the extraction if an optimal number of components rather than an optimal support. We thus decided to use a real dataset as covariate matrix for a more general and relevant comparison.

Simulations for global comparison 3.2.1. Dataset simulations
In this study, we used a real microarray gene expression dataset, which was created using fresh-frozen primary tumors samples, from a multi-center cohort, with stage I to IV colon cancer. 566 samples fulfilled RNA quality requirement, and constituted our database. These samples were split into a 443 sample discovery set and a 123 sample test set, well balanced for the main anatomo-clinical characteristics. This database has already been studied by Marisa et al. (2013) and more details on it are available in their article.
In order to reduce computational time, a preliminary selection of 100 predictors was performed. Based on the original localization of the tumors as response vector, and the full 566 samples, the 100 most differentially expressed probe sets were extracted. As mentioned in Section 1, this pre-processing is based on a Bayesian technique and gives us our benchmark predictors matrix.
We performed simulations for six distinct levels of random noise standard deviation in order to investigate the performance of the different methods. Both these standard deviations and their related SNR are shown in Table 3.

Benchmarked methods
Using these simulated datasets, eight methods were analyzed and compared.
1. Q 2 . The original bootstrap-based method, introduced by Lazraq et al. (2003), combined with the 10-fold CV-based Q 2 criterion (Tenenhaus, 1998, p. 83) for pre-selecting the number of components. 2. BIC. The bootstrap-based method, introduced by Lazraq et al. (2003), combined with the corrected BIC using estimated degrees of freedom (DoF) (Krämer and Sugiyama, 2011) for pre-selecting the number of components. 3. BootYT. The bootstrap-based method, introduced by Lazraq et al. (2003), combined with the bootstrap-based criterion (Magnanensi et al., 2015) for previously selecting the number of components. 4. BICdyn. Our new dynamic bootstrap-based method combined with the corrected BIC criterion for successive selections of number of components. 5. BootYTdyn. Our new dynamic bootstrap-based method combined with the bootstrapbased criterion for successive selections of number of components. 6. SPLS CV. The original SPLS method using 10-fold CV for tuning parameter determination (Chun and Keleş, 2010). 7. SPLS BootYT. The new adapted SPLS version using the bootstrap-based criterion for component selection. 8. Lasso. Lasso regression, included as a benchmark (Efron et al., 2004).

Simulation scheme and notation
In order to perform reliable comparisons between the eight methods, each type of trial was performed one hundred times. Numbers of components, sparse supports and sparse tuning parameters are the main examples of these. Results linked to the highest occurrence rates are then chosen for method comparison. All bootstrap-based techniques were performed with R = 1000 bootstrap samples, and each related CI was constructed with type I risk α = 0.05.
The global comparison has two main parts. First, in order to compare accuracy and stability related to each technique, we focus both on different supports, and models extracted by the different variable selection methods. Indeed, in the PLS framework, a specific model results from both a set of predictors and a specific number of components. Due to the sparsity parameter in SPLS approaches, the same support can be extracted, but with a different number of components leading to different models. Lasso regression can also extract the same support for several different sparsity parameters, leading to different estimations of model coefficients. Therefore, for clarity, the following notation, related to each specific variable selection technique, is introduced.
• S sel , the selected support, i.e., the one that appears the most often.
• M sel , the selected model, i.e., the one that appears the most often.
• %S sel , rate of occurrence of the selected support.
• %M sel , rate of occurrence of the selected model.
• K sel , the number of components related to the selected model.
Second, in order to compare the predictive ability of models, 10-fold CV MSE, related to each selected sparse model through PLS regression, were computed one hundred times. The test set was also used in order to confirm results obtained by CV.
Note that, concerning the dynamic BIC-based method for sd (ǫ) = 0.5, only 97 trials performed well. Lastly, due to equality of occurrence rates between the two most-represented pairs of tuning parameters, results for SPLS CV and sd(ǫ) = 5 come instead from 150 trials.

Stability and accuracy results
Both the mean accuracy values over trials in each case, and stability results based on extracted supports, are given in Tables 4, 5 and 6. The numbers of components used for the original bootstrap-based approach (Lazraq et al., 2003) are summarized in SI 3.  Concerning the number of different models, results related to lasso regression are the same as those concerning the number of different supports. Only the result for sd(ǫ) = 0.5 differs, since one trial finished with a fifth value of the sparsity parameter and the same support as the model that was selected. Therefore, only results concerning models established using SPLS Q 2 BIC BICdyn BootYT BootYTdyn SPLS CV SPLS BootYT Lasso sd(ǫ) = 0.5 11 (5) 6 (4) 5 (5) 8 (5) 7 (6) 4 (4) 4 (4) 4 (4) sd(ǫ) = 1 10 (4) 9 (5) 9 (5) 9 (5) 6 (4) 4 (4) 4 (4) 5 (3) sd(ǫ) = 3 16 (4) 11 (6) 8 (6) 7 (4) 6 (4) 8 (4) 6 (4) 7 (4) sd(ǫ) = 4 24 (3) 11 (5) 10 (5) 6 (4) 5 (3) 6 (4) 6 (4) 7 (4) sd(ǫ) = 5 21 (3) 12 (5) 10 (5) 9 (4) 3 (3) 5 (3) 3 (3) 7 (4) sd(ǫ) = 6.366 15 (3) 16 (4) 11 (4) 7 (4) 3 (3) 4 (3) 5 (3) 7 (4) methods are summarized in Table 7. In the case of bootstrap-based techniques, supports and models are similar, since no sparsity parameter is needed. Concerning the three bootstrap-based techniques and in the light of accuracy results (Table 4), BootYT outperforms both the others, except in the case where sd(ǫ) = 0.5, for which BIC should be used. This exception is confirmed through the comparison of the dynamic bootstrap-based methods, where the sd(ǫ) = 0.5 case is the only one where using the BIC criterion also represents the most relevant choice. This phenomenon matches with conclusions obtained by Magnanensi et al. (2015), in that the BIC criterion is well-designed for small values of noise variance, while BootYT outperforms it for non-negligible levels of random noise. The use of the Q 2 criterion for selecting the number of components would appear never be a reliable option, so combining this criterion with our new dynamical approach was not done. The accuracy values highlight the fact that our new method is always an improvement over the original one. Concerning the two versions of SPLS regression, both gave similar accuracy. These stand to perform better than others for the smallest levels of random noise variance, while BootYTdyn outperforms all others for the highest values of noise variability. Tables 5 and 7, the Q 2 -based approach for predictor selection has non-negligable stability issues, providing between 18 and 96 different models in the 100 simulations. Depending on both the noise variability and the criterion combined with our dynamic approach, the latter improves over the original one by stabilizing the choice of sparse support. Cases where this is observed match with previous conclusions established when analyzing accuracy results, namely both the BIC-based dynamic approach for small values of noise variability, and BootYTdyn used for datasets with non-negligible noise variances. This strengthens the conclusion that BIC is well-designed for small values of noise variance, while BootYT performs best for non-negligible noise. As for SPLS methods, our new bootstrap-based version gains in stability in that it retains fewer different optimal pairs (η opt , K opt ), and also sparse models, than the original does. Moreover, since Γ 1 = Γ 2 for all studied datasets, this directly implies that it retains a unique optimal number of components for each sparse support. It thus permits a choice of optimal model in a more reliable way than using the CV-based technique. Lastly, lasso regression has good stability performances, leading it, BootYTdyn and SPLS BootYT to be recommended when stability is important.

Based on results introduced in
The last descriptive statistic concerns the number of significant predictors retained ( Table  6). The Q 2 -based approach is the least sparse, selecting the highest number of covariates. Both BICdyn BootYTdyn improve respectively BIC and BootYT, showing better accuracy related to their selected support. Indeed, the four expected covariates are always included in these selected supports. Once more, BootYT can be recommended for datasets with nonnegligible random noise variability compared with corrected BIC, which has to be applied for small values of random variance. Indeed, both the BIC and BICdyn approaches, by retaining globally increasing numbers of predictors while the random noise standard deviation increases, lead us to suppose that they use some predictors to model this random noise, leading to over-fitting. On the contrary, while BootYT selects a stable number of significant predictors, BootYTdyn selects a decreasing number of significant predictors as the random noise standard deviation increases, which is expected. As a confirmation of over-fitting issues with BIC, we present the 10-fold CV-based MSE in Fig. 2. These results tend to confirm our suspicion of over-fitting since, except for results related to BootYTdyn, the others have MSE that do not match with the theoretical random noise variances, suggesting that a part of the noise is being modeled. As for the two SPLS methods, there is no pertinent difference to mention, both of them concluding on similar numbers of selected predictors. Lastly, like in BIC-based approaches, the lasso extracts an increasing number of significant predictors.
As a first conclusion, we can thus reasonably conclude that, based on these first simulation results, BootYTdyn and SPLS BootYT should be used in practice.

Complexity and predictive ability results
To confirm and strengthen the conclusions of Section 3.2.2, we will now focus on the predictive abilities of the models selected by the various approaches. We calculate one hundred times the 10-fold CV MSE based on the original simulated response values (without noise) of the various selected models. These results thus reflect the accuracy in predicting the original information by leaving out random noise. We also compute the Predictive MSE (PMSE) based on the test set, by using its simulated response y test without including noise. Lastly, we extract the DoF of each selected sparse model (DoF sel ) to compare the respective complexities.
Graphical results for these statistics, related to Q 2 , BIC and BootYT, are shown in Fig. 3. The evolution of estimated DoF both highlights and confirms BIC and Q 2 over-fitting issues. Indeed, as the random noise standard deviation increases, these methods globally build sparse models with increasing complexity. Thus, they model increasing parts of the inserted random noise, implying poor predictive abilities of their selected models compared to those obtained by applying BootYT. This is confirmed through higher values of PMSE and CV MSE, especially for datasets with non-negligible random noise variability. These results confirm the conclusions from the previous section in that using the Q 2 criterion for selecting the number of PLS components should be avoided, and that BootYT outperforms corrected BIC, except for responses with negligible random noise levels. Therefore, only BIC and BootYT are retained for further comparison.
The results shown in Fig. 4 highlight that BootYTdyn is the only method that models with decreasing DoF, ensuring a complexity reduction suitable to the avoidance of prediction issues. Indeed, BootYTdyn selects models with the lowest PMSE and 10-CV MSE.
In light of these results, only BootYTyn is retained for further comparison. Comparing the two SPLS implementations with respect to their predictive abilities lead us to recommend SPLS BootYT, since models selected by this bootstrap-adapted SPLS technique feature comparable if not lower PMSE and 10-CV MSE (Fig. 5). Let us clarify that, in order to ensure relevant comparisons, we used ordinary PLS regressions with both the support and the number of components selected by the SPLS methods, and not SPLS methods with selected tuning parameters, for computing the 10-CV MSE.
For datasets characterized by a low level of random noise variability in y, SPLS BootYT builds models inducing the smallest CV MSE and PMSE. By focusing on non-negligible random variability, BootYTdyn gives the smallest predictive errors, thus showing the high level of robustness of this approach against random noise. This robustness is, as previously mentioned, due to its decreasing number of significant predictors, and also components, leading to decreasing DoF, i.e., a loss of complexity. Lastly, we compare the two retained methods BootYTdyn and SPLS BootYT, and the lasso is run. As for SPLS methods, in order to perform relevant comparisons, the supports extracted by the lasso are used as sets of covariates for a PLS regression. The number of PLS components is then established by performing one hundred times their selection using the bootstrap-based stopping criterion; the number of components related to the highest occurrence rate was selected. In this way, results shown in Fig. 6 concerning 10-CV MSE give the predictive ability of the extracted supports for PLS regression. In order to provide a clear picture of the impact of this choice, the PMSE obtained through the lasso is also shown in Fig. 6. These approaches are referred to as Lasso and Lasso.supp. Except for negligible values of random noise variability in y, the support extracted by the lasso regression and applied as explanatory variables for PLS regression are linked to both the highest 10-CV MSE and highest PMSE. This is a direct consequence of the lasso's accuracy issues mentioned in Section 3.2.4, notably the increasing number of extracted covariate while the random noise variability rises. However, performing predictions using the model obtained by the original lasso regression lead to the lowest PMSE values. This is due to the L 1 penalization which is applied on the vector of estimated parameters, thus permitting correction of the relative lack of accuracy of this technique.
To conclude, we summarize our conclusions in the following Table 8, and recommend certain approaches, depending on whether the initial aim was to select significant predictors or to obtain a sparse model with attractive predictive ability.

Real dataset application
In this section, we deal with the predictors matrix introduced in Section 3.2.1 and the original binary response vector. Five approaches for variable selection, adapted for the GPLS framework, are considered for comparison.

1.
BootYT. The bootstrap-based method, introduced by Lazraq et al. (2003), combined with the bootstrap-based criterion (Magnanensi et al., 2015) for pre-selecting the number of components. 2. BootYTdyn. The new dynamic bootstrap-based method combined with the bootstrapbased criterion for successive determinations of the number of components. 3. SGPLS CV. The original SGPLS method using 10-fold CV for tuning parameter selection (Chung and Keles, 2010). 4. SGPLS BootYT. The new adapted SGPLS version using the bootstrap-based criterion for component selection. 5. RSGPLS. An approach developed by Durif et al. (2015). It consists of adapting the SGPLS method by introducing a ridge penalty to ensure convergence of parameter estimations, and stability in hyper-parameter tuning. They also propose an adjustment of the L 1 constraint in order to further penalize less significant predictors. Hyperparameters are tuned using CV MSE. 6. Lasso. The adapted lasso regression for logistic framework, available in the R package glmnet, as a benchmark.
Concerning bootstrap-based approaches, the incorporation of PLS methodology into GLM, developed by Bastien et al. (2005), was used. Due to non-convergence issues for parameter estimations, some samples were excluded using a threshold for parameter estimations. Indeed, a model built using a bootstrap sample with at least one parameter estimate that is higher in absolute value than 10 4 times the one (in absolute value as well) estimated on the original dataset, leads to the exclusion of this bootstrap sample. Thus, to ensure a sufficient number of relevant bootstrap samples, the preset number of computed samples was increased to R = 4000. As for the lasso, three different loss functions for the establishment of the sparsity parameter using 10-fold CV were used: the number of misclassified values, the MSE, and the deviance. We refer to them as Lasso.Cl, Lasso.MSE and Lasso.Dev, respectively. The set of values defined for the sparsity parameter is preset by the "glmnet" package. For the SGPLS and RSGPLS approaches, the number of components K varies from 1 to 10, the sparsity parameter η varies from 0.04 to 0.99 (by steps of 0.05), and the ridge parameter involved in the RSGPLS technique is selected from 31 points log 10 -linearly spaced in the range [10 −2 , 10 3 ], as in Durif et al. (2015).
Each method was performed one hundred times in order to obtain relevant results. However, due to the high observed stability of results extracted with the BootYTdyn approach, and in order to save computational time, this was only performed twenty times instead of a hundred.
Stability results for tuning parameter selection, except for the bootstrap-based methods, are shown in Fig. 7 and Table 9.   Based on results summarized in Table 9, the lasso methodology, using CV-based MSE or deviance values for the selection of its hyper-parameter, is the most stable. This can be explained by the fact that only one hyper-parameter is involved, while the other techniques are based on two or three tuning parameters. Using the number of misclassified values for CV has to be avoided for stable selection of the sparsity parameter. Our SGPLS adaptation with the BootYT criterion improves reliability in selecting the set of tuning parameters, as previously observed in the PLS framework (Sections 2.2 and 3.2.4). Note that, concerning RSGPLS, three different sets of optimal parameters were extracted with maximal occurrence rate of five, all of them selecting the same set of predictors. Thus, the set of parameters which leads to the smallest number of misclassified values on the training dataset was retained. As already mentioned in Section 3.2.3, extracting the same support does not necessarily lead to the same model when sparsity or ridge parameters are involved. Therefore, the numbers of sparse supports and models retained are respectively summarized in Tables 10 and 11.  In light of these results, BootYTdyn and both the MSE-and deviance-based lasso techniques are the most stable in extracting supports and models. This could be due in part to the fact than all of them depend on only one tuning parameter. Even if this hyper-parameter for the BootYTdyn approach, i.e., the number of components, has to be chosen R times, the high stability of the bootstrap-based stopping criterion introduced by Magnanensi et al. (2015) endows this approach with good stability in selecting the sparse support. As for the PLS framework, our new bootstrap-based SGPLS implementation improves the stability here. The lack of stability of the lasso based on misclassified values is directly induced by the discrete form of this statistic. This issue was also observed and mentioned in Magnanensi et al. (2015).
The various selected supports are displayed in Fig. 8. Note that both the MSEand deviance-based lasso regressions select the same support and the same model, noted Lasso.MSE.Dev in the following.  In the following, two additional independent public datasets, stated by Marisa et al. (2013) as being comparable with our original dataset, were included for comparison of predictive abilities. These datasets are named GSE18088 (n=53) (Gröne et al., 2011) and GSE14333 (n=247) (Jorissen et al., 2009). Both the MSE and number of misclassified (MC) value of the selected models were computed on both the training and test parts of the original dataset, as well as on the two additional datasets. These results are summarized in Table 12. Since independent datasets were available, we decided to follow the suggestion of Van Wieringen et al. (2009, p.1596: "the true evaluation of a predictor's performance is to be done on independent data". In this real data study, BootYTdyn retains only one predictor, which is also retained by all other methods. Thus, as expected, this most sparse support induced the highest values of both the MSE and number of misclassified observations, based on the training subset of the original dataset. It also provided the highest values based on the test subset of the original dataset, which is to be expected since, as explained in Section 3.2.1, these two parts are well-balanced in terms of anatomo-clinical characteristics. Thus, this causes a bias in evaluation of model's predictive abilities, by making comparable MSE and misclassified values based on both the training and testing subsets. Results in Table 12 confirm this property, and also highlight the usefulness of additional independent datasets for reliably comparing predictive abilities. While a well-designed model for predictive purposes will provide similar PMSE values on comparable independent additional datasets compared to MSE obtained on the training dataset, an over-fitted one would be related to higher PMSE values, due to its dependance on training data. Thus, the differences, noted ∆ MSE, between the MSE obtained on the training subset of the original dataset, and those obtained on the three test datasets, are shown in Fig. 9. Our new dynamic bootstrap-based approach is the only one that exhibits this MSE stability property, while all others provided higher PMSE values on the two additional datasets than on the original one. This led BootYTdyn to have only 48 (16%) misclassified values on these two additional datasets, which represents the best result among all studied approaches. Thus, we can reasonably assume that our new method has helped to remove non-informative predictors, in the sense of not being relevant for improving predictive ability.
Lastly, the unique extracted probe set, named 230784 at, is already known to be related to the original location in the distal colon. The sign of the regression coefficient obtained with BootYTdyn is coherent with this state-of-the-art result, and thus strengthens our conclusion.

Discussion
In this article, we developed a new dynamic bootstrap-based technique for variable selection, suitable for both the PLS and GPLS frameworks, and proposed a bootstrap-based version of the SPLS and SGPLS methods for selecting the number of components. While the first of these lets us completely avoid CV, the second lets us select the set of tuning parameters in a more reliable way.
In state-of-the-art approaches, the use of CV-based techniques for selection of hyperparameters is common, and can lead to important stability issues, observed notably by Boulesteix (2014) and Magnanensi et al. (2015), and confirmed in our studies. Sun et al. (2013) also worked on this subject, and proposed a methodology for selecting tuning parameters of penalized regressions in order to stabilize variable selection. Even if their method is not applicable to the selection of the number of components in PLS regression, it would be interesting to adapt it to both SPLS CV and our new SPLS BootYT implementation, in order to look for potential stability gains. However, our new dynamic bootstrap-based technique represents a useful method for avoiding CV-related issues, since the unique hyperparameter is successively selected using a bootstrap-based criterion. This new technique improves on the original bootstrap-based methodology introduced by Lazraq et al. (2003), in that it permits us to approximate the distribution of covariates' regression coefficients by removing the condition of working in a subspace of fixed dimension K. Theoretical results have been established that strengthen the usefulness of building subspaces spanned by a dynamic number of components for performing PLS regressions on bootstrap samples.
In the PLS framework, conclusions based on our simulations are twofold. First, for datasets with negligible random noise in y, SPLS BootYT is recommended. Indeed, both in terms of accuracy and predictive abilities, it outperforms all other techniques compared here. Second, for datasets with non-negligible random noise in y, which represents the more realistic case, our new dynamic bootstrap-based method is recommended. As for SPLS BootYT for the negligible random noise variability framework, BootYTdyn outperform all other methods for each property studied. Furthermore, it is the only method which concludes as to a decreasing number of significant predictors, while the random noise variability increases, which was expected here.
Results obtained from our classification study using real datasets match with previous conclusions. Indeed, BootYTdyn is the only one which leads to the expected PMSE values on two additional independent datasets, which allows us to suppose that over-fitting issues were avoided, and that these PMSE are induced by noise or information which cannot be modeled using only gene expression. Furthermore, the extracted probe set is already known to be linked to the relevant location in the distal colon, which strengthens our confidence in this new dynamic approach.
Lastly, our new bootstrap-based SPLS implementation improves the stability of this method. Indeed, in all cases studied, both the SPLS CV and SGPLS CV conclude in a higher number of different sets of hyper-parameters than our bootstrap-based versions do, leading to higher numbers of different supports and models too.
In the future, simulations need to be done to booster the results obtained on real datasets for the logistic framework (Section 4). Testing the performance of these new approaches for responses following other distributions also must be done. However, based on all results obtained in these studies, our new dynamic method appears to be the most efficient compared to state-of-the-art approaches for datasets with non-negligible noise variability, a common situation in daily practice.