New Developments in Sparse PLS Regression

Magnanensi, Jérémy; Maumy-Bertrand, Myriam; Meyer, Nicolas; Bertrand, Frédéric

doi:10.3389/fams.2021.693126

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 16 July 2021

Sec. Mathematics of Computation and Data Science

Volume 7 - 2021 | https://doi.org/10.3389/fams.2021.693126

New Developments in Sparse PLS Regression

Jérémy Magnanensi^1,2

Myriam Maumy-Bertrand^1,3

Nicolas Meyer²

Frédéric Bertrand^1,3*

¹IRMA, CNRS UMR 7501, Labex IRMIA, Université de Strasbourg and CNRS, Strasbourg, France
²Laboratoire de Biostatistique et Informatique Médicale, Faculté de Médecine, EA3430, Université de Strasbourg and CNRS, Strasbourg, France
³LIST3N, Université de Technologie de Troyes, Troyes, France

Methods based on partial least squares (PLS) regression, which has recently gained much attention in the analysis of high-dimensional genomic datasets, have been developed since the early 2000s for performing variable selection. Most of these techniques rely on tuning parameters that are often determined by cross-validation (CV) based methods, which raises essential stability issues. To overcome this, we have developed a new dynamic bootstrap-based method for significant predictor selection, suitable for both PLS regression and its incorporation into generalized linear models (GPLS). It relies on establishing bootstrap confidence intervals, which allows testing of the significance of predictors at preset type I risk α, and avoids CV. We have also developed adapted versions of sparse PLS (SPLS) and sparse GPLS regression (SGPLS), using a recently introduced non-parametric bootstrap-based technique to determine the numbers of components. We compare their variable selection reliability and stability concerning tuning parameters determination and their predictive ability, using simulated data for PLS and real microarray gene expression data for PLS-logistic classification. We observe that our new dynamic bootstrap-based method has the property of best separating random noise in y from the relevant information with respect to other methods, leading to better accuracy and predictive abilities, especially for non-negligible noise levels.

1. Introduction

Partial least squares (PLS) regression, introduced by [1], is a well-known dimension-reduction method, notably in chemometrics and spectrometric modeling [2]. In this paper, we focus on the PLS univariate response framework, better known as PLS1. Let n be the number of observations and p the number of covariates. Then, $y = {(y_{1}, \dots, y_{n})}^{T} \in ℝ^{n}$ represents the response vector, with ${(.)}^{T}$ denoting the transpose. The original underlying algorithm, developed to deal with continuous responses, consists of building latent variables $t_{k}, 1 ⩽ k ⩽ K$ , also called components, as linear combinations of the original predictors $X = (x_{1}, \dots, x_{p}) \in ℳ_{n, p} (ℝ)$ , where $ℳ_{n, p} (ℝ)$ represents the set of matrices of n rows and p columns. Thus,

t_{k} = X_{k - 1} w_{k}, 1 ⩽ k ⩽ K, (1)

where $X_{0} = X$ , and $X_{k - 1}, k ⩾ 2$ represents the residual covariate matrix obtained through the ordinary least squares regression (OLSR) of $X_{k - 2}$ on $t_{k - 1}$ . Here, $w_{k} = {(w_{1 k}, \dots, w_{p k})}^{T} \in ℝ^{p}$ is obtained as the solution of the following maximization problem [3]:

w_{k} = \underset{w \in ℝ^{p}}{argmax} {{Cov}^{2} (y_{k - 1}, t_{k})} (2)

= \underset{w \in ℝ^{p}}{argmax} {w^{T} X_{k - 1}^{T} y_{k - 1} y_{k - 1}^{T} X_{k - 1} w}, (3)

with the constraint ${‖ w_{k} ‖}_{2}^{2} = 1$ , and where $y_{0} = y$ , and $y_{k - 1}, k ⩾ 2$ represents the residual vector obtained from the OLSR of $y_{k - 2}$ on $t_{k - 1}$ .

The final regression model is thus:

y = \sum_{k = 1}^{K} c_{k} t_{k} + ϵ (4)

= \sum_{j = 1}^{p} β_{j}^{PLS} x_{j} + ϵ, (5)

with $ϵ = {(ϵ_{1}, \dots, ϵ_{n})}^{T}$ the $n \times 1$ error vector and $(c_{1}, \dots, c_{K})$ the regression coefficients related to the OLSR of $y_{k - 1}$ on $t_{k}$ , $\forall k \in 〚 1, K 〛$ , also known as y-loadings. More details are available, notably in Höskuldsson [4] and Tenenhaus [5].

This particular regression technique, based on reductions in the original dimension, avoids matrix inversion and diagonalization, using only deflation. It allows us to deal with high-dimensional datasets efficiently, and notably solves the collinearity problem [6].

With great technological advances in recent decades, PLS regression has been gaining attention, and has been successfully applied to many domains, notably genomics. Indeed, the development of both microarray and allelotyping techniques result in high-dimensional datasets from which information has to be efficiently extracted. To this end, PLS regression has become a benchmark as an efficient statistical method for prediction, regression and dimension reduction [3]. Practically speaking, the observed response related to such studies does commonly not follow a continuous distribution. Frequent goals with gene expression datasets involve classification problems, such as cancer stage prediction, disease relapse, and tumor classification. For such reasons, PLS regression has had to be adapted to take into account discrete responses. This has been an intensive research subject in recent years, leading globally to two types of adapted PLS regression for classification. The first, studied and developed notably by Nguyen and Rocke [7], Nguyen and Rocke [8] and Boulesteix [9], is a two-step method. The first step consists of building standard PLS components by treating the response as continuous. In the second step, classification methods are run, e.g., logistic discrimination (LD) or quadratic discriminant analysis (QDA). The second type of adapted PLS regression consists of building PLS components using either an adapted version of or the original iteratively reweighted least squares (IRLS) algorithm, followed by generalized linear regression on these components. This type of method was first introduced by Marx [10]. Different modifications and improvements, using notably ridge regression [11] or Firth’s procedure [12] to avoid non-convergence and infinite parameter-value estimations, have been developed, notably by Nguyen and Rocke [13], Ding and Gentleman [14], Fort and Lambert-Lacroix [15] and Bastien et al. [16]. In this work, we focus on the second type of adapted PLS regression, referred to from now on as GPLS.

As previously mentioned, a feature of datasets of interest is their high-dimensional setting, i.e., $n ≪ p$ . Chun and Keleş [17] have shown that the asymptotic consistency of PLS estimators does not hold in this situation, so filtering or predictor selection become necessary in order to obtain consistent parameter estimation. However, all methods described above proceed to classification using the entire set of predictors. For datasets that frequently contain thousands of predictors, such as microarray or RNAseq ones, a variable filtering pre-processing thus needs to be applied. A commonly used pre-processing method when performing classification uses the BSS/WSS-statistic:

{BSS}_{j} / {WSS}_{j} = \frac{\sum_{q = 1}^{Q} \sum_{i : y_{i} \in G_{q}} {({\hat{μ}}_{j q} - {\hat{μ}}_{j})}^{2}}{\sum_{q = 1}^{Q} \sum_{i : y_{i} \in G_{q}} {(x_{i j} - {\hat{μ}}_{j q})}^{2}}, (6)

with ${\hat{μ}}_{j}$ the sample mean of $x_{j}$ , and ${\hat{μ}}_{j q}$ the sample mean of $x_{j}$ in class $G_{q}$ for $q \in {1, \dots, Q}$ . Then, predictors associated with the highest values of this are retained, but no specific rule exists to choose the number of predictors to retain. A Bayesian-based technique, available in the R-package limma, has become a common way to deal with this, computing a Bayesian-based p-value for each predictor, therefore allowing users to choose the number of relevant predictors based on a threshold p-value [18].

This method cannot be considered parsimonious, but rather as a pre-processing stage for exclusion of uninformative covariates. Reliably selecting relevant predictors in PLS regression is of interest for several reasons. Practically speaking, it would allow users to identify the original covariates which are significantly linked to the response, as is done in OLS regression with Student-type tests. Statistically speaking, it would avoid the establishment of over-complex models and ensure consistency of PLS estimators. Several methods for variable selection have already been developed [19]. Lazraq et al. [20] group these techniques into two main categories. The first, model-wise selection, consists of first establishing the PLS model before performing a variable selection. The second, dimension-wise selection, consists of selecting variables on one PLS component at a time.

A dimension-wise method, introduced by Chun and Keleş [17] and called Sparse PLS (SPLS), has become the benchmark for selecting relevant predictors using PLS methodology. The technique is for continuous responses and consists of simultaneous dimension reduction and variable selection, computing sparse linear combinations of covariates as PLS components. This is achieved by introducing an $L_{1}$ constraint on the weight vectors $w_{k}$ , leading to the following formulation of the objective function:

w_{k} = \underset{w \in ℝ^{p}}{argmax} {w^{T} X_{k - 1}^{T} y_{k - 1} y_{k - 1}^{T} X_{k - 1} w},

s . c . {‖ w ‖}_{2}^{2} = 1, {‖ w ‖}_{1} < λ, (7)

where λ determines the amount of sparsity. More details are available in Chun and Keleş [17]. Two tuning parameters are thus involved: $η \in [0,1]$ as a rescaled parameter equivalent to λ, and the number of PLS components K, which are determined through CV-based mean squared error (CV MSE). We refer to this technique as SPLS CV in the following.

Chung and Keles [21] have developed an extension of this technique by integrating it into the generalized linear model (GLM) framework, leading it to be able to solve classification problems. They also integrate Firth’s procedure, in order to deal with non-convergence issues. Both tuning parameters are selected using CV MSE. We refer to this method as SGPLS CV in the following.

A well-known model-wise selection method is the one introduced by Lazraq et al. [20]. It consists of bootstrapping pairs $(y_{i}, x_{i •}), 1 ⩽ i ⩽ n$ , where $x_{i •}$ represents the ith row of $X$ , before applying PLS regression with a preset number of components K on each bootstrap sample. By performing this method, approximations of distributions related to predictors’ coefficients can be achieved. This leads to bootstrap-based confidence intervals (CI) for each predictor, and so opens up the possibility of directly testing their significance with a fixed type I risk α. The advantages of this method are twofold. First, by focusing on PLS regression coefficients, it is relevant for both the PLS and GPLS frameworks. Second, it only depends on one tuning parameter K, which must be determined earlier.

Two important related issue should be mentioned. Firstly, the presence of variables unrelated to the phenomenon under study strongly increases the level of inherent noise in the response and in the predictors. This phenomenon occurs frequently in various contexts and has been studied by several authors both in the context of PLS regression and its extensions [22, 23] and in the more general context of machine learning where the use of techniques related to least squares have proven to be relevant and effective [24–28].

Secondly, while performing this technique, approximations of distributions are achieved conditionally on the fixed dimension of the extracted subspace. In other words, it approximates the uncertainty of these coefficients, conditionally on the fact that estimations are performed in a K-dimensional subspace for each bootstrap sample. The determination of an optimal number of components is crucial for achieving reliable estimations of the regression coefficients [29]. Thus, since this determination is specific to the dataset in question, it must be performed for each bootstrap sample, in order to obtain reliable bootstrap-based CI. We have established some theoretical results which confirm this (Supplementary Section 1.3).

Determining tuning parameters by using q-fold cross-validation (q-CV) based criteria may induce important issues concerning the stability of extracted results (Hastie et al. [30], p.249; Boulesteix [31]; Magnanensi et al. [32]). Thus, using such criteria for successive choosing of the number of components should be avoided. As mentioned, amongst others, by Efron and Tibshirani [33], p.255 and Kohavi [34], bootstrap-based criteria are known to be more stable than CV-based ones. In this context, Magnanensi et al. [32] developed a robust bootstrap-based criterion for the determination of the number of PLS components, characterized by a high level of stability and suitable for both the PLS and GPLS regression frameworks. Thus, this criterion opens up the possibility of reliable successive choosing for each bootstrap sample.

In this article, we introduce a new dynamic bootstrap-based technique for covariate selection suitable for both the PLS and GPLS frameworks. It consists of bootstrapping pairs $(y_{i}, x_{i •})$ , and successive extraction of the optimal number of components for each bootstrap sample, using the previously mentioned bootstrap-based criterion. Here, our goal is to better approximate the uncertainty related to regression coefficients by removing the condition of extracting a fixed K-dimensional subspace for each bootstrap sample, leading to more reliable CI. This new method both avoids the use of CV, and features the same advantages as those previously mentioned related to the technique introduced by Lazraq et al. [20]. This method also has the advantage of behaving consistently with the variation of the noise level in the dataset. We refer to this new dynamic method as BootYTdyn in the following.

We also succeed in adapting the bootstrap-based criterion introduced by Magnanensi et al. [32] to the determination of a unique optimal number of components related to each preset value of η in both the SPLS and SGPLS frameworks. Thus, these adapted versions, by reducing the use of CV, improve the reliability of the hyper-parameter tuning. We will refer to these adapted techniques as SPLS BootYT and SGPLS BootYT, respectively.

There is some constructive attributes of our proposed approach as follows:

1. theoretical results are developed and presented to give indications on the number of components related to a bootstrap sample and showed the necessity for a dynamic algorithm for Sparse PLS;

2. adapt our stable stopping criterion in the construction of PLS components to the Sparse PLS;

3. for each value of the parsimony parameter, a determination of the optimal number of components is performed using our new algorithm;

4. the number of models to be compared is considerably reduced;

5. possible thanks to the adaptation of our stable stopping criterion for PLS components and by its properties, especially its stability.

The article is organized as follows. In Section 2, we introduce our new dynamic bootstrap-based technique, followed by the description of our adaptation of the BootYT stopping criterion to the SPLS and SGPLS frameworks. In Section 3, we present simulations related to the PLS framework, and summarize the results, depending notably on different noise levels in y. In Section 4, we treat a real microarray gene expression dataset with a binary response, here the original localization of colon tumors, by benchmarking our new dynamic bootstrap-based approach for GPLS regression. Lastly, we discuss results and conclusions in Section 5 as well as possible future work.

2 Bootstrap-Based Approaches for Predictor Selection

2.1 A New Dynamic Bootstrap-Based Technique

As mentioned in Section 1, the selected number of extracted components is crucial for reliable estimation of $β_{j}^{PLS}, 1 ⩽ j ⩽ p$ [29]. We have shown (Supplementary Section 1.3) that selecting an optimal number of components on the original dataset and using it to perform PLS regression on the constructed bootstrap samples, as done by Lazraq et al. [20], is not correct for the obtention of reliable CI.

In order to take into account these theoretical results, we have developed a new dynamic bootstrap-based approach for variable selection relevant for both the PLS and GPLS frameworks. The approach consists of estimating the optimal number of components for each bootstrap sample created during the algorithm. To obtain consistent results, a robust and resample-stable stopping criterion in component construction has to be used. Let us use $β_{j}$ to mean $β_{j}^{PLS}$ in the following, in order to lighten notation. The algorithm for this new method is as follows:

1. Let $D^{o r i}$ be the original dataset and R the total number of bootstrap samples $D_{r}^{b}, r \in 〚 1, R 〛$ .

2. $\forall r \in 〚 1, R 〛$ :

• Extract the number of PLS components that is needed for $D_{r}^{b}$ using a preset stopping criterion.

• Compute the estimations ${\hat{β}}_{j}^{r}, \forall j \in 〚 1, p 〛$ , by fitting the relevant PLS or GPLS model.

3. $\forall j \in 〚 1, p 〛$ , construct a $(100 \times (1 - α))$ % bilateral $B C_{a}$ CI for $β_{j}$ , noted:

{CI}_{j} = [{CI}_{j, 1}, {CI}_{j, 2}] .

4. $\forall j \in 〚 1, p 〛$ , If $0 \notin {CI}_{j}$ then retain $x_{j}$ else delete $x_{j}$ .

5. Obtain the reduced model $ℳ_{sel}$ by only integrating the significant predictors, and extracting the number of components $K_{s e l}$ determined by the preset stopping criterion.

Note that here, we set $α = 0.05$ .

2.2 An Adapted Bootstrap-Based Sparse PLS Implementation

As mentioned by Boulesteix [31], using q-CV based methods for tuning parameters potentially induces problems, notably concerning variability of results due to dependency on the way folds are randomly chosen. However, as detailed in Section New developments in Sparse PLS regression, the selection of both tuning parameters involved in the SPLS regression developed by Chun and Keleş [17] is performed using q-CV MSE. Therefore, to improve the reliability of this selection, we adapt the bootstrap-based stopping criterion to this method, which gives the following algorithm:

1. Let ${η_{1}, \dots, η_{s}}$ be a set of pre-chosen values for the sparsity parameter and ${k_{1}, \dots, k_{s}} = {1, \dots, 1}$ the set of initial numbers of components for each $η_{i}$ . Let $i = 1$ .

2. Let $c_{j}^{η_{i}}, j \in 〚 1, k_{i} 〛$ , be the regression coefficients of $y$ on $T_{k_{i}} = (t_{1}, \dots, t_{k_{i}}) \in ℳ_{n, k_{i}} (ℝ)$ . Obtain $k_{i}$ $B C_{a}$ CI for $c_{j}^{η_{i}}, j \in 〚 1, k_{i} 〛$ , using the bootstrap-based stopping criterion, noted:

{CI}_{j}^{k_{i}} = [{CI}_{j, 1}^{k_{i}}, {CI}_{j, 2}^{k_{i}}] .

3. If $\exists j \in 〚 1, k_{i} 〛 such that 0 \in {CI}_{j}^{k_{i}}$ then $K_{o p t}^{η_{i}} = k_{i} - 1$ else ${k_{i} = k_{i} + 1 and return to step 2}$ .

4. While $i \neq s$ then ${i = i + 1 and return to step 2}$ .

5. Return the set of the extracted numbers of components ${K_{o p t}^{η_{1}}, \dots, K_{o p t}^{η_{s}}}$ related to ${η_{1}, \dots, η_{s}}$ .

6. Return the pair $(η_{o p t}, K_{o p t}^{η_{o p t}})$ having the lowest CV-based MSE.

Retesting all components obtained after each increase in $k_{i}$ is essential since the original predictors involved in the components construction change when $k_{i}$ increases [17]. This fact, combined with the aim of retaining orthogonality between components, leads the components themselves to change, so that the significance of each component has to be retested at each step.

While the original implementation compares $K^{max} \times s$ models through CV MSE, with $K^{max}$ the maximal number of components (set by the user), this new bootstrap-based version only focuses on s models, since only one number of components is determined for each preset value of η. An illustration of the stability improvement obtained by using this new implementation, based on the simulated dataset introduced in Section 3.3.1 with $s d (ϵ) = 5$ , is shown in Figure 1.

FIGURE 1

FIGURE 1. Repartition of 100 selections of $(η_{o p t}, K_{o p t})$ using the original SPLS approach (A) and the new bootstrap-based implementation (B).

3 Simulations Studies

3.1 Simulations for Accuracy Comparisons

These simulations are based on a simulation scheme proposed by Chun and Keleş [17, p.14] and modified in order to study high-dimensional settings. We consider the case where there are less observations than predictors, i.e., $n < p$ , and set $n = 100$ , and $p = 200$ or 1,000. Let q be the number of spurious predictors. While Chun and Keleş [17, p.14] only consider a ratio $q / p$ equal to 0.25 and 0.5, both the 0.05 and 0.95 ratio values have been added here. Four independent and identically distributed hidden variables $h_{1}, \dots, h_{4}$ , following a $N (0,25 I_{n})$ distribution, were computed. Then, columns of the covariate matrix $X$ were generated by $x_{j} = h_{l} + ϵ_{j}$ for $p_{l - 1} + 1 ⩽ j ⩽ p_{l}$ , where $l = 1, \dots, 4$ , $(p_{0}, \dots, p_{4}) = (0, (p - q) / 2, p - q, p - r, p)$ , $r = 5$ when $p = 200$ and $r = 10$ when $p = 1000$ , and $ϵ_{1}, \dots, ϵ_{p}$ are drawn independently from a $N (0,0.1 I_{n})$ . Also, y is generated by $3 h_{1} - 4 h_{2} + f$ , where f is normally distributed with mean 0 and variance such that the signal-to-noise ratio (SNR) equals 10.

Using this simulation scheme, accuracy of the SPLS technique using 10 fold-CV for selecting tuning parameters (SPLS CV), and our new dynamic bootstrap method combined with the bootstrap-based stopping criterion (BootYTdyn), is compared. In order to do so, for each parameter setting, 50 selections of the sparse support related to both methods were established. Lastly, mean accuracy values over the 50 trials were calculated. Results are summarized in Table 1.

TABLE 1

TABLE 1. Mean accuracy values (SNR).

Based on these results, SPLS CV gives better accuracy than BootYTdyn for ratio values $q / p$ that are not close to 0 or 1. While both give good performance when the ratio is close to 0, i.e., when a major part of predictors are significant, BootYTdyn outperforms SPLS CV when only a small proportion of predictors are significant.

Nevertheless, in this simulation set-up, covariates are collected into four groups. While within-group correlations between covariates are close to one, between-group ones are close to zero. This unrealistic situation makes irrelevant the determination of an optimal support, and seems more appropriate to selecting the number of components. As an illustration, 50 additional samples in the $p = 1000$ and $q / p = 0.5$ case were simulated. We then calculated the predictive MSE (PMSE) based on four different supports $S_{1}$ , $S_{2}$ , $S_{3}$ and $S_{4}$ , where $S_{1} = {x_{j}, 1 ⩽ j ⩽ p}$ , $S_{2} = {x_{j}, 1 ⩽ j ⩽ (p - q)}$ , $S_{3} = {x_{1}, x_{251}}$ and $S_{4} = {x_{1}, x_{251}} \cup {x_{j}, (p - q) + 1 ⩽ j ⩽ p}$ . Results are summarized in Table 2.

TABLE 2

TABLE 2. PMSE values for different supports.

In the light of these observations, the aim of this simulation scheme is instead the extraction if an optimal number of components rather than an optimal support. We thus decided to use a real dataset as covariate matrix for a more general and relevant comparison.

3.2 Simulations for Global Comparison

3.3.1 Dataset Simulations

In this study, we used a real microarray gene expression dataset, which was created using fresh-frozen primary tumors samples, from a multi-center cohort, with stage I to IV colon cancer. 566 samples fulfilled RNA quality requirement, and constituted our database. These samples were split into a 443 sample discovery set and a 123 sample test set, well balanced for the main anatomo-clinical characteristics. This database has already been studied by Marisa et al. [35] and more details on it are available in their article.

In order to reduce computational time, a preliminary selection of 100 predictors was performed. Based on the original localization of the tumors as response vector, and the full 566 samples, the 100 most differentially expressed probe sets were extracted. As mentioned in Section New developments in Sparse PLS regression, this pre-processing is based on a Bayesian technique and gives us our benchmark predictors matrix.

Then, based on correlation values, four positively-correlated predictors were selected to form the set of significant covariates (Supplementary Section 1.1). To this end, let $X_{s e l} = (x_{1}, x_{12}, x_{15}, x_{59}) \in ℳ_{n, 4} (ℝ)$ be the matrix composed of these predictors, so that y is simulated as follows:

y = X_{s e l} β + ϵ, (8)

with $β = {(3.559, 2.071, 1.440, 1.770)}^{T}$ , $E (ϵ) = 0$ and Var $(ϵ) = σ^{2} I_{n}$ .

We performed simulations for six distinct levels of random noise standard deviation in order to investigate the performance of the different methods. Both these standard deviations and their related SNR are shown in Table 3.

TABLE 3

TABLE 3. Noise standard deviation (SNR).

3.3.2 Benchmarked Methods

Using these simulated datasets, eight methods were analyzed and compared.

1. $Q^{2}$ . The original bootstrap-based method, introduced by Lazraq et al. [20], combined with the 10-fold CV-based $Q^{2}$ criterion [5, p.83] for pre-selecting the number of components.

2. BIC. The bootstrap-based method, introduced by Lazraq et al. [20], combined with the corrected BIC using estimated degrees of freedom (DoF) [36] for pre-selecting the number of components.

3. BootYT. The bootstrap-based method, introduced by Lazraq et al. [20], combined with the bootstrap-based criterion [32] for previously selecting the number of components.

4. BICdyn. Our new dynamic bootstrap-based method combined with the corrected BIC criterion for successive selections of number of components.

5. BootYTdyn. Our new dynamic bootstrap-based method combined with the bootstrap-based criterion for successive selections of number of components.

6. SPLS CV. The original SPLS method using 10-fold CV for tuning parameter determination [17].

7. SPLS BootYT. The new adapted SPLS version using the bootstrap-based criterion for component selection.

8. Lasso. Lasso regression, included as a benchmark [37].

3.3.3 Simulation Scheme and Notation

In order to perform reliable comparisons between the eight methods, each type of trial was performed one hundred times. Numbers of components, sparse supports and sparse tuning parameters are the main examples of these. Results linked to the highest occurrence rates are then chosen for method comparison. All bootstrap-based techniques were performed with $R = 1000$ bootstrap samples, and each related CI was constructed with type I risk $α = 0.05$ .

The global comparison has two main parts. First, in order to compare accuracy and stability related to each technique, we focus both on different supports, and models extracted by the different variable selection methods. Indeed, in the PLS framework, a specific model results from both a set of predictors and a specific number of components. Due to the sparsity parameter in SPLS approaches, the same support can be extracted, but with a different number of components leading to different models. Lasso regression can also extract the same support for several different sparsity parameters, leading to different estimations of model coefficients. Therefore, for clarity, the following notation, related to each specific variable selection technique, is introduced.

• ${S_{1}, \dots, S_{Γ_{1}}}$ , the set of extracted supports.

• ${ℳ_{1}, \dots, ℳ_{Γ_{2}}}$ , the set of fitted models.

• $S_{s e l}$ , the selected support, i.e., the one that appears the most often.

• $ℳ_{s e l}$ , the selected model, i.e., the one that appears the most often.

• $% S_{s e l}$ , rate of occurrence of the selected support.

• $% ℳ_{s e l}$ , rate of occurrence of the selected model.

• $K_{s e l}$ , the number of components related to the selected model.

Second, in order to compare the predictive ability of models, 10-fold CV MSE, related to each selected sparse model through PLS regression, were computed one hundred times. The test set was also used in order to confirm results obtained by CV.

Note that, concerning the dynamic BIC-based method for $sd (ϵ) = 0.5$ , only 97 trials performed well. Lastly, due to equality of occurrence rates between the two most-represented pairs of tuning parameters, results for SPLS CV and $sd (ϵ) = 5$ come instead from 150 trials.

3.3.4 Stability and Accuracy Results

Both the mean accuracy values over trials in each case, and stability results based on extracted supports, are given in Table 4. The numbers of components used for the original bootstrap-based approach [20] are summarized in Supplementary Section 1.2.

TABLE 4

TABLE 4. Mean accuracy values over trials in each case and stability results based on extracted supports and number of predictors.

Concerning the number of different models, results related to lasso regression are the same as those concerning the number of different supports. Only the result for $sd (ϵ) = 0.5$ differs, since one trial finished with a fifth value of the sparsity parameter and the same support as the model that was selected. Therefore, only results concerning models established using SPLS methods are summarized in Table 5. In the case of bootstrap-based techniques, supports and models are similar, since no sparsity parameter is needed.

TABLE 5

TABLE 5. Results for SPLS model stability.

Concerning the three bootstrap-based techniques and in the light of accuracy results (Table 4), BootYT outperforms both the others, except in the case where $sd (ϵ) = 0.5$ , for which BIC should be used. This exception is confirmed through the comparison of the dynamic bootstrap-based methods, where the $sd (ϵ) = 0.5$ case is the only one where using the BIC criterion also represents the most relevant choice. This phenomenon matches with conclusions obtained by Magnanensi et al. [32], in that the BIC criterion is well-designed for small values of noise variance, while BootYT outperforms it for non-negligible levels of random noise. The use of the $Q^{2}$ criterion for selecting the number of components would appear never be a reliable option, so combining this criterion with our new dynamical approach was not done. The accuracy values highlight the fact that our new method is always an improvement over the original one. Concerning the two versions of SPLS regression, both gave similar accuracy. These stand to perform better than others for the smallest levels of random noise variance, while BootYTdyn outperforms all others for the highest values of noise variability.

Based on results introduced in Tables 4,5, the $Q^{2}$ -based approach for predictor selection has non-negligable stability issues, providing between 18 and 96 different models in the 100 simulations. Depending on both the noise variability and the criterion combined with our dynamic approach, the latter improves over the original one by stabilizing the choice of sparse support. Cases where this is observed match with previous conclusions established when analyzing accuracy results, namely both the BIC-based dynamic approach for small values of noise variability, and BootYTdyn used for datasets with non-negligible noise variances. This strengthens the conclusion that BIC is well-designed for small values of noise variance, while BootYT performs best for non-negligible noise. As for SPLS methods, our new bootstrap-based version gains in stability in that it retains fewer different optimal pairs $(η_{o p t}, K_{o p t})$ , and also sparse models, than the original does. Moreover, since $Γ_{1} = Γ_{2}$ for all studied datasets, this directly implies that it retains a unique optimal number of components for each sparse support. It thus permits a choice of optimal model in a more reliable way than using the CV-based technique. Lastly, lasso regression has good stability performances, leading it, BootYTdyn and SPLS BootYT to be recommended when stability is important.

The last descriptive statistic concerns the number of significant predictors retained (Table 4). The $Q^{2}$ -based approach is the least sparse, selecting the highest number of covariates. Both BICdyn BootYTdyn improve respectively BIC and BootYT, showing better accuracy related to their selected support. Indeed, the four expected covariates are always included in these selected supports. Once more, BootYT can be recommended for datasets with non-negligible random noise variability compared with corrected BIC, which has to be applied for small values of random variance. Indeed, both the BIC and BICdyn approaches, by retaining globally increasing numbers of predictors while the random noise standard deviation increases, lead us to suppose that they use some predictors to model this random noise, leading to over-fitting. On the contrary, while BootYT selects a stable number of significant predictors, BootYTdyn selects a decreasing number of significant predictors as the random noise standard deviation increases, which is expected. As a confirmation of over-fitting issues with BIC, we present the 10-fold CV-based MSE in Figure 2.

FIGURE 2

FIGURE 2. Boxplots of 10-CV MSE based on y with noise.

These results tend to confirm our suspicion of over-fitting since, except for results related to BootYTdyn, the others have MSE that do not match with the theoretical random noise variances, suggesting that a part of the noise is being modeled. As for the two SPLS methods, there is no pertinent difference to mention, both of them concluding on similar numbers of selected predictors. Lastly, like in BIC-based approaches, the lasso extracts an increasing number of significant predictors.

As a first conclusion, we can thus reasonably conclude that, based on these first simulation results, BootYTdyn and SPLS BootYT should be used in practice.

3.3.5 Complexity and Predictive Ability Results

To confirm and strengthen the conclusions of Section 3.3.2, we will now focus on the predictive abilities of the models selected by the various approaches. We calculate one hundred times the 10-fold CV MSE based on the original simulated response values (without noise) of the various selected models. These results thus reflect the accuracy in predicting the original information by leaving out random noise. We also compute the Predictive MSE (PMSE) based on the test set, by using its simulated response $y_{test}$ without including noise. Lastly, we extract the DoF of each selected sparse model ( ${DoF}_{s e l}$ ) to compare the respective complexities.

Graphical results for these statistics, related to $Q^{2}$ , BIC and BootYT, are shown in the first row of Figure 3.

FIGURE 3

FIGURE 3. From left to right: DoF of the extracted sparse models, PMSE based on $y_{test}$ without noise, and boxplots of 10-CV MSE based on y without noise.

The evolution of estimated DoF both highlights and confirms BIC and $Q^{2}$ over-fitting issues. Indeed, as the random noise standard deviation increases, these methods globally build sparse models with increasing complexity. Thus, they model increasing parts of the inserted random noise, implying poor predictive abilities of their selected models compared to those obtained by applying BootYT. This is confirmed through higher values of PMSE and CV MSE, especially for datasets with non-negligible random noise variability. These results confirm the conclusions from the previous section in that using the $Q^{2}$ criterion for selecting the number of PLS components should be avoided, and that BootYT outperforms corrected BIC, except for responses with negligible random noise levels. Therefore, only BIC and BootYT are retained for further comparison.

The results shown in the second row of Figure 3 highlight that BootYTdyn is the only method that models with decreasing DoF, ensuring a complexity reduction suitable to the avoidance of prediction issues. Indeed, BootYTdyn selects models with the lowest PMSE and 10-CV MSE.

In light of these results, only BootYTyn is retained for further comparison. Comparing the two SPLS implementations with respect to their predictive abilities lead us to recommend SPLS BootYT, since models selected by this bootstrap-adapted SPLS technique feature comparable if not lower PMSE and 10-CV MSE (third row of Figure 3). Let us clarify that, in order to ensure relevant comparisons, we used ordinary PLS regressions with both the support and the number of components selected by the SPLS methods, and not SPLS methods with selected tuning parameters, for computing the 10-CV MSE.

For datasets characterized by a low level of random noise variability in $y$ , SPLS BootYT builds models inducing the smallest CV MSE and PMSE. By focusing on non-negligible random variability, BootYTdyn gives the smallest predictive errors, thus showing the high level of robustness of this approach against random noise. This robustness is, as previously mentioned, due to its decreasing number of significant predictors, and also components, leading to decreasing DoF, i.e., a loss of complexity.

Lastly, we compare the two retained methods BootYTdyn and SPLS BootYT, and the lasso is run. As for SPLS methods, in order to perform relevant comparisons, the supports extracted by the lasso are used as sets of covariates for a PLS regression. The number of PLS components is then established by performing one hundred times their selection using the bootstrap-based stopping criterion; the number of components related to the highest occurrence rate was selected. In this way, results shown in the fourth row of Figure 3 concerning 10-CV MSE give the predictive ability of the extracted supports for PLS regression. In order to provide a clear picture of the impact of this choice, the PMSE obtained through the lasso is also displayed on the fourth row of Figure 3. These approaches are referred to as Lasso and Lasso.supp.

Except for negligible values of random noise variability in y, the support extracted by the lasso regression and applied as explanatory variables for PLS regression are linked to both the highest 10-CV MSE and highest PMSE. This is a direct consequence of the lasso’s accuracy issues mentioned in Section 3.3.4, notably the increasing number of extracted covariate while the random noise variability rises. However, performing predictions using the model obtained by the original lasso regression lead to the lowest PMSE values. This is due to the $L^{1}$ penalization which is applied on the vector of estimated parameters, thus permitting correction of the relative lack of accuracy of this technique.

To conclude, we summarize our conclusions in the following Table 6, and recommend certain approaches, depending on whether the initial aim was to select significant predictors or to obtain a sparse model with attractive predictive ability.

TABLE 6

TABLE 6. Recommended approaches.

4 Real Dataset Application

In this section, we deal with the predictors matrix introduced in Section 3.3.1 and the original binary response vector. Five approaches for variable selection, adapted for the GPLS framework, are considered for comparison.

1. BootYT. The bootstrap-based method, introduced by Lazraq et al. [20], combined with the bootstrap-based criterion [32] for pre-selecting the number of components.

2. BootYTdyn. The new dynamic bootstrap-based method combined with the bootstrap-based criterion for successive determinations of the number of components.

3. SGPLS CV. The original SGPLS method using 10-fold CV for tuning parameter selection [21].

4. SGPLS BootYT. The new adapted SGPLS version using the bootstrap-based criterion for component selection.

5. RSGPLS. An approach developed by Durif et al. [38]. It consists of adapting the SGPLS method by introducing a ridge penalty to ensure convergence of parameter estimations, and stability in hyper-parameter tuning. They also propose an adjustment of the $L_{1}$ constraint in order to further penalize less significant predictors. Hyper-parameters are tuned using CV MSE.

6. Lasso. The adapted lasso regression for logistic framework, available in the R package glmnet, as a benchmark.

Concerning bootstrap-based approaches, the incorporation of PLS methodology into GLM, developed by Bastien et al. [16], was used. Due to non-convergence issues for parameter estimations, some samples were excluded using a threshold for parameter estimations. Indeed, a model built using a bootstrap sample with at least one parameter estimate that is higher in absolute value than $10^{4}$ times the one (in absolute value as well) estimated on the original dataset, leads to the exclusion of this bootstrap sample. Thus, to ensure a sufficient number of relevant bootstrap samples, the preset number of computed samples was increased to $R = 4000$ . As for the lasso, three different loss functions for the establishment of the sparsity parameter using 10-fold CV were used: the number of misclassified values, the MSE, and the deviance. We refer to them as Lasso.Cl, Lasso.MSE and Lasso.Dev, respectively. The set of values defined for the sparsity parameter is preset by the “glmnet” package. For the SGPLS and RSGPLS approaches, the number of components K varies from 1 to 10, the sparsity parameter η varies from 0.04 to 0.99 (by steps of 0.05), and the ridge parameter involved in the RSGPLS technique is selected from 31 points ${log}_{10}$ -linearly spaced in the range $[10^{- 2}, 10^{3}]$ , as in Durif et al. [38].

Each method was performed one hundred times in order to obtain relevant results. However, due to the high observed stability of results extracted with the BootYTdyn approach, and in order to save computational time, this was only performed twenty times instead of a hundred.

Stability results for tuning parameter selection, except for the bootstrap-based methods, are shown in Figure 4 and Table 7.

FIGURE 4

FIGURE 4. Repartition of selected sets of tuning parameters over the 100 trials, for SGPLS CV (A) and SGPLS BootYT (B).

TABLE 7

TABLE 7. Number of different selected sets of tuning parameters over the 100 trials (rate of occurrence of the retained set of tuning parameters).

Based on results summarized in Table 7, the lasso methodology, using CV-based MSE or deviance values for the selection of its hyper-parameter, is the most stable. This can be explained by the fact that only one hyper-parameter is involved, while the other techniques are based on two or three tuning parameters. Using the number of misclassified values for CV has to be avoided for stable selection of the sparsity parameter. Our SGPLS adaptation with the BootYT criterion improves reliability in selecting the set of tuning parameters, as previously observed in the PLS framework (Sections 2.2, 3.3.4). Note that, concerning RSGPLS, three different sets of optimal parameters were extracted with maximal occurrence rate of five, all of them selecting the same set of predictors. Thus, the set of parameters which leads to the smallest number of misclassified values on the training dataset was retained. As already mentioned in Section 3.3.3, extracting the same support does not necessarily lead to the same model when sparsity or ridge parameters are involved. Therefore, the numbers of sparse supports and models retained are summarized in Table 8.

TABLE 8

TABLE 8. Number $Γ_{1}$ of different extracted supports $(% S_{s e l})$ (first row) and number $Γ_{2}$ of different extracted sparse models $(% ℳ_{s e l})$ (second row).

In light of these results, BootYTdyn and both the MSE- and deviance-based lasso techniques are the most stable in extracting supports and models. This could be due in part to the fact than all of them depend on only one tuning parameter. Even if this hyper-parameter for the BootYTdyn approach, i.e., the number of components, has to be chosen R times, the high stability of the bootstrap-based stopping criterion introduced by Magnanensi et al. [32] endows this approach with good stability in selecting the sparse support. As for the PLS framework, our new bootstrap-based SGPLS implementation improves the stability here. The lack of stability of the lasso based on misclassified values is directly induced by the discrete form of this statistic. This issue was also observed and mentioned in Magnanensi et al. [32].

The various selected supports are displayed in Figure 5. Note that both the MSE- and deviance-based lasso regressions select the same support and the same model, noted Lasso.MSE.Dev in the following.

FIGURE 5

FIGURE 5. Summary of selected supports.

In the following, two additional independent public datasets, stated by Marisa et al. [35] as being comparable with our original dataset, were included for comparison of predictive abilities. These datasets are named GSE18088 (n = 53) [39] and GSE14333 (n = 247) [40]. Both the MSE and number of misclassified (MC) value of the selected models were computed on both the training and test parts of the original dataset, as well as on the two additional datasets. These results are summarized in Table 9. Since independent datasets were available, we decided to follow the suggestion of Van Wieringen et al. [41, p.1596]: “the true evaluation of a predictor’s performance is to be done on independent data”.

TABLE 9

TABLE 9. Summary of model fitting and predictive abilities.

In this real data study, BootYTdyn retains only one predictor, which is also retained by all other methods. Thus, as expected, this most sparse support induced the highest values of both the MSE and number of misclassified observations, based on the training subset of the original dataset. It also provided the highest values based on the test subset of the original dataset, which is to be expected since, as explained in Section 3.3.1, these two parts are well-balanced in terms of anatomo-clinical characteristics. Thus, this causes a bias in evaluation of model’s predictive abilities, by making comparable MSE and misclassified values based on both the training and testing subsets. Results in Table 9 confirm this property, and also highlight the usefulness of additional independent datasets for reliably comparing predictive abilities. While a well-designed model for predictive purposes will provide similar PMSE values on comparable independent additional datasets compared to MSE obtained on the training dataset, an over-fitted one would be related to higher PMSE values, due to its dependance on training data. Thus, the differences, noted $Δ$ MSE, between the MSE obtained on the training subset of the original dataset, and those obtained on the three test datasets, are shown in Figure 6.

FIGURE 6

FIGURE 6. Differences between the MSE computed on the training subset of the original dataset and the PMSE obtained on test sets.

Our new dynamic bootstrap-based approach is the only one that exhibits this MSE stability property. In contrast, all other approaches provided higher PMSE values on the two other datasets than on the original. This led BootYTdyn to have only 48 (16%) misclassified values on these two additional datasets, representing the best result among all studied approaches. Thus, we can reasonably assume that our new method has helped to remove non-informative predictors, in the sense of not being relevant for improving predictive ability.

Lastly, the unique extracted probe set, named 230784_at, is already known to be related to the original location in the distal colon. The sign of the regression coefficient obtained with BootYTdyn is coherent with this state-of-the-art result, and thus strengthens our conclusion.

5 Discussion

In this article, we developed a new dynamic bootstrap-based technique for variable selection, suitable for both the PLS and GPLS frameworks, and proposed a bootstrap-based version of the SPLS and SGPLS methods for selecting the number of components. While the first of these lets us completely avoid CV, the second lets us select the set of tuning parameters in a more reliable way.

In state-of-the-art approaches, the use of CV-based techniques for selection of hyper-parameters is common, and can lead to important stability issues, observed notably by Boulesteix [31] and Magnanensi et al. [32], and confirmed in our studies. Our new dynamic bootstrap-based technique represents a useful method for avoiding CV-related issues, since the unique hyper-parameter is successively selected using a bootstrap-based criterion. This new technique improves on the original bootstrap-based methodology introduced by Lazraq et al. [20], in that it permits us to approximate the distribution of covariates’ regression coefficients by removing the condition of working in a subspace of fixed dimension K. Theoretical results have been established that strengthen the usefulness of building subspaces spanned by a dynamic number of components for performing PLS regressions on bootstrap samples.

In the PLS framework, conclusions based on our simulations are twofold. First, for datasets with negligible random noise in $y$ , SPLS BootYT is recommended. Indeed, both in terms of accuracy and predictive abilities, it outperforms all other techniques compared here. Second, for datasets with non-negligible random noise in y, which represents the more realistic case, our new dynamic bootstrap-based method is recommended. As for SPLS BootYT for the negligible random noise variability framework, BootYTdyn outperform all other methods for each property studied. Furthermore, it is the only method which concludes as to a decreasing number of significant predictors, while the random noise variability increases, which was expected here.

Results obtained from our classification study using real datasets match with previous conclusions. Indeed, BootYTdyn is the only one which leads to the expected PMSE values on two additional independent datasets, which allows us to suppose that over-fitting issues were avoided, and that these PMSE are induced by noise or information which cannot be modeled using only gene expression. Furthermore, the extracted probe set is already known to be linked to the relevant location in the distal colon, which strengthens our confidence in this new dynamic approach.

Lastly, our new bootstrap-based SPLS implementation improves the stability of this method. Indeed, in all cases studied, both the SPLS CV and SGPLS CV select a higher number of unique sets of hyper-parameters than our bootstrap-based versions do, leading to higher numbers of unique supports and models too.

In the future, our work may be extended in the following three directions. Firstly, Sun et al. [42] also worked on this subject, and proposed a methodology for selecting tuning parameters of penalized regressions in order to stabilize variable selection. Even if their method is not applicable to the selection of the number of components in PLS regression, it would be interesting to adapt it to both SPLS CV and our new SPLS BootYT implementation, in order to look for potential stability gains.

Secondly combining partial least squares based regular or sparse techniques with the existing techniques that were developed in [24–28] may speed up, specialize toward a specific target or improve the accuracy of those existing algorithms by taking into account some latent structure of the data.

Thirdly, simulations may be performed to strengthen the results obtained on real data sets for the logistic framework (Section 4). In addition, testing the performance of these new approaches for responses following other distributions -such as Poisson, Gamma-should also be done. However, based on all the results obtained in this paper, our new dynamic method seems to be the most efficient compared to the state of the art approaches for datasets with non-negligible noise variability, a common situation in daily practice.

Data Availability Statement

The datasets analyzed for this study can be found in the Datasets_benchmark Github repository, https://github.com/fbertran/Datasets_benchmark. The packages used for the simulations in this article are available on the CRAN, https://cran.r-project.org and especially the plsRglm package that implements partial least squares generalized regression, https://CRAN.R-project.org/package=plsRglm and the bootPLS package that implements number of components selection using bootstrap for PLS regression and generalized regression models, https://github.com/fbertran/bootPLS.

Author Contributions

JM, FB, MM-B and NM designed the study. JM did the simulations. JM, FB and MM-B wrote the manuscript.

Funding

This work was supported by grants from the Agence Nationale de la Recherche (ANR) (ANR-11-LABX-0055_IRMIA); the CNRS (UMR 7501) LabEx IRMIA to FB, NM and MM-B; Mastére Spécialisé® Expert Big Data Engineer of the University of Technology of Troyes to FB and MM.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fams.2021.693126/full#supplementary-material

References

1. Wold, S, Martens, H, and Wold, H. The Multivariate Calibration Problem in Chemistry Solved by the PLS Method. In: Matrix Pencils. Heidelberg, Deutschland: Springer-Verlag (1983) p. 286–93. doi:10.1007/bfb0062108

CrossRef Full Text | Google Scholar

2. Wold, S, Sjöström, M, and Eriksson, L. PLS-regression: a Basic Tool of Chemometrics. Chemometrics Intell Lab Syst (2001) 58:109–30. doi:10.1016/s0169-7439(01)00155-1

CrossRef Full Text | Google Scholar

3. Boulesteix, AL, and Strimmer, K. Partial Least Squares: a Versatile Tool for the Analysis of High-Dimensional Genomic Data. Brief Bioinform (2007) 8:32–44. doi:10.1093/bib/bbl016

PubMed Abstract | CrossRef Full Text | Google Scholar

4. Höskuldsson, A. PLS Regression Methods. J Chemometrics (1988) 2:211–28. doi:10.1002/cem.1180020306

CrossRef Full Text | Google Scholar

5. Tenenhaus, M. La régression PLS, Théorie et pratique. Paris, France: Editions Technip (1998).

6. Wold, S, Ruhe, A, Wold, H, and Dunn, III, WJ. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM J Sci Stat Comput (1984) 5:735–43. doi:10.1137/0905052

CrossRef Full Text | Google Scholar

7. Nguyen, DV, and Rocke, DM. Multi-class Cancer Classification via Partial Least Squares with Gene Expression Profiles. Bioinformatics (2002) 18:1216–26. doi:10.1093/bioinformatics/18.9.1216

PubMed Abstract | CrossRef Full Text | Google Scholar

8. Nguyen, DV, and Rocke, DM. Tumor Classification by Partial Least Squares Using Microarray Gene Expression Data. Bioinformatics (2002) 18:39–50. doi:10.1093/bioinformatics/18.1.39

PubMed Abstract | CrossRef Full Text | Google Scholar

9. Boulesteix, A-L. PLS Dimension Reduction for Classification with Microarray Data. Stat Appl Genet Mol Biol (2004) 3:1–30. doi:10.2202/1544-6115.1075

CrossRef Full Text | Google Scholar

10. Marx, BD. Iteratively Reweighted Partial Least Squares Estimation for Generalized Linear Regression. Technometrics (1996) 38:374–81. doi:10.1080/00401706.1996.10484549

CrossRef Full Text | Google Scholar

11. Cessie, SL, and Houwelingen, JCV. Ridge Estimators in Logistic Regression. Appl Stat (1992) 41:191–201. doi:10.2307/2347628

CrossRef Full Text | Google Scholar

12. Firth, D. Bias Reduction of Maximum Likelihood Estimates. Biometrika (1993) 80:27–38. doi:10.1093/biomet/80.1.27

CrossRef Full Text | Google Scholar

13. Nguyen, DV, and Rocke, DM. On Partial Least Squares Dimension Reduction for Microarray-Based Classification: a Simulation Study. Comput Stat Data Anal (2004) 46:407–25. doi:10.1016/j.csda.2003.08.001

CrossRef Full Text | Google Scholar

14. Ding, B, and Gentleman, R. Classification Using Generalized Partial Least Squares. J Comput Graphical Stat (2005) 14:280–98. doi:10.1198/106186005x47697

CrossRef Full Text | Google Scholar

15. Fort, G, and Lambert-Lacroix, S. Classification Using Partial Least Squares with Penalized Logistic Regression. Bioinformatics (2005) 21:1104–11. doi:10.1093/bioinformatics/bti114

PubMed Abstract | CrossRef Full Text | Google Scholar

16. Bastien, P, Vinzi, VE, and Tenenhaus, M. PLS Generalised Linear Regression. Comput Stat Data Anal (2005) 48:17–46. doi:10.1016/j.csda.2004.02.005

CrossRef Full Text | Google Scholar

17. Chun, H, and KeleÅŸ, S. Sparse Partial Least Squares Regression for Simultaneous Dimension Reduction and Variable Selection. J R Stat Soc Ser B (Statistical Methodology) (2010) 72:3–25. doi:10.1111/j.1467-9868.2009.00723.x

CrossRef Full Text | Google Scholar

18. Smyth, GK. Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Stat Appl Genet Mol Biol (2004) 3:Article3. doi:10.2202/1544-6115.1027

PubMed Abstract | CrossRef Full Text | Google Scholar

19. Mehmood, T, Liland, KH, Snipen, L, and Sæbø, S. A Review of Variable Selection Methods in Partial Least Squares Regression. Chemometrics Intell Lab Syst (2012) 118:62–9. doi:10.1016/j.chemolab.2012.07.010

CrossRef Full Text | Google Scholar

20. Lazraq, A, Cléroux, R, and Gauchi, J-P. Selecting Both Latent and Explanatory Variables in the PLS1 Regression Model. Chemometrics Intell Lab Syst (2003) 66:117–26. doi:10.1016/s0169-7439(03)00027-3

CrossRef Full Text | Google Scholar

21. Chung, D, and Keles, S. Sparse Partial Least Squares Classification for High Dimensional Data. Stat Appl Genet Mol Biol (2010) 9. Article 17. doi:10.2202/1544-6115.1492

CrossRef Full Text | Google Scholar

22. Bastien, P, Bertrand, F, Meyer, N, and Maumy-Bertrand, M. Deviance Residuals-Based Sparse PLS and Sparse Kernel PLS Regression for Censored Data. Bioinformatics (2015) 31:397–404. doi:10.1093/bioinformatics/btu660

PubMed Abstract | CrossRef Full Text | Google Scholar

23. Meyer, N, Maumy-Bertrand, M, and Bertrand, F. Comparaison de variantes de régressions logistiques PLS et de régression PLS sur variables qualitatives: application aux données d’allélotypage. J de la Société Française de Statistique (2010) 151:1–18.

Google Scholar

24. Gupta, D, and Richhariya, B. Entropy Based Fuzzy Least Squares Twin Support Vector Machine for Class Imbalance Learning. Appl Intell (2018) 48:4212–31. doi:10.1007/s9489-018-1204-4

CrossRef Full Text | Google Scholar

25. Gupta, U, Gupta, D, and Prasad, M. Kernel Target Alignment Based Fuzzy Least Square Twin Bounded Support Vector Machine. In: Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence; 18-21 Nov. 2018; Bangalore, IndiaSSCI (2018) p. 228–35. doi:10.1109/SSCI.2018.8628903

CrossRef Full Text | Google Scholar

26. Borah, P, and Gupta, D. A Two-Norm Squared Fuzzy-Based Least Squares Twin Parametric-Margin Support Vector Machine. In: M Tanveer, and RB Pachori, editors. Machine Intelligence and Signal Analysis Advances in Intelligent Systems and Computing, 748. Singapore: Springer Singapore (2019) p. 119–34. doi:10.1007/978-981-13-0923-610.1007/978-981-13-0923-6_11

CrossRef Full Text | Google Scholar

27. Borah, P, Gupta, D, and Prasad, M. Improved 2-norm Based Fuzzy Least Squares Twin Support Vector Machine. In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI); 18-21 Nov. 2018; Bangalore, IndiaIEEE (2018) p. 412–9. doi:10.1109/SSCI.2018.8628818

CrossRef Full Text | Google Scholar

28. Gupta, U, and Gupta, D. Least Squares Large Margin Distribution Machine for Regression. Appl Intell (2021). doi:10.1007/s10489-020-02166-5

CrossRef Full Text | Google Scholar

29. Wiklund, S, Nilsson, D, Eriksson, L, Sjöström, M, Wold, S, and Faber, K. A Randomization Test for PLS Component Selection. J Chemometrics (2007) 21:427–39. doi:10.1002/cem.1086

CrossRef Full Text | Google Scholar

30. Hastie, T, Tibshirani, R, and Friedman, JJH. The Elements of Statistical Learning. 2nd ed., Vol. 1. New York, USANew York: Springer (2009).

31. Boulesteix, AL. Accuracy Estimation for PLS and Related Methods via Resampling-Based Procedures. Paris: PLS’14 Book of Abstracts (2014). p. 13–4.

32. Magnanensi, J, Bertrand, F, Maumy-Bertrand, M, and Meyer, N. A New Universal Resample-Stable Bootstrap-Based Stopping Criterion for PLS Component Construction. Stat Comput (2017) 27:1–18. doi:10.1007/s11222-016-9651-4

CrossRef Full Text | Google Scholar

33. Efron, B, and Tibshirani, RJ. An Introduction to the Bootstrap, Vol. 57. Boca Raton, USA: Chapman & Hall/CRC (1993).

34. Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: Proceedings of the 14th international joint conference on Artificial intelligence-Volume 2, Montreal, Quebec, Canada, August 20–25, 1995. USA: Morgan Kaufmann Publishers Inc.) (1995). p. 1137–43.

Google Scholar

35. Marisa, L, de Reyniès, A, Duval, A, Selves, J, Gaub, MP, Vescovo, L, et al. Gene Expression Classification of colon Cancer into Molecular Subtypes: Characterization, Validation, and Prognostic Value. Plos Med (2013) 10:e1001453. doi:10.1371/journal.pmed.1001453

PubMed Abstract | CrossRef Full Text | Google Scholar

36. Krämer, N, and Sugiyama, M. The Degrees of freedom of Partial Least Squares Regression. J Am Stat Assoc (2011) 106:697–705. doi:10.1198/jasa.2011.tm10107

CrossRef Full Text | Google Scholar

37. Efron, B, Hastie, T, Johnstone, I, Tibshirani, R, et al. Least Angle Regression. Ann Stat (2004) 32:407–99. doi:10.1214/009053604000000067

CrossRef Full Text | Google Scholar

38. Durif, G, Modolo, L, Michaelsson, J, Mold, JE, Lambert-Lacroix, S, and Picard, F. High Dimensional Classification with Combined Adaptive Sparse Pls and Logistic Regression. Bioinformatics (2017) 34:485–93. doi:10.1093/bioinformatics/btx571

CrossRef Full Text | Google Scholar

39. Gröne, J, Lenze, D, Jurinovic, V, Hummel, M, Seidel, H, Leder, G, et al. Molecular Profiles and Clinical Outcome of Stage UICC II colon Cancer Patients. Int J Colorectal Dis (2011) 26:847–58. doi:10.1007/s00384-011-1176-x

PubMed Abstract | CrossRef Full Text | Google Scholar

40. Jorissen, RN, Gibbs, P, Christie, M, Prakash, S, Lipton, L, Desai, J, et al. Metastasis-associated Gene Expression Changes Predict Poor Outcomes in Patients with Dukes Stage B and C Colorectal Cancer. Clin Cancer Res (2009) 15:7642–51. doi:10.1158/1078-0432.ccr-09-1431

PubMed Abstract | CrossRef Full Text | Google Scholar

41. Van Wieringen, WN, Kun, D, Hampel, R, and Boulesteix, A-L. Survival Prediction Using Gene Expression Data: a Review and Comparison. Comput Stat Data Anal (2009) 53:1590–603. doi:10.1016/j.csda.2008.05.021

CrossRef Full Text | Google Scholar

42. Sun, W, Wang, J, and Fang, Y. Consistent Selection of Tuning Parameters via Variable Selection Stability. J Machine Learn Res (2013) 14:3419–40.

Google Scholar

Keywords: variable selection, partial least squares, sparse partial least squares, generalized partial least squares, bootstrap, stability

Citation: Magnanensi J, Maumy-Bertrand M, Meyer N and Bertrand F (2021) New Developments in Sparse PLS Regression. Front. Appl. Math. Stat. 7:693126. doi: 10.3389/fams.2021.693126

Received: 09 April 2021; Accepted: 21 June 2021;
Published: 16 July 2021.

Edited by:

Hong Zhu, Jiangsu University, China

Reviewed by:

Yunlong Feng, University at Albany, United States
Deepak Gupta, National Institute of Technology, Arunachal Pradesh, India

Copyright © 2021 Magnanensi, Maumy-Bertrand, Meyer and Bertrand. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Frédéric Bertrand, ZnJlZGVyaWMuYmVydHJhbmRAdXR0LmZy

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.