Review of Variable Selection Methods for Discriminant-Type Problems in Chemometrics

Sorochan Armstrong, Michael D.; de la Mata, A. Paulina; Harynuk, James J.

doi:10.3389/frans.2022.867938

REVIEW article

Front. Anal. Sci., 19 May 2022

Sec. Chemometrics

Volume 2 - 2022 | https://doi.org/10.3389/frans.2022.867938

This article is part of the Research TopicVariable Selection in ChemometricsView all 5 articles

Review of Variable Selection Methods for Discriminant-Type Problems in Chemometrics

Michael D. Sorochan Armstrong

A. Paulina de la Mata

James J. Harynuk*

Department of Chemistry, Harynuk Research Group, the Metabolomics Innovation Centre, University of Alberta, Edmonton, AB, Canada

Discriminant-type analyses arise from the need to classify samples based on their measured characteristics (variables), usually with respect to some observable property. In the case of samples that are difficult to obtain, or using advanced instrumentation, it is very common to encounter situations with many more measured characteristics than samples. The method of Partial Least Squares Regression (PLS-R), and its variant for discriminant-type analyses (PLS-DA) are among the most ubiquitous of these tools. PLS utilises a rank-deficient method to solve the inverse least-squares problem in a way that maximises the co-variance between the known properties of the samples (commonly referred to as the Y-Block), and their measured characteristics (the X-block). A relatively small subset of highly co-variate variables are weighted more strongly than those that are poorly co-variate, in such a way that an ill-posed matrix inverse problem is circumvented. Feature selection is another common way of reducing the dimensionality of the data to a relatively small, robust subset of variables for use in subsequent modelling. The utility of these features can be inferred and tested any number of ways, this are the subject of this review.

1 Introduction

The output of modern chemical instrumentation can provide a high degree of dimensionality, or number of features, to describe each sample that can be used to gain insight into complex chemical mixtures. Analysts often use this information to construct models for discriminant-type problems, but the burden of dimensionality limits the application of typical algorithms such as Support Vector Machines (SVM) (Crammer and Singer, 2001), or Canonical Variates Analysis (CVA). CVA is commonly referred to in the literature as: Linear Discriminant Analysis (LDA) (Nørgaard et al., 2006), and the two terms are often used interchangeably. However LDA refers specifically to the calculation of one or many classification thresholds while CVA refers to the dimensionality reduction technique that maximises class separation along the latent variables. Reducing the dimensionality is critical for representative and interpretable models, and often relies on methods such as Partial Least Squares (PLS) (Wold et al., 2001) to weight variables in the X-block that are highly co-variate with those values in the Y-block. However, the resultant regression coefficient output is not always particularly informative, since the analyst needs to make decisions for which regression coefficient loadings are more or less significant than others for inclusion into their resultant interpretation of the data. PLS-DA alone offers very little recourse in those instances where the regression coefficient scores of the models are not particularly useful for classifying samples external to the training set.

Applications of discriminant-type problems have enjoyed considerable attention over the past few decades, thanks in part to the explosion of interest in metabolomics (Dettmer et al., 2007). By examining the differences in metabolite expression, quantitative differences between biological states can be inferred through the use of discriminant analyses. This can simplify the problem of biological interpretation, but the question of what regression coefficient scores ought to be deemed significant is still relevant. Simplifying the output of the data analysis routine to include only those features that are deemed to be significant for subsequent interpretation is one motivation for employing a feature selection routine.

There is an idealised linear relationship between the analytical response of one or several chemical factors and their absolute quantities. While this relationship is not always observed at the limits of the dynamic range of the instrument, through careful method optimisation it is usually safe to make the assumption that the underlying chemical phenomena can be studied using linear methods. This is as true for regression-type problems as it is discriminant-type problems, and simplifies the practical application of these technologies. Hence in chemometrics the focus for variable selection and modelling typically favours linear methods (Hopke, 2003). Though better performance has in some cases, been reported for non-linear models (de Andrade et al., 2020), linear models are more easily interpretable and thus favoured for applications where underlying correlation and causal relationships are being sought.

Different assumptions can be made about the data, depending on whether a discriminant or regression analysis is being performed, and variable selection techniques that may be appropriate for regression-type problems may not be applicable for discriminant-type problems. Known characteristics of the X-block are also relevant for certain variable selection routines, based on the properties of the instrumentation used for data collection. In this article, we distinguish between two basic types of data, based on what are commonly encountered by analysts: discrete, pre-processed, identifiable, tabulated data (e.g.,: peak table data output from a chromatographic system), and continuous data (e.g.,: raw chromatographic or spectroscopic data). More tools can be applied for continuous data, since “windows” of adjacent variables can be considered all at once and variables within certain regions can be assumed to correlate with one another and some underlying chemical information. However, variables will often correlate to multiple chemical species simultaneously, making interpretation less straightforward.

1.1 Discriminant Analyses as Regression

Feature selection may be used to improve the mathematical characteristics of the problem–to avoid the computation of an ill-poised problem. For either regression or discriminant-type problems, when the number of variables exceeds the number of observations, or samples, there are an infinite number of possible solutions to the equation:

Y = X β (1)

The solution to Eq. 1 identifies a useful variable subset in X that is able to accurately predict those values in Y via the regression coefficient, β. This in effect is minimising the following cost function in the least-squares sense, yielding an solution to Eq. 1 via β

S S R = ‖ Y - X β ‖_{2}^{2} (2)

\hat{β} = a r g m i n_{β} (S S R) = {(X^{T} X)}^{- 1} X^{T} Y (3)

Where ‖.‖²₂ denotes square of the L₂ norm of error term, often referred to as the sum of squared residuals (SSR) and $\hat{β}$ is the predicted value for β, determined empirically in a way that minimises Eq. 2.

In this case, the only meaningful difference between a discriminant-type analysis and a regression-type analysis is the content of the Y-block. Namely, if the Y-block contains categorical information for two or more classes, then the information encoded is discrete, and the analysis is a discriminant-type analysis. For continuous, quantitative information encoded in the Y block, the analysis is generally referred to as a regression. Considerations for scaling are also different for regression-type vs discriminant-type problems: while scaling is critical for a Y-block of continuous data, it is optional for categorical data since the manner in which the observations are weighted is consistent across all observations. However, treatment of a classification problem as a regression problem is only commonly seen in those instances where PLS is used, although principal component scores have also been investigated (Yendle and MacFie, 1989). Even in those cases where PLS-DA is employed, a number of critical parameters such as residuals must be ignored, since a line-of-best-fit through binary classification data is assumed to have high and poorly informative residuals due to the very nature of regression problems. Within the context of a regression, a feature selection routine indicates which variable contributions are negligible, and removes them from the model. In essence, this is analogous to setting certain variable contributions within a regression vector to zero.

1.2 Canonical Variates Analysis (CVA)

Rather than calculating a linear model that best minimises the sum of squared residuals through categorical data, it is also possible to deploy CVA to maximise Fisher’s discriminant ratio (FDR) (similar to Equation 8) of two or more classes based on each sample’s projection scores on a series of latent variables. This method is more robust against outliers and unequal numbers of samples versus regression methods, but assumes homoscedasticity for deriving the decision threshold (Theodoridis, 2020). CVA is calculated via the co-variance matrices that reflect within-class variance, versus between-class variance Nørgaard et al. (2006):

S_{within} = \frac{1}{(n - g)} \sum_{i = 1}^{g} \sum_{j = 1}^{n_{j}} (x_{ij} - {\bar{x}}_{i}) {(x_{ij} - {\bar{x}}_{i})}^{T} (4)

S_{between} = \frac{1}{(g - 1)} \sum_{i = 1}^{g} n_{i} ({\bar{x}}_{i} - \bar{x}) {({\bar{x}}_{i} - \bar{x})}^{T} (5)

x_ij refers to the i^th class and j^th observation, and n_i and n_j refer to the total number of classes and observations, respectively. For S_within, each entry in x_ij is scaled relative to the i^th class, denoted as $\bar{x_{i}}$ . For S_between, the variance of each class centroid, $\bar{x_{i}}$ is determined relative to the overall mean, $\bar{x}$ .

A solution for CVA can be found once the problem is written as an eigenvalue problem:

S_{within}^{- 1} S_{between} w = λ w (6)

Where w is the eigenvector, or latent variable that maximises F(w):

F (w) = \frac{w^{T} S_{between} w}{w^{T} S_{within} w} (7)

As a classifier, CVA is used relatively infrequently relative to PLS-DA, since it cannot handle singular co-variance matrices as either S_within or S_between. However, since PLS-DA makes use of a regression to effect its discrimination, it does suffer from the same drawbacks as any other classifier that is informed by a linear line-of-best-fit. CVA has been modified to account for ill-posed problems where PLS-DA is typically applied, through its application as a technique for solving the matrix inverse of high-dimensional data. This was first described by (Nørgaard et al., 2006) as Extended Canonical Variates Analysis (ECVA). For either traditional CVA, or ECVA the decision threshold is calculated by fitting a multivariate normal distribution to each class, and assigning predictions based on each sample’s highest modelled likelihood of belonging to each class.

1.3 Types of Feature Selection

Feature selection routines can be categorised as belonging to either filter, wrapper, or embedded methods (Kohavi and John, 1997). There also exist hybrid methods that combine at least two of these approaches. True to their name, filter methods use some variable ranking scheme, and include only those variables with a value greater than a particular threshold. While simple and computationally efficient, filter methods require extensive user intervention to determine an appropriate value for the threshold, and may or may not account for correlations between variables depending on the variable ranking metric used. Wrapper methods evaluate several candidate variable subsets and select the variable subsets with the best performance. While wrapper methods can be robust, they are computationally expensive due to a inherent degree of redundancy built into the algorithms. Embedded methods include variable selection as a part of the calculation of the model itself.

This review will describe methods for variable selection, rather than methods for variable weighting such as Projection Pursuit Analysis (PPA) (Hou and Wentzell, 2011), PLS, or manifold learning for non-linear modelling (Van der Maaten and Hinton, 2008). While variable significance can be judged and subsequent variable selection performed this requires human intervention, and so such techniques are widely considered to be dimensionality reduction techniques rather than variable selection techniques.

2 Filter Methods

Filter methods proceed following variable ranking, and may require user intervention in order to determine an appropriate cut-off for variable significance. Variables scoring above the threshold are included in the model, and the rest are discarded. The advantage of filter methods is that they are relatively easy to apply, and certain variable ranking metrics may include the importance of co-variate or correlated variables as they affect a latent variable discriminant analysis such as PLS-DA or CVA (Kvalheim, 2020). However since many latent variable discriminant analyses have a tendency to over-fit the data, the use of variables acquired using metrics derived from the latent variable model are only as useful as the latent variable model itself.

2.1 Variable Ranking Metrics

2.1.1 Fisher(F)–Ratios

The simplest variable ranking filter is arguably the Fisher or F-ratio, which describes the pooled variance of all samples over the variance attributable to class means relative to the overall mean (Maddala and Lahiri, 1992). These two considerations have already been described in Eq 4, Eq. 5, and the F-ratio is described quite simply as:

F = \frac{S_{between}}{S_{within}} (8)

A key difference between the F-ratio for variable significance, versus for CVA, is that the F ratio is calculated for individual variables, and a multivariate optimisation for class separation is not performed.

The F-ratio described by the F-distribution, which includes degree-of-freedom values for both the numerator and denominator as (g − 1) and (n − g) respectively as input parameters, where n is the total number of samples and g refers to the number of classes. Critical values of significance can be used to inform the appropriate threshold based on the parametric F-Distribution, but there is no guarantee that the observed distribution of F-ratios will follow a theoretical one - especially in cases where the data is not normally distributed, since Equation 8 can also be written as:

F (ν_{1}, ν_{2}) = \frac{χ_{1}^{2} / ν_{1}}{χ_{2}^{2} / ν_{2}} (9)

Where the numerator and denominator are χ² values with degrees of freedom ν₁ and ν₂ respectively. This relationship highlights the Gaussian assumptions of the F-distribution (Box, 1953), and by extension CVA.

The F-ratio has been used extensively in discriminant-type analyses of chemical data, owing to the simplicity of its application, relationship to one-way ANOVA via Eq. 9 and the fact that direct variable-variable comparisons evade issues surrounding the dimensionality of the data. Experiments where the number of identifiable underlying chemical features greatly outnumber the number of samples (e.g.,: many omics problems and non-target environmental analyses) see frequent use of the F-ratio (Johnson and Synovec, 2002; Marney et al., 2013; de la Mata et al., 2017; Pesesse et al., 2019), in part because the costs associated with sample analysis make any attempts to lessen the impact of dimensionality much less practical. Since the objective of any analysis of F-ratios is a descriptive rather than an inferential one, it could be argued that F-ratio measurements are not a learning technique, but rather a descriptive statistical analysis. However, the act of applying a threshold for variable significance introduces bias to the model which nonetheless warrants further confirmation with external samples.

F-ratios can be applied on either continuous or peak-table data. Synovec’s research group (Pierce et al., 2006; Marney et al., 2013) has frequently compared F-ratios of each ion channel (mass-to-charge ratio, m/z) from samples of known classes to qualitatively identify regions of two-dimensional chromatograms where there is “discriminating” information using either a pixel-based or tile-based approach. In these approaches, the variance across individual mass channels are compared via Fisher ratio, accounting for known class membership, and the Fisher ratios are summed across all mass channels to visualise the significance of each pixel on a chromatogram. It is worth noting that the pixel-based approach can falsely indicate significance if there is significant chromatographic drift between samples. This issue has been overcome somewhat by the tile-based approach, wherein peaks are expected to drift within the tile spaces themselves. In the tile-based approach, an additional summation step is used to indicate the quantity of a particular ion within the tile. For complex data and/or large studies with hundreds or thousands of samples, there may still be issues with peaks drifting in and out of tiled regions across the data set if the tiles are not sized appropriately or if signals drift too much over the course of the study (this would affect the subsequent determination of the F-ratios).

2.1.2 Selectivity Ratios

The Selectivity Ratio (SR) (Rajalahti et al., 2009) is another metric for variable ranking. It encodes multivariate and co-linearity information within the rank of each variable as informed by the ratio of its variance explained within the predictive model versus its variance within the residual matrix:

X_{L V} = T P^{T} + E (10)

Where X_LV is an m × n reconstruction of observations and variables via a latent variable analysis such via PLS, as informed by a regression model Y = Xb, and E is an error or residual matrix of similar dimensions. SR is defined for a particular variable as the ratio between the variance explained in the latent variable space versus its contributions to the noise. Here, for j ∈ [1, n], SR_j is the selectivity ratio for the j^th variable out of n:

S R_{j} = \frac{‖ T P_{j}^{T} ‖^{2}}{‖ E_{j} ‖^{2}} (11)

This method has been applied in a number of studies (Rajalahti et al., 2009; Amante et al., 2019) for discriminant-type problems, due in part to the ease with which it can be integrated within the framework of PLS-DA.

2.1.3 VIP Scores

Variable Importance in Projection (VIP) scores are a measure of a particular variable’s influence on the latent variable model, which is correlated to the variance explained in the Y block. They are calculated as the weighted sum of squares as a product of the amount of variance explained by the model (Farrés et al., 2015). VIP scores were originally described by (Wold et al., 1993). However, the article (Chong and June 2005) arguably first popularised the method, which is described using Equation 12:

V I P_{j} = \sqrt{\frac{n \sum_{i = 1}^{k} b_{i}^{2} t_{i}^{T} t_{i} {(w_{ij} / ‖ w_{i} ‖)}^{2}}{\sum_{i = 1}^{k} b_{i}^{2} t_{i}^{T} t_{i}}} (12)

In Equation 12, for a matrix of m × n samples by variables, VIP_j refers to the VIP score for the j^th variable, and n refers to the total number of variables. For a k-component PLS model, b_i indicates the i^th entry of the regression vector for the i^th component derived from the vector or matrix of observed values relative to the score matrix, T. t_i indicate the i^th vector of the score matrix T. w_ij indicate the weights of the PLS model in the X block for each component, i, and variable j normalised to the euclidean norm of the weight vector for the particular component.

Many analysts use a threshold of 1 (Stoessel et al., 2018) to indicate what VIP scores are significant, but examination of the resultant projections before and after variable inclusion or exclusion are advised to prevent over-fitting of the data (Andersen and Bro, 2010). VIP scores are included as a method of variable ranking in the popular online platform MetaboAnalyst (Pang et al., 2021), and are used in many publications, with an unsurprisingly high representation in fields related to metabolomics (Seijo et al., 2013; Stoessel et al., 2018; Ghisoni et al., 2020; Sinclair et al., 2021).

Although other variable ranking metrics are used (Tran et al., 2014; Mehmood et al., 2020), the F-ratio, selectivity ratio, and VIP score are among those most frequently encountered in chemometrics for discriminant problems. Selectivity ratios and VIP scores can tend towards over-fitting the data, since they rely on parameters returned by PLS-DA which suffers from problems both related to dimensionality and the application of a regression model for a discriminant-type problem. Despite this, selectivity and VIP scores account for co-linearity, unlike the F-ratio. Despite this drawback, the calculation of F-ratios is a more statistically informative criterion for discriminant-type problems than VIP scores and the selectivity ratio, and remains popular as a simple method for selecting variables that feature a high degree of univariate discrimination.

Also of note: Talukdar et al. (2018) utilised non-linear kernel partial least squares (kPLS) and eliminated poorly weighted coefficients in the model to improve prediction accuracy, indicating that filter methods can be easily applied in conjunction with more sophisticated modelling and variable significance techniques.

3 Wrapper Methods

Wrapper methods evaluate a number of different variable subsets through multiple iterations of the algorithm, and return the best-performing variable subset Kohavi and John (1997). Choice of performance metrics used to evaluated the variable subsets, as well as the methods for determining the variable subsets have a profound effect on the resultant output of the algorithm. Wrappers can evaluate a single operation iteratively through a single dataset (Rinnan et al., 2014) followed by validation on an external or cross-validated set, or can be evaluated at each iteration using different combinations of the data (Sinkov et al., 2011).

3.1 Variable Subset Selection

3.1.1 Forward Selection, Backwards Elimination

Regardless of the method used to determine a variable subset, some initial assumptions about what variables are likely to be useful in the final model must be made, since generating and evaluating random subsets is computationally prohibitive for high-dimensional data. This may qualify certain wrapper methods as hybrid approaches, due to their reliance on a filter method in a preliminary variable ranking step. The simplest methods for variable subset selection use either forward selection, where variables are successively added to a model; or backwards elimination, where variables are successively removed from a model. In either case, the performance of the variable subset is determined at each step where a variable is either added or removed. Studies that use a hybrid forward selection/backwards elimination routine (Sinkov et al., 2011) have been proposed for systems when there are many thousands, or even millions of data points describing each sample. When using one of either forward selection or backwards elimination, the only challenge is selecting a high-performing variable subset, and frequently all variables are tested. But in order to utilise both forward selection and backwards elimination, an initial population of variables for the backwards elimination must be chosen, followed possibly by a point where one should stop the subsequent forward selection. This has been done by estimating the distributions of each variable’s Fisher ratios (Adutwum et al., 2017), and an estimate for the optimal “start” and “stop” was performed within this framework using numerical experiments.

3.1.2 Genetic Algorithms

Genetic algorithms (GAs) are another method for selecting a variable subset: high performing variable subsets “evolve” and are selected to share information with other informative subsets to create new subsets. The theory being that at each iteration the surviving subsets become more informative as variables contained within poorly performing subsets are excluded from further consideration. Genetic algorithms are poorly characterised mathematically, since they imitate biological processes, rather than try to exploit specific mathematical characteristics of the data. They are best understood through their dynamic programming routine, which can be summarised for classification problems (Cocchi et al., 2018):

Algorithm 3.1.2: Variable Subset Selection by Genetic Algorithm.

1. A user-defined (A) number of individual “chromosomes”, each containing binary information indicating the presence or absence of the variables, are randomly generated and the performance criteria for each evaluated.

2. A user-defined (B) number of highly performing individual chromosomes are selected to move forward to the next “generation”.

3. The highly performing individual chromosomes randomly exchange information to create new individuals, typically of the same population size (A). In each case there is a small probability (C) for a random mutation to occur. Following this, the fitness of each individual chromosome is reassessed.

4. Reiterate steps 2, 3 for a set number of iterations (D_a), or until a performance criterion D_b is reached (Lavine et al., 2011).

As described in Algorithm 3.1.2, there are a number of required user input parameters that can have a profound impact on the feature selection routine. Genetic algorithms can also be quite slow due to their reliance on dynamic programming to select a high-performance variable subset. As such, the use of GAs can be difficult for high-dimensional data. Ballabio et al. (2008) demonstrated dimensionality reduction techniques prior to a feature selection step using GAs, to circumvent this problem.

Lavine has proposed a number of feature selection methods based on GAs acting as wrapper functions, for source identification of jet fuel using Solid Phase Microextraction (SPME) and Gas Chromatography (Lavine et al., 2000) and fuel spill identification (Lavine et al., 2001) among others.

A drawback of wrapper methods, is that while the performance evaluation functions themselves are generalisable, the same cannot be said for their implementation. Oftentimes, several nested internal validation routines are used to train and evaluate the candidate features in a way that is tailor-made to the data, especially if there are hierarchical classifications involved. If these routines are not perfectly transparent they may not be reproducible, and are poorly generalisable to data with different characteristics.

3.1.3 Methods Based on PLS

Many wrapper methods based on PLS exist in the literature - arguably the most widely-known and most influential of these methods is Uninformative Variable Elimination (UVE) (Centner et al., 1996). Through several iterations, a model is trained on a number of samples, and variables that are under-performing (typically based on an analysis of the regression vector) are eliminated. The resultant variable subset is then assessed based on its ability to correctly indicate the samples external to the model, and variables that are consistently selected are included in the final model. While most broadly applicable for regression-type analyses, this technique has also been used to distinguish different cultivars of corn using Terahertz (far-infrared) spectroscopy (Yang et al., 2021).

For spectroscopic data with a high degree of co-linearity, different wavelength bands can be selected as candidate variable subsets. This is the principle of Interval-PLS (Nørgaard et al., 2000). It has also been applied in discriminatory analyses: for example, Peris-Díaz et al. (2018) compared the performance of different spectral regions based on their ability to distinguish different samples of amber based on geological age and region of origin with Raman spectroscopy. The wrapper function in this case is the use of several PLS models calibrated on different spectral regions, which can be either sequentially backwards eliminated or forward selected (Mehmood et al., 2012).

Although a number of methods based on PLS exist, it is also worth noting that not all of these methods have been applied for discriminant-type analyses. Recursive weighted partial least squares (rPLS) (Rinnan et al., 2014), has not been widely demonstrated on discriminant problems, although it has been used in one recent publication to select relevant features for different cultivars of saffron by Aliakbarzadeh et al. (2016). In this publication, the features selected by rPLS appeared to correspond well to features selected using other methods. rPLS iterates through incrementally scaled versions of the X block, weighted using the regression coefficient (for PLSR) calculated at each iteration (r), where D_r is the diagonal of the regression vector (b) that is normalised to its maximum value in Equation 13. The algorithm reaches convergence once each variable that is included in the final variable subset approaches 1, and the ones not deemed to be significant approach 0.

X_{r} = X_{r - 1} D_{r} (13)

X_{R} = X \prod_{r = 1}^{R} D_{r} (14)

3.2 Methods for Evaluating the Performance of the Discriminant Function

3.2.1 Traditional Measures

Performance metrics can strongly inform the resultant output of a wrapper-based method that iterates through several combinations of internal training and test sets. The results of a discriminant type analysis can typically be summarised in the form of a confusion matrix, which is a table that summarises the distribution of the predicted samples classes relative to their known class. Although accuracy (the number of true positives and true negatives over the total number of samples) is a commonly used metric, it is not particularly informative on grossly unbalanced datasets, and is inappropriate in fields such as diagnostics where sensitivity (the number of correctly indicated positive samples over the number of known members of the “true” class) is a much more important consideration than specificity (the number of correctly indicated negative samples over the known number of members of the “negative” class) for binary classification problems. Nonetheless, considerations for the predictive power of a classification model must be appropriately summarised in order to simplify the problem such that the variable subset can be optimised relative to a single cost function.

3.2.2 F_β Scores

The F₁-score (Not to be confused with the F-ratio in this manuscript) is a summary of the model performance, incorporating measures of sensitivity and precision (true positives over the sum of true positives and false positives), and is given by the equation:

F_{1} = 2 (\frac{P P V \times T P R}{P P V + T P R}) = \frac{2 T P}{2 T P + F P + F N} (15)

Where the F₁ score can be summarised simply as the harmonic mean of sensitivity and precision. In those instances where the analyst would like to bias the performance measure somewhat to better reflect the needs of the analysis, it is also possible to incorporate an additional β coefficient that weights the relative importance of sensitivity vs. precision:

F_{β} = (1 + β^{2}) \frac{P P V \times T P R}{(β^{2} \times P P V) + T P R} (16)

Where PPV describes precision (Positive Predictive Value) and TPR describes sensitivity (True Positive Rate).

3.2.3 Area Under the Curve

The Area Under the Curve (AUC) value for the Receiver Operator Characteristics of a classification is also used as a measure of model performance. The Receiver Operator Characteristics of a binary classifier plot the True Positive Rate against the False Positive Rate as a function of the position of the decision boundary relative to the samples being classified. Intuitively, a decision boundary that classifies all samples as belonging to class 1 would have both a high false positive and true positive value. If the samples being considered are perfectly resolved along the discriminant axis, then the true positive rate has no effect on the false negative rate. However, for misclassified samples, a reduction in the true positive rate has some demonstrable effect on the false negative rate. The AUC is the area under the Receiver operator curve, which is a value between 0 and 1, with 1 being a perfectly performing classifier. This has not frequently been used to guide feature selection routines in chemometrics, although it has been demonstrated in adjacent fields (Wang and Tang, 2009).

3.2.4 Cluster Resolution

Cluster Resolution (CR) is a statistical measure of model performance that measures the maximum confidence interval over which two confidence ellipses are non-intersecting in two or more dimensions. This has been demonstrated in a linear subspace such as principal component space for uncorrelated, orthogonal scores. It was first described by Sinkov and Harynuk (2011), using a dynamic programming approach. A numerical determination was proposed by Armstrong et al. (2021), that minimised the χ² value of significance between a binary set of clusters. It is a particularly useful method for those instances where there are a number of missing values in the X block, which is a persistent problem in non-target analytical studies involving complex samples. In these data sets, single chemical features are not always reliably registered in the same column across the X-block across all samples, and when considering trace compounds, they may not always be properly detected if their abundance is near the abundance threshold for inclusion in the data table. An advantage of CR is that single chemical components registered in multiple columns of the dataset will be highly correlated, and thus identifiable as being useful features when projected into principal component space. CR is also a less granular metric than those derived from classification performance metrics, and can offer information about the discriminatory power of different variable subsets without samples crossing a decision threshold. Therefore, the observation of relatively minor changes are easily scrutinised when evaluating a feature selection routine. However, a drawback of this technique is that, when compared to other methods, it requires a larger number of samples (20–30 per class as a minimum) for proper estimation of variances along the principal component axes.

4 Embedded Methods

4.1 Regularisation Methods

Regularisation methods add an additional term to the minimisation of the least-squares problem, that constrains some components of the regression coefficient (β) to be equal to zero. These methods constrain the length of β, via a regularisation constant γ, such that certain entries of the regression vector are weighted more significantly, or with λ such that certain entries are excluded from the final model as zeros.

\hat{β} = a r g m i n_{β} ‖ Y - X β ‖_{2}^{2} + γ ‖ β ‖_{2}^{2} (17)

The length of the regression vector can be constrained via the Euclidean norm (the L₂ norm) as in Equation 17, where it is described as being a ridge regression or Tikhonov regularisation. Or, via the Lagrangian (L₁) norm:

\hat{β} = a r g m i n_{β} ‖ Y - X β ‖_{2}^{2} + λ | β |_{1} (18)

The sparse regression coefficient, ${\hat{β}}_{R}$ for Equation 17 can be solved analytically as:

{\hat{β}}_{R} = {(X^{T} X + γ I)}^{- 1} X^{T} Y (19)

Equation 18 is referred to as a Least Absolute Shrinkage and Selection Operator (LASSO) Regression by Tibshirani (1996), although a similar approach was first described earlier by Santosa and Symes (1986). The L₂ norm of any vector (x) is can be described as $‖ x ‖_{2} = \sqrt{x^{T} x} = \sqrt{\sum_{i = 1}^{n} x_{i}^{2}}$ , and the L₁ norm as the sum of the absolute values of x: $| x |_{1} = \sum_{i = 1}^{n} | x_{i} |$ . In either case, the regularisation coefficient, λ, must be selected by the user. This is typically done by picking a value that maximises the prediction accuracy of the model, achieved by analysing the sum of residual squares for a set of previously unconsidered samples in the case of regression-type problems, or by analysing the prediction accuracy for discriminant-type problems.

However for LASSO, the coefficient ${\hat{β}}_{L}$ must be determined through convex optimisation. Despite reliance on numerical methods for optimisation, it has been proven that a unique solution exists for LASSO regressions of ill-posed problems (Tibshirani, 2013).

An extension of LASSO for use in discriminant analyses via CVA was proposed by Trendafilov and Jolliffe (2007); however, it was applied on datasets containing relatively few numbers of variables, which is not a common occurrence in chemometrics. Witten and Tibshirani (2011) implemented a different formulation of the problem, and proved its utility on far higher dimensionality data, which has been released as the R package penalizedLDA (Witten and Witten, 2015). This package has been used in chemometrics-type work, in distinguishing wild-grown and cultivated Ganoderma lucidum using Fourier transform infrared spectroscopy (Zhu and Tan, 2016), which also included a comparison of various other methods for sparse discriminant analyses. Also of note, was LASSO coupled to a logistic regression for classification of different fabric dyes using UV-Vis Spectroscopy (Rich et al., 2020).

For values of m ≪ n in X, it may be helpful to select a number of variables that are highly correlated with each other–especially for spectral data where the bandwidth of any particular transition implies a degree of co-linearity within the dataset. Using a LASSO regression, only the most highly correlated variables are usually included in the final model, but due to the regularisation parameter some correlated variables may be lost. As a consequence, the model may be less predictive, and it may be harder to interpret the output of the model. By adding both regularisation parameters from Eqs 17, 18, it is possible to include more correlated features in the final model:

\hat{β} = a r g m i n_{β} ‖ Y - X β ‖_{2}^{2} + γ ‖ β ‖_{2}^{2} + λ | β |_{1} (20)

Based on Equation 20, Clemmensen et al. (2011) developed a sparse extension of CVA for classification of highly co-linear data, which is of particular interest in chemometrics. This algorithm has been released as the R package: sparseLDA (Clemmensen and Kuhn, 2016) and has been used to analyse data from a Direct Analysis in Real Time - High Resolution Mass Spectrometry (DART-HRMS) experiment, to the end of classifying personal lubricants for forensic analysis (Coon et al., 2019), and for distinguishing different printer inks by micro-Raman spectroscopy (Buzzini et al., 2021).

Sparsity has also been used for variable selection via PLS-DA. This approach was first reported by using a sparse PLS regression step as described by Chun and Keleş (2010), followed by classification using a standard classifier such as CVA or a logistic regression. Lê Cao et al. (2011) developed a single-step solution for sparse PLS-DA using an approximation of the LASSO regularisation function via:

a r g m i n_{u_{i}, v_{i}} ‖ X^{T} Y - u_{i} v_{i}^{T} ‖_{F}^{2} + s i g n (u_{i}) {(| u_{i} | - λ)}_{+} (21)

Where i ∈ [1, … , K] describe the number of vectors that span the partial least squares subspace, calculated from the iterative deflation: i − 1 of matrices Y and X. The iterative approach used was previously described in literature illustrating its application for sparse PCA (Shen and Huang, 2008), and the approximation of the LASSO regularisation function used was a soft thresholding approach, where only the positive values of u_i where used as the sparse vectors. This method has been included in the R package mixOmics (Rohart et al., 2017). Sparse PLS-DA was shown to improve classification accuracy on previously published datasets in chemometrics, although manipulation of the regularisation coefficient in addition to the number latent variables certainly adds a layer of complexity to the analysis (Filzmoser et al., 2012). It was also applied to distinguish between conventional and organic walnut oils by solid-phase micro-extraction GC-MS in a recent publication (Kalogiouri et al., 2021).

Also of note is a sparse discriminant analysis implemented using a Bayesian information criteria by Orlhac et al. (2019), and another sparse method using shrunken centroids (Chen et al., 2015)–both for simultaneous classification and feature selection.

In many of the previous citations, sparse methods for discriminant-type analyses are not used as a primary means of investigation, and it appears that for chemometrics research sparse methods are often included to the sake of comparison to existing methods. In some cases, it appears that sparse methods do not improve upon the classification accuracy versus more standard chemometrics tools such as PLS-DA, or reduction of the data dimensionality via PCA prior to an analysis of the resultant scores by CVA (Buzzini et al., 2021). Sparse methods may also suffer from the drawbacks associated with other embedded methods for feature selection, such as a tendency to over-fit to the training set. For hyphenated chromatographic-mass spectrometric data where analysts have come to expect a high level of missing features from their datasets, a feature selection method that optimises some criterion of the training set, may not correctly predict the test or validation set. This problem may be attributed to the high sensitivity of hyphenated instruments, but a complacency with low industry standards for data pre-processing is also suspected (Lu et al., 2008). It is also worth noting that most sparse methods for discriminant analysis or regression appear to be published in the R programming language, and since the working language for much of chemometrics is MATLAB, this could be another possible reason why these methods are not widely applied.

4.2 Sparse Projection Pursuit Analysis

Kurtosis minimisation as a projection index has long been utilised in chemometrics and performs well for classification problems, since scores with low measures of kurtosis typically are well-resolved within a linear subspace (Hou and Wentzell, 2011; Wentzell et al., 2021a,b). Kurtosis is described as the fourth statistical moment, following mean, variance, and skew (Equation 22). Distributions with a high degree of kurtosis describe data with higher tendency for outliers relative to a normal distribution. For the purposes of revealing clustering of the data, distributions with a low value of kurtosis can indicate a bimodality in their distribution. This bimodality naturally lends to clusters that are readily observable in higher dimensions, which are calculated step-wise for univariate measures of kurtosis (Equation 23), but can also be calculated simultaneously via a multivariate determination of kurtosis (Eq. 24) to avoid categorising samples into nominal clusters. What is interesting about PPA is that it is an unsupervised method, and does not calculate the subspace in a way that is informed by class information. It has been argued that this makes it less prone to over-fitting (Hou and Wentzell, 2011); however, it performs poorly for ill-posed problems and a dimensionality reduction step has historically been applied prior to its use. Being an unsupervised method, the use of a separate classifier is typically required following the analysis.

K = \frac{\frac{1}{m} \sum_{i = 1}^{m} {(z_{i} - \bar{z})}^{4}}{\frac{1}{m} {(\sum_{i = 1}^{m} {(z_{i} - \bar{z})}^{2})}^{2}} (22)

K = \frac{m \sum_{i = 1}^{m} {(v^{T} x_{i} x_{i}^{T} v)}^{2}}{{(v^{T} X^{T} X v)}^{2}} (23)

K = m \sum_{i = 1}^{m} {(t r ({(V^{T} X^{T} X V)}^{- 1} (V^{T} x_{i} x_{i}^{T} V)))}^{2} (24)

In Equation 22 the kurtosis of the i^th sample’s projection (as z_i) to a latent variable space is described for all samples i ∈ [1, m] where m is the total number of samples for an m × n matrix. Minimisation of Eqs 23, 24 as a function of the latent variables individually (v) or collectively (V), operates on the sum of the projections of each sample as x_i.

Hou and Wentzell (2014) implemented a regularisation parameter to the minimisation of kurtosis via a quasi-power algorithm. More recently, Driscoll et al. (2019) utilised a genetic algorithm to reduce the dimensionality of the feature set in order to perform PPA using kurtosis minimisation as a projection index (kPPA).

5 Hybrid, and Miscellaneous Approaches

5.1 Decision Trees

Decision trees, or more specifically Classification and Regression Trees (CART) are trees that perform binary operations based on some information criteria of the variable such as entropy or Gini impurity to establish a threshold for the decision (Questier et al., 2005). Each tree is comprised of nodes (τ) that are connected by branches. Each node is a point at which a decision is made about a particular sample given the information encoded by a single variable. If one node leads to two other nodes, it is described as a parent node–otherwise it is considered a terminal node, wherein a final decision regarding the class membership of a sample is made by weighing the outcomes of the decisions made previously further up the tree.

The heterogeneity, or disorder of each decision can be measured using entropy $(\sum_{k = 0}^{1} - p_{k} l o g_{2} (p_{k}))$ . Due to the logarithmic operator however, this term is computationally expensive. The Gini impurity, $(Δ_{k} = \sum_{k = 0}^{1} p_{k} (1 - p_{k}))$ is more efficient to calculate and closely resembles entropy. In either case, p_k is the probability of a given sample being labelled as the k^th class based on the characteristic being measured at a particular node.

Further details about decision trees can be found in an article by Zhang et al. (2005).

5.2 Random Forests

A Random Forest (RF) is comprised of a large number of decision trees, T that vote on the membership of an unknown sample. RFs are frequently used as a variable selection step, since the method largely evades the issue of dimensionality through a number of randomly determined variable subsets voting on the model outcome. RF is also a method that is especially simple to use, with the only critical parameter being the number of decision trees to include. This parameter is easy to optimise, since the number of decision trees usually scales with the relative dimensionality of the data. The functionality of RF can be used to overcome problems related to over-fitting of individual decision trees, through the consensus of a majority of parallel models. RFs are a widely-used tool in data analysis broadly speaking, and have long been used for Quantitative Structure-Activity Relationship (QSAR) modelling (Svetnik et al., 2003) in particular.

Random Forests (RFs) Breiman (2001) are difficult to classify as belonging to one of either filter, wrapper or embedded approaches. RFs are large collections of decision trees used to vote on the class membership of a sample based on its characteristics. The randomly generated variable subsets for each decision tree liken the method to wrapper methods, but since the method performs both classification and feature selection simultaneously Menze et al. (2009), they are also widely considered to fall under the umbrella of embedded methods. For interpretation of the most significant features; however, a threshold of variable significance is typically employed as part of the analysis, which could also qualify RFs as a filter method. For the purposes of this review, feature selection by RFs will be classified as a “hybrid” method, if only to signify that it is somewhat of an outlier compared with the methods that have been previously discussed.

Similar to PLS-DA and PPA, the importance of each variable in RF can be summarised by their relative importance in the model. In RF, the Gini importance is a measure of how often a variable was used to split the data across multiple decision trees, considering its discriminating value via For the j^th variable:

G i n i (j) = \sum_{t = 1}^{T} \sum_{r = 1}^{τ} Δ_{k} (τ, T) (25)

Where Δ_k indicates the Gini impurity, or the ability of a variable to separate two classes (in a binary example), and Gini(j) is the sum of this discriminatory power summed over all nodes (τ) and trees (T) in the RF model (Menze et al., 2009). The Gini coefficient has been used for feature selection routines for the detection of bovine spongiform encephalopathy via serum (Menze et al., 2007).

Mean decrease in accuracy (MDI) is also sometimes used for variable selection with RF, and describes the difference in prediction accuracy when a considered variable is excluded from the model. This method was used in conjunction with VIP scores using MetaboAnalyst by Azizan et al. (2021) to the end of detecting lard adulteration using fatty acids as analysed by GC-MS.

Although variable impact can be assessed using RFs, the exact manner in which they correlate with each other is not immediately clear using RFs. This is due to the fact that both decision trees and RFs are non-linear methods, and as such cannot be summarised by considering the co-linearity of variables directly. With that being said, it is nonetheless possible to interpret the selected features in the context of a linear model (Azizan et al., 2021). Despite its limited application in chemometrics for classification problems, RF has been included in a number of popular data analysis suites, including MetaboAnalyst (Pang et al., 2021) and ChromCompare+ (an analysis suite for GC×GC-TOFMS data), likely owing to the simplicity of its operation for inexperienced users.

5.3 Hybrid Methods

The cluster resolution function has been used in a number of studies as an objective function to select variables via a hybrid backwards elimination/forward selection approach (de la Mata et al., 2017; Nam et al., 2020; Sorochan Armstrong et al., 2022). The algorithm used for these studies, frequently called Feature Selection by Cluster Resolution (FS-CR), proceeds once the variables have been ranked, and appropriate cutoffs determined for backwards elimination to start, and for forward selection to end (as mentioned earlier, the start and stop numbers).

Hybridisation of wrapper and filter methods are common for assessing an initial subset of variables to perform backwards elimination or forward selection on, or for intelligently selecting a number of variables to consider using other subset selection methods (Zhang et al., 2019; Singh and Singh, 2021).

6 Conclusion

Analysts demand the most predictive subset of features using the fewest number of parameters possible. Fewer parameters typically rely more strongly on well-informed methodologies to optimise variable subsets, and offer fewer avenues of recourse should the routine fail to correctly indicated external samples to the model. Routines with more parameters to optimise may be more dependent on a skilled analyst, who may explore a number of avenues to gain better insight into the classification problem. A number of factors must be considered, including the relative dimensionality of the data and/or the total number of features (which may scale poorly regardless of the number of samples depending on what technique is being used), the number of missing elements in the data, the number and severity of outliers present in the dataset, and of course the instrumentation being used. The effect of outliers can be explored using unsupervised methods like PCA or PPA, and the effect of missing data can be observed by projecting previously unconsidered samples into the model to assess potential over-fitting.

The vast majority of feature selection routines that see frequent use in chemometrics return a subset of variables whose linear combination can provide adequate discrimination, assuming that the instrument was operated within its linear range during the data acquisition step. However despite the fact that linear methods can account for co-linear variables, some data may require non-linear methods for subset determination or evaluation if the variables or combinations thereof are not sufficiently informative. Non-linear methods for variable selection and evaluation such as genetic algorithms or RFs can offer more discriminating power, but subsequent model interpretation may be difficult, and extensive user experimentation may be required for methods based on genetic algorithms to optimise a high number of user-input parameters.

A recent study by Vrábel et al. (2020) examined the result of a contest for a challenging classification problem based on laser-induced breakdown spectroscopy. Each team used various combinations of either linear or non-linear approaches for feature selection and classification, but the winning team was the one that focused on manual data exploration and interpretation to inform a simple classifier using PLS-DA. This study highlights the conventional wisdom that human interpretation and insight is difficult to beat using machines, regardless of the technique used.

The orthodoxy of linear modelling and feature selection in chemometrics may yet be challenged at some point in the future, where feature selection tools based on highly non-linear methods (Upadhyay et al., 2020; Ranjan et al., 2021) are eventually explored using chemical data. Uniform Manifold Approximation and Projection (UMAP) is generating a lot of interest in fields adjacent to chemometrics, in particular the -omics fields, where the underlying biological phenomena may not always be described using the same linear assumptions that are suitable for most chemical analyses (Shen et al., 2020).

Author Contributions

MSA performed all research, and summarisation for the article. AdlM reviewed the article for correctness. JH was responsible for conceptualisation, and also reviewed the article for correctness.

Funding

The authors acknowledge the support provided by the Natural Sciences and Engineering Council of Canada (NSERC), and funding provided by Genome Canada, Genome Alberta, and the Canada Foundation for Innovation that support the Metabolomics Innovation Centre (TMIC).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Adutwum, L. A., de la Mata, A. P., Bean, H. D., Hill, J. E., and Harynuk, J. J. (2017). Estimation of Start and Stop Numbers for Cluster Resolution Feature Selection Algorithm: an Empirical Approach Using Null Distribution Analysis of Fisher Ratios. Anal. Bioanal. Chem. 409, 6699–6708. doi:10.1007/s00216-017-0628-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Aliakbarzadeh, G., Parastar, H., and Sereshti, H. (2016). Classification of Gas Chromatographic Fingerprints of Saffron Using Partial Least Squares Discriminant Analysis Together with Different Variable Selection Methods. Chemom. Intelligent Laboratory Syst. 158, 165–173. doi:10.1016/j.chemolab.2016.09.002

Review of Variable Selection Methods for Discriminant-Type Problems in Chemometrics

1 Introduction

1.1 Discriminant Analyses as Regression

1.2 Canonical Variates Analysis (CVA)

1.3 Types of Feature Selection

2 Filter Methods

2.1 Variable Ranking Metrics

2.1.1 Fisher(F)–Ratios

2.1.2 Selectivity Ratios

2.1.3 VIP Scores

3 Wrapper Methods

3.1 Variable Subset Selection

3.1.1 Forward Selection, Backwards Elimination

3.1.2 Genetic Algorithms

3.1.3 Methods Based on PLS

3.2 Methods for Evaluating the Performance of the Discriminant Function

3.2.1 Traditional Measures

3.2.2 Fβ Scores

3.2.3 Area Under the Curve

3.2.4 Cluster Resolution

4 Embedded Methods

4.1 Regularisation Methods

4.2 Sparse Projection Pursuit Analysis

5 Hybrid, and Miscellaneous Approaches

5.1 Decision Trees

5.2 Random Forests

5.3 Hybrid Methods

6 Conclusion

Author Contributions

Funding

Conflict of Interest

Publisher’s Note

References

3.2.2 F_β Scores