Cauchy non-convex sparse feature selection method for the high-dimensional small-sample problem in motor imagery EEG decoding

Introduction The time, frequency, and space information of electroencephalogram (EEG) signals is crucial for motor imagery decoding. However, these temporal-frequency-spatial features are high-dimensional small-sample data, which poses significant challenges for motor imagery decoding. Sparse regularization is an effective method for addressing this issue. However, the most commonly employed sparse regularization models in motor imagery decoding, such as the least absolute shrinkage and selection operator (LASSO), is a biased estimation method and leads to the loss of target feature information. Methods In this paper, we propose a non-convex sparse regularization model that employs the Cauchy function. By designing a proximal gradient algorithm, our proposed model achieves closer-to-unbiased estimation than existing sparse models. Therefore, it can learn more accurate, discriminative, and effective feature information. Additionally, the proposed method can perform feature selection and classification simultaneously, without requiring additional classifiers. Results We conducted experiments on two publicly available motor imagery EEG datasets. The proposed method achieved an average classification accuracy of 82.98% and 64.45% in subject-dependent and subject-independent decoding assessment methods, respectively. Conclusion The experimental results show that the proposed method can significantly improve the performance of motor imagery decoding, with better classification performance than existing feature selection and deep learning methods. Furthermore, the proposed model shows better generalization capability, with parameter consistency over different datasets and robust classification across different training sample sizes. Compared with existing sparse regularization methods, the proposed method converges faster, and with shorter model training time.


Introduction
Motor imagery-based brain-computer interface (BCI) systems have been widely applied in stroke rehabilitation, neuroprosthetics, and robot control (Liao et al., 2023).However, motor imagery electroencephalogram (EEG) signals are spontaneous, with poor signal quality and large individual differences, resulting in low accuracy and poor stability of motor imagery decoding (Zhang et al., 2021).Currently, motor imagery decoding is still a big challenge.
The time, frequency, and space information of EEG signals is crucial for motor imagery decoding (Zheng et al., 2022).Therefore, temporal-frequency-spatial feature-based motor imagery decoding has been widely studied (Chen et al., 2023).In the process of temporalfrequency-spatial feature extraction, the original EEG signals are first decomposed into multiple time-frequency units, then the common spatial pattern (CSP) algorithm is used to extract the spatial features on each time-frequency unit, and finally, the spatial features of multiple time-frequency units are cascaded into a feature vector (Miao et al., 2021), which significantly increases the feature dimension of EEG.The number of feature dimensions exceeds one hundred or even several hundred, while feature redundancy exists.In addition, due to the difficulty and high cost of collecting EEG samples, especially for patients, the sample size is generally relatively small, usually only a few dozen.Therefore, the temporal-frequency-spatial feature is highdimensional small-sample data, which will bring a series of problems to the EEG classification model, such as the problem of overfitting and model solution underdetermination (Chadebec et al., 2022).
For high-dimensional small-sample problems, feature selection is an effective method (Chen et al., 2023), which can remove redundant information, reduce the feature dimension, simplify the model complexity, and effectively solve many problems (Shen and Zhang, 2022).Sparse regularization-based feature selection methods are commonly used in motor imagery decoding, such as the least absolute shrinkage and selection operator (LASSO; Zhang et al., 2022), group LASSO (gLASSO; Zhang et al., 2020), and sparse group LASSO (sgLASSO;Jiao et al., 2018).These methods are all convex sparse regularization models, which penalize the regression coefficients of the model by the l 1 norm so that regression coefficients with small absolute values are automatically compressed to zero, thus generating sparse solutions and achieving feature selection.However, the l 1 norm is a biased estimation that penalizes all components of the regression coefficients to the same extent.In addition to compressing the regression coefficients corresponding to the noisy features to zero, a certain degree of compression is applied to the target features, resulting in a biased estimation of the target features.Therefore, the biased estimation model applied to feature selection will result in the loss of useful information and degrade the classification performance.
Non-convex sparse regularization models penalize the regression coefficients to different degrees for different values of the regression coefficients, which are approximate unbiased estimation models and have stronger noise suppression and sparsity induction capabilities (Wang et al., 2018).The commonly used non-convex regularization models, such as smoothly clipped absolute deviation (SCAD; Chopra and Lian, 2010) and minimax concave penalty (MCP; You et al., 2019) models, have been widely used in the fields of image restoration and image denoising, and their effect is remarkable.SCAD and MCP models penalize the regression coefficients in chunks, reducing the compression of the regression coefficients corresponding to the target features and alleviating the biased estimation problem of the l 1 norm to some extent (Wen et al., 2018).However, the SCAD and MCP models may still compress the regression coefficients of a portion of the target features.Therefore, there is still a need to explore new non-convex sparse regularization methods to better address the biased estimation problem and learn more accurate, discriminative, and effective feature information.
In addition, many deep learning methods for temporal-frequencyspatial feature learning have been proposed, which are mostly inspired by the FBCSP approach (Ang et al., 2008) in a machine learning framework, using convolutional neural network (CNN) for frequency band filtering followed by spatial filtering (Zancanaro et al., 2021).Earlier and more classical works include ConvNets (Schirrmeister et al., 2017) and EEGNet (Lawhern et al., 2018).There are also works that use traditional band-pass filtering banks to decompose the raw EEG signal into multiple frequency subbands and then use CNN to learn spatial domain or time domain information, such as FBCNet (Mane et al., 2020), FBMSNet (Liu et al., 2022), and the literature (Sakhavi et al., 2018;Kwon et al., 2019;Dai et al., 2020).Subsequent work uses multiscale convolution to learn frequency domain information in parallel and then learns either spatial domain or time domain information at each branch, such as MSFBCNN (Wu et al., 2019), MMCNN (Jia et al., 2020), and the literature (Chang et al., 2022;Li et al., 2023).Deep learning methods have a strong representation learning capability but require a large number of data samples (Autthasan et al., 2021).Although deep learning has been widely used in motor imagery decoding, feature selection is integrated into the overall network structure, and the theoretical support and physiological interpretability are relatively poor.Furthermore, the model training is time-consuming.
A new non-convex sparse regularization model is proposed to deal with high-dimensional small-sample problems for motor imagery decoding in this paper, which can learn more accurate, discriminative, and effective temporal-frequency-spatial features.Specifically, we propose a non-convex sparse regularization model based on the Cauchy function and design an effective solution algorithm based on the proximal gradient.The proposed model penalizes the weight coefficients of each feature independently with a better ability to induce sparsity while avoiding the compression of the weight coefficients of the target features to zero during noise suppression, which achieves approximately unbiased estimation.We conducted experiments on two publicly available motor imagery EEG datasets to fully and adequately validate the effectiveness of the proposed model, using subject-dependent and subject-independent assessment methods.
The main contributions and innovations of this paper are summarized below.

Data description
The proposed method was validated using two publicly available motor imagery EEG datasets, Dataset 1 from the international BCI competition and Dataset 2 from the database of the BNCI Horizon 2020 project with no.002-2014.The basic information of both datasets is shown in Table 1, other detailed information can be found on the official website.
For Dataset 1, we only study the binary classification problems, so the four types of tasks are arranged and combined to obtain six sets of binary classification problems (Zhang et al., 2022), namely, L vs. R, L vs. F, L vs. T, R vs. F, R vs. T, and F vs. T, where L vs. R denotes the left-hand and right-hand motor imagery tasks, the rest can be deduced accordingly.Note that, for the binary classification task, the number of both training and test sets is 144.For Dataset 2, the original data are downsampled to 256 Hz in this paper.

Methods
This section first introduces the temporal-frequency-spatial feature extraction method, then describes the EEG decoding framework based on temporal-frequency-spatial features, and finally proposes the non-convex sparse regularization model based on the Cauchy function.

Temporal-frequency-spatial feature extraction
As shown in Figure 1, the temporal-frequency-spatial feature extraction mainly consists of three steps.First, time window segmentation.A sliding time window is used to intercept the original EEG signal to obtain 5 time windows with a length of 2 s and an overlap rate of 0.5 s, i.e., 0-2 s, 0.5-2.5 s, …, 2-4 s.Second, band-pass filtering.Each time window is filtered with a band-pass filter bank to obtain 17 sub-bands with a width of 4 Hz and an overlap rate of 2 Hz, i.e., 4-8 Hz, 6-10 Hz, …, 36-40 Hz, the 6th-order Butterworth filter is selected.After the above signal processing, the original EEG signal is divided into a total of 85 time-frequency units.Third, feature extraction.For each timefrequency unit, the CSP method is used to extract the spatial features separately, thus obtaining multiple groups of temporalfrequency-spatial features containing rich EEG information.In this paper, the pair number of the spatial filter for CSP is set to be 1 (Blankertz et al., 2008;Lotte and Guan, 2011), i.e., each timefrequency unit contains two spatial features.

EEG decoding framework
The EEG decoding framework is shown in Figure 2, where each group of temporal-frequency-spatial features is cascaded by the time window and frequency band to obtain a feature vector.One motor imagery task corresponds to one feature vector, and feature vectors from multiple tasks will form a sample matrix, each row of which is a sample and each column is a one-dimensional feature.Feature selection and classification are performed sequentially on the sample matrix.Filtered and wrapped methods need to be configured with additional classifiers, and the embedded methods can perform feature selection and classification simultaneously.

Cauchy non-convex sparse regularization model
The general mathematical model of the embedded feature selection method based on sparse regularization is as follows: where , , ,  •   denotes the square of the l 2 norm, and λ > 0 denotes the regularization parameter.The first term of Eq. ( 1) is the data fidelity term and the second term is the penalty term.P w ( ) is a function of the coefficient vector w, which penalizes or constraints w.During model training, some regression coefficients in w are compressed to zero by P w ( ).The features corresponding to a coefficient of zero will not work in the model fit.Therefore, sparse regularization models can simultaneously achieve feature selection and classification.When P w ( ) takes a different penalty function, the model will obtain solutions with different structures.LASSO is a biased convex sparse model, and its specific mathematical model is as follows (Zhang et al., 2022).
where P w w w denotes the absolute value of w i .The LASSO model penalizes all components of the regression coefficients to the same extent, which results in a biased estimation.Similarly, gLASSO (Zhang et al., 2020) and sgLASSO (Jiao et al., 2018) are also biased convex sparse models.

Existing non-convex sparse regularization models
Non-convex sparse regularization are approximate unbiased estimation models, which have stronger noise suppression and sparsity induction capabilities than convex sparse regularization (Wang et al., 2018).In the following, we will provide a detailed introduction to two existing non-convex sparse regularization methods, namely SCAD and MCP.
SCAD is an approximate unbiased non-convex sparse model with the following objective function (Chopra and Lian, 2010). where where λ > 0 is the regularization parameter.γ > 2, γ is set to 3 in this paper.SCAD compresses each weight coefficient to different degrees.The compression of the weight coefficients corresponding to the noise variables (absolute values less than λ) has the same effect as the LASSO model, which tends to compress this part of the weight coefficients to zero; the compression of the weight coefficients corresponding to the target variables is gradually reduced.Since SCAD reduces or even avoids the compression of the weight coefficients corresponding to the target variables, it effectively overcomes the biased estimation of LASSO and improves its parameter estimation consistency and variable selection consistency.
MCP is also an approximate unbiased non-convex sparse model with the following objective function (You et al., 2019).
where γ > 1, γ is set to 2 in this paper.Similar to SCAD, MCP also compresses each weight coefficient to different degrees.MCP compresses the weight coefficients corresponding to the noise variables (absolute values less than γλ), while it does not compress the weight coefficients corresponding to the target variables (absolute values greater than γλ ).Thus, MCP also achieves approximately unbiased estimation.
The SCAD and MCP models penalize the regression coefficients in chunks, which mitigates the biased estimation problem to some extent, but still inaccurately compresses a portion of the target features.

The proposed Cauchy regularization models
To better solve the biased estimation problem in temporalfrequency-spatial feature selection, we propose a non-convex sparse regularization model based on the Cauchy function.The Cauchy function is defined as follows: where γ ≥ 0. In this paper, the Cauchy function is used as a penalty term P w ( ), and a new non-convex regularized feature selection model is constructed, the mathematical model of which is specified as follows: where 2 1 •   denotes the square of the l 1 norm.This concave log function imposes an uneven penalty on all regression coefficients (Zhang et al., 2020).It allows a larger penalty to be imposed on small-valued elements than on larger-valued elements, a property that makes the log function closer to unbiased estimation than the SCAD and MCP models.Also, the Cauchy model has a better ability to induce sparsity than the l 1 norm (Zhang et al., 2020).
In this paper, the Cauchy non-convex regularized model is solved in two parts and iterated cyclically until convergence.The specific procedure is as follows: 1. Gradient solution.Gradient solution of the differentiable term of the model with an intermediate point v τ in the τ step iteration: ( ) is the proximity operator of the Cauchy function, defined as follows (Karakuş et al., 2020): Find the partial derivative of Eq. ( 9) with respect to w and make it zero, i.e.
Next, the Cardano method (Karakuş et al., 2020) is used to solve for w τ +1 in Eq. ( 12), i.e. w v q p q q p q τ τ where In this paper, γ is set to 0.007 in subject-dependent decoding and 0.003 in subject-independent decoding.

Evaluation indicators and assessment methods
For each subject, the classification accuracy of the test set is used as an evaluation indicator, i.e., the number of correctly classified samples divided by the total number of test set samples.Two assessment methods, subject-dependent and subject-independent decoding, are used to verify the classification performance of the proposed method.For subject-dependent decoding, one model is trained for each subject, and the division of the training and test sets of the model is kept consistent with the original data set, as detailed in the data description section.
For subject-independent decoding, the training and test sets of all subjects except the target subject are used to train the model, and the test set of the target subject is used to evaluate the performance of the model.For example, if subject 1 in Dataset 1 is selected as the target subject, all training and test sets of the other 8 subjects constitute the training set of the model, and the test set of subject 1 constitutes the test set of the model.

Comparison methods and model parameter settings
There are nine feature selection methods involved in the comparison.F-score and MI are filtered methods, and BDE and BPSO are wrapped methods.LASSO, gLASSO, sgLASSO, SCAD, MCP, and Cauchy are embedded methods, among which LASSO, gLASSO, and sgLASSO are based on convex sparse regularization, and SCAD, MCP, and Cauchy are based on non-convex sparse regularization.We further divide the training set of the model into a training subset and a validation set and use the average accuracy of 10 crossvalidations as the selection criterion for the optimal model.
The F-score and MI methods first rank the features using their respective metric criteria and finally select the optimal feature subset using 10-fold cross-validation and the classifier.After the optimal feature subset is obtained by the BDE and BPSO methods, it is directly fed into the classifier for classification.The model parameters of the BDE and BPSO methods are set following the literature (Datta and Dutta, 2012;Too et al., 2019)


, and the optimal regularization parameters are selected using 10-fold cross-validation.γ is set to 3 and 2 in the SCAD and MCP models, respectively.
There are five classifiers used for filtered and wrapped methods, including FLDA, BLDA, sBLDA, KNN, and LR.The K value of the KNN classifier is set to 5, and no parameters need to be set for other classifiers.

Subject-dependent decoding
The classification results of all feature selection methods in subject-dependent decoding are listed in Table 2. Due to the limited space, only the average classification accuracy is listed for each dataset, which is obtained by averaging the classification accuracies of all subjects within the dataset.The classification results for Dataset 1 were obtained by averaging the classification accuracies of all subjects in the six sets of binary classification tasks.From Table 2 we can see that the proposed Cauchy feature selection method achieves the highest classification accuracy on both Dataset 1 and Dataset 2.
The average classification accuracy for all data is shown in Figure 3, which is obtained by averaging the classification accuracies of all subjects in Dataset 1 and Dataset 2. From the overall results of Figure 3, the existing embedded methods (LASSO, gLASSO, sgLASSO, SCAD, and MCP) have little or no advantage over the filtered and wrapped methods.However, the proposed method in this paper has a clear advantage.In addition, the non-convex regularization method outperforms the convex regularization method, which proves the superiority of the non-convex regularization method.

Subject-independent decoding
The classification results of all feature selection methods in subject-independent decoding are listed in Table 3  Dataset 1 (L vs. R) denotes a subset of Dataset 1, i.e., the subset of data corresponding to the left-hand and right-hand motor imagery tasks, the others can be deduced accordingly.Bold display indicates that the method is optimal.Dataset 1 (L vs. R) denotes a subset of Dataset 1, i.e., the subset of data corresponding to the left-hand and right-hand motor imagery tasks, the others can be deduced accordingly.Bold display indicates that the method is optimal.
from the highest value and outperforms the vast majority of existing methods.The average classification accuracy for all the data is shown in Figure 4, which is obtained by averaging the classification accuracies of all subjects in Dataset 1 and Dataset 2. As can be seen from Figure 4, the Cauchy method still achieves the best classification results, followed closely by the LASSO method.In subject-independent decoding, the existing convex regularization methods (LASSO, gLASSO, sgLASSO) outperformed the non-convex regularization methods (SCAD and MCP).The filtered and wrapped methods performed comparably or even better than the existing embedded methods.However, the results of filtered and wrapped methods based on KNN classifiers are very poor.This indicates that some classifiers are suitable for subject-dependent decoding but not for subjectindependent decoding.

Compared with deep learning methods
In sections 4.3.1 and 4.3.2, the compared methods belong to machine learning methods.In this section, the proposed method is compared with deep learning methods.Deep ConvNet (Schirrmeister et al., 2017), EEGNet-8,2 (Lawhern et al., 2018), Spectral-Spatial CNN (Kwon et al., 2019), and MIN2NET (Autthasan et al., 2021) perform temporal-frequency-spatial feature learning for EEG decoding in different ways.In Table 4, we directly cite the experimental results provided in the literature (Autthasan et al., 2021) without reproducing these deep learning methods.From Table 4 we can see that the proposed method has significant advantages in subject-dependent decoding.In subject-independent decoding, the proposed method is optimal on Dataset 1 (L vs. R) and second only to Spectral-Spatial CNN on Dataset 2 (Kwon et al., 2019).

Model generalization ability of the Cauchy method
The model generalization ability of the proposed Cauchy method is analyzed from two aspects.First, the parameter consistency over different datasets.Second, the classification performance across different training sample sizes.
We first investigate whether the model parameters are the same or close when the optimal classification accuracy is achieved over different datasets.The Cauchy model has only one parameter γ , as  8).In the subject-dependent decoding, the average accuracy change of all subjects in Dataset 1 (L vs. R) and Dataset 2 is shown in Figure 5 when γ is varied from 0 to 1.
From Figure 5, we can see that the classification accuracy change curves of the two datasets are almost the same, indicating that the Cauchy model parameters have good consistency over different datasets.To better represent the consistency of the model parameters, we expand the curves on the fuchsia and lime green boxes of Figure 5, and the results are shown in Figure 6.The fuchsia box part corresponds to Figure 6A, and the lime green box part corresponds to Figure 6B.It can be seen from Figure 6 that the model parameters are also relatively consistent over different data sets, and the parameter values for obtaining the optimal classification accuracy are relatively close.
We validated the classification performance of the proposed model across different training sample sizes using the data of subjects A01 and A09 in Dataset 1 (L vs. R) and subjects S01 and S04 in Dataset 2. In Figure 7, the test set remains unchanged, but the sample size of the training set increases sequentially.In addition, the sample size in the training set is the same for both classes of tasks.From Figure 7, we can see that the proposed method is overall superior to existing methods, especially after the training sample size per class reaches 25.Therefore, the proposed method has robust classification ability.
In summary, the proposed model shows better generalization capability with parameter consistency over different datasets and robust classification across different training sample sizes.

Discussion
We first discuss the overall experimental results, followed by a further analytical study of the feature selection method.Finally, we explore new research directions for future work.The classification accuracy varies with the Cauchy's.model parameter γ, ranging from 0 to 1.

classification results
For the subject-dependent decoding, we can see from Table 2 and Figure 3 that the proposed Cauchy method outperforms the existing feature selection methods.The non-convex regularized feature selection methods (SCAD, MCP, and Cauchy) outperformed the convex regularization methods (LASSO, gLASSO, and sgLASSO), indicating that the introduction of non-convex sparse regularization methods into EEG decoding is effective.Some filtered and wrapped methods also achieve better classification results, but rely on specific classifiers.
For subject-independent decoding, we can see from Table 3 and Figure 4 that the proposed Cauchy method outperforms existing feature selection methods.However, the classification results of existing non-convex regularization methods are lower than those of convex regularization methods.The classification results of most feature selection methods are close and all are low, possibly because the extracted temporal-frequency-spatial features are not distinguishable across subjects.Also, some classifiers work well in subject-dependent decoding, but very poorly in subject-independent decoding.Again, it is shown that filtered and wrapped methods are very much influenced by the classifier.
In addition, we can see from Table 4 that our proposed method works significantly better than the deep learning method in subjectdependent decoding.In subject-independent decoding, although our method outperforms most of the deep learning methods, the classification accuracy of both is not high.
Furthermore, we can see from Figures 5-7 that the proposed Cauchy feature selection method shows better generalization capability.
In summary, the proposed Cauchy method achieved good classification results in both subject-dependent and subjectindependent decoding.However, it is still challenging to develop feature extraction and feature selection methods that are effective for both subject-dependent and subject-independent decoding.

Classifier impact on filtered and wrapped feature selection methods
We take F-score and BDE methods as examples to analyze the effect of classifiers on filtered and wrapped methods in subject-dependent and subject-independent decoding.The average classification accuracy is shown in Table 5, which is obtained by averaging the classification accuracies of all subjects in Dataset 1 and Dataset 2, with the maximum value marked with an upper triangle and the minimum value marked with a lower triangle.From the results in Table 5, we can draw two main conclusions.First, the classification accuracies of different classifiers with the same feature selection method vary relatively widely.For BDE in subject-dependent decoding, the maximum classification accuracy is 79.92% and the minimum classification accuracy is 70.43%, a difference of 9.49%.Second, the same classifier with the same feature selection method performs differently on different assessment methods.For F-score, the KNN classifier achieved the best classification results in subject-dependent decoding but was the worst in subjectindependent decoding.
In summary, the traditional filtered and wrapped methods are influenced by the classifiers.How to select a classifier matching the feature selection method deserves further study.In contrast, the proposed Cauchy method can simultaneously perform feature selection and classification without relying on additional classifiers and thus has a more convenient and efficient performance.

Model analysis for Cauchy feature selection method
The model analysis of the Cauchy method includes model training time and model convergence.
We first compared the model training time of six embedded feature selection methods.The program runs in the following environment: OS: Windows 10, CPU: AMD Ryzen 74800H @2.90GHz, RAM: 16GB, MATLAB R2017b.To prevent randomness from affecting a fair comparison, the average model training time of all subjects in the dataset is used as the evaluation criterion.The model training time of the sgLASSO method in Dataset 1 (L vs. R) and Dataset 2 is 130.25 s and 115.11 s, respectively, which is much longer than other methods.To not affect the drawing effect, the sgLASSO method is not included in Figure 8. From Figure 8 we can see that the model training time of the Cauchy method is comparable to LASSO and second only to SCAD.
The model convergence curves of various embedded methods in subject-dependent decoding are shown in Figure 9, still using the data of subject A01 in Dataset 1 for the experiment, in which subject A01 performs left-hand and right-hand tasks.As can be seen, the Cauchy method converges faster and more stable.It is worth noting that the objective function of each feature selection method is different, so the loss range of the model is also different.
In summary, the proposed Cauchy feature selection method has good convergence performance.

Future work
In the follow-up work, we will continue to optimize the solution method of the proposed Cauchy model, such as using the alternating  In this paper, subject-dependent decoding achieves better classification results, but the classification accuracy of subjectindependent decoding needs to be improved.We will further explore  more effective feature extraction and feature methods to enhance the performance of subject-independent decoding.In addition, cross-dataset decoding (Miao et al., 2023;Miao and Zhao, 2023) is also a key direction of our attention.The extended application of the proposed method is also very important.The proposed method only deals with the data of healthy subjects, and in the future, we will apply it to stroke patients as well as to data from other EEG paradigms, such as P300 and emotional EEG.

Conclusion
For motor imagery EEG decoding, a non-convex sparse regularization method based on the Cauchy function is proposed in this paper, which can perform feature selection and classification simultaneously, without relying on additional classifiers.The proposed method can effectively alleviate the biased estimation problem of convex sparse regularization models and is closer to unbiased estimation than existing non-convex sparse regularization models.Therefore, the feature selection effect is better than existing methods.The experimental results of the subject-dependent and subject-independent decoding show that the proposed method outperforms existing feature selection methods and deep learning methods.The proposed method shows good parameter consistency over different datasets and robust classification across different training sample sizes.Furthermore, the model training time is shorter and converges faster than existing sparse regularization methods.
matrix, N is the total number of samples, and P is the feature dimension of one sample.

FIGURE 1
FIGURE 1Temporal-frequency-spatial feature extraction.Time window interception is performed first, followed by frequency band filtering, and finally CSP features are extracted on each time-frequency unit.

FIGURE 2 EEG
FIGURE 2 EEG decoding framework.The dashed boxes indicate the specific implementation methods of each data processing step.The embedded feature selection method based on sparse regularization performs feature selection and classification simultaneously.
Proximity operator solution.Compute the proximity operator of the Cauchy function at the intermediate point v τ , i.e.

FIGURE 3
FIGURE 3 Average classification accuracy of all data in subject-dependent decoding.(A) Filtered and wrapped methods combined with 5 classifiers.(B) Embedded methods perform feature selection and classification simultaneously, without additional classifiers.

FIGURE 4
FIGURE 4Average classification accuracy of all data in subject-independent decoding.(A) Filtered and wrapped methods combined with 5 classifiers.(B) Embedded methods perform feature selection and classification simultaneously, without additional classifiers.

6
The classification accuracy varies with Cauchy's model parameter γ. (A) ranging from 0.001 to 0.01, (B) ranging from 0.01 to 0.1.

FIGURE 7
FIGURE 7The classification accuracy varies with different training sample sizes.The classification results of four subjects were used for experimental display.(A) Subject A01 in Dataset 1 (L Vs R), (B) Subject A09 in Dataset 1 (L vs R), (C) Subject SO1 in Dataset 2, (D) Subject S04 in Dataset 2.
direction multiplier In this way, we can improve the convergence speed of the model, reduce the model training time, and make the algorithm more applicable to online brain-computer interface systems.

FIGURE 8
FIGURE 8Model training time of various embedded methods in subject-dependent decoding.(A) Dataset 1 (L Vs R), (B) Dataset 2. The sgLASSO method is not include, because it is model training time is much longer than other methods, which will affect the drawing effect.

TABLE 1
Description of all datasets.
http://bncihorizon-2020.eu/ database/data-sets 10.3389/fnins.2023.1292724Frontiers in Neuroscience 04 frontiersin.org . Similar to Table2, only the average classification accuracy for each dataset is listed.The proposed Cauchy method achieves the best classification results on Dataset 1.Although Cauchy is not optimal on Dataset 2, it is not far

TABLE 2
Classification accuracy of various feature selection methods in subject-dependent decoding.

TABLE 3
Classification accuracy of various feature selection methods in subject-independent decoding.

TABLE 4
Classification accuracy of the proposed method and deep learning methods.
Bold values indicate that the method achieved the best classification results on a particular dataset.

TABLE 5
The average classification accuracy achieved by different classifiers with the same feature selection method.