Single-Trial EEG Classification via Orthogonal Wavelet Decomposition-Based Feature Extraction

Achieving high classification performance is challenging due to non-stationarity and low signal-to-noise ratio (low SNR) characteristics of EEG signals. Spatial filtering is commonly used to improve the SNR yet the individual differences in the underlying temporal or frequency information is often ignored. This paper investigates motor imagery signals via orthogonal wavelet decomposition, by which the raw signals are decomposed into multiple unrelated sub-band components. Furthermore, channel-wise spectral filtering via weighting the sub-band components are implemented jointly with spatial filtering to improve the discriminability of EEG signals, with an l2-norm regularization term embedded in the objective function to address the underlying over-fitting issue. Finally, sparse Bayesian learning with Gaussian prior is applied to the extracted power features, yielding an RVM classifier. The classification performance of SEOWADE is significantly better than those of several competing algorithms (CSP, FBCSP, CSSP, CSSSP, and shallow ConvNet). Moreover, scalp weight maps of the spatial filters optimized by SEOWADE are more neurophysiologically meaningful. In summary, these results demonstrate the effectiveness of SEOWADE in extracting relevant spatio-temporal information for single-trial EEG classification.


INTRODUCTION
Brain-computer interface (BCI) systems provide an approach for communicating with the external world by brain signals (Lemm et al., 2011). BCI systems based on Electroencephalogram (EEG) is the most common non-invasive modality, as EEG is inexpensive and has high temporal resolution. Motor imagery based BCI is a commonly applied paradigm that can efficiently decode the imagination of movement, and related features are derived from event-related desynchronization/synchronization (ERD/ERS) (Blankertz et al., 2008). The signal processing steps of a BCI system include brain signal acquisition, EEG signal preprocessing, feature extraction and feature classification. The preprocessing step aims at enhancing the relevant information by attenuating the artifacts and noise, e.g., band-pass filtering. The feature extraction stage forms discriminative features for the performed tasks in the spatial domain, temporal domain or frequency domain. The extracted features are then used to train a classification or regression model to decode the user's intent. Hence, advanced signal processing plays a key role in neuroscience research, and feature extraction and classification of EEG signal have always been the two most critical problems encountered in EEG signal analysis.
However, EEG signal processing is very challenging due to the low spatial resolution, high non-stationarity and high intrasubject variability of EEG. The challenges in developing an efficient feature extraction algorithm are to address the abovementioned issues and form discriminative features. Common spatial patterns (CSP) computes spatial filters, i.e., linear combinations of EEG channels, to enhance class-discriminative band power features contained in the EEG (Blankertz et al., 2008;Lemm et al., 2011). However, the potential problem of CSP is that it does not take into consideration the discriminative information in the temporal or frequency domain. To address this problem, many studies have applied spatial filtering or spatio-temporal filtering for feature extraction, as reviewed below. The common spatio-spectral patterns (CSSP) algorithm (Lemm et al., 2005) is an approach to simultaneously optimize spatial filters and channel-wise temporal filters; however, embedding one temporal delay has less flexility. The common sparse spectral spatial patterns (CSSSP) algorithm  optimizes one high-order finite impulse response filter for all the channels, and a sparse penalty term for the temporal filter is embedded in the objective function of CSSSP. However, no close-form solution is available by CSSSP. DFBCSP (Higashi and Tanaka, 2013) is developed to maximize the power ratio between two motor imagery states (i.e., two classes) to design spatial and temporal filter pairs. Note that the high-dimensional filters are not regularized to ameliorate the potential over-fitting issue. The filter bank common spatial pattern (FBCSP) algorithm (Ang et al., 2008) operates spatial filtering and spectral filtering in a sequential manner, which first filters the EEG signals into several distinct sub-band components and then selects discriminative features from the sub-bands via specific criteria. The variants of FBCSP (Novi et al., 2007;Kavitha et al., 2008;Zhang et al., 2011) differ in the feature selection criterion. BSSFO (Suk and Lee, 2013) formulates a Bayesian inference framework for the features of each sub-band. Along a different line, several algorithms have been proposed to optimize the spatial filters and spectral/temporal filters in separate stages. SPEC-CSP (Tomioka et al., 2006) and ISSPL (Wu et al., 2008) iteratively optimize spatial filters and spectral filters by maximizing the Fisher ratio or margin hyper-plane. However, the objective functions of FBCSP, ISSPL and SPEC-CSP are distinctive for spatial filter and spectral filter optimization; therefore, convergence and optimality cannot be guaranteed. In our previous work, the regularized spatio-temporal filtering and classification (RSTFC) algorithm (Qi et al., 2015) is proposed as a new EEG analysis framework, which shows advantageous classification performance. In recent years, there has been increasing interest in using deep learning with convolutional neural networks to decode imagined tasks from raw EEG signals (Lawhern et al., 2016;Schirrmeister et al., 2017). However, more training samples are needed, as the number of parameters is larger than that of the conventional approaches.
In addition to spatial filtering, time-frequency representation is highly desirable for EEG feature extraction from timefrequency plots (Qin and He, 2005;Yang et al., 2007). In recent years, wavelet transform has been found to be an effective timefrequency analysis tool for analyzing transient signals and has been applied to seizure detection (Bhattacharyya and Pachori, 2017) and emotion recognition (Mohammadi et al., 2017). However, there is no standard method for selecting the best wavelet and determining the decomposition level for processing EEG signals (Hlawatsch and Boudreaux-Bartels, 1992;Subasi et al., 2006). An algorithm based on an autoregressive model and wavelet packet decomposition is proposed in Zhang et al. (2017); however, the decomposed signals are redundant. CSP is employed for feature extraction on the EEG signal that are reconstructed and de-noised by single level wavelet in Zhang et al. (2010). In Mousavi et al. (2011), wavelet packets are used to decompose EEG signals into multiple levels, and fuzzy logic is combined to select the discriminative packets. EEG signals are decomposed by wavelets (Robinson et al., 2011(Robinson et al., , 2013 after preprocessing, and then sub-band signals are reconstructed at each level, which are further spatially filtered by the CSP algorithm for feature extraction. An orthogonal wavelet is employed to decompose the EEG signal into several sub-bands in Robinson et al. (2012), and then spatial regularized CSP is performed to extract features using the wavelet coefficients. The wavelet coefficients are taken as input features for probabilistic neural network in Gandhi et al. (2011). In Zhao et al. (2009), a Morlet wavelet transformation is applied for different frequency bins to each row of the composite covariance matrix and store the new rows in a larger matrix, and then optimizes the filters by maximizing the time-frequency ratio. However, the Morlet wavelet is a linear transformation, and some rows might be linearly dependent, leading to a non-full-rank matrix. In summary, there are mainly two potential problems of the existing wavelet-related algorithms: (a) the reconstructed signals are linearly dependent as the used wavelet is not orthogonal, or (b) the reconstructed signals from each level are used to construct features separately; therefore, optimality cannot be guaranteed.
In this paper, an orthogonal wavelet decomposition-based algorithm termed SEOWADE is proposed for single-trial EEG classification of motor imagery signals. The contributions are threefold: (i) Unrelated sub-band components are obtained for subsequent feature extraction. Specifically, the preprocessed EEG signal is decomposed using orthogonal wavelets, and the decomposed signals are subsequently reconstructed at different levels/scales to yield sub-band components. (ii) To enhance the discriminativity of the extracted features for classification, the proposed algorithm localize signals in spectral domain and spatial domain by implementing spatio-spectral filtering on reconstructed EEG signals from multiple levels. (iii) As the choice of wavelet may have a significant impact on the quality of the results regarding the classifier, cross-validation is employed to select the most suitable wavelet function for , 2, · · · , L}: details coefficients and approximation coefficients at the l-th level u l ,v l , l ∈ {1, 2, · · · , L}: reconstructed signal at the l-th level P ∈ R (L+1)×T : sub-band components of a single channel with the decomposition level is L X ∈ R M×T : wavelet filtered and embedded data matrix of EEG signal X, where M = C · (L + 1) R 1 , R 2 ∈ R M×M : estimated augmented covariance matrices of the EEG data under two states w ∈ R M×1 : combined spatio-spectral filter 2m: dimension of feature vector, i.e., m features for each state W ∈ R M×2m : the combined spatio-spectral filter matrix f ∈ R 2m×1 : the feature vector obtained by logarithm of the normalized variance of EEG signals projected onto W signal processing. The SEOWADE algorithm is validated using three EEG data sets from past BCI competitions, and it is confirmed that the performance of the proposed algorithm is comparable to or even better than other stateof-the-art algorithms.
The remainder of this paper is structured as follows. Mathematical details of the SEOWADE algorithm is presented in section 2. The experimental results of SEOWADE on the three data sets from past BCI competitions are provided in section 3, where the details of the data sets, analysis pipelines, and classification results are provided. SEOWADE is benchmarked against those of contemporary algorithms: CSP, CSSP, CSSSP, FBCSP, and the algorithm termed shallow ConvNets proposed in Schirrmeister et al. (2017). Finally, section 4 concludes this paper. For ease of reference, the essential mathematical symbols used in this paper are shown in Table 1.

METHODOLOGY
In this section, the orthogonal wavelet decomposition-based feature extraction method is described. We used orthogonal wavelets to construct independent sub-band components to localize signals spectro-temporally. Subsequently, spectral filtering via linearly transformed the sub-band components and spatial filtering is applied to enhance the discriminativity of EEG signals. In this way, the motor imagery-related patterns are generated by spatio-spectral filtering, and distinct power features defining the performed tasks are extracted. Finally, sparse Bayesian learning with Gaussian prior is applied to the extracted power features, with a RVM classifier obtained for classification.

Wavelet Decomposition and Reconstruction
Fourier transform has been widely applied to the analysis of non-stationary EEG signals. The advantage of wavelet analysis over short-time Fourier transform is that one can look at the signals at different scales or resolutions: a approximate level and a detailed level. Specifically, the wavelet is a smooth function that oscillates and quickly vanishes, which can localize well in both the frequency domain and the temporal domain (Vetterli and Herley, 1992). A wavelet family ψ a,b (t) is a set of elementary functions, which are generated by dilations a's and translations b's of a unique admissible mother wavelet where a, b ∈ R, a = 0, and t is the time point. The dilation parameter a, the translation parameter b determine the oscillatory frequency and length, the shifting position of the wavelet, respectively.

Discrete Orthogonal Wavelet Transform
In this subsection, we describe how the discrete orthogonal wavelet transform is implemented. In wavelet transform multiresolution analysis, a finite impulse response filter pair [highpass (HP) and low-pass (LP) filters] is specifically designed, the frequency responses of which separate the high-frequency and low-frequency components of the input signal. The HP filter coefficients are associated with the scaling function, and the LP filter is associated with the wavelet function. The outputs of the LP filters are called the approximation (A), and the outputs of the HP filters are called the details (D). In multi-resolution algorithms, any time series can be completely decomposed in terms of the approximation and detail components. Applications of discrete wavelet transformation produce a multi-resolution analysis of signals across time and frequency; however, the resulted wavelet components are redundant, and the computational complexity increases with additional decomposition layers. To address this issue and accommodate non-stationarity frequency analysis, discrete orthogonal wavelet transformation is applied for EEG signal analysis in this paper to obtain an improved tradeoff between temporal resolution and frequency resolution by varying the window length over frequencies.
The concept of multi-resolution and successive approximation of orthogonal wavelet transformation can be explained as follows. Let A i , i ∈ Z be the space of band limited functions, and let D i , i ∈ Z be the orthogonal complement of A i in A i−1 . These spaces are related as in the equations (Hazarika et al., 1997): The orthogonal wavelet bases are constructed such that they span A i and D i . For example, at level i = 1, the functions that approximate signals of space A 0 in A 1 represent a perfect half-band LP filter h 1 and those in D 1 represent a perfect half-band HP filter h 0 . At each decomposition level, orthogonal wavelet decomposition involves filtering with h 1 and h 0 followed by sub-sampling by 2. From a signal processing approach, the orthogonal wavelet transform can be defined as applying filters and samplers on discrete time sequences to perform a coarse half-resolution approximation of the original time sequences (Vetterli and Herley, 1992). The original signals can be reconstructed from these sub-band components using the reverse process, i.e., up-sampling by 2 and filtering using time reversed filter sequences h ′ 1 and h ′ 0 .

Wavelet Filtering and Embedding
Given a T-length EEG signal x = [x(0), x(1), · · · , x(T − 1)], we show how the sub-band components of x by wavelet filtering and embedding are obtained as follows. The finite impulse response of the N-length filter h 0 is in the form of h 0 = [h 0 (0), h 0 (1), · · · , h 0 (N − 1)], while the N-length filter h 1 satisfies the following expression: The assumption is that the even-shifted version of the impulse response h 0 [rows of the circulate matrix H 0 given in (2)] forms an orthogonal set that spans the subspace D 1 : Similarly, the even-shifted version of h 1 forms H 1 which has the same structure as H 0 and spans subspace A 1 . Note that the subspaces spanned by H 0 and H 1 are non-overlapped: where B * denotes the complex conjugate of matrix B, and I is the identity matrix. The EEG signals x projected onto subspaces D 1 and A 1 followed by sub-sampling by 2, denoted by u 1 and v 1 can be represented by: where ⊤ denotes the transpose operator. Then, u 1 consists of the detailed coefficients, and v 1 consists of the approximation coefficients. The reconstruction of signal is achieved as: (5) Figure 1A shows the processing steps of decomposition and reconstruction by orthogonal wavelet transformation at level 1. For multilevel orthogonal wavelet decomposition, the aforementioned process repeats to decompose v l , l ∈ {1, 2, · · · , L} at each level. The wavelet decomposition using coefficients from all L levels first gives L + 1 wavelet coefficients, and then retains coefficients from only one components for construction, until all the coefficients are used for construction, with L + 1 reconstructed signals obtained: v ′ L , u ′ L , u ′ L−1 , · · · , u ′ 1 . Specifically, Figure 1B demonstrates the process of obtaining the sub-band components with the impulse response of decomposition and reconstruction filters. The sub-band components via wavelet filtering and embedding can be presented by the reconstructed signals of the EEG signal x are denoted as a embedding matrix: In Figure 2, an example of the 100th trial of the training data set from subject al implemented by DB7 is presented, with the level of wavelet decomposition is three.

SEOWADE
Though the CSP algorithm is a highly successful method that has gained a surge of popularity and interest, the performance suffers from a non-discriminative brain rhythm that has an overlapping frequency range with the most discriminative brain rhythm. On the other hand, the frequency band on which the CSP operates is either selected manually or unspecifically set to a broad band filter, which is likely to degrade the performance by using an inappropriate frequency band. In the proposed algorithm, the sub-band components are weighted channelwisely for spectral filtering because the discriminative frequency bands are channel-distinct, and simultaneously spatial filtering is implemented to enhance the discriminability of EEG signals. Note that the previous paper (Robinson et al., 2013) considered only the relative power distribution between each wavelet component, while our algorithm considers the total power of all the orthogonal wavelet components after spatio-spectral filtering, which is considered to capture the discriminative features of the EEG signal.

Combined Spatio-Spectral Filtering
The channel-wise spectral filters and the spatial filter can be re-parameterized by a single vector, which is presented below. Let X = [x 1 , x 2 , · · · , x C ] ⊤ ∈ R C×T denote a single-trial EEG signal, where C and T denote the number of channels and sample points, respectively. Each x c ∈ R 1×T , c ∈ {1, 2, · · · , C} denotes a T-length time sequence given by x c = [x c (0), x c (1), · · · , x c (T − 1)]. Suppose the L+1-length spectral filter for the sub-band components of the c-th channel is a c = [a c (0), a c (1), · · · , a c (L)] ⊤ , c ∈ {1, 2, · · · , C}, and the C-length spatial filter is s = [s(1), s(2), · · · , s(C)] ⊤ . The C channelspecific spectral filters a c 's composes a matrix A ∈ R (L+1)×C as: A = [a 1 , a 2 , · · · , a C ]. Then, the combined operations of spatial filtering and channel-wise spectral filtering on X can be collectively expressed as: FIGURE 1 | (A) EEG signal decomposition and reconstruction by orthogonal wavelet. Here h 0 (n) and h 1 (n) present the finite impulse response of the half-band LP filter and half-band HP filter, and h ′ 0 (n) and h ′ 1 (n) present the reversed filters of h 0 (n) and h 1 (n), respectively. (B) EEG signal decomposition and reconstruction by orthogonal wavelet filtering to obtain the wavelet filtered and embedded EEG data, with the level of wavelet decomposition equals to three.
where w ∈ R M×1 with M = (L + 1) × C denotes the combined spatio-spectral filter: where X denotes the embedding spatio-spectral matrix that appends the reconstructed wavelet signals of all the channels: and P c ∈ R (L+1)×T denotes the component matrix of the c-th channel, c ∈ {1, 2, · · · , C}, as described in (6). According to (7), the implementation of spatial filtering and channel-wise spectral filtering on EEG signal X can be represented by linear combinations ofX by w, where w is reparameterized by a spatial filter and channel-wise spectral filters.

Filter Optimization
The optimization objective of our algorithm is to determine spatio-spectral filter w by solving the following maximizing variance ratio problem as in CSP (Blankertz et al., 2008): where R 1 and R 2 are augmented covariance matrices under two states (labeled with 1 and 2) estimated by X's with trace normalization applied: where N i is the number of training samples for the i-th state, X i k is the k-th training sample for the i-th state, i ∈ {1, 2}. To ameliorate the potential over-fitting problem, we perform a l 2 -norm regularization, which penalizes non-smooth combined spatio-spectral filters, as shown in the following Rayleigh quotient maximization problem: where ρ is the regularization parameter. Then, w can be decomposed into a spatial filter s and channel-wise spectral filter a c 's, c ∈ {1, 2, · · · , C}. In this work, the c-th coefficient of s is s(c) = sgn(w c (1)) · w c 2 , and a c = w c /s c . Typically, one can retain only a small number of projection vectors, which contain most of the discriminative information for each state. We term the number of filters per class as m. Multiple spatio-spectral filters can be obtained by performing the generalization eigenvalue problem: R 1 w = λ R 2 w. The resulting eigenvalues λ = J 2 (w) associated with each spatio-spectral filter is presumed to indicate the discriminability of the spatio-spectrally filtered EEG signals between the two class. By exchanging R 1 and R 2 , m filters for another state can be obtained. In total, 2m spatio-spectral filters are retained for subsequent analysis.

Classification
Suppose W is a combination of the 2m filters: W = [w 1 , w 2 , · · · , w 2m−1 , w 2m ] ∈ R M×2m . Spatio-filtering can be implemented by projecting the wavelet filtered and embedded EEG data onto W as: Z = W ⊤ X ∈ R 2m×T . A classifier can be trained on the feature vectors obtained by normalizing and log-transforming the variances of Z as: where f denotes the feature vector. The log transformation serves to approximate the normal distribution of the feature vector, which is the basic assumption of some classifiers.
Typically, the prediction of class labelŷ is defined over the features f as follows, where u ∈ R 2m×1 and u 0 denote the vector of parameters and the bias term, respectively, and is the basis function:ŷ The objective of a classifier is to estimate the values for those parameters. Sparse Bayesian learning is designed to manage the computational and statistical complexity in a principled way, which defines a hyper-parameterized prior over the parameter with the following form: This type of prior ultimately favor a sparse model that also fits the features well. The relevance vector machine (RVM) is a Bayesian sparse kernel technique for regression and classification, which is implemented as the classifier with Gaussian prior in this paper. In summary, the involved analysis steps of SEOWADE can be described as follows, with the analysis flowchart presented in Figure 3: • The EEG data are acquired and preprocessed; • The preprocessed signals are decomposed by orthogonal wavelets at different scales, which are then reconstructed using wavelet reconstruction, with sub-band components are obtained; • With the wavelet filtered and embedded data, the spatiospectral filters are optimized by (12); • We normalize and log-transform the discriminative features constructed by the spatio-spectrally filtered signals as in (13); • The RVM classifier with a Gaussian prior is trained with the processed features, and the performance of the algorithm in terms of classification performance is determined.

DATA ANALYSIS AND RESULTS
In this section, the binary classification performance of SEOWADE algorithm applied to three real EEG data sets is shown. In particular, the performance of SEOWADE is compared with that of the following algorithms: CSP, FBCSP (Ang et al., 2008), CSSP (Lemm et al., 2005), CSSSP , and shallow ConvNets proposed in Schirrmeister et al. (2017).
Several intuitive examples are illustrated to demonstrate the relative advantage of SEOWADE over the compared algorithms.
The MATLAB code of SEOWADE is available upon requests from the authors to allow for the reproducibility.

Data Description
In this work, three publicly available motor imagery data sets from past BCI competitions are used for performance evaluation. The data sets consist of EEG data from 17 subjects during motor imagination.
(1) Data Set A (Data Set IVa from BCI Competition III; Dornhege et al., 2004): This data set is recorded from five subjects (al, aa, av, aw, and ay), with a sampling rate of 100 Hz and 118 recorded channels. The experiment is a cuebased motor imagery paradigm, with the subjects performing right-hand and foot motor imagery tasks (labeled 1 and 2, respectively). A total of 280 trials are available for each subject, among which 224 (80%), 168 (60%), 84 (30%), 56 (20%), and 28 (10%) trials compose the training data set for al, aa, av, aw, and ay, and the remaining trials compose the testing data set. (2) Data Set B (Data Set IIIa from BCI Competition III; Schlögl et al., 2005): This data set is recorded three subjects (K3, K6, L1), with a sampling rate of 250 Hz and 60 recorded channels. These subjects perform left-hand, right-hand, foot and tongue motor imagery tasks (labeled 1, 2, 3, and 4, respectively). Both the training and the testing data sets contain 45 trials per class for subject K3, and 30 trials for subjects K6 and L1. We split the data set of each subject to generate C 2 4 = 6 data subsets, as our aim is to evaluate the binary classification performances of these algorithms. Therefore, a total of 18 data subsets are obtained, and for convenience, we use "subject's name-class labels" to denote the data subset.
(3) Data Set C (Data Set IIa from BCI Competition IV; Naeem et al., 2006): This data set comprises 22-channel EEG signals with a sampling rate of 250 Hz. Nine subjects (A01-A09) perform left-hand, right-hand, foot and tongue motor imagery tasks (labeled 1, 2, 3, and 4, respectively). Both the training and the testing data sets contain 72 trials per class for each subject. Akin to Data Set B, 6 × 9 = 54 data subsets are generated for binary classification.
A total of 5 + 18 + 54 = 77 data subsets are obtained for evaluation. The proposed SEOWADE algorithm and competing algorithms are run on each data subset for binary classification. The channel layouts for the three data sets are presented in Figure 4.

Analysis Pipeline
The analysis pipeline of the considered algorithms for binary classification is illustrated as follows.
The following preprocessing steps are applied to the above mentioned 77 data subsets.
• All the channels are used for data analysis, i.e., channels covering both hemispheres are included in the analysis to obtain spatio-spectral information. • A sixth-order Butterworth filter with a pass-band range of 7-40 Hz is used to filter the EEG signals to filter out the components unrelated to sensorimotor rhythms. • The time window is set to 0.5-3.5 s for the three data sets, where 0 denotes the time of cue ends.
For each data subset, the training data sets after preprocessing are fed into SEOWADE and other competing algorithms to optimize the spatial filters, spatio-temporal filters or spatio-spectral filters. The normalized and logtransformed variances of the spatially, spatio-temporally or spatio-spectrally filtered signals are then defined as the features for all the algorithms. Following previous studies, 3 features per state are constructed from the filters corresponding to the 3 largest generalized eigenvalues. That is, 2m = 6 features are obtained.
RVM with Gaussian prior is applied as the classifier for CSP, FBCSP, CSSP, CSSSP and SEOWADE. The classification performance on the test data set is measured in terms of mean classification accuracy and standard deviation.

Hyper-Parameters Settings
Several hyper-parameters need to be pre-determined for SEOWADE by 10-fold cross-validation during the analysis of CSSP, CSSSP, and SEOWADE. The hyper-parameters and the candidate sets for the algorithms are summarized in Table 2. Note that when L = 0 and ρ = 0 in SEOWADE, or τ = 0 in CSSP, these algorithms are reduced to CSP. N-fold cross-validation, sometimes called rotation estimation (Devijver and Kittler, 1982), is the statistical practice of partitioning a sample of data into N subsets such that the analysis is initially performed on N − 1 subsets, while the other subset is retained for subsequent use in conforming and validating the initial analysis. The initial subsets of data are called the training set, and the other subset is called the validation or testing set. The cross-validation process is then repeated N times (the folds), with each of the N subsets used exactly once as the validation data set. The results from the N folds can then be averaged to produce a single estimation. For each hyper-parameter, the one that achieves the highest average classification accuracy across the 10 repetitions is determined as its value.

Classification Performances
The classification performance of SEOWADE is compared with those of other compared algorithms in this subsection. The testing classification accuracies for all the algorithms compared with SEOWADE are shown in Figure 5, and the points above the diagonal line indicate that the classification accuracy of SEOWADE is higher than that of the compared algorithm on the x-axis. Moreover, the mean testing classification accuracies on the three data sets and on all the data sets are summarized in Table 3, in which the entries highlighted in boldface indicate the best performance among the competing algorithms. Repeated measures analysis of variance (ANOVA) shows that SEOWADE and the other five algorithms significantly differ in the classification performance [F (5,5×76) = 7.9617, and p = 3.76 × 10 −7 ). Moreover, based on the Bonferronicorrected Wilcoxon signed-rank tests, the significance of the results between the two compared algorithms at the 5, 0.5, and 0.1% levels is indicated by *, **, *** beneath the corresponding sub-figure of Figure 5. It can be confirmed that a considerable improvement in classification performance can be achieved by SEOWADE over the compared algorithms, which shows the effectiveness of SEOWADE.

Presentations of Extracted Filters
To determine the physiological significance of the algorithm and the source of informative features in the brain, we study the extracted spatial filters obtained from the proposed SEOWADE Regularization parameter ρ 10 −6 × {1, 2, · · · , 100} FIGURE 5 | Classification accuracies (%) of the algorithms (CSP, FBCSP, CSSP, CSSSP, and shallow ConvNets) compared with SEOWADE for the 77 binary data subsets, among which the five-pointed stars, circles and diamonds in each sub-figure denote the corresponding data subsets from Data Set A, Data Set B and Data Set C, respectively. Moreover, based on the Bonferronicorrected Wilcoxon signed-rank tests, the significance of the results between the two compared algorithms at the 5, 0.5, and 0.1% levels is indicated by *, **, *** beneath the corresponding sub-figure of this figure.  Figure 6 shows the topological scalp distribution maps of the spatial filters optimized by CSP, CSP, CSSP, CSSSP, and SEOWADE, with the weight maps related to the leading eigenvalue for each class. Overall, the scalp weight distributions from SEOWADE are neurophysiologically relevant to the motor imagery tasks, with strong weights over the corresponding regions in the contralateral motor cortex. In addition, we compare the spatial filters of different algorithms by calculating the absolute value r's of correlation coefficients between their spatial maps, which is shown in the rightmost column of Figure 6. It can be seen that some spatial filter maps obtained by different algorithms are highly correlated, yet not identical.

Computational Costs
SEOWADE can be optimized efficiently. Suppose the number of the decomposition level for SEOWADE is L, and the number of channels is C, below we assess the computational complexity of each algorithm in the feature extraction step. The computational complexity is O(((N + 1)C) 2 ) for SEOWADE, and O(C 2 ), O((2C) 2 ) for CSP and CSSP respectively. As for FBCSP, the complexity is approximately 8 times as large as that of CSP, since CSP is called for each of the 8 sub-band components. As for CSSSP, the spatial filters and channel-common temporal filters are optimized iteratively, and the complexity is O(C 3 ) for spatial filter optimization and O(K 1 N) for temporal filter optimization in each iteration, respectively, where K 1 is the number of iterations for temporal filter optimization with gradient descent. Let K 2 denote the number of iterations to reach convergence for CSSSP. The total complexity for CSSSP is O(K 2 (C 3 + K 1 N)).  To highlight, in Table 4, we provide the runtimes for CSP, FBCSP, CSSP, CSSSP, shallow Convnet (abbreviated as sConvnet here), and SEOWADE (with DB7 implemented as the orthogonal wavelet) on the representative examples. sConv is coded in Pytorch, and other algorithms are run on a Windows PC with a 3.70-GHz Inter Core (TM) i7-8700K CPU and 16-GB RAM, in MATLAB © R2018b. It can be seen that the runtime of SEOWADE is less than those of CSSP and sConvnet, albeit more than those of CSP and CSSP.

DISCUSSION
Note that common average reference, removal of eyes movements and blinks are not performed for SEOWADE and other competing algorithms. The reason is that these preprocessing steps amount to spatial filtering on EEG, which is to a large extent redundant considering that the spatial filtering algorithms considered in the paper are already optimal for within-subject classification under their respective optimization criteria.
The high classification performance of SEOWADE on the three open motor imagery data sets demonstrates the effectiveness of SEOWADE. However, a relatively small-sized training set may lead to biased models with poor generalization performance. To address this issue, our work can be improved along the following two lines: (i) Introducing inter-subject transfer learning. One idea is borrowing useful information from other subjects or other sessions to facilitate the current model learning process. In this case, how to utilize the information from other subjects to regularize the current model of the target subject will be the focus of our future work. (ii) Generating adversarial examples. Alternatively, the number of training sample can be enlarged by generating adversarial examples, i.e., through perturbing the real sample via Generative Adversarial Networks (GAN). With more training samples, the obtained model is expected to be more stable and more robust with better generalization ability.

CONCLUSIONS
In this work, an SEOWADE algorithm based on orthogonal wavelet decomposition is proposed for motor imagery EEG signal classification. By SEOWADE, spatial filter and channel-specific spectral filters can be optimized simultaneously under a single optimization problem, with a l 2 -norm regularization term embedded to ameliorate over-fitting issue. Feature vectors are extracted via normalize and log-transform the spatio-spectral filtered signals, and then RVM with Gaussian prior is employed for classification. The proposed algorithm can effectively localize signals both in spatial and temporal domain, and thus can provide discriminative features for classification. One motivation of BCI research is the application to motor imagery signal classification. Three motor imagery data sets from past BCI competitions are used to evaluate the performance of the proposed algorithm. Compared with the classic algorithms CSP, FBCSP, CSSP, CSSSP and shallow ConvNets, the testing classification performance of SEOWADE is significantly better at the 5% level. The extracted spatial patterns of SEOWADE are mainly located in the contralateral cortex, which is neurophysiologically relevant.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
FQ, WWa, and WWu were responsible for the data analysis and contributed to the writing of the manuscript and interpretation of the data. XX and FW contributed to the data analysis and interpretation of the data. ZG, ZY, and YL contributed to the interpretation of the data. All authors contributed to the article and approved the submitted version.