A Fast, Open EEG Classification Framework Based on Feature Compression and Channel Ranking

Superior feature extraction, channel selection and classification methods are essential for designing electroencephalography (EEG) classification frameworks. However, the performance of most frameworks is limited by their improper channel selection methods and too specifical design, leading to high computational complexity, non-convergent procedure and narrow expansibility. In this paper, to remedy these drawbacks, we propose a fast, open EEG classification framework centralized by EEG feature compression, low-dimensional representation, and convergent iterative channel ranking. First, to reduce the complexity, we use data clustering to compress the EEG features channel-wise, packing the high-dimensional EEG signal, and endowing them with numerical signatures. Second, to provide easy access to alternative superior methods, we structurally represent each EEG trial in a feature vector with its corresponding numerical signature. Thus, the recorded signals of many trials shrink to a low-dimensional structural matrix compatible with most pattern recognition methods. Third, a series of effective iterative feature selection approaches with theoretical convergence is introduced to rank the EEG channels and remove redundant ones, further accelerating the EEG classification process and ensuring its stability. Finally, a classical linear discriminant analysis (LDA) model is employed to classify a single EEG trial with selected channels. Experimental results on two real world brain-computer interface (BCI) competition datasets demonstrate the promising performance of the proposed framework over state-of-the-art methods.


INTRODUCTION
For a long time, the brain-computer interface (BCI) has been a prevalent communication system for directly bridging between a brain and a computer. A typical BCI system collects brain activities associated with mental tasks, translates these neural signals into appropriate commands, and eventually sends them to a computer (Handiru and Prasad, 2016). As a non-invasive brain activity measurement method, electroencephalography (EEG) has attracted increasing interest, owing to its low risk, low cost, feasibility, and significant potential for practical applications (Yang et al., 2016). EEG data is commonly composed of multichannel signals recorded from several electrodes placed on the scalp to obtain the activity of various cortexes. For example, motor-imagery (MI) EEG signals are primarily gained from the motor-relevant cortex and could provide users with direct control of various devices (Lafleur et al., 2013;Meng et al., 2016), e.g., a wheelchair, quadcopter, or robotic arm, without using any peripheral nerves or muscle movements.
The core component of an EEG-based BCI (EEG-BCI) system is the decoding or classification of EEG signals. The basic structure of an EEG classification framework typically consists of four parts: signal pre-processing, feature extraction, channel selection and classification, of which the feature extraction and channel selection are attracting more attention. Feature extraction is fundamental to EEG classification and many methods have been presented to acquire the implicit essential information from raw EEG signals, e.g., spectral power (SP) and time-domain parameters (TDP) (Müller et al., 2004). Channel selection attempts to remove irrelevant EEG features in redundant channels to reduce the setup time and equipment cost, and improve the effectiveness and efficiency of EEG-BCI systems (Alotaiby et al., 2015). Well-known channel selection methods are often based on evolutionary algorithms and mutual information.
Even though multiple methods have been proposed to tackle EEG classification tasks using different feature extraction and channel selection methods, bottlenecks still appear when existing EEG-BCI frameworks are applied to practical applications, as shown in Figure 1A.
(1) The computational complexity of channel selection methods has often been neglected by previous works (Kee et al., 2015;Zhang et al., 2016), thereby slowing the EEG feature extraction and offline classifier training procedures, which negatively affects the deployment efficiency of EEG frameworks in different situations. Moreover, excessively long setup and experiment times may negatively affect the (human) subjects' mental concentration and condition, heavily decreasing the EEG signal-to-noise ratio (SNR). (2) Convergence is seldom considered in some iterative frameworks (Yu et al., 2015), leading to stochastic, even divergent solution searching. A non-monotonic convergence process may postpone the EEG classification and further degenerate the subjects' experience. Moreover, it is difficult to consistently obtain an optimal decision persistently because of the unstable solution searching process.
(3) Existing frameworks are mostly designed too specifically (Bashar et al., 2016), narrowing their expansibility. The majority of datasets in pattern recognition approaches are expressed in two-dimensional matrices, i.e., samples and features, which differ from previous EEG data matrices. Thus, an overly welldesigned EEG framework is usually confined to specific methods and offers few open accesses to the many more remaining methods, thus, seriously affecting its application and expansion. Additionally, considering that novel methods are continuously emerging, inadequate support for them may limit the sustained improvement of existing frameworks.
Intuitively, the solution to (2) is to use fast convergent channel selection algorithms, which is also beneficial for solving (1). Another solution to (1) is decreasing the data scale, including feature extraction and dimension reduction. Meanwhile, the solution to (3) is to present the EEG signals in two dimensional data matrices, which are commonly used in other signal processing areas, without removing the latent spatial information carried by multi-channel EEG signals. In the meantime, representing EEG data in two dimensional matrices is conducive to building bridges between EEG feature extraction and previous feature selection methods.
In light of this, as shown in Figure 1A, we propose a fast, open EEG classification framework characterized by feature compression, low-dimensional representation, and convergent iterative channel ranking. Firstly, to alleviate the computational burden, we use fast data clustering to give EEG feature vectors numerical signatures according to their channel-wise similarity. Thus, it is possible to reduce the feature dimension in EEG signals. Secondly, instead of simply flattening three dimensional (trial-channel-feature) EEG signals into two dimensional (trialchannel*feature) ones, we compress the EEG features and contract the feature vectors into their numerical signatures. Thus, we can represent the EEG signal using a two dimensional matrix and retain its spatial information in the meantime. In this manner, EEG signals are mapped into a normal data matrix with rows and columns indicating trials and channels, respectively, which is consistent with most pattern recognition studies. Thirdly, to obtain fast and stable decisions, we use a few iterative feature selection approaches with theoretical convergence to rank and select informative EEG channels, thereby shortening the procedure and improving the decision performance of the EEG-BCI systems.
Our main aims and contributions are highlighted in the following: • We leverage the data clustering between feature extraction and channel selection in the traditional EEG-BCI to channel-wise compress the features to reduce the data scale. • We provide an EEG classification framework applicable to most common EEG processing and pattern recognition methods to provide an entrance for increasing the EEG-BCI performance through technology integration. • We introduce feature selection approaches with theoretical convergence to rank and select channels to further improve the performance of the EEG-BCI.

RELATED WORK
In past decades, great efforts have been made to address the EEG classification problem, mainly including feature extraction and channel selection. One of the most widely used EEG feature extraction methods is common spatial patterns (CSP). CSP utilizes covariance analysis to amplify the class disparity in the spatial domain, to combine signals from different channels (Blankertz et al., 2007;Li et al., 2013). Improved CSPs, e.g., common spatiospectral patterns (CSSP) (Lemm et al., 2005), iterative spatiospectral pattern learning (ISSPL) (Wu et al., 2008), and filter bank common spatial patterns (FBCSP) (Kai et al., 2012), mainly optimize the combination of multi-channel signals by developing a spectral weight coefficient evaluation. Obviously, these spatial feature extraction methods are good at selecting the pivotal spatial information included in EEG signals. However, signal combination cannot be practically used to reduce the channel number and data scale. Apart from the spatial domain, extracting features in the temporal and frequency domains is also prevalent in many works. Multivariate empirical mode decomposition (MEMD) generates multiple dimensional envelopes by projecting signals in all directions of various spaces (Rehman and Mandic, 2010;Islam et al., 2012). The autoregressive (AR) model assumes that EEG signals can be approximated in the AR process, in which the features could be gained as parameters of the approximated AR models (Zabidi et al., 2013). TDP was introduced as a set of broadband features based on the variance of different EEG signals in various orders (Vidaurre et al., 2009). A wavelet transform (WT) simultaneously provides ample frequency and time information about EEG signals at the low and high frequencies, respectively (Jahankhani et al., 2006). We utilize SP based on AR and TDP as two feature extraction examples in this paper.
EEG channel selection has been extensively studied. For instance, multi-objective genetic algorithms (GA) (Kee et al., 2015) and Rayleigh coefficient maximization based GA  were introduced to simultaneously optimize the number of selected channels and improve the system accuracy by embedding classifiers into the GA process. Recursive feature elimination (Guyon and Elisseeff, 2003) and zero-norm optimization (Weston et al., 2003) based on the training of support vector machines (SVMs) were used to reduce the number of channels without decreasing the motor imagery EEG classification accuracy (Lal et al., 2004). Sequential floating forward selection (SFFS) (Pudil et al., 1994) and successive improved SFFS (ISFFS) (Zhaoyang et al., 2016) took an iterative channel selection strategy that selected the most significant feature from the remaining features and dynamically deleted the least meaningful feature from the selected feature subset. A Gaussian conjugate group-sparse prior was incorporated into the classical empirical Bayesian linear model to gain a groupsparse Bayesian linear discriminant analysis (gsBLDA) method for simultaneous channel selection and EEG classification (Yu et al., 2015). Mean ReliefF channel selection (MRCS) adopted an iterative strategy to adjust the ReliefF-based weights of channels according to their contribution to the SVM classification accuracy . The Fisher criterion, based on Fisher's discriminant analysis was utilized to evaluate the discrimination of TDP features, extracted from all channels in different time segments via channel selection, using time information (CSTI) methods (Yang et al., 2016). GA and pattern classification using multi-layer perceptrons (MLP) and ruleextraction based on mathematical programming were combined to create a generic neural mathematical method (GNMM) to select EEG channels . However, a visible limitation is that the convergence of most of these methods is unstable, e.g., gsBLDA, leading to an uncertain channel selection procedure and selected subset.
Therefore, it is natural to select EEG channels by taking advantage of feature selection methods in other areas, e.g., robust feature selection (RFS) (Nie et al., 2010), joint embedding learning and sparse regression (JELSR) (Hou et al., 2011), nonnegative discriminative feature selection (NDFS) (Li et al., 2012), selecting feature subset with sparsity and low redundancy (FSLR) (Han et al., 2015b), robust unsupervised feature selection (RUFS) (Qian and Zhai, 2013), joint Laplacian feature weights learning (JLFWL) (Yan and Yang, 2014), a general augmented Lagrangian multiplier (FS_ALM) (Cai et al., 2013) and structural sparse least square regression based on the l 0 -norm (SSLSR) (Han et al., 2015a). RFS, JELSR, and NDFS focused on the l 2,1 -norm minimization regularization to develop an accurate and compact representation of the original data. RUFS tried to solve the combined object of robust clustering and robust feature selection using a limited-memory BFGS based iterative solution. JLFWL selected important features based on l 2 -norm regularization, and determined the optimal size of the feature subset according to the number of positive feature weights. FSLR could retain the preserving power, while implementing high sparsity and low redundancy in a unified manner. FS_ALM and SSLSR attempted to handle the least square regression based on l 0 -norm regularization by introducing a Lagrange multiplier and a direct greedy algorithm, respectively. Further, prevalent deep learning was also utilized in this area. A point-wise gated convolutional deep network was developed to dynamically select key features using a gating mechanism (Zhong et al., 2016). Multi-modal deep Boltzmann machines were employed to select important genes (biomarkers) in gene expression data (Syafiandini et al., 2016). A deep sparse multi-task architecture was exploited to recursively discard uninformative features for Alzheimer's disease diagnosis (Suk et al., 2016). In this paper, we employ three performanceverified feature selection methods with theoretical convergence, i.e., RFS, RUFS, and SSLSR, to rank and select channels.

MATERIALS AND METHODS
In this section, we describe the datasets, introduce some notations used throughout this paper, and present the proposed frameworks.

Datasets and Pre-processing
In our experiments, we selected two public real world datasets on motor imagery (MI) paradigms as examples. A brief description of these two datasets is offered below and their basic statistics are summarized in Table 1. EEG datasets of other paradigms, e.g., P300, are also suitable for this framework.

DS1
Dataset IVa from BCI Competition III is a public EEG dataset provided by the Berlin BCI group Fraunhofer FIRST (Intelligent Data Analysis Group) and Campus Benjamin Franklin of the Charité University (Neurophysics Group). This public dataset is recorded from five healthy subjects (aa,al,av,aw,ay) during right hand and right foot motor imageries. The EEG recordings consist of 118 channels at positions of the extended international 10/20system. We choose a version of the data that is down-sampled at 100 Hz for analysis. In the experiments, subjects performed three motor imageries for 3.5 s after visual cues for left hand, right hand, or right foot. After the duration of motor imagery, a resting interval with random length of 1.75-2.25 s was inserted for relaxation. The dataset provides only EEG trials for right hand and right foot imagery. For each subject, the dataset consists of signals of 140 trials per class. In our experiment, the signals in the time interval of [0.5, 3.0] s are analyzed for each trial. A visual cue is presented to mark the start time (0 s). A bandpass finite impulse response (FIR) filter using the window method, with a band of 0.1-40 Hz and order 33, is applied to DS1.

DS2
Dataset I from BCI Competition IV is another public EEG dataset provided by the Berlin BCI group Fraunhofer FIRST (Intelligent Data Analysis Group) and Campus Benjamin Franklin of the Charité University (Neurophysics Group). This public dataset is recorded from four healthy subjects (a,b,f,g) during two classes of motor imagery selected from three classes: left hand, right hand, and foot (side chosen by the subject; optionally also both feet). In the experiments, the data was continuous signals of 59 EEG channels and visual cues pointing left, right or down were presented for a period of 4.0 s during which the subject was instructed to perform the cued motor imagery task. These periods were interleaved with 2.0 s of blank screen and 2.0 s with a fixation cross displayed in the center of the screen. The dataset provides only EEG trials for left hand and foot imagery. For each subject, the dataset consists of signals of 100 trials per class.
In our experiment, the signals in the time interval of [0.0, 4.0] s are analyzed for each trial. A visual cue is presented to mark the start time (0 s). A bandpass FIR filter using the window method, with a band of 0.1 to 40 Hz and order 33, is also applied to DS2.

Notations
In this document, scalars, matrices, vectors, sets, and functions are denoted as small, boldface capital, boldface lowercase, blackboard capital, and script capital letters, respectively. x T , X T , (:,j) , and trace(X)indicate the transpose of vector x, the transpose of matrix X, the i-th element of x, the i-th sample of the variable X, the element of X occurring in the i-th row and j-th column, the i-th row of X, the j-th column of X and the trace of X respectively. Moreover, ||x|| 1 is the l 1 -norm of x, ||X|| 1 and ||X|| 2 are the m 1 -norm and m 2 -norm of matrix X. For any vector x ∈ R n×1 and any matrix X ∈ R n×m , the definitions of l p -norm, m r -norm and m r,s -norm are given in the Appendix.
Furthermore, assume that we have recorded EEG signals of N trials, and let X = {X i } N i=1 be the set of EEG signal corresponding to all N trials. Specifically, we represent each trial of EEG signals, i.e., X i (i = 1, 2, ..., N), as matrix X M×C , where M and C are the number of sampled time points and channels in a trial respectively. The class indicator vector can be denoted as y ∈ {0, 1, ..., L − 1} N , where L is the number of classes.

Proposed Methods
We first depict the flowchart of the framework, called feature compressing and channel ranking (FCCR). Afterward, we provide the details for its three core components. Lastly, its pseudo codes are given.

Flowchart of the Framework
Our framework starts with EEG signal pre-processing and singlechannel SP or TDP feature extraction. Next, after dividing all of the trials into the training and testing sets, we gather all the features from the different channels for the training trials and assign them cluster signatures through k-means. Then, EEG signals are mapped from the three dimensional (trialchannel-feature) to the two dimensional (trial-channel) matrix by compressing all feature vectors into their cluster signatures. With this data representation (two-dimensional matrix) commonly used in most pattern recognition areas, we employ RFS, RUFS, or SSLSR to rank and select the channels. Lastly, a traditional LDA is used to classify the testing trials, with the features corresponding to the selected channels. The procedure of our proposed FCCR is summarized in Figure 1B. Notably, FCCRs are subject-specific since the procedures from feature clustering to channel ranking in Figure 1B are dependent of the subject.

Feature Extraction
We take SP based on AR model and TDP as two examples to extract EEG features. In fact, other methods, including WT and MEMD, can also be used to replace them.
Each EEG signal of a single channel is treated as a time-varying variable, denoted as x(t), in this section.

SP features
A power spectrum is an energy density distribution over frequencies. The AR Filter actually computes a time series of such spectra, representing a flow of energy through each single point in the frequency domain. In this way, the SP based on AR model can be treated as the energy per frequency per time.
AR model is a representation of a type of random process (Zabidi et al., 2013), which specifies that the output variable depends linearly on its own previous values. Hence, the predicted valuex(t) of an AR process is defined by, where a p (k) is the coefficient of the p-th order AR process. Then, the estimated power spectrum directly corresponds to the filter's transfer function whose coefficients are actually AR coefficients. To obtain spectral power for finite-sized frequency bins, that power spectrum is multiplied by total signal power, and integrated over the frequency ranges corresponding to individual bins. We select frequency bins from 0 to 30 Hz with step 3 Hz, resulting 11 values per channel for each trial.

TDP features
TDPs are a set of broadband features with physical meanings, whose definition is as (Vidaurre et al., 2009).
in which we apply logarithm to regulate the TDPs' distribution to Gaussian approximately (Vidaurre et al., 2009) to fit the LDA classifier (Müller et al., 2004). Three TDPs are used in this paper to extract EEG features in both temporal and frequency domain, that is, f TDP = [TDP (0) , TDP (1) , TDP (2) ]. Typically, TDP (0) depicts the EEG character in terms of amplitude, TDP (1) can be interpreted as a kind of EEG pattern in terms of high frequency (mainly the beta band), and TDP (2) carries the essential information of the change in frequency (Vidaurre et al., 2009).

Feature Compressing and Data Representation
With SP or TDP features extracted from all channels of all trials, we collect all features of training trials (with N trials) as a feature for TDP features), abandoning its trial and channel information. Then, the classical k-means is used as an example to cluster the above EEG features. Likewise, almost all of data clustering approaches, such as spectral clustering and AP clustering, are acceptable to replace it.
In this paper, we utilize k-means clustering to partition N×C× 11 or N × C × 3 features into K(≤ N) sets C = {C 1 , C 2 , ..., C K } in order to minimize the within-cluster sum of squares (WCSS). Formally, the object of k-means is to find: where µ j is the mean of features in C j . This is equivalent to minimizing the pairwise squared deviations of features in the same cluster: After dividing training features into K clusters using standard Lloyd's algorithm (Lloyd, 1982), we can replace feature vector f i with its corresponding cluster signature j if i ∈ C j . In this way, training signals can be represented as a matrixX N×C .

Channel Ranking
After representing training trials as a normal data matrixX with its rows and columns indicating trials and channels, we take RFS, RUFS and SSLSR as examples to rank and select EEG channels. The main strength of these methods is that they could find the optimal channel subset with the strict convergence in theory.
Similarly, researchers can use other appropriate feature selection algorithms to rank channels. In this section,Ỹ ∈ {0, 1} N×L is the class label indicator matrix (Ỹ il = 1 if the i-th sample belongs to the l-th class, otherwise, Y il = 0), and W ∈ R C×L is the weight matrix of channels. After obtaining W, we rank all channels according to their weights, which are computed as ||W (i,:) || 2 .

Ranking based on RFS
RFS is an efficient and robust feature selection by emphasizing l 2,1 -norm minimization on both the loss function and the regularization term (Nie et al., 2010). The objective of RFS is Note that Equation (5) is equivalent to Then, the optimal U is gained by setting the derivative of the Lagrangian function of Equation (7) on U to zeros, that is where D is a diagonal matrix with the i-th diagonal element as D ii = 1 2||U (i,:) || 2 . The pseudo codes of RFS is given in Algorithm 1.

Output:
Channel weight matrix W.
as an identity matrix. 2: repeat 3: The theoretical proof of the convergence of Algorithm 1 can be found in Nie et al. (2010).

Ranking based on RUFS
RUFS utilizes l 2,1 -norm minimization on processes of both label learning and feature learning to effectively handle outliers and noise in the data, as well as reduce redundant or irrelevant channels. The objective of RUFS is often written as min F,G,W ||X − GF|| 2,1 + βtrace(G T LG) where L is the normalized graph Laplacian matrix commonly used in unsupervised learning (Li et al., 2012). Then, the optimal W is obtained by using a limited-memory BFGS (Nocedal, 1980) and BMLVM (Benson and Moré, 2001) based alternating iterative algorithm, which can be summarized in Algorithm 2.

Ranking based on SSLSR
SSLSR is an effective greedy algorithm to directly handle the challenging l 2,0 -norm based structural sparse least square regression, which has the objective where C is the number of selected channels. Start with W 0 = 0, SSLSR develops an alternating forward and backward strategy to pick and eliminate (if possible) a channel in each iteration.
Suppose at the beginning of the iter-th iteration, the selected channel set is F iter . Then ∀i ∈ F iter , W iter (i,:) = 0. In the forward step, we need to find the channel which reduces the loss most, i.e., find i / ∈ F iter and θ ∈ R L×1 such that where e i ∈ R C×1 is the vector of zeros, except for the icomponent which is one. θ is the optimal weight vector for one channel.
Since L(W iter + e i θ T ) is convex in terms of θ T , the optimal θ for i / ∈ F iter can be gained by setting ∂L(W iter + e i θ T )/∂θ = 0, that is, Thus, we have Before beginning the backward step, F and W are updated as F iter+1 = F iter ∪ {i iter } and W iter+1 = W iter + e i iter θ iter T respectively.
In the backward step, the channel with the least contribution to reducing the loss is selected and we need to determine whether to remove it according to a specific criterion. That is to say, we need firstly to find which is equivalent to finding j ∈ F iter such that min j∈F iter L(W iter − e j W iter (j,:) ).
The eliminating criterion relies on the decrease and increase of the loss in forward and backward steps. Specifically, if and only if iter the j iter -th channel can be removed from F, where ν ∈ (0, 1) is a hyper-parameter, iter − = L(W iter ) − L(W iter+1 ) and iter + = L(W iter − e j iter W iter (j iter ,:) ) − L(W iter ). The pseudo codes of SSLSR are as given in Algorithm 3 and Algorithm 4.
The theoretical proof of the convergence of Algorithm 3 can be found in Han et al. (2015a).

Pseudo Codes of the Framework
The flowchart and pseudo codes of the proposed framework are summarized in Figure 1 and Algorithm 5.

RESULTS
We refer to our methods based on RFS, RUFS, and SSLSR as FCCR1, FCCR2, and FCCR3, respectively, and compare them with some baselines. In this section, we present the experimental setup, followed by the classification results.

Experimental Setup
In this section, the baselines, metrics and other experimental settings are given in sequence.

Input:
EEG data matrixX N×C and class label indicator matrix Y N×L , number of selected channels C and number of maximum iteration MI.

Output:
Selected feature index set F. Find j iter = argmin j∈F iter L(W iter − e j W iter (j,:) ).
14:  To validate the effectiveness of FCCRs, we compare them with the following baselines, including three kinds of static channel subset and three kinds of channel selection with training algorithms.

Output:
N-dimensional predicted label vectorỹ. 1: EEG data pre-processing. 2: EEG feature extraction channel-wise according to Section 3.3.2. 3: Partition all trials into training and testing sets. 4: Cluster features of training trials using k-means. 5: EEG data representation. 6: Channel ranking and selection by Algorithm 1, Algorithm 2, or Algorithm 3. 7: Train LDA classifier using training trials with selected channels. 8: Predict the label vector of testing trails using trained LDA.
1. All channels: Signals of all available channels are used for EEG classification; 2. 3C channel s: Signals of C3, Cz, and C4 are used for EEG classification; 3. MEMD+STFT (Bashar et al., 2016): Features extracted based on multivariate empirical mode decomposition and short time Fourier transform are used for EEG classification; 4. gsBLDA (Yu et al., 2015): Signals of channels selected based on group sparse Bayesian linear discriminant analysis are used for EEG classification; 5. MRCS : Signals of channels selected by combining ReliefF and SVM are used for EEG classification; 6. NSGA-II (Kee et al., 2015): Signals of channels selected by a multi-objective genetic algorithm, i.e., NSGA-II, are used for EEG classification.

Metrics
Following previous researches, we employ the classification accuracy and F1 score to evaluate the performance of these methods. Accuracy discovers the one-to-one relationship between predicted and ground truth labels. Denotingỹ i and y i as the predicted result and the ground truth label of a trial X i respectively, we can compute accuracy as follows.
where n is the total number of trials and δ(x, y) is the delta function that equals 1 if x = y and 0 otherwise. A larger accuracy indicates a better performance. F1 score is another widely used measure in evaluating the performance of classification. It is the harmonic mean of precision and recall, Again, a larger F1 indicates a better performance.

Other Settings
It is universally acknowledged that the performance of most algorithms is dependent on the hyper-parameters they contain. Therefore, we set some parameters in advance.
In the feature clustering process, for our FCCRs, the number of clusters is determined using the "grid-search" strategy, i.e., from 2 to 8 with a step of 2, 10 to 90 with a step of 10, and 100 to 700 with a step of 200 on each subject, using the litekmeans algorithm.
In the channel ranking process, for gsBLDA, the stop criterion is whether the max iteration beyond 10 or the change of α or β is smaller than 0.01. For MRCS, the number of nearest neighbors k = 10, the maximum iteration number 50, ǫ = 0.01, and 10-fold cross-validation is used to evaluate the average classification accuracy when the top-n channels are used. For NSGA-II, the population is 300, the maximum generation is 200, and the 10-fold cross-validation accuracy is also utilized to estimate the first objective. The only difference from Kee et al. (2015) and Zhang et al. (2016) is that LDA, instead of the library for support vector machines (LIBSVM), is employed inside the channel selection procedure to maintain consistence with the following classification process.
In the trial classification process, 5 × 5 cross-validation on the LDA classifier is used throughout this study [except for MEMD+STFT, in which we use k nearest neighbor (kNN) according to Bashar et al., 2016]; i.e., five times for each dataset, we partition all the trials into five sets and choose four of them as the training sets, with the last set as the testing set. It should be noted that the training and testing sets are identical for all methods. For gsBLDA, MRCS, FCCR1, FCCR2, and FCCR3, the numbers of selected channels range from 1 to 118 (DS1) or 59 (DS2) with an incremental step of 1, and the numbers of the channels selected in NSGA-II are set according to the population with the best fitness.

Experimental Results
To examine the effectiveness of the proposed framework, the classification experimental results are given and analyzed in this section. Because of the limited length of this paper, for some experiments related to subjects and features, we randomly  The number of selected channels (listed in the bracket) and the standard deviation are also given. The best performance is highlighted in bold.
partitioned the results for all subjects into different experiments, and show them in Figures 3-5. Thus, all nine subjects and two features are covered, if we consider these experimental results together.

Classification Performance Comparison
We summarize the classification results of different methods using the two real world MI-EEG datasets in Tables 3-6. From these four tables, we make the following observations. First, the classification performance of the selected channel subsets is superior to All channels, demonstrating the practical significance of channel selection. For instance, as shown in Table 3, the best mean accuracies for the selected channels on DS1 are 81.93 and 82.57%; i.e., they are 9.14 and 17.8% better than the 75.07 and 70.07% for All channels, respectively.
Second, channel subsets selected through training often perform better than static channel subsets, as they verify the role of channel selection approaches using training. Taking Table 6 as an example, the best mean F1 scores on DS2 for training methods are 75.13 and 78.99%, which outperform the 67.00 and 69.71% of the static 3C channels by 12.1 and 13.3%, respectively.
Third, our proposed FCCRs usually outperform other channel selection methods on various subjects from different datasets. In terms of mean accuracies and F1 scores, our FCCRs consistently obtain the best performance. Specifically, of 36 results on nine subjects with two features in terms of two metrics, the FCCRs gain the 31 best results, occupying 86.1%.
Finally, our FCCRs can sometimes gain the best performance with fewer channels, indicating the effectiveness of our framework. Especially on subject ay with SP features, all three FCCRs obtain high accuracies and F1 scores with only four channels. In contrast, the fewest channels selected by other methods are 36 (NSGA-II), nine times the FCCRs' results.
The p-values of one-way analysis of variance (ANOVA) 1 statistical tests of the improvement for all methods are plotted in Figure 2, from which we can observe that the FCCRs obtain a significant improvement over the other methods in most situations. The three FCCRs have 18 improvements over six baselines for each panel; an average of 14 improvements is significant, occupying 77.8%.
Except for the above, we plot the precision and recall (P-R) pairs for all methods in Figure 3, from which we can obtain other observations. First, some channel selection approaches may be limited in the recall, e.g., MRCS and NSGA-II on subject aa with SP features; gsBLDA, MRCS and NSGA-II on subject aw with TDP features; and gsBLDA on subject a with TDP features.
Second, our FCCRs can obtain a relatively high precision with acceptable recall. Specifically, on subject aw with TDP features, FCCR2, and FCCR3 both gain the recall of 1 and the precision larger than 0.8 at the same time, much better than the baselines.
Third, the P-R curve varies obviously with the subjects. For subjects aa and b with SP features, the recall of all the methods is below 0.9, and the precision is hardly beyond 0.9. Conversely, The number of selected channels (listed in the bracket) and the standard deviation are also given. The best performance is highlighted in bold. The number of selected channels (listed in the bracket) and the standard deviation are also given. The best performance is highlighted in bold.
Frontiers in Neuroscience | www.frontiersin.org The number of selected channels (listed in the bracket) and the standard deviation are also given. The best performance is highlighted in bold.
FIGURE 2 | p-value for the improvements of nine methods on two datasets with SP or TDP features. The white diamonds at the center of each square denote a statistically significant improvement. Red rectangles mark the p-values of the proposed FCCRs relative to the baselines.
many markers are distributed in the area beyond 0.9 in terms of both precision and recall for subjects aw and a with TDP features.

Selected Channels Comparison
To visually compare the performance of the different channel selection approaches, we plot the classification accuracies of nine methods with different numbers of selected channels on four subjects in Figure 4. The observations are as follows. First, almost all channel selection methods are effective, including 3C channels, gsBLDA, MRCS, NSGA-II and three FCCRs, since their accuracies with fewer selected channels can outperform the ones with all channels. Second, few channel selection results are superior to the ones with 3C channels. In fact, on subject g with TDP features, only the FCCRs outperform the 3C channels. Third, the proposed FCCRs can obtain the best performance compared with the other methods in most situations. For these four results, the best performance is always acquired by FCCRs. Fourth, our FCCRs are able to gain the highest accuracies with few channels, which is consistent with the above results. Additionally, in Figure 5, we display the selected channel subset on the subject's scalp with the top classification accuracies for subject ay, in which the channels ranked from top to bottom are indicated by the red to blue areas. It is clear that significant and informative channels relevant to the MI-EEG, i.e., channels in the motor-sensory cortex, are selected first by the proposed FCCRs. gsBLDA, MRCS and NSGA-II usually select some irrelevant or redundant channels, possibly having an adverse effect on their classification performance, which is consistent with the results in section 4.2.1 and Figure 4.

Computational Complexity Comparison
In this subsection, we experimentally analyze the computational complexity of nine methods by comparing their average running time (ART) for nine subjects with two datasets, on a PC with a 4.0GHz Intel i-7 6700K CPU with 32 GB of memory.
From Figure 6 and Table 7, we can observe that the ARTs of the FCCRs are longer than some approaches without training, including All channels and 3C channels. In addition, the fact that the FCCRs run faster than MRCS, NSGA-II, and MEMD+STFT demonstrates their efficiency. It is interesting that MEMD+STFT FIGURE 4 | Average classification accuracy and F1 scores of nine algorithms with numbers of selected channels varying from one to all available in the step of one on subjects with different features.
is unexpectedly slow, even though it has no channel selection. It is also clear that the ARTs of FCCRs are <10 s on nine subjects, which is comparable with motor imagery paradigms, demonstrating their practical worth in real applications.

Parameter Sensitivity
The insensitivity of our proposed framework to cluster numbers is demonstrated in Figure 7, in which we notice that cluster numbers exert a limited influence on classification accuracies in quite a large range, from 2 to 300.

Comprehensive Comparison
In addition to the above experiments, we comprehensively compare nine methods comprehensively in terms of six metrics, i.e., mean accuracy (MA), mean F1 score (MF), mean precision (MP), mean recall (MR), average running speed (ARS, number of trials divided by ART), and average channel reduction ratio (ACRR, ratio of number of removed channels to all channels), and show the results in Figure 8. In this figure, the inner hexagon indicates 0 for MA, MF, MP, and MR, the smallest ARS, and 0% for ACRR. The outer hexagon indicates 1 for MA, MF, MP, and MR, the largest ARS, and 100% for ACRR. Thus, for all results, more outside markers mean better performance.
It appears from Figure 8 that, for all data sets with SP or TDP features, the FCCRs are superior to other state-of-the-art methods in terms of almost all metrics, except for ARS.

DISCUSSION
From the experimental results shown above, we find that our methods are conducive to EEG classification through FIGURE 5 | Topographical map of multichannel MI-EEG recordings for subject ay with the SP feature. Six rows indicate six methods, i.e., gsBLDA, MRCS, NSGA-II and our three proposed methods. Five columns indicate the five cross-validation. We only retain the channels included in the optimal channel subset. Red and blue represent the top and bottom rankings. feature compression, low-dimensional data representation, and convergent channel ranking. The primary reasons for the promising classification performance of our FCCRs include the following: (1) motor-sensory cortex signals are much more relevant to MI-EEG than signals of other cortexes; signals from redundant channels corresponding to unrelated cortexes may  negatively influence the EEG classification performance; (2) informative channels often vary between subjects and the optimal channel subsets need to be found empirically; the intrinsic physiological and mental condition divergence among subjects during the experimental process is fairly large; (3) our FCCRs can probably select more pivotal channels and rank them in the top position, by virtue of their channel selection methods with high capability and strict convergence. With regard to the selected channel comparison results, the primary explanations follow: (1) signals from irrelevant and redundant channels have a negative effect on classifying MI-EEG trials, as they introduce confusing information; (2) 3C channels, i.e., Cz, C3, and C4, are closely related to the motor-sensory cortex and pivotal for classifying motor imagery EEG signals; they are infrequently selected by some methods.
The computational complexity comparison results could be intuitively foreseen by comparing their flowcharts and procedures. The detailed explanation mainly includes the following: (1) the process of evaluating the channels' weights is unnecessary in methods without training, e.g., All channels and 3C channels; thus, their ARTs are small; (2) the FCCRs compress the feature vectors and exploit strictly convergent channel selection approaches to reduce the data scale and the period of evaluating the channels' weights; they thereby accelerate the EEG classification procedure and decrease the FCCRs' ARTs; (3) methods with embedded classifiers, e.g., MRCS and NSGA-II, need to estimate the channel subset and update the channel's weights in each iteration, leading to large ARTs. Additionally, MEMD+STFT has large ARTs because extracting the MEMD features is time consuming, especially when the total number of signal projections is beyond 200 in the MEMD algorithm.
For the parameter sensitivity results, the cluster numbers have limited influence on estimating the similarity among features and clustering-based representative signals. Therefore, changing the number of clusters does not significantly affect the following channel selection and classification.
From the comprehensive comparison results, we find that the FCCRs have some exciting advantages, including a small data scale, a strictly convergent procedure, short running time, and high classification accuracy. All of these visibly promote their comprehensive performance, which is helpful for implementing FIGURE 7 | Average classification accuracies(%) of our three methods when the numbers of selected channels are 10, 30, 50, 70, and 90 for DS1 and 10, 20, 30, 40, and 50 for DS2, and the numbers of clusters are 2, 6, 20, 60, and 300. The three rows correspond to FCCR1, FCCR2 and FCCR3, respectively. them with field programmable gate arrays (FPGAs) or digital signal processors (DSPs) in practical applications.
As shown in Figure 1, the main differences between the FCCRs and previous studies are feature compression and convergent channel ranking. On one hand, in current EEG feature extraction methods, spatial features, or features in the time and frequency domain, are always directly used to select channels, and even classify trials (Blankertz et al., 2007;Zhaoyang et al., 2016). From the perspective of classical pattern recognition, the essential role of the features depends on their similarities or differences, rather than the explicit numerical values themselves. Therefore, compressing features by clustering them and endowing them with signatures not only preserves the intrinsic similarities involved in the features, but also reduces the data scale. On the other hand, most existing channel selection methods carefully evaluate the channels' weights by directly classifying the signals that compose them (Kee et al., 2015;Zhang et al., 2016). Thus, their performance is classifierdependent, lacking adaptability and support for unfamiliar situations. In most practical applications, classifiers are not fixed and should be flexibly chosen according to practical conditions. Moreover, ignoring the convergence of the selection methods may lead to unsatisfactory results, since the process of adding or removing channels is not controllable. Thus, it is better to leverage classifier-independent and performance-verified selection approaches with proven strict convergence to guarantee both the process and the result of the channel weights evaluation. Philosophically, some solutions have tortuous exteriors but are very rapid in many researching areas. By introducing feature compression and convergent channel selection, we propose the FCCRs as such a promising opportunity to classify multiple EEG signal types.
Inevitably, FCCRs still have some potential limitations to be overcome, which may bring a forth series of innovations and applications by integrating plenty of current and forthcoming signal processing and pattern recognition approaches in the future. Firstly, we will focus on the feature extraction and clustering methods for seizing the key information from the original EEG signals with low SNRs. Moreover, considering the ever-developing feature selection approaches, dynamic weighting techniques for evaluating channels are also among our directions for future research. In addition, deciding the optimal number of channels is still an open problem, for which we will attempt to propose a criterion in the near future. FIGURE 8 | Mean accuracy, mean F1 score, mean precision, mean recall, average running speed and average channel reduction ratio for nine algorithms. Except for the average running speed, whose range varies according to the practical minimum and maximum values of the nine methods, the range of the other metrics is [0,1] ([0%,100%] for ACRR).

CONCLUSION
In this paper, we present a novel EEG classification framework named FCCR, which introduces feature compression and normal signal representation between the traditional feature extraction and channel selection procedures. In addition, some strict convergent algorithms with theoretical proofs are leveraged to rank and select informative channels, thereby showing that our proposed framework is capable of classifying EEG signals with acceptable accuracy and speed. Extensive experimental results on two real world motor imagery EEG datasets validate the effectiveness and efficiency of the proposed framework, as compared with state-of-the-art methods.

AUTHOR CONTRIBUTIONS
JH, JZ, and CW designed research. JH and YZ performed research. AK, GX, and HZ analyzed data and drafted parts of the first version of the manuscript. JH, HS, and JC wrote and revised the paper under the supervision of JZ and CW.