Optimized Projection and Fisher Discriminative Dictionary Learning for EEG Emotion Recognition

Electroencephalogram (EEG)-based emotion recognition (ER) has drawn increasing attention in the brain–computer interface (BCI) due to its great potentials in human–machine interaction applications. According to the characteristics of rhythms, EEG signals usually can be divided into several different frequency bands. Most existing methods concatenate multiple frequency band features together and treat them as a single feature vector. However, it is often difficult to utilize band-specific information in this way. In this study, an optimized projection and Fisher discriminative dictionary learning (OPFDDL) model is proposed to efficiently exploit the specific discriminative information of each frequency band. Using subspace projection technology, EEG signals of all frequency bands are projected into a subspace. The shared dictionary is learned in the projection subspace such that the specific discriminative information of each frequency band can be utilized efficiently, and simultaneously, the shared discriminative information among multiple bands can be preserved. In particular, the Fisher discrimination criterion is imposed on the atoms to minimize within-class sparse reconstruction error and maximize between-class sparse reconstruction error. Then, an alternating optimization algorithm is developed to obtain the optimal solution for the projection matrix and the dictionary. Experimental results on two EEG-based ER datasets show that this model can achieve remarkable results and demonstrate its effectiveness.


INTRODUCTION
Brain-computer interface (BCI) has been one of the research hotspots in recent years in health monitoring and biomedicine (Edgar et al., 2020;Ni et al., 2020b). The BCI does not rely on muscles and the peripheral nervous system. It establishes a direct information transmission channel between the brain and the outside world. The electroencephalography (EEG) signals captured by the BCI system are a powerful tool to analyze neural activities and brain conditions. EEG has the advantages of convenience (i.e. non-invasive, non-destructive and simple) and validity (i.e. sensitivity, validity and compatibility) (Sreeja and Himanshu, 2020). EEG signal is an important tool for revealing the emotional state of human beings. It has been shown that when people are in different thinking and emotional states, the rhythm components of EEG signals are different from their waveform. In BCI, the operation of emotion recognition (ER) starts from external stimuli to subjects, which induce specific emotions such as happiness, sadness, and anger. These stimuli may be videos, images, music, and so on. During the session, EEG data are recorded by EEG devices. Subsequently, the first step is to extract and preprocess useful features obtained from the recorded EEG. The next step is to train the classifier and optimize the parameters. The final step is to test the training model with new EEG data that are not used in the training process.
Traditional machine learning classifiers have been widely used in EEG-based ER, such as support vector machine (SVM) , deep learning (Hwang et al., 2020;Song et al., 2020), nearest neighbor classifier , random forest (Fraiwan et al., 2012), and probabilistic neural networks (Nakisa et al., 2018). In recent years, dictionary learning-based methods have achieved great success in EEGbased recognition tasks for BCI (Ameri et al., 2016;Gu et al., 2020;Ni et al., 2020a). In general, dictionary learningbased classification methods often learn the discriminative and robust dictionaries from training samples. The test sample is sparsely represented as a sparse linear combination of atoms by the learned dictionary, and then, the classification task can be carried out according to the reconstruction error and/or the sparse coefficients. Dictionary learning works well-even with noisy EEG signals. Barthélemy et al. (2013) developed an efficient method to represent EEG signals based on the adapted Gabor dictionary and demonstrated on real data that the learned multivariate model is flexible and the learned representation is informative and interpretable. Abolghasemi and Ferdowsi (2015) developed a dictionary learning framework to remove ballistocardiogram (BCG) artifacts from EEG. Given the advantage of the noise-robust sparse dictionary, a new cost function was proposed, which can model BCG artifacts and then remove them from the original EEG signals. Kashefpoor et al. (2019) developed a correlational label consistent K-SVD dictionary learning method applied to EEG-based screening tool. This method was applied to speckle extraction of EEG signals and extracted spectral features in both time and frequency domains. Aiming at the problem that eye movement and blinking can cause artifacts, Kanoga et al. (2019) proposed a multi-scale dictionary learning method to eliminate eye artifacts from singlechannel measurement. Specifically, the time-domain waveforms related to repetitive phase events in EEG signals were learned within the framework of dictionary learning. And the proposed multi-scale dictionary learning method was used to represent the signal components on different timescales. To achieve the highly accurate classification of EEG in BCI, Huang et al. (2020) developed a signal identification model using sparse representation and fast compressed residual convolutional neural networks (CNNs). The authors used the common spatial patterns to extract EEG signal features and build a redundant dictionary using these features. Then, the proposed deep model as a classifier recognized the input EEG signals.
Although machine learning has achieved good classification performance in some application scenarios, the accuracy and applicability of the classification do not go far enough. Since EEG data provide comprehensive information across different frequency bands to characterize emotions, it was expected to design an ER method, which utilizes the specific discriminative information of each frequency band and preserves the common discrimination information shared by multiple band signals. After the success of dictionary learning, in this study, we propose optimized projection and Fisher discriminative dictionary learning (DDL) for EEG-based ER. According to the Fisher discrimination criterion of minimum within-class sparse reconstruction error and maximum between-class sparse reconstruction error, we learn the discriminative projection to map the multiple band signals into a shared subspace and simultaneously build a shared dictionary that establishes the connection between different bands and represents the characteristics of signals well. Therefore, the joint learning of projection and dictionary ensures the common internal structure of multiple frequency bands of signals to be mined in the subspace.
The main contributions of this study are as follows: (1) A multiple frequency band collaborative learning is introduced in dictionary learning for the EEG-based ER. This learning mechanism can efficiently integrate the band-independent information and inter-band correlation information.
(2) Through the feature projection matrix, the data of multiple frequency bands are projected into a common projection subspace to keep the latent manifold of EEG signals. Meanwhile, the discriminative dictionary is learned by enforcing the classification criterion so that the learned sparse code has a strong representation and discrimination ability. (3) This joint optimization method has some benefits. Learning independent projection matrices makes this model easily extensible; meanwhile, learning a dictionary in a subspace allows abandoning extraneous information in the original features. In addition, the alternating optimization procedure ensures the dictionary and projection are optimized at the same time. (4) These extensive experiments on the SEED and DREAMER datasets demonstrate that the multiple band collaborative learning is effective, and this method can improve the discrimination ability of sparse coding in EEG-based ER.

BACKGROUND Datasets
The experimental data in this study are taken from two public EEG emotion datasets: SEED (Zheng and Lu, 2015) and DREAMER (Katsigiannis and Ramzan, 2018). Table 1 briefly describes the information of the two datasets. Both SEED and DREAMER datasets are collected when subjects watched emotion-eliciting movies. In the SEED dataset, each subject participated in three experiments, which were separated into three time periods, corresponding to three sessions, and each session corresponds to 15 EEG data trials. Thus, a total of 15 × 3 = 45 trials are formed per subject. The SEED provided five frequency bands: δ band (1-3 Hz), θ band (4-7 Hz), α band (8-13 Hz), β band (14-30 Hz), and γ band (31-50 Hz). For the DREAMER dataset, the data recorded by each subject contain three parts: 18 experimental signal segments, 18 baseline signal segments corresponding to relaxation state, and 18 corresponding labels. The DREAMER data provided EEG features with frequency bands θ , α , and β .

Machine Learning-Based EEG Signal Processing Program
For machine learning-based EEG-based ER, feature extraction and emotion classification are the critical procedures. Considering the SEED dataset as an example, the process of constructing five frequency band sequences is described in Figure 1. Firstly, EEG signals collected by BCI are preprocessed by filtering. Then, according to the characteristics of different rhythms of EEG signals, EEG signals usually can be divided into several rhythmic signal components ranging from 0 to 50 Hz. Secondly, EEG features can be extracted by various strategies. Time-domain, frequency-domain, and non-linear analysis methods are the three types of most commonly used EEG feature extraction methods. The time-domain features aim to capture the temporal information of EEG signals, such as higher-order crossings (HOC) (Petrantonakis and Hadjileontiadis, 2010), Hjorth features (Petrantonakis and Hadjileontiadis, 2010), and event-related potential (ERP) (Brouwer et al., 2015). The frequency-domain features aim to capture primarily the EEG emotion information from a frequency perspective. Then, EEG features can be extracted by various methods, such as rhythm (Bhatti et al., 2016), wavelet packet decomposition (WPD) (Wu et al., 2008), and approximate entropy (AE) (Ko et al., 2009). Non-linear features are extracted from the transformed phase space. Non-linear features contain quantitative measures that represent the complex dynamic characteristics of the EEG signals, such as Lyapunov exponent (Lyap) (Kutepov et al., 2020) and correlation dimension (CorrDim) (Geng et al., 2011). Finally, many machine learning methods are established to handle EEG emotion classification on extracted feature sets.
A common approach to deal with multiple bands of EEG data using traditional dictionary learning methods is to directly concatenate features of multiple bands together in the highdimensional space and treat this single feature vector as the input to the model. However, dictionary learning may not perform well because different band features usually carry different characteristics of EEG emotion.

Dictionary Learning
Let X = [x 1 , ..., x n ] ∈ R m×n be a set of m-dimensional n training signals. To minimize the reconstruction error and satisfy the sparsity constraints, the sparse representation and dictionary learning of X can be accomplished by FIGURE 1 | The process of constructing multiple frequency band sequences (Wei et al., 2020).
Frontiers in Psychology | www.frontiersin.org where D = [d 1 , ..., d K ] ∈ R m×K is a dictionary with K atoms S = [s 1 , ..., s n ] ∈ R m×K is the sparse coefficient matrix of signals X, and s i is the sparse coefficient vector of x i over D. T is the sparse constraint factor. The s i 0 ≤ T term requires the signal x i to have fewer than T non-zero items in its decomposition. It is not easy to find the optimal sparse solution using ℓ 0 -norm regularization term; thus, an alternative formulation of Equation (1) is to replace it with ℓ 1 -norm regularization as Equation (2) can be optimized by many efficient ℓ 1 optimization methods, such as the famous K-SVD algorithm (Aharon et al., 2006;Jiang et al., 2013). However, Equation (2) is an unsupervised learning framework. To learn a discriminative dictionary for classification tasks, different kinds of loss functions or Fisher discrimination criterion are considered in the dictionary learning. Fisher discrimination constraints on atoms of the dictionary (Peng et al., 2020) or sparse coefficient S (Li et al., 2013) or reconstruction error of X (Zheng and Sun, 2019;Zhang et al., 2021) strive to preserve the class distribution and geometric structure of data. Suppose data matrix X consists of samples from C different classes, from X, both the sub-dictionary D i and sub-sparse coefficient matrix S i are learned for the i-th class data (i = 1, 2,...,C). The whole dictionary D is represented as D = [D 1 , D 2 , ..., D C ]. Let W w and W b denote the within-class scatter and between-class reconstruction error of X, respectively, then and where δ l j ( ) function returns the sparse codes consistent with the class of x j (j = 1,2,...,n), and ζ l j ( ) function returns the sparse codes not consistent with the class of x j . Then, a discriminative dictionary can be learned by reducing the within-class diversity and by increase between-class separation using Equations (3) and (4) (Zheng and Sun, 2019;Zhang et al., 2021).

Objective Function
Here, we describe in detail the optimized projection and Fisher DDL model for collaborative learning of multiple frequency band EEG signals. The training framework of the OPFDDL model is illustrated in Figure 2. We learn a discriminative projection to map multiple frequency band EEG signals into a common subspace; simultaneously, we learn a common discriminative dictionary to encode the band-invariant information of multiple frequency bands. In particular, to promote the discrimination ability of the model, we utilize the model according to the Fisher discrimination criterion (Gong et al., 2019) under the structure of dictionary learning.
Let X r = {x r j }denote the signal set X of the frequency band r, where R is the number of frequency bands (r =1,..., R). x r j is the jth sample in X r . To build the connection between different frequency bands and exploit the specific characteristic of each representation, we project x r j into a feature subspace as z r j = Q r x r j by using a transformation matrix Q r ∈ R m×d r . Therefore, we obtain {z r j } R r=1 by {Q r x r j } R r=1 as the feature representations for R frequency bands. Then, we denote the within-class reconstruction error J r w and between-class reconstruction error J r b of the rth frequency band in the projection subspace and where W r w = n r j=1 (x r j -Dδ l j (s r j )) × (x r j -Dδ l j (s r j )) T is the withinclass scatter matrix for sparse coding of the rth frequency band, and W r b = n r j=1 (x r j -Dζ l j (s r j )) × (x r j -Dζ l j (s r j )) T is the between-class scatter matrix for sparse coding of the rth frequency band. From the classification point of view, minimizing within-class scatter and maximizing between-class scatter in the dictionary learning-based classifier can be represented as With Equation (7) can be written as follows: The projection matrixQ is limited to be orthogonal, which is highly effective in the optimization process. The solution of Equation (8) It is noted that the parameter µ is an adaptive weight that can be obtained by a closed-form solution but is not a manually adjusted parameter.

Optimization
In the following, the alternating optimization approach is used to update the parameters {Q, D, µ} in Equation (9).
(1) Update step forQ. With D and µ fixed, and with the knowñ W w andW b , the optimization ofQ can be solved by The projection matrixQ is constituted by the feature vector corresponding to the first d minimum eigenvalues of Equation (10). 0 · · · X R   , δ r l j = [δ l j (s r 1 ), δ l j (s r 2 ), ..., δ l j (s r n r )], ζ r l j = [ζ l j (s r 1 ), ζ l j (s r 2 ), ..., ζ l j (s r n r )], = [δ 1 l j , δ 2 l j , ..., δ R l j ], and = [ζ 1 l j , ζ 2 l j , ..., ζ R l j ], and with the knownQ and µ, Equation (9) can be written as For each column ofX, i.e.,X k , the optimization of D can be solved by the following problem: Then, D can be updated by where λ D is the step size.
(4) Update step for µ . By recalling that the matrixes,W b and W w can be built according to the obtainedQ and S. WhenQ andQ are learned, the solution of µ is ∂L ∂µ = 0, and it can be obtained by a closed-form solution Based on the above analysis, the implementation process of OPFDDL is described in Algorithm 1. We initialize the subdictionary for each class by the K-SVD algorithm, and then, we integrate them to form the initialization dictionary D.
When the projection matrix Q and dictionary D are learned, we perform the following procedure to run testing work. The testing procedure of OPFDDL is illustrated in Figure 3. For each testing EEG signal z, its rth frequency band feature is denoted as z r . We map z r into projection subspace using Q r and classify its class label according to the smallest reconstruction error on each class as follows: Finally, we use the majority voting to identify the class label of signal z, i.e., where j is the number of votes for class j.

Algorithm 1
The OPFDDL algorithm Input: Multiple frequency band EEG signals X r (r = 1, 2, ..., R) with their class labels. Output: Projection matrix Q and dictionary D. Initialization: Initialize D and S using the K-SVD algorithm and initializeQ such thatQQ T = 1. Repeat Sparse codes update: Compute sparse code s for each training sample using Equation (14); Projection matrix update: ComputeQ using Equation (10); Dictionary update: Compute D using Equations (11-13); Adaptive weight update: Compute µ using Equation (15); Until convergence

EXPERIMENT Experimental Settings
Following the study of Li Y. et al. (2019), we used extracted methods of three features on the SEED dataset, including differential entropy (DE), power spectral density (PSD), and fractal dimension (FD). We investigated the EEG features over all frequency bands per second with no overlap in each channel. We used the random 10 trials in each subject for model training and the rest 5 trials for testing. The classification performance corresponding to each period is recorded for each subject. For DREAMER dataset, to balance the number and length of the segments, we divided the 60-s EEG signals into 59 blocks with an overlap rate of 50%. The DE feature extraction method was carried out and 14-dimensional features for each frequency band were obtained. For each subject, we trained our model using the random 12 trials and the rest 6 trials for testing. We compared our proposed model with five machine learning methods, including SVM (Cortes and Vapnik, 1995), K-SVD (Aharon et al., 2006), PCB-ICL-TSK (Ni et al., 2020b), DDL (Zhou et al., 2012), and dictionary pair learning (DPL) (Ameri et al., 2016). The Gaussian kernel and Gaussian fuzzy membership were used in SVM and PCB-ICL-TSK, respectively. The parameters in comparison methods were set according to the default settings in corresponding methods. In OPFDDL, the dimension of the projection subspace was set as 90% of the dimension of the EEG signal features. The number of atoms in each class was selected in {10, 15, 20, 25, 30, 35}. The λ parameter in Equation (2) was set as 0.01. We used the 5-fold cross-validation method to select the optimal parameters, and we performed five independent runs to evaluate the classification accuracy of all methods.

Experiment Results on the SEED Dataset
In this subsection, we performed the comparison experiments on the SEED dataset using various combinations of frequency bands and various features. The average accuracy performances of all methods with three feature methods are summarized in Table 2. From these results, we have the following observations: (1) Under different frequency band combinations, the results of total frequency bands of all methods are the best. For example, the classification accuracy of OPFDDL using all frequency bands is 6.70, 3.83, and 2.17% higher than that using frequency bands β + γ , α + β + γ , and θ + α + β + γ . The classification accuracy of SVM using all frequency bands is 3.57, 2.90, and 1.76% higher than that using frequency bands β + γ , α + β + γ , and θ + α + β + γ . In addition, in most cases, the SDs of all methods are small in all five bands. It demonstrates that multiple bands are helpful for EEG-based ER, due to that the features of each band have discrimination ability and five bands are complementary for distinguishing EEG emotions. (2) The classification performance of the three features is comparable. The performance of the DE feature is slightly better and shows an advantage in most of the cases. The classification accuracy of OPFDDL using the DE feature is 88.87%. It indicates that the DE feature is suitable to deal with EEG emotion signals. (3) OPFDDL outperforms all comparison methods, especially in the case of all five bands. It is because that OPFDDL can effectively integrate band-independent information and interband correlation information. The encouraging results indicate that direct concatenation of five frequency bands of EEG data cannot well-exploit the inherent distinguishing characteristics of data. Considering the common information of multiband and band-specific information shared in each band, it is important to jointly learn multiple band representations. (4) Compared with K-SVD and ODFDL, OPFDDL generates the shared common dictionary on all frequency bands in the projected subspace, which can maintain the data structure of multiple frequency bands. In addition, based on the Fisher discrimination criterion of maximizing within-class compactness and minimizing between-class separation, OPFDDL can well learn the discriminative dictionary from the cooperation of multiple frequency bands.
To further validate the discrimination ability of OPFDDL, in Figure 4, we reported the confusion matrix of the OPFDDL model with the DE feature. As shown in Figure 4, OPFDDL achieves a better classification performance on positive and neutral emotions than negative emotions in all cases. The performance result of OPFDDL is similar to that in references (Zheng and Lu, 2015;Li Y. et al., 2019). This suggests that subjects may have different EEG signals when experiencing negative FIGURE 4 | Confusion matrices of OPFDDL of frequency bands using differential entropy (DE) features. emotions and have similar EEG signals when experiencing positive and natural emotions. In addition, it can be seen that OPFDDL achieves the best classification performance (see Figure 4D) which uses all five frequency bands.

Experiment Results on the DREAMER Dataset
In this subsection, we performed the comparison experiments on the DREAMER dataset. We performed the comparison experiment using the DE feature. Similar to the SEED dataset, our model is compared with the abovementioned five methods. In the experiment, we verified the performance of OPFDDL according to valence, arousal, and dominance. The accuracy performance of all methods under the frequency band θ + α + β is shown in  results on the SEED dataset, OPFDDL performed best among all comparison methods. Based on the Fisher discrimination criterion, OPFDDL can well-learn the intrinsic relationships of EEG bands and can obtain the discriminative dictionary from multiple frequency bands cooperation in the projection subspace. In addition, the joint optimization strategy, which  addresses the shared projection subspace and dictionary learning, also can incrementally enhance the recognition performance of our proposed model. Thus, our proposed model can utilize more distinctive representations of multiple frequency bands of EEG signals. Then, we recorded the average accuracies of each subject for the OPFDDL model using the DE feature in terms of arousal, valence, and dominance. The experimental results are shown in Table 4. The proposed OPFDDL model had achieved satisfactory recognition performance for all three dimensions of arousal, valence, and dominance. Based on the structure of dictionary learning and the principles of projection and Fisher discrimination criterion, OPFDDL can make better use of discriminative information of different frequency band data and has stronger generalization ability, so it can be effectively used in EEG emotion classification task.

Parameter Variations
In this subsection, we first discussed the convergence of OPFDDL using the DE feature of total frequency bands on SEED and DREAMER datasets. The threshold for iteration stop was set as 10 −3 . Figure 5 plots the accuracy that varies with the number of iterations on one subject in two datasets. The results verify the convergence of OPFDDL. It can be seen that the OPFDDL model can achieve convergence within 20 iterations.
Then, we discussed the number of atoms used in OPFDDL. The number of atoms in each class K c was increased from 10 to 35 in increments of 5. Figure 6 plots the accuracy that varies with the parameter K c . The results show that after an initial dramatic increase, the classification accuracy of OPFDDL becomes stable after K c = 20. In addition, the variation trend of accuracy is consistent on two datasets. Thus, the classification performance of OPFDDL is acceptable for small dictionary sizes.

CONCLUSIONS
Most previous machine learning methods focus on extracting feature representations for total frequency bands together without considering specific discriminative information of different frequency bands. In this study, we propose collaborative learning of multiple frequency bands for EEG-based ER. In particular, our model is an integration of projection and dictionary learning based on the Fisher discrimination criterion. For subspace projection optimization, a shared subspace is employed for each frequency band such that the band-specific representations and shared band-invariant information can be simultaneously utilized. For dictionary learning optimization, a shared dictionary is learned from the projected subspace where the Fisher discrimination criterion is used to minimize within-class sparse reconstruction error and maximize betweenclass sparse reconstruction error. The joint learning strategy allows the model to extend easily. Consequently, we obtain a discriminative dictionary with a small size. We have performed the experiments and proved the performance of OPFDDL on two real-world EEG emotion datasets, i.e., SEED and DREAMER. For further studies, we will try to utilize and test more discriminative sparse representation criteria in our model. In addition, we only consider subject-dependent classification in EEG emotion identification. Applying this model to subjectindependent classification is a challenging work.

DATA AVAILABILITY STATEMENT
The SEED dataset analyzed for this study can be found in this link (http://bcmi.sjtu.edu.cn/~seed/seed.html). The DREAMER dataset analyzed for this study can be found in this link (https:// zenodo.org/record/546113).

AUTHOR CONTRIBUTIONS
XG, JZho, and JZhu conceived and developed the theoretical framework of the study. All authors carried out experiment and data process, and drafted the study.