Singular Learning of Deep Multilayer Perceptrons for EEG-Based Emotion Recognition

Guo, Weili; Li, Guangyu; Lu, Jianfeng; Yang, Jian

doi:10.3389/fcomp.2021.786964

ORIGINAL RESEARCH article

Front. Comput. Sci., 21 December 2021

Sec. Human-Media Interaction

Volume 3 - 2021 | https://doi.org/10.3389/fcomp.2021.786964

This article is part of the Research TopicBridging the Gap between Machine Learning and Affective ComputingView all 13 articles

Singular Learning of Deep Multilayer Perceptrons for EEG-Based Emotion Recognition

Weili Guo^1,2

Guangyu Li^1,2*

Jianfeng Lu²

Jian Yang^1,2*

¹PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, Nanjing University of Science and Technology, Nanjing, China
²School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China

Human emotion recognition is an important issue in human–computer interactions, and electroencephalograph (EEG) has been widely applied to emotion recognition due to its high reliability. In recent years, methods based on deep learning technology have reached the state-of-the-art performance in EEG-based emotion recognition. However, there exist singularities in the parameter space of deep neural networks, which may dramatically slow down the training process. It is very worthy to investigate the specific influence of singularities when applying deep neural networks to EEG-based emotion recognition. In this paper, we mainly focus on this problem, and analyze the singular learning dynamics of deep multilayer perceptrons theoretically and numerically. The results can help us to design better algorithms to overcome the serious influence of singularities in deep neural networks for EEG-based emotion recognition.

1 Introduction

Emotion recognition is a fundamental task in affective computing and has attracted many researchers’ attention in recent years (Mauss and Robinson, 2009). Human emotion can be expressed through external signals and internal signals, where external signals usually include facial expressions, body actions, and speeches, and electroencephalograph (EEG) and galvanic skin response (GSR) are typical internal signals. EEG is the method to measure electrical activities of the brain by using electrodes along the scalp skin and it is rather reliable; therefore, EEG has played a more significant role in investigating human emotion recognition problem in recent years (Yin et al., 2021).

For the emotion recognition problem based on EEG signals, researchers mainly investigate this issue from two aspects: how to extract better features from EEG signals and how to construct a model with better performance. For aspect 1, researchers have investigated the feature extraction methods of EEG signals from a time domain, frequency domain, and time–frequency domain, respectively, and a series of results have been given previously (Fang et al., 2020; Nawa et al., 2020). In this paper, we mainly focus on aspect 2, i.e., the computational model problem, and researchers have proposed many models to recognize emotions through EEG signals (Zong et al., 2016; Yang et al., 2018a; Zhang et al., 2019; Cui et al., 2020). In recent years, deep learning technology has achieved great success in many fields (Yang et al., 2018b; Yang et al., 2019; Basodi et al., 2020; Zhu and Zhang, 2021), and many works are devoted to addressing the EEG emotion recognition issue by applying deep neural networks (DNNs) (Cao et al., 2020; Natarajan et al., 2021), where the performances based on deep learning also show significant superiority of conventional methods (Ng et al., 2015; Tzirakis et al., 2017; Hassan et al., 2019). However, the learning dynamics of DNNs, including deep multilayer perceptrons (MLPs), deep belief networks and deep convolution neural networks, are often affected by singularities, which exist in the parameter space of DNNs (Nitta, 2016).

Due to the influence of singularities, the training of DNNs often becomes very slow and the plateau phenomenon can often be observed. When the DNNs are applied to EEG-based emotion recognition, the severe negative effect of singularities on the learning process of DNNs is also inevitable, where the efficiency and performance of networks can also not be guaranteed. However, up to now, there are rarely literatures investigating this problem. In this paper, we mainly concern this problem. The main contribution of this paper is to take the theoretical and numerical analysis of singular learning in DNNs for EEG-based emotion recognition. We choose deep MLPs as the learning machine, where deep MLPs are of typical DNNs and the results are also representative for other DNNs. The types of singularities in parameter space are analyzed and the specific influence of the singularities is clearly shown. Based on the obtained results in this paper, we can further design the related algorithms to overcome this issue.

The rest of this paper is organized as follows. A brief review of related work is presented in Section 2. In Section 3, theoretical analysis of singularities in deep MLPs for EEG-based emotion recognition is taken and then the learning dynamics near singularities are numerically analyzed in Section 4. Section 5 states conclusion and discussion.

2 Related Work

In this section, we provide a brief overview of previous work on EEG-based emotion recognition and singular learning of DNNs.

In recent years, due to the high accuracy and stabilization of EEG signals, EEG-based algorithms have attracted ever-increasing attention in emotion recognition field. To extract better features of EEG signals, researchers have proposed various feature extraction models (Zheng et al., 2014; Zheng, 2017; Tao et al., 2020; Zhao et al., 2021), such as power spectral density (PSD), differential entropy (DE), and differential asymmetry (DASM). By using PSD and DE to extract dimension reduced features of EEG signals, Fang et al. (2020) chose the original features and dimension reduced features as the multi-feature input and verified the validity of the proposed method in the experiment part. Li et al. (2020) integrated psychoacoustic knowledge and raw waveform embedding within an augmented feature space. Song et al. (2020) employed an additional branch to characterize the intrinsic dynamic relationships between different EEG channels and a type of sparse graphic representation was presented to extract more discriminative features. Besides the feature extraction methods, more attention is paid to study the emotion classification. Given that the deep learning technology has excellent capabilities, various types of DNNs have been widely used in emotion classification (Li et al., 2018; Li et al., 2019; Ma et al., 2019; Atmaja and Akagi, 2020; Cui et al., 2020; Zhong et al., 2020), including deep convolution neural networks, deep MLPs, long short term memory (LSTM)-based recurrent neural networks, and graph neural networks. The obtained results show that these DNN models can provide superior performance compared to previous models (Yang et al., 2021a; Yang et al., 2021b).

As mentioned above, various DNNs have been widely used in EEG-based emotion recognition; however, the training processes of DNNs often encounter many difficulties. Even if numerous research studies have been developed to conduct explanatory research, it is still very far to revealing the mechanism. As there are singularities in the parameter space of DNNs where the Fisher information matrix is singular, the singular learning dynamics of DNNs have been studied and have attracted more and more attention. As the basis of DNNs, traditional neural networks often suffer from the serious influence of various singularities (Amari et al., 2006; Guo et al., 2018; Guo et al., 2019), and the learning dynamics of DNNs are also easy to be influenced by the singularities. Nitta (2016, 2018) analyzed the types of singularity in DNNs and deep complex-value neural networks. Ainsworth and Shin (2020) investigated the plateau phenomenon in Relu-based neural networks. By using the spectral information of Fisher information matrix, Liao et al. (2020) proposed an algorithm to accelerate the training process of DNNs.

In view of the serious influence of singularities to DNNs, the training processes of DNNs will also encounter difficulties when applying DNNs to EEG-based emotion recognition. Thus, it is necessary to take the theoretical and numerical analysis to reveal the mechanism and propose related algorithms to overcome the influence of singularities.

3 Theoretical Analysis of Singular Learning Dynamics of Deep Multilayer Perceptrons

In this section, we theoretically analyze the learning dynamics near singularities of deep MLPs for the EEG-based emotion recognition.

3.1 Learning Paradigm of Deep Multilayer Perceptrons

Firstly, we introduce a typical learning paradigm of deep MLPs. For a typical deep multilayer perceptrons with L hidden layers (the architecture of the networks is shown in Figure 1), assuming M_i is the neuron number of hidden layer i, M₀ is the dimension of the input layer and M_L+1 is the dimension of the output layer, we denote that: $W_{j k}^{(i)}$ represents the weight connecting from the jth node of the previous layer to the kth node of hidden layer i, and $W_{p q}^{(L + 1)}$ represents the weight connecting from the pth node of hidden layer L to the qth node of output layer for 1 ≤ i ≤ L, 1 ≤ j ≤ M_i−1, 1 ≤ k ≤ M_i, 1 ≤ p ≤ M_L, and 1 ≤ q ≤ M_L+1. Then θ = {W⁽¹⁾, W⁽²⁾, …, W^(L+1)} represents all the parameters of the networks, where $W^{(i)} = [W_{1}^{(i)}, W_{2}^{(i)}, \dots, W_{M_{i}}^{(i)}]$ and $W_{j}^{(i)} = {[W_{1 j}^{(i)}, W_{2 j}^{(i)}, \dots, W_{M_{(i - 1)} j}^{(i)}]}^{T}$ for 1 ≤ i ≤ L + 1 and 1 ≤ j ≤ M_i. In this paper, the widely used log-sigmoid function $ϕ (x) = \frac{1}{1 + e^{- x}}$ is adopted as the activation of hidden layers and the purelin function ψ(x) = x is adopted as the activation function of output layer, then for the input $x \in R^{M_{0}}$ , by denoting the input to hidden layer k as X^(k−1) for 1 ≤ k ≤ L and the input to output layer as X^(L), the mathematical model of the networks can be described as follows:

f (x, θ) = ψ ({(W^{(L + 1)})}^{T} X^{(L)}) = {(W^{(L + 1)})}^{T} X^{(L)} . (1)

FIGURE 1

FIGURE 1. Architecture of deep MLPs.

For 1 ≤ k ≤ L, X^(k) can be computed as $X^{(k)} = ϕ (X^{(k - 1)}, W^{(k)}) = ϕ ({(W^{(k)})}^{T} X^{(k - 1)})$ and X⁽⁰⁾ is the input x.

We choose the square loss function to measure the error:

l (y, x, θ) = \frac{1}{2} {(y - f (x, θ))}^{2}, (2)

and use the gradient descent method to minimize the loss:

θ_{t + 1} = θ_{t} - η \frac{\partial l (y, x, θ_{t})}{\partial θ_{t}}, (3)

where η is the learning rate.

3.2 Singularities of Deep Multilayer Perceptrons in Electroencephalograph-Based Emotion Recognition

In this paper, we mainly focus on the mechanism of singular learning dynamics of deep MLPs applied to EEG-based emotion recognition domain, not seeking the best performance; therefore, the size of the networks need not to be very large, and an appropriate size that can capture the essence of singular learning dynamics can satisfy the requirement. Without loss of generality, we choose the deep MLPs with two hidden layers and a single output neuron, i.e., L = 2 and M₃ = 1, i.e., $W^{(3)} = W_{1}^{(3)} = {[W_{11}^{(3)}, W_{21}^{(3)}, \dots, W_{M_{2} 1}^{(3)}]}^{T}$ , we simply denoted as $W^{(3)} = [W_{1}^{(3)}, W_{2}^{(3)}, \dots, W_{M_{2}}^{(3)}]$ . Then, the deep MLPs can be rewritten as:

\begin{align} f (x, θ) & = {(W^{(3)})}^{T} ϕ (ϕ (x, W^{(1)}), W^{2}) \\ = \sum_{j = 1}^{M_{2}} W_{j}^{(3)} ϕ (ϕ (x, W^{(1)}), W_{j}^{(2)}) . \end{align} (4)

Next, we analyze the types of singularities. From Eq. 4, we can see that if one output weight equals zero, e.g., $W_{j}^{(3)} = 0$ , whatever the values of W⁽¹⁾ and $W_{j}^{(2)}$ be, the output of unit j will be always 0 and the unit seems to be vanished. As the values of W⁽¹⁾ and $W_{j}^{(2)}$ have no effect on the output of the deep MLPs, the training process will encounter difficulties on the subspace $R_{1} = {θ | W_{j}^{(3)} = 0}$ . Besides the above singularity, if there are two elements of weight W⁽²⁾ overlap, e.g., $W_{i}^{(2)} = W_{j}^{(2)}$ , then

\begin{array}{l} W_{i}^{(3)} ϕ (ϕ (x, W^{(1)}), W_{i}^{(2)}) \\ + W_{j}^{(3)} ϕ (ϕ (x, W^{(1)}), W_{j}^{(2)}) \\ = (W_{i}^{(3)} + W_{j}^{(3)}) ϕ (ϕ (x, W^{(1)}), W_{i}^{(2)}) \end{array}

remains the same value when $W_{i}^{(3)} + W_{j}^{(3)}$ takes a fixed value, regardless of particular values of $W_{i}^{(3)}$ and $W_{j}^{(3)}$ . Therefore, we can identify their sum $W = W_{i}^{(3)} + W_{j}^{(3)}$ ; nevertheless, each of $W_{i}^{(3)}$ and $W_{j}^{(3)}$ remains unidentifiable. Thus, the training will also suffer difficulties on the subspace $R_{2} = {θ | W_{i}^{(2)} = W_{j}^{(2)}}$ .

To sum up the above analysis, it can be seen that there are at least two types of singularities:

(1) Zero weight singularity: $R_{1} = {θ | W_{j}^{(3)} = 0}$ ,

(2) Overlap singularity: $R_{2} = {θ | W_{i}^{(2)} = W_{j}^{(2)}}$ .

Till now, we have theoretically analyzed the types of singularity that existed in the parameter space of deep MLPs; in the next section, we will numerically analyze the influence of singularities to solve EEG-based emotion recognition problem.

4 Numerical Analysis of Learning Dynamics Near Singularities

In this section, we take the numerical analysis of singularities by taking experiments on the dataset of EEG signals. For the EEG datasets, the SEED dataset is a typical benchmark dataset that is developed by SJTU and has been widely used to evaluate the proposed methods on EEG-based emotion recognition. In this paper, the training process will be carried out using the SEED dataset.

4.1 Data Preprocessing

The SEED dataset (Zheng and Lu, 2015) is collected from 62-channel EEG device and contains EEG signals of three emotions (positive, neutral, and negative) from 15 subjects. Due to the low signal-to-noise ratio of raw EEG signals, it is rather necessary to take the preprocessing step to extract meaningful features. As is known, there are five frequency bands for each EEG channel: delta (1–3 Hz), theta (4–7 Hz), alpha (8–13 Hz), beta (14–30 Hz), and gamma (31–50 Hz). That means, for one subject, the data are the form 5 × 62, the dimension of raw EEG signal is very large, and then we use principal component analysis (PCA) (Abdi and Williams, 2010) to extract the features of the EEG signal. After the PCA step, the form of EEG signals becomes 5 × 5, and then by putting every element of the data to a vector, the dimension of the input can be finally reduced to be 25.

4.2 Learning Trajectories Near Singularities

Now, we take experiments on the SEED dataset, and the learning dynamics near singularities will be numerically analyzed. We choose the neuron numbers of two hidden layers as L₁ = 8 and L₂ = 8; thus, the architecture of the deep MLPs is 25−8−8−1. As there are three emotions in the SEED dataset, we set values 1, 2, and 3 corresponding to labels positive, neutral, and negative, respectively. We choose the training sample number and testing sample number to be 1,000 and 500, respectively. Then, by setting the learning rate to η = 0.002, the target error to 0.05, and the maximum epochs to 8,000, we use Eq. 3 to accomplish the experiment. By analyzing the experiment results, two cases of learning dynamics will be shown. Besides training error, classification accuracy is also used to measure the performance. In the following figures of experiment results, “◦” and “×” represent the initial state and final state, respectively. The experiments were run by using Matlab 2013a on a PC with an Intel Core i7-9700K CPU @3.60 GHz, 32 GB RAM and NVIDIA GeForce RTX 2070 GPU.

Case 1. Fast convergence: The learning process fast converges to the global minimum.For this case, the learning dynamics does not suffer from any influence of singularity and the parameters fast converge to the optimal value. The initial value of W⁽³⁾ is W⁽³⁾⁽⁰⁾ = [0.8874, 0.6993, 0.5367, −0.9415, −0.8464, −0.9280, 0.3335, −0.7339]^T and the final value of W⁽³⁾ is W⁽³⁾ = [3.1443, 2.5868, 2.3291, −1.1544, −1.2281, −2.9704, 2.9650, −1.8221]^T. The experiment results are shown in Figure 2, which represent the trajectories of training error, output weights W⁽³⁾, and classification accuracy, respectively.As can be seen from Figure 2A, the learning dynamics quickly converge to the global minimum and have not been affected by any singularity.

FIGURE 2

FIGURE 2. Case 1 (Fast Convergence). (A) Trajectory of training error. (B) Trajectory of W⁽³⁾. (C) Trajectory of classification accuracy.

Case 2. Zero weight singularity: the learning process is affected by the elimination singularity.For this case, one output weight crosses 0 during the learning process and a plateau phenomenon can be obviously observed. The initial value of W⁽³⁾ is W⁽³⁾⁽⁰⁾ = [0.4825, 0.9885, −0.9522, −0.3505, −0.5004, 0.9749, −0.9111, −0.5056]^T, and the final student parameters are W⁽³⁾ = [3.0297, 3.1006, −1.7413, 0.1717, −1.9567, 3.5131, −1.9037, −0.9143]^T. The experiment results are shown in Figure 3, which represent the trajectories of training error, output weights W⁽³⁾ and classification accuracy, respectively.From Figure 3B, we can see that $W_{4}^{(3)}$ crosses 0 in the learning process and the learning process is affected by elimination singularity. During the stage $W_{4}^{(3)}$ crosses 0, the plateau phenomenon can be obviously observed (Figure 3A). Then, the student parameters escape the influence of elimination singularity. After the training process, we can see that the training error is bigger than that in Case 1 and the classification accuracy is also lower than that in Case 1, which means that the parameters do not reach the optimum.

FIGURE 3

FIGURE 3. Case 2 (Zero weight singularity). (A) Trajectory of training error. (B) Trajectory of W⁽³⁾. (C) Trajectory of classification accuracy.

Case 3. Extending training time of Case 2.In this experiment, we only increase the training epochs to 15,000, and the rest of the experiment setup remains the same with that in Case 2. The experiment results are shown in Figure 4. Compared to Figure 3, it can be seen that the learning process that is affected by the zero weight singularity can arrive at the optimum, but it costs much more time. This means that the zero weight singularities will greatly reduce the efficiency of deep MLPs.

FIGURE 4

FIGURE 4. Case 3 (Extending training time of Case 2). (A) Trajectory of training error. (B) Trajectory of W⁽³⁾. (C) Trajectory of classification accuracy.

Case 4. Changing initial value of Case 2.In order to confirm that the plateau phenomenon corresponds to the zero weight singularity, a supplementary experiment is carried out here where only the initial value of W⁽³⁾⁽⁰⁾ has been changed and the rest of the experiment setup remains the same. The initial value of W⁽³⁾ is W⁽³⁾⁽⁰⁾ = [−0.5056, −0.9111, 1.7749, −0.5004, 1.6495, −0.9522, 0.9885, 1.2825]^T, and the final student parameters are W⁽³⁾ = [−1.3660, −1.8232, 3.2529, −1.9425, 3.2450, −1.6325, 1.9452, 3.4158]^T. The experiment results are shown in Figure 5, which represent the trajectories of training error, output weights W⁽³⁾, and classification accuracy, respectively. As can be seen in Figure 5, there is not any weight of W⁽³⁾ that becomes zero. Also, no plateau phenomenon can be observed, and the classification accuracy has reached a comparatively high value. By comparing the experiment results shown in Figures 3, 5, we can conclude that the plateau phenomenon is indeed caused by zero weight singularity.

FIGURE 5

FIGURE 5. Case 4 (Changing initial value of Case 2). (A) Trajectory of training error. (B) Trajectory of W⁽³⁾. (C) Trajectory of classification accuracy.

Remark 1. From the results shown in Figures 2–5 and Table 1, we can see that the training and testing accuracy in Case 2 is the lowest. This means that when the training process is affected by the zero weight singularity, the parameters cannot achieve the optimum after the same training time with that in fast convergence case. When we extend the training time in Case 2, the parameters can escape the influence of zero weight singularity and finally arrive at the optimum, which is shown in Case 3. Thus, the points in zero weight singularity are saddle points, not local minimum. To sum up, the zero weight singularity will seriously delay the training process, and it is worthy to investigate algorithms to overcome the influence of zero weight singularities.

TABLE 1

TABLE 1. Training and testing classification accuracy.

Remark 2. When taking the experiments, we do not observe the learning dynamics of deep MLPs that are affected by overlap singularities. The results are in accordance with the conclusion where we analyze the learning dynamics of shallow neural networks (Guo et al., 2018); i.e., the overlap singularities mainly influence the neural networks with low dimension and the large-scale networks predominantly suffer from zero weight singularities. Thus, we should pay more attention to how to overcome the influence of zero weight singularities.In this section, we have numerically analyzed the learning dynamics near singularities of deep MLPs for EEG-based emotion recognition and showed the singular case. We can obtain that the learning dynamics of deep MLPs are mainly influenced by zero weight singularities and rarely affected by overlap singularities.

5 Conclusion and Discussion

Deep learning technology has been widely used in EEG-based emotion recognition and has shown superior performance compared to traditional methods. However, for various DNNs, there exist singularities in the parameter space, which cause singular behaviors in the training process. In this paper, we investigate the singular learning dynamics of DNNs when applied to EEG-based emotion recognition. By choosing deep MLPs as the learning machine, we firstly take the theoretical analysis of singularities of deep MLPs, and obtained that there are at least two types of singularities: overlap singularity and zero weight singularity. Then, by doing several experiments, the numerical analysis is taken. The experiment results show that the learning dynamics of deep MLPs are seriously influenced by zero weight singularities and rarely affected by overlap singularities. Furthermore, the plateau phenomenon is caused by zero weight singularity. Thus, we should pay more attention to how to overcome the serious influence of zero weight singularity to improve the efficiency of DNNs in EEG-based emotion recognition in the future.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, Further inquiries can be directed to the corresponding authors.

Author Contributions

WG and GL: Methodology. WG and JY: Validation and investigation. WG: Writing—original draft preparation. GL, JL, and JY: Formal analysis, data curation. JL and JY: Writing—reviewing and editing, and supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61906092, 61802059, 62006119, and 61876085, the Natural Science Foundation of Jiangsu Province of China under Grant Nos. BK20190441, BK20180365, and BK20190444, and the 973 Program No. 2014CB349303.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abdi, H., and Williams, L. J. (2010). Principal Component Analysis. Wires Comp. Stat. 2, 433–459. doi:10.1002/wics.101