Dual-Threshold-Based Microstate Analysis on Characterizing Temporal Dynamics of Affective Process and Emotion Recognition From EEG Signals

Recently, emotion classification from electroencephalogram (EEG) data has attracted much attention. As EEG is an unsteady and rapidly changing voltage signal, the features extracted from EEG usually change dramatically, whereas emotion states change gradually. Most existing feature extraction approaches do not consider these differences between EEG and emotion. Microstate analysis could capture important spatio-temporal properties of EEG signals. At the same time, it could reduce the fast-changing EEG signals to a sequence of prototypical topographical maps. While microstate analysis has been widely used to study brain function, few studies have used this method to analyze how brain responds to emotional auditory stimuli. In this study, the authors proposed a novel feature extraction method based on EEG microstates for emotion recognition. Determining the optimal number of microstates automatically is a challenge for applying microstate analysis to emotion. This research proposed dual-threshold-based atomize and agglomerate hierarchical clustering (DTAAHC) to determine the optimal number of microstate classes automatically. By using the proposed method to model the temporal dynamics of auditory emotion process, we extracted microstate characteristics as novel temporospatial features to improve the performance of emotion recognition from EEG signals. We evaluated the proposed method on two datasets. For public music-evoked EEG Dataset for Emotion Analysis using Physiological signals, the microstate analysis identified 10 microstates which together explained around 86% of the data in global field power peaks. The accuracy of emotion recognition achieved 75.8% in valence and 77.1% in arousal using microstate sequence characteristics as features. Compared to previous studies, the proposed method outperformed the current feature sets. For the speech-evoked EEG dataset, the microstate analysis identified nine microstates which together explained around 85% of the data. The accuracy of emotion recognition achieved 74.2% in valence and 72.3% in arousal using microstate sequence characteristics as features. The experimental results indicated that microstate characteristics can effectively improve the performance of emotion recognition from EEG signals.


INTRODUCTION
To make a human-machine interaction more natural, emotion recognition should play an important role. Interest in emotion recognition from different modalities (e.g., face, speech, body posture, and physiological responses) has risen in the past decades. Physiological signals could measure the changes in physiological responses to emotional stimulus. They have advantages on eliminating social masking or factitious emotion expressions to obtain a better understanding of underlying emotions (Jang et al., 2015). Among the various types of physiological signals, an electroencephalogram (EEG) shows a direct measure of the electrical activity of the brain. It has been used in cognitive neuroscience to investigate the regulation and processing of emotion (Dennis and Solomon, 2010;Thiruchselvam et al., 2011). With the rapid development of dry EEG electrode techniques, EEG-based emotion recognition has received increasing applications in different fields such as affective brain-computer interaction (Atkinson and Campos, 2016;Chen et al., 2021), healthcare (Hossain and Muhammad, 2019), emotional companionship, and e-learning (Ali et al., 2016).
However, some limitations still exist on traditional feature sets. As EEG is an unsteady and rapidly changing voltage signal, the feature extracted from EEG usually changes dramatically, whereas emotion states change gradually (Wang et al., 2014). This leads to bigger differences among EEG features, even with the same emotion state in adjacent time. Most existing feature extraction approaches do not consider these differences between EEG and emotion. In this study, the authors proposed a feature extraction method based on EEG microstates for emotion recognition. Microstate analysis treats multichannel EEG as a series of momentary quasi-stable scalp electric potential topographies (Pascual-Marqui et al., 1995). These quasi-stable potential topographies are referred to as microstates, so brain electrical activity could be modeled as being composed of a time sequence of non-overlapping microstates. Microstate sequences could capture the important spatio-temporal properties of an EEG signal. At the same time, it can reduce the fast-changing EEG signals to a sequence of prototypical topographical maps. Characterizing the dynamics of brain neuronal activity through EEG microstate patterns could provide novel information for improving EEG-based emotion recognition.
Microstate analysis has been used to study the resting state of the human brain based on the topography of the EEG signals Khanna et al., 2015;Michel and Koenig, 2018). The greater part of the literature acknowledges four standard microstate maps on healthy subjects at rest. In addition, the characteristics of microstate sequences have been proven to offer a potential biomarker for some diseases, such as mood and anxiety disorders (Al Zoubi et al., 2019), autism spectrum disorder (D'Croz- Baron et al., 2019), and schizophrenia (Soni et al., 2018(Soni et al., , 2019da Cruz et al., 2020;Kim et al., 2021). Baradits et al. (2020) created a specified feature set to represent microstate characteristics. These features were used to classify patients with schizophrenia and healthy controls.
While microstate analysis has been widely used to study brain function, few studies have used this method to analyze how the brain responds to emotional auditory stimuli. There are some challenges when applying microstate analysis to emotion process. Considering the complex emotion process, how to determine the optimal number of microstates automatically is a subject worthy of study. The modified K-means and K-medoids had been used to determine the microstate classes in many studies (Von Wegner et al., 2018). However, these methods need preset K cluster centers, and the clusters are sensitive to the initialization. Emotional response is a complex cognitive process so that it is difficult to predict the number of microstate classes subjectively. Atomize and agglomerate hierarchical clustering (AAHC) algorithm is specifically proposed for the microstate analysis of EEG (Murray et al., 2008). It is a hierarchical clustering that can offer more optional clustering results. The method initializes with a large number of clusters and then reduces the number of clusters by one during each iteration step. It stops when only one single final cluster is obtained, but the best partition from numerous clustering results is subjectively determined.
To overcome this limitation, this study proposes a dual threshold-based atomize and agglomerate hierarchical clustering (DTAAHC) which can determine the optimal number of microstate classes automatically. For microstate analysis, microstates are expected to be distinct and could explain the original EEG topographies as much as possible. Therefore, two optimization criteria are used to estimate the quality of the candidate microstates during iterations. Compared with AAHC, in addition to global explained variance (GEV) contribution, the proposed algorithm also considers the microstate topographic similarity. Global map dissimilarity (GMD) is used to measure the topographic differences of candidate microstates. In addition, the iteration stops when the criterion GEV reaches the threshold. Although we made a minor alteration to the AAHC algorithm, the new method could identify the optimal microstate classes automatically and reduce the computational cost. By using the proposed method to model the temporal dynamics of the auditory emotion process, we extract microstate characteristics as novel temporospatial features for improving the performance of emotion recognition from EEG signals. The schema of the present study is shown in Figure 1.

MATERIALS AND METHODS
This section provides details of the experimental tasks and datasets used in this study. In addition, we describe the proposed DTAAHC and the temporal parameters of microstate sequences for emotion recognition.

Datasets
Speech, music, and ambient sound events carry emotional information in human communication. In the present study, we focused on the emotional response induced by speech and music. Two independent datasets were available for analysis.

Dataset 1: Speech-Evoked Emotion Cognitive Experiment Participants
Nineteen healthy participants (8 females and 11 males) with normal hearing participated in the experiment. The mean age of the 19 subjects was 22.4 (SD = 5.4; range, 18-27) years. All subjects were self-reported right-handers. All subjects had no personal history of neurological or psychiatric illness. The subjects were undergraduate and graduate students at Harbin Institute of Technology. The participants must exhibit enough proficiency in English. The ethics committee of Heilongjiang Provincial Hospital accepted the study. The concept was explained to the subjects, and written informed consent was obtained.

Stimuli selection
There are two unique models for signifying emotions: the categorical model and the dimensional model. In the former, emotions are recognized with the help of words denoting emotions or class tags. In the dimensional model, the representation is based on a set of quantitative measures using multidimensional scaling. One of the classical and widely used categorical models is six basic emotion classes, namely, anger, disgust, fear, joy, sadness, and surprise (Ekman et al., 1987). Various dimensional models have also been proposed (Schlosberg, 1954;Russell and Mehrabian, 1977;Russell, 1980). In this work, we use the valence-arousal scale of Russell (1980), which is widely used in research on affect, to quantitatively describe emotions. In this scale, each emotional state can be placed on a two-dimensional plane with arousal and valence as the horizontal and vertical axes, respectively. In the present research, we first selected stimuli by categorical model. After selection, we rated the valence-arousal scales for each stimulus online using Self-Assessment Manikin (SAM).
Considering the six basic emotions, we collected 20 pairs of audio clips for each emotion category. Each pair of clips was the same slice of a film in two languages (original English version vs. Chinese-dubbed version).
The stimuli used in the experiment were selected in three steps. First, we selected the raw films by watching a range of films for 1 month. The principles considered in the raw film selection are listed below: (A) The films display relatively strong emotions; (B) The films should have an original English version and a Chinese-dubbed version; and (C) The Chinese-dubbed version matches the original version to the greatest extent. We finally selected 40 films as raw sources. Second, we need to select emotional clips from the films. This step is carried out manually. The selection requirements are as follows: (A) Each segment should contain the speech of only one speaker; (B) Each segment expresses a single desired target emotion; (C) Each segment lasts for 5 s and contains at least a complete utterance; and (D) The background sound should not be too obvious. We finally selected 158 pairs of clips. We extracted soundtracks from these film clips. Third, all the audio clips were manually rechecked to guarantee the quality of emotional expression by 10 subjects. Some clips with ambiguous emotions were removed. We finally selected 20 pairs of clips for each emotion category which maximize the strength of elicited emotions. The list of the film clips is shown in Supplementary Table 1.
To obtain reliable emotional labels of these clips, we utilized Amazon's Mechanical Turk service to collect data from native English-speaking and native Chinese (Mandarin)speaking subjects. We initially started with a target goal of 40 repetitions per clip. The subjects were allowed to classify as many of the 240 possible audio clips as they wish. There was no expectation for a single subject to complete all 240 audio exemplars. In the event that a subject completes only a portion of the 240 audio clips, we will continue to solicit additional subjects until we have achieved the required number of responses.
We presented subjects with selected audio clips and asked them to rate the emotional content of what they just heard and how they arrived at that decision. Discrete affective label and dimensional emotional annotation (arousal-valence) with 1-9 scales related to a single audio clip were obtained. Figure 2 shows the mean locations of the stimuli on the arousal-valence plane.

Experimental protocol
Before the experiment, the subjects were given a set of instructions to help them understand the experiment protocol. When the instructions were clear, the participants were led into the experiment room with sensors placed on their heads. After that, an experimenter explained the meaning of the different scales of SAM. The SAM is a non-verbal pictorial assessment technique that directly measures the valence, arousal, and dominance associated with the affective reaction of a person  to a wide variety of stimuli. The arousal dimension ranges from a relaxed, sleepy figure to an excited, wide-eyed figure. The valence dimension ranges from a frowning, unhappy figure to a smiling, happy figure. The dominance-submissiveness scale represents the controlling and dominant vs. controlled or submissive one feels: a prominent figure indicates maximum control in the situation. The participants could perform three practice trials to familiarize themselves with the experiment.
The subjects were instructed to keep their eyes open for the entire duration of the experiment. The process of our experiment is depicted in Figure 3. In this experiment, each subject performed two sessions of around 25 min each. They can have a 5-min break after one session is finished. Each session consisted of 40 trials.
Audio clips inducing different emotional states were presented in random order. Each trial consists of the following steps: (a) a 3-s baseline recorded, during which the subjects were instructed to watch a fixation cross presented on a computer monitor, (b) a 5-s audio clip played, during which the subjects were instructed to listen attentively and watch a central visual fixation, and (c) a 30-s self-assessment for arousal, valence, and dominance, during which the subjects used a computer keyboard to rate the SAM on a scale of 1-9.
The experiment was programmed using Psychophysics Toolbox of Matlab. Table 1 summarizes the number of trials for high/low valence and arousal and the average rating for the four conditions.

EEG acquisition
The EEG signals were continuously recorded using a 64-channel EEG system (64-channel Quik-Cap and Neuroscan Synamp2 Amplifier). The cap had 64 electrodes and two integrated bipolar which led for vertical and horizontal electrooculography (EOG). During recording, two EOGs and two mastoid electrodes (M1 and M2) were not placed. Each electrode impedance should be less than 10 k . The sampling rate was 1,000 Hz. The electrodes were placed over the scalp according to the international 10-20 system.

EEG pre-processing
The EEG signal pre-processing was performed to reduce unwanted noise and artifacts that compromise the quality of the signal. First, four signals from two EOGs and two mastoid electrodes were removed. Sixty-two remaining signals were used for the processing and analysis of the next step. Then, the EEG signals were average-referenced, down-sampled to 500 Hz, and filtered with 1-35 Hz to obtain the desired frequency range and remove the electrical line noise. After that, the eye blinks and muscular artifacts were excluded using independent component analysis (ICA). For each group, each participant, and each trial, EEG signal from 3-s baseline before the audio clip was removed to correct stimulus-unrelated variations. The pre-processing was performed using EEGLAB of Matlab.

Dataset 2: Music-Evoked Emotion Cognitive Experiment
Music is a powerful method for emotional communication and can evoke genuine basic emotions in the listener (Daly  al., 2015). Physiological measurements can be used to identify personal emotional responses to music. A popular public database, Dataset for Emotion Analysis using Physiological signals (DEAP), has been widely used to analyze affective states (Koelstra et al., 2011). DEAP is a multimodal dataset, including EEG, MEG, galvanic skin resistance, electrooculography, blood volume pressure, skin temperature, and respiration pattern. A total of 32 subjects participated in the data collection, and 40 carefully pre-selected 1-min-long music videos were used as the stimulus to elicit emotions for each subject. Before each video is displayed, a 5-s baseline is recorded. Each participant was requested to finish a self-assessment for arousal, valence, and dominance on a scale of 1-9 after watching. In this research, we used 32-channel EEG original signals for emotion recognition based on microstate analysis. The raw EEG data can be downloaded from http://www.eecs.qmul.ac.uk/ mmv/datasets/deap/. During pre-processing, the EEG data was average-referenced, down-sampled to 128 Hz, and filtered with a 1-35-Hz cutoff, and eye artifacts were removed with ICA. The 5-s baseline before the stimuli was used to correct the data for stimulus-unrelated variations. There is a total of 1,280 trials for analysis.

The Proposed Dual-Threshold-Based Microstate Analysis
The principles of microstate analysis are the quasi-stable periods of topographies, which is demonstrated in previous studies. More particularly, the changes of electric field configurations can be described by a limited number of microstate classes, which remain stable for around 80-120 ms before abruptly transitioning to another configuration. EEG microstates might represent and characterize the dynamic neuronal activity of conscious contents.

Global Field Power
Global field power (GFP) is calculated to find a series of dominant template topographies. GFP constitutes a single, referenceindependent measure of response strength at a global level (Lehmann and Skrandies, 1980). GFP is simply the standard deviation of all electrodes at a given time. What GFP tells the researcher is, on average across the electrode montage, how strong is the potential being recorded. It is often used to measure the global brain response to an event or to characterize the rapid changes in brain activity.
For each subject, GFP was calculated for each sample time according to Eq. 1, where N denotes the number of electrodes, u i (t) is the measured voltage of a specific electrode at time t, and u(t) is the average voltage of the N electrodes at the respective sample time t.
The local maxima of the GFP curve represent high global neuronal synchronization (Skrandies, 2007) and are considered with the highest signal-to-noise ratio. The topographies around these peaks remain stable and are submitted to the clustering algorithm. For each participant, the GFP of each trial is calculated. After smoothing the GFP with a Gaussian-weighted moving average of 50 time points, topographies at GFP peaks were collected and fed into a DTAAHC clustering algorithm to identify the microstates.

The Proposed Dual-Threshold-Based AAHC
AAHC is a bottom-up hierarchical clustering wherein the number of clusters is initially large and progressively diminishes. Classical agglomerative hierarchical clustering would eventually disintegrate the short-duration period of stable topography. These short-duration periods would be designated to other clusters even if they contribute a high GEV (Murray et al., 2008). In AAHC, clusters are given priority according to their GEV contributions. In this way, short-duration periods are conditionally maintained. Specifically, during each iteration, AAHC frees the cluster with the lowest GEV and then re-assigns these "free" maps to the surviving clusters by calculating spatial correlation. The iterations stop when only one single final cluster is obtained. An important next step is the choice of the number of desired output clusters. Unfortunately, there is no definitive solution. The more clusters one identifies, the higher the quality of the clustering but the lower the data reduction. Five criteria to decide on the amount of microstate clusters have been described by Poulsen et al. (2018). GEV is used to measure the percentage of data that can be explained by microstate classes. The crossvalidation criterion is related to the residual noise. Dispersion (W) is a measure of the average distance between members of the same cluster. However, it is not a suitable measure of fitting for polarity-invariant methods such as modified K-means and AAHC. Krzanowski-Lai criterion and normalized Krzanowski-Lai criterion are based on dispersion (W).
Here we propose DTAAHC to determine the optimal number of microstate classes automatically during clustering. Compared with AAHC, in addition to GEV contribution, the proposed algorithm also considers the microstate topographic similarity. For microstate analysis, microstates are expected to be distinct and could explain the original EEG topographies as much as possible. Therefore, two optimization criteria are used to estimate the quality of the topographical maps of microstate classes during iterations. First, the cluster with the lowest GEV is freed and reassigned to the surviving clusters. Second, the clusters are merged if the GMD between the candidate microstate classes is lower than 0.1. In addition, the iteration stops when the criterion GEV reaches the threshold. Although we made a minor alteration to the AAHC algorithm, the new method could identify the optimal microstate classes automatically and reduce the computational cost. The detailed introduction of this method is discussed below.
GMD is used to measure the topographic differences of microstate maps, independent of electric strength. It is defined as follows: where u i and v i are the voltages of two specified microstates, and u and v are the average voltages of the N electrodes. GMD ranges from 0 to 2, where 0 indicates topographic homogeneity and 2 indicates topographic inversion. GEV measures the percentage of data that can be explained by microstate classes. It is frequently used to quantify how well the microstate classes describe the whole data. The higher GEV, the better. It is influenced by the dimensionality of the data. The total GEV is the sum of the GEV values over all microstate classes: The GEV l value for a specific microstate class with label l is: The spatial correlation C V t ,M l between instantaneous EEG topography V t and the candidate microstate class M l can be calculated by Eq. 6, where V ti is the voltage of ith electrode of instantaneous EEG at time t (local peak index), and M li denotes the topography of the microstate class l.
In this study, DTAAHC is performed on the EEG topographies at local peaks of GFP. During initialization, each topography map is considered as a unique cluster. Upon subsequent iterations, the spatial correlation C V t ,M l between each instantaneous EEG topography V t and the candidate microstate class M l will be calculated by Eq. 6, merging the clusters which have maximum spatial correlation. The groups of the centroid of maps are defined as the candidate microstate class for that cluster. Then, two optimization criteria are applied. The GEV l for a specific microstate class with label l is calculated by Eq. 4. The cluster with the lowest GEV is removed and re-assigned to the most similar cluster during each iteration step. The GMDs between the candidate microstate classes are calculated. The clusters are merged if the GMD is lower than the threshold. The iterations stop when the GEV is higher than the threshold. In the present work, the threshold of GEV is set to 85% (Lehmann et al., 2005;Michel and Koenig, 2018;D'Croz-Baron et al., 2019). The threshold of GMD is set to 0.1 (Murray et al., 2008). Table 2 shows the DTAAHC procedure.

Microstate Sequence Characteristics
After microstate classes are identified, the original individual EEG data can be labeled as a microstate sequence, with fitting back of these microstate classes to topographies at sample point. Temporal parameters can be extracted as features for further analysis and can also be compared between different experimental conditions or between groups of subjects.

Backfitting
Microstate classes are assigned to EEG at each time frame (or index of GFP peaks) considering the highest spatial correlation  (see Eq. 5). The maximum spatial correlation determines the microstate label L t . In the fitting process, temporal smoothing (Pascual-Marqui et al., 1995;Poulsen et al., 2018) is applied to avoid interruptions in spontaneous EEG sequences with a lot of unwanted noise-that is, class assignments are based on topographical similarity with microstate classes and the microstate labels of samples prior to and following the EEG sample. Different temporal parameters and statistical analyses will be performed after class assignments for every subject.

Temporal Parameters
EEG microstate sequences (EEG-MS) are symbolic time series related to potential neurophysiological relevance. The temporal dynamic characteristics of EEG-MS can be described by the following parameters. These statistical parameters mainly represent the activation strength, the spatial configuration, and the temporal attributes of microstates: (1) Duration (ms): This refers to the average length of continuous sequences belonging to a given microstate class.

Transition Probabilities
Transition probabilities can be derived to quantify the probabilities of a certain class switched to other classes. The transition probability between two states is given as T ij = P(X t+1 = S j |X t = S i ). A Markov chain describes the probability distribution of the system either remaining in that state or transitioning to a different state for the next time point. In this study, separate transition probabilities are computed and compared for each of the four conditions (high vs. low valence and high vs. low arousal).

Statistical Analysis
Statistical analyses were performed by using in-house scripts. Each microstate parameter was compared on the valence and arousal dimension separately. The trial is labeled as "high" group if its dimension value is higher than 4.5 and "low" group if its dimension value is lower than 4.5. To evaluate group differences in the microstate parameters mentioned above, we used Wilcoxon rank-sum statistic test for comparisons (Musaeus et al., 2019;Chu et al., 2020). The Wilcoxon rank-sum test is a nonparametric approach. It allows us to compare two populations where the underlying distributions are not normal but that do have similar shapes.

Microstate Classes
For dataset 1, the group-level clustering revealed nine optimal microstate classes for emotional speech-evoked EEG. These nine microstate topography templates are illustrated in Figure 4A. The topographies are labeled as #1-9. For dataset 2, the microstate analysis identified 10 microstates for emotional music videoevoked EEG (see Figure 4B).

Global Explained Variance
The performance of the microstate segmentation algorithm is reported in terms of the GEV, which estimates the portion of EEG point topography that can be explained by microstates. For dataset 1, the nine EEG microstate classes together explained around 85% of the data in global field power peaks. The GEV of each microstate class ranged from 6.55 to 11.25% (see Figure 4C). For dataset 2, ten microstates explained 86% of the variance of all global field power peaks. Correspondingly, the GEV of each microstate class fluctuates between 6.73 and 11.68%.

Global Map Dissimilarity
GMD is calculated as a measure of topographic differences of microstate maps. For dataset 1, the GMD matrix across different microstates is shown in Table 3. The GMD ranged from 0.10 to 0.25 (mean = 0.18, SD = 0.06). Table 4 presents the GMD between different microstates of dataset 2. The average GMD is 0.25 (SD = 0.08). The range of the GMD is 0.10-0.34.

Temporal Parameters
It is controversial whether the first-order Markov model can capture the complex temporal dependencies for a longer time series of minutes (von Wegner et al., 2017). The duration of one trial in DEAP is 60 s. The duration is 5 s in the emotional speech-evoked cognitive experiment. Therefore, the microstate sequence characteristics are evaluated on the speechevoked EEG dataset. We compared the temporal parameters of microstates in valence and arousal dimensions separately. We divided the trials into two groups based on the valence or arousal level. The trial is labeled as "high" group if its valence (or arousal) value is higher than 4.5 and as "low" group if it is lower than 4.5.
The comparison results are shown in Table 5. For the valence dimension, the mean duration, occurrence, time coverage, and GEV are investigated for the high valence and the low valence groups. The Wilcoxon rank-sum statistic test was used to identify statistically significant differences between high/low conditions for each microstate class in every temporal parameter. The significance level is set to 5%. The significant group differences are marked with an asterisk. The result revealed that the duration of microstate #3 is significantly increased during the response to a high valence stimulus (p = 0.02). No significant differences in occurrence, time coverage, and GEV between the groups are found.
For the arousal dimension, microstates #3 and #6 had a striking increase in duration for high arousal (p = 0.05). On the other hand, the occurrence, temporal coverage, and GEV of microstate #7 slumped during the same period for high arousal.
Further tests examined the model of transition probabilities for valence and arousal, respectively. Table 6 depicted the statistically significant differences (p-value) of directions of transitions between high-vs. low-level groups. For valence, the statistical analysis unraveled the significant differences between high and low groups in five transitions: from The asterisk indicates significant difference (p ≤ 0.05).

Emotion Recognition Results
In order to verify the effectiveness of our feature sets, we firstly captured the EEG data from the public DEAP dataset to validate our framework. Then, the proposed feature extraction was applied to the speech-evoked EEG dataset. A fivefold cross-validation method is adopted to evaluate the performance: the dataset is split into fivefolds. In each iteration, onefold is used to test the model, and the rests serve as the training set. The process is repeated until each fold has been used as the training set.
For the two-class classification problem, the accuracies are measured using where TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative, respectively.

Performance on DEAP Dataset
The dataset is separated into high-low classes by valence or arousal dimension. Each class is determined by the positivity of arousal and valence ratings. Valence and arousal levels higher than 4.5 are high and vice versa.
Considering temporal dependencies more complex than the first Markov models, von Wegner et al. (2017) suggested that the geometric distribution of microstate durations for short EEG time series was up to a duration of 16 s. In DEAP, the duration of EEG signals is 60 s. Therefore, we segment each signal using a moving window with a length of 5 s to evaluate short-time identifiability.
We perform three experiments on the microstate-related feature sets. We first use four temporal parameters (duration, occurrence, time coverage, and GEV) as features to obtain accuracies for the valence and arousal dimensions and later use transition probabilities as features to obtain the accuracies. Finally, we combine temporal parameters and transition probabilities as a feature set to measure performance. The extracted features are fed into the support vector machine (SVM) for classification. SVM is widely used for emotion recognition, which has promising properties in many fields. We also carry out comparisons of other features that exist in the works of literature.
The accuracy results of high/low valence and arousal are given in Table 7. The four temporal parameters with SVM yield accuracy rates of 72.5 and 72.1% for high/low valence and high/low arousal, while the transition probabilities have scores of 74.4 and 73.9%, respectively. The highest scores of 75.8% for valence and 77.1% for arousal are obtained by combining temporal parameters and transition probabilities. Our methods are compared to other states-of-the-art which use the DEAP dataset. According to the comparison table, our study has higher accuracy rates than the previous studies. The results demonstrate that the parameters derived from microstate sequences are promising features for characterizing the dynamics of neural activity and recognizing emotion from EEG signals.

Performance on Speech-Evoked EEG Signals
In this section, the performances of microstate characteristic features are evaluated on the emotional speech-evoked EEG dataset.
Three different classifiers are applied to three feature setsthat is, SVM, random forest, and artificial neural network (ANN).

DISCUSSION
In this study, we applied the microstate analysis to the emotional auditory response. Our proposed method DTAAHC revealed that nine template maps best described the entire dataset, explaining ∼85% of the global variance for speech-evoked EEG. For musicevoked EEG, 10 template maps explain ∼86% of the data. In previous visual research, Gianotti et al. (2008) studied the temporal dynamics of the neural activity that responded to emotional words and picture stimulus using ERP microstate analysis. In the emotional word experiment, 11 sequential microstates were identified. Among the 11 microstates, four of them were valence-sensitive and two of them were arousalsensitive. In the emotional picture experiment, the microstate analysis identified 15 sequential microstates. Five of the fifteen and two of the fifteen microstates were valence-sensitive and arousal-sensitive, respectively. Although four prototypical microstate classes were useful to compare or complement results across different studies, several studies also suggested that the number of microstate classes was explicitly driven by the data. Muthukrishnan et al. (2016) performed the microstate analysis in a visuospatial working memory task. The optimal number of clusters was determined by the cross-validation criterion without prior assumptions. D'Croz- Baron et al. (2019) investigated that six template microstate maps can best describe the dataset across the autism spectrum disorder and neurotypical controls. In research of schizophrenia (Soni et al., 2018(Soni et al., , 2019, four to six microstate maps were clustered, which related to the conditions of the experiments. Michel and Koenig (2018) discussed a metacriterion for the optimal number of clusters. They suggested that the most appropriate choice was a pragmatic compromise between the needs for specificity and generalizability. The four prototypical microstates exhibited highly similar topographies across studies and were consistently labeled as class A, B, C, and D. Microstate A exhibits a left-right orientation, map B exhibits a right-left orientation, map C exhibits an anterior-posterior orientation, and map D exhibits a frontocentral maximum (Michel and Koenig, 2018). In terms of the orientation of the electrical axis, we relate some microstates of our study to four prototypical microstates. Here we mark maxima as "+" and minima as "-."In our emotional speechevoked cognitive experiment, three microstates (#3, #4, and #8) are characterized by fronto-central orientation of the maxima which are similar to map D (Santarnecchi et al., 2017;da Cruz et al., 2020). Some studies suggest that microstate D is associated with attention network activity Milz et al., 2016). For the music-evoked EEG dataset, microstates #5 and #8 exhibit fronto-central maximum.
We also identify some microstates which have significant differences with prototypical microstates. In the speech-evoked emotion experiment, microstate #9 has a local extremum in posterior (+). In the music-evoked emotion experiment, microstates #4 and #6 exhibit local maxima in posterior. Microstates #7 and #9 show local minima at the axis center.
For future research, the relationship between microstates and brain functions can be explored using source localization. Some computational approaches, e.g., distributed linear inverse solution (LAURA) (de Peralta Menendez et al., 2004), can help understand the brain source activation in terms of intracranial generators.
We further delved into the temporal characteristics of microstates for emotional speech perception. The Wilcoxon rank-sum test was used to analyze the statistical differences of the microstate parameters between different groups. For the valence dimension, the results indicated that the mean duration of microstate #3 (active prefrontal cortex) in the high group was longer than that in the low group. For arousal dimension, three microstates had significant differences between high and low group. Specifically, the mean duration of microstates #3 and #6 (active frontal lobe) in the high group was longer than those in the low group. The occurrence, coverage, and GEV of microstate #7 (active temporal lobe) had significant differences between the high and low groups. In previous research, Gianotti et al. (2008) found that five of the 15 microstates were different for pleasant vs. unpleasant pictures, and two of the 15 microstates were different for high-vs. low-arousing pictures. However, it was difficult to compare this work with our study directly since visual and auditory information activated different cortices.

CONCLUSION
The main purpose of this study is to extract novel features based on EEG microstates for emotion recognition. Determining the optimal number of microstates automatically is a challenge for applying microstate analysis to emotion. To overcome the limitation, this research proposed DTAAHC. The proposed method identified 10 microstates on a public music-evoked EEG dataset (DEAP) and nine microstates on our recorded emotional speech-evoked EEG dataset. Subsequently, the microstate sequence characteristics were compared in the aspect of high/low valence or arousal conditions. Finally, these characteristics were fed into the classifier for emotion recognition. All the findings in this work suggested that the microstate sequence characteristics can effectively improve the performance of emotion recognition from EEG signals. We hope this work will stimulate future research to propose novel algorithms to reduce the limitation of microstate analysis and uncover more interesting mechanisms of the affective process, e.g., linking the source localization of microstates to brain functions can help understand the functional significance of these states.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Heilongjiang Provincial Hospital. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
JC was involved in the conduct of the experiment, data analysis, and writing of the manuscript. HL, LM, and FS were involved in the conception, supervision, and manuscript review.
HB was involved in the study design and conduct of the experiment. YS was involved in the study design and subject recruitment. All authors contributed to the article and approved the submitted version.