MS-MDA: Multisource Marginal Distribution Adaptation for Cross-Subject and Cross-Session EEG Emotion Recognition

As an essential element for the diagnosis and rehabilitation of psychiatric disorders, the electroencephalogram (EEG) based emotion recognition has achieved significant progress due to its high precision and reliability. However, one obstacle to practicality lies in the variability between subjects and sessions. Although several studies have adopted domain adaptation (DA) approaches to tackle this problem, most of them treat multiple EEG data from different subjects and sessions together as a single source domain for transfer, which either fails to satisfy the assumption of domain adaptation that the source has a certain marginal distribution, or increases the difficulty of adaptation. We therefore propose the multi-source marginal distribution adaptation (MS-MDA) for EEG emotion recognition, which takes both domain-invariant and domain-specific features into consideration. First, we assume that different EEG data share the same low-level features, then we construct independent branches for multiple EEG data source domains to adopt one-to-one domain adaptation and extract domain-specific features. Finally, the inference is made by multiple branches. We evaluate our method on SEED and SEED-IV for recognizing three and four emotions, respectively. Experimental results show that the MS-MDA outperforms the comparison methods and state-of-the-art models in cross-session and cross-subject transfer scenarios in our settings. Codes at https://github.com/VoiceBeer/MS-MDA.


INTRODUCTION
Emotion as physiological information, unlike widely studied logical intelligence, is central to the quality and range of daily human communications (Dolan, 2002;Tyng et al., 2017). In the human-computer interaction (HCI), emotion is crucial in influencing situation assessment and belief information, from cue identification to situation classification, with decision selection for building a friendly user interface (Jeon, 2017). For example, affective brain-computer interfaces (aBCIs), acting as a bridge between the emotions extracted from the brain and the computer, which has shown potential for rehabilitation and communication (Birbaumer, 2006;Frisoli et al., 2012;Lee et al., 2019). Besides, many studies have shown a strong correlation between emotions and mental illness. Barrett et al. (2001) studies the relation between emotion differentiation and emotion regulation. Joormann and Gotlib (2010) finds that depression is strongly associated with the use of emotion regulation strategies. Bucks and Radford (2004) investigates the identification of non-verbal communicative signals of emotion in people that are suffering from Alzheimer's disease. To quantify emotion, most researchers have focused on using conventional methods such as classifying emotions with facial expression or language (Ekman, 1993). In recent years, with the advantage of reliability, easy accessibility, and high precision, non-invasive BCIs such as electroencephalogram (EEG) are widely used for brain signal acquisition, and analysis of psychological disorders (Sanei and Chambers, 2013;Acharya et al., 2015;Liu et al., 2015;Ay et al., 2019). With EEG signals, many works also investigate the rehabilitation methods for psychological disorders, such as (Jiang et al., 2021) of using spatial information of EEG signals to classify depressions, and Zhang et al. (2020) proposes a brain functional network framework for major depressive disorder by using the EEG signals. Besides, Hosseinifard et al. (2013) investigates the non-linear features from EEG signals for classifying depression patients and normal subjects. The flow of an EEG-based affective BCI (aBCI) for emotion recognition is introduced in section 3.1.
Due to the non-stationary between individual sessions and subjects of EEG signals (Sanei and Chambers, 2013), it is still challenging to get a model that is shareable to different subjects and sessions in EEG-based emotion recognition scenarios, which elicits two scenarios: cross-subject and cross-session (i.e., data FIGURE 1 | Two strategies of multi-source domain adaptation. (A) is a single-branch strategy while (B) is a multi-branch strategy. In (A), all source domains are combined into one new big source and then been used to align distribution with the target domain, while in (B), multiple sources are being aligned at the same time, and are divided into multiple branches to adopt DA with the target domain. In short, (A) is one source, one branch with one-to-one DA; (B) is multiple sources, multiple branches with one-to-one DA. The figure is best viewed in color. collected from the same subject at the same session can be very biased, detailed description is given in section 3.2). Besides, the analysis and classification of the collected signals are timeconsuming and labor-intensive, so it is important to make use of the existing labeled data to analyze new signals in the EEGbased BCIs. With this purpose, domain adaptation is widely used in research works. As a sub-field of machine learning, domain adaptation (DA) improves the learning in the unlabeled target domain through the transfer of knowledge from the source domains, which can significantly reduce the number of labeled samples (Pan and Yang, 2009). In practice, we often face the situation that contains multiple source domain data (i.e., data from different subjects or sessions). Due to the shift between domains, adopting DA for EEG data especially when facing multiple sources is difficult. In recent years, the researchers tend to merge all source domains into one single source and then use DA to align the distribution (Source-combine DA in Figure 1) (Zheng and Lu, 2016;Jin et al., 2017;Li et al., 2018Li et al., , 2019aLi et al., ,b, 2020Zheng et al., 2018;Zhao et al., 2021). This simple approach may improve the performance because it expands the training data for the model, but it ignores the non-stationary of each EEG source domain itself and disrupts it (i.e., EEG data of different people obey different marginal distributions), besides, directly merging into one new source domain cannot determine whether its new marginal distribution still obeys EEG-data distribution, thus brings a larger bias.
To solve the multi-source domain adaptation problems in EEG-based emotion recognition, we propose a Multi-Source Marginal Distribution Adaptation for cross-subject and cross-session EEG emotion recognition (MS-MDA, as illustrated in Figure 1). First, we assume all the EEG data share low-level features, especially those taken from the same device, the same subject and the same session. Based on this, we construct a simple common feature extractor to extract domain-invariant features. Then for multiple sources, since each of them has some specific features, we pair every single source domain with the target domain to form a branch for one-to-one DA, and align the distribution and extract domain-specific features. After that, a classifier is trained for each branch, and the final inference is made by these multiple classifiers from multiple branches. The details of MS-MDA are given in section 4.
In this study, we make two following contributions: 1. We proposed and evaluated MS-MDA for EEG-based emotion recognition in a new multi-source adaptation way to avoid disrupting the marginal distributions of EEG data. Extensive experiments demonstrate that our method outperforms the comparison methods on SEED and SEED-IV, and additional experiments also illustrate that our method generalizes well. 2. Though many works have achieved considerable results, there is no systematic discussion of the normalization operation of EEG data. Thus we design and conduct extensive experiments to investigate the effects of three normalization types (i.e., electrode-wise, sample-wise, and global-wise, details are given in section 5.4), and the order of whether first concatenating multiple sources or normalizing each session individually. To our knowledge, we are the first to investigate the normalization methods for EEG data, which we believe can be taken as a guide for other future works, and be applied to all data in EEG-based datasets and EEG-related domains.
In the remainder of this paper, we first review related works on domain adaptation in the field of EEG-based emotion recognition in section 2. Section 3 introduces the materials, including the diagram of EEG-based affective BCI with transfer scenarios, datasets and pre-processing methods. The details of MS-MDA are given in section 4, whereas section 5 demonstrates the settings, results, and additional experiments. Section 7 discusses the results of the experiment and our findings, as well as problems and solutions. Finally, section 7 concludes the work and outlines the future extension.

RELATED WORK
In recent years, the research of affective computing has become one of the trends of machine learning, neural systems, and rehabilitation study. Among those works, emotions are usually characterized into two types of emotion model: discrete categories (basic emotional states, e.g., happy, sad, neutral; Zheng and Lu, 2015) or continuous values (e.g., in 3D space of arousal, valence, and dominance; Koelstra et al., 2011). With domain adaptation techniques, many works have achieved significant performance in the field of affective computing. Zheng and Lu (2016) first applies Transfer Component Analysis (Pan et al., 2010) and Kernel Principle Analysis based methods on SEED dataset to personalize EEG-based affective models and demonstrates the feasibility of adopting DA in EEG-based aBCIs. Chai et al. proposes adaptive subspace feature matching (Chai et al., 2017) to decrease the marginal distribution discrepancy between two domains, which requires no labeled samples in the target domain. To solve cross-day binary classification, Lin et al. (2017) extends robust principal component analysis (rPCA) (Candès et al., 2011) to their filtering strategy which can capture EEG oscillations of relatively consistent emotional responses. Li et al., different from the above, considering the multi-source scenario, and proposes a Multisource Style Transfer Mapping (MS-STM) (Li et al., 2019b) framework for cross-subject transfer. They first take a few labeled training data to learn multiple STMs, which are then being used to map the target domain distribution to the space of the sources. Though they consider adding prior information of source-specific features, they do not take the domain-invariant features into consideration, thus losing the low-level information.
In recent years, with the development of deep learning techniques and its usability, many works of EEG-based decoding with neural networks have been proposed. Jin et al. (2017) and Li et al. (2018) adopts deep adaptation network (DAN) (Long et al., 2015) to EEG-based emotion recognition, which takes maximum mean discrepancy (MMD) (Borgwardt et al., 2006) as a measure of the distance between the source and the target domain, and training to reduce it on multiple layers. Extending the original method, Chai et al. proposes subspace alignment auto-encoder (SAAE) (Chai et al., 2016) which first projects both source and target domains into a domain-invariant subspace using an auto-encoder, and then kernel PCA, graph regularization and MMD are used to align the feature distribution. To adapt the joint distribution, Li et al. (2019a) propose a domain adaptation method for EEG-based emotion recognition by simultaneously adapting marginal distributions and conditional distributions, they also present a fast online instance transfer (FOIT) for improved EEG emotion recognition (Li et al., 2020). Zheng et al. extends SEED dataset to SEED-IV dataset and presents EmotionMeter , a multi-modal emotion recognition framework that combines two modalities of eye movements and EEG waves. With the concept of attention-based convolutional neural network (CNN) (Yin et al., 2016), Fahimi et al. (2019) develops an end-to-end deep CNN for cross-subject transfer and fine-tunes it by using some calibration data from the target domain. To tackle the requirement of amassing extensive EEG data, Zhao et al. (2021) proposes a plug-and-play domain adaptation method for shortening the calibration time within a minute while maintaining the accuracy. Wang et al. (2021) present a domain adaptation SPD matrix network (daSPDnet) to help cut the demand of calibration data for BCIs.
These aBCI works have gained significant improvement in their respective directions, transfer scenarios, and on multiple benchmark databases. However, many of them focus on combing multiple sources into one and adopt one-to-one DA, which ignores the differences of the marginal distribution of different EEG domains (source-combine DA in Figure 1). This operation may compromise the effectiveness of downstream tasks, and although it somehow extends the training data, the trained FIGURE 2 | The flowchart of EEG-based BCI for emotion recognition. The emotions are first evoked and encoded into EEG data, then the EEG data are pre-processed and extracted to various forms of features for subsequent pattern recognition. models do not generalize well enough. Therefore, inspired by Zhu et al. (2019), a novel multi-source transfer framework, we propose MS-MDA (multi-source marginal distribution alignment for EEG-based emotion recognition), which transfers multiple source domains to the target domain separately, thus avoiding the destruction of the marginal distribution of the multiple EEG source domains; and also takes the domaininvariant features into consideration. Due to the sensitivity of the EEG data and intuition, we do not adopt complex networks, but just a combination of few multi-layer perceptrons (MLPs) (Gardner and Dorling, 1998), and thus makes our method computationally efficient, and easy to expand.

Diagram
The flow of one EEG-based aBCI for emotion recognition is shown in Figure 2, which involves five steps: • Stimulating emotions. The subjects are first stimulated with stimuli that correspond to a target emotion. The most commonly used stimuli are movie clips with sound, which can better stimulate the desired emotion because they mix sound with images and actions. After each clip, self-assessment is also applied for the subject to ensure the consistency of the evoked emotion and the target emotion. • EEG signal acquisition and recording. The EEG data are collected using the dry electrodes on the BCI, and then be labeled with the target emotion. • Signal pre-processing. Since the EEG data is a mixture of various kinds of information containing much noise, it is required to pre-process the EEG signal to get cleaner data for subsequent recognition. This step often includes downsampling, band-pass filtering, temporal filtering, and spatial filtering to improve the signal-to-noise ratio (SNR).
• Feature extraction. In this step, features of the pre-processed signals are extracted in various ways. Most of the current research works are to extract features in the time or frequency domain. • Pattern recognition. The use of machine learning techniques to classify or regress data according to specific application scenarios.

Scenarios
Considering the sensitivity of the EEG, domain adaptation in emotion recognition can be divided into several cases: (1) Crosssubject transfer. In one session, new EEG data from a new subject is taken as the target domain, and the rest of existing EEG data from other subjects are taken as the source domains for DA.
(2) Cross-session transfer. For one subject, data collected in the previous sessions can be used as the source domain for DA, and data collected in the new session are taken as the target domain. In our work, since the datasets we evaluate on contains 3 session and 15 subjects (refer to section 3.3 for details), we take the first 2 session data from one subject as the source domains for cross-session transfer, and take the first 14 subjects data from one session as the source domains for cross-subject transfer. The results of cross-session scenarios are averaged over 15 subjects, and the results of cross-subject are averaged over 3 sessions. Standard deviations are also calculated.

Datasets
The database we evaluate on are: SEED (Duan et al., 2013;Zheng and Lu, 2015) and SEED-IV , both are established by the BCMI laboratory led by Prof. Bao-Liang Lu from Shanghai Jiao Tong University.
The SEED database contains emotion-related EEG signals that are evoked by 15 film clips (with positive, neutral, and negative emotions) from 15 subjects (7 males and 8 females, with average age of 23.27) with 3 sessions each. The signals are recorded by a 62-channel ESI neuroscan system. The SEED-IV is an evolution of SEED, which contains 3 sessions, each has 15 subjects and 24 film clips. Comparing to the SEED with EEG signals only, this database also includes eye movement features recorded by SMI eye-tracking glasses.

Pre-processing
After collecting EEG raw data, pre-processing on signals and feature extractions will be adopted. For both SEED and SEED-IV, to increase the SNR, the raw EEG signals are first down-sampled to a 200 Hz sampling rate, then been processed with a bandpass filter between 1 Hz to 75 Hz. After that, features are then being extracted.
Recent works extract features from EEG data on the time domain, frequency domain, and time-frequency domain. Among them, Differential Entropy (DE) as in (1), has the ability to distinguish patterns from different bands (Soleymani et al., 2015), thus we choose to take DE features as the input data of our model. For SEED and SEED-IV, extracted DE features at five frequency bands of delta (1-4 Hz), theta (4-8 Hz), alpha (8-14 Hz), and gamma (31-50 Hz) are provided.
One data from one subject in one session for both databases is in the form of channel (62) × trial (15 for SEED, 24 for SEED-IV) × band (5), we then merge the channel with the band, and the form becomes trial × 310 (62 × 5). For SEED, 15 trials contain 3394 samples in total for each session. For SEED-IV, 24 trials contain 851/832/822 samples for three sessions, respectively. In the end, all data are formed into 3394 × 310 (SEED), or 851/832/822 × 310 (SEED-IV) with corresponding generated label vectors in the form of 3,394 × 1, or 851/832/822 × 1.

METHOD
For simplicity of demonstration, we list the symbols and their definition in Table 1 that will be used in the following sections.
Given a set of pre-existing EEG data and a newly collected EEG data, our goal is to learn a model φ that is trained on these multiple independent source domain data using DA, and thus has a better prediction on the newly collected data than simply combining the existed data into one source domain. The architecture of the proposed method is illustrated in Figure 3.
As shown in the Figure 3, the input to the MS- and a target domain data {X T }, and then these data are fed into a common feature extractor module to get the domain-invariance Then for each domain-specific feature extractor, extracted common features {Q S i } N i=1 will be fed into one branch with {Q T } and get their domain-specific features: , and on top of that, the MMD value is calculated, which is a measure of the distance of the current source and the target domain. Next, the target domain features {R T i } N i=1 and all the source domain features {R S i } N i=1 extracted from the last step will get to the domain-specific classifiers to get the corresponding , then the results of the source domain are taken to calculate the classification loss. Since the target domain will be fed into all the source domain classifiers, multiple target domain predictions are generated. These predictions are taken to calculate the discrepancy loss. In the end, the average of these target-domain predictions is taken as the output of the model. Details of these modules are given below.

Common Feature Extractor
Common feature extractor in the MS-MDA is used to map the source and target domain data from the original feature spaces to a common sharing latent space, and then common representations of all domains are extracted. This module can help to extract some low-level domain-invariant features.

Domain-Specific Feature Extractor
Domain-specific feature extractor follows the Common Feature Extractor (CFE). After obtaining the features of all domains, we set up N single fully connected layers to correspond to N source domains. For each pair of source and target domain, we map the data to a unique latent space via the corresponding Domainspecific Feature Extractor (DSFE), respectively, and then obtain the domain-specific features in each branch. To apply DA and bring the two domains close in the latent space, we choose the MMD to estimate the distance between these two domains. MMD is widely used in the DA and can be formulated in (2). In the process of training, MMD loss is decreased to narrow the source domain and the target domain in the feature space, which helps make better predictions for the target domain. This module aims to learn multiple domain-specific features.

Domain-Specific Classifier
Domain-specific classifier uses the features extracted from the DSFE to predict the result. In Domain-specific Classifier (DSC), there are N single softmax classifiers that correspond to each source domain. For each classifier training, we choose crossentropy to estimate the classification loss, as shown in (3). Besides, since there are N classifiers in this module, and these N classifiers are trained on N source domains, if their predictions are simply averaged as the final result, the variance will be high, especially when the target domain samples are at the decision boundary, which will have a significant negative impact on the results. To reduce this variance, a metric called discrepancy loss is introduced to make the predictions of the N classifiers converge, which is shown in (4). The average of the predictions of the N classifiers is taken as the final result.
In summary, MS-MDA accepts N source domain EEG data and one target domain EEG data, and then includes a common feature extractor to get N source domain features and one target domain feature. Next, N domain-specific feature extractors are used to pairwise compute the MMD loss of one individual source with the target domain and extract their domain-specific features. Finally, a domain-specific classifier is used to do the classification task, which also calculates the classification loss of the N classifiers using the features, with the discrepancy loss of the N classifiers for the features of the target domain data after the previous N feature extractors.
The training is based on the (5) and following the algorithm as shown in Algorithm 1. For the three losses, minimizing MMD loss can get domain-invariant features for each pair of the source and target domains; minimizing classification loss will bring more accurate classifiers for predicting the source domain data; minimizing discrepancy loss will get more convergent multiple classifiers. The setting of α is illustrated in section 5.1 and we also investigate different settings of β in section 5.5.1.
Algorithm 1 Overview of MS-MDA. Input: Update model by minimizing the total loss 9: end for 10: return {Ŷ T }; Output: Prediction of target domain data, {Ŷ T };

EXPERIMENTS
In this section, we describe experiments settings and results in classifying of emotions on two datasets SEED and SEED-IV, with the normalization study with three types and two orders to the EEG data for domain adaptation. Besides, we also conduct some exploratory experiments in addition to the evaluation of our proposed methods and comparison methods.

Implementation Details
As mentioned in the section 4, there are many details in the three modules of MS-MDA. First, for the Common Feature Extractor (CFE), since we do not take raw data (i. e. EEG signals) but the extracted DE features as vectors, complex deep models such as deep convolutional neural networks are not suitable for this module, thus we choose 3-layer MLP for simplicity which reduces feature dimensions from 310-dimension (62 × 5, channel × band) to 64-D. In CFE, every linear layer is followed by a LeakyReLU (Xu et al., 2015) layer. We also evaluate the effort of the ReLU (Nair and Hinton, 2010) activation function, but due to the sensitivity of the EEG data, much information would be lost if using ReLU since the value less than zero would be dropped, so we choose LeakyReLU as a compromise. Next, for both domainspecific feature extractor (DSFE) and domain-specific classifier (DSC), there is a single linear which reduces 64-D to 32-D and 32-D to the corresponding number of categories (3 for SEED, 4 for SEED-IV), respectively. In DSFE, same as the settings in CFE, a LeakyReLU layer is followed after the linear layer, while in DSC, there is only one linear layer without any activation function. The network is trained using an Adam (Kingma and Ba, 2014) optimizer with an initial learning rate of 0.01, and train for 200 epoch. The batch size we choose is 256, which means we take 256 samples from each domain in every iteration (we also evaluate different settings of batch size and epoch in section 5.5). The whole model is trained under the (5), for domain adaptation loss, we choose MMD as the metric of the distance between two domains in the feature space (CORAL loss has a similar effect). As for the discrepancy loss, L1 regularization is being used, we also evaluate this loss in section 5.5. Besides, we dynamically adjust the α coefficients to achieve the effect of focusing on the classification results first, and then start aligning MMD and the convergence between the classifiers (α = 2 1+e −10 * i/epoch − 1). As for the training data, we take the DE features and reform one sample to a 310-D vector as illustrated in the section 3.4. Before feeding into the model, we normalize all the data in electrode-wise, refer to section 5.4 for details.

Results
Experiment results of comparison methods and our proposed method on SEED and SEED-IV are listed in Table 2, all the hyper-parameters are the same, except for those results taken directly from the original papers. It should be noticed that since many previous works do not make their codes public available, we then customize the comparison methods (in the deep learning domain adaptation field) that are described in their papers with our settings, and also including some typical deep learning domain adaptation models for better comparison (DDC  The best results are shown in bold. Tzeng et al., 2014, DCORAL Sun andSaenko, 2016). The results indicate that our method largely outperforms the comparison methods in most transfer scenarios. For SEED dataset, our method has a minimum of 7 and 3% improvement in crosssession and cross-subject scenarios, respectively. While in SEED-IV dataset, our method has a minimum of 7 and 18% for two transfer scenarios. The results also show that our method outperforms comparison methods significantly in cross-subject, the reason for that may be that in the cross-subject scenario, the number of sources is 14, much bigger than the number of 2 in cross-session, and thus maximizes the effect of taking multiple sources as multiple individuals in domain adaptation rather than concatenating them.

Ablation Study
To understand the effect of each module in the MS-MDA, we remove them one at a time and evaluate the performance of the ablated model, the results are shown in Table 3. The first row of SEED and SEED-IV shows the performance of the full model (the same as in Table 2). The second row ablates the MMD loss in the training process, which makes the model focuses only on the classification loss and discrepancy loss. The significant drop compared to the full model indicates the important effect of domain adaptation. Notice that even the results without MMD loss are better than many comparison methods, showing the importance of taking multiple sources as multiple individuals during training. The third row of taking out the discrepancy loss shows that this loss will affect the performance but the impact is minimal, the reason is that we want this discrepancy loss to be the icing on the cake rather than having a dominant effect on the model. The fourth row only considers the classification loss, thus reduces losses (2) and (4). Besides, we also conduct experiments to demonstrate how the performance measures change along with the number of source numbers. Since the amount of experiments is massive if following a full cross-validation rule, we simply take the first subject as the source domain to test one branch experiment, the first two subjects on two-branch, the first three subjects on threebranch, etc. The results are plotted in Figure 4. From the figure, it is obvious that with the improvement of source number, our algorithm has a large improvement in the accuracy.

Normalization
During the experiments, we also find that different normalization to data can significantly impact the outcomes, and also the order of whether first concatenating multiple sources or first normalize each session individually. Thus we design diagrams and conduct extensive experiments to investigate the effects of different normalization strategies on the input data, i. e., extracted feature vectors from two datasets. Since we have reformed the origin 4-D matrices (session × channel × trial × band) into 3-D matrices [session × trial × (channel*band)], for each session, there is a 2-D matrix of trial × 310. Following the common machine learning normalization approaches and the prior knowledge and intuition of EEG data (i. e., the data acquired by the same electrode are more consistent with the same distribution), the normalization methods to these 2-D matrices can be categorized into three, as shown in Figure 5. Besides, since we also take the multi-source situation into consideration, the order of normalization may also influence the performance.
We evaluate three normalization methods and two normalization orders on SEED and SEED-IV with our proposed method MS-MDA and representative domain adaptation model DAN (Long et al., 2015). The results are listed in Table 4. In all three sets, the normalization of electrode-wise outperforms the other three normalization types significantly. Comparing DAN 1 with DAN 2 , the results indicate that the first normalization order of normalizing the data first and then concatenating them is better. In the third set of MS-MDA, we find that all the results of four normalization types are better than those in the first and second sets, and the improvement is significant. Row w/o normalization in MS-MDA, for example, has a top of 47% improvement, which also indicates the generalization of our proposed method in different normalization types, and the positive effects of taking multiple sources as individual branches for DA.

Coefficient Study
After multiple sets of experiments, we find that easy to control the MMD loss and it plays an influential role in the training as shown in Table 3. However, for the disc. loss, it remains many problems. Adding this loss to the model too early will affect the overall effect, and too late will lose the impact of learning convergence. Too large a weight would cause the training to focus on convergence, thus the few correct ones might follow the many incorrect ones; too small may not have enough influence on the model. Also, for better use and simplicity mentioned earlier, we do not make many tests on the β, but simply compared the effects on only a few sets of β, and the results are shown in Table 5. From which we can see that compared to row one (w/o disc.  (2) is an operation for all sources. In order A, (1) in the figure stands for the normalization, and (2) stands for the concatenate (i.e., normalize every single source first, and then concatenate them all). In order B: (1) stands for concatenating while (2) is for normalization (i.e., concatenate every single source into one big source domain first, and then normalize this domain). loss), introducing discrepancy loss increases the performance in most cases, especially when training for the whole process in cross-subject for SEED-IV. We then choose the weight of 0.01 and training discrepancy loss for the whole process according to the results.

Hyper-Parameters and Visualization
To better investigating our proposed method, we evaluate it with different hyper-parameters, besides, we also take the representative method DAN as the comparison. The results are shown in Figures 6, 7. From them we can see that, with the increase of batch size, both models show a drop in performance, especially when the batch size is 512, which has a significant decrease compared to 256 on SEED-IV. Besides, with the training epoch increases, neither model has a substantial improvement, especially MS-MDA, but our method achieves moderate accuracy and converges faster. Comparing cross-subject experiments on two datasets, it can be significantly seen that MS-MDA has a clear advantage over DAN, which indirectly shows that our approach has a more significant performance improvement for multiple source domain adaptation in EEG-based emotion recognition. We also visualize the four loss items (total loss, classification loss, MMD loss, and discrepancy loss) in Figure 8. From the figure, we can see that the total loss, the classification Training percentage stands for when to add this loss into the training, 1 means whole training process while 0.2 stands for the last 20% of the training process. Weight of β represents the ratio compared to α. The best results are shown in bold. loss, and the discrepancy loss decrease with the training step increases. However, the figure of MMD loss has a relatively significant rise at the 2k step. We assume that the alpha gets to value 1 at the 2k step, which makes the MMD loss the same weight as the classification loss, thus slightly impacting the model.  For a better understanding of the effect of our proposed method, we randomly pick 100 EEG samples from each subject (domain) in the scenario of cross-subject to visualize with t-SNE (Van der Maaten and Hinton, 2008), as displayed in Figure 9. We only plot the cross-subject since this transfer scenario has more sources that will maximize visualization. In the Figure 9, each color stands for a source domain, and the target domain are in black. To better plotting, we transparent the target sample to avoid overlap. It should be noticed that in the lower left figure, we pick 1400 samples since we concatenate all sources into one.
From the Figure 9, it is apparently that the distribution of all EEG data from different subjects (with different colors) is close. Most samples are concentrated in one area, with a few outliers in individual subjects. These distribution confirms our hypothesis that all EEG data share some low-level features, i.e., their distribution on the feature space is slightly overlapping. This is especially noticeable after normalization (upper right in FIGURE 9 | Visualization with t-SNE for raw data (upper left), normalization data (upper right), data using DAN (lower left), and data using MS-MDA (lower right). The input data of the last fully-connected (DSC) layer are used for the computation of the t-SNE. Target data are in the shape of X with black, all other 14 source data are in 14 colors. Notice that since we have concatenated all the source domains, the lower left figure has only one color for the source domain. All four figures are best viewed in color.
the Figure 9), where the distribution of these EEG data is neater and around the center. In DAN, since all the source data are concatenated as one source domain, there is only one color for the source domain. Lower left figure of Figure 9 illustrates that domain adaptation process brings the source and target domains closer together, and resulting in a high degree of overlap between green and black samples, with a concentration of black samples in the more central region of the source domain. As for MS-MDA, since we adopt the distribution of each source domain separately with the target domain, it is intuitive that the black dots should have some closeness and overlap with each color of the source domain, and the lower right figure of Figure 9 does confirm our suspicion.

DISCUSSION
As can be seen from Table 2, comparing the results of selective methods and prior works, our proposed method has a significant improvement, especially for cross-subject DA in which the number of source domains is large. The ablation experiments from Table 3 also show that our proposed method requires both MMD and discrepancy loss in most cases. Eliminating the MMD loss has a significant performance drop on both datasets, confirming the importance of DA, and eliminating disc. loss does not have as large an impact as MMD loss, but also verifies the help of multi-source convergence. Also, during the experiments, we find that the type of normalization of the data has a significant impact on the overall results, so we also design experiments and explore the normalization of EEG data in DA to help improve the performance of our model. As can be seen in Table 4, there is not much difference between the two normalization orders, and it is most appropriate to do data normalization on the electrode-wise, which has a crushing performance improvement compared to the other three methods; for our method, which does not concatenate data, electrode normalization is also the most effective. This conclusion is in line with our intuition that data collected from the same electrode are relatively more regular or conform to a certain distribution, while data collected from different electrodes are very different. In addition, during the experiments, we find that the disc. loss needs to be carefully adjusted, otherwise it is easy to cause harmful effects, which we guess is because this loss introduces a convergence effect on multiple classifiers in the model (in other words, smooth the inferences made from multiple classifiers), and if most of the classifiers are wrong, this convergence effect will cause the correct classifiers to error. Therefore, we also test and evaluate the impact of the disc. loss coefficients on the model at different settings, and from Table 5, we can see that the disc. loss achieves the best results if it is set to 0.01 times the MMD loss coefficient and is being used in the full model training.
After exploring the internal details of the model, we also evaluated the performance of the model under different hyperparameters. For better comparison, we chose a representative DAN as the comparison method. From Figures 6, 7, we can see that both models have a significant decrease as the batch size is increasing. The reason for this we assume is that small batch size tends to fall into local optimal overfitting. Besides, the hyperparameter epoch has a minimal impact on both models, particularly the MS-MDA. From Figures 6, 7, we can also clearly see that MS-MDA has a significant advantage over DAN in crosssubject DA where the number of multiple source domains is large, which also confirms the importance of constructing multiple branches for multiple source domains to adopt DA separately.
Although it is clear from the results that our proposed method has a significant performance improvement, we also found that the training time consumed increases linearly with the number of source domains, i.e., the larger the number of source domains and the larger the model, the longer the training takes, unlike concatenating all source data into one, where there is only additional time due to the increase in the amount of data. For this problem, our current idea is to discard some less relevant source domains selectively and not build DA branches for them, allowing the disc. loss to play a more prominent role because there is less negative information. In addition, the encoders in the current model are the simplest MLP, and many literature and works have verified the usability of LSTM for EEG data (Ma et al., 2019;Jiao et al., 2020;Tao and Lu, 2020), and we will consider switching to use LSTM as the encoders in future works.

CONCLUSION
In this paper, we propose MS-MDA, an EEG-based emotion recognition domain adaptation method, which is applicable to multiple source domain situations. Through experimental evaluation, we find that this method has a better ability to adapt to multiple source domains, which is validated by comparison with the selective approaches and the SOTA models, especially for cross-subject experiments where our proposed method consists of up to 20% improvement. In addition, we also explore the impact of different normalization methods for EEG data in domain adaptation, which we believe can serve as an inspiration for other EEG-based works while improving the effectiveness of the models. As for our future work, the current model for multiple source domains is to construct a DA branch for each of them without selection, which will increase the model size and training time exponentially, and also introduces information from the source domain that is not relevant to the target into the model. A more efficient approach may be to selectively build DA branches from a reservoir of source domains, allowing the model to be more efficient while only focusing on the source domain information that is relevant to the target domain.

DATA AVAILABILITY STATEMENT
The datasets analyzed for this study can be found in the BCMI laboratory official website at https://bcmi.sjtu.edu.cn/home/seed/ index.html.

AUTHOR CONTRIBUTIONS
HC and JL: conceptualization and writing-original draft. HC, MJ, and ZL: investigations and data analysis and constructed the experiments. CF, JL, and HH: review the draft and editing. All authors contributed to the article and approved the submitted version.