Possibilistic distribution distance metric: a robust domain adaptation learning method

The affective Brain-Computer Interface (aBCI) systems, which achieve predictions for individual subjects through training on multiple subjects, often cannot achieve satisfactory results due to the differences in Electroencephalogram (EEG) patterns between subjects. One tried to use Subject-specific classifiers, but there was a lack of sufficient labeled data. To solve this problem, Domain Adaptation (DA) has recently received widespread attention in the field of EEG-based emotion recognition. Domain adaptation (DA) learning aims to solve the problem of inconsistent distributions between training and test datasets and has received extensive attention. Most existing methods use Maximum Mean Discrepancy (MMD) or its variants to minimize the problem of domain distribution inconsistency. However, noisy data in the domain can lead to significant drift in domain means, which can affect the adaptability performance of learning methods based on MMD and its variants to some extent. Therefore, we propose a robust domain adaptation learning method with possibilistic distribution distance measure. Firstly, the traditional MMD criterion is transformed into a novel possibilistic clustering model to weaken the influence of noisy data, thereby constructing a robust possibilistic distribution distance metric (P-DDM) criterion. Then the robust effectiveness of domain distribution alignment is further improved by a fuzzy entropy regularization term. The proposed P-DDM is in theory proved which be an upper bound of the traditional distribution distance measure method MMD criterion under certain conditions. Therefore, minimizing P-DDM can effectively optimize the MMD objective. Secondly, based on the P-DDM criterion, a robust domain adaptation classifier based on P-DDM (C-PDDM) is proposed, which adopts the Laplacian matrix to preserve the geometric consistency of instances in the source domain and target domain for improving the label propagation performance. At the same time, by maximizing the use of source domain discriminative information to minimize domain discrimination error, the generalization performance of the learning model is further improved. Finally, a large number of experiments and analyses on multiple EEG datasets (i.e., SEED and SEED-IV) show that the proposed method has superior or comparable robustness performance (i.e., has increased by around 10%) in most cases.


Introduction
In the field of affective computing research (Mühl et al., 2014), automatic emotion recognition (AER) (Dolan, 2002) has received considerable attention from the computer vision community (Kim et al., 2013;Zhang et al., 2017).Thus far, numerous Electroencephalogram (EEG)-based emotion recognition methods have been proposed (Musha et al., 1997;Jenke et al., 2014;Zheng, 2017;Li X. et al., 2018;Pandey and Seeja, 2019).From a machine learning perspective, EEG-based AER can be modeled as a classification or regression problem (Kim et al., 2013;Zhang et al., 2017), where state-of-the-art AER techniques typically train their classifiers on multiple subjects to achieve accurate emotion recognition.In this case, subject-independent classifiers usually have poor generalization performance, as emotion patterns may vary across subjects (Pandey and Seeja, 2019).Significant progress in emotion recognition has been made by improving feature representation and learning models (Zheng et al., 2015;Zheng and Lu, 2015;Li et al., 2018aLi et al., ,b, 2019;;Song et al., 2018;Du et al., 2020;Zhong et al., 2020).Since the individual differences in EEG-based AER are a natural existence, we may obtain a not good result by qualitative and empirical observations if the learned classifier generalize to previously unseen subjects (Jayaram et al., 2016;Zheng and Lu, 2016;Ghifary et al., 2017;Lan et al., 2019).As a possible solution, subject-specific classifiers are often impractical due to insufficient training data.Moreover, even if they are feasible in some specific scenarios, it is also an indispensable task to finetune the classifier to maintain a sound recognition capacity partly because the EEG signals of the same subject are changing now and then (Zhou et al., 2022).To address the aforementioned challenges, the domain adaptation (DA) learning paradigm (Patel et al., 2015;Tao et al., 2017Tao et al., , 2021Tao et al., , 2022;;Zhang et al., 2019b;Dan et al., 2022) has been proposed and has achieved widespread effective applications, which enhances learning performance in the target domain by transferring and leveraging prior knowledge from other related but differently distributed domains (referred to as source or auxiliary domains), where the target domain has few or even no training samples.
Reducing or eliminating distribution differences between different domains is a crucial challenge currently faced during DA learning.To this end, mainstream DA learning methods primarily eliminate distribution biases between different domains by exploring domaininvariant features or samples (Pan and Yang, 2010;Patel et al., 2015).In order to fully exploit domain-invariant feature information, traditional shallow DA models have been extended to the deep DA paradigm.Benefiting from the advantages of deep feature transformation, deep DA methods have now achieved exciting adaptation learning performance (Long et al., 2015(Long et al., , 2016;;Ding et al., 2018;Chen et al., 2019;Lee et al., 2019;Tang and Jia, 2019).Unfortunately, these deep DA methods can provide more transferable features and domain-invariant features, they can only alleviate but not eliminate the domain distribution shift problem caused by domain distribution differences.In addition, these deep DA methods can demonstrate better performance advantages, which may be attributed to one or several factors such as deep feature representation, model fine-tuning, adaptive regularization layers/terms, etc.However, the learning results of these methods still lack theoretical or practical interpretability at present.DA theoretical studies have been proposed for domain adaptation generalization error bound (Ben-David et al., 2010) where the expected error of the target hypothesis e h  ( ) is mainly constrained by three aspects: (1) the expected error of the source domain hypothesis e h  ( ); (2) the distribution difference between the source and target domains d H S T D D , ( ); (3) the difference in label functions between the two domains [i.e., the third term from Equation ( 1)].Therefore, we will consider the three aspects simultaneously in this paper to reduce the domain adaptation generalization error bound (Zhang et al., 2021).Most existing methods assume that once the domain difference is minimized, a classifier trained only on the source domain can also generalize to the target domain well.Therefore, current mainstream DA methods aim to minimize the statistical distribution difference between the two domains.To this end, reducing or eliminating the distribution difference between domains to achieve knowledge transfer from the source domain and improve learning performance in the target domain is the core goal of domain adaptation learning methods.However, the key to this goal is effectively measuring the distribution difference between domains.
To address the domain distribution shifting phenomenon, early instance re-weighting methods calculate the probability of each instance belonging to the source or target domain by likelihood ratio estimation (i.e., the membership of each instance).The domain shift problem can be relieved by re-weighting instances based on their membership.MMD (Gretton et al., 2007) is a widely adopted strategy for instance re-weighting, which is simple and effective.However, its optimization process is often carried out separately from the classifier training process, it's difficult to ensure that both are optimal at the same time.To address this issue, Chu et al. (2013) proposed a joint instance re-weighting DA classifier.To overcome the conditional distribution consistency assumption of the instance re-weighting method, the feature transformation methods have received widespread attention and exploration (Pan et al., 2011;Baktashmotlagh et al., 2013;Long et al., 2013;Liang et al., 2018;Luo et al., 2020;Kang et al., 2022).Representative methods include Pan et al. (2011) proposed the Transfer Component Analysis (TCA) method, which learned a transformation matrix.It adopted MMD technology to minimize the distribution distance between source domains and target domain, and preserved data divergence information, but did not consider domain semantic realignment.Then, Long et al. (2013) proposed a Joint DA (JDA) method, which fully considered the domain feature distribution alignment and class conditional distribution alignment with the target domain labels in the class conditional distribution initialized by pseudo-labels.Recently, Luo et al. (2020) proposed a Discriminative and Geometry Aware Unsupervised Domain Adaptation (DGA-DA) framework, which combined the TCA and JDA methods.It introduced a strategy that made different classes from cross-domains mutually exclusive.Most of the existing affective models were based on deep transfer learning methods built with domain-adversarial neural network (DANN) (Ganin et al., 2016) proposed in Li et al. (2018c,d), Du et al. (2020), Luo et al. (2018), andSun et al. (2022).The main idea of DANN (Ganin et al., 2016) was to find a shared feature representation for the source domain and the target domain with indistinguishable distribution differences.It also maintained the predictive ability of the estimated features on the source samples for a specific classification task.In addition, the framework preserved the geometric structure information of domain data to achieve effective propagation of target labels.Baktashmotlagh et al. (2013) proposed a Domain Invariant Projection (DIP) algorithm, which investigated the use of polynomial kernels in MMD to construct a compact domainshared feature space.The series of DANN methods still has some challenges, PR-PL (Zhou et al., 2022) also explored the prototypical representations to further characterize the different emotion categories based on the DANN method.Finally, the study designed a clusteringbased DA concept to minimize inner-class divergence.A review of existing DA method research shows that MMD is the main distribution distance measurement technique adopted by feature transformation-based DA methods.Traditional MMD-based DA methods focused solely on minimizing cross-domain distribution differences while ignoring the statistical (clustering) structure of the target domain distribution, which to some extent affects the inference of target domain labels.To address this issue, Kang et al. (2022) proposed a contrastive adaptation network based on unsupervised domain adaptation.The initialization of the labels from the target domain was realized by the clustering assumption.The feature representation is adjusted by measuring the contrastive domain differences (i.e., minimizing within-class domain differences and maximizing between-class domain differences) in multiple fully connected layers.During the training process, the assumptions of the target domain label and the feature representations are continuously cross-iterated and optimized to enhance the model's generalization capability.Furthermore, inspired by clustering patterns, Liang et al. (2018) proposed an effective domain-invariant projection integration method that uses clustering ideas to seek the best projection for each class within the domain, bridging the domain-invariant semantic gap and enhance the inner-class compactness in the domain.However, it still essentially belongs to MMD-based feature transformation DA methods.
It is worth noting that existing MMD-based methods did not fully consider the impact of intra-domain noise when measuring domain distribution distance.In real scenarios, noise inherently exists in domains, and intra-domain noise can lead to mean-shift problems in distance measurement for traditional MMD methods and their variants.This phenomenon to some extent is affecting the generalization performance of MMD-based DA methods.As shown in Figures 1A1, B1 represent the noise-free source domain and target domain, respectively.m s* and m t* are the means of the source domain and target domain, respectively.Figure 1C1 shows the domain adaptation result based on the MMD method.When the source domain has noises (i.e., Figure 1A2), the mean shift occurs and it's difficult to effectively measure the distribution distance by the MMD criterion.It matches the most of target domain samples (i.e., Figure 1B2) to a certain category of source domain (i.e., Figure 1C2).It declines the inferring performance of domain adaptation learning.
Existing research (Krishnapuram and Keller, 1993) pointed out that the possibilistic-based clustering model can effectively suppress noise interference during the clustering process.Therefore, Dan et al. (2021) proposed an effective classification model based on the possibilistic clustering assumption.Inspired by this work, we aim to jointly address the robustness and discriminative issues in the MMD criterion to enhance the adaptability of MMD-based methods and propose a robust Probabilistic Distribution Distance Measure (P-DDM) criterion.Specifically, by measuring the distance between EEG data (from either the source or target domain) and the overall domain mean (i.e., the mean of the source domain and target domain), the corresponding matching membership is used to judge the relevance between the EEG data and the mean.In other words, the smaller the distance between the EEG data and the mean, the larger the membership, and vice versa.In this way, the impact of noise in the matching process can be alleviated by the value of membership.The robustness and effectiveness of P-DDM are further enhanced by introducing a fuzzy entropy regularization term.Based on this, a domain adaptation Classifier model based on P-DDM (C-PDDM) is proposed, which introduces the graph Laplacian matrix to preserve the geometric structure consistency within the source domain and target domain.It can improve the label propagation performance.At the same time, a target domain classification model with better generalization performance is obtained by maximizing the use of source domain discriminative information to minimize domain discriminative errors.The main contributions of this paper are as follows: 1) The traditional MMD measurement is transformed into a clustering optimization problem, and a robust possibilistic distribution distance metric criterion (P-DDM) is proposed to solve the domain mean-shift problem in a noisy environment; 2) It is theoretically proven that under certain conditions, P-DDM is an upper bound of the traditional MMD measurement.The minimization of MMD in domain distribution measurement can be effectively achieved by optimizing the P-DDM; 3) A DA classifier mode based on P-DDM is proposed (i.e., C-PDDM), its consistent convergence is proven, and the DA generalization error bound of the method is proposed based on Rademacher complexity theory; 4) A large number of experiments are conducted on two EEG datasets (i.e., SEED and SEED-IV), demonstrating the robust effectiveness of the method and a certain degree of improvement in the classification accuracy of the model.

Proposed framework: C-PDDM
In domain adaptation learning, denotes n samples and its associated labels of the source domain.
 indicates all the source samples.
denotes the mean value of the source domain and target domain, respectively.Our proposal has some assumptions: 1) However, the distributions of source domain () and target domain () are different (i.e., P Q X X S T ( )¹ ( ) and X X S T

=
), they share the same feature space with X X X S T , Î are feature space of the source domain and target domain, respectively.
2) The condition probability distributions of the source domain and target domain are different but they share the same label space with Y Y Y S T , Î are label space of the source domain and target domain, respectively.
In the face of a complex and noisy DA environment, the proposed method will achieve the following objectives by the DA generalization error theory (Ben-David et al., 2010) to make the distance metric for domain adaptation more robust and achieve good target classification performance: (1) Robust distance metric: solve the problem of domain mean shift under the influence of noise, thereby effectively aligning the domain distribution differences; (2) Implement target domain knowledge inference: we bridge the discriminative information of the source domain while minimizing the domain discriminative error based on preserving the consistency of domain data geometry, and learn a target domain classification machine with high generalization performance.Based on the descriptions of the above objectives, the general form of the proposed method can be described as: ) is the robust distance metric, which reduces the impact of noisy data on the alignment of domain distribution differences.R(Y, W) is the domain adaptation learning loss function that includes the label matrix Y (that is, the comprehensive label matrix of the source and target domains) and the comprehensive learning model W of the source domain and the target domain.

Motivation
In a certain reproducing kernel Hilbert space (RKHS), the original space data representation can be transformed into a feature representation in the RKHS through a certain non-linear transformation f :  d H ® (Long et al., 2016).The corresponding kernel function is defined as K X X X X 1 2 , : ( ) ´® , where The influence comes from the noises or outliers during domain matching.2011; Long et al., 2015).For the problem of inconsistent distributions in domain adaptation, existing research has shown (Bruzzone and Marconcini, 2010;Gretton et al., 2010) that when sample data is mapped to a high-dimensional or even infinitedimensional space, it can capture higher-dimensional feature representations of the data (Carlucci et al., 2017).That is, in a certain RKHS, the distance between two distributions can be effectively measured through the maximum mean discrepancy (MMD) criterion.Based on this, it is assumed that  is a collection of functions of a certain type f : f :  ®  , The maximum mean discrepancy (MMD) between two domain distributions  and  can be defined as: MMD measure minimizes the expected difference between two domain distributions through the function f, making the two domain distributions as similar as possible.When the sample size of the domain is sufficiently large (or approaches infinity), the expected difference approximates (or equals) the empirical mean difference.Therefore, Equation (3) can be written in the empirical form of MMD: To prove the universal connection between the traditional MMD criterion and the mean clustering model, we give the following theorem: Theorem 1.The MMD measure can be loosely modeled as a special clustering problem with one cluster center, where the clustering center is m , and the instance clustering membership is V k .
Proof: As defined by MMD: .
When n m ¹ , the number of samples in the source domain and target domain can be set the same during sampling.The sample membership V k of one cluster center is defined as: From Equation ( 5), it can be seen that the one cluster center form with clustering center n is an upper bound of the traditional MMD measure.In other words, the MMD measure can be relaxed to a special one cluster center objective function.By optimizing this clustering objective, the minimization of MMD between domains can be achieved.
As indicated in Theorem 1 and Baktashmotlagh et al. ( 2013), the domain distribution MMD criterion is essentially related to the clustering model, which can be used to achieve more effective distribution alignment between different domains by clustering domain data.It is worth noting that the traditional clustering model has the disadvantage of being sensitive to noise (Krishnapuram and Keller, 1993), which makes domain adaptation (DA) methods based on MMD generally face the problem of domain mean shift caused by noisy data.To address this issue, this paper further explores more robust forms of clustering and proposes an effective new criterion for domain distribution distance measurement.

P-DDM
Recently proposed possibility clustering models can effectively overcome the impact of noise on clustering performance (Dan et al., 2021).Therefore, this paper further generalizes the above special one cluster center to a possibility one cluster center form and proposes a robust possibility distribution distance metric criterion P-DDM.By introducing the possibility clustering assumption, the MMD hard clustering form is generalized to a soft clustering form, which controls the contribution of each instance according to its distance from the overall domain mean.The farther the distance, the smaller the contribution of the instance, thus weakening the influence of mean shift caused by noisy data in the domain and improving the robustness of domain adaptation learning.
To achieve robust domain distribution alignment, the distribution distance measurement criterion based on the possibility clustering assumption mainly achieves two goals: (1) Calculate the difference in distribution between kernel space domains based on the possibility clustering assumption, by measuring the distance between each instance in the domain and the overall domain mean; (2) Measure the matching contribution of each instance.Any instance in the overall domain has a matching contribution value , , , which is the matching contribution degree of x k to the overall domain mean, and the closer the distance, the larger the value of l k .Thus, the possibility distribution distance measure can be defined as: where the parameter b is the weight exponent of l k , which is used to adjust the uncertainty or degree of the data points belonging to multiple categories.In order to circumvent the trivial solution, b is set to 2 in the subsequent equations of this paper.The detailed process of different values of b can be found in references (Krishnapuram and Keller, 1993).W p k s t X X l , ,

(
) is an objective function of possibility clustering with a cluster center of μ, and when l V ) takes the form of the above-mentioned special one cluster center.Theorem 2.When l k r ) is an upper bound of the traditional MMD method.
Proof: Combining Equation ( 5) and Equation ( 7), we have the following inference process: min , , , According to the value range of V k , when l k r Î ( ) , and r = min (n, m), the second inequality in Equation ( 8) holds, thus proving that ) is the upper bound of traditional MMD.According to Theorem 1 and Theorem 2, the traditional MMD metric criterion can be modeled as a possibilistic one cluster center objective form.
From this perspective, it can be considered that the possibilistic distribution distance metric target domain can not only achieve alignment of domain feature distribution, but also weaken the "negative transfer" effect of noisy data in the domains during training.Equation ( 7) only considers the overall mean regression problem, which clusters each instance with the overall domain mean, while ignoring the semantic structural information of the instance in domain distribution alignment.It may lead to the destruction of the local class structure in the domain.Inspired by the idea of global and local from Tao et al. (2016), we further consider the semantic distribution structure in domain alignment and calculate the semantic matching contribution of each instance.Therefore, based on the feature distribution alignment, we propose an integrated semantic alignment.It can be rewritten as follows: where m dm d m , is the membership of x k belonging to the c-th class in the overall domain (i.e., integrate the source domain and target domain into one domain).
To further improve the robustness and effectiveness of the possibilistic distribution distance metric method on noisy data, we add a fuzzy entropy regularization term related in Equation ( 9).Therefore, the semantic alignment P-DDM in (9) can be further defined as follows: where b is a tunable balancing parameter that forces the value of l k c , for relevant data to be as large as possible to avoid trivial solutions.After the above improvements, P-DDM is a monotonic decreasing function on l k c , .Through the fuzzy entropy term in the second part of Equation ( 10), P-DDM reduces the impact of noise data on model classification.The larger the fuzzy entropy, the greater the sample discrimination information, which helps to enhance the robustness and effectiveness of distribution distance measurement.Additionally, the possibility distribution measurement model regularized by fuzzy entropy can effectively suppress the contribution of noise data in domain distribution alignment, thereby reducing the interference of noise/abnormal data to domain adaptation learning.The robustness effect of fuzzy entropy can be further seen in the empirical analysis of reference (Gretton et al., 2010).

Design of domain adaptation function
The P-DDM criterion addresses the problems of domain distribution alignment and noise impact.Next, we will achieve the two goals required for the inference of target domain knowledge: (1) to preserve the geometric consistency in the source domain and the target domain, i.e., the label information between adjacent samples should be consistent, and (2) to minimize the structural risk loss of both the source and target domains.Given the description of the objective task, the general form of the objective risk function can be described as: where W Y is the loss of joint knowledge transfer and label propagation, which preserves the geometric consistency of the data between the source and target domains, and W W is the structural risk loss term, which includes both the source domain and the target domain.Next, these two terms will be designed separately.
where x Ne x k m Î ( ) means that x k is the neighbor of x m .s is the local influence range parameter that controls the Gaussian kernel function and is also a hyper-parameter.The larger the value of s , the larger the local influence range, and vice versa, the smaller the local influence range.When s is fixed, the change in M ij decreases monotonically as the distance between x i and x j increases.
In combination with source domain knowledge transfer and graph Laplacian matrix (Long et al., 2013;Wang et al., 2017), the objective form of label propagation modeling can be described as: where

Minimize structural risk loss
In our proposed method, the classifier of the source domain (the corresponding target domain classification model) is defined as is the source domain bias (the target source bias).W ss (W tt ) is the parameter matrix of the source domain (the parameter matrix of the target ,1 , we can rewrite both classifiers of the source domain and the target domain respectively: . We rewrite the final classifier as: . According to the minimum square loss function, the problem of minimizing structural risk loss in both domains (source domain and target domain) can be described as: where the first term denotes the structure risk loss and y Y k Î The second term is the constraint term of W. By using l 2 1 , regularization, we can achieve feature selection and it can effectively control the complexity of the model to prevent over-fitting of the target classification model to some extent.
The classification task proposed in this method is ensured by the dual prediction of the label matrix Y and the decision function W to guarantee the reliability of the prediction.The target classification function is combined by Equation ( 13) and Equation ( 14).It's described as follows:

Final formulation
By combining the semantic alignment P-DDM form [i.e., Equation ( 10)] and the target classification function [i.e., Equation ( 16)], the final optimization problem formulation of the proposed method C-PDDM can be described as follows: where b , a , and r are balance parameters.With all model parameters obtained, target domain knowledge inference can be achieved by maximizing the utilization of source domain discriminative information, linearly fusing the two classifiers f s  and f t  , and using this linear fusion model for target domain knowledge inference.The fusion form can be written as follows: , is an adjustable parameter that balances the two classifiers, in order to reflect the importance of source domain discriminative information as prior knowledge, υ is set to 0.9 based on empirical experience.

C-PDDM optimization
The optimization problem of C-PDDM is a non-convex problem with respect to l k c , , W, and Y.We will adopt an alternating iterative optimization strategy to achieve the optimization and solution of l k c , , W, and Y, so that each optimization variable has a closed-form solution.

Update λ k,c as given W and Y
As we fix W and Y, the objective function in Equation ( 16) reduces to solving: Theorem 3. The optimal solution to the primal optimization problem of the objective function ( 17) is: where Proof.By setting the derivative ¶ ¶ , we obtain: ¶ ¶ = ( )- Combining and simplifying the terms in Equation ( 19), we get the solution of l k c , is Equation (18), Theorem 3 is proved.From Theorem 3, the membership of any sample can be obtained by Equation (18).

Update W as given Y and λ k,c
Since the first and the third terms in Equation ( 16) do not have W, the optimization formula for C-PDDM can be rewritten as: where l is a matrix with l Î , means the membership of x k belonging to the c-th class.Theorem 4.The optimal solution to the primal optimization problem of the objective function ( 20) is: with where The solution obtained by organizing Equation ( 22) is Equation (21).

Update Y by fixing W and λ k,c
Finally, l k c , is fixed.W AY = is substituted into Equation ( 16).The constraint YY T I = can reduce the interference information in the label matrix Y, the objective form for optimizing the solution of Y is described as:

Algorithm description
In unsupervised domain adaptation learning scenarios (i.e., the target domain does not have any labeled data), in order to achieve semantic alignment between domains, initial labels of the target domain can be obtained through three strategies (Liang et al., 2018): (1) random initialization; (2) zero initialization; (3) use the model trained on the source domain data to cluster the target domain data to obtain initial labels.(1) and (2) belong to the cold-start method.(3) belongs to the hot-start method which is relatively friendly to subsequent learning performance.Therefore, we adopt the third method to initialize the prior information of l k c , , W, and Y.The proposed method adopts the iterative optimization strategy commonly used in multi-objective optimization, and the algorithm stops iterating when the following conditions are satisfied: denotes the value of the objective function at the z-th iteration.e is a pre-defined threshold.

Computational complexity
This article uses Big O to analyze the computational complexity of Algorithm 1.The proposed method C-PDDM mainly consists of two joint optimization parts: P-DDM and target label propagation.Specifically, we first construct the k-Nearest Neighbor (i.e., k-NN) graph and compute the kernel matrix K in advance requiring computational costs of O dn 2 ( ) and O dN 2 ( ) , respectively.Then, the optimization process of Algorithm .Before training in Algorithm 1, pre-computing the C-PDDM kernel matrix and Laplacian graph matrix and loading them into memory can further improve the computational efficiency of Algorithm 1.In short, the proposed algorithm is feasible and effective in practical applications.

Analysis of convergence
To prove the convergence of Algorithm 1, the following lemma is proposed.
Lemma 1 (Nie et al., 2010).For any two non-zero vectors , Î  , the following inequality holds: Then, we prove the convergence of the proposed algorithm through Theorem 5.Theorem 5. Algorithm 1 decreases the objective value of the optimization problem (17) in each iteration and converges to the optimal solution.

Proof. For expression simply, the updated results of optimization variables l k c
, , W , and Y after t -th iteration are denoted as l t k c , , W t , and Y t , respectively.The internal loop iteration update in Step 8 of Algorithm 1 corresponds to the following optimization problem: According to the definition of matrix U , we have: ,: ,: ,: where Z e X Input: The source domain data X Y s s , { }, the target domain data X t , unknown labels of the target domain Y t (the initialization can be obtained by cluster algorithm), model parameter values of b a r q , , , and the threshold of iteration stop e , and the maximal iteration number Z .Output: The contribution matrix l k c , matches each instance to the mean points of each class in the entire domain, the decision function W and the label matrix Y. Procedure: 1. Initialize the label values for unlabeled data from the target domain.,: (28) Finally, Theorem 6 is proved.According to the update rule in Algorithm 1 and Theorem 6, it is known that the optimization objective ( 17) is a decreasing function concerning the objective value.Therefore, it can be inferred that Algorithm 1 can effectively converge to the optimal solution.

Analysis of generalization
Rademacher complexity can effectively measure the ability of a function set to fit noise (Ghifary et al., 2017;Tao and Dan, 2021).Therefore, we will derive the generalization error bound of the proposed method through Rademacher complexity.Let Y be a set of hypothesis functions in the RKHS  space, where  is a compact set and  is a label space.Given a loss function loss × × ( ) ´® + , :   and a. neighborhood distribution  on  , the expected loss of two hypothesis functions h h H ,  Î is defined as: The domain distribution difference between the source domain distribution  and the target domain distribution  can be defined as: Let f  and f  be the true label functions for  and , respectively, and let the corresponding optimized hypothesis functions be: Their corresponding expected loss is denoted as , .Our C-PDDM method achieves the empirical loss target of ).
The following theorem gives the generalization error bound of the proposed method: Theorem 6 (Generalization Error Bound) (Nie et al., 2010).Let , , and X x x , are datasets of the source domain and the target domain, respectively.q-Lipschitz function loss is loss The generalization error bound for any hypothesis function h Î  with a probability of at least 1 -d of having Rademacher complexity Â ( ) )-( ) £ ( ) + Â ( ) where Â ( ) is Rademacher complexity.Theorem 6 shows that the possibilistic distribution distance measure W l k s t X X , ,

(
) and the model alignment function R Y ,W ( ) can simultaneously control the generalization error bound of the proposed method.Therefore, the proposed method can effectively improve its generalization performance in domain adaptation by minimizing both the possibilistic distribution distance between domains and model bias.
The experimental results on real-world datasets also confirm this conclusion.

Discussion of kernel selection
The literature (32) theoretically analyzed and pointed out that the Gaussian kernel cluster provides an effective RKHS embedding space for the consistency estimation of domain distribution distance measure.The detailed derivation process can be found in Sriperumbudur et al. (2010a,b).Therefore, all the kernel functions used in this paper are Gaussian kernel k e x x i j . In order to illustrate the impact of the Gaussian kernel bandwidth on the distribution of sample RKHS embedding, the following theorem is introduced: Theorem 7 (Sriperumbudur et al., 2010a).The function set of Gaussian kernel.
According to Theorem 7, the larger the kernel bandwidth, the larger the RKHS embedding distance of the domain distribution, which slows down the convergence speed of the domain distribution distance measure W l k s t X X , ,

(
) based on the soft clustering hypothesis of the MMD criterion.In order to further study the performance impact of Gaussian kernel bandwidth, the Gaussian kernel bandwidth is parameterized, that is, the generalized Gaussian kernel function is defined as: where q is a tunable parameter, will be shown in the experimental analysis below.When q is too large, the samples within the domain are highly cohesive, leading to a certain degree of mixing between positive and negative classes, which is not conducive to effective classification of the model.Conversely, when q is too small, it may slow down the convergence of the distribution distance measurement algorithm based on the possibilistic clustering hypothesis to some extent.Therefore, this paper limits q q Î[ ] 1 0 , , where q 0 is a sufficiently large tunable parameter.The above analysis shows that the distribution distance measurement based on the possibilistic clustering hypothesis can not only constrain the divergence of the distributions between domains to be as consistent as possible, but also reduce the divergence of the sample distributions within each domain within a certain range of kernel bandwidths, thereby accelerating the convergence speed of the domain distribution divergence difference measurement and further improving the execution efficiency of the algorithm.
It is worth noting that kernel selection is an open problem in kernel learning methods.Recently, some studies have proposed the use of Multi-Kernel Learning (MKL) (Long et al., 2015) to overcome the kernel selection problem in single-kernel learning methods.Therefore, we can also use MKL to improve the performance of the proposed method.Specifically, the first step is to construct a new space that spans multiple kernel feature mappings, represented by f a a { } =1  , which projects X into  different spaces.Then, an orthogonal integration space can be built by connecting these  spaces, and represents the mapping features in the final space, where x X i Î .In addition, the kernel matrix in this final space can be written as ;...;


, where K i  is the i-th kernel matrix from  feature spaces.The kernel functions that can be used in practice include the Gaussian kernel function, inverse square distance kernel function , and inverse distance kernel function , etc.

Emotional databases and data preprocessing
In order to make a fair comparison with stat-of-the-art (SOTA) methods, a large number of experiments were conducted for effective validation on two well-known open datasets [i.e., SEED (Zheng and Lu, 2015) and SEED-IV (Zheng et al., 2019)].The SEED dataset has a total of 15 subjects participating in the experiment to collect data, each subject needs to have three sessions at different times, each session contains 15 trials, with a total of 3 emotional stimuli (negative, neutral, and positive).In the SEED-IV dataset, there are also 15 subjects participating in the experiment to collect data, each subject needs to have three sessions at different times, each session contains 24 trials, with a total of 4 emotional stimuli (happy, sad, fearful, and peaceful).
The EEG signals of the two datasets (i.e., SEED and SEED-IV) are collected simultaneously from the 62-channel ESI Neuroscan system.In the EEG signal preprocessing, the down-sampled data sampling rate is reduced to 200 Hz, then the environmental noise data is manually removed, and the data is filtered through a 0.3 Hz-50 Hz band-pass filter.In each trial, the data is divided into multiple segments with a length of 1 s.Based on the predefined 5 frequency band-passes [Delta (1-3 Hz), Theta (4-7 Hz), Alpha (8-13 Hz), Beta (14-30 Hz), and Gamma (31-50 Hz)], the corresponding differential entropy (DE) is extracted to represent the logarithmic power spectrum in the specified frequency band-pass, and a total of 310 features (5 frequency bands and 62 channels) are obtained in each EEG segment.Then, all features are smoothed by the Linear Dynamic System (LDS) method, which can utilize the time dependency of emotion transitions and filter out the noise EEG components unrelated to emotions (Shi and Lu, 2010).

Settings
The settings of the hyper-parameter for the C-PDDM method are also crucial before analyzing the experimental evaluation results.For all methods, in both the source and target domains, a Gaussian kernel is used, where s can be obtained by minimizing MMD to obtain a benchmark test.Based on experience, we first select s as the square root of the average norm of the binary training data, and s C (where C is the number of classes) for multiclass classification.The underlying geometric structure depends on k neighbors to compute the Laplacian matrix.In the experiment of this paper, it can be observed that the performance slightly varies when k is not large.Therefore, to construct the nearest neighbor graph in C-PDDM, this paper conducts a grid search for the optimal number of nearest k neighbors in 3 5 10 15 17 , , , , { } , and provides the best recognition accuracy results from the optimal parameter configuration.Before presenting the detailed evaluation, it is necessary to explain how the hyper-parameters of C-PDDM are tuned.Based on experience, the parameter b is used to balance the fuzzy entropy and domain probability distribution alignment in the objective function ( 16).Both parameters a and r are adjustable parameters, and they are used to balance the importance of structure description and feature selection.Therefore, these two parameters have a significant impact on the final performance of the method.
Considering that parameter uncertainty is still an open problem in the field of machine learning, we determine these parameters based on previous work experience.Therefore, we evaluate all methods on the dataset by empirically searching the parameter space to obtain the optimal parameter settings and give the best results for each method.Except for special cases, all parameters of all relevant methods are tuned to obtain the optimal results.
As unsupervised domain adaptation does not have target labels to guide standard cross-validation, we perform leave-one-subject-out on the two datasets: SEED and SEED-IV (the details of this protocol are shown in Section 6.2).We obtain the optimal parameter values on {10 6 -, 10 5 -, …, 10 5 , 10 6 } by obtaining the highest average accuracy on the two datasets using the above method.This strategy often constructs a good C-PDDM model for unsupervised domain adaptation, and a similar strategy is adopted to find the optimal parameter values for other domain adaptation methods.In the following sub-sections, a set of experiments is set up to test the sensitivity of the proposed method C-PDDM to parameter selection (i.e., Section 6.4.1), in order to verify that C-PDDM can achieve stable performance within a wide range of parameter values.In addition, the hyper-parameters of other methods are selected according to the original literature.

Experiment protocols
In order to fully the robustness and stability of the proposed method, we adopt four different validation protocols (leave-onesubject-out) (Zhang et al., 2021) to compare the proposed method with the SOTA methods.

1) Cross-subject cross-session leave-one-subject-out crossvalidation.
To fully estimate the robustness of the model on unknown subjects and trials, this paper uses a strict leave-one-out method cross-subject cross-session to evaluate the model.All session data of one subject is used as the target domain, and all sessions of the remaining subjects are used as the source domain.
We repeat the training and validation until all sessions of each subject have been used as the target domain once.Due to the differences between subjects and sessions, this evaluation protocol poses a significant challenge to the effectiveness of models in emotion recognition tasks based on EEG. 2) Cross-subject single-session leave-one-subject-out crossvalidation.This is the most widely used validation scheme in emotion recognition tasks based on EEG (Luo et al., 2018;Li J. et al., 2020).One session data of a subject is treated as the target domain, while the remaining subjects are treated as the source domain.We repeat the training and validation process until each subject serves as the target once.As with other studies, we only consider the first session in this type of cross-validation.3) Within-subject cross-session leave-one-session-out crossvalidation.Similar to existing methods, a time series crossvalidation method is employed here, where past data is used to predict current or future data.For a subject, the first two sessions are treated as the source domain, and the latter session is treated as the target domain.The average accuracy and standard deviation across subjects are calculated as the final results.4) Within-subject single-session cross-validation.Following the validation protocols proposed in existing studies (Zheng and Lu, 2015;Zheng et al., 2019), for each session of a subject, we take the first 9 (SEED) or 16 (SEED-IV) trials as the source domain and the remaining 6 (SEED) or 8 (SEED-IV) trials as the target domain.The results are reported as the average performance of all participants.In the performance comparison of the following four different validation protocols, we use "*" to indicate the replicated model results.
6.3.Results analysis on SEED and SEED-IV

Cross-subject cross-session
For verifying the efficiency and stability of the model under cross-subject and cross-session conditions, we used cross-subject cross-session leave-one-subject-out cross-validation on the SEED and SEED-IV databases to validate the proposed C-PDDM.As shown in Tables 1, 2, the results show that our proposed model achieved the highest accuracy of emotion recognition.The C-PDDM method, with or without using deep features, achieved emotion recognition performances of 73.82 ± 6.12 and 86.49 ± 5.20 for the three-class classification task on SEED, and 67.83 ± 8.06 and 72.88 ± 6.02 for the four-class classification task on SEED-IV.Compared with existing research, the proposed C-PDDM has a slightly lower accuracy on SEED-IV than PR-PL, but PR-PL uses adversarial learning, which has a higher computational cost.In addition, the proposed C-PDDM method has the best recognition performance in the other three cases.These results indicate that the proposed C-PDDM has a higher recognition accuracy and better generalization ability, and is more effective in emotion recognition.

Cross-subject single-session
Table 3 summarizes the model results of the recognition task under cross-subject single-session leave-one-subject-out and compares them with the performance of the latest methods in the literature.All results are presented in the form of mean ± standard deviation.The results show that our proposed model (C-PDDM) achieves the best performance (74.92%) with a standard deviation of 8.16 when compared with traditional machine learning methods.The recognition performance of C-PDDM is better than the DICE method, indicating that the C-PDDM method is superior to the DICE method in dealing with noisy situations.When compared with the latest deep learning TABLE 1 The mean accuracies (%) and standard deviations (%) of emotion recognition on the SEED database using cross-subject cross-session leaveone-subject-out cross-validation.

Methods
Pacc Methods Pacc Tao et al. 10.3389/fnins.2023.1247082Frontiers in Neuroscience 13 frontiersin.orgmethods, especially with deep transfer learning networks based on DANN (Li J. al., 2020) [such as ATDD-DANN (Du et al., 2020), R2GSTNN (Li et al., 2019), BiHDM (Li Y. et al., 2020), BiDANN (Li et al., 2018c), WGAN-GP (Luo et al., 2018)], the proposed C-PDDM method effectively addresses individual differences and noisy label issues in aBCI applications.The recognition performance of PR-PL is slightly better than the C-PDDM, which may be because the PR-PL method uses adversarial loss for model learning, resulting in higher computational costs.Overall, the C-PDDM method has a competitive result, indicating that the C-PDDM method has better generalization performance in cross-subject within the same session.

Within-subject cross-session
By calculating the mean and standard deviation of the experimental results for each subject, the cross-session cross-validation results for each subject on the different datasets SEED and SEED-IV are shown in Tables 4, 5, respectively.For these two datasets, our proposed C-PDDM method, which compared with the existing traditional machine learning methods, has results close to or better than the DICE method on both SEED and SEED-IV.This may be because each subject is less likely to generate noisy data in different sessions, which does not highlight the advantages of C-PDDM.In addition, for the SEED-IV dataset (four-class emotion recognition), regardless of traditional machine learning or the latest deep learning methods, the performance of the C-PDDM method is the best when the number of categories increases.This indicates that the proposed method is more accurate and has stronger scalability in more nuanced emotion recognition tasks.
TABLE 2 The mean accuracies (%) and standard deviations (%) of emotion recognition on SEED-IV database using cross-subject cross-session leaveone-subject-out cross-validation.

Methods
Pacc Methods Pacc The bold values are the best performance in tables.
TABLE 3 The mean accuracies (%) and standard deviations (%) of emotion recognition on the SEED database using cross-subject single-session leaveone-subject-out cross-validation.

Pacc Methods Pacc
Traditional machine learning methods  The previous evaluation strategy considered the first two sessions of the SEED dataset as the source domain for the experiment.The evaluation results of emotion recognition for each subject within each session are presented in Table 6.When compared with traditional machine learning methods, the C-PDDM method has comparable TABLE 5 The mean accuracies (%) and standard deviations (%) of emotion recognition on SEED-IV database using within-subject cross-session crossvalidation.

Methods
Pacc Methods Pacc The bold values are the best performance in tables.
TABLE 6 The mean accuracies (%) and standard deviations (%) of emotion recognition on the SEED database using within-subject single-session crossvalidation.performance, and it still outperforms the performance of the DICE method.When compared with the latest deep learning methods, the C-PDDM method achieves the highest recognition performance, reaching 96.38%, which is even higher than the PR-PL method.This comparison demonstrates the high efficiency and reliability of the proposed C-PDDM method in various emotion recognition applications.

Pacc Methods Pacc
For the SEED-IV dataset, we calculated the performance of all three sessions (emotional categories: happiness, sadness, fear, and neutral).Our proposed model outperforms the existing latest classical research methods and achieves the highest accuracy of 71.85 and 83.94% in Table 7.This comparison shows that the more emotional categories there are, the more prominent the generalization of the proposed C-PDDM method in applications.

Discussion
For comprehensively study the performance of the model, we evaluated the effects of different settings in C-PDDM.Please note that all the results presented in this section are based on the SEED dataset, using the cross-subject single-session cross-validation evaluation protocol.

Ablation study
We conducted ablation studies to systematically explore the effectiveness of different components in the proposed C-PDDM model and their respective contributions to the overall performance of the model.As shown in Table 8, when 5 labeled samples existed at each category in the target domain, the recognition accuracy (93.83% ± 5.17) is very close to the recognition accuracy of C-PDDM (unsupervised learning) (92.19% ± 4.70).This decrease indicates the impact of individual differences on model performance and highlights the huge potential of transfer learning in aBCI applications.Moreover, the results show that simultaneously preserving the local structure of data in both the source and target domains helps improve model performance; otherwise, the recognition accuracy decreases significantly (90.60% ± 5.29 and 91.37% ± 5.82, respectively).When W 2 1 , is changed to W 2 , the model's recognition accuracy drops to 91.84% ± 6.33.This result reflects the sample selection and denoising effects achieved when using l 2 1 , constraint.
For the pseudo-labeling method, when the pseudo-labeling method changes from fixed to linear dynamic, the corresponding accuracy increases from 89.95 to 92.19%.When adopting multi-kernel learning, the accuracy further improves to 93.68%.The results indicate that multikernel learning helps rationalize the importance of each kernel in different scenarios and enhances the generalization of the model.
Next, we analyze the impact of different hyper-parameters on the overall performance of the model.According to the experimental results, it can be seen that the recognition accuracy with a , b , r are dynamically learned better than fixed values.When ignoring the local The bold values are the best performance in tables.
structural information and fuzzy entropy information in the domain, the performance drops by about 2% (i.e., a = 0, a = 1, b = 0, and b = 100).In addition, from the results, it can be inferred that the performance is optimal when the value of d is around 0.5, indicating that the means of different categories in the source domain and target domain are equally important.

Effect of noisy labels
In order to further verify the robustness of the model in the noisy label learning process, we randomly add noise to the source labels at different ratios and test the performance of the corresponding model on unknown target data.Specifically, we replace the corresponding proportion of real labels in Y s with randomly generated labels to train the model by semi-supervised learning and then test the performance of the trained model in the target domain.It should be noted that only noise data is added in the source domain, and the target domain needs to be used for model evaluation.In the implementation, the noise ratios are adjusted to 5, 15, 25, and 30% of the sample number of the source domain, respectively.The results in Figure 2 show that the accuracy of the proposed C-PDDM decreases at the slowest rate as the number of noise increases.It indicates that C-PDDM is a reliable model with a high tolerance to noisy data.In future work, we can combine recently proposed new methods, such as Xiao et al. (2020) and (Jin et al. (2021), to further eliminate more common noise in EEG signals and improve the stability of the model in cross-corpus applications.

Confusion matrices
In order to qualitatively study the performance of the model in each emotion category, we analyze the confusion matrix through visualization and compare the results with the latest models (i.e., BiDANN, BiHDM, RGNN, PR-PL, DICE ResNet101).As shown in Figure 3, all models are good at distinguishing positive emotions from other emotions (with recognition rates above 90%), but relatively not good at distinguishing negative emotions and neutral emotions.For example, the emotion recognition rate in BiDANN (Li et al., 2018c) is even lower than 80% (76.72%).In addition, the PR-PL method achieves the best performance, possibly due to its adoption of adversarial networks, but at the cost of increased computational expenses.Compared with other existing methods (Figures 3A-C,E), our proposed model can improve the model's recognition ability, especially in distinguishing neutral and negative emotions, and its overall performance is better than the DICE method (as shown in Figures 3E,F).

Convergence
The proposed C-PDDM adopts an iterative optimization strategy and uses experiments to prove its convergence.The experiment is completed on the MATLAB platform, and the device configuration used is as follows: 64 GB memory, 2.5 GHz CPU, and 8-core Intel i7-11850H processor.Figure 4 shows the convergence process of C-PDDM at different iteration times.The results are shown in Figure 4. We can observe clearly that the proposed algorithm can achieve the minimum convergence at about 30 iterations.In the algorithm, the objective function of optimizing the sub-problem at  Robustness on source domain with different noise levels.Tao et al. 10.3389/fnins.2023.1247082Frontiers in Neuroscience 17 frontiersin.orgeach time is a function, which proves that the C-PDDM method has good convergence.

Conclusion
This paper proposes a novel transfer learning framework based on a Clustering-based Probability Distribution Distance Metric (C-PDDM) hypothesis, which uses a probability distribution distance metric criterion and fuzzy entropy technology for EEG data distribution alignment, and introduces the Laplace matrix to preserve the local structural information of source and target domain data.We evaluate the proposed C-PDDM model on two famous emotion databases (SEED and SEED-IV) and compare it with existing state-of-the-art methods under four cross-validation protocols (cross-subject singlesession, single-subject single-session, single-subject cross-session, and cross-subject cross-session).Our extensive experimental results show that C-PDDM achieves the best results in most of the four cross-validation protocols, demonstrating the advantages of C-PDDM in dealing with individual differences and noisy label issues in aBCI systems.

.
, , , C is the number of classes.n c is the number of samples of the c-th class in the source domain, m c is the sample number of the c-th class in the target domain, and n n When c = 0, m s c , and m t c , are the mean values of the source domain and the target domain, respectively.Equation (9) is a feature distribution alignment form.When c , m s c , and m t c , are the associated c-th class mean values of the source domain and the target domain, respectively.l k c the target domain label matrix.The label value for a sample in the target domain corresponding to a position in Y t is all zeros when the sample has no label.Y s is the source domain label matrix.graph matrix(Long et al., 2013) with D is a optimization problem (23) is a standard singular value decomposition problem, where Y is the eigenvector of the matrix H . Y can be obtained by solving the singular value decomposition of the matrix H .

1
requires T iterations to complete with the P-DDM minimization (including possibility membership inference) process requires O

ALGORITHM 1
Domain adaptation learning based on C-PDDM.
2. Compute the means of different classes in the target domain and the source domain respectively, denoted as m t c , and m s c , compute the mean of different class data in the overall domain (i.e., integrate the source domain and the target domain), denoted as m the initialization W 0 of W using (21); 6. Obtain the initialization Y 0 of Y using (23); 7. Compute the value of the objective function ˜,the current W and Y for updating l k c , the current l k c , and Y for updating W to W z by Eq. (21)； 8.3 Fix the current l k c , and W for updating Y to Y z by Eq.

FIGURE 2
FIGURE 2 Here, the model results reproduced by us are indicated by "*".The bold values are the best performance in tables.

TABLE 4
The mean accuracies (%) and standard deviations (%) of emotion recognition on the SEED database using within-subject cross-session crossvalidation.

TABLE 8
The ablation study of our proposed model.

TABLE 7
The mean accuracies (%) and standard deviations (%) of emotion recognition on SEED-IV database using within-subject single-session crossvalidation.