Seizure Classification From EEG Signals Using an Online Selective Transfer TSK Fuzzy Classifier With Joint Distribution Adaption and Manifold Regularization

To recognize abnormal electroencephalogram (EEG) signals for epileptics, in this study, we proposed an online selective transfer TSK fuzzy classifier underlying joint distribution adaption and manifold regularization. Compared with most of the existing transfer classifiers, our classifier has its own characteristics: (1) the labeled EEG epochs from the source domain cannot accurately represent the primary EEG epochs in the target domain. Our classifier can make use of very few calibration data in the target domain to induce the target predictive function. (2) A joint distribution adaption is used to minimize the marginal distribution distance and the conditional distribution distance between the source domain and the target domain. (3) Clustering techniques are used to select source domains so that the computational complexity of our classifier is reduced. We construct six transfer scenarios based on the original EEG signals provided by the Bonn University to verify the performance of our classifier and introduce four baselines and a transfer support vector machine (SVM) for benchmarking studies. Experimental results indicate that our classifier wins the best performance and is not very sensitive to its parameters.


INTRODUCTION
The maturity of the brain-computer interface (BCI) technology has provided an important channel for the human to use artificial intelligence (AI) to explore the cognitive activities of the brain. For example, many AI methods have been proposed for an intelligent diagnosis of epilepsy instead of neurological physicians through electroencephalogram (EEG) signals (Ghosh-Dastidar et al., 2008;Van Hese et al., 2009;Wang et al., 2016). In this study, we also focus on the intelligent diagnosis of epilepsy through EEG signals. The classic diagnostic procedure for epilepsy by using intelligent models is illustrated in Figure 1. We observe that, for an emerging task, a large number of labeled EEG epochs are required to train an intelligent model. Therefore, it needs to consume a lot of effort to manually label EEG epochs. Because the responses to EEG signals of different patients in the same cognitive activity show a certain degree of similarity, we expect to leverage abundant labeled EEG epochs, which are available in a related source domain for training an accurate intelligent model to be reused in the target domain. To this end, transfer learning is often used, which has been proven to be promising for epilepsy EEG signal recognition. For example, Yang et al. (2014) proposed a transfer model LMPROJ for epilepsy EEG signal recognition underlying the support vector machine (SVM) framework. In LMPROJ, the marginal probability distribution distance measured by the maximal mean discrepancy (MMD) between the source domain and the target domain is used to minimize the distribution difference. Jiang et al. (2017c) improved LMPROJ and generated a model A-TL-SSL-TSK for epilepsy EEG signal recognition underlying the TSK fuzzy system framework. Comparing with LMPROJ, A-TL-SSL-TSK not only used the marginal probability distribution consensus as a transfer principle but also introduced semisupervised learning (cluster assumption) for regularization. Additionally, in our previous work (Jiang et al., 2020), we proposed an online multiview and transfer model O-MV-T-TSK-FS for EEG-based drivers' drowsiness estimation. It minimized not only the marginal distribution differences but also the conditional distribution differences between the source domain and the target domain. But it did not derive any information from unlabeled data. More references about transfer learning for epilepsy EEG signal recognition can be found in Jiang et al. (2019) and Parvez and Paul (2016).
Although existing intelligent models, for example, LMPROJ and A-TL-SSL-TSK, underlying the transfer learning framework are effective for epilepsy EEG signal recognition, there still exist some issues that should be further addressed.
• To tolerate the distribution difference between the source domain and the target domain, it is not enough to only minimize the marginal distribution difference between the two domains. • Most of the existing models use only one source domain for knowledge transfer. That is to say, all available labeled data in the source domain are leveraged for model training. However, some labeled data may cause negative transfer. Therefore, in this study, by overall considering the above two issues, we propose a new intelligent TSK fuzzy classifier (online selective transfer TSK fuzzy classifier with joint distribution adaption and manifold regularization, OS-JDA-MR-T-TSK-FC) for epilepsy EEG signal recognition. First, it further explores the marginal probability distribution adaption between the source domain and the target domain from two aspects. One is that it additionally introduces conditional probability distribution adaption to further minimize the distribution difference. The second is that it preserves manifold consistency underlying the marginal probability distribution. Second, it can selectively leverage knowledge from multiple source domains.
The following sections are organized as follows: in Data and Methods, we give the EEG data and our proposed method. In Results, we report the experimental results. Discussions about experimental results are presented in Discussions, and the whole conclusions are summarized in the last section.

Data
In this study, we download very commonly used epilepsy EEG 1 data to verify our proposed intelligence model. The data from the University of Bonn is open to the public for scientific research. Table 1 gives the data archive and collection conditions. Additionally, Figure 2 illustrates the amplitudes during the collection procedure of one volunteer in each group. The original EEG data cannot be directly used for model training (Jiang et al., 2017b;Tian et al., 2019). We should employ feature extraction methods to extract robust features before model training.

Feature Extraction
Three feature extraction algorithms, that is, wavelet packet decomposition (WPD) (Li, 2011), short-time Fourier transform (STFT) (Pei et al., 1999), and kernel principal component analysis (KPCA) (Li et al., 2005), are employed to extract three kinds of features from the original epilepsy EEG signals.

• Wavelet Packet Decomposition
Wavelet packet decomposition is used to extract time-frequency features from epilepsy EEG signals. More specifically, the  epilepsy EEG signals are disassembled into six different frequency bands with the Daubechies 4 wavelet coefficients. Each band is considered as one feature. Figure 3 illustrates the six features of group A.
• Short-Time Fourier Transform Short-time Fourier transform is used to extract frequencydomain features from epilepsy EEG signals. More specifically, the epilepsy EEG signals are disassembled into different local stationary signal segments, and then the Fourier transform is used to extract a group of spectra of the local segments, which are with evident time-varying characteristics at different times. Finally, six frequency bands are extracted from each group of spectra. Figure 4 illustrates the six features of group A.

• Kernel Principal Component Analysis
Kernel principal component analysis is used to extract timedomain features from epilepsy EEG signals. More specifically, the Gaussian function is chosen as the kernel to map the original features nonlinearly. Then six kinds of features are selected from the top six PC eigenvectors. Figure 5 illustrates the six features of group A.

Online Transfer Scenario Construction
We construct six online transfer scenarios from the EEG data after feature extraction ( Table 2). Each scenario consists of five source domains as multiple source domains and one target domain. Specifically, two healthy groups (A, B) and three epileptic groups (C, D, E) are combined to generate six different  pairs of combinations, that is, AC, AD, AE, BC, BD, and BE. Five pairs are alternatively selected from the six combinations as source domains, and the rest one is taken as the target domain such that each pair has the opportunity to become the target domain.
In general, calibration in BCIs can be divided into two types, that is, offline calibration and online calibration (Jiang et al., 2020). Offline calibration means that we have obtained a pool of unlabeled EEG epochs. Some of unlabeled EEG epochs were labeled by experts to train a classifier. The unseen epochs then were classified by the trained classifier. Online calibration means that the training EEG epochs were obtained on-the-fly. That is to say, the classifier was trained online. Both calibration methods have their own advantages and disadvantages. For example, in offline calibration, unlabeled EEG epochs can be used to assist labeled ones to achieve classifier training, for example, semisupervised learning (Mallapragada et al., 2009;Zhang et al., 2013;Dornaika and El Traboulsi, 2016). Additionally, if necessary, we can easily obtain the label of any EEG epochs at any time. In online calibration, we not only have no unlabeled EEG epochs to be used for classifier training but also have little control on which epochs to see next. However, online calibration is more attractive because it is more in line with the needs of practical application scenarios. Therefore, in this study, we only consider online calibration for seizure classification. To simulate online calibration in the aforementioned six transfer scenarios, we first generate M = 20 subject-specific objects from the target domain. The online calibration flowchart is shown in Figure 6.
We repeat all rounds 10 times to obtain statistically meaningful results, where each time has a random starting position m 0 .

Methods
In this section, we will elaborate the method we proposed for seizure classification. We first mathematically state the transfer problem, and then we give the online transfer learning framework and hence the online transfer TSK fuzzy classifier (OS-JDA-MR-T-TSK-FC). Lastly, we give the detailed algorithm steps of OS-JDA-MR-T-TSK-FC including how to select source domains.

Problem Statement
A domain = X, P(x) in the transfer learning or domain adaption scenario consists of a d-dimensional feature space ∈ R d and a marginal distribution P(x), and a task Ŵ = {Y, P y|x } in the similar scenario consists of a one-dimensional label space Y and a conditional distribution P y|x , where y ∈ Y. Suppose that s and t are two domains derived from , they are deemed to be different when X s = X t and/or P s (x) = P t (x). Homoplastically, two tasks Ŵ s and Ŵ t derived from Ŵ are different when Y s = Y t and/or P s y|x = P t y|x .
Based on the above definitions, the target of OS-JDA-MR-T-TSK-FC is to train a predictive function on a source domain s having N-labeled EEG epochs ( to predict the class label of a unseen epoch in the target domain with a low expected error under the hypotheses that s = t , Y s = Y t , P s (x) = P t (x), and P s y|x = P t y|x .

• Online Transfer Learning Framework
Because the classic one-order TSK fuzzy classifier (1-TSK-FC) (Deng et al., 2015;Jiang Y. et al., 2017a;Zhang J. et al., 2018;Zhang et al., 2019) is considered as the basic component of our online transfer learning framework, we first give some details about 1-TSK-FC before introducing our framework.
The kth fuzzy rule involved in 1-TSK-FC is formulated as the following if-then form: where k = 1, 2, . . . , K, K represents the total number of fuzzy rules 1-TSK-FC uses. (1) represents a fuzzy set subscribed by x ij for the kth fuzzy rule, and ∧ represents a fuzzy conjunction operator. Each fuzzy rule is premised on the feature space and maps the fuzzy sets in the feature space into a varying singleton represented by f k (x i ). After the steps of inference and defuzzification, the predictive function y o (•) for an unseen object x is formulated as the following form: in which the µ k (x) is expressed as where µ A k j (x j ) can be expressed as the following form when the Gaussian kernel function is employed: where c k j and δ k j are two parameters representing the kernel center and kernel width, respectively. Therefore, training of 1-TSK-FC means to find optimal c k j , δ k j in the if parts, and p k = [p k 0 , p k 1 , . . . , p k d ] T in the then parts. Referring to the literature (Zhang et al., 2019), we know that parameters in the if parts can be trained by clustering techniques. For instance, c k j and δ k j can be trained by fuzzy c-means (FCM) (Gu et al., 2017) as where µ ik is the fuzzy membership degree of x i belonging to the kth cluster. h is a regularized parameter that can be always set to 0.5 according to the suggestions in Jiang Y. et al. (2017a). When c k j and δ k j in the if parts are determined by FCM or other similar techniques, for an object x i in the training set, let then we can rewrite the predictive function y o (·) in (2) as the following form: Referring to Zhou et al. (2017) and Zhang Y. et al. (2018), we formulate an objective function as follows to solve p g : where the first 1 2 (p g ) T p g is a generalization term, the second is a square error term, and η > 0 is balance parameter used to control the tolerance of errors and the complexity of 1-TSK-FC. By setting the partial derivative of the objective function w.r.t p g to zero, that is, ∂J 1−order−TSK−FS (p g )/∂p g = 0, we can compute p g analytically as x gi y i . (10) In this study, 1-TSK-FC is taken as the basic learning component to support the transfer learning framework. Many previous works (Yang et al., 2014;Jiang et al., 2017c) explored the marginal distribution adaption between the source domain and the target domain for transfer learning. In our framework, we introduce conditional distribution adaption to further minimize the distribution difference. Additionally, we impose manifold consistency on the marginal distribution. Therefore, the transfer learning framework can be formulated as where ω t in the first term is the overall weights of the specificsubject objects. Generally, ω t should be larger than 1 so that more emphasis is given to objects in s than t . Therefore, we set ω t to ω t = max(2, σ · N/M). λ 1 and λ 2 are regularization parameters. The first term contains two parts: the first is to measure the loss on s , and the second is to measure the loss in t . The second one is the joint distribution adaption term, and the third one is the manifold regularization term. Below, we will explain how to embody them formally.

• Objective function of OS-JDA-MR-T-TSK-FC
Under the framework shown in (11), we specify each term to get the objective function of our online transfer TSK fuzzy classifier OS-JDA-MR-T-TSK-FC.

Loss Function
The squared loss is taken as the loss function to measure the sum of squared training errors on both s and t ; hence, the first term in (11) can be formulated as where f (x) = p T g x gi is the predictive function of 1-TSK-FC. Suppose we have a diagonal matrix in which each element is defined as By submitting (13) to (12), then (12) can be rewritten as where X g = [x g1 , ..., x gN , ..., x g(N+M) ] T in which each element x gi is derived from x i by using (7.c).

Joint distribution adaptation
As all we know that even EEG epoch features in s and t are extracted in the same way, the joint distributions (marginal and conditional distributions) between s and t are generally different. In order to meet practical requirements, we assume that P s (x) = P t (x) and P s y|x = P t y|x . Therefore, a joint distribution adaptation should be designed to minimize the distribution similarity (distance) D(J s , J t ) between s and t . First, the projected MMD (Gangeh et al., 2016;Jia et al., 2018;Lin et al., 2018) is employed to the marginal distribution similarity D(P s , P t ) between s and t . As a result, D(P s , P t ) can be expressed as where is the MMD matrix, which can be defined as Second, we suppose that s,c belongs to s and its objects are selected by x i |x i ∈ s ∧ y i = c , and t,c belongs to t and its objects are selected by x i |x i ∈ t ∧ y i = c , where c means the cth class in one domain. Also, for the source domain, N c is used to denote the number of objects in the cth class, and for the specific-subject objects in the target domain, M c is used to denote the number of objects in the cth class. Hence, D (Q s , Q t ) can be expressed as where = 2 c=1 c and c is an MMD matrix defined as follows: According to the probability theory, the joint adaption D (J s , J t ) = D (P s , P t ) + D (Q s , Q t ) so that the joint distribution adaptation can be formulated as D(J s , J t ) = D(P t , P s ) + D(Q t , Q s ) = p T g X g X g p g + p T g X g X g p g , = p T g X g ( + )X g p g .

Manifold regularization
In the manifold assumption (Lin and Zha, 2008;Chen and Wang, 2011;Geng et al., 2012), it is assumed that if two objects x i and x j are very close in the intrinsic geometry in terms of P(x i ) and P(x j ), then the corresponding Q(y i |x i ) and Q(y j |x j ) are considered as being similar. That is to say, for the objects in s and the calibration objects in t , if they are in a manifold, it is expected that their output (conditional probability distribution) differences should be as small as possible. Therefore, the manifold regularization can be formulated as follows under geodesic smoothness, Where, W = [w ij ] (N+M)×(N+M) is the graph affinity matrix in which each element is defined as Where, ξ ν (x i ) represents a set of v-nearest neighbors of object x i . L = [l ij ] (N+M)×(N+M) is the corresponding normalized graph Laplacian matrix of W, which can be computed by L = I − D −1/2 WD −1/2 , where D is the degree matrix in which each diagonal element d ii is computed by N+M j=1 w ij . By embedding the manifold regularization into the transfer learning framework, the marginal probability distributions of objects in the target domain and the source domain are fully utilized to guarantee the consistency between the predictive structure of the decision function f and the intrinsic manifold data structure.
We can deduce a closed-form solution of p g for the objective function in (26) by setting its derivative w.r.t p g to zero as

Algorithm of OS-JDA-MR-T-TSK-FC
Different from most of the existing transfer models, OS-JDA-MR-T-TSK-FC can leverage knowledge from multiple source domains. However, as we know, too many source domains will improve computational complexity. Additionally, some source domains having significant differences with the target domain may bring some negative transfer knowledge. Therefore, according to Wu et al. (2017), we adopt a distance-based schema to select relative source domains. We use v z,c to denote the mean vector of each class in the zth source domain, where z = 1,2,. . . , Z. Similarly, v t,c is used to denote the mean vector of each class in the target domain. The Euclidean distance between the zth source domain and the target domain can be computed as (24) With (24), we can get a distance set {d(1, t), d(2, t), ..., d(Z, t)} that contains Z domain distances. The distance set then is partitioned by k-means to k groups (in this study, k is set to 2), and the source domains are selected from the cluster who has the smallest center. As a whole, the training of OS-JDA-MR-T-TSK-FC contains three parts: the first one is source domain selection, the second one is model training on a source domain combining with the target domain, and the last is classifier combination. Algorithm 1 shows the detailed training steps of OS-JDA-MR-T-TSK-FC.
OS-JDA-MR-T-TSK-FC can also be used for multiclassification tasks. According to Zhou et al. (2017), we can convert y from the space R to the space R C by that y ij = 1 if y(x i ) = j, and y ij = 0 otherwise, where i = 1, 2, ..., N + M, j = 1, 2, ..., C, and C represents the number of classes. Thus, the label space becomes Y = [y 1 , ..., y N , ..., y N+M ] T ∈ R C , and p g is also converted from R d+1 to R (d+1)×C .
Calculate p g and record it as (p g ) z by (23); Use (p g ) z to predict N z + M objects the record the training accuracy as α z ; End Return f (x) = α 1 (p T g ) 1 x g + α 2 (p T g ) 2 x g + ... + α Z ′ (p T g ) Z ′ x g ; The best performance is marked in bold.

RESULTS
Experiment setups and comparison results will be reported in this section.

Setups
For fair, we introduce three baselines and one transfer learning algorithm for comparison study. The three baselines all use 1-TSK-FC for training. But their training sets are different.
(1) Baseline 1 (BL1). Its training set consists of the five source domains directly connected, and its testing set is the target domain. Therefore, BL1 is considered as a calibrationindependent classifier, which does not use the subject-specific data in the target domain for training. (2) Baseline 2 (BL2). It uses only subject-specific calibration EEG data in the target domain for training. Its testing set is the unlabeled data in the target domain. Therefore, BL2 is considered as a source domain-independent classifier, which does not consider the EEG data in the source domains at all. (3) Baseline 3 (BL3). BL3 is trained on five training sets, receptively. Each set consisted of a source domain and the subject-specific data in the target domain. The five trained models are finally combined by a weight schema that is also used in Algorithm 1. Its testing set is the unlabeled data in the target domain (4) Transfer support vector machine (TSVM) (Chapelle et al., 2008). It trains five TSVM classifiers by combining unlabeled EEG data in the target domain for semisupervised learning. The five trained models are finally combined by a weight schema that is also used in Algorithm 1.
(5) ARRLS (Long et al., 2014). It trains five ARRLS classifiers by combining unlabeled EEG data in the target domain for supervised learning. The five trained models are finally combined by a weight schema that is also used in Algorithm 1.

Experimental Results
In this section, we report the experimental results from several aspects, that is, classification performance, interpretability, and robustness.
• Classification Performance Table 3 shows the average classification performance of the six scenarios in the KPCA feature space, PWD feature space, and STFT feature space, respectively. Table 4 shows the classification performance on KPCA features. Table 5 shows the classification performance on PWD features, and Table 6 shows the classification performance on STFT features. The best results are marked in bold.

• Interpretability
Unlike TSVM that works in a black-box manner, the proposed OS-JDA-MR-T-TSK-FC has high interpretability because 1-TSK-FC is taken as the basic component. Table 7 shows the five trained fuzzy rules (antecedent and consequent parameters) on SC-1 in the KPCA feature space. •

Robustness
From the objective function of OS-JDA-MR-T-TSK-FC, we see that there are three parameters, that is, ω t (σ ), λ 1 , and λ 2 that should be fixed before a classification task. So, we should consider

DISCUSSIONS
We observe from Table 3 that the proposed OS-JDA-MR-T-TSK-FC wins the best average performance across the six transfer scenarios in all feature spaces when the number of specificsubject objects is more than 4. Especially compared with the three baselines, the advantages are more obvious.
Moreover, the classification results in Tables 4-6 also exhibit the following four characteristics: • BL1 does not use the specific-subject objects, so its accuracy is independent on M, whereas the other four classifiers depend on M, and it is intuitive that they gradually perform better than BL1 with the increasing of M. • BL2 is only trained by the subject-specific objects. Therefore, BL2 becomes unusable when M is set to 0. But BL1, BL3, TSVM, and OS-JDA-MR-T-TSK-FC can work because, except subject-specific objects, they also leverage training objects from the source domains. Compared with other algorithms, when M is too small, BL2 performs so badly because it cannot get enough training patterns from subject-specific objects. • When M is set to 0, TSVM always achieves the best performance. With the subject-specific objects gradually added into the training set, OS-JDA-MR-T-TSK-FC soon performs better than TSVM, which indicates that significant differences exist among the domains. Hence, a domaindependent classifier, for example, TSVM is not very expected in our online transfer scenarios.
• When one batch (four subject-specific objects are taken as a batch in our experiments) or at most two batches of subject-specific objects are added into the training set, the classification performance of OS-JDA-MR-T-TSK-FC becomes stable. That is to say, the number of subject-specific objects OS-JDA-MR-T-TSK-FC needs is very small. So, OS-JDA-MR-T-TSK-FC meets the practical requirements because subject-specific objects are very few in real-world applications. In addition to classification performance, interpretability is also a main characteristic of the proposed OS-JDA-MR-T-TSK-FC. From Table 7, we see that it generates five interpretable fuzzy rules on SC-1 in the KPCA feature space. Each feature in a fuzzy rule can be interpreted as the energy of an EEG signal band, and each fuzzy membership function is endowed with a linguistic description. For example, "x 1 is A k 1 " in the antecedent of a fuzzy rule can be interpreted as "the energy of an EEG band is a litter high, " where the term "a little high" can be replaced by others such as "a litter low, " "medium, " or "high." In this way, suppose I am an expert from the field of EEG signal analysis, I assign five kinds of linguistic descriptions to each fuzzy membership function, that is, "low, " "a little low, " "medium, " "a little high, " and "high." Therefore, for the first fuzzy rule in Table 7, it can be interpreted as follows: If the energy of an EEG signal band (band 1) is "high, " and the energy of an EEG signal band (band 2) is "a little low, " and the energy of an EEG signal band (band 3) is "low, " and the energy of an EEG signal band (band 4) is "low, " and the energy of an EEG signal band (band 5) is "low, " and the energy of an EEG signal band implementation, and LW gave some suggestions to the writing.