Feature separation and adversarial training for the patient-independent detection of epileptic seizures

An epileptic seizure is the external manifestation of abnormal neuronal discharges, which seriously affecting physical health. The pathogenesis of epilepsy is complex, and the types of epileptic seizures are diverse, resulting in significant variation in epileptic seizure data between subjects. If we feed epilepsy data from multiple patients directly into the model for training, it will lead to underfitting of the model. To overcome this problem, we propose a robust epileptic seizure detection model that effectively learns from multiple patients while eliminating the negative impact of the data distribution shift between patients. The model adopts a multi-level temporal-spectral feature extraction network to achieve feature extraction, a feature separation network to separate features into category-related and patient-related components, and an invariant feature extraction network to extract essential feature information related to categories. The proposed model is evaluated on the TUH dataset using leave-one-out cross-validation and achieves an average accuracy of 85.7%. The experimental results show that the proposed model is superior to the related literature and provides a valuable reference for the clinical application of epilepsy detection.


Introduction
Epilepsy is a chronic disorder caused by the sudden abnormal discharge of nerve cells in the brain, resulting in temporary brain dysfunction. Epilepsy is the second most common neurological disorder after headache, affecting approximately 70 million people worldwide. The clinical manifestations of epileptic seizures are complex, and the types of epileptic seizures are varied. The clinical manifestations may include impaired consciousness, limb spasms, urinary incontinence, frothing, and other symptoms. Although epileptic seizures have little impact on patients in the short term, long-term frequent seizures have a severe impact on the physical, mental, intellectual health of patients (Rakhade and Jensen, 2009;Rasheed et al., 2021). Most people with epilepsy can control their condition with medication Frontiers in Computational Neuroscience 01 frontiersin.org and surgery, still, about 30% of people with intractable epilepsy cannot be adequately controlled with medication (Kwan and Brodie, 2000), posing a severe threat to the life and health of patients and a heavy burden to their families and society. The pathogenesis of epilepsy is complex, and the types of epileptic seizures are varied. The characteristics of EEG (electroencephalogram) data during the epileptic seizure period are related to the original location and cause of epilepsy. Different diseases of the nervous system or various conditions of the brain can cause different epileptic seizures, and the same condition of the nervous system can cause more than one type of epileptic seizure. Previous studies have pointed out that about 7% of the neurons ignited in patients with subclinical seizure, about 14% of the neurons ignited in patients when omen appeared. About 36% of the neurons ignited in patients with clinical seizure. Therefore, in the same patient, the intensity, type, location, duration of each seizure may be the same or different. In multiple patients, the differences are more marked (Babb et al., 1987;Fisher et al., 2017).
Most of the existing epileptic seizure detection methods focus on the patient-dependent scenario, which refers to detecting a patient's epileptic seizure by learning from his own historical records; this method is easy to implement and has high detection accuracy. In contrast, patient-independent methods advance in alerting potential patients but are easily corrupted by interpatient noises. Most existing studies fail to eliminate significant differences between patients (mainly caused by multiple factors such as physical condition, pathogenesis, seizure intensity, seizure type, etc.). When the model is trained directly on data from multiple patients, it will easily lead to underfitting, and detection performance will drop sharply on new patients. For these reasons, we propose a new method, which uses a feature extraction network and feature separation network to improve the discriminability of features, and which uses the marginal distribution and conditional distribution alignment technology of features to enhance the ability to extract patient invariant features.
The main contributions of our study can be summarized as follows: (1) We propose a novel domain generalization model based on feature disentanglement and adversarial training to enhance the ability of extracting patient invariant features, so the generalization ability of the model is improved. (2) We verify the proposed model through extensive experimental evaluations. The experimental results show that our proposed approach has significant potential to provide an optimal epileptic seizure detection method, and it also provides a valuable reference for clinical application.
The remainder of this paper is organized as follows. In the section "2. Related work, " reviews the related work of epileptic seizure detection. In the section "3. Methodology, " a patientindependent epileptic seizure detection model is proposed. In the section "4. Experiments, " we present experiments and results on a benchmark dataset. In the section "5. Discussion, " we analyze the effectiveness of the proposed method. Finally, some conclusions are given in the section "6. Conclusion."

Related work
As a subclass of machine learning, deep neural networks have made remarkable progress in computer vision, natural language processing, and other fields, and researchers have proposed a variety of network models and methods for specific application scenarios. In the research of domain generalization methods, the following two approaches are usually adopted: (1) The method based on experience and knowledge is designed to extract universal features that can perform good detection on new patients. (2) The domain adaptive technology is used to extract invariant features of multiple patients to improve the generalization ability of the model.
For the first approach, Ansari et al. (2021) proposed an automated seizure onset detection system, which used power spectrum features and some statistical features to detect seizure onset, achieving a mean latency of 0.9 s and 1.02 false detections per hour. Liu et al. (2022) proposed a novel patient-independent approach; this method used wavelet decomposition, Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (Bi-LSTM) network and a novel channel perturbation technique, achieved mean accuracies of 97.51 and 93.70%. Sridevi et al. (2019) proposed a patient-independent approach; this method used spectral entropy, spectral energy and signal energy as useful features, achieved a better classification effect.
For the second approach, Zhao et al. (2021) proposed a domain adaptive method, domain shift can be eliminated from the source domain to the target domain, and achieved better performance. Li et al. (2021) proposed a bi-hemisphere domain adversarial neural network, that achieved good recognition performance in EEG emotion recognition. Tang and Zhang (2020) applied conditional adversarial domain adaptation neural network to motor image EEG decoding, and achieved a better classification effect.
In epilepsy detection, Zhang et al. (2020) used feature separation and adversarial representation learning methods to decompose the data into categories (seizure and normal) related features and patient-related features, achieving an average accuracy rate of 80.5% on the TUH EEG dataset. Dissanayake et al. (2021) used the CNN network structure and Siamese network structure, and achieved an accuracy of 88.81% on the CHB-MIT dataset.
To the best of our knowledge, the above methods do not completely eliminate the effects of the data distribution shift between patients, so in this study, we propose a robust approach to address this problem.

The proposed network
The proposed patient-independent epileptic seizures detection model is illustrated in Figure 1, which includes three subnets. (1) Multi-level temporal-spectral feature extraction network, (2) feature separation network, and (3) invariant feature extraction network. The feature extraction network extracts temporal feature information and frequency domain feature information from EEG data , and performs enhanced characterization by the Squeeze-and-Extraction Network (Hu et al., 2018), so that the extracted features are discriminable; the feature extraction The architecture of the proposed network. The architecture of multi-level temporal-spectral feature extract network.
network is illustrated in Figure 2. The feature separation network disentangles the features into category-related features and patientrelated features. Finally, the invariant feature extraction network extracts the invariant patient-independent features by aligning the marginal distribution and the conditional distribution; so the generalization ability of the model is improved.

Multi-level temporal-spectral feature extract network
Electroencephalogram data is two-dimensional data similar to images, which has uncertainties and incidences; therefore, it is necessary to preprocess the original data; we use min-max regulation technology to regulate the data. You can also refer to Rahim et al. (2016) and Versaci and Morabito (2021) for preprocessing.
As convolution operators are essentially equivalent to a low-pass filter (Azimi et al., 2019), the embedding block, the embedding block, that is, successive temporal convolution and batch normalization (BN) operations, is first adopted to infer an optimal filter-band for the subsequent analysis. As a result, after stacking original data and output embeddings with a channel-wise concatenation function, the embedding block obtains a sub-band matrix, which provides a subsequent network with adaptive subband responses and also original data. Finally, the data is fed into the multi-level spectral feature extraction module and the multilevel temporary feature extraction module for feature extraction.
In the multi-level temporal-spectral feature extraction network, in order to prevent the deformation of the boundary data caused by zero padding in the convolution operation, the head and tail of the data are filled according to formula (1): (1) Where, | is a concatenating operator, x(i) is the i-th element of input x, R representing the parameter kernel size in the convolution operation.
In order to reduce the time of data computation, the proposed method adopts convolution operation to perform multi-level wavelet decomposition, which is defined as follows: (2) Where, ⊗ is the convolution operation,gand h represent a pair of scaling and wavelet filter, s represents the parameter stride in the convolution operation, y A (i) is the approximation (low pass) coefficients, and y D (i) is the detail (high pass) coefficients.
In the multi-level spectral feature extraction module, to extract the corresponding wavelet coefficients under standard physiological sub-bands δ(0∼4 Hz), θ(4∼8 Hz), α(8∼16 Hz), β(16∼32 Hz), and γ(32∼64Hz), we select Daubechies order-4 (Db4) wavelet, since previous studies reported that Db4 mother wavelet is useful for epileptiform transient detection due to its high correlation coefficients with the epileptic spike signal (Indiradevi et al., 2008). Finally, the frequency features In the multi-level temporal feature extraction module, considering the data distribution shift between subjects, we use five independent convolution, batch normalization and empirical linear unit (ELU) operations to capture multi-level temporal feature information with different receptive fields. The convolution kernel size is set to [S, 1], the value of S is {k, k, k/2, k/4, k/8}, k= 2 5 , and finally, the temporary features To further extract discriminative feature information, the features extracted by the multi-level spectral feature extraction module and the multi-level temporal feature extraction module are combined according to the feature dimensions: The combined features f all are fed into Squeeze-and-Excitation Network to enhance feature discrimination.

Feature separation network
The feature information (category information, patient information, etc.) is contained in each dimension and intertwined. If the features can be disentangled by the feature separation network, the separability and discriminability of the features will be improved. Therefore, according to the prior knowledge, we separate the features which are obtained from the feature extraction network into two parts, the first half of the features is the category-related component, which is recorded as F category_related , the second half of the features is the patient-related component, which is recorded as F patient_related . In addition, to ensure the first half of the features are the category-related component, the category classifier and cross-entropy loss function are used, to ensure the second half of the features are the patient-related component, the patient classifier and cross-entropy loss function are used, to ensure better separation of the features of the two parts, the maximum divergence loss function is used to ensure the maximum separation of the category-related component and the patient-related component (Bui et al., 2021).
The loss function of the category classifier and the patient classifier is: Where, N is the number of samples, x i is the data sample, G f is the feature extraction network, G c1 is the category classifier, G p is the patient classifier, L is the cross-entropy loss function, y i is the category label (seizure or normal), d i is the patient label, D s ∈ D 1 ∪ D 2 ... ∪ D n (D 1 ,D 2 ,. . .. . ., D n are the data of each patient).
To separate category-related component (F category_related ) and patient-related component (F patient_related ), we use the maximum divergence loss function: Then combine the separated features to create new features:

Invariant feature extraction network
The feature separation network effectively disentangles the features and improves the discrimination of the features, but the current features are not the invariant features of each patient. To improve the generalization ability of the model, the proposed method is based on the methods of DANN (Domain-adversarial training of neural networks) (Ganin et al., 2016;Yu et al., 2019) and MADA (Multi-adversarial domain adaptation) (Pei et al., 2018) to achieve better invariant feature learning. The global patient discriminator aligns the features of each patient according to the marginal distribution. The local patient discriminator aligns the features of each category according to the conditional distribution. The global adversarial loss function and the local adversarial training loss function are as follows: Where, L is the cross entropy loss function, G f is the feature extraction network, G g and G k l (k = 1,2) are the patient discrimination network, d i is the patient label, y k i (k = 1,2)is the first and second dimensional data of the original label after one-hot encoder, D s ∈ D 1 ∪ D 2 ... ∪ D n is the patient sample set.
In category classifier, to centralize the character of data, the central loss function is adopted. The loss function is (Wen et al., 2016): Where, c y i is the category center. Through the above operations, the marginal distribution and conditional distribution of features are aligned, and the features are gathered to the central point of each category, so the invariant features are obtained. The loss function of the category classifier (Rahim et al., 2015;White et al., 2020;Versaci et al., 2022;Waheed et al., 2023) is: Where, G c2 is the category classifier, y i is the category label.

Training details
We propose an adversarial training strategy to train all the loss functions jointly (Matsuura and Harada, 2020): Where, λ = 0.1. θ g , θ 1 l , θ 2 l are trained by a special layer called Gradient Reversal Layer (GRL), this GRL is omitted during forward propagation, and the gradient is reversed in backpropagation.

Finally, we search for the optimal parameters
θ 2 l to meet the following requirements: Where, θ f are the parameters of multi-level temporal-spectral feature extract network, θ c1 are the parameters of category classifier in feature separation network, θ p are the parameters of patient classifier in feature separation network, θ c2 are the parameters of category classifier in invariant feature extraction network, θ g are the parameters of global patient discriminator in invariant feature extraction network, θ 1 l , θ 2 l are the parameters of local patient discriminator in invariant feature extraction network.
During training, if the training samples are trained by minibatch, the features of all the training samples cannot be  obtained in time, so we feed all the training samples into the network as a batch for training. The Adam optimizer is used for the model; the learning rate is set to 0.005; the center loss function is optimized using the Stochastic Gradient Descent (SGD) optimizer, and the learning rate is set to 0.05; the training rounds are 200. We use the grid search method to set the hyperparameters in the experiment.

Dataset
The proposed approach is evaluated on a benchmark dataset, the TUH corpus (Obeid and Picone, 2016), which is a neurological seizure dataset of clinical EEG recordings associated with 22 channels according to the international 10/20 system. We form a subset of the TUH with 14 subjects by selecting the subject with more than 250 s of seizure state. For each subject, we use 500 s (half normal and half seizure) of EEG signals with a sampling rate of 250 Hz. Each EEG fragment has 250 sample points (lasting 1 s) and adjacent fragments with 50% overlap. For each EEG fragment, those belonging to the epileptic seizure state are labeled as 1, while those belonging to the normal state are labeled as 0. Then the sample set is divided into a training set and a test set.

Evaluate metrics
The experiment used accuracy (ACC), sensitivity (SN), and specificity (SP) to quantify the performance of the algorithm (Yang et al., 2023).
Where, TP (True Positive): The sample which is positive is judged to be positive, TN (True Negative): The sample which is negative is judged to be negative, FP (False Positive): The sample which is negative is judged to be positive, FN (False Negative): The sample which is positive is judged to be negative.

Baselines
The adopted baseline models include: • Zabihi et al. (2013) applied Discrete Wavelet Transform (DWT) and calculated metrics such as relative scale energy and Shannon entropy as features; SVM is used for data classification.
• Fergus et al. (2015) applied Power Spectral Density (PSD) and calculated metrics such as peak frequency and max frequency as features; KNN is used for data classification.
• Schirrmeister et al. (2017) applied convolutional neural networks to distinguish seizure segments by decoding task-related information from EEG signals.
• Kiral et al. (2018) designed a deep neural network for seizure diagnosis and further developed a prediction system on a wearable device.
• Zhang et al. (2020) proposed an adversarial representation learning strategy, which achieves robust and explainable epileptic seizure detection.
• Dissanayake et al. (2021) used the CNN network structure and Siamese network structure to improve the generalization ability of the model.
The six comparison methods and my experiment used the same data segment length on the TUH dataset, using leave-one-out crossvalidation, and obtained the comparison results in Table 1.    Through comparative analysis, the methods in literature (Schirrmeister et al., 2017;Kiral et al., 2018) only used a deep neural network to train a model with the data of multiple patients together, without considering the negative impact of inter-patient differences on the training model, resulting in poor detection accuracy when applied to new patients. In literature (Zabihi et al., 2013), relative scale energy and Shannon entropy, etc., were used as features, in literature (Fergus et al., 2015), peak frequency and max frequency, etc., were used as features, these methods were able to extract the obvious common features, but were unable to extract the deeper common features, so the detection accuracy of the methods was higher than the results in Schirrmeister et al. (2017) and Kiral et al. (2018) and lower than the results in Zhang et al. (2020) and Dissanayake et al. (2021). For the methods mentioned in the literature Dissanayake et al., 2021), which applied a neural network to eliminate the negative impact of the data distribution shift between patients, the results were higher than those without considering the elimination of the negative impact of the data distribution shift between patients. For the method proposed in this paper, which uses feature separation and adversarial training to disentangle features in the latent space while learning domain-invariant features to achieve the goal of mitigating the influence of inter-patient differences, its experimental results are the best, with an average detection accuracy of 85.7% by leaveone-out cross-validation.
In addition, the confusion matrix and the receiver operating characteristic (ROC) curve with the area under the curve (AUC) value are shown for a closer look at the detection results. The results of one of the best-performing subjects (patient 6) are illustrated in Figure 3. From the confusion matrix we can see that our approach achieves a sensitivity of 98.4% and a specificity of 100%.

Discussion
To analyze the effectiveness of the proposed method, first, we removed the feature separation network while leaving the other settings unchanged. Then we tested on the TUH dataset using leave-one-out cross-validation. The results of the tests are shown in Table 2: By comparison, the average accuracy of the comparison method in which the feature separation network is removed is 81.6%. The proposed method ensures feature separability and improves feature discrimination, thus improving detection performance.
Second, for the invariant feature extraction network, since DANN only aligns the marginal distribution features of multipatients, and MADA only aligns the conditional distribution features of multi-patients, we propose the method which aligns the marginal distribution and conditional distribution of each patient's features at the same time. As the label of each training set, y k i (k = 1,2)in the MADA method is modified with the value of the original label by the one-hot encoder. Then, the model is trained in the adversarial network, respectively, so that the invariant features of each category can be obtained.
To compare the advantages of the proposed method, this paper trains and tests networks that only use DANN and only use MADA. By comparing with the proposed method, the proposed method has the best performance. The results of performance comparison are shown in Table 3.
For a clear illustration, we further use the t-SNE method (Maaten and Hinton, 2008) to visualize the feature distribution of the comparison methods, the feature distribution is illustrated in Figure 4. It can be seen that DANN only tries to align the marginal distribution. Still, due to the shift in data distribution between patients, it is difficult to align the marginal distribution, resulting in features in a decentralized state. MADA uses the aligned conditional distribution and different features are mixed together. In the proposed method, the features are clustered by category and can be discriminated. It is shown that the proposed method has advantages in learning invariant features.
The reasons are as follows: first, DANN, which uses global domain adversarial method aligns the marginal distribution of features not according to the data category; second, the MADA, which uses local domain adversarial method aligns the conditional distribution of features according to the data category; but y k i (k = 1,2)in the MADA method are not the true category information, which is the output of the classification network; therefore, the features of each category cannot be aligned accurately. The proposed method uses the marginal distribution and conditional distribution alignment simultaneously, and uses the accurate label of the training set as y k i (k = 1,2), which improves the performance of data feature alignment. Therefore, the proposed method has the best performance.
For future work, I suggest the following three points: First, in the proposed method, the data features are divided into category-related features and patient-related features. In future work, the features can be divided into more detailed features, and new network structures and loss functions can be used for feature extraction to improve the algorithm's performance.
Second, the proposed method uses adversarial training to learn the invariant features, but the results of adversarial training are not stable; there are significant differences between each training epoch; therefore, new invariant feature learning methods can be studied in the future to improve the stability of training.
Thirdly, the experiments of the proposed method are all conducted on the existing public dataset and not verified on the real clinical dataset, therefore, we need to cooperate with the clinical hospital to obtain the clinical data of epilepsy and verify the actual effect.

Conclusion
In the proposed method, a domain generalization model based on feature separation and adversarial training is proposed for the case where there is a significant shift in the data distribution between patients in the epilepsy dataset. The model includes a feature extraction network, a feature separation network, and an invariant feature extraction network. The multi-level temporal-spectral feature extraction network extracts valuable features using a convolutional operation and attention mechanism. The feature separation network is used to improve feature discrimination. The invariant feature extraction network is used to align the marginal distribution and conditional distribution of features to make the features more discriminable and general. We use the TUH dataset of 14 patients and leave-one-out cross-validation, and compared with the related literature, the proposed method achieves the best result; therefore, the proposed method can provide some reference for the clinical application of epilepsy detection.

Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ supplementary material.

Ethics statement
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.