Positional multi-length and mutual-attention network for epileptic seizure classification

The automatic classification of epilepsy electroencephalogram (EEG) signals plays a crucial role in diagnosing neurological diseases. Although promising results have been achieved by deep learning methods in this task, capturing the minute abnormal characteristics, contextual information, and long dependencies of EEG signals remains a challenge. To address this challenge, a positional multi-length and mutual-attention (PMM) network is proposed for the automatic classification of epilepsy EEG signals. The PMM network incorporates a positional feature encoding process that extracts minute abnormal characteristics from the EEG signal and utilizes a multi-length feature learning process with a hierarchy residual dilated LSTM (RDLSTM) to capture long contextual dependencies. Furthermore, a mutual-attention feature reinforcement process is employed to learn the global and relative feature dependencies and enhance the discriminative abilities of the network. To validate the effectiveness PMM network, we conduct extensive experiments on the public dataset and the experimental results demonstrate the superior performance of the PMM network compared to state-of-the-art methods.


Introduction
Epilepsy is a prevalent neurological disease worldwide, affecting individuals' cognitive abilities and presenting risks of sudden falls or fatality (Rajinikanth et al., 2022).To mitigate epilepsy risks, the analysis of Electroencephalography (EEG) signals is the most effective approach to identify real-time neural disorder activity.However, EEG events often exhibit subtle amplitude variations, and manual detection of EEG signals is time-consuming, prone to errors, and requires specialized expertise.Thus, the significance of automatic EEG diagnosis lies in its capacity to analyze EEG signals with efficiency and accuracy, facilitating the timely detection of epilepsy (Xin et al., 2022;Wu et al., 2023).
Early studies on automatic EEG diagnosis focused on using hand-engineered low-level features, such as spectral, temporal, low-frequency, and high-frequency features, to achieve automatic classification of EEG signals (Liu et al., 2022).For instance, Lemm et al. (2005) applied spatio-spectral filters to mitigate the impact of noisy, non-stationary, and contaminated information in EEG signals, thereby improving classification performance.Meng et al. (2014) proposed a novel approach that learned spatial and spectral features and optimized ./fncom. .
the loss function by calculating the mutual information between the learned spectral features and class labels.Additionally, to assess the impact of different frequency sub-bands on EEG classification accuracy (Tsipouras, 2019), multiple sub-bands were combined as feature vectors.In another work, Qi et al. (2015) introduced regularized spatio-temporal filtering to classify EEG signals using supervised optimization algorithms.Jrad (2016) utilized highfrequency oscillations to extract relevant features, which were then input into a support vector machine for classifying different EEG events.Furthermore, Gao et al. (2019) developed a multiscale information analysis model that utilized high-frequency EEG oscillations to recognize emotional states.Despite the success of these approaches, the subjective selection of hand-engineered features typically requires domain knowledge and may not capture the full range of characteristics present in input EEG signals.
Recently, with the great success of the convolutional neural network (CNN) on a broad array of medical image analysis, a large body of work in this area has been considered (Chen et al., 2021(Chen et al., , 2023)).Compared with the traditional hand-engineered methods, the CNN-based ones have the advantage of extracting more complicated and discriminative characteristics from the medical image.For instance, Zheng and Lu (2015) adopted the deep belief networks (DBNs) as the main detection architecture to train differential features for the automatic detection of the seizure.Regarding temporal features, Kasabov and Capecci (2015) designed a spiking neural network architecture that extracted spatiotemporal features for detecting and interpreting EEG signals.In Liu M. et al. (2020), it used the pre-trained CNN models to extract deep features and then adopted the cartesian K-means algorithm to conduct the semi-supervised learning on the EEG data.Furthermore, the unsupervised learning method (Chai et al., 2016) which combined the auto-encoder network with a subspace alignment solution into a unified framework was developed for analyzing the EEG data.Some other works, such as Qiu et al. (2018) and Liu J. et al. (2020) utilized the sparse autoencoder with different classifiers to jointly detect the seizure signal.Moreover, to improve the performance of the classification model, Yuan et al. (2018a) proposed a multi-view CNN model which aimed to learn the brain seizure from input multi-channel signals.Similarly, in Yuan et al. (2018b), it further developed a novel channel-aware attention network for multi-channel EEG seizure detection by using CNNs.Hossain et al. (2019) proposed a model to extract the spectral, temporal features and then input them to the classifier for EEG seizure classification.For learning the multi-scale features from the EEG, Zhang et al. (2020) designed a multi-scale non-local (MNL) network with two special layers to achieve promising classification results of the seizure.Additionally, some other works (Aliyu and Lim, 2021;Hussain et al., 2021;Saichand, 2021) adopted the long short-term memory (LSTM) to overcome the vanishing gradient problem of the recurrent neural network and boost the feature extraction ability of the EEG signal data.
Despite the promising results shown by CNN-based methods for EEG signal classification, three major challenges still need to be addressed.Firstly, seizures in EEG signals often exhibit subtle abnormal characteristics that can pose challenges for feature extraction, potentially impacting the performance of classification models.Secondly, the extraction of long contextual dependencies is crucial for effective EEG signal classification, but the use of LSTM for this purpose is impeded by limited receptive fields, which compromises their ability to capture necessary contextual information.Lastly, it is worth noting that previous works have paid relatively less attention to the incorporation of global relative dependencies in EEG signal analysis, which could offer valuable discriminative information crucial for improving classification accuracy.To tackle these challenges, we propose a novel approach called the positional multi-length and mutual-attention (PMM) network.The PMM network comprises three main processes: positional feature encoding, multi-length feature learning, and mutual-attention feature reinforcement.In the positional feature encoding process, minute abnormal characteristics from the shallow layers of the network are captured through the utilization of residual positional attention.This facilitates the PMM network in focusing on and extracting crucial information associated with those characteristics.The multi-length feature learning process employs a stacking of hierarchical residual dilated (RD) LSTMs to acquire long contextual dependencies within the EEG signal.By doing so, the network becomes adept at capturing temporal patterns across various time scales and effectively modeling the relationships between distant time steps.To further fortify the features, a mutual-attention feature reinforcement process is introduced.This process delves into both the global discriminative and relative dependencies present in the EEG signal.It selectively enhances informative features while simultaneously suppressing irrelevant ones, thereby enhancing the overall discriminative power of the network.Incorporating these three processes into the PMM network enables it to capture minute abnormal characteristics, long contextual dependencies, and global discriminative and relative dependencies simultaneously, resulting in a significant improvement in classifying EEG signals.Overall, the main contributions of this paper can be summarized as follows: (1) A novel PMM network is proposed for the automatic classification of epilepsy seizures from EEG signals, with the incorporation of positional feature encoding to improve the extraction of minute abnormal characteristics from the EEG signals.
(2) In the proposed multi-length feature learning process, hierarchical RDLSTMs are used to capture long contextual dependencies from the EEG signal.Additionally, mutual-attention feature reinforcement is employed to jointly explore global discriminative features and relative dependencies simultaneously.
(3) Extensive experiments are conducted on the publicly available dataset.The results of the comparative analysis demonstrate that competitive performance is achieved by our proposed PMM network when compared to other state-of-the-art methods.
The remainder of the paper is organized as follows.Section 2 provides an introduction to the main method used in our proposed network.In Section 3, we provide a detailed description of the experimental data utilized in our study.Section 4 covers the implementation details, evaluation metrics, and a series of experiments conducted to evaluate the performance of our proposed approach.Finally, in Section 6, we summarize the findings of our study and provide a conclusion.

FIGURE
The PMM network systematically analyzes EEG signals through pre-process.The processed signal then undergoes positional feature encoding, multi-length feature learning, and mutual-attention feature reinforcement.Finally, the reinforced features are processed in a dense layer with softmax activation to generate the predicted result.

FIGURE
Overview of the PMM network, the PFEBlocks capture minute abnormal features, while the stacked RDLSTMs capture long contextual dependencies from the processed EEG signal.Additionally, the mutual-attention feature reinforcement further enhances the network's capability by extracting global discriminative features and relative dependencies.

Method
As depicted in Figure 1, the input EEG signal undergoes an initial pre-processing step to obtain the processed signal.This processed signal is then fed into the positional feature encoding module, which extracts subtle abnormal characteristics from the shallow layers.Subsequently, multi-length feature learning and mutual-attention feature reinforcement are employed to enhance the classification of the PMM network.Finally, the reinforced feature is delivered to the dense layer with the softmax activation function to generate the predicted result.To provide a more specific overview, the main architecture of the proposed PMM network is illustrated in Figure 2.
Initially, the input processed signal undergoes a positional feature encoding process, employing multiple positional feature encoding blocks (PFEBlocks) to capture minute abnormal features.Next, a multi-length feature learning process utilizing stacked RDLSTMs is employed to capture long contextual dependencies from the EEG signal.Furthermore, the network includes a mutual-attention feature reinforcement mechanism, which enables the extraction of both global discriminative features and relative dependencies, enhancing the network's overall capability in these aspects.In the following subsections, we will provide more detailed descriptions of positional feature encoding, multi-length feature learning, and mutual-attention feature reinforcement.

. Positional feature encoding
In the shallow layers of the network, the extracted feature map contains crucial details of the EEG signal that are vital for accurate EEG classification.Inspired by the structure of the residual block (Figure 3A), we incorporate a positional feature encoding block (Figure 3B) during feature encoding to automatically extract informative detail representations from the EEG signal.Considering the input of the positional feature encoding as F e , it first passes through the 1D convolution layer, which can be defined as: where X o is the output feature vector, R denotes the receptive field, W and b is the weighted parameter and bias, respectively.After that, a randomized leaky rectified linear unit (RReLU) nonlinear activation function is employed, which is formulated as: where x is the input value, a represents a random number gained from the uniform distribution U(p, q), and it is given as: a ∼ U(p, q), p < q and p, q ∈ [0, 1) Next, an extra 1D convolution layer is adopted to further refine the output feature from the RReLU activation function, resulting in X ′ o .In contrast to the traditional residual block, the proposed positional feature encoding block applies a sigmoid operation on F e to obtain the position weight matrix W pos .The formulation of W pos is defined as follows: e F e e F e + 1 (4) Afterward, we multiply W pos with X ′ o , and then add it to F e by a residual connection, therefore the final output feature map of F o is formulated as: Subsequently, the positional feature encoding block is followed by a 1D max-pooling layer, which downsamples the resolution of F o , and the resulting feature map is then further refined and enhanced through multi-length feature learning.

. Multi-length feature learning
To leverage the valuable information provided by the dependencies among multi-length features, the learned features obtained from the positional feature encoding are fed into the multi-length feature learning process.To be more specific, we denote the output features from the positional feature encoding process as F pos , for learning dependencies of the input signal features, we use the LSTM as the primary feature extraction unit  for learning high-level representations, as illustrated in Figure 4. Notably, considering the shortcoming of LSTM of disappearing and losing the information of cell state (Schoene et al., 2020), we further add the residual dilation (Chang et al., 2017) to the LSTM for learning the multi-length and long sequence dependencies as shown in Figure 5. Here, denote the cell state, hidden state, and input of LSTM at time t as c t , h t , x t , respectively.Thus, the output of the block input z t is calculated as: where W z , R z , b z is the input weight, recurrent weight, and bias weight, respectively.The function of g(•) is the tanh activation function.

FIGURE
The structure of RDLSTM block.

FIGURE
The structure of the mutual-attention feature reinforcement.
Then, the input gate, forget gate and output gate could be formulated as: as the outputs of block input, cell state, hidden state, input gate, forget gate, and output gate, respectively.Then, the dilated LSTM could be defined as: where represents the hidden state at the (l − 1)-th layer and is used to create a shortcut connection with the current LSTM cell to mitigate the issue of gradient vanishing.Additionally, hierarchically stacking the RDLSTMs enables the network to capture effective long dependencies among multi-length features across different layers.Therefore, various dilated rates of 1, 2, 4 are utilized to increase the receptive field of the network exponentially, enabling it to capture both local and global information.Following the processing in the RDLSTMs, the learned features are fed into the mutual-attention feature reinforcement process to extract global context information and further enhance the network's understanding of the input feature.

. Mutual-attention feature reinforcement
Previous research has demonstrated that attention-based learning is effective in encoding discriminative features and capturing global dependencies (Vaswani et al., 2017;Zhang et al., 2020).Building on these findings, we propose to enhance the feature learning capability by incorporating a mutual-attention feature reinforcement after the multi-length feature learning process (as shown in Figure 6).Formally, let the output feature maps of the RDLSTM with dilation rates q, k, v be denoted as F q , F k , F v (q = k = v).In order to standardize the features to the same value domain, a batch normalization layer N(•).Ioffe and Szegedy (2015) is initially applied to F q , F k , F v , yielding Subsequently, these features with different value domains are separately passed through three linear layers and combined as inputs to the mutual-attention module.The formulation of this module can be expressed as: where Q, K, V is the corresponding query, key, and value of , and b (value)   v is the bias, separately.Then, the mutual scaled-attention A qk ,A qv ,A vk could be calculated by: where √ S is the scaling parameter, ⊺ denotes the transpose, and the softmax function is performed on the gained attention values to normalize the attention values into probability distributions, which is defined as: Furthermore, to exploit more discriminative and global representations, we employ the multi-head attention (Vaswani et al., 2017) which calculates the mutual-attention operations for T times.Thus, the output of multi-head attention A (multi)   d is formulated as: where W (multi)   qk , W (multi)   qv , W (multi)   vk are the weight matrices of the linear combination.
The operation of Concat(•) represents the concatenation of the input features.Finally, the output from mutual-attention feature reinforcement F at is given as: .

Classification of EEG data
After the feature extraction process, the extracted features are passed into a softmax layer to generate prediction probabilities for different classes.Mathematically, the training data can be represented as s (1) , y (1) , s (2) , y (2) , • • • , s (N) , y (N) , where s N ∈ R 1×C denotes the input features, y (N) ∈ {1, 2, • • • , C} represents the class label, and C is the total number of class labels.Therefore, the mapping function of softmax could be given as: where θ is the learned parameters of the softmax.To optimize the network, we use the cross-entropy J(•) as the loss function, which can be defined as: where y c is the true class label, and ŷc denotes the predicted class label.Overall, the whole process of the proposed PMM Network could be illustrated in Algorithm 1.

Data description
We evaluate our approach using the Bonn EEG dataset, initially reported in Andrzejak (2001).This dataset consists of five subsets: Set A, B, C, D, and E. Each subset contains 100 EEG channels and has a duration of 23.6 seconds.Subsets A and B were collected from healthy subjects, with recordings taken during both eyes open and closed conditions.Subsets C, D, and E correspond to different locations in epileptic subjects.Subset C represents recordings from the hippocampal formation, Subset D records the epileptogenic zone, and Subset E captures signals during seizure activity.It is important to note that the signals in subset C and D were recorded during seizure-free intervals, while subset E was captured during seizure activity.For simplicity, the eye movements in dataset A and B were not considered in our evaluation.
Moreover, the UCI-EEG Recognition dataset (Wu and Fokoue, 2017) is also used for the detection of epileptic seizures.It consists of five distinct groups, each consisting of 100 single-channel EEG signals.Each EEG file corresponds to a 23.6-s recording of brain activity, which is sampled into 4097 data points.Therefore, the dataset comprises a total of 500 subjects, with each subject's data containing 4,097 data points.Additionally, the EEG samples in this dataset are further divided into 23 data chunks, with each chunk containing 178 data points.Overall, the dataset contains 11,500 time-series EEG signal data samples from the 500 subjects.In the EEG recognition dataset, Class 1 represents the state of epileptic seizure, while Classes 2-5 represent normal healthy states.This dataset facilitates a binary classification task aimed at distinguishing between the combined normal states (Classes 2-5) and the seizure condition (Class 1).

. Implementation details
The PMM network is implemented using the PyTorch deep learning framework, and cross-entropy is adopted as the loss function.The optimizer used is Adam, which helps in the convergence of the network.The initial learning rate is set at 0.0003 and is decayed by a factor of 0.001 after each epoch.To accelerate the training process, a GTX 1080 GPU is used.Additionally, 10-fold cross-validation is carried out to assess the performance of the model.

. Evaluation metrics
The performance of the experiment is evaluated using several commonly used performance metrics, including accuracy, precision, sensitivity, specificity, and F1-score.
Accuracy is defined as the ratio of the number of correctly predicted samples to the total number of predicted samples, which is defined as Precision refers to the ratio of the number of correctly predicted positive samples to the total number of predicted positive samples, which is given as Sensitivity measures the proportion of positives that are correctly identified, which is defined as Specificity measures the proportion of negatives that are correctly identified, which is defined as F1 score is the harmonic mean of precision and sensitivity, which is defined as Among all the equations presented above, the term TP (true positives) represents the number of EEG data samples that are abnormal and correctly identified as abnormal.Similarly, TN (true negatives) represents the number of EEG data samples that are normal and correctly identified as normal.FP (false positives) refers to the number of normal EEG data samples that are   incorrectly predicted as abnormal, and FN (false negatives) refers to the number of abnormal EEG data samples that are incorrectly predicted as normal.
To ensure a comprehensive evaluation of the system, a 10-fold cross-validation approach is applied.During each iteration, one fold is used for testing the model, while the remaining nine folds are used for training.This process is repeated ten times, with each fold used as the test set once.The average values of accuracy, sensitivity, and specificity are then collected from the ten-fold cross-validation, providing an average performance measurement of the system across different categories of data.

. The performance of double classes classification
In this section, the performance of the double class classification is evaluated on Bonn dataset.Table 1 compares the performance of different combinations of double classes, including A-E, B-E, C-E, D-E, AB-E, AC-E, AD-E, BC-E, BD-E, CD-E, ABC-E, ABD-E, BCD-E, and ABCD-E on Bonn dataset.Among these combinations, the highest performance is achieved in the A-E class classification with an accuracy of 99.95%, while the most challenging classification task is D-E .

The performance of multiple classes classification
We further evaluate the performance of the proposed PMM network on multiple class classification using Bonn dataset.We compare the combinations of classes including A-C-E, A-D-E, B-C-E, B-D-E, AB-CD-E, and A-B-C-D-E separately, and present the results in Table 2.The results clearly indicate that the B-D-E combination achieves the best performance, with an accuracy of 98.73%, sensitivity of 97.22%, specificity of 98.61%, precision of 97.32%, and F1-score of 97.52%.On the other hand, the A-B-C-D-E combination, consisting of five classes, shows the lowest performance.This showcases the increasing difficulty of multiple class classification tasks as the number of classes increases. .

The e ectiveness of di erent components
In this section, we conduct extensive experiments to validate the effectiveness of each proposed component in the A-B-C-D-E classes combination classification task using Bonn dataset.We refer to the positional feature encoding block, multi-length feature learning, mutual-attention feature reinforcement, and residual dilation with LSTM as PFEBlock, MFL, MFR, and RDLSTM, respectively.The model without any proposed module is defined as "Original".Table 3 illustrates the results of these experiments.It can be observed that integrating any of the processes, i.e., PFEBlock, MFL, or MFR, leads to improved classification performance compared to RDLSTM.This confirms the effectiveness of each proposed process in enhancing the overall classification performance.Additionally, we find that adding the MFL could gain the best performance, which further demonstrate that the dependencies among multi-length features play vital importance in this task.

. Influence of di erent dilated rates
The optimal dilated rates in the Dilated LSTM network play a crucial role in achieving improved performance.Through our experiments, we conducte tests to determine the best rates based on various metrics.The results, as shown in Table 4, indicate that larger dilated rates lead to enhanced performance when using a single dilated rate or two different rates.The higher dilated rates offer better performance because they allow the network to capture a broader range of information from the input data.By increasing the dilation rate, the network can expand its receptive field and consider a wider context, resulting in more accurate and informed predictions.Comparing the dilated rate sequences "1, 2, 4, 8" and "1, 2, 4", we find that the latter sequence achieves superior performance.This is because the rates "1, 2, 4" strike a balance between capturing local patterns and incorporating global relationships within the data.On the other hand, including the rate "8" in the first sequence potentially introduces noise or redundant information, which may degrade the model's performance.

. Compare with other classification methods
To evaluate the performance of the proposed network, we first compare it with other classification methods on Bonn dataset, especially those based on CNN, in various classification tasks.For consistency, we use fixed combinations of EEG classes, including ABCD-E, AB-CD-E, A-B-C-D-E, A-E, AC-E, C-E, A-D-E, D-E, A-D, B-E, and B-C-D.Tables 5, 6 present the results of the comparison.It can be observed that our proposed method achieves competitive performance in most classification tasks when compared to other methods, particularly in the double classes classification task.This demonstrates the effectiveness of our proposed method in handling and accurately classifying EEG signals for various classification tasks.In addition, we

Discussion
From a technical perspective, our paper introduces the PMM network as a potentional solution to address the challenges associated with learning minute abnormal characteristics and modeling long dependencies in EEG signals.We employ positional feature encoding to enhance the network's detection of subtle abnormalities, leveraging temporal position information.Additionally, our proposed multi-length feature learning enables the network to extract features at different scales, capturing short-term and long-term dependencies in the EEG signals.Moreover, incorporating the mutual-attention feature reinforcement mechanism enhances the network's ability to identify relevant spatial and temporal dependencies, allowing it to distinguish abnormal patterns from background activity more effectively.These advancements collectively contribute to the PMM network's potential in clinical applications and EEG signal analysis by providing a more comprehensive and accurate approach for capturing small abnormal characteristics, modeling long dependencies, and improving attention mechanisms.
From a clinical perspective, our proposed PMM network offers improved advancements in EEG signal analysis that have the potential to benefit clinical practice.It effectively captures minute abnormal characteristics often associated with neurological disorders, allowing for precise identification even within complex EEG patterns.By modeling long dependencies and incorporating multi-length feature learning, the network provides a comprehensive understanding of the underlying abnormal processes that evolve over time.The mutual-attention feature reinforcement mechanism further enhances specificity in detecting abnormal patterns, which is crucial for accurate diagnosis and informed decision-making in patient management.While these improvements hold promise for clinical practice, it is important to note that further research and evaluation are needed.Extensive testing with diverse datasets is necessary to validate the network's performance across various EEG classification tasks encountered in real-world clinical settings.Such validation is crucial before the network's potential can be fully realized and integrated into routine clinical workflows.

Conclusion
In this paper, we introduced the PMM network to address the challenges related to learning minute abnormal characteristics and long dependencies in EEG signals.Our proposed approach effectively captures the minute abnormal characteristics through positional feature encoding and improves the modeling of long dependencies with multi-length feature learning and mutualattention feature reinforcement.Experimental evaluations on the publicly available dataset demonstrated that the PMM network achieves competitive performance compared to other state-of-theart methods.One limitation of this study is that the proposed network was only evaluated on the limited dataset, which may not cover the full spectrum of EEG classification tasks.Therefore, in future work, we aim to extend our network to more diverse time-series datasets to further validate its effectiveness and generalizability.Moreover, while this approach proves beneficial for capturing subtle abnormal characteristics in EEG signals, it may be sensitive to variations in signal alignment and timedependent patterns.Different EEG recording setups or variations in patient-specific factors could introduce spatial and temporal misalignments, potentially affecting the network's performance.Therefore, future research should focus on developing more robust techniques for positional feature encoding that can adapt to different recording setups and account for these variations, ensuring the network's stability and reliability across diverse EEG datasets.By addressing this limitation, we can enhance the network's applicability and strengthen its performance in realworld clinical scenarios.

FIGURE
FIGUREThe structure of residual block (A) and positional feature encoding block (B).

FIGURE
FIGUREThe structure of LSTM block.
11) here, the W i , W f , W o and b i , b f , b o are the corresponding inputs and bias weights of LSTM.The σ (•) and ⊙ represent the sigmoid and element-wise multiplication, respectively.In the RDLSTM, instead of using the previous cell state c t−1 and hidden state h t−1 , it takes in the cell state c t−d and hidden state h t−d , where the dilation rate d exponentially increases the receptive field of the LSTM.By incorporating these distant past states, the RDLSTM can capture long dependencies from the EEG signals, allowing for a more comprehensive understanding of the input sequence.Mathematically, we denote z The prediction of EEG class label ŷc → {1, 2, • • • , C}; 1 Initialization: C ← numbers of the classes; E ← numbers of the epochs; 2 Initialization: P(•) Positional feature encoding; RDLSTM(•) Residual dilated LSTM; MA(, ) mutual-attention feature reinforcement; Pool(•) Max-pooliing operation; D(•) Seizure Classification of PMM Network.
TABLE The overall performance of double classes classification on Bonn dataset.
TABLE The overall performance of multiple classes classification on Bonn dataset.

TABLE The e
ectiveness of di erent components.
TABLE The influence of di erent dilated rates on Bonn dataset.
TABLE Comparison with other methods (b).
TABLE Comparison with other methods on UCI-EEG dataset.