CMTS-GNN: a cross-modal temporal-spectral graph neural network with cognitive network explainability

Wang, Yi; Meng, Lu; Fan, Yuying

doi:10.3389/fneur.2025.1700161

ORIGINAL RESEARCH article

Front. Neurol., 30 October 2025

Sec. Pediatric Neurology

Volume 16 - 2025 | https://doi.org/10.3389/fneur.2025.1700161

This article is part of the Research TopicThe Convergence of Cognitive Neuroscience and Artificial Intelligence: Unraveling the Mysteries of Emotion, Perception, and Human CognitionView all 11 articles

CMTS-GNN: a cross-modal temporal-spectral graph neural network with cognitive network explainability

Yi Wang¹

Lu Meng¹

Yuying Fan²^*

¹School of Information Science and Engineering, Northeastern University, Shenyang, China
²Department of Pediatrics, Shengjing Hospital of China Medical University, Shenyang, China

Infantile spasms (IS) represent a severe form of epileptic encephalopathy occurring in early infancy. Timely and accurate detection is critical, as delays or misdiagnosis are associated with adverse neurodevelopmental outcomes that can impair perceptual, cognitive, and affective development. Conventional EEG analysis is often challenged by the complexity, heterogeneity, and large volume of IS data, rendering manual review both time-intensive and susceptible to inter-rater variability. To address these challenges, we introduce CMTS-GNN—a Cross-Modal Temporal—Spectral Graph Neural Network. This model integrates complementary information from temporal and spectral EEG representations through bidirectional cross-modal attention and gated fusion mechanisms. It further incorporates explicit modeling of brain-region connectivity to capture functional interactions that underlie perceptual processing, cognitive control, and affective dynamics. By doing so, CMTS-GNN aims to improve both detection accuracy and interpretability. We evaluated the proposed model on an in-house infantile spasms dataset and the publicly available CHB-MIT epilepsy dataset. Evaluation protocols included five-fold cross-validation and subject-independent schemes (leave-one-subject-out/leave-one-patient-out). On our in-house dataset, five-fold cross-validation resulted in an accuracy of 99.02%, precision of 98.96%, recall of 97.47%, F1-score of 98.20%, and AUC of 99.27%. For the CHB-MIT dataset, the same protocol yielded an accuracy of 98.54%, precision of 98.31%, recall of 98.71%, F1-score of 98.47%, and AUC of 98.87, outperforming several recent approaches across most metrics. Subject-independent evaluations further confirmed the model's robustness and generalizability across different patients. Importantly, by modeling connectivity across brain regions, CMTS-GNN provides clinically meaningful explanations for its decisions, enhancing interpretability. In summary, CMTS-GNN offers an accurate, generalizable, and interpretable framework for automated IS detection from EEG. It holds potential to support earlier clinical intervention, thereby helping to mitigate long-term perceptual, cognitive, and affective morbidity in affected infants.

1 Introduction

Infantile spasms (IS) represent a severe form of epileptic encephalopathy occurring in early infancy, characterized by stereotypical epileptic spasms, a highly disorganized electroencephalographic pattern known as hypsarrhythmia, and developmental stagnation or regression that may compromise perception, cognition, and affective development (1). IS is widely classified within the spectrum of West syndrome and exhibits marked clinical heterogeneity (2, 3). The global incidence is approximately 0.02% to 0.05%, with no significant sex differences. Most cases present between 4 and 9 months of age, with a peak onset around 6 months (4, 5). This developmental window coincides with critical periods for sensory–perceptual integration and early cognitive–affective maturation. The typical spasms present as brief, repetitive clusters, characterized by flexion or extension of the trunk, often accompanied by autonomic symptoms such as ocular deviation and alterations in respiratory rhythm. These events are more frequent or pronounced during wakefulness or transitional sleep states (6, 7). Due to the presence of atypical spasm manifestations in some cases, IS can be easily misdiagnosed as other infantile movement disorders, leading to delayed diagnosis and potentially irreversible neurodevelopmental impairment that affects perceptual, cognitive, and affective trajectories (8).

Electroencephalography (EEG) is a non-invasive technique that records neuronal electrophysiological activity via scalp electrodes. It serves as a critical tool in the clinical diagnosis of epilepsy and related encephalopathies, offering high temporal resolution and cost-effectiveness (9). In infantile spasms, EEG holds central diagnostic value. During ictal episodes, characteristic changes such as voltage attenuation and bursts of fast rhythms can be observed (10). Current clinical diagnosis relies on prolonged, synchronized video-EEG monitoring, which requires manual interpretation by trained clinicians to detect ictal events and abnormal discharge patterns (11). However, this approach faces three major challenges. First, EEG patterns associated with infantile spasms are highly heterogeneous–not only do they vary significantly between individuals, but they also exhibit dynamic fluctuations over time within the same patient, reflecting complex spatiotemporal variability in epileptic discharges (12). Second, prolonged monitoring generates a large volume of data, making manual analysis time-consuming and labor-intensive, which results in low diagnostic efficiency (13). Third, EEG interpretation is highly dependent on clinician expertise, and inter-rater consistency among experts is limited, which hinders the standardization of diagnosis and treatment (14, 15). These challenges highlight the urgent need for automated detection technologies in the diagnosis and management of infantile spasms.

In recent years, deep learning-based end-to-end models have shown promising performance in the detection of epileptiform discharges, offering a feasible pathway for the automated recognition of EEG signals (16–18). Zhou et al. (19) developed a convolutional neural network (CNN) framework for automatic seizure detection, which processes raw EEG signals directly in the frequency domain without the need for manual feature extraction. Cao et al. (20) proposed a deep transfer learning-based feature fusion algorithm for multi-state epileptic EEG classification. The method constructs sub-band mean amplitude spectrum maps to characterize brain rhythm activity and leverages five ImageNet-pretrained deep neural networks (AlexNet, VGG19, Inception-v3, ResNet152, and Inception-ResNet-v2) to extract and fuse discriminative EEG features. In the study by Tsiouris et al. (21), a long short-term memory (LSTM) network was employed to extract temporal information from EEG segments for seizure detection. Further advancing this approach, Yao et al. (22) integrated an attention mechanism into the LSTM framework to enhance the model's ability to detect epileptic seizure. Recent studies have revealed specific patterns of correlation among neural signals originating from distinct brain regions (23). Brain networks are typically modeled as graph structures due to their inherently non-Euclidean nature. While traditional convolutional neural networks (CNNs) are well-suited for processing regular, Euclidean data such as images, they are limited in capturing the complex topological properties of brain connectivity. To more effectively leverage the spatial and structural information embedded in brain networks, graph neural networks (GNNs) have been introduced (24). Recent advances have adopted graph convolutional approaches, modeling EEG electrode channels as nodes within a topological graph, where edges denote functional or anatomical connections between electrodes. This framework mitigates the constraints imposed by fixed convolutional kernels in conventional CNNs and enables the retention of more intricate structural characteristics embedded in EEG data (25–27). Meng et al. (14) proposed a method based on Graph Convolutional Networks (GCNs) to automatically identify Electrical Status Epilepticus during Sleep (ESES) from electroencephalogram (EEG) recordings. Their model preserves the intrinsic graph structure of EEG signals and leverages both time-domain and frequency-domain features, achieving higher accuracy and generalizability compared to traditional approaches such as template matching and conventional machine learning models. However, this method has certain limitations, particularly when dealing with dynamic temporal data. EEG signals exhibit not only spatial correlations across electrodes but also strong temporal dependencies. In conventional graph classification tasks, the topological structure and temporal dynamics of the graph may not be fully exploited simultaneously. In the context of EEG, the signal at each electrode is not only correlated with signals from other electrodes but also shows a clear dependency over time. If a GNN fails to account for this temporal dependency, critical information may be lost, potentially degrading classification performance.

Most existing methods primarily focus on a single modality, with limited consideration of the relationships among temporal, spatial, and frequency domains. Although current deep learning approaches are capable of capturing temporal dependencies, they often lack explicit modeling of interactions across different modalities. Our work addresses this limitation by introducing a multimodal attention mechanism to explicitly model the dependencies between temporal and frequency features, thereby bridging this gap. In addition, most previous studies only validated their models on a single dataset, raising concerns about generalizability. Furthermore, existing explainability analyses have mainly targeted adult epilepsy datasets, whereas our study systematically analyzes explainability specifically on infantile spasm datasets, providing valuable references for clinical translation. To address these challenges, we propose a novel Cross-Modal Temporal-Spectral Graph Neural Network (CMTS-GNN) that integrates both temporal and spectral information for spasm detection. The proposed model combines multi-scale temporal feature extraction, spectral-domain modeling, and a cross-modal attention mechanism to fully leverage the temporal, frequency, and spatial characteristics of EEG data. CMTS-GNN has been evaluated on both a proprietary dataset and a public benchmark dataset to validate its generalization ability. We employ five-fold cross-validation for comprehensive performance assessment and conduct independent validation to ensure complete separation of patient data between the training and test sets, thereby preventing data leakage and overfitting.The main contributions of this work are summarized as follows:

• We proposed CMTS-GNN, a cross-modal temporal-spectral graph neural network that integrates temporal and spectral EEG features via bidirectional attention and gated fusion, enabling comprehensive and robust modeling of spatio-temporal patterns for infant spasms detection.

• The model explicitly divides EEG channels into five regions—frontal, central, parietal, occipital, and temporal lobes—based on the international 10–20 electrode system. Region-wise attention pooling is then employed to adaptively aggregate salient features within each brain region. This region-aware design significantly enhances the model's spatial specificity and interpretability in representing brain functional areas. Using attribution methods, we spatially visualize the basis of the model's decisions and observe that its focus closely aligns with the clinically recognized epileptogenic zones of infantile spasms. This further strengthens the model's explainability and medical credibility, laying a solid foundation for future clinical translation.

• The proposed model not only achieves state-of-the-art accuracy and robustness on the dedicated infantile spasm dataset but also demonstrates strong generalization performance in cross-domain transfer experiments on the public CHB-MIT epilepsy dataset. These results suggest that the framework presented in this study can efficiently detect infantile spasms as well as effectively recognize epileptic seizures, highlighting its significant potential for widespread clinical application.

The remainder of this paper is organized as follows. Section 2 provides a detailed description of the methods used in this study. Our experimental results are presented in Section 3. Finally, Section 4 concludes the study.

2 Materials and methods

2.1 Datasets

Datasets A. We evaluated the proposed method on two electroencephalogram (EEG) datasets. Dataset A was obtained from Shengjing Hospital of China Medical University and contains EEG recordings from 40 pediatric patients diagnosed with infantile spasms. All participants were younger than two years, and electrodes were positioned following the international 10-20 system. The cohort comprises 16 females and 24 males. Table 1 summarizes patient-level demographics and recording information.

Table 1

Table 1. Infantile spasms A dataset: participant demographics and recording summary.

Dataset B (CHB-MIT). The CHB-MIT dataset (28) used in this study was collected at Boston Children's Hospital and consists of pediatric EEG from children with epilepsy. Signals were recorded with 23 scalp electrodes arranged according to the 10-20 standard, yielding 844 hours of continuous EEG. The database contains 198 seizure events. Recordings are available for 24 subjects in total, but patient 24 was excluded here because detailed metadata and channel information are missing for that subject, which was added in a later phase of collection. All EEG was sampled at 256 Hz, and seizure onset/offset times were manually annotated. Most recording files are about one hour in duration, although some for particular patients extend to two or four hours.

2.2 Data processing

Due to variations in the number of recording channels and sampling frequencies across datasets, a standardized preprocessing pipeline was applied. Specifically, 16 commonly used EEG channels (Fp1, Fp2, F3, F4, C3, C4, P3, P4, O1, O2, F7, F8, T3, T4, T5, T6) were selected, and all signals were resampled to a uniform frequency of 250 Hz. The EEG recordings for each patient were then segmented into 5-second epochs, and each segment was labeled by experienced neurologists. Given that EEG signals are often contaminated by power line interference, electromyographic (EMG) artifacts, and ocular movements during acquisition, a multi-stage filtering strategy was adopted for signal denoising. A bandpass filter ranging from 0.7 to 40 Hz was applied to suppress both power line noise and high-frequency artifacts, while preserving seizure-related features and minimizing information loss. This approach helps prevent the loss of critical ictal waveforms due to over-filtering. To address inter-subject variability in signal amplitude, a dynamic gain control mechanism was introduced. Specifically, an average reference was applied during the preprocessing stage to reduce common-mode interference. Subsequently, Z-score normalization was performed on each channel, ensuring that the mean and variance of the signals were standardized to zero and one, respectively. This normalization strategy not only improves model convergence during training but also enhances its generalization ability across heterogeneous datasets.

2.3 Temporal graph construction

To simultaneously capture temporal dynamics and inter-channel dependencies within the temporal branch of CMTS-GNN, each 5-s EEG segment is represented as a temporal graph. We consider segments with C = 16 channels and T sampling points per segment. In this graph, nodes correspond to electrode channels; node features are the standardized time series of each channel; and edge weights quantify the strength of time-varying functional connectivity. The total number of nodes is 16.

Let the raw EEG matrix be X ∈ ℝ^C×T. We apply per-channel z-score standardization to obtain Z:

\begin{array}{c} Z_{i, t} = \frac{X_{i, t} - μ_{i}}{σ_{i} + ε}, & (1) \end{array}

where X_i,t is the amplitude of channel i at time t; Z_i,t is the standardized amplitude; μ_i and σ_i denote the mean and standard deviation of channel i; and ε > 0 is a stability constant. The vector z_i = (Z_i,1, …, Z_i,T) serves as the feature of node i.

To characterize time-varying inter-channel relations, we compute sliding-window Pearson correlations over Z. With window length L and step size S, the number of windows is K = ⌊(T − L)/S⌋ + 1. For the k-th window, let $z_{i}^{(k)}$ and $z_{j}^{(k)}$ denote the length-L subsequences of channels i and j. Their correlation is

\begin{array}{c} r_{i j}^{(k)} = \frac{cov (z_{i}^{(k)}, z_{j}^{(k)})}{σ_{i}^{(k)} σ_{j}^{(k)} + ε}, & (2) \end{array}

where cov(·, ·) is the sample covariance and $σ_{i}^{(k)}, σ_{j}^{(k)}$ are the sample standard deviations of the corresponding subsequences.

Based on these dynamic correlations, we construct a fully connected, undirected, weighted graph without self-loops 𝒢 = (𝒱, ℰ, W). For each unordered pair {i, j} with i ≠ j, the edge weight is defined as the average correlation across windows:

\begin{array}{c} w_{i j} = \frac{1}{K} \sum_{k = 1}^{K} r_{i j}^{(k)} with w_{i j} = w_{j i}, w_{i i} = 0 . & (3) \end{array}

Equivalently, ℰ = {{i, j} ∣ 1 ≤ i < j ≤ C} and the adjacency (weight) matrix W = [w_ij] is symmetric.

Through this construction, the graph topology explicitly encodes cross-channel functional connectivity, while the node features preserve complete time-domain information. This representation enables CMTS-GNN to exploit complementary temporal and spatial cues in subsequent processing.

2.4 Spectral graph construction

The proposed CMTS-GNN integrates temporal and frequency-domain information within a unified graph-based framework to comprehensively capture the temporal dynamics, spectral characteristics, and spatial dependencies of infantile spasms (IS) EEG signals. In the temporal branch, the raw time series of each EEG channel $x_{i} \in ℝ^{T}$ is processed by a multi-scale encoder composed of three parallel one-dimensional convolutional branches with kernel sizes k ∈ {100, 50, 25}:

\begin{array}{c} h_{i}^{(k)} = ReLU (BN (x_{i} * W_{k})), & (4) \end{array}

where W_k is the convolution kernel for scale k, BN(·) denotes batch normalization, and * represents the one-dimensional convolution operator. The outputs from all scales are concatenated along the channel dimension and passed through global average pooling to produce compact multi-scale temporal features:

\begin{array}{c} h_{i}^{temp} = GAP (‖_{k} h_{i}^{(k)}) . & (5) \end{array}

Both the temporal graph, constructed from dynamic functional connectivity, and the frequency-domain graph, constructed from the weighted phase lag index (wPLI), are processed using edge-conditioned graph convolution, in which edge attributes are transformed into learnable kernels for message passing:

\begin{array}{c} h_{i}^{'} = σ (\frac{1}{| N (i) |} \sum_{j \in N (i)} W_{ϕ (e_{i j})} h_{j}), & (6) \end{array}

where 𝒩(i) denotes the neighbor set of node i, e_ij is the edge attribute (either a DFC or wPLI weight), ϕ(·) is an MLP that maps edge attributes to convolution kernels, and σ is the ReLU activation.

Anatomical priors are incorporated by grouping EEG channels into R = 5 brain regions ${V_{r}}_{r = 1}^{R}$ (frontal, central, parietal, occipital, and temporal). Within each region, features are aggregated via attention pooling:

\begin{array}{c} g_{r} = \sum_{i \in V_{r}} α_{i}^{(r)} h_{i}, α_{i}^{(r)} = \frac{exp (w_{r}^{⊤} h_{i})}{\sum_{j \in V_{r}} exp (w_{r}^{⊤} h_{j})}, & (7) \end{array}

where w_r is a learnable vector for region r.

Cross-modal interaction is enabled by a bidirectional multi-head attention mechanism, allowing temporal features to attend to spectral features and vice versa, based on the scaled dot-product attention:

\begin{array}{c} Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{h}}}) V, & (8) \end{array}

where d_h is the per-head dimensionality.

The raw and cross-enhanced features are then fused through a gated mechanism:

\begin{array}{c} u = σ (W_{g} [h^{raw}; h^{enh}]), h^{fused} = u ⊙ h^{enh} + (1 - u) ⊙ h^{raw}, & (9) \end{array}

where ⊙ denotes element-wise multiplication and σ is the sigmoid function.

Finally, the fused temporal and spectral regional features are concatenated, flattened, and passed through a fully connected classifier to produce the final prediction. This end-to-end architecture allows CMTS-GNN to jointly exploit temporal, spectral, and spatial cues for robust automated detection of infantile spasms.

2.5 CMTS-GNN overview

Infantile spasm EEG signals are characterized by substantial heterogeneity and rapidly shifting spatiotemporal patterns, making them difficult to model with conventional sequence-based approaches. Such models often struggle to capture the non-Euclidean topology inherent to EEG channel arrangements while also integrating complementary cues from temporal and spectral domains. To overcome these challenges, the proposed CMTS-GNN unifies three key operations into a single pipeline: it first extracts temporal features at multiple scales, then performs edge-aware graph reasoning enriched with brain-region-wise pooling, and finally applies bidirectional cross-modal attention coupled with gated fusion. The resulting architecture simultaneously models waveform dynamics, frequency rhythms, and inter-channel connectivity, delivering a cohesive and clinically relevant framework for EEG analysis. The network architecture of CMTS-GNN is shown in Figure 1.

Figure 1

Diagram showing a process for detecting spasms using patient data. It includes steps for temporal and frequency domain feature extraction, brain region-wise attention pooling, and classification into non-spasms and spasms. Visual elements include graphs, matrices, and labeled arrows indicating the flow from raw data to final predictions.

Figure 1. The overall architecture of CMTS-GNN. The CMTS-GNN model is designed to classify EEG segments as spasm or non-spasm events by leveraging both temporal and spectral characteristics of EEG data. Raw EEG signals are processed in parallel through temporal and spectral branches. Within each branch, attention-based pooling aggregates features across anatomically grouped brain regions, generating region-wise temporal and spectral feature maps. A bidirectional cross-modal attention module is then applied to enable effective interaction between temporal and spectral representations, enhancing the features based on the complementary information from both modalities. Subsequently, the attention-refined features are adaptively integrated with the original representations through gated fusion blocks, where learnable sigmoid gates dynamically control the contribution of each modality. The resulting fused representation encodes rich and complementary spatiotemporal information, which is ultimately fed into a classifier for final decision-making between non-spasm and spasm events.

2.5.1 Multi-scale temporal feature extraction

Given a time-domain EEG segment X^(t) ∈ ℝ^N×T with N channels and T samples per channel, three parallel 1-D convolution branches with kernel sizes {100, 50, 25} are applied to capture long-range, medium-range, and short-range temporal dependencies. Each branch consists of a convolution layer, batch normalization, and ReLU activation:

\begin{array}{c} H_{k} = ReLU (BN (Conv 1 D_{k} (X^{(t)}))), k \in {100, 50, 25} . & (10) \end{array}

Global average pooling over the temporal dimension produces compact channel-wise descriptors. If the sequence length after convolution is L_k:

\begin{array}{c} {\bar{H}}_{k} (i, :) = \frac{1}{L_{k}} \sum_{t = 1}^{L_{k}} H_{k} (i, t, :), i = 1, \dots, N . & (11) \end{array}

The pooled features from all branches are concatenated and linearly transformed into a shared hidden space of width D:

\begin{array}{c} H_{ms}^{(t)} = ϕ ([{\bar{H}}_{100} ∥ {\bar{H}}_{50} ∥ {\bar{H}}_{25}]), ϕ (\cdot) = W_{f} (\cdot) + b_{f}, & (12) \end{array}

where $H_{ms}^{(t)} \in ℝ^{N \times D}$ . This step generates scale-robust temporal embeddings that retain both transient bursts and contextual information.

2.5.2 Edge-aware graph encoding and brain-region pooling

In both temporal and spectral streams, EEG channels are modeled as graph nodes, with edges encoding functional connectivity derived from DFC or wPLI. Node features are projected to a common width D:

\begin{array}{c} {\hat{X}}^{(t)} = W_{p}^{(t)} H_{ms}^{(t)} + b_{p}^{(t)}, {\hat{X}}^{(s)} = W_{p}^{(s)} X_{in}^{(s)} + b_{p}^{(s)} . & (13) \end{array}

For an edge (i, j) with scalar attribute $a_{i j}^{(m)}$ in modality m ∈ {t, s}, an MLP outputs an edge-specific kernel:

\begin{array}{c} W_{i j}^{(m)} = reshape (ML P^{(m)} (a_{i j}^{(m)})) \in ℝ^{D \times D} . & (14) \end{array}

Node features are updated via mean aggregation over neighbors, followed by ReLU and LayerNorm:

\begin{array}{c} Z_{i}^{(m)} = LN (ReLU (\frac{1}{| N (i) |} \sum_{j \in N (i)} W_{i j}^{(m)} {\hat{X}}_{j}^{(m)})) . & (15) \end{array}

Channels are grouped into five anatomical regions (frontal, central, parietal, occipital, temporal) based on the 10-20 system. Within each region r, attention pooling computes:

\begin{array}{c} s_{i}^{(m, r)} = {(w_{r}^{(m)})}^{⊤} Z_{i}^{(m)}, α_{i}^{(m, r)} = \frac{e^{s_{i}^{(m, r)}}}{\sum_{j \in R_{r}} e^{s_{j}^{(m, r)}}}, \\ u_{r}^{(m)} = \sum_{i \in R_{r}} α_{i}^{(m, r)} Z_{i}^{(m)} . & (16) \end{array}

Stacking R = 5 regions yields U^(m) ∈ ℝ^R×D.

2.5.3 Cross-modal interaction and gated fusion

At the region level, temporal and spectral matrices interact via bidirectional multi-head cross-attention. For the temporal to spectral direction:

\begin{array}{c} Ũ^{(t)} = [Conca t_{h = 1}^{H} Softmax (\frac{Q^{(h)} K^{(h) ⊤}}{\sqrt{d_{h}}}) V^{(h)}] W_{O}, & (17) \end{array}

with $Q^{(h)} = U^{(t)} W_{Q}^{(h)}, K^{(h)} = U^{(s)} W_{K}^{(h)}, V^{(h)} = U^{(s)} W_{V}^{(h)}$ . The spectral → time direction is analogous.

Gated fusion adaptively combines original and enhanced features:

\begin{array}{c} û_{r}^{(t)} = g_{t}^{(r)} ⊙ ũ_{r}^{(t)} + (1 - g_{t}^{(r)}) ⊙ u_{r}^{(t)}, & (18) \end{array}

\begin{array}{c} û_{r}^{(s)} = g_{s}^{(r)} ⊙ ũ_{r}^{(s)} + (1 - g_{s}^{(r)}) ⊙ u_{r}^{(s)}, & (19) \end{array}

where $g_{t}^{(r)}$ and $g_{s}^{(r)}$ are sigmoid gates from concatenated inputs.

Fused features from all regions are concatenated, flattened, and passed to a two-layer fully connected classifier with dropout:

\begin{array}{c} y = W_{2} Dropout (ReLU (W_{1} f + b_{1})) + b_{2}, \hat{p} = σ (y) . & (20) \end{array}

This sequential design–multi-scale temporal encoding, graph reasoning with anatomical priors, cross-modal alignment, and gated fusion–produces robust, interpretable segment-level predictions.

3 Experiments and results

3.1 Experimental environment

Our method is implemented using PyTorch and trained on an Ubuntu server equipped with an Intel^® Xeon^® Gold 6133 @ 2.50GHz CPU and an NVIDIA 3090Ti GPU. The Adam optimizer is adopted for training, with the learning rate set to 0.01. The entire network is trained with a batch size of 32 for a total of 150 epochs. Due to sample imbalance, a focal loss function is used as the loss criterion, which is proposed by Lin et al. (29).

3.2 Evaluation metrics

\begin{array}{c} Accuracy = \frac{T P + T N}{T P + T N + F P + F N} & (21) \end{array}

\begin{array}{c} Recall = \frac{T P}{T P + F N} & (22) \end{array}

\begin{array}{c} Precision = \frac{T P}{T P + F P} & (23) \end{array}

\begin{array}{c} Specificity = \frac{T N}{T N + F P} & (24) \end{array}

\begin{array}{c} AUC = \frac{\sum_{i \in P} r_{i} - \frac{P (P + 1)}{2}}{P N}, r_{i} = rank (s_{i}), & (25) \end{array}

Here, TP, TN, FP, and FN represent True Positive, True Negative, False Positive, and False Negative, respectively.

3.3 Comparative experiment

To provide a comprehensive evaluation of our proposed CMTS-GNN model, we reproduced several representative state-of-the-art methods and conducted a unified performance comparison on Dataset A using 5-fold cross-validation. While the official implementations of some models were not publicly available, we carefully replicated the architectures and training procedures based on the original papers to ensure high fidelity. The experimental results are summarized in Table 2 and further illustrated through the confusion matrices (Figure 2).

Table 2

Table 2. Performance comparison between the proposed method and state-of-the-art methods using 5-fold cross-validation on dataset A.

Figure 2

Eight confusion matrices compare different methods for spam detection. Each matrix shows percentages for correctly and incorrectly classified spam and non-spam emails. Accuracy and error rates vary among the methods presented by Md. Nurul Ahad Tawhid, Xiashuang Wang, Sergi Abadala, Saravanan Srinivasan, Wenna Chen, Hui Huang, Weidong Dang, and the proposed method. The proposed method exhibits the highest performance, with a spam detection accuracy of 97.47% and a non-spam accuracy of 99.62%.

Figure 2. Comparison of confusion matrices between the proposed method and state-of-the-art methods on dataset A. The colorbar indicates the percentage, which is row-normalized.

Hybrid architectures such as the ConvLSTM-based model proposed by Md. Nurul Ahad Tawhid et al. (30) and the CNN-LSTM framework by Xiashuang Wang et al. (31) combine convolutional and recurrent layers to capture spatiotemporal dependencies in EEG signals. When evaluated on our dataset, the ConvLSTM model achieved an accuracy of 87.84% and a recall of 80.22%, but showed limited precision at 75.89%, resulting in an F1-score of 77.86% and an AUC of 89.63%. CNN-LSTM improved the overall accuracy to 88.78% and precision to 81.44%, though its recall declined to 75.61%, indicating reduced sensitivity to spasm events. The confusion matrices for both models reveal a noticeable presence of off-diagonal elements, reflecting misclassifications likely caused by the domain shift from adult to infant EEG. Sergi Abadala et al. (32) proposed a Graph Transformer Network (GTN) designed to model inter-channel dependencies in EEG data. Although it achieved a precision of 88.38% in our experiments, the recall was only 73.09%, suggesting under-detection of spasm episodes. Likewise, the hybrid 3D-Denoising Convolutional Autoencoder (3D-DCAE) + Bi-LSTM model by Srinivasan et al. (33) exhibited the weakest performance among all compared models, with a recall of just 54.70% and an F1-score of 66.07%, indicating limited generalizability to infantile EEG patterns.

Models incorporating attention mechanisms and multi-level feature fusion have shown relatively better adaptability to our dataset. The 1D-CNN with attention-based feature fusion, proposed by Wenna Chen et al. (34), achieved strong performance, with 95.55% accuracy, 91.65% precision, 92.12% recall, and an F1-score of 91.74%, and its confusion matrix showed minimal off-diagonal misclassifications. Similarly, the multiband 3D-CNN with attention mechanisms by Hui Huang et al. (35) yielded competitive performance. The Multi-branch Deep Convolutional Neural Network (MDCNN) proposed by Weidong Dang et al. (36) achieved 96.12% accuracy, 93.65% precision, 92.14% recall, and an F1-score of 92.77%, highlighting the advantages of deeper convolutional structures in capturing EEG dynamics.

In comparison, the proposed CMTS-GNN model achieved state-of-the-art results, with an accuracy of 99.02%, precision of 98.96%, recall of 97.47%, F1-score of 98.20%, and AUC of 99.27%. CMTS-GNN encloses the largest area across all metrics, signifying superior balance between sensitivity and specificity. Moreover, the confusion matrix of CMTS-GNN shows near-perfect classification, with negligible false positives and false negatives, in contrast to the scattered misclassifications observed in other methods.

These results demonstrate that the integration of multiscale temporal encoding, edge-aware graph modeling, cross-modal attention, and brain region-wise pooling enables CMTS-GNN to effectively capture complex spatiotemporal-frequency dependencies in EEG data. Consequently, our method not only surpasses existing approaches in classification performance but also sets a new benchmark in achieving a balanced and reliable detection of infant spasms.

3.4 Ablation experiments

To verify the contribution of each module in our model, we designed several variant models. First, we use cross-modal fusion between temporal and frequency domains as the baseline model, and then progressively integrate additional modules to form the complete model. The specific configurations are as follows:

• Variant A (Cross-modal fusion): We use a network that performs cross-modal fusion between temporal-domain and frequency-domain graphs as the baseline model.

• Variant B (+ Multi-Head Attention): Based on the cross-modal fusion, we add a multi-head attention mechanism.

• Variant C (+ Brain Region-Wise Attention Pooling): We enhance the cross-modal fusion model by introducing Brain Region-Wise Attention Pooling.

• Variant D (+ Multi-Head Attention + Brain Region-Wise Attention Pooling): We incorporate both Multi-Head Attention and Brain Region-Wise Attention Pooling into the cross-modal fusion framework (our proposed model CMTS-GNN).

To evaluate the contribution of each component in the proposed CMTS-GNN model, we conducted a comprehensive ablation study by designing four variant models with progressive integration of core modules. As shown in Table 3, the baseline model utilizing only cross-modal fusion between temporal and spectral features (Variant A) yielded the lowest performance across all metrics, with an accuracy of 76.86% and F1-score of 52.65%. Introducing the multi-head attention mechanism (Variant B) significantly enhanced performance, boosting the F1-score to 89.72%, highlighting its effectiveness in modeling inter-modal dependencies. Further incorporating Brain Region-Wise Attention Pooling (Variant C) led to substantial improvements across all evaluation metrics, with a notable increase in precision (96.62%) and specificity (98.78%), indicating the benefit of anatomical priors in feature aggregation. Finally, the full model (Variant D), integrating both multi-head attention and brain-region-wise pooling, achieved the highest performance with an accuracy of 99.02%, F1-score of 98.20%, and specificity of 99.05%. These results demonstrate that each module contributes incrementally and synergistically to the overall performance, validating the design of the CMTS-GNN architecture.

Table 3

Table 3. The comparison of experimental results from ablation experiments.

3.5 Leave-one-patient-out cross-validation on dataset A

To rigorously evaluate the generalizability of the proposed CMTS-GNN model across different subjects, we conducted a Leave-One-Patient-Out Cross-Validation (LOPO-CV) experiment. In this setting, the dataset comprising 40 infant patients was partitioned such that, in each iteration, the EEG recordings from one patient were held out as the test set, while the remaining 39 patients' data were used for training. This process was repeated 40 times, ensuring that each patient served exactly once as the test subject. LOPO-CV offers a stringent and subject-independent evaluation protocol, particularly suitable for medical applications where inter-subject variability is high. It allows us to assess the model's robustness and its ability to generalize to previously unseen patients, a critical requirement for real-world clinical deployment in infantile spasm detection. Because Table 4 shows substantial and heterogeneous class imbalance at the subject level, we explicitly balanced our cross-validation splits. For 5-fold CV, we used a grouped, stratified split at the patient level: patients were ordered by their spasm counts and assigned to folds in a round-robin manner so that each fold approximated the global spasm/non-spasm ratio and contained comparable EEG hours; no re-sampling was applied on the validation fold.

Table 4

Table 4. Performance of CMTS-GNN using leave-one-patient-out cross-validation on dataset A.

The leave-one-patient-out cross-validation results, as presented in Table 4, demonstrate the strong generalization and robustness of the CMTS-GNN model for infantile spasm detection across a diverse cohort of 40 subjects. Notably, 10 patients, such as numbers 3, 5, 10, 16, 17, 23, 30, 32, 35, and 39, exhibited perfect scores for all metrics, reflecting cases where the model could fully separate spasm from non-spasm events. The majority of samples were correctly classified, indicating both high sensitivity and specificity. Given the pronounced class imbalance, accuracy alone can be inflated, therefore we interpret performance in light of this balance and emphasize precision, recall, specificity, F1-score, accuracy so that each subject contributes equally. In subjects with very few spasms, precision is expected to be lower because non-spasm segments dominate, whereas consistently high recall indicates that true spasm episodes are still captured despite imbalance. False negatives and false positives were relatively rare, but some patients–such as number 2 and number 18,displayed lower precision, resulting in more false positive predictions. For example, in these instances, the confusion matrix showed an increased number of non-spasm samples misclassified as spasms, suggesting that patient-specific signal variability or noise may present challenges for the model. Despite this, recall remained above 75 percent for nearly all patients, underscoring the model's robustness in capturing true spasm episodes even in less distinct or noisy EEG segments. The overall distribution of LOPO-CV metrics reveals a low standard deviation, reflecting consistent model performance and minimal overfitting to individual subjects. Furthermore, the confusion matrix did not reveal any subject with systematic misclassification of either spasms or non-spasms, supporting the patient-independence and clinical reliability of CMTS-GNN. These results validate that CMTS-GNN can effectively generalize across patients and holds significant potential for real-world deployment in clinical settings for automated infantile spasm detection. The overall distribution of LOPO-CV metrics reveals a low standard deviation, reflecting consistent model performance and minimal overfitting to individual subjects. Furthermore, the confusion matrix did not reveal any subject with systematic misclassification of either spasms or non-spasms, supporting the patient-independence and clinical reliability of CMTS-GNN. These results validate that CMTS-GNN can effectively generalize across patients and holds significant potential for real-world deployment in clinical settings for automated infantile spasm detection.

3.6 Explainability of model decisions

To provide insight into the decision-making process of our deep learning model, we employed the gradient multiplied by input attribution method. This approach, originally described by Karen Simonyan et al. (37) in 2013 in the context of saliency maps, quantifies feature importance by computing the element-wise product of the input and the gradient of the output with respect to that input. This method has since been widely adopted in the field of neural network interpretability, and was further developed by Mukund Sundararajan et al. (38) in 2017 through the introduction of Integrated Gradients. The resulting relevance scores reflect the direct contribution of each input feature to the model's prediction, offering an intuitive and computationally efficient means of interpreting complex models. In the context of electroencephalogram (EEG) analysis, the application of gradient multiplied by input attribution is particularly important (16, 37, 39). EEG signals are high-dimensional and spatially distributed, with substantial variability across both subjects and brain regions. Traditional deep learning models, while powerful in capturing nonlinear spatiotemporal dependencies, often lack transparency, making it difficult to assess which channels or temporal segments drive the network's predictions. By employing gradient multiplied by input attribution, we can generate channel-wise or region-wise relevance maps, enabling neuroscientific interpretation and clinical validation of model behavior. This not only enhances trust in automated EEG classification systems, but also helps uncover physiologically meaningful patterns that may underlie epileptic activity or other neurological events.

Given that the proposed cross-modal fusion architecture is capable of simultaneously integrating temporal and spectral graph features, we further designed a weighted fusion mechanism for the attribution scores, combining the channel contributions from both modalities in a weighted manner. The fusion coefficient was set to 0.5 to ensure equal representation of temporal and spectral information. Specifically, we applied the gradient multiplied by input method to compute the attribution scores for each channel in both the temporal and spectral domains, and then aggregated these scores using the weighted scheme to obtain the final channel relevance scores. To facilitate spatial pattern comparison across different samples, we averaged the channel scores along the temporal dimension for each sample to obtain a single spatial vector. Finally, based on the international standard 10–20 electrode system, we visualized the model's decision basis by plotting EEG topographic maps.

As shown in Figure 3, by visualizing the topographic maps of attribution scores for several infant spasm samples, it can be observed that when the model identifies spasm events, it notably focuses on neural activity in the frontal, central, and temporal regions. These areas consistently display higher positive attribution scores in most spasm samples, indicating their critical discriminative value in the model's classification decisions. In contrast, channels in the occipital region tend to exhibit negative or low contributions, suggesting that this region is not important for spasm recognition. The spatial activation pattern remains highly consistent across different samples, and also demonstrates individualized lateralization of epileptogenic zones, reflecting the model's sensitivity to the potential distribution of epileptic foci. Importantly, these attribution results are highly consistent with findings from clinical EEG research, which indicate that infantile spasms most frequently originate from the frontal lobe, central motor cortex, and temporal pole regions, as documented by Lux et al. (40) and Watanabe et al. (41). This correspondence confirms the neurophysiological validity and medical relevance of the model's interpretability, further supporting the value of deep models in spasm prediction.

Figure 3

Two rows of colored scalp maps labeled A and B represent brain activity for Patients 07, 13, 20, 29, and 13. Each circular map shows variations in activity with gradients from red to blue. The color bars indicate the range of activity values for each patient. Panel A and B show different intensity patterns across the patients, suggesting variations in brain activity or connectivity.

Figure 3. Topographic maps of attribution scores are shown for samples from five infants with spasms. Higher normalized attribution scores indicate features that are more relevant for the model's classification decision, whereas lower scores represent less relevant or irrelevant inputs. (A) displays attribution maps for spasm samples, while (B) corresponds to non-spasm samples.

In contrast, analysis of the attribution topographic maps for non-spasm samples shows that when the model identifies non-spasm states, the overall distribution of channel relevance scores becomes more diffuse, with no concentrated activation regions. Most channels present attribution scores close to zero or mildly negative, especially in the occipital and central areas, which consistently show a suppressive contribution in multiple samples. This suggests that the model derives non-spasm evidence from these regions. Compared to the prominent frontal and temporal activation observed in spasm samples, the spatial discriminability and activation magnitude in non-spasm samples are substantially reduced. This trend demonstrates that the model can effectively distinguish spatial features under different clinical states, providing visual evidence for its stability and reliability in practical clinical applications.

From a cognitive network perspective, CMTS-GNN yields explanations at the level of large-scale functional systems rather than isolated channels. The brain-region-wise attention pooling in Equation (16) produces region embeddings that serve as proxies for canonical systems. Bidirectional cross-modal attention together with the gated fusion in Equations 18–19 then quantifies how evidential support flows between these systems across temporal and spectral representations. Aggregating gradient × input attributions within each anatomically defined region provides a decomposable “network evidence” profile per segment, revealing that spasm decisions are primarily driven by fronto-central and anterior temporal systems, with consistent suppression or low evidence in occipital cortex. This network-level pattern accords with circuits subserving early sensorimotor control, cognitive control, and affective reactivity, and thus offers a cognitively meaningful account of why the model classifies a segment as spasm vs. non-spasm. Practically, per-region evidence can be surfaced alongside predictions to support clinical review and to track patient-specific lateralization over time, linking model outputs to interpretable cognitive networks and facilitating biomarker development for downstream mental-health modeling.

3.7 Leave-one-patient-out cross-validation on dataset B

To further evaluate the generalization capability of the proposed CMTS-GNN model across different epilepsy types and EEG backgrounds, we conducted transfer testing on the public CHB-MIT epilepsy dataset. The CHB-MIT dataset consists of long-term EEG recordings from multiple epilepsy patients, encompassing a wide spectrum of seizure types and exhibiting background activity and ictal patterns that differ substantially from those observed in infantile spasms. Employing this dataset as an independent test set not only imposes stricter requirements on model robustness and cross-domain adaptability, but also more accurately simulates real-world clinical scenarios. For data preprocessing, all EEG recordings–both seizure and non-seizure segments–were uniformly segmented into five-second epochs to standardize input length and enhance temporal resolution for model analysis. In addition, to ensure consistency across samples and facilitate robust cross-subject evaluation, we retained only the 18 EEG channels that were common to all recordings: FP2-F4, C4-P4, T8-P8, F7-T7, FP1-F3, FP1-F7, P7-O1, F4-C4, T7-P7, P8-O2, P3-O1, F8-T8, FZ-CZ, FP2-F8, CZ-PZ, F3-C3, C3-P3 and P4-O2.

To avoid class imbalance and to provide a fair evaluation of the model's discriminative ability, we adopted a balanced scheme with equal proportions of positive and negative samples. Several representative and state-of-the-art methods were selected for unified performance comparison on Dataset B using five-fold cross-validation. In addition, we employed a leave-one-subject-out (LOSO) cross-validation strategy, where the EEG data of each patient was sequentially used as the test set, while the data from the remaining patients served as the training set. This approach provides a comprehensive assessment of the model's generalization ability and robustness across different individuals. Detailed experimental results are presented in Tables 5, 6.

Table 5

Table 5. Performance comparison between the proposed method and state-of-the-art methods using 5-fold cross-validation on dataset B.

Table 6

Table 6. Performance of CMTS-GNN using leave-one-patient-out cross-validation on dataset B.

The proposed method demonstrated outstanding overall performance in the five-fold cross-validation experiments conducted on the CHB-MIT public dataset. Specifically, this method outperformed other comparative approaches in all evaluation metrics, including accuracy (98.54%), precision (98.31%), recall (98.71%), and F1-score (98.47%). In comparison, the method by Wenna Chen et al. achieved the highest AUC (99.04%), but its other metrics—such as accuracy and F1-score–were slightly lower than those of the proposed method. The related metrics of Tawhid et al. (30) and Wang (31) were all inferior to those of our method, with particularly noticeable gaps in recall and precision. As shown in Figure 4, the confusion matrix provides an intuitive reflection of the classification performance on both positive and negative samples. It can be observed that the proposed method achieves higher true positive rate (98.71%) and true negative rate (98.38%) than the comparative methods, indicating fewer missed detections and false alarms in practical detection. While other methods also perform well, some exhibit higher error rates in negative sample discrimination; for example, the true negative rate of the Tawhid et al. (30) method is 97.61%, slightly lower than that of the proposed method. In summary, the proposed method consistently outperforms several state-of-the-art algorithms in the five-fold cross-validation experiments on the CHB-MIT public dataset. Not only does it achieve optimal results in accuracy, precision, recall, and F1-score, but its AUC value is also close to the highest, indicating strong potential for application in the automatic detection of infantile spasms.

Figure 4

Four confusion matrices compare performance across different methods. Each matrix displays percentages for predicted versus true labels of spasms and non-spasms. From left to right: Md. Nurul Ahad Tawhid et al. shows 98.96% accuracy for spasms and 97.61% for non-spasms. Xiashuang Wang et al. shows 98.04% accuracy for spasms and 97.82% for non-spasms. Wenna Chen et al. shows 98.56% accuracy for spasms and 98.06% for non-spasms. The proposed method shows 98.71% accuracy for spasms and 98.38% for non-spasms.

Figure 4. Comparison of confusion matrices between the proposed method and state-of-the-art methods on dataset A. The colorbar indicates the percentage, which is row-normalized.

4 Conclusion

We proposed CMTS-GNN, a cross-modal temporal-spectral graph neural network for automated infantile spasm detection from EEG, and demonstrated state-of-the-art performance with strong generalizability and interpretability. On the dedicated infant spasm dataset, CMTS-GNN reached 99.02% accuracy, 98.96% precision, 97.47% recall, 98.20% F1, and 99.27% AUC under five-fold evaluation, and exhibited robust patient-independent generalization in leave-one-patient-out testing with multiple subjects achieving perfect scores. Cross-domain transfer to CHB-MIT confirmed robustness under distribution shift, yielding 98.54% accuracy, 98.31% precision, 98.71% recall, 98.47% F1, and 98.87% AUC in five-fold evaluation, while most patients surpassed 90% accuracy in leave-one-subject-out testing. Attribution analysis highlighted frontal, central, and temporal regions during spasm detections in line with clinical knowledge. These results establish CMTS-GNN as an accurate, generalizable, and clinically interpretable solution for infantile spasm detection and motivate future work on larger and more diverse cohorts, integration of additional physiological signals, and refined interpretability to support clinical deployment.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement

The studies involving humans were approved by Northeastern University Medical and Bioethics Committee. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation in this study was provided by the participants' legal guardians/next of kin.

Author contributions

YW: Formal analysis, Writing – review & editing, Writing – original draft, Methodology. LM: Software, Supervision, Writing – review & editing, Writing – original draft. YF: Visualization, Data curation, Formal analysis, Writing – review & editing, Investigation, Writing – original draft.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This study was supported by National Natural Science Foundation of China (62073061) and Guangdong Basic and Applied Basic Research Foundation (2025A1515011602).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Gen AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Lux AL, Osborne JP, A. proposal for case definitions and outcome measures in studies of infantile spasms and West syndrome: consensus statement of the West Delphi group. Epilepsia. (2004) 45:1416–28. doi: 10.1111/j.0013-9580.2004.02404.x

PubMed Abstract | Crossref Full Text | Google Scholar

2. Pavone P, Striano P, Falsaperla R, Pavone L, Ruggieri M. Infantile spasms syndrome, West syndrome and related phenotypes: what we know in (2013). Brain Dev. (2014) 36:739–51. doi: 10.1016/j.braindev.2013.10.008

PubMed Abstract | Crossref Full Text | Google Scholar

3. Lux AL. Latest American and European updates on infantile spasms. Curr Neurol Neurosci Rep. (2013) 13:334. doi: 10.1007/s11910-012-0334-z

PubMed Abstract | Crossref Full Text | Google Scholar

4. Riikonen R, A. long-term follow-up study of 214 children with the syndrome of infantile spasms. Neuropediatrics. (1982) 13:14–23. doi: 10.1055/s-2008-1059590

PubMed Abstract | Crossref Full Text | Google Scholar

5. Osborne JP, Lux AL, Edwards SW, Hancock E, Johnson AL, Kennedy CR, et al. The underlying etiology of infantile spasms (West syndrome): information from the United Kingdom Infantile Spasms Study (UKISS) on contemporary causes and their classification 2. Epilepsia. (2010) 51:2168–74. doi: 10.1111/j.1528-1167.2010.02695.x

PubMed Abstract | Crossref Full Text | Google Scholar

6. Caraballo RH, Ruggieri V, Gonzalez G, Cersosimo R, Gamboni B, Rey A, et al. Infantile spams without hypsarrhythmia: a study of 16 cases. Seizure. (2011) 20:197–202. doi: 10.1016/j.seizure.2010.11.018

PubMed Abstract | Crossref Full Text | Google Scholar

7. Tiwari AK, Pachori RB, Kanhangad V, Panigrahi BK. Automated diagnosis of epilepsy using key-point-based local binary pattern of EEG signals. IEEE J Biomed Health Inform. (2016) 21:888–96. doi: 10.1109/JBHI.2016.2589971

PubMed Abstract | Crossref Full Text | Google Scholar

8. Ramantani G, Bölsterli BK, Alber M, Klepper J, Korinthenberg R, Kurlemann G, et al. Treatment of infantile spasm syndrome: update from the interdisciplinary guideline committee coordinated by the German-Speaking Society of Neuropediatrics. Neuropediatrics. (2022) 53:389–401. doi: 10.1055/a-1909-2977

PubMed Abstract | Crossref Full Text | Google Scholar

9. Chaddad A, Wu Y, Kateb R, Bouridane A. Electroencephalography signal processing: A comprehensive review and analysis of methods and techniques. Sensors. (2023) 23:6434. doi: 10.3390/s23146434

PubMed Abstract | Crossref Full Text | Google Scholar

10. Chopra SS. Infantile spasms and West syndrome-a clinician's perspective. Indian J Pediat. (2020) 87:1040–6. doi: 10.1007/s12098-020-03279-y

PubMed Abstract | Crossref Full Text | Google Scholar

11. L-Molnár T, Siegler Z, Hegyi M, Jakus R, Bodó T, Kormos E, et al. A tartós videó-EEG-monitorozás szerepe a gyermekkori epilepsziák diagnosztikájában. Orvosi Hetilap. (2024) 165:722–6. doi: 10.1556/650.2024.33037

PubMed Abstract | Crossref Full Text | Google Scholar

12. Demarest ST, Shellhaas RA, Gaillard WD, Keator C, Nickels KC, Hussain SA, et al. The impact of hypsarrhythmia on infantile spasms treatment response: observational cohort study from the National Infantile Spasms Consortium. Epilepsia. (2017) 58:2098–103. doi: 10.1111/epi.13937

PubMed Abstract | Crossref Full Text | Google Scholar

13. Jing J, Sun H, Kim JA, Herlopian A, Karakis I, Ng M, et al. Development of expert-level automated detection of epileptiform discharges during electroencephalogram interpretation. JAMA Neurol. (2020) 77:103–8. doi: 10.1001/jamaneurol.2019.3485

PubMed Abstract | Crossref Full Text | Google Scholar

14. Meng L, Hu J, Deng Y, Hu Y. Electrical status epilepticus during sleep electroencephalogram waveform identification and analysis based on a graph convolutional neural network. Biomed Signal Process Control. (2022) 77:103788. doi: 10.1016/j.bspc.2022.103788

Crossref Full Text | Google Scholar

15. Halford JJ, Shiau D, Desrochers J, Kolls B, Dean B, Waters C, et al. Inter-rater agreement on identification of electrographic seizures and periodic discharges in ICU EEG recordings. Clini Neurophysiol. (2015) 126:1661–9. doi: 10.1016/j.clinph.2014.11.008

PubMed Abstract | Crossref Full Text | Google Scholar

16. Roy Y, Banville H, Albuquerque I, Gramfort A, Falk TH, Faubert J. Deep learning-based electroencephalography analysis: a systematic review. J Neural Eng. (2019) 16:051001. doi: 10.1088/1741-2552/ab260c

PubMed Abstract | Crossref Full Text | Google Scholar

17. Zhao S, Tuan LA, Fu J, Wen J, Luo W. Exploring clean label backdoor attacks and defense in language models. IEEE/ACM Trans Audio, Speech, Lang Proc. (2024) 32:3014–24. doi: 10.1109/TASLP.2024.3407571

Crossref Full Text | Google Scholar

18. Zhao S, Tian J, Fu J, Chen J, Wen J. Feamix: Feature mix with memory batch based on self-consistency learning for code generation and code translation. IEEE Trans Emerg Topics Comp Intellig. (2024) 9:192–201. doi: 10.1109/TETCI.2024.3395531

Crossref Full Text | Google Scholar

19. Zhou M, Tian C, Cao R, Wang B, Niu Y, Hu T, et al. Epileptic seizure detection based on EEG signals and CNN. Front Neuroinform. (2018) 12:95. doi: 10.3389/fninf.2018.00095

PubMed Abstract | Crossref Full Text | Google Scholar

20. Cao J, Hu D, Wang Y, Wang J, Lei B. Epileptic classification with deep-transfer-learning-based feature fusion algorithm. IEEE Trans Cognit Dev Syst. (2021) 14:684–95. doi: 10.1109/TCDS.2021.3064228

Crossref Full Text | Google Scholar

21. Tsiouris KM, Pezoulas VC, Zervakis M, Konitsiotis S, Koutsouris DD, Fotiadis DI, et al. long short-term memory deep learning network for the prediction of epileptic seizures using EEG signals. Comput Biol Med. (2018) 99:24–37. doi: 10.1016/j.compbiomed.2018.05.019

PubMed Abstract | Crossref Full Text | Google Scholar

22. Yao X, Li X, Ye Q, Huang Y, Cheng Q, Zhang GQ, et al. robust deep learning approach for automatic classification of seizures against non-seizures. Biomed Signal Process Control. (2021) 64:102215. doi: 10.1016/j.bspc.2020.102215

Crossref Full Text | Google Scholar

23. Bullmore E, Sporns O. Complex brain networks: graph theoretical analysis of structural and functional systems. Nat Rev Neurosci. (2009) 10:186–98. doi: 10.1038/nrn2575

PubMed Abstract | Crossref Full Text | Google Scholar

24. Kipf T. Semi-supervised classification with graph convolutional networks. arXiv [preprint] arXiv:160902907. (2016). doi: 10.48550/arXiv.1609.02907

Crossref Full Text | Google Scholar

25. Verma S, Zhang ZL. Stability and generalization of graph convolutional neural networks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage, AK; New York, NY: Association for Computing Machinery (2019). p. 1539–48.

Google Scholar

26. Zhang Z, Cui P, Zhu W. Deep learning on graphs: a survey. IEEE Trans Knowl Data Eng. (2020) 34:249–70. doi: 10.1109/TKDE.2020.2981333

Crossref Full Text | Google Scholar

27. Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, et al. Graph neural networks: a review of methods and applications. AI Open. (2020) 1:57–81. doi: 10.1016/j.aiopen.2021.01.001

Crossref Full Text | Google Scholar

28. Guttag J. CHB-MIT Scalp EEG Database (version 1.0.0). In: PhysioNet. (2010).

Google Scholar

29. Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). Venice; Piscataway, NJ: IEEE (2017). p. 2980–8.

Google Scholar

30. Tawhid MNA, Siuly S, Li T, A. convolutional long short-term memory-based neural network for epilepsy detection from EEG. IEEE Trans Instrum Meas. (2022) 71:1–11. doi: 10.1109/TIM.2022.3217515

Crossref Full Text | Google Scholar

31. Wang X, Wang Y, Liu D, Wang Y, Wang Z. Automated recognition of epilepsy from EEG signals using a combining space-time algorithm of CNN-LSTM. Sci Rep. (2023) 13:14876. doi: 10.1038/s41598-023-41537-z

PubMed Abstract | Crossref Full Text | Google Scholar

32. Abadal S, Galván P, Mármol A, Mammone N, Ieracitano C, Giudice ML, et al. Graph neural networks for electroencephalogram analysis: Alzheimer's disease and epilepsy use cases. Neural Netw. (2025) 181:106792. doi: 10.1016/j.neunet.2024.106792

PubMed Abstract | Crossref Full Text | Google Scholar

33. Srinivasan S, Dayalane S, Mathivanan Sk, Rajadurai H, Jayagopal P, Dalu GT. Detection and classification of adult epilepsy using hybrid deep learning approach. Scientif Reports. (2023) 13:17574. doi: 10.1038/s41598-023-44763-7

PubMed Abstract | Crossref Full Text | Google Scholar

34. Chen W, Wang Y, Ren Y, Jiang H, Du G, Zhang J, et al. An automated detection of epileptic seizures EEG using CNN classifier based on feature fusion with high accuracy. BMC Med Inform Decis Mak. (2023) 23:96. doi: 10.1186/s12911-023-02180-w

PubMed Abstract | Crossref Full Text | Google Scholar

35. Huang H, Chen P, Wen J, Lu X, Zhang N. Multiband seizure type classification based on 3D convolution with attention mechanisms. Comput Biol Med. (2023) 166:107517. doi: 10.1016/j.compbiomed.2023.107517

PubMed Abstract | Crossref Full Text | Google Scholar

36. Dang W, Lv D, Rui L, Liu Z, Chen G, Gao Z. Studying multi-frequency multilayer brain network via deep learning for EEG-based epilepsy detection. IEEE Sens J. (2021) 21:27651–8. doi: 10.1109/JSEN.2021.3119411

Crossref Full Text | Google Scholar

37. Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv [preprint] arXiv:13126034. (2013). doi: 10.48550/arXiv.1312.6034

Crossref Full Text | Google Scholar

38. Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. In: International Conference on Machine Learning. New York: PMLR (2017). p. 3319–3328.

Google Scholar

39. Lawhern VJ, Solon AJ, Waytowich NR, Gordon SM, Hung CP, Lance BJ. EEGNet: a compact convolutional neural network for EEG-based brain-computer interfaces. J Neural Eng. (2018) 15:056013. doi: 10.1088/1741-2552/aace8c

PubMed Abstract | Crossref Full Text | Google Scholar

40. Lux AL, Osborne JP. The influence of etiology upon ictal semiology, treatment decisions and long-term outcomes in infantile spasms and West syndrome. Epilepsy Res. (2006) 70:77–86. doi: 10.1016/j.eplepsyres.2006.01.017

PubMed Abstract | Crossref Full Text | Google Scholar

41. Watanabe K, Negoro T, Okumura A. Symptomatology of infantile spasms. Brain Dev. (2001) 23:453–66. doi: 10.1016/S0387-7604(01)00274-1

Crossref Full Text | Google Scholar

Keywords: infantile spasms, cognitive control, explainability analysis, cross-modal, brain regions

Citation: Wang Y, Meng L and Fan Y (2025) CMTS-GNN: a cross-modal temporal-spectral graph neural network with cognitive network explainability. Front. Neurol. 16:1700161. doi: 10.3389/fneur.2025.1700161

Received: 12 September 2025; Accepted: 08 October 2025;
Published: 30 October 2025.

Edited by:

Shuai Zhao, Nanyang Technological University, Singapore

Reviewed by:

Sanqing Xu, Huazhong University of Science and Technology, China
Dingnan Deng, Jiaying University, China

Copyright © 2025 Wang, Meng and Fan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yuying Fan, ZmFueXlAc2otaG9zcGl0YWwub3Jn

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.