AMANet: a data-augmented multi-scale temporal attention convolutional network for motor imagery classification

Wang, Shu; Wang, Raofen; Chang, Liang; Wu, Jianzhen; Hu, Lingyan

doi:10.3389/fnbot.2025.1704111

ORIGINAL RESEARCH article

Front. Neurorobot., 09 January 2026

Volume 19 - 2025 | https://doi.org/10.3389/fnbot.2025.1704111

AMANet: a data-augmented multi-scale temporal attention convolutional network for motor imagery classification

Shu Wang¹

Raofen Wang¹^*

Liang Chang²

Jianzhen Wu¹

Lingyan Hu¹

¹School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
²School of Mechatronic Engineering and Automation, Shanghai University, Shanghai, China

Motor imagery brain–computer interface (MI-BCI) has garnered considerable attention due to its potential for neural plasticity. However, the limited number of MI-EEG samples per subject and the susceptibility of features to noise and artifacts posed significant challenges for achieving high decoding performance. To address this problem, a Data-Augmented Multi-Scale Temporal Attention Convolutional Network (AMANet) was proposed. The network mainly consisted of four modules. First, the data augmentation module comprises three steps: sliding-window segmentation to increase sample size, Common Spatial Pattern (CSP) extraction of discriminative spatial features, and linear scaling to enhance network robustness. Then, multi-scale temporal convolution was incorporated to dynamically extract temporal and spatial features. Subsequently, the ECA attention mechanism was integrated to realize the adaptive adjustment of the weights of different channels. Finally, depthwise separable convolution was utilized to fully integrate and classify the deep extraction of temporal and spatial features. In 10-fold cross-validation, the results show that AMANet achieves classification accuracies of 84.06 and 85.09% on the BCI Competition IV Datasets 2a and 2b, respectively, significantly outperforming baseline models such as Incep-EEGNet. On the High-Gamma dataset, AMANet attains a classification accuracy of 95.48%. These results demonstrate the excellent performance of AMANet in motor imagery decoding tasks.

1 Introduction

Brain–Computer Interface (BCI) is utilized to enable communication between the human brain and external devices without direct muscular or neural intervention (Pfurtscheller and Neuper, 2001; McFarland and Wolpaw, 2011). Among these, Motor Imagery BCI (MI-BCI) has been recognized forits noninvasiveness and high temporal resolution (Graimann et al., 2009). By acquiring neural signals in real time and decoding user intent from neuronal activation patterns (Pfurtscheller et al., 2006), BCI can control external devices (Ang et al., 2012, 2015) or facilitate information transfer (Li et al., 2010). Accurate decoding of electroencephalogram (EEG) signals generated during motor imagery not only underpins intelligent control of prostheses and robots but also holds promise for applications in medical rehabilitation and human–machine interaction (Condori et al., 2016; Cho et al., 2018). However, the complexity of EEG acquisition and the limited sample sizes constrain decoding performance (Craik et al., 2019), making enhancement of decoding accuracy under small-sample conditions critical for BCI applications.

In recent years, to address the challenge of small-sample EEG decoding, various convolutional neural network (CNN) (Lecun et al., 1998)-based models have been proposed by researchers to achieve efficient feature extraction and classification under limited-sample conditions. For example, Mattioli et al. combined a one-dimensional CNN (1D-CNN) (Mattioli et al., 2021) with the Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002), increasing the overall classification accuracy in a five-class task from only 33.38% without augmentation to 99.38% after augmentation. Zhang et al. (2020) employed a generative model-based data augmentation approach in motor imagery tasks and performed classification using a single-scale CNN, achieving a substantial improvement in recognition accuracy with limited training data. Huang and Zhou (2021) proposed a hybrid approach that combines common spatial pattern (CSP) (Ramoser et al., 2000) with deep learning: CSP was first applied to enhance the spatial separability between two motor imagery classes, and the filtered signals were then fed into a deep network for end-to-end classification. It is noteworthy that this method focuses on enhancing feature separability rather than expanding the training dataset; thus, in sample-limited scenarios, further improvements in generalization performance can often be achieved by integrating such methods with data augmentation. Although effectiveness in small-sample environments has been shown by these approaches, they typically rely on single-scale convolution with a fixed receptive field, making it difficult to fully capture multi-scale time–frequency features (Zhang et al., 2021).

With the advancement of deep learning, multi-scale convolution structures (Ma et al., 2020) have been proposed by reasearchers to overcome the limitations of single-scale models in feature extraction and to enhance the capture of EEG information across different time–frequency scales. For example, Wu et al. developed MSFBCNN (Wu et al., 2019), which employs multiple parallel temporal convolution branches to extract multi-scale temporal features, combined with spatial convolution, thereby improving MI classification accuracy and cross-subject generalization on the BCI Competition IV dataset. Liu et al. designed FBMSNet (Liu et al., 2023), which integrates filter banks with multi-scale depthwise convolution to jointly extract frequency-band features, achieving approximately 79% accuracy across multiple datasets. Li et al. (2023) proposed MTFB-CNN, which applies multi-scale time–frequency block convolution directly to raw EEG signals without complex preprocessing, significantly improving MI decoding performance. While these methods enhance feature capture capability, they lack adaptive mechanisms to emphasize salient information and suppress noise, which limits decoding performance in small-sample scenarios.

To enhance EEG decoding performance, attention mechanisms have been gradually recognized as a key technique. They selectively emphasize critical information while suppressing irrelevant or redundant data, thereby improving feature representation (Ou and Zou, 2025). In recent years, attention mechanisms have been increasingly integrated into deep learning frameworks to highlight crucial information in motor imagery signals and boost classification performance. For example, CIACNet, was proposed by Liao et al. (2025) in which an improved CBA (Woo et al., 2018) module was incorporated and a dual-branch CNN with a temporal convolutional network (TCN) (Dudukcu et al., 2023) was combined to achieve multi-level feature extraction and fusion for MI-EEG signals, attaining excellent performance on the BCI-IV 2a and 2b datasets. Similarly, Altaheri et al. (2023) introduced ATCNet, integrating multi-head attention into the TCN structure to emphasize the most discriminative features, thus significantly enhancing decoding accuracy.

Inspired by these studies, AMANet, a data-augmented multi-scale temporal attention convolutional network for effective EEG decoding is proposed. The model is composed of five modules: a data augmentation block (DG-Block), a multi-scale temporal block (MST-Block), an ECA attention block, a depthwise separable convolution block (DSC-Block), and a classification block. The main contributions of this work are as follows:

• The AMANet model is proposed, which adaptively and dynamically captures small-sample EEG features and achieves superior results on the BCI Competition IV Dataset 2a and 2b.

• To address the limited sample size, a sliding window strategy is applied by this study to expand the training set and employs CSP to extract six discriminative spatial filter pairs, enhancing spatial feature extraction while reducing model parameters and improving robustness.

• To improve decoding performance, multi-scale temporal convolutions are employed for dynamic EEG feature extraction and pointwise convolution is adopted for channel integration, thereby strengthening multi-scale representation and enabling the model’s generalization to complex EEG signals to be improved.

• To emphasize critical channels and reduce computational cost, the efficient channel attention (ECA) mechanism is introduced, which highlights key features through adaptive weighting and enables lightweight spatial feature extraction without complex dimensionality reduction.

The remainder of this paper is organized as follows. The architecture of AMANet is introduced in Section 2. The experimental setup is described in Section 3. The experimental results are presented and discussed in Section 4, and finally, a brief conclusion is provided in Section 5.

2 Methods

The architecture of AMANet is depicted in Figure 1. First, the training set is expanded and robustness is enhanced by the data augmentation block (DG Block). Next, both temporal and spatial features are extracted by the multi-scale temporal block (MST-Block)—comprising the multi-scale temporal feature block (MS-Block) and the spatial feature refinement block (ST-Block). Then, channel features are adaptively reweighted by An Efficient Channel Attention (ECA) mechanism to highlight the most discriminative information. Subsequently, spatiotemporal information is coupled by the DSC-Block for integrated feature representation. Finally, the fused features are assigned to the target classes by the classification layer assigns. Notably, to accommodate the multi-scale, non-stationary, and low signal-to-noise characteristics of EEG signals, this study adopts a multi-stage temporal convolutional architecture. The convolutions at different stages serve hierarchical temporal modeling functions rather than mere repetitive stacking. Specifically, the first-stage multi-scale convolution module extracts coarse-grained multi-temporal-scale features using kernels of different lengths, which can capture short-term local dynamics, medium-range rhythmic variations, and longer-term dependencies, acting as a learnable multi-band filter bank. The subsequent second-stage convolutions further integrate these coarse features in a fine-grained temporal manner and combine them with spatial filtering, achieving inter-channel feature projection and temporal enhancement. The third stage then strengthens the temporal features from the first two stages and suppresses noise through depthwise separable convolutions and lightweight channel attention, thereby improving the overall stability of feature representation.

Figure 1

Diagram of a neural network architecture for EEG signal processing. It starts with a DG-Block using CSP and a sliding window. Outputs are fed into an MST-Block, composed of MS-Block and ST-Block, with specified sizes and paddings. Outputs proceed to an ECA block, then a DSC-Block. Finally, a full connected layer classifies the data. The flow includes raw EEG input progressing through various convolutional and pooling layers, culminating in a class output.

Figure 1. The overall structure of AMANet.

To ensure consistent tensor dimensions, each convolutional layer employs padding tailored to its kernel size. The model parameters are listed in Table 1. With the exception of the CSP block, ELU activations are used, by which ReLU “dead neuron” issues are avoided and gradient flow is enhanced. Within the ECA mechanism, channel weights are normalized to the (0, 1) range by a sigmoid function preventing extreme weights from biasing feature representation.

Table 1

Table 1. Hyperparameters of the AMANet model.

2.1 Data augmentation block

As illustrated in Figure 2, the block first applies a sliding window strategy with a window size of 500 and a stride of 75 to the EEG signals (with $T$ sampling points), resulting in five overlapping sub-windows. The main reason for adopting this commonly used high-overlap strategy (Hwang et al., 2023) is that it can significantly increase the number of data samples while enhancing the model’s robustness and generalization ability without compromising temporal resolution, thereby providing richer training data for subsequent classification tasks. The original feature representation is transformed by The sliding window from $X \in ℝ^{C \times T}$ to $X_{1} \in ℝ^{C \times T_{1}}$ . Subsequently, the CSP algorithm is employed to extract spatial features from the segmented signals. To prevent data leakage between the training and test sets, CSP spatial filters are strictly computed on the training set and then applied only to the corresponding test set, ensuring that information from the test data is not utilized in the estimation of spatial filters. The normalized covariance matrices of different classes are computed by CSP, and generalized eigenvalue decomposition is performed to jointly diagonalize the matrices, thereby extracting the most discriminative spatial filters. The top six pairs of filters are selected by the algorithm to produce the output $X_{2} \in ℝ^{C_{1} \times T_{1}}$ . Finally, to enhance the model’s robustness to amplitude variations, the EEG signals are further subjected to linear scaling, which simulates inter-trial amplitude differences without altering the time–frequency structure. This procedure not only augments the training data and mitigates overfitting but also improves the model’s robustness to amplitude fluctuations, as shown in Equation 1:

Figure 2

Bar charts comparing average accuracy and average kappa for five network types: WMNet, WMANet, CMNet, AMNet, and AMANet. Average accuracy percentages are 78.37%, 77.45%, 76.87%, 81.13%, and 84.06%, while average kappa values are 0.70, 0.69, 0.69, 0.73, and 0.78 respectively. Each network is represented by a different color.

Figure 2. Schematic of the data augmentation block.

\begin{array}{l} X_{augmented} = β . X_{2} & (1) \end{array}

Where $β$ denotes the scaling factor. It follows a normal or uniform distribution (default value: 0.1), and is used to perturb the signal amplitude within a range of 90 to 110%. The augmented signals are then fed into the subsequent network to further extract spatiotemporal features, and classification performance is improved.

2.2 Multi-scale temporal block

The MST-Block is composed of the MS-Module and the ST-Module. Primarily designed to dynamically extract multi-scale temporal features from EEG signals (Tao et al., 2024; Chang et al., 2025) the MS- Block is as illustrated in Figure 3. According to previous studies (Lawhern et al., 2018; Ingolfsson et al., 2020), smaller convolutional kernels are more effective in capturing the temporal characteristics of EEG data. Therefore, the MS-Module employs three temporal convolutional layers with kernel sizes of (1, 8), (1, 16), and (1, 24), each of which comprises $F_{1}$ filters. The resulting feature tensor is denoted as ${X^{1}}_{augmented} \in ℝ^{F_{1} \times C_{1} \times T_{1}}$ . Each convolution is followed by batch normalization (BN) and the ELU activation function to enhance model stability and representational capacity. The outputs of the three layers are fused along the channel dimension, preserving the temporal integrity while reducing computational costs. The final output tensor remains ${X^{1}}_{augmented} \in ℝ^{F_{1} \times C_{1} \times T_{1}}$ . This module effectively captures the multi-scale temporal features of EEG data, which lays a solid foundation for subsequent spatial modeling and deep feature extraction.

Figure 3

EEG signal processing diagram showing the transformation of the EEG signal through CSP (Common Spatial Patterns). The initial signal, labeled with channels (C) and time (T), undergoes a process with size and stride parameters, is transformed using CSP, and then scaled to create a new signal labeled C1 and T1.

Figure 3. Multi-scale temporal convolutional structure.

Designed to extract spatial features from EEG signals and then apply further temporal convolution, The ST- Block is as shown in Figure 4. It is composed of two consecutive convolutional layers, each followed by BN and an ELU activation. The first layer performs a global spatial convolution using D = 2 spatial filters of size $(C_{1}, 1)$ , where $C_{1}$ denotes the number of CSP feature channels. By setting $groups = F_{1}$ , each filter operates on a single channel to learn inter-channel relationships without mixing information. Its output feature map is represented as ${X^{2}}_{augmented} \in ℝ^{F_{1} * D \times C_{1} \times T_{1}}$ . The second layer applies a temporal convolution with kernel size $(1, K_{1} / 4)$ to capture fine-grained time-domain features. Subsequently average pooling is performed along the temporal axis to reduce feature-map dimensions, decrease computational cost, suppress noise, and enhance robustness. The final output is ${X^{3}}_{augmented} \in ℝ^{F_{1} * D \times 1 \times T_{1} / 4}$ . Additionally, Dropout is introduced to mitigate overfitting.

Figure 4

Diagram of a neural network showing the process of feature extraction and transformation. It begins with input data labeled C1, T1, and branches into three parallel pathways, each involving convolutions with filters and batch normalization with ELU activation. Outputs are stacked and merged using pointwise convolution, labeled as PC, producing unified features F1.

Figure 4. Spatial feature refinement block structure.

2.3 Efficient channel attention block

ECA (Jia et al., 2023) is a lightweight yet effective channel attention mechanism that is designed to model inter-channel dependencies, enabling the network to focus on more discriminative feature channels. In this study, the design principle of ECA is functionally similar to that of self-attention mechanisms, as both enhance discriminability by adaptively weighting latent feature dimensions (Du et al., 2022). However, ECA achieves this in a more lightweight manner, making it suitable for small-sample EEG tasks. It should be noted that ECA is applied to the latent feature channels obtained after CSP spatial filtering and convolutional feature mapping. At this stage, each channel primarily reflects a combined representation of different spectral response patterns and spatial projection characteristics. The channel dimension no longer follows an explicit physical adjacency order, instead, it emphasizes the functional encoding of discriminative features. Therefore, in this work, ECA is not introduced based on an assumption of spatial adjacency. Rather, it is regarded as a lightweight adaptive weighting mechanism that leverages the latent relationships among features to emphasize important channels, suppress redundant channels, and highlight feature representations with higher discriminative contribution. As shown in Figure 5, different from the Squeeze-and-Excitation (SE) (Zhu et al., 2023) module, ECA replaces the fully connected layers with a 1 × 1 convolution, thus reducing parameter complexity without sacrificing performance. Specifically, the output feature map ${X^{3}}_{augmented} \in ℝ^{F_{1} * D \times 1 \times T_{1} / 4}$ of ST-Module is fed into the ECA module. First, global average pooling (GAP) compresses each features spatial dimensions (H × W) into a single scalar $z_{c}$ , as shown in Equation 2:

Figure 5

Diagram of a neural network architecture. The process flow begins with convolution across space (Conv1) followed by depthwise convolution (Conv2) along the time axis. Outputs undergo batch normalization, ELU activation, and average pooling, resulting in a reduced dimensional representation.

Figure 5. ECA block structure, Were, $C_{1} = 12 or 2, H = 1$ , $W = T_{1} / 4 = 125$ , and $K = 3$ denotes the convolutional kernel size. $X^{3}$ represents the input feature map, $GAP$ denotes the global average pooling layer, and $X_{c}$ represents the output feature map.

\begin{array}{l} z_{c} = (\sum_{i}^{H} \sum_{j}^{W} {X^{3}}_{augmented}) / (H \times W) c = 1, 2, \dots C_{1} & (2) \end{array}

Subsequently, a feature vector $z = [z_{1}, z_{2} \dots z_{C_{1}}]$ is obtained. A 1D convolution with a small kernel ( $k = 3$ )is then applied along the channel dimension to capture local spatial features dependencies. The kernel size is calculated as shown in Equation 3:

\begin{matrix} k = ∣ ({log}_{2} (C_{1}) + b) / γ ∣ γ = 2 b = 1 \end{matrix} (3)

The convolution result is subsequently normalized via a sigmoid activation function to generate channel-wise weighting coefficients, as defined in Equation 4:

\begin{matrix} s = σ (Conv 1 D_{k} (z)), σ = Sigmoid \end{matrix} (4)

Finally, the weighting coefficients are multiplied in a channel-wise manner with the original input features to achieve weighted feature optimization, yielding a final output $X_{c} \in ℝ^{F_{1} * D \times 1 \times T_{1} / 4}$ , as shown in Equation 5:

\begin{matrix} X_{c} = s_{c} . {X^{3}}_{augmented}, c = 1, 2, \dots C_{1} \end{matrix} (5)

The ECA module is introduced subsequent to the ST-Module to enhance the model’s focus on spatial features, thereby optimizing the feature representations fed into the DSF-Module and further improving overall classification performance.

2.4 Depthwise separable fusion block

The DSF-Block is devised to extract deeper temporal features. Its first layer is a depthwise convolution that performs temporal convolution independently on each input channel to prevent mixing of inter-channel information. Unlike the spatially focused depthwise convolution in the FSR-Module, the DSF-Module emphasizes deep temporal feature mining, Thus, setting groups = $F_{1} \times D$ ensures each filter operates only on its corresponding channel’s time series. The second layer is a pointwise convolution with a 1 × 1 kernel to fuse information across channels and enhance feature representation. An average pooling layer is also introduced to reduce parameter count. The final output feature map is represented as $X_{c} \in ℝ^{F_{1} * D \times 1 \times T_{1} / 32}$ . The architecture of the DSF-Module is shown in Figure 6. Through this meticulously designed hierarchy, the DSF-Module efficiently extracts deep temporal features and fully integrates inter-channel information, providing critical support for the model’s final classification performance.

Figure 6

Flowchart illustrating a convolutional neural network architecture. The process starts with an input passing through a global average pooling layer, creating a feature map. This is followed by a fully connected layer producing weights, which are then processed by a sigmoid function. The resulting output is element-wise multiplied with the original input, generating the final output of dimension F times D. Layer dimensions and operations like GAP, convolution, and multiplication are detailed in the diagram.

Figure 6. Depthwise separable fusion block structure.

2.5 Classification block

In the classification block, the input features are flattened to a vector whit a dimension $of F_{2} \times (T_{1} / 32)$ . First they pass through a fully connected layer that reduces this to $F_{2} \times 2$ , followed by an ELU activation to introduce nonlinearity and enhance representational capacity. Dropout is then applied to mitigate overfitting and improve generalization. A second fully connected layer maps the $F_{2} \times 2$ features to the final number of classes, thus yielding the decision logits. Instead of explicitly applying Softmax, these logits are directly fed into PyTorch’s CrossEntropyLoss, which internally combines LogSoftmax and NLLLoss, to compute the classification loss in a numerically stable and streamlined manner.

3 Experiments

3.1 Dataset

To comprehensively evaluate the performance of the proposed network, this study utilize two motor imagery EEG datasets from BCI Competition IV, namely Dataset 2a and Dataset 2b.

Dataset 2a (Brunner et al., n.d.) includes EEG recordings from 9 subjects with 25 channels, which are band-pass filtered between 0.5–100 Hz. Each subject completed two sessions on different days, each containing 288 eight-second trials involving motor imagery of the left hand, right hand, both feet, or tongue. For analysis, EEG segments from 2 to 6 s after the cue were selected to capture stable task-related activity. Three EOG channels (left, right, center) were removed, thus leaving 22 EEG channels for classification.

Dataset 2b (Leeb et al., n.d.) also involves EEG recordings from 9 subjects, with only three channels: C3, Cz, and C4. The signals were sampled at 250 Hz and underwent band-pass filtering between 0.5 and 100 Hz. The experimental task involved motor imagery of the left and right hands, which comprised two non-feedback sessions and three feedback sessions. Each non-feedback session included 6 blocks per hand, with 10 trials per block, thus totaling 120 trials. Each feedback session consisted of 4 blocks per hand, with 20 trials per block, thus totaling 160 trials.

HGD dataset (Schirrmeister et al., 2017) consists of EEG recordings from 14 subjects acquired using 128 channels at a sampling rate of 500 Hz, with the signals band-pass filtered between 0.5 and 100 Hz and additionally processed with a 50 Hz notch filter to suppress power-line noise. The experimental paradigm involves four motor imagery tasks, corresponding to left hand, right hand, both feet, and tongue movements. Each subject completed four sessions, with 160 trials per session. In this study, EEG segments from 0.5 to 4.5 s after cue onset, corresponding to the stable task period, were selected for analysis, and all EEG channels were utilized for classification.

3.2 Training procedure

To evaluate the effectiveness of the AMANet model, this study adopted 10-fold cross-validation and utilized the Adam optimizer for model training. The initial learning rate was set to 0.001, the batch size was set to 32, and the dropout rate was set to 0.3. Taking the BCI Competition IV-2a dataset as an example, the data of each subject was partitioned into 10 subsets. In each round, one subset served as the validation set, whereas the remaining nine subsets were used for training. This procedure was iterated ten times. The adoption of 10-fold cross-validation not only maximizes data utilization but also helps prevent overfitting and enhances the model’s generalization capability. The random seed was fixed at 1234, and training was conducted for 300 epochs. Early stopping was not applied. The model was implemented in PyTorch and trained on an NVIDIA GeForce GTX 1660 Ti GPU.

4 Results and discussion

This study trains and evaluates AMANet on the four-class BCI Competition IV Dataset 2a, the binary Dataset 2b, and High-Gamma datasets, comparing its classification accuracy with that of baseline models. Results are visualized via confusion matrices. Ablation experiments are subsequently conducted to demonstrate the contributions of CSP and AMANet. Finally, key parameters, such as the number of temporal and spatial filters, are tuned to select the optimal classification model.

4.1 Classification accuracy

As shown in Figure 7, the classification accuracies of the AMANet model for nine subjects (S1–S9) in the BCI Competition IV-2a and IV-2b datasets are presented. Overall, the model achieved an average accuracy of 84.06% on the IV-2a dataset. Notably, S3 (91.32%), S7 (89.58%), and S8 (93.40%) attained the highest accuracies, indicating a strong discriminative capability for these subjects, which may be attributed to more distinct signal features, higher data quality, or greater consistency in task execution strategies. In contrast, S5 (74.65%) and S6 (76.39%) exhibited comparatively lower performance, possibly owing to insufficient EEG samples or poorer signal quality. For the 2b dataset, accuracies also varied substantially across subjects, with S4 achieving the highest accuracy (90.14%) and S3 the lowest (81.07%). Overall, all subjects exceeded an accuracy of 80%, demonstrating the model’s strong recognition ability for most individuals. Nevertheless, these results also suggest that AMANet could be further optimized to better address inter-subject variability in the IV-2b dataset.

Figure 7

Diagram of a neural network architecture with Conv1 and Conv2 layers. It illustrates the process of input transformation through convolutional layers, with annotations like T/4, k2=32, F1*D, and T/32. The flow includes batch normalization, exponential linear unit (BN ELU) activation, and average pooling.

Figure 7. Classification results of different subjects in the IV-2a and IV-2b datasets.

4.2 Model comparison

The classification accuracies and kappa values of our proposed AMANet and six benchmark models—EEGNet (Lawhern et al., 2018), Incep-EEGNet (Riyad et al., 2020), FBCSP (Ang et al., 2008), ShallowConvNet, DeepConvNet (Schirrmeister et al., 2017) and TCNet Fusion (Musallam et al., 2021)—on the BCI Competition IV 2a dataset are presented in Table 2. At the single-subject level, AMANet achieved the highest accuracy and κ on S2, S4, S6 and S8, with S8 reaching an accuracy of 93.40% and a κ of 0.91. Compared with the benchmarks, this corresponds to improvements of 4.48–13.85% in accuracy and 0.06–0.20 in κ. The exceptional result on S7 indicates strong subject-specific adaptability. In contrast, S5 and S6 recorded lower accuracies (74.65 and 76.39%) and κ values (0.66 and 0.68). This may be attribute to EEG noise or artifacts affecting feature extraction or to suboptimal parameter settings leading to overfitting. Notably, despite the relatively low performance of S6, AMANet still outperformed all benchmark models on that subject, further demonstrating its superiority. In addition except for the TCNet Fusion model, whose significance difference is greater than 0.05 (p-value), all other baseline models exhibit significance differences less than 0.05 (p-value). Overall, AMANet achieved a mean accuracy of 84.06% and a mean κ of 0.78—both exceeding those of the benchmark models—indicating superior classification consistency and stability.

Table 2

Table 2. Comparison of classification accuracy (%) and kappa value (k) on the BCI Competition IV-2a dataset (4 classes).

Table 3 presents the classification results of the AMANet model on the BCI Competition IV-2b dataset, compared with five benchmark models: FBCSP (Ang et al., 2008), EEGNet (Lawhern et al., 2018), EEG-ITNet (Salami et al., 2022), 1D-Multi-scale-CNN (Tang et al., 2020) and SHNN (Liu et al., 2022). The results show that AMANet achieved the highest or near-optimal performance for most subjects. For instance, it attained peak accuracies of 87.35 and 84.29% for S1 and S2, respectively, significantly outperforming the compared models. For S9, it achieved 84.11%, slightly lower than that of 1D-Multi-scale-CNN (86.81%) but far higher than that of EEG-ITNet (55.31%). Although the accuracy of AMANet was marginally lower than EEG-ITNet and SHNN for certain subjects (e.g., S4 and S8), its overall performance remained stable and reliable. Notably, for subjects S3, S5, S6, and S7, where EEG-ITNet performed better, AMANet still maintained accuracies above 80%, which demonstrates strong robustness. Overall, AMANet achieved an average classification accuracy of 85.09% across all subjects, surpassing EEG-ITNet (82.59%) and SHNN (83.49%), thus validating its effectiveness and comprehensive advantage in motor imagery BCI classification tasks.

Table 3

Table 3. Comparison of classification accuracy (%) on the BCI competition IV-2b dataset.

The classification accuracy and Kappa values of different models on the High-Gamma dataset are presented in Table 4. As show in the table the traditional FBCSP method yields the poorest performance, with an average accuracy of only 71.04% and an average Kappa of 0.62, whereas deep learning models significantly outperform traditional approaches. Specifically, ShallowConvNet, DeepConvNet, and TCNet Fusion achieve average accuracies exceeding 94% with Kappa values around 0.94, where demonstrates the strong capability of deep convolutional networks in extracting high-frequency motor imagery EEG features. EEGNet and Incep-EEGNet, as lightweight convolutional networks, also exhibit stable performance, with average accuracies of 89.67 and 93.03%, and corresponding Kappa values of 0.85 and 0.91. Notably, the proposed model in this study consistently achieves excellent performance across all subjects, with an average accuracy of 95.48% and a Kappa of 0.94. It attains the highest or near-highest classification performance for subjects such as S1, S2, and S5, highlighting its superior capability in capturing key high-frequency EEG features and its robust cross-subject stability. Overall, although various deep convolutional networks can effectively improve motor imagery EEG classification performance, the proposed model consistently achieves the best or near-best performance across subjects. This indicates that our model has stronger feature extraction ability and higher classification stability, thus significantly outperforming existing methods.

Table 4

Table 4. Comparison of classification accuracy (%) and kappa values (k) on the high-gamma dataset.

Table 5 presents a comparison between AMANet and other mainstream models on the IV-2a dataset in terms of trainable parameters, computational complexity (MACs), memory requirements, and classification accuracy. MACs (Multiply–Accumulate Operations) measure the total number of multiplication–addition operations during a single forward pass, while memory requirements refer to the total memory occupied by the output feature maps of all network layers (Li et al., 2023). Specifically, AMANet has 8.09 times more trainable parameters than EEGNet, yet still fewer than those of other compared models; its average MACs are 1.22 times higher than those of EEGNet, but remain lower than those of the other baselines. Notably, due to the increased network depth, its memory usage is increased by 17.34 times compared to EEGNet. These results indicate that AMANet achieves higher classification accuracy by enhancing the feature extraction capability at the cost of additional parameters, computational complexity, and memory consumption (Rizvi et al., 2023).

Table 5

Table 5. Comparison of model complexity and performance on the BCI competition IV-2a dataset.

4.3 Visualization

This section employs confusion matrices and t-distributed stochastic neighbor embedding (t-SNE) to evaluate the model’s performance. The confusion matrices of the AMANet model for five subjects (S1, S3, S7, S8, S9) from the BCI Competition IV-2a dataset are presented in Figure 8. Each matrix includes the labels “left,” “right,” “foot,” and “tongue,” where diagonal entries represent correct classification rates and off-diagonal entries indicate misclassifications. The results demonstrate that subject S8 achieved the best overall performance, with the darkest diagonal and the lightest off-diagonal colors. Specifically, the accuracies for “tongue” and “left” were the highest, reaching 94.1 and 93.3%, respectively. The “foot” class achieved an accuracy of 81.2%, with 18.8% misclassified as “tongue,” this may stem from the overlapping spatial activation patterns of the two types of EEG signals. Previous studies have indicated that foot movements primarily activate the central sensorimotor area, whereas tongue movements are more associated with the lateral cortex (Pfurtscheller et al., 2006). However, individual differences may make it difficult for CSP features to achieve effective separation. The “right” class had the lowest accuracy of 80.0%, with 20.0% misclassified as “tongue,” which reflects the similarity of EEG features between left- and right-hand motor imagery. Overall, AMANet demonstrated superior performance on S8 in comparison with other subjects, suggesting its effectiveness in capturing individual EEG characteristics. Nevertheless, inter-subject variability remains, which may be attributed to the higher signal quality or more stable motor imagery features in S8. These findings highlight the potential of incorporating personalized training strategies for further enhancing the model’s generalization and robustness across individuals.

Figure 8

Two line graphs show classification accuracy per subject using AMANet. The left graph (2a) displays accuracies ranging from approximately 74.65% to 93.4% across nine subjects. The right graph (2b) shows accuracies from around 81.07% to 90.14% across the same subjects. Both graphs have accuracy percentages on the y-axis and subject identifiers on the x-axis.

Figure 8. Confusion matrix of dataset 2a (S1, S3, S7, S8, S9).

To further evaluate the proposed model’s performance in feature separability, this study employed the t-SNE method to visualize the distribution of different motor imagery samples in a low-dimensional space. The feature distribution of subject S3 from the BCI Competition IV 2a and 2b datasets is illustrated in Figure 9. The results indicate that for S3, the four-class motor imagery tasks (left hand, right hand, foot, and tongue) exhibit well-formed clusters with clear inter-class boundaries in the two-dimensional space, which demonstrates the model’s strong capability in feature extraction and multi-class discrimination. In contrast, the two-class tasks (left hand and right hand) for the same subject exhibit a certain degree of overlap, although relatively distinct boundaries can still be observed. Overall, the model demonstrates more prominent discriminative performance in multi-class tasks, while some room for improvement remains in binary tasks for certain subjects.

Figure 9

Five normalized confusion matrices for subjects 1, 3, 7, 8, and 9 display percentages for predicting movements: left, right, foot, and tongue. High accuracy is shown on diagonals, indicating strong prediction consistency. Each matrix highlights specific accuracy percentages, such as 92.9% for

Figure 9. t-SNE distribution map of dataset S3 for 2a and 2b.

4.4 Ablation study

Ablation results forAMANet’s multi-scale module, attention mechanism, and data augmentation on Dataset 2a are reported in Table 6. With only attention mechanism and augmentation enabled—excluding the multi-scale module—the accuracy was 81.56%. Adding the multi-scale module further raises the accuracy to 84.06%, the highest value, which demonstrates its effectiveness in capturing discriminative features across temporal scales (Shen et al., 2022). Removing the attention mechanism (while retaining only the multi-scale module and augmentation) reduces the accuracy to 81.13%, indicating the critical role of the attention module in allocating channel weights to enhance key features. Omitting data augmentation (while using only the multi-scale module and attention) result an accuracy drop to 75.67%, underscoring the importance of data augmentation in mitigating overfitting and improving robustness in small-sample EEG scenarios. Further analysis of the roles of the ST-Block and DSF-Block shows that when the ST-Block is removed, leaving only the multi-scale module, attention mechanism, and DSF-Block, the accuracy drops to 77.39%, indicating that the ST-Block plays an important role in spatiotemporal feature extraction and in capturing long-term temporal dependencies, contributing significantly to model performance. Conversely, when the DSF-Block is removed, leaving only the multi-scale module, attention mechanism, and ST-Block, the accuracy decreases to 79.06%, demonstrating that the DSF-Block is also indispensable for spatial feature fusion and modeling feature dependencies. Overall, these results validate the synergistic contributions of the multi-scale module, attention mechanism, data augmentation, ST-Block, and DSF-Block in enhancing model performance, with each component making a substantial contribution to the final classification accuracy.

Table 6

Table 6. Impact of different modules on the classification accuracy of the AMANet model.

To assess the specific effects of the CSP and ECA modules, five model variants (Figure 10) were designed. The baseline model, WMNet, which excludes both CSP and ECA, achieved an accuracy of 78.37% with a κ value of 0.70. When only the ECA module was added (WMANet), the performance decreased to 77.45% accuracy with a κ value of 0.69, indicating that without CSP providing discriminative spatial features, ECA may amplify irrelevant or redundant feature information, which is detrimental to classification. By contrast, CMNet, which includes CSP but does not use data augmentation, achieved 76.87% accuracy with a κ value of 0.69, lower than that of WMNet. This suggests that applying CSP alone, when training data diversity is insufficient, may lead to underfitting of specific spatial patterns in the training set, limiting generalization. When AMNet incorporated CSP while retaining data augmentation, performance improved substantially, achieving 81.13% accuracy with a κ value of 0.73. This demonstrates that data augmentation enhances the robustness of the discriminative spatial features extracted by CSP. Finally, the full model, combining CSP and ECA modules under data augmentation, achieved the best performance with 84.06% accuracy with a κ value of 0.78, indicating that CSP provides reliable spatial discriminative features while ECA performs fine-grained weighting on these features, resulting in a synergistic effect that significantly enhances overall classification capability.

Figure 10

Bar chart showing classification accuracy for different kernel sizes. Five groups are displayed: {4,8,12} at 81.35%, {6,12,16} at 82.63%, {8,16,24} at 84.06%, {12,18,32} at 83.47%, and {16,32,48} at 82.79%. The highest accuracy is with kernel size {8,16,24}.

Figure 10. Classification accuracy and kappa values for different network structures.

4.5 Parameter optimization

4.5.1 Selection the size of the convolutional kernel

The impact of different convolutional kernel size combinations in the multi-scale temporal encoding module on motor imagery EEG classification accuracy is illustrated in Figure 11. On Dataset 2a, kernel selection proves critical: enlarging the kernel set from {4, 8, 12} to {8, 16, 24} increases the accuracy from 81.35 to 84.06%, demonstrating that a moderate kernel expansion broadens the receptive field and enhances the capture of temporal features. However, further enlarging to {12, 18, 32} and {16, 32, 48} reduces the accuracy to 83.47 and 82.79%, respectively—likely because overly large kernels, despite extracting wider temporal context, markedly increase the parameter count, complicate the training process, over-smooth features, obscure local details, and promote overfitting. Therefore, model design must balance the feature extraction capacity against complexity, with {8, 16, 24} identified as the optimal kernel combination.

Figure 11

Two side-by-side t-SNE visualizations show data clusters. The left chart (S3(2a)) depicts four classes: right (yellow), tongue (green), foot (blue), and left (purple). The right chart (S3(2b)) shows two classes: right (orange) and left (blue), with distinct clustering patterns. Both charts include a legend and grid lines.

Figure 11. Classification results of different combinations of convolution kernels.

4.5.2 Attention mechanism selection

To analyze the impact of different attention mechanisms on classification performance, this study compared four settings: ECA, CBAM, SE, and No-Attention, with the experimental results summarized in Table 7. Overall, all three attention mechanisms achieved more stable performance improvements compared to the model without attention, confirming the effectiveness of incorporating attention mechanisms in EEG motor imagery classification tasks. For individual subject results, ECA achieved the best performance on S1, S3, and S8, demonstrating its strong feature enhancement capability for specific subjects. CBAM achieved the highest classification accuracy on S2, S5, and S7, with S7 reaching 93.10%, the highest among all subjects, indicating that the simultaneous application of channel and spatial attention helps to effectively model the local saliency distribution of EEG features. SE slightly outperformed other methods on S3 and S9, reflecting its advantage in modeling channel-wise dependencies under certain scenarios. In contrast, the No-Attention model generally exhibited relatively lower performance across most subjects. Regarding overall average performance, ECA achieved the highest mean classification accuracy of 84.06%, with a Kappa coefficient of 0.78, which is also superior to CBAM and SE (both 0.76) and the No-Attention model (0.73). This indicates that ECA offers superior stability and overall generalization. The underlying reason may be that SE recalibrates channel weights through global average pooling and fully connected layers, and while it can model channel dependencies, it may lose some fine-grained discriminative information during the “squeeze-and-excitation” process. CBAM, by incorporating both channel and spatial attention, can highlight local salient features more precisely but features a more complex structure, which increases the risk of overfitting in small-sample EEG scenarios. In contrast, ECA enables local interactions along the channel dimension using lightweight one-dimensional convolutions, effectively preserving fine-grained feature information without dimensional compression. This not only reduces model complexity but also improves overall stability and generalization. Considering accuracy, Kappa coefficient, and model complexity, ECA was ultimately selected as the attention module for AMANet, ensuring high performance while maintaining computational efficiency and model robustness.

Table 7

Table 7. Comparison of different attention mechanisms.

4.5.3 Sliding window selection

To analyze the impact of the sliding window approach on experimental results, the classification accuracy under several conditions was compared in this study: without a sliding window, and with a window length of 500 and step sizes of 250, 125, 75, and 25, respectively. The results are presented in Table 8. Without the application of a sliding window, the average accuracies on the 2a and 2b datasets were 75.85 and 76.14%, respectively. After the employment of the sliding window method, the lowest average accuracies increased to 78.15 and 78.45%, which indicates that the method not only enlarged the sample size but also significantly improved the classification performance. Notably, when the step size was set to 75, the average accuracies of the nine subjects in the two datasets reached 84.06 and 85.09%, achieving the best performance. However, when the step size was further reduced to 25, the accuracies dropped by 2.83 and 1.24%, respectively. This decline may be attributed to excessive overlap in the data, which made the model more sensitive to redundant information and thus negatively affected classification outcomes.

Table 8

Table 8. Classification accuracy (%) at different sliding window time intervals.

Next, the sliding interval to 75, with window lengths of 750, 625, 500, 375, and 250 was set, respectively. As shown in Table 9, when the sliding window length is 500, the average classification accuracy of the two datasets reaches the highest values of 84.06 and 85.09%, which achieves optimal classification performance. Therefore, the sliding window is set to 500 and the sliding interval to 75 in this paper.

Table 9

Table 9. Classification accuracy (%) under different sliding window lengths.

4.5.4 Different parameter selection for the model

The performance in terms of accuracy and parameter quantity of four model configurations (MC1, MC2, MC3, and MC4)on the 2a dataset is compared in Table 10. The results indicate that Configuration MC4 demonstrates the best overall performance. Although it has the smallest parameter count among the four configurations (only 22,000), it achieves an accuracy of 84.06%, which is only 0.20% lower than that of Configuration MC1 (84.26%), despite the fact that MC1 having a significantly larger parameter count of 188,000. In contrast, Configurations MC2 and MC3 both have 84,000 parameters and achieve the same accuracy of 82.35%. They neither offer an accuracy advantage nor require fewer parameters than Configuration MC4, thus making them suboptimal. Additionally, the comparison among Configurations MC2, MC3, and MC4 reveals that the model’s parameter count is primarily influenced by the number of $F_{1}$ and $F_{2}$ filters. Therefore, Configuration MC4 significantly reduces model complexity and computational cost while maintaining high accuracy, offering a clear efficiency advantage. This makes it particularly suitable for resource-constrained environments, such as the deployment on mobile or edge device. In summary, Configuration MC4 is the best-performing setup in this comparative experiment and represents the most suitable parameter combination for the current task requirements.

Table 10

Table 10. Comparison of different network parameters of the model.

5 Conclusion

A Data-Augmented Multi-Scale Temporal Attention Convolutional Network (AMANet) was proposed in this work to enhance motor imagery EEG classification performance. By augmenting limited samples, AMANet effectively mitigates overfitting caused by insufficient training data and adaptively capturing discriminative EEG features. To further assess the impact of architectural design, systematic analyses were conducted on convolutional kernel sizes, attention mechanisms, and sliding window lengths. Experimental results on the BCI Competition IV-2a,2b and high-gamma datasets demonstrate that AMANet achieves classification accuracies of 84.06, 85.09 and 95.48%, respectively, which significantly outperforms multiple benchmark methods and confirms its effectiveness in MI-EEG decoding. Ablation studies further reveal that both the multi-scale structure and the attention mechanism contribute positively to performance, and their integration enables more effective extraction of cross-temporal discriminative features. Although the proposed method has been validated against several classic and widely used baseline models, a limitation remains: we did not include some of the more recently proposed models characterized by substantially higher structural complexity and computational cost. Given that the primary goal of this work is to evaluate model suitability under small-sample conditions and limited computational resources, we deliberately selected comparison methods that are representative, stable, and practically reproducible. In future work we will expand the set of competing methods to include more complex state-of-the-art architectures for a systematic assessment of our approach across varying levels of model complexity. Furthermore, to enhance the practical utility of the model, we will focus on two key directions: first, improving cross-subject generalization to increase the model’s adaptability in real-world scenarios; and second, applying model quantization, parameter compression, and transfer-learning based techniques to reduce memory and computational footprint, thereby optimizing deploy ability in edge computing and other resource-constrained environments.

Where the number of EEG channels is denoted by $C = 22$ The number of spatial filters in the CSP block is represented by $C_{1}$ , which is set to 12 for the 2a dataset and 2 for the 2b dataset. The number of sampling points is indicated by $T$ =1,000 and $T_{1}$ =500. the number of temporal filters is represented by $F_{1}$ =8, and the number of pointwise filters is signified by $F_{2}$ =16. The sizes of different convolution kernels are denoted by $K_{1}$ =64 and $K_{2}$ =32 while the number of spatial filters is represented by $D$ =2. SW indicates the sliding window, and β denotes the scaling range. Padding is applied during convolution to by which the spatial dimensions of the output tensor are preserved.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/robintibor/high-gamma-dataset.

Ethics statement

Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements because the study used only secondary data.

Author contributions

SW: Writing – original draft, Writing – review & editing. RW: Methodology, Writing – review & editing. LC: Methodology, Writing – review & editing. JW: Writing – review & editing. LH: Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the Health Commission of Jiangxi Province under Key Science and Technology Innovation Grants 2023ZD008 and the Science and Technology Department of Shanghai, China, under Grant 23010501700.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that Generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Altaheri, H., Muhammad, G., and Alsulaiman, M. (2023). Physics-informed attention temporal convolutional network for EEG-based motor imagery classification. IEEE Trans. Ind. Inform. NJ, USA: IEEE, Piscataway. 19, 2249–2258. doi: 10.1109/TII.2022.3197419

Crossref Full Text | Google Scholar

Ang, K. K., Chin, Z. Y., Wang, C., Guan, C., and Zhang, H. (2012). Filter Bank common spatial pattern algorithm on BCI competition IV datasets 2a and 2b. Front. Neurosci. 6:39. doi: 10.3389/fnins.2012.00039,

PubMed Abstract | Crossref Full Text | Google Scholar

Ang, Kai Keng, Chin, Zhang Yang, Zhang, Haihong, and Guan, Cuntai. 2008. Filter Bank common spatial pattern (FBCSP) in brain-computer Interface., in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), (IEEE), 2390–2397.

Google Scholar

Ang, K. K., Guan, C., Phua, K. S., Wang, C., Zhao, L., Teo, W. P., et al. (2015). Facilitating effects of transcranial direct current stimulation on motor imagery brain-computer Interface with robotic feedback for stroke rehabilitation. Arch. Phys. Med. Rehabil. 96, S79–S87. doi: 10.1016/j.apmr.2014.08.008,

PubMed Abstract | Crossref Full Text | Google Scholar

Brunner, C., Leeb, R., Müller-Putz, G. R., Schlögl, A., and Pfurtscheller, G. n.d. BCI Competition 2008-Graz data set A Experimental paradigm Available online at: http://biosig.sourceforge.net/

Google Scholar

Chang, L., Yang, B., Zhang, J., Li, T., Feng, J., and Xu, W. (2025). DSTA-net: dynamic spatio-temporal feature augmentation network for motor imagery classification. Cogn. Neurodyn. 19:118. doi: 10.1007/s11571-025-10296-0,

PubMed Abstract | Crossref Full Text | Google Scholar

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357. doi: 10.1613/jair.953

Crossref Full Text | Google Scholar

Cho, J.-H., Jeong, J.-H., Shim, K.-H., Kim, D.-J., and Lee, S.-W.. 2018. Classification of hand motions within EEG signals for non-invasive BCI-based robot hand control., in 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), (IEEE), 515–518.

Google Scholar

Condori, K. A., Urquizo, E. C., and Diaz, D. A. (2016). Embedded brain machine Interface based on motor imagery paradigm to control prosthetic hand., in 2016 IEEE ANDESCON, (IEEE), 1–4. doi: 10.1109/ANDESCON.2016.7836266

Crossref Full Text | Google Scholar

Craik, A., He, Y., and Contreras-Vidal, J. L. (2019). Deep learning for electroencephalogram (EEG) classification tasks: a review. J. Neural Eng. 16:031001. doi: 10.1088/1741-2552/ab0ab5,

PubMed Abstract | Crossref Full Text | Google Scholar

Du, Y., Huang, J., Huang, X., Shi, K., and Zhou, N. (2022). Dual attentive fusion for EEG-based brain-computer interfaces. Front. Neurosci. 16:1044631. doi: 10.3389/fnins.2022.1044631,

PubMed Abstract | Crossref Full Text | Google Scholar

Dudukcu, H. V., Taskiran, M., Cam Taskiran, Z. G., and Yildirim, T. (2023). Temporal convolutional networks with RNN approach for chaotic time series prediction. Appl. Soft Comput. 133:109945. doi: 10.1016/j.asoc.2022.109945

Crossref Full Text | Google Scholar

Graimann, B., Allison, B., and Pfurtscheller, G. (2009). Brain–computer interfaces: A gentle introduction, 1–27.

Google Scholar

Hersche, M., Rellstab, T., Schiavone, P. D., Cavigelli, L., Benini, L., and Rahimi, A.. 2018. Fast and accurate multiclass inference for MI-BCIs using large multiscale temporal and spectral features., in 2018 26th European Signal Processing Conference (EUSIPCO), (IEEE), 1690–1694.

Google Scholar

Huang, H., and Zhou, B.. 2021. Motion imagery classification based on the combination of CSP and deep learning., in 2021 2nd International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI), (IEEE), 152–155.

Google Scholar

Hwang, J., Park, S., and Chi, J. (2023). Improving multi-class motor imagery EEG classification using overlapping sliding window and deep learning model. Electronics 12:1186. doi: 10.3390/electronics12051186

Crossref Full Text | Google Scholar

Ingolfsson, T. M., Hersche, M., Wang, X., Kobayashi, N., Cavigelli, L., and Benini, L.. 2020. EEG-TCNet: an accurate temporal convolutional network for embedded motor-imagery brain–machine interfaces., in 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), (IEEE), 2958–2965.

Google Scholar

Jia, H., Yu, S., Yin, S., Liu, L., Yi, C., Xue, K., et al. (2023). A model combining multi branch spectral-temporal CNN, Efficient Channel attention, and LightGBM for MI-BCI classification. IEEE Trans. Neural Syst. Rehabil. Eng. 31, 1311–1320. doi: 10.1109/TNSRE.2023.3243992,

PubMed Abstract | Crossref Full Text | Google Scholar

Lawhern, V. J., Solon, A. J., Waytowich, N. R., Gordon, S. M., Hung, C. P., and Lance, B. J. (2018). EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces. J. Neural Eng. 15:056013. doi: 10.1088/1741-2552/aace8c,

PubMed Abstract | Crossref Full Text | Google Scholar

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324. doi: 10.1109/5.726791

Crossref Full Text | Google Scholar

Leeb, R., Brunner, C., Müller-Putz, G. R., Schlögl, A., and Pfurtscheller, G. n.d. BCI Competition 2008-Graz data set B Experimental paradigm Available online at: http://biosig.sourceforge.net/

Google Scholar

Li, H., Chen, H., Jia, Z., Zhang, R., and Yin, F. (2023). A parallel multi-scale time-frequency block convolutional neural network based on channel attention module for motor imagery classification. Biomed. Signal Process. Control. 79:104066. doi: 10.1016/j.bspc.2022.104066

Crossref Full Text | Google Scholar

Li, Y., Long, J., Yu, T., Yu, Z., Wang, C., Zhang, H., et al. (2010). An EEG-based BCI system for 2-D cursor control by combining mu/Beta rhythm and P300 potential. I.E.E.E. Trans. Biomed. Eng. 57, 2495–2505. doi: 10.1109/TBME.2010.2055564,

PubMed Abstract | Crossref Full Text | Google Scholar

Liao, W., Miao, Z., Liang, S., Zhang, L., and Li, C. (2025). A composite improved attention convolutional network for motor imagery EEG classification. Front. Neurosci. 19:1543508. doi: 10.3389/fnins.2025.1543508,

PubMed Abstract | Crossref Full Text | Google Scholar

Liu, C., Jin, J., Daly, I., Li, S., Sun, H., Huang, Y., et al. (2022). SincNet-based hybrid neural network for motor imagery EEG decoding. IEEE Trans. Neural Syst. Rehabil. Eng. 30, 540–549. doi: 10.1109/TNSRE.2022.3156076,

PubMed Abstract | Crossref Full Text | Google Scholar

Liu, K., Yang, M., Yu, Z., Wang, G., and Wu, W. (2023). FBMSNet: a filter-Bank multi-scale convolutional neural network for EEG-based motor imagery decoding. I.E.E.E. Trans. Biomed. Eng. 70, 436–445. doi: 10.1109/TBME.2022.3193277,

PubMed Abstract | Crossref Full Text | Google Scholar

Ma, Z., Niu, Y., and Hu, J.. 2020. Deep multi-scale convolutional neural network method for depth estimation from a single image., in 2020 Chinese Control And Decision Conference (CCDC), (IEEE), 3984–3988.

Google Scholar

Mattioli, F., Porcaro, C., and Baldassarre, G. (2021). A 1D CNN for high accuracy classification and transfer learning in motor imagery EEG-based brain-computer interface. J. Neural Eng. 18:066053. doi: 10.1088/1741-2552/ac4430,

PubMed Abstract | Crossref Full Text | Google Scholar

McFarland, D. J., and Wolpaw, J. R. (2011). Brain-computer interfaces for communication and control. Commun. ACM 54, 60–66. doi: 10.1145/1941487.1941506,

PubMed Abstract | Crossref Full Text | Google Scholar

Musallam, Y. K., AlFassam, N. I., Muhammad, G., Amin, S. U., Alsulaiman, M., Abdul, W., et al. (2021). Electroencephalography-based motor imagery classification using temporal convolutional network fusion. Biomed. Signal Process. Control. 69:102826. doi: 10.1016/j.bspc.2021.102826

Crossref Full Text | Google Scholar

Ou, Q., and Zou, J. (2025). Channel-wise attention-enhanced feature mutual reconstruction for few-shot fine-grained image classification. Electronics 14:377. doi: 10.3390/electronics14020377

Crossref Full Text | Google Scholar

Pfurtscheller, G., Brunner, C., Schlögl, A., and Lopes da Silva, F. H. (2006). Mu rhythm (de)synchronization and EEG single-trial classification of different motor imagery tasks. NeuroImage 31, 153–159. doi: 10.1016/j.neuroimage.2005.12.003,

PubMed Abstract | Crossref Full Text | Google Scholar

Pfurtscheller, G., and Neuper, C. (2001). Motor imagery and direct brain-computer communication. Proc. IEEE 89, 1123–1134. doi: 10.1109/5.939829

Crossref Full Text | Google Scholar

Ramoser, H., Muller-Gerking, J., and Pfurtscheller, G. (2000). Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE Trans. Rehabil. Eng. 8, 441–446. doi: 10.1109/86.895946,

PubMed Abstract | Crossref Full Text | Google Scholar

Riyad, M., Khalil, M., and Adib, A. (2020). Incep-EEGNet: A ConvNet for motor imagery decoding, 103–111.

Google Scholar

Rizvi, S. M., Rahman, A. A.-H. A., Sheikh, U. U., Fuad, K. A. A., and Shehzad, H. M. F. (2023). Computation and memory optimized spectral domain convolutional neural network for throughput and energy-efficient inference. Appl. Intell. 53, 4499–4523. doi: 10.1007/s10489-022-03756-1,

PubMed Abstract | Crossref Full Text | Google Scholar

Salami, A., Andreu-Perez, J., and Gillmeister, H. (2022). EEG-ITNet: an explainable inception temporal convolutional network for motor imagery classification. IEEE Access 10, 36672–36685. doi: 10.1109/ACCESS.2022.3161489

Crossref Full Text | Google Scholar

Schirrmeister, R. T., Springenberg, J. T., Fiederer, L. D. J., Glasstetter, M., Eggensperger, K., Tangermann, M., et al. (2017). Deep learning with convolutional neural networks for EEG decoding and visualization. Hum. Brain Mapp. 38, 5391–5420. doi: 10.1002/hbm.23730,

PubMed Abstract | Crossref Full Text | Google Scholar

Shen, L., Sun, M., Li, Q., Li, B., Pan, Z., and Lei, J. (2022). Multiscale temporal self-attention and dynamical graph convolution hybrid network for EEG-based stereogram recognition. IEEE Trans. Neural Syst. Rehabil. Eng. 30, 1191–1202. doi: 10.1109/TNSRE.2022.3173724,

PubMed Abstract | Crossref Full Text | Google Scholar

Tang, X., Li, W., Li, X., Ma, W., and Dang, X. (2020). Motor imagery EEG recognition based on conditional optimization empirical mode decomposition and multi-scale convolutional neural network. Expert Syst. Appl. 149:113285. doi: 10.1016/j.eswa.2020.113285

Crossref Full Text | Google Scholar

Tao, W., Wang, Z., Wong, C. M., Jia, Z., Li, C., Chen, X., et al. (2024). ADFCNN: attention-based dual-scale fusion convolutional neural network for motor imagery brain–computer interface. IEEE Trans. Neural Syst. Rehabil. Eng. 32, 154–165. doi: 10.1109/TNSRE.2023.3342331,

PubMed Abstract | Crossref Full Text | Google Scholar

Woo, S., Park, J., Lee, J.-Y., and Kweon, I. S. (2018). CBAM: Convolutional block attention module, 3–19 doi: 10.1007/978-3-030-01234-2_1.

Crossref Full Text | Google Scholar

Wu, H., Niu, Y., Li, F., Li, Y., Fu, B., Shi, G., et al. (2019). A parallel multiscale filter Bank convolutional neural networks for motor imagery EEG classification. Front. Neurosci. 13:1275. doi: 10.3389/fnins.2019.01275,

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, G., Luo, J., Han, L., Lu, Z., Hua, R., Chen, J., et al. (2021). A dynamic multi-scale network for EEG signal classification. Front. Neurosci. 14:578255. doi: 10.3389/fnins.2020.578255,

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, K., Xu, G., Han, Z., Ma, K., Zheng, X., Chen, L., et al. (2020). Data augmentation for motor imagery signal classification based on a hybrid neural network. Sensors 20:4485. doi: 10.3390/s20164485,

PubMed Abstract | Crossref Full Text | Google Scholar

Zhu, H., Wang, L., Shen, N., Wu, Y., Feng, S., Xu, Y., et al. (2023). MS-HNN: multi-scale hierarchical neural network with squeeze and excitation block for neonatal sleep staging using a Single-Channel EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 31, 2195–2204. doi: 10.1109/TNSRE.2023.3266876,

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: attention mechanism, brain–computer interface, common spatial pattern, data augmentation, motor imagery

Citation: Wang S, Wang R, Chang L, Wu J and Hu L (2026) AMANet: a data-augmented multi-scale temporal attention convolutional network for motor imagery classification. Front. Neurorobot. 19:1704111. doi: 10.3389/fnbot.2025.1704111

Received: 12 September 2025; Revised: 03 December 2025; Accepted: 09 December 2025;
Published: 09 January 2026.

Edited by:

Jing Jin, East China University of Science and Technology, China

Reviewed by:

Omar Mendoza Montoya, Monterrey Institute of Technology and Higher Education (ITESM), Mexico
Ruiyu Zhao, East China University of Science and Technology, China
Weijie Chen, East China University of Science and Technology, China

Copyright © 2026 Wang, Wang, Chang, Wu and Hu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Raofen Wang, cmZ3YW5nc3Vlc0AxNjMuY29t

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.