Data Augmentation for Deep Neural Networks Model in EEG Classification Task: A Review

Classification of electroencephalogram (EEG) is a key approach to measure the rhythmic oscillations of neural activity, which is one of the core technologies of brain-computer interface systems (BCIs). However, extraction of the features from non-linear and non-stationary EEG signals is still a challenging task in current algorithms. With the development of artificial intelligence, various advanced algorithms have been proposed for signal classification in recent years. Among them, deep neural networks (DNNs) have become the most attractive type of method due to their end-to-end structure and powerful ability of automatic feature extraction. However, it is difficult to collect large-scale datasets in practical applications of BCIs, which may lead to overfitting or weak generalizability of the classifier. To address these issues, a promising technique has been proposed to improve the performance of the decoding model based on data augmentation (DA). In this article, we investigate recent studies and development of various DA strategies for EEG classification based on DNNs. The review consists of three parts: what kind of paradigms of EEG-based on BCIs are used, what types of DA methods are adopted to improve the DNN models, and what kind of accuracy can be obtained. Our survey summarizes the current practices and performance outcomes that aim to promote or guide the deployment of DA to EEG classification in future research and development.


INTRODUCTION
As a key tool to capture the intention of brain activity, electroencephalography (EEG) can be used to measure rhythmic oscillations of the brain and reflect the synchronized activity of substantial populations of neurons (Atagün, 2016). The rhythmic oscillation is closely related to the state change of the nerve center that directly reflects the mental activity of the brain (Pfurtscheller, 2000;Villena-González et al., 2018). The brain-computer interface (BCI) is one of the typical applications used as a communication protocol between users and computers that does not rely on the normal neural pathways of the brain and muscles (Nicolas-Alonso and Gomez-Gil, 2012). Based on the generation types of EEG, BCIs can be divided into three types: non-invasive BCIs, invasive BCIs, and partially invasive (Rao, 2013;Levitskaya and Lebedev, 2016). Due to the low risk, low cost, and convenience, the EEG-based non-invasive BCIs are the most popular type of BCIs and are the main type discussed in this article.
During the execution of the interaction, the automatic classification of EEG is an important step toward making the use of BCI more practical in applications (Lotte et al., 2007). However, some limitations present challenges for classification algorithms (Boernama et al., 2021). Firstly, EEG signals have weak amplitudes and are always accompanied by irrelated components which suffer from a low signal-to-noise ratio. Secondly, the essence of EEG is the potential change of cluster activity of neurons which is a non-stationary signal. The technologies of machine learning and non-linear theory are widely used for EEG classification in current research (Lotte et al., 2018). However, a long calibration-time and weak generalization ability limits their application in practice.
In the past few years, deep neural networks (DNNs) have achieved excellent results in the field of image, speech, and natural text processing (Hinton et al., 2012;Bengio et al., 2013). The features can be automatically extracted from the input data by successive non-linear transformations based on hierarchical representations and mapping. Due to their ability to minimize the interference of redundant information and nonlinear feature extraction, EEG decoding based on DNNs has attracted more and more attention. However, one of the prior conditions to obtain expected results is the support of largescale datasets that could ensure the robustness and generalization ability of DNNs (Nguyen et al., 2015). There are still some challenges for EEG collection. First, it is difficult to collect large-scale data due to strict requirements for the experimental environment and subjects that may cause overfitting and increase the structural risk of the model . More than that, EEG signals are highly susceptible to change in psychological and physiological conditions that cause high variability of feature distribution across subject/sessions . It not only reduces the accuracy of the decoding model, but also limits the generalization of the model in the independent test set.
One promising approach is regularization (Yu et al., 2008;Xie et al., 2015), which could effectively improve the generalization ability and robustness for DNNs. There are three ways to achieve regularization, including adding term into loss function (e.g., L2 regularization), directly in the model (e.g., dropout, batch normalization, kernel max norm constraint), and data augmentation (DA). Compared to the first two approaches, DA solves the problem of overfitting by using a more comprehensive set of data to minimize the distance between the training and test dataset. This is especially useful for EEG signals where the limitation of small-scale datasets greatly affects the performance of classifiers. Therefore, researchers are increasingly concerned with optimization for deep learning (DL) models using DA in the task of EEG classification. The framework of the methodology is shown in Figure 1.
The rest of the article is organized as follows. The search methods for identifying relevant studies is described in detail in section "Method." In section "Results, " the basic concept and specific methods of DA in EEG classification based on DNNs are presented. Section "Discussion" discusses the current research status and challenges. Finally, conclusions are drawn in section "Conclusion." METHOD A wide literature search from 2016 to 2021 was conducted through Web of Science, PubMed, and IEEE Xplore. The keywords used for the search contain DA, EEG, deep learning, DNNs. Table 1 lists the collection criteria for inclusion or exclusion.
This review was conducted following PRISMA guidelines (Liberati et al., 2009). Results are summarized in a flowchart in Figure 2. The flowchart identifies and narrows down the collection of related studies. Duplicates between all datasets and studies that meet the exclusion criteria are excluded. Finally, 56 papers that meet the inclusion criteria are included.

Concepts and Methods for Data Augmentation
Data augmentation aims to prevent the overfitting of the DNN model by artificially generating new data based on existing training data (Shorten and Khoshgoftaar, 2019). There are three main strategies of this technology: basic image manipulations, deep learning, and feature transformation. The first approach performs augmentation directly in the input space while the last two methods realize DA based on the feature space of datasets. Here, we briefly describe these methods in the following parts.

Data Augmentation Based on Image Manipulations
Data augmentation based on image manipulations perform simple transformations using geometric features in an intuitive and low-cost way. Typical methods could be divided into the following categories.

Geometry Transformations
The geometric features of images are generally a visual representation of the physical information that contains both direction and contour elements (Cui et al., 2015;Paschali et al., 2019). Common operations include:

Flipping
This method is realized by rotating the image along the horizontal or vertical axis under the premise that the size of the matrix was consistent.

Cropping
The operation of cropping can be realized by cropping the central patch of images randomly and then mixing the remaining parts.

Rotation
Data augmentation rotation is realized by rotating images along some coordinate axis. How to select rotation parameters is an important factor that affects the enhancement effect.

Photometric and Color Transformations
Performing augmentations in the color channels' space is another method to implement practically (Heyne et al., 2009). During the operation, the raw data are converted to a form of the    power spectrum, stress diagram, and so on. They represent the distribution of spatial features.

Color Transformations
Color transformation realizes the generation of new data by adjusting the RGB matrix.

Noise Injection
Another approach to increase the diversity of data is injecting random matrices into the raw data, which are usually derived from Gaussian distributions (Okafor et al., 2017).

Data Augmentation Based on Deep Learning
Augmentation methods by image manipulations perform the transformation in input space of data. However, these approaches cannot take advantage of underlying features of data to perform augmentation (Arslan et al., 2019). Recently, a novel DA method has attracted the attention of researchers. It applies DNNs to map data space from high-dimensional to low-dimensional and realize feature extraction to reconstruct the artificial data (Cui et al., 2014). There are two typical deep learning strategies for DA: autoencoder (AE) and generative adversarial networks (GAN).

Autoencoder and Its Improved Version
As shown in Figure 3, an AE is a feed-forward neural network used to encode the raw data into low-dimensional vector representations by one-half of the network and to reconstruct these vectors back into the artificial data using another half of the network (Yun et al., 2019).
To obtain the expected generated data, a variational autoencoder (VAE) is proposed to improve the performance of the autoencoder. Compared with AE, VAE ensure that generated data is subject to specific probability distribution by adding constraints into the structure (Figure 4).
Where µ is the mean value of probability distribution, σ 2 represents variance, and ∈ is deviation.

Generative Adversarial Networks and Their Improved Version
Generative adversarial networks refer to artificially generating data based on the principle of adversarial learning. As shown in Figure 5, it performs a competition between bilateral networks to achieve a dynamic balance that learns the statistical distribution of the target data (Deng et al., 2014). The optimization problem of GAN can be defined as follows: Where p(x) is the distribution of training data and D(x; θ G ) is the discriminative model used to estimate the probability distribution p (•) between generated data z of real data x. V represents the value function and E is the expected value. In the process of training stage, the goal of GAN is to find the Nash equilibrium of a non-convex game with high-dimensional parameters. However, the optimization process of the model does not constraint for loss function that is easy to generate a meaningless output during the training stage. To address the issue and expand its application scope, the researchers proposed improved structures such as deep convolution GAN, conditional GANs, cycle GANs, and so on (Goodfellow et al., 2014). Amongst these new architectures for DA, DCGAN employed the CNNs to build the generator and discriminator networks rather than multilayer perceptron that expands more on the internal complexity than GAN (Radford et al., 2015). To improve the stability of the training process, an additional cycleconsistency loss function was proposed to optimize the structure of GAN, which was defined as cycle GANs (Kaneko et al., 2019). Conditional GANs effectively alleviate the limitations with mode collapse by adding a conditional vector to both the generator and the discriminator (Regmi and Borji, 2018). Another architecture of interest is known as Wasserstein GAN (WGAN). This architecture employed Wasserstein distance to measure the distance between generated data with real data rather than Jensen-Shannon or Kullback-Leibler divergence to improve the training performance (Yang et al., 2018).

Data Augmentation Based on Feature Transformation
Compared with the method of image manipulations and deep learning, feature transformation performs DA using spatial transformation of features in low dimensions that generate artificial data with a diverse distribution. However, a few studies have reported related methods. A novel spatial filtering method has been proposed to generate data using a time-delay strategy by combining it with a common spectral-spatial pattern (CSSP; Blankertz and BCI Competition, 2005). Another study applied empirical mode decomposition to divide EEG into multiple modes for DA (Freer and Yang, 2019).  To clearly show the taxonomy of DA, Figure 6 briefly integrated all the DA methodologies collected in this review.

Typical EEG Paradigms
Based on the form of interaction, the BCIs can be divided into two types: active type and passive type. Among them, active BCI is defined as a neural activity to a specific external stimulus that contains three typical paradigms: Motor imagery (MI), visual evoked potentials (VEP), and event-related potentials (ERP). MI is a mental process that imitates motor intention without real output. Different imagery tasks can activate the corresponding region of the brain, while this activation can be reflected by various feature representations of EEG (Bonassi et al., 2017).
Visual evoked potentials are continuous responses from the visual region when humans receive flashing visual stimuli (Tobimatsu and Celesia, 2006). When external stimuli are presented in a fixed frequency form, the visual region is modulated to produce a continuous response related to this frequency, i.e., Steady-State Visually Evoked Potentials (SSVEP; Wu et al., 2008).
Event-related potentials refers to a potential response when receiving specific stimulus such as visual, audio, or tactile stimulus (Luck, 2005).
Compared with active BCI, passive BCI aims to output the EEG signals from subjects' arbitrary brain activity, which is a form of BCI that does not rely on voluntary task (Roy et al., 2013;Arico et al., 2017).
In this section, we review recent reports for DA in EEG classification based on DNNs.

Data Augmentation Strategy for EEG Classification
In recent years, scientific interest in the field of the application of DA for EEG classification has grown considerably. Abdelfattah et al. (2018) employed recurrent GANs (RGAN) to improve the performance of classification models in the MI-BCI tasks. Different from the structure of GAN, they applied recurrent neural networks to replace generator components. Due to its ability to capture the time dependencies of signals, RGAN show great advantages in time-series data generating.
The classification accuracy was significantly improved after DA through the verification of three models. Zhang et al. (2020) carried out the research of image augmentation using deep convolution GAN (DCGAN) that replace pooling layers with Fractional-Strided Convolutions in the generator and strided convolutions in the discriminator. Considered the rule of feature distribution, they transformed time-series signal to spectrogram form and applied adversarial training with convolution operation to generate data. Meanwhile, they discussed the performance of different DA models and then verified that the generated data by DCGAN show the best similarity and diversity.
Freer and Yang (2019) proposed a convolutional long-short term memory network (CLSTM) to execute binary classification for MI EEG. To enhance the robustness of the classifier, they applied noise injection, multiplication, flip, and frequency shift to augment data, respectively. Results show average classification accuracy could obtain 14.0% improvement after DA.
Zhang Z. et al. (2019) created a novel DA method in the MI-BCI task in which they applied empirical mode decomposition (EMD) to divide the raw EEG frame into multiple modes. The process of decomposition was defined as: Where x (t) is recovered signal by EMD, IMF represents intrinsic mode functions, s represents the number of IMFs, and r s (t) is the final residual value. In the training stage, they mixed IMFs into the intrinsic mode functions to generate new data and then transformed it into tensors using complex Morlet wavelets which were finally input into a convolutional neural network (CNN). Experimental results verified that the artificial EEG frame could enhance the performance of the classifier and obtain higher accuracy. Panwar et al. (2020) proposed a WGAN (Eq. 3) with gradient penalty to synthesize EEG data for rapid serial visual presentation (RSVP) task. It is worth noting that WGAN applied Wasserstein distance to measure the distance between real and generated data.
Where P r and P g are the distribution of real data x r and generated data x g . W represents the distance of two distributions and E is mean value. To improve the training stability and convergence, they utilized a gradient penalty to optimize the training process. Meanwhile, the proposed method addressed the problems of frequency artifacts and instability in the training process of DA. To evaluate the effectiveness of DA, they proposed two evaluation indices (visual inspection and log-likelihood score from Gaussian mixture models) to assess the quality of generated data. Experiments show that presentation-associated patterns of EEG could be seen clearly in generated data and they obtained significant improvement based on the EEGNet model after DA in the RSVP task (Lawhern et al., 2018). A similar method was also performed in Aznan et al. (2019). Aznan et al. (2019) applied WGAN to generate synthetic EEG data that optimizes the efficiency of interaction in the SSVEP task. After that, they performed generated EEG to the pre-trained classifier in the offline stage and finetune classifier by realcollection EEG. This approach was used to control the robot and achieve real-time navigation. Results show that the DA method significantly improves the accuracy in real-time navigation tasks across multiple subjects. Yang et al. (2020) deemed that typical DA methods of GT and NI ignored the effect of signal-to-noise ratio (SNR) across trials. Therefore, they proposed a novel DA method by randomly averaging EEG data, which artificially generates EEG data with different SNR patterns. The DA was achieved by randomly taking n (1 < n < N) examples from the same category to calculate the average potential at each iteration, where N represents the number of all trials. RNN and CNN were used to classify different specific frequencies in the visual evoked potential (VEP) task and obtained significant improvement after DA. Li et al. (2019) discussed the effect of noise addition for time series form and spectrums signals in the MI-BCI task, respectively. They applied CNN combined with channelprojection and mixed-scale to classify 4-class MI signals and concluded that noise may destroy the amplitude and phase information of time-series signals, but cannot change the feature distribution of spectrum. Therefore, they performed STFT to transform time series EEG signal into spectral images, which was defined as amplitude-perturbation DA. Results show that the performance has been improved using DA almost for all subjects in two public datasets. Lee et al. (2020) investigated a novel DA method called borderline-synthetic minority over-sampling technique (Borderline -SMOTE). It generates synthetic data from minority class by using the m nearest neighbors from the instance of the minority class and then adding these instances into real data by weighting calculation. The effectiveness of DA was evaluated by EEG data collected from the P300 task. Results show that the proposed methods could enhance the robustness of decision boundaries to improve the classification accuracy of P300 based on BCIs.
Regarding EEG-based passive BCIs, they have gradually become more prominent in research (Zander et al., 2009;Cotrina et al., 2014;Aricò et al., 2018), and are used to detect and monitor the affective states of humans. In this part, we introduce some cases of DA application to passive BCIs. Kalaganis et al. (2020) proposed a DA method based on graphempirical mode decomposition (EMD) to generate EEG data, which combines the advantage of multiplex network model and graph variant of classical empirical mode decomposition. They designed a sustained attention driving task in a virtual reality environment, while realizing the automatic detection for the state of humans using graph CNN. The experimental results show that the exploration of the graph structure of EEG signal could reflect the spatial feature of signal and the methodology of integrating graph CNN with DA has obtained a more stable performance. Wang et al. (2018) discussed the limitations of DA for EEG in emotion recognition tasks and pointed out that the features of EEG in emotion detection tasks have a high correlation with Where µ and σ are mean value and standard deviation, respectively, P is probability density function, and z is Gaussian random variable. x g is generated data after noise injection. There are three classification models, namely LeNet, ResNet, and SVM, that were used to evaluate the performance. Results show the generated data could significantly improve the performance for the classifier based on LeNet and ResNet. However, it obtains little effect on the SVM model. Luo et al. (2020) applied conditional Wasserstein GAN (cWGAN) and selective VAE (sVAE) to enhance the performance of the classifier in the emotion recognition tasks. The loss functions of sVAE is defined as follows: Where ELBO represent the evidence lower bound and x r and x g is real data and generated data, respectively. The goal of optimization was to maximize ELBO which was equal to minimizing the KL divergence between the real data and generated data. Based on the loss function of GAN, an extra penalty term is added to it: Where λ is weight coefficient for the trade-off between the original objective and gradient penalty, andx represents the data points sampled from the straight line between a real distribution and generated distribution. · 2 is 2-norm value. In their work, the training samples of DA models were transformed into the forms of power spectral density or differential entropy and the performance of different classifiers are compared after DA. Experiments show that two representations of EEG signals were suitable for the requirement of the artificial datasets that enhances the performance of the classifier. Bashivan et al. (2015) emphasized the challenge in modeling cognition signals from EEG was extracting the representation of signals across subjects/sessions and suggested that DNN had an excellent ability for feature extraction. Therefore, they transformed the raw EEG signal to topology-preserving multispectral images as training sets in a mental load classification task. To address the overfitting and weak generalization ability, they randomly added noise to spectral images to generate training sets. However, this DA method did not significantly improve classification performance, just strengthened the stability of the model.
To comprehensively show the implementation, we summarize the details of the application of DA in EEG decoding in Table 2.

DISCUSSION
The limitation of small-scale datasets hinders the application of DL for EEG classification. Recently, the strategy of DA has received widespread attention and is employed to improve the performance of DNNs. However, there remain several issues worth discussing.
Taking the above discussion into consideration, we found that the input forms of DA models could be divided into three categories: time-series data, spectral image, and feature matrix. We also found that researchers preferred to convert EEG signals into image signals for subsequent processing in MI tasks. One possible reason might be that the features of MI are often accompanied by changes in frequency band energy, i.e., eventrelated desynchronization (ERD)/event-related synchronization (ERS; Phothisonothai and Nakagawa, 2008;Balconi and Mazza, 2009). This phenomenon indicated that more significant feature representations of MI-EEG were displayed in the time-frequency space rather than the time domain. While the EEG based on VEP paradigms prefer to employ the time-series signal as the input, which has the strict requirement of being time-locked and contains more obvious features in time sequence (Basar et al., 1995;Kolev and Schurmann, 2009;Meng et al., 2014). Another form of input is the feature matrix that could be extracted by wavelet, entropy, STFT, power spectral density, and so on (Subasi, 2007;Filippo et al., 2009;Seitsonen et al., 2010;Lu et al., 2017;Lashgari et al., 2020).
From a difference of implementation point of view, DA can be divided into input space augmentation and feature space augmentation. Indeed, the former aspect has the advantage of interpretability and takes lower computational costs. However, we found that the operation in feature space could obtain more significant improvements than in input space based on the results of classification performance presented in Table 2. One explanation is that this type of DA model could extract an intrinsic representation of data due to the incredible ability of non-linear mapping and automatic feature extraction.
Generative adversarial networks have become popular for generating EEG signals in recent years (Hung and Gan, 2021), although it has still not been clearly demonstrated to be the most effective strategy across different EEG tasks. Due to the limited number of studies, it is still unclear which method is the more popular technique. Consequently, researchers should select the appropriate DA method according to the paradigm type and feature representation of EEG.
Previous studies show that DA could improve the decoding accuracy of EEG to varying degrees in different EEG tasks. However, this improvement varies greatly in different data sets and preprocessing modes. There are several possible explanations to be discussed. First, most studies have not discussed whether DA produces negative effects in the training stage of the classifier. As mentioned in the above discussion, EEG signals are accompanied by strong noise and multi-scale artifact. But existing DA methods are global operations, which cannot effectively distinguish these irrelevant components. Meanwhile, EEG signals collected from specific BCI tasks (SSVEP, P300) perform features that are time-locked and phase-locked, which may cause wrong feature representation using GT to produce artificial data. While GT performs effectively in MI and ER tasks due to this kind of signal having no strict requirement for feature-locking. Therefore, feature representation of EEG should be analyzed before the application of GT. Second, there are a few studies that discuss the boundary conditions of the feature distribution for generated data, even though it is one of the important guarantees of data validity.
Another important issue worthy to discuss is how much generated data could most effectively enhance the performance of a classifier. Researchers have explored the influence of different ratios of real data (RD) and generated data (GD) for classification performance and demonstrated that the enhancement effect does not increase with the size of GD (Zhang et al., 2020). Research on the effect of different amounts of training data to the classification performance using artificial data has indicated that the improvement of performance requires at least a doubling size of GD . Consequently, the size of the GD should be determined by multi-group trials with different mix proportion.
Based on the above analysis we believe that the following studies are worthy of exploring in further research. First, different DA methods can be combined to extend datasets and augmentation would be executed both in input space and feature space. For example, generated data based on GT can be put into GANs to realize secondary augmentation, which may improve the diversity of generated data. Second, combining meta-learning with data enhancement might reveal why DA affects classification tasks, which may improve the interpretability of generated data. Meanwhile, DA based on GAN is a mainstream method at present, but how to improve the quality of generated data is still a valuable point.

CONCLUSION
Collecting large-scale EEG datasets is a difficult task due to the limitations of available subjects, experiment time, and operation complexity. Data augmentation has proven to be a promising approach to avoid overfitting and improve the performance of DNNs. Consequently, the research state of DA for EEG decoding based on DNNs is discussed in the study. The latest studies in the past 5 years have been discussed and analyzed in this work. Based on the analysis of their results, we could conclude that DA is able to effectively improve the performance in EEG decoding tasks. This review presents the current practical suggestions and performance outcomes. It may provide guidance and help for EEG research and assist the field to produce high-quality, reproducible results.

AUTHOR CONTRIBUTIONS
CH is responsible for manuscript writing. JL draw related figure. YZ and WD guide the literature collection and structure of the manuscript. All authors contributed to the article and approved the submitted version.