An Evaluation of the EEG Alpha-to-Theta and Theta-to-Alpha Band Ratios as Indexes of Mental Workload

Many research works indicate that EEG bands, specifically the alpha and theta bands, have been potentially helpful cognitive load indicators. However, minimal research exists to validate this claim. This study aims to assess and analyze the impact of the alpha-to-theta and the theta-to-alpha band ratios on supporting the creation of models capable of discriminating self-reported perceptions of mental workload. A dataset of raw EEG data was utilized in which 48 subjects performed a resting activity and an induced task demanding exercise in the form of a multitasking SIMKAP test. Band ratios were devised from frontal and parietal electrode clusters. Building and model testing was done with high-level independent features from the frequency and temporal domains extracted from the computed ratios over time. Target features for model training were extracted from the subjective ratings collected after resting and task demand activities. Models were built by employing Logistic Regression, Support Vector Machines and Decision Trees and were evaluated with performance measures including accuracy, recall, precision and f1-score. The results indicate high classification accuracy of those models trained with the high-level features extracted from the alpha-to-theta ratios and theta-to-alpha ratios. Preliminary results also show that models trained with logistic regression and support vector machines can accurately classify self-reported perceptions of mental workload. This research contributes to the body of knowledge by demonstrating the richness of the information in the temporal, spectral and statistical domains extracted from the alpha-to-theta and theta-to-alpha EEG band ratios for the discrimination of self-reported perceptions of mental workload.


INTRODUCTION
Human mental workload is a fundamental concept for investigating human performance. It represents an intrinsically complex and multilevel concept, and ambiguities exist in its definition. The most general description of mental workload can be framed as the quantification of a cognitive cost of performing a task in a finite timeframe in order to predict operator, system performance or both (Reid and Nygren, 1988;Rizzo and Longo, 2018;Hancock et al., 2021). Mental workload has been regarded as an essential factor that substantially influences task performance (Young et al., 2015;Galy, 2018;Longo, 2018a). As a construct, it has been widely applied in the design and evaluation of complex human-machine systems and environments such as in aircraft operation (Hu and Lodewijks, 2020;Yu et al., 2021), train and vehicle operation (Li et al., 2020;Wang et al., 2021), nuclear power plants (Gan et al., 2020;Wu et al., 2020), various human-computer and braincomputer interfaces (Longo, 2012;Asgher et al., 2020;Putze et al., 2020;Bagheri and Power, 2021) and in educational contexts (Moustafa and Longo, 2019;Orru and Longo, 2019;Longo and Orr, 2020;Longo and Rajendran, 2021), to name a few. Mental workload research has accumulated momentum over the last two decades, given the fact that numerous technologies have emerged that engage users in multiple cognitive levels and requirements for different task activities operating in diverse environmental conditions. Different methods have been proposed to measure human mental workload. These methods can be clustered into three main groups. Subjective measures which relies on the analysis of the subjective feedback provided by humans interacting with an underlying task and is usually in the form of a post-task survey. The most well-known subjective measurement techniques are the NASA Task Load Index (NASATLX) (Hart and Staveland, 1988), the Workload profile (WP) (Tsang and Velazquez, 1996), and the Subjective Workload Assessment Technique (SWAT) (Reid and Nygren, 1988). Task performance measures, often referred to as primary and secondary tasks measures, focus on the objective measurement of a human's performance in an underlying task. Examples of such measures include timely completion of a task, reaction time to secondary tasks, number of errors on the primary task and tapping error. Physiological measures are based upon the analysis of physiological responses of the human body. Examples include EEG (electroencephalography), MEG (magnetoencephalography), brain metabolism, endogenous eye blink rates, pupil diameter, heart rate variability (HRV) measures or electrodermal responses such as galvanic skin response (GSR) (Byrne, 2011).
Many research works indicate that EEG data contains information that can help correlate task engagement and mental workload in cognitive processes like vigilance, learning and memory (Berka et al., 2007;Roy et al., 2016), in operating under environmental factors such as temperature (Wang et al., 2019) and in critical systems domains such as transport (Borghini et al., 2014;Diaz-Piedra et al., 2020), nuclear power plants (Choi et al., 2018) and aviation (Wilson et al., 2021). The reason for using EEG is that it offers several benefits compared to imaging techniques or mere behavioral observational approaches. The most important benefit of EEG is its excellent time resolution which offers the possibility to study the precise time-course of cognitive and emotional processing of behavior. Billions of neurons in the human brain are organized in a highly intricate and convoluted fashion exhorting in complex firing patterns. These patterns, accompanied by frequency oscillations, are measurable with EEG reflecting certain cognitive, affective or attentional states. These frequencies, in adults, are usually decomposed in different bands: delta band (1-4 Hz), theta band (4-8 Hz), the alpha band (8-12 Hz), the beta band (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25) and gamma band (≥ 25 Hz) (Mesulam, 1990).
Recent studies seem to indicate changes in frequency band across different brain regions when a subject performs specific tasks (Gevins and Smith, 2003;Schmidt et al., 2013;Borys et al., 2017). The theta band is thought to be linked to mental fatigue and mental workload (Gevins et al., 1995). The increase in theta spectral power is thought to be correlated with the rise in the use of cognitive resources (Tsang and Vidulich, 2006;Xie et al., 2016), task difficulty (Antonenko et al., 2010) and working memory (Borghini et al., 2012). Alpha band tends to show sensitivity in experiments with mental workload (Xie et al., 2016;Puma et al., 2018), cognitive fatigue (Borghini et al., 2012), attention and alertness (Kamzanova et al., 2014).
Even though EEG bands have been proposed as indicators that can discriminate mental workload (Gevins and Smith, 2003;Tsang and Vidulich, 2006;Antonenko et al., 2010;Coelli et al., 2015), it is unclear which of these best contribute to such discrimination. This article aims to identify the impact of the high-level features extracted from alpha and theta band ratios (and their combination) on the discrimination of levels of perception of mental workload self-reported by users. To tackle this aim, an empirical research experiment has been designed to generate time-series of alpha, theta band ratios, and their combinations, and extract high-level features that can be used to build models to classify self-report perceptions of mental workload.
The remainder of this article is organized as follows: Section 2 outlines the related work regarding the specific definition and use of the alpha-to-theta and theta-to-alpha band ratios along with their relationship to mental workload. Section 3 describes the design of an empirical experiment and the methodology employed for answering the above research goal. Section 4 presents the findings followed by a critical discussion while Section 5 concludes this work, proposing future research directions.

RELATED WORK
Recent studies analyze EEG bands on various experimental settings designed for specific domains and purposes such as fatigue and drowsiness (Borghini et al., 2014), brain-computer interfaces (Gevins and Smith, 2003;Käthner et al., 2014), learning (Borys et al., 2017;Dan and Reiner, 2017) as well as for specific brain function disorders such as Alzheimer (Schmidt et al., 2013). Most research studies seem to indicate the possibility that EEG signals across various cortical regions can be a helpful tool toward discriminating mental workload while performing experiments with varying degree of task demands (Borghini et al., 2014).
The theta band is thought to be linked to mental fatigue and drowsiness (Gevins et al., 1995;Borghini et al., 2014). Increase of spectral power in the theta band is associated with an increase of demand in cognitive resources (Tsang and Vidulich, 2006;Xie et al., 2016), an increase in task difficulty (Gevins and Smith, 2003;Antonenko et al., 2010;Käthner et al., 2014;Borghini et al., 2015) and an increase in working memory (Borghini et al., 2012(Borghini et al., , 2014. Particularly, the theta power spectrum seems to increase in cases where a prolonged concentration while executing a task is required (Gevins and Smith, 2003;Borghini et al., 2014;Käthner et al., 2014). Some research even indicates a decrease in vigilance and alertness where a higher power spectrum in theta band is observed (Kamzanova et al., 2014). The brain regions thought to be associated with theta activity are mostly in the frontal cortical area (Gevins and Smith, 2003;Borghini et al., 2014;Dan and Reiner, 2017).
The research on alpha band emerges to indicate sensitivity toward mental workload (Xie et al., 2016;Puma et al., 2018), cognitive fatigue (Borghini et al., 2012(Borghini et al., , 2014, and an increase in the alpha band activity is associated with a decrease in attention and alertness (Kamzanova et al., 2014). An increase/decrease in the alpha band power spectrum is witnessed during relaxed states with eyes closed and opened, respectively (Antonenko et al., 2010). A continuous suppression in the alpha band seems to be linked with increments of task difficulty (Mazher et al., 2017). The brain regions that are primarily associated with the alphaband activity are parietal and occipital areas (Borghini et al., 2014;Puma et al., 2018).
The beta band is thought to be linked to visual attention (Wróbel, 2000), short term memory tasks (Palva et al., 2011) and inconclusively it was hypothesized that an increase in the beta band is associated with an increase in working memory (Spitzer and Haegens, 2017). An increase in the beta band spectrum seems to be associated with increased levels of task engagement (Coelli et al., 2015) and concentration (Kakkos et al., 2019). The brain regions that are associated with the beta-band activity are parietooccipital areas that have been observed during visual working memory task experiments (Mapelli and Özkurt, 2019).
Multiple EEG band combinations and ratios have also been used to improve mental workload assessment. For instance, beta/(alpha + theta) known as engagement index is used to study task human engagement (Mikulka et al., 2002), mental attention (MacLean et al., 2012) and mental effort (Smit et al., 2005). The reduction in the alpha band activity seems to correlate with increased activity in the frontal-parietal areas with an increase in beta power followed by a decrease in theta, which indicates high vigilance states (MacLean et al., 2012). Alpha band activity reduction is also thought to correlate with activities in the parietal brain region where a decrease in beta activity followed by an increase of theta band activity indicate states of drowsiness and low attention (MacLean et al., 2012).
Attempts to assess mental workload and task engagement using the information from the theta and alpha bands in the form of theta-to-alpha band ratios are seen in Gevins and Smith (2003), Käthner et al. (2014), Dan and Reiner (2017), and Xie et al. (2016). This is based on the assumption that an increase in the theta power band in the frontal brain region, and a decrease in the alpha power in parietal region is associated with an increase in mental workload (Käthner et al., 2014). The increase in both alpha and theta power is related to the rise of fatigue (Käthner et al., 2014;Xie et al., 2016). Research seems to indicate that task load manipulations are followed by an increase of theta band activity in frontal brain regions followed by a decrease in alpha power in the parietal areas (Gevins and Smith, 2003;Käthner et al., 2014;Dan and Reiner, 2017).
The motivation for this article arises from the fact that research studies are indicating that band ratios, specifically the theta and alpha bands, are associated with mental workload states (Gevins and Smith, 2003;Borghini et al., 2014) and to some extent, this seems to justify their potential as workload indicators (Fernandez Rojas et al., 2020). While research exists that focuses on the alpha, theta and beta bands as well as their respective ratios such as beta/(alpha + theta) and to some extent, theta-to-alpha, there is an absence of research related to the use of the alpha-to-theta and theta-to-alpha ratios and their role in discriminating self-reported perceptions of mental workload. Therefore, to address the goal as stated in the introductory Section 1 we formulate a research problem focused on the investigation of the importance of high-level features extracted from the alpha-to-theta and the theta-to-alpha EEG band ratios on the discrimination of levels of perception of mental workload. In other words, the research question that can be formulated is: what is the impact of high-level features extracted from alpha and theta band ratios (and their combination) on discriminating of levels of perception of mental workload self-reported by users?

DESIGN AND METHODOLOGY
To answer the research problem and research question outlined above, the following research hypotheses are defined: 1. H1: If high-level features are extracted from indexes of mental workload built upon alpha-to-theta and theta-to-alpha band ratios, then their discriminatory capacity to self-reported perceptions of mental workload will be higher than those extracted from indexes of mental workload built upon the alpha and theta bands alone. 2. H2: If more adjacent EEG electrodes from the respective cortical areas are used to create indexes of mental workload built upon alpha-to-theta and theta-to-alpha band ratios, then they will exhibit higher discriminatory capacity to selfreported perceptions of mental workload than those indexes built with fewer electrodes.
In order to test these research hypotheses, empirical comparative research has been designed based on a process pipeline as illustrated in Figure 1 with details outlined in the following subsections.

Experiment Design and Dataset Description
The STEW (Simultaneous Task EEG Workload) (Lim et al., 2018) has been selected for experimental purposes. The dataset consists of raw EEG data collected from 48 subjects across 14 channels in two experimental conditions. In one condition, the EEG data was recorded from subjects in the rest state while not performing any mental activity. In the second condition, a multitasking SIMKAP test was presented to subjects, and EEG data was recorded. In both cases, a sampling frequency of 128 Hz was used with 2.5 min of EEG recordings utilizing the Emotiv EPOC EEG headset. Every recording contains 19,200 data samples (128 samples x 150 s) across the following 14 channels: AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, and AF4. Additionally, a subjective rating was collected after each task execution whereby users rated their experienced mental workload on the scale 1 to 9. The rating was a likert scale with 1 = "very, very low mental effort"; 2 = "very low mental effort"; 3 = "low mental effort"; 4 = "rather low mental effort"; 5 = "neither low nor high mental effort"; 6 = "rather high mental effort"; 7 = "high mental effort"; 8 = "very high mental effort" and 9 = "very, very high mental effort." The rationale of using a perceived mental workload scores, was a form of subjective validation to verify whether a subject indeed experienced an increase in cognitive load load while performing the SIMKAP condition as compared to the resting condition.

EEG Denoising Pipeline
Applying a denoising pipeline is an important step to pre-process the raw EEG data and to remove noise from it to facilitate subsequent analysis. In detail, this process follows the Makoto's pre-processing pipeline (Miyakoshi, 2018) including: • re-referencing channel data to average reference.
• high-pass filtering of each channel at 1hz.
• source separation and artifact removal via Independent Component Analysis (ICA).
The key pre-processing step is the application of ICA which is utilized to separate the 14 EEG signal sources into independent components for each subject. Fourteen components are generated and used to automatically find and remove artifacts without human intervention using part of the methodology described in Nolan et al. (2010). In detail, the criteria for identifying bad components includes the computation of the z-scores of each component's spectral kurtosis, slope, Hurst exponent and gradient median. Spectral kurtosis is a parameter in the frequency domain indicating the component's impulsiveness variation with frequency. The slope of a component represents its mean slope of the power spectrum over two-time points. The Hurst exponent, also known as a long term memory in time series, tends to measure the tendency of a component to either regress to its mean or to catch up with an upward/downward trend. The gradient median is the median slope of the component's time course. All of the components exhorting values above and below ranges "z − score ± 3" can be considered as artifacts since they are outliers and significantly different from all the others. The value of ±3 for a z − score was adopted from Nolan et al. (2010) as part of automatic outlier detection taken from the FastR method. Finally, the inverse ICA has been executed to convert the remaining "good" components back in the original neural EEG signal.

Forming Cluster Combinations
A baseline of initial parietal and frontal electrodes was adopted following the electrode locations from the 10-20 international system to form different alpha and theta clusters for analysis and comparison purposes. These electrode locations were cross-referenced with locations, naming notation and electrode availability from the Emotiv EPOC EEG headset. The initial electrodes that are selected from the frontal and parietal locations are indicated as S1 and S2 in Figure 1. Due to the limited availability of electrodes from the Emotiv EPOC EEG headset (highlighted in green in Figure 1), three frontal and one parietal clusters were constructed. In detail, cluster combinations and electrodes, together with the channel aggregation approach used, is shown in Table 1.

Formation of the Mental Workload Indexes From Clusters of EEG Alpha and Theta Bands
Generating band ratios from EEG channels over time follows the methodology used in Borghini et al. (2014). The computation of the alpha-to-theta and theta-to-alpha ratios was done utilizing the average power spectral density (PSD) values from the alpha band from the cluster c − α and the average PSD values from the theta band from clusters c1 − θ , c2 − θ and c3 − θ as outlined in Table 1. Three alpha-to-theta and theta-to-alpha ratios are set to take different clusters from frontal electrodes and one cluster from parietal electrodes. The computation of the band ratios are given as follows: where, c − α and c x − θ are the respective alpha and theta clusters (from Table 1), with e an electrode in them, and x a cluster among those using the theta band The combination of the clusters in Table 1, jointly with their individual use, led to the formation of the following possible mental workload indexes (configurations): In this study, a 1 s non-overlapping sliding widow technique is employed to segment long EEG data, and for each window, an index of mental workload can be calculated.

Feature Extraction From Indexes and Selection
Extracting high-level features from MWL indexes is crucial since it allows the finding of distinguishing properties that otherwise would not be possible if a raw index alone is considered.
The extraction of such high-level features from the indexes defined in the set Equation 3 is executed using the TSFEL (Time Series feature Extraction Library) (Barandas et al., 2020). The advantage of using TSFEL is that it offers a wide range of statistical properties that can be extracted from multiple domains, including those from frequency and temporal domains. It is useful for identifying peculiar aspects of a signal and its specific properties such as variability, slope or peak to peak, just to name a few. Classes of extracted features span from the most well-known such as statistical/spectral kurtosis, mean and median of a signal, to less frequently employed features such as human range energy ratio, the estimator of the cumulative distribution function (ECDF), variability and peak-to-peak. The idea behind considering a large number of initial features was to assess their individual importance, and subsequently retain only the most informative ones by adopting a systematic feature selection approach rather than selecting them subjectively from intuition. Feature reduction can also facilitate model training in terms of required computational time. The selection criteria were based on the "SelectKBest" feature selection algorithm that ranks the features with the largest ANOVA F-value between a feature vector and a class label. The reason for choosing such an approach is that it offers a better trade-off in terms of accuracy, stability and stopping criteria in comparison to other feature selection algorithms such as SelectPercentile or VarianceThreshold (Powell et al., 2019). Determining the threshold for an optimal number of features is an iterative process of supervised evaluation of model performance with variable numbers of features. Initially, a model with all features was built and its performance metric in terms of accuracy was observed and in subsequent steps, the number of features was reduced by half iteratively as long as the model performance increased. The following iterative step with number of features that would indicate a decrease in model performance served as a stopping criteria. Finally, a Pearson correlation was computed to assess the correlation between selected features in order to reduce multicollinearity among them. Reducing multicollinearity of features is an essential step for retaining the predictive power of each of them. Using highly correlated features very often hamper model training. Experiments conducted by Lieberman and Morris (2014) indicate a correlation threshold of ±0.5 for optimal model performance.

Models Training
The modeling and training process aims at learning classification models capable of discriminating self-reported mental workload scores from subjects (target feature), given the features extracted and selected in the previous step D (independent features). The mental workload scores were selected rather than the type of condition ("Simkap" or "Rest") because we wanted a sensible indicator of mental workload, not a task load condition. In other words, a self-reported indicator of mental workload can be considered a more reliable representation of the user experience than a class representing a certain task load condition. This argument is originated by the fact that, in both task load conditions, users can experience any level of cognitive load. For example, a novice user can experience high mental workload for an easy task load condition when compared to an expert user. Similarly, a skilled user can experience moderate mental workload even while in a resting condition because of significant mind wandering. In research from Charles and Nixon (2019), a distinction between objective elements of the work (taskload) from the subjective perception of mental workload is outlined. Both taskload and subjective perception of mental workload can be mediated by operator experience or time constraint factors. Therefore, it is intuitive that task load conditions are not equivalent to mental workload experiences. In fact, on one hand, the former are strictly defined prior task execution, and are static, meaning they are immutable during task execution. On the other hand, the latter are unknown prior task execution and can change depending on a number of factors, for example including user's prior knowledge, motivation, time of execution, fatigue, stress among the others. To stress further, research has clearly shown that even the same person can execute a task, designed with a specific, static load condition (pre-defined task demands) differently at various times of the day (Hancock et al., 1992). Additionally, to facilitate subsequent interpretation, we treated model training as a binary classification problem, mainly to use more interpretable evaluation metrics such as precision, recall, accuracy and f1 score. Therefore, the target feature range of 1-9 of the self-reported mental workload scores was mapped into two levels of mental workload, the "suboptimal MWL" and "super optimal MWL." The split was adopted based on the assumption of the parabolic relationship between experienced mental workload and performance as outlined in Longo and Rajendran (2021). This split was done by aggregating the scores from 1 to 4, representing some degree of low mental workload (effort), into the "suboptimal MWL, " and all the scores from 6 to 9, for all of those supporting some degree of high mental workload (effort), into the "super optimal MWL." All those scores rated five were discarded because they represent the neutral experience of mental workload.
The learning techniques chosen for achieving this aim are Logistic regression (L-R), Support Vector Machines (SVM) and Decision Trees (DTR). Many research works have considered these three learning techniques for continuous and more prolonged EEG recordings (Berka et al., 2007;Hu and Min, 2018;Doma and Pirouz, 2020). Logistic regression and SVM, as errorbased learning techniques, are suitable for binary classification tasks (as in this work). On the other hand, as an informationbased technique, decision trees are suitable for distinguishing important features by calculating their information gains during model training.
Due to the fact that a small dataset of 48 subjects was selected, then repeated montecarlo sampling for model training and validation is set in the following order: 1. A randomized 70% of subjects is selected both from the "suboptimal MWL" and the 'super optimal MWL" classes (dependent feature) for model training; 2. The remaining 30% was kept for model testing.
3. The above splits are repeated for 100 iterations to observe random training data, and effectively capture the probability density of the target variable.
A general rule of thumb implies a minimum of 1/5th ratio for each feature in the data to increase model accuracy (Friedman, 1997). Given the low number of training instances in each of the target classes ("suboptimal MWL, " "super optimal MWL"), the "curse of the dimensionality" problem is anticipated (Verleysen and François, 2005). Therefore, a strategy for generating synthetic data is adopted, which is based on the generation of statistically similar synthetic data that mimics the original data. For this purpose, the Synthetic Data Quality Score based on metrics like Field Correlation Stability, Deep Structure Stability and Field Distribution Stability (Gretel.ai, 2022) is adopted. The Field Correlation Stability is the correlation between every pair of independent features (fields) in the training data and then in the synthetic data. These values' absolute difference is computed and averaged across all independent features. The lower this average, the higher is the correlation stability of the synthetic data. Deep Structure Stability verifies the statistical integrity of the generated dataset by performing deep, multifield analysis of distributions and correlations. This is done by executing Principal Component Analysis (PCA) on the original data, and comparing it against that on the synthetic data. A synthetic quality score is created by comparing the distributional distance between the principal components found in the two datasets. The closer the principal components, the higher the quality of the synthetic data. Field distribution stability measures how closely the field distribution in the synthetic data mimic that on the original data. The comparison of two distributions is done using the Jensen-Shannon (JS) distribution distance given as: where H(O) and H(S) are the Shannon entropy values for original (O) and synthetic (S) data respectively and H(M) is the sum of selected weights for probability distributions (π) and dataset probabilities (P) given as M= 2 i=1 π i P i . The lower the distance score on average across all fields, the higher the Field Distribution Stability quality score and consequently the higher the quality of the synthetic data generated.
The Synthetic Data Quality Score represents an arithmetic mean between field correlation stability, deep structure stability and field distribution stability. In this sense, the Synthetic Data Quality Score can be viewed as a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be indistinguishable as if they were to be used in the original data. Synthesizing new data is performed using synthetic generators offered from Gretel.ai 1 . The training process for the combined (original + synthetic) uses the same montecarlo sampling with the same steps as with original data outlined above. Randomized 70% of subjects is selected both from the combined (original + synthetic) for the"suboptimal MWL" and the 'super optimal MWL" classes (dependent feature) for model training, the remaining 30% of combined (original + synthetic) subjects was kept for model testing and performing 100 iteration through these randomized splits. During model training, the data was normalized using z − score normalization given as z = (x−µ)/σ , where µ is the mean of training samples and σ is the standard deviation. The rationale for using z − score normalization is that it tends to minimize the mean (µ = 0) and maximize the standard deviation (σ = 1) for the normalized value and makes it suitable reducing extremely peak values in data, by transforming it in such a way that it's no longer a massive outlier.

Models Evaluation
A set of evaluation metrics were employed to assess the ability of the selected models to generalize on unseen data by learning from the training data. These metrics can be used to measure and summarize the quality of the trained models when tested with previously unseen data. For a binary classification problem, such as in the case, the evaluation of the models is dependent on True Positives (tp) and True Negatives (tn) which denote the number of positive and negative instances that are correctly classified. It can be also conducted with the False Positives (fp) and False Negatives (fn) that denote the number of miss-classified negative and positive instances, respectively. According to this, several metrics are used to evaluate the performance of the trained models. The accuracy metric measures the ratio of correct predictions over the total number of evaluated instances. Accuracy is represented as, Accuracy=(tp + tn)/(tp + fp + tn + fn).Precision is used to measure the positive instances that are correctly predicted from the total predicted instances in a positive class, given as Precision=(tp)/(tp + fp). Recall measures the fraction of positive instances that are correctly classified, Recall=(tp)(tp + tn). F-Measure or f1-score is the harmonic mean between recall and precision values represented as, f 1 − score = (2 · Precision · Recall)/(Precision + Recall). The proposed evaluation metrics are essential to assess the robustness of the selected models built upon high-level features extracted from the MWL indexes toward the discrimination of self-reported perceptions of mental workload. While precision refers to the percentage of relevant instances, recall refers to the rate of total relevant instances correctly classified by the model. The best model minimizes the value of (fp) in precision and (tn) in recall, and both come at the cost of each other since we cannot minimize both of them in one metrics. f1-score represent a harmonic mean of precision and recall and takes into account both metrics. Consequently, to bring hypotheses H1 and H2 on provable grounds, the f 1 − score metric is adopted too.

RESULTS
The results section follows the same order of steps as outlined in the design section.

EEG Artifact Removal
Artifact removal is performed on each EEG signal for each of the 48 subjects separately for the "Rest" and "Simkap" task load conditions. The average number of ICA components removed from the EEG data associated with each subject is 1.61 for the "Rest" and 1.46 for the "Simkap" condition. The number of removed artifacts is within limits of the adopted methodology described in Nolan et al. (2010). Figure 2 depicts the removal occurrence for a total of 14 components across all 48 users for "Rest" condition and "Simkap" condition. As it is possible to see from Figure 2, at most, one ICA component per subject that is significantly different from the other components (±3 standard deviations) exists. These components are removed by zero-ing them, and the EEG multi-channel data is subsequently reconstructed by applying inverse ICA. Since at least one bad component was identified and removed for most subjects, it is possible to reasonably claim that some artifact has been removed from the EEG signal, thus facilitating the subsequent computations of the alpha and theta bands.

Evaluation of Feature Extraction and Selection
All high-level features are collected from the statistical properties of the mental workload indexes in various domains, including the temporal and frequency domains. The initial number of collected features are 210, and the exhaustive list is provided in the Supplementary Material accompanying this article. The ANOVA F-Value is computed for each of these features, and those with the highest value are retained for subsequent model training.
Since the SelectKBest algorithm requires an initial number of features, an iterative approach of feature inclusion during model training and the performance of the models with those features is assessed through its accuracy. Iterative optimal feature selection is performed by employing data from the original dataset. Figure 3 illustrates the convergence on the optimal number of features in relation to model performance grouped by learning techniques (L-R, SVM, DTR). This resulted in a reduced number of features that are kept for model training by employing to the dataset enhanced with synthetic data. Figure 4 shows the Pearson correlation matrix for the "Rest" and the "Simkap" states for the alpha-to-theta ratios for the case of index at-1 (as designed in Section 3.3).
Noticeably, most of the features are in the correlation range between −0.5 and +0.5, which contributes to reduce multicollinearity and thus potentially being all relevant and with high prediction capability (Lieberman and Morris, 2014). Figure 4 is an illustration of the results associated to a single mental workload index (at-1). However, results associated to the other indexes are mostly consistent with these, as it is possible to examine in the Supplementary Figures S1-S9 accompanying this article.

Evaluation of the Training Set Across Indexes
After the feature selection process, training of the models was conducted with Montecarlo sampling using Logistic Regression (L-R), Support Vector Machines (SVM) and Decision Trees (DTR) as described in design subsection E). Model training suffered from the "curse of dimensionality" issue since it comprised 48 subjects across only the seven selected features. The number of training instances is low compared to the number FIGURE 2 | The number of components removed across all 48 subjects for "Rest" and "Simkap" task load conditions. FIGURE 3 | Optimal number of features against model performance with data coming from the Simultaneous Task EEG workload (STEW) dataset. The dashed lines indicate the number of feature considered in the iteration. It can be seen that the optimal number of top features to select is around seven indicated with green dashed line which also acts as a stopping criteria.
of independent features retained for modeling purposes. This is followed by the peak phenomenon of feature inclusion, where the number of features and their cumulative discriminatory effect is essential for the average predictive power of a classifier, which is data-dependent (Zollanvari et al., 2019). The initial model evaluation with test data on the original dataset did not reveal the accuracy of more than 60% for the standalone mental workload indexes built upon the alpha and theta indexes alone (c1 − θ , c2 − θ , c3 − θ , c − α). An accuracy of 70% for the mental workload indexes built upon the alpha-to-theta and theta-to-alpha ratios was observed. An in-depth analysis of the learning curves associated to the classifiers indicated a model underfitting and an inability to generalize from test data. Moreover, analyzing the spectral entropy of the mental workload indexes revealed a small variance variation, as can be seen from the boxplots of Figure 5. Small data variance subsequently FIGURE 4 | Pearson correlation coefficients matrix for the case of MWL index -at-1: Rest (Left) and Simkap (Right) task-load conditions. at-1: alpha-to-theta ratios between the indexes c − α and c1 − θ . The scale on the right of the image indicates the Pearson correlation coefficients range.
FIGURE 5 | Variance of spectral entropy associated the original data -Left ("Rest" state), Right ("Simkap" state). From the figure it can be seen a small interquartile range Q1-Q3 is small.
increases the bias influencing the model's ability to generalize. Thus, as expected, synthetic data generation was applied for training robust models.

Synthetic Data Evaluation
The input for data synthesis was the initial dataset comprised of 48 subjects and 150 data points (2.5 min of EEG data per participant split into segments of 1 s) for each of the indexes designed in Equation 3. Two synthetic datasets are created, one for the "Rest" and one for the "Simkap" task load conditions in order to retain the original dataset's intrinsic properties. Table 2 illustrates the overall synthetic quality scores for the mental workload indexes set in Equation 3.
Findings suggest that the overall synthetic data score is always above 87% throughout all the mental workload indexes selected for the comparative analysis. The synthetic quality score was measured in the scale (1-20)%-Very Poor, (20-40)%-Poor, (40-60)%-Moderate, (60-80)%-Good and (80-100)%-Excellent. This suggests that the quality of synthesized data is excellent, and in line with similar studies (Hernandez-Matamoros et al., 2020). Consequently, the synthesized data was of an additional 180 synthesized subjects with 150 data points (2.5 min of EEG activity split into 150 segments of 1 s) for each mental workload index. The final combined dataset with original and synthesized data 2 | Synthetic score for different mental workload indexes and two task load conditions ("Rest" and "Simkap").

Validation of Models for Discriminating Self-Reported Perceptions of Mental Workload
The training of the models with Logistic Regression and Support Vector Machines learning techniques utilized the linear optimizer since it offers speed and optimum convergence on minimizing a multivariate function by solving univariate optimization problems during repeated training of the model (Fan et al., 2008). In the case of model training with Decision Tree, a Gini index was used to measure the quality of split during the model build.
The classifiers performance is shown in Figure 6. The evaluation metrics are shown across all mental workload indexes and are presented in descending order. The best classification accuracy results are observed for those models built with Support Vector Machines (SVM) and Logistic regression (L-R). In order to acknowledge the best learning technique, a two-tailed ttest between the three learning techniques and the employed evaluation metrics was performed. The results indicated no statistically significant difference between Logistic Regression (L-R), Support Vector Machines (SVM) or Decision Trees (DTR). This indicates the validity of the training approach adopted from the design, which means that no matter the learning technique adopted, the results across all applied evaluation metrics are the same. Table 3 illustrates the p-value significance levels of the t-test between evaluation metrics for each learning technique used in the study. The t-test was conducted with a threshold confidence value of α = 0.05.
Further analysis of mental workload indexes of the alpha-totheta ratios (at − 1, at − 2) indicates better performance than their respective individual counterparts used for computing those ratios (c1 − θ , c2 − θ and c − α). For example, in the case of all learning techniques (L-R, SVM and DTR), first two alpha-to theta ratio indexes (at − 1 and at − 2) show better performance than their individual counterparts (c1 − θ , c2 − θ and c − α).
In the case of the theta-to-alpha mental workload indexes, this is also seen in the first two indexes (ta − 1 and ta − 2). Figure 7 illustrates the performance of band ratios (alpha-totheta and theta-to-alpha) across all evaluation metrics, given as density plots for the case of Support Vector Machines (SVM). The density plots for all other learning techniques are available on the Supplementary Figures S10-S11. Table 4 outlines the significance levels of a two-tailed t-test between the alphato-theta and theta-to-alpha ratio indexes with indexes used to construct those ratios. A comparison analysis of models average performance between original data and those enhanced with synthetic data is shown in Table 5. An analysis of the number of electrodes across alpha and theta bands as given in Table 1 outlined in the design Section 3 we can see a higher number of electrodes in indexes c1 − θ and c3 − θ in comparison to indexes c2 − θ and c − α. To asses the impact of the number of electrodes in overall performance of the models, a cross-plotting between indexes at − 1 vs. at − 2 and at − 3 vs.at − 2 as well ta − 1 vs. ta − 2 and ta − 3 vs.ta − 2 is analyzed. Figure 8 illustrates this cross density plot comparison of performance between the alpha-to-theta and theta-to-alpha ratio indexes. Furthermore, a two-tailed significance test between these band ratio indexes (at − 1 and at − 3 vs. at − 2 as well as ta − 1 and ta − 3 vs. ta − 2) reveals a statistically significant difference. Table 6 presents the p-value significance levels for confidence interval of α=0.05. The p-values are with Bonferroni correction applied, resulting in a significance level set at α = 0.005.

DISCUSSION
Rapid advancements in various tools and technologies introduced new perspectives in using EEG signals to classify task load conditions using machine learning techniques. The analysis done so far on EEG frequency bands, specifically alpha and theta bands, seems to correlate the changes in these bands to task load (Gevins and Smith, 2003;Borghini et al., 2014).  Researchers face many problems in using EEG band ratios for the purpose of mental workload modeling: (i) the limited amount of participants for each conducted empirical experiment (ii) a clear definition of mental workload (iii) a clear EEG measure of mental workload. In detail, the three aforementioned issues can be overcome and this article is a testament for such a claim. In fact, this research work demonstrates how the first issue can be tackled by using modern deep-learning methods for synthetic data generation, giving the possibility to expand the often limited cardinality of existing datasets created with EEG data. It also contributes to tackle the second issue by advancing the understanding of mental workload as a construct by means of an empirical experiment with EEG data. In particular, it performs a construction of indexes of mental workload by employing the alpha and theta EEG bands individually and in combination, and the extraction of statistical features from these indexes for the discrimination of self-reported perceptions of mental workload.
Results show that, from an initial highest accuracy of 60% for the individual alpha and theta indexes on the original dataset, we witnessed an increase between 8 and 20% in classifier performance when this data has been augmented with synthetic data.
Regarding mental workload ratio indexes, especially the alpha-to-theta indexes, it was possible to build models with minimum 18.4 − 30.2% higher performance (as measured by f1-score, accuracy, precision, recall) than the other indexes. Furthermore, the results show that mental workload indexes at − 1, at − 2 and ta − 1, ta − 2 can better discriminate selfreported perceptions of mental workload in comparison to their individual counterparts (c − α, c1 − θ and c2 − θ ). This proves our hypothesis H1 given earlier that alpha-to-theta and theta-to-alpha ratios can significantly discriminate self-reported perceptions of mental workload than the individual use of EEG band power and can be used in designing highly accurate classification models. The accuracy, f1-score, recall and precision evaluation metrics indicate a good classification across almost all alpha-to-theta and theta-to-alpha indexes.
One interesting observation is the impact of the number of electrodes in the selected indexes on the overall accuracy of the classifiers. For example, it can be seen from Table 1 that c1 − t from theta band has a higher number of electrodes that contribute to the computation of band ratios and indicate the higher accuracy in both alpha-to-theta and theta-to-alpha indexes. Given the results from Figure 8 and Table 6, hypothesis H2 cannot be conclusively proven that the number of electrodes used for calculating alpha-to-theta and theta-to-alpha better effectuate the predictive power of the classifiers. Figure 8 indicates better performance of at − 2 and ta − 2 indexes which are computed from c2 − θ and c − α individual indexes that, if seen from Table 1 have lesser electrode numbers. One potential explanation hypothesized by authors lies in the nature of the experiment performed while collecting STEW datasets' EEG recordings, where "Rest" and "Simkap" activities are performed in sequence one after the other. Some research indicates a strong correlation between EEG frequency patterns and the relative levels of distinct neuromodulators (Vakalopoulos, 2014). This sudden change in task load activity may lead to neuromodulation in the parietal region and neuronal suppression on the frontal cortical region, resulting in better performance of band ratio indexes (at − 2 and ta − 2) with a smaller number of electrodes. Further research is required to validate this claim. 4 | The two-tailed t-test between the alpha-to-theta and theta-to-alpha ratio indexes with their individual indexes with Bonferroni( †) correction applied, resulting in a significance level set at α = 0.005. .20 (5.73· 10 −29 ) * * (6.88· 10 −28 ) † ta-2 -11.6 (2.75· 10 −24 ) * * (3.30· 10 −23 ) † -8.07 (6.16· 10 −14 ) * * (7.93· 10 −13 ) † ta-3 --14.6 (2.83· 10 −33 ) * * (3.40· 10 −32 ) † 11.21(6.54· 10 −23 ) * * (7.85· 10 −22 ) †  Based on the results above, we can conclude that EEG band ratios, alpha-to-theta and theta-to-alpha ratio mental workload indexes, can significantly discriminate self-reported perceptions of mental workload and be used to design models for detecting such levels of mental workload perception. The observations however cannot conclusively prove that the higher the electrode number, especially in the parietal region, leads to a better discrimination self-reported perceptions of mental workload.

CONCLUSION
Various EEG frequency bands indicate a direct correlation to human mental workload. In particular, EEG bands such as alpha and theta bands tend to increase/decrease in the state of mental workload (Borghini et al., 2014). However, no conjoint analysis of both bands in the form of indexes over time has been sufficiently analyzed so far. This article has empirically demonstrated that EEG band ratios, specifically the alpha-to-theta and theta-toalpha ratios can be treated as mental workload indexes for the discrimination of self-reported perceptions of mental workload. In detail, a set of higher level features associated to these indexes, have proven useful for the inductive formation of models, employing machine learning, for the discrimination of two levels of mental workload perception ("suboptimal MWL" and "super optimal MWL"). Another important contribution in this research is the analysis of the impact of electrode density in-band ratios on the formation of discriminative models of self-reported perceptions of mental workload.
Future research work will outline the usage of the alpha-to-theta and theta-toalpha ratio indexes related to the following issues:  at − 2 vs. at − 3: 2-tail test value between between indexes at − 2 and at − 3. ta − 1 vs. ta − 2: 2-tail test value between indexes ta-1 and ta-2. ta − 2 vs. ta − 3: 2-tail test value between indexes ta − 2 and ta − 3. The ( † ) sign indicate the p-value with Bonferroni correction applied, resulting in a significance level set atα = 0.005.
• replication of the experiment conducted in this research with additional public available datasets, to further validate the contribution to knowledge. • evaluation of human tasks different than those employed in this research, as for instance those conducted in the automobile industry (Di Flumeri et al., 2018), in the context of Human-Computer Interaction (HCI) (Longo, 2012) and in education (Longo, 2018b;Longo and Orru, 2018). • use of multi-channel EEG data collected from a larger pool of electrodes, and thus formation and evaluation of additional mental workload indexes built with different clusters of electrodes for the alpha and theta bands. • the design of a novel experiment with additional task load conditions of incremental complexity, for example by employing the multiple resource theory of Wickens (Wickens, 2008) and the definition of objective task performance measures that can be used as dependent features from indexes of mental workload.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

AUTHOR CONTRIBUTIONS
LL and BR designed the study. BR conducted the experiment. All authors reviewed and approved the final manuscript.