Performance Improvement of Near-Infrared Spectroscopy-Based Brain-Computer Interface Using Regularized Linear Discriminant Analysis Ensemble Classifier Based on Bootstrap Aggregating

Ensemble classifiers have been proven to result in better classification accuracy than that of a single strong learner in many machine learning studies. Although many studies on electroencephalography-brain-computer interface (BCI) used ensemble classifiers to enhance the BCI performance, ensemble classifiers have hardly been employed for near-infrared spectroscopy (NIRS)-BCIs. In addition, since there has not been any systematic and comparative study, the efficacy of ensemble classifiers for NIRS-BCIs remains unknown. In this study, four NIRS-BCI datasets were employed to evaluate the efficacy of linear discriminant analysis ensemble classifiers based on the bootstrap aggregating. From the analysis results, significant (or marginally significant) increases in the bitrate as well as the classification accuracy were found for all four NIRS-BCI datasets employed in this study. Moreover, significant bitrate improvements were found in two of the four datasets.


INTRODUCTION
In general, brain-computer interface (BCI) systems (1) measure the brain signals in response to specific stimuli or mental tasks, (2) extract representative features from the acquired brain signals, (3) translate them by applying pattern recognition algorithms, and (4) control external devices or communicate with environments (Wolpaw et al., 2002;Schalk et al., 2004). In some cases, feedbacks are given to BCI users to improve the BCI performance (Lebedev and Nicolelis, 2006;Hwang et al., 2009;Kanoh et al., 2009;Blankertz et al., 2010). Among the aforementioned procedures, feature selection and pattern recognition are the most important parts that determine the overall performance of a BCI system (Nicolas-Alonso and Gomez-Gil, 2012). Particularly in the case of near-infrared spectroscopy (NIRS)-BCI, many different kinds of features have been tested to validate their suitability to various NIRS-BCI systems with different experimental paradigms and environments (Hwang et al., 2014;Naseer et al., 2016). It was reported that the temporal mean, maximum, and slope yielded reasonable BCI performance ; however, there is no consensus on the most suitable features that can be generally applied to different NIRS-BCIs. In addition, various kinds of pattern recognition methods have also been proposed and tested with the aim to improve the performance of NIRS-BCI systems. Among them, linear discriminant analysis (LDA) classifier has been most widely used for NIRS-BCIs because of its excellent performance reflected by both a fast learning rate and a good classification performance (Holper and Wolf, 2011;Power et al., 2011Power et al., , 2012aSchudlo and Chau, 2014;Hong et al., 2015;Shin et al., 2016Shin et al., , 2018aHong and Khan, 2017). In applying the classifier, dimension reduction or feature selection methods are generally employed because the number of NIRS feature vectors is usually larger than that of training datasets and this might degrade the BCI performance due to the poor empirical sample covariance Sereshkeh et al., 2019). Regularization with a shrinkage parameter can be another option to alleviate the adverse effect of the large dimensionality (Fazli et al., 2012).
Ensemble learning can be considered a good substitute for further improving the overall BCI performance. Ensemble classifiers are grounded in the theory that a combination of multiple weak learners that barely exceed the chance level is capable of achieving better classification accuracy than that of a single strong leaner. It has been reported that ensemble classifiers can improve the performance of electroencephalography-BCIs. Sun et al. (2007); Ahangi et al. (2013), and Gao et al. (2016) employed various types of ensemble learning methods, e.g., bagging, boosting, and random subspace, etc., to evaluate the feasibility of ensemble learning for motor imagery EEG data. Fatourechi et al. (2008) stacked support vector machine (SVM) classifiers to classify finger flexion movement with a low false positive rate. Rakotomarnonjy and Guigue (2008) employed a majority voting system based on SVM for P300 signals by an oddball paradigm. Hassan and Bhuiyan (2017) demonstrated the automated identification of sleep stages by means of boosting methods, and Hosseini et al. (2018) exploited random subspace ensemble and majority voting for seizure detection. In the case of NIRS-BCIs, there have been a few studies that employed ensemble classifiers (Schudlo and Chau, 2015;Gurel et al., 2019), but they did not compare the performance of ensemble classifiers with that of conventional classifiers. To the best of our knowledge, no study has systematically investigated the performance improvement of NIRS-BCIs by the employment of ensemble classifiers. Specifically, because regularized linear discriminant of analysis (RLDA) alleviating the degradation of classification accuracy is generally known to be appropriate for the high dimensional NIRS dataset, we employed RLDA as a type of weak learner in the ensemble method. In the present study, for the first time, we explore whether the performance of NIRS-BCIs can be enhanced by using an ensemble of weak learners rather than a single strong learner through a systematic comparison of BCI performances with multiple NIRS datasets recorded with different experimental paradigms and/or under different recording environments.

MATERIALS AND METHODS
We employed four different NIRS datasets recorded by the first author of this paper. Datasets denoted by "dataset I" and "dataset II" can be freely downloaded at: http://doc.ml.tu-berlin.de/hBCI/ (Shin et al., 2017b) and "dataset III" can be downloaded at: http://dx.doi.org/10.14279/depositonce-5830 (Shin et al., 2018c). "Dataset IV" is a NIRS dataset used in the study of Shin et al. (2018b). All data processing was performed using MATLAB R2018b (Mathworks, MA, United States) and the BBCI toolbox 1 (Blankertz et al., 2016). A brief summary of the datasets I-IV is given in Table 1.

Data Recording
Near-infrared spectroscopy data were collected using NIRScout (NIRx GmbH, Berlin, Germany) at a sampling rate of 12.5 Hz. Adjacent source-detector distance was fixed to 30 mm. The locations of nine physical NIRS channels over the prefrontal area are depicted in Figure 1A.

Two Motor Imagery Tasks (Dataset I)
Twenty-nine participants were seated and performed two designated motor imagery (MI) tasks (kinesthetic motor imagery of grasping with either the left or the right hand at a rate of approximately 1 Hz) during the task period (0-10 s), 30 times each, in a randomized order.

Mental Arithmetic vs. Idle State (Dataset II)
The same participants who participated in the previously described MI experiment (dataset I) were asked to perform a mental arithmetic (MA) task. Starting with an initial problem of subtraction of a single digit between 6 and 9 from a threedigit number (e.g., 219 -7), they continuously subtracted the given single-digit number from the result of the former 1 https://github.com/bbci/bbci_public  calculation (e.g., 219 -7 = 212, 212 -7 = 205, 205 -7 = 198,. . .) as fast as they could during the task period (0-10 s). For the idle state (IS), the participants relaxed and tried not to come up with any distractive thoughts during the task period (0-10 s). The MA and IS tasks were randomly repeated 30 times each.

Data Recording
Near-infrared spectroscopy data were acquired with NIRScout at a sampling rate of 10.4 Hz. Sixteen sources and 16 detectors were placed over the frontal area (around AFz), and sixteen NIRS channels with a source-detector separation of 30 mm were created. The NIRS channel locations are illustrated in Figure 1B.

Word Generation vs. Idle State
For the word generation (WG) task, twenty-six participants were seated and kept coming up with words beginning with a randomly given syllable as quickly as they could during the given task period (0-10 s). Repetition of the same word was not allowed for each trial to avoid potential adaptation. For the IS, the participants took a rest and tried not to think about anything for 10 s. The WG and IS tasks were randomly performed 30 times each.

Data Recording
Near-infrared spectroscopy data were sampled at a sampling rate of 13.3 Hz using a portable NIRS acquisition system (LIGHTNIRS, Shimadzu Corp., Kyoto, Japan). Six sources and six detectors over the prefrontal area created 16 NIRS channels with a 30-mm source-detector separation. The locations of the 16 physical NIRS channels are illustrated in Figure 1C.

Mental Arithmetic vs. Motor Imagery vs. Idle State
For the MI task, seventeen participants were seated and imagined complex finger tapping at a rate of approximately 2 Hz for 10 s. The participants performed the MA task in the same way as with the dataset II, and for the IS, they relaxed without performing any specific mental task. The MI, MA, and IS tasks were randomly performed 30 times each.

Behavioral Data
Available behavioral data are stored in each repository for the datasets I-IV.

Preprocessing
In the original articles (Shin et al., 2017b(Shin et al., , 2018b, the four datasets were preprocessed in different manners. For the sake of fair performance comparison, all datasets were preprocessed in the same manner. The hemodynamic changes in reduced and oxidized hemoglobin ( HbR and HbO) were converted from the raw light intensity changes using the modified Beer-Lambert law, and were then band-pass filtered using a zero-phase Butterworth filter with a passband of 0.01-0.09 Hz to eliminate physiological noises (Matthews et al., 2008). Any trials were not excluded because the recorded data were minimally affected by motion artifacts.

Classification
The classification procedures were performed using the data from each of the participants separately.

Features
The baseline of the filtered data was corrected by subtracting the temporal mean of the data within [-1 0] s interval. The baseline-corrected data were then segmented to epochs ranging from 0 to 15 s, which contained part of the post-task break period, considering the hemodynamic delay in the order of several seconds (approximately 6-8 s) (Cui et al., 2010 (2)] × [the number of windows (2)].

Single Strong Learner
Three types of classifiers were considered, namely SVM, LDA, and RLDA. For SVM, the linear kernel was employed and the feature vectors were standardized by subtracting mean and dividing by standard deviation. Other parameters were default options given by MATLAB. For LDA, typical LDA was used. Normally, typical LDA classifier find the k th class which maximize where π k , µ k , and are the a prior probability and the mean of samples in the k th class, and the covariance matrix common to all classes, respectively. However, In the case of NIRS feature vectors, typical LDA is not likely to be adequate because of the degradation of classification accuracy due to the high-dimensionality, in other words, the number of features is greater than the number of samples. That is a reason why the RLDA classifier with a shrinkage parameter (γ) was employed to alleviate the adverse effects of large dimensionality on the BCI performance by replacing the empirical covariance matrix with (1 − γ) + γI, where I is the identity matrix. The optimal γ between 0 and 1 was determined individually based on the Ledoit and Wolf (2004), Schäfer and Strimmer (2005), Blankertz et al. (2011. For the ternary classification, linear SVM and LDA with "one-versusone" error-correcting output model were used, and the multiclass RLDA were applied.

Ensemble of Weak Learners
The bootstrap aggregating (Bagging) algorithm subsamples N learn training sets of the same size with replacement (fraction of the training set to resample for every weak learner: 100% in this study), then builds N learn classification models for each training set using a weak learner h(·). The final aggregate classification model based on a majority voting H(x) is given by: To verify the efficacy of LDA classifier, RLDA classifier was used as a weak learner and the value of λ was set to 0.1 as a rule of thumb. Stratified random sampling was applied to split the whole dataset into ten subsets, and a 10 × 10-fold cross-validation was performed for both the single strong learner and the ensemble of weak learners, resulting in the "strong classification accuracy (acc strong )" and the "Bagging classification accuracy (acc bag ), " respectively.

Bitrate
Information transfer rate (ITR) is one of the most popular metrics to evaluate the performance of communication systems. The ITR per minute, called bitrate, is utilized to assess the performance of BCI systems, as follows (Dornhege et al., 2007): where T, n, and acc are a single trial length (usually the length of the task period), the number of different types of mental tasks, and classification accuracy, respectively.

Statistical Test
Normality of data distribution was tested with Anderson-Darling test, and according to the test decision (p < 0.05), two-tailed paired t-test was performed to test the hypothesis that the average of acc bag and acc strong are different. The p-values were corrected by false positive rate (Benjamini and Yekutieli, 2001) unless otherwise noted. Figure 2 shows the grand average of the classification accuracy as a function of N learn . As the N learn increased, the classification accuracies improved irrespective of the type of NIRS datasets. Overall, the rate of increment rapidly decreased where N learn > 10, and then the classification accuracy was almost converged where N learn = 50. Comparisons of individual acc strong and acc bag are presented in Figures 3, 4. In Figure 3, magenta dashed lines indicate the classification accuracy value (70%) generally known as a threshold for effective BCI control . Black dashed lines denote the theoretical chance levels based on binomial distribution (p < 0.05) (Combrisson and Jerbi, 2015). For Figures 3, 4, the values of N learn to compute acc bag are individually different and the optimal values of N learn were chosen in the range of 10 ≤ N learn ≤ 50. For the dataset I, the grand average of acc bag (62.6 ± 9.6%) was significantly higher than the averages of all three acc strong (i.e., acc SVM (59.6 ± 9.5%), acc LDA (57.9 ± 9.5%), and acc RLDA (59.1 ± 11.6%). For the dataset II, apart from the acc RLDA (86.7 ± 8.6%), the grand average of acc bag (88.5 ± 7.7%) was significantly higher than others. The grand average of acc bag (74.8 ± 11.8%) for the dataset III was not significantly higher apart from the grand average of acc RLDA (71.2 ± 12.4%) but the others. The bagging algorithm yielded the significant difference of ternary classification accuracy (71.2 ± 12.4%) compared to the other strong learners. The individual classification accuracies are provided in digits in the Supplementary Information. In Figure 4, symbols above the dashed diagonal line represent that the Bagging algorithm is more advantageous to improve the individual classification accuracy (i.e., acc bag > acc strong ), while those below the diagonal line represent that acc bag < acc strong . It was revealed that acc bag exceeded significantly acc strong in nine cases out of 12 comparisons (p < 0.05). For the ternary classification (dataset IV), the improvement of bitrate was particularly significant in all cases. Figure 5 shows comparisons of individual bitrates (bits/min), where the symbols above the dashed diagonal line represent that the Bagging algorithm resulted in higher bitrates than the single strong learner, and vice versa. In the case of the dataset I, significant (or marginally significant) bitrate improvements were observed and it was observed in over 70% of individual results in all three cases. The bagging highly significantly improved bitrates when it comes to the comparisons versus SVM or LDA (corrected-p < 0.001). For the dataset III, The bagging was significantly superior to RLDA when it came to bitrates (corrected-p < 0.001) unlikely the rest two cases. Note that the bagging always outperformed typical LDA in the ternary system (dataset IV).

Summary
In this study, we explored, for the first time, whether the performance of binary and ternary NIRS-BCI systems can be improved by using ensemble classifiers. We created ensembles of weak learners based on the Bagging algorithm. Four NIRS-BCI datasets recorded with different experimental paradigms were used for the quantitative performance comparisons between the Bagging algorithm and the conventional single stronger learner approach. Our results demonstrated that the Bagging algorithm significantly (or marginally significantly) outperformed the single strong learner in terms of classification accuracy and bitrate in all the cases of datasets.

Necessity of Using an Appropriate Ensemble Classifier
To create a better ensemble classifier, it is important to select an appropriate ensemble aggregation method, that is, a type of weak learner, and to determine the optimal hyperparameters, such as the number of ensemble learning cycles (N learn in this study). For the optimization of the hyperparameters, various approaches can be employed, such as a grid search, random search (Bergstra and Bengio, 2012), and the Bayesian optimization (Mockus, 2012); however, since the optimized hyperparameters are generally dependent on the test set employed in the optimization, it is practically difficult to derive universally optimized hyperparameters. This implies that simply using ensemble classifiers does not always guarantee an enhanced performance in NIRS-BCIs and that a customized ensemble classifier appropriate for the given datasets needs to be employed. If the aggregation method and hyperparameters are not properly chosen or determined based on subjective assumptions, desired results might hardly be obtained. For example, when a binary decision tree was arbitrarily designated as a weak learner in this study, the classification accuracy was not enhanced at all compared to acc strong . In addition, as shown in Figure 2, small values of N learn resulted in low acc bag , even lower than acc strong because the small size of ensembles was not able to be trained sufficiently with various sample sets, causing to deteriorate classification accuracy. On the other hand, the bagging ensemble containing enough weak learners reduced effectively variance of estimates, which is consistent with the bagging ensemble theoretical background (Mayr et al., 2014). As mentioned above, the γ value for a weak learner was chosen as a rule of thumb. By changing the γ value from 0.001 to 0.5, in addition, we assessed whether the improvement of classification accuracy was possible. As a result, γ = 0.1 yielded significant difference in classification accuracies against N learn (Bonferroni correctedp < 0.001, not shown in the text) except the dataset I. This fact underpins the importance of proper parameter selection regarding ensemble learning methods as well. In this study, we could successfully achieve an enhanced BCI performance by using RLDA classifier with appropriate hyperparameters (10 ≤ N learn ≤ 50 and γ = 0.1).

Limitation: Bitrate and Real Time Analysis
We improved the bitrate by successfully improving the classification accuracy in the present study. However, it is very difficult to reduce the trial length due to the inherent limitations of fNIRS-BCIs, such as slow response time due to hemodynamic delay. Recently, steady-state visually evoked potential (SSVEP)-BCI has shown the average performance of 701 bit/min (Nagel and Spüler, 2019). Even though many efforts have been devoted  . AVG represents the average of the classification accuracies across all participants. * Corrected-p < 0.05, * * corrected-p < 0.01, and * * * corrected-p < 0.001 (false discovery rate correction). to improving the bitrate of fNIRS-BCIs (Cui et al., 2010;Zafar and Hong, 2017;Hong and Zafar, 2018), it is difficult to bridge the performance gap between fNIRS-BCIs and EEG-BCIs. However, for such SSVEP-BCI which is a type of exogenous BCIs, the need for an external stimulus causing user fatigue easily could be problematic. This study dealt with the efficacy of the ensemble learning methods using the previously released open-access NIRS-BCI datasets. Since the experimental environment and analysis techniques for the implementation of real-time NIRS-BCIs are completely different from those for the implementation of offline NIRS-BCIs, it does not make sense to verify the feasibility of ensemble learning for online NIRS-BCIs with the offline NIRS-BCI datasets. Therefore, the efficacy of ensemble learning for online NIRS-BCIs should be validated in the future studies.
Efforts to Improve the Performance of NIRS-BCIs: Future Perspective There have been many efforts to improve the overall performance of NIRS-BCIs. Recently, off-the-shelf NIRS systems adopting novel designs and form factors have been introduced to FIGURE 5 | Scatter plots comparing individual bitrates for (A) dataset I, (B) dataset II, (C) dataset III, and (D) dataset IV. The x-and y-axes correspond to bitrate strong and bitrate bag , respectively. Gray dashed lines are points where bitrate strong = bitrate bag . The corrected-p-values represent the significance of improvement of the bitrate by the bagging method. Pentagram, square, and diamond symbols are for SVM, LDA, and RLDA, respectively. Symbol color is in accordance with the bar color shown in Figure 3. the market and their usefulness in NIRS-BCIs has been verified (Shin et al., 2017a;Kim et al., 2018;Kwon et al., 2018;Lancia et al., 2018). However, most of the new form factors adopted by the recent NIRS systems do not possess general applicability because they are designed to record hemodynamic changes from the prefrontal area only. In addition, artificial intelligence methods based on deep learning have demonstrated their potential in enhancing the performance of BCI systems (Cecotti and Graser, 2011;Chiarelli et al., 2018;Lawhern et al., 2018;Nicholas et al., 2018;Sakhavi et al., 2018). Even though some studies have reported the superiority of the deep learning-based approach compared to the conventional machine learning methods (Trakoolwilaiwan et al., 2018), there still exist controversies regarding the employment of these opinions (Hennrich et al., 2015). Since deep learning techniques generally depend on human factors, such as how well the deep learning model structure is designed, objective and thorough investigations of deep learning models that can enhance the performance of NIRS-BCIs are necessary. Conversely, some recent studies showed the potential of the incorporated use of ensemble learning concepts with deep learning approaches (Xiao et al., 2018). The development of a novel ensemble classifier incorporated with deep learning techniques and its application to NIRS-BCIs would be a promising topic, which we would like to pursue in future studies.

CONCLUSION
In this study, we demonstrated the effect of performance enhancement of NIRS-BCIs by the employment of a proper ensemble classifier, the RLDA ensemble classifier is based on the Bagging algorithm in this study, which has never been investigated before. As a result, the ensemble learning method employed was beneficial to improve the classification accuracies of all four datasets considered in this study. In our future studies, the ensemble classifier introduced in this study would be applied to new NIRS-BCI datasets to confirm its general availability, and new types of ensemble classifiers that can further enhance the performance of NIRS-BCI would also be tested.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
JS planned the study and analyzed the data. C-HI supervised the work. Both authors wrote and reviewed the manuscript.

FUNDING
This work was supported in part by a grant from the Institute for Information and Communications Technology Promotion (IITP), funded by the Korea Government (MSIT) (2017-0-00432, Development of non-invasive integrated BCI SW platform to control home appliances and external devices by user's thought via AR/VR interface) and in part by grants from the Brain Research Program through the National Research Foundation of Korea, funded by the Ministry of Science and ICT (NRF-2015M3C7A1031969 and NRF-2017R1A6A3A01003543).