In silico vs. Over the Clouds: On-the-Fly Mental State Estimation of Aircraft Pilots, Using a Functional Near Infrared Spectroscopy Based Passive-BCI

There is growing interest for implementing tools to monitor cognitive performance in naturalistic work and everyday life settings. The emerging field of research, known as neuroergonomics, promotes the use of wearable and portable brain monitoring sensors such as functional near infrared spectroscopy (fNIRS) to investigate cortical activity in a variety of human tasks out of the laboratory. The objective of this study was to implement an on-line passive fNIRS-based brain computer interface to discriminate two levels of working memory load during highly ecological aircraft piloting tasks. Twenty eight recruited pilots were equally split into two groups (flight simulator vs. real aircraft). In both cases, identical approaches and experimental stimuli were used (serial memorization task, consisting in repeating series of pre-recorded air traffic control instructions, easy vs. hard). The results show pilots in the real flight condition committed more errors and had higher anterior prefrontal cortex activation than pilots in the simulator, when completing cognitively demanding tasks. Nevertheless, evaluation of single trial working memory load classification showed high accuracy (>76%) across both experimental conditions. The contributions here are two-fold. First, we demonstrate the feasibility of passively monitoring cognitive load in a realistic and complex situation (live piloting of an aircraft). In addition, the differences in performance and brain activity between the two experimental conditions underscore the need for ecologically-valid investigations.

There is growing interest for implementing tools to monitor cognitive performance in naturalistic work and everyday life settings. The emerging field of research, known as neuroergonomics, promotes the use of wearable and portable brain monitoring sensors such as functional near infrared spectroscopy (fNIRS) to investigate cortical activity in a variety of human tasks out of the laboratory. The objective of this study was to implement an on-line passive fNIRS-based brain computer interface to discriminate two levels of working memory load during highly ecological aircraft piloting tasks. Twenty eight recruited pilots were equally split into two groups (flight simulator vs. real aircraft). In both cases, identical approaches and experimental stimuli were used (serial memorization task, consisting in repeating series of pre-recorded air traffic control instructions, easy vs. hard). The results show pilots in the real flight condition committed more errors and had higher anterior prefrontal cortex activation than pilots in the simulator, when completing cognitively demanding tasks. Nevertheless, evaluation of single trial working memory load classification showed high accuracy (> 76%) across both experimental conditions. The contributions here are two-fold. First, we demonstrate the feasibility of passively monitoring cognitive load in a realistic and complex situation (live piloting of an aircraft). In addition, the differences in performance and brain activity between the two experimental conditions underscore the need for ecologically-valid investigations.

INTRODUCTION
Neuroergonomics is an emerging field of interdisciplinary research that promotes the understanding of the brain in complex real-life activities. This approach merges knowledge and methods from cognitive psychology, system engineering, and neuroscience (Parasuraman and Wilson, 2008). Accurate and reliable mental state assessment of human operators during use of complex systems is a prime goal of neuroergonomics that aims to measure the "brain at work" (Parasuraman and Rizzo, 2008). Understanding the underlying neurocognitive processes of such interaction could be used to improve safety and efficiency of the overall human-machine pairing. This could be achieved by (i) the augmentation of human performance and its translation to improved functioning "at work", (ii) informing the design of the complex systems, or (iii) adapting the user interface and task parameters dynamically during use.
Aviation operations constitute an ideal paradigm to implement this approach. Pilots deal with an uncertain environment and face complex interaction with the flightdeck (Causse et al., 2013;Çakır et al., 2016;Reynal et al., 2016). For instance, several studies have emphasized that pilots' working memory (WM) abilities are heavily recruited to handle flightpath, to monitor the flight parameters, and to maintain an up-to-date situation awareness (Causse et al., 2011a,b). WM is also an important component when following air traffic control (ATC) instructions (Morrow et al., 1993). This activity indeed requires mentally storing flight parameters (e.g., heading, altitude, speed) to follow the adequate flight path. However, it is well-known that human working memory is fundamentally limited (Baddeley, 1992;Miller, 1994) and easily overwhelmed when task demand is excessive (Durantin et al., 2014a). Human factor studies emphasized that a variety of environmental stressors may negatively impact pilots' ability to execute ATC clearances (Billings and Cheaney, 1981;Taylor et al., 1994Taylor et al., , 2005Scerbo et al., 2003;Risser et al., 2006;Rome et al., 2012;Dehais et al., 2017). Thus, the implementation of monitoring technology in the cockpit to infer a state of cognitive limitation could represent a promising approach to enhance flight safety (Roy et al., 2017;Verdière et al., 2018).
Indeed, the development of brain computer interface (BCI) technology provides interesting prospects to continuously monitor and take advantage of the brain dynamics and the neural mechanisms underlying cognition. Among the three categories of BCIs (active, reactive, and passive) (Zander and Kothe, 2011;Vecchiato et al., 2016), the first two types are aimed at transforming cerebral activity into messages or commands to voluntarily control distant apparatus (e.g., mouse cursor). Passive BCIs (pBCI) are of particular interest for neuroergonomic applications (Cutrell and Tan, 2008;Frey et al., 2017;Gramann et al., 2017). They allow the use of interpretation of unlabeled brain activity during a task to derive various mental states (Blankertz et al., 2010;Roy et al., 2013;Van Erp et al., 2015;Zander et al., 2017). These mental state-inference systems offer a unique insight into the development of the human-system interactions to overcome cognitive limitations (Zander and Kothe, 2011;Brouwer et al., 2013). While several pBCIs have been successfully implemented in driving (Dijksterhuis et al., 2013) and flight simulator Aricò et al., 2016;Çakır et al., 2016;Callan et al., 2016;Verdière et al., 2018), few have attempted to test these systems under more realistic settings. However, very few studies have attempted to test these adaptive systems under realistic settings (Callan et al., 2015).
Electroencephalography (EEG) is by far the most popular technique (George and Lécuyer, 2010;Borghini et al., 2017) in the BCI community as it has excellent qualities for monitoring cognitive states (Brouwer et al., 2012;Roy et al., 2013) including superior temporal resolution and has been used to monitor working memory (Roy et al., 2013;Mühl et al., 2014). However, the localization of sources from the EEG signal requires higherdensity recordings and additional computation to solve the inverse problem that may not be amenable to critical operational situations such as flying real aircraft. In addition, setup time and susceptibility to motion artifacts should be considered for minimally intrusive deployment. Thus, the use of functional near infrared spectroscopy (fNIRS) has been gaining popularity recently as the sensors have been miniaturized, become portable and wireless (Ayaz et al., 2013;Strait et al., 2014;Naseer and Hong, 2015;Schudlo and Chau, 2015). This brain activity monitoring technique uses near-infrared light absorption properties of hemoglobin to estimate local variations of cortical oxygenation changes (Villringer and Obrig, 2002;Ayaz et al., 2012). fNIRS has been successfully used to detect working memory solicitation (Li et al., 2005;Schreppel et al., 2008;Hirshfield et al., 2011;Gagnon et al., 2012;Herff et al., 2014;McKendrick et al., 2014;León-Domínguez et al., 2015;Unni et al., 2017). Despite its relatively low temporal resolution, fNIRS poses several advantages compared to more established traditional tools (Kikukawa et al., 2008;Piper et al., 2014;McKendrick et al., 2015;Davranche et al., 2016) such a relatively high spatial resolution (around 1 cm 2 depending on the sensor geometry) and high signal-to-noise ratio as sensors are relatively more robust to motion artifacts (Huppert et al., 2009), eye-blinks and facial muscles (Izzetoglu et al., 2004). It is also possible to run experiments with active and mobile subjects and even outdoors (Piper et al., 2014;McKendrick et al., 2016). Specifically, it is less sensitive to noisy electromagnetic environment in the aircraft (radio transmission, radio-navigation beacons, GPS antenna, etc.) than EEG, making it a candidate to measure pilot's brain activity during real flight. As an emerging neuroimaging technique, we believe that it is important to investigate the capabilities of fNIRS and its utility in future applications.
The present study aims to develop an on-line fNIRS based pBCI for the assessment of working memory of aircraft pilot during real flight. Earlier studies demonstrate that fNIRS based measures BCI have been implemented. They rely on oxygenation changes in the prefrontal cortex (PFC) and can be used for measuring WM load (Schreppel et al., 2008;Ayaz et al., 2012;Gagnon et al., 2012;Durantin et al., 2014a,b). Here, a pilot-ATC interaction task, was designed with two contrasted levels of WM load. A Support Vector Machine (SVM) based classifier performing on-line for single trial WM load level discrimination was implemented. This classifier was first tested in a high fidelity flight simulator. The same machine learning approach was then utilized to assess the WM load level in an actual flight condition. To the authors' knowledge, this is the first study to monitor pilot's brain activity on-line under such operational settings and ecological validity. We also compared pilot's WM performance and related PFC activity both in high fidelity simulator and real flight conditions. The objective was to determine wether these two conditions simulated and real operational settings were equivalent or not in terms of task demand (Dahlstrom and Nahlinder, 2009;Batula et al., 2017). As most aviation psychology experiments and pilots' training are conducted with flight simulators, such assessment is critical for future design and development of such approaches (Philip et al., 2005).

Participants
Fourteen visual flight rules (VFR) pilots (three women; mean group age: 29.25 ± 6.92; mean flight hours 80 ± 50) completed the experiment. Pilots had normal or corrected-to-normal vision, normal hearing, and no psychiatric disorders. They all had medical clearance to fly. After providing written informed consent, they were instructed to complete task training. The data from two participants were rejected due to a high level of fatigue in one case, and data collection issue for the second. Typical total duration of a subject's session (informed consent approval, practice task, and real task) was about two hours. This work was approved by the Institutional Review Board (IRB) of the Inserm Committee of Ethics Evaluation (CEEI: Comité d'Evaluation Ethique de l'Inserm IRB00003888). The methods were carried out in accordance with approved guidelines and participants gave written informed consent approved by the IRB of CEEI.

Neurophysiological Measurements: fNIRS
During this experiment, we recorded hemodynamics of the prefrontal cortex using the functional near-infrared spectrometer fNIR Device Model 100B (Biopac R ) equipped with 16 optodes (Figure 1). On this continuous-wave system, the optode separation was about 25 mm and two wavelengths were used, 730 and 850 nm. DPF (differential pathlength factor) value was 5.97 which is within the range used by many in literature (Kato et al., 1993;Luo et al., 2002) and accepted by many groups. Four regions of interest (ROI) were defined to allow for explorative statistical comparisons with the data collected during the real flight experiment (see section 3).
Each optode of the device records hemodynamics at a frequency of 2 Hz in terms of oxygenation level variations in comparison to an initial baseline performed prior to the experiment. Changes in the concentrations of oxygenated ( [HbO 2 ]) and deoxygenated hemoglobin ( [hHb]) relative to the baseline can be calculated from changes in detected light intensity using the modified Beer-Lambert Law (Delpy et al., 1988). Cognitive Optical Brain Imaging (COBI) Studio R software (Ayaz and Onaral, 2005;Ayaz et al., 2011) was used to collect data. The data stream was available on-line from a TCP/IP interface. Before recording, signals for each optode were carefully checked for saturation with COBI Studio which provides signal quality visual representation. COBI studio was also used to check signal quality and to adjust consequently the headband on the participant's forehead. After this check, a baseline was established, which simply consists of letting the participant rest for 10 s.

Experimental Environment: Flight Simulator
We used the ISAE-SUPAERO (Institut Supérieur de l'Aéronautique et de l'Espace -French Aeronautical University in Toulouse, France) flight simulator to conduct the experiment in an ecological situation. Its user interface is composed of a Primary Flight Display, a Navigation Display, and an Electronic Central Aircraft Monitoring Display (Figure 2).

Task Description
This protocol was adapted from a previous study . As in real flight operations, pilots heard ATC instructions (pre-recorded for this experiment) to vector them and were asked to read back the instructions. Their answers were recorded for off-line behavioral analysis. The ATC messages were delivered at 78 dB through a Sennheiser R headset. Two levels of difficulty were defined based on the flight parameters that the participant had to read back during the experiment: • Low WM load: The two first digits were the same for each flight parameter (e.g., 14 for "speed 140, heading 140, altitude 1400, vertical speed +1400"). • High WM load: each flight parameter value was different from the previous one and composed of different number to increase task difficulty (e.g., "speed 172, heading 238, altitude 6400, vertical speed −2800").
The task consisted of 10 repetitions of each difficulty for a total of 20 trials. The task difficulty order was randomly distributed with two constraints: • the first 10 trials contained both 5 trials of high difficulty, and 5 trials of low difficulty (which is necessary for machine learning purposes, see section 2.1.5); • the difficulty cannot be the same for more than two successive trials.
Each ATC message started with the airplane call sign (i.e., "Supaero 32"), immediately followed by a sequence of flight parameters and ended with the message "over" (Figure 3). Thereafter, pilots had a 18 s response window to repeat the instruction. A practice session was conducted prior to the experiment runs to familiarize them with the experiment protocol and the interface. During the experiment, the experimenter was collecting the volunteer's ability to read back each message so as to compute the total number of correct responses in the low and hard conditions.

Experimental Time Course
For machine learning purposes, the experiment was split into three successive phases (Figure 4): • Phase A -data gathering phase: 10 instructions with two levels of difficulty were successively presented to the pilot in a random order. During phase A, the correctness of the pilot's response was also checked for further pilot performance analysis. The fNIRS's data were processed and recorded for each trial's response window. The levels of difficulty of the message were also recorded. • Phase B -classifier training phase: the classifier training process was activated, based on the data gathered during phase A. This phase was not perceived by the pilot and allowed further classification actions. At the end of this phase, the pilot's classifier -the pilot's specific classification model, correctly trained -was available for classification requests. • Phase C -classifier testing phase: 10 instructions with random levels of difficulty (high WM load or low WM load) were successively presented. The aim of the classification process   was to discriminate the difficulty of the trial. After each response window of trials, the classifier returned WM load estimation of the trial.
Note that the transition (phase B) from phase A to phase C was not perceptible to the participants.

MACD Filter
Raw fNIRS data were real-time filtered using a MACD filter, commonly used in economic market analysis (Appel, 2005). This filter, based on the difference between a short-term EMA (Exponential Moving Average) and a long-term EMA, implements a second order band-pass filtering to eliminate low-frequency (<0.02 Hz) and high-frequency (>0.33 Hz) components from the raw fNIRS signal (Utsugi et al., 2007). This low order filter has a quasi linear phase in its bandwith and is particularly suited for real-time applications. For the experiment, we proceeded to an on-line filtering of [HbO 2 ] and [hHb] on 16 optodes. N represents the number of time points defining the EMA window: Frontiers in Human Neuroscience | www.frontiersin.org We chose a 6 s short-term EMA and a 13 s long-term EMA according to previous work (Durantin et al., 2014b) for MACD filtering, to get the desired bandwidth.

Single Trial SVM-Based WM Load Estimation
The classification's goal was to discriminate on-line whether the last trial was a high WM load trial or a low WM load trial. For each pilot, we used the first 10 trials to train the pilot's classifier (phase A and B, see section 2.1.5). From trial 11 to 20, we used the pilot's classifier to discriminate trial difficulty, without any further training. An accuracy score (sum of correct predictions divided by total number of predictions) of the pilot's classifier was provided at the end of the experimental session. Sixteen optodes of [HbO 2 ] and [hHb] filtered signals were segmented into trials, in real-time, according to the task synchronization module (Figure 5). Each trial starts when an ATC message is played, and lasts 30 s. All data points of a trial -2 different inputs per optode, 16 optodes, 30 s of data with a 2 Hz sampling corresponding to 1920 features -were considered as the input of the machine learning process. A 30 s sliding window was chosen to be consistent with literature regarding inter-subject variability (Jasdzewski et al., 2003;Sato et al., 2005). Note that the transition from the "Response" phase to the "Rest" one was unnoticeable, as it was anticipated that participants started to rest as soon as they completed the instruction.
As our number of features was large compared to the training sample, we used a linear Support Vector Machine (SVM) (Cortes and Vapnik, 1995). The principle of the SVM is to find the separating hyperplane that maximizes the distance between the hyperplane and the closest training points in each class. To avoid over-fitting, we chose to customize the SVM regularization parameter for each pilot's classifier. In a linear SVM, the regularization parameter C controls the trade-off between errors of the SVM on training data and margin maximization. During the training process of each participant, the parameter C is incrementally changed over a large range of values (from 10 −3 to 10 4 ) with a 10-step factor.
Hence, a five-fold cross-validation on the first 10 trials with scikit-learn (Pedregosa et al., 2011) packages (sklearn.svm and sklearn.cross_validation) was ran to select the C parameter with the highest performance in terms of accuracy. The classifier training (phase B) was performed as soon as the data of the first 10 trials were available for online purposes (Aricò et al., 2016).

Experimental Components' Architecture
We implemented a WM load estimator that integrated different components (Figure 5): • a simulated ATC which broadcasts a list of chosen messages to the pilot; • the ISAE flight simulator (Figure 2); • a fNIR sensor which measures the prefrontal oxygenation (Figure 1); • a MACD filter for artifact removal (see section 2.1.6); • a synchronization module that also formats filtered data for the classification process: filtered fNIRS output must be synchronized with the pilot's state, according to the instant of the arrival of that incoming message and according to the pilot's response window; • a classifier (see section 2.1.7) which evaluates in real-time whether the last ATC instruction was a high WM load trial or a low WM load trial. Results were logged into a file, while a real time feedback is provided through a system terminal.
Task monitoring, data acquisition, and computation were conducted on the same computer (core i5-3210M, 2.50 GHz, 4 GB RAM). During the experiment, the classifier training (phase B) duration was short (800 ms) and remained unnoticeable for the participant. The classifier testing phase lasted 10 ms and was also unnoticeable for the participant" (Figure 6).

Participants
Fourteen VFR pilots (1 women; mean group age: 23.07 ± 5.35; mean flight hours 44.07 ± 37.52), completed the FIGURE 5 | Illustration of the fNIRS based inference system. Pre-recorded ATC messages were sent to the pilot (1). The pilot's prefrontal activity was measured with a fNIRS device (2). Output measures (3) were MACD-filtered and synchronized with the temporal design of the trial (4). When all of the required data were available for the trial, a request was sent to the pilot's classifier to assess the WM load of the trial (5).
FIGURE 6 | Trial timeline and computing latencies. Upper timeline shows ATC span task trial events duration (see Figure 3). Bottom timeline illustrates duration constraints to get pilot's estimated WM load: classifier's response is available in the worst case less than 10 ms after pilot's response window.
experiment. Please note that these volunteers were different from the ones who participated to the flight simulator experiment. The data from two participants were rejected due to light saturation issues and a device synchronization issue. After providing informed consent, they were instructed to complete task training on the ground. None of the recruited subjects had neurological or psychiatric history or was on medication. Each of them gave written informed consent for the experiment. The experimental protocol was approved by the committee of the European Aviation Safety Agency (EASA permit to fly approval number : 60049235). The methods were carried out in accordance with approved guidelines and participants gave written informed consent approved by the EASA.

Neurophysiological Measurements: Mini-fNIRS
We used the miniaturized and wireless fNIR Device Model 1200W (Biopac R ) portable system (Ayaz et al., 2013) to record the pilots' hemodynamics of the prefrontal cortex (Figure 7). This device was chosen as it was wireless (i.e., the pilot's head was not attached to any cables) and did not require external power supply as the Model 1200S. This was a prerequisite to facilitate its implementation and use in the aircraft for our experiment. This device had the same hardware design, and exactly same LED light source components and detectors than the fNIRS Model 1200S used in the flight simulator. Consistent with the previous device, on this continuous-wave system, the optode separation was about 25 mm and two wavelengths were used, 730 and 850 nm. The DPF value was 5.97. This fouroptode device records hemodynamics at a frequency of 4 Hz in terms of oxygenation level variations in comparison to a baseline same as the 1200S desktop version. With flexible circuit board and separation-adjustable split pads, the sensors were positioned to aim monitoring brain areas similar to the ROIs extracted from 1200S sensor. Changes in the concentrations of oxygenated ( [HbO 2 ]) and deoxygenated hemoglobin ( [hHb]) can be calculated from changes in detected light intensity using the modified Beer-Lambert Law (Delpy et al., 1988). Cognitive Optical Brain Imaging (COBI) Studio R software (Ayaz and Onaral, 2005;Ayaz et al., 2011) was used to collect data. The data stream was available on-line from a TCP/IP interface. Before recording, signals for each optode was carefully checked for saturation with COBI Studio which provided a visual representation of signal quality. An aluminum foil attached to a dark ski band band and a cap were placed over the mini-fNIRS to shield against ambient sunlight infrared.
Data was MACD filtered and we used a similar on-line Experimental Components' Architecture with the exception that we used a real plane instead of the flight simulator.

Experimental Environment: DR400 Aircraft
The ISAE Supaero DR400 light aircraft was used for the purpose of the experiment (Figure 8). It was powered by a 180HP Lycoming engine and was equipped with classical gauges, radio and radio navigation equipment, and actuators such as rudder, stick, thrust, and switches to control the flight. The participant was placed on the left seat and was equipped with the mini fnirs system. The participant wore a Clarify Aloft R that was used to trigger task-related auditory stimuli from a PC via an audio cable. The participant could still communicate with the other crew members, real ATC and when he received auditory stimuli (emulated ATC).The safety pilot was an ISAE flight instructor. He was right seated and had the authority to stopping the experiment and taking over the control of the aircraft for any safety reason. The backseater was the experimenter: his role was to set the sensor, to trigger the experimental scenario and to supervise data collection.

Task Description
The experimental task and audio messages were similar to the previous protocol (see section 2.1), with the same experimental time course and the same instructions for the participant.
A practice session on the ground was conducted prior to the experiment runs to familiarize them with the experiment protocol and the interface. After training was completed on the ground, the mini-fNIRS system was placed over the participants forehead. The participant then took off from Lasbordes (LFCL, Toulouse, France) airfield and began a local flight. The experimental task per se started when the pilot left the Lasbordes traffic pattern and was stabilized at an altitude of 2500 feet. The participant was asked to fly as straight and stable as possible and to only perform slight avoidance maneuvers as necessary. Once stabilized, the baseline of ten seconds was recorded. After the completion of the WM task, the participant was heading back to land at Lasbordes airfield. The total flight lasted one hour including the WM task.
As in simulated condition, the backseater was collecting the volunteer's ability to read back each message in order to compute the total number of correct responses in the low and hard conditions. These data allowed to compare the WM peformance accross the conditions (i.e., low vs. high; simulated vs. real flight).

Experimental Components' Architecture and WM Load Estimation
We implemented a similar WM load estimator in the airplane as in the flight simulator. Machine learning inputs were lightly adjusted to fit the data flow available with the mini-fNIRS wireless portable device. The four (instead of 16) available optodes of [HbO 2 ] and [hHb] filtered signals were segmented into trials, in real-time, according to the task synchronization module (see section 2.1.8). Each trial starts when an ATC message is played, and lasts 30 s. All data points of a trial -two different inputs per optode (i.e., [HbO 2 ] and [hHb]), four optodes, 30 s of data with a 4 Hz sampling corresponding to 960 features -were considered as the input of the machine learning process.

Statistical Analyses
Off-line statistical analyses were performed with "R" (R Core Team, 2013) software and the "EzANOVA" (Anderson, 2001) package to compare WM performance and prefrontal cortex activations in the flight simulator and in the real flight conditions during the 20 trials. Two-tailed unpaired t-tests were performed to compare the WM performance in the high and low load conditions across the two flight conditions (simulator and real flight). As the number of optodes was not equivalent between the two fNIRS devices (16 vs. 4), we defined four regions of interests (ROIs) for the fNIR100 device that was used in the simulator condition to allow for explorative comparisons with the real flight condition. ROI1, ROI2, ROI3, and ROI4 were derived respectively from the spatial averaging of optodes 1 to 4, 5 to 8, 9 to 12, and 13 to 16 (see Figure 1). The mean frontal [HbO 2 ] peak response and the mean frontal [hHb] peak response (peak value within 30 s post-trial onset minus 2 s average pre-trial onset) over the four ROIs of the PFC for each trial and each pilot using the MACD-filtered data in both flight conditions (i.e., simulator and real flight) were computed. A multivariate analyses for repeated measures (MANOVA) was conducted over the mean [HbO 2 ] data with between factor flight condition (simulator vs. real flight) and within subject  factors WM Load (High vs. Low) and ROI (#1, #2, #3 & #4); see Figure 1 was led. A similar MANOVA over the mean [hHb] was then conducted. We then ran a two-tailed unpaired t-test to compare the classification accuracy in the two experimental conditions. The Tukey's Honestly Significant Difference (HSD) test was used for all post-hoc comparisons.

Real Flight vs. Flight Simulator: Off-Line Behavioral and Neurophysiological Analyses
Participants committed on average 5.33 errors (SD = 1.95) in the WM task in the simulator condition and on average 8.25 (SD = 2.42) errors in the real flight condition, all occurring during the high load trials (see Figure 9). As no error was committed in the low WM load condition, we performed a statistical analysis to compare the effect of the flight conditions on WM performance in the high load conditions. An unpaired t-test revealed that the real flight condition led to significantly higher number of errors on the WM task in the high load condition (p < 0.001, Cohen ′ s d = 1.34). The MANOVA over the mean [HbO 2 ] data disclosed a significant WM load × Flight condition × ROI interaction [F (3,66) = 3.36; p = 0.039; see Figure 10]. Post-hoc analyses revealed that high load trials performed in real flight condition led to higher [HbO 2 ] in ROI #2 than their counterparts performed in simulator (p = 0.0001). The MANOVA over the mean [hHb] data did not disclose any significant WM load × Flight condition × ROI interaction [F (3,66) = 0.69; p = 0.56].

Simulator
During the testing phase, a mean of 76.66% (SD : 16.14%) of the trials were accurately classified (discriminated into on-line low WM load trials and high WM load trials). We obtained a 85.60% mean precision (SD : 19.36%) and a 73.33% mean recall (SD = 24.62%). Individual classifiers' accuracies are shown in Table 1.

Real Flight
During the testing phase, a mean of 78.33% (SD : 11.93%) of the trials were accurately classified (discriminated into on-line low WM load trials and high WM load trials). We obtained a 84.14% mean precision (SD : 18.56%) and a 76.67% mean recall (SD = 22.29%). Individual classifiers' accuracies are shown in Table 2.

Real Flight vs. Flight Simulator: Statistical Analysis
A t-test disclosed no statistical differences of the classification accuracy in the two experimental conditions (p = 0.67, Cohen ′ s d = 0.17).

DISCUSSION
The motivation of this study was to develop on-line tools to monitor pilots' cognitive performance under realistic settings. We followed a two-step methodological approach as we first implemented and tested an inference system in a flight simulator and then in a real aircraft. We designed a task known to elicit WM Gateau et al., 2015) as this executive function is particularly engaged when operating aircraft (Causse et al., 2011a,b).

Summary of Findings
The behavioral results confirmed that these two levels of WM load were well contrasted, as the participants exhibited lower performance during the higher difficulty level. This result is in line with Taylor et al. study (Taylor et al., 2005;Durantin et al., 2015) and previous experiments  showing that pilots' WM performance decline when four different ATC instructions have to be read back. Moreover, this drop in performance was most significant for the participants under actual flight conditions. Consistent with this finding, the real flight condition yielded to higher PFC activation than the simulated one only when the pilots had to execute the difficult WM load task. Taken together, these findings suggest that the mental demand was higher when operating the actual aircraft as the participants had not only to perform the WM task but also to monitor the flight path, the aircraft status and the airspace in a much more careful fashion than in the simulated condition.
Whereas this multitasking aspect of the real flight was not detrimental from a behavioral and neurophysiological point of view when performing the low WM stimuli it became   critical when engaged in the high WM one. One could suspect prioritization issue leading the pilots to focus more on flying the aircraft thus leaving few resources available to face the demand of the high WM stimuli. This could be one explanation for the higher levels of activation observed in fNIRS measurements that reflect the higher load of concurrent cognitive tasks induced by the real flying task compared to the simulated. Unfortunately, our aircraft was not equipped with a flight data recorder preventing us from analyzing the flight performance and investigating these prioritization and multi-tasking issues. Despite this limit, our study is consistent with Dahlstrom and Nahlinder (2009) who found evidence of higher cardiac activity when flying under realistic settings than in flight simulator. These results raise the question of the ecological validity of simulators. Their use is of undeniable interest (e.g., understanding cognitive performance, training pilots, assessing cockpit design) and they present several advantages in terms of economical costs and reproducibility issues. However, our findings and others (Philip et al., 2005;Dahlstrom and Nahlinder, 2009) suggest that the simulators may need to be calibrated against real flying conditions to be more engaging.
Several field studies have demonstrated the potential of fNIRS to measure cortical activity while walking outdoors (McKendrick et al., 2016), facing prolonged stay at high altitude (Davranche et al., 2016), riding bikes (Piper et al., 2014), motorcycles (Kawashima et al., 2014), and flying helicopters (Kikukawa et al., 2008). Our study was conducted in accordance with the recent neuroergonomics approach to measure brain activity out of the laboratory. Indeed, beyond the offline analyses, we used machine learning techniques to perform single trial discrimination of the low WM load versus high WM load trials. The results of the classification process were available and displayed in a terminal to the experimenter after each: as soon as data of the trial were available, SVM discrimination process never required more than 10 ms to provide its result. The mean accuracy to classify low vs. high WM trials in the two experimental conditions exceeded the threshold of 70%, defined as a sufficient rate for pBCI (Kubler et al., 2006;Tai and Chau, 2009 (Kanoh et al., 2009;Hu et al., 2012;Power et al., 2012;Robinson et al., 2016) fNIRS-based BCI were not implemented under realistic settings and describe experiments in controlled lab settings.

Limitations and Avenues for Future Research
Despite the promising results presented in this paper for development of fNIRS based pBCI in ecologically valid environment, one could argue that the translation of the fNIRSbased pBCI in real cockpit to day-to-day flight operations might not be applicable. First, the addition of machine learning and this on-line classifier approach to standard procedures of aviation still remains a challenge as the reliability of the classifier does not meet aviation certification criteria (10 −3 allowable failure probability). One approach to overcome this reliability problem would be to integrate complementary measurements such as EEG that could significantly enhance classification performance when combined with fNIRS as suggested by Khan et al. (2014).
Also, the accuracy score per subject must be interpreted with caution. In a two classes and five testing trials per class to fit with experimental constraints, classification performance should be higher than 75% to be statistically significant (p < 0.05) (Müller-Putz et al., 2008;Combrisson and Jerbi, 2015). Considering both groups in this study, 17 of 24 subjects were already above this threshold with our online classifier. Further improvements with machine learning methodologies would be needed to improve and optimize the classifier performance.
Secondly, availability of the information about WM level estimation is a key preoccupation. One criteria to evaluate on-line inference system is related to the delay of single trial classification. In our study, the diagnosis of the WM lasted less than 1.01 s after each pilot's response window. It could allow, for instance, to automatically give a feedback to ATC that the pilot is currently facing a high workload situation and may have misinterpreted the last communication. This timing was comparable with results from other on-line fNIRS-based BCI latency (for a review of on-line fNIRS-based BCI latency, please see Strait et al., 2014). However, solutions have to be explored to speed up response detection on fNIRS signal that can drastically reduce latency in detecting change in a mental state (Cui et al., 2010;Hong and Naseer, 2016). Thirdly, our study was limited to monitor WM load in a binary and discrete fashion. Further studies have to be conducted to continuously discriminate a gradient of WM levels from underload to overload (Unni et al., 2017). Eventually, lingering issues remain regarding the effect of accelerations and headband motion on fNIRS signal (Mackey et al., 2013). In other scenarios accelerometer data with special processing could be used to eliminate any systemic effect of blood pooling.
Also, one should consider that fNIRS based pBCI could be first used for civilian application as highly automated modern aircraft prevents pilots from exceeding 1g maneuvers for passenger comfort and to avoid going against the flight envelope protection. Despite these limits, one can propose a progressive framework for the introduction of fNIRS in aviation. A first step is to consider the use of fNIRS based BCI to improve training via neurofeedback (Pope et al., 2014) and to tailor the flight sessions to the trainee (Chad et al., 2018). A second step is to use such inference system to monitor pilot's brain activity during each operational flight for quantified self purpose. These daily measures can be used to assess pilot's cognitive workload state and mental fatigue thus providing airlines with analyses tools for crew rostering. A third step is to stream the fNIRS data to the flight data recorder for accident analyses. These logged neurophysiological data would provide additional insights on the crew's cognition during these critical events and help accident investigators. A last step, when the reliability of the fNIRS-based inference system will meet the standard, would be to adapt the flight deck depending on the crew's changing WM load level. As previously demonstrated, stochastic decisional systems could be implemented to infer that human operators are engaged in demanding WM task and dynamically adapt interactions to prevent them from distraction (Gateau et al., 2016). The objective is to improve task allocation to enable better task switching, interruption management, and multi-tasking (Kohlmorgen et al., 2007;Solovey et al., 2011). Eventually, one should consider that such fNIRS based system could be applied to variety of contexts whereby human operators interact with complex and critical systems (e.g., nuclear powerplant, train).
In summary, this study is the first report of the use of an online fNIRS based pBCI both in simulation (in silico) and in aircraft during flight (over the clouds) to measure pilot's WM . The implementation of this pBCI led to address several technical constraints, adapting and testing for instance a new wireless fNIRS that can be used by pilots and that has been approved for use during real flight. It also led to identify solutions to address potential sources of noise in signals such as the sunlight infrared shielding using aluminum based cover. Moreover, it provides important albeit preliminary information about fNIRS measures of the PFC hemodynamic response and its relationship to working memory workload, and in both simulation and actual flight environment. Level of immersion or realistic aspect of flight environment does appear to influence the performance as well as hemodynamic response in the anterior prefrontal cortex, at least for the air traffic control related working memory task. The measurements in simulator had larger fNIRS sensor coverage and future studies may compare simulation vs. actual flight or level of realistic aspect of environment with larger cortical coverage within the actual flight environment, for a more granular detailed comparison. Since fNIRS technology allows the development of mobile, nonintrusive and miniaturized devices, it has the potential to be deployed in future operational environments to monitor the pilot, adapt the complex system interface, and/or to assess the training of operators.

AUTHOR CONTRIBUTIONS
Study conception and design: FD, TG, and HA. Data acquisition : TG and FD. Data analysis : TG, FD, and HA. Data interpretation and writing FD, HA, and TG.