Detecting Pilot's Engagement Using fNIRS Connectivity Features in an Automated vs. Manual Landing Scenario

Monitoring pilot's mental states is a relevant approach to mitigate human error and enhance human machine interaction. A promising brain imaging technique to perform such a continuous measure of human mental state under ecological settings is Functional Near-InfraRed Spectroscopy (fNIRS). However, to our knowledge no study has yet assessed the potential of fNIRS connectivity metrics as long as passive Brain Computer Interfaces (BCI) are concerned. Therefore, we designed an experimental scenario in a realistic simulator in which 12 pilots had to perform landings under two contrasted levels of engagement (manual vs. automated). The collected data were used to benchmark the performance of classical oxygenation features (i.e., Average, Peak, Variance, Skewness, Kurtosis, Area Under the Curve, and Slope) and connectivity features (i.e., Covariance, Pearson's, and Spearman's Correlation, Spectral Coherence, and Wavelet Coherence) to discriminate these two landing conditions. Classification performance was obtained by using a shrinkage Linear Discriminant Analysis (sLDA) and a stratified cross validation using each feature alone or by combining them. Our findings disclosed that the connectivity features performed significantly better than the classical concentration metrics with a higher accuracy for the wavelet coherence (average: 65.3/59.9 %, min: 45.3/45.0, max: 80.5/74.7 computed for HbO/HbR signals respectively). A maximum classification performance was obtained by combining the area under the curve with the wavelet coherence (average: 66.9/61.6 %, min: 57.3/44.8, max: 80.0/81.3 computed for HbO/HbR signals respectively). In a general manner all connectivity measures allowed an efficient classification when computed over HbO signals. Those promising results provide methodological cues for further implementation of fNIRS-based passive BCIs.


INTRODUCTION
It is largely admitted that pilot error represents a major cause of aircraft crashes (Li et al., 2001), being more frequently cited than mechanical failure. Safety statistics show that the progressive introduction of automation in the cockpit since the 1960's has improved safety, with modern "computerized" cockpits taking pride in an accident rate half that of the previous generation of aircraft. However, it appears that such technologies have created a new category of potentially deadly incidents whereby crews are unable to comprehend the situation presented before them, and persevere in erroneous decision-making (Dehais et al., 2010. This is especially true for the final approach and landing phases that represent almost half the on-board accidents and fatal accidents (Myers and Arnold, 2016).
Without question the development of automation has dramatically changed the role of the crew from "direct (manual) controllers" to "system supervisors/decision makers." Both increased trust in automation and complexity of these computerized systems (Sarter et al., 1997;Dehais et al., 2012;Tessier and Dehais, 2012) reduce crew's basic flying abilities, and leave them ill equipped to cope with emergency situations when automation fails (Mumaw et al., 2001). Another drawback of automation is that it imposes long periods of inactivity and thus dramatically decreases pilot's vigilance (Wright and McGown, 2001). For instance, some recent surveys disclosed that 56% of British Airways pilots experienced sleep while on duty (Steptoe and Bostock, 2012;Reis et al., 2013). These operational situations show that automation can vary pilot's engagement from a very low engagement state (disengagement) that induces states of low vigilance and mind wandering, to a very high engagement state (over-engagement) yielding to perseveration and attentional tunneling (Wickens and Alexander, 2009). These extreme cognitive states may jeopardize safety and advocate for the introduction of monitoring solutions.
The idea of introducing physiological data into Human Machine Interface, called "Physiological Computing" (Fairclough, 2008) could allow the system to take the operator's states into account. Brain monitoring techniques such as passive Brain Computer Interfaces (pBCI) have shown their ability to detect and characterize several operator's mental state such as workload, fatigue or more generally engagement (Zander et al., 2010;Zander and Kothe, 2011;Khan and Hong, 2015;Roy and Frey, 2016). Building a system capable of doing a continuous monitoring or detecting some operator's degraded states would potentially permit it to adapt to this change to optimize both safety and performance. Such kinds of closed-looped systems are the ultimate goal for neuroadaptative technology.
While the real-time identification of these degraded mental states still remains a challenge, a first reasonable step is to characterize the brain activity when flying with and without the use of automation. One possible solution to meet this goal is to consider the use of functional near infra-red spectroscopy (fNIRS). Less popular in the BCI community than electroencephalography , mainly due to its low temporal resolution, this brain imaging technique presents several advantages for ecological settings as its signal is less affected by electrical and motion artifacts. Moreover, its high spatial resolution allows to give a direct access to specific brain structures without additional computational costs as long as cortical areas are concerned. Thus, several studies have shown the potential of fNIRS to infer several mental states under laboratory settings or ecological settings such as flight simulators (Ayaz et al., 2012;Gateau et al., 2015).
Classically, the authors used the relative variation of local HbO and HbR concentration and related features (e.g., slope, area under the curve, skewness) to relate cerebral activation to specific cognitive tasks (Tai and Chau, 2009;Durantin et al., 2015;Gateau et al., 2015). Yet the goal is always to improve the estimation, especially in critical settings. A solution proposed by some authors is to use connectivity measures (Borghini et al., 2014) to account for brain dynamics (for a review on functional connectivity see Bastos and Schoffelen, 2015). Indeed, cognition cannot be reduced to activation of specialized brain areas but should rather been seen as the cooperation among large scale distributed neural networks (Siegel et al., 2012;Hutchison et al., 2013;van den Heuvel and Sporns, 2013). In other words, examining spontaneous hemodynamic fluctuations can provide us a great picture of the functional architecture of the brain (Fox and Raichle, 2007) Moreover, connectivity features have been used with success to estimate various mental states based on EEG data (Roy and Frey, 2016) in laboratory settings as well as in ecological settings. For instance, a recent study combined EEG connectivity analysis and crew monitoring in simulator and showed differences in connectivity patterns during different flight phases (Toppi et al., 2016). A few studies combined optical brain imaging like fNIRS with connectivity analysis (Lu et al., 2010;Funane et al., 2011;Cui et al., 2012;İşbilir et al., 2016) either to identify brain dynamics or brain-to-brain relationship yet they did not perform mental state estimation. Hence, the contribution of connectivity measures for fNIRS based on mental state estimation is yet to be assessed.
Classical correlation/covariance measures were successfully used in EEG (Gevins et al., 1987), however some spontaneous oscillation observed in blood-related imaging (fNIRS and fMRI) seems to be frequency specific, especially Low Frequency Oscillation (LFO) around 0.1 Hz (Obrig et al., 2000;Tong and Frederick, 2010). Knowing this, frequency specific connectivity metrics such as coherence and also wavelet coherence were used, which has gained some momentum in fNIRS signal analysis (Rowley et al., 2006;Cui et al., 2012;Holper et al., 2012;Mirelman et al., 2014).
The objectives of the present study are : (i) to evaluate the feasibility to estimate the pilot's engagement using fNIRS connectivity measures in an ecological setting such as a flight simulator. Secondly : (ii) to assess the potential of connectivity measures to better characterize engagement than classical measures.
To meet these goals, a simplified task was designed whereby pilots had to perform different manual and automated landings. Parieto-occipital areas were targeted as they play a key role for visual attention, particularly involved while flying (Dehais et al., 2016). Prefrontal cortex activity was also measured as its activation reflects mental demands Moro et al., 2016) and top down regulation. Off-line classification was performed over different classical metrics (average, peak, variance, skewness, kurtosis, area under curve and slope) and connectivity metrics to identify the most predictive ones. Regarding connectivity features, classical dependency measures such as : Covariance, Pearson's correlation (Greenblatt et al., 2012), Spearman's correlation (Spearman, 1904) and some spectral measures : magnitude squared coherence (Mandel and Wolf, 1976) and the wavelet coherence (Torrence and Compo, 1998;Lachaux et al., 2002;Grinsted et al., 2004) were compared (for review on connectivity metrics see Lachaux et al., 2002;Greenblatt et al., 2012).

Participants
Twelve visual flight rules (VFR) pilots (11 males, mean group age 24 ± 3) completed the experiment. Pilots had normal or corrected-to-normal vision, normal hearing, and no psychiatric disorders. They all had medical clearance to fly. After providing written informed consent, they were instructed to complete a 5min task training. Typical total duration of a subject's session (informed consent approval, practice task and real task) was about 1 h. This work was approved by the local ISAE-SUPAERO committee (Approval Number: CERNI-Université fédérale de Toulouse-2017-057).

Experimental Design
The protocol consisted in 8 scenarios in a flight simulator : 4 in manual landing and 4 in automated landing. The Airbus A320 full motion simulator at ISAE-SUPAERO (French Aeronautical Engineering School in Toulouse) was used to conduct the experiment in ecological conditions. It simulates a twin-engine aircraft flight mode. The user interface is composed of a Primary Flight Display (PFD), a Navigation Display and an upper Electronic Central Aircraft Monitoring Display page. The pilot also had a Flight Control Unit (FCU) to interact with the autopilot (Figure 1).
The scenarios were divided into 3 phases: a rest phase, a cruise phase and lastly a landing phase, which were performed either in manual mode (i.e., hard condition) in which they control the aircraft speed and trajectory, or with the autopilot engaged (easy condition; Figure 2). Landing conditions (Auto vs. Manual) were pseudo-randomly distributed.
During the cruise phase, the autopilot was engaged and the pilots were asked to relax. This phase was mostly set to serve as a baseline. When approaching the ILS (Instrument Landing System) range (approximately 2 min) they were asked either to let the autoflight system perform the landing or to disengage the automation to manually land the aircraft. Autopilot and auto throttle deactivation was done by pushing a red button on the flight stick and the throttle respectively. Participants did not know in advance whether the landing would be automated or manually executed. Considering the whole spectrum of the landing task, our experimental conditions were designed to be contrasted in terms of mental demands. The landing phase ended 10 s after the pilots touched down on the landing ground. Before starting the experiment, the participants performed a 30-min training session to familiarize themselves with the simulator environment.

Subjective Workload Assessment
After the end of the experiment, the pilots were asked to complete a commonly used subjective workload level questionnaire, the NASA-TLX (Hart and Staveland, 1988) in order to compare the two conditions. This questionnaire combines 6 factors, i.e., mental demand, physical demand, temporal demand, overall performance, frustration level, and effort.

fNIRS Recording
Two NIRSport acquisition devices (NIRx Medical Technologies) were used in tandem mode to increase the number of sensors. Each system has 8 sources and 8 detectors receiving wavelength at 760 and 850 nm recorded at 7.8125 Hz. By using 2 systems, Frontal and Occipital areas were both covered with 8 sources and 8 detectors constrained mechanically by a plastic spacer at the appropriate distance (3 cm maximum), resulting in 42 optodes or channels. The probabilistic path of photon through cortex were estimated using the Monte-Carlo transport software tMCimg via the Atlas Viewer from Homer2 (Boas et al., 2002;Aasted et al., 2015). The optodes placement and the results of the simulation are shown Figure 3. Before starting the experiment a calibration was performed in order to check each optode's signal quality.

Pre-processing
FNIRS data were analyzed using Matlab R2015b with several functions from the Homer2 software package (Dubb and Boas, 2016). The overall analysis pipeline is described in Figure 4. The landing phase was divided into epochs of 200 samples (∼25 s) overlapping by 60 samples (∼7.5 s). As the landing duration (152 ± 22 s) could slightly differ among participants depending on their performance, the fixed number of extracted epochs was based on the shortest landing in duration, resulting in 12 epochs per landing and per subject.
Each epoch was processed independently in order to potentially extend our method to online processing. Raw data were converted to optical densities; an artifact removal algorithm and a band pass filter were applied on each epoch separately. A wavelet interpolation method was used for the artifact correction . This method has been shown to have the greatest signal to noise ratio among the current artifact removal methods available . A butterworth high pass filter (cutoff: 0.01 Hz -order 3) and a low pass filter (cutoff: 0.5 Hz -order 5) were applied for the band pass filtering step.
The filtered and artifact free data were then converted to oxyhemoglobin [HbO] and deoxy-hemoglobin [HbR] concentration variations.
For further analysis, only the 80 centered samples (∼10 s) of each epoch were kept by applying a boxcar function. This window was applied to avoid spectral leakage, specifically from the wavelet transform, and to obtain a 10 s window without overlap. At the end of this processing stage, for each landing (trial) we had 12 non-overlapping, filtered and artifact free epochs of 80 samples.   Mean, Variance, Kurtosis, Skewness, Area Under the Curve, and Slope).

Oxygenation Measures
The peak (maximum) and the 4th moment (average, variance, skewness, and kurtosis) were computed as follows: (1) (2) The Area Under the Curve (AUC) was calculated by summing the absolute values of the signal.
The slope was computed using the least-squared linear regression with the polyfit matlab function.

Connectivity Measures
Connectivity measures were computed, as previously, using both the [HbO] and [HbR] signals on each epoch separately, where x and y represents two signals from two different channels. Five oxygenation measures were computed (Covariance, Pearson's correlation, Spearman's correlation, Coherence, and Wavelet Coherence). Covariance of two signals x and y can be described as a "measure of joint variability": Where E represents the expected value. Intuitively, covariance characterizes the simultaneous variations of two signals. Covariance will be positive when the differences between the signals and their averages tend to be of the same sign and tend to be negative in the opposite case. Pearson's correlation coefficient is the covariance of two signals normalized by the product of their standard deviation (std). It represents the linear correlation between two signals, its values ranges from −1 to +1 meaning respectively a linear negative and positive correlation and 0 corresponding to no correlation at all.
Spearman rank correlation coefficient is "defined as the Pearson correlation coefficient between the ranked variable" (Myers and Arnold, 2003).
Where rg x and rg y are the ranked variable (of x and y respectively). Using the rank instead of the values allows describing monotonic non-linear relationship between signals where the Pearson's coefficient only characterizes linear relationship. Spectral Coherence C xy (f ) or Magnitude squared Coherence is defined as the absolute squared value of the cross-spectral density of two signals (x and y) for a frequency f, normalized by the product of their auto-spectral density: Where G xy (f ) represents the cross spectral density (being the spectrum of the cross correlation function) for a frequency f . G xx (f ) and G yy (f ) being the auto spectral density (i.e., the spectrum of the auto correlation function) respectively for x and y. Spectral coherence can be seen as a correlation coefficient in the frequency domain. For the last one, a coherence measure based on the wavelet transform was used (Torrence and Compo, 1998): the wavelet coherence. The wavelet coherence power R 2 n (s) can be defined as: Where W x n (s) and W y n (s) represent respectively the wavelet transform of x and y at the n time point for a wavelet scale s. W xy n (s) is the cross wavelet transform of x and y (being the wavelet transform of the cross correlation function). S is a smoothing operator (for more detail see Torrence and Compo, 1998).
This measure can be seen as "a localized correlation coefficient in time frequency space" (Grinsted et al., 2004). Coherence values range from 0 to 1, 1 meaning there is a perfectly phase-locked oscillations at a given frequency for the two analyzed signals.
Connectivity measures were computed on each epoch separately and for each couple of channels namely C n k = 861 couples (k = 2, n = 42

. Region of interest
In order to reduce the amount of data and the dimensionality, the 42 different channels were combined into 6 regions of interest (ROI): Frontal-Left and Right, Fronto-Central; Occipital-Left and Right and Occipito-Central. For the oxygenation features, it was done by averaging all the features from channels included in these 6 differents regions. For the connectivity features, 15 possible connections were possible across the 6 ROI. Values were firstly evaluated for each pair (861 couples) and then averaged across couples connecting the same regions. Couples of channels included inside one ROI were also kept, which gave 15 + 6 = 21 connectivity measures.
At this step, we had 6 measures for each oxygenation feature and 21 measures for each connectivity feature per epoch and per subject.

Frequency specific measures
For the two coherence measures (Magnitude Squared Coherence and Wavelet Coherence), the obtained coherence values were averaged for a frequency range between 0.3125 Hz (1/3.2 s) and 0.08 Hz (1/12.8 s) accordingly to the fNIRS literature (Cui et al., 2012).
Our paradigm was an intra-subject binary classification. Each subject performed 8 landings (4 of each of the 2 conditions). Data were processed to obtain 12 10 s epochs for each landing which gives 12 × 8 = 96 epochs (examples) for each subject. Our model prediction performance was assessed by using a stratified cross validation, which is a good tradeoff between bias and variance estimation (Kohavi and Sommerfield, 1995;Friedman et al., 2001). The classifier was trained with examples that originated from 6 different landings (3 of each of the 2 conditions, i.e., 6 * 12 = 72 examples) and tried to predict examples from the last 2 landings (1 of each condition, i.e., 2 * 12 = 24 examples). This method was applied for every combination (16) of landings left out of the training set and the averaged performance was kept.
Regarding the features, 2 types of comparisons were done. Firstly, a single feature comparison where each feature classification performance is assessed separately was performed.
Secondly, features were merged together to evaluate their potential. They were combined 2 by 2 and the classification performance obtained with each couple was assessed.

Subjective workload comparison
A paired-sample t-test was performed in order to compare the average overall workload obtained for the 2 conditions among subjects.

Classification performance significance
For a 2-class problem like ours, the theoretical chance level for classification is 100/2 = 50 %, but this is only right for an infinite sample number. To assess the significance of our classifier (decoding accuracy) the classification error was modelized by a binomial cumulative distribution (see Combrisson and Jerbi, 2015 for more details): Where -P is the probability to predict the correct class at least Z timesn the number of samplesc the number of classes. The performance of our classification pipeline was assessed by repeating the stratified cross validation 16 times and averaging it. As stated earlier, our classifier was trained with 72 samples and tested on the last 24 samples. By using the cumulative binomial distribution, it sets the 5% significance classification threshold at 58.3%.

Classification performance comparison
In order to compare the classification performance for each feature, a repeated measure ANOVA was used considering FEATURES (or FEATURES COUPLE) and CHROMOPHORE (HbO/HbR) as within factors. A post-hoc Tukey's Honestly Significant Difference (HSD) procedure was applied to perform multiple comparisons.

Subjective Workload Assessment
Participants rated their workload significantly higher for the manual landing condition (M = 66.6 ± 9) than the automatic landing condition (M = 18.7 ± 7; t(11) = −17.43, p < 10 −8 ). Figure 5 illustrates the classification performances for each feature computed over the HbR and HbO signals. In order to compare classification performance among features, a repeated measure ANOVA was done.

Classification with Individual Features
The statistical analysis showed that there was a significant effect of feature type on classification performance [F (11,121) = 5.66, p < 10 −3 ] and it also revealed a significant effect of the chromophore used [F (1,11) = 8.73, p < 0.05]. Posthoc comparisons revealed significant differences among features mainly for HbO. In particular, Wavelet Coherence had a significantly better performance than the Average, Skewness, Kurtosis, and Slope. Also, every connectivity feature gave a significantly greater performance than the Skewness. Moreover, regarding HbR, the Wavelet Coherence and the Covariance gave a significantly greater performance than the Kurtosis. All the connectivity features did not exhibit significant differences between one another. Post-hoc comparisons did not show any significant effect of the chromophore on the classification performance regardless of the feature used. In other words, every feature gave non-significant different results when using either the HbO or HbR signals for the classification.
Moreover, every connectivity feature computed over the HbO signals led to an average classification performance above chance level (>58.3 %). Furthermore, Pearson's, Spearman's correlation, and the Wavelet Coherence exceeded the chance level for both HbO and HbR. Concerning classical oxygenation features, the AUC and Variance were the only features to reach a classification performance above chance level but only when computed over HbO signals.
Regarding the best features, Wavelet Coherence benefited of the best classification performance among subjects with an average 65.34 and 59.94% of good classification respectively for HbO and HbR. The second was the Covariance (62.93 and 56.03 %) followed by the Area Under the Curve (61.76 and 57.83%) for HbO and HbR respectively. Figures 6, 7 show the averaged classification performance for all the possible combinations of 2 oxygenation or 2 connectivity features respectively.

Classification with Combined Features
Following the same procedure as before, a repeated measure ANOVA was done with the data showed Figures 6, 7. It revealed that there was a significant effect of the feature couple [F (30, 330) = 5.42, p < 10 −3 ] but not of the chromophore [F (1, 11) = 2.47, p = 0.14] on the classification performance.
When evaluating multiple comparisons for HbO, the main observation is that the 7 best connectivity couples gave a significantly greater classification performance than the 7 worse oxygenation couples. Besides that, it can also be noted here that connectivity couples did not exhibit significant differences between one another.
For oxygenation features, 9 out of 21 couples of features led to a classification performance above chance level, namely AUC-Peak, AUC-Variance, AUC-Average, Average-Variance, AUC-Slope, Variance-Slope, Peak-Variance, AUC-Skewness, and Variance-Skew. The AUC-Peak couple reached a classification performance of 61.2 and 56.7% for HbO and HbR respectively. Moreover AUC is in 5 of these 9 best couples. Regarding combined connectivity features, every connectivity couple reached a classification performance above chance except the couple Covariance-Coherence when computed over HbR. The best couple (Covariance-WaveletCoherence) led to a classification performance of 66.4 and 59.8% (for HbO and HbR respectively).
Results for every feature couple, including couples mixing oxygenation and connectivity features, for every subject are given in Tables 1, 2.    Results are rounded to the closest integers and ordered by their average value (the last couple (row) is the best performing on average across subjects). Columns refer to subjects (S1-S12) and rows to each feature couple (

DISCUSSION
The main motivation of the present study was to assess the potential of connectivity measures to classify two different levels of task engagement with fNIRS under relatively ecological settings. We therefore designed a protocol whereby pilots had to perform several manual and automated landings. Our subjective measures confirmed that these two situations were contrasted as manual landing led to significantly higher subjective NASA-TLX scores than automated landing. Our overall classification  Results are rounded to the closest integers and ordered by their average value (the last couple (row) is the best performing on average across subjects). Columns refer to subjects (S1-S12) and rows to each feature couple ( results confirmed that the two different engagement levels could be discriminated in a flight simulator. This is in line with previous neuroergonomics studies showing that this brain optical imaging technique is well suited for mental state monitoring in ecological situations (Herff et al., 2013;Durantin et al., 2015;Gateau et al., 2015;Foy et al., 2016). The best classification accuracy reached 66.9 %, a result that does not compare favorably with recent studies at first hand. For instance, Hong et al. (2015) obtained a classification performance of 75.6 % on 10 subjects with a mental motor imagery and mental arithmetic paradigm using average and slope features over chromophore concentration. Holper and Wolf (2011) did a complex vs. simple imaginary movement paradigm with 12 subjects. By combining different features such as the average, variance, skewness and kurtosis computed over HbO and HbR, they reached a performance of 81.3 %. Naseer et al. (2016) obtained a 93 % classification performance with almost similar features to classify mental arithmetic vs. rest on 7 subjects. However, these studies did not consider a continuous but rather an event locked assessment of a specific cognitive activity contrarily to our flying task involved different executive and attentional skills. Interestingly enough and contrary to our results, Khan and Hong (2015) showed that classical oxygenation metrics could yield to a high accuracy (84.9 %) when continuously monitoring drowsiness under ecological settings such as driving in simulated conditions. The comparison with our study remains challenging as the construct of engagement is probably more subtle to be captured. Eventually, the limited number of trials did not allow us to optimize the training of our model to guarantee high classification accuracy.
Interestingly, the connectivity measures led to better classification performance than the classical oxygenation metrics (i.e., chromophore concentration variation). The better performance of the connectivity metrics over classical ones could rely on two main explanations. Firstly, one has to consider that the analysis of task-related concentrations (i.e., hemodynamic response) is time-locked to the event. It has been proposed that these task-related responses induce a small increase (<5%) in neural energy consumption compared to the overall brain energy consumption (Raichle and Mintun, 2006). Thus by focusing only on a localized hemodynamic response, the majority of the brain activity is dismissed. It is now well admitted that cognition relies on the activation of several distributed brain areas rather than single dedicated processing units (Siegel et al., 2012;Hutchison et al., 2013;van den Heuvel and Sporns, 2013). Thus, the analysis of the interaction between neural networks provides more information on the brain dynamics, especially when concerned with the understanding of complex real-life task (Cui et al., 2012;Leff et al., 2015;İşbilir et al., 2016). Secondly, some relevant studies disclosed that frequency or amplitude correlations among spontaneous LFOs (around 0.1 Hz) are tightly linked to cortical processes (Lowe et al., 1998;Xiong et al., 1999;Obrig et al., 2000, see Siegel et al., 2012 for a review). As a matter of fact, when considering continuous monitoring of the brain activity, where no specific events are expected, connectivity features based on frequency or amplitude coupling can give an insight on the ongoing cognitive processes.
The comparison of the connectivity metrics classification performance revealed that covariance, correlation (Pearson's or Spearman's) and wavelet coherence led to significantly higher classification accuracies than respectively 3, 2, and 4 classical oxygenation metrics. It is interesting to note that the formers present complementary advantages. On one hand, correlation and covariance are straightforward and low cost computational measure to implement. This is of great advantage as long as pBCIs are concerned. On the other hand, the wavelet coherence takes into account both time and phased locked oscillations. While being used for some years by the fMRI community, the wavelet coherence metrics has only recently been applied to fNIRS signal (Cui et al., 2012;İşbilir et al., 2016). Wavelet coherence also allows to target specific and relevant frequency bands such as LFOs as discussed previously. However, the implementation of wavelet coherence based pBCIs remains challenging as this metric requires a high number of wavelet convolutions and the calculation costs could be critical in an online paradigm. One possible promising approach to overcome this issue is to consider dimensionality reduction (Guyon and Elisseeff, 2003). Taken together our findings provide some methodological guidance for the implementation of fNIRS based BCI metrics. To the best of our knowledge, this study is one of the rare to benchmark different fNIRS connectivity metrics and to use them for classification purposes in ecological settings. It paves the way toward online mental state estimation in ecological aeronautical settings, but some challenges still remain.
Despite its potential interests, our paper has several limitations. Firstly, this experiment involved 12 subjects that only performed four trials of each conditions. This limited number of trials relied on a compromise as the participants would experience fatigue and discomfort if wearing the cap for a long period (around 40 min). Secondly, the choice of the two contrasted conditions (automatic vs. manual) can be discussed since potential confounds such as motor responses could influence our measures. Yet we did not target motor areas therefore the risk is low. However, our motivation is to monitor pilots' brain activity when facing realistic flying conditions. The designing of well contrasted and controlled conditions in ecological environments such as flying remains challenging. This first experiment was meant to set the path to more refined protocols to characterize different tasks with a view to perform crew monitoring as achieved by Toppi et al. (2016). The third limitation is regarding the fNIRS signal analysis. Indeed, fNIRS signal is the result of a global component influenced by skin blood flow and a local neuronal component. Some algorithms based on spatial filtering and principal component analysis such as the one proposed by Zhang et al. (2016) could have been used if the analysis was not done on each epoch separately. Moreover fNIRS signals can also be influenced by other physiological activities such as heartbeats, respiration or changes in blood pressure. It would have been interesting to also record those activities to evaluate how they can correlate with the engagement level. Regarding the paradigm settings and because of these limitations, despite the fact that our classification performance were very high and satisfying, it is not possible to make any claim regarding the underlying neurophysiological processes.
Finally, the performance of the classification pipeline needs to be improved before its implementation in the cockpit as such rate of false negative detection cannot be afforded as it is in such critical systems, even though using multisensory fusion this accuracy level is still usable. A promising way to increase classification performance could be to use a bimodal EEG-fNIRS pBCI (Fazli et al., 2012).

FUNDING
This study was supported by a PhD grant delivered by the DGA (Direction Générale de l'Armement).