# DETECTION AND ESTIMATION OF WORKING MEMORY STATES AND COGNITIVE FUNCTIONS BASED ON NEUROPHYSIOLOGICAL MEASURES

EDITED BY : Felix Putze, Christian Mühl, Fabien Lotte, Stephen Fairclough and Christian Herff PUBLISHED IN : Frontiers in Human Neuroscience and Frontiers in Neuroscience

#### Frontiers Copyright Statement

© Copyright 2007-2019 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.

The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.

Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.

Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.

As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.

All copyright, and all rights therein, are protected by national and international copyright laws.

The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use. ISSN 1664-8714 ISBN 978-2-88945-691-8 DOI 10.3389/978-2-88945-691-8

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

## DETECTION AND ESTIMATION OF WORKING MEMORY STATES AND COGNITIVE FUNCTIONS BASED ON NEUROPHYSIOLOGICAL MEASURES

Topic Editors:

Felix Putze, University of Bremen, Germany Christian Mühl, German Aerospace Center, Germany Fabien Lotte, Inria/LaBRI (CNRS - Bordeaux INP - Univ. Bordeaux), France Stephen Fairclough, Liverpool John Moores University, United Kingdom Christian Herff, Maastricht University, Netherlands

Executive cognitive functions like working memory determine the success or failure of a wide variety of different cognitive tasks, such as problem solving, navigation, or planning. Estimation of constructs like working memory load or memory capacity from neurophysiological or psychophysiological signals would enable adaptive systems to respond to cognitive states experienced by an operator and trigger responses designed to support task performance (e.g. by simplifying the exercises of a tutor system when the subject is overloaded, or by shutting down distractions from the mobile phone). The determination of cognitive states like working memory load is also useful for automated testing/assessment or for usability evaluation. While there exists a large body of research work on neural and physiological correlates of cognitive functions like working memory activity, fewer publications deal witt the application of this research with respect to single-trial detection and real-time estimation of cognitive functions in complex, realistic scenarios. Single-trial classifiers based on brain activity measurements such as electroencephalography, functional near-infrared spectroscopy, physiological signals or eye tracking have the potential to classify affective or cognitive states based upon short segments of data. For this purpose, signal processing and machine learning techniques need to be developed and transferred to real-world user interfaces.

The goal of this Frontiers Research Topic was to advance the State-of-the-Art in signal-based modeling of cognitive processes. We were especially interested in research towards more complex and realistic study designs, for example collecting data in the wild or investigating the interaction between different cognitive processes or signal modalities. Bringing together many contributions in one format allowed us to look at the state of convergence or diversity regarding concepts, methods, and paradigms.

Citation: Putze, F., Mühl, C., Lotte, F., Fairclough, S., Herff, C., eds. (2019). Detection and Estimation of Working Memory States and Cognitive Functions Based on Neurophysiological Measures. Lausanne: Frontiers Media. doi: 10.3389/978-2-88945-691-8

# Table of Contents

*05 Editorial: Detection and Estimation of Working Memory States and Cognitive Functions Based on Neurophysiological Measures* Felix Putze, Christian Mühl, Fabien Lotte, Stephen Fairclough and Christian Herff

### SECTION 1

#### COGNITIVE WORKLOAD

*09* In silico *vs. Over the Clouds: On-the-Fly Mental State Estimation of Aircraft Pilots, Using a Functional Near Infrared Spectroscopy Based Passive-BCI*

Thibault Gateau, Hasan Ayaz and Frédéric Dehais


Carina Walter, Wolfgang Rosenstiel, Martin Bogdan, Peter Gerjets and Martin Spüler

*79 Dynamic Threshold Selection for a Biocybernetic Loop in an Adaptive Video Game Context*

Elise Labonte-Lemoyne, François Courtemanche, Victoire Louis, Marc Fredette, Sylvain Sénécal and Pierre-Majorique Léger

### SECTION 2

#### MEMORY PROCESSES

*89 Single-Trial EEG Analysis Predicts Memory Retrieval and Reveals Source-Dependent Differences*

Eunho Noh, Kueida Liao, Matthew V. Mollison, Tim Curran and Virginia R. de Sa


Jlenia Toppi, Laura Astolfi, Monica Risetti, Alessandra Anzolin, Silvia E. Kober, Guilherme Wood and Donatella Mattia

#### SECTION 3

#### ATTENTION

#### *134 On the Relationship Between Attention Processing and P300-Based Brain Computer Interface Control in Amyotrophic Lateral Sclerosis*

Angela Riccio, Francesca Schettini, Luca Simione, Alessia Pizzimenti, Maurizio Inghilleri, Marta Olivetti-Belardinelli, Donatella Mattia and Febo Cincotti


Anne-Marie Brouwer, Maarten A. Hogervorst, Bob Oudejans, Anthony J. Ries and Jonathan Touryan

### SECTION 4

### METHODS & OTHER COGNITIVE PROCESSES


Hiroyuki Takayoshi, Keiichi Onoda and Shuhei Yamaguchi

*194 Isolating Discriminant Neural Activity in the Presence of Eye Movements and Concurrent Task Demands*

Jon Touryan, Vernon J. Lawhern, Patrick M. Connolly, Nima Bigdely-Shamlo and Anthony J. Ries

*207 Assessing the Depth of Cognitive Processing as the Basis for Potential User-State Adaptation*

Irina-Emilia Nicolae, Laura Acqualagna and Benjamin Blankertz

# Editorial: Detection and Estimation of Working Memory States and Cognitive Functions Based on Neurophysiological Measures

#### Felix Putze<sup>1</sup> \*, Christian Mühl <sup>2</sup> , Fabien Lotte<sup>3</sup> , Stephen Fairclough<sup>4</sup> and Christian Herff 1,5

<sup>1</sup> Cognitive Systems Lab, University of Bremen, Bremen, Germany, <sup>2</sup> Sleep and Human Factors Research, German Aerospace Center, Institute of Aerospace Medicine, Cologne, Germany, <sup>3</sup> Inria / LaBRI (CNRS - Bordeaux INP - Univ. Bordeaux), Talence, France, <sup>4</sup> School of Natural Sciences and Psychology, Liverpool John Moores University, Liverpool, United Kingdom, <sup>5</sup> School for Mental Health and Neuroscience, Maastricht University, Maastricht, Netherlands

Keywords: cognitive functions, working memory - long-term memory interactions, BCI (brain computer interface), EEG, fNIRS (functional near infrared spectroscopy), cognitive processes

**Editorial on the Research Topic**

**Detection and Estimation of Working Memory States and Cognitive Functions Based on Neurophysiological Measures**

### 1. SCOPE

#### Executive cognitive functions like working memory determine the success or failure of a wide variety of different cognitive tasks, such as problem solving, navigation, or planning. Estimation of constructs like working memory load or memory capacity from neurophysiological or psychophysiological signals would enable adaptive systems to respond to cognitive states experienced by an operator and trigger responses designed to support task performance (e.g., by simplifying the exercises of a tutor system when the subject is overloaded Gerjets et al., 2014, or by shutting down distractions from the mobile phone). The determination of cognitive states like working memory load is also useful for automated testing/assessment or for usability evaluation. While there exists a large body of research work on neural and physiological correlates of cognitive functions like working memory activity, fewer publications deal with the application of this research with respect to single-trial detection and real-time estimation of cognitive functions in complex, realistic scenarios. Single-trial classifiers based on brain activity measurements such as electroencephalography (EEG, Kothe and Makeig, 2011; Lotte et al., 2018), functional near-infrared spectroscopy (fNIRS, Putze et al., 2014; Herff et al., 2015), physiological signals (Fairclough et al., 2005; Fairclough, 2008), or eye tracking (Putze et al., 2013) have the potential to classify affective (Koelstra et al., 2010; Heger et al., 2014; Mühl et al., 2014) or cognitive states based upon short segments of data. For this purpose, signal processing and machine learning techniques need to be developed and transferred to real-world user interfaces.

The goal of this Frontiers Research Topic was to advance the State-of-the-Art in signal-based modeling of cognitive processes. We were especially interested in research toward more complex and realistic study designs, for example collecting data in the wild or investigating the interaction

Edited and reviewed by: Tom Verguts, Ghent University, Belgium

\*Correspondence:

Felix Putze felix.putze@uni-bremen.de

Received: 25 September 2018 Accepted: 09 October 2018 Published: 30 October 2018

#### Citation:

Putze F, Mühl C, Lotte F, Fairclough S and Herff C (2018) Editorial: Detection and Estimation of Working Memory States and Cognitive Functions Based on Neurophysiological Measures. Front. Hum. Neurosci. 12:440. doi: 10.3389/fnhum.2018.00440 between different cognitive processes or signal modalities. Bringing together many contributions in one format allowed us to look at the state of convergence or diversity regarding concepts, methods, and paradigms.

The accepted manuscripts in this research topic cover a large range of aspects of cognition, reflecting the broadness of the field and its many application domains. A dominant challenge in the research topic is the analysis of cognitive workload (or memory load) from neurological signals. This does not come as a surprise because workload is a thoroughly studied construct and workload models can be immediately exploited, e.g., for adaptive human-machine interaction. While all these manuscripts share a joint research interest, they tackle the challenge of workload modeling in different application domains, with different signals, different classification approaches, and different features. Working memory and attentional control represent two recurring themes through the collection of papers in this research theme. The most prominent modality in this research topic is EEG, drawing from both spectral as well as timedomain features. In multiple articles, EEG is complemented by other modalities: Two of them use fNIRS as a different mode to capture neural activity (two others use fNIRS as single modality); two others use eye tracking as a way to capture visual attention and one also uses physiological signals (such as heart rate and breath rate). This shows that researchers today routinely select which (combinations of) signals are most promising for a given task.

### 2. HIGHLIGHTS

In the following paragraphs, we give an overview of the contributed manuscripts and their main contributions to the field. We structure them according to the cognitive processes which are in the respective focus of interest, according to the core themes identified in the previous section.

### 2.1. Cognitive Workload

Gateau et al. perform a study in which they investigated cognitive workload (induced by a serial memorization task in two difficulty levels) in pilots from prefrontal fNIRS. They show that binary classification of workload level was feasible both in flight simulator as well as in a real aircraft. Additionally, the authors investigated the differences between these two conditions and revealed that workload tends to be higher in the real aircraft. Workload classification performances were similar in both conditions though. Grissmann et al. investigated in a study how the simultaneous stimulation of multiple cognitive and affective states, namely workload and valence, influenced the respective neural markers in the EEG signal. They found that while correlates of workload (in frontal theta and parietal alpha power) could still be detected, these correlates were affected by negative valence. Moreover they found no evidence of typical neural correlates of affect. This has implications for many realworld applications for which different concurrent mental states can be expected. In a study with 17 healthy volunteers, Aghajani et al. investigated mental workload classification using EEG and fNIRS. Workload was induced using a letter n-back task and could be classified using Support Vector machines in EEG, fNIRS, and the combination of both, which outperformed the individual modalities.

The impact of a multimodal approach for the classification of working memory load was also investigated by Liu et al.. These authors presented a study in which they discriminated three levels of mental workload, induced by an n-back task, using EEG, fNIRS, and physiological measures, with data combined over modalities performing best. They could show that all three modalities yielded better than chance level classification. To reduce calibration time for each subject, they demonstrated how results could be improved by learning from other subjects.

Extending offline analysis to a closed-loop system, Walter et al. demonstrate in their study that cognitive workload can be estimated from EEG to automatically adapt the difficulty level in a learning environment. The prediction model was trained on 10 subjects beforehand and no user-specific calibration data was needed, highlighting the feasibility in realistic scenarios. Labonte-Lemoyne et al. investigate a closedloop system which adapted a game of Tetris to the cognitive load of the player, as measured from EEG and facial expression, with dynamic thresholding. In their user study, they show that adaptation to measured cognitive user states is not a trivial task and neither perceived experience, nor objective performance could be improved compared to the control condition.

#### 2.2. Memory Processes

Noh et al. used ERP features from the EEG signal to discriminate between successful and unsuccessful retrieval of memorized items. In their study, they show that the presence or absence of contextual information interacts with the type of information to be encoded. Tian et al. conducted an EEG study to investigate neural markers for the discrimination of low and high performers in a repeated memory task. They found that fronto-central spectral entropy during the retention period of working memory was predictive for intra-session and intersession differences in performance. Toppi et al. use graphtheoretic connectivity features to discriminate different phases of working memory processing in 60-channel EEG data from a study in which 17 participants performed a Sternberg task. The authors could identify topological differences between the encoding, storage, and retrieval phases and also found a relationship between the neural markers and the participants' behavior.

### 2.3. Attention

Riccio et al. worked with a P3-BCI and study how attentional processing differs between patients suffering from amyotrophic lateral sclerosis (ALS) and healthy users and whether and how this corresponds to differences in P3-BCI performance. The corresponding study investigates attentional processing via a rapid serial visual presentation task. Results revealed that ALS patients have degraded attentional performances which translates in smaller P3 amplitude and lower P3- BCI classification performances. Using EEG, Baldwin et al. investigated mind wandering during simulated driving. In their study, they found increased alpha activity and reduced magnitude of P3a components to auditory stimuli in periods of mind wandering. As mind wandering during driving is a potential safety threat, this study can have large impact in future adaptive interfaces. Brouwer et al. presented a study combining EEG and eye tracking to investigate target encoding during visual search. They could show that eye tracking features distinguished better between hits and misses (i.e. targets which are never reported by the participant), while EEG features differentiated better between targets and non-targets. These results highlight how the two modalities complement each other.

#### 2.4. Methods & Other Cognitive Processes

Verdière et al. measure engagement of pilots via fNIRS in a flight simulator. In their study, different levels of engagement are induced via manual vs. automatic landing. The authors propose connectivity-based features and show that these can be used for effective classification. Takayoshi et al. used rewards in a simple number-discrimination task to study motivation and apathy tendency. In their study, they found P2 and P3 ERPs in EEG to be modulated by reward dependence and apathy tendency, which is an important step toward an adaptive interface based on motivation. Touryan et al. used a simultaneous visual search tasks and auditory n-back to investigate the neural sources of target detection in the presence of eye movements and during simultaneous task-demand. In their study, the authors used a modified measure-projection approach to separate the neural sources into six regions of interest (ROIs). This enabled efficient target detection from EEG signals, as well as revealing the time course of EEG activity in these ROIs—thus leading to a better understanding of the underlying processes. Nicolae et al. investigate a relatively novel concept by studying depth of processing (on three levels: no processing, shallow processing, or deep processing) from event-related timedomain and frequency domain features in EEG on a singletrial basis. In their study, they demonstrate the feasibility of

#### REFERENCES


this approach in three domains (namely: memory, language, and visual imagination).

### 3. SUMMARY

The presented articles tackle a large variety of cognitive processes, their neurophysiological correlates and their detection. This research shows great progress in basic understanding of cognition as well as for applied research, especially for the development of adaptive systems. They also highlighted the need for more research into realistic and ecologically valid application environments of cognitive state estimation. If such states could be reliably estimated in real application scenarios–outside the lab—which is not really the case so far—then this could prove very useful for multiple applications such as aeronautics, driving, education or video games, among many others. Future directions of research should also include thorough attempts to define and differentiate the modeled concepts (e.g., how can we discriminate "mental workload" from "memory load") and find a consensus on publicly available data sets for joint evaluations. A role model here could be the field of affective computing. Available common data sets such as DEAP (Koelstra et al., 2012), fNIRS-nback (Herff et al., 2014), EEG (So et al., 2017), or EEG combined with fNIRS (Shin et al., 2017) could be a start for such a program.

#### AUTHOR CONTRIBUTIONS

FP coordinated the writing process and proposed the structure of the article. All authors contributed to the writing process.

### ACKNOWLEDGMENTS

We thank all reviewers involved in the compilation of this exciting research topic. FL received research support from the European Research Council with project BrainConquest (grant ERC-2016-STG-714567).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Putze, Mühl, Lotte, Fairclough and Herff. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# In silico vs. Over the Clouds: On-the-Fly Mental State Estimation of Aircraft Pilots, Using a Functional Near Infrared Spectroscopy Based Passive-BCI

#### Thibault Gateau<sup>1</sup> , Hasan Ayaz <sup>2</sup> and Frédéric Dehais <sup>1</sup> \*

1 ISAE-SUPAERO, Institut Supérieur de l'Aéronautique et de l'Espace, Université Fédérale de Midi-Pyrénées, Toulouse , France, <sup>2</sup> School of Biomedical Engineering, Science Health Systems, Drexel University, Philadelphia, PA, United States

There is growing interest for implementing tools to monitor cognitive performance in naturalistic work and everyday life settings. The emerging field of research, known as neuroergonomics, promotes the use of wearable and portable brain monitoring sensors such as functional near infrared spectroscopy (fNIRS) to investigate cortical activity in a variety of human tasks out of the laboratory. The objective of this study was to implement an on-line passive fNIRS-based brain computer interface to discriminate two levels of working memory load during highly ecological aircraft piloting tasks. Twenty eight recruited pilots were equally split into two groups (flight simulator vs. real aircraft). In both cases, identical approaches and experimental stimuli were used (serial memorization task, consisting in repeating series of pre-recorded air traffic control instructions, easy vs. hard). The results show pilots in the real flight condition committed more errors and had higher anterior prefrontal cortex activation than pilots in the simulator, when completing cognitively demanding tasks. Nevertheless, evaluation of single trial working memory load classification showed high accuracy (> 76%) across both experimental conditions. The contributions here are two-fold. First, we demonstrate the feasibility of passively monitoring cognitive load in a realistic and complex situation (live piloting of an aircraft). In addition, the differences in performance and brain activity between the two experimental conditions underscore the need for ecologically-valid investigations.

Keywords: fNIRS, BCI, working memory, prefrontal cortex, simulated and real flight, neuroergonomics

## 1. INTRODUCTION

Neuroergonomics is an emerging field of interdisciplinary research that promotes the understanding of the brain in complex real-life activities. This approach merges knowledge and methods from cognitive psychology, system engineering, and neuroscience (Parasuraman and Wilson, 2008). Accurate and reliable mental state assessment of human operators during use of complex systems is a prime goal of neuroergonomics that aims to measure the "brain at work" (Parasuraman and Rizzo, 2008). Understanding the underlying neurocognitive processes of such interaction could be used to improve safety and efficiency of the overall human-machine pairing. This could be achieved by (i) the augmentation of human performance and its translation to

#### Edited by:

Fabien Lotte, Institut National de Recherche en Informatique et en Automatique (INRIA), France

#### Reviewed by:

Thorsten O. Zander, Technische Universität Berlin, Germany Megan K. Strait, University of Texas Rio Grande Valley Rio Grande City, United States

> \*Correspondence: Frédéric Dehais frederic.dehais@isae.fr

Received: 12 October 2017 Accepted: 17 April 2018 Published: 17 May 2018

#### Citation:

Gateau T, Ayaz H and Dehais F (2018) In silico vs. Over the Clouds: On-the-Fly Mental State Estimation of Aircraft Pilots, Using a Functional Near Infrared Spectroscopy Based Passive-BCI. Front. Hum. Neurosci. 12:187. doi: 10.3389/fnhum.2018.00187 improved functioning "at work", (ii) informing the design of the complex systems, or (iii) adapting the user interface and task parameters dynamically during use.

Aviation operations constitute an ideal paradigm to implement this approach. Pilots deal with an uncertain environment and face complex interaction with the flightdeck (Causse et al., 2013; Çakır et al., 2016; Reynal et al., 2016). For instance, several studies have emphasized that pilots' working memory (WM) abilities are heavily recruited to handle flightpath, to monitor the flight parameters, and to maintain an up-to-date situation awareness (Causse et al., 2011a,b). WM is also an important component when following air traffic control (ATC) instructions (Morrow et al., 1993). This activity indeed requires mentally storing flight parameters (e.g., heading, altitude, speed) to follow the adequate flight path. However, it is well-known that human working memory is fundamentally limited (Baddeley, 1992; Miller, 1994) and easily overwhelmed when task demand is excessive (Durantin et al., 2014a). Human factor studies emphasized that a variety of environmental stressors may negatively impact pilots' ability to execute ATC clearances (Billings and Cheaney, 1981; Taylor et al., 1994, 2005; Scerbo et al., 2003; Risser et al., 2006; Rome et al., 2012; Dehais et al., 2017). Thus, the implementation of monitoring technology in the cockpit to infer a state of cognitive limitation could represent a promising approach to enhance flight safety (Roy et al., 2017; Verdière et al., 2018).

Indeed, the development of brain computer interface (BCI) technology provides interesting prospects to continuously monitor and take advantage of the brain dynamics and the neural mechanisms underlying cognition. Among the three categories of BCIs (active, reactive, and passive) (Zander and Kothe, 2011; Vecchiato et al., 2016), the first two types are aimed at transforming cerebral activity into messages or commands to voluntarily control distant apparatus (e.g., mouse cursor). Passive BCIs (pBCI) are of particular interest for neuroergonomic applications (Cutrell and Tan, 2008; Frey et al., 2017; Gramann et al., 2017). They allow the use of interpretation of unlabeled brain activity during a task to derive various mental states (Blankertz et al., 2010; Roy et al., 2013; Van Erp et al., 2015; Zander et al., 2017). These mental state-inference systems offer a unique insight into the development of the human-system interactions to overcome cognitive limitations (Zander and Kothe, 2011; Brouwer et al., 2013). While several pBCIs have been successfully implemented in driving (Dijksterhuis et al., 2013) and flight simulator (Gateau et al., 2015; Aricò et al., 2016; Çakır et al., 2016; Callan et al., 2016; Verdière et al., 2018), few have attempted to test these systems under more realistic settings. However, very few studies have attempted to test these adaptive systems under realistic settings (Callan et al., 2015).

Electroencephalography (EEG) is by far the most popular technique (George and Lécuyer, 2010; Borghini et al., 2017) in the BCI community as it has excellent qualities for monitoring cognitive states (Brouwer et al., 2012; Roy et al., 2013) including superior temporal resolution and has been used to monitor working memory (Roy et al., 2013; Mühl et al., 2014). However, the localization of sources from the EEG signal requires higherdensity recordings and additional computation to solve the inverse problem that may not be amenable to critical operational situations such as flying real aircraft. In addition, setup time and susceptibility to motion artifacts should be considered for minimally intrusive deployment. Thus, the use of functional near infrared spectroscopy (fNIRS) has been gaining popularity recently as the sensors have been miniaturized, become portable and wireless (Ayaz et al., 2013; Strait et al., 2014; Naseer and Hong, 2015; Schudlo and Chau, 2015).

This brain activity monitoring technique uses near-infrared light absorption properties of hemoglobin to estimate local variations of cortical oxygenation changes (Villringer and Obrig, 2002; Ayaz et al., 2012). fNIRS has been successfully used to detect working memory solicitation (Li et al., 2005; Schreppel et al., 2008; Hirshfield et al., 2011; Gagnon et al., 2012; Herff et al., 2014; McKendrick et al., 2014; León-Domínguez et al., 2015; Unni et al., 2017). Despite its relatively low temporal resolution, fNIRS poses several advantages compared to more established traditional tools (Kikukawa et al., 2008; Piper et al., 2014; McKendrick et al., 2015; Davranche et al., 2016) such a relatively high spatial resolution (around 1 cm<sup>2</sup> depending on the sensor geometry) and high signal-to-noise ratio as sensors are relatively more robust to motion artifacts (Huppert et al., 2009), eye-blinks and facial muscles (Izzetoglu et al., 2004). It is also possible to run experiments with active and mobile subjects and even outdoors (Piper et al., 2014; McKendrick et al., 2016). Specifically, it is less sensitive to noisy electromagnetic environment in the aircraft (radio transmission, radio-navigation beacons, GPS antenna, etc.) than EEG, making it a candidate to measure pilot's brain activity during real flight. As an emerging neuroimaging technique, we believe that it is important to investigate the capabilities of fNIRS and its utility in future applications.

The present study aims to develop an on-line fNIRS based pBCI for the assessment of working memory of aircraft pilot during real flight. Earlier studies demonstrate that fNIRS based measures BCI have been implemented. They rely on oxygenation changes in the prefrontal cortex (PFC) and can be used for measuring WM load (Schreppel et al., 2008; Ayaz et al., 2012; Gagnon et al., 2012; Durantin et al., 2014a,b). Here, a pilot-ATC interaction task, was designed with two contrasted levels of WM load. A Support Vector Machine (SVM) based classifier performing on-line for single trial WM load level discrimination was implemented. This classifier was first tested in a high fidelity flight simulator. The same machine learning approach was then utilized to assess the WM load level in an actual flight condition. To the authors' knowledge, this is the first study to monitor pilot's brain activity on-line under such operational settings and ecological validity. We also compared pilot's WM performance and related PFC activity both in high fidelity simulator and real flight conditions. The objective was to determine wether these two conditions simulated and real operational settings were equivalent or not in terms of task demand (Dahlstrom and Nahlinder, 2009; Batula et al., 2017). As most aviation psychology experiments and pilots' training are conducted with flight simulators, such assessment is critical for future design and development of such approaches (Philip et al., 2005).

### 2. MATERIALS AND METHODS

#### 2.1. Passive BCI in Flight Simulator 2.1.1. Participants

Fourteen visual flight rules (VFR) pilots (three women; mean group age: 29.25 ± 6.92; mean flight hours 80 ± 50) completed the experiment. Pilots had normal or corrected-to-normal vision, normal hearing, and no psychiatric disorders. They all had medical clearance to fly. After providing written informed consent, they were instructed to complete task training. The data from two participants were rejected due to a high level of fatigue in one case, and data collection issue for the second. Typical total duration of a subject's session (informed consent approval, practice task, and real task) was about two hours. This work was approved by the Institutional Review Board (IRB) of the Inserm Committee of Ethics Evaluation (CEEI: Comité d'Evaluation Ethique de l'Inserm IRB00003888). The methods were carried out in accordance with approved guidelines and participants gave written informed consent approved by the IRB of CEEI.

#### 2.1.2. Neurophysiological Measurements: fNIRS

During this experiment, we recorded hemodynamics of the prefrontal cortex using the functional near-infrared spectrometer fNIR Device Model 100B (Biopac <sup>R</sup> ) equipped with 16 optodes (**Figure 1**). On this continuous-wave system, the optode separation was about 25 mm and two wavelengths were used, 730 and 850 nm. DPF (differential pathlength factor) value was 5.97 which is within the range used by many in literature (Kato et al., 1993; Luo et al., 2002) and accepted by many groups. Four regions of interest (ROI) were defined to allow for explorative statistical comparisons with the data collected during the real flight experiment (see section 3).

Each optode of the device records hemodynamics at a frequency of 2 Hz in terms of oxygenation level variations in comparison to an initial baseline performed prior to the experiment. Changes in the concentrations of oxygenated (1[HbO2]) and deoxygenated hemoglobin (1[hHb]) relative to the baseline can be calculated from changes in detected light intensity using the modified Beer-Lambert Law (Delpy et al., 1988). Cognitive Optical Brain Imaging (COBI) Studio <sup>R</sup> software (Ayaz and Onaral, 2005; Ayaz et al., 2011) was used to collect data. The data stream was available on-line from a TCP/IP interface. Before recording, signals for each optode were carefully checked for saturation with COBI Studio which provides signal quality visual representation. COBI studio was also used to check signal quality and to adjust consequently the headband on the participant's forehead. After this check, a baseline was established, which simply consists of letting the participant rest for 10 s.

#### 2.1.3. Experimental Environment: Flight Simulator

We used the ISAE-SUPAERO (Institut Supérieur de l'Aéronautique et de l'Espace - French Aeronautical University in Toulouse, France) flight simulator to conduct the experiment in an ecological situation. Its user interface is composed of a Primary Flight Display, a Navigation Display, and an Electronic Central Aircraft Monitoring Display (**Figure 2**).

#### 2.1.4. Task Description

This protocol was adapted from a previous study (Gateau et al., 2015). As in real flight operations, pilots heard ATC instructions (pre-recorded forthis experiment) to vector them and were asked to read back the instructions. Their answers were recorded for off-line behavioral analysis. The ATC messages were delivered at 78 dB through a Sennheiser <sup>R</sup> headset. Two levels of difficulty were defined based on the flight parameters that the participant had to read back during the experiment:


The task consisted of 10 repetitions of each difficulty for a total of 20 trials. The task difficulty order was randomly distributed with two constraints:


Each ATC message started with the airplane call sign (i.e., "Supaero 32"), immediately followed by a sequence of flight parameters and ended with the message "over" (**Figure 3**). Thereafter, pilots had a 18 s response window to repeat the instruction. A practice session was conducted prior to the experiment runs to familiarize them with the experiment protocol and the interface.

During the experiment, the experimenter was collecting the volunteer's ability to read back each message so as to compute the total number of correct responses in the low and hard conditions.

#### 2.1.5. Experimental Time Course

For machine learning purposes, the experiment was split into three successive phases (**Figure 4**):


FIGURE 1 | fNIR Device® Model 1200S headband and associated optode numbering. Only the four closest detectors to an emitter constituted optodes. Optodes are represented in red with their associated number. Four regions of interest (ROI) were defined for statistical analyses purposes (#1, #2, #3, #4).

FIGURE 2 | (Left) ISAE-Supaero flight simulator; the closed cabin is visible from rear, and eight screens are used to visualize external environment. (Right) The pilot subject with the fNIRS headband.

was to discriminate the difficulty of the trial. After each response window of trials, the classifier returned WM load estimation of the trial.

Note that the transition (phase B) from phase A to phase C was not perceptible to the participants.

#### 2.1.6. MACD Filter

Raw fNIRS data were real-time filtered using a MACD filter, commonly used in economic market analysis (Appel, 2005). This filter, based on the difference between a short-term EMA (Exponential Moving Average) and a long-term EMA, implements a second order band-pass filtering to eliminate low-frequency (<0.02 Hz) and high-frequency (>0.33 Hz) components from the raw fNIRS signal (Utsugi et al., 2007). This low order filter has a quasi linear phase in its bandwith and is particularly suited for real-time applications. For the experiment, we proceeded to an on-line filtering of 1[HbO2] and 1[hHb] on 16 optodes.

N represents the number of time points defining the EMA window:

$$\chi = EMA\_N(\mathbf{x}) \Leftrightarrow \chi\_n = \frac{2}{N+1}\mathbf{x}\_n + \frac{N-1}{N+1}\mathbf{y}\_{n-1} \tag{1}$$

$$MAC\_{N\_{\text{short}}, N\_{\text{long}}}(\mathbf{x}) \; = \; EMA\_{N\_{\text{short}}}(\mathbf{x}) - EMA\_{N\_{\text{long}}}(\mathbf{x}) \tag{2}$$

We chose a 6 s short-term EMA and a 13 s long-term EMA according to previous work (Durantin et al., 2014b) for MACD filtering, to get the desired bandwidth.

#### 2.1.7. Single Trial SVM-Based WM Load Estimation

The classification's goal was to discriminate on-line whether the last trial was a high WM load trial or a low WM load trial. For each pilot, we used the first 10 trials to train the pilot's classifier (phase A and B, see section 2.1.5). From trial 11 to 20, we used the pilot's classifier to discriminate trial difficulty, without any further training. An accuracy score (sum of correct predictions divided by total number of predictions) of the pilot's classifier was provided at the end of the experimental session.

Sixteen optodes of 1[HbO2] and 1[hHb] filtered signals were segmented into trials, in real-time, according to the task synchronization module (**Figure 5**). Each trial starts when an ATC message is played, and lasts 30 s. All data points of a trial – 2 different inputs per optode, 16 optodes, 30 s of data with a 2 Hz sampling corresponding to 1920 features – were considered as the input of the machine learning process. A 30 s sliding window was chosen to be consistent with literature regarding inter-subject variability (Jasdzewski et al., 2003; Sato et al., 2005). Note that the transition from the "Response" phase to the "Rest" one was unnoticeable, as it was anticipated that participants started to rest as soon as they completed the instruction.

As our number of features was large compared to the training sample, we used a linear Support Vector Machine (SVM) (Cortes and Vapnik, 1995). The principle of the SVM is to find the separating hyperplane that maximizes the distance between the hyperplane and the closest training points in each class. To avoid over-fitting, we chose to customize the SVM regularization parameter for each pilot's classifier. In a linear SVM, the regularization parameter C controls the trade-off between errors of the SVM on training data and margin maximization. During the training process of each participant, the parameter C is incrementally changed over a large range of values (from 10−<sup>3</sup> to 10<sup>4</sup> ) with a 10-step factor. Hence, a five-fold cross-validation on the first 10 trials with scikit-learn (Pedregosa et al., 2011) packages (sklearn.svm and sklearn.cross\_validation) was ran to select the C parameter with the highest performance in terms of accuracy. The classifier training (phase B) was performed as soon as the data of the first 10 trials were available for online purposes (Aricò et al., 2016).

#### 2.1.8. Experimental Components' Architecture

We implemented a WM load estimator that integrated different components (**Figure 5**):


Task monitoring, data acquisition, and computation were conducted on the same computer (core i5-3210M, 2.50 GHz, 4 GB RAM). During the experiment, the classifier training (phase B) duration was short (800 ms) and remained unnoticeable for the participant. The classifier testing phase lasted 10 ms and was also unnoticeable for the participant" (**Figure 6**).

#### 2.2. Passive BCI in Real Flight 2.2.1. Participants

Fourteen VFR pilots (1 women; mean group age: 23.07 ± 5.35; mean flight hours 44.07 ± 37.52), completed the

experiment. Please note that these volunteers were different from the ones who participated to the flight simulator experiment. The data from two participants were rejected due to light saturation issues and a device synchronization issue. After providing informed consent, they were instructed to complete task training on the ground. None of the recruited subjects had neurological or psychiatric history or was on medication. Each of them gave written informed consent for the experiment. The experimental protocol was approved by the committee of the European Aviation Safety Agency (EASA permit to fly approval number : 60049235). The methods were carried out in accordance with approved guidelines and participants gave written informed consent approved by the EASA.

#### 2.2.2. Neurophysiological Measurements: Mini-fNIRS

We used the miniaturized and wireless fNIR Device Model 1200W (Biopac <sup>R</sup> ) portable system (Ayaz et al., 2013) to record the pilots' hemodynamics of the prefrontal cortex (**Figure 7**). This device was chosen as it was wireless (i.e., the pilot's head was not attached to any cables) and did not require external power supply as the Model 1200S. This was a prerequisite to facilitate its implementation and use in the aircraft for our experiment. This device had the same hardware design, and exactly same LED light source components and detectors than the fNIRS Model 1200S used in the flight simulator. Consistent with the previous device, on this continuous-wave system, the optode separation was about 25 mm and two wavelengths were used, 730 and 850 nm. The DPF value was 5.97. This fouroptode device records hemodynamics at a frequency of 4 Hz in terms of oxygenation level variations in comparison to a baseline same as the 1200S desktop version. With flexible circuit board and separation-adjustable split pads, the sensors were positioned to aim monitoring brain areas similar to the ROIs extracted from 1200S sensor. Changes in the concentrations of oxygenated (1[HbO2]) and deoxygenated hemoglobin (1[hHb]) can be calculated from changes in detected light intensity using the modified Beer-Lambert Law (Delpy et al., 1988).

Cognitive Optical Brain Imaging (COBI) Studio <sup>R</sup> software (Ayaz and Onaral, 2005; Ayaz et al., 2011) was used to collect data. The data stream was available on-line from a TCP/IP interface. Before recording, signals for each optode was carefully checked for saturation with COBI Studio which provided a visual representation of signal quality. An aluminum foil attached to a dark ski band band and a cap were placed over the mini-fNIRS to shield against ambient sunlight infrared.

Data was MACD filtered and we used a similar on-line Experimental Components' Architecture with the exception that we used a real plane instead of the flight simulator.

#### 2.2.3. Experimental Environment: DR400 Aircraft

The ISAE Supaero DR400 light aircraft was used for the purpose of the experiment (**Figure 8**). It was powered by a 180HP Lycoming engine and was equipped with classical gauges, radio and radio navigation equipment, and actuators such as rudder, stick, thrust, and switches to control the flight. The participant was placed on the left seat and was equipped with the mini fnirs system. The participant wore a Clarify Aloft <sup>R</sup> that was used to trigger task-related auditory stimuli from a PC via an audio cable. The participant could still communicate with the other crew members, real ATC and when he received auditory stimuli (emulated ATC).The safety pilot was an ISAE flight instructor. He was right seated and had the authority to stopping the experiment and taking over the control of the aircraft for any safety reason. The backseater was the experimenter: his role was to set the sensor, to trigger the experimental scenario and to supervise data collection.

#### 2.2.4. Task Description

The experimental task and audio messages were similar to the previous protocol (see section 2.1), with the same experimental time course and the same instructions for the participant. A practice session on the ground was conducted prior to the experiment runs to familiarize them with the experiment protocol and the interface. After training was completed on the ground, the mini-fNIRS system was placed over the participants forehead. The participant then took off from Lasbordes (LFCL, Toulouse, France) airfield and began a local flight. The experimental task per se started when the pilot left the Lasbordes traffic pattern and was stabilized at an altitude of 2500 feet. The participant was asked to fly as straight and stable as possible and to only perform slight avoidance maneuvers as necessary. Once stabilized, the baseline of ten seconds was recorded. After the completion of the WM task, the participant was heading back to land at Lasbordes airfield. The total flight lasted one hour including the WM task.

As in simulated condition, the backseater was collecting the volunteer's ability to read back each message in order to compute the total number of correct responses in the low and hard conditions. These data allowed to compare the WM peformance accross the conditions (i.e., low vs. high; simulated vs. real flight).

#### 2.2.5. Experimental Components' Architecture and WM Load Estimation

We implemented a similar WM load estimator in the airplane as in the flight simulator. Machine learning inputs were lightly adjusted to fit the data flow available with the mini-fNIRS wireless portable device. The four (instead of 16) available optodes of 1[HbO2] and 1[hHb] filtered signals were segmented into trials, in real-time, according to the task synchronization module (see section 2.1.8). Each trial starts when an ATC message is played, and lasts 30 s. All data points of a trial - two different inputs per optode (i.e., 1[HbO2] and 1[hHb]), four optodes, 30 s of data with a 4 Hz sampling corresponding to 960 features - were considered as the input of the machine learning process.

#### 2.3. Statistical Analyses

Off-line statistical analyses were performed with "R" (R Core Team, 2013) software and the "EzANOVA" (Anderson, 2001) package to compare WM performance and prefrontal cortex activations in the flight simulator and in the real flight conditions during the 20 trials. Two-tailed unpaired t-tests were performed to compare the WM performance in the high and low load conditions across the two flight conditions (simulator and real flight). As the number of optodes was not equivalent between the two fNIRS devices (16 vs. 4), we defined four regions of interests (ROIs) for the fNIR100 device that was used in the simulator condition to allow for explorative comparisons with the real flight condition. ROI1, ROI2, ROI3, and ROI4 were derived respectively from the spatial averaging of optodes 1 to 4, 5 to 8, 9 to 12, and 13 to 16 (see **Figure 1**). The mean frontal 1[HbO2] peak response and the mean frontal 1[hHb] peak response (peak value within 30 s post-trial onset minus 2 s average pre-trial onset) over the four ROIs of the PFC for each trial and each pilot using the MACD-filtered data in both flight conditions (i.e., simulator and real flight) were computed. A multivariate analyses for repeated measures (MANOVA) was conducted over the mean 1[HbO2] data with between factor flight condition (simulator vs. real flight) and within subject their associated number.

FIGURE 8 | (Left) ISAE Supaero DR400 plane used for the experiment. (Right) The pilot subject with the mini-fNIRS headband (on the left) and the safety pilot (on the right).

factors WM Load (High vs. Low) and ROI (#1, #2, #3 & #4); see **Figure 1** was led. A similar MANOVA over the mean 1[hHb] was then conducted. We then ran a two-tailed unpaired t-test to compare the classification accuracy in the two experimental conditions. The Tukey's Honestly Significant Difference (HSD) test was used for all post-hoc comparisons.

### 3. RESULTS

### 3.1. Real Flight vs. Flight Simulator: Off-Line Behavioral and Neurophysiological Analyses

Participants committed on average 5.33 errors (SD = 1.95) in the WM task in the simulator condition and on average 8.25 (SD = 2.42) errors in the real flight condition, all occurring during the high load trials (see **Figure 9**). As no error was committed in the low WM load condition, we performed a statistical analysis to compare the effect of the flight conditions on WM performance in the high load conditions. An unpaired t-test revealed that the real flight condition led to significantly higher number of errors on the WM task in the high load condition (p < 0.001, Cohen′ s d = 1.34). The MANOVA over the mean 1[HbO2] data disclosed a significant WM load × Flight condition × ROI interaction [F(3,66) = 3.36; p = 0.039; see **Figure 10**]. Post- hoc analyses revealed that high load trials performed in real flight condition led to higher 1[HbO2] in ROI #2 than their counterparts performed in simulator (p = 0.0001). The MANOVA over the mean 1[hHb] data did not disclose any significant WM load × Flight condition × ROI interaction [F(3,66) = 0.69; p = 0.56].

## 3.2. Single Trial SVM-Based WM Load Estimation Results

#### 3.2.1. Simulator

During the testing phase, a mean of 76.66% (SD : 16.14%) of the trials were accurately classified (discriminated into on-line low WM load trials and high WM load trials). We obtained a 85.60% mean precision (SD : 19.36%) and a 73.33% mean recall (SD = 24.62%). Individual classifiers' accuracies are shown in **Table 1**.

#### 3.2.2. Real Flight

During the testing phase, a mean of 78.33% (SD : 11.93%) of the trials were accurately classified (discriminated into on-line low WM load trials and high WM load trials). We obtained a 84.14% mean precision (SD : 18.56%) and a 76.67% mean recall

(SD = 22.29%). Individual classifiers' accuracies are shown in **Table 2**.

#### 3.2.3. Real Flight vs. Flight Simulator: Statistical Analysis

A t-test disclosed no statistical differences of the classification accuracy in the two experimental conditions (p = 0.67, Cohen′ s d = 0.17).

### 4. DISCUSSION

The motivation of this study was to develop on-line tools to monitor pilots' cognitive performance under realistic settings. We followed a two-step methodological approach as we first implemented and tested an inference system in a flight simulator and then in a real aircraft. We designed a task known to elicit WM (Durantin et al., 2015; Gateau et al., 2015) as this executive function is particularly engaged when operating aircraft (Causse et al., 2011a,b).

#### 4.1. Summary of Findings

The behavioral results confirmed that these two levels of WM load were well contrasted, as the participants exhibited lower performance during the higher difficulty level. This result is in line with Taylor et al. study (Taylor et al., 2005; Durantin et al., 2015) and previous experiments (Gateau et al., 2015) showing that pilots' WM performance decline when four different ATC instructions have to be read back. Moreover, this drop in performance was most significant for the participants under actual flight conditions. Consistent with this finding, the real flight condition yielded to higher PFC activation than the simulated one only when the pilots had to execute the difficult WM load task. Taken together, these findings suggest that the mental demand was higher when operating the actual aircraft as the participants had not only to perform the WM task but also to monitor the flight path, the aircraft status and the airspace in a much more careful fashion than in the simulated condition.

Whereas this multitasking aspect of the real flight was not detrimental from a behavioral and neurophysiological point of view when performing the low WM stimuli it became

1 - left to 4 - right) across WM loads and flight conditions.



TABLE 2 | Real flight experiment: machine learning results.


critical when engaged in the high WM one. One could suspect prioritization issue leading the pilots to focus more on flying the aircraft thus leaving few resources available to face the demand of the high WM stimuli. This could be one explanation for the higher levels of activation observed in fNIRS measurements that reflect the higher load of concurrent cognitive tasks induced by the real flying task compared to the simulated. Unfortunately, our aircraft was not equipped with a flight data recorder preventing us from analyzing the flight performance and investigating these prioritization and multi-tasking issues. Despite this limit, our study is consistent with Dahlstrom and Nahlinder (2009) who found evidence of higher cardiac activity when flying under realistic settings than in flight simulator. These results raise the question of the ecological validity of simulators. Their use is of undeniable interest (e.g., understanding cognitive performance, training pilots, assessing cockpit design) and they present several advantages in terms of economical costs and reproducibility issues. However, our findings and others (Philip et al., 2005; Dahlstrom and Nahlinder, 2009) suggest that the simulators may need to be calibrated against real flying conditions to be more engaging.

Several field studies have demonstrated the potential of fNIRS to measure cortical activity while walking outdoors (McKendrick et al., 2016), facing prolonged stay at high altitude (Davranche et al., 2016), riding bikes (Piper et al., 2014), motorcycles (Kawashima et al., 2014), and flying helicopters (Kikukawa et al., 2008). Our study was conducted in accordance with the recent neuroergonomics approach to measure brain activity out of the laboratory. Indeed, beyond the offline analyses, we used machine learning techniques to perform single trial discrimination of the low WM load versus high WM load trials. The results of the classification process were available and displayed in a terminal to the experimenter after each: as soon as data of the trial were available, SVM discrimination process never required more than 10 ms to provide its result. The mean accuracy to classify low vs. high WM trials in the two experimental conditions exceeded the threshold of 70%, defined as a sufficient rate for pBCI (Kubler et al., 2006; Tai and Chau, 2009). These results compare well to the rare on-line studies such as the ones conducted by Naseer et al. (2014) (14 participants: 82.14% accuracy), Girouard et al. (2013) (9 participants: 83.5% accuracy), and (Schudlo and Chau, 2014) (10 participants: 77.4% accuracy). However, these and other (Kanoh et al., 2009; Hu et al., 2012; Power et al., 2012; Robinson et al., 2016) fNIRS-based BCI were not implemented under realistic settings and describe experiments in controlled lab settings.

#### 4.2. Limitations and Avenues for Future Research

Despite the promising results presented in this paper for development of fNIRS based pBCI in ecologically valid environment, one could argue that the translation of the fNIRSbased pBCI in real cockpit to day-to-day flight operations might not be applicable. First, the addition of machine learning and this on-line classifier approach to standard procedures of aviation still remains a challenge as the reliability of the classifier does not meet aviation certification criteria (10−<sup>3</sup> allowable failure probability). One approach to overcome this reliability problem would be to integrate complementary measurements such as EEG that could significantly enhance classification performance when combined with fNIRS as suggested by Khan et al. (2014).

Also, the accuracy score per subject must be interpreted with caution. In a two classes and five testing trials per class to fit with experimental constraints, classification performance should be higher than 75% to be statistically significant (p < 0.05) (Müller-Putz et al., 2008; Combrisson and Jerbi, 2015). Considering both groups in this study, 17 of 24 subjects were already above this threshold with our online classifier. Further improvements with machine learning methodologies would be needed to improve and optimize the classifier performance.

Secondly, availability of the information about WM level estimation is a key preoccupation. One criteria to evaluate on-line inference system is related to the delay of single trial classification. In our study, the diagnosis of the WM lasted less than 1.01 s after each pilot's response window. It could allow, for instance,

to automatically give a feedback to ATC that the pilot is currently facing a high workload situation and may have misinterpreted the last communication. This timing was comparable with results from other on-line fNIRS-based BCI latency (for a review of on-line fNIRS-based BCI latency, please see Strait et al., 2014). However, solutions have to be explored to speed up response detection on fNIRS signal that can drastically reduce latency in detecting change in a mental state (Cui et al., 2010; Hong and Naseer, 2016). Thirdly, our study was limited to monitor WM load in a binary and discrete fashion. Further studies have to be conducted to continuously discriminate a gradient of WM levels from underload to overload (Unni et al., 2017). Eventually, lingering issues remain regarding the effect of accelerations and headband motion on fNIRS signal (Mackey et al., 2013). In other scenarios accelerometer data with special processing could be used to eliminate any systemic effect of blood pooling.

Also, one should consider that fNIRS based pBCI could be first used for civilian application as highly automated modern aircraft prevents pilots from exceeding 1g maneuvers for passenger comfort and to avoid going against the flight envelope protection. Despite these limits, one can propose a progressive framework for the introduction of fNIRS in aviation. A first step is to consider the use of fNIRS based BCI to improve training via neurofeedback (Pope et al., 2014) and to tailor the flight sessions to the trainee (Chad et al., 2018). A second step is to use such inference system to monitor pilot's brain activity during each operational flight for quantified self purpose. These daily measures can be used to assess pilot's cognitive workload state and mental fatigue thus providing airlines with analyses tools for crew rostering. A third step is to stream the fNIRS data to the flight data recorder for accident analyses. These logged neurophysiological data would provide additional insights on the crew's cognition during these critical events and help accident investigators. A last step, when the reliability of the fNIRS-based inference system will meet the standard, would be to adapt the flight deck depending on the crew's changing WM load level. As previously demonstrated, stochastic decisional systems could be implemented to infer that human operators are engaged in demanding WM task and dynamically adapt interactions to prevent them from distraction (Gateau et al., 2016). The objective is to improve task allocation to enable better task switching, interruption management, and multi-tasking (Kohlmorgen et al., 2007; Solovey et al., 2011). Eventually, one should consider that such fNIRS based system could be applied to variety of contexts

#### REFERENCES

Anderson, M. J. (2001). A new method for non-parametric multivariate analysis of variance. Austral Ecol. 26, 32–46. doi: 10.1111/j.1442-9993.2001.01070.pp.x


whereby human operators interact with complex and critical systems (e.g., nuclear powerplant, train).

In summary, this study is the first report of the use of an online fNIRS based pBCI both in simulation (in silico) and in aircraft during flight (over the clouds) to measure pilot's WM . The implementation of this pBCI led to address several technical constraints, adapting and testing for instance a new wireless fNIRS that can be used by pilots and that has been approved for use during real flight. It also led to identify solutions to address potential sources of noise in signals such as the sunlight infrared shielding using aluminum based cover. Moreover, it provides important albeit preliminary information about fNIRS measures of the PFC hemodynamic response and its relationship to working memory workload, and in both simulation and actual flight environment. Level of immersion or realistic aspect of flight environment does appear to influence the performance as well as hemodynamic response in the anterior prefrontal cortex, at least for the air traffic control related working memory task. The measurements in simulator had larger fNIRS sensor coverage and future studies may compare simulation vs. actual flight or level of realistic aspect of environment with larger cortical coverage within the actual flight environment, for a more granular detailed comparison. Since fNIRS technology allows the development of mobile, nonintrusive and miniaturized devices, it has the potential to be deployed in future operational environments to monitor the pilot, adapt the complex system interface, and/or to assess the training of operators.

### AUTHOR CONTRIBUTIONS

Study conception and design: FD, TG, and HA. Data acquisition : TG and FD. Data analysis : TG, FD, and HA. Data interpretation and writing FD, HA, and TG.

### ACKNOWLEDGMENTS

This study was supported by the AXA research fund (Neuroergonomics for flight safety) and MRIS DGA (French Defense Agency, MAIA Project). The authors wish to express their gratitude to Fabrice Bazelot (chief mechanics), Stephane Juaneda (chief pilot), Philippe Minier (flight instructor), Frederic Pierron (flight instructor), and all the pilots who participated to the experiments.


Ayaz, H., Shewokis, P. A., Curtin, A., Izzetoglu, M., Izzetoglu, K., and Onaral, B. (2011). Using mazesuite and functional near infrared spectroscopy to study learning in spatial navigation. J. Vis. Exp. e3443. doi: 10.3791/3443

Baddeley, A. (1992). Working memory. Science 255, 556–559.


Cutrell, E. and Tan, D. (2008). BCI for passive input in HCI. in Proc. CHI 8, 1–3.


International Conference of the IEEE Engineering in Medicine and Biology Society (Minneapolis: IEEE), 594–597.


**Conflict of Interest Statement:** fNIR Devices, LLC manufactures the optical brain imaging instrument and licensed IP and know-how from Science Health Systems, Drexel University. HA was involved in the technology development and thus offered a minor share in the startup firm fNIR Devices, LLC.

The other authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Gateau, Ayaz and Dehais. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Electroencephalography Based Analysis of Working Memory Load and Affective Valence in an N-back Task with Emotional Stimuli

Sebastian Grissmann<sup>1</sup> \*, Josef Faller <sup>2</sup> , Christian Scharinger <sup>3</sup> , Martin Spüler <sup>4</sup> and Peter Gerjets 1,3

<sup>1</sup>LEAD Graduate School, University of Tübingen, Tübingen, Germany, <sup>2</sup>Laboratory for Intelligent Imaging and Neural Computing, Columbia University, New York, NY, United States, <sup>3</sup>Leibniz-Institut für Wissensmedien, Multimodal Interaction Lab, Tübingen, Germany, <sup>4</sup>Wilhelm-Schickard-Institute for Computer Science, University of Tübingen, Tübingen, Germany

Most brain-based measures of the electroencephalogram (EEG) are used in highly controlled lab environments and only focus on narrow mental states (e.g., working memory load). However, we assume that outside the lab complex multidimensional mental states are evoked. This could potentially create interference between EEG signatures used for identification of specific mental states. In this study, we aimed to investigate more realistic conditions and therefore induced a combination of working memory load and affective valence to reveal potential interferences in EEG measures. To induce changes in working memory load and affective valence, we used a paradigm which combines an N-back task (for working memory load manipulation) with a standard method to induce affect (affective pictures taken from the International Affective Picture System (IAPS) database). Subjective ratings showed that the experimental task was successful in inducing working memory load as well as affective valence. Additionally, performance measures were analyzed and it was found that behavioral performance decreased with increasing workload as well as negative valence, showing that affective valence can have an effect on cognitive processing. These findings are supported by changes in frontal theta and parietal alpha power, parameters used for measuring of working memory load in the EEG. However, these EEG measures are influenced by the negative valence condition as well and thereby show that detection of working memory load is sensitive to affective contexts. Unexpectedly, we did not find any effects for EEG measures typically used for affective valence detection (Frontal Alpha Asymmetry (FAA)). Therefore we assume that the FAA measure might not be usable if cognitive workload is induced simultaneously. We conclude that future studies should account for potential context-specifity of EEG measures.

Keywords: electroencephalography (EEG), working memory, affective valence, emoback, IAPS

#### Edited by:

Fabien Lotte, Institut National de Recherche en Informatique et en Automatique (INRIA), France

#### Reviewed by:

Frederic Dehais, National Higher School of Aeronautics and Space, France Mohammad Soleymani, Université de Genève, Switzerland

#### \*Correspondence:

Sebastian Grissmann sebastian.grissmann@lead.unituebingen.de

Received: 11 August 2017 Accepted: 05 December 2017 Published: 19 December 2017

#### Citation:

Grissmann S, Faller J, Scharinger C, Spüler M and Gerjets P (2017) Electroencephalography Based Analysis of Working Memory Load and Affective Valence in an N-back Task with Emotional Stimuli. Front. Hum. Neurosci. 11:616. doi: 10.3389/fnhum.2017.00616

## INTRODUCTION

## Investigating Complex User States with the Electroencephalogram

In recent years, there has been increased interest to use the electroencephalogram (EEG) in the context of human-machine interaction (Frey et al., 2013). However, most studies using the EEG to measure mental states focus on very specific states like working memory (Klimesch, 1999) or affective valence (Ahern and Schwartz, 1985), which are investigated in well controlled lab environments. Therefore, the indicators used in such studies might not provide robust measurements outside the lab, since real world environments tend to evoke much more complex and multidimensional mental states that involve different cognitive, emotional and motivational components (Gerjets et al., 2014). Furthermore, many measures used to infer mental states from the ongoing EEG are known to have many-to-many relations, meaning that several physiological variables are associated with multiple psychological elements (Fairclough, 2009). Hence, it is necessary to systematically investigate the relationship between different mental states and the different EEG measures that are widely used in neuroscientific studies, to investigate if such measures can be used outside the lab. In this article we systematically investigate the relation between two types of mental states that are widely used in the context of humanmachine interaction, namely working memory load (e.g., Spüler et al., 2016) and affective valence (e.g., De Smedt and Menschaert, 2012), to study the interaction of the brain responses typically associated to these mental states.

### Working Memory Load

There are many different ways to induce mental states characterized by high levels of working memory load (Wilhelm et al., 2013). One way to induce working memory load that has been widely used in the context of cognitive neuroscience is the N-back task (Kirchner, 1958; Scharinger et al., 2015). The N-back task is a continuous performance task where subjects are presented with a series of stimuli and have to indicate whether the current stimulus is identical to the stimulus presented N-steps before, or nor. Hence, the load factor ''N'' allows to adjust the difficulty of the task, thereby manipulating working memory load.

Working memory load is commonly defined as the interplay of controlled attentional processes and short term memory structures that handle different representational codes via various temporal storage components (Baddeley, 2003, 2012). The central attentional control system is assumed to be mainly located in frontal regions of the brain like the dorsolateral prefrontal cortex, while content in short term memory is thought to be maintained via parietal brain areas like the intraparietal sulcus (Klingberg, 2009; Scharinger et al., 2015). Accordingly, increases in mental workload usually result in increased frontal theta activity as well as decreased parietal alpha activity (Gevins et al., 1995; Klimesch, 1999; Smith and Gevins, 2005). However, previous research has shown that other mental states can also have an effect on measures used for workload detection. Roy et al. (2013), for example, induced fatigue and found that with increasing time on task the discriminability of working memory load was decreased.

## Affective Valence

There are multiple ways to induce affective experiences in lab settings (Gerrards-Hesse et al., 1994). One effective way is the use of affective picture stimuli. The most prominent picture database is the International Affective Picture System (IAPS; Lang et al., 1997) which comprises a large set of standardized and emotionally evocative color photographs. All stimuli of the database have been rated along the dimensions of valence and arousal as described in the two-dimensional circumplex model of emotion (Russell and Pratt, 1980).

The valence dimension reflects the pleasantness of a situation and ranges from sadness to happiness. The arousal dimension reflects the responsiveness of the organism and ranges from sleep to frenzied excitement. Several studies have used the EEG to study affective states in the past Olofsson et al. (2008) and Kim et al. (2013). While there are many different approaches to infer affective valence in the EEG, some even using right hemispheric activity in the beta (Rowland et al., 1985) or gamma band (Müller et al., 1999), the two most widely used EEG measures to infer affective states are the late positive potential (LPP) and the so-called Frontal Alpha Asymmetry (FAA). The LPP is an EEG feature in the time domain and represents a positive deflection in the ERP-curve, reflecting the activity related to the arousal dimension (Schupp et al., 2000). The FAA represents the individual hemispheric contributions which is related to the affective valence dimension (Ahern and Schwartz, 1985; Tomarken et al., 1990; Huang et al., 2012). Increased right frontal activity is an indicator of a mental state characterized by negative affective valence. Usually, this measure is used during resting (with closed eyes) or passive viewing conditions (Davidson, 1992). Interestingly, previous work has shown that working memory load can result in lateralized activity as well. For instance, a study by Baldwin and Penaranda (2012) used several tasks to induce mental workload and found more left hemispheric activity during increased cognitive load. This might result in potential interferences between affective valence and working memory load in the FAA measure. Analyzing this type of potential interference between working memory load and affective valence is the main goal of this article.

### Investigating Interactions between Working Memory Load and Affective Valence: The Emoback Task

Previous research on the use of neural signatures of mental states has largely ignored the problem of potential interference between working memory load and affective experiences. Only one recent study addressed the effect of an affective experience on the automatic identification of working memory load. In this study, Mühl et al. (2014) used an N-back task to manipulate working memory load while social stress was induced with a stress-induction protocol based on the Trier Social Stress Task (Kirschbaum et al., 1993). The authors attempted to detect mental workload during stress using features from the frequency domain as well as the time domain. They concluded that it is possible to transfer methods across affective contexts, but only with diminished performance. However, these results are limited to one specific affective context (social stress). Furthermore, the authors were focusing on the identification of working memory load alone. The current article wants to extend these findings using a more general affective response and also account for the potential interference of cognitive and affective components.

In order to collect suitable data for this objective, we used a combination of an N-back task with a standard affect induction, called the emoback task. Interactions between cognitive and affective processes in the EEG have been previously investigated with the affective flanker task (Alguacil et al., 2013). However, the affective flanker task can be seen as a simple stimulus-response task that only requires perceptual inhibition and therefore does not represent a genuine working memory task. The emoback task does involve memory components and, to our knowledge, has only been used twice in combination with the EEG. First, a study by MacNamara et al. (2011) used the emoback task with affective pictures as distractors and found that the LPP was modulated not only by affective responses towards emotional pictures, but also by working memory load. Second, a study by Kopf et al. (2013) used a N-back task with emotional words and recorded data from EEG and fNIRS. They found more errors during the negative condition, especially for high task difficulty. An ERP-analysis also revealed that the LPP is influenced by the difficulty in the working memory task and that this influence is further modulated by affective valence. However, both these studies used the LPP, as feature commonly associated with the arousal dimension. We want to investigate affective responses related to the valence dimension, which allows to make more general discriminations of affective states into positive and negative states. Moreover, the study by MacNamara et al. (2011) only used affective stimuli as distractors, while we want to use a paradigm that inherently activates cognitive as well as affective processes. Finally, while the study by Kopf et al. (2013) used emotional words to induce affective reactions, we assume that affective pictures can elicit stronger affective reactions. We therefore decided to use an emoback task with affective pictures from the IAPS database to investigate potential interferences between working memory load and affective valence using frontal theta activity, parietal alpha activity as well as the FAA.

In previous analyses of the same dataset (Grissmann et al., in press), we found that classification of working memory load under affective valence can result in classification accuracies above 70%, which can be further improved via data integration over time. However, we also found that positive as well as negative valenced affective contexts led to decreased classification accuracies, when compared to a neutral affective context. Additionally, classifiers failed to generalize across affective contexts, which highlighted the need to better understand the interactions between working memory load and affective valence in such a context.

#### Research Questions and Hypotheses

We investigated the influence of working memory load and affective valence on subjective measures, behavioral measures as well as the corresponding EEG measures. Furthermore, we investigated potential interferences between cognitive processes and affective processes as reflected in EEG measures used to infer working memory load and EEG measures used to infer affective valence.

More specifically, we investigated potential effects of load levels in the emoback task on subjective ratings, accuracies and reaction times, which might also be reflected in the corresponding EEG measures.

Additionally, we investigated potential effects of the affective valence inductions in the emoback task on subjective measures, behavioral measures as well as EEG measures.

Beyond the main effects of working memory load and affective valence on behavioral measures and EEG measures, we also analyzed whether EEG measures used for mental workload detection are sensitive to different affective contexts and whether EEG measures used to infer affective valence are sensitive to working memory load.

### MATERIALS AND METHODS

#### Sample

For this study, we collected data from 27 female subjects. Female subjects were used in this study because they tend to show stronger reactions toward affective stimuli (Lang et al., 1993) and also exhibit more stable responses (Ahern and Schwartz, 1985). Three subject were removed due to the low quality of the signals. All of the participants were university students, aged above 18 years (mean: 23.0 years; range: 19–32 years), right handed and had no blood phobia to avoid extreme responses toward the experimental stimuli. All subjects provided written informed consent and were paid 20 e for participation in the experiment. The study was approved by the ethic committee of the Knowledge Media Research Center Tuebingen.

### Recording of Physiological Data

Sixty channels of EEG were recorded using an ActiCHamp amplifier and active Ag/Cl-electrodes (Brainproducts GmbH, Gilching, Germany). Electrodes were placed according to the extended 10-20 system. The electrooculogram (EOG) was recorded with four EEG electrodes located at the left and right canthi as well as above (channel Fp1) and below the left eye. All channels were referenced to channel FCz. Impedances were kept below 10 kΩ. The data were sampled at 1 kHz. This electrode layout was chosen to allow for potential source localization approaches. For later processing the data was downsampled to 250 Hz. During the recordings subjects were instructed to sit in a relaxed posture to avoid artifact contamination of the data.

#### Preprocessing

To automatically reject time windows that are contaminated by artifacts, a time window was removed if channel power at more than six channels exceeded five times the standard deviation. After computing independent components using the CUDAICA implementation (Raimondo et al., 2012) of the Infomax independent component analysis (Bell and Sejnowski, 1995), eye movement artifacts were removed (reduced) using the ADJUST approach (Mognon et al., 2011) which is implemented as plug-in in the EEGLAB toolbox (Delorme et al., 2011). Finally the EEG signal was bandpass filtered between 1 Hz and 45 Hz using two-way least-squares finite impulse response filtering.

#### Stimuli

For the positive, neutral and negative valence conditions we selected 96 stimuli each. The stimuli were taken from the IAPS database (Lang et al., 1997) and selected based on the valence and arousal ratings provided with the database. Valence and arousal ratings are usually confounded, meaning that stimuli with strong (positive or negative) valence ratings usually also have high arousal ratings. To improve discriminability between affective conditions we made sure that stimuli for the positive condition had the highest valence ratings while stimuli for the negative conditions had the lowest valence ratings. Furthermore, stimuli for the neutral condition were selected based on the lowest arousal ratings. All selected stimuli had a quadratic shape and were centrally presented on a standard 23 inch display. To avoid unnecessary eye-movements the size of the stimuli was scaled to fill 60% of the height of the display.

### Study Design and Block Structure

The affective valence conditions were presented in groups of four trial blocks with identical affective valence. The affective valence conditions being either positive, neutral or negative. Load levels of the emoback task were alternated block wise. The load factor of the emoback task had two levels, one-back and two-back. We avoided the use of a zero-back condition, because this would require to repeatedly use the same affective target stimuli. This repetition of the same stimuli might have resulted in affective habituation with regard to the target stimuli, thereby diminishing the affect induction (Leventhal et al., 2007). We also avoided a 3-back condition because some studies showed that this load level can lead to task disengagement (Ayaz et al., 2007). Target response hand as well as starting load level of the emoback task were balanced across the subject sample. See **Figure 1** for an illustration of the study design.

All participants performed 12 blocks in total. Each block started with a 10 s baseline where subjects were instructed to relax and visually fixate a centrally presented light gray fixation cross. After the baseline the first trial started with a stimulus presentation. There were 72 trials in one block, consisting of 24 target stimuli and 48 distractors. The 24 target stimuli were randomly selected for each subject from 96 unique stimuli (four blocks with 24 target stimuli per affective condition). Targets and distractors were sometimes interleaved in the 2-back condition, meaning that two target stimuli could appear right after each other. Here is an example: Distractor (e.g., table), distractor (e.g., chair), target (table) and target (chair). After the last trial, another baseline phase, also with a duration of 10 s, was recorded. Between blocks there were short brakes between 1 min and 3 min. See **Figure 2** for an illustration of the block structure.

#### Subjective Measures

After each run, subjects were asked to rate their subjective experience of the last run. Subjective experience of working memory load was measured using one (modified) item taken from the NASA task load index (Hart and Staveland, 1988). The item asked participants how cognitively demanding they experienced the last experimental run. The rating scale ranged from 0 (absolutely no mental demand) to 100 (highest possible mental demand).

Emotional experiences are commonly judged with the help of rating scales. The most widely used is a visual analog scale called the Self-Assessment Manikin (SAM; Bradley and Lang, 1994). It enables fast and reliable judgments about the current emotional state.

#### Trial Structure

Each trial started with a stimulus presentation phase of 1500 ms. The duration of the stimulus presentation was selected based on tests using pilot subjects and should ensure that enough trials can be recorded to get reliable estimates for the EEG measures used and the desired affective responses are evoked. During the stimulus presentation phase the subjects had to indicate if the current stimuli was a target (i.e., identical to the stimulus one step before in the 1-back condition and two steps before in the 2-back condition) or if the current stimulus was a distractor. Left and right control keys of a standard keyboard were used as inputs. The subjects were instructed to react as quickly as possible to ensure an effect in performance measures. Between stimuli there were interstimulus intervals (ISI) of 1500 ms duration including one to 500 ms of jitter at the end of the ISI to avoid periodic responses in the EEG data. During the ISI the same gray crosshair as in the baseline conditions was presented on the screen.

From psychological perspective, the N-back task may conceptually be divided into two phases. The first phase starts with the stimulus presentation and ends after the stimulus response. In this phase we assume that subjects had to perform a simple matching task by comparing the current stimulus with the stimulus stored in short term memory. In the second phase, ranging from the stimulus response to the next stimulus onset we assume that multiple executive functions are required for a correct response. Inhibition, shifting and updating represent the core executive functions as identified by Miyake et al. (2000). Inhibition is seen as the ability to suppress automatic responses that might arise during task processing. Shifting is referring to the concept of cognitive flexibility, making us capable to switch between different tasks. Updating refers to the continuous monitoring and changing of content in working memory. Content in short term memory needs to be updated via inhibition of the last stimuli in the stimuli list. Furthermore, the participants need to switch from a simple matching task to a working memory task and back. We therefore assumed that the second phase was more relevant for the measurement of working memory load. Concerning the affective reaction we first assumed that it would be stronger right after stimulus onset, since the stimulus was fresh. However, pilot measurements revealed that subjects tend to focus first on the response toward the stimuli and only then direct their attention toward the affective content of the presented stimuli. We therefore decided to focus our analyses on the second phase. Additionally, we wanted to avoid contamination of the data due to muscular artifacts which originate from the keyboard input.

#### Analysis

The window of analysis was 1400 ms wide. It started 1100 ms after stimulus onset to exclude post-motor responses in the EEG and ended 2500 ms after stimulus onset. The time window also included 1000 ms of the ISI.

All spectra for the EEG analysis were computed using the Welch method (Welch, 1967) implemented in the EEGLAB toolbox (Delorme and Makeig, 2004). Theta bands were computed between 4 Hz and 7 Hz and alpha bands were computed between 8 Hz and 12 Hz. Power of the frontal (AFz, Fz) and parietal (CPz, Pz, POz) electrodes was averaged. Power for FAA computation was averaged across channels AF3, F3 and FC1 for the left hemisphere and across channels AF4, F4 and FC2 for the right hemisphere. We followed the approach from Allen et al. (2004) and computed FAA as difference score (see Equation 1).

Equation 1: Frontal Alpha Asymmetry Index (Allen et al., 2004).

#### FAA = right alpha power − left alpha power

To evaluate the influence of affective valence and working memory load on the EEG, we performed repeated measures analysis of variances (ANOVAs) with two factors. The first factor was affective valence with three levels (positive, neutral and negative). The second factor was working memory load with two levels (1-back and 2-back). All analyses were conducted at the group level.

### RESULTS

### Subjective Measures—Working Memory Load

As expected, using workload ratings as dependent variable, we found a significant main effect for working memory load, F(1,23) = 22.7, p < 0.001, η <sup>2</sup> = 0.50. Higher working memory load resulted in the increased subjective experience of working memory load. Additionally, there was a significant main effect for affective valence, F(2,46) = 11.6, p < 0.001, η <sup>2</sup> = 0.34. A Post hoc test using the Šidák correction revealed that the negative valence condition resulted in an increase of subjective working memory load when compared to the positive valence condition (p < 0.001) as well as the neutral valence condition (p < 0.02). There was no significant interaction between affective valence and working memory load, F(2,46) = 0.8, n.s. Box plots showing the different conditions can be seen in **Figure 3A**.

#### Subjective Measures—Affective Valence

Since, this manuscript was focusing on the effect of the experimental manipulation on the EEG measure used to infer affective valence, we do not report results from the arousal dimension here. As anticipated, using the affective valence ratings as dependent variable, there was a significant main effect for affective valence, F(1.4,33) = 55.7, p < 0.001, η <sup>2</sup> = 0.71. Decreased affective valence due to the emotion induction resulted in decreased subjective valence ratings. The neutral condition resulted in more positive valenced ratings than the negative condition (p > 0.001) and more negative valenced ratings than the positive condition (p > 0.01). However, there was neither a significant main effect for working memory load, F(1,23) = 0.1, n.s., nor was there a significant interaction between affective valence and working memory load, F(2,46) = 1.0, n.s. Median values for all conditions are shown via box plots in **Figure 3B**.

### Behavioral Performance Measures—Accuracies

Using accuracies as dependent variable we found a significant main effect for working memory load, F(1,23) = 32.70, p < 0.001, η <sup>2</sup> = 0.59. Higher working memory load resulted in decreased accuracies. Additionally, there was a significant main effect for affective valence, F(2,46) = 3.73, p < 0.035, η <sup>2</sup> = 0.14. A Post hoc test using the Šidák correction revealed that the negative condition resulted in a decreased accuracy when compared to the positive condition (p < 0.03), but not when compared to the neutral condition (n.s.). Interestingly, there was a significant interaction between affective valence and working memory load, F(2,46) = 3.68, p < 0.035, η <sup>2</sup> = 0.14. The negative condition had a negative impact on accuracy, but only under high working memory load. Box plots showing the different conditions can be seen in **Figure 4A**.

### Behavioral Performance Measures—Reaction Times

There was a significant main effect for working memory load, F(1,23) = 60.53, p < 0.001, η <sup>2</sup> = 0.73. Higher load resulted in longer reaction times. Moreover, there was a significant main effect for affective valence, F(2,46) = 10.94, p < 0.001, η <sup>2</sup> = 0.32. The negative condition resulted in longer reaction times than the

FIGURE 3 | Box plots showing median values for workload ratings (A) and affective valence ratings (B). Blue boxes show working memory load ratings (A) and affective valence ratings (B) for the 1-back conditions. Green boxes show working memory load ratings (A) and affective valence ratings (B) for the 2-back conditions. Median values are indicated by black horizontal lines within the boxes. Top and bottom borders of the boxes represent the middle 50% of the data. Whiskers represent the smallest and largest values not classified as outliers (between 1.5 and 3 times the height of the boxes) or extreme values (more than three times the height of the boxes). Circles indicate outliers.

FIGURE 4 | Box plots showing median values for accuracies and reaction times. Blue boxes show accuracies (A) and reaction times (B) for the 1-back conditions. Green boxes show accuracies (A) and reaction times (B) for the 2-back conditions. Median values are indicated by black horizontal lines within the boxes. Top and bottom borders of the boxes represent the middle 50% of the data. Whiskers represent the smallest and largest values not classified as outliers (between 1.5 and 3 times the height of the boxes) or extreme values (more than three times the height of the boxes). Circles indicate outliers and stars show extreme values.

positive condition (p < 0.01) as well as the neutral (p < 0.005) condition. However, there was no significant interaction between affective valence and working memory load, F(2,46) = 1.02, n.s. Median values for all conditions are shown via box plots in **Figure 4B**.

# EEG Working Memory Load Measures

#### Frontal Theta

Using frontal theta activity as dependent variable there was a significant main effect for working memory load, F(1,23) = 9.00, p < 0.01, η <sup>2</sup> = 0.28. Frontal theta power was higher in the 2-back

FIGURE 6 | Topographic plots showing difference in EEG workload measures between negative and neutral conditions. (A) Frontal theta showing decreased frontal theta activity for the negative conditions when compared to the neutral conditions. (B) Parietal alpha showing widespread decrease in parietal alpha power for the negative conditions when compared to the neutral conditions. Nose is at the top. Values are averaged across both N-back levels. Electrodes used for analysis are marked with black rectangles.

FIGURE 7 | Box plots showing median values for frontal theta activity, parietal alpha activity and Frontal Alpha Asymmetry (FAA). Blue boxes show frontal theta activity (A), parietal alpha activity (B) and FAA (C) for the 1-back conditions. Green boxes show frontal theta activity (A), parietal alpha activity (B) and FAA (C) for the 2-back conditions. Median values are indicated by black horizontal lines within the boxes. Top and bottom borders of the boxes represent the middle 50% of the data. Whiskers represent the smallest and largest values not classified as outliers (between 1.5 and 3 times the height of the boxes) or extreme values (more than three times the height of the boxes). Circles indicate outliers. Please note the strong variability in the data due to large inter-individual differences.

conditions. **Figure 5A** illustrates this effect. Interestingly, there was a significant main effect for affective valence, F(2,46) = 8.28, p < 0.001, η <sup>2</sup> = 0.27. A Post hoc test using the Šidák correction revealed that frontal theta power was lower in the negative condition, when compared to the neutral condition (p < 0.005) as well as the positive condition (p < 0.01). **Figure 6A** shows a topographic plot displaying this frontal theta effect. However, there was no significant interaction between affective valence and working memory load, F(2,46) = 0.87, n.s. Median values for all conditions can be seen in **Figure 7A**.

#### Parietal Alpha

Using parietal alpha activity as dependent variable we found a significant main effect for working memory load, F(1,23) = 4.22, p = 0.05, η <sup>2</sup> = 0.16. Higher working memory load resulted in decreased parietal alpha power. This effect can be seen in **Figure 5B**. Interestingly, there was also a significant main effect for the factor affective valence in the post-motor window, F(2,46) = 5.27, p < 0.01, η <sup>2</sup> = 0.19. The negative condition exhibited lower parietal alpha power than the neutral condition (p < 0.03), but not lower than the positive condition (n.s.). **Figure 6B** shows a topographic plot displaying this parietal alpha effect. There was no significant interaction between affective valence and working memory load, F(2,46) = 1.52, n.s. **Figure 7B** shows box plots for all conditions.

### EEG Affective Valence Measures—Frontal Alpha Asymmetry

Unexpectedly, there were no significant main effects with regard to FAA. Neither for affective valence, F(2,46) = 1.87, n.s. nor for working memory load, F(1,23) = 0.96, n.s. Finally, there was also no significant interaction between affective valence and working memory, F(2,46) = 0.08, n.s. Median values for all conditions are summarized in **Figure 7C**.

### DISCUSSION

Using the emoback paradigm, we found that increased working memory load had a negative impact on task performance, as reflected in decreased accuracies and increased reaction times. These effects were also reflected in corresponding EEG measures. Increased working memory load was accompanied by increases in frontal theta activity as well as decreases in parietal alpha activity.

We found that negative affective valence had a negative impact on accuracies as well as reaction times. Interestingly, measures used for working memory load estimate also appeared sensitive to changes in affective valence. In contrast to that, FAA, a measure typically used to infer affective valence, did not show any effects in our paradigm, neither with regard to affective valence, nor with regard to increases in working memory load. In the following sections we will discuss these results in detail.

#### Subjective Measures

As expected, we found that increased working memory load, due to an increase in the load factor of the emoback task, resulted in higher subjective working memory load ratings. This finding is reassuring, especially since some subjects reported difficulties judging the working memory load with regard to the difficulty of the emoback task. Some subjects even reported that they experienced the 1-back conditions as more demanding, due to their monotonous nature.

As hypothesized, we also found a significant effect of affective valence on ratings from the valence dimension of the SAM. Notably, the successful induction of positive affective valence shows that the emoback paradigm worked in the intended way. The induction of positive emotions can be very difficult to achieve in laboratory environments, since positive emotions usually arise in a specific context which is difficult to represent in a single picture (Kim and Hamann, 2007).

#### Behavioral Performance Measures

Our analyses revealed that increased working memory load induced via the emoback task did result in decreased performance. The 2-back conditions had reduced accuracy as well as increased reaction times. We also found that inducing negative affective valence had detrimental effects on accuracies as well as reaction times. Similar results have been found in a study by Passarotti et al. (2011). The authors used face stimuli in an affective N-back task and found slower reaction times for angry faces. These negative effects of affective stimuli on cognitive processing might be explained within the theory of hot and cold cognition, concepts related to executive function (Zelazo and Müller, 2002). Hot cognition refers to a process where a persons' thinking is influenced by their affective state (Brand, 1986), while cold cognition is more based on rational thinking and critical analysis (Roiser and Sahakian, 2013). Hot cognition seems to be able to overpower cold cognition in certain situations and has been shown to impair decision making (Huijbregts et al., 2008). Processing of negative stimuli can divert cognitive resources from the primary task, and thereby lead to decreased performance. A study by Levens and Gotlib (2010) found that strongly valenced stimuli tend to stay longer active in working memory, which might interfere with different core executive functions necessary during the emoback task (Miyake et al., 2000). In the emoback task, subjects constantly needed to update content in working memory by replacing (inhibition) previous stimuli. Additionally, the subjects seemed to need to shift between these updating tasks and a rather simple identity-matching task. Negative stimuli might catch the attention, slowing down reorganization of stimuli in working memory and thereby impairing task performance.

Interestingly, we also found an interaction for accuracy between working memory load and affective valence. Negative affective valence resulted in decreased accuracies, but only during increased working memory load. These results are in line with findings from the study by Kopf et al. (2013) who found more errors for the difficult task during the negative condition using an affective word N-back task. These findings seem to indicate that both, cognitive and affective processing, compete for limited cognitive resources and this dual strain on mental resources leads to decreased performance. Similar results have also been found in the study by MacNamara et al. (2011). The authors used emotional pictures as distractors in an emoback task and found that the emotional content of negative valenced stimuli increased the negative impact of working memory load on performance. These findings might be explained using the capacity model from Ellis and Ashbrook (1989). Their model assumes that there is a limited pool of attentional resources that can be used to complete a certain task. Affective states can influence the allocation of available attentional resources toward the task, thereby potentially impairing task performance. However, based on performance measures alone, we can only vaguely assume how negative valence impaired task performance.

### EEG Working Memory Load Measures

#### Frontal Theta Activity

Increased working memory load was reflected in EEG measures as well. Frontal theta activity was increased during high working memory load. Interestingly, the negative affective valence condition exhibited decreased frontal theta after the motor response. If negative affective valence would impair performance via the production of additional working memory load, one would expect to find increased frontal theta activity during negative affective valence. However, the decreased frontal theta activity under negative affective valence indicates that the detrimental effect on performance likely has different reasons than the performance decrease under high working memory load. Accordingly, we assume that the negative affective valenced stimuli interfered with task processing through a reduction of activity in the frontal attentional control network.

#### Parietal Alpha Activity

Our analyses revealed that parietal alpha activity was decreased under high working memory load (2-back) condition. We also found that the parietal alpha activity was reduced for the negative condition. An fMRI study by Rämä et al. (2001) used different affective voices and found similar results. The authors discovered that the parietal cortex is bilaterally involved in active maintenance of emotional content. Affective stimuli are more salient and usually carry more relevant information (Carretié, 2014). We therefore assume that negative valenced stimuli result in stronger processing in the storage areas of the parietal cortex, which seems to be reflected through increased parietal activity (i.e., reduced alpha power).

#### EEG Affective Valence Measures—Frontal Alpha Asymmetry

Unexpectedly, we did not find any effect using the FAA measure. We used the FAA measure during the induction of working memory load, while most other studies used brain laterality responses during rest or passive viewing (Ahern and Schwartz, 1985; Tomarken et al., 1992; Lin et al., 2009; Huang et al., 2012; Ramirez and Vamvakousis, 2012). Use of the FAA in combination with other tasks is further complicated because the FAA is known to be also influenced by other factors like seating position and working memory load (Briesemeister et al., 2013). A study by Baldwin and Penaranda (2012) found that increased task difficulty resulted in increased left frontal activity. It appears that the negative condition increased task difficulty, since it did impair task performance. Negative affective valence in passive conditions usually results in increased right frontal activity. It is therefore conceivable that both, affective and workload related, processes acted on the same FAA measure, but in opposing directions. This could mean that both effects canceled each other out and thereby masked any potential effects.

#### Limitations and Outlook

We did not find any effects using the FAA measure. We assume that was due to the motor response required in our paradigm. It would be interesting to test this assumption by developing a study design that contrasts an active condition with a passive viewing condition. However, it could also be possible, that the paradigm did not induce sufficient stress to induce changes in the FAA measure. A study by Goodman et al. (2013) used a working memory task and concurrently induced stress. The authors found that changes in the FAA measure (as indicated by left frontal activity) can be used to infer emotional processing, but only when the emotional induction reaches a certain intensity. Future studies could try to use other ways to induce emotional reactions along the valence dimension, like short video clips.

Additionally, our intent to induce the strongest possible valence effect had the drawback that the affective conditions also differed concerning the arousal dimension. This meant that the negative conditions were also experienced as more arousing. Since even this valence induction did not produce a valence effect that could be measured with the FAA, we recommend that future studies should control for arousal as well as other dimensions like affective dominance.

Furthermore, we based our choice of the window used for analysis on our understanding of the existent literature. However our assumptions need further support and therefore future studies should investigate the different sub-processes that contribute to the n-back task and their temporal evolution in greater detail.

While we decided to focus on the use of a frequency based approach, future approaches could also include different features for the measurement of affective valence and working memory load. One example would be the use of the ERP, since previous studies have already demonstrated that the ERP has some potential in this context (see Olofsson et al., 2008; Brouwer et al., 2012). Another approach could be the use of connectivity measures, which have already been shown to be of use in similar contexts (e.g., Lee and Hsieh, 2014). A study by Martini et al. (2012) has even combined frequency measures, ERP measures and connectivity measures to differentiate neutral from negative pictures.

Future studies should also explore how well these findings generalize to male individuals as well as other stimulus modalities.

Finally, in this study we focused on the interaction between working memory load and affective valence, using established measures that were already used in the context of humanmachine interaction. Future studies should further investigate this issue to gain more insight about the processes underlying the results that were extracted from the EEG-recordings. One example for such an integrative approach is the review by Schwabe et al. (2012) that integrates multiple findings into a framework that allows to create hypotheses with regard to specific neural structures. Something which is beyond the methodology used in this study.

#### CONCLUSION

We demonstrated that increased working memory load and negative valenced stimuli can have an effect on performance measures. Additionally, when using established EEG measures we found that increased working memory load can be detected in the EEG, even when affective valence is induced at the same time. However, EEG measures used to infer working memory load were still influenced by negative affective valence. Furthermore, FAA did not prove useful for identification of emotional states when working memory load is induced at the same time. Therefore, future studies should further investigate the context sensitivity and applicability of EEG measures in various contexts to identify the ramifications in which such measures can be used to identify different states and processes.

#### AUTHOR CONTRIBUTIONS

SG is the main author and responsible for all tasks. JF was involved in the planning of the experiment and also provided

#### REFERENCES


technical advice. CS and MS helped with general advice. PG is the main coordinator and also involved in all related tasks.

#### ACKNOWLEDGMENTS

This research was funded by the LEAD Graduate School and Research Network (GSC1028), a project of the Excellence Initiative of the German Federal and State Governments, the Deutsche Forschungsgemeinschaft (DFG, grant SP 1533/2-1), the Open Access Publishing Fund of University of Tübingen and the Leibniz ScienceCampus Tübingen ''Informational Environments''.

anlaysis. J. Neurosci. Methods 134, 9–21. doi: 10.1016/j.jneumeth.2003. 10.009


**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Grissmann, Faller, Scharinger, Spüler and Gerjets. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Measuring Mental Workload with EEG+fNIRS

#### Haleh Aghajani <sup>1</sup> \*, Marc Garbey <sup>2</sup> and Ahmet Omurtag1 †

<sup>1</sup> Department of Biomedical Engineering, University of Houston, Houston, TX, United States, <sup>2</sup> Center for Computational Surgery, Department of Surgery, Research Institute, Houston Methodist, Houston, TX, United States

We studied the capability of a Hybrid functional neuroimaging technique to quantify human mental workload (MWL). We have used electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) as imaging modalities with 17 healthy subjects performing the letter n-back task, a standard experimental paradigm related to working memory (WM). The level of MWL was parametrically changed by variation of n from 0 to 3. Nineteen EEG channels were covering the whole-head and 19 fNIRS channels were located on the forehead to cover the most dominant brain region involved in WM. Grand block averaging of recorded signals revealed specific behaviors of oxygenated-hemoglobin level during changes in the level of MWL. A machine learning approach has been utilized for detection of the level of MWL. We extracted different features from EEG, fNIRS, and EEG+fNIRS signals as the biomarkers of MWL and fed them to a linear support vector machine (SVM) as train and test sets. These features were selected based on their sensitivity to the changes in the level of MWL according to the literature. We introduced a new category of features within fNIRS and EEG+fNIRS systems. In addition, the performance level of each feature category was systematically assessed. We also assessed the effect of number of features and window size in classification performance. SVM classifier used in order to discriminate between different combinations of cognitive states from binary- and multi-class states. In addition to the cross-validated performance level of the classifier other metrics such as sensitivity, specificity, and predictive values were calculated for a comprehensive assessment of the classification system. The Hybrid (EEG+fNIRS) system had an accuracy that was significantly higher than that of either EEG or fNIRS. Our results suggest that EEG+fNIRS features combined with a classifier are capable of robustly discriminating among various levels of MWL. Results suggest that EEG+fNIRS should be preferred to only EEG or fNIRS, in developing passive BCIs and other applications which need to monitor users' MWL.

Keywords: functional near-infrared spectroscopy (fNIRS), electroencephalography (EEG), human mental workload, cognitive state monitoring, n-back, multi-modal brain recording, machine learning

### INTRODUCTION

Mental workload (MWL) affects people who are interacting with computers and other devices. The use of technology in everyday life may impose high cognitive demands as users navigate complex interfaces. Mental overload may compromise users' performance and sometimes safety, by increasing error rates and engendering fatigue, decline in motivation, higher reaction times

#### Edited by:

Stephen Fairclough, Liverpool John Moores University, United Kingdom

#### Reviewed by:

Noman Naseer, Air University, Pakistan Mickael Causse, Institut Supérieur de l'Aéronautique et de l'Espace, France

> \*Correspondence: Haleh Aghajani haghajani@uh.edu

#### † Present Address:

Ahmet Omurtag Nottingham Trent University, Nottingham, United Kingdom

Received: 09 January 2017 Accepted: 23 June 2017 Published: 14 July 2017

#### Citation:

Aghajani H, Garbey M and Omurtag A (2017) Measuring Mental Workload with EEG+fNIRS. Front. Hum. Neurosci. 11:359. doi: 10.3389/fnhum.2017.00359

**35**

(Xie and Salvendy, 2000; Young and Stanton, 2002), and neglect of critical information, known as cognitive tunneling (Thomas and Wickens, 2001; Dixon et al., 2013; Dehais et al., 2014). Taking into account the users' cognitive characteristics and limitations are thus critical in improving the design of human-machine interfaces (BMI) and for operating them efficiently by installing adaptive features that can respond to changes in the MWL (Kaber et al., 2000; Parasuraman and Wilson, 2008; Gagnon et al., 2012).

MWL has been defined as the proportion of the human operator's mental capabilities that is occupied during the performance of a given task (Boff et al., 1994). According to the prevalent Multiple Resources theory (Navon and Gopher, 1979; Wickens, 2002), performing different tasks requires a subject to tap into a set of separate resources, which are limited in capacity and distributable among tasks (Horrey and Wickens, 2003). In general, these resources can be categorized among four dimensions: processing stage (perception or cognition vs. response), perceptual modality (visual vs. auditory), visual channel (focal vs. ambient), and processing code (verbal vs. spatial; Wickens, 2002; Horrey and Wickens, 2003). Based on the Multiple Resources theory, equal resource demands between two tasks that both recruit one level of a given dimension will interfere with each other more than two tasks that recruit separate levels on the dimension (Wickens, 2002), and may create bottlenecks and consequent decrements in performance. Similar conclusions have been reached in the areas of aviation (Stanney and Hale, 2012; Causse and Matton, 2014; Durantin et al., 2014), education (Palmer and Kobus, 2007; Spüler et al., 2016), and a variety of clinical situations (Carswell et al., 2005; Stefanidis et al., 2007; Prabhu et al., 2010; Yurko et al., 2010; Byrne, 2013; Guru et al., 2015). In the case of driving while having a phone conversation, in addition to the interference of resources the "engagement phenomenon" also controls the outcome of multitasking scenario. This happens when one of the tasks attracts so much attention that the advantage of separate resource demand would be eliminated (Strayer and Johnston, 2001; Strayer and Drews, 2007).

MWL is a construct that arises from the interaction of the properties of a task, the environment in which it is performed, and the characteristics of the human operator performing it (Longo, 2016). Task properties include the difficulty and monotony of the task and the types of resources that it engages. The environment may contain various degrees of distraction and noise. The subject characteristics involve training and expertise as well as changing levels of fatigue, motivation, and vigilance. Thus, the MWL can be systematically adjusted by tuning a subset of these variables while controlling for the rest.

Methods of determining MWL fall into three broad categories: (1) Self-reporting and subjective ratings using standard questionnaires such as the NASA-TLX (Hart and Staveland, 1988); (2) Behavioral measures, such as primaryand secondary-task performance; and (3) Measures based on the physiology of the user, including heart rate variability, oculomotor activity, pupillometry, electromyography, galvanic skin response, and brain activity (Xiao et al., 2005; Wickens, 2008; Sahayadhas et al., 2012). Self-reporting and behavioral based information tends to be delayed, sporadic, and intrusive to obtain. Performance based information, in addition, can be misleading since multiple degrees of MWL may accompany the same level of performance (Yurko et al., 2010). Physiological measures, on the other hand, do not require overt behavior, can be arranged to have little or no interference with task execution, and can supply information continuously without significant delay. Progresses in miniaturization and wireless technology have amplified these advantages of physiological measures (Sahayadhas et al., 2012).

Most studies of MWL based on brain function have utilized electroencephalography (EEG), following a large number of studies using EEG for developing BMI (Wolpaw and Wolpaw, 2012). Functional near-infrared spectroscopy (fNIRS) as a newer modality have shown promising capabilities in BMI applications for discrimination of different motor tasks (Naseer and Hong, 2013) or decoding subjects' binary decisions (Naseer et al., 2014). The relationship between MWL and central nervous system activity is well-established (McBride and Schmorrow, 2005). BMIs that do not attempt to directly control a device but modulate its user interface based on real-time user status are referred as passive BMIs (Gateau et al., 2015). In such applications, of paramount importance. Recently multi-modal techniques utilizing concurrent EEG and fNIRS have gained popularity due to their relatively richer information content (Hirshfield et al., 2009; Liu Y. et al., 2013; Liu T. et al., 2013; Keles et al., 2014b; Aghajani and Omurtag, 2016; Buccino et al., 2016; Omurtag et al., 2017). Available evidence indicates that brain activity measures of MWL are more informative than ocular or peripheral physiology measures (Hogervorst et al., 2014).

Concurrent EEG and fNIRS, which we refer to as EEG+fNIRS, is promising as a practical technique that is more accurate than the individual modalities alone. fNIRS provides information that is complementary to EEG, by measuring the changes in cerebral blood flow (CBF) and related hemoglobin concentrations through near-infrared light source/detectors on the scalp. It is comparable to EEG in portability. fNIRS does not have electromyographic (EMG) and blink artifacts and its signal closely correlates with the blood oxygen level dependent (BOLD) signal from functional magnetic resonance imaging (fMRI; Strangman et al., 2002; Huppert et al., 2006), which is a goldstandard for measuring cerebral hemodynamics. In addition to the advantages of pooling different types of signals, EEG+fNIRS offers new types of features, ultimately based on neurovascular coupling (NVC), the cascade of processes by which neural activity modulates local blood flow and oxygenation, and NVC related features are not resolvable by a uni-modal signal sensitive to only neural activity (e.g., EEG) or only hemodynamics (e.g., BOLD).

Working memory (WM) is a brain system that provides transient holding and processing of the information necessary for complex cognitive tasks (Baddeley, 2003). It has been investigated in previous functional neuroimaging studies, which identified the prefrontal cortex (PFC) as the most relevant area of activation (Cohen et al., 1997; Smith and Jonides, 1997; Hoshi et al., 2003; Owen et al., 2005). MWL detection using WM load as an experimental paradigm has been studied using EEG (Berka et al., 2007; Dornhege et al., 2007; Grimes et al., 2008; Brouwer et al., 2012), fNIRS (Hoshi et al., 2003; Izzetoglu et al., 2003; Ayaz et al., 2012; Durantin et al., 2014; Herff et al., 2014), and concurrent EEG and fNIRS (Hirshfield et al., 2009; Coffey et al., 2012). We have previously shown that aspects of NVC are characterizable by EEG+fNIRS, by taking advantage of the synergistic interaction between the modalities (Keles et al., 2016). The potential of EEG+fNIRS for active BCIs has recently been investigated (Fazli et al., 2012; Liu Y. et al., 2013; Tomita et al., 2014; Buccino et al., 2016). In this study, we built on this work to explore the unique properties of EEG+fNIRS for MWL detection.

The n-back task was introduced by Kirchner (1958). nback is a continues-performance task for measurement of WM capacity, which has been used frequently in the field of cognitive neuroscience. Gevins (Gevins et al., 1997, 1998; Smith and Gevins, 2005a) and Smith (Smith and Gevins, 2005b) showed during high task-load conditions EEG theta activity increases in the frontal midline and alpha activity attenuates during the performance of an n-back task. In addition, fNIRS revealed WM load while performing n-back task activates PFC (Ayaz et al., 2012; Sato et al., 2013; Fishburn et al., 2014; Herff et al., 2014; Mandrick et al., 2016b). The n-back task engages WM and becomes more demanding as the value of n increases. We have therefore used the n-back task as our experimental paradigm with n ranging from 0 to 3, allowing us to tune the task difficulty. Our study maintained all other conditions constant by employing only healthy adult volunteers with no previous experience with this task, all performing under the same laboratory conditions with no distractions or additional activities. Our experimental design is consistent with numerous BMI studies that take WM load as a proxy for MWL (Berka et al., 2007; Grimes et al., 2008; Coffey et al., 2012; Herff et al., 2014).

The aim of this study was, first, to introduce and validate a state of the art EEG+fNIRS set up in a single headpiece. Since PFC is the main region of interest in WM load detection (Owen et al., 2005), our design had the advantage of frontal lobe coverage by fNIRS. We had whole-head coverage by EEG. We used the term whole-head to refer to the fact that we placed EEG electrodes at all (except frontopolar) standard 10–20 sites bilaterally covering the frontal, central, temporal, parietal, and occipital areas. Having fNIRS optodes on the forehead not only improved the quality of acquired signal but also reduced the preparation time. The second aim of this study was to develop EEG+fNIRS measures that discriminate among levels of MWL and show that they are promising for the practical and accurate quantification of MWL in real-world settings. We developed such measures by extracting EEG, fNIRS, and Hybrid (EEG+fNIRS) based features from the full set of signals. Most discriminating features were selected and fed into support vector machines (SVM) to perform binary or multi-class classification. The handful of EEG+fNIRS studies currently available (Coffey et al., 2012; Putze et al., 2014; Buccino et al., 2016) have not systematically quantified the performance of subsets of a Hybrid system and its features contribute to the accuracy of classification. Therefore, the third aim of this study was to rigorously compare the performance of uni-modal and Hybrid systems.

### METHODS

#### Subjects

Seventeen healthy volunteers (16 males, 1 female) with a mean age of 26.2 and standard deviation of 7.7 years from University of Houston students or employees participated in the experiment. The experimental procedures involving human subjects described in this paper were approved by the Institutional Review Board of the University of Houston. The participants gave written informed consent prior to the experiments and were compensated for their effort by being given a gift card from a major retailer. During the performance of the verbal n-back task, target letters should be detected by the operator by means of pressing Space button on the keyboard. All subjects were right-handed and used their dominant hand for performing the experiment. This will reduce the variability of brain signals based on the motor function through all subjects. None of the subjects had ever taken part in an n-back study, thus no training effects were expected.

### Experimental Design

One of the most common WM paradigm for MWL assessment is the n-back task, which was first introduced by Kirchner (1958). In the letter n-back task, participants observe a sequence of single letters separated by a certain amount of time each; for each letter they decide whether it is a target, i.e., identical to the item that appeared n items back in the sequence. The value of n is kept constant during a segment of the experiment referred to as a session. As n increases the difficulty of the task becomes higher. In the literature usually 0-back task has been used as a control state. **Figure 1** illustrates how letter n-back task works when n is 0, 1, 2, or 3. Depending on n, subject should find the target letter and interact with the user-interface.

In each experiment, we had a total number of 40 sessions. These sessions were presented in pseudorandom order, 10 sessions per n. Each session started with an instruction block that is displayed for 5 s on the screen and informed the subject about which type of the n-back tasks was about to start (instruction block). Then 22 randomly selected letters (out of 10 candidate pool of letters) appeared in sequence on the screen (task block). Each letter stayed on the screen for 500 ms and the subject had 1,500 ms to press the space button in case that the letter was a target according to the type of session. At the end of each session there was a 25 s resting block. During this block the subject remained relaxed and fixated at a cross on the screen to let the brain activation return to its baseline and get ready for the next n-back session (Herff et al., 2014). **Figure 2** shows one sample session. Total recording time was 50 min. The program for implementation of this task was written using Presentation software (Neurobehavioral Systems, Inc.). All the information about the appearance time of each letter, session type, subject's response time, and also whether the presented letter was a target or not was recorded by this software and stored as a text file for later processing. The objective performance of the subjects within each session was computed from this information. Subjects who had too low accuracy (<90% in the 0- or <80% in the 1-back) were deemed insufficiently focused on the task. Performance

FIGURE 2 | Experimental design for the letter n-back task. Each session includes the Instruction, task, and rest blocks.

level was measured by computing the accuracy defined as the fraction of correct responses. We considered a missed target as an incorrect response.

### Data Acquisition and Preprocessing

A quantitative meta-analysis has found the cortical regions that were activated robustly during letter n-back task (Broadman Areas 6, 7, 8, 9, 10, 32, 40, 45, 46, 47, and supplementary motor area; Owen et al., 2005). We used this information together with the results of previous EEG studies to choose the optimum locations for our 19 EEG electrodes (F7, F8, F3, F4, Fz, Fc1, Fc2, T3, T4, C3, C4, Cp1, Cp2, P3, P4, Pz, Poz, O1, O2). We used Fpz as the ground and Cz as the reference electrode. In the literature, several different reference electrode positioning is indicated, which have their own set of strengths and weaknesses. Among them, linked ears and vertex (Cz) are the most common. Cz reference is advantageous when it is located in the middle among active electrodes, however, for close points, it may result in poor resolution (Teplan, 2002). Based on the previous studies, central brain region in not majorly involved during the performance of a WM task compared to the frontal and parietal lobes and choosing Cz as the reference may be more appropriate rather than any other electrode in the 10–20 system. microEEG (a portable device made by Bio-Signal Group Inc., Brooklyn, New York) was used to sample EEG at 250 Hz (**Figure 3a**). Electrode impedances were kept below 10 k. A 128-channel electrode cap with Ag/AgCl electrodes (EasyCap, Germany) was used to physically stabilize the sensors and provide uniform scalp coverage. We located the fNIRS optodes on the subject's forehead to fully cover the PFC, which plays a significant role in WM (Fitzgibbon et al., 2013). Seven sources and seven detectors were located on the forehead resulting in 19 optical channels, each consisting of a source– detector (S–D) pair separated by a distance of 3 cm. The 19 optical channels used in this study are shown in **Figure 3c**. The S–D placement starts from the left hemisphere and ends on the right hemisphere. S4 and D4 are located at the center of forehead, where D4 is located at the AFz location and channel 10 is located at the Fpz location according to the standard international 10– 20 system (**Figures 3b,c**). We used our triplet-holders (Keles et al., 2014a) on the forehead to keep each EEG electrode in the middle of an S–D pair and fix the distances between the sensors. fNIRS signals were acquired at 8.93 Hz via NIRScout extended (NIRx Medical Technologies, New York) device, which was synchronized with the EEG data by means of common event triggers (**Figure 3a**). NIRScout is a dual wavelength continuous wave system. The EEG signal was band-pass filtered (0.5–80 Hz), and a 60 Hz notch filter was used to reduce the power line noise.

The spatial Laplacian transform is generally effective in muscle artifact removal from EEG signal (Fitzgibbon et al., 2013). We subtracted the mean EEG voltage of the neighbor electrodes from each EEG signal. **Figure 4** shows the configuration of neighbor electrodes for 19 EEG channels. Each detector in NIRScout device records the signal from each separate source in two different wavelengths (760 and 850 nm). Oxy- and deoxyhemoglobin concentration changes (HbO and HbR) were computed using the modified Beer-Lambert law (Sassaroli and Fantini, 2004) using standard values for the chromophore extinction coefficients and differential path-length factor (Keles et al., 2016). fNIRS might be

and data transmission to the acquisition platform. (b) Coronal view of the subject showing the close view of the placement fNIRS optodes and EEG electrodes. (c) Topographical view of fNIRS sources (Si , black) and detectors (Di , red) and EEG electrodes (green). Each pair of source and detector separated by 3 cm creates a channel (CHi ). We used the signals from F7, Fpz, and F8.

contaminated with the movement, heart rate, and Mayer wave artifacts. In order to reduce these artifacts while retaining the maximum possible amount of information, a band pass filter of 0.01–0.5 Hz was applied to fNIRS signals. After the preprocessing step, two subjects were excluded from the rest of analysis due to the poor quality of the signal and excessive noise. The processed signals were inspected visually for the presence of muscle and motion, eye movements, and other artifacts. The recordings that were contaminated in excess of 10% by artifact were excluded as a whole (Keles et al., 2016). In addition, one subject was excluded since he was not sufficiently focused in the experiment according to 0-back low accuracy cut-off. **Figure 5** shows a segment of preprocessed data for one of the subjects. The figure indicates the temporal variations in the fNIRS signals and the EEG frequency bands, which are utilized in feature extraction. First and second rows are HbO and HbR of fNIRS channel 17, respectively. Third row is the EEG time-frequency map for channel O2.

After preprocessing, each task block ({0, 1, 2, 3}-back) and rest block was divided into 5, 10, 20, or 25 s epochs in order to assess the effect of window size on classification results. **Figure 6** shows four different epoch type with window size from 5 to 25 s. In most of the cases there is an overlap between adjacent epochs (half size of epoch's length). This overlap was considered in order to capture the unique temporal response for each individual, as there could be inter-subject variability in the time required for the hemodynamic response to peak, and/or in the number of peaks (Power et al., 2012). In addition, during the classification phase, an imbalance in the number of features within each class biases the training procedure in favor of the class with a higher number of training features (He and Garcia, 2009). In our experiment design we have 40 rest blocks and 10 blocks from each n-back task type. From each task block, 16, 8, 4, and 2 features were extracted when we changed the size of the window from 5 to 25 s, respectively. From each rest block 5, 4, 2, and 1 features were extracted when we changed the size of the window from 5 to 25 s, respectively.

#### Feature Extraction

We extracted from each window three main categories of features for all 19 EEG electrodes and 19 fNIRS channels: EEG (unimodal), fNIRS (uni-modal), and EEG+fNIRS (multi-modal or Hybrid).

EEG-based features were computed from the frequency band power (PSD), phase locking value (PLV), phase-amplitude coupling (PAC), and the asymmetry of frequency band power between right and left hemispheres (Asym\_PSD). Initially, the spectrogram was calculated using short-time Fourier transform method with windows of 1 s and half window size overlapping and frequency resolution of 1 Hz. The power was calculated in eight frequency bands each with a width of 4 Hz in the range 0 to 32 Hz. The ranges are referred to by their conventional labels: delta (0–4 Hz), theta (4–8), alpha (8–12), followed by five intervals ranging from low beta (12–16) to high beta (28– 32). We also used the labels f1, f2,..., f8 for these frequency bands. EEG frequency band power for each epoch was extracted by integration of the corresponding power over each frequency band. We imposed the 32 Hz cutoff since higher frequencies in scalp EEG are generally not considered informative about cortical activity (Goncharova et al., 2003; Muthukumaraswamy, 2013). PLV is a measure of phase synchrony between two distinct neuronal populations, which is computed between two selected EEG electrodes as an estimate the inter-area synchrony (Vinck et al., 2011). PLV was estimated between electrode pairs that were selected to assess three different types of synchrony: intra-hemispheric (F3-P3, F4-P4, Fc1-Cp1, Fc2-Cp2, Fz-Poz), symmetric inter-hemispheric (F7-F8, F3-F4, Fc1-Fc2, C3-C4, T3- T4, Cp1-Cp2, P3-P4, O1-O2), and asymmetric inter-hemispheric (F3-P4, F4-P3, Fc1-Cp2, Fc2-Cp1). PLV was computed for four band of interest ([3–5], [9–11], [19–21], [39–41] Hz). PAC measures coupling between the phase of a low frequency (here [4–7], [9–13] Hz) oscillations and the amplitude of a high frequency ([15–35], [30–60] Hz) oscillation (Radwan et al., 2016). It provides an estimate of local, multi-frequency organization of neuronal activity (Dvorak and Fenton, 2014). We chose 8 EEG channel pairs between right and left hemispheres (F8-F7, F4-F3, Fc2-Fc1, T4-T3, C4-C3, Cp2-Cp1, P4-P3, and O2-O1) for Asym\_PSD feature.

fNIRS features were based on HbO and HbR amplitude (HbO/R Amp.), slope of HbO and HbR (HbO/R slope), standard deviation of HbO and HbR (HbO/R Std.), skewness of HbO and HbR (HbO/R Skew.), and kurtosis of HbO and HbR (HbO/R Kurt.). The statistics of HbO and HbR are commonly used as features in fNIRS studies of MWL and BMIs (Naseer and Hong, 2015; Naseer et al., 2016a,b). Our inspection of the fNIRS data revealed patterns of correlation between HbO and HbR that were time and area dependent. Hence, we also included the zerolagged correlation between HbO and HbR (HbO-HbR Corr.) as an additional feature. Hybrid features were based on EEG and fNIRS features in addition to specifically Hybrid quantities that depend simultaneously on both systems.

We chose to focus on a straightforward quantity, which can be easily calculated within the time windows of interest: the zerolagged correlation between the Hb (HbO or HbR) amplitude and the EEG frequency band power (in eight separate bands described above). These neurovascular features based on HbO and HbR were denoted NVO (oxygenated neurovascular coupling) and NVR (deoxygenated neurovascular coupling), respectively. To calculate NVO/R for the left hemisphere, the correlation between each fNIRS channel (CH1 to CH9) and each frequency band of F7 EEG channel was calculated (band-passed filter within the specific frequency range). For the right hemisphere, the correlation between each fNIRS channel (CH11–CH19) and each frequency band of F8 EEG channel was calculated. For the fNIRS channel 10, which is located at the center we used the average of F7 and F8 channels to find NVO and NVR. This resulted in 152 (19 × 8) NVO and 152 NVR features from each window. Each set of features extracted from one subject's data were dc-shifted and scaled in order to have a mean value of zero and standard deviation of one.

#### Classification and Validation

Following feature extraction, we implemented SVM classification and k-fold cross-validation with k = 10. SVM is a nonparametric supervised classification method, which already showed promising results in the medical diagnostics, optical character recognition, electric load forecasting, and other fields. SVM can be a useful tool in the case of non-regularity in the data, for example when the data are not regularly distributed or have an unknown distribution (Auria and Moro, 2008). Linear SVM constructs an optimal hyperplane creating a decision surface maximizing the margin of separation between the closest data points belonging to different classes (Aghajani et al., 2013). The observations were randomly partitioned into k groups of approximately the same size. One group was selected as the testing and the rest as the training data. Principal component analysis (PCA) was applied to the training set. We applied PCA separately on each feature subgroup (11 subgroups, with the subgroups divided further by frequency bands as described in our feature extraction methods). For example, the EEG alpha frequency band power (8–12 Hz) consisted of the time series from 19 EEG channels. After PCA, these signals yielded 19 principal components (PC) and their associated time series as the new set of features. A similar PCA was applied to each feature subgroup. The resulting PCs contained a set of weights (for

FIGURE 5 | Sample preprocessed EEG+fNIRS data for one of the subjects. Vertical dashes separate different n-back task and rest blocks. (a) Concentration changes of oxy-hemoglobin (red curve) and deoxy-hemoglobin (blue) for channel 17. (b) EEG Time-frequency map of the channel O2.

the EEG channels), which could be interpreted as an activation map. PCA therefore allowed us to interpret the topographic distribution of activation associated with every feature. PCA also yielded an eigenvalue corresponding to the variance of that feature. The eigenvalues typically decrease sharply, the sum of the first few accounting for almost all of the total energy of the 19 PCs. However, the most energetic PCs are not necessarily the most informative, as shown in Results (**Table 1**). In order to estimate the features' discriminating ability, we used the Pearson correlation coefficient method (Mwangi et al., 2014). A reference time series was constructed by labeling each window by a distinct integer that represented the rest or the task difficulty level ({0 (rest), 1 (0-back), 2 (1-back), 3 (2-back), 4 (3-back)}). We used R 2 , the square of the Pearson correlation between the time series and the reference signal, to rank the set of features. The testing data were projected into the PC space that was obtained from the training data and the testing features were ranked by using the same method. In part of our analysis, we have chosen to reduce the number of features of the systems (EEG, fNIRS, and Hybrid) by truncating all systems at the same fixed size, eliminating the lowest ranked features. The labeled training examples were fed into a binary linear SVM. Each training example contained a vector of feature values in a given window and its label that denoted one of the two classes of interest. The SVM constructed an optimal hyper-plane creating a decision surface maximizing the margin of separation between the closest data points belonging to different classes (Aghajani et al., 2013). In this study there were 10 possible pairs of binary classifications corresponding to our five distinct classes. In order to investigate the ability to discriminate WM loading against a baseline, we have chosen the pairs {1-back v rest}, {2-back v rest}, {3-back v rest}, {1-back v 0-back}, {2-back v 0-back}, and {3-back v 0-back} and performed binary classifications on them. We also investigated the ability to discriminate between degrees of MWL by using a multi-class scheme. For this purpose we utilized the errorcorrecting output code multiclass model (ECOC), which employs a set of binary classifiers. We adopted an all-pairs ECOC model to train a binary classifier on the pairs of classes in the training data and, for each new instance in the testing data, assigned the label that minimizes the aggregate Hamming loss from the predictions of all binary classifiers (Dietterich and Bakiri, 1995). In comparison to its alternatives, this approach has been shown to enhance accuracy while maintaining a low run-time complexity (Fürnkranz, 2002). We investigated four groups of multi-class


TABLE 1 | The top R 2 ranked features for three representative subjects for the binary rest v 3-back classification.

Columns indicate the feature description (Descr.), corresponding frequency range if applicable, and PC index in order of descending energy (or the channel pair for PLV). The HbO-HbR corr. has been abbreviated as COR.

sets that contained narrow gradations of MWL: {3-back v 2 back v 1-back}, {3-backv 2-back v 1-back v 0-back}, {3-back v 2-back v 1-back v rest}, and {3-back v 2-back v 1-back v 0-back v rest}. The accuracy was computed as the fraction of labels in the testing data that were correctly identified by the SVM. Finally, the cross-validation was repeated k times with each group of observations being used exactly once as the testing data. The overall accuracy was calculated as the mean of the repetitions. In addition to overall accuracy, confusion matrices yield a very detailed overview of a classifier's performance. Usually, the confusion matrix is further summarized by some proportions extracted from the confusion matrix. The main metrics are (a) sensitivity of class A (Sens.A) which describes how well the classifier recognize observations of class A, (b) specificity of class A (Spec.A) which describes how well the classifier recognizes that an observation does not belong to class A, (c) positive predictive value of class A (PPVA) tells us given the prediction is class A, what is the probability that the observation truly belongs to class A?, (d) negative predictive value of class A (NPVA) tells us given a prediction does not belong to class A, what is the probability that the sample truly does not belong to class A (Beleites et al., 2013)? We pooled all the k confusion matrices of the k-fold cross validation to calculate Sens.A, Spec.A, PPVA, and NPVA. For all the calculations described in this paper we used Matlab v.8.6.0.267246 (R2015b) (The MathWorks, Inc., Natick, Massachusetts, United States).

#### RESULTS

We initially investigated the relationship between the subjects' performance and task difficulty, in order to insure that it was consistent with expectations. **Figure 7** shows the accuracy and response time of all subjects with error bars showing the standard deviation of inter-subject variability. The figure shows that the fraction of accurate responses decreased with increasing task difficulty. There was little or no accuracy decrement between 0- and 1-back tasks, as expected (Jonides et al., 1997).

Furthermore, the time it took subjects to produce a correct response increased (and eventually more than doubled) with task difficulty (**Figure 7**).

comparison of each two response accuracy (red) or response time (black) (\*p

< 0.05, \*\*p < 0.001, \*\*\*p < 0.0001).

We next examined the HbO and HbR patterns during changes in mental load. **Figures 8 (a–e)** shows the grand block average of HbO (red) and HbR (blue) amplitude. The shaded area shows the standard deviation of the inter-subject variability. In this paper, the term grand block averaging denotes the average over the blocks of the same class and over all channels and subjects. Following neural activation, local blood flow and volume typically increase on a time scale of seconds and, at the beginning of the task, there is a localized rise in oxygenation in

rest (dashed curves) and task (solid). Increasing thickness of solid curves corresponds to increasing task difficulty from 0- to 3-back. AU means arbitrary units.

PFC (Huppert et al., 2006), which creates the positive peak of HbO. After a few seconds due to the metabolic consumption of oxygen the oxyhemoglobin concentration decreases leading to a negative HbO amplitude. During the rest state which comes after the task block, the oxyhemoglobin concentration starts to increase and HbO returns to baseline. Toward the end of rest window there is an apparent task anticipating rise in HbO. The range of changes of HbO is obviously higher than those of HbR during the task periods. From 0- to 2-back the positive peak of HbO increases and then decreases from 2- to 3-back. HbO and HbR have the opposite sign and are hence negatively short time correlated in the rest state. However, this appears to change during task in ways depending on the value of n. The range of HbO changes increases with n although they appear to slightly decrease as n changes from 2 to 3. In (**Figures 8f,g**) we show the grand block average of all tasks v rest state for one specific fNIRS channel (channel 10, which is located at Fpz, near the center of the forehead). In this figure the curves corresponding to rest and task with all values of n have been shown in one plot

to make the comparison easier. And we just show the first 25 s of the n-back task block (out of the full 44 s). The shaded areas for standard deviation are omitted for clarity. **Figure 8f** shows that the peak amplitude of HbO is positive for task performance and negative for rest. In addition, it decreases with increasing load for n > 0. The area under curve clearly discriminates between rest and task since it is negative during rest and positive for all other n-back tasks. By contrast, the peaks of the amplitude of grand block average of HbR (**Figure 8g**) that occur after the 10 s have a positive correlation with the level of mental load. These patterns are not observed in the case of the 0-back task since it is related to perception only and is less involved with WM system. We also examined the time course of selected features that were extracted from the signals.

**Figure 9** shows the PSD extracted from EEG, HbO/R Amp. from fNIRS, and NVO/R features from EEG+fNIRS change in relation to the degree of WM load. We use the term session to denote a task block and the following rest block. For each cognitive state, we then have 10 sessions per subject. In the case of

5 s windowing, for each feature, we have 21 values in each session (16 from task block and 5 from rest block). The curves in **Figure 9** were computed by first applying a simple triangular moving average filter covering three samples at each step, and then a cubic spline interpolation. The figure shows that the theta and alpha bands of EEG are positive during 0- and 1-back, although they become negative for 2- and 3-back tasks. The positive peak of HbO increases from 0- to 2-back and has a slightly lower peak for 3-back compared to 2-back. The figure also shows that the Hybrid features (such as NVO in the delta range) generally resemble the corresponding uni-modal features (such as HbO and PSD in the delta range) however they were dominated by neither, suggesting that the Hybrid feature contained additional information.

**Table 1** shows the top 10 highest ranked features (based on R 2 ) for three subjects obtained during the 3-back v rest training set. The features are characterized by the description (e.g., PSD, PLV, HbO, as described in Section Methods) and the particular frequency band, where applicable. The frequency band is applicable only to the EEG and neurovascular features. The table also indicates the order of the feature according to the magnitude of its eigenvalue [ordered from the most energetic (1) to the least (19)]. Since PCA was not used in the case of PLV, the channel label is given instead of the PC order. For example, the highest ranked feature for subject one was the third most energetic PC from the EEG frequency band power in the theta range (4–8 Hz). For subject 3, the highest ranked feature was the second most energetic PC from the neurovascular feature based on the correlation between HbR and the EEG frequency band power in the high beta range (28–32 Hz). The table illustrates that the types of features in the top ranked group may vary among subjects and that high discriminating ability of a feature does not imply high energy in the sense of the PCA.

**Figure 10** shows the classification accuracies of various subsystems as well as the Hybrid system for the 3-back v rest using 5 s windows. The error bars represent the standard deviation of inter-subject variability. Within the EEG group (gray bars), the leftmost bar is the accuracy of a system based only the PSD features. On its immediate right is the accuracy of the subsystem based only on PLV features, and similarly for PAC and other feature types. The rightmost bar in the EEG group shows the accuracy of the full EEG system which includes all feature types based on EEG signals. Clearly the PSD is the primary contributor to the discriminating ability of the EEG, however, the accuracy appears to be slightly enhanced by including the other types of features. Among the fNIRS systems (red) the leftmost bar indicates that Hb amplitudes together with the HbO-HbR correlation is the primary contributor to the accuracy of detection. Unlike the EEG system, the other feature types such as slope and higher order statistics significantly enhance the accuracy of the fNIRS system. The overall accuracy of the fNIRS system is lower than the overall accuracy of the EEG system. The accuracy based only on the neurovascular features is indicated

Hybrid (green bars).

by the leftmost bar in the Hybrid group (green). The middle bar in the Hybrid group represents the pooling of all features from the EEG and fNIRS systems. Finally the inclusion of the neurovascular features in the Hybrid system (rightmost green bar) appears to slightly enhance the accuracy.

**Figure 10** compared the accuracies of various systems with each system containing the full set of features that belonged to it. The number of features in the full set was different for each system, for example the EEG, fNIRS, and Hybrid systems contained 360, 209, and 873 features, respectively. This may raise

the question of to what extent these systems' accuracies were influenced by the number of features they contained, rather than by the information content per feature. In order to examine this topic, we computed the systems' accuracies after they had been truncated to contain the same number of features. This was done by selecting the top ranking group after the features had been sorted in order of descending values of R 2 . The goal was to perform a comparison on an equal footing by truncating each system in the same way. The calculation was repeated by varying the number of features from two to the available maximum. **Figure 11a** indicates that the fNIRS system had the lowest accuracy over the entire range of the number of features. The EEG system performed better, while the Hybrid accuracy was consistently superior to either system, similar to the results shown in **Figure 10**. **Figure 11b** shows the cumulative sums of R 2 index v number of features for three systems which qualitatively agree with **Figure 11a**. The calculations are for the 3-back v rest using 5 s windows and they qualitatively agree with the results (not shown) of binary classifications of other pairs of classes and window sizes. The shaded areas indicate the standard deviation of inter-subject variability.

**Figure 10** provided the results for only one type of binary classification (3-back v rest) and the variability over subjects as a standard deviation. However, it is highly instructive to examine the result for each subject as well as for every binary and multiclass problem that was described previously in our Methods. **Tables 2, 3** break down the accuracy of classification for each subject (S1, S2,..., S14), system type (EEG, fNIRS, Hybrid), and the type of classification problem. The mean as well as the minimum and maximum of the values for the subject population are provided as three separate columns on the left. The height of the rightmost bars within the EEG (gray), fNIRS (red), and Hybrid (green) groups in **Figure 10** correspond in **Table 2** to the accuracy percentages 83.5, 75.3, and 90.1 shown under the column "Mean" and the row "3-back v rest." In the columns for





TABLE 4 | Sensitivity (Sens.), specificity (Spec.), positive predictive value (PPV), and negative predictive value (NPV) are listed in percentage (%) for all classification cases (binary and multi-class) and all systems (EEG, fNIRS, Hybrid).


individual subjects, **Table 2** shows the mean accuracy and the standard deviation from the trials in the 10-fold cross validation.

**Table 2** suggests that the mean accuracy of classifying task against a baseline increases with n, as expected. The accuracy of detecting 0-back v n-back appears to be slightly greater than that of detecting rest v n-back (n > 0). For example, 87.2% for 1-back v rest and 91.4% for 1-back v 0-back. **Table 3** shows the results for multi-class classification. In this case, the accuracy tends to decline slightly as more classes are included in the classification problem. In all subjects and classification problems, the Hybrid system has the greatest accuracy without exception. We investigated whether the observed superiority of the Hybrid system was statistically significant. A two-way ANOVA was performed on every classification problem (a row of **Table 2** or **Table 3**) by using as the two factors the type of system (EEG, fNIRS, Hybrid) and the subject. The analysis was repeated by taking the classification problems as, first, the binary types in **Table 2** and, second, the multi-class types in **Table 3**. In all cases, the differences of accuracy among the subjects were not significant and there were no interactions between system type and subject, while the differences in accuracy between the Hybrid and the uni-modal system was significant with a p < 0.001.

**Table 4** lists the sensitivity (Sens.), specificity (Spec.), positive predictive value (PPV), and negative predictive value (NPV) for each individual class within a classification case. For example for the case of {Rest v 3back}, each one of rest and 3-back classes would have a Sens., Spec., PPV, and NPV, respectively. In addition, this table summarizes all these metrics for EEG, fNIRS, and Hybrid systems in order to make it easier to compare between their capabilities.

The foregoing results corresponded to 5 s windows but qualitatively agreed with patterns we observed with other window sizes as well. We also assessed the effect of window length on classification accuracy for EEG, fNIRS, and Hybrid systems. **Figure 12** shows the results of this assessment. We examined four different lengths for the windows (5, 10, 20, and 25 s). Change of window length has the same effect on all three types of systems. By increasing the length from 5 to 20 the accuracy increases and declines thereafter.

### DISCUSSION AND CONCLUSION

The functional activity of the human brain can be observed with various imaging techniques including fMRI, fNIRS, and EEG. Each of these modalities has its advantages and disadvantages. The advantage of using Hybrid EEG+fNIRS system can be divided into two main categories: First, each of these modalities is measuring the changes in a specific brain physiology. EEG results directly from the electrical activity of cortical and subcortical neurons with a sub-millisecond temporal resolution. On the other hand, fNIRS yields local measures of changes in HbO and HbR concentration and is, therefore, an indicator of metabolic/hemodynamic changes associated with neural activity. Second, the physics of measurement behind EEG and fNIRS are quite different. This property, for example, makes EEG signal prone to blink and muscle artifacts, while this is not the case for fNIRS. Hence using a multimodal recording system we are able to assess brain behavior from different physiological perspectives in addition to compensating for some weaknesses of one modality by the other one.

Our results suggest that EEG+fNIRS combined with a classifier are capable of robustly discriminating among various levels of MWL. In our study, the Hybrid system had an accuracy higher than either EEG or fNIRS alone for every subject. The pooling of EEG and fNIRS features and the inclusion of neurovascular features resulted in a synergistic enhancement, rather than in a diluting effect (which would have given a performance intermediate between the two modalities). In mission-critical contexts such as aviation or surgery, even small improvements in MWL detection can translate into significant gains in safety and efficiency. Our experiments were designed to use WM load (adjusted through the value of n in the n-back task) as a correlate of MWL in general. Furthermore, EEG and fNIRS can be integrated without excessive cost, effort, or intrusiveness for the user. The combination of all these considerations suggests that EEG+fNIRS should be preferred to only EEG or fNIRS, in developing passive BCIs and other applications which need to monitor users' MWL.

Our preliminary analysis of the experimental data was consistent with expectations. For example **Figure 7** indicated that the fraction of accurate responses declined more steeply as n increased. This can be explained by noting that temporal tagging is the cognitive process that imposes the greatest load in the nback task, as compared to the other processes which are also involved such as encoding, storage, matching, and inhibition to dampen the oldest memory traces (Jonides et al., 1997).

Temporal tagging, unused in the 0- and 1-back, begins to affect MWL substantially as n > 1. Another interesting preliminary result was the observation that HbO showed an anticipatory increase near the end of the rest sessions (**Figure 8E**). This is consistent with related fMRI findings (Sakai and Passingham, 2003) and with the fact that the PFC is involved in planning future action. A negative correlation between HbO and HbR has been seen in the literature. Based on (**Figures 8A–E**), the shaded area, which is the standard deviation of normalized HbO and HbR variations within the block for all of the subjects and all of the fNIRS channels, is relatively high. This shows the high level of inter-subjects variability and that might be the reason that we are not seeing such anti-correlated pattern between HbO and HbR in **Figure 8**. Izzetoglu et al. (2004) showed that the reason behind the drop of the peak of HbO in **Figure 8d** is that when a participant reaches his maximum performance capacity or in another word starts to overloads cognitively, he loses his concentration on the task and as a result performance as well as the oxygenated hemoglobin changes decline.

**Figure 5** did not show any differences between rest and task states that were obvious to visual inspection of the preprocessed EEG or fNIRS signals. Subject and block averaging of various features shown in **Figure 9** did, however, indicate that such systematic variations existed. Lower values of the EEG alpha band power in the 2- and 3-back task, and higher values of HbO in the beginning of the task period in the n-back (increasing with n) relative to those in the rest state were examples of such visible variations. To take advantage of such variations we employed discrimination through linear SVM. In the case of non-linear SVM, the kernel can help with the non-linearly separable data and map it into a new feature space in which the dataset are separable with a linear SVM. In non-linear SVM in order to improve the accuracy choosing the optimum kernel parameters, is necessary. This can reduce classifier's generalization potential for new subjects when we don't want to adjust the kernel's parameters. It will increase the probability of overfitting occurrence, since increasing the complexity level of a classifier gives it the flexibility to match exactly to the train set. In addition, the option of choosing a non-linear SVM depends on the exact application to make a trade-off between a slightly higher accuracy rather than calculation speed. The same trade-off we face to choose the windows size. The results shown in **Figure 10**, **Tables 2, 3** were highly promising for accurately discriminating among the rest and task states. As **Tables 2, 3** show, the subject averaged accuracy of the Hybrid system in binary discrimination was lowest (87.2%) for 1-back v rest and highest (96.6%) for 3 back v 0-back. The corresponding lowest and highest results for uni-modal systems were fNIRS (71.6%) and EEG (92.0%), both for 3-back v rest. We calculated the overall average of accuracy one time for all of the binary cases and one time for all of the muli-class cases. We did this calculation for EEG, fNIRS, and Hybrid systems separately. The results show that EEG, fNIRS, and Hybrid system, in the case of binary classification, have 85.9, 74.8, and 90.9% overall accuracy, respectively. EEG, fNIRS, and

Hybrid system, in the case of multi-class classification, have 79.6, 57.0, 86.2% overall accuracy, respectively. These numbers convey that the accuracy of each one of EEG, fNIRS, and Hybrid systems are higher for the binary cases. The multi-class accuracies were generally lower; however, note that the chance level accuracy for multi-class classification is less than binary classification (33% for 3-back v 2-back v 1-back, 25% for 3-back v 2-back. v 1-back v rest and also 3-back v 2-back. v 1-back v 0-back, and 20% for 3-back v 2-back. v 1-back v 0-back v rest).

**Table 4** reveals that, for all of the four extracted metrics from the confusion matrix (sensitivity, specificity, PPV, NPV), always Hybrid system has a higher value than EEG system and EEG system has a higher value than fNIRS system. In the cases of binary classification, for those that we are detecting between task and rest, the sensitivity of detection of rest state is significantly higher than the sensitivity of detection of task state. As its complement, the specificity of detecting task state is higher than detecting the rest state. In the cases of binary classification, for those that we are detecting between task and task, the sensitivity of detecting the task with a higher difficulty level is more than those with less level of difficulty, although the difference is not very significant. The PPV and NPV are usually more important than sensitivity and specificity. Patients and doctors want to know whether this particular patient is ill rather than whether the test can recognize ill people (Beleites et al., 2013). Here, our result (**Table 4**) shows that Hybrid system has at the same time a very promising PPV and NPV for all of the classification cases. Except for the case of {1-back v rest}, the minimum of sensitivity, specificity, PPV, NPV for the Hybrid system are 81.4, 82.5, 81.8, and 84.6%, respectively.

Selecting an optimal subset from the full set of features is crucial for achieving high accuracy and avoiding over-fitting. In some applications, e.g., those involving on-board real-time analysis, it may also be important to keep the system size small and avoid computational delays. **Figure 11b** shows the cumulative sum of R 2 v number of features for three systems which qualitatively agrees with **Figure 11a**, suggesting that R 2 ranking is an effective method of feature selection. We have not used an explicit artifact rejection step in our analysis. However, it is well known that PCA can segregate non-cerebral artifacts (typically of higher amplitude than contributions of cortical origin) into distinct PCs. Our feature selection based on R 2 then assigns a lower rank to such PCs and they are excluded from a truncated system.

One of the main considerations in developing an online system is computational speed. It is instructive to review the computational loads of particular feature types in conjunction with how effectively they discriminate among rest and task states. For example, **Figure 10** shows that PAC is the least discriminating EEG feature. This may be important in designing a compact and efficient detector, as PAC is also the most computationally time-consuming feature. By contrast, the most effective EEG feature (PSD) was also the fastest to compute. In our study, the central processing unit (CPU) time required for computing PSD, PLV, PAC, and Asym\_PSD were, respectively, 0.1, 14.3, 44.4, and 0.2 s. The CPU times required for other features were as follows: HbO/R Amp. and HbO-HbR Corr. (0.1 s), HbO/R slope (14.3 s), Std., Skew. and Kurt. collectively (3.3 s), and NVO and NVR (3.4 s).

Our results suggest that Hybrid outperforms the uni-modal systems for each subject (**Tables 2, 3**), every classification problem (**Tables 2, 3**), every number of features (**Figure 11**), and every window size (**Figure 12**). This could have been due to the neurovascular features that the uni-modal systems do not contain. NV obviously had a higher classification performance rather than any of fNIRS based feature subgroups. However, **Figure 10** indicates that such features contribute little if any to the accuracy (two rightmost bars) after the other EEG and fNIRS features have been pooled. The likely explanation instead is related to inter-subject variability. We have found that the top ranked (in terms of R 2 ) features tend to differ among different subjects. Although, the EEG frequency band power (especially in the alpha range) tended to play an important role for most subjects, for other subjects other feature types, fNIRS- or Hybrid based, dominated the top ranks. An example of this is provided in **Table 1** where the third subject's most discriminating PCs were neurovascular. EEG and fNIRS use different physical processes for detection and the underlying physiology which they detect are different. Hence deficiencies such as artifacts, weak sensor coupling, or subject variability leading to a weak signal would selective affect only one modality rather than both. The Hybrid advantage may be associated primarily with the complementary nature of the individual modalities.

**Figure 12** indicates that accuracy can be increased by using larger windows. But this presents a tradeoff between accuracy and rapid detection. The windows with the highest accuracy were 20 s long and may be impractical for some applications if the online response to rapid changes in MWL is desirable. As window size increases, although the amount of information per window likely increases, the number of windows available for training the classifier decreases. Fewer training data are expected to cause the classifiers to underperform (Grimes et al., 2008). The decline in accuracy in **Figure 12** for windows >20 s may be due to the excessively small number of data available for training.

A handful of studies on concurrent EEG and fNIRS and WM tasks have been previously published. Hirshfield et al. (2009) combined an 8 channel fNIRS covering the forehead with 32 channel whole-head EEG with N = 4 subjects as they performed a counting and mental arithmetic task with adjustable difficulty. They used separate classifiers for the fNIRS and EEG (k-nearest neighbor and Naive Bayes classification, respectively) and obtained a maximum accuracies of 64% (with fNIRS) and 82% (EEG). They did not attempt to use the multi-modal data concurrently. The generally higher accuracy of the EEG is consistent with our results. Their overall lower accuracies relative to our results may be due to the relatively short 2 s feature extraction window. Liu Y. et al. (2013) used a 16-optode fNIRS system covering the forehead and 28 EEG sensors at the standard 10–20 sites, with N = 16 subjects performing a n-back task. They found significant correlations between WM load and some EEG frequency band powers as well as HbO and HbR, however did not attempt classification. Their study focused on discovering neural correlates of the effects of practice time on performance. Coffey et al. (2012) recorded three fNIRS channels over the left forehead together with 8 EEG electrodes placed mainly in the frontal and central areas, from N = 10 subjects in a n-back task. They extracted EEG frequency band power and fNIRS Hb amplitude features from 5 s windows and employed them in linear discriminant analysis classifiers. They report maximum accuracies of 89.6% (EEG), 79.7% (fNIRS), and 91.0% (Hybrid). Their results differed from ours in that in some subjects all their systems had very low accuracies and their Hybrid accuracies were not always higher than those of both uni-modal systems. However, the fact that EEG generally had the higher uni-modal accuracy and that Hybrid could attain

the highest observed accuracy were consistent with our findings. The differences from our results could be attributed to the relatively lower number of sensors and fewer types of features they employed.

Acquiring the very low-frequency (VLF) oscillations (<0.5 Hz) in the EEG signal requires highly specialized amplifiers (DC-coupled, high input impedance, high DC stability, and a wide dynamic range; Demanuele et al., 2007). In addition, VLF oscillations are known to be linked with specific pathologies such as epileptic seizures or attention deficit hyperactivity disorder (Steriade et al., 1993; Vanhatalo et al., 2004; Demanuele et al., 2007) that are not within the range of interest in this study. On the other hand, some studies (Gevins et al., 1997; Berka et al., 2007; So et al., 2017) demonstrated EEG within gamma range as a biomarker for discrimination between different cognitive states. We defined the band-pass filter cutoff frequency (0.5–80 Hz) based on these criteria. Although in the feature extraction section, we did not consider gamma frequency range features and have considered this as the future work.

The present study had several limitations which we have not directly addressed due to constraints of available time or effort. Firstly, the group of subjects included only one female. This may have been due to the demographics of the subjects, who happened to be interested in volunteering for our study. In addition to the neural correlates of MWL, we have recorded the subjects' performance characteristics. However, it may prove insightful to collect data on the MWL by using additional techniques such as self-reporting, which was not done in this study. In some studies for assessment of MWL participants filled out the NASA Task Load Index (NASA TLX) questionnaire (Hart and Staveland, 1988) to provide a subjective evaluation of the mental demand induced by different levels of task difficulty. In this study, we implicitly used the assumption that an increase in the level of task difficulty will result in a higher MWL. This can be also considered in future studies. In addition, it is possible that during the course of an experiment the subjects' performance and MWL change through training effects. Studying the performance and neural correlates of MWL for subsets of our data could reveal differences in the beginning and at the end of the study. This would also require an additional investigation of statistical validity, and was not attempted. The statistical significance of the results of our study was demonstrated through a two-way ANOVA that showed significant differences in the accuracy of the Hybrid v uni-modal systems. However, we have not investigated whether a smaller group of subjects would still yield a significant result. We have investigated the capabilities of various subsets of the types of features that were available. It would also be illuminating to investigate the classification accuracy of subsets of the full array of our sensors. Such information can help design more compact headsets and is the subject of an ongoing study. The headset we used is lightweight and no discomfort was reported by any of the subjects. However, wearing it may nevertheless affect performance, and this could be revealed in a parallel set of experiments which we have not done. The primary goal of our study was to apply machine learning techniques in discriminating levels of MWL. We used multiple statistical techniques to ensure

that the statistical significance of the values of accuracy that we obtained for such discrimination. Our observations regarding the range of changes of Hb are therefore only qualitative and observational, serving to ensure that our results are consistent with expectations.

In this study, we have taken steps toward investigating the EEG+fNIRS feature extraction and analysis methods by using a popular WM task. We anticipate and hope that converging efforts in Hybrid hardware integration (Safaie et al., 2013) and data analysis (Biessmann et al., 2011; Keles et al., 2016), potentially based on detailed knowledge of underlying physiology (Bari et al., 2012; Mandrick et al., 2016a), will lead to more effective passive BMIs and other applications in neuroergonomics.

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

HA and AO designed the study. HA collected and analyzed the data. HA, AO, and MG wrote the manuscript.

### FUNDING

This work is based partly on support by the National Science Foundation I/UCRC for Cyber-Physical Systems for the Hospital Operating Room under Grant no. IIP-1266334 and by industry partners. We would also like to thank the Department of Biomedical Engineering and the Cullen College of Engineering at University of Houston for its financial support (Award no. R413022).

and Neuroergonomics, eds M. K. Stanney, and S. K. Hale. Available online at: http://oatao.univ-toulouse.fr/16199/


Springer), 373–379. Lecture Notes in Computer Science. Available online at: http://link.springer.com/chapter/10.1007/978-3-540-73216-7\_42


**Conflict of Interest Statement:** AO participated in the development of the wireless portable EEG device (microEEG), which was used in this research. He holds a financial interest in Bio-Signal Group which is the maker of microEEG.

The other authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Aghajani, Garbey and Omurtag. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multisubject "Learning" for Mental Workload Classification Using Concurrent EEG, fNIRS, and Physiological Measures

Yichuan Liu1, 2, Hasan Ayaz 1, 2, 3, 4 \* and Patricia A. Shewokis 1, 2, 5

*<sup>1</sup> School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA, United States, <sup>2</sup> Cognitive Neuroengineering and Quantitative Experimental Research Collaborative, Drexel University, Philadelphia, PA, United States, <sup>3</sup> Department of Family and Community Health, University of Pennsylvania, Philadelphia, PA, United States, <sup>4</sup> Division of General Pediatrics, Children's Hospital of Philadelphia, Philadelphia, PA, United States, <sup>5</sup> Nutrition Sciences Department, College of Nursing and Health Professions, Drexel University, Philadelphia, PA, United States*

An accurate measure of mental workload level has diverse neuroergonomic applications ranging from brain computer interfacing to improving the efficiency of human operators. In this study, we integrated electroencephalogram (EEG), functional near-infrared spectroscopy (fNIRS), and physiological measures for the classification of three workload levels in an n-back working memory task. A significantly better than chance level classification was achieved by EEG-alone, fNIRS-alone, physiological alone, and EEG+fNIRS based approaches. The results confirmed our previous finding that integrating EEG and fNIRS significantly improved workload classification compared to using EEG-alone or fNIRS-alone. The inclusion of physiological measures, however, does not significantly improves EEG-based or fNIRS-based workload classification. A major limitation of currently available mental workload assessment approaches is the requirement to record lengthy calibration data from the target subject to train workload classifiers. We show that by learning from the data of other subjects, workload classification accuracy can be improved especially when the amount of data from the target subject is small.

Keywords: fNIRS, EEG, heart rate variability, respiration rate, n-back, mental workload, multimodal fusion, brain

## INTRODUCTION

computer interface

Mental workload refers to the cognitive and psychological effort required to complete given tasks. Continuous evaluation of mental workload enables real-time adjustment in the task load assigned to human operators so that their workload can be kept at a moderate level for improving human performance (Parasuraman et al., 1992; Parasuraman, 2003). Studies have thus far mainly decoded human workload levels from brain activity electroencephalogram (EEG) measures (Gevins et al., 1998; Brouwer et al., 2012). Cerebral hemodynamics have recently gained attention for applications in brain-computer interfaces (Naseer and Hong, 2015) and the decoding of mental workload level with the emergence of the portable measurement technique known as functional near-infrared spectroscopy (fNIRS) (Sassaroli et al., 2008; Ayaz et al., 2012; Herff et al., 2014). Previous studies have adopted a combination of EEG and non-brain measures such as heart rate variability, respiration rate, and eye movement (Hankins and Wilson, 1998; Wilson and Russell, 2003; Fairclough, 2009) for mental workload assessment. Moreover, results from our previous study

#### Edited by:

*Stephen Fairclough, Liverpool John Moores University, United Kingdom*

#### Reviewed by:

*Noman Naseer, Air University, Pakistan Jochem W. Rieger, University of Oldenburg, Germany*

\*Correspondence:

*Hasan Ayaz hasan.ayaz@drexel.edu*

Received: *01 May 2017* Accepted: *12 July 2017* Published: *27 July 2017*

#### Citation:

*Liu Y, Ayaz H and Shewokis PA (2017) Multisubject "Learning" for Mental Workload Classification Using Concurrent EEG, fNIRS, and Physiological Measures. Front. Hum. Neurosci. 11:389. doi: 10.3389/fnhum.2017.00389* suggests that when combining EEG and fNIRS workload classification accuracies, they outperform the EEG-alone and fNIRS-alone results in mental workload level classification (Liu et al., 2017).

Before mental workload can be decoded from brain and body signals, it is typical that a time-consuming calibration process is required to derive a decoder for each individual operator. This is primarily due to the challenge that psychophysiological signals vary considerably between different people and over time. In the traditional calibration approach, lengthy psychophysiological signals (i.e., calibration data) need to be recorded from an operator so that a decoder can learn both the signal patterns specific to this operator and the variations of these patterns over time.

This problem is not unique to mental workload decoding. The lengthy calibration process is also required to decode other types of mental activities such as motor imagery (Blankertz et al., 2006). To address this problem for motor imagery decoding, Lotte and Guan proposed an alternative calibration approach (Lotte and Guan, 2010). In this approach, a decoder is derived using calibration data from both the target subject and some other subjects. Lotte and Guan argued that despite the large intersubject variations, similar signal patterns can be found across some individuals so that less calibration data from the target subject is required to derive a decoder. This approach has been further investigated by other researchers, with positive results (Devlaminck et al., 2011; Samek et al., 2013). An alternative approach to learning from other subjects, is to identify which models incorporate the inter-subject variations from a large database (Fazli et al., 2009).

For mental workload decoding, only one preliminary study to date has explored the reduction of calibration time using a simulated aviation task (Wang et al., 2012). Authors have shown that calibrating decoders using data from both the target subject and a pool of other subjects did not degrade the decoding accuracies compared to using data only from the target subject. However, no benefit of including data from the other subjects has been shown.

In this study, the integration of EEG, fNIRS, and physiological signals was investigated for the classification of three workload levels induced by the n-back working memory task. The objective was two-fold: first, to compare the classification performance using the different modalities and their combinations; and, second, to investigate learning in a workload decoder using data from other subjects as an approach to improve workload classification performance when the sample size of the target subject is small.

### MATERIALS AND METHODS

### Participants

A total of 25 volunteers were recruited for participation in this study. Two of the participants were unable to finish the protocol. Another two participants were rejected from the analyses due to excessive movement. Consequently, a total of 21 valid subjects [all right-handed, 12 female, ages 25.9 ± 4.9 (mean ± SD)] were included in the analysis. The Edinburgh Handedness Inventory (Oldfield, 1971) showed that participants were right handed and the average Laterality Quotient (L.Q.) and Decile is 78.7 ± 22.2 and 6.2 ± 3.4, respectively. Participants self-reported that they had their vision corrected to 20/20, did not have any history of neurological or psychiatric disorders and that they did not take any medication known to affect brain activity. Prior to the experiment, participants gave written informed consent for their participation in the study. The protocol was approved by the Institutional Review Board of Drexel University.

### Recording

EEG, fNIRS, Heart rate, R-R interval, breath rate, and breath depth were simultaneously recorded during data collection. **Figure 1** shows an overview of the recording setup.

**EEG** were recorded using a Neuroscan Nuamp amplifier by Compumedics Neuroscan (http://compumedicsneuroscan. com/) from 26 locations according to the International 10– 10 system (See **Figure 2**). Three additional electrodes, one placed above Nasion, the other two placed below the left/right outer canthus were used for electrooculography (EOG) artifact correction according to Schlögl et al. (2007). All 29 channels (26 EEG + 3 EOG) were band-pass filtered 0.1–100 Hz, digitally sampled at 500 Hz and referenced to a linked mastoid.

**Prefrontal fNIRS** were recorded from the forehead at a 2 Hz sampling rate using a 16-optode continuous wave fNIRS system developed at Drexel University (Ayaz et al., 2012, 2013) and manufactured by fNIR Devices LLC (http://fnirdevices.com/). The sensor included 4 light sources (LED) that can emit 730 and 850 nm wavelength light and 10 photon detectors (See **Figure 3**). The distance between light sources and detectors was 2.5 cm which allowed for a ∼1.2 cm penetration depth. To ensure repeatable sensor placement, the center of the sensor was aligned to the midline and the bottom of the sensor touched the participant's eye brow.

**Systemic NIR** were recorded from the right cheek at a 4 Hz sampling rate using a 2-optode continuous wave wireless fNIRS system developed at Drexel University (Ayaz et al., 2013) and manufactured by fNIR Devices LLC. The systemic NIR was not used in the current study.

**Heart rate, R-R interval, breath rate, and breath depth** were recorded using a Zephyr Bioharness chest band (https://www. zephyranywhere.com/).

FIGURE 2 | EEG channels according to the International 10–10 system. The 26 recorded channels were highlighted.

### Experiment

Subjects sat comfortably in front of an LED screen. Sequences of capitalized letter stimuli (∼1.7◦ visual angle) were shown on the center of the screen. The BCI2000 software was employed for stimulus delivery and for the recording of EEG and behavioral data (Schalk et al., 2004). Each letter was displayed for a duration of 480 ms and the inter-stimulus interval (ISI) was 2,520 ms. Subjects were instructed to click a keypad button with their right index finger in response to a "match stimulus" and click another keypad button with their right middle finger in response to a "non-match stimulus" as fast as possible. There were three workload conditions. In the 0-back condition, letter "X" was the match. In the 2-back condition, a letter was the match if it was shown two screens back. In the 3-back condition, a letter was the match if it was shown three screens back.

The letter stimuli were grouped into n-back blocks. Each block included 6 s of instruction, 45 s of task performance, and 15 s of fixation. The instruction period informed the subject which task (0-, 2-, or 3-back) to perform. During the task period, 15 letters were shown to the participants on the screen in a pseudo random order. Four of the letters were targets. No letters appeared more than twice in succession within a block. In the fixation period, subjects were instructed to focus their eye gaze on a white plus sign located at the center of the screen allowing fNIRS signals to return to the baseline. **Figure 4** shows the time line of a typical n-back block.

There were four recording sessions. Each session included 12 n-back blocks, 4 from each condition. Hence, there were 48 n-back blocks for the entire experiment, 16 from each condition. To reduce the correlation between adjacent samples and to balance time induced experimental factors such as fatigue across the three workload conditions, the 48 n-back blocks were grouped into 16 repetitions. Each repetition included one block from each workload conditions. The order of the blocks was further randomly shuffled so that no workload condition was repeated twice in succession within a session. Before the start of the first session, subjects practiced one block from each condition for familiarization with the procedure and an ∼5 min long EOG calibration session was performed during which subjects were instructed to rotate, blink and move (up/down, left/right) their eyes. A 5 min break was given to the subjects between the recording sessions. The entire recording time was about 1 h. **Figure 5** shows the outline of the experiment.

### EEG Signal Processing

In this work, we extracted for each EEG channel the band powers of 1–3, 4–7, 8–12, 13–19, and 20–30 Hz bandwidths. This was performed at a single stimulus level, forming a feature vector fEEG of 6 bands × 26 channels = 156 length for each of the 48 blocks × 15 stimuli = 720 sample epochs for each subject.

Raw EEG and EOG signals are band-pass filtered 1–35Hz. A regression-based approach was adopted to reduce EOG contamination by using the calibration data recorded before the n-back sessions started (Schlögl et al., 2007). Epochs were extracted 0–2.8 s and baseline corrected −0.2 to 0 s with respect to stimulus onset. The power spectral density of each epoch was then estimated using the Multitaper method (Thomson, 1982) with 8 Discrete Prolate Spheroidal Sequences (DPSS) window of 3 s long for subsequent analysis.

### fNIRS Signal Processing

The average oxygenated hemoglobin (oxy-Hb) and deoxygenated hemoglobin (deoxy-Hb) amplitude change between (25 s, 45 s) and (−5 s, 5 s) with respect to the block start was used as a feature. The features were extracted from 14 forehead optodes, forming a feature vector fNIR of 14 × 2(oxy/deoxy-Hb) = 28 length for each of the 48 sample blocks. Optode 1 and 15 were rejected from analysis because they are over the hairline for most of the subjects. The average activation amplitude with respect to a baseline was adopted as the feature for characterizing the mental activities in many studies (Ayaz et al., 2007, 2012; Merzagora et al., 2009; Herff et al., 2012; Liu et al., 2013). This feature extraction strategy has been shown to be more reliable when

compared to other possible feature choices in our preliminary analysis.

Raw light intensities recorded from prefrontal fNIRS were first visually inspected to reject those optodes with inadequate contact or those positioned over the hairline. Raw light intensities were converted into concentration changes in oxy-hemoglobin (oxy-Hb) and deoxy-hemoglobin (deoxy-Hb) using the modified Beer-Lambert law (Cope and Delpy, 1988). Oxy-Hb/deoxy-Hb signals are band-pass filtered at 0.005–0.1 Hz for reducing artifacts from physiological signals (e.g., heartbeat and respiration) before subsequent analysis.

### Heart Rate Variability (HRV) Processing

Heart rate variability (HRV) was estimated according to Clifford (2002) and Gritti et al. (2013). The R-to-R interval recorded by the Bioharness was first interpolated to form a 4 Hz time series. Epochs were extracted for each n-back block with (0 s, 45 s) time windows with respect to the onset of the first stimulus and the power spectral density (PSD) were estimated using a single DPSS window of 45 s long (Thomson, 1982) for evaluating the variability of the R-to-R interval. The average PSD in the bandwidths 0.02–0.06 Hz (mainly originated from body temperature regulation), 0.07–0.14 Hz (related to regulation of blood pressure), and 0.15–0.5 Hz (momentary respiratory influences on heart rate) were extracted as suggested by Scerbo et al. (2001).

In addition to HRV, the average of heart rate, breath rate, and breath depth for each n-back block recorded by Bioharness were extracted as features.

### Multimodality Workload Classification

We considered the three-class classification problem of 3- vs. 2- vs. 0-back. A multiclass linear discriminant analysis (LDA) was adopted for classification. To prevent a covariance matrix from becoming singular due to small sample size, an automatic shrinkage using the Ledoit-Wolf lemma (Schafer and Strimmer, 2005) was adopted. The following eight different classifications were considered dependent on the adopted modalities (See **Figure 6**):

1) **EEG-alone**. A LDA was trained to classify EEG features at the single stimulus level (3 s time window with respect to a single stimulus). At the block level (45 s time window, included 15 stimuli), the LDA predicted probabilities for each of the 15 stimuli were Naïve-Bayes combined (Kuncheva et al., 2001) to produce P L|fEEG where L ∈ {0-back, 2-back, 3-back}, which determined the predicted workload levels. More specifically, in Naïve-Bayes fusion, the product of the predicted probabilities from the 15 stimuli was calculated and normalized as follows:

$$P\left(L|\text{f\_{EG}}\right) = \frac{\prod\_{i=1}^{15} P\left(L|f\_{\text{EG}}^i\right)}{\prod\_{i=1}^{15} P\left(3back|f\_{\text{EG}}^i\right) + \prod\_{i=1}^{15} P\left(2back|f\_{\text{EG}}^i\right) + \prod\_{i=1}^{15} P\left(3back|f\_{\text{EG}}^i\right)}\tag{1}$$

level (4 s epoch). The output probabilities from the 15 stimuli (of a block) were Naïve-Bayes combined to produce *P L*|*fEEG* . A second LDA was trained to classify fNIRS features extracted from each block (45 s epoch) to produce *P L*|*fNIR* . A third LDA was trained to classify PHY features extracted from each block to produce *P L*|*fPHY* . *P L*|*fEEG* , *P L*|*fNIR* , and *P L*|*fPHY* were Naïve-Bayes combined for EEG+fNIRS+PHY classification. All of the above procedures were conducted on calibration data. The LDA classifiers were then applied on testing data to evaluate the classification performance.


$$\mathcal{L} = \arg\max\_{L} \left[ P\left( L | f\_{EEG} \right) \cdot P\left( L | f\_{NIR} \right) \right] \tag{2}$$


#### Learning from Other Participants

We consider the following calibration approaches:


calibration time of the target subject (Lotte and Guan, 2010). For each subject, the features were first z-score transformed to reduce the between-subject variations. For the target subject, only the training data was used for estimating the mean and variance of each feature. The mean and covariance matrix of the feature vector of each subject was then estimated. Finally, the mean and covariance matrices from all subjects were combined according to Equation (3) and Equation (4).

$$\mu\_i = (1 - \lambda) \,\mu\_i^t + \lambda \frac{1}{|\mathbb{S}\_t \,(\mathfrak{Q})|} \sum\_{j \in \mathbb{S}\_l \,(\mathfrak{Q})} \mu\_i^j \tag{3}$$

$$
\Sigma = (1 - \lambda) \,\,\Sigma^{\mathfrak{t}} + \lambda \,\frac{1}{|\mathbb{S}\_{\mathfrak{t}}(\mathfrak{Q})|} \sum\_{j \in \mathbb{S}\_{\mathfrak{t}}(\mathfrak{Q})} \Sigma^{j} \tag{4}
$$

where µ t i and 6<sup>t</sup> are the mean and covariance estimated from the target subject, St() is a set of subjects that does not include the target subject (leave-one-subject-out) and λ ∈ [0, 1] was a parameter representing the weight of other subjects. In this study, λ was empirically chosen to be 0.5.

When the sample size from the target subject is small, we expect that an improved classification performance can be achieved by incorporating the mean and covariance matrices estimated from other subjects.

#### Performance Evaluation

A repeated learning-testing method (Burman, 1989) was adopted for performance evaluation. The procedure was done as follows: For subject j = 1, . . . , 21:

	- a. Data splitting:

The data of the target subject j were randomly split into a calibration set and a testing set three times with varying calibration sample size:

	- i. For traditional calibrations, the calibration sets were used to train the classifiers using LDA.
	- ii. For the proposed calibrations, the calibration sets and data from all other subjects were used to train the classifiers.
	- i. The testing sets were used to evaluate the classification accuracy.

#### Multiple Comparisons

To correct for multiple testing, we adopted false discovery rate (FDR) control with the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995). Without specification, we rejected null hypotheses for FDR q < 0.05.

#### RESULTS

#### Behavior Performance

To verify the successful manipulation of workload level with the adopted protocol, we evaluated the following three behavior measures:

1) **d-prime**, which was the key-press true positive rate minus false positive rate:

$$d \cdot prime = \frac{N(stim = match \, and \, responded = match)}{N(stim = match)}$$

$$-\frac{N(stim = non - match \, and \, responded = match)}{N(responded = match)}\tag{5}$$

where N(event) is the number of cases of an event, stim = match/unmatch represent the true stimulus type, and responded = match/unmatch represent subject's response.


For all three behavioral measures, one-way repeated measures ANOVAs revealed a significant effect of workload and post-hoc tests revealed significant differences (FDR q < 0.05) between all three workload levels, suggesting the successfully manipulation of workload level (**Figure 7**). The generalized eta-squared (η 2 ) as reported by the ezANOVA library of R was used (Bakeman, 2005).

#### Effect of Workload on EEG Band Powers

**Figure 8** depicts the topographic map of EEG band powers. A repeated measures ANOVA was applied to assess the effect of workload on the six mid-line channels Fz, FCz, Cz, CPz, Pz, and Oz and the results are shown in **Table 1**. For delta activity, a significant effect of workload was found at Cz and CPz (FDR q < 0.05) where the delta band power decreased with increased workload. Workload had a significant effect on theta band at channel Fz and Cz (FDR q < 0.05). At Fz, theta band increased with increased workload whereas at Cz, theta band power decreased with increased workload. Workload had a significant effect on alpha band power at all of the six midline channels Fz, FCz, Cz, CPz, Pz, and Oz (FDR q < 0.05). At all six channels, alpha band power decreased with increased workload. Workload has a significant effect on low beta band power at the six midline channels Fz, FCz, Cz, CPz, Pz, and Oz (FDR q < 0.05). At all six channels, low beta band power decreased with increased workload. Workload had a significant effect on high beta band power at Fz, FCz, Cz, CPz, and Pz (FDR q < 0.05). At these five channels, high beta band power decreased with increased workload. The significant effects that has been found in the low and high beta band may be confounded by motor responses as the 13–30 Hz range is typically associated with motor responses (Pfurtscheller et al., 1996, 2006). To investigate the effect of motor responses, a 2 (key press type: middle/index finger) × 3 workload level (0-/2-/3-back) ANOVA with repeated measures on both factors was conducted using the amount of key-press responses as the dependent. No significant effect of workload level [F(2, 42) = 0.83, p = 0.44, η <sup>2</sup> < 0.01] or the interaction between key-press type and workload level [F(2, 42) = 2.01, p = 0.15, η <sup>2</sup> = 0.03] was found. Mean and standard deviations of the number of keypresses within each block across the 21 participants for each of the three workload conditions can be found in Supplementary Table 1.

#### Effect of Workload on fNIRS Measures

**Figure 9** shows the results from oxy-Hb. A common average reference approach was applied to remove the average oxy-Hb across all optodes and from each individual optode for reducing the effect of systemic physiological artifacts. Repeated measures ANOVA revealed a significant effect of workload on optode 5, 7, 8, and 14. Post-hoc tests revealed a significant 3-back > 0 back and 2-back > 0-back at optode 14. A optode 7, there was a significant effect of 3-back < 0-back and 3-back < 2-back. At optode 8, there was a significant effect of 3-back < 0-back and 2 back < 0-back. No significant post-hoc test results were detected at optode 5.

### Effect of Workload on Physiological Measures

The effect of workload on physiological measurements are shown in **Figure 10**. For each subject, the average heart rate at 0-back, 2-back, and 3-back blocks were calculated respectively and the

three heart rate values were z-score standardized before analysis. The same preprocessing procedure was applied to the other physiological measurements: breath rate, breath amplitude and HRV. A repeated measures ANOVA revealed a significant effect of workload on breath amplitude, breath rate, heart rate, HRV 0.07–0.14 Hz, and HRV 0.15–0.5 Hz (FDR q < 0.05). Post-hoc tests revealed significant differences between 3-back and 0-back and also between 2-back and 0-back for breath amplitude, breath rate, heart rate HRV 0.07–0.14 Hz, and HRV 0.15–0.5 Hz (FDR q < 0.05).

#### Workload Classification

Workload classification results are shown in **Table 2**, **Figures 11**, **12** and Supplementary Tables 2, 3. For all investigated approaches and with the different calibration sample sizes, classification accuracy was significantly better than chance level (33.3%) as revealed by one-tailed Wilcoxon signed rank tests. **Figure 11** compares the accuracy using traditional and proposed calibration approaches. The results of the repeated measures ANOVAs indicate that the proposed calibration approach significantly outperforms the traditional calibration approach for EEG-based classification, fNIRS-based classification, physiological based classification, and EEG-fNIRS multimodal classification (p < 0.05). The effect size of the results are shown in **Table 3**. Post-hoc analysis was conducted using a Wilcoxon Signed Rank test with FDR correction and the results are shown in Supplementary Table 4. For the calibration sample size of 13 min, the proposed calibration approach significantly outperformed the traditional calibration approach for EEG-alone, fNIRS-alone, PHY-alone, and EEG+fNIRS (FDR q < 0.05). For the calibration sample size of 26 min, the proposed calibration approach significantly outperformed the traditional calibration approach for EEG-alone, fNIRS-alone, and EEG+fNIRS (FDR q < 0.05). While for the calibration sample size of 39 min, no significant difference in classification accuracy can be found between the proposed and traditional calibration approach for all of the four modalities.

**Figure 12** compares the classification accuracy using EEGalone, fNIRS-alone with those using both EEG and fNIRS. A repeated measures ANOVA revealed that EEG-fNIRS significantly outperforms EEG-alone or fNIRS-alone for both traditional calibration approach and the proposed calibration

approach (p < 0.001). For the traditional calibration approach, an effect size dz of 0.81 [t(20) = 3.70], 0.84 [t(20) = 3.83], and 0.85 [t(20) = 3.88] has been achieved for a calibration sample size of 13, 26, and 39 min, respectively, when comparing the EEG-alone and EEG+fNIRS approach. For the proposed calibration approach, an effect size dz of 0.89 [t(20) = 4.07], 1.18 [t(20) = 5.43], and 0.94 [t(20) = 4.33] has been achieved for a calibration sample size of 13, 26, and 39 min, respectively, when comparing the EEG-alone and EEG+fNIRS approach. Post-hoc analysis was performed using aWilcoxon Signed Rank test with FDR correction comparing EEG-alone and EEG+fNIRS with the results reported in Supplementary Table 5. For all three

calibration sample sizes and for both traditional and proposed calibration approaches, EEG+fNIRS significantly outperformed EEG-alone (FDR q < 0.05).

The effect of including a physiological-based classifier and combining them with EEG-alone, fNIRS-alone, and EEGfNIRS classifier was studied and no significant improvement in classification was found.

#### DISCUSSION

In this study, the integration of EEG, fNIRS, and physiological measures investigated the classification of three workload levels. To our knowledge, this is the first study that investigated the integration of fNIRS, EEG, and physiological signals for mental workload assessment. The n-back working memory task was adopted to induce three workload levels and the behavioral results suggested successful manipulation of the workload levels.

We first showed that in our data the EEG delta, alpha, low beta, and high beta activities decreased with increased workload levels


*F-ratio statistics and the* η *<sup>2</sup> effect sizes are reported. Bold highlighted results are significant after correcting for multiple comparisons using FDR (Benjamini and Hochberg, 1995) (FDR q* < *0.05).*

*Interpretation of the effect size* η *<sup>2</sup> are 0.02* = *small, 0.13* = *medium, and 0.26 or greater is large (Bakeman, 2005).*

whereas theta activity increased with an increased workload level at the frontal site Fz. The suppression of alpha power in the posterior areas and increased theta power in the midline frontal areas under workload matches with the results reported in the literature (Gevins et al., 1997). It has been reported that beta activity decreased as workload increased at the midline central site Cz (Gevins et al., 1998). A previous study also suggested that the delta band decreased with increased workload level and the delta band carried information needed to characterize mental workload levels (Zarjam et al., 2011). Our results match those reported in the literature. A concern is that the workload effect on beta activities found in our study maybe caused by motor responses. The effect of workload and key-press type (middle/index finger) was assessed based on the number of keypress within each block and no significant effect of workload and the interaction between workload and key-press type was found. It is possible that motor activities other than key-presses could be affected by workload levels (e.g., subject may be more restless in the low workload condition) which need to be investigated in future studies.

For the fNIRS data, three prefrontal sites were found to be sensitive to workload changes with the 3-back task showing the highest level of activations. Previous fNIRS-based mental workload studies suggested that fNIRS was sensitive to workload changes (Ayaz et al., 2012; Fishburn et al., 2014; Herff et al., 2014).

<sup>2</sup> are 0.02 = small, 0.13 = medium, and 0.26 or greater is large (Bakeman, 2005).

Frontiers in Human Neuroscience | www.frontiersin.org

the bootstrapped 95% confidence interval. Interpretation of the effect size η

FIGURE 10 | Effect of workload on physiological measurements. One-way repeated measures ANOVA results and the η <sup>2</sup> effect sizes with workload as the independent variable are shown. All measures except for HRV (0.02–0.06 Hz) showed a significant effect of workload (FDR *q* < 0.05). Error bars represented the bootstrapped 95% confidence interval. BRAmplitude, breath amplitude. Interpretation of the effect size η <sup>2</sup> are 0.02 = small, 0.13 = medium, and 0.26 or greater is large (Bakeman, 2005).


*Mean* ± *standard deviation of the classification accuracies are shown.*

Again, our findings are consistent with the reported results in the literature.

For the physiological data, breath amplitude, breath rate, heart rate, HRV mid band (0.07–0.14 Hz), and HRV high band (0.15– 0.5 Hz) were found to be sensitive to workload changes. The suppression of HRV spectral power in the 0.07–0.14 Hz range and 0.15–0.5 Hz range under workload was reported by the literature (Veltman and Gaillard, 1996; Nickel and Nachreiner, 2003). They suggested increased blood pressure and increased heart rate under high workload. Also reported was that breath rate increased with increased workload (Wilson and Eggemeier, 1991). Our results reflected these phenomena.

For workload classification, a significantly better than chance level classification was achieved by all investigated modalities: EEG-alone, fNIRS-alone, physiological alone, and EEG+fNIRS hybrid classification. For improving the classification accuracy when the calibration sample size is small, we proposed to calibrate classifiers using data from both the target subject and a pool of other subjects. Our results indicate that the proposed calibration approach significantly outperformed the traditional calibration approach which only used data from the target subject to calibrate classifiers regardless of the modality adopted. To our knowledge, this was the first study which demonstrated that learning from the data of multiple subjects outperforms learning

TABLE 3 | Effect size and t-statistics of the improvement of the proposed calibration approach over the traditional approach.


from a single subject for mental workload decoding accuracy. In the literature, various multisubject learning approaches have been proposed for the classification of different types of tasks. For example, Lotte et al. investigated multisubject learning for the classification of motor imagery tasks (Lotte and Guan, 2010) using EEG. To account for the inter-subject variability, they adopted a data-driven approach to select for each target subject a relevant subset of other subjects whose data can be used to improve the classification of the target subject. Reichert et al. investigated the classification of the phenomenal content of perception using fMRI (Reichert et al., 2014). To achieve cross-subject generalization, the weights of the classifiers trained from individual subjects were combined according to the individual classifier performance. Samek et al. found that the changes between training and testing data is similar across subjects and transferring this non-stationary information between subjects can help improve classification (Samek et al.,

2013). Our approach differs from these approaches in that the features for each subject were standardized before training to minimize the inter-subject variability. A further improvement to our approach may be achieved by performing subject subset selection as adopted by Lotte et al. or by weighting the mean and covariance matrices of each subject by their classification performance before applying Equation (3) and Equation (4). Finally, in this study, the hyperparameter λ in Equation (3) and Equation (4) is empirically chosen to be 0.5. The effect of λ on classification accuracies is provided in the Supplementary Figure 1. Estimating λ based on the individual classifier performance and the number of available samples from the target subject may further improve the performance of proposed approach.

Our results also suggest that EEG+fNIRS hybrid classification significantly outperformed EEG-alone or fNIRS-alone workload classification. These findings are consistent with our recent study (Liu et al., 2017) and indicate that there is complementary information about workload in EEG and fNIRS. However, the improvement of EEG+fNIRS over EEG-alone is only about 1– 2% in classification accuracy. One possible reason behind this is the relatively low fNIRS-alone performance. It can be seen from **Table 2** that fNIRS-alone classification accuracy is about 10% lower than EEG-alone classification. A recent fNIRS-based workload estimation study reports that using only the forehead optodes resulted in a much-reduced workload estimation accuracy compared to using optodes from the whole head (Unni et al., 2017). We speculated that by using whole head coverage, the fNIRS-alone and EEG+fNIRS performance can be much improved. Finally, integrating physiological measures with EEG or fNIRS does not significantly improve workload classification. A reason for the lack of improvement in classification may be due the reduced reliability of the physiological based workload classification in comparison to the brain signal based approaches. Another possibility may be that the physiological measurements do not provide additional information about workload to the brain signal measurements.

In conclusion, the current study presented various approaches for mental workload classification and demonstrated that with the integration of EEG and fNIRS and learning classifiers using the data from other subjects, workload classification performance can be improved. The proposed approaches may have applications in neuroegonomics research and applications such as adaptive aiding systems that are designed to improve the efficiency and safety of human-machine systems during critical tasks.

### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the Institutional Review Board of Drexel University with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Institutional Review Board of Drexel University.

### AUTHOR CONTRIBUTIONS

YL had the largest contribution for all aspects of the work including the design and programming of the experimental testing, data acquisition, data processing, analysis, and interpretation as well as drafting and editing the manuscript. HA contributed to all aspects of the work with particular emphasis on the fNIRS and EEG application, signal processing, data analysis and interpretation, drafting and editing of the manuscript. PAS contributed to all aspects of the work with particular emphasis with the experimental design, data analysis and interpretation, drafting and editing the manuscript as well as data acquisition. All authors agreed on the content and presentation of the submitted version of the manuscript.

#### FUNDING

This work was made possible, in part, by a research award from the National Science Foundation (NSF) IIS: 1064871 (Shewokis, PI). The content of the information herein does not necessarily reflect the position or the policy of the sponsors and no official endorsement should be inferred.

#### REFERENCES


#### ACKNOWLEDGMENTS

This work is part of the requirements for the PhD degree for YL. The authors acknowledgment the helpful suggestions and insights to the work from committee members, Drs. Anna Rodriguez, Banu Onaral, and Karen Moxon.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnhum. 2017.00389/full#supplementary-material

tasks with EEG pattern recognition methods. Hum. Factors 40, 79–91. doi: 10.1518/001872098779480578


**Conflict of Interest Statement:** The optical brain imaging instrumentation utilized in the present research was manufactured by fNIR Devices, LLC. HA was involved in the development of the technology and thus offered a minor share in fNIR Devices, LLC.

The other authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Liu, Ayaz and Shewokis. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Online EEG-Based Workload Adaptation of an Arithmetic Learning Environment

Carina Walter <sup>1</sup> , Wolfgang Rosenstiel <sup>1</sup> , Martin Bogdan<sup>2</sup> , Peter Gerjets <sup>3</sup> and Martin Spüler <sup>1</sup> \*

<sup>1</sup> Department of Computer Engineering, Eberhard-Karls University Tübingen, Tübingen, Germany, <sup>2</sup> Department of Computer Engineering, University of Leipzig, Leipzig, Germany, <sup>3</sup> Knowledge Media Research Center, Tübingen, Germany

In this paper, we demonstrate a closed-loop EEG-based learning environment, that adapts instructional learning material online, to improve learning success in students during arithmetic learning. The amount of cognitive workload during learning is crucial for successful learning and should be held in the optimal range for each learner. Based on EEG data from 10 subjects, we created a prediction model that estimates the learner's workload to obtain an unobtrusive workload measure. Furthermore, we developed an interactive learning environment that uses the prediction model to estimate the learner's workload online based on the EEG data and adapt the difficulty of the learning material to keep the learner's workload in an optimal range. The EEG-based learning environment was used by 13 subjects to learn arithmetic addition in the octal number system, leading to a significant learning effect. The results suggest that it is feasible to use EEG as an unobtrusive measure of cognitive workload to adapt the learning content. Further it demonstrates that a promptly workload prediction is possible using a generalized prediction model without the need for a user-specific calibration.

#### Edited by:

Christian Mühl, Deutsches Zentrum für Luft- und Raumfahrt (DLR), Germany

#### Reviewed by:

Rodolphe J. Gentili, University of Maryland, College Park, United States Raphaëlle N. Roy, National Higher School of Aeronautics and Space, France

#### \*Correspondence:

Martin Spüler spueler@informatik.uni-tuebingen.de

> Received: 13 December 2016 Accepted: 16 May 2017 Published: 30 May 2017

#### Citation:

Walter C, Rosenstiel W, Bogdan M, Gerjets P and Spüler M (2017) Online EEG-Based Workload Adaptation of an Arithmetic Learning Environment. Front. Hum. Neurosci. 11:286. doi: 10.3389/fnhum.2017.00286 Keywords: Passive brain-computer interface (BCI), Cognitive workload, Electroencephalography (EEG), Online Adaptation, Neurotutor, tutoring system, closed-loop workload adaptation

### 1. INTRODUCTION

Currently, there is an ongoing debate in mathematics education research on how to optimally support learners' during arithmetic learning (e.g., Askew, 2015; Calder, 2015). Obviously, learning outcome is most promising if the training program and learning content is tailored to the learner's specific needs (e.g., Gerjets and Hesse, 2004; Richards et al., 2007; for numerical interventions see Dowker, 2004; Karagiannakis and Cooreman, 2014). To optimally support the learner's efforts, learning content should neither be too easy, nor too difficult. Therefore, it is crucial for successful learning to keep the cognitive workload in the individual optimal range for each learner (Sweller et al., 1998; Gerjets et al., 2009). This can be achieved by adapting the difficulty of the learning content to the individual competencies of the learner.

Computer-supported learning (Kirschner and Gerjets, 2006) seems specifically suited for implementing adaptivity, because it is easy to implement algorithms that change the difficulty of the presented material based on the learner's behavioral response. This allows for an easy personalization of the learning environment to the user's individual needs, which is assumed to be necessary for efficient learning. So far, adaptive computer-supported learning environments rely on the user's interaction behavior for adaptation, e.g., error-adaptive systems, which change the task difficulty based on the number of erroneous responses (Corbett, 2001; Graesser and McNamara, 2010; Käser et al., 2013). However, such behavioral measures are rather indirect and not very specific with respect to the cognitive processes required for performing the task at hand. For instance, more errors in a row may not only be caused by the difficulty of the task itself but also by task-unspecific processes (e.g., lapses of attention, fatigue, or disengagement).

It was recently proposed to measure cognitive processes directly to enhance human-computer interaction (Zander and Kothe, 2011). This approach, called passive brain-computer interface, could also be applied to improve arithmetic learning environments. Measuring neural correlates of specific cognitive processes allows for a more direct and implicit monitoring of the learner's cognitive state and should thereby allow for a better adaptation of the training content to improve the learning success of the user (Gerjets et al., 2014).

One cognitive state that is important within this context, is the working memory load, or in short workload. Throughout this paper, the term workload will be used as the amount of mental resources that are used to execute a specific task (based on Gevins and Smith, 2006). Working memory describes the small amount of information that can be stored and manipulated in mind simultaneously for the execution of a current cognitive task (Cowan, 2014). As the capacity for storing information at a time is limited, workload describes the extent to which this capacity is used or to which extent the working memory is filled. According to the cognitive load theory (Sweller et al., 1998), this is a bottleneck in learning, as the learning process is hindered if the amount of information the learner has to process exceeds the capacity of the working memory storage. Consequently, if the working memory load can be measured, this allows to adapt the presented learning content in a way that the storage capacities are never exceeded and the workload is always in an optimal range.

The amount of cognitive workload can be measured by Electroencephalography (EEG), which has been shown by multiple studies (Gevins et al., 1997; Murata, 2005; Berka et al., 2007; Wang et al., 2012). Basically, the amount of workload is reflected in two components of the EEG: the power spectrum and event-related potentials. Regarding the effect of workload on event-related potentials, two types of stimuli can be distinguished, task-independent as well as task-dependent stimuli. Using the n-back tasks for task-dependent stimuli, an increase in workload leads to a diminished P300 amplitude (Scharinger et al., 2017). Also for more complex tasks which include task-dependent as well as task-independent stimuli, like a piloting task, it was shown that the P300 amplitude is lower in high workload conditions (Causse et al., 2015). Brouwer et al. (2012) stated this diminishing effect in the P300 amplitude for task-independent stimuli. Besides the P300, also other components of the event-related potential were shown to be sensitive to workload, like the N100 (Ullsperger et al., 2001) or the N200 and mismatch negativity (Kramer et al., 1995), using taskindependent stimuli. In Roy et al. (2016a) a significant decrease of the P200 component was stated when workload increased. While ignoring infrequent task-independent auditory probes, they were able to assess mental workload efficiently.

Further, the oscillatory activity in EEG is also affected by workload. Pesonen et al. (2007) have shown that there are workload related changes in theta-, alpha-, and beta-band. Brouwer et al. (2012) also found an increase in frontal theta power and a decrease in occipital alpha power in an n-back task. Specific to arithmetic tasks, it was shown that the cognitive demand results in an increasing power of the theta band and a decreasing power in the alpha band (Harmony et al., 1999).

Regarding an EEG-based prediction of workload, Kohlmorgen et al. (2007) have presented a real-time system in which the workload induced by mental calculation task while driving could be predicted. Brouwer et al. (2012) classified high- against low workload in an n-back task and achieved classification accuracies above 80% using either spectral features, ERPs or a combination of both. Roy et al. (2016b) also compared the classification accuracy of power spectrum and ERP features during a Sternberg memory task and achieved a low performance (60%) with spectral data, while achieving high performance (91%) with ERP data. In a previous study (Walter et al., 2014), we tried to predict the difficulty of arithmetic tasks based on theta, alpha and beta power and achieved an average correlation coefficient of up to 0.88. In a following study (Spüler et al., 2016), we improved the cross-subject prediction and trained a prediction model that works across subjects yielding an average correlation coefficient of 0.82.

While there is enough evidence showing that an EEG-based workload prediction is possible and can be implemented in a real-time system, it has not been used for a closed-loop adaptation in an online learning environment, to the best of our knowledge. In this paper, we show how we developed a learning environment for arithmetic exercises that adapts the task difficulty based on learner's cognitive workload as predicted from the EEG. To concentrate on the usability of the EEGbased learning environment, we wanted a system that works out-of-the-box without the need for a subject specific calibration phase. As we also wanted a stimulus-independent system, only the power spectrum was used for prediction. In the following, we present the EEG-based workload prediction and describe its application in the EEG-based learning environment. To show the feasibility of the EEG-based learning environment, the learning effect for 13 subjects testing this environment is compared to a control group using an error-based learning environment.

#### 2. METHODS

For the development of an EEG-based workload environment, the work was split in two studies. In the first study, EEG data were collected while subjects solved arithmetic tasks of varying difficulty. Based on this data a prediction model was created that predicts the amount of workload based on the user's EEG. In the second study, which is the main part of this paper, the prediction model was used to estimate a learner's workload online and adapt the task difficulty accordingly. To evaluate the EEG-based learning environment, subjects used it to learn arithmetic addition in the octal number system (e.g., 3 + 5 = 10) and the learning success was compared to a control group, who learned the same task using a learning environment that adapts based on the number of correct responses.

#### 2.1. Task Design

The participants solved addition tasks with diverse levels of difficulty, which where presented on a desktop computer with a 19 inch TFT display. In the first study, addition tasks in the decimal number system were presented, while the participants had to learn addition in the octal number system in the second study. The difficulty level of the presented addition exercises was defined by their Q-value (Thomas, 1963), which reflects the information content of an arithmetic task. This difficulty measurement takes into account both, problem size and the need for a carry over operation, which are the main parameters for problem difficulty in addition.

A more detailed description on calculating the Q-value can be found in Thomas (1963) or in Spüler et al. (2016), where also some examples are shown. The addition problems presented in this work were ranging from Q = 0.6 (easy, single-digit, e.g., 1 + 1) to Q = 7.2 (difficult, four-digit, e.g., 3721 + 1452).

Each trial, in which one arithmetic problem should be solved, consisted of four phases, which are depicted in **Figure 1**. First, the calculation phase occurred, where the problem to be solved was shown for 5 s. Subsequently, subjects had a maximum of 3.5 s to type in their result. In the first study, with arithmetic's in the decimal number system, the subjects did not receive feedback. In the second study, the subjects had to learn arithmetic's in the octal number system and therefore got feedback by presenting the correct answer for 3.5 s. Each trial ended with an inter-trial interval (ISI) of 1.5 s, resulting in a total length of approximately 45 min. To avoid the prediction model of being based on perceptual-motor confounds, the time windows used for analyzing EEG data should not contain motor events. As typing in the answer leads to motor artifacts, the calculation phase was used for EEG analysis only.

### 2.2. EEG Recording

A set of 28 active electrodes (actiCap, BrainProducts GmbH), was used to record EEG signals. They were attached to the scalp, placed according to the extended international electrode 10−20 placement system (FPz, AFz, F3, Fz, F4, F8, FT7, FC3, FCz, FC4,

FT8, T7, C3, Cz, C4, T8, CPz, P7, P3, Pz, P4, P8, PO7, POz, PO8, O1, Oz, and O2). Three additional electrodes were used to record an electrooculogram (EOG); two of them were placed horizontally at the outer canthus of the left and right eye to measure horizontal eye movements and one was placed in the middle of the forehead between the eyes to measure vertical eye movements. Ground and reference electrodes were placed on the left and right mastoids. EOG- and EEG-signals were amplified by two 16-channel biosignal amplifier systems (g.USBamp, g.tec) and sampled at a rate of 512 Hz and the impedance of each electrode was less than 5 k. EEG data were band-pass filtered between 0.5 and 60 Hz with a Chebyshev filter of order 8 during the recording. Furthermore, a notch-filter (Chebyshev, order 4) was applied between 48 and 52 Hz to filter out power line noise. The signal processing pipeline is documented in detail in **Figure 2**.

### 2.3. Study 1: Workload during Decimal Arithmetic Tasks

#### 2.3.1. Study Design and Participants

Ten students (4 male and 6 female; range: 17 − 32 years, M = 24.9 years, SD = 5.3 years) participated voluntarily in this study and received monetary compensation for participation. The study was approved by the local ethics committee of the Medical Faculty at the University of Tübingen and written informed consent was obtained by the participants. All participants reported normal or corrected to normal vision and no mathematical problems. Participants were chosen randomly. Nevertheless, all were university students (with different fields of study) and can thus be considered as having a high educational background.

Each participant had to solve 240 addition exercises in the decimal number system while EEG was measured. The exercises were presented with an increasing difficulty due to learning effects. If a subject learns and gets better at performing the tasks, exercises with the same objective difficulty (Q) can lead to a different perceived difficulty at the start (before learning) and at the end of the session (after learning). If exercises are presented in increasing difficulty, the relationship that a task with higher perceived difficulty has also a higher objective difficulty (Q) still holds.

However, it should be noted that the increasing task difficulty over times leads to potential confounds, as there might also be changes in EEG over time that are not related to task difficulty. The only possibility to counter these confounds would be a randomization of the task difficulty, which in turn would corrupt the labels, as the relationship between Q and the perceived difficulty would change over time due to learning effects. Since we needed reliable labels for the training of the prediction model, we decided against a randomization.

#### 2.3.2. Analysis of EEG Data

EEG data were corrected for eye-movements using an EOGbased regression method (Schlögl et al., 2007). Furthermore, Burg's maximum entropy method (Cover and Thomas, 2006) with a model order of 32 was used to estimate the power spectrum from 1 to 40 Hz in 1 Hz bins. After calculating the power

spectrum, the data was z-score normalized along the channels to correct for inter-subject variability in the subjects baseline EEG power. To analyze which electrodes and frequencies change with increasing task difficulty, the signed squared correlation coefficients (sign (R) <sup>∗</sup> R 2 ) between the power at each frequency bin (for each electrode), as well as Q-values of the corresponding trials were calculated.

### 2.4. Study 2: EEG-Based Learning Environment for Octal Arithmetic Tasks 2.4.1. Study Design and Participants

The participants in the second study were divided into two groups: an experimental group using the EEG-based learning environment and a control group using an error-adaptive learning environment. In both groups, the subjects learned arithmetic addition in the octal number system (e.g., 5 + 3 = 10), which was a completely new task to all subjects.

To evaluate the learning success, each subject did a pre-test and a post-test, before and after using the learning environment for approximately 45 min (180 exercises). The tests consisted of 11 exercises, with varying difficulty. Although difficulty of the exercises were the same for pre- and post-test, the exercises itself were different.

The participants of both groups were university students of various discipline, reported to have normal or corrected to normal vision and participated voluntarily in the EEG experiment. 13 subjects (7 male and 6 female; range 21−35 years, M = 28.1 years, SD = 4.3 years) participated in the experimental group using the EEG-based learning environment. The control group consisted of 11 subjects (7 male and 4 female; range 22 − 27 years, M = 23.4 years, SD = 1.4 years), using an error-adaptive learning environment which is state of the art. The study was approved by the local ethics committee of the Medical Faculty at the University of Tübingen and written informed consent was obtained by the participants.

#### 2.4.2. Cross-Subject Regression for Online Workload Prediction

Based on the EEG data from study 1, we created a prediction model that was used to predict the cognitive workload in a timely manner and further was able to adjust the learning environment accordingly. In terms of usability, we wanted the EEG-based learning environment to be useable out-of-the-box without the need for a subject-specific calibration phase, which is why a crosssubject regression method as presented by Walter et al. (2014) was applied.

Therefore, EEG data obtained in study 1 were used for training a linear ridge regression model with a regularization parameter of λ = 10<sup>3</sup> . For training the ridge regression, we used the MATLAB function ridge and the regularization parameter was found to be optimal in the previous study, where it was determined by crossvalidation. The number of electrodes used for online adaptation was reduced to 16 inner electrodes (FPz, AFz, F3, Fz, FC3, FCz, FC4, C3, Cz, C4, CPz, P3, Pz, P4, Oz, and POz), to be consistent with the electrode positions used in the previous cross-subject study (Walter et al., 2014), where the outer electrodes were not used as those are more prone to artifacts and contain less relevant information. Furthermore, only trials with a Q-value smaller than 6 were used to train the regression model, since trials with higher Q-value showed similar EEG patterns as very easy trials, which is most likely due to a disengagement of the subjects (Spüler et al., 2016). The power spectrum was calculated for the 5 s time frame of the calculation phase using the Burg's maximum entropy with a model order of 32. To correct for inter-subject variability in the subjects baseline EEG power, the data was z-score normalized along the channels. For the final prediction output, we calculated the moving average with a window length of 6 trials in 1 trial steps, which still guaranteed a response time smaller than 1 min of the system. This delay is feasible for the detection of workload since it is not recommendable to adapt an online learning environment too fast (i.e., single trial duration). The moving average also leads to a more robust

prediction (Walter et al., 2014), but makes the system react slower to sudden changes in workload, which is feasible since it is not recommendable to adapt the difficulty of an online learning environment too rapidly.

In contrast to study 1, we did not use an EOG-correction in this study. Although it is technically possible, it would not fit with our approach to build an out-of-the-box system, as training data for the EOG-correction would be needed for each subject. This would cost additional time before the user can use the system (and start learning) and therefore would decrease usability of the system.

The so trained regression model was then applied in the EEGbased learning environment, to predict the amount of cognitive workload for novice subjects online.

#### 2.4.3. Online Adaptation of the EEG-Based Learning Environment

For the experimental group, the EEG data served as workload indicator. Therefore, we used the output of the previously trained regression model to predict the current workload state of each learner and differentiated three difficulty levels. If the predicted workload was less than Q = 0.8, the presented task difficulty was assumed to be too easy. Thus the following Q-value was increased by 0.2. Vice versa, the target Q of the subsequent task decreased by 0.2 when the predicted workload was greater than Q = 3.5. In this case, the presented task difficulty was assumed to be too difficult. If the predicted workload was between Q = 0.8 and Q = 3.5, the Q-value for the next presented task remained the same and the difficulty level was kept constant. These thresholds were defined based on the results in Spüler et al. (2016). Trials with Q < 1 were solved correctly in all cases, while none of the subjects were able to solve trials with a Q > 6. 50% of the trials with a Q = 3.5 were successfully solved on average. The learning session for the experimental group started with an exercise of difficulty level Q = 2.

#### 2.4.4. Adaptation of the Error-Based Learning Environment

For the control group, an error-adaptive learning environment was used. The number of wrong answers served as performance and adaptation measure. When subjects solved five consecutive tasks correctly, the difficulty level (Q) increased by 1. Vice versa, the difficulty level decreased by 1 when participants made three errors in a row. Otherwise, the Q-value did not change and the difficulty level was held constant. The adaptation scheme was kept similar in the control group, as in common tutoring systems. The learning session for the control group started with an exercise of difficulty level Q = 2.

#### 2.4.5. Evaluating Learning Success

To compare the learning success of the two groups, the learning effect after completing the learning phase serves as performance measure and is used as an indicator of how successful each subject was supported during learning. Hence, each subject had to perform a pre-test before the learning phase started. This was used to assess the prior knowledge of each user. After the learning session, each participant had to solve a post-test, and the difference in score between the two tests served as indicator of the learning effect. Results of the pre- and post-tests were compared statistically using a two-sided Wilcoxon's ranksum test.

### 3. RESULTS

### 3.1. Study 1: Workload Related Effects in EEG Data

As results from study 1 were already published in Spüler et al. (2016), only the most important results relevant for the implementation of the online learning environment are shown here. For a more detailed analysis, we refer to the original publication.

Results from the analysis of the EEG data regarding the association between the Q-value and the power at each electrode and frequency bin, the R 2 -values are shown in **Figure 3**. In the delta frequency band, there is a small difficulty-related effect over the central electrodes, while the effect in the theta, alpha, and beta frequency band is located over the parieto-occipital electrodes. This effect was strongest for the alpha band (8 − 12 Hz). While the lower beta band (13 − 24 Hz) still shows some effects related to task difficulty, they cannot be observed in the upper beta band (25 − 40 Hz).

### 3.2. Study 2: Task Performance Results

The behavioral results, how well the subjects performed the octal arithmetic task, are shown in **Table 1** for both groups. For the experimental group, 45.5% of the 180 exercises were solved correctly on average. Averaged over all subjects, a maximum Q-value of 5.85 was reached by using the EEG-based learning environment. Each subject achieved at least the difficulty level of Q = 3.2.

The control group answered 64% of all 180 assignments correctly on average. Since the error-rate was used for adapting the difficulty level of the presented learning material, the number of correctly solved trials was similar across subjects. On average, a maximum Q-value of 4.64 was reached (see **Table 1**). The best subjects reached a maximum Q-value of 6, whereas each participant achieved at least the difficulty level of Q = 4.

## 3.3. Study 2: Learning Effect Using Adaptive Learning Environments

To evaluate if the EEG-based learning environment works and how it compares to an error-adaptive learning environment, we analyzed the learning effect of each subject by pre- and post-tests. Furthermore, the learning effects between the experimental and the control group were compared.

**Table 2** reports the learning effect of each individual subject, as well as the group averages for both groups. After using the EEGbased learning environment and thus learning how to calculate in an octal number system, a learning effect can be recognized for almost every subject of the experimental group, except for subject S02 and S05. On average, 5.08 assignments from 11 post-test tasks were solved correctly after completing the learning phase, 3.54 more assignments compared to the pre-test. On average, a significant learning effect can be verified between the pre- and post-test (p = 0.0026, two-sided Wilcoxon test).

The control group solved 1.55 tasks from the 11 pre-test assignments on average correctly. As the difference to the experimental group is not significant (p > 0.05, two-sided

FIGURE 3 | Topographic plots of the signed squared correlation coefficient averaged over all subjects between the frequencies in each band for each electrode and the task difficulty as indicated by the Q-value.


TABLE 1 | Task performance for all subjects.

The relative amount of correctly solved trials in % and the maximum Q-value for each subject, as well as the group average are shown. Results of the experimental group using the EEG-based learning environment are displayed at the top, while results from the control group using an error-adaptive learning environment are displayed at the bottom.


TABLE 2 | Number of correctly solved trials in the pre-and post-test, as well as the difference, which indicates the learning success for the individual subjects and the group means.

Results of the experimental group using the EEG-based learning environment are displayed at the top, while results from the control group using an error-adaptive learning environment are displayed at the bottom.

Wilcoxon test), equal prior knowledge can be implied for both groups. The best subject of the control group performed 36.36 % of the pre-test tasks accurately, whereas the worst subjects gave no correct answers. For almost every subject, a learning effect for calculating in the octal number system is noticeable after the error-adaptive learning session, except for subject S15. On average, 3.45 more tasks were solved correctly in the post-test, compared to the pre-test, which also shows a significant learning effect between the pre- and post-test (p = 0.0016, two-sided Wilcoxon test).

Although the learning effect for the experimental group using the EEG-based learning environment is higher than for the control group, this difference is not significant (p > 0.05, two-sided Wilcoxon test).

### 4. DISCUSSION

In the presented work, we have shown that EEG can be used to measure a learner's workload online and adapt a learning environment accordingly. By using a cross-subject regression method, a subject-specific calibration phase can be omitted. Although the cross-subject prediction model was build using data recorded while subjects did arithmetics in the decimal number system, it could be successfully applied to predict the workload during learning of arithmetics in the octal number system.

In the following, the benefits and drawbacks, as well as further ideas for adaptive learning environments will be discussed.

#### 4.1. Evaluating an EEG-Based Learning Environment

Commonly, if neural signals (like EEG) are used to estimate a user's mental state or a user's intention, the performance of the prediction method is assessed in terms of accuracy, correlation or other metrics (Spüler et al., 2015) that try to quantify how well the prediction model is working. While we have also evaluated the prediction performance of our model in a previous publication (Spüler et al., 2016), this kind of assessment is no more feasible in the here presented scenario with an online learning environment.

The reason for this is the lack of an objective measure for the user's workload. For the creation of the prediction model we used EEG data from a task that all subjects were able to do fluently (addition in decimal system). As no learning effects are expected in this case, the difficulty of the task (measured by Q) was used as a subjective measure of expected workload. For the online learning environment, the relationship between task difficulty and workload does not hold anymore, as the learning environment induces learning effects. At the beginning, when the task is unknown to the user, even easy exercises will induce a high workload. After using the learning environment, the user may have mastered arithmetic tasks in the octal number system, and even exercises with moderate difficulty will result in a low workload. As the relationship between task difficulty and workload changes in the course of learning, the predicted task difficulty cannot be used for performance evaluation in an online learning environment.

As the task difficulty measured by the Q-value is the only objective measure we have, and the relationship to workload is invalidated in the online scenario, we have no means to objectively assess the prediction performance of our model. If the reader is interested in an assessment of the model performance, we refer to our previous publication (Spüler et al., 2016), where the model was evaluated on offline data.

Although we cannot assess the performance of the prediction model in the online scenario, we can evaluate the EEG-based learning environment with regards to its effect on the learning success. Learning success was defined as the difference in score between the pre- and post-tests, which were done by the subjects before and after using the learning environment and is a common measure for the evaluation of learning environments (Chi et al., 2011). As it is not important for the learner how accurate the workload-prediction works, but it is important how much the learning success can be improved, learning success is also the most user-centered metric.

Due to these facts, that other commonly used metrics are not applicable for the scenario of an online learning environment and that learning success is the most user-centered metric, we used the learning success as prime outcome measure for this study.

### 4.2. Proving the Concept of an EEG-based Learning Environment

As subjects using the EEG-based learning environment showed a significant learning effect, this work is a successful proof-of-concept that an EEG-based learning environment works. So far, the use of EEG in a reading tutor has been investigated (Chang et al., 2013) or the cognitive and emotional state of the user is modeled to improve a tutoring system (Graesser et al., 2007), but to the best of our knowledge, this work presents for the first time a closed-loop system using an EEG-based workload adaptation in an arithmetic learning environment.

When comparing the experimental group using the EEGbased learning environment with the control group using the error-based learning environment, the learning success was higher for the experimental group, but the difference was not significant. As this study only compared an EEG-adaptive system to an error-adaptive system, it would be interesting for future studies to compare an EEG-based system against non-adaptive learning environments to see how it compares to those and if a significant performance difference can be achieved. Nevertheless, the results achieved with this study indicate that an EEG-based learning environment is an alternative to the state-of-the-art approach, but usability of the system is still an open issue.

Although, we aimed at a high usability for the presented system by using a cross-subject prediction model to omit a subject-specific calibration phase, the use of gel-based EEG still needs some time and effort to prepare, thereby making it impractical to use an EEG-based learning environment on a wide basis. With the recent development of dry EEG electrodes it was shown that dry electrodes provide lower signal-to-noise ratio than gel-based electrodes, but signal quality is still good enough for brain-computer interface control (Spüler, 2017). While usability of such a system could be improved by dry EEG electrodes, the increased cost can likely not be justified when using EEG-based learning environments on a broad population (e.g., in a class room). However, this technique could prove helpful for special cases in which the user suffers from learning disability or other problems. In the presented study, one subject of the experimental group suffered from test anxiety and reported to feel very comfortable using the EEG-based learning environment, as the system turned down the difficulty every time the subject was closed to feeling overwhelmed, thereby providing a good learning experience.

#### 4.3. Improving the EEG-Based Learning Environment

As this work should merely serve as a proof-of-concept to show the feasibility of an EEG-based learning environment, it should be discussed how such a system could be potentially improved.

One possible improvement could be made by not only detecting workload, but also other cognitive properties like vigilance, attention or engagement. As we already mentioned in Spüler et al. (2016), arithmetic exercises with a very high difficulty show similar EEG patterns as exercises with very low difficulty, which is likely due to a disengagement effect (Chanel et al., 2008) where subjects do not even try to solve the task. Therefore, taking additional parameters into account when adapting the learning material, seems to be advisable. Besides workload, also vigilance and engagement are important factors for solving a task correctly and learn efficiently. Decreasing vigilance is often specified as the decline in attention-requiring performance over an extended period of time. Furthermore, vigilance increases steeper in the context of difficult, compared to easy tasks. Tiwari et al. (2009) describe the interrelationship of vigilance and workload. An increasing workload is accompanied by a vigilance decrease. Engagement and workload increased as a function of task difficulty during learning and memory tasks (Berka et al., 2007). The results from previous studies (Oken et al., 2006; Berka et al., 2007; Tiwari et al., 2009) showed, the detection of various vigilance and engagement states is possible. Being able to detect these cognitive states based on the EEG could thereby improve the workload prediction and also add another layer to the learning environment where not only the learning material is presented, but also other cues could be integrated to motivate the learner when signs of disengagement or boredom are detected. Roy et al. (2013) have also shown that there is an interaction between fatigue and workload, which potentially decreases classification performance over time. As Roy et al. (2016b) have also shown that ERPs are a more robust indicator of workload than power spectral features, ERPs could be used as an additional feature to predict workload in future EEG-based learning environments. But, as ERP-based workload detection is stimulus-dependent, this approach is less flexible than using the power spectrum, which can be used for workload estimation independent of any stimuli.

### 5. CONCLUSION

In this paper, we presented an EEG-based learning environment, which unobtrusively detects the user's workload online and adapts the learning material accordingly to support each learner optimally. In a first study we collected EEG of subjects solving arithmetic exercises in the decimal number system. Based on this data a cross-subject prediction model was created that allows to predict the workload of other subjects without the need for a subject-specific calibration phase. This prediction model was used online to predict the workload of subjects learning arithmetic in the octal number system and the learning material was adapted promptly to keep the learner's workload in an optimal level. Utilizing the EEG-based learning environment showed a learning success similar to using a state-of-the-art system, which suggests the feasibility of an EEG-based learning environment.

### AUTHOR CONTRIBUTIONS

CW and MS conceived and designed the studies. CW programmed the online learning environment. CW collected the data. CW and MS analyzed the data. CW and MS wrote the article, partly based on the doctoral dissertation by CW, and aided by MB, WR, and PG.

### FUNDING

This research is supported by the Leibniz ScienceCampus Tübingen Informational Environments and the Deutsche Forschungsgemeinschaft (DFG; grant SP-1533\2-1). CW was a doctoral student of the LEAD Graduate School [GSC1028], her work is funded by the Excellence Initiative of the German federal and state governments. WR and PG are principle investigators of the LEAD Graduate school at the Eberhard-Karls University Tübingen funded by the Excellence Initiative of the German federal government. We further acknowledge support by

#### REFERENCES


Deutsche Forschungsgemeinschaft and Open Access Publishing Fund of University of Tübingen.

#### ACKNOWLEDGMENTS

The authors would like to thank Hauke Glandorf for collecting data of the control group.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Walter, Rosenstiel, Bogdan, Gerjets and Spüler. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Dynamic Threshold Selection for a Biocybernetic Loop in an Adaptive Video Game Context

Elise Labonte-Lemoyne\*, François Courtemanche, Victoire Louis, Marc Fredette, Sylvain Sénécal and Pierre-Majorique Léger

Tech3Lab, HEC Montréal, Université de Montréal, Montreal, QC, Canada

Passive Brain-Computer interfaces (pBCIs) are a human-computer communication tool where the computer can detect from neurophysiological signals the current mental or emotional state of the user. The system can then adjust itself to guide the user toward a desired state. One challenge facing developers of pBCIs is that the system's parameters are generally set at the onset of the interaction and remain stable throughout, not adapting to potential changes over time such as fatigue. The goal of this paper is to investigate the improvement of pBCIs with settings adjusted according to the information provided by a second neurophysiological signal. With the use of a second signal, making the system a hybrid pBCI, those parameters can be continuously adjusted with dynamic thresholding to respond to variations such as fatigue or learning. In this experiment, we hypothesize that the adaptive system with dynamic thresholding will improve perceived game experience and objective game performance compared to two other conditions: an adaptive system with single primary signal biocybernetic loop and a control non-adaptive game. A within-subject experiment was conducted with 16 participants using three versions of the game Tetris. Each participant plays 15 min of Tetris under three experimental conditions. The control condition is the traditional game of Tetris with a progressive increase in speed. The second condition is a cognitive load only biocybernetic loop with the parameters presented in Ewing et al. (2016). The third condition is our proposed biocybernetic loop using dynamic threshold selection. Electroencephalography was used as the primary signal and automatic facial expression analysis as the secondary signal. Our results show that, contrary to our expectations, the adaptive systems did not improve the participants' experience as participants had more negative affect from the BCI conditions than in the control condition. We endeavored to develop a system that improved upon the authentic version of the Tetris game, however, our proposed adaptive system neither improved players' perceived experience, nor their objective performance. Nevertheless, this experience can inform developers of hybrid passive BCIs on a novel way to employ various neurophysiological features simultaneously.

Keywords: dynamic adaptation, passive brain computer interface, hybrid brain computer interface, BCI, pBCI, hBCI, EEG, tetris

Edited by:

Stephen Fairclough, Liverpool John Moores University, United Kingdom

#### Reviewed by:

Tonio Ball, Albert-Ludwigs-Universität Freiburg, Germany Emili Balaguer-Ballester, Bournemouth University, United Kingdom Noman Naseer, Air University, Pakistan

\*Correspondence:

Elise Labonte-Lemoyne elise.labonte-lemoyne@hec.ca

Received: 07 September 2017 Accepted: 22 June 2018 Published: 17 July 2018

#### Citation:

Labonte-Lemoyne E, Courtemanche F, Louis V, Fredette M, Sénécal S and Léger P-M (2018) Dynamic Threshold Selection for a Biocybernetic Loop in an Adaptive Video Game Context. Front. Hum. Neurosci. 12:282. doi: 10.3389/fnhum.2018.00282

## INTRODUCTION

We have become quite adept at communicating with our computer systems and having them do what we wish. However, this type of communication lacks the subtleties that accompany human-human interactions through non-verbal signals. A promising new mode of communicating with technology has emerged that could help our computer systems understand more understated information about us, and that is passive braincomputer interfaces (pBCIs) (Zander and Kothe, 2011). This paper presents suggested improvements to the current state of pBCI development.

With pBCIs, the user's state is communicated to the computer without any conscious effort from the user and the system takes that into account to adapt itself to ultimately lead the user to an optimal state. Passive BCIs can bring about many different types of adaptations, such as modifying the level of difficulty of a game to ensure it is as stimulating as possible for players (Van De Laar et al., 2013), set off an alarm when truck drivers are too tired to be safe on the road (Hajinoroozi et al., 2016), or disengaging the autopilot to prevent airline pilots from getting so bored that they fail to notice important warning signs (Pope et al., 1995). Passive BCIs, as opposed to active BCIs, are mostly being developed for average users to improve their daily experiences rather than to replace other input devices such as using eye tracking instead of a mouse (Frisoli et al., 2012). As mentioned by Robert Jacob, "active BCIs have the potential to bring a great improvement to the lives of a small number of people, while passive BCIs can bring small improvements to the lives of a great number of people" (Jacob, 2017, p. 14).

While multiple research groups have been building functional pBCI systems (Prinzel et al., 2000; Scerbo et al., 2003; Berka et al., 2004; Lin et al., 2006; Ewing et al., 2016), many technical challenges remain to be addressed (Allanson and Fairclough, 2004; Fairclough, 2011; Brouwer et al., 2015), such as personalization. As biosignals vary greatly from one person to another, personalization is required for systems to function properly (Makeig et al., 2012). Developers will often use personalization tasks to individually tailor thresholds before the use of a system (Johnson et al., 2011). Their participants will thus have to be recorded twice, once to measure their responses to a task designed to induce a given state and a second time once the system has been personalized to actually use it. This requires time and resources and results in highly personalized systems that are, nonetheless, not responsive to changes over time, such as fatigue. This is the challenge we attempt to address within this paper.

A solution that has been proposed to many such BCI problems is the combination of multiple neurophysiological signals (hybrid BCI) to improve reliability, proficiency, bandwidth, convenience, and utility (see Banville and Falk, 2016 for a review; Nijholt et al., 2011). The type of system presented in this article can incorporate both neurophysiological and psychophysiological measures into the same system (Pfurtscheller et al., 2010; Zander et al., 2010; Banville and Falk, 2016). It is also commonly known as multimodal BCI. We look to the hybrid BCI to personalize our system in real-time and diminish the time and resources that are required before a user can begin using a system for the first time. While most hybrid BCIs use a second signal to improve the robustness of the system, to our knowledge, no other group has used a secondary neurophysiological signal to adjust the parameters of the system's adaptation to the first signal. Basically, the second signal would be used as a successfulness indicator of the adaptation, which is based on the first signal, leading it to be personalized. To our knowledge, this paper presents a first attempt to do so.

Our approach takes a slightly different view of pBCIs as most systems intend the subject to be unaware of the adaptations taking place. We propose instead that subjects do not need to be completely conscious of the changes, but that they are probably, on some level, aware that changes are taking place. By measuring the impact of these changes with a secondary, affective signal, we intend to measure the user's reaction to the adaptation in real time. This should inform the system whether the adaptation improved their experience (and/or performance) or not, thus telling it whether the adaptation parameters were correct or not. Accordingly, we propose a new approach for dynamic threshold selection that performs a continuous calibration. The biocybernetic loop adapts to one physiologically inferred metric, in this case, cognitive load, using electroencephalography (EEG), while the thresholds are continuously adjusted based on a second metric, in this case, emotional valence, inferred using automatic facial expression analysis.

The goal of this paper is to evaluate if dynamic thresholding with a secondary neurophysiological signal can continuously personalize the adaptive system. We hypothesize that the adaptive system with dynamic thresholding will improve perceived game experience and objective performance.

## METHODS

#### Participants

This study was carried out in accordance with the Research Ethics Board of our institution and with written informed consent from all subjects. Participants were students at local universities. There were 16 in total (6 women). Average age was 27.75 (SD = 8.55). Participants had to be over 18 and have no known health issues. They received a \$30 gift card as a compensation.

### Initial Baselining

We used a modified vanilla baseline task (Jennings et al., 1992) as a baseline calibration task. This was chosen because it is an emotionally neutral task with similar visual characteristics as the Tetris game, i.e., color and movement.

### Experimental Design

As Ewing et al. (2016) underlined, game periods need to be longer than 5 min to get an accurate representation of the system's adaptation. Thus, participants played each condition for 15 min. The order of the three conditions was counterbalanced between participants. For the adaptive conditions, the system managed the game difficulty as in the Dynamic Difficulty Adjustment framework (Hunicke and Chapman, 2004).

#### Game

Tetris is a game that is often used in research as it is easily adaptable and because gameplay is generally understood by most. In this instance, Tetris was also advantageous as there is no emotional component to the gameplay. This prevented us from having to tease out the effect of gameplay from the effects of the adaptive system on the measured valence, the latter being more relevant to our study.

To easily control game parameters, we developed an in-house version of Tetris. The game window consisted of a rectangle of 10 by 20 squares. Players controlled the game with a Logitech F310 gamepad (Logitech, Lausanne). They could move the pieces laterally with the left and right arrows and rotate the pieces with the "A" and "X" buttons. No other manipulation was allowed.

For each condition, the initial speed was 400 ms per line. The slowest allowed speed was 1,500 ms per line and the fastest allowed speed was 100 ms per line. Speed variations between levels were of 100 ms at a time. These parameters were chosen after pilot testing was conducted. The only parameter that was atypical was the absence of the control that allows the user to manually make the blocks fall faster. The ability to force the blocks down which might prevent the conditions from being uniform between subjects, thus it was not implemented in our game.

#### Condition 1

In the control condition, there was no adaptation taking place. The speed increased by one step (speed levels are defined in the next section) every time four rows were successfully cleared. This condition was chosen to resemble a traditional game of Tetris as closely as possible.

#### Condition 2

The second condition was developed to model as closely as possible to Ewing et al. (2016) conservative condition. The system adapts to the user state as measured by the EEG. Any 100% or two-fold increase/decrease from the baseline value (e.g., for a starting baseline value of 4, a 100% increase would be a value of 8, a 100% decrease would be a value of 2) or more from the benchmarking task triggers a speed change (see **Figure 1a**). Ewing et al. (2016) employed the upper alpha (10.5–13 Hz) in the P4 location and theta bands (4–8 Hz) in the Fz location in the following manner: an increase of theta accompanied by an increase in alpha (both should be more than 100% baseline) corresponds to the user being in a state of boredom and the speed is increased; an increase of theta accompanied by a decrease in alpha (both 100% or more) corresponds to the user being in a state of overload and the speed is decreased. Similarly, our system adaptations of speed are triggered by 100% or more changes to the baseline level of the upper alpha band (10.5–13 Hz). This band was chosen from Ewing et al. (2016) following pilot testing.

#### Condition 3

The third condition is similar to the second condition (see **Figures 1a,b** for a side by side comparison) with the addition of adaptive thresholds. The thresholds adapt to the emotional valence of the subject, i.e., the extent to which an emotion is positive (Lang et al., 1995). Valence is measured with the use of facial expression analysis, which is explained in the next section. The emotional valence is measured before and after each speed adaptation over 2 s. If the after-before delta is positive, the subject is deemed satisfied with the adaptation, the threshold value is considered appropriate and is maintained. If the delta is negative, the subject is deemed dissatisfied with the adaptation and the threshold value is modified as follows. There are two threshold values, the lower threshold which, when crossed, leads to an increase in speed, and the upper threshold, which, when crossed leads to a decrease in speed. When the subject is deemed dissatisfied with the adaptation, only the threshold that was crossed is changed. The thresholds move away from the middle, i.e., the upper threshold increases, or the lower threshold decreases, leading to a less severe adaptation criteria (see **Figure 2**). For example, if the upper threshold value was 1, and the baseline alpha value was 4, when the current alpha value reaches past 8 for a given window (a 100% increase), it has crossed the threshold. Thus, the decision-rule determines that the game speed must slow down. At the moment where the speed is modified, the delta valence is measured. The valence value from before the speed change is subtracted from the valence value after the speed change to see if the change is received positively or negatively. For example, if the valence before the speed change was 2 and the valence after the speed change is 1, the delta valence is negative (−1). The threshold value from our previous example, a value of 1 for the upper threshold, should be modified. It will move away from the center; therefore, the upper threshold value will increase to 1.1. In the next window, for the speed to change, the alpha value will need to reach 8.4 (an increase of 110% of the baseline value of 4). In addition, when, for a given 10 second epoch, no adaptation takes place while delta valence is negative, both thresholds are considered too loose and are brought closer to the middle, i.e., the upper threshold decreases, and the lower threshold increases, leading to a more severe adaptation criterion.

#### Measurement Tools and Processing EEG Recording and Processing

The EEG was recorded from 32 Ag-AgCl preamplified electrodes mounted on the actiCap and with a brainAmp amplifier (Brainvision, Morrisville). The recording reference was FCz and the acquisition rate was 500 Hz. All the EEG processing was performed in the NeuroRT software (Mensia, Renne). The following steps were performed in this order for the online preprocessing of the data: down-sampling to 256 Hz, bandpass filtering with an infinite impulse response filter at 1–50 Hz, notch filtering at 60 Hz, blink removal through blind source separation, re-referencing to the common average reference, and artifact detection by computing the Riemannian distance between the covariance matrix and the online mean. The data was then filtered to keep only the upper alpha band (10.5–13 Hz) and squared. Only data from P3 and P4 was kept in accordance with Ewing (Ewing et al., 2016). Finally, each 0.5 s epoch was divided by the average of the 1-min calibration period to normalize the data. This value, which ranged from 0.008 to 6.87, was then sent to the adaptation system to be used in the decision rule-based process.

#### Facial Expression Analysis

Emotional valence was obtained through the use of facial expression analysis performed in real time with the FaceReader software (Noldus, Wageningen). Automatic facial expression analysis uses the Facial Action Coding System (Ekman et al., 1980) that has traditionally been employed by human experts who manually code video playback for facial expressions. Specific combinations of muscle contractions are associated to certain emotions. Computer image recognition is now able to detect these same muscle contraction combinations and code the user's emotions in real time (Bartlett et al., 1999). We used the Facereader valence ratio which is calculated as the intensity of positive emotion minus the intensity of the negative expression with the highest intensity. The software maps the facial muscles

of the user and derives valence at a rate of 30 Hz. It streamed to the adaptive system that used the 2 s before and the 2 s after an adaptation to calculate the change in valence caused by the adaptation. A positive change tells the system that the adaptation was appropriate, meaning that the threshold parameters were correct, while a negative change tells the system that the adaptation was wrong and that the thresholds should be modified.

### Game Performance

The behavior of each system (Conditions 1–3) was logged in order to compare them and understand what drove user experience. The game speed was controlled as the duration (ms) it took for a block to move down one line. This was converted post-hoc into levels. The smaller the level, the slower the game and there was 100 ms difference between levels (i.e., level 1 = 1,500 ms, level 15 = 100 ms). The number of "game deaths" per subject and the score of the subject at the end of each condition were also logged (**Table 1**). Participants scored 100 points each time they cleared a line, 300 points for two lines, 500 points for three lines, and 800 points for four lines.

#### Perceived Measures

In addition to game performance, we also measured the perception of the participant at the end of each experimental condition. We used the In-game and Post-game Game Experience Questionnaire (GEQ), a shortened version of the core questionnaire which was developed to assess multiple iterations of an experiment (IJsselsteijn et al., 2008). These questionnaires respectively have 14 and 17 items that lead to scores on the following dimensions: competence, flow, annoyance, challenge,

TABLE 1 | High and low speed comparisons for each experimental condition.


positive, and negative affect are the 6 dimensions of the In-game GEQ and positive experience, negative experience, tiredness, and returning to reality are the 3 dimensions of the pots-game questionnaire. Each item has a 5-point scale ranging from 0 (Don't get the feeling described in the statement at all) to 4 (Extremely get the feeling described in the statement).

#### Statistical Analysis

Diverse statistical tests have been performed between the three conditions (control, EEG, adaptive thresholds) and between clusters of participants afterwards. In order to compare conditions (GEQ, total scores, average game level, valence, alpha power), we used Wilcoxon tests to compare the 3 conditions and Mann–Whitney tests to compare the clusters of participants. Both these tests are non-parametric as is appropriate considering the sample size. Also, a Kolmogorov–Smirnov test was used to test for normality and found only one variable to be normally distributed, the alpha power in conditions B and C (D = 0,12, p = 0.15; D = 0.16, p = 0.15) further justifying the use of non-parametric tests. We used Poisson regressions with random effects to account for repeated measures from the same participant to compare total number of deaths and total number of level changes between the 3 conditions.

### RESULTS

As can be seen in **Figure 3**, when compared to the control condition (Condition 1), the number of game deaths was significantly lesser in the EEG only BCI (Condition 2), and in the adaptive thresholds condition (Condition 3). The score, when compared with the control condition, was lesser in both BCI conditions. The average game level (seen in **Figure 4**) in each condition was 12.53 for control (Condition 1), 6.18 for EEG only (Condition 2), and 8.92 for adaptive thresholds (Condition 3). All differences were significant at p < 0.05. These results show that both BCI conditions seem to have "over adapted" the game resulting on average in slower games (lower game level) and decreased player performance (score) compared to the control condition.

As for the GEQ, results and significant differences can be seen in **Figures 5**, **6**. Participants in the EEG condition (Condition 2) perceived the experience as less challenging (comparison 1 vs. 2, p = 0.004; comparison 2 vs. 3, p = 0.059), negatively charged (comparison 1 vs. 2 p = 0.012), and less conducive to flow (comparison 1 vs. 2, p = 0.019; comparison 2 vs. 3, p = 0.086) compared to the two other conditions. The fixed threshold condition (Condition 3) is perceived as less annoying (comparison 2 vs. 3 p = 0.032) and participants in this condition develop a higher perceived competence than in the two other conditions (comparison 1 vs. 2 p = 0.008; comparison 2 vs. 3 p = 0.033). Participants felt the experience in Condition 1 more negatively than in the two other conditions (comparison 1 vs. 2 p = 0.031; comparison 1 vs. 3 p = 0.041).

When explored further, our results appeared to show two different clusters of responses in Condition 3. Thus, we evaluated how many subjects spent 5 or more periods of 10 s at the fastest speed or at the slowest speed. Once these two groups were identified, we compared them with Wilcoxon tests to evaluate if they differed in their alpha power change from benchmark, valence, GEQ, number of game deaths, and game scores.

Results suggest that in Condition 3, 9 participants spent a large portion of their time at the slowest speed, while 10 participants spent most of their time at the fastest speed. Only three participants spent <25 s at either extremity. When compared, these two groups (slow speed group vs. high speed group, Condition 3) are statistically different for their average EEG alpha power change from baseline (p = 0.01), where the high-speed group has more alpha (high speed = 0.99; low speed = 0.67). They are also statistically different in terms of valence values (p = 0.02; high speed = 0.01; low speed = 0.20).

We also evaluated if participants differed in Condition 2 to see if they were intrinsically different or if they were only different while using the dynamic threshold adaptive system. For those two groups (high and low-speed), their values did not differ in Condition 2 (alpha p = 0.35; valence p = 0.23; **Table 1**). Also, these groups differed on some dimensions of the GEQ. In Condition 2, they reported different annoyance perceptions (low speed = 2.78; high speed = 1.7; p = 0.02) and in Condition 3, challenge (low speed = 2.17; high speed = 3.65; p = 0.03) and competence perceptions differed (low speed = 3.22; high speed = 1.35; p < 0.01). Finally, the two groups significantly differed in game deaths in Condition 3 (high speed = 0; low speed = 19) and score in Conditions 1 and 2 (respectively, low speed = 4028.57, high speed = 5233.33; p = 0.001 and low speed = 2028.57, high speed = 2,700; p = 0.001).

### DISCUSSION

Our results show that the adaptive systems did not improve the participants' experience while playing Tetris. Contrary to our expectation, participants had more negative affect from the fixed threshold and the dynamic thresholds conditions than in the traditional Tetris game (control condition). In the EEG only condition, similarly to Ewing et al. (2016), participants had fewer game deaths, a lower score, and a lower average speed; showing that the system slowed down the game, leading to the participants feeling more confident, but less challenged and ultimately feeling more negative affect.

If we examine the adaptive threshold condition by separating the participants according to how much time they spent at each extreme level, we can see that participants have significantly different alpha and valence. However, they did not differ in EEG only condition, showing that they are not fundamentally different in their neurophysiology, but that their neurophysiological responses to the use of an adaptive system were different. Obviously, the high-speed group found the challenge higher and their competence lower as they were generally unable to keep up with the game and died many more times than the low speed group.

One potential explanation of our counterintuitive results could be found in Csikszentmihalyi's flow theory (Nakamura, 2014). This specific flow theory postulates that an optimal positive valence experience emerges in an autotelic context when one's competence is in alignment with a given challenge. In self-motivating contexts, such as games, the theory predicts that when the challenge is above one's competency, a state of anxiety and frustration is likely to emerge. As the competency of the player evolves in addressing this challenge, the theory suggests that the individual is likely to move back toward a positively valenced equilibrium state. In our study, it is possible that the parameters to capture the change in the valence might not have been optimized to account for the time it takes for the player to come back to this equilibrium. This would suggest that future research needs to not only personalize the values that trigger

FIGURE 4 | Average game level for each experimental condition. Each line is the game level of a study participant over time in milliseconds. A higher game level means that the speed at which the blocs descend increases, thus diminishing the time available to position the bloc properly and increasing difficulty. In Condition 1, participants began at level 12 and the level increased progressively. In Condition 2, for most participants the level decreased over time. In Condition 3, a split appears, where some participants see their level increase irreversibly, while other participants have a result more similar to Condition 2. In a functioning system, there would be a lot of variations early on and after some time the levels would stabilize and adjust linearly. For Condition 3, we present only results for which the adaptive mechanism leads to thresholds that were mathematically attainable. This issue is discussed in further details in the next section.

FIGURE 5 | Average answers to the in-game portion of the Game Experience Questionnaire (GEQ) showing the results for each dimension measured. Error bars represent standard deviation. \*p < 0.05.

adaptation, but also needs to personalize the adaption window duration. This could be done through a more complex baseline task that would be able to capture this idiosyncratic latency in players.

With both these personalizations in place (threshold and adaptation time), we could hope to reach both subject independence and context independence. Subject independence and context independence are the big challenges currently facing passive BCIs in their quest for use in everyday life (BNCI Horizon 2020, 2015). The hope is to have a system that can be simply put on the user's head and work straight away in a variety of contexts without the need for lengthy calibration tasks. Our goal was to develop a system that continuously adjusts its action triggering threshold rules which would have been a step in the direction of subject independence, or at the least, a system that is subject dependent and context independent. While our system needs more testing and calibration to evaluate its feasibility, we trust that continuous adaptation and the use of multimodal neurophysiological signals hold the key to those challenges. Other researchers should keep in mind that the use of valence-based tools may differ in the context of games with more emotionally charged objects and not extrapolate our results to those contexts. Future works will explore more comprehensive EEG data than alpha alone, and various window durations to see if the affective response to the speed adaptation appears after a delay, or if it may simply take some time for the user to get comfortable at a given level thus the speed changes should happen less frequently. In addition, as speed changes in quick succession are not expected when playing Tetris, perhaps novelty played a role in the user's dissatisfaction, thus a variety of adaptation windows could be tested.

While evaluating a variety of neurophysiological data sources and time windows could lead to a solution, other possible avenues of exploration have emerged. As mentioned in the literature, machine learning would greatly improve this type of system (Scherer et al., 2013; Brouwer et al., 2015; Ewing et al., 2016). On one side, recent advances in deep learning techniques could help enhance the accuracy of cognitive state inference models. Deep learning models are well suited to find complex hierarchical patterns in high dimensionality data such as neurophysiological data. Research using such approaches for psychophysiological inference (Martinez et al., 2013) or EEG signal decoding (Schirrmeister et al., 2017) show promising results. On the side of the interaction loop, we would recommend the use of reinforcement learning in the case of hybrid BCIs. Reinforcement learning uses a problem space to determine where it is in relation to the optimal location. The system adapts within this problem space and gets rewarded as it gets closer to the goal. In the case of our proposed biocybernetic loop, the problem space would be defined as a grid of difficulty by neurophysiological state and the reward could be determined by a second neurophysiological signal. Consider the system presented in this paper. As a reinforcement learning system, the problem space in which the model would evolve and try to find an optimal path would be composed of the two dimensions of cognitive load and current speed of the game. Each action undertaken by the model would be rewarded by the local changes in the subject's valence following the adaptation. Doing so, the system should converge toward an optimal state of difficulty vs. cognitive load that is specific to each subject and that may change over time. In this approach, the adaptation thresholds would be replaced by the transitioning policy optimized over time that the system uses to decide which action to perform. We can imagine a similar system with a problem space composed of simulator challenge and vigilance level with a reward based on stress levels. The idea behind this is that an affective index could confirm or infirm if the action performed by the system improved the user experience in real time instead of having to validate with the user verbally after they test it for a while and adjusting the system manually afterwards. Our future work will explore this promising alternative.

We endeavored to develop a system that improved upon the authentic version of the Tetris game, however, our proposed

#### REFERENCES


adaptive system neither improved players' perceived experience, nor their objective performance. In practice, for condition 3, the way we modified the threshold values based on valence could eventually lead to values that were impossible to attain. For some participants, the thresholds crossed, leading to some lower thresholds being higher than upper thresholds and for some thresholds to be negative which is impossible to attain from the alpha ratio. Nevertheless, this experience can inform developers of hybrid passive BCIs on a novel way to merge various neurophysiological features. If the goal of a passive BCI system is to be widely adopted, it must be better than the current version of a similarly oriented system. Comparisons to other versions of a given BCI (Liu et al., 2009; Chanel et al., 2012) are very informative to better understand how parameters of the BCI influence the user-system relationship, however, this is only one part of the picture when it comes to the future adoption of passive BCIs in everyday lives. As mentioned by Nijholt et al. (2011), new types of human-computer interaction take years of research before they come into their own, thus all advances should be presented and discussed. Ultimately, the true test will be to compare any new system to the status quo in order to improve it.

#### AUTHOR CONTRIBUTIONS

EL-L and FC designed the study and developed the system. All authors contributed to the analysis and the writing.


fuzzy neural networks. IEEE Trans. Circuits Syst. I Regul. Pap. 53, 2469–2476. doi: 10.1109/TCSI.2006.884408


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Labonte-Lemoyne, Courtemanche, Louis, Fredette, Sénécal and Léger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Single-Trial EEG Analysis Predicts Memory Retrieval and Reveals Source-Dependent Differences

Eunho Noh<sup>1</sup> , Kueida Liao<sup>1</sup> , Matthew V. Mollison<sup>2</sup> , Tim Curran<sup>2</sup> and Virginia R. de Sa<sup>3</sup> \*

<sup>1</sup> Department of Electrical and Computer Engineering, University of California, San Diego, San Diego, CA, United States, <sup>2</sup> Department of Psychology and Neuroscience, University of Colorado Boulder, Boulder, CO, United States, <sup>3</sup> Department of Cognitive Science, University of California, San Diego, San Diego, CA, United States

We used pattern classifiers to extract features related to recognition memory retrieval from the temporal information in single-trial electroencephalography (EEG) data during attempted memory retrieval. Two-class classification was conducted on correctly remembered trials with accurate context (or source) judgments vs. correctly rejected trials. The average accuracy for datasets recorded in a single session was 61% while the average accuracy for datasets recorded in two separate sessions was 56%. To further understand the basis of the classifier's performance, two other pattern classifiers were trained on different pairs of behavioral conditions. The first of these was designed to use information related to remembering the item and the second to use information related to remembering the contextual information (or source) about the item. Mollison and Curran (2012) had earlier shown that subjects' familiarity judgments contributed to improved memory of spatial contextual information but not of extrinsic associated color information. These behavioral results were similarly reflected in the event-related potential (ERP) known as the FN400 (an early frontal effect relating to familiarity) which revealed differences between correct and incorrect context memories in the spatial but not color conditions. In our analyses we show that a classifier designed to distinguish between correct and incorrect context memories, more strongly involves early activity (400–500 ms) over the frontal channels for the location distinctions, than for the extrinsic color associations. In contrast, the classifier designed to classify memory for the item (without memory for the context), had more frontal channel involvement for the color associated experiments than for the spatial experiments. Taken together these results argue that location may be bound more tightly with the item than an extrinsic color association. The multivariate classification approach also showed that trial-by-trial variation in EEG corresponding to these ERP components were predictive of subjects' behavioral responses. Additionally, the multivariate classification approach enabled analysis of error conditions that did not have sufficient trials for standard ERP analyses. These results suggested that false alarms were primarily attributable to item memory (as opposed to memory of associated context), as commonly predicted, but with little previous corroborating EEG evidence.

Keywords: EEG, memory retrieval, old/new effect, multi-variate analysis, prediction

#### Edited by:

Felix Putze, University of Bremen, Germany

#### Reviewed by:

Dezhong Yao, University of Electronic Science and Technology of China, China Jing Jin, East China University of Science and Technology, China Marco Steinhauser, Catholic University of Eichstätt-Ingolstadt, Germany

\*Correspondence:

Virginia R. de Sa desa@ucsd.edu; desa@cogsci.ucsd.edu

Received: 11 September 2017 Accepted: 05 June 2018 Published: 10 July 2018

#### Citation:

Noh E, Liao K, Mollison MV, Curran T and de Sa VR (2018) Single-Trial EEG Analysis Predicts Memory Retrieval and Reveals Source-Dependent Differences. Front. Hum. Neurosci. 12:258. doi: 10.3389/fnhum.2018.00258

### INTRODUCTION

fnhum-12-00258 July 6, 2018 Time: 17:33 # 2

Previous recognition memory studies have used electroencephalography (EEG) to identify neural substrates of recognition memory. The 'parietal old/new effect' is a positivegoing event-related potential (ERP) typically observed in the parietal electrodes between 500 and 800 ms and typically left lateralized. It shows greater amplitude for the correctly recognized old (hits) compared to the new (correct rejections) test items. It has been found that this effect correlates with the amount of information retrieved from the study episode (Wilding and Rugg, 1996; Curran, 2000; Wilding, 2000; Rugg and Curran, 2007; Tsivilis et al., 2015); hence, it is understood as a neural correlate of recollection. The 'frontal old/new effect' (or the FN400) is a frontally distributed and negative-going ERP which peaks earlier around 400 ms. The FN400 is interpreted as a neural correlate of familiarity since it shows a more negative peak for less familiar items while it typically does not vary for different amounts of recollected context information (Curran, 2000; Rugg and Curran, 2007; Tsivilis et al., 2015).

Pattern classification methods have been recently applied to EEG data to reveal novel findings during encoding of episodic memory (Jafarpour et al., 2014; Noh et al., 2014a; Anderson et al., 2015; Ratcliff et al., 2016). In Noh et al. (2014a), the classifier was used as a discriminative dimensionality reduction method to project the high-dimensional EEG data onto a discriminative space. These projections revealed neural correlates of levels of encoding in the pre- and during-stimulus periods of the study phase. This multivariate analysis directly controls for the multiple comparison problem (MCP) by effectively reducing the number of test variables. A major advantage of this approach is that it is possible to compare the brain activity across conditions even when the trial count is low, provided that a sufficient number of classifier training trials are used to establish the initial hyperplane(s) (Noh and de Sa, 2014). Hence conditions that divide subtle behavioral differences can be readily compared. In ERP studies, these data are usually ignored or combined with other conditions to acquire reasonable ERPs for analysis. This may result in losing the ability to reveal the neural mechanisms underlying subtle behavioral differences.

Our study aims to create classifiers to discriminate between the correctly identified old/new trials during the recognition phase of episodic memory experiments on a single trial basis. We also utilize pattern classifiers as multivariate analysis tools to analyze the brain activity during retrieval of recognition memory using the time domain information of the EEG data. The EEG data were collected from three separate visual memory task experiments with extrinsic source information. Two types of source information were considered in these experiments. Spatial information (the location of the item) was of interest in Experiment 1 and extrinsic color information (the color of an external frame) was of interest in Experiment 2. In Experiment 3, both source types were considered. Data collected from these experiments were used to conduct multivariate analysis via pattern classifiers. The data used were previously collected by Mollison and Curran (2012). In the experiments, subjects were asked to remember items as well as the contextual source information (side of the screen, or color of outlined box). In the test phase they were asked to indicate whether they believe they have seen the item before, and if so to give the associated source information as well as their confidence in that judgment by specifying whether they remember the source information, any other information, or whether the item is just familiar. Mollison and Curran (2012) found that even familiar judgments were associated with above chance source judgments and that the FN400 distinguished between the source-correct and sourceincorrect responses only for the location-source information but not the box-color source information. In this work, we specifically train separate classifiers to extract information related to item memory (without correct source memory) and source memory (for correctly remembered items) to observe any sourcedependent differences that the classifiers extract between the experiments with different source types.

The average projection values (or classifier scores) of the different source retrieval conditions and different subjective rating conditions are also compared to reveal the relationship between the different conditions and memory retrieval strength. Furthermore, data from the error conditions (incorrectly identified new trials, incorrectly rejected old trials) are projected onto the discriminative vector characterized by the different classifiers. The average projection values of these error trials are compared to those given by the other conditions and across the different projection directions.

#### MATERIALS AND METHODS

Electroencephalography for the current study was previously recorded in three separate visual memory task experiments (Mollison and Curran, 2012). All procedures were approved by the Institutional Review Board at the University of Colorado Boulder and were conducted in accordance with this approval. All participants gave written informed consent before the experiment.

#### Experiment 1 Participants

The subjects were right-handed University of Colorado undergraduate students (ages 18–28, mean = 21.4) who volunteered for paid participation (\$15 per hour) or course credit (17 male, 13 female). All subjects were native English speakers and had normal or corrected- to-normal vision.

#### Experimental Paradigm and EEG Acquisition

The experiment was divided into four blocks consisting of a study and recognition phase. The stimuli were color images of physical objects, animals, and people. A total of 1297 images were selected from http://www.clipart.com, the stimuli set by Brady et al. (2008), and image search on the Internet. All images were resized to 240 pixels × 240 pixels and presented on a square white background. For each subject, a total of 416 images were randomly selected as the study items (104 items per block). The test lists consisted of 100 old items from the preceding study list with 50 foil items given in random order. The first and last two

stimuli in the study list were excluded from the test list to reduce primacy and recency effects.

During the study phase, the study items were presented on either the left or right side of the fixation cross. The subjects were instructed to memorize the side of the screen on which each study item was given. The spatial location of the item was considered as the source information in this experiment. A study item was shown for 1000 ms followed by an interstimulus interval with varying lengths (uniformly distributed within 625 ± 125 ms). A visual Gaussian noise image was shown at the locations of study item presentations whenever an item was not being presented to prevent after-image effects from the stimulus. The area containing the possible study image locations subtended a visual angle of 11.4◦ wide × 5.6◦ high.

In the recognition phase, a fixation cross appeared on the center of the screen for 750 ms. A test item was shown for 750 ms on top of the fixation cross followed by a 1500 ms long fixation cross. The visual angle of each test probe image was 4.3◦ wide × 4.3◦ high. Then the subjects were given two consecutive questions where the second question type depended on the subject's answer on the first one. An inter-stimulus interval of 625 ± 125 ms followed each response. In the first question, subjects were asked to make a source/new judgment where source was the location of the item in the study phase. The first question had three options: left, right (given as L and R, respectively) and a new judgment (given as N). If the subjects responded with L or R in the first question, they were asked to give a modified R-K judgment in the second question. The R-K judgment question had three options: remember side (given as RS), remember other (given as RO), and familiar (given as F). Subjects were instructed to respond with RS if they remembered the source information, RO if they remembered something other than the source information, and F if they could not remember any details of learning the item but it looked familiar. If the subjects responded with new in the first question, they were asked to give a confidence of that response: sure (given as S) or maybe (given as M) based on how confident they were about it being a new item. See **Figure 1A** for an illustration of the study and test tasks in the experiment. The keys for left responses were assigned to the left hand (z or x key), the keys for right responses were assigned to the right hand (. or / key), and the keys for new responses were assigned to one of the outermost keys (z or / key). For the confidence judgments, the keys were set up from left to right to follow memory strength in either descending or ascending order. The familiar (F) responses and remember (RS/RO) responses were always assigned to different hands. The key assignment was fixed for a given subject, but all possible key combinations were distributed to an equal number of subjects.

EEG was recorded with a 128-channel Geodesic Sensor NetTM [HydroCel GSN 200, v.2.1; Tucker (1993)] at 250 Hz sampling rate using an AC-coupled 128-channel, high-input impedance amplifier (300 M, Net AmpsTM; Electrical Geodesics Inc., Eugene, OR, United States) with a 0.1–100 Hz bandpass filter. Initial common reference was the vertex channel (Cz) and the individual electrodes were adjusted until impedance measurements were lower than 40 k. **Figure 2** shows the locations of the electrodes.

#### Experiment 2 Participants

The subjects were right-handed University of Colorado undergraduate students (ages 18–27, mean = 21.2) who volunteered for paid participation (\$15 per hour) or course credit (17 male, 13 female). All subjects were native English speakers and had normal or corrected- to-normal vision.

#### Experimental Paradigm and EEG Acquisition

The stimuli set used in Experiment 1 was used in Experiment 2. In the study phase, the study items were presented with a 48-pixel wide color frame with eight possible colors (purple, green, blue, pink, red, orange, yellow, and brown). The color of the frame was considered as the source information in this experiment. Two of the four study lists used six colors and the two other study lists used the two remaining colors. Half of the subjects received the two-color condition in the even blocks and the other half of the subjects received the two-color condition in the odd blocks. All colors were randomly and evenly distributed over the study items.

During the study phase, the subjects were instructed to memorize the frame color with each of the presented study items. A study item was shown for 1500 ms followed by an inter-stimulus interval with varying lengths (625 ± 125 ms). A visual Gaussian noise image was given at the location of study item presentation whenever an item was not being presented to prevent after-image effects from the stimulus.

In the recognition phase, a fixation cross appeared for 750 ms with a preview of the two colors the subject would be choosing from immediately following the test item presentation. The number of preview colors were set to two for both six- and two-color conditions. If the test item was old (i.e., given in the preceding study list), its corresponding frame color was given in the preview. After the color preview, a test item was shown for 750 ms followed by a 1500 ms long fixation cross. Then the subjects were given two consecutive questions where the second question type depended on the subjects' answer on the first one. In the first question, subjects were asked to make a source/new judgment where source was the frame color given with the item in the study phase. The first question had three options: two colors (given as solid color squares) and a new judgment (given as N). If the subjects responded with a color in the first question, they were asked to give a modified R-K judgment in the second question. The R-K judgment question had three options: remember color (given as RC), remember other (given as RO), and familiar (given as F). Subjects were instructed to respond with RC if they remembered the source information, RO if they remembered something other than the source information, and F if they could not remember any details of learning the item but it looked familiar. If the subjects responded with new in the first question, they were asked to give a confidence of that response: sure (given as S) or maybe (given as M) based on how confident they were about it being a new item. See **Figure 1B** for an illustration of the study and test tasks in the experiment.

EEG was recorded as for Experiment 1.

## Experiment 3

fnhum-12-00258 July 6, 2018 Time: 17:33 # 5

#### Participants

The subjects were right-handed University of Colorado undergraduate students (ages 18–29, mean = 20.6) who volunteered for paid participation (\$15 per hour) or course credit (21 male, 17 female). All subjects were native English speakers and had normal or corrected- to-normal vision.

#### Experimental Paradigm and EEG Acquisition

The experiment was conducted in two separate sessions occurring on separate days. Each session consisted of four lists where two lists were the location source paradigm (as in Experiment 1) and two lists were the color source paradigm (as in Experiment 2). Only two frame colors (blue and yellow) were used for the color condition to match the number of location and color conditions across lists). For the first session, half of the subjects received the color condition in the even list numbers and the other half of the subjects received the color condition in the odd list numbers. The second session used the opposite order. The stimuli used in the two previous experiments were used for this experiment.

For both source conditions, a source indicator frame (color condition: blue/yellow frame, location condition: white frame on the left/right side of the screen) appeared on top of the visual Gaussian noise image prior to each study item presentation for 500 ms. Then the study item was presented inside the source indicator frame for 2000 ms followed by a slightly increased inter-stimulus interval (1125 ± 125 ms).

The timing of the recognition phase was the same as the previous experiments. However, a number of changes were made to the procedures. No color preview was given prior to test item presentation during the color condition lists. Also, the solid color squares used as source cues (as in Experiment 2) were changed to letters B and Y to better match the location conditions. Finally, both of the source responses (B and Y/L and R) were assigned to one hand and the new response (N) was assigned to the other hand. The key assignments were counterbalanced across subjects.

Electroencephalography was recorded with the same equipment as in the previous experiment except with a 500 Hz sampling rate and without the 0.1 Hz hardware high-pass filter.

#### Pre-processing

Electroencephalography epochs from the recognition phase of each experiment were extracted and recalculated to average reference. In order to address any possible deficiencies in the average reference method, a subset of analyses were repeated using the reference electrode standardization technique (REST) (Yao, 2001; Dong et al., 2017; Lei and Liao, 2017), which uses reconstructed equivalent sources to re-reference electrode signals relative to a reference at infinity. The lead field for using REST had 3000 potential sources corresponding to the 128 channel HydroCel Geodesic Sensor NetTM recording system used (Mollison and Curran, 2012).

Each epoch was filtered between 0.1 and 50 Hz using a 40 tap FIR filter and baseline corrected using data from −200−0 ms. Data from Experiment 3 were down-sampled to 250 Hz after the pre-processing procedure to match the sampling rate of Experiments 1 and 2.

### Classification Problem

Classification analysis was conducted separately on Experiment 1, Experiment 2, location source blocks from Experiment 3 (denoted as Experiment 3-location or Exp 3-loc), and color source blocks from Experiment 3 (denoted as Experiment 3-color or Exp 3-col). The data from Experiment 3 were divided into the different source conditions in order to reveal any potential differences between the location and color conditions that may correspond to ERP differences observed in Mollison and Curran (2012). Before conducting classification, the trials were divided into five conditions (SC: source correct, SI: source incorrect, CR: correct rejection, M: miss, FA: false alarm) based on their source judgments (1st response) as illustrated in **Figure 3**. Note that in **Figure 3** and for the rest of the paper, RS refers to remember source which includes both remember side and remember color.

The classifiers were trained to find the projection function onto the vector perpendicular to the decision boundary (we sometimes refer to these vectors as planes) which is characterized by the choice of the training conditions. The behavioral conditions corresponding to correct item retrieval (SC and SI) and correct item rejection (CR) were selected for training. As a result, three different two-class binary classifiers (SC-CR, SI-CR, and SC-SI) with probability outputs (0 ≤ p ≤ 1) were trained to discriminate between pairs of behavioral conditions. These probability outputs given by the classifiers are denoted as classifier scores in this paper. The classifiers were trained on each individual subject and only the subjects with a minimum of 25 trials for each of the 2 trained conditions (SC, SI, and CR) were included in the analysis. For each classification problem, the classifier scores were also computed for the trials which were not included in the training procedure (non-training trials).

(1) SC-CR classifier

The SC-CR classifier (trained to discriminate between SC and CR) was expected to find the projection which

maximizes the difference in the amount of information retrieved from the study episode.

(2) SI-CR classifier

fnhum-12-00258 July 6, 2018 Time: 17:33 # 6

This classifier (trained to discriminate between SI and CR) was designed to discriminate between correctly retrieved old items (with incorrect source judgments) and the correctly rejected new items.

(3) SC-SI classifier

The SC-SI classifier (trained to discriminate between SC and SI) was designed to distinguish the correctly retrieved old items with correct source judgments from those with incorrect source judgments. Hence the classifier would extract the information on source memory retrieval.

#### Classification

The spatio-temporal structure of the ERPs was extracted based on previous findings on the old/new effect. Six channel groups were selected for evaluation (LAS, RAS, CM, LPS, RPS, and PM) as given in **Figure 2**. The average voltage for each channel group

TABLE 1A | Classification results for Experiment 1. Subject SC-CR SI-CR SC-SI 102 0.5538 0.4702 0.4693 103 0.4857 0.6875 0.5572 0.6540 0.6720 0.5455 0.5358 0.6550 0.4891 0.4949 109 0.5593 0.5947 0.4712 0.5 112 0.4953 0.6667 0.328 0.6271 0.6746 0.5575 0.6442 0.5741 0.5269 0.5517 116 0.5251 0.5183 0.4839 0.5944 0.5148 0.5025 0.65 0.4762 0.5398 0.6154 0.5756 0.5 0.6172 0.5108 0.6090 0.5977 0.5057 0.5356 122 0.5255 0.6224 0.5420 0.6585 0.5603 0.5368 0.5649 0.5577 0.6104 0.6518 0.5571 0.4373 0.6955 0.4419 0.5981 0.6048 0.5530 0.5194 0.7474 0.4633 0.5302 0.5542 0.4328 0.4714

Overall accuracies given in the penultimate row are the accuracies over all trials from the relevant classes for subjects with 25 or more trials per class. Overall accuracies in the last row are computed over all trials from relevant classes for subjects with 50 or more trials per class. Bolded entries are significantly better than chance (p < 0.05). Results from subjects with less than 50 trials per condition are italicized.

Overall 0.6231 0.5090 0.5383 Overall with 50 trials/class cutoff 0.6290 0.5368 0.5397 was computed and the data between 300 and 800 ms after test item presentation were extracted to take advantage of the ERP effects related to memory retrieval. The dimensionality of these subsequences was reduced to 5 by averaging over 100 ms length non-overlapping windows. The features from all six channel groups were concatenated to build a 30-dimensional feature vector for each trial. A binary classifier using linear discriminant analysis (LDA) with automatic shrinkage (Ledoit and Wolf, 2004; Schaefer and Strimmer, 2005) was trained to classify these feature vectors (Lotte et al., 2007; Blankertz et al., 2011). In order to avoid any overfitting to the training data, the projections for the training conditions were computed using leave-two-out (one from each class) cross-validation. In order to train with balanced classes, trials from the majority class were randomly discarded (from training) to have equal numbers of trials in each class. These trials, however, were still used for evaluation of the classifier (using a classifier trained on all the selected

TABLE 1B | Classification results for Experiment 2.


Overall accuracies given in the penultimate row are the accuracies over all trials from the relevant classes for subjects with 25 or more trials per class. Overall accuracies in the last row are computed over all trials from relevant classes for subjects with 50 or more trials per class. Bolded entries are significantly better than chance (p < 0.05). Results from subjects with less than 50 trials per condition are italicized.



Overall accuracies given in the penultimate row are the accuracies over all trials from the relevant classes for subjects with 25 or more trials per class. Overall accuracies in the last row are computed over all trials from relevant classes for subjects with 50 or more trials per class. Bolded entries are significantly better than chance (p < 0.05). Results from subjects with less than 50 trials per condition are italicized.

balanced training data). The data from the remaining conditions (e.g., Misses and False Alarms) were not used to evaluate the classifier, but were still projected onto the discriminative vector (learned from the entire balanced training set) for interpretative analysis.

#### Statistical Methods

The average classifier scores (for a given classification problem) across all subjects were compared across different behavioral conditions (SC, SI, CR, M, and FA). The classifier score is a projection of the high-dimensional EEG data onto a 1-dimensional vector which is representative of the given classification problem. Paired t-tests were conducted on the trial-by-trial classifier scores separately for the four available datasets to compare the classifier scores of the different retrieval/subjective rating conditions. A comparison was considered to be significant only when all four separate datasets gave p-values below 0.05 for the conditions of interest.

It is advantageous to also visualize the EEG features utilized by the classifiers for interpreting any effects identified from the multivariate analysis using the pattern classifiers. This was done by analyzing the classifier activation patterns representing which channel, time pairs were important for classification (Haufe et al., 2014). For each source type, the 30-dimensional classifier activation pattern vector for each subject was normalized to have length 1.

In order to identify features consistent across subjects, a cluster-based method for correction for multiple comparisons was used (Maris and Oostenveld, 2007). In this method, first each spatiotemporal pixel significantly different from zero (p < 0.05) was identified. Then the t-statistic of all significant flagged neighboring pixels with the same sign was summed and the maximum absolute value over all clusters taken. This value is compared to the distribution of max absolute cluster values obtained from a permutation distribution resulting from 10,000 random permutations of class labels for each subject. Temporal neighbors were temporally adjacent time windows. Spatial groups were considered neighbors if they contained adjacent electrodes from the cap layout (see **Figure 2**). Using this rule, LAS, CM, and RAS were all mutual neighbors; CM was also neighbors with LPS and RPS; LPS and RPS were also neighbors with PM.

### RESULTS

#### Classifier Performance

fnhum-12-00258 July 6, 2018 Time: 17:33 # 8

Performance of the SC-CR classifier was computed based on classification of the SC and CR trials (SC-RS, SC-RO, SC-F, CR-SN, CR-MN). The significance of the performance of a classifier (whether it performs significantly over chance) was evaluated based on the number of test trials used for classification. The 95% confidence interval for the obtained accuracy was calculated using Wald intervals with small sample size adjustments (Agresti and Caffo, 2000) for each subject. Classification results were considered to be significantly over chance only when the interval did not include 50%. Results are given in **Table 1**. The overall classification accuracy for Experiment 1 (SC-CR) was 62% with 18 of 25 subjects having individual accuracies significantly over chance. When restricted to subjects with at least 50 trials in each class, the performance is somewhat better. The overall classification accuracy for Experiment 2 (SC-CR) was 59% with 17 of 28 subjects having individual accuracies significantly over chance. Experiment 3-loc (SC-CR) had an average accuracy of 57% and Experiment 3-col (SC-CR) had an average accuracy of 56%.

**Figure 4** gives the ROC (receiver operating characteristic) curves for choosing different thresholds (between 0 and 1) to make decisions between classes 1 and 2 for all 3 classification problems. **Table 2** gives the area under these ROC curves. All results were above 0.5, however, there was a variability in performance across the different classification problems. The SC-CR classifiers showed the highest performance on all four datasets. It was also found that the datasets with recordings from multiple days (Exp 3-loc and Exp 3-col) showed a slight decrease in performance compared to the single session datasets. The SC-SI classification performs better for the location source datasets relative to the color source datasets in contrast with the SI-CR classifiers.

We redid some classifications using the reference electrode standardization technique (REST) (Yao, 2001; Dong et al., 2017). The performance of classifiers SC-CR in Experiment 1 using REST for re-referencing pre-processing showed comparative AUC (0.6571) and accuracy (0.6221) to that obtained with our usual average reference method (AUC of 0.6555 and accuracy of 0.6231). We then compared the REST method for the harder SC-SI classification in the two color source datasets, but this also resulted in no significant improvement in the classification results. Specifically for SC-SI in Exp2 (color) with REST we have AUC of 0.5206 (vs. 5357 with AR) and accuracy of 0.5191 (vs. 5263 with AR) and for SC-SI in Exp3 (color) with REST we have AUC of 0.5075 (vs. 5108) and accuracy of 0.5034 (vs. 5024).

#### Analysis of the Classifier Scores

The projection weights for a given classification problem can be used to project the EEG data onto a discriminative vector. In this paper, these projection values are denoted as the classifier scores. The relationship between the average classifier scores for the different behavioral conditions represents the characteristics of the different discriminative hyperplanes (Noh and de Sa, 2014). As described in section "Statistical Methods," the representation of the EEG data on the three different discriminative vectors were compared across the different behavioral conditions. The classifier scores were computed for each classification problem (as described in section "Classification Problem") and the average scores corresponding to the different behavioral conditions were compared. The results were compared across the four datasets and effects with p < 0.05 consistently across the different datasets were considered to be meaningful (the individual comparison results are given in **Table 3**).

The correct item memory conditions (SC, SI, and CR) showed similar patterns across the different projections where SC trials gave the highest scores and the CR trials showed the lowest scores. However, the relative distance between the three conditions varied across the different discriminative vectors. It was found that the SI condition was mapped closer to the CR condition on the SC-SI plane (see **Figure 5C**) while it was mapped closer to the SC condition on the SI-CR plane (see **Figure 5B**). It was also found that the difference between the SI and CR trials were only significant (p < 0.05 for all four datasets) on the SC-CR and SI-CR planes (see **Figure 5**).

The relative mapping of the error conditions (M and FA) with respect to the correctly retrieved/rejected conditions (SC, SI, and CR) gave different patterns for the different projection directions. Interestingly, the source correct (SC) trials and false

FIGURE 4 | The ROC curves for the three different classification problems (A: SC vs. CR; B: SI vs. CR; and C: SC vs. SI) are given separately for the four individual datasets.



Every AUC was computed using the projections of the trials, by leave-two-out training, in the selected balanced training data.

alarms (FA) were mapped to significantly different values on the SC-CR and SC-SI plane but not on the SI-CR plane (see **Table 4**). In contrast, the misses (M) gave values significantly lower than the two correct item retrieval conditions (SC and SI) when mapped onto the SC-CR and SI-CR plane.

A similar analysis was conducted considering the different subjective ratings given to the correct item retrieval/rejection trials (SC, SI, and CR). These responses consisted of remember source (RS), remember other (RO), and familiar (F) for the SC/SI conditions and sure (SN denoting sure new) and maybe (MN denoting maybe new) for the CR condition. The error conditions (FA and M) can be similarly projected. While the classifiers generally gave a monotonic decrease in classifier scores from the RS to SN conditions, there were interesting interactions with the memory retrieval conditions as illustrated in **Figure 6**.

#### Classifier Activation Patterns

The activation patterns which represent the features used by the classifiers (or the characteristics of the projection weights) were compared across the three different classification problems (see **Figure 7**). The activation patterns were computed for each subject and the average activation patterns were computed by averaging the values across all four datasets. A t-test was conducted on each of the features to illustrate which features showed similar effects across the different subjects. Cluster based analysis (Maris and Oostenveld, 2007) was then used to control for multiple comparisons. This revealed features with values significantly above/below zero across all the subjects available for analysis. The activation patterns are given as a 2-dimensional matrix with its corresponding channel groups and time segments (the times give the center of the interval) in **Figure 7** and the most significant clusters (with significance values) are shown in **Figure 8**.

The SC-CR classifier utilized temporal features from 300 to 800 ms. The SI-CR classifier only showed consistent patterns between 300 and 700 ms and the SC-SI classifier showed consistent patterns between 400 and 800 ms for the two tasks with spatially presented contextual information (1 and 3-loc). In the two tasks with colored frames as context, there is not a strong activation pattern consistency across subjects for the SC-SI classifier. Interestingly the SC-SI (source memory) classifier has strong consistent activity across all spatial areas except PM when the source context is location. The SI-CR (item memory) classifiers have an early frontal activation when the source context is the colored outline, but a more parietal activation when the source context is the location.

The activation patterns for the three classifiers we created using the REST preprocessing (SC-CR Exp1, SC-SI Exp 2-col, SC-SI Exp 3-col) were similar to the analogous ones with average referencing.

#### Classifier Scores Evolution Over Time

In the activation patterns, the characteristics of projection weights in different time intervals and channel groups were shown. The classifier scores variation across time gives a clear insight about the evolution of the separation of classes over time. To obtain the scores only under the operation with weights between 300 and 400 ms in activation patterns, the grouped EEG data after 400 ms were set to zero, and the remaining computations remained the same. In brief, the data were set to zero after the considered intervals and the trained classifier was used to get the classifier scores.

**Figure 9A** shows that the scores of SC and CR trials start to be discriminable around 500–600 ms and separate further afterwards. **Figure 9B**, shows that with the SI-CR classifier, scores of SI and CR trials also start to separate around 500–600 ms. As for the SC-SI classifier in **Figure 9C**, the scores of SC trials become more separable from the scores of the SI trials after about 700 ms. Note that while the activation patterns for the SI-CR classifier show not much significant activation that is consistent between subjects after 600 ms, the classifier scores continue to separate, indicating that the activation patterns causing this separation are less consistent between subjects.

#### DISCUSSION

The results show that it is possible to predict successfully identified old vs. new items based on single-trial scalp EEG activity recorded during the retrieval episode. The prediction rate was higher for the location-source datasets and the average accuracy of the single-session datasets was higher compared

TABLE 3 | Comparison results between the classifier scores for the SC-CR classifier.


Paired t-tests between all possible pairs are given with their corresponding uncorrected p-values.

TABLE 4 | The uncorrected pairwise comparison results for the five behavioral conditions across the four datasets [Exp 1 (loc), Exp 2 (col), Exp 3-loc, Exp 3-col].


The results for the different projections are given in separate rows.

to the multi-session datasets. The non-stationarity of the data between the two sessions (due to electrode position changes, impedance changes, or changes in brain-state) likely contributes to the drop in classification performance (Krauledat et al., 2007). Our analysis was restricted to time domain signals from specific channel groups known to be involved in frontal and parietal old/new effects. It is possible that accuracy could be increased by using frequency domain information from multiple electrode location and frequency bands (see for example, Hammon and de Sa, 2007; Hammon et al., 2008; Velu and de Sa, 2013; Noh et al., 2014a; Mousavi et al., 2017).

The current analysis found that the projections of the temporal information from the EEG data onto different hyperplanes

show different patterns. This was evident in the relationship between the behavioral conditions of interest. We focused on the patterns which were consistent across multiple subjects and multiple datasets to compare across the different classifiers. These results suggested that the classifier may be exploiting features which are more informative for discriminating between the two behavioral conditions selected for training. It was found that the SC-SI classifier performance on the two color source datasets (Experiment 2 and the color blocks from Experiment 3) was lower compared to the location datasets (Experiment 1 and the location blocks from Experiment 3). The activation patterns for the SC-SI classifier were also not significantly consistent across subjects for the color outline source. In Mollison and Curran (2012), it was found that accurate/inaccurate judgments to the familiar responses were affected by source type where the SC trials with familiar ratings and SI trials with familiar ratings were significantly different only for the location-source datasets when comparison was conducted on a ROI centered at FCz. This suggests that the temporal information in the EEG signal may be less separable between SC and SI trials for the color datasets compared to the location datasets resulting in a lower classification performance.

The relationship between the correctly remembered conditions (where the classifier scores showed CR < SI < SC on all three discriminative vectors) suggests that these classifier scores may reflect the amount of information retrieved from the study episode. The difference in the amount of information

retrieved from the study episode is maximal between conditions SC (when the correct item is retrieved from the study phase with the appropriate source information) and CR (when no information is retrieved from the study phase) which may be why the SC-CR classifier outperformed the other two classifiers. The drop in classifier performance for the SC-SI and SI-CR classifiers compared to the SC-CR classifier may be due to this innate relationship between the 3 behavioral conditions used for classifier training. The SI-CR classifier would primarily be able to utilize information related to differences between item retrieval vs. correct rejection to distinguish between the two classes. On the contrary, the SC-SI classifier would only be able to utilize information related to source memory differences between correct source retrieval vs. incorrect source retrieval in order to distinguish between the SC and SI conditions.

The activation patterns (see **Figure 8**) indicated that the classifiers used features mostly around 400 to 800 ms and gave these features higher weights. The spatiotemporal distribution of predictive features associated with the (SI - CR) classifiers (early and more frontal) were somewhat consistent with the timing and location of the FN400 only in the color-source experiments [2 and 3(col)]. Likewise the spatiotemporal

mid-point of the 100 ms window used to compute the features.

distribution of predictive features associated with the (SC-SI) classifiers (later and more parietal) were somewhat consistent with the timing and location of the parietal ERP old/new effect only in the color-source experiments. In the locationsource experiments, the (SC-SI) classifier had significant contributions from both early (<500 ms) and late (500– 800) time periods and frontal and parietal locations. This suggests that while the SI-CR classifier may be representative of the early frontal old/new effect and the SC-SI classifier representative of the later parietal old/new effect when color is the source information, the mapping is not as appropriate when location is the source information. This is consistent with Mollison and Curran's (2012) observations suggesting that familiarity contributes to source recognition for location more so than for color. The activation patterns corresponding to the SC-CR classifier took advantage of the features across all time periods (see **Figure 8**) which most likely resulted in the largest distinction between the SC and CR condition.

Additionally, the multivariate classification approach showed that trial-by-trial variation in EEG corresponding to these ERP components are predictive of subjects' behavioral

responses, which is consistent with the hypothesis that the underlying processes are influencing memory judgments. One previous study has similarly used logistic regression to predict performance on a city-size comparison task from single-trial EEG data corresponding to the FN400 (Rosburg et al., 2011). Their results showed that the relative familiarity of two cities, as indexed by single-trial FN400 measures, predicted which of the cities subjects judged as being more populous. Taken together with the current results, these classification approaches are important for establishing that EEG patterns which have been related to familiarity and recollection in ERP averages, can be shown to predict behavior on individual trials in both standard memory tasks as well as a decision making task that is influenced by memory. Overall, this strengthens the hypothesized links between these EEG patterns and behaviorally relevant memory processes.

The ERP studies of recognition memory often exclude error trials from analyses because of insufficient trials for stable ERPs in these conditions. In their original study, Mollison and Curran (2012) excluded subjects with less than 15 artifactfree trials/condition/subject and 24% of subjects would have been excluded if errors were included in the analyses. One approach for increasing the false alarm rate has been to use lures that are similar to studied items (e.g., Curran, 2000; Curran and Cleary, 2003; Nessler et al., 2001). In these cases subjects are presumed to have a high false alarm rate because similar lures are as familiar as studied items, and the familiarityrelated FN400 responds similarly to hits and false alarms to similar lures. It is also common to hypothesize that false alarms to even non-similar lures are driven by familiarity. For example, the Yonelinas (1994, 1997) dual process model of ROC curves explicitly assumes that recollection does not contribute to false alarms, which are only driven by familiarity. Few ERP studies have assessed false alarms from lures that were not similar to the studied items. If familiarity differentiates "no" (CR) and "yes" (FA) responses to new items, the FN400 should be more



The difference between the average classifier scores for the SI and FA conditions are given in the bottom three rows. Negative values indicate FA had larger values. The table shows that FA is mapped closer to SC and SI in the SI-CR classifier than the SC-CR and SC-SI classifiers.

TABLE 6 | The uncorrected pairwise comparison results for the five subjective rating options across the four datasets [Exp 1 (loc), Exp 2 (col), Exp 3-loc, Exp 3-col].


Only the trials with correct item judgments were included in this analysis.

positive to FA trials than CR trials. Although early studies that did not clearly differentiate the FN400 reported no differences between hits and false alarms (Wilding et al., 1995; Wilding and Rugg, 1996, 1997; Rubin et al., 1999), two studies that specifically focused on the FN400 did observe more positive FN400s to FA than to CR trials (Finnigan et al., 2002; Wolk et al., 2006). Wolk et al. (2006) included a very large number of test items, which resulted in an average of 105 FA trials/subject, but Finnigan et al. (2002) only averaged 12 trials/subject. The current multivariate analysis approach using pattern classifiers addresses this trial count issue by projecting the high dimensional EEG data onto a one-dimensional vector which is meaningful with respect to the experimental paradigm. The SI-CR classifier responded more strongly to FA trials than to CR, with FA being more similar to item hits (SC and SI), as would be expected if FA trials were driven by familiarity.

The relationship between the SC and FA conditions was particularly interesting. The difference between the two conditions were consistently larger across all four datasets on the SC-CR and SC-SI planes compared to the SI-CR plane as given in **Table 5** (and shown in **Figure 5**). This pattern was also evident between the SI and FA conditions, however, the distances between these two conditions were closer. Hence the representations with respect to the different classification boundaries suggest that SC and FA are more similar to each other on the SI-CR (item memory) plane compared to the other two representations. In other words, false alarms (on item information) may include information related to item retrieval while they do not include much information related to source retrieval (recollection).

The other type of error, misses (M), were generally similar to CR in all three classifiers. Both of these conditions reflect low levels of familiarity and recollection that lead to "no" responses. Previous studies have found 300–500 ms FN400 or 500–800 ms parietal old/new differences between hits and misses, but not between CR trials and misses (Rugg et al., 1998; Curran and Hancock, 2007). Instead, Rugg et al. (1998) found differences between misses and CR were observed over posterior channels between 300 and 500 ms. The latter differences were interpreted as reflecting the activity of an implicit memory process because subjects were giving the same explicit "no" response to both old and new items, but the brain was still differentiating their memory status [although others dispute this definition of implicit memory, Voss and Paller (2008)]. Because our classifiers were trained to differentiate different levels of explicit memory, it makes sense that no major differences were observed between misses and CR in any of our results. Future work could be done to further investigate any differences by specifically involving misses in the classification training [see for example (Noh et al., 2014b)].

In summary, the present results showed that the classification analysis successfully extracts information related to retrieval strength from the EEG data. These results show that the classifier scores well represent the subjects' behavioral performance on source retrieval (the relationship between the SC, SI, and CR conditions in **Figure 5**) and indicate that EEG item-memory and source-memory responses may be more spatially widespread than previously thought and differ between source-types. The results also indicate that retrieval strength as reflected in the classifier scores follows the subjects' subjective ratings (**Figure 6** and **Table 6**). It was also found that the brain activity related to item memory/familiarity may be present during false item retrieval (FA trials) as well as during correct item retrieval (SC and SI trials).

#### AUTHOR CONTRIBUTIONS

TC and MM planned the EEG experiments. MM collected the data. EN, MM, TC, and VdS planned the initial analyses in this work. KL, EN, and VdS planned the temporal evolution and cluster analysis tests. EN and KL implemented the analyses.

All authors were involved in drafting and editing the work and are accountable for all aspects of the work.

#### FUNDING

This research was funded by NSF grants CBET-0756828, IIS-1219200, IIS-1528214, NIH Grant MH64812, NSF grants # SBE-0542013 and # SMA-1041755 to the Temporal Dynamics of Learning Center (an NSF Science of Learning Center), and the

#### REFERENCES


KIBM (Kavli Institute for Brain and Mind) Innovative Research Grant and by IBM Research AI through the AI Horizons Network.

#### ACKNOWLEDGMENTS

This project developed out of a meeting of the NSF Temporal Dynamics of Learning Center (TDLC).



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Noh, Liao, Mollison, Curran and de Sa. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Spectral Entropy Can Predict Changes of Working Memory Performance Reduced by Short-Time Training in the Delayed-Match-to-Sample Task

Yin Tian\* † , Huiling Zhang† , Wei Xu, Haiyong Zhang, Li Yang, Shuxing Zheng and Yupan Shi

Bio-information College, Chongqing University of Posts and Telecommunications, Chongqing, China

Spectral entropy, which was generated by applying the Shannon entropy concept to the power distribution of the Fourier-transformed electroencephalograph (EEG), was utilized to measure the uniformity of power spectral density underlying EEG when subjects performed the working memory tasks twice, i.e., before and after training. According to Signed Residual Time (SRT) scores based on response speed and accuracy trade-off, 20 subjects were divided into two groups, namely high-performance and lowperformance groups, to undertake working memory (WM) tasks. We found that spectral entropy derived from the retention period of WM on channel FC4 exhibited a high correlation with SRT scores. To this end, spectral entropy was used in support vector machine classifier with linear kernel to differentiate these two groups. Receiver operating characteristics analysis and leave-one out cross-validation (LOOCV) demonstrated that the averaged classification accuracy (CA) was 90.0 and 92.5% for intra-session and inter-session, respectively, indicating that spectral entropy could be used to distinguish these two different WM performance groups successfully. Furthermore, the support vector regression prediction model with radial basis function kernel and the root-meansquare error of prediction revealed that spectral entropy could be utilized to predict SRT scores on individual WM performance. After testing the changes in SRT scores and spectral entropy for each subject by short-time training, we found that 16 in 20 subjects' SRT scores were clearly promoted after training and 15 in 20 subjects' SRT scores showed consistent changes with spectral entropy before and after training. The findings revealed that spectral entropy could be a promising indicator to predict individual's WM changes by training and further provide a novel application about WM for brain–computer interfaces.

Keywords: spectral entropy, WM performance, SVR, prediction, classification, BCIs

## INTRODUCTION

Working memory (WM) was originally defined as a cognitive mechanism responsible for the temporal maintain and manipulation of new and stored memory information (Baddeley, 2012). WM was considered a limited-capacity, short-term, information retention system (Ma et al., 2014). The original model of WM proposed by Baddeley included three subcomponents: the central

#### Edited by:

Felix Putze, University of Bremen, Germany

#### Reviewed by:

Benjamin Thürer, Karlsruhe Institute of Technology, Germany Martin Spüler, University of Tübingen, Germany

\*Correspondence: Yin Tian tiany20032003@163.com †These authors have contributed equally to this work.

Received: 30 April 2017 Accepted: 15 August 2017 Published: 31 August 2017

#### Citation:

Tian Y, Zhang H, Xu W, Zhang H, Yang L, Zheng S and Shi Y (2017) Spectral Entropy Can Predict Changes of Working Memory Performance Reduced by Short-Time Training in the Delayed-Match-to-Sample Task. Front. Hum. Neurosci. 11:437. doi: 10.3389/fnhum.2017.00437

executive, the visuospatial sketch pad and the phonological loop (Baddeley and Hitch, 1994). Later, an additional component, namely the episodic buffer, was added to WM. This component could take information from the other three components and from long-term memory, from which a single episodic representation was created and then temporarily preserved in the buffer (Baddeley, 2000). Individuals exhibited varying abilities in WM. For example, one person was able to memorize more information and manipulate the information more effectively than others (Baddeley, 2003). There is no doubt that if someone is confused, his or her memory ability will decline.

Working memory performance prediction has become an interesting topic which received considerable attention in recent years (Rajji et al., 2015; Johannesen et al., 2016; Tumari and Sudirman, 2016). In previous attempts, numerous approaches were proposed to extract reliable features to predict individual WM performance and distinguish the difficulty levels of cognitive tasks, such as alpha power (Myers et al., 2014), absolute power (Klimesch et al., 2006), and wavelet entropy (Zarjam et al., 2013).

Previous studies related to WM using EEG found that upper alpha event-related desynchronization (ERD) and small power were associated with good performance during actual processing of the task (Klimesch et al., 2006). Autoregressive model (Nai-Jen and Palaniappan, 2004) and wavelet entropy (Zarjam et al., 2013) extracted from EEG signals were used frequently to measure and distinguish the levels of WM task difficulty. Similarly, time-frequency characteristics using wavelet transform were applied to evaluate mental workload in WM with EEG signals. It has been demonstrated that the appearance time and the total power extracted from wavelet analysis were effective features to measure mental workload (Murata, 2005). Combining magnetoencephalography and EEG (MEEG) recordings with source reconstruction techniques showed that synchrony was enhanced with increasing memory loads among the frontoparietal regions during memory retention while the individual WM capacity could be forecasted by phase synchronization in a network among frontoparietal and visual regions (Palva et al., 2010). An event-related functional magnetic resonance imaging (fMRI) study found that better WM performance in a Sternberg-type delayed match WM task could be predicted by greater temporo-parietal junction (TPJ) and default mode network (DMN) deactivation during the encoding period (Anticevic et al., 2010).

In recent years, the rapid development of neuroscience facilitated the improvements of brain–computer interfaces (BCIs) which were information transfer systems transforming brain intention into control commands without involving peripheral neural pathways (Mühl et al., 2014; Putze et al., 2014). Originally, BCIs were designed to provide a new way for patients with impaired motor functions to communicate with others. Generally, motor imagery and evoked visual potentials received a lot of attentions in BCIs (Ferrez and Millán, 2008; George and Lécuyer, 2010). Lately, BCIs have become available to anyone who wants or needs them and have been envisaged to monitor cognitive state, such as attention, fatigue, and emotions (Frey et al., 2013; Mühl et al., 2014, 2015; Chanel and Mühl, 2015). Moreover, the WM-related EEG signal was utilized in BCIs (Putze and Schultz, 2010; Mühl et al., 2014).

To date, the above measures have widely been investigated in the analysis of the WM loads (i.e., various difficulty levels) (Murata, 2005; Mühl et al., 2014; Dimitriadis et al., 2015), but the estimation or prediction of individual WM performance has rarely been involved especially in BCIs. Previous literatures also tried to find ways to improve one's WM ability which could be reflected by predicting performance in carrying out a wide range of cognitive tasks (Chouhan et al., 2015). To some extent, WM performance might be a symbol of individual's WM ability when carrying out the memory tasks.

Usually memory has been regarded as a personal constant character. However, recent studies revealed that it could be promoted by adaptive and extended training (Klingberg, 2010). The density of cortical dopamine receptors changed after training through test (Mcnab et al., 2009). Previous findings showed that using WM-related fMRI, the training-induced variations were linked to the increased activity in prefrontal and parietal regions (Olesen et al., 2003). Research on children with attention deficit hyperactivity disorder (ADHD) also suggested that WM impairments might be overcome by training and stimulant medication on WM (Holmes et al., 2010).

In this work, we attempted to use a proper and objective feature of retention period in WM EEG as a biomarker to predict individual's WM performance. The existing EEG-based studies, to our best knowledge, demonstrated that power spectrum of low frequency resting-state EEG was associated with individual's WM performance. Changes in brain activity could be reflected by the changes of power spectral density (PSD) during performing cognitive tasks (Weiss, 1992; Pachou et al., 2008).

Recently, different entropy concepts have been applied to describe the order state of sequences (Bruhn et al., 2000, 2001; Xu et al., 2013; Zhang et al., 2015). Among them, Shannon entropy has been shown as an effective measure for the predictability of EEG series in describing anesthetic drug effect (Bruhn et al., 2001; Zhang et al., 2015). However, Shannon entropy was not normalized to the total power of EEG. Consequently, the absolute value of Shannon entropy might vary among individual, which hampered the applications in clinical areas. To overcome this drawback, spectral entropy was developed by using the Shannon entropy to the Fourier-transformed signals (Vanluchene et al., 2004). Therefore, the spectral entropies were regarded as features to distinguish high-performance from low-performance groups in WM with SVM classifier and further predicted subjects' individual WM performance by short-time training with support vector regression (SVR) prediction model in the current study. We assume that spectral entropy could be applied to predict individual WM performance and provide a novel approach for the study of BCI in the future.

### MATERIALS AND METHODS

#### Ethics Statement

The experiment was approved by the ethical committee of Chongqing University of Posts and Telecommunications.

Written informed consent was signed prior to participating in the study. Subjects received a monetary compensation after experiment. None of them had cognitive impairments, mental and neurological disorders.

#### Subjects

Twenty healthy and right-handed male subjects (mean age 21 years old) participated in the experiment. Subjects were asked to keep relaxed and to restrain wide movements as much as possible during the whole experiments. Subjects were requested to perform a continuous delayed match task consisted of two sessions at three levels of task difficulty. The two sessions were exactly equal and the only distinction between the two sessions was that subjects carried out the first tasks without training but they performed the second tasks after WM training.

### Stimuli and Design

**Figure 1** showed an example of the stimulus sequence. A fixation cross (0.5 × 0.5◦ ; at the center of the monitor) was displayed throughout the entire block of trials. Each trial started with the fixation cross flashing for 50 ms. Following that, the memory array which was randomly consisted of two, four, or eight letters (0.5◦ × 0.5◦ ) was presented for 200 ms with the same appearing frequency of two, four, and eight letters. After 3000 ms retention interval, the test array was presented on the screen for 100 ms as a probe item. Then subjects responded with a button press, indicating whether the probe item was in the memory array. Subjects pressed the key "F" with their left index fingers for the absence of the probe item from memory array, and pressed the key "J" with their right index fingers for the attendance of the probe item in the memory array. The number between the absence and the attendance of the probe items in the memory array was equal. Response accuracy (RACC) and speed were equally stressed in the instructions.

Subjects were required to maintain central fixation and minimize eye blinks and body motion throughout the recordings. The experiment consisted of two sessions per subject and each session was composed of 60 trials with three kinds of memory loads (two, four, or eight letters). Stimuli were presented and behavioral results were recorded using E-prime software<sup>1</sup> .

### EEG Recordings and Preprocessing

EEG recording was accomplished by using a 64-channel NeuroScan system (Quik-Cap, band pass: 0.05–100 Hz, sampling rate: 1000 Hz, impedances < 5 k) with a vertex reference. To monitor ocular movements and eye blinks, Electro-Oculogram (EOG) signals were simultaneously recorded from four surface electrodes, one pair placed over the higher and lower eyelid and the other pair placed 1 cm lateral to the outer corner of the left and right orbit.

The data was re-referenced to the infinity reference (IR) (Yao, 2001; Tian and Yao, 2013) using the software REST<sup>2</sup> . In the study, EEG was segmented from 100 ms before the onset of the memory array to 100 ms after the subjects' response onset. In other words, the retention and retrieval stages of WM were extracted for the following preprocessing. EOG and Electromyography (EMG) were excluded by Blind Source Separation (BSS) (Negro et al., 2016). Other noise was removed by automatic artifact rejection (±100 µν). The data was baseline corrected using 100 ms EEG signal before the memory array onset. Then the EEG recordings were filtered with a band-pass of 0.5–45 Hz. After the above preprocessing, 60 trials were obtained for each subject under three memory loads (two, four, and eight items). The retention stage (3000 ms) of each trial was extracted for subsequent analysis.

#### Behavioral Analyses

When performing time limit tasks incorporating both reaction time (RT) and RACC, subjects either sacrificed accuracy in

<sup>1</sup>http://www.pstnet.com/eprime.cfm

<sup>2</sup>www.neuro.uestc.edu.cn/rest

exchange for response speed, or exchanged response speed for high accuracy (Mordkoff and Egeth, 1993; Breukelen, 2005). Here, we utilized the Signed Residual Time (SRT) scores (Maris and Han, 2012) for speed accuracy trade-off (Schmitt and Scheirer, 1977) to represent subjects' behavioral performance. The SRT scoring rule was defined as follows (Schmitt and Scheirer, 1977; Maris and Han, 2012; van Rijn and Ali, 2017):

$$\sum\_{i} (2 \text{RACC}\_i - 1)(MT - t\_i) \tag{1}$$

where the subscript i denoted the item index, RACC<sup>i</sup> = {0,1} was the response accuracy (RACC: 0 for incorrect and 1 for correct) for a single trial, and MT denoted the maximum allowable response time for item i. t<sup>i</sup> denoted response latency. The total scores were simply the sum of the scores in each item. In other words, for a correct response, subjects earned the residual time as score, and for an incorrect response, subjects lost the residual time as score.

For testing the left–right hand effect and memory-load effect on behavioral WM performance, 2 (hand: left vs. right) × 3 (memory load: 2 items vs. 4 items vs. 8 items) repeated measures analysis of variances (ANOVAs) were conducted. The dependent variables were the SRT scores in intra-sessions (session 1, session 2) and the change rates of subjects' SRT scores in inter-sessions (from session 1 to session 2). Greenhouse–Geisser correction was used when the sphericity assumption was violated in repeated measures ANOVAs (Greenhouse and Geisser, 1959), and a factor had more than two levels in the current study to protect against Type I errors (Kisley et al., 2005). In addition, false discovery rate (FDR) was utilized to the correction of multiple comparisons.

#### Spectral Entropy

Spectral entropy based on Shannon entropy in physics, quantifying the regularity/randomness of the power spectrum during a given period of time, was used to establish the biomarker for WM performance in the current study. The methodological details were similar to those adopted in a previous study on the prediction of BCI performance (Zhang et al., 2015).

The retention period for each trial was extracted to calculate PSD Psd(f) via the Welch's method (Akbar et al., 2016). The PSD of a time series was defined as the distribution of power as a function of frequency. The normalized PSD was defined as the Psd(f) divided by the total power to obtain probability density function.

$$\widehat{Psd}\ (f) \;=\;\frac{\text{Psd}\ (f)}{\sum\_{f=0.5}^{f=45} \text{Psd}\ (f)}\tag{2}$$

where Psd <sup>c</sup>(f) was the normalized PSD of <sup>P</sup>sd(f) . We estimated spectral entropy based on the PSD within 0.5–45 Hz. The entropy of Psd <sup>c</sup>(f) was generated by using the following equation:

$$\text{SE}n = -k \sum\_{f=0.5}^{f=45} \widehat{\text{Psd}} \text{ ( $f$ )} \log \text{ ( $\widehat{\text{Psd}}$ } \text{ ( $f$ )}) \tag{3}$$

where k = 1. The base of the logarithm was 10 and the unit of SEn was dit (i.e., Decimal Digit) in the current study (Schneider, 2007). In effect, spectral entropy reflected the uniformity of the power spectrum distribution. The greater the spectral entropy was, the more uniform the power spectral distribution was.

#### Scalp Spatial Distribution for Highest Relationship

Spearman's rank correlation coefficient was widely applied to measure the monotonic relationships. It was defined according to the following equation (Gautheir, 2001; Sedgwick, 2014):

$$r = 1 - \frac{6\sum\_{i=1}^{m} d\_i^2}{m \ (m^2 - 1)}\tag{4}$$

where each variable was ordered respectively from lowest to highest. d<sup>i</sup> was the difference between two ranks for paired variables x<sup>i</sup> and y<sup>i</sup> . m was the number of data pairs. In the current study, the relationships between SRT and spectral entropy on each channel were separately measured by Spearman's correlation. From these, 64 spearman's coefficients were observed. Then, we constructed the fingerprint figures and the scalp spatial distribution according to the spearman's coefficients on each channel (the processing steps as shown in **Figure 2**). Based on the above method, the electrode site for highest correlation between SRT and spectral entropy was identified to carry out the next steps (see below).

#### Grouping Rules and Classification

Subjects were divided into two groups according to their standard z scores of SRT. Here, z-score was defined as:

$$\mathbf{z}\_{i} = \frac{\mathbf{x}\_{i} - \mu}{\sigma} \tag{5}$$

where x<sup>i</sup> (i = 1,2,3. . .,20) was the SRT scores of the ith subject, µ was the mean of the 20 subjects' SRT scores and σ was the standard deviation of SRT scores between 20 subjects in the current study. Subjects whose z values of SRT were above zero were allocated to the group with high WM performance, and the rest were assigned to the group with low WM performance. The high and low memory performance groups were defined as positives and negatives, respectively.

The evaluation of generalization performance for SVM classifier was a required step after using the SVM to classify the high from low memory performance groups. SVM was developed by Vapnik based on statistics learning theory (SLT) (Netherlands, 2008). As a result of its excellent generalization performance, SVM has been applied in a wide variety of issues, such as text classification, images classification, hand writing recognition and gene classification. Furthermore, SVM had the feature of empirical risk minimization (ERM) and global optimum solution (Netherlands, 2008). If the SVM classifier could reflect the relationship between features and the class labels very well, then the classifier was considered that it could predict the classes of new samples with good performance. Therefore, classification accuracy (CA), sensitivity (SE), specificity (SP), and area under ROC curve (AUC) were utilized to evaluate the classification performance of SVM classifier (Galar et al., 2012). At the same time, leave-one-out cross-validation (LOOCV) was applied to

evaluate the generalization performance of SVM for a relatively small sample size in the present study.

(1) The percentage of the number of samples predicted correctly in the test set over the total samples, CA, was calculated as follows:

$$CA = \frac{TP - TN}{TP + TN + FP + FN} \tag{6}$$

where true positive (TP) was the number of high-performance samples correctly predicted and true negative (TN) was the number of low-performance samples correctly predicted. False positive (FP) denoted the number of high-performance samples incorrectly predicted and false negative (FN) denoted the number of low-performance samples incorrectly predicted.

(2) SE and SP were calculated by the following formulae, respectively:

$$SE = \frac{TP}{TP + FN} \tag{7}$$

$$SP = \frac{TN}{TN + FP} \tag{8}$$

SE referred to the ratio of correctly classified high-performance samples to the total population of highperformance samples, whereas SP was the ratio of correctly classified low-performance samples to the total population of low-performance samples.

(3) AUC was defined as the area under ROC curves, which was discovered and proved to be better than CA to evaluate the predictive performance of classification learning algorithms (Jin and Ling, 2005). Moreover, AUC was indeed a statistically consistent and more discriminating measure than CA (Ling et al., 2003). Originally, the ROC curves were introduced to evaluate machine learning algorithms (Provost et al., 1997). On ROC curves, TP was plotted on the Y axis and FP was plotted on the X axis. It described the classifiers' performance across the entire range independent of class distributions (Provost et al., 1997; Jin and Ling, 2005). However, often there was no clear dominating relation between two ROC curves in the entire range. Therefore, AUC was introduced to provide a good "summary" for the performance of the learning algorithms based on ROC.

To examine the stability of the SVM classification performance, the classification was performed respectively under intra- and inter-session conditions. In the intra-session, LOOCV was utilized separately for the first and second sessions (Vehtari et al., 2016). One subject was picked out as a test sample, while the rest of the subjects were regarded as the train samples for obtaining the classification model. The above steps were repeated until every subject has been allocated as a test sample for one time. During the inter-session prediction, subjects in the first session were regarded as test samples and subjects in the second session were regarded as the training samples. Then subjects in the first session were considered as training samples and subjects in the second session were considered as test samples.

#### Predicting WM Performance

The subject's spectral entropies derived from WM data at the retention period were selected to predict the corresponding SRT scores. According to the Section "Grouping Rules and Classification," the spectral entropies of the channel achieving the highest correlations between SRT and spectral entropy were applied to SVR to develop predictive model for a relatively small sample size (Ju et al., 2014) respectively in session 1, session 2, and merged session (i.e., session 1 + session 2). Then, LOOCV was utilized to test the stability of the model and the performance of the predictions. Detailed information for the LOOCV procedure for prediction was described as follows:

We assumed that there were n samples in the dataset. One sample was picked out as a testing set, and the rest of the samples were regarded as training sets to develop the predictive model. The above steps were repeated until every subject had been assigned as a test sample for one time and eventually n SVR models were obtained. The correlation coefficients between

FIGURE 3 | Scalp spatial distribution for correlations between the normalized spectral entropy and subjects' SRT scores. (A) Fingerprint maps. (B) Topographic maps. The color bar denoted the correlation values between SRT scores and spectral entropy. (C) Correlations between SRT scores and spectral entropy on channel FC4 separately in session 1, session 2, and merged session.


"Groups" denoted the grouping results about the number of subjects separately in session 1 and session 2 according to standard z scores of subjects' SRT. "Changes" denoted the changes (i.e., SRT scores increased or SRT scores decreased) in the number of subjects after training (i.e., from session 1 to session 2). "Change Rates" denoted the number of subjects whose change rates of SRT showed the consistent and inconsistent change with spectral entropy after training.

#### TABLE 2 | The classification results of SVM classifier.

fnhum-11-00437 August 29, 2017 Time: 16:52 # 7


CA, classification accuracy; AUC, area under ROC curves; SE, sensitivity; SP, specificity. The session 1 with asterisk (<sup>∗</sup> ) denoted the classification results of session 1 with WM data in session 2 as training samples. The session 2 with asterisk (<sup>∗</sup> ) denoted the classification results of session 2 with WM data in session 1 as training samples.

predicted and actual SRT scores, together with root-mean-square error of prediction (RMSEP), were calculated to evaluate the prediction performance of the SVR model (Ju et al., 2014; Kichonge et al., 2015).

### Change Rates of SRT Scores and Spectral Entropy between Sessions

For a single subject, behavioral scores usually varied by training. Thus we attempted to study whether the spectral entropy of WM data at retention stage could predict changes in memory performance before and after training.

Since the subjects' SRT scores and spectral entropy were in different unit scales, a new measure called Change Rate (CR) was defined as follow (Zhang et al., 2015):

$$CR = 2 \times (TA - TB)/(TA + TB) \times 100\% \tag{9}$$

where TB denoted the subjects' SRT scores or spectral entropy of WM data at retention period before training (i.e., session 1). TA was the subjects' SRT scores or spectral entropy predictors at retention period after training (i.e., session 2).

### RESULTS

#### Behavioral Performance

For SRT scores, the significant main effects of memory load were separately observed in intra-sessions (session 1: F = 218.99, P < 0.001, η 2 <sup>p</sup> = 0.920; session 2: F = 162.32, P < 0.001, η 2 <sup>p</sup> = 0.895), indicating that different memory-load tasks affected the behavioral performance on SRT scores regardless of before training and after training. Non-significant main effects of hand were found in intra-sessions (session 1: F = 0.435, P = 0.518, η 2 <sup>p</sup> = 0.01; session 2: F = 1.51, P = 0.234, η 2 <sup>p</sup> = 0.074), indicating that SRT scores were not affected by left–right hand effect in intra-sessions. There were non-significant interactions between hand and memory load (session 1: F = 4.24, P = 0.053, η 2 <sup>p</sup> = 0.183; session 2: F = 0.92, P = 0.405, η 2 <sup>p</sup> = 0.046).

For the change rates of SRT scores from session 1 to session 2, there were non-significant main effects of hand (F = 3.636, P = 0.072, η 2 <sup>p</sup> = 0.161) and memory loads (F = 0.028, P = 0.973, η 2 <sup>p</sup> = 0.002), indicating that the change rates reduced by training were not affected by both hand and memory-load effects. There were non-significant interactions between hand and memory load (F = 3.99, P = 0.060, η 2 <sup>p</sup> = 0.169).

### Relationship between SRT Scores and Spectral Entropy

The correlations between SRT scores and the spectral entropy on 64 channels were illustrated in the fingerprint map (**Figure 3A**) and the scalp topographic map (**Figure 3B**). Among them, the spectral entropy on channel FC4 showed the strongest correlation with subjects' SRT scores (session 1: r = 0.814, P < 0.001; session 2: r = 0.761, P < 0.001; and merged session: r = 0.698, P < 0.001; FDR correction, and also shown in **Figure 3C**).

Therefore, the spectral entropies on channel FC4 from WM EEG recording at the retention stage were then used as features for SVM classifiers to distinguish the high-performance group from the low-performance group, and further used to predict subjects' SRT scores with SVR prediction model.

#### Intra- and Inter-Session Classification for SRT

In the first session (i.e., before training), 11 subjects were divided into high-performance group and nine subjects were divided into low-performance group; while for the second session (i.e., after training), 12 of 20 subjects were assigned to high-performance group and eight subjects were assigned to low-performance group (**Table 1**).

The CA was 95 and 85% for the first and second session with spectral entropy features, respectively. The sensitivity (SE) and specificity (SP) at the optimal operating point, as well as the resulting AUC were showed for the first and second session, respectively. The grouping results were showed in **Table 2**.

For intra-session prediction, the resulting CA, AUC, SE, and SP were respectively 0.950, 0.976, 1.000, and 0.900 for session 1 with spectral entropy as classification feature. CA, AUC, SE, and SP were respectively 0.850, 0.918, 0.800, and 0.900 for session 2 with spectral entropy as classification feature.

For inter-session classification, CA, AUC, SE, and SP were respectively 0.950, 0.973, 0.898, and 1.000 for session 1 with session 2 as training samples. The resulting CA, AUC, SE, and SP were respectively 0.900, 0.942, 0.900, and 0.900 for session 2 with session 1 as training samples.

The classification results were illustrated in **Figure 4** and **Table 2**. The red line demonstrated the ROC curves of session 1, and the blue line showed the ROC of session 2.

#### SRT Predicted by the Spectral Entropy

Support vector regression prediction model and LOOCV revealed that the spectral entropy could be utilized to predict

boundaries. (B) ROC curves for spectral entropy predictors for intra-session and inter-session in classifying the two different performance groups in WM tasks. The session 1 with asterisk (<sup>∗</sup> ) denoted the classification results of session 1 with WM data in session 2 as training samples. The session 2 with asterisk (<sup>∗</sup> ) denoted the classification results of session 2 with WM data in session 1 as training samples. The abscissa represented the false positive rate, and the ordinate denoted the true positive rate. The red line denoted the ROC curves of session 1, and the blue line represented the ROC curves of session 2. AUC: area under curve.

individual WM performance on SRT scores in the current study. The RMSEP after doing LOOCV were 4.635 (session 1), 3.339 (session 2), and 6.972 (merged session), respectively. As illustrated in **Figure 5**, the predicting SRT scores were significantly correlated with the original SRT scores (session 1: r = 0.749, P < 0.001; session 2: r = 0.864, P < 0.001; merged session: r = 0.732, P < 0.001; FDR correction).

### Consistent Changes in Spectral Entropy and SRT Scores between Sessions

Within single subject, we explored whether the increased (or decreased) changes in spectral entropy could be predictive of the increased (or decreased) changes in SRT scores. **Figure 6A** showed that 16 in 20 subjects' SRT scores increased by short-time training. **Figure 6B** showed that 15 in 20 subjects' SRT scores consistently varied with spectral entropy predictor before and after training. The results also could be seen in **Table 1**.

## DISCUSSION

The present study utilized the spectral entropy to predict individual WM performance changes reduced by short-time training during carrying out the delayed-match-to-sample tasks. We found that: (1) the spectral entropy features extracted from FC4 channel during retention stage of WM was strongly related to subjects' SRT scores; (2) the averaged CA to distinguish highperformance from low-performance group in WM tasks was 90.0 and 92.5% for intra-session and inter-session, respectively; (3) SVR with LOOCV revealed that spectral entropy could predict individual WM performance; (4) 16 out of 20 subjects' SRT scores increased and 15 in 20 subjects' SRT scores were consistent with the changes in spectral entropies by short-time training.

### Spatial Distribution for WM Performance

As shown in **Figure 3B**, the spatial distribution of the r-values (i.e., the relationship between spectral entropies and behavioral scores) focused on the right frontal area, which was consistent with the previous findings that the frontal cortex might play a prominent part in WM tasks (Gentili et al., 2015; Thürer et al., 2016). The previous study showed that right inferior frontal junction (rIFJ) as a prefrontal cortex (PFC) control region mediated the causal connection between top–down modulation in the service of attentional goals and WM performance (Gazzaley and Nobre, 2012). Among these channels (**Figure 3A**), the spectral entropy of channel FC4 from WM at retention period generated the strongest correlation (r = 0.698) in comparison to other channels in the merged session. Likewise, in the two separate sessions, the highest correlation coefficients also appeared on channel FC4 before training (r = 0.814) and after training (r = 0.794), respectively. The good classification effect of SVM showed that the spectral entropies on channel FC4 from WM EEG recording at the retention stage might be a dependable biomarker to classify two memory groups successfully.

In the current study, there was no significant effect on the change rates of subjects' SRT scores before and after training when different WM items (i.e., two, four, and eight) were loaded. Thus, subjects' SRT scores were utilized to reflect individual WM performance while carrying out the WM tasks regardless of memory load. As illustrated by **Figure 3C**, the spectral entropy indexes of the subjects were significantly related to individual SRT scores in all sessions including separate sessions and merged session, indicating that spectral entropy indexes might be applied to predict individual WM performance (i.e., SRT scores).

For the further analysis, SVR prediction model combined with LOOCV was established to estimate the predictive ability of spectral entropy indexes on WM performance separately in session 1, session 2, and the merged session (**Figure 5**). The resultant RMSEP after using LOOCV, as well as the high correlation between original SRT scores and predicted SRT scores, demonstrated that prediction models constructed by SVR were effective. The spectral entropy obtained from channel FC4 could be a biomarker to predict individual WM performance.

### WM Training

Converging evidence revealed that one's memory ability was absolutely not innate and could be promoted through proper training (Klingberg et al., 2005; Holmes et al., 2010; Klingberg, 2010). Moreover, this improved performance was related to training-induced plasticity from the intracellular level to

functional organization of the cortex for WM (Klingberg, 2010). Vogt et al. (2009) conducted a WM training study on multiple sclerosis. They found that the patient's memory was improved effectively and the spread of bad mood was delayed after receiving the special training for memory (Vogt et al., 2009).

As illustrated by **Figure 6A** and **Table 1**, 16 subjects' behavioral scores were obviously increased after training, indicating that individuals' WM performance could be promoted effectively through training, consistent with the previous study (Klingberg et al., 2005; Holmes et al., 2010; Klingberg, 2010). The CR of 15 in 20 subjects' SRT scores increased (or declined) with the increase (or decline) of the spectral entropy before and after training, which demonstrated that the variations in the spectral entropy predictor could be predictive of the WM performance variations. The findings revealed the consistent changes in SRT scores and spectral entropy by training (**Figure 6B** and **Table 1**).

### WM in BCIs

Previously, P300 in motor imagery tasks and steady state evoked visual potentials (SSVEP) were frequently applied in BCIs (Parra et al., 2003; Dal et al., 2009; George and Lécuyer, 2010). Recently, BCIs also were designed to recognize human emotions (Garcia-Molina et al., 2013; Chanel and Mühl, 2015) and to monitor WM load (Sánchez et al., 2015), while there are still little research on the detection of WM performance or individual memory in BCIs. Consequently, it is of great significance to find a reliable feature for the monitor of individual WM performance in BCIs. The development of predictors on WM performance could recognize the potential memory impairment subjects, assisting them in preparation for the positive life and avoiding the frustration from disordered memory. On the other hand, the relevant study might in turn be heuristic for making effective training strategies for those subjects with low WM performance. Moreover, spectral entropy could be a potential instructive biomarker applied in the detection of schizophrenia, depression, ADHD and two-way affective disorder patients and meanwhile provide a new thought for BCIs on the feature extraction of WM.

### Limitations

As shown in **Figure 6A**, 16 in 20 subjects' SRT scores obviously was improved after training, while there were still four subjects' SRT scores representing a downward trend which might be induced by individuals' state: either their mental state such as easy to feel tense when carrying out the unfamiliar WM tasks or staying unsuitable to experimental environment, or their physical state such as exhaustion without rest well before experiment or easy to be tired when doing the "dull" experiment. For small sample size in the current study, we just utilized LOOCV and four different evaluation indexes of the classifier's generalization to avoid overfitting as much as possible. It is noted that extracting classification features (like channels) outside of the LOOCV method still could induce overestimation problems. Therefore, we verified the robustness of the selected channel. And the correlation between spectral entropies and WM-related SRT scores of all subjects was highest on channel FC4 regardless of session 1, session 2 and merged session, suggesting the relationship was robust. In addition, spectral entropy was used

to predict the individual SRT scores thus indirectly reflecting the WM performance. In the future, a direct way represented the online WM-related BCI would be explored by optimizing classifiers, expanding sample size and improving experimental paradigm.

### CONCLUSION

In the current study, we first proposed to use spectral entropy as a feature applied in the classification of WM to distinguish high-performance from low-performance groups in the delayedmatch-to-sample task. The resulting RMSEP for the SVR prediction models, as well as the high correlations between original SRT scores and predicted SRT scores, demonstrated that the spectral entropies on channel FC4 could implicitly predict individual WM performance. The changes in the spectral entropy can be predictive of changes in behavioral scores for individual WM performance. This study could provide theoretical foundation for researchers in the establishment of enhanced training strategies for memory impairment humans BCIs feedback system on memory state with spectral entropy as feature.

#### AUTHOR CONTRIBUTIONS

fnhum-11-00437 August 29, 2017 Time: 16:52 # 11

Conceived, designed the experiments and wrote the manuscript: YT. Performed the experiments, analyzed the data and wrote the first draft: HuZ. Contributed reagents/materials/analysis tools: WX, HyZ, LY. Discussed the experiment design,

#### REFERENCES


analyzed the data and discussed the experiment results: SZ, YS.

#### ACKNOWLEDGMENTS

This research is supported by the National Natural Science Foundation of China (#61671097); the Chongqing Research Program of Basic Science and Frontier Technology (No. cstc2017jcyjBX0007; No. cstc2015jcyjA10024); the Chongqing Key Laboratory Improvement Plan (cstc2014pt-sy40001); and the University Innovation Team Construction Plan Funding Project of Chongqing (CXTDG201602009).



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Tian, Zhang, Xu, Zhang, Yang, Zheng and Shi. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Different Topological Properties of EEG-Derived Networks Describe Working Memory Phases as Revealed by Graph Theoretical Analysis

Jlenia Toppi 1, 2, Laura Astolfi1, 2, Monica Risetti <sup>2</sup> , Alessandra Anzolin1, 2, Silvia E. Kober 3, 4 , Guilherme Wood3, 4 and Donatella Mattia<sup>2</sup> \*

<sup>1</sup> Department of Computer, Control and Management Engineering, Sapienza University of Rome, Rome, Italy,

<sup>2</sup> Neuroelectrical Imaging and Brain-Computer Interface Laboratory, Fondazione Santa Lucia IRCCS, Rome, Italy,

<sup>3</sup> Department of Psychology, University of Graz, Graz, Austria, <sup>4</sup> BioTechMed-Graz, Graz, Austria

Several non-invasive imaging methods have contributed to shed light on the brain mechanisms underlying working memory (WM). The aim of the present study was to depict the topology of the relevant EEG-derived brain networks associated to distinct operations of WM function elicited by the Sternberg Item Recognition Task (SIRT) such as encoding, storage, and retrieval in healthy, middle age (46 ± 5 years) adults. High density EEG recordings were performed in 17 participants whilst attending a visual SIRT. Neural correlates of WM were assessed by means of a combination of EEG signal processing methods (i.e., time-varying connectivity estimation and graph theory), in order to extract synthetic descriptors of the complex networks underlying the encoding, storage, and retrieval phases of WM construct. The group analysis revealed that the encoding phase exhibited a significantly higher small-world topology of EEG networks with respect to storage and retrieval in all EEG frequency oscillations, thus indicating that during the encoding of items the global network organization could "optimally" promote the information flow between WM sub-networks. We also found that the magnitude of such configuration could predict subject behavioral performance when memory load increases as indicated by the negative correlation between Reaction Time and the local efficiency values estimated during the encoding in the alpha band in both 4 and 6 digits conditions. At the local scale, the values of the degree index which measures the degree of in- and out- information flow between scalp areas were found to specifically distinguish the hubs within the relevant sub-networks associated to each of the three different WM phases, according to the different role of the sub-network of regions in the different WM phases. Our findings indicate that the use of EEG-derived connectivity measures and their related topological indices might offer a reliable and yet affordable approach to monitor WM components and thus theoretically support the clinical assessment of cognitive functions in presence of WM decline/impairment, as it occurs after stroke.

Keywords: EEG, brain networks, working memory, sternberg task, connectivity, graph theory

Edited by:

Felix Putze, University of Bremen, Germany

#### Reviewed by:

Márk Molnár, Institute of Cognitive Neuroscience and Psychology, Centre of Natural Sciences (HAS), Hungary Birgit Mathes, University of Bremen, Germany

> \*Correspondence: Donatella Mattia d.mattia@hsantalucia.it

Received: 31 August 2017 Accepted: 14 December 2017 Published: 12 January 2018

#### Citation:

Toppi J, Astolfi L, Risetti M, Anzolin A, Kober SE, Wood G and Mattia D (2018) Different Topological Properties of EEG-Derived Networks Describe Working Memory Phases as Revealed by Graph Theoretical Analysis. Front. Hum. Neurosci. 11:637. doi: 10.3389/fnhum.2017.00637

### INTRODUCTION

The working memory (WM) is a non-unitary construct that involves the temporary maintenance and manipulation of information either recently acquired or retrieved from long-term storage (Baddeley, 1996). The Baddeley's model is one of the most recognized among the several current models describing the operating principles of WM (D'Esposito and Postle, 2015). It encompasses diverse separable but interacting subsystems such as: 2 unimodal storage sub-systems (phonological loop for verbal material and visuo-spatial sketchpad for visuo-spatial material), a flexible system (central executive) which is responsible for the control and regulation of the storage sub-systems and a multimodal system with limited capacity storage (episodic buffer) that allows the interaction between the various components of WM and the interface with long-term memory (LTM) (Baddeley, 2000, 2010).

The Sternberg Item Recognition Task (SIRT; Sternberg, 1966) has been largely used in cognitive neuroscience to assess WM capacity in terms of storage and data retrieval (Nosofsky et al., 2011; Corbin and Marquer, 2013). It allows for a segregation of encoding, executive maintenance and retrieval processes (not manipulation) regarded as central within the multi-component model of WM. The SIRT is also relatively free from practice effects (Kristofferson, 1972). The SIRT was initially introduced to investigate the neurophysiological processes at the basis of WM by means of indirect behavioral measures (Sternberg, 1966, 1969). Its application was extended later into the field of neuroimaging techniques, functional magnetic resonance imaging (fMRI), electroencephalography (EEG), and magnetoencephalography (MEG) aiming at directly measuring the neural correlates underpinning WM processes (Rypma et al., 1999; Cairo et al., 2004; Payne and Kounios, 2009; Keren-Happuch et al., 2012).

In this regard, several fMRI studies have shown that verbal WM processing in adult humans requires the involvement of a large network of areas which includes bilateral dorso-lateral, prefrontal, left inferior frontal, middle and superior frontal areas, premotor and supplementary motor areas as well as inferior parietal and superior temporal areas, the insula and parts of the cerebellum (Smith and Jonides, 1998; Marvel and Desmond, 2010; Luis et al., 2015). Further studies using SIRT found specific patterns of activation for each of the three phases of WM (encoding, storage, and retrieval) which were also sensitive to WM load levels (Rypma and D'Esposito, 1999; Cairo et al., 2004; Chen and Desmond, 2005; Marvel and Desmond, 2010; Thürling et al., 2012; Vergauwe et al., 2015).

Evidence for specific brain oscillatory responses elicited during the different phases of WM emerged from EEG and MEG studies using the SIRT. In particular, the maintenance (storage) phase of verbal SIRT was associated with oscillatory power in theta (4–8 Hz) predominantly over the frontal midline and left temporal-parietal sites (Payne and Kounios, 2009; Brookes et al., 2011; Kottlow et al., 2015) as well as in alpha (8–13 Hz) power over the parietal midline, the parieto-occipital and left tempoparietal regions (Jensen et al., 2002; Scheeringa et al., 2009; Heinrichs-Graham and Wilson, 2015; Xie et al., 2016). Changes in EEG powerspectra such as an increase of bilateral frontal delta, frontal-midline theta, and temporo-parietal alpha and a decrease of beta and gamma activities in frontal and occipital areas have been observed as function of WM load (Jensen and Tesche, 2002; Hwang et al., 2005; Payne and Kounios, 2009; Axmacher et al., 2010; Brookes et al., 2011; Roux and Uhlhaas, 2014; Zakrzewska and Brzezicka, 2014; Maurer et al., 2015; Gurariy et al., 2016). Delta power also varies as a function of stimulus type (so-called old-new effect; Kayser et al., 2007; Mathes et al., 2012).

To fully understand brain functions, functional neuroimaging methods have been also applied to investigate the dynamics within networks of brain areas that underlie specific cognitive processes (such as WM), and how a brain damage-induced disruption of neural circuits could account for behavioral impairments (Honey and Sporns, 2008; Cramer et al., 2011; Grefkes and Fink, 2011, 2014). In this regard, functional connectivity estimation was applied to track age-related changes in brain connectivity in a group of children and adolescents performing a modified version of the SIRT (van den Bosch et al., 2014). Task-related networks were identified for encoding (including left motor area, right prefrontal, parietal, and occipital cortex cerebellum) and recognition (including anterior and posterior cingulate cortex, right motor area, cerebellum, left parietal, and prefrontal cortex) phases and their load-induced modulation also correlated with age (Woodward et al., 2013; van den Bosch et al., 2014).

In this study, we take advantage of high temporal resolution of EEG technique and its relatively low-cost and easiness to use to isolate salient descriptors of WM processes as elicited by SIRT, in healthy middle age condition. To this purpose, a combined approach based on EEG-derived connectivity patterns and graph theory (Baccalá and Sameshima, 2001; Milde et al., 2010; Rubinov and Sporns, 2010; Astolfi et al., 2013) was adopted. We expected such combined approach to return quantitative measurements specific for the three different WM phases (encoding, storage and retrieval) and sensitive to different memory workload. The relationship between extracted neurophysiological indices and subject memory performance was also assessed to explore to what extent the estimated EEG networks topology would account for memory behavior. Here, we targeted middle-age population (i.e., between 40 and 50 years) to limit possible confounding effects on the stability of EEG network measures as due to changes in memory task-related neural activity that may emerge from middle age (i.e., fourth decade) onward (Aine et al., 2006, 2011; Grady et al., 2006; Mattay et al., 2006; MacPherson et al., 2014).

The ultimate goal is to provide affordable (EEG-based) computational instruments to measure the electrophysiological dynamics at the level of brain networks that underpin theoretical models of WM processes. As such, this EEG-based network approach could also serve as an objective counterpart of the behavioral assessment of WM impairments which often occur in acquired brain lesions (e.g., stroke) and to ground future cognitive rehabilitative strategy design (Kober et al., 2015, 2017).

## MATERIALS AND METHODS

### Participants

Seventeen healthy subjects (age: 46 ± 5 years old; 6 males; education: 14.8 ± 3 years; see **Table 1**) were enrolled in the study. All participants except one were right-handed with normal or corrected-to-normal vision. No participant reported a history of neurological or psychiatric diseases; in addition, they were all screened for intake medications and none was receiving any pharmacological treatment affecting cognitive functions. Participants underwent some subtests (Similarities, Information, Coding, Picture Completion, Mosaic Test) from the German adaptation of the Wechsler Adult Intelligence Scale (WAIS III, Von Aster et al., 2006), for a general screening of the cognitive functions and also a deep evaluation of the memory functions. In particular, for the evaluation of the verbal and visuospatial memory, subjects performed the Corsi Block Tapping Test (CBTT) (Corsi and Michael, 1972), the Visual and Verbal Memory Test (VVM 2) (Schellig and Schächtele, 2009), the Digit Span (Härting, 2000), the Verbal Learning Memory Test (VLMT), the Nonverbal Learning Test (NVLT), the Verbal Learning Test (VLT) (Sturm and Willmes, 1999). All the subjects obtained normal scores in all the investigated cognitive domains (**Table 1**). This study was carried out in accordance with the recommendations provided in the declaration of Helsinki. All participants provided written informed consent according to the convention of Helsinki. The ethics committee of the University of Graz approved the study. All participants received monetary reward for their participation to the study.

### Data Acquisition and Experimental Paradigm

The EEG potentials were recorded from 60 scalp electrodes embedded in a lycra cap, with a left mastoid reference and a ground at Fpz. Horizontal and vertical electro-oculogram (EOG) signals were recorded from 3 electrodes in total, two placed on the outer canthi of the eyes and one below the right eye, respectively. EEG signals were amplified (BrainAmp; Brain Products GmbH, Munich, Germany) and filtered by means of a [0.01–100] Hz band-pass filter prior to digitization at 500 Hz. Electrode impedances were kept below 5 and 10 kOhms for the EEG and EOG recording, respectively.

After 2 min of resting EEG acquisition (of eyes open and close), each subject performed the Sternberg task (Sternberg, 1966). Accordingly, the experimental procedure to deliver the paradigm was as follows (**Figure 1**): first, a series of digits was visually presented to the participants who had to memorize it (encoding phase); then, the participants had to retain the memorized information for a fixed period (storage phase) and finally, participants had to retrieve such memorized content in a brief time interval (retrieval phase). In particular, participants were asked to remember a set of unique digits (between 1 and 9), and then a probe stimulus in the form of a digit was presented. Subjects were instructed to answer, as quickly as possible, whether the probe was in the previously presented set of digits or not. The size of the initial set of digits determined the WM load required to the subject to execute the task (4 digits → easy, low workload; 6 digits → difficult, high workload). Each trial started with a 2 s presentation of a fixation cross in the middle of the screen. Afterwards, a "memory set" of four (e.g., 5682) or 6 digits (e.g., 146372) was presented (1 s) to allow for memorization (encoding phase). The presentation of the digit series was then followed by a fixation cross, displayed for 2 s (storage period). A single probe digit was then presented for 250 ms (retrieval phase) followed by a fixation cross presented for 1,250 ms. Afterwards, the question "yes or no?" appeared on the screen (maximum duration of 1,500 ms) and the participant had to answer: if the probe was a member of the preceding digit series, the participant had to press the left button (Target Condition) whereas, the right button had to be pressed in case the probe was not part of the series (NoTarget Condition). A new trial would start upon participant answer. The probability of Target condition (36 trials in total) was 0.5 and the digits contained in each trial were presented in a randomized order. The conditions 4/6 digits and Target/NoTarget were also randomized along the recording session.

### Behavioral Data

We collected reaction times (RT) and the percentage of correct answers for each subject and each condition (Target/NoTarget and 4–6 digits). To examine any effect of task-related complexity and task-related trials on the subject behavioral performances, 2 separate two-way repeated measures ANOVAs with digits number (DIGITS; 4 or 6) and target type (TYPE; Target/NoTarget) as within main factors were performed considering the percentage of correct answers and reaction times (RTs) parameters as dependent variables.

### Data Pre-processing

EEG signals were downsampled to 100 Hz (with anti-aliasing low-pass filter) to optimize the following connectivity analysis and then band-pass filtered in the range (1–45) Hz in order to isolate the EEG spectral content of interest. Independent Component Analysis (ICA—fastICA algorithm) was used to remove ocular artifacts (i.e., the blinks-related IC was removed on the basis of its temporal content and spatial distribution—mainly located over frontal scalp areas). EEG traces were segmented in relation with the specific timing of the paradigm, (0–4,500) ms (period of interest) according to the onset of the first screen containing the digits series and classified according to different conditions (Target\_4digits, NoTarget\_4digits, Target\_6digits, NoTarget\_6digits). Only trials correctly executed were included in the analysis. Residual artifacts were then removed by means of a semi-automatic procedure based on a threshold criterion (±80 µV). Only the artifacts-free trials were used for further analysis (no less than 30 trials per condition were considered for final analysis). The entire pre-processing procedure was performed by means of Brain Vision Analyzer 1.0 software (Brain Products GmbH).

### Time-Varying Connectivity Estimation

Pre-processed EEG signals were subjected to a time-varying connectivity estimation process for each subject and each experimental condition (4/6 digits, Target/NoTarget).



F, female; M, male; R, right; L, left; U, university; HS, high school; CS, compulsory school.

Several time-domain or frequency-domain measures were developed to estimate connectivity from EEG data in terms of correlation/coherence, statistical dependencies, or causal interaction among data series. Some of these measures are deemed to be less prone to volume conduction effects (Haufe et al., 2013), such as imaginary coherence (Nolte et al., 2004) or phase slope index (Nolte et al., 2008) but return non-causal, undirected measures, and are based on a pairwise approach, which results in high rates of false positives when the network complexity increases (Kus et al., 2004). Here, we employed a time-varying adaptation of Partial Directed Coherence (PDC), a spectral multivariate estimator which provides with the directed influences between any given pair of signals in a multivariate data set (Baccalá and Sameshima, 2001; Astolfi et al., 2006; Toppi et al., 2016b). Such time-varying adaptation is based on the General Linear Kalman Filter (GLKF) (Milde et al., 2010) which is able to follow temporal dynamics of brain networks with high temporal resolution in high density EEG data (Toppi et al., 2016a). We used it to estimate the relationships between signals for all frequency samples in the range (1–45) Hz and for all the samples in the time interval (0–4,500) ms.

The connectivity patterns contrasted with the baseline period were estimated for each time sample and averaged in the five frequency bands-of-interest and in three time intervals (periods-of-interest). The frequency bands were individually defined according to the Individual Alpha Frequency (IAF-10 ± 0.9 Hz), as determined by means of the Fast Fourier Transform spectra of 2 min resting EEG (recorded before SIRT execution) over posterior leads (parietal, parieto-occipital, and occipital). The following frequency bands were then considered: Delta (IAF-8/IAF-6), Theta (IAF-6/IAF-2), Alpha (IAF-2/IAF+2), Beta (IAF+2/IAF+14) and Gamma (IAF+15/IAF+30; Klimesch, 1999). The three periods-of-interest correspond to: (0–1,000) ms (encoding phase); (1,000–3,000) ms (storage phase) and (3,000– 4,500) ms (retrieval phase). The analysis was conducted only on Target condition.

Any relevant changes in the time-varying connectivity matrices related to the different experimental conditions were evaluated by means of statistical comparisons (independent samplest-test) performed for each task condition (Target\_4digits, NoTarget\_4digits, Target\_6digits, NoTarget\_6digits) between each post-stimulus time window (encoding, storage, retrieval) and the baseline period. The baseline period was the time interval (-1000-0) ms preceding the appearance of the digits series (subjects fixing a cross on the screen). Time samples were used as observations for statistical test. The test was repeated for each frequency band and each subject. The significance level was set at 5%. A False Discovery Rate (FDR) was conducted for multiple comparison correction (Benjamini and Yekutieli, 2001).

The analysis pipeline was performed in Matlab environment (MATLAB, 2011).

#### Graph Indices

The main global and local properties of the estimated networks were quantified by means of indices derived from the graph theory. Such indices are defined on the basis of a binary adjacency matrix G, obtained by comparing each entry of the connectivity matrix A with its corresponding threshold as follows:

$$G\_{i\bar{j}}(f,t) = \begin{cases} 1 \to A\_{i\bar{j}}(f,t) \ge \tau\_{i\bar{j}}(f,t) \\ 0 \to A\_{i\bar{j}}(f,t) < \tau\_{i\bar{j}}(f,t) \end{cases} \tag{1}$$

where Gij and Aij represent the entry (i,j) of the adjacency matrix G and the PDC matrix A, respectively, and τ ij is the corresponding threshold. As mentioned above, we adopted a statistical approach for the threshold definition in order to avoid the attribution of false properties to the networks due to the application of an empirical thresholds (Toppi et al., 2012). Accordingly, the threshold τij corresponds to the 95th percentile (corrected for multiple comparisons by FDR) of the PDC distribution obtained for the baseline condition.

The estimated adjacency matrices were then, used to extract local and global indices as described below. To avoid network-size effects, each global index was normalized by its corresponding value obtained from 100 random graphs generated by fixing the connection density of the original network. Random graphs were thus used, as reference level for the description of global properties of WM networks (see below).

Local and global indices were computed by means of routines provided in Brain Connectivity Toolbox developed for Matlab environment (Rubinov and Sporns, 2010).

#### General Properties of the Network

The human brain can be viewed as a large-scale complex network that is simultaneously segregated and integrated via specific connectivity patterns (Bullmore and Sporns, 2009). We selected the three indices—local and global efficiency and small-worldness—that are widely utilized to describe the general topological properties of a network, thus reflecting the integration and segregation of the information flow between areas (Sporns, 2013b).

#### **Global Efficiency (GE)**

The GE is a global measure (considering all the connections in the whole-network) of how efficiently a network exchanges information internally. It is defined as the average of the inverse of the geodesic distance (shortest path between two nodes in the network) and it represents the efficiency of the communication between all the nodes within the network (Latora and Marchiori, 2001). It can be defined as follows:

$$GE = \frac{1}{N(N-1)} \sum\_{i \neq j} \frac{1}{d\_{ij}} \tag{2}$$

where N represents the number of nodes in the graph and dij the geodesic distance between i and j.

#### **Local Efficiency (LE)**

The LE is a measure of the fault tolerance of a network. It verifies whether the communication between nodes is still efficient when a node is removed from the network. The higher the LE, the greater the robustness of the network at local scale.

The LE is the average of the global efficiencies computed on each sub-graph S<sup>i</sup> belonging to the network and it represents the efficiency of the communication between all the nodes around the node i in the network (Latora and Marchiori, 2001). It can be defined as follows:

$$LE = \frac{1}{N} \sum\_{i=1}^{N} E\_{\mathfrak{F}} \langle \mathcal{S}\_i \rangle \tag{3}$$

where N represents the number of nodes in the graph and S<sup>i</sup> the sub-graph obtained deleting the ith row and the ith column from the original adjacency matrix.

#### **Small-Worldness (SW)**

It has been suggested that the human networks are organized to optimize efficiency, due to a small-world topology allowing simultaneous global and local parallel information processing (Bassett and Bullmore, 2006). SW is a measure of a network global organization in terms of its integration and segregation properties. Small-world topology is typical of networks highly segregated (nodes organized according to clusters) and highly integrated (high communication speed between electrodes).

A network G is defined as small-world network if L<sup>G</sup> ≥ Lrand and C<sup>G</sup> ≫ Crand where L<sup>G</sup> and C<sup>G</sup> represent the characteristic path length (Sporns et al., 2004) and the clustering coefficient (Fagiolo, 2007) of a generic graph and Lrand and Crand represent the correspondent quantities for a random graph (Watts and Strogatz, 1998). On the basis of this definition, a measure of small-worldness of a network can be introduced as follows:

$$SW = \frac{{}^{\mathcal{C}\_{\mathbb{G}}}\!\!/ {}^{\mathcal{C}\_{\text{nud}}}\_{\text{nud}}}{{}^{\mathcal{L}\_{\text{G}}}\!\!/ {}^{\mathcal{L}\_{\text{nud}}}}\tag{4}$$

being a small-world network if S > 1 (Humphries and Gurney, 2008).

#### Local Properties of the Network

The topology of the networks was further investigated by computing the degree index for each scalp electrode to characterize the (local) level of in- and out- information flows exchanged within the network.

#### **Degree**

The degree of a node is the number of connections involving it. As such the degree is the simplest index identifying hubs in graphs. In directed networks, the indegree is the number of inward links and the outdegree is the number of outward links (Sporns et al., 2004). It can be defined as follows:

$$k\_f = \sum\_{j \in N, j \neq f} \mathfrak{g}\_{\varnothing} + \sum\_{i \in N, i \neq f} \mathfrak{g}\_{if} \tag{5}$$

where k<sup>f</sup> is the degree of node f and gij represents the entry ij of the adjacency matrix G. The degree of a specific electrode was normalized with respect to the network density, in order to capture local changes and not a general increase/decrease of the network density.

All the extracted global and local indices were subjected to a two-way ANOVA with memory phases (PHASES: Encoding, Storage, Retrieval) and digits number (DIGITS: 4, 6) as main within-subject factors. Duncan's post-hoc test was used to verify differences between the ANOVA levels. FDR was further applied to correct for multiple ANOVAs. Furthermore, to explore the relationship between the indices extracted for each memory phase and the relative behavioral data (correct answers rate, reaction time) a Pearson correlation analysis was performed. FDR was applied to reduce type I errors due to multiple correlations.

#### RESULTS

#### Behavioral Results

The overall behavioral data obtained from each subject is reported in **Table 2**. All the participants showed a percentage of correct answers above 80% (except for subject 5 in 6 Digits who was removed from the analysis) and reaction times (RTs) comprised between 250 and 700 ms for the 4 SIRT conditions. The variability ranges observed for the two behavioral parameters are in agreement with literature and comparable with those reported in other studies (Sternberg, 1966; Cummins and Finnigan, 2007; Tuladhar et al., 2007).

The two-way ANOVA revealed that the percentages of correct answers significantly decreased (94 ± 3 to 87 ± 5%) when the subjects were challenged with the condition of 6 digits with respect to 4 [main factor DIGITS p = 0.00007, F(1, 15) = 28.15, MSE = 672.8]. The RTs were also significantly longer in the condition 6 digits with respect to 4 digits condition [main factor DIGITS p = 0.021, F(1, 15) = 6.54, MSE = 13336]. Furthermore, the NoTarget condition was associated to significantly longer RTs as compared to those obtained during the Target condition (470 ± 126 vs. 500 ± 130 ms) in the 6 digits case [main factor TYPE p = 0.013, F(1, 15) = 7.83, MSE = 12269]. Subject 5 was excluded from the EEG analysis because of low accuracy performance (30% error rate).

#### General Properties of the Networks

The results of the two-way ANOVA for the Local Efficiency (LE), Global efficiency (GE), and Small-Worldness (SW) indices with respect to memory phases and WM load are reported in **Table 3** for the five frequency bands.

As shown in **Figure 2**, the LE index mean value (n = 16) relative to alpha band was significantly higher in the Encoding with respect to both Storage and Retrieval phases (**Figure 2A**). An opposite trend was observed for the GE index (**Figure 2B**) that was significantly higher in the Retrieval as compared to both Encoding and Storage (**Figure 2B**). Finally, the SW index (**Figure 2C**) was significantly higher in the Encoding as compare with Storage and Retrieval. Similar significant results were found for the three indices in the other frequency bands (**Table 3**).

We found significant differences between 4 and 6 digits conditions for the LE and the SW indices (**Figures 2A,C**). In particular, the LE and SW showed significantly higher values for the 6 with respect to 4 digits only during Encoding in alpha (**Figures 2A,C**). Similar results were found in gamma band (**Table 3**). No significant differences between 4 and 6 digits were found for the GE.

Furthermore, the LE index computed for alpha band and relative to the Encoding phase negatively correlated with RTs obtained from both 4 (r = −0.7026, p = 0.0024) and 6 (r = −0.7048, p = 0.0023) digits cases.

#### Local Properties of the Networks

The degree index was computed for each electrode and each subject and then averaged within the experimental group for the three PHASES and the two DIGIT conditions [Grand Average (GA) Degree Maps]. A spatial representation of such index is reported in the topographical maps of **Figure 3** for 4 digits (**Figure 3A**) and 6 digits (**Figure 3B**) cases.

The visual inspection of the GA Degree Maps relative to the 4 digits condition revealed that the three WM phases were associated with distinct connectivity networks for each frequency band oscillation (**Figure 3A**). During the Encoding, we observed a connectivity pattern which mainly included (high degree index) the central midline, the bilateral frontal areas and the bilateral posterior areas in the delta and theta frequency bands. In the alpha band, such patterns were mainly represented over the frontal midline, the left frontal areas and the right hemisphere from frontal to parietal areas. In beta and gamma oscillation ranges, the patterns were prevalent over the bilateral frontotemporal areas.

Storage was consistently associated with a high involvement (high degree) of the bilateral fronto-temporal areas, the frontal midline and the right posterior areas in the delta, theta and alpha bands. Bilateral fronto-temporal areas, left tempo-parietal areas and right central areas have an importal role in the beta band. In gamma band, we found an involvement of bilateral fronto-temporal areas and frontal midline.

The Retrieval phase showed a connectivity pattern mainly involving (high degree) frontal-central midline, left frontotempo-parietal areas, right frontal areas and occipital areas in the delta, theta, and alpha bands. In beta band, we found a high TABLE 2 | Mean values of the percentage of correct answers and relative reaction time (RTs) obtained from each participant.


Missing answers (RT = 0) were excluded.

involvement of bilateral fronto-tempo-parietal areas and parietooccipital midline. An important role of bilateral fronto-temporal areas and central areas resulted in gamma band.

The averaged patterns obtained for the 6 digits condition are illustrated in **Figure 3B**. The qualitative (visual inspection) analysis of 6 digits condition revealed a general superimposition with the areas mainly involved in the 4 digits condition.

On the basis of these findings (**Figure 3**), we selected eight scalp areas (macro-areas) symmetrically distributed over the left and right sides and computed the respective average degree index. The following macro-areas were considered: Left Frontal (Fp1, AF7, F7), Frontal Midline (AFz, Fz, FCz), Right Frontal (Fp2, AF8, F8), Left Temporal (FT7, T7, TP7), Right Temporal (FT8, T8, TP8), Left Parietal (PO7, O1), Right Parietal (PO8, O2), Occipital (Oz).

The results of the two-way ANOVA on degree index with respect to the memory phases and WM load are reported in **Table 4** for each macro-area and frequency band. The schematic representation of **Figure 4** summarizes the trends obtained for the macro-areas degree across the three memory phases in the five frequency bands (irrespective of DIGIT factor). In particular, the areas distinctive for the Encoding were the bilateral frontal areas in the delta band, the right frontal and right parietal areas in theta band, right parietal area in alpha band, left, and right frontal and right temporal areas in beta and gamma bands. The Storage was instead characterized by right frontal area in both delta and theta bands and frontal midline in gamma band. The retrieval involved occipital area in alpha, left parietal, and occipital areas in beta band and frontal midline, left parietal and occipital areas in gamma band.

### DISCUSSION

This study applied a graph theory—driven approach to complex causality patterns derived from EEG recordings with the aim to identify distinct topological properties of the neural networks associated to encoding, storage, retrieval phases of WM as elicited during visual SIRT performed by healthy, middle age adults. We found that, during the encoding phase, the global network exhibited a small-world topology (in all frequency bands) thus, indicating a network configuration accounting for both global information transfer and local processing. The requirement of such optimal configuration specifically for item encoding appeared further corroborated by the negative correlation between local efficiency and behavioral task performance. The small-world configuration of the whole network persisted across maintenance and rehearsal of encoded items but it showed a descendent trend. At the local scale, the degree of information flow between scalp regions was selective for each of the three different WM phases, in that network hubs were consistent with the different role of brain regions in different WM phases. Overall, these EEG findings provide evidence that the complexity of functional networks underpinning the model of multicomponent WM (Baddeley, 2012; Eriksson et al., 2015)


TABLE 3 | Results of two-way repeated measures ANOVA on global indices (F-values, \*\*p < 0.001, \*p < 0.05).

FDR correction for multiple ANOVAs was applied. Significant results are highligthed in bold.

may be represented by synthetic EEG indices which preserve the selectivity of the dynamics occurring during encoding, maintenance and rehearsal of memory items.

### Behavioral Results

The behavioral results obtained from our sample of healthy, middle age adults are in line with what was reported by previous studies conducted in healthy adults and where the SIRT was applied to investigate working memory (WM) processes (Sternberg, 1966; Cummins and Finnigan, 2007; Tuladhar et al., 2007). As expected, WM loads (4, 6 digits conditions) had a significant effect on the response time and accuracy for both Target and NoTarget probes in our sampled population. These WM load-related behavioral changes have been previously ascribed to a serial scanning of memorized elements required in order to recall the memorized material (Majerus et al., 2006).

### Global Organization of the WM Network

The complex human brain networks have been found to have a "small-world" topology which is characterized by a high local specialization and a high global integration to sustain a high efficiency at a low wiring cost (Sporns, 2013a).

The significant effect of phase factor on all the global indices (global and local efficiency, small-worldness; **Table 3**) indicated that a small-word topology of the networks was present in all three WM phases with the highest value associated with encoding with respect to storage and retrieval (**Figure 2**; in alpha band). Such descendent trend was evident for lower to higher frequency oscillations (**Table 3**).

The finding of a descendent trend in the network smallworldness in all frequency bands may reflect a general network

relative to Encoding, Storage and Retrieval phases. The asterisk indicates significant difference (Duncan's post-hoc; p < 0.05).

tendency to reduce local segregation (see also local efficiency decrease) in favor of global integration (see global efficiency increase) of the information exchange between/within the different brain regions as cognitive processing evolves from encoding to retrieval. Such dynamics in the topological rearrangement is consistent with the recently released global workspace theory (Baars and Franklin, 2003; Baars et al., 2013) postulating that the networks structure reorganizes across

the temporal evolution of WM cognitive processing (Bola and Sabel, 2015). As such, the dynamics observed in the EEG-derived network(s) would reinforce the assumption at the base of multi-store model revised by Baddeley (2012) that the several cognitive components (i.e., central executive, episodic buffer (s), phonological loop and visuo-spatial sketchpad) are


TABLE 4 | Results of two-way repeated measures ANOVA on local indices (F-values, \*\*p <0.001, \*p < 0.01).

FDR correction for multiple ANOVAs was applied. Significant results are highligthed in bold.

not "crystallized" but have a tendency to be "fluid" as well the capacity they sub-serve (attention, temporary storage. . . ; Baddeley, 2012). In particular, this dynamic observed for the

topology networks might reflect the operational mode of the "episodic buffer" component of the Baddeley model (Baddeley, 2010). This "buffer" serves as an intermediary between the storage subsystems with different codes (i.e., phonological loop and visuo-spatial sketchpad) whose content is bounded by the buffer into unitary multi-dimensional representations. Thus, one can speculate that a tendency toward a more global vs. local integration network topology (ie, the decrease of small-worldness across WM phases) would "optimally" serve the function of the episodic buffer by favoring the information flow between WM networks (i.e., 2 storage subsystems).

The dynamic changings toward a more globally integrated network(s) interplay with less specialized segregation could also be effective in sustaining more recent WM models (for review see Eriksson et al., 2015). In this recent reappraisal of WM functioning, the content of working memory would be defined by an interaction between selective perceptual (visual, auditory. . . ) information process (operated via a selective attention) and LTM representations being in a particular state of "accessibility" that requires a largely persistent activity of specialized networks controlled by attentional processes. Thus, a whole brain network with high global information transfer (integration) would better "sustain" an optimal interplay between locally specialized networks (see below, local organization of WM subnetworks).

The encoding process directly influences the precision and accuracy of subsequent WM representations (Awh and Vogel, 2008; Rutman et al., 2010). The well-known limitation in the capacity to simultaneously encode objects requires efficient mechanisms to operate a selection of only the most relevant objects from the immediate environment to be represented in WM by restricting those irrelevant from consuming capacity (Vogel et al., 2005; Chun, 2011). In this regard, several evidence indicate that a successful encoding information into WM is the result of an interplay between brain circuits underlying selective attention processes and perceptual (e.g., visual) object representation (for a review Gazzaley and Nobre, 2012) that, in a more recent vision would trigger LTM object representation (Eriksson et al., 2015). A small-world topology could well account for this complex network interplay by supporting both specialized and integrated information processing in the whole brain connectivity. It comprises both high segregated (or modular) processing (high clustering) and distributed (or integrated) processing (short path length; Bassett and Bullmore, 2006). The observed high small-world network during WM encoding (with respect to maintenance and retrieval) would fit with the necessity of the brain to combine the functioning of specialized (segregated) modules with a number of inter-modular links integrating those modules.

To further corroborate this interpretation, it is notable to mention that a disruption of an optimal small-world network organization has been described in schizophrenic patients (Fornito et al., 2012) who exhibited an impairment of WM performance accounted by a decreased efficiency in item encoding (Cairo et al., 2006; Koch et al., 2009). As yet, altered oscillatory dynamics during encoding of information have been reported in normal and pathological aging associated with

cognitive decline (Aine et al., 2011; Kirova et al., 2015; Proskovec et al., 2016).

We also found that the network optimal topology defined by a high local efficiency and small-worldness, increased significantly as a function of WM load increase (4 vs. 6 digits) only during encoding. This WM load-induced modulation of network topology reinforces the above interpretation of a high network modularity required during encoding. Recent neurophysiological evidences support the idea that visual WM capacity limitation (i.e., the so-called set size effect Luck and Vogel, 1997) begins with neural attention resource allocation at encoding (Gurariy et al., 2016). Of note, the WM load-induced increase in the values of the network global indices was selectively observed in the alpha and gamma frequency bands. Alpha oscillations have been hypothesized to play an active role in protecting WM items from non-relevant information (Jensen and Mazaheri, 2010) by suppressing distracting sensory information (Romei et al., 2010). More specifically, the increase of WM load is associated to an increase of alpha-band coherence between midline parietal and left temporal/parietal sites during encoding (Payne and Kounios, 2009). From the behavioral view point, we found that the network local efficiency estimated in alpha band and relative to encoding varied as function of the RTs (negative correlation; **Figure 3**). The existence of such correlation exclusively in alpha band is in accordance with previous evidence of a correlation between changes in α-power spectrum and behavioral performance during encoding (Klimesch, 1999; Bashivan et al., 2014). The increase in the values of the indices describing the global network organization during encoding was also selective for the gamma frequency. As such this finding is in line with evidence of a direct correlation between the changes in the gamma power spectrum amplitude and the number of items to be memorized (Howard et al., 2003; Roux et al., 2012; Roux and Uhlhaas, 2014).

### Local Organization of the Working Memory Networks

The computation of local degree maps (GA illustrated in **Figure 3**) allowed for the identification of hubs within the WM network(s) activated across the three phases elicited by the visual SIRT. As expected, encoding, storage and retrieval WM phases were consistently characterized by a main involvement of bilateral frontal and temporal regions in all frequency oscillations while an anterior-to-posterior midline pattern was prevalent in the low (delta/theta) EEG frequency range. In addition, a bilateral parieto-occipital connectivity was observed mainly in theta/delta oscillations during the encoding, while storage/retrieval phase were characterized by a prevalent left temporo-parietal and right fronto-parietal connectivity in alpha/beta bands. These patterns were sensitive to WM load increase.

Consistently with the view of WM as emerging from the dynamic interplay of several brain regions, recent evidences indicate that the integrity of white matter pathways connecting the dorsolateral frontal cortex, parietal cortex, and temporal cortex correlates with working-memory performance (Charlton et al., 2010). Within this large network of areas, the prefrontal cortex has been suggested to be crucial for executive demands such as the maintenance of resilient information during WM, updating WM content, and shifting (Nee et al., 2013). Together with the prefrontal areas, the parietal cortex is also causally involved in WM functioning, being associated with executive aspects (superior parietal cortex) of WM (Collette et al., 2005) and the implementation of selective attentional control (Awh et al., 2006). Interestingly, parietal cortex activity correlates with WM capacity in that its activity increases as the number of items to remember increases (Vogel and Machizawa, 2004). According to computational modeling (O'Reilly, 2006), the basal ganglia (striatum) would play a key role in controlling (filtering) when the prefrontal cortex representations should be maintained vs. updated. Moreover, the above mentioned parietal load effect negatively correlated with basal ganglia activity (McNab and Klingberg, 2008).

Our connectivity patterns expressed as locally distributed hubs of information flow between scalp regions (i.e., local degree index) well reflect the main interpretational mapping of WM processes to brain regions, thus highlighting the accuracy of our EEG network estimation approach in providing indices which can specifically describe the distributed topography of the networks involved in WM task solving. The spectral features of the estimated local topological indices further corroborate their selectivity in capturing the (local) functional dynamics underlying WM processing.

There are a number of evidence that an interplay between rhythmic activity at low (delta/theta) and high (beta/gamma) frequency has been suggested to enable WM item encoding and maintenance in humans (for review see Roux and Uhlhaas, 2014). Particularly, the gamma band would be relevant for active maintenance of WM information, whereas theta band would be involved in the temporal organization of WM items. The relevance of alpha oscillation would reside in filtering task nonrelevant information.

As schematically illustrated in **Figure 4**, we actually found that Encoding networks were mainly described by hubs (encoding related local degree indices contrasted against those of storage and retrieval time series) located within bilateral frontal and right fronto-temporal scalp area in low (delta/theta) and in the high frequency oscillation range (beta/gamma), respectively.

As discussed above, gamma band activity plays a role in maintenance of visual (and others sensory) WM items (Tallon-Baudry et al., 1998; Kaiser et al., 2008) which spatially occurs within the prefrontal cortex in association with parietal cortex (Eriksson et al., 2015). In addition, EEG/MEG source localization studies pointed out that gamma oscillatory activity changes (increased power) is mainly localized over frontal (and parietal) regions (Palva and Palva, 2012). The exclusive bilateral frontal areas involvement for delta activity-related hubs would be consistent with the role of sustained delta activity in inhibiting interferences that might affect task performance (Harmony, 2013; Kleen et al., 2016).

Finally, the observed lateralization of network hubs toward right parietal area in alpha/theta during encoding might be related to spatial WM (Owen et al., 2005) elicited by a visual SIRT. Contemporary views of alpha/theta range of frequency suggest that it reflects the allocation of spatial attention to the memoranda, as well as the suppression of distracting information (Corbetta and Shulman, 2002; Asplund et al., 2010; Klimesch, 2012).

The storage partially overlaps encoding network hub map as it involved (right) frontal area in delta/theta frequency range. Although the correspondence in neural activity between encoding and maintenance still remains debatable (Gazzaley et al., 2004; Woodward et al., 2006; Chang et al., 2007) recent work by Cohen et al. (2014) provides empirical evidence for an overlap between encoding and maintenance processes as a critical element of WM (Postle, 2006; D'Esposito, 2007). A gammarelated frontal midline hub also was observed (in common with retrieval) that also could reflect the interplay between subcortical structures (i.e., basal ganglia) and (pre) frontal areas cortex responsible for maintenance of relevant WM items (Kaiser et al., 2009; Roux et al., 2012).

During retrieval we observed a selective distribution of network hubs within occipital area in alpha, beta and gamma bands as well as within the left parietal region in high frequency oscillations (beta/gamma band). Such parieto-occipital engagement could account for visual stimulus presentation and visual information processing during retrieval (Voytek et al., 2010). Moreover, neuronal synchronization in the gamma band over occipital areas has been associated to subject ability during encoding and retrieval memory phases (Osipova et al., 2006).

As expected local degree indices varied as function of WM load. Specifically, the left temporal degree increased as function of WM load in alpha band. This finding is consistent with a role of (left) temporal region in sub-lexical phonological processing of visual material (Howard et al., 1992; Price, 1998). During Sternberg tasks sequential encoding would activate the phonological loop to support the maintenance of sequenced WM items by means of subvocal rehearsal (silent speech; Henson et al., 2000; Barry et al., 2011). We found that frontal and parietal local degree indices were sensitive to WM load in gamma oscillations that well reflect the direct correlation between gamma power magnitude and the number of items to be memorized and the role of sustained activity in frontal and parietal areas for maintenance of memoranda (Howard et al., 2003; Roux et al., 2012; Roux and Uhlhaas, 2014).

The present EEG-derived network findings still await for further consolidation which requires a large sample of different age population to be involved and a network model testing procedure such as challenging the distributed network hub mapping both internally (e.g., leave-one-out approach to verify the fault tolerance of the network to the removal of one node) and/or externally (e.g., with non-invasive technique to induced virtual lesions such as rTMS).

Upon consolidation, our EEG-derived network estimation approach may on the one hand, break new ground in the WM function theoretical modeling and on the other it would offer a valuable and affordable method to improve clinical assessment and evaluate treatment efficacy of cognitive disorders occurring after brain lesions (i.e., stroke).

### AUTHOR CONTRIBUTIONS

JT: EEG experimental data analysis management; manuscript writing; LA: experimental design definition, EEG derived brain network data analysis supervision and validation; MR: interpretation of behavioral data; AA: implementation of methodological pipeline (connectivity estimation and graph theory approach); SK: EEG and behavioral data collection; GW: experimental design definition and supervision of data

#### REFERENCES


collection; DM: responsible for study; study design and management; overall data interpretation; manuscript writing management.

### FUNDING

Research partially supported by the European ICT Program FP7-ICT-2009-4 Grant Agreement 287320 CONTRAST and by Project FIRB 2013 (Fondo per gli investimenti della Ricerca di Base—Futuro in Ricerca)—RBFR136E24, by Sapienza University of Rome–Progetti di Ateneo 2016 (MIME-BCI, PI1161550696379A) and 2017 (EMBRACING, RM11715C82606455).


neural systems underlying encoding and maintenance in verbal working memory. Neuroscience 139, 317–325. doi: 10.1016/j.neuroscience.2005. 05.043


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer BM and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

Copyright © 2018 Toppi, Astolfi, Risetti, Anzolin, Kober, Wood and Mattia. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# On the Relationship Between Attention Processing and P300-Based Brain Computer Interface Control in Amyotrophic Lateral Sclerosis

#### Angela Riccio<sup>1</sup> \*, Francesca Schettini <sup>2</sup> , Luca Simione<sup>3</sup> , Alessia Pizzimenti <sup>4</sup> , Maurizio Inghilleri <sup>5</sup> , Marta Olivetti-Belardinelli 6,7 , Donatella Mattia<sup>1</sup> and Febo Cincotti 1,8

<sup>1</sup>Neuroelectrical Imaging and BCI Laboratory, NeiLab, Fondazione Santa Lucia (IRCCS), Rome, Italy, <sup>2</sup>Servizio Ausilioteca per Riabilitazione Assistita con Tecnologia (SARA-t), Fondazione Santa Lucia (IRCCS), Rome, Italy, <sup>3</sup> Institute of Cognitive Sciences and Technologies, Consiglio Nazionale delle Ricerche (CNR), Rome, Italy, <sup>4</sup>Crossing Dialogues, Rome, Italy, <sup>5</sup>Department of Neurology and Psychiatry, Sapienza University of Rome, Rome, Italy, <sup>6</sup>Centro Interuniversitario di Ricerca sull'Elaborazione Cognitiva in Sistemi Naturali e Artificiali (ECoNA), Rome, Italy, <sup>7</sup>ECONA Interuniversity Centre for Reseach on Natural and Artificial Systems, Sapienza University of Rome, Rome, Italy, <sup>8</sup>Department of Computer, Control and Management Engineering Antonio Ruberti, Sapienza University of Rome, Rome, Italy

#### Edited by:

Fabien Lotte, Institut National de Recherche en Informatique et en Automatique (INRIA), France

#### Reviewed by:

Quentin Noirhomme, Maastricht University, Netherlands Dennis J. McFarland, Wadsworth Center, United States Jérémie Mattout, Lyon Neuroscience Research Center, France

\*Correspondence:

Angela Riccio a.riccio@hsantalucia.it

Received: 10 September 2017 Accepted: 09 April 2018 Published: 28 May 2018

#### Citation:

Riccio A, Schettini F, Simione L, Pizzimenti A, Inghilleri M, Olivetti-Belardinelli M, Mattia D and Cincotti F (2018) On the Relationship Between Attention Processing and P300-Based Brain Computer Interface Control in Amyotrophic Lateral Sclerosis. Front. Hum. Neurosci. 12:165. doi: 10.3389/fnhum.2018.00165 Our objective was to investigate the capacity to control a P3-based brain-computer interface (BCI) device for communication and its related (temporal) attention processing in a sample of amyotrophic lateral sclerosis (ALS) patients with respect to healthy subjects. The ultimate goal was to corroborate the role of cognitive mechanisms in event-related potential (ERP)-based BCI control in ALS patients. Furthermore, the possible differences in such attentional mechanisms between the two groups were investigated in order to unveil possible alterations associated with the ALS condition. Thirteen ALS patients and 13 healthy volunteers matched for age and years of education underwent a P3-speller BCI task and a rapid serial visual presentation (RSVP) task. The RSVP task was performed by participants in order to screen their temporal pattern of attentional resource allocation, namely: (i) the temporal attentional filtering capacity (scored as T1%); and (ii) the capability to adequately update the attentive filter in the temporal dynamics of the attentional selection (scored as T2%). For the P3-speller BCI task, the online accuracy and information transfer rate (ITR) were obtained. Centroid Latency and Mean Amplitude of N200 and P300 were also obtained. No significant differences emerged between ALS patients and Controls with regards to online accuracy (p = 0.13). Differently, the performance in controlling the P3-speller expressed as ITR values (calculated offline) were compromised in ALS patients (p < 0.05), with a delay in the latency of P3 when processing BCI stimuli as compared with Control group (p < 0.01). Furthermore, the temporal aspect of attentional filtering which was related to BCI control (r = 0.51; p < 0.05) and to the P3 wave amplitude (r = 0.63; p < 0.05) was also altered in ALS patients (p = 0.01). These findings ground the knowledge required to develop sensible classes of BCI specifically designed by taking into account the influence of the cognitive characteristics of the possible candidates in need of a BCI system for communication.

Keywords: brain-computer interface, amyotrophic lateral sclerosis, attention, event-related potentials, P300, BCI, ALS, EEG

## INTRODUCTION

The non-invasive brain-computer interface (BCI) based on the visual event-related potential (ERP) known as P300 (P3; Farwell and Donchin, 1988) is by far the most extensively investigated BCI system to enhance or even allow communication when this latter is severely compromised due to different neurological disorders (Kleih et al., 2011; Riccio et al., 2016). Despite the large amount of studies seeking for methods to ultimately optimize P3-based BCI control accuracy, a reliable and yet flexible (to be customized for individual users) BCI system still requires some research efforts.

Within the range of users in need of a BCI for communication and control, those with amyotrophic lateral sclerosis (ALS) represent the target population due to a progressive muscular paralysis that leads to a loss of communication and interaction ability thus preventing persons from using conventional assistive technologies (ATs) at the later stage of the disease.

A number of studies reported that ALS patients can communicate by using a P3-based BCI (Marchetti et al., 2013) with stable performance over time (Sellers and Donchin, 2006; Nijboer et al., 2008; Silvoni et al., 2013). Communication and interaction could also be enhanced by means of a P3-based BCI combined with an AT device (Thompson et al., 2014; Schettini et al., 2015). Marchetti and Priftis (2015) reported the results of a meta-analysis (pooled studies from 2008 to 2013) indicating that the effectiveness of the P3-speller (Farwell and Donchin, 1988) in ALS patients reached an overall classification accuracy of 73.72%. Further studies aimed at investigating predictors of P3-based BCI control in ALS patients showed that both external (the stimuli exploited) and internal factors (the user's motivation) could account for the BCI performance (Nijboer et al., 2010; Townsend et al., 2010; Kaufmann et al., 2013b). Conversely, the relation between BCI performance and the clinical-functional status of ALS patients (Nijboer et al., 2008; Silvoni et al., 2013; McCane et al., 2014; Thompson et al., 2014) has not been fully investigated yet. McCane et al. (2015) reported no significant differences in BCI accuracy between ALS patients and healthy age- (but not years of education-) matched subjects. However, they found differences in the target-related ERPs characteristics: the ALS group presented a higher N2 wave peak amplitude and a latency delay in N2, P3 and late negativity (LN) with respect to the control group. However, studies comparing ALS patients with healthy participants are still scarce. Further investigations are needed on the possible impairment/alteration of brain processing in response to external inputs (such as visual stimuli) delivered within a BCI framework of stimulation to eventually unveil whether and how they could influence the BCI control.

In this regard, it is important to note that cognitive deficits have been described in ALS patients (Lomen-Hoerth et al., 2003; Ringholz et al., 2005; Christidi et al., 2012; Strutt et al., 2012; Volpato et al., 2016; Radakovic et al., 2017). Up to now, the influence of ALS patients' cognitive profile on the visual P3-based BCI control has not been fully investigated. The current studies on visual P3-based BCIs for communication often involved end-users with severe motor disabilities due to neurological disorders of various etiology (Piccione et al., 2006; Zickler et al., 2011; Kaufmann et al., 2013b; Kübler et al., 2014; Riccio et al., 2015) as compared to those studies in which more homogenous groups of participants, such as only ALS patients, were enrolled (Sellers and Donchin, 2006; Nijboer et al., 2008; Riccio et al., 2013; Silvoni et al., 2013; Thompson et al., 2014; Schettini et al., 2015). As such, this inconsistency between studies does not allow for definitive inferences on how P3-based BCI control could vary in severely motor disabled end-users with ALS and how this variability could be related to cognitive processing. In this line of reasoning, we previously showed (Riccio et al., 2013) how some aspects of attention processing such as the stimulus temporal filtering (i.e., the ability to keep the attentional filter active during the selection of a target) would be a predictor of the P3-speller control accuracy in ALS patients. Since we did not include a control (healthy) group, we could only speculate that such temporal aspect of attention processing was impaired in ALS population by comparing our results with those reported in other studies which included healthy participants (Kranczioch et al., 2005; Georgiou-Karistianis et al., 2007).

In the present study, we investigated whether the accuracy in mastering a P3-based BCI by an ALS population sample would be affected by the previously identified alterations in the attention processing and whether these alterations would be exclusive of the ALS population. To this purpose, we compared a group of ALS patients with a group of healthy volunteers both controlling a P3-speller (Farwell and Donchin, 1988). Groups were matched for age and years of education since both factors are known to deeply influence performance in executing cognitive tasks (Ardila et al., 2000). As yet, the two groups underwent an identical BCI stimulation protocol (i.e., the sequence of stimuli were not customized) in order to avoid confounding factors due to the use of different stimulation protocols. Based on previous findings (Riccio et al., 2013), our present hypothesis was that the ALS patients would show an altered visual attention processing of the stimuli delivered during the P3-based BCI control, and this would, in turn, affect the ability to control the P3-based BCI (i.e., decrease in performance). We also investigated the possible relation between cognitive mechanisms and P3-speller control to further corroborate the role of cognitive dysfunctions in BCI control in ALS patients (Riccio et al., 2013).

### MATERIALS AND METHODS

#### Participants and Baseline Assessment

Thirteen participants (8 males; mean age 62.2 ± 13; years of formal education 13.7 ± 5.1) with ALS diagnosis (ALS group) and 13 age and years of education-matched participants (9 males; mean age 55.3 ± 9; years of formal education 13.3 ± 3) with no history of neurological/psychiatric disorders (Control group) were enrolled in the study. Seven out of 13 ALS patients participated in the previous study (Riccio et al., 2013).

The ALS patients were recruited through the ALS Center of the Policlinico ''Umberto I'', Sapienza University, Rome. The study was conducted at Fondazione Santa Lucia, IRCCS, Rome and approved by the Independent Ethics Committee of Fondazione Santa Lucia. All participants (or the legal representatives of ALS patients when required) provided a written informed consent.

The inclusion criterion for the ALS patients was the ability (also with the help of an AT device if required) to clearly communicate (at least) a binary response (yes/no). Patients with other concomitant neurological or psychiatric disorders, any impediment in the acquisition of electroencephalography (EEG) data from the scalp (e.g., wounds, dermatitis), severe concomitant pathologies (fever, infections, metabolic disorders, severe heart failure), or episodes of reflex epilepsy were excluded from the study.

The level of physical disability was assessed by means of the ''ALS Functional Rating Scale-Revised'' (ALSFRS-R; Cedarbaum et al., 1999). ALSFRS-R scores range from 0 to 48 (the higher the score, the higher the functionality). Mean ALSFRS-R score was 31.2 ± 10.4 (range from 12 to 41). The ALS patients' demographic and clinical information are reported in **Table 1**.

Participants underwent a cognitive assessment focused on attention domains, in order to have individual baseline profiles. Two clinical neuropsychological tests were applied for the cognitive screening. The computerized test for attentional performance (TAP; Zimmermann and Fimm, 1995) was used to assess selective attention (SA) and working memory (WM) whereas the executive functions (EF) were assessed by means of the perseverative response scores obtained in the Wisconsin Card Sorting Test (WCST; Berg, 1948). Between the several clinical tests, the TAP includes a go-nogo task for SA (participants had to press a key when two target items were presented, while ignoring three distracter items) and a 2-back task for WM (numbers were presented on the screen and participants had to indicate the repetition of a number within an interval of three numbers by pressing a key). The WCST test consists of a card sorting game according to either color, shape or number. The sorting rule changes over time. Participants then have to rely on the outcome or feedback after each of their choice in order to infer the new rule in effect. Eight of the 13 ALS patients completed the full protocol (psychological session and BCI session); the remaining five performed only the BCI session (one patient participated to the earlier study—Riccio et al., 2013). All Control subjects (n = 13) had both the psychological session and the BCI session.

#### Experimental Session

The experimental design consisted of two separate sessions (performed on two different days): the BCI session and the psychological session (see below for details).

#### BCI Session

Scalp potentials were acquired by means of a 16-channel amplifier (g.MOBILAB, g.tec, Austria) from eight active electrodes (g.Ladybird, g.tec, Austria) placed according to 10–10 international standard (Fz, Cz, Pz, Oz, P3, P4, PO7, and PO8; right ear lobe reference, left mastoid ground). This experimental choice was dictated by a reasonable trade-off between a not exhausting experimental procedure for ALS patients and a widely accepted eight electrodes configuration to ensure a P300 based-BCI successful control. Signals were digitized at 256 Hz. Stimulus paradigm and online delivery were managed by means of the BCI2000 framework (Schalk et al., 2004). A P3-speller (Farwell and Donchin, 1988) interface (6 by 6 matrix of alphanumeric items) was displayed full screen on a 15 computer screen, placed approximately at eye level and at a distance of 100 cm from the participant.

During the calibration phase (i.e., no feedback on performance), the subjects had to focus on 15 items forming three predefined words (3 runs; 5 items for each run). The target to focus on was shown to the participants by a single flash, after which rows and columns were randomly intensified for 125 ms, with an inter stimulus interval (ISI) of 125 ms. Participants were suggested to mentally count how many times that target was flashing. Calibration data were segmented into epochs lasting 800 ms (time 0 marked the stimulus onset) that were fed into a stepwise linear discriminant analysis (SWLDA) to determine the classifier coefficients (Krusienski et al., 2006) to be applied in the online BCI session. During the online phase (i.e., provision of feedback on performance), participants had to spell four


Demographic and clinical characteristics of the ALS group and Control group (means ± standard deviations, range). Abbreviations: F, female; M, male; SA, selective attention; EF, executive functions; WM, working memory; S, spinal; B, bulbar.

predefined (copy mode) words (4 runs; 5 items for each run; 20 characters in total). The classifier coefficients extracted from calibration data were applied to epochs grouped by stimulation classes (rows and columns) and averaged over stimulation sequences. The spelled letter was identified as the intersection between the row and the column exhibiting the maximum of the sum of scored features (Krusienski et al., 2006). As mentioned in the Introduction section, the stimulation sequences were not customized for each patient and thus we performed a static data collection (Mainsah et al., 2015). Specifically, a single item (e.g., letter) was intensified 20 times (10 sequences) before the next set of stimuli would start. Such stimulation sequence was maintained fixed for each subject.

#### Psychological Session

The individual temporal pattern of attentional resource allocation was tested by means of a rapid serial visual presentation (RSVP) paradigm which corresponds to an attentional blink (AB) paradigm in Kranczioch et al. (2007). In brief, it consisted of two target stimuli (T1 and T2) which were embedded in a stream of 16 or 19 distracter stimuli; each stream of stimuli (the equivalent of a single trial) was presented in the center of a monitor (white background). All stimuli were capital letters (letters F, K, Q, X, Z were excluded) and were presented pseudo-randomly (with a constraint that the same letter was not presented within three sequential positions) at central fixation (1 stimulus/100 ms; presentation rate at 10 Hz). As for target stimuli, T1 was a green capital letter randomly occurring as 4th, 5th, 6th or 7th item within a single stream. T2 was a black capital ''X'' which followed T1 on 80% of the trials according to four conditions (each occurring with a frequency of 20%): after no intervening distracters, after one, three or five intervening distracters. In 20% of the trials, T2 was not presented (5th condition). The distracter stimuli were black capital consonants.

Upon the stimulus stream delivery, participants were asked to answer the following questions: (1) whether the green letter (T1) was a vowel (T1 was a vowel on 50% of the trials); and (2) whether the black X (T2) was contained in the stimulus stream. In the case of ALS patients, the answers to the questions were given according to their residual motor activity (e.g., verbal response, head movements, eye movements).

Twenty practice trials preceded a total of 160 experimental trials (32 trials for each of the five T2 conditions); these latter were fully randomized within two presentation blocks separated by a pause of 5 min.

### DATA ANALYSIS

All acquired data were preprocessed as follows. High and low pass filters (4th order Butterworth filter) were applied with a cut off frequency of 1 Hz and 20 Hz, respectively. EEG signals with peak amplitude higher than 70 µV or lower than −70 µV were removed. Data were then segmented into epochs (time 0 denoted the stimulus onset) lasting 800 ms and 1000 ms for the BCI and ERPs analysis, respectively. Both target and non-target stimulusrelated epochs were considered.

#### BCI Performance Analysis

The BCI online accuracy was expressed as the percentage of correct selections (i.e., the ratio between the number of correct selections and the total number of selections). Furthermore, an offline estimation of both the accuracy and the information transfer rate (ITR, Wolpaw et al., 2002) was performed in order to account for the online performance inter-subject variability that was ''hidden'' by the static modality of data collection.

To estimate the offline accuracy, a baseline correction was performed based on the mean amplitude of signal within the 200 ms pre-stimulus interval. The offline accuracy was then calculated for each stimulation sequence by means of a 7-fold cross-validation technique according to which six runs were used as training dataset to extract SWLDA classifier parameters and one run was used as testing dataset. A mean accuracy value for each stimulation sequence was obtained by averaging the values resulting from the seven iterations.

The ITR (bits/min) was estimated for each subject and each stimulation sequence based on the definition of bit-rate as in Wolpaw et al. (2002) and multiplying the bit-rate by the speed of selection (selections/minute). The individual highest ITR value was considered. Output metric resulting from this computation will be reported as ITR (0–800 ms).

The relative contribution of the N2 and P3 ERP components to the BCI accuracy was investigated offline as follows. Differences in the amplitudes of ERPs that were elicited by the stimulus types (target vs. non-target) were quantified using the coefficient of determination R2. We considered the epochs relative to all seven runs. The R2 values range from 0 to 1, wherein higher values correspond to larger explained variances. A signed R2 index was introduced to account for the different polarity of ERPs (N2 and P3) and was derived by multiplying R2 by the sign of the slope of the corresponding linear model which was positive when the amplitudes of the ERPs that were elicited by the target stimuli were higher than those elicited by non-target stimuli and vice versa (as in Aloise et al., 2012). The mean R 2 values were computed within two different temporal intervals for the N2 and P3 components that ranged from 100 ms to 400 ms for the N2 (negative values) and from 250 ms to 550 ms for the P3 (positive values).

As for the ITR, its values were computed by segmenting data into epochs between 0 ms and 550 ms after the stimulus onset—ITR (0–550 ms)—to ensure that the temporal interval would include both N2 and P3 ERP components.

#### ERPs Analysis

We focused the ERP analysis on the N2 as the earliest ERP that reliably correlates with visual awareness (Visual Awareness Negativity; Railo et al., 2011) and the P3 as associated with conscious access to the content of conscious vision (Raffone et al., 2014). Hence, these two ERP components can be considered a reliable reflection of attentional stimulus processing.

In this offline ERP analysis, the epochs in which a target stimulus occurred within the 500 ms preceding the stimulus onset were removed in order to reduce the contamination between consecutive epochs and the ERP overlapping (Treder and Blankertz, 2010). Target and non-target ERP waveforms

were obtained by averaging the epochs relative to each run. The ERP waveforms were obtained from a sample by sample contrast between the non-target and target ERP waveform amplitude (i.e., the difference between target and non-target).

The mean of both the N2 and P3 amplitude and the centroid latency (Luck, 2005) were obtained from all data sets (seven BCI session runs). The P3 mean amplitude was calculated by averaging the voltage of all positive points preceded or succeeded by positive values between 250 ms and 550 ms after the stimulus onset whereas for the N2 mean amplitude we averaged the voltage of all negative points preceded or succeeded by negative values between 100 ms and 400 ms after the stimulus onset (Clayson et al., 2013). The P3 and N2 centroid latencies were set as the time to which the area under the curves was divided into equal halves. Finally, the mean amplitude and the centroid latency of the P3 waves (P3-MA and P3-CL, respectively) recorded from Fz, Cz and Pz and the Mean Amplitude and the Centroid Latency of the N2 waves (N2-MA and N2-CL, respectively) recorded from PO7, PO8 and Oz were subjected to the statistical analysis.

#### Psychological Paradigm Data Analysis

As for the RSVP data set, the accuracy of T1 and T2 detection (T1%; T2%) was estimated (T2% was considered only in trials in which T1 had been correctly identified). T1% was considered an index of participants' temporal attentional filtering capacity and T2% was considered as an index of the capability to adequately update the attentive filter (Riccio et al., 2013) in the temporal dynamics of the attentional selection.

#### Statistical Analysis

Between-group (ALS and Control) differences in terms of BCI control performance were evaluated as follows. A (non-parametric) Mann-Whitney U test was applied to assess the between-group difference in BCI online accuracy (accuracy scores not normally distributed). A Student's T-test was applied to assess the between-group difference in terms of the ITR scores. The contribution of N2-R2 and P3-R2 to the BCI performance was assessed for each group (ASL and Control) by means of two linear regression analyses with the ITR (0–550 ms) as the dependent variable and the N2-R2 and the P3-R2 as independent variables.

The (non-parametric) Spearman's rank order correlation was applied to investigate the possible correlation between ALSFRS-R scores (not normally distributed) and the ITR values.

To investigate whether ALS patients showed differences with respect to Control group in attention processing during the BCI task, we conducted two MANOVAs to determine the effect of group (independent variable) on both P3-MA and P3-CL (dependent variables). The same analysis (two MANOVAs) was performed to determine the effect of group on both N2-MA and N2-CL.

To investigate whether the temporal pattern of attentional resource allocation would be correlated with the BCI performance level (ITR), we performed a correlation analysis (Pearson's correlation coefficients) between the ITR (0–800 ms), the P3-MA in Pz and T1% and T2%. Such correlation was sought either by pooling all data from ALS and Control groups and by considering only ALS group data.

The existence of alterations in attentional resources allocation in the ALS group was assessed by means of a MANOVA to test the effect of group (independent variable) on T1% and T2% (dependent variables).

### RESULTS

The ALS and Control groups did not show significant differences as regard demographic characteristics (Student's T-test; t(24) = 1.6; p = 0.13 and t(24) = 0.17; p = 0.85 for age and ''years of formal education'', respectively) and clinical assessment focused on selective attention (SA; error scores, χ <sup>2</sup> = 1.96; p = 0.16), working memory (WM; errors χ <sup>2</sup> = 3.29; p = 0.07) and executive functions (EF; χ <sup>2</sup> = 0.119; p = 0.729).

We did not find a significant between-group difference in the online accuracy (**Figure 1A**; ALS group mean = 96.1% ± 5; Control group mean = 99.2% ± 2; U = 55.5; p = 0.13). On the contrary, ITR (0–800 ms) was significantly higher in the

Control group (36.6 ± 14.5 bits/min) as compared to ALS group (mean = 25.4 ± 12.1 bits/min; t(24) = 2.1, p ≤ 0.05; **Figure 1B**).

The linear regression analysis (p = 0.11) revealed that ITR (0–550) was not significantly predicted neither by the N2-R2 values (β = 0.33; p = 0.34) nor the P3-R2 values (β = 0.32; p = 0.34) in the Control group (**Figure 2B**). Differently, in the ALS group we found that the linear regression was significant (F(2,10) = 4.3526, p < 0.05). Specifically, only the N2-R2 was significantly predictive of the ITR (0–550 ms; β = 0.59, p < 0.05; P3-R2 = β = 0.34, p = 0.17; **Figure 2A**).

We found that the ITR and ALSFRS-R scores showed a high tendency to correlate which did not reach a significance (p = 0.06; r = 0.52). Such tendency, however, suggests that the degree of disability due to ALS could influence the BCI performance in a detrimental way.

No significant differences were found between ALS and Control groups (MANOVAs) in P3-MA (λ = 0.71; F(3,22) = 2.9, p = 0.05), N2-MA (λ = 0.88; F(3,22) = 0.9; p = 0.4) and N2-CL (λ = 0.89; F(3,22) = 1.0; p = 0.4). On the contrary, the MANOVA returned a significant between-group difference in P3-CL (λ = 0.59; F(3,22) = 5.0; p < 0.01) values over Fz (p < 0.001; Control group mean = 363.17 ± 20.14 ms; ALS group mean = 409.4 ± 38.07 ms; Bonferroni corrected) and Cz electrodes (p < 0.05; Control group mean = 368.60 ± 27.62 ms; ALS group mean = 396.68 ± 32.66 ms; Bonferroni corrected) with longer CL in ALS patients with respect to Controls. No significant differences were found in P3-CL values over Pz (p = 0.3; control group mean = 379.71 ± 37.83 ms; ALS group mean = 394.4 ± 38.82 ms; Bonferroni corrected).

As illustrated in **Figure 3**, the visual inspection of P3 topography indicates a prevalent frontal distribution of P3 in ALS whereas a parietal distribution is observed in Control group.

The analysis of the relation between cognitive substrates and BCI performance as measured by means of RSVP and BCI data returned a significant positive correlation between T1% and the ITR (r = 0.51; p < 0.05) and T1% the P3-MA in Pz (r = 0.63; p < 0.05) when considering all group data (ALS and Control data pooled). No significant correlation was found between T2% and ITR (r = 0.32; p = 0.15) and between T2% and P3-MA (r = 0.23; p = 0.31). When considering only the ALS group, the same analysis unveiled significant positive correlation between T1% and ITR (r = 0.71; p < 0.05) and T1% and P3-MA (r = 0.78; r < 0.05) whereas T2% and ITR (r = 0.66; p = 0.07) and T2% and P3-MA (r = 0.39; p = 0.31) did not show significant correlation.

The MANOVA (F = 4.4; p < 0.05) revealed that the T1% values were significantly lower in ALS as compared to Control group (Control group mean 89.7 ± 0.8%; ALS group mean = 79.4 ± 10%; p = 0.01; Bonferroni corrected). No significant difference was found in T2% (control group mean = 63.6 ± 2%; ALS group mean = 67.8 ± 23%; p = 0.6).

#### Discussion

This study aimed at investigating whether ALS patients showed differences in the ability to control a P3-speller BCI system with respect to healthy subjects. We focused on the attention processing involved in the delivering of the visual BCI stimulation paradigm, in order to further (Riccio et al., 2013) elucidate if and how such cognitive abilities would be altered in ALS patients and eventually would account for patients' BCI control capacity. We hypothesized that the capacity to accomplish a P3-speller task was decreased in ALS patients and that they would have shown an alteration in the visual attention processing as elicited during the P3-based BCI control. To test our hypothesis, we compared two groups of participants (ALS patients vs. Control) in terms of performance in P3-speller control and with regard to the earliest ERP components such as N2 and P3 which are correlated with visual awareness (Railo et al., 2011; Raffone et al., 2014).

First, we found that the ALS patients showed a significantly lower ITR in the P3-speller BCI task with respect to Controls whereas the online performance was comparable between the two groups.

This finding is not in line with what reported by McCane et al. (2015). According to their study severely disabled ALS patients and age-matched healthy controls showed similar P3-based BCI performance in terms of maximum accuracy, communication rate and bit rate. Several differences between these two studies

might account for the apparent discrepancy on the ability to master a P3-based BCI by ALS patients. First, the pattern of visual stimulation (checkerboard in McCane et al., 2015) that is well known to remarkably influence the P3-based BCI control performance (Townsend et al., 2010) and related ERPs characteristics (Kaufmann et al., 2011). Second, the ALS clinical severity that was higher in ALS population of McCane (ALSFRS-R scores = 9.4 ± 9.5 SD) with respect to our population (ALSFRS-R scores = 31.2 ± 10.4). In this regard, we found a remarkable (but not significant) correlation between the ITR and the ALSFRS-R scores that suggests a direct relation between the degree of clinical disability and the ability to use a P3-based BCI. Finally, in McCane et al. (2015) study, the ALS and healthy participants were not matched for years of formal education and this is a variable also accounting for the level of cognitive task performance (Ardila et al., 2000).

In addition to this, the overall methods (and metrics) to estimate the P3-based BCI performance are not directly comparable between the two studies. We ''only'' found the (offline) ITR as a distinctive metric of the ability to use a P3-based BCI in ALS with respect to Control group.

In the P3-speller task, the act of focusing attention on the target letter modulates the visual processing of the stimulus. Our ERP findings indicate that the P3 mean latency was significantly higher in ALS with respect to control group while no difference was found in the N2 parameters between the two groups. The finding of a delayed P3 associated with a ''normal'' N2 (i.e., physiological stimulus categorization process) in ALS can be interpreted as a delay occurring in the post-perceptual stage of the stimulus attentional processing (Duncan-Johnson and Kopell, 1981) that is, the variation in the attention modulation during the stimulus visual processing observed in ALS would occur when stimulus perception is complete, the target is categorized and its storing in WM has taken place. We found no significant between-group differences in P3 mean amplitude; this latter parameter can be considered as a measure of the attention allocation resources (Donchin, 1981). This finding allows us to speculate that in ALS patients the overall alteration of attention modulation during a P3-speller task is only related to the time of processing but not to the resource allocation.

The P3 wave component showed a frontal topography in the ALS group as compared to the parietal distribution observed in the Control group (**Figure 3**). This finding is in line with previous findings reported by McCane et al. (2015). We interpret this difference in P3 topography as possibly related to the P3a and P3b components that have different generators and thus, different topography (Courchesne et al., 1975): the frontal P3 component observed in the ALS group would represent the P3a whereas the parietal P3 present in Controls would better represent the P3b component.

Fabiani and Friedman (1995) suggested that during the process of learning a task, the P3a is elicited by novel stimuli that do not require a memory template. In contrast, the P3b is elicited when the stimulus memory template is created. One can speculate that the frontal P3a topography in ALS might reflect an increase in the frontal activity related to a more rapid decay of the memory templates (of the stimuli processed) that would make it more difficult to create and maintain an adequate template for the target stimulus (Fabiani and Friedman, 1995; Fabiani et al., 1998).

We found that the N2-R2 and not the P3-R2 coefficient significantly predicted the BCI accuracy only in ALS group, accounting for the 59% of the ITR variance. Based on the assumption that such coefficients mostly returned the contribution of N2 and of P3 waves to the BCI classification performance, these findings indicate that in ALS patients the N2 elicited during the P3-speller task would have a major role with respect to P3 in successful target selection.

The presence of a jitter in the P3 latency has been described in healthy subjects controlling a P3-based BCI system and its magnitude would be correlated with the online performance (Thompson et al., 2013; Aricò et al., 2014). Although it remains to be demonstrated that such P3 wave jitter exists and to what extent it might influence the P3-speller BCI task performance in ALS patients, one can speculate that our observed unbalance in the contribution of N2 and P3 components to successful BCI performance in favor of N2 might reside in the cross-relationship between the P3 latency jitter and delayed visual processing phenomena. Hence, further investigations in this regard are of utmost relevance to address sensible design of future ERP-based BCIs for ALS user candidates.

According to our previous findings (Riccio et al., 2013), the temporal aspect of the SA investigated by means of a RSVP task and measured as T1% (i.e., the ability to maintain the attentional filter active during the selection of a target within a range of time) was found to be related to both the BCI performance and P3 amplitude in ALS patients.

In the present study, we confirmed that the temporal filter in attention processing of visual stimuli in ALS patients was altered, by directly compare the T1% and T2% values obtained from ALS and Control group. Specifically, we found that the capability to detect T1 (but not T2) was lower in the ALS group.

As such, this finding is consistent with that of a delay in the P3 latency which reflects a deficit in the temporal aspect (i.e., post-perceptual stage of the stimulus attentional processing) of the context update. Taken altogether, these findings clearly indicate the existence of an alteration in the temporal aspects of the visual stimulus processing as presented in a ''conventional'' P3-speller matrix in ALS population and that this time-related alteration in the capacity to temporally process visual stimuli does influence the rate of success in BCI control.

Our findings might lay the groundwork for future clarification of some of the relevant issues in the actual deployment of ERPs-based BCI for communication to ALS users, such as the impact of end-users' cognitive profile in designing user-centered (Liberati et al., 2015; Nijboer, 2015) and reliable BCI systems (Kübler et al., 2013).

### Study Limitation

Some limitations pertaining different aspects of this study deserve to be mentioned. First, our ALS population does not include ALS patients in a complete locked-in state (LIS). Although this restriction in the inclusion criterion was mandatory to allow the cognitive screening, it prevents any generalization of our findings to those patients with no means of communication (i.e., complete LIS). In this framework, our study suggests a possible role of the cognitive assessment to be performed in ALS patients before they would be in a LIS condition, including the specific cognitive abilities identified here as critical for P3-based BCI usage.

Second, our findings relative to BCI factors influencing BCI performance allow us to make inferences only regarding the control of a P3-speller in a group of ALS patients and cannot be generalized to the control of other BCIs. It is conceivable that when exploiting different features to control other BCIs, temporal aspects of attention would not have a comparable role. An example of ''alternative'' features to P3 would be the N400 wave, involved in the elaboration of meaningful stimuli such as face recognition (Kaufmann et al., 2013b). In addition, several single-case studies have shown significant differences in classification performance depending on the sensory modality of the ERP-based BCIs that participants controlled (Kaufmann et al., 2013a; Schreuder et al., 2013).

Third, attention is a complex domain of the cognitive functions (Posner, 1975) and its substrates can be measured in different ways; this prevents from a direct comparison of results obtained within different contexts and with different behavioral and neurophysiological approaches. For instance, attention substrates measured with different tests (i.e., Cognitrone, Schuhfried, 2007) were not found to be precursors of performance in controlling sensorimotor–based BCIs, differently from visuo-motor coordination (Hammer et al., 2012, 2014).

### CONCLUSION

This study involved a group of participants with ALS and a group of healthy participants matched for age and years of formal education. Our results showed that both the capacity to accomplish the P3-speller task and the timing of the allocation of attentional resources in the post-perceptual stage of stimulus processing were altered in ALS patients. Furthermore, we confirmed that the capacity to temporally filter a target stimulus within a stream of stimuli was related to a lower capacity for ALS to control a P3-speller.

Developing AT devices that restore communication in people with severe motor disabilities is a central issue of BCI research (Millán et al., 2010). Current BCI systems do not address the heterogeneity of the end-users often due to the lack of customizability and adaptability to their cognitive capabilities. This study contributes to the knowledge needed for developing a new class of BCI specifically designed by taking into account the influence of the cognitive characteristics of end-users on BCI usage.

#### AUTHOR CONTRIBUTIONS

AR was responsible for experimental design, data collection, analysis of data and manuscript writing. FS was responsible for

#### REFERENCES


data acquisition and analysis. LS was responsible for behavioral assessment (attention tasks). AP and MI were responsible for the patients' enrolment. MO-B, DM and FC supervised the overall experimental design implementation, data interpretation and manuscript editing.

#### FUNDING

Research partially supported by Sapienza University of Rome—Progetti di Ateneo 2015 (C26A15N8LZ), Progetti di Ateneo 2016 (PI1161550696379A) and Progetti di Ateneo 2017 (RM11715C82606455).


79–87. Available online at: http://link.springer.com/10.1007/978-3-319-25190- 5\_8 [Accessed December 22, 2015].


**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Riccio, Schettini, Simione, Pizzimenti, Inghilleri, Olivetti-Belardinelli, Mattia and Cincotti. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Detecting and Quantifying Mind Wandering during Simulated Driving

Carryl L. Baldwin<sup>1</sup> \*, Daniel M. Roberts <sup>1</sup> , Daniela Barragan<sup>1</sup> , John D. Lee<sup>2</sup> , Neil Lerner <sup>3</sup> and James S. Higgins <sup>4</sup>

<sup>1</sup>Department of Psychology, George Mason University, Fairfax, VA, United States, <sup>2</sup>Department of Industrial and Systems Engineering, University of Wisconsin-Madison, Madison, WI, United States, <sup>3</sup>Center for Transportation, Technology and Safety Research, Westat, Rockville, MD, United States, <sup>4</sup>Office of Behavioral Safety Research, National Highway Traffic Safety Administration, U.S. Department of Transportation, Washington, DC, United States

Mind wandering is a pervasive threat to transportation safety, potentially accounting for a substantial number of crashes and fatalities. In the current study, mind wandering was induced through completion of the same task for 5 days, consisting of a 20-min monotonous freeway-driving scenario, a cognitive depletion task, and a repetition of the 20-min driving scenario driven in the reverse direction. Participants were periodically probed with auditory tones to self-report whether they were mind wandering or focused on the driving task. Self-reported mind wandering frequency was high, and did not statistically change over days of participation. For measures of driving performance, participant labeled periods of mind wandering were associated with reduced speed and reduced lane variability, in comparison to periods of on task performance. For measures of electrophysiology, periods of mind wandering were associated with increased power in the alpha band of the electroencephalogram (EEG), as well as a reduction in the magnitude of the P3a component of the event related potential (ERP) in response to the auditory probe. Results support that mind wandering has an impact on driving performance and the associated change in driver's attentional state is detectable in underlying brain physiology. Further, results suggest that detecting the internal cognitive state of humans is possible in a continuous task such as automobile driving. Identifying periods of likely mind wandering could serve as a useful research tool for assessment of driver attention, and could potentially lead to future in-vehicle safety countermeasures.

#### Edited by:

Stephen Fairclough, Liverpool John Moores University, United Kingdom

#### Reviewed by:

Edmund Wascher, Leibniz Research Centre for Working Environment and Human Factors (LG), Germany Chrysi Bogiatzi, McMaster University, Canada

#### \*Correspondence:

Carryl L. Baldwin cbaldwi4@gmu.edu

Received: 30 April 2017 Accepted: 25 July 2017 Published: 08 August 2017

#### Citation:

Baldwin CL, Roberts DM, Barragan D, Lee JD, Lerner N and Higgins JS (2017) Detecting and Quantifying Mind Wandering during Simulated Driving. Front. Hum. Neurosci. 11:406. doi: 10.3389/fnhum.2017.00406 Keywords: mind wandering, inattention, driving, EEG, alpha

### INTRODUCTION

Driver inattention is a frequent cause of automobile crashes and fatalities. This issue has received considerable attention from the scientific community in recent years. Methods of detecting episodes of driver inattention in real-time hold promise for alleviating the human and economic costs of this safety critical issue. Drivers can be inattentive for a variety of reasons, the most obvious being distraction from mobile devices (Caird et al., 2008) or other external factors. However, many distracted driving crashes occur in the absence of an obvious visual or manual distraction. Mind wandering has been suggested as a potential source of many of these distracted driving crashes.

For most people, driving is a highly-overlearned task. Consequently, many of the tasks of everyday driving—lane and speed maintenance, stopping at signaled intersections, etc.—tend to occur relatively automatically. In addition, many trips are routinized with drivers taking the same routes back and forth to work, the grocery store, or other frequently visited locations, which further promotes automaticity, allowing attention to be devoted to other activities. The routine nature of the driving task, particularly along familiar or monotonous routes, creates an environment ripe for internal distraction or mind wandering, as we will refer to it here. Nevertheless, to maintain safety, drivers must remain attentive to a wide variety of stimuli that may represent latent hazards (Fisher et al., 2002) and be able to swiftly and accurately respond to unexpected events.

There are many varieties of attention (Parasuraman, 1998), but a well-accepted theoretical distinction is between external and internal attention (Chun et al., 2011). External (or ''bottomup'') attention is triggered reflexively by environmental events. Internal (or ''top-down'') attention is voluntary or involuntary application of cognitive resources away from the external environment towards internal thoughts. Most empirical research on internal attention has investigated the pursuit of goals relevant to events in the environment—e.g., such as searching for a landmark while driving in an unfamiliar neighborhood. But, by definition, internal attention can also be devoted to thoughts and memories quite unrelated to any external event. Much less research has been devoted to this topic (Forster and Lavie, 2014), but in recent years a small but growing body of work has examined an aspect of such internal attention—mind wandering (Giambra, 1995; Smallwood and Schooler, 2006).

It is well documented that driving performance is modulated by factors such as effort, fatigue and time on task (Robertson et al., 1997; Grier et al., 2003; Helton and Warm, 2008; May and Baldwin, 2009; Langner et al., 2010; Baldwin et al., 2014). Furthermore, several studies have shown that external distractions, such as talking or texting on a mobile phone, impair driving performance (Strayer et al., 2003; Strayer and Drews, 2007; Caird et al., 2008). Less appreciated and understood is the threat to safety posed by mind wandering behind the wheel, though mind wandering has been associated with an increased risk of being responsible for an automobile crash (Galéra et al., 2012). Fatigue associated with increased time on task may exacerbate both the prevalence and potential risk associated with mind wandering as it is associated with withdrawal of attention away from the driving task and can be considered a form of internal distraction (Williamson, 2007).

Mind wandering has been defined as, ''a shift of attention away from a primary task toward internal information'' (Smallwood and Schooler, 2006, p. 946). It often occurs without intention and may even occur without explicit awareness, making it a particular challenge to observe and measure. People may continue to move their eyes across a page of text (or the forward field of view while driving) without overtly attending to the viewed stimuli (Smallwood, 2011). Mind wandering is associated with increased activity in the default mode network (DMN) and decreased activity in the dorsal attention network (DAN). The DAN is integrally involved in controlling eye movements and directing exogenous attention to salient stimuli through top-down goal directed processing (Carretié et al., 2013). The DMN is sensitive to the presence of biologically salient non-task relevant stimuli, but this sensitivity is generally thought to come at the cost of processing task-relevant stimuli (Smallwood, 2011; Carretié et al., 2013). Smallwood (2011) has referred to this interplay between the DAN and DMN as a decoupling process, meaning that as attention is shifted toward one system it is withdrawn from the other. This decoupling may have important implications for driving by suggesting that the more drivers are engaged in mind wandering (activation of the DMN) the less likely they are to process external perceptual cues (potential hazards), particularly if the perceptual cues involve non-biologically salient stimuli (e.g., artificial sounds such as alarms and brake lights).

Two methods have typically been used to detect and quantify mind wandering: self-caught detection and probecaught detection. The self-caught method of detection involves participants reporting when they notice their mind wandering. In contrast to self-caught methods of detection, probe techniques allow for sampling participants cognitive states throughout a task under experimenter control, thus allowing for the detection of mind wandering episodes that are uncaught by the participants themselves. Three types of probe techniques have been previously utilized. The first and most commonly used is intermittent presentation of questions such as, ''Just now, were you mind wandering?'' throughout an otherwise continuous task (e.g., Broadway et al., 2015). Another variation of this technique uses probe tones, prompting participants to indicate via a button press whether they were or were not mind wandering (Smallwood and Schooler, 2006). More recently, Seli et al. (2016) used a combination of these probe techniques to determine whether performance differs when participants were aware or unaware that they were mind wandering. Participants were asked to indicate whether their thoughts were on task or they were mind wandering just before they heard the probe tone. Additionally, if participants indicated that they were mind wandering, they were then asked to indicate if they were aware or unaware that they were mind wandering prior to the probe tone. The third probecaught method involves participants indicating the content of their thoughts at the time of probe, leaving the experimenter to classify whether their thoughts constituted mind wandering (Smallwood and Schooler, 2006). However, by necessity, both self-caught and probe-caught methods likely disrupt participants primary task performance. This may be especially true when probes require more fine-grained judgments of the degree of mind wandering. An ideal mind wandering detection methodology would not require any response on the part of the participant, but rather would provide an on-line classification using some form of machine learning. However, currently such a technique awaits further exploration of the sensitivity and robustness of various metrics using some type of self-report technique.

Results of several recent studies utilizing measures derived from electroencephalography (EEG) substantiate this claim by observed reductions in perceptual sensitivity during periods of mind wandering (Smallwood et al., 2008; Braboszcz and Delorme, 2011). In a breath counting task, Braboszcz and Delorme (2011) report that in the 10 s prior to participants' self-detecting a state of mind wandering, there was a significant reduction of alpha band activity combined with a diffuse increase in theta band activity, relative to the 10 s period following the button press, during which participant thought had presumably returned to the breath counting task. Modulation of theta band activity may provide a means of distinguishing mind wandering from other types of internal distractions. Savage et al. (2013) found that when participants were given a riddle to solve while performing a simulated driving task there was a similar increase in theta band as that reported by a study measuring frontal sites during mind wandering (Braboszcz and Delorme, 2011). However, in contrast to mind wandering, being internally distracted by a secondary cognitive task leads to a decrease in theta band activity over occipital electrode locations.

Oscillatory activity in the alpha band of the EEG is suggested to be related to attention processes (Klimesch, 2012), specifically the degree to which attention is allocated internally vs. externally. For example, greater alpha power at parietal and/or occipital electrode locations is associated with the failure to detect (Ergenoglu et al., 2004) or discriminate (Van Dijk et al., 2008; Roberts et al., 2014) visual stimuli, while spatial attention processes similarly modulate alpha power such that alpha is suppressed contralaterally and enhanced ipsilaterally to the attended location (Worden et al., 2000; Thut et al., 2006). In contrast, alpha power has been reported to be elevated for tasks in which attention is directly internally, such as working memory retention (Jensen et al., 2002). As periods of high alpha power are associated with lapses of attention to external stimuli (O'Connell et al., 2009; Zauner et al., 2012; Borghini et al., 2014) alpha power is expected to be greater during periods of mind wandering relative to periods of on task behavior.

In terms of event related potentials (ERPs), modulations of early perceptual and attentional components during mind wandering have been observed. Broadway et al. (2015) report that in a reading task the visual N1 component, thought to index orienting and enhancement of perceptual information (Hopfinger et al., 2004), was attenuated during mind wandering. The authors also report an attenuation of the P1 component, thought to index the inhibition of irrelevant information (Hopfinger et al., 2004), over the right hemisphere when participants reported mind wandering. Attenuation of the visual P1 was reported by Baird et al. (2014) bilaterally. However, it should be noted that such early components are relatively small and short-lasting and thus may be difficult to utilize in terms of on-line detection of mind wandering, especially.

More promising as a means of future on-line detection of mind wandering is the modulation of longer latency components such as the P300, a large, long-lasting ERP component thought to reflect allocation of attentional resources (Nieuwenhuis et al., 2005). Smallwood et al. (2008) found that the P300 evoked by visual stimuli in the sustained attention to response task (SART) was significantly attenuated during periods of self-reported mind wandering relative to periods of self-reported on-task behavior. Similar P300 attenuation during mind wandering was reported by Kam et al. (2014) in response to evaluation of more complex visual stimuli. The relatively high amplitude of the P300 relative to the background EEG allows reasonable extraction of this waveform from single trials for brain-computer interfaces, and has been an area of significant research during the past decade (Cecotti and Rivet, 2014).

Driver behavioral metrics such as speed variability, lane deviation, standard deviation of lateral position (SDLP) and steering reversal rate (SRR), have been shown to serve as an additional method to detect mind wandering (He et al., 2011, 2014; Bencich et al., 2014; Yanko and Spalek, 2014). However, previous research using these behavioral metrics have found inconsistent results that vary depending on the mind wandering detection method. For example, He et al. (2011) used the self-caught detection method and report that speed variability was decreased during mind wandering relative to on-task states. When using the probe detection method to detect instances of mind wandering, Yanko and Spalek (2014) found that mean speed was greater during mind wandering than on-task states. Conversely, using this same technique, Bencich et al. (2014) found that mean speed and speed variability were reduced during mind wandering compared to a state of alertness.

Further, He et al. (2014) found that under low cognitive load, which is thought to be similar to a mind wandering state, SDLP and SRR were reduced compared to a high cognitive load condition. He et al. (2014) interpreted these findings as suggesting that under high cognitive load, more effort and attention is needed to maintain lateral control, which could reflect an alert state. For example, during an attentive state (relative to mind wandering) participants drove at significantly greater speeds, had greater speed variability, greater SDLP and increased SRR (He et al., 2011; Bencich et al., 2014; Yanko and Spalek, 2014). However, though it is well-understood that mind wandering is associated with decreased attention to a primary task, there is currently no consensus of the effects of mind wandering on driving performance.

The purpose of this research was to investigate the frequency of mind wandering over repeated exposure to the same driving route, as well as to identify the relationship between mind wandering and both driver behavior and electrophysiology.

### MATERIALS AND METHODS

#### Procedure

The current study assessed the relationship between mind wandering and driving across 5 days, with the time of day for participation maintained over days within each participant. While a given participant returned at the same time of day for each of their five sessions, the time of day used between participants varied between the morning and afternoon due to participant scheduling availability. The duration of each experimental session lasted from 2 h to 3 h. Each experimental task (the SART and two simulated drives) took approximately 20 min to complete. All procedures were approved by a University Institutional Review Board of George Mason University (protocol # 727867-5) and participants provided written informed consent in accordance with the Declaration of Helsinki.

The procedures were consistent across the five experimental days with the exception of the first day, which included additional procedures. On day 1 only, participants signed an informed consent form describing the study, performed the Rosenbaum and Snellen visual acuity tests, and completed the demographics and driving history questionnaire, Simulator Sickness Screening, and six additional questionnaires described in ''Questionnaires'' Section. Additionally, prior to the first experimental drive on the first day of participation, participants completed a 5-min practice drive to familiarize themselves with maneuvering the driving simulator, including steering and braking. As the majority of the questionnaires were expected to be stable across days of participation, participants only completed the KSS and a short questionnaire on sleep quality on days two through five. The remaining procedures were performed across all 5 days.

On each day of participation, a 1-min resting baseline was acquired, during which participants were instructed to relax, keep their eyes open, and to look at the center monitor of the driving simulator. During the baseline, the three driving simulator displays (forward, left and right side screens) displayed visual noise, composed of screen shots of the driving scenario with all pixels randomized. Use of visual noise was used to prevent large changes in screen luminance between the resting baseline and performance of the drive. Participants were then presented with a definition of mind wandering, and were provided a demonstration as to how to respond to the probe tones. The mind wandering definition used in this study, adapted from (Singer and Antrobus, 1972), and Seli et al. (2016), was as follows: ''please note that for the purposes of this experiment, the words mind wandering, daydreaming and zoning-out are all synonymous. These are popular terms for which there is no official definition. Despite the subjective nature of the mind wandering experience, we define it as thinking about something unrelated to the immediate task. For example, when driving on a highway it is not unusual for thoughts unrelated to driving to enter your mind. For example, you may think about what you ate for dinner, plans you have later with friends, or an upcoming test. These thoughts are considered mind wandering or off task for the purposes of this experiment whether they occur spontaneously or intentionally. During the experiment, you will periodically hear probe tones. As soon as you hear the tone, please indicate whether you were thinking about the immediate task (either driving or SART) or were mind wandering, meaning you were thinking about something unrelated to the immediate task''.

For both experimental drives, participants were instructed to maintain a speed of 65 miles per hour (MPH), stay in the right lane of the roadway, and keep at least one hand on the steering wheel at all times. Additionally, participants were instructed to indicate on the touchscreen whether their thoughts were on-task or they were mind wandering right before they heard the probe tone by pressing the corresponding buttons. Additionally, if participants indicated that they were mind wandering, a second screen appeared on the touchscreen asking participants to indicate if they were aware or unaware that they were mind wandering before they heard the probe tone. After the first drive was complete, participants completed the SART and then the second driving task.

While the drive was primarily composed of straight highway segments, it also included four curved roadway segments. Due to the potential for differences in attention demand between the straight and curved roadway segments, which could potentially influence attentional state or driving performance, attentional probes that occurred during or within 10 s following a roadway curve were excluded from further analyses.

Although not discussed further, in addition to EEG, participants were also outfitted with a head-mounted (eyeglass) eye-tracker, chest belt heart rate monitor, and were video recorded with a dash board video camera during the experiment.

#### Participants

Nine participants (5 men, 4 women) were recruited from George Mason University via an e-mail announcement, and were compensated at a rate of \$15 per hour. All participants were at least 18 years of age, had normal or corrected-to-normal vision (verified with Rosenbaum card and Snellen eye chart), and possessed a valid United States driver's license. On average, participants were 24 years of age (SD = 3.37, range: 18–29) and had 8 years (SD = 3.19, range: 2–12) of driving experience.

None of the participants were taking any medications known to affect the central nervous system, none of the participants had sustained a major head injury such as a concussion, and all were right handed. None of the participants self-reported experiencing frequent motion sickness, and were additionally screened using the Simulator Sickness Screening (Hoffman et al., 2003) prior to participation. Each participant signed an informed consent document after being briefed on the procedures of the study. Procedures of the study were approved by the George Mason University Institutional Review Board.

#### Questionnaires

On the first day of participation, participants completed eight questionnaires: a demographics and driving history questionnaire, Simulator Sickness Screening (Hoffman et al., 2003), Mindful Attention Awareness Scale (MAAS; Brown and Ryan, 2003), Cognitive Failures Questionnaire (CFQ; Broadbent et al., 1982), Mind Wandering Scale (MWS; Singer and Antrobus, 1970; Giambra, 1980), Attention-Related Driving Errors Scale (ARDES; Barragán et al., 2016), Attention-Related Cognitive Errors Scale (ARCES; Cheyne et al., 2006) and Karolinska Sleepiness Scale (KSS; Åkerstedt and Gillberg, 1990). The MAAS, CFQ, MWS, ARDES and ARCES have been shown to reliably predict individuals with a greater propensity of experiencing lapses in attention (Broadbent et al., 1982; Cheyne et al., 2006; Barragán et al., 2016) or mind wandering (Giambra, 1980; Brown and Ryan, 2003; Burdett et al., 2016).

#### Driving Task

Participants completed two simulated drives on each of the 5 days of participation. The start and end points of the two drives were reversed; on the second drive, the participants began at the destination from the first drive, and drove back to the first drive origin point (see **Figure 1**).

With the exception of the direction of travel, the driving scenarios were identical between the two drives. The 20 min drives consisted of leaving a parking lot and entering a straight highway with limited scenery and no ambient traffic, until a destination parking lot was reached. The drive did not require navigation; while the highway included curves, participants did not have to make any turns, with the exception of leaving the starting position and entering the destination position. While at the speed limit of 65 MPH, ambient road noise was presented at 60 dB A-weighted sound pressure level (SPL) from speakers integrated into the driving simulator. If participants were driving 15 MPH above or below the speed limit, a message appeared on the center monitor instructing them to slow down or speed up, respectively. Additionally, as participants approached the end of each drive, an auditory message was presented, which instructed the participants to turn into the destination parking lot.

### SART

Between the first and second drives on each day of participation, participants completed the SART (Robertson et al., 1997), presented on a touchscreen display (7<sup>00</sup> diagonal, 800 by 480 pixels resolution) mounted inside the cab of the driving simulator. The SART was included in the experiment to roughly simulate cognitively demanding office work, which could potentially influence participant performance or mind wandering frequency on the second drive of each day via a depletion of executive resources that would otherwise maintain attention towards the primary task (Thomson et al., 2015). However, the purpose of including the SART wasn't to examine the effect of the SART per se, but rather to ensure that enough mind wandering instances occurred throughout the course of the study for comparison of mind wandering and on task states. Within each trial of the SART, participants were presented with a single digit between 1 and 9 on the center of the display. Participants were instructed to click a response button on a Logitech Wireless Presenter remote to the digits 1, 2, 4, 5, 6, 7, 8 or 9, while withholding any response to the remaining digit 3. Participants held the remote with their right hand for the duration of the task. Participants completed two blocks of the task on each day of participation, with each block containing 450 trials and taking approximately 9 min to complete. Each block contained 50 presentations of each of the digits 1 through 9, in a randomized order. Each stimulus remained on-screen for 250 ms, following which it was removed. Response to each SART stimulus were collected for up to 1-s following stimulus onset on each trial. The inter-stimulus interval was jittered with a continuous uniform distribution between 1050 ms and 1250 ms, rounded to the next presentation frame. Stimuli were presented as white digits on a black background, in addition, black and white tracking patterns were displayed in the four corners of the display to allow forward facing camera of the eye-tracker to track the location of the display. Although performed within the cab of the driving simulator for practical purposes, the participants did not perform a drive or interact with the simulator while performing the SART. During performance of the SART the visual displays of the stimulator displayed images of visual noise generated by randomizing the pixels of screenshots of the driving scenario, as used within each day's pre drive baseline condition. The SART was implemented within the Psychophysics Toolbox (Brainard, 1997; Pelli, 1997; Kleiner et al., 2007) for MATLAB.

### Attentional Probes

Within both the driving scenarios and SART, participants were probed to self-report their current attentional state. Probes were initiated by an auditory tone, composed of a 440 Hz sine wave, 500 ms in length with 10 ms onset and offset ramps, presented at 70 dB SPL via speakers mounted behind the seat of the simulator. Concurrent with tone presentation, a touchscreen mounted on the dashboard of the vehicle presented a response screen displaying two buttons, labeled ''Mindwandering'' and ''On Task''. If participants responded ''Mindwandering'', a second screen was presented displaying two buttons labeled ''Aware of Mind Wandering before probe'' and ''NOT aware of Mind Wandering until probe''. The second response was collected as the effects of mind wandering may be particularly pronounced when participants are both off-task and unaware of their inattentiveness (see Smallwood and Schooler, 2015). The time interval to the next attentional probe was relative to the submission of response to the current probe, with a response to stimulus interval jittered with a continuous uniform distribution between 30 s and 90 s. Attentional probes were presented in both the driving task and SART. In the SART, the attentional probes were always presented between SART trials. Prior to participation, participants were explained the probe procedure and response screen, and allowed to familiarize themselves with the attentional probe tone. Attentional probe presentation and response collected was implemented within the Psychophysics Toolbox (Brainard, 1997; Pelli, 1997; Kleiner et al., 2007) for MATLAB.

## Electrophysiology

EEG was recorded using a BrainVision ActiChamp 32-channel active EEG system in conjunction with BrainVision PyCorder (v1.0.8) recording software. All electrode impedances were prepared to below 25 kΩ prior to data collection, the threshold recommended by the EEG system manufacturer for active electrodes. EEG was recorded at a sampling rate of 500 Hz, an online reference of electrode TP9 (left mastoid process), and an online band-pass filter between 0.01 Hz and 100 Hz. Offline, EEG data was processed using MATLAB in conjunction with the EEGLAB toolbox (Delorme and Makeig, 2004). Data was re-referenced to the average of TP9 and TP10 (left and right mastoid process electrodes), low-pass filtered using a filter with 40 Hz cutoff and 10 Hz transition bandwidth, and high-pass filtered using a windowed sinc FIR filter with Blackmann window, 0.1 Hz cutoff, and 0.2 Hz transition bandwidth, both as implemented in the EEGLAB function pop\_firws. An additional copy of the data was high-pass filtered at 1 Hz with a 2 Hz transition bandwidth, for independent components analysis (ICA) decomposition (Debener et al., 2010; Winkler et al., 2015). Following filtering, artifactual electrodes were identified and removed via the EEGLAB pop\_rejchan function, where artifactual electrodes are defined as those exceeding ±3 standard deviations probability. Artifactual electrodes were identified using the 0.1 Hz high-pass filtered data (which is used for the final analysis), however the same electrodes were subsequently also removed from the 1 Hz high-pass filtered data (which is used for ICA only). Across the 45 sessions within the experiment, an average of 1.29 electrodes were removed (SD = 0.82, min = 0, max = 3).

The copy of the data high-pass filtered at 1 Hz was epoched into 1 s intervals, with any 1-s epoch with activity ±500 µV on any electrode removed from further analysis. The remaining epochs were decomposed via ICA, using the Extended InfoMax algorithm as implemented in EEGLAB, following which the ICA weights were copied from the 1 Hz high-pass filtered to the 0.1 Hz high-pass filtered data. Independent components (IC) reflecting eye movements and eye blinks were identified via manual inspection of IC topography and spectra, and subtracted from the data. This ICA procedure was performed for each day of recorded data separately, as the electrodes are unlikely to be located in precisely the same location for the same participant across days.

Following rejection of artifactual ICs, EEG data was epoched into 10 1-s non-overlapping epochs preceding each auditory probe, with each epoch labeled according to the corresponding probe response. Following epoching, any epoch with activity ±100 µV on any electrode was removed from further analysis. Any electrodes previously removed due to artifact were interpolated via spherical interpolation for the purpose of topographic plotting.

For measurement of EEG spectra, the data from each of the remaining 1-s epochs were linearly detrended and converted from the time domain to the frequency domain via Welch's periodogram method, as implemented in the MATLAB function pwelch. Power values for each frequency bin within each epoch were converted into decibels (dB) power (10 <sup>∗</sup> log10(power)) to more closely approximate a normal distribution. The spectral power preceding each auditory probe was then computed by averaging the dB scaled power from the remaining 1-s epochs preceding that probe. Of interest was activity in the theta and alpha bands of the EEG. A priori, electrode Fz was selected for investigation of theta power, while electrode Pz was selected for investigation of alpha power. Electrode Fz was selected for theta activity as it is the electrode where frontal midline theta is most prominent. Electrode Pz was selected for alpha activity as enhanced alpha activity at electrode Pz has been reported when attention is directed internally, such as to the content work working memory, and has previously been reported to be sensitive to lapses of attention to external stimuli (O'Connell et al., 2009). The frequency bins representing the peak theta and alpha frequency across subjects were identified via visual inspection of the spectrum collapsed across all participants and conditions under study at the stated electrodes. The peak theta frequency was identified as the bin centered on 5.85 Hz, while the peak alpha frequency was identified as the bin centered on 8.78 Hz. The dB power within these frequency bins were used for statistical comparison across experimental conditions.

For measurement of time domain EEG activity (ERP), EEG was epoched from −200 ms to 800 ms relative to the onset of the attentional probe, baseline corrected to the mean of the activity within the −200 ms to 0 ms baseline period. Following baseline correction rejection, any epoch with activity ±100 µV on any electrode was removed from further analysis. Any electrodes previously removed due to artifact were interpolated via spherical interpolation for the purpose of topographic plotting. Of interest was activity in the auditory N1 and P3a. A priori, electrode Cz was selected for analysis of the auditory N1, a plot of N1 topography collapsed across conditions confirmed a Cz maximal negativity during the time period of the auditory N1. The P3a component classically displays a fronto-central maximum. A plot of the P3a time window collapsed across conditions under study suggested a maximal positivity between electrodes Fz and Cz (near the location of electrode FCz, although this electrode not present in the 32-channel montage used within the present study). For this reason, both electrodes Fz and Cz were selected for analysis of the P3a. Time windows of the auditory N1, and P3a components were identified inspection of the ERP collapsed across all participants and conditions under study at the stated electrodes. The time window of the auditory N1 was identified as 110–160 ms following auditory probe onset. The time window representing the P3a component of the ERP was identified as 200–400 ms following auditory probe onset.

### Synchronization

Data from the equipment used within the study were synchronized via the lab streaming layer (LSL) software library<sup>1</sup> . LSL synchronizes the timestamps between multiple devices and computers via a network connection. Specifically, the LSL library was integrated into the driving simulator in order to synchronize the simulator with the computer responsible for the auditory probes, which utilized the LSL MATLAB library. Parallel port trigger events were additionally sent from the computer responsible for the auditory probes directly to the EEG system.

### RESULTS

Data reduction was performed using MATLAB, with statistical analyses performed using R (R Core Team, 2015). Specifically, for measures of driver behavior and electrophysiology obtained during each drive, participant attentional state was labeled according to participant response following each attentional probe presentation. Further, the response to each probe was used to label the period of data that had occurred within the 10 s prior to the onset of that probe (see **Figure 2**).

<sup>1</sup>https://github.com/sccn/labstreaminglayer

For data from the driving simulator, measures of mean activity (e.g., speed) and variance (e.g., speed variability) were derived within each 10-s period. For data from the EEG, due to the potential of EEG artifact, each 10-s period was first split into 10, 1-s epochs, with any 1-s period containing artifact removed from further analysis and only the remaining 1-s periods averaged as representative of that 10-s period. As previously noted, attentional probes that occurred within or up to 10-s following a roadway curve were excluded from analysis. Finally, all data were winsorized to ±3 standard deviations from the mean of the corresponding outcome measure (Dixon and Tukey, 1968).

Driver behavior and physiology were analyzed using linear mixed effects modeling. The linear mixed effects analyses were performed using the Linear Mixed-Effects Models using ''Eigen'' and S4 (lme4) package (Bates et al., 2015) and the Tests in Linear Mixed Effects Models (lmerTest) package (Kuznetsova et al., 2016) for R. As SRR is a count variable (i.e., number of reversals per second), a poisson generalized linear mixed model was analyzed using lme4 and lmerTest. Because participants varied in the number of mind wandering instances reported, Satterthwaite type III approximations were used to calculate the denominator degrees of freedom. The fixed effects for all models included state (mind wandering, on task) and drive (drive one, drive two) as categorical variables, entered as sum contrasts (−1, 1) and day as a continuous variable, mean centered across the data set. Model intercepts were set as a random effect, allowing participants to have varying intercepts.

### Questionnaire Analyses

The descriptive statistics for the questionnaire data obtained on the first day of participation are displayed in **Table 1**. However, due to the limited sample size, further analyses were not performed.

### SART Performance

The results of the SART are not reported here in depth, however participants attempted to perform the task, correctly responding to over 99% of the non ''3'' stimuli on average, and correctly withholding response to around 60% of the ''3'' stimuli, on average.

### Prevalence of Mind Wandering

Responses to the attentional probes were analyzed to identify any changes in the frequency of mind wandering, or the frequency of aware relative to unaware mind wandering, over day or drive. The frequency of mind wandering, as well as the frequency of mind wandering awareness, were tested across days and drives via generalized linear mixed effects models with logit link function, using the lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2016) packages for R. For mind wandering frequency, the effect of drive reached significance, such that participants were more likely to respond ''mind wandering'' during the second drive (z = 2.36, β = 0.16,


Note: ARDES, Attention-Related Driving Errors Scale; ARCES, Attention-Related Cognitive Errors Scale; CFQ, Cognitive Failures Questionnaire; MAAS, Mindful Attention Awareness Scale; MWS, Mind Wandering Scale.

p = 0.019) relative to the first drive. Neither the effect of day, nor the day by drive interaction reached significance for mind wandering frequency. For mind wandering awareness frequency, the effect of day reached significance, such that participants were more likely to be aware of their mind wandering as days of the experiment progressed, z = 4.22, β = 0.23, p < 0.001. Neither the effect of drive, nor the drive by day interaction reached significance for mind wandering awareness frequency.

For illustrative purposes, the percentage of mind wandering episodes were identified by computing the percentage of probes that were responded to as ''mind wandering'' during each drive (''mind wandering'')/(''mind wandering'' + ''on task''). Mind wandering awareness percentage was identified by computing the percentage of ''mind wandering'' responses that were subsequently responded to as ''aware'', during each drive (''aware mind wandering'')/(''aware mind wandering'' + ''unaware mind wandering''). In general, participants often self-reported that they were mind wandering (see **Figure 3**). Collapsing across day of participation and drive within day, participants responded ''mind wandering'' for 70.10% of probe responses (SD across participants = 17.00%). Additionally, participants often self-reported they were aware that they were mind wandering at the time of the probe. Collapsing across day of participation and drive within day, participants responded that they were aware of their mind wandering for 65.00% of mind wandering responses (SD across participants = 16.54%).

#### Driving Behavior

Data from the driving simulator, including speed, lane offset and steering wheel rotation were recorded at 30 Hz. Within each 10-s window, the raw data from the simulator was used to derive speed variability, lateral position variability, lane deviation and the SRR. Lateral position variability was defined as the vehicle's lateral position standard deviation (SDLP) in meters. Lane deviation was defined as the root mean square of the distance in meters from the center of the lane. Lastly, SRR was measured as the number of reversals per second (Hz) when steering angle passed through zero with a degree offset greater than or equal to 2 (He et al., 2014). Since SRR is a count variable requiring a Poisson analysis, it was necessary to multiply all values by 10 for this analysis. Additionally, logarithmic transformations [log10 (x) + 3] were performed for speed variability, lateral position

and lane deviation to account for positively skewed data. The constant of 3 was added to the logarithmically transformed variables to shift the output scale back to a positive range, for convenience.

The results of the linear mixed model for speed variability showed that drive (1, 2) was not a significant predictor (drive 1: M = 0.27, SE = 0.017; drive 2: M = 0.26, SE = 0.018), p = 0.84. There was a significant interaction between attentional state and day such that, speed variability decreased during ''on task'' periods across days, t(1158) = −2.61, β = −0.03, p = 0.009 (see **Figure 4**).

Additionally, lane deviation was significantly greater for drive 2 (M = 0.36, SE = 0.015) compared to drive 1 (M = 0.34, SE = 0.013), t(1157) = −2.05, β = −0.02, p = 0.041. There was also a significant effect for state such that, lane deviation was greater during ''on task'' compared to ''mind wandering'' attentional state, t(1164) = 2.58, β = 0.02, p = 0.01 (see **Figure 5**). However, day did not significantly predict lane deviation, and there were no significant interactions, ps > 0.05.

SDLP was also significantly greater when ''on task'' compared to ''mind wandering'', t(1165) = 2.07, β = −0.02, p = 0.04 (see **Figure 6**). There was also a significant interaction between drive and day such that, SDLP significantly increased for the second drive across days, t(1157) = 2.36, β = 0.02, p = 0.02.

A poisson mixed model showed that SRR was also significantly greater when ''on task'' compared to ''mind wandering'', z = 6.77, β = 0.16, p < 0.001 (see **Figure 7**). However,

day and drive did not significantly predict SRR, and there were no significant interactions, ps > 0.05.

#### EEG Spectra

Power in the theta frequency at frontal electrode Fz and power in the alpha frequency at parietal electrode Pz were selected a priori for analysis. No effects reached significance for frontal theta power. Alpha power was increased during ''mind wandering'' periods relative to ''on task'' periods, t(1146.4) = 4.41, β = 0.43, p < 0.001. Additionally, alpha

per second) for "mind wandering" and "on task" attentional states.

power increased over days of participation, t(1144) = 3.36, β = 0.22, p < 0.001. The main effect of alpha power at Pz is illustrated in **Figure 8**, with the topography illustrated in **Figure 9**.

#### ERP to Probe Tone

ERPs to the onset of the auditory probe were analyzed in order to determine whether the auditory probes were processed differently with respect to subjective attentional state. The auditory N1 component was analyzed at electrode Cz, however no main effects or interactions reached significance. The P3a component was analyzed independently at electrodes Fz and Cz. For both electrode locations, the P3a component in response to the auditory probe was reduced in magnitude for probes which were subsequently responded to as ''mind wandering'' relative to probes which were subsequently responded to as ''on task'', electrode Fz: t(1115) = −3.64, β = −1.45, p < 0.001, electrode Cz: t(1115) = −2.84, β = −1.09, p = 0.005. In addition, for both electrode locations, there was a significant effect of drive, such that the P3a during the second drive was diminished relative to the first, electrode Fz: t(1108) = −3.27, β = −1.24, p = 0.001, electrode Cz: t(1108) = −2.48, β = −0.91, p = 0.013. No effects of day of participation, nor any interactions, reached significance for the P3a at either electrode location. The main effects of attentional state and drive on P3a magnitude are illustrated in **Figure 10**. The topography of the effect of attentional state is illustrated in **Figure 11**.

### DISCUSSION

The experimental design was intended to roughly simulate drives to and from work, separated by a cognitively depleting work task. In this study, participants drove the same highway route twice a day for 5 days. Between the two drives on a given day, participants completed a task requiring sustained attention, the SART. Participants self-reported their current attentional state, indicating whether they were either ''ontask'' or ''mind wandering'', in response to periodic probe tones. Additionally, following mind wandering responses, participants indicated whether they were aware or not aware of their mind wandering prior to the attentional probes.

The driving scenarios were designed to be rather repetitive and monotonous in order to increase the incidence of mind wandering (Berthié et al., 2015; Thomson et al., 2015) and ensure that enough instances of mind wandering would be generated for study. On average, 70.10% of the probes in the present study were responded to as ''mind wandering'' by the study participants. In contrast, Killingsworth and Gilbert (2010) report the results of a study probing individuals as to the content of their thoughts throughout everyday life outside the laboratory, reporting that participants respond that they are thinking about something other than what they are currently doing approximately 47% of the time. The high frequency of mind wandering in the present experiment would likely be lessened if the driving scenarios were made to be more demanding. For example, Lin et al. (2016) report a study measuring EEG activity in a motion-base driving simulator, asking participants to detect lane departures using either visual and motion information (the simulator could move in response to road conditions such as rumble strips), or only visual information

(the simulator's motion capabilities were deactivated). The authors report less DMN activity when participants were performing the more demanding version of the task without motion information. The addition of ambient traffic, or a more complex navigation task that also required navigation would also likely decrease the prevalence of self-reported mind wandering.

With consideration of previous work that has suggested increases in driver inattention as route familiarity increases (Yanko and Spalek, 2014), it was expected that mind wandering frequency might increase over the 5 days of participation. Instead, mind wandering frequency did not significantly differ over days of participation, though mind wandering did increase for the second drive relative to the first drive within the same day of participation. It is possible that the repetitive nature of the drives induced a maximum amount (or ceiling effect) of mind wandering while still being able to safely operate the simulated vehicle, thereby reducing the potential to see increased mind wandering frequency across days. The SART was performed between drives to roughly simulate performing a work task between commutes, and to ensure that enough mind wandering instances were available for study. As all participants performed the SART, we cannot comment on whether the increase in mind wandering frequency in the second drive of the present study is due to increased route familiarity, resource depletion from the SART, or a time on task effect.

Participants became significantly more aware that they were mind wandering across days. Specifically, for probe responses that were labeled mind wandering, there was a significant increase over days of participation in the frequency that participants reported that they were ''aware'' of their mind wandering at the time of the probe. It should be noted that the work of Yanko and Spalek (2014) did not query participant subjective state, and instead measured participant inattention by assessing their ability to detect hazards in the environment. The present experiment, in contrast, did ask participants to report their subjective attentional state, but did not include roadway hazards. Thus the present results may not contradict Yanko and Spalek (2013) but instead suggest a more complex relationship between driving performance and participant awareness of their attentional state. In the present experiment, when examining the 10-s periods preceding the attentional state probes, EEG alpha power increased over days of participation, in addition to more generally being elevated during periods of mind wandering. Taken together, it is possible that participants did become somewhat less attentive to the driving environment over days of participation in the present study, despite the lack of increase in subjective mind wandering reports over days of participation.

Power in the alpha band of the EEG was of greater magnitude during periods of self-reported mind wandering, relative to periods of self-reported on-task performance. Greater alpha power during periods of task inattention has also been reported for a variety of non-driving tasks. For example, O'Connell et al. (2009) report that within the continuous temporal expectancy task (CTET), the magnitude of alpha power over parietal electrodes is elevated prior to trials in which participants missed the target of interest, relative to trials in which participants correctly detected the target. Importantly, the increase in alpha band activity for miss trials was detectable up to 20 s prior to trial onset, suggesting that a slow modulation of top-down control contributes to lapses of attention within the CTET. Alpha power has further been related to inattention within a driving context more specifically. Within a driving simulator task in which participants were provided auditory notifications of lane departures, notifications that were behaviorally successful were associated with decreased alpha power following the notification, while for ineffective notifications alpha power remained elevated (Lin et al., 2013).

Although consistent with other work on lapses of attention, an increase in alpha power during mind wandering is in contrast to what has been reported for self-detected mind wandering in a breath-counting task, wherein alpha power is suppressed immediately prior to self-detected mind wandering (Braboszcz and Delorme, 2011). This apparent difference may be resolved by considering that the alpha oscillation may serve a different role within the primary task used within the present study, simulated driving, in comparison to the primary task used within Braboszcz and Delorme (2011), eyes closed breath-counting. Future exploration of the possibility of using alpha power as a potential predictor of the mind wandering state, and thus a greater probability of missing a potentially hazardous event warrant further research. A supplementary method of examining a possible predictive potential for participants to miss critical events can be found in analysis of the ERPs.

Participants were periodically presented with an auditory tone, notifying them to indicate their current attentional state. Analysis of the ERP to this probe tone suggested that the P3a component was larger in response to tones that were subsequently responded to as ''on task'', relative to tones that were subsequently responded to as ''mind wandering''. As the P3a component is thought to reflect the orienting of attention towards a novel stimulus (Polich, 2007, 2011), this result supports the decoupling hypothesis of mind wandering (Smallwood, 2011) and is suggestive that participant attention towards the external environment was diminished during

periods that were subjectively labeled as ''mind wandering'' relative to ''on task''. Despite previous reports that mind wandering modulates the amplitude of early sensory components of the ERP (Baird et al., 2014; Broadway et al., 2015), we did not observe any statistically significant modulation of the auditory N1 component to the probe tones by attentional state. As early sensory components are known to be modulated by attention (Luck et al., 2000), the lack of an effect on the auditory N1 may reflect a lack of top-down attention towards the auditory stimulus in either the mind wandering or on task state. The auditory tone in the present experiment was supra-threshold, unpredictable with respect to onset, and did not require discrimination, only simple detection in order to provoke an attentional probe response. This is in contrast to influences of mind wandering on early sensory components in previous reports, for which the ERP is time-locked to a primary task stimulus (Baird et al., 2014; Broadway et al., 2015).

There are several limitations that should be noted. Our participants were restricted to young (18–29 years) individuals free from disease and other visual or health impairments that might compromise driving. It is not known whether mind wandering might manifest similarly in an older population or in individuals with certain health disorders. Further, none of our participants were shift workers and therefore it is unknown whether disruptions in circadian rhythms might impact performance or the underlying physiology. Additionally, the results were obtained in a driving simulator under carefully controlled conditions. More variability in both environment and behavior could be expected under naturalistic driving conditions.

In summary, both driving behavior and EEG activity demonstrated sensitivity off-line to distinguishing between periods of self-reported mind wandering vs. being on task. These results are largely in line with previous studies on mind wandering during driving, and on attentional processes as assessed with EEG, and support that mind wandering has an impact on both driving performance and the driver's underlying physiology. Future work could extend these results by examining more closely a driver's reaction to potential hazard situations when mind wandering vs. alert. Drivers may be expected to be less likely to react appropriately to a potential hazard (particularly if it occurs in peripheral vision since during mind wandering since gaze is narrowly focused more centrally). Further, the current results suggest drivers may be less likely to detect an auditory or visual warning while mind wandering. Future work should examine the potential for advanced auditory warnings to aid hazard mitigation in differing attentional states.

### AUTHOR CONTRIBUTIONS

CLB, DMR, DB, JDL, NL and JSH contributed to the design of the experiment, DMR and DB collected and analyzed the data. CLB, DMR, DB, NL and JSH contributed to the writing of the manuscript.

### FUNDING

This research was supported by Contract #DTNH2214C00404 from the National Highway Traffic Safety Administration.

### ACKNOWLEDGMENTS

The authors thank Steven Chong for assistance with data collection and the Federal Aviation Administration William J. Hughes Technical Center for loaning the 32-channel BrainVision EEG system for the duration of this study.

## REFERENCES


**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Baldwin, Roberts, Barragan, Lee, Lerner and Higgins. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# EEG and Eye Tracking Signatures of Target Encoding during Structured Visual Search

Anne-Marie Brouwer <sup>1</sup> \*, Maarten A. Hogervorst <sup>1</sup> , Bob Oudejans <sup>1</sup> , Anthony J. Ries <sup>2</sup> and Jonathan Touryan<sup>2</sup>

<sup>1</sup>Perceptual and Cognitive Systems, Netherlands Organisation for Applied Scientific Research (TNO), Soesterberg, Netherlands, <sup>2</sup>U.S. Army Research Laboratory, Aberdeen, MD, United States

EEG and eye tracking variables are potential sources of information about the underlying processes of target detection and storage during visual search. Fixation duration, pupil size and event related potentials (ERPs) locked to the onset of fixation or saccade (saccade-related potentials, SRPs) have been reported to differ dependent on whether a target or a non-target is currently fixated. Here we focus on the question of whether these variables also differ between targets that are subsequently reported (hits) and targets that are not (misses). Observers were asked to scan 15 locations that were consecutively highlighted for 1 s in pseudo-random order. Highlighted locations displayed either a target or a non-target stimulus with two, three or four targets per trial. After scanning, participants indicated which locations had displayed a target. To induce memory encoding failures, participants concurrently performed an aurally presented math task (high load condition). In a low load condition, participants ignored the math task. As expected, more targets were missed in the high compared with the low load condition. For both conditions, eye tracking features distinguished better between hits and misses than between targets and non-targets (with larger pupil size and shorter fixations for missed compared with correctly encoded targets). In contrast, SRP features distinguished better between targets and non-targets than between hits and misses (with average SRPs showing larger P300 waveforms for targets than for non-targets). Single trial classification results were consistent with these averages. This work suggests complementary contributions of eye and EEG measures in potential applications to support search and detect tasks. SRPs may be useful to monitor what objects are relevant to an observer, and eye variables may indicate whether the observer should be reminded of them later.

Keywords: EEG, pupil size, fixation, SRP, FRP, visual search, target detection, BCI

### INTRODUCTION

Visual search is a common task that is performed when looking for a singing bird in the trees, when checking a poster for graphical errors or when searching an environment for suspicious objects. We are interested in EEG and eye variables as potential sources of information about the underlying processes of target detection and encoding during visual search, i.e., a task where the eyes move. As elaborately discussed in ''Application'' Section, monitoring target detection and encoding on the basis of implicit variables could be useful in a number of applications.

#### Edited by:

Stephen Fairclough, Liverpool John Moores University, United Kingdom

#### Reviewed by:

Juan Esteban Kamienkowski, University of Buenos Aires, Argentina Gijs Plomp, University of Fribourg, Switzerland

#### \*Correspondence:

Anne-Marie Brouwer anne-marie.brouwer@tno.nl

Received: 03 February 2017 Accepted: 03 May 2017 Published: 16 May 2017

#### Citation:

Brouwer A-M, Hogervorst MA, Oudejans B, Ries AJ and Touryan J (2017) EEG and Eye Tracking Signatures of Target Encoding during Structured Visual Search. Front. Hum. Neurosci. 11:264. doi: 10.3389/fnhum.2017.00264

### Brain and Eye Correlates of Target Detection

The literature describes several EEG and eye indicators of target detection. Observers usually fixate longer on targets than non-targets in a search task (e.g., Brouwer et al., 2013; Jangraw et al., 2014; Wenzel et al., 2016). Also, several studies showed stronger pupil dilation responses to target compared to non-target stimuli (Nieuwenhuis et al., 2011; Hong et al., 2014). Finally, the P300 event related potential (ERP), a positive peak occurring in the EEG signal roughly 300 ms after a sensory event, indicates that an observer's attention has been drawn. It has been shown that the P300 reliably distinguishes between top-down defined ''targets'' and ''non-targets'', e.g., in cases where observers are asked to pay attention to the letter ''p'' presented in a sequence of successively flashed letters (e.g., Farwell and Donchin, 1988).

While in studies on pupil dilation and P300 participants usually kept their eyes still, some studies showed similar findings when participants were actively and purposefully moving their eyes. Jangraw et al. (2014) found that, for a realistic visual search scenario, pupil size locked to target fixation onset increased relative to non-target fixation. There is a growing body of research on ERPs that are not time locked to stimulus onset as determined within an experimental paradigm, but to ocularbased events such as saccades and fixations (saccade-related potentials, SRPs or fixation-related potentials, FRPs respectively) as a means to determine whether observers are looking at a target (i.e., a top-down defined relevant object). For some of the early studies (e.g., Hale et al., 2008; Luo et al., 2009), differences found between target and non-target SRPs could have been due to confounding factors such as systematic target vs. non-target differences in saccade length, low level visual features, or motor preparation to press a button. However, by now it is clear that ERPs following fixation of a target are different than ERPs following fixation of a non-target and that these differences are associated with top-down stimulus processing (e.g., Dandekar et al., 2012a,b; Kamienkowski et al., 2012; Brouwer et al., 2013).

### Brain and Eye Correlates of Missed Targets

In the current study, we examine fixation duration, pupil size and SRPs in a structured visual search task. We are especially interested in encoding failures, i.e., failing to report a fixated target. Observers can fail to report fixated targets for different reasons. They may have ''really'' missed the targets, e.g., because the type of target was difficult to identify (perceptual identification failures). Another possible reason is that after target identification, observers forgot the target before it was time to report, for example, because they were involved in another task (memory encoding failures).

In the case that observers did not identify the target, we expect eye and SRP features for misses to be similar to nontargets—observers simply did not perceive the target as being a target. This hypothesis is supported by research by Dias et al. (2013) who studied SRPs following fixations of not-reported targets under circumstances that misses were likely due to not identifying the target. In their task, participants searched a display filled with rectangular objects where the target was defined by a combination of visual features that changed every trial. For instance, a target could be a vertical bar consisting of a red bar on the left and a yellow one on the right that was presented between non-targets that also consisted of vertical colored bar combinations. After finding the target, participants had to immediately report finding it. The average miss SRP could not be distinguished from the average non-target SRP, while the average hit SRP stood out and was consistent with a P300. Dias et al. (2013) found that misses were associated with relatively high EEG alpha activity which has also been linked to lapses of attention (e.g., Vázquez Marrufo et al., 2001).

In the case that targets are identified but not reported due to a memory encoding failure, we do not expect features for misses to be the same as for non-targets. Here, misses are detected as targets at the time of fixation. However, to the extent that differences in target vs. non-target SRPs reflect differences in allocated attention, where this difference matters for storage in memory, SRP waveforms following these missed targets could fall in between target and non-target SRPs. Evidence that is partly in line with this expectation has been found in previous P300 studies. In these studies, participants were asked to remember as many words as possible of a list of sequentially presented words. P300s following presentation of words that were later remembered were compared to those that were later forgotten. Remembered words tended to correspond with larger P300s. However, these effects were small and interacted with primacy and recency effects as well as the type of rehearsal or encoding strategy of the participants (Karis et al., 1984; Fabiani et al., 1990; Azizian and Polich, 2007; Kamp et al., 2012). For instance, Azizian and Polich found larger P300 amplitudes for recalled compared to forgotten words only for words at the beginning of the list, and Fabiani et al. (1990) only found a positive relation between P300 and recall when participants used a rote strategy (repeating the words to themselves) for remembering. As of yet, we do not know whether there is a relation between the P300 and later recall of targets in a visual search task.

Fixation duration may reflect or may enable more attention and deeper processing of the fixated object. Therefore, we not only expect fixation duration to be longer for targets than for non-targets, but also for hits compared to misses.

We discussed that SRPs and fixation duration as signatures for missed targets may be in between hit target and non-target values. The same may hold for pupil size, i.e., hit targets may be associated with larger pupil size than missed targets. However, the reverse may be found as well. In case of a search task where targets are likely to be forgotten because the observer is simultaneously attending to another task, we expect momentary high workload to be associated with misses. Since workload or memory load is strongly related to pupil size (Kahneman and Beatty, 1966; Beatty, 1982; Hogervorst et al., 2014), also specifically with memory load during visual search (Porter et al., 2007), we expect it to be larger for misses than for hits in our task.

#### Current Study

In the current study we examine the general association between SRP and eye features on the one hand and whether a fixated object is a target and is going to be missed on the other hand. In addition, we examine how well we can distinguish between targets and non-targets, and between hits and misses on a single fixation basis and which combination of these sources of information gives us the best result. A surge of recent studies show that it is possible to distinguish between target and non-target SRPs above chance on a single SRP basis, also in rather challenging circumstances. For instance, Uš´cumli´c and Blankertz (2016) show a single trial distinction when moving stimuli are involved; single trial classification has been shown for a mixture of foveally and parafoveally identified stimuli (Brouwer et al., 2014; Wenzel et al., 2016); and it has been shown when using more natural stimuli such as looking for a face in a crowd (Kaunitz et al., 2014) or when viewing signs during navigating a virtual environment (Jangraw et al., 2014). Some previous studies have shown that combining SRP features and eye related features increased classification performance for targets and non-targets (Jangraw et al., 2014; Wenzel et al., 2016). Thus, for target and non-target distinction, eye and brain signals can potentially add complimentary information. It remains to be seen how this works out for the distinction between hits and misses.

In our task, participants performed a structured visual search task consisting of scanning 15 locations on a screen. Target locations were reported after scanning all locations. An auditory math task was performed in the high load condition but ignored in the low load condition. In pilot experiments we verified that performing such a double task results in failures to report targets. With respect to the SRP, we expect a larger P300 for targets compared to non-targets, and possibly a higher P300 for hit compared to missed targets. We expect longer fixation duration for targets compared to non-targets, and longer fixation duration for hits than for misses. Pupil size may be larger for targets than for non-targets, and—through a general association between high workload and pupil size—larger for misses than for hits.

### MATERIALS AND METHODS

#### Participants

Twenty-one participants (nine males, 12 females) between the age of 19 and 30 years (average age: 23) were recruited through the participant pool of the Netherlands Organization for Applied Scientific Research (TNO). None of the participants wore glasses. Each participant received a monetary reward for his or her time and travel costs. This study was carried out in accordance with the recommendations of the Human Research Protections Official (HRPO) and the TNO Institutional Review Board (TCPE) with written informed consent from all subjects. All participants signed an informed consent form in accordance with the Declaration of Helsinki. This study was approved by the HRPO and TCPE and conducted in accordance with the Army Research Laboratory's IRB requirements (32 CFR 219 and DoDI 3216.02).

### Materials

The task was presented on a 19<sup>00</sup> flat-screen monitor (Dell 1907FP Flatpanel 1900, display size 37.5 × 30 cm). The screen resolution was 1280 × 1024 and the refresh rate was set at 60 Hz. Participants' eyes were located approximately 40 cm from the screen. Audio output was coming from a dual speaker set (TEAC PowerMax 60/2) placed left and right of the screen.

Gaze and pupil size were recorded at 60 frames per second using SmartEyePro V6.1.6 (Smart Eye AB, Göteburg, Sweden). This system consists of two cameras (Basler acA640-120 gm, HR 8.0 mm lens) placed at the left and right side of the screen.

EEG and EOG signals were recorded using an ActiveTwoMK II system (BioSemi, Amsterdam, Netherlands) with a sampling frequency of 512 Hz. For EEG, 32 active silver-chloride EEG electrodes were placed according to the 10–20 system and were referenced to the Common Mode Sense (CMS) active electrode and Driven Right Leg (DRL) passive electrode. Four EOG electrodes (BioSemi Flat Active electrodes, Amsterdam, Netherlands) were used to record eye movement. Two EOG electrodes were placed at the approximately 0.5 cm off the lateral canthi of both eyes, and were used to record horizontal eye movement. Another two EOG electrodes were placed above and below the left eye to record vertical eye movement and blinks. The electrode offset of all electrodes was below 25.

### Task and Design

The experiment featured two tasks: a monitoring task and an auditory math task. In the high load condition, participants performed both tasks. In the low load condition, they only needed to perform the monitoring task, even though the math task was still played to keep auditory stimulation constant across conditions.

#### Monitoring Task

Participants were asked to monitor 15 ''systems'', represented by strings of symbols on a screen and placed in three rows of five columns. There were three different system conditions: hidden (''####''), working as intended (''#OK#'') or system failure (''#FA#''). At the start of a trial, all system conditions were hidden. Then, each of the systems was successively highlighted for 1 s (1027 ms) by displaying a square around it while its condition changed from ''####'' into either ''#OK#'' or ''#FA#'' (**Figure 1** shows an example of the stimulus display, and also presents the dimensions of the different stimulus elements). Highlighting the systems happened in random order, except for that two subsequently presented systems were never further apart than two steps in horizontal direction and one in vertical direction, or two vertical and one horizontal. The next highlighted system was in peripheral vision, such that we could not distinguish between ''#OK#'' or ''#FA#'' without making a saccade. After all system conditions had been shown, empty boxes appeared at the system locations and the participant had to indicate which systems failed during the trial by clicking the appropriate boxes with the left mouse button. When finished, the participant pressed an OK button at the top left of the screen. Every trial, two, three or four ''#FA#''s were presented.

The amount (two, three or four) and the ''#FA#'' locations were chosen randomly.

#### Math Task

The math task was an aurally presented sum consisting of six numbers between 6 and 12. Only addition (+) and subtraction (−) operations were used. An example is ''8−6 + 10−12 + 11 + 7''. The first number was presented 1 s after the start of the monitor task, and every 2660 ms another number was presented. Thus, the last number was presented after 14.3 s. Performing this task involves attention and working memory. When participants had to perform the math task (i.e., in the high load condition), they were required to give the answer of the sum after having indicated where the ''#FA#''s were located. This was done by typing the answer and pressing enter. In order to motivate participants to perform the math task, they received feedback on their answer. If the answer was incorrect, the correct answer was shown.

For each of the load conditions, participants performed eight blocks of 11 trials. High and low load conditions were presented alternately, starting with the high load condition.

#### Procedure

After a general explanation and signing the informed consent, the EEG and EOG electrodes were attached. During this time, the participant had time to read detailed instructions. Participants were asked to take a comfortable position in front of the screen. Even though they were able to move freely, they were instructed to minimize their head and body movements. Before the task began, a four point-calibration was performed to calibrate the SmartEye system. There was a few minutes break after eight blocks of trials, i.e., half-way through the experiment. The participants indicated when they were ready to continue.

#### Analysis

#### Electrophysiological Data

EEG and EOG data was resampled to 256 Hz. Bad EEG channels were identified as channels with standard deviations exceeding five times the median standard deviation over all channels, after bandpass filtering between 0.5 and 32 Hz. This affected 1–3 channels for five participants. The unfiltered data from bad channels were replaced by the weighted average of unfiltered data from the surrounding channels. Next, the EEG-data was re-referenced to the mean of all unfiltered data excluding the bad channels. The resulting signals were submitted to bandpass filtering between 0.5 and 32 Hz.

We extracted saccades to divide the data in saccade-locked segments. This was done as follows (see also **Figure 2**). First, horizontal and vertical EOG were cleaned from noise. This was done by detecting values that exceeded five times the standard deviation. The signal around these peaks was cut out and interpolated. Next, blinks were detected and removed in vertical EOG: after band pass filtering between 2 Hz and 100 Hz, peaks exceeding a threshold of three times the standard deviation were considered to be blinks, removed and interpolated. Derivatives of the vertical and horizontal cleaned EOG signals were calculated using a derivative Gaussian filter with a standard deviation (sigma) of eight samples (about 31 ms). Values exceeding four times the standard deviation were associated with potential stimulus-to-stimulus

FIGURE 2 | Illustration of extracting saccades of interest for a sample trial. (A) Detecting noise (indicated in red) in EOG. This example shows the horizontal EOG. Noise was cut out and interpolated. (B) Detecting blinks in vertical EOG. Blinks were cut out and interpolated. (C) Detecting potential saccades of interest in the derivative of the cleaned EOG. This example shows the horizontal EOG.

saccades. We then looked at candidate saccades occurring between 100 ms and 800 ms after a next location was highlighted on the screen. The first saccade where the sign of the HEOG signal matched the direction of the stimulusto-stimulus transition was selected as the saccade of interest. If no match was found in the HEOG signal, we looked for saccades in the VEOG with a sign that matched the vertical saccade jump. In 10% of the data no matching saccade was found. The EEG and EOG data was split into segments starting from the point of the highest saccade speed to 1 s after, and baselined using the first 100 ms of the epoch. EEG epochs with extremely high variance (standard deviation exceeding 50 times the standard deviation) were discarded as outliers.

#### Eye Tracking Data

One participant was excluded from the eye data analyses, because of technical difficulties with the eye tracking hardware during this measurement. After the measurement, the fixation locations were recalibrated using the 15 displayed stimulus positions to obtain higher gaze localization accuracy. Fixations were considered to be on the stimulus when the fixation position was within a radius of 150 pixels (4.4 cm or 6.4◦ ) from the center of the current stimulus location. Fixation duration was determined as the time that eye fixation was on the stimulus. Only valid samples in a window starting at stimulus onset until 2 s after were taken into account. Pupil size was determined as the mean of the pupil size values over these same samples.

#### Classification

For classification, we used linear SVM models that were trained to distinguish between either targets vs. non-targets (for each of the load conditions) or hits vs. misses (in the high load condition) using 5-fold cross validation. Classification was performed using the Donders machine learning toolbox developed by van Gerven et al. (2013) and implemented in the FieldTrip open source Matlab toolbox (Oostenveld et al., 2011). The features were standardized to have mean 0 and standard deviation 1 on the basis of data from the training set. Included features were EEG voltages over a time interval of 250–1000 ms after peak velocity of the stimulus saccade, in which all EEG electrodes were included. In order to examine potential information from EOG leaking into EEG, we also used EOG voltages over the same time interval as features. Different models were trained using different combinations of EEG (i.e., SRP) features, EOG features, fixation duration and pupil size. Classification was performed separately for each participant and each load condition. Random selections of non-targets and hits were used in the training sets in order to match the numbers of available target and miss epochs to ensure balanced training of the model. For each participant, each load condition, each type of distinction (target vs. non targets and hits vs. misses) and each model we determined whether classification was above chance using a binomial test. An alpha level of 5% was used.

### RESULTS

### Behavioral Performance

Performance on the secondary (math) task was on average 62% correct (SD 17%) indicating that the secondary task was quite difficult and performance varied strongly between subjects. High workload data from both trials with correct and incorrect responses to the math task were included in the analyses. Note that performance on the math task is no direct indicator of workload. Performance could be high because a participant tried hard (high load) or, for that participant, the sum was easy (low load). Conversely, low performance could be caused by lack of trying (low load) or because the participant simply did not manage, despite trying hard (high load). There was no evidence for participants choosing to focus on the one rather than the other as indicated by the lack of (negative) correlation between participants' performance on the math task and performance on the monitoring task (Pearson correlation: r = 0.25, p = 0.27).

**Figure 3A** shows the hit rate of the primary task (defined as the proportion of ''#FA#'' targets whose location was correctly indicated) for successive blocks (of 11 trials each) in the high and low workload conditions. There seems to be some indication of a learning effect in the high load condition with increasing performance up to block 5. The **Figure 3B** shows the hit rate as a function of when the target was presented within a trial. For the high load condition, it is clear that targets presented at the beginning or the end of a trial are remembered better than the ones in between. This is consistent with primacy and recency effects.

As expected and intended, the hit rate in the high load condition was much lower (average hit rate of 0.73 (SD 0.13) than in the low load condition (average hit rate of 0.96, SD 0.04). Except for one participant who only missed nine targets, all participants missed at least 22 targets in the high load condition, with an average of 72 missed targets (range: [9, 133]). In the low load condition, the average number of missed targets was eight (with a range of [0, 32]). We consider the average number of eight missed targets in the low load condition as too little to do meaningful hit vs. miss comparisons in this condition.

Missed targets can be accompanied by wrongly identified targets (i.e., false alarms), or not (resulting in less targets being reported than being presented). We found that in general, the latter is the case. In the high load condition, the number of false alarms was on average 42 (range of [5, 132]). Given the number of missed targets, this means that participants reported on average 30 targets less than the number of targets that was actually shown (with a range of [−103, 10]). While brain processes will be different when a target is not identified as a target (definitely leading to not report the target at all) compared to when a target has been identified but not properly encoded (which, depending on the reporting strategy of the participant, could lead to not reporting it at all or to an accompanying false alarm), we do not distinguish between not reporting a target at all and indicating a wrong location in this study. Given the design of our experiment, and the results described in the following,

we take both to mean that the target has not been properly encoded. However, it is important to keep in mind that treating these results as the same may not be appropriate under other circumstances.

#### Saccade Latencies

**Table 1** shows the saccade latencies, defined as the peak velocity of the stimulus saccade relative to stimulus onset. Relative to this point SRP onsets were determined. Latencies toward targets that are missed are longer than toward hit targets (t(19) = 2.18, p = 0.04 for the low load condition; t(19) = 4.85, p < 0.01 for high load). Saccade latencies toward targets are shorter than toward non-targets. Although the difference is small (8 ms in both low and high load conditions), it is statistically significant (respectively (t(19) = 2.83, p = 0.01 for low load; t(19) = 2.12, p = 0.05 for high load). We think that this effect is mediated by a longer fixation duration TABLE 1 | Means and standard deviations of saccade latency (time of peak velocity of the stimulus saccade relative to stimulus onset), separately for each low and high load condition, targets, non-targets, hits and misses.


The gray font of the hits and misses in the low load condition signifies that because of few misses in the low load condition, the hit trials are about the same as the target trials and the misses are represented by few data points.

for targets than non-targets (see ''Eye—Fixation Duration'' Section). The object fixated prior to a target is more likely to have been a non-target (that does not keep the eyes linger relatively long and results in a short saccade latency for the next object) compared to the object fixated prior to nontarget.

### EEG—SRP

**Figure 4** shows SRP traces associated with targets, non-targets, hits and misses for each of the two load conditions, averaged across participants and electrode locations around Pz (CP1, P3, Pz, PO3, PO4, P4, CP2). In the lower part of the figures we indicated individual time samples (3.9 ms) that were significantly different for target vs. non-target (light gray) and hit vs. miss (dark gray) as indicated by paired t-tests (alpha level of 5%). Around 500 ms, target SRPs are larger than non-target SRPs, which is consistent with a stronger P300 for targets than non-targets. The difference appears stronger in the low load condition—only in this condition, the difference between target and non-target traces reached significance for an uninterrupted interval of almost 300 ms. After correcting for multiple testing (Benjamini and Hochberg, 1995), this interval is reduced to around 150 ms (indicated by the bold, black line in **Figure 4**). When examining traces for load condition separately, hit and miss traces did not differ for a substantial period of time. When collapsing across load conditions, the higher values for miss traces towards the end of the epoch becomes significant. However, it is clear that at least up to 450 ms, miss traces overlap with hit traces. Miss traces certainly do not lie in between hit and non-target traces.

Exploring average traces for all individual channels revealed no significant differences between target and non-target, and hits and misses in the math condition. For the non-math condition, we found significant effects as reflected in **Figure 4** (higher voltages for targets compared to non-targets around 500 ms) for channel CP1, Pz and P4. P7 showed a lower voltage for targets compared to non-targets around 250 ms, and F8 a lower voltage around 450 ms.

#### Eye—Fixation Duration

**Figure 5** shows average fixation duration. As expected, fixation duration was longer for targets than for non-targets, both

in the high and in the low load condition (paired t-tests, t(19) = 11.47, p < 0.01; t(19) = 11.43, p < 0.01). Fixation duration was longer for hits than for misses in the high load condition (paired t-test, t(19) = 2.38, p = 0.03). The same trend was seen in the low load condition. No significant difference was found in fixation duration between high and low workload conditions (paired t-test, t(19) = 1.40, p = 0.18).

### Eye—Pupil Size

**Figure 6** shows pupil size. For both high and low load conditions, pupil size was the same between targets and non-targets (paired t-test, t(19) = 0.54, p = 0.59; t(19) = 0.12, p = 0.91). As expected, pupil size was significantly larger

during the high load condition compared to the low load condition (paired t-test, t(19) = 12.5, p < 0.01). Additionally, pupil size was found to be larger for misses than for hits in the high workload condition (paired t-test, t(19) = 4.25, p < 0.01). The same trend was found in the low load condition.

#### Single Trial Analysis

**Figure 7** shows the classification accuracy between targets and non-targets (blue bars) separately for the low and high load conditions, and hits and misses (red bars) for the high load condition. As expected from the average SRPs as presented above and from previous studies, single fixation classification was possible for target vs. non-targets based on SRP in the low load condition (on average 65% correct, with performance significantly above chance for 13 out of 21 participants according to binomial

tests). For the high load condition average classification performance on the basis of SRP reaches 59% correct (7 out of 21 participants above chance). For both load conditions, classification accuracy between target and non-targets is highest when SRP features are used. Adding fixation duration and pupil size does not (substantially) improve performance.

Another picture emerges for the distinction between hits and misses—which we could only meaningfully examine for the high load condition due to the very small number of misses in the low load condition. Classification based on SRPs is around chance level (52% correct), while classification based on pupil size results in an average single fixation classification accuracy of 58%. Adding fixation duration does not (substantially) improve performance; adding SRP rather decreases performance. Distinguishing between hits and misses did not reach above chance performance for single subjects (note that the power of the binomial tests is much weaker for hits vs. misses

FIGURE 7 | Classification results of models averaged over individual participants that distinguish between target and non-target fixations (blue bars) and hits and misses (red bars), separately for data from the low load condition (upper half of the figure—no results for hit and miss due to few misses) and the high load condition (lower half of the figure). On the left the features that the models were based on are indicated: SRP, fixation duration, pupil size, combined eye features, and a combination of all features. Error bars indicate standard errors of the mean.

FIGURE 8 | Classification results of models averaged over individual participants that distinguish between target and non-target fixations (blue bars) and hits and misses (red bars), separately for data from the low load condition (upper half of the figure—no results for hit and miss due to few misses) and the high load condition (lower half of the figure). On the left the features that the models were based on are indicated: SRP, EOG, and a combination of SRP and EOG features. Error bars indicate standard errors of the mean.

compared to targets vs. non-targets where more data was available).

**Figure 8** shows results for classification based on EOG features, and EOG and SRP features combined, with the result of the SRP based models as a comparison. When using EOG, classification accuracy between either targets and non-targets or hits and misses does not rise over 52% and adding EOG to SRP does not improve classification.

### DISCUSSION

We investigated a structured search task where fixated targets that observers can easily identify are relatively likely to remain unreported due to a concurrent secondary task that is expected to interfere with working memory. We found that in such a case, SRPs differ between targets and non-targets (consistent with a target eliciting a P300), but not between hit targets and missed targets. Fixation duration was longer for targets than non-targets as expected, and also for hits compared to misses. Pupil size did not differ between targets and non-targets, but was larger for misses than for hits. In sum, EEG features appeared more suitable to distinguish targets from non-targets while eye features (especially pupil size) were suitable to distinguish fixated targets that were subsequently reported from those that were not. These results were also reflected in single trial classification analyses.

We interpret our findings as reflecting distinct underlying cognitive processes as discussed further below.

#### SRP P300

Differences between target and non-target SRPs are as expected with a higher amplitude P300 for targets compared to nontargets. This difference is smaller between the average target and non-target SRP traces in the high compared to the low load condition. Previous studies also found the P300 to be less pronounced under high load conditions (Allison and Polich, 2008; Gherri and Eimer, 2011; Pratt et al., 2011). Also for ERPs following fixations, smaller target P300s during high workload compared to low workload have been reported before Ries et al. (2016). This reducing effect of workload on the P300 is consistent with less attentional resources being available for the target detection task.

While the P300 reflects attentional processes, this did not translate to larger P300s for hit compared to missed targets. Clearly, SRPs associated with missed targets more closely resembled hit target rather than non-target SRPs. This suggests that, as we intended, there is no problem in target identification and that the amount of attention allocated to the fixated object around the time of fixation is not critical for its encoding in memory.

### Fixation Duration

As expected, fixation duration is longer for targets than for nontargets. We also found it to be longer for hits than for misses. A short duration of a fixation being associated with misses could in principle be caused by participants moving the eyes from the target to the next stimulus too quickly, or because participants lingered relatively long on the stimulus that was fixated before the target, which causes a late arrival on the target and leaves less time for target fixation. Comparing the differences in hit-miss saccade latency to the difference in hit-miss fixation duration indicates that misses are mostly due to arriving late rather than leaving too early.

Note that in the present structured search task, potentially relevant information appeared every second, encouraging rather long, and relatively invariable fixation durations. Thus, in different types of search tasks fixation duration effects can be expected to be stronger.

Also note that the rather strong association between fixation duration and hits and misses could potentially have led to confounding effects in the SRPs, i.e., differences between hit and miss SRPs that cannot be attributed to memory or attentional processes but are e.g., due to visual processes caused by systematic timing differences in the change of stimulus appearance. In this study, we did not find any significant difference between hit and miss SRPs, but this should be kept in mind for similar future studies.

#### Pupil Size

Pupil size is larger in the high load compared to the low load condition. An increased pupil size with an increase in memory load or workload is a robust finding in the literature (Kahneman and Beatty, 1966; Beatty, 1982; Hogervorst et al., 2014). We also observed the hypothesized effect of larger pupil size during missed targets, consistent with a momentary high workload being the cause of the miss. Such momentary increase in cognitive workload could have been caused by a difficult (part in the) math trial, or by having to store yet another target. Also, momentary fluctuations could be caused by participants sometimes giving up on the math task half way during a trial. Our data do not reflect differences in phasic pupil size responses—there was no significant difference in pupil size between targets and non-targets as determined by the median size over the fixation interval. While our dependent measure of pupil size was not optimal to capture differences in pupil dilation that start to diverge at longer latencies and reach their maximum difference at about 1.5 s post fixation (Hong et al., 2014; Jangraw et al., 2014), averaged pupil size traces in our data did not show the expected differential pattern for targets and non-targets. At present, we do not know why this difference was not observed. The average pupil size was quite large (around 4.5 mm diameter) which may have played a role in attenuating the phasic response.

In sum, SRP and eye features provide complementary information in the search task under study. Saccade-related P300s signify that a target has been detected but not that it has been encoded for successful recall. For that, fixation duration and pupil size are informative, likely because they respectively reflect time taken to store the target and variations in workload associated with the secondary task.

#### Application

Determining whether an observer is looking at a target (i.e., at an object that is of interest) using SRPs and eye related features may be useful in two general areas. If it is not known what visual information is important (in a particular situation, or for a particular person) such features provide a means to examine what is of interest to an observer without requiring conscious behavioral responses. If it is known what information is important, brain and eye signals can be used to judge whether this important information is properly recognized as being important. If we know whether an observer is gazing at relevant information (as judged from the brain and eye signals, or as a priori knowledge of the task), determining whether this information is going to be remembered could be used for deciding whether or not to present the information again or in another way. Target recognition and encoding indicators may also be used as indicators of (momentarily) suboptimal performance so that the system can advise the human observer or operator to take break. The interplay between user state detection and responses to individual targets may be used in higher level classification strategies. If observers are detected to be in a high memory load situation (because of a large pupil size and EEG features), a model that classifies fixated targets (as detected through SRP) into hits and misses may be activated.

In addition to physiological or eye-based sources of information, task- and context-related information can be used directly in models for state monitoring and behavior prediction. For instance, knowledge of the presence of an additional or difficult task will add to the likelihood of misses, and indicate that a classifier distinguishing between hits and misses should be put to work. In some cases, knowledge of the task may strongly influence the interpretation of eye and brain signals that occur while searching and storing visual information. As mentioned before, low workload as may be indicated by a small pupil size, can be associated with misses in cases that a target is difficult to identify and there are no other tasks (the situation as in Dias et al., 2013) or it can be associated with hits in cases that a target is easy to identify and there is a concurrent task (current study). Also, the type of search task influences which physiological or eye based features can be expected to be informative. When there is no time pressure, fixation duration is expected to be more informative compared to when there is.

It has to be realized that the certainty with which it is possible to determine whether an observer is looking at a target or whether an observer is going to miss a target at the individual fixation level can never achieve perfect accuracy. The highest fixation classification accuracy as found in the current study is 65% and associated with distinguishing targets from non-targets in the low load condition. While it is obviously difficult to compare classification performance across studies, a classification accuracy of around 65% (with a 50% chance level) has also been found in other studies distinguishing targets from non-targets using SRPs (Brouwer et al., 2013: 62%; Wenzel et al.,

### REFERENCES


2016: 67%; Kaunitz et al., 2014: 63%). SRP studies that report a relatively good performance (Touryan et al., 2016: up to around 82%; Uš´cumli´c and Blankertz, 2016: a mean AUC of almost 80%) employed methods of subtracting out eye movement artifacts. Brouwer et al. (2014) found a decrease of the Equal Error Rate for distinguishing between target and non-target SRPs from 40% to 31% when removing eye artifacts. Amongst a range of possible techniques to boost classification performance, removing eye movement components from SRP traces seems to be an important one.

For real applications it is also important to consider the context and circumstances where uncertain information about ''targetness'' or memory encoding, based on eye and brain signals, are expected to add to other available or easy-to-retrieve information such as behavioral data (Brouwer et al., 2015). Preferably, this would be in an environment with minimal noise affecting the brain and eye signals, but also situations where the signal is strong—e.g., low workload scenarios are beneficial when exploiting the P300 signal to distinguish between targets and non-targets (current study, Thurlings et al., 2013; Ries et al., 2016). An example that may capture these elements is robust image triage. If a subset of rapidly presented images is identified for careful review and inspection (e.g., x-ray images of luggage), it would be prudent to include images with a relatively high risk of (missed) targets.

### AUTHOR CONTRIBUTIONS

All authors contributed to the conception of the study, analysis and writing.

#### FUNDING

This research was sponsored by the US Army Research Laboratory and was accomplished under Contract Number W911NF-10-D-0002.

#### ACKNOWLEDGMENTS

We would like to thank Kaleb McDowell, Oded Flascher and Jan van Erp for making this research collaboration possible. We also thank the reviewers for their help on improving the manuscript.


**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Brouwer, Hogervorst, Oudejans, Ries and Touryan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Detecting Pilot's Engagement Using fNIRS Connectivity Features in an Automated vs. Manual Landing Scenario

#### Kevin J. Verdière\*, Raphaëlle N. Roy and Frédéric Dehais

ISAE-SUPAERO, Institut Supérieur de l'Aéronautique et de l'Espace, Université Fédérale de Midi-Pyrénées, Toulouse , France

Monitoring pilot's mental states is a relevant approach to mitigate human error and enhance human machine interaction. A promising brain imaging technique to perform such a continuous measure of human mental state under ecological settings is Functional Near-InfraRed Spectroscopy (fNIRS). However, to our knowledge no study has yet assessed the potential of fNIRS connectivity metrics as long as passive Brain Computer Interfaces (BCI) are concerned. Therefore, we designed an experimental scenario in a realistic simulator in which 12 pilots had to perform landings under two contrasted levels of engagement (manual vs. automated). The collected data were used to benchmark the performance of classical oxygenation features (i.e., Average, Peak, Variance, Skewness, Kurtosis, Area Under the Curve, and Slope) and connectivity features (i.e., Covariance, Pearson's, and Spearman's Correlation, Spectral Coherence, and Wavelet Coherence) to discriminate these two landing conditions. Classification performance was obtained by using a shrinkage Linear Discriminant Analysis (sLDA) and a stratified cross validation using each feature alone or by combining them. Our findings disclosed that the connectivity features performed significantly better than the classical concentration metrics with a higher accuracy for the wavelet coherence (average: 65.3/59.9 %, min: 45.3/45.0, max: 80.5/74.7 computed for HbO/HbR signals respectively). A maximum classification performance was obtained by combining the area under the curve with the wavelet coherence (average: 66.9/61.6 %, min: 57.3/44.8, max: 80.0/81.3 computed for HbO/HbR signals respectively). In a general manner all connectivity measures allowed an efficient classification when computed over HbO signals. Those promising results provide methodological cues for further implementation of fNIRS-based passive BCIs.

Keywords: fNIRS, passive brain-computer-interface, classification, functional connectivity, wavelet coherence, engagement

### 1. INTRODUCTION

It is largely admitted that pilot error represents a major cause of aircraft crashes (Li et al., 2001), being more frequently cited than mechanical failure. Safety statistics show that the progressive introduction of automation in the cockpit since the 1960's has improved safety, with modern "computerized" cockpits taking pride in an accident rate half that of the previous generation of aircraft. However, it appears that such technologies have created a new category of potentially

#### Edited by:

Christian Herff, University of Bremen, Germany

#### Reviewed by:

Felix Scholkmann, University Hospital Zurich, Switzerland Anne-Marie Brouwer, Netherlands Organisation for Applied Scientific Research (TNO), Netherlands

#### \*Correspondence:

Kevin J. Verdière kevin.verdiere@isae-supaero.fr

Received: 10 September 2017 Accepted: 08 January 2018 Published: 25 January 2018

#### Citation:

Verdière KJ, Roy RN and Dehais F (2018) Detecting Pilot's Engagement Using fNIRS Connectivity Features in an Automated vs. Manual Landing Scenario. Front. Hum. Neurosci. 12:6. doi: 10.3389/fnhum.2018.00006 deadly incidents whereby crews are unable to comprehend the situation presented before them, and persevere in erroneous decision-making (Dehais et al., 2010, 2015). This is especially true for the final approach and landing phases that represent almost half the on-board accidents and fatal accidents (Myers and Arnold, 2016).

Without question the development of automation has dramatically changed the role of the crew from "direct (manual) controllers" to "system supervisors/decision makers." Both increased trust in automation and complexity of these computerized systems (Sarter et al., 1997; Dehais et al., 2012; Tessier and Dehais, 2012) reduce crew's basic flying abilities, and leave them ill equipped to cope with emergency situations when automation fails (Mumaw et al., 2001). Another drawback of automation is that it imposes long periods of inactivity and thus dramatically decreases pilot's vigilance (Wright and McGown, 2001). For instance, some recent surveys disclosed that 56% of British Airways pilots experienced sleep while on duty (Steptoe and Bostock, 2012; Reis et al., 2013). These operational situations show that automation can vary pilot's engagement from a very low engagement state (disengagement) that induces states of low vigilance and mind wandering, to a very high engagement state (over-engagement) yielding to perseveration and attentional tunneling (Wickens and Alexander, 2009). These extreme cognitive states may jeopardize safety and advocate for the introduction of monitoring solutions.

The idea of introducing physiological data into Human Machine Interface, called "Physiological Computing" (Fairclough, 2008) could allow the system to take the operator's states into account. Brain monitoring techniques such as passive Brain Computer Interfaces (pBCI) have shown their ability to detect and characterize several operator's mental state such as workload, fatigue or more generally engagement (Zander et al., 2010; Zander and Kothe, 2011; Khan and Hong, 2015; Roy and Frey, 2016). Building a system capable of doing a continuous monitoring or detecting some operator's degraded states would potentially permit it to adapt to this change to optimize both safety and performance. Such kinds of closed-looped systems are the ultimate goal for neuroadaptative technology.

While the real-time identification of these degraded mental states still remains a challenge, a first reasonable step is to characterize the brain activity when flying with and without the use of automation. One possible solution to meet this goal is to consider the use of functional near infra-red spectroscopy (fNIRS). Less popular in the BCI community than electroencephalography (Cutini and Brigadoi, 2014), mainly due to its low temporal resolution, this brain imaging technique presents several advantages for ecological settings as its signal is less affected by electrical and motion artifacts. Moreover, its high spatial resolution allows to give a direct access to specific brain structures without additional computational costs as long as cortical areas are concerned. Thus, several studies have shown the potential of fNIRS to infer several mental states under laboratory settings or ecological settings such as flight simulators (Ayaz et al., 2012; Gateau et al., 2015).

Classically, the authors used the relative variation of local HbO and HbR concentration and related features (e.g., slope, area under the curve, skewness) to relate cerebral activation to specific cognitive tasks (Tai and Chau, 2009; Durantin et al., 2015; Gateau et al., 2015). Yet the goal is always to improve the estimation, especially in critical settings. A solution proposed by some authors is to use connectivity measures (Borghini et al., 2014) to account for brain dynamics (for a review on functional connectivity see Bastos and Schoffelen, 2015). Indeed, cognition cannot be reduced to activation of specialized brain areas but should rather been seen as the cooperation among large scale distributed neural networks (Siegel et al., 2012; Hutchison et al., 2013; van den Heuvel and Sporns, 2013). In other words, examining spontaneous hemodynamic fluctuations can provide us a great picture of the functional architecture of the brain (Fox and Raichle, 2007)

Moreover, connectivity features have been used with success to estimate various mental states based on EEG data (Roy and Frey, 2016) in laboratory settings as well as in ecological settings. For instance, a recent study combined EEG connectivity analysis and crew monitoring in simulator and showed differences in connectivity patterns during different flight phases (Toppi et al., 2016). A few studies combined optical brain imaging like fNIRS with connectivity analysis (Lu et al., 2010; Funane et al., 2011; Cui et al., 2012; Molavi et al., 2012; ˙I¸sbilir et al., 2016) either to identify brain dynamics or brain-to-brain relationship yet they did not perform mental state estimation. Hence, the contribution of connectivity measures for fNIRS based on mental state estimation is yet to be assessed.

Classical correlation/covariance measures were successfully used in EEG (Gevins et al., 1987), however some spontaneous oscillation observed in blood-related imaging (fNIRS and fMRI) seems to be frequency specific, especially Low Frequency Oscillation (LFO) around 0.1 Hz (Obrig et al., 2000; Tong and Frederick, 2010). Knowing this, frequency specific connectivity metrics such as coherence and also wavelet coherence were used, which has gained some momentum in fNIRS signal analysis (Rowley et al., 2006; Cui et al., 2012; Holper et al., 2012; Mirelman et al., 2014).

The objectives of the present study are : (i) to evaluate the feasibility to estimate the pilot's engagement using fNIRS connectivity measures in an ecological setting such as a flight simulator. Secondly : (ii) to assess the potential of connectivity measures to better characterize engagement than classical measures.

To meet these goals, a simplified task was designed whereby pilots had to perform different manual and automated landings. Parieto-occipital areas were targeted as they play a key role for visual attention, particularly involved while flying (Dehais et al., 2016). Prefrontal cortex activity was also measured as its activation reflects mental demands (Gateau et al., 2015; Moro et al., 2016) and top down regulation. Off-line classification was performed over different classical metrics (average, peak, variance, skewness, kurtosis, area under curve and slope) and connectivity metrics to identify the most predictive ones. Regarding connectivity features, classical dependency measures such as : Covariance, Pearson's correlation (Greenblatt et al., 2012), Spearman's correlation (Spearman, 1904) and some spectral measures : magnitude squared coherence (Mandel and Wolf, 1976) and the wavelet coherence (Torrence and Compo, 1998; Lachaux et al., 2002; Grinsted et al., 2004) were compared (for review on connectivity metrics see Lachaux et al., 2002; Greenblatt et al., 2012).

### 2. MATERIALS AND METHODS

#### 2.1. Participants

Twelve visual flight rules (VFR) pilots (11 males, mean group age 24 ± 3) completed the experiment. Pilots had normal or corrected-to-normal vision, normal hearing, and no psychiatric disorders. They all had medical clearance to fly. After providing written informed consent, they were instructed to complete a 5 min task training. Typical total duration of a subject's session (informed consent approval, practice task and real task) was about 1 h. This work was approved by the local ISAE-SUPAERO committee (Approval Number: CERNI-Université fédérale de Toulouse-2017-057).

## 2.2. Experimental Design

The protocol consisted in 8 scenarios in a flight simulator : 4 in manual landing and 4 in automated landing. The Airbus A320 full motion simulator at ISAE-SUPAERO (French Aeronautical Engineering School in Toulouse) was used to conduct the experiment in ecological conditions. It simulates a twin-engine aircraft flight mode. The user interface is composed of a Primary Flight Display (PFD), a Navigation Display and an upper Electronic Central Aircraft Monitoring Display page. The pilot also had a Flight Control Unit (FCU) to interact with the autopilot (**Figure 1**).

The scenarios were divided into 3 phases: a rest phase, a cruise phase and lastly a landing phase, which were performed either in manual mode (i.e., hard condition) in which they control the aircraft speed and trajectory, or with the autopilot engaged (easy condition; **Figure 2**). Landing conditions (Auto vs. Manual) were pseudo-randomly distributed.

During the cruise phase, the autopilot was engaged and the pilots were asked to relax. This phase was mostly set to serve as a baseline. When approaching the ILS (Instrument Landing System) range (approximately 2 min) they were asked either to let the autoflight system perform the landing or to disengage the automation to manually land the aircraft. Autopilot and auto throttle deactivation was done by pushing a red button on the flight stick and the throttle respectively. Participants did not know in advance whether the landing would be automated or manually executed. Considering the whole spectrum of the landing task, our experimental conditions were designed to be contrasted in terms of mental demands. The landing phase ended 10 s after the pilots touched down on the landing ground. Before starting the experiment, the participants performed a 30-min training session to familiarize themselves with the simulator environment.

### 2.3. Data Acquisition

#### 2.3.1. Subjective Workload Assessment

After the end of the experiment, the pilots were asked to complete a commonly used subjective workload level questionnaire, the NASA-TLX (Hart and Staveland, 1988) in order to compare the two conditions. This questionnaire combines 6 factors, i.e., mental demand, physical demand, temporal demand, overall performance, frustration level, and effort.

#### 2.3.2. fNIRS Recording

Two NIRSport acquisition devices (NIRx Medical Technologies) were used in tandem mode to increase the number of sensors. Each system has 8 sources and 8 detectors receiving wavelength at 760 and 850 nm recorded at 7.8125 Hz. By using 2 systems, Frontal and Occipital areas were both covered with 8 sources and 8 detectors constrained mechanically by a plastic spacer at the appropriate distance (3 cm maximum), resulting in 42 optodes or channels. The probabilistic path of photon through cortex were estimated using the Monte-Carlo transport software tMCimg via the Atlas Viewer from Homer2 (Boas et al., 2002; Aasted et al., 2015). The optodes placement and the results of the simulation are shown **Figure 3**. Before starting the experiment a calibration was performed in order to check each optode's signal quality.

#### 2.4. Data Analysis 2.4.1. Pre-processing

FNIRS data were analyzed using Matlab R2015b with several functions from the Homer2 software package (Dubb and Boas, 2016). The overall analysis pipeline is described in **Figure 4**. The landing phase was divided into epochs of 200 samples (∼25 s) overlapping by 60 samples (∼7.5 s). As the landing duration (152 ± 22 s) could slightly differ among participants depending on their performance, the fixed number of extracted epochs was based on the shortest landing in duration, resulting in 12 epochs per landing and per subject.

Each epoch was processed independently in order to potentially extend our method to online processing. Raw data were converted to optical densities; an artifact removal algorithm and a band pass filter were applied on each epoch separately. A wavelet interpolation method was used for the artifact correction (Molavi and Dumont, 2012). This method has been shown to have the greatest signal to noise ratio among the current artifact removal methods available (Brigadoi et al., 2014). A butterworth high pass filter (cutoff: 0.01 Hz - order 3) and a low pass filter (cutoff: 0.5 Hz - order 5) were applied for the band pass filtering step.

The filtered and artifact free data were then converted to oxyhemoglobin [HbO] and deoxy-hemoglobin [HbR] concentration variations.

For further analysis, only the 80 centered samples (∼10 s) of each epoch were kept by applying a boxcar function. This window was applied to avoid spectral leakage, specifically from the wavelet transform, and to obtain a 10 s window without overlap. At the end of this processing stage, for each landing (trial) we had 12 non-overlapping, filtered and artifact free epochs of 80 samples.

#### 2.4.2. Oxygenation Measures

Oxygenation measures were computed using both the [HbO] and [HbR] signals on each epoch separately, where x represents either the [HbO] and [HbR] signal for one epoch (80 samples) and one optode. Seven oxygenation measures were computed (Peak,

Mean, Variance, Kurtosis, Skewness, Area Under the Curve, and Slope).

The peak (maximum) and the 4th moment (average, variance, skewness, and kurtosis) were computed as follows:

$$Average(\text{x}) = E(\text{x}) \quad Var(\text{x}) = E[(\text{x} - E(\text{x}))^2] \tag{1}$$

$$\text{Skew}(\mathbf{x}) = \frac{E[(\mathbf{x} - E(\mathbf{x}))^3]}{(E[(\mathbf{x} - E(\mathbf{x}))^2]^{3/2})} \quad \text{Kurt}(\mathbf{x}) = \frac{E[(\mathbf{x} - E(\mathbf{x}))^4]}{(E[(\mathbf{x} - E(\mathbf{x}))^2]^2)} \tag{2}$$

The Area Under the Curve (AUC) was calculated by summing the absolute values of the signal.

$$AUC = \sum |\mathbf{x}|\tag{3}$$

The slope was computed using the least-squared linear regression with the polyfit matlab function.

#### 2.4.3. Connectivity Measures

Connectivity measures were computed, as previously, using both the [HbO] and [HbR] signals on each epoch separately, where x and y represents two signals from two different channels. Five oxygenation measures were computed (Covariance, Pearson's correlation, Spearman's correlation, Coherence, and Wavelet Coherence).

Covariance of two signals x and y can be described as a "measure of joint variability":

$$COV(\mathbf{x}, \mathbf{y}) = E(\mathbf{x} - E(\mathbf{x})) \times E(\mathbf{y} - E(\mathbf{y})) \tag{4}$$

Where E represents the expected value. Intuitively, covariance characterizes the simultaneous variations of two signals. Covariance will be positive when the differences between the signals and their averages tend to be of the same sign and tend to be negative in the opposite case.

Pearson's correlation coefficient is the covariance of two signals normalized by the product of their standard deviation (std). It represents the linear correlation between two signals, its values ranges from −1 to +1 meaning respectively a linear negative and positive correlation and 0 corresponding to no correlation at all.

$$Pearson(\mathbf{x}, \mathbf{y}) = \frac{COV(\mathbf{x}, \mathbf{y})}{std(\mathbf{x}) \times std(\mathbf{y})} \tag{5}$$

Spearman rank correlation coefficient is "defined as the Pearson correlation coefficient between the ranked variable" (Myers and Arnold, 2003).

$$\text{Spearman}(\mathbf{x}, \mathbf{y}) = \frac{\text{COV}(\text{rg}\_{\mathbf{x}}, \text{rg}\_{\mathbf{Y}})}{\text{std}(\text{rg}\_{\mathbf{x}}) \times \text{std}(\text{rg}\_{\mathbf{y}})} \tag{6}$$

Where rg<sup>x</sup> and rg<sup>y</sup> are the ranked variable (of x and y respectively). Using the rank instead of the values allows describing monotonic non-linear relationship between signals where the Pearson's coefficient only characterizes linear relationship.

Spectral Coherence Cxy(f) or Magnitude squared Coherence is defined as the absolute squared value of the cross-spectral density of two signals (x and y) for a frequency f, normalized by the product of their auto-spectral density:

$$\mathcal{C}\_{\text{xy}}(f) = \frac{|\mathcal{G}\_{\text{xy}}(f)|^2}{\mathcal{G}\_{\text{xx}}(f) \times \mathcal{G}\_{\mathcal{YD}}(f)} \tag{7}$$

Where Gxy(f) represents the cross spectral density (being the spectrum of the cross correlation function) for a frequency f . Gxx(f) and Gyy(f) being the auto spectral density (i.e., the spectrum of the auto correlation function) respectively for x and y. Spectral coherence can be seen as a correlation coefficient in the frequency domain.

For the last one, a coherence measure based on the wavelet transform was used (Torrence and Compo, 1998): the wavelet coherence. The wavelet coherence power R 2 n (s) can be defined as:

$$R\_n^2(s) = \frac{|\mathcal{S}(s^{-1} \, \mathcal{W}\_n^{\mathcal{X}^\flat}(s))^2|}{\mathcal{S}\left(s^{-1} |\, \mathcal{W}\_n^{\mathcal{X}}(s)|^2\right)\mathcal{S}\left(s^{-1} |\, \mathcal{W}\_n^{\mathcal{Y}}(s)|^2\right)}\tag{8}$$

Where W<sup>x</sup> n (s) and W y <sup>n</sup>(s) represent respectively the wavelet transform of x and y at the n time point for a wavelet scale s. W xy <sup>n</sup> (s) is the cross wavelet transform of x and y (being the wavelet transform of the cross correlation function). S is a smoothing operator (for more detail see Torrence and Compo, 1998).

This measure can be seen as "a localized correlation coefficient in time frequency space" (Grinsted et al., 2004). Coherence values

range from 0 to 1, 1 meaning there is a perfectly phase-locked oscillations at a given frequency for the two analyzed signals.

Connectivity measures were computed on each epoch separately and for each couple of channels namely C n k = 861 couples (k = 2, n = 42). At this step, we had 42 measures for each oxygenation feature and 861 measures for each connectivity feature per epoch and per subject.

## 2.5. Data Classification

#### 2.5.1. Feature Extraction **2.5.1.1. Region of interest**

In order to reduce the amount of data and the dimensionality, the 42 different channels were combined into 6 regions of interest (ROI): Frontal-Left and Right, Fronto-Central; Occipital-Left and Right and Occipito-Central.

For the oxygenation features, it was done by averaging all the features from channels included in these 6 differents regions. For the connectivity features, 15 possible connections were possible across the 6 ROI. Values were firstly evaluated for each pair (861 couples) and then averaged across couples connecting the same regions. Couples of channels included inside one ROI were also kept, which gave 15 + 6 = 21 connectivity measures.

At this step, we had 6 measures for each oxygenation feature and 21 measures for each connectivity feature per epoch and per subject.

#### **2.5.1.2. Frequency specific measures**

For the two coherence measures (Magnitude Squared Coherence and Wavelet Coherence), the obtained coherence values were averaged for a frequency range between 0.3125 Hz (1/3.2 s) and 0.08 Hz (1/12.8 s) accordingly to the fNIRS literature (Cui et al., 2012).

#### **2.5.1.3. Normalization: z-score**

Regarding all features, they were normalized by z-scoring (i.e., transform it to have 0 mean and 1 standard deviation; Toronov et al., 2001; Tsunashima and Yanagisawa, 2009; Sasai et al., 2011).

#### 2.5.2. Classification and Cross-Validation

A Linear Discriminant Analysis (LDA) with regularization of the empirical covariance matrix by shrinkage, also known as "shrinkage method", was used (Friedman, 1989; Blankertz et al., 2011). This method has proved its robustness for BCI and passive BCI (pBCI) application (Roy et al., 2016a,b) but also with fNIRS (Herff et al., 2013; Bauernfeind et al., 2014; Hennrich et al., 2015).

Our paradigm was an intra-subject binary classification. Each subject performed 8 landings (4 of each of the 2 conditions). Data were processed to obtain 12 10 s epochs for each landing which gives 12 × 8 = 96 epochs (examples) for each subject. Our model prediction performance was assessed by using a stratified cross validation, which is a good tradeoff between bias and variance estimation (Kohavi and Sommerfield, 1995; Friedman et al., 2001). The classifier was trained with examples that originated from 6 different landings (3 of each of the 2 conditions, i.e., 6<sup>∗</sup> 12 = 72 examples) and tried to predict examples from the last 2 landings (1 of each condition, i.e., 2<sup>∗</sup> 12 = 24 examples). This method was applied for every combination (16) of landings left out of the training set and the averaged performance was kept.

Regarding the features, 2 types of comparisons were done. Firstly, a single feature comparison where each feature classification performance is assessed separately was performed. Secondly, features were merged together to evaluate their potential. They were combined 2 by 2 and the classification performance obtained with each couple was assessed.

#### 2.5.3. Statistical Assessment

#### **2.5.3.1. Subjective workload comparison**

A paired-sample t-test was performed in order to compare the average overall workload obtained for the 2 conditions among subjects.

#### **2.5.3.2. Classification performance significance**

For a 2-class problem like ours, the theoretical chance level for classification is 100/2 = 50 %, but this is only right for an infinite sample number. To assess the significance of our classifier (decoding accuracy) the classification error was modelized by a binomial cumulative distribution (see Combrisson and Jerbi, 2015 for more details):

$$P(Z) = \sum\_{i=z}^{n} \binom{n}{k} \times \left(\frac{1}{c}\right)^i \times \left(\frac{c-1}{c}\right)^{n-1} \tag{9}$$

Where - P is the probability to predict the correct class at least Z times - n the number of samples - c the number of classes.

The performance of our classification pipeline was assessed by repeating the stratified cross validation 16 times and averaging it. As stated earlier, our classifier was trained with 72 samples and tested on the last 24 samples. By using the cumulative binomial distribution, it sets the 5% significance classification threshold at 58.3%.

#### **2.5.3.3. Classification performance comparison**

In order to compare the classification performance for each feature, a repeated measure ANOVA was used considering FEATURES (or FEATURES COUPLE) and CHROMOPHORE (HbO/HbR) as within factors. A post-hoc Tukey's Honestly Significant Difference (HSD) procedure was applied to perform multiple comparisons.

#### 3. RESULTS

#### 3.1. Subjective Workload Assessment

Participants rated their workload significantly higher for the manual landing condition (M = 66.6 ± 9) than the automatic landing condition (M = 18.7 ± 7; t(11) = −17.43, p < 10−<sup>8</sup> ).

#### 3.2. Classification with Individual Features

**Figure 5** illustrates the classification performances for each feature computed over the HbR and HbO signals. In order to compare classification performance among features, a repeated measure ANOVA was done.

The statistical analysis showed that there was a significant effect of feature type on classification performance [F(11,121) = 5.66, p < 10−<sup>3</sup> ] and it also revealed a significant effect of the chromophore used [F(1,11) = 8.73, p < 0.05]. Posthoc comparisons revealed significant differences among features mainly for HbO. In particular, Wavelet Coherence had a significantly better performance than the Average, Skewness, Kurtosis, and Slope. Also, every connectivity feature gave a significantly greater performance than the Skewness. Moreover, regarding HbR, the Wavelet Coherence and the Covariance gave a significantly greater performance than the Kurtosis. All the connectivity features did not exhibit significant differences between one another. Post-hoc comparisons did not show any significant effect of the chromophore on the classification performance regardless of the feature used. In other words, every feature gave non-significant different results when using either the HbO or HbR signals for the classification.

Moreover, every connectivity feature computed over the HbO signals led to an average classification performance above chance level (>58.3 %). Furthermore, Pearson's, Spearman's correlation, and the Wavelet Coherence exceeded the chance level for both HbO and HbR. Concerning classical oxygenation features, the AUC and Variance were the only features to reach a classification performance above chance level but only when computed over HbO signals.

Regarding the best features, Wavelet Coherence benefited of the best classification performance among subjects with an average 65.34 and 59.94% of good classification respectively for HbO and HbR. The second was the Covariance (62.93 and 56.03 %) followed by the Area Under the Curve (61.76 and 57.83%) for HbO and HbR respectively.

#### 3.3. Classification with Combined Features

**Figures 6**, **7** show the averaged classification performance for all the possible combinations of 2 oxygenation or 2 connectivity features respectively.

Following the same procedure as before, a repeated measure ANOVA was done with the data showed **Figures 6**, **7**. It revealed that there was a significant effect of the feature couple [F(30, 330) = 5.42, p < 10−<sup>3</sup> ] but not of the chromophore [F(1, 11) = 2.47, p = 0.14] on the classification performance.

When evaluating multiple comparisons for HbO, the main observation is that the 7 best connectivity couples gave a significantly greater classification performance than the 7 worse oxygenation couples. Besides that, it can also be noted here that connectivity couples did not exhibit significant differences between one another.

For oxygenation features, 9 out of 21 couples of features led to a classification performance above chance level, namely AUC-Peak, AUC-Variance, AUC-Average, Average-Variance, AUC-Slope, Variance-Slope, Peak-Variance, AUC-Skewness, and Variance-Skew. The AUC-Peak couple reached a classification performance of 61.2 and 56.7% for HbO and HbR respectively. Moreover AUC is in 5 of these 9 best couples. Regarding combined connectivity features, every connectivity couple reached a classification performance above chance except the couple Covariance-Coherence when computed over HbR. The best couple (Covariance-WaveletCoherence) led to a classification performance of 66.4 and 59.8% (for HbO and HbR respectively).

Results for every feature couple, including couples mixing oxygenation and connectivity features, for every subject are given in **Tables 1**, **2**.

for our research question (\*\*\*p < 0.05).

FIGURE 6 | Pilot's engagement classification performance function of couple fNIRS-based oxygenation feature used (average across subject). Blue and red bars represents features extracted from respectively [HbR] and [HbO] signals. Error bars represents the confidence interval at 95 %.

TABLE 1 | Classification performance (HbO/HbR) for every possible combination of 2 features with an average performance across subjects under chance level (<58.3 %) computed over HbO.


Results are rounded to the closest integers and ordered by their average value (the last couple (row) is the best performing on average across subjects). Columns refer to subjects (S1–S12) and rows to each feature couple (Ave, Average; Pea, Peak; Var, Variance; Skew, Skewness; Kurt, Kurtosis; AUC, Area Under the Curve; Cov, Covariance; Pear, Pearson's; Spear, Spearman's; Coh, Coherence; WTC, Wavelet Coherence).

### 4. DISCUSSION

The main motivation of the present study was to assess the potential of connectivity measures to classify two different levels of task engagement with fNIRS under relatively ecological settings. We therefore designed a protocol whereby pilots had to perform several manual and automated landings. Our subjective measures confirmed that these two situations were contrasted as manual landing led to significantly higher subjective NASA-TLX scores than automated landing. Our overall classification TABLE 2 | Classification performance (HbO/HbR) for every possible combination of 2 features with an average performance across subjects above chance level (>58.3 %) computed over HbO.


(Continued)

TABLE 2 | Continued


Results are rounded to the closest integers and ordered by their average value (the last couple (row) is the best performing on average across subjects). Columns refer to subjects (S1–S12) and rows to each feature couple (Ave, Average; Pea, Peak; Var, Variance; Skew, Skewness; Kurt, Kurtosis; AUC, Area Under the Curve; Cov, Covariance; Pear, Pearson's; Spear, Spearman's; Coh, Coherence; WTC, Wavelet Coherence).

results confirmed that the two different engagement levels could be discriminated in a flight simulator. This is in line with previous neuroergonomics studies showing that this brain optical imaging technique is well suited for mental state monitoring in ecological situations (Herff et al., 2013; Durantin et al., 2015; Gateau et al., 2015; Foy et al., 2016).

The best classification accuracy reached 66.9 %, a result that does not compare favorably with recent studies at first hand. For instance, Hong et al. (2015) obtained a classification performance of 75.6 % on 10 subjects with a mental motor imagery and mental arithmetic paradigm using average and slope features over chromophore concentration. Holper and Wolf (2011) did a complex vs. simple imaginary movement paradigm with 12 subjects. By combining different features such as the average, variance, skewness and kurtosis computed over HbO and HbR, they reached a performance of 81.3 %. Naseer et al. (2016) obtained a 93 % classification performance with almost similar features to classify mental arithmetic vs. rest on 7 subjects. However, these studies did not consider a continuous but rather an event locked assessment of a specific cognitive activity contrarily to our flying task involved different executive and attentional skills. Interestingly enough and contrary to our results, Khan and Hong (2015) showed that classical oxygenation metrics could yield to a high accuracy (84.9 %) when continuously monitoring drowsiness under ecological settings such as driving in simulated conditions. The comparison with our study remains challenging as the construct of engagement is probably more subtle to be captured. Eventually, the limited number of trials did not allow us to optimize the training of our model to guarantee high classification accuracy.

Interestingly, the connectivity measures led to better classification performance than the classical oxygenation metrics (i.e., chromophore concentration variation). The better performance of the connectivity metrics over classical ones could rely on two main explanations. Firstly, one has to consider that the analysis of task-related concentrations (i.e., hemodynamic response) is time-locked to the event. It has been proposed that these task-related responses induce a small increase (<5%) in neural energy consumption compared to the overall brain energy consumption (Raichle and Mintun, 2006). Thus by focusing only on a localized hemodynamic response, the majority of the brain activity is dismissed. It is now well admitted that cognition relies on the activation of several distributed brain areas rather than single dedicated processing units (Siegel et al., 2012; Hutchison et al., 2013; van den Heuvel and Sporns, 2013). Thus, the analysis of the interaction between neural networks provides more information on the brain dynamics, especially when concerned with the understanding of complex real-life task (Cui et al., 2012; Leff et al., 2015; ˙I¸sbilir et al., 2016). Secondly, some relevant studies disclosed that frequency or amplitude correlations among spontaneous LFOs (around 0.1 Hz) are tightly linked to cortical processes (Lowe et al., 1998; Xiong et al., 1999; Obrig et al., 2000, see Siegel et al., 2012 for a review). As a matter of fact, when considering continuous monitoring of the brain activity, where no specific events are expected, connectivity features based on frequency or amplitude coupling can give an insight on the ongoing cognitive processes.

The comparison of the connectivity metrics classification performance revealed that covariance, correlation (Pearson's or Spearman's) and wavelet coherence led to significantly higher classification accuracies than respectively 3, 2, and 4 classical oxygenation metrics. It is interesting to note that the formers present complementary advantages. On one hand, correlation and covariance are straightforward and low cost computational measure to implement. This is of great advantage as long as pBCIs are concerned. On the other hand, the wavelet coherence takes into account both time and phased locked oscillations. While being used for some years by the fMRI community, the wavelet coherence metrics has only recently been applied to fNIRS signal (Cui et al., 2012; ˙I¸sbilir et al., 2016). Wavelet coherence also allows to target specific and relevant frequency bands such as LFOs as discussed previously. However, the implementation of wavelet coherence based pBCIs remains challenging as this metric requires a high number of wavelet convolutions and the calculation costs could be critical in an online paradigm. One possible promising approach to overcome this issue is to consider dimensionality reduction (Guyon and Elisseeff, 2003). Taken together our findings provide some methodological guidance for the implementation of fNIRS based BCI metrics. To the best of our knowledge, this study is one of the rare to benchmark different fNIRS connectivity metrics and to use them for classification purposes in ecological settings. It paves the way toward online mental state estimation in ecological aeronautical settings, but some challenges still remain.

Despite its potential interests, our paper has several limitations. Firstly, this experiment involved 12 subjects that only performed four trials of each conditions. This limited number of trials relied on a compromise as the participants would experience fatigue and discomfort if wearing the cap for a long period (around 40 min). Secondly, the choice of the two contrasted conditions (automatic vs. manual) can be discussed since potential confounds such as motor responses could influence our measures. Yet we did not target motor areas therefore the risk is low. However, our motivation is to monitor pilots' brain activity when facing realistic flying conditions. The designing of well contrasted and controlled conditions in ecological environments such as flying remains challenging. This first experiment was meant to set the path to more refined protocols to characterize different tasks with a view to perform crew monitoring as achieved by Toppi et al. (2016). The third limitation is regarding the fNIRS signal analysis. Indeed, fNIRS signal is the result of a global component influenced by skin blood flow and a local neuronal component. Some algorithms based on spatial filtering and principal component analysis such as the one proposed by Zhang et al. (2016) could have been used if the analysis was not done on each epoch separately. Moreover fNIRS signals can also be influenced by other physiological activities such as heartbeats, respiration or changes in blood pressure. It would have been interesting to also record those activities to evaluate how they can correlate with the engagement level. Regarding the paradigm settings and because of these limitations, despite the fact that our classification performance were very high and satisfying, it is not possible to

#### REFERENCES


make any claim regarding the underlying neurophysiological processes.

Finally, the performance of the classification pipeline needs to be improved before its implementation in the cockpit as such rate of false negative detection cannot be afforded as it is in such critical systems, even though using multisensory fusion this accuracy level is still usable. A promising way to increase classification performance could be to use a bimodal EEG-fNIRS pBCI (Fazli et al., 2012).

### AUTHOR CONTRIBUTIONS

Study conception and design: KV, RR, FD. Data acquisition: KV, FD. Data Analysis: KV. Data Interpretation and Writing: KV, RR, FD.

#### FUNDING

This study was supported by a PhD grant delivered by the DGA (Direction Générale de l'Armement).

### ACKNOWLEDGMENTS

The authors would like to thank Carlos Cuenca-Ruza and Kevin Mandrick for their help during the experiment.

statistical assessment of decoding accuracy. J. Neurosci. Methods 250, 126–136. doi: 10.1016/j.jneumeth.2015.01.010


signals: an introduction to wavelet coherence. Clin. Neurophysiol. 32, 157–174. doi: 10.1016/S0987-7053(02)00301-5


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Verdière, Roy and Dehais. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Do Event-Related Evoked Potentials Reflect Apathy Tendency and Motivation?

Hiroyuki Takayoshi, Keiichi Onoda and Shuhei Yamaguchi\*

Department of Neurology, Shimane University, Izumo, Japan

Apathy is a mental state of diminished motivation. Although the reward system as the foundation of the motivation in the human brain has been studied extensively with neuroimaging techniques, the electrophysiological correlates of motivation and apathy have not been fully explored. Thus, in 14 healthy volunteers, we examined whether event-related evoked potentials (ERP) obtained during a simple number discrimination task with/without rewards reflected apathy tendency and a reward-dependent tendency, which were assessed separately using the apathy scale and the temperament and character inventory (TCI). Participants were asked to judge the size of a number, and received feedback based on their performance in each trial. The P3 amplitudes related to the feedback stimuli increased only in the reward condition. Furthermore, the P2 amplitudes related to the negative feedback stimuli in the reward condition had a positive correlation with the reward-dependent tendency in TCI, whereas the P3 amplitudes related to the positive feedback stimuli had a negative correlation with the apathy score. Our result suggests that the P2 and P3 ERPs to reward-related feedback stimuli are modulated in a distinctive manner by the motivational reward dependence and apathy tendency, and thus the current paradigm may be useful for investigating the brain activity associated with motivation.

#### Edited by:

Stephen Fairclough, Liverpool John Moores University, United Kingdom

#### Reviewed by:

Rolf Verleger, University of Lübeck, Germany Filippo Brighina, Università degli Studi di Palermo, Italy

#### \*Correspondence:

Shuhei Yamaguchi yamagu3n@med.shimane-u.ac.jp

> Received: 13 March 2017 Accepted: 11 January 2018 Published: 31 January 2018

#### Citation:

Takayoshi H, Onoda K and Yamaguchi S (2018) Do Event-Related Evoked Potentials Reflect Apathy Tendency and Motivation? Front. Hum. Neurosci. 12:11. doi: 10.3389/fnhum.2018.00011

#### Keywords: motivation, apathy, event-related potential, P2, P3

### INTRODUCTION

Apathy is one of the representative clinical symptoms with reduced motivation. Apathy is defined as diminished motivation that is not attributable to a disorder of consciousness, cognitive impairment, or emotional distress (Marin, 1990), and it is characterized by an absence of will which results in decreased self-initiated behavior (Berrios and Grli, 1995). It is difficult to judge clinically whether there is an absence of will or not, but Stuss et al. defined apathy as a status characterized by a decreased response to external stimuli (Stuss et al., 2000). Apathy is seen frequently in various neuropsychiatric disorders, but its mechanism has not been fully explored. If apathy can be assessed by physiological measures, the exploration of neural basis of motivation may be understood more deeply.

Electrophysiological indices such as the P3 event-related evoked potential (ERP), feedbackrelated-negativity (FRN), and stimulus preceding negativity (SPN) are objective measures of cognitive processing, which have excellent temporal resolution for neural activities elicited by external and internal events. Several ERP studies have examined the motivational changes caused by monetary gain or loss, and it is known that some ERP components are particularly sensitive to valence and the size of a reward. One of these components is P3 and increases in its amplitude are associated with the gain of a larger reward (Yeung and Sanfey, 2004; Sato et al., 2005). However, few studies have examined the direct relationship between P3 and apathy. In particular, the P3 amplitude in a visual oddball task decreased in apathetic patients after stroke (Yamagata et al., 2004). A similar result was found in Parkinson's disease based on a visual oddball task (Kaufman et al., 2016). The P3 amplitude also decreased in patients with anhedonia (Dubal et al., 2000) and depression (Foti and Hajcak, 2009). This evidence suggests that P3 may reflect cognitive processes that are sensitive to an apathetic state.

Another component is the FRN, which was discovered as a negative potential generated by feedback stimuli signifying a false response (Takasawa et al., 1990). The FRN was also elicited by feedback signifying monetary loss (Gehring and Willoughby, 2002). The FRN amplitude was higher when immediately preceding feedback represented monetary gain compared with loss (Masaki et al., 2006), thereby indicating that the FRN is affected by the motivation level in a trial base. The FRN is generated in the anterior cingulate cortex (ACC), and dysfunction of the ACC network is associated with apathy (Onoda and Yamaguchi, 2015). This evidence suggests that the FRN might also reflect the degree of apathy.

In addition, the SPN was associated with reward gain in motivation studies, including a task with feedback signals related to performance (Brunia and Damen, 1988). The SPN was studied in a time production task, and it had a larger amplitude in the case with monetary rewards (Bocker et al., 1994). Therefore, it is possible that the SPN also reflects motivation.

The ERP component is known to correlate with personality traits and affective disorder (Gangadhar et al., 1993; Hansenne, 1999). To make a physiological index of apathy, the effect of other motivation-related factors should be considered simultaneously. Here, we focused on reward dependence, novelty seeking, and depression. Reward dependence is characterized by eager to help and please others, persistent, industrious, warmly sympathetic, sentimental, and sensitive to social cues and personal succor but able to delay gratification with the expectation of eventually being rewarded (Cloninger, 1987). These characteristics suggest that reward dependence could be treated as a motivational trait. Novelty seeking is a temperament associated with exploratory activity in response to novel stimulation, impulsive decision making, extravagance in approach to reward cues, quick loss of temper, and avoidance of frustration (Cloninger et al., 1993). Reward dependence and novelty seeking are related with reward system (Krebs et al., 2009). Moreover, novelty seeking is associated with dopamine function (Cloninger et al., 1994) and its polymorphism (Lusher et al., 2001). Novelty seeking may play a role in motivation. On the other hand, depression is associated with anhedonia and loss of motivation through functional impairment of the mesolimbic dopamine pathway (Martin-Soelch, 2009). Apathy and depression are often confused, and sometimes both could be seen simultaneously, particularly in neurological disorders (Hama et al., 2011). It would be desirable to distinguish apathy and depression to reveal the neural basis. Therefore, we investigated the relationships between the ERPs and not only apathy but also reward dependence, novelty seeking, and depression.

In this study, we developed a new simple task where the P3, FRN, and SPN components were evaluated in a single session, and motivation was modulated by changing a monetary reward. This task paradigm enabled us to examine the relationships among the electrophysiological measures, novelty seeking, reward dependence, depressive state, and apathy tendency.

## METHODS

### Subjects

Fourteen neurologically healthy adult volunteers (eight males, six females) were recruited. Their mean age was 25.3 years (standard deviation = 4.1, range = 20–35 years). All subjects had normal vision or corrected to normal vision. This study was approved by the Ethics Committee of Shimane University, and was conducted in accordance with the Declaration of Helsinki.

### Questionnaires

Participants completed the apathy scale (Starr et al., 1983; Okada et al., 1998), the temperament and character inventory (TCI) (Cloninger et al., 1993; Kijima et al., 1996), and Zung's selfrating depression scale (Zung, 1965). These questionnaires are self-entry style questionnaires. TCI is a 125-item questionnaire regarding personality developed by Cloninger et al. (4 points scale per item). We obtained scores for reward dependence and novelty seeking because both traits are closely related to motivation. A higher score of novelty seeking represents novelty seekers (Cloninger et al., 1993). A higher score of reward dependence represents more motivated state (Kijima et al., 1996). A higher scores of apathy scale and SDS represent more apathetic state and depressive state, respectively. The score of mean and standard deviation for novelty seeking was 47.1 ± 6.9, for reward dependence was 41.8 ± 4.5(33–49), for harm avoidance was 51.6 ± 5.9(41–63), for apathy score was 11.3 ± 5.5(2–21), and for SDS was 36.4 ± 8.4(20–52). There were several correlations among apathy scale, SDS, and temperaments. Apathy scale was positively correlated with harm avoidance and SDS (Supplementary Table 1).

#### Procedures

We developed an original task to measure SPN, FRN, and P3 in a single experimental session. Participants were asked to perform a number discrimination task. **Figure 1** shows the protocol for the number discrimination task. This task comprised three conditions (reward, non-reward, and control condition). In each trial, a number excluding five was displayed and participants judged whether the number is smaller than five. Participants were asked to press the left button when the number was smaller than five and to press the right button when the number was

**Abbreviations:** ACC, anterior cingulate cortex; ANOVA, analysis of variance; EEG, electroencephalogram; ERP, event-related potentials; FRN, feedback-related negativity; SPN, stimulus preceding negativity; TCI, temperament and character inventory.

larger than five. The feedback stimulus was presented 2.5 s after the response. When participants correctly responded faster than the criterion time, a positive feedback stimulus was presented with a value of +10 to +90 at an interval of 10. In contrast, when they responded correctly but slower than the criterion time, a negative feedback was presented with a value of −10 to −90 at an interval of 10. The feedback value was altered based on the response speed and accuracy in the trial, which faster responses yielded higher values, and vice versa (see **Figure 1B** for details). If the previous response was correct and faster than the criterion, the next criterion was shortened automatically by 10 ms. Inversely, if the previous response was incorrect or too slow, the following criterion was automatically prolonged by 10 ms. In the case of the reward condition, the positive value was represented by a monetary reward and it was added to the total amount of money acquired. Even if the feedback was negative, the total amount of money acquired was not decreased because the expected total reward was manipulated to be positive in the reward condition. In the non-reward condition, the value of the feedback stimulus indicated the response speed, which did not affect the acquisition of money. In the case of the control condition, the value of the feedback stimulus ranging from +90 to −90 at an interval of 10 was assigned randomly regardless of the response speed. The probabilities of positive and negative feedback were manipulated so they were both kept at 50%. When participants made a wrong response, "incorrect response" was presented as text in all conditions. If no response was made for 0.8 s after the presentation of the number, "no response" was presented as text. The duration of the feedback stimulus was 1.0 s. After feedback, the current total monetary reward was displayed for 1.0 s. The stimulus color differed in each condition (reward: yellow; non-reward: green; control: white). The average time of the inter-trial interval was 2.5 s (range: 2.0– 3.0 s). The task comprised five sessions and each session included three blocks (one block per condition; **Figure 1C**). Each block included 20 trials. A break for a few minutes was allowed between the sessions. Participants were given an opportunity to practice 20 trials before the actual task. They were instructed to press a button as quickly as possible. The initial time criterion was calculated based on the mean reaction time for correct responses in the practice section for each participant. Participants were told that the positive feedback value would be larger if they pressed the button as quickly as possible and answered correctly, and that the negative feedback value would be larger if they responded slowly even with a correct response. We also told the participants that they could identify the ongoing condition based on the stimulus color.

### ERP Data Acquisition and Signal Processing

Participants were seated ∼1 m from a monitor in a shielded room. Electroencephalographic (EEG) data were acquired using a BrainAmp system with 64-channel electrodes (Brain Products, Brain AMP DC, Germany) (**Figure 1D**). EEG signals were recorded continuously with the bandpass set at 0.01–250 Hz and a sampling frequency of 500 Hz. The reference channel was Cz, and re-referencing was performed offline based on the average of all recording sites. Noise components including ocular movement were removed by independent component analysis. The continuous EEG was segmented into epochs, including 200 ms pre-stimulus and 800 ms post-stimulus for the target, and feedback stimulus with a bandpass filter of 2–16 Hz to analyze the P2, P3, and FRN components. This filter setting was used to detect more prominent FRN and to remove slow drift with low frequency filter (Onoda et al., 2010). P2 and P3 were identified as positive or negative components in latency windows of 100–250, 200–350, and 300–500 ms, respectively. FRN was measured as a negative component in the latency window of 250–400 ms for the subtraction waveform (negative-positive). The peak amplitude and latency for each component were determined in the same window. To analyze the SPN, epochs from 2,000 ms pre-stimulus to 500 ms post-stimulus were extracted from the EEG with a bandpass filter of 0.016–30 Hz. The baseline for the SPN was defined as the time window from −1,500 to −1,000 ms before the feedback stimulus. The mean amplitude of the SPN was measured between 1000 ms pre-stimulus and stimulus onset.

#### Statistics

Behavioral measures were subjected to repeated one-way analysis of variance (ANOVA) with the condition. The primary analysis models for the amplitude and latency of the ERP components comprised repeated measures ANOVA with two or three factors (channel × condition, or channel × condition × feedback valence). The Greenhouse–Geisser correction was applied to ANOVA. In the post-hoc test, the Bonferroni method was employed for multiple comparisons. The statistical significance threshold was set to p < 0.05.

### RESULTS

According to the behavioral data, the main effect of condition on the reaction time was significant [F(2, 26) = 7.70, ε = 0.68, p = 0.008, d = 0.37], where the reaction time to targets was faster for the reward condition compared with the non-reward and control conditions (ps < 0.05, **Table 1**). There was no significant main effect on the error rate [F(2, 26) = 0.72, n.s.]. The mean total monetary reward was 1912 ± 698 yen.

The grand average waveforms are illustrated in **Figure 2**. P2 and P3 were elicited for both the target and feedback stimuli, and SPN appeared to precede the feedback stimuli. These components differed in their amplitude and latency depending on the condition or feedback valence.

The target P2 was the largest at Cz, did not exhibit any significant main effects or interaction in terms of their amplitude and latency (Fs < 2.4, for P2). The peak amplitude of target P3, which was largest at Cz and Pz, was mainly affected by the condition [F(2, 26) = 4.79, ε = 0.75, p = 0.03, d = 0.27], where the amplitude for the reward condition was larger than that for the control condition [p = 0.04, **Figure 3A**). Similar to the amplitude, the latency was also mainly affected significantly by the condition [F(2, 26) = 4.7, ε = 0.84, p = 0.024, d = 0.27], where the latency was shorter for the reward condition than the control condition (p = 0.007). The mean amplitude of SPN was

TABLE 1 | Response time and error rate in rewarded discrimination task.


R, reward; N, non-reward; C, control P < 0.05; Statistics, repeated ANNOVA.

not significantly influenced by the main effects or interactions (Fs < 2.0, **Figure 3B**).

The ERPs for the feedback stimulus were analyzed using threeway ANOVA. The feedback P2 had a significant main effect of channel [F(2, 26) = 5.54, ε = 0.64, p = 0.01], where the amplitude was largest at Cz (**Figure 3C**). Regarding for the FRN, there were neither main effects nor interactions (Fs < 1.3, **Figure 3D**). The feedback P3 had a significant main effect of condition [F(2, 26) = 52.9, ε = 0.84, p < 0.001, d = 0.80], where the largest amplitude was in the reward condition and the smallest amplitude in the control condition (ps < 0.05, **Figure 3E**). The interactions between channel × valence/condition were also significant (Fs > 3.45, ε = 0.70/0.87, ps < 0.04, ds > 0.21). The post-hoc test showed that the P3 amplitude at Fz was larger for the negative feedback compared with the positive feedback (p < 0.05). Feedback P3 amplitude was the largest at Pz in the reward condition. There was a main effect of valence on latency, where the P3 peak latency was shorter for the positive feedback than the negative feedback [F(1, 13) = 6.07, p = 0.03, d = 0.32].

Next, we examined the correlations between the ERP components, and the individual psychological and affective characteristics (**Figure 4**, Supplementary Tables 1, 2). To reduce the flood of information about ERP measures in supplementary table, we averaged ERP measures across three conditions. Target P3 amplitude showed negative correlation with reaction time (ps < 0.05). The P2 amplitude at Cz had positive correlations with reward dependence for negative feedback in the reward condition, for positive feedback in the non-reward condition, and for positive feedback in the control condition (ps < 0.05). Furthermore, the P3 amplitude at Pz had negative correlations with apathy scale for positive feedback in the reward condition, for positive and negative feedback in the non-reward condition, and for positive feedback in the control condition (ps < 0.02).

### DISCUSSION

In this study, we examined whether ERP components can be employed as objective measures of apathy and motivation by using a newly developed number discrimination task with or without rewards. According to the behavioral analysis, the reaction time to targets was faster in the reward condition than in the non-reward and control conditions, thereby indicating that the participants were relatively motivated by the monetary reward. We found larger ERP components for the target and feedback stimuli in the reward condition compared with other conditions, which suggests that increased neural activities are associated with enhanced motivation.

We demonstrated that the feedback P2 amplitude was positively correlated with reward dependence, and the feedback P3 amplitude was negatively correlated with the apathy score.

for feedback P2, and 364 ms for feedback P3. The topography of SPN was made from the mean amplitude between 1,000 ms pre-stimulus and stimulus onset (B). These results imply that the feedback P2 and P3 reflected the

motivation. Other ERP components, i.e., SPN and FRN, had no significant relationships with the motivational measures.

The feedback P2 was clearly elicited in all conditions in this study. P2 is considered to be a stimulus-dependent component related to an early stage of information processing (Portella et al., 2012). Potts et al. reported that the frontal P2 was the largest when the reward was unpredictable and the generator was medial frontal cortex associated with reward system (Potts et al., 2006). This evidence indicates that a larger P2 is often observed when

attention is preferentially allocated to a particular stimulus, such as an imperative stimulus or performance feedback (Lackner et al., 2014). In this study, feedback P2 was correlated with reward dependence. Our result suggests that P2 amplitude increases through higher attention based on higher reward dependence. Moreover, close relationships between affective state/personality trait and the P2 component has also been reported. Regarding affective state, higher P2 amplitude was seen in shy adolescents, in individuals with anxiety disorder, and individuals with depression (Kemp et al., 2010; Han et al., 2014; Lackner et al., 2014). Affective state influenced higher attention and is explained by attention bias (Han et al., 2014) and disruption of selective attention (Kemp et al., 2010). In this study, depressive state was not associated with P2. This may be because the task does not cause affective process markedly and the degree of depressive state was mild. On the other hand, there are several studies regarding the association of reward system and reward dependence. Reward dependence was correlated with gray matter volumes in the caudate nucleus (Iidaka et al., 2006), orbitofrontal cortex, and temporal lobe (Van Schuerbeek et al., 2011); BOLD activity of substantia nigra/ventral tegmental area (Krebs et al., 2009); and opioid receptor availability in striatum and nucleus

accumbens (Schreckenberger et al., 2008). These results indicate that reward dependence is associated with the reward system based on the fronto–striatal circuit. The fronto–striatal circuit may modulate P2 activity via attentional deployment.

In addition, we examined whether the P3 component is modulated by individual temperament and affective state. We found a negative correlation between the feedback P3 amplitude and score of apathy scale. P3 component is usually separated into P3a and P3b (Snyder and Hillyard, 1976). P3a is elicited by novelty or salient stimuli, for example, in an oddball task (Courchesne et al., 1975; Knight, 1984), and distributed over the fronto-central area (Conroy and Polich, 2007), suggesting its association with the frontal attention system. P3b is elicited by target stimuli in an oddball task. This component is generated partly from temporo–parietal junction (Conroy and Polich, 2007) and relates to attention and memory processing. P3 seen in a gambling task is related to motivational salience in feedback processing (Nieuwenhuis et al., 2005; Yeung et al., 2005). The feedback P3 amplitude changes depending on reward expectancy and size and the feedback value (Wu and Zhou, 2009). We consider target and feedback P3 as P3b because of the task demands and the topography. Target P3 is associated with target evaluation, feedback anticipation, and encoding contextual valence. On the other hand, feedback P3 is enhanced for the outcome with large value compared to small value and is involved in the late stage of outcome processing for motivational salience rather than contextual valence (Zheng et al., 2017). In our study, target P3 amplitude was increased in reward condition and showed correlation with reaction time but was not correlated with temperament or affective state. Referring to the study of Zheng et al. feedback P3 is related to outcome evaluation for motivational salience, and our results support their notion. We speculate that feedback P3 could be a physiological marker as motivational state.

Several studies have investigated the association between emotion/affection and the P3 component, where they demonstrated that the P3 amplitude decreased in individuals with anhedonia (Dubal et al., 2000) and depression (Foti and Hajcak, 2009; Mathis et al., 2014), which are often accompanied by apathy. In our study, we found no significant correlation between depressive state and the P3 amplitude, thereby suggesting that the P3 component may reflect apathy more directly rather than depression. Similar results were obtained for Parkinson's disease (Mathis et al., 2014), Alzheimer's disease (Daffner et al., 2001), and head trauma (Daffner et al., 2000), where these studies measured the ERP using a visual or auditory oddball task. The P3a arising mainly from the prefrontal area was also correlated with apathy in subcortical stroke patients (Yamagata et al., 2004).

Previous studies have suggested that the SPN and FRN are associated with reward expectation (Bocker et al., 1994; Pfabigan et al., 2011). The SPN amplitude depends on the amount of information with an affective or motivational value carried by the feedback stimulus (Bocker et al., 1994). The FRN is sensitive to unexpected negative feedback but also to unexpected positive feedback, which suggests that the FRN reflects expectancy and the valence of feedback. However, meaningful results were not obtained in the SPN and FRN in the current study. After the participants pressed a button, they made a prediction regarding the outcome, which would have been informed by the feedback received. The probabilities of positive and negative feedback were each fixed at 50% in this study. The probability could have influenced their surprising or disappointing reaction to feedback. It is possible that no significant changes were found in the SPN and FRN because the anticipation and expectation of the outcome were attenuated by the uncertainty of the feedback stimuli.

There was some limitations in our study. Firstly, it was conducted with healthy volunteers; therefore, the degree of apathy was mild even if they were apathetic. Severe apathy is characterized by decreased mental or behavioral reactions; therefore, although the current task was simple and easy to perform, a task that requires responses might not be suitable for studying severe apathy. Thus, we cannot be certain that similar results would be obtained in subjects with severe apathy. Secondly, the number of participants was not adequate for the correlation analysis between subjective measures and ERPs. High reliability for TCI was obtained in the English (Cloninger et al., 1994) and Japanese versions (Takeuchi et al., 2011). We also confirmed the reliability and validity of the apathy scale (Okada et al., 1998). Moreover, the stability of P2 and P3 (Thigpen et al., 2017) is known and high correlation was reported in the test-retest (McEvoy et al., 2000; Williams et al., 2005). Therefore, because there is robustness in these indicators, the results of correlation study seem acceptable even though the number of participants is not adequate for the analysis. Thirdly, there were several correlations between temperaments and states. Therefore, it is difficult to judge whether temperaments affect ERP components or individual affective states. Further studies are necessary to validate our findings before the clinical use of this method. It is desirable to generate tasks that can evaluate intrinsic, extrinsic, and novel motivation to clarify the neural basis of motivation.

In summary, the P2 and P3 may have distinct associations with motivation, where P2 reflects attention that is modulated by motivation and P3 reflects apathy more directly. The current stimulus paradigm may be useful for investigating the brain activity associated with apathy.

### ETHICS STATEMENT

This study was approved by the Ethics Committee of Shimane University, and was conducted in accordance with the Declaration of Helsinki. All participants gave written informed consent.

### AUTHOR CONTRIBUTIONS

HT, KO, and SY make substantial contributions to conception and design, and acquisition of data, and analysis and interpretation of data and also participate in drafting the article or revising it critically for important intellectual content. Every authors give final approval of the version to be submitted and any revised version. Every authors give agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

### ACKNOWLEDGMENTS

This study was supported by JSPS KAKENHI Grant Number 80135904.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnhum. 2018.00011/full#supplementary-material

## REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Takayoshi, Onoda and Yamaguchi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Isolating Discriminant Neural Activity in the Presence of Eye Movements and Concurrent Task Demands

Jon Touryan<sup>1</sup> \*, Vernon J. Lawhern<sup>1</sup> , Patrick M. Connolly <sup>2</sup> , Nima Bigdely-Shamlo<sup>3</sup> and Anthony J. Ries <sup>1</sup>

<sup>1</sup> U.S. Army Research Laboratory, Future Soldier Technologies Division, Human Research and Engineering Directorate, Aberdeen Proving Ground, Aberdeen, MD, United States, <sup>2</sup> Teledyne Scientific Company, Durham, NC, United States, <sup>3</sup> Qusp Labs, San Diego, CA, United States

A growing number of studies use the combination of eye-tracking and electroencephalographic (EEG) measures to explore the neural processes that underlie visual perception. In these studies, fixation-related potentials (FRPs) are commonly used to quantify early and late stages of visual processing that follow the onset of each fixation. However, FRPs reflect a mixture of bottom-up (sensory-driven) and top-down (goal-directed) processes, in addition to eye movement artifacts and unrelated neural activity. At present there is little consensus on how to separate this evoked response into its constituent elements. In this study we sought to isolate the neural sources of target detection in the presence of eye movements and over a range of concurrent task demands. Here, participants were asked to identify visual targets (Ts) amongst a grid of distractor stimuli (Ls), while simultaneously performing an auditory N-back task. To identify the discriminant activity, we used independent components analysis (ICA) for the separation of EEG into neural and non-neural sources. We then further separated the neural sources, using a modified measure-projection approach, into six regions of interest (ROIs): occipital, fusiform, temporal, parietal, cingulate, and frontal cortices. Using activity from these ROIs, we identified target from non-target fixations in all participants at a level similar to other state-of-the-art classification techniques. Importantly, we isolated the time course and spectral features of this discriminant activity in each ROI. In addition, we were able to quantify the effect of cognitive load on both fixation-locked potential and classification performance across regions. Together, our results show the utility of a measure-projection approach for separating task-relevant neural activity into meaningful ROIs within more complex contexts that include eye movements.

Keywords: fixation-related potentials, EEG, eye tracking, target detection, cognitive load

## INTRODUCTION

Goal-directed eye movements are a ubiquitous component of everyday life and integral to our perception of the world. Over recent decades, numerous visual search studies have used eye movement patterns to better understand perceptual and attentional processes that underlie human vision (Kowler, 2011). In contrast, the majority of human electrophysiological studies of visual

#### Edited by:

Fabien Lotte, Institut National de Recherche en Informatique et en Automatique (INRIA), France

#### Reviewed by:

Benjamin Blankertz, Technische Universität Berlin, Germany Andrey R. Nikolaev, KU Leuven, Belgium

#### \*Correspondence:

Jon Touryan jonathan.o.touryan.civ@mail.mil

Received: 10 January 2017 Accepted: 21 June 2017 Published: 07 July 2017

#### Citation:

Touryan J, Lawhern VJ, Connolly PM, Bigdely-Shamlo N and Ries AJ (2017) Isolating Discriminant Neural Activity in the Presence of Eye Movements and Concurrent Task Demands. Front. Hum. Neurosci. 11:357. doi: 10.3389/fnhum.2017.00357 search continue to use fixation constrained paradigms, artificially limiting the natural linkage between attentional shifts and subsequent eye movements. Thus, extending these paradigms into a framework of overt visual search would enable the validation of attentional models in a more natural context. However, a number of potential confounds and analytical challenges emerge when interpreting electroencephalography (EEG) in the presences of eye movements (Nikolaev et al., 2016). One of the primary confounds is the large eye movement related signals around the events of interest, namely saccades and fixations. These include the corneo-retinal and saccadic spike potentials along with eyelid artifacts. Importantly, the magnitude of these signals systematically scale with the direction, amplitude, and velocity of the saccade. Furthermore, saccade features themselves can systematically vary with task design or conditions. To address this concern, a number of methods have been developed to account for the effects of saccade sequence (Dandekar et al., 2012b) or isolate eye movement related signals within the EEG record (Plöchl et al., 2012). These approaches have been successful to the degree that they were able to reveal task-relevant activity, such as the P3 component, that may otherwise have been conflated with eye movement related artifacts (Dandekar et al., 2012a; Devillez et al., 2015).

Robust saccade detection and quantification presents another methodological challenge. Previously, EEG studies often relied on an explicit electrooculography (EOG) measurement, horizontal or vertical, for detecting the onset of a saccade (Gaarder et al., 1964; Thickbroom et al., 1991; Kazai and Yagi, 1999). The benefit of using this signal is both the high temporal resolution and the de facto alignment with the EEG record. Unfortunately, the lack of precision in determining the direction and distance of saccades limits these studies to paradigms with a small number of predetermined fixation locations. However, recent advances in the speed and accuracy of infrared eyetracking technology has made it possible to link gaze position with neural activity at both high spatial and temporal resolution. This has led to a growing number of studies that explore the neural correlates of target detection during visual search, in both controlled (Brouwer et al., 2013) and free-viewing paradigms (Kamienkowski et al., 2012; Dias et al., 2013; Jangraw et al., 2014; Kaunitz et al., 2014; Ušcumli ´ c and Blankertz, 2016; Wenzel et al., ´ 2016).

In addition to the above measurement and signal processing challenges, there is the more nuanced task of interpreting brain activity in the context of planned and executed eye movements (Nikolaev et al., 2016). This remains a significant obstacle for studies focusing on both perceptual and cognitive phenomena. First, there is the task of quantifying or controlling for stimulus properties. When the eyes are free to move, stimuli impinging on the retina will necessarily vary across conditions and participants, even when gaze position is guided by the task sequence. In more controlled settings, experimental design can ensure that small differences in eye position do not significantly bias the statistics of the stimuli. However, this becomes more challenging for the ultimate goal of free-viewing in natural scenes where spatial frequency, orientation, and chromatic distributions can vary widely within a single image. Likewise, there is the challenge of separating saccade planning and execution from the perceptual or cognitive signal of interest. Even when utilizing high density EEG and source localization techniques, the spatial resolution of the saccadic preparatory signals is limited. Thus, accounting for these signals via subtraction across equated conditions (Nikolaev et al., 2013), regression (Dandekar et al., 2012b), or other techniques (Dias et al., 2013) is an important factor for the interpretation of para-saccadic neural activity.

Despite these recent advancements, there remains a need for development and validation of methods for the quantification of both perceptual and cognitive phenomena in the presence of eye movements. Part of this process is the evaluation of novel analytical approaches within paradigms that enable a more direct comparison to related fixation-constrained studies. Similarly, any particular method may only address some of the above challenges while still providing valuable insight when applied within the appropriate constraints or combined with other techniques. It is within this context that we propose the following approach for separating neural activity into meaningful regions of interest (ROIs) in the presence of eye movements. To evaluate our approach, we utilized data from a previously publish study (Ries et al., 2016) that employed a dual-task paradigm, visual target detection and auditory N-back, to quantify the effect of working memory load on the lambda response. The primary observation from this study was a small but significant reduction in the lambda amplitude with increasing cognitive load.

Here, we were able to separate the neural response to each fixation into six ROIs by applying a technique that linearly combines activity from independent sources based on their equivalent dipole location. Within each ROI we show a distinct neural response that, to varying degrees, discriminated target from non-target fixations and was differentially modulated by cognitive load. While the task design mitigated the overlapping response from adjacent saccades, common in free-viewing visual search, this approach is a substantive step in the interpretation of fixation-related brain activity. When combined with GLMbased techniques for the deconvolution of overlapping FRPs, this approach can be applied to more natural contexts where the interplay between bottom-up and top-down neural activity is not well understood.

### MATERIALS AND METHODS

The experiment used in this study has been described in a previous publication (Ries et al., 2016). Here, we provide a summary of stimuli and procedure, followed by a more detailed description of the novel ROI analysis method.

### Participants

Fourteen participants volunteered for the study; all participants were right-handed males with an average age of 32.8 years. All participants had 20/20 vision or corrected to 20/20 vision. This study was conducted in accordance with the U.S. Army Research Laboratory's IRB requirements (32 CFR 219 and DoDI 3216.02). The voluntary, fully informed consent of research participants was obtained in written form. The study was reviewed and approved by the U.S. Army Research Laboratory's IRB before the study began.

### Stimuli and Procedure

Participants performed a guided visual target detection task on a 7 × 7 grid (23.9◦ × 23.9◦ visual angle) of equally spaced and variably oriented "T" or "L" characters (1.1◦ visual angle) presented on a low contrast 1/f noise background from a viewing distance of approximately 65 cm (**Figure 1**). Eye fixations were guided across the grid by a red annulus (2.3◦ visual angle) that randomly surrounded one of the characters for a duration of 1 s before moving to the next randomly selected character. Participants were instructed to saccade to and fixate on the character in the center of the red annulus and to press a button (left hand) only when a "T" (visual target) was present. Visual target characters appeared on 10% of trials. Participants were instructed to maintain fixation on the character until the next red annulus appeared. All red annuli surrounding a non-target "L" were at least two characters from any "T" present on the grid to minimize peripheral detection. The guided visual target detection task was performed in one of five conditions: visual alone (silent condition), while ignoring binaurally presented digits (numbers 0–9), or while using the auditory digits in a 0, 1, or 2-back working memory task. The digit "0" was only used in the 0- Back condition where it served as the auditory target. Auditory stimuli were presented every 2 s with a 500 ms offset from a shift in the red annulus location. Participants were instructed to make a button press (right hand) for auditory targets, which occurred on 20% of trials. Thus, the same number of targets appeared in both tasks during the 0-back, 1-back, and 2-back conditions. Participants performed two consecutive blocks of the same condition (silent, ignore, 0-back, 1-back, 2-back) with the condition order counterbalanced. Each of the 10 blocks had a duration of 200 s with self-paced rest periods between blocks. Participants were given practice in each N-back condition, prior to experimental data collection, until they reached above chance performance.

### Eye Tracking

Eye-tracking data were sampled at 250 Hz using the SMI RED 250 system (Teltow, Germany). A 15-point calibration was performed prior to the practice and experimental blocks. A posthoc model was fit to the eye-tracking data for each participant to increase accuracy of the gaze position estimate. Briefly, we used the expected eye position (i.e., location of the red annulus) to fit a quadratic regression model for both the horizontal and vertical gaze position vectors (**Figure 2**). A temporal lag (250 ms) was applied to the expected location (red annulus) to account for the delay between annulus onset and subsequent fixation.

Saccades and fixations were detected in the eye-tracking data using a velocity-based algorithm (Engbert and Mergenthaler, 2006; Dimigen et al., 2011). Saccades and fixations were detected using a velocity factor of 6 (standard deviations of the velocity distributions), minimum saccade duration of 20 ms, minimum fixation duration of 350 ms. If two saccades occurred within a 350 ms window, only the fixation corresponding to the largest

saccade was preserved. Fixations were only considered taskrelevant or "valid" if they were within 3 degrees of the current stimulus location. These criteria were chosen to focus analyses on the first saccade onto the new stimulus (red annulus) location.

### Electroencephalography and Feature Extraction

(no auditory stimuli), ignore, 0-back, 1-back, and 2-back.

Electrophysiological signal acquisition and analysis steps are outlined in **Figure 3**. EEG recordings were digitally sampled at 1,024 Hz from 64 scalp electrodes over the entire scalp using a BioSemi Active Two system (Amsterdam, Netherlands). External leads were placed on the outer canthi, and above and below the orbital fossa of the right eye to record electrooculography (EOG). EEG was referenced offline to the average mastoids, downsampled to 256 Hz (fs), and digitally high-pass filtered above 1 Hz using the EEGLAB toolbox (Delorme and Makeig, 2004). Large artifacts were detected using a previous described technique (Touryan et al., 2016). Briefly, EEG sessions were segmented into high-resolution 100 ms epochs, with a 10 ms step size. Epochs were marked as high noise if the average power between 90 and 120 Hz was greater than three standard deviations above the mean for all epochs. These epochs were then removed and the remaining EEG record was lowpass filtered below 50 Hz.

FIGURE 2 | Gaze location estimation. (A) Example X and Y gaze vectors before and after correction. Dashed red line indicates location of the red annulus. (B) Example search grid overlaid with aggregate gaze position across all blocks for one participant. Black circle illustrates the fixation area considered valid for that stimulus location.

Each "clean" EEG session was decomposed into independent components using the Extended Infomax ICA algorithm implemented in EEGLAB (Delorme and Makeig, 2004). The equivalent dipole locations of these independent sources were then estimated using the EEGLAB implementation of DIPFIT (Scherg, 1990; Pascual-Marqui et al., 1994). IC activation epochs were extracted around each valid fixation using a temporal window spanning 300 ms before and 1,000 ms after fixation onset. Time-frequency features were also calculated for each epoch using a wavelet transform (Torrence and Compo, 1998). Specifically, we used the Morlet wavelet function:

$$
\psi\_0(t) = c\pi^{-1/4} e^{i\alpha\_0 t} e^{-t^2/2} \tag{1}
$$

where ω<sup>0</sup> is the central frequency and c the normalization constant. This function was used to create a basis set of 30 wavelets covering the available frequency range with minimum scale of 2/f<sup>s</sup> and a discrete step size of 0.25 (wavelet transform software available at http://paos.colorado. edu/research/wavelets/). After the wavelet transform, the spectral power of each epoch was computed via multiplication with the complex conjugate of the corresponding epoch. While this timefrequency decomposition included frequencies from 1 to 128 Hz, only frequencies below 32 Hz were included in subsequent analyses.

To isolate activity in brain regions of interest (ROIs), the above IC activation epochs were linearly mixed based on equivalent dipole location using the initial steps of measure-projection analysis (Bigdely-Shamlo et al., 2013). ICs with equivalent dipoles outside of the MNI model brain volume were identified and excluded from analysis (see Supplementary Section 3). These ICs often corresponded to corneo-retinal potentials (i.e., EOG) or muscle artifacts (i.e., EMG). The remaining k IC processes were preserved and their corresponding fixation-locked activation epochs used as the "measure" for each dipole location in the mixing process. Specifically, the fixation-locked activations or measures can be indexed as M<sup>i</sup> , i = 1...k for each IC, and the equivalent dipole location x ∈ V ⊂ R 3 indexed as D (xi), i = 1 ...k. Importantly, there exists uncertainty in dipole localization arising from errors in tissue conductivity parameters, electrode co-registration, noise in the IC estimate process, and betweensubject variability in the location of equivalent functional cortical areas. To capture this uncertainty in the mixing process we can instead model each equivalent dipole as a spherical (3-D) Gaussian with uniform covariance σ 2 , centered at the estimated dipole location x<sup>i</sup> . The spherical Gaussian is truncated at t∗σ to minimize the erroneous influence of distant dipoles in sparsely populated regions. Thus, the probability of dipole D xj being located at position y ∈ V now becomes P<sup>j</sup> y = TN(y; x<sup>j</sup> , σ 2 , t), where TN is a truncated normal distribution centered at x<sup>j</sup> . Then for an arbitrary location y ∈ V, the expected value of the measure becomes:

$$E\left\{M\left(\boldsymbol{\upchi}\right)\right\} = \left\langle M\left(\boldsymbol{\upchi}\right)\right\rangle = \frac{\sum\_{i=1}^{k} P\_{i}\left(\boldsymbol{\upchi}\right)M\_{i}}{\sum\_{i=1}^{k} P\_{i}\left(\boldsymbol{\upchi}\right)}\tag{2}$$

Where M(y) is the combined fixation-locked activity at location y from all proximal ICs. We used this approach to calculate the aggregate measure M(y) , either fixation-related potential (FRP) or time-frequency spectrum, for specified regions of the brain volume. For this study, six a priori ROIs (**Figure 4**) were defined using the Measure Projection Toolbox (http://sccn.ucsd.edu/ wiki/MPT). Each ROI consisted of all regions of LONI LPBA40 atlas (Shattuck et al., 2008) that included the corresponding anatomical label (e.g., "occipital"). The only additional parameter σ (standard deviation of the Gaussian distribution) in this calculation was set to 12 mm. This value produced smooth spatial distributions in each ROI given the relatively small number of participants (N = 14). The six ROIs included: occipital, fusiform, temporal, parietal, cingulate, and frontal cortices.

#### Hierarchical Classification

For the classification step, we used a two-stage hierarchical approach to dissociate target and non-target fixation epochs. In the first stage, ridge regression (MATLAB <sup>R</sup> ridge function) was applied separately to the time-frequency epochs from each ROI. Specifically, we applied regularized regression to the entire temporal epoch and frequencies up to 32 Hz. The regularization parameter was determined via calculating the effective degrees of freedom as a function of lambda (λ):

$$df\left(\lambda\right) = tr\left(X\left(X^TX + \lambda I\right)^{-1}X^T\right) = \sum\_{j=1}^{m} \frac{d\_j^2}{d\_j^2 + \lambda} \qquad \text{(3)}$$

where d<sup>j</sup> are the singular values of the n x m data matrix X. As these functions were roughly similar across ROIs (Supplementary Figure 3), we selected a hyperparameter value (Lemm et al., 2011) such there were approximately 1 target and 10 non-target observations per degree of freedom (see Model Considerations for a discussion of the validity of this approach). However, the exact value had minimal effect on the results (see Supplementary Section 2). The second stage utilized the regression output, or latent variable estimate, from the six ROIs to provide a single classification score and label for each fixation epoch. In this second stage, we employed linear discriminant analysis (LDA; MATLAB <sup>R</sup> fitcdiscr function) and coefficients for both stages were fit within a single 5-fold cross-validation scheme. Area under the ROC curve (Az) was calculated for each ROI, as well as for the second-stage LDA classifier. Finally, for direct inference into the discriminant neural activity we calculated the forward model for each ROI (Haufe et al., 2014). Specifically, the

regression weights (W) were used to estimate the forward model (A), such that <sup>A</sup> <sup>=</sup>6XW6b<sup>s</sup> , where <sup>6</sup><sup>X</sup> and <sup>6</sup>b<sup>s</sup> are the empirical data and score covariance respectively.

To facilitate comparison with other approaches, we included two techniques commonly used for single-trial classification of EEG. Both methods were applied directly to the filtered EEG data (64 channels) using the same fixation epochs described above. First, Hierarchical Discriminant Components Analysis (HDCA) was applied with each epoch divided into 8 equal-sized temporal windows (Gerson et al., 2006). Second, we used the xDAWN algorithm (Rivet et al., 2009) to identify the 8 most discriminant spatial filters followed by a Bayesian linear discriminant analysis, collectively referred to as XD+BLDA (Cecotti et al., 2011). Area under the ROC curve was calculated for both of these classifiers on all participants.

### RESULTS

#### Behavioral and Ocular Measures

Detailed behavioral analysis of this study has been previously reported (Ries et al., 2016), however the relevant statistics are summarized below for comparison with the classification results. Reaction time and accuracy were analyzed separately for the visual and auditory tasks using a one-way repeated measures ANOVA (Greenhouse-Geisser correction reported where appropriate). The primary factor was auditory task condition, which had five levels in the visual task (Silent, Ignore, 0-Back, 1-Back, 2-Back), and three levels in the auditory task (0- Back, 1-Back, 2-Back). There was a trend for decreased accuracy in the visual task as a function of cognitive load (i.e., auditory N-back level); however this was not statistically significant (**Table 1**). We did observe a highly significant effect of cognitive load on reaction time (RT) in the visual task [F(2.73, 35.44) = 29.24, p < 0.001, η <sup>2</sup> = 0.69] showing that visual target RT increased as a function of cognitive load. Likewise, analysis of the auditory task showed both a significant decrease in accuracy [F(1.61, 20.96) = 6.74, p < 0.01, η <sup>2</sup> = 0.34] and increase in RT [F(1.60, 20.79) = 17.64, p < 0.001, η <sup>2</sup> = 0.58] with increasing auditory task demands. While the behavioral results showed that auditory working memory load had a significant negative impact on visual task performance, exhibited through increased RT, the near-ceiling accuracy likely mitigated any decline of this corresponding metric. Together, the behavioral results suggest that participants were not exclusively favoring one modality as



\*Auditory N-back level: Silent, Ignore, 0-Back, 1-Back, 2-Back.

performance declined in both the visual and auditory tasks with increased cognitive load.

Since eye movements were constrained by the nature of the visual task (guided target detection), the majority of ocular metrics did not significantly differ across blocks or conditions. As expected, we found no significant difference in fixation duration (0.962 ± 0.053 s; mean ± STD) or saccade distance (13.316 ± 1.072 degrees) across auditory task conditions. However, we did observe a large change in pupil dilation as a function of cognitive load (**Figure 4**). Specifically, we calculated the average pupil size in each condition relative to the average size across each participant's entire session. This relative pupil-size metric was significantly modulated by auditory task condition [F(3.04, 39.57) = 7.98, p < 0.001, η <sup>2</sup> = 0.38], exhibiting an increase in size with working memory load. Due to static luminance and counterbalanced condition order (see Materials and Methods), this modulation was unlikely to be a consequence of either changes in luminance or time-on-task (Beatty, 1982). Thus, our results indicate that the task-induced cognitive load increased the arousal level of participants, as has been shown in similar paradigms (Kahneman and Beatty, 1966).

### Fixation-Related Potentials (FRPs) by ROI

We calculated FRPs for each brain region by combining independent components activations within fixation epochs, using a ROI-based measure-projection approach (ROI-MPA). An IC's contribution to a given ROI was determined by the overlap between the anatomically defined region and the equivalent dipole Gaussian density function (see Materials and Methods). Importantly, by excluding equivalent dipoles located outside of this brain volume this approach attempts to minimize the influence of non-brain signals, such as those generated by eye movements, from the ROIs (see Supplementary Section 3). **Figure 5** shows the grand average FRPs from each ROI: occipital, fusiform, temporal, parietal, cingulate, and frontal cortices. To account for the differing number of included ICs, FRPs from each participant were uniformly scaled by total variance and are shown in arbitrary units. All Included epochs were from valid fixations (within 3 degrees of the current stimulus) and free of large artifacts. The average number of target and non-target epochs, by condition, are shown in **Table 2**.

The ROI FRP waveforms shown in **Figure 5** exhibit a clear distinction across brain regions. Both target and nontarget FRPs show a temporal progression through the visual cortices (occipital, fusiform, temporal) and reflect known electrophysiological signatures, such as the P1 or lambda component. Importantly, the distinction between target and non-target FRPs is evident in most ROIs. To identify periods of significant difference in the FRP waveforms we used a paired t-test at causal time points in each ROIs (255 time points × 6 ROIs). A single false-discovery-rate correction for multiple comparisons was then applied to all p-values (Benjamini and Hochberg, 1995). As expected, visual cortices show this distinction in earlier epochs, consistent with the visual mismatch negativity (vMMN: 150–250 ms), while the parietal and cingulate cortex exhibit a clear late positive deflection, indicative of the P3 component. In contrast, the frontal cortex shows little saccaderelated EOG artifact that would be expected to dominate frontal electrodes (e.g., Fz).

the ROI are shown as the inset. FRP response consists of a linear sum of IC activations weighted by their contribution to the corresponding ROI and are shown in arbitrary units. Note: ICs with equivalent dipoles located outside of the brain volume, such as those produced by EOG, are not aggregated in the ROI FRPs.


TABLE 2 | Number of epochs included in each condition.

Values represent the mean, standard deviation in parentheses.

For comparison to the standard approach, we also calculated target and non-target FRPs for electrodes that most directly correspond to each ROI. **Figure 6** shows the grand average FRP from these corresponding electrodes, using the same fixation epochs as above (**Table 2**). For occipital regions, the electrode and ROI FRPs are quite similar, as these electrodes are least affected by changes in the corneo-retinal potential and other saccade related activity. However, EOG artifact increasingly dominates the anterior regions, especially frontal electrodes (e.g., Fz). This can lead to difficulty in dissociating neural from EOG phenomena in more cognitive processes. For these grand averages, the number of target epochs was an order of magnitude lower than that for non-target epochs. However, a similar result was found for both ROI and electrode FRPs when these numbers were equated by randomly sampling a subset of non-target epochs (Supplementary Figures 1, 2).

Finally, to quantify the effect of auditory task condition (i.e., working memory load) on the ROI FRP we performed the following analysis. We measured the amplitude of the FRP for auditory conditions at either end of the difficulty spectrum: Ignore and 2-Back. These were chosen as representative of low and high cognitive load conditions; although similar results were found when comparing the Silent and 2-Back conditions. To capture the P3 waveform, we calculated the average amplitude within a 300–700 ms post-fixation window. We then applied a two-way repeated measures ANOVA, with factors ROI and condition, to quantify the effect of cognitive load on this components (**Table 3**). As expected, there was a strong effect of auditory task condition [F(1, 65) = 22.45, p < 0.001, η <sup>2</sup> = 0.15] with the amplitude of the P3 being significantly smaller during high, relative to low, working memory load.

#### Classification by ROI

For single-trial classification, we used fixation-locked timefrequency features from each ROI. Before linearly mixing IC activations, we first applied a Morlet wavelet transform to each epoch. We then calculated the spectral power of the wavelet transform before combining these time-frequency epochs. **Figure 7A** shows the grand average spectral FRPs for target epochs from each ROI. These average time-frequency responses, analogous to event related spectral perturbations (ERSP), show a similar time course as the FRPs above. Visual cortices have an early, mid-frequency (alpha band) component that is the spectral equivalent of the lambda response. Similarly, the parietal, cingulate, and frontal cortices are dominated by a later lower frequency (delta band) activity, reflecting the P3 component.

To classify target from non-target fixation epochs, we used ridge regression on these high-dimensional time-frequency features. We constructed separate classifiers for each ROI that utilized spectral information, below 32 Hz, from the entire fixation epoch. The forward models for each ROI are shown in **Figure 7B**. Again, the time course is similar to the grand average FRPs, where visual cortices have discriminant activity with smaller latencies and higher-frequency components. The marginal activations (**Figure 7C**) provide a more direct view of the temporal profile of the discriminant activity.

The relative discriminant power of each ROI was quantified by using classifier performance in a two-way repeated measures ANOVA, with factors ROI and auditory task condition (**Table 4**). We found a significant modulation of the area under the ROC curve (Az) by region [F(2.41, 156.73) = 12.84, p < 0.001, η <sup>2</sup> = 0.15]. The average performance across all ROIs and participants was 0.741 ± 0.068 (**Figure 8**), substantially below behavioral performance in the visual detection task (average accuracy = 0.974 ± 0.030).

Integration across regions required a second-stage classifier applied to the output of the ROI regression step. For each epoch, the output from the ROI classifiers (i.e., vector of six classification scores) were combined using a linear discriminant function. Not surprisingly, this hierarchical approach resulted in significantly better performance (Az: 0.851 ± 0.096) than the individual ROI classifiers (p < 0.001; Wilcoxon signed rank test). Interestingly, there was a wide range in classifier performance across participants with Az values ranging from 0.708 to 0.947, indicating that for some individuals our approach was able to identify visual targets at an accuracy similar to behavioral performance. This was despite ongoing neural activity related to the concurrent auditory task as well as the planning and execution of eye movements.

This hierarchical approach compared favorably to other common classification techniques (**Figure 8B**). Specifically, we applied Hierarchical Discriminant Components Analysis (Gerson et al., 2006) to the EEG channel data using the same epochs as above. We also applied the xDAWN filtering algorithm (Rivet et al., 2009) followed by Bayesian linear discriminant analysis, or XD+BLDA (Cecotti et al., 2011). HDCA and XD+BLDA classification accuracies were similar to our hierarchical approach with HDCA having slightly higher overall performance (p = 0.013).

While we were able to classify visual target from nontarget stimuli during a concurrent auditory task, there was a significant modulation of ROI classification performance as a function of cognitive load (**Table 4**). Here, this modulation was the inverse of that observed in the relative pupil size. Classification performance decreased with increasing auditory task demands; except in the silent and ignore condition in which performance was similar. At the hierarchical stage, while the target scores were significantly modulated by condition (**Figure 9B**), the classification performance was not (**Table 4**, **Figure 9A**). Much like behavioral accuracy in the visual task, the hierarchical classifier performance remained relatively constant

across auditory task conditions. However, increasing working memory load clearly affected both the neural activity and pupil diameter in a manner consistent with increased arousal (Murphy et al., 2011).

### DISCUSSION

In this study, we used a novel approach to determine if the neural response associated with visual target detection could be separated into meaningful components in the presence of eye movements and concurrent task demands. Here, we employed a single framework to isolate neural from non-neural activity and to separate the FRP into cortical regions. Using common statistical techniques we were then able to classify target from non-target FRPs across ROIs on a single-trial basis at a level similar to state-of-the-art machine learning algorithms. By doing so, we were able to show a clear time-course of discriminant activity associated with target detection as well as the modulating effect of cognitive load. While our task design mitigated the overlapping response from previous or subsequent fixations, the results demonstrate the potential for separating task-relevant neural activity in more complex contexts that include eye movements and concurrent tasks.

The EEG analysis framework described here is both specific enough to separate activity by ROI and sensitive enough to evaluate the effects of cognitive load. While more traditional channel-based approaches of FRP analysis may be able to separate these effects by scalp location (e.g., Oz vs. Pz), the inference into the constituent neural sources remains more difficult. Likewise, channel-based approaches require an explicit EOG mitigation or removal process, using ICA (Plöchl et al., TABLE 3 | ANOVA statistics for P3 amplitude in the visual task.


\*Auditory N-back level: Ignore, 2-Back.

2012) or other techniques (Parra et al., 2005). For example, a number of recent studies have utilized ICA for the identification and removal of EOG components (Nikolaev et al., 2011, 2013; Devillez et al., 2015). However, these studies typically included a manual or semi-manual step for the identification of ICs related to corneo-retinal potentials and eyelid artifacts (although see Mognon et al., 2011; Plöchl et al., 2012). In contrast, our approach uses equivalent dipole locations to include or exclude particular ICs. While there is ongoing debate as to the accuracy of source localization techniques such as LORETA, there is growing evidence that suggests independent sources are indeed dipolar (Delorme et al., 2012). Fortunately, eye movement related ICs typically explained large fractions of the total signal variance and resolve to equivalent dipoles outside the brain volume with relatively little residual error.

In comparison, IC clustering results (Makeig et al., 2002) are highly dependent on the choice of the clustering parameters (in many cases up to 12 tunable parameters without a clear physiological interpretation for each parameter) and provide no guarantees in terms of producing clusters at particular ROIs. However, the ROI-based measure projection approach

TABLE 4 | ANOVA statistics for classifier performance in the visual task.


\*Auditory N-back level: Silent, Ignore, 0-Back, 1-Back, 2-Back.

(ROI-MPA) is able to focus the analysis on selected ROIs while utilizing only a single parameter with a physiological interpretation (i.e., the expected spatial uncertainty in IC dipole localization). Likewise, IC clusters are not well-suited for singletrial analysis. Simply averaging the single-trial activity of the ICs contained in each cluster would not properly account the spatial distribution of dipole locations. For example, ICs adjacent to the cluster boundary would be excluded (weighted zero) while ICs just inside the cluster weighted at unity. ROI-MPA directly incorporates this spatial information and the ROI structure by forming a weighted sum based on IC spatial probability overlap with each ROI.

## Eye Tracking, Pupillometry, and EEG

A growing number of studies are combining eye tracking with EEG to enable the exploration for neural activity during visual search. While the task employed here was not a visual search

paradigm, our results demonstrate the ability to acquire and utilize gaze position to detect saccades and quantify evoked neural activity. Importantly, our experimental configuration employed a head-free tracking system (SMI RED 250) without a requisite chinrest. This configuration facilitates visual search paradigms or related tasks requiring a large field of view that may naturally engender small head movements. To improve the spatial accuracy of such a system, our analysis included a posthoc calibration of gaze position. Specifically, we utilized task information to infer gaze position when adjusting the offline

calibration model. While this type of information may not always be available, experimenters can and should use an opportunistic calibration approach during periods where gaze position can reasonably be inferred (e.g., prior to trial initiation or visual target detection).

An additional benefit derived from the inclusion of eye tracking is the coincident measure of pupil size. For example, the change in pupil diameter shown here suggests that increased working memory load resulted in an increase in arousal level, an important modulator of cognitive performance. Since arousal is largely regulated through the norepinephrine system via the locus coeruleus (LC), a nucleus within the dorsal pons, it cannot be measured directly via EEG. However, several studies have shown that pupil dilation can be used as a proxy for LC activity (Aston-Jones and Cohen, 2005; Murphy et al., 2011; Hong et al., 2014). Furthermore, the LC receives input from anterior cingulate and dorsolateral prefrontal cortex and some studies have suggested that the LC system underlies the parietal P3 ERP, specifically the P3b (Nieuwenhuis et al., 2005, 2011).

Overall, our results confirm the localization of the P3 to parietal ROIs and show a significant effect of arousal on both pupil diameter and P3 amplitude (Murphy et al., 2011). In addition, target FRPs in the posterior ROIs show a negative deflection, relative to non-target FRPs, beginning around 200 ms post-fixation. This difference is consistent with the visual mismatch negativity (vMMN); a negative posterior deflection elicited by an infrequent (deviant) visual stimulus presented in a homogenous sequence of frequent (standard) stimuli (Czigler et al., 2002). In particular, this difference is consistent with the later components of the vMMN associated with memory-comparison-based change detection (Kimura et al., 2009). However, since the infrequent stimuli (Ts) are also task-relevant, it is difficult to dissociate this vMMN from the attentional orienting component of the P3 (Polich, 2007).

Interestingly, the frontal ROI and Fz electrode showed a small but significant difference between target and non-target fixations at an early latency (approximately 80 ms). Previous studies have shown activity associated with peripheral detection in frontal-parietal regions early in and even prior to target fixations (Dias et al., 2013; Devillez et al., 2015). In this paradigm, target stimuli were never immediately adjacent to the current grid location (red annulus), making the peripheral detection of an upcoming target unlikely. However, it would be reasonable for participants to anticipate a target fixation after a sequence of non-target stimuli were encountered. This phenomenon illustrates the manifold difficulties in the interpretation of eye movement related activity given the dependencies between eye movement behavior (e.g., saccade size, fixation duration) and the elicited response. Similarly, in free-viewing contexts there remains the additional challenge of separating overlapping responses from adjacent saccades and fixations. Fortunately, several techniques have now been proposed to address this potential confound using regression or GLM-based approaches (Burns et al., 2013; Smith and Kutas, 2015; Kristensen et al., 2017).

### Model Considerations

The ICA-based, hierarchical classification algorithm described here is not ideally suited for real-time application or meant as an alternative to other channel-based approaches (e.g., HDCA). Rather, the goal of this study was to identify the discriminant neural response in each ROI and to quantify the effect of cognitive load on that response. As such, we did not take additional steps to separate data in our cross-validation scheme. Due to data limitations, ICA was applied to the entire EEG record for each participant rather than independently to each training set. Since ICA is an unsupervised technique, however, the potential for overfitting is limited. Additionally, we selected the hyperparameter by balancing the effective degrees of freedom with the number of data points (fixation epochs) rather than through a separate cross-validation step. Here, the exact choice of parameter did not substantially influence the results (see Supplementary Section 2). The added separation of these additional cross-validation steps would significantly reduce the amount of data and the quality of the estimated discriminate functions without providing any additional insight into the neural processes.

#### Implications for BCI

Importantly, our hierarchical classification scheme was shown to perform at a level similar to other state-of-the-art machine learning algorithms such as HDCA and XD-BLDA. This result suggests that our approach captured the majority of the taskrelated variance within the EEG record. While the average FRPs are useful and exhibit an effect of cognitive load, singletrial classification techniques can reveal additional discriminative activity (Brouwer et al., 2012). The forward model of each ROI (**Figure 7**) reveal both the time course and spectral characteristics of the discriminant neural response. Thus, while our approach imposes an additional computational burden, compared with the above methods, it adds insight into the source of task-related neural activity.

Our results can likewise be used to guide BCI development and future applications. P3-based paradigms remain a key component of the BCI application space, such as the P300 Speller (Krusienski et al., 2006). Presently, these reactive BCIs typically classify the neural response to passively viewed stimuli, such as in a rapid serial visual presentation (Gerson et al., 2005; Touryan et al., 2011; Bigdely-Shamlo et al., 2013). In this passive condition, stimuli are presented to the user who detects the desired target (e.g., target object within an image). In contrast, for targets occurring in natural or ordered environments, a more ecologically valid approach for detection would be through goal-directed visual search (Jangraw et al., 2014; Ušc´umlic and ´ Blankertz, 2016). In this case, stimulus presentation is controlled through the user's search strategy, with fixation onset serving as a natural time-locking event. Thus, the growing body of work on single-trial classification of FRPs will support the improved performance of future FRP BCI technology.

### CONCLUSION AND FUTURE WORK

In this study, we provide a principled framework for interpreting EEG in the presence of eye movements and concurrent task demands by adapting a recently developed independent source aggregation technique (Bigdely-Shamlo et al., 2013). This approach enabled us to both quantify the discriminant information contained within each cortical region and measure the effect of cognitive load on the evoked response. While these phenomena have been previously observed, our results demonstrate the feasibility and utility of combining synchronous recordings of EEG and eye-tracking to measure both sensory and cognitive processes. Our approach can be extended to tasks that incorporate unconstrained eye movements, however, additional techniques would be needed to account for the overlapping activity from adjacent saccades and fixations.

In this experiment we did not explicitly manipulate top-down (goal-directed) or bottom-up (stimulus driven) components of the visual task beyond increasing the overall working memory load. However, our ROI mapping framework would be well suited for such an assessment. ROI analysis could be applied across conditions to identify what factors bias top-down vs. bottom-up neural activity in a visual search paradigm. Eye movements biased by top-down task influences may have greater pre- or post-saccadic activity in frontal cortices (Nikolaev et al., 2013). Likewise, eye movements driven by bottom-up stimulus influences may have greater post-saccadic activity in occipital cortex. This distinction may become essential for understanding free-viewing search in natural scenes where visual information leading to a detection event can be accumulated across fixations (Jangraw et al., 2014) rather than isolated to a single gaze position.

#### ETHICS STATEMENT

The voluntary, fully informed consent of the persons used in this research was obtained in written form. The document used to obtain informed consent was approved by the U.S. Army Research Laboratory's Institutional Review Board (IRB) in accordance with 32 CFR 219 and AR 70-25, and also in compliance with the Declaration of Helsinki. The study was reviewed and approved (approval# ARL 14-042) by the U.S. Army Research Laboratory's IRB before the study began.

### AUTHOR CONTRIBUTIONS

JT, AR, and PC develop the task and conducted the experiment. VL and NB developed the single-trial ROI method. AR and JT analyzed the data. JT wrote the manuscript.

### FUNDING

Research was sponsored by the U.S. Army Research Laboratory and was accomplished under Contract Number W911NF-09-D-0001, W911NF-10-D-0022, and ARL-74A-HR53. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

### ACKNOWLEDGMENTS

The authors would like to thank Barry Ahrens for the development of the task software and data synchronization scheme. The authors would also like to thank Stephen Gordon and Michael Nonte for help with HDCA and XD+BLDA classification and Anne-Marie Brouwer for helpful comments on the manuscript.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnhum. 2017.00357/full#supplementary-material

#### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Touryan, Lawhern, Connolly, Bigdely-Shamlo and Ries. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Assessing the Depth of Cognitive Processing as the Basis for Potential User-State Adaptation

Irina-Emilia Nicolae1, 2 \*, Laura Acqualagna<sup>2</sup> and Benjamin Blankertz <sup>2</sup> \*

<sup>1</sup> Department of Applied Electronics and Information Engineering, Politehnica University of Bucharest, Bucharest, Romania, <sup>2</sup> Department of Neurotechnology, Technische Universität Berlin, Berlin, Germany

Objective: Decoding neurocognitive processes on a single-trial basis with Brain-Computer Interface (BCI) techniques can reveal the user's internal interpretation of the current situation. Such information can potentially be exploited to make devices and interfaces more user aware. In this line of research, we took a further step by studying neural correlates of different levels of cognitive processes and developing a method that allows to quantify how deeply presented information is processed in the brain.

Edited by:

Felix Putze, University of Bremen, Germany

#### Reviewed by:

Michael Tangermann, Albert Ludwigs University of Freiburg, Germany Ricardo Chavarriaga, École Polytechnique Fédérale de Lausanne, Switzerland

\*Correspondence:

Irina-Emilia Nicolae irina.nicolae@aut.pub.ro Benjamin Blankertz benjamin.blankertz@tu-berlin.de

#### Specialty section:

This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience

Received: 20 July 2017 Accepted: 20 September 2017 Published: 04 October 2017

#### Citation:

Nicolae I-E, Acqualagna L and Blankertz B (2017) Assessing the Depth of Cognitive Processing as the Basis for Potential User-State Adaptation. Front. Neurosci. 11:548. doi: 10.3389/fnins.2017.00548 Methods/Approach: Seventeen participants took part in an EEG study in which we evaluated different levels of cognitive processing (no processing, shallow, and deep processing) within three distinct domains (memory, language, and visual imagination). Our investigations showed gradual differences in the amplitudes of event-related potentials (ERPs) and in the extend and duration of event-related desynchronization (ERD) which both correlate with task difficulty. We performed multi-modal classification to map the measured correlates of neurocognitive processing to the corresponding level of processing.

Results: Successful classification of the neural components was achieved, which reflects the level of cognitive processing performed by the participants. The results show performances above chance level for each participant and a mean performance of 70–90% for all conditions and classification pairs.

Significance: The successful estimation of the level of cognition on a single-trial basis supports the feasibility of user-state adaptation based on ongoing neural activity. There is a variety of potential use cases such as: a user-friendly adaptive design of an interface or the development of assistance systems in safety critical workplaces.

Keywords: encephalography (EEG), event-related potentials (ERPs), oscillatory activity, cognitive processes, single-trial classification

### INTRODUCTION

While Brain-Computer Interface (BCI) research primarily targets medical applications (Birbaumer, 2006; Dornhege et al., 2007; van Gerven et al., 2009; Wolpaw and Wolpaw, 2012; Guger et al., 2014; Hassanien and Taher Azar, 2015), more and more perspectives are being explored that go beyond communication and control paradigms (Blankertz et al., 2010, 2015, 2016; Zander and Kothe, 2011; Borghini et al., 2014; Zander et al., 2016). One of those perspectives are systems that take the ongoing user mental state into account and automatically adapt according to the user's mindset (Müller et al., 2008; van Erp et al., 2015) exploiting implicit information (Gamberini et al., 2015).

Neural correlates of cognitive processes can be found in the time-locked Event-Related Potentials (ERPs), usually composed of different components, and in modulations of spontaneous brain rhythms. The phase of those background rhythms is not time-locked to external stimuli, but the modulation of their amplitudes (resp. their hull curves) can be time-locked. In that case, the effect is called Event-Related Desynchronization or Synchronization (ERD/ERS) (Pfurtscheller and Lopes da Silva, 1999; Lemm et al., 2009), depending on whether it is a decrease or an increase in spectral power of the given frequency band.

ERP components have been widely investigated and previous studies involving cognitive activities in oddball paradigms show that the amplitude and latency of the ERPs are modified according to task difficulty (Donchin et al., 1973; Ullsperger et al., 1987; Polich, 2007; Kim et al., 2008). Specifically, increased P300 amplitude and longer latencies are usually found, relating to more complex processes and stronger attentional demand. These cognitive ERP components are reported to appear between 300– 500 ms after the stimuli, in the centro-parietal cortex (Polich, 2007).

Moreover, changes in frequency are also reported, being modulated by the difficulty of cognitive processes. Firstly, the presentation of a stimulus triggers a short synchronization proceeded by a prolonged desynchronization mostly occurring in the α band at the frontal, temporal, central, and parietal locations of the scalp in accordance with the type of cognitive process (Klimesch et al., 1992, 1993, 1997; Klimesch, 1999), and follows the P300 potential (Yordanova et al., 2001). The α band desynchronizes during mental activity and cognitive judgment (Klimesch, 1999), and is reported proportionally increasing with more difficult cognitive processing, change mostly encountered at the centro-parietal sites. The effect is visible in laboratory environments (Gevins et al., 1997) and as well in more realistic scenarios (Venthur et al., 2010).

In addition, β oscillations are also linked to complex cognitive processes (Pesonen et al., 2007; Okazaki et al., 2008; Sheth et al., 2009), whereas decreased β oscillations appearing at the central and parietal sites are correlated to complex reasoning (Basile et al., 2013), decision making (Nakata et al., 2013) and are also related to the transition of cognitive states (Sheth et al., 2009).

Changes in the θ activity are observed as synchronizations in relation to task difficulty (Klimesch, 1999), e.g., increased θ power proportionally with increased memory load (Gundel and Wilson, 1992; Gevins et al., 1997), which also relate to the encoding of new information (Klimesch et al., 1996; Klimesch, 1999). Regarding localization, the θ changes are known to appear at the frontal midline scalp location (Gevins et al., 1997).

A pronounced ERD was found in relation to different types of cognitive processes. For example, in memory processes (Mecklinger et al., 1992; Klimesch et al., 1994; Stipacek et al., 2003; Pesonen et al., 2007), perceptual encoding and attentional processes, a stronger ERD is observed in the α band (Sergeant et al., 1987; Klimesch, 2003; Schack et al., 2005; Polich, 2007). In addition, the processing of semantic information, e.g., words, shows likewise ERD enhancements (Klimesch et al., 1997).

In a recent study, Naumann et al. (2017), investigated gradual differences in task difficulty by estimating the difficulty level of a video game from the ongoing neural activity of the user. They found likewise significant modulations in the θ (4–7 Hz) and α (8–13 Hz) frequency bands, associated with changes in task difficulty.

In the present work, we have studied the feasibility of quantifying how deeply presented information is processed in the brain by tapping the corresponding components of brain activity. To that end, we analyzed single-trial EEG data with respect to its discriminative value of ERPs and of modulations of brain rhythms. In the ERPs, we found as main effect an increase of the P300 amplitude with task demand and also domain specific modulations in later components, see also (Nicolae et al., 2015a). Spontaneous background oscillations showed a prolonged suppression of the α and β rhythm as a reflection of profound cognitive processing. For extracting the features for classification, we employed spatio-temporal features of the ERPs (Blankertz et al., 2011) and we exploited the ERD effect by combining spatio-spectral decomposition (SSD, Nikulin et al., 2011) in two frequency bands with Common Spatial Patterns (CSP) analysis (Fukunaga, 1990; Koles, 1991). Combining features of ERPs and ERDs in a multimodal classification approach (Fazli et al., 2015) leads to a performance increase compared to using a single modality only (Nicolae et al., 2015b). The current manuscript comprises a detailed investigation of the depth of cognitive processing integrating the analysis over the modulations of brain rhythms (ERDs) and the spectral analysis discriminability (power spectrum, CSPs), extending previous work in Nicolae et al. (2015a), where the temporal and spatial evolution of the ERPs are briefly analyzed and the abstract in Nicolae et al. (2016), where a short overview over the power spectrum is presented. In addition, the present work contributes further with an enhanced multivariate classification approach based on both, the temporal and spectral features extracted from the EEG, as compared to the classification based on only the temporal features presented in (Nicolae et al., 2015b).

## MATERIALS AND METHODS

#### Experimental Setup Participants

Fifteen healthy participants with no acute or chronic neurological and/or psychiatric disorders and no pregnant women were considered for the study. Eleven participants were right-handed, ten were males and all were aged between 22 and 35 years old. Eleven participants had German as mother tongue, one participant had English and the others had different languages as native tongue, with a required good command of English or German in order to fulfill the task in the language condition. Seventeen participants were initially recorded, but two participant's data were removed for the analysis due to high artifacts probably caused by improper recordings. The experimental procedure was conducted in accordance with the declaration of Helsinki, approved by the ethics committee of the Department of Psychology and Ergonomics of the Technische Universität Berlin and written informed consent was obtained from each participant. To countenance participant's motivation, they received financial compensation.

#### Material

The hardware equipment used for the acquisition of the Electroencephalography (EEG) was a BrainAmp amplifier with 64 active electrodes (Brain Products GmbH, Munich, Germany) positioned according to the 10–20 international system. One electrode, named EOG, was placed under the left eye and used for eye movements recording. We used unipolar recording at a sampling frequency of 1 kHz with the ground placed on the scalp at position AFz and with reference at left mastoid. The acquisition system was re-referenced to left and right mastoids. When mounting the electrode cap, the impedance was kept below 20 k. The acquired data will be made available from the DepositOnce repository of Technische Universität Berlin (https://depositonce.tu-berlin.de/).

The stimuli were designed as vectorial graphics with the Inkscape software (version 0.91.0.1 https://inkscape.org) besides the images from the animals category which were created by Freepik and taken from the Freepik database (http://www.freepik. com/free-photos-vectors/icon). The stimuli were presented on a 24" display with 60 Hz refresh rate and 1,920 × 1,200 resolution (Dell U2410). For developing the experimental paradigm, the Processing software (version 3.0a4 https://processing.org/) was used in conjunction with MATLAB software for signal acquisition (release R2014a, The MathWorks, Inc., Natick, MA, USA). For the offline analysis, the data was processed with the BBCI MATLAB Toolbox (https://github.com/bbci/bbci\_public).

#### Experimental Scenario

We considered two degrees of cognitive processing, namely shallow and deep processing levels (Craik and Lockhart, 1972) in a visual stimuli paradigm. In our scenario, the shallow processing involves a basic information processing revealed by attention (color appearance) and deep processing requires a complex activity related to specific cognitive tasks in the memory, language, and visual imagination conditions (Ganis et al., 2013). Each visual stimulus consisted of a pair of two images, cartoonlike drawings and every image represented an object chosen out of three categories: animals, fruits and mobility and was represented in one of the colors: red, green, blue, or magenta. Both images had same color and same category, where each category consisted of a total of 10 objects, see **Figure 1**. In order to maintain the desired ratio and without increasing too much the difficulty of the tasks, only two out of the three categories were chosen for each run.

The experimental study took place in a laboratory environment. Participants performed the experiments seated, and were requested to stay still, relaxed, and focused. They were allowed to freely explore the information presented, but in general, to focus their view in the center of the screen. Each participant was preliminary prepared with a practice test (1–3 runs) in order to become familiar with the tasks.

The experiment paradigm is described below. Before the start of the sequence, and after a short personal current state evaluation, the condition to be performed was displayed along with the target image pair (target color and category). When the sequence started, participants had to distinguish first between color, and subsequently for category. If the color does not match the target, then no processing at all was requested (**Non-Target, NT** case). If only the color matched, perform only mental computation (**Shallow Target, ST** case). If color and category matched the target (**Deep Target, DT** case), then evaluate the requested cognitive task and perform the corresponding mental computations (addition). Following the example sketched in **Figure 2**, the procedure was as follows:

	- First, check if the stimulus color matches the target color:
		- → if not, do nothing (**NT**—non-target);
		- → if yes, count +1 and:
			- check if the category matches the target category:
				- → if not, do nothing (**ST**—shallow target);
				- → if yes, perform the cognitive task associated to each cognitive process and memorize the new images for the next trial in case of the memory condition:
					- if the answer is negative, do nothing additional (**DT**—deep target);
					- or in case of positive answer, additionally count +10 (**DT**—deep target);

The resulting final number was inserted at the end of each run, followed by the feedback regarding the correct number.

The cognitive conditions were separately conducted in this order: memory, language, and visual imagination, with five runs per condition. We did not alternate between the conditions after each run in order to avoid confusions between the tasks, which have been quite demanding. In each condition participants had to "answer" a yes-no question, which is detailed hereunder.

#### **Memory**

The first condition considers memory retrieval by comparing a previously presented stimulus with the current stimulus (Kirchner, 1958; Chen et al., 2008). Specifically, the question is whether the current stimulus coincides with previous target pair (last pair of target color and target category). The accomplishment implies memorization of the current pair and retrieval of the previous target part. For an example see **Figure 2**, left column. An additional example of the memory experimental paradigm is provided in **Supplementary Video 1** of the Supplementary Presentation 1.

#### **Language**

The language task considered comparisons based on phonemic representations between the words that represent the images. The task was to decide whether the number of syllables of

FIGURE 1 | Stimuli categories (animals, fruits, mobility) and objects representations examples. Each object could be presented in one of the four colors (red/green/blue/magenta).

the left image's word was greater or equal than the number of syllables of the word for the object on the right side. This condition considered English or German words based on the participant's native language. Examples are shown in **Figure 2**, middle column. Note that the chosen objects (**Figure 1**) have a quite unique mapping to their representing words. In case a test person employed a different word, that could just affect the behavioral data, while the actual performance (DT vs. ST or NT) is not affected. Additional example for the language experimental paradigm is encountered in **Supplementary Video 2** of the Supplementary Presentation 1.

#### **Visual imagination**

The visual imagination task required mental representations in order to perform a comparison based on the size in reality. The differentiations made are accomplished by judging whether three times the dimension of the left object (or a part of the object) is greater or equal to the right object's size, considering average dimensions of the represented objects. The respective part of the object and the dimension type (length, height, or thickness) used for comparison, was emphasized with a marker on the stimulus image. Further examples are in **Figure 2**, right column.

When generating the stimulus pairs for the visual imagination condition, a small constraint was added in order to ensure similar complexity within the visual imagination task. This related to the differences of dimensions within a range, more exactly the absolute difference between three times the left object and the right object should be less or equal than the left object dimension. Hence, no big discrepancy between the objects sizes was assured: no large differences implying an easy comparison, and no small ones either, which would result in a hard or ambiguous comparison. Short additional example for the visual imagination experimental paradigm is encountered in **Supplementary Video 3** of the Supplementary Presentation 1.

#### Experimental Design

As a pilot study, four participants were asked to test the application without EEG cap. The results of this pilot study have been used to calibrate the speed and the complexity of the task. The time course of the experiment is presented in **Figure 3**. It

consisted of 2,500 ms Inter-Stimuli Interval and was divided as follows: 500 ms fixation, 1,250 ms stimulus presentation, and 750 ms relaxation period.

Each condition was evaluated in five runs with a total of 600 stimuli. The two images representing the stimulus, were scaled to a common 480 × 480 resolution and presented close to the center of the screen at a distance of 2" from one another on a light gray background.

The same percentage of shallow and deep targets was chosen in order to avoid confounds caused by a different number of occurrences: 75 ± 2% for non-targets (NT) and 12.5 ± 2% for both, shallow targets (ST) and deep targets (DT).

### Behavioral Assessment

The subjects behavioral responses were assessed by subjective and objective indicators. The subjective indicators that we took into account by means of a questionnaire are: personal feedback, meaning personal internal state/mood (good/ok/bad) and personal overview for the difficulty of the conditions, which was acquired by scoring (0—easy; 1—medium; 2—hard). The objective indicator that we integrated (subjects answers ratio), is given by the actual responses, assessed by a ratio of the absolute difference between the correct number and user response number, divided by the correct number.

### Data Analysis

Hereunder, the signal processing and machine learning methods are described, mainly the multi-modal analysis and multivariate classification which give us complementary and additional information over the neurophysiological effects of the cognitive processes. The multi-modal analysis investigates the neural correlates from the temporal and spectral domain: ERPs and Event-Related (De)Synchronization, ERDs. The brain sources of neural oscillations which substantiate cognitive activity (Varela et al., 2001; Buzsáki and Draguhn, 2004) were extracted by advanced decomposition methods: SSD by Nikulin et al. (2011) and Common Spatial Pattern (CSP) (Fukunaga, 1990). The two different types of neurophysiological information, temporal, and oscillatory activity, are combined with the concatenation approach as described in Dornhege et al. (2004), in order to give better performance as shown for example in other studies (Mühl et al., 2014). Finally, the depth of cognitive processing of the external information is estimated using multivariate singletrial classification, by Regularized Linear Discriminant Analysis (Friedman, 1989).

#### Filtering and Epochs Rejection

For dimensionality reduction, the data was downsampled to 100 Hz. Preliminary processing of the data was performed by a sequence of low-pass filtering for anti-aliasing and high-pass filtering for drifts removal. The low-pass filter design for 42 Hz, is a Chebyshev type II of order 10 with 42 Hz pass-band edge frequency and 3 dB ripple, and a 49 Hz stopband with 50 dB attenuation. For high-pass filtering, a 1 Hz FIR filter of order 300 (three times the downsampling frequency) was applied, using least-squares error minimization and reverse digital filtering with zero-phase effect (For future online classification, appropriate causal filters must be considered). Following, the data was segmented into epochs considering the experiment timing detailed in **Figure 3**.

A rough pre-cleaning of the data was additionally performed obviating noisy channels and epochs. A criterion evaluated on band-pass filtered data in a broad frequency band (5–40 Hz) was applied to remove the channels dropping to zero. Particularly, the channels with variance smaller than 0.5 µV 2 in more than 10% of the trials were removed. Moreover, the epochs with muscle artifacts were also removed considering the trials with excessive variance in 20% of the channels. For strong eye movements artifacts, max-min criterion was applied and epochs with more than 150 µV difference between maximum and minimum voltage in channels F9, F10, AF3, and AF4 were removed. For removing the background noise of the remaining epochs, baseline correction was performed trial-wise by subtracting the mean amplitude computed on 100 ms of the pre-stimulus trial period from each trial period time point.

#### Artifact Removal

As a next step for removing the artifact data, including smaller eye movement artifacts, muscular artifacts and loose electrodes we performed Independent Component Analysis (ICA) with artifactual components selection given by the Multiple Artifact Rejection Algorithm (MARA, Winkler et al., 2011). For further verification, visual inspection was performed over each component considering the power spectral density and its topographic distribution.

#### Univariate Discriminative Analysis

In addition to the temporal ERP analysis in Nicolae et al. (2015a), we evaluated the differences among levels of cognitive processing in the spectral domain (Nicolae et al., 2016). Moreover, we investigated the power spectrum with a discriminative measure given by the signed and squared point biserial correlation coefficient (signed r 2 ) which quantifies the discriminability between the two classes and the time course of power modulations with the help of Event-Related De/Synchronization (ERD/ERS) curves (Pfurtscheller and Lopes da Silva, 1999) in selected frequency bands.

#### Multivariate Classification and Validation

In order to join complementary information about the neural activity and therefore improve single-trial classification, we combined two different types of neurophysiological information. We considered spatio-temporal features reflecting ERPs and oscillatory activity features, which are described hereunder.

#### **Spatio-temporal features**

The spatio-temporal features (channels and time) were extracted as in Nicolae et al. (2015a). The method, further described in Blankertz et al. (2011), detects five temporal windows for each participant based on a heuristic selection of the intervals with maximum discriminability and a constant pattern between two classes based on the signed and squared point biserial correlation coefficient (signed r 2 ). The selected intervals contain the most significant spatio-temporal features, effective as found in other studies (e.g., Acqualagna and Blankertz, 2013).

#### **Spatio-spectral features**

As the cognitive processes produce observable modulations in the oscillatory activity, we considered extracting this information from the respective frequency bands. Based on the discriminative analysis, we selected the most significant frequency bands: α (8–14 Hz) and β (16–20 Hz). To enhance the discrimination of activity in the frequency band of interest, we performed SSD (Nikulin et al., 2011) and Common Spatial Pattern (CSP) (Blankertz et al., 2008) for the corresponding frequency bands.

Spatio-spectral decomposition, SSD. Linear spatial filtering prior to CSP was used in order to reach an efficient differentiation of mental states depicted by ERD/ERS rhythm patterns. Neural activity can be sometimes concealed in the background noise fluctuations, and therefore, for a better discrimination of the neural oscillations regarding cognitive processing, we applied SSD (Nikulin et al., 2011) which enhances the signal-to-noise ratio. In order to extract individual oscillatory sources, SSD finds the optimal spatial filters based on a generalized eigenvalue decomposition, that relates to high band power in the frequency of interest (pass-band filter) and low band power in the noise in the adjacent frequencies. The adjacent noise frequencies reduction is obtained with two pass band filters of a desired width (e.g., 1 or 2 Hz) below and, respectively, above the frequency of interest, before or after a gap (stop-band filter of e.g., 1 Hz) just below, respectively, just above the frequency of interest. Important notice: because SSD requires frequency filtering in advance, we used continuous data to avoid filter edges artifacts.

For the SSD decomposition, we consider only components with eigenvalues higher than 10−<sup>6</sup> times the highest eigenvalue (see low-rank factorization in Haufe et al., 2014). Typically, between 15 and 35 components per discrimination pair were further selected.

Multi-band Common Spatial Patterns, mCSP. In order to obtain discriminative information about the cognitive processes based on oscillatory activity, we make use of the widely used CSP method as described by Fukunaga (1990) and Koles (1991), which was successively applied in a similar context (Schultze-Kraft et al., 2016). CSP facilitates the binary discrimination of different brain states by spatial filtering, enhancing the signal of interest while suppressing the background activity, by maximizing the variance for one class whilst minimizing the variance for the other class and vice versa. In our case, it increases the variance of a higher level of cognitive processing while diminishes the variance of a lower level of processing and vice versa. The components reaching this goal were automatically selected (as in Blankertz et al., 2008) up to a maximum of three spatial filters per class.

Based on the two relevant frequency bands, α (8–14 Hz) and β (16–20 Hz), we extracted the most discriminative spatial filters within each band and we combined them, such that the neural components referring to both frequency bands could be simultaneously exploited. The entire process was performed on the data after applying band-pass and SSD filters. The time interval considered was selected from the ERD/ERS phenomena, starting from 350 ms after the stimuli, which corresponds roughly to the peak time point of the P300, from which point the cognitive process should generally begin (Nicolae et al., 2015a).

#### **Combined spatio-temporal and spatio-spectral features and evaluation scheme**

Targeting the estimation of user's cognitive processing, we followed the processing pipeline described in **Figure 4**. After appropriate preprocessing for each feature type (time or power), the spatio-temporal features were combined with the spatiospectral features given by the mCSP process, which were then classified and evaluated by cross-validation. More specifically, considering the band-power domain, the relevant spectral (logvariance) and spatial features were detected from the training data and used to spatially filter the testing data with the corresponding CSPs and were applied repeatedly in crossvalidation manner (10 folds with 10 repetitions). Further, the separation was performed by a regularized Linear Discriminant Analysis (Friedman, 1989; Lemm et al., 2011) with shrinkage of the covariance matrix (Ledoit and Wolf, 2004; Schäfer and Strimmer, 2005; Vidaurre et al., 2009; Blankertz et al., 2011), which proved to be successful for this type of analysis (Müller et al., 2003; Bartz and Müller, 2013; Farquhar and Hill, 2013). The classification performance, as the amount of correct estimated trials, was measured by the area under the Receiver Operating Characteristic (ROC) curve (Hanley and McNeil, 1982).

Initially, no normalization was performed on the feature vectors, because in our case, the features are roughly on the same scale. However, z-score normalization of the feature vectors was also applied as comparison, by subtracting the mean and dividing by the standard deviation on each feature type.

FIGURE 4 | Outline of the data analysis chain. The preprocessed neural data is analyzed in the temporal domain (ERPs based on signed r 2 ) and the spectral domain (spectral filtering and decomposition in the α (8–14 Hz) and β (16–20 Hz) bands). All three types of feature vectors: the spatio-temporal features (meaning the most important temporal points based on the maximum signed r 2 intervals, the spatio-spectral features (log-variance of the CSPs) in the α and β bands, are concatenated and given to the classifier within crossvalidation. Due to label information employment, spatio-spectral filtering considers the optimal channel and frequency band using CSP analysis with automatic filter selection computed on the training set (CSP W) and applied to the test set by linear derivation (W). SSD and the interval selection method based on signed r <sup>2</sup> were applied to the whole dataset and not within crossvalidation. While this aspect of the validation is not perfectly sound, the expected overestimation of the performance is limited. Finally, the classifier (regularized Linear Discriminant Analysis with shrinkage of the covariance matrix) decides the corresponding class membership for each trial (output classifier), representing the cognitive processing level.

### RESULTS

#### Behavioral Data

Regarding the subjective indicators about the difficulty of the conditions (**Figure 5**, right), the lowest score was attributed to the language condition, the 25% percentile shows 0 score difficulty and the 75% shows a medium difficulty with score 1, meaning that the subjects mostly considered language as the easiest condition. The memory and visual imagination conditions were considered equally difficult by the subjects (25% rated a medium difficulty with score 1 and around 75% rated a high difficulty with score 2). For the personal user mood evaluation, it was reported a good mood in 53% of all the experiment runs and subjects, and in 47% it was specified as "ok." No bad mood was reported by the subjects during or after the experiment.

Considering the objective method given by answers ratio, we observe more accurate answers for the language condition with a ratio closer to zero (**Figure 5**, left—the answers ratio for all participants averaged over the runs for each condition). No improvement or decrease in performance was observed over time considering the answers ratio, showing insignificant correlations by the Spearman rank-order correlation (memory: p = 0.3367; language p = 0.3982; visual imagination p = 0.1211; and in total over all 15 runs: p = 0.6192). However, most of the participants showed engagement and enthusiasm throughout the experiment.

### Neurophysiological Data

The temporal and spatial distribution of the neural activity represented by the ERPs is shown in **Figure 6** (with more details in Nicolae et al., 2015a).

Looking at the spatial and temporal distribution of the ERPs in **Figure 6**, a gradual difference between the levels of cognitive processing is observed in the centro-parietal area reflecting a positive peak about 400 ms after the stimulus (∼P300). Earlier, a peak around 250 ms is observed, with a negative component more pronounced in the right-occipital cortex, discriminating between no processing and processing, and similarly in the visual imagination condition, but discriminating also between shallow and deep processing.

Second, we visualized the modulations of the signals' power spectrum computed for the entire trial timing (2 s) in a spatialspectral representation (topographic maps). In order to focus on the discriminative aspects, the visualization was based on signed r 2 -values.

For the distinction between shallow and deep processing (**Figure 7**, the bottom graphs), we investigated the average spectrum (mean over trials and participants) given by the signed r <sup>2</sup> over the parietal site (Pz) in the frequency range from 3 to 40 Hz. We observed a higher discriminative difference in the α band (8–14 Hz) and smaller in the β band (16–20 Hz). Due to their prominent difference, also in the scalp maps at frontal and parietal sites, both frequency bands were selected for

FIGURE 5 | Behavioral assessment indicators: answers ratio (left) and difficulty scores (right) for the three cognitive processes (memory—dark gray, language—light red, visual imagination–light gray). The upper and bottom whiskers of each box-plot corresponds to the maximum and minimum values over all participants. The horizontal sides of the rectangular boxes represent the 25 and 75% percentiles of the data. The mean values are represented by the blue asterisk (\*) and the outliers are indicated by the red crosses. Left plot of answers ratio is taken from Nicolae et al. (2015b) with permission from Springer.

the analysis in a multi-band approach. A modulation appears also in the θ band (5–7 Hz), visible when comparing shallow with no-processing and deep with shallow processing (upper and bottom graphs). This effect was less marked compared to the other frequency bands and it was not encountered in all levels of processing, therefore it was not considered for the analysis.

**Figures 8**, **9** depict the grand average desynchronization and synchronization effects in the 8–14 Hz and 16–20 Hz frequency band which start about 300 ms after stimulus. The time evolution is initially similar for all levels and conditions until 300 ms, representing the same amount of evaluating the stimulus information. After this point, the effect of the desynchronization appears, climaxing around 500 ms and it follows a synchronization around 800–1,800 ms. For visualization, we chose the central parietal electrode Pz (for other electrode patterns, see Supplementary Figures S1, S2 in Supplementary Presentation 1). It can be noticed that shallow or deep processes (DT) elicits an attenuation of brain rhythms in comparison to the reference of no-processing (NT). While the ERDs in the ST and DT levels start similarly, they are markedly more sustained in the DT level. The effect is more pronounced for the α band as −1 to 0.5 µV and less for the β band, as −0.4 to 0.4 µV. Looking over the scalp distributions, we clearly see higher synchronization (ERS) for the shallow processing (0.1–0.35 µV) compared to the reference no-processing and more pronounced desynchronization (ERD) for a more complex processing (−0.3–0.1 µV). Comparing between processes, a higher ERS in amplitude and spatial distribution is encountered for the memory process, contrasting to a more pronounced ERD in the language and visual imagination case.

Next, the CSP analysis provides a deeper view of the neural oscillations and activity related to cognitive processes. The patterns provide information about the presumed sources of the neural activity which are then optimally projected on the surface of the scalp.

FIGURE 7 | Grand-average spectrum discrimination given by sgn r<sup>2</sup> at location Pz, computed over the entire trial timing (0–2,000 ms) in the frequency range from 3 to 40 Hz. The discrimination pairs ST-NT, DT-NT, and DT-ST are represented from top to bottom, while from left to right are represented according to the three conditions: memory, language, and visual imagination. The four scalp plots refer to discriminative signed r 2 -values, corresponding to θ (5–7 Hz), α (8–12 Hz; 12–14 Hz), β (16–20 Hz) frequency bands (shaded in gray). Note the upper graphs scale −0.01 to 0.01, compared to the other scales −0.05 to 0.05.

Considering the selected frequency bands, the CSP for participant P4 are shown in the Supplementary Presentation 1 as scalp topographies, computed for all pairs of class combinations (Supplementary Figures S3, S4).

#### Classification

The evaluation of the binary multivariate classification based on the combined spatio-temporal (ERP) and multi-band CSP (mCSP with SSD) features, are presented hereunder. The results

of classification based on ERP only were presented in Nicolae et al. (2015b).

The general classification performance across participants given by the area under the ROC curve is presented as boxplots in **Figure 10**. Here we observe good performance which are on average above 70% for ST-DT discrimination, around 75–80% for NT-ST pair and the highest performance for NT-DT discrimination, around 85–90%. All performances for all participants are significantly above chance level (indicated by ttest with alpha = 0.01). A two-way repeated measures ANOVA was performed over the AUC values with the factors: condition and classification pairs, which provide a statistically significant difference between the classification pairs (p < 0.001, F = 64.99). Based on the condition factor, the results in **Figure 10** expose the highest average AUC for the language condition, but this observation was not statistically significant (p = 0.2112). The distribution of the data was verified using the one-sample Kolmogorov–Smirnov test, supposing the null hypothesis of standard normal distribution samples. The null hypothesis was rejected below the 1% significance level.

With standardization of the feature vectors, the performance is actually increased by 1–3%, compared to the results in **Figure 10** with no normalization, although this effect is not statistically significant (n-way ANOVA: p = 0.0601).

Now relating the classification results to the signed r 2 discrimination for the ERD/ERS curves, it is important to notice that even when no substantial difference was encountered for the NT-ST pair, the combined classification still performed considerably, which was due to the ensemble features approach that integrated also the temporal features. Moreover, when discriminating the deep processing, the classification is significantly improved in the presented combined approach compared to the separate temporal classification results (Nicolae et al., 2015b): increasing from 83 to 87% on average for the NT-DT pair (p = 0.0027; F = 9.57) and from 68 to about 72% for the ST-DT pair (p = 0.0234; F = 5.33). No statistical significant difference was obtained for the NT-ST pair. Additionally, while considering only the spectral features, using the SSD method

FIGURE 10 | Pairwise classification mean performance over all trials for all cognitive processes (memory—dark gray, language—light red, visual imagination—light gray) given by the area under the ROC curve (AUC) based on ERP-mCSP. The bottom and upper whiskers of each box-plot corresponds to the minimum and maximum values regarding all participants, the rectangular horizontal sides of the box represent the 25 and 75% percentiles of the data, the blue asterisk (\*) represents the mean values, and red crosses indicates outliers. All pairs show statistically significant AUC, marked with three black asterisks (\*\*\*) in the bar plot (p < 0.001).

before CSP boosts the classification performance significantly for all cases: for example, from 61 to 74% on average for the ST-DT pair considering the visual imagination condition (statistically significant: p = 0.0001 with the two-way ANOVA test).

#### DISCUSSION AND CONCLUSION

We investigated the neural correlates of cognitive processing that indicate the level of profoundness. Effects have been found in ERPs as well as in brain rhythms.

In the ERPs, two peaks arise: one at 250 ms and one prolonged at 400 ms. The first peak refers to the first decision on the type of the stimulus (NT, ST, or DT), based on appearance (color and category) which is similar between different levels of processing. the second peak relates to no processing (NT), mild (ST), or intense (DT) processing, and is graded in amplitude in relation to the corresponding level of processing. The latencies are similar because the levels of processing involve to the same decision task, mental computation. Further, in the deep processing task another decision is involved, deciding the task fulfillment or not, and hence different latencies will be observed trial to trial, which are not clearly visible on grand averages.

The power spectrum analysis, over a wide frequency range from 3 to 40 Hz, provided an overview of the spectral components which were pronounced for the α (8–14 Hz) and the β band (16–20 Hz) and reduced for the θ band (5–7 Hz). Analyzing the components specific to each frequency band, we observed the present activity in the θ band being more pronounced for the memory and language condition, which reflects the memory retrieval activity (Meyer et al., 2015), cognitive processes (Klimesch, 1999), and sustained attention (Huang et al., 2007). However, the signed r <sup>2</sup> differences occurred in the θ band (see **Figure 7**) are substantially smaller than the differences occurred in the α and the β band. Moreover, when testing the classification considering the θ band only, the performances were around chance level for all discrimination pairs, as expected (e.g., 0.51, 0.56, 0.51 mean AUCs for the language condition) which do not show significance with t-test at alpha = 0.0056 with Bonferroni correction for multiple comparisons, given p > 0.0195 for the NT-ST and ST-DT classification pairs and show barely significance for the NT-DT discrimination with p = 0.0052. Therefore, the decision was made to discard this band from further analysis, since in our case it did not produce significant differences. The ERD/ERS curves of the remaining frequency bands display a marked ERD with a peak at a latency of about 500 ms. The duration of the ERD is modulated by the degree of cognitive processing. In addition, strong discrimination in the α band elucidates mental coordination (Palva and Palva, 2007), alert states (Klimesch, 1999), cognitive processing, access to stored information (Klimesch, 2012), and is completed with strong β band differentiation that corresponds to complex mental process and analyzing the presented information (Lachaux et al., 2005).

Comparing across the conditions, the activity for memory is focused in the temporal and frontal area. More lateralized activity is observed for language and more accentuated in parietal and temporal area for visual imagination depicting memory access and interpretation (Ganis et al., 2004). Comparing the ERD/ERS effects between conditions, a more increased synchronization (higher DT in amplitude and power) is visible for the memory condition in comparison with the others, representing an easier process. This effect, contrasts with the behavioral point of view, where more participants stated, on average, the language as the easiest method (difficulty scores in **Figure 5**), effect found statistically significant with p = 0.0451 given by a one-way ANOVA statistical test considering one factor (subjects) and three levels (conditions).

In order to improve the oscillatory discrimination, CSP filtering was employed to obtain spatial filters that distinguish the areas where the activity is differentiated between two processing levels (Supplementary Figures S3, S4 in the Supplementary Presentation 1).

For classification, we employed an approach which combines spatio-temporal features with spatio-spectral features. The average AUC is over 76% for the NT-ST pair, more than 86% for the NT-DT pair and 70% for the ST-DT pair. Note, that classification was challenged by the fact that participants moved their eyes freely between the two objects of each stimulus. In addition, it seems probably that the eye movements might have different dynamics between the tasks (e.g., DT might induce more alternations of the gaze between the two objects of a stimulus). For that reason, appropriate artifact removal strategies (ICA with MARA) and careful verification was performed. The resulting ERD and CSP patterns (cf. **Figures 8**, **9** and Supplementary Figures S3, S4 in Supplementary Presentation 1) do not suggest influences from eye movements, like strong lateralized activity in the frontal area which would be expected to result from horizontal movements. This gives a strong indication that the classification performance was not based on eye movement dynamics but on neural correlates of the cognitive processing. Furthermore, the evaluation of potential contamination by artifacts becomes more robust when analyzing the level of decoding based on the EOG activity alone. Specifically, the classification was performed considering two feature channels (EOG and Fp1; F10 and F9 channels difference) corresponding to the vertical and horizontal eye movements. The results show significant values at chance level for all conditions and classification pairs (e.g., mean AUCs for the language condition considering the classification pairs: 0.51, 0.55, 0.46) with p > 0.0198 and Bonferroni corrected for the nine comparisons with alpha = 0.0056.

In our experimental paradigm, different cognitive levels were externally imposed using task instructions. The graduated differences observed in the ERPs between shallow and deep processing correlate with different levels of processing, evidenced also in the ERDs/ERSs and are not generated by the targets occurrence, namely the odd-ball effect, which was controlled by imposing the same percentage of the stimuli between ST and DT (12.5%). When comparing the shallow and deep processing with no processing, it cannot be disentangled which differences in the ERPs are generated by the rarity of the occurrence (oddball effect) and which by the additional processing demands of the task. In particular for the ERDs in **Figure 8** the time course (differences extending far past 500 ms) seems to suggest that the main difference is due to the additional cognitive processing in ST and DT. This design was chosen in order to have better control of the true level of cognitive processing. For the application perspective, however, we strive to estimate the momentary level of cognitive processing within the natural fluctuations.

In conclusion, the performed investigation of the depth of cognitive processing brings us closer to real scenarios. Compared to standard BCI research our study induced different levels of cognitive processing by tasks that go beyond a simple target/nontarget discrimination. Moreover, the visual stimuli used had a higher variability, including the need of eye movements in the exploration of the complex stimuli, which were composed of two objects side-by-side. Our work extends also previous investigation of the effect of task complexity on ERPs and brain oscillations. Again, the set of stimuli used in our study is richer and more complex and importantly, methods from machine learning have been employed for single-trial classification, in a binary approach.

Overall, the present study is a step forward toward applications that estimate the level of cognitive processing in realistic settings of human-computer interaction (Gamberini et al., 2015) and in safety critical workplaces (Venthur et al., 2010). A further step in order to discriminate the ongoing level of cognitive processing is to apply a regression approach, as in Naumann et al. (2017). A different perspective for the current study is the development of techniques suitable for adaptive learning environments based on user state monitoring regarding the depth of cognitive processing. In this regard, it would be interesting to augment the approach with predicting remembered vs. not remembered items (Klimesch et al., 1996) in memory tasks.

### DATA AVAILABILITY STATEMENT

The EEG signals and behavioral data acquired in this study will be made available from the DepositOnce repository of Technische Universität Berlin (https://depositonce.tu-berlin.de/).

### AUTHOR CONTRIBUTIONS

Design of the study by BB. Implementation and data acquisition by IN. IN performed data analysis and LA and BB reviewed and revised the analysis. IN wrote the manuscript draft which was reviewed and revised by LA and BB.

#### FUNDING

Considering the research, we acknowledge financial support by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 611570, by the German Federal Ministry of Education and Research

#### REFERENCES


(BMBF) under contract 01GQ0850 and by the Sectoral Operational Programme Human Resources Development 2007-2013 of the Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/134398. Regarding publication, we acknowledge support by the German Research Foundation and the Open Access Publication Funds of Technische Universität Berlin.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2017.00548/full#supplementary-material

Supplementary Video 1 | Short example with 30 trials for the experimental paradigm considering the memory condition. First, the visual questionnaire is presented evaluating participant's current mental state (good, ok, bad). Next, the target pair (pear, banana) specifying the target category (fruits) and the color target (red) is presented along with the current condition that needs to be performed (memory). When the participant memorized the target, he/she press a keyboard button and the sequence starts after a short countdown. At the end of the run, the participant enters the number counted and receives feedback considering the correct number. In total, there were 4 DT (from which two fulfilled the memory question task), 7 ST, and 19 NT stimuli presented, accumulating a final number counted of 31. Note that the number of trials is higher in the actual run of the experiment (120 stimuli) and the ratio of the stimuli is also different (75:12.5:12.5 ± 2% for NT:ST:DT), while this example may seem harder (with a ratio of 63.3:23.3:13.3 (%) for NT:ST:DT).

Supplementary Video 2 | Short example with 30 trials for the experimental paradigm considering the language condition. The target pair here (submarine, submarine) refer to the mobility category and shows the color target (green). The current condition that needs to be performed is also specified (language). In total, 4 DT (from which two fulfilled the language question task), 7 ST, and 19 NT stimuli were presented, resulting in a final number counted of 31. Same notice as in the memory example, that the number of trials and the ratio are different here compared with the actual experiment.

Supplementary Video 3 | Short example with 30 trials for the experimental paradigm considering the visual imagination condition. The target pair here (cow, elephant) and the color target (red) are presented along with the specification of the current condition that needs to be performed (visual imagination). In total, 4 DT (from which two fulfilled the visual imagination question task), 5 ST, and 21 NT stimuli were presented, resulting in a final number counted of 29. Same notice as in the memory and language example, that the number of trials and the ratio (70:16.67:13.3 (%) for NT:ST:DT) are different compared with the actual experiment.


applications," in Proceedings of International Conference of the IEEE Engineering in Medicine and Biology Society (Milano), 1484–1487.


memory components and increasing levels of memory load. Neurosci. Lett. 353, 193–196. doi: 10.1016/j.neulet.2003.09.044


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Nicolae, Acqualagna and Blankertz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.