N-Back Related ERPs Depend on Stimulus Type, Task Structure, Pre-processing, and Lab Factors

The N-Back, a common working memory (WM) updating task, is increasingly used in basic and applied psychological research. As such, an increasing number of electroencephalogram (EEG) studies have sought to identify the electrophysiological signatures of N-Back task performance. However, stimulus type, task structure, pre-processing methods, and differences in the laboratory environment, including the EEG recording setup employed, greatly vary across studies, which in turn may introduce inconsistencies in the obtained results. Here we address this issue by conducting nine different variations of an N-Back task manipulating stimulus type and task structure. Furthermore, we explored the effect of the pre-processing method used and differences in the laboratory environment. Results reveal significant differences in behavioral and electrophysiological signatures in response to N-Back stimulus type, task structure, pre-processing method, and laboratory environment. In conclusion, we suggest that experimental factors, analysis pipeline, and laboratory differences, which are often ignored in the literature, need to be accounted for when interpreting findings and making comparisons across studies.


INTRODUCTION
Working memory (WM), defined as a limited capacity system responsible for temporary storage and manipulation of relevant information (Baddeley, 2012), has been studied extensively in the last few decades because it correlates with a wide range of complex cognitive abilities such as problem-solving, reasoning, learning and planning of goal-directed behaviors (Miyake and Shah, 1999). A considerable number of studies have addressed behavioral and neurophysiological, and underlying hypothetical constructs of WM using both single session (Scharinger et al., 2015(Scharinger et al., , 2017 and repeated practice (Anguera et al., 2012;Buschkuehl et al., 2014;Jaeggi et al., 2014).
One of the commonly used techniques to probe WM is the N-Back task, a complex task that requires storage, maintenance, and manipulation of information (Chen et al., 2008;Jaeggi et al., 2008) as well as inhibitory and interference control (Oberauer, 2005;Kane et al., 2007). The N-Back task has been used in single-session behavioral (Jaeggi et al., 2010;Brouwer et al., 2012) and neurophysiological (Krause et al., 2000;Pesonen et al., 2007;Esposito et al., 2009;Scharinger et al., 2017) studies as well as in multi-session behavioral (Jaeggi et al., 2008Minear et al., 2016;Blacker et al., 2017) and neurophysiological (Chen and Mitra, 2009;Dong et al., 2015;Pergher et al., 2018) training studies, to name a few. Many N-Back studies focus on task difficulty at different N-levels, indicating lower ERP amplitudes for more difficult tasks (Brouwer et al., 2012;Herff et al., 2014;Scharinger et al., 2017;Pergher et al., 2019b) and/or stimulus type, such as the use of spatial (for instance when the target stimulus occurs in different locations on the screen) vs. verbal (for instance when the presented stimulus is word or syllable) stimuli. This indicates that stimulus and load factors play a significant role in modulating P2, N2, and P3 components (Ross and Segalowitz, 2000;Polich, 2007;Chen et al., 2008;Chen and Mitra, 2009). However, there are many other task parameters such as stimulus duration, inter-stimulus interval (ISI), feedback, etc. that, although previously explored, are rarely consistent across N-Back studies (for a review see Pergher et al., 2019a). Different combinations of these parameters may differentially affect electrophysiological signatures associated with task performance and thus limit the interpretation of the functional significance of ERP components related to the N-Back task and their comparison across studies.
Here we examine several candidate factors that may affect ERP and behavioral signatures during N-Back task performance, not only in terms of task parameters such as stimulus type (words, pictures, and colors) and (stimulus duration, ISI, and feedback) but also in terms of different data pre-processing pipelines and laboratory effects, such as differences in room setup, computer testing stations, as well as electroencephalogram (EEG) hardware and software. While this is true of numerous areas of ERP research, the N-Back is particularly notorious in how it varies across studies (Owen et al., 2005;Kane et al., 2007;Mencarelli et al., 2019) and the data presented here is the first to detail the extent of these efforts for a variety of N-Back variations.

MATERIALS AND METHODS
Three datasets involving the N-Back task were included in the current study. Dataset I was collected specifically for the current study and was collected at the University of California-Riverside (UCR), USA. The purpose of this study was to explore the potential factors that affect ERP morphology and behavioral signatures of the N-Back task and to replicate experimental procedures described in two published datasets collected in different labs (Datasets II and III). Dataset II was collected at KU Leuven, Belgium (Pergher et al., 2018) as part of a study that investigated near and far transfer effects, the former involving cognitive sub-processes similar to the one practiced during training, whereas the latter calling upon other mental processes (De Ribaupierre and Lecerf, 2006), after 10 N-Back training sessions using behavioral and EEG recording. Dataset III was collected at the University of Maribor (UM), Slovenia (Pahor and Jauš sovec, 2018) in a study that examined the effects of transcranial alternating current stimulation (tACS) on WM performance and EEG responses. Participants in each dataset were healthy young subjects, who reported normal or corrected-to-normal vision, no history of psychiatric or neurological diseases, and were not taking any medication known to interfere with cognitive functioning.

Dataset I: UCR
Participants Thirty-six right-handed adults (27 females and nine males, mean age = 19.58, SD = 0.97), undergraduate students, were recruited from UCR. The experimental protocol was approved by the Institutional Review Board of UCR and all participants gave their informed consent before the experiment. They received course credit and a payment of $10 for participating in two sessions.

Stimuli and Task Structure
Nine variants of the N-Back task were obtained by crossing three task structures (see below) with three stimulus types: words (i.e., so, do, up), pictures (i.e., apple, fish, and bag), and colors (i.e., red, green, and blue). Task structures differed in terms of stimulus duration, ISI, response contingency, and feedback (see Figure 1) and were modeled after tasks used in previous studies, as mentioned above: task 1 (Pahor and Jauš sovec, 2018), task 2 (Pergher et al., 2018), and task 3 (Mohammed et al., 2017).
Task 1 had a stimulus duration of 400 ms, ISI of 1,600 ms, and employed a two-alternative forced-choice design for responding to targets and non-targets during the ISI. A white fixation cross appeared during ISI, turning blue when a response was registered or red if no response was detected. Task 2 had a stimulus duration of 1,000 ms and ISI of 2,000 ms. During the ISI, participants viewed a white fixation cross and were instructed to press a button only for target trials. Task 3 had a stimulus duration of 2,500 ms and ISI of 500 ms. Participants were instructed to respond to targets during stimulus presentation and were given feedback for correct (green circle around the stimulus) and incorrect responses (red circle around the stimulus). For task 1 and task 2 no response was allowed during stimulus presentation.

Procedure
Each participant performed four out of the nine N-Back variations across two different sessions conducted on different days, where the same difficulty levels were administered each day, for a total of approximately 90 min per session. This ensured that all combinations of conditions existed in a withinsubject design, even though not all participants completed every condition. The assignment of each participant to each N-Back variant was done randomly based on the subject number to ensure an equal number of participants (N = 16) in each variant. Each session consisted of 11 blocks presented in the following order: 1-back practice block, 2-back practice block, four 2-back test blocks, 3-back practice block, and four 3-back experimental blocks. Instructions were provided before each new N-level and 15-s breaks were given between blocks. Practice blocks consisted of sixteen trials during which the participant performed task 3 with Color stimuli whereas experimental blocks consisted of N + 40 trials (i.e., 2-Back had 42 trials).
The experiment took place in an electrically shielded room with DC lighting. An Apple Mac Mini with OS X 10.6.8 running MATLAB 2007b (Mathworks, Natick, MA, USA) and Psychtoolbox Version 3.0.8 were used to present the task and generate the stimuli (Brainard, 1997;Kleiner et al., 2007). The stimuli were displayed on a 22.5-inch wide Sony Trinitron (Sony Corp., Tokyo, Japan) CRT monitor with a resolution of 1,280 × 1,024 pixels and a refresh rate of 75 Hz. Also, to guarantee temporal precision of event-markers with experimental stimuli, a DATAPixx stimulus unit was used (VPixx, Vision Science Solutions, Saint-Bruno, QC, Canada) that ensured that triggers were sent precisely at the times of the vertical interrupt of the monitor and button presses.

Dataset II: KU Leuven
Participants Twenty-three healthy adults (12 females and 11 males, mean age = 24.37, SD = 1.78) were recruited via advertisements and flyers 1 . We randomly selected 16 subjects out of the first two sessions of dataset II to have a comparable sample size for cross-laboratory comparison purposes (see Table 1). Before starting the experiment, all participants were informed about the experimental procedure and signed informed consent. 1 Eight of these participants were included in N-Back training study conducted by Pergher et al. (2018). They received a payment of 20 euros for participating in two experiments. The study was approved by the UZ KU Leuven ethical committee (S59475).

Stimuli and Procedure
Dataset II had a task structure similar to task 2 of Dataset I, mentioned above, where each stimulus was presented for 1,000 ms, followed by an ISI of 2,000 ms. The stimuli were generated using MATLAB 2007b (Mathworks, Natick, MA, USA) and Psychtoolbox Version 3.0.8 (Brainard, 1997;Kleiner et al., 2007) and displayed on a CRT monitor. Participants had to respond only to targets. The stimuli used were pictures (Pergher et al., 2018).

Dataset III: UM
Participants Seventy-two healthy adults were recruited from the University of Maribor, Slovenia (Pahor and Jauš sovec, 2018) 2 , 24 of which were assigned to sham stimulation in the first session (all females, mean age = 20.42, SD = 1.56) and were thus not exposed to any active stimulation. Sixteen of these participants (session 1 data only) were randomly selected for Dataset III (see Table 1). The protocol was approved by the Commission for Ethics in Research at the Faculty of Arts. Participants gave written informed consent and received course credit as compensation.

Stimuli and Procedure
Dataset III had a task structure similar to task 1 of Dataset I, where each stimulus was shown for 400 ms, followed by an ISI of 1,600 ms. The stimuli were generated on STIM2 (Compumedics Neuroscan Systems, Charlotte, NC, USA) and displayed on a CRT monitor. Participants had to respond to both targets and non-targets. Two types of stimuli were used: two-letter words and colors (Pahor and Jauš sovec, 2018).

EEG Recording
EEG was recorded over 19 scalp locations based on the 10-20 Electrode Placement System using a Quik-Cap (Quik-Cap Compumedics Neuromedical supplies, Charlotte, NC, USA) with sintered electrodes. The EEG data were recorded using a SynAmps RT system and had a band-pass filter of 0.15-100.0 Hz. The 19 EEG traces were sampled online at 1,000 Hz. Vertical eye movements were recorded using two external electrodes placed above and below the left eye and a ground electrode was applied to the forehead. Two ear lobe references (A1 and A2) were used for online referencing, followed by common average re-referencing.

ERP Pre-processing Pipelines
Two pre-processing pipelines were used to analyze Dataset I: pipeline I and pipeline II. For ERP comparison across the three datasets, only pipeline I was used. The pipelines were chosen as they represented different, but standard approaches to ERP analysis (Croft and Barry, 2000;Delorme and Makeig, 2004;Groppe et al., 2009).

Pipeline I
Pipeline I was conducted in EEGLAB (MATLAB 2015a, MathWorks Incorporation; EEGLAB v. 14.1.1 Delorme and Makeig, 2004): the data were resampled to 512 Hz and filtered using a Butterworth filter with lower and upper cut-off frequencies of 0.1 Hz and 40 Hz. Electrode recordings were re-referenced to the average of the mastoid recordings (average mastoid reference, TP9 and TP10). Manual inspection was first performed to locate and remove visible disturbances in the data. Epochs were created from 1,000 ms before to 2,000 ms after stimulus onset, and the pre-onset average was subtracted from the post-onset signal (baseline correction). Independent 2 Dataset 3 only included participants that were in a sham stimulation condition during their first session. components analysis (ICA) was used to extract blinking and eye movements within the data. Independent components (ICs) that were identified by the data analyst as ocular artifacts were rejected. Finally, epochs were averaged for each N-Back variant and baseline corrected using 200 ms before stimulus onset.

Pipeline II
Pipeline II was conducted by using MATLAB R2016a (Mathworks, Natick, MA, USA). The data were resampled to 1,000 Hz and filtered in the 0.1-30 Hz range using a zero-phase 4th-order Butterworth filter. All electrodes were re-referenced offline to the average of the two mastoid signals (average mastoid reference, TP9 and TP10; Luck, 2014). Epochs were created from 200 ms before to 1,000 ms after stimulus onset, and baseline correction was performed by subtracting the average of the 200 ms pre-stimulus onset signal from the 1,000 ms post-stimulus onset signal. The EOG recorded before and during the experiment was used for correcting the EEG signal for eye artifacts using Croft and Barry's aligned-artifact average (AAA) procedure (Croft and Barry, 2000). Finally, epochs with EEG signals greater than 50 µV were also excluded as they could signify motion artifacts (van Vliet et al., 2015;Wittevrongel and Van Hulle, 2016). This Pipeline has been developed by the computational neuroscience group at KU Leuven (van Vliet et al., 2014(van Vliet et al., , 2015Wittevrongel and Van Hulle, 2016) and since then used in dozens of published studies from this group (http://lirias.kuleuven.be/cv?Username=U0013308). The method was developed as it accounts for eye artifacts using an automatic procedure (AAA procedure in Croft and Barry, 2000) rather than having to rely on a post hoc ICA analysis where the data analyst needs to identify which IC's contain those artifacts (as in EEGLAB).

Statistical Analysis
To assess the effect of N-Back task variations on behavioral responses (average of correct responses across trials) and ERP morphology (we considered the same three midline location electrodes: Fz, Cz, and Pz investigated by Watter et al. (2001), we used nonparametric permutation-based tests (Guo and Yuan, 2017;Derrick et al., 2019) as our datasets failed the Shapiro-Wilk test of normality (Shapiro and Francia, 1972) and the Levene test of equality of variances (Levene, 1960). Specifically, Dataset I utilized a mixed within/between design where each participant performed four out of nine variations. The rationale for using a mixed design was to obtain enough power-16 participants-for each of the nine variations by recruiting only 36 subjects. Therefore, we used a nonparametric permutation-based test to account for the mixed (within/between) design (Efron and RJ, 1993;Farrell et al., 1998). The null hypothesis distribution is generated empirically regardless of any assumptions about the data distribution. The observed results were then assessed relative to the empirical null hypothesis distribution (Collingridge, 2013) and the p-value was calculated by comparing the absolute distance between observed values of two groups to the absolute of the empirical null distribution (Cohen, 2017). The results were considered statistically significant when the p-value was less than 0.05. We ran 30.000 iterations for permutation testing of behavioral data and 3.000 for ERP data. We adopted the same statistical tests for the comparison between datasets (UCR, KU Leuven, and UM), and ERP and behavioral data comparisons respectively. We note that this p-value is monotonically relatable to other measures of reliability, such as differences in signal to noise ratio (SNR). Furthermore, we performed a power analysis for accuracy to ensure that our samples, considering the significant results of Figure 2, were large enough. Here, we reported the comparison between task 1 and task 3 for words that revealed that 14 subjects were sufficient to support the power of 80%, for colors that showed that 16 subjects were sufficient to support the power of 80%, and for pictures that demonstrated that 22 subjects were sufficient to support the power of 80%. Although the latter did show that a bigger sample size would be necessary, we believe that it does not significantly affect our results.

ERP Components
We investigated the following ERP components in the 0-800 ms post-stimulus time window: P100 (P1), a positive deflection with a peak around 100 ms after stimulus presentation. It is distributed over the lateral occipital electrodes and reflects the early sensory processing of visual stimuli. P1 latency depends on stimulus contrast, such as luminance or SNR, while its amplitude is modulated by attention (Hillyard et al., 1998) and discrimination processes (Vogel and Luck, 2000). N100 (N1), a negative deflection that peaks around 100-200 ms after stimulus onset. It has a distribution over the entire scalp, but it peaks earlier over the frontal regions of the scalp. It has been shown that its amplitude is modulated by attention. Larger amplitude is associated with attended stimuli, while smaller is associated with increasing stimulus presentation frequency (Luck et al., 1990). N1 latency is affected by cognitive processing effort: the bigger the effort, the longer the latency (Callaway and Halliday, 1982). P200 (P2), a positive deflection with a peak of around 150-275 ms after stimulus presentation. It is distributed over the fronto-central and parieto-occipital regions of the scalp, but its maximal is over the frontal area. It is elicited by visual stimuli and modulated by attention (Liu et al., 2013). Its amplitude is suppressed by increasing the attentiveness (Kanske et al., 2011) and more frequent target stimuli (Lu et al., 1992). N200 (N2), a negative deflection detected around 200-350 ms after stimulus onset. It is distributed over the frontal regions of the scalp and posterior regions in visual attention tasks (Folstein and Van Petten, 2008). N2 component reflects several functions such as stimulus identification, attentional shift, and motor response inhibition (Patel and Azzam, 2005). P300 (P3), a positive deflection with a peak occurring around 250-600 ms after stimulus onset. It shows a stronger distribution over the centro-parietal electrodes on the scalp. Its amplitude becomes larger with infrequent target stimuli and decreases with habituation and task difficulty. Its latency is modulated by the difficulty in discriminating the target stimulus (Picton, 1992;Polich and Kok, 1995). N400 (N4), a negative deflection detected between 400-600 ms after stimulus onset. It is typically stronger over centro-parietal regions of the scalp and reflects brain response to semantically meaningful stimuli that can include visual and auditory words, sounds, pictures, and faces (Kutas and Federmeier, 2011). N4 amplitude is affected by priming and frequency of the stimulus (Van Petten and Kutas, 1990).
Positive Late Component (PLC), a positive deflection, with a peak occurring around 500-1,000 ms after stimulus onset. It is most prominent for posterior scalp channels. The PLC amplitude is modulated by stimulus repetition: suppressed for stimuli that have been already presented, and generally larger for new stimuli (i.e., ''old-new'' effect), in both long-and short-term memory paradigms (Olichney et al., 2000;Danker et al., 2008). PLC is believed to index the top-down allocation of attention in a memory recollection process (Mecklinger, 2000).

Effect of Stimulus Type and Task Structure-Dataset I (UCR)
To investigate the effect of experimental features (stimulus type and task structure type), we performed a nonparametric permutation-based analysis on behavioral and electrophysiological data.

Behavioral Results
While pictures were associated with the highest accuracy level when holding task structure constant (Figure 2A; see Supplementary Tables 1-3 in Supplemental Material for means, standard deviations, and statistics per condition, respectfully), there was no statistically significant effect of stimulus type on accuracy (p > 0.2 for all conditions other than for task 1: words vs. pictures: p = 0.075). On the other hand, results revealed a robust overall effect of task structure (see Figure 2B), showing higher accuracy for task 3 vs. task 1 (p < 0.002 for all stimulus types), and for task 3 vs. task 2 for words (p < 0.001) and colors (p < 0.012) but only a trend for pictures (p = 0.067). However, there was no statistically significant difference between task 1 and 2 (p > 0.4 for all conditions other for words p = 0.074). These results show that while there is a highly significant effect of tasks, especially task 3 vs. the others, that the choice of stimulus has a lesser effect on task performance.

ERP Morphology
Overall, ERP morphologies changed substantially both as a function of stimulus type and task structure. This can be seen in Figures 3, 4 for channel Cz, while channels Fz and Pz are shown in Supplemental Material (Supplementary Figures 1-4). We also presented topographies and reported differences between them in the Supplemental Material (see Supplementary Figure  9 and Supplementary Tables 6, 7). In the following sections, we highlight some of the significant effects by running permutation tests that demonstrate the extent to which differences in morphology across the time-course are different as a function of condition. Significant differences discussed below are concerning shaded regions in graphs that indicate periods in the ERPs where differences are p-value of less than or equal to 0.05 for at least 12 consecutive bins with ∆t of 1/512 Hz.

Effects as a Function of Stimulus Type
For task 1, ERP morphologies differed more frequently for pictures compared to colors and words, as seen in Figure 3. While pictures vs. words differed more frequently in the N1, N2 and P2 components in channels Fz, Cz, and Pz (the latter only for P2), pictures vs. colors showed differences in the N2, P2, and P3 components in channels Fz and Cz. Additionally, words vs. colors showed differences in the P2 component in channels Fz, Cz, and Pz.
For task 2, we found that ERP morphologies differed more frequently for colors compared to pictures and words (see Figure 3). While both colors and words differed from pictures more frequently in the N1 component in channels Cz, Pz, and Fz respectively, colors vs. words and colors vs. pictures showed differences mostly in the P2 component in channels Fz and Cz. Additionally, words differed from pictures and colors in the N2 component for channel Cz, while colors compared to pictures differed more frequently in the P3 component for channels Fz and Cz.
For task 3, ERP morphologies differed more frequently for words compared to pictures (see Figure 3). words and pictures showed differences in the N2, P2, and P3 components in channel Fz and Cz. Additionally, words differed from colors in the P3 component in channel Fz and Cz.

Effects as a Function of Task Structure
For words, ERP morphologies differed more frequently for task 3 compared to task 1 and task 2 (see Figure 4). While task 3 and task 1 showed differences in the N1, P1, and P2 components in channels Fz, Cz, and Pz, task 3 and task 2 showed differences in the N1 and P2 components in channels Fz, Cz, and Pz. We do note, that in the case of where the stimulus offset occurred at 400 ms, waveforms after 400 ms may have been impacted by a stimulus offset event in addition to other task-related factors.
For pictures, ERP morphologies differed more frequently for task 1 compared to task 3 and task 2 (Figure 4). While task 1 and task 3 showed differences in the N1 component in channels Fz and Cz, task 1 and task 2 showed differences in the P2 component in channels Fz, Cz, and Pz. Additionally, task 3 differed from task 2 in the N1 component in channels Fz and Cz.
For colors, ERP morphologies differed more frequently for task 3 compared to task 1 and task 2 (Figure 4). While task 3 and task 1 showed differences in the N1 and P2 components in channels Fz, Cz, and Pz, task 3 and task 2 showed differences in the N1 and P2 components in channels Cz and Pz.

Effects as a Function of Task Load and Performance
To understand how other factors may have influenced the ERPs, we also examined the effects of memory load and performance on ERP waveforms (see Figure 5). Concerning N-back load (N = 2, N = 3), the main effect of the load is shown in Figure 5A with this effect of load being significant (p < 0.05) for all the components mentioned in this paper except for P1 (see Supplementary  Table 4 for stats). However, this effect was largely independent of the task, and stimulus types (see Supplementary Figure 5, Supplementary Table 4 for the break-down of ERPs and stats across the different task and stimulus conditions). Likewise, we also observed differences in the ERPs as a function of metrics of performance ( Figure 5B); hits (correctly responded targets), misses (incorrectly responded targets), correct rejections (correctly responded non-targets), and false alarms (incorrectly responded non-targets). There is a significant main effect of performance (p < 0.001) for all the components, except for P1. However, again, this effect was largely independent of the task, and stimulus types (see Supplementary Figure 6, Supplementary Table 5 for ERPs for break-down of ERPs and stats across the different stimulus and task conditions).

Comparison Between Pre-processing Pipelines in Dataset I (UCR)
We next examined the extent to which differences in analysis pipelines used across labs resulted in changes in estimated ERP morphologies. Interestingly, early ERP components are relatively preserved across the pipelines, but that later ERP components showed significant differences between pipeline I and pipeline II (see Figure 6). Further, these differences showed some interaction with task and stimulus. For example, the effect of the pipeline was found in all variations in task structure 1 (for channels Fz, Cz, and Pz). Moreover, the word N-Back variation with task structure 1 showed significant differences in P3 components between the two pipelines. For task structure 2 and words, Cz showed a significant difference in N2 and P3 components. For task structure 2 and pictures, Fz revealed significant differences in PLC and Cz in P3 and PLC. For task structure 2 and color stimulus, Fz showed significant differences in P3, N4, and PLC signatures and Cz in P3 and PLC. For task structure 3 and words, Fz showed significant differences in N2, P3, N4, and PLC components. Further, Cz showed differences for N2 and P3 and Pz for PLC. For task structure 3 and pictures, Fz and Cz showed a significant difference in PLC. Finally, for task structure 3 and colors, Fz showed significant differences in P3, N4, and PLC components and Cz showed a significant difference in PLC.
Because Pipelines I and II differ in several ways ranging from analysis toolbox, eye artifacts removal to reference electrodes, etc., there are too many candidate parameters to be causally related to a specific difference in an ERP component. Nevertheless, these results are interesting as they highlight how the use of different pre-processing pipelines commonly used in FIGURE 3 | Grand average and SEM of ERP curve for UCR dataset at Cz electrode for target trials during variations of stimulus types (words, pictures, and colors). Gray shaded areas indicate significantly different data points (p < 0.05). P-values that are less than 0.0001 are thresholded to 0.0001 for viewing purposes, as shown by the black curve at the bottom of each graph where log p-values are reported.
FIGURE 4 | Grand average and SEM of ERP curve for UCR dataset at Cz electrode for target trials during variations of task structure (task 1, task 2, and task 3). Gray shaded areas indicate significantly different data points (p < 0.05). P-values that are less than 0.0001 are thresholded to 0.0001 for viewing purposes.
the EEG literature can affect ERP morphology at an aggregate level, and in particular, the choice of the pipeline can impact the extent to which one correctly/incorrectly determines differences between conditions. While it would be interesting to unveil possible causal relations between these differences in the pipeline, the goal of the present study is to illuminate the impacts of common methodological differences between studies rather than to fully explain such differences, which would require a larger study. Furthermore, considering the few existing studies in the literature Jiang et al., 2019;Yao et al., 2019) that demonstrated a significant role played by pre-processing factors, we think it is likely that the eye artifacts removal method and reference electrodes might have greatest impacts in our pipelines on the resulting ERPs. Still, we note that our analysis of pipeline is merely illustrative of how the pipelines used in the previously published versions of these datasets give rise to different ERP morphologies and that a complete characterization of how pipeline elements affect the signal and/or SNR (Robbins et al., 2020) is beyond the scope of the present manuscript.

Laboratory Effects
Another potential aspect of variation is the experimental location resulting in behavioral and ERP morphology differences. Specifically, we refer to different laboratories to explore differences in several characteristics such as lab settings, stimuli, tasks, subject pools, subject instructions, processing pipelines, and so on. Using pipeline I, we compared task 2 (pictures only, N = 16) as used in Dataset I (UCR) and Dataset II (KU Leuven), as well as task 1 (words only, N = 16) which was used in Dataset I (UCR) and Dataset III (UM). We did not compare Datasets II and III as the stimuli were different: pictures vs. words respectively, whereas Dataset I included both words and pictures and could therefore be compared to both datasets. We note, that while this analysis is far from comprehensive and it would be ideal to collect data on identical procedures across the labs, however, this is at least illuminative of other, unexplained, variance that can be expected from different labs conducting similar research but not coordinating on the exact details of the studies, which is typical of the extant literature.

Dataset I vs. Dataset II (UCR vs. KU Leuven)
Behavioral results for task 2 showed significantly higher accuracy in Dataset II compared to Dataset I (p < 0.001; Figure 7A) and ERP morphology outcomes revealed larger ERP amplitudes in Dataset II compared to Dataset I (Figure 8). Namely, significant differences between Dataset I and Dataset II (p < 0.05) were found in P1, N1, P2, N2, and P3 components, in channels Fz and Cz.

DISCUSSION
The goal of the present study was to fill a gap in the extant literature by illuminating the extent to which common procedural differences related to N-back task variants, EEG recording setups, and pre-processing pipelines affect behavioral and electrophysiological correlates of performance. To address this, we compared variants of the N-Back task used in three laboratories, two in Europe (Pahor and Jauš sovec, 2018;Pergher et al., 2018) and 1 in the US where the behavioral and EEG datasets were replicated. Our findings suggest that stimulus type, task structure, pre-processing pipeline, and lab factors contribute to differences in behavioral and neurophysiological responses on the N-Back task.
Given the fact that most meta-analyses overlook differences in the N-Back task adopted in each study (Glahn et al., 2005;Heishman et al., 2010;Redick and Lindsey, 2013;Brunoni and Vanderhasselt, 2014), we characterized some of those factors that might affect cognitive task outcomes. First, we examined task structure and showed differences in accuracy level between tasks (task 1, task 2, and task 3), revealing higher accuracy for task 3 compared to the other two, perhaps due to having the longest stimulus duration (2,500 ms) thereby supporting the process for the encoding of information that is facilitated when stimulus duration is longer. Indeed, Kunimi (2016) showed that increasing stimulus duration (from 500 to 5,000 ms) improves memory performance during retention of visuospatial information, whereas Fox et al. (2007) showed that longer ISI was associated with increased accuracy level. We also investigated stimulus type and observed better performance for pictures of objects compared to words and colors. In contrast, Nystrom et al. (2000) reported higher accuracy for letters compared to shapes.
Another important aspect when considering the following factors such as task structure and stimulus type is their impact on ERP morphology. To highlight this variance, we examined differences in several ERP components, named N1, P2, N2, and P3 for both factors, as previous studies suggested ERP component modulation in response to WM experimental features, particularly for stimulus type, and observed their spatial distribution. Mecklinger and Pfeifer (1996) reported that the encoding of object features was associated with modulation of P2 component, whereas Ruchkin et al. (1992) showed variations of N2 and P3 components for visuospatial stimuli compared to phonological stimuli, indicating that visuospatial stimuli were processed more quickly than phonological ones. Moreover, Rossion et al. (2003) observed N1 modulation in response to faces and objects compared to objects. Thus while it is clear in the literature that both task and stimulus should influence ERPs in systematic ways, to date this has been largely overlooked in the literature examining ERP signatures of WM tasks such as the N-Back.
In addition to stimulus type and task structure, we suggest that different experimental laboratories and pre-processing procedures might also affect the accuracy and ERP morphology.  Supplementary Figures 7, 8). Gray shaded areas indicate significantly different data points (p < 0.05). P-values that are less than 0.0001 are thresholded to 0.0001 for viewing purposes. Data in Pipeline II was up-sampled to 512 to make the comparison possible. Seemingly arbitrary procedures are employed by different laboratories, in terms of environment and equipment, as well as data pre-processed and analyzed by different pipelines, which have been shown to produce different findings (Busch et al., 2006; for review, see Zimmer et al., 2001). Here we show that the same N-Back task performed in two laboratories produces different behavioral and ERP morphology results. However, we suggest interpreting these results carefully, as participants' individual differences and EEG and analysis operator skills may also have affected these results . Green et al. (2019) observed that reward, motivation, and/or participant expectations, such as differences in task performance, researcher instructions, etc., could also count as factors for behavioral differences when comparing performances between different laboratories. Moreover, we highlight the impact of pre-processing pipelines on ERP data, supporting the recommendations provided by Smith and Kutas (2015) regarding the power of EEG data pipeline, including baseline correction, artifact rejection, and the filtering procedure (Acunzo et al., 2012) on ERP analysis. In line with the goal of this study, we did not associate a specific step of the pre-processing pipeline procedure to an ERP component or cognitive process since we aimed to show at a more general level the impact of stimuli, tasks, and laboratory environment on both accuracy and ERP responses.
Our study presents several limitations. We considered only accuracy during N-Back performance, because the three Datasets and the related tasks had different response requirements, and so it would have been very complex to compare them. As Dataset I utilized a mixed within/between design, individual differences might have affected ERP signatures attributed to laboratory effects. Indeed, a recent review paper highlighted the variety of features that may impact N-Back performance, including both task and individual features (Pergher et al., 2019a). The samples compared here were of similar age and had a similar educational level (undergraduates) and in Datasets I and II, a similar distribution of gender. While Dataset III only consisted of data collected from females, a recent study by Pliatsikas et al. (2019) demonstrated that gender, age, and education level affect response accuracy after a single N-Back training session in healthy older individuals. Since the present study consists of N-Back performance across 1 or 2 sessions FIGURE 8 | ERP responses during task 2, only for target stimuli recorded at different laboratories. Gray shaded areas show significant differences at p < 0.05. Both datasets were pre-processed with pipeline I. FIGURE 9 | ERP responses during task 1 (mean and standard deviation of targets) recorded at different laboratories. Gray shaded areas show significant differences at p < 0.05. Both datasets were pre-processed with pipeline I.
in young subjects, we do expect these variables to have a moderate effect on behavioral and electrophysiological results. Nevertheless, there might be other individual difference factors such as motivation, personality, and WM capacity (Dong et al., 2015) that were not accounted for but could have affected the results. Future studies will need to examine whether these individual differences, along with other factors such as timeof-day and environment, affect N-Back task performance and ERP signatures. Moreover, further studies should also consider the choice of words, pictures, and colors, as they may play an important role in affecting behavioral and ERP responses due to different colors and shapes used, and familiarity with the objects presented. Finally, since Dataset III represents a sham condition in a brain stimulation study (Pahor and Jauš sovec, 2018) possibly the placebo effects affected performance. Since we only retained data collected in session 1, i.e., before exposure to active stimulation, it is unlikely that these effects are large. Still, we suggest that while more work can be done to clarify the effects presented here, and that other differences still exist in the extant literature, that the present work is informative of how some of the most common differences in the N-Back between studies can impact observed behavioral and physiological measures.
In conclusion, the present data sets help clarify the extent to which common N-Back task variations in terms of stimulus type, task structure, and laboratory and processing pipeline give rise to differences in behavioral and physiological outcomes. While future research is needed to help us understand the mechanisms that underly these observed differences, the present work can help readers appreciate effect sizes to be expected related to the many variations considered here. We note that while, in general, it is well acknowledged any difference between studies can have an impact, the significance of these variations in the case of the N-Back has been largely overlooked, thus limiting understanding of their role in affecting accuracy and ERP morphology and of potentially important information related to the mechanisms that regulate WM processes. We suggest that for the field to move forward, experimental features, analysis pipeline, and laboratory differences need to be taken into consideration when interpreting findings and making comparisons across studies.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Institutional Review Board of the University of California Riverside. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
VP, MS, and AS conceived the presented idea and developed the theory. MS, VP, and AP carried out the experiment. MS and VP performed the computations. All authors contributed to the article and approved the submitted version.

FUNDING
This research was supported by NIMH R01 MH111742 to AS, by a PostDoctoral Mandate (PDM) from KU Leuven to VP, and by research grants to MV from the European Union's Horizon 2020 research and innovation program under grant agreement No. 857375, the Financing Program (PFV/10/008), the special research fund of the KU Leuven (C24/18/098), the Belgian Fund for Scientific Research-Flanders (G088314N, G0A0914N, and G0A4118N), the Interuniversity Attraction Poles Programme-Belgian Science Policy (IUAP P7/11), and the Hercules Foundation (AKUL 043).