Reliability of P3 Event-Related Potential During Working Memory Across the Spectrum of Cognitive Aging

Event-related potentials (ERPs) offer unparalleled temporal resolution in tracing distinct electrophysiological processes related to normal and pathological cognitive aging. The stability of ERPs in older individuals with a vast range of cognitive ability has not been established. In this test-retest reliability study, 39 older individuals (age 74.10 (5.4) years; 23 (59%) women; 15 non β-amyloid elevated, 16 β-amyloid elevated, 8 cognitively impaired) with scores on the Montreal Cognitive Assessment (MOCA) ranging between 3 and 30 completed a working memory (n-back) test with three levels of difficulty at baseline and 2-week follow-up. The main aim was to evaluate stability of the ERP on grand averaged task effects for both visits in the total sample (n = 39). Secondary aims were to evaluate the effect of age, group (non β-amyloid elevated; β-amyloid elevated, cognitively impaired), cognitive status (MOCA), and task difficulty on ERP reliability. P3 peak amplitude and latency were measured in predetermined channels. P3 peak amplitude at Fz, our main outcome variable, showed excellent reliability in 0-back (intraclass correlation coefficient (ICC), 95% confidence interval = 0.82 (0.67–0.90) and 1-back (ICC = 0.87 (0.76–0.93), however, only fair reliability in 2-back (ICC = 0.53 (0.09–0.75). Reliability of P3 peak latencies was substantially lower, with ICCs ranging between 0.17 for 2-back and 0.54 for 0-back. Generalized linear mixed models showed no confounding effect of age, group, or task difficulty on stability of P3 amplitude and latency of Fz. By contrast, MOCA scores tended to negatively correlate with P3 amplitude of Fz (p = 0.07). We conclude that P3 peak amplitude, and to lesser extent P3 peak latency, provide a stable measure of electrophysiological processes in older individuals.


INTRODUCTION
The aging process is characterized by gradual decline in physical, neurobiological and cognitive functions that may impact instrumental activities of daily living (iADL) such as driving, doing household chores, managing finances, medication adherence, or grocery shopping (Moon et al., 2018;Carmona-Torres et al., 2019). Deterioration in these iADL becomes more apparent with age-related neurodegeneration such as mild cognitive impairment (MCI) and Alzheimer's disease (AD) (Jekel et al., 2015). Executive functions in particular are paramount in carrying out numerous iADL, but are also vulnerable to the effects of normal and pathological cognitive aging (Overdorp et al., 2016;Tabira et al., 2020). Working memory is one core executive function that relates to the ability to temporarily store, process, and manipulate the information necessary for higher order cognitive tasks such as decision making, learning, and reasoning (Baddeley, 1992). Working memory stems from the interaction between attention, short-term retention and manipulation of information, carried out by the coordinated activation of many brain regions (Eriksson et al., 2015).
The prefrontal cortex has particularly been associated with working memory (Bahmani et al., 2019). Consequently, the prefrontal cortex is highly susceptible to the effects of aging and early neurodegeneration (West et al., 2002;Ranchet et al., 2017). A recent meta-analysis pooling functional magnetic resonance imaging studies suggested a gradual and linear decline in prefrontal cortex engagement in older individuals (Yaple et al., 2019). Similarly, electrophysiological processes also decline with age. The P3, a positive peak that appears with a latency between 250 to 500 ms in the event-related potential (ERP), has been implicated in attention and working memory processes across the lifespan (Van Dinteren et al., 2014). A previous study showed reduced positivity in P3 central-frontal and parietal ERPs in older adults (Lubitz et al., 2017), whereas others demonstrated frontal hyperactivity in P3 coupled with parietal or posterior hypoactivity (Fjell and Walhovd, 2001;Saliasi et al., 2013). Despite the ambiguity in ERP findings, most studies conclude that the abnormal ERP response in older individuals reflects inefficient or compensatory use of neural resources due to frontal cortex dysfunction (Saliasi et al., 2013;Lubitz et al., 2017). Therefore, electrophysiological responses to working memory tasks are convenient measures to test hypotheses related to frontal cortex function, normal cognitive aging, and early neurodegeneration.
The ability to distinguish natural variability and measurement error from biologically relevant cognitive changes due to aging or early neurodegeneration is valuable to provide informed decisions on diagnosis, monitoring, and treatment of cognitive impairments (Feinkohl et al., 2020). However, older adults show more intraindividual variability in performance measures of working memory compared to younger adults. The age-related changes in intraindividual variability of performance measures become even more apparent with increasing cognitive demand (West et al., 2002). This increased intraindividual variability may also stem from the heterogeneity of cognitive profiles in older individuals, especially when patients with MCI and AD are included (Troyer et al., 2016). The intraindividual variability observed in performance measures is believed to be linked to frontal cortex dysfunction (West et al., 2002), which may therefore also affect intraindividual variability of the ERP response in older adults (Robertson et al., 2006). To date, few studies have investigated test-retest reliability of P3 ERP in healthy older adults (Sandman and Patterson, 2000;Walhovd and Fjell, 2002;Behforuzi et al., 2019). The test-retest reliability of the P3 ERP in older individuals with a heterogeneous cognitive profile has yet to be established.
The main aim was to characterize test-retest reliability of P3 ERP in a group of older adults with a wide range of cognitive function. Secondary aims were to investigate the impact of age, disease groups (non β-amyloid elevated; β-amyloid elevated, cognitively impaired), cognitive status, and task difficulty on P3 ERP.

Participants
This test-retest reliability study included 39 right-handed participants recruited from the KU Disease Center between 05/03/2018 and 03/10/2020. Inclusion criteria were informed consent; age older than 65; ability to understand the instructions in English; and having previously undergone an amyloid PET scan of the brain. Cerebral amyloid burden was assessed using PET images, obtained on a GE Discovery ST-16 PET/CT scanner after administration of intravenous florbetapir F-18. Standard Uptake Value Ratio for six regions of interest was calculated using MIMneuro software (MiM Software Inc., Cleveland, OH, United States) by normalizing the Aβ PET image to the entire cerebellum to calculate the. Diagnosis of cognitively normal pre-clinical AD followed the recommendations from NIA and the Alzheimer's Association workgroup (Sperling et al., 2011). The protocol for determination of amyloid elevation is detailed elsewhere (Vidoni et al., 2016). The average time between administration of PET scan and EEG assessment was 1090 (479) days. Exclusion criteria were: currently taking steroids, benzodiazepines, or neuroleptics; history of any substance abuse; and history of a neurological disorder other than MCI or AD. Sixteen were cognitively normal older adults with no elevated amyloid PET scans (Aβ−), 15 were cognitive normal with elevated amyloid PET scans (Aβ+), and eight had a clinical diagnosis of MCI or AD with positive amyloid PET scans. Participants completed their 2-week follow-up session 16 ± 8 days after the first session. Each session lasted about 60 minutes including rest breaks.

Demographic and Clinical Information
Age, sex, and education were recorded. General cognitive functions were evaluated with the Montreal Cognitive Assessment (MOCA) (Nasreddine et al., 2005). Scores on the MOCA range between 0 and 30.

N-Back Test
In the n-back test, participants are shown a series of letters and are instructed to press a button when the current stimulus is the same as the item presented n-positions back. The cognitive demand of the n-back task increases with each number, while the perceptual and motor demands remain constant. In this study, the 0-back, 1-back, and 2-back tests were administered. The 0-back test is essentially a memory search task of sustained attention and often used as a control condition (Miller et al., 2009;Bopp and Verhaeghen, 2018). The 1-back test requires the participant to passively store and update information in working memory. Whereas in the 0-back and 1-back the stimulus on screen is held in the focus of attention, the 2-back test requires constant switching from the focus of attention to shortterm memory (Bopp and Verhaeghen, 2018). Higher levels of difficulty require continuous mental effort to update information of new stimuli and maintain representations of recently presented stimuli (Gevins et al., 2011).
Participants sat in a comfortable chair at 26 inches in front of the computer screen with the center of the screen at eye level. White letters appeared on a black screen. Prior to each test, participants were given a practice trial consisting of 7 nontargets and 3 targets. The practice trials were repeated until the participant felt comfortable with the instructions. Each test comprised 180 trials, including 60 trials that needed a response (target, 33.3%) and 120 trials for which a response was not required (non-target, 66.7%). Each letter was presented for 500 ms on the computer screen followed by a blank interstimulus interval for 1,700 ms, with a random jitter of ±50 ms. The maximum time to accept the response was 2,150 ms. The total task time was ∼7 minutes. In the 0-back test, participants were instructed to press the left mouse button as soon as the letter "X" (target) appeared on the screen while ignoring the other letters (non-target). In the 1-back test, participants were instructed to press the button if the current letter on the screen was the same as the letter previously shown (target). In the 2-back test, participants were instructed to press the button when the current letter was the same as the one presented two places before (target). The number of hits (accuracy) and response times to the hits were the main behavioral performance outcome measures.

P3 ERP
Continuous electro-encephalogram (EEG) was acquired using a Philips EGI high-density system from 256 scalp electrodes, digitized at 1,000 Hz. Data were filtered from 0.50 to 30 Hz using EGI software. Data were online referenced to Cz and offline rereferenced to the averaged mastoids. All other EEG processing was done in EEGLab (Delorme and Makeig, 2004) and in ERPLab (Lopez-Calderon and Luck, 2014). Various artifacts unrelated to cognitive functions, including ocular and muscular movement or cardiovascular signals, were identified and removed using independent component analysis (ICA). Signals from bad electrodes were interpolated using surrounding electrode data. Stimulus-locked ERPs were extracted from the n-back tests and segmented into epochs of 100 ms before to 1,000 ms after stimulus onset, and baseline corrected using the prestimulus interval. Scalp locations and measurement windows for the P3 component were based on their spatial extent and latency after inspection of grand average waveforms (collapsed across the two sessions). P3 peak amplitude of the task effect was considered the main electrophysiological outcome measure, but we also used P3 peak latency as outcome measure. The task effect was calculated by subtracting the average ERP elicited from the targets from the average ERP elicited by non-targets for each participant. The P3 component time window was established between 200 and 400 ms for all three tests. Because of the prefrontal cortex involvement in working memory, we identified a priori Fz as the main channel, but also calculated reliability of other preidentified electrode locations, i.e., Cz, Pz, F3, and F4. Cz was interpolated using the surrounding five channels. No participants were removed from the analyses because of artifacts. However, one participant disengaged during the 2-back test and was therefore excluded from the 2-back reliability analyses.

Data Analysis
Descriptive analysis including mean (standard deviation) and frequency count of participants' general, performance measures, and ERP data were performed as appropriate. Intra-class correlation coefficients (ICC) were used to calculate test-retest reliability of performance measures and P3 amplitude and latency. ICCs reflect the consistency of a measure taking into account variance related to the time of testing (Shrout and Fleiss, 1979). ICC values less than 0.40 were considered poor; values between 0.40 and 0.59 fair, values between 0.60 and 0.74 good, and values between 0.75 and 1.00 excellent (Cicchetti, 1994). Bland-Altman plots were used to visualize the measurement precision of amplitude and latency across the test moments (Bland and Altman, 1986). Intersubject stability according to subject rankings was calculated using the Pearson r correlation coefficient. Generalized linear mixed models were employed to evaluate the effect of age, diagnosis (Aβ−; Aβ+; MCI/AD), MOCA scores, and task difficulty on stability of the P3 amplitude and latency. Stability of P3 amplitude (latency) was calculated as the squared difference of P3 amplitude (latency) at follow-up and baseline. The Kolmogorov-Smirnov test was employed to test the normality of our data distribution in addition to visualization of Q-Q plots. All analyses were done using SAS 9.4 software. The threshold of significance was set at p = 0.05.

Participant Characteristics
Participants (n = 39) were on average 74.05 (5.37) years old and scored 26.44 (4.76) on the MOCA scale. MOCA scores ranged between 3 and 30. No differences were observed for age and sex between groups. As expected, participants with MCI/AD scored worse on the MOCA compared to Aβ− and Aβ+ (Table 1).

Test-Retest Reliability of Performance Measures
All ICC values of hits (accuracy) and response times of each n-back test demonstrated excellent reliability (Supplementary Table 1). ICCs of hits ranged between 0.92 (1-and 2-back) and 0.99 (0-back) and were slightly higher than the ICCs of response times, ranging between 0.76 (2-back) and 0.89 (1-back). Pearson r correlations ranged from 0.65 (0-back response time) to 0.99 (0-back hits).

Test-Retest Reliability of ERP Measures
Grand average waveforms of the task effect from all channels at baseline and follow-up are displayed in Figure 1. The 3D scalp map is embedded in the figure to demonstrate the task effect at P3. Considerable overlap in ERP response within the P3 time window (200-400 ms post-stimulus) was observed at baseline and 2-week follow-up.
The ICC values of P3 peak amplitude and peak latency of the key electrode locations are displayed in Table 2. Overall, P3 amplitude showed greater reliability compared to P3 latency across channels and task difficulty levels. Also, ICCs of the 0-back and 1-back were consistently higher than those calculated for the 2-back.
For the main channel location Fz, excellent reliability was found in P3 amplitude for 0-back (ICC = 0.82) and 1-back (ICC = 0.87). P3 amplitude of Fz for 2-back only showed fair reliability (ICC = 0.53). Reliability scores of P3 latency at Fz were fair for 0-back (ICC = 0.54) and 1-back (ICC = 0.47), but poor for 2-back (ICC = 0.17). Figure 2 shows the Bland-Altman plots for P3 peak amplitude and peak latency at the Fz channel. All plots demonstrated equal distribution of the data around zero, indicating no bias in the results and no heteroscedasticity within the data.
Finally, generalized linear mixed models were employed to evaluate the effect of age, disease diagnosis (Aβ−; Aβ+; MCI/AD), cognitive status, and task difficulty on stability of squared P3 peak amplitude and latency at the Fz channel. Age (p = 0.74), disease diagnosis (p = 0.67), and task difficulty (p = 0.70) did not affect the stability of the P3 amplitude response, although individuals with lower MOCA cognitive scores tended to show more variability in P3 amplitude (p = 0.07).
We recalculated ICCs for 0-back, 1-back, and 2-back in participants who scored 26 or higher on MOCA (n = 32) and those scoring lower than 26 (n = 7). ICC values showed more variance in 0-back and in 2-back in the group with lower MOCA scores, but ICC values were not worse across the n-back tests in this group (Supplementary Tables 2 and 3). Whereas ICCs were similar in the Aβ− and Aβ+ groups, lower ICCs were found for the MCI/AD group (Supplementary Tables 2 and 3).

DISCUSSION
This test-retest reliability study provides critical information on the stability of electrophysiological measures related to working memory in healthy older adults, older adults with increased risk of dementia, and those with MCI or AD. Our results showed that most P3 ERPs in the frontal channels provide fair to excellent reliability to measure electrophysiological processes of cognitive aging in older adults with and without cognitive impairments. Similar to previous studies, the reliability is superior in measures of amplitude compared to latency (Kinoshita et al., 1996;Walhovd and Fjell, 2002;Cassidy et al., 2012;Behforuzi et al., 2019). The robustness of P3 stability is not affected by age, disease diagnosis, or task difficulty, however, there is a trend that lower MOCA scores may affect the stability of the P3 amplitude response.
The body of evidence related to reliability of P3 ERPs is sparse, and typically restricted to healthy young (Segalowitz and Barnes, 1993;Kinoshita et al., 1996;Cassidy et al., 2012;Brunner et al., 2013;Huffmeijer et al., 2014), middle-aged (Kinoshita et al., 1996), and older individuals (Sandman and Patterson, 2000;Walhovd and Fjell, 2002;Behforuzi et al., 2019). Few studies have reported reliability measures in neurological conditions (Lew et al., 2007). The reliability analyses in our study produced fair to excellent ICC values across the n-back tests. Whereas ICC values provide a single measure of the magnitude of agreement, Bland-Altman plots depict a graphical display of bias across the two test moments (Ranganathan et al., 2017). Visual inspection of the Bland-Altman plots showed an average difference in ERP responses between first and second testing close to 0, with equal spread of data points around the average difference line. These findings suggest that 2 weeks follow-up is sufficient to wash out any potential adaptation, test, or practice effect of the n-back on ERPs in older individuals.
Comparison of our results with other test-retest studies of ERPs in older adults is complicated by lack of consistency in terms of the ERP components that are investigated, the tests of working memory, the choice of channel locations, the extracted P3 metric, the P3 window measurement, and the test-retest reliability intervals (Sandman and Patterson, 2000;Walhovd and Fjell, 2002;Behforuzi et al., 2019). Our research design most closely aligns with a study that compared ERPs to novel stimuli collected at baseline and 7-week follow-up in healthy older individuals (Behforuzi et al., 2019). Similar to our study,  this study also found excellent reliability for P3 mean amplitude (ICC = 0.86, 95% CI, 0.78-0.92), and poorer reliability for P3 mean latency (ICC = 0.56, 0.30-0.73). Our study demonstrated larger confidence intervals in some of the amplitude and latency measures, which might have been due to the greater cognitive heterogeneity of our sample. Another study also reported considerably lower reliability in P3 amplitude (ICC = −0.02) and latency (ICC = −0.17) in seven individuals experiencing cognitive difficulties following traumatic brain injury compared to healthy peers (ICC = 0.84 for amplitude and 0.64 for latency) (Lew et al., 2007). Combined, these findings point toward a potential confounding effect of cognitive impairment on stability of ERPs in neurological conditions. No effect of age, task difficulty, or disease diagnosis was found on stability of the P3 ERP in the n-back task. Most participants in our study were cognitively normal, either without FIGURE 2 | Bland Altman plots of (A) 0-back Fz peak amplitude (B) 1-back Fz peak amplitude; (C) 2-back Fz peak amplitude; (D) 0-back Fz peak latency; (E) 1-back peak latency; (F) 2-back peak latency.
(n = 15) or with (n = 16) elevated Aβ. The fair to excellent reliability of P3 amplitude and latency provides opportunities for studying the effect of Aβ on neural transmission in preclinical AD using ERP. Accumulation of Aβ deposits in the brain is known to increase the risk of developing AD (Klunk et al., 2004). P3 amplitudes are smaller in AD compared to controls (Hedges et al., 2016). ERPs also show useful in predicting conversion to AD, with accuracy rates ranging between 70 and 94% (Chapman et al., 2011). Patients with AD exhibit prolonged latency in P3 ERP compared to age-matched controls (Pedroso et al., 2012). These prolonged latencies observed in patients with AD become particularly apparent in the cognitive domains of executive function, memory, and language (Lee et al., 2013). The ability of P3 ERP to discriminate between MCI and AD (Bennys et al., 2007) opens avenues for investigation of ERP in detecting preclinical AD (Boutros et al., 1995;Rossini et al., 2020).
We established the reliability of P3 amplitude in a group of older adults with a wide range of cognitive ability. Yet, most were cognitive normal. Future studies should include a larger sample of participants with MCI and AD to confirm the confounding effect of impaired cognition on the stability of the P3 response. The results of the group analyses (non β-amyloid elevated; β-amyloid elevated, cognitively impaired), and the potential confounding effect of impaired cognition on ERP response should be considered exploratory. The n-back is arguably the most ubiquitous working memory test used in ERP studies across the age spectrum (Bopp and Verhaeghen, 2018). However, previous studies have shown that the n-back test hosts an array of control processes, including speed of processing, storage, comparison processes, updating, keeping track, task mixing, task shifting, and resistance to interference (Miller et al., 2009;Schmiedek et al., 2009;Bopp and Verhaeghen, 2018). In addition, we did not establish reliability of ERP in other cognitive domains known to deteriorate in older age, such as memory and language, and this remains an opportunity for further investigation. Future research should also include multiple testing sessions over extended periods of time to evaluate the sensitivity of ERP to detect subtle neurobiological changes due to normal and pathological aging.

CONCLUSION
We set out to assess the test-retest reliability of ERP response in older adults with a heterogeneous cognitive profile. Consistent with other studies, P3 amplitude and latency show fair to excellent reliability across different levels of task difficulty. However, impaired cognition may potentially affect the stability of the P3 ERP response.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by University of Kansas Medical Center Internal Review Board. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
HD, JB, JM, WB, and KG conceptualized the study. HD, KL, and KG worked out the EEG data processing steps. HD, PA, and KL administered the tests. HD and JM analyzed the data. HD wrote the initial manuscript. JB, KL, PA, JM, WB, and KG reviewed the manuscript and provided valuable comments. All authors contributed to the article and approved the submitted version.

FUNDING
Research reported in this publication was supported by the National Institute on Aging of the National Institutes of Health under Award Number K01 AG058785. This study was supported in part by a pilot grant of the KU Alzheimer's Disease Center (P30 AG035982). The Hoglund Biomedical Imaging Center is supported in part by S10 RR29577 and generous gifts from Forrest and Sally Hoglund. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.