Both Reaction Time and Accuracy Measures of Intraindividual Variability Predict Cognitive Performance in Alzheimer's Disease

Dementia researchers around the world prioritize the urgent need for sensitive measurement tools that can detect cognitive and functional change at the earliest stages of Alzheimer's disease (AD). Sensitive indicators of underlying neural pathology assist in the early detection of cognitive change and are thus important for the evaluation of early-intervention clinical trials. One method that may be particularly well-suited to help achieve this goal involves the quantification of intraindividual variability (IIV) in cognitive performance. The current study aimed to directly compare two methods of estimating IIV (fluctuations in accuracy-based scores vs. those in latency-based scores) to predict cognitive performance in AD. Specifically, we directly compared the relative sensitivity of reaction time (RT)—and accuracy-based estimates of IIV to cognitive compromise. The novelty of the present study, however, centered on the patients we tested [a group of patients with Alzheimer's disease (AD)] and the outcome measures we used (a measure of general cognitive function and a measure of episodic memory function). Hence, we compared intraindividual standard deviations (iSDs) from two RT tasks and three accuracy-based memory tasks in patients with possible or probable Alzheimer's dementia (n = 23) and matched healthy controls (n = 25). The main analyses modeled the relative contributions of RT vs. accuracy-based measures of IIV toward the prediction of performance on measures of (a) overall cognitive functioning, and (b) episodic memory functioning. Results indicated that RT-based IIV measures are superior predictors of neurocognitive impairment (as indexed by overall cognitive and memory performance) than accuracy-based IIV measures, even after adjusting for the timescale of measurement. However, one accuracy-based IIV measure (derived from a recognition memory test) also differentiated patients with AD from controls, and significantly predicted episodic memory performance. The findings suggest that both RT- and accuracy-based IIV measures may be useful indicators of underlying neuropathology. The present study therefore contributes toward an understanding of the relative utility of RT- and accuracy-based IIV measures in detecting neurocognitive impairment in older adults, and also advances the empirical evaluation of sensitive markers of cognitive change in patients with AD.

Dementia researchers around the world prioritize the urgent need for sensitive measurement tools that can detect cognitive and functional change at the earliest stages of Alzheimer's disease (AD). Sensitive indicators of underlying neural pathology assist in the early detection of cognitive change and are thus important for the evaluation of early-intervention clinical trials. One method that may be particularly well-suited to help achieve this goal involves the quantification of intraindividual variability (IIV) in cognitive performance. The current study aimed to directly compare two methods of estimating IIV (fluctuations in accuracy-based scores vs. those in latency-based scores) to predict cognitive performance in AD. Specifically, we directly compared the relative sensitivity of reaction time (RT)-and accuracy-based estimates of IIV to cognitive compromise. The novelty of the present study, however, centered on the patients we tested [a group of patients with Alzheimer's disease (AD)] and the outcome measures we used (a measure of general cognitive function and a measure of episodic memory function). Hence, we compared intraindividual standard deviations (iSDs) from two RT tasks and three accuracy-based memory tasks in patients with possible or probable Alzheimer's dementia (n = 23) and matched healthy controls (n = 25). The main analyses modeled the relative contributions of RT vs. accuracy-based measures of IIV toward the prediction of performance on measures of (a) overall cognitive functioning, and (b) episodic memory functioning. Results indicated that RT-based IIV measures are superior predictors of neurocognitive impairment (as indexed by overall cognitive and memory performance) than accuracy-based IIV measures, even after adjusting for the timescale of measurement. However, one accuracy-based IIV measure (derived from a recognition memory test) also differentiated patients with AD from controls, and significantly predicted episodic memory performance. The findings suggest that both RT-and accuracy-based IIV measures may be useful indicators of underlying neuropathology. The present study therefore contributes toward an understanding of the relative utility of RT-and accuracy-based IIV measures in detecting neurocognitive impairment in older adults, and also advances the empirical evaluation of sensitive markers of cognitive change in patients with AD.

INTRODUCTION
Twenty-three percent of the worldwide burden of disease occurs in individuals age 60 years and older, and up to 63% of individuals with age-related diseases such as dementia currently reside in low-and middle-income countries (LAMICs; Prince et al., 2015;World Health Organisation, 2015). In one such country, South Africa, the most recent census statistics indicate that 8% of the population (∼4.1 million individuals) are aged 60 years or older, and that that number will increase by as much as 40% over the next two decades (Statistics South Africa, 2014). Furthermore, community-based estimates suggest there is a higher prevalence of dementia in South Africa, and in LAMICs generally, compared to global estimates (Prince et al., 2013;de Jager et al., 2015). These epidemiological data underscore the urgency of conducting LAMIC-based dementia research.
Recently, the Alzheimer's Association's Research Roundtable (AARR), an interdisciplinary group of leading dementia researchers, prioritized the urgent need for sensitive measurement tools that can detect cognitive and functional change at the earliest (even prodromal) stages of Alzheimer's disease (AD; Snyder et al., 2014). Sensitive indicators of underlying neural pathology are important for the evaluation of early-intervention clinical trials, and may play a central role in alleviating the burden of age-related disease (Food and Drug Administration, 2013). One method that may be particularly well-suited to help achieve this goal involves the quantification of intraindividual variability (IIV; also known as inconsistency) in cognitive performance. Whereas conventional indicators of cognitive performance are based on measures of central tendency and involve assessment of an individual on a single measure administered on a single occasion, IIV indicators are based on measures of variability and involve assessing fluctuations in performance of an individual on a single measure administered on multiple occasions (Li et al., 2001;Hultsch et al., 2002;MacDonald et al., 2006).
Contemporary IIV research focuses primarily on inconsistency in performance on reaction time (RT) measures (see, e.g., Bielak et al., 2010;Saville et al., 2011;Bunce et al., 2013;Yao et al., 2016). Such latency-based measures are particularly well-suited to IIV research because they have larger ranges than traditional cognitive test scores, thus making them more sensitive than traditional cognitive tests to individual performance differences. RT tasks also (a) typically involve multiple trials, which allows for many samples of performance, and (b) are less sensitive to re-test effects (Allaire and Marsiske, 2005;Salthouse, 2012). Over the past two decades, a sizeable literature has established IIV in RT as an effective marker of general cognitive function in older adults: High levels predict impending cognitive decline, and are associated with a range of age-related neurological disturbances, with neurodegenerative disease (e.g., AD), and with mortality risk (Collins and Long, 1996;Hultsch et al., 2002;MacDonald et al., 2003;Burton et al., 2006;Shipley et al., 2006;Duchek et al., 2009;Bielak et al., 2010).
An alternative method for capturing IIV involves using accuracy-based measures. These measures are derived from tasks featuring stimuli to which the test taker makes either a correct or an incorrect response (e.g., Murphy et al., 2007;Tractenberg and Pietrzak, 2011). Because such tasks are used frequently in clinical practice, deriving an IIV score from them, and showing the predictive value of that score, is a useful undertaking. Although some studies report that accuracy-based IIV measures can, for instance, differentiate between patients with AD, those with Parkinson's disease, and healthy controls, and can aid in detecting prodromal AD (Darby et al., 2002;Burton et al., 2006;Murphy et al., 2007;Tractenberg and Pietrzak, 2011;Kälin et al., 2014), many researchers prefer latency-based measures. One reason for this preference is that statistically significant positive associations between accuracy-based IIV and age do not survive after controlling for mean performance. Associations of outcomes (e.g., age, clinical group status) with RT-based IIV measures are not affected by controlling for mean performance, and hence those measures are perceived to be superior in detecting underlying pathology (see, e.g., Li et al., 2001;Salthouse et al., 2006).
However, only one previous study in the aging literature provides a direct comparison of the relative sensitivity of RT-and accuracy-based IIV measures to cognitive compromise. Hultsch et al. (2000) measured trial-to-trial and session-to-session IIV in RT-and accuracy-based measures in three groups: healthy older adults, patients with arthritis, and patients with dementia (either mild AD or mild vascular dementia). They reported that, whereas there were no significant between-group differences in terms of accuracy-based IIV, RT-based measures differentiated the groups successfully, independent of mean-level predictors.
The current study seeks to systematically replicate and extend the findings of Hultsch et al. (2000). Specifically, we also compare directly the relative sensitivity of RT-and accuracybased estimates of IIV to cognitive status. The novelty of the present study, however, centers on the patients we tested and the measures we used. Where Hultsch et al. (2000) used a mixed-dementia group, we use a group of patients with AD. Inclusion of this more homogenous clinical group allows for improved sensitivity of accuracy-based tasks, which are typically designed to target specific domains of cognitive function (e.g., episodic memory). Furthermore, where Hultsch and colleagues' analyses were targeted toward categorical prediction of group membership (i.e., they asked whether RT-and accuracy-based IIV measures could distinguish healthy older adults from patients with arthritis and from patients with dementia), our analyses use more variable outcome measures (i.e., we ask not only whether RT-and accuracy-based IIV are significantly different in healthy older adults compared to patients with AD, but also whether those IIV measures are predictive of performance on a measure of general cognitive function and on a measure of performance in a cognitive domain that, typically, is sensitive to AD dysfunction). In summary, the specific aims of our analyses were to (a) use RT-and accuracy-based measures of IIV to differentiate between a clinical group of AD patients and a control group of demographically matched healthy individuals, (b) determine the relative contribution of RT-and accuracybased measures of IIV to the prediction of overall cognitive functioning and episodic memory functioning, and (c) evaluate the effect of the timescale of measurement on that relative contribution of RT-and accuracy-based measures of IIV to the prediction of overall cognitive functioning and episodic memory functioning.
Hence, the present study contributes toward an understanding of the relative utility of RT-and accuracy-based IIV measures in detecting neurocognitive impairment in older adults, and also responds to the AARR call for empirical evaluation of sensitive markers of cognitive change in patients with AD.

Design and Setting
The current study is the first report of data collected within an ongoing longitudinal investigation of AD progression taking place in Cape Town, South Africa. The parent study utilizes a measurement burst design (Nesselroade, 1991), in which each participant experiences three intervals of serial testing (or bursts; T1, T2, and T3) over the course of 12 months. Within each interval, each participant is tested three times (e.g., T1.1, T1.2, T1.3) over a 2-week period. The data we report here are from the first test interval (i.e., T1).
Clinical participants were recruited from a state hospital's Memory Clinic. Recruitment was monitored by health professionals, including a neurologist (MIC) and a neuropsychologist (KGFT), who provide clinical service delivery at the Clinic. Control participants were community-dwelling volunteers from the greater Cape Town area. They received notice of the study via word-of-mouth or flyers distributed to seniors' clubs, old age homes, and retirement villages.
Inclusion criteria were (a) availability of medical health history; (b) age 55 years or above; (c) English literacy (i.e., basic ability to speak, read, and write in that language); and (d) availability of a close relative or similar who could provide information about recent changes in cognitive function. Exclusion criteria included (a) a diagnosis of HIV/AIDS, uncontrolled hypertension, uncontrolled diabetes mellitus, or any other medical condition that, in the opinion of the research team, might have a long-lasting effect on cognitive function; (b) current or present psychiatric illness; (c) a Geriatric Depression Scale (GDS; Yesavage et al., 1982) score > 9/30; (d) the presence of any major neurological disorder (e.g., Parkinson's disease, Huntington's disease) or past stroke; (e) any history of alcohol or drug abuse, or heavy smoking (> 20 cigarettes per day); and (f) Mini-Mental State Examination (MMSE; Folstein et al., 1975) score < 12.
The Research Ethics Committees of the University of Cape Town's Department of Psychology and Faculty of Health Sciences approved all study procedures. These procedures adhered to the guidelines published in the Declaration of Helsinki (World Medical Association, 2013).

Measures and Procedures
The data we report on here were gathered across four sessions (one screening and three test sessions). All study participants signed consent forms before screening. Moreover, all the clinical participants were informed about the study and consent was signed in the presence of a guardian/caregiver/relative (who also signed the consent form). The screening session occurred no more than 30 days before the first test session (in most cases, there was a week or less of separation). For all participants, the three test sessions took place over a 2-week period.
Sessions were held in a private research room at Groote Schuur Hospital or at the participant's home, depending on his/her preference and travel capabilities. All tests were administered by BUC, or by a graduate student trained and supervised by him.

Screening Session
This session included administration of (a) a detailed clinical interview that gathered information about biographical, medical, and psychiatric history, (b) the GDS, (c) the MMSE, and (d) the Cambridge Examination for Mental Disorders of the Elderly-Revised edition (CAMCOG-R; Huppert et al., 1995). The latter was developed as a cognitive screening measure for the early diagnosis of dementia in the elderly (Leeds et al., 2001). It consists of 67 items and measures cognitive performance within eight domains (orientation, language, memory, attention, praxis, calculation, abstract thinking, and perception). We used a version adapted for use with South African samples (James et al., 2014).
At the conclusion of the session, participants were invited back for repeated administration of a 10-test cognitive battery (see Supplemental Material for the full list of tests). Below, we describe the five tests for which data are reported.

Test Sessions
The order of test administration was varied for each session to prevent order effects (see Supplemental Material for the different test orders). All three test sessions were otherwise identical to one another. Each lasted ∼2 h.

Reaction Time Tasks
These tasks are part of the Cambridge Neuropsychological Test Automated Battery (CANTAB; Fray et al., 1996). All CANTAB tests are administered on a touch-screen computer. On the simple reaction time (SRT) task, a yellow dot appears inside a circle placed at the center of the computer screen. Participants are required to release a press pad and touch the dot as quickly as possible after its onset. On the choice reaction time (CRT) task, the yellow dot appears inside one of five circles located on the screen. Both tests include a 10-trial practice phase that precedes the test phase. Participants are required to obtain 90% accuracy on the practice trials before proceeding to the test phase. Those who fail to achieve this criterion are presented with a second practice phase. Thereafter, they proceed to the test phase regardless. The test phase for both the SRT and the CRT tasks consists of 30 trials. A single block of trials (e.g., 30 SRT trials) takes ∼5 min to complete.
We administered two SRT blocks and two CRT blocks in each session. Hence, after three test sessions and six blocks of administration we had collected data from 180 trials of SRT performance and 180 trials of CRT performance for each participant.

Accuracy-based Tasks
We used two subtests from the Repeatable Battery for the Assessment of Neuropsychological Status (RBANS), a short screening battery for identifying and characterizing dementia in the elderly (Randolph et al., 1998). These subtests measure immediate and delayed episodic memory, a prominent domain of dysfunction in the cognitive profile of AD (Traykov et al., 2007). There are four parallel forms for each of the RBANS memory subtests, making them appropriate for repeat assessments and allowing for the tracking of cognitive decline within neurodegenerative processes (Randolph et al., 1998).
On the RBANS List Learning subtest, the participant is read a list of 10 words, and is instructed immediately thereafter to recall as many as possible. This process is repeated four times. After a 25-35 min delay, the participant is asked to recall the list, and immediately thereafter is administered a recognition task (i.e., to identify which words from a group of 20 (10 targets and 10 foils) were present on the original list). On the RBANS Story Memory subtest, the participant is read a brief story, and is instructed immediately thereafter to recall as many elements of the story as possible. This process is repeated twice. After a 25-35 min delay, the participant is asked to recall the story.
Although we administered a different form of the List Learning and Story Memory subtests at each test session, the order of administration was the same for each participant (i.e., all participants received List A and Story A at the first test session, List B and Story B at the second session, and so on).

Data Preparation RT Tasks: Filtering data
We examined the RT data for outliers because unusually fast or slow responses may reflect spurious performance (e.g., temporary distraction, interruption, or fast guesses). Following convention (see, e.g., Hultsch et al., 2000;Bielak et al., 2010;Garrett et al., 2012), we removed scores that were either (a) below a lower limit for authentic responses at 150 ms, or (b) above an upper limit of 3 SD above the group RT mean for each block of testing. Missing data were then imputed for the outlier trials using a regression-based multiple imputations method (Lachaud and Renaud, 2011). This method of filtering the data is thought to offer conservative estimates of performance variability (e.g., Hultsch et al., 2002).

RBANS and CAMCOG-R tasks: deriving variables
We derived three scores from the RBANS subtests. The List Learning score is the sum of the number of words recalled correctly across the four learning trials (range 0-40). The List Recognition score is the total number of correctly identified words on the recognition trial (range 0-20). The Story Memory score is the sum of the number of items recalled correctly across the two learning trials (range 0-24).
We derived two scores derived from the CAMCOG-R: Total Score (assessing general cognitive function; range = 0-105), and the recent memory and learning subscale composite score (assessing episodic memory function; range = 0-21). We chose to use the latter because (a) episodic memory dysfunction is a key feature of AD (Peña-Casanova et al., 2012), and (b) the composite score is relatively resistant to the influence of education (James et al., 2014). This is an important consideration given that almost half of the variance in CAMCOG-R scores is accounted for by the effects of age and education (Pereiro et al., 2015).

Extracting intraindividual variability
Computing IIV scores requires an initial purification of systematic effects in the data that are explained by mean performance scores. Specifically, although one might calculate the intraindividual standard deviation (iSD), calculating raw SDs may introduce systematic effects associated with mean RT because slower mean RTs are strongly associated with higher SDs, and vice-versa (Hale et al., 1988;Hultsch et al., 2008). Therefore, before computing iSDs it is important to partial out any factors (e.g., group and time-on-task effects such as practice and fatigue) that may influence mean RT performance.
To determine which factors significantly influenced the means of RT-and accuracy-based variables, and to thus extract iSDs, we ran a random intercept model on the sample data for each of the SRT, CRT, List Learning, List Recognition, and Story Memory variables, and then added two sets of main effects, the first [featuring test order, blocks, trials (or sessions for the accuracy-based tasks)] to evaluate the impact of time-on-task effects, and the second (featuring group status, sex, monthly household income, age, and level of education) to evaluate the impact of group effects.

Inferential Statistical Analyses
We conducted all inferential analyses using SPSS (version 24), with α set at 0.05.
The first part of the analysis involved analyzing betweengroup differences in demographic, cognitive, and affective variables. We used independent-samples t-tests for parametric data, chi-squared tests of contingency for categorical data, Mann-Whitney U tests for non-parametric data, and when the assumption of homogeneity of variance for the Mann-Whitney U tests was not upheld we used independent-samples t-tests and bootstrapped 1,000 replicates using bias corrected (BCa) confidence intervals. To estimate effect sizes, we used Cohen's d, phi (φ), and r for t-tests, chi-squared tests, and Mann-Whitney U tests, respectively. We interpreted these effect sizes following Cohen's (1988) guidelines: for Cohen's d, 0.2 = small, 0.5 = moderate, 0.8 = large; for φ and r, 0.1 = small, 0.3 = moderate, 0.5 = large.
The second part of the analysis set the stage for subsequent regression modeling by examining bivariate associations (using Pearson's r correlation coefficient) between each candidate predictor (i.e., the iSD for each of the SRT, CRT, List Learning, List Recognition, Story Memory variables, and the mean for each of those variables) and each cognitive outcome variable (i.e., CAMCOG-R Total Score and CAMCOG-R Memory Composite).
The final part of the analysis involved creation of a series of sequential multiple regression models that sought to determine the relative contribution of RT-and accuracy-based IIV measures to the prediction of (a) overall cognitive functioning, and (b) episodic memory functioning.

Sample Characteristics
The groups were well matched in terms of age, sex distribution, monthly household income, and current depressive symptomatology, but there were significant between-group differences in terms of education, with participants in the control group having completed more years of formal schooling (see Table 1). As expected, the analyses also detected significant between-group differences (associated with large effect sizes) on the two CAMCOG-R outcome measures, with the control group scoring better in each case.

Extraction of iSDs
We followed the extraction approach described by Hultsch et al. (2008). Random intercept models identified the following fixed effects that contributed significantly to mean performance on each of the candidate predictors: for SRT, blocks, group status, and sex significantly predicted trial-to-trial performance; for CRT, test order, blocks, group status, and sex significantly predicted trial-to-trial performance; for List Learning, group status significantly predicted session-to-session performance; for List Recognition, group status, and session significantly predicted session-to-session recognition performance; and for Story Memory, task session, group status, sex, and education level significantly predicted session-to-session scores. (See Supplemental Material for the full set of results.) Next, we entered, for each candidate predictor, the significant fixed effects and all their higher-order interactions into a random coefficient model with random slopes on trials (or sessions for the accuracy-based variables) in order to partial out time-on-task and group effects. Finally, we captured the residuals, converted them to T-scores, and calculated the SD across the T-score values to compute the iSDs.

Between-Group Differences: Predictor Variables
On both RT-based measures, iSD scores for controls were, on average, significantly lower than those for patients. In contrast, the same significant between-group difference was only present for one of the accuracy-based measures (List Recognition; see Table 2).
Regarding mean-level performance variables, control participants achieved significantly faster reaction times on the CANTAB tasks, and performed significantly more accurately on the RBANS subtests, than patients (see Table 2).

Regression Modeling Bivariate associations between predictor and outcome variables
Among iSD scores, those for SRT, CRT, and List Recognition showed significant moderate-to-large negative associations with both outcome variables. Among mean scores, each predictor variable was significantly associated, with moderate-to-large magnitude and in the expected direction, with each outcome variable (see Table 3). Based on this set of findings, we excluded List Learning and Story Memory iSD scores from subsequent analyses.

Sequential regression models
We regressed a set of demographic variables (age, sex, education, income), after controlling for group status, on each    of the two CAMCOG-R outcome variables to determine which demographic factors were significant predictors of, respectively, overall cognitive functioning and episodic memory functioning. For CAMCOG-R Total Score, significant predictors were group (β = −0.72, t = −8.38, p < 0.001), sex (β = −0.27, t = −3.13, p < 0.01), and education (β = 0.27, t = 2.76, p < 0.01). For CAMCOG-R Memory composite, significant predictors were group (β = −0.80, t = −9.10, p < 0.001) and sex (β = −0.24, t = −2.77, p < 0.01). Then, we created a set of models that described how trialto-trial variability on RT tasks, relative to session-to-session variability on accuracy-based tasks, predicted (a) CAMCOG-R Total Score, and (b) CAMCOG-R Memory Composite score. For each model, we entered the significant demographic factors identified above at the first step, iSD RT-and accuracy-based predictors at the second, and mean-based predictors at the third (see Table 4). The purpose of taking this third modeling step was to determine if the significant iSD predictors identified at Step 2 would continue to make a unique contribution toward prediction of the outcome variable after controlling for (a) means of iSD predictors entered at Step 2, and (b) means of the List Learning and Story Memory scores (entered because they are widely-used mean-level predictors of episodic memory performance in the clinical setting).
The most notable results at the second step were these: After controlling for demographic variables, iSDs for List Recognition and SRT contributed significantly to the prediction of CAMCOG-R Total Score The most notable results at the third step, given the aims of the model, were these: SRT iSD scores continued to contribute significantly to the prediction of CAMCOG-R Total Score (Model 1) and CAMCOG-R Memory Composite  score (Model 3), and CRT iSD score continued to contributed significantly to the prediction of CAMCOG-R Memory Composite score (Model 4). Finally, to examine the influence of measurement timescale on the relative contribution of the different IIV measures to the prediction of cognitive performance, we created a set of models that described how session-to-session variability on RTand accuracy-based tasks predicted (a) CAMCOG-R Total Score, and (b) CAMCOG-R Memory Composite score. For each model, we entered the same demographic factors as in the previous models at the first step, iSD RT-and accuracy-based predictors at the second, and mean-based predictors at the third (see Table 5).
The most notable result at the second step was that, for the RT data, the magnitude of variability decreased markedly from that observed in the trial-to-trial models. For instance, although at Step 2 of the modeling procedure SRT iSD was a significant predictor of CAMCOG-R Total Score The most notable results at the third step were, again, that the predictive power of the RT iSD scores decreased markedly from that observed in the trial-to-trial models. Here, the only significant finding, given the aims of the model, was that SRT iSD scores continued to contribute significantly to the prediction of CAMCOG-R Total Score (Model 1).

DISCUSSION
This study provided a direct comparison of the relative sensitivity of reaction time-and accuracy-based estimates of intraindividual variability to cognitive compromise. We systematically replicated findings presented by Hultsch et al. (2000), showing that (a) RT-based measures of IIV differentiated a dementia group from a group of healthy older adults, (b) increasing the timescale of measurement (i.e., measuring on a session-to-session rather than a trial-to-trial basis) reduced the sensitivity of RT-based  IIV, and (c) generally, RT-based IIV was a better predictor of cognitive status than accuracy-based IIV, even after adjusting for timescale of measurement. We extended upon previous findings by showing that accuracy-based IIV (a) could also differentiate patients with AD from healthy older adults, (b) correlated significantly with overall cognitive function and episodic memory performance in both patients and controls, and (c) was a significant predictor of episodic memory performance, even after controlling for sex and group status (AD patient vs. control).
Of the accuracy-based IIV measures that formed part of our investigation, only RBANS List Recognition was sensitive to between-group differences, correlated with CAMCOG-R Total Score and Memory Composite score, and predicted performance on the CAMCOG-R Memory Composite variable after controlling for sex and group status. Although Hultsch et al. (2000) also used measures of recognition memory to derive accuracy-based IIV, they found them to have no significant value in distinguishing dementia patients from controls. We argue that this cross-study difference is attributable to sample characteristics: Whereas we used a homogeneous group of patients with AD, Hultsch and colleagues used a heterogeneous clinical group [i.e., some of their patients had been diagnosed with vascular dementia (VaD) and others with AD]. Patients with VaD perform significantly better than those with AD on recognition memory tasks (Tierney et al., 2001;Román et al., 2002). Hence, including both VaD and AD patients in a single clinical group is likely to diminish the sensitivity of a recognitionbased measure to neurological compromise.
We suggest, therefore, that accuracy-based IIV measures are useful in detecting neurocognitive impairment, but that there must be a careful match between the type of task from which the IIV measure is derived and the purportedly compromised cognitive domain. In other words, accuracy-based IIV measures have less utility when they are considered as indicators of diffuse cognitive or neurological dysfunction: They are best used as indicators of a specific type of cognitive impairment linked to a specifically damaged neuroanatomical site or system. Murphy et al. (2007) demonstrated this point empirically. They administered parallel forms of a list-learning task eight times over 4 days to young (M = 23.4 years) and older (M = 73.3 years) adults. The groups were differentiated by accuracy-based IIV scores derived from tasks assessing frontal lobe function (e.g., false memory tests), but not by those derived from tasks assessing medial temporal lobe (MTL) function (e.g., learning, delayed recall). The authors proposed that age-related changes in the integrity of the frontal lobes (changes not typically present in the MTL) explained this finding. These specific structural changes made it much more likely that there would be increased variability in the performance of the older adults relative to the younger counterparts on the frontal tasks, but not the MTL tasks. Another minor empirical demonstration of this regional specificity consideration is that, among the patient group in the present study, the largest magnitude of variability we observed was on the List Recognition task (see Table 2).
Whereas List Recognition iSD scores differentiated between patients and controls, and were significantly associated with scores on the outcome measures, no such relationships were observed for the List Learning and Story Memory iSDs. Given that performance on all three tests requires participation from neural networks that are centered on the MTL and that are compromised by AD pathology (Traykov et al., 2007;Peña-Casanova et al., 2012), this result is unexpected: In this context, IIV on the three tasks should have been similar.
One possible reason for this unexpected result relates to the differing nature of the processing demands made by the List Learning, Story Memory, and List Recognition tasks. Although all three tasks require the participant to retrieve previouslyencoded information, the former two make heavier demands on cognitive resources because they are free recall, and not aided-recall, tasks. In other words, they present no cues to assist retrieval of the learned information, and therefore require more self-generated strategic processing (Moscovitch and Winocur, 1992;Dickerson et al., 2007). Tasks with greater strategic processing demands typically produce higher degrees of score variability, particularly when performance is measured across several learning trials (Allaire and Marsiske, 2005), as was the case with both List Learning and Story Memory. Hence, performance on those subtests may be more vulnerable than List Recognition performance to adaptive variability (Li et al., 2004). Because the presence of adaptive variability tends to increase IIV, a confound within the current design is that one of the three memory tasks we used to measure IIV featured lower strategic processing demands than the other two.
Nonetheless, there is clinical value in the finding that an accuracy-based measure of inconsistency can significantly predict episodic memory performance. In the clinic, accuracybased assessment is far more prevalent than latency-based assessment, and the List Recognition task we used here is a standard element of many clinical neuropsychological test batteries. Of note here, however, is that a current trend in IIV studies that use accuracy-based measures is to move away from operationalizing variability as inconsistency across time and toward dispersion across tasks (within and across cognitive domains) or across items within a test of global cognition (e.g., CAMCOG-R). Findings from IIV studies using this latter operationalization indicate successful prediction of cognitive decline and clinical dementia status above and beyond mean-level performance (Tractenberg and Pietrzak, 2011;Kälin et al., 2014). Accuracy-based measures of dispersion may be more practical than RT measures of IIV for clinicians as the test from which they are derived are already used frequently within standard neuropsychological test batteries, and they avoid the need for multiple trials of administration (Kälin et al., 2014).

LIMITATIONS
Two limitations of the study's RT-based measures might have reduced their sensitivity to impairment on the cognitive outcomes we sought to measure. The first involves how engaging the RT tasks were for participants. We observed that, for healthy controls, bivariate associations between (a) RT means and CAMCOG-R Total Scores, and (b) variability scores and CAMCOG-R Total Scores were in the opposite direction from what might have been expected (see Supplemental Material). That is, participants with higher CAMCOG-R Total Scores showed slower and more variable performance on both the SRT and the CRT tasks. One explanation is that the repetitive nature of the serial assessments, combined with the relative ease of the RT tasks, may have resulted in a lack of engagement among higher-functioning individuals. This speculation is consistent with research indicating that a lack of task engagement (e.g., due to boredom) during prolonged repetitive tasks may reduce mean RT performance and increase RT variability (Pan et al., 1994;Langner et al., 2010;Garrett et al., 2012;Wang et al., 2014).
A second, and related, limitation involves the regional specificity of the RT measures. Performance on the kinds of SRT and CRT tasks used here activates a complex combination of cognitive control processes (including visual encoding, motor preparation, response selection, and execution), with common neural substrates located largely in the frontal lobes Lo and Andrews, 2015). As noted above, accuracybased IIV measures are most useful when the task from which they are derived taps into functioning of the area purportedly compromised in the samples under scrutiny. Such regional specificity considerations may also apply to RT-based IIV measures (MacDonald et al., 2008). Following this line of argument, an RT measure better suited to the purposes of the current study may have been one derived from tasks sensitive to episodic memory function [e.g., the recognition latencies from the list and story tasks used by Hultsch et al. (2000)].
Hence, future research in the field might consider adapting RT tests to make them more engaging, and to ensure that they meet considerations related to regional specificity. Using latency scores from tasks that are typically used to produce accuracy-based outcomes may, in fact, also improve task engagement because participants typically find such tasks more challenging than basic RT tasks (Allaire and Marsiske, 1999).

SUMMARY AND CONCLUSION
We set out to systematically replicate and extend important previous findings regarding the use of intraindividual variability measures in the detection of neurodegenerative disease (Hultsch et al., 2000). Our replication was successful: Results indicated that RT-based IIV measures are superior predictors of cognitive compromise than accuracy-based IIV measures, even after adjusting for timescale of measurement. Our extension was also successful: Results indicated that, by using a homogeneous clinical sample (i.e., early-to-mid-stage Alzheimer's disease patients) and measuring overall cognitive function as well as a performance within a targeted cognitive domain, accuracy-based IIV measures may be useful indicators of underlying pathology.
The present study therefore contributes toward understanding the relative utility of RT-and accuracy-based IIV measures in detecting neurocognitive impairment in older adults, and also responds to the AARR call for empirical evaluation of sensitive markers of cognitive change in patients with AD.

ETHICS STATEMENT
This study was carried out in accordance with the recommendations of the University of Cape Town Research Ethics Code for Research Involving Human Participants with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Research Ethics Committees of the University of Cape Town's Department of Psychology and the Faculty of Health Sciences.

AUTHOR CONTRIBUTIONS
BC: contributed to the conception and design of the study, participant recruitment and acquisition of data, data analysis and interpretation, drafting of the manuscript, and critical revisions of the manuscript for important intellectual content; KT: contributed to the conception and design of the study and critical revisions of the manuscript for important intellectual content; MC: contributed to participant recruitment and acquisition of data; All three authors (BC, KT, and MC) approved of the final version of the manuscript to be submitted for publication.