Examining Test-Retest Reliability and Reliable Change for Cognition Endpoints for the CENTER-TBI Neuropsychological Test Battery

Objective: Seven candidate cognition composite scores have been developed and evaluated as part of a research program designed to validate a cognition endpoint for traumatic brain injury (TBI) research and clinical trials, but these composites have yet to be examined longitudinally. This study examined test-retest reliability and methods for determining reliable change for these seven candidate composite scores, using the neuropsychological test battery from the Collaborative European NeuroTrauma Effectiveness Research in Traumatic Brain Injury (CENTER-TBI). Methods: Participants (18–59 years-old) with mild TBI (n = 124), orthopedic trauma without head injury (n = 67), and healthy community controls (n = 63) from the Trondheim MTBI follow-up study completed the CENTER-TBI neuropsychological test battery at 2 weeks and 3 months after injury. The battery included both traditional paper-and-pencil tests and computerized tests from the Cambridge Neuropsychological Test Automated Battery (CANTAB). Seven composite scores were calculated for the paper-and-pencil tests, the CANTAB tests, and all tests combined (i.e., 21 composites in total on each assessment): the overall test battery mean (OTBM); global deficit score (GDS); neuropsychological deficit score-weighted (NDS-W); low score composite (LSC); and the number of scores ≤5th percentile, ≤16th percentile, or <50th percentile. The OTBM was calculated by averaging T scores for all tests. The other composite scores were deficit-based scores, assigning different weights to low scores. Results: All composites revealed better cognitive performance at the 3-month assessment compared to the 2-week assessment and the magnitude of improvement was similar across groups. Differences, in terms of effect sizes, were largest on the OTBMs. In the combined composites, the test-retest correlation was highest for the OTBM (Spearman's rho = 0.87, in the community control group) and lowest for the number of scores ≤5th percentile (rho = 0.41). Conclusion: The high test-retest reliability of the OTBM appears to favor its use in TBI research; however, future studies are needed to examine these candidate composite scores in participants with more severe TBIs and cognitive deficits and the association of the composites with functional outcomes.


INTRODUCTION
Cognitive impairment is a core clinical feature of traumatic brain injury (TBI) (1). In the mildest of TBIs, it might resolve within hours or days (2); and in severe TBI, it can be permanent and disabling (3,4). Cognitive functioning is of considerable interest as an outcome measure in TBI research and clinical trials (5). Given that cognition is multifaceted and it is commonly measured with a variety of tests that index different cognitive domains, it would be useful to create a single composite score, or cognition endpoint, for TBI research and clinical trials (6). Our research team has recently examined seven candidate composite scores, derived from prior studies (7,8), using the neuropsychological test battery from the Collaborative European NeuroTrauma Effectiveness Research in Traumatic Brain Injury (CENTER-TBI) (9). CENTER-TBI is a large-scale, multi-national, observational study that aspires to identify best practices, develop precision medicine, and improve outcomes for people with TBIs via comparative-effectiveness studies (10)(11)(12). In our prior study on the CENTER-TBI neuropsychological battery, data from the Trondheim MTBI follow-up study, in which the CENTER-TBI neuropsychological battery was administered, was used to calculate the composite scores separately for four traditional paper-and-pencil tests, five computerized neuropsychological tests, and a combined battery of all nine tests (9). Before determining which candidate composite score(s) might be most useful for clinical research in TBI, and with the CENTER-TBI battery in particular, it is important to examine these scores longitudinally for stability (in people without TBI) and sensitivity to change (in people with TBI). Test-retest reliability, in the present study, represents an estimate of the stability and consistency of neuropsychological test scores across two testing sessions. Test-retest reliability is influenced by the internal consistency of the test, measurement error related to time and situational variables (13), and normal variability in human cognition. Changes in cognitive scores from test to retest are related to several factors such as susceptibility to practice effects, measurement error, the test-retest interval between administrations, and regression to the mean. Moreover, person-specific factors can influence test to retest difference scores, such as initial level of performance, motivation, and effort. Some cognitive abilities can be measured precisely and reliably, such as a person's ability to read single words in his or her native and dominant language, whereas tests of other cognitive abilities, such as memory and executive functioning, usually have lower reliabilities (14) because people's test scores are more likely to be influenced by a variety of factors (e.g., practice effects, situational distractions, measurement imprecision, regression, and effort). A composite score with high test-retest reliability that is sensitive to changes in cognitive functioning coinciding with the natural history of TBI recovery is desirable in TBI research and clinical trials that use rate of change as the primary endpoint. The purpose of this descriptive study is to compare and contrast the test-retest reliabilities, and estimates of reliable change, for the seven candidate composite scores that have been recently applied to the CENTER-TBI battery (9) in patients with mild TBI, in trauma controls without head injury, and in healthy community controls.

Participants
The participants in the present study were part of the Trondheim Mild Traumatic Brain Injury (MTBI) Study (15). Patients with MTBI were recruited from April 2014 to December 2015. In the present study, patients were included if they were between ages 18 and 59 years and sustained a MTBI per the criteria described by the WHO Collaborating Center Task Force on Mild Traumatic Brain Injury: (a) mechanical energy to the head from external physical forces; (b) Glasgow Coma Scale (GCS) score of 13-15 at presentation to the emergency department; and (c) either witnessed loss of consciousness (LOC) <30 min, confusion, or post-traumatic amnesia (PTA) <24 h, or intracranial traumatic lesion not requiring surgery (16). Exclusion criteria were non-fluency in the Norwegian language; pre-existing severe neurological (e.g., stroke, multiple sclerosis), psychiatric, somatic, or substance use disorders, determined to be severe enough to likely interfere with follow-up; a prior history of a complicated mild, moderate, or severe TBI; or other concurrent major orthopedic trauma; or moderate/severe TBI. Recruitment took place at a level 1 trauma center in Trondheim, Norway, and at the municipal emergency clinic, an outpatient clinic run by general practitioners. LOC was categorized as present if witnessed. Duration of PTA, defined as the time after injury for which the patient had no continuous memory, was dichotomized to either <1 h or 1-24 h. Intracranial traumatic findings were obtained from magnetic resonance imaging (MRI), performed within 72 h, previously described in detail (17). Two control groups were recruited. One group consisted of patients with orthopedic injuries, free from trauma affecting the head, neck, or the dominant upper extremity (i.e., trauma controls). The trauma controls were recruited from the same emergency departments as the patients with MTBI. Fractures to the upper extremities (35.8%), lower extremities (25.4%), and soft tissue injuries to the lower extremities (26.9%) were the most common injuries among the trauma controls. Injuries commonly occurred during sports or recreational activities (38.8%) and 25.4% had an injury requiring surgery. The other group consisted of healthy community controls, not receiving treatment for severe psychiatric disorder (e.g., bipolar or psychotic disorder). The community controls were recruited among hospital and university staff, students, and acquaintances of staff, students and patients. The study was approved by the regional committee for research ethics (REK 2013/754) and was conducted in accordance with the Helsinki declaration. All participants gave informed consent.

Neuropsychological Assessment
Participants with MTBI underwent neuropsychological testing ∼2 weeks (M = 16.6 days, SD = 3.2 days) and 3 months (M = 95.0 days, SD = 6.6 days) after the injury. The trauma controls were also evaluated 2 weeks (M = 17.1 days, SD = 3.5 days) and 3 months (M = 95.3 days, SD = 10.5 days) after injury. The community controls were tested ∼3 months apart (M = 95.1 days, SD = 11.6 days). The tests were administrated by research staff with at least a Bachelor's degree in clinical psychology or neuroscience who were supervised by a licensed clinical psychologist. The testing involved a larger battery, with only the tests included in the CENTER-TBI neuropsychological battery analyzed in the current study. To calculate the composite scores (described below), normative data was required for each included outcome to convert raw scores into age-referenced T scores.
The traditional paper-and-pencil tests included in the CENTER-TBI battery are the Trail Making Test (TMT) Parts A and B and the Rey Auditory Verbal Test (RAVLT). In TMT Part A (18), participants connect numbered circles in order as fast as possible, and in TMT Part B, participant alternate between numbered and lettered circles, switching between connecting them in numerical and alphabetical order as fast as possible. For both TMT Parts A and B, the outcome measure was time-tocompletion, with normative data from Mitrushina et al. (19) used to calculate age-referenced T scores. The RAVLT (18) involves participants listening to and recalling a list of 15 words over five trials, and then recalling these words again following the introduction of a distractor list and after a 20-min delay. The RAVLT outcome measures included in composite calculation were the total number of words recalled across the five learning trials and the total number of words recalled following the 20-min delay. The 2-week and the 3-month assessments involved different word lists for the RAVLT. Age-referenced T scores were calculated based on normative data from Schmidt (20) published in Strauss et al. (18).
The CENTER-TBI battery includes six tablet administered tasks from the Cambridge Neuropsychological Test Automated Battery (CANTAB): Attention Switching Task (AST), Paired Associates Learning (PAL), Rapid Visual Processing (RVP), Spatial Working Memory (SWM), Reaction Time Index (RTI), and Stockings of Cambridge (SOC). The CANTAB software generates age-referenced T scores for all of these tasks except for the AST (21), which was therefore not included in the present study (i.e., in total five CANTAB tests were included). Each CANTAB task generates multiple outcome measures. We selected one outcome measure for each task to be included in the composite score calculation based on the CANTAB's "Recommended Measures Report" (21). On the PAL visual memory task, participants briefly observe a series of boxes that contain different patterns. The patterns are then hidden, and the participant must match a target pattern to the box that contains that pattern. "Total errors" (adjusted for the number of trials completed) was chosen as the outcome measure, with more errors indicative of worse performance. On the RVP processing speed task, individual numbers appear rapidly on the screen (i.e., 100 presentations per minute) and participants respond to target sequences of digits presented in a specific order (e.g., 2-4-6). The outcome measure chosen was "A prime, " which measures discriminability between target and non-target sequences. A higher score is indicative of better performance. On the Spatial Working Memory (SWM) task, participants search through a series of boxes for a token. Once the token is found, a new token is hidden in one of the remaining boxes. A token is never hidden in the same box twice, and participants must remember where tokens were previously presented in order to avoid errors. "Between errors" was chosen as the outcome, which is the number of times a participant revisits a box in which a token was previously found. More errors are indicative of worse performance. On the Reaction Time Index (RTI) task, the participant responds as quickly as possible when a yellow dot appears in one of five white circles, with response time in milliseconds chosen as the outcome measure (shorter response time equals better performance). On the SOC executive function task, two displays with three balls presented inside stockings are presented. Participants move the balls in one display to produce an identical arrangement to the other display. The outcome measure was the number of problems solved with the minimum possible moves, with a higher score indicative of better performance.

Composite Scores
Seven different composite scores, previously described in detail (7,8), were calculated for the present study. Each composite score was calculated for the traditional paper-and-pencil tests only, the CANTAB tests only, and all tests (i.e., a combined composite). All raw scores were converted to age-referenced T scores (M = 50, SD = 10, in the normative sample), with higher scores indicative of better performance, before the composites were calculated.
The Overall Test Battery Mean (OTBM) was calculated by averaging T scores for all tests (22,23). Lower scores indicate worse performance.
The Low Score Composite (LSC) is a new composite calculated in previous cognition endpoint research only (7)(8)(9). T scores of 50 or higher are assigned a weight of 50, and T scores below 50 are assigned a weight that equals the T score (i.e., a T score of 40 would equal a weight of 40). The mean weight was then calculated for the entire batteries. Lower scores indicate worse performance. This new composite score provides an even greater increase in gradation than the NDS-W.
The number of scores at or below the 5th percentile (#≤5th %tile) is calculated by assigning the value 1 to scores at or below the 5th percentile (T = 34) and a zero to scores above the 5th percentile. These values are then summed for each participant. Higher scores indicate worse performance. This score has been used in research calculating multivariate base rates for a range of neuropsychological test batteries (26)(27)(28)(29)(30)(31)(32)(33).
The number of scores at or below the 16th percentile (#≤16th %tile) is calculated by assigning the value 1 to scores at or below the 16th percentile (T = 40) and a zero to scores above the 16th percentile. These values are then summed for each participant. Higher scores indicate worse performance. This score has also been calculated in previous multivariate base rate research (26)(27)(28)(29)(30)(31)(32)(33).
The number of scores below the 50th percentile (#<50th %tile) is a new composite score, inspired by research on multivariate base rates, and previously calculated in cognition endpoint research only (7)(8)(9). It is calculated by assigning the value 1 to scores below the 50th percentile (T score 49) and a zero to scores at or above the 50th percentile. These values are then summed for each participant. Higher scores indicate worse performance.

Statistical Analyses
Wilcoxon Signed-Rank Tests were used to evaluate differences in the composite scores between the assessments, with r reported as the effect size (i.e., the z-statistic associated with the Wilcoxon Signed-Rank Test divided by the square root of the sample size) (34,35). This effect size can be interpreted as: 0.1 = small, 0.3 = medium, 0.5 = large (36). Cohen's ds [the mean difference between the assessments divided by the pooled standard deviation from the two assessments (37)] are also reported, but should be interpreted with caution because most composites scores had non-normal distributions. A Cohen's d of 0.2 is considered small, 0.5 medium, and 0.8 large (36). The effect sizes are coded so that a positive effect size indicates better performance at the 3-month assessment. It is important to note that these effect size interpretation criteria are guidelines, and that whether an effect of a certain size is important or not depends on the context (e.g., in the present study, the effect sizes of different composites should be compared against each other, rather than against Cohen's benchmarks) (38). Spearman's rho was used to examine test-retest reliability for the composite scores between the 2-week assessment and the 3-month assessment. Because most composite scores were, by design, zero-inflated and nonnormally distributed, reliable change was calculated from the natural distribution of the difference scores. First, the difference scores were calculated by subtracting the 2-week score from the 3-month score. The natural distributions were then examined to identify "uncommon" and "very uncommon" difference scores. Those correspond to improvements or declines in performance that are experienced by 20% or fewer or 10% or fewer of each sample (i.e., the 10, 20, 80, and 90th percentiles of the distribution of difference scores). The percentiles were identified with the default HAVARAGE procedure in IBM SPSS Statistics v.25. Of note, when using the HAVERAGE method in contexts where the exact percentile of interest in the natural distribution does not exist (e.g., no score would correspond exactly to the 10th percentile in a sample of 63 participants, such as the community control group), then the score is interpolated from scores surrounding the percentile of interest.

Participant Characteristics
There were 140 adults with MTBIs who completed the 2-week assessment and 124 of them (88.6%) completed the 3-month assessment. The MTBI sample (n = 124; 27.4% women) was an average age of 33. There were no significant differences in age (p = 0.464), years of education (p = 0.710), or gender representation (p = 0.136) between groups. The most common cause of MTBI was a fall (n = 48, 38.7%), and the majority of the MTBI group were discharged (i.e., not admitted to any other department, such as the neurosurgery department) from the emergency clinics (n = 90, 72.6%). Traumatic intracranial findings were found in 16 (12.9%) of the patients with MTBI, 87 (70.2%) had PTA present for <1 h, and 37 (29.8%) had PTA for 1-24 h.

Descriptive Statistics for the Scores and Composite Scores
Scores from the individual tests that constitute the composite scores are reported in Table 1. For all groups, all test scores appeared higher on the 3-month assessment than on the 2-week assessment, with the exception of the RAVLT delayed trial. The descriptive characteristics of the composite scores at 2 weeks and 3 months are shown in Table 2 for the MTBI group and in Table 3 for the control groups. All of the combined battery composite scores, except the #≤5th %tile, identified significantly better cognitive performances at the 3-month assessment compared to the 2-week assessment in all groups (Tables 2, 3). Differences, in terms of effect sizes, were largest on the OTBM composites (for the combined battery: MTBI: r = 0.42; trauma controls: r = 0.43; community controls: r = 0.41). The percentage of participants who had at least 1, 2, or 3 low scores on the test batteries (i.e., base rates of low scores) are shown in Table 4. Having at least one score at or below the 5th percentile was common in all three groups and especially when both the paper-and-pencil tests and the CANTAB tests were included (i.e., the combined battery). For the 2-week assessment, the percentage of participants with one or more scores at or below the 5th percentile was 41.1% for the MTBI group, 32.8% for the trauma control group, and 39.7% for the community control group. In general, the base rates of low scores were lower on the 3-month assessment compared to the 2-week assessment, and when the paper-and-pencil and CANTAB batteries were examined separately.

Test-Retest Reliability and Reliable Change
The test-retest correlations were higher in the combined composites than in the paper-and-pencil and CANTAB composites ( Table 5). The OTBM composite had the highest test-retest correlation (i.e., 0.87 for the combined composite in the community control group). The lowest test-retest correlations in the community control group were observed for the paper-and-pencil #≤5th %tile (0.14) and GDS (0.28).
Reliable changes on each composite from 2 weeks to 3 months based on the natural distribution of composite change scores are shown in Table 6. As an example, the cutoff value for improvement at the 90th percentile (i.e., being among the 10% with greatest improvement) for the combined OTBM composite in the community control group was +5.94, which means if an individual's change on the OTBM from 2 weeks to 3 months exceeds +5.94, that individual would have shown greater improvement than 90% of the community control group. Notably, if the exact percentile of interest in the natural distribution of difference scores does not exist (e.g., no score correspond exactly to the 90th percentile in a sample of 63 participants, such as the community control group), then the score is interpolated from the scores surrounding the percentile of interest (i.e., the default HAVERAGE procedure for calculating percentiles in SPSS). Consequently, some of the scores in Table 6 do not exist in the natural distribution of the difference score, and some are even theoretically impossible scores for an individual participant (e.g., 0.6, 10th percentile, #≤16th %tile paper-andpencil composite, community control group).

DISCUSSION
The present study examined longitudinally seven candidate composite scores for the neuropsychological test battery used in CENTER-TBI. Mean normative scores for the individual tests were mostly in the normal range at 2 weeks in all groups with modestly higher scores at 3 months in all three groups ( Table 1). There was some variability in mean normative scores, with RAVLT Trials 1-5 being lower and some CANTAB scores being higher (e.g., Spatial Working Memory). There were small statistically significant improvements on nearly all of the composites from 2 weeks to 3 months across groups (Tables 2, 3). Somewhat larger improvements were seen on the OTBM composite scores, suggesting this score may be more sensitive to change in cognitive performance. The OTBM composite more directly aggregates improvements across the individual test scores and this may explain why improvement was larger on this composite. However, with all groups improving in similar magnitude, change from 2 weeks to 3 months likely reflects a common cause across groups (i.e., practice) whereas if the MTBI group improved to a greater degree, that change would have likely corresponded to cognitive recovery following injury. Thus, the OBTM might be the most sensitive composite for detecting practice effects, but not necessarily for detecting cognitive recovery. It is important to note that low test scores were common in this study across all groups. A considerable portion of the trauma control group (32.8%) and the community control group (39.7%) had at least one individual test score at or below the 5th percentile at the 2-week assessment, when all nine scores were considered. Further, the portion of the control groups that had 3+ (out of 9 total) scores at or below the 16th percentile was about 22% at the 2-week assessment and 11-13% at the 3-month assessment. This finding aligns with previous studies on multivariate base rates, that have consistently demonstrated that low scores occur commonly among cognitively healthy individuals (26)(27)(28)(29)(30)(31)(32)(33)39). As prior studies have noted, it is essential to consider the base rates of low scores in control/normative samples when interpreting low scores in clinical samples.
The OTBM composite had the highest test-retest reliability ( Table 5). There is no generally accepted cutoff for what constitutes adequate test-retest reliability (14). Using the guidelines from Strauss et al. (18) for individual tests, a testretest correlation <0.60 is low, 0.60-0.69 is marginal, 0.70-0.79 is adequate, and ≥0.80 is high. With these reference values, the OTBM composite had a high test-retest correlation (rho = 0.87, for the combined composite in the community control group) and the #<50th %tile composite had adequate reliability (rho = 0.76, for the combined composite in the community control group). The test-retest correlations for the    other composites fell in the low or marginal category. A test-retest correlation of 0.87 is considerable higher than the test-retest correlations for the individual tests in the present study ( Table 5) and for most of the tests in the CANTAB battery (40)(41)(42)(43). The test-retest reliability was similar for the paper-and-pencil composite (e.g., OTBM rho = 0.70 in the community control group) and the CANTAB composite (e.g., OTBM rho = 0.71 in the community control group) and these particular results suggest that the CANTAB tests are not inferior or superior to the traditional paper-and-pencil tests. It is notable, although expected, that test-retest reliability increases in association with greater numbers of tests included in the composite scores. Even if the reliability is inadequate for many of the individual tests, the reliability is adequate for the paper-and-pencil composite and the CANTAB composite and high for the combined composite. This favors the use of composite scores in research and clinical trials and can, to some extent, compensate for low test-retest reliability observed for many individual neuropsychological tests. Further, the cognitive domains most affected by MTBI show great variability between studies (44), suggesting between-patient variability in cognitive deficits (e.g., some patients present with mainly attentional deficits and others with memory deficits). Under these circumstances, a cognitive composite score that sums deficit scores might be better suited than individual tests for detecting cognitive deficits in MTBI research and clinical trials. Using deficit-based scores (i.e., all the composites except the OTBM) in longitudinal studies is complicated by practice effects, which are expected on neuropsychological tests (41,45). Even in the absence of cognitive recovery, fewer participants are expected to fulfill a criterion for defining a cognitive deficit (e.g., having two or more scores at or below the 5th percentile) on a second assessment because of practice effects. An individual who obtains a low score on the first assessment (e.g., ≤5th percentile) and benefited from a practice effect may still obtain a low score, but this score may now exceed the threshold that would be quantified as a low score (e.g., on retest, the score falls at the 9th percentile). Because normative data does not typically consider practice effects (i.e., the normative sample has not been exposed to repeated neuropsychological testing), this is an inherent problem with using deficit-based composite scores that are based on normative data. In comparison to deficit scores, the test-retest correlation for the OTBM composite is less sensitive to practice effects because cutoffs are not used when the OTBM is calculated.
The problems associated with interpreting change on the deficit-based composites are also seen when inspecting the cutoffs for reliable change presented in Table 6. The cutoffs for the OTBM composites are straightforward to interpret, but the cutoffs for the deficit-based composites are in many cases less meaningful. For example, on the paper-and-pencil #≤5th %tile composite in the community control group, for an individual participant to be among the 20% with greatest improvement (i.e., the 80th percentile), the change from the 2week to 3-month assessments must be >1. However, a change >1 is also required for being among the 10% with greatest improvement (i.e., the 90th percentile). Similarly, looking at the other tail of the distribution, where individuals who decline in performance are found, both the 10th and the 20th percentile correspond to a change score of zero. A closer inspection of this composite in the community control group shows that, out of 63 participants, one had a change score of 2 (i.e., a decline, because a higher number indicates worse performance on the deficit-based composites), four had a change score of 1, 44 had a change score of zero, 11 had a change score of −1 (i.e., an improvement), and three had a change score of −2. Thus, the range of change scores on this composite is narrow (i.e., there are only five different scores) and many participants have the same score. Because many participants share the same change score, most change scores corresponds not to one, but a range of percentiles. For example, the 44 participants with a change score of zero dominate the distribution of change scores (i.e., their ranks are from rank 6 to rank 49) and a change score of zero on this composite means that the participant is somewhere between the 9 and 77th percentile, indicating that this composite may lack sufficient sensitivity to detect change among most individuals.
Previous research on multivariate base rates has shown that among healthy individuals, the likelihood of obtaining at least one low score increases with the number of tests included in the test battery (26)(27)(28)(29)(30)(31)(32)(33)39). This is also the finding in the present study, in that the base rates of low scores were higher when all tests were considered, compared to the separated composites for the paper-and-pencil and CANTAB batteries. Thus, evaluating change on deficit-based composites may be more suitable on large test batteries, where more variability in scores is likely. Further, on the deficit-based composites, there is a general trend of higher test-retest correlation with higher cutoffs: the #<50th %tile composite had a higher test-retest correlation than the #≤16th %tile composite, which had a higher test-retest correlation than the #≤5th %tile composite. Taken together, these findings indicate that on deficit-based composites, there might be a tradeoff between acceptable test-retest reliability, the number of tests in the neuropsychological battery, and the cutoff chosen to define a low score (i.e., scores at or below the 5 or 16th percentile).
This study is part of a research program with the aim of identifying a cognitive composite score suitable for TBI research and clinical trials. So far, the composites have been evaluated on healthy participants and patients with MTBI on the CENTER-TBI neuropsychological test battery, the Automated Neuropsychological Assessment Metrics (Version 4) Traumatic Brain Injury Military (ANAM4 TBI-MIL), and on the Delis-Kaplan Executive Function System (D-KEFS) (7-9). Because MTBI-related cognitive deficits often are subtle (44), diminish rapidly over time, and they might only be present in a small subgroup, identifying a sensitive composite is important, but challenging, in this patient population. The present sample consisting of patients with MTBI that are in the milder end of MTBI (i.e., the vast majority were non-hospitalized), had few, if any, cognitive deficits at their first assessment (9), and this constitutes a limitation of the present study as we cannot conclude with certainty which of the composites are most sensitive to change after TBI. Future studies should evaluate the composites in samples where cognitive deficits are likely, such as in the acute phase after MTBI, or in patients with moderate-tosevere TBI. In such samples, it is possible that the deficit-based composites would have better capacity to detect change. Further, the zero-inflation of the composite score distribution lead to non-normality, which limited the statistical methods available to calculate reliable change. Several methods for calculating reliable change exist (46), but because of the non-normal data, only the natural distribution of change scores was used in the present study. For example, when calculating the reliable change index (RCI), the standard deviations from the two assessments are used to calculate the standard error of measurement (47), and in zeroinflated distribution, the standard deviation is a poor and biased measure of dispersion. The natural distributions were examined to identify "uncommon" and "very uncommon" difference scores. Those correspond to improvements or declines in performance that are experienced by 20% or fewer or 10% or fewer of each sample (i.e., the 10, 20, 80, and 90th percentiles of the distribution of difference scores). If the exact percentile asked for in the natural distribution of difference scores does not exist (i.e., no score correspond exactly to the 10th percentile in a sample of 63 participants, such as the community control group), then the score is interpolated from the scores surrounding the percentile asked for. The + and -signs indicate that a change needs to be greater or less than the specified amount, respectively, to be considered reliable change. When the cutoff score for change is 0, that means that a deviation from 0 indicates a reliable change (e.g., having one more or one fewer low scores, at or below the 5th percentile, on retesting, is considered to be a reliable change for the CANTAB battery).
Frontiers in Neurology | www.frontiersin.org Improvements were largest on the OTBM composite and this composite had the highest test-retest reliability. Although these findings appear to favor the use of OTBM in research and clinical trials that analyze change trajectories, more research is necessary to replicate these findings in different test batteries, in more severely injured samples, who have varying degrees of cognitive impairment, and assess the association between these composites and functional outcomes in people with moderate or severe TBIs. For clinical trials that compare two or more groups at a single time point (e.g., post-treatment), the OTBM may be less advantageous. Lastly, these candidate composite scores reflect a developing body of research on the best methods to summarize cognitive test data for use in research and clinical trials, but they are not exhaustive and other composites might ultimately prove to be preferred.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the regional committee for research ethics (REK 2013/754). The patients/participants provided their written informed consent to participate in this study.