Utility of Huntington's Disease Assessments by Disease Stage: Floor/Ceiling Effects

Introduction: An understanding of the clinimetric properties of clinical assessments, including their constraints, is critical to sound clinical study and trial design. Utilizing data from Enroll-HD—a global, prospective HD observational study and clinical research platform—we examined several well-established HD clinical assessments across all stages of disease for evidence of instrument constraints, specifically floor/ceiling effects, to inform selection of appropriate instruments for use in future studies/trials and identify gaps in instrument utility over the life-course of the disease. Material and Methods: Analyzing publicly available data from 6,614 HD gene-expansion carriers (HDGECs), we grouped participants into deciles based on baseline CAP score, which ranged from 26 to 229. We used descriptive statistics to characterize data distribution for 25 outcome measures (encompassing motor, function, cognition, and psychiatric/behavioral domains) in each CAP decile. A skewness statistic threshold of ±2 was defined a priori to indicate floor/ceiling effects. Results: We found evidence of floor/ceiling effects in the early premanifest stages of disease for most motor and function assessments (e.g., TMS, TFC) and select cognitive tasks (MMSE, Trail Making tests). Other cognitive assessments, and the HADS-SIS scales, performed well ubiquitously, with no evidence of floor/ceiling effects at any disease stage. Floor/ceiling effects were evident at every disease stage for certain assessments, including PBA-s measures. Ceiling effects were apparent for DCL from onset stages onwards, as expected. Discussion: Developing instruments sensitive to subtle differences in performance at the earlier stages of the disease spectrum, particularly in motor and function domains, is warranted.


INTRODUCTION
Huntington's disease (HD) is an autosomal dominant, progressive, neurodegenerative disease characterized by debilitating movement, cognitive and psychiatric disturbances (1). It is caused by a mutation in the CAG repeat region of the HTT gene, defined by the presence of ≥36 CAG repeats. Clinical diagnosis of HD is typically based on the unequivocal presence of extrapyramidal motor signs. Onset most commonly occurs in mid-adulthood, although subtle cognitive, motor and psychiatric symptoms may be detected many years prior to formal clinical diagnosis (2). Symptoms progressively worsen post-onset, leading to death, typically within 10-30 years of diagnosis (3).
Several now well-established clinical assessments have been developed to assess and track the evolution of HD symptomology over time. Critically, these assessments measure impairments in one or more critical disease domains: motor, function, cognition and psychiatric/behavioral. Given the progressive nature of HD, it follows that the utility of these assessments may vary by disease stage/severity, dependent on instrument design, including range and sensitivity constraints.
Here we focus on ceiling and floor effects, which are commonly observed phenomena in data, particularly in clinimetric contexts. "Ceiling effect" describes a situation in which many values for a given variable are at or near the upper limit (ceiling) of the scale used to measure said variable (4). Distributions of values in such situations will typically be very heavily skewed and variance limited, which can prove problematic for many types of analyses and give rise to spurious conclusions. Conversely, "floor effect" causes similar problems.
Utilizing publicly-available data from Enroll-HD-a worldwide prospective observational study and clinical research platform-we sought to identify floor/ceiling effects for the most common clinical measures used in HD research at different stages of disease across the full life course.

Enroll-HD
Enroll-HD (https://www.enroll-hd.org/) is a prospective cohort study and global clinical research platform designed to facilitate clinical research in HD (5,6). Enroll-HD encompasses over 150 sites, from 19 countries located in North America, Latin America, Europe, Australia and New Zealand. Data are collected from participants annually and are monitored for quality and accuracy using a risk-based monitoring approach. Sites are required to obtain and maintain local Ethics Committee approvals.

Analysis Dataset
The third Enroll-HD periodic dataset, released December 15, 2016, was used for analysis. Analyses were limited to crosssectional Enroll-HD baseline visit data from HD gene expanded carriers (HDGECs) only.
The Enroll-HD periodic dataset is available to any interested researcher for download through the Enroll-HD website (https:// www.enroll-hd.org/for-researchers/access-data/).

Participants and Visits
A maximum of 6,614 participants were available for analysis. Table 1 provides an overview of participant demographic and disease stage characteristics.

Clinical Assessments and Outcomes
The Enroll-HD clinical assessment battery includes assessments for motor, function, cognition, and psychiatric/behavioral domains. These assessments are administered by a trained rater in a clinic setting. Certain Enroll-HD assessments are administered as standard at each study visit ("core" assessments), while others are completed at the discretion of the site investigator ("extended" assessments; denoted in Table 2).
We focused on 25 commonly used outcome measures drawn from these assessments, as listed in Table 2. Given the optional nature of the "extended" assessments, analysis of these outcomes was based on a more limited sample relative to those outcomes from "core" assessments.

Gene Carrier Status
Analyses were limited to HDGECs, defined as individuals with a CAG length ≥36 as determined at a central laboratory (Biorep Technologies, Inc.).

Disease Stage/Severity
CAP score, derived from CAG length and age, is indicative of cumulative exposure to mutant huntingtin (akin to "pack/years" for assessing tobacco exposure in smokers), and was used to approximate disease stage/severity. CAP score was calculated based on the Warner 1 formula, which is standardized to ensure that CAP = 100 at the expected age of diagnosis: CAP score = Age * (CAG − L)/K, where L = 30 and K = 6.49 Participants were subdivided into deciles based on CAP score at baseline for the purposes of HD staging using the quantile function in R (quantile ()). This enabled approximation of a disease stage/severity gradient (CAP score decile 1 = least severe; 10 = most severe). Note that a CAP score of 100 fell within CAP decile 5 (Table 1). Participants were also characterized according to the Shoulson-Fahn I-V staging system (28) using Total Functional Capacity (TFC) assessment score (Table 1).

Statistical Methods
Clinical outcome measures were characterized by CAP score deciles, using descriptive statistics of central tendency and variability to characterize data distribution (mean, standard deviation, maximum, minimum, skewness). A skewness statistic threshold of ±2 was defined a priori to indicate substantial departure from normality/extreme positive or negative skew, indicative of floor/ceiling effects (29,30). In addition, a complementary method to assess such effects was applied in which the percentage of participants scoring minimum and maximum scores was calculated within each CAP score decile for each assessment with defined upper and lower score bounds. Data points outside of minimum or maximum scale thresholds (see Table 2) were excluded (Trail Making Test part B = 17 observations; Trail Making Test part A = 2 observations), as were extreme outliers from assessments with no maximum score [Stroop Color Naming Test = 1 observation (scores ≥400); Time up and go = 2 observations (scores ≥120 s)]. Additional sensitivity analyses were performed post-hoc for assessments with no maximum score to evaluate the impact of outliers on skewness statistics.
Statistical analyses were performed using R version 3.0.3.

Participant Characterization
Participant demographic and disease stage characteristics, determined at Enroll-HD baseline visit, are presented in Table 1 and Figure 1.

Clinical Outcome Characterization by Cap Score
Descriptive statistics characterizing each outcome as a function of CAP score decile are presented in Table 3. Accompanying density plots, illustrating the observed distribution of data for each outcome as a function of CAP score decile, are presented in Figure 3. The degree of skewness observed for clinical outcome data in each CAP decile is illustrated in Figure 2. The percentage of participants scoring minimum and maximum scores within each CAP score decile for each assessment is presented in Supplementary Table 1.

Motor
For Total Motor Score (TMS), extreme skewness of data was observed for the three lowest CAP score deciles (encompassing CAP scores of 26 through 89), with density plots clearly illustrating floor effects in the lowest deciles. At higher deciles, data resembled a more normal distribution, although flattened curves with non-pronounced peaks were observed, indicating a somewhat even distribution of scores across the observed range, underscored by kurtosis statistics (not shown). A similar pattern was observed for SF-12 physical functioning, with data in the highest CAP deciles resembling a near uniform distribution across the scale. For Timed Up and Go (TUG), data demonstrated extreme skew across the full disease spectrum.
Post-hoc sensitivity analyses were performed for TUG, given the observation of outliers, illustrated in Figure 3. Several outlier removal thresholds were explored, the most extreme of which was a maximum threshold of 40 s, resulting in the removal of 20 data points from 2,660 total observations (i.e., 0.75% of data). Imposition of this threshold had a major impact on skewness statistics and data distribution relative to the original distribution observed; data resembled a somewhat normal distribution from CAP decile 2 on (see Supplementary Figures 1, 2). Diagnostic Confidence Level (DCL) data were reasonably distributed up to and including CAP decile 4 (encompassing CAP scores of 26 through 97). Beyond this point, data became increasingly skewed as CAP score increased. Density plots provide a clear illustration of pronounced ceiling effects in the advanced phases of the disease (Figure 3), as does the percentage of individuals scoring the maximum score on this outcome (Supplementary Table 1). Better *Optional ("extended") assessment in the Enroll-HD assessment battery. a Observations >120 excluded (n = 2) per Enroll Quality Control procedure; b Observations of >150 excluded (n = 1).

Function
For the functional assessments, i.e., Functional Assessment Score (FAS), Total Functional Capacity (TFC), and Independence Scale (IS), extreme skewness of data was also observed at lower CAP score deciles, indicative of ceiling effects in the early stages of the HD life-course, also clearly observed in the corresponding density plots. From decile 4 on (encompassing CAP scores of 89 and above), data began to resemble a more normal distribution, verging on uniform at the higher end of the spectrum (Figures 2, 3).

Cognition
Except for the Trail Making tests and The Mini Mental State Examination (MMSE), all other cognitive assessments examined demonstrated a relatively normal distribution of scores within each CAP score decile (Figures 2, 3).
Post-hoc sensitivity analyses were performed for Symbol Digit Modality Test (total correct) and Stroop Interference Test (total correct) given the observation of six valid but implausible zero scores in the two initial CAP deciles (SDMT: n = 4; SIT: n = 2). Removal of these values had a negligible effect on the skewness statistics and did not alter conclusions (Supplementary Figure 3).
Trail Making Test Part A demonstrated extreme data skew up to and including CAP decile 6 (CAP scores ≤ 107), with clear floor effects apparent in the very lowest deciles, while Part B demonstrated extreme data skew in the 2 lowest deciles only (CAP scores ≤79). MMSE data were heavily skewed in the lowest deciles, with clear ceiling effects apparent in the corresponding density plots, while a more normal data distribution was observed from decile 4 onwards (CAP scores of ≥89).

Psychiatric/Behavioral
For the Hospital Anxiety and Depression Scale-Snaith Irritability Scale (HADS-SIS) depression, anxiety and irritability subscales, data distribution was relatively normal with broad variation across all CAP score deciles, indicating an absence of floor/ceiling effects at all disease stages. The   Most Problem Behaviors Assessment-Short (PBA-s) scores exhibited extreme data skew across the disease spectrum, of which the psychosis scale was the most extreme example. PBAs depression was an exception (data were skewed but within acceptable range across almost all deciles), as was PBA-s apathy, which also demonstrated skew but within an acceptable range from decile 3 onwards (CAP scores ≥80).

DISCUSSION
Using data from Enroll-HD, a large and diverse observational study of HD, we examined distributions of scores on several well-established HD clinical assessments for evidence of instrument constraints, specifically floor/ceiling effects. These results are an important addition to existing clinimetric data for the outcome measures analyzed, which is limited, particularly for populations in the prodromal phase of the disease.
Most assessments demonstrated good utility (i.e., absence of floor/ceiling effects) across CAP deciles 5 through 10, equivalent to the period spanning clinical diagnosis/onset through to the most severe stages of the disease examined. The only exceptions to this rule were DCL, which (unsurprisingly) demonstrated poor utility from CAP decile 5 onwards (encompassing CAP score of 100, approximating onset), and assessments which demonstrated poor utility ubiquitously. Conversely, many assessments demonstrated poor utility in the lower CAP deciles, equivalent to the very earliest phases of the disease life course, prior to diagnosis/onset. This was true of the motor assessments, all functional assessments, and the two Trail Making Tests from the cognitive domain, indicating that instruments that are sensitive to the very earliest premanifest changes in motor and function performance are required. These are already under development (31,32).
In contrast, all the cognitive assessments (bar the Trail Making Tests and MMSE) and all the HADS-SIS subscales demonstrated good utility across the full disease life course.
However, certain assessments demonstrated poor utility across the full disease spectrum, including most PBA-s scales and TUG (although removal of extreme outliers from the TUG data suggests this task does show promise in the earliest stages of the HD life-course). This should be further explored, perhaps with digital versions of the task using smartphone sensors, as is currently being used in the Parkinson's disease field (33).
A thorough understanding of the clinimetric properties of assessments, including instrument constraints, is imperative in conducting robust and meaningful research. If there is no (or very limited) variability in a specific outcome measure in a cohort due to assessment constraints, then using that outcome to assess a candidate drug may lead researchers to conclude that the candidate has no effect on disease. This may, or may not, be true. For example, use of FAS as an outcome in a clinical trial targeting early premanifest HDGECs would be inadvisable FIGURE 2 | Skewness of clinical assessment data by CAP score. Degree of data skewness observed within each CAP score decile is illustrated for each clinical outcome. Cells are color coded conditional on observed skewness statistic. A threshold of ±2, indicating extreme positive or negative skew, was defined a priori to identify extreme skew, indicative of floor/ceiling effects. Cells with values more extreme than these thresholds are colored white. All remaining cells, with skewness statistics ranging from −2 to +2, are color coded on a yellow-green-yellow gradient, centered at 0 (green), indicative of a perfect normal distribution of data, graduating to yellow as values become more extreme.
given the floor effects observed in scores in this phase of the disease.
Here we describe the utility of commonly used HD clinical assessments by disease stage, for consideration when designing clinical studies and/or trials. We also expand on what is known with regards to psychiatric/behavioral assessments. Our results in relation to the PBA-s outcomes, which were highly skewed across most deciles, imply low utility of this assessment across the HD spectrum. In contrast, the HADS-SIS depression, anxiety and irritability outcomes demonstrated a superior clinimetric performance, consistent with previous conclusions that HADS is a more appropriate assessment than PBA-s for the constructs mentioned (34). Note that we do not comment on the utility of these assessments to track disease progression.
The extremely large and diverse nature of the Enroll-HD data sample affords us confidence in the robustness of our results. This is further bolstered by the rigorous data quality control procedures implemented within Enroll-HD, from point of data entry through to onsite and remote data review, designed to maximize data integrity. We do, however, acknowledge several limitations. First, our approach to the identification of instrument constraints, specifically floor/ceiling effects, was relatively simple, based principally on assessment of data skewness in conjunction with other basic descriptive statistics. Other complimentary methods may be applied to further and more comprehensively assess the presence and severity of floor/ceiling effects. Second, our consideration of both "core" and "extended" assessments from the Enroll-HD protocol assessment battery resulted in a somewhat limited sample size for certain assessments, given the optional nature of the extended measures. Third, we acknowledge that our results may have been mildly affected by missing data, should the reason for missingness relate to disease stage/assessment score. This is plausible for both optional assessments (identified in Table 2), as well as assessments requiring active participation (e.g., TUG) where more advanced individuals who have progressed further may be unable (or may not be asked) to attempt these tasks. If such situations have occurred, the most extreme/worst scores would have been masked. Fourth, we concede that our approach to disease staging was non-standard; we used a quantile grouping method to approximate equivalency of group size for statistical reasons, but acknowledge that this resulted in categories encompassing widely differing ranges of CAP score, with the widest categories noted at the lowest and highest deciles. A more standard approach may have aided interpretation and application of results and afforded increased resolution in identifying the boundaries of floor/ceiling effects in the earliest and latest phases of disease. Nevertheless, it is wellestablished that a CAP score of 100 coincides approximately with the occurrence of clinical diagnosis and it is therefore relatively simple to infer which deciles encompass prodromal and manifest phases of the disease. Fifth, we highlight that in assessments with no set maximum scale threshold (e.g., TUG, or assessments where the vast majority score within a limited scale range, e.g., PBA psychosis) the presence of even a few extreme scores (plausible outliers) can dramatically influence skewness statistics, leading to inaccurate conclusions regarding floor/ceiling effects. Consequently our conclusions regarding these specific assessments should be considered with caution. We applied a simple analytical approach to data ascertained from a very large and diverse clinical cohort to examine constraints of commonly used HD assessments as a function of disease stage. Most assessments demonstrated good utility from onset onwards, others across the full disease life course, while others performed poorly across the spectrum. Investigations to develop instruments sensitive to subtle differences in performance in earlier stages of the disease spectrum are warranted.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found at: https://www.enroll-hd.org/for-researchers/ access-data/.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent from the participants' legal guardian/next of kin was not required to participate in this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
DA: conception, organization, execution of research project, design, execution, review and critique of statistical analysis, writing of the first draft, review, and critique of manuscript preparation. JW: conception, organization, execution of research project, design, review and critique of statistical analysis, writing of the first draft, review, and critique of manuscript preparation. NG-K and BL: conception, organization, execution of research project, review and critique of statistical analysis, writing of the first draft, review and critique of manuscript preparation.