Development of a Sensitive Outcome for Economical Drug Screening for Progressive Multiple Sclerosis Treatment

Therapeutic advance in progressive multiple sclerosis (MS) has been very slow. Based on the transformative role magnetic resonance imaging (MRI) contrast-enhancing lesions had on drug development for relapsing-remitting MS, we consider the lack of sensitive outcomes to be the greatest barrier for developing new treatments for progressive MS. The purpose of this study was to compare 58 prospectively acquired candidate outcomes in the real-world situation of progressive MS trials to select and validate the best-performing outcome. The 1-year pre-treatment period of adaptively designed IPPoMS (ClinicalTrials.gov #NCT00950248) and RIVITaLISe (ClinicalTrials.gov #NCT01212094) Phase II trials served to determine the primary outcome for the subsequent blinded treatment phase by comparing 8 clinical, 1 electrophysiological, 1 optical coherence tomography, 7 MRI volumetric, 9 quantitative T1 MRI, and 32 diffusion tensor imaging MRI outcomes. Fifteen outcomes demonstrated significant progression over 1 year (Δ) in the predetermined analysis and seven out of these were validated in two independent cohorts. Validated MRI outcomes had limited correlations with clinical scales, relatively poor signal-to-noise ratios (SNR) and recorded overlapping values between healthy subjects and MS patients with moderate-severe disability. Clinical measures correlated better, even though each reflects a somewhat different disability domain. Therefore, using machine-learning techniques, we developed a combinatorial weight-adjusted disability score (CombiWISE) that integrates four clinical scales: expanded disability status scale (EDSS), Scripps neurological rating scale, 25 foot walk and 9 hole peg test. CombiWISE outperformed all clinical scales (Δ = 9.10%; p = 0.0003) and all MRI outcomes. CombiWISE recorded no overlapping values between healthy subjects and disabled MS patients, had high SNR, and predicted changes in EDSS in a longitudinal assessment of 98 progressive MS patients and in a cross-sectional cohort of 303 untreated subjects. One point change in EDSS corresponds on average to 7.50 point change in CombiWISE with a standard error of 0.10. The novel validated clinical outcome, CombiWISE, outperforms the current broadly utilized MRI brain atrophy outcome and more than doubles sensitivity in detecting clinical deterioration in progressive MS in comparison to the scale traditionally used for regulatory approval, EDSS.

Therapeutic advance in progressive multiple sclerosis (MS) has been very slow. Based on the transformative role magnetic resonance imaging (MRI) contrast-enhancing lesions had on drug development for relapsing-remitting MS, we consider the lack of sensitive outcomes to be the greatest barrier for developing new treatments for progressive MS. The purpose of this study was to compare 58 prospectively acquired candidate outcomes in the real-world situation of progressive MS trials to select and validate the best-performing outcome. The 1-year pre-treatment period of adaptively designed IPPoMS (ClinicalTrials. gov #NCT00950248) and RIVITaLISe (ClinicalTrials.gov #NCT01212094) Phase II trials served to determine the primary outcome for the subsequent blinded treatment phase by comparing 8 clinical, 1 electrophysiological, 1 optical coherence tomography, 7 MRI volumetric, 9 quantitative T1 MRI, and 32 diffusion tensor imaging MRI outcomes. Fifteen outcomes demonstrated significant progression over 1 year (Δ) in the predetermined analysis and seven out of these were validated in two independent cohorts. Validated MRI outcomes had limited correlations with clinical scales, relatively poor signal-to-noise ratios (SNR) and recorded overlapping values between healthy subjects and MS patients with moderate-severe disability. Clinical measures correlated better, even though each reflects a somewhat different disability domain. Therefore, using machine-learning techniques, we developed a combinatorial weight-adjusted disability score (CombiWISE) that integrates four clinical scales: expanded disability status scale (EDSS), Scripps Outcomes  inTrODUcTiOn Therapeutic progress in relapsing-remitting multiple sclerosis (RRMS) has been facilitated by the recognition that contrastenhancing lesions (CELs) on brain magnetic resonance imaging (MRI) can serve as a predictive marker of multiple sclerosis (MS) relapses. Utilizing this outcome allowed rapid, inexpensive screening of candidate agents. In contrast to RRMS, therapeutic development for progressive MS patients, who have few CELs and MS relapses, has been extremely slow. These patients relentlessly accumulate neurological disability, albeit at a pace that requires observation of hundreds of patients for a minimum of 2-3 years to reliably detect moderate (30-50%) therapeutic effects using the expanded disability status scale (EDSS) (1). Such studies utilize the majority of available patients and, therefore, allow screening of only a handful of therapeutic agents each decade. Consequently, more sensitive outcomes are necessary to facilitate broader therapeutic advances for progressive MS. While quantitative MRI (qMRI) measures have been promoted as candidate outcomes (2,3), a comprehensive comparison of qMRI markers with clinical outcomes and with each other is missing. Therefore, we integrated systematic comparisons of clinical, electrophysiological, optical coherence tomography (OCT) and a large number of qMRI measures as an adaptive part of the IPPoMS (double-blind, placebo-controlled Phase I/II clinical trial of Idebenone in patients with Primary Progressive Multiple Sclerosis; NCT00950248) and RIVITaLISe (Double Blind Combination of Rituximab by Intravenous and Intrathecal Injection Versus Placebo in Patients With Low-Inflammatory Secondary Progressive Multiple Sclerosis; NCT01212094) clinical trials and present the results.

Trial Design
IPPoMS and RIVITaLISe trials were randomized, double-blind, placebo-controlled trials with an adaptive design. The 2-year randomized treatment phase was preceded by a 1-year pretreatment period, which served a dual purpose: (1) to determine a final primary outcome, by comparing 58 measures in the first ≥30 subjects, and to perform a new power analysis/sample size re-calculation using the selected outcome (this is the adaptive part of the design and represents the work described in this paper) and (2) to collect individualized disease-progression data while off therapy, as the baseline-versus-treatment paradigm is expected to enhance power (4). The default primary outcome for both trials was progression of brain atrophy measured by Structural Image Evaluation using Normalization of Atrophy (SIENA) methodology (5).

Patient Population
Due to missing data, the first 35 primary progressive MS (PPMS; IPPOMS1 cohort) subjects who completed the IPPoMS pretreatment baseline (before randomization into placebo or active treatment arm) were included to yield a minimum of 30 subjects per outcome, as defined in the protocol. The problem of missing clinical data at the beginning of the trial was solved by launching a database system in September 2013 that allows the principal investigator to confirm in real time that all measures were acquired according to the protocol. Some MRI data were missing because of technical issues with MRI acquisition (e.g., MRI machine problems). MRI data could also be missing if computer programs failed to run on the particular patient's MRI, usually due to poor quality of the MRI caused by movement artifacts. The data loss in both instances should be considered missing completely at random.
The RIVITaLISe trial was recently terminated for futility, after interim analysis of the pharmacodynamic markers in the target organ showed that the pre-determined criteria for protocol continuation were not reached (6). Accrual of only 29 secondary progressive MS (SPMS) patients who completed the 1-year pre-treatment baseline prevented us from performing protocolstipulated analysis of outcomes; therefore, we used this cohort (RIVITALISE cohort) as a validation cohort for the IPPOMS1 results. Finally, to avoid uncertainty as to whether SPMS and PPMS patients are comparable when it comes to clinical and MRI outcomes, we included all 34 remaining IPPOMS patients who completed the year-long pre-treatment baseline as of June 2015 and were not included in the IPPOMS1 as the second validation cohort (IPPOMS2 cohort). neurological rating scale, 25 foot walk and 9 hole peg test. CombiWISE outperformed all clinical scales (Δ = 9.10%; p = 0.0003) and all MRI outcomes. CombiWISE recorded no overlapping values between healthy subjects and disabled MS patients, had high SNR, and predicted changes in EDSS in a longitudinal assessment of 98 progressive MS patients and in a cross-sectional cohort of 303 untreated subjects. One point change in EDSS corresponds on average to 7.50 point change in CombiWISE with a standard error of 0.10. The novel validated clinical outcome, CombiWISE, outperforms the current broadly utilized MRI brain atrophy outcome and more than doubles sensitivity in detecting clinical deterioration in progressive MS in comparison to the scale traditionally used for regulatory approval, EDSS.

inclusion criteria
Eligible patients had clinically definite PPMS (IPPoMS trial) or SPMS (RIVITaLISe trial); aged 18-65 years (inclusive) with disability ranging from mild to moderate (EDSS 1-7, inclusive). Patients must not have received any immunomodulatory/immunosuppressive therapies for a period of at least 3 months prior to enrollment and must not have had any exposure to idebenone, coenzyme-Q10, or other mitochondrial-function promoting supplements more than three times the recommended daily dose for a period of at least 1 month before enrollment in the IPPoMS trial. Exclusion criteria included pregnancy, abnormal screening blood tests exceeding predefined limits, and/or clinically significant medical disorders that could expose the patient to undue risk or harm. A data safety monitoring board (DSMB) and institutional review board (IRB) approved a single patient exemption for a 70-year-old subject who otherwise fulfilled all inclusion criteria to be enrolled in the IPPoMS trial ( Table 1).
Inclusion criteria for patients in the natural history protocol (cross-sectional cohort) were 18-75 years of age, presenting with a clinical syndrome consistent with immune-mediated central nervous system (CNS) disorder and/or neuroimaging evidence of inflammatory and/or demyelinating CNS disease. HV inclusion criteria were 18-75 years of age and vital signs within normal range at the time of screening visit. HV had to have no systemic disorder or CNS disease of any kind or other related risk factors.

study Oversight
All subjects provided written informed consent. The trials were approved by the Combined Neuroscience Institutional Review Board of the National Institutes of Health and relevant regulatory agencies. Monitoring was provided by Data and Safety Monitoring Boards.

Pre-Defined analysis of Outcomes
Because the SIENA methodology, the default primary trial outcome, calculated progression of brain atrophy as a percentage of baseline brain tissue, the same type of analysis was used for every other outcome; i.e., each biomarker quantified at month (Mo) 0 (before randomization) was expressed as a percentage of the Mo −12 value (considered to represent 100%). For each biomarker, we calculated a z-score as the average yearly change divided by the group standard deviation (SD). According to the protocol, z-scores (which are directly related to statistical power for a test of change) were designed to select the highest powered outcome. We observed a violation of the normality assumption in the analysis of some outcomes, questioning whether z-scores represented the best tool for outcome comparisons. As a compromise, in the IPPOMS1 cohort, we performed parametric statistical analysis after exclusion of outliers, without adjustment for multiple comparisons to obtain broad selection of candidate outcomes for validation. Outcomes from IPPOMS1 cohort at the 5% significance level were tested in two independent validation cohorts, RIVITALISE and IPPOMS2, using step-down Sidak (7) adjustments for multiple comparisons.

electrophysiological Outcome
Single-pulse transcranial magnetic stimulation was performed using a Magstim 200. Motor evoked potentials were obtained using 130% resting threshold with mild activation of muscle. Central motor conduction time (CMCT) was calculated using the "F-wave method" (11). A total of six CMCTs were obtained: two from arms (at the abductor pollicis brevis) and four from legs (at the tibialis anterior and extensor digitorum brevis).

Optical coherence Tomography
Optical coherence tomography was obtained using the ZEISS Cirrus (TM) HD-OCT Model 4000. The retinal nerve fiber layer (RNFL) thickness was quantified in four quadrants.

Mr imaging
Magnetic resonance imaging of the brain was performed on a 3T Signa HDx (3TA; GE, Milwaukee, WI, USA) equipped with 16-channel head coil or on a 3T Skyra (3TD; Siemens, Malvern, PA, USA) with a 32-channel head coil. Follow-up MRIs were maintained on the same scanner as the first MRI. Seven HVs were scanned twice on each scanner to yield at least five technically adequate duplicates for test-retest reliability.
Percent change in brain volume was calculated using SIENA (V-SIENA (5); http://fsl.fmrib.ox.ac.uk/fsl), while volumes of brain (V-Brain), ventricles (V-Ventricles), cortical gray matter (V-CorticalGM) and thalamus (V-Thalamus) were calculated using LesionTOADS tissue segmentation. A cross-sectional area of the upper cervical SC at the level of Dens (A-CS-Dens) was calculated from manually drawn ROIs on individual GRE or SPGR images (using OSIRIX and MIPAV).

Data collection
The EDSS and SNRS were performed by the same clinician, who had no knowledge/intervention in collecting any other outcomes. 9HPT, 25FW, PASAT, and SDMT were performed by non-clinical investigators, who had no knowledge/intervention in collecting other outcomes. MRI analyses were performed by another set of non-clinical personnel, who had no knowledge/intervention in collecting clinical outcomes.

Development of combinatorial Weight-adjusted Disability score
To mathematically optimize the new scale (i.e., CombiWISE) with relative weights of different subscales that are not distorted by individual observations, each clinical scale was re-scaled by its maximum achievable value so that all values lie between 0 and 1 making the different scales directly comparable. The three longitudinal cohorts of progressive MS patients (IPPOMS1, IPPOMS2, and RIVITALISE) were combined and the subjects were then randomly permuted multiple times between training and validation datasets with 70% of the subjects allocated to each training dataset. In order to efficiently estimate the contributions of the failure to complete 9HPT or 25FW, the randomization was constrained to balance the number of subjects in each training and validation dataset with failed attempts on non-dominant hand 9HPT (NDH-9HPT), dominant hand 9HPT (DH-9HPT), 25FW, and a combination of NDH-9HPT and 25FW. For each of the constructed training datasets, a genetic algorithm (GA) implemented in the GA package (17) in R (18) was used to construct a linear combination of EDSS, SNRS, log25FW (log2 of average of two attempts on 25FW, or 0 if at least one trial was unsuccessful), 25FWFAIL (1 if patient failed either attempt on 25FW; otherwise 0), logNDH-9HPT (log2 of average of two attempts on 9HPT with non-dominant hand; otherwise 0), NDHFAIL (1 if patient failed either attempt on 9HPT with non-dominant hand; otherwise 0), logDH-9HPT (log2 of average of two attempts on 9HPT with dominant hand; otherwise 0), DHFAIL (1 if patient failed either attempt on 9HPT with dominant hand, otherwise 0), PASAT, and SDMT that maximizes the evidence of a change over time (test statistic) from a linear mixed model (19) estimated using the nlme package (20). The model contained a random intercept for each subject (in order to account for three repeated measures on each subject; Mo −12, Mo −6, and Mo 0) and assumes a linear change over these times. The sign of the individual weights were constrained to be the direction of disease progression (i.e., positive for EDSS since higher values of EDSS indicate more progression, negative for SNRS since smaller values of SNRS indicate more progression). The scale was optimized for 200 permutations, followed by dropping of four variables with weights routinely close to 0 (logDH-9HPT, DHFAIL, PASAT, and SDMT). The final weights after removing these unused variables were generated as an average of the selected weights from 500 permutations of the training data set (referred to hereafter as relative weights that allow comparison of overall contributions of individual components to the developed scale), followed by their linear re-scaling to generate the CombiWISE scale that ranges from 0 to 100 (referred hereafter as computing weights that allow for construction of a scale with range specified above, with higher values indicating more disease severity). The performance of CombiWISE against traditional clinical outcomes was measured in 500 permutations of the retained validation data sets. The R code for the development of the CombiWISE scale is available in Data S1 in the Supplementary Material.

statistical Methods
For each biomarker, the relative percentage change over 1 year (= 100 × (score at period 0 − score at period −12)/score at period −12) was calculated and used as an outcome measure. For most biomarkers, the relative change had normal or near-normal (both kurtosis and skewness between −1 and 1) distributions (based on Shapiro-Wilk test) after excluding a few extreme outliers [<Q1-3IQR, or >Q3+3IQR, where Q1 and Q3 are the first and third quartiles and IQR is the interquartile range (Q3−Q1)]. One-sample t-tests were performed to test the null hypothesis: μ = 0 (the mean relative change equals to 0) for each biomarker in the IPPOMS1 cohort. The same test was performed for IPPOMS2 and RIVITALISE cohorts separately for any biomarkers with p < 0.05 in IPPOMS1, with each set of p-values corrected for multiple testing using step-down Sidak method (7).
The power analysis was performed for both a two-group parallel design and a one-group baseline-versus-treatment design. Since IPPoMS is a 2-year randomized controlled trial, for each biomarker, the relative change in 2 years [= 100 × (measure score in 2 year − baseline)/baseline] was used as an outcome measure, which was assumed to follow a normal distribution, has changed linearly over time and had homogeneity of variance. The drug was assumed to have 50%, 40%, or 30% effect. The mean and SD from the observed relative change in the first year without treatment was used for the calculations.
For a two-group design, a two-tailed two-sample t-test was used to test the null hypothesis: μ1 = μ2, where μ1 is placebo group mean change and μ2 is treated group mean change. For example, for an outcome variable with 10% measured relative change in the 1-year pre-treatment period, the placebo group mean was expected to change 20% over 2 years of treatment, and the treated group mean was expected to change 10% (12% and 14%) if the drug had 50% (40% and 30%, respectively) efficacy.
For the one-group design, two-tailed one-sample t-test was used to test the null hypothesis: μ = μ0, where μ0 is null hypothesis mean and μ is the alternative hypothesis mean, with analogously estimated 50% (40% and 30%) drug effects.
For each biomarker, the sample sizes were estimated to reject the null hypotheses with 80% power at the 5% significance level (not adjusted for multiple outcomes) for the two designs.
The individual biomarker and power analyses were performed using SAS 9.2 and GraphPad Prism 6 software.

resUlTs
The results for all 58 measured variables are summarized in Table 2.

Mri Volumetric Outcomes
While change in the brain volume measured by LesionTOADS (V-Brain) was not statistically significant, the percent change in brain volume calculated by SIENA was significantly reduced Both of these segmented volumetric outcomes achieved higher z-scores than whole brain atrophy (V-SIENA). None of the cross-sectional areas of the upper cervical SC changed significantly over 1 year; therefore, we only highlight the best-performing of these outcomes, measured at the level of dens (C1).

Optical coherence Tomography
Because the temporal quadrant of retinal nerve fiber layer has the most-pronounced thinning in MS (21), we have used the sum of the temporal quadrants from both eyes as a single OCT measure. This biomarker did not show evidence of progression over 1 year.

Mri Tissue integrity Measures: DTi
Initially, we used the identical co-registration method (i.e., JIST; DTIJ) for qT1 and DTI measures of CNS tissue integrity. However, DTIJ data demonstrated unacceptably high scan-rescan variability in HVs (at times > 100%), which led us to re-analyze DTI scans using two different technologies: (1) we drew separate ROIs for Mo −12 and Mo 0 on un-registered native DTI scans (DTIN) and (2) we co-registered DTI scans to anatomical images using TORTOISE algorithm (DTIT). The DTIT method outperformed DTIN for all DTI biomarkers and, therefore, only DTIT data are presented.

Validation of Biomarkers that reached statistical significance in the iPPOMs1 cohort in Two additional longitudinal Progressive Ms cohorts (riViTalise and iPPOMs2)
Fifteen outcomes that reached statistical significance based on unadjusted p-values in the IPPOMS1 cohort were evaluated for statistically significant progression over 1 year in two independent validation cohorts consisting of PPMS (IPPOMS2) and SPMS (RIVITALISE) patients with identical inclusion criteria. Eleven outcomes also showed statistical significance in the IPPOMS2 cohort after adjustment for multiple comparisons, but only six outcomes validated in the smaller RIVITALISE cohort ( Table 2).
From clinical measures, SNRS barely missed the cut-off for statistical significance in the RIVITALISE cohort. From MRI volumetric measures, only ventricular volume (V-Ventricles) validated in both cohorts. Finally, from MRI measures of CNS tissue integrity, MD was the most successful DTI biomarker and it validated significant progression when measured in three ROIs: the head of the caudate nuclei, midbrain, and medulla. Additionally, radial diffusivity of the medulla and axial diffusivity of caudate nuclei also demonstrated statistically significant change in both validation cohorts.

correlations between Validated
Outcomes and eDss  Table S2 in Supplementary Material. MS patients from all three longitudinal cohorts) and longitudinal (i.e., correlations between yearly changes measured in identical 98 patients) paradigms. We also included other clinical scales in the correlation matrix for instructive purposes [ Figure 1 (exact correlation coefficients, p-values, and number of observations are in Table S2 in Supplementary Material)].
In the cross-sectional paradigm, we observed strong to moderate correlations between all clinical outcomes that had small p-values, with exception of the cognitive test PASAT, which was only moderately correlated with another cognitive test SDMT. We also observed statistically significant correlations of relatively mild strength between ventricular volume and clinical biomarkers that capture cognition and fine finger movements/coordination (i.e., SDMT, MSFC, PASAT, and 9HPT), but not with EDSS. Finally, DTI measures were generally correlated with each other, but not with any clinical outcome.
As expected, fewer correlations were observed in the longitudinal paradigm: CombiWISE (see below) was the only scale that showed strong, statistically significant correlation of its yearly change with three out of four clinical outcomes (EDSS, SNRS, and 25FW) that contribute to its computation. By contrast, yearly change in MSFC, another composite scale, demonstrated significant, but weak correlation with only one (25FW) of its three components. Again, strong correlations were observed between different DTI measures. However, no statistically significant correlation was observed between MRI measures and clinical scales in the longitudinal paradigm.
composite clinical score: combiWise The discrepancy between strong correlations among clinical scales in the cross-sectional paradigm and the lack of correlations in the 1 year longitudinal study indicate that like other outcomes, tested clinical scales suffer from low sensitivity confounded by measurement noise. Repeated measurements can enhance signalto-noise ratio (SNR). To the extent to which clinical scales capture overlapping elements of disability, they represent a form of repeated measures. For example, slight worsening in one clinical score but improvement in another may reflect performance noise rather than true disability. A structural substrate to observed change is expected to affect overlapping domains of several clinical scores congruently. Thus, using a combination of clinical scales with overlapping elements amplifies the true disability and limits measurement noise. However, differences in z-scores also indicate that clinical scales differ in sensitivity and specificity. Therefore, combination of clinical scales should be based on their measured performance, giving a greater weight to the measures that have higher sensitivity and lower measurement noise. We tested this hypothesis by first constructing a conceptual model of the combinatorial weight-adjusted disability score (CombiWISE v.0; see Figure S3 in Supplementary Material for details) based on the collected clinical data exclusively in the IPPOMS1 cohort. We then validated its sensitivity for longitudinal change and superiority against other clinical measures in IPPOMS2 and RIVITALISE cohorts ( Figure S3 in Supplementary Material). Because CombiWISE v.0 represented only one possible model from measured data, we next employed statistical modeling using a GA (22)(23)(24) to numerically optimize CombiWISE for its ability to detect yearly changes across a suite of random permutations of the acquired data from all 98 progressive MS patients (see Materials and Methods).
The 200 permutations of the training/validation data for the weights generated from the attempt to use all 10 measured clinical variables [i.e., 9HPT was evaluated independently for the dominant (logDH-9HPT) or non-dominant hand (logNDH-9HPT) and the failure to perform either of them (DHFAIL, NDHFAIL) or log25FW (25FWFAIL) were captured as separate variables, leading to a total of four 9HPT and two 25FW measures tested] demonstrated that cognitive scales (PASAT and SDMT) were always at the boundary of 0, because their direction of change was often opposite to expected clinical progression (i.e., they either did not change or demonstrated a learning effect; Figure 2A). Surprisingly, permutations also revealed differences between logDH-9HPT and logNDH-9HPT, with the latter achieving higher weights, while logDH-9HPT weights were close to zero in most cases. Consequently, to reduce variability of CombiWISE, we removed the aforementioned four clinical outcomes that did not reliably capture disease progression. Five hundred additional GAs (using the same constrained permutation procedure of training/validation data splits) with the remaining six clinical variables ( Figure 2B) achieved mean weights that were proportionally comparable to the weights utilized in the conceptually generated CombiWISE v.0 (i.e., weights based on measured z-scores in the IPPOMS1 cohort), with following hierarchy: SNRS > EDSS > log25FW = 25FWFAIL > logNDH-9HPT = NDHFAIL (Figure 2C). The measured weights were rescaled so that optimized CombiWISE ranges from 0 to 100 (higher numbers correspond to increasing disability), calculated based on the following formula:

+ +
In order to assess the performance of the optimized CombiWISE, we compared the resulting test statistics for 500 permutations of the withheld validation data sets for the t-statistics for the linear time change in the mixed model ( Figure 2D). We observed that CombiWISE outperformed all other clinical scales. In 93.4% of the 500 permuted validation datasets, CombiWISE generates a larger test statistic than SNRS, which is the highest performing single clinical scale, with an average gain of 0.85 t-statistic units and a maximum gain of over 2.5 units. CombiWISE also outperformed EDSS and log25FW in over 97% of the permuted validation datasets with average gains of approximately 1.45 and 2 t-statistic units, respectively. This gain in t-statistic corresponds to higher power in detecting clinical changes over time, particularly if the changes are relatively small ( Figure S4 in Supplementary Material), as would be expected in a Phase II trial.
CombiWISE correlates strongly with all clinical scales, including cognitive SDMT (which does not contribute to CombiWISE) in the cross-sectional evaluation of >300 untreated neurological patients and HVs (Figure 3A). Furthermore, CombiWISE can reliably detect linear progression of clinical disability in all three progressive MS cohorts, often even in intervals as short as 6 months (Figure 3B). In contrast to MRI measures, which have generally high technical/biological variability (i.e., SNR computed as the average of absolute yearly change in patients divided by the average of absolute scan-re scan difference in HVs; Table S3 in Supplementary Material), with successful MRI biomarkers having SNR between 1.45 and 2.94, CombiWISE has SNR 7.66, even when we used a more stringent definition of SNR, computed as average absolute yearly difference in patients divided by average absolute yearly changes in HVs. Finally, CombiWISE shows no overlap of values between HVs and progressive MS patients ( Figure 3C) in contrast to MRI measures. For comparison, we selected the best-performing MRI variable -radial diffusivity of medulla -that shows complete overlap of the values between HVs and moderately to severely disabled progressive MS patients ( Figure 3D).

Power analysis
In power calculation, CombiWISE is the best-performing outcome in IPPOMS1 cohort (Figure 4, Table S4 in Supplementary Material). In a parallel-group design, a 2-year treatment study with 1:1 randomization, 34, 53, and 95 subjects per arm are required to detect 50%, 40% versus 30% drug effect, respectively, with 80% power, 5% significance level and two-sided comparisons ( Figure 4A). In a baseline-versus-treatment paradigm, 19, 28, and 49 subjects per arm are needed to detect 50%, 40% versus 30% drug effect on CombiWISE, respectively ( Figure 4B). We included EDSS in the Figure 4 for direct comparison.

DiscUssiOn
There remains a large unmet need for development of therapies for progressive MS. While at any given time, multiple candidate therapies are available, the present bottleneck resides in the inability to screen them in small, but adequately powered, Phase II trials that can correctly predict efficacy on FDA-accepted clinical endpoint utilized for Phase III trials. This study provides a comprehensive comparison of outcomes in the same patient group(s) within a real-world situation of Phase II clinical trials.
The reason for implementing pre-specified comparison of large number of candidate outcomes into progressive MS trials initiated 7 years ago was the fact that such comparison was lacking in the public domain then, and still is lacking today; surprisingly, the papers that describe candidate new outcomes do not compare these with the traditional outcomes, such as EDSS (2,10). Nevertheless, the excellent experience with MRI CELs as highly sensitive and reproducible outcome in RRMS poised the MS field to trust the superiority of MRI outcomes over clinical outcomes for progressive MS, in the absence of factual evidence. This belief in superiority of MRI outcomes is virtually universal, as evidenced by the fact that brain atrophy represents the primary outcome in the vast majority of currently ongoing Phase II trials in progressive MS (26).
Despite the fact that we validated six qMRI measures as reliably changing in three progressive MS cohorts over 1 year, we found that they had low to absent correlations with clinical scales and a strong overlap of values between HVs and MS patients. From tested volumetric measures, enlargement of ventricles measured by LesionTOADS was the most reproducible MRI outcome, which outperforms the SIENA brain volume change measurement, but did not outperform CombiWISE in any of the tested cohorts. This MRI outcome showed modest correlations with cognitive scales and the 9HPT, at least in the cross-sectional paradigm, proving its biological relevance. However, its lack of correlation with EDSS (and SNRS, 25FW, or CombiWISE) makes it questionable whether the efficacy on brain atrophy observed in Phase II trials will correctly predict efficacy on clinical outcomes in Phase III studies. In fact, simultaneously measured changes in brain atrophy and clinical parameters were already contradictory in a Phase II progressive MS trial (27). For these multiple reasons, CombiWISE is a better outcome for Phase II trials of progressive MS than ventricular or brain atrophy.
The remaining MRI markers that reproducibly progressed in all three cohorts were DTI biomarkers measured in the head of the caudate nucleus, midbrain, and medulla. These putative biomarkers of CNS tissue integrity were all correlated with each other, but did not correlate with clinical outcomes, even when we re-analyzed data separately for each of the scanners, to avoid influence of the observed "scanner effect" (Figure S6 in Supplementary Material). The most concerning observation was a broad overlap of DTI-derived measurements between HVs who lack neurological disability and moderately-severely disabled MS patients. If the measured yearly increase in DTI outcomes in the MS cohort truly reflected yearly increase in CNS tissue destruction, then extrapolating such a yearly rise in DTI parameters across the long disease duration of our MS cohort would position this cohort way above the HVs. Instead, relatively high scan-rescan variability in HVs in comparison to the yearly changes measured in MS cohorts (i.e., low SNR), prominent scanner effect ( Figure S7 in Supplementary Material) and also statistically significant, scan-date-related longitudinal drift of DTI data measured across 3 years ( Figure S8 in Supplementary Material), suggest that technical aspects of MRI scanning, rather than biological changes are more likely the cause of the measured yearly increase in DTI parameters.
These observations caution against uncritical interpretations that changes in qMRI parameters reflect unequivocally structural alterations of CNS tissue. An informative review (28) highlights this erroneous assumption: MRI does not measure brain structure; instead, it infers brain structure from the radio-frequency signals of energized hydrogen protons, which are affected by both technical parameters of magnetic fields and magnetic properties of the surrounding tissue. Quantitative data derived from advanced imaging techniques, such as DTI, are computed from mathematical models, parameters of which are influenced by scanner hardware, sequences, and post-processing methods (29). Furthermore, MRI, as a physical-chemical measure, is also influenced by biological phenomena that have nothing to do with the structural integrity of CNS tissue, such as changes in body weight, lipid levels, hydration, and use of alcohol or pharmaceutical agents (28). While these confounding factors are easily controlled in animal experiments from which pathological-MRI correlations have been derived (30,31), they are impossible to eliminate in the real-world experience of human clinical trials performed on multiple MRI scanners and spanning several years.
Thus, reliance on MRI markers as the primary outcome in progressive MS trials is currently not advisable because of their lack of surrogacy with clinical scales. Surrogacy (32) requires that the biomarker predicts results on clinical outcomes, and does so in a considerably shorter time-period than clinical scales. While correlation with clinical scales is not sufficient, it is nevertheless a prerequisite for surrogacy and none of the MRI markers tested in the current study fulfills this condition in relationship to EDSS, SNRS, 25FW, or CombiWISE. We cannot generalize our conclusions to magnetic resonance spectroscopy or magnetization transfer biomarkers (33)(34)(35)(36), which we did not test. Hopefully, future studies of these potentially promising biomarkers will include longitudinal assessment of their variance, influence of scanner(s), and sequences, overlap with data generated in HVs and direct comparison with the simultaneously acquired clinical scales.
Strong to moderate correlations between different clinical scales indicate that these do reflect evolution of underlying disease. Nevertheless, their sensitivity for yearly disease progression is low, as none of them demonstrated statistically significant progression in all three longitudinal cohorts and yearly changes measured by different scales did not correlate. While SDMT correlated stronger with the remaining clinical scales than PASAT did, the modeling permutations found both cognitive tests insensitive to reliably detect yearly progression in small Phase II trials. The idea that a composite score could amplify changes in individual clinical scales has been tested before. Goodkin introduced a composite outcome consisting of designated changes in any of the four utilized clinical scales (37) and demonstrated superior sensitivity of such composite (38). An analogous composite primary endpoint was used in recently announced negative trials of natalizumab (ASCEND trial, NCT01416181) and opicinumab (anti-Lingo-1 SYNERGY trial, NCT01864148). Unfortunately, the specificity of such an inclusive composite has not been published. The National MS Society Clinical Outcomes Assessment Task Force recommended development of a composite clinical measure in which individual components "should have high reliability" (39). The result of this effort was the MSFC (10), introduced without direct comparison to EDSS. In follow-up studies, MSFC change had considerably lower power than EDSS for detecting sustained disability in PPMS subjects (40). Our measurements concur with this conclusion.
In contrast to aforementioned efforts, we used statistical modeling to optimize a composite clinical metric that "weighs" simultaneously captured data from several clinical scales, selected based on their ability to detect MS disease progression in a majority of modeling cohort permutations. CombiWISE is based on the intersection of these scales, benefiting from the noise-limiting feature of combining partially overlapping measurements. Strong correlations of CombiWISE with traditional clinical outcomes observed in multiple cohorts and its excellent SNR fully support the stated conceptual advantages of this scale; because EDSS represents only 28% of the CombiWISE score, retaining strong, statistically significant correlations between changes in CombiWISE and EDSS in a small, year-long study is actually not intuitive (see Figure S5 in Supplementary Material for formal assessment of this statement). The reason why CombiWISE is more than twice as sensitive as EDSS in detecting progression of disability lies in the discreteness of EDSS: while a patient may remain on any given EDSS step for a long time, CombiWISE can detect continuous disease progression as measured by three alternative clinical scales. Yet, thanks to the strong correlation between CombiWISE and EDSS, one can calculate from the resulting regression slopes that 1 point change in EDSS corresponds on average to a 7.50 point change on CombiWISE with a standard error of 0.10, allowing extrapolation of clinical meaning from the CombiWISE measurements. Finally, CombiWISE provides approximately normally distributed data ( Figure S9 in Supplementary Material), permitting the use of parametric statistical techniques even in small cohorts. We observed that both conceptually devised and the numerically-optimized version of CombiWISE detect significant disease progression in all three longitudinal progressive MS cohorts, in intervals as short as 6 months. Thus, using CombiWISE as a continuous variable captured every 6 months should provide further advantage over event-driven outcomes, such as the one used in a trial of ocrelizumab in PPMS (ORATORIO trial; NCT01194570). While relatively low numbers of patients in each of the three independent longitudinal cohorts (i.e., N = 29-35) may be viewed as a limitation, it proves that CombiWISE reproducibly measures yearly disease progression in cohorts that correspond in size to the treatment versus placebo arms of the economical Phase II trials.
In conclusion, CombiWISE has validated as the most sensitive clinical outcome for progressive MS. It has consistently higher sensitivity for detecting longitudinal changes in progressive MS in comparison to MRI measures of brain atrophy, currently broadly utilized in Phase II progressive MS trials. In contrast to all tested MRI measures, CombiWISE can predict changes in EDSS, presently used for regulatory approval. Substituting EDSS with CombiWISE requires over 100 fewer subjects per arm (200 versus 95) in a parallel-group design to detect a 30% drug effect in a 2-year study. aUThOr cOnTriBUTiOns BB designed and supervised the study. PK, IC, GN, TW, CB, MG, and BB analyzed the data and drafted the manuscript and figures. DG, MT, IC, WK, BS, JO, KF, and TL collected or generated data and contributed to reviewing and editing the manuscript.