Interpretable Trials: Is Interpretability a Reason Why Clinical Trials Fail?

Background: There are clinical trials using composite measures, indices, or scales as proxy for independent variables or outcomes. Interpretability of derived measures may not be satisfying. Adopting indices of poor interpretability in clinical trials may lead to trial failure. This study aims to understand the impact of using indices of different interpretability in clinical trials. Methods: The interpretability of indices was categorized as: fair-to-poor, good, and unknown. In the literature, frailty indices were considered fair to poor interpretability. Body mass index (BMI) was highly interpretable. The other indices were of unknown interpretability. The trials were searched at clinicaltrials.gov on October 2, 2018. The use of indices as conditions/diseases or other terms was searched. The trials were grouped as completed, terminated, active, and other status. We tabulated the frequencies of frailty, BMI, and other indices. Results: There were 263,928 clinical trials found and 155,606 were completed or terminated. Among 2,115 trials adopting indices or composite measures as condition or disease, 244 adopted frailty and 487 used BMI without frailty indices. Significantly higher proportions of trials of unknown status used indices as conditions/diseases or other terms, compared to completed and terminated trials. The proportions of active trials using frailty indices were significantly higher than those of completed or terminated trials. Discussion: Clinical trial databases can be used to understand why trials may fail. Based on the findings, we suspect that using indices of poor interpretability may be associated with trial failure. Interpretability has not been conceived as an essential criterion for outcomes or proxy measures in trials. We will continue verifying the findings in other databases or data sources and apply this research method to improve clinical trial design. To prevent patients from experiencing trials likely to fail, we suggest further examining the interpretability of the indices in trials.


INTRODUCTION
There are a variety of diseases or conditions assessed in clinical trials. Many of them are direct measurement of physical conditions or pathological diagnoses, i.e., survival status and blood pressure. In contrast, several of them could not be directly measured or quantified based on single variables. For example, frailty is theorized as a geriatric syndrome that can be defined by a composition of variables, from four to 70 depending on the theories used to support the definitions (1)(2)(3)(4)(5)(6). Frailty is often calculated on a continuous scale and dichotomized to derive frailty status (1-3, 5, 7-11). It has also been associated with a variety of outcomes, such as falls and mortality (1-5, 8, 10, 11). In addition to serving as proxy measures for health status, frailty itself has been used as an outcome of interventions (1)(2)(3)(8)(9)(10)(11)(12). One of the reasons is that frailty status has been linked to pathological changes, such as sarcopenia (13). Frailty status has been conceived as an opportunity to shift the aging trajectory and is actively used in various trials (1).
However, there are issues related to the use of composite measures or indices, such as frailty indices and three diagnoses of mental illnesses in clinical trials (1,14). One of the critical issues is interpretability. Three of the most widely used frailty indices have been found difficult to interpret for several reasons (1). First, they may be better explained by noise or bias introduced due to inadequate data processing (1). Body mass index (BMI) and three of the widely used frailty indices have been approximated with input variables, as well as the biases (1). While 99.4% of the BMI variances can be explained by its input variables, the bias can explain 71.9% of the variance of one frailty index (1). For the same index, the bias variables also better predict mortality than frailty status (1). Second, the frailty statuses are in fact the sum of continuous frailty indices and biases induced by variable dichotomization (1). It has been well-recognized that variable categorization can be associated with information bias (1,15) and misclassification bias (16). Third, the threshold for dichotomization is also problematic. There are conflicting views about the choices of the cut-off thresholds for dichotomization. One theory suggests not to adopt common symptoms and the other requires at least 20% of the populations to be eligible for the frailty criteria (1,8). The choice of thresholds for dichotomization is also related to the direction of bias (1). Lastly, the frailty index consisting of 70 variables in the initial theory can be further simplified with fewer numbers of input variables (1). This is because many input variables are likely to be correlated with each other and the sum of many correlated variables may not be more informative than a few of them (1).
The use of conditions or diseases that may not be fully interpretable in clinical trials warrants caution as the consequences are severe. If these composite conditions are used as proxy measures to predict the outcomes, researchers may be misled and ignore other factors that can better explain the outcomes, especially the input variables of the composite measures or indices (1,7). It has been noticed that the input variables of the three frailty indices better predicted mortality than the indices (1). In addition, it is very likely that there are numerous alternative indices that can be mined and better predict outcomes (1,17). If the conditions defined by indices are used as outcomes, the danger is that unnecessary interventions may be designed and tested among patients that should be treated otherwise (16). Taking metabolic syndrome as example, it requires the information on five conditions to confirm the diagnosis (18)(19)(20). However, it is later found not predictive of other important outcomes, such as cardiovascular disease and diabetes (21).
To understand the extent of the interpretability problem in clinical trials, we think it important to estimate the numbers of trials that may be involved in conditions that are not fully interpretable, as well as the impact of interpretability on the execution of clinical trials. This hypothesis-generating study aims to provide initial evidence regarding whether interpretability of terms in trials may be associated with early termination or suspension of clinical trials.

METHODS
The clinicaltrials.gov site was searched with the indices of different interpretability listed in Table 1. This site maintained information on clinical studies (22). This site was established due to the legislation in 1997 and was made public in 2000 (22). The scope of this site was expanded in 2007 and 2017 to include more types of trials (22). In addition to summary information of trials, the following sections of the trial protocols were included: diseases or conditions, interventions, titles, description, study design, requirements for participation (eligibility criteria), locations where the study was being conducted, contact information for the study locations, links to relevant information on other health Web sites (22). Sometimes the trial results were available, such as participant description, study outcomes, and adverse events (22).
To demonstrate the interpretability issue, the numbers of the trials that included the above-mentioned indices as conditions/diseases or other terms were tabulated. Based on the evidence, (1) frailty indices were considered not adequately interpretable for matching the following criteria: more than 25% of variances explained by biases or measurement errors and (2) excessive numbers of redundant variables (1). Based on the criteria, frailty indices were classified as fair to poor interpretability (1). BMI was found to be highly interpretable (1). Other indices were considered to have unknown interpretability.
Two search strategies were allowed in this site: conditions/diseases or other terms. The conditions or diseases were defined as "the disease, disorder, syndrome, illness, or injury that is being studied" (23). Other terms were defined as a search feature that helped to narrow the search by looking for a name of a drug or the registration number of a clinical study (23).
Trials were classified into four categories based on recruitment status: completed, terminated, active, and other. The completed trials were the studies that had ended normally, and participants were "no longer being examined or treated (that is, the last participant's last visit has occurred)" (23). Terminated trials were those that had stopped early and would not start again  p < 0.001 for the distribution of indices of different interpretability across four types of trials regarding the search in both conditions/diseases and other terms according to Chi-square tests (Chi-squared = 105.9 and 1283.6, respectively). Active trials are those "not yet recruiting," "recruiting," "enrolling by invitation," and "active, not recruiting." Other statuses are "suspended," "withdrawn," and "unknown." Search date on clinicaltrials.gov: October 2, 2018.

Estimation of the Interpretability of Clinical Trials by Types of Indices Used
The numbers of clinical trials related to different types of indices were calculated based on the search terms in Table 1. The relationships between different search strategies were shown in Figure 1. The numbers of trials that involved any types of indices or composite measures were searched with indices or composite measures or frailty or BMI. The total numbers of the trials using BMI only, frailty indices, any indices other than frailty, or BMI were calculated based on these searches. The trials that used BMI only were considered using a condition that was highly interpretable (1). Those using frailty indices with or without BMI were of fair to poor interpretability (24). Those using indices other than frailty or BMI were of unknown interpretability. The percentages of the trials in relation to all trials were calculated. The distribution of trials of different interpretability was compared between types of search fields: conditions/diseases or other terms through chi-square tests. The associations between interpretability and trial status (completed, terminated, and other) were investigated with multinomial logit regression (25). Compared to using good-interpretability indices as diseases/conditions, the independent variables were (1) using fair-to-poor-interpretability indices as diseases/conditions, (2) using unclear-interpretability indices as diseases/conditions, (3) using good-interpretability indices for other terms, (4) using fair-to-poor-interpretability indices for other terms, (5) using unclear-interpretability indices for other terms, and (6) not using any indices or composite measures. The outcomes were trial statues: completed, terminated, and other statuses. The effect sizes were estimated with odds ratios (ORs). P-values <0.05, two-tailed, were considered statistically significant. The statistical analysis was conducted with R (26) and RStudio (27).

RESULTS
There were 284,644 clinical trials identified on October 2, 2018. There were 80,290 active trials (28.2%), 151,605 completed trials (53.3%), 16,052 terminated trials (5.6%), and 36,153 trials of other status (12.7%). Among 3,000 trials that adopted any indices as conditions or diseases, 254 used BMI without frailty indices and were rated as trials involving interpretable conditions (0.09% of all trials). There were 291 trials using frailty indices as conditions that were rated as fair to poor interpretability (0.1%) and 2,455 using other types of indices of unknown interpretability (0.9%). The other 281,644 trials did not use any types of indices as conditions or diseases (98.95%). There were respectively, 3,753 and 263 trials used BMI without frailty indices and frailty indices for terms apart from conditions or diseases (1.32 and 0.1% of all trials, respectively). There were 39,194 trials using other indices (13.8%) and 241,434 without any indices as other terms (84.8%).

Interpretability and Active Trials
For the 80,290 active trials, 56, 145, and 696 included BMI without frailty indices, frailty indices, and other indices as conditions or diseases (0.1, 0.2, and 0.9% of all active trials, respectively). There were 79,393 active trials not involving any indices or composite measures (98.9%). There were 1,247, 144, and 13,772 active trials adopting BMI without frailty indices, frailty indices, and other indices as other terms (1.6, 0.2, and 17.2% of all active trials, respectively). There were 65,127 active trials not including any indices (81.1%).

Interpretability and Terminated Trials
Among 16,052 terminated trials, 7, 6, and 88 included BMI without frailty indices, frailty indices, and other indices as conditions or diseases (0.04, 0.04, and 0.6% of all completed or terminated trials, respectively). There were 15,951 terminated trials that did not involve any indices or composite measures as conditions or diseases (99.4%). There were 116, 13, and 2,334 completed trials that included BMI without the use of frailty indices, frailty indices, and any other indices as other terms (0.7, 0.1, and 14.5% of all terminated trials, respectively). There were 13,589 terminated trials not involving any indices or composite measures (84.7% of all completed or terminated trials).  1, 0.1, and 11.7% of all trials of other statuses, respectively). There were 31,507 trials of unknown status not including any indices (87.2%).

Use of Indices in Different Types of Trials
There was an increasing number of clinical trials, with 80,290 active trials vs. 167,657 completed or terminated trials since the initiation of the clinical trial registry. The distribution of three types of indices used was not the same across terminated, completed trials, and trials of other statuses (Chi-squared = 35.6 FIGURE 1 | Distribution of clinical trials of different interpretability. Blue = trials involving the diseases or measures of poor to fair interpretability, red = trials using body mass index that is highly interpretable, black = trials involving the diseases or measures of unknown interpretability. Not applicable = no index or composite measure used in the trials, OR = odds ratio. Other statuses including "suspended," "withdrawn," and "unknown." and 164.5, respectively, p < 0.001 for conditions/diseases and other terms).

Multinomial Logit Regression
In Table 2, the ORs of using indices of different interpretability for diseases/conditions or other terms were listed. Compared to completed trials, using indices of fair-to-poor and unclear interpretability in clinical trials was associated with early termination (OR = 3.4 and 2.7, 95% CI = 1.3-8.8 and 1.3-5.8, respectively). The use of indices of other interpretability or no use of any indices was not associated with trial termination (p > 0.05 for all).
Compared to completed trials, not using any indices or composite measures in clinical trials was associated with a decreased likelihood of trial suspension, withdrawal, or unknown status (OR = 0.37, 95% CI = 0.25-0.53). The use of indices of any interpretability was not associated with trial suspension, withdrawal, or unclear status (p > 0.05 for all).

DISCUSSION
There are several intriguing findings worthy of further investigation. First, not using any indices or composite measures is associated with a decreased likelihood of trial suspension, withdrawal, or unknown status. The reasons why clinical trials may fail have been discussed from many aspects, including measurement error, statistical power, and lack of efficacy (28)(29)(30). To our knowledge, our finding is the first to support the hypothesis that the use of indices and composite measures of different interpretability in clinical trials may be associated with the rates of trial completion or early termination. There are several reasons that link indices and trial failure. Indices are possible sources of measurement errors and an important factor for inferior outcome predictability (1,7). Indices themselves can also serve as illusory outcomes (1,16). The correlation between index use and trial failure needs to be further investigated. We will assess this causation problem in the future.
Second, the recent and active trials are increasingly using frailty indices whether for conditions/diseases or other terms. This shows the researchers' interest in novel opportunities to improve aging and human health (1). However, three commonly used frailty indices have been found to represent the frailty theories very poorly (1). There are also conflicts between frailty theories regarding assumptions and variable selection (1). Although, some researchers have been criticizing the use of vaguely defined frailty indices (31), those that are likely to have fair-to-poor interpretability remain frequently used. This may place patients at risk for receiving unnecessary or even harmful interventions (1). To prevent harms to patients, we suggest further examination of the interpretability of the indices in trials based on published guidelines (1, 7).
Lastly, our research method using ClinicalTrials.gov provides a simple and feasible framework for future application. This study uses a classic epidemiologic method (32) to tabulate the distribution of clinical trials based on the use of indices of different interpretability. With the advances in text mining and machine learning (33, 34), we will continue using this method to screen other significant factors to improve the design of clinical trials.

Limitations
Although, this study uses one of the most important sources of clinical trial information, ClinicalTrials.gov, limitations still remain. First, it is unclear how these indices are used in the trials, especially for the trials that include indices or composite measures in other terms. Second, the trials of other status are those suspended, withdrawn, and of unknown status. Other completed or terminated trials may not be successful in proving significant association between interventions and outcomes.
Other advanced text mining methods may be required to further refine the definition of trial failure. Third, there are other important clinical trials registries to be studied (35). We think this study demonstrates the feasibility of analyzing trial databases. Fourth, indices are not used in a majority of the trials. There are relatively few numbers of trials adopting indices. We will continue this analysis with more trials in the future. Fifth, frailty indices might not be an ideal measure of interpretability. We plan to test other terms for measuring interpretability. Lastly, one related issue is that clinical trial registration is not fully complied (36). There are still trials that can't be searched in public repositories.

CONCLUSION
Indices or composite measures of different interpretability have been used in clinical trials. Not using any indices or composite measures is associated with a decreased likelihood of trial suspension, withdrawal, or unknown statuses. The use of indices of fair-to-poor or unclear interpretability is associated with early trial termination. The proportions of using frailty indices are increasing in active trials as conditions/diseases or other terms, compared to completed or terminated trials. Based on the findings, we hypothesize that using indices of poor interpretability may lead to trial failure. We will further test the hypothesis that the use of indices of inadequate interpretability causes trial failure and continue applying this research method to improve clinical trial design.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.