Large Sample Size Fallacy in Trials About Antipsychotics for Neuropsychiatric Symptoms in Dementia

Background A typical antipsychotics for neuropsychiatric symptoms in dementia have been tested in much larger trials than the older conventional drugs. The advantage of larger sample sizes is that negative findings become less likely and the effect estimates more precise. However, as sample sizes increase, the trials also get more expensive and time consuming while exposing more patients to drugs with unknown safety profiles. Moreover, a large sample size might yield a statistically significant effect that is not necessarily clinically relevant. Objective To assess (1) the variation in sample size and sample size calculations of antipsychotic trials in dementia, (2) the size of reported treatment effects and related statistical significance, and (3) general study characteristics that might be related to sample size. Study Design and Setting We performed a meta-epidemiological study of randomized trials that tested antipsychotics for neuropsychiatric symptoms in dementia. The trials compared conventional or atypical antipsychotics with placebo or another antipsychotic. Two reviewers independently extracted sample size, sample size calculations, reported treatment effects with p-values, and general study characteristics (drug type, trial duration, type of funding). We calculated a reference sample size of 83 and 433 per study group for the placebo-controlled and head-to-head trials respectively. Results We identified 33 placebo-controlled trials, and 18 head-to-head trials. Only 14 (42%) and 2 (11%), respectively, reported a sample size calculation. The average sample size per arm was 34 (range 6–179) in placebo-controlled trials testing conventional drugs, 107 (8–237) in such trials testing atypical drugs, and 104 (95–115) in such trials testing both drug types; it was 31 (10–88) in head-to-head trials. Thirteen out of 18 trials with sample sizes larger than required (72%) reported a statistically significant treatment effect, of which two (15%) were clinically relevant. None of the head-to-head trials reported a statistically significant treatment effect, even though some suggested non-inferiority. In placebo-controlled trials of atypical drugs, longer trial duration (>6 weeks) and commercial funding were associated with higher sample size. Conclusion Sample size calculations were poorly reported in antipsychotic trials for dementia. Placebo-controlled trials of atypical antipsychotics showed large sample size fallacy while head-to-head trials were massively underpowered.


INTRODUCTION
Over the years the sample sizes of antipsychotic trials in dementia have increased from as low as 18 in the 1960s to as high as 652 in the 1990s (Schneider et al., 2005;Cox Grad, 2009;Hulshof et al., 2015). The increase in sample sizes is generally viewed as a favorable development. Larger sample sizes provide more power to identify a treatment effect that is really present. In addition, the effect is estimated more precisely (smaller confidence intervals). Larger trials are also a natural consequence of head-to-head trials because the difference between two active drugs is generally expected to be small, and therefore, the required sample size needs to be relatively high.
However, larger sample sizes also make trials expensive and time consuming (Cox Grad, 2009). This can be barrier for noncommercial investigators to perform a trial. Moreover, it can be ethically questionable to ask more patients to participate, especially when the safety of the tested drug has not yet been established (Schipper and Weyzig, 2008). Another disadvantage of (very) large sample size is that a difference in outcomes between the groups will become (very) statistically significant, no matter how small or clinically meaningless it is (Sullivan and Feinn, 2012). If such results are nevertheless interpreted as clinically relevant, the 'large sample size fallacy' occurs (Lantz, 2013).
Sample size calculations for trials are based on four parameters if the response rate is the outcome. These are alpha, beta, the expected response rate in the active treatment groups, and the expected response rate in the comparison group (e.g. placebo) (Noordzij et al., 2010). Alpha is the probability of identifying a treatment effect that is not really present, which is usually set at 5%. Beta is the risk of not identifying a treatment effect that is really present, and is usually set at 20%. Sample size calculations for trials with continuous outcomes, such as the reduction of neuropsychiatric symptoms (NPS), are based on alpha, beta, the expected (difference between) means in the active and comparison group, and the population variance around the mean. Furthermore, the expected number of participants dropping out should be taken into account when determining the final target sample size of a trial.
A different expected treatment effect might explain why the sample sizes of antipsychotic trials increased over time. Perhaps, atypical antipsychotics were expected to be less effective than conventional antipsychotics, even before it was shown in systematic reviews that they did not affect psychotic symptoms compared to placebo Smeets et al., 2018). Alternatively, drop-out could have increased because recent trials lasted longer and participants have become more assertive.
On the other hand, general study characteristics, which are not directly related to sample size calculation might have contributed to the increase in trial sample sizes over the years. Large sample size is generally considered a sign of high trial quality, and this increases the probability of publication and citation (Dickersin et al., 1992). In addition, pharmaceutical companies will have more resources to fund larger trials than non-commercial organizations. Therefore, the aim of this metaepidemiological study was to assess (1) the variation in sample size and sample size calculations of antipsychotic trials in dementia, (2) the size of the reported treatment effects and related statistical significance, and (3) general study characteristics that might be related to sample size.

Search Strategy
Two reviewers (TAH, HJL) used a list of conventional and atypical antipsychotics from the websites of the World Health Organization, Food and Drug Administration, and Wikipedia to search the literature (US Food and Drug Administration, 2013;World Health Organization, 2013;Wikipedia, 2015). First, we first searched for studies in the electronic databases PubMed, CINAHL, EMBASE, and Cochrane library with the string 'generic name of atypical/conventional antipsychotic' and trial and dementia (see online supplement). We restricted the position of the drug name to title and abstract. Subsequently, we manually searched the references of published systematic reviews, which were identified with the same electronic databases. Titles and abstracts of potentially eligible studies were retrieved from PubMed. In addition, we sought trials in trial registration websites with the abovementioned search terms if possible; otherwise we used only the term dementia. These three searches were last re-run in June 2019. Finally, we had used the databases of the Dutch Medicines Evaluation Board and the FDA to find unpublished trials as part of a previous search performed in 2015 (Hulshof et al., 2019).

Study Selection
We screened the title and abstract of the hits. Full texts of potentially eligible published studies and online protocols for unpublished studies were retrieved. Two reviewers used the full texts to determine definitive eligibility (TAH, HJL). The selected trials had to have been randomized and double-blind. They should have tested the efficacy of antipsychotics on NPSs in persons diagnosed with Alzheimer or vascular dementia. The trial had to compare conventional or atypical antipsychotics with placebo or another antipsychotic (head-to-head trial). We excluded studies with multiple drugs in a single intervention arm, studies that were stopped early and thus did not reach the targeted sample size, and studies with a cross-over design as other than standard sample size calculations need to be applied for this design. There were no restrictions with respect to publication date, language, and duration of the study.

Data Extraction
Two reviewers (TAH or SIMJ and HJL) independently extracted the following general study characteristics besides the sample size from the included studies: placebo-controlled or head-to-head trial, type of dementia (Alzheimer's disease, vascular dementia, mixed, unspecified), type of NPS (agitation, psychosis, diverse), setting (nursing home, hospital, outpatient clinic), active drug tested (conventional, atypical, or both), trial duration, type of funding (not-for-profit or commercial), and whether a sample size calculation was reported.
If the sample size calculation was reported, we extracted the input for sample size calculations: alpha, beta, expected treatment effects in the comparison groups (response rate, or mean symptom reduction with population variance at endpoint), and the expected drop-out rates. For trials that had been published in an abstract or online trial registration only, this data-extraction was considered inapplicable.
In addition, we extracted the reported treatment effects and related statistical significance. The primary outcome of trials that test antipsychotics for NPS in dementia is most often the difference in response rate or difference in reduction of target symptoms between the treatment groups. We extracted both for each trial with the related p-value. For the response rate, we extracted the number of patients with a clinically relevant improvement as defined by the authors. For reduction in symptoms, we extracted the difference in mean change from baseline to endpoint as measured with a symptom scale, such as the Cohen-Mansfield Agitation Inventory (CMAI) for agitation and Neuropsychiatric Inventory-Nursing Home (NPI-NH) for mixed symptoms. Initially, we also set out to extract standard deviations to calculate standardized mean differences, so that we could compare trial results. However, as many SDs turned out to be missing, we decided to extract the mean on the symptom scale at baseline as a reference instead (see data-analysis).
The primary source of extracted data was the published main results article. If that was not available, then conference abstracts or online published results were used. We received the individual patient data of two trials Paleacu et al., 2008), and additional meta-data of two others for use in another study (De Deyn et al., 1999;De Deyn et al., 2005;Hulshof et al., 2019).

Data Analyses
First, we described the variation in sample sizes for the different types of trials by plotting the mean number of participants per comparison group against the publication year of the trial. We present these data for the conventional and atypical placebocontrolled trials and head-to head trials separately.
To assess the adequacy of the reported sample sizes, we calculated reference sample sizes for trials with the response rate as outcome. For the placebo-controlled trials, we used an alpha of 0.05, beta of 0.20, a treatment response rate in the antipsychotic group of 55% and in the placebo group of 30%, and an expected drop-out of 30% (Brant, 2016). A treatment effect of 25% (NNT = 4) and drop-out rate of 30% is in line with previous literature and the reported response rates in antipsychotic trials in dementia Jeste et al., 2008;Drouillard et al., 2013). We used a conservative drop-out rate of 30% (it was 26% on average in the included trials), so that the reference sample size would not be an underestimation. The required sample size per study group was 58 without loss to drop-out, and 83 with loss.
For the head-to-head trials (no placebo group), we used a treatment effect of 55% for the drug of interest and 45% for the control antipsychotic drug, because a 10% difference seems the upper limit of no difference. The expected drop-out rate was set at 10%, which is in line with the average drop-out rate in the included head-to-head trials. The required sample size was 389 per group without loss, and 433 with loss. We used the ssi command in Stata version 15.0 to calculate the reference sample sizes (StataCorp., 2017).
To calculate reference sample sizes based on the outcome mean symptom reduction, the minimal clinically important difference (MCID) is required. However, the MCID is not known for most symptom scales used in this field (Shabbir and Sanders, 2014). The exception is the NPI, which was found to have an MCID of at least 8.0 (Howard et al., 2011;Zuidema et al., 2011). Nine of the included placebo-controlled trials in our study used this instrument, and we used the reported data to check our calculated reference sample size based on response rates. The reported mean reduction in symptoms was 19 (SD 14) for the placebo group (see Supplementary Table 1), and hence, assuming an MCID of 8.0, 27 (SD 16) for the antipsychotic group. We calculated a required sample size of 80 based on these data, and this finding confirms the reference sample size of 83 based on response rates. In addition, the MCID of 8.0 reflects an SMD of 0.500 given the SD of 16 reported in the included trials. This is in line with the lower limit for a visible (medium) treatment effect suggested by Cohen (2007). The next step was to assess whether studies with larger sample size reported statistically significant treatment effects that were not clinically relevant (difference in response rate <25%; difference in symptom reduction < MCID or SMD <0.5), which would suggest the presence of large sample size fallacy. Treatment effects in terms of reported response rates can be compared between trials with varying sample sizes. However, it was not possible to use MCIDs or SMDs to compare reported reductions in symptoms across different symptoms scales. Therefore, we calculated the relative symptom reduction as the ratio of the difference in symptom reduction between the study groups relative to the baseline mean in the groups. This approach has been used before (Smith et al., 1974). Moreover, the MCID of 8.0 on the NPI and a mean baseline of 39 (see Supplementary  Table 1) would translate into a relative symptom reduction of 21%. Hence, a relative symptom reduction of > = 20% seems appropriate.
Finally, we analyzed the association between other general study characteristics and mean sample size per group. The characteristics were type of drug tested (category: conventional, atypical, or both), trial duration (< = 6 weeks, > 6 weeks), and type of funding (non-for-profit, commercial). We calculated mean sample sizes of comparison groups per category, and used the two-sample t-test to determine whether the means differed between the first (reference) category and other categories. The analyses were performed for the placebocontrolled and head-to-head trials separately. All analyses were carried out with Stata version 15.0 (StataCorp., 2017).

Sample Size Variation and Calculations
We calculated a reference sample size of 83 patients per group for the placebo-controlled trials and 433 patients for the head-tohead trials, as explained above. The group sample size was lower than the reference sample size in 10 placebo-controlled trials of conventional antipsychotics (small sample size) and higher in one such trial (large sample size), whereas 5 of the 19 atypical antipsychotic trials and none of the 3 trials including both conventional and atypical antipsychotics had small sample sizes. At least four of the five atypical underpowered antipsychotic trials were investigator initiated, although one was performed with commercially acquired funds. All head-tohead trials had a small sample size that was lower than the reference sample size of 433.
Sixteen of 47 articles (excluding 2 abstracts and 2 reports on online trial registers) reported a sample size calculation (34%), which was often called a power analysis ( Table 1). Fourteen were placebo-controlled trials and two head-to-head trials ( Table 2). Table 2 shows, which input for these sample size calculations was reported. There were only three studies that reported sufficient information (Ballard et al., 2005;Mintzer et al., 2006;Schneider et al., 2006). Two studies reported an alpha that differed from 5% (2.5% and 7%). Eight studies reported a beta that differed from 20% and it varied between 1% and 15%. Except for the alpha of 2.5%, this input will yield higher sample sizes. Expected drop-out rates were reported in seven studies and varied between 10% and 30%.
There were seven placebo-controlled trials that postulated an expected treatment effect in terms of symptom reduction, four of which reflected a relative symptom reduction below 20%. The expected differences in relation to baseline means (relative AD stands for Alzheimer's disease; CBS for chronic brain syndrome; HOS for hospital; MIX for mixed dementia (Alzheimer/vascular); NH for nursing home; NPS for neuropsychiatric symptoms; OUTP for outpatients; Ph for Pharmaceutical company; NR for not reported; and VAS for vascular dementia.°a bstract only; * reduction in NPI Psychosis items at 12 weeks was the original primary outcome (clinicaltrials.gov);^discontinuation rate at week 36 was the primary outcome, but as it is incomparable to other trials, we used response rate and reduction of symptoms at 12 weeks (see Table 3); † results of 0.5 mg group (n = 20) were not reported; # the term senile brain disease was also used. symptom reduction) were: 10% (Ballard et al., 2005); 11% (Tariot et al., 2006); 12% (Brodaty et al., 2003); 14% (Street et al, 2000); 20% ; 31% (De Deyn et al., 2004); 31% (Ballard et al., 2018). For a head-to-head trial, the expected relative risk reduction was 16% (Verhey et al., 2006). Table 3 presents the reported treatment effects in order of sample size per study group. A positive difference in response rate and negative difference in symptom reduction means that    the investigated drug performed better than the control group. Six trials did not report what the effect of treatment on the primary outcome was: four studies were old, published between 1974-1983, but two were relatively new, published after 2000 (Smith et al., 1974;Rada and Kellner, 1976;Vergara et al., 1980;Spagnolo et al., 1983;Herz et al., 2002;Mulsant et al., 2004). Five placebo-controlled studies reported only p-values without effect sizes in the abstract (Katz et al., 1999;Brodaty et al., 2003;Deberdt et al., 2005;Mintzer et al., 2006;Zhong et al., 2007). Thirteen of 18 overpowered trials (72%) versus seven of 15 underpowered placebo-controlled trials (47%) yielded a statistically significant difference between the study groups in either response rate or symptom reduction. Two of 13 (15%) and four of seven (57%) of these treatment effects respectively were clinically relevant (difference in response rate > = 25%, or relative symptom reduction > = 20%). The statistically significant response rates were 10-22% and reported by studies with large sample sizes. The two studies with a difference in response rate of > = 25%, which is the difference deemed clinically relevant (Cohen, 2007), were underpowered and did not report a statistically significant result. In addition, large sample size trials reported statistically significant relative symptom reductions between 10% and 23%, and small sample size trials reported statistically significant relative symptom reductions varying between 17% and 55%.

Reported Treatment Effects In Relation To Sample Size
Many placebo-controlled trials had more than one intervention group, adding up to a total of 54 individual comparisons. Thirteen of the 33 overpowered comparisons (39%) from 18 trials yielded a statistically significant treatment effect on either response rate or symptom reduction, versus seven of the 21 underpowered comparisons (33%) from 15 trials.
Five of 18 head-to-head trials reported a difference in response rate of 10%, the lower limit that we set for non-inferiority in our reference sample size calculation, and four a relative symptom reduction of 10%. Yet, none of these results were statistically significant.
The reported treatment effect was lower than the expected treatment effect in the 14 studies that presented an expected treatment effect in a sample size calculation, except in two studies (Street et al., 2000;Brodaty et al., 2003). The reported drop-out rates varied between 6% and 37% (not shown), which was higher than the expected drop-out rate in most studies. Table 4 shows the mean sample size per comparison group by type of drug tested, trial duration, and type of funding. The mean sample size per study group was statistically significantly higher in placebocontrolled trials that tested an atypical antipsychotic drug (107.0) or both a conventional and an atypical drug (103.8) in comparison to placebo-controlled trials of conventional antipsychotics (34.4; p < .05). The mean sample size per study group was also statistically significantly higher in trials that lasted more than 6 weeks (109.2) compared to less than 6 weeks (28.9; p < .001), and that were commercially (100.3) versus non-commercially (18.1; p < .001) funded. Head-to-head-trials that tested atypical drugs only had a significantly larger mean sample size (46.3) than trials that tested conventional drugs (22.3; p < .05). Trial duration and commercial funding did not seem to be related to the sample size of head-to-head trials.

DISCUSSION
We assessed the presence of large sample size fallacy in 51 antipsychotic trials in dementia. Most placebo-controlled trials 2 is the weighted mean of baseline mean of all studies with CMAI total; ¶ reduction in NPI Psychosis at 12 weeks was originally the primary outcome (clinicaltrials.gov); † Discontinuation rate at week 36 is primary outcome of trial, but as it is incomparable to other trials, we used response rate and reduction of symptoms at 12 weeks;^results of 0.5mg group (n = 20) were not reported.
of conventional antipsychotics had small sample size, i.e. smaller than the calculated reference sample size, but most trials of atypical antipsychotics had large sample sizes. All head-to-head trials had very small sample sizes. Only one third of trials reported a sample size calculation. Thirteen of 18 trials with large sample sizes (72%) reported a statistically significant treatment effect, of which two (15%) were clinically relevant. In contrast, seven of 15 placebo-controlled trials with small sample sizes (47%) yielded a statistically significant treatment effect, and four were clinically relevant (57%). None of the headto-head trials reported a statistically significant treatment effect, even though some suggested non-inferiority.

Large Sample Size Fallacy
Sample sizes need to be large enough to guarantee a minimum level of discriminative power to detect a real treatment effect. Moreover, precision of an estimate increases with sample size. Studies based on small sample size may yield a non-statistically significant but clinically relevant treatment effect. On the other hand, studies based on large sample size-larger than necessary-may yield statistically significant but clinically insignificant treatment effects (Roggla and Fortunat, 2004;Chan et al., 2008). Large sample size fallacy occurs when such results are interpreted as relevant for medical practice (Lantz, 2013;Lin et al., 2013). Nevertheless, pharmaceutical companies and academic scholars benefit from statistically significant treatment results being interpreted as clinically relevant (Dickersin et al., 1992). The emphasis on statistical significance was confirmed by six trials in our review that did not report effect sizes, and five trials that reported just p-values in the abstract.
The sample sizes of trials testing atypical antipsychotics versus placebo, whether or not simultaneously with a conventional antipsychotic, were generally larger than necessary. These trials were commercially funded by the manufacturer of the atypical antipsychotic drugs. Only investigator-initiated trials were too small. The majority of large trials reported a statistically significant treatment effect, despite lack of clinical relevance, which confirms the presence of large sample size fallacy. The mean sample size was also higher when the study lasted longer than 6 weeks and was commercially funded, but this might be explained by the fact that placebocontrolled trials of atypical antipsychotics were generally longer and often industry-initiated. The chance of statistically significant findings was further enhanced by the use of multiple comparisons per study and multiple measurement scales per outcome in a number of the larger trials.
Many placebo-controlled trials of conventional antipsychotics had small sample sizes. Most were relatively old (published before 1990) and seemed to be investigator-initiated. Some of these trials reported clinically relevant results, but most were not statistically significant. That small placebo-controlled trials yielded statistically significant and clinically relevant effects relatively often might reflect publication bias.
Head-to-head trials had sample sizes that were (much) smaller than required, and these studies yielded nonstatistically significant results that sometimes suggested a substantial effect. Even if we had set the limit for noninferiority at 15%, the required sample would have been a lot higher than the sample sizes of the included studies were (346 without loss, and 385 with loss). It is unclear why these trials were so clearly underpowered. Perhaps, industry has little to gain from properly testing their own product against that of competitors. Non-commercial funds might not be interested in a trial with at least 2 × 433 patients to show that the tested drugs are non-inferior, even if patients might be quite willing to participate in a study that ensure treatment with an active drug.

Sample Size Requirements
It is generally agreed that a trial protocol and report should report a sample size calculation (CONSORT Group, 2010). Nevertheless, only a third of trials in our review reported a sample size calculation and just three were complete. Although some trials can be considered old, most were published in the 1990s or later when it had become common to report trial methods in detail. Sample size calculations are often not (completely) reported in randomized trials in other fields of research was well (Chan et al., 2008;Charles et al., 2009). One review found that articles about newer randomized controlled trials included sample size calculations more often, and showed positive results more often (76%) than older studies (55%) (Latif et al., 2011).
Some studies in our review reported a lower alpha (2.5%) or beta (5%) than is usual in sample size calculations (5% and 20% respectively). In addition, the MCID proposed in the sample size calculations seemed rather small: difference in response rates <25% in 3/6 trials, and in relative risk reduction of <20% in 4/7 trials. The lower the alpha, beta, and MCID, the higher the calculated sample size will be and hence the power to detect a statistically significant but not clinically relevant treatment effect. Moreover, even if the expected difference is equal to the MCID, a proportion of the patients will not have a clinically relevant effect on the individual level. On the other hand, the expected drop-out rate in the sample size calculations was mostly lower than the (mean) reported drop-out, and this would have led to a spuriously smaller calculated sample size. Real drop-out might have been high because trial duration was long on average. Most trials lasted more than a month, even though in clinical practice, antipsychotics usually show an effect within 2 weeks, four at the most. It has been estimated that up to 64% of trials with continuous outcomes are underpowered or overpowered because of imprecise input (Tavernier and Giraudeau, 2015).

Strengths and Limitations
To our knowledge determinants of sample size in trials testing antipsychotics for NPSs in dementia have not been studied previously. Our study showed that sample size calculations in the reports of these trials were missing on a large scale as was the correct interpretation of effect size. A limitation of our study is its focus on antipsychotic trials in dementia, which might be perceived as a small field of research. In addition, the interpretation of our results is limited by the possible presence of multiple testing. Many trials used multiple comparisons of either different drugs, different dosages, multiple outcomes, and sometimes multiple measurement instruments per outcome. Such multiple testing might reinforce the large sample size fallacy. With our study, we do not want to suggest that large sample sizes should be avoided. It is important for clinical practice that study results are precise. Moreover, large sample sizes are very useful for identification of adverse effects. Small trials should not be avoided either, as long as they are published irrespective of results and available for pooling in meta-analyses.
The implication of our study is that researchers need to be encouraged to report and consider effect sizes in line with pvalues to avoid the large sample size fallacy. Journals should probably mention this in their author instructions.

CONCLUSION
Placebo-controlled trials that tested atypical antipsychotics showed large sample size fallacy. Placebo-controlled trials of conventional antipsychotics and head-to-head trials had insufficient power to detect a real difference between the treatment groups. Sample size calculations in antipsychotic trials for dementia need to be reported adequately.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

AUTHOR CONTRIBUTIONS
TH, SJ, and HL extracted the data. TH and HL searched and selected the trials, performed the data analysis, and drafted the manuscript. SJ and SZ critically reviewed the manuscript and suggested revisions. HL designed the study.