Keeping Meta-Analyses Hygienic During the COVID-19 Pandemic

Despite the massive distribution of different vaccines globally, the current pandemic has revealed the crucial need for an efficient treatment against COVID-19. Meta-analyses have historically been extremely useful to determine treatment efficacy but recent debates about the use of hydroxychloroquine for COVID-19 patients resulted in contradictory meta-analytical results. Different factors during the COVID-19 pandemic have impacted key features of conducting a good meta-analysis. Some meta-analyses did not evaluate or treat substantial heterogeneity (I2 > 75%); others did not include additional analysis for publication bias; none checked for evidence of p–hacking in the primary studies nor used recent methods (i.e., p-curve or p-uniform) to estimate the average population-size effect. These inconsistencies may contribute to contradictory results in the research evaluating COVID-19 treatments. A prominent example of this is the use of hydroxychloroquine, where some studies reported a large positive effect, whereas others indicated no significant effect or even increased mortality when hydroxychloroquine was used with the antibiotic azithromycin. In this paper, we first recall the benefits and fundamental steps of good quality meta-analysis. Then, we examine various meta-analyses on hydroxychloroquine treatments for COVID-19 patients that led to contradictory results and causes for this discrepancy. We then highlight recent tools that contribute to evaluate publication bias and p-hacking (i.e., p-curve, p-uniform) and conclude by making technical recommendations that meta-analyses should follow even during extreme global events such as a pandemic.

Despite the massive distribution of different vaccines globally, the current pandemic has revealed the crucial need for an efficient treatment against COVID-19. Meta-analyses have historically been extremely useful to determine treatment efficacy but recent debates about the use of hydroxychloroquine for COVID-19 patients resulted in contradictory meta-analytical results. Different factors during the COVID-19 pandemic have impacted key features of conducting a good meta-analysis. Some meta-analyses did not evaluate or treat substantial heterogeneity (I 2 > 75%); others did not include additional analysis for publication bias; none checked for evidence of p-hacking in the primary studies nor used recent methods (i.e., p-curve or p-uniform) to estimate the average population-size effect. These inconsistencies may contribute to contradictory results in the research evaluating COVID-19 treatments. A prominent example of this is the use of hydroxychloroquine, where some studies reported a large positive effect, whereas others indicated no significant effect or even increased mortality when hydroxychloroquine was used with the antibiotic azithromycin. In this paper, we first recall the benefits and fundamental steps of good quality meta-analysis. Then, we examine various meta-analyses on hydroxychloroquine treatments for COVID-19 patients that led to contradictory results and causes for this discrepancy. We then highlight recent tools that contribute to evaluate publication bias and p-hacking (i.e., p-curve, p-uniform) and conclude by making technical recommendations that meta-analyses should follow even during extreme global events such as a pandemic.
Keywords: COVID−19, meta-analysis, heterogeneity, publication bias, hydroxychloroquine BACKGROUND The 2020 COVID-19 pandemic has highlighted the urgent need for the development and administration of a new treatment for COVID-19. Despite the rollout of several different vaccines globally, the need to find treatment remains essential given the uncertainty and shortcomings with equal distribution of vaccines and vaccine availability. Meta-analyses have historically been used to establish the existence, size, and confidence of therapeutic effects or causes of particular diseases.
Meta-analysis is an important tool to determine the effectiveness of COVID-19 treatments, but it is essential that the strength of evidence be maintained by adhering to all components of the methodology. Various factors during the COVID-19 pandemic, including time pressure, have resulted in alterations and omissions of key aspects of meta-analysis that lower the quality of evidence. Some meta-analyses did not evaluate publication bias, nor treat substantial heterogeneity (I 2 > 75%); none checked for evidence of p-hacking in the primary studies nor used recent techniques (i.e., p-curve or p-uniform) to estimate average population-size effect. Journals greatly favor publishing significant findings in comparison to non-significant findings, resulting in publication bias which can overestimate effect sizes (the strength of the relationship between two variables). These discrepancies may contribute to opposing results in the research evaluating COVID-19 treatments. A prominent example of this is the use of hydroxychloroquine (HCQ), where some studies reported a large protective effect, whereas others indicated no significant effect or even increased mortality when HCQ was administered with the antibiotic azithromycin.
In this paper, we first highlight the benefits and fundamental steps of meta-analytical studies. Then, we analyze examples of meta-analyses of HCQ treatments for COVID-19 patients that led to contradictory results and causes for this discrepancy. We conclude by making recommendations that meta-analyses should follow even during extreme global events such as a pandemic (see Table 1).

Meta-Analysis: Principles and Procedures
Meta-analysis involves a set of statistical techniques to synthesize effect sizes of several studies on the same phenomenon (2,3). Several benefits are expected from clinical meta-analyses: • Identify, screen, select, and include studies based on systematic reviews of the literature (i.e., as recommend by the PRISMA Statement) (7). • Compute the mean effect sizes across different studies (i.e., the average effect of a particular treatment on a specific condition). • Evaluate the level of heterogeneity (i.e., the amount of variation in the outcomes detected by the different studies). • Determine the impact of publication bias (i.e., the lack of publication of negative trials and underrepresentation of unpublished data, can lead to overestimated effect sizes). • Run meta-regressions and subgroup analyses to control for the effects of studies' characteristics (e.g., design, procedure, measures) and sample (e.g., age/gender, BMI, clinical history).
Meta-analyses begins by identifying, screening, and evaluating potentially relevant studies, and ultimately collecting data from included studies and evaluating their quality (through PRISMA, for instance) (7). The mean and variance of the estimates is collected from every included study to compute a global weighted Abbreviations: AZ, Azithromycin; COVID-19, Coronavirus disease 2019; CQ, Chloroquine; EUA, Emergency-use authorization; FDA, Food and Drug Administration; HCQ, Hydroxychloroquine; HR, Hazard ratio; ICU, Intensive care unit; IPD, Individual patient data; OR, Odd ratio; PRISMA, Preferred reporting items for systematic reviews and meta-analyses; RCT, Randomized controlled trial; RR, Risk ratio. mean based on the inverse variance (2,3). Some recent metaanalyses on the effect of HCQ in COVID-19 patients have omitted basic practices to assess publication bias (8), resulting in massive untreated heterogeneity (8) (i.e., I 2 > 80%) or metaanalyzed small sets of studies [k ≤ 3 implying low statistical power (9)]. None used recent tools to evaluate p-hacking and recent techniques for assessing the publication bias, possibly leading to an overall biased representation of the populationsize effect.

Gathering the Studies
The first step in conducting a meta-analysis is to search the literature for studies investigating a specific predefined question and using predetermined inclusion and exclusion criteria based on theoretical or methodological criteria (7) to determine eligibility. Several methods exist to assess and correct for publication bias among a set of studies (e.g., Egger or PET-PEESE tests, p-uniform, p-curve) (5,10). A general issue is that studies likely differ significantly in design (e.g., randomized controlled trial vs. observational studies) and the specific questions they investigate (e.g., viral load, shedding, mortality/ICU events, mild symptoms). For instance, investigators must decide whether observational or quasiexperimental studies should be included alongside experimental studies. The lack of randomization inherent to observational or quasi-experimental studies is problematic as they are at risk of bias by uncontrolled confounding variables (11).
Recent advances in techniques question the reliance on estimates presented in the original studies for meta-analysis due to significant limitations (12). Systematic reviews and meta-analyses that use individual patient data (IPD) suggest collecting, validating, and reanalyzing the raw data from all clinical trials included in the meta-analysis. Following COCHRANE recommendations, IPD meta-analyses offer a multidisciplinary and cross-cultural perspective that decreases cultural and professional biases. IPD analyses also enable better assessment of moderator variable impact and improves statistical power (13). Integrating contextual variables helps ensure main effects are not explained by sample or country characteristics (or any other contextual factors). Limitations of including such variables are potential collinearity with other variables or insufficient precision in the measure. Although there are several advantages of conducting IPD meta-analyses, it also requires significant organization and coordination that can be challenging.

Statistical Power
A major strength of meta-analysis is the relatively high statistical power associated with compiling several studies (i.e., independent RCTs may have few participants per group limiting their statistical power). The median number of studies in the Cochrane Database for Systematic Reviews is six, according to Borenstein et al. (2). This is a serious concern considering that (1) subgroup analysis and meta-regressions are routine procedures that require high levels of statistical power, and (2) many meta-analyses have high heterogeneity (I 2 > 75%), which negatively affects precision and thus statistical Frontiers in Public Health | www.frontiersin.org 1. Include published and unpublished studies on the basis of inclusion/exclusion criteria (e.g., designs, measures, sample characteristics). Ideally, pre-register your meta-analysis on an accessible server (1) (e.g., PROSPERO database, Open Science Framework) 2. Systematically run heterogeneity tests (Q statistic, the variance between studies (τ 2 ), and the relationship between the real heterogeneity and the total variation observed, I 2 ). Some depend on the number of participants (Q) whereas other depends on the metric scale (τ 2 ) so it is crucial to compare them to estimate true heterogeneity (2, 3) 3. In case of substantial heterogeneity (i.e., I 2 > 75%), create homogenous subgroups based on theoretical or methodological justifications (4) 4. Estimate publication bias using funnel plots and inferential tests (i.e., Begg's/Egger's tests). In case of publication bias, run additional analysis comparing the main results with/without these studies (2, 3) 5. Evaluate p-hacking using p-curve. If H0 is true (no effect), the p-distribution must be uniform but right-skewed if there is an effect. In case of signs of p-hacking, exclude those studies and run again the analysis to compare the results (5) 6. Conduct separate analyses for observational, quasi-experimental, and experimental studies and evaluate the risk of bias for each study (6).
power. For example, the statistical power to detect a small effect size (d = 0.2) with 25 participants per group in 6 different studies using a random-effects model with moderate heterogeneity (I 2 = 50%) and a 5% of type I error is only 26.7% (https://bookdown.org/MathiasHarrer/Doing_Meta_ Analysis_in_R/power-calculator-tool.html).

Assessing Heterogeneity
The objective of meta-analysis is not simply to calculate an average weighted effect estimate but also to make sense of the pattern of effects. An intervention that consistently reduces the risk of mortality by 30% in numerous studies is different from an intervention that reduces the risk of mortality by 30% on average, with a risk reduction ranging from 10 to 80% across studies. We must determine the true variance to provide different perspectives on the dispersion of the results based on the Q statistic, the variance between studies (τ 2 ), and the relationship between the real heterogeneity and the observed total variation (I 2 ).

Publication Bias
Publication bias affects both researchers conducting metaanalyses and physicians searching for primary studies in a database. If the missing studies are a random subset of all studies, excluding them will result in less evidence, wider confidence intervals, and lower statistical power, but will not have a systematic influence on the effect size (2). However, whenever there are systematic differences in unpublished and published studies included in the meta-analysis, the weighted effect sizes are biased (e.g., a lack of studies reporting non-significant effects of HCQ in COVID-19 patients). Dickersin (14) found that statistically significant results are more likely to be published than non-significant findings, and thus when published studies are combined together, they may lead to overestimated effects. Also, for any given sample size, the result is more likely to be statistically significant if the effect size is large. Studies with inflated estimate effects are expected to be reported in the literature more frequently as a result (i.e., the first studies on HCQ likely reported large effects). This trend has the potential to produce large biases both on effect size estimates and significance testing (15).
Different techniques have been developed to detect publication bias (16). A widely used method-the funnel plot-consists of plotting effect sizes against their standard errors or precisions (the inverse of standard errors). A skewed funnel plot is usually an indication of the presence of publication bias. However, subjective visual examination as well as coding of the outcome, the choice of the metric, and the choice of the weight on the vertical axis all impact the appearance of the plot (17). Inferential tests such as Egger's regression test regress the standardized effect size on the corresponding precisions (the inverse of the within-study variance). Although widely used, the Egger test may suffer from an inflated type I error rate or low statistical power in certain conditions (16,17).

Hydroxychloroquine and COVID-19 Meta-Analysis
HCQ and chloroquine (CQ) have been used for decades to manage and treat malaria and several autoimmune conditions. At the beginning of the pandemic, preliminary studies (18)(19)(20) suggested that HCQ might have a positive effect on the treatment of COVID-19 patients. This led the U.S. Food and Drug Administration (FDA) to issue an emergency-use (EUA) authorization on March 28, 2020 allowing for HCQ sulfate and CQ phosphate to be donated to the Strategic National Stockpile for use in hospitalized COVID-19 patients. Given its multiple antiviral effects, it is plausible that HCQ could be beneficial in COVID-19 patients (21). In vitro data have shown that HCQ/CQ blocks viral infection by inhibiting virus/cell fusion through increasing endosomal pH (22) and by reducing the production of inflammatory cytokines (23)(24)(25). Shortly after the EUA of HCQ/CQ, a group in France published a study describing viral load reduction/cure with HCQ (20). However, this study included a small sample size, was non-randomized, only reported viral load as an outcome, and excluded the most severely ill patients from the analysis. Numerous meta-analyses have already been published on the use of HCQ in COVID-19 patients, with some indicating a large protective effect for HCQ (8), and others reporting no effect (9) or increased mortality when HCQ was Null (I 2 = 0%) to high heterogeneity (I 2 = 96%) but small set of comparisons (k = 2).
Cochrane handbook to assess biased of RCTs (2 independent authors) and NOS for observational studies. ROBINS-I tool for non-randomised trials.
Ebina-Shibuya et al.   (26) did not find any effect of HCQ/AZ in a meta-analysis of 29 studies including 11,932 patients. In both meta-analyses, the authors found large heterogeneity among the included studies (I 2 = 75% and I 2 = 83%), which suggests the presence of confounders not being accounted for across studies, and neither study performed subgroup analysis to better explore the high heterogeneity. Study selection was problematic in both studies. For instance, Million and colleagues did not publish a flow diagram with the different phases of a systematic review as recommended by the PRISMA Statement (7). Several items, fundamental in the method of the PRISMA protocol, were not followed such as review protocol registration, detailed study selection criteria, data collection process, risk of bias within and across studies, and additional analyses. Million et al. (8) grouped "clinical" studies together (studies that had direct access to patients) and "observational big data" studies together (that may present conflicts of interest and show no effect of HCQ) instead of doing meta-regressions based on study designs (i.e., RCT vs. observational study). Fiolet et al. (26) excluded several studies because of critical risk of bias (i.e., lack of statistical information and the assignment of treatment, unknown timing between measures and confounders) with HCQ and AZ combination therapy (47)(48)(49).
Outcome selection is concerning, as Million et al. (8) reported positive effects on the duration of symptoms such as cough, fever, and clinical care with analysis of 1 to 7 small sample size studies (50). In , the type of estimate used for effect sizes was inconsistent and not clearly reported. They did not make a distinction between Risk Ratios (RR), which are usually used in cohort studies, and Hazard Ratios (HR) and Odds Ratios (OR), which are used in case-control studies. This can influence the analysis because OR tend to overestimate effects compared to RR when the selected outcome occurs frequently (51). In both meta-analyses, the selection of included studies, the degree of heterogeneity in these analyses, and the calculations of effect sizes make the veracity of the estimates uncertain. Many of the meta-analyses published had low statistical power, untreated heterogeneity and none used tools to evaluate potential risks of p-hacking (13,(27)(28)(29)(30)(31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41)(42)(43)(44)(45)(46)52).

A Proposal for Conducting Meta-Analyses in Clinical Research
As discussed above, one potential bias in meta-analyses is "selection bias, " which may lead to inaccurate estimation of effect sizes. An important question is whether to incorporate unpublished pre-print studies, especially when the field has limited studies and there is urgency for reliable data. An argument against this approach could be that unpublished studies might not be as rigorous as published studies. From this point of view, unpublished studies should not be included in metaanalyses because the inclusion of poorly conducted research also introduces bias. However, having access to published and unpublished studies helps decide which studies to include in a meta-analysis based on a priori inclusion criteria [through pre-registered meta-analyses for instance (1), see Table 2].
Readers typically focus on the forest plot, which depicts the quantitative effects and level of uncertainty for each study included in the meta-analysis. Forest plots are great tools to visually assess heterogeneity (coupled with quantitative index such as I 2 or Q-test) and pooled results (53). However, forest plots do not address publication bias and thus can mislead readers's conclusion if not presented with additional information such as a funnel plot.
One of the most widely used tools to assess publication bias is plotting the effect sizes for each study against an indicator for the precision to which each study estimated the effect size. In funnelplots, studies will be plotted near the average effect size, while studies with low precision (e.g., small sample) will have effect estimates distributed on either side of the average effect, creating a funnel-shaped plot.
Although funnel plots are a widely used and reliable way to evaluate publication bias, another useful tool is the p-curve (5). The p-curve plots a proportion of observed p-values for each value of p in a set of studies. Because true effects are more likely to have smaller values of p (e.g., "p < 0.01") than values around the arbitrary significant threshold of p < 0.05, a flat p-curve or a p-curve indicating a higher proportion of pvalues between 0.04 and 0.05 is more likely to be an indicator of questionable research practices, sometimes referred to as phacking. Van Assen et al. (4) propose the use of another tool, p-uniform, to estimate population effect size in the presence of small to moderate heterogeneity (I² < 50%).
Conducting sub-group analyses for observational, quasiexperimental, and experimental studies will also help evaluate the risk of bias of each study design (6). In cases of substantial heterogeneity, researchers can generate a homogenous group of studies based on theoretical or methodological criteria and then use the p-curve and p-uniform to estimate the average population effect sizes for each subgroup analysis (4). Additional tools can be useful to determine publication bias. For instance, selection models can adjust for suspected selective publication; Rosenthal's fail-safe N is used to estimate the number of unpublished studies necessary to overturn the significant results and Copas sensitivity approach uses regression models to evaluate publication bias (54).
The fact that statistically significant results are more likely to be published than non-significant results is a major source of publication bias. Additional sources of potential bias that should be addressed when possible include pipeline bias (non-significant results take longer to publish than significant results), subjective reporting bias (selective reporting of the results), duplicate reporting bias (results published in multiple sources), and language bias (non-native English speakers tend to publish non-significant findings in their native tongue) (54).

CONCLUSION
Tensions over the use of HCQ for COVID-19 patients have unfortunately led some authors to disregard basic meta-analytical protocols. Concern over the quality of studies included in metaanalyses has also emerged in a recent comparative psychological study between meta-analytical findings and registered replication studies. The authors found that meta-analytical effect sizes significantly differed from the replication effect sizes for 12 of the 15 meta-replication pairs, and meta-analytic effect sizes were almost three times larger than the replication effect sizes (15). These inconsistencies call for caution when running and interpreting meta-analyses of new clinical studies.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.