The impact of violating the independence assumption in meta-analysis on biomarker discovery

With rapid advancements in high-throughput sequencing technologies, massive amounts of “-omics” data are now available in almost every biomedical field. Due to variance in biological models and analytic methods, findings from clinical and biological studies are often not generalizable when tested in independent cohorts. Meta-analysis, a set of statistical tools to integrate independent studies addressing similar research questions, has been proposed to improve the accuracy and robustness of new biological insights. However, it is common practice among biomarker discovery studies using preclinical pharmacogenomic data to borrow molecular profiles of cancer cell lines from one study to another, creating dependence across studies. The impact of violating the independence assumption in meta-analyses is largely unknown. In this study, we review and compare different meta-analyses to estimate variations across studies along with biomarker discoveries using preclinical pharmacogenomics data. We further evaluate the performance of conventional meta-analysis where the dependence of the effects was ignored via simulation studies. Results show that, as the number of non-independent effects increased, relative mean squared error and lower coverage probability increased. Additionally, we also assess potential bias in the estimation of effects for established meta-analysis approaches when data are duplicated and the assumption of independence is violated. Using pharmacogenomics biomarker discovery, we find that treating dependent studies as independent can substantially increase the bias of meta-analyses. Importantly, we show that violating the independence assumption decreases the generalizability of the biomarker discovery process and increases false positive results, a key challenge in precision oncology.


SUPPLEMENTARY TABLES AND FIGURES
Fixed-effects Model Random-effects Model τ 2 +s 2 k η k and e k are independent Table S1. Models to incorporate study results in a meta-analysis.

SUPPLEMENTARY FIGURES
1 Figure S1. Pan-cancer cell line data. Bar plots illustrate distribution of cell lines across studies and tissue types.  Figure S3. Comparison of different meta-analysis approaches. Upset diagrams represent the number of statistically significant genes associated with drug response (FDR < 0.05) using various meta-analysis methods for (A-C) breast and (D-F) pan-cancer data. Figure S4. Stability metric to compare different meta-analysis approaches. Stability of identified 100 top-ranked genes associated with drugs by applying Jaccard similarity index using (A-C) breast and (D-F) pan-cancer data.
Frontiers 5 Figure S5. Breast cancer independent meta-analyses not limited to common cell lines across studies. Volcano plots show genes associated with drug response using RE meta-analysis model and forest plots illustrate overall effect estimate using FE (red diamond) and RE (blue diamond) meta-analysis models. DL approach was applied to estimate heterogeneity across studies. pan-cancer data. Figure S6. Gene-drug association meta-analyses and estimated heterogeneity. Bar plots present the percentage of gene-drug association meta-analyses using RE meta-analysis across estimated heterogeneity using (A) breast and (B) pan-cancer data.
Frontiers 7 Figure S7. Bayesian versus classical independent meta-analyses. Forest plots illustrate overall effect estimate and 95% confidence and credible intervals along with heterogeneity estimate I 2 using DL and Bayesian Jeffreys procedures using (A-C) breast and (D-F) pan-cancer data. Figure S8. Bayesian versus classical independent to assess gene-drug association using breast cancer data. Violin plots illustrate (A-C) length of 95% confidence and credible intervals and (D-F) heterogeneity estimate I 2 using DL and Bayesian Jeffreys procedures. Black dot represents the median.
Frontiers 9 Figure S9. Bayesian versus classical independent to assess gene-drug association using pan-cancer data. Violin plots illustrate (A-C) length of 95% confidence and credible intervals and (D-F) heterogeneity estimate I 2 using DL and Bayesian Jeffreys procedures. Black dot represents the median. Figure S10. Mean squared error and coverage probability of overall effect estimates. Scenarios containing various within-study variances (row) and heterogeneity across studies (column). The x-axis represents the number of studies and the y-axis shows the relative MSE (A-B) and coverage probability of 95% confidence intervals (C-D). Set overall effect β = 0.5. Different colors represent a number of duplication or non-independent effects across studies. Figure S11. Mean squared error and coverage probability of overall effect estimates. Scenarios containing various within-study variances (row) and heterogeneity across studies (column). The x-axis represents the number of studies and the y-axis shows the relative MSE (A-B) and coverage probability of 95% confidence intervals (C-D). Set overall effect β = 0.8. Different colors represent a number of duplication or non-independent effects across studies. Figure S12. Breast cancer non-independent Jeffreys Bayesian meta-analyses. (A-C) Bar plots present increases in the number of duplications and its impact on the estimated overall effect using MAD metric across drugs and selected genes with substantial (blue) and non-substantial (gray) heterogeneity estimation. Note that x-axis presents the number of duplicate study effects. (D) Bar plots illustrate an average of 95% confidence or credible intervals for specific genes over the number of duplications using DL and Bayesian Jeffreys procedures.

Frontiers 13
Figure S13. Pan-cancer non-independent Jeffreys Bayesian meta-analyses. (A-C) Bar plots present increases in the number of duplications and its impact on the estimated overall effect using MAD metric across drugs and selected genes with substantial (blue) and non-substantial (gray) heterogeneity estimation.
Note that x-axis presents the number of duplicate study effects. (D) Bar plots illustrate an average of 95% confidence or credible intervals for specific genes over the number of duplications using DL and Bayesian Jeffreys procedures.