METHODS article

Front. Genet.

Sec. Statistical Genetics and Methodology

Volume 16 - 2025 | doi: 10.3389/fgene.2025.1534726

Analysis of follow-up data in large biobank cohorts: a review of methodology

Provisionally accepted
  • University of Tartu, Tartu, Estonia

The final, formatted version of the article will be published soon.

This study focuses on key methodological challenges in genome-wide association studies (GWAS) of biobank data with time-to-event outcomes, analyzed using the Cox proportional hazards (CPH) model. We address four primary issues: left-truncation of the data, computational inefficiency of standard model-fitting algorithms, relatedness among individuals, and model misspecification.To manage left-truncation, the common practice is to use age as the timescale, with individuals entering the risk set at their age of recruitment. We assess how this choice of timescale influences bias and statistical power, under realistic GWAS conditions of varying effect sizes and censoring rates. In addition, to alleviate the computational burden typical in large-scale data, we propose and evaluate a two-step martingale residual (MR) approach for high-dimensional CPH modeling.Our results show that the timescale choice has minimal effect on accuracy for small hazard ratios, though using time since birth as the timescale-ignoring recruitment age-yields the highest power for association detection. We find that relatedness, when ignored, does not substantially bias effect size estimates, while omitting key covariates introduces significant bias. The two-step MR approach proves to be computationally efficient, retaining power for detecting small effect sizes, making it suitable for large-scale association studies.However, when precise effect size estimates are critical, particularly for moderate or larger effect sizes, we recommend recalculating these estimates using the conventional CPH model, with careful attention to left-truncation and relatedness. These conclusions are drawn from simulations and illustrated with data from the Estonian Biobank cohort.

Keywords: survival analysis, Genome-Wide Association Study, populationbased biobank data, Martingale residuals, Cox Proportion Hazard model

Received: 26 Nov 2024; Accepted: 15 May 2025.

Copyright: © 2025 Kolde, Koitmäe, Käärik, Möls and Fischer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Anastassia Kolde, University of Tartu, Tartu, Estonia

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.