A statistical boosting framework for polygenic risk scores based on large-scale genotype data

Polygenic risk scores (PRS) evaluate the individual genetic liability to a certain trait and are expected to play an increasingly important role in clinical risk stratification. Most often, PRS are estimated based on summary statistics of univariate effects derived from genome-wide association studies. To improve the predictive performance of PRS, it is desirable to fit multivariable models directly on the genetic data. Due to the large and high-dimensional data, a direct application of existing methods is often not feasible and new efficient algorithms are required to overcome the computational burden regarding efficiency and memory demands. We develop an adapted component-wise L 2-boosting algorithm to fit genotype data from large cohort studies to continuous outcomes using linear base-learners for the genetic variants. Similar to the snpnet approach implementing lasso regression, the proposed snpboost approach iteratively works on smaller batches of variants. By restricting the set of possible base-learners in each boosting step to variants most correlated with the residuals from previous iterations, the computational efficiency can be substantially increased without losing prediction accuracy. Furthermore, for large-scale data based on various traits from the UK Biobank we show that our method yields competitive prediction accuracy and computational efficiency compared to the snpnet approach and further commonly used methods. Due to the modular structure of boosting, our framework can be further extended to construct PRS for different outcome data and effect types—we illustrate this for the prediction of binary traits.


APPLICATION TO THE UK BIOBANK
Out of the over 500,000 individuals from UK Biobank (Bycroft et al., 2018) we filtered for individuals with self-reported white British ancestry (UKBB field 21000) and available data for all chosen phenotypes, resulting in n = 284, 342 observations.Additionally, the covariates age and sex as well as the first ten principal components of the genotype matrix are available.We randomly divided the data set into training (n train = 170, 557), validation (n val = 56, 801) and test set (n test = 56, 984).We used genome-wide genotype data and filtered for variants with a genotyped rate of at least 90% and a minor allele frequency of at least 0.1%, resulting in p = 562, 684 genetic variants.Missing genotypes are imputed by the corresponding mean of the complete observations.For both the boosting and lasso approaches, we first estimated a PRS using only the genotyped variants as predictors.We used the training set to fit the model and the validation set to simultaneously monitor the predictive performance for choosing the main tuning parameters of the algorithms (i.e., the number of iterations for boosting and the penalty parameter for the lasso).To fit the lasso we used the R package snpnet (Qian et al., 2020) with the provided default hyperparameters.Following the results of our simulation study, for the snpboost algorithm we chose a batch size of p batch = 1, 000 variants, a learning rate of ν = 0.1 and an outer stopping lag of b stop = 2 batches.PRS via --score in plink2 (Chang et al., 2015;Purcell and Chang, 2015).Using the resulting P RS we then fitted two models on the training and validation sets, namely the first one (M PRS ) incorporating only the PRS as a single predictor variable: and the second one (M f ) including the first ten principal components, sex and age as additional covariates: Resulting estimates were applied on the test set.Table S1 shows the prediction performance measured as R 2 of predicted and observed phenotype on the test set and table S2 shows the number of genetic variants with non-zero coefficient in the corresponding PRS model.Our proposed snpboost algorithm showed highly competitive performance in regards of the compared prediction tools.While PRScs, LDpred2-inf, SBayesR and the LDAK-Predict models do not induce sparsity, snpboost and snpnet yield sparse models without losing predictive accuracy.117,203 117,203 117,203 117,203 117,203 LDpred2 114,702 114,702 114,702 114,702 114,702 SBayesR 117,391 117,391 117,391

Figure S1 .Figure S2 .
Figure S1.Results of 100 simulated phenotypes with varying heritability and sparsity s = 0.1% for p = 20, 000 variants and n = 20, 000 individuals (divided into 50% training, 20 % validation and 30% test set).Boxplots of the evaluation metrics obtained after 1, 500 boosting iterations are shown depending on the batch size.Batch size p batch = 20, 000 corresponds to the original L 2 -boosting (shown in grey).

Figure S7 .
Figure S7.Comparison of predictive performance of snpnet and snpboost for five continuous phenotypes from the UKBB.Results of the covariate-only model (M c , grey bars) and multivariable polygenic models with and without inclusion of the covariates derived by lasso (snpnet, petrol-coloured bars) and statistical boosting (snpboost, red-coloured bars) for the prediction of five phenotypes from the UKBB.The barplots show the predictive performance (RM SEP ) on the test set of 52,551 unrelated white British individuals.M PRS corresponds to a linear model incorporating the PRS as a single predictor variable and M f to a linear model incorporation sex, age and the first ten principal components as additional covariates.M PRS,c includes the covariates already in the fitting process of the PRS.Bootstrapped 95% confidence intervals are indicated by error bars.Furthermore, information on the number of selected genetic variants (# variants) and the number of additionally included covariates (# covariates) is given.

Figure S9 .
Figure S8.Absolute values of coefficient estimates for PRS models for LDL-cholesterol derived by boosting (snpboost) and lasso (snpnet) shown in dependence of the genomic position of the variants.Variants that are included in both models are marked in black.

Figure S10 .Figure S11 .
Figure S10.Absolute values of coefficient estimates for PRS models for BMI derived by boosting (snpboost) and lasso (snpnet) shown in dependence of the genomic position of the variants.Variants that are included in both models are marked in black.

Figure S13 .
Figure S12.Absolute values of coefficient estimates for PRS models for height derived by boosting (snpboost) and lasso (snpnet) shown in dependence of the genomic position of the variants.Variants that are included in both models are marked in black.

Figure S15 .Figure S17 .
Figure S14.Absolute values of coefficient estimates for PRS models for lipoprotein A derived by boosting (snpboost) and lasso (snpnet) shown in dependence of the genomic position of the variants.Variants that are included in both models are marked in black.

Table S1 .
Comparison of predictive performance of eight PRS methods for five phenotypes from the UKBB.Results of GWAS-based (PRScs, LDpred2-inf and SBayesR) and individual-level data-based (Bolt, Ridge, BayesR, snpnet and snpboost) polygenic models with and without inclusion of the covariates for the prediction of three phenotypes from the UKBB.The table gives the predictive performance (R 2 ) on the test set of 55,221 unrelated white British individuals.M PRS corresponds to a logistic regression model incorporating the PRS as a single predictor variable and M f to a logistic regression model incorporating sex, age and the first ten principal components as additional covariates.

Table S2 .
Number of selected variants of eight PRS methods for five phenotypes from the UKBB.Results of GWAS-based (PRScs, LDpred2 and SBayesR) and individual-level data-based (Bolt, Ridge, BayesR, snpnet and snpboost) polygenic models with and without inclusion of the covariates for the prediction of five phenotypes from the UKBB.The table gives the number of genetic variants that are included in the PRS.