Considering strategies for SNP selection in genetic and polygenic risk scores

Genetic risk scores (GRS) and polygenic risk scores (PRS) are weighted sums of, respectively, several or many genetic variant indicator variables. Although they are being increasingly proposed for clinical use, the best ways to construct them are still actively debated. In this commentary, we present several case studies illustrating practical challenges associated with building or attempting to improve score performance when there is expected to be heterogeneity of disease risk between cohorts or between subgroups of individuals. Specifically, we contrast performance associated with several ways of selecting single nucleotide polymorphisms (SNPs) for inclusion in these scores. By considering GRS and PRS as predictors that are measured with error, insights into their strengths and weaknesses may be obtained, and SNP selection approaches play an important role in defining such errors.


Bias(P RS
where expectation is taken with (β) or without (β) considering the value of X i . The notation p 0 represents 5 the probability that X i = 0, p 1 = 1 − p 0 = P (X i = 1), and β jk is the subgroup specific true coefficient. 6 The magnitude of the bias therefore depends on how different are β j0 and β j1 .

7
Similarly, define where N 0 and N 1 are the sample sizes in the subgroups. It is likely that σ 2 j0 and σ 2 j1 will be similar to each 9 other; the differences between the two variances will likely be driven mostly by the sample sizes.

10
Depending on the magnitudes of the sample sizes in the subgroups and these 4 parameters,

12
For a PRS summed over all SNPs, the estimate with the smallest mean squared error will also depend on 13 how many of the SNPs have subgroup-specific parameters.
14 For example, suppose two equally sized subgroups exist in the data, each of size N/2. This implies 15 that parameter estimation performed separately for each subgroup will lead to a estimates with variances 16 approximately two times larger (standard errors that will be larger by roughly (2)). Therefore, to benefit 17 from the subgroup analyses, we could argue that the subgroup-specific estimates must differ sufficiently 18 that bias 2 is reduced two-fold.

24
Suppose that β j depends on an exposure or demographic variable X, such that we have β j (X); for 25 simplicity, suppose that X is a binary variable with values 0 and 1. We assume that the genotypes g ij are 26 known and not random.
For a single individual i, therefore, where expectation is taken with (β) or without (β) considering the value of X i . The notation p 0 is the 30 probability that X i = 0, and p 1 = 1 − p 0 = P (X i = 1), and β jk is the subgroup specific coefficient.

31
The variance can be written as Assume that there is independence of the genetic variants in the risk set {S}, achieved by use of appropriate 32 pruning/clumping and thresholding, fine mapping, or haplotype construction. Hence, the variance can be 33 written as The first term contains where these variances will depend very strongly on the sample sizes in the subgroups defined by X.

41
In the standard C+T approach, we retain only the predictor with the most significant association with the trait of interest among a region of correlated markers. Let y i be any continuous trait and assume the true causal model is given .., n, X i is a single genetic predictor, Z i is a vector of p genetic predictors in the same region as X, ϵ i iid ∼ N (0, σ 2 ) and ϵ i is independent of X i and Z i . Without loss of generality, we assume there is no intercept. If we ignore Z i from the model, the OLS estimator for β is given bŷ Assuming the predictor X i is the most significant marker from the region, the mean prediction error of the PRS for the i th subject is equal to From (7) we have that the mean prediction error of the PRS is proportional to the strength of the 42 correlation between X and Z and the size of the omitted predictors effects γ = (γ 1 , ..., γ p ).

D. PRS FOR CARDIOVASCULAR DISEASE IN UK BIOBANK, SEPARATELY BY DIABETES
Using resources from the UK Biobank (Bycroft et al., 2018), we calculated three PRSs for coronary artery

63
In the UK Biobank, genotyping data and complete cardiovascular risk factor information were available 64 for 322,230 participants of White British ancestry without prevalent coronary artery disease. Our analysis 65 thus focused on these individuals. We found that there was very little difference in the PRS' discriminative 66 power between individuals with and without type 2 diabetes (  Table S1. C-indices of PRSs and covariate-adjusted models in predicting incident coronary artery disease cases in the UK Biobank. T2D: type 2 diabetes. "+ covar." implies covariates (age, sex, genotyping array, recruitment centre, and first 10 genetic principal components) were included in models predicting incident coronary artery disease. Khera: PRS from Khera et al.