Estimating Effect Sizes and Expected Replication Probabilities from GWAS Summary Statistics

Genome-wide Association Studies (GWAS) result in millions of summary statistics (“z-scores”) for single nucleotide polymorphism (SNP) associations with phenotypes. These rich datasets afford deep insights into the nature and extent of genetic contributions to complex phenotypes such as psychiatric disorders, which are understood to have substantial genetic components that arise from very large numbers of SNPs. The complexity of the datasets, however, poses a significant challenge to maximizing their utility. This is reflected in a need for better understanding the landscape of z-scores, as such knowledge would enhance causal SNP and gene discovery, help elucidate mechanistic pathways, and inform future study design. Here we present a parsimonious methodology for modeling effect sizes and replication probabilities, relying only on summary statistics from GWAS substudies, and a scheme allowing for direct empirical validation. We show that modeling z-scores as a mixture of Gaussians is conceptually appropriate, in particular taking into account ubiquitous non-null effects that are likely in the datasets due to weak linkage disequilibrium with causal SNPs. The four-parameter model allows for estimating the degree of polygenicity of the phenotype and predicting the proportion of chip heritability explainable by genome-wide significant SNPs in future studies with larger sample sizes. We apply the model to recent GWAS of schizophrenia (N = 82,315) and putamen volume (N = 12,596), with approximately 9.3 million SNP z-scores in both cases. We show that, over a broad range of z-scores and sample sizes, the model accurately predicts expectation estimates of true effect sizes and replication probabilities in multistage GWAS designs. We assess the degree to which effect sizes are over-estimated when based on linear-regression association coefficients. We estimate the polygenicity of schizophrenia to be 0.037 and the putamen to be 0.001, while the respective sample sizes required to approach fully explaining the chip heritability are 106 and 105. The model can be extended to incorporate prior knowledge such as pleiotropy and SNP annotation. The current findings suggest that the model is applicable to a broad array of complex phenotypes and will enhance understanding of their genetic architectures.

Here, Y is a vector of phenotypic values for the N subjects in the study, and G is the N × n genotype matrix for their n SNPs: g ij = 0, 1, or 2, the number of reference alleles at the jth SNP for the ith subject; e is the residual effect, an N -vector of random errors and environmental contributions, assumed to be exogenous (zero-centered and uncorrelated with the regressors: E(e|G) = 0, where E denotes expectation). The data elements {y i , G i }, where G i is the genotype n-vector (over SNPs) for the ith subject, are considered to be random and sampled from a population. This means that β, the vector of regression coefficients, is an nvector of random genetic effects (a source of variance, in addition to e), whose means and variances are of primary interest. Thus, all three quantities on the right-hand side of Eq. 1 are considered random. Let R e = var(e), the covariance matrix for e, and R β = var(β), the covariance matrix for β; denote transpose with a prime, and letˆdenote estimates of true values from the data. Then the best linear unbiased predictor (BLUP) (Robinson, 1991) The first thing to do is to simplify. (1) Assume the residuals are uncorrelated and have constant variance: R e = σ 2 e I N , where I N is the N × N identity matrix and σ e is a constant. (2) Assume the random effects β are uncorrelated and that, in a Bayesian sense, the priors on them are all equally and highly uninformative (they have very large variance): R β = σ 2 β I, with σ β ≫ σ e . Eqs. 2 and 3 then reduce to the familiar ordinary least squares (OLS) solution to Eq. 1 (minimizing only the error variance,σ 2 e ) with var(β) =σ 2 e (G ′ G) −1 , andσ 2 e = var(Y − Y ) where Y = Gβ. With more regression coefficients than data points, n ≫ N (the usual situation in GWAS), there is no unique solution for β. A further simplification is to resort to simple linear regression, treating each SNP independently (the massively univariate approach, ignoring correlation between SNP genotypes): only the diagonal of G ′ G in Eqs. 4 and 5 is considered to be non-zero. Essentially, for each SNP j in turn (j=1,. . . ,n; no summation over repeated indices), solve the set of N equations where i indexes the subjects and ε ij is the residual. In a case-control situation (y i = 1 for cases and y i = 0 for controls), one performs logistic regression, replacing the left-hand side of Eq. 6 with where P r(y i = 1|g ij ) = E(y i = 1|g ij ), with P r denoting probability. To relate estimated quantities on the observed case-control (0/1) scale to an underlying continuous liability scale, and taking into consideration case-control ascertainment and disease prevalence, see Lee et al. (2012). From Eq. 6, the estimate of the elementβ j is given by the corresponding diagonal entry from the OLS estimatê where now G j denotes the N -vector (over subjects) of genotype values for the jth SNP, and (again, no summation over repeated indices). For a polygenic phenotype each SNP will contribute only a small amount to the phenotypic variance. Therefore all theσ 2 ε,j will be approximately equal to (but slightly less than) the phenotypic variance:σ 2 ε,j ≃ var(y). If G is centered (for each SNP, the mean value over subjects is subtracted from the integral genotypic value for each subject), G j = g j −ḡ j , g j = (g 1j , ..., g ij , ...g nj ), and letting var(g j ) denote the scalar variance arising from the vector elements, then a result that holds regardless of the centering (since for a ∈ R, var(a+g i ) = var(g i ) = var(G i )), andβ can be written asβ = cov(G j , Y )/var(G j ). Under Hardy-Weinberg equilibrium, g ij is distributed binomially with allele frequency p j : g ij ∼ Bin(2, p j ). Then where H j is the heterozygosity (allele count variance) for SNP j. Therefore, one can write se denoting standard error.

Basic Gaussian Model
The Wald statistic corresponding toβ j is given by Dropping the subscript j indicating a particular SNP, one can writeβ = E(β) + γ, where γ is an environment and error term with E(γ) = 0. Note that OLS (and simple linear regression) estimates are unbiased estimates of β.
We assume E(β) ≃ β. Then where δ is the genetic effect size and ǫ is due to environment and error and assumed to be independent of δ (thus ignoring any genotype-phenotype interactions). Under the null hypothesis of no effect, β = 0 and z is assumed to be normally distributed with mean 0 and variance σ 2 0 (which might be slightly different from 1 due to population structure (Devlin and Roeder, 1999) and var(z) = var(δ) + var(ǫ) with σ 2 a , the constant of proportionality (independent of the quantities N and H), being the per-allele variancethe variance component associated with each allele. Motivated by the central limit theorem, the essential component of the basic Gaussian model is to assume that δ has a normal prior distribution with mean 0 and variance σ 2 1 : δ ∼ N (0, σ 2 1 ). In summary, the usual single group Gaussian model is: δ ∼ N (0, σ 2 1 ), prior density of δ (20) z|δ ∼ N (δ, σ 2 0 ), likelihood of δ given z. (21) Then, using the familiar property of Gaussian integrals, and Bayes equation, P r(δ|z) = P r(z|δ)P r(δ)/P r(z), where the marginal distribution of z-scores is given by the convolution Model fit, no ubiquitous effects π1 = 0.141 σ0 = 1.051 σb = 0.016 Model fit, no sparse effects σ0 = 0.962 σa = 0.008 Figure 1: Posterior estimates of effect size for schizophrenia, comparing the empirical estimates with those from explicit model fits with (a) three parameters for no ubiquitous component (no σa, red curve), and (2) two parameters for no sparse component (no σ b , green curve). In the latter case, note that the model is degenerate with respect to π 1 . Compare with Fig. 1(A) in the main paper.
where φ(·; µ, σ 2 ) is the normal PDF with mean µ and variance σ 2 , one has Thus, in the case of the single-group (non-mixture) model (Eq. 25), Noting that δ ∝ √ N , then for a z-score measured in a sample of effective sample size N d the expected posterior effect size δ r in a new sample of effective sample size N r is given by That is, Similarly, For probability densities that are more intractable than simple Gaussians, the standard formalism of a one parameter exponential family of probability densities can be used to estimate means and variances (Efron, 2013). As an illustration, Eq. 23 can be re-expressed in terms of an exponential family as where with P r(δ) given by Eq. 20, and f (z) given by Eq. 25. Writing T (δ) ≡ δ/σ 2 0 , one has from the exponential family formalism and Letting and from which Eqs. 26 and 27 follow directly.

Gaussian Mixture Model
The motivation for pursuing a two-groups mixture model, as discussed in the main text, comes from the empirical results (black curve in Fig. 1) which indicate that the zscores show evidence of having both ubiquitous ("small") effects and sparse ("large") effects: small but non-zero slope around the origin, transitioning to steep slope further away from the origin. A single-group distribution like Eq. 25 does not well characterize the data, as shown by the green curve in Fig. 1, and even a two-groups mixture model that does not incorporate ubiquitous effects provides only an inaccurate description, as shown by the red curve in Fig. 1. The two-groups mixture model, that is an extension of Eq. 25 proposed here, is given by Eq. 1 in the main paper: i.e., (a mixture of scaled Gaussians) where, analogous to Eq. 17, σ 2 b being the additional variance component associated with alleles exhibiting sparse effects (the additional perallele variance for sparse effects). The empirical PDF for z-scores and the model estimate, Eq. 37 (where the model parameters were estimated as described in the main paper), are plotted in Fig. 2 for five allele count variance windows. The excellent match between the empirical and δ = δ a with prior prob. π 0 δ a + δ b with prior prob. π 1 (42) δ a ∼ N (0, σ 2 1 ), prior density of δ a (43) with δ a and δ b independent. As before, z|δ ∼ N (δ, σ 2 0 ), but now the prior density for δ, P r(δ), is P r(δ) = π 0 φ(δ, 0, σ 2 1 ) + π 1 φ(δ, 0, σ 2 1 + σ 2 2 ).
Additionally, the posterior distribution of z-scores z r in a new "replication" sample of effective sample size N r , given z d in a "discovery" sample of effective sample size N d , is where From Eq. 50, one immediately has the posterior expectation of effect size in a replication sample, give a z-score in the discovery sample: which also immediately follows from Eq. 35. Since variance is given by the second moment minus the square of the first moment, the variance of δ given z, var(δ|z), can be calculated straightforwardly from Eq. 49, which itself is in the form of a mixture of two Gaussians (with constant weights 1−tdr and tdr for given z, H, andN ). Thus, for fixed z, write g(δ) ≡ P r(δ|z) ≡ (1 − tdr)g u (δ) + tdr · g s (δ), with g u (δ) ≡ φ(δ, µ u , σ 2 u ), and similarly for g s . Let q k denote the kth moment of δ with respect to the mixture distribution g: with E g denoting expectation with respect to g, and E u and E s denoting expectation with respect to g u and g s , respectively. Then, and (for the individual mixture terms, recalling that the second moment is the square of the first moment plus the variance) Since var(δ|z) = q 2 − q 2 1 , it follows that var(δ|z) = (1 − tdr)(µ 2 u + σ 2 u ) + tdr · (µ 2 s + σ 2 which also immediately follows from Eq. 36.

Effective Sample Size
Sample size explicitly enters Wald statistics or z-scores through the denominator term, se(β) ∝ √ N , where N is the effective sample size. For binary qualitative traits (e.g., disease / no disease), z-scores calculated through logistic regression are equivalent to those calculated using the Cochran-Arimtage trend test (Wellek and Ziegler, 2012), where one is explicitly interested in testing the null hypothesis that the allele frequencies in the affecteds (p A ) and unaffecteds (p U ) are equal, H 0 : p A = p U : Equivalently, let X be a scaled (0,1,2) random variable giving the number of alleles of interest, and compare the means in affecteds and unaffecteds using the test statistic .
Under H 0 , assume X A and X U have the same underlying variance, σ 2 = var(X), and mean, µ. If there are N A affecteds, then under H 0 , X A is normally distributed with mean µ and variance σ 2 /N A . Similarly, if there are N U unaffecteds, then under H 0 , X U is normally distributed with mean µ and variance σ 2 /N U . Therefore, because the two samples are independent, under H 0 , (X A −X U ) is normally distributed with mean 0 and variance where the effective sample size, N ef f , is defined as With this definition of N ef f , if there are equal numbers of affecteds and unaffecteds, N ef f = N A = N U ; i.e., N ef f is the number of subjects per arm of a balanced-design study. The factor 2 in Eqs. 61 and 62 does not affect the calculation of the meta-analysis z in Eq. 65 below. In the model PDF, Eq. 37, N ef f appears only as a multiplicative factor for the "per allele" (or per subject, with N ef f subjects) variance components σ 2 a and σ 2 b . Thus, for example, if N ef f is redefined to be twice or half its value above (replacing 2 with 4 or 1), σ 2 a and σ 2 b will simply halve or double, retaining their "per allele" meaning. With N ef f defined as in Eq. 62, for balanced designs the total sample size for a qualitative trait is 2N ef f ; total sample sizes for quantitative and qualitative traits can then be directly compared.
Thus, for binary qualitative traits, given K sub-studies, for the k-th sub-study with N A affecteds and N U unaffecteds, the effective sample size N k was defined as The total effective sample size over K sub-studies was then For quantitative traits, the effective sample size over K sub-studies is just the sample size given by the sum of the subjects in each sub-study.

Meta Analysis z-scores
For a given SNP, z-scores z k , k = 1, . . . K, from K substudies with, respectively, effective sample sizes N k , were combined in a weighted Z-test (Liptak, 1958) to estimate the meta-analysis z-score for the SNP. Under the null hypothesis H 0 (and ignoring inflation), z k ∼ N (0, 1) for each k. Writing the weighted z in the form (Liptak, 1958;de Bakker et al., 2008;Zaykin, 2011), we chose the weights to be the square root of the sample size w k = √ N k . When the substudies are from similar populations, these are optimal weights (Zaykin, 2011), and the test is equivalent to weighting by the inverse standard deviation of the statistic used in the k-th sample: (Inverse variance weighting is equivalent to choosing w k = N k , in which case the test is called the Mosteller-Bush test (Mosteller and Bush, 1954).)

Confidence Intervals for the Model Parameters
To estimate the 95% confidence intervals of the four model parameters, K = 20 pair estimates of the parameters were made from K split-half random divisions of the SNPs. Because the three model standard deviations (σ 0 , σ a , and σ b ) are constrained to be positive, and the polygenicity (π 1 ) is constrained to lie between 0 and 1, the parameters were first transformed to unconstrained forms (ranging from −∞ to +∞). Specifically, each standard deviation σ was estimated as σ ′ = ln(σ), and the polygenicity was estimated as π ′ 1 = logit(π 1 ) = ln(π 1 /(1 − π 1 )). K pairs of these were then estimated with the minimization scheme described in the main paper. Split-half random divisions of the SNPs were generated by dividing the entire set of SNPs into 1000 non-overlapping blocks (the number of SNPs per block then well-exceeded the average LD block size; using a threshold of r 2 ≥ 0.1 for the squared correlation, the number of independent SNPs was 339,161 out of a total of 9,279,485, giving on average less than 30 SNPs per LD block). Half of the blocks were randomly chosen to compose the data set for estimation of one set of the model parameters; the remaining blocks then provided an independent data set from which also to estimate the parameters. K = 20 such complementary random divisions were made. Note that there is data overlap, and therefore correlation, between different pairs of divisions, while elements of a pair division are independent. The variance of the difference between the pair of estimates from complementary independent data sets then provided an estimate for the variance of the corresponding model parameter, from which the standard error immediately follows. Assuming the transformed estimates are distributed multi-normally, the 95% confidence interval for a (transformed) parameter was given by the point estimate from the full data set ±1.96× the corresponding standard error; these limits were then inverse transformed to the native parameter representation.
Let A and B be random variables for a pair of independent estimates of, say, π ′ 1 , with realizations a and b; since only half the data were used in each of the independent estimates, one has var(A) = var(B) = 2var(π ′ 1 ). Then, var(π ′ 1 ) can be estimated from the K pairs of parameter point estimates as , and the standard error, being the standard deviation of the test statistic, is then simply Lettingπ ′ 1 denote the point estimate from the entire set of SNPs of the (transformed) polygenicity parameter, the (transformed) 95% confidence interval CI ′ is CI ′ = [π ′ 1 − 1.96 × se(π ′ 1 );π ′ 1 + 1.96 × se(π ′ 1 )]. (66) The confidence interval CI in the native parameter representation is then given by inverse-transforming each of these values in turn: