1 Supplementary Material : Empirical Bayes estimation of semi-parametric hierarchical mixture models for unbiased characterization of polygenic disease architectures

Genome-wide association studies (GWAS) suggest that the genetic architecture of complex diseases consists of unexpectedly numerous variants with small effect sizes. However, the polygenic architectures of many diseases have not been well characterized due to lack of simple and fast methods for unbiased estimation of the underlying proportion of disease-associated variants and their effect-size distribution. Applying empirical Bayes estimation of semi-parametric hierarchical mixture models to GWAS summary statistics, we confirmed that schizophrenia was extremely polygenic [~40% of independent genome-wide SNPs are risk variants, most within odds ratio (OR = 1.03)], whereas rheumatoid arthritis was less polygenic (~4 to 8% risk variants, significant portion reaching OR = 1.05 to 1.1). For rheumatoid arthritis, stratified estimations revealed that expression quantitative loci in blood explained large genetic variance, and low- and high-frequency derived alleles were prone to be risk and protective, respectively, suggesting a predominance of deleterious-risk and advantageous-protective mutations. Despite genetic correlation, effect-size distributions for schizophrenia and bipolar disorder differed across allele frequency. These analyses distinguished disease polygenic architectures and provided clues for etiological differences in complex diseases.


Main Simulation General settings
We evaluated the performance of our method using the SP-HMM to estimate the proportion of disease-associated SNPs, , and the effect-size distribution, g, through Monte Carlo simulations under realistic situations in GWAS. We considered a case-control study where n = 3000, 5000, 7000, 10000, 30000, or 100000 cases and controls, such that 2n = 6000, 10000, 14000, 20000, 60000, or 200000 in total, were sampled from a general population with disease prevalence K = 0.01. We set the number of SNPs as m = 20000, 50000, and 100000. We considered the 'less', 'moderate' and 'highly polygenic' settings to cover a wide range of polygenicity, in which the proportion of disease-associated SNPs, , was set to 3%, 10%, and 30%, respectively.
In generating genotype counts for SNPs with the effect size  in the case and control samples, the Hardy-Weinberg equilibrium for genotype frequencies in the general population and linkage equilibrium among SNPs (i.e., independence of alleles) were assumed. The for no-null SNPs are apart from 0 cannot be completely excluded. Thus, we also conducted the simulations using gamma distributions (Table S1b). We generated positive effects, ′s, from gamma distributions with expected value and standard deviation of (0.03, 0.015), (0.03, 0.015), and (0.018, 0.005) for the less, moderate, and highly polygenic settings, respectively. We specified the same distributional forms for negative effects, but we set the ratio of positive to negative effects as 2:1. We also considered other forms of the effect-size distribution. For the less or moderate polygenic settings with large effects, we generated positive effects according to a mixture distribution with two components: a gamma distribution with expected value of 0.03 and standard deviation of 0.015, and a fixed very large effect,  = 0.1, with mixing proportions of 95% and 5%, respectively. Similarly, in the highly polygenic with large effects setting, we generated positive effects according to a mixture model with two components: a gamma distribution with expected value of 0.018 and standard deviation of 0.005, and a fixed very large effect,  = 0.08, with mixing proportions of 98% and 2%, respectively. The numbers of SNPs were set to m = 100000, and, DAF's were generated from a uniform distribution U(0.01, 0.99).

Evaluation of estimation
From the simulated genotype data, the set of summary statistics ̂ and ̂̂ for m SNPs was obtained by a univariate logistic regression analysis. We conducted 100 simulation repetitions for each configuration. For each simulated dataset, we adopted the estimation method for  and g using the SP-HMM. Boxplots were used for assessing estimation accuracy and precision for Accuracy of estimating g was confirmed by average of estimated effect-size distribution across simulations. With respect to estimation of for comparison with our method, we also showed the results obtained by 'qvalue' R package (Bioconductor v.3.4) with the default setting.
For g, simple estimates of log-odds ratio, ̂, with/without some qvalue threshold calculated by the R package were shown. Boxplots of estimates of the liability-scale variance, V, were also shown 1.1.2. Simulations using estimated genetic architecture from real data In addition to the above artificial simulations, we conducted simulations using ′s based on the real data, i.e., the estimates of  and g by the SP-HMM from the P-value-based SNP sets ( using the -r --ld-window 21 option in PLINK 2 . LD between distant SNP pairs apart from each other more than 20 SNPs were neglected (set to r(1000G ) = 0). The covariance (equaling to correlation due to normalization) between the normalized log-odds ratios in a sample, ̂/ √̂̂, for the j-th and k-th SNPs is approximately equal to the sample LD, , between the two SNPs 3 .
The values of could not be obtained from each GWAS data. Thus, substituting (1000 ) for , we generate ̂′ s for the 1000 SNPs using the multivariate normal distribution: ̂~( , ), where the (j, k) element of is √̂̂√̂̂. We tested the case of more strong LD where is replaced 1.5 × (1000 ) (If 1.5 × (1000 ) >1 or < -1, is replaced by 1 or -1, respectively). The true log-odds ratios, ′s, were generated using normal distributions in the same way as described above (see also Table S1a). The process generating ̂′ s using ̂~( , ) for the 1000 SNPs was repeated 1000 times and we got ′s for the hypothetical 100000 pruned SNPs. Estimation performances for g, and V were assessed in the same manner as those for 'Main simulations' (see section 1.1.1).
With respect to the SNPs used in the DAF-stratified analysis, which are not pruned but randomly sampled from all SNPs, using the SNP set with 0.4 < DAF ≤ 0.6 for schizophrenia GWAS, we conducted simulations to investigate how much influence the LD had on the SP-HMM in the same way as the P-value pruned SNPs. In addition, the simplified case where = √0.03 for all pairs of SNPs (uniform correlation structure for genotypes), meaning weak but long range LD, was investigated.

Main simulations Simulations using normal distributions
Figures S1 and S3 show the simulation results for the estimates of  and g, respectively, in the settings using normal distributions for effect-size distributions, with DAFs ranging from 0.01 to 0.99. The results for the moderate polygenic setting with symmetric effect sizes using m=100,000 were also provided in Figure 1 in the main body. We also provided a summary of these results in Table S2. When the numbers of SNPs (m) were ≥50000 and the case and control sample sizes (n) were ≥7000 (i.e., ≥14000 in total), our estimation method generally yielded nearly unbiased estimates for  in the less, moderate, and highly polygenic settings ( Figure S1). In these cases, our method also captured well the range and the forms of the true effect size distributions ( Figure S3). For the real P-value-based and random-pruned SNP sets, the estimates of  and g by the SP-HMM should be reliable since there are about 80000～ 100000 SNPs and sufficient sample sizes (>16000 total sample size in case and control; Table   S3). For the real non-eQTL SNP sets with about 50000～75000 SNPs, estimates of  and g by the SP-HMM should be reliable, with the exception that the estimate of for Asian rheumatoid arthritis, 1.1%, is out of the range of simulated values.
Even under relatively small sample sizes, n = 3000 or 5000, the estimates of  and g were still unbiased for the less or moderate polygenic settings. However, for the highly polygenic settings,  was underestimated and the estimate of g was biased with higher frequency for large effects under small sample size situations, n = 3000 or 5000 (Figures S1 and S3).
In cases with small numbers of SNPs, m = 20000 (this is approximately corresponding to that of eQTL SNP sets),  was overestimated and the estimate of g was biased with higher frequency for small effects under the less polygenic settings (the overestimation of  should relate to estimation bias of g with higher frequency for small effects). On the other hand, under moderate or highly polygenic settings, even in the case of m = 20000, the method provided good estimates for  and g under n ≥7000. In the real eQTL SNP sets, the smallest estimate of  wase 18.4% in Asian rheumatoid arthritis GWAS, suggesting that even with more than the 'moderate' degree of polygenicity, the SP-HMM should be reliable. Particularly when n = 3000 in the less and moderate polygenic settings, or, when n ≤7000 in the highly polygenic settings, the SP-HMM underestimated  In the real analysis, the estimation of using SNPs with DAF = 0.01-0.2 for bipolar GWAS might likely be underestimated due to its sample size and its high polygenicity. The other results for DAF stratified analysis should not suffer from serious bias.
Exceptionally, in DAF = 0.4-0.6 and n ≥30000 (in the case ̂ should have the smallest sampling variance),  were unbiasedly estimated by the qvalue method. We also confirmed that the true effect-size distribution, g, cannot be estimated by simple estimates of log-odds ratio, ̂, even with some qvalue threshold ( Figure S6).

Simulations using gamma distributions
Figures S7 and S9 show the simulation results for the estimates of  and g, respectively, in the settings using gamma distributions for effect-size distributions. Estimates of were almost unbiased in n = 100000 ( Figure S7). Whereas, in the smaller sample sizes, especially in the moderate sample size (n=5000-1000), the SP-HMM tended to overestimate . In particular, in the highly polygenic setting was estimated as 0.4 on average for true 0.3 in n = 10000.
This overestimation for  should be connected to the overestimation of effects near 0 in g ( Figure S9). In reality, it is more natural that there are assumed to be more SNPs around zero effects in g, as are described in the settings using normal distribution. Note that even in the settings using gamma distribution, the SP-HMM detected the effect-size range and approximate form of true g. It was also remarkable that our method could properly estimate asymmetric effect-size distributions, possibly with a small peak for large effects ( Figure S9). In the same way as the settings using normal distributions, the SP-HMM estimated the liability-scale variance, V, almost unbiasedly, irrespective of the numbers of SNPs and the sample size in all the cases investigated ( Figure S8).

Simulations using estimated genetic architecture from real data
In the results of simulations using estimated genetic architecture from real data, the assigned number of associated SNPs and distribution of effect sizes are generally well reconstructed by the SP-HMM ( Figures S10-15). For the estimates of effect size distribution, the SP-HMM captured the assigned effect size distribution flexibly. These results are consistent with those of the polygenic architectures of complex diseases using the GWAS summary data, which are sufficiently well estimated by the SP-HMM. Note that the estimates of effect size distribution for rheumatoid arthritis GWAS, particularly for the non-eQTL SNP set, tended to vary from simulation to simulation compared with those of other GWAS because of relatively small sample size.

Simulations for the case with weak LD
Figures S16 and S17 show the simulation results for the estimates of  and g, respectively, taking account for weak LD among SNPs. For the cases using LD from the 1000 Genomes

Exploring natural selection
We explored the relationship between variants affecting complex diseases and natural selection using the random-pruned sets for each GWAS. Under the Wright-Fisher model with N diploids and irreversible mutation, the fitness of the genotypes AA, Aa, and aa in a locus were assumed to be 1, 1+s, and 1+2s, respectively. The stationary distributions of DAF, p, is given by where = 2 and is the mutation rate per site per generation (Sawyer and Hartl 1992 4 ; see Figure S20 for the distributions with various S). The variants in the genome were assumed to be independent of one another and the derived alleles were classified into three types: null alleles (= (0) = 0) with S = S(0) = 0 (i.e., selectively neutral), risk alleles (= (1)> 0with S = S(1), and protective alleles (= (2)< 0) with S = S (2). We assumed that |(2)| = |(1)|. When we draw a variant with a particular DAF from the genome, the probability that the variant has the effect (t) is expressed as where ( ) is the mutation rates for null (t = 0), risk (t = 1) and protective (t = 2) mutations standardized by 0 , i.e., 0 = 1. We applied the present model to random-pruned sets for each GWAS. For the j-th SNP the likelihood of the (the estimate of log-odds ratio) is given by , assuming the standard asymptotic normality for . The likelihood for the random-pruned set was calculated as the product of the likelihood over all variants in the set.

Supplemental Data
Empirical Bayes estimation of semi-parametric hierarchical mixture models for characterization of polygenic disease architectures

Contents Supplemental Tables
• Table S1a . Settings of the simulation studies using normal distributions.
• Table S1b. Settings of the simulation studies using gamma distributions.
• Table S2. Summary of the estimated proportions of disease-associated SNPs, ො, and effect-size distributions, ො, in the settings using normal distributions (m=100000, 50000, 20000) • Table S3. GWAS data used in this study.
• Table S4. Study type used in this study.

Supplemental Figures (continue)
Simulations using estimated genetic architecture from real data • Figure S10. Estimated proportions of disease-associated SNPs, ො, under the estimated genetic architectures from the P-value-based pruned SNPs. • Figure S11. Estimated proportions of disease-associated SNPs, ො, under the estimated genetic architectures from the eQTL/not eQTL SNPs. • Figure S12. Estimated proportions of disease-associated SNPs, ො, under the estimated genetic architectures from the DAF stratified SNPs. • Figure S13. Estimated effect-size distributions, ො, under the estimated genetic architectures from the P-value-based pruned SNPs. • Figure S14. Estimated effect-size distributions, ො, under the estimated genetic architectures from the eQTL/not eQTL SNPs. • Figure S15. Estimated effect-size distributions, ො, under the estimated genetic architectures from the DAF stratified SNPs.

Simulations for the case with weak linkage disequilibrium (LD)
• Figure S16. Estimated proportions of disease-associated SNPs, ො, in the settings with SNPs in weak LD. • Figure S17. Estimated effect-size distributions, ො, in the settings with SNPs in weak LD.

Real data analyses
• Figure S18. Raw effect size estimates, መ , for the P-value-based pruned SNPs in the GWAS summary data . • Figure S19. Estimated effect-size distributions for disease-associated SNPs, ො, using the random-pruned SNPs compared with those using the P-value-based SNPs. • Figure S20. The stationary distributions of DAF for various values of the selection parameter. • Figure S21. Result of fitting the three selection models.              (Table S2).   For each DAF bin, we used 100,000 SNPs with the exception that in C4D GWAS the numbers of SNPs in 0.4 < DAF ≤ 0.6, 0.6 < DAF ≤ 0.8, and 0.8 < DAF were 94506, 70170, and 49116, respectively, since the SNPs of C4D GWAS was limited (Table S2).