Original Research ARTICLE
SNP imputation bias reduces effect size determination
- Department of Neurology, University of California San Francisco, San Francisco, CA, USA
Imputation is a commonly used technique that exploits linkage disequilibrium to infer missing genotypes in genetic datasets, using a well-characterized reference population. While there is agreement that the reference population has to match the ethnicity of the query dataset, it is common practice to use the same reference to impute genotypes for a wide variety of phenotypes. We hypothesized that using a reference composed of samples with a different phenotype than the query dataset would introduce imputation bias. To test this hypothesis we used GWAS datasets from Amyotrophic Lateral Sclerosis (ALS), Parkinson Disease (PD), and Crohn's Disease (CD). First, we masked and then performed imputation of 100 disease-associated markers and 100 non-associated markers from each study. Two references for imputation were used in parallel: one consisting of healthy controls and another consisting of patients with the same disease. We assessed the discordance (imprecision) and bias (inaccuracy) of imputation by comparing predicted genotypes to those assayed by SNP-chip. We also assessed the bias on the observed effect size when the predicted genotypes were used in a GWAS study. When healthy controls were used as reference for imputation, a significant bias was observed, particularly in the disease-associated markers. Using cases as reference significantly attenuated this bias. For nearly all markers, the direction of the bias favored the non-risk allele. In GWAS studies of the three diseases (with healthy reference controls from the 1000 genomes as reference), the mean OR for disease-associated markers obtained by imputation was lower than that obtained using original assayed genotypes. We found that the bias is inherent to imputation as using different methods did not alter the results. In conclusion, imputation is a powerful method to predict genotypes and estimate genetic risk for GWAS. However, a careful choice of reference population is needed to minimize biases inherent to this approach.
In genome-wide association studies (GWAS), SNP data are used to find genetic loci associated with various traits, particularly common diseases. Due to the large number of tests performed (more than 2.5 million for the latest chips), correction for multiple hypothesis testing is necessary to avoid type I errors. However, after correction, there is typically not enough power to detect small effects (OR < 1.5), even with sample sizes exceeding 1000 cases and 1000 controls. A usual approach used in order to increase n, consists in merging datasets from two or more sources in a meta-analysis. However, if datasets were generated using different genotyping platforms (a likely scenario), a small minority of the total number of markers will be represented in both. In this case, non-overlapping genotypes are imputed using one or more reference populations (Guan and Stephens, 2008; Huang et al., 2009; Nothnagel et al., 2009; Zheng et al., 2011). Imputation methods are used to infer missing or untyped SNP genotypes based on known information (e.g., linkage disequilibrium between missing or untyped SNPs and their flanking typed SNPs) and can provide partial solutions for recovering missing or untyped genotype data (Stephens et al., 2001; Greenspan and Geiger, 2004; Browning and Browning, 2007). Several imputation methods using various statistical models such as the haplotype-clustering algorithm (Scheet and Stephens, 2006), the hidden Markov model (HMM) (Marchini et al., 2007), and the Markov Chain model (Li et al., 2006), have been proposed. Imputed genotypes, generated with these methods, have been used, successfully, to improve power in association analyses (Scott et al., 2007; Servin and Stephens, 2007; Sandhu et al., 2008; Sanna et al., 2008), to facilitate meta-analyses (Patsopoulos et al., 2011), and to replicate significant findings in follow-up studies (Willer et al., 2008).
Specifically, SNP imputation uses knowledge about haplotype structure in a densely genotyped population [often healthy controls from the HapMap International Consortium (2005), Jostins et al. (2011) or as more recently proposed, the 1000 genomes Project (Abecasis et al., 2010)] to infer unknown genotypes in the query population. Though popular SNP imputation algorithms vary in their details, they are based on the same general principle. To impute genotypes for a given individual, that individual's genotype is compared to the genotypes in the reference population (e.g., the 1000 genomes data). At each genomic region, a set of individuals from the reference population which closely matches the individual is chosen. The individual's genotype is assigned (i.e., “imputed”) based on a consensus using matching individuals from the reference panel. Thus, the imputed genotype is derived from and highly dependent on the genotypes in the reference population. This strategy can increase the power of a GWAS, enable replication of findings from different array types, and allow testing on a large number of SNPs to reveal the fine structure of an association peak (Marchini et al., 2007; Sanna et al., 2008; Willer et al., 2008; Zeggini et al., 2008; Becker et al., 2009; De Jager et al., 2009; Hao et al., 2009).
Here we analyze publically available data on three complex diseases and reveal a bias in SNP imputation that may confound this approach. Our results suggest that when solely healthy controls are used as reference for imputation, “risk” variants in the target population are more likely to be mistakenly imputed as “non-risk” alleles resulting in a deflation of the effect size in a GWAS.
To test the hypothesis that imputation introduces a systematic bias that ultimately results in a deflation of the effect size, we used three publically available datasets in Amyotrophic Lateral Sclerosis (ALS), Crohn's Disease (CD), and Parkinson's Disease (PD). For each dataset, the following steps were performed. We randomly split the dataset into two groups of equal age, gender, and case/control distribution. The two halves were termed split A and split B. We conducted a GWAS on split A. We ranked all SNPs by GWAS effect size and selected the top 100 disease associated markers (DAM) and the bottom 100 non-associated markers (NAM). We then proceeded to perform three imputations on the SNPs of interest. Each time, the SNP of interest was masked and an independent reference was used to impute the SNP. First, the cases from split A were imputed using controls from split B as reference for imputation, in order to measure the bias and error of imputation. Next, the cases from split A were imputed using cases from split B as reference for imputation, in order to see if the use of cases for imputation can improve accuracy and decrease bias. Finally, both cases and controls from the entire dataset were imputed using a commonly used publically available reference panel, in order to see if the measured effect size of a case control study is affected by using imputed genotypes vs. true genotypes.
Discordance Rate of Imputation
As most imputation approaches use a healthy control population as reference, we first conducted imputation of the masked genotypes using an independent set of healthy controls as reference. When imputing integer genotypes, discordance (Dint) at each SNP was defined as the percentage of samples where genotype was mistakenly inferred by imputation, and Dint = DM + Dm(where DM is the percent of genotypes where imputation over-estimated the major allele by one or two copies, and Dm is the percent of genotypes where imputation mistakenly over-estimated the minor allele by one or two copies). In other words, Dint is percent of genotypes which do not match imputation, composed of the cases where minor allele is mistakenly predicted Dm and cases where the major allele is mistakenly predicted DM. When imputing fractional genotypes, discordance (Dfrac) at each SNP was defined as the average of the absolute difference between the imputed fractional genotype and the “true” genotype (coded as 0, 1, or 2; where 2 = homozygous for the major allele) across all samples, where the “true” genotype is given by SNP-chip. In general, the average Dint = 15–20% across all diseases and Dfrac = 0.19 and 0.24 (Table 1). Interestingly, the lowest overall discordance rate was found in CD, the dataset with the largest sample size.
We observed that the imputation discordance rate (either Dint or Dfrac) was significantly higher for DAM than for NAM. For example, in the ALS dataset, Dint = 19.65% ± 1.16 (mean ± 2 × standard error) for DAM and Dint = 17.4% ± 1.11 for NAM, a statistically significant difference (p = 0.005) (Table 1). Similarly, imputation of PD samples was significantly less accurate (p = 0.0002) for the 100 DAM (Dint = 19.12% ± 1.3) than for 100 NAM (Dint = 15.74% ± 1.23). We found no difference in Dint between DAM and NAM in the CD dataset (p = 0.95). A similar pattern was found when using fractional genotypes (Dfrac) (Table 1).
When independent cases were used as a reference for imputation (instead of controls), a significant reduction in Dint or Dfrac was observed. For example, in the ALS dataset, the average discordance at DAM (Dint = 18.04% ± 1.15) using cases as reference was significantly lower (matched pairs t-test, p = 10−4) than the average discordance at the same SNPs using controls as reference (Dint = 19.65% ± 1.16), suggesting that more accurate imputation is obtained when matched cases are used as reference for imputation. The average Dint in the ALS dataset using independent ALS cases as a reference was 18.04% (± 1.15) for DAM and 17.6% (± 1.2) for NAM, a non-significant difference (p = 0.6) (Table 1).
Similar results were obtained in the PD dataset, in which the average discordance at DAM using controls as reference (Dint = 19.12% ± 1.3) was significantly higher than that observed when matched cases were used instead (Dint = 18.22% ± 1.4) (matched pairs t-test, p = 0.002). Although using independent cases as a reference reduces the discordance in imputation, a significant drop in imputation accuracy was still observed in the PD dataset for DAM (Dint = 18.22% ± 1.4) compared with NAM (Dint = 15.89% ± 1.21) (p = 0.01) (Table 1). We observed similar results using fractional imputation metrics.
In summary, the average total discordance rate of imputation on each data set ranged from 15 to 20% and SNPs with large effect sizes were significantly more discordant than SNPs with small effect sizes (up to 3.5%) in two of three datasets (PD and ALS). This discordance was attenuated when cases, instead of controls, were used as reference for imputation.
SNP Imputation Bias
In order to determine whether the discordance observed was random or systematic, we looked at bias. For integer imputation, we defined bias (B) as the difference in imputation discordance when predicting the major (M) and minor (m) alleles, Bint = DM − Dm. When using fractional imputation, bias (Bfrac) is defined at each SNP as the average of the signed difference between the imputed fractional genotype and the true genotype (coded as 0, 1, or 2; where 2 is homozygous for the major allele) across all samples. According to these definitions, a positive bias means imputation favors the major allele, and a negative bias means imputation favors the minor allele.
Before discussing the difference between using cases or controls as a reference, or between DAM and NAM, it is important to note that imputation is inherently biased toward the minor allele. For integer genotypes, when using controls as a reference to impute NAM (markers not associated with disease), there is a small significant bias toward the major allele of 1–2% in all three datasets (ALS p = 0.0067; PD p < 10−6; and CD p < 10−7), Table 2. This effect was reduced by using fractional imputation (Table 2). This inherent bias is the reason that we defined bias in terms of major or minor allele, and the reason that we split the DAM (disease markers) into two groups for further analysis: markers where the major allele is the susceptibility (“risk”) allele, and markers where the minor allele is the risk allele.
When controls were used as a reference for imputation DAM, a consistent bias against the risk allele was observed for all three diseases. For example, when the major allele was the risk allele, integer imputation was biased toward the minor allele (p < 10−7 for each disease) and when the minor allele was the risk allele, imputation was biased toward the major allele (p < 10−15 for each disease). Similar results were observed when fractional imputation was used.
Interestingly, when independent cases were used as a reference to impute DAM, the bias against risk alleles remained. Similarly to what we observed using controls as a reference, when the major allele was the risk allele, imputation was significantly biased toward the minor allele (p < 10−8 in ALS; p = 3 × 10−4 in PD; and p = 0.012 in CD). Conversely, when the minor allele was the risk allele, imputation was significantly biased toward the major allele (p < 10−6 for each disease).
Figure 1 shows the bias (Y-axis) when controls (left) or cases (right) are used as a reference to impute SNPs in the ALS dataset. With either reference population the bias is consistently against the risk allele and can be observed for all DAM (circles) including the most significantly associated SNPs (dark gray) as well as for more modestly associated (light gray). However, the magnitude of the bias is lower when cases are used as reference. We observed similar results in the PD and CD data sets (Supplementary Figures S1, S2). Of note, results were largely unchanged when the call tolerance parameter T was changed from 0.5 to 0.3 or 0.1 (data not shown), or when fractional genotypes were used (Supplementary Figures S3–S5).
Figure 1. Imputation bias vs. odds ratio of association in ALS. Each circle represents one of the 100 DAM in ALS. For each SNP, the odds ratio (OR) of association (x-axis) indicates whether the minor allele (OR > 1) or the major allele (OR < 1) is the susceptibility allele (the allele more prevalent in cases than controls). The imputation bias (y-axis) indicates whether imputation error favors the major allele (positive values) or the minor allele (negative values). When controls were used as the reference for imputation, imputation is biased against the susceptibility allele. When an independent set of cases was used as the reference for imputation, the bias is significantly decreased. For reference, the 100 NAM (OR ≈ 1) are shown as boxes. Points are shaded by the log10 p-value of association with disease. The odds ratios of NAM are exaggerated for visual clarity.
To evaluate the potential effect of imputation on a genome-wide association study (GWAS), the same three datasets (PD, ALS, and CD) were used to perform three parallel association studies. For this analysis, imputation was performed using the CEU subset of the 1000 Genomes as a reference for imputation (Abecasis et al., 2010). For any SNP, the odds ratio (OR) was defined by logistic regression (OR = exp(b), where b is the estimate of the logit coefficient in the logistic regression). Odds ratios computed from imputed genotypes (imputed OR) were compared to those obtained using data from experimentally determined genotypes (from SNP chip; termed “true” OR). Imputed OR were compared to true OR for each DAM. Figure 2 shows the distribution of imputed/true OR for the top 100 DAM in each dataset. Using integer imputation, the magnitude of the mean imputed OR was only 64.2% ± 2.8 (mean ± 2 × standard error) as high as the magnitude of the true OR in ALS, 71.1% ± 3.2 in PD, and 73.8% ± 2.4 in CD. In all cases, it is evident that the distribution of OR is shifted toward the left of 1.0, a vertical dotted line which represents equality between imputed OR and true OR. Using fractional genotypes yielded similar results (Figure 2). Using another commonly used imputation software algorithm (Beagle) yielded similar results (Figure 2). However, in contrast to Mach imputation, the magnitude of the mean imputed OR was closer to the magnitude of the true OR in each dataset. In summary, the observed effect size after imputation is considerably lower for both types of imputation although in the second analysis using Beagle and a larger reference dataset improves the accuracy of imputation. It should be noted that for a few SNPs, the imputed OR is nearly equal or greater than the true OR (imputed OR/true OR ≥ 1). That is, for these SNPs, there is no decrease in observed magnitude of association with disease when imputed genotypes are used.
Figure 2. The distribution of (imputed OR/true OR) for 100 DAM in each dataset. In each of three datasets, 100 DAM were selected and the odds ratio of association with disease was estimated using both genotyped (true) data and imputed data. The ratio of imputed odds ratio to true odds ratio (x-axis) takes a similar distribution across the 100 SNPs in each disease. The odds ratio (OR) of association was generally lower for imputed data than for true data (imputed OR/true OR < 1). This hold true whether we use whole number imputation by Mach (top), fractional imputation by Mach (middle), or whole number imputation by Beagle (bottom). In contrast to Mach imputation, the magnitude of the mean imputed OR was closer to the magnitude of the true OR for all three datasets when using Beagle imputation. ALS, Amyotrophic Lateral Sclerosis; PD, Parkinson's Disease; CD, Crohn's Disease.
We have shown that imputation of DAM is consistently and significantly biased against the risk allele. This was seen in three diseases, and using two different software algorithms of imputation. The number of top SNPs selected as DAM or NAM was arbitrary and we acknowledge that many of these SNPs are likely to be false positives. However, since this arbitrary threshold was used for all datasets, we deemed this as a valid strategy. Furthermore, the fact that using independent cases as a reference (rather than controls) reduces the imputation bias indicates that at least some of these SNPs may be truly associated with disease.
While imputation bias is reduced when matching cases were used as reference for imputation, the effect is still present. We offer three possible explanations for the persistence of bias in this scenario. First, as in most common diseases, significant genetic heterogeneity could result in reference cases not carrying the same “risk” haplotype structure as the original cases, thus leading to underestimation of “risk” alleles during imputation. Second, the disease variants/mutations may occur in relatively small genetic windows which are not spanned by enough SNPs to make imputation effective. Third, the disease alleles are rare. If the frequency of the “non-risk” allele greatly outweighs the frequency of the “risk” allele, then the inherent bias for common alleles will add to the apparent bias for the “non-risk” allele. Future analyses aimed at describing the relative contribution of these hypotheses in diseases of varying genetic complexity are needed.
Although the Beagle imputed odds ratios are closer to the real odds ratios than the Mach imputed odds ratios, they are still significantly less than the real odds ratios. The reduced bias for Beagle imputation can be explained by at least two reasons. First, Beagle may be using a more accurate algorithm for imputation though this is unlikely given previously published head-to-head analysis of Beagle vs. Mach, and, second, the reference panel may be more up to date and consists of more individuals.
In conclusion, while combining datasets by imputation can lead to a more powerful GWAS (Becker et al., 2009; Hao et al., 2009) by allowing successful identification of SNPs associated with various phenotypes (Sanna et al., 2008; Willer et al., 2008; Zeggini et al., 2008; De Jager et al., 2009), the described decrease in signal inherent to imputation can partially offset any gain in power resulting from the combination of studies. Important implications of this finding include the fact that some truly associated variants may not be detected, and that some genome-wide significant findings may have larger true effect sizes than estimated. Since the imputation error of any given SNP cannot be known a-priori, individual genotyping of candidate SNPs by imputation should always be performed as a follow-up (Halperin and Stephan, 2009). In summary, imputation is a powerful method to estimate genetic risk at the population and individual level, but a careful choice of control population is required to minimize biases inherent to the approach. A plausible strategy is to consider deeper genotyping or whole genome sequencing of a small panel of ethnically matched cases and controls to be used as a reference for imputation.
Datasets: Three Case-Control GWAS
Quality controlled, genotype-level data from three previously published independent case-control GWAS in individuals of European ancestry in Amyotrophic Lateral Sclerosis (ALS), Crohn's Disease (CD), and Parkinson's Disease (PD) were obtained from dbGAP (Mailman et al., 2007) (Supplementary Table S1). In CD (Rioux et al., 2007) and PD (Fung et al., 2006), cases and controls were matched by sex, age (or year of birth), and ancestry (Rioux et al., 2007; Simon-Sanchez et al., 2009). For ALS cases (Schymick et al., 2007), a sample from neurologically normal controls (Simon-Sanchez et al., 2007) were matched for age and gender and ancestry (Schymick et al., 2007). After quality control done by the original authors, we performed a second layer of quality control on markers (MAF > 0.05, HWE p-value > 10−6 in controls, genotype success rate > 95%).
For each disease dataset, half of the patients and a matched number of controls were extracted (split “A”) while maintaining original ratios of gender and age. This half of the data (split A) was used to perform a association study (using software Plink Purcell et al., 2007), using a genotypic model or a linear dose model. Here we report results using the genotypic model; results for the linear trend model were largely similar and not shown. The 100 markers with the largest absolute effect size (DAM) were considered for imputation (OR > 1.88 in ALS, OR > 1.53 in PD, OR > 1.44 in CD). As a control, 100 SNP markers with the smallest absolute effect size (NAM) were imputed with the same procedure (1 ≤ OR < 1.001).
For each marker of interest (DAM or NAM), genotypes of cases of split A were masked with the goal of predicting them using imputation. Next, imputation of the masked genotypes was carried out using all markers within a 1 Mb window centered on each query SNP. The imputation window size was chosen large enough so as to include neighborhood SNPs that have r2 > 0.2 with the query SNP. The imputation was repeated with each of two reference sets: (a) the previously unused healthy controls from split B of the same study, (b) an equal number of the previously unused affected case individuals from split B of the study.
Next, for each marker of interest (DAM or NAM) the genotypes of all cases and all controls from each study (split A and split B) were masked with the goal of predicting them with imputation. Imputation was performed using the CEU subset of the 1000 Genomes as a reference for imputation (Abecasis et al., 2010).
Imputation was carried out using Mach (Li et al., 2006) (see Nothnagel et al., 2009 for an evaluation of the relative performance of this algorithm compared to others), with default settings and default quality control criteria for SNPs (MAF > 0.05, HWE p-value > 10−6, genotype success rate > 95%). For each imputed genotype, Mach outputs a fractional genotype (Gimp) between zero and two that corresponds to the inferred number of copies of the minor allele, where minor allele is determined using dataset frequencies (cases and controls). The fractional genotype is compared to the genotype by SNPchip (G0), or “true” genotype.
Discordance and Bias using Integer Genotypes
Using a call tolerance parameter (T), the integer imputed genotype (Gint) is homozygous for the major allele if Gimp < T, heterozygous if 1 − T ≤ Gimp < 1 + T, or homozygous for the minor allele if Gimp ≥ 2 − T; otherwise the genotype is considered missing. Results using T = 0.5 are shown here. Results using stricter tolerance thresholds T = 0.3 and T = 0.1 were largely similar and are not shown. When using integer genotypes, discordance Dint = DM + Dm is the percent of imputed genotypes that do not match genotypes from SNP chip, where DM is the percent of samples where rounded imputation mistakenly over-predicts the major allele in the genotype (by one or two copies), and Dm is the percent of samples where imputation mistakenly over-predicts the minor allele in the genotype (by one or two copies). The bias B is defined as Bint = DM − Dm. The bias Bint is positive when the major allele is over-predicted and negative when the minor allele is over-predicted. The bias cannot exceed the discordance: |Bint| ≤ Dint.
The discordance (or bias) calculated at individual SNPs is the sum (or difference) of two proportions. To compare groups of SNPs, a Gaussian approximation of the discordance (or bias) distribution was employed and t-statistics compared the difference in means between two groups of SNPs. Normality of the distributions were tested by the Anscombe–Glynn test of kurtosis. Further, non-parametric tests of differences in medians were used in parallel to confirm results.
Discordance and Bias using Fractional Genotypes
For each SNP, we computed across samples the mean discordance Dfrac = mean |Gimp − G0|, and the mean bias Bfrac = mean (Gimp − G0). The bias Bfrac is positive when the major allele is over-predicted and negative when the minor allele is over-predicted across all samples. The bias cannot exceed the discordance: |Bfrac|≤Dfrac.
The discordance (or bias) calculated at individual markers is a mean across the samples. To compare groups of markers (e.g., DAM vs. NAM), a Gaussian approximation of the discordance (or bias) distribution was employed and t-statistics compared the difference in means between two groups of SNPs. Normality of the distributions were tested by the Anscombe–Glynn test of kurtosis. Further, non-parametric tests of differences in medians were used in parallel to confirm results (not shown).
Sergio E. Baranzini, Pouya Khankhanian, and Pierre-Antoine Gourraud conceived and designed the study. Pouya Khankhanian and Lennox Din performed the data analysis. Stacy J. Caillier performed the assays. Sergio E. Baranzini, Pouya Khankhanian, Lennox Din, and Pierre-Antoine Gourraud wrote the paper. All authors read and approved the final manuscript.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Sergio E. Baranzini is a Harry Weaver Scholar and is funded by a grant from the National Multiple Sclerosis Society (RG4051A2/1). Funding support for Genome-Wide Genotypic Association of Amyotrophic Lateral Sclerosis was provided by NIH Intramural funding from NIA and NINDS, with additional funding from the ALS Association and the Packard Center for ALS Research at Johns Hopkins and the genotyping of samples was provided by the National Institute of Neurological Disorders and Stroke (NINDS). Funding support and the genotyping of samples for the NINDS Ischemic Stroke Genetics Study was provided by the National Institute of Neurological Disorders and Stroke (NINDS) (Singleton, PI). Funding support for NINDS Parkinsons Disease was provided by the National Institute for Neurological Disease and Stroke and the genotyping of samples was provided by the Singleton Lab (NIA Laboratory of Neurogenetics) with support from NINDS. The NIDDK IBDGC Crohn's Disease Genome-Wide Association Study was conducted by the NIDDK IBDGC Crohn's Disease study investigators and supported by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). This manuscript was not prepared in collaboration with investigators of the NIDDK IBDGC Crohn's Disease study and does not necessarily reflect the opinions or views of the IBDGC or the NIDDK. The dataset(s) used for the analyses described in this manuscript were obtained from the dbGAP, NINDS, and NIDDK databases found at http://www.ncbi.nlm.nih.gov/ through accession numbers phs000004, phs000101, phs000102, phs000089, and phs000130. Sarah B. Hill provided editorial assistance.
The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fgene.2015.00030/abstract
Supplementary Figure S1. Bias vs. log odds ratio of association in Parkinson's Disease. Analog of Figure 1 for Parkinson's Disease.
Supplementary Figure S2. Bias vs. log odds ratio of association in Crohn's Disease. Analog of Figure 1 for Crohn's Disease.
Supplementary Figure S3. Analog of Figure 1 using fractional imputed genotypes.
Supplementary Figure S4. Analog of Figure S1 using fractional imputed genotypes.
Supplementary Figure S5. Analog of Figure S2 using fractional imputed genotypes.
Abecasis, G. R., Altshuler, D., Auton, A., Brooks, L. D., Durbin, R. M., Gibbs, R. A., et al. (2010). A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073. doi: 10.1038/nature09534
Becker, T., Flaquer, A., Brockschmidt, F. F., Herold, C., and Steffens, M. (2009). Evaluation of potential power gain with imputed genotypes in genome-wide association studies. Human Hered. 68, 23–34. doi: 10.1159/000210446
Browning, S. R., and Browning, B. L. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097. doi: 10.1086/521987
De Jager, P. L., Jia, X., Wang, J., De Bakker, P. I., Ottoboni, L., Aggarwal, N. T., et al. (2009). Meta-analysis of genome scans and replication identify CD6, IRF8 and TNFRSF1A as new multiple sclerosis susceptibility loci. Nat. Genet. 41, 776–782. doi: 10.1038/ng.401
Fung, H. C., Scholz, S., Matarin, M., Simon-Sanchez, J., Hernandez, D., Britton, A., et al. (2006). Genome-wide genotyping in Parkinson's disease and neurologically normal controls: first stage analysis and public release of data. Lancet Neurol. 5, 911–916. doi: 10.1016/S1474-4422(06)70578-6
Hao, K., Chudin, E., McElwee, J., and Schadt, E. E. (2009). Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies. BMC Genet. 10:27. doi: 10.1186/1471-2156-10-27
Huang, L., Li, Y., Singleton, A. B., Hardy, J. A., Abecasis, G., Rosenberg, N. A., et al. (2009). Genotype-imputation accuracy across worldwide human populations. Am. J. Hum. Genet. 84, 235–250. doi: 10.1016/j.ajhg.2009.01.013
Jostins, L., Morley, K. I., and Barrett, J. C. (2011). Imputation of low-frequency variants using the HapMap3 benefits from large, diverse reference sets. Eur. J. Hum. Genet. 19, 662–666. doi: 10.1038/ejhg.2011.10
Marchini, J., Howie, B., Myers, S., McVean, G., and Donnelly, P. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913. doi: 10.1038/ng2088
Patsopoulos, N. A., Esposito, F., Reischl, J., Lehr, S., Bauer, D., Heubach, J., et al. (2011). Genome-wide meta-analysis identifies novel multiple sclerosis susceptibility loci. Ann. Neurol. 70, 897–912. doi: 10.1002/ana.22609
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575. doi: 10.1086/519795
Rioux, J. D., Xavier, R. J., Taylor, K. D., Silverberg, M. S., Goyette, P., Huett, A., et al. (2007). Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis. Nat. Genet. 39, 596–604. doi: 10.1038/ng2032
Sandhu, M. S., Waterworth, D. M., Debenham, S. L., Wheeler, E., Papadakis, K., Zhao, J. H., et al. (2008). LDL-cholesterol concentrations: a genome-wide association study. Lancet 371, 483–491. doi: 10.1016/S0140-6736(08)60208-1
Sanna, S., Jackson, A. U., Nagaraja, R., Willer, C. J., Chen, W. M., Bonnycastle, L. L., et al. (2008). Common variants in the GDF5-UQCC region are associated with variation in human height. Nat. Genet. 40, 198–203. doi: 10.1038/ng.74
Scheet, P., and Stephens, M. (2006). A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644. doi: 10.1086/502802
Schymick, J. C., Scholz, S. W., Fung, H. C., Britton, A., Arepalli, S., Gibbs, J. R., et al. (2007). Genome-wide genotyping in amyotrophic lateral sclerosis and neurologically normal controls: first stage analysis and public release of data. Lancet Neurol. 6, 322–328. doi: 10.1016/S1474-4422(07)70037-6
Scott, L. J., Mohlke, K. L., Bonnycastle, L. L., Willer, C. J., Li, Y., Duren, W. L., et al. (2007). A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316, 1341–1345. doi: 10.1126/science.1142382
Simon-Sanchez, J., Scholz, S., Fung, H. C., Matarin, M., Hernandez, D., Gibbs, J. R., et al. (2007). Genome-wide SNP assay reveals structural genomic variation, extended homozygosity and cell-line induced alterations in normal individuals. Hum. Mol. Genet. 16, 1–14. doi: 10.1093/hmg/ddl436
Simon-Sanchez, J., Schulte, C., Bras, J. M., Sharma, M., Gibbs, J. R., Berg, D., et al. (2009). Genome-wide association study reveals genetic risk underlying Parkinson's disease. Nat. Genet. 41, 1308–1312. doi: 10.1038/ng.487
Willer, C. J., Sanna, S., Jackson, A. U., Scuteri, A., Bonnycastle, L. L., Clarke, R., et al. (2008). Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat. Genet. 40, 161–169. doi: 10.1038/ng.76
Zeggini, E., Scott, L. J., Saxena, R., Voight, B. F., Marchini, J. L., Hu, T., et al. (2008). Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat. Genet. 40, 638–645. doi: 10.1038/ng.120
Keywords: genomics, genome-wide association study, SNP imputation, haplotype estimation
Citation: Khankhanian P, Din L, Caillier SJ, Gourraud P-A and Baranzini SE (2015) SNP imputation bias reduces effect size determination. Front. Genet. 6:30. doi: 10.3389/fgene.2015.00030
Received: 20 October 2014; Accepted: 21 January 2015;
Published online: 09 February 2015.
Edited by:Eduardo Manfredi, Institut National de la Recherche Agronomique, France
Reviewed by:Yulia A. Medvedeva, King Abdullah University of Science and Technology, Saudi Arabia
Wei Guo, National Institute of Health, USA
Copyright © 2015 Khankhanian, Din, Caillier, Gourraud and Baranzini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Sergio E. Baranzini, Department of Neurology, University of California San Francisco, 675 Nelson Rising Lane, Suite 215, San Francisco, CA 94158, USA e-mail: email@example.com