Replication of GWAS “Hits” by Race for Breast and Prostate Cancers in European Americans and African Americans

In this study, we assessed association of genome-wide association studies (GWAS) “hits” by race with adjustment for potential population stratification (PS) in two large, diverse study populations; the Carolina Breast Cancer Study (CBCS; N total = 3693 individuals) and the University of Pennsylvania Study of Clinical Outcomes, Risk, and Ethnicity (SCORE; N total = 1135 individuals). In both study populations, 136 ancestry information markers and GWAS “hits” (CBCS: FGFR2, 8q24; SCORE: JAZF1, MSMB, 8q24) were genotyped. Principal component analysis was used to assess ancestral differences by race. Multivariable unconditional logistic regression was used to assess differences in cancer risk with and without adjustment for the first ancestral principal component (PC1) and for an interaction effect between PC1 and the GWAS “hit” (SNP) of interest. PC1 explained 53.7% of the variance for CBCS and 49.5% of the variance for SCORE. European Americans and African Americans were similar in their ancestral structure between CBCS and SCORE and cases and controls were well matched by ancestry. In the CBCS European Americans, 9/11 SNPs were significant after PC1 adjustment, but after adjustment for the PC1 by SNP interaction effect, only one SNP remained significant (rs1219648 in FGFR2); for CBCS African Americans, 6/11 SNPs were significant after PC1 adjustment and after adjustment for the PC1 by SNP interaction effect, all six SNPs remained significant and an additional SNP now became significant. In the SCORE European Americans, 0/9 SNPs were significant after PC1 adjustment and no changes were seen after additional adjustment for the PC1 by SNP interaction effect; for SCORE African Americans, 2/9 SNPs were significant after PC1 adjustment and after adjustment for the PC1 by SNP interaction effect, only one SNP remained significant (rs16901979 at 8q24). We show that genetic associations by race are modified by interaction between individual SNPs and PS.

While it is now standard practice to test for population stratification (PS) and remove ancestral outliers from analysis in GWAS studies, further adjustment for PS may be necessary when studying a recently admixed population such as African Americans or Hispanic Americans Tiwari et al., 2008). In addition, because risk allele frequencies can vary by ancestral group, interaction effects between a SNP of interest and PS may be needed in order to fully understand differences in potential genetic associations by race. The objective of the current study was to validate potential GWAS "hit" associations in two large, diverse study populations of breast cancer and prostate cancer, and to evaluate adjustments for potential PS in association testing of GWAS "hits" within each study.

CAROLINA BREAST CANCER STUDY STUDY POPULATION
The CBCS is a population-based case-control study of breast cancer conducted in North Carolina (as described in Millikan et al., 2003). Briefly, eligible cases included women ages 20-74 who were diagnosed with primary invasive breast cancer from 1993 to 2001 and lived within a 24-county study area. Cases were identified using rapid case ascertainment in cooperation with the North www.frontiersin.org Carolina Central Cancer Registry. Randomized recruitment was used to oversample African Americans and women younger than 50 years of age. Women diagnosed with breast carcinoma in situ (CIS) from 1996 to 2001 were also enrolled in the study. Eligible controls were women aged 20-74 years, residing within the study area, with no history of breast cancer and were identified using Division of Motor Vehicles lists (for women under 65) and Medicare records (for women age 65-74). Controls were frequency matched to cases according to race within 5 year age categories. Women who agreed to participate in the study provided informed consent and completed an in-home interview regarding known and suspected breast cancer risk factors. Women were also asked to provide a 30-ml blood sample. DNA was extracted from the blood samples and stored at −80˚C. The interview participation rates for invasive cases and controls were 76 and 55%, respectively, and for CIS cases and controls were 83 and 65%, respectively. The final study population consisted of 3693 individuals, 2319 European Americans (62.8%), and 1374 African Americans (37.2%); 1946 (52.7%) cases and 1747 (47.3%) controls. Fifty-three (N = 53; <2%) individuals were excluded who reported as "other" race (Hispanic, mixed race or other). Age was defined as age in years at breast cancer diagnosis for cases or at the time of sampling for controls. Self-identified race was reported by each study participant during the study interview. An offset variable to account for the sampling design was included in the analysis. All study procedures involving human subjects were approved by the University of North Carolina (UNC) at Chapel Hill Institutional Review Board.

STUDY OF CLINICAL OUTCOMES, RISK, AND ETHNICITY AT THE UNIVERSITY OF PENNSYLVANIA
Incident prostate cancer cases were identified through Urologic Oncology Clinics at multiple hospitals of the University of Pennsylvania Health System (UPHS) between 1995 and 2008 and included in the SCORE study. Controls were men attending UPHS general medicine clinics and were ascertained concurrently with the prostate cancer cases (i.e., between 1995 and 2008) and were frequency matched to cases according to race. All study participants provided informed consents. Three hundred five (N = 305; 20%) individuals were excluded as they were of "other" race, or they had missing genotype data for ancestry estimation. Our final study population consisted of 1135 individuals, 713 European Americans (62.8%), and 422 African Americans (37.2%); 808 (71.1%) were cases and 327 (28.9%) were controls. Age was defined as age at consent for both cases and controls. Self-identified race was reported by each study participant during the study interview. All study procedures involving human subjects were approved by the University of Pennsylvania Institutional Review Board.

GENOTYPING RESULTS
A panel of 200 ancestry informative markers (AIMs) was genotyped as part of a multiplex, custom candidate gene SNP panel assay using the Illumina Goldengate platform for all 3693 CBCS and all 1165 SCORE individuals (Illumina, Inc., San Diego, CA, USA; Barnholtz-Sloan et al., 2010); previous studies have shown that at least 50-100 AIMs are needed to accurately assign one's individual ancestry; fewer markers when the average allele frequency difference between ancestral populations 0.6 and above (Risch et al., 2002;Tsai et al., 2005;Choudhry et al., 2006). AIMs were selected to maximize the difference in allele frequencies between ancestral populations and the Fisher's information criterion (FIC; Pfaff et al., 2004) for distinguishing between African and European ancestry, based upon ancestral allele frequencies from African (YRI) and European (CEU) populations in HapMap (www.hapmap.org). AIMs were prioritized based on having the highest FIC values in the following order: 90% European/10% African, 10% European/90% African, and 50% European/50% African. This prioritization allowed the AIMs to be chosen to represent the whole expected ancestral distribution of this population. In the CBCS dataset, 42 AIMs SNPs were dropped by Illumina and 14 failed genotyping resulting in 144 genotyped AIMs. In the SCORE dataset, 42 SNPs were dropped by Illumina, and 9 failed genotyping, resulting in 149 genotyped AIMs. There were 136 AIMs in common between the final CBCS and SCORE datasets, and these AIMS were used in the analysis.
In addition, for CBCS the custom Illumina SNP panel included many SNPs in breast cancer candidate genes including SNPs in known GWAS hit genes/regions, FGFR2 and 8q24, and for SCORE the custom Illumina SNP panel included many SNPs in prostate cancer candidate genes including SNPs in known GWAS hit genes/regions, MSMB (rs7920517 only), and 8q24. Some SNPs for SCORE were genotyped using Taqman (JAZF1, MSMB). For CBCS all 1946 cases and 1747 controls were genotyped for 11 breast cancer GWAS SNPs. For SCORE, only 597 cases and 322 controls were genotyped for 9 prostate cancer GWAS SNPs (88.8% of the total SCORE sample). For each of these GWAS hit genes/regions, 10/11 SNPs in FGFR2 and 1/1 SNPs in 8q24 for CBCS and 1/1 SNPs in JAZF 1, 5/5 SNPs in MSMB and 3/3 SNPs in 8q24 for SCORE passed each study's respective quality control criteria for successful genotyping and were included for analysis. Further details regarding the genotyping call rates for both datasets have been published elsewhere (Barnholtz-Sloan et al., 2010;Chang et al., 2011).

STATISTICAL ANALYSIS
Population stratification for each study population was assessed using a multi-pronged approach using the 136 AIM panel stratified by study site (CBCS or SCORE). First, principal components analysis was used to assess overall similarities of European Americans and Africans Americans by study site via scatter plots of the first principal component (PC1) versus the second principal component (PC2) using R. Second, principal components analysis was used to assess PS by race and case-control status via scatter plots of PC1 versus PC2 using R-based programs. Finally, we conducted a set of analyses using multivariable unconditional logistic regression for top breast cancer GWAS "hits," in the CBCS case-control data, and for top prostate cancer GWAS "hits," in the SCORE case-control data, with and without adjustment for PC1 and also for PC1 by GWAS "hit" interaction effects, generating odds ratios (OR) and 95% confidence intervals (95% CI) using R-based programs and SAS version 9.2.

RESULTS
We first compared the allele frequencies for the tested GWAS "hits" in the CBCS and SCORE studies by race to the HapMap African Yorubans (YRI; as compared to African Americans) and CEPH

Frontiers in Genetics | Applied Genetic Epidemiology
Europeans (CEU; as compared to European Americans; Table 1). For CBCS African Americans, allele frequencies for six of the SNPs varied from YRI by more than 5%, while for CBCS European Americans, none of the allele frequencies varied from CEU by more than 5%. For SCORE African Americans, allele frequencies for six of the SNPs varied from YRI by more than 5%, while for SCORE European Americans, allele frequencies for three of the SNPs varied from CEU by more than 5%. PC1 explained 53.7% of the variance in ancestry for CBCS and 49.5% of the variance in ancestry for SCORE. As seen with the joint analysis, the remaining PCs each accounted for <1% of the variance for each study site individually. Figure 1 shows the ancestral structure for CBCS and SCORE separately by race via principal components analysis. PC1 in both samples can be interpreted as the African axis of ancestral variation. Both panels of Figure 1 showed that in this joint analysis, European Americans and African Americans were similar in terms of their ancestral structure between CBCS and SCORE. Figure 2 shows the ancestral structure for CBCS and SCORE separately by case-control status via principal component analysis, showing that cases and controls seemed to be fairly well matched by ancestry. Eleven breast cancer GWAS "hits" were tested in CBCS overall and stratified by race with and without PS adjustment ( Table 2).
Overall CBCS results are shown Table A1 in Appendix. In the CBCS European Americans, nine of 11 GWAS "hits" were significant after adjustment for PC1 only, but after adjustment for the PC1 by SNP interaction effect, only one SNP remained significant [rs1219648 in FGFR2; OR = 1.93, 95% CI (1.12,3.36)]; for CBCS African Americans, six of 11 GWAS "hits" were significant after adjustment for PC1 only and after adjustment for the PC1 by SNP interaction effect, all six SNPs remained significant (although the p-value for significance changed from 0.003 to 0.02-0.04 for some SNPs), and an additional SNP now became significant. Interestingly, only six of the nine SNPs significant in European Americans after adjustment for PC1 were significant in African Americans; there was no overlap by race for the GWAS "hits" that were statistically significant after adjustment for the PC1 by SNP interaction effect. Although in general for CBCS the p-values for the PC1 by SNP interaction effects were all nonsignificant, except one SNP for African Americans (rs13281615; p-value for interaction = 0.01).
Nine prostate cancer GWAS "hits" were tested in SCORE overall and stratified by race with and without PS adjustment ( Table 3).
Overall SCORE results are shown Table A1 in Appendix. In the SCORE European Americans, zero out of nine GWAS "hits" were significant after adjustment for PC1 only and no changes were seen after additional adjustment for the PC1 by SNP interaction effect; for SCORE African Americans, two of the nine GWAS "hits" were significant after adjustment for PC1 only and after adjustment for the PC1 by SNP interaction effect, only one SNP remained significant [rs16901979 at 8q24; OR = 3.24, 95% CI (1.19,8.78)]. Although in general for SCORE the p-values for the PC1 by SNP interaction effects were all non-significant.

DISCUSSION
We investigated the importance of adjustment for PS effects using principal components analysis on two large epidemiologic datasets of breast and prostate cancer that include both European and African Americans. In addition, we show the importance of additionally adjusting for a potential SNP by PS interaction effect. Our results show that PS, in particular PS by GWAS "hit" interaction effects, can greatly change the significance of these GWAS "hits" by racial group. We also show that the first principal component explains the majority of the ancestral variation by race in each study population; for CBCS PC1 explained 53.7% of the variance and for SCORE PC1 explained 49.5% of the variance and the remaining PCs each accounted for <1% of the variance for each study site. In a completely random matrix with no structure, having 136 dimensions (i.e., 136 AIM SNPs), each PC would account for ∼0.7% of the variance, which is what we see for all PC's beyond PC1 for the joint analysis and for each specific study site individually. PS can cause both false positive and false negative associations in epidemiologic studies of individuals with mixed ancestry. There are now multiple examples in the literature about the importance of adjustment for PS and how PS can affect study inference in recently admixed populations such as African Americans or Hispanic Americans (e.g., Kittles et al., 2002;Ziv et al., 2006).
Multiple GWAS have now been performed for many complex diseases including breast and prostate cancers. The vast majority of these studies have been performed in individuals of European ancestry only; breast cancer (Easton et al., 2007;Hunter et al., 2007;Stacey et al., 2007;Gold et al., 2008;Ahmed et al., 2009;Thomas et al., 2009;Zheng et al., 2009c) and prostate cancer (Amundadottir et al., 2006;Freedman et al., 2006;Duggan et al., 2007;Gudmundsson et al., 2007Gudmundsson et al., , 2009Haiman et al., 2007;Yeager et al., 2007Yeager et al., , 2009Sun et al., 2008;Thomas et al., 2008;Al Olama et al., 2009;Eeles et al., 2009;Hsu et al., 2009;Lou et al., 2009;Zheng et al., 2009a). We showed in this study that association statistics for different racial groups can be impacted when adjustment for PS and an interaction between PS and the SNP of interest is performed. Further replication studies of the GWAS "hits" included in this   www.frontiersin.org analysis and others have been performed in African Americans with breast cancer (Zheng et al., 2009b;Barnholtz-Sloan et al., 2010) and prostate cancer Hooker et al., 2009;Waters et al., 2009;Xu et al., 2009;Chang et al., 2011; Table 4). One of these studies was based on the CBCS population (Barnholtz-Sloan et al., 2010) and another included the SCORE population as one of many studies used for analysis (Chang et al., 2011). Although these studies used different AIMs panels and different techniques for ancestry/PS estimation, all studies showed that PS adjustment did affect the magnitude of association statistics and was therefore a necessary adjustment factor. Additionally, Chang et al. (2011) showed that for those study sites included in their analysis that had available PS ancestry information that the average ancestry varied significantly by study site. Hence, they concluded that adjustment for study site would also serve as a partial proxy for PS adjustment.

Frontiers in Genetics | Applied Genetic Epidemiology
In this study we show that the additional adjustment for a PS by SNP interaction effect changes the magnitude and significance of most association statistics in both racial groups studied, particularly for the CBCS study, although the p-values for this interaction were non-significant. For CBCS European Americans, all but one SNP were non-significant after adjustment for the PS by SNP interaction effect where the magnitude of the interaction OR was much higher than previously reported in other studies at 1.93. For the six of the seven SNPs that remained significant in CBCS African Americans after adjustment for the PS by SNP interaction effect their association statistics corroborated with previously published studies in African Americans ( Table 4) in terms of their direction and significance of effect, however the magnitude of the ORs was much higher. Interestingly, in previous studies rs13281615 has been shown to be non-significant in African Americans and in this study showed a significant protective effect for breast cancer development after interaction adjustment. In the SCORE study none of the SNPs were significant in European Americans; while in African Americans only one SNP remained significant after PS by SNP interaction effect adjustment (rs16901979). Interestingly, in previous prostate cancer case-control studies of African Americans, rs16901979 was the only SNP that showed a consistent significant effect in multiple studies ( Table 4).
The SNPs used in this analysis were all highly selected given they had been previously shown to be associated with breast or prostate cancer in previous GWAS studies, irrespective of this fact, this analysis showed that PS can cause inflation and deflation of the significance of association statistics. Additionally, some of the affects on association statistics after adjustment for PS or for the PS by SNP interaction effect could have also been due to issues related to study design, since cases and controls cannot be perfectly matched for allele frequency within each racial group. We did not perform haplotype analyses in this study as out goal was limited to single SNP replication of GWAS hits previously reported in European and African Americans; however haplotype analyses could have been informative to show further differences in genetic associations by race. There may be other relevant SNPs within these loci that are associated with breast or prostate cancer, particularly for different racial groups that were not examined. In addition, our sample size was relatively small for African Americans, particularly in the SCORE study. We also realize that there are many additional factors that may potentially influence a GWAS replication study that were not adjusted for in this analysis; we adjusted for the same factors as previous studies in order to replicate previous findings. Future research will require pooling of data from different breast and prostate cancer studies for different racial groups in order to gain a more complete understanding of differences in risk alleles by race and in order to study gene-gene and gene-environment interaction.
In conclusion, we demonstrated that genetic associations by race are modified by interactions between individual SNPs and PS and that significance of particular GWAS "hits" is not the same between racial groups. Our results and results of previously published studies in African Americans as shown in Table 4 highlight the need to conduct GWAS and GWAS replication studies in a variety of racial groups.

ACKNOWLEDGMENTS
We thank all participants in the CBCS and SCORE studies. We also thank Priya Shetty, Yanwen Chen, PhD, MS, Lynette Phillips, PhD for their technical assistance. This work was supported by the Case Comprehensive Cancer Center Core Grant (NIH/NCI P30-CA043703 to Jill S. Barnholtz-Sloan), the Specialized Program of Research Excellence (SPORE) in Breast Cancer (NIH/NCI P50-CA58223 to Robert C. Millikan); Center for Environmental Health and Susceptibility (NIEHS P30-ES10126 to Robert C. Millikan); Lineberger Comprehensive Cancer Center Core Grant (NIH/NCI P30-CA16086 to Robert C. Millikan); NIH/NCI grants R01-CA085074 and P50-CA105641 (both to Timothy R. Rebbeck). Frontiers in Genetics | Applied Genetic Epidemiology