Original Research ARTICLE
Genome-wide analysis of the role of copy-number variation in pancreatic cancer risk
- 1Department of Medicine, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
- 2Program in Cancer Biology and Genetics, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
- 3Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
- 4Genomics Core Laboratory, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
Although family history is a risk factor for pancreatic adenocarcinoma, much of the genetic etiology of this disease remains unknown. While genome-wide association studies have identified some common single nucleotide polymorphisms (SNPs) associated with pancreatic cancer risk, these SNPs do not explain all the heritability of this disease. We hypothesized that copy number variation (CNVs) in the genome may play a role in genetic predisposition to pancreatic adenocarcinoma. Here, we report a genome-wide analysis of CNVs in a small hospital-based, European ancestry cohort of pancreatic cancer cases and controls. Germline CNV discovery was performed using the Illumina Human CNV370 platform in 223 pancreatic cancer cases (both sporadic and familial) and 169 controls. Following stringent quality control, we asked if global CNV burden was a risk factor for pancreatic cancer. Finally, we performed in silico CNV genotyping and association testing to discover novel CNV risk loci. When we examined the global CNV burden, we found no strong evidence that CNV burden plays a role in pancreatic cancer risk either overall or specifically in individuals with a family history of the disease. Similarly, we saw no significant evidence that any particular CNV is associated with pancreatic cancer risk. Taken together, these data suggest that CNVs do not contribute substantially to the genetic etiology of pancreatic cancer, though the results are tempered by small sample size and large experimental variability inherent in array-based CNV studies.
Pancreatic adenocarcinoma is the fourth-leading cause of cancer mortality in the United States for both men and women (Siegel et al., 2012). Despite recent advances in screening methods and surgical treatment, it is a rapidly fatal disease with a poor 5-year survival rate of 5–6%. Thus, a challenge exists to develop new and more effective therapeutic interventions.
Inherited genetic predisposition to pancreatic cancer is hypothesized to play a role in both familial and non-familial forms of the disease. In large-scale genome-wide association studies, common single-nucleotide polymorphisms (SNPs) on chromosomes 9q34, 13q22, 1q32, and 5p15 were associated with pancreatic cancer risk (Amundadottir et al., 2009; Petersen et al., 2010); however, the true causal variants underlying these associations and their functional mechanisms remain unclear. Additional studies have focused their analyses on SNPs within candidate genes (Jiao et al., 2006, 2008; Li et al., 2007; McWilliams et al., 2008). Under this approach, SNPs within DNA damage response and repair genes—particularly ATM, LIG3, XRCC1, and XRCC2 genes—were associated with increased risk, suggesting the involvement of inherited genetic variants within these pathways in pancreatic tumorigenesis.
Importantly, such efforts to identify inherited genetic variants that contribute to pancreatic cancer susceptibility may lead to novel biological insights about the disease and useful biomarkers for risk prediction. However, while these efforts have primarily focused on the analysis of SNPs, the additional contribution of germline copy number variations (CNVs) remains unclear.
CNVs are generally defined as inherited or de novo deletions or duplications of the genome ranging in size from 100 bp to 3 Mb (Zhang et al., 2009). Such variations may lead to changes in gene dosage and/or expression (Diskin et al., 2009). As of August 2012, approximately 179,450 human CNVs have been reported in the Database of Genomic Variants (DGV) (Iafrate et al., 2004; Zhang et al., 2006). Although there are substantially fewer reported CNVs than SNPs, it is estimated that more than 30% of the human genome is covered by at least one CNV (compared to the <1% covered by SNPs). Thus, CNVs are hypothesized to be of functional significance.
The specific role of CNVs in familial forms of pancreatic cancer has been investigated previously. Lucito et al. used representational oligonucleotide microarray analysis (ROMA) to identify a total of 56 germline CNVs that were present in patients with a family history of pancreatic cancer and absent from a cohort of healthy controls (Lucito et al., 2007). Al-Sukhni et al. followed a similar approach by analyzing the germline DNA of 91 familial pancreatic cancer patients and a 950 healthy controls using high-resolution Affymetrix 500 K and SNP 6.0 platforms (Al-Sukhni et al., 2012). There, investigators found a total of 93 germline CNVs that were unique to familial pancreatic cancer patients.Taken together, these studies nominate several CNVs as putative risk loci for familial pancreatic cancer. However, additional studies are needed to confirm these findings. More importantly, evidence for the broader role of CNVs outside of the familial pancreatic cancer setting is still unclear.
Here, we report a genome-wide analysis of CNVs in a hospital-based, European ancestry cohort of pancreatic cancer cases and controls. Germline CNV discovery was performed using the Illumina Human CNV370 platform in 223 pancreatic cancer cases (both sporadic and familial) and 169 controls. Following stringent quality control, we explored whether global CNV burden was a risk factor for pancreatic cancer. Finally, we performed in silico CNV genotyping and association testing to discover novel CNV risk loci.
Materials and Methods
Sample Collection and SNP Array Genotyping
Participants were part of an ongoing hospital-based case-control project conducted in conjunction with the Familial Pancreatic Tumor Registry (FPTR) at Memorial Sloan-Kettering Cancer Center (MSKCC; New York, NY) as described previously (Mukherjee et al., 2011; Willis et al., 2012). Briefly, patients were eligible if they were 21 years or older, spoke English, and had pathologically or cytologically confirmed adenocarcinoma of the pancreas. Patients were recruited between June 2003 and July 2009 from the surgical and medical oncology clinics at MSKCC at the time of their initial diagnosis or during follow-up. Controls were spouses of patients or visitors accompanying patients with other diseases, had the same age and language eligibility requirements as the cases, had no personal history of cancer (except for non-melanoma skin cancer), and were not blood relatives of the cases. The participation rate among approached and eligible individuals was 76% among cases and 56% among controls. The study was approved by the MSKCC Institutional Review Board, and all enrolled participants signed informed consent.
Consented participants provided a blood or buccal (mouthwash or saliva) sample to the MSKCC FPTR research study assistant and completed risk factor and family history questionnaires administered by the research study assistant in person or via telephone. Biospecimens were subsequently delivered for genomic DNA extraction and banking to the Molecular Epidemiology Laboratory. DNA was isolated from mouthwash specimens using the Puregene DNA Purification Kit (Qiagen, Inc.), from saliva samples with the Oragene saliva kits (DNA Genotek), and from whole blood using the GentraPuregene Blood Kit (Qiagen Inc.). DNA samples were hydrated in 1 × TE buffer. The quality and quantity of the DNA was assessed by spectrophotometry and agarose gel electrophoresis.
A total of 464 individuals (263 cases and 201 controls) recruited at MSKCC were available for inclusion in downstream analyses. DNA samples were genotyped in 28 batches on the Illumina Human CNV370 bead array (either the Illumina Human CNV370-Duo or Illumina Human CNV370-Quad format) at the Genomics Core Laboratory of MSKCC according to the manufacturer's protocol. Normalization and SNP genotype calling was performed using the Illumina BeadStudio software package. Ten individuals had their DNA analyzed twice for quality control during the course of the genotyping experiments, yielding a total of 474 samples. Normalized probe intensities were exported for downstream CNV discovery and genotyping.
SNP genotype calls for 474 individual samples (including 10 duplicate pairs) were exported to PLINK (version 1.07; Purcell et al., 2007) for processing. Identity-by-descent (IBD) analysis was performed to confirm that none of the genotyped individuals were blood relatives. For each known duplicate sample pair, priority was given to the sample that passed CNV-level quality-control (described below). We removed SNPs with call rates <95%, minor allele frequency <5%, or Hardy–Weinberg equilibrium (HWE) test p-value < 1 × 10−7 in controls, leaving a total of 315,136 SNPs.
Population structure was examined by principal component analysis (PCA) of the SNP genotype calls. As reference, we obtained whole-genome SNP data for Utah residents with northern and western European ancestry (abbreviated CEU) and individuals living in Toscani in Italia (abbreviated TSI) from the International HapMap project (phase 3, draft release 2) (International Hapmap 3 Consortium et al., 2010). The top four principal components of genetic structure identified by EIGENSOFT were used as covariates in downstream CNV association testing (Patterson et al., 2006).
CNV Discovery and Quality Control
CNV discovery was performed for each MSKCC sample using two parallel methods. First, we applied a hidden Markov model (HMM)-based algorithm implemented in the PennCNV package (2009 Aug 27 release, Wang et al., 2007). PennCNV makes use of normalized probe intensity (R) and allelic intensity ratio (BAF) values measured across different probes on the bead array to detect regions of copy number variation in the sample. For each probe, a ratio of the observed R value to the expected R value is calculated (here, the expected value is pre-defined as the average intensity observed at the locus in a pool of healthy HapMap individuals from CEU, YRI, and CHB-JPT populations). Positive or negative deviations of the log Robserved/Rexpected ratio (LRR) from zero indicate that the locus may be either duplicated or deleted, respectively. The algorithm incorporates spatial information as well, such that the probability of transitioning between different copy number states is dependent upon the physical map distance between two adjacent loci.
PennCNV was applied to 474 individual samples (including 10 duplicate pairs) using default parameters and GC-wave correction (Diskin et al., 2008). From each duplicate pair, one sample was kept for downstream analysis on the basis of having the lowest LRR value. Quality-control (QC) was then applied at the sample-level to PennCNV output by excluding samples on the basis of: (1) LRR standard deviation >0.28 (mean LRR plus 3 times the interquartile range); (2) BAF standard deviation >0.13 (approximately the mean BAF plus three times the interquartile range); or (3) a total number of CNV calls >124 (approximately the mean call rate plus 1.5 times the interquartile range). Additional QC was applied at the CNV-level by excluding CNV calls <5 kb in length and spanning <5 probes. After QC, we derived a total of 11,635 CNV calls from 417 unique samples using PennCNV.
Secondly, we applied an Objective Bayes HMM-based algorithm implemented in QuantiSNP (version 2.3 beta, Colella et al., 2007). QuantiSNP was run using default parameters and GC-wave correction on 474 individual samples (including 10 duplicate pairs). One sample from each duplicate pair was kept for downstream analysison the basis of having the lowest LRR value. QC was then applied at the sample-level by excluding samples on the basis of: (1) LRR standard deviation >0.21; (mean plus 3 times the interquartile range); (2) BAF standard deviation >0.102 (approximately the mean plus three times the interquartile range); (3) a total number of CNV calls >160 (approximately the mean call rate plus two times the interquartile range). Additional QC was applied at the CNV-level by excluding CNV calls with logBF confidence scores <15. After QC, we derived a total of 5422 CNV calls from 414 unique samples using QuantiSNP.
Lastly, as an additional QC procedure, we retained only those CNV calls that were made in the same individual by both PennCNV and QuantiSNP. Any sample or CNV call that was present in just one of the result-sets was excluded. The boundaries of each merged CNV call were defined using the smallest starting coordinate and largest end coordinate from either algorithm. This procedure yielded 3520 merged CNV calls from 392 unique individuals.
The start and end coordinates of each CNV in our dataset were based on the March 2006 human genome build (NCBI36/hg18). For comparison to previously-reported CNV loci, we obtained the 2012-03-29 release of the Database of Genomic Variants (DGV).
Definition of CNV Regions (CNVRs)
CNVRs were defined as any contiguous segment of the genome spanned by a CNV call from any sample. To identify CNVRs, we applied an iterative clustering procedure to the QC-filtered CNV calls, whereby CNV calls with a mutual overlap of ≥40% were considered to be members of the same CNVR cluster. The boundaries of the CNVR clusters were “relaxed,” such that the starting and ending coordinates were based by the smallest and largest coordinates of any CNV that was a member of the cluster, respectively.
In Silico CNVR Genotyping
In silico CNVR genotyping was performed using the CNVtools package (Barnes et al., 2008). For each CNVR of interest, we systematically evaluated the parameter space of data summarization methods, number of copy-number components, and signal/variance model specifications:
- Data summarization. As starting input for CNVtools, we extracted the normalized signal intensities of probes on the array that mapped within the boundaries of a given CNVR of interest. The probe intensities for a given region and sample were summarized using either one of two methods: principal component analysis or simply taking the mean. For each CNVR under investigation, we selected the method which gave the best separation between different copy-number clusters by visual inspection.
- Copy-number components. For each CNVR under investigation, we used the Bayesian Information Criterion (BIC) and our subjective visual assessment of clustering quality to select the optimal number of copy-number classes used for genotyping.
- Signal and variance model selection. For each CNVR under investigation, we explored different combinations of linear models to describe the signal mean and variance of each copy-number class (model.mean and model.var parameters, respectively). We considered models for signal mean that were “free” (stratified by copy-number class) or proportional to copy-number. Similarly, we considered models for signal variance that were either free (stratified by copy-number class), proportional to copy-number, or constant for each copy-number. Selection was based on successful convergence of the model fitting procedure and visual assessment of the clustering quality.
Individual CNVR loci were excluded from downstream risk association testing on the basis of being too problematic for in silico CNVR genotyping—e.g., rare or singleton (detected in only one sample) events, and noisy or insufficient separation between different copy-number clusters.
Statistical Methods to Evaluate Risk Association
Comparison of CNV burden
In this study, “CNV burden” was estimated on a per-individual basis by counting the number of CNV calls made in a given individual. We compared the estimated CNV burden between cases and controls using either univariate or multivariate logistic regression models in R. Notably, as reported by others, estimates of CNV burden are susceptible to non-specific sources of bias, including batch effects, DNA source effects, and age (International Hapmap 3 Consortium et al., 2010). Therefore, we analyzed the effects of DNA source, SNP array platform (Illumina Human CNV370-duo vs. CNV370-quad), and experimental batch on CNV burden using univariate linear regression models (Table 1). Variables that were associated with CNV burden (p < 0.1) were used as covariates in a multivariable logistic regression model to test for the association of CNV burden with risk. Our final multivariate model adjusted for age, gender, the top four principal components of genetic ancestry, experimental batch, and SNP array platform. The case-control comparison was made for (1) both deletion and duplication calls together, (2) deletion calls alone, or (3) duplication calls alone. In either univariate or multivariate models, we estimated the odds ratio per-unit-of-CNV burden, and significance was determined using the 1-degree-of-freedom (df) Wald test.
Single-locus risk association testing
We evaluated individual CNVRs for association with risk using two approaches. First, we used the approach implemented in CNVtools and described previously (Barnes et al., 2008). Briefly, for each CNVR locus, the approach is to jointly fit two models: (1) a Gaussian mixture model describing the relationship of signal intensity to copy-number genotype and (2) a generalized logit linear model describing the log-linear relationship of case-control status to copy-number phenotype. The models are fit twice, once under the null hypothesis of no risk association and again under the alternative hypothesis. A likelihood ratio test is then performed comparing the likelihood of the two fits with 1 df.
Our second approach was a multivariate logistic regression model that adjusted for age, gender, the top four principal components of genetic ancestry, and DNA source. For each locus of interest, copy-number genotypes were obtained based on in silico genotyping (described above); we estimated the per-copy odds ratio (OR) and significance was determined using the 1-df Wald test.
Characteristics of the Study Participants
Table 2 describes the characteristics of 223 pancreatic cancer cases and 169 healthy controls included in our analyses. The majority of samples were processed on the Illumina Human CNV370-duo platform using DNA collected from buccal sources (mouthwash or saliva). We observed a significant association of case-control status with gender (p-value = 0.003), family history (p-value < 0.001), DNA source (p-value < 0.001), and array platform (p < 0.001). Following SNP array genotyping, principal component analysis revealed that the majority of our cases and controls clustered into northern and southern European genetic ancestry groups (Figure 1). We also observed a smaller subset of individuals that clustered separately into a group identified as Ashkenazi Jewish.
Figure 1. Population structure of the study samples revealed by principal component analysis. Following SNP array genotyping, we applied the EIGENSTRAT package to 43,909 pairwise independent (r2 < 0.1) SNPs with minor allele frequency (MAF) >0.05 and call rates >95% among the 223 pancreatic cancer cases, 169 controls, and 253 individuals from reference HapMap CEU and TSI populations. A plot of the top two principal components of genetic variation (PC1 and PC2) is shown with cases as red dots, controls as black dots, CEU reference samples as blue triangles, and TSI reference samples as green triangles. As expected for our New York-based population study, the majority of cases and controls clustered with either the CEU reference samples (i.e., central European ancestry) or TSI (southern Italian ancestry). A subset of cases and controls (representing those with Ashkenazi Jewish ancestry) clustered separately.
Our approach to CNV discovery is summarized in Figure 2. After sample-level and CNV-level quality control filtering, we derived a total of 3520 high-confidence CNV calls from 223 cases and 169 controls. Of the total 3520 CNV calls, 1912 (54.3%) were deletions and 1608 (45.7%) were duplications. The median CNV length was 50.3 kb and 135.2 kb for deletions and duplications, respectively.
Figure 2. Schematic overview of the CNV discovery pipeline. Whole-genome SNP array genotyping was applied to the germline DNA of 263 pancreatic cancer patients (cases) and 201 healthy individuals (controls). Following normalization, probe intensities were analyzed separately by two CNV detection algorithms, PennCNV and QuantiSNP. Quality-control filtering was applied to the outputs of these algorithms by removing low-quality samples and/or low-confidence CNV calls. This resulted in a final set of 3520 putatively high-quality, high-confidence CNV calls across 223 cases and 169 controls.
Notably, using a minimum overlap threshold of 40%, we found that 3407 (96%) of the CNVs discovered in our study were overlapped by a CNV previously reported in the DGV. Of the remaining 113 putatively novel CNVs, 17 (1.5%) were observed among study participants with Ashkenazi Jewish ancestry. Furthermore, 377 CNVs (194 deletions, 183 duplications) were found to be “singletons” in our study (i.e., detected in only one study sample).
To determine whether non-specific technical factors influenced our CNV discovery results, we first compared the distributions of CNV call rates across different genotyping batches (Figure 3). Indeed, significant batch-to-batch variation was observed, suggesting that experimental “batch effects” may have played a role. Similarly, we observed significant differences in the distributions of CNV call rates for samples genotyped on the Illumina Human CNV370-duo vs.-quad platform (Figure 4). In contrast, no significant differences were observed when comparing the CNV call rate across different sample DNA sources (Figure 5).
Figure 3. Box-and-whisker plots of the number of CNV calls made within different experimental batches. Whole-genome SNP array genotyping was performed in 28 batches. Based on the final derived set of 3520 QC-filtered CNV calls, the interquartile range and median number of calls in a given genotyping batch are represented by a white box and black bar, respectively. The whiskers are drawn to 1.5 times the interquartile range; circles are drawn to represent individuals with a total number of CNV calls beyond that range.
Figure 4. Box-and-whisker plots of the number of CNV calls made across different DNA sources. Germline DNA was extracted from either a blood, buccal, or saliva biospecimen offered by each individual in our study. Based on the final set of 3520 QC-filtered CNV calls, the interquartile range and median number of calls derived from a given DNA source are represented by a shaded box and black bar, respectively. The whiskers are drawn to 1.5 times the interquartile range; circles are drawn to represent individuals with a total number of CNV calls beyond that range. We observed moderate (but non-statistically significant) variation in the number of CNVs detected between the DNA sources.
Figure 5. Box-and-whisker plots of the number of CNV calls made between different configurations of the genotyping platform. Whole-genome SNP array genotyping was performed on two different configuration of the Illumina HumanCNV370 bead array: the duo and quad. Based on the final set of 3520 QC-filtered CNV calls, the interquartile range and median number of calls derived from each configuration are represented by a shaded box and black bar, respectively. The whiskers are drawn to 1.5 times the interquartile range; circles are drawn to represent individuals with a total number of CNV calls beyond that range. We observed statistically significant differences in the number of CNVs detected between the two configurations.
Finally, using an iterative clustering procedure (described in Materials and Methods), we collapsed the 3520 individual CNVs into 809 unique CNV regions (CNVRs), i.e., continuous segments of the genome spanned by one or more CNVs (Figure 6).
Figure 6. Clustering of CNVs from different samples to identify common CNV regions (CNVRs). In this illustrated example, each green bar represents a CNV call detected in a single individual (either a case or control). CNVs with reciprocal overlap of at least 40% were clustered into the same CNVR.
Comparison of CNV Burden between Cases and Controls
Global CNV burden
Under the hypothesis that CNV burden is a risk factor for pancreatic cancer, we first sought to compare CNV burden between cases and controls. Here, considering all 3520 CNVs discovered in our study regardless of frequency, we defined CNV burden as the number of CNV calls made in an individual. This measure was derived on a per-individual basis by counting (1) deletions and duplications together, (2) deletions only, or (3) duplications only and then averaged across case and control groups (Table 3).
Table 3. Comparison of global CNV burden in cases and controls based on the final derived set of 3520 QC-filtered CNV calls.
The average case/control CNV burden ratio was observed to be 1.07 counting all CNV types together, 0.98 counting deletions only, and 1.17 counting duplications only. To assess whether these differences in CNV burden were significant, we employed a logistic regression model and estimated the odds ratio (OR) per unit of CNV burden. Under a univariate model, we observed no significant association between pancreatic cancer risk and CNV burden when counting all CNV types together (OR = 1.05, p = 0.12) or deletions only (OR = 0.99, p = 0.73). A nominally significant (p-value < 0.05) association was observed when counting duplications only (OR = 1.10, p = 0.02). However, under a multivariate model controlling for age, gender, genetic ancestry, and the non-specific effects of experimental batch and SNP array platform, we observed no statistically significant associations.
We further hypothesized that CNV burden would be enriched in patients with a family history of pancreatic cancer or early-onset disease. To evaluate this, we compared the CNV burden between controls and cases (n = 24) who reported a history of pancreatic cancer in at least one first-degree relative. Similarly, we compared CNV burden between controls (n = 33) aged 50 or younger and cases (n = 58) diagnosed at or prior to age 50. Again, although we observed minor differences in case/control CNV burden, these differences were not statistically significant in either univariate or multivariate analysis.
Putative rare or de novo CNV burden
To explore whether putative rare or de novo CNV burden is associated with pancreatic cancer risk, we extended the above analysis by considering only the subset of 377 CNV calls detected in a single individual in our study (Table 4). In both univariate and multivariate analyses, no statistically significant case-control differences in CNV burden were detected.
Table 4. Comparison of putative rare or de novo CNV burden in cases and controls based on a subset of 377 CNV calls observed in a single individual.
Analysis of CNV Loci Previously Discovered in Familial Pancreatic Cancer
Next, we compared our CNV discovery results to loci that have been previously implicated in familial pancreatic cancer by scanning for CNVs that overlapped with 28 duplications and 25 deletions identified by Lucito et al. and 40 duplications and 53 deletions identified by Al-Sukhni et al. (Table 5). Seven overlapping CNV loci of the same type were identified, including 4 deletions and 3 duplications. At five of these loci, CNV events were observed in both our cases and controls together, or our controls alone. However, for the remaining two CNV loci (a duplication at chr18:6838462-7291170 and deletion at chr9:2235919-2351848) we observed a single CNV event exclusively in one of our cases.
Table 5. Overlap between discovery CNVs (this study) and CNVs previously implicated in familial pancreatic cancer.
Association of Individual CNVR Loci with Pancreatic Cancer Risk
Lastly, we evaluated whether specific CNVR loci were associated with pancreatic cancer risk. To facilitate robust in silico CNVR genotyping and to avoid potential biases in signal characteristics between the Illumina CNV370-duo and Illumina CNV370-quad platforms, we focused this analysis on the subset of 295 samples (144 cases, 151 controls) genotyped using Illumina Human CNV370-duo. In silico copy-number genotyping was attempted across 706 CNVR loci that were derived from samples in this subset (Figure 7, described in Materials and Methods). Of those loci, only 176 were successfully genotyped with high quality.
Figure 7. Schematic overview of in silico genotyping of 809 CNVRs across pancreatic cancer cases and controls. (A) Using SNP array data for each of the 223 cases and 169 controls, we extracted signal intensity information for probes that overlapped with each of the CNVRs. Probe intensities for a given CNVR region were summarized on a sample-by-sample basis by taking the mean or by use of principal component analysis. (B) In silico genotyping for each CNVR was then performed using the CNVtools package, which assigns cases and controls to copy number classes by jointly fitting a Gaussian mixture model and a log-linear regression model to the observed distribution of summarized probe intensities. An example fit is shown, overlaid with the estimated locations of individuals who have normal copy number class, a heterozygous deletion, or a homozygous deletion (blue, green, and black lines, respectively) for this CNVR. (C) Quality-control was applied by removing CNVRs with low-quality genotyping, low minor allele frequency, or CNVRs that were derived solely from samples genotyped on the Illumina CNV370-quad platform.
Each CNVR that could be successfully genotyped was then analyzed for association with pancreatic cancer risk by use of a likelihood ratio test (Table 6). We observed a total of seven loci associated with p-values < 0.05. Considering the number of loci tested, only one association (PA-CNVR46.1, likelihood ratio test p = 6.41 × 10−5) met the Bonferroni threshold of significance. However, in a multivariate logistic regression model adjusted for age, gender, experimental batch and the top four principal components of genetic ancestry, this region did not remain statistically significantly associated with risk (per-copy OR = 0.86, 95% CI = 0.58–1.26, p = 0.44).
In this study, we sought to investigate the roles of CNV burden and individual CNV loci in pancreatic cancer susceptibility. Toward this end, we first performed genome-wide CNV discovery within a hospital-based cohort of 223 pancreatic cancer cases and 169 healthy controls using a SNP array platform. To help minimize the proportion of false-positive CNVs in our data set, we took the approach of analyzing whole-genome SNP array data using two separate CNV discovery algorithms (PennCNV and QuantiSNP) followed by stringent QC filtering.
A small proportion (n = 113) of the CNV loci detected in our study were not overlapped by CNVs previously reported in the DGV. One likely explanation is that these regions are platform-specific artifacts. Indeed, because we did not experimentally validate the CNV loci discovered in our study, we cannot exclude the presence of artifacts in our downstream analyses despite the application of rigorous filtering methods. In addition, 17 of the previously unreported CNV loci were observed among subjects with Ashkenazi Jewish ancestry. Therefore, we further speculate that at least some of the CNV calls unique to our study may be true population-specific CNVs from populations (i.e., Ashkenazi Jewish) that are underrepresented in the DGV.
Experimental batch effects are well known to the genomics field and require proper consideration when performing case-control analyses. In this study, we observed significant variation in distributions of CNV call rates across different sample batches, which is likely the result of technical variation during the course of batch processing. In support of this hypothesis, we also observed variations in CNV call rates at the individual sample-level by comparing CNVs from 10 duplicate pairs of samples that were genotyped twice on the same array platform but in two different batches (data not shown). Furthermore, the distribution of CNV call rates for samples genotyped on the Illumina Human CNV370-duo vs. -quad platforms were significantly different. While batch effects may have contributed to this observation, we speculate that it also reflects differences in probe-level characteristics between the duo and quad bead array configurations.
Following CNV discovery, we explored several hypotheses that CNV burden is a risk factor for pancreatic cancer. Different formulations of “CNV burden” have been employed in the literature, including: a simple CNV count, the total CNV length, and the total number of genes overlapped by a CNV. Here, we regarded CNV burden as the number of CNV calls made per individual—a metric that was evaluated across a spectrum of different CNV types and frequency, and different case-control subgroups.
When we considered all 3520 CNVs discovered in our study across all 223 cases and 169 controls, we found no strong evidence of an association between CNV burden and pancreatic cancer risk. Similarly, no evidence was found when we restricted our analyses to cases with early-onset (age ≤ 50) disease or with a family history of pancreatic cancer. Notably, a recent study by Stadler et al. found that a significant proportion of men with early-onset testicular cancer harbored de novo CNVs (Stadler et al., 2012). Although the pathogenicity of these specific de novo CNVs has yet to be confirmed, this finding suggested a novel framework for understanding the genetic basis of sporadic cancers. We explored this hypothesis by restricting our analyses to singleton CNVs (i.e., putative rare or de novo CNVs), but found no evidence of association.
Although our results do not support the role of CNV burden in pancreatic cancer risk, we emphasize that our conclusions are tempered by the presence of experimental variability in the CNV discovery scheme. It remains possible that true differences in case/control CNV burden might have been masked by the presence of varying DNA sources, genotyping platforms, and experimental batch performance.
We next attempted in silico validation of CNV loci that were previously implicated in familial pancreatic cancer risk. By comparing our CNV discovery results against those reported by Lucito et al. and Al-Sukhni et al. we identified two loci in which the CNV events were present exclusively in our cases—and, thus, consistent with the original hypotheses that each locus confers a strong risk of pancreatic cancer. The first such locus, on chromosome 9p24.2, was reported by Al-Sukhni et al. as a deletion event in a single case, but harbors no known RefSeq genes. The second locus, on chromosome 18p11.31, was reported by Al-Sukhni as a duplication event in a single case and harbors four known genes including ARHGAP28, LINC00668, LAMA1, LRRC30. Notably, LAMA1 (laminin subunit alpha-1 precursor) encodes a subunit of the extracellular protein laminin. While the specific role of laminin in pancreatic tumorgenesis remains poorly understood, a recent study by Vincent et al. found that LAMA1 was among several genes that were hypermethylated and underexpressed in pancreatic tumor samples compared to normal pancreas (Vincent et al., 2011). Our results add further support to the hypothesis that inherited duplications at the LAMA1 locus may be involved in pancreatic cancer risk. However, as we did not experimentally validate the duplication involving LAMA1 in our data set, we cannot exclude the possibility of either false positives or negatives.
Lastly, to determine whether specific CNV loci play a role in pancreatic cancer risk, we performed in silico copy-number genotyping and association testing across 176 CNVR loci identified in our discovery experiment. Using a likelihood ratio testing approach, we identified seven regions putatively associated with pancreatic cancer risk at the nominal threshold of p < 0.05. Only one locus (PA-CNVR46.1) reached the Bonferroni level of significance. This locus maps to a non-genetic region of chromosome 4q26 and was found to be multiallelic in our samples (i.e., both deletion and duplication genotypes were detected). However, its effect on risk was not significant in a multivariate logistic regression model adjusted for gender, age, ancestry, and DNA source. Follow-up studies of these seven putative loci would be necessary to validate their associations with pancreatic cancer risk.
Yet, prior to the conclusion of our study, work performed by Conrad et al. suggested that nearly 77% of all common CNVs in the human genome are tagged (r2 > 0.8) by SNPs through linkage disequilibrium (Conrad et al., 2009). Hence, one could speculate that any common CNV locus with weak-to-moderate effects on pancreatic cancer risk would have been detected already through large-scale GWAS involving several thousand cases and controls (Amundadottir et al., 2009; Petersen et al., 2010). Thus, in this context, we strongly emphasize that our inability to detect such CNV loci is not unexpected given the relatively small sample size of our study.
As a corollary, one could further hypothesize that CNVs with large contributions to pancreatic cancer risk are likely to be individually rare and poorly tagged by SNPs on commercially available genotyping arrays. Hence, we also emphasize that such CNV loci were likely to have been missed in our study due to not only small sample size, but also the resolution of the Illumina Human CNV370 bead array.
In conclusion, based on the results of genome-wide CNV discovery in a hospital-based case-control cohort, our study found no evidence that CNVs contribute substantially to the genetic etiology of pancreatic cancer. However, in light of recent population-wide CNV data and the challenges faced by our study, future efforts to address the role of CNVs in pancreatic cancer will require larger case-control groups and high-resolution discovery platforms.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We are grateful to the Geoffrey Beene Cancer Research Center at MSKCC and the Emerald Foundation for supporting this work.
Al-Sukhni, W., Joe, S., Lionel, A. C., Zwingerman, N., Zogopoulos, G., Marshall, C. R., et al. (2012). Identification of germline genomic copy number variation in familial pancreatic cancer. Hum. Genet. 131, 1481–1494. doi: 10.1007/s00439-012-1183-1
Amundadottir, L., Kraft, P., Stolzenberg-Solomon, R., Fuchs, C., Petersen, G., Arslan, A., et al. (2009). Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat. Genet. 41, 986–990. doi: 10.1038/ng.429
Barnes, C., Plagnol, V., Fitzgerald, T., Redon, R., Marchini, J., Clayton, D., et al. (2008). A robust statistical method for case-control association testing with copy number variation. Nat. Genet. 40, 1245–1252. doi: 10.1038/ng.206
Colella, S., Yau, C., Taylor, J. M., Mirza, G., Butler, H., Clouston, P., et al. (2007). QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 35, 2013–2025. doi: 10.1093/nar/gkm076
Conrad, D., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., et al. (2009). Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712. doi: 10.1038/nature08516
Diskin, S. J., Hou, C., Glessner, J. T., Attiyeh, E. F., Laudenslager, M., Bosse, K., et al. (2009). Copy number variation at 1q21.1 associated with neuroblastoma. Nature 459, 987–991. doi: 10.1038/nature08035
Diskin, S. J., Li, M., Hou, C., Yang, S., Glessner, J., Hakonarson, H., et al. (2008). Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 36, e126. doi: 10.1093/nar/gkn556
International Hapmap 3 Consortium, Altshuler, D. M., Gibbs, R. A., Peltonen, L., Altshuler, D. M., Gibbs, R. A., et al. (2010). Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58. doi: 10.1038/nature09298
Jiao, L., Bondy, M., Hassan, M., Wolff, R., Evans, D., Abbruzzese, J., et al. (2006). Selected polymorphisms of DNA repair genes and risk of pancreatic cancer. Cancer Detect. Prev. 30, 284–291. doi: 10.1016/j.cdp.2006.05.002
Jiao, L., Hassan, M., Bondy, M., Wolff, R., Evans, D., Abbruzzese, J., et al. (2008). XRCC2 and XRCC3 gene polymorphism and risk of pancreatic cancer. Am. J. Gastroenterol. 103, 360–367. doi: 10.1111/j.1572-0241.2007.01615.x
Li, D., Li, Y., Jiao, L., Chang, D., Beinart, G., Wolff, R., et al. (2007). Effects of base excision repair gene polymorphisms on pancreatic cancer survival. Int. J. Cancer 120, 1748–1754. doi: 10.1002/ijc.22301
Lucito, R., Suresh, S., Walter, K., Pandey, A., Lakshmi, B., Krasnitz, A., et al. (2007). Copy-number variants in patients with a strong family history of pancreatic cancer. Cancer Biol. Ther. 6, 1592–1599. doi: 10.4161/cbt.6.10.4725
McWilliams, R., Bamlet, W., Cunningham, J., Goode, E., De Andrade, M., Boardman, L., et al. (2008). Polymorphisms in DNA repair genes, smoking, and pancreatic adenocarcinoma risk. Cancer Res. 68, 4928–4935. doi: 10.1158/0008-5472.CAN-07-5539
Mukherjee, S., Simon, J., Bayuga, S., Ludwig, E., Yoo, S., Orlow, I., et al. (2011). Including additional controls from public databases improves the power of a genome-wide association study. Hum. Hered. 72, 21–34. doi: 10.1159/000330149
Petersen, G., Amundadottir, L., Fuchs, C., Kraft, P., Stolzenberg-Solomon, R., Jacobs, K., et al. (2010). A genome-wide association study identifies pancreatic cancer susceptibility loci on chromosomes 13q22.1, 1q32.1 and 5p15.33. Nat. Genet. 42, 224–228. doi: 10.1038/ng.522
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575. doi: 10.1086/519795
Stadler, Z. K., Esposito, D., Shah, S., Vijai, J., Yamrom, B., Levy, D., et al. (2012). Rare de novo germline copy-number variation in testicular cancer. Am. J. Hum. Genet. 91, 379–383. doi: 10.1016/j.ajhg.2012.06.019
Vincent, A., Omura, N., Hong, S.-M., Jaffe, A., Eshleman, J., and Goggins, M. (2011). Genome-wide analysis of promoter methylation associated with gene expression profile in pancreatic adenocarcinoma. Clin. Cancer Res. 17, 4341–4354. doi: 10.1158/1078-0432.CCR-10-3431
Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S., et al. (2007). PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17, 1665–1674. doi: 10.1101/gr.6861907
Willis, J. A., Olson, S. H., Orlow, I., Mukherjee, S., Mcwilliams, R. R., Kurtz, R. C., et al. (2012). A replication study and genome-wide scan of single-nucleotide polymorphisms associated with pancreatic cancer risk and overall survival. Clin. Cancer Res. 18, 3942–3951. doi: 10.1158/1078-0432.CCR-11-2856
Zhang, F., Gu, W., Hurles, M., and Lupski, J. (2009). Copy number variation in human health, disease, and evolution. Annu. Rev. Genomics Hum. Genet. 10, 451–481. doi: 10.1146/annurev.genom.9.081307.164217
Zhang, J., Feuk, L., Duggan, G. E., Khaja, R., and Scherer, S. W. (2006). Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet. Genome Res. 115, 205–214. doi: 10.1159/000095916
Keywords: pancreatic cancer, copy number variation, cancer risk, SNP microarrays, CNVs
Citation: Willis JA, Mukherjee S, Orlow I, Viale A, Offit K, Kurtz RC, Olson SH and Klein RJ (2014) Genome-wide analysis of the role of copy-number variation in pancreatic cancer risk. Front. Genet. 5:29. doi: 10.3389/fgene.2014.00029
Received: 25 September 2013; Paper pending published: 28 October 2013;
Accepted: 26 January 2014; Published online: 13 February 2014.
Edited by:Hakon Hakonarson, University of Pennsylvania. USA
Reviewed by:Peter A. Kanetsky, University of Pennsylvania, USA
Lynette Phillips, Kent State University, USA
Copyright © 2014 Willis, Mukherjee, Orlow, Viale, Offit, Kurtz, Olson and Klein. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Robert J. Klein, Department of Medicine, Memorial Sloan Kettering Cancer Center, 1275 York Ave, Box 337, New York, NY 10065, USA e-mail: firstname.lastname@example.org