A Method to Assess Linkage Disequilibrium between CNVs and SNPs Inside Copy Number Variable Regions

Since the discovery of the ubiquitous contribution of copy number variation to genetic variability, researchers have commonly used metrics such as r2 to quantify linkage disequilibrium (LD) between copy number variants (CNVs) and single nucleotide polymorphisms (SNPs). However, these reports have been restricted to SNPs outside copy number variable regions (CNVR) as current methods have not been adapted to account for SNPs displaying variable copy number. We show that traditional LD metrics inappropriately quantify SNP/CNV covariance when SNPs lie within CNVR. We derive a new method for measuring LD that solves this issue, and defaults to traditional metrics otherwise. Finally, we present a procedure to estimate CNV–SNP allele frequencies from unphased CNV–SNP genotypes. Our method allows researchers to include all SNPs in SNP/CNV LD measurements, regardless of copy number.

introduction Examination of linkage disequilibrium (LD) between single nucleotide polymorphisms (SNPs) has played a key role in our understanding of worldwide patterns of genetic variation, including determining the extent of haplotype diversity , detecting regions of positive selection (Sabeti et al., 2007), and guiding the design of most current genotyping arrays through the selection of appropriate haplotype tagging SNPs. Traditional pairwise metrics of LD, including r 2 , D, and D′, have been designed to quantify the degree of non-independence between neighboring genetic polymorphisms (Lewontin and Kojima, 1960;Lewontin, 1964;Hill and Robertson, 1968). With the current understanding that copy number variation (CNV) also significantly contributes to genetic variation (Redon et al., 2006), research has turned to the role for CNV in disease risk (Gonzalez et al., 2005;Aitman et al., 2006;McCarroll and Altshuler, 2007;Sebat et al., 2007), particularly as a partial explanation for the so-called missing heritability (Manolio et al., 2009;Eichler et al., 2010). Recently, genome-wide CNV surveys such as that performed by the Wellcome Trust Case Control Consortium (WTCCC) have concluded that common CNVs were adequately tagged by SNPs; and thus unlikely to substantially contribute to the genetic basis of common human diseases (Conrad et al., 2010;Wellcome Trust Case Control Consortium et al., 2010). However, current methods have restricted these studies to only include SNPs that fall outside of copy number variable regions (CNVR) -the ramifications being that more tagging SNPs are being missed, particularly in DNA segments of higher copy number.
In this paper, we explicitly derive the covariance between SNPs and CNVs under a range of scenarios where SNPs either fall inside (interior) or outside (exterior) of a CNVR. We find that traditional LD metrics are sufficient for exterior SNPs; however, these same metrics inappropriately quantify covariance for interior SNPs.
Assuming that the joint frequencies (f X,Y ) are known, the covariance between X and Y can be written as: We consider this covariance between CNVs and SNPs in the following four scenarios.
Scenario 1a: The SNP is outside a CNVR (exterior SNP) that contains a normal (one copy) variant and deletion (zero copies). Then: Scenario 1b: The SNP is outside a CNVR (exterior SNP) that contains a normal (one copy) variant and duplication (two copies). Then: In both of the above scenarios, the covariance between the CNV and SNP will appropriately be zero when X and Y are independent (i.e., the joint frequency is equivalent to the product of the marginal frequencies). Also, any inference concerning the relationship between the CNV and SNP does not depend on as the choice of reference allele, since only the direction of the covariance differs. Given these features, traditional measurements of LD between CNVs and SNPs are sufficient for exterior SNPs.
Scenario 2a: The SNP is inside a CNVR (interior) that contains a normal (one copy) variant and deletion (zero copies). Table 1 provides definitions of CNV-SNP allele frequencies based on haploid, three copy number state model (zero to two copies per haploid). In situations where the SNP lies within the CNVR, SNP allele counts are dependent on copy number state. For example, whenever a deletion is present, both X and Y must be equal to zero. Thus, Scenario 2b: The SNP is inside a CNVR (interior) that contains a normal (one copy) variant and duplication (two copies). This final scenario represents the most complex case. The sample space of Y needs to change to reflect the possibility of zero to two copies of the A allele. Namely: The covariance then becomes, Based on the covariances calculated in scenarios 2a and 2b, we find two undesirable features of current metrics when used to assess LD between interior SNP and CNVs: (1) polymorphic SNPs inside CNVRs will never be uncorrelated with the CNV; and (2) the correlation between variants will differ based upon which SNP allele is considered as the reference. In these scenarios the use of traditional LD measurements could impact association results. Consider a population where a monomorphic SNP lies within a CNVR that includes a moderately frequent deletion (for instance: f 0 = 0.1 and f 1,A = 0.9). Traditional metrics would conclude that the SNP and CNV are in perfect LD; and that any inference based upon the SNP would apply to the CNV. However, in the absence of copy number data, an association analysis based upon the SNP would be completely uninformative -leading to, perhaps, the incorrect conclusion that CNV is also not associated with the trait. In general, we show that high values of r 2 between an interior SNP and deletion are obtained whenever the SNP minor allele frequency is low (Figure 1). However, in the absence of CNV data, the same incorrect conclusion would again be applied to the CNV. In these situations we would hope LD measurements would conclude independence. However that is not the case. We also note the result that the correlation between the SNP and CNV depends on the SNP allele considered as the reference. We have provided an example in the results section signifying this property. Together, these features demonstrate that traditional LD metrics are inappropriate when applied to interior SNPs and CNVs.

Figure 1 | Linkage disequilibrium (r 2 ) between copy number variants (CNVs) and single nucleotide polymorphisms (SNPs) within the copy number variable region as a function of deletion frequency (f 0 ) and SNP minor allele frequency (MAF).
We note that in the case of interior SNPs, a deletion should not provide any information on the relationship between the copy number state and SNP allele(s) present. Therefore, let X be the integer haploid copy number state and Y represent the presence of a particular SNP allele, conditional on haploid copy number state not equal to zero, so that: and f B = f 1,B = f 2,BB + 1/2f 2,AB according to the CNV-SNP allele frequencies listed in Table 1. The covariance between X and Y then becomes, which does not depend on the particular choice of the reference allele. We denote the inner factor in formula {9} asD C , noting its equivalence to Lewontin's D (Lewontin and Kojima, 1960) in situations for exterior SNPs. Specifically, let where f f for exterior SNPs, for interior SNPs.
Similar to D, the range of values for D C is difficult to interpret without proper scaling. Therefore, we propose a method nearly identical to the construction of D′ (Lewontin, 1964). Define the maximum value that D C can take based upon allele frequencies as D C max . Then: Finally, let Meanwhile, we can also calculate the correlation between X and Y to be: To address these deficiencies, we now propose a new metric to quantify LD between CNVs and SNPs that functions equivalently to traditional measures for exterior SNPs, and solves these issues for interior SNPs.

derivation of new ProPosed statistic
We consider a bi-allelic SNP present within a CNVR with three potential haploid copy number states: zero, one, or two copiesalthough methods here can be expanded to higher copy number, or multiple SNP alleles (Kalinowski and Hedrick, 2001). Define the CNV-SNP allele at this locus to be a combination of the haploid copy number state and nucleotide frequency with two differing, generically labeled SNPs A and B. Then this model can be treated similar to a multiallelic locus with alleles: 0, A, B, AA, AB, and BB; where 0 represents a deletion ( Table 1). Combined in pairs, these alleles form a CNV-SNP genotype which provides information on the total number of copies of each nucleotide ( Table 2). This model is consistent with those in the majority of copy number calling algorithms for array-based CNV detection Korn et al., 2008;Coin et al., 2010). Note, however, that while CNV-SNP genotypes can be inferred from common genotyping platforms (Korn et al., 2008;Coin et al., 2010), the phase, particularly in duplicated regions, may be ambiguous. For example, an AAB genotype may have either of the phased haploid configurations AA/B or AB/A. traditional metrics of LD using SNP allele A or B as the reference allele, respectively. Note how vastly different results can be obtained depending on which allele is used as the reference. The value of r C 2 is the same irrespective of SNP allele considered as the reference allele.
We theoretically demonstrated how current metrics of LD are inappropriate in certain cases and proposed a new method that solves these issues. Note that the CNV-SNP allele frequencies are critical in calculating r C 2 . We evaluated our method for estimating CNV-SNP allele frequencies via an EM algorithm, as described above in the methods section, using a simulation procedure. These results are provided in Table 4. In summary, our metric accurately and precisely measures SNP/CNV covariance, regardless of the location of the SNP and type of CNV. In particular, high values of r C 2 will always lead to a proper conclusion about role of CNVs from the study of SNPs. In CNVRs that only include a deletion, our proposed method will always correctly assign independence between interior SNPs and the CNV. Meanwhile, in duplicated regions our metric will provide a value that appropriately quantifies the correlation between SNP allele(s) and the number of copies present. or, alternatively: We again note that D C ′ and r C 2 are identical to the traditional LD measurements D′ and r 2 , respectively, for exterior SNPs; and both are an appropriate measurement for interior SNPs.

estiMation of cnv-snP allele frequencies
Calculation of D C ' and r C 2 is straightforward when the CNV-SNP haplotype frequencies are known. However, current methods for array-based genotype/CNV calling do not directly infer the haploid configuration (phase), though methods for estimating this configuration have been recently proposed (Kato et al., 2008;Su et al., 2010). Here we present a novel method to estimate CNV-SNP allele frequencies based on unphased CNV-SNP genotypes. The method is a direct result of an EM algorithm and nearly identical in construction to the gene-counting, allele frequency estimation procedure in Ceppellini et al. (1955) and Smith (1957). Consider a CNVR with CNV-SNP haploid configurations S/T such that S, T ∈ {0, A, B, AA, AB, BB}. In the E-step, haploid configuration counts are estimated based on the expected counts from estimated CNV-SNP allele frequencies. That is, for each CNV-SNP haploid configuration S/T: The algorithm is based upon haploid configurations falling into their appropriate Hardy-Weinberg equilibrium proportions. As a result, this approach may perform poorly in de novo mutation hot-spots and CNVs found only in somatic cells.

results
We provide calculations of r C 2 for various CNV-SNP allele frequencies and compare them to the traditional measurements for SNPs inside CNVRs (Table 3). We define, r A 2 and r B 2 are the

discussion
We have provided rationale for why current metrics used to assess LD between CNVs and interior SNPs are inappropriate. Given that difficulties arise only for these SNPs, one potential solution, as previous studies have done, would be to rely upon exterior SNPs for tagging CNVs. Though this approach been successful for deletions, duplications tend to be in very low LD with exterior SNPs (Kato et al., 2010). It is possible that duplicate copies are not simply positioned in tandem next to a neighboring SNP in relation to the reference genome. The more extreme case arises when a duplicate copy has been translocated onto a different chromosome. In this situation an exterior SNP will be completely unlinked to the translocated duplicate. However, interior SNPs will segregate within the duplicate -particularly if this copy is not suitably matched for recombination. A similar argument can be made for duplicated segments inserted downstream of its reference location. A larger physical distance between the duplicated copies and an exterior SNP allows for a greater probability of recombination to eliminate LD. However, this distance will be irrelevant in regards to the allelic content within the duplicated genomic segment.
We have included a new method to quantify LD between CNVs and SNPs which provides accurate estimates for interior SNPs and defaults to the traditional measurements otherwise. As our methods require knowledge of CNV-SNP allele frequencies, we have provided an estimation procedure that performs well under a wide range of scenarios. We hope CNV researchers, particularly those hoping to draw conclusions about CNVs from SNPs, will use this method to identify tagging SNPs which may or may not exist within the CNV boundary. web resources R code to measure r C 2 and CNV-SNP allele frequencies from CNV-SNP genotypes is available from the corresponding author upon request.

acknowledgMents
The work is supported in part by the University of Alabama at Birmingham's Alumni Associations' Marie and Emmett Carmichael Fund for Graduate Students in Biosciences, and NIH grants T32 HL-079888 and T32 HL-072757. The opinions expressed herein are those of the authors and not necessarily those of the NIH or any organization with which the authors are affiliated.  Mean difference represents the mean difference between the true and estimated CNV-SNP allele frequencies. 0*: Less than 1 × 10 −5 for each allele. Haploid configurations can nearly be unambiguously assigned based upon the given three-state haploid model.