A method for calling copy number polymorphism using haplotypes

Single nucleotide polymorphism (SNP) and copy number variation (CNV) are both widespread characteristic of the human genome, but are often called separately on common genotyping platforms. To capture integrated SNP and CNV information, methods have been developed for calling allelic specific copy numbers or so called copy number polymorphism (CNP), using limited inter-marker correlation. In this paper, we proposed a haplotype-based maximum likelihood method to call CNP, which takes advantage of the valuable multi-locus linkage disequilibrium (LD) information in the population. We also developed a computationally efficient algorithm to estimate haplotype frequencies and optimize individual CNP calls iteratively, even at presence of missing data. Through simulations, we demonstrated our model is more sensitive and accurate in detecting various CNV regions, compared with commonly-used CNV calling methods including PennCNV, another hidden Markov model (HMM) using CNP, a scan statistic, segCNV, and cnvHap. Our method often performs better in the regions with higher LD, in longer CNV regions, and in common CNV than the opposite. We implemented our method on the genotypes of 90 HapMap CEU samples and 23 patients with acute lung injury (ALI). For each ALI patient the genotyping was performed twice. The CNPs from our method show good consistency and accuracy comparable to others.


INTRODUCTION
DNA copy number variation (CNV) refers to differences in genomic DNA with varying numbers of gene copies including segmental amplification, deletion, and loss of heterozygosity. CNVs are found widespread in the human genome, covering approximately 18% of the genome (Freeman et al., 2006;Redon et al., 2006;McCarroll and Altshuler, 2007;Database of Genomic Variates). Increasing evidence shows that CNVs accounts for a significant portion of phenotypic variation (Iafrate et al., 2004;Sebat et al., 2004;Tuzun et al., 2005;Conrad et al., 2006;McCarroll et al., 2006;Redon et al., 2006) yet are far underestimated for human diseases and conditions (Sebat et al., 2007). A comprehensive study suggested that the total amount of sequence variation involving CNVs between two healthy subjects was actually higher than that for Single nucleotide polymorphisms (SNPs) , which was supported by the increasing number and resolution of CNV discoveries (Korbel et al., 2007). A systematic evaluation of five widely used array-based CNV detection programs suggested that existing methods have conservative sensitivity in CNV detection.
Recently, high-density SNP genotyping arrays have gained substantial attention for CNV detection and analysis. Although originally designed for genome-wide SNP association studies, they contain signal intensities that can be borrowed to identify regions with deletions or duplications (Komura et al., 2006;Peiffer et al., 2006). Multiple softwares and programs have been developed for these arrays, and their performance is being evaluated (Winchester et al., 2009;Zhang et al., 2011). With a few exceptions, the existing approaches can be roughly classified to two types: single-locus pooled-sample approach and singleindividual cross-genome approach. The single-locus pooledsample approaches use the distributions of signal intensities of multiple samples at a fixed locus to derive reference values and clusters for each CNV value and then determine individual CNV by their belonged cluster, such as TriTyper (Franke et al., 2008) and CNVtools (Barnes et al., 2008). These methods proceed locus by locus and generally ignore inter-marker correlations. The single-individual cross-genome approaches either use partitioning approach [such as, DNAcopy (Olshen et al., 2004), CnvPartition by Illumina, segCNV (Shi and Li, 2012)] or the hidden Markov Models (HMM) [such as, Birdseye (Korn et al., 2008), QuantiSNP (Colella et al., 2007) and PennCNV (Wang et al., 2007)] to call CNV individual by individual. The HMM considers the dependency between copy number states at two adjacent markers by assuming the CNV underlying observed signal intensities is a first-order Markov process (Gelfond et al., 2009). The inter-marker correlation is implied in the homogeneous transition probability that only depends on the intermarker distance. A noticeable exception to the two lines of methods is a novel Bayesian approach that combines the signal intensity distribution and heterozygosity information to infer individual CNV (Zőllner et al., 2009).
Because both SNPs and CNVs affect the signal intensities and may affect phenotypes separately or jointly, their coexistence will affect the identification of each other and the results of association studies. Ignoring SNPs in CNV analysis fails to incorporate allelespecific gains and losses and diminish the potential to exploit linkage disequilibrium (LD) between CNVs and nearby SNPs. How to combine SNPs and CNVs has been a challenge for geneticists. The common annotations of the copy numbers and SNP genotypes have been independent and the available methods for SNP and CNV calling had been separate until (Korn et al., 2008) presented a sequential approach to generate copy number polymorphisms (CNPs) using results from both genotype calls and CNV calls.
Methods were extended to accommodate the CNP along the same class of CNV calling approaches. In single-locus pooledsample approach, the CnvPartition reassumed that intensity and proportion of B alleles follow distinct bivariate Gaussian distributions given different CNP. In single-individual cross-genome approach, (Wang et al., 2009) allowed multiple CNP states in their HMM and incorporated the two-locus LD parameter in addition to the inter-marker distance in the transition probability, which increased the accuracy of the CNP calls. cnvHap (Coin et al., 2010) used a new transition probability in their HMM, which only relies on the two-locus haplotype frequencies.
The HMM of polyHap (Su et al., 2010) treated a CNV region as a region of variable ploidy and considered only two-locus haplotype.
At presence of CNV, all existing algorithms using haplotypes augmented the conventional haplotype definition by treating duplications and deletion as additional "alleles", distinct from two existing SNP alleles at each locus. Such defined "haplotypes" are not continuous physical pieces as traditionally perceived, which can cause conceptual confusions. In addition, the number of possible combinations over a region increases exponentially with the base of 4-6 instead of 2, largely increasing computational complexity and leading to infeasibility. There were a few methods to estimate the frequencies of such "haplotypes". MOCSphaser (Kato et al., 2008b) infers CNV-SNP haplotypes using an expectation-maximization (EM) algorithm but only accommodates integer copy numbers in CNV regions and SNP genotypes in non-CNV regions. CNVphaser employed a hierarchical partition-ligation strategy to break down a longer region into smaller blocks and used the EM algorithm to estimate the "haplotypes" frequencies just as for the regular haplotypes (Kato et al., 2008a).
It often occurs that some intensity values are apparent outliers that can be easily detected by routine quality control procedures. Other methods either exclude the whole samples with some poor quality values (Wang et al., 2007) or re-estimate the SNP at each locus with poor quality by imputation for subsequent calls.
In this paper, we developed a haplotype-based maximum likelihood method to call CNP, which takes account of valuable multi-locus LD information in the population. By posing practical assumptions for short CNV regions, we keep the same conventional haplotypes as originally defined for SNP genotype data and make corresponding inferences. We developed a computationally efficient algorithm that determines optimal CNPs for each individual and estimates haplotype frequencies in the population simultaneously. We consider our method as a natural merge of single-locus pooled-sample and single-individual cross-genome approaches for CNP calling. Our method can also recover CNPs even with missing data. We evaluated our methods through extensive simulations in terms of sensitivity, true positive rate, length of detectable CNV regions for different haplotype structure, frequencies and length of CNV regions. And we compared our method with a few available methods to assess the possible gain of using haplotypes. In addition, we checked how well our method can recover CNPs when there are missing or extreme values in raw data. Last, we applied these methods to the duplicated genotype samples of 23 individuals with acute lung injury (ALI) to check the consistency of our method in calling both CNV and SNP. Accuracy was assessed by comparing calls from all methods to the CNV regions identified through array CGH data.

NOTATIONS
We use c = (c A , c B ) to denote the copy numbers of A allele and B allele, or CNP at a bi-allelic marker locus. CNPs at normal states with two copies of alleles include (1,1), (2,0), and (0,2), which code regular SNP genotypes; (0,0) is the CNP at the double deletion state, (0,1) and (1,0) are the CNPs at single deletion states, and (1,2), (2,1), (2,2), (1,3), and (3,1) or more are for the duplication states. The copy numbers in most available methods are referred to as CN = c A + c B , which does not contain the allele specific information and cannot infer disease risks associated with the copy number change of a specific allele. Because the duplications with four or more copies are virtually indistinguishable on genotype platforms (Wang et al., 2009), we set the maximum copy number to 4.
We denote X A and X B as the normalized signal intensity values for allele A and allele B, respectively. X A and X B can be extracted from raw CEL files using the standardized normalization procedure, such as the BeadStudio software for Illumina platforms and the Affymetrix Power Tools for Affymetrix platforms. We use two measures log R ratio and B allele frequency, denoted as r and b, as the observed values in our models. r is the logarithm ratio of observed total intensity R = X A + X B to expected intensity (Peiffer et al., 2006), and b is the standardized proportion of samples carrying the B allele, i.e., a linear transformation of θ = arctan(X B /X A )/(π/2).
We assume our study includes N individuals and we start with a haplotype block of interest containing M markers. Within the haplotype block, there are a total of s possible haplotypes h 1 , . . . , h s with population frequencies ρ = (ρ 1 , . . . , ρ s ). Throughout the paper, we use subscripts for marker locations and superscripts for individuals.

HAPLOTYPE IDENTIFICATION GIVEN CNP
In this section, we illustrate how individual CNPs can help infer haplotypes under some practical assumptions. We assume either duplication or deletion occurs as a continuous piece on one chromosome and deletion/insertion regions are not immediately next to each other for each individual.
If a CNV region of an individual covers the whole region of interest, we can identify his haplotype(s) as follows: (1) Within a duplication region (CN = CN 1 = · · · = CN M = 3 or 4), the duplicated haplotype h 1 can be written as {S 1 . . . S M : S 1 = A if c k,A CN − 1, or S k = B otherwise, for k = 1, . . . , M}, and the other haplotype For example, if we know an individual's CNPs (3, 0)(1, 2)(2, 1)(0, 3) at four adjacent loci in a block, it is easy to see that there are three copies of allele A and 0 copy of allele B at the first locus and thus A must be on the duplicated haplotype. Similarly at loci 2, 3, and 4, B, A, and B are on the duplicated haplotype. Therefore the haplotype ABAB must be the duplicated piece and the other AABB is the normal non-duplicated haplotype. Similarly, given an individual's CNPs (1, 0)(0, 1)(0, 1)(1, 0), we will know that one haplotype of his was deleted and the other haplotype is ABBA. For those determined haplotypes given c = (c 1 , . . . , c M ), we call them "compatible" with c and denote as (h 1 , h 2 ) ∼ c. We say (ABAB, AABB) are compatible with (3, 0)(1, 2)(2, 1)(0, 3), i.e., (ABAB, AABB) ∼ (3, 0)(1, 2)(2, 1)(0, 3). If a CNV piece doesn't cover the whole region of interest, the haplotypes compatible with the CNP genotypes are determined by the CNPs within the CNV region and the regular SNP genotypes outside of the CNV region. As shown in Figure 1 for a region including five loci, the haplotype sections within the 3-locus CNV regions are uniquely identified and the sections in the normal region (loci 1 and 5) can be partially inferred given population haplotype distributions. In the left figure with a deleted piece, the middle section of one haplotype is ABA and that of the other is deleted following our rule listed previously. Combined with genotypes (1,1) at loci 1 and 5, the possible haplotype couples covering the whole region can be AABAA/B...B, AABAB/B...A, BABAA/A...B, and BABAB/A...A. Similarly in the right figure with a duplication piece, the possible haplotype couples are AABAA/BBBAB, AABAB/BBBAA, BABAA/ABBAB, and BABAB/ABBAA. Now we see that a CNV region of an individual can help infer his/her haplotypes and thus help estimate the population haplotype frequencies. On the contrary, the haplotype information in the population can be used to better infer individual CNPs. In this example, if in the population there is no AABAA, BABAA, or BABAB (i.e., π(AABAA) = π(BABAA) = π(BABAB) = 0), the haplotypes in the left figure can be uniquely determined as AABAB/B...A and in the right figure, the duplicated haplotype would be AABAB and the other BBBAA. In most cases, we know non-zero probabilities of AABAA, AABAB, BABAA, or BABAB in the population and can still make some inference about the haplotypes given CNP. This can be done by incorporating the haplotype distributions in the likelihood as we show next. In our method, we allow CNV regions vary from individual to individual, i.e., the CNV regions of different individuals can have

FIGURE 1 | CNP and haplotype configurations within a 5-locus block.
2-digit CNP at each locus was shown on top. Specific alleles A or B are shown in each circle at the corresponding loci aligned on a pair of chromosomes. The long lines between two loci denote deletion regions (no corresponding alleles). Both deletion (left) and duplication (right) occur from loci 2 to 4. completely different boundaries. Such flexibility would lead to extremely large numbers of haplotypes and likely small probabilities for them in the population in other "haplotype"-based CNV methods.
Please note that, as SNPs, CNPs were aligned according to the reference coordinate, but not to the actual physical locations. For duplications, LRR and BAF can only tell which piece of chromosome is duplicated but not where it is connected to.

MAXIMUM LIKELIHOOD METHOD USING HAPLOTYPE INFORMATION
Given observed log R ratio (r) and B allele frequency (b) at loci 1, . . . , M, the likelihood can be written as a function of CNP and population haplotype frequencies, i.e., , and CNP at all loci for individual i and C represents the CNP for all individuals, i.e., are the conditional probability of the BAF and LRR given the CNP at locus k, P(h , h ) is the probability of observing two haplotypes h and h , and P denotes the set of parameters in both conditional probabilities P(b i k | c i k , P) and P(r i k | c i k , P). The last equation holds when b and r are assumed conditionally independent given c and either b k or r k are conditionally independent of other b l or r l given c. If we assume Hardy-Weinberg Equilibrium (HWE), As in HMM models, we can assume that (1) with respect to C and π will yield optimal CNP estimates of each individual and haplotype frequency estimates. But the number of combinations of CNPs of all individual and haplotype is huge and searching over the whole parameter space can be computationally intensive. To reduce the computation burden, we derived an optimization algorithm to update individuals' CNP and haplotype frequencies iteratively.

THE CNP OPTIMIZATION ALGORITHM
We used the following iterative algorithm to maximize the likelihood Equation (1): (1) At step 0, assign initial values to CNPs of all individuals C (0) and population haplotype frequencies π (0) within a haplotype block.
(2) At step , given C ( ) and π ( ) , maximize the log-likelihood over possible c i for each individual i to obtain optimal c i,( + 1) , i.e., (3) Update π through collecting the conditional probability of haplotypes given all individuals' CNP. This is a revised expectation-maximization (EM) algorithm (Dempster et al., 1977) of updating haplotype frequencies given individuals' genotypes in the absence of CNV (Excoffier and Slatkin, 1995). Given current estimate of π ( ) and updated C ( + 1) , the haplotype frequency is updated according tô (2) and (3) until the inferred CNP doesn't change and the estimated parameterπ converges.
When the region is long, the path for convergence can be painfully long and thus makes the computation infeasible. To save the computation burden, we will call the initial CNP using a HMM and then apply our haplotype-based method. The HMM we used are similar to others' (Colella et al., 2007;Wang et al., 2007Wang et al., , 2009 and we refer readers to Appendix A for details.

MISSING DATA
When some b k 's and r k 's are missing, the genotype or haplotype at the missing loci can be inferred using the LD information around the missing loci. For a single individual, if b k , r k at loci k ∈ D are missing, the contributions of that individual to the overall likelihood is changed to The key difference between likelihoods Equations (1) and (4) is that Equation (1) does not have the components corresponding to possible haplotypes at the unobserved loci for this individual. Computationally, missing data affect initial parameter estimation slightly and may also increase the computational complexity because the number of possible CNPs and haplotypes increases for those individuals with missing data. If a locus is not genotyped at all, all individuals will not have the corresponding components, but the extended haplotypes in other reference populations can be used to infer the CNP within or outside of the CNV region.

POTENTIAL USES OF OUR PROPOSED METHOD
Our proposed method can serve as either an independent calling method or a refining procedure upon the results from other calls. Please note that the initial calls from HMM are not necessary and other initial values can work in our method as well. The haplotype frequency can be estimated as a byproduct in a decent-size sample or can be borrowed from public database such as HapMap to improve the inference in small samples.

SIMULATIONS
To evaluate the performance of our methods, we conducted a series of simulations to access: 1. sensitivity of detecting CNV intervals, i.e., how often we can correctly detect it when there is such a region; 2. true positive rate, i.e., how often our detected CNV region are true; and 3. length of truly and falsely detected CNV regions. The CNV length, frequency among population, and the haplotype structure, were varied to understand their influences on the CNV calls. Our method was compared with five existing methods. We also checked CNP recovery rate using our method under a missing data scenario.
For each data set, we simulated the LRR and BAF for 1000 unrelated individuals following three steps: First, 2000 independent chromosomes of 1000 individuals were generated containing 55,860 SNP loci, the same number of loci on Affymetrix Genome-Wide Human SNP Array 6.0 platform, along chromosome 4. We selected eight haplotype blocks with various length and LD structure, four medium-length and two long blocks with low to medium R 2 , two medium-length blocks with high R 2 , as shown in Table 1.
All genotypes were simulated using the allele frequencies in HapMap CEU population. In addition, two-locus LD and multilocus LD were reserved outside and within the selected haplotype blocks, respectively. Specifically, outside of the haplotype blocks, the alleles at the first locus were generated from Bernoulli(p 1 ) where p 1 is the population frequency of B allele in CEU; alleles at other loci were generated using the conditional probabilities given the previous alleles as observed in CEU. Within the selected haplotype blocks, the starting SNPs were simulated as before conditional on the alleles at previous locus only but the remaining alleles were simulated as haplotypes based on the conditional haplotype frequencies within the same block in the CEU population.
Second, within the eight selected haplotype blocks, deletion and duplication regions with fixed boundary were randomly chosen from populations and the haplotype piece within the CNV regions was deleted or inserted. At this stage, the true CNPs at all loci were generated.
We considered multiple scenarios with varying parameters, specified as follows: 1. different length of the CNV regions (3-5 in short blocks and 5-20 in long blocks); We fixed the left boundary of each CNV region so that the LD can remain the same within each block and we can use the LD across different haplotype blocks to understand the influences of the LD between the boundary and other SNPs. 2. more and less frequent CNV; For more frequent CNV regions, we let the population frequencies of deletion and duplication to be 20 and 5%; for less frequent CNV regions, the population frequencies of deletion and duplication were set to be 5 and 1%.
3. random missing of LRR and BAF; we assumed a relatively high missing rate 1% for all loci and all individuals.
For each scenarios, 100 datasets were generated. Based on observed LRR and BAF, we applied our method hap-CNP to call CNPs and compared them with the true CNP. Several CNV calling procedures were compared, including a HMM in Wang et al. (2009) (WHMM), PennCNV for unrelated individuals (Wang et al., 2007), a SCAN method (Jeng et al., 2010), an integrative segmentation method segCNV using both the joint distribution of LRR and BAF (Shi and Li, 2012), and cnvHap (Coin et al., 2010) that is another HMM using two-locus haplotype distribution in transition probabilities. PennCNV, SCAN, and segCNV only consider the copy numbers and can form good contrasts with ours for CNP. Comparing with WHMM and cnvHap can help us understand how much additional information we can gain using correlation within multi-locus haplotype and flexible boundary assumption.

DUPLICATED SAMPLES OF PATIENTS WITH ACUTE LUNG INJURY
In a genome-wide study to investigate the genetic effect on various trauma-induced clinical outcomes, cases with ALI and controls from at-risk trauma population at the University of Pennsylvania were recruited for genotyping. To control the quality of genotyping, 23 Caucasian ALI patients were randomly chosen to have duplicate serum samples, which were separately genotyped using Illumina HumanQuad610 BeadChip (Illumina, San Diego). Over 600,000 bin-tagging polymorphisms were included and normalized intensity data for each sample were loaded into Illumina Beadstudio 2.0. See Christie et al. (2012) for genotyping details.
To check the feasibility and reliability of our proposed method, we applied it to the 23 pairs of samples and compared our CNP calls with PennCNV, SCAN, and the normal genotypes called using Illumina's clustering algorithm without considering CNV. The summary statistics of raw signal data were checked prior to analysis. Samples with extreme values generally suggest low quality and were removed from the analysis. We used a sliding window of five SNPs through the whole chromosome to determine haplotype blocks. In each window, haplotype frequencies are estimated from the HapMap genotype data. Because the HapMap genotypes were generated mainly from Affymetrix platform, their strands sometimes were the complementary stands of our genotypes. We unified the strandness of two sets of genotypes and estimated the corresponding haplotype frequencies using the EM algorithm implemented in the haplo.stats R package.
We checked the concordance of the genotype of CNV calls from our and others' methods and the recovery of CNP calls among regular genotype calls.

HapMap CEU SAMPLES
We also applied PennCNV, SCAN, WHMM, segCNV, and our method in 90 HapMap CEU samples genotyped by Affymetrix Genome-Wide Human SNP Array 6.0. To assess the accuracy of various calling methods, we checked the overlap between CNV calls from these methods and the CNV regions validated through array-CGH (Conrad et al., 2010).

www.frontiersin.org
September 2013 | Volume 4 | Article 165 | 5  As the length of true CNV interval increased with the same haplotype block, the sensitivity increased and the true positive rate remained similar. With comparable LD, the longer CNV regions across different blocks tend to have higher sensitivity and true positive rates. This is because haplotypes are more likely to be separated with longer CNV regions and in return the more accurate haplotype information can lead to better CNV detection.

SIMULATION RESULTS
In addition, we found the majority of the false negatives resided on the boundary rather than within the true region, which was also consistently observed in the results from other methods.
The sensitivity remained stable in both common and rare CNV intervals, while the true positive rate was slightly higher for common CNV regions than that for rare CNV regions. Though we expected better performance in common CNV than rare CNV, the sample size we generated was large enough to give reliable haplotype inference. So the frequency of CNV doesn't play much role in studies with a decent sample size (n ∼ 1000). For smaller sample size (n < 200), common CNVs could lead to much better inference in Haplotypes than rare CNVs and the difference can be larger.
There was no clear trend of sensitivity change as LD gets stronger from CORIN-3, ADD1, NR3C2-1 to LOC285501-2 (overall average R 2 increases from 0.15 to 0.81). Even the LD is assumed to help the inference of CNP, the excess high LD may not necessarily lead to much accuracy gain and the little gain may be covered by the boundary LD and length of the CNV regions.
In truly detected CNV regions, the average length of the regions was close to the truth; while in falsely detected CNV regions, the average length was around 1, meaning the most of the falsely detected CNV were singletons. On the contrary, almost all detected singletons were false CNVs and the longer detected regions were more likely to be the true CNV regions. The average lengths of detected CNV intervals were similar for common and rare cases while the variability was bigger for rare case due to the small sample size of CNV intervals.
As a comparison, the results from other methods including PennCNV, SCAN, WHMM, cnvHap, and segCNV are summarized in Table 3. In general, PennCNV underestimated CNV regions and more often it happened when the true CNV regions were short. WHMM, SCAN and segCNV methods were more sensitive than PennCNV, but slightly less than our method. Interestingly, although PennCNV detected fewer CNV loci and regions, their detected CNV regions were often longer than the truth. WHMM yielded much more CNV regions than the truth, which resulted in high sensitivity with small true positive rate. SCAN had similar performance as ours for less frequent cases, while it has smaller sensitivity than ours for more frequent cases. SCAN showed slightly longer falsely detected CNV regions than our methods in both cases. segCNV, as a partitioning method, performs similar to SCAN. cnvHap, as the only method using haplotypes we are comparing with, has superior true positive rates than others in general. For short CNV regions, it has better sensitivity than others but this advantage quickly diminishes as CNV regions become longer. On one hand, cnvHap's results support the advantage of using haplotype information but its usage of two-locus haplotypes may not be ideal for a long block, which seems improved in our method. It is counter-intuitive to see that cnvHap perform worse for longer CNVs. This is partially because the results are sensitive to the choice of blocks, even with true values, in the customized normalization step of cnvHap.
Tables 2, 3 demonstrate that our method provides the most sensitive and accurate results among the methods considered.
When we randomly selected 1% loci to have missing b and r values, the CNP at missing loci were estimated using neighbor markers as described in section 2.5. As described in Table 4, recovery rate is generally higher for longer CNV regions. Missing loci in rare case are more likely to be correctly recovered than those in more frequent cases. For less frequent cases, there are more normal copy number loci which can be used for accurate haplotype frequency estimation.
For haplotype inference, we checked the haplotype frequency estimates from our method and from HaploView using all individuals with normal copies. Within the NR3C2 block, the sum of squared errors of estimated frequencies vs. the true frequencies from our method and from HaploView had a mean of 0.0007 and 0.0011, respectively. This shows a better accuracy of inferring haplotypes using CNP, as a byproduct of our method. We found that our method can give similar results no matter how long the CNV interval is but the estimates can be more accurate in common CNV regions than less frequent CNV regions (data not shown). That's not surprising because with common CNV regions, more individuals have CNV and those more informative haplotypes.
Based on 500,000 simulations, the average computing times for one chromosome were 10.5 s for PennCNV, 77.7 s for WHMM, 1.0 s for SCAN, 3.2 s for segCNV, and 81.5 s for our hap-CNP. cnvHap requires customized input for each haplotype block and thus we only ran it over our tested blocks, which took 137.7 s per simulation. As expected, SCAN is extremely fast; PennCNV is efficient using the forward-backward algorithm for HMM and it was developed for calling CNV only without allelic specifications. WHMM and our hap-CNP are comparable though we allow copy numbers to range from 0 to 4, more states than 1-3 in WHMM. cnvHap requires more computing time and would become challenging to run over a whole chromosome, mostly because the number of "haplotypes" increases considerably with all possible CNVs.

CNPs ON CHROMOSOMES 1 OF DUPLICATED SAMPLES
Without loss of generality, we reported the results of CNP calls on chromosome 1. Among 46 samples, two had either extremely large variance or median absolute deviation and thus were removed from further analysis. There are 1317 loci missing in LRR or BAF among all subjects. Among them 22 loci were recovered from our algorithm. Table 5 summarizes the total number of CNV calls on chromosome 1 from one set of samples in contrast of the calls from the other set of samples, using our method, PennCNV, and SCAN.
The copy number concordance rates were 98.6, 97.5, and 99.8% for our method, SCAN and PennCNV, respectively. PennCNV showed higher concordance rate due to its conservative detection of CNV while our method and SCAN detected more CNV loci. The copy number discordant rate was 1.00% in our method, mainly caused by three individuals whose r is further away from majority of individuals.  Table 6 summarizes the number of normal SNP genotype calls from one set of samples compared with the other set of samples. The concordance rate of regular genotypes (CN = 2) using our method was 99.95% and for BeadStudio was 99.99%. But our method had a no-call rate of 0.13%, much smaller than 4.35% from BeadStudio. BeadStudio provides high amount of no-calls to maintain the concordance rate almost perfect. While our method extracts more information from the BeadStudio's no-calls, that is, among 43,018 no-call loci from BeadStudio, 38,962 loci were genotyped concordantly by our method. Hence, there was a trade-off between no call rate and concordance rate. Table 7 summarizes CNV regions on chromosome 1 of CEU samples detected by PennCNV, SCAN, segCNV, WHMM, and our hap-CNP in overlap with the CNV regions validated through array CGH and with each other. Among a total of 26.5 Mb CNV regions detected by hap-CNV, more than 80% are in the validated regions. PennCNV and SCAN detected much less but most of their detected regions are covered by the validation set. Overall, we have detected 70, 57, 67, and 78% unique CNVs which cannot be found in PennCNV, SCAN, segCNV, and WHMM. Despite that the majority of CNV regions were among validated regions, the percentage of all validated regions covered    by each approach was tiny (0.13, 0.06, 0.10, 0.09, and 5.16%, respectively), suggesting genotyping platform may have limited sensitivity for CNV detecting compared with aCGH. When we checked individual calls, most of long CNV regions were called by all five algorithms but there is no persistent optimal choice of an algorithm. As example regions from three samples are shown in Figure 2, long deletion and duplication regions were detected by all five methods in (A) and (B) but a small deletion CNV region was detected only by our method. Computation time was in similar scales as in simulations. For 90 CEU individuals, the average computing time on chromosome 1 were 1.0, 16.9, 3.2, 48.2, and 28.8 s for SCAN, PennCNV, segCNV, WHMM, and our method, respectively.

DISCUSSION
SNPs and CNVs may affect phenotypes separately or jointly and the accuracy of their call can affect the results of association studies. Ignoring CNVs during SNP genotyping may lead to failure to capture the true underlying sequence at many sites and can create the appearance of violations of Mendelian inheritance or Hardy-Weinberg equilibrium where in fact none exists. Using only CNV while ignoring the allelic information in the association studies may fail to incorporate allele-specific gains and losses and diminish the potential to exploit LD between CNVs and nearby SNPs (ongoing study). Association analysis using copy number only without differentiating alleles can dilute the effect size and the power, as shown in both simulations and real studies of insulin and schizophrenia (Hu et al., under review;Irvin et al., 2011).
We didn't separate LOH from the CNV calls, but it can be checked as a special class from our call, i.e., regions with c A · c B = 0 and c A + c B = 2. Due to the limitation of genotyping platforms, our method can not detect interchromosome duplication and dispersed segmental duplication, which can be discovered using genomic sequence data.
Our assumption for haplotype-based CNV inference means that within a region, the deletion/duplication piece cannot end at locus T on one chromosome and then occur immediately again from T + 1 on the other. If in reality this occurs, one more parameter of the event probability can be incorporated in the likelihood, which will result in much longer computation time as a trade-off.
In duplication regions, our method also relies on the assumption of a nearby haplotype being duplicated. In reality, exceptions could occur, which may affect the performance or our method in uncertain ways. So users should be cautious about the inference on the duplication regions when the assumption is in doubt. Further investigations on how likely this would happen and what bias it leads to are warranted.
The genotypes at the loci with missing LRR/BAF values can be inferred using the neighborhood haplotype information. Depending on whether the loci are at the boundary of or within CNV regions, the copy numbers may not be accurately recovered. This can also be used for imputing CNP genotypes of some individuals that were genotyped using a platform different from others.
We used the estimates from an HMM as the initial values for our proposed haplotype-based method to expedite the computation. We also checked the robustness of our method using other initial estimate such as clusters based on arbitrary cut-off values for LRR and BAF. We found the performance of our method was consistent. For the long regions, the initial calls from HMM were generally reliable and had little space for improvement. But for short regions where our model assumptions are more likely to meet, our method yielded more reliable and accurate calls. These finding were consistent as reported in Wang et al. (2009).
Whether real chromosomes can be partitioned as unrelated haplotype blocks is still a question, early studies (Daly et al, 2001;Patil et al, 2001;Dawson et al, 2002;Gabriel et al, 2002;Zhang et al, 2003) has shown the rational and feasibilities of separated blocks' representation. So we adopted known haplotype blocks in our simulation. As a limitation of the algorithm, the data generated can only have similar local LD patterns as in the HapMap CEU population.
In analysis of HapMap data, we used sliding-window approach to avoid the selection of haplotype blocks. We have tested a few similar window sizes, which resulted little differences. But longer blocks can cause problem in haplotype estimates and slow down the algorithms even though they worked fine in our simulations. In addition, since there is more chances to detect longer CNV and less space for improvement, using long sliding windows may be not efficient in whole-genome scan.
The mutual benefits of haplotype and CNP inference can be applied to other data such as next generation sequence data, as in our ongoing work.