Quantifying the impact of inter-site heterogeneity on the distribution of ChIP-seq data

Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) is a valuable tool for epigenetic studies. Analysis of the data arising from ChIP-seq experiments often requires implicit or explicit statistical modeling of the read counts. The simple Poisson model is attractive, but does not provide a good fit to observed ChIP-seq data. Researchers therefore often either extend to a more general model (e.g., the Negative Binomial), and/or exclude regions of the genome that do not conform to the model. Since many modeling strategies employed for ChIP-seq data reduce to fitting a mixture of Poisson distributions, we explore the problem of inferring the optimal mixing distribution. We apply the Constrained Newton Method (CNM), which suggests the Negative Binomial - Negative Binomial (NB-NB) mixture model as a candidate for modeling ChIP-seq data. We illustrate fitting the NB-NB model with an accelerated EM algorithm on four data sets from three species. Zero-inflated models have been suggested as an approach to improve model fit for ChIP-seq data. We show that the NB-NB mixture model requires no zero-inflation and suggest that in some cases the need for zero inflation is driven by the model's inability to cope with both artifactual large read counts and the frequently observed very low read counts. We see that the CNM-based approach is a useful diagnostic for the assessment of model fit and inference in ChIP-seq data and beyond. Use of the suggested NB-NB mixture model will be of value not only when calling peaks or otherwise modeling ChIP-seq data, but also when simulating data or constructing blacklists de novo.


INTRODUCTION
Chromatin Immunoprecipitation followed by sequencing (ChIPseq) is an experiment for the genome-wide location of events such as transcription factor binding sites and histone modifications (Park, 2009;Cairns et al., 2013). These events provide information on chromatin status, a topic of great interest in epigenetics. As with all sequencing assays, ChIP-seq can be represented as count data across the genome. Commonly, researchers need algorithms that find sites where the counts are larger than one would expect under a null noise model, a procedure known as "peak-calling." Consider read counts x i , where i indexes over genomic bins, for a single ChIP-seq sample. When modeling these counts as independent samples from some Poisson random variable X, a common problem is overdispersion-that is, the property that the observations have greater variance than allowed for by the model. In our case, the Poisson distribution requires that mean and variance are equal, an assumption that (even when ignoring between-sample variability) is violated by ChIP-seq data, impairing model fit (Spyrou et al., 2009).
The choice of distribution for X is important when peakcalling; in particular, underestimating the variance should decrease specificity. Indeed, many have commented on the presence of false positives in peak-calling and how this quantity varies depending on the choice of peak-caller (Landt et al., 2012).
A number of strategies exist to account for the extra variance. For example, we can allow the Poisson distribution to have site-specific means-that is, X i ∼ Pois(λ i ). This strategy requires some sort of smoothing criterion to make parameter estimation robust (Zhang et al., 2008a). It is also difficult to expand this model to account for between-sample heterogeneity.
Many researchers use more general distributions-for example, the Negative Binomial (NB) model is used by Spyrou et al. (2009) ;Wu et al. (2010); Cairns et al. (2011); Song and Smith (2011) and others. The log-normal Poisson model has been used to model data from other high-throughput sequencing experiments, such as Serial Analysis of Gene Expression (SAGE) (Thygesen and Zwinderman, 2006).
Other strategies involve regression. ZINBA (Rashid et al., 2011) uses an NB model whose mean is regressed against known covariates, though the dispersion is fixed. Such a strategy requires knowledge of all appropriate dependent variables to capture the full variability.
An increasingly common approach is to use blacklists (Myers et al., 2011)-regions that have unusual mappability and previously have been seen to accumulate artifacts across many ChIP-seq experiments. Reads that fall in blacklisted regions are removed.
Use of blacklists appears to improve peak-calling (Carroll et al., 2014), but has a number of downsides. Firstly, this strategy requires an organism-specific blacklist, and it is possible that the blacklist is non-exhaustive. Secondly, there is no reason to assume that enrichment cannot occur in these loci-a better alternative would be to model these regions appropriately, or perhaps separately. Thirdly, there may be sample-specific or copy-number specific events. For example, a transfected vector may artificially inflate copy number at a particular locus.
We consider the general form of the above models, taking a completely unsupervised random-effects approach. It is reasonable to retain the Poisson element in our model, since it has been shown that counts from a single site in a single library, sequenced repeatedly, follow a Poisson distribution (Marioni et al., 2008). However, by expanding to a mixture model setting, where x i is a sample from X i ∼ Pois( ) and the latent mixing variable satisfies ∼ f (λ), we can use f (λ) to capture the genome-wide variation in our sample. Indeed, if has gamma distribution, then X i has NB distribution.
However, there is no biological or statistical reason, other than convenience, to believe that is best modeled with the gamma distribution. We assess the fit of the NB model to the data; that is, investigate the appropriateness of the gamma distribution as a mixing distribution. Next, we use CNM to make inferences about the distribution of , and see that it suggests a mixture of two distributions. Thus, we consider the NB-NB distribution and show that it provides a better fit to the data. We also consider zero inflation, and show that the poor fit of the NB distribution can be mistaken for the presence of zero-inflation in the data.

DATA
We use four different data sets, all of which are input samples (that is, untreated data). We focus on input samples because, in order to detect sites where counts differ from noise, it is important to model the noise appropriately. However, the methods used can also be applied to treated ChIP-seq data-see Supplementary Material. Samples were chosen to represent a variety of species. In each case, we considered only the first chromosome to avoid inter-chromosome normalization issues (Schmidt et al., 2010;Ross-Innes et al., 2012). We partition the first chromosome into 100 bp bins, avoiding conservatively-chosen regions that span the telomeres and centromeres, as obtained from the UCSC Genome Browser at http:// genome.ucsc.edu/. We did not remove duplicates-however, we found that deleting duplicates did not substantially affect our results (see Supplementary Material).

Sample Species Description Read # Reads Aligner lengths
Code and data to reproduce the analysis are linked to in the Supplementary Material section.

POISSON
The Poisson model's Maximum Likelihood Estimate (MLE) occurs when its expected mean is equated to the observed mean: μ =x.

NEGATIVE BINOMIAL
The MLE for the Negative Binomial (NB) distribution does not have closed form. However, we can fit the model using standard techniques, using the BFGS method implemented as fitdistr() in the MASS R package (Venables and Ripley, 2002).
Though the MLE for the dispersion of the NB distribution is biased, the bias is of the order 1/n (Saha and Paul, 2005) and we have enough bins in our data sets for the bias to be negligible.

LOG-NORMAL POISSON
Here, is a log-normal random variable: log ( ) ∼ N(μ, σ 2 ). The full log-likelihood involves a complicated integral: and the MLEs have no known closed form. Therefore, to estimate the parameters of this distribution, we take the following Bayesian approach: 1. Assume a weak prior distribution for each of μ (the mean) and σ −2 (the precision). We use conjugate prior distributions: The prior distributions should have very little effect on the posterior, given that we have so many data. 2. Use the Metropolis-Hastings algorithm to sample from each parameter's posterior distribution. 3. Excluding a suitably-chosen burn-in period, take the mean of the posterior samples as an approximation to the maximum a posteriori (MAP) estimate.

Figure 1
shows an example of the fit of the above distributions to sample A. We see that the empirical distribution has a large tail that the fitted distributions cannot account for, which in turn affects their ability to model bins with low counts.
To explore this effect further, we investigate the properties of the underlying mixing distribution.

MIXTURE INFERENCE
As described in the introduction, suppose that we have multiple samples {x i ; i = 1, . . . , n} from a variable X distributed according to a Poisson mixture-that is,

Bin count
Logarithm of probability density Our aim is to estimate the density f (λ) from our observations {x i }.
A formulation of the general problem, where X i | i has arbitrary distribution, is given in Roueff and Rydén (2005). In the case of the Poisson distribution, it is known thatf (λ), the maximum likelihood estimator (MLE) of f (λ), consists of a finite number of point masses: that is, if we adopt the MLEf (λ), then is a discrete random variable, taking values from some vector θ with associated probabilities π, where θ and π both have length L. It is also known that L, the number of support points used in the MLE distribution, can never exceed the number of distinct values in our X samples (Laird, 1978).
The CNM algorithm takes an initial value for (L, θ, π ), then repeats the following steps until convergence: 1. Update (L, θ), as follows: Suppose that G(λ) is our current estimate for the mixing distribution, corresponding to the current value of (L, θ , π ). Now, consider some new candidate support point θ , and let H θ (λ) be the distribution that has a point mass at θ -in other words, H θ is the Dirac delta function H θ (λ) = δ(λ − θ ). Consider the "gradient function," the directional derivative which we can think of as the rate at which the likelihood increases, as we shift probability mass across to the new support point θ .
The value of θ that maximizes d(θ ; G) is added to θ , and L is increased by one. If multiple values of θ maximize d(θ ; G), then we add each of them to θ, increasing L by one for each value. 2. Update π by calculating the MLE for π given (θ, L). Wang (2007) show that this problem is equivalent to refers to the likelihood for a single data point, x j ). This minimization problem is solved using the Constrained Newton (CN) method. 3. If π i = 0 for some i * , then delete π i * and θ i * and reduce L by one.
Wang (2007) prove theoretical convergence and demonstrate that, in practice, convergence is dramatically faster than previous algorithms designed to find L, θ, and π.
CNM reported that it did not converge for data sets C or D, and we check the output in the next section. In practice, we found that the value of L ranged from around 4 to 11 in our input samples.

DISTRIBUTION RECOVERY
First, we show that the CNM estimatef (λ) is appropriate for our data, by demonstrating that the empirical distribution of X can be recovered from the CNM estimate according to the following procedure: 1. Calculate the empirical probability densityf (x) directly, from the data. 2. Calculate the mixing distribution MLEf (λ) from the data according to the CNM algorithm. 3. Estimate the density f (x) from the mixing distribution according tof where f P (x; θ i ) is the density of the Poisson distribution with mean θ i .
If the CNM algorithm retains the key features of the true distribution f (x), then we expectf (x) to be "similar" to the empirical distributionf (x). The results of this comparison are shown in Figure 2.

SMOOTHING
The MLEf (λ) is desirable in the sense that it is the "best fit" to our data. However, it does not make physical sense to usê f (λ) when modeling, which would make a discrete variable. In reality, drawing from represents a complicated chain of www.frontiersin.org November 2014 | Volume 5 | Article 399 | 3 FIGURE 2 | Distribution recovery in each data set. The empirical distribution obtained from the actual data (black line) appears to show a dramatic change in behavior around x = 9. While CNM reported failure to converge for samples C and D, it can be seen that sample C's distribution was adequately recovered.
library-preparation procedures, leading to wide variation across hundreds of millions of sites across the genome. Therefore, it is much more logical to model as a continuous random variable. As such, we adopt the following strategy to assess the fit of various distributions: 1. Fit a continuous mixing distributionf (λ) to the data. 2. Comparef (λ) with the discrete MLEf (λ), obtained from CNM.
Iff (λ) is flexible enough to fit our data, then it should be "similar" to the MLEf (λ). Thus, we need to compare a discrete distribution,f (λ), with a continuous distribution,f (λ).
Methodology for making such a comparison tends to be ad hoc, as the general question is ill-defined.
A common strategy is to smooth out the data through "kernel smoothing"-that is, replacing each point mass in the discrete distribution with an appropriately-scaled kernel distribution, often the normal distribution. The problem with this approach is that the support of the mixing distribution's MLE contains only a handful of points, and the points become dramatically less dense at higher values. Thus, by using this approach, we end up with a distribution that is highly sensitive to the choice of length scale for the kernel.

Frontiers in Genetics | Bioinformatics and Computational Biology
November 2014 | Volume 5 | Article 399 | 4 1. Take some partition P = (P 1 = 0, P 2 , P 3 , . . . , P L , ∞) of the interval [0, ∞]. 2. For each i ∈ {1, . . . , L}, replace the probability mass of f (λ) that lies in the interval [P i , P i + 1 ) with an equivalent point mass of size π i = P i+1 P i f (u)du placed at any point θ i within that interval. Now, consider the true distribution's CDF, F(λ) = λ 0 f (u)du. For i > 1, we have: For the case i = 1, we set Note that Equation 2 assumes that is not zero-inflated.
We can now place bounds on the value of F(θ i ). θ i must lie in [P i , P i + 1 ), by assumption. Therefore, F(θ i ) must lie in [F(P i ), F(P i + 1 )), since F(λ) is a CDF and is therefore an increasing function of λ. Thus, for i > 1, by Equation (1) and for i = 1, we have 0 ≤ F(θ 1 ) < π 1 (4) We can use these upper and lower bounds to assess the fit of various candidate mixing distributions to the observed mixing distribution. Figure 3 shows an example of this procedure, as applied to sample A. We see that all of the candidate mixing distributions considered thus far violate the CNM bounds early on, indicating that these distributions cannot cope with large counts. This motivates the selection of a mixing distribution that can stay within the bounds.

NB-NB MIXTURE
The curves plotted in Figure 3 have insufficient curvature to accommodate the sharp turn present in the shaded region. This suggests that a mixture of two distributions is requiredconsistent with the abrupt change in behavior seen in Figures 1, 2.
We consider the case where X is a mixture of two NB distributions, equivalent to modeling as a mixture of two gamma distributions. In this case, we have a mixing parameter τ , and two separate NB parametrizations (μ 1 , r 1 ) and (μ 2 , r 2 ). Thus, We could find the MLE of the parameters with the EM algorithm (Dempster et al., 1977). However, since the EM algorithm can converge very slowly, we accelerated the process using SQUAREM (Varadhan and Roland, 2008). SQUAREM assumes that each step of the EM algorithm can be approximated using a particular quadratic form, allowing us to estimate the cumulation of a large number of EM updates in one go. Note that we did not consider mixtures of Poisson distributions, since these cases are attended to by the CNM algorithm, and we did not consider mixtures of log-normal Poisson distributions due to the computational complexity.

MODEL FITS
We considered the following distributions: We again see a striking change of behavior, occurring at around x = 9. The only distribution that can model this change is the NB-NB mixture.

FIGURE 5 | Total variation.
We see from the plot that the NB-NB mixture outperforms all of the other distributions, achieving the closest fit to the CNM estimate in the cases where it exists.
Then we examine the similarity between the mixing distributions used and the observed mixing distribution as calculated by CNM, in Figure 6.
The NB-NB mixture is the only choice of distribution that can consistently model the higher counts.

ZERO INFLATION
Zero-inflated models have been proposed to improve model fit in ChIP-seq data (Rashid et al., 2011;Diaz et al., 2012). A zeroinflated model is one such that, with some probability ν, we set X = 0. Otherwise, we draw X from a candidate distribution as before.
Having accounted for some of the variation in our data with the NB-NB mixture model, we investigated whether or not zeroinflation can improve model fit further. To quantify this, we delete some proportion of zeros from the data, fit each model, then assess the fit with Total Variation as defined in Section 4.1. We also considered the case where we increase the number of zeros. Note that the log-normal Poisson fit was excluded due to computational complexity.
The results are given in Figure 7. For sample A (dog) we see evidence of zero inflation, perhaps due to a less-established genome assembly. In samples B-D, the NB-NB models showed no evidence of zero-inflation. However, when using distributions that cannot account for the heavy tail in the high bin-counts, there is an erroneous indication that a zero-inflation component is necessary.
The general problem of inferring the mixture distribution whilst accounting for zero-inflation is difficult, though similar approaches exist (Lim et al., 2014;Morgan et al., 2014). We envisage an altered version of CNM that accounts for zero-inflation by adding the parameter ν defined above, with its own update step. Additional constraints may be required to make this model identifiable.

DISCUSSION
We found that, of the distributions tested, a mixture of two NB distributions best modeled counts in input data, even after removing problematic centromeric and telomeric regions. This result, and the observation that the empirical distribution changes behavior at high counts, reflected two apparent sources of counts in the data-one large population of counts, and another smaller population of higher counts. The regions with higher counts have higher variance than lower counts, and could be mistaken for peaks if using a naïve peak-calling method.
Importantly, we saw that the heavy tail of large counts could cause inadequate models to suggest the need for a zero-inflation component. Our results suggest that researchers should check that large counts are not distorting model fit before assuming that a zero-inflation component is necessary. and after refitting each model, we reassess model fit with the Total Variation distance d TV , plotted as −log(d TV ) on the Y -axis against zero proportion on the X -axis. For severely zero-inflated data, deleting zeros should improve model fit, so we expect to see a "peak" somewhere to the right of the black vertical line.
The NB and Poisson models recommend the removal of most of the zero-valued data, a conclusion that seems biologically implausible. In contrast, the NB-NB mixture consistently recommends either no zero-removal, or a modest level of zero-inflation in the case of sample A. Note the dramatic change in the NB-NB mixture's fit in sample D as additional counts are added-this could be due to the accelerated EM algorithm encountering alternative local maxima.
Several aspects of experimental design could influence the abundance of large counts. Any biases in mapping that cause reads to align preferentially to the same locus could cause highcount artifacts. Thus, we might be able to reduce the abundance of large counts by improving mappability (that is, reducing the number of ambiguously-mapped reads). For example, better genome assembly, or use of a better aligner, or using longer reads could all accomplish this task.
Appropriate mixture modeling could apply when simulating ChIP-seq data-a simulation paradigm that fails to account for these different populations of counts cannot test peak-callers properly and overestimates their performance (Zhang et al., 2008b). A better understanding of the properties of the underlying mixture model allows us to simulate noise in ChIP-seq data sets more accurately.
Mixture modeling is also of importance when peak-calling in ChIP-seq data, allowing us to model large counts without removing them. Models that regress on genomic covariates, such as copy number or GC content, can help us explain some of the variation in the data (Rashid et al., 2011;Robinson et al., 2012). However, our unsupervised approach can model the noise, even in the absence of appropriate covariates to regress overfor example, if we have samples that are from less well-elucidated species or that have abnormal copy number events (such as after chromothrypsis). We may also be able to extend the approach to make inferences about the variation that remains after regressing out known covariates. Additionally, we note that Rashid et al. (2011) assume constant dispersion due to general linear modeling restrictions-in contrast, a mixture model permits multiple dispersions.
Alternatively, should we wish to adopt a blacklist strategy, we can construct a blacklist de novo by classifying bins into artifact and non-artifact regions (for example, by finding the probability that a bin belongs to each of the NB components in our mixture model). Again, this could be particularly useful for abnormal samples or species.
An alternative approach to smoothing the MLEf (λ) is to enforce continuity off (λ) during maximum likelihood estimation. For example, Liu et al. (2009) minimize a penalized log-likelihood: where α, the smoothing parameter, controls how smooth the output function is required to be. The methods we described have general applicability to other sequencing experiments based on genomic DNA, and not just ChIP-seq. As we increasingly see the emergence of large-scale epigenetic studies based on gDNA assays like ChIP-seq, it is important to choose models whose false discovery rates are robust. Properly accounting for the null variability in ChIP-seq data is vital to avoid false positives.

FUNDING
We acknowledge the support of the University of Cambridge, the Medical Research Council, Cancer Research UK, and Hutchison-Whampoa.