Methylated DNA is over-represented in whole-genome bisulfite sequencing data

The development of whole-genome bisulfite sequencing (WGBS) has resulted in a number of exciting discoveries about the role of DNA methylation leading to a plethora of novel testable hypotheses. Methods for constructing sodium bisulfite-converted and amplified libraries have recently advanced to the point that the bottleneck for experiments that use WGBS has shifted to data analysis and interpretation. Here we present empirical evidence for an over-representation of reads from methylated DNA in WGBS. This enrichment for methylated DNA is exacerbated by higher cycles of PCR and is influenced by the type of uracil-insensitive DNA polymerase used for amplifying the sequencing library. Future efforts to computationally correct for this enrichment bias will be essential to increasing the accuracy of determining methylation levels for individual cytosines. It is especially critical for studies that seek to accurately quantify DNA methylation levels in populations that may segregate for allelic DNA methylation states.


INTRODUCTION
The ability to produce genome-wide data sets has radically changed during the last decade. With the advent of highthroughput sequencing, billions of short read data can be generated within days. Currently, one of the primary uses of these short reads is to explore genetic variation within the human population with the goal of linking variants to diseases. Identifying these causal associations is the first major step to understanding how diseases arise, how they manifest and how they can be treated. Additionally, these short reads are used to study a wide range of model systems and an ever-growing list of non-model systems. Although much of the energy in the scientific community focuses on genetic variants there are a growing number of scientists that are fascinated by the epigenome. The NIH Roadmap Epigenomics Mapping Consortium defines the epigenome as a map of features distributed throughout and on top of the genome (Bernstein et al., 2010). These features can include RNAs such as mRNAs, long non-coding RNAs and small RNAs, DNA-protein interactions, chromatin accessibility, nucleosome positioning, covalent modifications to histone tails and DNA methylation, to name a few. Understandably, refinement of epigenomic techniques lagged behind development of genomic DNA library preparation due to their demand for use and due to the difficulty in execution. Part of the reason for this sudden increase and interest in epigenomics is due to the standardization of protocols and analysis pipelines used to produce and interpret data. No longer is data generation the rate-limiting step. Instead, the ability to accurately analyze and interpret these data sets has become the major bottleneck in the daily workflow.
One of the most exciting technical advances in field of epigenomics was the development of methods to analyze DNA methylation at single-base resolution (Cokus et al., 2008;Lister et al., 2008). Generally referred to as whole-genome bisulfite sequencing (WGBS), these methods combine sodium bisulfite conversion (Hayatsu et al., 1970;Shapiro et al., 1973Shapiro et al., , 1974Hayatsu, 1976;Frommer et al., 1992;Clark et al., 1994Clark et al., , 2006, the gold standard for determining DNA methylation states at individual cytosines, with high-throughput DNA sequencing. Using this technique, unmethylated cytosines are converted into uracils and then into thymines after PCR. The observation of cytosines in sequencing reads typically indicates that the cytosine was methylated, as methylcytosines are protected from conversion by treatment with sodium bisulfite. These methods are capable of testing approximately 90% of all cytosines in genomes studied to date. They also generate quantitative measurements of DNA methylation at each cytosine due to the multiple independent reads that align to a particular sequence. These methods were originally tested on Arabidopsis thaliana genomic DNA (Cokus et al., 2008;Lister et al., 2008) as this species has a relatively small genome size (∼150 Mb) compared to human genomes (∼3 Gb per haploid genome), which made testing and troubleshooting cost efficient. Since the first single base resolution DNA methylomes for A. thaliana surfaced, larger DNA methylomes have been sequenced such as humans (Lister et al., 2009), mice (Stadler et al., 2011), corn (Gent et al., 2013), and soybean (Schmitz et al., 2013a), but in general it is cost prohibitive to sequence large numbers of individual samples using this technique. As a result, reduced representation bisulfite sequencing (RRBS) was developed in which ∼1% of the genome is studied making it possible to investigate large numbers of individuals at the cost of surveying the entire genome (Meissner et al., 2005). Most recently, additional modifications to cytosines have been identified such as 5-hydroxymethylation, 5-formylcytosine, and 5-carboxylcytosine and methodologies have been developed to study the presence of some of these base modifications genome-wide (Booth et al., 2012(Booth et al., , 2014Yu et al., 2012).
The initial genome-wide maps of DNA methylation were invaluable for defining previously unexpected epigenomic signatures in a variety of systems, such as partially methylated domains and non-CG methylation in animals (Lister et al., 2009) and CHH islands in maize (Gent et al., 2013). Since these initial maps, WGBS has been used to study epigenome reprogramming of induced pluripotent stem cells (Lister et al., 2011), genomic imprinting in both plants (Gehring et al., 2009;Hsieh et al., 2009;Zemach et al., 2010;Calarco et al., 2012) and animals (Xie et al., 2012;Kobayashi et al., 2013;Shirane et al., 2013), germ line epigenome reprogramming in plants (Slotkin et al., 2009;Calarco et al., 2012;Ibarra et al., 2012;Creasey et al., 2014) and animals (Seisenberger et al., 2012;Jiang et al., 2013) and also to study the impact of genetic variation on DNA methylation (Schmitz et al., 2013a,b;Orozco et al., 2014). Clearly, mapping and analyses of DNA methylomes is a relatively young field with great potential for exciting discoveries in the next decade.
It is well established that there is a GC content bias in amplification of fragments of DNA used for high-throughput sequencing applications (Aird et al., 2011;Benjamini and Speed, 2012). As a result, methods were developed to avoid PCR amplification all together in the construction of DNA sequencing libraries (Kozarewa et al., 2009). However, this approach is not applicable to bisulfite-treated DNA as the uracils in the fragments would inhibit cluster formation on the instrument unless reagents could be customized to include a uracil-insensitive DNA polymerase. As a result, bisulfite-treated DNA is amplified for varying cycles of PCR to replace uracils in the DNA sequence and to amplify libraries as treatment with sodium bisulfite degrades DNA. The degradation of DNA by sodium bisulfite also creates limitations with regards to the types of samples that can be sequenced, as higher input amounts are often required for most protocols. Moreover, after bisulfite conversion, previously complementary strands are single stranded and will no longer anneal together. In fact, previous observations of bisulfite-conversion and PCR have uncovered strand biases that severely affects estimates of DNA methylation levels (Warnecke et al., 1997). Correction methods have been developed for targeted bisulfite PCR experiments, but these methods are not feasible for WGBS (Komori et al., 2011;Moskalev et al., 2011). Given that methylated DNA will retain higher GC content after bisulfite conversion and PCR compared to unmethylated DNA, over-representation of methylated DNA may occur in the construction of sequencing libraries. Although libraries can be constructed from lower input DNA samples such as those prepared from FFPE (formalin-fixed paraffin-embedded), sequence-capture or single cell approaches, there usually is a cost of increasing the total PCR cycles. This cost likely comes in the form of increasing any bias in read enrichment between methylated and unmethylated sequences at the same locus.
In this study, we explore the presence of an amplification bias created in fragments that retain higher GC content after sodium-bisulfite treatment with unmethylated sequences and with genomic DNA sequences. A strong positive correlation between DNA methylation levels and normalized read counts are observed in the WGBS data. Investigation of three different commercially available uracil-insensitive enzymes reveal differences in amplification bias of highly methylated and high GC content DNA sequences.

PLANT AND FUNGAL MATERIAL
The A. thaliana Col-0 accession was used for all. Plants were grown in 16 h day lengths and tissue was harvested from above ground rosette leaves 3 weeks after planting. DNA was isolated using the Qiagen (Valencia, CA, cat-69106, USA) DNeasy Plant Kit according to the manufacturer's protocol. All Neurospora experiments were performed with the wild-type "Oak Ridge" strain (FGSC# 2489) obtained from the Fungal Genetics Stock Center (McCluskey et al., 2010). ∼1 × 10 6 conidia/ml were germinated in 50ml of Vogel's Minimal Medium with 1.5% sucrose and grown for 5 h.

PREPARATION AND SEQUENCING OF Neurospora DNA
Chromatin immunoprecipitation (ChIP) and DNA isolation were performed as described previously (Sasaki et al., 2014). For Illumina sequencing, libraries were prepared from 10 ng of ChIP input DNA using an Illumina TruSeq kit (Illumina cat-FC-121-2002). Libraries were prepared according to manufacturer instructions except for the following modifications. DNA adaptors were diluted 1:100 prior to ligation. PCR was performed for 4, 8, or 15 cycles, as indicated in the text. In some experiments, a PCR reaction mixture containing Kapa HiFi polymerase (Kapa Biosystems, cat-KK2502), 1X Kapa HiFi buffer, 200 nM dNTPs, 1X TruSeq primer cocktail, and 60 mM tetramethylammonium chloride (TMAC; Sigma-Aldrich, catalog # T3411) was used in place of the TruSeq PCR master mix [as described in (Oyola et al., 2012)]. Illumina sequencing was performed using an Illumina Next-Seq500 instrument at the University of Georgia genomics facility.

METHYLC-SEQ DATA ANALYSIS
FASTQ files were trimmed for adapters, preprocessed to remove low quality reads and aligned to the TAIR10 reference genome as previously described in (Schmitz et al., 2013b). Inefficiencies in the Frontiers in Genetics | Bioinformatics and Computational Biology sodium bisulfite conversion reaction are calculated by measuring the fraction of methylated basecalls detected in the chloroplast genome (which is unmethylated). This non-conversion rate is used as the null hypothesis for a binomial test to determine if a cytosine is methylated and the resulting p-values are corrected for multiple testing using Benjamini-Hochberg with an FDR cut off of 5%.

CALCULATION OF METHYLATION LEVELS
Weighted methylation levels were performed as described in (Schultz et al., 2012).

CALCULATION OF CORRELATION COEFFICIENTS
Pearson correlation was calculated. In Figure 2B, the outliers were identified using Bonferroni test (Cook and Weisberg, 1982) and then Pearson correlation was calculated after the outliers were removed.

CALCULATION OF NORMALIZED READ COVERAGE
For Figures 1D-F normalized read coverage was calculated by the total number of reads in the DMR (Differentially Methylated Region) divided by the total number of reads sequenced from the library. This value was then multiplied by 10 5 . For Figure 1G, normalized read coverage was calculated as follows: genome sequencing reads from previously published A. thaliana Col-0 data (Becker et al., 2011) were used to determine a baseline level of mappability by the DNA that had not undergone sodium bisulfite treatment. In this case, all cytosines were converted to thymines in the FASTQ files and then the genome  Shaw et al. (2000). Pink shaded regions indicate DMRs in (A-C) and the location of data used to produce (D-F). In (A-C), gold lines = methylated CGs, blue lines = methylated CHGs (where H = A, C, or T) and pink lines = methylated CHHs. In (D-F), black dots = highly methylated alleles, red dots = lowly methylated alleles. Data used to produce Figure 1G were obtained from line 1 replicate 2 from (Schmitz et al., 2011). www.frontiersin.org WGBS data used to produce this figure were obtained from line 1 replicate 2 from (Schmitz et al., 2011). Genomic DNA sequencing data used to produce this figure were obtained from sample 119 in (Becker et al., 2011). sequencing data set was used in the same analysis pipeline as the one used to map and determine methylation levels for sodium bisulfite-treated libraries. A. thaliana mutation accumulation line 1 replicate 2 was selected as bisulfite sequencing sample (Schmitz et al., 2011). In each 10 kb window, read numbers were counted and then divided by the number of mapped reads in the first tile to normalize them to the same scale. Normalized read coverage was then computed by subtracting the normalized number of reads present in the genomic DNA sample from the normalized number of reads in the bisulfite sequencing sample. 5% outliers were identified, as described in the Calculation of correlation coefficients section, and removed before computing linear regression (Cook and Weisberg, 1982). For Figure 2B, the window size was set to 100 kb, the enrichment of mapped reads for each tile was computed using the same method as described for 1 G. For Figures 3-5, the methylation levels from Figure 1G were used to set a fixed horizontal axis. Therefore, data points were only able to move in a vertical direction depending on normalized read coverage. This method was used because higher read coverage resulted in higher methylation levels, which made it challenging to accurately reveal and compare the trends in each figure. For Figure 6, to calculate normalized read coverage for Neurospora, the genome was divided into 1 kb windows using bedtools, and a gtf annotation file was constructed using custom scripts. Cuffdiff software (Trapnell et al., 2013) and the custom annotation file were used to calculate the enrichment ratios at each 1 kb window. %GC was determined using a custom script.

DATA ACCESS
The data generated for this study have been deposited in the Gene Expression Omnibus 1 and are accessible through accession 1 http://www.ncbi.nlm.nih.gov/geo/ number GSE58217. For WGBS data in Figures 1 and 2 data were downloaded from the National Center for Biotechnology Information (NCBI) -Sequence Read Archive (SRA) 2 and for genomic DNA sequence used in Figure 2 data were downloaded from (http://www.ncbi.nlm.nih.gov/sra/?term==ERX386705).

METHYLATED DNA IS OVER-REPRESENTED IN BISULFITE SEQUENCING DATA
Our previous studies that applied MethylC-seq (a WBGS method) on a population of A. thaliana MA lines (Shaw et al., 2000) revealed the existence of spontaneous epialleles (Schmitz et al., 2011; Figures 1A-C). These differentially methylated regions (DMRs) of the population have identical underlying DNA sequences as revealed by whole genome sequencing (Ossowski et al., 2010), but vary significantly in DNA methylation levels (Figures 1A-C). Further examination of the underlying data revealed that reads from highly methylated DMRs were over-represented compared to unmethylated regions after bisulfite conversion and DNA sequencing (Figures 1D-F). In total, eight different MA lines were sequenced in duplicate for a total of 16 samples. In every case, there was a positive correlation between the methylation level of the DMR and the normalized number of aligned sequenced reads (Figures 1D-F). Similarly, a chromosome-wide test revealed a strong positive correlation between DNA methylation levels and normalized (aligned) read counts from WGBS data ( Figure 1G; R 2 = 0.55).

GC CONTENT INFLUENCES ENRICHMENT OF HIGHLY METHYLATED DNA
Although the sequences within DMRs are identical, GC contents were significantly different after sodium bisulfite conversion. Therefore, the possibility exists that the underlying GC content likely influences the enrichment bias of methylated DNA. To further explore this question, the GC content throughout chromosome five was determined prior to and post sodium bisulfite conversion (Figure 2A). Post bisulfite treatment, the GC content is significantly reduced by approximately 50%, which is expected as most of the cytosines are unmethylated and are converted to uracils and ultimately thymines after PCR. The lowest percentage reduction in GC content is observed in the pericentromeric region of the chromosome (Figure 2A) where the highest density of methylation has been observed ( Figure 2B; Cokus et al., 2008;Lister et al., 2008). To empirically test if sodium bisulfite-treated and amplified DNA is preferentially biased toward highly methylated reads, both WBGS and genomic DNA sequencing data from wild-type Col-0 were aligned using the same parameters. The read enrichment was calculated and plotted along chromosome five, which revealed an enrichment of WGBS reads compared to genomic DNA sequencing reads that paralleled DNA methylation levels along the chromosome (Figure 2B). Collectively, these data suggest that highly methylated sequences, which retain high GC content after bisulfite conversion and PCR amplification, are preferentially enriched in WGBS data compared to unmethylated sequences.

NUMBER OF PCR CYCLES INFLUENCES ENRICHMENT OF HIGHLY METHYLATED DNA
The introduction of altered GC content occurs during bisulfite treatment and subsequent PCR amplification. Bisulfite-converted DNA is rich with uracils, which impairs the ability of most DNA polymerases to efficiently amplify these fragments. The B-type DNA polymerase from Pyrococcus furiosus is able to polymerize uracil-rich DNA because it binds to deaminated bases (Horvath and Vertessy, 2010). This enzyme was engineered to polymerize uracil-rich DNA by mutating a single amino acid that blocks recognition of the deaminated bases (Horvath and Vertessy, 2010). These types of engineered uracil-tolerant enzymes are critical for single-base resolution studies of DNA methylation and as a result, a number of commercially available enzymes have been engineered to amplify uracil-containing templates. We performed a test on the Pfu Turbo Cx enzyme with biological replicates (Supplementary Table 1) using three different cycles (4, 8, and 15 cycles) of PCR ( Figure 3A) to assess the degree to which a uracil-insensitive enzyme is prone to an amplification bias of methylated DNA. A comparison between 4, 8, and 15 PCR cycles revealed that 15 cycles results in an exacerbation in the preferential enrichment of methylated DNA (Figures 3B-G).

COMPARATIVE ANALYSIS OF COMMERCIALLY AVAILABLE URACIL-INSENSITIVE ENZYMES
Based on the results from Figure 3, an experiment was designed to test two additional commonly used commercially available enzymes in biological replicates (Supplementary Table 1) for their ability to amplify bisulfite-converted DNA (Kapa HiFi Uracil + and EpiMark). Upon constructing the MethylC-seq libraries the concentration of the samples were determined using a Qubit. This revealed that the total library yield for the 4cycle/8-cycle PCR was similar between the three enzymes tested, www.frontiersin.org but for the 15-cycle PCR, the KAPA HiFi Uracil + enzyme yielded significantly more total library (Table 1). Regardless, the total amount of library yielded in all experiments is more than adequate to sequence to sufficient depth. Next, the methylation levels were plotted against normalized read counts and as observed with the Pfu Turbo Cx enzyme in Figures 1 and 2, the other two tested enzymes also preferentially enrich methylated DNA, although to a lesser extent (Figures 4 and 5). However, both of these enzymes are sensitive to highly methylated sequences as the normalized read numbers started to decrease with higher methylation levels compared to intermediate methylation levels (Figures 4 and  5). Interestingly, between the 4, 8, and 15 cycle experiments for the Kapa HiFi Uracil + enzyme, extremely high GC content regions did not result in an increase in amplification bias of DNA like it does with both the Pfu Turbo Cx and EpiMark enzymes (Figures 3-5), which would indicate this enzyme is likely preferred for most WGBS applications. Interestingly, the EpiMark enzyme revealed that at 15 PCR cycles a strong bias was associated with increasing GC content, but not with increasing levels of DNA methylation compared to the 4-and 8-cycle reactions (Figures 5A-F). Therefore, both PCR cycles and enzymes selected for amplification can influence enrichment of highly methylated DNA in sequencing data.

AT-RICH SEQUENCES ARE DEPLETED FROM DNA SEQUENCING LIBRARIES
ChIP-seq experiments carried out with the filamentous fungus Neurospora crassa also revealed a bias against AT-rich DNA sequences, demonstrating reduced coverage at AT-rich DNA is not restricted to uracil-tolerant polymerases. Neurospora heterochromatin domains have an average GC content of 30% compared to an average GC content of 51.5% for genes (Figure 6). ChIP input DNA was used to prepare Illumina sequencing libraries using two different polymerases and PCR was performed for 4, 8, and 15 cycles. Libraries prepared using a TruSeq PCR mastermix revealed equal representation of GC-rich and AT-rich regions after 4 and 8 cycles of PCR, but a clear amplification bias was observed after 15 cycles of PCR (Figures 6A,C). It was reported previously that the Kapa HiFi polymerase, a distinct enzyme from the Kapa HiFi Uracil + polymerase used throughout the rest of the manuscript, efficiently amplified AT-rich DNA when supplemented with 60 mM TMAC (Oyola et al., 2012). However, our experiments with Neurospora DNA revealed a significant underrepresentation of AT-rich DNA after only 8 cycles of PCR using the KAPA HiFi DNA polymerase + TMAC (Figures 6B,D). Therefore, the DNA polymerase used for PCR amplification and the number of PCR cycles selected can affect amplification of genomic DNA that contains large disparities in GC content.

CONCLUSION
Whole-genome bisulfite sequencing is an excellent method for determining base resolution DNA methylomes and although in most genomes studied to date, it has been able to determine methylation levels for >90% of cytosines, it does have some limitations. In this study, we show that there exists a strong amplification bias of methylated DNA compared to unmethylated sequences and that this bias can be affected by PCR cycle number and the enzyme used for PCR amplification. Interestingly, this change in amplification bias is not observed with the Kapa HiFi Uracil + enzyme in comparison to the others tested. This observation suggests that the high PCR cycle number that www.frontiersin.org is necessary to construct a sequenceable library for low input samples such as RRBS (Gu et al., 2011), single cells, sequencecapture or FFPE samples can be achieved without introducing major artifacts from the number of PCR cycles selected. Therefore, although an amplification bias will occur at these highly methylated sequences, in these types of libraries, it will not grow in strength to the point that it depletes lower methylated sequences from produced sequencing data. This effect is not true for all enzymes used in amplification of bisulfite converted DNA as the Pfu Turbo Cx enzyme resulted in a significant over-enrichment of reads associated with methylated DNA with increasing cycle number. Furthermore, even tests on genomic DNA from Neurospora revealed significant variation between different commercially available enzymes and selected PCR cycle number. To improve the quality and to reduce to bias associated with WGBS data it is recommended to start higher concentrations of genomic DNA when possible, limit the number of PCR cycles and optimize the polymerase used for amplification of the library. In general, it is difficult to precisely quantify the effect of this observed bias in WGBS data on percent methylation calculations. As we show here, bias depends on polymerase choice and PCR cycle number. We expect that other factors can also affect bias, not only the methylation level, which itself can be affected by biological and technical variation, but also factors like raw sequence read length, sequencing errors, efficiency of sodium bisulfite conversion, sequence structure itself, etc. In order to develop computational tools that correct for bias, additional experiments are needed to define how these factors contribute to the observed bias in WGBS data. We note that the bias introduced using current protocols is likely to influence DMR analysis. In particular, loci that are heterozygous for methylation state could be underrepresented by DMR finding algorithms.
The sodium bisulfite conversion and PCR amplification steps are impossible to remove from WGBS library preparation and these are the steps where a bias is introduced. In fact, efforts to correct for amplification bias in targeted bisulfite sequencing of genomic regions have been successfully performed by using a calibration method that requires a parallel analysis of input samples with known methylation levels (Moskalev et al., 2011). Additionally, RainDance Technology has also proved useful for correction of amplification bias in targeted bisulfite PCR (Komori et al., 2011), but it is unknown if this correction will be observed on WGBS libraries. Regardless, both of these methods are not feasible alternatives for routine correction of the observed amplification bias in WGBS, as they are technically cumbersome compared to current methods and they are cost prohibitive. This bias will be most pronounced in the study of allele-specific methylation. The results of this study make it clear that PCR enzyme and the PCR amplification cycle used can influence determined methylation levels, which undoubtedly could affect results of previously published studies that were not aware of these biases. Fortunately, most experiments that use this approach can be corrected for this bias by using SNP data in the bisulfite-sequencing reads. Caution should be taken when calculating methylation levels at loci with heterozygous methylation states, especially when experiments study regions that do not have SNPs or do not take into consideration SNPs linked to bisulfite-sequencing reads. Caution should also be taken when using platforms that do not detect SNPs. For example, many microarray platforms have been designed to detect DNA methylation states and although PCR is not generally used to amplify bisulfite-converted DNA, the use of Whole Genome Amplification (WGA) methods will still introduce a bias based on the GC content of DNA sequences (Bredel et al., 2005;Arriola et al., 2007). Because WGBS is primarily used on species with published reference genomes, it may be possible to predict this amplification bias based on GC content and the range of potential methylation levels at these regions. Therefore, future efforts to computationally model and predict expected read coverage for regions will be advantageous for studies where SNP information is not available or present at specific regions of interest.