Characterization of HPV DNA methylation of contiguous CpG sites by bisulfite treatment and massively parallel sequencing—the FRAGMENT approach

Invasive cervix cancer (ICC) is the third most common malignant tumor in women and human papillomavirus 16 (HPV16) causes more than 50% of ICC. DNA methylation is a covalent modification predominantly occurring at CpG dinucleotides and increased methylation across the HPV16 genome is strongly associated with ICC development. Next generation (Next Gen) sequencing has been proposed as a novel approach to determine DNA methylation. However, utilization of this method to survey CpG methylation in the HPV16 genome is not well described. Moreover, it provides additional information on methylation “haplotypes.” In the current study, we chose 12 random samples, amplified multiple segments in the HPV16 bisulfite treated genome with specific barcodes, inspected the methylation ratio at 31 CpG sites for all samples using Illumina sequencing, and compared the results with quantitative pyrosequencing. Most of the CpG sites were highly consistent between the two approaches (overall correlation, r = 0.92), thus verifying that Next Gen sequencing is an accurate and convenient method to survey HPV16 methylation and thus can be used in clinical samples for risk assessment. Moreover, the CpG methylation patterns (methylation haplotypes) in single molecules identified an excess of complete-and non-methylated molecules and a substantial amount of partial-methylated ones, thus indicating a complex dynamic for the mechanisms of HPV16 CpG methylation. In summary, the advantages of Next Gen sequencing compared to pyrosequencing for HPV genome methylation analyses include higher throughput, increased resolution, and improved efficiency of time and resources.

Invasive cervix cancer (ICC) is the third most common malignant tumor in women and human papillomavirus 16 (HPV16) causes more than 50% of ICC. DNA methylation is a covalent modification predominantly occurring at CpG dinucleotides and increased methylation across the HPV16 genome is strongly associated with ICC development. Next generation (Next Gen) sequencing has been proposed as a novel approach to determine DNA methylation. However, utilization of this method to survey CpG methylation in the HPV16 genome is not well described. Moreover, it provides additional information on methylation "haplotypes." In the current study, we chose 12 random samples, amplified multiple segments in the HPV16 bisulfite treated genome with specific barcodes, inspected the methylation ratio at 31 CpG sites for all samples using Illumina sequencing, and compared the results with quantitative pyrosequencing. Most of the CpG sites were highly consistent between the two approaches (overall correlation, r = 0.92), thus verifying that Next Gen sequencing is an accurate and convenient method to survey HPV16 methylation and thus can be used in clinical samples for risk assessment. Moreover, the CpG methylation patterns (methylation haplotypes) in single molecules identified an excess of complete-and non-methylated molecules and a substantial amount of partial-methylated ones, thus indicating a complex dynamic for the mechanisms of HPV16 CpG methylation. In summary, the advantages of Next Gen sequencing compared to pyrosequencing for HPV genome methylation analyses include higher throughput, increased resolution, and improved efficiency of time and resources.
So far, the most widely used method for HPV DNA methylation investigation is bisulfite treatment followed by sequencing (Bhattacharjee and Sengupta, 2006), MassArray (Ehrich et al., 2005), SNPshot (Kaminsky and Petronis, 2009), or in particular pyrosequencing (Tost and Gut, 2007a,b;Dejeux et al., 2009) (for review, see Clarke et al., 2012). Despite the accuracy of CpG quantitation by pyrosequencing, it can only provide a relatively short read for each assay per sample and thus is time and labor intensive for testing multiple sites in large numbers of samples, which limits the incorporation of DNA methylation in clinical studies. Moreover, all these approaches constrain the ability to detect the methylation pattern at single-DNA-molecule resolution, which is critical for investigating methyltransferase dynamics. Although the cloning-sequencing approach after bisulfite treatment can provide some insight into this issue, the typically low number of clones analyzed (<10) and the high costs limits this approach.
Next generation sequencing can yield millions of single molecule reads and has been used to determine DNA methylation (Taylor et al., 2007;Bibikova and Fan, 2010;Laird, 2010;Feng et al., 2011;Kim et al., 2011;Komori et al., 2011;Ku et al., 2011;Nejman et al., 2014). Supplemented with DNA barcoding technology, which incorporates a unique index sequence into each PCR segment, this approach can provide a rapid way to simultaneously determine DNA methylation at the single-molecule level in large numbers of samples. However, the accuracy and validity of this approach needs further evaluation, especially in viral genomes such as HPV16.
In the present study, we randomly chose 12 samples with quantitation of CpG methylation within the HPV16 genome by pyrosequencing and performed amplification with primers containing barcodes specific for each sample. After all samples were pooled and purified, the PCR products were deep sequenced, analyzed, and the results were compared with CpG methylation determined by pyrosequencing. The methylation ratio for most CpG sites was highly correlated with those from pyrosequencing, which indicated that Next Gen sequencing of bisulfite treated cervical cells infected with HPV16 was an accurate method of quantitating CpG methylation. Moreover, the single molecule analyses provides a "methylation haplotype" and indicated an excess of full and non-methylated molecules in nearly all samples and a lower proportion of partially methylated molecules in most samples, thus revealing a complex and mosaic methylation pattern in the HPV16 genome.

CERVICOVAGINAL SAMPLES
Twelve exfoliated cervical samples were randomly chosen from a previously reported nested case-control study of HPV16-positive cervical intraepithelial neoplasia grade 3 (CIN3) and HPV16positive cervical samples that cleared infection . The lab was blinded to all clinical information. All samples contained HPV16 and the quantitation of CpG methylation of the HPV16 genome had been determined by pyrosequencing, as described . The study was designed to test samples for quantitation of CpG methylation to evaluate Next Gen sequencing compared to pyrosequencing prior to embarking on using this method on precious well-characterized samples from epidemiological studies.

ASSAY DESIGN
Since L1, L2, and E2 ORF regions showed differential methylation among disease groups , these three regions and the most significant CpG sites within them were chosen for the current study. Primers for PCR were designed using MethPrimer (Li and Dahiya, 2002) (http://www.urogene. org/methprimer/index1.html). In total, 3 segments in L1, 4 in L2, and 1 in E2 were included in the current study (Table 1A), and in total, 31 CpG sites were surveyed. Each primer was labeled by a unique barcode and 5 and 3 padding sequence (Table 1B) and synthesized by Integrated DNA Technologies (IDT, Coralville, IA). A map of all 112 CpG sites in the HPV16 7906 bp reference genome is shown in Figure 1 in the review by Clarke et al. (2012). 169 ( ), These sites were present in the NGS data but not PSQ. Thus, in the CpG sequence a "C" represents a methylated CpG, whereas a "T" represents an unmethyaled CpG. All segments were amplified by HotStart-IT FideliTaq DNA polymerase (United States Biochemicals, Cleveland, OH). After validating size and intensity in a 3% agarose gel, each PCR product was pooled in equal proportions, separated by electrophoresis, and isolated from the gel. After precipitation by isopropanol and washed by 70% ethanol, the PCR products were ligated with adaptor FIGURE 1 | The summary correlation of 27 CpG methylation sites between pyrosequencing and Next Gen sequencing. The x-axis is percent methylation by pyrosequencing and the y-axis is percent methylation by Next Gen sequencing. Each CpG site is plotted for each subject.
(library construction) and sequenced on an Illumina HiSeq 2000 (NG sequencing) within the Albert Einstein College of Medicine, Epigenetics Core Facility (Bronx, NY).

METHYLATION RATIO ANALYSIS
The obtained sequences were filtered by prinseq (Schmieder and Edwards, 2011) with average Phred score (Cock et al., 2010) not less than 20. The barcodes for each sample were split and cut by FastX kit (http://hannonlab.cshl.edu/fastxtoolkit/index.html).
After alignment with the reference HPV16 genome by bowtie (Langmead et al., 2009), the methylation status for each molecule was determined by Bismark (Krueger and Andrews, 2011). For each CpG site, the methylation ratio is calculated by the formula: number of C reads divided by the sum of C and T reads at each CpG site. The pyrosequencing result for each site was determined on a PSQ96 ID Pyrosequencer (Qiagen, Valencia, CA) at the Albert Einstein College of Medicine, Genomics Core Facility (Bronx, NY) and the readout was percent methylation, as previously described . The correlation between NG sequencing and pyrosequencing was performed by linear regression in SPSS 16.0 (SPSS Inc., Chicago, IL) and the null hypothesis was rejected when P < 0.05. We have deposited the read sequences in the NCBI Sequence Read Archive (SRA) database, accession number SRP040981.

SINGLE MOLECULE ANALYSIS
The methylation pattern for each single molecule and the counts of each pattern were obtained by an in-house script (available on request). Briefly, the Bismark output, which gave the methylation state for each CpG site in each molecule, was parsed, and the methylation pattern of each DNA molecule was then reconstructed based on the unique read name it was assigned by the sequencer. Finally, the prevalence of each unique methylation pattern was counted, for each sample in each assay. The expected probabilities were constructed in two steps. First the singular probabilities of each site being methylated and unmethylated were calculated, by counting the proportions of molecules in each state at each site. Multiplying the appropriate singular probabilities, under the assumption that CpG sites would be methylated independently, produced an expected probability for each methylation pattern. A χ 2 goodness of fit test was then performed to compare observed and expected probabilities for methylation patterns, where the observed probabilities were the proportions of each detected pattern, calculated from their counts.

SEQUENCING DATA STATISTICS
In total, 192.2 million reads 95 bp long were obtained from NG sequencing and 53.4 million (27.8%) possessed an average Phred score above 20 and were used for this analysis (Ewing and Green, 1998). 41.7 million reads (78.1%) were observed to contain one of the incorporated barcodes without mismatch and assigned to a corresponding sample for further analysis. Except one segment in the L2 gene, most segments included ∼4-8 million reads ( Figure  S1A). For each sample, the read count varied from 2 to 6 million ( Figure S1B). Most CpG sites (21/27) were covered by 0.6 to 1.8 million reads ( Figure S1C), in total. Although three fragments did not amplify as robustly and had less reads (i.e., L1_2, L2_2, and L2_5 assays containing CpG sites 7034, 7091, 7136, 7145; 4427, 4437, 4441; and 5378, respectively), the correlation between CpG sites within these fragments between the PSQ and NGS assays had reasonable agreement (see Figure 2).

METHODOLOGICAL COMPARISON
Among these 31 CpG sites, 27 had been analyzed by pyrosequencng and the two results were compared. Using linear regression, most sites showed significant correlation between the two methods with an overall correlation of 0.92 (Figure 1). Only two CpG sites (positions 6650 and 7034) were poorly correlated (P = 0.069 and 0.19, respectively, Figure 2). These results indicated that next generation sequencing was an appropriate method for determining CpG methylation in the HPV16 genome, yielding results that were highly correlated with a well-established pyrosequencing technique.

SINGLE-MOLECULE RESOLUTION OF CPG METHYLATION
The methylation pattern for each single molecule was determined for each sample across six regions of the HPV16 genome. A substantial proportion (10-80%) of molecules in the six assays were partially methylated (possessing a mixture of methylated and unmethylated sites), and the distributions of patterns were varied. However, the site-wise proportions of methylation for a given sample in a given assay were similar and the distribution of patterns in each molecule did exhibit dependence on that methylation level. To address this issue, we calculated the expected frequencies of all possible methylation patterns in each assay and compared them with the observed patterns (Figure 3). In most samples, a relative excess of none-and/or fully methylated molecules was observed (Figure 3). In contrast, despite their high prevalences, there was a relative absence of partial methylated molecules (Figure 3). As a consequence, most of the samples yielded a significant P-value (p < 0.05), thus indicating that CpG sites are not likely to be methylated/demethylated in an independent fashion, but that a more complex process determines the methylation state within a region of the HPV16 genome.

DISCUSSION
In the present study, we used Illumina Next Gen single molecule sequencing to survey the methylation of the HPV16 genome in 12 samples and compared the result with pyrosequencing, which provides an average percent methylation for each CpG site. A consequence of using Next Gen sequencing technology was the ability to investigate the methylation pattern at the single-molecule level. Our results demonstrate that the Next Gen bisulfite sequencing protocol was comparable to the well-established pyrosequencingbased method for assaying HPV16 genome methylation (overall correlation = 0.92). The ability to analyze single molecules allowed us to test whether the process of CpG methylation at specific sites was independent. Thus with known percent CpG methylation at each site we compared the observed with the predicted methylation haplotypes. There was a significant difference indicating that CpG methylation of multiple CpG sites on a given fragment of the viral genome is not an independent process. In addition, utilizing DNA barcoding, multiple samples can be pooled together and run in a single sequencing reaction and the result for each single sample can be distinguished without ambiguity. Thus, the high-throughput nature of this technique facilitates large-scale clinical and epidemiological studies.
Several CpG sites presented a relatively low read count compared with others. A careful inspection indicated that these were located in the middle of the amplicon and would thus appear at the end of the read in both strand. These end regions usually have a low base call quality, due to constraints of the sequencing technology and are, therefore, often filtered out prior to analysis. The Bismark-based analysis method utilized in this study doesn't exploit the gain in base call quality that can be obtained by comparing overlapping regions of paired-end reads, therefore the latter problem could be surmounted by improvements in the bioinformatics pipeline. Alternatively, we could shorten the amplicon, or generate longer reads, to facilitate equal read counts for all sites in an assay. However, there are limitations on shortening or lengthening the amplicons, due to the composition of sequences surrounding CpG sites that are used for the primers. Nevertheless, it is anticipated that with advances in Next Gen sequencing technologies longer fragment reads and improved quality will facilitate the use of the methods described in this report.
Despite the high consistency between next generation sequencing and pyrosequencing results, two CpG sites, 6650 and 7034, failed to yield a significant correlation. Both CpG sites had a relatively high read count (>1.14 and 0.15 million, respectively), thus indicating that the low correlation was not due to stochastic effects. A detailed audit identified one sample showing remarkable discrepancy between the two approaches in both CpG sites (see red circle in Figure 2). When this sample was removed from analysis, significant correlations were obtained for both CpG sites (r 2 = 0.85, P = 0.00006 for 6650 and r 2 = 0.76, P = 0.001 for 7034), which further verified the strong consistency between the two approaches. However, the reason for the discrepancy in this sample remains unclear. Possibilities include a nucleotide variation at this position, which would skew the results, and/or bisulfite-and PCR-induced artifacts. In addition, potential PCR biases could also result in lower than expected read numbers.
Examining methylation patterns on individual molecules of DNA is essential to understand the methyltransferase dynamics and the mechanism(s) by which methylation interacts with oncogenic HPV viral natural history and progression to cervical cancer. Based on cloning approaches, most previous studies suggested that methylation in promoter regions was more likely to be one (fully methylated) or zero (fully unmethylated) in human genomic DNA (Oates et al., 2006) and HPV genomes (Kalantari et al., 2004(Kalantari et al., , 2009(Kalantari et al., , 2010Turan et al., 2006Turan et al., , 2007Ding et al., 2009;Fernandez et al., 2009). However, due to the low number of single molecules analyzed (i.e., clones sequenced), the real composition might be substantially different. Next Gen sequencing, which can survey millions of molecules at the same time, can provide more insight into this issue. In our results, despite the relative excess of fully methylated and fully unmethylated molecules, a substantial proportion of molecules displayed a partial methylation pattern, which verified previous observations (Taylor et al., 2007) and hinted at the complex regulation of the methylation process. Whether there are dynamic changes in CpG methylation patterns remains to be determined.

ACKNOWLEDGMENTS
This work was supported in part by the National Cancer Institute (CA78527) (Robert D. Burk), the Einstein-Montefiore Center for AIDS funded by the NIH (AI-51519) and the Einstein Cancer Research Center (P30CA013330) from the National Cancer Institute.