Selective Chemical Labeling and Sequencing of 5-Hydroxymethylcytosine in DNA at Single-Base Resolution

5-Hydroxymethylcytosine (5hmC), the oxidative product of 5-methylcytosine (5mC) catalyzed by ten-eleven translocation enzymes, plays an important role in many biological processes as an epigenetic mediator. Prior studies have shown that 5hmC can be selectively labeled with chemically modified glucose moieties and enriched using click chemistry with biotin affinity approaches. Besides, DNA deaminases of the AID/APOBEC family can discriminate modified 5hmC bases from cytosine (C) or 5mC. Herein, we developed a method based on embryonic stem cell (ESC) whole-genome analysis, which could enrich 5hmC-containing DNA by selective chemical labeling and locate 5hmC sites at single-base resolution with enzyme-based deamination. The combination experimental design is an extension of previous methods, and we hope that this cost-effective single-base resolution 5hmC sequencing method could be used to promote the mechanism and diagnosis research of 5hmC.

INTRODUCTION 5-Hydroxymethylcytosine (5hmC), which exists in various mammalian tissues and cell types, is an oxidative product of 5-methylcytosine (5mC) catalyzed by ten-eleven translocation (TET) enzymes (Kriaucionis and Heintz, 2009;Tahiliani et al., 2009). Many researches show that 5hmC not only is an intermediate of DNA demethylation but also plays an important role in many biological processes and human diseases as an epigenetic mediator Mellén et al., 2012;Colquitt et al., 2013;Taylor et al., 2016). The recent development of high-throughput sequencing technology has enabled whole-genome sequencing of 5hmC in mammalian systems. Generally, there are two strategies, which are the selective enrichment-based profiling and deamination-based single-base resolution sequencing methods Pastor et al., 2012;Tan et al., 2013;Thomson et al., 2013;Cui et al., 2014;Petterson et al., 2014;Sun et al., 2015;Han et al., 2016;Hu et al., 2019;Liu et al., 2019;Gibas et al., 2020;Wang et al., 2021). While the current application of these methods provides key information about the distribution and functional insights of 5hmC, there is major shortage for both strategies. The lack of single-base resolution information of the profiling strategy limited its application in detailed 5hmC location context, while the single-base resolution method is limited by its high sequencing cost. Therefore, a method that has both advantage of single-base resolution and enrichment would be highly valued in broad biological and clinical studies. A recent study developed a bisulfite-free strategy to detect 5hmC in single-base resolution using AID/APOBEC family DNA deaminase enzyme as deamination reagent, in which 5hmC is protected by a glucose motif catalyzed by T4 beta-glucosyltransferase (T4-βGT) Vaisvila et al., 2019;Sun et al., 2021a). This strategy has also been reported in a well-studied chemical selective profiling method back to year 2011 using a modified UDP-glucose .
Here, we introduce a cost-effective single-base resolution 5hmC sequencing method, which allows genome-wide chemical labeling and enrichment of 5hmC and a bisulfite-free single-base resolution detection of 5hmC. Our strategy (DIP-CAB-Seq) has three steps: 1) label and enrich the 5hmC containing fragments using the N 3 -UDP-glucose, click chemistry, and biotin-avidin interaction; 2) deaminate unprotected cytosines and 5mC using a deaminase; and 3) construct sequencing library in a single-stranded manner. This strategy is compatible with the labeling chemistry, enzyme-based deamination, and single-stranded DNA library construction. To demonstrate the superiority and effectiveness of this method, we have applied this approach to compare detected 5hmC signals among our new methods (DIP-CAB-Seq), 5hmC-Seal and ACE-Seq Liu et al., 2019). We found that we can detect reliable single-base 5hmC signals using DIP-CAB-Seq method with limited sequencing depth.

Library Construction
Nano-ACE-Seq Library Construction of 5-Hydroxymethylcytosine-Containing Genomic DNA (DIP-CAB-Seq) The mESC genomic DNA was sheared with KAPA Frag Kit (KK8600, Kapa Biosystems, Wilmington, MA, USA) to average 200 base pair and purified using the manufacturer's protocol (Ring et al., 2017). Fragmented mESC genomic DNA measuring 500 ng (average 200 bp) was treated with β-GT (New England Biolabs, Ipswich, MA, USA; Catalog #M0357S) in the presence of UDP-6-N 3 -Glu and labeled with cyclooctyne-biotin. Subsequently, 5hmCcontaining fragments were enriched using Streptavidin beads. The enriched fragments were mixed with control DNAs, CPG methylated pUC19, and unmethylated lambda and subjected to denaturation with NaOH and enzymatic deamination using APOBEC enzyme (New England Biolabs, Catalog #E7120S) to transform C/mC to U but not hmC. Then the converted DNA was tagged with Illumina compatible adapter and amplified to an appropriate library concentration using Accel-NGS Methyl-Seq DNA Library Kit (Swift BioSciences, Ann Arbor, MI, USA) combined with NEB Next Multiplex Oligos for Illumina. Libraries were checked for quality and quantified using Agarose Gel Electrophoresis and Qubit3.0 individually. 5-Hydroxymethylcytosine-Seal Library Construction of 5-Hydroxymethylcytosine-Containing Genomic DNA 5hmC-Seal Library was prepared as previously described . Briefly, the fragmented mESC genomic DNA (average 200 bp) was treated with UDP-6-N 3 -Glu in the presence of β-GT to form chemical modification, followed by labeling with DBCO-PEG 4 -Biotin via click reaction. The 5hmCcontaining DNA fragments were captured by C1 Streptavidin beads. The beads with enriched DNA fragments were resuspended in water and amplified with 12-17 cycles of PCR using an enzyme mixture in the Nextera kit. The PCR products were purified using AMPure XP beads. Sequencing was performed on the NextSeq instrument.

ACE-Seq Library Construction of 5-Hydroxymethylcytosine-Containing Genomic DNA
ACE-Seq Library was prepared as previously described . The mESC genomic DNA was sheared with KAPA Frag Kit (KK8600, Kapa Biosystems, Wilmington, MA, USA) to an average of 200 base pair and purified using the manufacturer's protocol (Ring et al., 2017). DNA fragment was prepared to a concentration of 15-20 ng/μl. The reaction was assembled in a total volume of 50 μl using the table below as per reaction. If multiple samples were processed, making a master mix of everything except the sample should be recommended.
The reaction was placed in a thermocycler and incubated at 37°C for 1 h. Subsequent reaction was cleaned separately by DNA Clean & Concentrator-5 Kit (Zymo Research, Irvine, CA, USA) and eluted in 25 μl of EB buffer. Glycosylated dsDNA was treated with a fresh 0.1 N NaOH solution and incubated in a thermocycler at 50°C for 10 min for denaturation. After that, the sample was transferred into an ice box and was left to stand at least 5 min to keep the DNA in its single-stranded form. Subsequently, the ssDNA was subjected to enzymatic deamination using APOBEC enzyme to transform C/mC to U but not hmC with NEB Next ® Enzymatic Methyl-seq Kit (New England Biolabs, Catalog #E7120S) Vaisvila et al., 2019;Sun et al., 2021a). Then the converted DNA was tagged with Illumina compatible adapter and amplified to an appropriate library concentration using Accel-NGS ® Methyl-Seq DNA Library Kit (Swift BioSciences) coupled with NEB Next ® Multiplex Oligos for Illumina ® (New England Biolabs). Libraries were checked for quality and quantified using Agarose Gel Electrophoresis and Qubit3.0 individually.
Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 749211 2 Sequence Alignment and Peak Identification Sequencing reads were trimmed by trimmomatic with the trim tail option set true and mapped to the mouse genome (mm10) by bowtie2. The peak analysis was conducted by MACS2 call peaks, under the paired-end mode with bam file as input format. Afterward, the peaks were annotated by HOMER annotate Peak, and mm10 was used as the reference (Supplementary Table S1).

Single-Base Resolution Analysis
The numbers as well as site information of converted and unconverted cytosines in samples were recognized by the software Bismark, with the criterion that the C site cutoff is greater than zero; that is, the conversion was valid as long as any read indicates modification of the site. The process of running Bismark can be split into three steps. First, conduct bisulfite conversion on the genome of interest and build its index to allow bowtie2 alignments. Second, run Bismark alignment step, which means the actual bisulfite alignment and methylation calling part. Finally, extract the methylation information from the Bismark alignment output (Supplementary Table S2). Validation of genomic 5hmC at the CH sites was done on AbaSI-digested glucosylated ESC genomic DNA followed by real-time PCR (Sun et al., 2013;Sun et al., 2021b). Real-time PCR of digested DNA was done with iQ SYBR Green Supermix (Bio-Rad Laboratories, Hercules, CA, USA: 170-8882). All real-time PCR primers and the method are listed in Supplementary Table S3 for CH sites.

Definition of Enhancer Subgroups and Motif Analysis
Enhancers with evidence showing their interaction with distal genomic regions were considered as the most active enhancer subgroup (interacting enhancers). The genomic locations of these enhancers were obtained from published ChIA-PET dataset. The active and poised enhancers were defined by using histone modification markers. Both H3K4me1-and H3K27ac-enriched regions were obtained from previous publication. Liftover was used to convert the genomic location from mm8 to mm10. Regions that overlapped with interacting enhancers were discarded. Regions enriched with H3K27ac were considered as active enhancers. Regions only enriched with H3K4me1 but not H3K27ac were considered as poised enhancers. To detect TF motifs around 5hmC-modified cytosines, we performed de novo motif analysis with HOMER find Motifs Genome around the 5hmC sites in mESCs.

DIP-CAB-Seq Can Generate Deaminated DNA Library From the Pull-Down DNA Fragments
We present here a selective chemical labeling coupled with bisulfite-free sequencing (DIP-CAB-Seq) approach to generate the genome-wide, single-base resolution maps for 5hmC. In the current approach, we first selectively labeled and enriched 5hmCcontaining DNA fragments by using the glucosyltransferase and azido-UDP-glucose-based DNA profiling prior to the subsequent deamination ( Figure 1). We then performed the deamination on the enriched fragments by using deaminase-based bisulfite-free reaction. The normal cytosine and 5mC can be deaminated, while 5hmC remained as glucosylated 5hmC (N 3 -5gmC). Therefore, 5hmC will be read as C after PCR amplification, while normal C and 5mC will be read as T. After the single-stranded library construction and PCR amplification, the 5hmC containing fragments will be selectively amplified, while the 5hmC can be read in single-base resolution and generate the precise genomic locations of 5hmC.
The first challenge is whether we can generate high-quality next-generation sequencing library from the trace amount pull-down DNA. We validated that approach by generating 5hmC library in genomic DNA of mECS. We constructed ACE-Seq libraries and DIP-CAB-Seq libraries from two replicated mECS gDNA. The output data size was determined by library input. After getting the DNA library, we examined its concentration used Qubit. The first normalization was conducted through dilution between library's concentration and validated by quantitative PCR. The second normalization was conducted base on the quantitative-PCR data. A library mixture was prepared by pooling the same amount of diluted library. The library mixture was sequenced on Nestseq500 device by following the manufacturer's instruction. We observed 99.89% deamination of control DNA. The ACE-Seq library generated 35,067,558 reads with 68% mapping ratio, while the DIP-CAB-Seq library generated 32,769,029 reads with 61% mapping ratio, not far from the former result. The DIP-CAB-Seq methods have slightly lower mapping ratio than ACE-Seq method, which is fully acceptable considering the ultra-low input DNA amount in the library construction step in DIP-CAB-Seq method. Both DIP-CAB-Seq libraries and regular ACE-Seq libraries have similar total sequencing reads, which guaranteed us a fair comparison between the ability of the two methods in detecting the 5hmC sites in the genome. These data indicate that we can successfully construct the 5hmC singlebase resolution library from the enriched DNA fragments.

DIP-CAB-Seq Can Selectively Enrich 5-Hydroxymethylcytosine-Containing DNA Fragments
Using the DIP-CAB-Seq strategy, we performed both regular hmC-Seal profiling and single-base resolution mapping of 5hmC on mESC genomic DNA. The profiling analysis indicated that 5hmC from both libraries accumulate at the intergenic and intron regions in mESCs (Figure 2). This observation is consistent with previous findings . We barely observed any abnormal 5hmC signals, nor did we observe any noticeable changes of 5hmC at these genomic element regions between the two methods. Therefore, the DIP-CAB-Seq can totally inherit Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 749211 the profiling information that the traditional hmC-Seal method carried.

DIP-CAB-Seq Can Detect Single-Base Information With Limited Sequencing Depth
We next analyzed 5hmC enriched regions using the base-resolution data in ACE-Seq and DIP-CAB-Seq so as to evaluate the repeatability of both methods. We observed 5,936,200 5hmC sites in CpG context in the union set of two ACE-Seq samples, while there are only 291,527 5hmC sites, that is, only 4.9% of the union set in the intersection set of both samples. On the other hand, we found 7,821,453 5hmC sites in the CpG context in the union set of the DIP-CAB-Seq samples, and there are 1,273,953 5hmC sites in the intersection set of both samples, accounting for 16.3% of the union set. These data indicate that the DIP-CAB-Seq method has higher detection ability and consistency than ACE-Seq with the same sequencing depth. Considering that the total sequencing depth is much lower than that of a regular expensive whole-genome 5hmC sequencing, the DIP-CAB-Seq method shows much better application potential in large-scale sample studies. We also detect 5hmC signals in CHH and CHG, which may have 5hmC sites according to the previous study. As expected, DIP-CAB-Seq methods showed higher consistency than ACE-Seq. We observed many CHH and CHG sites from the ACE-Seq library, which may come from the false-positive signal of the ACE-Seq due to the nonperfect deamination. The amount of 5hmC sites in CHH and CHG is shown in the figure below (Table 1). Therefore, the enrichment of the 5hmC fragments can also decrease the false-positive signals even if the deamination reagents are not perfect in a bisulfite-free system.  Based on the above findings, we observed much better consistency from DIP-CAB-Seq method than ACE-Seq, indicating the DIP-CAB-Seq method could have better understanding of the biology with limited sequencing depth ( Figure 4).

DIP-CAB-Seq Can Detect Detailed Preferential Occurrences of 5-Hydroxymethylcytosine
With single-base resolution information available, we continued to investigate the detailed preferential occurrences of 5hmC sites in the genome to understand the potential of this method in biology study.
The previous base-resolution mapping of 5hmC allowed for the determination of the surrounding base composition of 5hmC sites in mESCs (Ring et al., 2017). We aligned our 5hmC sites in CG context and examined the base compositions ( Figure 5). Compared with reported results, our data possess a similar local sequence context, having increased guanine abundance with a depletion of thymine.
We then performed motif analysis with HOMER32 at ±100bp region around the 5hmC sites. We studied the well-known 5hmC containing motif Klf4, Oct4, Hif1a, Esrrb, and Sox2; our motif analysis successfully identified these motifs around the 5hmC sites in mESCs ( Figure 6). Our data indicated that our method can obtain these detailed 5hmC signals with very low sequencing depth as compared with the reported method.

DISCUSSION
The current dilemma between cost and sequencing resolution has hampered many biological and clinical studies of 5hmC. Generally, 5hmC has 10 times lower genome percentage than 5mC, and 5hmC is distributed broadly across the genome, which requires very deep sequencing for detecting low abundant 5hmC sites. Although the profiling methods provide a cheap manner to map the 5hmC peak, it is much more difficult to quantify the 5hmC signals than the single- Note. 5hmC, 5-hydroxymethylcytosine.
Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 749211 5 base resolution method. The DIP-CAB-Seq method we developed can solve this problem; although we cannot quantify the 5hmC abundance for a specific 5hmC site, it provides much better resolution than the regular profiling method. We showed that the base-resolution data from the very limited sequencing depth can provide detailed 5hmC loci information. It is well known that 5hmC profiling has the potential to be a clinical tool in different clinical questions. The current profiling can only calculate the 5hmC score based on 5hmC read density, which needs much adjustment between the samples and cohorts. To further study the role of 5hmC in clinical application, our method can provide better quantification information and could be more reliable than the profiling method.

CONCLUSION
Herein, we developed a method based on ESC whole-genome analysis, which could enrich 5hmC-containing DNA by selective chemical labeling and locate 5hmC sites at single-base resolution with enzyme-based deamination. The combination experimental design is an extension of previous methods, and we hope that this cost-effective single-base resolution 5hmC sequencing method could be used to promote the mechanism and diagnostic research of 5hmC.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are publicly available. These data can be found here: National Center for Biotechnology Information (NCBI) BioProject database under accession number PRJNA732757.