Investigating Different DNA Methylation Patterns at the Resolution of Methylation Haplotypes

Different DNA methylation patterns presented on different tissues or cell types are considered as one of the main reasons accounting for the tissue-specific gene expressions. In recent years, many methods have been proposed to identify differentially methylated regions (DMRs) based on the mixture of methylation signals from homologous chromosomes. To investigate the possible influence of homologous chromosomes on methylation analysis, this paper proposed a method (MHap) to construct methylation haplotypes for homologous chromosomes in CpG dense regions. Through comparing the methylation consistency between homologous chromosomes in different cell types, it can be found that majority of paired methylation haplotypes derived from homologous chromosomes are consistent, while a lower methylation consistency was observed in the breast cancer sample. It also can be observed that the hypomethylation consistency of differentiated cells is higher than that of the corresponding undifferentiated stem cells. Furthermore, based on the methylation haplotypes constructed on homologous chromosomes, a method (MHap_DMR) is developed to identify DMRs between differentiated cells and the corresponding undifferentiated stem cells, or between the breast cancer sample and the normal breast sample. Through comparing the methylation haplotype modes of DMRs in two cell types, the DNA methylation changing directions of homologous chromosomes in cell differentiation and cancerization can be revealed. The code is available at: https://github.com/xqpeng/MHap_DMR.


INTRODUCTION
In recent years, the revealing of the mechanisms behind the diseases has been performed from different angles, such as mutated genes, altered DNA methylation (Eden et al., 2003;Baylin, 2005), non-coding RNAs (Yan et al., 2017(Yan et al., , 2018Chen et al., 2019;Lan et al., 2020), microbes (Yan et al., 2019(Yan et al., , 2021, etc. Differentially methylated regions (DMRs) are the main explanation accounting for the diversity of gene expression in different cell types in a body. Differentiation-associated differential methylation profiles were observed on cell types under different stages of development and differentiation (Laurent et al., 2010). Recent studies show that altered DNA methylation has a very close relationship with diseases. In cancer genomes, the promoter regions of tumor suppressor genes are altered to be hypermethylated (Baylin, 2005), while the promoter regions of tumor genes are altered to be hypomethylated (Eden et al., 2003). Identifying DMRs can promote revealing the mechanisms in tissuespecific/diseases-specific gene expression (Scott et al., 2020) and tissue-specific DMRs can be used as feature markers in identifying the tissues-of-origin of cfDNAs in noninvasive diagnosis (Hu et al., 2019;Xiaoqing et al., 2020).
Infinium HumanMethylation450 BeadChip and Infinium MethylationEPIC BeadChip provide a convenient way to measure the methylation levels of CpG sites in CpG islands and gene regions. In BreadChips, the methylation level of a certain CpG site is estimated by using the ratio of intensities between methylated and unmethylated alleles. In recent years, due to the development of sequencing technology, bisulfite sequencing (BS-Seq) makes to reveal the methylation status of each cytosine site on a read become possible. The numbers of methylated cytosines and unmethylated ones of each cytosine site can be measured, respectively. Recently, by using deep-learning, DNA methylation status of each cytosine site can be deduced from Nanopore sequencing reads (Ni et al., 2019). In both BeadChip and BS-Seq, molecules derived from two homologous chromosomes are not discriminated and are always processed together.
Based on the methylation profiles extracted from BeadChips or BS-Seq data, many methods have been proposed to identify DMRs in different tissues or cell types. These methods can be roughly divided into two categories: differentially methylated cytosine site (DMC)-based methods and candidate region-based methods. In DMC based methods, methylation levels of CpG sites can be calculated based the raw methylation information of CpG sites (Catoni et al., 2018;Condon et al., 2018;Xu et al., 2020), estimated by beta-binomial distribution considering the biological variances and sample variances (Feng et al., 2014;Park et al., 2014;Lea et al., 2015;Wu et al., 2015;Park and Wu, 2016;Wen et al., 2016) or estimated by considering the spatial correlation (Hansen et al., 2012;Hebestreit et al., 2013;Wu et al., 2015;Sun and Yu, 2016). Then, DMCs are identified and DMRs are formed by merging the neighboring DMCs satisfying some defined criteria, such as DMCs with p-values less than a certain threshold, the distance between the DMCs less than a cutoff value, and the minimum number of DMCs required in a region.
In candidate region-based methods, there are two types of candidate regions, including sample-dependent candidate regions and sample-independent ones. The sample-independent candidate regions are predefined on the genome with a fixedsize or sliding window (Stockwell et al., 2014;Wang et al., 2015;Catoni et al., 2018). The sample-dependent candidate regions are generated according to the coverage, the depth of CpG sites, the methylation levels of CpG sites in samples, and the methylation changes of CpG sites among multi-samples. Then DMRs are identified by comparing the methylation of regions among different samples (Su et al., 2012;Lokk et al., 2014;Liu et al., 2015;Jühling et al., 2016;Gomez et al., 2019).
As we known, the allele-specific methylation is a special phenomenon of DNA methylation, which is that the methylation of an allele on two homologous chromosomes is not consistent. With the development of high-throughput sequencing technology, the region capture based sequencing and the genome-wide sequencing have been widely used for detecting allele-specific methylation sites. Some strategies and algorithms also contribute to improve the identification of allele-specific methylation (Cheung et al., 2017;Abante et al., 2020). However, the research on identifying allele-specific methylation is limited to the alleles, and the influence of homologous chromosomes on methylation analysis should be investigated genome wide.
In the methods of identifying DMRs, the reads from homologous chromosomes are processed together, and the methylation levels of CpG sites are calculated based on the mixture of reads from homologous chromosomes. The influence of homologous chromosomes on methylation analysis was not considered and investigated. To investigate the possible influence of homologous chromosomes on methylation analysis, we construct methylation haplotypes for homologous chromosomes on the sample-independent candidate regions. Then the methylation consistency of paired methylation haplotypes from homologous chromosomes in different cell types is compared. Further, DMRs are identified at the resolution of methylation haplotypes. The proposed method in this paper not only can be applied to methylation analysis, but also can provide a clear explanation for the methylation difference at the resolution of methylation haplotypes.

MATERIALS AND METHODS
In this paper, two methods, MHap and MHap_DMR, are proposed to construct methylation haplotypes and identify DMRs based on methylation haplotypes, respectively. MHap is a method for constructing methylation haplotypes, which consists of four steps. Firstly, it generates sample-independent candidate regions based on genomic information, such as CpG islands and CpG dense regions. Then, for the BS-seq data of each sample, it classifies CpG sites into homozygous and heterozygous ones, and then constructs methylation haplotypes for each candidate region. Finally, the paired methylation haplotypes of homologous chromosomes are represented by methylation haplotype modes (MHMs). MHap_DMR is the method designed to identify DMRs based on methylation haplotypes. The framework of MHap and MHap_DMR is shown in Figure 1 and the detail of each step in the proposed methods will be described in the following subsections.

Materials
To investigate the influence of homologous chromosomes on methylation analysis, 12 WGBS datasets of 10 different tissues/cell types are involved in this study, including two lower leg skin samples and two tibial nerver samples downloaded from the ENCODE project (The ENCODE Project Consortium, 2012) (access sample id: ENCSR930WUY, ENCSR128RMY, ENCSR752OCM, and ENCSR658MZU), breast cancer sample and normal breast sample in the GEO database under the accession number GSE29069 (Hon et al., 2012), adipose-derived stem (ADS) cells and mature fat cells (adipocytes derived from the ADS cells) in the NCBI SRA database under the accession number SRA023829.2 (Lister et al., 2011), embryonic stem cells (hESCs) and foreskin fibroblasts (hESC-Fibro cells) in the GEO database under the accession number GSE19418 (Laurent et al.,

2010), mature B cells and hematopoietic stem cells in the GEO
database under the accession number GSE31971 (Hodges et al., 2011). The WGBS datasets were aligned to the human reference genome (hg38) and the methylation statuses of cytosines on reads were called by using Bismark (Krueger and Andrews, 2011).

MHap: Methylation Haplotype Construction
Due to the limited read lengths and the uneven distribution of CpG sites, it is challenging to construct two complete methylation haplotypes for two homologous chromosomes. Thus, sample-independent candidate regions are predefined on CpG dense regions, and methylation haplotypes are constructed for homologous chromosomes in these regions. MHap is proposed to construct methylation haplotypes for homologous chromosomes based on the overlapping methylation statuses of heterozygous methylated CpG sites on reads. The details of MHap is described as following.

Generate Sample-Independent Candidate Regions
MHap generates sample-independent candidate regions based on the CpG island information and the distance between neighboring CpG sites. In order not to hide local methylation signals, CpG islands are usually divided into a number of candidate regions, each of which contains at least 7 CpG sites. For other regions with densely located CpG sites, a distance-based clustering algorithm is applied to generating candidate regions, which contains at least 7 CpG sites also and the distances between neighboring CpG sites are not >20 bp. As shown in Table 1, for each chromosome, the number of candidate regions and the corresponding averages of CpG numbers and region lengths are listed. Then, MHap will construct methylation haplotypes for homologous chromosomes on these candidate regions. of the candidate regions.

Classify CpG Sites Into Homozygous and Heterozygous Ones
The flow char of classifying CpG sites into homozygous and heterozygous ones is illustrated as in Figure 2. For each sample, the reads falling in candidate regions are collected. In these candidate regions, firstly, CpG sites with depth less than a threshold Th dp are filtered out. Then the remaining CpG sites are classified into homozygous sites and candidate heterozygous sites(CHSs) based on the types of methylation statuses and the corresponding depths. If a CpG site has only one methylation status with depth not less than Th dp , it is considered as a homozygous site. If it has two methylation statuses and the depth of each status is not less than half of Th dp , it is considered as a CHS. Due to the sequencing errors and the bisulfite conversion rates, the identified CHSs inevitably contain false-positives. The joint methylation statuses of neighboring CHSs on the same reads can help to distinguish true-positives from false-positives. Thus, the joint methylation statuses of two neighboring CHSs on the covering reads are extracted and can be represented as 00/11/01/10 patterns. In MHap, the frequency of each pattern on two neighboring CHSs is calculated, and patterns with frequency <2 are filtered. Then, one or two true-positive patterns are identified according to the ratios of the corresponding frequencies to the total frequency of all patterns or to the maximum frequency. If there is a pattern with the maximum frequency among other patterns and the ratio of its frequency to the total frequency of all patterns is above a threshold (recommended as 0.6), it is considered as the only one truepositive pattern on the two neighboring CHSs. Otherwise, if there are two patterns with higher frequencies than other patterns and the ratio of the second maximum frequency to the first maximum frequency is not less than a threshold (recommended as 0.4), it is considered that there are two true-positive patterns on the two neighboring CHSs. Then two neighboring CHSs are reclassified into homozygous or heterozygous ones based on the true-positive patterns.
Pairs of neighboring CHSs are processed sequentially. Assume there are three successive CHSs (u, v, w). During the processing of two successive pairs (u, v) and (v, w), the unbalance join depths may result in a conflict on the classification of the overlapped CHS v. To handle with this conflict, a confidence score is calculated for each pair of neighboring CHSs, computed as the ratio of the total frequency of true-positive patterns on two sites to the maximum depth among three CpG sites, as defined in Equation (1). If conf (u, v)> = conf (v, w), the class of v will be not changed, and the class of w will be determined based on the joint methylation statuses of (v, w) with the given class of v. If conf (u, v) < conf (v, w), the class of v will be revised based on the true-positive patterns of (v, w).
where TP denotes the set of true-positive patterns of (u, v), f (p) denotes the frequency of pattern p, and d(u), d(v), and d(w) denote the depths of u, v, and w, respectively.

Construct Methylation Haplotypes for Each Candidate Region
After classifying CpG sites into homozygous and heterozygous ones, the skeletons of two methylation haplotypes are constructed by linking the patterns of neighboring heterozygous sites sequentially. Then, a pair of methylation haplotypes are constructed by padding the homozygous CpG sites into the skeletons.

Definition of Methylation Haplotype Mode
Each methylation haplotype can be represented by a 0-1 string.
To simplify the comparison between methylation haplotypes, each methylation haplotype is converted into a label based on its 0-1 string, defined in Equation (2). Then, two labels of the paired methylation haplotypes on a candidate region, denoted as LL, HL, LN, LM, NN, MM, MN, HN, HM or HH, are termed as a methylation haplotype mode (MHM).
where MH(s) = len(s) , s represents the 0-1 string of a methylation haplotype, len(s) represents the length of s, and s i is the i-th character in s.

Map_DMR: DMR Identification Based on Methylation Haplotypes
Based on the MHMs of each candidate region among different samples, MHap_DMR identifies DMRs by comparing the MHMs directly. If the MHMs are identical, the candidate region is considered as a non-DMR. Otherwise, a methylation haplotype difference (MHD) between a pair of samples or groups is calculated, defined as in Equation (3). Then, the methylation difference among multi groups on the region can be defined as the maximum MHD among pairs of groups.
where g i and g j denote group i and j, g i1 and g j1 denote the 0-1 strings of methylation haplotypes with higher MH values in g i and g j , respectively, and g i2 and g j2 denote the 0-1 strings of methylation haplotypes with lower MH values in g i and g j , respectively. To investigate the influence of homologous chromosomes on methylation analysis, we applied MHap to construct methylation haplotypes for 12 WGBS datasets of 10 different tissues/cell types. MHap constructs methylation haplotypes for each sample based on the alignment file and candidate regions. Methylation haplotypes covering more than 3 CpG sites are defined as valid methylation haplotypes (VMHs). Table 2 lists the number of candidate regions with VMHs contained by each sample. It can be observed that the average number of CpG sites in these candidate regions is >10, and the average number of covered CpG sites in VMHs is ranging from 5.9 to 8.9.

Majority of Methylation Haplotypes Are Consistent Between Homologous Chromosomes
MHMs HH and LL denote that the paired methylation haplotypes of two homologous chromosomes are simultaneously hypermethylated (HH) or hypomethylated (LL). Both the HH and LL are considered as consistent MHMs. Then, the methylation consistency between two homologous chromosomes in a sample can be defined as the ratio of the number of CpGs in VMHs with consistent MHMs to that in all VMHs.
The methylation consistency of homologous autosomes in different tissues/cell types is compared, as shown in Figure 3. For normal tissues or cell types, the methylation consistency is above 90% on average, especially in hematopoietic stem cells. A lower methylation consistency can be observed in the breast cancer sample, which is about 86% on all the homologous chromosomes.
The methylation consistency of chromosome X indicates the gender of a sample. In Figure 4, it can be observed that three samples with methylation consistency above 94% are derived from male, while samples with methylation consistency ranging from 54 to 72% are derived from female which is much lower than that of other homologous autosomes. It coincides with the previous studies that the methylation between two homologous chromosome X in female are different, one of which is inactive and highly methylated (Mohandas et al., 1981;Goto and Monk, 1998).
Further, we compared the hypomethylation consistency in different samples. The hypomethylation consistency between two homologous chromosomes in a sample can be defined as the ratio of the number of CpGs in VMHs with consistent MHM LL to that in all VMHs. From Figure 5, we can observe that the hypomethylation consistency of derived cells is higher than that of the corresponding undifferentiated stem cells, which is consistent with the former studies that methylation decrease with the degree of differentiation increased (Laurent et al., 2010). In Figure 5, we can find that the mature fat cells are more hypomethylated than adipose-derived stem cells, mature B cells are more hypomethylated than hematopoietic stem cells, and foreskin fibroblasts are more hypomethylated than embryonic stem cells. It is also noted that the hypomethylation consistency of breast cancer sample is much lower than that of normal breast sample on homologous chromosome.
In addition, it is interesting to observe that the tissues/cell types can be roughly clustered into three groups according to the hypomethylation consistency, as shown in Figure 5. Lower leg skin and tibial nerve have similar hypomethylation consistency and they belong to the ectoderm. The hypomethylation consistency of mature fat cells, adipose-derived stem cells, mature B cells, hematopoietic stem cells and the normal breast sample are similar, and these tissues/cell types belong to the mesoderm. The hESCs and hESC-Fibro cell types have high hypomethylation consistency in chromosomes, which are higher than that of other tissues/cell types.

Identifying DMRs Between Two Samples
MHap_DMR was applied to identify DMRs in four pairs of samples, including breast cancer vs. normal breast, mature fat    Table 3.
To investigate the methylation changing directions at the methylation haplotype level, the number of some subtypes of DMRs in Type 1 and Type 2 DMRs is specified. For example, in Type 1 DMRs, the number of DMRs with hypomethylation consistent mode LL vs. hypermethylation consistent mode HH and the number of DMRs with hypomethylation unconsistent mode HL vs. hypermethylation consistent mode HH are listed.
In Type 1 DMRs, it can be observed that there is only 1 DMR with hypomethylation consistent mode LL vs. hypermethylation mode consistent HH in mature fat cells and adipose-derived stem cells. It may indicate that the methylation statuses of two homologous chromosomes are seldom changed simultaneously during the differentiation from adipose-derived stem cells to mature fat cells.
In Type 1 DMRs between breast cancer and normal breast, it can be observed that there are 13,173 DMRs with hypermethylation consistent mode HH in breast cancer and hypomethylation consistent mode LL in normal breast, while there are only 2,002 DMRs with hypomethylation consistent mode LL in breast cancer and hypermethylation consistent mode HH in normal breast. It suggests that many regions with hypomethylation consistent mode LL in normal breast become hypermethylated in breast cancer, while a small quantity of regions with hypermethylation consistent mode HH in normal breast become hypomethylated in breast cancer. Further, comparing the number of four types of DMRs between breast cancer and normal breast, it may indicate that, in breast cancer, the methylation statuses of homologous chromosomes changes in the same direction (hypomethylated or hypermethylated) simultaneously in many cases. The MHMs of DMRs among different samples can indicate the methylation changing directions of homologous chromosomes in cell differentiation and cancerization.

Compared With Comparative Methods
To further demonstrate the performance of MHap_DMR, four comparative tools were also applied to these four pairs of samples, including CpG_MPs (Su et al., 2012), DMRCaller (Catoni et al., 2018), SMART , and Metilene (Jühling et al., 2016). The default parameter settings were adopted when running these methods.
The numbers of DMRs identified by different methods are compared, as shown in Table 4. Metilene always predicts a larger number of DMRs with low methylation level differences than other methods. MHap_DMR predicts a smaller number of DMRs than other methods, because it works on candidate regions predefined on the CpG dense regions. All the methods report a largest number of DMRs between breast cancer sample and normal breast sample, and a second largest number of DMRs between embryonic stem cells and foreskin fibroblasts. This consistency indicates that DNA methylation is altered a lot in cancerization, and the methylation difference between embryonic stem cells and foreskin fibroblasts is larger than that between other types of stem cells and the cells derived from these stem cells.

CONCLUSION
In this paper, MHap is developed to construct methylation haplotypes for homologous chromosomes in CpG dense regions. Through the analysis based on methylation haplotypes of homologous chromosomes, we found that majority of methylation haplotypes are consistent between homologous autosomes, while a lower methylation consistency was observed in the breast cancer sample. Further, the hypomethylation consistency of derived cells is higher than that of the corresponding undifferentiated stem cells. The hypomethylation consistency can be used as a feature for cell clustering. DMRs identified by MHap_DMR based on methylation haplotypes can help to investigate the methylation changing directions of homologous chromosomes in cell differentiation and cancerization.