PCGIMA: developing the web server for human position-defined CpG islands methylation analysis

Introduction: CpG island (CGI) methylation is one of the key epigenomic mechanisms for gene expression regulation and chromosomal integrity. However, classical CGI prediction methods are neither easy to locate those short and position-sensitive CGIs (CpG islets), nor investigate genetic and expression pattern for CGIs under different CpG position- and interval- sensitive parameters in a genome-wide perspective. Therefore, it is urgent for us to develop such a bioinformatic algorithm that not only can locate CpG islets, but also provide CGI methylation site annotation and functional analysis to investigate the regulatory mechanisms for CGI methylation. Methods: This study develops Human position-defined CGI prediction method to locate CpG islets using high performance computing, and then builds up a novel human genome annotation and analysis method to investigate the connections among CGI, gene expression and methylation. Finally, we integrate these functions into PCGIMA to provide relevant online computing and visualization service. Results: The main results include: (1) Human position-defined CGI prediction method is more efficient to predict position-defined CGIs with multiple consecutive (d) values and locate more potential short CGIs than previous CGI prediction methods. (2) Our annotation and analysis method not only can investigate the connections between position-defined CGI methylation and gene expression specificity from a genome-wide perspective, but also can analysis the potential association of position-defined CGIs with gene functions. (3) PCGIMA (http://www.combio-lezhang.online/pcgima/home.html) provides an easy-to-use analysis and visualization platform for human CGI prediction and methylation. Discussion: This study not only develops Human position-defined CGI prediction method to locate short and position-sensitive CGIs (CpG islets) using high performance computing to construct MR-CpGCluster algorithm, but also a novel human genome annotation and analysis method to investigate the connections among CGI, gene expression and methylation. Finally, we integrate them into PCGIMA for online computing and visualization.


Introduction
CpG island (CGI) methylation is one of the key epigenomic mechanisms for gene expression regulation and chromosomal integrity (Dor and Cedar, 2018).Especially, recent studies have reported that position-sensitive CGI co-methylation mechanism is essential for such functions that are related to histone modification (Ming, et al., 2021).However, it is neither easy for current commonly used classical CGI island prediction methods (Gardinergarden and Frommer, 1987;Han et al., 2008;Takahashi et al., 2017) to locate those short and position-sensitive CGIs which called CpG islets (Hackenberg et al., 2006) due to the length limitation, nor investigate relationship among CGI density, methylation, and gene expression specificity.Therefore, it is urgent for us to develop such a bioinformatic algorithm that not only can locate short and position-sensitive CGIs (CpG islets), but also provide CGI methylation site annotation and functional analysis to investigate the regulatory mechanisms for CGI methylation (http://www.combio-lezhang.online/pcgima/home.html).
For CGI perdition method, we usually employ the unsupervised clustering methods such as CpGCluster (Hackenberg et al., 2006) and CPG_MI (Su et al., 2009) to locate CGIs with shorter length than the supervised (Bock et al., 2007;Ning et al., 2017), since these unsupervised algorithms do not need consider the constraints of CGI length and content ratio (Hackenberg et al., 2010).However, these methods are not only time-consuming for the big dataset, but also cannot investigate the genetic characteristics of CGIs under different CpG interval parameters.Therefore, our first scientific question is how to develop a novel CGI prediction method with CpG interval parameters selective feature and high-performance computing, and investigate the differences in genetic characteristics such as CpG coverage, CGI length, and GC content of CGIs under various CpG interval parameters.
Several previous studies have interrogated the connections between methylation and CGI (Reik, 2007;Smith et al., 2012;Liu et al., 2016;El-Maarri, 2019;Acton et al., 2021).For example, Ziller et al. (2013) have turned out that not only the hypermethylation of promoter CGI is related to gene expression, but also CGI methylation in the gene body is positively correlated with gene expression.However, these studies usually interrogate the methylation characteristics of CGI from partial sequence regions rather than genome-wide perspective.Meanwhile, although our previous studies (Zhang et al., 2018;Zhang et al., 2021a) have analyzed the relationship between CGI density and gene expression after annotating genome-wide CGIrelated genes (CGI+) into high-CGI (HCGI), intermediate-CGI (ICGI), and low-CGI (LCGI) genes based on the classification of CGI density (Weber et al., 2007;Zhu et al., 2008), we are still unclear the relationship between CpG methylation and gene expression.Thus, our second scientific question is how to build up a human genome-wide CGI-based methylation and gene expression annotation and analysis method to investigate the relationship among CGI density, methylation, and gene expression specificity.
Meanwhile, although several CpG methylation online service are already available (Raney et al., 2010;Di et al., 2018;Xiong et al., 2019), most of them only focus on CpG island prediction and data downloading, but not provide visualization and analysis for the distribution of CGI in different sequence regions and the connections between methylation status of CGIs and gene expression.Therefore, our third scientific question is how to establish an easy-to-use web service for fast CGIs prediction and visualization of the connections between CGIs and methylation.
For these reasons, we propose three major innovations to answer the above scientific questions.
Firstly, we develop an unsupervised clustering-based CGI prediction method (Human position-defined CGI prediction), which not only employs high performance computing technology to accelerate its predictive speed, but also offers a parameter selective option that can help us to locate short CGIs (position-defined CGIs) with unique location-or sequence-sensitive features and explore the differences in the genetic characteristics of CGIs under various CpG interval parameters.
Secondly, we build up a novel human genome annotation and analysis function (Human position-defined CGI annotation and analysis), which not only can study the methylation characteristics of CGIs from a genome-wide perspective by computing the methylation level of all CpG sites in the human genome, but also improve the previous CpG-Island-based human gene expression annotation and analysis method (Zhang et al., 2021a) by integrating genome-wide methylation annotation to further investigate the connections among CGI density, gene expression and methylation.
Thirdly, we establish an easy-to-use web service "Positiondefined CGI methylation analysis (PCGIMA)" with relevant CGI prediction, annotation, and data analysis functions, which provides us an online platform for further study on the regulation mechanism of CGI and methylation.
In conclusion, we develop a bioinformatic algorithm and web service to investigate the regulatory mechanism of CGI methylation.The main results include: 1) Human positiondefined CGI prediction method is more efficient to predict position-defined CGIs with multiple consecutive (d) values and locate more potential short CGIs than previous CGI prediction methods; 2) Our annotation and analysis method not only can investigate the connections between positiondefined CGI methylation and gene expression specificity from a genome-wide perspective, but also can analyze the potential association of position-defined CGIs with gene functions; (3) PCGIMA provides an easy-to-use analysis and visualization platform for human CGI prediction and methylation.
This study downloads human genome data from GRCh38 assembly (Schneider et al., 2017) at NCBI (Pruitt et al., 2005).To classify CGIs into density-defined and position-defined groups, we download human CGIs data and annotations from UCSC (Casper et al., 2018).Next, we use human genome annotated data (release 24) in GenBank GBFF format (Clark et al., 2016) from GENCODE (Wright et al., 2016) to define different sequence regions.Finally, to study the methylation level of CpG sites in different sequence regions, we obtain all CpG methylation data of 29 human tissues (Supplementary Table S1), including heart, spleen, lung and esophagus, from ENCODE databases (Harrow et al., 2006).In order to ensure data consistency, the above-listed annotation and methylation data are all annotated according to GRCh38 (Schneider et al., 2017).Figure 1 describes the workflow of the study with three essential steps: Human position-defined CGI prediction (left side of Figure 1), Data annotation (right side of Figure 1), and Human position-defined CGI methylation analysis (Bottom side of Figure 1).
Here, we describe the key equations as follows: (1) CGI prediction: We employed Eq. 1 to define CpGs clusters (Hackenberg et al., 2006) at the start.Next, we consider these CpG clusters with small p-values (Eq.2) as CGIs (Hackenberg et al., 2006).
Here, x and I represent the position and index of a CpG, respectively.

P d
( ) P(d) represents the probability to find a distance d between neighboring CpGs.p corresponds to the probability of CpGs in the sequence.Since our previous studies (Zhang et al., 2018;Zhang et al., 2021a) has led to a conclusion that LAUPs (Lineage-associated underrepresented permutations) are closely related to CGIs and the shortest LAUPs of mammals range from 10bp to 14bp in length, here we use the intermediate value of d = 12bp.
Workflow of the study.
Frontiers in Genetics frontiersin.org03 Here, N is the length of the CGI, CpGNum, CNum and GNum represent the number of CpG, number of C, number of G respectively.

Human position-defined CGI prediction method
Indicated by previous study (Hackenberg et al., 2006), we consider CGIs as potentially functionally short islands (CpG islets) if length of CGIs is less than 200bp.Here, Figure 2A demonstrates that Human position-defined CGI prediction method not only can locate the shortest average length (23.7bp) under CpG interval d = 12bp, but also the percentage of CGIs <200bp for Human position-defined CGI prediction method are greater than both CpGCluster method (Hackenberg et al., 2006) and density-defined CGI prediction method (Weber et al., 2007;Zhang et al., 2021a).
Also, since proportion distribution of CGI features is closely related to the regulatory mechanisms for CGI methylation (Hackenberg et al., 2010), Human position-defined CGI prediction method can describe the proportional distribution of the predicted CGIs at different CGI length (Figure 2B), GC content (Figure 2C), and O/E (Figure 2D).Here, we employ default setup for CpG interval, d = 12bp (Zhang et al., 2018;Zhang et al., 2021a).
It should be noted that Human position-defined CGI prediction method can parallel carry out position-defined CGI prediction and comparative analysis for multiple CpG intervals (d) by MR-CpGCluster.

Data annotation
Data annotation is described by the right side of Figure 1.Firstly, the position-defined CGIs are classified into different densities by Supplementary Eq.S2.And then, we classify each CpG methylation site of CGIs into different gene functional regions by Supplementary Table S2.Lastly, we classify the CpG sites into three methylation levels by Eq.5.
Data annotation can help us investigate the distribution of all CpG sites in different structural and functional categories of genome sequences (Figure 3; Supplementary Table S3).For example, we not only can compare the distribution of the number of CpG sites in each region of the predicted CGIs under different CpG interval(d) (Figure 3A), but also visualize the density of CpG sites in different functional regions (Figure 3B).

Human position-defined CGI methylation analysis
The position-defined CGI methylation analysis is described by the bottom side of Figure 1 with three functions.
First is "CGI density analysis" (Figure 4A), which is used to analyze the classification of position-defined CGI under various CGI density (Weber et al., 2007;Zhu et al., 2008) and CpG interval (d).
Second is "Methylation level analysis," which not only can analyze the specificity of methylation level for CpG sites under different annotation categories and CpG interval (d) (Figure 4B), but also allows the visualization and comparative analysis of methylation level of position-defined CGIs at the genome-wide perspective (Figure 4C).
The third is "GO enrichment analysis," which employs clusterProfiler (Yu et al., 2012) to make GO enrichment analysis (Liu et al., 2020) for the CGI + genes (Coding genes that at least one of its TSSs is located in the CGI) (Weber et al., 2007;Zhang et al., 2021a) of position-defined CGIs.Here, Figure 4D shows GO enrichment analysis for the CGI + genes under CpG interval d = 12bp.

Algorithm performance comparison
Firstly, As shown in Figure 5; Supplementary Figure S2, we compare the computing speed for Human position-defined CGI prediction method with MR-CpGCluster and this method without MR-CpGCluster with three commonly used standards: Speedup, Scaleup and Sizeup (Schatz, 2009).Figure 5 shows that the Speedup is positively related to the number of nodes and the size of dataset.For example, when using 8 nodes for a 3920 MB dataset, the ratio between the actual and ideal Speedup is 6.00/8 = 75%, while with 6 nodes for a 980 MB dataset, this ratio is 2.88/6 = 49.17%.
Next, we compare the computing efficiency for Human position-defined CGI prediction method with commonly used density-defined CGIs prediction method (Weber et al., 2007;Zhang et al., 2021a) and another two classical distance-based CGI prediction methods such as WordCluster (Hackenberg et al., 2011) and CpGProD (Ponger and Mouchiroud, 2002) by CGI length, GC content, and O/E ratio (Eq.3), which are three broadly used standards (Wang and Leung, 2004;Hackenberg et al., 2010).
Table 1; Supplementary Figure S5 not only demonstrate that the average length of CGIs of Human position-defined CGI prediction method (23.7 ± 11.5bp) is statistically shorter, but also the average GC content (89.3% ± 7.5%) and O/E value (1.54 ± 0.27) of Human position-defined CGI prediction method are statistically greater than other prediction methods by statistical test (Zhang et al., 2021b;Zhang et al., 2021d;Gao et al., 2021;Liu et al., 2021;Lai et al., 2022;Song et al., 2022).

Web service construction
Figure 6 shows the technical architecture of PCGIMA (http:// www.combio-lezhang.online/pcgima/home.html), which consists of three modules: "Human position-defined CGI prediction," "CpG sites annotation analysis," and "CGI methylation analysis." PCGIMA employs MR-CpGCluster to predict the positiondefined CGI for multiple consecutive (d) values.To compare and analyze the CpG methylation levels in different genome regions, we integrate the JavaScript version of IGV (Integrative Genomics Viewer) (Thorvaldsdottir et al., 2013) into our Web service.PCGIMA also imports the genome annotation information and analysis results into the MySQL database (Xia et al., 2010) and use eCharts (Bond and Goguen, 2002) to visualize CGI-related analysis results.
"Human position-defined CGI prediction" module provides two functions (Figure 2).One is "Position-defined CGI prediction," which can online predict position-defined CGI for the human genome or a particular chromosome with multiple consecutive (d) values.The other is "Position-  The speedup ratio of MR-CpGCluster.
Frontiers in Genetics frontiersin.orgdefined CGI features analysis," which can describe the connection between the proportion distribution of CGI and CGI features.
"CpG sites annotation analysis" module consists of two functions.
First is "Human CpG sites Distribution analysis," which can analyze the distribution of CpG methylation sites in different structural and functional categories of genomic sequences (Figure 3).Second is "Human CpG sites permutation analysis" module (Supplementary Method S4), which can analyze the CpG permutation patterns (Zhang et al., 2018) of density-and position-defined CGIs.
"CGI methylation analysis" module also provides two functions.
One is "Position-defined CGI methylation analysis," which can analyze the specificity of methylation level for CpG sites under different annotation categories (Figure 4B).The other is "GO enrichment analysis," which can make GO enrichment analysis for the CGI + genes of position-defined CGIs (Figure 4D).Meanwhile, PCGIMA also provides related source code and data download services.The function descriptions are detailed in Supplementary Method S4.

Discussion and conclusion
This study not only develops Human position-defined CGI prediction method to locate short and position-sensitive CGIs (CpG islets) using high performance computing to construct MR-CpGCluster algorithm (Figure 1), but also a novel human genome annotation and analysis method to investigate the connections among CGI, gene expression and methylation.Finally, we integrate them into PCGIMA for online computing and visualization.
For Human position-defined CGI prediction method, it not only can efficiently locate CpG islets (Figure 2A; Table 1), but also it can parallel predict position-defined CGIs with multiple consecutive (d) values and investigate the genetic characteristics of position-defined CGIs under different CpG interval parameters (Figures 2B-D; Supplementary Datas S1-S3).
For annotation method, it can investigate the connections between position-defined CGI methylation and gene expression specificity from a genome-wide perspective by considering functional regions (core promoters and gene bodies) and the distribution of methylation sites of genes for different expression breadth (Figure 3).Our annotation method (Figure 3A) reveals that the distribution proportion of methylation sites in TS genes for short positional-defined CGIs (d = 12) is 9.97%, which is less than that for long positional-defined CGIs (d = 50, 11.46%).The technical architecture of PCGIMA.
For Human position-defined CGI methylation analysis, not only CGI density analysis (Figure 4A) finds an interesting phenomena that short position-defined CGIs (CpG islets) are closer to LCGI by classifying the position-defined CGI under various CGI density (Weber et al., 2007;Zhu et al., 2008) and CpG interval (d), but also methylation levels analysis demonstrates that the average methylation levels are obviously low for CpG islets from overall scale and genome-wide perspective, respectively (Figures 4B, C) as well as Go enrichment analysis implies that the position-defined CGI-related genes could be associated with unique gene regulatory functions (Figure 4D; Supplementary Figure S4).
For Algorithm performance comparison, Figure 5 turns out that MR-CpGCluster method is faster than classical CpGCluster for the big dataset, which implies Human position-defined CGI prediction method can parallel process the big CGI data.
Moreover, previous studies indicate that CGIs with length less than 200 bp may be functional CpG islets (Hackenberg et al., 2006) and high GC content and O/E values represent enrichment of methylation sites (Gardinergarden and Frommer, 1987;Takai and Jones, 2002).Since Table 1 demonstrates that the average CGI length of the Human position-defined CGI prediction method is much less than 200bp (column 3 of Table 1), and the average GC content and O/E value are statistically greater than other prediction methods (column 4 and 5 of Table 1), we can conclude that Human position-defined CGI prediction method can locate more potential short CGIs with special functions than previous CGI prediction methods (Takai and Jones, 2002;Takahashi et al., 2017).
Lastly, Figure 6 shows that since we utilize the MR-CpGCluster to speed up CGI prediction and incorporate extensive visualization methods to increase user usability, PCGIMA provides an easy-touse analysis and visualization platform for human CGI prediction and methylation.It should be noted that since the human genome annotation and analysis results have been computed and imported into the database in advance, it is fast (about 2-3 min) for PCGIMA to show the analysis results except the "Human position-defined CGI prediction." Although our study already made great progress in CGI prediction, annotation, analysis, and visualization, it still needs further improving.Firstly, we should make detail annotations for human position-defined CGIs in terms of functional and structural features.Secondly, we should interrogate the lineage-based and function-based subsets for CGIs and their regulatory implications (Blackledge et al., 2013).Finally, we should employ advanced high performance computing technology (Jiang et al., 2015;Zhang et al., 2021c;Xiao et al., 2021) to improve PCGIMA in the distant future.

FIGURE 3
FIGURE 3 Position-defined CGI annotation results.(A) Distribution of all CpG sites in different structural and functional categories of genome sequences.(B) CpG density (Eq.4) of different gene and sequence categories.

FIGURE 4
FIGURE 4 Human position-defined CGI methylation analysis.(A) The classification of position-defined CGI under CGI density.(B) CpG density of different gene and sequence categories.(C) Comparison of the methylation ratio of position-defined CGIs.The horizontal and vertical axes represent genomic chromosome position and the methylation rate of the CpG site at corresponding position, respectively.(D) GO enrichment analysis.

TABLE 1
CGI prediction methods comparison.