REW-ISA V2: A Biclustering Method Fusing Homologous Information for Analyzing and Mining Epi-Transcriptome Data

Background: Previous studies have shown that N6-methyladenosine (m6A) is related to many life processes and physiological and pathological phenomena. However, the specific regulatory mechanism of m6A sites at the systematic level is not clear. Therefore, mining the RNA co-methylation patterns in the epi-transcriptome data is expected to explain the specific regulation mechanism of m6A. Methods: Considering that the epi-transcriptome data contains homologous information (the genes corresponding to the m6A sites and the cell lines corresponding to the experimental conditions), rational use of this information will help reveal the regulatory mechanism of m6A. Therefore, based on the RNA expression weighted iterative signature algorithm (REW-ISA), we have fused homologous information and developed the REW-ISA V2 algorithm. Results: Then, REW-ISA V2 was applied in the MERIP-seq data to find potential local function blocks (LFBs), where sites are hyper-methylated simultaneously across the specific conditions. Finally, REW-ISA V2 obtained fifteen LFBs. Compared with the most advanced biclustering algorithm, the LFBs obtained by REW-ISA V2 have more significant biological significance. Further biological analysis showed that these LFBs were highly correlated with some signal pathways and m6A methyltransferase. Conclusion: REW-ISA V2 fuses homologous information to mine co-methylation patterns in the epi-transcriptome data, in which sites are co-methylated under specific conditions.


INTRODUCTION
At present, researchers have identified more than 170 different chemical modifications in RNA (Frye et al., 2018). N 6 -methyladenine (m 6 A) is the most common and abundant post-transcriptional RNA modification in mRNAs and long non-coding RNAs , and its methylation occurs at the sixth position of nitrogen atoms of adenosine. Studies have shown that m 6 A is involved in some RNA metabolic processes such as mRNA transcription, translation, nucleation, splicing and degradation (Ping et al., 2014;Lin et al., 2016;Deng et al., 2018). Besides, m 6 A also plays an important role in the early development of eukaryotic cells, sex determination, antiviral immunity, brain development, and directed differentiation of hematopoietic stem cells (Zhang et al., 2017(Zhang et al., , 2019a. In addition to the above biological processes, m 6 A modification is also related to many pathological phenomena, such as leukemia, glioma and hepatocellular carcinoma (Lachén-Montes et al., 2016;Chai et al., 2019).
The m 6 A methylation in RNA is a dynamic and reversible process regulated by methyltransferases and demethylases. Since the main role of m 6 A methyltransferases is to catalyze RNA to produce m 6 A methylation modifications, these enzymes are often called "writers." The most common m 6 A writer is composed of core components METTL3, METTL14, WTAP, and other subunits Ping et al., 2014). On the contrary, m 6 A demethylases mainly mediate m 6 A demethylation, so these enzymes are also known as "erasers." The common erasers are FTO, AKLBH5, and so on (Jia et al., 2011). Studies have shown that m 6 A has a series of biological functions because many RNA binding proteins mediate it. These binding proteins can specifically recognize m 6 A methylated adenosine on RNA, so these proteins are often referred to as "readers." The common readers include protein YT521-B homologous (YTH) domain family (Meyer et al., 2015), etc. In recent years, with the development of methylated RNA immunoprecipitation sequencing (MeRIP-seq, or m 6 A-seq) technology (Dominissini et al., 2012;Meng et al., 2014), many m 6 A experimental data continue to emerge, which makes it possible to analyze m 6 A in the whole transcriptome. However, since there are a few enzymes, such as m 6 A writers, erasers and readers only, each enzyme may regulate a large number of m 6 A sites. In other words, the methylation level of m 6 A site regulated by the same enzyme may share the same pattern, which is called the co-methylation pattern of m 6 A.
Till this day, some researchers have used clustering methods to study the co-methylation patterns in epi-transcriptome data, trying to clarify the functional mechanism of m 6 A methylation. Based on MeRIP-seq data, Liu et al. used k-means clustering, hierarchical clustering, Bayesian factor regression model and non-negative matrix decomposition to cluster m 6 A sites (Liu et al., 2015). To better fit the distribution of epitranscriptome data, Zhang et al. proposed an infinite beta binomial mixture model based on Dirichlet Process (DPBBM) to reveal the co-methylation patterns (Zhang et al., 2019b). Besides, our previously proposed RNA Expression Weighted Iterative Signature Algorithm (REW-ISA) (Zhang et al., 2020) applied biclustering to the analysis of epi-transcriptome data for the first time. However, the above methods only used the read counts of the m 6 A sites of the IP sample and the input sample in MeRIP-seq data. They did not fully consider the homologous information of sites and experimental conditions. Homology is a central concept in comparative biology, in which the most basic meaning of homology is to have a common ancestor. The homologous information of MeRIP-seq data can be divided into two categories: the genes corresponding to the m 6 A sites and the cell lines (or environments) corresponding to the experimental conditions. Appropriate use of the abovementioned homologous information will help discover potential local functional blocks (LFBs) and better reveal the m 6 A regulatory mechanism. Besides, although some of the most advanced biclustering methods have been developed, such as runibic (Wang et al., 2016;Orzechowski et al., 2018a), EBIC (Orzechowski et al., 2018b), QUBIC2 (Xie et al., 2020) and RecBic , their goal is to identify the trend-preserving biclusters. However, when mining m 6 A co-methylation pattern, it is expected to obtain locally hyper-methylated biclusters, so these new methods are not applicable.
Therefore, we proposed an improved RNA expression weighted iterative signature algorithm (REW-ISA V2), which fuses the homologous information of sites and experimental conditions in the iterative search for LFBs. Consistent with the previous method, each potential LFB is identified by the row threshold (defined as T R ) and column threshold (defined as T C ) during the LFB searching strategy. It is important to note that REW-ISA V2 updates T R and T C 's selection process, optimizing the selection of thresholds through the built-in rich constraint framework. According to the previous study (Henriques et al., 2015(Henriques et al., , 2017, REW-ISA V2 is a nondeterministic greedy algorithm, which can be used to find hypermethylated biclusters. Besides, REW-ISA V2 can obtain these overlapping LFBs when there is overlap between the LFBs implied in the input data. To verify the effectiveness of the fusion of homologous information, REW-ISA V2 was applied to the collected MERIPseq data to find potential LFBs. The obtained LFBs were further analyzed by the Gene Ontology (GO) analysis, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis, and enzyme-specific experiments, in an attempt to reveal the possible regulatory mechanism of m 6 A. As a result, REW-ISA V2 can better find potential LFBs with high methylation levels in the epi-transcriptome data.

Pre-processing of Real Data
As is known, MeRIP-Seq data profiles the m 6 A epi-transcriptome by IP and input samples. Thus, we first need to follow  and  to quantify the information of m 6 A sites. Specifically, after downloading the sequencing data from Gene Expression Omnibus (GEO) in SRA format, the Tophat2 (Kim et al., 2013) needs to be used to compare the sequencing data reads with the human reference genome, and finally obtain the Fragments Per Kilobase of transcript per Million (FPKM) statistics.
To mine the potential LFBs in the epi-transcriptome data, only the FPKM statistical information of IP and input samples are not enough. It is necessary to calculate the m 6 A methylation level of each m 6 A site under each experimental condition. Let m denote the total number of m 6 A sites and n denote the total number of conditions. Therefore, according to the REW-ISA, the methylation level matrix P ∈ R m×n and the RNA expression level matrix W ∈ R m×n can be further calculated using the IP sample and the input samples, as shown in (1, 2).
In (1) and (2), t ij represents the FPKM of the i-th m 6 A site under the j-th condition in the IP sample, and h ij represents the FPKM of the i-th m 6 A site under the j-th condition in the input sample. Besides, α in (1) is a very small value, aiming to avoid NaN where FPKM of both IP and input samples are zeros. The purpose of introducing the RNA expression level is to provide a confidence level for m 6 A methylation level in further biclustering analysis.

REW-ISA V2
To eliminate the effect of global sites or conditions on P, REW-ISA V2 performs standard normalization on the whole, rows and columns of P in turn to eliminate the global effect, as shown in (3-5). P nw , P nr , and P nc represent the matrices obtained after whole normalization, row normalization, and column normalization, respectively.
In (3-5), mean(·) represents calculating the mean value, max(·) represents calculating the maximum value, and min(·) represents calculating the minimum value. P nw i· represents the i-th row in P nw , and P nr ·j represents the j-th column in P nr . Then minmax normalization is performed on the overall data to generate P t , which will facilitate subsequent combination with RNA expression level.
For the RNA expression level matrix W, since its distribution fluctuates with the MeRIP-seq data, it is necessary to perform the min-max normalization on W to generate W t , which acts as confidence matrix for P t .
Suppose that k-1 (2 ≤ k ≤ K) LFBs have been found, and the k-th LFB is currently being searched. Assuming that the k-th LFB is B k , the site indicator ρ k and the condition indicator κ k are used to indicate the sites and conditions contained in B k . Specifically, the site indication ρ ik is one if the i-th site is present in B k (zero otherwise). The condition indication κ jk is one if the j-th condition is present in B k (zero otherwise). The average methylation level µ p k and average expression level µ w k of B k can be further calculated, as shown in (8,9), respectively.
Each time a LFB is found, the average methylation level and average expression level of the LFB should to be removed from P t and W t . The purpose of removing is to prevent the algorithm from falling into a loop looking for a strong LFB. Let residual matrix P (k) represent the methylation level matrix after eliminating the µ p of the first k-1 LFBs, Then, P (k) turns into P R(k) after row min-max normalization and turns into P C(k) after column min-max normalization. Similarly, let W (k) represent the RNA expression level matrix after eliminating the µ w of the first k-1 LFBs, After obtaining the above P R(k) , P C(k) and W (k) , combined with the homologous information of sites and conditions, the algorithm begins to search for LFBs iteratively. The algorithm running from a randomly selected site's subset U ′ and updates the conditions' subset V ′ according to (12).
where V is the conditions set of P t , refers to the u-th site under the v-th condition in P R , is the RNA expression level of the u-th site under v-th condition, H C v represents the subset of homologous conditions corresponding to the v-th condition. ρ(·) represents to calculate Pearson similarity, |·| represents to calculate absolute value (or module). Besides, T C is a hyperparameter, and its function is to select the subset of conditions V ′ . In (12), e C U ′ v is calculated based on P R(k) and W (k) , which represents the average methylation level score of the v-th condition combined with the confidence of the expression level. t C U ′ v is calculated based on P t and W t , representing the average similarity score of the v-th condition relative to its homologous conditions subset. In the process of calculating e C U ′ v and, only the sites involved in U ′ are considered.
Then, the subsets of sites are updated following (13).
Frontiers in Genetics | www.frontiersin.org where U is the sites set of P t , refers to the u-th site under the v-th condition in P C , H R u represents the subset of homologous sites corresponding to the u-th site. Besides, T R is a hyperparameter, and its function is to update the subset of sites U ′ . e R u V ′ represents the average methylation level score of the u-th site combined with the confidence of the expression level. t R uV ′ represents the average similarity score of the u-th site relative to its homologous sites subset. In the process of calculating e R uV ′ and t R uV ′ , only the conditions involved in V ′ are considered.
Using the preset hyperparameters T R and T C , U ′ and V ′ are updated iteratively by (12, 13) until convergence is satisfied (or the maximum number of preset iterations is reached). The convergence condition is shown in (14).
where ε is the default convergence criteria, and its value is slightly <1. U" represents the site's subset in the previous iteration, and U ′ represents its subset in the current iteration. If the algorithm converges within the maximum number of iterations, it means that the k-th LFB, B k = {U ′ , V ′ } has been found. The flow chart of searching for the k-th LFB by REW-ISA V2 is shown in Figure 1.
Then the algorithm will return to (8) and continue to look for the next LFB. Conversely, if the convergence condition of (14) is not satisfied when the algorithm reaches the maximum number of iterations, REW-ISA V2 will automatically terminate and output all previously obtained LFBs. We recommend setting ε to 0.99 and the maximum number of iterations not <50. The closer the value of ε is to 1 and the greater the maximum number of iterations, the more accurate the LFBs obtained by REW-ISA V2. The REW-ISA V2 algorithm based on R language can be downloaded freely from https://github.com/ labiip/REWISAV2.

Enrichment Constraint Framework
It can be seen from (12, 13) that the selection of T R and T C will greatly affect the biological significance of the obtained LFBs. Therefore, based on Meng et al. (2009), we introduced a grid search-based enrichment constraint framework for the algorithm to optimize T R and T C selection further. For LFBs obtained under different T R and T C combinations, we need to extract the genes corresponding to the m 6 A sites in each LFB and then perform GO analysis based on "clusterProfiler" (Yu et al., 2012) for each LFB. For the range of T R and T C , we recommend setting it between 0 and 3, and the step size is 0.1. On this basis, the range of specific thresholds should be appropriately adjusted according to the input real data. Assuming that a LFB is obtained, the number of genes corresponding to the m 6 A site contained in it is M. The number of GO terms obtained by GO analysis of the LFB is l. Then the weighted enrichment score (WE_score) (Li et al., 2012) of this LFB can be calculated by (15).
where p i is the p-value of the i-th GO term, m i is the number of genes of the i-th GO term enriched, m non is the number of genes covered by LFB but not enriched by any GO term. The higher the WE_score, the stronger the biological significance of this LFB. However, as the number of genes corresponding to the sites in LFB increases, WE_score will also show an increasing trend, as shown in Supplementary Figure 1. Therefore, only using WE_score to evaluate the biological significance of obtained LFBs is not perfect, and the number of genes corresponding to the sites in LFBs also needs to be considered. Assume that the data analyzed contain a total of M all genes, and further assume an obtained LFB is containing M genes and record its WE_score as W m . We randomly select M genes from all genomes, and their WE_score is recorded as W rm . The relative promotion rate (RPR) of WE_score can be further calculated, as shown in (16).
The larger the RPR is, the larger the area of the obtained LFB is, and the more biological significance of the obtained LFB is. On the one hand, in the actual process of mining LFBs, we hope to get more LFBs. On the other hand, we hope to get LFBs with rich biological significance. Therefore, the number of LFBs obtained by each pair of threshold combinations is obtained by grid search under different T R and T C combinations. The threshold combinations corresponding to the maximum number of LFBs are selected. Then, the average RPR of the LFBs is calculated based on the selected combination of T R and T C . Finally, the optimal T R and T C are the threshold combinations corresponding to the maximum average RPR.

RESULTS
We collected 32 samples from 10 publicly human m 6 A MeRIPseq datasets (Dominissini et al., 2012;Meyer et al., 2012;Fustin et al., 2013;Batista et al., 2014;Schwartz et al., 2014;Wang et al., 2014;Barbieri et al., 2017;Li et al., 2017;Pendleton et al., 2017) to mine potential LFBs, most of which can be retrieved from the MeT-DB V2.0 database (Liu et al., 2018). Table 1 summarizes the MeRIP-seq real data set used in this project. Then, calculate the corresponding P and W through (1) and (2), and perform REW-ISA V2. Within the range of T R being 0.1-2 with step size 0.1, and T C being 0.1-2 with step size 0.1, T R and T C are optimized through the enrichment constraint framework. The experiments were repeated ten times for each parameter setting. Although optimizing T R and T C based on the gathered biological significance may produce biased results. However, this process provides guidance for the selection of T R and T C . Finally, under the optimal T R of 0.4 and the optimal T C of 0.7, a total of fifteen LFBs are obtained. The number of m 6 A sites, the number of genes corresponding to m 6 A sites and the number of conditions contained in these LFBs are shown in Supplementary Table 1.
For the above-mentioned real data set, Bimax (Prelić et al., 2006), Xmotifs (Murali and Kasif, 2003), Plaid (Lazzeroni and Owen, 2002), ISA (Bergmann et al., 2003), REW-ISA (Zhang et al., 2020), FBCwPlaid (Chen et al., 2021), runibic (Orzechowski et al., 2018a), and QUBIC2 (Xie et al., 2020) were all included for comparison with REW-ISA V2. To make the LFBs obtained by the above methods have significant biological significance, the parameters of these methods have been appropriately adjusted. For each LFB obtained by each method, the two enrichment indicators, WE_score and RPR, were both calculated for evaluation. The comparison results are shown in Figures 2A,B, respectively. As can be seen from Figure 2A, the average WE_score of the LFBs obtained by the REW-ISA V2 algorithm is higher than that of ISA and REW-ISA, which indicates that the fusion of homologous information is effective for mining LFBs. Although the average WE_score of LFBs obtained by REW-ISA V2 is lower than that of the FBCwPlaid algorithm, there are significant differences in RPR between the two methods. After further analysis of the LFBs, we found that this was caused by the size of LFBs found by REW-ISA V2 was smaller than that found by the FBCwPlaid algorithm. In other words, the LFBs found by REW-ISA V2 had higher enrichment scores with fewer corresponding genes. Besides, we can find that runbic and QUBIC2 do not perform well in the task of m 6 A hypermethylation pattern recognition. It may be due to the following two points. On the one hand, the two algorithms mainly identify the trend-preserving biclusters, which is different from the hyper-methylation bicluster. On the other hand, the LFBs obtained are generally small. This also reflects the need of developing biclustering methods for epi-transcriptome data. In a word, the average RPR of LFBs inferred by REW-ISA V2 is significantly higher than that of other biclustering algorithms, which means that the LFBs obtained by REW-ISA V2 may be more biologically significant.
To further explore the biological significance of the obtained LFBs, we selected four LFBs with more sites from the fifteen LFBs. As can be seen from Supplementary Table 1, for the four selected LFBs, they cover 1,256, 1,619, 824, and 1,148 genes, respectively. An important feature of any biclustering is the  identified subsets of conditions, so the conditions contained in the four selected LFBs are explored in detail, as shown in Supplementary Table 2. The methylation level heatmaps of the four selected LFBs are shown in Figure 3. For the KEGG pathway analysis, six KEGG pathways known to be regulated by RNA methylation were selected (Dominissini et al., 2012;Xiang et al., 2017), such as apoptosis, DNA repair, fatty acid metabolism, etc. Then, Fisher's exact test was used to verify whether each LFB was significantly enriched in some specific pathways. The output p-value shows the correlation between the four LFBs obtained and six biological pathways, as well as the importance of multiple hypothesis correction. We could see from Supplementary Table 3 that the four selected LFBs are significantly enriched in the ultraviolet (UV) response up. Although the enrichment degree of LFB2 is lower than that of the other three LFBs in the UV response up, its enrichment in the apoptosis is significantly higher than that of the other three LFBs, indicating that LFB2 may further affect apoptosis through some other m 6 A-related pathways. Besides, LFB1, LFB3, and LFB4 are also significantly enriched in DNA repair, which may be related to DNA damage caused by ultraviolet radiation. Since m 6 A has been proved to be related to stem cell differentiation and cancer progression (Batista et al., 2014), there is a reasonable explanation for enriching LFB1 and LFB3 in fatty acid metabolism. As the main components of neutral fat, phospholipids and glycolipids, fatty acids can meet various body needs and regulate metabolism, growth and development (Azain, 2004). The p53 pathway enriched in LFB4 indicates that LFB4 may be related to stress signal, regulation of intracellular homeostasis, chromosome segregation, and cell division (Harris and Levine, 2005). Through the above analysis, it is not difficult to see that the LFBs obtained by REW-ISA V2 have more significant biological significance than the randomly selected LFB. Therefore, an in-depth analysis of the LFBs obtained may help reveal the specific regulatory mechanism of m 6 A.
To check whether the detected LFBs have biological significance, we further conducted the enzymes substrate specificity experiments on the four selected LFBs. Since LFB covers hyper-methylated sites and conditions, the sites and conditions involved in each LFB are more likely to be the target sites of m 6 A methyltransferases. Therefore, we studied the association between each selected LFB and four m 6 A methyltransferases, including METTL3, METTL14, WTAP as well as KIAA1429. For this purpose, 38,845 METTL3 targeted gene sites, 19,099 METTL14 targeted gene sites, 35,144 WTAP targeted gene sites, and 1,784 KIAA1429 targeted gene sites included in the real data were first identified by TREW tool (Liu et al., 2018). After REW-ISA V2, we summarized the distribution of target RNA methylation sites involved in each LFB (Supplementary Table 4). Then, the association between the sites in each selected LFB and m 6 A methyltransferases target sites was further evaluated by Fisher's exact test. The experimental enrichment results are shown in Supplementary Table 5, where p-value indicates the significance of association between sites and methyltransferase target sites. The results showed that the sites contained in the four selected LFBs were significantly enriched in the target sites of the four methyltransferases. This means that under specific conditions, the LFBs obtained by REW-ISA V2 were indeed the collaboratively hyper-methylated sites, which will help biologists to further study the specific regulation mechanism of m 6 A. The detailed analysis process and results can be obtained in the Supplementary Materials.

DISCUSSION
Although more and more studies have shown that the modification of m 6 A in RNA is related to many important biological functions, the specific regulatory mechanism of m 6 A is still unclear. To quickly and effectively predict potential functional m 6 A sites from the epi-transcriptome data, it is important to develop some computational algorithms, which will help us have a more comprehensive understanding of m 6 Arelated life processes. Based on REW-ISA, in this article, we developed REW-ISA V2 to better reveal the potential local comethylation patterns across subsets of conditions. REW-ISA V2 was implemented on the real MeRIP-seq data, and a total of 15 LFBs were obtained. Further comparison and analysis show that, compared with other biclustering algorithms, the LFBs obtained by REW-ISA V2 has more significant biological significance.
REW-ISA V2 could obtain reliable biclustering patterns because of the use of homologous information. More specifically, the sites' methylation levels corresponding to the same gene will show a similar trend with a high probability. Similarly, conditions derived from the same cell line will have similar trends in all sites. Therefore, the rational use of homologous information will help to better mine local co-methylation patterns. Of course, REW-ISA V2 still has some deficiencies that need to be improved in the future. First of all, REW-ISA V2 uses simple multiplication to fuse homologous information, which inevitably introduces noise at the same time. Secondly, because the database on which GO analysis depends is incomplete, the enrichment constraint framework designed is prone to human error. Finally, the enrichment constraint framework built into REW-ISA V2 usually takes a long time. In the future, we will use BSig (Henriques and Madeira, 2018) to better evaluate the obtained LFBs and develop a new computational model to overcome these limitations.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
LZ and SC built the architecture for REW-ISA V2, designed and implemented the experiments, analyzed the result, and wrote the paper. JM conducted the experiments, analyzed the result, and revised the paper. ZL and HL supervised the project, analyzed the result, and revised the paper. All authors read, critically revised, and approved the final manuscript.

FUNDING
This work has been supported by Fundamental Research Funds for the Central Universities (Grant No. 2019ZDPY15 to LZ). The funding body did not play any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.