Identifying RNA N6-Methyladenine Sites in Three Species Based on a Markov Model

N6-methyladenosine (m6A), the most common posttranscriptional modification in eukaryotic mRNAs, plays an important role in mRNA splicing, editing, stability, degradation, etc. Since the methylation state is dynamic, methylation sequencing needs to be carried out over different time periods, which brings some difficulties to identify the RNA methyladenine sites. Thus, it is necessary to develop a fast and accurate method to identify the RNA N6-methyladenosine sites in the transcriptome. In this study, we use first-order and second-order Markov models to identify RNA N6-methyladenine sites in three species (Saccharomyces cerevisiae, mouse, and Homo sapiens). These two methods can fully consider the correlation between adjacent nucleotides. The results show that the performance of our method is better than that of other existing methods. Furthermore, the codons encoded by three nucleotides have biases in mRNA, and a second-order Markov model can capture this kind of information exactly. This may be the main reason why the performance of the second-order Markov model is better than that of the first-order Markov model in the m6A prediction problem. In addition, we provide a corresponding web tool called MM-m6APred.


INTRODUCTION
To date, more than 160 types of RNA modifications have been discovered (Zhao et al., 2019). In these modifications, N6-methyladenosine (m6A) is the most common and abundant one existing in various species. It is closely associated with diverse biological processes, such as RNA localization and degradation (Wang et al., 2014), RNA structural dynamics (Roost et al., 2015), alternative splicing (Liu et al., 2015), and primary microRNA processing (Alarcón et al., 2015). Thus, identification of m6A sites is of great importance for better understanding their function and mechanisms . In the past few years, high-throughput experimental methods, such as MERIPP (Geula et al., 2015) and M6ASeq (Meyer et al., 2012), have been used to identify m6A modifications, but these methods have some limitations: (1) The location of the m6A site cannot be accurately located; (2) the cost is high; and (3) they are not applicable for the large-scale identification of m6A sites. Hence, it is highly desirable to develop a fast and accurate computational method for the identification of m6A sites (Dominissini et al., 2012).
Currently, there are several effective methods for predicting m6A sites based on machine learning, mainly including iRNA-Methyl , SRAMP (Zhou et al., 2016), M6AMRFS (Qiang et al., 2018), M6APred-EL , pm6A-CNN (Roost et al., 2015), and iN6-Methyl (Nazari et al., 2019) etc. The above methods actually use the physical and chemical properties of nucleotides in various species, such as the nucleotide frequency at specific locations and the chemical properties of nucleotides, to extract features and predict m6A sites. However, none of these methods can capture the correlation between adjacent nucleotides well, while the Markov model can model this kind of correlation. In fact, Pian et al. (2020) used a first-order Markov model to predict the DNA N6-methyladenine sites. Recently, we proposed a method to predict DNA 4mC sites based on the second-order Markov model (Yang et al., 2020). Later, we found that the second-order Markov model is more suitable for predicting the methylation sites of RNA m6A because of the biases of the triplet codons in mRNA. The main purpose of this article is to provide a more accurate prediction tool of m6A.
Based on this idea, we used a second-order Markov model to identify the m6A sites of RNA. The m6A data of the three species of Saccharomyces cerevisiae, mouse, and Homo sapiens, were used to evaluate our model. The results show that the prediction performances of the first-order Markov model and the second-order Markov model are significantly better than those of the other four existing prediction tools. In addition, the secondorder Markov model outperforms the first-order Markov model, which indicates that the second-order Markov model can capture the codon bias in mRNA well. This suggests that second-order Markov may be able to characterize the codon bias in mRNA.

Benchmark Datasets
In this study, we used three benchmark datasets from three different species: S. cerevisiae , mouse (Dominissini et al., 2012), and H. sapiens (Chen et al., 2017). The corresponding number of positive samples was 1,300, 725, and 1,130. There were as many negative samples as positive samples. Table 1 shows the details of these data. For the three benchmark datasets, the positives were the sequences centered with true m6A sites, while the negatives were the sequences centered with adenines but without any m6A peaks detected. The datasets can be downloaded from the following website 1 .

Model Construction
A Markov model is a stochastic process where the next variable depends on only the most recent variable(s) instead of all the previous variables. In this sequence information study, we first model a sequence as a first-order Markov chain, and the 1 http://server.malab.cn/M6AMRFS/ More specifically, for the m6A sequences of positive samples in the training data, we first calculate the initial probability P P S 1 (P P A , P P G , P P C , P P U ) of the initial state S1 nucleotide being A, G, C or U, respectively. Then, we need to calculate the transfer probability P P n S n −S n+1 of the current nucleotide state to the next state individually from the initial state S 1 (for example, P P 2 G−A represents the probability that nucleotide G in state S 2 transfers to nucleotide A in state S 3 ).
Thus, we can obtain the probability of the occurrence of the four nucleotides in the initial state and the transfer probability matrix of each state except the last one. Similarly, for the negative sequences of non-m6A, the probability of the occurrence of the corresponding four nucleotides in the initial state and the transition probability matrix can also be obtained. Therefore, two Markov models are trained based on the m6A sequences and non-m6A sequences in the training dataset.
In the process of prediction, we need to select the probability values according to the nucleotide arrangement of the sequence, including the initial state probability and the corresponding transfer probability from the positive and negative Markov models in the previous step. Then, we calculate the products of positive and negative probability values. Finally, we calculate the ratio of the positive product and negative product. If the ratio is greater than 1, the sequence is considered a m6A sample. Otherwise, it is considered a non-m6A sample.
Since there is a bias in the codon of mRNA (Kurland, 1991;Quax, 2015), we consider using a second-order Markov model to capture this bias. The flowcharts of the training and testing of the second-order Markov model are shown in Figure 1. For the m6A sequences, we first calculate the initial probability P P S 1 S 2 (P P AA , P P AG , ..., P P UU ) of the first dinucleotide. Then, we need to calculate the transfer probability P P n S n S n+1 −S n+2 of the current dinucleotide (S n S n+1 ) to the next nucleotide (S n+2 ) (for example, P P1 AA−A represents the probability of state S 1 S 2 transferring to S 3 , where the nucleotide of state S 1 S 2 is AA, and the nucleotide of state S 3 is A). Thus, 39 transfer probability matrices with 16 rows and four columns can be obtained. Similarly, the initial probability and transfer probability can be obtained for non-m6A sequences. Therefore, two Markov models (M P and M N ) are similarly trained based on the m6A sequences and non-m6A sequences in the training dataset. Taking the sequence "seq = GUAUAUAACUUUUUUCUUCAAGGAGCAGGUGUC UGCCUAA" as an example, the probabilities P(seq|M P ) and P(seq|M N ) of the sequence "seq" under models M P and M N are obtained, respectively. Then, the value of Ratio = P(seq|M P )/P(seq|M N ) can be used to determine the class of "seq, " where P(seq|M P ) = P P GU × P P 1 GU−A × P P 2 UA−U × ... × P P 39 UA−A , and P(seq|M N ) = P N GU × P N 1 GU−A × P N 2 UA−U × ... × P N 39 UA−A , The prediction for a test sequence. The sequence "GUAUAUAACUUUUUUCUUCAAGGAGCAGGUGUCUGCCUAA" is used as an example to explain the prediction process.   If the Ratio > 1, "seq" is classified as a m6A sequence; otherwise, it is classified as a non-m6A sequence.

Performance Evaluation
Ten-fold cross-validation was used to assess the reliability of the method. In the performance evaluation, the sensitivity (Sn), specificity (Sp), accuracy (ACC), and Mathew's correlation coefficient (MCC) were calculated. They are formulated as follows: S n = T P T P + F N , where T P , T N , F P , and F N denote true positive, true negative, false positive, and false negative, respectively. S n measures the predictive ability of a predictor for positive samples, while S p measures the predictive ability of a predictor for negative samples. ACC and MCC are two metrics measuring the overall performance of a predictor.

RESULTS AND DISCUSSION
Representation and Illustration of (P P n S n S n+1 −S n+2 For the second-order Markov model, the heat map of the quotient matrix (P P n S n S n+1 −S n+2 /P N n S n S n+1 −S n+2 ) of second-order transfer probability of m6A samples divided by the secondorder transfer probability of non-m6A samples is shown in Figure 2. In order to facilitate comparison, the results of The best performance in the respective part appears bold in the table.
heat map were standardized. The results show that there is a significant difference in the transfer probability of nucleotides at some positions between the positive and negative samples. This indicated that the second-order Markov chain is informative for predicting sequences containing m6A sites.
We also plotted the line charts of transfer probability of the second-order Markov model (Shown in Supplementary Material 1). Similar to the first-order Markov model, the transfer probability of positive samples is significantly different from that of negative samples in the second-order Markov model. Furthermore, the number of significant different sits in the second-order Markov model are obviously greater than that in the first-order Markov model from the line charts in Supplementary Material 2. It indicates that more information is provided in the second-order Markov model to help determine the type of sequences.

The Distribution of Ratios in the Positive and Negative Sample Sets
Probability density maps of ln(Ratio) values for three species based on the second-order transfer probability products are shown in Figure 3. It can be found that in each species, the distribution of ln(Ratio) is very different between positive and negative samples, except for a small amount of overlap in the probability density graphs. The Ratio value of positive samples is significantly greater than that of negative samples, which enables the positive and negative samples to be divided accurately.

Comparison and Analysis
To evaluate our Markov model, we compared the performance of the two methods based on the Markov model with those of other m6A classifiers, including iRNA-Methyl, SRAMP, M6AMRFS, and M6APred-EL. Table 2 and Figure 4 show the prediction results of various methods (10-fold cross validation was used in all methods).
It can be found that the two methods based on the Markov model in m6A types of sequence identification had better or equal classification effects than several kinds of classifiers and that the second-order Markov model performed much better than the first-order Markov model in each aspect. It is noteworthy that Sp in several other methods is 100% in the species of mouse and H. sapiens, while the Sp of our method is close to 90% on average. Therefore, we checked these non-m6A data and found that the selection of negative sample data in the original literature [12] is unreasonable. The states S 22 of the negative samples in mouse and human are all C, and the states S 20 are all A or G. This is the reason why Sp of other methods can reach 100%. To evaluate our method more fairly, we downloaded 725 m6A sequences of mice from the m6Avar database, and the same number of sequences were randomly selected from the non-m6A sequences of the dbSNP database as negative samples. We used these data to retrain new models and carried out 10-fold cross validation in all methods. The performance results of all the above methods are shown in Table 3 and Figure 5. The results indicate that all the performance metrics based on the two Markov model are high. And the secondorder Markov model still performed much better than the firstorder Markov mode.

Web-Server Implementation
To facilitate the use of the Markov model to identify RNA m6A sites, the user-friendly web server MM-m6APred has been provided. It is freely available at 2 . Our tool can handle RNA sequences of 41 bp or longer. Users can either paste RNA sequences into the text area or upload a FASTA format file.

CONCLUSION
Accurate identification of the m6A site is a necessary step in the study of its biological function. In this study, we used first-order and second-order Markov models to predict the m6A sites of three species. The results show that our method is better than the other four existing prediction tools. This shows that the Markov model can capture the correlation between neighboring nucleotides well. Considering the biases of the codons in mRNA, the second-order Markov model is used to capture these biases. The results show that the prediction performance of the second-order Markov model is significantly better than that of the first-order Markov model. In addition, we also provide the online prediction web tool of m6A, with code available to download (see text footnote 2). 2 http://www.pianlab.cn/m6APred/

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: http://server.malab.cn/M6AMRFS.

AUTHOR CONTRIBUTIONS
CP and ZY contributed equally to this work. All authors read and approved the final manuscript.