Identifying RNA N6-Methyladenosine Sites in Escherichia coli Genome

N6-methyladenosine (m6A) plays important roles in a branch of biological and physiological processes. Accurate identification of m6A sites is especially helpful for understanding their biological functions. Since the wet-lab techniques are still expensive and time-consuming, it's urgent to develop computational methods to identify m6A sites from primary RNA sequences. Although there are some computational methods for identifying m6A sites, no methods whatsoever are available for detecting m6A sites in microbial genomes. In this study, we developed a computational method for identifying m6A sites in Escherichia coli genome. The accuracies obtained by the proposed method are >90% in both 10-fold cross-validation test and independent dataset test, indicating that the proposed method holds the high potential to become a useful tool for the identification of m6A sites in microbial genomes.


INTRODUCTION
At present, ∼150 kinds of RNA modifications have been found in different RNA species (Boccaletto et al., 2018), which not only enrich the genetic information, but also play critical roles in a variety of biological processes as mentioned in a recent review (Roundtree et al., 2017). Among these modifications, the N 6 -methyladenosine (m 6 A) is the most abundant posttranscriptional modification and has been found in the three domains of life. m 6 A has been found to participate in various biological activities, such as mRNA splicing (Nilsen, 2014), mRNA translation , mRNA maturation (Hoernes et al., 2016), stem cell proliferation (Bertero et al., 2018), and even a series of diseases Cui et al., 2017;Li et al., 2017).
In order to reveal its biological functions, different kinds of high-throughput sequencing techniques have been proposed to map the locations of m 6 A on genome wide (Dominissini et al., 2013;Linder et al., 2015;Wan et al., 2015;Hong et al., 2018). Although these techniques promoted the research progress on understanding the biological functions and the identification of RNA modifications, they are still labor-intensive and cost-ineffective. In addition, the resolution of detecting m 6 A sites for most techniques is still not satisfactory. Therefore, it's necessary to develop novel methods to detect m 6 A sites.
Giving the credit to the experimental data yielded by these high-throughput sequencing techniques as reported in a recent work (Chen X. et al., 2017), some machine learning based computational methods have been proposed to identify m 6 A sites (Chen et al., 2015a(Chen et al., ,b, 2016aZhou et al., 2016). Although these methods are really good complements to experimental methods for detecting m 6 A sites, to the best of our knowledge, so far there is no computational tool available for detecting m 6 A sites in microbial genomes.
Stimulated by the successful applications of machine learning methods in computational genomics and proteomics (Chen et al., 2012;Feng et al., 2013;Cao et al., 2016Cao et al., , 2017aHu et al., 2018), in the present work, we presented a support vector machine (SVM) based method for identifying m 6 A sites in the Escherichia coli (E. coli) genome. By encoding the RNA sequences using nucleotide chemical property and accumulated nucleotide frequency, the proposed method obtained promising performances in 10-fold cross validation test. Moreover, we also validated the method on the independent dataset and obtained satisfactory results.

Benchmark Dataset
The m 6 A site containing sequences of E. coli genome were obtained from the RMBase database (Xuan et al., 2018). All the sequences are 41 bp long with the m 6 A site in the center. To overcome redundancy and reduce the homology bias, sequences with more than 80% sequence similarity were removed by using the CD-HIT program (Fu et al., 2012). After such a screening procedure, 2,055 m 6 A site containing sequences were retained and regarded as positive samples.
The negative samples (non-m 6 A site containing sequences) were obtained by choosing the 41-bp long sequences with the central adenosine that was not experimentally confirmed occurring methylation on its 6th nitrogen. By doing so, we could obtain a large number of negative samples. After removing sequences with identify >80%, the number of negative samples are still dramatically larger than that of positive samples. To balance out the numbers between positive and negative samples in model training, we randomly picked out the same number of negative samples and repeated this process 10 times. Therefore, 10 negative subsets were obtained, and each of them includes 2,055 non-m 6 A site containing sequences. The positive and negative samples thus obtained are provided in Supplementary Material.

Sequence Encoding Scheme
Inspired by recent studies (Chen et al., 2016b(Chen et al., ,c,d, 2017aFeng et al., 2017), in order to transfer the RNA sequences into discrete vectors that can be recognized and handled by machine learning methods, we encoded RNA sequences using nucleotide chemical properties and accumulated nucleotide frequency. Their brief descriptions are as following.
In order to include nucleotide composition surrounding the modification site as well, the accumulated nucleotide frequency of any nucleotide n j at position i was also used to represent RNA sequences and was defined as where |N i | is the length of the sliding substring concerned, l denotes each of the site locations counted in the substring, qǫ{A, C, G, U}. By integrating both nucleotide physicochemical properties and accumulated nucleotide frequency, an L nt long RNA sequence could be represented a 4L-dimensional vector (Chen et al., 2016b(Chen et al., ,c,d, 2017aFeng et al., 2017).

Support Vector Machine
As an efficient supervised machine learning algorithm, SVM has been widely used in the realm of bioinformatics (Cao et al., 2014;Li et al., 2017;Wang et al., 2017b;Zhang et al., 2017). Its basic idea is to transform the input data into a high dimensional feature space and then determine the optimal separating hyperplane.
In the current study, the implementation of SVM was performed by using the LibSVM package 3.18, available at http:// www.csie.ntu.edu.tw/~cjlin/libsvm/. The radial basis kernel function (RBF) was used to obtain the classification hyperplane. The grid search method was applied to optimize its regularization parameter C and kernel parameter γ .

Evaluation Metrics
The performance was evaluated by using the following four metrics, namely sensitivity (Sn), specificity (Sp), Accuracy (Acc), and the Mathew's correlation coefficient (MCC), which can be expressed as where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. To further evaluate the performance of the current method more objectively, inspired by recent works (Wang et al., 2017a), the ROC (receiver operating characteristic) curve was also plotted. Its vertical coordinate indicates the true positive rate (sensitivity) and the horizontal coordinate indicates the false positive rate (1-specificity). The area under the ROC curve (auROC) is an indicator of the performance quality of a binary classifier, i.e., the value 0.5 of auROC is equivalent to random prediction while the value 1 of auROC represents a perfect one.

Performance for m 6 A Site Identification
In statistical prediction, independent dataset test, K-fold crossvalidation test and jackknife test are often used to derive the metric values for a predictor (Chou, 2011). In order to saving computational time, the 10-fold cross-validation test was used to examine the performance of the proposed method. In 10-fold cross-validation test, the samples in the dataset are randomly partitioned into 10 equal sized sub-datasets. Of the 10 subdatasets, a single sub-dataset is retained as the validation data for testing the model, and the remaining 9 sub-datasets are used as training data. The process is then repeated 10 times, with each of the 10 sub-datasets used exactly once as the validation data. By encoding RNA sequences using nucleotide chemical property and accumulated nucleotide frequency, each sample in the dataset was represented by a (4 × 41) = 164-dimensional vector and used as the input of SVM. The 10-fold cross-validation test results for identifying m 6 A sites in E. coli were listed in Table 1. In addition, to demonstrate that whether its accuracy is sensitive to the selection of negative data, the method was also tested on the other nine negative datasets, respectively. Their predictive results of the 10-fold cross-validation were also provided in Table 1.
As indicated in Table 1, we found that the predictive accuracy is not affected by the selection of negative data. In addition, the 10 ROC curves obtained based on the 10 different negative datasets were also plotted in Figure 1. It was found that their auROCs are all higher than 0.98. These results demonstrate the reliability and robustness of the model developed in this study.

Comparison With Other Methods
In order to demonstrate the effectiveness of nucleotide chemical property and accumulated nucleotide frequency for identifying m 6 A sites in E. coli, we compared the performance of the proposed method with that of the method based on other commonly used RNA sequence   (Chen et al., 2014a,b), in which both the local and global sequence order information w included. Since it has been proposed in 2014, PseKNC have been used in in many branches of computational genomics (Guo et al., 2014;Lin et al., 2014Lin et al., , 2017. Therefore, we employed the SVM to perform the comparisons between the model based on nucleotide chemical property and accumulated nucleotide frequency features and that based on the PseKNC features (Chen et al., 2015a). The 10-fold cross-validation test results were listed in Table 2.
As indicated in a recent study (Schwartz et al., 2013), the m 6 A modification is also affected by RNA secondary structures. Therefore, we performed the prediction of m 6 A sites by using RNA secondary structure. To this end, all the sequences in the benchmark dataset were encoded by using their second structures. The details about the encoding scheme based on secondary structures can be found in a recent work (Xue et al., 2005). By doing so, each RNA sequence is converted to a 32 dimensional vector (Xue et al., 2005) and used as the input feature of SVM. Its 10-fold cross-validation test results were also listed in Table 2.
As shown in Table 2, the predictive performance of the method based on nucleotide chemical property and accumulated nucleotide frequency is dramatically higher than that based on PseKNC and RNA secondary structure.

Validation on Independent Dataset
The proposed method trained based on the benchmark dataset from the E. coli genome was further used to identify the m 6 A sites in the P. aeruginosa genome. For this purpose, we firstly collected the 5,814 experimentally confirmed m 6 A sites from the RMBase to form an independent dataset, which is given in Supporting Information S2. Of the 5,814 m 6 A sites in the P. aeruginosa, 5,809 were correctly identified, indicating that the proposed method is really quite promising for identifying m 6 A sites in microbial genomes.

CONCLUSION
In this study, we present a computational method to identify m 6 A sites in the E. coli genome by encoding the RNA sequences using nucleotide chemical property and accumulated nucleotide frequency. The results obtained based on the benchmark dataset and independent dataset demonstrate that the proposed method is powerful and promising in discovering m 6 A sites. We hope that the proposed method will be helpful for the future research on m 6 A sites in microbial genomes.
Since user-friendly and publicly accessible web-servers (Feng et al., 2018)and databases  represent the direction of developing new prediction method, we will make efforts in our future work to provide a web-server for the method presented in this paper.

AUTHOR CONTRIBUTIONS
HL and WC: conceived and designed the experiments; JZ and PF: performed the experiments; HL and WC: wrote the paper.