ORIGINAL RESEARCH article

Front. Genet., 27 May 2021

Sec. Computational Genomics

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.670852

m6AGE: A Predictor for N6-Methyladenosine Sites Identification Utilizing Sequence Characteristics and Graph Embedding-Based Geometrical Information

  • 1. Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China

  • 2. School of Artificial Intelligence, Jilin University, Changchun, China

Abstract

N6-methyladenosine (m6A) is one of the most prevalent RNA post-transcriptional modifications and is involved in various vital biological processes such as mRNA splicing, exporting, stability, and so on. Identifying m6A sites contributes to understanding the functional mechanism and biological significance of m6A. The existing biological experimental methods for identifying m6A sites are time-consuming and costly. Thus, developing a high confidence computational method is significant to explore m6A intrinsic characters. In this study, we propose a predictor called m6AGE which utilizes sequence-derived and graph embedding features. To the best of our knowledge, our predictor is the first to combine sequence-derived features and graph embeddings for m6A site prediction. Comparison results show that our proposed predictor achieved the best performance compared with other predictors on four public datasets across three species. On the A101 dataset, our predictor outperformed 1.34% (accuracy), 0.0227 (Matthew’s correlation coefficient), 5.63% (specificity), and 0.0081 (AUC) than comparing predictors, which indicates that m6AGE is a useful tool for m6A site prediction. The source code of m6AGE is available at https://github.com/bokunoBike/m6AGE.

Introduction

N6-methyladenosine (m6A) is one of the most prevalent RNA post-transcriptional modifications. It was first found in mammalian RNA in 1974 (Desrosiers et al., 1974). Subsequently, m6A modification was observed in various species, such as Saccharomyces cerevisiae (Schwartz et al., 2013), Arabidopsis (Luo et al., 2014), humans and mouse (Dominissini et al., 2012). Research shows that m6A sites are enriched in long internal exons and 3′UTRs around stop codons rather than randomly distributed in the genome (Dominissini et al., 2012; Meyer et al., 2012; Wan et al., 2015). It has been reported that m6A modification is associated with many biological processes, including but not limited to protein translation and localization (Meyer and Jaffrey, 2014), mRNA splicing and stability (Nilsen, 2014), RNA localization and degradation (Meyer and Jaffrey, 2014). Therefore, precisely identifying m6A sites contributes to understanding the regulatory mechanism and biological significance of m6A modification.

High-throughput techniques have enabled locating the m6A sites in genomes. MeRIP-Seq (or m6A-Seq), a combination of immunoprecipitation and next-generation sequencing technology, has successfully mapped m6A in several species genomes (Dominissini et al., 2012; Schwartz et al., 2013; Wan et al., 2015). In 2015, Chenet al. developed photo-crosslinking-assisted m6A-sequencing (PA-m6A-seq) which provided a high-resolution (about 23nt) mammalian map (Chen et al., 2015a). MeRIP-Seq and PA-m6A-seq can only locate the high methylation regions of m6A rather than the exact positions. In the same year, Linder produced a single-nucleotide resolution map of m6A sites using a new technology termed miCLIP (Linder et al., 2015). However, the current experimental methods face a lot of limitations and expensive costs. With the rapid development of computational methods, it is possible to use machine learning algorithms to predict m6A. Hence, building advanced models to predict m6A sites is significant for the following research of m6A.

Chen et al. (2015b) proposed the first predictor named iRNA-Methyl for m6A sites in Saccharomyces cerevisiae, using three physical-chemical properties of dinucleotide and SVM classifier. WHISTLE (Chen et al., 2019) integrates genomic features besides the sequence features to train a predictor with SVM classifier. Liu and Chen (2020) developed a computational method called iMRM for detecting different RNA modifications simultaneously with XGBoost classifier. Recently, deep learning methods show better performance trend in bioinformatics problems. DeepM6ASeq (Zhang and Hamada, 2018), BERMP (Huang et al., 2018), Gene2vec (Zou et al., 2019), DeepPromise (Chen et al., 2020), and im6A-TS-CNN (Liu et al., 2020) establish deep learning frameworks by using convolutional neural network (CNN) layers and gated recurrent unit (GRU) to seek the m6A sites on DNA/RNA sequence level on the same dataset as SRAMP (Zhou et al., 2016). In this study, seven kinds of sequence-derived features are employed to encode RNA sequences, including CTD (Tong and Liu, 2019), Pseudo k-tuple Composition (PseKNC) (Guo et al., 2014), nucleotide pair spectrum (NPS) (Zhou et al., 2016), nucleotide pair position specificity (NPPS) (Xing et al., 2017), nucleotide chemical properties and density (NCP-ND) (Golam Bari et al., 2013), electron-ion interaction pseudopotentials (EIIP) (Nair and Sreenadhan, 2006), and bi-profile Bayes (BPB) (Shao et al., 2009). Besides, graph embedding methods are innovatively introduced to distill the potential structure information. Firstly, a network is constructed by mapping each sample of the dataset to a node. Secondly, the three graph embedding methods SocDim (Tang and Liu, 2009), Node2Vec (Grover and Leskovec, 2016), and GraRep (Cao et al., 2015) are used to learn the distributed representation of the sample in an unsupervised manner. At last, all the feature vectors are merged as the input of model. The predictive results show that m6AGE improves the performance of identifying m6A sites.

Materials and Methods

Datasets

The m6A sites of different species share different consensus motifs. The adenosines lying within the consensus motif are considered to be the potential methylation sites. The samples in the dataset are RNA sequence segments with the potential methylation sites at their center. The samples with the m6A sites experimentally annotated are put into the positive dataset, whereas the other samples are put into the negative dataset.

There have been many datasets across multiple species for training m6A site predictors. We have collected four datasets that involve three species: Saccharomyces cerevisiae, Arabidopsis thaliana, and human. The following are details of these datasets.

A101. Wang extracted A.thaliana m6A sites from the m6A peak data of Luo et al. (2014) and Wan et al. (2015). The dataset (Wang and Yan, 2018) Wang built contains 2,518 positive samples and 2,518 negative samples. Every sample in the dataset is a 101nt RNA sequence segment.

A25. Luo obtained 4,317 m6A peaks detected both in Can-0 and Hen-16 strains. After removing the sequences with more than 60% sequence similarity, Chen et al. (2016) obtained 394 positive samples. The same number of negative samples were selected randomly from sequences without the m6A site. The length of every sample is 25nt.

S21. Chen further constructed this dataset (Chen et al., 2015c) based on the previous work (Chen et al., 2015b). They selected 832 RNA sequence segments as the positive samples in the training set whose distances to the m6A-seq peaks are less than 10nt. Then, 832 of 33,280 RNA sequence segments with non-methylated adenines were selected randomly as negative samples in the training set. The rest 475 RNA sequences with methylated adenine and 4750 of 33,280 RNA sequences with non-methylated adenine constitute the independent testing dataset. The length of every sample is 21nt.

H41. Chen obtained the m6A-containing sequences in Homo sapiens from RMBase (Chen et al., 2017). All the m6A sites in these sequences conform to the RRACH motif. The dataset contains 1,130 positive samples and 1,130 negative samples. The length of every sample is 41nt.

Construction of Input Feature

Conventional machine learning models require numerical vectors as input features. The feature extraction methods selected have an important impact on the performance of the model. To fully characterize the context of m6A sites, seven sequence-derived features were used. In addition, we build a network based on the whole dataset, by mapping each sample to node and the similarity between samples to edges in the network, and then use graph embedding (neighborhood-based node embedding) methods to extract features in an unsupervised manner. The computational framework of our predictor is illustrated in Figure 1. In the following, we will introduce the sequence-derived features and the graph embeddings, respectively.

FIGURE 1

Sequence-Derived Features

CTD Feature

CTD (Tong and Liu, 2019) is one of the global sequence descriptors. The first descriptor C (nucleotide composition) describes the percentage composition of each nucleotide in the sequence. The second descriptor T (nucleotide transition) describes the frequency of four different nucleotides present in adjacent positions. The third descriptor D (nucleotide distribution) describes five relative positions of each nucleotide along the RNA sequence which are the first one, 25%, 50%, 75%, and the last one.

PseKNC Feature

With the successful application of the pseudo component method in peptide sequence processing, its idea has been further extended to the study of DNA and RNA sequences feature representation. The Pseudo k-tuple Composition (PseKNC) combines the local and global sequence information of RNA (Guo et al., 2014) and transforms an RNA sequence into the following vector:

where,

where du(u = 1,2,…,4k) is the occurrence frequency of the u-th k-nucleotide in this RNA sequence; the parameter w is the weight factor; the parameter λ is the number of totals counted tiers of the correlations along an RNA sequence. The j-tier correlation factor θj is defined as follows:

The correlation function Θ(,) is calculated by the following formula:

where μ is the number of RNA physicochemical properties used. RiRi+1 is the dinucleotide at position i of this RNA. Pν(RiRi + 1) is the standardized numerical value of the ν-th RNA physicochemical properties for dinucleotide RiRi +1.

Six RNA physicochemical properties are considered: “Rise”, “Roll”, “Shift”, “Slide”, “Tilt”, “Twist”.

NPS Feature

The nucleotide pair spectrum (NPS) (Zhou et al., 2016) encoding method describes the RNA sequence context of the site by calculating the occurrence frequency of all k-spaced nucleotide pairs in the sequence. The k-spaced nucleotide pair n1{k}n2 means that there are k arbitrary nucleotides between n1 and n2, and its occurrence frequency is calculated as follows:

where C(n1{k}n2) is the count of n1{k}n2 in this RNA sequence, and L is the sequence length. The parameter k ranges from 1 to dmax. The parameter dmax is set to 3, so this encoding method transforms an RNA sequence into a vector DNPS with a dimension of 4×4×3=48.

NPPS Feature

The nucleotide pair position specificity (NPPS) (Xing et al., 2017) encoding method extracts statistical information by calculating the frequency of single nucleotide and k-spaced nucleotide pairs at specific locations. Based on the positive training dataset, we can get the frequency matrix

where the element of is the frequency of single nucleotide appearing at each location in the positive training dataset; the element of is the frequency of k-spaced nucleotide pair appearing at each location in the positive training dataset; and L is the sequence length. The frequency matrix and are calculated similarly on the negative training dataset.

Assuming that the i-th nucleotide is “A” and the (i + k)-th nucleotide is “C”, is calculated through conditional probability formula and frequency matrix:

NPPS encoding method transforms a sequence into a vector DNPPS = [pk + 2,…,pL] with a dimension of Lk−1, where .

NCP-ND Feature

Different nucleotides have different chemical properties. According to the difference of ring structure (purine or pyrimidine), hydrogen bond (strong or weak), and functional group (amino or keto), nucleotide A, U, C, and G can be represented by (1, 1, 1), (0, 1, 0), (0, 0, 1), and (1, 0, 0), respectively (Golam Bari et al., 2013).

The nucleotide density (ND) is used to measure the relevance between the frequency and position of the i-th nucleotide ni in the sequence:

where L is the sequence length. Combined with the chemical properties of nucleotides, each sequence is transformed into a vector DNCPND with a dimension of L×4.

EIIP Feature

This encoding method uses the electron-ion interaction pseudopotentials (EIIP) values (Nair and Sreenadhan, 2006) to represent the nucleotide in the sequence. The EIIP values of nucleotides A, T (we replace T with U), C, G are 0.1260, 0.1340, 0.0806, and 0.1335, respectively. Thus the dimension of the vector DEIIP is equal to the sequence length.

BPB Feature

The Bi-profile Bayes (BPB) encoding method was first proposed by (Shao et al., 2009), and then has been successfully applied in other fields of bioinformatics. This method uses the occurrence frequency fi,n of the i-th nucleotide n to estimate the posterior probability pi,n, and transforms a sequence into the following vector:

where n is the i-th nucleotide of the sequence; denotes the frequency of nucleotide n appearing at the i-th position of the sequence in the positive training dataset, while denotes the frequency of nucleotide n appearing at the i-th position of sequence in the negative training dataset. L is the sequence length. The dimension of the vector DBPB is 2×L.

Graph Embeddings

Network Construction

To extract the graph embedding feature of each sample, we construct a network based on the whole dataset. Each sample in the dataset is taken as a node, and the relationships between samples are taken as edges. Generally, edges exist two similar sample nodes. The fast linear neighbor similarity approach (FLNSA) (Zhang et al., 2017, 2019) is a method to extract “sample-sample” similarity, which has been successfully applied to many bioinformatics classification tasks. In this study, FLNSA is utilized to calculate the similarity between samples.

First, we extract sequence-derived features and use the feature fusion strategy to transform all the samples in the dataset into n-dimensional vector {x1,x2,…,xm}, where xi(0 < im) is the vector of the i-th sample. Then these vectors are concentrated into a matrix XRm×n, each row of which represents a sample vector. FLNSA tries to minimize the objective function:

where ⊙ is the Hadamard product operator; ||⋅||F represents the Frobenius norm and μ is the regularization coefficient. e is an m-dimensional column vector with all elements equal to 1. The element wi,j of matrix WRm×m represents the reconstruction contribution weight of the sample xj to the sample xi, and is used to quantify the similarity between two samples. The element of indicator CRm×m is

where N(xi) denotes the set of all neighbors of xi. The Euclidean distances between xi and other samples are calculated and the nearest c(0 < c < m) samples are selected to form N(xi). FLNSA uses the Lagrange method to get matrix W. After mathematical derivation, the Equation (13) is obtained.

Randomly generated matrix W was updated according to Equation (13) until convergence. Taking W as the adjacency matrix, an undirected weighted graph G is obtained. The graph embedding methods require a connected graph as input. Note that if G is not connected, we can increase c (the number of neighborhoods of a sample). Under the condition of ensuring the connectivity of the graph, the edges whose weights are lower than the threshold t are removed and the weights of the remaining edges are set to 1. Finally, an undirected unweighted graph is constructed based on the dataset.

SocDim

The social-dimension-based (SocDim) (Tang and Liu, 2009) method is proposed by Lei Tang and Huan Liu to solve the relational learning between nodes in social networks. This method extracts latent dimensions from networks and uses them as distributed representations, which involves community detection tasks.

SocDim uses Modularity (Newman, 2006) which measures community structure through degree distribution to extract potential dimensions. Modularity considers dividing the network into non-overlapping communities, measures the deviation between the network and uniform random graphs with the same degree distribution, and then obtains the modularity matrix B defined as follows:

where A is the interaction matrix of the network; d is a column vector composed of the degrees of each node; m is the number of nodes. Subsequently, SocioDim extracts the dimensions from the top eigenvectors of the modularity matrix B.

Node2Vec

Node2Vec (Grover and Leskovec, 2016) attempts to design a graph embedding model that can train efficiently and retain the neighborhood information of nodes to the maximum extent. The embedding vectors of nodes are learned through the skip-gram model. Different from DeepWalk, Node2Vec proposes biased random walk instead of truncated random walk to control the search space. Node2vec considers the homophily (nodes from the same community have similar embeddings) and structural equivalence (nodes that share similar roles have similar embeddings), thus there are two classic search strategies: Breadth-first Sampling (BFS) and Depth-first Sampling (DFS).

GraRep

GraRep (Cao et al., 2015) proposes a graph embedding model that can be learned from weighted graphs and integrate global structure information of the graph. GraRep forms k different vectors by separating k kinds of relationships. For a specific k, GraRep samples a set of k-step paths from the graph. The k-step path which starts with node vw and ends with node vc is denoted as (vw,vc). For all pairs, it increases the probability of the pairs come from the graph and decreases the probability of the pairs do not come from the graph. Based on the normalized adjacency matrix, GraRep obtains Wk for different values of k, and each column vector of Wk represents an embedding of the node. Finally, this method concatenates all the k-step representations W1,W2,…,Wk.

CatBoost Classifier

CatBoost (Dorogush et al., 2018; Prokhorenkova et al., 2018) is an improved implementation of gradient enhanced decision trees (GDBT) algorithm developed by Yandex. It has demonstrated excellent performance on many classification and regression tasks. Compared with other advanced gradient boosting algorithms such as XGBoost (Chen and Guestrin, 2016) and lightBGM (Ke et al., 2017), CatBoost has the following advantages: (1) It can better process categorical features. (2) To solve the problem of gradient bias and prediction shift, ordered boosting is proposed instead of the classic GDBT gradient estimation algorithm. (3) The requirement of super parameter tuning is reduced.

CatBoost uses oblivious decision trees (Langley and Sage, 1994) as base predictors. As oblivious decision trees are balanced, they can prevent overfitting. Moreover, it optimizes the traditional boosting algorithm which transforms the category features into numerical features, and the algorithm of calculating the leaf value to improve the generalization ability of the model. Since the CatBoost algorithm is running on GPU, the model is trained efficiently and parallelly.

Evaluation Metrics

Our predictor predicts whether the adenosine at the center of an RNA sequence segment is an m6A site. We used the following metrics to evaluate the performance of binary classification predictors: accuracy (ACC), Matthew’s correlation coefficient (MCC), sensitivity (SEN), specificity (SPE), and F1. These metrics are calculated as follows:

where TP is the number of true positive samples; TN is the number of true negative samples; FP is the number of false positive samples; FN is the number of false negative samples.

Additionally, the receiver operating characteristic (ROC) curve is also an important measurement to evaluate the performance of classifiers, and the area under receiver operating characteristic curve (AUC) is the quantitative indicator. High values of AUC indicate better performance of predictors.

Results

We redivided the four datasets introduced in section “Datasets” into the training sets and test sets with the ratio of 4:1, respectively. The training datasets were used to train models and the test datasets were utilized to evaluate model performance. Due to the difference between datasets, we selected suitable sequence-derived features for each dataset. For A101, the PseKNC, CTD, and NPS features were selected; For A25, the EIIP, NPPS, NPS, PseKNC, and NCP-ND were selected; For S21, the NPPS and NCP-ND features were selected; For H41, the NCP-ND, PseKNC, and NPPS features were selected.

Comparison With Existing Predictors

In this section, we compared the performance of our predictor m6AGE with several other state-of-the-art predictors, including M6A-HPCS (Zhang et al., 2016), Targetm6A(Li et al., 2016), RAM-NPPS (Xing et al., 2017), M6APred-EL (Wei et al., 2018), and DeepM6ASeq (Zhang and Hamada, 2018). M6A-HPCS uses PseDNC and DACC features and a support vector machine (SVM) classifier to identify m6A sites. Targetm6A utilizes position-specific kmer propensities (PSKP) feature and SVM classifier. RAM-NPPS uses the NPPS feature and SVM classifier to identify m6A sites. M6APred-EL creates an ensemble model with PseKNC, PSKP, and NCP-ND features. DeepM6ASeq develops a deep learning framework and uses one-hot encoding for the identification of m6A sites. The predictor M6A-HPCS, M6APred-EL, Targetm6A, and RAM-NPPS were reproduced faithfully, and their parameters were optimized by grid search with five-fold cross-validation. All predictors were trained and evaluated on the same dataset for fairness of comparison.

The evaluation results were summarized in Table 1. We employed ACC, MCC, SEN, SPE, and AUC as evaluation metrics, and compared the evaluation metrics of m6AGE with five other predictors on three datasets: A101, A25, and H41. As shown in Table 1, our predictor m6AGE achieved all optimal values on three datasets, except for SEN and SPE on the A25 dataset, and AUC on the H41 dataset.

TABLE 1

DatasetsPredictorsMetrics
ACC (%)MCCSEN (%)SPE (%)AUC
A101m6AGE89.110.782290.4987.680.9500
M6A-HPCS86.430.728686.6486.220.9284
Targetm6A87.360.747187.6587.060.9358
RAM-NPPS83.860.677786.4481.210.9077
M6APred-EL86.020.720585.6386.430.9055
DeepM6ASeq87.770.759593.3282.050.9419
A25m6AGE87.970.770874.6598.850.8867
M6A-HPCS68.350.357761.9773.560.7238
Targetm6A82.910.654276.0688.510.8370
RAM-NPPS82.910.653877.4687.360.8621
M6APred-EL87.340.764271.83100.000.8464
DeepM6ASeq77.850.551567.6186.210.8054
H41m6AGE90.930.832581.94100.000.9181
M6A-HPCS71.460.433664.7678.220.7765
Targetm6A90.490.824981.06100.000.9205
RAM-NPPS90.490.824981.06100.000.9051
M6APred-EL89.820.813679.74100.000.9132
DeepM6ASeq86.500.756673.5799.560.9051

The performance of m6AGE against other existing predictors.

The optimal value of each evaluation metric is marked in bold.

On the A101 dataset, m6AGE obtained the optimal ACC, MCC, SPE, and AUC with 89.11%, 0.7822, 87.68%, and 0.9500, which is 1.34%, 0.0227, 5.63%, and 0.0081 higher than the suboptimal predictor DeepM6ASeq, respectively.

On the A25 dataset, m6AGE obtained the optimal ACC, MCC, and AUC with 87.97%, 0.7708, and 0.8867. Its Acc and MCC is 0.63% and 0.0066 higher than the suboptimal value of predictor M6APred-EL. Its AUC is 0.0246 higher than the suboptimal value of predictor RAM-NPPS.

On the H41 dataset, m6AGE obtained the optimal ACC, MCC, SEN, and SPE with 90.93%, 0.8325, 81.94%, and 100%, which is 0.44%, 0.0076, 0.88%, and 0 higher than the predictor Targetm6A and RAM-NPPS, respectively.

The ROC curves of these predictors on three datasets were plotted in Figure 2. As shown in Figure 2, our predictor outperformed other predictors on the A101 and A25 datasets. Although the AUC of m6AGE on dataset H41 is lower than other predictors, m6AGE achieved the optimal value of ACC, MCC, SEN, and SPE. These evaluation results demonstrate that our predictor m6AGE is superior to other predictors in terms of these three datasets.

FIGURE 2

Performance on Imbalanced Dataset

The non-m6a sites on mRNA are much more than m6A sites, so testing the performance of our predictor on imbalanced datasets is of great importance. The imbalance ratio of the S21 dataset is about 1:4. We redivided the S21 dataset, and randomly selected 80% samples as the training set, and the remaining 20% samples as the test set.

CatBoost solves the imbalance data issues by setting weights for each class or sample. The weight of each class is generally inversely proportional to the number of its samples. The metrics F1 and MCC are usually used as the evaluation criteria for imbalanced datasets (Zhao et al., 2018; Wang et al., 2019; Dou et al., 2020). We compared m6AGE with five other predictors on the S21 dataset.

The evaluation results were summarized in Table 2. The optimal value of each evaluation metric is marked in bold. As shown in Table 2, our predictor m6AGE got the optimal values of F1, MCC, and AUC with 0.5723, 0.4593, and 0.8103.

TABLE 2

PredictorsMetrics
SEN (%)SPE (%)F1MCCAUC
m6AGE68.6883.020.57230.45930.8103
HPCS71.7046.630.36220.14590.6330
Targetm6A70.5776.730.52600.39840.7818
RAM-NPPS66.4281.490.54400.42180.7778
M6APred-EL78.5975.200.55540.44330.7899
DeepM6ASeq63.7783.380.54600.42530.8056

The performance of different predictors on S21 dataset.

The optimal value of each evaluation metric is marked in bold.

The ROC curves of these predictors on the S21 dataset were plotted in Figure 3. As shown in Figure 3, our predictor outperformed other predictors on the S21 dataset.

FIGURE 3

Comparison With Different Classifiers

To further demonstrate the effectiveness of CatBoost, we compared it with other popular classifiers, including Random Forest, Logistic Regression, and Decision Tree, which are commonly and widely used in bioinformatics classification tasks. All classifiers were trained and assessed under the same conditions for a fair comparison.

The prediction results were summarized in Table 3. We compared the prediction results with three other classifiers on the A101, A25, and H41 dataset. The evaluation metrics used are ACC, MCC, SEN, SPE, and AUC. As shown in Table 3, CatBoost achieved all optimal metrics on three datasets, except for SPE on the A101 dataset and SEN on the A25 and H41 dataset.

TABLE 3

DatasetsClassifiersMetrics
ACC (%)MCCSEN (%)SPE (%)AUC
A101CatBoost89.110.782290.4987.680.9500
Random forest87.670.753487.0488.310.9377
Logistic regression89.000.780089.0788.940.9489
Decision tree80.990.619782.3979.540.8096
A25CatBoost87.970.770874.6598.850.8867
Random forest87.340.764271.83100.000.8729
Logistic regression79.110.576774.6582.760.8562
Decision tree81.650.634984.5179.310.8191
H41CatBoost90.930.832581.94100.000.9181
random forest89.380.803179.7499.110.9098
Logistic regression86.950.742282.3891.560.9125
Decision tree86.280.725885.4687.110.8629

The performance of different classifiers.

The optimal value of each evaluation metric is marked in bold.

Feature Importance Analysis

CatBoost can output the scores of feature importance, which reflect the contributions of the features in specific feature space for identifying m6A sites. The first 20 important features and their scores on the four datasets were plotted in Figure 4.

FIGURE 4

On the A101 dataset, the first three important sequence-derived features are “PseKNC_44”, “PseKNC_59”, and “PseKNC_40”, which correspond to the occurrence frequency of “GUA”, “UGU”, and PseKNC_40 respectively, On the A25 dataset, the first three important sequence-derived features are “NCP_ND_58”, “NPPS_xi2_14”, and “NPPS_xi1_14,” which correspond to the position +1 (Assuming that the position of m6A site is 0), +2 and +4, +2 and +3, respectively; On the H41 dataset, the first three important sequence-derived features are “NPPS_xi1_20”, “NPPS_xi1_22”, and “NCP_ND_72”, which correspond to the position 0 and +1, +2 and +3, −3, respectively; On the S21 dataset, the first three important sequence-derived features are “NPPS_xi1_17”, “NPPS_xi2_17”, and “NPPS_xi1_18,” which correspond to the position +6 and +7, +6 and +8, +7 and +9, respectively.

In addition, graph embeddings account for 20%, 25%, 35%, and 50% of the top 20 important features in the four datasets, respectively, which indicates that graph embeddings could supplement the information of the sequence-derived features.

Discussion

The methods for extracting sequence features are indispensable for building a reliable predictor. Contributing sequence features, such as the physical and chemical properties of nucleotides, the frequency of k-nucleotides, and the frequency of specific positions, can fully reflect the information related to the m6A site recognition. In this study, we integrated and selected suitable sequence-derived features for each dataset. However, most of the feature encoding methods are based on the primary sequence, and only a few of them calculate the frequency of nucleotides in the training dataset, so it is difficult to obtain more helpful information from the whole dataset. This paper innovatively introduces a feature extraction method based on the graph embedding methods as a supplement to sequence-derived features. First of all, a network is constructed based on the whole dataset and sequence-derived features. Samples are abstracted as nodes of the network, and the similarity relationships between samples are abstracted as edges. This network reflects global information of the whole dataset. Then, graph embedding (neighborhood-based node embedding) methods are used to learn the feature representation of each node in an unsupervised manner. The graph embedding features of samples contain the related information with other samples. Finally, we integrate sequence-derived features and graph embeddings based with the feature fusion strategy. Therefore, the final input features can reflect the information of samples more comprehensively.

It is also significant to choose an appropriate classifier. CatBoost is a GBDT algorithm, which shows excellent performance in many classification tasks. Because of its good effect of restraining overfitting and fast running, the CatBoost algorithm is selected to train our predictor m6AGE.

To further prove the effectiveness of our predictor, we compare the evaluation results with that of other existing m6A site predictors. The results show that our predictor m6AGE outperforms other existing methods. In the future, we will apply m6AGE to more m6A site datasets and seek more suitable graph embedding methods. It is worth mentioning that the computational framework proposed in this study is possible to extend to other bioinformatics site identification tasks.

The source code of m6AGE is available at https://github.com/bokunoBike/m6AGE. Users can download and run it on the local machines. The data is imported through the file paths of the positive training set, negative training set, and test set. Then m6AGE is trained and generates prediction results. Note that the corresponding python packages need to be installed first (see GitHub page for details). For a new dataset, our predictor will automatically select the appropriate sequence-derived features (or specified by the users in the corresponding configuration file) according to the feature importance scores.

Conclusion

The identification of N6-methyladenosine (m6A) modification sites on RNA is of biological significance. In this study, a novel computational framework called “m6AGE” is proposed to predict and identify the m6A sites on mRNA. Our predictor combines sequence-derived features with the features extracted by graph embedding methods. The context information of sites is directly extracted from primary sequences by the sequence-derived features, and the global information is extracted by the graph embeddings. Experiments showed that the proposed m6AGE achieved successful prediction performance on four datasets across three species. It could be expected that m6AGE would be a powerful computational tool for predicting and identifying the m6A modification sites on mRNA.

Statements

Data availability statement

Publicly available datasets were analyzed in this study. Codes and data are available here: https://github.com/bokunoBike/m6AGE which contains detailed steps to run m6AGE.

Author contributions

YW and RG conceived the algorithm and developed the program. RG, YW, and SY wrote the manuscript and prepared the datasets. YW and SY helped with manuscript editing, design. XMH, LH, and KH helped to revise the manuscript. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62072212), the Development Project of Jilin Province of China (Nos. 20200401083GX, 2020C003), and Guangdong Key Project for Applied Fundamental Research (No.2018KZDXM076). This work was also supported by Jilin Province Key Laboratory of Big Data Intelligent Computing (No. 20180622002JC).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  • 1

    CaoS.LuW.XuQ. (2015). “GraRep: learning graph representations with global structural information,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, (New York, NY:Association for Computing Machinery), 891900. 10.1145/2806416.2806512

  • 2

    ChenT.GuestrinC. (2016). “XGBoost: a scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (New York, NY:Association for Computing Machinery), 785794. 10.1145/2939672.2939785

  • 3

    ChenK.LuZ.WangX.FuY.LuoG.-Z.LiuN.et al (2015a). High-Resolution N 6 -methyladenosine (m 6 A) map using photo-crosslinking-assisted m 6 a sequencing.Angew. Chemie12716071610. 10.1002/ange.201410647

  • 4

    ChenW.FengP.DingH.LinH.ChouK. C. (2015b). IRNA-Methyl: identifying N6-methyladenosine sites using pseudo nucleotide composition.Anal. Biochem.4902633. 10.1016/j.ab.2015.08.021

  • 5

    ChenW.FengP.DingH.LinH. (2016). Identifying N 6-methyladenosine sites in the Arabidopsis thaliana transcriptome.Mol. Genet. Genomics29122252229. 10.1007/s00438-016-1243-7

  • 6

    ChenW.TangH.LinH. (2017). MethyRNA : a web server for identification of N–methyladenosine sites.J. Biomol. Struct. Dyn.110215. 10.1080/07391102.2016.1157761

  • 7

    ChenW.TranH.LiangZ.LinH.ZhangL. (2015c). Identification and analysis of the N6-methyladenosine in the Saccharomyces cerevisiae transcriptome.Sci. Rep.5:13859. 10.1038/srep13859

  • 8

    ChenK.WeiZ.ZhangQ.WuX.RongR.LuZ.et al (2019). WHISTLE: a high-accuracy map of the human N6-methyladenosine (m6A) epitranscriptome predicted using a machine learning approach.Nucleic Acids Res.47:e41. 10.1093/nar/gkz074

  • 9

    ChenZ.ZhaoP.LiF.WangY.SmithA. I.WebbG. I.et al (2020). Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences.Brief. Bioinform.2116761696. 10.1093/bib/bbz112

  • 10

    DesrosiersR.FridericiK.RottmanF. (1974). Identification of methylated nucleosides in messenger RNA from novikoff hepatoma cells.Proc. Natl. Acad. Sci. U.S.A.713971L3975. 10.1073/pnas.71.10.3971

  • 11

    DominissiniD.Moshitch-MoshkovitzS.SchwartzS.Salmon-DivonM.UngarL.OsenbergS.et al (2012). Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq.Nature485201206. 10.1038/nature11112

  • 12

    DorogushA. V.ErshovV.GulinA. (2018). CatBoost: gradient boosting with categorical features support.arXiv [Preprint]arXiv:1810.11363,

  • 13

    DouL.LiX.DingH.XuL.XiangH. (2020). IRNA-m5C_NB: a novel predictor to identify RNA 5-methylcytosine sites based on the naive bayes classifier.IEEE Access88490684917. 10.1109/ACCESS.2020.2991477

  • 14

    Golam BariA. T. M.ReazM. R.ChoiH. J.JeongB. S. (2013). “DNA encoding for splice site prediction in large DNA sequence,” in Database Systems for Advanced Applications.DASFAA 2013. Lecture Notes in Computer Science, Vol. 7827edsHongB.MengX.ChenL.WiniwarterW.SongW. (Berlin: Springer), 10.1007/978-3-642-40270-8_4

  • 15

    GroverA.LeskovecJ. (2016). “Node2vec: scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (New York, NY:Association for Computing Machinery), 855864. 10.1145/2939672.2939754

  • 16

    GuoS. H.DengE. Z.XuL. Q.DingH.LinH.ChenW.et al (2014). INuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition.Bioinformatics3015221529. 10.1093/bioinformatics/btu083

  • 17

    HuangY.HeN.ChenY.ChenZ.LiL. (2018). BERMP: a cross-species classifier for predicting m 6 a sites by integrating a deep learning algorithm and a random forest approach.Int. J. Biol. Sci.1416691677. 10.7150/ijbs.27819

  • 18

    KeG.MengQ.FinleyT.WangT.ChenW.MaW.et al (2017). LightGBM: a highly efficient gradient boosting decision tree.Adv. Neural Inf. Process. Syst.3031473155. 10.1016/j.envres.2020.110363

  • 19

    LangleyP.SageS. (1994). “Oblivious decision trees and abstract cases,” in Proceedings of the Working Notes of the AAAI-94 Workshop on Case-Based Reasoning, (Seattle, WA:AAAI Press), 113117.

  • 20

    LiG. Q.LiuZ.ShenH. B.YuD. J. (2016). TargetM6A: identifying N6-Methyladenosine sites from RNA sequences via position-specific nucleotide propensities and a support vector machine.IEEE Trans. Nanobiosci.15674682. 10.1109/TNB.2016.2599115

  • 21

    LinderB.GrozhikA. V.Olarerin-GeorgeA. O.MeydanC.MasonC. E.JaffreyS. R. (2015). Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome.Nat. Methods12767772. 10.1038/nmeth.3453

  • 22

    LiuK.CaoL.DuP.ChenW. (2020). im6A-TS-CNN: identifying the N6-methyladenine site in multiple tissues by using the convolutional neural network.Mol. Ther. Nucleic Acids2110441049. 10.1016/j.omtn.2020.07.034

  • 23

    LiuK.ChenW. (2020). IMRM: a platform for simultaneously identifying multiple kinds of RNA modifications.Bioinformatics3633363342. 10.1093/bioinformatics/btaa155

  • 24

    LuoG. Z.MacqueenA.ZhengG.DuanH.DoreL. C.LuZ.et al (2014). Unique features of the m6A methylome in Arabidopsis thaliana.Nat. Commun.5:5630. 10.1038/ncomms6630

  • 25

    MeyerK. D.JaffreyS. R. (2014). The dynamic epitranscriptome : N 6 -methyladenosine and gene expression control. 1974.Nat. Rev. Mol. Cell Biol.15313326. 10.1038/nrm3785

  • 26

    MeyerK. D.SaletoreY.ZumboP.ElementoO.MasonC. E.JaffreyS. R. (2012). Comprehensive analysis of mRNA methylation reveals enrichment in 3’ UTRs and near stop codons.Cell14916351646. 10.1016/j.cell.2012.05.003

  • 27

    NairA. S.SreenadhanS. P. (2006). A coding measure scheme employing electron-ion interaction pseudopotential (EIIP).Bioinformation1197202.

  • 28

    NewmanM. E. J. (2006). Modularity and community structure in networks.Proc. Natl. Acad. Sci. U.S.A.10385778582. 10.1073/pnas.0601602103

  • 29

    NilsenT. W. (2014). Internal mRNA methylation finally finds functions stirring the simmering.Science34312071208. 10.1126/science.1249340

  • 30

    ProkhorenkovaL.GusevG.VorobevA.DorogushA. V.GulinA. (2018). “CatBoost: Unbiased Boosting with Categorical Features’, in Proceedings of the 32nd International Conference on Neural Information Processing Systems NIPS’18. (Red Hook, NY, Unites States: Curran Associates Inc.), 66396649. Availble online at: https://nips.cc/Conferences/2018

  • 31

    SchwartzS.AgarwalaS. D.MumbachM. R.JovanovicM.MertinsP.ShishkinA.et al (2013). High-Resolution mapping reveals a conserved, widespread, dynamic mRNA methylation program in yeast meiosis.Cell15514091421. 10.1016/j.cell.2013.10.047

  • 32

    ShaoJ.XuD.TsaiS. N.WangY.NgaiS. M. (2009). Computational identification of protein methylation sites through Bi-profile Bayes feature extraction.PLoS One4:e4920. 10.1371/journal.pone.0004920

  • 33

    TangL.LiuH. (2009). “Relational learning via latent social dimensions,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, (New York, NY:Association for Computing Machinery), 817825. 10.1145/1557019.1557109

  • 34

    TongX.LiuS. (2019). CPPred: coding potential prediction based on the global description of RNA sequence.Nucleic Acids Res.47e43e43. 10.1093/nar/gkz087

  • 35

    WanY.TangK.ZhangD.XieS.ZhuX.WangZ.et al (2015). Transcriptome-wide high-throughput deep m6A-seq reveals unique differential m6A methylation patterns between three organs in Arabidopsis thaliana.Genome Biol.16:272. 10.1186/s13059-015-0839-2

  • 36

    WangH.MaY.DongC.LiC.WangJ.LiuD. (2019). CL-PMI: a precursor microRNA identification method based on convolutional and long short-term memory networks.Front. Genet.10:967. 10.3389/fgene.2019.00967

  • 37

    WangX.YanR. (2018). RFAthM6A: a new tool for predicting m6A sites in Arabidopsis thaliana.Plant Mol. Biol.96327337. 10.1007/s11103-018-0698-9

  • 38

    WeiL.ChenH.SuR. (2018). M6APred-EL: a sequence-based predictor for identifying N6-methyladenosine sites using ensemble learning.Mol. Ther. Nucleic Acids12635644. 10.1016/j.omtn.2018.07.004

  • 39

    XingP.SuR.GuoF.WeiL. (2017). Identifying N6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine.Sci. Rep.7:46757. 10.1038/srep46757

  • 40

    ZhangM.SunJ. W.LiuZ.RenM. W.ShenH. B.YuD. J. (2016). Improving N6-methyladenosine site prediction with heuristic selection of nucleotide physical–chemical properties.Anal. Biochem.508104113. 10.1016/j.ab.2016.06.001

  • 41

    ZhangW.ChenY.TuS.LiuF.QuQ. (2017). “Drug side effect prediction through linear neighborhoods and multiple data source integration,” in Proceedings of the 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), (Shenzhen: IEEE), 427434. 10.1109/BIBM.2016.7822555

  • 42

    ZhangW.TangG.WangS.ChenY.ZhouS.LiX. (2019). “Sequence-derived linear neighborhood propagation method for predicting lncRNA-miRNA interactions,” in Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Vol. 1 (Madrid: IEEE), 5055. 10.1109/BIBM.2018.8621184

  • 43

    ZhangY.HamadaM. (2018). DeepM6ASeq: prediction and characterization of m6A-containing sequences using deep learning.BMC Bioinformatics19(Suppl. 19):524. 10.1186/s12859-018-2516-4

  • 44

    ZhaoZ.PengH.LanC.ZhengY.FangL.LiJ. (2018). Imbalance learning for the prediction of N6-Methylation sites in mRNAs.BMC Genomics19:574. 10.1186/s12864-018-4928-y

  • 45

    ZhouY.ZengP.LiY. H.ZhangZ.CuiQ. (2016). SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features.Nucleic Acids Res.44:e91. 10.1093/nar/gkw104

  • 46

    ZouQ.XingP.WeiL.LiuB. (2019). Gene2vec: gene subsequence embedding for prediction of mammalian N 6 -methyladenosine sites from mRNA.RNA25205218. 10.1261/rna.069112.118

Summary

Keywords

m6A, machine learning, graph embedding, feature fusion, CatBoost

Citation

Wang Y, Guo R, Huang L, Yang S, Hu X and He K (2021) m6AGE: A Predictor for N6-Methyladenosine Sites Identification Utilizing Sequence Characteristics and Graph Embedding-Based Geometrical Information. Front. Genet. 12:670852. doi: 10.3389/fgene.2021.670852

Received

22 February 2021

Accepted

29 April 2021

Published

27 May 2021

Volume

12 - 2021

Edited by

Wilson Wen Bin Goh, Nanyang Technological University, Singapore

Reviewed by

Panagiotis Alexiou, Central European Institute of Technology (CEITEC), Czechia; Jia Meng, Xi’an Jiaotong-Liverpool University, China

Updates

Copyright

*Correspondence: Sen Yang,

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics