Edited by: Gajendra PS Raghava, Indraprastha Institute of Information Technology Delhi, India
Reviewed by: Leyi Wei, Tianjin University, China; Zhi-Ping Liu, Shandong University, China
*Correspondence: Hui Liu,
This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Protein-RNA interactions play essential roles in many biological aspects. Quantifying the binding affinity of protein-RNA complexes is helpful to the understanding of protein-RNA recognition mechanisms and identification of strong binding partners. Due to experimentally measured protein-RNA binding affinity data available is still limited to date, there is a pressing demand for accurate and reliable computational approaches. In this paper, we propose a computational approach, PredPRBA, which can effectively predict protein-RNA binding affinity using gradient boosted regression trees. We build a dataset of protein-RNA binding affinity that includes 103 protein-RNA complex structures manually collected from related literature. Then, we generate 37 kinds of sequence and structural features and explore the relationship between the features and protein-RNA binding affinity. We find that the binding affinity mainly depends on the structure of RNA molecules. According to the type of RNA associated with proteins composed of the protein-RNA complex, we split the 103 protein-RNA complexes into six categories. For each category, we build a gradient boosted regression tree (GBRT) model based on the generated features. We perform a comprehensive evaluation for the proposed method on the binding affinity dataset using leave-one-out cross-validation. We show that PredPRBA achieves correlations ranging from 0.723 to 0.897 among six categories, which is significantly better than other typical regression methods and the pioneer protein-RNA binding affinity predictor SPOT-Seq-RNA. In addition, a user-friendly web server has been developed to predict the binding affinity of protein-RNA complexes. The PredPRBA webserver is freely available at
Protein-RNA interactions play a crucial role in many biological processes, such as gene expression and its regulation (
In the past decade, many methods have been developed to identify protein-RNA interactions
Although the protein-RNA docking benchmark has played an important role in studying multiple aspects of protein-RNA interactions, it is still somewhat inefficient in quantifying the binding affinity of proteins-RNA interaction. The standard non-redundant dataset of protein-RNA complexes is a prerequisite for the development and validation of protein-RNA binding affinity studies. Since lack of protein-RNA binding affinity data sets has become a bottleneck in the development of more accurate scoring functions,
In this work, we have developed a method, referred to as PredPRBA, to predict the quantitative binding affinity of protein-RNA complexes. The flowchart of our method is shown in
The flowchart of the PredPRBA method for predicting the binding affinity of protein-RNA complexes. It involves four steps:
We primarily collect 173 protein-RNA complexes to extract quantitative protein-RNA binding affinity, among which 73 complexes come from a non-redundant protein-RNA binding benchmark dataset (
Where
It is worth noting that previous findings have demonstrated that the structure of RNA molecules greatly influences the binding affinity between proteins and RNAs (
We extract a total of 37 kinds of features to predict the binding affinity of the protein-RNA complexes. These features can be mainly separated into four categories, including features based on protein sequences and protein structures, features based on RNA sequences and RNA structures.
We extract the protein sequences from the PDB files and then calculate the total molecular mass of the protein fraction based on the molecular weight of each amino acid. Also, the total number of hydrogen bonds (
We use the DSSP algorithm (
We use the RNA sequences in the protein-RNA complexes to obtain the molecular mass of the RNA molecules. The computational formula is as below.
in which
A number of features based on the RNA structure are derived to predict protein-RNA binding affinities. We use the RNA fold in ViennaRNA (
Ensemble learning algorithms are a family of powerful machine-learning techniques that have shown considerable success many applications (
Without loss of generality, the features and the real-valued binding affinities can be described as an
Let
where
where
However, it is not straightforward to solve Eq. (5). Therefore, GBRT separately and approximately estimates (ǀ
where
When the
where
Then, a new additive function
where 0 <
The performance is evaluated using the Pearson correlation coefficient (
in which
In addition, the average absolute error(MAE) (
We independently conduct iterative feature selection for each class of protein-RNA complexes, as the binding affinity of the different class of complexes is influenced by the structure of RNAs and proteins. In particular, we build the protein-RNA binding affinity prediction models iteratively using each feature and compute the performance measure Pearson correlation coefficient. Next, we sort the features in descending order according to the correlation coefficient and select the top 10 features for each class complex. Finally, we adopt the greedy algorithm to add one feature to the optimal feature set at each step until the performance stops to increase. The selected features are shown in the
Selected features to predict protein-RNA binding affinity of each class of protein-RNA complexes.
Class I | Class II | Class III | Class IV | Class V | Class VI | |
---|---|---|---|---|---|---|
molecular weight of RNA | √ | |||||
total value of the relative solvent accessible surface area | √ | √ | ||||
number of hydrophilic residues in the protein | √ | √ | ||||
number of hydrophobic residues in the protein | √ | |||||
% of hydrophilic residues in the protein | √ | |||||
% of hydrophobic residues in the protein | √ | √ | √ | √ | ||
% of the aromatic and positively charged residues in the protein | √ | |||||
number of the aromatic and positively charged residues in the protein | √ | |||||
number of the charged residues in protein | √ | √ | ||||
number of the polar residues in protein | √ | √ | ||||
molecular weight of |
√ | √ | ||||
molecular weight of |
√ | |||||
number of cWW | √ | |||||
relative frequency of cWW | √ | √ | √ | |||
frequency of the MFE structure | √ |
We first conduct an experiment to check the significance of the classification of protein-RNA complexes based on RNA types. For each class of complexes, we use the top 1 and 2 features to train GBRT prediction models and compute the performance measures, respectively. As a contrast, we take all the complexes as a whole to train the prediction model using the top 1 and top 2 features. The results are shown in
Performance of models built on the best one and two features for six classes of protein-RNA complexes.
Number of complexes | Maximum correlation coefficient(r) | ||
---|---|---|---|
Single property | Two properties | ||
Class I | 21 | 0.565 | 0.725 |
Class II | 34 | 0.452 | 0.546 |
Class III | 8 | 0.567 | 0.669 |
Class IV | 9 | 0.616 | 0.663 |
Class V | 11 | 0.422 | 0.521 |
Class VI | 20 | 0.511 | 0.615 |
All | 103 | 0.178 | 0.332 |
For each class of protein-RNA complexes, we train the GBRT model using the selected features to predict binding affinities. The correlation coefficients, together with MAE and R2 measures, are shown in
Performance measures of Pred PRBA on leave-one-outcrossvalidations.
Correlation coefficient(r) | Mean absolute error(MAE) | Coefficient of determination(R2) | |
---|---|---|---|
Class I | 0.818 | 1.215 | 0.623 |
Class II | 0.731 | 1.145 | 0.518 |
Class III | 0.894 | 1.270 | 0.288 |
Class IV | 0.803 | 0.749 | 0.489 |
Class V | 0.768 | 1.425 | 0.255 |
Class VI | 0.762 | 0.879 | 0.531 |
Average value | 0.796 | 1.114 | 0.451 |
Scatterplot in the coordinate of experimental
Next, we further evaluate the performance of the method for predicting the binding affinity in different classes and reveal the features that dominate the prediction of binding affinity of protein-RNA complexes. The predicted and actual values of binding affinities for each complex in six classes of complexes are shown in
The predicted and actual binding affinities, represented by Δ
In this class of complex, proteins interact with single-stranded RNA molecules that are very common
The interacting partners in this class of protein-RNA complexes are protein and double-stranded RNA. The binding affinities follow the range of 6–14 kcal mol-1. Three selected features are used to build the prediction model that obtain the correlation coefficient 0.731. The physicochemical properties of the protein fraction play most important role in the prediction of the binding affinity of this class of complexes. In particular, the number of hydrophobic residues in the protein and the number of the polar residues in proteins are also features of importance, which demonstrate that the physicochemical properties of the interacting proteins have a major impact on the interaction between proteins and double-stranded RNA.
This class of complexes is composed of proteins and tRNA molecules, and four features enable our model to achieve a correlation coefficient of 0.872. From
RNA loop structure includes many types, such as hairpin loops, internal loops, etc. (
One interacting partner of this class of protein-RNA complexes is the small RNA fragment. There are 11 complexes in our dataset, and the average binding affinity is 9.78 kcal mol-1. As shown in
The complexes that do not fall into the above five categories are assigned to miscellaneous. The reason is that the structure of RNA in this class of complexes is uncertain and software available cannot determine their specific structures, we thereby assumed that the features influencing the binding affinity of this class of complexes might be different from other classes. This class consists of 20 complexes, and the binding affinities range from 6 to 15 kcal mol-1. The set of four features are included in our model to predict the binding affinity, and the correlation coefficient is 0.76 on leave-one-out cross-validations. The molecular weight of α-helix and the number of the aromatic and positively charged residues in the protein are identified as important factors influencing the binding affinity. Moreover, among the protein sequence-based features, the percentage of hydrophilic and hydrophobic residues in the protein also play a vital role.
To verify that the utilization of both protein-derived features and RNA-derived features improve the performance of our prediction models, we build other two GBRT prediction models, referred to as protein-based and RNA-based prediction models, using only protein-derived features or RNA-derived features alone. Next, we compare their performance to that of PredPRBA that takes advantage of both protein-derived features and RNA-derived features.
Performance comparison of PredPRBA to protein-based and RNA-based prediction models.
Protein-based model | RNA-based model | PredPRBA | |
---|---|---|---|
Class I | 0.562 | 0.818 |
|
Class II | 0.652 | 0.436 |
|
Class III | 0.894 | 0.634 |
|
Class IV | 0.642 | 0.621 |
|
Class V | 0.768 | 0.547 |
|
Class VI | 0.762 | 0.635 |
|
Average | 0.71 | 0.62 |
|
Inspired by the study of protein-RNA interactions by Liu et al. (
Performance comparison of PredPRBA to sequence feature-based and structur efeature-based models.
Sequence-based model | Structure-based model | PredPRBA | |
---|---|---|---|
Class I | 0.661 | 0.711 |
|
Class II | 0.618 | 0.635 |
|
Class III | 0.883 | 0.765 |
|
Class IV | 0.696 | 0.735 |
|
Class V | 0.661 | 0.697 |
|
Class VI | 0.736 | 0.665 |
|
Average | 0.71 | 0.70 |
|
We evaluate PredPRBA by conducting performance comparison with several other typical regression methods, such as Linear Regression (LR) (
Comparison of correlation coefficients between PredPRBA and other regression algorithms.
SVR | DTR | LR | KNNR | ERRT | RFR | PredPRBA | |
---|---|---|---|---|---|---|---|
Class I | 0.541 | 0.356 | 0.604 | 0.411 | 0.760 | 0.641 |
|
Class II | 0.356 | 0.621 | 0.456 | 0.476 | 0.685 | 0.695 |
|
Class III | 0.708 | 0.449 | 0.634 | 0.628 | 0.458 | 0.535 |
|
Class IV | 0.389 | 0.669 | 0.696 | 0.602 | 0.588 | 0.724 |
|
Class V | 0.366 | 0.395 | 0.432 | 0.492 | 0.215 | 0.343 |
|
Class VI | 0.157 | 0.377 | 0.374 | 0.636 | 0.519 | 0.400 |
|
Average | 0.42 | 0.52 | 0.53 | 0.54 | 0.54 | 0.56 |
|
Comparison of mean correlation coefficients over six classes of protein-RNA complexes between PredPRBA and typical regression methods.
The SPOT-Seq-RNA (
Comparison of correlation coefficients between SPOT-Seq-RNA method and Pred PRBA.
Number of complexes | Correlation coefficient(r) | ||
---|---|---|---|
SPOT-Seq-RNA | PredPRBA | ||
Class I | 21 | 0.442 | 0.818 |
Class II | 34 | -0.044 | 0.731 |
Class III | 8 | -0.038 | 0.894 |
Class IV | 9 | 0.172 | 0.803 |
Class V | 11 | 0.756 | 0.768 |
Class VI | 20 | 0.386 | 0.762 |
Average | 17 | 0.276 | 0.796 |
In this paper, we propose a method for predicting the binding affinities of protein-RNA complexes using the sequence-based and structure-based features. As far as our knowledge, the data set of binding affinities of 103 protein-RNA complexes we built is the largest dataset to date. For each class of protein-RNA complexes, we have conducted systematic analysis on the importance of features in predicting the binding affinity and found that the structural features play a vital role in governing protein-RNA binding affinity. We also compared our method with several typical regression methods and the existing binding affinity predictive method, and the performance comparison has verified that our method achieved the best performance. In addition, we have also developed a web server for predicting the binding affinity of protein-RNA complexes, which is free and open to the academic community.
The datasets for this study can be found in the
LD, WY, and HL designed the study and conducted experiments. LD and WY performed statistical analyses. LD and HL drafted the manuscript. WY prepared the experimental materials and benchmarks. All authors have read and approved the final manuscript.
This work was supported by the National Natural Science Foundation of China under grant no. 61672541 and no. 61672113, and Natural Science Foundation of Hunan Province under grant no. 2017JJ3412.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at: