Recognizing Ion Ligand–Binding Residues by Random Forest Algorithm Based on Optimized Dihedral Angle

Liu, Liu; Hu, Xiuzhen; Feng, Zhenxing; Wang, Shan; Sun, Kai; Xu, Shuang

doi:10.3389/fbioe.2020.00493

ORIGINAL RESEARCH article

Front. Bioeng. Biotechnol., 12 June 2020

Sec. Computational Genomics

Volume 8 - 2020 | https://doi.org/10.3389/fbioe.2020.00493

Recognizing Ion Ligand–Binding Residues by Random Forest Algorithm Based on Optimized Dihedral Angle

Liu Liu

Xiuzhen Hu^*

Zhenxing Feng

Shan Wang

Kai Sun

Shuang Xu

College of Sciences, Inner Mongolia University of Technology, Hohhot, China

The prediction of ion ligand–binding residues in protein sequences is a challenging work that contributes to understand the specific functions of proteins in life processes. In this article, we selected binding residues of 14 ion ligands as research objects, including four acid radical ion ligands and 10 metal ion ligands. Based on the amino acid sequence information, we selected the composition and position conservation information of amino acids, the predicted structural information, and physicochemical properties of amino acids as basic feature parameters. We then performed a statistical analysis and reclassification for dihedral angle and proposed new methods on the extraction of feature parameters. The methods mainly included applying information entropy on the extraction of polarization charge and hydrophilic–hydrophobic information of amino acids and using position weight matrices on the extraction of position conservation information. In the prediction model, we used the random forest algorithm and obtained better prediction results than previous works. With the independent test, the Matthew's correlation coefficient and accuracy of 10 metal ion ligand–binding residues were larger than 0.07 and 52%, respectively; the corresponding evaluation values of four acid radical ion ligand–binding residues were larger than 0.15 and 86%, respectively. Further, we classified and combined the phi and psi angles and optimized prediction model for each ion ligand–binding residue.

Introduction

The protein is the foundation of life and plays an important role in the life activities. Proteins carried out various functions by interactions, such as protein–protein interaction, protein–DNA interaction, protein–RNA interaction, and protein–ion ligand interaction (Lin et al., 2005; Ebert and Altman, 2008; Pan and Shen, 2018; Al-Mugotir et al., 2019; Emamjomeh et al., 2019; Robin et al., 2019). Because more than half of the proteins required binding with ion ligands for functions, research of ion ligand–binding residues on proteins was of great significance. However, it was difficult to accurately predict the ion ligand–binding residues on the protein sequence because of the small size and high versatility of ion ligands.

Current theoretical prediction methods of ligand-binding residues can be roughly classified into sequence-based method and three-dimensional (3D) structure–based method. The experiments showed that the accuracy of the 3D structure–based prediction was higher than that of sequence-based prediction (Yang et al., 2013, 2018). However, the number of proteins with known amino acid sequence was far more than that with known 3D structure. Although the prediction accuracy of sequence-based method is not as satisfactory as 3D structure–based, sequence-based method is still generally used.

In general, there were three key points in predicting the ion ligand–binding residues with theoretical methods: research objects selection, the selection, and extraction of feature parameters, and the selection of algorithms. In recent decades, researchers conducted different studies on binding residues of metal and acid radical ion ligands. In these studies, most researchers were working on the development of features to improve the prediction results of ion ligand–binding residues. Among the feature parameters used to predict ion ligand–binding residues, position conservation information, and composition of amino acids were two commonly used basic feature parameters (Sodhi et al., 2004; Komiyama et al., 2015; Hu et al., 2016a; Liu et al., 2019; Wang et al., 2019). Besides, based on the biological background of interaction between ion ligands and proteins, researchers added physicochemical properties of amino acids, secondary structure, and relative solvent accessibility (RSA) to identify ion ligand–binding residues (Lin et al., 2006; Jiang et al., 2016; Cao et al., 2017; Li et al., 2017). Using these features obtained improved prediction results. In this work, we further improved prediction result by optimizing the extraction method of features. In the previous work, on the extraction of position conservation information of amino acids, some researchers used the position specific score matrix (PSSM) method to extract it (Sodhi et al., 2004; Hu et al., 2016a), whereas the other ones used the position weight scoring algorithm to extract its score values (Liu et al., 2019; Wang et al., 2019). However, the dimension of PSSM is excessively high, which will potentially lead to the overfitting problem, and the dimension of the score values is too low, which will lose a lot of information. Thus, we constructed the position weight matrices to extract the 2L-dimensional position conservation information of amino acids. In terms of extraction of the hydrophilic–hydrophobic information of amino acids, autocross covariance formula was attempted (Jiang et al., 2016). However, the method did not take into account the different number of amino acid species contained in each class of hydrophilic–hydrophobic properties. Same problem also exists in the classification of polarized charge. To settle these problems, we used the information entropy to extract the polarization charge and the hydrophilic–hydrophobic information of amino acids.

Dihedral angle of amino acid sequence can specify protein backbone conformation. Therefore, some researchers selected dihedral angle as feature and obtained improved results. But the extraction method, using two-dimensional real values of phi and psi angles as features (Hu et al., 2016b; Cui et al., 2019), ignored character of dihedral angle of each ion ligand–binding residue. In this work, the phi and psi angles were performed by using statistical analysis and reclassification, and they were extracted as feature parameters. The random forest (RF) algorithm is a strong classifier integrated with multiple weak classifiers. Fewer parameters are needed to be set, and it will cause less overfitting phenomenon in general. So, we finally used the RF algorithm to make the prediction model on the basis of combined feature parameters. By adding reclassified dihedral angle feature, we obtained improved prediction results and optimized the prediction model for each ion ligand–binding residue. Further, we compared our prediction results with the results of Artificial Neural Networks (ANNs) and Support Vector Machine (SVM) and turned out that the RF algorithm had a better prediction result.

Materials and Methods

Dataset

The dataset of ion ligand–binding residues constructed in this article was from the BioLip database. On the basis of literatures (Hu et al., 2016a; Cao et al., 2017), we selected the protein chains with the length longer than 50 residues and resolution below 3Å and then removed the protein chains whose pairwise sequence identity was higher than 30% by using the CD-HIT software (Li and Godzik, 2006). Inspired by the literature (Hu et al., 2016a), we finally selected the binding residues of 10 metal ion ligands (Zn²⁺, Cu²⁺, Fe²⁺, Fe³⁺, Co²⁺, Ca²⁺, Mg²⁺, Mn²⁺, Na⁺, K⁺) and four acid radical ion ligands ( ${NO}_{2}^{-}$ , ${CO}_{3}^{2 -}$ , ${SO}_{4}^{2 -}$ , ${PO}_{4}^{3 -}$ ) as research objects.

The interaction between the protein and ion ligand was related to both binding residues and their surrounding residues. Thus, we used the sliding window method to cut the protein chain into the corresponding sequence segments. If sequence segment center was an ion ligand–binding residue, it was defined as positive segment; otherwise, it was defined as the negative segment. In order to make each amino acid residue on the protein chains appear in the center of the sequence segment, we added (L-1)/2 virtual amino acids to the end of the protein chains. L is the sequence segment length. The non-redundant datasets of 14 ion ligands are shown in Table 1. By a number of calculations, the obtained optimal window lengths of Zn²⁺, Fe²⁺, Fe³⁺, Cu²⁺, Mn²⁺, Co²⁺, Ca²⁺, Mg²⁺, Na⁺, K⁺, ${NO}_{2}^{-}$ , ${CO}_{3}^{2 -}$ , ${SO}_{4}^{2 -}$ , and ${PO}_{4}^{3 -}$ were 7, 9, 9, 13, 7, 11, 9, 9, 9, 11, 13, 15, 13, and 13, respectively. The binding residues of 10 metal ion ligands and the corresponding optimal window lengths selected in this article were the same as in the literature (Cao et al., 2017).

TABLE 1

Table 1. Benchmark datasets of 14 ion ligands.

Because the number of negative segments was greater than positive segments, we used random sampling method in mathematics to balance the segments number in positive and negative datasets. Specifically, we randomly selected negative segments with the equal number of positive segments to compose negative datasets. In order to ensure the stability of the result, the negative samples were randomly sampled 10 times, and the final prediction result was the average value of the 10 results.

The Selection and Extraction of Feature Parameters

Amino Acid Composition and Position Conservation Information

Our prestudy showed that amino acid composition of metal ion ligand–binding segments was different from amino acid composition of nonbinding segments (Cao et al., 2017). Besides, the literatures (Ansari and Raghava, 2010; Jiang et al., 2016; Li et al., 2017) also showed the amino acid composition was an important feature in the recognition work of ligand-binding residues. Therefore, the amino acid composition was selected as a feature parameter in this article.

According to the description in the literatures (Sodhi et al., 2004; Hu et al., 2016a,b; Cao et al., 2017; Li et al., 2017; Cui et al., 2019) and the statistical analysis of the position conservation in metal ion ligand–binding segments and nonbinding segments done by our group (Cao et al., 2017), we selected the amino acid position conservation information as a feature parameter. Because the dimension of PSSM (20^*L) that is commonly used to extract position conservation information is excessively high, we constructed position weight matrices to extract position conservation information of amino acids (Kel et al., 2003; Gao and Hu, 2014). First, we constructed position weight matrices for positive and negative samples in the training set, respectively. The matrix element m_{i, j} of position weight matrix M was as follows:

\begin{array}{l} m_{i, j} = ln (\frac{p_{i, j}}{p_{0, j}}) & (1) \end{array}

\begin{array}{l} p_{i, j} = \frac{(n_{i, j} + \frac{\sqrt{N_{i}}}{q})}{(N_{i} + \sqrt{N_{i}})} & (2) \end{array}

In the above equation, i represents position, and j runs for 20 different amino acids and virtual residue “X”. p_{o, j} represents the background probability of the amino acid j, and p_{i, j} represents the probability of the amino acid j at the i^th position. n_{i, j} represents the frequency of amino acid j at the i^th position, and N_i is total number of amino acids at the i^th position. q represents the total number of categories, where q = 21. L is the length of the amino acid sequence segments. Taking Mn²⁺ ligand as an example, the position weight matrices constructed for its positive and negative training samples are shown in Figures 1, 2, respectively. For arbitrary amino acid sequence segment, such as “ACFPQSW,” according to the corresponding position weight matrices, we can get a 14-dimensional amino acid position conservation information: (−1.8245, −1.9236, −1.9109, −1.9971, −1.822, −1.9681, −3.808, −1.8407, −1.8452, −1.9661, −1.9129, −1.9378, −1.7342, −1.7195). Finally, the 2L-dimensional amino acid position conservation information was used as a feature parameter.

FIGURE 1

Figure 1. Position weight matrix constructed by the positive training samples of Mn²⁺ ligand.

FIGURE 2

Figure 2. Position weight matrix constructed by the negative training samples of Mn²⁺ ligand.

The Predicted Structural Information

It is well-known that structure of proteins directly determines their function, but not all proteins have experimentally measured 3D structure information. Because the secondary structure, RSA, and dihedral angle can reflect the local structure information of the protein, we selected above information as basic features (Hu et al., 2016a; Cao et al., 2017). In this article, the predicted secondary structure, RSA, and dihedral angle were obtained by ANGLOR software (Yang Zhang Lab; https://zhanglab.ccmb.med.umich.edu/ANGLOR/) (Wu and Zhang, 2008).

The secondary structure types that were predicted by the ANGLOR software included α-helix (H), β-strand (E), and coil (C). In previous studies, its composition and position conservation information were already used to predict ion ligand–binding residues and obtained great prediction results (Hu et al., 2016a,b; Cao et al., 2017; Li et al., 2017). Therefore, composition and 2L-dimensional position conservation features were extracted from secondary structure as feature parameters.

For the predicted RSA, its threshold value 0.25 was usually chosen to indicate whether the residue was exposed (RSA > 0.25) or buried (RSA < 0.25). In the literature (Cao et al., 2017), the four-classification of RSA was used as the basic feature parameter, and prediction results were obviously improved. Therefore, we selected four-classification of RSA in the literature (Cao et al., 2017) as a basic feature parameter and extracted its composition and 2L-dimensional position conservation information as predicted feature parameters. The four-classification of RSA in the literature (Cao et al., 2017) was as follows:

\begin{array}{l} r (x) = {\begin{matrix} I, x \in (0, 0.2] \\ J, x \in (0.2, 0.45] \\ M, x \in (0.45, 0.6] \\ N, x \in (0.6, 0.85] \end{matrix} & (3) \end{array}

Protein backbone dihedral angle specifies the backbone conformation of protein and is important for describing the local conformation of amino acids. Therefore, the dihedral angle information was selected as a basic feature parameter to predict the binding residues of the ion ligands. In a previous prediction work on binding sites of ligand–proteins, Chen et al. used 20 regions of the Ramachandran plot to calculate the value of propensity for ligand binding (Chen and Xu, 2016); Cui et al. used the real values of phi and psi angles to predict the ligand-binding residues on proteins (Cui et al., 2019). However, prediction results obtained by the above two methods did not achieve expectation. Therefore, the phi and psi angles of amino acid residues on protein chains binding with ion ligands were statistically analyzed and reclassified in this article. The degree of phi and psi angles predicted by ANGLOR software retained one decimal point, and the value range of phi and psi angles all was [−180°, 180°]. To simplify statistics, every 15th degree was divided into an interval; [−180°, 180°] was divided into 24 intervals. Then, we performed statistical analysis for the phi and psi angles. Taking Mn²⁺ ligand as an example, the distributions of phi and psi angles for binding and nonbinding residues are shown in Figures 3, 4, respectively.

FIGURE 3

Figure 3. Distribution of phi angle for binding and nonbinding residues of Mn²⁺ ligand. The abscissa represents 24 classification intervals; the ordinate represents the percentage of predicted phi angle that appears in each interval.

FIGURE 4

Figure 4. Distribution of psi angle for binding and nonbinding residues of Mn²⁺ ligand. The abscissa represents 24 classification intervals; the ordinate represents the percentage of predicted psi angle that appears in each interval.

In Figures 3, 4, the abscissa represents 24 classification intervals; the ordinate represents the percentage of predicted phi/psi angle appearing in each interval. We chose thresholds based on the difference of the predicted phi/psi angle between the binding and nonbinding residues. The threshold of phi angle was defined by the function g(x), and the threshold of the psi angle was defined by the function h(x). As can be seen from Figure 3, the difference of phi angle on the Mn²⁺ ligand-binding residues and nonbinding residues was mainly concentrated in two different intervals. Therefore, g(x) was defined as follows:

\begin{array}{l} g (x) = {\begin{matrix} A, x \in [- 18 0^{\circ}, - 7 5^{\circ}] \\ B, x \in [- 7 5^{\circ}, 18 0^{\circ}] \end{matrix} & (4) \end{array}

Similarly, it can be seen from Figure 4 that the difference of psi angle on the Mn²⁺ ligand-binding residues and nonbinding residues was mainly concentrated in three intervals. Therefore, h(x) was defined as follows:

\begin{array}{l} h (x) = {\begin{matrix} A, x \in [- 18 0^{\circ}, 1 5^{\circ}] \\ B, x \in [1 5^{\circ}, 13 5^{\circ}] \\ C, x \in [13 5^{\circ}, 18 0^{\circ}] \end{matrix} & (5) \end{array}

In summary, the reclassification information of phi and psi angles was taken as the basic feature parameters, and its composition and the 2L-dimensional position conservation information were extracted as the feature parameters.

Physicochemical Properties of Amino Acids

Because the ion ligand was charged, it was easy to bind to residues with opposite charge. Therefore, we selected the polarization charge information of amino acids as a basic feature parameter. The 20 amino acids were divided into three categories in this article (Taylor, 1986), as shown in Figure 5.

FIGURE 5

Figure 5. Polarization charge classifications of amino acids.

Actually, ion ligand mainly interacted with the amino acids exposed on the surface of the protein pocket. These amino acids generally showed hydrophilicity, so the hydrophilic–hydrophobic property of amino acids was also selected as a basic feature parameter. The 20 amino acids were divided into six categories according to hydrophilic–hydrophobic property (Pánek et al., 2005), as shown in Figure 6.

FIGURE 6

Figure 6. Hydrophilic–hydrophobic classifications of amino acids.

It can be seen from Figure 5 that the number of uncharged amino acids was many times more than charged amino acids, which made the information of charged amino acids unavailable. To solve this problem, information entropy was applied to extract the polarization charge information. Similarly, in the hydrophilic–hydrophobic classification of amino acids, P, G, and C were respectively divided into one category; the number of amino acids in these categories was far less than that in the other three categories. So, information entropy was also used on the extraction of hydrophilic–hydrophobic information.

Information entropy was proposed by Shannon (1948) to describe the uncertainty of information sources. Although the signal is uncertain, it can be measured according to the probability of its occurrence. For the state space X: {x₁, x₂, …, x_q}, n_j is the occurrence number of x_j (j = 1, 2, …, q), and information entropy was defined as follows:

\begin{array}{l} H (x) = - \sum_{j = 1}^{q} p_{j} {log}_{2} p_{j} & (6) \end{array}

\begin{array}{l} p_{j} = \frac{(n_{j} + \frac{\sqrt{N}}{q})}{(N + \sqrt{N})} & (7) \end{array}

where, $N = \sum_{j = 1}^{q} n_{j} . x_{j}$ represents hydrophilic–hydrophobic (or polarization charge) classifications of amino acids and virtual residue “X”. p_j represents the probability of the information symbol x_j. When x_j represents hydrophilic–hydrophobic classifications of amino acids, q = 7, whereas when it represents polarization charge classifications of amino acids, q = 4. For arbitrary sequence segment, we can get the one-dimensional (1D) information entropy value according to Equation (6). Therefore, we used the 1D information entropy value as a feature.

The RF Algorithm

The RF algorithm is a machine learning algorithm proposed by Breiman (2001). It is composed of multiple independent decision trees, each of which is a classifier. The basic principle is to integrate weak classifiers into one strong classifier, and the final result is determined by voting. The RF algorithm has been successfully applied to the prediction of β-hairpin and protein fold (Jia and Hu, 2011; Chen et al., 2014; Feng and Hu, 2014). The advantage of the RF algorithm is that fewer parameters need to be adjusted (Kandaswamy et al., 2010); in comparison with other algorithms, when the dimension is relatively high, the overfitting problem of RF algorithm is not so serious. The RF algorithm has two important parameters, one is the m ( $m = \sqrt{M},$ and M is the number of the feature initially selected), which is the size of a random feature subset for splitting the nodes; the other is the k that is the number of decision trees in the RF, generally k = 500 (Liaw and Wiener, 2002). In this article, we established our prediction model using the 4.6–12 RF algorithm package of R software version 3.4.3 (Vienna, Austria; https://www.R-project.org/).

The Validation Methods and Evaluation Metrics

The Validation Methods

The constructed model was tested by the 5-fold cross-validation and independent test, which were commonly used on ligand–binding residues predictions (Sodhi et al., 2004; Lin et al., 2005; Hu et al., 2016a,b; Jiang et al., 2016; Cao et al., 2017; Li et al., 2017).

In the 5-fold cross-validation, the dataset was randomly divided into five subsets with same size. Each subset was regarded as test dataset in turn, and the rest of the four were accordingly as training dataset. After prediction process was repeated five times, the average value of five results was adopted as the final result.

In the independent test, the protein chains binding to each ion ligand were divided into two parts: one part accounted for 80% of the total protein chains, which was the training dataset for training the model; the other part accounted for 20% of the total protein chains, which was the independent test dataset for testing the model. Because the protein chains that generated the independent test dataset and the training dataset were different, the data of independent test dataset and training dataset were independent.

The Evaluation Metrics

The following four measures were used to evaluate the prediction performance of ion ligand–binding residues: sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthew's correlation coefficient (MCC) (Hu et al., 2016a,b; Lin et al., 2016; Cao et al., 2017; Li et al., 2017). These measures were defined as:

\begin{array}{l} S_{n} = \frac{T P}{T P + F N} \times 100 % & (8) \end{array}

\begin{array}{l} S_{p} = \frac{T N}{T N + F P} \times 100 % & (9) \end{array}

\begin{array}{l} A c c = \frac{T P + T N}{T P + T N + F P + F N} \times 100 % & (10) \end{array}

\begin{array}{l} M C C = \frac{(T P \times T N) - (F P \times F N)}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}} & (11) \end{array}

where TP represents the number of correctly predicted ion ligand–binding residues, TN represents the number of correctly predicted nonbinding residues, FP represents the number of nonbinding residues predicted as binding residues, and FN represents the number of binding residues predicted as nonbinding residues.

Results and Discussion

The Predicted Results of 5-Fold Cross-Validation

Because of the heavy imbalance in the original dataset, we took the number of positive segments as the standard and randomly sampled the equal number of negative segments 10 times, which generated 10 negative subsets. For each negative subset and the positive set, we extracted related feature parameters. Then, with five-fold cross-validation, we combined the feature parameters and inputted them into the RF algorithm. In this way, the progress was repeated 10 times, and the average result was taken as the result of one subset. Finally, we averaged the above results of 10 subsets as the prediction results.

The Prediction Results Obtained Without Adding Dihedral Angle Information

The composition and 2L-dimensional position conservation information of amino acids, secondary structure, RSA, and information entropy of polarization charge and hydrophilic–hydrophobic were used as feature parameters, and then we inputted them into the RF algorithm to predict the ion ligand–binding residues, and the predicted results obtained by the 5-fold cross-validation are shown in Table 2.

TABLE 2

Table 2. Prediction results without adding dihedral angle information.

As can be seen from Table 2, comparing to the results of Ca²⁺, Mg²⁺, Na⁺, K⁺ ligands, the results of transition metal ions Zn²⁺, Cu²⁺, Fe²⁺, Fe³⁺, Co²⁺, and Mn²⁺ were better, with MCC value greater than 0.630, Acc value >81%, Sn value >76%, and Sp value >86%. This is because transition metal ion binding residues have a preference for conservation. For example, Zn²⁺ ligand–binding residues prefer to use C, H, D, E, and so on; Cu²⁺ ligand–binding residues prefer to use H, C, E, and so on (Cao et al., 2017). It was visualized in Figure 7.

FIGURE 7

Figure 7. Position-specific conservation of amino acid residues in the binding and nonbinding sequence segments for Zn²⁺ (A), Cu²⁺ (B). Each ligand contains two subfigures, where the labeled subfigure is the position-specific conservation in binding residue sequence segments; the other is the position-specific conservation in nonbinding residue sequence segments [from literature (Cao et al., 2017)].

The Prediction Results Obtained by Adding Dihedral Angle Information

The dihedral angle has an important influence on local structure of the protein backbone. The literatures (Chen and Xu, 2016; Hu et al., 2016a,b; Cui et al., 2019) showed that the dihedral angle played an important role in predicting the ligand binding sites on proteins. Therefore, composition feature and 2L-dimensional position conservation feature extracted from the phi and the psi angles were added to predict the binding residues of ion ligands. The flowchart of the proposed method was displayed in Figure 8. And the results obtained by the five-fold cross-validation are shown in Table 3.

FIGURE 8

Figure 8. Flowchart of the proposed method. AA is amino acids; DH is polarization charge of amino acids; QS is hydrophilic and hydrophobic of amino acids; SS is predicted second structure; RSA is relative solvent accessibility; and DA is dihedral angle; 2L is 2L-dimensional position conservation feature; PS is composition feature; and SH is information entropy value.

TABLE 3

Table 3. Prediction results with adding dihedral angle information by 5-fold cross-validation.

As shown in Table 3, prediction results were improved after adding dihedral angle information; the MCC values of Ca²⁺, Na⁺, K⁺, ${SO}_{4}^{2 -}$ , and ${PO}_{4}^{3 -}$ ligands increased by at least 20 percentage points; the MCC values of Mn²⁺, Co²⁺, Mg²⁺, and ${CO}_{3}^{2 -}$ increased by at least 10 percentage points; and the values of Acc, Sn, and Sp of these nine ion ligands were also significantly improved. The above values of other ion ligands were also slightly improved. The results showed that the ion ligand–binding residues were sensitive to the information of the reclassified dihedral angle. Namely, the information of the reclassified dihedral angle was effective for identifying ion ligand–binding residues.

The Comparison of the Predicted Results of 10 Metal Ion–Ligand Binding Residues

Because the datasets of metal ion ligand–binding residues used in this article and the optimal window length selected for each metal ion were the same as used in literature (Cao et al., 2017), the prediction results of 10 metal ion ligands obtained in this article were compared with prediction results obtained by SVM in literature (Cao et al., 2017).

In order to facilitate comparison, prediction results obtained by SVM were putted into brackets in Table 3. Apparently, the predicted results of Zn²⁺, Fe²⁺, and Cu²⁺ were lower than those obtained by SVM, but prediction results of the other seven metal ion ligands obtained by RF algorithm were all more accurate than those obtained by SVM. In particular, the MCC values of Ca²⁺ and Mg²⁺ were increased more than 20 percentage points comparing to SVM.

The Prediction Results of Independent Test

In order to test the practicability of the model established in this article, independent test was performed. The independent test datasets of 14 ion ligands are shown in Table 4.

TABLE 4

Table 4. Data of the training dataset and independent test dataset.

In the independent test, composition and 2L-dimensional position conservation information of the amino acids, secondary structure, RSA, and dihedral angle, as well as information entropy of polarization charge and hydrophilic–hydrophobic, were used as feature parameters and inputted into the RF algorithm to predict ion ligand–binding residues. The independent test results are shown in Table 5.

TABLE 5

Table 5. Prediction results of the independent test.

The predicted results of Zn²⁺, Cu²⁺, Fe²⁺, Fe³⁺, Co²⁺, Mn²⁺, ${NO}_{2}^{-}$ , and ${PO}_{4}^{3 -}$ were improved, MCC values were higher than 0.220, the values of Acc and Sp were higher than 87%, and the Sn values were higher than 72%. The result of Cu²⁺ ligand was the highest, in which the MCC value was 0.399, and the values of Sn, Sp, and Acc were higher than 87%.The obtained results of Ca²⁺, Mg²⁺, Na⁺, K⁺, ${CO}_{3}^{2 -}$ , and ${SO}_{4}^{2 -}$ ligands were lower, the MCC values were lower than 0.200, in which the predicted result of Na⁺ was the lowest, with the MCC value of 0.076 and the Acc value of 73.5%.

Because the independent test datasets of 10 metal ion ligands constructed in this article were the same as those used in the literature (Cao et al., 2017), the results obtained by independent test in this article were compared with those obtained by SVM in the literature (Cao et al., 2017). The results of the independent test in the literature (Cao et al., 2017) are shown in the brackets of Table 5.

Comparing MCC value, it is obvious that the independent test results obtained by the RF algorithm were better than those obtained by the SVM. In terms of values of the Acc and Sp, the RF algorithm achieved better predicted results of metal ion ligands, except for Na⁺ and K⁺ ligands. As for the Sn value, the prediction results of Co²⁺, Mg²⁺, Na⁺, and K⁺ ligands were better than those obtained by SVM, whereas other ligands were slightly lower. In general, the model constructed in this article was practical on the metal ion ligand–binding residues prediction.

Comparison With SVM and ANN Algorithms

It is objective to compare our proposed methods with previous model using the same dataset. From Tables 3, 5, we found that most of the prediction results of this work were better than those of previous work (Cao et al., 2017).

In order to test the performance of our proposed method, we further made a comparison between RF and SVM and ANN algorithms on the same dataset, feature parameters, classification strategy, and evaluation methods. We inputted the same feature parameters extracted in this article into SVM and ANN algorithms to identify ion ligand–binding residues. The results obtained by the independent test are shown in Table 6. As seen, the prediction results of RF algorithm were best, although the results of Na⁺ and K⁺ ligands were slightly lower. At the same time, we can see that, when using the SVM on the same dataset, the prediction results obtained by selecting the feature parameters of this article were also better than those obtained in literature (Cao et al., 2017).

TABLE 6

Table 6. Comparison of independent test results obtained by RF, SVM, and ANN.

The Optimization of Model

Because the reclassified phi and psi angles have a good effect on the prediction of ion ligand–binding residues, we further classified the phi and psi angles in the other way to optimize the prediction model for each ion ligand. At first, we classified the phi and psi angles by distribution. Therefore, we tried to classify the phi and the psi angles according to peak value. Taking Mn²⁺ ligand as an example, according to Figure 3, phi angle was divided into four categories and defined by function f(x):

\begin{array}{l} f (x) = {\begin{matrix} A, x \in [- 18 0^{\circ}, - 10 5^{\circ}] \\ B, x \in (- 10 5^{\circ}, - 7 5^{\circ}] \\ C, x \in (- 7 5^{\circ}, - 6 0^{\circ}] \\ D, x \in (- 6 0^{\circ}, 18 0^{\circ}] \end{matrix} & (12) \end{array}

According to Figure 4, psi angle was divided into three categories and defined by function p(x):

\begin{array}{l} p (x) = {\begin{matrix} A, x \in [- 18 0^{\circ}, - 1 5^{\circ}] \\ B, x \in (- 1 5^{\circ},7 5^{\circ}] \\ C, x \in (7 5^{\circ},18 0^{\circ}] \end{matrix} & (13) \end{array}

The two classifications of phi angle and the two classifications of psi angle were combined, respectively (Figure 9). Their composition and 2L-dimensional position conservation information were extracted as the characteristic parameters.

FIGURE 9

Figure 9. The combination ways of phi and psi angles.

The composition and 2L-dimensional position conservation information of amino acids, secondary structure, RSA, dihedral angle, and information entropy of polarization charge and hydrophilic–hydrophobic were used as characteristic parameters and inputted into the RF algorithm to predict ion ligand–binding residues. The optimal prediction model and the corresponding prediction result of each ion ligand obtained by the five-fold cross-validation are shown in Table 7.

TABLE 7

Table 7. Optimal Predicted models of ion ligands and corresponding predicted results.

As seen, not all ion ligand–binding residues were sensitive to the first combination way of phi and psi angles. For example, the optimal prediction model of Mn²⁺ ligand corresponded to the second combination way of phi and psi angles, and the optimal prediction model of Zn²⁺ ligand corresponded to the third combination way of phi and psi angles. It can be seen that the different reclassification and combination ways of phi and psi angles have an important impact on the prediction of ion ligand–binding residues. Therefore, when using the reclassified dihedral angle information to predict ion ligand–binding residues, different classifications and combination ways of dihedral angle (phi and psi angles) should be considered to optimize the prediction model of binding residues of each ion ligand.

Conclusion

Many proteins perform their functions by interacting with ion ligands. To illustrate the protein functions, it is a significant work to recognize the ion ligand–binding residues. In this article, based on optimized dihedral angle, we predicted the 14 ion ligand–binding residues from the BioLip database by RF algorithm and obtained improved results. During the progress, the dihedral angle information was statistically analyzed and reclassified. Besides, the new extraction methods of feature parameter were proposed, in which the position weight matrices were constructed to extract the 2L-dimensional position conservation features; the polarization charge information and hydrophilic–hydrophobic information of amino acids were extracted by using information entropy. These changes in extraction methods improved the predicted results of ion ligand–binding residues. In particular, the reclassification information of the dihedral angle has significantly improved the prediction results of the ion ligand–binding residues, indicating that the reclassification information of the dihedral angle is an important feature parameter for the identification of the ion ligand–binding residues. By classifying and combining phi and psi angles, we optimized the prediction model for each ion ligand–binding residue. Thus, with different classification standards and combined methods of the dihedral angle (phi and psi angles), it can further improve the prediction results of ion ligand–binding residues.

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Author Contributions

LL performed the experiments and wrote the paper. XH designed the experiments and analyzed the results. ZF, SW, SX, and KS gave guidance on the writing of the paper. All authors read and approved the final manuscript.

Funding

This work was supported by National Natural Science Foundation of China (61961032), Natural Science Foundation of the Inner Mongolia of China (2019BS03025).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Al-Mugotir, M., Kolar, C., Vance, K., Kelly, D. L., Natarajan, A., and Borgstahl, G. E. O. (2019). A simple fluorescent assay for the discovery of protein-protein interaction inhibitors. Anal. Biochem. 569, 46–52. doi: 10.1016/j.ab.2019.01.010

PubMed Abstract | CrossRef Full Text | Google Scholar

Ansari, H. R., and Raghava, G. P. S. (2010). Identification of NAD interacting residues in proteins. BMC Bioinformatics 11:160. doi: 10.1186/1471-2105-11-160

PubMed Abstract | CrossRef Full Text | Google Scholar

Breiman, L. (2001). Random forests. Mach. Learn. 45, 5–32. doi: 10.1023/A:1010933404324

CrossRef Full Text | Google Scholar

Cao, X., Hu, X., Zhang, X., Gao, S., Ding, C., Feng, Y., et al. (2017). Identification of metal ion binding sites based on amino acid sequences. PLoS ONE 12:e0183756. doi: 10.1371/journal.pone.0183756

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, C., and Xu, S. (2016). Improving the performance of the PLB index for ligand-binding site prediction using dihedral angles and the solvent-accessible surface area. Sci. Rep. 6:33232. doi: 10.1038/srep33232

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, P., Huang, J., and Gao, X. (2014). LigandRFs: random forest ensemble to identify ligand-binding residues from sequence information alone. BMC Bioinformatics 15(Suppl.15):S4. doi: 10.1186/1471-2105-15-S15-S4

PubMed Abstract | CrossRef Full Text | Google Scholar

Cui, Y., Dong, Q., Hong, D., and Wang, X. (2019). Predicting protein-ligand binding residues with deep convolutional neural networks. BMC Bioinformatics 20:93. doi: 10.1186/s12859-019-2672-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Ebert, J. C., and Altman, R. B. (2008). Robust recognition of zinc binding sites in proteins. Protein Sci. 17, 54–65. doi: 10.1110/ps.073138508

PubMed Abstract | CrossRef Full Text | Google Scholar

Emamjomeh, A., Choobineh, D., Hajieghrari, B., MahdiNezhad, N., and Khodavirdipour, A. (2019). DNA–protein interaction: identification, prediction and data analysis. Mol. Biol. Rep. 46, 3571–3596. doi: 10.1007/s11033-019-04763-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Feng, Z., and Hu, X. (2014). Recognition of 27-class protein folds by adding the interaction of segments and motif information. Biomed. Res. Int. 2014, 871–882. doi: 10.1155/2014/262850

PubMed Abstract | CrossRef Full Text | Google Scholar

Gao, S., and Hu, X. (2014). Prediction of four kinds of super secondary structure in enzymes by using ensemble classifier based on scoring SVM. Hans J. Comput. Biol. 4, 1–11. doi: 10.12677/HJCB.2014.41001

CrossRef Full Text | Google Scholar

Hu, X., Dong, Q., Yang, J., and Zhang, Y. (2016a). Recognizing metal and acid radical ion-binding sites by integrating ab initio modeling with template-based transferals. Bioinformatics 32, 3260–3269. doi: 10.1093/bioinformatics/btw396

CrossRef Full Text | Google Scholar

Hu, X., Wang, K., and Dong, Q. (2016b). Protein ligand-specific binding residue predictions by an ensemble classifier. BMC Bioinformatics 17:470. doi: 10.1186/s12859-016-1348-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Jia, S., and Hu, X. (2011). Using random forest algorithm to predict β-hairpin motifs. Protein Pept. Lett. 18, 609–617. doi: 10.2174/092986611795222777

PubMed Abstract | CrossRef Full Text | Google Scholar

Jiang, Z., Hu, X., Geriletu, G., Xing, H., and Cao, X. (2016). Identification of Ca(2+)-binding residues of a protein from its primary sequence. Genet. Mol. Res. 15, 1–13. doi: 10.4238/gmr.15027618

PubMed Abstract | CrossRef Full Text | Google Scholar

Kandaswamy, K. K., Chou, K. C., Martinetz, T., Moller, S., Suganthan, P. N., Sridharan, S., et al. (2010). AFP-Pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties. J. Theor. Biol. 270, 56–62. doi: 10.1016/j.jtbi.2010.10.037

PubMed Abstract | CrossRef Full Text | Google Scholar

Kel, A. E., GoBling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, O. V., and Wingender, E. (2003). MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 31, 3576–3579. doi: 10.1093/nar/gkg585

PubMed Abstract | CrossRef Full Text | Google Scholar

Komiyama, Y., Banno, M., Ueki, K., Saad, G., and Shimizu, K. (2015). Automatic generation of bioinformatics tools for predicting protein–ligand binding sites. BMC Bioinformatics 32, 901–907. doi: 10.1093/bioinformatics/btv593

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, W., and Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659. doi: 10.1093/bioinformatics/btl158

CrossRef Full Text | Google Scholar

Li, S., Hu, X., Sun, L., and Zhang, X. (2017). “Identifying the sulfate ion binding residues in proteins,” in International Conference on Biomedical and Biological Engineering (Guilin). 4, 209–216. doi: 10.2991/bbe-17.2017.34

CrossRef Full Text | Google Scholar

Liaw, A., and Wiener, M. (2002). Classification and regression by random forest. R News 2, 18–22.

Google Scholar

Lin, C. T., Lin, K. L., Yang, C. H., Chung, I. F., Huang, C. D., and Yang, Y. S. (2005). Protein metal binding residue prediction based on neural networks. Int. J. Neural Syst. 15, 71–84. doi: 10.1142/S0129065705000116

PubMed Abstract | CrossRef Full Text | Google Scholar

Lin, H., Han, L., Zhang, H., Zheng, C., Xie, B., Cao, Z., et al. (2006). Prediction of the functional class of metal-binding proteins from sequence derived physicochemical properties by support vector machine approach. BMC Bioinformatics 7(Suppl.5):S13. doi: 10.1186/1471-2105-7-S5-S13

PubMed Abstract | CrossRef Full Text | Google Scholar

Lin, Y. F., Cheng, C. W., Shih, C. S., Hwang, J. K., Yu, C. S., and Lu, C. H. (2016). MIB: metal ion-binding site prediction and docking server. J. Chem. Inf. Model. 56, 2287–2291. doi: 10.1021/acs.jcim.6b00407

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, L., Hu, X., Feng, Z., Zhang, X., Wang, S., Xu, S., et al. (2019). Prediction of acid radical ion binding residues by K-nearest neighbors classifier. BMC Mol. Cell Biol. 20:52. doi: 10.1186/s12860-019-0238-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Pan, X., and Shen, H. B. (2018). Predicting RNA-protein binding sites and motifs through combining local and global deep convolutional neural networks. Bioinformatics 34, 3427–3436. doi: 10.1093/bioinformatics/bty364

PubMed Abstract | CrossRef Full Text | Google Scholar

Pánek, J., Eidhammer, I., and Aasland, R. (2005). A new method for identification of protein (sub)families in a set of proteins based on hydropathy distribution in proteins. Proteins 58, 923–934. doi: 10.1002/prot.20356

PubMed Abstract | CrossRef Full Text | Google Scholar

Robin, ÖZ., Sriram, K. K., and Westerlund, F. (2019). A nanofluidic device for real-time visualization of DNA-protein interactions on the single DNA molecule level. Nanoscale 11, 2071–2078. doi: 10.1039/C8NR09023H

PubMed Abstract | CrossRef Full Text | Google Scholar

Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech. J. 27, 623–656. doi: 10.1002/j.1538-7305.1948.tb00917.x

CrossRef Full Text | Google Scholar

Sodhi, J. S., Bryson, K., McGuffin, L. J., Ward, J. J., Wernisch, L., and Jones, D. T. (2004). Predicting metal-binding site residues in low-resolution structural models. J. Mol. Biol. 342, 307–320. doi: 10.1016/j.jmb.2004.07.019

PubMed Abstract | CrossRef Full Text | Google Scholar

Taylor, W. R. (1986). The classification of amino acid conservation. J. Theor. Biol. 119, 205–218. doi: 10.1016/S0022-5193(86)80075-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, S., Hu, X., Feng, Z., Zhang, X., Liu, L., Sun, K., et al. (2019). Recognizing ion ligand binding sites by SMO algorithm. BMC Mol. Cell Biol. 20:52. doi: 10.1186/s12860-019-0237-9

PubMed Abstract | CrossRef Full Text | Google Scholar

Wu, S., and Zhang, Y. (2008). ANGLOR: a composite machine-learning algorithm for protein backbone torsion angle prediction. PLoS ONE 3:e3400. doi: 10.1371/journal.pone.0003400

PubMed Abstract | CrossRef Full Text | Google Scholar

Yang, J., Roy, A., and Zhang, Y. (2013). Biolip: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 41, D1096–D1103. doi: 10.1093/nar/gks966

PubMed Abstract | CrossRef Full Text | Google Scholar

Yang, J., Roy, A., and Zhang, Y. (2018). Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 29, 2588–2595. doi: 10.1093/bioinformatics/btt447

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: binding residues, dihedral angle, information entropy, ion ligands, protein

Citation: Liu L, Hu X, Feng Z, Wang S, Sun K and Xu S (2020) Recognizing Ion Ligand–Binding Residues by Random Forest Algorithm Based on Optimized Dihedral Angle. Front. Bioeng. Biotechnol. 8:493. doi: 10.3389/fbioe.2020.00493

Received: 20 December 2019; Accepted: 28 April 2020;
Published: 12 June 2020.

Edited by:

Yang Zhang, University of Michigan, United States

Reviewed by:

Qiwen Dong, East China Normal University, China
Qian Liu, Children's Hospital of Philadelphia, United States

Copyright © 2020 Liu, Hu, Feng, Wang, Sun and Xu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Xiuzhen Hu, aHh6QGltdXQuZWR1LmNu

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.