The Identification of Metal Ion Ligand-Binding Residues by Adding the Reclassified Relative Solvent Accessibility

Many proteins realize their special functions by binding with specific metal ion ligands during a cell’s life cycle. The ability to correctly identify metal ion ligand-binding residues is valuable for the human health and the design of molecular drug. Precisely identifying these residues, however, remains challenging work. We have presented an improved computational approach for predicting the binding residues of 10 metal ion ligands (Zn2+, Cu2+, Fe2+, Fe3+, Co2+, Ca2+, Mg2+, Mn2+, Na+, and K+) by adding reclassified relative solvent accessibility (RSA). The best accuracy of fivefold cross-validation was higher than 77.9%, which was about 16% higher than the previous result on the same dataset. It was found that different reclassification of the RSA information can make different contributions to the identification of specific ligand binding residues. Our study has provided an additional understanding of the effect of the RSA on the identification of metal ion ligand binding residues.


INTRODUCTION
Proteins act as an indispensable material in the life process. However, many special functions of protein are realized by binding with specific ligands, and more than one-third of the proteins need to bind with metal ion ligands. Thus, depending on the interaction between the metal ion ligands and specific binding residues, many metal ion ligands can affect the special protein functions (Caspers et al., 1990;Supek et al., 1997;Selvarengan and Kolandaivel, 2005). For instance, Mn 2+ is used as catalyst in photosynthesis (Degtyarenko, 2000;Reed and Poyner, 2000), Ca 2+ can lead to anxiety and Alzheimer's disease (Jiang et al., 2015;Cao et al., 2017), and Cu 2+ can cause Coronary Heart Disease (Sodhi et al., 2004;Lin et al., 2005). The basic principle of molecular drug design is that the interaction between the receptor and ligand must conform to the "Lock and Key Model, " and the interaction between the protein and ion ligands we studied also conforms to the "Lock and Key Model." In the experiment of molecular drug design, protein crystallization, structure confirmation, and the interaction between ligands and protein residues are required. Thus, the experimental method is a timeconsuming and expensive process, and it cannot be processed in batches, however, theoretical prediction of binding residues between proteins and ligands can overcome these shortcomings, and accurate prediction can provide theoretical information for drug design experiments. Therefore, correctly identifying metal ion ligand-binding residues is helpful for the human health and the design of molecular drug.
In the past two decades, experimental methods have been developed to identify metal ion ligand-binding residues, such as the Nuclear Magnetic Resonance Spectroscopy (Sletten, 1997) and fluorescence method (Kawahashi, 2003). However, due to the time-consuming nature and other limitations of experimental methods, the high-throughput computational methods were developed to predict the binding residues of metal ion ligands. Among the computational methods, many efforts were made to improve the databases, feature parameters, and algorithms. First, the databases were generally acquired from Protein Data Bank (PDB) (Tainer et al., 1991;Bernstein et al., 1997;Sodhi et al., 2004;Lin et al., 2005;Bordner, 2008;Babor et al., 2010;Lu et al., 2012), Structural Classification of Protein (SCOP) (Hubbard et al., 1997;Sodhi et al., 2004;Chauhan et al., 2010;Sobolev and Edelman, 2013), Ligand Protein Contact (LPC) (Sobolev et al., 1999;Chauhan et al., 2010), and BioLip (Yang et al., 2013a,b;Hu et al., 2016a,b, Wang et al., 2019. Second, the feature parameters generally contained the composition information of the amino acid (Cao et al., 2017;Wang et al., 2019), hydrophilicity-hydrophobicity (Lin et al., 2005;Lin et al., 2006;Cao et al., 2017), charge (Lin et al., 2005;Cao et al., 2017;Wang et al., 2019), position specific score matrix (PSSM) (Hu et al., 2016a), relative solvent accessibility (RSA) (Lin et al., 2006;Hu et al., 2016a;Cao et al., 2017;Wang et al., 2019) and three-dimensional structure information (Babor et al., 2010;Roy et al., 2012;Yang et al., 2015;Hu et al., 2016a). Finally, the classification algorithms used were artificial neural network (ANN) (Lin et al., 2005), Support Vector Machine (SVM) (Lin et al., 2006;Jiang et al., 2015;Cao et al., 2017;Hu et al., 2016a), Naïve Bayes (Ebert and Altman, 2010), COFACTOR (Lin et al., 2006;Yang et al., 2015), TargetSeq, TargetCom (Hu et al., 2016b), COACH (Yang et al., 2015), and SMO (Wang et al., 2019). Among the three aspects in the prediction mentioned above, the key step of feature extraction was generated by one of two ways: (1) the three-dimensional structure information or (2) primary sequence information of the protein. However, the precise three-dimensional structure information of many proteins was not available in the recent databases. Thus, feature extraction from sequence information is more popular in current research. Among the sequence information, RSA is one of the important parameters. In the previous works, researchers only divided it to burial and exposure by a certain threshold. However, the effects of different classifications of the RSA on prediction results have not been explored. In this paper, based on the semi-manually curated database of BioLip for biologically relevant ligand-protein interactions, we performed a statistical analysis for RSA and further reclassified the RSA. By integrating the optimized sequence information, we mainly used the Gradient Boosting Machine (GBM) algorithm and obtained better predicted results by using fivefold cross-validation and an independent test.

Benchmark Dataset
We selected non-redundant datasets of metal ion-binding proteins that were constructed in our group (Cao et al., 2017;Wang et al., 2019). The benchmark datasets were entirely from the BioLip database (Yang et al., 2013a). The proteins were filtered with a resolution less than 3 Å, the length of sequences was greater than 50, and the sequence identity was below 30%. Among the ∼250 ligands, there were only 10 ligands that could meet the above conditions to contribute to our further statistical analysis and prediction. The statistical information of the datasets containing ten metal ion ligands is shown in Table 1. In the protein sequence, residue binding with ion ligands was not only determined by the residue itself but also by how this was affected by the surrounding residues. Thus, a sliding window method was used to cut the protein sequence into overlapping residue segments with different sizes ranging from 5 to 21. In order to ensure that each residue was in the center of the segments, we added (L-1)/2 dummy residues "X" at both terminals of the proteins, where L was the window length. The optimal window length for each ligand was determined based on the evaluation results of the proposed computation method. If a binding residue was located at the segment center, it was defined as a positive sample; otherwise, it was defined as a negative sample. The number of non-binding segments was much larger than that of the binding segments, which led to a heavy imbalance in the datasets ( Table 1). According to the methods of previous works (Yen and Lee, 2006;Roy et al., 2015), we took the number of positive samples as the standard and randomly extracted the equal number of negative samples. In this way, the negative samples were randomly selected 10 times to ensure the credibility of the results. Finally, we averaged the 10 results to calculate our overall accuracy.

Selection and Extraction of Feature Parameters
According to the biological background of protein-ligand interactions and the statistical analysis of protein sequences, we extracted features of the position conservation information, which was acquired from the protein backbone and side chains.

Secondary Structure and Relative Solvent Accessibility
Analyzing the three-dimensional (3D) structure of a protein is critical to the understanding of its function. However, 3D models of only a small fraction of the sequenced proteins were made. The prediction of a secondary structure and RSA is a crucial step from the sequence to the 3D structure, reflecting the spatial structure information of the backbone and side chains, respectively. We therefore selected the predicted secondary structure information and RSA information. The prediction was helpful when simplifying the problem from the 3D structure to sequence information (Chen and Zhou, 2005;Lin et al., 2005;Hu et al., 2016a,b;Cao et al., 2017;Wang et al., 2019). In this paper, they were predicted by using ANGLOR software (Wu and Zhang, 2008). We obtained three secondary structure types, including alpha-helix (H), beta-strand (E), and coil (C). The relative solvent accessibility (RSA) was generally represented as a Boolean value, indicating whether the residue was buried (RSA < 0.25) or exposed (RSA > 0.25).

Physicochemical Properties of Amino Acids
Physicochemical properties affected the protein-ligand interactions, and different physicochemical properties of amino acids were caused by their different side chains (Lin et al., 2005(Lin et al., , 2006Cao et al., 2017;Wang et al., 2019). Metal ion ligands bind to a residue, probably preferring to bind to a specific side-group of this residue. The information from the side chains is therefore important for the prediction of metal ion ligand-binding residues. Since different standards can cause different classifications, the amino acids were divided into six categories according to the hydrophilicity and hydrophobicity (Panek and Eidhammer, 2010) (Supplementary Figure S1) and three categories according to the charge (Taylor, 1986) (Supplementary Figure S2).

Construction of Position Weight Matrix
The ion-binding residues tend to be more conserved than others during the process of evolution, and the residue conservation is a crucial indicator for the presence of functionally important residues. The PWSM has been successfully used in the prediction of transcription factor binding sites and ligand binding sites (Kel et al., 2003;Hu et al., 2016a). Thus, the position weight scoring matrix (PWSM) was used to extract the position conservation information of the basic feature parameters, and the scoring matrix based on amino acid residues was constructed from the sequence segments with a specific window length. The positionspecific occurrence frequency of an amino acid is calculated as follows: where i is the position index in the sequence segment, j is one of the 20 kinds of amino acids or vacancy, n ij is the frequency of the jth amino acids at the ith position, and N i is total number of all amino acids occurring at the ith position. The position weight matrix is then calculated as follows: where P oj is background probability of the jth amino acid. Therefore, based on the positive and negative training sets, two standard scoring matrices can be obtained. In a testing set, we got 2 * L dimensional values for every sequence segment. Finally, the 5 * 2L dimensional values from the above five features can be used as the input parameters in the subsequent algorithm.

Gradient Boosting Machine
The Gradient Boosting Machine (GBM) is an improved Boosting algorithm proposed by Friedman (2001Friedman ( , 2002, Rawi et al. (2018) and Jain et al. (2018). The GBM algorithm is different from the original Boosting algorithm. The core of the Boosting algorithm is to set different weights to different samples during the iterative process. Based on the results of the previous iteration, the Boosting algorithm will increase the weight of wrong classification samples and reduce the weight of correct classification samples. Then, a weak classifier will be generated in each iterative process; after m iterations, a strong classifier an improved performance will be obtained by setting weight for each weak classifier. In the iterative process, GBM algorithm classifies the sample residual of the previous iteration and not the sample itself. After the end of the iteration, our classifier F m (x) was obtained as Equation (3), where m is the number of iterations in the calculation process, ρ m is the weight value and also the distance of the loss function decreases in its gradient direction, and h m (x) is the function that fits the sample residuals in the iterations.
In addition, the GBM algorithm can handle mixture data and its robustness against outliers in the output space is very strong. In this paper, we implemented the GBM algorithm in the R platform by using the "gbm" package. In the classifier, parameters were optimized: "n.trees" ranged from 1 to 500, "n.minobsinnode" ranged from 10 to 50, "interaction.depth" ranged from 3 to 9, and "shrinkage" ranged from 0.01 to 0.1.

The Validation and Evaluation Metrics
As general validation methods, cross-validation and independent tests have been commonly used in previous literature (Hu et al., 2016a,b;Sun et al., 2016;Cao et al., 2017;Wang et al., 2019). In the five cross-validations, the dataset was randomly divided into five equal subsets. Four subsets were then used as training sets, and the remaining subset was used as a testing set. This process was repeated five times in such a way that each subset was used once for testing, and the average performance of the five subsets was then taken as the final performance.
We used several following metrics to evaluate our proposed method: sensitivity (Sn), specificity (Sp), False positive rate FIGURE 1 | Flowchart of the method for the identification of metal ion ligand-binding residues.
Frontiers in Genetics | www.frontiersin.org (FPR), accuracy of prediction (Acc), and Matthew's correlation coefficient (MCC). They are defined as follows: where TP is the number of correctly predicted metal ion ligand binding residues, FN is the number of binding residues predicted as non-binding residues, TN is the number of correctly predicted non-binding residues, and FP is the number of non-binding residues predicted as binding residues. To explain the above prediction method more directly and clearly, see our detailed flowchart in Figure 1.

The Classification of Relative Solvent Accessibility
For each metal ion ligand, based on the optimized window length, we gradually added the parameters from the position Predicted Results for K + Ligand Binding Residues Table 2 shows the prediction results of the K + ligand by gradually adding parameters to the model. By gradually adding parameters to the model, we found that the different parameters had different effects on the predicted results. In this work, we used the initial classification of Boolean value thresholds (marked as SA_2) and added it to the model; the predicted result was significantly improved, and the Acc and MCC increased by nearly 12 and 24%, respectively. However, the predicted results did not change much by adding other parameters. It indicated that the RSA played an important role in the whole parameters for identifying the metal ion ligand-binding residues.
FIGURE 2 | The statistical distribution of relative solvent accessibility in positive and negative set for K + ligand. Note: the abscissa axis is the values of the relative solvent accessibility, and the ordinate is the number of amino acids corresponding to each predicted value. The solid red line represents the positive set, and the dotted blue line represents the negative set.

Statistical Analysis of the Relative Solvent Accessibility
Due to the importance of RSA and the particularity of metal ion ligands, we performed the statistical analysis of the RSA information for different metal ion ligands. Then, we found that the classification was not the same for different metal ion ligands. Therefore, we reclassified the thresholds of the Boolean value for different metal ion ligands. For instance, Figure 2 shows the statistical distribution of the RSA in a positive set and negative set for the K + ligand (the statistical distribution of other metal ion ligands is shown in Supplementary Material 1). In Figure 2, the abscissa indicates the predicted values of amino acid RSA; the ordinate indicates the number of amino acids corresponding to each predicted value in the positive and negative samples.   When "i" is an odd number, WAi, DHi, QSi, SSi, and SAi indicate the matrix value of amino acid, charge, hydrophilic-hydrophobic, secondary structure, and relative solvent accessibility at the ((i + 1)/2)th position calculated from the positive training set. When "i" is an even number, WAi, DHi, QSi, SSi, and SAi indicates the corresponding values at the (i/2)th position calculated from the negative training set.
Besides, we also used the previous four regions (Cao et al., 2017), which were suitable for most metal ion ligands (marked as SA_4), namely

The Predicted Results of Four General RSA Classifications
Then, for each metal ion ligand, four different classification groups of RSA were added to the parameters, and four general prediction models were obtained. The four different predicted results of K + ligand binding residues are shown in Table 3.
We found that the predicted results of the same metal ion ligand were different for the four general prediction models, and the optimal predicted results of ten metal ion ligand-binding residues were from the differently specific prediction model. An additional file shows this in more detail (see Supplementary Material 3). For example, the K + ligand obtained the optimal predicted result from the specific classification namely SA_V, but the Fe 2+ ligand obtained this from SA_4.

The Optimal Predicted Results of Ten Metal Ion Ligand-Binding Residues
By comparing the four general prediction models, the optimal predicted results for ten metal ion ligand-binding residues were obtained and listed in Table 4.
Based on the different classifications of RSA, we obtained the optimal predicted results of ten metal ion ligand-binding residues and corresponding specific prediction models.

The Predicted Results (by Use of the Boruta Algorithm)
We used the 5 * 2L dimensional features in the above calculations. However, different features made varied contributions to the predicted results, and the combination of different features did not necessarily result in a good classification performance. Therefore, we used a Boruta algorithm Li, 2017, Feng et al., 2018) to make a main feature selection. The algorithm iteratively removed the features that were less relevant than random probes. From this we could obtain the optimal features combination. The algorithm was implemented by the "Boruta" package in R environment. In this way, after a large-scale computation, the confirmed features were obtained, and the rejected features were removed from the combination of all the features. The rejected features are shown in Table 5.
When using the Boruta algorithm to reduce the dimension of the features, it was found that the reduced dimensions of different metal ion ligands were different. For example, the dimensions of the Ca 2+ and Mg 2+ ligands were not reduced, the dimension of the Zn 2+ ligand was reduced by 5 dimensions, the dimension of the Mn 2+ ligand was reduced by  13 dimensions, etc. In order to prove the justifiability of the features eliminated by the Boruta algorithm, we analyzed the importance of the features by using the "randomForest" package in R environment. The larger the MeanDecreaseAccuracy and MeanDecreaseGini values, the higher the importance of the feature parameters. Taking the Zn 2+ ligand as an example, it can be seen from Figure 3 that the important features of the first 30 dimensions were consistent with the confirmed features by the Boruta algorithm. The obtained subset features were then input into the GBM, and the predicted results were shown in Table 6. Table 6 shows that we obtained similar results based on subset features. This suggested that, under the premise of ensuring the accuracy, the Boruta algorithm was efficient in its ability to reduce the dimensions of features for predicting metal ion ligand-binding residues. The decline of the subset predicted results showed that all the selected features had certain contributions to the recognition of the binding residues. In addition, the predicted results of the subset were still higher than those of SVM. Our The bold values represent the best Acc and MCC values.
Frontiers in Genetics | www.frontiersin.org FIGURE 4 | The comparison of prediction performances between several machine learning methods based on the same features by using five fold cross-validation test.
method was therefore relatively reliable for predicting the metal ion ligand binding residues.

The Predicted Results of GBM by Using an Independent Test
We used equal samples of positive and negative in the previous calculations. However, the positive and negative samples were not equal when we intercepted segments by using the sliding window method. In order to verify the practicability of the proposed method, we divided the total dataset into two parts: the training dataset was used to construct the predicted methods by fivefold cross-validation, and the independent testing dataset was used to test the extrapolation ability of the predicted methods. The protein chains in the independent testing dataset accounted for 20% of the total dataset, which was consistent with the published work (Cao et al., 2017). The statistical information of the datasets is shown in Table 7.
In the independent test, the 5 * 2L dimension position information was input into the GBM algorithm to obtain the predicted ligand-specific models, and the testing dataset was input into the predicted model to test. The number of positive and negative samples was not balanced, and the MCC values in Table 8 therefore reflect the stability of the predicted model. In order to compare these results more obviously, we added them to Table 8. The comparative results indicated that the selected features and algorithm had better identification abilities for predicting metal ion ligand-binding residues.

Comparison With Other Methods
It is necessary to compare our proposed methods with previous models using the same dataset, classification strategy, and evaluation methods. For the purposes of comparison with the previous results (Cao et al., 2017;Wang et al., 2019), our predicted results of fivefold cross-validation and independent test are displayed in Tables 8, 9, respectively. Comparing the previous results in our group (Cao et al., 2017;Wang et al., 2019), most of the metal ion ligands were improved to different degrees. With the same dataset, the same feature parameters, classification strategy, and evaluation methods, we further made a comparison between the GBM algorithm and several other machine learning methods, including SVM, Random Forest, and Artificial Neural Network. Using the same features, the ACC and MCC values of each classifier for ten ligands are displayed in Figure 4. The results showed that accuracies of the GBM classifier were higher than other machine learning methods, indicating that the GBM classifier was a powerful tool for predicting metal ion ligand binding residues.

CONCLUSION
The interactions between metal ion ligands (e.g., Na + , Mn 2+ , Ca 2+ , K + , and Cu 2+ ) and proteins perform key biological functions in many important life processes. Research into these metal ion ligands and functions is of significant biological import. In particular, the prediction of ligand binding residues is of great significance to the understanding of the biological functions of proteins and drug design. In this work, we predicted the binding residues of 10 metal ion ligands in the BioLip database, and we obtained improved results. According to the biological background of proteins, we selected hydrophobic polarized charges, predicted secondary structures, and RSA information as the basic information. From the statistical analysis of RSA information, we found that the reclassified RSA information has important effects on recognition of metal ion ligandbinding residues. Therefore, on the basis of primary sequence information, we extracted the important features of RSA by reclassifying the RSA as four different classifications (i.e., SA_2, SA_V, SA_P, and SA_4). Using the GBM algorithm and an overall classification strategy, we further improved the prediction success rate of metal ion ligand binding residues in the crossvalidation and independent test. In the best performance, MCC values were higher than 0.558, the FPR values were lower than 20.8%, and the Acc values were higher than 77.9%. In comparison with previous results (Cao et al., 2017), our best accuracy of fivefold cross-validation was about 16% higher on the same dataset. In this research, we identified the specific contributions of different reclassified RSA to the identification of 10 ligandbinding residues. However, for the prediction performances of different ligands, there are different improvements that can indicate the differences in the ligand-binding residues. Our next step is to prove this specialty. To make our models available for other researchers, we provide our database in Supplementary Material 4 and full feature parameters in the additional material. In our future work, we will make efforts to provide a web server for the analysis method presented in this paper, which can be manipulated by readers according to their need.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation, to any qualified researcher.