OnionNet-2: A Convolutional Neural Network Model for Predicting Protein-Ligand Binding Affinity based on Residue-Atom Contacting Shells

One key task in virtual screening is to accurately predict the binding affinity ($\triangle$$G$) of protein-ligand complexes. Recently, deep learning (DL) has significantly increased the predicting accuracy of scoring functions due to the extraordinary ability of DL to extract useful features from raw data. Nevertheless, more efforts still need to be paid in many aspects, for the aim of increasing prediction accuracy and decreasing computational cost. In this study, we proposed a simple scoring function (called OnionNet-2) based on convolutional neural network to predict $\triangle$$G$. The protein-ligand interactions are characterized by the number of contacts between protein residues and ligand atoms in multiple distance shells. Compared to published models, the efficacy of OnionNet-2 is demonstrated to be the best for two widely used datasets CASF-2016 and CASF-2013 benchmarks. The OnionNet-2 model was further verified by non-experimental decoy structures from docking program and the CSAR NRC-HiQ data set (a high-quality data set provided by CSAR), which showed great success. Thus, our study provides a simple but efficient scoring function for predicting protein-ligand binding free energy.


INTRODUCTION
Protein-ligand binding is the basic of almost all processes in living organisms [1] thus predicting binding affinity( G) of protein-ligand complex becomes the research focus of bioinformatics and drug design. [2][3][4] Theoretically, molecular dynamics (MD) simulations and free energy calculations (for instance, thermal integration method and free energy perturbation can provide accurate predictions of G relying on extensive configurational sampling and calculation, leading to a large demand in computational cost. [4][5][6][7] Therefore, developing simple, accurate and efficient scoring methods to predict protein-ligand binding will greatly accelerate the drug design process. [8] To achieve this, several theoretical methods (scoring functions) have been proposed. Typically, the scoring functions are based on calculations of interactions between protein and ligand atoms. [1,[8][9][10] This includes quantum mechanics calculations, molecular dynamics simulations (electrostatic interaction, van der Waals interaction, hydrogen-bond and etc.), empiricalbased interacting models. [1,[9][10][11] In recent years, approaches based on machine learning (ML) have been applied in scoring functions and demonstrated great success. [12][13][14][15] For instance, RF-Score [16] and NNScore are two pioneering ML-based scoring functions. [17] Compared with classical approaches, these ML-based methods allow higher flexibility in selecting configurational representations or features for protein and ligand. More importantly, these methods have been demonstrated to perform better and more effective than classical approaches. [18,19] Recently, the deep learning (DL) approaches have provided alternative solution. Compared with ML, the DL models perform better at learning features from the raw data to extract the relationship between these features and labels. [20][21][22] Thus, DL algorithms have been introduced to model the structure-activity relationships. [23][24][25] One of the most popular methods of DL is the convolutional neural network (CNN), which is a multi-layer perceptron inspired by the neural network of living organisms. [26] Inspired by the great success of DL and CNN techniques, several models applying CNN to virtual screening and G prediction have been reported. [27][28][29][30][31][32][33][34][35] For example,Öztürk and co-workers reported a DeepDTA model based on one-dimensional (1D) convolution, which took protein sequences and simplified molecular input line entry specification (SMILES) codes of ligand as inputs to predict drug-target G. [30] Using 3D CNN model, two independent groups developed scoring functions, named Pafnucy [31] and K deep [29], to model the complex in a cubic box centered on the geometric center of the ligand to predict the G of protein-ligand complex. More interestingly, Russ et al. employed Graph-CNNs to automatically extract features from protein pocket and 2D ligand graphs, and demonstrated that the Graph-CNN framework can achieve superior performance without relying on protein-ligand complexes. [34] Our group has proposed a 2D convolution-based predictor, called Onion-Net, based on element-pair-specific contacts between ligands and protein atoms. [32] As is shown, these DL and CNN based approaches, achieved higher accuracy in G prediction than most traditional scoring functions, such as AutoDock, [36,37] X-Score [38] and KScore. [39] For DL scoring functions, how to treat with the highdimensional structural information encoded in the 3D structures and convert to the low-dimensional features for ML (or DL) training is critical. For most structurebased ML/DL models, the features are usually derived from the atomic information of proteins and ligands, such as the element type and spatial coordinates of the atom and even other atomic properties. [29,31] We noticed that same elements in different residues have quite different physical and chemical properties, which might greatly affect the predicting performance of scoring functions. Considering that the twenty types of amino acids can be treated as intrinsic classifications of protein compounds which involve lay features of them, like polar, apolar, aromatic and etc. We anticipate that it may be more beneficial to encode protein as residues instead of atoms in developing DL scoring functions.
In this work, we proposed a simple OnionNet-2 -a 2D CNN based regression model, which adopts the rotationfree residue-atom-specific contacts in multiple distance shells to describe the protein (residues) -ligand (atoms) interactions. We demonstrated that, our present method can significantly improve the prediction power by about 3.7% than previous models, thus providing an efficient and accurate approach for predicting protein-ligand interactions and uncover a new trend of using DL technique for massive biological structures training for drug design.

Descriptors
The features employed are the pair numbers of the specific residue (protein)-atom (ligand) combination in multiple distance shells. The minimum distances between any atom in the ligand and any residue of protein are treated as the representative distances. First, around each atom in the ligand, we defined N continuously packed shells. The thickness of each shell is δ, except that the first shell is a sphere with a radius of d 0 . The boundary K i of the ith shell is as follows Meanwhile, we classified atoms in the ligand into eight types, namely C, H, O, N, P, S, HAL and DU, where HAL represents the halogen elements (F, Cl, Br and I), and DU represents the element types excluded in these seven types.
T e ∈ {C, H, O, N, P, S, HAL, DU} When pre-processing the structure file, water and ions were treated explicitly because crystal water molecules and ions could affect the protein-ligand binding. [40,41] In addition to the twenty standard residues, we added an expanded type named "OTH" to represent water, ions and any other non-standard residues. It is worth mention that the residue-atom distance is defined as the distance between the atom in the ligand and the nearest heavy atom in the residue. A 2D visual representation is depicted at the upper left of Fig. 1. For any shell, the number of contacts for each residueatom pair is calculated and used as a feature. Each shell has 8×21=168 residue-atom combinations, which means that there are 168 features for a shell. Thus, if the total number of shells is N, 168×N features will be generated.
Here, R is the total number of residues in the protein, and E is the total number of atoms in the ligand. The d r,e is the minimum distance between the residue r in the protein and the atom e in the ligand, and n i Tr,Te is the number of contacts of the specific residue-element combination in the ith shell. The c i r,e is 1 when r,e is 0. Following our previous study, [32] we used d 0 = 1Å and δ = 0.5Å. Interestingly such shell-like, or radial, representations of protein environments, have been demonstrated to be superior features in protein function prediction. [35] The preparation of datasets and the CNN architecture [42,43] can be found in SI. The source code of OnionNet-2 is available at https://github.com/zchwang/OnionNet-2/.

Evaluation metrics
To evaluate the performance of the OnionNet-2,we adopted the loss function defined in the previous work. [32] where R and RMSE represent Pearson correlation coefficient(Eq. 3) and root-mean-squared error, respectively; x i is the predicted pKa for ith complex; y i is the experimental pKa of this complex;x andȳ are the averages of all predicted values and experimental values. [44] The α(0 ≤ α ≤ 1) value is an adjustable factor for adjusting the weight with R and RMSE, which was finally set to 0.7. For each independent training task, we adopted early stopping (patience = 20, that is, if the change of the loss value in the validating set is less than 0.001 after 20 epochs, the training is terminated) and save the model that performed best on the validating set. For the prediction in each case, five independently trainings were conducted to obtain the predicted mean value.

RESULTS AND DISCUSSIONS
The predictive power of OnionNet-2 Firstly, we explored the effect of shell number N on the predictive capability of the OnionNet-2 model. A range of the total shell number 10 ≤ N ≤ 90 was tested with interval of 2. According to our definitions of distance shell, this covers a separation between the residue and the atom from 0.55 nm to 4.55 nm. Fig. 2 depicts the trend of the R value to shell number N testing with core set v.2016 [44]. For N from 10 to 20, the R quickly increases as the total number of shells increases. This is expected because as the number of shells increases, the interactions between ligand and protein were gradually captured by the model. The R value reached the first peak for N is 30. This means that OnionNet-2 can achieve high prediction accuracy at a relatively low computational cost. Then, R fluctuates in a range of 0.01 until reaches the global maximum value when N = 62.   (Fig. 3d). It shows that the predicted pKa and experimental pKa are highly correlated for the two testing sets and validating set. After this point, R decreases when N increase. We attribute this to the enormous data that leads to the introduction of noise in the training. Unless otherwise specified, we adopted N of 62 in the following discussions. In addition, we also re-trained the model with two elder versions (v.2016 and v.2018) of the PDBbind database, and the R values of our re-trained models are almost the same (Fig. S1 and Table S1).
The performance of some published scoring functions [48,49] and OnionNet-2 tested on CASF-2016 and CASF-2013 are showed in Fig. 4 and Fig. S2, respectively. The corresponding R and RMSE (or SD) achieved by these representative scoring functions can be found in Table S2 in SI. Firstly, our OnionNet-2 model achieved highest R of 0.864 and RMSE of 1.164 with the core set v.2016, and R = 0.821 and RMSE = 1.357 with the core set v.2013. These were significantly higher than other scoring functions. The 2 nd best scoring function was AGL, which adopted the gradient boosting trees (GBTs) algorithm, focusing on multiscale weighted labeled algebraic subgraphs to characterize protein-ligand interactions. [48] For two 3D convolution-based scoring functions K deep [29] and Pafnucy [31], they adopted 3D voxel representation to model the protein-ligand complex and explicitly treated with physical properties of atoms such as hydrophobic, hydrogen-bond donor or acceptor and aromatic etc. into consideration. It is interesting to find that although we only employed the residue-atom contact to mimic the interactions between the protein and the ligand in OnionNet-2, the predicting power is higher. This further reveals that the selected features have a great impact on the predictive power of the CNN-based scoring functions. Secondly, as is expected, the introduction of ML/DL techniques into models has systematically enhanced the predicting accuracy. Generally, DL models display a good generalization behavior in practical applications. [50] To verify the generalization ability of the OnionNet-2, the CSAR NRC-HiQ data set provided by CSAR [51] was used as an additional test set in this study. This data set contains two subsets which contain 176 and 167 protein-ligand complexes respectively. For the two previous ML models, K deep and RF-Score, the researchers used 55 and 49 complexes in two subsets respectively as test data. [29] To provide a direct comparison with them, we adopted the same data for the OnionNet-2 test. It is worth mention that the two test subsets from the CSAR NRC-HiQ only have two common complexes with core set v.2013, namely 2jdy and 2qmj, and does not overlap with the training set, validation set and core set v.2016. The performance of K deep , RF-Score and OnionNet-2 on these two subsets are shown in Table I, and the scatter plots of the pKa predicted by OnionNet-2 with respect to experimental pKa can be found in Fig. S3 in SI. As expected, our model achieved a higher performance than K deep and RF-score. For subset 1, the present OnionNet-2 achieved R of 0.89, which is considerably higher than that of K deep (0.72) and RF-Score (0.78). This is also true for subset 2. Especially that, the R value of K deep model is only 0.65 for subset 2, indicating weak predicting capability on these data. These results effectively demonstrated that OnionNet-2 has a good generalization ability.

Evaluations on subsets of non-experimental decoy structures
As all the training and validating sets are composed of well-validated native structures in previous studies, it is largely unknown whether the DL method is capable to distinguish "bad data" that are incorporated in these integrated data sets, for instance, non-native binding poses. To verify this, we tested OnionNet-2 to deal with non-experimental structures (generated by docking programs). Technically, non-native binding poses (called decoys) were generated based on core set v.2016 complexes by AutoDock Vina. [52,53] The detailed information of the generation of decoys can be found in the SI. [54,55] The predicting accuracy was evaluated by calculating the RMSE between the predicted pKa of the decoy complex and the pKa of the corresponding native receptor-ligand complex which is shown in Fig. 5. It is clear that, the RMSE quickly increased with increasing RMSD. This is expected because decoys with larger RMSD result in more severe change of δG. These results reveal that OnionNet-2 can accurately respond to changes of the ligand binding poses and distinguish the native structure.

Effects of hydrophobic scale, buried solvent-accessible area and excluded volume inside the binding pockets on the prediction accuracy
Principally, the physical interactions between protein and ligand determine the G. The dominating factors for overall G involve electrostatic interactions, van der Waals interactions, hydrogen bonds, hydration/dehydration during complexation. However, such mechanistic interactions were not directly input into DL features. At molecular level, these involves the size and shape of the binding pocket, and the nature of residues around the binding pocket which determine its physicochemical characteristics. [56] Whether DL models can ac- curately represent the structural specificity of the binding pocket is poorly documented. The entire CASF-2016 test set can be divided into three subsets by each of three descriptors according to physical classifications of the binding pocket on the target protein. [44] The three descriptors include H-scale (hydrophobic scale of the binding pocket), SAS (buried percentage of the solvent-accessible area of the ligand after binding) and V OL (excluded volume inside the binding pocket after ligand binding). Protein-ligand complexes in CASF-2016 were grouped into 57 clusters, and the authors sorted all 57 clusters in ascending order by each descriptor. Then, these complex clusters were divided into three subsets according to the chosen descriptor, labeled as H1, H2 and H3 or S1, S2 and S3 or V1, V2 and V3. These subsets were used as validations of our OnionNet-2 model. As comparison, previous scoring functions were also tested on these three sets of subsets by Su et al. [44], and the results are summarized in Table  II.
As can be seen in Table II, OnionNet-2 achieved higher prediction accuracy compared with other soring functions when tested on H-, S-and V-series subsets. This indicates that the feature based on the contact number of residue-atom pairs in multiple shell is capable of capturing the hydrophobic scale of the binding pocket. The number of contacts in different shells (specifically the shells within the binding pocket) may be able to reflect the buried solvent-accessible surface area and the excluded volume of the ligand.
We noticed that, compared to other subsets, the R value of OnionNet-2 on V2 subset is clearly lower than other subsets (nevertheless it is still high than other scoring functions). This may indicate that our model is less sensitive to medium-sized binding pockets. Thus it may be still challenging for current scoring functions to recognize the size and shape of the binding pocket.
Furthermore, we plotted the detailed scatter plots of predicted pKa and experimental pKa in Fig. S4 according to the specific H, S and V range. It is interesting to find almost no dependence of pKa with the values of H, S or V. Thus we speculate that a more realistic descriptor for the ligand characteristic in the binding pocket is essential to guide the protein-ligand G prediction.

CONCLUSION
To summarize, a 2D convolution-based CNN model, OnionNet-2, is proposed for prediction of the proteinligand binding free energy. The contacting pair numbers between the protein residues and the ligand atoms were used as features for DL training. Using CASF-2013 and CASF-2016 as benchmarks, our model achieved the highest accuracy to predict G than previous scoring functions. In addition, when employing different versions of PDBbind database for training, the performance of OnionNet-2 is nearly the same. The generalization ability of the model was verified by the CSAR NRC-HiQ data set and decoy structures. Our result also indicates that OnionNet-2 has the capability to recognize these physical natures (in detail, hydrophobic scale of the binding pocket, buried percentage of the solvent-accessible area of the ligand upon binding and excluded volume inside the binding pocket upon ligand binding) of the ligandbinding pocket interactions. This work is supported by the Natural Science Foundation of Shandong Province (ZR2020JQ04), National Natural Science Foundation of China (11874238) and Singapore MOE Tier 1 Grant RG146/17.

OnionNet-2: A Convolutional Neural Network Model for Predicting Protein-Ligand
Binding Affinity based on Residue-Atom Contacting Shells We mainly used the protein-ligand complexes of PDBbind database v.2019 (http://www.pdbbind-cn.org/) for training. This database consists of two overlapping subsets, the general set and the refined set. The general set includes all available complexes and the refined set comprises protein-ligand complexes with high-quality structure and binding information selected from the general set. For each structure of the protein-ligand complex, the corresponding binding affinity is represented by the negative logarithms (pKa) of the dissociation constants (Kd), inhibition constants (Ki) or half inhibition concentrations (IC50). In order to evaluate the predictive ability and compare with other scoring functions, OnionNet-2 was evaluated on the CASF-2016 test set (core set v.2016) [1] and CASF-2013 test set (core set v.2013) [2][3][4]. It should be noted that the CASF-2016 test set is the latest update of CASF-2016, which contains 285 high-quality complexes. While for core set v.2013, it is a subset of the PDBbind database v.2013, consisting of 195 protein-ligand complexes classified in 65 clusters with binding constants spanning nearly 10 orders of magnitude. Besides, a data set called CSAR NRC-HiQ, consisting of two subsets containing 176 and 167 complexes respectively, [5] was employed as a third test set. For the previous models of Kdeep and RF-score, 55 and 49 complexes in two subsets were used as test data. [6] To provide a direct comparison with K deep and RF-score, we adopted the same data for the OnionNet-2 test. In order to perform normal training and testing, it is necessary to redistribute remaining complexes in PDBbind database v.2019. First, we excluded the complexes contained in three test sets from PDBbind database v.2019 (general set and refined set). Then, as a common practice (Reference: Pafnucy [7] and OnionNet [8]), 1000 complexes were randomly sampled from v.2019 refined set (after filtering all complexes used in the test sets described previously) as the validating set. Finally, the remaining complexes (that is, the complexes that are not included in the three test sets and validating set) were adopted for the training set. This ensures that there is no overlapping proteinligand complex in the training set, validating set and test sets.

Architecture
We adopted a CNN model based on 2D convolution to learn the relationship between the contact features and the G. The model was constructed using the Keras package in tensorflow. [9] The workflow architecture is shown in Fig. 1.
The raw data is pre-processed before input into the CNN model. Here, the features are standardized through the scikit-learn package. [10] Three convolutional layers, with 32, 64 and 128 filters respectively were used and the filter sizes were all set as 4, with strides as 1. After preliminary tests, two fully connected layers with 100 and 50 neurons are used before the output layer, which is capable of capturing the nonlinear relationship between the features and the pKa values.
To further increase the nonlinear ability of the model, a rectified linear unit (RELU) layer was added after each convolutional layer and fully connected layer. Also, a batch normalization layer was used after the fully connected layer. The stochastic gradient descent (SGD) optimizer was adopted and the learning rate was set as 0.001. To reduce overfitting, L2 regularization with weight decay λ = 0.01 was used after each fully connected layer. The number of samples processed per batch is 64.

PART II. THE STATISTICAL INFORMATION OF THE TRAINING SET AND VALIDATING SET WHEN USING DIFFERENT VERSIONS OF PDBBIND DATABASE
Generally, the size of the datasets has an impact on the predictive ability of the DL model. [11,12] To make a fair comparison with previous scoring functions, we retrained the model with two elder versions (v.2016 and v.2018) of the PDBbind database, and the number of samples in the training set and validating set is shown in Table S1. In addition, we only re-trained the models using N = 58, 60 and 62. The performance on core set v.2013 and core set v.2016 are summarized in Fig.  4. It is clear that, although the three versions of PDBbind database differed greatly in size, the R values of our re-trained models are almost the same. This suggests that the difference between these three databases (in detail, PDBbind database v.2016, v.2018 and v.2019) has a rather limited impact on our OnionNet-2 model.   [14] 0.672 1.65 X-Score [14] 0.614 1.78 ChemScore [14] 0.592 1.82 AutoDock Vina [15] 0.54 1.90 AutoDock [15] 0.

PART IV.THE DETAILED INFORMATION OF THE GENERATION OF DECOYS
In CASF-2016 benchmark, the similarity between two binding poses is measured by the root-mean-square deviation (RMSD) value. [4] Following previous studies by Allen et al., [16] we adopted the Hungarian algorithm to calculate RMSD between decoy ligand and native structure which is implemented in spyrmsd. [17] In this study, we used AutoDock Vina to generate non-native binding pose (called decoys). The sampling space is a 27Å × 27 A × 27Å cubic, which is centered on the geometric center of the native ligand binding pose. The exhaustiveness value was set to 12. It is worth noting that AutoDock Vina ignores the effects of water molecules and ions when calculating protein-ligand binding energy. In addition, changes in the ligand binding poses will be accompanied by changes in the surrounding microenvironment. Therefore, the receptor-decoy complexes here do not include water molecules and ions.
1. For each receptor, up to 20 decoy ligands were generated by AutoDock Vina. The actual number may be less than 20 because of limited size and shape of the binding pocket in the target protein. For each decoy, the RMSD with respect to native structure was calculated.