Edited by: Sergio Decherchi, Italian Institute of Technology (IIT), Italy
Reviewed by: Oscar Mendez Lucio, Bayer, France; Zhili Zuo, Kunming Institute of Botany (CAS), China
This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Protein-ligand binding affinity is a key pharmacodynamic endpoint in drug discovery. Sole reliance on experimental design, make, and test cycles is costly and time consuming, providing an opportunity for computational methods to assist. Herein, we present results comparing random forest and feed-forward neural network proteochemometric models for their ability to predict pIC50 measurements for held out generic Bemis-Murcko scaffolds. In addition, we assess the ability of conformal prediction to provide calibrated prediction intervals in both a retrospective and semi-prospective test using the recently released Grand Challenge 4 data set as an external test set. In total, random forest and deep neural network proteochemometric models show quality retrospective performance but suffer in the semi-prospective setting. However, the conformal predictor prediction intervals prove to be well-calibrated both retrospectively and semi-prospectively showing that they can be used to guide hit discovery and lead optimization campaigns.
One of the most important phases of a drug discovery campaign is the discovery of a potent inhibitor to a target driving the disease phenotype. Experimental design, make, test cycles seek to optimize initial hits to lead compounds by optimizing the protein-ligand binding affinity. However, this process is frequently slow and costly, adding to the large cost of drug discovery. As such, computational methods that can accelerate this optimization phase by predicting protein-ligand binding affinity values are readily sought. Fully atomistic simulation approaches model protein-ligand binding physics through time integrating Newton's equations of motion in molecular dynamics simulations (Jorgensen and Thomas,
We first compare the performance of random forest (RF) and feed-forward neural network (FFN) PCM models trained using the recently released ChEMBL25 data set (Gaulton et al.,
The recently released ChEMBL25 (Gaulton et al.,
RF models were trained using the Scikit-learn (Pedregosa et al.,
Rigorous quantification of model confidence is essential in fields such as drug discovery where chemical space is essentially infinitely vast and models are trained on only a small fraction of possible compounds. Conformal prediction is a state of the art method to provide confidence intervals, i.e., a region where the true value is predicted to be, and whose size is determined in part by a user defined confidence level (Shafer and Vovk,
It has been shown that overly simplistic (i.e., random) training/validation/test splits lead to overestimates in the accuracy of machine learning models (Wallach and Heifets,
In
ChEMBL25 validation set performance metrics for both the RF and FFN models as well as the SMILES featurization method used.
RF (ECFP6) | 0.38 | 0.80 | 0.61 |
RF (CDDD) | 0.47 | 0.75 | 0.55 |
FFN (entity embeddings) | 0.39 | 0.79 | 0.60 |
FFN (ECFP6) | 0.41 | 0.78 | 0.58 |
FFN (CDDD) | 0.42 | 0.78 | 0.58 |
Ensemble averaging is a strategy to improve prediction performance by averaging the individual predictions of multiple models. This wisdom of crowd approach works best when individual models are uncorrelated, allowing errors to be averaged out. The residuals of the RF and FFN model are correlated with an
The number of data points per gene is heterogeneous in ChEMBL25, allowing a few genes to contribute more to performance metrics than others. To remove this bias, we calculated performance metrics across each individual gene with more than 100 data points in the validation set (
Performance metrics averaged across individual UniProt IDs.
RF | 0.37 +/– 0.17 | 0.65 +/– 0.18 | 0.46 +/– 0.15 |
FFN | 0.38 +/– 0.15 | 0.61 +/– 0.22 | 0.43 +/– 0.17 |
Analysis of the training distribution of the pIC50 values shows that a strong amount of publication bias exists in the ChEMBL25 dataset (
Feature importance analysis allows us to determine whether the protein sequence component of the feature vector is pertinent to the model performance. This analysis indicates that the feature importance for the RF model plateaus at approximately the 1500th ranked feature out of the total 4,104 features (
ChEMBL25 validation set performance metrics for both the RF and FFN models as well as the SMILES featurization method used.
RF (ECFP6) | 0.96 | 0.41 | 0.27 |
RF (CDDD) | 0.89 | 0.46 | 0.31 |
FFN (entity embeddings) | 0.80 | 0.51 | 0.35 |
FFN (ECFP6) | 0.79 | 0.49 | 0.34 |
FFN (CDDD) | 0.78 | 0.48 | 0.33 |
Molecular fingerprints are the most commonly used technique to encode molecules for ML model training. This technique hashes atomic neighborhoods for each atom to bits to represent molecules with 1D dimensional vectors. This approach can suffer from bit collision. In addition, there is no meaning of proximity of bits (Feinberg et al.,
TSNE plot of the FFN entity embeddings for the category 1 variables of the morgan fingerprint vector. Category 1 was selected for plotting as this denotes the presence of a chemical fragment. The category 1 weights of the 0, 100, 1,000, 2,000, 2,500, and 4,050 bits are plotted in black to illustrate how the FFN model groups bits into distinct clusters.
The ChEMBL25 validation set was also used to assess the retrospective performance of conformal prediction. Analysis of the RF prediction interval sizes shows that they span a larger range of values than those from the FFN (
The true test of model performance is a prospective one as this is how ML models are used in practice. Here, we use the recently released GC4 dataset (Parks et al.,
GC4 BACE-1 and CatS performance metrics.
RF | 0.25 (0.71) | 0.19 (0.51) | 0.25 (0.68) | 0.38 (0.49) |
FFN | 0.5 (0.77) | 0.29 (0.59) | 0.19 (0.8) | 0.26 (0.61) |
In addition to the model predictions, this semi-prospective test allows us to analyze the validity and efficiency of the conformal prediction intervals for both the RF and FFN. Both models achieve excellent validities on the CatS data set. Here, the prediction intervals contained the true measured value 95 and 97% of the time for the RF/FFN models, respectively. The validities of both models suffer slightly on the BACE-1 data set relative to CatS, but overall still achieve excellent performance with 81 and 80% validities. As a possible explanation for the validity performance degradation between CatS and BACE1, we note that the distribution of nearest neighbor Tanimoto coefficients demonstrates that the CatS GC4 compounds are more similar to the ChEMBL25 training data, and hence providing an easier test for the conformal predictor, than the BACE1 GC4 compounds (
Box-whisker plots of the 95% confidence interval (CI) sizes (scaled pIC50) for CatS/BACE-1 predictions for
We next sought to compare the performance of the PCM models against a standard QSAR target model trained using only CatS or BACE-1 ChEMBL25 data, respectively. Here, a RF QSAR model achieved approximately the same performance metrics on both targets as the RF PCM model with a 0.26 Pearson Correlation and 0.37 Kendall's Tau for CatS and 0.26 Pearson Correlation and 0.18 Kendall's Tau for BACE-1. This indicates that the RF model may benefit from additional protein sequence descriptors such as those from unsupervised training (Kim et al.,
Finally, we sought to test the impact of deleting all CatS and BACE1 data and retraining the models using the same procedure. As shown in
Protein-ligand binding affinity is a key variable during hit discovery and lead optimization in drug discovery. Experimental design, make, test cycles that seek to optimize this property are costly and time consuming, and hence limit the rate of entry of novel therapies to the clinic. Computational methods seek to accelerate these cycles by producing reliable protein-ligand binding affinity predictions. Traditional QSAR models train using only chemical compound data for a specific target. Alternatively, PCM models featurize both protein sequence and ligand to create the final feature vector. This allows ML models to be trained on protein-ligand binding affinity data from multiple proteins at once, hence augmenting the size of the training set, and potentially allowing the model to learn from related proteins (Lapins et al.,
Here we first analyze the performance of PCM models trained using the most recent ChEMBL25 database (Gaulton et al.,
Finally, we analyze the utility of conformal prediction to provide prediction intervals to assess model confidence. Conformal prediction was implemented using the standard deviation across the trees of the RF model and Bayesian dropout (Cortés-Ciriano and Bender,
The datasets generated for this study can be found in the
CP, ZG, and RA planned the research, performed the analysis, and wrote the paper. CP and ZG performed the research.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at:
1RDKit Avaliable Online at:
2MolVS: Molecule Validation and Standardization — MolVS 0.1.1 documentation Avaliable Online at: