Improving Small Molecule pK a Prediction Using Transfer Learning With Graph Neural Networks

Enumerating protonation states and calculating microstate pK a values of small molecules is an important yet challenging task for lead optimization and molecular modeling. Commercial and non-commercial solutions have notable limitations such as restrictive and expensive licenses, high CPU/GPU hour requirements, or the need for expert knowledge to set up and use. We present a graph neural network model that is trained on 714,906 calculated microstate pK a predictions from molecules obtained from the ChEMBL database. The model is fine-tuned on a set of 5,994 experimental pK a values significantly improving its performance on two challenging test sets. Combining the graph neural network model with Dimorphite-DL, an open-source program for enumerating ionization states, we have developed the open-source Python package pkasolver, which is able to generate and enumerate protonation states and calculate pK a values with high accuracy.

April 30, 2022 test set is shown. 50 training runs with different training/validation splits were performed and for each training run the best model was selected based on its performance on the validation set (shown here is a single, randomly selected training run). Panel A shows the performance of the GNN model on the Literature data set. Panel B shows the performance of the GNN model on the Novartis data set. The solid red line in the scatter plot indicates the ideal behavior of the reference and predicted pK values, the dashed lines mark the ±1 pk unit interval. Mean absolute error (MAE) and root mean squared error (RMSE) are shown, the values in bracket indicate the 90% confidence interval calculated from 50 repetitions with random training/validation splits.
indicates the number of investigated samples. 50 training runs with different training/validation splits were performed and for each training run the best model was selected based on its performance on the validation set (shown here is a single, randomly selected training run). Panel A shows the validation set performance of the best GNN model trained on the ChEMBL data set. Panel B shows the validation set performance starting from the same pre-trained model after fine-tuning on the experimental training set. The solid red line in the scatter plot indicates the ideal behavior of the reference and predicted pK values, the dashed lines mark the ±1 pk unit interval. Mean absolute error (MAE) and root mean squared error (RMSE) are shown, the values in bracket indicate the 90% confidence interval calculated from 50 repetitions with random training/validation splits. indicates the number of investigated samples. . The accuracy of the fine-tuned GNN model only decreases slightly when molecules from the ChEMBL data set are used for regularization. 50 fine-tuning runs with different training/validation splits were performed, each initialized using the parameters of 50 pre-training runs, and for each training run the best model was selected based on its performance on the validation set. In order to generate a single plot we selected randomly a single fine-tuning run and generated the scatter plot with the best performing model on the validation set. The solid red line in the scatter plot indicates the ideal behavior of the reference and predicted pK values, the dashed lines mark the ±1 pk unit interval. Mean absolute error (MAE) and root mean squared error (RMSE) are shown, the values in bracket indicate the 90% confidence interval calculated from 50 repetitions with random training/validation splits. indicates the number of investigated samples.

Predicting pK values
The protonation states and pK values discussed below are shown separately in the corresponding figures for each molecule. pK values are calculated as the average value of 50 trained models with the standard deviation in parenthesis. Values are rounded to one significant digit. Predictions are shown for pkasolver-epic (using transfer learning and the large ChEMBL data set for which pK values were calculated using Epik) and pkasolver-light (using only the fine-tuning data set for training). pkasolver-epic performs with higher accuracy than pkasolver-light for molecules with multiple protonation states. pkasolver-light performs well for monoprotic molecules, but its use is not recommended for molecules with multiple protonation states and/or sites.
Ethylenediaminetetraacetic acid 6 protonation state were identified and the calculated pK values are 1.5 (0.3), 1.9 (0.2), 2.3 (0.2), 2.6 (0.2), 6.1 (0.8), and 9.4 (0.4), shown in Figure S.I.9. Comparing this with the experimentally determined pK values of 0.0, 1.5, 2, 2.66, 6.16, and 10.24 the values are in good relative agreement and the correct protonation states are identified [? ]. Only the first and last pK estimates are noticeable too high and too low respectively. This highlights a limitation of pkasolver-epic. The range of pK values present in the training data was between zero and 14, and predictions of pK values will stay in this range.
pkasolver-light identified the same first 5 protonation states (but misses the last one) and calculates the pK values with 2.8 (0. MolGpKa identifies three protonation sites with pK values 8.3, 10.3, 13.8 (all of which agree qualitatively and quantitatively very well with the predicted values for pkasolver-epic). pkasolver-epic identified one additional protonation state, the charged state of the secondary amine with a pK value of 2.46 and a very high standard deviation of 1.4 pK units. While the protonation state is certainly possible at low pH, the high standard deviation points to uncertainty in the trained models.
pkasolver-light identified the same states and calculates the pK values with 7.10 (1.16), 8.38 (0.47), 9.15 (0.32) and 9.67 (0.42). The first state is significantly different than the value calculated with pkasolver-epic. Also the pK value of the last state is significantly too low. Cocaine A single protonation state was identified and pkasolver-epic calculates the pK values with 8.4 (0.1), shown in Figure S.I.11. This is in good agreement with the experimental pK value of 8.6 for the same protonation site [? ].
pkasolver-light identified the same state and calculates the pK value with 8.5 (0.2). Tyrosine 3 protonation states were generated and pkasolver-epic calculates the pK values with 2.4 (0.1), 9.1 (0.3), and 10.21 (0.2), shown in Figure S.I.12. The experimental pK values are 2.2, 9.21, and 10.5 1 . This is one of the examples in which the sequential pK prediction seems to perform much better than the pK prediction for discrete groups, as performed e.g. by MolGpka (which calculates the pK values for the same protonation states with 2.3, 6.5, and 8.8). pkasolver-light identified the same state and calculates the pK value with 9.0 (0.6).
Aspergillic acid 2 protonation states were identified and pkasolver-epic calculates the pK values with 2.8 (0.5) and 4.6 (0.9), shown in Figure S.I.14. The experimental pK value is 5.5 [? ]. Here, pkasolver provides the correct protonation states but for the protonation site with experimental pK value, the estimate is off by 1 pK unit.
pkasolver-light identified the same protonation states and calculates the pK values with 4.9 (0.6) and 5.9 (1.0). The experimentally determined value for the acid group is predicted accurately by pkasolver-light.  3, 8.7, 9.7, 13.42 shows good agreement 2 . This example shows again that when approaching extreme values (0 or 14) the model tends to perform worse than for values in the physiologically relevant range.
pkasolver-light identifies only the carboxylic acid and amine group as possible protonation sites and calculates the pK values with 5.6 (0.7) and 9.2 (0.5). The estimate for the carboxylic group is too low while the prediction for the amine is reasonably accurate.
Furosemide 5 protonation states are identified and pkasolver-epic calculates the pK values with 1.1 (0.4), 3.4 (0.3), 6.2 (1.1), 9.7 (0.1) and 13.1 (0.4), shown in Figure S.I.17. The pK value for the carboxylic group closely matches the experimentally measured pK of 3.52, also the pK value of the sulfonamide is in proximity to the experimental value of 7.5 [? ]. There are no experimental values available for the more extreme pKa values but compared with e.g. MolGPka, the pK values for the protonation states are in good agreement.
pkasolver-light identifies the same protonation sites but different protonation states, which are shown in Figure  pkasolver-light identified the same states and calculates the pK values with 6.65 (0.5) and 8.6 (0.6). Pyridine A single protonation state was identified and pkasolver-epic calculates the pK value with 5.2 (0.2), closely matching the experimental pK value of 5.25 [? ]. Protonation state and pK value is shown in Figure S.I.20.
pkasolver-light identified the same state and calculates the pK value with 5.0 (0.2).
Testing pkasolver-epic on the SAMPL6 molecules The experimental and calculated pK values are shown in Table S.I.2. For 21 of the 24 molecules pkasolver-epic proposes too many protonation states in the investigate pK range [? ]. This makes a direct comparison difficult. If the protonation state nearest to pH 7.4 is matched to the experimental pK values (shown in Table  S.I.2 in red) the MAE of the predictions is 1.0 and the RMSE 1.4 pK units.
This set of molecules shows that the interplay between pkasolver and the sequential protonation state generation using Dimorphite-DL has to be improved. Currently, pkasolver tries to de-and protonate each protonation site proposed by Dimorphite-DL, irrespective of whether it is an acidic or basic site. . pK values were calculated using pkasolver-epic. pK values and standard distribution (shown in parenthesis) are rounded to one significant digit. The pK value used to match the experimental pK value is shown in red.
Figure S.I.9. Results are shown for a sequential pK prediction using pkasolver-epic for ethylenediaminetetraacetic acid (EDTA). For each protonation state the base-acid pair is shown and the consensus prediction for the pK value with the standard deviation is shown. The protonation site is highlighted for each protonation state. 14 of 21   Results are shown for a sequential pK prediction using pkasolver-epic for aspergillic acid.  Results are shown for a sequential pK prediction using pkasolver-epic for pyridine.