Improved Prediction of Aqueous Solubility of Novel Compounds by Going Deeper With Deep Learning

Aqueous solubility is an important physicochemical property of compounds in anti-cancer drug discovery. Artificial intelligence solubility prediction tools have scored impressive performances by employing regression, machine learning, and deep learning methods. The reported performances vary significantly partly because of the different datasets used. Solubility prediction on novel compounds needs to be improved, which may be achieved by going deeper with deep learning. We constructed deeper-net models of ~20-layer modified ResNet convolutional neural network architecture, which were trained and tested with 9,943 compounds encoded by molecular fingerprints. Retrospectively tested by 62 recently-published novel compounds, one deeper-net model outperformed four established tools, shallow-net models, and four human experts. Deeper-net models also outperformed others in predicting the solubility values of a series of novel compounds newly-synthesized for anti-cancer drug discovery. Solubility prediction may be improved by going deeper with deep learning. Our deeper-net models are accessible at http://www.npbdb.net/solubility/index.jsp.

AI solubility prediction tools may be critically tested by newly-published novel compounds. Tested by 62 novel compounds published since November 2017 (Methods section), four established tools MOE V2016.0802, QikProp QP18 and CIQP18, and AlogGPS V2.1 scored significantly lower R 2 (<0.2) and higher RMSE (0.814-1.162) values (Results section) than the typicallyreported values (5,6,8,14,15). Our own-developed deep learning model of typically-employed shallow-net architecture (Methods section), trained and tested with 9,943 compounds, also scored lower R 2 (0.307) and higher RMSE (0.739) values (Results section). Hence, there is a need for improved solubility prediction particularly on novel compounds to promote oral anti-cancer drug development. In AI field, deep learning methods with distinguished learning capabilities (16) [which has been proved by prediction of CRISPR-Cpf1 guide RNA activity (17) and prediction of protein-ligand binding affinity (18)] are useful for this task, but their potential has yet to be fully realized.
The published deep learning solubility prediction models are primarily shallow-nets (3-7 layers) (10)(11)(12). Deep learning performances have been routinely enhanced by going deeper (adding more layers to shallow-nets) (19)(20)(21). Although performances can also be enhanced by going wider (22), it may be practically easier to develop deeper-nets by tapping into the well-established architectures that require fewer parameters (19)(20)(21). The depth of deeper-nets or the width of wider-nets is constrained by the limited number of compounds with experimental solubility data. The architecture with fewer parameters, convolutional neural networks (CNN), is therefore preferred. A question is whether the superior local-feature learning capability of CNN can adequately learn molecular features of compounds. To fit with the localfeature learning capability of CNN, compounds are better represented by substructure-encoded molecular fingerprints (23) instead of molecular descriptors used for solubility prediction by previously-developed deep learning models (10)(11)(12). Molecular fingerprints are vectors with individual components encoding specific sub-structures of molecules. Hence, the superior local-feature learning capability of CNN is expected to be useful for capturing the key sub-structural elements and their combinations contributing the solubility of molecules.
We constructed N-layer CNN models (N = 14, 20, and 26) using 9,943 compounds and based on a residual network (ResNet) architecture (20), which are significantly deeper than the previously-developed 3-7 layers shallow-net models (10)(11)(12). The solubility prediction capability of our deeper-net models was tested by retrospective prediction of the experimental solubility of 62 recently-published novel compounds beyond the training and testing compounds. These performances were compared with those of four established tools, shallow-net models and four human experts. Our deeper-net models and others were further tested by a real anti-cancer drug discovery project with a series of novel compounds newly-synthesized for discovering FLT3 inhibitors. These compounds were considered difficult for solubility estimation by medicinal chemistry experts, which are ideal for rigorous test of solubility prediction models. Our models are accessible at http://www.npbdb.net/solubility/index.jsp for supporting broader tests.

Data Collection and Processing
A total of 10,166 compounds with experimental aqueous solubility value were collected from ChemIDplus database (24) and Pubmed (9,25,26) Figure 1) were collected from PMC database (27)(28)(29)(30)(31) search using keyword combination of "novel", "new, " and "solubility" and under the following criteria: published between November 2017 and May 2018, and solubility measured at room-temperature and around pH 7.0. For the 10,166 compounds, their SMILES strings (which encode sub-structures), InChIKeys (chemical structure identifiers) and aqueous solubility values were collected from the searched sources. For the 62 novel compounds, their structures were drawn from literature-reported structures by using ChemDraw 18.0 and then converted to the SMILES strings by using RDKit 1 . Solubility S values in different units (e.g., µg/mL, mg/mL, and mg/L) were converted to mol/L and transformed into logS (in logarithmic units) values. The SMILES strings were converted to canonical SMILES strings for consistency by using Open Babel (32). Duplicates were removed by InChIKeys comparisons. The canonical SMILES of the remaining nonredundant 9,943 compounds (Supplementary Table 1, the basic physical properties detailed in Supplementary Table 2) and the 62 novel compounds were converted into the Pubchem molecular fingerprints (which encode sub-structures by 881 bits) using PaDEL (33).

Established Tools and a Deep Learning
Model of Typically-Employed Shallow-Net Architecture for Solubility Prediction Solubility prediction performances were comparatively evaluated with respect to four established software tools [MOE V2016.0802 2 , QikProp 2018-4 QP18 and CIQP18 3 , and AlogGPS V2.1 based on an artificial neural network method (5)]. The deep learning model was developed based on a typicallyemployed shallow-net deep neural network (DNN) architecture for solubility prediction (11), which is a 4 hidden-layers DNN (Supplementary Figure 2) with the network architecture and parameter sets re-constructed based on the literature descriptions (11) with the following minor variations: the activation function was changed from SReLU to ReLU and the compounds were represented by pubchem molecular fingerprints instead of fp6 molecular fingerprints. The numbers of nodes of the hidden layers are 512, 1,024, 2,048, and 4,096. The parameters of L2 regularization and dropout regularization are 0.001 and 0.5. The 9,943 compounds were randomly divided into 90% training and 10% testing datasets for training the DNN model. These CNN models were trained by the 10-fold cross validation method used for the development of two shallow-net deep learning solubility prediction models (10,12). In the 10-fold cross validation method, the 9,943 compounds were randomly divided into 10 sets of approximately equal sizes, with each set used once as a testing dataset, and the remaining 9 sets as training dataset for training the CNN models. The CNN hyperparameters were optimized based on the overall performance of the 10 training/testing datasets. These hyperparameters include loss function, kernel sizes, number of filters, stride lengths, number of fully-connected hidden layers, number of neurons of the fully-connected layer, activation function, optimizer, learning rate, weight initialization, regularization, batch size, and epochs. Multiple activation functions (Sigmoid, ReLU, Softmax) were evaluated in both activation layers and the activation arguments of all forward layers. The weight initialization was uniform. L2 regularization was added by small amounts of L2 weight decay. A solubility value regression model was trained by least squares )) between the predicted (ŷ i ) and experimental (y i ) solubility values of the n training compounds as the loss function of the output of our deeper-net models.

Performance Evaluation Metrics
The solubility prediction performances of the developed deep learning models were assessed by two metrics used in the evaluation of previously-developed shallow-net deep learning models (10,12). One is the R 2 value, where R is the Pearson correlation coefficient defined by: The second is the root mean squared error RMSE defined by: whereŷ i is the predicted and y i is experimental solubility values of the training compounds. In statistics, R 2 , the coefficient of determination, is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is a statistical measure used in a regression model to indicate that how well the model fits the data. Theoretically, it denotes a goodness-offit indicator that can vary from -∞ to 1. The closer the R 2 value is to 1, the better the model fits the data, and vice versa. The other metric, RMSE, is the square root of the average of squared errors. It is a statistical measure of the differences between the values predicted by a model and the true values. RMSE is always non-negative, and the value closer to 0 indicates the better fit to the data.

Chemical Synthesis and Experimental Aqueous Solubility Determination
In one of our drugs discovering projects toward antitumor therapeutics, a series of novel FLT3 inhibitors were designed and synthesized using the structure-based drug design methods. The aqueous solubilities (pH = 7) of these compounds were measured using the modified shake flask method and RP-HPLC (34,35). Each compound was added into a 1.5 mL Eppendorf tube containing Milli-Q water (1 mL) to form the precipitates at The "conv1d x,y" is a 1D convolution layer with x kernel sizes and y filters. And the curvy arrows are the shortcut connections. The shortcut connection with a parameter layer increases dimensions. The different color means different layer class in the architecture. "Green" means the first layer, "white" means the last layer, "gray" means the parameter layer of the shortcut connection, and the others mean the residual layers. The color change of the residual layers from purple to blue to yellow indicates the tensor dimension change from 9 to 18 to 36. (B) The shortcut connection in the architecture of CNN ResNet-like deep learning model. Shortcut connections simply perform identity mapping by skipping one or more layers (20). Their outputs are added to the outputs of the stacked layers without extra parameter and computational complexity.

RESULTS
The Training of the Deeper-Net Models and Solubility Prediction Performance Evaluation  Table 4). In spite of different depths, these models performed similarly well, possibly because the superior predictive capability of these deeper-net models cannot be fully tested by 1-fold (1/10) testing datasets. The test by novel compounds may be better for probing the predictive capabilities. The reported 10-fold cross validation performances of the two previously-developed shallow-net models are R 2 = 0.86-0.92 and 0.90-0.92, and RMSE = 0.58-0.79 and 0.45-0.50, respectively (10, 12), which are substantially better than those of our deepernet models. It is noted that our datasets (testing 994 compounds, training 8,949 compounds) are significantly larger than those of the two previously-developed shallow-net models (testing 102-287 and 129-154 compounds, training 923-2,586 and 1,161-1,537 compounds, respectively) (10,12). Caution is needed in a direct comparison of the performance statistics of these models. The significantly more diverse testing datasets may partly contribute to the lower performance statistics. But the more diverse training datasets likely lead to more robust prediction capability than the less diverse training datasets. Because of the inaccessibility of the previously-published shallow-net models, it is impossible to test these models on a common set of diverse compounds. Therefore, these models were tested on the 62 newly-published novel compounds and a series of novel compounds from our anti-cancer drug discovery project with solubility measured for the first time in this work.

Prediction of the Solubility Values of Literature-Reported Novel Compounds by the Deeper-Net Models in Comparison With the Established Tools and Shallow-Net Models
The solubility prediction capability of our deeper-net models was tested by the 62 newly-published novel compounds. We also trained 1-layer DNN model, 6-layer DNN model, and 8-layer ResNet-like model as our shallow-net models. The testing results of these models are included in Table 1, and the predicted logS values of these models with respect to experimental logS values are in Supplementary

Comparison With Human Experts in Coarse-Grained Classification of the Solubility Categories of the Literature-Reported Novel Compounds
Four human experts in medicinal chemistry were selected from the China Pharmaceutical University using the criterion of a recent machine vs. human comparative solubility prediction study (9), i.e., a human expert is someone with medicinal chemistry expertise working or studying in a university. These four experts include one assistant professor and three PhD students. They were tasked to conduct coarse-grained classification of the aqueous solubility of the 62 novel compounds at room temperature into one of the following categories: practically insoluble or insoluble (<0.1 g/L), slightly soluble (0.1∼10 g/L), soluble (10∼100 g/L), and freely soluble (>100 g/L). The classification performance of these four experts together with those of the established tools, and shallow and deeper-net models are in Table 2. All tools and models achieved high classification accuracies of 79.0-91.9%, which significantly outperformed the human experts (6.5-74.2%). These indicated the more superior capability of both established tools and deep learning models over human experts in coarse-grained classification of the solubility categories on novel compounds. However, no definite conclusion could be deduced on which was better between the established tools and the deep learning models. No improving trend was found with the increasing of the deep learning models' depth.  Figure 4) were experimentally measured using the method described in the Methods section. We were unable to determine the exact solubility values for 12 compounds because they are insoluble below 1.0000E-2 mg/mL in neutral water. Hence, only the remaining five compounds (Figure 3) with exact experimental solubility values were used for testing our deeper-net models and other models. Partly because of the novelty and low number of compounds, the R 2 values of all models are well below statistically meaningful values. Hence only the RMSE values and the percent of predicted logS values within 10-fold of experimental value were used for performance evaluation ( Table 3). Judged by the RMSE values, the deeper-net models substantially outperformed all other models, with the 26layer model as the best one in spite of minor level of overfitting. This further indicated the advantage of going deeper for improved solubility prediction. Judged by the percent of predicted logS values within 10-fold of experimental value, the majority of the models (including 14-and 20-layer deepernet models) achieved equally good performances (60%) with the 26-layer model as the best one (80%). This again showed that both the established tools and deep learning models are useful for rough estimation of the solubility values of novel compounds.

DISCUSSIONS
Like successful applications of deep learning methods in other fields (19)(20)(21), the superior learning capability of deeper-net models may be exploited to improve solubility prediction of novel compounds, including those compounds considered by medicinal chemistry experts as difficult for solubility estimations. To better explore the learning capability of deeper-net architectures, the molecular representations of the compounds may be selected for conforming to these architectures. Specifically, the superior local-feature learning capability of the CNN architectures may be better exploited by using the substructure-encoded molecular fingerprints for representing compounds. Our studies consistently scored the substantially better solubility prediction performances of the deeper-net deep learning models on novel compounds than the established tools and shallownet models. Nonetheless, the prediction performance of the deeper-net models on novel compounds is affected by the limited number of 9,943 compounds for training these models. Solubility prediction capability of the deeper-net methods may be further enhanced with the expanded experimental solubility data and by means of algorithm development. Our novel approach may find broader applications in the development of high-performance deep learning models for the prediction of various pharmacodynamic, pharmacokinetic, and toxicological properties.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
YC and HZ conceived and designed this research, contributed to the preparation, and write-up of this manuscript. HZ designed and conducted computation. QC processed data, built models, predicted results, analyzed results data, and participated in this manuscript writing. SL synthesized compounds and experimented to test the aqueous solubility values of these compounds, and participated in this manuscript writing. BN collected and processed information on 62 recently-published novel compounds and assisted in optimizing the models. XZ predicted the aqueous solubility values of novel compounds using commercial software. YT assisted in the design of this study.