Edited by: Ping Gong, Engineer Research and Development Center (ERDC), United States
Reviewed by: Yun Tang, East China University of Science and Technology, China; Shengyong Yang, Sichuan University, China
This article was submitted to Toxicogenomics, a section of the journal Frontiers in Genetics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Due to diverse reasons, most drug candidates cannot eventually become marketed drugs. Developing reliable computational methods for prediction of drug-likeness of candidate compounds is of vital importance to improve the success rate of drug discovery and development. In this study, we used a fully connected neural networks (FNN) to construct drug-likeness classification models with deep autoencoder to initialize model parameters. We collected datasets of drugs (represented by ZINC World Drug), bioactive molecules (represented by MDDR and WDI), and common molecules (represented by ZINC All Purchasable and ACD). Compounds were encoded with MOLD2 two-dimensional structure descriptors. The classification accuracies of drug-like/non-drug-like model are 91.04% on WDI/ACD databases, and 91.20% on MDDR/ZINC, respectively. The performance of the models outperforms previously reported models. In addition, we develop a drug/non-drug-like model (ZINC World Drug vs. ZINC All Purchasable), which distinguishes drugs and common compounds, with a classification accuracy of 96.99%. Our work shows that by using high-latitude molecular descriptors, we can apply deep learning technology to establish state-of-the-art drug-likeness prediction models.
Over the past several decades, various novel and effective techniques, such as high-throughput screening(HTS), fragment-based drug discovery (FBDD), single-cell analysis, have been developed and led to remarkable progresses in the field of drug discovery. However, it is noted that the amount of new chemical entities (NCEs) approved by FDA did not grow as rapidly as expected (
About 40% of the candidate compounds not being marketed is due to their poor biopharmaceutical properties, also commonly referred to as drug-likeness, which includes poor chemical stability, poor solubility, poor permeability and poor metabolic (
A drug-likeness prediction model introduced by Wagener et al., involved molecular descriptors related to numbers of different atom types and decision trees for discriminating between potential drugs and nondrugs. The model was trained using 10,000 compounds from the ACD and the WDI, and its prediction ACC on an independent validation data set of 177,747 compounds was 82.6% (
Deep learning is a new wave of machine learning based on artificial neural networks (ANN) (
In this study, the whole chemical space was divided into drug, drug-like and non-drug-like. Marketed drug molecules were represented by ZINC WORLD DRUG (
Detailed information of the dataset pairs.
Dataset pair | Number of positive | Number of negative | Total |
---|---|---|---|
WDI/ACD | 38,260 | 288,540 | 326,800 |
MDDR/ZINC | 171,850 | 199,220 | 371,070 |
WORLDDRUG/ZINC | 3,380 | 199,220 | 202,600 |
Data cleaning can be a crucial step in cheminformatics calculation, as expounded by
Data preprocessing and post-processing steps used in this study.
Data processing |
|
---|---|
Step Name/Software | Step description |
Element filter/KNIME ( |
Hydrocarbons are removed. Molecules containing elements other than C H O N P S Cl Br I Si are removed. |
Remove Mixture/KNIME ( |
All records containing more than one molecules are removed. |
Standardize/ChemAxon Standardizer ( |
Neutralize, tautomerize, aromatize, and clean 2D |
Remove duplicate/OpenBabel ( |
Two molecules having the same InChI(including stereochemistry) means duplication. If a molecule appears in both drug set and nondrug set, it is removed from nondrug set. As for duplications in the same set, only the one that appears first is kept. |
Data post-processing | |
Remove error values/Python | If a descriptor has the value of N/A or ‘infinity’, the molecule it belongs to is removed. |
Remove constant descriptors/Python | If a descriptor has the same value across all molecules, the descriptor is removed from the descriptor list. |
We used 2D descriptors to encode the molecules. Molecules after preprocessing were calculated by MOLD2 (
Due to the special classification task, the positive and negative samples collected by us were not balanced in this study. Predictive model developed using imbalanced data could be biased and inaccurate. Therefore, we adopted two methods to balance our data sets to make the ratio of positive and negative samples approximately equal. The first method was to copy the minority class making the ratio 1:1, the second one was to use SMOTE (
An autoencoder was an unsupervised learning algorithm that trains a neural network to reconstruct its input and more capable of catching the intrinsic structures of input data, instead of just memorizing. Intuitively, it attempted to build an encoding-decoding process so that the output
A schematic architecture of a stacked autoencoder. Left) the architecture of autoencoder, layer-by-layer can be stacked. Right) a pre-trained autoencoder to initialize a fully connected network with the same structure for classifying.
According to the partition of chemical space into drug, drug-like and non-drug-like, there can be two kinds of classification models, drug-like/non-drug-like, drug/non-drug-like. The first one matched the traditional definition of drug-likeness. The second one also bore considerable practical value, but no model had been published to address it. In this study, to address drug-like/non-drug-like classification, we proposed two models, MDDRWDI/ZINC (which means MDDR and WDI as positive set, ZINC as negative set) and WDI/ACD. To address drug/non-drug-like classification, we proposed WORLDDRUG/ZINC (which means ZINC WORLD DRUG as positive set, ZINC ALL PURCHASABLE as negative set) model.
In this study, we used the open-source software library Keras (
Hyper-parameter settings of the stacked autoencoder.
Hyperparameter | Setting |
---|---|
Initializer | TruncatedNormal |
Number of hidden layers | 1 |
Number of hidden layer nodes | 512 |
L2 Normalization term | 1e-4 |
Dropout rate | 0.14 |
Activation | Relu |
Batch size | 128 |
Optimizer | Adam |
Loss | mse for AE, binary crossentropy for classifier |
Considering that although the data set has been balanced, the model results may be overfitting, so we optimized the weight of the positive and negative sample loss of the logarithmic likelihood loss function as:
where yk represented the kth compound label. yk = 1 or 0, means kth compound was the drug-like or non-drug-like compound, respectively. ak = P(yk = 1|xk) was the probability to be the drug-like compound of kth compound calculated by model. w was the weight of the positive sample loss. For different cases, we chose the most suitable w from the range of (0.5∼1.0) to avoid overfitting. Then we trained all models with 5-CV and enforced early stopping based on classification ACC on the test set. Finally, each case had 5 trained models and the average value was the final judgement of these models.
All models were evaluated by five indexes. The ACC, SP, and sensitivity(SE), MCC, area under the receiver operating characteristic curve (AUC), the previous four criteria were defined, respectively, as follow:
After we tried pre-training on validation test with 5-CV, we found that more layers and neuron numbers did not improve the predictive power. In all case, one hidden layer was sufficient for our classification objective. By analyzing the two different over-sampling methods to balance datasets, copy the minority class and SMOTE, we found the latter can achieve better prediction accuracies in Table
Performance on the training sets with 5-CV.
Model | Copy the minority class |
SMOTE over-sampling |
||||||
---|---|---|---|---|---|---|---|---|
ACC | SE | SP | AUC | ACC | SE | SP | AUC | |
WDI/ACD | 0.8923 | 0.8991 | 0.8859 | 0.9598 | 0.9265 | 0.9244 | 0.9286 | 0.9783 |
MDDR/ZINC | 0.9095 | 0.8855 | 0.9302 | 0.9701 | 0.9116 | 0.9141 | 0.9092 | 0.9719 |
WORLD/ZINC | 0.9910 | 0.9961 | 0.9859 | 0.9986 | 0.9906 | 0.9937 | 0.9874 | 0.9990 |
With the same dataset, the ACC of a SVM model built by Li et al was 92.73% (
We observed that when using the independent external validation set pre-segmented from the original data to evaluate model, the prediction ACC of the model tended to be slightly lower than that of training, but the sensitivity value was significantly lower and the SP value was higher (Table
Performance of the models on the validation sets.
Model | Using SMOTE over-sampling |
||||
---|---|---|---|---|---|
ACC | SE | SP | MCC | AUC | |
WDI/ACD | 0.9014 | 0.7683 | 0.9191 | 0.6014 | 0.9271 |
MDDR/ZINC | 0.9025 | 0.9012 | 0.9036 | 0.8043 | 0.9669 |
WORLD/ZINC | 0.9800 | 0.7544 | 0.9838 | 0.5690 | 0.9707 |
The underlying reason may be that the positive sample ratio in the original data was too low, and we randomly divided the positive and negative samples in the original data set according to 9:1 to build the training set and the validation set. Even if the SMOTE method was used to balance the positive and negative samples in the train set, the new positive sample generated by SMOTE depended on positive sample in the original training set, so the positive sample information of the external verification set was less included.
In order to overcome the over-fitting on the negative samples, we increased the weight of positive sample loss in the loss function to enhance the learning ability of the model to the positive sample side. We tested the weigh values (details in Formula 1) from 0.5 to 1 with 20 intervals, and record the values of ACC, SE, and SP on the validation set varying with weight, as shown in Figure
Evaluations of different models vary with weight of positive sample loss.
For different models, the intersection point of SE and SP in the curves of Figure
Performance on the training set after optimizing the weight of loss function.
Model | SMOTE over-sampling |
||||
---|---|---|---|---|---|
ACC | SE | SP | MCC | AUC | |
WDI/ACD | 0.9104 | 0.9694 | 0.8515 | 0.8270 | 0.9757 |
MDDR/ZINC | 0.9120 | 0.9219 | 0.9020 | 0.8243 | 0.9726 |
WORLD/ZINC | 0.9699 | 0.9985 | 0.9414 | 0.9416 | 0.9955 |
Performance on the validation set after optimizing the weight of loss function.
Model | SMOTE over-sampling |
||||
---|---|---|---|---|---|
ACC | SE | SP | MCC | AUC | |
WDI/ACD | 0.8458 | 0.8524 | 0.8449 | 0.5286 | 0.9253 |
MDDR/ZINC | 0.9046 | 0.9174 | 0.8935 | 0.8095 | 0.9699 |
WORLD/ZINC | 0.9366 | 0.8804 | 0.9376 | 0.4049 | 0.9622 |
Although the MCC is generally regarded as a balanced measure, it is seriously affected by the number gap between positive and negative samples of data sets and the confusion matrix calculated by the model. The MCC is satisfactory for the balanced training sets. But in the validation sets, the data set becomes more unbalanced, and the MCC becomes smaller, which was inevitable.
In image recognition problems, where AE was originated, several layers of AE are often stacked to make a SAE. Though SAE was found to be more powerful than single layer AE there, we found that SAE is flawed here in drug-likeness problems, making multi-layer SAE perform much poorer than single layer AE.
When a layer of AE is trained, it is expected to give output as close as possible to its input, and the error can be defined as the mean value of output minus input. In this study, when training the model, we found that the ACC of the normalized (z-score) input was much higher than scaling input to [-1,1]. After standardizing the data, the error of AE is 0.8, an order of magnitude higher than typical values in image recognition. Stacking layers of AE will further amplify the error, making the SAE-initialized NN perform poorly in classification.
We propose that such a flaw of AE stems from how input data in different dimensions are interrelated. In image recognition, each pixel is a dimension; in drug-likeness prediction and related areas, each descriptor is a dimension. The training goal of AE is to learn the relationship among dimensions, to encode input information into hidden layer dimensions. So it is very likely that AE would do worse if the relationship among dimensions is intrinsically more chaotic and irregular. The relationship among pixels is regular in that they are organized as a 2D grid and that neighbor pixels bare some similarity and complementarity. Such good properties are absent in relationship among descriptors, resulting in the failure of AE input reconstruction process. Despite the fact that AE reconstruction error is large, our model still performs well in classification. In our opinion, this is due to the regularization effect of AE pre-training. With unsupervised pre-training, the model is more capable of truly learning data, less prone to simply memorizing data.
Imbalanced data sets are a common problem. Although there are some methods such as SMOTE, which can generate new data to balance the data set, this method of generating data is much dependent on the distribution of samples. Once the distribution of samples is very sparse, then the new data is likely to deviate from the space where the original data is exited. Developing method to find data mapping spaces based on the distribution of existing data is critical to generating data to balance the data set, such as the current popular deep generation model. Developing new algorithms to train unbalanced data sets is also an important research direction.
In this study, DL has once again shown its capacity for improving prediction models. Despite the success, we believe that there is still much space for further development. A key aspect is to adapt current DL methods to specific problems. Such adaptations should be based on a better comprehension of current DL methods. That is, knowing which part of the method can be universally applied, and which part should be modified according to the nature of data. For example, in this study, we believe that the regularization effect of AE pre-training is a universal part, while the part of AE input reconstruction should be canceled or modified when input data is irregular.
In this study, we manually built two larger data sets, drug-like/non-drug-like and drug/non-drug-like. Then using the AE pre-training method, we developed drug-likeness prediction models. The ACC of classification based on WDI and ACD databases was improved to 91.04%. Our model achieved classification ACC of 91.20% on MDDRWDI/ZINC dataset, making it the state-of-the-art drug-likeness prediction model, showing the predictive power of DL model outperforms traditional machine learning methods. In addition, we developed a drug/non-drug-like model (ZINC World Drug vs. ZINC All Purchasable), which distinguished drugs and common compounds, with a classification ACC of 96.99%. We proposed that AE pre-training served as a better regularization method in this study. The fail of multi-layer SAE reconstruction in this study indicated that due to the specific nature of data, some modifications may be needed when applying DL to different fields. We hope machine learning researchers and chemists collaborate closely to solve such a problem in the future, bringing further comprehension and applications of DL method in chemical problems.
QH and MF wrote the codes and analyzed the data. LL and JP conceived the work. All authors wrote the paper.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
5-fold cross-validation
accuracy
autoencoder
areas under the receiver operating characteristic curve
deep learning
deep neural network
fully connected neural network
Matthews correlation coefficient
stacked autoencoder
sensitivity
specificity
support vector machine.