Deep Learning on High-Throughput Transcriptomics to Predict Drug-Induced Liver Injury

Drug-induced liver injury (DILI) is one of the most cited reasons for the high drug attrition rate and drug withdrawal from the market. The accumulated large amount of high throughput transcriptomic profiles and advances in deep learning provide an unprecedented opportunity to improve the suboptimal performance of DILI prediction. In this study, we developed an eight-layer Deep Neural Network (DNN) model for DILI prediction using transcriptomic profiles of human cell lines (LINCS L1000 dataset) with the current largest binary DILI annotation data [i.e., DILI severity and toxicity (DILIst)]. The developed models were evaluated by Monte Carlo cross-validation (MCCV), permutation test, and an independent validation (IV) set. The developed DNN model achieved the area under the receiver operating characteristic curve (AUC) of 0.802 and 0.798, and balanced accuracy of 0.741 and 0.721 for training and an IV set, respectively, outperforming the conventional machine learning algorithms, including K-nearest neighbors (KNN), Support Vector Machine (SVM), and Random Forest (RF). Moreover, the developed DNN model provided a more balanced sensitivity of 0.839 and specificity of 0.603. Besides, we found the developed DNN model had a superior predictive performance for oncology drugs. Also, the functional and network analysis of genes driving the predictions revealed their relevance to the underlying mechanisms of DILI. The proposed DNN model could be a promising tool for early detection of DILI potential in the pre-clinical setting.


INTRODUCTION
Drug-induced liver injury (DILI) has been recognized as a significant cause of drug attrition, resulting in drug withdrawal from any stage of the drug development processes and post-marketing (Hoofnagle and Björnsson, 2019;Weaver et al., 2020). DILI accounts for approximately 50% of acute liver failure cases in the United States (Ostapowicz et al., 2002). Notably, DILI risk covers more than 750 approved drugs (Thakkar et al., 2018). The late stage of DILI identification poses a serious challenge to pharmaceutical professionals as well as regulatory agencies. Therefore, early elimination of DILI concerns in the pre-clinical setting is of great importance for DILI management (Weaver et al., 2020).
Drug-induced liver injury is a very composite endpoint in predictive toxicology, which interplays between drug properties and host factors (Chen et al., 2015). Besides the clinical-driven approaches initialized by consortiums such as DILI Network (DILIN) (Fontana et al., 2009), considerable efforts have been made for enhancing DILI prediction in a pre-clinical setting (Walker et al., 2020). Poor extrapolation from pre-clinical animal models to clinical hepatotoxicity exists. Thus, the paradigm has been shifted to developing in vitro (Chan and Benet, 2018;Tolosa et al., 2018;Aleo et al., 2020) and in silico (Mulliner et al., 2016;Hong et al., 2017;Shin et al., 2020) models for DILI prediction. Notably, toxicogenomics (TGx), integrating genomics and computational approaches, becomes a new trend to enhance DILI prediction . Wang et al. (2019) developed Deep neural networks (DNNs) by using rat transcriptomic profiles from the Open TG-GATEs database (Igarashi et al., 2014) to predict three DILI endpoints including biliary hyperplasia, fibrosis, and necrosis. The model yielded the Matthews correlation coefficients (MCC) of 0.56∼0.89. Although it is still elusive of the model performance in human DILI endpoints, it is a beneficial attempt. Kohonen et al. (2017) developed a "predictive toxicogenomics space (PTGS)" tool composed of 1, 331 genes based on Connectivity Map (CMap) and NCI-60 in vitro assays. The developed PTGS tool was applied for human DILI prediction with different DILI annotation datasets and achieved a sensitivity of 72-86% without a loss of specificity.
It is challenging to evaluate and cross-compare the performance metrices such as sensitivity and specificity of the reported DILI models directly. First, there is a lack of a consistent DILI classification scheme to standardize and unify the existing DILI annotation strategies (Walker et al., 2020). In the past decade, the FDA National Center for Toxicological Research (NCTR) research team continue to minimize the discrepancies among the different reported DILI classifications. Such efforts were incorporated into the FDA's Liver Toxicity Knowledge Base (LTKB) with two developed DILI classification datasets -LTKB Benchmark Dataset (Chen et al., 2011) and DILIrank (Chen et al., 2016). Recently, the FDA team released DILI severity and toxicity (DILIst) dataset . The DILIst provides access to the largest binary DILI classification dataset composed of 1279 drugs, accurately categorized for DILI potential by the assimilation of highly concordant information on DILI from clinical evidence, literature evidence, case registry and the FDA Adverse Event Reporting System (FAERS). Second, the sample size varies among the reported model is another hurdle to compare their model performance. Furthermore, the number of samples used in the published models is usually very limited. For example, the large-scale of TGx, such as the Open TG-GATEs database consists of 170 compounds with four different types of assays, i.e., rat/human in vitro, and rat in vivo single/repeated doses. However, the small number of DILI negative compounds limits its application for predictive DILI model development. Benefited from high-throughput screen technologies such as L1000 (Subramanian et al., 2017) and TempO-Seq (Bushel et al., 2018), transcriptomic profiles for millions of compounds designed to monitor hundreds to thousands of genes could be generated at once in high throughput at an incredibly lower cost. Accumulated transcriptomic profiles generated from these technologies could potentially improve the generalization of the DILI model. Last, there is a lack of a comprehensive assessment of machine learning (ML) algorithms for their DILI predictive ability. So far, there have been limited investigations that evaluated the superiority of deep learning (DL) over conventional ML in the context of toxicity prediction. Some observations were made with only small datasets and were based just on chemical structure data (Muratov et al., 2020). Further evaluation of DL algorithms in diverse biomedical profiles is urgently needed for better understating of its potential in DILI prediction.
In this study, we developed a DNN model for DILI prediction based on the largest binary DILI classification dataset-DILIst as well as the transcriptomic profiles from the Library of Integrated Network-based Cellular Signatures (LINCS) L1000 dataset. To our best knowledge, it is the first attempt in the community. The developed DNN models were comprehensively evaluated by using Monte Carlo cross-validation (MCCV), a permutation test, and an independent validation (IV) set. The performance of the developed DNN models was also intensively compared to the conventional ML algorithms, including K-nearest neighbors (KNN), Support Vector Machine (SVM), and Random Forest (RF). Moreover, the applicability domain and biological relevance of the proposed DNN model were also defined and investigated through various approaches such as therapeutic class analysis, pathway analysis, and network analysis.

Drug-Induced Liver Injury (DILI) Annotation Data
Drug-induced liver injury severity and toxicity developed by the FDA's NCTR was employed, which is the largest binary DILI annotation dataset . DILIst integrated the human hepatotoxicity-related evidence derived from approved drug labeling, hepatotoxic case registries, FAERS, and literature reports. The current version of DILIst contains a total of 1279 drugs, where 768 are DILI positives, and 511 are DILI negatives.

High-Throughput Transcriptomic Profiles
Drug-induced transcriptome profiles used for this study were curated from the NIH LINCS L1000 dataset (GEO accession number: GSE92742) (Subramanian et al., 2017). LINCS L1000 dataset consists of over 1.3 million transcriptomic profiles across human cell lines. The data generated from LINCS L1000 has been subjected to a data processing pipeline established by the LINCS consortium and to produce data at five levels. In this study, we used Level 5 moderated Z-scores data that contained 473,647 collapsed transcriptomic profiles derived from the weighted averages of the individual replicates on 978 landmark genes. Here, each transcriptomic profile of Level 5 represents the treatment effect of a drug/dose/duration/cell line combination.

Transcriptomic Profiles for Model Development
We mapped LINCS L1000 Level 5 data onto the DILIst based on their shared drug names and related synonyms through PubChem Identifier Exchange Service 1 . As a result, we obtained 23,791 transcriptomic profiles across 69 human cell lines. Here, we hypothesized the DILI signal was embedded in transcriptomic profiles with different drug/dose/duration/cell line combinations. However, based on our previous study, the correlation between liver-related cell lines and various cancer cell lines varied . Therefore, we used a modified Kennard-Stone algorithm (Kennard and Stone, 1969) for sample selection to extract transcriptomic profiles with the most explained variance. Accordingly, we calculated a average transcriptomic profile based on all the 23,791 transcriptomic profiles. Then, we calculated Euclidean distances between average transcriptomic profile and each of 23,791 transcriptomic profiles. Transcriptomic profiles with Euclidean distances located in two sides of the quantile value 0.05 (e.g., a confidence interval 0.95) were selected (see the Supplementary Figure S1). Consequently, 6,000 transcriptomic profiles were obtained for model development, of which 3,568 were DILI positives and 2,432 were DILI negatives (e.g., positive/negative ratio = 1.47). The detailed information, including drug name, concentration, cell line, and DILIst label, was listed in the Supplementary Table S1.

Description of Classifiers
The study aims to develop a DNN model based on transcriptomic profiles for DILI prediction. Furthermore, we compared the DNN model performance with three state-of-the-art ML algorithms, including KNN, SVM, and RF.

Deep Neural Network
A DNN is an artificial neural network (ANN) consisting of multiple layers between the input and output layers (Schmidhuber, 2015). The DNN aims to find the correct mathematical manipulation to transform the input into the output through a linear or a non-linear relationship. The network moves through the layers by calculating the probability of each output. In this study, we developed a DNN model with seven hidden layers between the input and output layers (Supplementary Figure S2). The nodes in the input layer were loaded with the feature information of transcriptomic profiles. Then, the input layer information was forwarded through the network of seven hidden layers and then finally to the output layer (i.e., probability of DILI classification). For each node in the hidden layers, and the output layer, the weighted sum (S l j ) was calculated using the formula below: Where x l−1 k is the kth node in the layer of l − 1, W l jk is the weight between x l−1 k and x l j , b l j is on the bias for neuron x l j . Then the weighted sum was activated through a non-linear function as 1 https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi shown in the following formula: Here, x l k denotes the output of an activation function. The activation for the hidden layers was optimized during the model construction. In this study, we investigated four activation functions, including Hyperbolic tangent activation (Tanh), Rectified Linear Unit (ReLU), Scaled Exponential Linear Unit (SELU), and Exponential Linear Unit (ELU). For the output layer, we used the sigmoid activation function for DILI classification prediction (binary where DILI negative prediction = 0 and DILI positive predictions = 1) (if the sigmoid output value ≤ 0.5, classify to DILI negative, else classify to DILI positive). The binary cross-entropy loss function was applied to calculate the loss of the predicted value Y l and the actual value Y i . The loss function was presented by the following formula: Four optimizers (i.e., Adam, Adadelta, RMSProp, and SGD) were employed and applied to minimize the loss function by backward updating the weights and bias on the DNN training process.

KNN
K-nearest neighbors is a non-parametric algorithm that determines each test object class label as the most frequent one among the k-nearest training objects (Altman, 1992).
In this study, we used Euclidean distance to measure the distance between the test object and the training object. The hyperparameter (the number of neighbors K) was optimized.

SVM
Support vector machine is a supervised ML algorithm developed by Cortes and Vapnik (1995). It separates objects by constructing a hyperplane based on support vectors. In addition to performing linear classification, SVM can also be used to perform nonlinear classification using kernel functions. The kernel function can project the original feature space into a higher dimensional feature space, which may help to separate objects into different classes.
In this study, we tested polynomial, and radial basis function (RBF) kernel functions with hyperparameters, i.e., Gamma for the non-linear kernel, and penalty parameter C.

RF
Random forest is an ensemble learning model used for classification by building multiple decision trees, and the output of the decision is the ensemble result of all the individual trees. The hyperparameters, including the number of trees, the max depth of the tree, minimum sample split, and minimum sample leaf, were optimized. Figure 1 provides a schematic overview of the model development procedure performed in this study:

Model Development
Step 1: we split the 6000 transcriptomic profiles into training (80%), and IV (20%) sets using a stratified splitting strategy, where the same DILI positive/negative (P/N) ratio was kept in both sets. Consequently, we obtained a total of 4800 transcriptomic profiles (P/N ratio = 2854/1946) for the training set and 1200 ones (P/N ratio = 714/486) for the IV set.
Step 2: The MCCV was employed to optimize the model hyperparameters. First, the training set data was stratified split into two subsets with a fixed ratio (i.e., 80% than 20%). Then, the model was developed using 80% of the training set with investigated hyperparameters to predict the remaining 20%. This procedure was repeated 100 iterations. Finally, we used the average predictive results of 100 iterations to select optimized hyperparameters (Result 1).
Based on our previously developed models for DILI prediction, the models tended to provide unbalanced sensitivity and specificity (Chen et al., 2013;Hong et al., 2017). Considering the prevalence of DILI positives and negatives in our dataset, we designed a score function V to balance model performance for hyperparameter optimization, as list below: Step 3: The model with optimized parameters yielding the highest predictive performance in MCCV was considered as the final model to evaluate the IV set (Result 2).
Step 4: We further evaluate the model by performing a permutation test (i.e., Y-scrambling) to investigate whether the predictive performance is better than chance. Specifically, we scrambled the DILI labels of training set 100 times to generate 100 of the pseudo training set. For each pseudo training set, we developed a model based on 80% of data with optimized parameters obtained in Step 2 to predict the remaining 20%. Next, we compared the 100 predictive performances from the pseudo training set with Result 1 from the original training set. Subsequently, we calculated the adjusted p value by using Student's t-test. If the adjusted p value less than 0.05, we considered the developed model is statistically significant than the random correlation. Moreover, we also used the Cohen's measure to confirm the findings further.

Performance Metrics
We evaluated the predictive performance of developed models by using the area under the receiver operating characteristic (ROC) curve analysis. A ROC curve exhibits the performance of the classification model with a plot by the true predictive rate (TPR) against the false positive rate (FDR). We calculated the area under the ROC curve (AUC) for each developed model. Moreover, three other metrics, including MCC, F1, Cohen's kappa, accuracy, balanced accuracy, sensitivity, and specificity, were employed to further evaluation of model performance as listed below: balanced accuracy = sensitivity+specificity 2 (9) Where TP, TN, FP, FN denotes true positive, true negative, false positive, and false negative, respectively. P(o) is the probability of oberserved agreement. P(e) is the probability of random agreement.

Applicability Domain
Since drugs in the DILIst dataset covers a wide spectrum of different therapeutic categories, a defined applicability domain is helpful further to apply the developed DNN model in the realworld application. Therefore, we investigated the performance of developed DNN model on each therapeutic category using the WHO Anatomical Therapeutic Chemical (ATC) Classification System (Skrbo et al., 2004). The WHO ATC Classification System is a hierarchical ontology with five different levels. In the study, the first level of ATC was employed, which represents 14 different main therapeutic categories the drug acts on. We first mapped the drugs in the IV set (1200 transcriptomic profiles) onto the first level of ATC. Finally, we calculated performance metrics, including AUC, sensitivity, specificity, and balanced accuracy for each therapeutic category.

Functional Analysis
To determine the association between the predictive performance of the developed DNN model and its biological relevance, we performed the Ingenuity Pathway Analysis (IPA), network analysis, and Gene Ontology analysis. We collected the signatures (i.e., top 100 up/down-regulated genes) from each correctly predicted transcriptomic profiles in the IV set. Then, we extracted the top 200 high-frequent genes amongst these signatures. For IPA, we uploaded the 200 high-frequent genes into the IPA web server 2 to enrich canonical pathways and hepatotoxicityrelated functions. For the Gene Ontology analysis, we uploaded the 200 high-frequent genes into the DAVID web server 3 to enrich GO terms with Benjamini-Hochberg adjusted p values less than 0.05. For network analysis, we used the following steps: (1) We queried the STRING version 11 (i.e., a Protein-Protein Interaction (PPI) Networks Functional Enrichment Analysis database) to determine PPIs amongst the 200 high-frequent genes (Szklarczyk et al., 2018); (2) We kept PPIs with confidence scores > 0.7 were considered; (3) We employed the MCODE plug-in for Cytoscape version 3.7.1 to extract and visualize the PPI sub-networks that are densely connected to represent the biological complexes (Bader and Hogue, 2003;Shannon et al., 2003). We set the default parameters in MCODE with Node Score Threshold = ?0.2, K-core Threshold = ?2, and MaxDepth = ?100; (4) the genes involved in the enriched subnetworks were input to the IPA software to enrich the pathways and hepatotoxicityrelated functions.

Model Robustness
To further investigate the robustness of the proposed DNN model, we rebuilt the DNN models based on two different strategies, including balanced sampling and drug-based data splitting.

Balanced Sampling
Considering the unbalanced DILI positive and DILI negative transcriptomic profiles, we employed a downsampling strategy to rebuilt the DNN models. Specifically, there are 6000 transcriptomic profiles (DILI positive/DILI negative = 3568/2432) used in this study. First, We randomly selected 2432 of 3568 DILI positive transcriptomic profiles with the same number of negative ones. Then, the 4864 (DILI positive/DILI negative = 2432/2432) transcriptomic profiles dataset was stratified split into 80% (3891) for training and 20% (973) for the IV set. Finally, the model was developed by using the training set and test on the IV set. The whole process was repeated 50 times. The average and standard deviation of model performance metrics were calculated.

Drug-Based Splitting
Considering the developed DNN model was based on the transcriptomic profiles, the same drug's transcriptomic profiles may have a chance to present in the training and IV sets. To avoid the potential information leaking occurs, we split the training and the IV sets by drugs to guarantee no common drugs between the two datasets.

Code Availability
All the models were built using the open-source Python (version 3.6.5). The DNN model was developed with the Keras library version 2.0 on top of TensorFlow version 1.14 as the backend. The models based on conventional ML algorithms (KNN, SVM, and RF) were developed using scikit-learn package version 0.22 (Pedregosa et al., 2011). The model scripts and processed data in this study are available at https://github.com/TingLi2016/ L1000_DILI.

Model Development
To obtain the optimized model, we comprehensively evaluated model performance under different hyperparameter combinations by using the proposed V score function (see "Materials and Methods" section). For DNN, a total 16 of hyperparameter combinations of four activation functions (i.e., Tanh, ReLU, SELU, ELU) and four optimizers (Adam, Adadelta, RMSProp, and SGD) and were investigated. For KNN, we tested the number of neighbors k from 3 to 11 with an increased step 2. For SVM, We used a grid search on a polynomial function, and RBF kernel functions with regularization parameter C and kernel coefficient gamma γ. The C values of 0.01, 0.1, 1, 10, 100 and the γ values of 0.1, 0.01, 0.001, 0.0001 were tested. For RF, We investigated hyperparameter combination with n_estimators, max_depth, min_samples_split, and min_samples_leaf. The n_estimators was tested from 100 to 500 increased by 100. The max_depth of 8, 10, 12, and None were employed None means no depth was specified. The min_samples_split (i.e., 2, 5, and 10) and the min_samples_leaf (1, 2, and 4) were used. Figure 2 depicted the distribution of V scores of DNN models from the 100-iteration MCCV. The average and standard deviation of V scores under 16 hyperparameter combinations were ranked. We observed the V score distributions of the top five hyperparameter combinations were no statistically significant difference. We chose the hyperparameter combination with activation ELU and optimizer Adam since they yield the highest average V score of 0.295. The detailed V score distributions of three conventional classifiers, including KNN, SVM, and RF, were listed in the Supplementary Table S2. For KNN, the optimized number of neighbors K was 3. For SVM, the optimized hyperparameter combination of kernel, C, and γ was the RBF kernel, 0.0001, and 10, respectively. For RF, the best hyperparameter combination of estimators, max depth, min_samples_split, and min_samples_leaf was 500, None, 2, and 1, respectively. The average and standard deviation of V scores for the four classifiers were ranked in the following order: DNN (0.295 ± 0.023) > SVM (0.256 ± 0.023) > RF (0.237 ± 0.022) > KNN (0.224 ± 0.025). Figure 3 illustrated the distribution of AUC values from 100iteration MCCV of the four classifiers. Dots in the violin plot denoted the 100 individual models generated in MCCV. The average and standard deviation of AUC values were decreased in the order DNN (0.772 ± 0.015) > SVM (0.751 ± 0.015) > RF (0.743 ± 0.018) > KNN (0.723 ± 0.016), highlighting the better generalization ability of DNN. Furthermore, more DNN models were distributed around 1.5× interquartile range of violin plot (dots with black color), indicating the higher probability of DNN generating models with better performance. As a result, we selected the model with the highest AUC value for each classifier as the final model. Table 1 listed the performance metrics (i.e., AUC, MCC, F1, Cohen's kappa, accuracy, balanced accuracy, sensitivity, and specificity) in the training set and IV set across the four classifiers. DNN yielded the highest AUC values (i.e., 0.802 for the training set, and 0.798 for the IV set), Cohen's kappa (i.e., 0.493 for the training set, and 0.453 for the IV set), and balanced accuracies (i.e., 0.741 for the training set, and 0.721 for IV set) among the four classifiers. Moreover, DNN provided more balanced sensitivity (i.e., 0.851 and 0.839 for the training and IV sets, respectively) and specificity (i.e., 0.630 and 0.603 for the training and IV sets, respectively) than the other three classifiers. RF classifier yielded the highest sensitivity and the lowest specificity in both training and IV sets among the four classifiers, indicating the model suffered a high falsepositive rate and tended to predict transcriptomic profiles as DILI positives.

DNN Model Provided More Balanced Model Performance
We further compared the absolute difference of performance metrics between MCCV and IV, denoted as | MCCV-IV| (Figure 4). A large | MCCV-IV| value indicates either potentially overfitting (MCCV > IV) or an unreliable extrapolation (IV > MCCV) of a model. We found that low | MCCV-IV| values (<0.03) of AUC, F1, balanced accuracy, accuracy, and sensitivity for all the four classifiers. RF and SVM had a relatively higher | MCCV-IV| values of specificity (i.e., 0.052 and 0.073) than that of DNN and KNN. RF had a relatively higher | MCCV-IV| values of MCC (0.047) and Cohen's kappa (0.055) than that of DNN, KNN, and SVM. In general, we did not observe obvious overfitting or underfitting phenomena among the four classifiers. Thus, in the following analysis, we only focused on the DNN classifier because of its superior performance to other classifiers.

DNN Model Is Significantly Better Than Random Chance
We further investigated whether the predictivity of the developed DNN models yielded by chance. Figure 5 depicted the results of permutation tests for the four performance metrics, including AUC, sensitivity, specificity, and balanced accuracy. In the permutation test, we compared the 1000-iteration MCCV results between the real data and permutated data. The average performance metrics of AUC, MCC, F1, Cohen's kappa, accuracy, balanced accuracy sensitivity, and specificity (0.772, 0.422, 0.772, 0.419, 0.723, 0.707, 0.790, and 0.625, respectively) derived from real data were much larger than that of permutated data (i.e., 0.512, 0.038, 0.604, 0.038, 0.535, 0.519, 0.603, and 0.434, FIGURE 2 | Hyperparameters optimization of DNN models: we investigated a total of 16 hyperparameter combinations of four activation functions (i.e., Tanh, ReLU, SELU, and ELU) and four optimizers (i.e., Adam, Adadelta, RMSProp, and SGD). Each hyperparameter combination was evaluated with a 100-iteration MCCV. The mean and standard deviation of V scores is illustrated for each hyperparameter combination. The hyperparameter combination of Elu and Adam was chosen as the best hyperparameters for developing an optimized DNN model.  The best performance values among the classifiers.
FIGURE 4 | Absolute differences, | MCCV-IV|, of the four performance metrics: absolute differences of the eight performance metrics between the 100-iteration MCCV and the independent validation (IV) set were calculated for the four classifiers.
respectively). The p values of eight performance metrics were all less than 2.2E-16, indicting the results yielded from real data are significantly better than random chance. Moreover, we employed the Cohen's d measures to verify further the mean values generated based on real data is statistically different from permutated data. Generally, Cohen's d values more than 0.8 were considered as "large effect" between two distributions.

DNN Model Yielded a Better Performance for Oncology Drugs
To define the applicability domain of our developed DNN model, we calculated the performance metrics separately for each therapeutic class in the IV set. Figure 6 showed the distribution of the four performance metrics, including AUC, MCC, F1, Cohen's kappa, accuracy, balanced accuracy, sensitivity, and specificity for each of 14 ATC therapeutic classes. The predictive power in each therapeutic class varied. L-antineoplastic and immunomodulating agents outperformed other therapeutic classes. The higher AUC, MCC, F1, Cohen's kappa, accuracy, balanced accuracy, sensitivity, and specificity of 0.943, 0.749, 0.864, 0.744, 0.872, 0.836, 0.877, and 0.919 were obtained. We also observed the developed DNN model yielded lower predictive performances (AUC < 0.5) in some therapeutic categories such as V-various, B -blood and blood forming organs, and D-Dermatologicals.  Figure S3). Moreover, two hepatoxicity-related functions, including liver cancer and Hepatoblastoma, were also enriched. Similar findings also were revealed from network analysis (Figure 7), where two high densely interacted subnetworks were extracted by using the MCODE Cytoscape plug-in. We found 6 (CCNB1, KIF14, CCNB2, TOP2A, BIRC5, and CDC45) of a total of 17 genes (35.2%) in the subnetwork one, and two (TP53, MYC) of seven genes (28.6%) in subnetwork two overlapping with hepatotoxicity-related functions obtained from the IPA (highlighted in green color in Figure 7). A total of eight GO terms were enriched with an adjusted p value less than 0.05. Seven of the eight GO terms belonged to the Cellular Component (CC) level, such as the nucleus, cytoplasm, nucleoplasm. The other enriched GO term, positive regulation of apoptotic process, was from the Biological Process (BP) level.

DNN Provided Robust Models
To further investigate the proposed DNN model's robustness, we redeveloped the models based on two sampling strategies, including balanced sampling and drug-based splitting. Figure 8A illustrated the distribution of the eight performance metrics of 50 IV sets based on the balanced sampling strategy. The average and standard deviation of AUC, MCC, F1, Cohen's kappa, accuracy, and balanced accuracy were 0.789 ± 0.009, 0.423 ± 0.025, 0.711 ± 0.019, 0.422 ± 0.023, 0.711 ± 0.011, and 0.711 ± 0.011, respectively. As expected, the balanced sampling method provided a more balanced sensitivity (0.714 ± 0.045) and specificity (0.707 ± 0.029) compared to the original dataset (see Supplementary Table S4). Furthermore, we observed that the average AUC (0.789) yielded based on balanced sampling was slightly lower than the AUC (0.798) generated with the original dataset, indicating the unbalanced positive and negative sample did not influence the performance of the developed DNN model. Figure 8B showed the performance metrics of 50 IV sets using the drug-based splitting strategy. The average and standard deviation of AUC, MCC, F1, Cohen's kappa, accuracy, balanced accuracy, sensitivity, and specificity were 0.769 ± 0.038, 0.420 ± 0.057, 0.651 ± 0.041, 0.391 ± 0.059, 0.689 ± 0.038, 0.718 ± 0.030, 0.818 ± 0.061, and 0.619 ± 0.066, respectively (see Supplementary Table S5). We found that 11 of 50 IV datasets with drug-based splitting methods had a better AUC value (i.e., AUC ≥ 0.798), suggesting no obvious information leaking exists in the DNN model developed with the original datasets.

DISCUSSION
Drug-induced liver injury remains one of the largest safety concerns in the drug development characterized by the complicated intrinsic and idiosyncratic mechanisms (Andrade et al., 2019). Although significant efforts have been made to improve DILI prediction ability, the performances of reported models are still suboptimal. Thus, in this study, we developed an 8-layer DNN model to investigate how to take advantage of accumulated transcriptomic profiles to enhance DILI prediction. It is the first attempt to use the largest binary DILI classification data -DILIst and transcriptomic profiles from the LINCS L1000 dataset to develop the DNN model. The developed DNN model yielded the best predictive results among the four different ML classifiers with AUCs of 0.802 and 0.798 and balanced accuracies of 0.741 and 0.721 for the training and an IV set, respectively. Furthermore, Our DNN model provided more balanced sensitivity and specificity ( Table 1).
A defined applicability domain of the developed model is of help to its real-world application. The therapeutic class analysis demonstrated that the developed DNN model was specifically effective (e.g., 0.943 AUC) in predicting DILI potential of L-antineoplastic and immunomodulating agents. It would be FIGURE 7 | Cytoscape network analysis of protein-protein interactions (PPIs): 482 high confidence PPIs were extracted from the STRING database version 11.0 based on 200 high-frequent genes derived from the DNN model. Panels (A,B) are the top 2 subnetworks of the PPI network obtained by using MCODE plug-in for Cytoscape. The hepatoxicity-related genes were highlighted in green color. The size of code is projected based on frequency of genes. valuable to improve patient quality of life during drug treatment by reducing the potential for hepatotoxicity in this class of drugs. However, the diversity in chemical class within the antineoplastic and immunomodulating agents, together with the underlying complex biology, is a challenge for the identification of the potential for clinical hepatotoxicity. The model described here has great potential to overcome this during the early stages of drug development as soon as efficacy data generated in cancer cell lines become available. Once in clinical development, the potential for hepatotoxicity is routinely monitored alongside other parameters to ensure patient safety.
An explainable in silico models is a pre-requisite for a better understanding of the underlying mechanism and causal relationship between data and biological endpoints. Through the functional analysis, we found five canonical pathways and two hepatotoxicity-related functions by using IPA based on 200 highfrequent genes derived from the DNN model. The results were also confirmed by using the network analysis. Several canonical pathways have been reported to play essential roles in DILI pathogenesis and etiology. For example, it was reported ATM signaling pathway plays an important protective function of causing liver failure in mice (Bandi et al., 2011). Furthermore, it was reported that Estrogen receptor alpha (ESR1)-mediated signaling inhibits liver regeneration by downregulation of Wnt signaling resulting in lower cyclin D1 activation in ESR1 knockout rats (McGreal et al., 2017).
It is worthwhile to consider some additional studies to confirm our findings in this study and further verify our developed DNN in the real-world application. First, we did not investigate the model performances based on transcriptomic profiles generated from different genomics platform. Based on our previous study, the extrapolation ability between assay platforms are multiple-factorial, and a careful selection of cell lines is strongly recommended (Liu et al., 2017. However, considering the ongoing declining cost of the emerging genomics technology such as L1000 technologies, it is affordable for most of the researches to screen the investigated drug candidates. Second, DL is growing at a very rapid rate. In the current study, we just employed the DNN algorithm for a proof-of-concept purpose. Other DL algorithms, such as transfer learning (Pan and Yang, 2009) and multi-task learning (Ruder, 2017), were worth investigating for further improving DILI perdition. Finally, in our current study, we only finetune the essential parameters, including activation functions and optimizers in the DNN model. Other parameters, such as the number of hidden layers and node numbers, were arbitrarily selected. With that said, the prediction performance of the developed model may be further improved by extra fine-tuning.
Drug-induced liver injury is a multifactorial endpoint that cannot easily be predicted by current pre-clinical animal models. The results presented illustrate the utility of a DL algorithm combined with transcriptomics information for predicting DILI. The developed DNN model could be a promising tool for DILI potential detection in the early stages of drug development.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.