Deep Neural Network Models for Predicting Chemically Induced Liver Toxicity Endpoints From Transcriptomic Responses

Improving the accuracy of toxicity prediction models for liver injuries is a key element in evaluating the safety of drugs and chemicals. Mechanism-based information derived from expression (transcriptomic) data, in combination with machine-learning methods, promises to improve the accuracy and robustness of current toxicity prediction models. Deep neural networks (DNNs) have the advantage of automatically assembling the relevant features from a large number of input features. This makes them especially suitable for modeling transcriptomic data, which typically contain thousands of features. Here, we gaged gene- and pathway-level feature selection schemes using single- and multi-task DNN approaches in predicting chemically induced liver injuries (biliary hyperplasia, fibrosis, and necrosis) from whole-genome DNA microarray data. The single-task DNN models showed high predictive accuracy and endpoint specificity, with Matthews correlation coefficients for the three endpoints on 10-fold cross validation ranging from 0.56 to 0.89, with an average of 0.74 in the best feature sets. The DNN models outperformed Random Forest models in cross validation and showed better performance than Support Vector Machine models when tested in the external validation datasets. In the cross validation studies, the effect of the feature selection scheme was negligible among the studied feature sets. Further evaluation of the models on their ability to predict the injury phenotype per se for non-chemically induced injuries revealed the robust performance of the DNN models across these additional external testing datasets. Thus, the DNN models learned features specific to the injury phenotype contained in the gene expression data.


INTRODUCTION
Toxicity prediction is a key element in evaluating the safety of drugs and chemicals (Raies and Bajic, 2016). Currently, the standard methods of toxicity evaluation are based on animal experiments to identify mechanisms of action and potential toxic effects (Benigni, 2016;Raies and Bajic, 2016). Recent advances in biological and computational modeling techniques are accelerating the development of a large number of animal-free assays and in silico models for toxicity testing (Blomme and Will, 2016). However, extensive work on these assays and models is still needed if they are to adequately address the central challenge of providing accurate prediction of toxicity endpoints and thus become valid replacements for traditional methods (Benigni, 2016).
Gene expression (transcriptomic) data have been widely used since the introduction of microarrays to elucidate the links between chemical exposures and the adverse effects they induce (Brockmeier et al., 2017). In particular, transcriptomic data can be used to discover genes and pathways associated with adverse effects and provide mechanistic insights, using multiple computational methods (Brockmeier et al., 2017), including those that involve machine learning (Thomas et al., 2001;Hamadeh et al., 2002;Steiner et al., 2004;Huang et al., 2008;Kohonen et al., 2017). Increasingly, research is focused on using gene expression signatures as predictors of a chemical's toxicological class (Zidek et al., 2007;Low et al., 2011;Kim and Shin, 2014;AbdulHameed et al., 2016). A recent work by Su et al. (2018) suggests that careful feature engineering could achieve high accuracy in predictions of liver injury using transcriptomic data. An advantage of the transcriptomic approach is that gene expression data may provide an early indication of toxicity, given that toxicant-induced changes in gene expression are often detectable before chemical, histopathological, or clinical observations (Ulrich and Friend, 2002). These studies have typically employed traditional supervised or unsupervised machine-learning methods, such as Support Vector Machines (SVMs) or Random Forests (RFs).
Recently, Deep Neural Networks (DNNs) (Hinton et al., 2006) have achieved notable success in many domains of machine-learning applications, including the biomedical sciences (Schmidhuber, 2015;Webb, 2018). In the context of toxicity prediction, DNN models have been assessed in several comprehensive studies for building quantitative structure-activity relationship models of different absorption, distribution, metabolism, excretion, and toxicity (ADMET)-properties based on chemical structural features (Ma et al., 2015;Xu et al., 2015;Goh et al., 2017). Aliper and colleagues developed both DNNs and SVMs based on in vitro transcriptomic data generated by the Connectivity Map (CMap) at the Broad Institute (Subramanian et al., 2017) to classify therapeutic categories for chemicals, and found DNNs to have superior classification performance (Aliper et al., 2016).
Transcriptomes contain thousands to tens of thousands of input variables that can be used as features in machine learning. These range from the most granular, of individual features representing the expression levels of single genes, to more focused sets or combinations of genes or pathways. The performance of traditional machine-learning methods typically depends critically on manually selecting and tuning these features to find the appropriate transcriptomic feature sets for model construction (Steiner et al., 2004;Huang et al., 2008). A key advantage of DNNs is their capability of discovering the representations that are relevant to solving a classification problem from the input features automatically, with little if any manual intervention (LeCun et al., 2015). In principle, DNNs should be able to handle large numbers of transcriptomic features, provided that ample training data are available. In practice, however, this advantage is diminished by the lack of training data, and some consideration is still required to construct or select the appropriate input variables.
Here, we used whole-genome DNA microarray data to construct DNN models of three histopathological endpoints (biliary hyperplasia, fibrosis, and necrosis), based on liver toxicity studies available in the Open Toxicogenomics Project-Genomics Assisted Toxicity Evaluation System (TG-GATEs) (Igarashi et al., 2015) and DrugMatrix (Ganter et al., 2006), and evaluated their performance on multiple, independent testing datasets. Because the liver is the major site where drugs and exogenous toxins are metabolized, these extensive dataset compilations make DNN modeling feasible for complex in vivo liver disease phenotypes induced by chemical exposures. To gage the performance levels of our models and evaluate the influence of feature selection, we built both single-and multi-task DNNs for the three liver toxicity endpoints, compared the performance of DNNs to that of SVMs and RFs, and examined the robustness of the models in predicting the liver injury phenotypes in experimental datasets derived from non-chemically induced injuries. We further investigated the impact of feature selection strategy on prediction accuracy, by building models using multiple sets of gene-and pathway-level features. Overall, our results suggest that DNNs offer a practical and robust modeling strategy to predict chemically induced liver injury from transcriptomic data.

Transcriptomic Data
We developed machine-learning models for three commonly evaluated histopathology endpoints liver endpoints biliary hyperplasia, liver fibrosis, and liver necrosis using data publicly available from two large-scale toxicogenomics databases, DrugMatrix (Ganter et al., 2006) 1 and Open TG-GATEs (Igarashi et al., 2015) 2 . These databases contain data that match chemical exposures with transcriptomic changes in multiple tissues of Sprague-Dawley rats to graded histopathology assessments. We downloaded the rat in vivo liver microarray datasets based on the Affymetrix GeneChip Rat Genome 230 2.0 Array from TG-GATEs and DrugMatrix. The raw dataset contains whole genome microarray expression data for liver and kidney from 6,765 and 2,218 rats, respectively. According to our previous protocol (Tawa et al., 2014;Te et al., 2016), we assessed the quality of the arrays and removed outlier arrays and renormalized the remaining data. For both datasets, we further removed samples where histopathology scores for the three studied endpoints were missing. For some chemicals, all of the exposure conditions, i.e., chemical-time-dose combinations, did not induce any of the three endpoints, and we removed samples related to these chemicals to mitigate the problem of data imbalance. The final training data ( Table 1) included all samples of all available exposure conditions. We built and evaluated the models using 10-fold cross validation. The detailed information of used samples, chemical exposure conditions, and histopathology outcomes are given in Supplementary Table S1.
Our models were evaluated on independent external datasets. The first dataset was obtained from rats repeatedly exposed to four chemicals at multiple doses and time points ; Gene Expression Omnibus accession number, GSE70559] exhibiting liver injury endpoints contained in the training data. The four chemicals are part of the TG-GATEs and DrugMatrix dataset, but with different doses and exposure durations. In order to assess endpoint predictions using this data, we first removed the training samples in TG-GATEs and DrugMatrix related to the four chemicals and built DNN and SVM models on the remaining training data.
The first two datasets involved the use of non-chemical treatments, i.e., heat shock (Stallings et al., 2014) and bile duct ligation (Sutherland et al., 2018). The third was obtained using two chemicals, nevirapine and galactosamine (Brown et al., 2016), of which only galactosamine was present in the training data. The fourth involved five chemicals, i.e., pyrazinamide, ranitidine, enalapril, carbamazepine, and chlorpromazine (Eun et al., 2015), of which only carbamazepine was present in the training data. The final dataset, involved the use of methapyrilene at a dose and exposure duration different from those used in the training data (Slopianka et al., 2017). In short, these datasets thus represented physiological and chemical perturbations independent of the training data.
Four external datasets (Ippolito et al., Stallings et al., Brown et al., and Eun et al.) had sample-level histopathology annotations, i.e., the transcriptome of each sample was directly linked to its histopathological injury score. Therefore, standard contingency tables allowed us to evaluate the models based on these data. In contrast, Sutherland et al. and Slopianka et al. data did not include sample-level histopathology annotations, although they did include mean injury scores at multiple time points and/or doses (exposure conditions). These data allowed us to investigate the correlation and consistency between experimental results and model predictions. Table 1 summarizes all training datasets and four external testing datasets. The detailed information of samples, chemical exposure conditions, and histopathology outcomes of these external datasets are given in Supplementary Tables S1, S2.

Feature Selection
To investigate the impact of varying the input features on the performance of our model, we constructed gene-and pathway/co-expression module-level feature sets, each of which represented a specific way of extracting liver toxicity information from the transcriptome. Table 2 shows the 13 constructed datasets that we used as the model inputs.
For gene-level feature sets, we defined a feature by the fold-change value calculated from the difference between the mean log-transformed gene expression values for samples in the treatment and control cohorts. Feature sets were chosen arbitrarily or according to previous data-mining results. We generated eight gene-level feature sets. The first three were based on data-mining techniques, two were formed by reducing the number of genes in these sets, and the remaining three were created by arbitrarily (randomly) selecting genes so that the number of genes was similar to that of the first five feature sets. The first three sets comprised the (1) predictive toxicogenomics space [PTGS (all)], composed of 1,331 genes considered related to cytopathology and drug-induced liver injury (Kohonen et al., 2017), (2) L1000, composed of 978 genes and considered as an adequate reduced representation of the whole-genome expression profile (Subramanian et al., 2017), and (3) Toxicity Module Gene, a set composed of genes belonging to 89 co-expression modules we previously identified as being associated with chemically induced liver injuries (Tawa et al., 2014;Te et al., 2016). To investigate the effect of varying the number of features on model performance, we generated two additional feature sets based on the first three: (4) PTGS (core), a subset of PTGS (all) containing the 199 genes occurring in all 14 overlapping components (gene sets) of the latter (Kohonen et al., 2017), and (5) Toxicity Module (L1000), composed of genes occurring in both Toxicity Module and L1000. Finally, to evaluate the impact of knowledge-based gene selection on the ability of predicting liver injury, we generated three non-curated gene sets: (6) A200, (7) A600, and (8) A1200, containing 200, 600, and 1,200 randomly selected genes, respectively. The number of genes in these sets spanned a range similar to that of the first five datasets. For our pathway/module-level feature sets, we defined a feature as the expression score of an entire pathway/module (Schyman et al., 2018). Briefly, we first calculated the fold-change values for all genes occurring in the pathway/module set. Subsequently, we calculated the absolute value of each gene's log-transformed fold-change value, as well as its average (µ 0 ) and standard deviation (σ ) across all genes. For a gene set (pathway), we calculated the average score (X) of the absolute values. We estimated the significance of a gene set by its p-value, i.e., the probability of having a score (X) more extreme than the calculated value. According to the Central Limit Theorem, the probability distribution for an average value is approximately normal with parameters µ 0 and σ √ n, where n is the number of genes in the gene set. The p-value can be calculated as the upper tail of the N µ 0 , σ √ n distribution. The z-score, which is defined as has a normal probability distribution, N (0, 1). We used this score as the expression score of the individual pathway/module.
We used the following four pathway and module feature sets: (1) the Molecular Signatures Database C2 collection [MSigDB (C2)], downloaded from the Molecular Signatures Database 3 and containing 1,329 curated gene sets, including canonical pathways and gene sets representing expression signatures of genetic and chemical perturbations, (2) MSigDB (hallmark), a collection of 50 gene sets generated by computationally mining all of the MSigDB data to reduce noise and redundancy, and considered to represent specific well-defined biological states or processes that display a coherent gene expression pattern (Liberzon et al., 2015), (3) Toxicity Module, a set of 89 co-expression modules associated with chemically induced liver injuries (Tawa et al., 2014;Te et al., 2016), and (4) MSigDB (C2) L1000, a set containing only MSigDB (C2) pathways in which at least one gene also occurred in L1000, and constructed to investigate how gene reductions in pathways affect model performance.
For each feature set, we calculated expression scores for samples in the training sets to obtain the input data matrices. Given an endpoint, histopathology annotations of samples provided the output vectors. The output vector represented the occurrence of the liver histopathology endpoint in the samples: 1 if the endpoint was found in the sample; and 0 otherwise.
The eight gene-and four pathway/module-level feature sets described above constituted the 12 training sets. To test for over-fitting of the model, we generated additional pathway-level datasets, Toxicity Module Gene Y-Shuffle and L1000 Y-Shuffle, by retaining the input matrix of Toxicity Module Gene and L1000 while randomly shuffling the values in the output vectors for the training set (Table 2). Thus, we obtained 14 training sets for each histopathology endpoint associated with liver injury.

Data Imbalance
All datasets showed strong class imbalance, i.e., negative samples greatly outnumbered positive samples (Table 1). Classifiers may be biased towards major classes and, hence, show poor classification performance for minor classes (Lemaitre et al., 2017). To address this problem, we applied the Synthetic Minority Over-sampling Technique (SMOTE) (Chawla et al., 2002) to process the training data, and used multiple metrics, such as F1 and the Matthews correlation coefficient (MCC), to evaluate model performance. The MCC and F1 can be calculated from a confusion matrix as follows: where TP, TN, FP, and FN refer to the number of true positives, true negatives, false positives, and false negatives, respectively.

Deep Neural Networks
The single-task DNNs used in this study were standard, fully connected multilayer perceptrons with a single neuron in the output layer. For each toxicity endpoint, we built 13 single-task DNNs based on the 13 feature sets. For the DNN calculations, we used the open source Python library Keras 4 on top of TensorFlow 5 as the backend, a ReLU activation function for the hidden layers, a sigmoid activation function for the output layer, the Adam optimizer, a binary cross-entropy loss function, a kernel initializer with a normal distribution, early stopping and a dropout technique for all input and hidden layers. For each single-task DNN, we optimized the hyperparameters (i.e., the number of hidden layers, number of nodes in the layers, batch size, and dropout rate; Supplementary Table S3) by a grid search technique with cross validation, using the F1 score as the objective metric. Figure 1 shows the diagram of our single-task neural network, and the values of training loss. In all investigated cases, the loss values became flat within 200 epochs.
Multi-task learning try to solve the classification of three endpoints at the same time. To test the performance of multi-task learning strategy in predicting liver injury, we constructed multitask DNNs with hard parameter sharing (Caruana, 1997). For our multi-task model, the three endpoints shared the first several hidden layers, while each endpoint retained task-specific layers to generate the relevant liver injury outputs. The DNN calculations were performed as described for single-task DNNs. We built the multi-task DNNs and optimized the hyperparameters (i.e., the number of shared and task-specific hidden layers, number of nodes in the hidden layers, relative weighting between each task's loss, and dropout rate; Supplementary Table S3) by a grid search technique with cross validation, using the F1 score as the objective metric. The codes that implement the neural network are available at https://github.com/BHSAI/DNN-liverTox.

Random Forest
To evaluate the performance of the deep-learning method in relation to other methods, we also built RF classifiers. We constructed the RF models using Scikit-learn 6 . We optimized the hyperparameters (i.e., the number of trees, minimum impurity decrease; see Supplementary Table S3 for final values) by a randomized search technique with cross validation, using the F1 score as the objective metric.

Support Vector Machine
We also built SVM classifiers, using Scikit-learn (See footnote 6). We optimized the hyperparameters (i.e., the kernel used, value of locality parameter Gamma, and value of regularization parameter C; see Supplementary Table S3 for final values) by a randomized search technique with cross validation, using the F1 score as the objective metric. Figure 2 shows the distribution of mean MCCs of the four machine-learning algorithms across the 12 feature sets. All algorithms achieved high MCC scores across these feature sets. Single-task DNN and SVM outperformed RF and multi-task DNN for the three endpoints. Tables 3, 4 show the performances on different endpoints for single-task DNN and SVM using different feature sets (the results for RF and multi-task DNN are shown in Supplementary Tables S4, S5). The single-task DNN and SVM perform similarly in these studies. Based on the cross validation studies (Tables 3, 4), the best (i.e., highest average MCC) feature sets for single-task DNN were Toxicity Module Gene for the biliary hyperplasia and fibrosis endpoints and L1000 for the necrosis endpoint; the best feature sets of SVM were Toxicity Module Gene for the biliary hyperplasia endpoint and A1200 for the fibrosis and necrosis endpoints.

Single-Task DNN Had More Consistent Performance Than SVM for Data Not Seen in Training
Because SVM and single-task DNN models performed similarly in the cross validation studies, we further compared their performance on the Ippolito et al. data (see MATERIALS AND METHODS, Transcriptomic Data). We used this data because it was the largest external dataset we could find and contained more balanced endpoints annotation than all other external validation sets. To make an objective estimation of the performance, we removed the training samples related to the four chemicals occurring in the training dataset used to build the models (see MATERIALS AND METHODS, Transcriptomic Data). We compared the performance of single-task DNN and SVM algorithm using the models built on the best feature sets and found that single-task DNN outperformed SVM (Table 5). This result indicate that DNN models potentially exhibit better predictive ability for data not seen in training. Based on the above results, we selected the single-task DNN models built based on the Toxicity Module Gene feature set (for biliary hyperplasia and fibrosis) and L1000 (for necrosis) for further studies.

Specificity of Single-Task DNN Models
We partitioned the training samples into different categories by the occurrence of the liver injury endpoints, i.e., three where only one endpoint was present (single-endpoint), three where exactly two endpoints were present (double-endpoint), and one where all three endpoints were present (Table 6, columns 1-3). We investigated the specificity of DNN models by evaluating their performance to predict single and multiple injury endpoints. Table 6 shows multiple metrics for the two best feature sets which had highest mean MCCs for the three endpoints. The results for the other feature sets are shown in Supplementary Table S6. We calculated these metrics using model predictions on all 10 validation sets. Figure 3 shows the MCCs of all 12 feature sets for single-and double-endpoint categories. In cross validation, most feature sets showed comparable levels of performance, exhibiting relatively high MCCs for most categories. All zero MCCs came from the category where only fibrosis occurred. This category had 2,809 negative samples, but only 5 positive   samples. The extremely small size of the minor class, which only supports sparse sampling in the data space, may underlie the poor performance in this category. Overall, these results suggest that the single-task DNN models were specific to the injury endpoints.

External Validation Using Independent Data
We built the final single-task DNN models for the three endpoints and all training data using their corresponding best feature sets (i.e., Toxicity Module Gene for biliary hyperplasia    Figures 4A,B, respectively) showed recall rates of 0.50 and 0.67, precision rates of 1.00 and 1.00, and F1 scores of 0.67 and 0.80, respectively, for the two datasets. For Eun et al., the contingency table was degenerative because there were no positive samples. Table 7 shows that the model correctly predicted almost all samples (4 erroneous predictions out of 807) for the three endpoints. Because the two other external datasets of Sutherland et al. and Slopianka et al. did not provide any endpoint injury annotations, we could not directly evaluate our predictions through standard contingency tables. However, these studies did provide average injury scores for multiple time points and/or doses. In addition, our predictions allowed us to derive the number of injured animals-another measure of the injury level-for the endpoints at these exposure conditions. Hence, if our predictions are accurate, the two measures should be positively correlated. Consistent with this expectation, the Spearman's correlation coefficients (rho-values) were 0.72 (p = 7.8 × 10 −5 ) for the dataset of Sutherland et al. (2018) and 0.73 (p = 0.006) for the dataset of Slopianka et al. (2017) (Figures 4C,D, respectively).

Label-Shuffled Models Show No Predictive Power
We performed 10 random shufflings to construct 10 label-shuffled Y-Shuffle sets for Toxicity Module Gene and L1000 (see MATERIALS AND METHODS, Feature Selection). Table 8 shows that these models performed poorly in both cross validation and for the external dataset of Ippolito et al. No meaningful model could be built from the label-shuffled dataset.

Models With Arbitrary Gene Sets Show Poor Performance in External Validation
We also tested the performance of arbitrarily selected gene sets in predicting outcomes in the datasets of Stallings et al. and Brown et al. We used these datasets because they used treatment

DNN Accurately Predict Liver Toxicity Endpoints From Transcriptomic Responses
Our single-task DNN model achieved high performance scores in cross validation using various feature selection methods (Table 3 and Figure 2). Our aim was to develop a model that learns signals associated not with specific chemical-exposure conditions but with the endpoints. The high MCCs for the three endpoints and various patterns of present/absent endpoints (Table 6 and Figure 3) indicate that our DNN models are endpoint-specific. We further validated our single-task DNN model with data from six additional independent experiments, most of which induced the target endpoints by treatments different from those used in the training data. The power of the single-task DNN model was highlighted by the fact that it performed satisfactorily across these datasets (Figures 3, 4 and Table 7). In summary, the single-task DNN model provided robust predictions for the intended injury phenotype.
To ascertain that the signals learned by our model was not spurious, we also applied y-randomization (Rucker et al., 2007) to the best feature sets for the three endpoints. In this method, the performance of the original model is compared to that of a model trained with the same original input variables and model-building procedure but with the output variable randomly shuffled. The underlying rationale is that a useful model should describe the given data better than by chance alone (Rucker et al., 2007); i.e., the model based on original data should outperform a model based on randomized data. We found that randomizing the class labels generated a model with no predictive ability ( Table 8). All together, these results suggest that the DNN model captured true signals of the target endpoints.

DNN Provides Robust Predictions for the Three Endpoints
Although DNNs have been extensively adopted in various fields of study in recent years, they are far from general panaceas. On many tasks, they perform no better than other methods. For example, we recently showed that the overall performance level of a DNN is quite similar to that of a variant of the nearest neighbor classifier (arguably the simplest machine-learning method) . Therefore, careful comparisons of DNNs with other machine-learning methods is important for identifying the strengths and weaknesses of this modeling approach. Here, we specifically chose RF and SVM models as a reference because, ever since their introduction at the turn of the century (Cortes and Vapnik, 1995;Breiman, 2001), they have proven successful in many fields, including the biomedical sciences (Denisko and Hoffman, 2018) and are strong competitors to DNN models in many areas. To ensure a fair comparison, we built RF and SVM models following the same procedure as that for constructing DNNs: searching the hyperparameter space to identify the best parameter settings, and then using these settings to build 10 models with 10fold cross validation. Single-task DNN and SVM models showed comparable performance in cross validation studies (Tables 3, 4 and Figure 2), whereas, DNN models showed better performance when evaluated in the independent testing data ( Table 5).
Comparing the performance of single-task DNN and SVM models using the external validation data also showed that single-task DNN models were less sensitive to the selection of feature set than SVM models for fibrosis (Supplementary Figure S1). SVM had mean MCCs of less than 0.05 (0.00, 0.15, 0.04, and 0.05) for feature set PTGS core, A200, A600, and PTGS (all), while the corresponding MCC values for single-task DNN were 0.36, 0.60, 0.58, and 0.82, respectively. Furthermore, single-task DNN outperformed SVM in 27 out of 36 cases (Supplementary Figure S1).
Overall, DNN achieved consistent robust performance, which indicates that high-throughput in vivo toxicological expression data deposited in TG-GATEs and DrugMatrix contain rich information for predicting of these endpoints and DNN is a powerful method to extracting such information.

Feature Selection and Model Performance
The curated feature sets can be classified into two types according to how they were selected. In gene-level features, Toxicity Module Gene, PTGS (all) and PTGS (core) are generated by data mining techniques to isolate the features associated with liver toxicity. In contrast, L1000 is a reduced representation of genes sufficient to predict cell-wide gene expression patterns. In pathway level features, Toxicity Module features are generated by their statistical correlation with liver toxicity, while MSigDB (C2) and MSigDB (C2) L1000 represent all known canonical pathways. The two types of feature sets shared only a small fraction of genes: the Jaccard index, an index of the similarity between two sets, was 0.11 for L1000 and PTGS (all) and 0.08 for MSigDB (C2) and Toxicity Module. This finding indicates that the two types of feature sets are largely independent of each other. The comparable performance for the two types of features (Table 3) suggests that in constructing DNN model, providing input variables with enough diversity is more important than selecting the toxicity-specific features in advance. This notion is consistent with the ability of DNNs to automatically discover the relevant representations from the input features (LeCun et al., 2015).
In cross validation arbitrary gene sets did exhibit relatively high predictive power in the training data. For example, the mean MCCs for A1200 the three endpoints are comparable to the best curated feature sets (Table 3). However, arbitrary gene sets performed poorly on external testing sets ( Table 9). In contrast, the F1 scores for the best curated feature sets on the two external testing sets were 0.80 and 0.67 (Figures 4A,B), which were much higher than the corresponding scores of the models with arbitrary gene sets. The poor generalization of the arbitrary gene-set − based models indicated that they did not learn true signals associated with the endpoints as well as did the curated feature-set − based models.
For all three endpoints, the feature set Toxicity Module Gene (L1000) showed consistent performance in both cross validation and external validation for both single-task DNNs and SVMs (Tables 3, 4 and Supplementary Figure S1). This feature set contained only 154 features, but performed better than the other feature sets with similar number of features [A200 and PTGS (core)] in the external Ippolito et al. data. In cross validation, small feature sets such as Toxicity Module and MSigDB (Hallmark) had in general lower mean performance than the best feature sets consisting of about 1,000 features. However, it is unlikely that the lower average performance for the two feature sets can be attributed to its small size, because as discussed above, the Toxicity Module Gene (L1000) feature set, which contained only 154 features, achieved F1 scores that were comparable to those of the best feature set. The correlation between feature set size and performance was rather low (r 2 = 0.02, p = 0.43; Table 3), indicating that the relationship between feature selection and model performance is not straightforward.
In summary, our findings show that gene-and pathway-level feature sets with diverse functional information perform on par with toxicity-specific feature sets; and single-task DNN have potential better performance better than SVM and RF; and carefully selection of features results in better performance than arbitrary selection in data not seen during training. Importantly, our DNNs exhibit good generalization of phenotype prediction in independent external testing datasets.