Development of Prediction Models for Drug-Induced Cholestasis, Cirrhosis, Hepatitis, and Steatosis Based on Drug and Drug Metabolite Structures

Drug-induced liver injury (DILI) is one of the major reasons for termination of drug development. Due to the importance of predicting DILI in early phases of drug development, diverse in silico models have been developed to filter out DILI-causing candidates before clinical study. However, no computational models have achieved sufficient prediction power for screening DILI in early phases because 1) drugs often cause liver injury through reactive metabolites, 2) different clinical outcomes of DILI have different mechanisms, and 3) the DILI label on drugs is not clearly defined. In this study, we developed binary classification models to predict drug-induced cholestasis, cirrhosis, hepatitis, and steatosis based on the structure of drugs and their metabolites. DILI-positive data was obtained from post-market reports of drugs and DILI-negative data from DILIrank, a database curated by the Food and Drug Administration (FDA). Support vector machine (SVM) and random forest (RF) were used in developing models with nine fingerprints and one 2D molecular descriptor calculated from drug (152 DILI-positives and 102 DILI-negatives) and drug metabolite (192 DILI-positives and 126 DILI-negatives) structures. Models were developed according to Organisation for Economic Co-operation and Development (OECD) guidelines for quantitative structure-activity relationship (QSAR) validation. Internal and external validation was performed with a randomization test in order to thoroughly examine model predictability and avoid random correlation between structural features and adverse outcomes. The applicability domain was defined with a leverage method for reliable prediction of new chemicals. The best models for each liver disease were selected based on external validation results from drugs (cholestasis: 70%, cirrhosis: 90%, hepatitis: 83%, and steatosis: 85%) and drug metabolites (cholestasis: 86%, cirrhosis: 88%, hepatitis: 86%, and steatosis: 83%) with applicability domain analysis. Compiled data sets were further exploited to derive privileged substructures that were more frequent in DILI-positive sets compared to DILI-negative sets and in drug metabolite structures compared to drug structures with a Morgan fingerprint level 2.


INTRODUCTION
Drug-induced liver injury (DILI) is caused by almost all classes of drugs and covers diverse clinical manifestations such as hepatitis, cholestasis, cirrhosis, and steatosis depending on the duration of injury and the histological location of damage (David and Hamilton, 2010). DILI is typically derived from an initial hepatocellular injury; however, the mechanisms for each clinical outcome are distinct. Drug-induced cholestasis (DICH) is caused when drug disturbs bile acid homeostasis by inhibiting hepatic transporters that mediate biliary secretion of bile acids and other organic solutes at the sinusoid (Padda et al., 2011;Sagnik and Pieter, 2018). Drug-induced cirrhosis (DIC) is developed when drugs trigger fibrogenesis, through which excessive extracellular matrix molecules (ECM) are produced. Normally, ECM is removed by matrix metalloproteinases (MMPs); however, chronic liver damage upregulates tissue inhibitors of MMPs, and thus, ECM accumulates throughout the liver (Schuppan and Afdhal, 2008;Tsochatzis et al., 2014). Hepatitis is an immune-mediated liver injury, and drug-induced hepatitis (DIH) is caused by drugs or their metabolites binding to cellular proteins (e.g., cytochrome P450) which are recognized as antigens by the immune system (Björnsson et al., 2010) and triggering an adverse immune response that resembles viral infection (Hodgson and Levi, 2004). Hepatic steatosis is characterized by fatty acid accumulation in hepatocytes due to increases in lipogenesis or decreases in fatty acid secretion (Rabinowich and Shibolet, 2015). Since mitochondria play a significant role in lipid metabolism in hepatocytes, drugs accumulating in mitochondria cause drug-induced steatosis (DIS) by interfering with mitochondrial respiration or b-oxidation (Patel and Sanyal, 2013). As each clinical outcome of liver injury is derived from distinctive mechanisms, prediction models for each disease are needed to improve predictive power and interpretability of the results.
DILI is a challenging endpoint in predictive toxicology even though numerous studies have already been conducted . Particularly, DILI classification models based on drug structures faced limitations in improving their performance since these are based on the assumption that similar structures have similar properties, while certain hepatotoxicants damage hepatocytes through their reactive metabolites (Cherkasov et al., 2014). There are diverse data sets available in research articles or databases that provide information on whether drugs cause DILI, yet DILI induction by reactive metabolites is only partially reported. Even in the LiverTox (NIDDK, 2012), which listed DILI causing drugs with information on their reactive metabolites, drug metabolism information is mentioned for certain drugs while it is missing for others, and thus it is impossible to judge whether the drugs are not actively metabolized, metabolites of the drugs are not reactive, or simply reactivity of drug metabolites was not examined. Due to lack of DILI labels on drug metabolites, development of prediction models for drug metabolites has been a challenging task.
As DILI has become one of the leading causes for the termination of premarket studies or drug withdrawal from the market (Fisk et al., 2018), diverse hepatotoxicity prediction models have been developed to filter out DILI causing molecular structures in early phases of drug discovery (Yang et al., 2018). In silico models, which predict DILI causation based on the molecular structure of drugs, are developed either with knowledge of experts to give structural alerts (SAs) or with computation based on statistical modeling (Greene and Naven, 2009). SAs are also referred to as structure-activity relationships (SARs) in which patterns of the molecular substructure of DILI causing drugs are defined according to expert knowledge and experience. SARs have been developed based on reactive drug or metabolite structures, or pharmacological mechanisms such as redox cycling, mitochondrial dysfunction, and disruption of hepatic transporters (Fisk et al., 2018). Derek hepatotoxicity module is one example of a hepatotoxicity SAR (Greene et al., 2010). The statistical model is also termed quantitative structureactivity relationship (QSAR), in which machine learning algorithms were used to train the model with chemical structure as an input and their activity as an endpoint . Few in silico models were developed to predict specific DILI endpoints such as cholestasis and steatosis based on their molecular mechanism, while numerous in silico models were developed to predict DILI or hepatotoxicity overall (Hewitt and Przybylak, 2016;Hong et al., 2018;Yang et al., 2018), although the molecular mechanism of each liver disease is different. For cholestasis, the bile salt export pump (BSEP) inhibition model was developed since BSEP inhibition is one molecular mechanism of hepatic cholestasis (Sagnik and Pieter, 2018). For steatosis, QSAR models were developed for cellular targets of molecular initiating events in liver steatosis adverse outcome pathways such as the pregnane X receptor, the liver X receptor, the aryl hydrocarbon receptor, the nuclear factor (erythroid-derived 2)-like 2, and the peroxisome proliferatoractivated receptors (Gadaleta et al., 2018).
In this work, binary classification models that predict four types of DILI (i.e., cholestasis, cirrhosis, hepatitis, and steatosis), were developed with 257 drugs. DILI caused by drug metabolites were considered by developing models with 318 drug metabolites of the compounds. Structural differences of the compounds were considered alone in the prediction models of this work, since structural motifs in each drugs and drug metabolites are assumed to be associated with clinical outcomes. Prediction models were developed with molecular fingerprints and 2d descriptors. Support vector machine (SVM) and random forest (RF) were applied in training, and the models achieved sound prediction performance. Models were developed, validated, and analyzed according to the Organisation for Economic Co-operation and Development (OECD) guidelines for QSAR validation (OECD, 2007). Compiled data sets for drugs and drug metabolites were further examined to find privileged substructures in DILIpositive compared to DILI-negative groups, and in drug metabolites compared to drugs.

Drug-Induced Liver Injury Label Assignment on Drugs
Two databases were combined in this work to prepare data sets for model development: PharmaPendium (Elsevier, 2008) and DILIrank (Chen et al., 2016). Drugs that have been reported to cause cholestasis, cirrhosis, hepatitis, and steatosis were collected from PharmaPendium, and drugs with no-evidence-of-DILI (noDILI) were obtained from DILIrank. Since four clinical outcomes were the target endpoints in this work, four data tables for DICH, DIC, DIH, and DIS were separately prepared. In each data set, drugs were labeled as DILI-positive when they had over 50 post-market reports for DICH, DIC, DIH, and DIS, while drugs were labeled as DILI-negative when they met two conditions: 1) drugs were defined as noDILI in DILIrank, and 2) no post-market reports were found for the drugs in PharmaPendium. Further description of DILI label assignment is illustrated in the DILI label assignment section of the Results and Discussions.

Structure Curation and Data Preprocessing
In the structure curation process, salts were removed so that active ingredients were used alone in model development. Drugs were removed when they were metallo-organics, proteins, large peptides, or mixtures. Duplicated structures were also excluded from the data set. Drugs whose molecular weight (MW) was higher than 800 Da or lower than 100 Da were removed because 1) data points were sparse within the MW range, 2) molecules with MW higher than 800 Da were commonly found to be peptides, and 3) molecules with MW lower than 100 Da were often organic solvents rather than drugs. One hundred fifty-two DILI-positive drugs and 105 DILI-negative drugs were obtained after preprocessing.

Drug Metabolite Set Preparation
As DILI labels for drug metabolites were not available, metabolites of drugs that cause DICH, DIC, DIH, and DIS were identically labeled as DILI-positive metabolites, and noDILI as DILI-negative, respectively. Metabolite structures for the DILI-positive set were obtained from the ADME database (Fujitsu, 1996), in which 78,703 entries of phase I drug metabolism and 17,692 entries of phase II drug metabolism were available (updated July 1, 2019); however, those for the DILI-negative set were relatively sparse. Therefore, DILInegative drug metabolite sets were compiled with DILInegative drug metabolite structures and DILI-negative parent drug structures based on the assumption that drugs whose metabolite structures were not found in the ADME database were not actively metabolized, since the ADME database covers drug metabolism of diverse compounds. Given that phase I drug metabolism produces majority of metabolic products, drug metabolites produced by human cytochrome P450 were used in this study. One hundred ninety-two DILI-positive drug metabolites were obtained from 77 DILI-positive drugs. The DILI-negative set for the drug metabolite-based prediction model was composed of 36 DILI-negative drug metabolites obtained from 20 DILI-negative drugs and 90 DILI-negative drugs whose metabolite structures were not available in the ADME database. Summary of data composition is in Figure 1, and the distribution of the number of metabolites per parent compound is presented in Figure 2.

Feature Calculation and Curation
In this work, 2D molecular fingerprints (FP) and molecular descriptors were calculated during model development. RDKit version 2018.03.2 (Landrum, 2006) was used in FP calculation, and nine total FPs calculable in RDKit were prepared: MACCS FP (Durant et al., 2002), Avalon FP in a binary string based on presence of substructures and in an integer string based on the number of substructures (Gedeck et al., 2006), atom-pair FP FIGURE 1 | The number of drugs for each drug data set (A) and drug metabolite set (B). The hepatitis data sets were the largest in each group.  (Carhart et al., 1985), topological-torsion FP (Nilakantan et al., 1987), RDKit layered FP, and Morgan FP (Rogers and Hahn, 2010) with three different depths (level 2, 3, and 4). 2D descriptors were introduced from Mordred ( Moriwaki et al., 2018), which provides 1800 2D and 3D descriptors and is also implemented as a python package. 3D descriptors were not calculated in this study since an accurate 3D structure of drugs and drug metabolites were unavailable. Calculated features were removed from the data table when their standard deviation was lower than 0.01.

Training and External Test Set
Eight data sets were compiled from DILI-positive drugs for DICH, DIC, DIH, and DIS, and DILI-negative for drug and drug metabolite structures. In the DICH, DIC, and DIS sets, the number of drugs in the DILI-positive class is almost half of that of DILI-negative, while the DIH set is imbalanced with 1.5 times more DILI-positive drugs than DILI-negative drugs. In the drug metabolite set, the number of DILI-positives for DICH, DIC, and DIS is also less than that of DILI-negatives with a smaller difference in data size between two classes, while that of DIH has a relatively larger data size difference compared to DILInegatives. Models trained with imbalanced data tend to be improperly trained and give biased prediction outcomes on the majority class in the data. In order to prevent biased training on the model, under sampling was applied to reduce data size imbalance between positive and negative data in the training set (Zakharov et al., 2014;Klimenko et al., 2019). In this work, data imbalance was handled by designing the number of DILIpositives and DILI-negatives for each set, specifically. First, minor classes from the data set were separated with a ratio of 80 and 20% as a training set and external test set, respectively. Second, the major class was separated with a ratio that provided a similar amount of data to the minor class in the training set. This  approach makes the ratio of DILI-positive to DILI-negative almost one to one in the training set of each data set. Data composition is summarized in Table 1 with the number of positive and negative sets in training and external sets, with ratio. Even though data was randomly split with the ratio designed for each data set, it was confirmed that randomly split test set was evenly distributed within chemical space represented by randomly split training set. Chemical space similarity between the training set and the test set implied that the models trained with the training set can be successfully evaluated by the test set.

Model Development
Support vector classification (SVC) (Cortes and Vapnik, 1995), implemented in scikit-learn (Pedregosa et al., 2011), was applied during model development. The "SelectFromModel" function in the scikit-learn "feature_selection" package was applied in this work, which chooses significant features based on the weights of the model. Since SVC with linear kernel (linear SVC) defines hyperplane in the linear formula (i.e., w·x+b, where w is weight vector, x is feature vector, and b is bias vector), linear SVC was used first to subset significant features with "SelectFromModel," and then SVC with radial basis function (RBF) kernel was applied to develop prediction models on the data set with selected features. RBF kernel projects inputted feature space into high-dimensional space in which linearly inseparable cases in the original feature space became separable with linear classifier. Random forest (RF), implemented in scikit-learn, was also used in model building. RF model is an ensemble approach, which is composed of diverse number of decision tree (DT) models. In RF, unseen data is predicted based on majority vote from DT models. (Svetnik et al., 2003). In this study, the number of DT was set to 10 due to small size of data, and information gain was used.

Model Performance Validation
The predictive power of the model was evaluated based on five metrics: accuracy (ACC), sensitivity (SEN), specificity (SPE), Matthew's correlation coefficients (MCC), which is regarded as a balanced binary classification performance measure on imbalanced data, and area under the curve (AUC) of recursive operating characteristic (ROC) curve.
where TP, TN, FP, FN are true positive, true negative, false positive, and false negative, respectively. Binary classification models are designed to calculate a probability score of the instances, and the instances are assigned into certain groups if the score is higher or lower than a threshold value. Better classification performance is expected when positive data obtains a higher score while negative data a lower score, which leads to higher AUC. The "Metrics" package implemented in scikit-learn was used to calculate AUC, which was measured from the ROC curve based on SEN and false positive rate (FPR), calculated by changing the threshold values.
For internal validation, five fold cross validation and bootstrapping were performed. In bootstrapping, data set was sampled 100 times since recommended sampling iteration was between 25 and 200. Sampled data is used to train the model, and the model is validated with un-sampled data. Statistically, 63.2% of data is sampled on average (Braga-Neto and Dougherty, 2004); therefore, a 0.632 estimator scheme is applied to each metric to evaluate model robustness and predictability as, where metric refers to ACC, SEN, SPE, MCC, and AUC on sampled and un-sampled data during bootstrapping. Y-randomization tests examine the correlation between selected features and target endpoint in the model and whether it is achieved by mere chance. In the test, model development protocols were repeated after end point values were randomly shuffled. The models developed from Y-randomization are termed random models, and the model developed with correctly labeled data is an original model. Y-randomization testing was repeated 10 times to develop 10 random models, and Z-score was calculated based on MCC of random models and the original model as, where MCC ori is MCC of the original model on external test set, MCC mean random is averaged MCC of 10 random models on the external test set, and s is standard deviation of MCC of 10 random models. Models were considered to achieve statistically valid correlation between selected features and the endpoint when Z-score exceeded 3.

Applicability Domain Analysis
Since reliability of the model prediction outcome is dependent on the data used to train and select the model, applicability domain (AD) of the models were defined for giving reliable prediction results under the consideration of the chemical space. Leverage (h i ) is a method to detect outliers or novel entities among data sets (Tropsha et al., 2003), which is calculated as, where x i is a feature vector of i-th compound, X is a feature matrix, and T is a transpose. When the leverage of certain compounds was lower than the warning leverage (h*), the prediction outcome was reliable since input data for the prediction is not an outlier based on the chemical space of the training data.
where p is the number of features selected in the model, and n is the number of training sets.

Analysis of Privileged Substructures
In this study, substructures were called privileged substructures when their presence is higher in a certain class of data. Frequency of substructures (F) were calculated as, where class is the label (i.e., DILI-positive or DILI-negative), N fragment, class is the number of drugs that have the fragment belonging to one of the binary class, N class is the total number of drugs labeled in the class, N fragment, total is the number of drugs that have the fragment in total data set, and N total is the total number of data set.
Frequency of substructures in each data set were examined based on Morgan FP in this analysis. The draw package implemented in RDKit was used to visualize substructures represented by each bit of Morgan FP level 2. In the first analysis, DILI-positive sets and DILI-negative sets in DICH, DIC, DIH, and DIS for drugs and drug metabolites were compared to find privileged substructures in DILI-positive data sets. Next, drugs and drug metabolites were compared in DICH, DIC, DIH, and DIS to find privileged substructures that were significantly influenced by drug metabolism. DILI-positive drug metabolites and drugs were compared to check which fragments were increased for bioactivation or decreased for detoxification. DILI-negative drug metabolites and drugs were also analyzed in the same way to find the detoxification effects of drug metabolism on certain molecular fragments.

Drug-Induced Liver Injury Label Assignment
Assigning DILI labels on drugs is a challenging task due to the wide spectrum of severity and differing injury mechanisms. Inconsistent DILI labels were often found among large DILI datasets since evidence for estimating DILI risk is differently weighted (Thakkar et al., 2018). Due to ambiguity of DILI labels, the noDILI drug list was examined by comparing it with the lists for DICH, DIC, DIH, and DIS. In DILIrank, 259 drugs were found to be noDILI; however, some noDILI drugs were found to have post-market reports on DICH (51 drugs), DIC (70 drugs), DIH (94 drugs), and DIS (73 drugs). In order to securely obtain DILI-negative drugs, drugs overlapping with DICH, DIC, DIH, and DIS were removed.
Distributions of post-market reports were examined from DICH, DIC, DIH, and DIS drugs overlapping with no DILI drugs ( Figure 3). A majority of the overlapped drugs have less than 10 post-market reports; however, some of them were found to have a high number of post-market reports on DICH, DIC, DIH, and DIS. In order to define the number of post-market reports for assigning a DILI-positive label to drugs, the highest and second highest post-market reports on DICH (highest: 47 and second highest: 26), DIC (highest: 76 and second highest: 34), DIH (highest: 67 and second highest: 56), and DIS (highest: 51 and second highest: 47) of the overlapped drugs were averaged, and the mean value (50.5) became standard for DILI-positive assignment. Hence, drugs having over 50 post-market reports were defined as DILI-positive drugs in the DICH, DIC, DIH, and DIS datasets.

Chemical Space Analysis
QSAR study for drug discovery normally focuses on certain classes of chemicals since it intends to predict activity variation due to moiety modification. Such approaches give high predictability to the model with relatively narrow AD. In this study, the data set is composed of drugs from diverse classes since it aims to develop a prediction model that covers heterogeneous chemical structures to secure a broader AD of the model. The chemical spaces of the data sets were visualized based on molecular weight (MW) and Wildman-Crippen octanol/water partition coefficient (logP) (Wildman and Crippen, 1999) of DILI-positive and DILInegative sets for drugs ( Figure 4) and drug metabolites ( Figure 5). Chemical space analysis revealed that chemical spaces of DILI-positive drug metabolites are slightly changed compared to DILI-positive drugs. In Figure 4, several DILIpositive drugs were found to have logP values less than 0; however, the number of DILI-positive drug metabolites with logP values lower than 0 was decreased ( Figure 5). Independent t-test indicated that change of mean logP value between drugs and drug metabolites were not statistically significant except cholestasis data set (Table S1); however, decrease of hydrophilic molecular structures in drug metabolites could potentially influence on feature selection and model training process. The shift in the chemical space of drug metabolite data sets was due to the fact that drug metabolites for drugs with low logP values were absent in the Fujitsu database. As DILI-positive drug metabolite data sets were composed of metabolite structures of DILI-positive drugs having a high logP, the mean of logP values for DILIpositive drug metabolite data sets is estimated to be higher than that for DILI-positive drug data sets even though the logP of individual drug metabolites was decreased compared to that of its parent compound. Given that drug metabolism converts lipophilic xenobiotics into hydrophilic products for rapid excretion, the chemical space shift in the drug metabolite data sets was reasonable as lipophilic drugs are more actively metabolized compared to relatively hydrophilic drugs. Moreover, the drug metabolite chemical space with increased logP was considered a more appropriate chemical space for modeling DILI causation through drug metabolites due to associations between drug lipophilicity and DILI (McEuen et al., 2017).

Model Developments and Analysis
Binary classification models were developed with nine FPs and 2D molecular descriptors on drug and drug metabolite data sets for four types of DILI (cholestasis, cirrhosis, hepatitis, and steatosis); therefore, 80 models were trained with the SVC algorithm. Performance of the models are summarized in Tables 2 and 3.

Drug Based Models
In every model development process, randomization tests were applied to avoid a situation where correlation between endpoint FIGURE 3 | Drugs with post-market reports on cholestasis (A), cirrhosis (B), hepatitis (C), and steatosis (D) were compared with drugs labeled as no-drug-induced liver injury (DILI) in the DILIrank database. Even though drugs were labeled as no-DILI in DILIrank, a number of drugs were reported to have adverse hepatic outcomes in the post-market phase. Most no-DILI drugs have less than 10 post-market reports while a few of them have a high number of post-market reports. The highest number of post-market cholestasis reports among no-DILI labeled drugs was 47 and the second highest was 26. The highest number of post-market cirrhosis reports among no-DILI labeled drugs was 76 and the second highest was 34. The highest number of post-market hepatitis reports among no-DILI labeled drugs was 67 and the second highest was 56. The highest number of post-market steatosis reports among no-DILI labeled drugs was 51 and the second highest was 47. This analysis showed that uneven distribution of data points was found throughout the space when drugs were represented by atom-pair FP ( Figure S2) and topological torsion FP ( Figure  S10). The ADs of the models were visualized based on the leverage and molecular weight range of the drugs. According to the leverage of atom-pair FP and topological-torsion FP data sets, several data points were out of the domain in every data set ( Figures S22 and S30) while Avalon FP ( Figure S24), layer FP ( Figure S28), and MACCS FP ( Figure S29) models didn't have any data points that exceeded the warning leverage.
The prediction results of the models were reliable when the inputted data point was included in the AD of the model; therefore, the AD of the models should be considered together with their accuracy when the best model is selected. In this study, performance in the external test set, randomization test results, and AD analysis were considered in selection of the best models. In DICH, MACCS FP model (RF) was considered the best model (external test ACC: 70%, Z-score: 3.14, AD: Figure S29A). In DIC, layer FP model (RF) was best (external test ACC: 90%, Zscore: 4.23, AD: Figure S28B). In DIH, 2D descriptor model (SVM) was best (external test ACC: 83%, Z-score: 5.06, AD: Figure S21C). In DIS, layer FP model (SVM) was best (external test ACC: 85%, Z-score: 5.10, AD: Figure S28D).

Drug Metabolite Based Models
In terms of ACC and MCC from the external test set, the best models were developed with Morgan (lv3) FP in DICH (SVM), Morgan (lv4) FP in DIC (RF), Morgan (lv2) FP in DIH (SVM), and topological-torsion FP in DIS (RF). In accuracy comparison, it was found that 63.8% of the models developed with drug metabolite data sets achieved higher accuracy compared to that developed with drug data sets (51 models out of 80 models). Moreover, only one model was failed in randomization test in all metabolite data sets (SVM: Avalon count FP). Due to the lack of DILI-labels on drug metabolites, the metabolites were labeled according to their parent drugs' DILI label based on the assumption that metabolites from DILI-positive drugs cause DILI and those from DILI-negatives do not. Improvement of the general performance of the model and randomization test results showed that assumption for the DILI labeling of metabolites was reasonable; however, it is still possible that models for drug metabolites contain a certain degree of structural noise introduced by the ambiguous DILI labeling of metabolites, particularly for DILI-positive metabolites since all metabolites were prepared according to the assumption. DILInegative metabolites were composed of DILI-negative drugs and metabolites of DILI-negative drugs; hence, the correlation between structural features and endpoint is potentially not solely based on drug metabolite chemical space.
Feature space visualization was performed for drug metabolite data sets. In general, data distribution of the feature space was similar between drug data sets and drug metabolite data sets, except Morgan FPs where more outliers were found in The majority of drug-induced liver injury (DILI)-positive drug metabolites had MWs less than 500. In particular, DILIpositive drug metabolites whose logP values were less than 0 had significantly decreased compared to DILI-positive parent drugs (Figure 4).  feature space generated from drug metabolite sets (Figures S15-S17). In AD analysis of the drug metabolite set, models developed with Avalon FP ( Figure S34), layer FP ( Figure S38, and MACCS FP ( Figure S39) showed that no data points exceeded the warning leverage. Based on the prediction performance in the external test set, randomization test results, and AD analysis, the best models for drug metabolites were selected. AD of the models was considered more important than accuracy in model selection since models with narrow AD cannot be applied for prediction of heterogeneous structures. In cases where accuracy in the external test was equal, MCC was used to evaluate model performance. In DICH, the layer FP model (RF) was best (external test ACC: 86%, Z-score: 5.64, AD: Figure S38A). In DIC, the Avalon FP model (SVM) was best (external test ACC: 88%, Z-score: 7.98, AD: Figure S34B). In DIH, the MACCS FP model (RF) was best (external test ACC: 86%, Z-score: 10.85, AD: Figure S39C). In DIS, the Avalon FP model (RF) was best (external test ACC: 83%, Z-score: 7.42, AD: Figure S34D).

Privileged Structure Analysis for Drug-Induced Liver Injury
Privileged substructures, which were more frequent in DILIpositive sets, were found using Morgan FP in level 2 based on the number of substructures and their frequency as calculated with equation 10. Privileged substructures for drugs are listed in Tables S2-S5 and for drug metabolites in Tables S6-S9 in simplified molecular-input line-entry system arbitrary target specification (SMARTS) pattern, were increased in DILI-positive drug metabolite sets compared to that of drug sets while the number of substructures, [#6]-[#7] in SMARTS pattern, was relatively decreased in DILI-positive drug metabolite sets compared to drug sets and slightly increased in DILI-negative drug metabolites sets compared drug sets in all data sets. Substructures of the SMARTS patterns are available in Table 4. In DILI, drugs often damage hepatocytes through their reactive metabolites; therefore, privileged substructures were derived by comparing DILI-positive and DILI-negative sets and also comparing drug and drug metabolite data sets to consider metabolic activation of compounds. Even though privileged substructures were suggested based on cholestasis, cirrhosis, hepatitis, and steatosis data for drugs and drug metabolites, SAs based on the privileged substructures suggested in this work may have limited predictability for DILI since the substructures were obtained from small data sets, and uncertainty in drug metabolite labeling may introduce a certain degree of structural noise in the analysis. These substructures in Tables S1-S12 may still need further exploration to confirm their predictability on DILI in SAs.

CONCLUSION
In the current study, binary classification models were developed to predict four types of DILI; cholestasis, cirrhosis, hepatitis, and steatosis. 2D fingerprints and molecular descriptors were calculated from drugs and their metabolite structures to find significant structural features from both chemical spaces since DILI is caused by drugs or their reactive metabolites. Due to uncertainty in DILI labeling, we curated the DILI label on drugs by integrating DILIrank labels and post-market reports on drugs. Drug metabolites were labeled according to the label of their parent drugs based on the assumption that drug metabolites produced from DILI-positive drugs also caused DILI and drug metabolites from DILI-negative drugs did not. Models were developed and analyzed according to OECD guidelines for QSAR validation. Sufficient performances were observed from the models in internal and external validation results. In particular, improved performance in models built with drug metabolite sets implied that consideration of drug metabolism was significant in DILI prediction. Privileged substructure analysis identified frequent substructures in DILI-positive sets *V N , variation of the number of fragments in DILI-positive. It is calculated by substracting the number of fragment in DILI-positive drug sets from that in DILI-positive drug metabolite data sets. *V F , variation of the frequency of fragments in DILI-positive. It is calculated by substracting the frequency of fragment in DILI-positive drug sets from that in DILI-positive drug metabolite data sets. *DICH, drug-induced cholestasis; DIC, drug-induced cirrhosis; DIH, drug-induced hepatitis; DIS, drug-induced steatosis. and drug metabolite sets that could be responsible for DILI induction or metabolic activation of drugs.

DATA AVAILABILITY STATEMENT
The datasets analyzed in this article are not publicly available.
Requests to access the datasets should be directed to HS, hyunkil.shin@kitox.re.kr.

AUTHOR CONTRIBUTIONS
HS designed the study, which was validated by SY. TP collected drug metabolite structure and preprocessed data sets. DP analyzed chemical space, feature space, and privileged structures. Model development and validation, and applicability domain analysis were conducted by HS and M-GK. Manuscript was prepared and edited by HS, M-GK, DP, TP, and SY. All the authors approved the final version of the manuscript.

ACKNOWLEDGMENTS
MarvinSketch was used for drawing and visualizing chemical structures and substructures. It was also used to convert molecular structures from SMILES code to a 2D structure and from a 2D molecular fragment structure to a SMARTS pattern, MarvinSketch 18.16, ChemAxon (https://www.chemaxon.com). This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (Project No. NRF-2019R1F1A1061955). This study was carried out with the support of the R&D Program for Forest Science Technology (Project No. 2017030A00-1919-BA01) provided by Korea Forest Service (Korea Forestry Promotion Institute).