Deep Learning Prediction of Adverse Drug Reactions in Drug Discovery Using Open TG–GATEs and FAERS Databases

Machine learning techniques are being increasingly used in the analysis of clinical and omics data. This increase is primarily due to the advancements in Artificial intelligence (AI) and the build-up of health-related big data. In this paper we have aimed at estimating the likelihood of adverse drug reactions or events (ADRs) in the course of drug discovery using various machine learning methods. We have also described a novel machine learning-based framework for predicting the likelihood of ADRs. Our framework combines two distinct datasets, drug-induced gene expression profiles from Open TG–GATEs (Toxicogenomics Project–Genomics Assisted Toxicity Evaluation Systems) and ADR occurrence information from FAERS (FDA [Food and Drug Administration] Adverse Events Reporting System) database, and can be applied to many different ADRs. It incorporates data filtering and cleaning as well as feature selection and hyperparameters fine tuning. Using this framework with Deep Neural Networks (DNN), we built a total of 14 predictive models with a mean validation accuracy of 89.4%, indicating that our approach successfully and consistently predicted ADRs for a wide range of drugs. As case studies, we have investigated the performances of our prediction models in the context of Duodenal ulcer and Hepatitis fulminant, highlighting mechanistic insights into those ADRs. We have generated predictive models to help to assess the likelihood of ADRs in testing novel pharmaceutical compounds. We believe that our findings offer a promising approach for ADR prediction and will be useful for researchers in drug discovery.


Introduction
Adverse drug reaction (ADR) or event is defined as any unintended or undesired effect of a drug [1,2].ADRs are responsible for a high number of visits to emergency departments and in-hospital admissions.The Japan Adverse Drug Events (JADE) study reported around 17 adverse drug events per 1000 patient days.1.6% were fatal, 4.9% were life-threatening, and 33% were serious [3]; these observations underscore the importance of toxicity assessment of any medication, especially in the early stages of drug discovery.
In-silico approaches to ADR prediction have been promising in terms of accuracy and underlying mechanisms understanding.Researchers have variously utilized multiple data types ranging from chemical structures to literature mining to predict ADRs [4].
Machine learning methods can play a significant role in the interpretation of various data types to predict ADRs.Deep learning [5], a subset of machine learning in Artificial intelligence (AI), has emerged as a promising and highly effective approach that can combine and interrogate diverse biological data types to generate new hypotheses.Deep learning methods are used extensively in the field of drug discovery and drug repurposing; however, their applications in ADR prediction are rather limited [6,7,8].
Open TG-GATEs [9] is a large-scale toxicogenomics database that collects gene expression profiles of in vivo as well as in vitro samples that have been treated with various drugs.These expression profiles are an outcome of the Japanese Toxicogenomics Project [10], which aimed to build an extensive database of drug toxicities for drug discovery.It also collects physiological, biochemical, and pathological measurements of the treated animals.Similar databases that aim to profile compund toxicities have also been developed [11,12].
In contrast with other databases, such as (LINCS) [13], which have been used to predict multiple ADRs in a single study [14], Open TG-GATEs has been used to investigate individual/specific toxicities [15].To the best of our knowledge, no multiple ADR predictions have been attempted using Open TG-GATEs.
The design of Open TG-Gates has several advantages over the LINCS database, chiefly the inclusion of in vivo samples with different doses and durations of administration.Therefore, we designed our analysis to encompass multiple samples with different dosages and duration for each compound, necessitating additional noise-removal steps in the data set.
In this study, we describe our deep learning-based, systematic ADR prediction models that combine ADR occurrence data, including frequency details, from the FAERS (FDA Adverse Event Reporting System) database, with the gene expression profiles from Open TG-GATEs.We aim to improve the models' performance by applying recently developed feature selection and hyper-parameter optimization algorithms.The methodologies and models described in our study offer useful tools for assessing the likelihood of ADRs and for incorporation into new drug development pipelines.

Overview
An overview of this study's methodology is shown in Figure : 1.First, we retrieved the relevant data from the above-described two databases (open TG-GATEs and FAERS).We pre-processed the gene expression data to filter out noisy profiles by using a simple classification model, and retained only the significant ADR-drug associations (p¡0.05;Fisher's exact test).Next, gene expression profile datasets were created by assigning positive and negative compounds for each ADR.We split the datasets into training and validation sets five times.We used the training set data to perform feature selection and build deep neural network models with hyper-parameter tuning using the Optuna package (see below).Finally, we evaluated the performances of the individual models on the validation set.We discuss these steps in detail below: 2.2.Data retrieval and processing: 2.2.1.Open TG-GATEs database: We extracted the in-vivo gene expression profiles of rat liver samples from the Open TG-GATEs database [9,10].We selected the rat in-vivo data for our analysis chiefly because the in-vivo dataset included more compounds and a greater number of time points as compared with the in-vitro data (rat and human).However, our methodology can be easily extended to the other datasets.
This dataset was comprised of single-dose experiments and repeated-dose experiments.Single-dose experiments included injection-to-sacrifice periods of 3, 6, 9, or 24 hours, whereas, in the repeated dose experiments, rats were injected once daily for 4, 8, 15, or 29 days.In the repeated-dose experiments, all rats were sacrificed 24 hours after the last dose [9].In Open TG-GATEs, gene expression profiles were measured using Microarray technology (Affymetrix GeneChip).
The Affymetrix CEL files were downloaded from http://toxico.nibiohn.go.jp, and were preprocessed using the affy package [16] from R Bioconductor (https: //bioconductor.org/);Affymetrix Microarray Suite algorithm version 5 (mas5) was applied with the default parameters provided in affy, wherein normalization = TRUE.The resulting normalized dataset -hereafter referred to as "the raw dataset"was used for all the subsequent analyses.Next, the fold change values were calculated for each probe set by dividing the raw dataset by the mean intensities of corresponding control samples; these values were then log2 transformed, hereafter referred to as the "log2FC dataset".Since the experiment design contained multiple dosages and durations of exposure, the drugs had varied effects on the gene expression profiles.To reduce the noise, we clustered all the samples to classify them as either treated or control using general linear modeling with Lasso penalty machine learning model (GLMNET package from R [17].).We used the whole raw dataset as the training set with a two-class classification (treated and control).We fed through all the microarray data of the same duration to a single model, creating one model for each exposure set duration.Next, we estimated the probability of being classified as a treated sample for all the training sets.Only those samples with a probability of higher than 92% were included in our analysis.The remaining samples were considered to fall within the gray zone between treated and control, and they were discarded.We chose a cut-off of 92% because, at this threshold, no control samples were misclassified as treated.

Standardized FAERS data:
FAERS (FDA Adverse Event Reporting System) is "a database that collects adverse event reports, medication error reports, and product quality complaints resulting in adverse events that were submitted to FDA" (https://open.fda.gov/data/faers/).Since the terms used in the FAERS database are left to the reporter to decide, inaccurate descriptions can often be incorporated, such as using general, vague terms to describe adverse events or treatments [18].To surmount this issue, we used the portion of the FAERS dataset standardized by Banda et al. [19].They had curated and standardized the entries of the FAERS database for 11 years (2004 to 2015) following Medical Dictionary for Regulatory Activities (MedDRA) terms [20].
We extracted all the compound-ADR combinations (70,553,900) from the total number of reports (4.8 million).Among the difficulties of using the FAERS database in ADRs' prediction models is the presence of reports with multiple drugs used (Multipharma), which is expected in patients with chronic diseases.Such cases introduce unreliable associations added to the data noise.To solve this issue, we used only the association in which the drug is assigned as the primary suspect (PS) (15,377,900).We calculated the number of reports for each compound-adverse drug event combination and calculated the total number of reports of the compound in question and also the total number of reports of the adverse event.
Using this data, we performed the Fisher test [21] using "fisher.test"function from R with the parameter (alternative = "greater").This option returns a significant P-value only in the event of a positive association, in contrast to "two.sides" test, which assesses both positive and negative associations Table 1.
Table 1: Fisher exact test: a: the number of reports of that the compound cause the ADE, b: the number of reports of the compound that does not report the cause of ADE, c: the number of all positive reports of the ADE for all compounds other than the specific compound, e: the number of all negative reports of all compound other than the specific compound.

Model building and training
The training set included only treated samples (excluding all controls).For each model, we designated the compounds with the most significant associations as positive compounds, (p-value threshold < 0.05) and the least significantly associated compounds as negative compounds.We evenly balanced the number of positive and negative compounds for each model.
When assembling the training and validation sets, we imposed two criteria: 1) the data-sets were balanced, i.e., the number of positive and negative samples were equal in both the sets and; 2) no compounds were commonly shared between training and validation.Instead of following a standardized cross-validation method, we created five models for each side effect by exploring all combinations of compounds in training and validation sets.However, we kept only those that comply with the previous conditions.
To prevent information leakage between the validation set and the training set, feature selection was performed using the training set data only, validation set data was not involved at all.Moreover, the hyper-parameter optimization was also done using the information from the training set only, and validation set accuracy was only used to identify the best model.
For feature selection, we used Boruta [22] implementation in Python https:// github.com/scikit-learn-contrib/boruta_py, that is based on the random forest classifier from scikit-learn [23] python package with default parameters.Boruta algorithm utilizes the random forest classifier variable importance feature.To remove the effect of randomness and improve the accuracy of important feature detection, Boruta creates extra shadow variables by shuffling the values of the original features, then the feature importance is measured.Features with significantly higher importance than the shadow variables are considered important, while those with less importance are considered less important.This procedure was repeated 100 times to detect the important features more accurately.[22] Then we used TensorFlow 2 [24] to construct deep learning models.The input of the deep learning model is constructed of two dimensional matrix with samples in rows and genes on columns.Each model consists of three groups of layers, input, output, and hidden layers.We applied Optuna [25] for hyperparameters tuning.Optuna uses the trial and error method for optimization, by randomly assigning values to the model hyperparameters from a range of values or choices offered by the author for a pre-determined number of trials.

Input layer
• number of nodes is determined by the selected features

Hidden layers
• number of hidden layers are determined by Optuna optimizer; • number of nodes in each layer is also determined by Optuna optimizer; • between each layers there is drop layer with percentage of drop detremined by optuna Output layer  The maximum number of epochs was set at 800; however, we applied the "early stopping" strategy if the accuracy did not improve for 75 epochs.The best models were saved for each Optuna trial.The best parameters are shown in the Supplementary Table: S1.

Evaluation and enrichment analysis
Model performances were evaluated by testing the performance of the validation set prediction.We estimated the accuracy of the validation set and the area under the ROC (Receiver operating characteristic) curve using the scikit-learn [23] package from Python.
TargetMine data analysis platform was used for enrichment analysis and gene annotation [27].Databases used for enrichment analysis were KEGG, Reactome, and NCI.P-values were calculated in TargetMine using one-tailed Fisher's exact test.Multiple test correction was set to Benjamini Hochberg, with p-value significance threshold of 0.05.

Data processing
To reduce the data dispersion caused by multiple dose levels and injection durations (sacrifice period) in Open TG-GATEs, we filtered out low quality/unsuitable samples.To do that, we used Lasso to classify the samples to either treated or control classes.A total of 6619 of 10573 treated samples, chiefly belonging to the "Low" dose level category, were classified as controls and eventually excluded.Samples that were correctly classified as treated (3953 samples) remained for subsequent analysis.(Table 3).

Model building and training
We created a total of 14 models (Table : 4).The number of compounds used to create each model ranged from 10 to 18.We equalized the number of positive compounds and negative compounds to generate balanced models.

Model evaluation
The Average accuracy for all models was 85.71% [minimum=67.9%,and maxi-mum=100%, standard error=0.1].The validation accuracies of the models are shown in Figure 3.The area under the Receiver Operating Characteristic (ROC) curve is shown in Figure : 4.

Case study: Duodenal Ulcer
To highlight the effectiveness of our approach, we describe below our observations on the development of the duodenal ulcer ADR prediction model.Duodenal ulcer is a type of peptic ulcer disease characterized by the emergence of open sores on the duodenum's inner lining of the duodenum [28].It is mainly caused by the failure of gastrointestinal system inner coating protection, and the most common causing agents are Helicobacter Pylori infection and NSAIDs (Non-steroidal anti-inflammatory drugs).It can lead to serious bleeding or perforation.

Duodenal ulcer model description
Using the FAERS database, the six most significantly associated drugs with duodenal ulcer were identified using Fisher test (positive drugs: Aspirin, Diclofenac, Ibuprofen, Indomethacin, Meloxicam, Naproxen) and the least associated drugs were also identified (negative drugs: Acetaminophen, Amiodarone, Bortezomib, Carbamazepine, Ciprofloxacin, Cyclophosphamide).Subsequently, the open TG-Gates samples of these drugs were used to build the prediction models.Details of the compounds used, the number of the gene expression samples associated with these compounds, and the results of the Fisher test (see Methods) are shown in Table 5.The ROC curves and the area under them for the five models of duodenal ulcer (each trained on a different training set) are shown in Figure 5.The performance of these models showed that the   area under the curve ranges from 0.94 to 0.99.The number of the features (genes) commonly selected among the five duodenal ulcer models were 108.

Enrichment Analysis
Pathway enrichment analysis (see Methods) using the duodenal ulcer-selected features (Table: S2) clearly highlighted the involvement of bleeding cascade and complement function (Table: 6).
The manifestation of a duodenal ulcer activates the complement cascade, which is probably due to the inflammation caused by the acid effect on the intestinal mucosa.Bleeding is also linked with the duodenal ulcer disease.The enrichment of the Fatty acid degradation pathway is consistent with the fact that the majority of the compounds that cause duodenal ulcers are NSAIDs that inhibit Arachidonic acid metabolism, which is a part of the Fatty acid metabolism pathway.

Discussion
We have described a novel approach that combined toxicogenomics gene expression profiles extracted from Open TG-GATEs and ADRs reports extracted from FAERS to predict the likelihood of ADRs.This integration of two highly distinct data types allowed us to predict ADRs successfully.Moreover, it led to creating a novel dataset that associated drug-induced gene expression profiles with ADRs.
To overcome the significant challenges in combining the two datasets, we first sought to extract the individual drug-induced gene expression signature from Open TG-GATEs.Next, we extracted the ADR occurrence frequencies for these drugs and estimated their statistical significance to eventually combine the two datasets.
Moreover, due to multiple dose-levels and sacrifice periods and the presence of repeated and single injection events, the drug-induced gene expression profiles were fairly noisy.We generated a simple model to classify all the samples as either control or treated classes using Lasso to filter out this noise.We performed a rigorous statistical assessment to narrow down suitable samples for subsequent analyses (see Methods for details).
Recently, deep learning has gathered an increasing usage in the field of drug discovery [8,29].In this study we have used deep learning together with feature selection to reduce the data dimensionality and avoid overfitting due to limited samples.
Previously, Wang et al. [14] utilized multiple cell lines to develop predictive models for multiple ADRs.Another study [30] demonstrated that blood transcriptomics could be used to examine other organ toxicities.Our results have supported this notion by exhibiting robust prediction models with high accuracy using liver samples.Liver is a vital organ for drug metabolism and receives a significant amount of blood, and hence, it is widely used in drug toxicity studies.Moreover, even in the absence of pathological responses to the compound toxicity, cells still display differences in gene expression profiles.
We did not apply the leave-p-out or k-fold cross-validation protocols to the gene expression samples because: 1) the training and validation sets should be assigned based on compounds segregation, i.e., the same compound should not span training and validation sets, and 2) the number of samples differed from compound to compound and thus, creating balanced sets was impossible.We instead adopted the approach of creating five models with different training and validation sets.We also performed feature selection for each of the set combinations.
This study utilized in vivo gene expression data in contrast with another study [14] that utilized the data from the LINCS database [13], a collection of in vitro gene expression profiles from human cell lines.Our approach is easily applicable to other publicly available collections of toxicogenomics data, such as those from Drug Matrix [31].Another difference is that Wang et al. [14] combined chemical structure and Gene Ontology (GO) term associations in their models.They also selected only a single gene expression profile to represent the compound effects; in contrast, our method analyzed multiple samples with different doses and durations of each compound's exposure.Hence, our method is better equipped to account for the biological variations that are inherent in drug-induced physiological and phenotypic responses.Indeed, our models performed better than Wang et al. 's gene expression data-only models [14].This difference in performances may probably be attributed to our utilization of multiple samples for each compound.
Using the Optuna optimization package [25] made these prediction models' creation computationally expensive; hence, only a limited number of models were built.However, our approach can help generate models to serve specific applications using other data resources such as DrugMatrix, depending on the user's needs.
In conclusion, we have developed 14 Deep learning models to predict adverse drug events utilizing the public available Open TG-Gates and FAERS databases.These models can be used to examine if a new drug candidate can cause these side effects.Moreover, following the same feature selection and model building and tuning steps, other models can be created for other ADRs.

Declaration of Competing Interest
The authors state that there is no conflict of interest

Figure 3 :
Figure 3: Validation accuracy of the created models

Figure 4 :
Figure 4: Area under ROC curves for the created models.

Figure 5 :
Figure 5: Area under ROC curves for duodenal ulcer models.Each color corresponds to a different model.

Table 3 :
The number of Open TG-GATEs samples included in the analysis after clustering using Lasso, dose level and sacrifice period details are shown.

Table 4 :
The number of compounds, and samples used to create ADRs prediction models.(AGEP: Acute

Table 5 :
Details of the data set for the duodenal ulcer model (compounds, the number of the samples, the p-value of the Fisher test and the class in the training or test sets).Positive class compounds are those that can cause doudenal ulcer, while Negative class compounds are controls.Entries were ordered alphabetically

Table 6 :
The enrichment analysis results of duodenal ulcer model, showing the involvement of both complement and coagulation functions