# ARTIFICIAL INTELLIGENCE FOR TRANSLATIONAL PHARMACOLOGY

EDITED BY : Zhi-Liang Ji, Lixia Yao, Kartick Chandra Pramanik and Zhaohui John Cai PUBLISHED IN : Frontiers in Pharmacology

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-888-8 DOI 10.3389/978-2-88963-888-8

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# ARTIFICIAL INTELLIGENCE FOR TRANSLATIONAL PHARMACOLOGY

Topic Editors: Zhi-Liang Ji, Xiamen University, China Lixia Yao, Mayo Clinic, United States Kartick Chandra Pramanik, University of Pikeville, United States Zhaohui John Cai, Celgene (United States), United States

Citation: Ji, Z.-L., Yao, L., Pramanik, K. C., Cai, Z. J., eds. (2020). Artificial Intelligence for Translational Pharmacology. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-888-8

# Table of Contents


Thanh M. Nguyen, Syed A. Muhammad, Sara Ibrahim, Lin Ma, Jinlei Guo, Baogang Bai and Bixin Zeng


Hsuen-Wen Chang, Min-Ju Wu, Zih-Miao Lin, Chueh-Yi Wang, Shu-Yun Cheng, Yen-Kuang Lin, Yen-Hung Chow, Hui-Ju Ch'ang and Vincent H. S. Chang

*45 Exploring the Mechanism of Flavonoids Through Systematic Bioinformatics Analysis*

Tianyi Qiu, Dingfeng Wu, LinLin Yang, Hao Ye, Qiming Wang, Zhiwei Cao and Kailin Tang

*57 A Hybrid Interpolation Weighted Collaborative Filtering Method for Anti-cancer Drug Response Prediction*

Lin Zhang, Xing Chen, Na-Na Guan, Hui Liu and Jian-Qiang Li

*68 Prediction of Potential Small Molecule-Associated MicroRNAs Using Graphlet Interaction*

Na-Na Guan, Ya-Zhou Sun, Zhong Ming, Jian-Qiang Li and Xing Chen


Cédric Bousquet, Julien Souvignet, Éric Sadou, Marie-Christine Jaulent and Gunnar Declerck

# Searching Synergistic Dose Combinations for Anticancer Drugs

Zuojing Yin, Zeliang Deng, Wenyan Zhao and Zhiwei Cao\*

Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China

Recent development has enabled synergistic drugs in treating a wide range of cancers. Being highly context-dependent, however, identification of successful ones often requires screening of combinational dose on different testing platforms in order to gain the best anticancer effects. To facilitate the development of effective computational models, we reviewed the latest strategy in searching optimal dose combination from three perspectives: (1) mainly experimental-based approach; (2) Computational-guided experimental approach; and (3) mainly computational-based approach. In addition to the introduction of each strategy, critical discussion of their advantages and disadvantages were also included, with a strong focus on the current applications and future improvements.

#### Edited by:

Zhi-Liang Ji, Xiamen University, China

#### Reviewed by:

Qi Liu, Vanderbilt University Medical Center, United States Feng Zhu, Zhejiang University, China

> \*Correspondence: Zhiwei Cao zwcao@tongji.edu.cn

#### Specialty section:

This article was submitted to Translational Pharmacology, a section of the journal Frontiers in Pharmacology

Received: 06 January 2018 Accepted: 03 May 2018 Published: 22 May 2018

#### Citation:

Yin Z, Deng Z, Zhao W and Cao Z (2018) Searching Synergistic Dose Combinations for Anticancer Drugs. Front. Pharmacol. 9:535. doi: 10.3389/fphar.2018.00535 Keywords: synergistic combination, optimized dose combination, computational model, feedback system control scheme, regression model

### INTRODUCTION

In current days, combinational drugs have been increasingly used clinically in treating various cancers. Comparing to the traditional single drug approach, combinational strategy is often found with enhancing therapeutic effects or delayed drug resistance, among which synergistic drugs are mostly desired (Chou, 2006). The past few years has witnessed the computational progress in analyzing and predicting synergistic components qualitatively (Han et al., 2017; Sarah, 2017; Sheng et al., 2017). However, the optimal dose of each component needs to be identified before the formula is clinically applied, as different dose combination may lead to different effects even for the same formula (Tallarida and Raffa, 2010). To avoid potential adverse or antagonistic effects, largescale experiments have to be screened in a huge combinational space of drug concentration which are highly time consuming and laborious. Thus, developing smart methods either experimentally or theoretically are both in urgent need to facilitate the synergistic drug design.

Until the present time, the general experimental criteria to evaluate drug synergy mainly include Loewe isobologram (Chevereau and Bollenbach, 2015), CI index from Median Effect Principle (Chou, 2010), Bliss independence (BI) model (Bansal et al., 2014), Loewe Additivity (LA) model (Lee et al., 2007), and so on. Under defined criteria, substantial data has been accumulated which initiated the computational efforts to predict dose effects of drug combination. Despite of a few algorithm and statistical methods (Calzolari et al., 2008; Deharo and Ginsburg, 2011; Caglar and Pal, 2014; Weiss et al., 2015a), constructing quantitative model to predict synergistic dose remains highly challenging for combinational therapy. To promote future improvements in this area, we reviewed the latest progress in this area covering (1) mainly experimental-based approach; (2) Computational-guided experimental approach; and (3) mainly computational-based approach.

#### Mainly Experimental-Based Approach

Normally the drug efficacy can be roughly tested via cell viability assay, such as MTT assay and various animal models. But experimental exploration of drug combinations under all dose ratio seems to be unrealistic. Any high throughput technology or heuristic design will significantly save the time and experimental costs by purposely choosing the potential candidate dose.

#### High Throughput Experimental Screening

In order to identify effective combinations of therapeutic compounds, Borisy et al. (2003) developed a high-throughput screening method to systematically screen of ∼120,000 pairwise combinations for antifungal effects in 2003. The systematic testing began by defining the activity of each compound as a single agent in the assay system. And then, each active compound against all other compounds was tested in dose matrices comprising six concentrations based on EC50. Finally, the possible synergistic dose ratio between the drug pairs would be detected. In this way, this paper proposed a practical application to systematic screening of compounds in diseaserelevant phenotypic assays (Borisy et al., 2003). Furthermore, this method also proposed to detect the synergistic effects between constituents within the natural products (Isgut et al., 2017).

Then in 2007, a series of concentration ratios for each drug pair were tested on 10∼20 tumor cell lines via high-throughput screening technology (Mayer and Janoff, 2007). After analyzing the cytotoxicity curves for each, they found that certain dose ratios of combinational drugs can be synergistic, while other ratios of the same agents may be antagonistic (Mayer and Janoff, 2007). Interestingly, high-throughput screening has been applied to tumor organoids system in recent years (Ivanov and Grabowska, 2017; Ivanov et al., 2017; Shahi Thakuri and Tavana, 2017). For instance, colon cancer spheroids were applied for drug synergy between 25 compounds under multiple IC50s, instead of the traditional cell lines (Shahi Thakuri and Tavana, 2017). And animal model of zebrafish was also established for this purpose with the assistance of auto-image analysis technology (Todd et al., 2017).

#### Fixed Dose Method

To avoid random high-throughput screening, fix dose/ratio method may serve as a starting point to explore when prior information is totally unknown. The dose may be set according to their maximum tolerated doses (MTD) and partial MTDs (Cao and Rustum, 2000; Azrak et al., 2004, 2007; Cao et al., 2005). As early as in 2000, the synergistic effect of Irinotecan and 5-Fluorouracil was studied in the rat model of colon cancer, at the dose of MTDs, 12.5% MTDS, 50% MTDS, and 75% MTDS, respectively (Cao and Rustum, 2000). Another searched the synergistic effect of 200 pairs of antifungal drugs within a dose range between 0 to minimal inhibitory concentration (MIC) in the brewer's yeast (Cokol et al., 2011). It worth to note that, besides dose combination, the time interval and sequential treatment, even the pharmaceutical packaging may influence the effects of drug combination (Azrak et al., 2007; Mohan et al., 2014).

Instead of fixed dose, some studies fixed dose ratios based IC50 when prior information is unknown (Hatakeyama et al., 2014; Zhang et al., 2014). Occasionally, dose ratio may also start from 1:1 to explore the synergistic spectrum for different drugs in different cancer types (Liu et al., 2011).

#### Computational-Guided Experimental Approach

To avoid exhaustive searching in dose combinational space, computer algorithm was often adopted as a feedback control to suggest next round of experiments design based on preliminary experimental results. Current algorithms for this purpose mainly refers to feedback system control scheme (FSC), which help to converge fast in a huge searching space of multiple drugs with multiple doses. This scheme has been applied to identify the best dose combinations of multiple drugs in various cancer (Liu et al., 2015), and viral infection (Wong et al., 2008).

The procedure of FSC (Tsutsui et al., 2011; Liu et al., 2015; Weiss et al., 2015b) usually includes: (1) Input a number of drugs (usually 5 to 10) with several doses (e.g., 0, IC25, IC50, IC75) for a specific disease; (2) Combine all drugs and their doses to form a large searching space; (3) Random select partial combinations from above space and test experimentally; (4) Update the drug doses by differential evolution algorithm (DE); (5) Repeat (3) and compare latest experimental results to the previous ones; and (6) Choose better experimental results for the next iteration.

Here the detailed heuristic DE algorithm (Tsutsui et al., 2011) is illustrated in **Figure 1**: (1) Choose a drug-dose combination

#### Frontiers in Pharmacology | www.frontiersin.org 2

according to a random algorithm: xji; (2) Examine the effect through experiments: E(xji); (3) Mutate the current selected drug dose to vji; (4) Crossover the current and mutated drug-dose combination (xji×vji) to obtain a new drug-dose combination uji; (5) Examine the effect of the new drug-dose combination: E(uji); (6) Compare E(uji) with E(xji). The new drug-dose combination is uji, if E(uji) > E(xji) and will go into the next iteration cycle.

It can be seen that the features of FSC as several advantages (Nowak-Sliwinska et al., 2016). Firstly, it is phenotypically driven, simpler than genotype-driven methods, and does not require any mechanism information. Secondly it can achieve a fast convergence by using DE algorithm. Despite of that, the experimental testing is still substantial because all input drugs are considered equally in the combination. Thus the improved version of FSC incorporates a regression model to identify those potential synergistic drugs out of the input list before searching optimized dose (Wang et al., 2015; Weiss et al., 2015a).

Recently, FSC was used to screen Nano-diamond modified drugs out of 57 dose combinations and therapeutic dose window was proposed which could optimally inhibit cancer cell lines and protect the normal cell lines (Wang et al., 2015). More application of FSC could be found in prostate cancer and hepatocellular carcinomas (Mohd Abdul Rashid et al., 2015; Jia et al., 2017).

#### Mainly Computational-Based Approach

Apart from the above approaches, a few mathematical models have been constructed which have been collected as below.

#### Stochastic searching model

To minimize searching space for optimal dose combination, a few stochastic search algorithms with heuristic ideas have been reported recently (Calzolari et al., 2008; Caglar and Pal, 2014). **Figure 2A** shows an example of stochastic search algorithms, with ideas similar to that of the stack sequential algorithm (Jeline, 1969). An alternative version of the **Figure 2A** tree (**Figure 2B**), eliminates nodes representing redundant drugdose combinations. Stochastic search algorithms works as this: under search tree structure, the biological score was evaluated at the first level of tree and best single drug Cbest was extracted (Calzolari et al., 2008). Then, the biological scores of Cbest combined with all other drugs were measured and compared with Cbest's to decide the movements of upward or downward. The current best combination was chosen for

further searching of sub-nodes to get the global optimal combinations. In this way, only one-third of the tests were actually scanned in the Drosophila model of 4 drugs (Calzolari et al., 2008).

of quiescence: the cell cannot go through the cell cycle, or the cell cannot find a valid place to divide.

Meanwhile, a diversified stochastic search algorithm has been recently proposed to find optimum drug concentrations efficiently without prior normalization of the searching space (Caglar and Pal, 2014). This stochastic algorithm was composed of the initial parallel part and the iteration part. The former was used to generate a rudimentary knowledge of the searching space, while the later was mainly used to search the space repeatedly to update knowledge of new hills that the previous iterations could not locate. After relatively smaller number of iterative steps, the optimized dose combination could be detected for anti-bacteria and anti-cancer effects (Caglar and Pal, 2014).

#### Statistical model

In addition to stochastic searching, statistical models were also applied to screen the optimal drug-dose combination based on cellular responses (Deharo and Ginsburg, 2011; Weiss et al., 2015a). The logistic regression model showed in equation (1) (Deharo and Ginsburg, 2011) was proposed to predict the EC50s of the drug alone and in combination. And the synergistic effects of six different ergosterol together with the pyrethroid in five selected dose ratios were detected (Deharo and Ginsburg, 2011).

$$f(\mathbf{x}, (b, c, d, e)) = \mathbf{c} + \frac{d - c}{1 + (\mathbf{x}/e)^{\mathbf{c}}} \tag{1}$$

f(x): drug effects; x: dose of drug; b: a measure of the steepness of the curve for the dose equal to the ED50 value; c, d: denote the

lower and upper asymptotes of the s-shaped curve; e: corresponds to ED50 value.

Different from the logistic regression model, the secondorder linear regression model screened out the optimal drugdose combination by firstly refine drugs which might produce synergistic effect (Chen et al., 2010; Xu et al., 2014; Weiss et al., 2015a; Silva et al., 2016). This model mainly contained the following steps: (1) Establish a stepwise linear regression model describing the relationship between drug doses and effects; (2) Select the drugs most likely to produce synergistic effects according to the model coefficients; (3) Continue to do regression analysis of the drugs selected in (2); (4) Detect final optimal drug combination and dose ratio. Through several cycles, an optimal drug combination toward viability inhibition of renal carcinoma cells from initial 10-drug pool with 4 doses each was detected (Weiss et al., 2015a).

The second-order linear regression model (Weiss et al., 2015a) is showed in equation (2)

$$\boldsymbol{y} = \beta\_0 + \sum\_{i=1}^{\mathbf{k}} \beta\_i \mathbf{x\_i} + \sum\_{i=1}^{\mathbf{k}} \beta\_{\text{il}} \mathbf{x\_i^2} + \sum\_{i=1}^{\mathbf{k}} \sum\_{j=i+1}^{\mathbf{k}} \beta\_{\text{ij}} \mathbf{x\_i} \mathbf{x\_j} + \varepsilon \tag{2}$$

y: the response variable (i.e., cell viability as percent of control); βi, βii, βij: represent the intercept and the coefficients of linear, quadratic, and bilinear terms, respectively; <sup>x</sup>i, <sup>x</sup><sup>j</sup> : independent variables (i.e., drug combination at designed doses); ε: an error term.

#### Multi-Scale Agent-Based Model

In recently years, the multi-scale agent-based model has been established to evaluate synergistic dose ratios by controlling the fate of cells under different drug combinations (Wang et al., 2013; Qiao et al., 2015). The model simulated the growth process of tumor cells including apoptosis, proliferation, migration, etc. based on some specialized biological regulations to screen the optimal dose combinations with maximal lethality in different dose combinations. Furthermore, the model could not only describe multicellular interaction system and microenvironment in cancer, but also detect synergistic dose with limited experimental data. Usually, the model was established according to discrete dose combination effects to simulate continuous effects under wide range of dose combinations. And the fate of cells was usually described from the intracellular, intercellular, and tissue scales to illustrate the 'phenotypic' switches showed in **Figure 3** (Qiao et al., 2015), cell–cell and cell– microenvironment interaction, respectively (Wang et al., 2013; Qiao et al., 2015).

In 2015, this model was firstly used to choose optimal combinations restoring the balance between osteoclast cells and osteoblast cells as well as killed cancer cells in multiple myeloma Cancer (Qiao et al., 2015). According to the pathogenesis, the behaviors of myeloma cells and two normal cells under the action of multiple cytokines and drug combinations were simulated. Ultimately, the optimal dose ratio of the combination was screened out according to the simulation result.

Besides, artificial intelligence (AI) has had an impact in drug synergy area. Recently, Preuer et al. (2017) developed a novel deep learning method, termed DeepSynergy, to model drug synergy qualitatively using chemical and genomic information, which is based on Neural Networks. This mechanism-free and data-driven method outperformed those previously methods of deep learning within the space of 38 drugs on 39 cell lines. But DeepSynergy didn't make comparison with the other models previously reported, such as RACS (Sun et al., 2015) and other methods in DREAM Challenge (Bansal et al., 2014). RACS, which is semi-supervised, mechanism-guided, and context-dependent combining both genomic and network characteristics, showed a probability concordance of 0.78 compared with 0.61 obtained with the best algorithm reported in DREAM Challenge within the space of 14 compounds on the cell line OCI-Ly3. Furthermore, more computational approaches in qualitatively identifying synergistic drug combinations are summarized by Sheng et al. (2017). Yet AI methods have not been seen in quantitatively screening synergistic dose combinations, which worth further exploration.

#### PERSPECTIVE

We have summarized the latest development in the area of synergistic dose combinations for Anticancer Drugs. Above accumulated work has paved the way to comprehensive predictive model of optimal dose combination. It should be aware of that, the current searching methods are still limited to local optimization, while more experimental results are needed to validate the computational models. Although challenging, considering below factors may contribute to more effective algorithms. For instance, cancer heterogeneity should be seriously considered in order to achieve better results. Meanwhile, considering the drug response of multiple cells/tissues may minimize the potential side effects of combined drugs to normal tissues. This is particularly important when the drugs are administrated with different time and different order. Coupled with the future development of AI and hardware development, more concrete models are expected to potentially assist the clinical decision of combinational drug dosage to cancer patients.

#### AUTHOR CONTRIBUTIONS

ZY collected the main papers and wrote the manuscript. ZD and WZ collected the related studies. ZC supervised the whole project and modified the manuscript. All authors read the approved the final manuscript.

#### FUNDING

This work has been supported by the Fundamental Research Funds for the Central Universities and National Natural Science Foundation of China (No. 31671379).

## REFERENCES

fphar-09-00535 May 18, 2018 Time: 17:19 # 6


study combination of 5 prostate cancer drugs. Comput. Biol. Chem. 67, 234–243. doi: 10.1016/j.compbiolchem.2017.01.010



carcinoma cell line MGC803. J. Med. Food 17, 955–962. doi: 10.1089/jmf.2013. 2967

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Yin, Deng, Zhao and Cao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# DeCoST: A New Approach in Drug Repurposing From Control System Theory

Thanh M. Nguyen<sup>1</sup> \*, Syed A. Muhammad<sup>2</sup> \*, Sara Ibrahim<sup>3</sup> , Lin Ma<sup>4</sup> , Jinlei Guo<sup>4</sup> , Baogang Bai <sup>4</sup> and Bixin Zeng<sup>5</sup> \*

*<sup>1</sup> Department of Computer and Information Science, Indiana University-Purdue University Indianapolis, Indianapolis, IN, United States, <sup>2</sup> Institute of Molecular Biology and Biotechnology, Bahauddin Zakariya University, Multan, Pakistan, <sup>3</sup> Department of Biology, School of Science, Indiana University-Purdue University Indianapolis, Indianapolis, IN, United States, <sup>4</sup> The 1st School of Medicine and School of Information and Engineering, Wenzhou Medical University, Zhejiang, China, 5 Institute of Lasers and Biomedical Photonics, Wenzhou Medical University, Wenzhou, China*

#### Edited by:

*Lixia Yao, Mayo Clinic, United States*

#### Reviewed by:

*Yanshan Wang, Mayo Clinic, United States Fuhai Li, The Ohio State University, United States Zhichao Liu, National Center for Toxicological Research (FDA), United States*

#### \*Correspondence:

*Thanh M. Nguyen thamnguy@iupui.edu Syed A. Muhammad aunmuhammad78@yahoo.com Bixin Zeng z\_bixin@163.com*

#### Specialty section:

*This article was submitted to Translational Pharmacology, a section of the journal Frontiers in Pharmacology*

Received: *15 March 2018* Accepted: *15 May 2018* Published: *05 June 2018*

#### Citation:

*Nguyen TM, Muhammad SA, Ibrahim S, Ma L, Guo J, Bai B and Zeng B (2018) DeCoST: A New Approach in Drug Repurposing From Control System Theory. Front. Pharmacol. 9:583. doi: 10.3389/fphar.2018.00583* In this paper, we propose DeCoST (Drug Repurposing from Control System Theory) framework to apply control system paradigm for drug repurposing purpose. Drug repurposing has become one of the most active areas in pharmacology since the last decade. Compared to traditional drug development, drug repurposing may provide more systematic and significantly less expensive approaches in discovering new treatments for complex diseases. Although drug repurposing techniques rapidly evolve from "one: disease-gene-drug" to "multi: gene, dru" and from "lazy guilt-by-association" to "systematic model-based pattern matching," mathematical system and control paradigm has not been widely applied to model the system biology connectivity among drugs, genes, and diseases. In this paradigm, our DeCoST framework, which is among the earliest approaches in drug repurposing with control theory paradigm, applies biological and pharmaceutical knowledge to quantify rich connective data sources among drugs, genes, and diseases to construct disease-specific mathematical model. We use linear–quadratic regulator control technique to assess the therapeutic effect of a drug in disease-specific treatment. DeCoST framework could classify between FDA-approved drugs and rejected/withdrawn drug, which is the foundation to apply DeCoST in recommending potentially new treatment. Applying DeCoST in Breast Cancer and Bladder Cancer, we reprofiled 8 promising candidate drugs for Breast Cancer ER+ (Erbitux, Flutamide, etc.), 2 drugs for Breast Cancer ER- (Daunorubicin and Donepezil) and 10 drugs for Bladder Cancer repurposing (Zafirlukast, Tenofovir, etc.).

Keywords: drug repurposing, system control, breast cancer, bladder cancer, pathway, expression profile

## INTRODUCTION

Drug repurposing (also called drug repositioning) has become one of the most active areas in pharmacology since last decade (Oprea et al., 2011) because this approach could significantly reduce the cost and time to invent a new treatment. Before drug repurposing research became active, it was expected to take about 15 years and \$0.8–\$1 billion to bring a new drug into the market (Dimasi, 2001) due to many tests and clinical trials in order to be commercially approved by Food and Drug Administration (FDA) (USFDA, 2016). It is expected that the failure probability during clinical trials is about 91.4% (Thomas et al., 2016). One of the key reasons for low productivity in traditional drug development is the lack of systematic evaluation of additional indications (Dudley et al., 2011), which may lead to unexpected side effects and low efficacy. Briefly, drug repurposing finds new indications for known drugs and compounds (Gupta et al., 2013) to reduce the risk of failure and shorten time of discovery. Drug repurposing applies modern computational techniques to digitalize genomic (Power et al., 2014), bioinformatics, chemical informatics (Bisson, 2012) and patients' individual health records (Xu et al., 2014) to offer more systematic evaluation of the chemical compound before entering the laboratory testing and clinical trial steps. In addition, drug repurposing could explore the large set of chemical compounds, which is estimated to be more than 90 million by PubChem statistics (Wang et al., 2014), to reduce the cost of synthesizing new compounds. Prominent successful examples for drug repurposing include Viagra, Avastin, and Rituxan (Dudley et al., 2011).

System biology (Pujol et al., 2010) plays an important role to in the evolvement of drug repurposing evolved from "one: disease-gene-drug" (Durrant et al., 2010) to "multi: gene, drug" (Chou, 2010; Medina-Franco et al., 2013) and from "lazy guiltby-association" (Campillos et al., 2008; Keiser et al., 2009; Iorio et al., 2010; Gottlieb et al., 2011) to "systematic model-based pattern matching," such as the Broad Institute's Connectivity Maps (CMAP), C2MAP, etc. (Lamb et al., 2006; Hu and Agarwal, 2009; Huang et al., 2012; Jensen et al., 2012; Li and Lu, 2013; Subramanian et al., 2017). System biology reveals connectivity among drug, gene, and diseases (**Figure 1**). In this Figure, the green connectivity shows the types of connectivity for which drug repurposing could utilize to answer the key question: could drug A be re-indicated to treat disease B. The literature and public data sources for these types of connectivity have been thoroughly developed in the recent two decades, such as DrugBank (Law et al., 2013) and SFINX (Andersson et al., 2015) for drug-drug interaction; DrugBank (Law et al., 2013) and STITCH (Kuhn et al., 2012) for drug-gene/protein interaction; BioGRID (Chatr-Aryamontri et al., 2013), STRING (Szklarczyk et al., 2015), HAPPI (Chen et al., 2017), KEGG (Kanehisa et al., 2017) and Reactome (Croft et al., 2011) for protein-protein interaction and human pathway; OMIM (Baxevanis, 2012) and GEO (Barrett et al., 2013) for disease-specific gene curation and analysis; the human disease network (Goh et al., 2007) for disease-disease connectivity; and SIDER for diseases' drug-sideeffect (Kuhn et al., 2016). The integration of rich data sources enable mathematical system modeling and analysis in system biology to deepen our understanding and predictive capability for biological processes, disease ontology (Hannon and Ruth, 2014; Goel and Richter-Dyn, 2016; Woodhead et al., 2016) and personalized medicine (Weston and Hood, 2004).

From the mathematical system-model-control-based point of view, there exist a mechanism regulating the gene expression profile. In the healthy condition, the gene expression stays in the stable equilibrium region such that **x**(t) = f (**x**(t−1)) ≈ **x**(t−1), where f indicates the expression-regulating mechanism computed from data integration, **x** stands for expression and t stands for time. In the disease state, the critical gene expression strays outside the stable region. In this case, without a control (treatment), the expression will be unbounded. The system control algorithms aim to find the sequence of control-treatment that optimally stabilize the expression back to the original equilibrium point, such as linear control (Willems, 1971; Chen et al., 2016), nonlinear control (Bardi and Capuzzo-Dolcetta, 2008; Falcone and Ferretti, 2013), adaptive neural network (Rovithakis and Christodoulou, 1994; Tong et al., 2014). By comparing the real drug treatments with the optimal controltreatment (also called hypo-treatment), we can evaluate the potential efficacy of the drug before being repurposed.

However, applying mathematical system modeling and control in drug repurposing is still in very early steps. There are three key challenges in applying system control approach. First, it is difficult to quantify the gene expression and real drug treatment, as there is very little literature discussing the "normal range" of each gene's expression. Second, constructing a comprehensive and accurate mathematical model to simulate the gene expression change is complicated due to the diversity of gene-gene interaction mechanisms, mutation, and underdiscovered data. Third, the biological systems are known for

large scale for system control: there may be from hundreds to thousands of genes of interest in a specific disease or biological process.

In this paper, we propose DeCoST (Drug Repurposing from Control System Theory) to apply control system paradigm for drug repurposing purpose, with source code available at https://github.com/thamnguy/DeCoST. The DeCoST framework tackles these challenges above as follow. First, although we could not completely solve the "normal range" challenge, we discretized the gene expression and the connectivity data so that the control-system algorithm could be executed logically without the "normal range" impact. Second, to overcome the comprehensiveness challenge, we utilized the biological and pharmaceutical knowledge and public data sources to quantify the drug-protein interaction and disease-specific gene expression profile. We used the comprehensive public proteinprotein databases to setup the mathematical model for the repurposing problem. Third, to reduce the complexity and high-dimensionality of the repurposing problem, we applied the linear-quadratic-regulator method, which is practical in large-scale system control, to compute the hypo-treatment and evaluate the drug therapy. We apply DeCoST in Breast Cancer and Bladder Cancer case studies. Among cancer diseases, Breast Cancer causes the most number of mortality women (Centers for Disease Control Prevention, 2013). Breast Cancer is also the most comprehensively studied disease among cancers, with nearly 20 approved drugs by Food and Drug Administration (FDA). In addition, Breast Cancer has many subtypes, which is ideal for personalized drug repurposing. In contrast, FDA only approves 4 drugs for Bladder Cancer treatment although Bladder Cancer is the fourth most commonly diagnosed cancer in the United States (American Cancer Society, 2017). Therefore, drug development in Bladder Cancer is still an opened and attractive research area. From good performance when classifying between approved drugs and withdrawn drugs, we find 7 compounds that may be promising in Breast Cancer ER-positive subtype, 3 compounds in Breast Cancer ER-negative subtype and 10 compounds in Bladder Cancer for further drug repurposing in-vivo study.

#### METHODS

We developed our drug repurposing framework from the system modeling and control points (**Figure 2**). The framework integrates three types of data. First, from the Disease-specific expression profile, we quantified the expression as the system initial condition vector, where each vector elements specified whether the corresponding gene was overexpressed (red), underexpressed (green) or normally expressed (white). Second, from the protein-protein interaction database, we built the mathematical system model in order to apply the system-control algorithm. The red arrows implies activative; and the green arrow implies inhibitive interactions. Third, from the chemical-protein interaction data, we quantified the treatment vector for each

drug for later ranking. Using the initial condition vector and the mathematical model, we computed the optimal hypo-treatment. By mapping the pattern of the optimal hypo-treatment and the drugs' treatment vectors, we could rank the drugs and suggest repurposed drugs.

### Retrieve the Expression Profile as the Initial Condition Vector

We used GEO2R service (https://www.ncbi.nlm.nih.gov/geo/ geo2r/) to analyze GEO dataset for the initial condition vector. The GEO2R service runs on R 3.2.3 platform and utilizes the well-known bioinformatics packages Biobase 2.30.0 (Huber et al., 2015), GEOquery 2.40.0 (Davis and Meltzer, 2007), and limma 3.26.8 (Ritchie et al., 2015). In GEO2R's result, we filtered out genes whose adjusted p-values exceed 0.05. The filtered-out genes were marked with 0 in the initial condition vector. For genes, whose adjusted p-values are less than 0.05, we used the sign of base-10 logarithm fold-change (logFC) in the initial condition vector. In the other words, genes with logFC > 0, which implied that the genes were overexpressed in the disease condition, were marked by 1. Genes with logFC < 0, which implied that the gene were under expressed in the disease condition, were marked by −1.

We chose GSE10886 dataset for expression profile in Breast Cancer case study. GSE10886 is among the largest and most comprehensive Breast Cancer microarray in GEO at the tissue level. After the latest update in January 2013, GSE10886 has 226 samples and including 97 ER-positive-subtype samples, 69 ERnegative-subtype samples, and 32 control samples. We chose GSE31189 dataset for Bladder Cancer expression profile. This dataset contains 52 cancer samples and 40 control samples.

### Build Disease-Specific Mathematical System Model From Interactome Data

Due to the availability of public data sources for disease-specific pathway models, we built the disease-specific system model for Breast and Bladder Cancer differently. To avoid potential falsepositive, which is a well-known issue in predictive data source, we preferred using the pathway data to construct the mathematical model. For Breast Cancer, we conducted literature search on public curated pathway databases Reactome (Croft et al., 2011) and Wikipathway (Pico et al., 2008) for human disease pathways. In these databases, we only select pathways where the disease name appears in the pathways' titles or description. As the result, we found the Integrated Breast Cancer Pathway (Ibrahim et al., 2015) on Wikipathway. This pathway is among the most comprehensive Breast Cancer human pathway in the literature, which covers 239 genes and 467 interactions. The pathway also integrates 24 Breast Cancer-related pathways, including several signaling network. The entire detail about this pathway could be found in Supplemental Table 2. However, we could not find any pathways having more than 50 genes for Bladder Cancer, which implied low coverage. Therefore, for the Bladder Cancer model, we queried Bladder-Cancer-associated genes from PubMed Gene (https://www.ncbi.nlm.nih.gov/gene), one of the most comprehensive literature collection in biomedical and life sciences. To filter the possible noise during the retrieval process, we used specific query in format: <Disease Name> AND "Homo sapiens"[porgn: \_\_txid9606]. After retrieving the Bladder-Cancer-associated genes, we converted the gene identification to UniProt Knowledge Base Reviewed identification (UniProt, 2013) to filter possible alias. We queried the STRING database v10 (Szklarczyk et al., 2015), one of the most comprehensive interactome databases to retrieve the interactions information among the candidate disease-specific proteins, especially the directionality and mechanism of interactions. To filter out possible noisy information, we limited the search results only on interaction with minimum of 500 confidence score. STRING database covers 7 types of mechanism: activation, expression, inhibition, catalysis, ptmod, binding, reaction.

After retrieving the disease-associated genes and interactions from these models above, we quantified the interactome to finalize the mathematical systems for these diseases. Among the interactions, activation and inhibitions are the mechanisms with the clearest and the most unambiguous impact/directionality. Thus, we quantified the activation mechanisms by +1 and the inhibition mechanisms by −1. For the other mechanisms, we quantified them by the default value of 0. The results of this step could be represented by adjacency matrices, as showed in Supplemental Figure 1.

### Retrieve Chemical-Protein Interaction for Treatment Vector

For each disease, we curated literature for two set of drugs. The positive set, denoted by D1, includes all drugs which are approved for treatment by Food and Drug Administration (FDA). The negative set, denoted by D2, includes drugs which are withdrawn from disease treatment, or withdrawn/terminated from disease-specific clinical trials due to toxic or inefficient issues. We query https://clinicaltrials.gov/ for clinical trials information. To avoid the complexity of multi-drug and multidisease treatment, we ignored literature mentioning more than one drug/disease during curation. We also ignored the biotech drugs since this type of drug does not target the molecular level, therefore it is difficult to setup the treatment vector for biotech drugs. **Table 1** summarizes the list of D1 and D2 drugs we curated for Breast Cancer and Bladder Cancer. For Breast Cancer, we found 16 D1 drugs and 7 D2 drugs. In addition, to examine the possible newly therapeutic drugs for Breast Cancer, we referred to 24 drug proposed by Huang et al. (2011) as D3, in which these drugs have been approved for some other diseases by never in trial for Breast Cancer. For Bladder Cancer, we found 3 D1 drugs and 2 D2 drugs. Since we could not find any repurposed drug list for Bladder Cancer in the literature, we selected all of the 421 FDA-approved drugs for non-Bladder-Cancer diseases, which have at least one druggene interaction with genes in Bladder Cancer model, as D3 for Bladder Cancer. The entire D3 drug lists for both Breast Cancer and Bladder Cancer could be found in Supplemental Table 1.

We queried the DrugBank (Law et al., 2013) and DMAP (Huang et al., 2015) database for the list of drug-protein



*D1, FDA-approved drugs (positive/good drug set); D2, FDA-rejected/withdrawn drugs (negative/bad drug set).*

interaction mechanism. DMAP and DrugBank covers 38 mechanisms of drug action. In DMAP, we filtered out interactions with confidence score less than 800 (over 1,000) to avoid noisy information. From biological knowledge, we quantified these mechanisms as showed in **Table 2**. Similar to quantification of protein-protein mechanism of action, an inhibited or similar action is map to −1; and an activated or similar action is map to +1.

#### Construct Disease-Specific Drugs' Therapeutic Scoring for Drug Repurposing Purpose

The key principle in applying system control to evaluate drugs' therapy relies in the following assumption: in disease condition, the gene expressions are derived away from the balanced level of 0. Therefore, a good treatment should reverse the gene expressions in disease condition and stabilize the expressions to the balance level. In **Figure 2**, we illustrate this principle and explain several mathematical notation in a toy example. Based on system biology literature (Alberghina, 2007), we assume that there exists a model governing the gene expressions, which allows us to model the expression using time-series perspective

$$\mathbf{x}(t) = f(\mathbf{x}(t-1), \mathbf{u}(t-1))\tag{1}$$

where **x** ǫ ℜ<sup>N</sup> stands for the quantified gene expression of N genes, **u** ǫ ℜ<sup>N</sup> stands for the quantified treatment and t is the iteration and f is the arbitrary function controlling the expression change. The initial **x**(0) is the quantified gene expression in disease condition. In this paper, we choose a linear model for f.

$$\mathbf{x}(t) = \mathbf{A}\mathbf{x}(t-1) + \mathbf{u}(t-1) \tag{2}$$

We chose the linear model because not only it is simple but also it has equilibrium point at the origin: if **x**(t−1) = **u**(t−1) = 0

TABLE 2 | Quantification of drug-protein mechanism of action in drug-protein interaction databases.


*The Mechanism of Action terminologies are retrieved from drug-target annotation in DrugBank database. Quantification stands for the numerical representation of the Mechanism of Action in the modeling and computing steps.*

then **x**(t) = 0. This fact implies that when the gene expressions are already at the balance level, treatment is no longer needed. In addition, it is easier to setup a linear system with stability (Chui and Chen, 2012)

$$|f| |\mathbf{x}(0)| < \varepsilon \text{ and } \mathbf{u} = 0 \text{ then } ||\mathbf{x}(t)|| < \varepsilon \forall t \tag{3}$$

where ||**x**|| stands for the second norm of **x** and ε is an arbitrary small number. This fact implies the self-adjustment of the gene expression at the control level. We setup matrix **A** from quantification of protein-protein mechanism of interactions (section Methods). With temporal matrix **A** ∗ as the result of section Methods

$$\mathbf{A}^\*(i,j) = \begin{cases} -1 \text{ if protein } i \text{ inhibits protein } j \\ 1 \text{ if protein } i \text{ activities protein } j \\ 0 \text{ otherwise} \end{cases} \tag{4}$$

Let λ be the eigenvalue of **A** <sup>∗</sup> with the largest magnitude. By setting up **A** as

$$\mathbf{A} = \begin{pmatrix} 1/\lambda \end{pmatrix} \mathbf{A}^\* \tag{5}$$

We can guarantee the stability of system (2) (Chui and Chen, 2012).

The objective of the linear control is to find a sequence of **u**(t) such that

$$\mathbf{x}(t) \to \mathbf{0} \text{ as } t \to \infty \tag{6}$$

Optimal control considers not only how to stabilize **x** quickly but also consider the cost-effective of the treatment **u**. Regarding this point, the optimal linear control aims to minimize

$$J(\mathbf{x}\,(0)) = \sum\_{t=0}^{\infty} \left( \mathbf{x}\,(t)^T \mathbf{x}\,(t) + \mathbf{u}(t)^T \mathbf{u}\,(t) \right) \tag{7}$$

To solve the optimization problem (2–7) we solved the corresponding Riccati equation (Arnold and Laub, 1984)

$$\mathbf{A}^T \mathbf{P} \mathbf{A} - \mathbf{P} - \mathbf{A}^T \mathbf{P} \left(\mathbf{P} + \mathbf{I}\right)^{-1} \mathbf{P} \mathbf{A} + \mathbf{I} = \mathbf{0} \tag{8}$$

using DARE algorithm (Arnold and Laub, 1984) in Matlab (https://www.mathworks.com/help/control/ref/dare.html). In (8), P is just an intermediate result containing no biological representation. We compute the treatment vector **u**(t) as follow

$$\mathbf{u}\left(t\right) = -\left(\mathbf{I} + \mathbf{P}\right)^{-1}\mathbf{P}\mathbf{A}\mathbf{x}(t)\tag{9}$$

In system control practice, since **u**(t) often converges to 0 quickly (Bemporad et al., 2002), the first treatment vector **u**(0) = − (**I** + **P**) <sup>−</sup><sup>1</sup> **PAx**(0) often plays the most important role in optimally stabilizing the system (2). Therefore, we can consider **u**(0) as the optimal hypo-treatment. We compare the similarity between the real drug treatment (**u**d) and the hypo-treatment as the therapeutic score T(d) for each drug d as follow

$$T\_d = |\mathbf{u}\_d^T \text{sign}(\mathbf{u}\,(0))| / |\text{abs}(\mathbf{u}\_d)^T \text{abs}(\text{sign}(\mathbf{u}\,(0)))| \tag{10}$$

where abs stand for the absolute value function. Here, T<sup>d</sup> ranges between −1 and 1. The numerator **u** T d sign(**u** (0)) is the matching function between drug d and the optimal hypo-treatment, which is incremented when **u**d(i) and **u**(0)(i) are non-zero analog, and decremented when **u**d(i) and **u**(0)(i) are opposite. We measured the impact of T<sup>d</sup> score by the receiver operating characteristic when we use T<sup>d</sup> to classify D1 drugs vs. D2 drugs.

#### RESULTS

#### Therapeutic Scores for Breast Cancer Drugs

From the Integrated Breast Cancer Pathway (Ibrahim et al., 2015) on Wikipathway (section Methods) and the Breast Cancer drug list in Supplemental Table 3, we queried 222 drug-protein interactions for the drugs' treatment vectors (Supplemental Table 4). Supplemental Table 5 contains the initial condition vector from GEO2R expression analysis.

**Figure 3** shows that the T<sup>d</sup> score is able to give appropriate ranking for most of the well-known therapeutic drugs and suggest candidate drugs for repurposing in Breast Cancer ERpositive case. T<sup>d</sup> score reflexes the difference between the D1 and D2 drugs with receiver operator characteristic (Hanley and McNeil, 1982) area under the curve (AUC) of 0.76. This result is comparable to the overall result queried from Broad Institute CMAP (Subramanian et al., 2017) on MCF-7, the Breast Cancer ER+ cell line, using the Touchstone tool (https://clue.io/touchstone). Especially on the drugs covered in CMAP, DeCoST achieves AUC of 0.91, which is much higher than the AUC achieved by CMAP (0.79), as showed in the Supplemental Text 1. We did not setup training set and test set for classification because the model construction and T<sup>d</sup> calculation does not need the drug categories. The T<sup>d</sup> scores for D1 drugs in Breast Cancer ER-negative case are relatively lower than the scores for ER-positive case (**Figure 4**). Comparison detail has been shown in Supplemental Table 5. Using T<sup>d</sup> for classifying D1 and D2 drugs yields AUC of 0.68. In fact, clinical trials and literature have showed several drugs which are effective in ER-positive treatment but show little or no impact in ER-negative treatment. For example, Tamoxifen (T<sup>d</sup> ER-positive: 0.294, T<sup>d</sup> ER-negative: 0.176), which is a selective estrogen receptor modulator, does not prevent ER-negative Breast Cancer, when the estrogen receptor genes do not express (Fabian, 2007; Uray and Brown, 2011).

#### Therapeutic Scores for Bladder Cancer Drugs

Since we could not find any human pathway with sufficient coverage for Bladder Cancer, our Bladder Cancer system model retrieved the Bladder-Cancer-specific genes from PubMed Gene server. The model contains 738 proteins and 1,241 proteinprotein interactions. From 6 drugs in the Bladder Cancer casestudy, we retrieved 48 drug-protein interactions for drugs' treatment vector. From GSE31189 gene expression dataset, we found 221 genes whose expression differs from the balance level. Details about the Bladder Cancer system could be found in Supplemental Tables 6–8.

We observed AUC of 1.0 (**Figure 5**) when we used T<sup>d</sup> score to classify between D1 and D2 drugs in Bladder Cancer. Here, all of the D1 drugs receive non-negative T<sup>d</sup> scores: Cisplatin receives the score of 0.2, Doxorubicin Hydrochloride receives the score of 0.0 and Thiotepa receives the score of 1.0. All of the D2 drugs receive negative T<sup>d</sup> scores: Mitomycin C receives the score of −0.2 and Gemcitabine receives the score of −0.09.

### Potential Drugs for Breast Cancer Studies and Biological Insights

From the T<sup>d</sup> scores for D3 drugs, our framework suggests 8 drugs (Erbitux, Flutamide, Medrysone, Methylprednisolone, Norethindrone, Prednisolone, Prednisonea, and Vandetanib) with high potential efficacy in Breast Cancer ER+ drug repurposing. Significantly, these drugs do not directly target Estrogen receptor, which is the most well-known approach in Breast Cancer ER+ drug design. Tamoxifen is a typical example of Breast Cancer drugs which slows cancer process by blocking estrogen hormone receptors, preventing hormones from binding to them. About 80% of all breast cancers are ER+: the cancer cells grow in response to the hormone estrogen (Bulut and Altundag, 2015). About 65% of the ER+ cases grow in response to another hormone, progesterone (Hefti et al., 2013). Tumors in ER/PR-positive cases are much more likely to respond to hormone therapy than tumors that are ER/PR-negative. ER+ breast cancer entirely depends on the estrogen for growth and propagation involving genomic and non-genomic pathways. Epidermal growth factor receptor (EGFR) is a receptor found on both normal and tumor cells that is important for cell growth (Herbst, 2004; Khoo et al., 2015). ER-positive (ER+) drugs recommended for repurposing in this framework block the activities and growth of EGFR (**Figure 6A**). These drugs show different mechanism of action with the common objective of the inhibition of the growth of cancerous cells. By adjusting and modifying the known biases of the interactomic networks, our procedure would help to reveal the therapeutic effect of drugs along with effective treatments.

For Breast Cancer ER- case, our framework suggests Daunorubicin and Donepezil as the repurposing candidates. These drugs are independent of estrogen and usually inhibit the cell growth by either interacting with DNA or inhibiting Cholinesterases. Daunorubicin interacts with DNA by intercalation and inhibition of macromolecular biosynthesis (Momparler et al., 1976). This inhibits the progression of the enzyme topoisomerase II, and thereby stopping the process of replication. Donepezil is in a class of cholinesterase inhibitor that improves mental function and fatigue in cancer. The current research focused on recent large-scale efforts to systematically find repositioning candidates and elucidate individual disease mechanisms in cancer (Bruera et al., 2007). Personalized medicine and repositioning both aim to improve the productivity of current drug discovery pipelines. Standard drug discovery strategies can also lead to repositioning opportunities. D1, D2, and D3 drugs (**Table 1**) found to potently modulate the desired activity are repositioning candidates.

### Potential Drugs for Bladder Cancer Studies and Biological Insights

From the list of 143 FDA-approved drug with high T<sup>d</sup> score, we found 10 candidates drugs (with T<sup>d</sup> = 1) whose mechanisms are promising for Bladder Cancer repurposing. The T<sup>d</sup> scores for all Bladder Cancer drugs could be found in Supplemental Table 9. The prevalence of drug-repositioning studies has resulted in a variety of innovative computational methods for the identification of new opportunities for the use of old drugs. We sorted the potential list of drugs against bladder cancer. The reprofiling of these drugs followed the same biological mechanisms. For example, Zafirlukast antagonizes ATP-binding cassette and may improve the efficacy of anticancer effects (Sun et al., 2012). Similarly, Tenofovir may reduce the risk of bladder or others cancers while dopamine receptor antagonist Thioridazine inhibits tumor growth (Yin et al., 2015). Losartan

is an angiotensin II receptor (AT-II-R) blocker that is widely used by human for blood pressure regulation but it also shows antitumor property (Barreras and Gurk-Turner, 2003). Ciclopirox was first marketed in 1982 as an antifungal agent found in several topical drug products. However, further research demonstrated that it was able to kill bladder cancer cells (Weir et al., 2011). The Atezolizumab, Cisplatin, Doxorubicin, Nivolumab, Opdivo, Thiotepa, and others (**Figure 6B**) are FDA approved drugs which are recommended for bladder cancer.

#### DISCUSSION

The applications of drug-repositioning studies have brought a variety of new in silico approaches in drug designing and development. In most of the studies, the anticancer effect of newly designed drugs usually has been presented in vitro as clinical trials are very expensive and time consuming, but remain the only way to validate drug efficiency in vivo. Therefore, to establish accurate and effective drug-repositioning framework needs development of new computational techniques. In this work, we discuss and demonstrate the application of control system theory as a computational method to evaluate drug efficacy and repurposing from integrated system biology data. The capability in classification between approved and withdrawn drugs is the fundamental foundation for our framework in drug repurposing. It is important to note that although our AUC of 0.76 and 0.68 in Breast Cancer is inferior compared to the state-of-the-art methods (Cheng et al., 2012; Zheng et al., 2015), our validation is conducted from the pharmaceutical knowledge of drug's efficacy on treatment at the system-pathway level; meanwhile, the other methods often validate at the targeted molecular level. In addition, we set strict criteria in choosing the negative set by only choosing drugs that are rejected or withdrawn from disease-specific clinical trials and treatments. The state-of-the-art methods tend to be more relaxed on the negative set by choosing drug not being used in disease-specific drugs, which may have limitation on repurposing options. In addition, the appropriate assessment of tamoxifen efficacy between Breast Cancer ER+ and Breast Cancer ER- highlights the potential advantages of our framework in personalized drug repurposing. Compare to the approved drugs, the candidate drugs suggested in this work show different promising drug mechanisms which may be useful in future drug design.

In our work, although the number of target may be among the key difference between the D1 drugs and the D2 drugs, our analysis shows that the number of drugs' targeted genes and the targeted genes are not the only factors affecting the clinical outcome and predictive results in drug repurposing. As showed in Suppemental Table 3, D1 drugs, on the average, has more targets than D2 drugs. However, D1 drugs for Breast Cancer (average number of targets: 4.8) include both singletarget (such as Anastrozole, Exemestane, and Fluorouracil) and multi-target (such as Tamoxifen, Paclitaxel, and Cycloheximide) ones. D2 (average number of targets: 3.3) drugs also contains the single-target (such as Ixabepilone and Avastin) and the multi-target (such as Imetelstat and Diethylstilbestrol). In the result section, DeCoST's evaluation for these drugs showed above is appropriate for their clinical outcome. In addition, drugs targeting the same marker genes do not necessary have the same outcome. For example, both Tamoxifen and Diethylstilbestrol target the estrogen receptors ESR1 and ESR2, which are the marker in Breast Cancer ER+ (Yip and Rhodes, 2014). However, their clinical outcomes and DeCoST's evaluation are opposite, primarily because they have opposite mechanisms on the same targets of estrogen receptors: Tamoxifen is the estrogen inhibitor while Diethylstilbestrol is the estrogen activator. Since Breast Cancer ER+ is strongly associated with the overexpression of estrogen receptors (Yip and Rhodes, 2014), Tamoxifen could have therapeutic outcome because it reverses the disease signature. Meanwhile, Diethylstilbestrol should have poor outcome because it shows the analog to the disease signature.

In this work, we have showed the results between DeCoST and the Broad Institute CMAP, which is among the most well-known and comprehensive platforms for drug repurposing. In addition, our strategy of repurposing is similar to CMAP. Although Supplemental Text 1 shows that our DeCoST has higher AUC than CMAP does, it is inappropriate to conclude that DeCoST is better than the CMAP. There are fundamental differences in conducting experiment making comparison not totally solid. First, the expression profiles acquired by CMAP are at the cell line level; meanwhile, in this work DeCoST acquires the expression profile at the tissue level, which is closer to in-vivo studies. Second, due to several factors in experimental design, CMAP does not contains cell line for Breast Cancer ER- and Bladder Cancer. CMAP also covered less number of drugs, compared to the drug list evaluated in this work. Therefore, the key point in comparative evaluation should be on the repurposing hypotheses suggested by these platforms in future in-vivo studies and the biological insights of these hypotheses. In our results, we have offered several biological explanations why drugs recommended by DeCoST could be repurposed. Unfortunately, we could not compare between CMAP and DeCoST at this point. DeCoST focuses primarily on recommending drugs that have never been in disease-specific clinical trials; meanwhile, CMAP (https://clue. io/repurposing-app) primarily reports on drugs that has been under early phases of clinical trials. Therefore, we believe that DeCoST could provide complimentary advantages, in addition to CMAP.

The advantages of our framework are established not only by advanced computational method but also by two layers of personalized system (Li and Jones, 2012). In the first layer, the disease-specific gene expression could differ among different patients and subtypes, which results in different initial state condition. In the second layer, different types of disturbance among molecular-molecular interactions could be discovered and represented differently in the system modeling step. In our results, we show that Tamoxifen, which is approved to treat Breast Cancer, may not be effective in treating Breast Cancer ER-. The strong support from literature to this evaluation is a good example of the personalized medicine characteristics. In addition, our framework could easily integrate the results from many other state-of-the-art repurposing approaches such as molecular docking and gene-set enrichment analysis to refine the efficacy prediction. The main idea in this framework, which is based on control system theory, could be applied in many other bioinformatics problem, such as target prioritization and discovering new combination of treatments. In addition, our framework could easily be extended to evaluate combination of treatment, with careful preprocessing the drug-drug interaction data (Ayvaz et al., 2015; Wang et al., 2017).

In addition, our framework shows repurposing capacity at both target level and pathway level. At the target level, we show typical examples for EGFR-targeted and ACHE-targeted drugs. Patients being considered for anti-epidermal EGFR therapy are often screened for mutations in the oncogene KRAS (Hoorens et al., 2010) because a constitutively active KRAS gene downstream of EGFR would not be affected by EGFR inhibition. Many diseases have approved combination regimens, such as metastatic colorectal and bladder cancer and its four-drug FOLFIRI (folinic acid, 5-fluorouracil, irinotecan) with cetuximab regimen (Raoul et al., 2009). Losartan is an angiotensin II receptor (AT-II-R) blocker and this angiotensinconverting enzyme inhibitors (ACE) may have a protective role in bladder and other cancers (Yazdannejat et al., 2016). In the other hand, a typical example at the pathway level is Thioridazine. Thioridazine-induced effects are associated with inhibition of the canonical NFκB pathway.

The limitations in this work are the method to quantify the categorical data from public genomic/proteomic databases and the simplicity of linear system control. First, all of the data are discretized into only three values: −1, 0, and 1, which could lower the resolution of the final drug therapeutic score. Second, the linear system control approach needs to assume that the gene expression transition could be approximate closely by a linear equation, which is still unverified due to the scarcity of time-series gene expression data. Therefore, when applying into another repurposing problem, biologists and pharmacologists should apply deeper domain knowledge to increase the resolution of discrete quantification. Furthermore, mathematical nonlinear system identification and reinforcement learning, which are popular approach in unknown system control, could be used to increase the accuracy of system modeling and make the system more personalized. Integration of other resources, such as drugs, genes, and systems associated with side-effects (Kuhn et al., 2016; Maier et al., 2018) and high-throughput screening (Deftereos et al., 2011; Macarron et al., 2011) would also be valuable expansions of this work in the future. Also, the computational complexity of DeCoST is generally high (expected O(n 8 ), where n is the number of genes in the model). This complexity is manageable with most of the existing biological pathway model (expect about 400 genes). However, this could be a bottleneck if the number of genes raises to several thousands.

In addition, the advantages of our framework in personalized medicine may associate with the reproducibility issues (Draghici et al., 2006; Frye et al., 2015). As mentioned, the diseasespecific gene expression could differ among different patients

#### REFERENCES


and subtypes. Therefore, we could not completely guarantee that applying our framework on different gene expression data and on different interactome data sources (Chatr-Aryamontri et al., 2013; Szklarczyk et al., 2015) would return the same result. Therefore, by reproducibility, we can only guarantee that given a specific gene expression profile and an interactome data source, we can always produce the same result. In this work, we have tried to tackle the reproducibility issue by using tight criteria to select the positive/negative drug set, by maintaining the relevance and coverage of the disease-specific model, and by choosing the expression data set with high number of samples.

#### CONCLUSION

In this work, we have developed DeCoST, one of the first techniques from system control paradigm, to tackle the drug repurposing challenges. We showed that DeCoST could appropriately retrieve the clinical outcomes of drugs treating personalized Breast Cancer and Bladder Cancer. From the good retrieval result, DeCoST suggests repurposing 8 candidate drugs for Breast and 10 drugs for Bladder Cancer with biological insights. This framework would be promising to discover new therapeutic strategies to treat other cancer diseases.

#### AUTHOR CONTRIBUTIONS

TN designed the study (including the mathematical details), curated the Bladder Cancer drug dataset and analyzed the computational results. SM validated and provided biological insights for the results. SI constructed the Breast Cancer pathway model and collected the drug clinical outcomes for Breast Cancer. LM processed the expression and protein-protein interaction data for Bladder Cancer. LM, JG, BB, and BZ implemented the system control algorithm used in the paper. All authors contributed to the manuscript writing and edition.

#### ACKNOWLEDGMENTS

The authors thank Dr. Jake Chen from Informatics Institute— School of Medicine—the Alabama University of Birmingham for helpful comments in the study design. The authors also thank Wenzhou Medical University, Zhejiang, China for covering the publication cost for this paper.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar. 2018.00583/full#supplementary-material

Andersson, M. L., Bottiger, Y., Bastholm-Rahmner, P., Ovesjo, M. L., Veg, A., and Eiermann, B. (2015). Evaluation of usage patterns and user perception of the drug-drug interaction database SFINX. Int. J. Med. Inform. 84, 327–333. doi: 10.1016/j.ijmedinf.2015.01.013

Arnold, W. F. III, and Laub, A. J. (1984). Generalized eigenproblem algorithms and software for algebraic Riccati equations. Proc. IEEE 72, 1746–1754. doi: 10.1109/PROC.1984. 13083


networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452. doi: 10.1093/nar/gku1003


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer YW and handling Editor declared their shared affiliation.

Copyright © 2018 Nguyen, Muhammad, Ibrahim, Ma, Guo, Bai and Zeng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Discovery of the Consistently Well-Performed Analysis Chain for SWATH-MS Based Pharmacoproteomic Quantification

Jianbo Fu<sup>1</sup> , Jing Tang1,2, Yunxia Wang<sup>1</sup> , Xuejiao Cui1,2, Qingxia Yang1,2, Jiajun Hong<sup>1</sup> , Xiaoxu Li1,2, Shuang Li1,2, Yuzong Chen<sup>3</sup> , Weiwei Xue<sup>2</sup> and Feng Zhu1,2 \*

<sup>1</sup> College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China, <sup>2</sup> School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China, <sup>3</sup> Bioinformatics and Drug Design Group, Department of Pharmacy, Center for Computational Science and Engineering, National University of Singapore, Singapore, Singapore

Sequential windowed acquisition of all theoretical fragment ion mass spectra (SWATH-MS) has emerged as one of the most popular techniques for label-free proteome quantification in current pharmacoproteomic research. It provides more comprehensive detection and more accurate quantitation of proteins comparing with the traditional techniques. The performance of SWATH-MS is highly susceptible to the selection of processing method. Till now, ≥27 methods (transformation, normalization, and missing-value imputation) are sequentially applied to construct numerous analysis chains for SWATH-MS, but it is still not clear which analysis chain gives the optimal quantification performance. Herein, the performances of 560 analysis chains for quantifying pharmacoproteomic data were comprehensively assessed. Firstly, the most complete set of the publicly available SWATH-MS based pharmacoproteomic data were collected by comprehensive literature review. Secondly, substantial variations among the performances of various analysis chains were observed, and the consistently wellperformed analysis chains (CWPACs) across various datasets were for the first time generalized. Finally, the log and power transformations sequentially followed by the total ion current normalization were discovered as one of the best performed analysis chains for the quantification of SWATH-MS based pharmacoproteomic data. In sum, the CWPACs identified here provided important guidance to the quantification of proteomic data and could therefore facilitate the cutting-edge research in any pharmacoproteomic studies requiring SWATH-MS technique.

Keywords: pharmacoproteomics, SWATH-MS, processing method, transformation, normalization

### INTRODUCTION

The pharmacoproteomics has been widely applied to various aspects of current pharmaceutical researches by discovering disease-related genes (Mrozek et al., 2013; Quiros et al., 2017; Zeng et al., 2017) or new drug targets (Li et al., 2018; Saei et al., 2018), constructing pharmacology screening model (Hauser et al., 2005), and revealing the drug mechanism of action (Yue et al., 2016; Zhu et al., 2018), resistance (Paul et al., 2016), and toxicity (Tan et al., 2017; Wang et al., 2017b). Recent findings uncover its potentials to fulfill the promise that the pharmacogenomics has not accomplished yet (D'Alessandro and Zolla, 2010; Chambliss and Chan, 2016; Yang et al., 2016).

#### Edited by:

Zhi-Liang Ji, Xiamen University, China

#### Reviewed by:

Dariusz Mrozek, Silesian University of Technology, Poland Qing-Chuan Zheng, Jilin University, China

> \*Correspondence: Feng Zhu

zhufeng@zju.edu.cn; zhufeng.ns@gmail.com

#### Specialty section:

This article was submitted to Translational Pharmacology, a section of the journal Frontiers in Pharmacology

Received: 27 April 2018 Accepted: 05 June 2018 Published: 26 June 2018

#### Citation:

Fu J, Tang J, Wang Y, Cui X, Yang Q, Hong J, Li X, Li S, Chen Y, Xue W and Zhu F (2018) Discovery of the Consistently Well-Performed Analysis Chain for SWATH-MS Based Pharmacoproteomic Quantification. Front. Pharmacol. 9:681. doi: 10.3389/fphar.2018.00681

**23**

As a newly emerging technique (Anjo et al., 2017), the sequential windowed acquisition of all theoretical fragment ion mass spectra (SWATH-MS) has been reported to provide much more comprehensive detection and accurate quantitation of proteins compared to the traditional techniques used in pharmacoproteomic analyses (Zhu et al., 2008b; Tao et al., 2015; Aebersold and Mann, 2016; Li et al., 2016a; Anjo et al., 2017), and it thus becomes one of the most popular techniques for target discovery (Li et al., 2016b; Xu et al., 2016; Anjo et al., 2017), drug/lead quantification (Roemmelt et al., 2015) and identification (Scheidweiler et al., 2015; Wang et al., 2015; Aratyn-Schaus and Ramanathan, 2016; Li B. et al., 2017), construction of assay library for targeted proteomic analysis (Schubert et al., 2015), and quantitative protein profiling (Krasny et al., 2018) for recognizing drug-induced alterations (Roemmelt et al., 2015; Xue et al., 2016).

However, due to the interdependent nature among multiple acquisition parameters (dwell time, duty cycle, precursor isolation window width, and mass range), the protein quantification based on SWATH-MS is reported to be limited in dynamic range (Anjo et al., 2017) and in turn low in accuracy (Gillet et al., 2012; Huang et al., 2015; Shi et al., 2016; Yang et al., 2017; Xue et al., 2018b). The problems above can be even worse considering the innate complexity of clinical samples (Jamwal et al., 2017), small amount of proteins (Sajic et al., 2015), and low abundance of drug-metabolizing enzymes (Jamwal et al., 2017). To cope with these problems, a variety of popular quantification tools, including DIA-Umpire (Sajic et al., 2015), OpenSWATH (Rost et al., 2014), Skyline (MacLean et al., 2010), Spectronaut (Bruderer et al., 2015), and SWATH2.0 (Li S. et al., 2017), and dozens of subsequent processing methods (transformation, normalization, and missing-value imputation) are developed to enhance the accuracy of SWATH-MS (Navarro et al., 2016). Recent reports further reveal that SWATH-MS' accuracies depend heavily on the specific quantification tool/processing method used in a particular study (Navarro et al., 2016), and the protein quantification can significantly benefit from comparative benchmarking of the performance of these tools and methods (Gatto et al., 2016; Zheng et al., 2016). Therefore, it is urgently needed to assess the performances of tools/methods for discovering the optimal one(s) for SWATH-MS based pharmacoproteomic studies.

The performance of various quantification tools has already been systematically evaluated by benchmark SWATH-MS data (Navarro et al., 2016). Among those tools, only 2 (OpenSWATH and Skyline) are non-commercial ones, and the OpenSWATH (Rost et al., 2014) is of the most popular one used to quantify SWATH-MS based pharmacoproteomic data (Rost et al., 2014; Parker et al., 2015; Weisser and Choudhary, 2017). So far, ≥4 transformation, ≥15 normalization, and ≥6 missing-value imputation algorithms (Guo et al., 2015; Li et al., 2016c; Ori et al., 2016; Wu et al., 2016; Tan et al., 2017; Wang et al., 2017a) have been sequentially applied to process pharmacoproteomic data. Among these algorithms, four for normalizing label-free proteomic data have been assessed to identify the best performed one (Callister et al., 2006) and six for missing-value imputation have been evaluated to discover the one enhancing proteomic quantifications in the differential expression analysis (Valikangas et al., 2017). Appropriate integrations of the processing methods into a sequential analysis chain are reported to improve the quantification accuracies (Karpievitch et al., 2012; Chawade et al., 2015; Valikangas et al., 2017) with some chains identified as highly accurate in particular pharmacoproteomic studies (Guo et al., 2015; Ori et al., 2016; Tan et al., 2017; Zheng et al., 2017). For example, log transformation followed by median normalization performs well in identifying the therapeutic target/pathway for Down syndrome (Sullivan et al., 2017), endogenous toxins inducing the haploinsufficiency of tumor suppressor (Tan et al., 2017) and biological mechanism underlying the role of proteins played in Alzheimer's disease (Khoonsari et al., 2016). Since the processing methods are sequentially used to form the integrated analysis chain (Guo et al., 2015; Ori et al., 2016; Tan et al., 2017), any performance assessment aiming solely at transformation, normalization, or imputation may not be able to reflect the overall performance of the whole analysis chain. Considering the huge amount of possible analysis chains [560 in total, taking nontransformation, non-normalization, and non-imputation into account adopted by previous studies (Guo et al., 2015; Liu et al., 2015; Wu et al., 2016)] by randomly integrating those processing methods, it is therefore essential to comprehensively evaluate the performance of all analysis chains to identify the optimal one for specific pharmacoproteomic dataset. However, no such analysis has been conducted yet.

In this study, the performances of all possible analysis chains integrating 4 transformation, 15 normalization, and 6 imputation algorithms were comprehensively assessed by their precisions based on the proteomes among replicates (Kuharev et al., 2015; Navarro et al., 2016; Chignell et al., 2018; Muller et al., 2018). Systematic literature review on the popular quantification tool OpenSWATH firstly yielded seven SWATH-MS based benchmark pharmacoproteomic datasets of varied sample sizes (from 6 to 116). To the best of our knowledge, these seven provided the most complete set of the publicly available pharmacoproteomic data based on the SWATH-MS technique. Secondly, the performance of analysis chains was assessed by each dataset. Thirdly, the analysis chains consistently performed well across all datasets were identified for the first time and compared with those popular chains frequently applied in current pharmacoproteomic studies. Finally, the consistently well-performed analysis chains were further discussed based on their performances. The analysis chains identified in and the corresponding findings of this study provided important guidance to current pharmacoproteomic studies.

### MATERIALS AND METHODS

### Collection of SWATH-MS Based Benchmark Pharmacoproteomic Datasets

A systematic literature review on the popular quantification tool OpenSWATH and the analysis on the datasets provided in the PRIDE database (Navarro et al., 2016) were

collectively conducted to find SWATH-MS based benchmark pharmacoproteomic datasets. Firstly, PRIDE database was searched against by keyword "SWATH-MS." Together with the literature review on the resulting projects, 85 projects were identified as based on SWATH-MS, among which 76 and 9 projects were acquired by TripleTOF instruments 5600 and 6600, respectively. Secondly, several criteria were used to guarantee the availability and processability of the raw proteomic data, which included (1) complete set of raw data files, (2) welldefined parameters (isolation scheme, range of retention time, and transition settings), (3) availability of spectral library and protein database to search against, and (4) clear description on sample groups. The application of these criteria on the resulting PRIDE projects yielded seven SWATH-MS based benchmark pharmacoproteomic datasets of varied sample sizes (**Table 1**), which covered both TripleTOF instruments (5600 and 6600) of all 85 projects. Therefore, these datasets can be recognized as representatives of SWATH-MS based pharmacoproteomic data. To the best of our knowledge, these datasets provided the most complete set of SWATH-MS based pharmacoproteomic data.

### Processing Methods for Data Transformation, Normalization, and Imputation

So far, ≥4 transformation, ≥15 normalization, and ≥6 missingvalue imputation algorithms (Guo et al., 2015; Li et al., 2016c; Ori et al., 2016; Wu et al., 2016; Tan et al., 2017; Wang et al., 2017a) have been reported to be sequentially and frequently used to process pharmacoproteomic data. Based on our comprehensive literature review, their corresponding applications to current proteomic research were discussed in Supplementary Method S1. These 25 methods include 4 **transformation**: Box-cox (Sakia, 1992), Cube Root (Wen et al., 2017), Log (De Livera et al., 2012), and Power (Zhang, 2014), 15 **normalization**: Auto Scaling (Kohl et al., 2012), Cyclic Loess (Zhu et al., 2012b), EigenMS (Zhu et al., 2009), Locally Weighted Scatterplot Smoothing (Wilson et al., 2003), Mean (Andjelkovic and Thompson, 2006), Median (Bolstad et al., 2003), Median Absolute Deviation (Matzke et al., 2011), Pareto (Zhu et al., 2010), Probabilistic Quotient (Dieterle et al., 2006), Quantile (Callister et al., 2006), Robust Linear Regression (Hong et al., 2016), Total Ion Current (Gaspari et al., 2016), Trimmed Mean of M Values (Lin et al., 2016), VSN (Huber et al., 2002), and Z-score (Cheadle et al., 2003), and 6 **imputation**: Background (Chai et al., 2014), Bayesian Principal (Chai et al., 2014), Censored (Valikangas et al., 2017), K-nearest Neighbor (Zhu et al., 2008a), Singular Value Decomposition (Alter et al., 2000), and Zero Imputation (Gan et al., 2006). As shown in the Supplementary Method S1, due to their popularity in current pharmacoproteomic studies, these 25 methods were included, sequentially applied, and analyzed in this study. Each method was abbreviated by a three-letter code which was demonstrated in Supplementary Table S1.

### Assessing Analysis Chain Using the Precision Based on Proteomes Among Replicates

Diverse methods for proteomic data processing (transformation, normalization, and imputation) profoundly affected the precision of protein quantification which was frequently assessed using the value of pooled intragroup median absolute deviation (PMAD) of reported protein intensity among replicates (Chawade et al., 2014; Kuharev et al., 2015; Valikangas et al., 2018; Yu et al., 2018). Particularly, the PMAD was designed to demonstrate the capacity of each analysis chain to reduce the variation among replicates, and therefore to enhance the technical reproducibility (Chawade et al., 2014). The lower value of PMAD denoted the more thorough removal of the experimentally induced noise and indicated better precision of the corresponding analysis chain (Valikangas et al., 2018). So far, PMAD value within the range of ≤0.3, >0.3 & ≤0.7, and >0.7 was generally accepted as with


All datasets were from PRIDE database (Navarro et al., 2016). Each method in the analysis chain was abbreviated by a three-letter code as demonstrated in Supplementary Table S1, and ??? indicated that the corresponding method was not specified in the corresponding study of the dataset.

superior, good, and poor precision, respectively (Chawade et al., 2014; Valikangas et al., 2018), which had gradually become a popular metric for assessing the precision of processing methods in OMICs (Chawade et al., 2014; Valikangas et al., 2018).

### Performance Assessment Among Various Analysis Chains by Hierarchical Clustering

Pooled intragroup median absolute deviation values of 560 possible analysis chains across the seven benchmark datasets were firstly calculated. Fifty-one out of these 560 analysis chains reported error for processing at least one of the benchmark datasets. Therefore, the hierarchical clustering of the remaining 509 analysis chains with calculatable results of all seven PMADs was conducted to identify the relationship among the performances of various analysis chains. Particularly, PMAD values of a specific analysis chain among 7 datasets were used to form a 7-dimensional vector. Then, hierarchical clustering was applied to investigate the relationship among those 509 vectors, and therefore among the corresponding analysis chains. To measure the distance between any 2 vectors, the Euclidean distance was adopted, which could be demonstrated as below:

$$\text{Euclidean distance } (a, b) = \sqrt{\sum\_{i=1}^{n} \left(a\_{\mathbf{i}} - b\_{\mathbf{i}}\right)^{2}}$$

where i denoted each dimension of the analysis chain a and b. The clustering algorithm applied here was Ward's minimum variance algorithm (Barer and Harwood, 1999), which was designed to minimize the total within-cluster variance. Ward's minimum variance module in R package (Tippmann, 2015) was used. To visualize the hierarchical tree graph among those 509 analysis chains, the tree generator iTOL was used to generate and display the hierarchical tree structure (Letunic and Bork, 2016).

### RESULTS AND DISCUSSION

#### Ranking the Analysis Chains Based on Their Performances on Each Benchmark

The performances of each analysis chain on the seven SWATH-MS based benchmark datasets (**Table 1**) were assessed by measuring the corresponding PMAD values. As shown in **Figure 1**, the performances of 509 analysis chains (log<sup>10</sup> PMAD, Y-axis) with calculatable PMAD values were measured and ranked (X-axis). Because some analysis chains may not be able to result in a PMAD value, there were slight variations among the number of analysis chains for different benchmark datasets (from 530 to 560). Taking the dataset shown in the center of **Figure 1** as an example (Nat Med. 21:407-13, 2015), a total of 558 analysis chains were assessed and ranked, and the performance of different analysis chains varied significantly (PMAD from 1.8 × 10−<sup>15</sup> to 2.0 × 10<sup>5</sup> ). With reference to the frequently adopted cutoff (PMAD = 0.7) for differentiating the analysis chains of good and poor precision (Chawade et al., 2014; Valikangas et al., 2018), 203 (36.4%) out of these 558 analysis chains were ranked as well-performed. Similar to this dataset (Nat Med. 21:407-13, 2015), the performance of different analysis chains for the other datasets also differentiated substantially (PMAD from 1.7 × 10−<sup>16</sup> to 3.4 × 10<sup>5</sup> ) with 38.8%∼49.7% of the analysis chains ranked as well-performed.

The specific analysis chains for each benchmark dataset adopted in the corresponding original studies were identified by literature review (**Table 1**). Particularly, 4 out of these datasets were with the clearly defined analysis chain (LOG-QUA-NON, LOG-MED-NON, LOG-QUA-NON, and LOG-MED-NON for PXD003278, PXD006106, PXD000672, and PXD004880, respectively), while the remaining 3 datasets were with incomplete information of the adopted analysis chain (LOG-MED-???, LOG-???-???, and ???-RLR-BAK for the datasets of PXD002952, PXD003972, and PXD001064, respectively). Taking the same dataset in the middle of **Figure 1** as an example (Nat Med. 21:407-13, 2015), the red dot indicated the PMAD of the analysis chain adopted by this study and its corresponding ranking among all 558 analysis chains. As shown, the adopted chain (LOG-QUA-NON) in this study was ranked to be the 156th well-performed one (PMAD = 0.598) showing its capacity to reduce variations among replicates and thus enhance technical reproducibility (Chawade et al., 2014). However, there were 155 chains performed better than the adopted one (PMAD from 1.8 × 10−<sup>15</sup> to 0.595) with POW-TMM-ZER chain performed the best. Similar to this example dataset, the analysis chains adopted by the corresponding studies of PXD003278, PXD006106, and PXD004880 were ranked 162nd, 154th, and 164th well-performed ones, which demonstrated appropriate selection of analysis chain in previous studies. However, there were still more than a hundred chains performed better than the adopted ones, which may further enhance the accuracy of SWATH-MS based protein quantification. For the studies with incomplete information of the adopted chain (PXD002952, PXD003972, and PXD001064), the possible integrations based on the known information were highlighted by multiple red dots. 1 (20%) out of 5, 28 (25%) out of 112, and 7 (100%) out of 7 integrations were within the ranges of well-performance for PXD002952, PXD003972, and PXD001064, respectively.

### Analysis Chains Consistently Well-Preformed Across All Benchmark Datasets

The performances of 20 representative analysis chains across different datasets were illustrated in **Figure 2**. PMAD within the ranges of ≤0.3, >0.3 & ≤0.7, and >0.7 was generally accepted as with superior, good, and poor performance, respectively (Chawade et al., 2014; Valikangas et al., 2018), which was illustrated by a circle of various diameters (the smaller diameter denoted the lower PMAD value). As shown, the performances of specific chain among various datasets varied significantly. Particularly, the LOG-PQN-BPC performed superior, good, and poor in 3, 3, and 1 datasets, respectively, and POW-ZSC-ZER performed superior, good, and poor in 1, 5, and 1 datasets, respectively. These results demonstrated a certain level of variations among the seven datasets for each analysis chain. However, as shown in **Figure 2**, there were some chains

performed consistently across different benchmark datasets. For instance, CUB-TIC-BAK and CUB-VSN-CEN performed superior in all datasets, while 2 other chains (NON-CYC-ZER and NON-MEA-SVD) performed poor in all seven benchmarks. It was of great interests to explore dataset-independent properties underlying the consistency across datasets, which thus inspired us to further investigate the similarity among performances of different analysis chains.

Since the type of instrument (TripleTOF 5600 and 6600) covered by seven benchmark datasets were the same as that of 85 SWATH-MS based projects, those datasets could be recognized as representative datasets of SWATH-MS based pharmacoproteomic data. Thus, the discovery of analysis chain performed consistently well across the various datasets might give great insights into the selection of the most appropriate analysis chain in SWATH-MS based proteomic study. To identify such chains performed consistently well across datasets, the hierarchical clustering with the ward algorithm (Barer and Harwood, 1999; Zhu et al., 2011; Fu et al., 2018; Xue et al., 2018a) was used to identify the "consistently well-performed" analysis chains (CWPACs) based on their PMAD values across different datasets. Theoretically, there were 560 possible analysis chains by randomly integrating 5 transformation, 16 normalization, and 7 imputation algorithms (including non-transformation, non-normalization, and nonimputation). 51 (9.1%) out of these 560 were with at least one PMAD value of the seven datasets unavailable due to the calculation error. Then, the PMAD values of the remaining 509 analysis chains were applied for clustering analysis. As illustrated in **Figure 3**, six partitions of the analysis chains (A1, A2, A3, B, C, and D) were identified. The PMADs meeting the "well-performed" criterion (≤0.7) were displayed

by blue color, with the log<sup>10</sup> PMAD ≤ −5 set as exact blue and the larger log<sup>10</sup> PMAD gradually fading toward white (PMAD = 0.7). Meanwhile, those "poor-performed" PMADs (>0.7) were colored by orange, with log<sup>10</sup> PMAD ≥ 5 set as exact orange and the smaller PMAD gradually fading toward white (PMAD = 0.7).

The analysis chains in the partition A1, A2, and A<sup>3</sup> were "consistently well-performed" across all datasets (**Figure 3**). For partition A1, 320 (99.4%) out of 322 PMAD values were ≤0.1, and the remaining PMADs were ≤0.7 (Supplementary Figure S1). For partition A2, 288 (52.7%), 209 (38.3%), and 40 (7.3%) out of those 546 PMAD values were ≤0.1, ≤0.3, and ≤0.7, respectively (Supplementary Figure S2). In partition A3, 187 (46.1%) and 183 (45.1%) out of 406 PMADs were ≤0.3 and ≤0.7, respectively (Supplementary Figure S3). In summary, 608 (47.7%), 396 (31.1%), and 225 (17.7%) out of all 1,274 PMADs in the partition combined by A1, A2, and A<sup>3</sup> were ≤0.1, ≤0.3, and ≤0.7, respectively, indicating an extremely high percentage (96.5%) of the PMAD values meeting the widely adopted cutoff (PMAD = 0.7) for differentiating the chain of good and poor performances (Chawade et al., 2014; Valikangas et al., 2018). Comprehensive literature review on the 85 SWATH-MS based proteomic projects further identified the analysis chains adopted by their corresponding studies (Supplementary Table S2). In total, there were 55 analysis chains previously applied in proteomic studies, which were mapped to and labeled on **Figure 3** (pink triangles). As illustrated, 7 (12.7%), 9 (16.4%), and 21 (38.2%) out of the 55 analysis chains previously adopted were within the partition A1, A2, and A3, respectively, which indicated that the majority (67.3%) of these analysis chains were the CWPACs.

As shown in Supplementary Figure S4, the percentage of each processing method adopted by the previous proteomic studies were analyzed. Log Transformation was the only transformation method used in SWATH-MS based proteomic studies, and was widely recognized as powerful in quantifying thousands of proteins (Rao et al., 2011; De Livera et al., 2012; Wisniewski et al., 2012; Zhu et al., 2012a; Feng et al., 2014). For normalizations, Median Normalization, Total Ion Current, and Quantile Normalization were the top-3 ranked methods in their popularity. The Median and Quantile Normalization were frequently adopted in MS-based label-free proteomic analyses (Callister et al., 2006), while the Total Ion Current was reported to be preferably used in the proteomic profiling based on MALDI- and SELDI-TOF mass spectra (Borgaonkar et al., 2010). For imputation, K-nearest Neighbor and Background Imputation accounted for >80% of the SWATH-MS based proteomic studies adopting imputation methods. Among those methods used in proteomic studies (4 transformation, 15 normalization, and 6 missing-value imputation), Supplementary Figure S4 showed that some methods were adopted seldomly in SWATH-MS based proteomic studies (such as Box-Cox Transformation, Pareto Scaling, and Singular Value Decomposition). Therefore, it is of great interests to discover whether there are other methods suitable or demonstrating enhanced performance in SWATH-MS based proteomic analysis.

Fifty-three analysis chains consistently performed poor among datasets were also discovered by **Figure 3** (partition D), all of which did not adopt any transformation method in their analysis. In total, 101 out of the 509 analysis chains (**Figure 3**) adopted non-transformation, and 53 (52.5%), 10 (9.9%), 11 (10.9%), 14 (13.9%), 6 (5.9%), and 7 (6.9%) out of these 101 chains were within the partition D, C, B, A3, A2, and A1, respectively. These results demonstrated the important roles played by transformation methods in the quantification performance of analysis chains.

## Contribution of Each Processing Method to the Performance of Analysis Chain

With the discovery of a variety of CWPACs based on those independent benchmark datasets, it was interesting to go back

toward white. The pink triangles indicated the analysis chains adopted by previous published SWATH-MS based proteomic studies.

to each processing method used to integrate these CWPACs, which might be able to discover processing methods with significant contributions to the performance of CWPACs. Therefore, all CWPACs listed in Supplementary Figures S1–S3 were investigated by analyzing their corresponding processing methods. As shown in **Figure 4**, the percentage of each method appeared in 3 different partitions (A<sup>1</sup> & A<sup>2</sup> & A3, A<sup>1</sup> & A2,

and A1) were analyzed. For transformation, the percentage of Power Transformation significantly increased from 7% to 10% to 29% with the gradual narrow down of partitions (from A<sup>1</sup> & A<sup>2</sup> & A<sup>3</sup> to A<sup>1</sup> & A<sup>2</sup> to A1), which showed significantly enhanced role played by this transformation to achieve good performance in protein quantifications. However, Log Transformation decreased greatly from 41% to 25% to

26%. This indicated that Log Transformation contributed most to the CWPACs compared to other transformations. But when it came to the superior performance (partition A<sup>1</sup> with PMAD ≤ 0.1), its contribution decreased and ranked as the second. For normalization, the Total Ion Current method stood out among all methods as the one with the highest contribution to CWPAC. With gradual narrow down of partitions (from A<sup>1</sup> & A<sup>2</sup> & A<sup>3</sup> to A<sup>1</sup> & A<sup>2</sup> to A1), the importance of Total Ion Current method was enhanced significantly from 19% to 27% to 74%. For imputation, methods were almost evenly distributed with no clear change among different partitions. This indicated that each imputation method contributed equally to CWPACs, and the selection of any of those methods could not make statistical difference in protein quantification. Due to the equal contribution of imputation methods, it was essential to focus on selecting the appropriate combinations of transformation and normalization methods to achieve the optimal performance of analysis chains, which included POW-TMM, LOG-TIC, BOX-TIC, CUB-TIC, NON-TIC, POW-TIC, and LOG-VSN (Supplementary Figure S1).

## CONCLUSION

Based on the most complete set of the publicly available pharmacoproteomic data generated by SWATH-MS technique, this study revealed a substantial variation among the performances of various analysis chains applied for pharmacoproteomic quantification, and the analysis chains performed consistently well across a diverse set of publicly available pharmacoproteomic data were discovered. As a result, log and power transformations sequentially followed by total ion current normalization were discovered as one of the best performed analysis chains applied for the SWATH-MS based pharmacoproteomic quantification. In summary, the identified analysis chains provided important guidance to current proteomic research and could thus facilitate the cuttingedge research in any proteomic studies requiring SWATH-MS technique.

### AUTHOR CONTRIBUTIONS

FZ conceived the idea and supervised the work. JF, JT, and YW performed the research. JF, XC, QY, JH, XL, SL, YC, and WX prepared and analyzed the data. FZ and JF wrote the manuscript. All authors have read and approved this manuscript.

### FUNDING

Funded by the research support of the Precision Medicine Project of the National Key Research and Development Plan of China (2016YFC0902200); Innovation Project on Industrial Generic Key Technologies of Chongqing (cstc2015zdcyztzx120003); and Fundamental Research Funds for the Central Universities (10611CDJXZ238826, CDJZR14468801, and CDJKXB14011).

#### REFERENCES

fphar-09-00681 June 22, 2018 Time: 17:51 # 9


### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar. 2018.00681/full#supplementary-material



hela cells treated with gambogic acid. Mol. Cell. Proteomics 15, 26–44. doi: 10.1074/mcp.M115.053272


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Fu, Tang, Wang, Cui, Yang, Hong, Li, Li, Chen, Xue and Zhu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Therapeutic Effect of Repurposed Temsirolimus in Lung Adenocarcinoma Model

Hsuen-Wen Chang<sup>1</sup> , Min-Ju Wu<sup>1</sup> , Zih-Miao Lin<sup>2</sup> , Chueh-Yi Wang<sup>1</sup> , Shu-Yun Cheng<sup>1</sup> , Yen-Kuang Lin<sup>3</sup> , Yen-Hung Chow<sup>4</sup> , Hui-Ju Ch'ang<sup>5</sup> and Vincent H. S. Chang<sup>2</sup> \*

<sup>1</sup> Laboratory Animal Center, Office of Research and Development, Taipei Medical University, Taipei, Taiwan, <sup>2</sup> The PhD Program for Translational Medicine, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan, <sup>3</sup> Biostatistics Research Center, Taipei Medical University, Taipei, Taiwan, <sup>4</sup> National Institutes of Infectious Diseases and Vaccinology, National Health Research Institutes, Zhunan, Taiwan, <sup>5</sup> National Institute of Cancer Research, National Health Research Institutes, Zhunan, Taiwan

Lung cancer is one of the major cause of cancer-related deaths worldwide. The poor prognosis and resistance to both radiation and chemotherapy urged the development of potential targets for lung cancer treatment. In this study, using a network-based cellular signature bioinformatics approach, we repurposed a clinically approved mTOR inhibitor for renal cell carcinomans, temsirolimus, as the potential therapeutic candidate for lung adenocarcinoma. The PI3K-AKT-mTOR pathway is known as one of the most frequently dysregulated pathway in cancers, including non-small-cell lung cancer. By using a well-documented lung adenocarcinoma mouse model of human pathophysiology, we examined the effect of temsirolimus on the growth of lung adenocarcinoma in vitro and in vivo. In addition, temsirolimus combined with reduced doses of cisplatin and gemcitabine significantly inhibited the lung tumor growth in the lung adenocarcinoma mouse model compared with the temsirolimus alone or the conventional cisplatin– gemcitabine combination. Functional imaging techniques and microscopic analyses were used to reveal the response mechanisms. Extensive immunohistochemical analyses were used to demonstrate the apparent effects of combined treatments on tumor architecture, vasculature, apoptosis, and the mTOR-pathway. The present findings urge the further exploration of temsirolimus in combination with chemotherapy for treating lung adenocarcinoma.

#### Edited by:

Zhaohui John Cai, Celgene (United States), United States

#### Reviewed by:

Michele Samaja, Università degli Studi di Milano, Italy Michela De Bellis, Università degli Studi di Bari Aldo Moro, Italy

> \*Correspondence: Vincent H. S. Chang vinhschang@tmu.edu.tw

#### Specialty section:

This article was submitted to Translational Pharmacology, a section of the journal Frontiers in Pharmacology

Received: 16 March 2018 Accepted: 26 June 2018 Published: 24 July 2018

#### Citation:

Chang H-W, Wu M-J, Lin Z-M, Wang C-Y, Cheng S-Y, Lin Y-K, Chow Y-H, Ch'ang H-J and Chang VHS (2018) Therapeutic Effect of Repurposed Temsirolimus in Lung Adenocarcinoma Model. Front. Pharmacol. 9:778. doi: 10.3389/fphar.2018.00778 Keywords: mTOR inhibitor, drug repositioning, temsirolimus, lung adenocarcinoma, chemotherapy

### INTRODUCTION

Lung cancer is one of the most common forms of cancer and remains the number one cause of cancer-related deaths worldwide among men and women. Based on histological differentiation, there are two major types of lung cancers: small-cell lung cancer (SCLC) and non-small-cell lung cancer (NSCLC). NSCLCs are further divided into squamous cell carcinomas (SCCs), pulmonary adenocarcinomas (ADC), and large-cell carcinomas. Among them, lung ADC is the most prevalent form of NSCLC (Teng, 2005; Chang et al., 2017). Lung cancer has a dismal prognosis of 15%, mainly attributed to ineffective early detection and lack of therapeutic options for metastatic disease (Molina et al., 2008). This has spurred efforts for the development of molecularly targeted therapies.

The definition of drug repositioning is to identify new indications from existing drugs or compounds to treat a different disease. In addition to being time- and cost-efficient, drug repositioning offers a more favorable risk-versus-reward tradeoff of the available drug development strategies. Because the existing drugs have already been tested in terms of safety, dosage, and toxicity, they can often enter clinical trials much more rapidly than newly developed drugs (Ashburn and Thor, 2004). Computational drug repositioning is deemed as an alternative and effective way to identifying novel connections between diseases and existing drugs (Hurle et al., 2013). The increase in drug-target information and advances in systems pharmacology approaches have led to an increase in the success of in silico drug repositioning. In particular, large-scale genomics databases, such as the Connectivity Map, provide abundant information on the modes of action of drugs, which are reflected in the transcriptomic responses to chemical perturbation (Vempati et al., 2014). Recently, a similar but highly expanded version of a chemical genomics dataset was publicly released by the National Institutes of Health Library of Integrated Network-Based Cellular Signatures (NIH LINCS) program. This dataset includes gene expression signatures and protein binding, cellular phenotypic, and phosphoproteomics profiles resulted from chemical or genetic perturbation. Specifically, it presents the gene expression profiles of approximately 1000 landmark genes (L1000) in response to more than 20,000 chemical perturbations across many cell lines. Additionally, transcriptome-level expression profiles of approximately 20,000 genes have been computationally inferred using 1000 landmark genes (Vempati et al., 2014).

In this study, we compared the transcriptome profiles obtained from a well-documented mouse lung cancer model (Chang et al., 2017) and used the LINCS L1000 cellular signature bioinformatics approach to identify clinically approved candidate drugs to treat ADC. By using this strategy, we identified temsirolimus, a mTOR inhibitor approved for renal cell carcinoma, as a potential therapeutic agent for the treatment of lung tumor. In a study using mouse model xenografted with human NSCLC cells (A549, H1299, and H358), it was found that temsirolimus could inhibit the growth of subcutaneous tumors, as well as to prolong the survival of mice having pleural dissemination of cancer cells due to its anti-proliferative effect (Ohara et al., 2011). Temsirolimus also has been used on a case report (Vichai and Kirtikara, 2006) with lung adenocarcinoma harboring specific gene mutation; it was also noted to restore radio-sensitivity in lung adenocarcinoma cell lines (Ushijima et al., 2015). Two updated phase two clinical trials of temsirolimus (Study 1: Neratinib with and without temsirolimus for patients with HER2 activating mutations in non-small cell lung cancer. Study 2: Temsirolimus and pemetrexed for recurrent or refractory non-small cell lung cancer.) were found from webpage searching<sup>1</sup> , either as monotherapy or combined therapy with another drug. Although there are more than 40 inhibitors of the PI3K-AKT-mTOR signaling pathway have reached different stages of clinical development, only a few have been approved for clinical use (Skehan et al., 1990). However, an in vivo systemic evaluation of the lung tumor inhibitory effect of temsirolimus was lack. Here we assessed the combination of the mTOR inhibitor temsirolimus with the first-line chemotherapy for advanced NSCLC, cisplatin, and gemcitabine, to reduce cytotoxicity and enhance the therapeutic response.

### MATERIALS AND METHODS

#### Microarray Analysis

Total RNA was extracted from tissue samples or cells by using TRIzol <sup>R</sup> Reagent (Sigma, St. Louis, MO, United States) by following the manufacturer's instructions. Total RNA (0.2 µg) was amplified as previously mentioned (Chang et al., 2017) for microarray analysis by using a microarray scanner (Agilent Technologies, Santa Clara, CA, United States). A total of 155 differentially expressed genes were identified in the Tg-3m mice, of which 126 genes were upregulated (a log2 fold change of ≥0.6) and 29 genes were downregulated (a log2 fold change of ≤ −0.6). A total of 123 differentially expressed genes were identified in the Tg-6m mice, of which 105 genes were upregulated (a log2 fold change of ≥0.6) and 18 genes were downregulated (a log2 fold change of ≤ −0.6).

### LINCS Perturbagen Signature Comparisons

The LINCS L1000 is one of the complete drug treatment expression profile databases and currently contains more than a million gene expression profiles of chemically perturbed human cell lines (Blois et al., 2011; Li et al., 2014). First, for comparing the gene expression signatures in transgenic mice with LINCS data sets, the gene expression data were ranked according to the log2 fold changes. We retrieved the top 100 and bottom 100 most differentially expressed genes as gene expression signatures in both Tg-3m and Tg-6m mice. Then, we transferred the mouse gene symbols to homologous human gene symbols by using the HomoloGene database (Gabriel et al., 2016). Next, we queried the homologous human genes against the LINCS database by using sig\_query and sig\_summly in the LINCS C3 server. Finally, we annotated the returned results by combining the DrugBank (Vichai and Kirtikara, 2006) and PubChem (Ushijima et al., 2015) results to provide detailed perturbagen information.

### Functional Annotation of Differentially Expressed Genes

To discuss the gene ontology and Kyoto Encyclopedia of Genes and Genomes pathways involved in transgenic mice, we analyzed the differentially expressed genes by using the Database for Annotation, Visualization and Integrated Discovery (DAVID, version 6.7<sup>2</sup> ) (Huang da et al., 2009) application programming interfaces (APIs). A p-value of 0.05 was set as the threshold, which was calculated using Fisher's exact test.

<sup>1</sup>https://clinicaltrials.gov/beta/

<sup>2</sup>david.ncifcrf.gov

### Animals and Ethics Statement

fphar-09-00778 July 21, 2018 Time: 15:45 # 3

Murine lung adenocarcinoma models were maintained as previously mentioned (Chang et al., 2017) in a specific pathogenfree environment at the animal facility of Taipei Medical University. Experimental uses of mice were approved by the Institutional Animal Care and Use Committee of Taipei Medical University (Approved Proposal No. LAC-2014-0217). All experiments were conducted in accordance with relevant guidelines and regulations. The mice were monitored daily for physiological conditions. Tumor growths were monitored using micro-CT on a weekly basis. Mice were anesthetized by administering 5% isoflurane followed by 2% isoflurane through the inhalation route for maintenance during the imaging process. Total lung volumes were measured and analyzed using CTAn software (v.1.15), and mice were euthanized when the total lung volumes were less than 120 mm<sup>3</sup> . At the endpoint of the experiment (the 16th week), the tested mice were euthanized by administering 100% CO<sup>2</sup> through inhalation to minimize their suffering.

### Cell Cycle and Apoptosis Assays

The effects of temsirolimus and chemotherapy on the cell cycle and apoptosis were evaluated by seeding tumor cells into 6 well plates at a density of 5 × 10<sup>4</sup> per well. The cells were treated accordingly and incubated for 24 h followed by a phosphate-buffered saline (PBS) wash. The cell cycle phases were determined using a Muse cell analyzer (Merck Millipore, Darmstadt, Germany) and a Muse Cell Cycle Assay Kit (Merck Millipore, Darmstadt, Germany) according to the manufacturer's instructions. Cell apoptosis was analyzed using Annexin V Dead cell reagent (Merck Millipore, Darmstadt, Germany) according to the manufacturer's instructions. An average of at least 10,000 cells was analyzed for each condition. Triplicate independent experiments were conducted.

### Protein Preparation and Western Blotting

Protein extraction and Western blotting analysis were performed as previously mentioned (Chang et al., 2017). The blots were immunostained with 1:1000 of anti-p-mTOR (Ser2448) antibody (2971, Cell Signaling, Danvers, MA, United States). After incubation with horseradish peroxidase-conjugated secondary antibody (1:4000 of goat antirabbit IgG, GTX213110-01, GeneTex, Irvine, CA, United States), protein bands were visualized with an enhanced chemiluminescent reagent.

### Micro-CT

Mice were anesthetized with an induction flow dose of 3% isoflurane and oxygen mixture, following a maintaining flow dose of 1%. The chest area was scanned at one time through in vivo micro-CT (Bruker SkyScan 1176, Kontich, Belgium). Image scanning was performed in resolution of 35 µm. The instrument setting was at a voltage of 50 kVp, a current of 500 µA, and an exposure time of 50 ms with a 0.5-mm aluminum filter. To prevent artifacts caused by cardiac and respiratory motion, images were captured using the synchronization mode. Sections were reconstructed using a graphics processing unit-based NRecon software. The tumor volume inside the lung area was separated and analyzed using CTAn software (Bruker SkyScan, Kontich, Belgium). The cross-sectional images were obtained using DataViewer software (Bruker Skyscan, Kontich, Belgium).

### Histology and Immunohistochemistry

Mouse lung tumors were removed and prepared for paraffinembedded sectioning immunohistochemistry (IHC) staining was performed as previously mentioned (Chang et al., 2017). After antigen retrieval, primary antibody dilutions were prepared in the blocking buffer (10% bovine serum albumin with 0.1% Triton-100 in PBS) as follows: 1:200 of anti-Ki67 antibody (ab15580, Abcam, Cambridge, MA, United States), 1:250 of anti-CD34 antibody (ab81289, Abcam, Cambridge, MA, United States), 1:100 of p-mTOR (ab109268, Abcam, Cambridge, MA, United States), and 1:400 of p-S6RP antibody (2211, Cell Signaling, Danvers, MA, United States). Immunochemical signals were detected using a MultiLink Detection Kit (BioGenex, Fremont, CA, United States). The peroxidase reaction was developed with diaminobenzidine, and sections were counterstained using Mayer's hematoxylin. The intensity of positive signal areas was measured using ImageJ software (IJ 1.46r).

### Statistical Analyses

SAS version 9.3 for Windows (SAS Institute, Cary, NC, United States) was used for data manipulation and visualization. The means are used to describe the central tendency of continuous variables while standard deviations are used to depict the variation. One-way ANOVA and the Bonferroni post hoc multiple comparison tests were of inhibitory effects among different treatments. All statistical analyses were two sided, and p < 0.05 was considered as statistically significant. p-Values were depicted using asterisks, with <sup>∗</sup>p < 0.05, ∗∗p < 0.01.

## RESULTS

## Data Processing and Drug Repositioning

To compare the gene expression signatures from different stages of lung tumors, microarray results of Tg-3m and Tg-6m tumors (Chang et al., 2017) were subjected to the LINCS L1000 data sets, and the gene expression data were ranked according to the log2 fold changes. The top 100 and bottom 100 most differentially expressed genes were retrieved as gene expression signatures in both Tg-3m and Tg-6m mice. The mouse gene symbols were then converted to homologous human gene symbols by using the HomoloGene database (Coordinators, 2016). Next, the homologous human genes were queried against the LINCS database by using sig\_query and sig\_summly in the LINCS C3 server. The returned results were annotated by combining the DrugBank (Wishart et al., 2006) and PubChem (Kim et al., 2016) results to obtain detailed drug information (**Figure 1**). The drugs that negatively (K score = −1) correlated with the gene expression from both Tg-3m and Tg-6m lung tumor cell lines were selected for further screening. The data regarding the drugs were then manually curated from DrugBank and PubMed

by searching for keywords and abstracts that explicitly described their association with cancers. The repositioned drug candidates are listed in **Table 1**. This list contained a wide range of drugs, including some antineoplastic agents used for cancers other than lung cancer, suggesting that the use of these agents in clinics may affect the gene expression signature of lung cancer (Kerr et al., 2007; Lin et al., 2007; Gallotta et al., 2010; Wynne and Djakiew, 2010; Endo et al., 2014; Li et al., 2016). We focused on the topscoring candidates and clinically approved antineoplastic drugs. This analysis led to the identification of temsirolimus, a U.S. Food and Drug Administration (FDA)-approved mTOR inhibitor for renal cell carcinoma, which was repositioned from both stages of lung tumor cells and was tested in combination with thoracic radiation in NSCLC (Waqar et al., 2014).

### Temsirolimus Treatment Leads to G0/G<sup>1</sup> Cell Cycle Arrest

To understand whether temsirolimus treatment is lethal to lung tumor cells at both early and late stages, we performed flow cytometry to analyze the cell cycle distribution in Tg-3m (**Figure 2A**) and Tg-6m (**Figure 2B**) cell lines treated with temsirolimus at different concentrations (2.5, 5.0, and 10 µM). Temsirolimus treatment increased the cell population in the G0/G<sup>1</sup> phase in both Tg-3m and Tg-6m cell lines but did not cause significant cell death (**Figure 3**). Taken together, these results suggest that temsirolimus suppressed the proliferation of Tg-3m and Tg-6m cells through its cytostatic effect and not through cytotoxicity.

## Efficacy of Temsirolimus, Cisplatin, and Gemcitabine in mTOR Pathway and Cytotoxicity

The efficacy of temsirolimus, cisplatin, and gemcitabine (each at 10 µM) alone and in combination was evaluated in Tg-3m (**Figure 3A**) and Tg-6m (**Figure 3B**) cells. To evaluate the effect of temsirolimus on activation regulation in the mTOR pathway, we examined the phosphorylation of mTOR (s2448) by using Western blot analysis. Gemcitabine or cisplatin treatment did not alter the phosphorylation of mTOR. However, treatment with temsirolimus alone markedly suppressed the activation of mTOR in Tg-6m than in Tg-3m cells. When cells were treated with temsirolimus combined with cisplatin and gemcitabine, the effect of mTOR suppression was evident. The apoptotic cell death in H1299 human NSCLC cell line was presented in **Supplementary Figure S3**.

To evaluate the cytotoxic effect of temsirolimus, we examined the total cell apoptotic rate by using annexin V staining. In human NSCLC cell line H1299 treated with temsirolimus alone caused about 25% cell death, when combined with cisplatin and gemcitabine showed enhanced cytotoxicity by approximately 10% in G + C and 15% in G + C + T (p = 0.02 and 0.003, respectively) (**Supplementary Figure S4**). Treatment with gemcitabine alone induces higher cytotoxicity in Tg-6m than in Tg-3m cells; however, treatment with cisplatin alone did not reveal any substantial difference. Treatment with gemcitabine plus cisplatin revealed similar apoptotic results in both cell lines. Although treatment with temsirolimus alone did not cause


#### TABLE 1 | List of drug repositioning candidates.

fphar-09-00778 July 21, 2018 Time: 15:45 # 5

<sup>+</sup>Information obtained from DrugBank (https://www.drugbank.ca/).

cytotoxicity, it enhanced the cisplatin and gemcitabine-induced apoptosis in both cell lines significantly (p < 0.05; **Figures 3A,B**).

### Treatment Effects of Temsirolimus, Cisplatin, and Gemcitabine on Tumor Growth

To investigate the effect of temsirolimus, cisplatin, and gemcitabine on tumor growth, we used a therapeutic approach with a previously documented NSCLC mouse model (Chang et al., 2017). The mice were divided into three groups (n = 5): the control group (no treatment), the group that received a low dosage of cisplatin and gemcitabine (low-dose C + G), and the group that received temsirolimus combined with a low dosage of cisplatin and gemcitabine (mix T + C + G). The mice were treated at the age of 9 weeks for 8 weeks. Both treatments were administered weekly through the tail vein, and micro-CT imaging was performed to follow up tumor growths (**Figure 4A**). The imaging on week 15 was postponed because of regular maintenance of the scanner. On week 16, the mice were sacrificed and their lungs were removed for histopathological analysis.

The tumor growth rate was calculated by normalizing each tumor volume to the baseline tumor volume of each mouse

at the beginning of week 9. The tumor growth was slightly reduced in the low C + G group, whereas it was markedly inhibited in the mix (T + C + G) group. In addition, the tumor growth significantly declined after 4 weeks' treatment in the mix (T + C + G) group with p ≤ 0.05 (weeks 13–16). Smaller and reduced lung tumors were also noted in hematoxylin and eosin (H&E)-stained lung sections (**Figures 4B,C**). Collectively, the weekly administration of temsirolimus combined with low doses of cisplatin and gemcitabine effectively reduced the growth of lung tumors.

#### Treatment Effects on General Tumor Characteristics and the mTOR-Pathway

At the end of the experiment (week 16), all lungs were dissected and immunohistochemically analyzed to assess and quantify the microscopic effects of combined therapies with or without temsirolimus on general tumor characteristics (H&E

stain; Ki-67 and CD34) and to identify possible mechanisms for the observed differences in growth inhibition. H&E staining revealed viable tumor mass within the lung parenchyma in untreated tumors, with immune cell infiltration. Residual tumor mass within the lung parenchyma with congestion, hyaline deposition, and immune cell infiltration were observed in lowdose C + G treated tumors. Scattered viable tumor cells with nuclear pleomorphism within the lung parenchyma revealed foamy macrophages and giant cells when treated with combined T + C + G after chemotherapy (magnification: 100×; **Figure 5**). Ki-67 staining revealed condensed signals of proliferating tumor cells in untreated control tumors. Treatment with low-dose C + G resulted in a lower fraction of proliferating cells, whereas that with combined T + C + G demonstrated diffused proliferating signals (magnification: 300×). CD34 staining demonstrated disruptive angiogenetic architectures in the low-dose and mix groups compared with the untreated control groups (magnification: 400×). In addition to general tumor characteristics, we investigated specific treatment effects on the mTOR pathway by evaluating p-mTOR and pS6RP

temsirolimus, G + C: gemcitabine + cisplatin, G + C + T: gemcitabine + cisplatin + temsirolimus.

in all lung tumors (magnification: 400×; **Figure 6**). The quantitative bar charts represent the positively stained areas of the whole image above, revealed that both treatments inhibited the tumor proliferation marker of Ki67. The combination treatment with temsirolimus markedly inhibited angiogenesis compared with low-dose chemotherapy. Quantitative stained areas demonstrated reduced p-mTOR signaling in both the treated groups, whereas the p-S6BP signal was higher. The statistical analysis of the intensity of positive signals from three selected views of each IHC-stained section demonstrated similar results (**Supplementary Figure S5**). Whether the p-S6BP signaling resulted from heterogeneous tumor cells remains to be investigated.

### DISCUSSION

Chemotherapy is one of the most important treatment methods for advanced NSCLC, and cisplatin-based combinations are usually used as standard regimens. The combination of one

depicted (A). Red arrow heads indicate the monitored tumors compared with the corresponding H&E-stained histopathologic sections at the endpoint (B). The endpoint H&E-stained sectioned sections were also displayed as inset in Figure 5. The tumor growth rate was calculated by normalizing each tumor volume to the baseline tumor volume of each mouse at the beginning of week 9. The effect of different treatments: Tem, Low C + G and Mix T + C + G in lung tumor growth inhibition in time periods were displayed (C). The significance of tumor growth inhibition among each treatment was analyzed and it was found significant after week 13 in the Mix (T + C + G) group compared to the control group (∗p ≤ 0.05).

or more agents with a platinum compound resulted in high response rates and prolonged survival (Schiller et al., 2002; Ruiz-Ceja and Chirino, 2017). Gemcitabine was approved by FDA in 1996 with DNA synthesis inhibition. Gemcitabine is indicated in combination with cisplatin as the first-line treatment of patients with advanced NSCLC (Ruiz-Ceja and Chirino, 2017). Common cisplatin plus gemcitabine treatment-related adverse events are hematologic toxicity and gastrointestinal reaction. Hematologic toxicity mainly included decreased white blood

from mice with lung tumors were assessed by pathologists blinded to the treatment and outcome. Magnification: 100× and 40× (inset).

cells and platelets. Gastrointestinal reactions mainly included nausea and vomiting (Ai et al., 2016). However, the high toxicity induced by cisplatin-based doublets urges research on alternative treatments. In this study, we used the LINCS L1000 database and a well-characterized lung adenocarcinoma mouse model to repurpose existing drugs for lung adenocarcinoma. By using this approach, we identified the mTOR inhibitor, temsirolimus, which has been approved by the FDA for renal cell carcinoma, as a potential therapeutic agent. In our results, both temsirolimustreated early (Tg-3m) and late-stage (Tg-6m) lung tumor cell lines demonstrated cell cycle arrest at the G0/G1 phase. The treatment with temsirolimus alone markedly suppressed mTOR activation in Tg-6m than in Tg-3m cells. When temsirolimus was combined with cisplatin and gemcitabine, the effect of mTOR suppression was evident. Additionally, temsirolimus combined with gemcitabine and cisplatin not only suppressed the phosphorylation of mTOR but also significantly improved cell death in Tg-3m and Tg-6m cell lines compared with gemcitabine plus cisplatin.

As reported by Khuri colleagues (Li et al., 2014), mTOR inhibition triggers rapid and sustained activation of the PI3K/Akt survival pathway in the human lung and other types of cancer cells; therefore, the combination of mTOR-targeted therapy with drugs that block PI3K/Akt activation might also be reasonable. In a reported phase II study, temsirolimus was administered as a single agent in 52 patients with untreated NSCLC on a weekly basis. The clinical benefit rate was 35%, with a confirmed partial response of 8% and stable disease of 27%. Although these results did not satisfy the protocoldefined criteria for success, they evidenced the clinical activity of temsirolimus as a single agent in NSCLC (Reungwetwattana et al., 2012). In a phase I study, temsirolimus was combined with weekly thoracic radiation, which proved the tolerance (Waqar et al., 2014). Because temsirolimus has demonstrated considerable activity in clinical studies, we hypothesize that it works synergistically with the first-line NSCLC chemotherapy cisplatin plus gemcitabine.

In animal studies, optimizing the cytostatic agent temsirolimus with cycle-active chemotherapy is important for maximizing the clinical benefit. Therefore, we designed concurrent and sequential administration of temsirolimus with either low or high doses of chemotherapy in a mouse lung adenocarcinoma model. In the concurrent schedule, administration of low-dose chemotherapy and temsirolimus (T + C + G) demonstrated greater inhibition of tumor growth compared with low-dose chemotherapy alone (C + G) in the mouse model. In the sequential schedule in which temsirolimus alone was administrated weekly for 3 weeks prior to the administration of high-dose chemotherapy (3 mg/kg cisplatin + 30 mg/kg gemcitabine) in the following weeks, the effect of tumor growth inhibition was less significant (**Supplementary Figures S1**, **S2**). Collectively, our study revealed that concurrent administration of low-dose chemotherapy and temsirolimus is more effective in suppressing lung tumor growth, which may be advantageous to reduce the cytotoxicity caused by standard chemotherapy. The histopathologic evaluation of endpoint H&E-stained lung tumor sections revealed that

the tumors were associated with an extensive response to the T + C + G treatment compared with low-dose C + G treatment. Common tumorigenic and angiogenetic markers (Ki67 and CD34) were apparently inhibited after the T + C + G treatment compared with low-dose C + G treatment. These results proved the tumor inhibition efficacy of temsirolimus combined with low-dose chemotherapy. The mTOR phosphorylation inhibition was higher in the mixed treatment. Moreover, the phosphorylation of the ribosomal protein S6 (p-S6RP), one of the targets downstream of the mTOR pathway, was reduced after both treatments. The examination of phosphorylated mTOR and S6RP suggested their sensitivity to temsirolimus.

The clinical benefits of chemotherapy are limited by drug resistance and systemic toxicity. Temsirolimus was reported to restore cisplatin sensitivity in lung cancer cell lines by blocking the translation of proteins that are involved in cisplatin resistance (Blois et al., 2011). The cytostatic effect of temsirolimus was also demonstrated by introducing temsirolimus as a moleculartargeted agent with the potential for inhibiting tumor cell repopulation (Fung et al., 2009). However, the pulmonary toxicity was associated with mTOR inhibitors as many other drugs, including anticancer agents (Blois et al., 2011; Li et al., 2014). Proper chemotherapeutic strategy management and clinical pulmonary symptom diagnosis should be taken account when administration with mTOR inhibitors. Our study demonstrated that a combination of low-dose chemotherapy and temsirolimus treatment was more effective in inhibiting tumor growth than a doublet chemotherapy regimen in the mouse lung tumor model. In addition, the concurrent administration of the combined treatment was more efficacious than the sequential

administration of these agents at a higher dose. Our study results suggest that the combination of low-dose chemotherapy and temsirolimus treatment might be beneficial in the treatment of lung adenocarcinoma, which warrants further investigation.

#### AUTHOR CONTRIBUTIONS

H-WC and VC conceived the experiments. M-JW and Z-ML conducted the experiments. C-YW conducted the micro-CT imaging. S-YC conducted the tissue embedding and histopathology. H-WC, H-JC, and VC analyzed the results. Y-KL assisted on statistical analysis. H-WC wrote up the manuscript. Y-HC and VC provided comments on the manuscript. All authors reviewed the manuscript.

#### ACKNOWLEDGMENTS

This work was partially supported by the foundation of the Ph.D. program for Translational Medicine, College of Medical Science and Technology from Taipei Medical University.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar. 2018.00778/full#supplementary-material

#### REFERENCES


FIGURE S1 | Effects of high-dose chemotherapy and the sequential administration of temsirolimus followed by high-dose chemotherapy in lung tumor growth inhibition. The tumor growth inhibition was significant in the first 3 weeks of treatment with temsirolimus. However, the tumor growth inhibition efficacies between the two chemotherapy regimens were similar (A). Effect of various treatments on tumor growth rate (B). The tumor growth inhibition was most significant after the mixed treatment (T + C + G) with a lower dose of chemotherapy, which is beneficial in reducing the cytotoxic effect. (∗p ≤ 0.05; ∗∗p ≤ 0.01).

FIGURE S2 | Effect of temsirolimus alone (orange line) compared to different regimes of chemotherapy in lung tumor growth inhibition. The tumor growth inhibition of different treatments was depicted with statistical significance. While the tumor growth was inhibited moderately using temsirolimus alone (orange line), combined temsirolimus with low-dose chemotherapy (blue line, p-value = 0.019) inhibited the tumor growth significantly. Combined temsirolimus with low-dose chemotherapy treatment also showed significance over the treatment using temsirolimus alone or high-dose chemotherapy (black line), with p-value = 0.037 and 0.032, respectively.

FIGURE S3 | Temsirolimus combined with cisplatin and gemcitabine induced apoptotic cell death in H1299 human NSCLC cell line. Although temsirolimus treatment alone caused about 25% cell death, when combined with cisplatin and gemcitabine, it significantly enhanced the cytotoxicity by approximately 10% in G + C and 15% in G + C + T (p-value = 0.02 and 0.003, respectively). Con: control, Gem: gemcitabine, Cis: cisplatin, Tem: temsirolimus, G + C: gemcitabine+cisplatin, G + C + T: gemcitabine + cisplatin + temsirolimus.

FIGURE S4 | SRB cytotoxicity assay performed in Tg-3m (A) and Tg-6m (B) cell lines. The results showed either temsirolimus alone or combined with cisplatin and gemcitabine displayed significant cytotoxic effects in both Tg-3m and Tg-6m cell lines. The results of SRB assay demonstrated the in vitro correlation of cytotoxicity and cell death caused by chemotherapeutic agents tested.

FIGURE S5 | Statistical quantification of IHC-stained sections of Ki67, CD37, p-mTOR, and p-S6RP. The intensity of positive signals from each IHC-stained section were selected from three different views (red-lined squares) and analyzed using Image J software with IHC toolbox as plugins.


cancers: targeting DNA and cell cycle. Neoplasia 9, 830–839. doi: 10.1593/neo. 07475


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Chang, Wu, Lin, Wang, Cheng, Lin, Chow, Ch'ang and Chang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Exploring the Mechanism of Flavonoids Through Systematic Bioinformatics Analysis

Tianyi Qiu<sup>1</sup>† , Dingfeng Wu<sup>2</sup>† , LinLin Yang<sup>3</sup> , Hao Ye4,5, Qiming Wang<sup>2</sup> , Zhiwei Cao<sup>2</sup> \* and Kailin Tang<sup>2</sup> \*

1 Institute of Biomedical Sciences, Fudan University, Shanghai, China, <sup>2</sup> School of Life Sciences and Technology, Tongji University, Shanghai, China, <sup>3</sup> Hebei Key Laboratory of Metabolic Diseases and Clinical Medicine Research Center, Hebei General Hospital, Hebei, China, <sup>4</sup> Sinotech Genomics Ltd., Shanghai, China, <sup>5</sup> East China University of Science and Technology, Shanghai, China

Flavonoids are the largest class of plant polyphenols, with common structure of diphenylpropanes, consisting of two aromatic rings linked through three carbons and are abundant in both daily diets and medicinal plants. Fueled by the recognition of consuming flavonoids to get better health, researchers became interested in deciphering how flavonoids alter the functions of human body. Here, systematic studies were performed on 679 flavonoid compounds and 481 corresponding targets through bioinformatics analysis. Multiple human diseases related pathways including cancers, neuro-disease, diabetes, and infectious diseases were significantly regulated by flavonoids. Specific functions of each flavonoid subclass were further analyzed in both target and pathway level. Flavones and isoflavones were significantly enriched in multi-cancer related pathways, flavan-3-ols were found focusing on cellular processing and lymphocyte regulation, flavones preferred to act on cardiovascular related activities and isoflavones were closely related with cell multisystem disorders. Relationship between chemical constitution fragment and biological effects indicated that different side chain could significantly affect the biological functions of flavonoids subclasses. Results will highlight the common and preference functions of flavonoids and their subclasses, which concerning their pharmacological and biological properties.

Keywords: flavonoids, mechanism of action, pathway analysis, protein–protein interaction network, structure activity relationship

## INTRODUCTION

Flavonoids are a family of phenolic substances sharing the same backbone structure of 2-pheny1- 1,4-benzopyronemay, which are very abundant in nature, being accumulated in regular human diets including flowers (Zhang and Ma, 2018), fruits (Chang et al., 2018), vegetables, tea, wine (Matveeva et al., 2018), and so on (Szmitko and Verma, 2005). With the basic core scaffold, flavonoids have been demonstrated to exhibit relevant biological properties involving strong activity for anti-oxidant (Pietta, 2000), anti-allergy (Kawai et al., 2007; Castell et al., 2014), antiinflammatory (Nijveldt et al., 2001; Serafini et al., 2010; Matias et al., 2014), anti-microbial (Cushnie and Lamb, 2005), and anti-obesity (Hughes et al., 2008) effects. Also, flavonoids have been reported to have effect on reducing the risk of cardiovascular disease (Hooper et al., 2008; Mulvihill and Huff, 2010; Feliciano et al., 2015) and cancers (Yao et al., 2011; Batra and Sharma, 2013), ameliorating cognition (Spencer et al., 2009; Williams and Spencer, 2012) and neuro-protection in Alzheimer's disease (Bakhtiari et al., 2017; Mohebali et al., 2018). Moreover, it is also found that flavonoids act as

#### Edited by:

Zhi-Liang Ji, Xiamen University, China

#### Reviewed by:

Shi-Bing Su, Shanghai University of Traditional Chinese Medicine, China Haifeng Chen, Shanghai Jiao Tong University, China

#### \*Correspondence:

Zhiwei Cao zwcao@tongji.edu.cn Kailin Tang kltang@tongji.edu.cn

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Translational Pharmacology, a section of the journal Frontiers in Pharmacology

Received: 07 February 2018 Accepted: 26 July 2018 Published: 15 August 2018

#### Citation:

Qiu T, Wu D, Yang L, Ye H, Wang Q, Cao Z and Tang K (2018) Exploring the Mechanism of Flavonoids Through Systematic Bioinformatics Analysis. Front. Pharmacol. 9:918. doi: 10.3389/fphar.2018.00918

**45**

agonist or antagonist depending on the estrogen concentrations to regulate estrogenic-like activity (Breinholt et al., 1999; Hwang et al., 2006).

On the basis of common core scaffold, various combinations of substituent chemical groups on different positions may lead to structure diversity of flavonoids. This diversity can be further increased with possible variations of different functional groups, such as hydroxyl, methoxyl, carbonyl, and olefinic groups (Gontijo et al., 2017). According to the structure variations, flavonoids can be generally assigned into six main subclasses: flavones, flavonols, flavanones, flavanols, flavan-3-ols, and isoflavones (Ross and Kasum, 2002), for which the chemical properties depend on their structural classes, degrees of hydroxylation, substitutions, conjugation, and degree of polymerization (Kumar and Pandey, 2013). However, the functional similarities and differences, as well as the structure basis of different functions for flavonoids subclasses are not fully revealed yet.

In this study, a comprehensive bioinformatics analysis was performed based on a large-scale dataset including 679 flavonoids and 481 corresponding targets to decipher the mechanism of action (MOA) of flavonoids with a new perspective. Results illustrated the structure activity relationship of different flavonoids subclasses, which hint the protective roles of flavonoids subclasses in different human diseases. With the accumulation of flavonoids and corresponding targets, it is possible to comprehensively investigate the MOA of flavonoids in a systematic level and interpret the therapeutic mechanism to guide the drug discovery from natural flavonoid products.

### MATERIALS AND METHODS

#### Dataset

#### Flavonoids and Corresponding Targets

A total number of 5,006 chemical structures of natural plant products were derived from Natural Product Activity and Species Source Database (NPASS) (Zeng et al., 2018). Among them, main types of flavonoids including flavones, flavonols, flavanones, flavanonol, isoflavones, and flavan-3-ols were categorized according to the scaffold structures derived by cheminformatics software-RDKit (Landrum, 2010), which were illustrated in **Figure 1A**. Further, corresponding direct targets of flavonoids were selected from 5,337 targets of natural plant products in NPASS. After that, 679 flavonoids and 481 corresponding targets were selected and listed in **Supplementary Table 1**. Number of targets for different flavonoid subclasses were illustrated in **Figure 1B**.

## Enrichment Analysis of Flavonoids' Targets

#### Diversity Analysis of Natural Flavonoid Products' Targets

Targets of natural flavonoid products were mapped into Kyoto Encyclopedic of Genes and Genomes (KEGGs) (Kanehisa et al., 2012) and Gene Ontology (GO) (Ashburner et al., 2000) through Metascape (Tripathi et al., 2015) to analyze their enrichment pathways. Then, the enrichment pathways were generated for six flavonoid subclasses.

#### Specific Pathway Enrichment Analysis of Natural Flavonoids Products

To distinguish the specific pathway of flavonoids from other natural plant products, permutation test was implemented 1,000 times to identify the specific pathway of flavonoids' targets by setting the 4,327 other natural plant products as background.

### Pharmacology Network Analysis

Protein–protein interaction (PPI) networks of flavonoids' targets were generated and modularized through Metascape (Tripathi et al., 2015). Further, the bio-functional similarity and difference between networks of six subclasses were compared based on the main functional modules. Then, PPI enrichment analysis was carried out with the following databases including BioGrid (Chatr-Aryamontri et al., 2017), InWeb\_IM (Li et al., 2017), and OmniPath (Turei et al., 2016). The densely connected network components was identified by Molecular Complex Detection (MCODE) algorithm (Bader and Hogue, 2003) and viewed by Cytoscape (Shannon et al., 2003).

#### Structure–Activity Relationship Analysis

In order to analyze the structure–activity relationship, basic physicochemical properties including molecular mass (weight), lipid water distribution coefficient (LogP), hydrogen bond receptor (NumHAcceptors), hydrogen bond donor (NumHDonors), rotatable bond (NumRotatableBonds), topological molecular polarity surface area (TPSA) and Lipinski's Rule of five were calculated for different natural flavonoid products through RDKit (Landrum, 2010).

Also, the core scaffold and side chains of each natural flavonoid products were derived according to their chemical structures. Since flavones, flavonols, flavanones, flavanonol, and flavan-3-ols share the same core scaffold, the structure–activity relationships of above five subclasses were analyzed. Then, according to GO (Ashburner et al., 2000), the bio-functional annotation of each structure segment can be obtained. Further, to identify the association between chemical structure of flavonoid subclasses and biological function, structure–activity relationship was further analyzed through Apriori algorithm (Agrawal and Srikant, 1994). Here, the minimum support parameter was set as 0.01 and the minimum confidence was set as 0.5 for calculation.

### RESULTS

### Pathway Enrichment Analysis of Flavonoids' Targets

The biological function of flavonoids' target was deciphered through pathway enrichment analysis based on the background pathway dataset (**Figure 2** and **Supplementary Table 2**). Results showed that, the targets of flavonoids were enriched in multiple essential pathways including metabolism, genetic information

FIGURE 1 | Structures and targets information of flavonoids. (A) Core scaffold structures of six flavonoid subclasses. (B) Target number of different flavonoid subclasses.

processing, environmental information processing, cellular process, organismal systems, and multiple pathways which were related to human diseases such as infectious diseases and cancer. For instance, in environmental information processing, flavonoids were enriched in multiple cell signaling pathways including MAPK signaling pathway, PI3K-Akt signaling pathway, FoxO signaling pathway and cAMP signaling pathway. In cellular processes, flavonoids can significantly regulate pathways such as apoptosis, focal adhesion, cell cycle, and autophagy. Further, it can be found that flavonoids' targets were significantly enriched in several organismal systems including immune system, endocrine system and nervous system. Especially for immune system related pathways, flavonoids were enriched in Th17 cell differentiations, IL-17 signaling pathway, Toll-like, and NOD-like signaling pathways. Besides, multiple flavonoids' targets can be found in the endocrine system pathways, such as progesterone-medicated oocyte maturation, GnRH signaling, oxytocin signaling and thyroid hormone signaling pathways. Also, nervous system-related pathways such as serotonergic synapse, and neurotrophin signaling pathways were enriched by corresponding targets. Moreover, flavonoids' targets existed in pathways of essential human diseases such as multi-cancer, insulin resistance and infectious diseases including HTLV-1 infection, Epstein–Barr virus infection and Hepatitis B.

Besides the common enrichment pathways, different flavonoid subclasses illustrated different preference. For instance, targets of flavanonol and flavan-3-ols were more significantly enriched in nitrogen metabolism pathways than other subclasses. Targets of isoflavones, flavanones, and flavonols were enriched in metabolism pathways such as lipid, retinol, and drug metabolism pathway. Flavones' targets were significantly enriched in MAPK signaling pathway and neurotrophin signaling pathway, which means natural flavone products may have therapeutic effects on neurological-related diseases. Pervious researches indicated that flavones such as apigenin and luteolin could activate Nrf2 antioxidant response element (ARE)-mediated gene expression and induce anti-inflammatory activities through the PI3K and MAPK signaling pathways (Paredes-Gonzalez et al., 2015). Also, both compounds could significantly increase the endogenous mRNA and protein level of Nrf2 and Nrf2 targeting genes with important effects on hemo oxygenase-1 (HO-1) expression, thus, led to cytoprotective effects and neurite outgrowth (Lin et al., 2010; Zhao et al., 2013; Zhang et al., 2015). In addition, corresponding targets of flavan-3-ols and flavanonol were enriched in cancer-related pathways. Natural flavan-3-ol products such as (-)-epigallocatechin gallate (EGCG), (-)-epicatechin gallate (ECG), (-)-epigallocatechin (EGC), and (-)-epicatechin (EC) were discovered flavan-3-ols from green tea, which could provide possible prevention of cancers (Henning et al., 2013; Yang C.S. et al., 2014). Although the flavonoids contain similar biological function based on the same core scaffold, above results indicated the different biological functions of flavonoid subclasses with different chemical structures. Thus, the therapeutic selection and clinical application for flavonoid subclasses were different from each other.

### Functional Difference Between Flavonoids and Other Natural Plant Products

To further discover the functional difference between flavonoids and other natural plant products, the specific enrichment pathway of flavonoids' targets were analyzed by setting other natural plant products as background. Results showed that, flavonoids were enriched in cancer-related pathways compared with other natural products (**Figure 3** and **Supplementary Table 3**). Among them, isoflavones and flavones were enriched in multi-cancer related pathways, flavan-3-ols can regulate the pathway of microRNA in cancer and isoflavones significantly enriched in the pathway of breast cancer, indicating the potential anti-cancer preferences of flavonoid subclasses. It can be noticed that natural flavan-3-ol products such as EGCG could alter epigenetic processes through DNA methylation, histone modification and miRNA regulation such as miR-92, miR-93, miR-106, miR-7-1, miR-34a, and miR-99a (Chakrabarti et al., 2012), which could provide anti-cancer and cardiovascular protections (Henning et al., 2013; Yang C.S. et al., 2014). Also, soy isoflavones, including genistein, daidzein, and their corresponding glucosides were reported to reduce the risk of breast cancer through meta-analysis (Yamamoto et al., 2003; Dong and Qin, 2011). In vitro, these isoflavones could significantly restrain the growth of human breast cancer cells (Peterson and Barnes, 1991). In addition to cancer related pathways, flavonoids were enriched in metabolic, steroid hormone biosynthesis, replication and repair, adherence junction, insulin signaling and several diseases related pathways. Meanwhile, existential discrepancy was found between different flavonoid subclasses. For example, although flavonoids showed biological functions on multiple nervous system diseases, only flavones were significantly enriched in Alzheimer' disease related pathways. Previous researches showed that the derivatives of flavone acting at different target could elicit varied pharmacological properties with various substitution patterns, including anti-oxidant, anti-cancer activity, neuroprotective activity (Singh et al., 2014). Those derivatives also showed good binding affinity to Aβ aggregates and high brain penetration, which illustrate potential therapeutic utilities for Alzheimer's disease (Ono et al., 2005, 2007). Also, isoflavones were significantly enriched in Huntington's diseases related pathways, which may related with isoflavones-mediated autophagy (Pierzynowska et al., 2017, 2018). Generally, both common and discrepancy were found in flavonoid enriched pathways which represent the specific biological function and potential therapeutic utility of different flavonoid subclasses.

## Network Pharmacology and Modularization Analysis

To globally view the enrichment pathways of flavonoid, the network of all enriched targets for six flavonoids' subclass were analyzed and decomposed into eight modules. In **Figure 4**, the size of each node represents the ratio (=number of target related compounds/total number of compounds) of targets in here. Targets mapped in the same modules were marked in the same

color and network of targets in six major modules were analyzed through KEGG and GO to discover the function of flavonoid (**Table 1**), top 10 enriched pathways and GO terms were list in **Supplementary Table 4**. Entrez Gene ID and symbol in each module were listed in **Supplementary Table 5**.

Targets in module 1 were mainly enriched on epoxygenase P450 pathway, VEGF signaling pathway, fluid shear stress and atherosclerosis pathway, which related with cardiovascular regulations such as vascular dilatation. Meanwhile, the enrichment of pathways for FoxO signaling, mitotic cell cycle regulation, cell cycle arrest, negative regulation of cell cycle showed that cell cycle related functions can also been reflected in module 1. For module 2, targets were significantly enriched on pathways for T-cell receptor signaling, NF-kappa B signaling, inflammatory mediator regulation of TRP channels, immune response-regulating cell surface receptor signaling, immune response activating cell surface receptor signaling and immune response-activating signal transduction, which indicated the function of module 2 was related to immune inflammation. Further, it can be noticed that targets in module 3 were significantly enriched on pathways of insulin resistance, peptidyl-serine phosphorylation, peptidyl-serine modification, cellular response to nitrogen compound, cellular response to organonitrogen compound, positive regulation of kinase activity and cellular response to hormone stimulus, which illustrated the function of module 3 were closely related with functions of cell response to stimulation and hormone regulation. Moreover, the enrichment in neuroactive ligand-receptor interaction, cAMP



signaling, calcium signaling, serotonergic synapse, cGMP-PKG signaling and dopaminergic synapse pathways reflected the targets in module 4 were related to neuromodulation and signal transduction. In addition, module 5 was found to relate with human diseases such as cancer and viral-related diseases since most targets were enriched in viral carcinogenesis pathway, cancer and infectious disease related pathways. For module 6, targets were enriched in flavonoids glucuronidation, glucuronate pathway, ascorbate and aldarate metabolism, pentose and glucuronate interconversions, which meant functions of module 6 were mainly embodied in the process of glyoxylic acid metabolism.

#### Module Mapping of Different Flavonoid Subclasses

In order to understand the function differences among flavonoid subclass, targets of six flavonoid subclasses were mapped into above modules. Major nodes reflected the common targets for each flavonoid subclass.

Start from flavan-3-ols, the targets were mainly distributed in module 1, 2, 3, and 5, and generally enriched in pathways of cancer, fluid shear stress and atherosclerosis, AGE-RAGE signaling in diabetic complications, which related with cardiovascular, cell cycle regulation and cancer (**Supplementary Figure 1**). For example, MAPK 14 in module 3 was found to participate in multiple cellular processes including cell proliferation, differentiation, transcriptional regulation and development (Young, 2013). Also, it can be noted that MAPK 14 may related with atherosclerosis (Cheng et al., 2017). Further, BCL2 in module 2 was a therapeutic target for chronic lymphocytic leukemia since it can regulate lymphocyte in blood by hindering cell apoptosis (Ruefli-Brasse and Reed, 2017; Tahir et al., 2017) and PGD in module 5 was related with human cervical carcinoma (Lee et al., 2014).

Targets of flavanones mainly distributed in module 1, 4, and 5, several scattered in modules 2 and 3, which illustrated the pharmacological activities of flavanones for anti-cancer and antioxidant (**Supplementary Figure 2**). For example, nodes such as CYP1A1, CYP1A2, CYP1B1 in module 1 belonged to cytochrome P450 (CYPs) family, which could enrich in epoxygenase P450 pathway and relate with cardiovascular-related functions such as vascular ectasia. APEX1 in module 2 was found affecting cancer RNA metabolism and triple-negative breast cancer (Antoniali et al., 2017; Chen et al., 2017).

Targets of flavanonol were relatively less than the others and separated in different modules, which means the function of flavanonol are quite scattered (**Supplementary Figure 3**). Similar to flavanones, nodes such as CYPs in module 1 and APEX1 in module 2 were also detected in flavanonol, which indicated the potential function of it on cardiovascular and cancer related functions. Meanwhile, MAPT in module 2 was found closely related with neurodegenerative diseases, including Parkinson's disease (Beevers et al., 2017).

Target of flavones (**Supplementary Figure 4**) and flavonols (**Supplementary Figure 5**) were distributed in all six major modules, which indicated the broad function of compounds from those two subclasses. Besides common nodes such as CYPs, APEX1, MAPT, which reflect the same function for cancer, cardiovascular and neurodegenerative as other flavonoid subclasses flavones contains other nodes such as ALDH1A1 in module 5, which reflect potential associations with cancer invasion (Yao et al., 2017; Li et al., 2018).

Specifically, isoflavones are a type of naturally occurring isoflavonoids, which act as phytoestrogens in mammals, their targets were mainly distributed in module 1, 2, and 5 (**Supplementary Figure 6**). Previous researches indicated that BRCA1 in module 1 was associated with risk of estrogenreceptor-negative breast cancer (Milne et al., 2017), NFE2L2 in module 3 was related with cell multisystem disorder (Huppke et al., 2017), and TP53 in module 5 was related with human immunodeficiency virus-related head and neck squamous cell carcinoma (Gleber-Netto et al., 2018).

### Structure Activity Relationship Analysis of Flavonoids

In order to explore the cause of the functional similarity and difference among flavonoids' subclasses, the structure activity relationship of different flavonoids were analyzed. By calculating the structural and physic-chemical properties of natural flavonoid products, the structure difference of flavonoids' subclasses can be discovered to conjecture the potential effects of their biological functions (**Supplementary Figure 7**). Results showed that the rotatable bonds (NumRotatableBonds) and molecular weight (Weight) in different subclasses are quite similar, however, difference can be detected in H-bond acceptor (HAcceptor), H-bond donor (HDonor), lipid-water partition coefficient (LogP), and Topological polarity surface area (TPSA) for different flavonoid subclasses. LogP and TPSA could affect the absorption and distribution of drug, which should contain a certain degree of dissolution and appropriate lipid water distribution to be effective. Further, to provide nervous system activity, drug with larger liposolubility may be easier to pass the blood–brain barrier (BBB) (Yang Y. et al., 2014). The non-polar structural fragments such as alkyl group, halogen atom and aliphatic ring in chemical molecules will increase the liposolubility of molecules. Meanwhile, TPSA has a great impact on the cell penetration of drug molecules. In that case, the TPSA should be relatively lower for drugs which needs to across BBB and act on the receptors of central nervous system (Mehdipour and Hamidi, 2009). Natural products of flavones, flavanones, and isoflavones contain larger LogP and lower TPSA than other flavonoids, which indicated the potential activities to across the

Y-axis represents the value of TPSA.

BBB (**Figure 5**). For example, apigenin of flavones, quercetin, and genistein of isoflavones, hesperidin of flavanones and rutin, quercetin, and kaempferol of flavonols would have the ability to across the BBB (**Figure 5**). Among them, genistein and apigenin could provide stronger ability to across the BBB since their larger LogP and lower TPSA (Yang Y. et al., 2014), which indicated the potential ability of other flavonoids meets the appropriate value of LogP and TPSA.

Further, in order to evaluate the drug-likeness of flavonoids, Lipinski's Rule of Five (ROF) of different flavonoid subclasses were analyzed (**Supplementary Figure 8**). It can be found that for flavones, flavonols, flavanones, flavanonol, and isoflavones, near half of the compound can pass ROF, while for flavan-3-ols the percentage of ROF-passed compounds is extremely low, which indicated different drug-likeness of flavonoid subclasses.

By excavating the relationship between chemical constitution fragment and biological effects through Apriori (Agrawal and Srikant, 1994), results showed that the core scaffold and side chain in flavonoids can significantly affect the biological functions (**Figure 6**). For example, in rule 01–07, side chain such as hydroxyl in position 1 on the core scaffold structure of flavanones may assist the negative regulation of PERK-mediated unfolded protein response. Also, in rule 09–13, hydroxyl side chain in position 1, 3, 10, 11, and 12 on the core scaffold structure of flavonols closely related with error-prone translesion synthesis. Among them, natural products such as myricetin, robinetin, tricetin could against hydrogen peroxide-induced DNA damage and might reduce the risk of multiple cancers (Huang and Ferraro, 1992; Shelby et al., 1997). Meanwhile, natural products with core scaffold of flavonols and oxygen methyl on different positions as side chain illustrated the bio-function of cellular iron ion homeostasis (rule 14–17) and microtubule-based process (rule 18–20). Besides above rule of generality, individual rules can also be found in **Figure 6**. For example, the core scaffold of flavones combined with hydroxyl side chain in position 1, 3, 11, and 12 will related with base-excision repair and base-free sugar-phosphate removal (rule 21). Previous studies indicated that the number and position of glycoside and hydroxyl groups in flavonoids would affects the ability of permeation (Yang Y. et al., 2014). We also found that hydroxyl side chain in position 1 and 3 combined with hydrocarbyl side chain in position 2 which related with neurotransmitter receptor biosynthetic process (rule 22) will increase the liposolubility and enhance its transmembrane abilities. It can be noted that, the bio-activity of molecules which meet rule 22 is enhanced over 14.68 times than other molecules in flavonoids' families (**Supplementary**

**Table 6**). Analogously, flavonols meet rule 23, which contains pentose in position 1 and hydroxyl in position 3, 11 will increase the bioactivity for 32.37 times than others for the function of DNA topological change. Natural products such as kaempferol glycoside could targeting the DNA topoisomerase, which closely related with DNA replication and cell cycle (Vega et al., 2007; Baikar and Malpathak, 2010). Rule 24 indicate multiple oxygen methyl in side chain will benefit to the function of sphingolipid translocation. Moreover, rule 25 and rule 26 illustrate the different side chain components may have potential affects for regulation of prostaglandin biosynthetic process and vasoconstriction.

### DISCUSSION

In this article, comprehensive analysis was proposed to explore the MOA of natural flavonoid products and results indicated that flavonoids could affect essential pathways in several categories such metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, and human diseases related pathways. Among them, the enrichment in human diseases-related pathways illustrated the multifaceted therapeutic applications of flavonoids which could affect multiple human diseases such as cancers, neuro-disease, diabetes and infectious diseases. By compared with other natural plant products, flavonoids could significantly enrich in the pathways of breast cancer, Huntington's disease, Alzheimer's disease, insulin resistance and drug resistance. Also, after systemically analysis of targets for different flavonoids subclasses, it can be found that targets such as MAPT, APEX1, and ALDH1A1, which were closely related with nervous system and cancer, were significantly enriched in almost all flavonoid subclasses. In that case, the multifaceted therapeutic ability indicates the utility of flavonoids for cancer and nervous system related drug discoveries.

Besides common biological functions, specific functions of different flavonoids subclasses were also analyzed and detected in both target and pathway level. For example, flavones and isoflavones were significantly enriched in multi-cancer related pathways than others, which indicate the potential therapeutic utility in cancer treatment. Also, flavan-3-ols were found on cellular processing and lymphocyte regulation, flavones specifically acted on cardiovascular related activities and isoflavones were closely related with cell multisystem disorders. Different structural and physic-chemical properties of natural flavonoid products may relate with the functional differences and can be detected in physic-chemical properties including H-bond acceptor, H-bond donor, lipid-water partition coefficient and topological polarity surface area. It can be noted that LogP and TPSA are closely related with absorption and distribution of chemical components in drugs, since appropriate solubility

and lipid water distribution coefficient play essential roles in drug efficacy (Avdeef, 2001). For example, drugs which were activated in central nervous system requires larger liposolubility, which could be increased by non-polar structural fragments such as alkyl, halogen atom and aliphatic ring in chemical molecules. Meanwhile, TPSA can affect the cell penetration of drug molecules. Previous research indicated that in order to pass through the BBB and activate on the receptors in central nervous system, the polar surface areas of drug should be less than 90 square angstroms (van de Waterbeemd et al., 1998). Thus, natural products in flavonoids, flavones, flavanones, and isoflavones, which contains larger LogP and lower TPSA, have the ability to pass through the BBBs with potential activities.

Since flavonoids contain the same core scaffold, the functional difference was mainly related with the substituent groups. Relationship between chemical constitution fragment and biological effects indicated that different side chain can significantly affect the activity of flavonoids on the same target. Flavonoids with structures meet the corresponding rules will enhance the bioactivity of molecules for dozens of times. For 26 rules summarized in this article, the bioactivities were increased over three times at least. Among them, seven rules could enhance the bioactivities for over 10 times, and two rules (rule 23 and 26) could increase the activities for 30 times (**Supplementary Table 6**). Considering the substituent groups and positions of side chain, the relationship between structure and bioactivity analyzed

#### REFERENCES


in here may help to enhance the understanding of flavonoids and its potential ability for new drug discovery.

#### AUTHOR CONTRIBUTIONS

TQ and KT wrote the manuscript. DW and LY conceived and designed the experiments. HY and TQ analyzed and interpreted the results. TQ and QW modified the manuscript. ZC and KT supervised the project. All authors have read and approved the final version.

### FUNDING

This work was supported in part by National Key R&D Program of China (SQ2017YFC170310), National Natural Science Foundation of China (31671379), the National Postdoctoral Program for Innovative Talents (BX201600033), and the China Postdoctoral Science Foundation Funded Project (2017M611451).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar. 2018.00918/full#supplementary-material



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Qiu, Wu, Yang, Ye, Wang, Cao and Tang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fphar-09-00918 August 13, 2018 Time: 18:55 # 12

## A Hybrid Interpolation Weighted Collaborative Filtering Method for Anti-cancer Drug Response Prediction

#### Lin Zhang<sup>1</sup> , Xing Chen<sup>1</sup> \*, Na-Na Guan<sup>2</sup> , Hui Liu<sup>1</sup> and Jian-Qiang Li <sup>2</sup>

*<sup>1</sup> School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China, <sup>2</sup> College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China*

#### Edited by:

*Lixia Yao, Mayo Clinic, United States*

#### Reviewed by:

*Yanshan Wang, Mayo Clinic, United States Chen Li, Xi'an Jiaotong University, China Chen Wang, Mayo Clinic, United States*

> \*Correspondence: *Xing Chen xingchen@amss.ac.cn*

#### Specialty section:

*This article was submitted to Translational Pharmacology, a section of the journal Frontiers in Pharmacology*

Received: *09 April 2018* Accepted: *22 August 2018* Published: *12 September 2018*

#### Citation:

*Zhang L, Chen X, Guan N-N, Liu H and Li J-Q (2018) A Hybrid Interpolation Weighted Collaborative Filtering Method for Anti-cancer Drug Response Prediction. Front. Pharmacol. 9:1017. doi: 10.3389/fphar.2018.01017* Individualized therapies ask for the most effective regimen for each patient, while the patients' response may differ from each other. However, it is impossible to clinically evaluate each patient's response due to the large population. Human cell lines have harbored most of the same genetic changes found in patients' tumors, thus are widely used to help understand initial responses of drugs. Based on the more credible assumption that similar cell lines and similar drugs exhibit similar responses, we formulated drug response prediction as a recommender system problem, and then adopted a hybrid interpolation weighted collaborative filtering (HIWCF) method to predict anti-cancer drug responses of cell lines by incorporating cell line similarity and drug similarity shown from gene expression profiles, drug chemical structure as well as drug response similarity. Specifically, we estimated the baseline based on the available responses and shrunk the similarity score for each cell line pair as well as each drug pair. The similarity scores were then shrunk and weighted by the correlation coefficients drawn from the know response between each pair. Before used to find the K most similar neighbors for further prediction, they went through the case amplification strategy to emphasize high similarity and neglect low similarity. In the last step for prediction, cell line-oriented and drug-oriented collaborative filtering models were carried out, and the average of predicted values from both models was used as the final predicted sensitivity. Through 10-fold cross validation, this approach was shown to reach accurate and reproducible outcome for those missing drug sensitivities. We also found that the drug response similarity between cell lines or drugs may play important role in the prediction. Finally, we discussed the biological outcomes based on the newly predicted response values in GDSC dataset.

Keywords: anti-cancer drug response, drug response prediction, recommender system, collaborative filtering, interpolation weighted method

## INTRODUCTION

One of the top challenges in individualized therapies is the choice of the most effective chemotherapeutic regimen for each patient, while the administration of ineffective chemotherapy may increase mortality and decrease quality of life in cancer patients (Chen et al., 2013). Thus, it is urgent to evaluate each patients' possible response to each chemotherapeutic regimen to make sure the regimens applied are most likely to be effective. To address this problem, extensive patient drug screening projects need to be carried out so as to unveil significant drug response patterns. However, the large populations of cancer patients with numerous drugs has become the bottleneck.

To circumvent this issue in the context of cancer, some large drug screening projects have been carried out using cancer cell lines instead of individual cancer patients. These are NCI-60 panel, Genomics of Drug Sensitivity in Cancer (GDSC) and the Cancer Cell Line Encyclopedia (CCLE) projects (Boyd and Paull, 1995; Barretina et al., 2012; Yang et al., 2013). The NCI-60 study was pioneered by the US National Cancer Institute (NCI) to assemble the NCI60 tumor cell line panel, which has been assayed for its sensitivity to over 130,000 compounds and had been extensively profiled at the biological level (Shoemaker, 2006). It has been useful for the development of computational approaches aiming at linking drug sensitivity with genotype profiles together (Shoemaker et al., 1988; Weinstein et al., 1997; Garnett et al., 2012). The GDSC project is, to date, the largest public resource for information on drug sensitivity in human cancer cell lines and molecular markers of drug response. It pioneered the combination of drug and cell line information, including gene expression, gene copy number variations, and mutation profiles for drug sensitivity prediction (Garnett et al., 2012; Yang et al., 2013). It systematically addressed the issue of predictive biomarker identification by collectively analyzing the clinically-relevant human cell lines and their pharmacological profiles for corresponding cancer drugs. The other widely used database, CCLE (Barretina et al., 2012), collects gene expression, chromosomal copy number and massively parallel sequencing data from 947 human cancer cell lines, coupled with pharmacological profiles for 24 anti-cancer drugs across 479 of the cell lines. It allows identification of genetic, lineage, and gene expression-based predictors of drug sensitivity.

Corresponding to the large-scale datasets screened on cultured human cell line panels, many computational methods have been developed for the elucidation of the response mechanism of anti-cancer drugs, most commonly are multivariate linear regression (LASSO and elastic net regularizations) and nonlinear regression (e.g., neural networks and some kernel based methods; Barretina et al., 2012; Garnett et al., 2012; Heiser et al., 2012; Menden et al., 2013; Yang et al., 2013; Costello et al., 2014). Deamen et al. used least squares-support vector machine and random forest to identify drug response associated molecular features in breast cancer (Daemen et al., 2013). Based on the NCI-60 panel, a weighted voting classification model, an ensemble regression model using Random Forest as well as a simultaneous machine learning modeling of chemical and cell line information have been developed to predict anti-cancer drug sensitivity (Staunton et al., 2001; Riddick et al., 2011; Cortes-Ciriano et al., 2016). Based on the GDSC dataset, Ammad-uddin et al. developed a kernelized Bayesian matrix factorization (KBMF) method to integrate genomic and chemical properties as well as drug target information for drug sensitivity prediction (Ammad-ud-din et al., 2014). Sheng et al predicted unseen drug responses by calculating a weighted average of observed drug responses based on drug specific cell line similarity and drug structure similarity (Breese et al., 1998). Liu et al. proposed a dual-layer cell line drug integrated network (DLN) model, which integrated both cell line and drug similarity network data, to predict the missing drug response (Zhang et al., 2015). Wang et al. proposed HNMDRP method, incorporating gene expression, chemical structure as well as drug target and protein-protein interaction information to predict missing values of drug responses in cell lines (Zhang et al., 2018). Based on the transcriptomic data from both GDSC and CCLE, Kim et al. developed a network-based classifier for predicting sensitivity of cell lines to anti-cancer drugs (Kim et al., 2016). Base on the same whole datasets, Wang et al. proposed a similarity-regularized matrix factorization (SRMF) method for drug response prediction, which incorporates similarities of drugs and of cell lines simultaneously (Wang et al., 2017). Stanfield et al. proposed a heterogeneous network based method to predict the interaction between cell line-drug pairs (Stanfield et al., 2017). They classified the interaction between each cell line-drug pairs into sensitive and resistant, thus, turned the prediction problem into classification. Current methods have taken the similarity of genomic or transcriptomic profiles as well as drug structure into consideration for similarity definition, which were often defined by calculating the Pearson correlation coefficient for genomic profiles, or Jaccard coefficient for drug chemical fingerprint in present studies and are called as COEF in the following for short. However, the similarity that exhibited through drug sensitivity, which can be defined by calculating the Pearson correlation coefficient based on drug response sensitivity, has not been considered yet and is called as RPCC for short in the following. Not to mention the combination of COEF and RPCC, which is called as MRPCC (Multiplication of COEF and RPCC) for short throughout the paper. Drug-target interaction and PPI network have also been considered to improve the prediction performance (Chen et al., 2012; Stanfield et al., 2017).

Regarding the relatively more credible assumption that similar cell lines and similar drugs exhibit similar drug responses (Zhang et al., 2015), the prediction of missing drug response can be considered as a typical Recommender System (RS) (Adomavicius and Tuzhilin, 2005). Typically, in a recommender system, there is a set of users and a set of items. Each user rates a set of items by some values. The recommender system attempts to profile user preferences and tries to model the interaction between users and items, which is exactly what we want in the issue of drug response prediction. The cell lines correspond to users while drugs correspond to items. From the RS perspective, the similarity shown through drug sensitivity is also very important for missing value prediction. Thus, we improved an RS technique, Hybrid Interpolation Weighted Collaborative Filtering (HIWCF) (The acronym list defined in this paper is shown in **Table 1**), for drug response prediction, which incorporates similarities of drugs and of cell lines in additional to the known drug response simultaneously (The key source code and ready to use CCLE and GDSC datasets are provided at https://github. com/laureniezhang/HIWCF). To demonstrate its effectiveness, we compared HIWCF with SRMF and KBMF, which have been proved to show higher performance than typical similarity-based methods. The evaluation metrics used were averaged Pearson correlation coefficient (PCC) and averaged root mean square error (RMSE) over all drugs. The results on GDSC and CCLE drug response datasets by 10-fold cross validation showed that similarity defined based on drug response is more dependable for unknown response prediction, and the incorporation of gene expression profile, drug response, and drug structure similarity help to better improve the prediction performance. Finally, HIWCF was applied to impute the unknown drug response values in GDSC dataset for further evaluation.

#### MATERIALS AND METHODS

#### Data and Preprocessing

In this paper, two datasets, both consisting of large scale genomic expression profiles, pharmacologic profiling of drug compounds, as well as the experimentally determined drug response measurements IC50 values (the concentration of a drug compound that reached the absolute inhibition of 50% in vitro, given as natural log of µM) or experimental activity areas were used for performance evaluation. Large scale genomic expression profiles were normalized across cell lines to draw the similarity matrix of cell lines. The chemical structures of drug compounds were used to draw the similarity matrix of drugs.

The first dataset is from GDSC project (http://www. cancerrxgene.org/), consisting of 139 drugs and a panel of 790 cancer cell lines (release 5.0). We selected 652 cell lines for which both drug response data and gene expression were available, and


135 drugs whose SDF format (encoding the chemical structure of the drugs) were available. The drug response is given with IC50 values (70,676 data points, matrix 80.3% complete).

The second dataset consists of 1,036 human cancer cell lines and 24 drugs, which is from CCLE project (http://www. broadinstitute.org/ccle). We also selected 491 cell lines and 23 drugs following the same rule used in GDSC dataset. The drug response is given with activity areas (10,870 data points, matrix 96.25% complete). Both ready to use datasets are submitted to Github at https://github.com/laureniezhang/HIWCF.

#### Problem Formulation

We basically treat anti-cancer drug response prediction as a RS problem where each cell line-drug pair is the typical useritem pair. Based on the finding that similar cell lines by gene expression profiles exhibit similar response to the same drug (Zhang et al., 2015), we proposed a weighted interpolation collaborative filtering method to approximate the sensitivity of cell line u to drug i. For convenience, we reserve special indexing letters for distinguishing cell lines from items: for cell lines u, v, and for drugs i, j. We are given cell line drug response about m cell lines and n drugs, arranged as an m × n matrix R = {rui}1≤u≤m,1≤i≤<sup>n</sup> , where higher value of activity area or lower value of IC50 means a better sensitivity of a cell line to a given drug.

#### Baseline Estimate Strategy

Since typical CF data often exhibit large user and item effects, that means systematic tendencies for some users to give higher ratings than others, and for some items to receive higher ratings than others, we first adjusted the rating data by accounting for these effects, which we include in the baseline estimate strategy. Let µ denotes the overall average drug response, we denote the estimated baseline for an unknown rating rˆui as bui, which accounts for the above-mentioned user and item effects.

$$b\_{\text{ui}} = \mu + b\_{\text{u}} + b\_{\text{i}} \tag{1}$$

The parameters b<sup>u</sup> and b<sup>i</sup> indicate the observed deviations of cell line u and drug i, respectively, from the average.

In order to get the baseline formulation, for each drug i, we set:

$$b\_{\mu} = \frac{\sum\_{i \in U(\mu, i)} (r\_{\mu i} - \mu - b\_i)}{\lambda\_3 + |U(\mu, i)|} \tag{2}$$

Then, for each cell line u, we set:

$$b\_i = \frac{\sum\_{\mu \in U(\mu, i)} (r\_{\mu i} - \mu)}{\lambda\_2 + |U(\mu, i)|} \tag{3}$$

where U(u, i) is the set of cell lines who responses to drug i, or the set of drugs who have responses in cell line u, and |U(u, i)| means the number of elements in setU(u, i). λ2and λ<sup>3</sup> are regularization parameters that help to shrink the averages b<sup>u</sup> and b<sup>i</sup> toward zero. They are set to 5 and 2, respectively in the following simulation process.

TABLE 2 | The comparison results between HIWCF with different similarity definition (MRPCC/RPCC/COEF), SRMF, and KBMF obtained under 10-fold cross validation on CCLE dataset.


TABLE 3 | The comparison results between HIWCF with different similarity definition (MRPCC/RPCC/COEF), SRMF, and KBMF obtained under 10-fold cross validation on GDSC dataset.


FIGURE 1 | The drug similarity *RPCC* and *COEF* of 23 drugs in CCLE dataset. (A) The plot shows *RPCC* similarity for 23 drugs in CCLE dataset. (B) The plot shows *COEF* similarity for 23 drugs in CCLE dataset.

plot shows *COEF* similarity for 491 cell lines in CCLE dataset.

Similarity Definition

The similarity matrixes are required for identification of K nearest neighbors. The original similarity of cell lines was drawn based on the Pearson correlation coefficient between the gene expression profiles of cell line u and v, which is indicated as COEFcuv . The c in the subscript refers to cell line-oriented. The similarity of drugs was drawn based on the Jaccard coefficient between the drug chemical structures of drug i and j, which is indicated as COEFdij . The d in the subscript refers to drugoriented.

plot were collected from hematopoietic and lymphoid tissues.

However, to some extent, the similarity between cell line u and v can also be shown from their drug response. Thus, in this paper, we investigated the performance of different similarity definitions for drug response prediction. To be more specific, the similarity of cell line u and v, indicated asMRPCCcuv , was defined as the multiplication of COEFcuv andRPCCcuv , which helps the cell line pairs with consistent similarity in gene expression and drug response to get higher rank for unknown response prediction.

$$\text{MRPCC}\_{\text{c}\_{\text{uv}}} \leftarrow \text{COEF}\_{\text{c}\_{\text{uv}}} \times \text{RPCC}\_{\text{c}\_{\text{uv}}} \tag{4}$$

where COEFcuv was defined as the their gene expression profile's Pearson correlation, while RPCCcuv was defined as the correlation between the response IC50 value of cell line u and v.

$$RPCC\_{\text{c}\_{\text{uv}}} = \frac{\sum \left( R\_{\text{u}\bullet} - \bar{R}\_{\text{u}\bullet} \right) \left( R\_{\text{v}\bullet} - \bar{R}\_{\text{v}\bullet} \right)}{\sqrt{\sum \left( R\_{\text{u}\bullet} - \bar{R}\_{\text{u}\bullet} \right)^{2} \sum \left( R\_{\text{v}\bullet} - \bar{R}\_{\text{v}\bullet} \right)^{2}}} \tag{5}$$

where Ru• represents the response value of the u-th cell line, and R¯ <sup>u</sup>• represents the mean of the u-th cell line's response.

In the same way, the similarity between drug i and j, indicated as MRPCCdij , was defined as the multiplication of COEFdij and RPCCdij .

$$MRPC\_{d\_{\vec{\eta}}} = COEF\_{d\_{\vec{\eta}}} \times RPCC\_{d\_{\vec{\eta}}} \tag{6}$$

where COEFdij was defined as their drug chemical fingerprint's Jaccard coefficient, while RPCCdij was defined as the Pearson correlation coefficient between response IC50 values of drug i and j.

$$RPCC\_{d\_{\vec{\eta}}} = \frac{\sum \left(R\_{\bullet i} - \bar{R}\_{\bullet i}\right) (R\_{\bullet j} - \bar{R}\_{\bullet j})}{\sqrt{\sum \left(R\_{\bullet i} - \bar{R}\_{\bullet i}\right)^2 \sum \left(R\_{\bullet j} - \bar{R}\_{\bullet j}\right)^2}}\tag{7}$$

where R•<sup>i</sup> represents the response value of the i-th drug, and R¯ •i represents the mean of the i-th drug's response.

In order to avoid the bias caused by the different level of support (different number of known responses) for each cell linedrug pair, we also went through a shrunk procedure for similarity score, which is denoted by (Koren, 2010):

$$\boldsymbol{w}\_{i,j} \leftarrow \frac{|U(i,j)|}{|U(i,j)| + \lambda\_4} \boldsymbol{w}\_{i,j} \tag{8}$$

where |U(i, j)| is the number of cell lines who have responses to both drug i and j, or the number of drugs who have responses from both cell line i and j. wijis the similarity MRPCC<sup>c</sup> defined in (4) and MRPCC<sup>d</sup> in (6). λ4is a constant, which is set as 50 in the experiments.

In the following, we adopted a case amplification strategy, which refers to a transform applied to the weights used in the following collaborative filtering prediction, to reduce the noise in the data. The transform emphasizes high weights and punishes low weights by (Breese et al., 1998):

$$\left|\boldsymbol{\omega}\_{i,j} \leftarrow \boldsymbol{\omega}\_{i,j} \bullet \left|\boldsymbol{\omega}\_{i,j}\right|^{\rho-1} \tag{9}$$

where ρ is the case amplification power, ρ ≥ 1, and we also followed the typical choice of ρ as 2.5 (Lemire, 2005).

### Drug Response Prediction Based on HIWCF Method

After removing the noise by baseline estimate strategy, we need to predict the unknown sensitivity for cell line u of drug i, which is rˆui. Based on the above-mentioned similarity measure w defined in (9), we first conducted drug-oriented CF, and k drugs, which are most similar to drug i that had responses in cell line u were identified. This set of k neighboring drugs is denoted by U(i; u). Then, based on w, we conducted cell line-oriented CF, and k cell lines that responded to drug i, which are most similar to cell line u were identified. This set of k neighboring cell lines is denoted by U(u;i). Finally, the predicted value of rˆui is taken as an average of the weighted average of the response of neighboring drugs found in U(i; u) and that of the response of neighboring cell lines found inU(u;i), while adjusting from user and item effects through baseline estimates:

$$\begin{split} \hat{r}\_{ui} &= b\_{ui} + \frac{1}{2} (\frac{\sum\_{j \in U(i;u)} \boldsymbol{\omega}\_{i,j} (\boldsymbol{r}\_{uj} - b\_{uj})}{\sum\_{j \in U(i;u)} \boldsymbol{\omega}\_{i,j}} \\ &+ \frac{\sum\_{\boldsymbol{\nu} \in U(u;i)} \boldsymbol{\omega}\_{i,j} (\boldsymbol{r}\_{\boldsymbol{\nu}i} - b\_{\boldsymbol{\nu}i})}{\sum\_{\boldsymbol{\nu} \in U(u;i)} \boldsymbol{\omega}\_{i,j}} \end{split} \tag{10}$$

#### RESULTS

#### Similarity Exhibited in Drug Response Sensitivity Shows Leading Role in Prediction

We first conducted 10-fold cross validation to evaluate the performance of different similarity definition. Incorporated with

FIGURE 5 | Scatter plots of observed and predicted drug activity area for four drugs in CCLE using HIWCF with MRPCC similarity. (A) Scatter plot of Irinotecan. (B) Scatter plot of PD-0325901. (C) Scatter plot of Panobinostat. (D) Scatter plot of Erlotinib.

COEF, RPCC as well as MRPCC, drug response prediction performance of HIWCF is evaluated in both CCLE dataset and GDSC dataset with activity area or IC50 value as drug response measurement in comparison with KBMF and SRMF. The evaluation measures included average PCC, RMSE between predicted and observed drug responses through all drugs. Considering the known fact that the sensitive and resistant cell lines of each drug are more valuable to unveil mechanisms of drug actions, we also included PCC and RMSE from sensitive and resistant cell lines for each drug, which were denoted as PCC\_S/R and RMSE\_S/R (Wang et al., 2017).

For each dataset, the drug response entries were divided into 10-folds randomly with almost the same size. Each time, one-fold was used as the test set, while the rest nine-folds were used as the training set. The prediction was repeated 10 times such that each fold acted as a test set once. The whole cross-validation was run for 100 times for each dataset, and the prediction performance was shown in **Tables 2**, **3**.

As is shown, the prediction performance of HIWCF with MRPCC/RPCC similarity were far better than that with COEF similarity, which suggested that the similarity exhibited in drug response may lead important role than that of gene expression profiles or drug structures in the scenario of drug response prediction. Thus, we turned to use the predicted values of HIWCF with MRPCC similarity measure only in the rest evaluation of our paper.

In **Table 2**, we can also see that in CCLE dataset, the performance of HIWCF with RPCC and MRPCC were better than that of SRMF, without mentioning KBMF. However, as shown in **Table 3**, the performance of HIWCF with either RPCC or MRPCC were a little bit worse than that of SRMF. That may be because the similarity score of RPCC/MRPCC is based on the known drug response for each cell line-drug pair. Since GDSC dataset is much sparser than that of CCLE, the similarity score of RPCC/MRPCC of GDSC is less reliable than that of CCLE.

We further investigated the difference between COEF and RPCC. To be more specific, based CCLE dataset, we calculated the drug structure fingerprint similarity COEF for hierarchical clustering analysis. As shown in **Figure 1B**, it was surprising that the similarity score for most drug pairs were approaching

1, which was undistinguishable for neighbor selection. However, we can get distinguishable similarity scores from drug response similarity RPCC, as shown in **Figure 1A**. If we investigate the drugs that clustered into the same group, such as "Lapatinib," "AZD0530," "ZD-6474," and "Erlotinib." It is well-known that they are EGFR inhibitors, thus, they are most likely have higher similarity scores in drug response (Yuan et al., 2016). We also investigate the gene expression similarity with cell line response similarity. The cell line response similarity RPCC and cell line gene expression similarity COEF were calculated for hierarchical clustering, which were comparable with each other (**Figure 2**). The results show that cell lines collected from the same tissue type may have higher similarity score, which is consistent with previous studies. For example, most cell lines that clustered into the same group shown in **Figure 3** were collected from hematopoietic and lymphoid tissues. Hierarchical clustering was achieved in both row and column direction, with original similarity score was normalized with 0 mean.

### Cross-Validation on CCLE Drug Response Datasets

We then tested the prediction performance of HIWCF for 23 drugs tested in the CCLE study, which were quantified based on PPC and RMSE between the predicted and observed activity areas.

As shown in **Figure 4**, the overall prediction performance of HIWCF throughout all the drugs was significantly higher than that of SRMF for the CCLE dataset. We believe that the improvement of HIWCF is most likely due to the involvement of similarity calculated from response matrix. The scatter plots of observed vs. predicted responses for four demonstrative drugs, Irinotecan, PD-0325901, Panobinostat, and Erlotinib are shown in **Figure 5**, which indicate the good correlations between existing response and predicted ones.

### Response Data Prediction in GDSC Data

Based on the HIWCF method validated, we based on all known data to predict the unknown ones in the GDSC dataset.

As in Wang et al. (2017), we also focused on an EGFR and ERBB2 inhibitor drug lapatinib, where more than half of response values (342/652) were unknown. Previous studies had demonstrated that EGFR and ERBB2 amplification was associated with sensitivity to lapatinib, which has been licensed for the treatment of HER2+ breast cancer clinically (Petrelli et al., 2017; Zhao et al., 2017). Thus, we tried to investigate whether the observed and predicted response of EGFR/ERBB2 mutated cell lines exhibit the sensitivity to lapatinib. All the 635 cell lines in GDSC were first grouped into mutated vs. wildtype by the total copy number variation in the exact gene (Garnett et al., 2012). Then, we found that not only EGFR mutated but also

ERBB2 mutated cell lines were both significantly more sensitive to lapatinib, as shown in **Figures 6A,B**, which was consistent with previously mentioned conclusions.

We further investigated whether the newly predicted drug responses combined with known drug responses were able to detect novel drug-cancer gene association or not. To be more specific, the oncogene BRAF has been found to be significantly associated with enhanced and selective sensitivity to MEK inhibitor PD-0325901 (Solit et al., 2006) (p = 3.70e-11 for known drug responses; p = 6.20e-12 for combined response of predicted ones and known ones; **Figure 6C**).

The newly predicted drug responses of GDSC dataset may also aid in drug repositioning. For example, Sunitinib, as a kinase inhibitor targeting VEGFR2 and PDGFRβ, has been observed to be sensitive to non-small cell lung cancer (NSCLC) based on newly predicted drug responses vs. available ones, as shown in **Figure 7**.

We further conducted the hierarchical clustering analysis through genes based on the expression profile of all the 652 cell lines. Before hierarchical clustering, 80 percent genes that show less variations over all the genes were filtered out. As shown in **Figure 8**, the patterns of gene expression were shown to be related with the sensitivity of each cell line to Sunitinib. The pink marked group of genes showed higher expression in cell lines which were sensitive to Sunitinib, while the blue marked group of genes showed higher expression in cell lines which were resistant to Sunitinib.

We further conducted GO enrichment analysis for both groups of genes. For the genes that up-regulated in Sunitinib resistant cell lines were found to be related to some repair pathways, such as regulation of DNA repair (p = 1.1e-3), baseexcision repair (p = 0.032), nucleotide-excision repair (p = 6e-3), interstrand cross-link repair (p = 0.01), mismatch repair (p = 0.048), etc., which were found to be important factors of drug resistance. For genes that were up-regulated in Sunitinib sensitive cell lines were found to be related to mTOR signaling pathway (p = 1e-2), NF-kappaB signaling (p = 4.1e-10). The inhibition of the signaling pathways help to increase drug sensitivities (Cai et al., 2014).

#### DISCUSSION

In this paper, we used a recommender system-based method HIWCF to predict anti-cancer drug sensitivity in GDSC and CCLE datasets respectively. The idea of the method comes from the fact that similar cell lines exhibit similar responses to the same drug, which is the exact motivation of a recommender system. This method first estimated the baseline, which helped to remove the noise in the original drug sensitivity, then shrunk the similarity measure by integration of gene expression profile, drug structure in addition to the correlation between cell lines and drugs exhibited in the drug response, which helped to weak the influence of sparseness in response matrix. Finally, it incorporated the user-orientated and item-orientated interpolation weighted collaborative filtering method to predict the unknown drug sensitivity values. Ten-fold cross validation demonstrated that the similarity drawn based on known drug response can better improve the prediction performance in comparison to the similarity drawn based on cell line gene expression profiles and drug structure only. At least, in the respective of recommender system method, it is more reliable to predict the unknown drug sensitivity based on the similarity exhibited in known drug responses. We also applied HIWCF method to predict the missing drug response values in GDSC dataset. To be more specific, we found the consistent conclusions of mutated cell lines such as EGFR/ERBB2 are more sensitive to the drug of lapatinib. We also found that the gene expression profiles showed exact pattern for Sunitinib sensitive and resistant cell lines. Genes that up-regulated in Sunitinib sensitive cell lines were subjected to repair pathways, while genes that downregulated in Sunitinib resistant cell lines were subjected to some drug enhancement related pathways.

In comparison with existing drug response prediction methods, HIWCF follows a neighbor based collaborative filtering approach for unknown drug response prediction, which is theoretically simple and intuitive. Matrix Factorization based methods, such as SRMF model both cell lines and drugs with some latent factors for unknown drug response prediction.

However, this method has its own drawbacks. First, since HIWCF highly depends on the known drug response, the performance highly depends on the sparseness of the response matrix. The sparser the matrix is, the worse the performance it gets. Secondly, the similarity of cell lines is calculated by combining gene expression correlation coefficient and Pearson correlation coefficient exhibited in their known drug response. However, the similarity can also be improved by

#### REFERENCES


integrating the epigenetic, epi-transcriptomic information, etc. Furthermore, some pathway related information or other dynamic information may also help to improve the performance. Therefore, we can further work on some methods that aim in sparse issue as well as multi-omics integration one in the future.

#### AUTHOR CONTRIBUTIONS

LZ developed the prediction method, designed and implemented the experiments, analyzed the result, and wrote the paper. XC conceived the project, designed the experiments, analyzed the result, revised the paper, and supervised the project. N-NG prepared the data, analyzed the result, and revised the paper. HL and J-QL analyzed the result and revised the paper.

#### FUNDING

LZ was supported by the Fundamental Research Funds for the Central Universities and National Natural Science Foundation of China under Grant No. 2014QNB47 and 61501466. XC was supported by National Natural Science Foundation of China under Grant No. 61772531.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewers YW and CW and the handling Editor declared their shared affiliation.

Copyright © 2018 Zhang, Chen, Guan, Liu and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Prediction of Potential Small Molecule-Associated MicroRNAs Using Graphlet Interaction

Na-Na Guan<sup>1</sup> , Ya-Zhou Sun<sup>1</sup> , Zhong Ming1,2, Jian-Qiang Li<sup>1</sup> \* and Xing Chen<sup>3</sup> \*

<sup>1</sup> College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, <sup>2</sup> National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China, <sup>3</sup> School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China

#### Edited by:

Zhi-Liang Ji, Xiamen University, China

#### Reviewed by:

Quan Zou, Tianjin University, China Feixiong Cheng, Cleveland Clinic, United States Francesco Pappalardo, Università degli Studi di Catania, Italy

#### \*Correspondence:

Jian-Qiang Li lijq@szu.edu.cn Xing Chen xingchen@amss.ac.cn; xingchen@cumt.edu.cn

#### Specialty section:

This article was submitted to Translational Pharmacology, a section of the journal Frontiers in Pharmacology

Received: 08 May 2018 Accepted: 24 September 2018 Published: 15 October 2018

#### Citation:

Guan N-N, Sun Y-Z, Ming Z, Li J-Q and Chen X (2018) Prediction of Potential Small Molecule-Associated MicroRNAs Using Graphlet Interaction. Front. Pharmacol. 9:1152. doi: 10.3389/fphar.2018.01152 MicroRNAs (miRNAs) have been proved to be targeted by the small molecules recently, which made using small molecules to target miRNAs become a possible therapy for human diseases. Therefore, it is very meaningful to investigate the relationships between small molecules and miRNAs, which is still yet in the newly-developing stage. In this paper, we presented a prediction model of Graphlet Interaction based inference for Small Molecule-MiRNA Association prediction (GISMMA) by combining small molecule similarity network, miRNA similarity network and known small moleculemiRNA association network. This model described the complex relationship between two small molecules or between two miRNAs using graphlet interaction which consists of 28 isomers. The association score between a small molecule and a miRNA was calculated based on counting the numbers of graphlet interaction throughout the small molecule similarity network and the miRNA similarity network, respectively. Global and two types of local leave-one-out cross validation (LOOCV) as well as five-fold cross validation were implemented in two datasets to evaluate GISMMA. For Dataset 1, the AUCs are 0.9291 for global LOOCV, 0.9505, and 0.7702 for two local LOOCVs, 0.9263 ± 0.0026 for five-fold cross validation; for Dataset 2, the AUCs are 0.8203, 0.8640, 0.6591, and 0.8554 ± 0.0063, in turn. In case study for small molecules, 5- Fluorouracil, 17β-Estradiol and 5-Aza-2<sup>0</sup> -deoxycytidine, the numbers of top 50 miRNAs predicted by GISMMA and validated to be related to these three small molecules by experimental literatures are in turn 30, 29, and 25. Based on the results from cross validations and case studies, it is easy to realize the excellent performance of GISMMA.

Keywords: small molecule, microRNA, association prediction, graphlet interaction, similarity calculation

### INTRODUCTION

MicroRNAs (miRNAs) are a family of small non-coding RNAs, having about 22 nucleotides in length, which regulate gene expression at a post-transcriptional level (Ambros, 2003). The first miRNA was discovered over 30 years ago in the Caenorhabditis elegans. Subsequently, thousands of miRNAs have been discovered in many organisms, and there are currently 2588 annotated miRNAs in the human genome (Kozomara and Griffiths-Jones, 2014). MiRNAs can simultaneously regulate the expression of hundreds of genes due to the fact that their nucleotide pairing by

**68**

complementarity is imperfect (He and Hannon, 2004). In this manner, they play a critical role in a variety of crucial processes such as tissue development, morphogenesis, apoptosis, signal transduction pathways, etc., (Esquela-Kerscher and Slack, 2006; Spizzo et al., 2009; Wang and Lee, 2009). This additionally implicates them in an array of disease associated processes. The development of large-expression screens has been proven useful in identifying novel miRNAs involved in diseases, which could potentially become an attractive therapeutic target (Monroig and Calin, 2013; Chen et al., 2017a, 2018a,b,c; Matsui and Corey, 2017).

Regulation of miRNAs by small molecules is an efficient mean to modulate endogenous miRNA function and to treat miRNA-related diseases (Xia et al., 2015). Small molecules have been thoroughly used with clinical applications for numerous diseases (Zhang et al., 2009). However, drug discovery and development are currently an extremely long process, which takes approximately 10–15 years (Monroig Pdel et al., 2015). Also, drug production results in an incredible economic burden and patients end up having to pay exaggerated prices for their treatments (Chen et al., 2015; Monroig Pdel et al., 2015). The use of chemical compounds that are already FDA approved to treat a specific disease would accelerate the process of completing toxicological studies and clinical trials in order to apply them to other diseases. It would shorten both money expenses and time consuming processes.

As miRNAs have been associated with many diseases (Chen et al., 2017b), the development of small-molecule drugs targeting specific miRNAs seems to be a promising approach to meet the challenge (Monroig Pdel et al., 2015). Small molecule may modulate the expression of miRNAs by either activating or repressing their transcription (Xia et al., 2015). Transcriptional inhibitors were identified by completing a small molecule screen in which a 3<sup>0</sup> UTR complementary to miR-21 was inserted into a luciferase mRNA reporter (Gumireddy et al., 2008). This study identified a type of diazobenzene as miR-21 transcriptional inhibitors (Gumireddy et al., 2008). Small molecules were also discovered to modulate transcription of miR-122, a highly expressed and liver-specific miRNA whose aberrant expression is associated with hepatocellular carcinoma (Thomas and Deiters, 2013). Two small molecules that inhibit transcription and another small molecule that promotes transcription of primiR-122 were identified using a luciferase reporter system (Thomas and Deiters, 2013). The examples above show that miRNA expression can be altered with small molecules, providing promise to expand miRNAs from diagnostic signatures of disease to therapeutic targets. Therefore, the prediction of associations between small molecules and miRNAs could promote the drug repurposing for miRNA-related diseases. Besides, since the regulation of miRNA expression can be caused by targeting miRNAs directly (Zhang et al., 2010) or by targeting the relative proteins (Lim et al., 2016), identifying the small molecule-miRNA associations would be conductive to the drug discovery. However, experimental methods to study the small molecular-miRNA association are expensive and time-consuming, which makes it urgent to develop computational approaches to provide reliable predictions that can give some guidance to experiments.

Recently, several computational models have been proposed to investigate the relations between small molecules and miRNAs. For example, Jiang et al. (2012) proposed a highthroughput method to investigate the biological connections between small molecules and miRNAs in 23 human cancers based on transcriptional responses, which was the first model to systematically study the associations between bioactive small molecules and miRNAs. They constructed a complex Small molecule and MiRNA Network (SMirN) for each cancer and explored the molecular and functional features for small molecule modules, as well as miRNA modules for each cancer type. Each module of small molecular was linked to a miRNA, and each module of miRNA was connected with one small molecular. One of the advantages of this method is that it does not need to know the information of small molecule structure or miRNA structure in advance. However, the reliability of the approach was limited due to the small data of transcriptional response to genomewide miRNA perturbations. Furthermore, Meng et al. (2014) built a bioactive Small molecule and miRNA association Network in Alzheimer's Disease (SmiRN-AD) through comparing the gene expression profiles after bioactive small molecule treating with the AD-related miRNA (ADM) regulating expressions, to get the scores of associations between small molecules and ADMs. Besides, the positive and negative associations were identified to investigate the biological insights of the SimRN-AD. Recently, Wang et al. (2016) developed another method to identify small molecule-miRNA associations based on their functional similarity. They searched the functional link of each small molecule-miRNA pair by calculating Gene Ontology enrichment after identifying differentially expressed genes for small molecules and miRNAs. Compared with previous models based on transcriptional responses, this method is more repeatable by using functional associations. Additionally, Lv et al. (2015) presented a novel computational model to predict potential associations between small molecules and miRNAs. They implemented the random walk with restart algorithm on an comprehensive network, which was established by combining small molecule similarity, miRNA similarity, as well as known small molecule-miRNA associations. Especially, this model can predict the novel related miRNAs for small molecules without any known associated miRNAs. However, it has too many adjustable parameters that need to be affirmed. Moreover, Li et al. (2016) developed a network based framework called predictive Small Molecule-miRNA Network-Based Inference (SMiR-NBI), to investigate the underlying regulations of anticancer drugs on miRNAs. This model constructed a heterogeneous network that was composed of drugs, miRNAs and genes to conduct a network based algorithm. It is mentionable that the accuracy of this method is quite high even it only depended on the network topology information. However, SMiR-NBI could not be applied to prediction of isolated miRNAs that have no interlinked small molecules. Besides, it failed to predict potential miRNAs associated with small molecules that had different doseresponses, due to lack of known data.

So far, the number of computational models is still not satisfying for the prediction of novel associations between small molecules and miRNAs. Moreover, there are still some limitations

existing in the previous models. In order to predict potential small molecule-miRNA associations more effectively and reliably, in this paper, we presented the Graphlet Interaction based inference for Small Molecule-MiRNA Association prediction (GISMMA). In this model, the similarity of small molecules and the similarity of miRNAs were combined with known associations between small molecules and miRNAs in two different datasets, which were labeled with Dataset 1 and Dataset 2. In Dataset 1, only a fraction of small molecules and miRNAs were involved in known small molecule-miRNA associations, whereas in Dataset 2 all small molecules and miRNAs were implicated in known small molecule-miRNA associations. Based on the measuring of graphlet interaction between any two nodes on the network of small molecules and on the network of miRNAs, respectively, we can compute the correlation scores of small molecule-miRNA pairs. We have implemented leaveone-out cross validation (LOOCV) and five-fold cross validation to evaluate the performance of GISMMA. The AUCs of global LOOCV are 0.9291 and 0.8203 for Dataset 1 and Dataset 2, respectively; the AUCs of local LOOCV by ranking the small molecules for each fixed miRNA are, respectively 0.9505 and 0.8640 for the two datasets; the AUCs of local LOOCV by ranking the miRNAs for each fixed small molecule are, respectively 0.7702 and 0.6591 for the two datasets. And the average AUCs and standard deviations of five-fold cross validations are 0.9263 ± 0.0026 and 0.8088 ± 0.0044 for the two datasets, respectively. In case study, small molecule was set as a new one by turning all known related miRNAs into unknown ones. GISMMA was then applied to predicting latent related miRNAs for each small molecule based on the Dataset 1. For the small molecules, 5-Fluorouracil, 17β-Estradiol and 5-Aza-2<sup>0</sup> -deoxycytidine, there were in turn 30, 29, and 25 out of top 50 predicted miRNAs, which were validated to be associated with these three small molecules by experimental literatures, respectively. The results both in cross validations and case studies have suggested that GISMMA is a powerful and reliable model to predict novel associations between small molecules and miRNAs.

#### MATERIALS AND METHODS

#### Small Molecule-miRNA Associations

In this paper, we obtained the known small molecule-miRNA associations from SM2miR (Version 1) (Liu et al., 2013). The total number of known associations is 664. For comparison of model performance on different datasets, we have constructed two datasets. Dataset 1 consists of 831 small molecules extracted and integrated from SM2miR, DrugBank (Knox et al., 2011) and PubChem (Wang et al., 2009), and 541 miRNAs that were collected from SM2miR, HMDD (Lu et al., 2008), miR2Disease (Jiang et al., 2009) and PhenomiR (Jiang et al., 2009; Ruepp et al., 2010). In Dataset 1, there are only 39 small molecules and 286 miRNAs implicated in the 664 known associations, while 792 small molecules and 255 miRNAs are completely new ones without any known associations. Dataset 2 is only composed of those 39 small molecules and 286 miRNAs, which are involved in the known associations. Based on the known data, an adjacency matrix A was constructed to represent the relations between small molecules and miRNAs, in which A(i, j) was set to be 1 if there is an association between small molecule s(i) and miRNA m(j), 0 otherwise.

#### Small Molecule Similarity

In this paper, according to the method proposed in (Lv et al., 2015), the small molecule similarity was calculated by integrating four usual small molecule similarities which were side effect based similarity that was computed by Jaccard score using small molecule side effect dataset (Gottlieb et al., 2011), functional consistency based similarity that was obtained by comparing the function of small molecule target genes (Lv et al., 2012), chemical structure based similarity that was calculated with the method of chemical structure comparison between any two small molecules (Hattori et al., 2003), and indication phenotype based similarity that was constructed through identifying phenotype similarity between small molecule related diseases (Gottlieb et al., 2011). Therefore, the integrated similarity of small molecules can be computed with the following formula:

$$\text{LSS} = \frac{\beta\_1 S\_S^D + \beta\_2 S\_S^T + \beta\_3 S\_S^C + \beta\_4 S\_S^S}{\sum\_{i=1}^4 \beta\_i} \tag{1}$$

where, S D <sup>S</sup>, S T <sup>S</sup>, S C <sup>S</sup>, and S S <sup>S</sup> denote the four different similarity types, respectively, i.e., indication phenotype based similarity, functional consistency based similarity, chemical structure based similarity and side effect based similarity, and β<sup>i</sup> (i = 1, 2, 3, 4) are the weighs used to balance the different similarity contributions, whose default values were all set as 1.

#### MiRNA Similarity

The miRNA similarity we used in this paper was established using the method in (Lv et al., 2015), by combining functional consistency based similarity that was calculated by comparing the function of miRNA target genes (Lv et al., 2012) and indication phenotype based similarity that was computed by measuring phenotype similarity between diseases associated with miRNAs (Gottlieb et al., 2011). Similarly, to reduce the bias of each similarity measurement, the integrated similarity of miRNAs was defined as follows:

$$\text{SM} = \frac{\alpha\_1 S\_M^D + \alpha\_2 S\_M^T}{\sum\_{j=1}^2 \alpha\_j} \tag{2}$$

where, S D <sup>M</sup> is the indication phenotype based similarity and S T <sup>M</sup> represents the functional consistency based similarity, and α<sup>j</sup> (j = 1, 2) are the weighs of each similarity measurement, which were both set as 1.

#### GISMMA

In this study, by integrating small molecule similarity, miRNA similarity and known associations between small molecules and miRNAs, we developed a graphlet interaction based method to predict the potential associations between small molecules and miRNAs, which is motivated by the study of Wang et al. (2014). Prediction code of our model is available

at: https://github.com/AnnaGuan/GISMMA/tree/AnnaGuanpatch-1. The concept of graphlet interaction is traced to the definition in (Wang et al., 2014), which describes the relationship between any two nodes in a graphlet that is a type of subgraph in a large network. As was done in (Wang et al., 2014), in GISMMA only those graphlets that have 1 to 4 nodes were used, based on which 28 graphlet interaction isomers were constructed, denoted by labels I<sup>1</sup> to I<sup>28</sup> in **Figure 1**. The graphlet interaction isomer depends on the positions of the two involved nodes, which means that the graphlet interaction between two nodes have two different set of isomers. Through counting the number of each isomer, we can represent the graphlet interaction between any

two nodes in a network with a vector that contains 28 numbers (Przulj, 2007; Wang et al., 2014).

We have created a network NS to represent the small molecule similarity and a network NM to represent the miRNA similarity, where each node in the network denotes a small molecule or a miRNA. The edge with similarity value as its weight exists to link any two nodes that have similarity. The associations between small molecules and miRNAs were investigated in the two similarity networks NS and NM, respectively.

In the miRNA network NM, the number of isomer I<sup>k</sup> for graphlet interaction from node m(i) to node m(j) can be calculated as follows (Wang et al., 2014):

fphar-09-01152 October 11, 2018 Time: 15:28 # 5

$$N\_{\vec{\mathbb{M}}}(I\_k) = \sum\_{l \in \mathcal{V}(\text{NM})} \sum\_{m \in \mathcal{V}(\text{NM})} b\_{\vec{\text{ij}}} b\_{\vec{\text{il}}} b\_{\vec{\text{il}}} b\_{\vec{\text{im}}} b\_{\vec{\text{im}}} b\_{\vec{\text{im}}} \tag{3}$$

where V(NM) denotes the node set of all nodes in network NM, l, and m are two nodes different with node m(i) and m(j), and b is defined as:

$$b\_{st} = \begin{cases} \begin{array}{cc} a\_{st} & \text{s and } t \text{ has a link in } I\_k \\ 1 - a\_{st} & \text{s and } t \text{ has no link in } I\_k \end{array} \tag{4}$$

where, ast is the edge weight assigned with the similarity value of m(s) and m(t). Especially, ast is 0 when nodes m(s) and m(t) have no connection. Then we normalized the graphlet interaction as follows:

$$\text{norm}\left(N\_{\vec{\eta}}\left(I\_k\right)\right) = \frac{N\_{\vec{\eta}}\left(I\_k\right)}{\sum\_{m \in M} N\_{im}\left(I\_k\right)}\tag{5}$$

where M contains all other nodes but m(i). Based on the normalized form in equation (5), we can compute the association score of a small molecule-miRNA pair as follows:

$$S\_m\left(i,j\right) = \sum\_{k=1}^{28} \nu\_k \sum\_{p \in P(i)} \text{norm}\left(N\_{pj}\left(I\_k\right)\right) \tag{6}$$

where i denotes a small molecule s(i) and j denotes a miRNA m(j), v<sup>k</sup> is the weight of the kth isomer, P(i) is the set of miRNAs with known associations with small molecule s(i). By defining the summation of norm in equation (6) as following:

$$X\_m\left(k,j\right) = \sum\_{p \in P(i)} \text{norm}\left(\mathcal{N}\_{pj}\left(I\_k\right)\right) \tag{7}$$

we can modify equation (6) into the matrix form as following:

$$S\_m = X\_m^T V\_m \tag{8}$$

The weight coefficients V<sup>m</sup> can be learnt from known associations by performing a simple linear regression (Wang et al., 2014), which is given as following:

$$V\_m = \left(X\_m X\_m^T\right)^{-1} X\_m \mathbf{S}\_m \tag{9}$$

We computed the number of graphlet interaction isomer between two small molecules in the similar way as described in equations (3–5). Then the association score between small molecule s(i) and miRNA m(j) can be calculated in the small molecule network NS as follows:

$$\mathcal{S}\_s\left(i,j\right) = \sum\_{k=1}^{28} \nu\_k \sum\_{q \in Q(j)} \text{norm}\left(\mathcal{N}\_{qi}\left(I\_k\right)\right) \tag{10}$$

where Q(j) is the set of small molecules that have known associations with miRNA m(j). Also, the term of summation of norm in equation (10) can be defined with the matrix:

$$X\_s\left(k,j\right) = \sum\_{q \in Q(j)} \text{norm}\left(N\_{qi}\left(l\_k\right)\right) \tag{11}$$

Thus equation (10) was rewritten as:S<sup>S</sup> = X T <sup>S</sup>VS, and the undetermined matrix V<sup>s</sup> can be obtained by training the model with known association scores:

$$V\_s = \left(X\_s X\_s^T\right)^{-1} X\_s \mathbf{S}\_s \tag{12}$$

Finally, we calculated the association score between small molecule s(i) and miRNA m(j) by combining the scores from NM and NS in a simple average form as following:

$$\mathcal{S}\left(i,j\right) = \frac{\mathcal{S}\_m\left(i,j\right) + \mathcal{S}\_s\left(i,j\right)}{2} \tag{13}$$

#### RESULTS

#### Performance Evaluation

In this work, two commonly used methods, LOOCV and five-fold cross validation, were implemented to evaluate the performance of GISMMA based on Dataset 1 and Dataset 2, respectively. The LOOCV has three different types including global LOOCV, local LOOCV of ranking small molecules for fixed miRNA and local LOOCV of ranking miRNAs for fixed small molecule. Each confirmed association we collected was taken as the test sample one by one and the rest of known associations were considered as the training samples in LOOCV. Candidate samples in global LOOCV consist of all the small molecule-miRNA pairs that have no known associations. In the case of local, we only consider those small molecules that do not relate to the fixed miRNA or those miRNAs unconnected to the fixed small molecule in the test sample as candidates. The scores as association probabilities were computed using the GISMMA method for both test sample and all candidate samples. Then we ranked them for the corresponding type of LOOCV. The five-fold cross validation was performed in the following steps. Firstly, all the known small moleculemiRNA associations were randomly split into five parts with equal size. Secondly, the five parts take turns to act as the test sample set one after another and the other four parts as the training sample sets; similarly, all small moleculemiRNA pairs that have no known associations play the roles of candidate samples. Thirdly, the test samples as well as the candidate samples were endowed with association scores by GISMMA. Finally, each test sample was picked out in turn to be compared with candidate samples according to their scores. The model was considered to be successfully predict the test sample only when its rank exceeded the given rank threshold.

Based on the ranking, the receiver operating characteristic (ROC) curves were used to illustrate the results of the three types of LOOCV described above, in which the abscissa axis is true positive rate (TPR, sensitivity) and the ordinate axis represents false positive rate (FPR, 1-specificity) for different thresholds given in advance. The sensitivity means the ratio that the positive samples rank above the given threshold, while the specificity is defined as the percentage of candidate samples whose ranks are below the set threshold. The area

under the ROC curve (AUC) was correspondingly calculated to estimate the reliability of the GISMMA. When the model correctly predicts all test samples, AUC = 1; but if the model has a random prediction, AUC = 0.5. To make comparison with previous method, we implemented SMiR-NBI (Li et al., 2016) for global and two types of local LOOCVs, 5-fold cross validation based on the same datasets. The global AUCs of GISMMA for Dataset 1 and Dataset 2 are 0.9291 and 0.8203, respectively, which are shown in **Figure 2** in comparison with previous model SMiR-NBI whose results are 0.8843 and 0.7264, respectively. In the case of local LOOCV of ranking small molecules for fixed miRNA, the AUCs of GISMMA for Dataset 1 and Dataset 2 are 0.9505 and 0.8640, respectively, compared with 0.8837 and 0.7846 of SMiR-NBI, which can be seen in **Figure 3**. The results of local LOOCV of ranking miRNAs for fixed small molecule are shown in **Figure 4**, from which we can see that the AUCs of GISMMA and SMiR-NBI are 0.7702, 0.7497 for Dataset 1, and 0.6591, 0.6100 for Dataset 2, respectively. Besides, in five-fold cross validation, the average AUCs with standard deviations of GISMMA and SMiR-NBI are 0.9263 ± 0.0026, 0.8554 ± 0.0063 for Dataset 1, and 0.8088 ± 0.0044, 0.7104 ± 0.0087 for Dataset 2. The **Table 1** lists the comparison of GISMMA and SMiR-NBI for all AUC results of the four types of cross validations on two datasets. We can make a conclusion from the comparisons that the novel method proposed in this work is more reliable and more effective in predicting potential associations between small molecules and miRNAs.

### Case Study

Based on the known database and published references in PubMed database, we studied three common small molecules to further evaluate the predictive ability of GISMMA, in which the small molecule in study was set as a new one by taking away its known associations. We ulteriorly observed the number of the experimentally verified miRNAs in the top 50 ones predicted to be related to the three small molecules, respectively.

The small molecular 5-Fluorouracil (5-FU) is a widely used chemotherapeutic drug in colorectal cancer (Windle et al., 1987). For a long time, the 5-FU-induced cytotoxic effects were thought to result exclusively from its impact on DNA metabolism (Andreuccetti et al., 1996; Airley, 2009). However, several evidences indicated that the cytotoxic effect of 5-FU also results from its capacity to alter RNA metabolism and mRNA expression (Longley et al., 2003). Exposure to 5-FU promotes a profound transcriptional reprogramming leading to modification of mRNA and miRNAs expression profiles that contributes in modifying cell fate (Hernandez-Vargas et al., 2006; Rossi et al., 2007; Shah et al., 2011). After implementing GISMMA, we got the total ranking of potential miRNAs associated with 5-FU. As the result shown, among the top 10 and 50 potential 5-FUrelated miRNAs, there were 8 and 30 miRNAs confirmed by experiments, respectively (See **Table 2**). For instance, miR-21 and miR-23a were predicted as the first and fifth candidates for 5-FU, respectively, which were significantly down regulated in comparison between 5-FU treated and control samples in miRNA microarray analysis of 5-FU treated MCF-7 cells (Shah et al.,

FIGURE 3 | Performance of GISMMA was compared with SMiR-NBI in terms of ROC curve and AUC of local LOOCV of ranking small molecules for fixed miRNA on Dataset 1 (left) and Dataset 2 (right). As is shown, GISMMA achieves AUCs of 0.9505 and 0.8640 for Dataset 1 and Dataset 2, respectively, significantly superior to the previous model SMiR-NBI.

the previous model SMiR-NBI.

2011). Besides, miR-24-1, the third candidate in the ranking list, showed a significantly down regulation in HCT-8 colon cancer cell after exposure to 5-FU (Zhou et al., 2010). In addition, MiR-27b that ranked the fourth in the prediction list of 5-FU was found to be consistently up regulated in human colon cancer cells HC.21 following exposure to 5-FU in vitro (Rossi et al., 2007).

The small molecular 17β-Estradiol (E2) is the principal intracellular human estrogen that exerts important effects on


the reproductive as well as many other organ systems in both men and women (Simpson and Santen, 2015). The analogs of estradiol exhibit significant anticancer activity against human breast cancer cell lines (Sathish Kumar et al., 2014). Estrogens have associations with cancer in target tissues, which is because they have a phenolic ring structure in common with the carcinogenic hydrocarbons (Ryan, 1982). After implementing GISMMA, we got the total ranking of the E2-associated miRNAs. As the result shown, among the top 10 and 50 potential E2-related miRNAs, there were 5 and 29 miRNAs confirmed by experiments, respectively (See **Table 3**). For example, miR-21, miR-27b, and miR-23a dominated in turn the first, fourth, and fifth places of the ranking list predicted for E2, which were all down regulated after treatment of MCF-7 cells with E2 (Bhat-Nakshatri et al., 2009; Tilghman et al., 2012). Besides, E2 showed a capacity to

TABLE 2 | Top 50 miRNAs associated with 5-Fluorouracil were predicted by GISMMA based on Dataset 1.

miRNA Evidence miRNA Evidence hsa-mir-21 26198104 hsa-mir-22 25449431 hsa-mir-324 unconfirmed hsa-mir-409 unconfirmed hsa-mir-24-1 26198104 hsa-mir-337 unconfirmed hsa-mir-27b 26198104 hsa-let-7a-3 26198104 hsa-mir-23a 26198104 hsa-let-7a-2 26198104 hsa-mir-638 26198104 hsa-mir-155 28347920 hsa-mir-27a 26198104 hsa-mir-181b-2 unconfirmed hsa-let-7b 25789066 hsa-mir-181b-1 unconfirmed hsa-mir-181a-1 unconfirmed hsa-mir-15b 26198104 hsa-mir-126 26062749 hsa-let-7i unconfirmed hsa-mir-125b-2 unconfirmed hsa-mir-320a 26198104 hsa-mir-125b-1 unconfirmed hsa-mir-26a-2 unconfirmed hsa-mir-124-3 unconfirmed hsa-mir-328 unconfirmed hsa-mir-124-2 unconfirmed hsa-mir-16-2 26198104 hsa-mir-124-1 unconfirmed hsa-let-7e 26198104 hsa-let-7a-1 26198104 hsa-mir-34b unconfirmed hsa-mir-181a-2 24462870 hsa-mir-145 24447928 hsa-mir-24-2 26198104 hsa-mir-200b 26198104 hsa-mir-17 26198104 hsa-let-7c 25951903 hsa-mir-26a-1 unconfirmed hsa-mir-874 27221209 hsa-mir-16-1 26198104 hsa-mir-650 unconfirmed hsa-mir-518c unconfirmed hsa-mir-501 26198104 hsa-mir-99b unconfirmed hsa-mir-500a unconfirmed hsa-mir-18a 26198104 hsa-mir-1226 26198104 hsa-mir-663a 26198104 hsa-mir-200c 26198104

The top 1-25 miRNAs are shown in the first column while the top 26–50 in the second. As a result, 8 and 30 out of top 10 and top 50 were confirmed by the known experimental literatures, respectively.

down regulate the expression level of miR-21 in breast cancer cells (Selcuklu et al., 2012).

The small molecular 5-Aza-2<sup>0</sup> -deoxycytidine (5-Aza-CdR) is a nucleoside analog inhibitor of DNA methyltransferase (DNMT). It has been used to reverse methylation and reactivate the expression of silenced genes (Patra and Bettuzzi, 2009). 5-Aza-CdR is able to suppress the growth of various tumors in vitro, animal models, and clinical trials including prostate cancer (Hurtubise and Momparler, 2004; Issa et al., 2004; McCabe et al., 2006). We performed GISMMA on 5-Aza-CdR, and got the total ranking of the predicted miRNAs. As the result shown, among the top 10 and 50 potential 5-Aza-CdR related miRNAs, there were 7 and 25 miRNA-5-Aza-CdR associations confirmed by experiments (See **Table 4**). For example, in the ranking list of miRNAs predicted for 5-Aza-CdR, miR-21, and miR-27b were

TABLE 3 | Top 50 miRNAs associated with 17β-Estradiol were predicted by GISMMA based on Dataset 1.


The top 1–25 miRNAs are shown in the first column while the top 26–50 in the second. As a result, 5 and 29 out of top 10 and top 50 were confirmed by the known databases or experimental literatures, respectively.



The top 1–25 miRNAs are shown in the first column while the top 26–50 in the second. As a result, 7 and 25 out of top 10 and top 50 were confirmed by the known databases or experimental literatures, respectively.

ranked in the first and fifth position, respectively, both of which showed significant down regulation after 5-Aza-CdR treatment in breast cancer cells (Radpour et al., 2011). Moreover, miR-24-1 was the fourth miRNA predicted to be associated with 5-Aza-CdR. Microarray analysis showed miR-24-1 were up regulated upon 5-Aza-CdR therapy in pancreatic cancer PANC-1 cells compared to control cells (Lee et al., 2009).

The whole prediction list of all candidate small moleculemiRNA pairs in Dataset 1 was provided in **Supplementary Table 1**, which was ranked in a descending order according to the association scores resulted from GISMMA. It is hoped that the ranked list can be useful in guiding biological experiments, and can be verified by more experimental results in the future.

#### DISCUSSION

This paper presented a graphlet interaction based method GISMMA to infer the potential associations between small molecules and miRNAs by combining small molecule similarity, miRNA similarity and known associations between small molecules and miRNAs. In GISMMA, we used a similarity network to represent the small molecules and used another similarity network to represent the miRNAs. An edge with a weight of the similarity value between two nodes was ploted when there was similarity between the two nodes, otherwise not. We utilized graphlet interaction to measure the complex relationship between two nodes in the network, where the graphlet is defined as a type of non-isomorphic subgraph (Wang et al., 2014). Then, we counted each graphlet interaction isomer in a special pattern from the node having known associations to the node which does not have known associations. Therefore, we obtained a vector to describe the graphlet interaction between the two nodes. The correlation score between a small molecule and a miRNA can be computed through summing the weighted graphlet interaction isomers, where the weighs can be learnt from the known associations. The performance of GISMMA on predicting novel small molecule-miRNA associations was evaluated with four validation approaches that were global and two types of local LOOCV, as well as five-fold cross validation. The cross validation results were compared between GISMMA and SMiR-NBI, which showed the superior performance of GISMMA over SMiR-NBI. Besides, the ROC curves of SMiR-NBI are some unusual in **Figures 2**, **3**, which may be attribute to that SMiR-NBI could not predict associated miRNAs (small molecules) for new small molecules (miRNAs). When ranking the test small molecule-miRNA pair with those candidate pairs for SMiR-NBI, we assigned fixed rank to those pairs that contain new small molecules (miRNAs) with an average number, which may cause the presence of line segments in the ROC curve. We have implemented cross validations on two datasets with different sizes. The results showed that GISMMA performed better on Dataset 1 than on Dataset 2, which could be resulted from two factors. The one is the more similarity information in Dataset 1. The other is that Dataset 1 contains those small molecules and miRNAs without any known associations, which often get lower association scores and lower rankings than the test sample. This could also make the AUCs higher. And we further executed case study for three small molecules using Dataset 1. The numbers of miRNAs that were validated to be related to these three small molecules by experimental literatures are in turn 30, 29 and 25 in top 50 miRNAs predicted by GISMMA. Via cross validations together with case study, we can see that GISMMA is wellperformed and reliable in predicting new associations between small molecules and miRNAs. Furthermore, a list of all predicted small molecule-miRNA associations was provided, which would be favorable for the development of miRNA-targeted therapy and drug reposition. In detail, for a specific small molecule, we focused on the predicted miRNAs that are most possibly associated with this small molecule. These miRNAs might be related to some diseases that were not confirmed to be treated by this small molecule. Through regulating the expressions of these miRNAs, this small molecule could be used for the treatment of these diseases. Therefore, we believed that the prediction results of this work could offer some guidance for the experiment of drug reposition to some extent.

The outstanding performance of GISMMA can be attributed to several factors. Firstly, we mapped the similarity between small molecules and similarity between miRNAs into two networks, in which the similarity values were fully exploited to investigate the complex relationship between two nodes by measuring their

graphlet interaction. Secondly, in GISMMA, not only direct but also indirect links were considered between the nodes in the counting of graphlet interaction isomers. Finally, the GISMMA is a bipartite method which combines miRNA network with small molecule network. It can be used to predict miRNAs associated with new small molecules without any known related miRNAs, as well as to predict small molecules associated with new miRNAs without any known related small molecules, because it computes the association score by combining the result calculated in the small molecule network with that in the miRNA network.

However, GISMMA still has some limitations. For example, the lack of the known association data, especially the presence of many new small molecules or new miRNAs that have no known associations, affected the performance to a large extent. It can be expected that the model will obtain better performance when more experimental datasets are produced in the future. Besides, the simple algorithm of averaging the scores from two networks to compute the final association score may cause bias to those pairs that can be predicted only in one network. Furthermore, GISMMA considered 4 nodes at most within a graphlet, which hindered it to contain more similarity information from more distant nodes. Finally, this model cannot be applied to the prediction of the association in which the small molecule and the miRNA are both new. We anticipate that more network-based methods could be developed to improve the prediction of novel small molecule-miRNA association. For example, Petri nets based models have been proved to be a useful tool for many prediction problems, inspired by the work in (Russo et al., 2017), we could construct algorithm using

#### REFERENCES


Petri nets for the inference of potential small molecule-miRNA association.

## AUTHOR CONTRIBUTIONS

N-NG implemented the experiments, analyzed the result, and wrote the paper. Y-ZS analyzed the result and wrote the paper. XC conceived the project, developed the prediction method, designed the experiments, analyzed the result, and revised the paper. ZM analyzed the result. J-QL analyzed the result and revised the paper. All authors read and approved the final manuscript.

#### FUNDING

XC was supported by National Natural Science Foundation of China under grant nos. 61772531 and 11631014. JL was supported by National Natural Science Foundation of China under grant nos. U1713212 and 61572330, Natural Science foundation of Guangdong Province under grant no. 2014A030313554, and Technology Planning Project from Guangdong Province under grant no. 2014B010118005.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar. 2018.01152/full#supplementary-material



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Guan, Sun, Ming, Li and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Determining the Balance Between Drug Efficacy and Safety by the Network and Biological System Profile of Its Therapeutic Target

Xiao xu Li 1,2, Jiayi Yin<sup>1</sup> , Jing Tang1,2, Yinghong Li 1,2, Qingxia Yang1,2, Ziyu Xiao<sup>1</sup> , Runyuan Zhang<sup>1</sup> , Yunxia Wang<sup>1</sup> , Jiajun Hong<sup>1</sup> , Lin Tao<sup>3</sup> , Weiwei Xue<sup>2</sup> and Feng Zhu1,2 \*

<sup>1</sup> College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China, <sup>2</sup> School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China, <sup>3</sup> Key Laboratory of Elemene Class Anti-cancer Chinese Medicine of Zhejiang Province, School of Medicine, Hangzhou Normal University, Hangzhou, China

#### Edited by:

Zhi-Liang Ji, Xiamen University, China

#### Reviewed by:

Enrique Hernandez-Lemus, Instituto Nacional de Medicina Genómica (INMEGEN), Mexico Qing-Chuan Zheng, Jilin University, China Yan Li, Army Medical University, China

\*Correspondence:

Feng Zhu zhufeng@zju.edu.cn; prof.zhufeng@gmail.com

#### Specialty section:

This article was submitted to Translational Pharmacology, a section of the journal Frontiers in Pharmacology

Received: 28 July 2018 Accepted: 12 October 2018 Published: 31 October 2018

#### Citation:

Li X, Yin J, Tang J, Li Y, Yang Q, Xiao Z, Zhang R, Wang Y, Hong J, Tao L, Xue W and Zhu F (2018) Determining the Balance Between Drug Efficacy and Safety by the Network and Biological System Profile of Its Therapeutic Target. Front. Pharmacol. 9:1245. doi: 10.3389/fphar.2018.01245 One of the most challenging puzzles in drug discovery is the identification and characterization of candidate drug of well-balanced profile between efficacy and safety. So far, extensive efforts have been made to evaluate this balance by estimating the quantitative structure–therapeutic relationship and exploring target profile of adverse drug reaction. Particularly, the therapeutic index (TI) has emerged as a key indicator illustrating this delicate balance, and a clinically successful agent requires a sufficient TI suitable for it corresponding indication. However, the TI information are largely unknown for most drugs, and the mechanism underlying the drugs with narrow TI (NTI drugs) is still elusive. In this study, the collective effects of human protein–protein interaction (PPI) network and biological system profile on the drugs' efficacy–safety balance were systematically evaluated. First, a comprehensive literature review of the FDA approved drugs confirmed their NTI status. Second, a popular feature selection algorithm based on artificial intelligence (AI) was adopted to identify key factors differencing the target mechanism between NTI and non-NTI drugs. Finally, this work revealed that the targets of NTI drugs were highly centralized and connected in human PPI network, and the number of similarity proteins and affiliated signaling pathways of the corresponding targets was much higher than those of non-NTI drugs. These findings together with the newly discovered features or feature groups clarified the key factors indicating drug's narrow TI, and could thus provide a novel direction for determining the delicate drug efficacy-safety balance.

Keywords: drug efficacy-safety balance, therapeutic index, artificial intelligence, protein-protein interaction network, biological system profile

### INTRODUCTION

One of the most challenging puzzles in drug discovery is the identification and characterization of candidate drugs of well-balanced profile between efficacy and safety (Muller and Milton, 2012; Li et al., 2018; Xue et al., 2018b). In other words, apart from extensive effort made to optimize drug affinity and selectivity (Wang et al., 2017a; Zheng et al., 2017), considerable investments

**79**

should be devoted to detect adverse drug reactions (Huang et al., 2018) and reveal drug likeness (Benet et al., 2016; Yang et al., 2018). So far, the identification of drug toxicities in preclinical or clinical developments has been accelerated by a variety of technological advances (Badders et al., 2018) including biomarker-guided safety assessment (Muller and Dieterle, 2009; Rzepecki et al., 2018), OMICs techniques (Iloro et al., 2013; Fu J. et al., 2018), breakthrough in computing capacity and bioinformatics method (Zhu et al., 2011; Tao et al., 2015; Chen et al., 2016), and so on. To measure the level of correlation between drug maximum efficacy and confined safety in given disorder, the therapeutic index (TI typically considered as the ratio of the highest non-toxic drug exposure to the exposure producing the desired efficacy) has emerged as a key indicator illustrating that delicate balance (Zaykov et al., 2016). The TI is essential for life-threatening diseases (such as cardiovascular and oncological disease) with limited treatment options (Zhu et al., 2008b; Kimmelman and Federico, 2017). Particularly, tiny variation in the dosage of drugs with narrow TI (NTI drugs, TI ≤3) may result in therapeutic failure or serious adverse drug reactions (Tao et al., 2014; Ewer and Ewer, 2015; Zheng et al., 2016), and is only acceptable for the treatment of life-threatening diseases (Yu et al., 2015). Therefore, successful therapeutic agents require sufficient TI (NNTI drugs, TI >3) suitable for it corresponding indication (Abernethy et al., 2011).

However, TI characterization is too complicated to be achieved for many drugs (Yu et al., 2015), and TI is highly susceptible to the subject variations of drug responses (Jiang et al., 2015; Yang et al., 2017). To enhance the determination and interpretation of TI, a variety of in-silico studies have been performed to reveal the mechanism underlying NTI drugs (Muller and Milton, 2012). In particular, the prediction models based on quantitative structure–activity (QSAR), structure– toxicity (QSTR), and structure–index (QSIR) relationship have been constructed to enable early assessment of TI (Zhu H. et al., 2008; Rodgers et al., 2010; Zhu et al., 2012a; Chen et al., 2016; Fu T. et al., 2018). These models are primarily constructed and exert their prediction capacity based on structures of the studied drugs, which thus demonstrate great limitations in coping with TI's vulnerability to the subject variation of drug responses (Jiang et al., 2015). Compared with the approaches based on drug structure, target-based approach turns out to be the one of enhanced effectiveness for characterizing confined toxicity behind the drug efficacy (Muller and Milton, 2012; Huang et al., 2018), since the population variation of drug target is capable of reflecting, to some extent, the subject variations of drug responses (Fujimoto et al., 2014; Jiang et al., 2015). But target-based method is sophisticated due to the involvement of target in complex protein–protein interaction (PPI) network (Rao et al., 2011; Li et al., 2016b; Xu et al., 2016; Wang et al., 2017b) and the necessity of considering target biological system profiles (Zhu F. et al., 2009; Xue et al., 2016).

So far, the PPI network properties (Ragusa et al., 2010; Guo et al., 2018) and biological system profiles (Zheng et al., 2006) have been adopted to analyze the drug likeness of candidate agents. On one hand, the target–protein interaction network has been constructed and the corresponding network features can be calculated for discovering the differential properties indicating disease status (Ragusa et al., 2010) and identifying candidate drug targets for a given indication (Guo et al., 2018; Xue et al., 2018a). On the other hand, the druggability of candidate target is found significantly determined by a variety of biological system profiles, which include the number of target affiliated signaling pathways (Yang et al., 2016), the number of similarity proteins outside target's protein family (Zheng et al., 2006), the number of human tissues distributed by the studied target (Zhu F. et al., 2009), and the differential level of target expression between patient and healthy individual (Ernst et al., 2017; Li et al., 2018). Since the underlying theories of network- and biological system-based approaches are distinct from each other (Guo et al., 2018; Li et al., 2018), it is essential to simultaneously consider these two types of properties for understanding drug likeness. However, these properties have not yet been collectively considered in TI-related studies, and the mechanism underlying drugs' narrow TI is still elusive.

In this study, a comprehensive analysis on the network features and biological system profiles of the primary therapeutic targets of all FDA approved drugs was conducted, and various features differentiating drugs of narrow TI (NTI drugs) from those of sufficient TI (NNTI drugs) were identified. First, due to the limited information of both NTI and NNTI drugs, a systematic literature review was conducted to collect the TI data for all approved drugs. Then, the primary therapeutic targets of these drugs were classified into four groups based on collected TI data. These four target groups include (a) targets of NTI drugs, (b) targets of both NTI and NNTI drugs, (c) targets of drugs without reported TI, and (d) targets of NNTI drugs. Third, a comparative analysis between target group (a) and (d) identified several key features able to differentiate two groups, and further study revealed three feature groups indicating the mechanisms underlying NTI drugs. In summary, these findings together with the newly discovered features or feature groups clarified key factors indicating drug's narrow TI, which gave a new direction for determining the delicate balance between drugs' maximum efficacy and confined safety.

### MATERIALS AND METHODS

### Systematic Collection of Drugs and Their Corresponding Targets and TI Data

The TI data of FDA approved drugs were obtained by four steps. First, FDA approved drugs were collected from the official website of FDA (Drugs@FDA), and their corresponding diseases were carefully confirmed. In total, 1,762 drugs were collected. Second, the primary therapeutic targets of these drugs were identified from the TTD database (https://db.idrblab.org/ttd/; Li et al., 2018), and 418 primary therapeutic targets of these 1,762 drugs were discovered (detail information was provided in the following paragraphs). Third, TI data of these drugs were systematically collected by a comprehensive literature review. Particularly, various keyword combinations were searched in PubMed and other academic resources, which included "drug name + therapeutic index," "drug name + therapeutic window," "drug name + critical dose," "drug name + therapeutic ranges," and "drug name + therapeutic ratio." As a result, 161 NTI and 29 NNTI drugs confirmed by the clinical evaluations or experiments were identified, which aimed at 60 and 28 human targets, respectively. **Supplementary Table S1** provided a full list of 161 NTI and 29 NNTI drugs together with their approved disease indication and corresponding targets. To the best of our knowledge, it is the first comprehensive literature review on the TI data of all drugs approved by FDA and **Supplementary Table S1** provided the most completed information of the FDA approved drugs with available TI data. Moreover, the primary therapeutic targets of all FDA approved drugs were classified into four groups based on their TI: (a) 20 targets of NTI drugs, (b) 40 targets of both NTI and NNTI drugs, (c) 339 targets of drugs without reported TI, and (d) 19 targets of NNTI drugs. Moreover, among those drugs listed in **Supplementary Table S1**, four multi-target drugs were found with NTI data available, which included regorafenib (hepatocellular and colorectal cancer), sorafenib (renal cell and hepatocellular carcinoma), sunitinib (gastrointestinal cancer), and vandetanib (medullary thyroid cancer). All these drugs are multi-kinases inhibitors for the treatment of cancer.

### Identification of the Primary Therapeutic Target(S) of FDA Approved Drugs

The primary therapeutic target of each FDA approved drug was strictly determined by considering (1) the experimentally determined potency of drugs against their primary target or targets (Zhu et al., 2010), (2) the observed potency or effects of drugs against disease models (cell lines, ex-vivo, in-vivo models) linking to their primary drug targets (Zhu et al., 2012b), and (3) the observed effect of target knockout, knockdown, transgenetic, RNA interference, antibody or antisense-treated in vivo models (Zhu et al., 2012b). Taking the confirmation of CDK4 as the primary therapeutic target of FDA approved Palbociclib as an example, it was determined by considering: (1) experimentally defined high potency (IC50 = 11 nM) of Palbociclib against CDK4 (Fry et al., 2004), (2) the clearly observed development of multiple tumors by a point mutation (R24C) in the first coding exon of locus encoding CDK4 in the mice models (Sotillo et al., 2001), and (3) Palbociclib-induced G1-G2 arrest and apoptosis in breast tumor cell lines (IC50 <400 nM) and tumor growth reduction in human breast tumor xenograft (Lapenna and Giordano, 2009). In conclusion, only the targets with complete target determination data (including all three types of information above) were defined as the primary therapeutic targets of the corresponding FDA approved drugs.

## Deriving the Human PPI Network Properties for Each Studied Target

The human protein–protein interaction (PPI) network analyzed here included 15,554 proteins and 642,304 PPIs, which was constructed using the data provided in STRING (Szklarczyk et al., 2015). In order to ensure the reliability of the analyzed data, only those PPIs with high confidence score (>0.95) were collected for the subsequent analyses (Ghosh et al., 2015; Wang S. et al., 2015). As a result, a sub-network with 8,509 proteins and 40,468 PPIs were generated and adopted for further analyses in this study. Moreover, the network properties for each studied target were generated by the PROFEAT (Zhang et al., 2017a) and the tool NetworkAnalyzer of Cytoscape (Shannon et al., 2003; Thomas and Bonchev, 2010).

In total, 32 network properties were calculated and adopted in subsequent analysis. These properties were popular for analyzing a complex biological network, which included: (1) Average Closeness Centrality: the average number of steps required to reach the studied node from any node in a network (Ma et al., 2016); (2) Average Shortest Path Length: the average length of shortest paths between the studied node and all other ones (Zhang et al., 2014); (3) Betweenness Centrality: the number of times the studied node serving as a linking bridge along shortest path between any two nodes (Zeidán-Chuliá et al., 2015); (4) Bridging Centrality: the product of the bridging coefficient and betweenness centrality (Hwang et al., 2008); (5) Bridging Coefficient: the extent of the studied node lying between any other densely connected nodes in the network (Paladugu et al., 2008); (6) Closeness Centrality Sum: the reciprocal of the sum of the shortest paths between the studied node and all other nodes in the network (Costenbader and ValenteFontanesi, 2003); (7) Clustering Coefficient: the number of the connected pairs between all neighbors of node (Watts and Strogatz, 1998); (8) Current Flow Betweenness: a centrality index measuring the level of information travels along all possible paths within network (Paladugu et al., 2008); (9) Current Flow Closeness: the variant of current flow betweenness (Zhang et al., 2017b); (10) Degree: the number of edges linked to a node (Braeuning, 2013); (11) Degree Centrality: the number of links incident upon a studied node (Batool and Niazi, 2014); (12) Deviation: the variation between sum of node distances and network unipolarity (Zhang et al., 2017a); (13) Distance Deviation: the absolute difference between nodes' distance sum and network's average distance (Rogelj et al., 2013); (14) Distance Sum: the sum of all shortest paths starting from the studied node (Bolser et al., 2003); (15) Eccentric: the absolute difference between nodes' eccentricities and network's average eccentricity (Zhang et al., 2017a); (16) Eccentricity: the maximum non-infinite shortest path length between the studied node and all other nodes in the network (Bolser et al., 2003); (17) Eccentricity Centrality: the largest geodesic distance between the node and any other node (Batool and Niazi, 2014); (18) Eigenvector Centrality: the sum of its neighbors' centrality values (Solá et al., 2013); (19) Harmonic Closeness Centrality: the sum of the reciprocals of the average shortest path lengths of each node in network (Zhang et al., 2017b); (20) Interconnectivity: a connectivity index indicating the quality of the studied nodes being connected together (Emig et al., 2013); (21) Load Centrality: the fraction of all the shortest paths that pass through the studied node (Kivimäki et al., 2016); (22) Neighborhood Connectivity: the average connectivity of all neighbors (Carson and Lu, 2015); (23) Normalized Betweenness: the fraction of network shortest paths that a given protein lies on (Paladugu et al., 2008); (24) Number of Self Loops: the number of edges starting and ending at the same node (Garlaschelli and Loffredo, 2004); (25) Number of Triangles: the number of triangles that include the studied node as a vertex (Rubinov and Sporns, 2010); (26) Page Rank Centrality: an adjustment of Katz by considering the diluted issue (Li et al., 2013); (27) Radiality: the level of reachability of a studied node via various shortest paths within the entire network (Koschützki and Schreiber, 2008); (28) Residual Closeness Centrality: the closeness measured by removing the studied node (Dangalchev, 2006); (29) Scaled Degree: the degree of a studied node relative to the most connected node within the same module (Sormani, 2012); (30) Stress: the number of shortest paths passing through a given node (Shannon et al., 2003); (31) Topological Coefficient: the extent to which a node in network shares interaction partners with other nodes (Zhu M. et al., 2009); (32) Z Score: a connectivity index based on degree distribution of a network (Rubinov and Sporns, 2010).

### Assessing the Biological System Profile for Each Studied Target

The biological system profile for each studied target included: (1) the number of target-affiliated and target immediate-downstream signaling pathways in KEGG database (Kanehisa et al., 2017). The target-affiliated pathways were determined by considering that (a) the pathways of the studied target should be lifeessential in both patients and healthy people and (b) the studied target should be in the pathway upstream with the capacity of regulating the biological function of the pathways. (2) The number of human tissues each target distributed in, assessed by the TissueDistributionDBs (Kogenaru et al., 2010) and Uniprot (UniProt Consortium, 2018) databases. A target was assumed to distribute in a given tissue if >5% of the total proteins are distributed in that tissue or the target concentration is higher than the average concentration of proteins in that tissue. (3) The number of human similarity proteins of a target outside the corresponding target family for probing off-target collateral effect (Zheng et al., 2006; Zhu F. et al., 2009). This was determined by BLAST similarity screening of human proteome in Uniprot database (UniProt Consortium, 2018) with a cutoff (E-value < 0.005; Song et al., 2006; Singh et al., 2007). (4) The differential expressions of the studied target in the diseasespecific tissue between patients and healthy individuals (Li et al., 2018). The relevant data were collected directly from TTD (Li et al., 2018) and calculated based on the human gene expression raw data of Affymetrix U133 Plus 2.0 platform in GEO (Barrett et al., 2013).

### Selecting the Differential Features Indicating NTI Drugs by Artificial Intelligence

The artificial intelligence (AI) has been recently proposed as a powerful technique for drug target discovery (Xu and Wang, 2014; Zhu et al., 2018), protein function prediction (Li et al., 2016a; Seo et al., 2018; Yu et al., 2018) and biomarker identification (Li B. et al., 2016; Li et al., 2017) through mimicking the human thinking procedures, learning processes and information extractions, which included the machine learning algorithm (Zhu et al., 2008a; Wang P. et al., 2015), the deep learning method (van der Burgh et al., 2017; Seo et al., 2018), and the cognitive-computing (Krittanawong et al., 2017). As one of the most popular machine learning algorithms, the Boruta algorithm based on wrapper method built around a random forest classifier (Kursa, 2014) was selected and adopted in this study. It is an extension to determine the relevance via comparing the relevance of the real features to that of the random probes (Pan et al., 2018). Since Boruta was constructed by an AI-based technique (machine learning), it was considered to be the most powerful approach with the stability in the variable selection, especially suitable for the low-dimensional dataset among other available strategies (Degenhardt et al., 2017). In this study, the differential features between NTI and NNTI drugs were therefore identified by R package Boruta (Shang et al., 2017). Particularly, human PPI network properties and biological system features of each target were first calculated, and the results of feature selection were then acquired using R package Boruta by setting the p-value < 0.05, maxRuns = 100, and doTrace = 2. In the meantime, the getImp was set to "getImpRfZ," and the mcAdj and holdHistory were set to "TRUE."

## RESULTS AND DISCUSSION

### Network Properties and Biological System Profile of NTI and NNTI Drugs

As reported, the human PPI network properties and biological system profile were key factors determining efficacy-safety balance (Zheng et al., 2006; Ragusa et al., 2010; Guo et al., 2018). Network properties were inherent feature of a target in the human PPI network, while biological system profile could reflect both the on-target and off-target pharmacology (Bender et al., 2007; Han et al., 2018; Zhu et al., 2018). Herein, 32 features of human PPI network together with 4 biological system properties were therefore adopted and calculated for further analyses. To the best of our knowledge, these were the most comprehensive sets of features ever applied for TI-related analysis. **Table 1** listed the calculated values of ten properties based on the connectivity and adjacency in human PPI network. These connectivity/adjacency-based network properties were designed to describe the level of connectivity among human proteins or the neighborhood features of the studied proteins (Chen et al., 2016). The properties included bridging coefficient, clustering coefficient, degree, degree centrality, interconnectivity, neighbor connectivity, number of triangles, scaled degree, topological coefficient, and Z-score (corresponding definitions were provided in section Materials and Methods). As shown in **Table 1**, 8 (80.0%) out of 10 properties were significantly different (p-value < 0.05, highlighted by bold font) between the targets of NTI and NNTI drugs, and half of those 10 properties were with the most significant differences (p-value < 0.01, highlighted by bold-underline).

Similar to the connectivity/adjacency-based network property, the calculated values of 16 properties based on the shortest path length in the human PPI network were provided in **Table 2** (corresponding definitions of these properties were provided in section Materials and Methods). As shown in TABLE 1 | The calculated values of 10 properties based on the connectivity and adjacency in the human PPI network.


The mean values (together with standard deviation) and median values of these properties between the targets of NTI and NNTI drugs were provided, and the statistical difference (p-value) for each property between targets of NTI and NNTI drugs were also calculated (p-values <0.05 and <0.01 were highlighted by bold and bold-underline, respectively).

TABLE 2 | The calculated values of 16 properties based on the shortest path length in human PPI network.


Mean values (together with standard deviation) and median values of these properties between the targets of NTI and NNTI drugs were provided, and the statistical difference (p-value) for each property between targets of NTI and NNTI drugs were also calculated (p-values <0.05 and <0.01 were highlighted by bold and bold-underline, respectively).

**Table 2**, all properties were found to be significantly different (p-values < 0.05, in bold font) between the targets of NTI and NNTI drug, and 14 (87.5%) of the 16 properties were with the most significant difference (p-value < 0.01, bold-underline). Moreover, the calculated values of 4 human biological system properties were shown in **Table 3** (definition of these properties was given in section Materials and Methods). As reported, these properties were frequently adopted to analyze the druggability of therapeutic targets for not only approved drugs but also the drugs in clinical trial development or withdrawn from market (Li et al., 2018). Herein, two properties were identified as significantly different (p-value < 0.01, bold-underline) between targets of NTI and NNTI drugs, which included the number of pathways affiliated by the targets of the studied drugs and the number of similarity proteins outside target's functional family. One thing needed to be emphasized was that the standard deviation of many properties was even larger than their mean value (such as bridging coefficient, clustering coefficient, and Z-score). These deviations indicated that the corresponding p-value may not be enough to measure the difference between the targets of NTI and NNTI drug. Moreover, any of the individual feature (p-value < 0.05 shown in **Tables 1**–**3**) could not be used to satisfactorily differentiate the targets of NTI drugs from that of the NNTI ones. Thus, this finding inspired us to discover the differential features using more advanced computational algorithm and collectively considering multiple properties.

TABLE 3 | The calculated values of four human biological system properties.


The mean values (together with standard deviation) and median values of these properties between the targets of NTI and NNTI drugs were provided, and the statistical difference (p-value) for each property between targets of NTI and NNTI drugs were also calculated (p-values <0.05 and <0.01 were highlighted by bold and bold-underline, respectively).

TABLE 4 | 19 substantially overlapped network properties grouped into 5 property groups based on their innate mutual dependence.


### Discovering the Key Features of NTI Drug Targets by Artificial Intelligence

Based on the in-depth investigation of 36 properties in **Tables 1**–**3**, several properties were found to be not fully independent or even duplicate in their descriptions (like degree vs. scaled degree). In this study, all 36 properties were systematically reviewed, and 19 of these 36 were identified to be substantially overlapped with some other properties (**Table 4**). Since there was significant dependence among the 19 properties, the use of all 36 properties for statistical feature selection may introduce strong biases. Thus, the 19 properties were grouped based on their innate mutual dependence. As shown in **Table 4**,

five property groups were generated by considering equation and description of these 19 properties, and each group was named by the first property (ordered alphabetically) in the corresponding group. As a result, these five groups included: the average closeness centrality, average shortest path length, betweenness centrality, degree, eccentricity. To minimize the possible bias induced by the innate mutual dependence among properties, only these five properties were considered in subsequent feature selection analysis, instead of investigating all 19 properties. Taking the remaining 17 relatively independent properties into consideration, 22 properties in total of each target were selected for subsequent feature selection.

As one of the most popular feature selection strategies based on AI, the Boruta algorithm based on a wrapper method built around a random forest classifier (Kursa, 2014) was adopted in this study. Boruta was considered the most powerful method with the stability in variable selection, especially suitable for the low-dimensional dataset among other reported strategies (Degenhardt et al., 2017). In this study, the key differential features were thus selected from 22 properties using R package Boruta by setting the p-value < 0.05. As a result, eight properties were selected as able to collectively reflect the target's mechanism underlying NTI drugs. As illustrated in **Figure 1**, the boxplots colored in red and green referred to the targets of NTI and NNTI drugs, respectively. Some key features increased from the targets of NTI drug to that of NNTI one (such as average shortest path length), while others demonstrated a decrease (such as average closeness centrality). Based on the comprehensive literature review, some of those 8 key features had been reported to be indirectly relevant to drugs' efficacy-safety balances. For example, the lower value of average closeness centrality of target was reported to demonstrate a less lethality risk (Chen et al., 2011), which was consistent with the findings of this study (a much higher average closeness centrality of the targets of NTI drugs was observed compared with that of NNTI ones, shown in **Figure 1**). Moreover, the higher level (lower value) of interconnectivity was frequently observed in lethal diseases such as cardiovascular disorder and cancer (Muhammd et al., 2018).

Oncological and cardiovascular disorder had been recognized as life-threatening diseases, and the majority of their drugs were reported to be NTI ones (Muller and Milton, 2012; Yu et al., 2015). Thus, the result of interconnectivity in **Figure 1** was consistent with these previous reports, which further validated the effectiveness of applied algorithm in identifying key target features underlying NTI drugs.

Moreover, there were four groups of targets as defined in section Materials and Methods: (a) targets of NTI drugs, (b) targets of both NTI and NNTI drugs, (c) targets of drugs without reported TI, and (d) targets of NNTI drugs. Apart from the target groups (a) and (d), the remaining groups provided more complicated and informative data for illustrating the mechanism underlying NTI drugs. On one hand, the targets in group (b) were affected by both NTI and NNTI drugs, which might reflect properties from both sides, but might also be significantly affected by the properties of confirmed NTI drugs. On the other hand, no TI data of the group (c) targets was reported based on literature review. It was possible that some NTI drugs were not discovered for those targets. But considering the large number of group (c) targets (339 in total), it was highly possible that most of those group (c) targets were only aimed by NNTI drugs, and just a small fraction of which could find new NTI drug in the future. The value of 8 properties of those 4 target groups were illustrated in **Figure 1**. It was interesting that all properties followed a clear descending/ascending trend from the targets of group (a) to (d), which was in accordance with the analyses provided above. Thus, these findings could be another line of evidence that validated the effectiveness of the feature identification algorithm applied in this study.

### Target Mechanism Underlying NTI Drugs Collectively Determined by Multiple Profiles

By collectively considering **Figure 1** and **Tables 1**–**3**, seven out of those eight selected key features showed significant difference (p-value < 0.05), but it was clear that these significant differences did not guarantee the corresponding feature as the key differential one (57.7% of the features with significant difference (p-value < 0.05) were not selected as key differential ones). Moreover, significant difference was not observed for the selected key feature bridging coefficient (p-value = 0.22). This finding indicated that those eight features collectively determined the target mechanism of NTI drugs, and the TI-related mechanism might be the result of the synergistical effects among those features. Moreover, the majority of these eight key features were identified for the first time by this study, and this work was also the first analysis on the collective effects of both PPI network properties and biological system profile on the drug efficacysafety balance.

Further analysis on these eight identified key features (shown in **Figure 1**) revealed that these key features were found to belong to three feature groups. These feature groups were connectivity and centrality of targets in human PPI network together with human biological system features. By combining the data in **Figure 1**, the key features within the same feature group (illustrated in **Figure 2**) followed the same ascending/descending trends, which were colored by the same background. As shown in **Figure 2**, the targets of NTI drugs were highly centralized and connected, and the number of similarity proteins and the number of affiliated pathways were substantially higher than those of NNTI drug. Since the number of similarity proteins and affiliated pathways was reported to be good indicator of target druggability (Zhu F. et al., 2009; Li et al., 2018), the NTI profile identified in this study was in accordance with that of reported target druggability.

### CONCLUSION

This work is the first study conducting comprehensive review on the TI data of all FDA approved drugs (**Supplementary Table S1**) and revealing the collective effects of both human PPI network properties and biological system profiles on drug efficacy-safety balance. Eight key features were identified here as collectively differentiating the target mechanisms between NTI and NNTI drugs. These features revealed that the targets of NTI drugs were highly centralized and connected in human PPI network, and the numbers of similarity proteins and target-affiliated pathways were both much higher than those of NNTI drugs. These

### REFERENCES


findings together with the newly discovered features/feature groups clarified the key factors indicating drug's narrow TI and could therefore provide a novel direction for determining the delicate drug efficacy-safety balance.

### AUTHOR CONTRIBUTIONS

FZ conceived the idea and supervised the work. XL, JY, and JT performed the research. XL, JY, JT, YL, QY, ZX, RZ, YW, JH, LT, and WX prepared and analyzed the data. FZ wrote the manuscript. All authors have read and approved this manuscript.

### FUNDING

This work was funded by National Natural Science Foundation of China (81872798); Innovation Project on Industrial Generic Key Technologies of Chongqing (cstc2015zdcy-ztzx120003); and Fundamental Research Funds for Central Universities (10611CDJXZ238826, CDJZR14468801, CDJKXB14011, 2015CDJXY).

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar. 2018.01245/full#supplementary-material

of enzymes and cascade interactions of neighboring biological processes to identify drug-targets. Mol. Biosyst. 7, 1033–1041. doi: 10.1039/c0mb00249f


modulators in clinical trials by molecular dynamics simulations. ACS Chem. Neurosci. 9, 1492–1502. doi: 10.1021/acschemneuro.8b00059


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Li, Yin, Tang, Li, Yang, Xiao, Zhang, Wang, Hong, Tao, Xue and Zhu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Quantitative Systems Pharmacological Analysis of Drugs of Abuse Reveals the Pleiotropy of Their Targets and the Effector Role of mTORC1

Fen Pei † , Hongchun Li † , Bing Liu\* and Ivet Bahar\*

*Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States*

#### Edited by:

*Zhi-Liang Ji, Xiamen University, China*

#### Reviewed by:

*Amit P. Bhavsar, University of Alberta, Canada Huanmei Wu, Indiana University, Purdue University Indianapolis, United States*

#### \*Correspondence:

*Bing Liu liubing@pitt.edu Ivet Bahar bahar@pitt.edu*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Translational Pharmacology, a section of the journal Frontiers in Pharmacology*

Received: *08 November 2018* Accepted: *14 February 2019* Published: *08 March 2019*

#### Citation:

*Pei F, Li H, Liu B and Bahar I (2019) Quantitative Systems Pharmacological Analysis of Drugs of Abuse Reveals the Pleiotropy of Their Targets and the Effector Role of mTORC1. Front. Pharmacol. 10:191. doi: 10.3389/fphar.2019.00191* Existing treatments against drug addiction are often ineffective due to the complexity of the networks of protein-drug and protein-protein interactions (PPIs) that mediate the development of drug addiction and related neurobiological disorders. There is an urgent need for understanding the molecular mechanisms that underlie drug addiction toward designing novel preventive or therapeutic strategies. The rapidly accumulating data on addictive drugs and their targets as well as advances in machine learning methods and computing technology now present an opportunity to systematically mine existing data and draw inferences on potential new strategies. To this aim, we carried out a comprehensive analysis of cellular pathways implicated in a diverse set of 50 drugs of abuse using quantitative systems pharmacology methods. The analysis of the drug/ligand-target interactions compiled in DrugBank and STITCH databases revealed 142 known and 48 newly predicted targets, which have been further analyzed to identify the KEGG pathways enriched at different stages of drug addiction cycle, as well as those implicated in cell signaling and regulation events associated with drug abuse. Apart from synaptic neurotransmission pathways detected as upstream signaling modules that "sense" the early effects of drugs of abuse, pathways involved in neuroplasticity are distinguished as determinants of neuronal morphological changes. Notably, many signaling pathways converge on important targets such as mTORC1. The latter emerges as a universal effector of the persistent restructuring of neurons in response to continued use of drugs of abuse.

Keywords: drug abuse, quantitative systems pharmacology, pleiotropic proteins, mTOR complex 1, drug-target interactions, neurotransmission, machine learning, cellular pathways

#### INTRODUCTION

Drug addiction is a chronic relapsing disorder characterized by compulsive, excessive, and self-damaging use of drugs of abuse. It is a debilitating condition that potentially leads to serious physiological injury, mental disorder and death, resulting in major health, and social economic impacts worldwide (Nestler, 2013; Koob and Volkow, 2016). Substances with diverse chemical structures and mechanisms of action are known to cause addiction. Except for alcohol and tobacco, substances of abuse are commonly classified into six groups based on their primary targets or effects: cannabinoids (e.g., cannabis), opioids (e.g., morphine, heroin, fentanyl), central nervous system (CNS) depressants (e.g., pentobarbital, diazepam), CNS stimulants (e.g., cocaine, amphetamine), hallucinogens (e.g., ketamine, lysergic acid diethylamide), and anabolic steroids (e.g., nandrolone, oxymetholone).

The primary actions of drugs of abuse have been well studied. In spite of the pleiotropy and heterogeneity of drugs of abuse, they share similar phenotypes: from acute intoxication to chronic dependence (Taylor et al., 2013), the reinforcement shift from positive to negative through a three-stage cycle involving binge/intoxication, withdrawal/negative effect, and preoccupation/anticipation (Koob and Volkow, 2016). Notably, virtually all drugs of abuse augment dopaminergic transmission in the reward system (Wise, 1996). However, the detailed cellular pathways of addiction processes are still far from known. For example, cocaine acts primarily as an inhibitor of dopamine (DA) transporter (DAT) and results in DA accumulation in the synapses of DA neurons (Shimada et al., 1991; Volkow et al., 1997). However, it has been shown that DA accumulation per se is not sufficient to account for the rewarding process associated with cocaine addiction; serotonin (5-HT) and noradrenaline (or norepinephrine, NE) also play important roles (Rocha et al., 1998; Sora et al., 1998). Another example is ketamine, a non-selective antagonist for N-methyl-d-aspartate (NMDA) receptor (NMDAR), notably most effective in the amygdala and hippocampal regions of neurons (Collingridge et al., 1983). In addition to its primary action, ketamine affects a number of other neurotransmitter receptors, including sigma-1 (Mendelsohn et al., 1985), substance P (Okamoto et al., 2003), opioid (Hustveit et al., 1995), muscarinic acetylcholine (mACh) (Hirota et al., 2002), nicotinic acetylcholine (nACh) (Coates and Flood, 2001), serotonin (Kapur and Seeman, 2002), and γ-aminobutyric acid (GABA) receptors (Hevers et al., 2008). The promiscuity of drugs of abuse brings an additional layer of complexity, which prevents the development of efficient treatment against drug addiction.

In recent years, there has been significant progress in the characterization of drug/target/pathway relations driven by the accumulation of drug-target interactions and pathways data, as well as the development of machine learning, in silico genomics, chemogenomics, and quantitative systems pharmacology (QSP) tools. Several innovative studies started to provide valuable information on substance abuse targets and pathways. For example, Li et al. curated 396 drug abuse related genes from the literature and identified five common pathways underlying the reward and addiction actions of cocaine, alcohol, opioids, and nicotine (Li et al., 2008). Hu et al. analyzed the genes related to nicotine addiction via a pathway and network-based approach (Hu et al., 2018). Biernacka et al. performed genome-wide analysis on 1,165 alcohol-dependence cases and identified two pathways associated with alcohol dependence (Biernacka et al., 2013). Xie et al. generated chemogenomics knowledgebases focused on G-protein coupled receptors (GPCRs) related to drugs of abuse in general (Xie et al., 2014), and cannabinoids in particular (Xie et al., 2016). Notably, these studies have shed light on selected categories or subgroups of drugs. There is a need to understand the intricate couplings between multiple pathways implicated in the cellular response to drugs of abuse, identify mechanisms common to various categories of drugs while distinguishing those unique to selected categories.

We undertake here such a systems-level approach using a dataset composed of six different categories of drugs of abuse. Following a QSP approach proposed earlier (Stern et al., 2016), we provide a comprehensive, unbiased glimpse of the complex mechanisms implicated in addiction. Specifically, as shown in **Figure 1**, a set of 50 drugs of abuse with a diversity of chemical structures (**Supplementary Figure 1**) and pharmacological actions were collected as probes, and the known targets of these drugs as well as the targets predicted using our probabilistic matrix factorization (PMF) method (Cobanoglu et al., 2013) were analyzed to infer biological pathways associated with drug addiction. Our analysis yielded 142 known and 48 predicted targets and 173 pathways permitting us to identify both generic mechanisms regulating the responses to drug abuse as well as specific mechanisms associated with selected categories, which could facilitate the development of auxiliary agents for treatment of addiction.

A key step in our approach is to identify the targets for drugs of abuse. There exists various drug-target interaction databases (DBs), web servers and computational models, as summarized recently (Chen et al., 2016). The DBs utilized in this work are the drug-target database DrugBank (Wishart et al., 2018) and the protein-chemical database STITCH (Szklarczyk et al., 2016). DrugBank is a bioinformatics and cheminformatics resource that combines drug data with comprehensive target information. It is frequently updated, with the current version containing 10,562 drugs, 4,493 targets and corresponding 16,959 interactions. Since most of drugs of abuse are approved or withdrawn drugs, DrugBank is a good source for obtaining information on their interactions. STITCH, on the other hand, is much more extensive. It integrates chemical-protein interactions from experiments, other DBs, literature and predictions, resulting in data on 430,000 chemicals and 9,643,763 proteins across 2,031 genomes. We have used in the present analysis the subset of human protein-chemicals data supported by experimental evidence. The method of approach adopted here is an important advance over our original PMF-based machine learning methodology for predicting drug-target interactions (Cobanoglu et al., 2013). First, the approach originally developed for mining DrugBank has been extended to analyzing the STITCH DB, the content of which is 2–3 orders of magnitude larger than DrugBank (based on the respective numbers of interactions). Second, the information on predicted drug-target associations is complemented by pathway data on humans inferred from the KEGG pathway DB (December 2017 version; Kanehisa et al., 2017) upon pathway enrichment analysis of known and predicted targets. Third, the outputs are subjected to extensive analyses to detect recurrent patterns and formulate new hypotheses for preventive or therapeutic strategies against drug abuse.

FIGURE 1 | Workflow of the quantitative systems pharmacological analysis. (A) 50 drugs of abuse with a diversity of chemical structures and pharmacological actions were collected as probes. (B) 142 known targets of these drugs were identified through drug-target interaction database DrugBank and chemical-protein interaction database STITCH. (C) 48 predicted targets were predicted using our probabilistic matrix factorization (PMF) method (Cobanoglu et al., 2013). (D) 173 human pathways were inferred from the KEGG pathways database by mapping the known and predicted targets. (E,F) The pathways were grouped into 5 clusters. The functioning of identified targets and pathways and their involvement in drug addiction were comprehensively examined.

## MATERIALS AND METHODS

#### Selection of Drugs of Abuse and Their Known Targets

We selected as input 50 drugs commonly known as drugs of abuse using two basic criteria: (i) diversity in terms of structure and mode of action, and (ii) availability of information on at least one human target protein in DrugBank v5 (Wishart et al., 2018) or STITCH v5 (Szklarczyk et al., 2016). The selected drugs represent six different categories: CNS stimulants, CNS depressants, opioids, cannabinoids, anabolic steroids, and hallucinogens (see **Supplementary Table 1** and **Supplementary Figure 1**).

A dataset of 142 known targets, listed in **Supplementary Table 2**, were retrieved from DrugBank and STITCH DBs for these 50 drugs. The list includes all targets reported for these drugs in DrugBank, and those with high confidence score, based on experiments, reported in STITCH. Each chemical-target interaction is annotated with five confidence scores in STITCH: experimental, DB, text-mining, prediction, and a combination score of the previous four, each ranging from 0 to 1. We selected the human protein targets with experimental confidence scores of 0.4 or higher. **Supplementary Table 2** summarizes the 142 targets we identified as well as the associated 445 drug-target interactions.

Structure-based and interaction-pattern-based similarities between pairs of drugs were evaluated using two different criteria. The former was based on structure-based distance calculated as the Tanimoto distance between their 2D structure fingerprints. Tanimoto distances were evaluated using Python RDKit suite (RDKit: Open-Source Cheminformatics Software. https://www.rdkit.org/). Similarities based on their interactions patterns with known targets were evaluated by evaluating target-based distances. To this aim, we represented each drug i by a 142-dimensional "target vector" **d**<sup>i</sup> , the entries of which represent the known targets and are assigned values of 0 or 1, depending on the existence/observation of an interaction between the corresponding target and drug i. Interaction-pattern similarities between drug pairs i and j were evaluated by calculating the correlation cosine cos(**d<sup>i</sup> . dj**) = (**d<sup>i</sup> . dj**)/(|**d<sup>i</sup>** | |**d<sup>j</sup>** |) between these vectors, and the corresponding cosine distance is [1–cos(**d<sup>i</sup> . dj**)]. Likewise, ligand-based distances between target pairs i and j were evaluated as the cosine distance between the 50-dimensional vectors **t<sup>i</sup>** and **t<sup>j</sup>** corresponding to the two targets, the entries of which are 0 or 1 depending on absence or existence of an interaction between the target and the corresponding drug of abuse.

### Probabilistic Matrix Factorization (PMF) Based Drug-Target Interaction Prediction

Novel targets for each drug were predicted using our probabilistic matrix factorization (PMF) based machine learning approach (Cobanoglu et al., 2013, 2015). Briefly, we start with a sparse matrix **R** representing the known interactions between N drugs and M targets. Using the PMF algorithm, we decomposed **R** into a drug matrix **U** and a target matrix **V**, by learning the optimal D latent variables to represent each drug and each target. The product of **U T** and **V** assigns values to the unknown (experimentally not characterized) entries of the reconstructed **R**, each value representing the confidence score for a novel drug-target

$$\mathcal{R}\_{\mathbf{N}\times\mathbf{M}} = U\_{\mathbf{N}\times\mathbf{D}}^T V\_{\mathbf{D}\times\mathbf{M}}$$

Using this method, we trained two PMF models, one based on 11,681 drug-target interactions between 6,640 drugs and 2,255 targets from DrugBank v5, and the other based on 8,579,843 chemical-target interactions for 311,507 chemicals and 9,457 targets from STITCH v5 human experimentally confirmed subset, respectively. We evaluated the confidence scores in the range [0, 1] for each predicted drug-target interaction, in both cases. We selected the interactions with confidence scores higher than 0.7 within the top 10 predicted targets for each input drug. This led to 161 novel interactions identified between 27 out of the 50 input drugs and 89 targets (composed of 41 known and 48 novel targets; **Supplementary Table 3**).

#### Pathway Enrichment Analysis

We mapped the 50 drugs with 142 known and 48 predicted targets to the KEGG pathways (version December 2017, homo sapiens) (Kanehisa et al., 2017). 114 and 173 pathways were mapped by 142 known targets and all targets (both known and predicted) respectively (see **Supplementary Table 4**). In order to prioritize enriched pathways, we calculated the hypergeometric p-values based on the targets as the enrichment score as follows. Given a list of targets, the enrichment p-value for pathway A (P A) is the probability of randomly drawing k<sup>0</sup> or more targets that belong to pathway A:

$$P^{A} = \sum\_{k\_0 \le k \le m} \frac{\binom{K}{k} \binom{M-K}{m-k}}{\binom{M}{m}}$$

where M is the total number of human proteins in the KEGG Pathway, m is the total number of proteins/targets we identified, and K is the number of proteins that belong to pathway A, while k<sup>0</sup> is the number of targets we identified that belong to pathway A. The obtained p-values are adjusted by a False Discovery Rate (FDR) correction to account for multiple testing, using the widely used Benjamini-Hochberg method (Benjamini and Yekutieli, 2001). The cutoff of the adjusted p-values gives us an upper bound of the false discovery rate. The false discovery rate is the fraction of false significant pathways maximally expected from the significant pathways identified in our case. We sort pvalues from smallest to largest, with m being the total number of pathways. The adjusted p-value, p ∗ i , corresponding to the ith pathway is:

#### pi <sup>∗</sup> = mink=i...m{min(pkm/i, 1)}

**Supplementary Table 4** lists these p-values for pathway enrichments based on both known and predicted targets.

The source code used for generating the results reported in this study is available at https://github.com/Fengithub/DA.

#### RESULTS

### Functional Similarity of Drugs of Abuse Does Not Imply Structural Similarity, Consistent With the Multiplicity of Their Actions

**Figure 2** presents a quantitative analysis of the functional and structural diversity of the examined n = 50 drugs of abuse, and the similarities among the m = 142 known targets of these addictive drugs. The n × n maps in **Figures 2A,B** display the drug-drug pairwise distances/dissimilarities based on their 2D fingerprints (**Figure 2A**), and their interaction patterns with their targets. **Figures 2C**,**D** display the corresponding dendrograms. The drugs are indexed and color-coded as in **Supplementary Table 1** and **Supplementary Figure 1**. As expected, drugs belonging to the same functional category (same color) exhibit more similar interaction patterns (**Figure 2D**). However, we also note outliers, such as cocaine lying among opioids, as opposed to its categorization as a CNS stimulant, or promethazine, a CNS depressant, lying among hallucinogens (shown by arrows). The peculiar behavior of cocaine is consistent with its high promiscuity (see **Figure 3A** for the number of targets associated with each examined drug). This type of promiscuity becomes even more apparent when the drugs are organized based on their structure (or 2D fingerprints; see section Materials and Methods) as may be seen in **Figure 2A**. For example, opioids (cyan labels/arc; clustered together in **Figures 2B,D** based on their interactions) are now distributed in two or more branches of the structure-based dendrogram in **Figure 2C**; likewise, CNS depressants (blue) and cannabinoids (light brown), grouped each as a single cluster in target-based dendrograms in **Figure 2D**, are now distributed into two or more clusters in **Figure 2C**.

Overall these results suggest that the functional categorization of the drugs does not necessarily comply with their structural characteristics. The similar functionality presumably originates from targeting similar pathways, but the difference in the structure suggests that either their targets, or the binding sites on the same target, are different; or the binding is not selective enough such that multiple drugs can bind the same site. Consequently, a diversity of pathways or a multiplicity of cellular responses are triggered by the use and abuse of these drugs.

### The Selected Drugs and Identified Targets Are Highly Diverse and Promiscuous

We evaluated the similarities between proteins targeted by drugs of abuse, based on their interaction patterns with the studied drugs of abuse. **Figures 2E,F** display the respective target-target distances, and corresponding dendrogram. **Supplementary Table 2** lists the full names of these targets, organized in the same order as the **Figure 2E** axes. We discern several groups of targets clustered together in consistency with their biological functions. For example, practically all GABA receptor subtypes (brown) are clustered together. This large cluster also includes the riboflavin transporter 2A (SLC52A2), which may be required for GABA release (Tritsch et al., 2012).

labels in (C,D) are color-coded based on their categories: CNS stimulants (*green*), CNS depressants (*blue*), opioids (*cyan*), cannabinoids (*light brown*), anabolic steroids (*black*) and hallucinogens (*magenta*). Note that the drugs of abuse in the same category do not necessarily show structural similarities nor similar interaction pattern with targets. (E) Pairwise distance map for the 142 known targets based on their interaction patterns with the 50 drugs. The indices in (E) follows the same order as those listed clockwise in the dendrogram (F). The tree maps in (C,D,F) are generated based on the respective distances values in the (A,B,E).

On the other hand, the different subtypes of serotonin (or 5-hydroxytryptamine, 5-HT) receptors (5HTRs) participate in distinct clusters pointing to the specificity of different subtypes vis-à-vis different drugs of abuse (labeled in **Figure 2F**).

The large majority of neurotransmitter transporters, such as Na+/Cl−-dependent GABA transporters (SLC6A1) and glycine transporter (SLC6A9) are in the same cluster (pink, labeled). Acetylcholine receptors also lie close to (or are even interspersed among) Na+/Cl−-dependent neurotransmitter transporters, presumably due to shared drugs such as cocaine. However, the three transporters playing a crucial role in developing drug addiction, DAT, NE transporter (NET) and serotonin transporter (SERT) (labeled SLC6A2: NET, SLC6A3: DAT, SLC6A4: SERT) are distinguished by from all other neurotransmitter transporters as a completely disjoint group. The corresponding branch of the dendrogram (highlighted by the yellow circle) also includes vesicular amino acid transporters and trace amine-associated receptor 1 (TAAR1) known to interact with these transporters (Miller, 2011). We also note in the same branch two seemingly unrelated targets: flavin monoamine oxidase which draws attention to the role of oxidative events; and α2-adrenergic receptor subtypes A-C, which uses NE as a chemical messenger for mediating stimulant effects such as sensitization and reinstatement of drug seeking, and adenylate

FIGURE 3 | Promiscuity of drugs of abuse and their targets, and major families of proteins targeted by drugs of abuse. Number of known (*gray*) and predicted (*white*) interactions are shown by bars for (A) drugs of abuse and (B) their targets. The examined set consists of 50 drugs of abuse and a total of 142 known and 48 predicted targets, involved in 445 (known) and 161 (predicted) interactions. (A) Displays the number of interactions known or predicted for all 50 drugs. (B) Displays the results for the targets that interact with at least 4 known drugs (36 targets). The colors used for names of drugs and targets are same as those used in Figure 2. (C) Displays the distribution of families of proteins targeted by drugs of abuse.

cyclase as another messenger to regulate cAMP levels (Sofuoglu and Sewell, 2009).

**Supplementary Table 2** summarizes the 445 known interactions between these 50 drugs and 142 targets. We observe an average of 8.9 interactions per drug and 3.1 interactions per target. There are 23 promiscuous drugs that target at least 10 proteins as shown in **Figure 3A**. Cocaine, the most promiscuous psychostimulant, interacts with 45 known, and 3 predicted targets. It is known that cocaine binds DAT to lock it in the outward-facing state (OFS) and block the reuptake of DA. It similarly antagonizes SERT and NET (Heikkila et al., 1975; Sora et al., 1998), and also affects muscarinic acetylcholine receptors (mAChRs) M1 and M2 (Williams and Adinoff, 2008). Our PMF model also predicted a potential interaction between cocaine and M5. While this interaction is not listed in current DBs, there is experimental evidence suggesting that muscarinic AChR M5 plays an important role in reinforcing the effects of cocaine (Fink-Jensen et al., 2003), in support of the PMF model prediction.

The PMF model enables us to predict novel targets. For example, anabolic steroid nandrolone has only two known interactions, and cannabinoid cannabichromene has one. However, 10 new targets were predicted with high confidence scores for each of them (**Supplementary Table 3** and **Supplementary Figure 2A**). This is due to the data available in STITCH DB, which offers a large training dataset that enhances the performance of our machine learning approach. Overall, 89 new interactions were predicted for known targets, and 42 novel targets were predicted with 72 interactions. **Figure 3C** displays the distribution of all targets among different protein families. As will be further elaborated below, among the newly identified drug-target pairs, nandrolone-MAPK14 (mitogen-activated protein kinase 14, also known as p38α) and canabichromene-IKBKB (inhibitor of NFκ-B kinase subunit β) play a role in regulating mTORC1 signaling, which will be shown to be a potential effector of drug addiction.

Turning to targets, three opioid receptors (OPRM1, OPRD1, and OPRL1) exhibit the highest level of promiscuity (**Supplementary Figure 2B**). The µ-type opioid receptor (OPRM1) interacts with 14 known drugs including all opioids as well as ketamine and dextromethorphan. We also predicted a novel interaction between OPRM1 and the CNS stimulant methylphenidate. This is consistent with experimental observations that methylphenidate upregulates OPRM1's activity in the reward circuitry in a mouse model (Zhu et al., 2011). Furthermore, tissue-based transcriptome analysis (Uhlén et al., 2015) shows that 69% of our 190 targets are expressed in the brain, and 49 of them show elevated expression levels in the brain compared to other tissue types (**Supplementary Table 5**). Among all the targets, NMDA receptor 1 (GRIN1) shows the highest elevated expression. It is also one of the top 5 enriched genes overall in the brain (Uhlén et al., 2015).

Taken together, the 50 selected drugs of abuse and the 142 known and 48 novel targets we identified cover a diversity of biological functions, are involved in many cellular pathways, and are generally promiscuous. In order to reveal the common mechanisms that underlie the development and escalation of drug addiction and also distinguish the effects specific to selected drugs, we proceed now to a detailed pathway analysis, presented next.

### Pathway Enrichment Analysis Reveals the Major Pathways Implicated in Various Stages of Addiction Development

Our QSP analysis yielded a total of 173 pathways, including 114 associated with the known targets of the examined dataset of drugs of abuse, and 59 associated with the predicted targets. The detailed pathway enrichment results can be found in **Supplementary Table 4**. These pathways can be grouped in five categories (**Figure 4**; **Supplementary Figures 3**, **4**, and **Supplementary Table 4**):

#### Synaptic Neurotransmission (NT)

Six significantly enriched (with adjusted p-value < 0.05) pathways are associated with synaptic neurotransmission: dopaminergic, serotonergic, glutamatergic, synaptic vesicle cycle, cholinergic, and GABAergic synapses pathways. Sixty-eight known targets and 7 predicted targets are involved in these pathways. This is consistent with the fact that neurotransmission plays a dominant role in the rewarding system and is key to drug addiction (Volkow and Morales, 2015).

#### Signal Transduction (SG)

Forty-six intracellular signaling pathways were mapped by 92 targets comprised of 66 known and 25 predicted targets. Notably, many of these pathways have been reported to play a role in mediating the effects of drugs of abuse. These include the top five [calcium signaling (Li et al., 2008), retrograde endocannabinoid signaling (Mechoulam and Parker, 2013), cGMP-PKG signaling (Shen et al., 2016), cAMP signaling (Philibin et al., 2011), and Rap1 signaling (Cahill et al., 2016)] as well as some pathways with relatively low enrichment score (i.e., 0.2 < adjusted p-value), such as TNF signaling (Zhu et al., 2018), MAPK signaling (Sun et al., 2016), PI3K-Akt signaling (Neasta et al., 2011), NF-κB signaling (Nennig and Schank, 2017), and mTOR signaling (Neasta et al., 2014). We note that many receptors targeted by drugs of abuse take part in the KEGG neuroactive ligand-receptor interaction pathway. In the interest of focusing on intracellular signaling effects, we have not included these in the SG category; they are listed in the "Other Pathways" in **Supplementary Table 4**.

#### Autonomic Nervous System (ANS)-Innervation (ANS)

We also identified 10 pathways regulating ANS-innervated systems such as endocrine secretion, taste transduction, and circadian entrainment. Recent evidences suggested drugs of abuse such as morphine (Al-Hasani and Bruchas, 2011) and cocaine (Moeller et al., 1997; Prosser et al., 2014) can influence ANS-innervated systems and may contribute to the withdrawn symptoms associated with drug addiction. Thirty-seven known and 9 predicted targets take part in these pathways.

Neuroplasticity (NP). Eight enriched pathways with potential to alter the morphology of neurons, were found to be related to drug addiction. Among them, long-term potentiation (LTP) and

categories. See the complete list of pathways and targets in Supplementary Table 4.

long-term depression (LTD) are key to reward-related learning and addiction by modifying the fine tuning of dopaminergic firing (Jones and Bonci, 2005). Axon guidance pathway regulates the growth direction of neuron cells (Bahi and Dreyer, 2005). Regulation of actin cytoskeleton plays important role in morphological development and structural changes of neurons (Luo, 2002). Gap junctions connect neighboring neurons via intercellular channels that allow direct electrical communication (Belousov and Fontes, 2013) and regulate the efficiency of communication between electrical synapses (Belousov and Fontes, 2013). Nineteen known targets and 5 predicted targets are involved in these pathways. Insulin-like growth factor 1 receptor (IGF1R) is predicted as a target of drug triazolam (**Supplementary Table 4**). IGF1R is involved in LTP, adherens junction and focal adhesion pathways. It functions via canonical signaling pathways noted above in the SG category, such as the PI3K-Akt-mTOR and Ras-Raf-MAPK pathways (Lee et al., 2016) and it plays important role in neuroplasticity (Lee et al., 2016). We note that the NP group involves many pathways directly relevant to drug addiction (Bahi and Dreyer, 2005; Kalivas and Volkow, 2011; Moradi et al., 2013; Rothenfluh and Cowan, 2013). There is no target unique to this particular group of pathways (**Figure 4B**). However, the fact that the targets belonging to the NP group are also shared by other groups consolidates the significance of these targets.

#### Disease-Associated Pathways (DS)

Fifty enriched pathways mapped by 51 known and 17 predicted targets are associated with diverse diseases in different organs such as brain, liver, and lung. They also cover various drug addiction mechanisms including: nicotine addiction, morphine addiction, cocaine addiction, amphetamine addiction, and alcoholism. Additionally, there are "other pathways" such as those involved in cell migration, differentiation, immune responses, and metabolic events, which can be seen in **Supplementary Table 4**.

Taken together, the enrichment analysis reveals five major categories of pathways that regulate the three stages of drug addiction cycle: (1) binge and intoxication, (2) withdrawal and negative affect, and (3) preoccupation and anticipation (or craving) (Koob and Volkow, 2010). Drugs of abuse directly affect neurotransmission pathways: they increase the accumulation of DA and other neurotransmitters in the synaptic and extrasynaptic regions, which in turn results in the hedonic feeling (stage 1) and triggers the DA reward system. Dysregulation of ANS-innervation pathways may cause negative effects and feelings (stage 2) and feedback to the CNS. Addictive drugs impair executive processes by disrupting the reward system (neurotransmission pathways) and imparting morphological changes via neuroplasticity pathways (e.g., LTD and LTP), which then result in craving (stage 3). Below, we present an in-depth analysis of the role of these pathways or their shared targets in drug addiction.

#### Selected Targets Shared by Dominant Pathways Emerge as Common Mediators of Drug Addiction

We next analyzed the overlapping targets between the pathways in different functional categories.

First, we note that eight pleiotropic proteins are shared by all five categories (at the intersection of the five Venn diagrams in **Figure 4B):** AMPA receptor (subtype GluA2; GRIA2), NMDA receptors 1 and 2A-D (designated as GRIN1, GRIN2A, GRIN2B, GRIN2C, and GRIN2D) and voltagedependent calcium channel Cav2.1 (or CACNA1A) as well as the predicted target phosphatidylinositol 3-kinase class 1A catalytic subunit α (PIK3CA) (**Supplementary Table 4**).

Second, 15 proteins are distinguished as targets of four of these major pathways: Serotonin receptors 5HTR2-A, -B and - C), GABA<sup>A</sup> receptors 1-6 (GABRA1- GABRA6), β-1 adrenergic receptor 1 (ADRB1), Ras-related C3 botulinum toxin substrate 1 (RAC1; member of Rho family of GTPases), mAChR M<sup>3</sup> (CHRM3) and DA receptor D<sup>2</sup> (DRD2), and two predicted targets - p38α (MAPK14) and DA receptor D<sup>1</sup> (DRD1).

AMPA receptor plays a crucial role in LTP and LTD, which are vital to neuroplasticity, memory and learning (Volkow et al., 2016). Serotonin receptors, expressed in both the CNS and the peripheral nervous system (e.g., gastrointestinal tract), are responsible for anxiety, impulsivity, memory, mood, sleep, thermoregulation, blood pressure, gastrointestinal motility, and nausea (Pytliak et al., 2011). They have been proposed to be therapeutic targets for treating cocaine use disorder (Howell and Cunningham, 2015). RAC1 is involved in five neuroplasticity pathways, including axon guidance, adherens junction and tight junction pathways (**Supplementary Table 4**), and 13 intracellular signal transduction pathways. It regulates neuroplasticity, as well as apoptosis and autophagy (Natsvlishvili et al., 2015). DA receptor D<sup>2</sup> is a target of 28 drugs of abuse (out of 50 examined here) and is involved in cAMP signaling, and gap junction pathways, in addition to dopaminergic signaling. It is implicated in reward mechanisms in the brain (Blum et al., 1996) and the regulation of drug-seeking behaviors (Edwards et al., 2006). Finally, PI3K turns out to be the most pleiotropic target among those targeted by drugs of abuse, being involved in 61 pathways identified here, including neuroplasticity pathways such as axon guidance, and several downstream signaling pathways such as PI3K-Akt, mTOR, Ras and Jak-STAT pathways.

Overall, the above listed 23 proteins shared by at least four different groups of pathways are distinguished here as highly pleiotropic proteins involved in the large majority of pathway categories implicated in drug abuse. Most of them are ligandor voltage-gated ion channels or neurotransmitter receptors, mainly AMPAR, NMDAR, Cav2.1, mAChR, and serotonin and DA receptors. However, it is interesting to note the targets PI3K and p38α, not currently reported in DrugBank and STITCH, emerge as highly pleiotropic targets of the drugs of abuse. These are suggested by the current analysis to directly or indirectly affect addiction development and await future experimental validation. Finally, a number of proteins take part in specific drug-abuse-related pathways and might serve as targets for selective treatments. **Supplementary Table 6** provides a list of such targets uniquely implicated in distinctive pathways.

#### Pathway Enrichment Highlights the Interference of Drugs of Abuse With Synaptic Neurotransmission

It is broadly known that neurotransmitters such as DA, 5- HT, NE, endogenous opioids, ACh, endogenous cannabinoids, Glu, and GABA are implicated in drug addiction (Tomkins and Sellers, 2001; Everitt and Robbins, 2005; Parolaro and Rubino, 2008; Benarroch, 2012). Our analysis also showed that the serotonergic synapse (adjusted p-value p ∗ <sup>i</sup> = 2.01E-18), GABAergic synapse (p ∗ <sup>i</sup> = 1.19E-17), cholinergic synapse (p ∗ <sup>i</sup> = 2.36E-07), dopaminergic synapse (p ∗ <sup>i</sup> = 1.66E-06) and glutamatergic synapse (p ∗ <sup>i</sup> = 1.86E-03) pathways were significantly enriched (**Supplementary Table 4**). A total number of 34 drugs (across six different groups) target at least one of these pathways. However, the identification of a pathway does not necessarily mean that the drug directly affects that particular neurotransmitter transport/signaling. There may be indirect effects due to the crosstalks between synaptic signaling pathways. For example, the ionotropic glutamate receptors NMDAR and AMPAR are also the downstream mediators in the dopaminergic synapse pathway. Likewise, GABARs are downstream mediators in the serotonergic synapse pathway.

In **Figure 5**, we highlight five major neurotransmission events that directly mediate addiction, and illustrate how eight drugs of abuse interfere with them. Despite the promiscuity of the drugs of abuse, some selectively map onto a single synaptic neurotransmission pathway. For example, psilocin [a hallucinogen whose structure is similar to 5HT (Diaz, 1997)] interacts with several types of 5HTRs, regulating serotonergic synapse exclusively (see **Figure 5** and **Supplementary Table 4**). In contract, loperamide (not shown) affects all neurotransmission pathways by interacting with the voltage-dependent P/Q-type calcium channel (VGCC), regulating calcium flux on synapses. Cocaine targets four of these synaptic neurotransmission events (serotonergic, GABAergic, cholinergic, and dopaminergic synapses), through its interactions with 5-HT3R, sodiumand chloride-dependent GABA transporter (GAT), muscarinic (M1 and M2) and nicotinic AChRs, and DAT, respectively. Methadone affects three synaptic neurotransmissions, including serotonergic synapse, dopaminergic synapse, and glutamatergic synapse through the interactions with SERT, DAT, and glutamate receptors (NMDAR), respectively.

It is worth noting that the current analysis helps us generate new hypotheses, yet to be experimentally validated, on the ways drugs of abuse affect neurotransmission. In addition to the new role of the muscarinic AChR M5 suggested by the current analysis in section the selected drugs and identified targets are highly diverse and promiscuous, our PMF model suggested that cannabichromene, a cannabinoid whose primary target is the transient receptor (TRPA1), could interact with DAT and thus regulate dopaminergic transmission, which will require further examination.

The above synaptic neurotransmission events act as upstream signaling modules that "sense" the early effects of drug abuse. In the next section, we focus on the downstream signaling events elicited by drug abuse.

#### mTORC1 Emerges as a Potential Downstream-Effector Activated by Drugs Abuse

The calcium-, cAMP-, Rap1-, Ras-, AMPK-, ErbB-, MAPK-, and PI3K-Akt-signaling pathways in the SG category (**Supplementary Table 4**) crosstalk with each other and form a unified signaling network. As shown in **Figure 6**,

drug-target interaction, *dashed red arrows* indicate predicted drug-target interactions. Other molecules shown in the diagram are: KA, kainate receptor; MAO, monoamine oxidase; HVA, homovanillate; 3-MT, 3-methoxytyramine; MOR, mu-type opioid receptor; AChE, acetylcholinesterase; and 5-H1AA, 5-hydroxyindoleacetate.

ligand-binding to GPCRs modulates the production of cAMP, which leads to the activation of Rap1. Activated Rap1 modules the Ca2<sup>+</sup> signaling by inducing the production of inositol triphosphate (IP3) and also activates the PI3K-Akt signaling cascade. Stimulations of ErbB family of receptor tyrosine kinases (related to epidermal growth factor receptor EGFR) as well as insulin-like growth factor receptor IGF1R trigger both PI3K-Akt and MAPK signaling cascades (proteins colored blue in **Figure 6**). Notably all these pathways merge and regulate a group of downstream proteins (shown in dark yellow in **Figure 6**); and at the center of this cluster lies the mammalian target of rapamycin (mTOR) complex 1 (mTORC1) which is likely to be synergistically regulated by all these merging pathways.

mTORC1 is not only a master regulator of autophagy (Rabanal-Ruiz et al., 2017), but also controls protein synthesis and transcription (Ma and Blenis, 2009). It has been reported to promote neuroadaptation following exposure to drugs of abuse including cocaine, alcohol, morphine and 1<sup>9</sup> tetrahydrocannabinol (THC) (Neasta et al., 2014). Our results lead to the hypothesis that mTORC1 may act as a universal effector of the cellular response to drug abuse at an advanced

(preoccupation and anticipation, or craving) stage, controlling the synthesis of selected proteins and ensuing cell growth, which may result in persistent alterations in the dendritic morphology and neuronal circuitry.

In **Figure 6**, selected interactions between drugs from different substance groups and their targets are highlighted using gray arrows. The figure illustrates that not only many known drug-target interactions, but also predicted ones involved in the unified signaling network. For example, our PMF model predicted that diazepam would interact with PI3K to influence mTORC1 signaling (dashed gray arrows denote predictions). It has been reported that Ro5-4864, a benzodiazepine derivative of diazepam suppresses activation of PI3K (Yousefi et al., 2013), which corroborates our prediction. We further predicted that cannabichromene may interact with IκB kinase β (IKKβ) to regulate mTORC1 by inhibiting TSC1/2. Interestingly, another cannabinoid, arachidonoyl ethanolamine, is known to directly inhibits IKKβ (Sancho et al., 2003). Taken together, our results suggest a unified network that underlies the development of drugs addiction, in which mTORC1 appears to play a key effector role.

## DISCUSSION

In the present study we focused on the targets and pathways affected by drugs of abuse, toward gaining a systems-level understanding of key players and dominant interactions that control the response to drug abuse and the development of drug addiction. Using machine learning methods, we focused on 50 drugs of abuse that form a chemically and functionally diverse set, and analyzed their 142 targets as well as the corresponding cellular pathways and their crosstalk. Our analysis identified:

(i) 48 additional proteins targeted by drugs of abuse, including PIK3CA, IKBKB, EGFR, and IGF1R, are shown to be key mediators of downstream effects of drug abuse.


Overall, our comprehensive analysis led to new hypotheses on drug-target interactions and signaling and regulation mechanism elicited by drugs of abuse in general, along with those on selected targets and pathways for specific drugs. Below we elaborate on the biological and biomedical implications of these findings.

### Persistent Restructuring in Neuronal Systems as a Feature Underlying Drug Addiction

Enriched pathways in the neuroplasticity category include gap junction, LTP, LDP, adherens junction, regulation of actin cytoskeleton, focal adhesion, axon guidance, and tight junction (**Supplementary Table 4**). These are responsible for the changes in the morphology of dendrites. For instance, DA regulates excitatory synaptic plasticity by modulating the strength and size of synapses through LTP and LTD (De Roo et al., 2008; Volkow and Morales, 2015). The restructuring of dendritic spines involves the rearrangements of cytoskeleton and actin-myosin (Volkow and Morales, 2015). The axon guidance molecules guide the direction of neuronal growth.

Drugs of abuse can induce the changes in CNS through these pathways. For example, chronic exposure to cocaine increases dendritic spine density in medium spiny neurons (Russo et al., 2010). The disruption in axon guidance pathway and alteration in synaptic geometry can result in drugrelated plasticity (Bahi and Dreyer, 2005). The persistent restructuring in the CNS caused by drugs of abuse is responsible for long-term behavioral plasticity driving addiction (Volkow et al., 2003; Russo et al., 2010; Volkow and Morales, 2015). As will be further discussed below, mTORC1 plays a central role in the synthesis of new proteins (e.g., AMPARs) and thereby neuronal (dendrites) growth, alteration of the synaptic geometry and therefore rewiring of the neuronal circuitry.

### ANS May Mediate the Negative-Reinforcement of Drug Addiction

The current study further points to pathways regulating the ANS-innervated systems. As the NP pathways influence the neuroplasticity in the ANS, we hypothesize that drugs of abuse might induce a persistent restructuring in the ANS as well. The drug-related plasticity in ANS may lead to the dysregulation of ANS-innervated systems and cause negative effects and feelings during the second stage of drug addiction. Drug addiction is well known as a brain disease (Volkow and Morales, 2015). However, many drugs of abuse can disrupt the activity of ANS and cause disorders in ANS-innervated systems (Al-Hasani and Bruchas, 2011; Huang, 2017). For example, opioids (e.g., morphine) alter neuronal excitability and neurotransmission in the ANS (Wood and Galligan, 2004), and induce disorders in gastrointestinal system, smooth muscle, skin, cardiovascular, and immune system (Al-Hasani and Bruchas, 2011). Cannabinoids (e.g., THC) modulate the exocytotic NE release in ANSinnervated organs through presynaptic cannabinoid receptors (Ishac et al., 1996).

The pathways we identified in the ANS category regulate insulin secretion, gastric acid secretion, vascular smooth muscle contraction, pancreatic secretion, salivary secretion, and renin secretion (**Supplementary Table 4**). Their dysfunction may be associated with the autonomic withdrawal syndrome, such as thermoregulatory disorder (chills and sweats) and gastrointestinal upset (abdominal cramps and diarrhea), which has been observed in drug/substance users (Wise and Koob, 2014). In addition, the stress and depression caused by these negative effects may be part of the negative reinforcement of drug addiction (Self and Nestler, 1995; Koob and Le Moal, 2001). In other words, the drug induced ANS disorders can feedback to CNS and mediate the negative reinforcement. Compared to the structural changes in CNS, the disorder and persistent restructuring in ANS is less studied and it could be a future direction in the study of development of drug addiction and related diseases.

#### mTORC1 Appears as a Key Mediator of Cellular Morphological Changes Elicited in Response to Continued Drug Abuse

The functioning and regulation of mTOR signaling has been elucidated over the past two decades. It became clear that mTORC1 plays a crucial role in regulating diverse cellular processes including protein synthesis, autophagy, lipid metabolism, and mitochondrial biogenesis (Saxton and Sabatini, 2017). In the brain, mTORC1 coordinates neural development, circuit formation, synaptic plasticity, and long-term memory (Lipton and Sahin, 2014). The dysregulation of mTORC1 pathway is associated with many neurodevelopmental and neurodegenerative diseases such as Parkinson's disease and Alzheimer's disease. mTORC1 has been noted to be an important mediator of the development of drug addiction and relapse vulnerability (Dayas et al., 2012). Accumulating evidences show that pharmacological inhibition of mTORC1 (often through rapamycin treatment) can prevent sensitization of methamphetamine-induced place preference (Narita et al., 2005), reduce craving in heroin addicts (Shi et al., 2009), attenuate the expression of alcohol-induced locomotor sensitization (Neasta et al., 2010), suppress the expression of cocaine-induced place preference (Bailey et al., 2012), protect against the expression of drug-seeking and relapse by reducing AMPAR (GluA1) and CaMKII levels (James et al., 2014), and inhibit reconsolidation of morphine-associated memories (Lin et al., 2014).

Our unbiased computational analysis based on a diverse set of 50 drugs of abuse supports the hypothesis that mTORC1 may act as a universal effector or controller of neuroadaptations induced by drugs of abuse (Neasta et al., 2014). The major signal transduction pathways we identified that involve targets of drugs of abuse interconnect and converge to the mTORC1 signaling cascade (**Figure 6**). Most drugs of abuse in our list target upstream regulators of mTORC1, including membrane receptors (e.g., GPCRs, RTKs and NMDAR), kinases (e.g., PI3K, p38α, and IKKβ), and ion channels (e.g., CaV2.1 and TRPV2). Notably, the impact of some of these known or predicted targets has been experimentally confirmed. For example, blockade of the known target NMDAR using MK801 reduces the amnesic-like effects of cannabinoid THC (Puighermanal et al., 2009). Likewise, inhibition of PI3K (a predicted target) by LY294002 suppresses morphine-induced place preference in rats (Cui et al., 2010) and the expression of cocaine-sensitization (Izzo et al., 2002). Our results thus provide a pool of candidate targets implicated in cellular responses to addictive drugs, which await to be consolidated by further tests.

The downstream effectors of mTORC1, which specifically mediate drug behavioral plasticity is far from known. mTORC1 can mediate the activation of S6Ks and 4E-BPs, which leads to increased production of proteins required for synaptic plasticity including AMPAR and PSD-95 (Dayas et al., 2012). EM reconstruction of hippocampal neuropil showed the variability in the size and shape of dendrites depending on synaptic activity (Bartol Jr et al., 2015), which in turn correlates with information storage. Recently studies have revealed that Atg5- and Atg7-dependent autophagy in dopaminergic neurons regulates cellular and behavioral responses to morphine (Su et al., 2017). Cocaine exposure results in ER stress-induced and mTORC1-dependent autophagy (Guo et al., 2015). Fentanyl induces autophagy via activation of ROS/MAPK pathway (Yao et al., 2016). Methamphetamine induces autophagy through the κ-opioid receptor (Ma et al., 2014). These observations are consistent with the currently inferred role of mTORC1 as a downstream effector of cellular responses to drug addiction.

#### Drug Repurposing Opportunities for Combatting Drug Addiction

Autophagy modulating drugs have been shown to have therapeutic effects against liver and lung diseases. The signaling network presented in **Figure 6** involves many targets of such drugs. For instance, carbamazepine affects IP<sup>3</sup> production and enhances autophagy via calcium-AMPK-mTORC1 pathway (Hidvegi et al., 2010). It has been identified as a potential drug for treating α1-antitrypsin deficiency, hepatic fibrosis, and lung proteinopathy (Hidvegi et al., 2010, 2015). Rapamycin is a potential drug for lung disease such as fibrosis (Abdulrahman et al., 2011; Patel et al., 2012). Other liver and lung drugs which facilitate the removal of aggregates by promoting autophagy may also affect drug-related neurodegenerative disorders. **Supplementary Table 7** summarizes 15 autophagymodulating drugs for liver and lung diseases. Target identification and pathway analysis of this subset of drugs using the same protocol as those adopted for the 50 drugs of abuse indeed confirmed that drugs of abuse and liver/lung drugs share many common pathways (**Supplementary Figure 5**). Notably, among those pathways, neuroactive ligand-receptor interactions, calcium signaling, and serotonergic synapse pathways are among the top 10 enriched pathways of both drugs of abuse and liver/lung drugs. Amphetamine addiction and alcoholism are also enriched by targets of liver/lung drugs. Thus, an interesting future direction is to examine whether autophagy modulating drugs for liver and lung diseases could be repurposed, if necessary by suitable refinements to increase their selectivity, for treating drug addiction.

In summary, our results invite attention to new targets of addictive drugs and pathways implicated in the development of addiction, as well as new therapeutic opportunities. Recent studies support the utility of such computationally-driven QSP predictions. The validation of these predictions requires comprehensive wet-lab bioactivity assays (Pahikkala et al., 2015). In particular, the establishment of the proposed role of mTORC1 would require in vitro and in vivo longitudinal studies given that our current study points to the involvement of mTORC1 at later stages of drug addiction. In a recent study, we identified the role of protein kinase A (PKA) pathway in Huntington's disease using a QSP approach and verified experimentally (Pei et al., 2017). A similar combined computational-experimental framework could be adopted to extend the current study and establish new strategies. Though these experiments are beyond the scope of the current paper, our unbiased computational study provides insights into the pleiotropy of the targets of addictive drugs as well as the common signaling platforms that may serve as mediators of drug addiction.

Knowledge of pathways implicated in drug addiction may be used, as a next step, to construct kinetic models to quantitatively assess the orchestration of signals induced by pathway crosstalks. Our previous studies on Toll-like receptors (Liu et al., 2016) and cell fate decision processes (Liu et al., 2014, 2017) have demonstrated the utility of identifying such crosstalks for detecting synergistic response mechanisms and designing polypharmacological strategies. Therefore, the computational data presented here presents a milestone toward developing new therapies against drug addiction by identifying new targets beyond those usually investigated by focused studies. Finally, our analysis framework is generic and could be adopted for characterizing the targets and pathways of other complex disorders by suitable redefinition of the input set of drugs of interest.

#### DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and/or the supplementary files.

### AUTHOR CONTRIBUTIONS

FP, HL, and IB conceived and designed the research. FP and HL performed the research. FP, HL, BL, and IB analyzed the results and wrote the manuscript.

#### FUNDING

This work was supported by the National Institutes of Health awards P30DA035778, and P41GM103712.

#### REFERENCES


#### ACKNOWLEDGMENTS

FP wish to thank Dr. D. Lansing Taylor for his mentorship.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar. 2019.00191/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Pei, Li, Liu and Bahar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Ontological and Non-Ontological Resources for Associating Medical Dictionary for Regulatory Activities Terms to SNOMED Clinical Terms With Semantic Properties

*Cédric Bousquet1,2\*, Julien Souvignet1,2, Éric Sadou1, Marie-Christine Jaulent1 and Gunnar Declerck3*

*Edited by: Lixia Yao, Mayo Clinic, United States*

#### *Reviewed by:*

*Yongqun Oliver He, University of Michigan Health System, United States Zhe He, Florida State University, United States*

*\*Correspondence:*

*Cédric Bousquet cedric.bousquet@chu-st-etienne.fr*

#### *Specialty section:*

*This article was submitted to Translational Pharmacology, a section of the journal Frontiers in Pharmacology*

*Received: 18 July 2018 Accepted: 31 July 2019 Published: 10 September 2019*

#### *Citation:*

*Bousquet C, Souvignet J, Sadou É, Jaulent M-C and Declerck G (2019) Ontological and Non-Ontological Resources for Associating Medical Dictionary for Regulatory Activities Terms to SNOMED Clinical Terms With Semantic Properties. Front. Pharmacol. 10:975. doi: 10.3389/fphar.2019.00975*

*1 Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, LIMICS, Sorbonne Université, Inserm, Université Paris 13, Paris, France, 2 Unit of Public Health and Medical Informatics, University of Saint Etienne, Saint Etienne, France, 3 EA 2223 Costech (Connaissance, Organisation et Systèmes Techniques), Centre de Recherche, Sorbonne Universités, Université de technologie de Compiègne, Compiègne, France*

Background: Formal definitions allow selecting terms (e.g., identifying all terms related to "Infectious disease" using the query "has causative agent organism") and terminological reasoning (e.g., "hepatitis B" is a "hepatitis" and is an "infectious disease"). However, the standard international terminology Medical Dictionary for Regulatory Activities (MedDRA) used for coding adverse drug reactions in pharmacovigilance databases does not beneficiate from such formal definitions. Our objective was to evaluate the potential of reuse of ontological and non-ontological resources for generating such definitions for MedDRA.

Methods: We developed several methods that collectively allow a semiautomatic semantic enrichment of MedDRA: 1) using MedDRA-to-SNOMED Clinical Terms (SNOMED CT) mappings (available in the Unified Medical Language System metathesaurus or other mapping resources, e.g., the MedDRA preferred term "hepatitis B" is associated to the SNOMED CT concept "type B viral hepatitis") to extract term definitions (e.g., "hepatitis B" is associated with the following properties: has finding site liver structure, has associated morphology inflammation morphology, and has causative agent hepatitis B virus); 2) using MedDRA labels and lexical/syntactic methods for automatic decomposition of complex MedDRA terms (e.g., the MedDRA systems organ class "blood and lymphatic system disorders" is decomposed in blood system disorders and lymphatic system disorders) or automatic suggestions of properties (e.g., the string "cyclic" in preferred term "cyclic neutropenia" leads to the property has clinical course cyclic).

Results: The Unified Medical Language System metathesaurus was the main ontological resource reusable for generating formal definitions for MedDRA terms. The non-ontological resources (another mapping resource provided by Nadkarni and Darer in 2010 and MedDRA labels) allowed defining few additional preferred terms. While the Ci4SeR tool helped the curator to define 1,935 terms by suggesting potential supplemental relations based on the parents' and siblings' semantic definition, defining manually all MedDRA terms remains expensive in time.

Discussion: Several ontological and non-ontological resources are available for associating MedDRA terms to SNOMED CT concepts with semantic properties, but providing manual definitions is still necessary. The ontology of adverse events is a possible alternative but does not cover all MedDRA terms either. Perspectives are to implement more efficient techniques to find more logical relations between SNOMED CT and MedDRA in an automated way.

Keywords: adverse drug reaction, Medical Dictionary for Regulatory Activities, SNOMED Clinical Terms, ontology, clinical terminology, pharmacovigilance

#### INTRODUCTION

Formal representation of semantics as provided by computational ontologies and associated semantic Web techniques have been extensively used in medical data integration systems in the last decade (Sheth et al., 2005), and they now tend to be acknowledged as a powerful means to improve the quality of the processing chain of medical data, process automatic extraction of information and knowledge from large databases or ensure semantic interoperability between disparate data processing systems (Park and Hardiker, 2009; Schriml et al., 2012; Schulz and Jansen, 2013).

In the medical domain, classic terminologies are gradually giving way to clinical terminologies, in which terms are defined using knowledge representation languages (Rossi Mori et al., 1998). An example is SNOMED Clinical Terms (SNOMED CT), a general clinical terminology whose objective is to represent all possible terms required for coding the patient record and other applications for representation of biomedical information (Khorrami et al., 2018). SNOMED CT presents several advantages compared with classic terminologies, especially the ability to apply techniques of semantic reasoning in order to build new groups of terms, whereas classic terminologies are limited to default groupings (generally made manually by experts) that are already specified as part of the terminology (Bousquet et al., 2005).

Medical Dictionary for Regulatory Activities (MedDRA) is a classic terminology used by regulatory authorities and pharmaceutical companies for coding adverse drug reactions (ADR) in pharmacovigilance databases (Brown et al., 1999). MedDRA terms are not formally defined and search is therefore limited to existing categories (Bousquet et al., 2005). It is frequently difficult to identify the exact MedDRA category that represents a given medical condition under investigation in a sufficiently specific and exhaustive way, for example, during a pharmacovigilance database search (Brown, 2003).

Since several years, we have performed studies that showed that a knowledge-based approach is efficient for building new groups of ADR terms with World Health Organization Adverse Reaction Terminology (WHO ART) (Alecu et al., 2006) (Iavindrasana et al., 2006) and with MedDRA (Henegar et al., 2006; Declerck et al., 2012; Asfari et al., 2016; Souvignet et al., 2016a) in an automated way. This means that starting from a resource containing formal definitions of ADR terms, it is possible to make queries that correspond to a case definition in order to retrieve the related set of terms. This strategy was applied in Pharmacovigilance Adverse Reaction Terminology Server (Alecu et al., 2007) where building a knowledge base for all WHO ART terms was a challenge. Indeed, all definitions were to be set manually, and we therefore focused on automated ways to enrich WHO ART (Iavindrasana et al., 2006). We found that mapping of WHO ART with SNOMED CT by means of the Unified Medical Language System (UMLS) metathesaurus proved to be a very efficient method to build formal definitions of WHO ART terms in an automated way (Alecu et al., 2008).

Difficulties we encountered for enriching WHO ART now appear at a larger scale in MedDRA due to a growing number of terms and a more complex organization of MedDRA. Indeed, only about 50% of MedDRA terms [excluding lowest level term (LLT)] were associated with a SNOMED CT concept in UMLS (Bodenreider, 2009). Therefore, the mapping method we applied to WHO ART was a fair starting point but proved to be insufficient for obtaining an exhaustive enrichment of MedDRA.

Our objective was to evaluate the potential of reuse of ontological and non-ontological resources for defining and/or enriching definitions of MedDRA terms. We present in this article

**Abbreviations:** ADR, Adverse drug reactions; AERS, Adverse Events Reporting System; Ci4SeR, Curation Interface for Semantic Resources; HLGT, Higher Level Group Term; HLT, High Level Term; ICD-10, International classification of diseases, 10th edition; LALR, Lexically assign logically refine; LLT, Lowest Level Term; LOINC, Logical Observation Identifiers Names & Codes; MedDRA, Medical dictionary for drug regulatory activities; NEC, Not elsewhere classified; NOS, Not otherwise specified; OAE, Ontology of Adverse Events; OWL, Web Ontology Language; PT, Preferred Term; SMQ, Standardized MedDRA Queries; SNOMED CT, SNOMED Clinical Terms; SOC, System Organ Class; UMLS, Unified Medical Language System; WHO ART, World Health Organization-Adverse Reaction Terminology.

several complementary methods that may benefit different levels of automation and could also be reused in order to semantically enrich other terminologies. These include methods such as i) extracting SNOMED CT definitions based on MedDRA-to-SNOMED CT mappings available in the UMLS metathesaurus or other mapping resources and ii) developing lexical and syntactic methods using MedDRA term label information. The ability to reuse the selected ontological and non-ontological resources was measured by comparing the number of MedDRA terms associated with a formal definition after processing of these resources with the number of MedDRA terms to define. Additionally, we manually curated some term definitions using expert knowledge that allowed us to evaluate the time necessary and validated a sample of the formal definitions provided by the previous methods. We stored formal definitions of MedDRA terms in a semantic resource named OntoADR (Bousquet et al., 2014).

The organization of OntoADR and results of semantic queries on OntoADR have been already published (Bousquet et al., 2014; Souvignet et al., 2016b). This article presents methods we implemented for reusing ontological and non-ontological resources to enable the formalization of the semantics and results obtained with each of these methods, how they automate the development of formal representations of MedDRA terms, the limits related to these methods, and additional developments that would be required for a more complete semantic enrichment of MedDRA.

#### BACKGROUND

#### Hierarchical Organization of Medical Dictionary for Regulatory Activities

The MedDRA hierarchy consists of five levels (from broad to narrow), among which four are depicted in **Figure 1**: System Organ Class (SOC), e.g., hepatobiliary disorders; higher level group term (HLGT), e.g., hepatic and hepatobiliary disorders; high level term (HLT), e.g., hepatic viral infections; preferred term (PT), e.g., hepatitis B; and LLT not shown on the figure. The PT level is preferred for data analysis and retrieval. MedDRA was defined as multi-axial because one PT may be present in one primary SOC and also in several secondary SOC. However, one PT may exist only within one single HLT within a SOC. As HLT within a SOC constitutes disjoint classes, it is seldom reliable to consider only one HLT or higher level category when searching for MedDRA terms related to a pharmacovigilance safety topic (Bousquet et al., 2005; Asfari et al., 2016).

Moreover, it was recognized that HLT are not always sufficient to represent clinical conditions involving several organs (e.g., anaphylactic shock involving the kidney, liver, cardiovascular, and respiratory systems) because they only group together terms belonging to the same SOC. When searching for signals associated with a drug, MedDRA terms representing the suspected ADR must thus be identified prior to the running of signal detection algorithms Souvignet et al. (2012). For instance, if one suspects a given drug to cause acute renal failure, using the MedDRA term "renal failure acute" is generally not sufficient for the algorithms to extract a signal because the acute renal failure condition can be coded with several related MedDRA terms by health professionals (e.g., "renal impairment," "blood creatinine abnormal," or "dialysis"). Identifying clinically related terms in MedDRA is not an easy task, as those terms might exist in different locations of the MedDRA hierarchy.

Since several years, the Maintenance and Support Services Organization, which is responsible for MedDRA maintenance and diffusion, builds standardized MedDRA queries (SMQ) to address these issues (Mozzicato, 2007). SMQs consist of sets of PT from different branches of MedDRA that allow describing a particular medical condition and are intended to aid in case identification. SMQs are a way to describe safety topics relevant

for pharmacovigilance that are not covered by the HLT and HLGT present in the MedDRA hierarchy. However, the achievement of SMQ raises important difficulties.

SMQs are currently developed manually by experts from the Maintenance and Support Services Organization, which is time consuming. Furthermore, once defined, they should not be modified or customized, and the existing SMQs do not cover all issues possible with drugs. Because experts have (even slightly) different understandings of the medical condition targeted by an SMQ, the kind of terms and the rationale for their selection may differ from an SMQ to another. This means that from one group of experts to another, for the same safety topic, the list of MedDRA terms selected could be different. Because SMQs are manually implemented, they could also miss important MedDRA terms. For those different reasons, the development of methods for automated selection of MedDRA terms on the basis of semantic information is desirable. Indeed, an automation of the process of PT selection in SMQ, even partial, could increase the quality and reproducibility of the SMQ and allow an important saving of time.

#### Difficulties for Searching Terms in Medical Dictionary for Regulatory Activities

The performance of pharmacovigilance systems based on spontaneous reporting is dependent on the information systems in which case reports are stored. In particular, these systems are subordinated to the ability of users to retrieve and exploit case reports in order to 1) reinforce existing knowledge on drug

safety, 2) make assumptions about the existence of a causal relationship between a drug and an adverse event, and 3) evaluate the available information to implement regulatory measures to secure drug therapies. The search for pharmacovigilance case reports is difficult because it is necessary to identify the medical terms indicating the safety topic that one wishes to evaluate. In general, a term is not sufficient to designate this safety topic, and it is preferable to look for all case reports in relation to a set of terms (Hauben et al., 2006; Hansen et al., 2007). According to the MedDRA® Data Retrieval and Presentation: Points To Consider (ICH Working Group, 2018), "clinically related PTs might be overlooked or not recognized as belonging together because they might be in different groupings within a single SOC or they may be located in more than one SOC."

**Figure 2** shows the problem of finding terms in MedDRA. Terms associated with a green tick are related to valvulopathy, while terms marked with a red cross do not correspond to valvulopathy. It is observed that the search terms are located in different branches of the terminology, which requires the pharmacovigilant specialist more time and effort to carry out his query. In addition, several HLT or HLGT must be combined to arrive at the final result, and irrelevant terms are present in these groups, which means that a search based on HLT or HLGT groupings will be associated with a large number of irrelevant PT. Another method for searching MedDRA terms is the textual query, but the terminology seems complex and does not reveal discriminating strings in the search for valvulopathies. For example "stenos \* aort \*" gives as results "stenosis of the aortic valve," "congenital aortic stenosis of the

valve," and "stenosis of the mitral valve and insufficiency of the aortic valve," but it is necessary to perform additional text searches corresponding to findings for valvular involvement such as calcification or insufficiency.

#### Interface, Aggregation, and Reference Terminologies

Schulz et al. (2017) recently evaluated how interface, aggregation, and reference terminologies may interact in the context of the new 11th version of the International Classification of Diseases and its relations with SNOMED CT. Interface terminology was defined by Rosenbloom et al. (2006) as a "systematic collection of health care–related phrases (terms) that supports clinicians' entry of patient-related information into computer programs [ … ]." According to Spackman (1997), the "main purpose of a reference terminology [ … ] is the retrieval and analysis of data." The term "aggregation terminology" was first introduced by Rogers (2005) to designate a classification systems in which its main purpose is to enable "statistical aggregation," further defined by Schulz et al. (2017) as consisting of single hierarchies and disjoint classes.

We consider that such clarification would be useful in our case study, to better explain what we intend to do with MedDRA and SNOMED CT. Within our approach, MedDRA is the aggregation terminology, and SNOMED CT is the reference terminology. As our purpose is to improve retrieval and not coding of MedDRA terms, we did not work on building an interface terminology. Such interface terminology would be desirable to facilitate data entry in pharmacovigilance databases but is outside of our scope. In the following paragraph, we show how a graphical user interface implementing SNOMED CT as a reference terminology could help users' experience in selecting MedDRA terms and potentially improving search in pharmacovigilance databases.

#### Rationale for Supplementing Medical Dictionary for Regulatory Activities With Formal Definitions

We consider it is possible to overcome the limitations associated with the organization of MedDRA terminology and the difficulty to identify related MedDRA terms, by proposing an alternative method for the grouping of PTs based on their medical meaning rather than their position in the hierarchy. This new method is based on PT modeling in a form that allows logical inferences by a computer. From a technical point of view, the implementation of this method is based on knowledge engineering, a branch of artificial intelligence in which it is possible to describe MedDRA terms using a formal language (Bousquet et al., 2014). In the field of knowledge engineering, we define "ontology" as the set of objects of a domain and relations between these objects.

While McKnight (1999) recognized the need for user-directed composition of controlled health terminologies, and the required improvement of the user interface in the context of data entry, we believe that such user interface is also of great importance to enable composition for data retrieval. In a previous work, we implemented OntoADR query tools, a graphical user interface that relies on OntoADR, and compared the performances of eight users in selecting MedDRA PTs with the MedDRA web browser in a pilot study on five medical conditions (Souvignet et al., 2019). Although the number of medical conditions was low, we observed a statistically significant improvement by using OntoADR query tools compared with the MedDRA web browser for selecting MedDRA PTs (+27% precision and +34% recall). Similar to Maedche and Staab (2001), we consider that the target application may serve as a measure for validating the implemented ontology and believe that such criteria is more important than criteria relying only on the evaluation of the ontology without taking into account the context where it is used. This pilot study confirmed the validity of our approach and justifies that we continue the implementation of our mappings between MedDRA and SNOMED CT.

In order to address a safety issue, it is necessary to identify case reports in pharmacovigilance databases relative to this issue. A safety issue may concern the causality assessment of drug D in the occurrence of medical condition C. **Figure 3** shows the example of three use cases where one is evaluating the causal role of suspected drug D in three medical conditions: a) upper gastrointestinal hemorrhage, b) medical conditions with symptom of erythema, and c) fungal infectious disorders. A single MedDRA term is usually not sufficient to characterize a given medical condition. OntoADR is intended to support selection of MedDRA terms according to different criteria. **Figure 4** depicts the query performed in OntoADR to retrieve MedDRA terms associated to these medical conditions, e.g., "finding site: upper gastrointestinal tract," and "associated morphology: hemorrhage," and the 10 first MedDRA terms retrieved by this query, e.g., anastomotic ulcer hemorrhage, aorto-esophageal fistula, chronic gastrointestinal bleeding, etc.

**Table 1** shows parts of formal definitions for 10 MedDRA terms associated with upper gastrointestinal hemorrhage among 27, in particular, SNOMED CT concepts that are filler of relations "finding site" and "associated morphology." Fillers that are not relevant to upper gastrointestinal hemorrhage are in italic, e.g., *Stenosis* for the MedDRA PT "anastomotic ulcer hemorrhage." In case the filler of the "associated morphology" relation is "hemorrhage," the query is immediately satisfied for this condition, e.g., three MedDRA terms (aorto-esophageal fistula, gastric antral vascular ectasia, and gastric hemorrhage) are defined as having "hemorrhage" as their associated morphology. When "hemorrhage" is not the filler of the "associated morphology" relation, other relevant fillers may be retrieved thanks to the subsumption mechanism that establishes hierarchical relations between a parent concept and its children concept. For example, "hemorrhage" subsumes "acute bleeding ulcer," 'bleeding varices," "chronic hemorrhage," and "hemorrhagic inflammation." "Upper gastro intestinal tract structure" subsumes "duodenal structure," "esophageal structure," "gastrojejunal junction structure," "pyloric antrum structure," and "stomach structure."

### Semantic Enrichment

The traditional process of domain ontology construction is based on expert intervention (Bedini and Nguyen, 2007). Although this manual procedure guarantees a fair quality of the generated

upper gastrointestinal hemorrhage, (B) MedDRA terms describing medical conditions associated with erythema, (C) MedDRA terms describing infectious diseases induced by fungi.

resource, it suffers from several difficulties; among those are the cold start problem (starting from scratch) and the lack of availability of domain experts (Qawasmeh et al., 2018). In fact, the high cost of experts' interventions is the major bottleneck identified early in the state of the art of ontology construction (Cullen and Bryman, 1988; Simperl et al., 2006; Balakrishna et al., 2010). This bottleneck justifies reusing and linking existing resources, when available, to create new ontologies (Alani, 2006). Reuse is not always possible because ontologies may not exist in the field of interest. For example, Mazo et al. (2017) describe a histological ontology of the human cardiovascular system and report that they "did not find in the State-of-the-Art an ontology of TABLE 1 | Finding site and associated morphology of 10 MedDRA terms describing upper gastrointestinal hemorrhage among 27.


histology neither a similar organization of hierarchies of histology terms that [they] may be able to reuse." At the opposite, when all the ontologies that are needed are available, it is sufficient to reuse and assemble them, such as in the example of the development of the orthology ontology (Fernández-Breis et al., 2016).

Ontology enrichment is the task of extending an existing ontology with additional concepts and semantic relations and placing them at the correct position in the ontology (Petasis et al., 2011). Automatic ontological construction is often based on learning (Maedche and Staab, 2001; Buitelaar et al., 2005). Such approach can be based on unstructured texts (Asim, 2018; Cimiano et al., 2006; Emani et al., 2015; Costa et al., 2016; Dasgupta et al., 2018), informal ontologies (Astrakhantsev and Turdakov, 2013), or linked data (Gavankar et al., 2012; Tiddi et al., 2012; Riga et al., 2017). A particular case of unstructured data corresponds to the labels of ontology identifiers that may be very dense with information. Such information described as "hidden semantics" by Third (2012) received little attention, at the exception of the gene ontology identifiers (e.g., Quesada-Martínez et al., 2015). SNOMED CT identifiers may also benefit from such approach, but this was limited to taking into account the "acute" and "chronic" qualifiers for evolution of diseases (Rector and Iannone, 2012) and the occurrence of congenital diseases (van Damme et al., 2018). Such "hidden semantics" were also detected by Nadkarni and Darer (2010) to build correspondences between MedDRA and SNOMED CT, but these correspondences were not associated with relations, which makes this work interesting for reuse but requires reengineering to transform the correspondences into semantic relations.

One of the major difficulties of these approaches is the extraction of non-taxonomic relationships (Dahab et al., 2008; Sánchez and Moreno, 2008; Villaverde et al., 2009; Petasis et al., 2011; Serra et al., 2014). Furthermore, several automatic approaches for ontology reusing and engineering still require domain experts and knowledge engineers (Bobed et al., 2012). Thus, semiautomatic approaches could be a good alternative (Balakrishna et al., 2010).

Semiautomatic approaches employ intelligent methods to significantly reduce, without completely replacing, human efforts (Huang et al., 2014). In such approaches, the role of experts could be limited to validating final automatic learning results (Wächter and Schroeder, 2010) or suggesting improvements at the end of ontology life cycle (Alobaidi, 2018). Expert intervention can be achieved with the help of graphical user interface (Wächter and Schroeder, 2010), spreadsheets (Blfgeh et al., 2017; Judkins et al., 2018), or specified pipelines such as the eXtensible ontology development (He et al., 2018).

#### NeON Methodology

After comparing several methods for ontology development, we selected the NeON methodology (Suárez-Figueroa et al., 2012) because it was the most appropriate to illustrate the strategy that we followed for designing OntoADR. While other methodologies may also be considered for knowledge engineering and may be more relevant in other contexts, we considered that dimensions of reuse were the most important features when selecting NeON.

While other approaches for ontology engineering provide methodological guidance, the NeON Methodology does not prescribe a rigid workflow. It suggests a variety of pathways based on nine flexible scenarios that address common issues, such as reusing, reengineering, and merging ontological resources. These ontological resources also comprise ontology design patterns (Aranguren, 2008; Gangemi, 2005; Blomqvist, 2008; Presutti and Gangemi, 2008), which are generic templates or abstract descriptions proposed to enforce best practices in ontology implementation. One particularity of NeON matching well with our specific approach is that it also takes into account reusing and reengineering of non-ontological resources, which is not the case of other methodologies such as METHONTOLOGY (Fernández-López et al., 1997) and On-To-Knowledge (Sure et al., 2004). These non-ontological resources may consist of structured data such as terminologies (Jimeno-Yepes et al., 2009) or databases, unstructured data (e.g., articles), or semi-structured data (e.g., XML, JSON) (Qawasmeh et al., 2018). In addition to these nine scenarios, NeON also integrates support activities such as knowledge acquisition, documentation, and evaluation that should be carried out during the whole ontology development cycle.

### MATERIAL AND METHODS

#### Summary of the Method Application of the NeON methodology

We applied the following scenarios of the NeON methodology for implementing OntoADR (**Figure 5**).

• Scenario 1. From Specification to Implementation: this includes four steps (specification, conceptualization, formalization, and implementation). We previously presented these steps in previous work [(Bousquet et al., 2014)] and limit here the

scope of this presentation to scenarios that emphasize the reuse of ontological and non-ontological resources.


In addition to these different scenarios, we also implemented "ontology support activities" for "knowledge acquisition" that comprises activities for (1) capturing knowledge from the MedDRA labels and work by a domain expert for adding formal definition using the Ci4SeR tool [(Souvignet et al., 2014)] and (2) "ontology validation" that consists in checking that the meaning of the ontology definitions are compliant with the definitions we intended the MedDRA terms to convey.

#### Flow chart of the method

**Figure 6** depicts a flow chart representing an overall representation of the several steps and tasks proposed in the article to get an overview of the algorithm at a glance. While this diagram could make readers believe that all these steps were conducted in parallel, it is proposed only as a convenient way to apprehend the method as a whole. The previous paragraph where we applied the NeON methodology shows a different perspective where different scenarios were applied at different time. In the flow chart, each MedDRA term is considered one after the other and can go through several parallel paths according to different conditions.


Some manual definitions can optionally be added (see the section *NeON Methodology*). All partial definitions acquired with the different algorithms are then automatically combined into a merged definition of the MedDRA term. We used MedDRA version 17 that consists of 26 SOC, 334 HLGT, 1,720 HLT, 20.559 PT, and 72,637 LLT, SNOMED CT version March 2015 and UMLS version 2014AB. SNOMED CT concepts were extracted from the Concepts\_Core\_INT file (Release Format 1) and the hierarchy and semantic properties from the Relationships\_Core\_INT file. This version of MedDRA was applied to the following paths: "using UMLS metathesaurus mappings," "automatic enrichment methods," and "manual definition of concepts." MedDRA 13 that consists of 26 SOC, 335 HLGT, 1,709 HLT, 18,786 PT, and 68,258 LLT was applied to the following paths: "Using other mapping resources" and "Using a decomposition algorithm and Metamap software to map complex MedDRA terms."

#### Problems With Mapping Other Layers Than the PT Level

We have tried to map other layers (SOC, HLT, and HLGT). For instance, the cardiac disorders SOC concept has for formal definition hasFindingSite some "Heart Structure." While

some SOC, HLGT, and HLT were accurately defined, essentially thanks to mappings in UMLS, it was decided not to present the results in this article. We used formal definitions associated with the three MedDRA higher levels only in the Ci4SeR tool (Souvignet et al., 2014), where the curator could use them if she considered them as relevant. We decided not to map the LLT because PT is the preferred level for case report analysis and search in pharmacovigilance databases.

We explained in previous work why the MedDRA hierarchy cannot be converted into a subsumption tree because this sometimes causes semantic inconsistencies (Bousquet et al., 2014). The reason is that most high level categories in MedDRA are intended to reflect the domain actors' practices (i.e., following the different medical specialties) and are not necessarily organized according to different semantic criteria as one would expect in a well-formed ontology. A first example concerns groups of symptoms (HLGT or HLT) that are placed under the general categories of disorders that they are the symptom of (SOC or HLGT). Such hierarchical organization would not be authorized in an ontology as the relation being-a-symptom-of does not imply an is-a relation. For instance, the PT "dyspnoea" and "dizziness or syncope" belong to the HLGT "cardiac disorder signs and symptoms" that is under the SOC cardiac disorders. While dyspnea for instance refers to conditions that may be associated to cardiac disorders, such symptom cannot be considered as a cardiac disorder. A second example is the MedDRA PT sudden death that belongs to the following hierarchies: 1) ventricular arrhythmias and cardiac arrest (HLT)/cardiac arrhythmias (HLGT), and 2) death and sudden death (HLT)/fatal outcomes (HLGT). While "cardiac arrhythmias" is defined in OntoADR with the hasFindingSite some "Heart Structure" property, the sudden death PT should not inherit from such property because sudden death could be the consequence of death that is not of cardiac origin.

#### Using Medical Dictionary for Regulatory Activities-to-SNOMED Clinical Terms Mappings From UMLS Metathesaurus

The UMLS may be used as a source of knowledge for adding formal definitions to medical terminologies (Schulz and Hahn, 2001). Based on our initial experience (Alecu et al., 2008), we assume that SNOMED CT is currently the best candidate for providing formal definitions to MedDRA. SNOMED CT terms are defined using description logic (DL) formalism, and a fair number of alignments between MedDRA and SNOMED CT are present in the UMLS metathesaurus. Therefore, most of the formal definitions attributed to MedDRA terms in OntoADR are based on semantic information extracted from the SNOMED CT clinical terminology. When a MedDRA term is mapped to a SNOMED CT concept, we reused the semantic information within SNOMED CT in order to build the formal definition of the MedDRA term. Identifying reliable MedDRA-to-SNOMED CT mappings is thus an essential step in our methodology to define MedDRA term semantics.

The UMLS (Lindberg et al., 1993) consists of a semantic network and a metathesaurus developed by the US National Library of Medicine to link terms from more than a hundred controlled vocabularies, including SNOMED CT and MedDRA. Terms from the different vocabularies are linked together by association to a unique UMLS concept defined by a concept unique identifier, e.g., "C0019163" that is mapped to both MedDRA term "hepatitis B" and SNOMED CT concept "type B viral hepatitis."

In OntoADR, MedDRA term "hepatitis B" has the formal definition: hasFindingSite some "Liver Structure," hasAssociatedMorphology some "Inflammation Morphology," and hasCausativeAgent some "Hepatitis B Virus" where has FindingSite, hasAssociatedMorphology, and hasCausativeAgent are OntoADR semantic relations inspired from SNOMED CT, and "liver structure," "inflammation morphology," and "hepatitis B virus" are SNOMED CT concepts we imported in OntoADR (**Figure 1**).

In order to map MedDRA and SNOMED CT terms, we developed an algorithm following these steps: i) search for the MedDRA PT in the UMLS using the MedDRA identifier; ii) for MedDRA PT without SNOMED CT mappings in the UMLS, if the PT has one or more related LLT considered as synonymous, then the LLT identifier is used for a new UMLS search. iii) If this second search is unsuccessful, the algorithm performs a last UMLS search using PT and LLT labels, seeking to pair these labels with SNOMED CT concepts by string matching.

All mapping propositions selected from the UMLS metathesaurus were validated, modified, or completed by knowledge engineers and pharmacovigilance experts of our team. i) All one-to-one mappings we decided to use were first validated by checking manually the correspondence between the meanings of terms.

Several SNOMED CT concepts may be proposed as synonyms (Fung et al., 2005) of the same MedDRA concept, although they have different meanings. Each MedDRA concept in OntoADR can have only one equivalent SNOMED CT concept. When several SNOMED CT concepts are proposed in UMLS as synonym of a MedDRA term, only one was selected by an expert for building the formal definition. Such selection should be based on synonymy between a SNOMED CT concept and a MedDRA term. According to Fung, synonymy between term X and Y may be defined according to linguistic criteria (Fung et al., 2005) such as enforcing that it is possible to replace X by Y in any sentence without modifying the meaning. Examples of such synonyms are "celiac disease" and "gluten enteropathy," or "kidney stone" and "renal calculus." Selection of a SNOMED CT concept was performed first by comparing its label with the MedDRA term label. In case both were identical, which occurred most of the time, it was obvious to select this mapping, but in other cases, we took into account the medical relevance of the mapping and had to rely on expert evaluation. For example, three SNOMED CT concepts are mapped to the MedDRA term "Spondylitis": "inflammatory spondylopathy," "undifferentiated spondylitis," and "spondylitis," and the later appears as a perfect match according to label comparison.

Such a validation process was necessary because UMLS mapping propositions are not always semantically valid. MedDRA terms and SNOMED CT concepts mapped together in UMLS can refer to different medical entities, even if they are homonyms. For instance, the MedDRA term "vascular disorders" and its SNOMED CT homonym "vascular disorder" are mapped together in UMLS; however, the former refers to disorders of blood and lymphatic systems and the later only to disorders of blood vessels (lymphatic system disorders are caught by the concept "disorder of lymphatic system" in SNOMED CT). ii) In case of one-to-n mappings, a manual expert choice was made to select the SNOMED CT concept whose definition best fitted the meaning of the correspondent MedDRA concept. iii) When no SNOMED CT concept among the ones suggested by UMLS was satisfactory, the definition of the correspondent MedDRA concept was made manually. iv) Mapping a MedDRA term with a SNOMED CT concept does not ensure that the former gets a complete (or even a satisfying) formal definition of its semantics: the formal definition can be incomplete or even be literally absent: it is common to find SNOMED CT concepts, for instance psychiatric concepts, that have no definitional properties. When necessary, the semantic properties from SNOMED CT attributed to MedDRA concepts through the mapping procedure were thus completed manually by additional assertions.

#### Using Another Medical Dictionary for Regulatory Activities-to-SNOMED Clinical Terms Mapping Resources

To complete the mappings selected from UMLS, we also made use of Nadkarni and Darer's propositions of mappings (Nadkarni and Darer, 2010). Using one year of data (recorded between July 1, 2008 and April 30, 2009) from the US Food and Drug Administration Adverse Events Reporting System (AERS) pharmacovigilance database, the authors identified 3,705 MedDRA PT that collectively accounted for 95% of case reports. The 3,705 selected MedDRA terms correspond to high-frequency terms in the US Food and Drug Administration database and potentially have a great added value. After eliminating terms already mapped to SNOMED CT concepts in UMLS, they attempted to map manually the remaining terms (786 in total) with software assistance. Most of those terms (733) could be mapped by Nadkarni and Darer with SNOMED CT concepts *via*  one-to-one or one-to-n mappings.

Several problems have been encountered when trying to reuse Nadkarni and Darer's propositions of mappings (Nadkarni and Darer, 2010). i) First, in the case of one-to-n mappings, the authors broke down a MedDRA term in such a way to associate it to several SNOMED CT concepts but did not specify which semantic relation was relevant. For example, they mapped the MedDRA concept "tongue discoloration" with SNOMED CT concepts "abnormal color" and "entire tongue" but did not specify which semantic relation interconnected the first to the others. Obviously, it cannot be here an equivalence or synonymy (Fung et al., 2005) ("same as") relation as in the case of one-to-one mapping. When SNOMED CT concepts belong to branches such as *body structure* or *morphologic abnormality*, the relationship to use is easy to deduce, and its creation can be automated: it will be in the first case the hasFindingSite relationship and, in the second, the hasAssociatedMorphology relationship. However, when it comes to SNOMED CT concepts from branches *finding*, *qualifier value*, or *disorder*, the relationship to use is not obvious, and only a human expert can decide. A major part of our recovery work was to specify these relationships by making use of the set of relationships available in OntoADR.

ii) We have also occasionally been forced to revise the proposed Nadkarni and Darer's mappings, partly for reasons of pure semantic accuracy and partly because of the purpose of OntoADR, which we illustrate here using three examples:


(combined site)" is set in an isolated portion of the SNOMED CT branch "body structure" (branch called "group of anatomical entities"). Nothing connects it to the concepts of the digestive system structures (e.g., no relationship part-of). Due to the SEP decomposition (structure, entire, part) of the anatomical branch of SNOMED CT, the concept "anal structure" has no relation to the concept "anus and rectum." It would have been impossible to use this localization by semantic reasoning, for example, to identify concepts located on part of the anorectal system (principle of subsumption reasoning: concepts that have a relationship of location on parts of the anorectal structure are considered by inference as siblings of the concept of diseases that are located on the whole anorectal structure). We therefore preferred to use the SNOMED CT concept "anorectal structure" to define the relationship hasFindingSite of the MedDRA concept "anorectal discomfort" in OntoADR. This SNOMED CT concept allows the semantic reasoning operation described previously. Moreover, we can assume that the MedDRA term "anorectal discomfort" is sometimes used to encode ADRs in a non-specified way that may be anal or rectal, and not both, as is implied by the use of the SNOMED CT concept "anus and rectum (combined site)." It is therefore important to locate by subsumption concepts that are located within a substructure of the whole anorectal structure.

#### Using a Syntactic Decomposition Algorithm on Complex Medical Dictionary for Regulatory Activities Terms

Among MedDRA terms that are not mapped with SNOMED CT terms in UMLS, there are many complex terms, i.e., corresponding to composed expressions. MedDRA complex terms are of several kinds: a) expressions composed with an AND logical operator or commas (e.g., "acute and chronic thyroiditis" or "pregnancy, labour, delivery, and postpartum conditions"); b) expressions composed with "NEC" (not elsewhere classified), "unspecified," or with a text between brackets, usually to specify exclusion clauses [e.g., "autoimmune disorders NEC," "laryngeal neoplasms malignancy unspecified," or "ocular neoplasms malignant (excl. melanomas)"]; c) they can also combine these different kinds of complexity [e.g., "gastrointestinal and abdominal pains (excl. oral and throat)" or "ocular structural change, deposit, and degeneration NEC"]. These terms are usually terms of level HLT, HLGT, and SOC in the MedDRA hierarchy. However, their definitions have a great added value because some terms they subsume may inherit their properties. Indeed, defining one high level term with a morphology property may amount to defining all child terms with this property within the limits of what we have indicated in the section *Problems With Mapping Other Layers Than the PT Level*.

The complex MedDRA terms present two kinds of difficulties: i) the difficulty of mapping with SNOMED CT that tends to favor simple concepts probably because most complex concepts correspond to pure classifying artifacts, e.g., "not elsewhere classified," without real counterpart in the phenomena that are part of medicine; ii) difficulties for formalization of meaning: representing in OWL (Web Ontology Language) the meaning of a compound concept containing exclusions with logical operators is constrained by the expressiveness of the DL language used. In OntoADR, it is not possible to describe the exact same MedDRA semantics due to computability constraints. To date, we have developed a technical solution for the first point, but no satisfactory solution of conceptualization (especially in terms of human cost modeling) has yet been developed to meet the second point. It should be noted that this issue is regarding mainly terms of high levels and does not affect the progress of definitions for PT terms in OntoADR.

In order to map complex MedDRA terms, we developed an algorithm for syntactic decomposition. It consists of three routines: 1) a routine for "cleaning" terms; 2) a routine for identification of complex expressions; and 3) a routine for decomposition of an expression from a set of formal rules. Routine 1 begins by suppressing from the MedDRA labels unnecessary characters or characters that cannot be supported by the decomposition routine [stop words, content between brackets, terms as "unspecified," "NOS" (not otherwise specified), etc.]. Routine 2 identifies decomposable expressions: it searches for keywords that indicate a probable composition of the expression ("AND," "OR," "WITH," ",", etc.). Finally, routine 3 decomposes the complex expression in a set of simpler expressions (cf. **Table 2**), by applying different rules, for example:

$$\begin{aligned} \text{(A AND B).q} &\rightarrow \text{ A.q} + \text{B.q} \\ \text{q.(A AND B)} &\rightarrow \text{ q.A} + \text{q.B} \end{aligned}$$

We then used the MetaMap software (Aronson, 2001) to map all new decomposed concepts to existing SNOMED CT concepts.

#### Automatic Lexical Enrichment Methods

We have used a rule-based algorithm for automatic suggestion of properties from the MedDRA label to enrich the formal definition of concepts. Two key procedures have been implemented in the algorithm:

1. When the algorithm detects a given string Sx in a MedDRA label, it automatically adds a corresponding property Px in the OWL concept definition. For example: if the string "pain" or "algia" is found in a MedDRA concept's label, the semantic property hasDefinitionalManifestation some Pain is automatically added to the concept's definition. Similarly, if the string "perforation" is found, the formal definition hasAssociatedMorphology some Perforation is suggested. All created properties are then validated by an expert. Illegitimate properties are rejected. For example, the algorithm proposed to add the formal property hasAssociatedMorphology some Hernia to the MedDRA concept "hernia repair," as the string "hernia" was found in the label. Semantically, this assignment is obviously illegitimate: the "hernia repair" is not a type of hernia and cannot be defined by this morphological property. The property has therefore been rejected. The expert, however, took advantage of this suggestion to correct it in: OccursAfter TABLE 2 | Examples of parsing of complex MedDRA terms using different rules.


some Hernia. This validation step is also necessary due to the occasionally polysemic expressions used for automatic generation of properties. For example, the automatic generation of the hasClinicalCourse some Cyclic property, when the algorithm detects the "cyclic" string in a MedDRA label is valid for concepts such as "cyclic neutropenia" or "cyclic vomiting syndrome," where the term "cyclic" indicates the clinical course of the disease. However, it is not valid for the concept "cyclic AMP," which refers to a clinical test (a measure of the presence or amount of cyclic adenosine monophosphate, e.g., in urine). We could have improved the automatic processing in order to detect these problematic cases, but the formalization of these exceptions would have taken longer time than using a manual approval process.

A restriction is applied to prevent the property Px to be duplicated when it already exists in a MedDRA term definition, e.g., the detection of the "perforation" string in the label of a MedDRA concept CMed only results in the creation of the property hasAssociatedMorphology some Perforation if CMed does not already own a property hasAssociatedMorphology some <Morphology>. If it is the case, we assume that the relation Rx (in this case hasAssociatedMorphology) has already been filled in correctly.

2. A second procedure is implemented by the algorithm to automatically generate properties. Based on the same principles, but working with more complex patterns of recognition, it was designed to complete definitions of MedDRA concepts referring to investigations and their results (SOC « Investigations »).

Two relationships are available in SNOMED CT to define the examination results (whether clinical observations or investigations): interprets, which refers to "the entity being evaluated or interpreted, when an evaluation, interpretation, or "judgment" is intrinsic to the meaning of a concept"; and hasInterpretation, which, grouped with the attribute Interprets, "designates the judgment aspect being evaluated or interpreted for a concept (e.g., presence, absence, degree, normality, abnormality, etc.)" (Rector and Brandt, 2008). It is important that these two relationships are filled in OntoADR in order to apply semantic reasoning not only to ADR concepts as such, but also to concepts referring to abnormal results of investigations that are the consequence of an ADR (for instance, "neutrophil count decreased" for the neutropenia condition), as such results are frequently used to describe ADRs in pharmacovigilance databases. However, it turned out that very few MedDRA concepts located in the investigations branch could be identified through the procedures described in the previous sections, in particular the mapping from UMLS. A large majority of MedDRA concepts in SOC investigations thus remained undefined in OntoADR.

To remedy this situation, we have integrated into the algorithm a module supporting the properties interprets and hasInterpretation for MedDRA concepts from SOC "investigations." Results of investigations are usually expressed in MedDRA using the following adjectives: abnormal, normal, absent, present, increased, decreased, positive, and negative. All these qualifiers are also used in SNOMED CT to fill the property hasInterpretation. The procedure followed by the algorithm was therefore as follows:

When the string Sx corresponding to one of these adjectives is detected in the label < lab1> of a MedDRA concept CMed1 from SOC "Investigations":


In the example of the concept CMed1 « Alpha hydroxybutyrate dehydrogenase decreased », this procedure gives the following results:


Once again, all of the created properties were reviewed and validated by an expert.

#### Manual Definition

Besides these semiautomatic methods for defining MedDRA concepts in OntoADR, we also performed the manual definition of about 1,935 concepts (Souvignet et al., 2016b). We had insufficient human resources to carry out the manual definition of all MedDRA terms that previous methods had failed to define. So, we decided to focus on high value-added terms for pharmacovigilance. In the EU-ADR project, Trifirò et al. (2009) developed a ranked list of 23 first importance adverse drug events (e.g., cardiac valve fibrosis) based on a review of scientific literature, medical textbooks, and websites of regulatory agencies. To identify which MedDRA terms are related to those 23 topics, pharmacovigilance experts familiar with MedDRA have chosen for each topic an SMQ and/or MedDRA hierarchy-based grouping (HLT or HLGT) or a custom set of preferred terms (PT) fitting the definition of the targeted topics (see Declerck et al., 2012 for details). When no existing MedDRA groupings could be identified to fit the safety topic, *ad hoc* manual groupings of MedDRA PT were proposed by the experts. This work benefited from using a dedicated tool we implemented, Ci4SeR (curation interface for semantic resources) (Souvignet et al., 2014).

#### RESULTS

#### Using Other Medical Dictionary for Regulatory Activities-to-SNOMED Clinical Terms Mapping Resources

"Once the Nadkarni and Darer's mapping propositions were validated, modified or completed, we applied the same procedure as described in the section *Using MedDRA-to-SNOMED CT Mappings From UMLS Metathesaurus* to pick up information from SNOMED CT and define the MedDRA concepts of OntoADR. Using the set of SNOMED CT relations available in OntoADR, we also realized manually the definition of those MedDRA terms (53 in total) for which no mapping could be found by Nadkarni and Darer. The use, after verification and eventually correction and complementation, of mappings proposed by Nadkarni and Darer, allowed us to complete the definition of 786 supplementary MedDRA PTs in OntoADR.

#### Using a Syntactic Decomposition Algorithm on Complex Medical Dictionary for Regulatory Activities Terms

Among the 2,070 HLT, HLGT, and SOC in MedDRA 13.0, a total of 1,011 terms was decomposed by the algorithm generating an average of 2.7 terms by decomposition. The consistency of automatic decomposition was checked by an expert. The errors were corrected through a progressive adjustment of the decomposition algorithm. Only the decomposition of 30 complex terms that were not supported by the algorithm was done manually. Once the decomposition was performed, we used the UMLS MetaMap 2010 AB mapping software, which returns from a given string (in our case, a part of the decomposition), the UMLS concept unique identifier of the nearest syntactically SNOMED CT concepts (fuzzy match). With this method, a total of 638 MedDRA concepts (9 SOCs, 131 HLGTs, and 498 HLTs) could be mapped to the SNOMED CT concepts (mappings oneto-one or one-to-n).

This additional mapping method has the advantage of enabling the definition of high level terms in MedDRA. These definitions may then be inherited by subsumed low level terms. However, the definitions have also the disadvantage of being broad and thus potentially insufficiently precise for specific preferred terms.

#### Automatic Lexical Enrichment Methods

This procedure was applied to 11 of the 25 SNOMED CT properties used in OntoADR, using 82 different matching strings. In total, this procedure has led to the creation of 8,194 properties, among which 7,691 were validated (i.e., 93.9%). A sample of the strings detected by the algorithm and properties created is shown in **Table 3**.

#### Manual Definition of Concepts

**Figure 7** depicts as an example the formal definition associated to the term "Shwachman-Diamond syndrome," as it was described in OntoADR after application of the different algorithms that precede manual refinement.

The curation, which took approximately 750 h, allowed refining the definition of 1,935 MedDRA terms to validate and fully define these terms (Souvignet et al., 2016b). Among the 3,482 properties available in OntoADR for these terms, the curator validated 2,636 properties (76%), proposed 350 (10%) more precise terms (i.e., narrower terms in the SNOMED CT hierarchy), and removed 496 properties (14%). The curator also proposed 13,675 additional properties, but these should not be considered as errors related to missing properties but rather

TABLE 3 | Sample of the properties created automatically from the MedDRA label to enrich the formal definitions of MedDRA concepts in OntoADR.



FIGURE 7 | Formal definition associated to the preferred term "Shwachman-Diamond syndrome" before manual refinement.

as the curator's desire to better document diagnoses with signs and symptoms and investigations that may be associated to a given disease but are not specific, as they may be absent in some occurrences of this disease.

**Figure 8** shows how the "Shwachman-Diamond syndrome" PT's formal definition was modified by the curator in the Ci4SeR tool. The lowest part of the screenshot contains the properties that were automatically proposed considering the parent's and siblings' formal definitions. **Table 4** depicts the results using each method.

#### DISCUSSION

#### Summary

We have described in this article several methods that allow collectively a better semantic enrichment of MedDRA. **Table 4** shows that using UMLS metathesaurus is the method that was the most efficient considering the number of mappings and helped to add formal definitions for about half MedDRA terms. As other mapping resources than UMLS are rare and concern

TABLE 4 | Synthesis of mappings and properties found using all previously described methods.


*aRepresent only the number of mapping/properties that was not found by other methods.*

*bLimited to SOC, HLGT, and HLT.*


only few MedDRA terms, the Nadkarni and Darer's resource allowed to add properties to 4.2% of MedDRA terms but only to 2.4% of MedDRA terms that were not associated with mappings to SNOMED CT in UMLS.

Our proposal to decompose complex MedDRA terms was applied only to SOC, HLGT, and HLT levels and accounted for 30.8% of these MedDRA terms above the PT level. Manual definitions and refinements of definitions obtained with other methods allowed to process 9.1% of MedDRA terms, which is more than the proportion of terms that were defined using Nadkarni and Darer's mapping resource. However, it was associated with high time-consuming effort by the domain expert that confirms previous work, e.g., Giannangelo and Millar (2012) who observed that "map specialists on average mapped 6.5 SNOMED CT concepts an hour." **Table 5** summarizes the main characteristics of each method and indicates if the proposed method reuses existing knowledge, if it requires manual adaptations or may be performed in an automated way.

#### Related Work in Medical Informatics

He et al. (2014) have introduced the Ontology of Adverse Events (OAE). OAE was originally targeted for vaccine adverse events (Marcos et al., 2013) and now also includes adverse drug events. In practice, using OAE to select case reports in the Vaccine Adverse Event Reporting System proved difficult: "AE data stored in Vaccine Adverse Event Reporting System are annotated using MedDRA" (Marcos et al., 2013). Authors complained that "many disadvantages of MedDRA, including the lack of term definitions and a well-defined hierarchical and logical structure, prevent its effective usage in VAE (vaccine adverse event) term classification." Therefore, for an efficient analysis, they performed a mapping between MedDRA and OAE (Sarntivijai et al., 2012).

OAE contains about 2,300 AE entities but only 1,900 MedDRA mappings (9% of all MedDRA PT). For example, there is a single

term for upper gastrointestinal hemorrhage in OAE (He et al., 2014), whereas one can cite several in MedDRA (see the section *Rationale for Supplementing MedDRA With Formal Definitions* where we identified 27 using OntoADR). Furthermore, OAE formal definitions are limited to anatomical and physiopathological descriptions. He and colleagues proposed extensions to OAE such as the Ontology of Drug Neuropathy Adverse Events (Guo et al., 2016), which suggests that providing supplementary MedDRA mappings is possible using the same methodology. One advantage of OAE is the possibility to use it in open access, which allows wide dissemination to users, while legal issues related to ownership of MedDRA and SNOMED CT should be solved before we can make OntoADR available.

Adverse Events Reporting Ontology aims to allow storing of pharmacovigilance data related to anaphylaxis according to guidelines defined by the Brighton collaboration (Courtot et al., 2014) but may also be extended to other safety topics, e.g., malaria (Courtot et al., 2013). Nevertheless, ADRs are not formally defined in Adverse Events Reporting Ontology.

While we did not find any resource available providing definitions for every ADR in MedDRA, there are more general resources with formal representation of clinical terms. In order not to start from scratch the definitions of ADRs, we needed a trustworthy formal resource, standardized and reliable. We chose SNOMED CT for three main reasons: first, pharmacovigilance concepts generally do not differ from those used in other medical fields. Second, SNOMED CT is the most complete and most detailed terminology of medicine with a formal semantic foundation currently available (Elkin et al., 2006) sharing common fields with MedDRA (medical pathologies in all medical specialties, signs and symptoms, laboratory tests results, some diagnostic and therapeutic procedures). Finally, SNOMED CT has the advantage of covering to a large extent, if not entirely, other standard medical terminologies such as International Classification of Diseases, 10th edition (ICD-10), and especially more than 50% of MedDRA terms (excluding LLT) are associated with a SNOMED CT concept (Bodenreider, 2009) in UMLS, a degree of coverage that, to our knowledge, no other current medical ontology was able to match.

We found in the literature several examples of mappings from a terminology to SNOMED CT (Vikström et al., 2007; Merabti et al., 2009; Nyström et al., 2010; Dhombres and Bodenreider, 2016; Fung et al., 2017). However, the objective was usually to integrate a terminology in SNOMED CT or to map this terminology to SNOMED CT but not to enrich this terminology by the means of formal definitions. The lexically assign logically refine method is an example of an automated method in which logical observation identifiers names and codes (LOINC) and SNOMED terms are first decomposed, then refined by the means of knowledge-based methods that allowed to map LOINC and SNOMED together (Dolin et al., 1998). In another work, Adamusiak and Adamusiak and Bodenreider (2012) developed an OWL version of both LOINC and SNOMED CT and made use of mappings between SNOMED CT terms to identify redundancy and inconsistencies in LOINC multi-axial hierarchy. Roldán-García et al. (2016) implemented Dione, an OWL representation of ICD-10-CM where formal definitions were obtained thanks to mappings between ICD-10-CM and SNOMED CT available in UMLS and the Bioportal. More recently, Nikiema et al. (2017) benefited from SNOMED CT logical definitions to find mappings between ICD-10 and ICD-O3 concepts in the domain of cancer diagnosis terminologies.

It is usually recommended to build medical terminologies following the model of clinical terminologies that obey to Cimino's desiderata (Cimino, 1998; Bales et al., 2006). Such model brings several advantages such as improving the maintenance of large terminologies (Cimino et al., 1994), and formal definitions were implemented in several terminologies such as the NCI-Thesaurus (Hartel et al., 2005). Our approach is more in line with what is recommended by Ingenerf and Giere (1998), that is to say, to keep terminologies with disjoint classes required for statistics (in a clinical terminology, the same term may be present in several separate categories because of multiple inheritance and be counted more than once) and instead implement a mapping of terms of first-generation system to a formal system. This allows keeping the MedDRA terminology in its current format, counting ADRs according to predefined categories that are standardized and replicable at the international level with MedDRA and building new categories on demand by using knowledge engineering methods. This is what we have done in our implementation of OntoADR (Bousquet et al., 2014) in the form of an OWL-DL file and in the form of a database (Souvignet et al., 2016b).

We have no knowledge of other works in which the formalization of complex terms involving AND/OR relations has been performed in an automated way. We have not proposed formal definitions of LLT because this level is reserved for the coding of case reports, in order to improve the accuracy of coding, but it is not useful for grouping data for analysis (which is performed at the PT level). Although the analysis of pharmacovigilance databases is performed preferentially at the PT level, it could be important to also define the upper levels: SOC, HLGT, and HLT. This formalization would bring several advantages: i) preferred terms may inherit properties from their parents that allows to give them a formal definition in case the synonymous SNOMED CT concept has no definition, or there is no SNOMED CT concept mapped to this PT in UMLS; ii) This would allow to calculate by the means of terminological reasoning high level MedDRA categories in which PTs should be included and therefore restore multiple inheritance that does not exist in MedDRA. However, it is advisable to remain modest insofar as the relations between a PT and the higher hierarchical levels to which it is attached are not always of a taxonomic nature.

#### Perspectives

Our perspectives are to add formal definitions to a larger number of MedDRA terms. Our approach may be improved using more advanced natural language processing techniques (Iavindrasana et al., 2006; Deléger et al., 2009; Liu et al., 2011; Dupuch et al., 2014) compared with the basic semantic enrichment we performed considering MedDRA labels. We estimate that the methods proposed here can be reused for other first-generation terminologies provided that these terminologies have a mapping with SNOMED CT with fair coverage and that this mapping is available in accessible sources of knowledge such as the UMLS. The terminology can also be treated using methods of natural language processing as was done for example with LOINC in the lexically assign logically refine method (Dolin et al., 1998). One can also consider cases in which the terminology would be normally defined by mapping to another clinical terminology than SNOMED CT. This may be the case in other areas of application in which SNOMED CT is not the best choice.

As the manual approach was time consuming and necessitates human resources we do not have, we plan to rely on the development of complementary automated approaches. First, formal definitions could be extracted from textual definitions (Petrova et al., 2015) or directly using morphosemantic analysis on the term label, e.g., blepharitis where "itis" stands for "inflammation," and "blephar" stands for "eyelid." Such approach is limited to terms containing "compound forms" that have a medical meaning (Deléger et al., 2009). Second, formal definitions could be based on ontology design patterns, such as implemented in tools like Ontorat (Xiang et al., 2015) or TermGenie (Dietze et al., 2014), which partially automate the process, as they still rely on expert curation. Third, additional mappings between MedDRA and other terminologies could be obtained *via* improved mappings in the UMLS metathesaurus (Bodenreider et al., 1998; Fung et al., 2007; Diallo, 2014). Fourth, semantic definitions may be audited by comparing definitions associated to terms that present lexical similarities (Agrawal and Elhanan, 2014). However, this presents an intrinsic limit: terms to compare should consist of at least three words that constraints this method mainly to MedDRA procedures.

Fifth, we plan to extract knowledge using additional sources than SNOMED CT such as NCI Thesaurus (Sioutos et al., 2007) that could be useful to build definitions for MedDRA terms that describe cancer-related adverse reactions. A recent work by Oliveira and Pesquita, (2018) reports that current ontology matching techniques and systems are mostly devoted to finding links between two equivalent entities from two distinct ontologies. However, different domains may be involved that requires the implementation of matching techniques that allow linking more than two ontologies through more complex relations. An example is "aortic valve stenosis" (from human phenotype ontology) that is equivalent to the combination of "aortic valve" (from the Foundational Model of Anatomy) and "constricted" (from Phenotype And Trait Ontology).

#### CONCLUSION

The possibility of selecting terms using formal definitions and terminological reasoning are major advantages of clinical terminologies with formal semantics such as SNOMED CT, which present several advantages compared with classic terminologies. MedDRA, as a standard international terminology for the coding of ADRs in pharmacovigilance databases, could beneficiate from these knowledge engineering techniques, but MedDRA terms have to be defined using formal languages first. As defining manually MedDRA terms takes much time, it is important to reuse as much as possible ontological and non-ontological resources available to expedite the generation of formal definitions. The collection of methods we present can collectively support a semiautomatic semantic enrichment of MedDRA. Perspectives are to implement more efficient techniques to find more logical relations between SNOMED CT and MedDRA in an automated way.

#### AUTHOR CONTRIBUTIONS

GD adapted Nadkarni and Darer's definitions to OntoADR, performed automatic lexical enrichment methods, and wrote the first draft of the manuscript. M-CJ provided significant advice on the design of the study and contributed to the evaluation. ES first, then JS, performed mappings between MedDRA and SNOMED CT using the UMLS metathesaurus and developed new versions of

### REFERENCES


OntoADR. JS performed the syntactic decomposition algorithm on complex MedDRA terms and wrote the corresponding section of the manuscript. CB conducted the study, contributed to the evaluation, reviewed state-of-the-art related work, and wrote the final article. CB and M-CJ were responsible for submitting the PROTECT project to the IMI requests for proposal and for submitting the PEGASE project to the ANR request for proposal. All authors have made substantial contributions and approved the final manuscript and agreed to be accountable for all aspects of the work.

### FUNDING

This work was funded by the grant N° ANR-16-CE23-0011-01 from the ANR, the French Agence nationale de la Recherche through the PEGASE (*Pharmacovigilance enrichie par des Groupements Améliorant la detection des Signaux Emergents*). Some results described in this article were obtained as part of the PROTECT consortium (Pharmacoepidemiological Research on Outcomes of Therapeutics by a European Consortium, www.imi-protect.eu) which is a public–private partnership coordinated by the European Medicines Agency. The PROTECT project has received support from the Innovative Medicine Initiative Joint Undertaking (www. imi.europa.eu) under grant agreement no. 115004, resources of which are composed of financial contribution from the European Union's Seventh Framework Program (FP7/2007-2013) and EFPIA companies' in kind contribution.

### ACKNOWLEDGMENTS

The authors express their gratitude to Adrien Fanet for his technical contribution to the conception of OntoADR, and to Anne Jamet and Hadyl Asfari for their medical expertise.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Bousquet, Souvignet, Sadou, Jaulent and Declerck. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*