# VIRTUAL DRUG DESIGN

EDITED BY : Daniela Schuster and Honglin Li PUBLISHED IN : Frontiers in Chemistry

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-359-3 DOI 10.3389/978-2-88963-359-3

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# VIRTUAL DRUG DESIGN

Topic Editors: Daniela Schuster, Paracelsus Medical University, Austria Honglin Li, East China University of Science and Technology, China

In the current drug research environment in academia and industry, cheminformatics and virtual screening methods are well established and integrated tools. Computational tools are used to predict a compound's 3D structure, the 3D structure and function of a pharmacological target, ligand-target interactions, binding energies, and other factors essential for a successful drug. This includes molecular properties such as solubility, logP value, susceptibility to metabolism, cell permeation, blood brain barrier permeation, interaction with drug transporters and potential off-target effects. Given that approximately 40 million unique compounds are readily available for purchase, such computational modeling and filtering tools are essential to support the drug discovery and development process. The aim of all these calculations is to focus experimental efforts on the most promising candidates and exclude problematic compounds early in the project.

In this Research Topic on virtual activity predictions, we cover several aspects of this research area such as historical perspectives, data sources, ligand treatment, virtual screening methods, hit list handling and filtering.

Citation: Schuster, D., Li, H., eds. (2020). Virtual Drug Design. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-359-3

# Table of Contents


Doris E. Braun and Ulrich J. Griesser

*59 Structure-Activity Relationship Analysis of 3-Phenylcoumarin-Based Monoamine Oxidase B Inhibitors*

Sanna Rauhamäki, Pekka A. Postila, Sanna Niinivehmas, Sami Kortet, Emmi Schildt, Mira Pasanen, Elangovan Manivannan, Mira Ahinko, Pasi Koskimies, Niina Nyberg, Pasi Huuskonen, Elina Multamäki, Markku Pasanen, Risto O. Juvonen, Hannu Raunio, Juhani Huuskonen and Olli T. Pentikäinen


Mariela Bollini, Emilse S. Leal, Natalia S. Adler, María G. Aucar, Gabriela A. Fernández, María J. Pascual, Fernando Merwaiss, Diego E. Alvarez and Claudio N. Cavasotto

*136 How Diverse are the Protein-Bound Conformations of Small-Molecule Drugs and Cofactors?*

Nils-Ole Friedrich, Méliné Simsir and Johannes Kirchmair

*152 Design, Synthesis, and Evaluation of Dihydrobenzo[*cd*]indole-6-sulfonamide as TNF-*α *Inhibitors* Xiaobing Deng, Xiaoling Zhang, Bo Tang, Hongbo Liu, Qi Shen, Ying Liu and Luhua Lai

*162 Insights Into the Bifunctional Aphidicolan-16-ß-ol Synthase Through Rapid Biomolecular Modeling Approaches*

Max Hirte, Nicolas Meese, Michael Mertz, Monika Fuchs and Thomas B. Brück

*172 How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors*

Pavel V. Pogodin, Alexey A. Lagunin, Anastasia V. Rudik, Dmitry A. Filimonov, Dmitry S. Druzhilovskiy, Mark C. Nicklaus and Vladimir V. Poroikov

*186 Discovery of Natural Products as Novel and Potent FXR Antagonists by Virtual Screening*

Yanyan Diao, Jing Jiang, Shoude Zhang, Shiliang Li, Lei Shan, Jin Huang, Weidong Zhang and Honglin Li

*199 Reverse Screening Methods to Search for the Protein Targets of Chemopreventive Compounds*

Hongbin Huang, Guigui Zhang, Yuquan Zhou, Chenru Lin, Suling Chen, Yutong Lin, Shangkang Mai and Zunnan Huang

*227 Structure-Based Design, Synthesis, Biological Evaluation, and Molecular Docking of Novel PDE10 Inhibitors With Antioxidant Activities*

Jinxuan Li, Jing-Yi Chen, Ya-Lin Deng, Qian Zhou, Yinuo Wu, Deyan Wu and Hai-Bin Luo

*239 Quantum Chemical Approaches in Structure-Based Virtual Screening and Lead Optimization*

Claudio N. Cavasotto, Natalia S. Adler and Maria G. Aucar


Ashutosh Kumar and Kam Y. J. Zhang


Shoude Zhang, Qiangqiang Jia, Qiang Gao, Xueru Fan, Yuxin Weng and Zhanhai Su

# Design, Synthesis, and Evaluation of Ribose-Modified Anilinopyrimidine Derivatives as EGFR Tyrosine Kinase Inhibitors

Xiuqin Hu<sup>1</sup> , Disha Wang<sup>1</sup> , Yi Tong<sup>1</sup> , Linjiang Tong<sup>2</sup> , Xia Wang<sup>1</sup> , Lili Zhu<sup>1</sup> , Hua Xie<sup>2</sup> , Shiliang Li <sup>1</sup> , You Yang<sup>1</sup> \* and Yufang Xu<sup>1</sup> \*

*<sup>1</sup> Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China, <sup>2</sup> Division of Anti-tumor Pharmacology, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China*

#### Edited by:

*Daniela Schuster, University of Innsbruck, Austria*

#### Reviewed by:

*Johannes Kirchmair, University of Hamburg, Germany Dharmendra Kumar Yadav, All India Institute of Medical Sciences Jodhpur, India*

#### \*Correspondence:

*You Yang yangyou@ecust.edu.cn Yufang Xu yfxu@ecust.edu.cn*

#### Specialty section:

*This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry*

Received: *02 September 2017* Accepted: *30 October 2017* Published: *15 November 2017*

#### Citation:

*Hu X, Wang D, Tong Y, Tong L, Wang X, Zhu L, Xie H, Li S, Yang Y and Xu Y (2017) Design, Synthesis, and Evaluation of Ribose-Modified Anilinopyrimidine Derivatives as EGFR Tyrosine Kinase Inhibitors. Front. Chem. 5:101. doi: 10.3389/fchem.2017.00101* The synthesis of a series of ribose-modified anilinopyrimidine derivatives was efficiently achieved by utilizing DBU or *t*BuOLi-promoted coupling of ribosyl alcohols with 2,4,5-trichloropyrimidine as key step. Preliminary biological evaluation of this type of compounds as new EGFR tyrosine kinase inhibitors for combating EGFR L858R/T790M mutant associated with drug resistance in the treatment of non-small cell lung cancer revealed that 3-*N*-acryloyl-5-*O*-anilinopyrimidine ribose derivative 1a possessed potent and specific inhibitory activity against EGFR L858R/T790M over WT EGFR. Based upon molecular docking studies of the binding mode between compound 1a and EGFR, the distance between the Michael receptor and the pyrimidine scaffold is considered as an important factor for the inhibitory potency and future design of selective EGFR tyrosine kinase inhibitors against EGFR L858R/T790M mutants.

Keywords: EGFR, tyrosine kinase inhibitors, anilinopyrimidine, glycosides, carbohydrate-based drugs

# INTRODUCTION

Epidermal growth factor receptor (EGFR), a transmembrane protein with tyrosine kinase activity, is essential for cell growth, differentiation, migration, adhesion, and proliferation under normal physiological conditions (Gschwind et al., 2004). However, overexpression of EGFR has been associated with tumor growth and progression in a variety of cancers including non-small cell lung cancer (NSCLC), head and neck squamous cell carcinoma, and pancreatic cancer (Huang and Harari, 1999; Kris et al., 2003; Moore et al., 2007; Harrington et al., 2009). Therefore, regulation of EGFR has been deemed as an important strategy for the development of cancer therapy (Huang and Harari, 1999; Gschwind et al., 2004; Steuer et al., 2015).

First-generation EGFR tyrosine kinase inhibitors (TKIs) such as gefitinib and erlotinib that possess a 4-anilinoquinazoline scaffold, reversibly inhibit EGFR mutants (L858R and delE746\_A750) as well as wild-type (WT) EGFR, resulting in significant disease control of patients with NSCLC (**Figure 1**) (Cohen et al., 2005; Cheng et al., 2016). However, drug resistance driven by activating mutation of the gatekeeper T790M residue in which the threonine group is replaced with the methionine moiety, greatly counteracted the clinical efficiency of first-generation TKIs against NSCLC (Ozvegy-Laczka et al., 2005; Balak et al., 2006; De Luca et al., 2008; Pao and Chmielecki, 2010). To address this issue, the irreversible EGFR TKIs (afatinib, osimertinib, WZ4002, and CO-1686) which contain a Michael acceptor moiety for binding covalently to the thiol group of

**5**

Cys797 in the ATP binding domain of EGFR, were developed to treat NSCLC via the efficient inhibition of EGFR mutants (**Figure 1**; Castellanos and Horn, 2015). Among them, secondgeneration TKIs such as afatinib potently inhibited both EGFR mutants (L858R/T790M) and WT-EGFR without mutant selectivity, thereby leading to side effects such as rash and diarrhea (Dungo and Keating, 2013). In contrast, thirdgeneration TKIs such as osimertinib, WZ4002, and CO-1686 bearing an anilinopyrimidine core, showed high potency and selectivity for EGFR L858R/T790M over WT EGFR, therefore serving as mutant-selective TKIs targeting EGFR mutants involved in NSCLC (Zhou et al., 2009; Walter et al., 2013; Cross et al., 2014; Finlay et al., 2014; Gray and Haura, 2014).

Considering the drug resistance is rapidly emerging for third-generation TKIs (Eberlein et al., 2015; Niederst et al., 2015; Piotrowska et al., 2015; Thress et al., 2015), design of EGFR inhibitors with new structural skeletons could lead to the discovery of novel types of TKIs against EGFR mutants such as the triple mutant L858R/T790M/C797S (Günther et al., 2016, 2017; Jia et al., 2016; Juchum et al., 2017; Park et al., 2017). Based on the fact that most commercially available TKIs are ATP-competitive inhibitors for binding at the catalytic domain of the EGFR tyrosine kinase (Traxler and Furet, 1999; Grünwald and Hidalgo, 2003; Normanno et al., 2003), we envisioned that replacement of the phenyl ring on the right side of WZ4002 with a chiral ribosyl moiety would provide compound **1** as a novel type of carbohydrate-based EGFR TKI against the drug resistance involved in NSCLC (**Figure 2**). Here we report the synthesis, preliminary biological evaluation and molecular docking studies of ribose-containing anilinopyrimidine derivatives as EGFR TKIs against EGFR L858R/T790M.

### MATERIALS AND METHODS

Commercial reagents were used without further purification except where noted. Solvents were dried and redistilled prior to use in the usual way. All reactions were performed in ovendried glassware with magnetic stirring under an inert atmosphere unless noted otherwise. Analytical thin layer chromatography (TLC) was performed on precoated plates of Silica Gel (0.25– 0.3 mm, Shanghai, China). The TLC plates were visualized with UV light and by staining with sulfuric acid-ethanol solution. Silica gel column chromatography was performed on Silica Gel AR (100–200 mesh, Shanghai, China). NMR spectra were measured with a Bruker Avance III 400 or Bruker Avance III 500 spectrometer. The <sup>1</sup>H and <sup>13</sup>C NMR spectra were calibrated against the residual proton and carbon signals of the solvents as internal references (CDCl3: δ<sup>H</sup> = 7.26 ppm and δ<sup>C</sup> = 77.2 ppm; CD3OD: δ<sup>H</sup> = 3.31 ppm and δ<sup>C</sup> = 49.0 ppm). Multiplicities are quoted as singlet (s), broad singlet (br s), doublet (d), doublet of doublets (dd), triplet (t), or multiplet (m). All NMR chemical shifts (δ) were recorded in ppm and coupling constants (J) were reported in Hz. Mass spectra were recorded on an Agilent Technologies 6120 or LCT Premier XE FTMS instrument.

# 1,2-O-Isopropylidene-3-N-acryloyl-3 deoxy-5-O-(2,5-dichloropyrimidin-4-yl)-α-D-ribofuranoside 7

To a solution of compound **5** (0.60 g, 2.47 mmol) in anhydrous CH2Cl<sup>2</sup> (30 mL) at room temperature, was added DBU (1.48 mL, 5.94 mmol) and 2,4,5-trichloropyrimidine **6** (0.57 mL, 4.94 mmol). After stirring at room temperature for 2 h, the reaction mixture was diluted with saturated aqueous NH4Cl, and extracted with CH2Cl2. The organic layer was washed with brine, dried over Na2SO4, and concentrated in vacuo. The residue was purified by silica gel chromatography (petroleum ether/EtOAc: 80/1) to give **7** (0.88 g, 92%) as a pale yellow syrup: <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.30 (s, 1 H), 6.29 (dd, J = 1.2, 17.2 Hz, 1 H), 6.12 (dd, J = 10.4, 17.2 Hz, 1 H), 6.07 (d, J = 8.8 Hz, 1 H, NH), 5.90 (d, J = 3.6 Hz, 1 H, H-1), 5.69 (dd, J = 1.2, 10.4 Hz, 1 H), 4.73 (dd, J = 2.4, 12.0 Hz, 1 H), 4.66 (t, J = 4.0 Hz, 1 H), 4.61–4.55 (m, 2 H),

4.16 (m, 1 H), 1.57 (s, 3 H), 1.35 (s, 3 H); <sup>13</sup>C NMR (100 MHz, CDCl3) δ 165.4, 165.3, 157.4, 157.2, 130.1, 128.0, 117.0, 113.0, 104.7, 79.1, 78.2, 67.3, 52.1, 26.8, 26.5; ESI-MS (ESI) m/z calcd for C15H17O5N3Cl2Na [M + Na]<sup>+</sup> 412.0, found 412.0.

# 1,2-O-Isopropylidene-3-N-acryloyl-3 deoxy-5-O-[5-chloro-2-N-(2-methoxy-4-(4 methylpiperazin-1-yl)phenyl)pyrimidin-4 yl]-α-D-ribofuranoside 1a

To a solution of compound **7** (80 mg, 0.21 mmol) and aniline derivative **8** (91 mg, 0.41 mmol) in isobutanol (3 mL), was added TFA (0.12 mL, 1.55 mmol). The mixture was heated to 100◦C and stirred for 5 h. After cooling down to room temperature, the mixture was quenched with Et3N (3 mL) and concentrated in vacuo to give a residue, which was purified by silica gel column chromatography (CH2Cl2/MeOH: 30/1) to give **1a** (82 mg, 69%) as a pale yellow powder: <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.10 (d, J = 8.8 Hz, 1 H), 8.08 (s, 1 H), 7.35 (br s, 1 H), 6.56 (dd, J = 2.4, 8.8 Hz, 1 H), 6.52 (d-like, J = 2.4 Hz, 1 H), 6.30 (dd, J = 1.2, 17.2 Hz, 1 H), 6.11 (m, 2 H), 5.91 (d, J = 3.6 Hz, 1 H, H-1), 5.67 (dd, J = 1.2, 10.4 Hz, 1 H), 4.66 (m, 2 H), 4.59–4.49 (m, 3 H), 4.22 (m, 1 H), 3.86 (s, 3 H), 3.16 (t-like, J = 5.2 Hz, 4 H), 2.58 (t-like, J = 5.2 Hz, 4 H), 2.34 (s, 3 H), 1.58 (s, 3 H), 1.35 (s, 3 H); <sup>13</sup>C NMR (100 MHz, CDCl3) δ 165.4, 164.1, 157.9, 156.5, 149.2, 147.5, 130.2, 127.8, 122.0, 120.0, 112.9, 108.3, 106.2, 104.8, 100.6, 79.1, 78.2, 66.6, 55.8, 55.3, 52.7, 50.2, 46.2, 26.8, 26.5; HRMS (ESI) m/z calcd for C27H36O6N6ClNa [M + H]<sup>+</sup> 575.2385, found 575.2385.

# 3-N-Acryloyl-3-deoxy-5-O-[5-chloro-2-N- (2-methoxy-4-(4-methylpiperazin-1 yl)phenyl)pyrimidin-4-yl]-D-ribofuranose 1b

A solution of compound **1a** (58 mg, 0.10 mmol) in TFA/acetic acid/water (1/15/4, v/v/v, 5 mL) was stirred at 70◦C for 7 h. Concentration in vacuo and elution through reverse phase C-18 column (H2O/MeOH: 2/3) provided **1b** (44 mg, 83%) as a pale yellow syrup: <sup>1</sup>H NMR (400 MHz, CD3OD) δ 8.07 (s, 1 H), 8.05 (s, 2.4 H), 7.92 (m, 3.4 H), 6.67 (d-like, J = 2.4 Hz, 3.4 H), 6.58 (dd, J = 2.8, 8.8 Hz, 3.4 H), 6.35 (m, 3.4 H), 6.21 (m, 3.4 H), 5.67 (m, 3.4 H), 5.38 (d, J = 4.0 Hz, 1 H), 5.22 (s, 2.4 H), 4.67 (m, 2.4 H), 4.58 (m, 3.4 H), 4.49 (m, 4.4 H), 4.34–4.24 (m, 4.4 H), 4.00 (d-like, J = 4.4 Hz, 2.4 H), 3.87 (s, 10.2 H), 3.78 (m, 6.8 H), 3.58 (m, 6.8 H), 3.26 (m, 6.8 H), 3.03 (m, 6.8 H), 2.96 (s, 10.2 H); HRMS (ESI) m/z calcd for C24H32O6N6ClNa [M + H]<sup>+</sup> 535.2072, found 535.2079.

# 2,3-O-Isopropylidene-5-O-(2,5 dichloropyrimidin-4-yl)-α-D-ribofuranosyl acrylamide 14

To a solution of compound **13** (147 mg, 0.60 mmol) in anhydrous CH2Cl<sup>2</sup> (25 mL) at room temperature, was added DBU (0.21 mL, 1.42 mmol) and 2,4,5-trichloropyrimidine **6** (0.12 mL, 1.07 mmol). After stirring at room temperature for 2 h, the reaction mixture was diluted with saturated aqueous NH4Cl, and extracted with CH2Cl2. The organic layer was washed with brine, dried over Na2SO4, and concentrated in vacuo. The residue was purified by silica gel chromatography (petroleum ether/EtOAc: 5/1) to give **14** (217 mg, 93%) as a colorless syrup: <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.36 (s, 1 H), 6.58 (d, J = 9.2 Hz, 1 H, NH), 6.32 (dd, J = 1.2, 17.2 Hz, 1 H), 6.15 (dd, J = 10.4, 17.2 Hz, 1 H), 6.08 (dd, J = 4.0, 9.2 Hz, 1 H, H-1), 5.72 (dd, J = 1.2, 10.4 Hz, 1 H), 4.86 (d-like, J = 6.0 Hz, 1 H), 4.82 (dd, J = 4.4, 6.0 Hz, 1 H), 4.55 (m, 2 H), 4.45 (t, J = 2.8 Hz, 1 H), 1.57 (s, 3 H), 1.38 (s, 3 H); <sup>13</sup>C NMR (100 MHz, CDCl3) δ 165.1, 165.0, 157.7, 157.4, 130.6, 128.2, 116.8, 113.4, 82.3, 81.6, 79.7, 79.4, 70.4, 26.4, 24.8; HRMS (ESI) m/z calcd for C15H17O5N3Cl2Na [M + Na]<sup>+</sup> 412.0443, found 412.0448.

# 5-O-[5-Chloro-2-N-(2-methoxy-4-(4 methylpiperazin-1-yl)phenyl)pyrimidin-4 yl]-D-ribofuranosyl acrylamide 1c

To a solution of compound **14** (58 mg, 0.15 mmol) and aniline derivative **8** (66 mg, 0.30 mmol) in isobutanol (3 mL), was added TFA (0.084 mL, 1.13 mmol). The mixture was heated to 100◦C and stirred for 5 h. After cooling down to room temperature, the mixture was quenched with Et3N (3 mL) and concentrated in vacuo to give a residue, which was purified by silica gel column chromatography (CH2Cl2/MeOH: 30/1) to give **15** (60 mg, 70%) as a pale yellow oil: HRMS (ESI) m/z calcd for C27H36O6N6Cl [M + H]<sup>+</sup> 575.2385, found 575.2383. A solution of compound **15** (85 mg, 0.15 mmol) in TFA/acetic acid/water (1/20/4, v/v/v, 4 mL) was stirred at 50◦C for 5 h. Concentration in vacuo and elution through reverse phase C-18 column (H2O/MeOH: 2/3) provided **1c** (64 mg, 80%) as a pale yellow syrup. **1c** (α): <sup>1</sup>H NMR (400 MHz, CD3OD) δ 8.06 (s, 1 H), 7.88 (d, J = 8.8 Hz, 1 H), 6.63 (d-like, J = 2.4 Hz, 1 H), 6.53 (dd, J = 2.8, 8.8 Hz, 1 H), 6.34 (dd, J = 10.0, 17.2 Hz, 1 H), 6.27 (dd, J = 2.0, 17.2 Hz, 1 H), 5.83 (d, J = 4.4 Hz, 1 H, H-1), 5.70 (dd, J = 2.0, 10.0 Hz, 1 H), 4.57 (dd, J = 3.2, 12.0 Hz, 1 H), 4.40 (dd, J = 4.0, 11.6 Hz, 1 H), 4.25 (m, 3 H), 3.85 (s, 3 H), 3.16 (t, J = 4.8 Hz, 4 H), 2.61 (t, J = 4.8 Hz, 4 H), 2.34 (s, 3 H); <sup>13</sup>C NMR (100 MHz, CD3OD) δ 168.0, 165.6, 159.5, 157.3, 151.6, 148.8, 132.1, 128.1, 123.0, 122.3, 109.3, 106.7, 101.9, 81.9 (C-1), 81.5, 73.3, 71.8, 68.2, 56.4, 55.8, 50.2, 45.6; **1c** (β): <sup>1</sup>H NMR (400 MHz, CD3OD) δ 8.05 (s, 1 H), 7.87 (d, J = 8.4 Hz, 1 H), 6.65 (d-like, J = 2.4 Hz, 1 H), 6.54 (dd, J = 2.4, 8.8 Hz, 1 H), 6.23 (m, 2 H), 5.67 (dd, J = 4.8, 6.8 Hz, 1 H), 5.48 (d, J = 4.4 Hz, 1 H, H-1), 4.58 (dd, J = 3.6, 12.0 Hz, 1 H), 4.41 (dd, J = 4.8, 12.0 Hz, 1 H), 4.24–4.14 (m, 2 H), 4.03 (t, J = 4.8 Hz, 1 H), 3.84 (s, 3 H), 3.76 (m, 2 H), 3.57 (m, 2 H), 3.23 (m, 2 H), 3.00 (m, 2 H), 2.93 (s, 3 H); HRMS (ESI) m/z calcd for C24H32O6N6Cl [M + H]<sup>+</sup> 535.2072, found 535.2087.

# 5-O-tert-Butyldiphenylsilyl-3-O-(2,5 dichloropyrimidin-4-yl)-2-O-tertbutyldimethylsilyl-β-D-ribofuranosyl azide 21

To a solution of compound **19** (0.86 g, 1.63 mmol) in anhydrous CH2Cl<sup>2</sup> (25 mL) at room temperature, was added tBuOLi (1.83 g, 22.82 mmol) and 2,4,5-trichloropyrimidine **6** (0.37 mL, 3.26 mmol). After stirring under reflux for 36 h, the reaction mixture was diluted with saturated aqueous NH4Cl, and extracted with CH2Cl2. The organic layer was washed with brine, dried over Na2SO4, and concentrated in vacuo. The residue was purified by silica gel chromatography (petroleum ether/EtOAc: 80/1) to give **21** (0.88 g, 80%) as a pale yellow syrup: <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.34 (s, 1 H), 7.71–7.67 (m, 4 H), 7.45–7.34 (m, 6 H), 5.66 (t, J = 4.8 Hz, 1 H, H-3), 5.25 (d, J = 3.6 Hz, 1 H, H-1), 4.40 (dd, J = 3.6, 7.6 Hz, 1 H), 4.33 (t, J = 4.4 Hz, 1 H), 3.91 (dd, J = 4.0, 11.6 Hz, 1 H), 3.84 (dd, J = 3.6, 11.6 Hz, 1 H), 1.09 (s, 9 H), 0.75 (s, 9 H), 0.05 (s, 3 H), −0.15 (s, 3 H); HRMS (ESI) m/z calcd for C31H42O4N5Cl2Si<sup>2</sup> [M + H]<sup>+</sup> 674.2152, found 674.2155.

# 5-O-tert-Butyldiphenylsilyl-3-O-[5-chloro-2-N-(2-methoxy-4-(4-methylpiperazin-1 yl)phenyl)pyrimidin-4-yl]-2-O-tertbutyldimethylsilyl-β-D-ribofuranosyl azide 22

To a solution of compound **21** (0.53 g, 0.79 mmol) and aniline derivative **8** (0.70 g, 3.16 mmol) in isobutanol (12 mL), was added TFA (1.47 mL, 19.75 mmol). The mixture was heated to 100◦C and stirred for 5 h. After cooling down to room temperature, the mixture was quenched with Et3N (8 mL) and concentrated in vacuo to give a residue, which was purified by silica gel column chromatography (CH2Cl2/MeOH: 30/1) to give **22** (0.47 g, 69%) as a white powder: <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.12 (s, 1 H), 8.08 (d, J = 9.2 Hz, 1 H), 7.69 (dd, J = 1.6, 7.6 Hz, 2 H), 7.63 (dd, J = 1.6, 8.0 Hz, 2 H), 7.43–7.25 (m, 6 H), 6.54 (m, 2 H), 5.58 (dd, J = 4.4, 7.2 Hz, 1 H, H-3), 5.29 (d, J = 1.6 Hz, 1 H, H-1), 4.47 (m, 1 H), 4.38 (dd, J = 2.0, 4.4 Hz, 1 H), 4.02 (dd, J = 2.8, 11.6 Hz, 1 H), 3.88 (s, 3 H), 3.83 (dd, J = 3.6, 12.0 Hz, 1 H), 3.17 (t, J = 5.2 Hz, 4 H), 2.61 (t, J = 4.8 Hz, 4 H), 2.37 (s, 3 H), 1.06 (s, 9 H), 0.78 (s, 9 H), −0.07 (s, 3 H), −0.25 (s, 3 H); <sup>13</sup>C NMR (100 MHz, CDCl3) δ 163.6, 157.9, 156.9, 149.3, 147.6, 135.8, 135.7, 133.1, 133.0, 129.9, 127.9, 127.8, 121.8, 120.1, 108.4, 106.1, 100.6, 95.8, 81.6, 74.7, 74.2, 62.8, 55.8, 55.3, 50.1, 46.3, 26.9, 25.6, 19.3, 18.0, −4.9, −5.4; HRMS (ESI) m/z calcd for C43H60O5N8ClSi<sup>2</sup> [M + H]<sup>+</sup> 859.3914, found 859.3920.

# 5-O-tert-Butyldiphenylsilyl-3-O-[5-chloro-2-N-(2-methoxy-4-(4-methylpiperazin-1 yl)phenyl)pyrimidin-4-yl]-2-O-tertbutyldimethylsilyl-α-D-ribofuranosyl acrylamide 23

A mixture of compound **22** (190 mg, 0.22 mmol) and Pd/C (50 mg, 10%) in EtOH (7 mL) was stirred under an atmosphere of H<sup>2</sup> at room temperature for overnight. The mixture was filtered through celite, washed with EtOH and concentrated in vacuo to afford the corresponding amine for the next step without further purification. To a solution of the resulting amine in CH2Cl<sup>2</sup> (7 mL) at room temperature, was added DCC (69 mg, 0.33 mmol), DMAP (41 mg, 0.33 mmol), and acrylic acid (0.061 mL, 0.89 mmol). After stirring at room temperature for 4 h, the mixture was concentrated in vacuo to give a residue, which was purified by silica gel column chromatography (CH2Cl2/MeOH: 30/1) to afford **23** (76 mg, 39% over two steps) as a pale yellow syrup: <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.17 (s, 1 H), 8.09 (d, J = 8.8 Hz, 1 H), 7.71 (m, 4 H), 7.43–7.36 (m, 6 H), 7.12 (d, J = 9.2 Hz, 1 H), 6.53 (d, J = 2.4 Hz, 1 H), 6.42 (m, 1 H), 6.36 (dd, J = 1.2, 16.8 Hz, 1 H), 6.14 (dd, J = 10.4, 17.2 Hz, 1 H), 6.02 (d-like, J = 4.8 Hz, 1 H), 5.98 (dd, J = 6.0, 9.2 Hz, 1 H), 5.70 (dd, J = 1.2, 10.4 Hz, 1 H), 4.72 (dd, J = 5.2 Hz, 1 H), 4.36 (br s, 1 H), 3.86 (m, 5 H), 3.07 (br s, 4 H), 2.54 (br s, 4 H), 2.35 (s, 3 H), 1.11 (s, 9 H), 0.68 (s, 9 H), −0.02 (s, 3 H), −0.09 (s, 3 H); <sup>13</sup>C NMR (100 MHz, CDCl3) δ 165.9, 164.0, 157.8, 157.0, 149.3, 147.6, 135.8, 135.5, 133.4, 132.5, 131.0, 130.1, 130.0, 128.9, 128.0, 127.4, 121.5, 120.0, 108.1, 105.8, 100.4, 82.2, 80.4, 77.4, 71.0, 64.2, 55.7, 55.2, 49.9, 46.2, 27.0, 25.4, 19.5, 17.7, −5.1, −5.3; HRMS (ESI) m/z calcd for C46H64O6N6ClSi<sup>2</sup> [M + H]<sup>+</sup> 887.4114, found 887.4116.

# 3-O-[5-Chloro-2-N-(2-methoxy-4-(4 methylpiperazin-1-yl)phenyl)pyrimidin-4 yl]-2-O-tert-butyldimethylsilyl-α-Dribofuranosyl acrylamide 1d

To a solution of compound **23** (91 mg, 0.11 mmol) in pyridine (3 mL) at room temperature, was added HF·pyridine (0.19 mL). After stirring at room temperature for overnight, the mixture was poured into saturated aqueous NaHCO<sup>3</sup> and extracted with CH2Cl2. The combined organic layers were washed with brine, dried over Na2SO4, and concentrated in vacuo. The residue was purified by silica gel column chromatography (CH2Cl2/MeOH: 20/1) to afford **1d** (40 mg, 68%) as a pale yellow syrup: <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.15 (s, 1 H), 8.02 (d, J = 8.4 Hz, 1 H), 7.71 (m, 4 H), 7.30 (br s, 1 H), 7.06 (d, J = 8.4 Hz, 1 H), 6.54 (m, 2 H), 6.33 (d-like, J = 17.2 Hz, 1 H), 6.12 (dd, J = 10.0, 16.8 Hz, 1 H), 5.90 (dd, J = 6.0, 8.4 Hz, 1 H), 5.73 (m, 2 H), 4.51 (t, J = 5.6 Hz, 1 H), 4.30 (br s, 1 H), 3.87 (s, 3 H), 3.82 (d-like, J = 12.4 Hz, 1 H), 3.71 (d-like, J = 11.2 Hz, 1 H), 3.18 (br s, 4 H), 2.61 (br s, 4 H), 2.37 (s, 3 H), 0.72 (s, 9 H), −0.02 (s, 3 H), −0.10 (s, 3 H); <sup>13</sup>C NMR (100 MHz, CDCl3) δ 166.1, 163.8, 157.9, 157.0, 149.6, 147.5, 130.9, 127.7, 121.9, 120.5, 108.6, 106.1, 100.8, 82.1, 80.6, 76.3, 70.9, 62.5, 55.8, 55.0, 49.6, 45.7, 25.5, 17.9, −5.0, −5.2; HRMS (ESI) m/z calcd for C30H46O6N6ClSi [M + H]<sup>+</sup> 649.2937, found 649.2940.

#### Kinase Assay

Kinases domain of EGFR WT and EGFR L858R/T790M were expressed using the Bac-to-BacTM baculo virus expression system (Invitrogen, Carlsbad, CA, USA) and purified in Ni-NTA columns (QIAGEN Inc., Valencia, CA, USA). The kinase activity was evaluated with enzyme-linked immunosorbent assay (ELISA). Briefly, 20µg/mL Poly (Glu, Tyr) 4:1 (Sigma, St. Louis, MO) was precoated in 96-well ELISA plates as substrate. After adding 50 µL of 10 µmol/L ATP solution which was diluted in kinase reaction buffer (50 mM HEPES pH 7.4, 20 mM MgCl2, 0.1 mM MnCl2, 0.2 mM Na3VO4, 1 mM DTT), the plate was treated with 1 µL of indicated concentrations of compounds (dissolved in DMSO) per well. Experiments at each concentration were performed in duplicate. Reaction was initiated by adding tyrosine kinase diluted in kinase reaction buffer. After incubation at 37◦C for 1 h, the wells were washed three times with phosphate buffered saline (PBS) containing 0.1% Tween 20 (T-PBS). One hundred microliters of anti-phosphotyrosine (PY99) antibody (1:1,000, Santa Cruz Biotechnology, Santa Cruz, CA) diluted in T-PBS containing 5 mg/mL BSA was added and the plate was incubated at 37◦C for 30 min. After the plate was washed three times, 100 µL horseradish peroxidase-conjugated goat anti-mouse IgG (1:2,000, Calbiochem, SanDiego, CA) was added and the plate was incubated at 37◦C for 30 min. The plate was washed, added with 100 µL citrate buffer (0.1 M, pH 5.5) containing 0.03% H2O2. Then 2 mg/mL ophenylenediamine was added, and samples were incubated at room temperature until color emerged. The reaction was terminated immediately by adding 50 µL of 2 M H2SO4. Plate was read using a multiwell spectrophotometer (VERSAmaxTM, Molecular Devices, Sunnyvale, CA, USA) at 492 nm. The inhibitory rate (%) was calculated with the formula: [1 − (A492 treated/A492 control)] × 100%. IC<sup>50</sup> values were calculated from the inhibitory curves.

#### Molecular Docking

EGFRT790M/L858R structure (PDB: 3IKA) was retrieved from the Protein Data Bank and covalent docking was performed with maestro (Schrödinger, Inc., version 10.2). Compound **1a** was docked into the EGFR protein as an irreversible inhibitor using Covalent Docking module. The docking procedure was validated by re-docking the co-crystallized ligand WZ4002 into the ATP binding site of EGFRT790M/L858R structure. The details of the docking workflow are listed below:


# RESULT AND DISCUSSION

The synthesis of 3-N-acryloyl-5-O-anilinopyrimidine ribose derivatives **1a** and **1b** commenced with 3-amino ribose derivative **3** that can be readily prepared from D-xylose **2** in 41% yield over six steps (**Scheme 1**; Shie et al., 2007). Protection of the primary hydroxyl group in **3** with TBDPSCl followed by condensation with acryloyl chloride using Et3N as base gave ribose derivative **4** in 92% yield (two steps). Treatment of **4** with HF·pyridine afforded alcohol **5** in 71% yield. When **5** was reacted with 2,4,5-trichloropyrimidine **6** in the presence of K2CO<sup>3</sup> or DIPEA, almost no desired product was observed. Exhilaratingly, nucleophilic reaction of **5** with **6** employing stronger base (DBU) as promoter proceeded smoothly to provide ribose derivative **7** in excellent yield (92%). Subjection of **7** to the known aniline derivative **8** (Han et al., 2014) under the promotion of TFA in isobutanol at 100◦C delivered **1a** in 69% yield. Removal of the 1,2- O-isopropylidene group in **1a** with TFA in acetic acid and water at 70◦C produced **1b** in 83% yield.

For the synthesis of 1-N-acryloyl-5-O-anilinopyrimidine ribose derivative **1c**, β-ribosyl azide **10** was conveniently prepared from D-ribose **9** in three steps and 53% yield according to the procedures described in the literature (**Scheme 2**; Bonache et al., 2009). The acetyl group in **10** was then replaced with TBS group to give compound **11** in 85% yield. Hydrogenolysis of the azide group in **11** over Pd/C followed by condensation with acrylic acid in the presence of DCC and DMAP afforded a mixture of ribose derivative **12** in 43% yield (α/β = 1:1; α-anomer: δ<sup>H</sup> = 6.00 ppm, δ<sup>C</sup> = 81.4 ppm; β-anomer: δ<sup>H</sup> = 5.95 ppm, δ<sup>C</sup> = 87.2 ppm), which were easily separated by silica gel column chromatography (Numao et al., 1981; Bonache et al., 2009). After removal of the TBS group in **12**α, the resulting alcohol **13** reacted with **6** in the presence of DBU to produce ribose derivative **14** in an excellent 93% yield. TFA-promoted reaction of **14** with aniline **8** led to **15** in 70% yield as a mixture of α/β anomers probably arising from the anomerization of the 1-N-acryloyl ribose derivative under

strong acidic conditions (Boschelli et al., 1989). Finally, acidic cleavage of the isopropylidene group of **15** in the mixture of TFA/HOAc/water provided **1c** in 80% yield.

Synthetic work toward 1-N-acryloyl-3-O-anilinopyrimidine ribose derivative **1d** started from replacement of the 2,3 isopropylidene group of azide **10** with 2,3-orthoester group by treatment with TFA and subsequent protection with triethyl orthoacetate under the catalysis of TsOH·H2O, affording azide **16** in 76% yield over two steps (**Scheme 3**). Substitution of the acetyl group in **16** with TBDPS group and subsequent acidic cleavage of the orthoester group led to an inseparable mixture of 2-acetyl and 3-acetyl ribose derivatives **17** and **18** (77% yield over three steps). Treatment of the mixture of **17** and **18** with TBSCl followed by removal of the acetyl groups gave alcohols **19** and **20** in 83% yield, allowing for the separation of 3-hydroxyl ribose derivative **19** from 2-hydroxyl ribose derivative **20** (**19**:**20** = 3:2). Nucleophilic

attack of **19** on **6** required stronger basic conditions to promote the reaction due to the steric hindrance of the silyl groups on **19**. As such, excess tBuOLi in dichloromethane under reflux was employed for this conversion, providing ribose derivative **21** in 80% yield. TFA-promoted coupling of **21** with aniline **8** generated ribose derivative **22** (69%), which was then subjected to hydrogenolysis over Pd/C and subsequent condensation with acrylic acid to afford ribose derivative **23** as single anomer in moderate yield (39% over two steps). Exposure of **23** to HF·pyridine in pyridine resulted in cleavage of the TBDPS group without affecting the TBS group, providing **1d** in 68% yield.

To determine whether the Michael acceptor played a significant role in the inhibitory activity of ribosemodified pyrimidine derivatives against EGFR tyrosine kinase, 1-azide-5-O-anilinopyrimidine ribose derivative **24**, 1-azide-3-O-anilinopyrimidine ribose derivative **25**, and 5-Oanilinopyrimidine ribose derivative **26** were readily synthesized following the similar procedures described for **1a**-**1d** (**Table 1**; see Supplementary Material for details).

As shown in **Table 1**, compounds **1a** and **1b** containing 3-N-acryloyl-5-O-anilinopyrimidine ribosyl moiety potently inhibited EGFR L858R/T790M mutant with IC<sup>50</sup> values of 0.62 and 2.64µM, revealing specific inhibitory activity for EGFR L858R/T790M over WT EGFR, although they are not comparable to the positive controls osimertinib (IC<sup>50</sup> = 1.5 nM for EGFR L858R/T790M) and afatinib (IC<sup>50</sup> = 3.7 nM for EGFR L858R/T790M). In contrast, other compounds (**1c**, **1d**, and **24**–**26**) bearing 5-O-anilinopyrimidine ribosyl moiety, 1-N-acryloyl-5-O-anilinopyrimidine ribosyl moiety, 1-Nacryloyl-3-O-anilinopyrimidine ribosyl moiety, or their 1-azide counterparts, showed no inhibitory activities against EGFR tyrosine kinases.

In order to better understand the mechanism of this type of compounds binding to EGFR T790M, molecular docking was adopted to predict the binding mode of the representative compound **1a**. The docking procedure was validated in advance by re-docking the co-crystallized ligand WZ4002 (Zhou et al., 2009) into the ATP binding site of EGFR L858R/T790M structure (PDB ID: 3IKA). The root mean square deviation (RMSD) between the crystallographic and docked conformation of WZ4002 is 0.57 Å (Figure S1), demonstrating that the present docking procedure was feasible in generating the binding conformation accurately. As expected based upon co-crystal structure of the anilinopyrimidine-derived inhibitor WZ4002, the anilinopyrimidine core of compound **1a** forms a bidentate hydrogen bonding interaction with the "hinge" residue Met793 (**Figure 3**). The chlorine substituent on the pyrimidine ring could form hydrophobic contact with the mutant gatekeeper residue, Met790. The aniline ring is oriented to form hydrophobic interactions with Leu792 and Pro794 in the hinge region. Moreover, the acrylamide group attached to the sugar ring of compound **1a** could form a covalent bond with Cys797 to achieve irreversible binding. The sugar ring acts like a linker to tune the orientation of the electrophilic acrylamide moiety that can covalently alkylate the conserved cysteine residue Cys797. For the 1,2-O-isopropylidene moiety in compound **1a**, it could form favorable vdW interactions with residues ARG841, ASN842,

*<sup>a</sup>Kinase activity assays were examined by using the ELISA-based EGFR-TK assay. Data are averages of at least two independent determinations and reported as the mean* ± *SD (standard deviation). <sup>b</sup>Reported data.*

Frontiers in Chemistry | www.frontiersin.org

and Thr854. Therefore, compound **1b** without that protecting group on the sugar ring, displayed less potent bioactivity against EGFR T790M/L858R compared with compound **1a**. Lacking of the Michael receptor, compounds **24**–**26** are unable to form covalent bond with Cys797 and thus displayed sharply decreased inhibitory activity against EGFR T790M/L858R. Compound **1c** displayed no inhibitory activity of EGFR probably because of the long distance between the Michael receptor and Cys797. Although compound **1d** also has an acrylamide group attached to the sugar ring, it showed no inhibitory activity probably due to the conformational alteration of compound **1d** caused by the TBS protecting group. Briefly, it could be concluded that the distance between the Michael receptor and the pyrimidine scaffold has a significant effect on the inhibitory potency of this type of compounds. Employing the ribosyl moiety as a chiral building

### REFERENCES


block for modulating the distance between the Michael receptor and the pyrimidine scaffold could pave a new avenue for future design of EGFR inhibitors against EGFR mutants.

# CONCLUSION

In summary, we have described a DBU- or tBuOLi-promoted coupling of ribosyl alcohols with 2,4,5-trichloropyrimidine as key step for the synthesis of a series of ribose-modified anilinopyrimidine derivatives as EGFR TKIs. Preliminary biological evaluation indicated that compound **1a** displayed potent inhibitory activity against EGFR L858R/T790M with an IC<sup>50</sup> value of 0.62µM, and good selectivity for EGFR L858R/T790M over WT EGFR. Molecular docking studies revealed that the inhibitory activities of this type of compounds are largely influenced by the distance between the Michael receptor and the pyrimidine scaffold. As a novel type of EGFR inhibitor, the ribose-modified anilinopyrimidine derivative **1a** might be used as a promising lead compound for further development of selective EGFR inhibitors to overcome EGFR L858R/T790M resistance mutation.

# AUTHOR CONTRIBUTIONS

YY and YX designed and guided this study. XH conducted the chemical synthesis. YT, LT, LZ, and HX performed the kinase activity assays. DW, XW, and SL performed the molecular docking studies. XH, YY, and YX analyzed the data and wrote the manuscript with input from all authors.

# ACKNOWLEDGMENTS

Financial support from the National Thousand Young Talents Program (YC0130518, YC0140103), the Shanghai Committee of Science and Technology (14431902100), and the Shanghai Pujiang Program (15PJ1401500), is gratefully acknowledged.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2017.00101/full#supplementary-material

Tetrahedron Lett. 30, 1599–1600. doi: 10.1016/S0040-4039(00)99 530-3


resistance to EGFR inhibitors in lung cancer. Cancer Discov. 4, 1046–1061. doi: 10.1158/2159-8290.CD-14-0337


cancer institute of Canada clinical trials group. J. Clin. Oncol. 25, 1960–1966. doi: 10.1200/JCO.2006.07.9525


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Hu, Wang, Tong, Tong, Wang, Zhu, Xie, Li, Yang and Xu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Pharmacophore Modeling and in Silico/in Vitro Screening for Human Cytochrome P450 11B1 and Cytochrome P450 11B2 Inhibitors

Muhammad Akram<sup>1</sup> , Watcharee Waratchareeyakul <sup>2</sup> , Joerg Haupenthal <sup>3</sup> , Rolf W. Hartmann3, 4 and Daniela Schuster <sup>1</sup> \*

1 Institute of Pharmacy - Pharmaceutical Chemistry and Center for Molecular Biosciences Innsbruck (CMBI), University of Innsbruck, Innsbruck, Austria, <sup>2</sup> Department of Chemistry, Faculty of Science and Technology, Rambhai Barni Rajabhat University, Chanthaburi, Thailand, <sup>3</sup> Department of Drug Design and Optimization, Helmholtz Institute for Pharmaceutical Research Saarland, Saarbrücken, Germany, <sup>4</sup> Department of Pharmacy, Pharmaceutical and Medicinal Chemistry, Saarland University, Saarbrücken, Germany

#### Edited by:

Marc Poirot, Institut National de la Santé et de la Recherche Médicale, France

#### Reviewed by:

Katarina Nikolic, University of Belgrade, Serbia Marco Tutone, Università degli Studi di Palermo, Italy

> \*Correspondence: Daniela Schuster daniela.schuster@uibk.ac.at

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 22 September 2017 Accepted: 03 November 2017 Published: 19 December 2017

#### Citation:

Akram M, Waratchareeyakul W, Haupenthal J, Hartmann RW and Schuster D (2017) Pharmacophore Modeling and in Silico/in Vitro Screening for Human Cytochrome P450 11B1 and Cytochrome P450 11B2 Inhibitors. Front. Chem. 5:104. doi: 10.3389/fchem.2017.00104 Cortisol synthase (CYP11B1) is the main enzyme for the endogenous synthesis of cortisol and its inhibition is a potential way for the treatment of diseases associated with increased cortisol levels, such as Cushing's syndrome, metabolic diseases, and delayed wound healing. Aldosterone synthase (CYP11B2) is the key enzyme for aldosterone biosynthesis and its inhibition is a promising approach for the treatment of congestive heart failure, cardiac fibrosis, and certain forms of hypertension. Both CYP11B1 and CYP11B2 are structurally very similar and expressed in the adrenal cortex. To facilitate the identification of novel inhibitors of these enzymes, ligand-based pharmacophore models of CYP11B1 and CYP11B2 inhibition were developed. A virtual screening of the SPECS database was performed with our pharmacophore queries. Biological evaluation of the selected hits lead to the discovery of three potent novel inhibitors of both CYP11B1 and CYP11B2 in the submicromolar range (compounds 8–10), one selective CYP11B1 inhibitor (Compound 11, IC<sup>50</sup> = 2.5µM), and one selective CYP11B2 inhibitor (compound 12, IC<sup>50</sup> = 1.1µM), respectively. The overall success rate of this prospective virtual screening experiment is 20.8% indicating good predictive power of the pharmacophore models.

Keywords: cushing's syndrome, wound healing, hypertension, congestive heart failure, myocardial fibrosis, pharmacophore modeling, model validation, virtual screening

# INTRODUCTION

Cortisol is a glucocorticoid hormone that modulates many processes in the body such as blood sugar levels, immune system activity, metabolism of proteins, carbohydrates and fats, and bone formation (Cain and Cidlowski, 2017). Hypercortisolism in an unwanted increase in the secretion of cortisol and is the cause of many diseases such as Cushing's syndrome, metabolic disorders, and suppression of the immune system leading to delayed wound healing (Zhu et al., 2016). Cushing's syndrome is a condition that has symptoms like obesity, facial plethora, round face, decreased libido, thin skin, and easy bruising, impaired growth in children, menstrual irregularities, hypertension, hirsutism, depression, glucose intolerance, weakness, osteopenia, and nephrolithiasis

in more than 50% of clinically observed patients (Newell-Price et al., 1998; Savage et al., 2001; Faggiano et al., 2003; Pecori Giraldi et al., 2003). A tumor of the pituitary or adrenal gland is the main reason for the over-secretion of cortisol. In most cases, a surgical removal or radiation therapy of the tumor is not applied, and instead the patients are treated with drugs (Tritos et al., 2011). The use of glucocorticoid receptor antagonists for treating this situation often comes with an increased secretion of cortisol, potentially due to the pituitary feedback mechanism (Orth, 1978). An alternative treatment could be the reduction of cortisol formation by inhibiting cytochrome P450 11B1. It catalyzes the final step in the formation of cortisol by hydroxylating 11 deoxycortisol in the zona fasciculate of adrenal cortex (**Figure 1**) (Sayers, 1950). This mechanism of action is expected not to cause the adverse effects observed for glucocorticoid receptor antagonists (Nieman, 2002).

Aldosterone is a potent mineralocorticoid hormone, which regulates blood pressure by increasing the reabsorption of sodium at the distal convoluted tubule in the kidney. Under normal conditions, aldosterone secretion is controlled by the renin-angiotensin-aldosterone-system (RAAS). In case of insufficient renal flow, excessive aldosterone is released by the activation of the RAAS pathway (Young and Funder, 2000). The increase in aldosterone levels causes an increase in blood volume that elevates blood pressure. An unwanted increase in plasma aldosterone levels results in various pathological conditions like hyperaldosteronism, congestive heart failure, myocardial fibrosis, cardiac hypertrophy, ventricular arrhythmia, and other adverse effects through triggering cardiac fibroblasts (Ramires et al., 1998; Brilla, 2000; Lijnen and Petrov, 2000; Briet and Schiffrin, 2010). CYP11B2 catalyzes the rate-limiting step in the formation of aldosterone from corticosterone in the zona glomerulosa of the adrenal cortex (Sayers, 1950; Lifton et al., 2001). The antimineralocorticoid spironolactone is used to treat hypertension and heart failure (Pitt et al., 1999). However, this therapy is accompanied by severe antiandrogenic adverse effects (Soberman and Weber, 2000). An alternative approach for the management of congestive heart failure and hypertension would be the inhibition of CYP11B2, probably leading to fewer adverse effects (Azizi et al., 2013).

Both CYP11B1 and CYP11B2 are mitochondrial enzymes and belong to the cytochrome P450 family. They use NADPH as a cofactor (Guengerich, 2007). After moving to the mitochondrial matrix, the enzymes length is reduced to 479 amino acids, of which 450 (93%) amino acids are identical in both of them (Belkina et al., 2001). The molecular mass of CYP11B1 is 50 kDa and for CYP11B2 is 48.5 kDa (Ogishima et al., 1991). Although their primary sequence is highly similar, they have different functionalities (Belkina et al., 2001).

Several potent inhibitors of CYP11B1 and CYP11B2 have been reported (**Figure 2**). Some of these compounds were discovered using rational SAR studies and molecular modeling approaches. In 2006, Ulmschneider et al. developed a ligandbased pharmacophore model for CYP11B2 inhibitors by superimposing previously synthesized active and inactive ligands for CYP11B2 from their research group (Ulmschneider et al., 2006). Their pharmacophore consisted of four points: three ring centroids and an aromatic nitrogen. The model had a steric inclusion area that mapped the active compounds and a

**Abbreviations:** kDa, kilodalton; CYP11B1, cytochrome p450 11B1; CYP11B2, cytochrome p450 11B2; ACTH, adrenocorticotrophic hormone; RAAS, reninangiotensin-aldosterone-system; XVOL, exclusion volume; PAINS, pan-assay interference compounds; PDB, protein data bank, RMSD, root mean square deviation; Å, angstrom; Thr, threonine; Phe, phenylalanine; Ile, isoleucine; Trp, tryptophan; Met, methionine; Ala, alanine; Arg, arginine; IC50, half maximal inhibitory concentration; 2D, two dimensional; CAS, chemical abstract service; VS, virtual screening; EtOH, ethyl alcohol; DMSO, dimethyl sulfoxide; min, minutes; rpm, revolutions per minute; HPLC, high performance liquid chromatography; HCl, hydrochloric acid; E.coli, Escherichia coli; NADPH, nicotinamide adenine dinucleotide phosphate.

second electron from NADPH reductase to the heme iron resulting in a peroxo-iron intermediate; (4) transfer of a proton producing its protonated form; (5) attachment of another proton to the intermediate and release of a water molecule producing a perferryl oxygen complex that immediately forms a free radical; (6) and (7) oxidation of 11-deoxycortisol to cortisol.

steric exclusion area that was derived from inactive compounds. They validated their pharmacophore model by designing and synthesizing acenaphthalene-based inhibitors of CYP11B2, followed by in vitro testing. In another study performed by Lucas et al. (2008a), the authors designed and synthesized potential lead compounds for CYP11B2 inhibition with the help of a ligand-based pharmacophore model containing hydrophobic and hydrogen bond acceptor features. After the biological testing, the compounds were docked into a homology model of CYP11B2 (Lucas et al., 2008a). In 2011, the same group refined their previous ligand-based pharmacophore hypothesis based on diverse inhibitors. They added two hydrophobic features to their previous pharmacophore. Their final pharmacophore had four essential features, seven optional features, and five exclusion spheres. The refined pharmacophore of this study was validated by synthesizing and testing predicted inhibitors for CYP11B2 from the tetrahydropyrroloquinolinone scaffold, which led to potent compounds (Lucas et al., 2011). In addition to this, Gobbi et al. designed and synthesized several xanthone-based inhibitors of CYP11B1 and CYP11B2 based on the pharmacophore models by Lucas et al. (Lucas et al., 2011; Gobbi et al., 2013). The rationally designed inhibitors of CYP11B1 and CYP11B2 had a hydrophobic part in addition to the imidazolylmethyl ring, which was assumed to form a complex with the heme iron of CYP11B1 and CYP11B2 enzymes. This complexation is believed to play an important role for the inhibition of CYP11B1 and CYP11B2 enzymes (Gobbi et al., 2013).

All the above mentioned pharmacophore models have been successfully used to optimize already known active compound classes. However, none of them has been used to prospectively screen large, chemically diverse 3D molecular databases and identify novel active scaffolds. Our goal was therefore to create and validate an in silico model for future virtual screening (VS) experiments to find diverse inhibitors of either CYP11B1 or CYP11B2 or both, which could be used as pharmacological tool compounds. For this purpose, ligand-based pharmacophore queries of CYP11B1 and CYP11B2 inhibitors were generated. This method was chosen because of its frequently higher retrieval of active hits compared to docking (Chen et al., 2009) and because ligand-based models can often be better trained to recognize structurally diverse compounds binding to the same target compared to structure-based models (Schuster et al., 2010).

# WORKFLOW

#### Datasets Modeling Dataset

Data sets for model development were collected from the scientific literature (Table S1) (Dorr et al., 1984; Ulmschneider et al., 2005a,b, 2006; Voets et al., 2005, 2006; Heim et al., 2008; Lucas et al., 2008a,b, 2011; Adams et al., 2010; Roumen et al., 2010; Hille et al., 2011a,b; Stefanachi et al., 2011; Zimmer et al., 2011; Hu et al., 2012; Yin et al., 2012, 2013; Blass, 2013a,b; Emmerich et al., 2013; Ferlin et al., 2013; Gobbi et al., 2013; Meredith et al., 2013; Pinto-Bazurco Mendieta et al., 2013). As training-set compounds it is very important to select those compounds that are highly active, because VS commonly renders hits that are less active than the training compounds (Scior et al., 2012). For inactive compounds of the test set, a very high activity cut-off value must be chosen so that it is justified to refine the model according to the inactives. Therefore, the activity cut-off for active compounds of the test set was an IC<sup>50</sup> of less than 2µM and for inactive compounds, it was more than 100µM, respectively. Finally, a test set of 386 active compounds (Dorr et al., 1984; Ulmschneider et al., 2005a,b, 2006; Voets et al., 2005, 2006; Heim et al., 2008; Lucas et al., 2008a,b, 2011; Adams et al., 2010; Roumen et al., 2010; Hille et al., 2011a,b; Stefanachi et al., 2011; Zimmer et al., 2011; Hu et al., 2012; Yin et al., 2012, 2013; Blass, 2013a,b; Emmerich et al., 2013; Ferlin et al., 2013; Gobbi et al., 2013; Meredith et al., 2013; Pinto-Bazurco Mendieta et al., 2013) was collected for the theoretical validation of the models. This data set contained compounds with IC50s from 0.1 nM to 2µM. Since no compound with an IC<sup>50</sup> > 100µM was found in the literature, a decoy database representing the test set of putatively inactive compounds was assembled for theoretical validation purposes. Using the platform DecoyFinder (Cereto-Massagué et al., 2012), which extracts decoys from the ZINC (Irwin and Shoichet, 2005) database, 36 decoys per compound were generated based on the active compounds in the dataset. After removing duplicates, 15948 decoys remained in the database. The 2D structures of all active compounds were constructed in ChemBioDraw Ultra 14.0 (Cambridgesoft, 1986– 2015). For conformational analysis, LigandScout 3.12 (Wolber and Langer, 2005) generated up to 500 conformers for each compound in the dataset with OMEGA-BEST (Hawkins et al., 2010; Hawkins and Nicholls, 2012) settings.

# Pharmacophore Modeling

The espresso function of LigandScout was used to create ligandbased pharmacophores (Krautscheid et al., 2014). This workflow first assigns pharmacophore features to all of the conformations of the training compounds. Then, the features of the two most rigid training compounds are aligned to create intermediate common feature pharmacophore models. These intermediate models are ranked according to a selected scoring function. In this study, the default scoring function pharmacophore fit and atom overlap was used. The generated pharmacophore models usually profit from manual refinement to optimize their sensitivity (Equation 1) and specificity (Equation 2) (Vuorinen et al., 2014). The sensitivity of models can be improved by removing spatial restrictions, deleting features or marking them as optional, and adjusting the size of the features depending on the geometrical mapping of active compounds (Vuorinen et al., 2014).

$$Sensitivity = \frac{\text{activity found by model}}{\text{all activities in dataset}} \tag{1}$$

$$Specificity = \frac{\text{inactive not found by model}}{\text{all inactive in dataset}} \tag{2}$$

#### Prospective Virtual Screening

For prospective model validation, the commercial SPECS compound database was searched. The sd file containing 207976 compounds was downloaded from the SPECS webpage (www.specs.net, April\_2015). The conformational analysis was performed with the same program and settings as the modeling databases. VS of the SPECS database was performed using the default settings of LigandScout 3.12.

### PAINS Filtering

Pan-assay interfering substances (PAINS) appear as frequent hitters in many biological screening assays and are discussed as possible false positive hits in VS experiments for various reasons (Baell and Holloway, 2010). Therefore, PAINS filters were applied to the virtual hits obtained by the pharmacophore models. For this purpose, the sd files were submitted to the online server FAF-Drugs3 (Lagorce et al., 2015).

### Hit Selection

In order to select diverse virtual hits for biological evaluation, a total number of 50 chemical clusters were generated from the hits obtained by model 1 using the cluster ligands protocol implemented in Discovery Studio 4.0 (Accelrys, 2015). For this purpose, we used the default predefined set known as Feature-Connectivity Fingerprint FCFP\_6. FCFP generates clusters on the basis of pharmacophoric features instead of functional groups and six indicates the effective diameter of the largest feature and is equal to the double of iterations performed (Rogers and Hahn, 2010). For further processing, the top two hits from each cluster were selected based on their pharmacophore fit value.

# Biological Testing

#### Preparation of Inhibitor Solution

The selected potential inhibitors were dissolved in DMSO at a concentration of 10 mM to generate stock solutions. Various aliquots were then made from fresh stock solutions and each aliquot was tested only once. All the selected inhibitors were diluted with 100% ethanol (negative control) to the desired concentration to observe their inhibition of CYP11B1 and CYP11B2.

#### CYP11B1 and CYP11B2 Inhibition Assays

The selected hits were evaluated for their inhibition of human CYP11B1 and CYP11B2 enzymes expressed in hamster V79MZh cells. Approximately 8000000 V79MZh cells were cultured in 24 well cell culture plates for 24 h. The area of each well was 1.9 cm<sup>2</sup> . The cells were exposed to various concentrations of inhibitor solutions. The reactions were started by incubating the cells with [ <sup>3</sup>H]11-deoxycorticosterone. The incubation time for CYP11B1 cells was 15–60 and 50–120 min for CYP11B2 cells. The reactions were stopped by extracting the supernatant with cold ethyl acetate at 4◦C. Samples were mixed (10 min), and centrifuged (12,500 rpm). The organic (upper) layer was separated into fresh Eppendorf tubes and dried. The steroids were re-dissolved in methanol-water (65–35%) and were analyzed by radio-HPLC (Denner et al., 1995a; Ehmer et al., 2002). Ketoconazole (Hille et al., 2011b) (CYP11B1 IC<sup>50</sup> = 120 nM, CYP11B2 IC<sup>50</sup> = 60 nM,) was used as positive control and ethanol was used as negative control.

#### CYP17 Inhibition Assay

The inhibition of CYP17 was investigated using the 5,000 g sediment of homogenized Escherichia coli (Ehmer et al., 2000). Human CYP17 along with NADPH-P450 reductase was used to perform the assay as described previously. The incubation time for the reaction was 30 min at 37◦C. The reaction was started by adding [3H]-progesterone, and was quenched with 1 M HCl. The reaction mixture was extracted twice with ethyl acetate at 4◦C in order to avoid impurities. The samples were dried, prepared with methanol, and analyzed with radio-HPLC. DSMO was used as negative control. Abiraterone (IC<sup>50</sup> = 100 nM) and ketoconazole (IC<sup>50</sup> = 4µM) were used as reference inhibitors (Sergejew and Hartmann, 1994).

# Docking

The 2D structures were prepared for docking in ChemBioDraw Ultra 14.0 (Cambridgesoft, 1986–2015). The ChemBioDraw files were converted to structure data (sd) format using a protocol designed in Pipeline Pilot Client 2016 (Accelrys, 2011). The 3D starting conformation of each chemical structure was generated using OMEGA 2.3.2 from OpenEye (Hawkins et al., 2010; Hawkins and Nicholls, 2012). The X-ray crystal structure of CYP11B2 in complex with fadrozole (PDB entry 4FDH) (Strushkevich et al., 2013) was used for docking employing a genetic algorithm implemented in GOLD 5.2 (Jones et al., 1995, 1997). The binding site was defined by selecting the 6 Å space around the co-crystallized ligand. In order to obtain the best docking poses, the default docking template for CYP450 Goldscore P450 was used. Gold's Goldscore was used as a scoring function to rank the docked poses of inhibitor compounds. For validating the docking experiment, the co-crystallized ligand was re-docked into the binding site, which resulted in an RMSD of 0.223 Å.

# RESULTS

# CYP11B1 and CYP11B2 Inhibitor Pharmacophore Models

Pharmacophore models for CYP11B1 and CYP11B2 inhibitors were derived from highly potent training compounds. These compounds are expected to form a complex of an aromatic nitrogen with the heme iron in the active site of the enzyme. This sort of complex inhibits the catalytic process of the enzyme by preventing oxygen binding to heme iron.

The ligand-based, common feature pharmacophore model 1 was generated from compounds **4** and **5** (**Figure 3A**) (Meredith et al., 2013). From the 10 reported pharmacophore queries, the model with the highest pharmacophore-fit and atom overlap score (0.9084) and highest pharmacophore-fit score of training compounds was selected for further refinement. This pharmacophore model was composed of two aromatic ring features (AR-1 and AR-2), three hydrophobic features (H-1, H-2, and H-3), three hydrogen bond acceptors (HBA-1, HBA-2, and HBA-3), and 47 XVOLs (**Figures 3B,C**). HBA-1 represents the heterocyclic nitrogen of the training compounds, which is hypothesized to form a complex with the heme of the CYP enzymes. The remaining pharmacophore features represent various common features of the training compounds. Pharmacophore model 1 was made more sensitive by; (1) increasing the feature tolerance of AR-1, AR-2, and HBA-3 from default 1–1.6, 1.3, and 1.75 Å, respectively, (2) and marking the H-1, H-2, H-3, and HBA-2 features as optional. The theoretically validated model 1 found 76 out of 384 active hits excluding the two training compounds and 77 out of 15946 decoys. The training compounds **4** and **5** mapped all the features of refined pharmacophore model 1 with pharmacophore-fit scores of 87.50 and 87.59, respectively.

Ligand-based pharmacophore model 2 was generated from training compounds **6** and **7** (Ulmschneider et al., 2005b; Hille et al., 2011b) (**Figure 4A**) using the same settings as for model 1. The model which achieved the highest pharmacophore fit and atom overlap score (0.9174) and highest pharmacophorefit score for the training compounds was selected for further

FIGURE 3 | Pharmacophore model 1 with training compounds 4 and 5. (A) 2D training compounds with their IC50 values are drawn. (B) Training compounds mapped into the model. (C) Final pharmacophore model 1 with color-coded features (yellow—hydrophobic, blue rings—AR, red—HBA, dotted style—optional features). The model consisted of 3 hydrophobic features, 3 HBAs, 2 AR features, and 47 XVOLs.

FIGURE 4 | Pharmacophore model 2 with its training compounds 6 and 7. (A) Training compounds with their IC50 values are drawn. (B) Mapping of training compounds with the model are shown. (C) The pharmacophore model is shown. Pharmacophore features are marked by colors. Model 2 comprised of 2 hydrophobic features, 2 AR features, 1 HBA feature, and 33 XVOLs.

Akram et al. Discovery of CYP11B1 and -2 Inhibitors

optimization. It consisted of two AR features (AR-1, AR-2), two hydrophobic (H-1 and H-2) features, one HBA (HBA-1), and 33 XVOLs (**Figures 4B,C**). The shared HBA feature of both of the training compounds was derived from the nitrogen of pyrimidine and imidazole rings. This model was made more sensitive by marking the hydrophobic feature H-1 as optional. In the validation screening, the final model found 36 active hits among 384 active compounds excluding the two training compounds and 10 out of 15946 decoys. The

TABLE 1 | Inhibition of CYP11B1, CYP11B2, and CYP17 enzyme activity by the virtual hits.


<sup>a</sup>Human CYP11B1 and CYP11B2 enzymes expressed in hamster v79MZh cells. <sup>b</sup>Mean value of at least three experiments.

<sup>c</sup>Human CYP17 enzyme isolated from Escherichia coli.

d Inhibition was measured at 10µM concentration.

<sup>e</sup>n.i., not inhibited.

<sup>f</sup> n.d., not determined.

training compounds **6** and **7** mapped all the features of the refined model 2 and both got pharmacophore-fit score of 58.66, respectively.

The sensitivity values for both models 1 and 2 were calculated, which were 0.20 for model 1 and for model 2, respectively.

# Virtual Screening and Removal of False Positive Hits

Both pharmacophore models were employed for the VS of the drug discovery database SPECS (207,976 compounds) to find novel CYP11B1 and CYP11B2 inhibitors. The VS campaign resulted in 1,120 hits in total, including 1,023 hits found by model 1 and 97 hits found by model 2, respectively. A PAINS filter removed 65 compounds from the hit list obtained by model 1 and 4 from the hit list retrieved by model 2, respectively. First of all, we focused on consensus hits. Just one compound (**10**) was fitting to both pharmacophore models. Second, we aimed to validate each pharmacophore with a similar number of virtual hits in the biological testing. Because many of the hits that remained after virtual screening and PAINS filtering were derivatives of the same or similar scaffolds, we additionally performed a structural clustering to group the hits according to their chemical structure. The final selection was based on high fit values, chemical diversity, and the presence of an aromatic nitrogen in a ring system. Finally, 24 hits were submitted to in vitro evaluation including 11 hits found by model 1, 12 hits found by model 2, and 1 consensus hit (**Table 1**).

# Inhibition of Human CYP11B1 and CYP11B2 Enzymes

The selected 24 hits were analyzed for CYP11B1 and CYP11B2 inhibitory activities in a cell-based assay. In a first step, all hits were tested against both CYP11B1 and CYP11B2 at a concentration of 10µM. Three compounds (**8**, **9**, and **10**) amongst the 24 tested hits showed more than 50% inhibition on both CYP11B1 and CYP11B2 at a concentration of 10µM (**Table 1**) and were therefore dual inhibitors. Compound **11** inhibited CYP11B1 more potently than CYP11B2. Compound **12** selectively inhibited CYP11B2 (**Figure 5**). These five compounds were further evaluated for their IC<sup>50</sup> values (**Table 1**). All of the newly discovered compounds that inhibited human CYP11B1 and CYP11B2 had a pyridine or pyrazole ring in their structures. The tested inactive compounds are showed in **Figure 6**.

### Selectivity over Human CYP17 Enzyme

The four most active compounds **8**–**10** and **12** were analyzed for the inhibition of the steroidogenic enzyme CYP17. The inhibition values were measured at a concentration of 10µM of inhibitor. None of the tested compounds inhibited CYP17 (**Table 1**).

# Docking of Active Hits into CYP11B2 Binding Sites

Because a ligand-based virtual screening workflow was used for selecting the test compounds, a docking study was performed to propose binding modes for the inhibitors. Previous studies have suggested that binding affinity of the enzyme was highly dependent on the coordination geometry between the heme iron and the heterocyclic nitrogen of the inhibitor. Accordingly, an angle of 90◦ of the aromatic nitrogen-iron vector projected on the heme-porphyrin plane would lead to potent inhibition (Yin et al., 2014).

The docked pose of compound **9** showed the binding interaction of an imidazole-nitrogen with the heme iron at the binding site in a perpendicular way with an angle of 92◦ . The linker formed hydrophobic contacts with Thr318, Phe130, Ile488, Phe487, Phe231, and Trp116. The phenyl ring

contacted Trp116, Met230, Trp260, and Ala313. Finally, the fluorine formed a bifurcated hydrogen bond with Arg120 and hydrophobic interactions with Trp260, Met309, and Ala313 (**Figure 7A**).

The imidazole nitrogen of compound **10** interacted with the heme iron in a perpendicular manner with an angle of 87◦ . The oxygen atoms of the sulfate formed a hydrogen bond with Thr318. The other marked interactions included hydrophobic interactions of halogens with Ile488, Phe130, Trp116, Phe130, and the heme porphyrin (**Figure 7B**).

Compound **12** inhibited CYP11B2 more selectively than CYP11B1. Two of the triazole nitrogen atoms were complexed

with the heme iron at angles of 84 and 77◦ , respectively. The biphenyl part interacted via hydrophobic interactions with Phe130, Ala313, Trp116, Trp260, Met230, Leu227, Phe231, and Thr318 (**Figure 7C**).

# DISCUSSION

This study was performed to generate and validate novel pharmacophore models for CYP11B1 and CYP11B2 inhibitors (**Figures 3**, **4**). The developed pharmacophore queries were experimentally validated by screening the SPECS database. After removing the 69 PAINS (Baell and Holloway, 2010) compounds from a total of 1,120 virtual hits, 24 were selected for in vitro testing. These hits were biologically evaluated on hamster V79MZh cells expressing human CYP11B1 or CYP11B2 (Denner et al., 1995a; Ehmer et al., 2002). Five out of 24 selected hits inhibited CYP11B1 and/or CYP11B2 (**Table 1**). The predictive power of both pharmacophore models was analyzed. Eleven out of 24 compounds were selected by model 1, of them compounds **8** and **10** inhibited both CYP11B1 and CYP11B2 in vitro (**Table 1**). This implies a success rate of 18%. Among the 13 compounds selected by model 2, compounds **9**–**11** inhibited both CYP11B1 and CYP11B2, and compound **12** showed selective inhibition of CYP11B2. This results in a success rate of 31%. Compound **10** was a consensus hit and inhibited both CYP11B1 and CYP11B2. Thus, an overall success rate of both pharmacophore models was 21%. These findings showed that both models 1 and 2 had adequate prospective, predictive power with success rates quite typical for this virtual screening method. According to a search of the SciFinder database, none of the compounds discovered in this study were reported as CYP11B1 and CYP11B2 inhibitors in literature before. Due to the 93.9% identical amino acid residues in CYP11B1 and CYP11B2 (Kawamoto et al., 1992; Taymans et al., 1998) it is challenging to generate selective pharmacophore models for CYP11B1 and CYP11B2 inhibition. Model 1 found compounds **8** and **10**, both are novel dual inhibitors of CYP11B1 and CYP11B2. The IC<sup>50</sup> values for compounds **8** and **10** for CYP11B1 inhibition were 3.04 and 0.13µM, respectively, and for CYP11B2 inhibition were 2.77 and 0.11µM, respectively (**Table 1**). Model 2 found compounds **9–12**, of them **9** and **10** were dual inhibitors of CYP11B1 and CYP11B2. The IC<sup>50</sup> value for compound **9** for CYP11B1 inhibition was 0.21µM and for CYP11B2 inhibition was 0.08µM, respectively The IC<sup>50</sup> values of compound **11** (CYP11B1 = 2.52µM, CYP11B2 = 15.58µM) showed that it had a selectivity factor of 6 for CYP11B1 inhibition over CYP11B2. Compound **12** was a selective inhibitor of CYP11B2 with an IC<sup>50</sup> = 1.12µM, while it was a very weak inhibitor of CYP11B1 with an inhibition of 33% at a concentration of 10µM.

An X-ray crystal structure of CYP11B1 has not been published yet; however the crystal structure of CYP11B2 was available from the PDB (Berman et al., 2000) (PDB ID = 4FDH) (Strushkevich et al., 2013). The positioning of the novel inhibitors into the binding pocket of CYP11B2, which is similar to the wellknown inhibitor fadrozole, rationalizes their biological activities (**Figure 7**).

A close analysis of the mapping of the active hits and fadrozole into the pharmacophore models was performed. Combined aromatic ring-HBA features (AR-1 and HBA-1) of the respective pharmacophore models (**Figure 9**) mapped an aromatic nitrogen of all the novel inhibitors **8–12**. The angle and position of the aromatic nitrogen toward the heme iron is important for making an inhibition complex at the binding site. In the docking analysis, all active hits formed this interaction in an angle of around 90◦ (**Figure 7**).

According to the results obtained in this study, we compared our pharmacophore queries with previously reported pharmacophore models (Ulmschneider et al., 2006; Lucas et al., 2008a, 2011; Gobbi et al., 2013), Previously published studies used molecular modeling as a tool for designing optimized CYP11B1 and CYP11B2 inhibitors (Ulmschneider et al., 2006; Lucas et al., 2008a, 2011; Gobbi et al., 2013). Our pharmacophore queries were based on diverse training compounds (Ulmschneider et al., 2005a; Hille et al., 2011b; Meredith et al., 2013), and had different numbers and locations of pharmacophore features in space. In comparison to the previous models, our pharmacophores additionally include aromatic

TABLE 2 | Detailed analysis of pharmacophore features mapped by all novel inhibitors 8-12 of CYP11B1 and CYP11B2.


<sup>a</sup>Compound. <sup>b</sup>Optional feature.

FIGURE 8 | Comparison of pharmacophore models 1 and 2. The highlighted features (wireframe) are from pharmacophore model 2. The pharmacophore features are color-coded. Yellow represents hydrophobic, blue denotes AR, and red shows the HBAs. Four pharmacophore features of both pharmacophores are common. H-2\* is hydrophobic feature from model 2.

features (AR-1 and AR-2) (**Figure 9**, **Table 2**). The alignment of our pharmacophore models reveals that they have four features in common, including HBA-1, AR-1, AR-2, and H-1 (**Figure 8**). All the novel inhibitors found in this study were mapped to analyze the importance of different pharmacophore features. The alignment showed that HBA-1, HBA-3 AR-1, AR-2, H-1, H-2 were essential features in mapping the active compounds during virtual screening run (**Figure 9**, **Table 2**). All of the newly discovered inhibitors in this study have aromatic nitrogencontaining heterocycles and hydrophobic parts (**Figure 9**). The heterocyclic nitrogen part has a crucial role in forming an ironbinding interaction with heme of these CYP enzymes and was mapped by the HBA-1 and AR-1 features of the pharmacophores. This type of interaction inhibited the catalytic process of the target enzymes and has been reported earlier (Denner et al., 1995a,b; Hartmann et al., 2003; Bureik et al., 2004; Ulmschneider et al., 2005b; Hoyt et al., 2015).

Both active hits from model 1 did not map the two optional features of the model. This suggests that these features may be deleted from the model without losing active hits. A model with fewer and no optional features is much faster in screening virtual compound libraries. In future studies, a refined model 1 without those optional features can be applied for screening millions of compounds in a still reasonable time.

To compare the ligand-based features of the models to the protein-ligand interactions observed in the available X-ray structures of CYP11B2, the co-crystallized inhibitor fadrozole (4FDH) was aligned to pharmacophore model 1. Fadrozole mapped five features of the model (Figure S1), but also didn't map the two optional features supporting the hypothesis that those are not advantageous. A comparison of structure-based pharmacophore models derived from 4FDH (Strushkevich et al., 2013) and 4ZGX (Martin et al., 2015) co-crystallized structures is given in the supporting information (Figure S2). The general description about the generation of pharmacophore models has been previously outlined (Vuorinen et al., 2014; Akram et al., 2015; Kaserer et al., 2015).

During the validation of our pharmacophore models, three novel dual CYP11B1 and CYP11B2 inhibitors, one novel selective CYP11B1 inhibitor, and one novel selective CYP11B2 inhibitor were discovered. Compound **11** was a selective inhibitor of CYP11B1 that is the principal enzyme for the production of cortisol, which inhibition may be a strategy for the treatment of Cushing's syndrome and delayed wound healing (Nieman, 2002). Compound **12** was a selective CYP11B2 inhibitor, which is the key enzyme for the production of aldosterone, which inhibition is a potential target for the treatment of congestive heart failure, myocardial fibrosis, and hypertension. Compounds **8**–**10** are potent dual inhibitors of CYP11B1 and CYP11B2, which makes them interesting lead compounds for the development of drugs that could achieve a complete blockade of adrenal corticoid formation. Compounds **8**–**12** could be further chemically optimized to enhance their biological efficacies and selectivities by bioisosteric replacements or substitution of rings.

Compounds **8**–**10** and **12** were also tested for inhibition of human steroidogenic enzyme CYP17 (**Table 1**), because it belongs to the same class and has same inhibition mechanism

hydrophobic, blue denotes AR, and red shows the HBAs. Optional features (dotted style) are not mapped by the virtual hits 8 and 10.

as other CYP enzymes (Devore and Scott, 2012). None of the novel inhibitors showed inhibition of human CYP17 of more 3% at a concentration of 10µM. This showed the selectivity of these novel inhibitors over CYP17.

The virtually selected hits **13**–**31** that showed no or only very weak inhibition during in vitro testing on human CYP11B1 and CYP11B2 might not be able to bind to the target, may have suffered from degradation or did not reach the binding site of the enzyme, and/or could have been pumped out of the cells via cellular efflux pumps (Johnstone et al., 2000). A precise conclusion for their inactivity is difficult to draw (**Figure 6**).

#### CONCLUSION

In the course of this study, ligand-based pharmacophore models for CYP11B1 and CYP11B2 inhibition were developed. For experimental validation of pharmacophore queries, the virtually selected hits were tested in vitro. This process resulted in the identification of new structural features advantageous for CYP11B inhibition (AR-1, AR-2, H-1, H-2, and HBA-3) and five novel CYP11B1 and/or CYP11B2 inhibitors. All of the novel inhibitors contained a heterocyclic nitrogen that is frequently present in CYP inhibitors. This project validated our pharmacophore model for future virtual screening campaigns. Regarding the quality of the pharmacophore models, model 2 gave more active hits than model 1. Both models will be refined further based on the biological testing to enhance their sensitivity and specificity.

### REFERENCES

Accelrys, I. (2011). Pipeline Pilot. San Diego, CA: Biovia Inc.


#### AUTHOR CONTRIBUTIONS

DS planned and supervised the study. WW and MA created the pharmacophore models. MA performed the virtual screening and hits selection along with DS. MA conducted biological experiments under supervision of JH and RH. MA, JH, DS, and RH analyzed the data. All authors were involved in the preparation of the manuscript and approved the final version.

#### ACKNOWLEDGMENTS

This study was funded by the Austrian Science Fund FWF (P26782) and a Young Talents Grants from the University of Innsbruck. DS is an Ingeborg Hochmair Professor at the University of Innsbruck. MA is also grateful to Standortagentur Tirol for facilitating the ERASMUS+ program for performing in vitro assays at the Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Germany. We are thankful to InteLigand and OpenEye Inc. for providing LigandScout and OMEGA academic licenses. We acknowledge the help of Philipp Schuster and Grant Alexander Begg for proofreading the manuscript.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2017.00104/full#supplementary-material


Sayers, G. (1950). The adrenal cortex and homeostasis. Phy. Rev. 30, 241–320.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Akram, Waratchareeyakul, Haupenthal, Hartmann and Schuster. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods and Structural Alerts

#### Hongbin Yang, Lixia Sun, Weihua Li, Guixia Liu and Yun Tang\*

*Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China*

During drug development, safety is always the most important issue, including a variety of toxicities and adverse drug effects, which should be evaluated in preclinical and clinical trial phases. This review article at first simply introduced the computational methods used in prediction of chemical toxicity for drug design, including machine learning methods and structural alerts. Machine learning methods have been widely applied in qualitative classification and quantitative regression studies, while structural alerts can be regarded as a complementary tool for lead optimization. The emphasis of this article was put on the recent progress of predictive models built for various toxicities. Available databases and web servers were also provided. Though the methods and models are very helpful for drug design, there are still some challenges and limitations to be improved for drug safety assessment in the future.

Keywords: drug safety, chemical toxicity, drug design, machine learning, structural alerts

# INTRODUCTION

Drug discovery and development is a long journey full of high risk. It is estimated that the attrition rate of drug candidates is up to 96% (Paul et al., 2010) and the average cost to develop a new drug reaches to 2.6 billion U.S. dollars in recent years (PhRMA, 2015). One of the major causes for the high attrition rate is drug safety, which accounts for 30% of drug failures (Giri and Bader, 2015). Even if a drug is approved in market, it could be withdrawn due to safety problems. Therefore, drug safety should be evaluated extensively as early as possible.

Usually, in vitro and in vivo tests are performed to investigate drug safety, including a variety of toxicities and adverse drug effects. In recent years, there are also some efforts to develop in vitro models such as "organ on a chip" to reduce cost (Huh et al., 2010, 2011). However, those approaches are still costly and time-consuming. In comparison of experimental approaches, computational methods have shown great advantages since they are green, fast, cheap, accurate, and most importantly they could be done before a compound is synthesized (Segall and Barber, 2014).

Till now, many computational models have been developed for drug safety assessment, which could be generally divided into three categories: qualitative classification, quantitative regression and read-across. As the first step of drug safety assessment, we only need to know a compound is toxic or non-toxic, highly toxic or slightly toxic, rather than its exact toxicity value, so classification models can be used. For a small number of chemical analogs, quantitative structure-toxicity

#### Edited by:

*Daniela Schuster, Paracelsus Private Medical University of Salzburg, Austria*

#### Reviewed by:

*Huixiao Hong, United States Food and Drug Administration, United States Heebeom Koo, Catholic University of Korea, South Korea*

> \*Correspondence: *Yun Tang ytang234@ecust.edu.cn*

#### Specialty section:

*This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry*

Received: *11 January 2018* Accepted: *05 February 2018* Published: *20 February 2018*

#### Citation:

*Yang H, Sun L, Li W, Liu G and Tang Y (2018) In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods and Structural Alerts. Front. Chem. 6:30. doi: 10.3389/fchem.2018.00030*

**29**

Yang et al. *In Silico* Prediction of Toxicity

relationship (QSTR) models can be derived for prediction of exact toxicity values. For those unique compounds, read-across is also a feasible approach to deduce certain toxicity endpoint from their similar structures with experimental toxicity values. These models have high accuracies especially in a local chemical space, and sometimes they can replace in vitro or in vivo assays for certain endpoints. Furthermore, structural alerts (SAs) can be derived from the models as keys for a compound to cause adverse effects on organs (Pizzo et al., 2015), which can be used in structural modification to reduce the risk by chemists.

In recent years, we have worked on drug safety assessment and developed a lot of predictive models for chemical toxicity with machine learning methods and structural alerts. A web server named admetSAR was also developed for publicly free access (Cheng et al., 2012b). In a previous paper published in 2013, we reviewed the advances and challenges of in silico prediction of chemical toxicity together with pharmacokinetic properties (Cheng et al., 2013a). Here, we would like to review the progress of in silico chemical toxicity prediction in recent 5 years, including methodologies of machine learning and structural alerts, and major toxicity endpoints in drug discovery and development (**Figure 1**). Available data sources and web servers were also mentioned. At last, challenges and future directions in this field were provided.

# MODEL BUILDING WITH MACHINE LEARNING METHODS

The general procedure to build a predictive model contains roughly four steps: data collection, data description, model building, and model evaluation. Each step has its own requirements to guarantee the reliability and accuracy of the models.

### Data Collection

The quality of experimental data is the most important in model building. Currently, there are numerous well-defined data available online, which greatly facilitates the construction of computational models by machine learning methods. In **Table 1**, we listed some widely used databases, including those linking chemical structures with safety outcomes, protein targets and/or biological pathways.

TOXNET is a comprehensive source that integrates several toxicity databases such as ToxLine and ChemIDplus (Fowler and Schnall, 2014). ACToR is a large database that aggregates data from thousands of public sources (Judson et al., 2008). DSSTox, a subset of ACToR, provides a high quality resource for toxicity prediction, including ToxCast and Tox21 data (Williams-DeVane et al., 2009). OECD established eChemPortal to provide chemical information including physicochemical properties, and toxicity. Many databases are contained in eChemPortal, such as ACToR and HSDB (Fonger et al., 2014). Some other toxicity databases include SuperToxic (Schmidt et al., 2009), T3DB (Wishart et al., 2015), and ToxBank (http://www.toxbank.net). We previously developed a web server admetSAR, which also contains toxicity data (Cheng et al., 2012b).

In addition to the phenotype data that are directly relevant to toxicity, databases on bioactivity, pathway and side effects are also important to toxicity prediction. Several bioactivity databases are free available, such as PubChem (Wang et al., 2009), ChEMBL (Gaulton et al., 2017), and BindingDB (Gilson et al., 2016). We developed a web server named MetaADEDB that integrates CTD (Davis et al., 2017), SIDER (Kuhn et al., 2010), and OFFSIDES (Tatonetti et al., 2012) with regard to the ADE of drugs (Cheng et al., 2013b,c).

# Data Description

There are two ways to represent chemical structures as numeric features which can be processed by machine learning methods. One way is to use molecular descriptors, which can be calculated from chemical structures, physicochemical or topological properties. Currently thousands of continuous and discrete molecular descriptors can be obtained via chemoinformatics toolkits such as PaDEL-Descriptor (Yap, 2011), OpenBabel (O'Boyle et al., 2011), CDKit (Steinbeck et al., 2003), RDKit (Landrum, 2017), or web servers like E-Dragon (Tetko et al., 2005), ChemBCPP (Dong et al., 2017a), and ChemDes (Dong et al., 2015). Using numeric features may result in overfitting when the size of training set is small (Xue et al., 2004). Hence, feature selection should be done before model building, to reduce the risk of overfitting and enhance the performance of model (Sun et al., 2017).

The other way is to use molecular fingerprints, which represent a molecule as a binary string, such as MACCS, PubChemFP, and KRFP (Klekota and Roth, 2008). In a molecular fingerprint, lists of substructures or other kinds of patterns are predefined. If a specified pattern presents in a molecule, the corresponding bit in the binary string is set to "1," otherwise it will be set to "0." Comparing to molecular descriptors, these binary features are more interpretable because each bit corresponds to a specific substructure. In addition to the common fingerprints, custom patterns can also be used to enhance the predictability of the models (Yang et al., 2017b).

# Single-Label Model Building

Machine learning methods are usually used to build the predictive models. There are many free and open access tools and development kits to fulfill this task. For example, Scikitlearn (Pedregosa et al., 2011) is a popular python toolkit for machine learning and TensorFlow (https://www.tensorflow.org) is a widely used python library for deep learning. WEKA (Frank et al., 2004), Orange (Demsar et al., 2013) and RapidMiner (https://rapidminer.com/) are machine learning toolboxes with GUI (Graph user interface).

Support vector machine (SVM), Random forest (RF), boost tree (BT), and k-nearest neighbor (kNN) are classic machine learning methods that are widely used in classification and regression models. SVM, also known as support vector classifier (SVC) or support vector regression (SVR) in particular tasks, is well-known for its high predictive performance and less risk of overfitting (Cortes and Vapnik, 1995). The basic idea of SVM is to construct a hyperplane in a high dimensional space with the largest distance to the nearest training data points (support

vectors). RF and BT are derived from decision tree (Breiman, 2001; Elith et al., 2008). RF can be viewed as bagging many decision trees that use a random subset of features and combine them via a voting system. Different from RF, in which each tree is equal, BT dynamically adjusts the weight of each tree according to the mean error of prediction. kNN is one of the simplest algorithms (Cover and Hart, 1967). The creed of kNN is that compounds with similar structures have similar biological properties. In kNN, a sample is classified by the votes of the categories of its neighbors.

Sometimes, to enhance performance of prediction models, combination of these algorithms is applied. We developed a combined method using an artificial neural network (ANN) model to generate the final combination decision probability, which showed that the combined methods would be superior to "single" methods (Cheng et al., 2011b; Du et al., 2017; Sun et al., 2017).

Recently, deep learning (DL) has been applied in solving such challenging problems as computer vision and speech recognition (Deng et al., 2013; LeCun et al., 2015). Multilayer neural network (MNN) is one of the DL techniques. Different from common ANN that only has three layers including input layer, hidden layer and output layer (Shen et al., 2004), MNN contains more than one hidden layers and thus is more competent in large toxicological data with complex mechanisms. When the training set is large, it can perform better than ANN and above-mentioned classic machine learning methods (Mayr et al., 2016). However, more complex network indicates more weights to fit and more likely to be overfitting. Graph-convolutional networks (Duvenaud et al., 2015) and long short-term memory architectures (Altae-Tran et al., 2017) are recently developed to extract features from molecules based on atom features and show better performance in handling thousands of compounds or even more (Goh et al., 2017). DeepChem (https://deepchem.io) is an open source python library devoted to providing a high quality toolchain to facilitate the use of DL in drug discovery and other fields.

TABLE 1 | Data sources for prediction of chemical toxicity.


*<sup>a</sup>CTA, compound-toxicity association; MI, molecular interaction; SE, side effect; CPI, compound-protein interaction.*

#### Multi-Label Model Building

Unlike aforementioned single-label classification or regression models, multi-label classification (MLC) is a data mining approach in which each data instance can be assigned to multiple categories at once (Tsoumakas et al., 2010; Zhang and Zhou, 2014; Gibaja and Ventura, 2015). The demand for multilabel techniques is constantly growing in biology and genomics (Diplaris et al., 2005; Avila et al., 2009). The current algorithms used for this task are pretty new and many of them are still in an early stage of development.

There are three major approaches for multi-label learning: data transformation, method adaptation and ensembles of classifiers. The first one, including Binary Relevance (BR) (Godbole and Sarawagi, 2004), classifier chains (CC) (Read et al., 2011), and Label Powerset (LP) (Boutell et al., 2004), is to transform original multi-label dataset (MLD) to a set of binary datasets (BIDs) or one multi-class dataset (MCD) first, and then process them with traditional classification algorithms (Barot and Panchal, 2014). With the development of these frameworks for MLC, classification algorithms available for binary and multiclass data can be utilized as the underlying base classifier including SVM, ANN, decision tree, kNN, and so on. The second alternative aims for adapting existent algorithms to deal with multi-label data, such as multi-label C4.5 (Al-Otaibi et al., 2014), multi-label back-propagation (Zhang and Zhou, 2006), Rank-SVM (Wang et al., 2014), and multi-label kNN (Zhang and Zhou, 2007). Finally, the classification ensemble is also a widespread technique in multi-label field. For example, Ensemble of Classifier Chain (ECC) (Read et al., 2011), which consists of a set of CC with diverse label orders and then votes for the final prediction, is proposed to allow for the effect of chain order. Some other MLC methods based on the ensemble of multi-class classifiers were also proposed, such as EPS (Read et al., 2008), RAkEL (Tsoumakas and Vlahavas, 2007), and HOMER (Tsoumakas et al., 2008).

#### Model Evaluation

For regression models, three evaluation metrics, namely Pearson product moment correlation coefficient (R 2 ), mean absolute error (MAE) and root mean squared error (RMSE) are frequently used to estimate the performance of models. These parameters are defined as following:

$$R^2 = \left[\frac{\sum\_{1}^{N} (\chi\_i - \overline{\chi}) \left(\boldsymbol{\wp}\_i - \overline{\boldsymbol{\wp}}\right)}{\sqrt{\sum\_{1}^{N} (\chi\_i - \overline{\chi})^2 \sum\_{1}^{N} (\boldsymbol{\wp}\_i - \overline{\boldsymbol{\wp}})^2}}\right]^2 \tag{1}$$

$$\text{MAE} = \frac{\sum\_{1}^{N} |\boldsymbol{x}\_{i} - \boldsymbol{y}\_{i}|}{N} \tag{2}$$

$$\text{RMSE} = \sqrt{\frac{\sum\_{1}^{N} \left(\chi\_{i} - \chi\_{i}\right)^{2}}{N}} \tag{3}$$

where x<sup>i</sup> is the experimental value, y<sup>i</sup> is the predicted value, x, y are their corresponding means and N is the number of samples.

For traditional single-label binary or multiple classification models, most of the performance metrics are calculated based on the count of true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Accuracy, sensitivity and specificity metrics can be calculated as the following equations to represent the overall predictive ability, the predictive accuracy for positive samples and the predictive ability for negative ones:

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{4}$$

$$Sensitivity = \frac{TP}{TP + FN} \tag{5}$$

$$\text{Specificity} = \frac{TN}{TN + FP} \tag{6}$$

In addition to these computed from binary partition of labels, metrics these calculated from a confidence degree of being positive are also used like area under the receiver operating characteristic curve (AUC).

Comparing to the single-label classification patterns, multilabel classifiers can have multiple outputs for an instance, of which the predictions can be fully or partially correct. The multi-label performance metrics introduced there can be classified into two groups, i.e., example-based and labelbased metrics (Tsoumakas et al., 2007; Zhang and Zhou, 2014). Here, five example-based metrics (subset accuracy, Jaccard similarity coefficient, hamming-loss, micro-precision, micro-recall) are described with mathematical formulations below.

$$\text{SubsetAccuracy} = \frac{1}{n} \sum\_{i=1}^{n} \prod\_{i=1}^{n} Y\_i = Z\_i \overline{\mathbb{I}} \qquad (7)$$

$$\text{Jaccard Similarity Coefficient} = \frac{1}{n} \sum\_{i=1}^{n} \frac{|Y\_i \cap Z\_i|}{|Y\_i \cup Z\_i|} \qquad \text{(8)}$$

$$\text{Harmonic Loss} = \frac{1}{n} \frac{1}{k} \sum\_{i=1}^{n} |Y\_i \Delta Z\_i| \qquad \text{(9)}$$

$$Recall\_{micro} = \frac{1}{n} \sum\_{i=1}^{n} \frac{|Y\_i \cap Z\_i|}{|Y\_i|} \quad \text{(10)}$$

$$Precision\_{micro} = \frac{1}{n} \sum\_{i=1}^{n} \frac{|Y\_i \cap Z\_i|}{|Z\_i|} \quad \text{(11)}$$

where Y<sup>i</sup> represents the real label-set of the ith instance, and Z<sup>i</sup> the predicted one. n is the number of instances and k is the number of labels.

Furthermore, another example-based metric named ranking loss can be used. The ranking loss metric portrays how many times an irrelevant label is ranked above a relevant one according to their probabilities belonging to each label. As for label-based metrics, micro-AUC is the most commonly used one. It is also a ranking based metric similar to ranking loss. However, different from the ranking loss that compares the ranks for each example, micros-AUC counts the number of all the relevant-irrelevant pairs meeting the condition that the relevant label is ranked above irrelevant one (in which the labels are not necessarily for the same example).

#### METHODS FOR DETECTING STRUCTURAL ALERTS

Structural alerts (SAs) are key substructures responsible for certain toxicity. They are directly connected to toxicity and hence could be used for structural optimization by medicinal chemists to reduce the risk. In 1985, Ashby found strong associations between occurrence of some substructures or patterns and chemical mutagenicity to Salmonella, which was the first appearance of the concept of SA (Ashby and Tennant, 1988).

Till now, many methods and software have been developed for detecting SAs, such as SARpy (Ferrari et al., 2013), MoSS, Gaston, and MolFea. ToxAlerts is a web server that collects SAs defined by experts or identified by computational tools. It can predict toxicity according to the appearance of SAs (Sushko et al., 2012). Automatic detection of SAs by computational tools now becomes a hotspot as the development of cheminformatics and the explosion of available data (Lepailleur et al., 2013; Floris et al., 2017).

In a previous paper, we evaluated several methods for identification of SAs (Yang et al., 2017a). At present, the methods can be divided into three categories: fragment-based, graph-based, and fingerprint-based. Fragment-based methods, such as SARpy (Ferrari et al., 2013), cut the bonds of the molecules in dataset first to get all possible fragments. Then each fragment is evaluated according to their occurrence in toxic and non-toxic compounds. These methods have been used in detecting SAs for carcinogenicity (Golbamaki and Benfenati, 2016; Golbamaki et al., 2016). Graph-based approaches use subgraph searching algorithms, treating molecules as graphs that consist of a set of vertices and edges, to find the frequent patterns. MoSS uses depth-first search association rules to mine substructures (Borgelt and Berthold, 2002). Gaston is a stand-alone tool that uses a graph-based approach to obtain substructures from dataset (Kazius et al., 2006). Another graphbased method proposed by Ahlberg (Ahlberg et al., 2014) uses Atom Signature, a linear expression of a compound, to mined sub-signature as SAs. Fingerprint-based approaches do not obtain fragments from the dataset. Instead, the fragments are defined by different molecular fingerprints such as MACCS

and SubFP (Shen et al., 2010). The selection of fingerprints may affect the final results of the identified SAs. Fingerprints such as Morgan, used by Bioalerts (Cortes-Ciriano, 2016) might lead to redundant SAs which are very similar and related to the same mechanism.

Information gain (IG) can also be used to evaluate the significance of a substructure. Compounds containing the substructure are categorized as toxic and others are categorized as non-toxic. IG is defined as the difference between the information entropy of original dataset and the weighted average information entropies of two datasets separated by a substructure (Sokolova and Szpakowicz, 2010). We previously used IG to detect privileged substructures whose occurrences have strong relevance to some endpoints (Shen et al., 2010).

# PROGRESS IN TOXICITY PREDICTION

## Carcinogenicity and Mutagenicity

Chemical carcinogenesis is of increasing importance in drug discovery for its serious effect on human health. Most of the predictive models use Carcinogenic Potency Database (CPDB) as the data source, which contains more than 1,500 chemicals with their labels (carcinogen or non-carcinogen) according to their TD<sup>50</sup> values (Gold et al., 2005). Recently several publications shared their protocols to construct models to predict chemical carcinogenesis, including Naïve Bayes, kNN, probabilistic neural network, and SVM (Singh et al., 2013; Tanabe et al., 2013; Li et al., 2015; Zhang H. et al., 2016). Zhang et al. developed a web server, CarcinoPred-EL, for chemists to predict carcinogenicity online, in which Ensemble XGBoost was used to build the model (Zhang et al., 2017).

Due to its complicated mechanism and less available data, the predictive models based on phenotypic assays are not precise and reliable enough. It is an alternative to construct models based on in vitro assays. The mechanisms of carcinogenesis of chemicals can be categorized into: (1) genotoxicity, which are primarily caused by the mutagenicity of chemicals damaging DNA (Fan et al., in press); (2) non-genotoxic carcinogens acting through different specific mechanisms, which are more complicated (Golbamaki and Benfenati, 2016). Ames test devised by Bruce Ames is a wellknown in vitro assay to detect mutagenic effects of chemicals. Currently more than 8,000 compounds with Ames mutagenicity are available. Both predictive models and structural alerts were promoted with these toxicity data in recent years (Kazius et al., 2005; Hansen et al., 2009; Xu et al., 2012; Yang et al., 2017a).

# Acute Oral Toxicity

According to the exposure routes of chemicals, acute toxicity can be divided into oral, dermal and inhalation, among which acute oral toxicity is the most widely studied in computational prediction. It is often the first performed endpoint in drug discovery because any compounds causing acute toxicity will not be further considered for its strong hazardous to human health. Zhu et al. collected 7,385 compounds with LD<sup>50</sup> values and built several models for prediction of chemical acute oral toxicity (Zhu et al., 2009). Based on the data set, several machine learning methods were developed and applied to construct classifiers and regression models to predict LD<sup>50</sup> or their toxic categories (Li et al., 2014; Lei et al., 2016; Xu et al., 2017). Noticeably, the models built by Xu et al. have high performance in two test sets, more than 95% of accuracy for classification and 0.861 of R 2 for regression, and the model is free available in web server (http://www.pkumdl.cn/DLAOT/DLAOThome. php).

# Cardiotoxicity

Blockade of the hERG (human ether-a-go-go related gene) potassium channel is the main adverse effect with regard to cardiotoxicity (Gintant et al., 2016). Several in silico models were developed according to the in vitro hERG blockage test in early screening assays. Our group recently developed an in silico model that used chemical category approaches to predict hERG blockage (Zhang et al., 2016b), in which 1,570 unique compounds were collected from ChEMBL database and early studies (Doddareddy et al., 2010; Wang et al., 2012). In addition to machine learning methods, combination with multiple pharmacophores can improve the predictive capabilities and the model would be more interpretable (Wang et al., 2016).

However, as the simplified in vitro approaches for detection of cardiac safety are less specific, the in silico models will also output the false-positive predictions that may result in unwarranted attribution of novel drug candidates (Gintant et al., 2016). Other categories such as contractile and structural cardiotoxicity should be considered and more in vitro or in vivo data should be used to construct sophisticated models.

## Hepatotoxicity

Chemical hepatotoxicity in drug discovery, also termed "drug induced liver injury (DILI)," is the leading cause for drug failure or withdrawn from the market (Schuster et al., 2005). Due to its complicated mechanism and inconsistency in diverse patients, experimental detection of hepatotoxicity in preclinical and clinical trials is difficult.

Computational approaches to predict DILI of compounds are widely applied for their low cost and high efficiency. Hewitt reviewed the in silico models on DILI prediction from 2000 to 2015, including statistics-based methods and expert systems (Hewitt and Przybylak, 2016). Chemical or hybrid descriptors as features, and different machine learning methods such as linear discriminant analysis and ANN were used in these models to predict general or specific endpoints related to hepatotoxicity (Hewitt and Przybylak, 2016). Zhu constructed a human hepatotoxicity database for QSTR models using postmarket safety data originated from FDA adverse event reporting system (Zhu and Kruhlak, 2014). Our group previously used molecular fingerprints and machine learning methods to build classification models with a data set containing 1,317 diverse compounds (Zhang et al., 2016a). Xu et al. used a deep learning method called undirected graph recursive neural networks (UGRNN) that encodes molecules into an undirected graph to build QSTR models (Xu et al., 2015). The performance was excellent compared to other models, up to 0.955 of AUC. More recently, Mulliner et al. classified the complex pathology of hepatotoxicity into 21 endpoints at three levels, with a large data set comprising 3,712 compounds. Then the specific models were combined into an optimized global human hepatotoxicity that has high sensitivity of 68% and excellent specificity of 95% (Mulliner et al., 2016).

# Respiratory Toxicity

Respiratory toxicity is another toxicity category with complicated mechanisms. The most concerned endpoint is drug-induced interstitial lung disease (DILD), which can be classified into two categories in terms of their mechanisms: (1) cytotoxic lung injury and (2) immune-mediated (Matsuno, 2012). Another type of respiratory toxicity is respiratory sensitization, of which the mechanism is more complicated. There are still no good models for identification of respiratory sensitization (Mekenyan et al., 2014; Dik et al., 2015). The current QSTR studies tend to use phenotype data such as LD50, LC<sup>50</sup> or symptoms such as asthma as endpoints to represent the respiratory toxicity of a chemical, and the built models performed well enough (Jarvis et al., 2015; Lei et al., 2017).

# Irritation and Corrosion

Risk assessment of eye and skin irritation/corrosion (EI/EC, SI/SC) is of importance in pharmaceutical and cosmetics industries. Though these endpoints might not be directly considered in drug discovery stage, in silico models for these endpoints are yet required since a lot of substances may cause irritation and corrosion and should be assessed, including the ocular and dermal pharmaceuticals and final products used in manufacturing, agriculture, and warfare (Wilhelmus, 2001; Kolle et al., 2017).

Verheyen et al. evaluated the existing QSTR models in Derek Nexus, Toxtree and Case Ultra for the prediction of skin and eye irritation/corrosion, and found that the performance of those models is unsatisfactory because of narrow applicability domain and low accuracy (Verheyen et al., 2017). However, using machine learning methods to predict eye injury was reported having high performance. For instance, Verma et al. build combined QSTR models by ANN and got 88% of sensitivity and 82% of specificity for EI (Verma and Matthews, 2015a), 96% of sensitivity and 91% of specificity for EC (Verma and Matthews, 2015b). Our group recently developed in silico models for EI/EC using machine learning methods and molecular fingerprints (Wang et al., 2017). In the paper, more positive data were manually collected from X-Mol (http://www.x-mol.com) and ChemIDplus and the performance is excellent, 94.6% of overall accuracy for EI and 95.9% for EC.

### Endocrine Disruption

Chemicals interacting with nuclear receptors such as estrogen and androgen receptors (ER and AR) as off-targets or exposed in environment may cause endocrine disruption. These chemicals, called endocrine disrupting chemicals (EDCs), may interfere with the normal functions of these endogenous steroid hormones and lead to adverse health consequences such as tissue or organ proliferation, reproductive disorders, metabolic disorders, or even cancers (Colborn, 1995; Chawla et al., 2001; Grün and Blumberg, 2007).

For the specific mechanisms such as binding to ER, using in silico models to predict the bioactivity of chemicals and evaluate their risk of being EDCs is preferred for its high accuracy and less cost. We previously built in silico models for AR and ER binding using molecular fingerprints and machine learning methods and the best performance in the test set was 0.84 and 0.79, respectively (Chen et al., 2014). The Tox21 project also includes nuclear receptors assays which involve more diverse compounds (Hsieh et al., 2015). DeepTox, the winner of the "Tox21 Data Challenge," used deep neural network and obtained an excellent performance against other machine learning methods such as SVM (Mayr et al., 2016).

Previous studies on EDCs mainly focused on nuclear receptors. However, chemicals that do not directly interact with these receptors may also interfere through the pathway. For instance, aromatase (CYP19A1) is an important enzyme affecting the biosynthesis of estrogen and plays a key role in maintaining the balance between estrogen and androgen in many of the EDC-sensitive organs (Sonnet et al., 1998). Therefore, we recently built in silico models for prediction of aromatase inhibitors as potential EDCs using machine learning methods with molecular fingerprints (Du et al., 2017). The data used for training and test were collected from Tox21 and the best model had 0.84 of accuracy for the test set and 0.91 for the external validation set.

# Eco-Toxicity

Pharmaceuticals and their metabolites exposed to the environment may affect the ecosystem since they are designed to be bioactive to creature (Halling-Sørensen et al., 1998). For instance, chemicals with binding affinities to hormone receptors may be EDCs of fishes or concentrate in fish body and finally reach to high-level animal bodies (He et al., 2017). To evaluate the environmental persistence of a chemical, biodegradation half-life is widely used as a common criterion (Raymond et al., 2001). We previously categorized chemicals as ready biodegradability and not ready biodegradability according to their biological oxygen demand (BOD) with a threshold of 60% and built several classification models. The best model used kNN with molecular descriptors and had a AUC of 0.873 in test set (Cheng et al., 2012a).

Fishes are usually used as model species to evaluate aquatic toxicity and avian species are widely used as model species to evaluate the terrestrial toxicity. Our group previously collected LC<sup>50</sup> data of three fish species from ECOTOX database and built several local and global models (Sun et al., 2015). Recently, we reported a model focusing on the aquatic toxicity of pesticides and found that the molecule fingerprints performed different between local and global models (Li et al., 2017). For the avian species, several in silico models were developed including classification (Zhang et al., 2015) and regression (Mazzatorta et al., 2006; Toropov and Benfenati, 2006). In addition to the endpoints mentioned above, another commonly used model species for eco-toxicology is Tetrahymena pyriformis (Sauvant et al., 1999). Cheng et al. collected 1,571 unique chemicals with toxicity to Tetrahymena pyriformis and built several models of which the best performance was 92.6% for validation set (Cheng et al., 2011a).

#### SOFTWARE AND WEB SERVERS

Currently many software and web servers can predict chemical toxicity before synthesis. Drug design software suites such as Discovery Studio and Pipeline Pilot integrate toxicity prediction models to help filter compounds with risk of toxicity. But the endpoints are not as diverse as that in some toxicity-oriented commercial software including ADMET Predictor, Leadscope and Lhasa Derek, which take efforts primarily on predicting and alerting molecules with potential toxicity.

Free software or web servers are more preferred by academia, which can promote the development of high quality models and algorithms, and their applications in various fields including drug discovery. OCED Toolbox is an official suite for toxicity prediction and modeling using QSTR. Web servers are easier and lighter to use and will be preferred by outsiders of computational toxicology, such as medicinal chemists. Lazar is such a tool that can predict several toxicity endpoints with a user interface of drawing chemical structures (Maunz et al., 2013). ToxTree is an open source application that estimates toxic hazard by applying a decision tree approach (Patlewicz et al., 2008). Compared to QSTR-like models, ToxTree is more interpretable and the fragments (SAs) can guide the chemists in modification of the molecules. The performance of ToxTree, OECD Toolbox, and other commercial tools were compared in literature (Devillers and Mombelli, 2010; Mombelli and Devillers, 2010; Bhatia et al., 2015; Bhhatarai et al., 2016). Our group developed admetSAR that can also predict toxicity of compounds in SMILES format (Cheng et al., 2012b).

Web servers such as ChemSAR (Dong et al., 2017b) and ChemBench (Capuzzi et al., 2017) enable users to build custom models for particular use with machine learning methods and molecular descriptors. For chemists who have in-house data for some particular endpoints, it will be convenient to use these web servers to build predictive models to prioritize or substitute in vitro or in vivo tests.

#### PERSPECTIVES

Though in silico prediction of chemical toxicity has made a good progress in recent years, there are still some challenges and

### REFERENCES


limitations to be improved. At first, data quality is still a big issue. Currently many toxicity data are obtained from high-throughput in vitro assays or in vivo tests on animals. For example, Tox21 and ToxCast provide the activity data of thousands of chemicals against hundreds of assays (Huang et al., 2016). While false positive and false negative data are inevitable in those assays, in vivo data from animals are also questionable to be used directly on humans. Therefore, more data from drug clinical trials and clinic applications are highly demanded.

Secondly, more computational methods should be developed to enhance the accuracy of the predictive models. For instance, read-across has gained wide attention recently because it can fill the gap of missing data (Shah et al., 2016). Meanwhile, some endpoints have complex mechanisms such as hepatotoxicity and respiratory toxicity, computational systems toxicology has emerged to use comprehensive data sources from gene to organ to understand the mechanisms of toxicity (Jack et al., 2013; Sauer et al., 2015). With the help of machine learning methods and cheminformatics techniques, more accurate models could be developed for toxicity prediction.

Thirdly, medicinal chemists are more interested in the relationship between substructures and chemical toxicity, which can guide the optimization of lead compounds. Using computational tools to identify SAs is a promising way. Current approaches of SA identification can only generate numerous but redundant substructures in terms of their frequency of occurrence, disregarding the chemical or biological mechanisms (Yang et al., 2017a). It is not difficult to obtain "potential" SAs for almost every endpoint with support of assay results, yet innovative protocol or framework is still required to further refine these substructures and explore the chemical mechanisms of toxicity.

### AUTHOR CONTRIBUTIONS

YT, GL, and WL contributed conception and design of the study; HY wrote the first draft of the manuscript; HY and LS wrote sections of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

# ACKNOWLEDGMENTS

This work was supported by the National Key Research and Development Program of China (Grant 2016YFA0502304), the National Natural Science Foundation of China (Grants 81373329 and 81673356) and the 863 Project (Grant 2012AA020308).


pharmaceutical industry's grand challenge. Nat. Rev. Drug Discov. 9, 203–214. doi: 10.1038/nrd3078


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Yang, Sun, Li, Liu and Tang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Corrigendum: In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods and Structural Alerts

Hongbin Yang, Lixia Sun, Weihua Li, Guixia Liu and Yun Tang\*

Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China

**In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods**

Keywords: drug safety, chemical toxicity, drug design, machine learning, structural alerts

#### **A corrigendum on**

#### **and Structural Alerts** by Yang, H., Sun, L., Li, W., Liu, G., and Tang, Y. (2018). Front. Chem. 6:30. doi: 10.3389/fchem.2018.00030

In the original article, there was an error. The Equation (6) was:

$$\text{Specificity} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{6}$$

A correction has been made to Model Building With Machine Learning Methods, Model Evaluation, Equation (6):

$$Specificity = \frac{TN}{TN + FP} \tag{6}$$

The authors apologize for this error and state that this does not change the scientific conclusions of the article in any way.

The original article has been updated.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Yang, Sun, Li, Liu and Tang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Edited and reviewed by: Daniela Schuster, Paracelsus Medizinische Privatuniversität, Austria

\*Correspondence: Yun Tang ytang234@ecust.edu.cn

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 25 March 2018 Accepted: 04 April 2018 Published: 13 April 2018

#### Citation:

Yang H, Sun L, Li W, Liu G and Tang Y (2018) Corrigendum: In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods and Structural Alerts. Front. Chem. 6:129. doi: 10.3389/fchem.2018.00129

# Supramolecular Organization of Nonstoichiometric Drug Hydrates: Dapsone

Doris E. Braun\* and Ulrich J. Griesser

Institute of Pharmacy, University of Innsbruck, Innsbruck, Austria

The observed moisture- and temperature dependent transformations of the dapsone (4,4′ -diaminodiphenyl sulfone, DDS) 0. 33-hydrate were correlated to its structure and the number and strength of the water-DDS intermolecular interactions. A combination of characterization techniques was used, including thermal analysis (hot-stage microscopy, differential scanning calorimetry and thermogravimetric analysis), gravimetric moisture sorption/desorption studies and variable humidity powder X-ray diffraction, along with computational modeling (crystal structure prediction and pair-wise intermolecular energy calculations). Depending on the relative humidity the hydrate contains between 0 and 0.33 molecules of water per molecule DDS. The crystal structure is retained upon dehydration indicating that DDS hydrate shows a non-stoichiometric (de)hydration behavior. Unexpectedly, the water molecules are not located in structural channels but at isolated-sites of the host framework, which is counterintuitively for a hydrate with non-stoichiometric behavior. The water-DDS interactions were estimated to be weaker than water-host interactions that are commonly observed in stoichiometric hydrates and the lattice energies of the isomorphic dehydration product (hydrate structure without water molecules) and (form III) differ only by ∼1 kJ mol−<sup>1</sup> . The computational generation of hypothetical monohydrates confirms that the hydrate with the unusual DDS:water ratio of 3:1 is more stable than a feasible monohydrate structure. Overall, this study highlights that a deeper understanding of the formation of hydrates with non-stoichiometric behavior requires a multidisciplinary approach including suitable experimental and computational methods providing a firm basis for the development and manufacturing of high quality drug products.

Keywords: dapsone, hydrate, crystal structure prediction, temperature and moisture dependent stability, intermolecular energy

#### INTRODUCTION

The vast majority of drugs is formulated and administered in a solid (mostly crystalline) form, since this aggregation state assures the highest chemical and storage stability of the drug compound. However, a drug compound may occur in a variety of different solid state forms, which is subsumed under the general term "polymorphism" comprising one component forms (polymorphs, amorphous form) and multicomponent phases (hydrates, solvates, co-crystals). The statement "Many people think that polymorphism and solid state chemistry is the hardest thing to get right in drug development" (Byrn, 2004) clearly reflects on the challenges encountered and

#### Edited by:

Honglin Li, East China University of Science and Technology, China

#### Reviewed by:

Ariel Fernandez, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina Mariya al-Rashida, Forman Christian College, Pakistan

> \*Correspondence: Doris E. Braun doris.braun@uibk.ac.at

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 10 October 2017 Accepted: 09 February 2018 Published: 22 February 2018

#### Citation:

Braun DE and Griesser UJ (2018) Supramolecular Organization of Nonstoichiometric Drug Hydrates: Dapsone. Front. Chem. 6:31. doi: 10.3389/fchem.2018.00031

**42**

efforts to be undertaken in pre-formulation to guarantee that the best solid form is used in a drug formulation. The molecular structure of a drug compound determines its biological/pharmacological properties and is thus an invariant, i.e., cannot be changed in order to optimize the physicochemical and biopharmaceutical properties of a drug. The only strategy to improve such properties at the molecular level is the formation of bioreversible derivatives of the drug compound (prodrugs), which are transformed to the active molecules by metabolic principles in the organism (Rautio et al., 2008, 2017). The molecular features of a drug (molecular size, shape, flexibility, hydrogen bond donors/acceptors, etc.,) determine the potential of a drug to occur in different "supramolecular" states (solid state forms) which may exhibit significantly different physicochemical properties that are critical for the adjustment of an optimal performance of a pharmaceutical product. The most critical parameters are equilibrium solubility and dissolution rate but also differences in density, hardness, melting point, mechanical strength, chemical stability etc. may affect manufacturing processes and are relevant for shelf-life stability and finally the bioavailability of a final dosage form. Thus, identifying solid state forms of a drug and understanding their phase relationships, interconversion pathways and properties is a key concern in modern drug development (Byrn et al., 1999; Bernstein, 2002; Hilfiker, 2006; Brittain, 2009). Multiple solid forms, including salts, co-crystals and solvates, have been found for 90% of molecules (Stahly, 2007) and therefore, considerably extend the range of solid form options available for delivering drugs.

The past experience of late-appearing, more stable forms, as in the case of ritonavir (Chemburkar et al., 2000) or rotigotine (Perez-Lloret et al., 2013), has not only triggered the awareness of the issue of solid forms but also led to the implementation of polymorphism screenings, a survey of crystallization conditions designed to find and identify solid forms of a drug substance, as a routine in the pre-formulation phase. Experimental solid form screens may encompass up to thousands of crystallization experiments and need to be tailored to the properties of the investigated molecule (Newman, 2013; Cruz-Cabeza Aurora et al., 2015). The wide range of methods that have led to the discovery of novel forms (Llinàs and Goodman, 2008) highlight, however, that there is no standard recipe for comprehensive experimental solid form screening. Furthermore, the problem that there is no endpoint in experimental solid form screening, a computational method ensuring that all relevant forms have been found is in high demand. To this end, crystal structure prediction (CSP) on smaller pharmaceuticals has shown high promise in complementing experimental solid form screening, helping to rationalize and unify experimental observations on polymorphs, hydrates and solvates (Cruz-Cabeza et al., 2008; Campeta et al., 2010; Braun et al., 2011a, 2014a,b, 2016; Baias et al., 2013; Bhardwaj et al., 2013; Ismail et al., 2013; Kendrick et al., 2013; Price et al., 2014, 2016; Singh and Thakur, 2014; Braun and Griesser, 2016b; Price and Reutzel-Edens, 2016). The aim of an experimental polymorph screen is the identification of those solid state forms which are relevant for a product development, and the main expectation of a CSP study is the confirmation that those forms are among the lowest energy structures. Yet, computing the crystal energy landscapes of larger drug molecules including its hydrates and solvates is still too complex and computationally very (time) demanding. For multicomponent systems host (drug molecule) and different guest molecules in different stoichiometric ratios would have to be considered separately.

Generating knowledge of how water (vapor) is associated with a specific material and how it affects the stability of a product is a crucial task in pre-formulation studies, because water inevitably appears in the manufacturing and storage process of pharmaceutical products. Knowledge about hydrate formation (water adducts) is of importance, as hydrates can be the most stable solid form at relevant production and storage conditions and it is well-known that at least one-third of organic (drug) molecules (Stahly, 2007; Braun, 2008; Cruz-Cabeza Aurora et al., 2015) form hydrates. A transformation to a hydrate may be unavoidable. In a hydrate the water molecules occupy regular positions in the crystal lattice of the parent substance. The water can either fill structural voids or be an integral part of the structure. Based on the moisture sorption/desorption behavior hydrates can be subdivided into two main classes (Gal, 1968; Griesser, 2006). "Stoichiometric" hydrates are regarded as molecular compounds. Dehydration always leads to a different structure or the amorphous state. "Non-stoichiometric" hydrates incorporate a range of water levels as a function of temperature and water vapor pressure. The latter often host water molecules in open structural voids that allow for reversible water uptake/release without significant changes in the crystal structure. The water in non-stoichiometric hydrates is often rather weakly bound and may interact with other components compromising the stability and performance of formulated products. Thus, knowledge of hydrate formation, moisture and temperature dependent stability is crucial for the development of a high quality fine chemical product.

Dapsone (4,4′ -diaminodiphenyl sulfone; DDS, **Figure 1**) has been chosen in this study as a model compound for evaluating the value of computational chemistry in solid form screening and characterization of a pharmaceutical hydrate. The compound itself has been synthesized for the first time over 100 years ago (Fromm and Wittmann, 1908) and its microbial activity and therapeutic use for leprosy has already been studied in the 1940s. DDS has reinvented itself as a drug many times and has been in use for numerous indications, treatment of leprosy, dermatitis herpetiformis, malaria, prophylaxis of pneumocytosis etc. (Wolf and Orni-Wasserlauf, 2000; Wozel and Blasum, 2014).

dapsone) hydrate.

Today, it is mainly used as first-line drug in the treatment of leprosy in combination with rifampicin and clofazimine. As such, it is listed in the WHO's List of Essential Medicines (medications satisfying the priority health care needs in humans) (World Health Organization, 2017). The compound is known to be polymorphic (anhydrate forms **I**–**IV**), with form **III** being reported to be the most stable form (Brandstaetter-Kuhnert et al., 1963; Kuhnert-Brandstatter and Moser, 1979). Single crystal structures are known for anhydrate forms **III** (Dickinson et al., 1970; Deo et al., 1980; Bocelli and Cantoni, 1990; Su et al., 1992; Bertolasi et al., 1993) and **II** (Braun et al., 2017). It is also known that DDS forms solvates with dichloromethane, 1,4-dioxane and tetrahydrofuran (Babashkina et al., 2012; Lemmer et al., 2012). Furthermore, several crystal structure determinations of a hydrate with the unusual DDS:water stoichiometry of 3:1 (0.33 hydrate) have been reported (Kuz'mina et al., 1981; Bel'skii et al., 1983; Yathirajan et al., 2014). Apart from these structure reports no other information about the hydrate can be found in the literature.

The aim of this study was to unravel the molecular/structural reasons for hydrate formation in DDS, the structural and thermodynamic relationship between the **0.33-Hy** and waterfree DDS forms and their interconversion pathways as a function of temperature and humidity/water activity through a combination of computational and experimental methods. A range of experimental techniques (crystallization, slurry experiments, thermal analysis and X-ray diffraction), along with CSP and pair-wise intermolecular energy calculations were applied to explore the solid forms at an atomistic level. The applied method for estimating the intermolecular interaction energies (CE-B3LYP), best described as a hybrid method, did perform surprisingly well compared to B3LYP-D2/6- 31G(d,p) counterpoise-corrected energies, but in considerable less computation time (Turner et al., 2014). However, calculating water interactions in organic (drug) hydrates represents a big challenge as the balance of host (organic molecule)-host, host-water, and water-water intermolecular interactions has to be modeled accurately. Most simple water potentials have been parametrized against a wide range of liquid properties (Guillot, 2002). A potential for studying ices and amorphous water, which reparametrized the TIP4P potential to reproduce the density of several forms of ice, has been developed by Abascal et al. (2005). Very recently it has been demonstrated (in lead optimization), that accurately modeling intermolecular interactions involving water requires the incorporation of three-body terms and nanoscale treatment of the dielectric response of confined frustrated water molecules (Fernández, 2016, 2017; Fernandez and Scott, 2017). Nevertheless, in our study we decided to test the applicability of the readily available and transferable CE-B3LYP method, which was not explicitly developed for hydrate structures. We address the role of CSP in hydrate screening and modeling and investigate whether it is possible to derive information about hydrate stability and dehydration mechanism based on structural classifications and simple intermolecular interaction energy estimations (i.e., estimating the strengths of host-host, water-host and water-water interactions).

# MATERIALS AND METHODS

## Computational Generation of the Monohydrate Crystal Energy Landscape

The global energy minimum of DDS, obtained using Gaussian09 (Frisch et al., 2009), was used in the CSP searches. 350,000 Z ′ = 1 monohydrate structures were generated using CrystalPredictor2.0 (Karamertzanis and Pantelides, 2005, 2007; Habgood et al., 2015) in 48 common space groups for organic molecules (Supplementary Material). The molecules were held rigid and the lattice energy was evaluated by an exp-6 potential with atomic charges derived using the CHELPG scheme (Breneman and Wiberg, 1990) and minimized. The 10,000 lowest energy crystal structures were used as starting points for optimizing the intermolecular lattice energy (Uinter), with an improved model for the intermolecular forces. This was calculated using the FIT exp-6 potential parameters (Coombes et al., 1996), the sulfur potential derived by Scheraga (Day et al., 2009) and the distributed multipoles (Stone, 2005) derived from the PBE0/6-31G(d,p) charge density using GDMA2 (Stone, 2010).

The optimal proton positions of the amino group and orientation of the phenyl groups, in all crystal structures within 15 kJ mol−<sup>1</sup> of the global minimum (116 structures), were determined using the CrystalOptimizer database method (Kazantsev et al., 2011). This was done by minimizing the lattice energy (Elatt), calculated as the sum of the intermolecular contributions (Uinter) and the conformational energy penalty paid for distortion of the molecular geometry to improve the hydrogen bonding geometries. Conformational energy penalties (1Eintra, with respect to the pyramidal global conformational energy minimum) and isolated molecule charge densities were computed at the PBE0/6-31G(d,p) level, for each conformation considered in the minimization of Elatt. All isolated-molecule wave function calculations were performed using Gaussian09 (Frisch et al., 2009) and intermolecular lattice energies using DMACRYS (Price et al., 2010).

The 100 most stable structures (within 30 kJ mol−<sup>1</sup> of the global minimum) were used as starting points for periodic electronic structure calculations. The DFT-D calculations were carried out with the CASTEP plane wave code (Clark et al., 2005) using the Perdew-Burke-Ernzerhof (PBE) generalized gradient approximation (GGA) exchange-correlation density functional (Perdew et al., 1996) and ultrasoft pseudopotentials (Vanderbilt, 1990), with the addition of a semi-empirical dispersion correction developed by Tkatchenko and Scheffler (TS) (Tkatchenko Scheffler and Scheffler, 2009). Brillouin zone integrations were performed on a symmetrized Monkhorst–Pack k-point grid with the number of k-points chosen to provide a maximum spacing of 0.07 Å−<sup>1</sup> and a basis set cut-off of 780 eV. The self-consistent field convergence on total energy was set to 1x10−<sup>5</sup> eV. Energy minimizations were performed using the Broyden–Fletcher–Goldfarb–Shanno optimization scheme within the space group constraints. The optimizations were considered complete when energies were converged to better than 2 × 10−<sup>5</sup> eV per atom, atomic displacements converged to 1 × 10−<sup>3</sup> Å, maximum forces to 5 × 10−<sup>2</sup> eV Å−<sup>1</sup> , and maximum stresses were converged to 1 × 10−<sup>1</sup> GPa. Isolated molecule minimizations to compute the isolated DDS and water energy (Ugas) were performed by placing a single molecule in a fixed cubic 35 × 35 × 35 Å<sup>3</sup> unit cell and optimized with the same settings as used for the crystal calculations.

The experimental hydrate, lower hydrates and forms **II** and **III** of DDS, as well as other selected hydrates (Supplementary Material), were minimized with CASTEP and the same settings were used as described for generating the monohydrate crystal energy landscape.

#### Crystal Explorer Calculations

The pair-wise energy contributions to **0.33-Hy** and other wellcharacterized hydrate structures, have been calculated using CrystalExplorer V17 (Turner et al., 2014, 2015; Mackenzie et al., 2017). The optimized atomic positions (PBE-TS) have been used in all subsequent intermolecular interaction energy calculations. The model energies have been calculated between all unique nearest neighbor molecular pairs. The used model (termed CE-B3LYP) uses B3LYP/6-31G(d,p) molecular wave functions calculated by applying the molecular geometries extracted from the crystal structures. This approach uses electron densities of unperturbed monomers to obtain four separate energy components: electrostatic (EE), polarization (EP), dispersion (ED), and exchange-repulsion (ER). Each energy term was scaled independently to fit a large training set of B3LYP-D2/6-31G(d,p) counterpoise-corrected energies from both organic and inorganic crystals. The CE-B3LYP energies reproduced the training set energies with a mean absolute deviation of ∼1 kJ mol−<sup>1</sup> (Turner et al., 2014).

# Conformational Analysis

Conformational energy scans were performed at the B3LYP/6- 31G(d,p) level of theory using Gaussian09 (Frisch et al., 2009), allowing the two torsion angles defining the position of the phenyl rings, C–C–S–C, to rotate by 360◦ in 20◦ steps.

# Materials and Preparation of DDS Hydrate

Dapsone form **III** (purity 97%) was purchased from Aldrich. The obtained sample was recrystallized from a hot-saturated methanol solution. The solid product was isolated by filtration and consisted of form **III**. The organic solvents used were all of analytical grade and purchased from Aldrich or Fluka.

DDS **0.33-Hy** was prepared as follows: (i) a slurry of DDS form **III** in water was stirred in the temperature range from 10 to 30◦C for 1 week. The suspension was filtered and the solid was stored at ambient conditions. (ii) A hot saturated solution of form **III** in water (close to the boiling point) was cooled to room temperature (RT). Within 2 days large elongated **0.33-Hy** crystals were obtained.

# Thermal Analysis

For hot-stage thermomicroscopic (HSM) investigations a Reichert Thermovar polarization microscope, equipped with a Kofler hot-stage (Reichert, A), was used. Photographs were taken with an Olympus DP71 digital camera (Olympus, A).

Differential Scanning Calorimetry (DSC) thermograms were recorded on a Diamond DSC (Perkin-Elmer Norwalk, Ct., USA) controlled by the Pyris 7.0 software. Using a UM3 ultramicrobalance (Mettler, Greifensee, CH), samples of ∼5– 7 mg were weighed into perforated or sealed aluminum pans. The samples were heated using rates in between 1 and 20◦C min−<sup>1</sup> and cooled using a rate of 5 or 10◦C min−<sup>1</sup> with dry nitrogen as the purge gas (purge: 20 mL min−<sup>1</sup> ). The instrument was calibrated for temperature with pure benzophenone (mp 48.0◦C) and caffeine (236.2◦C), and the energy calibration was performed with indium (mp 156.6◦C, heat of fusion 28.45 J g−<sup>1</sup> ). The errors on the stated temperatures (extrapolated onset temperatures) and enthalpy values were calculated at the 95% confidence interval (CI) and are based on at least five measurements.

Thermogravimetric Analysis (TGA) was carried out with a TGA7 system (Perkin-Elmer, Norwalk, CT, USA) using the Pyris 2.0 Software. Approximately 7–10 mg of sample was weighed into a platinum pan. Two-point calibration of the temperature was performed with ferromagnetic materials (Alumel and Ni, Curiepoint standards, Perkin-Elmer). Heating rates of 5 and 10◦C min−<sup>1</sup> were applied and dry nitrogen was used as a purge gas (sample purge: 20 mL min−<sup>1</sup> , balance purge: 40 mL min−<sup>1</sup> ).

# Powder X-Ray Diffraction (PXRD)

PXRD patterns were obtained using an X'Pert PRO diffractometer (PANalytical, Almelo, NL) equipped with a θ/θ coupled goniometer in transmission geometry, programmable XYZ stage with well plate holder, Cu-Kα1,2 radiation source with a focusing mirror, a 0.5◦ divergence slit, a 0.02◦ Soller slit collimator on the incident beam side, a 2 mm antiscattering slit, a 0.02◦ Soller slit collimator on the diffracted beam side and a solid state PIXcel detector. The patterns were recorded at a tube voltage of 40 kV and tube current of 40 mA, applying a step size of 2θ = 0.013◦ with 200 s per step in the 2θ range between 2◦ and 40◦ . For non-ambient RH measurements, a VGI stage (VGI 2000M, Middlesex, UK) was used.

The diffraction patterns were indexed using the first 20 peaks with DICVOL04 and the space group was determined based on a statistical assessment of systematic absences (Markvardsen et al., 2001) as implemented in the DASH structure solution package (David et al., 2006). Pawley fits (Pawley, 1981) and Rietveld refinements (Rietveld, 1969) were performed with Topas Academic V5 (Coelho, 2012). The background was modeled with Chebyshev polynomials and the modified Thompson-Cox-Hastings pseudo-Voigt function was used for peak shape fitting. For the Rietveld refinements the DDS and water molecules were treated as rigid body molecules using the PBE-TS optimized conformations of the **0.33-Hy** structure.

# Gravimetric Moisture Sorption/Desorption Experiments

Moisture sorption and desorption studies were performed with the automatic multisample gravimetric moisture sorption analyser SPS23-10µ (ProUmid, Ulm, D). Approximately 500– 750 mg of sample was used for each analysis. The measurement cycles were started at 60% with an initial stepwise desorption (decreasing humidity) to 0%, followed by a sorption cycle (increasing humidity) up to 90% relative humidity (RH), a desorption cycle to 0% RH and a final sorption cycle to 90% RH. The RH changes were set to 2% and the equilibrium condition for each step was set to a mass constancy of ± 0.001% over 60 min and a maximum time limit of 48 h per step.

# Water Activity Measurements (Slurry Method)

DDS form **III** was stirred (500 r.p.m.) in 1.5–2.5 mL of each methanol and water mixture [each containing a different mole fraction of water corresponding to a defined water activity Zhu et al., 1996, Supplementary Material] at 25.0 ± 0.1◦C for 21 days. Samples were withdrawn, filtered and the resulting phase was determined using PXRD.

#### RESULTS AND DISCUSSION

### Computational Screening for DDS Monohydrates

The fact that the asymmetric unit of **0.33-Hy** (Kuz'mina et al., 1981; Bel'skii et al., 1983; Yathirajan et al., 2014) consists of four crystallographically distinct molecules (three DDS and one water) makes CSP studies for the experimental stoichiometry too time-consuming. Therefore, we decided to generate the monohydrate crystal energy landscape (one DDS and one water molecule) with the aim to estimate whether water molecules can compete against the DDS-DDS intermolecular interactions and form strong DDS-water contacts. In **Figure 2** the computed monohydrate structures are plotted according to lattice energy, which equals the energy that would be required to separate the molecules to infinity, against packing index.

To estimate whether any of the hypothetical monohydrate structure is competitive in energy with form **III**, we compared the lattice energy of the hydrate (Elatt-Hy) to the lattice energies of the anhydrate (Elatt-III) and ice (Elatt-ICE). If Elatt-Hy < Elatt-III + Elatt-ICE (we assume that hydrate formation is

FIGURE 2 | Summary of crystal structure prediction for DDS monohydrate (Z′ = 1), with each symbol denoting a crystal structure by its lattice energy and packing index. The vertical red dotted line separates the monohydrate structure that was calculated to be more stable than form III and ice from other computed hydrate structures that are less stable.

thermodynamically driven), then the hydrate is more stable than the anhydrate. Using the lattice energies of **0.33-Hy**, form **III** (**Table 1**) and a value of −59 kJ mol−<sup>1</sup> (Whalley, 1957, 1976) for ice, as the used functional is known to overbind the ice crystal structures (Thierfelder et al., 2006; Beran and Nanda, 2010), then only one structure, 01\_1963, was calculated to be more stable than **III**. The most stable hypothetical monohydrate was estimated to be 12.38 kJ mol−<sup>1</sup> more stable than form **III**, which is a respectable potential energy differences (1trsU) for a monohydrate with respect to an anhydrate. Thus, the CSP study clearly indicates hydrate formation.

Furthermore, 1trsU for the **0.33-Hy** to form **III** was calculated to be 15.37 kJ mol−<sup>1</sup> , indicating that the experimental hydrate is 3 kJ mol−<sup>1</sup> more stable than the computed lowest energy monohydrate structure and rationalizing why the stable 0.33 hydrate and not a monohydrate is formed experimentally.

#### Experimental Screening for DDS Hydrate(s)

To confirm that **0.33-Hy** is indeed the stable DDS hydrate form and that 01\_1963 is not a yet undiscovered monohydrate we subjected DDS to an experimental hydrate screening program. Evaporative crystallization experiments of DDS from a saturated (20◦C) aqueous solution, as well as cooling crystallization experiments from hot (boiling) saturated solutions in water at 5 ◦ , 25◦ , 50◦ , and 75◦C resulted in **0.33-Hy** crystals in the form of elongated plates. In contrast, evaporation experiments of a hot-saturated solution of DDS in water resulted in a mixture of **0.33-Hy** and form **III**. Slurry experiments in water, isothermal or cycling between 5◦ and 50◦C, always yielded the **0.33-Hy**.

Another successful way to produce hydrates are moisture sorption experiments. Therefore, form **III** and **V** (see section Moisture Dependent Stability of the Hydrate) were subjected to automated and manual water vapor sorption experiments. Neither form **III**, nor form **V** showed a transformation in the RH range up to 90%. Furthermore, no transformation was observed in long-time storage experiments of the two anhydrous DDS forms over saturated KOAc (24% RH), K2CO<sup>3</sup> (43% RH), NaCl (75% RH), KNO<sup>3</sup> solutions (92% RH) or water (100% RH) within 3 months (end of experiments) at RT and 8◦C. Similarly, also **0.33-Hy** did not dehydrate or transform to another hydrate if stored under the same conditions over the same time period.

TABLE 1 | Lattice energy calculations (Elatt) of 0.33-Hy, 01\_1963, forms II and III and the isomorphic dehydrate structure (0.33-Hy without water molecules, Hydehy) and potential energy differences (1trsU) with respect to form III.


<sup>a</sup>Calculated according to: –1trsUx−III = Elatt-x – (Elatt-III + Elatt-ICE). <sup>b</sup> –1trsUx−III = Elattx– Elatt-III.

#### Characterization of the DDS Hydrate DDS Hydrate Structure and Intermolecular Interaction Energies

To identify the key interactions in the DDS hydrate the pair-wise CE-B3LYP intermolecular energies were estimated starting from the PBE-TS optimized hydrate with the CSD Refcode ANSFON02 (Yathirajan et al., 2014). The intermolecular energies are subdivided into classical electrostatic (EE), polarization (EP), dispersion (ED) and exchange-repulsion energies (ER) and can be graphically represented by their "energy frameworks" (Turner et al., 2014, 2015; Mackenzie et al., 2017).

The hydrate crystallizes in the monoclinic space group C2/c with three DDS and one water molecule in the asymmetric unit, rationalizing the 3:1 stoichiometry (Kuz'mina et al., 1981; Bel'skii et al., 1983; Yathirajan et al., 2014). The three crystallographically independent DDS molecules (color coded in the packing diagram shown in **Figure 3A**) exhibit very similar conformations. The first DDS molecule (mol A, shown in red in **Figure 3A**) does not show any interaction with the hydrate water (**Table 1**), but forms two strong intermolecular interactions with itself, denoted with 2 and 4, mediated by inversion and 2-fold symmetry, respectively (**Figure 3B**). Furthermore, mol A forms classical hydrogen bonded interactions with neighboring DDS molecules

(interactions 5, 7, and 11; **Table 2**). In contrast to interactions 2 and 4, which have dispersion as the strongest contributor to the energy (π···π stacks), Coulomb interactions are the reason for the stability of the latter three (also true for 3, 6, 8, 9, 14, 15, 16, and 21). Molecule B (green) and C (blue) form the strongest pair-wise interaction (1, **Figure 3B**), which can be related to interaction 4. The third most stable interaction (3) involves mol C and is formed of four C–H···O close contacts. The strongest classical hydrogen bonded interaction (5, **Table 2**), N–H···O, is significantly less stable than interactions 1–4.

The water molecule, interacting only with mol B and mol C, forms three hydrogen bonds, two N–H···Owater and one Owater – H···O. Thus, the water molecule shows the water environment type DAA (Infantes et al., 2007), with A corresponding to hydrogen bonding acceptor and D to hydrogen bonding donor. According to Morris and Rodriguez-Hornedo (Morris and Rodriguez-Hornedo, 1993; Brittain et al., 2009) the DDS hydrate can be classified as an isolated-site hydrate, which retains water in segregated pockets in the crystal structure. The strongest pairwise water-DDS interaction was calculated to be −22.5 kJ mol−<sup>1</sup> , which is distinctly weaker than the strongest pair-wise DDS-DDS interaction (−53.9 kJ mol−<sup>1</sup> ).

The presence of an isolated-site hydrate could be confirmed by calculating the total energy of the hydrate and lower hydrate structures thereof, i.e., structures which were generated by systematically removing water molecules from the packing presented in **Figure 3** (using the P1 cell). **Figure 4** shows that a plot of the energy contributions from the water molecules to the hydrate structure vs. the water content gives a linear relationship. This clearly indicates that the water molecule interacts solely with DDS molecules in the **0.33-Hy**, which is a characteristic feature of an isolated-site hydrate.

#### Temperature Dependent Stability of the Hydrate

Key information for handling and storing hydrates is knowledge about temperature- and moisture-dependent stability. The dehydration process of **0.33-Hy** was monitored with HSM (**Figure 5**), DSC and TGA (**Figure 6**). To investigate the impact of the atmospheric conditions on the dehydration behavior and associated processes, different experimental conditions were applied: dry and silicon oil preparations (HSM), heating of the sample in perforated or sealed DSC crucibles and using different heating rates. The obtained thermodynamic data are summarized in **Table 3**.

The dehydration of **0.33-Hy** occurs in the temperature range from 40 to 90◦C. With HSM (dry preparation, **Figure 5**) hardly any change is observed during the dehydration process of the **0.33-Hy** to the isostructural dehydrate (**Hy**dehy). However, with TGA and DSC the dehydration is well observable. Under N<sup>2</sup> purge (TGA) the dehydration process starts immediately and a mass loss of ∼0.3 mol of water per mol DDS was determined. In DSC investigations (1a, 2a), the dehydration appears as a broad endothermic event which partly overlaps with a second endothermic process at a heating rate of 10◦C min−<sup>1</sup> . Using lower heating rates (not shown) the two thermal events can be separated and the heat of dehydration, 1dehyHHy−dehy of 10.60 kJ mol−<sup>1</sup> (sample contained 0.18 mol water per mol DDS)


<sup>a</sup>Electrostatic (E<sup>E</sup> ), polarization (EP), dispersion (ED) and exchange-repulsion energy (ER) contributions. Etot = k<sup>E</sup> E<sup>E</sup> + k<sup>P</sup> E<sup>P</sup> + k<sup>D</sup> E<sup>D</sup> + k<sup>R</sup> ER, with k being scale factors (Mackenzie et al., 2017) b Interaction ID; <sup>c</sup>molecule according to Figure 3A; <sup>d</sup>N – number of times interaction is present. <sup>e</sup>Centroid distances.

was determined. The known enthalpy value for the vaporization of water at the dehydration temperature (Tdehy ≈ 60◦C, 1vapH H2O = 42.482 kJ mol−<sup>1</sup> (Riddick and Bunger, 1986) can be subtracted from the measured heat of dehydration (1dehyH), according to Equation (1), resulting in an estimation of the heat change (1trsH) upon hydrate to anhydrate transformation. The enthalpy of this reaction was calculated to be 2.95 ± 0.09 kJ mol−<sup>1</sup> (**Table 3**).

FIGURE 4 | Energetic contribution (1Elatt, see Supplementary Table 5) of the water to the DDS hydrate structure in dependency of water occupancy (mol ratio water/DDS) in 0.33-Hy.

$$
\Delta\_{\rm trs}H\_{\rm Hy-delay} = \Delta\_{\rm defly}H\_{\rm Hy-delay} - 0.18 \cdot \Delta\_{\rm vap}H\_{\rm H2O} \tag{1}
$$

The second endotherm of the DSC traces (perforated crucibles) with an onset temperature of 103.6◦C corresponds to the solidsolid phase transformation of **Hy**dehy to form **II** (1.08 kJ mol−<sup>1</sup> ). In HSM investigations an increase in birefringence is observable during the transformation process. Upon further heating, form **II** melts at 177.2◦C (1a) and concomitantly form **I** crystalizes, which then melts at 179◦C. Upon cooling the melt of DDS (1b) spontaneous crystallization of form **II** occurs around 110◦C. The presence of form **II** is confirmed by the occurrence of the exothermic event at 75◦C (cooling curve), indicating the transformation of form **II** to form **III**. The measured enthalpy value of −2.02 kJ mol−<sup>1</sup> agrees with the enthalpy value of the transformation **III** → **II** (2.06 kJ mol−<sup>1</sup> ), which can be determined on reheating. A more detailed study of this transformation has been reported just recently by us (Braun et al., 2017). In a separate experiment, the DSC heating run of **0.33- Hy** was stopped above the **Hy**dehy → **II** transition peak (2a) and the subsequent cooling curve shows the exothermic **II** → **III** transition (2b). The temperature range and enthalpy of this spontaneous transition confirm unambiguously that mainly form **II** is present after the hydrate is heated to about 150◦C. Form **III** transforms back to form **II** at 81◦C (1c and 2c) just about 6◦C

FIGURE 5 | Photomicrographs of DDS 0.33-Hy. Dehydration in the temperature range 40◦ -100◦C, Hydehy to form II transformation in the temperature range 110◦ -114◦C, and peritectic dissociation (crystals embedded in high-viscosity silicon oil) of 0.33-Hy to form II in the temperature range 115◦−132◦C.

above the **II** → **III** transition peak highlighting the weak kinetic control of this reversible solid-solid transformation.

By embedding the hydrate crystals into high viscosity silicon oil (HSM, **Figure 5**) or using hermetically sealed DSC crucibles (3, **Figure 6**) the peritectic dissociation process of **0.33-Hy** to form **II** can be observed or recorded around 125◦C, respectively. A fast nucleation and growth process of form **II** occurs, thus no clear melting process is observable by HSM and the phase transition is mainly indicated by an increase in birefringence (**Figure 5**). The measured heat of ∼5 kJ mol−<sup>1</sup> can be related to the **0.33-Hy** to form **II** transformation, but also includes an unknown contribution from the enthalpy of solution of a fraction of the dehydration product in the liberated water. Due to the low water solubility of DDS and the low water stoichiometry of the hydrate the measured **0.33-Hy** to form **II** enthalpy is only slightly higher than the sum of the heats of **0.33-Hy** to **Hy**dehy and **Hy**dehy to form **II** transformations of ∼4 kJ mol−<sup>1</sup> . Thus for DDS, it is possible to estimate the **0.33-Hy** to form **II** transition enthalpy directly in a hermetically sealed DSC crucible.

#### Moisture Dependent Stability of the Hydrate

The moisture sorption/desorption experiments of **0.33- Hy** (**Figure 7**) clearly indicate a non-stoichiometric hydration/dehydration behavior. The isotherm shows a continuous course and the water content of the hydrate adjusts quickly to a specific value if the RH is altered. It is particularly striking that the sorption and desorption isotherms are superimposable, i.e., that there is no hysteresis between the sorption and desorption curve. This fact and the short time to reach the equilibrium water content on changing RH suggests that the diffusion of water molecules into or out of the structure occurs without special constraints and without significant changes of the DDS framework. This observation is even more surprising because the water molecules are located at isolated-sites in the **0.33-Hy** structure (**Figure 3A**) and not in open structure voids (channels, layers), which is the commonly expected feature for hydrates with a non-stoichiometric dehydration behavior.



<sup>a</sup>Referred to a hydrate with 0.18 mol water/mol DDS.

<sup>b</sup>Estimated form 0.18-Hy.

<sup>c</sup>13.85 kJ mol-1–10.87 kJ mol-1 .

The automated gravimetric moisture sorption/desorption analysis of **0.33-Hy** was complemented with longer-term drying experiments at 0% RH (storage over P2O5) at 25◦ , 50◦ , and 75◦C to investigate whether **Hy**dehy transforms to another DDS anhydrate polymorph. **Hy**dehy is stable for at least 3 weeks at 25◦C and for at least 10 days at 50◦C. At 75◦C the transformation to form **III** starts within 1 week at 0%. No new polymorph emerged in the drying studies.

The changes seen in the gravimetric moisture sorption/desorption studies (**Figure 7**) were correlated with structural changes to **0.33-Hy** using variable-humidity PXRD at 25◦C (**Figure 8**, Supplementary Material for the PXRD patterns). In the case of the DDS hydrate only slight changes in peak positions and peak intensities can be observed with varying RH. Changes in lattice parameters were quantified by indexation and Rietveld refinement of the **0.33-Hy** PXRD patterns recorded at different RH values. The lattice parameters changed by max. 0.2% in the range 90% to 1%, and the cell volume by only 0.66%. Such small changes are in the range one would expect from a non-stoichiometric hydrate and they are for example of similar magnitude as measured for the non-stoichiometric hydrate HyA of brucine (Braun and Griesser, 2016a). Plotting the **0.33-Hy** cell volume in dependence of the RH (**Figure 8A**) perfectly reproduces the course of the sorption/desorption isotherms in **Figure 7**.

Using the optimized (PBE-TS) experimental structure as starting model, Rietveld refinements were performed with PXRD patterns of samples recorded in 10% RH steps during a desorption and sorption cycle. The aim of this study was to unravel whether the water position in the **0.33-Hy** varies depending on the RH conditions. **Figure 8C** exemplarily shows an overlay of the hydrate structures at 90 and 1% RH. The DDS molecules are superimposable and also the water shows hardly any positional variation with RH. The structures solved at different RH values differ solely in the fractional occupancy

factor to which the water molecule refined to (**Figure 8B**). It is surprising that this method works so well even though the water is only a very minor contributor to the overall electron density of the hydrate structure. The lowest water content observed for **0.33-Hy** in the RH dependent PXRD experiments was 0.005(12) mol of water per three moles of DDS, which is in reasonable agreement with the value determined in the automatic gravimetric sorption/desorption measurements (0.02 mol water per mol DDS) determined at the same RH. No phase change was observed in the moisture dependent PXRD experiments.

The question remains why the water egress/ingress in the **0.33-Hy** is fast, which is not expected from a hydrate where the water molecules are located at isolated-sites. **Figure 9A** illustrates a possible escape route of water molecules parallel to [011]. However, this route requires cooperative movement of the diaminophenyl moieties of the DDS molecules to temporarily open up diffusion pathways, similar to that seen in hydrates of βcyclodextrin (β-CD) (Steiner and Koellner, 1994), ciprofloxacin (Mafra et al., 2012) or DB7 (Braun et al., 2015). The potential energy surface scans of DDS reveal (**Figure 9B**) that considerable movement of the diaminophenyl moieties is possible with low energy cost (1Eintra), which we assume enables the local formation of the required diffusion pathways.

Sorption/desorption studies based on exposure of the solid material to various moisture conditions are controlled by kinetic parameters, which must be minimized in order to assess the thermodynamic equilibrium between the hydrate and a dehydrated state. This can be achieved for example by slurring the substance in solvents with different water activities, as has been demonstrated in previous studies (Ahlqvist and Taylor, 2002a,b; Braun et al., 2013; Braun and Griesser, 2016a). The most obvious indicator for the kinetic barrier is the hysteresis between the sorption and desorption curve observed in moisture sorption/desorption isotherms. The hysteresis can be extreme in stoichiometric hydrates and is usually small in hydrates with nonstoichiometric behavior. The isotherm of DDS **0.33-Hy** shows no hysteresis indicating that there is practically no kinetic barrier

between the ingress or release processes of the water molecules to/from the crystal structure but also that the phase is maintained and no transformation to another form with different structural features occurs. To test the phase behavior under different "moisture conditions" (water activities) in solvent systems, we subjected DDS to a slurry study in methanol/water mixtures of various compositions, covering the water activity (aw) range from 0 to 1.0 (corresponding to 0 to 100% RH) in 0.01 steps and the range 0.6 to 0.7 in 0.001 steps using form **III** as the starting form (**Figure 10**). Surprisingly, we obtained a new anhydrous form, named form **V** hereafter, which emerged as the only stable solid phase below a water activity of 0.64. At an a<sup>w</sup> > 0.66, **0.33-Hy** was obtained, suggesting that this hydrate is the stable form at high water activities and that the equilibrium between the DDS form **V** and the **0.33-Hy**, lies at an a<sup>w</sup> value of ∼0.655 at 25◦C.

Thermal analysis and PXRD characterization confirmed that the new form **V** does not correspond to any of the four known polymorphs (Brandstaetter-Kuhnert et al., 1963; Kuhnert-Brandstatter and Moser, 1979). A thorough characterization of the new form, which is obviously the thermodynamically most stable anhydrate form at RT, and phase interrelations to the known polymorphs will be addressed elsewhere.

# Estimation of Host-Host, Host-Water, and Water-Water Interaction Energies in Organic Hydrates

To understand the nature and stability of a hydrate it is important to consider the location and interaction energies of the water molecules in the framework of the host structure. If water molecules are located in open structural voids (tunnels or connected pockets) the term channel hydrate (Brittain et al., 2009) is commonly used. In such hydrates the water molecules may be mobile and may readily escape through these tunnels on modest increase in temperature or decrease in relative humidity (RH). In contrast, if the water molecules are located at isolatedsites (Brittain et al., 2009), it is assumed that water egress is not as facile and requires a considerable rearrangement of the hydrate packing to allow the release of water molecules. This rearrangement results mostly in the formation of a different packing arrangement or in a partial or total collapse of the structure yielding a disordered or amorphous state upon dehydration. The non-stoichiometric behavior of hydrates, where the water content in the structure depends on the water vapor pressure of the surrounding medium (atmosphere), is normally observed in channel hydrates and not in isolated-site hydrates. However, as demonstrated above, DDS **0.33-Hy** shows clearly the typical features of an isolated-site hydrate (**Figure 3A**) but on the other hand shows a non-stoichiometric (de)hydration behavior (see **Figure 7**) which is a contradiction and questions the common relation between structural features and the stability of hydrates.

To further clarify why the hydrate water can escape easily from the "isolated sites" in the **0.33-Hy**, without disrupting the structure, we estimated the pair-wise interaction energies for DDS and water molecules in the **0.33-Hy** structure (**Table 2**). The use of the CE-B3LYP energies and not a specific water potential, not considering specific effects (nanoscale dielectric responses of water) and three-body energy terms were justified by the fact that the contribution of the water to **0.33-Hy** was found to be in reasonable agreement with the experiment (**Table 3**). The water interactions contribute −13.85 kJ mol−<sup>1</sup> to the **0.33-Hy** lattice. The CE-B3LYP energies of **0.33- Hy** (−142.87 kJ mol−<sup>1</sup> **)** and **Hy**dehy (−132.00 kJ mol−<sup>1</sup> , optimized RT structures) differ by 10.87 kJ mol−<sup>1</sup> , ignoring the conformational changes which are expected to account for < 0.2 kJ mol−<sup>1</sup> in the case of DDS. Thus, the sum of the interaction energies (Ecluster) roughly corresponds to the lattice energy. Furthermore, we calculated the intermolecular energies for water-host, water-water and host-host molecules for a series of well characterized organic hydrate systems (pharmaceuticals and model compounds) and contrasted the values to the DDS **0.33-Hy** (**Figure 11**, Supplementary Material). The chosen test set consists of stoichiometric and non-stoichiometric hydrates, as well as of channel and isolated-site hydrates. This analysis should indicate, whether it is possible to assess hydrate stability and/or the dehydration mechanism from general features of the hydrate structure and not the location of the water molecules in the structure alone.

In **Figure 11** the calculated interaction energies are grouped into contributions arising from host-host interactions and the incteractions including water molecules (host-water and water-water). Furthermore, the ratio between the host-host and water interactions has been calculated. The hydrates are ranked according to the compound:water ratio. For three of the chosen hydrates (dapsone, indinavir, and brucine) it is possible to remove the water molecules under maintaining the crystal structure, which is characteristic for a non-stoichiometric dehydration behavior. A grouping into channel and isolated-site hydrates is not always straight forward, in particular if more than one water molecule is present (di-, tri-hydrate, etc.,). However, based on the energetic contributions, host-water vs. waterwater, such a classification is facilitated as it can be expected that in isolated-site hydrates the host-water and in channel hydrates the water-water interactions predominate, respectively. Furthermore, the sum of host-host interactions are stronger in isolated-site hydrates than in channel hydrates.

A requirement for maintaining the crystal lattice upon dehydration is that the hydrate structure exhibits strong and/or a predominance of host-host interactions. Indeed, the three non-stoichiometric hydrates of the test set show the highest percentage of host-host interactions, i.e., ∼90% of the interaction energies for the DDS and indinavir hydrates. This value is lower for brucine (61.5%) but compared to dihydrates showing a stoichiometric behavior, brucine exhibits the most/strongest host-host interactions. The fact that the DDS and indinavir hydrates are isolated-site hydrates and brucine is a channel hydrate highlights that it is not possible to deduce whether a stoichiometic or non-stoichiometric dehydration mechanism

occurs from the location of the water molecules in the hydrate structure alone. Though, it is possible to rationalize a non-stoichiometric dehydration behavior from the energy contributions of the intermolecular interactions considering the compound:water ratios. On the other hand, the analysis shows that the moisture- or temperature dependent stability of hydrates cannot be derived from the interaction energy calculations. For example, the 5-flucytosine monohydrate (I) already dehydrates at RH values < 40%, whereas 4-aminoquinaldine monohydrate (Hy1A) dehydrates only at RH values below 10% (RT), but exhibits less energetic contributions from the host-host interactions than the 5-flucytosine monohydrate (I).

In the case of **0.33-Hy** the latter analysis (**Figure 10**) strongly indicates that water is only weakly bound and rationalizes the facile moisture- and temperature dependent dehydration behavior.

#### DISCUSSION

# Molecular Level Understanding of the Dehydration Mechanism Derived from the Hydrate Structure

Knowledge of how water vapor is sorbed by a hygroscopic material and how moisture affects the physical and chemical stability of a (pharmaceutical) product is a crucial question in developing drug products or preparations produced from other fine chemicals. Failures and time delays in product developments can be minimized or avoided with knowledge compiled in thorough solid state investigations. Hydrates require a thorough evaluation of their composition and stability under production relevant conditions and additionally the transformation pathways between different solid state forms of a compound, as well as their stability ranges, should be elucidated. This is mandatory to select the ideal solid state form that guarantees an optimal product performance and stability. In general, non-stoichiometric hydrates are undesired solid forms because any change in water vapor pressure of the surrounding medium causes a change in the water content of the substance, which can be critical for weighing and dosing operations and may thus lead to errors in any analyses, which require exact sample amounts. Such variations in the water content are often difficult to avoid as it requires special efforts to precisely control temperature and humidity conditions during processing and storage. Furthermore, the water molecules which have been released from such a hydrate may interact with other excipients in a drug formulation. Gravimetric moisture sorption/desorption studies (**Figure 7**), combined with environmental PXRD experiments (Supplementary Material) are the preferred analytical techniques for unraveling this nonstoichiometric behavior of a hydrate.

DDS **0.33-Hy** is a prime example for an isolated site hydrate with non-stoichiometric dehydration behavior. The latter behavior may be expected for hydrates where the water is located in open voids such as channels or layers. Thus, this study highlights that the popular structural classification of hydrates into isolated-site hydrates (water molecules are isolated from direct contact), channel hydrates (chains of water molecules) and ion-associated hydrates (metal ions are coordinated with water) cannot be directly related to the dehydration behavior or dehydration mechanism of a hydrate. However, by complementing the structural features with intermolecular energy calculations the observed dehydration behavior can be rationalized. As shown for DDS **0.33-Hy** the water molecules are only weakly bound (**Figure 11**), allowing a facile water egress/ingress with changing environmental conditions. Furthermore, the energy difference between the isomorphic dehydrate structure and anhydrate polymorphs is small. The lattice energy difference between **Hy**dehy and form **II** was calculated as 1.6 kJ mol−<sup>1</sup> (PBE-TS, at −273◦C) and the transition enthalpy between **Hy**dehy and form **II** determined to be 1.08 ± 0.05 kJ mol−<sup>1</sup> (experimentally measured at ∼100◦C). Thus, the calculations rationalize and indicate the non-stoichiometric dehydration mechanism.

# Computational Modeling of Pharmaceutical Hydrates

Modeling and predicting hydrate structures of pharmaceuticals still represent a big challenge in computational chemistry. Numerous potentials have been developed for modeling water (Guillot, 2002), however, there exists no method that can sufficiently model all its abnormalities. In organic hydrates the water molecules may be described as confined at nanoscales, implying frustration in their hydrogen-bonding coordination. Consequently, accurately modeling the water molecules in a hydrate lattice requires modeling efforts which go well beyond the methods applied for modeling the organic solid state (Reilly et al., 2016). In lead optimization is has been demonstrated that incorporating the three-body energy terms, and modeling frustration and frustration-related dielectric responses, significantly improves the results (Fernández, 2016, 2017; Fernandez and Scott, 2017). Considering and modeling the latter can be expected to significantly increase the accuracy of lattice and intermolecular energy calculations of water containing species, albeit at the expense of computational cost.

A major difficulty with using CSP in hydrate solid form screening is the computational expense in time and resources to generate the crystal energy landscape for all possible hydrate stoichiometries (mono-, di-, etc.,). However, solid form modeling at the electronic and atomistic level can provide vital support for unraveling the solid state for a compound which may not be achieved with experiments alone. A CSP study answers the question what types of crystal packings are favorable for a specific molecule, unraveling the compromises between close packing efficiency, conformational preferences and the different types of intermolecular interactions that can lead to feasible structures for a molecule (polymorph) or multi-component system (salt, solvate, hydrate, co-crystal). It should be stressed that CSP aids the interpretation of the experimental data (Price et al., 2014) and can guide experimentalists to find new solid forms (Arlin et al., 2011; Braun et al., 2014b, 2016; Neumann et al., 2015; Srirambhatla et al., 2016).

To significantly reduce the computational cost, and to make the calculations feasible, we did not attempt to computationally screen for different hydrate stoichiometries (Braun et al., 2011b) for the chosen model compound DDS, but used the crystal energy landscape of the 1:1 stoichiometry (monohydrate) as a guidance for hydrate formation. The monohydrate crystal energy landscape (**Figure 2**) shows only one hydrate structure that is more stable than the non-solvated form **III** and thus indicates hydrate formation.

# CONCLUSIONS

4,4′ -Diaminodiphenyl sulfone (DDS) forms a non-stoichiometric hydrate, with a water content of 0–0.33 mol of water per mol of DDS. The upper limit of this ratio is obvious from the features of the crystal structure, but it is surprising that the structurally isolated and hydrogen bonded water molecules can easily leave and enter the structure, which is indicated by the continuous change in water content when the hydrate is exposed to different RH values. This observation highlights that it is not advisable to make assumptions about the dehydration behavior based on the location of the water molecules in the structure alone. However, supported by intermolecular energy interaction calculations (host-host, water-host and water-water) and by comparing the lattice energies of the isomorphic dehydrate (hydrate without water) and anhydrate polymorph(s) of the same compound it is possible to rationalize and to potentially predict a non-stoichiometric dehydration behavior. Furthermore, this study shows that even though CSP has been performed with only one hydrate stoichiometry (here monohydrate) the outcome may be sufficient to get insight into the hydrate formation potential of a compound. However, such a limited approach requires a thorough analysis of the computed structures.

In our opinion, a sound understanding of hydrates and their often complex behavior can only be achieved by a full multidisciplinary investigation, including structural, moisture- and temperature dependent studies combined with modeling. Such an understanding may be mandatory to avoid complications during processing, storing and handling of a hydrate.

# AUTHOR CONTRIBUTIONS

DB conceived and designed the research and headed, wrote and revised the manuscript, while UG contributed to the writing and the revision of the article.

### FUNDING

DB gratefully acknowledges funding by the Elise Richter program of the Austrian Science Fund (FWF, project V436-N34). The computational results presented have been achieved using the HPC infrastructure LEO of the University of Innsbruck.

#### ACKNOWLEDGMENTS

The authors are grateful to Profs. C. C. Pantelides and C. S. Adjiman (Imperial College London) for the use of the CrystalPredictor and CrystalOptimizer programs, Prof. S. L.

#### REFERENCES


Bernstein, J. (2002). Polymorphism in Molecular Crystals. Oxford: Clarendon Press.


Price (University College London) for the use of the DMACRYS program and Elisabeth Achammer for support in few of the calculations.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00031/full#supplementary-material

kinetic relationship between pyrogallol and its tetarto-hydrate. Cryst. Growth Des. 13, 4071–4083. doi: 10.1021/cg4009015


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Braun and Griesser. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Structure-Activity Relationship Analysis of 3-Phenylcoumarin-Based Monoamine Oxidase B Inhibitors

Sanna Rauhamäki <sup>1</sup> , Pekka A. Postila<sup>1</sup> , Sanna Niinivehmas <sup>1</sup> , Sami Kortet 1,2 , Emmi Schildt 1,2, Mira Pasanen<sup>1</sup> , Elangovan Manivannan1,3, Mira Ahinko<sup>1</sup> , Pasi Koskimies <sup>4</sup> , Niina Nyberg<sup>5</sup> , Pasi Huuskonen<sup>5</sup> , Elina Multamäki <sup>1</sup> , Markku Pasanen<sup>5</sup> , Risto O. Juvonen<sup>5</sup> , Hannu Raunio<sup>5</sup> , Juhani Huuskonen<sup>2</sup> \* and Olli T. Pentikäinen1,6 \*

#### Edited by:

Daniela Schuster, Paracelsus Private Medical University of Salzburg, Austria

#### Reviewed by:

Outi Maija Helena Salo-Ahen, Åbo Akademi University, Finland Justin W. Hicks, Lawson Health Research Institute, Canada Julian Fuchs, University of Innsbruck, Austria

#### \*Correspondence:

Juhani Huuskonen juhani.s-p.huuskonen@jyu.fi Olli T. Pentikäinen olli.pentikainen@utu.fi

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 10 January 2018 Accepted: 14 February 2018 Published: 02 March 2018

#### Citation:

Rauhamäki S, Postila PA, Niinivehmas S, Kortet S, Schildt E, Pasanen M, Manivannan E, Ahinko M, Koskimies P, Nyberg N, Huuskonen P, Multamäki E, Pasanen M, Juvonen RO, Raunio H, Huuskonen J and Pentikäinen OT (2018) Structure-Activity Relationship Analysis of 3-Phenylcoumarin-Based Monoamine Oxidase B Inhibitors. Front. Chem. 6:41. doi: 10.3389/fchem.2018.00041

Frontiers in Chemistry | www.frontiersin.org 1 March 2018 | Volume 6 | Article 41

<sup>1</sup> Computational Bioscience Laboratory, Department of Biological and Environmental Science & Nanoscience Center, University of Jyväskylä, Jyväskylä, Finland, <sup>2</sup> Department of Chemistry & Nanoscience Center, University of Jyväskylä, Jyväskylä, Finland, <sup>3</sup> School of Pharmacy, Devi Ahilya University, Madhya Pradesh, India, <sup>4</sup> Forendo Pharma Ltd., Turku, Finland, <sup>5</sup> School of Pharmacy, University of Eastern Finland, Kuopio, Finland, <sup>6</sup> MedChem.fi, Institute of Biomedicine, University of Turku, Turku, Finland

Monoamine oxidase B (MAO-B) catalyzes deamination of monoamines such as neurotransmitters dopamine and norepinephrine. Accordingly, small-molecule MAO-B inhibitors potentially alleviate the symptoms of dopamine-linked neuropathologies such as depression or Parkinson's disease. Coumarin with a functionalized 3-phenyl ring system is a promising scaffold for building potent MAO-B inhibitors. Here, a vast set of 3-phenylcoumarin derivatives was designed using virtual combinatorial chemistry or rationally de novo and synthesized using microwave chemistry. The derivatives inhibited the MAO-B at 100 nM−1µM. The IC<sup>50</sup> value of the most potent derivative 1 was 56 nM. A docking-based structure-activity relationship analysis summarizes the atom-level determinants of the MAO-B inhibition by the derivatives. Finally, the cross-reactivity of the derivatives was tested against monoamine oxidase A and a specific subset of enzymes linked to estradiol metabolism, known to have coumarin-based inhibitors. Overall, the results indicate that the 3-phenylcoumarins, especially derivative 1, present unique pharmacological features worth considering in future drug development.

Keywords: 3-phenylcoumarin, monoamine oxidase B (MAO-B), structure-activity relationship (SAR), virtual drug design, Parkinson's disease

# INTRODUCTION

During neuronal signaling, neurotransmitters are released from the presynaptic cell into the synaptic cleft, from where they bind into their specific receptors embedded on the postsynaptic membrane. The membrane lipid bilayer, especially its anionic phospholipid constituents, has been suggested to play a role in the small-molecule entry processes with the receptors (Orłowski et al., 2012; Postila et al., 2016; Mokkila et al., 2017). Moreover, to assure that the neurotransmission remains transient, the neurotransmitters are removed quickly from the synaptic cleft via enzymatic degradation and cellular uptake.

When inside the neuron, monoamine neurotransmitters such as norepinephrine and dopamine are either recycled or destined for deactivation through oxidative deamination (RCH2NHR' + H2O + O<sup>2</sup> = RCHO + R'NH<sup>2</sup> + H2O2) by monoamine oxidases A (MAO-A; E.C. 1.4.3.4) and B (MAO-B; E.C. 1.4.3.4). These enzymes are integral monotopic proteins that anchor themselves as dimers onto the mitochondrial outer membrane surface by protruding their α-helical C-termini into the lipid bilayer (**Figure 1A**). Moreover, both subtypes A and B deaminate preferentially their respective substrates to aldehydes: MAO-A catalyzes serotonin, norepinephrine, and to some extent dopamine; and MAO-B catalyzes dopamine, phenethylamine, benzylamine and to a lesser extent norepinephrine (Shih et al., 1999; Edmondson et al., 2005; Gaweska and Fitzpatrick, 2011).

The MAO-B, which is the target of this study, is connected to neurodegenerative disorders such as Alzheimer's disease but also mental disorders such as schizophrenia, anorexia nervosa, depression and attention deficit disorder. In all of these conditions, the involvement of MAO-B in the metabolism of dopamine and other amines is in a key role (Youdim et al., 2006; Carradori and Silvestri, 2015). For instance, due to gliosis associated with Parkinson's disease, increased levels of MAO-B speed up degradation of dopamine in the motor neurons. MAO-B inhibitors decrease the degradation and boost dopamine concentration in the synapse. Thus, instead of introducing more dopamine, the neurotransmitter levels are elevated by inhibiting MAO-B. As a result, MAO-B inhibitors such as selegiline are used in treatment of Parkinson's disease, moreover, their neuroprotective effects can benefit Alzheimer's disease patients (Youdim et al., 2006). Due to these hepatotoxic effects of irreversibly binding MAO inhibitors, reversible inhibitors such as moclobemide were developed (Youdim et al., 2006; Finberg and Rabey, 2016). The MAO inhibitors can exhibit selectivity toward MAO-A (moclobemide) or MAO-B (pargyline, selegiline) or be non-selective (phenelzine, tranylcypromine). The selectivity, which can be lost in high dosages, is important for avoiding MAO-A inhibition related cheese effect (Youdim et al., 2006; Finberg and Rabey, 2016).

A vast amount of different types of MAO inhibitors are described in the literature and for example the ChEMBL database lists inhibition data for thousands of compounds. The specific problem in the development of MAO-specific ligands is that the promising compounds have potential to become active on other amine oxidases such as vascular adhesion protein 1 (Nurminen et al., 2010, 2011). Here, the aim was to probe the MAO-B activity and selectivity effects of different substitutions on the coumarin core by focusing, especially, on the 3-phenylcoumarin (or 3-arylcoumarin). Notably, there exist two X-ray crystal structures with structurally related coumarin analogs in which 3-chlorobenzyloxy groups are attached at the C7-position (**Figures 1B–D**). The studied set of 3-phenylcoumarin derivatives with different R1-R7 groups (**Figure 1E**) introduced in this study make an important addition to the earlier studies in which the potential of coumarin core, including 61 3-phenylcoumarin derivatives (Matos et al., 2009b, 2010, 2011a,b; Santana et al., 2010; Serra et al., 2012; Viña et al., 2012a,b), to block MAO-A and MAO-B has been explored (Borges et al., 2005; Catto et al., 2006; Matos et al., 2009a, 2010, 2011a; Serra et al., 2012; Ferino et al., 2013; Joao Matos et al., 2013; Patil et al., 2013). The compounds were designed using virtual combinatorial chemistry or rationally de novo and binding were probed via molecular docking prior to synthesis or in vitro testing.

Initially, 52 derivatives of the 3-phenylcoumarin core were synthesized and tested here for the first time for MAO-B inhibition using a specifically tailored spectrophotometric assay (Supplementary Table S1) (Holt et al., 1997). Next, 24 of the derivatives (**Figure 2**, **Table 1**), producing >70% inhibition at 10µM, were selected for further analysis. These derivatives inhibited MAO-B at a ∼100 nM to ∼1µM range, while the most potent derivative **1** produces ∼50–60 nM inhibition (**Table 1**, **Figure 2**). Finally, the potency of the derivatives for inhibiting estrogen receptor (ER), 17-β-hydroxysteroid dehydrogenase 1 (HSD1), aromatase (CYP19A1), and cytochrome P450 1A2 (CYP1A2), the topics of both our prior (Niinivehmas et al., 2016) and ongoing studies, was also considered. A dockingbased structure-activity relationship (SAR) analysis (**Figure 2**) was performed with all of the synthetized 3-phenylcoumarins focusing mainly on the 24 most potent compounds.

In short, this study explores thoroughly the pharmacological potential of 3-phenylcoumarin (**Figure 1E**) for blocking the MAO-B activity (**Table 1**, Supplementary Table S1) and, furthermore, explains the basis of the inhibitory effect on the atom level.

### MATERIALS AND METHODS

### Virtual Combinatorial Chemistry

The 3-phenylcoumarin was chosen as the scaffold of interest for building new MAO-B-specific inhibitors (see section The Alignment of the 3-Phenylcoumarin Scaffold at the Active Site). The analogs were designed using virtual combinatorial chemistry or virtual synthesis. In the initial stages, methoxy group was included at the R1 or R2 position (**Figure 1E**) in the coumarin core due to its predicted favorability at the active site. The R4-R7 substituents of the 3-phenyl ring (**Figure 1E**) were designed by combining phenylacetic acid with either 6-methoxycoumarin or 7-methoxycoumarin. The preliminary combinatorial compound library was generated using MAESTRO version 9.3 CombiGlide (CombiGlide, version 2.8, Schrödinger, LLC, New York, NY, USA) and Combinatorial Screening module. The compounds were docked with GLIDE and scored using GlideScore. Some of these derivatives with promising potency and selectivity profile in this study (**8**, **10**, **25**, **37**) were eventually synthesized, albeit using different chemistry (see section Chemical Procedure), and tested in vitro. Majority of the final derivatives were designed de novo after performing the initial docking simulations with the virtual synthesis products.

**Abbreviations:** MAO-A, monoamine oxidase A; MAO-B, monoamine oxidase B; HSD1 or 17-β-HSD1, 17-β-hydroxysteroid dehydrogenase 1; ER, estrogen receptor; CYP1A2, cytochrome P450 1A2; CYP19A1, aromatase; SAR, structureactivity relationship.

established inhibitors in comparison to the docking-based pose of the scaffold. Moreover, the phenyl rings of C17 and C18 are attached via ether bonds to the coumarin's C7-position instead of C3-position used with the inhibitors introduced in this study. (E) The 2D structure of the 3-phenylcoumarin scaffold indicating the positions of the functional R1-R7 groups.

# Chemical Procedure

All reactions were carried out using commercial materials (Sigma-Aldrich, Mannheim, Germany) and reagents without further purification unless otherwise noted. Reaction mixtures were heated by the CEM Discover microwave apparatus. All reactions were monitored by thin layer chromatography (TLC) on silica gel plates. <sup>1</sup>H NMR and <sup>13</sup>C NMR data was recorded on a Bruker Avance 400 MHz spectrometer or Bruker Avance III 300 MHz spectrometer. Chemical shifts are expressed in parts per million values (ppm) and are designated as s (singlet), br s (broad singlet), d (doublet), dd (double doublet), and t (triplet). Coupling constants (J) are expressed as values in hertz (Hz). The mass spectra were recorded using Micromass LCT ESI-TOF equipment. Elemental analyses were done with Elementar Vario EL III elemental analyzer. The coumarin derivatives were synthesized using Perkin-Oglialor condensation reaction. The method was developed from the earlier published procedures and transferred to microwave reactor and it was published earlier by authors (Niinivehmas et al., 2016).

A typical procedure: A mixture of salicylaldehyde derivative (2 mmol) and phenyl acetic acid derivative (2.1 mmol), acetic acid anhydride (0.6 ml), and triethylamine (0.36 ml) were placed in a microwave reactor tube and this mixture was heated at 100– 170◦C with microwave apparatus (100–200 W) for 10–20 min. After cooling, 2 ml of 10% NaHCO<sup>3</sup> solution was added and the precipitate was filtered, dried and recrystallized from EtOH/H2O or acetone/H2O mixture. The acetyl group(s) were removed by treating the compound with 2 M MeOH/NaOH(aq) (1:1) solution for 30–60 min at r.t. The solution was acidified with 2 M HCl(aq,) and the precipitate was filtered and recrystallized if needed.

Based on the elemental analysis and/or <sup>1</sup>H-NMR the purity of compounds was >95%.

**6-methoxy-3-(4-(trifluoromethyl)phenyl)-2H-chromen-2-one (1).** Yield: 76%; <sup>1</sup>H-NMR (400 MHz, CDCl3) δ: 3.86 (s, 3H, CH3O-), 6.99 (s, 1H, H-5), 7.14 (d, 1H, J <sup>3</sup> = 7.7 Hz, H-7), 7.29 (d, J <sup>3</sup> = 8.9 Hz, H-8), 7.69 (d, 2H, J <sup>3</sup> = 7.9 Hz, H-2', H-6'), 7.58 (m, 3H, H-4, H-3', H-5'); <sup>13</sup>C-NMR (100.6 MHz, CDCl3) δ: 55.99, 110.24, 117.73, 119.78, 120.02, 125.51 (q, J <sup>C</sup>−<sup>F</sup> = 4 Hz), 127.37, 129.05, 130.85 (q, J <sup>C</sup>−<sup>F</sup> = 32 Hz), 138.41, 140.88, 148.33, 156.44, 160.42. HRMS(ESI): calc. for C17H11F3O3Na<sup>1</sup> 343.0558, found 343.0574; elemental anal. for C17H11F3O3, calc. C% 63.76, H% 3.46, found C% 63.25, H% 3.51.

the chemical similarity of the R1-R7 substituents (Figure 1E). See Table 1 for the detailed activity data.


TABLE 1 | The activity data on the 24 most potent 3-phenylcoumarin derivatives.

N/A = not available. Controls: (1)pargyline, (2)clorgyline, (3) kit control. The compounds are grouped (A–F) based on the chemical similarity of the R1–R7 substituents (Figure 1E).

**6-methoxy-3-(4-(trifluoromethyl)phenyl)-2H-chromen-2 one (2).** Yield: 80%; <sup>1</sup>H-NMR (300 MHz, d<sup>6</sup> -DSMO) δ: 3.88 (s, 3H, CH3O-), 6.99 (s, 1H, J <sup>3</sup> = 8.7 Hz, J <sup>4</sup> = 2.4 Hz, H-6), 7.03 (d, 1H, J <sup>4</sup> = 2.4 Hz, H-7), 7.71 (d, J <sup>3</sup> = 8.6 Hz, H-8), 7.79 (d, 2H, J <sup>3</sup> = 8.3 Hz, H-2', H-6'), 7.93 (d, 2H, H-3', H-5'), 8.32 (s, 1H, H-4); <sup>13</sup>C-NMR (75.5 MHz, d<sup>6</sup> -DMSO) δ: 55.97, 100.25, 112.80, 121.57, 122.38, 120.02, 124.97 (q, J <sup>C</sup>−<sup>F</sup> = 4 Hz), 128.29 (q, J <sup>C</sup>−<sup>F</sup> = 32 Hz), 128.97, 129.99, 139.01, 142.10, 155.09, 159.62, 162.82. HRMS(ESI) calc for C17H11F3O3Na<sup>1</sup> [M + Na]+: 343.05525, found 343.05610.

#### **2-oxo-3-(4-(trifluoromethoxy)phenyl)-2H-chromen-7-yl**

**acetate (3).** (Dobelmann-Mara et al., 2017) Yield: 54%; %; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 2.27 (s, 3H, CH3C(O)O-), 7.20 (dd, 1H, J <sup>3</sup> = Hz, J <sup>4</sup> = Hz, H-6), 7.33 (d, 1H, J <sup>4</sup> = Hz, H-8), 7.47 (d, 2H, J <sup>3</sup> = Hz, H-3', H-5'), 7.81 (d, 1H, J <sup>3</sup> = 8.4 Hz, H-5), 8.32 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, d<sup>6</sup> -DMSO) δ: 20.86 109.74, 117.23, 118.88, 120.75, 124.84, 129.42, 129.60, 130.52, 133.85, 140.73, 148.35, 152.90, 153.55, 159.40, 168.78; HRMS(ESI) calc. for C18H11F3O5Na<sup>1</sup> [M + Na]<sup>+</sup> 387.0457, found 387.0481.

#### **6-methoxy-3-(4-(trifluoromethoxy)phenyl)-2H-chromen-**

**2-one (4).** Yield: 52%; <sup>1</sup>H-NMR (400 MHz, CDCl3) δ: 3.86 (s, 3H, CH3O-), 6.98 (d, 1H, J <sup>4</sup> = 3 Hz, H-5), 7.12 (dd, J <sup>3</sup> = 9.1 Hz, J <sup>4</sup> = 3 Hz, H-7), 7.27-7.30 (m, 3H, H-8, H-3', H-5'), 7.74 (d, 2H, J <sup>3</sup> = 8.9 Hz, H-2', H-6'); 7.77 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, CDCl3) δ: 55.99, 110.16, 117.70, 119.68, 119.92, 120.97, 127.41, 130.26, 133.51, 140.20, 148.21, 149.67, 156.41, 160.65. HRMS(ESI) calc for C17H11F3O4Na<sup>1</sup> [M + Na]+: 359.05071, found 359.05260. elemental anal. for C17H11F3O4·0.5H2O, calc. C% 59.14, H% 3.50, found C% 58.99, H% 3.25.

**6-methyl-3-(4-(trifluoromethyl)phenyl)-2H-chromen-2 one (5).** Yield: 54%; <sup>1</sup>H-NMR (400 MHz, CDCl3) δ: 7.27 (d, 1H, J <sup>4</sup> = 2.2 Hz, H-5), 7.35-7.38 (m, 2H, H-7, H-8), 7.70 (d, J <sup>3</sup> = 8.2 Hz, H-2', H-6'), 7.82 (m, 3H, H-4, H-3', H-5'); <sup>13</sup>C-NMR (100 MHz, CDCl3) δ: 20.92, 116.46, 119.22, 122.80, 125.53 (q, J <sup>C</sup>−<sup>F</sup> = 4 Hz), 126.98, 128.07, 129.05, 130.80 (q, J <sup>C</sup>−<sup>F</sup> = 33 Hz), 133.29, 134.62, 138.50, 141.08, 152.04 160.53; HRMS(ESI) calc for C17H12Cl2O4Na<sup>1</sup> [M + Na]+: 373.0005, found 372.9998.

**2-fluoro-4-(7-methoxy-2-oxo-2H-chromen-3-yl)phenyl acetate (6).** Yield 75%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 2.35 (s, 3H, CH3C(O)O-Ph), 3.88 (s, 3H, CH3O-Ph), 6.99 (dd, 1H, J <sup>3</sup> = 8.6 Hz, J <sup>4</sup> = 2.4 Hz, H-6), 7.05 (d, 1H, J <sup>4</sup> = 2.4 Hz, H-8), 7.37 (dd, J <sup>3</sup> = 9.3 Hz, J <sup>H</sup>−<sup>F</sup> = 8.3 Hz, H-5'), 7.62 (ddd, 1H, J <sup>3</sup> = 8.5 Hz, J <sup>4</sup> = 2.1 Hz, J <sup>H</sup>−<sup>F</sup> = 0.8 Hz, H-6'), 7.68 (d, J = 8.6 Hz, 1H, H-5), 7.74 (dd, J <sup>H</sup>−<sup>F</sup> = 12.1 Hz, J <sup>4</sup> = 2.0 Hz, H-3'), 8.31 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, d<sup>6</sup> -DMSO) δ: 20.19, 55.97, 100.25, 112.79, 116.35 (d, J <sup>C</sup>−<sup>F</sup> = 20.3 Hz), 121.02, 121.03, 123.83, 124.79 (d, J <sup>C</sup>−<sup>F</sup> = 3.2 Hz), 129.86, 134.24 (d, J <sup>C</sup>−<sup>F</sup> = 7.7 Hz), 137.20 (d, J <sup>C</sup>−<sup>F</sup> = 13.1 Hz), 141.55, 153.00 (J <sup>C</sup>−<sup>F</sup> = 246.1 Hz), 154.92, 159.65, 162.69, 168.19. HRMS(ESI) calc for C18H13F1O5Na<sup>1</sup> [M + Na]+: 351.06447, found 351.06240; elemental anal. for C18H13F1O<sup>5</sup> C% 65.85, H% 3.99, found C% 65.28 H% 4.02.

**4-(6,8-dichloro-2-oxo-2H-chromen-3-yl)-2-fluorophenyl acetate (7).** Yield 58%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 2.36 (s, 3H, CH3C(O)O-), 7.43 (dd, J <sup>3</sup> = 9.3 Hz, J <sup>H</sup>−<sup>F</sup> = 8.3 Hz, H-5'), 7.67 (ddd, 1H, J <sup>3</sup> = 8.4 Hz, J <sup>4</sup> = 2.1 Hz, J <sup>H</sup>−<sup>F</sup> = 0.8 Hz, H-6'), 7.74 (dd, J <sup>H</sup>−<sup>F</sup> = 11.8 Hz, J <sup>4</sup> = 2.0 Hz, H-3'), 7.84 (d, 1H, J <sup>4</sup> = 2.4 Hz, H-7), 7.97 (d, 1H, J <sup>4</sup> = 2.4 Hz, H-5), 8.32 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, d<sup>6</sup> -DMSO) δ: 20.72, 117.23 (d, J <sup>C</sup>−<sup>F</sup> = 21 Hz), 121.13, 122.17, 124.65, 125.74 (d, J <sup>C</sup>−<sup>F</sup> = 3.3 Hz), 127.29, 128.80, 131.47, 133.70 (d, J <sup>C</sup>−<sup>F</sup> = 7.7 Hz), 138.50 (d, J <sup>C</sup>−<sup>F</sup> = 12.9 Hz), 140.12, 147.94, 152.30, 154.75. 158.73; HRMS(ESI): calc. for C17H9Cl2F1O4Na<sup>1</sup> [M + Na]+: 388.9760, found 388. 9762.

**6-methoxy-3-(3-methoxyphenyl)-2H-chromen-2-one (8).** Yield 78%; <sup>1</sup>H-NMR (400 MHz, CDCl3) δ: 3.85 (s, 3H, CH3O-Ph), 3.86 (s, 3H, CH3O-Ph), 6.93-6.97 (m, 2H, H-4', H-5), 7.10 (dd, 1H, J <sup>3</sup> = 9.0, Hz, J <sup>4</sup> = 1.9 Hz, H-7), 7.25-7.29 (m, 3H, H-8, H-2', H-6'), 7.35 (t, 1H, J <sup>3</sup> = 8.2 Hz, H-5'), 7.76 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, CDCl3) δ: 55.69, 56.18, 110.28, 114.57, 114.88, 117.78, 119.55, 120.28, 121.26, 128.80, 129.79, 136.43, 140.13, 148.34, 156.47, 159.88, 160.91; HRMS(ESI): calc. for C17H14O4Na<sup>1</sup> [M + Na]+: 305.07898, found 305.07950; elemental anal. for C14H14O<sup>4</sup> calc. C% 72.33, H% 5.00, found C% 72.41, H% 4.88.

#### **3-(3,5-dimethoxyphenyl)-6-methoxy-2H-chromen-2-**

**one (9).** (Vilar et al., 2006) Yield 59%; <sup>1</sup>H-NMR (400 MHz,

d 6 -DMSO) δ: 3.79 (s, 6H, CH3O-Ph), 3.82 (s, 3H, CH3O-Ph), 6.56 (t, 1H, J <sup>4</sup> = 2.3 Hz, H-4'), 6.89 (d, 2H, J <sup>4</sup> = 2.3 Hz, H-2', H-6'), 7.20 (dd, 1H, J <sup>3</sup> = 9.0 Hz, J <sup>4</sup> = 3.0 Hz, H-7), 7.31 (d, 1H, J <sup>4</sup> = 3.0 Hz, H-5), 7.36 (d, 1H, J <sup>3</sup> = 9.0 Hz, H-8), 8.23 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, d<sup>6</sup> -DMSO) δ: 55.30, 55.66, 100.48, 106.71, 110.69, 116.90, 119.36, 119.78, 126.75, 136.44, 140.66, 147.33, 155.62, 159.53, 160.16; HRMS(ESI): calc. for C18H16O5Na<sup>1</sup> [M + Na]+: 335.08954, found 305.09010; elemental anal. for C14H14O<sup>4</sup> calc. C% 69.22, H% 5.16, found C% 68.80, H% 5.14.

**6-methoxy-3-(4-methoxyphenyl)-2H-chromen-2-one (10).** (Prendergast, 2001; Ferino et al., 2013) Yield 79%; <sup>1</sup>H-NMR (400 MHz, CDCl3) δ: 3.847 (s, 3H, CH3O-Ph), 3.852 (s, 3H, CH3O-Ph), 6.95-6.98 (m, 3H, H-5, H-3', H-5'), 7.07 (dd, 1H, J <sup>3</sup> = 9.0 Hz, J <sup>4</sup> = 2.9 Hz, H-7), 7.27 (d, 1H, J <sup>4</sup> = 8.8 Hz, H-5), 7.66 (d, 2H, J <sup>3</sup> = 8.9 Hz, H-2', H-6'), 7.70 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, CDCl3) δ:55.69, 56.16, 110.11, 114.23, 117.69, 119.05, 120.51, 127.47, 128.49, 130.18, 138.63, 148.11, 156.44, 160.49, 161.24; HRMS(ESI): calc. for C17H14O4Na<sup>1</sup> [M + Na]+: 305.07898, found 305.07910; elemental anal. for C17H14O<sup>4</sup> calc. C% 72.33, H% 5.00, found C% 72.34, H% 4.86.

**3-(3-methoxyphenyl)-2H-chromen-2-one (11).** (Kirkiacharian et al., 1999) Yield 81%; <sup>1</sup>H-NMR (400 MHz, CDCl3) δ: 3.86 (s, 3H, CH3O-Ph), 6.95 (ddd, 1H, J <sup>3</sup> = 8.2 Hz, J <sup>4</sup> = 2.3 Hz, J <sup>4</sup> = 2.5 Hz, H-4'), 7.26-7.37 (m, 5H, H-6, H-8, H-2', H-5', H-6'), 7.51-7.53 (m, 2H, H-5, H-7), 7.81 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, CDCl3) δ: 55.69, 114.56, 114.86, 116.76, 119.94, 121.24, 124.81, 128.26, 128.49, 129.80, 131.76, 136.35, 140.28, 153.85, 159.88, 160.76; HRMS(ESI): calc. for C16H12O3Na<sup>1</sup> [M + Na]+: 275.06841, found 275.06540; elemental anal. for C16H12O<sup>3</sup> calc. C% 76.18, H% 4.79, found C% 75.94, H% 4.67.

**7-hydroxy-3-(4-methoxyphenyl)-2H-chromen-2-one (12).** (Prendergast, 2001) Yield 81%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 3.79 (s, 3H, CH3O-Ph), 6.74 (s, 1H, H-8), 6.81 (d, 1H, J <sup>3</sup> = 8.5 Hz, H-6), 6.99 (d, 2H, J <sup>3</sup> = 8.3 Hz, H-3', H-5'), 7.57 (d, 1H, J <sup>3</sup> = 8.4 Hz, H-5), 7.65 (d, 2H, J <sup>3</sup> = 8.3 Hz, H-2, H-6'), 8.08 (s, 1H, H-4), 10.54 (s, 1H, HO-Ph); <sup>13</sup>C-NMR (100 MHz, d 6 -DMSO) δ: 55.18, 101.66, 112.10, 113.29, 113.61, 121.84, 127.30, 129.48, 129.70, 139.73, 154.63, 159.14, 160.20, 160.88; HRMS(ESI): calc. for C16H12O4Na<sup>1</sup> [M + Na]+: 291.06333, found 291.06160.

**3-(4-methoxyphenyl)-2-oxo-2H-chromen-7-yl acetate (13).** (Bhandri et al., 1949) Yield 67%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 7.02 (d, 2H, J <sup>3</sup> = 7.8 Hz, H-3', H-5'), 7.17 (d, 1H, J <sup>3</sup> = 8.3 Hz, H-6), 7.29 (d, 1H, H-8), 7.69 (d, 2H, J <sup>3</sup> = 7.8 Hz, H-2', H-6'), 7.79 (d, 1H, J <sup>3</sup> = 8.2 Hz, H-5), 8.19 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, d 6 -DMSO) δ: 20.85, 55.22, 109.57, 113.68, 117.49, 118.68, 125.69, 126.72, 129.19, 129.75, 138.62, 152.36, 153.15, 159.59, 168.81; HRMS (ESI): Calc for C18H14O5Na<sup>1</sup> [M + Na]+: 333.07389, found 333.07220. Elemental analysis for C18H14O<sup>5</sup> calc C% 69.67 H% 4.55, found C% 69.58 H% 4.52.

**3-(4-methoxyphenyl)-2-oxo-2H-chromen-6-yl acetate (14).** Yield 34%; 1H-NMR (300 MHz, d6-DMSO) δ: 2.31 (s, 3H, CH3C(O)O-), 3.81 (s, 3H, CH3O-), 7.03 (d, 2H, J3 = 8.7 Hz, H-3', H-5'), 7.37 (dd, 1H, J3 = 8.9 Hz, J4 = 2.5 Hz, H-7), 7.47 (d, 1H, J3 = 8.9 Hz, H-8), 7.54 (d, 1H, J4 = 2.5 Hz, H-5), 7.70 (d, 2H, J3 = 8.7 Hz, H-2', H-6'), 8.15 (s, 1H, H-5); 13C-NMR (75 MHz, d6-DMSO) δ: 20.73, 55.21, 113.68, 116.83, 120.07, 120.53, 125.00, 126.61, 127.03, 129.84, 138.30, 146.39, 150.18, 159.64, 159.71, 169.22. HRMS (ESI): Calc for C18H14O5 [M + H]+: 311.0914, found 311.0908.

**6-methoxy-3-(2,4,5-trifluorophenyl)-2H-chromen-2-one (15).** Yield 80%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 3.81 (s, 3H, CH3O-Ph), 7.26 (dd, 1H, J <sup>3</sup> = 9.0 Hz, J <sup>4</sup> = 3.0 Hz, H-7), 7.31 (d, 1H, J <sup>4</sup> = 3.0 Hz, H-5), 7.41 (d, 1H, J <sup>3</sup> = 9.0 Hz, H-8), 7.64-7.77 (m, 2H, H-2', H-6'), 8.18 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, d 6 -DMSO) δ: 55.73, 106.31 (dd, J <sup>C</sup>−<sup>F</sup> = 21 Hz, J <sup>C</sup>−<sup>F</sup> = 22 Hz), 110.90, 117.25, 119.12, 119.39, 119.55, 120.07, 120.91, 143.74, 145.70 (d, J <sup>C</sup>−<sup>F</sup> = 242 Hz), 147.64, 149.34 (J <sup>C</sup>−<sup>F</sup> = 252 Hz), 155.13 (J <sup>C</sup>−<sup>F</sup> = 248 Hz), 155.79, 158.78. HRMS (ESI): Calc for C16H9F3O3Na<sup>1</sup> [M + Na]+: 329.04015, found 329.04090. Elemental analysis for C16H9F3O3: calc C% 62.75 H% 2.96, found C% 62.62 H% 3.15.

**7-methoxy-3-(2,4,5-trifluorophenyl)-2H-chromen-2-one (16).** Yield 85 %; <sup>1</sup>H-NMR (300 MHz, d<sup>6</sup> -DMSO) δ: 3.88 (s, 3H, CH3O-Ph), 7.00 (dd, 1H, J <sup>3</sup> = 8.6 Hz, J <sup>4</sup> = 2.4 Hz, H-6), 7.06 (d, 1H, J <sup>4</sup> = 2.3 Hz, H-8), 7.61-7.6 (m, 3H, H-5, H-2', H-6'), 8.17 (s, 1H, H-4); <sup>13</sup>C-NMR (75.5 MHz, d<sup>6</sup> -DMSO) δ: 56.02, 100.49, 106.21 (dd, J <sup>C</sup>−<sup>F</sup> = 21 Hz, J <sup>C</sup>−<sup>F</sup> = 21 Hz), 112.24, 112.85, 116.85, 119.30, 119.57, 129.95, 144.06, 145.67 (d, J <sup>C</sup>−<sup>F</sup> = 242 Hz), 148.93 (d, J <sup>C</sup>−<sup>F</sup> = 250) Hz, 155.10 (d, J <sup>C</sup>−<sup>F</sup> = 245 Hz), 155.22, 158.89, 162.98; HRMS (ESI): Calc for C16H9F3O3Na<sup>1</sup> [M + Na]+: 329.04015, found 329.03980.

**3-(4-(dimethylamino)phenyl)-7-hydroxy-2H-chromen-2-one (17).** (Kirkiacharian et al., 2003) In the first step 7-acetoxy-3-(4-(dimethylamino)phenyl)-2H-chromen-2one was obtained. Yield: 70%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 2.31 (s, 3H, CH3C(O)O-Ph), 2.95 (s, 6H, (CH3)2N-Ph), 6.77 (d, J <sup>3</sup> = 9.0 Hz, 2H, H-2', H-6'), 7.14 (dd, J <sup>3</sup> = 8.4 Hz, J <sup>4</sup> = 2.2 Hz, 1H, H-5), 7.26 (d, J <sup>4</sup> = 2.2 Hz, 1H, H-8), 7.63 (d, J <sup>3</sup> = 9.0 Hz, 2H, H-3', H-5') 7.76 (d, J <sup>3</sup> = 8.5 Hz, 1H, H-5), 8.11 (s, 1H, H-4); <sup>13</sup>C-NMR (100.6 MHz, d<sup>6</sup> -DMSO) δ: 20.85, 39.84, 109.44, 111.58, 117.76, 118.57, 121.57, 126.00, 128.82, 129.11, 136.46, 150.45, 151.90, 152.77, 159.74, 168.85. In the second step 7-hydroxy-3-(4-(dimethylamino)phenyl)-2H-chromen-2one was obtained. Yield: 85% yellow solid; <sup>1</sup>H-NMR (400 MHz, d 6 -DMSO) δ: 2.94 (s, 6H, (CH3)2N-), 6.72 (d, J <sup>4</sup> = 2.3 Hz, 1H, H-8), 6.75 (d, J <sup>3</sup> = 9.0 Hz, 2H, H-2', H-6'), 6.79 (dd, J <sup>3</sup> = 8.4 Hz, J <sup>4</sup> = 2.3 Hz, 1H, H-5),), 7.55 (d, J <sup>3</sup> = 8.5 Hz, 1H, H-5), 7.58 (d, J <sup>3</sup> = 9.0 Hz, 2H, H-3', H-5'), 7.99 (s, 1H, H-4); <sup>13</sup>C-NMR (100.6 MHz, d<sup>6</sup> -DMSO) δ: 39.92, 101.59, 112.33, 113.16, 122.30, 122.32, 129.34, 137.83, 150.07, 154.27, 160.30, 160.41; HRMS (ESI): Calc for C17H15N1O3Na<sup>1</sup> [M + Na]+: 304.09496, found 304.09480; elemental anal. for C17H15N1O3, calc. C% 72.58, H% 5.37, N% 4.98, found C% 72.45, H% 5.40, N% 5.15.

**3-(4-(dimethylamino)phenyl)-6-methoxy-2H-chromen-2 one (18).** Yield 55%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 2.96 (s, 6H, (CH3)2N-Ph), 3.81 (s, 3H, CH3O-Ph), 6.77 (d, 2H, J <sup>3</sup> = Hz, H-3', H-5'), 7.14 (dd, 1H, J <sup>3</sup> = 3.0 Hz, J <sup>4</sup> = 9.0 Hz, H-7), 7.28 (d, 1H, J <sup>4</sup> = 3.0 Hz, H-5), 7.33 (d, 1H, J <sup>3</sup> = 9.0 Hz, H-8), 7.63 (d, 2H, J <sup>3</sup> = 9.0 Hz, H-2', H-6'), 8.06 (s, 1H, H-4); <sup>13</sup>C-NMR (100.6 MHz, d<sup>6</sup> -DMSO) δ: 39.93, 110.27, 111.59, 116.68, 118.16, 120.35, 121.73, 126.96, 129.15, 136.79, 146.79, 150.46, 155.59, 160.06; HRMS (ESI): Calc for C18H17N1O3Na<sup>1</sup> [M + Na]+: 318.11061, found 318.11050; elemental anal. for C18H17N1O3, calc. C% 73.20, H% 5.80, N% 4.74, found C% 72.75, H% 5.83, N% 4.45.

**3-(4-(dimethylamino)phenyl)-2-oxo-2H-chromen-7-yl acetate (19).** Yield 70%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 2.31 (s, 3H, CH3C(O)O-Ph), 2.95 (s, 6H, (CH3)2N-Ph), 6.77 (d, 2H, J <sup>3</sup> = 9.0 Hz, H-3', H-5'), 7.14 (dd, 1H, J <sup>3</sup> = 8.4 Hz, J <sup>4</sup> = 2.2 Hz, H-6), 7.26 (d, 1H, J <sup>4</sup> = 2.2 Hz, H-8), 7.63 (d, 2H, J <sup>3</sup> = 9.0 Hz, H-2', H-6'), 7.76 (d, 1H, J <sup>3</sup> = 8.5 Hz, H-5), 8.11 (s, 1H, H-4; <sup>13</sup>C-NMR (100 MHz, d<sup>6</sup> -DMSO) δ: 20.86, 39.84, 109.44, 111.58, 117.76, 118.57, 121.58, 126.00, 128.82, 129.11 136.46, 150.45, 151.90, 152.77, 159.74, 168.85; HRMS (ESI): Calc for C19H17N1O4Na<sup>1</sup> [M + Na]+: 346.10553, found 346.10640.

**3-(3-fluoro-4-hydroxyphenyl)-7-methoxy-2H-chromen-2-one (20).** In the first step 2-fluoro-4-(7-methoxy-2-oxo-2Hchromen-3-yl)phenyl acetate was obtained. Yield 75%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 2.35 (s, 3H, CH3C(O)O-Ph), 3.88 (s, 3H, CH3O-Ph), 6.99 (dd, 1H, J <sup>3</sup> = 8.6 Hz, J <sup>4</sup> = 2.4 Hz, H-6), 7.05 (d, 1H, J <sup>4</sup> = 2.4 Hz, H-8), 7.37 (t, 1H, J = 8.3Hz, H-6'), 7.62 (d, J = 8.5 Hz, 1H, H-5'), 7.68 (d, J = 8.6 Hz, 1H, H-5), 7.74 (dd, J <sup>H</sup>−<sup>F</sup> = 12.1 Hz, J <sup>4</sup> = 2.0 Hz, H-3'), 8.31 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, d<sup>6</sup> -DMSO) δ: 20.19, 55.97, 100.25, 112.79, 116.35 (d, J <sup>C</sup>−<sup>F</sup> = 20.3 Hz), 121.02, 121.03, 123.83, 124.79 (d, J <sup>C</sup>−<sup>F</sup> = 3.2 Hz), 129.86, 134.24 (d, J <sup>C</sup>−<sup>F</sup> = 7.7 Hz), 137.20 (d, J <sup>C</sup>−<sup>F</sup> = 13.1 Hz), 141.55, 153.00 (J <sup>C</sup>−<sup>F</sup> = 246 Hz), 154.92, 159.65, 162.69, 168.19. In the second step 3-(3-fluoro-4 hydroxyphenyl)-7-methoxy-2H-chromen-2-one was obtained. Yield 70%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 3.87 (s, 3H, CH3O-Ph), 6.96-7.03 (m, 3H, H-6, H-8, H-5'), 7.41 (d, 1H, J <sup>3</sup> = 8.4, H-6'), 7.57 (dd, 1H, J <sup>H</sup>−<sup>F</sup> = 13.1 Hz, J <sup>4</sup> = 2.2 Hz (H-H), 1H, H-2'), 7.66 (d, 1H, J <sup>3</sup> = 8.4, H-5), 8.18 (s, 1H, H-4), 10.09 (s, 1H, Ph-OH). <sup>13</sup>C-NMR (75.5 MHz, d<sup>6</sup> -DMSO) δ: 55.91, 100.16, 112.61, 113.04, 115.95 (d, J <sup>C</sup>−<sup>F</sup> = 20 Hz), 117.37 (d, J <sup>C</sup>−<sup>F</sup> = 3.3 Hz), 121.78 (J <sup>C</sup>−<sup>F</sup> = 2.0 Hz), 124.54 (d, J <sup>C</sup>−<sup>F</sup> = 3.0 Hz), 126.08 (d, J <sup>C</sup>−<sup>F</sup> = 7.0 Hz), 129.49, 139.62, 145.0 (J <sup>C</sup>−<sup>F</sup> = 13 Hz), 150.46 (d, J <sup>C</sup>−<sup>F</sup> = 240 Hz), 154.52, 159.87, 162.19; HRMS (ESI): Calc for C16H11F1O4Na<sup>1</sup> [M + Na]+: 309.0539, found 309.0553.

**3-(4-fluorophenyl)-6-methoxy-2H-chromen-2-one (21).** Yield 58%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -acetone) δ: 3.87 (s, 3H, CH3O-Ph), 7.19-7.33 (m, 5H, H-5, H-7, H-8, H-3', H-5'), 7.83 (dd, 2H, J HF = 5.4 Hz, J <sup>H</sup>−<sup>H</sup> = 9.0 Hz, H-2', H6'), 8.12 (s, 1H, H-4); <sup>13</sup>C-NMR (100 MHz, d<sup>6</sup> -acetone) δ: 56.17, 111.34, 115.84 (d, J <sup>C</sup>−<sup>F</sup> = 22 Hz), 117.85, 120.04, 121.04, 127.79, 131.62 (d, J <sup>C</sup>−<sup>F</sup> = 8 Hz), 132.41 (d, J <sup>C</sup>−<sup>F</sup> = 3 Hz), 140.82, 148.82, 157.13, 160.64, 163.72 (d, J <sup>C</sup>−<sup>F</sup> = 247 Hz); HRMS (ESI): Calc for C16H11F1O3Na<sup>1</sup> [M + Na]+: 293.05899, found 293.05850; elemental anal. for C16H11F1O3, calc C% 71.11, H% 4.10, found C% 71.10, H% 4.10.

**3-(3-fluoro-4-hydroxyphenyl)-6-methoxy-2H-chromen-2-one (22).** In the first step 2-fluoro-4-(6-methoxy-2-oxo-2Hchromen-3-yl)phenyl acetate was obtained. Yield 66%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 2.33 (s, 3H, CH3C(O)O-Ph), 3.82 (s, 3H, (CH3O-Ph), 7.23 (dd, 1H, J <sup>3</sup> = 9.0 Hz, J <sup>4</sup> = 3.0 Hz, H-7), 7.30 (d, 1H, J <sup>4</sup> = 3.0 Hz, H-5), 7.35 (d, 1H, J <sup>3</sup> = 9.2 Hz, H-8), 7.61 (d, 1H, J <sup>3</sup> = 8.5 Hz, H-5'), 7.75 (dd, 1H, J <sup>H</sup>−<sup>F</sup> = 12.0 Hz, J <sup>4</sup> = 1.7 Hz (H-H), 1H, H-3'), 8.30 (s, 1H, H-4); <sup>13</sup>C-NMR (100.6 MHz, d 6 -DMSO) δ: 20.22, 55.69, 110.83, 116.67, 117.02, 119.66, 123.96, 125.10, 135.96, 141.18, 147.44, 151.78, 154.23, 155.70, 159.53, 168.21. In the second step 3-(3-fluoro-4-hydroxyphenyl)-6 methoxy-2H-chromen-2-one was obtained. Yield 71%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> -DMSO) δ: 3.81 (s, 3H, (CH3O-Ph), 7.02 (dd, 1H, J <sup>3</sup> = 9.2 Hz, H-6'), 7.18 (dd, 1H, J <sup>3</sup> = 9.0 Hz, J <sup>4</sup> = 3.0 Hz, H-7), 7.28 (d, 1H, J <sup>4</sup> = 2.9 Hz, H-5), 7.42 (d, 1H, J <sup>3</sup> = 8.4 Hz, H-5'), 7.57 (dd, 1H, J <sup>H</sup>−<sup>F</sup> = 13.0 Hz, J <sup>4</sup> = 2.2 Hz (H-H), 1H, H-2'), 8.17 (s, 1H, H-4), 10.19 (s, 1H, Ph-OH); <sup>13</sup>C-NMR (100.6 MHz, d<sup>6</sup> -DMSO) δ: 55.66, 110.59, 116.67, 117.02, 119.66, 123.96, 125.10, 135.96, 141.18, 147.44, 151.78, 154.23, 155.70, 159.53, 168.21. HRMS (ESI): Calc for C16H11F1O4Na<sup>1</sup> [M + Na]+: 309.0539, found 309.0521.

**3-(4-fluorophenyl)-6-methyl-2H-chromen-2-one (23).** (Chauhan et al., 2016) Yield 74%; <sup>1</sup>H-NMR (400 MHz, d<sup>6</sup> - DMSO) δ: 2.38 (s, 3H, CH3-Ph), 7.27-7.35 (m, 3H, H-3', H-5', H-8), 7.43 (dd, 1H, J<sup>3</sup> = 8.5 Hz, J<sup>4</sup> = 2.1 Hz, H-7), 7.55 (d, 1H, J<sup>4</sup> = 1.4 Hz, H-5), 7.77 (dd, 2H, J HF = 5.7 Hz, J <sup>H</sup>−<sup>H</sup> = 9.0 Hz, H-2', H6'), 8.18 (s, 1H, H-4); <sup>13</sup>C-NMR (100.6 MHz, d<sup>6</sup> -DMSO) δ: 20.26, 115.11 (d, J <sup>H</sup>−<sup>F</sup> = 21.5 Hz), 115.64, 119.16, 125.76, 128.20, 130.70 (d, J <sup>H</sup>−<sup>F</sup> = 8.4 Hz), 131.10 (d, J <sup>H</sup>−<sup>F</sup> = 3.2 Hz), 132,61, 133.80, 140.48, 151.10, 159.82, 162.17 (d, J <sup>H</sup>−<sup>F</sup> = 245 Hz); HRMS (ESI): Calc for C16H11F1O2Na<sup>1</sup> [M + Na]+: 277.06408, found 277.06390; Elemental anal. for C16H11F1O2, calc C% 75.58, H% 4.36, found C% 75.42, H% 4.33.

**3-(4-fluorophenyl)-6-hydroxy-2H-chromen-2-one (24).** In the first step 3-(4-fluorophenyl)-2-oxo-2H-chromen-6-yl acetate was obtained and used as such for the next step. In the second step 3-(4-fluorophenyl)-6-hydroxy-2H-chromen-2-one was obtained. Yield 65%; <sup>1</sup>H-NMR (300 MHz, d<sup>6</sup> -DMSO) δ: 7.04 (dd, 1H, J <sup>3</sup> = 8.8 Hz, J <sup>4</sup> = 2.9 Hz, H-7), 7.09 (d, 1H, J <sup>4</sup> = 2.8 Hz, H-5), 7.24-7.29 (m, 3H, H-3', H-5', H-8), 7.75 (dd, 2H, J HF = 5.6 Hz, J <sup>H</sup>−<sup>H</sup> = 8.9 Hz, H-2', H6'), 8.13 (s, 1H, H-4), 9.72 (s, 1H, HO-Ph); <sup>13</sup>C-NMR (75.5 MHz, d<sup>6</sup> -DMSO) δ: 112.561, 115.03 (d, J <sup>H</sup>−<sup>F</sup> = 21.5 Hz), 116.71, 119.78, 119.93, 125.80, 130.70 (d, J <sup>H</sup>−<sup>F</sup> = 8.2 Hz), 131.18 (d, J <sup>H</sup>−<sup>F</sup> = 3.2 Hz), 140.50, 146.35, 159.92, 162.17 (d, J <sup>H</sup>−<sup>F</sup> = 246 Hz); HRMS (ESI): Calc for C15H9F1O3Na<sup>1</sup> [M + Na]+: 279.04334, found 279.0444.

# Monoamine Oxidase A and B

Both monoamine oxidase A (MAO-A) and B (MAO-B) protein and the reagents for the chromogenic solution of vanillic acid (4-hydroxy-3-methoxylbenzoic acid, 97% purity), 4 aminoantipyrine (reagent grade), horseradish peroxidase and the substrate tyramine hydrochloride (minimum 99% purity) as well as the potassium phosphate buffer, which was prepared using potassium phosphate dibasic trihydrate (≥99% ReagentPlusTM) and potassium phosphate monobasic (minimum 98% purity, molecular biology tested), were purchased from Sigma-Aldrich (St. Louis, MO, USA) for the spectrophotometric assay.

The protocol for continuous spectrophotometric assay (Holt et al., 1997) was followed in the activity measurements. The assay was performed in 0.2 M potassium phosphate buffer pH 7.6 on 96-well plates (NuncTM 96F microwell plate without a lid, Nunc A/S, Roskilde, DK) in 200 µl total volume. The chromogenic solution containing 1 mM vanillic acid, 500µM 4 aminoantipyrine and 8 U/ml horseradish peroxidase in 0.2 M potassium phosphate buffer pH 7.6 was mixed anew for each measurement. 5 mM tyramine solution was used as the substrate. In order to determine the activity of both MAO-B and MAO-A, concentration series as duplicates were prepared. The protein was combined with the chromogenic solution and incubated 30 min at 37◦C. The background signal was measured using multilabel reader (VictorTM X4, 2030 Multilabel Reader, PerkinElmer, Waltham, MA, USA) at A<sup>490</sup> before reaching the total 200 µl volume by adding 20 µl of tyramine to final concentration of 0.5 mM on the plate. As a result, the final concentration of the chromogenic solution on the plate was 250µM vanillic acid, 125µM 4-aminoantipyrine and 2 U/ml horseradish peroxide. After adding the substrate, the plates were measured 300 times every 15 s using 1 s exposure time. The device was set to 37◦C for the duration of the experiment.

Based on the activity measurement, suitable concentrations were chosen for both MAO-B and MAO-A to be used in the inhibition studies (Supplementary Figures S1, S2, and S5, **Table 1**, Supplementary Table S1). The experiment conditions should produce absorbance change of ∼0.35 (Holt et al., 1997). With MAO-B, this was reached using 10 µl (equals 50 µg of protein with enzymatic activity 3.2 units per well) of the protein and running the experiment for 2 h (Supplementary Figures S1, S2, and S5, **Table 1**, Supplementary Table S1). MAO-A was significantly more active, providing absorbance change of >0.5 with 5 µl (equals 25 µg of protein with enzymatic activity 1.05 units per well) of protein and, consequently, the reaction maximum was reached already in 30 min (Supplementary Figure S5, **Table 1**, Supplementary Table S1). Thus, a wide panel of coumarin derivatives was analyzed at 10µM (**Table 1,** Supplementary Table S1) and those 3-phenylcoumarin derivatives producing >70% inhibition were selected for further analysis (**Table 1**, **Figure 2**). The selected 24 candidates were measured as duplicates on a dilution series ranging from 50µM to 1 nM, and based on the normalized measurement results, IC<sup>50</sup> values were calculated (**Table 1**). The same wide panel of coumarin derivatives was additionally used to analyze the MAO-A inhibition at 100µM (**Table 1,** Supplementary Table S1).

GRAPHPAD PRISM 5.03 (GraphPad Software Inc., CA, USA) was used to normalize the spectrophotometric assay data where the maximal signal was reached at the lowest concentration of 10−<sup>8</sup> or 10−<sup>9</sup> depending on the sample and the starting concentration of 5·10−<sup>5</sup> acted as the lowest point of signal. The measured data was then fitted on a curve using non-linear regression with the equation for log[inhibitor] vs. response. The IC50 values were therefore determined based on the curve fit. The fitted curves are shown on –log scale in Supplementary Figures S1, S2.

### 17-β-Hydroxysteroid Dehydrogenase 1

Inhibition of the 17-β-hydroxysteroid dehydrogenase 1 (HSD1) was determined by HPLC using recombinant human HSD1 proteins, produced in Sf9-insect cells, as described earlier (Messinger et al., 2009). The assay was performed in a final volume of 0.2 ml buffer (20mM KH2PO4, 1mM EDTA, pH 7.4) containing 0.1 mg/ml protein, 1 mM cofactor NADPH, 30 nM substrate estrone or estradiol, 800,000 cpm/ml of tritium labeled estrone ([3H]-E1) or estradiol ([3H]-E2) and inhibitor concentrations in the range of 0.1–5 mM. Triplicate samples were incubated for 25 min at RT. The reaction was stopped by addition of 20 ml 10% trichloroacetic acid per sample. After incubation the substrate and the product of enzymatic conversion [3H]-E1 and [3H]-E2, were separated and quantified by HPLC (Alliance 2790, Waters) connected to an online counter (Packard Flow Scintillation Analyzer). The ratio of [3H]- E1 converted to [3H]-E2, or vice versa, determines the sample conversion percentage. Inhibition efficiencies were calculated by comparing the conversion percentages of the samples including inhibitors with those of conversion controls (without inhibitors).

#### Aromatase

Aromatase (CYP19A1) activity was measured as described previously (Pasanen, 1985) by using human placental microsomes and 50 nM [3H]-androstenedione as a substrate and inhibitor concentrations in the range of 60–1,000 nM. Aromatase activities were measured as released [3H]-H2O in Optiphase Hisafe 2 scintillation liquid (Perkin Elmer, USA) with a Wallac 1450 MicroBeta Trilux scintillation counter (Perkin Elmer, USA). As a positive control for aromatase inhibition, 1µM finrozole (generous gift from Olavi Pelkonen, University of Oulu, Finland) was used.

### Cytochrome P450 1A2

Inhibition of CYP1A2 activity was determined with commercial heterologously expressed human CYP1A2 enzyme (Corning Inc., Corning, NY, USA) as described earlier (Korhonen et al., 2005). The metabolic activity was not in the scope of this particular study. The assay was adapted to the 96-well plate format. In each well, a 150 µL incubation volume contained 100 mM Tris-HCl buffer (pH 7.4), 4.2 mM MgCl2,1µM 7-ethoxyresorufin, 0.5 pmol of cDNA expressed CYP1A2, 0-40 mM inhibitor, and a NADPH-generating system. All inhibitors were dissolved in ethanol, and the final concentration of ethanol was 2% in all incubations. The reaction was initiated by adding the NADPHregenerating system after a 10 min preincubation at 37◦C, and after a 20 min incubation, the reaction was terminated by the addition of 110 µL of 80% acetonitrile/20% 0.5 M Tris base. The formed fluorescence was measured with a Victor2 plate counter (Perkin-Elmer Life Sciences Wallac, Turku, Finland) at 570 nm excitation and 616 nm emission.

# Estrogen Receptor

The pIC50 values for the derivatives (**Table 1**, Supplementary Table S1) were measured with green PolarScreenTM ER Alpha Competitor Assay (Life Technologies, CA, The United States of America) kit, following the manufacturer protocol as previously described (Niinivehmas et al., 2016). The final concentration of the compounds ranged from 0.0007 to 10 000 nM in the dilution series which were performed as duplicates. The molecules were combined with 25 nM ERα and 4.5 nM fluormone in the assay buffer and placed on black low volume 384-well assay plate with NBS surface (Corning, NY, The United States of America). After mixing the assay plate, it was incubated for 2 h in RT. The fluorescence polarization was measured using excitation wave length 485 and emission wave length 535 with bandwidths of 25/20 nm on a 2104 EnVision <sup>R</sup> Multilabel Plate Reader which had EnVision Workstation version 1.7 (PerkinElmer, MA, The United States of America).

# Computational Methods

The small-molecule ligand structures were drawn in 3D and their tautomeric states at pH 7.4 were built using LIGPREP module in MAESTRO 2016-3 (Schrödinger, LLC, New York, NY, USA, 2016). The derivatives were docked to the X-ray crystal structure of MAO-B (PDB: 2V60) (Binda et al., 2007) with PLANTS 1.2 (Korb et al., 2009) using 10 Å radius and the C8 atom of inhibitor C18 (PDB: 2V60) was used as the center. The R1-methoxy group rotamers of compounds **1**, **8**, **9**, **21**, **15**, **18**, and **22** were manually adjusted to indicate how the groups exploit the small hydrophobic niche in the cavity (green sector in **Figures 3A,B**). The 2D structures of the 3-phenylcoumarin scaffold and the 24 most potent inhibitor derivatives shown in **Figures 1E**, **2** were drawn with BIOVIA DRAW 2016 (Dassault Systèmes, San Diego, CA, USA, 2016). **Figures 1A–D**, **3–5** were prepared using BODIL (Lehtonen et al., 2004) and VMD 1.9.2 (Humphrey et al., 1996). The negative images of the MAO-B and MAO-A binding cavities shown in **Figure 3A** and C were outlined using PANTHER (Niinivehmas et al., 2011, 2015) and visualized with BODIL, MOLSCRIPT (Kraulis, 1991), and RASTER3D (Merritt and Murphy, 1994).

# RESULTS AND DISCUSSION

#### Spectrophotometric Activity Measurements for Monoamine Oxidase B

All of the 52 derivatives were docked, synthetized and tested experimentally. Those 24 compounds that provided IC<sup>50</sup> values below 10µM were tested more thoroughly (**Table 1**). The fact that 24 of the synthesized derivatives with a wide variety of different R1-R7 groups (**Figure 2**) passed the 70% threshold indicates that the 3-phenylcoumarin is indeed a highly suitable scaffold for building MAO-B inhibitors. Notably, eight of these tested derivatives (**3**, **9**–**13**, **17,** and **23** in **Figure 2**) had been synthesized previously (Bhandri et al., 1949; Kirkiacharian et al., 1999, 2003; Prendergast, 2001; Vilar et al., 2006; Ferino et al., 2013; Chauhan et al., 2016; Dobelmann-Mara et al., 2017), however, this is the first time they are tested for MAO-B activity. The novel derivative **1** is the most potent inhibitor of the analog set with the IC<sup>50</sup> value of 56 nM (**Figure 2**, **Table 1**); meanwhile, the rest of the tested derivatives are evenly distributed within a range of 0.1–10µM (**Figure 2**, **Table 1**).

By focusing solely on the R1-R7 constituents of the derivatives (**Figures 1E**, **2**) and the activity data (**Table 1**) it is possible to outline trends that determine which functional groups, positions or their combinations establish and weaken or improve the MAO-B inhibition.

Although the R1 and R2 groups in the coumarin ring are not necessarily required for establishing MAO-B inhibition (see **11**; **Figure 2**, Supplementary Figure S3F; **Table 1**), the activity measurements indicate that adding a methoxy, hydroxyl, acetoxy, methyl or even halogen group(s) into the ring can facilitate strong

the active site, displays the contours (opaque surface) that roughly match the inhibitor shape and conformation. The colored sectors highlight specific sections of the cavity dedicated to different aspects of the 3-phenylcoumarin derivative binding: 3-phenyl ring (orange), the R4-R7 groups of the 3-phenyl ring (red), coumarin ring (yellow), the hydrophobic niche occupied by the R1/R2-groups of the coumarin ring (green). (C) A negative image of the MAO-A active site shows that only two residue changes (Ile199→ Phe208; Leu164→ Phe173) are enough to prevent 3-phenylcoumarin analog binding. (D) The docked poses of the 23 most potent 3-phenylcoumarin derivatives show what space is collectively occupied by the new inhibitors. See Figure 1 for details.

inhibition (**Table 1**). As a rule of thumb, introducing R1-methoxy group produces strong MAO-B inhibition (e.g., **1**; **Figure 2**; **Table 1**). In contrast, inserting for example a bulky R3 substituent such as acetoxy group weakens the inhibition considerably (26, 35, 47; Supplementary Figure S4; Supplementary Table S1). Whether the R1 or R2 position or any specific functional group in particular is favored depends on the composition of the 3-phenyl ring's R4-R7 constituents.

In fact, the activity data indicates that the R4-R7 substituents are vital for assuring strong MAO-B inhibition and without any 3-phenyl substituents, the activity is lost (41, 50, 52; Supplementary Figure S4, Supplementary Table S1). The most potent inhibitors were **1** (IC<sup>50</sup> of ∼56 nM; **Table 1**) and **2** (IC<sup>50</sup> of ∼138 nM; **Table 1**) housing R6-trifluoromethyl, but **3** (IC<sup>50</sup> of ∼141 nM; **Table 1**) with structurally similar R6-trifluoromethoxy group is almost equally potent. The combination of the R6-acetoxy and R7-fluorine groups in **6** (IC<sup>50</sup> of ∼189 nM) produces relatively strong inhibition. Furthermore, housing just one methoxy group at the R7 position (**8**; IC<sup>50</sup> of ∼230 nM) or two methoxy groups at both R5 and R7 positions (**9**; IC<sup>50</sup> of ∼255 nM) assures < 300 nM inhibition.

The effects of the R4-R7 groups of the 3-phenyl ring and the R1-R3 groups of coumarin ring (**Figure 2**) for the derivative binding and inhibition are detailed below in a docking-based structure-activity relationship (SAR) analysis.

# The Alignment of the 3-Phenylcoumarin Scaffold at the Active Site

The 3-phenylcoumarin derivative binding at the MAO-B active site is based on the premise that the coumarin and phenyl ring systems occupy roughly the same 3D space as the equivalent ring systems of the coumarin-based inhibitors co-crystallized with the enzyme (PDB: 2V60, 2V61; **Figures 1A–D**) (Binda et al., 2007). The fundamental difference between the 3-phenylcoumarin derivatives and those coumarin inhibitors with validated binding poses is that the coumarin alignment is reversed and the phenyl ring is attached to the C3-position instead of the C7-position (**Figures 1C,D**).

What is more, the "canonical" coumarin ring positioning inside the pocket is somewhat analogous to even simpler double ring constructs such as the indole of inhibitor isatin (PDB: 1OJA) (Binda et al., 2003). In fact, the hydrophobicity of the aromatic coumarin (yellow sector in **Figures 3A,B**) and 3-phenyl (orange sector in **Figures 3A,B**) rings is vital for establishing the MAO-B binding and it outweighs all other favorable interactions such as hydrogen or halogen bonding (via sigma hole) in importance (**Figure 4**). Thus, although the docking suggests variability in the coumarin and 3-phenyl ring positioning for the 3 phenylcoumarin derivatives due to different R1-R7 substituents, the hydrophobic interactions of the aromatic rings are highly similar between them (**Figure 3D**).

It is also noteworthy that the coumarin's C2-carbonyl is not facing the solvent based on the molecular docking simulations

(**Figure 3D**). Paradoxically, this does not matter, because the carbonyl group finds an atypical interaction partner from the thiol group of Cys172 side chain (**Figure 4**). Although the C2 carbonyl cannot form a full-fledged H-bond with the proton of the thiol group, the hydrophobic environment of the cavity likely enhances this ordinarily weak interaction between the two groups.

# R6-Trifluoromethyl Packing Produces the Strongest Inhibition

Halogen substituents in the 3-phenyl ring ensure strong MAO-B inhibition (**Figure 4**). This makes sense with MAO-B, because despite their apparent electronegativity the halogen substituents actually improve the steric packing of small-molecules via persistent van der Waals interactions while also retaining the ability to act as a halogen bond donor. Both of these properties should assist inhibitor binding into the active site that is mostly hydrophobic (**Figures 3A,B**). Besides, the increased lipophilicity conveyed by the halogen substituents (logP values in **Table 1**) should assist the 3-phenylcoumarin derivatives in aggregating on the outer mitochondrial membrane on route to the MAO-B active site (**Figure 1A**).

The most potent derivative **1** (**Figure 2**, **Table 1**) has trifluoromethyl group at the R6 position in the 3-phenyl ring. The derivative is relatively flat when bound at the active site and the proximal R6-group cannot flex out of this plane (**Figure 5A**). The trifluoromethyl of **1** fits very snugly into the hydrophobic end of the cavity (red sector in **Figures 3A,B**). The high shape complementarity of this cavity part and the R6 trifluoromethyl of **1** is typical for this bulky moiety in drug compounds. Thus, the R6-group alignment of **1** is mostly relying on the collective potency of individually weak van der Waals interactions (**Figures 3A,B, 5A**).

Replacing the R6-trifluoromethyl of derivative **1** with a trifluoromethoxy in **4** (**Figure 2**) produces six times lower MAO-B inhibition (**Table 1**, Supplementary Figure S3B). This happens because the trifluoromethoxy already fills the available space almost optimally (**Figures 3A,B**, **5A**) and elongating the substituent with an ether bond does not improve the fit (Supplementary Figure S3B). In fact, there is no extra wiggle room to fit the trifluoromethoxy (**Figures 3A,B**), if the 3 phenylcoumarin scaffold would be kept at the "canonical" position (**Figures 1C,D**). Hence, the coumarin ring of **4** pushes slightly closer to the cofactor. Although the binding site residues can adjust slightly in response to this shift, the realignment or rather misalignment of the scaffold (Supplementary Figure S3B) imposes an energetic cost that is reflected in the MAO-B inhibition (**Table 1**). In addition, depending on the rotamer pose of the R6-trifluorometoxy, a hydrogen bond could be bridged between a fluorine atom and the Pro102<sup>O</sup> by a water molecule (not shown).

# The Effects of Halogenation on the 3-Phenyl Ring Alignment

The chlorine and fluorine substituents of prior coumarin-based inhibitors form halogen bond with the Leu164<sup>O</sup> based on

X-ray crystallography (PDB: 2V60, 2V61; **Figures 1A–D**; Binda et al., 2007). Accordingly, it is not surprising that those 3 phenylcoumarin derivatives with single halogen substituent at their 3-phenyl rings are also capable of blocking the MAO-B activity (**Figure 4**, **Table 1**).

Although it is known that fluorine is the poorest halogen bond donor (Cavallo et al., 2016), the R7-fluorine groups of **20** and **22** (**Figure 2**) could form halogen bond with the Leu164◦ (**Figures 6E,F**) similarly to the halogens of previously published inhibitors with validated binding modes (**Figures 1B–D**; Binda et al., 2007). In fact, the R7-halogen groups of **20** and **22** are inserted into the exact same position as the halogen groups of the established inhibitors (**Figure 1B** vs. **Figures 6E,F**). The MAO-B inhibition (**Table 1**) is reinforced further by the R6-hydroxyl group H-bonding with the Pro102<sup>O</sup> (magenta dotted lines in **Figures 6E,F**). Because both **20** and **22** are bonding simultaneously with the Leu164<sup>O</sup> and the Pro102O, they elicit equivalent or stronger inhibition than derivatives **21** (**Figure 5D**), **23** (Supplementary Figure S3K), and **24** (Supplementary Figure S3L) that do not retain either one of these two interactions. Docking suggests that replacing the R6 hydroxyl with an acetoxy group prevents **6** (**Figure 2**) from forming direct halogen or hydrogen bonds (**Figure 5C**), but the R6-acetoxy and R7-fluorine could potentially connect via a water bridge with the Pro102<sup>O</sup> (not shown). Despite this, the hydrophobic packing of the R6-acetoxy in **6** against the hydrophobic residues, mainly Phe103 (**Figure 5C**), is likely the reason behind doubling the inhibition in comparison to **20** (IC<sup>50</sup> value of 391 vs. 189 nM; **Table 1**, **Figure 6E**).

Introducing fluorine to the R6 position of the 3-phenyl ring in derivatives **21**, **23**, and **24** (**Figure 2**) produces MAO-B inhibition ranging from 433 to 1,060 nM (**Table 1**). Due to the overall planarity of the 3-phenylcoumarin scaffold (**Figures 1C,D**), the R6-fluorine (**Figure 5D**, Supplementary Figures S3K,L), cannot take on the equivalent site occupied by the halogens of validated coumarin-based inhibitors that form halogen bond with the Leu164◦ (**Figure 1B**; Binda et al., 2007). In addition, the R6-fluorine is too limited in size to fill the end of the binding cavity as completely as for example the trifluoromethyl of **1** does (**Figures 3A,B**, **5A**). In addition, the R6-fluorine groups of derivatives **21**, **23,** and **24** (**Figure 5D**, Supplementary Figure S3K,L) reside within a suitable distance to form a halogen bond with the Pro102<sup>O</sup> (3.6 Å), however, the available angles seem to rule out actual bonding.

Derivatives **15** and **16** (**Figure 2**) house three fluorine atoms at their 3-phenyl groups' R4, R6, and R7 positions (**Figures 6A,B**). In the case of **15** (**Figure 6A**), these halogen substituents assure an IC<sup>50</sup> value that is almost 150 nM stronger than what is seen with derivatives housing only a single fluorine moiety at the R6 or R7 position (**21**, **23,** and **24**; **Figure 5D**, Figure S3K-L, **Table 1**). This is achieved by filling the hydrophobic cavity end (orange and red sectors in **Figure 3**) efficiently with the 3-phenyl ring and its fluorine moieties (**Figures 6A,B**). The fit is better for a 3-phenyl ring with the R5-trifluoromethyl than what is seen with the ring housing three separate fluorine substituents (**Figure 5A** vs. **Figure 6A**) and; accordingly, derivative **15** is not as potent MAO-B inhibitor as **1** (IC<sup>50</sup> 292 vs. 56 nM; **Table 1**). In addition, depending on the 3-phenyl ring pose, the R4 or R7 fluorine

groups could again potentially act as weak halogen bond donors to the Phe168<sup>O</sup> or the Leu164O, respectively (not shown).

# The Effects of the Methoxy and Dimethylamine Groups for the 3-Phenyl Alignment

Derivatives with proximal methoxy groups (**Figure 2**), especially at the R7 position, assure relatively strong MAO-B inhibition (**Figure 4**) and produce at best 230 nM inhibition (e.g., **8** in **Figure 2**, **Table 1**).

Based on the docking, derivatives **8** and **11** (**Figure 2**) flip their R7-methoxy groups toward the Leu164◦ (**Figure 5E**, Supplementary Figure S3F), which is shielded from a clash with the methoxy group by forming intra-protein H-bond with the Phe168<sup>N</sup> (not shown). Inserting an extra R5-methoxy into the 3-phenyl of **8** to produce otherwise identical derivative **9** (**Figure 2**) weakens the inhibition slightly (IC<sup>50</sup> difference of 23 nM; **Table 1**), because the added methoxy group is unable to form particularly favorable interactions with the nearby Pro102◦ (**Figure 5F**). With derivatives **10** or **13** (**Figure 2**), the methoxy group is added to the phenyl ring's para position, and due to the planarity of the 3-phenylcoumarin scaffold, there is an energetic penalty for pushing the group toward either side of the cavity end (red sector in **Figures 3A,B**). Accordingly, to avoid a scaffold misalignment, the R6-methoxy group of **10** (and **13**) points directly toward the side chains of Phe103, Pro104, Trp119, and Ile199 (Supplementary Figures S3E,F), which, in turn, produces roughly 170 nM difference in the IC<sup>50</sup> values with otherwise identical **8** (**Figure 5E**, **Table 1**) in favor of the R7-methoxy position.

A dimethylamine group at the 3-phenyl ring's para position (a.k.a. dimethylaniline; **Figure 2**) produces moderately strong MAO-B inhibition (**Table 1**) for derivatives **17** (**Figure 2**; IC<sup>50</sup> value of 400 nM), **18** (**Figure 2**; IC<sup>50</sup> value of 798 nM), and **19** (**Figure 2**; IC<sup>50</sup> value of 955 nM). This is due to the ability of the R6-dimethylamine to fill the cavity end (red sector in **Figures 3A,B**) similarly to the R6-trifluoromethyl of **1** (**Figures 5A,B** vs. **Figures 6C,D**, Supplementary Figure S3J). The downside is that the bulkier R6-substituent cannot form halogen or hydrogen bonds with water or residues nor push against either side of the cavity and, most importantly, it causes unfavorable coumarin alignment. Accordingly, the R6 dimethylamine of derivatives **17**–**19** packs directly against the side chains of Phe103, Pro104, Trp119, Leu164, and Ile316 (**Figures 4C,D**, Supplementary Figure S3J).

## Refining the Alignment via the R1–R3 Substituents of the Coumarin Ring

Inserting a functional group such as methoxy to the R1/R2 position of the coumarin ring (**Figure 2**), capable of forming both hydrophobic and hydrophilic interactions, generally improves the MAO-B inhibition (**Figure 4**, **Table 1**).

The benefits of this sort of dual-purpose group are evident when comparing the activity of otherwise identical derivatives with and without the proximal group; i.e., **11**, that lacks only the R1-methoxy of **8** (Supplementary Figure S3F vs. **Figure 5E**), produces significantly lower inhibition (IC<sup>50</sup> value of 798 vs. 231 nM; **Table 1**). On one hand, the methyl of the R1-methoxy group of **8** (**Figure 5E**) packs into a hydrophobic niche formed by the side chains of Tyr60, Gln206, Tyr326, Leu328, Phe343, and Met341 (green sector in **Figures 3A,B**). On the other hand, the methoxy's oxygen increases the 3-phenyl ring's hydrophilicity and softens the clash of the coumarin ring with the solvent shielding the cofactor (**Figure 5E**).

Switching the R1-methoxy of **1** into the R2 position in **2** (**Figure 2**) makes the alignment of the coumarin ring more challenging, because the R2-methoxy is unable to occupy the same hydrophobic niche (green sector in **Figures 3A,B**) as the R1-methoxy (**Figure 4A** vs. Supplementary Figure S3A). Although the R1/R2 methoxy switch, by all means, does not prevent binding, it leads to ∼80 nM reduction in the IC<sup>50</sup> value (**Table 1**). Paradoxically, the opposite and considerably larger difference in inhibition is produced by the R1/R2 switch, when comparing the activity of derivatives **20** and **22** (**Figure 2**; **Table 1**). Accordingly, **20** with the R2-methoxy of (IC<sup>50</sup> value of 391 nM; **Table 1**) provides twice as strong inhibition as **22** with the R1-methoxy (IC<sup>50</sup> value of 831 nM; **Table 1**). The vast difference is caused by the coordinated R6/R7 interactions of the 3-phenyl ring, which pushes the coumarin ring closer to the Tyr326 side chain—a critical shift that is stunted by the R1-methoxy of **22** (**Figure 5E** vs. **Figure 5F**).

Replacing the R2-acetoxy of **3** (**Figure 2**) with the R1-methoxy in **4** (**Figure 2**) weakens the inhibition ∼180 nM (**Table 1**). The coumarin ring of **4** is pushed closer to the cofactor due to the addition of the R6-trifluoromethoxy into the 3-phenyl ring (**Figure 5B** vs. Supplementary Figure S3B) and, in this new pose, the methyl of the R2-acetoxy is able to occupy the small hydrophobic niche (green sector in **Figures 3A,B**), meanwhile, exposing the acetoxy's oxygen atoms to the solvent (**Figure 3B**). However, substituting the R1-methoxy of **18** with the R2-acetoxy in **19** (**Figure 2**) does not improve the inhibition; instead, the IC<sup>50</sup> value is reduced by ∼250 nM (**Table 1**). This happens, because the R6-dimethylamine of **19** (Supplementary Figure S3J) is not forcing the scaffold to align close to the cofactor the same way as the R6-trifluoromethoxy does (**Figure 5B** vs. **Figures 6C,D**). In contrast, replacing the R1-methoxy of **18** with the R2-hydroxyl in **17** improves the inhibition (IC<sup>50</sup> improvement of 234 nM; **Table 1**) by promoting water solubility near the cofactor (**Figure 6C** vs. **Figure 6D**).

The R6 and R7 interactions of **7** (**Figure 2**) are expected to remind closely those of **6** (Supplementary Figure S3D vs. **Figure 5C**), but its coumarin ring's R1- and R3-chlorine groups weaken the inhibition ∼700 nM (**Table 1**). The R2 methoxy of **6** is able to play into the hydrophobic/hydrophilic dual nature of the cavity end facing the cofactor (**Figure 5C**) without occupying the small hydrophobic niche (green sector in **Figures 3A,B**). In this respect, the R1-chlorine is too bulky to occupy this specific niche although a methoxy group at the same position should be able to occupy the available space (e.g., **1** in **Figure 5A**).

# Selectivity of the 3-Phenylcoumarin Derivatives

Determining the specificity and subtype selectivity of the 3 phenylcoumarin derivatives for MAO-B is needed to evaluate their true pharmacological potential. Unintended off-target effects with other proteins can render even the most promising drug candidates useless, ambiguous or even toxic. Here, the focus is put on MAO-A which has shared activity with MAO-B in deamination of dopamine and dietary amines tyramine and tryptamine. In addition, the effects of the derivatives are tested with a specific subset of enzymes, including HSD1, aromatase, CYP1A2, and ER, whose function is linked to different stages of estradiol action and metabolism. These particular enzymes were looked at with the derivatives, because they are known to have structurally similar ligands or even coumarin-based inhibitors based on prior studies and our upcoming study (Mattsson et al., 2014; Niinivehmas et al., 2016; Niinivehmas et al., unpublished results).

Monoamine oxidase A (MAO-A) is more prevalent than the subtype B in the gastrointestinal tract and, accordingly, the MAO-A inhibition can cause accumulation of tyramine from dietary sources. Because tyramine can displace neurotransmitters leading to potentially fatal hypertensive crisis, it is highly desirable to design MAO-B-specific inhibitors lacking MAO-A activity. The vast majority of the novel derivatives do not produce MAO-A inhibition at 100µM despite the fact that it is ten times the concentration used in this study to determine MAO-B inhibition percentage (**Table 1**, Supplementary Table S1). Furthermore, only in those few cases where inhibition was detected, especially with the most potent MAO-B derivatives, it remains at moderate or close to non-existent level (**Table 1**). The strongest MAO-A inhibition was elicited by derivatives **42** and **43** (48.86 and 56.76%), but derivatives **27** and **45** (43.83 and 43.36%) are close runner-ups and next analogs down the list are already much weaker (Supplementary Figure S4, Supplementary Table S1). Notably, **1**, which is the most potent MAO-B inhibitor of the derivative set with the IC<sup>50</sup> value of 56 nM, does not produce MAO-A inhibition at 100µM (**Table 1**). The molecular basis for the lack of MAO-A activity is evident, when comparing the shape and size of the active sites of the two enzyme subtypes in the context of 3-phenylcoumarin binding (**Figure 3A** vs. **Figure 3B**).

17-β-hydroxysteroid dehydrogenase 1 (HSD1), which functions as the catalyst of the final reducing step in the estradiol biosynthesis, is often overexpressed in breast cancer and endometriotic tissue (Vihko et al., 2004; Dassen et al., 2007; Hanamura et al., 2014). Thus, specific inhibition of HSD1 has potential to reduce effective estradiol levels in the treatments. Although the synthesized 3-phenylcoumarin set contains several molecules that exhibit activity toward HSD1, the inhibition was generally very weak and the active compounds are not among the most potent MAO-B inhibitors. Of the 24 most potent MAO-B inhibitors, the strongest HSD1 inhibition could be recorded for **20** and **22** (46 and 54%; **Figure 2**, **Table 1**); however, considerably higher activity (48.20–83.90%) was seen with derivatives **30**, **31**, **33**, **38,** and **48** (Supplementary Figure S4, Supplementary Table S1). Modest HSD1 inhibition (12–33%) was also elicited by **6**, **15**, **16**, **23**, **24** (**Figure 2**, **Table 1**) and **51** (Supplementary Figure S4, Supplementary Table S1). Importantly, derivative **1**, which is the most potent MAO-B inhibitor of the derivative set, does not inhibit HSD1.

Aromatase (CYP19A1) inhibition, which is important for blocking local estradiol synthesis for example in breast cancer treatment (Pasqualini et al., 1996), was not detected with the derivatives (**Table 1**, Supplementary Table S1). Although 3 phenylcoumarin should be able to sterically mimic the steroidal positioning at the active site (not shown), it would have to house a clear-cut H-bond acceptor at the R5/R7-position in the 3 phenyl to facilitate aromatase binding. This is, because X-ray crystallography shows that the Asp309 side chain is in neutral state at pH 7.4 and donating a proton to the carbonyl group of inhibitor androstenedione (PDB: 3EQM) (Ghosh et al., 2009). Inserting a hydroxyl group to the R5/R7 position could put an Hbond acceptor to this same location with the 3-phenylcoumarins (see **31**, **38**, **40**, **42**, **43**; Supplementary Figure S4, Supplementary Table S1). However, because the hydroxyl always has a dual role as an H-bond donor as well, any aromatase binding by the derivatives remains theoretical as it is prevented by a proton donor clash. The issue is described more thoroughly in our upcoming study (Niinivehmas et al., unpublished results).

Estrogen receptor (ER) agonists/antagonists or selective modulators are developed for infertility, contraception, hormone replacement, and ER positive breast cancer therapies. If the MAO-B inhibitors would function also as ER agonists, they could promote tumorigenesis in the breast tissue as a side effect. Unintended ER inhibition could also disturb natural estrogen levels or interrupt ER-targeted therapies. The measurements indicate that the 3-phenylcoumarin derivatives either are a hit or miss when considering ER inhibition. Although the ER activity could not be measured for all of the analogs due to running out of the synthesis products, the acquired results overwhelmingly support our prior findings stating that the R2-hydroxyl or the R6 hydroxyl/halogen is needed to prompt ER activity (Niinivehmas et al., 2016). This ER-specific effect is prominent with **12**, **20**, **22**, **27**, **28**, **29**, **30**, **39**, **40**, **41**, **44**, and **48** (**Table 1**, Supplementary Table S1, **Figure 2**, Supplementary Figure S4) and, moreover, ER activity is predicted for **17** and likely for **32** and **47** based on the well-established trend.

Cytochrome P450 1A2 (CYP1A2) catalyzes the oxidation of xenobiotics, especially polyaromatic hydrocarbons and steroid hormone-sized compounds such as 3-phenylcoumarins, into more soluble form for excretion (Zhou et al., 2010). Accordingly, it was prudent to get a rough estimate of the CYP1A2 inhibition levels for the novel 3-phenylcoumarin derivatives as well. In general, all of the derivatives inhibited CYP1A2 at some level (**Table 1**, Supplementary Table S1); however, typically the most potent CYP1A2 inhibitors such as **21**–**24** were less potent MAO-B inhibitors (**Table 1**). Similar to MAO-A, HSD1, and aromatase, the most potent MAO-B derivative **1** displayed only low CYP1A2 activity (IC<sup>50</sup> value of 124µM; **Table 1**).

#### Overall Assessment on the Druglikeness

As a whole, the selectivity analysis indicates that the crossreactivity of 3-phenylcoumarins can be managed or even avoided via specific functional group substitutions without taking away the MAO-B activity. Coumarins in general do not belong to the PAINS (pan assay interference compounds) category as it is a privileged scaffold structure. Only derivative **50**, which is not a potent MAO-B inhibitor (Supplementary Table S1, Supplementary Figure S4), was recognized as a potential PAINS ligand by PAINS3 filter (or A filter) in CANVAS module in MAESTRO (Baell and Holloway, 2010). In the ChEMBL database, ∼14,200 coumarin derivatives are included (observed online in 8.2.2018), which indicates that the scaffold can be tailored to target multitude of proteins. Despite this, the literature does not raise widespread concerns that the coumarin-based compounds in particular would cause harmful cross-reactivity or selectivity issues. The 24 active derivatives presented in this study (**Table 1**, **Figure 2**) have lower potency than some of the prior 3-phenylcoumarin compounds (Supplementary Figure S6, Supplementary Table S2) (Matos et al., 2009b, 2011a,b; Santana et al., 2010; Viña et al., 2012a); however, one has to be aware of fact that these results originate from different laboratories and activity assays and are, therefore, not fully comparable. To a degree this is the case even for the positive control pargyline (Fisar et al., 2010). Importantly, the new compounds follow closely the Lipinski rule of five regarding the logP value (logP < 5) and remain in the logP range of 2–4. Moreover, the ligandlipophilicity efficiency (LiPE) values of the new analogs suggest reasonable druglikeness (Freeman-Cook et al., 2013). What is more, derivative **1** clearly has the most promising selectivity profile of the derivatives for future consideration, because it is not only the most potent MAO-B inhibitor of the set but it is also selective against the other tested enzymes.

# CONCLUSION

A broad set of 3-phenylcoumarin derivatives was designed using virtual combinatorial chemistry or rationally de novo, synthesized and tested for MAO-B inhibition potency using spectrophotometry (Supplementary Table S1). The results further validate prior studies suggesting that the 3-phenylcoumarin is a suitable scaffold for building potent small-molecule MAO-B inhibitors by functionalizing its ring systems. A moderate MAO-B inhibition could be achieved by inserting a wide variety of functional groups into the coumarin (R1–R3; **Figure 4**) or 3-phenyl (R4–R7; **Figure 4**) rings (Supplementary Table S1). Twenty-four of the derivatives (**Figures 2**, **3D**) were found to elicit >70% inhibition (**Table 1**, Supplementary Figures S1, S2). These promising derivatives inhibit the MAO-B at a ∼100 nM to ∼1µM range (**Table 1**), while the most potent derivative **1** produces ∼56 nM MAO-B inhibition. A molecular dockingbased (**Figures 5**, **6**, Supplementary Figure S3) SAR analysis (**Figure 4**) describe the determinants of the MAO-B binding and inhibition at the atomistic level. Firstly, without any kind of the 3-phenyl substituents, no inhibition was detected. Although both hydrogen and halogen bonding can assist the 3-phenyl alignment and facilitate inhibition (**Figures 6E,F**, **Table 1**), the ability of the functionalized ring to fill the hydrophobic end of the binding cavity (red sector in **Figures 3A,B**) is the most important property for ensuring strong MAO-B inhibition (e.g., R6-trifluoromethyl of **1**; **Figure 5A**). Secondly, the SAR analysis reveals that a spot-on placement and composition of the coumarin ring's substituents can further enhance the MAO-B inhibition (**Figure 2**, **Table 1**), however, these effects are ultimately dependent on the scaffold alignment, which, in turn, depends on the 3-phenyl ring substituents (**Figure 4**). The cross-reactivity analysis focusing on MAO-A and a subset of estradiol metabolism-linked HSD1, aromatase, CYP1A2 and ER highlighted the potential of the 3-phenylcourmains, especially the most potent MAO-B derivative **1**, for producing selective MAO-B inhibition. Finally, the most potent 3-phenylcoumarin analogs presented in this study are estimated to operate at close to optimal ligand-lipophilicity efficiency—a feature highlighting their overall druglikeness.

### AUTHOR CONTRIBUTIONS

SR: was responsible for the experimental testing regarding MAO-A and MAO-B; EMu: performed the MAO-A experimental analysis; SR: did the docking into MAO-B and prepared most of the figures; PAP: was responsible for the final SAR analysis; SK, ES, and JH: performed the organic synthesis;

## REFERENCES


MA: did the PAINS screening; PK: performed the HSD1 measurements; NN, RJ, and HR: did the experimental analysis regarding CYP1A2; PH and MaP: executed the experimental analysis regarding aromatase; MiP: did preliminary screening for designing MAO-B ligands; SN, EMa, SK, and OTP: designed the molecules for the selected targets; SN, EMa, and OTP: designed the study. All the coauthors were involved in the manuscript preparation and approved the final version.

# FUNDING

Academy of Finland is acknowledged for funding (OTP. Project No. 250311).

#### ACKNOWLEDGMENTS

The Finnish IT Center for Science (CSC) for computational resources (OTP; Project Nos. jyy2516 and jyy2585).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00041/full#supplementary-material


tissue homogenates. Anal. Biochem. 244, 384–392. doi: 10.1006/abio.1996. 9911


inner leaflet of presynaptic vesicles. ACS Chem. Neurosci. Acschemneuro. 8, 1242–1250. doi: 10.1021/acschemneuro.6b00395


synthesis, pharmacological evaluation, and docking studies. ChemMedChem 7, 464–470. doi: 10.1002/cmdc.201100538


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Rauhamäki, Postila, Niinivehmas, Kortet, Schildt, Pasanen, Manivannan, Ahinko, Koskimies, Nyberg, Huuskonen, Multamäki, Pasanen, Juvonen, Raunio, Huuskonen and Pentikäinen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# An Efficient Implementation of the Nwat-MMGBSA Method to Rescore Docking Results in Medium-Throughput Virtual Screenings

Irene Maffucci, Xiao Hu, Valentina Fumagalli and Alessandro Contini\*

Dipartimento di Scienze Farmaceutiche, Sezione di Chimica Generale e Organica "Alessandro Marchesini," Università degli Studi di Milano, Milan, Italy

#### Edited by:

Daniela Schuster, Paracelsus Private Medical University of Salzburg, Austria

#### Reviewed by:

Alfonso T. Garcia-Sosa, University of Tartu, Estonia Pramod C. Nair, Flinders University, Australia

\*Correspondence: Alessandro Contini

# alessandro.contini@unimi.it

Specialty section: This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 25 October 2017 Accepted: 19 February 2018 Published: 05 March 2018

#### Citation:

Maffucci I, Hu X, Fumagalli V and Contini A (2018) An Efficient Implementation of the Nwat-MMGBSA Method to Rescore Docking Results in Medium-Throughput Virtual Screenings. Front. Chem. 6:43. doi: 10.3389/fchem.2018.00043 Nwat-MMGBSA is a variant of MM-PB/GBSA based on the inclusion of a number of explicit water molecules that are the closest to the ligand in each frame of a molecular dynamics trajectory. This method demonstrated improved correlations between calculated and experimental binding energies in both protein-protein interactions and ligand-receptor complexes, in comparison to the standard MM-GBSA. A protocol optimization, aimed to maximize efficacy and efficiency, is discussed here considering penicillopepsin, HIV1-protease, and BCL-XL as test cases. Calculations were performed in triplicates on both classic HPC environments and on standard workstations equipped by a GPU card, evidencing no statistical differences in the results. No relevant differences in correlation to experiments were also observed when performing Nwat-MMGBSA calculations on 4 or 1 ns long trajectories. A fully automatic workflow for structure-based virtual screening, performing from library set-up to docking and Nwat-MMGBSA rescoring, has then been developed. The protocol has been tested against no rescoring or standard MM-GBSA rescoring within a retrospective virtual screening of inhibitors of AmpC β-lactamase and of the Rac1-Tiam1 protein-protein interaction. In both cases, Nwat-MMGBSA rescoring provided a statistically significant increase in the ROC AUCs of between 20 and 30%, compared to docking scoring or to standard MM-GBSA rescoring.

Keywords: MM-GBSA, explicit water, molecular dynamics, GPU, structure based virtual screening, protease, protein-protein interactions

### INTRODUCTION

Structure based virtual screening (SBVS) methods are widely applied in drug discovery (Enyedy and Egan, 2008; Sousa et al., 2013). In most of the cases, SBVSs are done in the hit-to-lead development phase of the drug discovery process, with multiple successful outcomes (Enyedy et al., 2001a,b; Vangrevelinghe et al., 2003). In SBVS-related studies, scoring functions are mostly applied for potential hit selection. In general, the scoring functions are based on either empirical, knowledgebased, or molecular mechanics force field derived potentials (Wang et al., 2003; Raha et al., 2007). Additionally, to make the virtual screening process computational inexpensive, the scoring functions are most likely simplified. Thus, some important contributions known to influence the binding affinity are neglected (Sousa et al., 2006; Moitessier et al., 2008). Inevitably, applications of such simplified methods tend to fail in the hit optimization phase, where more meticulous selections are required about structurally similar compounds for better prediction of biological activity (Leach et al., 2006; Tirado-Rives and Jorgensen, 2006; Warren et al., 2006; Enyedy and Egan, 2008).

A better scoring can be achieved by considering energy evaluation averaged over an ensemble of conformations from a complex dynamic trajectory, as is the underlying concept of the molecular mechanics Poisson-Boltzmann / Generalized Born surface area (MM-PB/GBSA) analysis. Of course, the applications of MM-PB/GBSA methods are at the cost of increased computational expenses (Massova and Kollman, 2000). Nonetheless, the MM-PB/GBSA methods have been successfully applied to estimate binding energies (Kollman et al., 2000), or incorporated as a scoring method in SBVS applications (Lyne et al., 2006; Zhou et al., 2006; Ferrari et al., 2007; Xiong et al., 2007; Xu et al., 2010; Xu, 2012; Knight et al., 2014). The treatment of the solvent in MM-PB/GBSA calculations is implicit, providing an acceptable estimations of the energy contribution while bulk water is the only solvent-related concern (Wong and Lightstone, 2011; Yang et al., 2013). However, explicit water molecules might also be important in forming biomolecular complexes (Chong and Ham, 2017), particularly waters involved in bridging the ligand and the receptor (Wong et al., 2009; Abel et al., 2011; Ahmad et al., 2011; Wallnoefer et al., 2011; Maffucci and Contini, 2013; Mikulskis et al., 2014). Indeed, by analyzing several thousand of crystallographic complexes, it was recently observed that at least a water molecule mediates contacts between the partners in about two thirds of all the considered systems (Hendlich et al., 2003). Thus, several computational methods were proposed to aid the identification of important water molecules in crystal structures (Raymer et al., 1997; García-Sosa et al., 2003; Amadasi et al., 2006). Moreover, although replacing a water molecule in the binding site is a generally accepted strategy to increase drug potency, it has been shown that better pharmacodynamic properties might be obtained by keeping a tightly bound water as a bridge between the ligand and the receptor (García-Sosa, 2013). The effects of targeting or displacing binding site waters in drug design can be rigorously assessed by free energy calculations (García-Sosa and Mancera, 2010), that however are still too demanding when libraries of hundreds of molecules need to be evaluated. Therefore, some approaches have been attempted to consider the contribution of water-mediated interactions into the ligand docking score (Young et al., 2007; Ricchiuto et al., 2008; Forli and Olson, 2012; Ross et al., 2012; Kumar and Zhang, 2013; Murphy et al., 2016) or into the MM-PB/GBSA estimated binding energy (Checa et al., 1997; Wong et al., 2009; Genheden et al., 2011; Wallnoefer et al., 2011; Greenidge et al., 2013; Maffucci and Contini, 2013).

In this framework, we developed a MM-PB/GBSA variant, that we refer as Nwat-MMGBSA, which provided good-toexcellent results in ranking the binding energies of different protein-ligand or protein-protein complexes (Maffucci and Contini, 2013, 2016). Nwat-MMGBSA is based on the inclusion of a number of explicit water molecules, that are selected to be the closest to the ligand in each frame of the MD trajectory and are included as part of the receptor during the analysis. In addition to our work (Maffucci and Contini, 2013, 2016), Aldeghi and coworkers recently validated, by a thorough statistical analysis, the use of this approach on bromodomains (Aldeghi et al., 2017). Compared to other methods that include explicit water in MM-PB/GBSA calculations, Nwat-MMGBSA might have some advantages. For instance, relevant explicit water might be selected from the crystal structure (Wong et al., 2009; Wallnoefer et al., 2011). However, this imply that high resolution crystal structures are available, while Nwat-MMGBSA calculations can be performed on receptor models obtained by other techniques, such as homology modeling or NMR. Moreover, crystallographic water sites might derive from the average electron density of several molecules competing for the same position (Schiffer and Hermans, 2003). Indeed, we previously observed that a water-bridge between the ligand and the receptor found in the crystal structure of topoisomerase I in complex with topotecan (Staker et al., 2002) was described by the competition of three different waters in a 4 ns MD trajectory (Maffucci and Contini, 2013). It was also reported that explicit water for MM-GB/PBSA calculations might be selected from MD simulations accordingly to their distance from the ligand (Zhu et al., 2014). In this case, the distance from the ligand atoms is fixed, while the number of waters is different in each snapshot selected for MM-PB/GBSA analysis. However, by comparing this method to Nwat-MMGBSA, where the number of selected water is constant among all snapshots, we observed that Nwat-MMGBSA provided a better correlation with experiments and a better reproducibility among multiple repetitions of the same calculation (Maffucci and Contini, 2016). In this work, aiming to make Nwat-MMGBSA suitable for rescoring ligands in lowto medium-throughput SBVS experiments, we optimized the protocol to improve its efficiency, without losing in accuracy. We selected penicillopepsin (James et al., 1992; Ding et al., 1998; Hou et al., 2011a), HIV1-protease mutants (Shen et al., 2010; Olajuyigbe et al., 2011) and BCL-X<sup>L</sup> (Lessene et al., 2013) as test systems with known experimental data, either binding free energy (1G), inhibition constant (ki), or IC50. Our studies have shown improvements in the coefficient of determination to experimental data (r 2 ) ranging from 10 to 60%, depending on the number of explicit water molecules considered in the energy evaluation. Moreover, we assessed the Nwat-MMGBSA approach for SBVS rescoring performance in a ligandprotein interaction and a protein-protein interaction (PPI) scenario (AmpC β-lactamase and Rac1-Tiam1, respectively). In both cases, improved outcomes were observed compared to either docking scoring or to standard MM-GBSA rescoring. Furthermore, the complete SBVS workflow applied in this

**Abbreviations:** SBVS, structure based virtual screening; VS, virtual screening; ROC, receiving operator characteristic; AUC, area under curve; MD, molecular dynamics; MM-GBSA, molecular mechanics Generalized Born surface area; PPI, protein-protein interaction; SD, steepest descendent; CG, conjugated gradient; NVT, constant number of particles, volume and temperature; NPT, constant number of particles, pressure and temperature.

work, including Nwat-MMGBSA rescoring, is provided in the Supplementary Information as a set of bash and tcsh scripts that, together with working tutorials, should make it readily applicable to other biomolecular systems of interest.

# METHODS

#### Preparation of Complexes

Crystal structures of the penicillopepsin [PDB codes: 1APU, 1APV, 1APT, 1APW (James et al., 1992), 2WEA, 2WEB, and 2WEC (Ding et al., 1998)] and HIV1-protease [PDB codes: 3NU3, 3NU4, 3NU5, 3NU6, 3NUJ, 3NU9, 3NUO (Shen et al., 2010), 3NDW, and 3NDX (Olajuyigbe et al., 2011)] complexes (Table S1) were obtained from RCSB Protein Databank (Figures S1, S2). However, for the BCL-X<sup>L</sup> system, (Figure S3) only 3ZK6, 3ZLN, 3ZLO, and 3ZLR complexes were available as crystal structures (Lessene et al., 2013). Therefore, the starting structures of the unavailable complexes were reconstructed using MOE software (Molecular Operating Environment, v2016.08, 2016) starting from the available ones. Ligand partial charges were derived with the AM1-BCC method using the antechamber (Wang et al., 2006) software of AmberTools15 package (Case et al., 2014). All waters, ions and stabilizing agents present in the crystal structures were removed. The protonation state of every titratable residue within the complexes were assigned at physiological conditions using the Protonate-3D module of MOE.

#### MD Simulations

MD simulations were performed with the pmemd.MPI or pmemd.cuda (Götz et al., 2012; Salomon-Ferrer et al., 2013) modules, depending on the hardware (classical HPC environment or GPU equipped workstations, respectively), included in the Amber14 package (Case et al., 2014). The ff14SB (Maier et al., 2015) and the gaff (Wang et al., 2004) force fields were adopted for the protein and the ligand in all simulations respectively. In each complex, the total charge was neutralized by adding Na+ or Cl- ions, and the systems were solvated by an octahedral box of TIP3P water (Jorgensen et al., 1983), with a box size of 10 Å from the solute.

The equilibration and production protocols were updated to optimize performance, in respect to previous studies (Maffucci and Contini, 2013, 2016). The systems were initially relaxed by optimizing the position of hydrogens (1,000 cycles of steepest descent (SD) and 5,000 cycles of conjugated gradient (CG), up to a gradient of 0.01 kcal mol−<sup>1</sup> · Å; restraints of 100 kcal·mol−<sup>1</sup> · Å 2 were applied on heavy atoms) and of ions and waters (2,000 cycles of SD and 5000 cycles of CG up to a gradient of 0.1 kcal·mol−<sup>1</sup> ·Å; restraints of 50 kcal·mol−<sup>1</sup> ·Å <sup>2</sup> were applied on atoms other than ions and water). The solvent box was then equilibrated at 300 K by 100 ps of NVT and 100 ps of NPT simulation using a Langevin thermostat with a collision frequency of 2.0 ps−<sup>1</sup> (restraints of 50 and 25 kcal·mol−<sup>1</sup> ·Å <sup>2</sup> were applied on the solute for NVT and NPT simulations, respectively). Successively, two cycles of restrained minimization (2500 cycles of steepest descent and 5,000 cycles of conjugated gradient, up to a gradient of 0.1 kcal mol−<sup>1</sup> Å, with restraints of 25 and 10 kcal mol−<sup>1</sup> Å 2

on backbone atoms, respectively) were performed. The systems were then heated up to 300 K in 6 steps (1T = 50 K) of 5 ps each, where backbone restraints were gradually reduced from 10.0 to 5.0 kcal mol−<sup>1</sup> Å 2 . An equilibration of 1.6 ns was then performed by initially using the NVT ensemble (100 ps, ligand and backbone restraints = 5.0 kcal mol−<sup>1</sup> Å 2 ) followed by NPT (1 step of 200 ps with ligand and backbone restraints = 5 kcal mol−<sup>1</sup> Å 2 , then 3 steps of 100 ps each reducing the ligand and backbone restraints from 5.0 to 1.0 kcal mol−<sup>1</sup> Å 2 , and finally 1 step of 500 ns with ligand and backbone restraints of 1.0 kcal mol−<sup>1</sup> Å 2 ). The last equilibration step consisted in 500 ps of unrestrained NVT simulation. Finally, production runs were conducted under the NVT condition at 300 K for 1 or 4 ns. An electrostatic cutoff of 8.0 Å, PME (Darden et al., 1993) for long electrostatic interactions, and the SHAKE (Ryckaert et al., 1977) algorithm were applied to all the calculations. Three independent simulations were performed for each hardware setup (GPU workstation or CPU HPC cluster). For the simulations performed on GPUs, the default single precision/fixed precision (SPFP) version of pmemd.cuda (Le Grand et al., 2013) was applied in all steps, except for geometry minimizations where the double precision/fixed precision (DPFP) version was adopted.

All MD production trajectories were processed by cpptraj for backbone RMSD analyses (Figures S4–S11), solute-solvent hydrogen bond (donor-acceptor distance cutoff at 4.0 Å, angle cutoff at 150◦ ) and water density (grid analysis over a cubic box 50 Å × 50 Å × 50 Å, mesh = 0.5 Å, centered on ligands) analyses. Images of water density plots were obtained by using UCSF Chimera (Pettersen et al., 2004).

## Nwat-MMGBSA Analyses

MM-GBSA and Nwat-MMGBSA analyses were performed with the MMPBSA.py script (Miller et al., 2012) of the AmberTools15 package. The analyses were conducted on either the 1st or the 4th ns of the production runs by selecting 100 frames evenly spaced out. The GB-Neck2 implicit solvent model (Nguyen et al., 2013) was chosen for the GB calculations and the salt molar concentration in solution was set at 0.15 M. Entropy was neglected in all calculations, since the benefits of including its contribution still remain controversial (Weis et al., 2006; Hou et al., 2011a; Wallnoefer et al., 2011; Yang et al., 2011) and normal mode calculations are also extremely time consuming. It should be noted that neglecting entropy, although acceptable when comparing ligands of similar size and structure (Kollman et al., 2000; Wang et al., 2001; Wong et al., 2009), might lead to errors when the analysis involves ligands that are structurally rather different (Oehme et al., 2012).

The Nwat-MMGBSA script (**Figure 1**) uses the cpptraj module of AmberTools15 to process the solvated MD trajectory. When Nwat > 0, the water molecules closest to the ligand were preserved while the remaining were stripped from the selected frames by using the cpptraj command closest. The total number of water molecules to be kept in the trajectory is given by the Nwat flag in the script input section (in this work, we evaluated Nwat = 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100). The number of frames that are going to be selected from the original MD trajectory is defined by the r flag in the script input section, that

corresponds to the interval keyword in the MMPBSA.py script (Miller et al., 2012). In this work, r was set at 10, meaning that one every 10 frames (i.e., 100 frames per nanosecond) was sampled. The preserved closest water molecules are considered as part of the receptor during the MM-GBSA analysis. In analogy with studies on the MM-PB/GBSA performance previously reported by us and by others (Hou et al., 2011b; Maffucci and Contini, 2013, 2016; Xu et al., 2013), the coefficient of determination (r 2 ) between experimental data and calculated binding energies was used as the evaluation metric.

#### Restrospective Virtual Screening Preparation of the Receptor

The AmpC β-lactamase receptor was derived from the 2HDS PDB file (Babaoglu and Shoichet, 2006) according to what described on the DUD-E website (Mysinger et al., 2012). Starting from the crystal structure, only chain B was preserved, crystallographic water molecules were removed and the "Structure preparation" module of the MOE software was used to check the protein structure and correct eventual errors. The receptor was then capped by acetyl (ACE) and methylamino (NME) groups at the N- and C-termini, respectively. Missing hydrogen atoms were added using the "Protonate 3D" function of the MOE package, considering a physiological pH. Partial charges were added accordingly to the AMBER10:EHT force field and solvation was treated with the Born model. The system geometry was then optimized up to a gradient of 0.1 kcal mol−<sup>1</sup> Å, with protein backbone atoms restrained to the original position.

The Rac1 receptor, used in the VS simulation, was prepared as described elsewhere (Ferri et al., 2009, 2013a; Ruffoni et al., 2014).

#### AmpC β-Lactamase Testing Library

The DUD-E database provides 48 experimentally determined active ligands and 2850 decoy molecules for AmpC β-lactamase. However, considering the computational cost of rescoring by MD simulations followed by Nwat-MMGBSA analyses, we considered using a smaller library that decently represent the original database. Hence, fingerprint clustering methods included in MOE package were applied to reduce the size of the test set. Multiple fingerprint/similarity metric method combinations have been trial-and-error-ed. We found that the application of Typed Graph Triangle (TGT) fingerprint and Tanimoto Superset/Subset (tanimoto-ss) similarity metric method provides the closest reproduction of virtual screening results as the data provided by DUD-E. However, the combination applied here might not be directly transferred to other biomolecular systems. The clustering process reduced the original database to 20 active ligands and 378 decoys (Table S2) with a docking AUC at 74.82% and top 1% enrichment factor of 9.5, in comparison to the original 78.92% and 8.3 provided on DUD-E database. The smiles of the final database are reported in Table S2.

#### Rac1 Testing Library

By analyzing the literature and by using in-house data, we collected a set of 116 compounds, 10 of which were active and 106 inactive on the Rac1 protein (Table S3). The active compounds were selected among those able to inhibit at least the 50% of Rac1 activity, as assessed by G-LISA biochemical assays (Ferri et al., 2009, 2013a). Conversely, the decoys were chosen among molecules that were designed (or identified by computational screening) as Rac1 inhibitors, but turned out to be completely inactive on experimental evaluation (Ferri et al., 2009, 2013a; Hernández et al., 2010; Surviladze et al., 2010; Shang et al., 2012; Rahimi et al., 2015; Lu et al., 2017). The selected compounds were designed with MOE, minimized and subjected to a conformational search (MMFF94x force field, Born solvation, with the other parameters as default). The lowest energy conformation of each compound was selected to form the final test set.

#### Virtual Screening

The workflow included in the VScreen script (see Supplementary Information) allows the following combinations for library processing:


The sixth library processing tandem was applied in this work. The UNICON software is used to generate tautomer and protonation states (Sommer et al., 2016). We chose the topscoring keyword to generate only the most favored tautomers and protomers, as in preliminary evaluation we observed that the generation of all tautomers and protomers (using the ensemble keyword) did not provide improved results and was significantly more time consuming. The SPORES software (ten Brink and Exner, 2009, 2010) is instead used to obtain stereoisomer and ring conformation, as well as for the final assignment of atom types, as requested by the PLANTS software (Korb et al., 2006, 2007, 2009, 2010) used for all dockings. Specific docking parameters, including search speed and scoring functions, can be set directly in the VScreen script. In the examples reported here, PLANTS was used in a low speed / high accuracy mode (search speed = speed1) and with the CHEMPLP scoring function (Korb et al., 2009). Additional PLANTS commands, such as H-bond or NMR constraints (Korb et al., 2010), can also be inserted in the input section of the VScreen script. Concerning the Rac1 example, we requested a H-bond constrain of 3 kcal/mol between any H-bond donor of the docking ligand and the carbonyl oxygen of Leu 70, since the literature evidence the importance of such ligandreceptor interaction for a proper activity (Montalvo-Ortiz et al., 2012; Ferri et al., 2013a; Ruffoni et al., 2014). Binding site radii were optimized to 16 Å for Rac1 and 7 Å for AmpC β-lactamase test sets, respectively. After the virtual screening process, the outcomes were ranked according to total PLANTSCHEMPLP score, using the top ranked pose of each ligand. Receiver operating characteristic (ROC) curves and corresponding area under curve (AUC) were then generated at the end of each docking run by using an R script integrated in the VScreen program.

#### Ligand Parameterization

Following the docking, automatized parametrization of ligands for later MD simulations can be enabled by setting the doMD keyword to 1. The user is allowed to choose a "top percentage" of the ranked ligands to be subjected to parametrization, by setting the fract keyword. We have chosen 100% (fract = 100), i.e., the full test set, as we were interested in a full assessment of the Nwat-MMGBSA methods in terms of virtual screening rescoring. The antechamber software (Wang et al., 2006) of the AmberTools15 package is used for deriving AM1-BCC partial charges for each ligand and to assign atom types accordingly to the gaff force field (Jakalian et al., 2002; Wang et al., 2004). The quantum mechanical calculations necessary to perform the charge parameterization can be accomplished by using the default sqm software included in AmberTools15, or with MOPAC2016 (Stewart, 2016) by setting the qm keyword to 2 or 0, respectively. The topology and starting coordinate files of each complex are then generated by calling the tleap software, included in the AmberTools15 package. Each complex is neutralized and solvated by adding Na<sup>+</sup> or Cl<sup>−</sup> ions and a TIP3P water box of 10 Å from the solute. MD simulations and Nwat-MMGBSA analyses can then be performed as detailed in previous sections.

Nwat-MMGBSA analyses were performed on the obtained trajectories with number of closest water molecules set to 0, 30, 60, or 100. ROC curves and corresponding AUCs were evaluated by using the rankings derived from each Nwat-MMGBSA analysis.

All scripts applied in this work are available in the Supplement Materials. Eventual updates might also be requested to the authors.

# RESULTS

# Optimization of the Nwat-MMGBSA Protocol

To optimize the Nwat-MMGBSA protocol for low- or mediumthroughput virtual screening procedures, such as those applied in the hit-to-lead optimization phase of a drug discovery process, we worked on a significant reduction of the overall simulation time in comparison to our previous implementations (Maffucci and Contini, 2013, 2016). Then, we integrated Nwat-MMGBSA in a continuous workflow that includes the library setup, docking and the preparation of complexes that is propaedeutic to MD, as shown in **Figure 1**. The following steps of the protocol were redesigned for an optimal ratio between accuracy and speed:


Generalized Born (GB) implicit solvent model is used by default in Nwat-MMGBSA calculations. Indeed, several articles report that GB can provide outcomes comparable to the PB method, at a fraction of the computational cost, especially when relatively short MD trajectories are used for MM-PB/GBSA calculations (Hou et al., 2011a,b; Maffucci and Contini, 2013, 2015, 2016). However, the PB method can still be requested by the user by setting the solv keyword in the input section of the Nwat-MMGBSA script (see scripts and examples provided as Supplementary Information).

Moreover, the reproducibility between independent MD simulation repeats of the same system, especially when using GPU, was also improved. This required some protocol adjustments, including a longer equilibration of the solvent box, the use of geometric restraints instead of constraints, the use of the Langevin thermostat instead of the weak coupling algorithm, and a slightly extended final equilibration phase.

The protocol modifications allowed approximately 1.5 h per ligand on a standard workstation equipped with a single GeForce GTX TITAN Black card, including parameterization, minimization, equilibration, 1 ns of production run and Nwat-MMGBSA analysis. This is roughly the same time required for the simulation on a HPC architecture using 12 nodes equipped with two 2.40 GHz octa-core processors under similar simulation settings.

Considering our interest in using Nwat-MMGBSA calculations to rescore docking results in a reasonable time, the following tests were also designed to evidence any statistical difference in the correlation to experiment when the analysis is performed on 1 ns or 4 ns long MD trajectories. All the energies computed for the discussed examples are reported in Tables S4–S39. Correlations to experiments and statistical analyses are shown in Tables S40–S42 and Table S43, respectively.

#### Test on Penicillopepsin

This system was already evaluated, although with a different protocol, in a previous work where the bases of the Nwat-MMGBSA approach were described (Maffucci and Contini, 2013). The results of the Nwat-MMGBSA analysis obtained with the new protocol agreed with those reported in the

previous work in terms of correlation between predicted and experimental binding energies (**Figure 2A** and Table S40). This confirms the robustness of the Nwat-MMGBSA method toward the modifications in the MD simulation conditions. However, the new protocol showed the beneficiary application of Nwat-MMGBSA method even when only 10 closest water molecules were considered (Nwat = 10), while in the previous evaluation no significant improvement was observed at this condition. Water density plot around the binding site (**Figure 2B**) confirm the role of water in mediating the ligand-receptor binding. Indeed, for this system, the use of the Nwat-MMGBSA methods allowed to increase the r 2 from about 0.3, obtained with the standard MM-GBSA approach (Nwat = 0), to about 0.8 (**Figure 2A**).

In addition, relatively low standard deviations, obtained when averaging the r <sup>2</sup> obtained by independent repetitions of the whole run, were observed when higher numbers of closest water molecules were considered (Table S40 and **Figure 2A**), thus suggesting that the inclusion of explicit waters is likely to improve the reproducibility of results from individual runs.

Moreover, the outcome obtained by running simulations on GPU and CPU hardware were statistically equivalent (Table S43), and the same was true for the analyses performed on either the 1st or the 4th ns of MD simulations (**Figure 2A**). This suggests that Nwat-MMGBSA analysis is suitable for the analysis of short MD simulations run on GPU cards, with a great improvement in speed and no impairments in accuracy.

#### Test on HIV1-Protease

Similarly to other aspartic proteases (Brik and Wong, 2003). HIV1-protease exhibits a close relationship with water-mediated bridging effects in the crystal structure (Shen et al., 2010). Consequently, the effects of explicit waters were also reflected by the Nwat-MMGBSA workflow (**Figure 3A**). The high waterdensity around a wide area at the binding site also confirms the likelihood of the involvement of explicit water during binding process (**Figure 3B**).

When considering the correlation between the experimental k<sup>i</sup> and the predicted binding energies, a significant improvement in r <sup>2</sup> was obtained with the inclusion of a hydration shell of 30–70 water molecules (**Figure 3A** and Table S41). Although, results with a lower Nwat value showed no significant difference from standard MM-GBSA analyses, suggesting that smaller hydration shells around the ligand might have excluded certain solute-solvent interactions important for binding free energy estimations. However, water-mediated H-bond analyses showed that only one or two stable (occupancy > 20%) water-mediated interactions did involve the ligand, while majority of the bridging water molecules were found between protein residues (Tables S44, S45). Furthermore, crystallographic data provides that only 10–15 water molecules are generally present within 4 Å from the ligand molecules (Olajuyigbe et al., 2011). These imply potential conflicts between the lower numbers of the observed "stable" bridging water molecules to the evidently better binding free energy estimations when higher amount of closest explicit solvent (up to 70) is included. Such conflicts can only be explained when transient water bridges are considered. The averaged free energy contribution of these transient interactions is more likely been captured by Nwat-MMGBSA calculations, whereas not necessarily detectable through population distribution or electron density analyses. Indeed, for example, the inclusion of crystallographic water molecules up to 3.5 Å from the ligand did not provide a clear benefit over standard MM-PB/GBSA approach (Greenidge et al., 2013).

Similar to the penicillopepsin system, the outcomes did not show statistically significant differences between the 1st and 4th ns of MD simulations and were independent from the hardware. Apparently, 4 ns MD simulations performed using CPU averagely provided higher correlation to experiments for Nwat ≤ 20, although the high standard deviations make this result not statistically significant (**Figure 3A** and Table S41).

#### Test on BCL-XL

The Nwat-MMGBSA trails up to 50 closest water molecules have no statistical difference from standard MM-GBSA calculations, despite of different hardware environment (**Figure 4A**). A lower water density was indeed observed around the ligand for BCL-X<sup>L</sup> system (**Figure 4B**). This implies that explicit water molecules are playing a less important role in ligand binding,

as reflected by the relatively deluding performance of Nwat-MMGBSA compared to MM-GBSA. A relatively high r 2 (∼0.7) was indeed consistent throughout the multiple trials for all of the conditions evaluated. These apparently non-effective results, however, positively suggest that the Nwat-MMGBSA method does not impair the statistical outcomes of the estimations for systems where explicit water molecules are deemed less important. Thus, it can be concluded that, even if system-specific tuning is necessary for optimal performance, Nwat-MMGBSA can be safe for binding free energy estimation even without an a priori knowledge of the bridging water in the system of interest.

#### Retrospective Virtual Screening Test

To assess the performance of the Nwat-MMGBSA method in rescoring virtual screening results, we have chosen two case studies for protein-ligand (PLI) and protein-protein (PPI) interactions, respectively. The first system, AmpC β-lactamase (Usher et al., 1998), was selected from the Dud-E database to provide an example were water plays an active role in the target catalytic cycle. Indeed, the inclusion of an explicit water molecule was found beneficial in previously reported virtual screenings (Powers et al., 2002). Conversely, the second system, the Rac1 protein targeted at the Tiam1 binding site (Worthylake et al., 2000), was chosen because of the availability of reliable in-house activity data, including those of several inactive compounds that were however selected as potential hits by virtual screening studies previously conducted (Ferri et al., 2009, 2013a).

The AUC of the ROC curves was chosen as the main metric of comparison, since enrichment values are not indicated for databases of limited sizes (Enyedy and Egan, 2008). For the AmpC system, the full virtual screening workflow, followed by MD simulation and Nwat-MMGBSA (Nwat = 0, 30, 60 and, for Rac1 only, 100) rescoring, was repeated twice, while for Rac1 a third repetition was added due to a higher variance among the obtained correlations. Docking scores and Nwat-MMGBSA binding energies for AmpC and Rac1 screenings are reported in Tables S46, S47 and Tables S48–S50, respectively.



Percentage of variation (1%docking and 1%MMGBSA) and P values respect to ChemPLP scoring (Pdocking) and to standard MM-GBSA rescoring (PMMGBSA) are also reported. <sup>a</sup>Corresponding to a standard MM-GBSA calculation, with no explicit waters included. <sup>b</sup>Average of two full repetitions.

<sup>c</sup>0.00001.

#### AmpC β-Lactamase

The receptor in Dud-E include an explicit water molecule. We did preliminary docking evaluations by including the water, using the "water\_molecule" function implemented in PLANTS, but comparable results were obtained (see Figure S12). For this reason, to simplify and standardize the procedure, we decided not to include any explicit water in the docking part of the virtual screening workflow.

We initially noticed that virtual screening has already provided a decent discrimination of active from decoys with a ROC AUC that averaged at 72.0% (**Table 1**). The application of the standard MM-GBSA method (Nwat = 0) only provided a barely significant increase of ROC AUC value, respect to docking (**Table 1**), while improvements appeared once explicit water molecules were included, as shown by the Nwat = 30 and 60 scenarios (**Figure 5**, Figure S15). Considering that the ROC AUCs for Nwat = 30 and 60 were fully converged, no additional analyses at higher Nwat values were done. Additionally, ligandto-ligand correlations in calculated free energies were evaluated between the two repeated runs. The standard MM-GBSA run provides a r <sup>2</sup> of 0.66 when correlating the energies obtained by the two repetitions, while Nwat-MMGBSA with Nwat = 30 and 60 resulted in r <sup>2</sup> of 0.84 and 0.91, respectively (**Figure 6**). This implies that Nwat-MMGBSA rescoring is likely to provide

better reproducibility between separate runs. Moreover, the good ligand-to-ligand inter-method correlation between Nwat = 60 and 30 (r <sup>2</sup> = 0.95 and 0.94 for runs 1 and 2, respectively; Figure S13) further confirmed the improvements in reproducibility. Interestingly, a positive binding energy was computed for two ligands by Nwat-MMGBSA calculations, but not by docking or standard MM-GBSA rescoring. The two ligands belong to the decoy set (ligands 088 and 179, Tables S46, S47) and are thus supposed to be poorly ranked. Indeed, by analyzing the binding modes of the decoy 088 (Figure S17), it can be observed that an isopropyl group overlaps with the position occupied by a water molecule present in the crystal structure (Babaoglu and Shoichet, 2006), but not explicitly considered during docking (see Methods). Conversely, decoy 179 does not overlap with the crystallographic water site (Figure S18). However, it can be observed that a solventexposed chloropropyl group overlaps to a position occupied by a hydrophilic amino acidic moiety of the crystallographic ligand. In both cases, it appears that Nwat-MMGBSA rescoring can correctly penalize compounds that do not offer an optimal orientation of hydrophobic groups.

#### Tiam1-Rac1 PPI Interface as the PPI Test Set

Virtual screening targeting PPIs has been suggested as a challenging task, especially when only traditional docking and scoring procedures are used (Bienstock, 2012; Scott et al., 2016). In the past, we have applied standard computational methods to identify and design inhibitors of the Rac1-Tiam1 PPI, thus collecting data on compounds identified as potential hits, but that turned out to be inactive upon experiments (Ferri et al.,

2009, 2013a,b; Ruffoni et al., 2014). In addition, we searched the literature to identify compounds that were tested against Rac1 inhibition, but turned out to be inactive (Hernández et al., 2010; Surviladze et al., 2010; Shang et al., 2012; Rahimi et al., 2015; Lu et al., 2017). By this way, the resulting ligand test set



Percentage of variation (1%docking and 1%MMGBSA) and P-values respect to ChemPLP scoring (Pdocking) and to standard MM-GBSA rescoring (PMMGBSA) are also reported. <sup>a</sup>Average of three full repetitions.

shared similar physico-chemical and structural features between the actives and the inactives, thus making this virtual screening a difficult discrimination process to tackle.

The docking protocol was optimized to maximize AUC by evaluating the effect of the different scoring functions available in PLANTS, by variating the binding site radius, the search speed and by using hydrogen bond constraints with residues known to be essential for activity (i.e., Leu70 or Ser71) (Gao et al., 2004). The docking poses were visually inspected to check their consistency with the poses obtained in previous studies (Ferri et al., 2013a). With the optimized protocol and for each library processing condition, all the active compounds showed a similar binding pose, except for ligand109 (Figure S14).

The ROC computed on the scores obtained by docking showed a moderate ability of this procedure in discriminating active from inactive compounds, with AUCs of about 0.6 (**Table 2**, **Figure 7**). Considering the strained characteristic of both the target and database, this result is acceptable, if compared to the ROC AUCs obtained in other benchmarks reported by literature (Brozell et al., 2012; Liebeschuetz et al., 2012; McGann, 2012; Neves et al., 2012; Novikov et al., 2012; Repasky et al., 2012; Schneider et al., 2012; Lavecchia and Di Giovanni, 2013; Yuriev et al., 2015).

This time, the application of the standard MM-GBSA (Nwat = 0) rescoring did not provide any significant improvement in the AUC compared to docking (**Table 2**). Unexpectedly, Nwat-MMGBSA performed with 30 water molecules (Nwat = 30) behaved similarly. Conversely, the ROC AUCs improved of about 20 and 30% after rescoring with Nwat = 60 or 100, respectively (**Table 2**, **Figure 7,** and Figure S16). Since the difference in AUC between the two last scenarios was not statistically significant, no additional simulations were conducted at higher Nwat. An improvement in the ROC AUC of about 20–30%, although reproducible and significant (Zhang et al., 2014), might be questionable against the increased computational effort of rescoring with either MM-GBSA or Nwat-MMGBSA. However, in the framework of a lead optimization study, the payback of a simulation that can be easily run on relatively inexpensive hardware can be an increased chance of synthesizing a good molecule. Considering the costs associated with the synthesis of new molecules, having even only a 20% higher probability of preparing an active compound can be considered a rather good result.

Statistical significance was calculated by t-test and is graphically reported only when a significant variation was observed (\*P < 0.05; \*\*P < 0.01).

# DISCUSSION

When developing new drugs, computational calculation can help in identifying new hits in either the hit-to-lead or lead optimization phases. While the first task is generally performed by using very fast computational methods to screen large databases, the lead optimization phase is generally done by applying more accurate, although more computationally demanding, methods. Indeed, starting from a lead, a virtual library of hundreds-to-thousands congeneric molecules can be generated and evaluated computationally. However, the prioritization of the synthesis of a few derivatives by computational methods might still be quite challenging. In this framework, we optimized a variant on the well-known MM-GBSA method, referred as Nwat-MMGBSA (Maffucci and Contini, 2013, 2016). This approach consists in the inclusion, during the MM-GBSA analysis, of a fixed number of water molecules, which in each frame of the MD simulation are the closest to the ligand, or to a binding interface, and are therefore potentially mediating interactions between the receptor and the ligand. We demonstrated that this approach might improve the correlation between predicted and experimental binding energies up to 50%, compared to the standard MM/GBSA method (corresponding to Nwat = 0), with only a modest increase in computation time (Maffucci and Contini, 2016). Of course, the potential improvement in correlation depends on the role played by water in facilitating the ligand-receptor binding. However, we also found that when water does not play a specific role in mediating this interaction, the application of Nwat-MMGBSA is not detrimental on the quality of correlation, compared to the default approach. In the light of this, we automatized the process and optimized the MD protocol for running simulations on standard workstations equipped with a GPU, on which a full calculation can be completed in about 1–2 h per complex, depending on system size. Indeed, the results obtained by using a single GPU card are comparable, in both quality and duration, with those obtained by running MDs on a relatively large HPC environment (12 nodes with 2 octa core processors per node). Moreover, we also observed that Nwat-MMGBSA analyses provided comparable results when applied on 1 or on 4 ns MD trajectories, thus making this simulation attractive for medium-throughput virtual screenings.

In the second part of this article, we described the integration of Nwat-MMGBSA as a method to rescore docking results in SBVS studies. By applying Nwat-MMGBSA rescoring (Nwat = 60 or 100) we obtained, in both the examples, an increase in the ROC AUCs of between 20 and 30%, compared to the docking scorings or default MM/GBSA (Nwat = 0), depending on the system. In the adopted conditions, we were able to process more than 20 compounds per day using a standard octa core workstation equipped by a single GPU. Although this might appear a quite long time, compared to the thousands of compounds that can be screened per day by docking, the investment becomes reasonable when considering the time and resources required for the synthesis of new molecules. Moreover, we can expect that the fast development of GPU hardware will make MD-based rescoring even faster in short time. Indeed, in 2010 we could run a MD simulation on a Rac1 complex at a speed of 8.7 ns/day on a Tesla C1060 card, while a few years later, the same simulation was run at a speed of 59.3 ns/day on a GeForce GTX TITAN Black card.

Unfortunately, we were not able to find an ideal number of water that need to be included during Nwat-MMGBSA rescoring. Indeed, while Nwat = 30 appeared to be reasonable in most of the examples, including those reported previously (Maffucci and Contini, 2013, 2016), it failed in the Rac1 VS example. Indeed, in this case, at least 60 waters were necessary to observe a significant improvement over docking and standard MM-GBSA, possibly due to the large and solvent-exposed nature of the Rac1 binding site. Conversely, it was recently reported that MM-PBSA calculations on a set of Mnk1 and Mnk2 inhibitors provided

# REFERENCES


improved correlations to experiments only when including up to 10 water molecules (Kannan et al., 2017). This quite low number, compared to other examples, was justified by the rather small interface between Mnk1/Mnk2 kinases and the respective ligands.

# AUTHOR CONTRIBUTIONS

AC coordinated the team, designed the scripts and performed the calculations on Rac1. IM performed the calculations on penicillopepsin, HIV1 and BCL-XL. XH provided important updates to the VScreen script. VF performed the calculations on AmpC β-lactamase. IM, XH, and AC wrote the article.

# FUNDING

This work was partially supported by the Italian Ministry of Education, University and Research (MIUR) through the FIRB— Programma 'Futuro in Ricerca' (grant No. RBFR087YAY), by the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie ITN-European Joint Doctorate grant agreement No. 675527 MOGLYNETprogramme in Drug Discovery and Development, and by NVIDIA through the GPU Grant Program.

# ACKNOWLEDGMENTS

We acknowledge the CINECA for high performance computer service and the MIUR for the Ph.D. scholarship that supported IM. We also acknowledge João Cavalheiro for having run calculations on the BCL-X<sup>L</sup> example during his Erasmus Programme. Daniela Albo and Cecilia Carvoli are also acknowledged for their work during their BS theses.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00043/full#supplementary-material

Characterizing the Roles of Water in Biomolecules. J. Mol. Biol. 358, 289–309. doi: 10.1016/j.jmb.2006.01.053


by flavonoids as a model system for congeneric series. J. Med. Chem. 40, 4136–4145. doi: 10.1021/jm970245v


in ritonavir and saquinavir mixtures. Cryst. Growth Des. 11, 4378–4385. doi: 10.1021/cg200514z


the simulation method and the force field. J. Med. Chem. 49, 6596–6606. doi: 10.1021/jm0608210


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Maffucci, Hu, Fumagalli and Contini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Computer-Aided Drug Design in Epigenetics

Wenchao Lu1,2†, Rukang Zhang1,2†, Hao Jiang1,2, Huimin Zhang1,3 and Cheng Luo1,2 \*

*<sup>1</sup> Drug Discovery and Design Center, CAS Key Laboratory of Receptor Research, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China, <sup>2</sup> Department of Pharmacy, University of Chinese Academy of Sciences, Beijing, China, <sup>3</sup> School of Life Science and Technology, ShanghaiTech University, Shanghai, China*

Epigenetic dysfunction has been widely implicated in several diseases especially cancers thus highlights the therapeutic potential for chemical interventions in this field. With rapid development of computational methodologies and high-performance computational resources, computer-aided drug design has emerged as a promising strategy to speed up epigenetic drug discovery. Herein, we make a brief overview of major computational methods reported in the literature including druggability prediction, virtual screening, homology modeling, scaffold hopping, pharmacophore modeling, molecular dynamics simulations, quantum chemistry calculation, and 3D quantitative structure activity relationship that have been successfully applied in the design and discovery of epi-drugs and epi-probes. Finally, we discuss about major limitations of current virtual drug design strategies in epigenetics drug discovery and future directions in this field.

#### Edited by:

*Daniela Schuster, Paracelsus Private Medical University of Salzburg, Austria*

#### Reviewed by:

*Dharmendra Kumar Yadav, Gachon University of Medicine and Science, South Korea Stefano Alcaro, Magna Græcia University, Italy*

\*Correspondence:

*Cheng Luo cluo@simm.ac.cn*

*† These authors have contributed equally to this work.*

#### Specialty section:

*This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry*

Received: *07 January 2018* Accepted: *23 February 2018* Published: *12 March 2018*

#### Citation:

*Lu W, Zhang R, Jiang H, Zhang H and Luo C (2018) Computer-Aided Drug Design in Epigenetics. Front. Chem. 6:57. doi: 10.3389/fchem.2018.00057* Keywords: drug discovery, epigenetics, small-molecule inhibitor, computer-aided drug design, virtual screening

# INTRODUCTION

Covalent modifications on nucleosomes, the basic building blocks on chromatins, including methylation, acetylation, phosphorylation, and ubiquination specifically regulate downstream gene expression patterns in a context-dependent manner that form the fundamental molecular basis of epigenetics (Strahl and Allis, 2000; Berger, 2007). Dynamic regulation of epigenetic modification collections leads to different functional outcomes that plays a pivotal role in biological processes including genome reprogramming, gene transcription, DNA damage response and homeostatic regulation (Li, 2002; Vidanes et al., 2005; Kouzarides, 2007; Gut and Verdin, 2013). Epigenetic dysfunction is tightly related with the pathogenesis and progression of several diseases including malignant diseases especially cancers and chronic diseases such as immune-mediated diseases, neurodegenerative disorders and diabetes which underscoring the importance of these covalent modifications (Best and Carey, 2010; Dawson and Kouzarides, 2012; Tough et al., 2016; Hwang et al., 2017).

Proteins responsible to modulate epigenetic marks on nucleosomes could be roughly divided into three categories based on their relative function including writers (enzymes that deposit covalent modifications), erasers (enzymes that remove covalent modifications), and readers (proteins that recognize specific modifications and recruit chaperons). Encouraging success has been achieved in the development of epi-probes for dissecting epigenome in recent decades (Shortt et al., 2017). However, there is still formidable challenge for epigenetic drug discovery in both academia and industry due to complexity in epigenetics regulatory network and the limits in assays and drug development technologies. So far only seven epigenetic agents targeting two epigenetic enzymes (DNA methytransferases, histone deacetylases) have been approved for

**91**

human use. The indications of approved epigenetic drugs are limited to malignant diseases such as myelodysplastic syndromes (MDS), acute myeloid leukemia (AML), chronic myelomonocytic leukemia (CML), peripheral T-cell lymphoma (PTCL), and cutaneous T-cell lymphoma (CTCL) while the applications of epigenetic drugs in chronic diseases treatment were less explored (Mann et al., 2007; Derissen et al., 2013; Laubach et al., 2015; Lee et al., 2015). Hence, there is urgent need to develop novel epigenetic drugs with multidisciplinary efforts and extensive collaborations that may accelerate the pace of drug discovery process.

With advanced development of computational methodologies, computer-aided drug design (CADD) has emerged as a burgeoning research filed (Zheng et al., 2013). Currently, many pharmaceuticals companies and research institutions all over the world have established their own CADD departments and continued efforts have been made toward the development and optimization of drug design methodologies and software (Kim et al., 2017). In silico druggability assessment helps researchers to identify more chemical-tractable targets and prioritize screening endeavor (Trosset and Vodovar, 2013). Based on rapid advancement of crystallography and successful applications of homology modeling, structure-based virtual screen (SBVS) has proven a useful method to quickly identify bioactive hits in early-stage discovery activities (Lounnas et al., 2013). Ligand-based drug design (LBDD) strategies like three dimensional quantitative structure activity relationship (3D-QSAR), 2D similarity-based searching, scaffold hopping and pharmacophore studies are also efficient approaches for hit enrichment and activity prediction based on available information of known inhibitors (Meena et al., 2011; Andrew et al., 2016; Yadav et al., 2017b). Moreover, quantum mechanical calculation and molecule dynamic (MD) simulation provide the in-depth understanding in protein catalytic mechanism that is quite useful mechanism-based drug design (Scheraga et al., 2007). In silico pharmacokinetic properties assessment allows the prediction of absorption, distribution, metabolism, elimination, and toxicity (ADMET) of drug candidates that is an important cheminformatics tool in drug design (Gaur et al., 2015; Yadav et al., 2016). Collectively, combined with the gained availability of diverse compound databases, these cost effective structure-based or ligand-based strategies significantly increase the efficiency in drug discovery and provide new horizons and promising avenues to conquer life-threatening diseases (**Figure 1**, **Table 1**).

Although these leading computational strategies have been successfully applied in traditional drug discovery pipeline, there are relatively few reports focusing on its contribution in epigenetic landscape (Li et al., 2015). In this review, we mainly focus on recent progress on the applications of these strategies and highlight representative studies and major contributions of computational approaches in this field. Other successful drug discovery studies using wet lab approaches are beyond the field of this review and not covered here that may also be interesting aspects in epigenetic-related studies.

# WRITER

Epigenetic writers are the enzymes responsible for transferring methyl groups or acetyl groups to DNA, histone or other non-histone substrates from cofactors S-adenosyl-L-methionine (SAM) or acetyl coenzyme A (Ac-CoA). Based on their distinct functions, writers are usually divided into three categories, namely DNA methyltransferases (DNMTs), protein lysine/arginine methyltransferases (PKMTs/PRMTs) and histone acetyltransferases (HATs). These enzymes alter chromatin organization and contribute to downstream gene expression regulation through site-specific modification that are involved in the multiple function pathways (Gelato and Fischle, 2008). To elucidate their roles in physiological or pathological states, there has been increasing interest in the discovery of writer inhibitors through in silico approaches and many successful stories have been reported in the literature (**Figure 2**). In this section, we will present an overview of the current applications of computational methods used in hit identification targeting epigenetic writers.

# DNA Methyltransferases

DNA methyltransferases catalyze DNA methylation by depositing a methyl group on the 5-position of the cytosine (Robertson, 2001). In mammalian cells, there are five members identified so far: DNMT1, DNMT2, DNMT3A, DNMT3B, and DNMT3L. Among them, DNMT1 is characterized as the maintenance methyltransferase that shows preference for hemimethylated DNA substrates while DNMT3A and DNMT3B belong to de novo methyltransferases subfamily that function in complex form and catalyze the methylation of unmethylated DNA (Okano et al., 1999; Goll and Bestor, 2005). In DNMTs, the founding member DNMT1 is the best studied. DNMT1 introduces a new methyl group into newly synthesized DNA strand in the context of CpG dinucleotide that maintains methylation patterns of template strand during DNA replication (Bestor, 2000; Auclair and Weber, 2012). Aberrant promoter DNA hypermethylation leads to silencing of tumor suppressor genes, which has been frequently observed in various carcinomas (Feinberg et al., 2006; Zhang and Xu, 2017). Therefore, DNMTs have become one of the most promising targets for cancer therapy and many computational approaches have evolved to fuel the development of epi-probes and epi-drugs targeting DNMTs (Medina-Franco et al., 2015).

#### Homology Modeling-Driven Studies

Homology modeling is a quite effective strategy especially when interested protein crystal structures are not available and it functions as the most valuable research tool to fill the sequencestructure gap for structure-based drug design (Dwivedi et al., 2015). The accuracy of homology models mainly depends on the sequence identity or similarity between the template protein and the protein to be modeled (Chothia and Lesk, 1986). As commonly accepted, homology models based on more than 50% of sequence identity with proteins whose structures have been experimentally acquired are usually very accurate and can be used for drug discovery purposes (Hillisch et al., 2004).

Since there was no DNMT1 crystal structure ever released until 2008, drug development against this therapeutic target progressed slowly (Syeda et al., 2011). To circumvent this, Siedlecki et al. built a homology model of human DNMT1 catalytic domain based on available structural information of M.HhaI, M.HaeIII, and DNMT2 in the MODELLER module of INSIGHT2000 (Siedlecki et al., 2003). In a follow-up study, based on this established homology model, Siedlecki and co-workers performed docking-based virtual screening of a diversity set containing 1,990 compounds from the National Cancer Institute (NCI) that represented more than 140,000 compounds using DOCK version 5.1.0. The screen resulted in the discovery of RG108 (compound 1 in **Figure 2**) that came out on top in biochemical assays (Brueckner et al., 2005; Siedlecki et al., 2006).

Similarly, Kuck et al. carried out virtual screening (VS) of a larger data set including more than 65,000 lead-like compounds based on the aforementioned DNMT1 homology model. Topranked compounds were re-scored by GLIDE, GOLD, and AUTODOCK followed by experimental tests. Among them, NSC14778 (compound 2 in **Figure 2**) presented inhibitory activities against DNMT1 and DNMT3B with the IC<sup>50</sup> value of 92 and 17µM, respectively while nanaomycin A (compound 3 in **Figure 2**) selectively inhibited DNMT3B with the IC<sup>50</sup> value of 500 nM. To explain the selective inhibitory activities


TABLE 1 | The online public and commercial databases and compound collections used for virtual screening in epigenetics.

*Data accessed in December 28, 2017.*

of nanaomycin A, the authors established homology model of DNMT3B catalytic domain based on DNMT3A crystallographic structure, which provided structural basis for mechanism interpretation (Kuck et al., 2010a,b).

In order to further disclose the mechanism of action (MOA) of nanaomycin A, Caulfield and co-workers performed >100 ns molecular dynamic simulation using the CHARMM27 force field in NAMD version 2.62. The previously established DNMT3B homology model bound to nanaomycin A was used with either presence or absence of cofactor SAM in the simulation. The results suggested that nanaomycin A and SAM could bind to DNMT3B in a cooperative manner. Besides, nanaomycin A could form long-lasting interactions with key residues that involved in the methylation process which further validated the hypothesis supported by previous docking simulation (Caulfield and Medina-Franco, 2011).

#### High Throughput Virtual Screening

In 2014, through docking-based virtual screening based on the complex structure of mouse DNMT1 bound to S-adenosyl-Lhomocysteine (SAH) (PDB ID: 4DA4), Chen et al. reported a novel non-nucleoside DNMT1 inhibitor DC\_05 that showed significant selectivity toward other protein methyltransferases (Chen S. et al., 2014). Further medicinal chemistry optimization led to the discovery of more potent compound DC\_517 (compound 4 in **Figure 2**) with the IC<sup>50</sup> value of 1.7µM. The putative binding models were generated based on molecular docking studies, which gave detailed interpretation of the structure-activity relationship (SAR).

In 2017, in order to identify novel DNMT3A inhibitors, Shao et al. conducted a multi-step docking-based virtual screening in combination with pharmacophore mapping. Through initial screening and follow-up similarity-based analog searching, the authors discovered novel DNMT3A inhibitor compound 40\_3 (compound 5 in **Figure 2**) with the IC<sup>50</sup> value of 41µM, which may serve as the starting point to develop more potent DNMT3A inhibitors (Shao et al., 2017).

#### Quantum Mechanical Calculation

In 2012, based on the ab initio methods, Alcaro and coworkers developed the force field parameters implemented in the MacroModel package for the treatment of charge distribution and overall charge assignment of nucleic acids that undergo methylation. It gives essential insights related to the correct charge treatment and force field parameterization, which is an important issue in molecular modeling of epigenetic phenomena and shed light for the nucleic acids-related epigenetic functional study and in the development of DNA intercalating, subtypeselective DNMTs inhibitors (Alcaro et al., 2002).

#### Histone Methyltransferases

Histone methylation is one of the most important posttranslational modifications on histones and results in either activation or repression depending on specific sites. The methylation marks could recruit different methyl-binding proteins and mediate downstream signaling pathways, which could be basically regulated by dynamic interplay between histone methyltransferases (HMTs) and demethyltransferases (HDMTs) (Martin and Zhang, 2005). Histone methyltransferases can be mainly divided into two categories based on their relative substrates: protein lysine methyltransferases and protein arginine N-methyltransferases (Li et al., 2012). Among them, PKMTs consist of SET domain-containing PKMTs (SUV, SET1, SET2, EZ, and RIZ) and non-SET domain-containing PKMT (DOT1L)

(Kouzarides, 2002). As for PRMT family, it could be further classified into three subcategories: type I PRMTs responsible for arginine monomethylation and asymmetric dimethylation (PRMT1, 2, 4, 6, 8), type II PRMTs (PRMT5, 9) for arginine monomethylation and symmetrical dimethylation and type III PRMT (PRMT7) only with arginine monomethylation activity (Wolf, 2009). Emerging evidence demonstrated that deregulated alternations of methylation patterns were implicated in the pathogenesis of various cancers and other malignant diseases (Spannhoff et al., 2009; Jones et al., 2016). Consequently, continued efforts have been devoted to drug design for HMTs, which open up vast ranges of prospects for diseases treatment (**Table 2**).

#### Pharmacophore-Based Drug Discovery

With the increasing knowledge of known active molecules available in databases, pharmacophore modeling methods are receiving more attention in the era of rational drug design that could quickly extract the key steric and electronic features for ligand-receptor interactions (Guner et al., 2004; Yadav et al., 2010, 2012). In 2007, Spannhoff et al. presented first targetbased virtual screening with NCI diversity set to discover novel PRMT1 inhibitors. The GRID-based pharmacophore model, the methodology originally introduced by Ortuso et al. in 2006, was applied as post-docking filter to analyze all preliminary docking solutions (Ortuso et al., 2006). The study resulted in the identification of allantodapsone (compound 6



*<sup>a</sup>The IC<sup>50</sup> values smaller than 1* µ*M.*

*<sup>b</sup>The IC<sup>50</sup> values at the range of 1–10* µ*M.*

*<sup>c</sup>The IC<sup>50</sup> values at the range of 10–100* µ*M.*

*The stars here denote that the IC<sup>50</sup> value of the most potent compound against this target is in the corresponding range.*

in **Figure 2**) with the IC<sup>50</sup> value of 1.7µM (Spannhoff et al., 2007a).

In the follow-up study, Heinke et al. expanded their work to a larger compound collection, the ChemBridge database containing 328,000 molecules (Heinke et al., 2009). Based on previously reported binding modes of allantodapsone, the pharmacophore models were generated in LigandScout with one HBD, one hydrogen bond acceptor (HBA), two hydrophobic/aromatic features, one included volume and 23 excluded volumes leading to the identification of nine compounds with PRMT1 inhibitory activity below 35µM.

Ligand-based pharmacophore modeling is also a powerful tool in drug discovery campaigns. In 2012, Wang et al. constructed four rational pharmacophore models with HBA, HBD, and ring/aromatic (RA) as key chemical features based on 17 reported active molecules in Discovery Studio version 2.1. Then established models were used as the query to search theoreticalsoluble small molecule library. Through cluster analysis in combination with biological assays, A9 and A36 (compounds 7–8 in **Figure 2**) were identified as PRMT1 inhibitors with the IC<sup>50</sup> values of 41.7 and 12.0µM, respectively (Wang et al., 2012). Kinetic analysis demonstrated that A9 was a peptide-competitive PRMT1 inhibitor whereas A36 was the non-competitive PRMT1 inhibitor that could be used as the parent compounds for further chemistry optimization.

The pharmacophore modeling has also been widely applied in hit identification targeting other HMTs. In 2016, aiming to identify novel EZH2 inhibitors, Wu et al. conducted ligand-based pharmacophore modeling based on validated EZH2 inhibitors (Wu et al., 2016). The reliability of constructed models was evaluated by enrichment capacity analysis using molecules in test sets. Based on the established models, they identified novel EZH2 inhibitors DCE\_254 (compound 18 in **Figure 2**) with the IC<sup>50</sup> value of 11µM.

In 2015, through integrated structure-based pharmacophoremodeling and molecular docking, Meng et al. discovered a SET7 inhibitor, namely DC\_S100 with the IC<sup>50</sup> value of 30.0µM (Meng et al., 2015). Docking-based SAR analysis followed by structure optimization led to the identification of DC-S239 (compound 19 in **Figure 2**), with the IC<sup>50</sup> value of 4.6µM. In addition, DC-S239 could dose-dependently inhibit the proliferation of MCF7, HL60, and MV4-11 with the IC<sup>50</sup> values at micromolar range supporting its potential use in cellular context.

#### Molecular Dynamics Simulation

Molecular dynamics simulation is a useful theoretical technique to investigate the conformations and dynamic behaviors of biomolecules in long-time scale that provides atomic-level insight into the regulatory mechanism (Lindorff-Larsen et al., 2011; Okumura et al., 2018). To characterize the elusive roles of the N-terminal region and dimerization arms for PRMT1 activity, Zhou et al. performed MD simulations using GROMACS 4.3 package based on hPRMT1 homology model in monomer and dimer states (Zhou et al., 2015b). The simulations captured the dynamic correlations between the N-terminal region and dimerization arms. Moreover, the normalized covariance analysis and principal component analysis (PCA) were applied to analyze the energy landscape of different conformations at reduced dimensions. Through network topology analysis, a long-distance communication pathway was theoretically proposed which was further validated by biochemical mutational experiments. The simulations disclosed the underlying molecule mechanism of allosteric communication between the two regions and provided the rationale for mechanism-based PRMT subtype-selective inhibitors.

Molecule dynamic simulations could be not only wildly applied in protein dynamic regulation studies but also in MOA studies for small molecule inhibitors. In order to uncover the molecular basis of diamidine inhibitors for selective PRMT1 inhibition, Yan et al. conducted extensive MD simulations and molecular mechanics/Poisson–Boltzmann solvent-accessible surface area (MM/PBSA) calculation to analyze the interaction patterns in the binding cavity for the docking complex which provided the avenue to design more potent and specific inhibitors (Yan et al., 2014; Zhang et al., 2017a,b). A similar MD study was reported by Yang and co-workers to propose binding poses of identified PRMT1 inhibitors and circumvent the limitations introduced by inaccuracy of molecule docking methods (Yang et al., 2017).

#### High Throughput Virtual Screening

The big explosion of available structural information of HMTs has greatly facilitated the application of docking-based virtual screening (DBVS). In 2007, via virtual screening and 2D similarity-based analog searching, Spannhoff et al. identified RM65 (compound 9 in **Figure 2**) as the PRMT1 inhibitor with the IC<sup>50</sup> value of 55.4µM (Spannhoff et al., 2007b). In 2010, a similar study was reported by Feng and co-workers in structurebased virtual screening with 400,000 compounds. In this study, NS-1 (compound 10 in **Figure 2**) was identified as the PRMT1 inhibitor with an IC<sup>50</sup> value of 12.7µM which directly targeted the peptide substrates instead of enzymes (Feng et al., 2010). In 2014, Xie and co-workers used the combinatorial dockings methods including GLIDE and DOCK for in silico screen. The authors identified DCLX069 and DCLX078 (compounds 11– 12 in **Figure 2**) with the IC<sup>50</sup> values of 17.9 and 26.2µM, respectively in biochemical assays (Xie et al., 2014).

A number of attempts have also been made to identify other PRMTis besides PRMT1 inhibitors. In 2015, Alinari et al. used comparative modeling and structure-based virtual screening with ChemBridge CNS-Set library of 10,000 small molecule compounds leading to the identification of first-inclass PRMT5 inhibitor CMP5 (compound 13 in **Figure 2**). In cellular context, CMP5 could selectively inhibit the proliferation and transformation of EBV-driven B-lymphocyte (Alinari et al., 2015). In 2016, Ferreira et al. started their work based on the two basic amine tails that mimicked the side chain of substrate arginine and established PRMT-focused virtual library. Through initial biochemical screening and structure-based optimization, the authors identified compound 27 (compound 14 in **Figure 2**) as the selective CARM1 inhibitor with an IC<sup>50</sup> value of 0.05µM and ligand efficiency of 0.43 (Ferreira de Freitas et al., 2016). In 2017, Ji et al. carried out molecular docking studies with semi-flexible docking methods in GOLD and identified selective PRMT5 inhibitor P5i-6 (compound 15 in **Figure 2**) with an IC<sup>50</sup> of 0.57µM (Ji et al., 2017). Similarly, Ye et al. identified PRMT5 inhibitor named DC\_C01 (compound 16 in **Figure 2**) with the IC<sup>50</sup> value of 2.8µM via docking-based virtual screening and structure modification (Ye et al., 2017). Concurrent with the two studies described above, through hierarchical docking strategies and chemistry optimization, Chen et al. identified DCPR049\_12 (compound 17 in **Figure 2**) with promising inhibitory activity for type I PRMT with the IC<sup>50</sup> value at nanomolar range (Wang et al., 2017).

Besides PRMTs, docking-based virtual screening strategy was also applied in the epi-probe design for other HMTs. In 2015, in an attempt to search for SMYD3 inhibitors, Peserico et al. performed high-throughput virtual screening of the CoCoCo database containing nearly 260,000 molecules in GLIDE version 5.7 (Peserico et al., 2015). The study led to the identification of BCI-121 (compound 20 in **Figure 2**) as the best candidate for SMYD3 inhibition that could reduce global H3K4me2/3 and H4K5me levels in colorectal cancer. Similarly, Chen et al. identified the DOT1L inhibitor DC\_L115 (compound 21 in **Figure 2**) with an IC<sup>50</sup> value of 1.5µM via structure-based virtual screening of approximately 200,000 molecules in SPECS database (Chen S. et al., 2016).

Very recently, Wang and co-workers developed a targetspecific scoring function based on epsilon support vector regression (ε-SVR) named the SAM-score for SAM-dependent methyltransferases. Based on the built regression model, the authors identified compound 6 (compound 22 in **Figure 2**) as the DOT1L inhibitor with an IC<sup>50</sup> of 8.3µM (Wang et al., 2017). There are also some successful studies reported elsewhere for the discovery of other HMT inhibitors (compounds 23–26 in **Figure 2**) for G9a and SETD8 based on in silico approaches (Chen W. L. et al., 2016; Kondengaden et al., 2016; Milite et al., 2016).

#### Histone Acetyltransferases

Histone acetyltransferases (HATs) transfer acetyl groups onto N-terminal tails of core histone and consequently give rise to DNA relaxation, which is closely related to gene activation (Brown et al., 2000). HATs can be divided into four categories on the basis of their sequence similarities, including the GNAT family (GCN5 and PCAF), the MYST family (MOZ/MORF, YBF2/SAS3, SAS2, and TIP60), p300/CBP and RTT109 (Dancy and Cole, 2015). Recently, emerging evidence implicated that deregulation of HATs was closely correlated with tumorigenesis, neurological disorders and inflammatory diseases (Yang, 2004; Rajendrasozhan et al., 2009; Sheikh, 2014). Several HAT inhibitors have been reported, such as bi-substrate inhibitors, natural products, and small molecules. However, there is still a large gap between activities in vitro and their potential applications as therapeutic agents in vivo due to the lack of potency and selectivity for the current inhibitors which is a long-standing challenge in the field.

In 2010, Bowers et al. conducted structure-based, in silico screening approach with a screening set of ca. 500,000 commercially available compounds to identify the p300 inhibitor (Bowers et al., 2010). The compounds were scored and ranked based on ICM (Internal Coordinate Mechanics) score in the ICM-VLS software version 3.5. Then top 194 compounds were cherry-picked by visual inspection and purchased from ChemBridge for biochemical analysis. Among them, C646 (compound 27 in **Figure 2**) was identified with Ki value of 400 nM. Further in vitro assay demonstrated that C646 was cofactor-competitive and selective p300 inhibitor. The detailed interaction patterns were confirmed by site-directed mutagenesis in accordance with the predicted computational model.

Very recently, Lasko and co-authors performed similar docking-based in silico screening with nearly 800,000 compounds and 1,300 available compounds were test in radioactive p300 acetylation assays (Lasko et al., 2017). Among them, hydantoin and a conjugated thiazolidinedione were identified with the IC<sup>50</sup> values of 5.1 and 11.5µM, respectively. More efforts were devoted to the optimization on hydantoin scaffold yielding A-485 (compound 28 in **Figure 2**) with an IC<sup>50</sup> value of 60 nM. A-485 was a first-in-class highly potent, selective p300/CBP catalytic inhibitor and displayed significant selectivity against other HATs members. Besides, it inhibited proliferation across a broad range of cancer cell lines with specificity for hematological and prostate cell lineages and retarded tumor growth in xenograft models, which underscored the therapeutic potential targeting p300/CBP. Another small molecule discovered by virtual screening from ChEMBL bioassay database was C14 (compound 29 in **Figure 2**) with an IC<sup>50</sup> value of 225 nM on PfGCN5 in parasite growth assay (Kumar et al., 2017). C14 displayed promising antimalarial activity and showed no effect on mammalian fibroblast cells supporting its safe use for further applications.

#### ERASER

Erasers are key modifying enzymes in charge of the removal of epigenetic marks that participate in dynamic regulation on gene expression patterns (Mosammaparast and Shi, 2010). Based on their different substrates and their relative functions, erasers could be divided into different families such as histone deacetylases (HDACs), RNA demethyltransferases, histone demethylases (HDMs), histone deubiquitinases, and so on (Arrowsmith et al., 2012). Among them, HDACs are the most studied targets for pharmacological interventions. So far, five epi-drugs targeting HDACs have been approved for clinical use and other HDAC inhibitors like entinostat and CUDC-907 have entered into clinical trials for advanced cancer treatment (Falkenberg and Johnstone, 2014; Li and Seto, 2016). In the following section, we will focus on representative computational work in drug discovery and related mechanism studies that expert in this field (**Figure 3**).

#### Histone Deacetylases

In mammalian cells, HDACs consist of 18 isoforms and are broadly classified into four categories based on their distinct structural features and subcellular localization: Class I (HDAC1, 2, 3, and 8), Class II (Class IIa HDAC4, 5, 7, 9, and Class IIb HDAC6, 10), Class III (NAD-dependent Sirtuins; SIRT1- 7) and Class IV (HDAC11) (Gregoretti et al., 2004; Li and Seto, 2016). HDACs catalyze the deacetylation of histone as well as non-histone substrates and are implicated in fundamental physiological processes including gene transcription, cell cycle regulation, DNA damage response, and metabolism homeostasis (Bode and Dong, 2004; Minucci and Pelicci, 2006). There is a growing body of evidence that deregulation of HDACs activity is strongly correlated with the pathogenesis of several diseases including hematological malignancies and solid tumors that implicates the significance of target intervention (Minucci et al., 2001; Zhu et al., 2004; Buurman et al., 2012). Significant progress has been made in the development of HDAC inhibitors (HDACis) over the recent decades based on in silico approaches (Yanuar et al., 2016). The following chapters will focus on some representative studies using computational methods in this field, some of which are described below.

#### Quantum Mechanical Calculation

Hydroxamic acid moiety presented in most common HDACis is usually recognized as problematic fragment with poor pharmacokinetic profile. To rationally design non-hydroxamic acid HDACis with more favorable physico-chemical properties, Wang et al. performed density functional theory (DFT) calculations to investigate binding modes and related binding free energy of potential zinc binding groups (ZBGs) (Wang et al., 2007). In model active site, only the side chains of zinccoordinated residues were kept for calculation including two formats and one imidazole that represented as the functional groups of zinc-coordinated histidine and two aspartic acid residues. The calculation results proposed alternatives with novel structural features that favored zinc binding including 3-hydroxy pyrones or β-amino ketones, which may be further utilized for medicinal chemistry optimization on current HDACis.

Apart from the applications in novel hit discovery, quantum mechanical calculation could also enable precise and solid interpretation into mechanism studies that facilitates the drug design of novel and specific HDACis. Finin et al. proposed that H143-D183 catalytic dyad was indispensable for HDAC8 enzymatic activity by abstracting proton from the bridged water molecule while Zhang et al. underscored the role of H142-D176 dyad in proton-shuttle process (Finnin et al., 1999; Wu et al., 2010). In addition, the controversial function of potassium ion near the active pocket present in HDAC crystal structures is also under debate (Gantt et al., 2010; Werbeck et al., 2014). Based on QM/MM simulations including the complete catalytic residues in the quantum region, Chen et al. explained disagreement for those observations and uncovered the unique catalytic mechanism of HDAC8. The results disclosed the inhibitory role of the potassium ion at the active site and uncovered the significance of the pK<sup>a</sup> values of zinc-coordinated moiety in HDACis that would be of great value in developing potent and subtype-selective mechanism-based HDACis (Chen K. et al., 2014).

#### Quantitative Structure-Activity Relationship Analysis

QSAR analysis is a well-established ligand-based computational methodology to describe the quantitative relationship between compound biological activity and its physicochemical properties or structural features, which is the milestone progress in the era of rational drug design (Gupta, 2007; Yadav et al., 2013, 2017a). Since therapeutic value of HDACis has been addressed over recent years and many potent HDACis have been identified so far, comprehensive QSAR studies were conducted using different kinds of data sets to facilitate drug design and discovery against this drug-actionable target. In 2004, Wang et al. developed QSAR models based on hydroxamic acid-based HDACis and found statistically significant relationship between charge distribution, hydrophobicity, geometrical shape of compounds and its relative anti-proliferative activities for PC-3 cell lines (Wang et al., 2004). Since then, the number of QSAR modeling studies increased at a dramatic rate (Xie et al., 2004; Guo et al., 2005; Juvale et al., 2006; Chen et al., 2008; Kozikowski et al., 2008; Ragno et al., 2008).

The first QSAR studies used for virtual screening was reported by Tang and co-workers (Tang et al., 2009). Based on validated QSAR models, the authors screened the in-house library with ca. 9.5 million compounds and identified four novel scaffolds that favored HDACs inhibition (compounds 30–33 in **Figure 3**). In 2012, Xiang and colleagues developed pharmacophore and 3D-QSAR models on a series of (benz)imidazole inhibitors (Xiang et al., 2012). The results led to the discovery of 27 inhibitors with putative HDAC2 inhibitory activity. Later on, several groups employed QSAR modeling workflow for HDACis activity and selectivity prediction (Silvestri et al., 2012; Zhao et al., 2013b). In 2014, Kandakatla et al. conducted ligand based 3D-QSAR pharmacophore modeling and identified eight hit compounds from Maybridge and NCI databases as potential HDAC2 inhibitors (Kandakatla and Ramakrishnan, 2014). In

the same year, based on 79 previously published substrate-based SIRT1 inhibitors, Kokkonen and coworkers performed CoMFA studies that was successfully applied in the bioactivity prediction of 13 newly synthesized compounds (Kokkonen et al., 2014). Similarly, Cao et al. developed QSAR models using support vector classification and regression with scrupulous examination based on published HDAC8 inhibitors that was applied in nextround drug screening (Cao et al., 2016).

#### High Throughput Virtual Screening

In 2007, Price and colleagues initiated virtual screening with HDAC-focused library containing 644 hydroxamic acids (Price et al., 2007). The study resulted in the identification of ADS100380 (compound 34 in **Figure 3**) with an IC<sup>50</sup> value of 0.75µM followed by iterative optimization. Similarly, another successful application of structure-based virtual screen for the discovery of HDACis was carried out by Park et al. based on HDAC1 homology model (Park et al., 2010). The newly identified inhibitors (compounds 35–36 in **Figure 3**) presented novel chemotypes that had not yet been reported before with IC<sup>50</sup> values at micromolar range. In 2016, Yoo et al. rationally designed selective HDAC6 inhibitors with the IC<sup>50</sup> value of 0.199µM (compound 37 in **Figure 3**) inspired by preliminary virtual screening efforts with LeadQuest chemical database containing 80,600 entries (Yoo et al., 2016). Very recently, Hu designed a versatile VS pipeline with better screening power for the rapid discovery of selective HDAC3 inhibitors (Hu et al., 2017). Many efforts have also been devoted to the discovery of Sirtuins inhibitors, the NAD-dependent class III histone deacetylases that was reported elsewhere (Salo et al., 2013; Kokkonen et al., 2015; Padmanabhan et al., 2016).

As commonly accepted, each computational approach may not perform optimally when applied alone due to complexity of epigenetic network and this highlighted the importance of various combined in silico approaches in epigenetic drug discovery. Hou et al. developed ZBG-based pharmacophore model with enhanced sensitivity for virtual screening leading to the identification of selective HDAC8 inhibitor H8-A5 (compound 38 in **Figure 3**) with the IC<sup>50</sup> value of 1.8µM. Then molecular docking followed by 50 ns MD simulation was performed to give detailed insight of the MOA of identified hits (Hou et al., 2015). In 2017, Ganai and co-workers employed top-down combinatorial strategy of molecule docking and molecular mechanics generalized born surface area (MM-GBSA), MD simulation and trajectory clustering, energeticallyoptimized pharmacophore. The authors identified distinct hot spots in highly homologous HDAC1 and HDAC2 that shed light on the development of specific HDAC2 inhibitors against neurological diseases (Ganai et al., 2017). Hsu and co-workers employed VS approach against classified NCI database leading to the identification of class IIa-selective HDACis (compounds 39–41 in **Figure 3**). Homology modeling was performed to generate HDAC5 and HDAC9 3D structures that provide atomicresolution insight into the selectivity of these inhibitors (Hsu et al., 2017).

### RNA Demethyltransferases

RNA methylation is one of most important chemical marks in epigenetic landscape among which N 6 -methyladenosine (m6A) is the most abundant and conserved modification in eukaryotes (Desrosiers et al., 1974). Reversible N 6 -methyladenosine could be dynamically regulated by related writers and erasers involved in gene expression, RNA splicing, transport, and stability (Fu et al., 2014). Fat mass and obesity-associated (FTO) enzyme is one of the RNA demethylases and depends on Fe (II) and α-KG cofactors for its oxidative demethylation activity (Jia et al., 2011). Genetic variations of FTO are functionally associated with human obesity and metabolic disorders (Frayling et al., 2007). Recent studies demonstrate that FTO is highly expressed in MLL-rearranged AML and plays pivotal role in leukemogenesis (Li et al., 2017). Collectively, these studies hold promise for drug design and development targeting FTO for therapeutic translation.

In order to gain detailed insight into molecular mechanism for its catalytic specificity, the complex crystal structure of FTO and 3-meT substrate was resolved, which laid foundations for structure-based drug design (Yadav et al., 2010). Chen et al. employed virtual screening strategy in an effort to identify inhibitors targeting FTO active site. After initial screening against the drug-like SPECS database in Dock version 4.0, the primary results were evaluated in Sybyl and revisited by AutoDock version 4.0. Then top 300 compounds were selected for cluster analysis to ensure scaffold diversity. Finally, 114 compounds were picked out for biochemical validation leading to the identification of natural product rhein (compound 50 in **Figure 3**) as the competitive FTO inhibitor. Further decomposed binding energy prediction highlighted the electrostatic interactions between R316 and rhein, which was validated by follow-up biophysical studies (Chen et al., 2012; Aik et al., 2013). Later on, more efforts have been devoted to the drug design and discovery of selective FTO inhibitors (Huang et al., 2015; Toh et al., 2015). These identified structurally different inhibitor collections may serve as the parent templates applied in ligand-based drug design approaches. The small molecule sets could be used to establish focused and biased libraries that may be useful for rational drug design against other RNA demethylases.

# Histone Demethyltransferases

Histone demethylation remained ambiguous until the hallmark discovery of first lysine specific demethylase LSD1 in 2004 (Shi et al., 2004). These demethyltransferases catalyze lysine/arginine demethylation and function as transcription corepressor that is tightly associated with dynamic regulation of methylation patterns shaping the epigenome (Dimitrova et al., 2015). Since then, more histone demethylases have been identified and their biological relevance has been disclosed (Kooistra and Helin, 2012). Currently, histone demethylases could be mainly categorized into two subfamilies based on homology and substrate specificity: LSD demethylases (LSD1-2) and Jumonji C (JmjC) domain-containing demethylases (JHDMs) (Markolovic et al., 2016). Dysfunction of histone demethylases has been observed in malignant diseases especially cancers such as colorectal cancer, bladder cancer and lung cancer (Hayami et al., 2011; Højfeldt et al., 2013). Harris et al. delineated the potential oncogenic role of LSD1 (KDM1A) in leukemia using the mouse model of MLL-AF9 leukemia (Harris et al., 2012). In another study, the authors showed that KDM2B was highly expressed in leukemia samples and played central role in the etiology and progression of acute myeloid leukemia (He et al., 2011). Thus, histone demethylases were considered as putative epi-targets for discovering anticancer agents. In the following section, we will discuss the successful applications of computational approaches in the field.

#### High Throughput Virtual Screening

In order to pursue novel LSD1 inhibitors, Hazeldine et al. undertook the virtual screen strategy against Maybridge compound library. Sitemap was employed to assess the druggability of potential active chamber. Through high throughput virtual screen in GLIDE, the authors identified a total of 10 hits with GlideScore lower than −7.5 kcal/mol. The most effective compound (compound 42 in **Figure 3**) featuring amidoximes moiety displayed moderate in vitro activity with the IC<sup>50</sup> value of 16.8µM (Hazeldine et al., 2012). Later on, Sorna and co-workers reported structure-based docking studies with the ligand library containing 13 million compounds. High Throughput Virtual Screen (HTVS) protocol integrated in Schrödinger suite was applied and the database was subsequently refined by rule of five filters to weed out nonbinders and compounds with undesirable physicochemical parameters. Top 15% compounds were selected and re-ranked by combinatorial scoring with GLIDE, ICM, and GOLD to discard false positives. Based on chemical diversity analysis and visual inspection of initial docking results, 121 compounds were selected for biochemical validation and further medicinal chemistry optimization led to the identification of novel LSD1 inhibitor 12 (compound 43 in **Figure 3**) with the IC<sup>50</sup> value of 0.013µM (Sorna et al., 2013). Continued efforts have been made toward the discovery of potent, selective epi-probes against LSD1 and other histone demethylases (compounds 44–45,47 in **Figure 3**) based on computational approaches (Schmitt et al., 2013; Kutz

et al., 2014; Roatsch et al., 2016). Chu et al. utilized GEMDOCK to screen the NCI database (∼236,962 compounds) in silico and identified a selective KDM4A/KDM4B inhibitor (compound 48 in **Figure 3**) with the IC<sup>50</sup> value at micromolar level (Chu et al., 2014). In 2016, Korczynska et al. performed molecular docking screens using ZINC fragment library (∼600,000 commercially available fragments) in DOCK version 3.6 leading to the identification of 5-aminosalicylates as the KDM4C inhibitor with good ligand efficiency. Further docking analysis and fragment linking optimization yielded more potent inhibitor with K<sup>i</sup> value of 43 nM (compound 49 in **Figure 3**) against KDM4C that highlighted the viable applications in fragment-based drug discovery (FBDD) (Korczynska et al., 2016).

#### 3D-QSAR Pharmacophore Modeling

In 2015, Zhou et al. presented pharmacophore-based ligand mapping strategy against LSD1 using refined SPECS database (∼171,143 small molecules) in Discovery Studio version 2.5. 3D conformations of 37 compounds with known activities (22 compounds for training set and 15 compounds for test set) were generated and used to generate pharmacophore in HypoGen module. The reliability of the pharmacophore model was verified by Fischer randomization test and decoy set prediction. Through combinatorial pharmacophore mapping and optimized docking in database screening, the authors identified XZ-09 (compound 46 in **Figure 3**) as a selective LSD1 inhibitor with the IC<sup>50</sup> value of 2.4µM that may serve as a lead compound for further optimization (Zhou et al., 2015a).

# READER

The posttranslational modifications on histone tails with different modification states are recognized by specific epigenetic readers, which recruit effector modules to stimulate different functions. Until now there are several well-characterized epigenetic readers including acetyl-lysine readers, methyllysine readers, methyl-arginine readers, and phospho-serine readers. Among them, lysine acetylation and methylation related readers were studied extensively as drug targets in epidrug design and discovery. The acetyl-lysine readers consist of bromodomains and the tandem PHD domains (Lange et al., 2008; Filippakopoulos et al., 2012). And the readers associated with lysine methylation include PHD zinc finger domains, WD40, Tudor, double/tandem Tudor, MBT, Ankyrin Repeats, zf-CW, PWWP, and chromodomains (Kim et al., 2006; Collins et al., 2008; Musselman and Kutateladze, 2009; He et al., 2010; Rona et al., 2016; Schapira et al., 2017). Emerging evidence demonstrated the dysfunction of epigenetic readers is implicated in various diseases such as cancer, intellectual disability, aging, autoimmune disease, inflammation and acquired immune deficiency syndrome (Baker et al., 2008; Greer and Shi, 2012; Jung et al., 2015). So far, several successful compounds selectively targeting epigenetic reader domains have been reported and some of them enter into clinical studies (Greschik et al., 2017). Herein, we focus on the computeraided drug discovery in epigenetic readers and review the successful examples to illuminate the advantages and potential applications of computational drug design and discovery in this field (**Figure 4**).

# Druggability Prediction

Based on the complex crystal structure information of epigenetic readers with their relative substrates or small molecule inhibitors, the druggability of these targets could be easily predicted by computational methods. Many pragmatic programs have been developed and applied to explore potential drug-actionable pocket and assess the druggability of these binding sites (Halgren, 2009; Fauman et al., 2011). In 2011, Santiago et al. conducted the systematic druggability prediction for methyl-lysine binding proteins (Santiago et al., 2011). Based on the terms like steric volume, enclosure and hydrophobicity of the pocket, the Dscores of potential pockets were calculated using SiteMap. The results revealed that the druggability of different of methyl-lysine readers was highly variable dependent on backbone motion and intramolecular interactions, among which chromodomains, WDR domains and PWWP domains were more targetable than others like Tudor and PHD domains for small molecule inhibitors.

In 2012, to explore the druggability for bromodomains, the acetyl-lysine binders, Vidler et al. retrieved the available crystal structures of 33 human bromodomains from the Protein Data Bank (PDB) and evaluated druggability in SiteMap (Vidler et al., 2012). Among them, bromodomain, and extra-terminal (BET) family was predicted as the highly druggable target, which was already proved by small molecule inhibitors studies, but it could not represent the whole bromodomain families. The authors classified 49 bromodomains into eight categories based on common binding site features and found that only one of them showed the comparable druggability with the BET family including CECR2, FALZ (A/B), GCN5L2, PCAF, TAF1 (A/B)(2), and TAF1L(2). Other groups were predicted with low scores suggesting to be challenging for epi-drug discovery. Collectively, these work uncovered novel druggable readers that were less explored before, which provided new opportunities for drug discovery.

# Combinatorial in Silico Virtual Screen Approaches

With the rapid development of BET inhibitors, more complex crystal structures were obtained, which made structure-based virtual screen and chemical modifications more easily. Based on the well-known critical interactions between BET family and related inhibitors, many computational studies were performed to develop novel chemotypes for BET family.

In 2013, a high throughput virtual screening was performed with more than 7 million small molecules from the Dictionary of Natural Products, the ChEMBL database, and the ZINC database by Lucas and colleagues in order to discover novel inhibitors of BRD4(1) (Lucas et al., 2013). Based on standard precision and extra precision algorithm for molecular docking in GLIDE version 5.6, top-ranked 500 hits were clustered into 33 diverse categories. According to the prediction of several properties including physicochemical, pharmacokinetic, toxicological and binding promiscuity using various computational approaches,

22 candidate compounds were selected for further experimental validation. Finally, 7 compounds comprising 6 different novel scaffolds (compounds 51–56 in **Figure 4**) were identified with significant binding affinity. The subsequent resolved complex structures of BRD4(1) with XD14, XD1, and XD25 revealed the accurate binding modes consistent with the docking simulation.

In 2015, Allen et al. developed in silico screening approaches against kinases and bromodomains, which integrated machine learning and structure-based drug design strategies. At last several BRD4 inhibitors (compounds 57–58 in **Figure 4**) and one dual EGFR-BRD4 inhibitor (compound 59 in **Figure 4**) were identified (Allen et al., 2015). Similarly, Xue and coauthors performed another structure-based virtual screening against BET bromodomains (Xue et al., 2016). Approximately 10,000 compounds were firstly screened against BRD4(1) in GLIDE version 6.1. Through binding free energy assessment and cluster analysis, 15 representative compounds were chosen for biological evaluation. The results showed two compounds with benzo[cd]indol-2(1H)-one scaffold were identified as novel inhibitors targeting the BRD4(1). Before the optimization of this scaffold, binding modes of these two compounds were predicted by molecular docking in order to characterize the critical interactions. A 20 ns MD simulation was subsequently performed, which indicated the conformations were stable and reasonable for hit optimization. Further SAR analysis and resolved complex crystal structures provided guidance for hit optimization leading to the discovery of compound 85 (compound 60 in **Figure 4**) with high-potency biological activity.

Concomitantly, Tripathi et al. carried out a virtual screening against BRD2(2) using 1,700 compounds in NCI Diversity Set III library (Tripathi et al., 2016). The candidates were selected according to the free energy values, critical binding conformations, and ligand efficiency. Among them, crystal structure of compound NSC127133 (compound 61 in **Figure 4**) in complex with BRD2(2) was resolved, which displayed distinct structural features. In 2017, Ayoub et al. performed high throughput virtual screen with 6,000,000 compounds in ZINC database using the crystal structure of BRDT(1) (Ayoub et al., 2017). A dihydropyridopyrimidine scaffold (compound 62 in **Figure 4**) was identified with highly selectivity for BET family and submicromolar affinity for BRD4(1) and BRDT(1), which could be easily synthesized in one step.

With many new scaffolds uncovered from high throughput virtual screening, Raj et al. made an attempt to screen with flavonoids and derivatives instead of a common library with large collections of compounds (Raj et al., 2017). The followed ADMET properties analysis demonstrated the good druglikeness properties of the identified compounds (compounds 63– 66 in **Figure 4**) suggesting potential applications in the therapies for BET-related diseases. In another study, Deepak et al. designed three benzotriazepipne analogs using in silico tools with the aim to improve the selectivity between BET family members (Deepak et al., 2017). Combined with ensemble docking, MD simulation and binding energy calculation, compound Bzt-W49 (compound 67 in **Figure 4**) was synthesized and showed about 10-folds selectivity toward BRD4 compared to BRD2.

Besides the virtual screening efforts against BET family, drug discovery toward other readers has also progressed a lot in recent years. In 2016, a structure-based pharmacophore modeling combined with molecular docking were carried out to identify small molecule inhibitors of methyllysine reader protein Spindlin1 (Robaa et al., 2016). Several hits (compounds 68– 70 in **Figure 4**) were subject to 2D-chemical similarity search and medicinal optimizations which improved the potency over 10-folds.

In addition to the in silico structure-based virtual screening against commercial libraries directly, the ligand-based computational methods would also help to improve the efficiency of virtual screening. In 2013, Vidler et al. carried out substructure searches for advanced enrichment of chemotypes in two branches (Vidler et al., 2013). For one thing, substructures that mimicked the acetyl-lysine moiety were searched in database. For another, similarity searching was performed to identify distinct chemotypes from known inhibitors using pharmacophore models, shape-based 2D fingerprint searches. The extensive set of substructures obtained was submitted to molecular docking in eMolecules database and manual selection for further experimental validation. Finally six novel hits (compounds 71–76 in **Figure 4**) including four unprecedented acetyl-lysine mimetics were identified. Structure-guided chemical modifications were performed based on complex crystal structures to improve the potency. In 2016, Hugle et al. screened PurchasableBoX library to select analog of previously identified bromodomain inhibitor XD14 (compound 52 in **Figure 4**) (Hügle et al., 2016). Several candidates were used to explore the SAR of XD14 and additional structural features of BRD4 through DFT calculation, atom-based QSAR and ligand-based pharmacophore, which offered the guidance for the development of novel BRD4(1) inhibitors.

#### Fragment-Based Drug Discovery

Fragment-based drug discovery has been widely practiced in drug discovery and some FBDD-derived drugs have entered into the clinical study (Erlanson et al., 2016). Many CADD integrated tools have been designed for scaffold replacement and fragment growing such as Molecular Operating Environment (MOE) developed by Chemical Computing Group, which could accelerate the pace of FBDD-guided drug discovery. In 2012, Chung et al. firstly built a fragment library that contained substructures with acetyl-lysine mimetic functional groups to identify novel BET inhibitors (Chung et al., 2012). The library was filtered to eliminate unsuitable substructures based on "rule of three" and predicted pK<sup>a</sup> values. The remaining fragments were clustered and then representative members were selected in each cluster according to docking results. Coupled with follow-up experiments, Chung and colleagues identified several compounds (compounds 77–81 in **Figure 4**) with two novel fragment scaffolds, which significantly extended the chemotypes of current inhibitors.

In 2013, Zhao et al. built a fragment library to discover novel BRD4 inhibitors (Zhao et al., 2013a). The fragment compounds in ZINC database were filtered by particular rules including molecular weight ≤ 250 Da, rotatable bonds ≤ 5, log P ≤ 3.5, and 1 ≤ smallest set of small ring ≤ 4. According to the Tanimoto similarity calculated in Pipeline Pilot, 487 representative fragments were purchased to build the fragment library. Through molecular docking with established in-house library and crystallization experiments, 9 fragments were identified in the binding pocket of BRD4(1) in the solved crystal structures and four of them (compounds 82–85 in **Figure 4**) were presented in **Figure 4**. Further pharmacokinetic study showed the great potential for further drug development. In 2017, Ali et al. performed docking-based virtual screening with fragmentlike database containing nearly 800,000 compounds from ZINC database in an effort to pursue BRD4 inhibitors (Ali et al., 2017). Finally, the authors unveiled the discovery of a novel scaffold (compound 86 in **Figure 4**) contained [1,2,4]triazolo[4,3 α]quinoxaline as BET inhibitors. Several rounds of chemical modification led to the synthesis of analogwith high potency and improved pharmacokinetic properties.

### Target-Specific Scoring Function

Considering the better druggability for BET family, many efforts were devoted to the discovery of novel BET inhibitors. Lu et al. Virtual Drug Design in Epigenetics

However, the performance of either virtual screening or high throughput screening varies and shows high rate of false positives, which restricts the applications in this field. In order to improve enrichment factor in screening, a BRD4-specific score named BRD4LGR was developed through machine-learningassisted approach by Xing et al. (Xing et al., 2017). Firstround virtual screening was performed in GLIDE version 5.6 and 453 compounds were selected for in vitro evaluation resulting in a high false positive rate of 95%. Based on the first-round screening results and other reported studies, structure and activity data of 814 compounds was collected to construct specific scoring function. The authors identified critical molecular interaction features from reported complex structures and established logistic regression model to correlate the interaction features to potencies. Compared with GLIDE and PMF, BRD4LGR discriminated BRD4 inhibitors and noninhibitors more effectively with high specificity and sensitivity. A second-round virtual screening using BRD4LGR identified 15 new active compounds with a lower FP rate at 85%. Beyond this, BRD4LGR was capable of interpreting key structure-activity relationships of BRD4 inhibitors, which would be quite valuable for chemistry optimization.

In a follow-up study, Jiang et al. employed virtual screening strategy with an in-house compound library containing 887 FDA-approved drugs using BRD4LGR scoring model (Jiang et al., 2017). The docking-based virtual screening coupled with similarity-based analog searching led to the discovery of nitroxoline (compound 87 in **Figure 4**) as a potent and novel BET inhibitor that was previously used to treat urinary tract infections. The successful application of BRD4LGR suggested potential use of nitroxoline in the treatment of BET familyrelated diseases.

#### Quantum Mechanical Calculations

Quantum mechanical calculations are commonly used to understand the nonbonding interactions, such as cation-π and hydrogen bond interactions. In order to explain the different affinity of 1,5-naphthyridine derivatives, Mirguet et al. carried out in vacuo QM calculations to calculate the bound conformations of several derivatives in their complex with BRD2 (Mirguet et al., 2014). The results showed that the differences in internal geometric energy might account for differences in relative bioactivity.

Besides, quantum mechanical calculations could be applied in combination with other computational studies in epi-probes discovery. In 2014, Rooney et al. identified two CREBBP bromodomain inhibitors with weak activity by in silico screen and biochemical assays (Rooney et al., 2014). Further structurebased chemical modifications led to the compound (R)-1 (compound 88 in **Figure 4**) with the IC<sup>50</sup> value of 758 nM. The complex structure of (R)-1 and CREBBP bromodomain revealed an induced-fit pocket that didn't exist in apo-form. (R)-1 formed a cation-π interaction with R1173 to maintain the stability of the conformation. In an effort to rationalize the importance of the cation-π interaction, the authors undertook MD simulation in which the cation-π interaction was observed for 40% of the trajectory time. Then the strength of cation-π interaction was estimated by DFT calculations with the strength value of 3.2– 4.7 kcal mol−<sup>1</sup> in accordance with the experimentally measured average strengths involving lysine or arginine. Meanwhile, DFT calculations were also applied to confirm the significance of internal hydrogen bound in ligand conformation which were also applicable in other studies.

# PROTEIN-PROTEIN INTERACTION

Epigenetic enzymes from the same protein subfamily often share similar catalytic core pockets and cofactors within family members, thus making it quite difficult to discover and design a selective inhibitor. A growing body of evidence suggests that a variety of protein–protein interactions (PPIs) are indispensable for integrity and oncogenic function of epigenetic enzymes. Therefore, these PPIs appear to be alternative drug targets to modulate chromatin state in epigenetic drug discovery. Due to the unique structural features of PPIs, which have large and flat contact surface and the lack of well-defined pockets, it remains challenging to explore small molecule inhibitors targeting epigenetic interactome (Wells and McClendon, 2007). However, with high-resolution protein complex structures resolved, advanced computational tools developed and renewed understanding of PPIs mechanisms, great progress has been made in the development of small molecule inhibitors (Scott et al., 2016). Here, we focus on the application of CADD methods, including structure-based virtual screening, scaffold hopping, structure-based pharmacophore modeling, and ligandbased pharmacophore profiling in the discovery and design of small molecule inhibitors targeting important epigenetic PPIs including EZH2-EED, WDR5-MLL1, and Menin–MLL1 (**Figure 5**).

# EZH2-EED

Polycomb repressive complex 2 (PRC2) specifically trimethylates lysine 27 at histone H3, which is one of the cardinal marks for transcriptional repression (Simon and Kingston, 2009). Enhancer of zeste homolog 2 (EZH2) is the catalytic subunit of PRC2, which requires two additional subunits embryonic ectoderm development (EED) and suppressor of zeste 12 (SUZ12) for full functional activity (Czermin, 2002; Cao and Zhang, 2004). Aberrant PRC2 activity has been reported in the initiation and progression of wide range of cancers (Chang and Hung, 2012). Thus drug design and discovery targeting the PRC2 complex formation represents the unique strategy in chemical intervention.

Drug repositioning is an increasingly attractive strategy widely applied in biopharmaceutical companies to identify alterative therapeutic indications from approved drugs (Ashburn and Thor, 2004). In 2014, in order to pursue EZH2-EED inhibitors, Kong et al. utilized structure-based virtual screening approach to enrich the hits from in-house compound library containing ca. 1,000 existing drugs (Kong et al., 2014). The standard precision and extra precision mode in GLIDE version 5.5 were subsequently employed to perform docking-based virtual screening leading

to the identification of astemizole (compound 89 in **Figure 5**), a FDA-approved antihistamine drug as moderate EZH2-EED inhibitor with K<sup>i</sup> value of 23.0µM. Further biophysical assays and cellular studies demonstrated the competitive MOA of astemizole and its inhibition for intracellular PRC2 activity.

#### WDR5-MLL1

Mixed lineage leukemia 1 (MLL1) is the histone methyltransferase responsible for the H3K4 methylation. MLL1 interacts with many chaperons including WD repeatcontaining protein 5 (WDR5), a common unit that is essential for the integrity of the catalytic core complex (Dou et al., 2006). Therapeutically targeting WDR5-MLL1 interaction by peptidomimetic inhibitors has been demonstrated as a promising strategy for MLL fusion-mediated acute leukemogenesis (Karatas et al., 2013).

In 2016, Getlik and co-workers designed focused library in silico guided by crystal structure information and initial SAR exploration on previously identified benzamides scaffold (Getlik et al., 2016). An exhaustive virtual enumeration was performed in Pipeline Pilot to search all accessible building blocks containing benzamides moieties. The set of compounds with poor physicochemical properties were removed by OICR HTS filters. About 1,200 acyl halides and 9,000 acids/esters were enumerated and used for further mediumthroughput virtual screening. Subsequently, molecular docking was performed in GLIDE with one H-bond constraint to the side chain of S91 in WDR5. Through overall consideration of the docking score, binding pose, structural complexity and synthetic difficulty, 50 representative compounds were selected by visual inspection and prioritized as candidates for synthesis and verification. Finally, 4-(trifluoromethyl)pyridin-2(1H)-one moiety was discovered as better alternative in replacement of the benzamide moiety. Among the derivatives, the optimized antagonist 16 days (compound 90 in **Figure 5**) was the most potent inhibitor against WDR5-MLL1 with the Kdisp value of 60 nM, which offered novel therapeutic options in the treatment of leukemia harboring MLL fusion proteins.

#### Menin-MLL1

The oncoprotein MLL1 can directly associate with cofactor Menin through N-terminal 43 amino acids including two Meninbinding motifs (MBMs), MBM1 (K<sup>d</sup> = 53 ± 4.2 nM) and MBM2 (K<sup>d</sup> = 1.4 ± 0.42µM) (Grembecka et al., 2010). Menin-MLL1 interaction is required for oncogenic function of MLL fusion proteins and contributes to related leukemia pathogenesis (Yokoyama and Cleary, 2008; Huang et al., 2012). Thus, the Menin-MLL1 PPI interface has been spotlighted as a potential target for epi-drugs development against MLLmediated leukemia.

In 2014, Li et al. employed structure-based pharmacophore modeling targeting the Menin–MLL1 interface based on the interaction patterns of Menin and MBM1 complex structure (PDB ID: 4GQ6) (Shi et al., 2012; Li et al., 2014). 10 best pharmacophore models were generated in Discovery Studio 3.0, considering the features of HBD, HBA, and hydrophobic group. Based on overall consideration of the fitness score in generated models, excluded volumes and hot spots analysis, one pharmacophore model with two hydrophobic groups and a hydrogen bond acceptor was selected as a query for follow-up virtual screening. Then an in-house library comprising 900 exiting drugs was built and queried by the constructed pharmacophore model. 29 compounds were finally selected for biochemical verification. Among them, two aminoglycoside antibiotics, neomycin and tobramycin (compounds 91–92 in **Figure 5**), were identified as Menin– MLL1 inhibitors in fluorescence polarization competition assay with binding affinities of 18.8 and 59.9µM, respectively. Thermal shift assay and isothermal titration calorimetry validated the direct interactions between the two antibiotics and Menin. Molecular docking analysis indicated these antibiotics competitively occupied the binding site of MLL1 in the central cavity of Menin.

In 2016, Xu and co-workers conducted the structure-based molecular docking and ligand-based pharmacophore modeling to obtain Menin-MLL1 inhibitors (Xu et al., 2016). To establish the ligand data set, 74 previously reported inhibitors classified into three categories were collected and 5,000 decoy compounds were generated based on 10 compounds with best potency by DecoyFinder (Cereto-Massagué et al., 2012). For one thing, molecular docking with various constrained conditions was subsequently performed in GLIDE. According to the Glide score and enrichment factor (EF) values, non-constraint SP docking approach performed best and was more appropriate for SBVS that could well distinguish known inhibitors from decoys for Menin-MLL1 inhibitors. For another, ligand-based pharmacophore models with 4–6 pharmacophore features (HBA, HBD, hydrophobic group, aromatic ring and positively or negatively charged group) were generated from those collected inhibitors with pIC<sup>50</sup> > 5.0. 3D-QSAR models were then developed based on the built pharmacophore models through partial least-squares (PLS) regression analysis. Through the joint LBVS and SBVS computational strategies, five compounds with novel scaffolds were identified as Menin-MLL1 inhibitors validated by fluorescence polarization assay. Among them, DCZ\_M123 (compound 93 in **Figure 5**) showed the most potent inhibitory activity in vitro with the IC<sup>50</sup> value of 4.7µM and could effectively inhibit the growth of MLL leukemia cells by impairing the Menin-MLL1 interaction in cell-based assays.

Scaffold hopping was proposed as a promising strategy to look for novel molecular entities with similar three dimensional conformations and properties (Schneider et al., 1999). As a shape-based three dimensional structure superposition method, it has been extensively used to generate potential alternatives of known compounds based on the bioisosteric replacement of core motif within molecules (Sun et al., 2012; Lamberth, 2017). In 2016, Yue et al. applied a shape-based scaffold hopping approach to reposition approved drugs targeting the Menin-MLL1 interaction (Yue et al., 2016). In the study, reported bioactive conformations of representative Menin-MLL1 inhibitors MI-2-2 and MIV-6R (PDB code 4GQ4 and 4GO8, respectively) were used as query (Shi et al., 2012; He et al., 2014). An in-house library comprising ∼1,600 existing drugs was aligned onto the query to perform 3D similarity searching using SHAFTS (Liu et al., 2011; Lu et al., 2011). A set of 12 top ranked compounds with SHAFTS similarity scores >1.2 (maximum 2.0) were selected for primary validation, which indicated that loperamide, previously used as anti-diarrhea agents, showed weak inhibition with the IC<sup>50</sup> value of 69µM. Further molecular docking analysis and medicinal chemistry optimization led to the identification of more potent loperamide-derived analog. Among them, DC\_YM21 (compound 94 in **Figure 5**) presented nanomolar inhibitory activity of the same order of magnitude as the reported inhibitor MI-2-2.

# FUTURE PERSPECTIVES

Computational methods are indispensable and creditable tools in both academia and industry that undoubtedly streamline the epi-drug and epi-probe discovery process. The focal point of this review is the state of art of CADD methods in epidrug design and discovery framework over the past decades. Tremendous progress has been achieved in epigenetic drug discovery based on in silico approaches as we have mentioned above which unequivocally draws a positive picture in the field. However, it is widely accepted that these aforementioned hitfinding methodologies are far from perfect and not omnipotent in all situations. There are still formidable challenges that need to be overcome which limit the effective applications of current computational methods. Firstly, current molecular docking scoring functions rank the compounds collections with inherent poor prediction accuracy in novel target drug discovery whose function has just been unraveled not long ago (Sable and Jois, 2015). Secondly, traditional docking algorithms fail to take complicated factors into full consideration like protein flexibility, solvation, entropy, and dynamic inclusion of water molecules (Clark, 2008; Lavecchia and Di Giovanni, 2013). Thus, it's difficult to precisely predict the absolute binding energy for ligand-protein interactions based on current methodologies. There are some reviews that investigate protein flexibility in detail (Barril and Fradera, 2006). However, the current computational methodology considering this issue is time-consuming that needs to be further improved. Thirdly, despite the fact that epigenetic enzymes have been actively pursued as potential drug targets, there is still conspicuous lack of potent chemical probes for a large number of knotty targets like HATs and epigenetic protein-protein interactions, which needs to be further explored. For these less well-studied epi-targets, there are few inhibitors with limited diversity of scaffolds ever reported that hinders the ligand-based drug design and development. For instance, PRMT5-MEP50 complex formation could enhance the stability and activity of PRMT5 and the PPI is essential for cancer cell invasion in lung cancer and breast cancer (Chen et al., 2017). Heterooctameric PRMT5-MEP50 complex structure has been resolved which enables structure-based drug design. Nonetheless, no chemical probes have ever been reported for such novel targets. Fourthly, the bioactivities of identified inhibitors vary considerably due to different assay platforms in differ different labs. Some of the reported inhibitors belong to pan-assay interference compounds and present non-specific interactions that have not been carefully examined (Dahlin et al., 2017). Overinterpretation of these results leads to misleading readouts and would go to the cul-de-sac in drug discovery process. Taken together, there are still many problems left unsolved which encourage the researcher to devote more drug discovery efforts in order to fill the vacancy in this field. To tackle with these issues, integrated SBVS and LBVS approaches should be applied to counterbalance their own limitations in a parallel manner in virtual screen campaigns. As for novel targets with fewer inhibitors ever reported, computational methods

### REFERENCES


should be applied in synergy with experimental approaches. Multidisciplinary efforts shall be devoted to generate more diverse machine learning datasets for the establishment of target-customized scoring functions, which in turn help to exploit chemical space available in database as thoroughly as possible. Meanwhile, the researchers should carefully examine the biological data before interpreting the biological results. This appeals to the researchers to develop a reliable experimental platform to standardize current biochemical assays. It could be expected that with rapid development of computational power and methodologies, more epi-drugs and epi-probes will be developed in the near future, which could not only help to uncover the elusive role of each node in epigenetic regulatory network but also guide optimum therapeutic options in the treatment of epigenetic-related diseases.

### AUTHOR CONTRIBUTIONS

WL, RZ, HJ, HZ, and CL wrote the manuscript. WL, RZ, and CL organized and revised the manuscript. All authors were involved in the preparation of the manuscript and approved the final version.

## ACKNOWLEDGMENTS

We appreciate referees for comments on this review and apologize to those whose important work concerning this topic has not been cited herein due to the main focus of this review article. We gratefully acknowledge financial support from the National Natural Science Foundation of China (81625022, 21472208, and 81430084 to CL).


identification of a selective small molecule inhibitor. Chem. Biol. 17, 471–482. doi: 10.1016/j.chembiol.2010.03.006


are mono- and dimethyllysine binding modules. Nat. Struct. Mol. Biol. 15, 245–250. doi: 10.1038/nsmb.1384


chemotherapy against neurological disorders. Front. Mol. Neurosci. 10:357. doi: 10.3389/fnmol.2017.00357


identified by QSAR modeling of known inhibitors, virtual screening, and experimental validation. J. Chem. Inf. Model. 49, 461–476. doi: 10.1021/ci80 0366f


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Lu, Zhang, Jiang, Zhang and Luo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Combined in Vitro and in Silico Studies for the Anticholinesterase Activity and Pharmacokinetics of Coumarinyl Thiazoles and Oxadiazoles

#### Edited by:

*Daniela Schuster, Paracelsus Private Medical University of Salzburg, Austria*

#### Reviewed by:

*Sobia Ahsan Halim, Kinnaird College for Women University, Pakistan Suresh Reddy Chidipudi, Spiro Organics Private Limited, India*

#### \*Correspondence:

*Ajmal Khan ajmalchemist@yahoo.com Imtiaz Khan kimtiaz@hotmail.co.uk Ahmed Al-Harrasi aharrasi@unizwa.edu.om*

#### Specialty section:

*This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry*

Received: *03 December 2017* Accepted: *26 February 2018* Published: *26 March 2018*

#### Citation:

*Ibrar A, Khan A, Ali M, Sarwar R, Mehsud S, Farooq U, Halimi SMA, Khan I and Al-Harrasi A (2018) Combined in Vitro and in Silico Studies for the Anticholinesterase Activity and Pharmacokinetics of Coumarinyl Thiazoles and Oxadiazoles. Front. Chem. 6:61. doi: 10.3389/fchem.2018.00061* Aliya Ibrar <sup>1</sup> , Ajmal Khan2,3 \*, Majid Ali <sup>2</sup> , Rizwana Sarwar <sup>2</sup> , Saifullah Mehsud<sup>1</sup> , Umar Farooq<sup>2</sup> , Syed M. A. Halimi <sup>4</sup> , Imtiaz Khan5,6 \* and Ahmed Al-Harrasi <sup>3</sup> \*

*<sup>1</sup> Department of Chemistry, Abbottabad University of Science & Technology, Havelian, Pakistan, <sup>2</sup> Department of Chemistry, COMSATS Institute of Information Technology, Abbottabad, Pakistan, <sup>3</sup> UoN Chair of Oman's Medicinal Plants and Marine Natural Products, University of Nizwa, Nizwa, Oman, <sup>4</sup> Department of Pharmacy, University of Peshawar, Peshawar, Pakistan, <sup>5</sup> Department of Chemistry, Quaid-i-Azam University, Islamabad, Pakistan, <sup>6</sup> School of Chemistry, Cardiff University, Cardiff, United Kingdom*

In a continuation of our previous work for the exploration of novel enzyme inhibitors, two new coumarin-thiazole 6(a–o) and coumarin-oxadiazole 11(a–h) hybrids have been designed and synthesized. All the compounds were characterized by <sup>1</sup>H- and <sup>13</sup>C-NMR spectroscopy and elemental analysis. New hybrid analogs were evaluated against acetylcholinesterase (AChE) and butyrylcholinesterase (BuChE) in order to know their potential for the prevention of Alzheimer's disease (AD). In coumarinyl thiazole series, compound 6b was found as the most active member against AChE having IC<sup>50</sup> value of 0.87 ± 0.09µM, while the compound 6j revealed the same efficacy against BuChE with an IC<sup>50</sup> value of 11.01 ± 3.37µM. In case of coumarinyl oxadiazole series, 11a was turned out to be the lead candidate against AChE with an IC<sup>50</sup> value of 6.07 ± 0.23µM, whereas compound 11e was found significantly active against BuChE with an IC<sup>50</sup> value of 0.15 ± 0.09µM. To realize the binding interaction of these compounds with AChE and BuChE, the molecular docking studies were performed. Compounds from coumarinyl thiazole series with potent AChE activity (6b, 6h, 6i, and 6k) were found to interact with AChE in the active site with MOE score of −10.19, −9.97, −9.68, and −11.03 Kcal.mol−<sup>1</sup> , respectively. The major interactions include hydrogen bonding, π-π stacking with aromatic residues, and interaction through water bridging. The docking studies of coumarinyl oxadiazole derivatives 11(a–h) suggested that the compounds with high anti-butyrylcholinesterase activity (11e, 11a, and 11b) provided MOE score of −9.9, −7.4, and −8.2 Kcal.mol−<sup>1</sup> , respectively, with the active site of BuChE building π-π stacking with Trp82 and water bridged interaction.

Keywords: coumarin thiazoles, coumarin oxadiazoles, cholinesterase inhibition, molecular docking, MOE score

# INTRODUCTION

Alzheimer's disease (AD), the most common cause of dementia, is a neurodegenerative disorder mainly characterized by progressive deterioration of memory and cognition (Terry and Buccafusco, 2003). One of the key therapeutic strategies adopted for primarily symptomatic AD is based on the cholinergic hypothesis targeting cholinesterase enzymes (acetylcholinesterase and butyrylcholinesterase; Cummings et al., 2007), two important enzymes from the group of serine hydrolases. Structurally, these serine hydrolases belong to the class of proteins known as the esterase/lipase family within the α/β-hydrolase fold superfamily (Cygler et al., 1993). The major role of AChE is the inhibition of the hydrolysis of acetylcholine in cholinergic synapses. Thus, blocking its metabolic activity and increasing the ACh concentration ultimately leading to a possible symptomatic treatment option for AD, whereas, the functional activity of butyrylcholinesterase (BChE) is less understood because it can hydrolyze ACh as well as other esters (Groner et al., 2007; Chiou et al., 2009). Butyrylcholinesterase has recently been considered as a potential target because it also plays an important role in regulating ACh level (Mesulam et al., 2002). AChE inhibitors currently approved as drugs for the treatment of Alzheimer's disease are donepezil, rivastigmine, galantamine, and tacrine (**Figure 1**). Although, donepezil is most commonly used AChE inhibitor, its Aβ formation inhibition activity is weak (Bartolini et al., 2003). In view of the limited number of cholinesterase inhibitors currently available for the treatment of AD, the search for new and potent inhibitors is of significant interest and a progressive area of current research.

Among oxygenated heterocycles, coumarin compounds have sustained efficacy as they inhibit both acetyl- and butyrylcholinesterase enzymes and help to slow down the formation of amyloid compounds (de Souza et al., 2016). Coumarins, both natural and synthetic demonstrate a wide spectrum of biological functions as they offer a wide range of structural changes on benzopyran ring. Activities like antitubercular (Manvar et al., 2011), anti-tumor (Maddi et al., 2007), anti-HIV (Kashman et al., 1992), anti-inflammatory (Ronad et al., 2010), anti-cancer (Olmedo et al., 2012), and anticoagulant (Martin-Aragón et al., 2001) have been reported. In addition, thiazole and oxadiazole skeletons are fundamentally important and versatile structural analogs of five-membered heterocyclic compounds. They show a vast majority of biological activities (Klimesová et al., 2004; Hang and Honek, 2005; Campiglia et al., 2009; Siddiqui et al., 2009; Jaishree et al., 2012; Romagnoli et al., 2012; Helal et al., 2013; Naveena et al., 2013; Venugopala et al., 2013; Yavari et al., 2014) in addition to be a part of numerous complex natural products like vitamin B1, penicillin (Shaker, 2006), and thiamine pyrophosphate, an important co-enzyme.

In the present study, two new coumarin-thiazole **6(a–o)** and coumarin-oxadiazole **11(a–h)** hybrids were synthesized and evaluated for their acetylcholinesterase (AChE) and butyrylcholinesterase (BuChE) inhibitory activity. Furthermore, the molecular docking studies on both series were also performed to explore their binding interactions.

## RESULTS AND DISCUSSION

#### Chemistry

Two series of coumarinyl thiazoles **6(a–o)** and oxadiazoles **11(a– h)** were prepared with the aim to identify new and potent inhibitors of acetylcholinesterase and butyrylcholinesterase. Coumarinyl thiazole derivatives **6(a–o)** were accessed through a multi-component reaction approach which starts with the preparation of 3-(2-bromoacetyl)-2H-chromen-2-one **(3)** via base-catalyzed condensation of readily available starting materials (salicylaldehyde and ethyl acetoacetate) followed by bromination (**Scheme 1**; Ibrar et al., 2016). An acidcatalyzed one-potreaction of intermediate **3**, different substituted acetophenones **(4)** and thiosemicarbazide **(5)** provided the title compounds **6(a–o)** in good yields (Ibrar et al., 2016).

In a second series, coumarinyl oxadiazole-2(3H)-thione conjugates **11(a–h)**, the central intermediate 3-(5-thioxo-4,5-dihydro-1,3,4-oxadiazol-2-yl)-2H-chromen-2-one **(8)** was prepared by the reaction of coumarinyl hydrazide **(7)** with carbon disulfide in ethanolic solution of KOH in good yield (Pattan et al., 2009). A one-pot reaction of compound **8**, paraformaldehyde **(9)** and different (aliphatic and aromatic) amines **(10)** gave the desired compounds **11(a–h)** in good yields (**Scheme 2**). The compounds were characterized by various spectroscopic techniques and full spectro-analytical data is described in our recent report (Ibrar et al., 2016).

#### TABLE 1 | Inhibition potency of coumarinyl thiazoles 6(a–o) against AChE and BuChE.


*<sup>a</sup>SEM, Standard mean error of three experiments.*

TABLE 2 | Inhibition potency of coumarinyl oxadiazoles 11(a–h) against AChE and BuChE.



*<sup>a</sup>SEM, Standard mean error of three experiments.*

# Pharmacology

The target compounds, coumarinyl thiazoles **6(a–o)** and coumarinyl oxadiazoles **11(a–h)**, were screened for their inhibitory activity against AChE and BuChE by Ellman's method. All the assays were carried out at micromolar level using neostigmine and donepezil as standard inhibitors having IC<sup>50</sup> values of 28.2 ± 2.01 and 7.23 ± 0.13µM for AChE, whereas 16.1 ± 1.13 and 0.03 ± 0.003µM for BuChE, respectively. The results obtained for both series **6(a–o)** and **11(a–h)** are summarized in **Tables 1**, **2**. The IC<sup>50</sup> values revealed that most of the synthesized compounds displayed potent and selective inhibition toward cholinesterases.

Among them, **6b** of the coumarinyl thiazole series was found to be the strongest AChE inhibitor with an IC<sup>50</sup> value of 0.87 ± 0.09µM (**Table 1**, **Figure 2**). This compound inhibited AChE ∼32-fold more strongly than the standard neostigmine, and nine-fold as effective against AChE as the second standard donepezil (IC<sup>50</sup> = 7.23 ± 0.12µM). The strong inhibitory potential of **6b** could be credited to the electron-donating amine group present at meta-position of the aryl ring. The introduction of a bromo group at the meta-position produced comparable results (**6d**; IC<sup>50</sup> = 30.06 ± 1.73µM) to the neostigmine. A slight decrease in the inhibition (IC<sup>50</sup> = 1.08 ± 0.84µM) was observed in case of compound **6h** having methoxy group at para-position but the inhibition was still 26 fold stronger than the standard neostigmine (**Figure 2**). However, compounds **6i** and **6k** bearing a double substitution at the aryl ring showed IC<sup>50</sup> values of 2.34 ± 1.34 and 5.86 ± 0.15µM, respectively. These compounds incorporate a combination of different electron-donating and electron-withdrawing groups which could potentially lead to increase the several folds in AChE inhibition than the neostigmine and comparable inhibition (in case of **6k**) to the donepezil (**Figure 2**). In the same series **(6a–o)**, a slight decrease in the inhibition was observed in compounds **6j**, **6n**, **6o,** and **6l** as compared to the potent analogs, but the inhibition was still stronger compared to neostigmine.

On the other hand, in the coumarinyl oxadiazole series **(11a–h)**, compound **11a** was found to be the most potent AChE inhibitor having IC<sup>50</sup> value of 6.07 ± 0.23µM (**Table 2**, **Figure 3**). This inhibitory potency might be attributed to an aliphatic methyl group substituted on the amine moiety. A slight decrease in the inhibition was observed when the methyl group

was replaced by another aliphatic (n-Bu) group as revealed by compound **11b** (IC<sup>50</sup> = 9.41 ± 0.55µM). When these aliphatic groups were replaced by aromatic substitutions as in **11c–f**, reduced inhibition was observed (**Figure 3**).

Moreover, oxadiazole compounds with morpholine substituent (**11g**) and two phenyl groups **(11h)** were also found as moderate inhibitors of AChE with two- and three-fold higher inhibition as compared to neostigmine (**Figure 4**). Overall, among the tested compounds, coumarinyl thiazoles appeared as potent AChE inhibitors than the coumarinyl oxadiazoles.

All the synthesized analogs were also evaluated for butyrylcholinesterase inhibition and several compounds were found to possess potent inhibitory activity higher than

TABLE 3 | MOE score of highly ranked coumarinyl thiazole and oxadiazole derivatives with the active site of AChE and BuChE.


the standard neostigmine. Among the coumarinyl thiazoles, compound **6j** with dual electron-donating groups was the lead inhibitor with IC<sup>50</sup> value of 11.01 ± 3.37µM. Compounds **6d** and **6h** with meta-bromo and para-methoxy substituents were also moderate inhibitors of BuChE (**Figure 5**). The other compounds in the series showed weak inhibition for BuChE.

However, the coumarinyl oxadiazoles were strong inhibitors of BuChE. Compound **11e** bearing meta-chloro substituent inhibited the BuChE with an IC<sup>50</sup> value of 0.15 ± 0.09µM. This compound was about 107-fold more potent than neostigmine. The replacement of chloro phenyl with an aliphatic methyl and n-Bu group (**11a**; IC<sup>50</sup> = 0.341 ± 0.06µM, **11b**; IC<sup>50</sup> = 0.77 ± 0.08µM) directed a small decrease in the inhibition but the compounds were still several-folds more active than neostigmine (**Figure 6**). The other compounds in the series **11c**, **11d**, **11f**, **11h,** and **11g** revealed significant inhibition more than the reference standard.

In general, among the synthesized analogs, the compounds from coumarinyl thiazole series **(6a–o)** were excellent AChE inhibitors than the two reference drugs neostigmine and donepezil while coumarinyl oxadiazoles **(11a–h)** showed strong inhibition for BuChE than the standard neostigmine. All in

FIGURE 7 | 2D binding pose representation of compound (A) 6b, (B) 6h, (C) 6k, and (D) donepezil with the active site of AChE (4EY7) chain A.

all, the target analogs proved to be very potent inhibitors of cholinesterase and by considering their strong inhibitory potential, these heterocyclic hybrid compounds hold great potential for the development of new targets for AD therapy.

# Molecular Docking Studies

Possible binding modes of coumarinyl thiazole and oxadiazole derivatives were explored by MOE (Molecular Operating Environment) software. Molecular docking studies revealed that the thiazoles having high anti-acetylcholinesterase activity (**6b**, **6h**, **6i,** and **6k**) represented better interaction with the active site of AChE (4EY7) with MOE score of −10.19, −9.97, −9.68, and −11.03 Kcal.mol−<sup>1</sup> , respectively, whereas oxadiazoles having high activity against butyrylcholinesterase (**11e**, **11a,** and **11b**) represented better interaction with BuChE (4BDS) with MOE score of −9.9, −7.4, and −8.2 Kcal.mol−<sup>1</sup> , respectively, as compared to reference ligands (neostigmine and donepezil) as shown in **Table 3**. Compound **6b** (IC<sup>50</sup> = 0.87 ± 0.09µM) demonstrated conventional hydrogen bonding with Glu202 (1.48 Å) and Ser203 (1.69 Å) due to amino group of aryl ring and π-π stacking with Trp86, Trp286, and Tyr341 along with water bridging with Tyr337, Tyr124, and Ser125 as shown in **Figure 7A**. The impact of amino group on the activity of compound was already mentioned earlier in the structure-activity relationship

(magenta).

(**Figure 2**). Compound **6h** (IC<sup>50</sup> = 1.08 ± 0.84µM) represented hydrogen bonding with Tyr72 (2.89 Å) and π-π stacking with Trp86, Trp286, and Tyr341 and Phe338 along with water bridging with Tyr72, Tyr124, and Thr83 as shown in **Figure 7B**. Compound **6i** (IC<sup>50</sup> = 2.34 ± 1.43µM) also demonstrated almost similar interactions (as with compound **6h)** except the length of hydrogen bond with Tyr72 (2.93 Å). Compound **6k** represented hydrogen bond with Tyr124 (2.21 Å), π-π stacking with Trp86 and Tyr341 (not with Trp286) and water bridging with Thr83, Ser125, and Asp74 as shown in **Figure 7C**. Reference compound (co-crystallized ligand) donepezil was also docked to compare and confirm our docking results where it was observed that it makes only π-π stacking with Trp86 and Trp286 along with water bridging with Tyr337 and Tyr341 as shown in **Figure 7D**. In agreement to the in-vitro results, molecular docking studies suggested that compounds **6b**, **6h**, **6i,** and **6k** represented better interaction than donepezil; (i) due to hydrogen bonding, (ii) due to extra π-π stacking with aromatic residues, and (iii) due to more interaction through water bridging as shown in **Figure 7**. Active site of the enzyme ribbon model (4EY7) and molecular docking comparison of the most active compound **6b** (magenta) with reference ligand donepezil (yellow) was depicted in **Figure 8a**, whereas hydrophobic surface and active site cavity of the enzyme docked with compound **6b** (magenta) was represented in **Figure 8b**.

According to molecular docking results, the highest ranked anti-butyrylcholinesterase compound **11e** (IC<sup>50</sup> = 0.15 ± 0.09µM) illustrated π-π stacking with Trp82 and interaction through water bridging with Asp70 and Ser79 as shown in **Figure 9A**. Compound **11a** (IC<sup>50</sup> = 0.34 ± 0.06µM) and **11b** (IC<sup>50</sup> = 0.77 ± 0.08µM) represented conventional hydrogen bonding with Glu197 (1.5 Å) and Ser198 (3.0 Å), respectively as shown in **Figures 9B,C**. However their low affinity binding poses also demonstrated π-π stacking with Trp82 and water bridging. Reference ligand donepezil represented π-π stacking with Trp82 and interaction through water bridging with Asp70,

TABLE 4 | Pharmacokinetics prediction of top ranked compounds.


Ser79, Thr120, Ser287, and Pro285 as shown in **Figure 9D**. Docking comparison of compound **11e** (pink) with reference ligand donepezil (yellow) in the active site of BuChE (4BDS) ribbon model and surface model was depicted in **Figures 10a,b**, respectively.

#### Pharmacokinetics Prediction

Early prediction of in-silico ADMET properties of lead molecules has now realized as an effective tool in the drug discovery and development process. Therefore, Lipinski's criteria and oral rat LD<sup>50</sup> value were estimated for top ranked active compounds by using TEST (Toxicity Estimation Software Tool) and Molinspiration online software. Top **five** compounds selected from coumarinyl thiazoles and oxadiazole (6b, 6h, 6i, 11a, and 11e) for the analysis and results were summarized in **Table 4**. Polar surface area (tPSA) values are important for the determination of blood brain barrier (BBB) penetration. According to Waterbeemd the cutoff value is 90 Å<sup>2</sup> or less. Almost all the compounds fulfilled the criteria except compound 6b with cutoff value slightly higher i.e., 93. Number of rotatable bonds (nROT) is an additional property that measures the flexibility of the molecule. The drugs that are BBB +ve, usually reported to have fewer nROT bonds. Another extension in RO5 to improve the prediction of drug-likeness is molar refractivity (MR) which should be 40–130. Oral rat LD<sup>50</sup> was also predicted for these compounds and the compounds were found slightly toxic according to Hodge and Sterner scale. All the criteria were fulfilled by the compounds and no Lipinski's violation was found as shown in **Table 4**.

# CONCLUSIONS

In summary, the present report clearly revealed that the new hybrid molecules show remarkable inhibition of AChE and BuChE enzymes. Compound **6b** from coumarinyl thiazole series was emerged as the most potent inhibitor of AChE, whereas **11e** from coumarinyl oxadiazole derivatives inhibited the BuChE with highest potency. Both the identified inhibitors follow Lipinski's RO5, slightly toxic and near the range of blood brain barrier crossing. In future, these compounds and their functionalized derivatives may be helpful in the development of potent drugs for Alzheimer's disease.

# EXPERIMENTAL

# Synthesis of Coumarinyl Thiazole 6(a–o) and Oxadiazole 11(a–h) Derivatives

The coumarinyl thiazole **6(a–o)** and oxadiazole **11(a–h)** analogs were prepared according to our recently published report (Ibrar et al., 2016).

#### Pharmacological Protocols Methodology for Determining AChE and BuChE Inhibitory Activity

For the determination of cholinesterase inhibition, electric eel, and horse serum were used as source of AChE and BuChE, respectively. The Ellman's spectrophotometric method was used to determine the AChE and BuChE inhibitory activity with a slight modification (Ellman et al., 1961). The compounds with 1 Mm concentration were prepared in DMSO. Assay was carried out in 96 well-plate in triplicates. The reaction mixture comprised of 20 µL of buffer (tris HCl 50 mM, 0.02 M MgCl2.6H2O, 0.1 mM NaCl) at pH 8, 10 µL of the test compound, 10 µL enzyme acetylcholine or butyrylcholinesterase of 0.03 U/mL (500 U of AChE and 700 U/mg of BuChE). The contents were incubated for 10 min at 25◦C followed by the addition of 1 mM of 10 µL of substrate acetylcholine iodide for AChE and butyrylthiocholine iodide for BuChE and incubated again at 25◦C for 15 min. A 50 µL of 3 mM DTNB as a coloring agent was added and incubated at 25◦C for further 10 min. The amount of product formed was measured by using micro plate reader (Bio-Tek ELx 800, Instruments Inc., Winooski, VT, USA) at 405 nm. The enzyme dilutions were made by using buffer of pH 8 (tris base 50 mM and having 0.1% BSA). The compounds which depict inhibitory activity more than 50% were further tested by making 9–12 serial dilutions in assay buffer and IC<sup>50</sup> values were calculated by graph pad prism.

# Molecular Docking

The molecular construction of the compounds was performed using ChemBioDraw Ultra 14 suite (PerkinElmer Inc.) and converted into 3D conformations by ChemBio3D (Mills, 2006). Molecular docking studies of the compounds were carried out using MOE (Molecular Operating Environment) software (ChemicalComputingGroup, 2008). The structures of the compounds were energy minimized using MMFF94x forcefield and gradient: 0.05. Crystal structures of the enzymes, acetylcholinesterase (PDB: 4EY7) and butyrylcholinesterase (PDB: 4BDS) were retrieved from Protein Data Bank (Berman et al., 2006). The co-crystallized ligands in the active site of AChE and BuChE, donepezil (PDB: E20), and Tacrine (PDB: THA) were taken as possible binding site. Ligand neostigmine was taken from PubChem (CID:4456) (Kim et al., 2015). The target proteins were prepared by the addition of hydrogen. All other parameters were used with the default settings. Donepezil and neostigmine were taken as reference ligands for comparison purposes. For each ligand 10 conformations were generated. The images in 2D were captured through MOE ligand binding interaction. 3D

#### REFERENCES


images were taken using UCSF Chimera 1.11 software (Pettersen et al., 2004).

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

#### ACKNOWLEDGMENTS

Generous support from Higher Education Commission of Pakistan (project No. 21- 978/SRGP/R&D/HEC/2016) is gratefully acknowledged.


2, 6-substituted benzo [d] thiazole and 2, 4-substituted benzo [d] thiazole analogues against Anopheles arabiensis. Eur. J. Med. Chem. 65, 295–303. doi: 10.1016/j.ejmech.2013.04.061

Yavari, I., Malekafzali, A., and Seyfi, S. (2014). A synthesis of functionalized 2-imino-1, 3-thiazoles from tetramethylguanidine, isothiocyanates, and 2 chloro-1, 3-dicarbonyl compounds. J. Iran. Chem. Soc. 11, 285–288. doi: 10.1007/s13738-013-0299-0

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Ibrar, Khan, Ali, Sarwar, Mehsud, Farooq, Halimi, Khan and Al-Harrasi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Discovery of Novel Bovine Viral Diarrhea Inhibitors Using Structure-Based Virtual Screening on the Envelope Protein E2

Mariela Bollini <sup>1</sup> \*, Emilse S. Leal <sup>1</sup> , Natalia S. Adler <sup>2</sup> , María G. Aucar <sup>2</sup> , Gabriela A. Fernández <sup>1</sup> , María J. Pascual <sup>3</sup> , Fernando Merwaiss <sup>3</sup> , Diego E. Alvarez <sup>3</sup> and Claudio N. Cavasotto<sup>2</sup> \*

<sup>1</sup> Laboratorio de Química Medicinal, Centro de Investigaciones en Bionanociencias, Consejo Nacional de Investigaciones

#### Edited by:

Daniela Schuster, Paracelsus Private Medical University of Salzburg, Austria

#### Reviewed by:

Rafaela Salgado Ferreira, Universidade Federal de Minas Gerais, Brazil Chandrabose Selvaraj, United States Department of Health and Human Services, United States

#### \*Correspondence:

Mariela Bollini mariela.bollini@cibion.conicet.gov.ar Claudio N. Cavasotto cnc@cavasotto-lab.net; ccavasotto@ ibioba-mpsp-conicet.gov.ar

orcid.org/0000-0002-1372-0379

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 15 January 2018 Accepted: 08 March 2018 Published: 26 March 2018

#### Citation:

Bollini M, Leal ES, Adler NS, Aucar MG, Fernández GA, Pascual MJ, Merwaiss F, Alvarez DE and Cavasotto CN (2018) Discovery of Novel Bovine Viral Diarrhea Inhibitors Using Structure-Based Virtual Screening on the Envelope Protein E2. Front. Chem. 6:79. doi: 10.3389/fchem.2018.00079 Científicas y Técnicas, Ciudad de Buenos Aires, Argentina, <sup>2</sup> Laboratory of Computational Chemistry and Drug Design, Instituto de Investigación en Biomedicina de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas, Partner Institute of the Max Planck Society, Ciudad de Buenos Aires, Argentina, <sup>3</sup> Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, Consejo Nacional de Investigaciones Científicas y Técnicas, San Martín, Argentina

Bovine viral diarrhea virus (BVDV) is a member of the genus Pestivirus within the family Flaviviridae. BVDV causes both acute and persistent infections in cattle, leading to substantial financial losses to the livestock industry each year. The global prevalence of persistent BVDV infection and the lack of a highly effective antiviral therapy have spurred intensive efforts to discover and develop novel anti-BVDV therapies in the pharmaceutical industry. Antiviral targeting of virus envelope proteins is an effective strategy for therapeutic intervention of viral infections. We performed prospective small-molecule high-throughput docking to identify molecules that likely bind to the region delimited by domains I and II of the envelope protein E2 of BVDV. Several structurally different compounds were purchased or synthesized, and assayed for antiviral activity against BVDV. Five of the selected compounds were active displaying IC<sup>50</sup> values in the low- to mid-micromolar range. For these compounds, their possible binding determinants were characterized by molecular dynamics simulations. A common pattern of interactions between active molecules and aminoacid residues in the binding site in E2 was observed. These findings could offer a better understanding of the interaction of BVDV E2 with these inhibitors, as well as benefit the discovery of novel and more potent BVDV antivirals.

Keywords: BVDV entry inhibitors, structure-based virtual screening, molecular dynamics simulation, envelope protein, molecular docking

# INTRODUCTION

Bovine viral diarrhea virus (BVDV) is a worldwide distributed pathogen of cattle. Together with classical swine fever virus (CSFV) and border disease virus (BDV) of sheep, BVDV belongs to the genus Pestivirus of the Flaviviridae family. The pestiviral genome is a positive, single-stranded RNA molecule of about 12.3 kb in length encoding a single polyprotein that is processed into individual viral proteins: Npro -C-Erns -E1-E2-p7-NS2-NS3-NS4A-NS4B-NS5A-NS5B (Collett et al., 1988). Pestivirus particles consist of a lipid bilayer with envelope glycoproteins Erns, E1, and E2 surrounding the nucleocapsid, composed by the capsid protein C and the RNA genome (Callens et al., 2016). BVDV infection is distributed worldwide resulting in major economic losses to the livestock industry. The virus is primarily a pathogen of cattle and the clinical manifestations are presented as acute infection, fetal infection, or mucosal disease (Lanyon et al., 2014). Based on genetic and antigenic differences, BVDV is segregated into genotypes 1 and 2. For each of these genotypes, cytopathic and non-cytopathic biotypes are distinguished according to the capacity of virus infection to induce cell death in culture (Ridpath, 2003). Non cytopathic (ncp) BVDV biotypes cause acute infections in adult animals and can be transmitted across the placenta to the fetus. Fetal infection is particularly relevant and it can lead to congenital malformations and abortion, or to the birth of persistently infected (PI) calves that spread and maintain the disease in cattle populations (Lanyon et al., 2014). Cytopathic (cp) BVDV biotypes arise in PI cattle from recombination events in the infecting ncpBVDV genome, and are associated with the development of fatal mucosal disease (Becher and Tautz, 2011).

Control and prevention of BVDV infection should combine systematic vaccination with detection and culling of persistently infected cattle from herds (Newcomer and Givens, 2013). However, immunization is complicated due to the wide antigenic diversity of the virus, and fails to target the emergence of persistently infected animals (Fulton et al., 2003; Newcomer et al., 2017). Previous studies showed that antivirals directed against the pestivirus polymerase NS5B provide immediate protection from viral challenge (Newcomer et al., 2012), thus prophylactic treatment with antivirals represents an alternative for therapeutic intervention in outbreaks of BVDV.

Computer-aided drug design has become an integral part of drug discovery and development in the pharmaceutical and biotechnology industry, and is nowadays extensively used in lead identification and optimization (Cavasotto and Orry, 2007; Jorgensen, 2009; Spyrakis and Cavasotto, 2015). Virus envelope proteins are attractive targets for the development of antiviral agents, and structure-based drug design has been successfully used to identify small molecule ligands of envelope proteins that block entry of flaviviruses (Zhou et al., 2008; Kampmann et al., 2009; Leal et al., 2017). With the aim of finding novel targets for pestivirus drug design, we focused on the in silico identification of antivirals directed against the envelope protein E2 of BVDV. E2 mediates receptor recognition on the cell surface and is required for fusion of virus and cell membranes after the endocytic uptake of the virus during entry (Ronecker et al., 2008; Wang et al., 2009). In this work, we expand on a structure-based approach to seek hit smallmolecules that dock into the druggable pocket at the interface between domains I and II of the envelope protein E2 of BVDV (Pascual et al., 2018). Around a million compounds from different chemical libraries were screened in a highthroughput docking (HTD) fashion. This led to the selection of nineteen lead candidates that were either purchased or synthesized, and evaluated in a reporter-based assay for antiviral activity. The likely interaction of active compounds with the protein E2 was further characterized by molecular dynamics (MD) simulations. The approach presented here led to the identification of five of novel compounds with anti-BVDV activity displaying IC<sup>50</sup> values in the low to mid-micromolar range.

# MATERIALS AND METHODS

#### Computational Chemistry Molecular System Preparation

All simulations were based on the crystal structure of the pestivirus of the envelope glycoprotein E2 from BVDV (PDB 2YQ2) (El Omari et al., 2013). Protein domains were designated from the N- to the C-terminus of E2 as I, II and III according to the nomenclature used by Li et al. (2013). The molecular system was described in terms of torsional coordinates using the ECEPP/3 force field (Nemethy et al., 1992) as implemented in the ICM program (version 3.7-2c, MolSoft LLC, La Jolla, CA; Abagyan et al., 1994), and prepared in a similar fashion as earlier works (He et al., 2012; Brand et al., 2013; Leal et al., 2017; Pascual et al., 2018). Hydrogen atoms were added to the receptor structure followed by local energy minimization. All Asp and Glu residues were assigned a −1 charge, and all Arg and Lys residues were assigned a +1 charge. Histidine tautomers were assigned according to the hydrogen bonding pattern.

#### High-Throughput Docking

As in an earlier work (Pascual et al., 2018), docking was performed within Site I located at the interface of domains I and II of E2. All water molecules and co-factors were deleted. A flexible-ligand:rigid-receptor docking methodology as implemented in ICM was used. The receptor was represented by six potential energy maps, while the docked molecule was considered flexible and subjected to global energy minimization within the field of the receptor using a Monte Carlo protocol (Abagyan et al., 1994; Cavasotto et al., 2006); thus, the intra- and inter-molecular energy of the molecule are minimized. Each molecule was assigned an empirical docking score according to its fit within the binding site (Totrov et al., 2001). Two independent runs of HTD were performed to improve convergence of the global optimization energy, while the best score per molecule was kept.

#### Small-Molecule Libraries and Filtering

The ZINC (Irwin and Shoichet, 2005) (accessed Nov. 2014), Maybridge (http://www.maybridge.com/), and in house databases were chosen for HTD. They were first filtered to remove the compounds containing inorganic atoms, PAINS (Filtering Pan-assay interfering substances) structures, and other reactive groups. Then the complete virtual library was pre-filtered for properties based on Lipinski's rules (Lipinski et al., 1997). Finally about a total of one million smallmolecules were used. The PAINS filter was implemented through the online server FAF-Drugs3 (Lagorce et al., 2015).

#### Molecular Dynamics

MD simulations were performed using GROMACS v5.1 package (Abraham et al., 2015) using the Amber99SB force field (Hornak et al., 2006). The system was solvated with the SPCE water model in a triclinic box, extending 10 Å from the protein, and neutralized adding sufficient NaCl counter ions to reach 0.15 M concentration. Bond lengths were constrained using the LINCS algorithm allowing a 2 fs time-step. Long-range electrostatics interactions were taken into account using the particle-mesh Ewald (PME) approach. The non-bonded cut-off for Coulomb and Van der Waals interactions were both 10 Å, and the nonbonded pair list was updated every 25 fs. Energy minimization was conducted through the steepest-descent algorithm, until the maximum force decayed to 1,000 [kJ mol−<sup>1</sup> nm−<sup>1</sup> ]. Then an equilibration of the whole system was performed by 500 ps of NVT simulation followed by 500 ps of NPT simulation. Temperature was kept constant at 300 K using a modified Berendsen thermostat (Berendsen et al., 1984) with a coupling constant of 0.1 ps. Constant pressure of 1 bar was applied in all directions with a coupling constant of 2.0 ps and a compressibility of 4.5 10−<sup>5</sup> bar−<sup>1</sup> .

# Biological Evaluation

Cell Culture

MDBK cells (Bos taurus kidney, ATCC CCL-22) were purchased from ATCC and grown in Dulbecso's modified Eagle medium (DMEM) supplemented with 10% fetal bovine serum and antibiotics under 5% CO2 at 37◦C. For infections, cells were cultivated in DMEM supplemented with 2% Horse serum and antibiotics under 5% CO2 at 37◦C.

#### Cytotoxicity Assay

Cell viability assays were performed on confluent cell cultures in 96 well plates (∼15,000 cells per well). For each compound, cells were treated with serial dilutions of the compound in quadruplicates and incubated at 37◦C for 3 days. Then, cell viability was measured using crystal violet staining. Briefly, cells were fixed with 10% formaldehyde, stained with crystal violet solution (20% Ethanol, 0.1% Crystal Violet), and after washing, the absorbance at 595 nm was recorded for each well in a spectrophotometer. Assays were conducted at least in duplicates, and the cytotoxic concentration 50 (CC50) was estimated by a nonlinear regression fitting of five data points as the compound

#### TABLE 1 | Antiviral activity against BVDV.


a IC50: inhibitory concentration 50%. Data represent the mean and standard deviation of at least two independent experiments.

<sup>b</sup>CC50: cytotoxic concentration 50%. Data represent the mean and standard deviation of at least two independent experiments.

<sup>c</sup>Pascual et al. (2018).

NA indicates assayed but not active compounds; ND Not determined.

concentration necessary to reduce cell viability by 50% compared to control non-treated cells.

#### Reporter-Based Assay for Antiviral Activity

Antiviral activity was evaluated in a reporter-based assay using a recombinant virus expressing GFP, cpBVDV/Npro GFP (Pascual et al., 2018). MDBK cells were seeded onto 24 well plates, infected with cpBVDV/Npro GFP at a multiplicity of infection of 0.1 in the presence of increasing concentrations of compounds. At 48 h post-infection cells were thoroughly washed, lifted with trypsin 0.05% and fixed using 4% paraformaldehyde in PBS. The fluorescence signal was measured using a flow cytometer (CyFlow <sup>R</sup> Space, Partec) at a detection spectrum of 488 nm. Data were analyzed in the FlowJo 7.6.2 software package. The inhibitory concentrations 50 (IC50s) for the compounds tested in the assay were calculated from curves constructed by plotting the percentage of infected cells versus the concentration of compound as the compound concentration necessary to reduce the number of infected cells by 50% compared to control nontreated cells.

#### Chemistry General Information

NMR spectra were recorded on Bruker Biospin 600 MHz AVIII600, Bruker advance II 500 MHZ and Bruker 300 MHZ spectrometers at room temperature. Chemical shifts (δ) are reported in ppm and coupling constants (J) in Hertz. Column chromatography was carried out employing Merck silica gel (Kieselgel 60, 63–200µm). Precoated silica gel plates F-254 were used for thin-layer analytical chromatography. The mass spectrometer utilized was a Xevo G2S QTOF (Waters Corporation, Manchester, UK) with an electrospray ionization (ESI) source. The mass spectrometer was operated in positive and negative ion modes with probe capillary voltages of 2.5 and 2.3 kV, respectively. The purity (≥95%) of all final synthesized compounds was determined by reverse phase HPLC, using a Waters 2487 dual λ absorbance detector with a Waters 1,525 binary pump and a Phenomenex Luna 5µ C18(2) 250 × 4.6 mm column. Samples were run at 1 mL/min using gradient mixtures of 5–100% of water with 0.1% trifluoroacetic acid (TFA) (A) and 10:1 acetonitrile:water with 0.1% TFA (B) for 22 min followed by 3 min at 100% B. UV spectra were measured with a Shimadzu 3600 UV/vis/NIR spectrophotometer.

#### Synthetic Procedures of New Compounds From Our in House Library

#### **Synthesis of (E)-2-(4-(dimethylamino)benzylidene)-N-(4- (trifluoromethyl)phenyl)hydrazinecarbothioamide (11)**

Synthesis of N-(4-trifluoromethoxyphenyl)hydrazinecarbothioa mide (**19**) Sodium hydroxide (0.14 g, 3.4 mmol) and carbon disulphide (0.2 mL, 2.8 mmol) were added to a solution of 4- (trifluoromethoxy)aniline **18** (0.50 g, 2.8 mmol) in DMF (5 mL). The mixture was stirred at room temperature for 1 h. Then, hydrazine hydrate (0.5 mL, 8.5 mmol) was added and stirring continued at 70◦C for 1 h. After water addition compound **19** precipitate and the solid was filtrated off. The crude was recristallized from ethanol:water (0.28 g, 39.1 %). <sup>1</sup>H NMR (500 MHz, CDCl3) δ 9.30 (s, 1H), 7.82 (s, 1H), 7.68 (d, J = 7.1 Hz, 2H), 7.24 (d, J = 8.7 Hz, 2H), 4.01 (s, 2H). To a solution of **19** (0.10 g; 0.40 mmol) in ethanol (3 mL) was added 4 dimethylaminobenzaldehyde (0. 65 g, 0.44 mmol). The mixture was stirred under reflux for 1 h. The reaction was then cooled to room temperature, and precipitate solid was filtered and washed with cyclohexane to give **11**, which was recristallized from ethanol. (0.07 g, 43.5 %). <sup>1</sup>H NMR (600 MHz, CDCl3) δ 9.37 (s, 1H), 9.20 (s, 1H), 7.79 (s, 1H), 7.76 (dd, J = 8.9, 2.1 Hz, 2H), 7.57 (dd, J = 8.9, 2.1 Hz, 2H), 7.27 (d, J = 8.5 Hz, 2H), 6.72 (dd, J = 8.9, 1.9 Hz, 2H), 3.07 (s, 6H). <sup>13</sup>C NMR (151 MHz, CDCl3) δ 175.2, 152.2, 146.5, 144.2, 144.1, 136.75, 129.0, 125.4, 125.3, 121.3, 121.2, 121.2, 120.1, 119.6, 111.8, 111.7, 40.1, 40.09. HR-MS (ES) calcd for C17H18F3N4OS [M+H]<sup>+</sup> 383.1153, found 383.1141.

#### **4-((5-methylisoxazol-3-yl)amino)-4-oxobutanoic acid. (14)**

To a solution of isoxazol-5-amine (0.20 g, 2.0 mmol) **20** in dioxane (5 mL) was added succinic anhydride (0.20 g, 2.0 mmol). The mixture was stirred at 80–90◦C overnight. The solvent was evaporated and the obtained yellowish solid was suspended in water, collected by filtration, and crystallized from ethanol to give the pure product as a white solid (0.088 g, 0.44 mmol, 22%). <sup>1</sup>H NMR (600 MHz, dmso-d6) δ 12.14 (s, 1H), 10.87 (s, 1H), 6.58 (s, 1H), 2.55 (t, J = 6.5 Hz, 2H), 2.34 (s, 3H). The other methylene group was determined by HSQC, due to the

SCHEME 1 | Synthesis of compound <sup>11</sup>. Reagents and conditions. (a) CS2, NaOH, DMF, 25◦C, 1 h. NH2NH2, 70◦C, 1 h. (b) 4-dimethylaminobenzaldehyde, EtOH, Reflux, 1 h.

overlapping with solvent signal (SI). <sup>13</sup>C NMR (151 MHz, dmsod6): δ 173.4, 170.3, 169.2, 158.1, 96.2, 30.4, 28.4, 12.1. HRMS (ES) m/z calc. for C8H10N2O4Na [M+Na]+: 221.0538; found: 221.0533, C8H11N2O<sup>4</sup> [M+H]+: 119.0719; found: 199.0714.

#### **4-chloro-N-(isoxazol-5-yl)benzamide (15)**

A mixture of p-chloro benzoic acid (1.0 g, 6.4 mmol), and an excess of thionyl chloride (4.92 g, 3 mL, 41.7 mmol) was refluxed for 2 h. The excess of thionyl chloride was distilled in vacuo and the acyl chloride was used without further purification. To a solution of p-chlorobenzoyl chloride **23** in MeCN (10 mL) was added Cs2CO<sup>3</sup> (3.7 g, 19 mmol) and isoxazol-5-amine **20** (0.63 g, 6.4 mmol) at 0◦C and the obtained suspension was stirred at r.t. overnight. Then, the reaction mixture was concentrated under vacuo and the obtained residue was treated with water and extracted with EtOAc (4 × 20 mL). The organic layers were dried over Na2SO4, filtered-off and concentrated under vacuo to give a residue that was purified by silica gel column chromatography eluting with cHex/EtOAc (95:5–70:30). The product was obtained as a white solid (0.146 g, 0.63 mmol, 10%).

<sup>1</sup>H NMR (600 MHz, dmso-d6) δ 11.39 (s, 1H), 8.02 (d, J = 8.6 Hz, 2H), 7.60 (d, J = 8.6 Hz, 2H), 6.74 (s, 1H), 2.41 (s, 3H). <sup>13</sup>C NMR (151 MHz, dmso-d6) δ 169.4, 164.2, 158.5, 137.1, 131.9, 129.9, 128.5, 96.9, 12.1. HRMS (ES) m/z calc. for C11H9ClN2O2Na [M+Na]+: 259.0250; found: 259.0247, C11H10ClN2O<sup>2</sup> [M+H]<sup>+</sup> 237.0431; found: 237.0428.

#### **N-(4-(trifluoromethoxy)phenyl)furan-2-carboxamide (16)**

Thionyl chloride (0.1 mL, 1.34 mmol) was added dropwise to a mixture of 2-furoic acid **20** (0.15 g, 1.34 mmol) and triethylamine (0.26 mL, 1.82 mmol) in DCM (5 mL) under N<sup>2</sup> atmosphere. The reaction mixture was stirred at room temperature for 5 h. The crude was added to another flask containing 4-(trifluoromethoxy)aniline (0.16 mL, 1.22 mmol) and triethylamine (0.34 mL, 2.43 mmol) in DCM (5 mL). The reaction mixture was stirred at room temperature overnight. After complete reaction, the solvent was the removed under reduced pressure, water added and extracted with dichloromethane. The organic layer was sequentially washed with brine, dried over anhydrous Na2SO<sup>4</sup> and concentrated in vacuo. The crude was purified by column chromatography (SiO2, dichloromethane) to give a white solid **16** (0.25 g, 75.6 %). <sup>1</sup>H NMR (600 MHz, DMSO-d6) δ 10.38 (s, 1H), 7.96 (d, J = 0.9 Hz, 1H), 7.88 (dd, J = 9.1, 2.0Hz, 2H), 7.40–7.32 (m, 3H), 6.72 (dd, J = 3.4, 1.7 Hz, 1H). <sup>13</sup>C NMR (126 MHz, CDCl3) δ 156.0, 147.4, 145.4, 144.3, 136.0, 121.8, 121.5, 121.0, 119.4, 115.6, 112.7. HR-MS (ES) calcd for C12H8F3NO3Na [M+Na]<sup>+</sup> 294.0354, found 294.0345. Synthetic procedure of compounds **5, 12, 13** and **17** from the ZINC library are described in Supporting information (Scheme S1 and S2).

### RESULTS AND DISCUSSION

### Computer-Aided Indentification, Chemical Synthesis, and Biological Evaluation of Novel Inhibitors

We employed a multistep HTD screening framework to efficiently identify novel inhibitors of the E2 protein using commercially available (ZINC and Maybridge chemical libraries) and synthetic druglike compounds (from our in house library)

using the available structural data of the BVDV virus envelope protein E2. Initially, several chemical filters were applied on the chemical libraries to remove pan assay interference compounds (PAINS) (Baell and Holloway, 2010), compounds containing inorganic atoms, unwanted functionalities, reactive groups, and compounds having (i) MW <500 Dalton; (ii) more than one violation or the Lipinsky rules (Lipinski et al., 1997), or (iii) more than two violation of the rule of three, or more than six #STARS using the program QikProp (Jorgensen, 2005). The #stars parameter indicates the number of property descriptors computed by QikProp that fall outside the optimum range of values for 95% of known drugs.

The selected molecules were subjected to independent parallel HTD cycles and the top 1,000 scoring compounds were further analyzed. To ensure diversity, these highly ranked compounds were clustered based on chemical similarity using ICM. For each cluster, several compounds were selected manually based on commercial availability, synthetic tractability for potential modifications, interaction with binding site amino acids and adequate pharmacological characteristics for drug candidates. Finally, 19 compounds were purchased from vendors (**1**–**10**) or synthesized (**11**–**16**), and then evaluated in a reporter-based assay for antiviral activity (**Figure 1**, **Table 1**). Compounds **5** and **17** were obtained via reaction of the corresponding amine and dicyandiamide under acidic conditions to give the required phenylbiguanide (**5**, **17**) in high yields. 2-Guanidinobenzimidazole (**12**) was prepared by the cyclocondensation of o-phenylendiamine with cyanoguanidine (Scheme S1 of Supplementary Material) according to the method reported by King et al. (1948). Synthesis of new compounds is shown in **Scheme 1**. Compound (**11**) was obtained by ccondensation of thiosemicarbazide with 4-(dimethylamino)benzaldehyde. Compounds **14–16** were prepared by acylation of the corresponding amine with the adequate carboxilic acid chloride (**15**, **16)** or by reaction with succinic anhydride (**14**) (see **Scheme 2**).

First, we assayed selected compounds for cytotoxicity in cultured cells. Only compound **6** displayed high toxicity and was discarded from further analysis (**Table 1**). The remainder of the compounds were evaluated for antiviral activity in a reporterbased assay using a recombinant BVDV virus carrying GFP on its genome to infect MDBK cells (Pascual et al., 2018). Expression of GFP induced by BVDV infection was measured 2 days after infection using flow cytometry. Inhibition of BVDV infection was assessed by comparing the number of GFP positive cells in non-treated control cells and in cells treated with increasing amounts of compound. Structurally different compounds **4**, **8**, **11**, **BI03**, and **PTC12** showed activity with IC<sup>50</sup> values of 30.1, 20.2, 23.9, 17.6, 0.30µM, respectively, and no cytotoxicity was detected at 50µM. In accordance with targeting of envelope protein function, we have previously shown that compounds BI03 and PTC12 specifically block BVDV cell entry (Pascual et al., 2018).

# Analysis of Binding Determinants Using Molecular Dynamics

To further characterize the likely interaction between the new molecules and protein E2, we performed 100 ns MD simulation on the most active compounds listed on **Table 1**. The docked poses of the ligands within the binding site between domains I and II were used as the initial conformations. For compound **8** two conformationally different poses with very similar docking scores were used as starting conformations, and the most probable pose was assigned based on the molecular dynamics simulation results and the analysis of interactions (Liu and Kokubo, 2017). The protein and ligands remained stable in every simulation (Figure S1), displaying the ligands the following RMSF values: **PCT12**, 0.4 Å; **BI0**3, 0.4 Å; **8**, 0.3 Å; **11**, 0.2 Å. The analysis of the binding determinants of the most active compounds is described in the following paragraphs.

The predicted binding mode of **PTC12** within the E2 protein is shown in **Figures 2**, **3**. The 3,4-dimethoxybenzamide group remained exposed to the solvent, whereas the thiophene ring made contacts with Asp91, Thr60, and Arg61. The system also presented a strong hydrogen bond between the benzamide group of the ligand and the carbonyl O of Gln89, exhibiting an interatomic distance of ∼2 Å and an angle of 160◦ during the last 50 ns simulation (**Figure 4**). A moderate hydrogen bond between the NH of the thiophencarboxamide and the carbonyl O Gln89 was detected, with interatomic distances and angles closer to 2.5 Å and 140◦ , respectively. This group was also intermittently exposed to the solvent through a narrow channel. A stable cation-π interaction between the aromatic ring of the 3, 4 dimethoxyphenyl group and Arg154 was observed throughout the simulation, with a N+-ring centroid distance below 6 Å at all times and a favorable θ angle below 40◦ during half of the last 50 ns simulation (Marshall et al., 2009).

The predicted interaction of **BI03** is shown in **Figure 3**. This ligand also presented a strong hydrogen bond between N2 and the backbone amide H of Thr60, with an average interatomic distance of 2.1 Å (**Figure 4**) and an angle of ∼165◦ during the final half of the simulation. A stable cation-π interaction was also found in this case between the ligand ring and Arg154, showing again distances and θ angles below 6 Å and 40◦ , respectively for most of the final 50 ns of the simulation. A moderate hydrogen bond was also formed between the ring and the HO atom of Thr60. The system was further stabilized by close contacts with Thr60, Gln87, Arg154, Val153, and Pro105, while the ligand ring, NH and OH groups were mainly exposed to the solvent.

Compound **11** is shown within its predicted binding site in **Figure 3** Two stable hydrogen bond occurred between the HN atoms of the ligand and the carbonyl O atom in Val153 and Gln87. In both cases interatomic distances and angles were very favorable with average values of 2 Å and 155◦ respectively. The cation-π interaction between Arg154 and the aromatic ring of 4-trifluoromethylphenyl group was less favorable than for the other compounds showing higher N+-ring centroid distances and θ angles, probably due to a moderate interaction between the CF<sup>3</sup> group and the charged portion of Arg154. The ligand made contacts with Asp91, Arg61, Arg154, Gln87, Val153, Pro105, and Thr60 while the CF<sup>3</sup> and NMe<sup>2</sup> groups were mostly exposed to the solvent.

The predicted binding mode of compound **8** is shown in **Figure 3**. This pose was selected as the most probable one based on the analysis of the interactions and binding free energy estimations. In the last half of the simulation, a moderate hydrogen bond between N1 (N with no H) and the side chain of Arg154 was observed, with average interactomic distances of 2.5 Å and angles of ∼140◦ . No cation-π interaction was detected, and the charged portion of Arg154 seemed to interact strongly with the amide group of the ligand. The 3-chloro-4-fluorobenzamide remained exposed to the solvent and there were close contacts of the ligand with Ser57, Thr60, Gln87, Gln89, Pro105, and Arg154.

Overall, molecular dynamics simulations reveal a common pattern of interactions with the binding site in E2. Taken together with previous studies on the mode of action (Pascual et al., 2018), our data support binding of active compounds to E2. Further studies including in vitro binding to the recombinant protein are still required to confirm the interaction of active compounds with E2.

# CONCLUSIONS

We have undertaken a structure-based virtual screening approach to identify small-molecules that dock into the druggable binding site at the interface between domains I and II of the E2 of BVDV, a virus responsible of both acute and persistent infections in cattle, with the consequent financial losses to the livestock industry each year. Around a million compounds were screened, and after chemical clustering, the top nineteen lead candidates were selected, and either purchased or synthesized, and evaluated in a reporter-based assay for antiviral activity. Five of these compounds exhibited IC<sup>50</sup> values in the low micromolar range. The likely binding determinants of these compounds is supported by molecular dynamics simulations, where a common pattern of interaction with the binding site in E2 could be identified. These findings should benefit the design of novel and improved BVDV antivirals.

# AUTHOR CONTRIBUTIONS

MB and CNC: Designed and supervised the study; NSA, MGA, and CNC: Performed the computational simulations; MB and CNC: The virtual screening; ESL, GAF, and MB: Conducted chemical synthesis; MJP and FM: Performed biological experiments under the supervision of DEA; MB, DEA, and CNC: Analyzed data. All authors were involved in the preparation of the manuscript and approved the final version.

# ACKNOWLEDGMENTS

This work has been supported by the Agencia Nacional de Promoción Científica y Tecnológica, Argentina (PICT 2013- 0778 and PICT 2014-1884 to MB; PICT 2011-2778 to CNC; PICT 2014-3599 to CNC and MB), CONICET (PIP 2014 11220130100721 to MB, CNC, and DEA), and FOCEM-Mercosur (COF 03/11). MB thanks William Jorgensen for providing an academic license for QikProp software. CNC thanks Molsoft LLC for providing an academic license for the ICM program. The authors thank the National System of High Performance Computing (Sistemas Nacionales de Computación de Alto Rendimiento, SNCAD) and the Computational Centre of High Performance Computing (Centro de Computación de Alto Rendimiento, CeCAR) for granting use of their computational resources.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article including MD plots, compound synthesis data, tested compounds codes, and NMR spectra can be found online at: https://www.frontiersin.org/ articles/10.3389/fchem.2018.00079/full#supplementary-material

#### REFERENCES


has identified compounds with antiviral activity against multiple flaviviruses. Antiviral Res. 84, 234–241. doi: 10.1016/j.antiviral.2009. 09.007


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Bollini, Leal, Adler, Aucar, Fernández, Pascual, Merwaiss, Alvarez and Cavasotto. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# How Diverse Are the Protein-Bound Conformations of Small-Molecule Drugs and Cofactors?

#### Nils-Ole Friedrich<sup>1</sup> , Méliné Simsir 1,2 and Johannes Kirchmair <sup>1</sup> \*

<sup>1</sup> Department of Informatics, Center for Bioinformatics, Universität Hamburg, Hamburg, Germany, <sup>2</sup> Molécules Thérapeutiques In Silico, Université Paris Diderot, Sorbonne Paris Cité, Paris, France

#### Edited by:

Daniela Schuster, Paracelsus Private Medical University of Salzburg, Austria

#### Reviewed by:

Esther Kellenberger, Université de Strasbourg, France Francesco Ortuso, Magna Græcia University, Italy Sereina Riniker, ETH Zürich, Switzerland

\*Correspondence: Johannes Kirchmair kirchmair@zbh.uni-hamburg.de

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 16 January 2018 Accepted: 05 March 2018 Published: 27 March 2018

#### Citation:

Friedrich N-O, Simsir M and Kirchmair J (2018) How Diverse Are the Protein-Bound Conformations of Small-Molecule Drugs and Cofactors? Front. Chem. 6:68. doi: 10.3389/fchem.2018.00068 Knowledge of the bioactive conformations of small molecules or the ability to predict them with theoretical methods is of key importance to the design of bioactive compounds such as drugs, agrochemicals, and cosmetics. Using an elaborate cheminformatics pipeline, which also evaluates the support of individual atom coordinates by the measured electron density, we compiled a complete set ("Sperrylite Dataset") of high-quality structures of protein-bound ligand conformations from the PDB. The Sperrylite Dataset consists of a total of 10,936 high-quality structures of 4,548 unique ligands. Based on this dataset, we assessed the variability of the bioactive conformations of 91 small molecules—each represented by a minimum of ten structures—and found it to be largely independent of the number of rotatable bonds. Sixty-nine molecules had at least two distinct conformations (defined by an RMSD greater than 1 Å). For a representative subset of 17 approved drugs and cofactors we observed a clear trend for the formation of few clusters of highly similar conformers. Even for proteins that share a very low sequence identity, ligands were regularly found to adopt similar conformations. For cofactors, a clear trend for extended conformations was measured, although in few cases also coiled conformers were observed. The Sperrylite Dataset is available for download from http:// www.zbh.uni-hamburg.de/sperrylite\_dataset.

Keywords: bioactive conformational space, protein-bound ligand conformation, conformational variability, PDB, protein-ligand interaction, binding site, small-molecule drug, cofactor

### INTRODUCTION

The protein-bound ("bioactive") conformations of ligands can differ substantially from those observed in solution, the gas phase and small-molecule crystal structures (Boström, 2001; Perola and Charifson, 2004; Seeliger and de Groot, 2010). Bioactive conformations can be distributed over large regions of the ligand's conformational space and can have considerable strain energy (Nicklaus et al., 1995; Boström et al., 1998; Boström, 2001; Perola and Charifson, 2004; Günther et al., 2006). For the application of 3D computational approaches such as docking or de novo design methods in drug discovery, the protein-bound conformations of small molecules need to be known or at least determinable (Brameld et al., 2008).

The Protein Data Bank (PDB) is the most comprehensive resource of experimental structural data on biomacromolecules and their interaction with small molecules (Berman et al., 2000). Currently, the PDB contains more than 100k structures of biomacromolecules that include a bound ligand. While the structural data available from the PDB are extremely valuable for the research of biomacromolecules and their interactions with small molecules, these data represent only a very small fraction of (known) interactions.

Sturm et al. (2012) investigated the relationship between the promiscuity of drug-like molecules and the molecular properties of ligands and their binding sites. In order to do so, they compiled a dataset of more than 1,000 protein-ligand complexes in which drug-like molecules are bound to at least two distinct proteins. They identified two major drivers of ligand promiscuity: the structural similarities of ligand binding sites (largely independent of the similarities of the overall protein sequences or folds) and the ability of ligands to adopt distinct binding modes for different proteins. The latter is facilitated by the conformational flexibility of ligands and/or the specific characteristics of their pharmacophoric features. In related work, He et al. (2015), analyzed the structures of 100 pharmaceutically relevant ligands bound to at least two different proteins (to which they bind with comparable in vitro affinities). Contrary to the common belief that ligand flexibility and promiscuity are correlated, no evidence for a distinct correlation was found within their dataset. In fact, for 59 out of the 100 investigated ligands, no significant changes between the conformers of ligands bound to different proteins were observed.

The relative abundance of available structural data on the conformation of protein-bound cofactors, and nucleotide cofactors in particular, has made them a primary subject of investigation. For example, Moodie and Thornton (1993) analyzed 65 structures of nucleotides bound to proteins and found them to bind predominantly in an extended conformation. In more recent work, Stockwell and Thornton (2006) analyzed the conformational variability of adenosine triphosphate (ATP), nicotinamide adenine dinucleotide (NAD) and flavin adenine dinucleotide (FAD) in a preprocessed set of more than 2,000 structures extracted from the PDB. Dym and Eisenberg (2001) compiled a set of 150 structures of FAD bound to 32 nonredundant flavoproteins. They found a clear correlation between the FAD-family fold, the shape of the cofactor binding site and the conformation of FAD. Bojovschi et al. (2012) investigated the conformational diversity of ATP/Mg:ATP in motor proteins based on a set of 159 X-ray structures extracted from the PDB. They found that ATP adopts a wide range of different conformations, with a preference for extended conformations in tight binding pockets (e.g., F1-ATPase) and compact conformations in motor proteins such as RNA polymerase and DNA helicase. The incorporation of Mg2+ was found to increase the conformational flexibility of ATP. They clustered the conformations of the individual ligands based on the similarity of their binding pockets and, in the case of ATP for example, identified 27 clusters with a mean intercentroid RMSD of more than 2 Å. The authors concluded that, within the individual protein superfamilies, the investigated ligands generally bind in a fairly conserved manner, although several exceptions were identified. In the case of ATP, most structures were found to have the ligand bound in an extended conformation. In few cases however, a conformation bent such that the terminal phosphate atoms are almost in van der Waals contact with the adenine ring was observed. Stegemann and Klebe (2012) explored the structural properties of six cofactors including an adenosine diphosphate moiety bound to a variety of different proteins with low sequence identity. They found that common binding pocket patterns sometimes only recognize parts of the cofactor and thereby induce similar conformations.

These and further studies have contributed substantially to the understanding of protein-bound ligand conformations. However, a major bottleneck is the limited quality (Liebeschuetz et al., 2012; Reynolds, 2014), quantity and diversity of the structural data that these studies are based on, in particular with respect to the uncertainty of atom coordinates that is inherent to crystallographic structures. Only recently, a robust and fully automated method for the assessment of the support of individual atom coordinates (as well as molecules) by the measured electron density (EDIA) has become available (Meyder et al., 2017). This allowed, for the first time, extraction of a complete subset of high-quality structures of protein-bound ligands from the PDB (Friedrich et al., 2017b). Prior to the development of the EDIA method, time-consuming manual inspection by human experts was required to assure the high quality of structural data, which limited the size of available datasets (see e.g., Warren et al., 2012).

In this work we assess the conformational variability of small molecules based on a complete set of high-quality structures of protein-bound ligands extracted from in the PDB, each of which is represented by at least ten high-quality X-ray structures. In total the conformational variability of 91 approved drugs and cofactors represented by 4,574 protein-bound conformations was assessed. The bioactive conformational space of 17 representative molecules was studied in detail.

# MATERIALS AND METHODS

# Dataset Compilation

The Sperrylite Dataset was extracted from the PDB using a workflow described previously (Friedrich et al., 2017a). It consists of 10,936 conformers of 4,548 unique small molecules. Ninetyone ligands in this dataset are represented by at least 10 structures, and these served as the basis of this analysis.

To ensure that all ligands with the same PDB ligand ID have identical stereochemistry, their isomeric smiles (generated with UNICON, Sommer et al., 2016) were compared in order to keep only the isomer with the most occurrences. The Approved Drugs subset of DrugBank (Wishart et al., 2017) was used to identify the approved drugs present in the Sperrylite Dataset.

# RMSD, Rotatable Bonds and Sequence Identity Calculations

All RMSD values were calculated with NAOMI (Urbaczek et al., 2011), which selects the minimum heavy-atom RMSD for the best superposition of each pair of conformers, taking molecular symmetry into account via complete automorphism enumeration.

The number of rotatable bonds was calculated with RDKit (RDKit: Open-Source Cheminformatics, version 2015.09.1, 2015). The default definition was used, meaning that amide and ester bonds were not counted as rotatable bonds.

All-against-all sequence identity was determined with NCBI BLAST (Altschul et al., 1990; BLAST, version 2.2.31. https:// blast.ncbi.nlm.nih.gov (accessed Jan 14, 2018); Camacho et al., 2009) and the sequence identity of individual pairs of proteins was measured with the Molecular Operating Environment (Molecular Operating Environment (MOE), version 2016.08; Chemical Computing Group Inc.: Montreal, QC, 2016) based on sequence and structural alignments.

Principal component analysis (PCA)-derived score plots of the alignments with the minimum median RMSDs were generated with R for each ligand.

# Visualization

Visualization of the (i) alignments of ligand conformers (ii) alignments of protein structures and (iii) interactions of proteins and ligands were generated with Maestro (Schrödinger Release 2016-2: Maestro, Schrödinger, LLC, New York, NY, 2016), MOE (Molecular Operating Environment (MOE), version 2016.08; Chemical Computing Group Inc.: Montreal, QC, 2016) and LigandScout (LigandScout, version 4.2; Inte:Ligand GmbH: Vienna, Austria, 2017; Wolber and Langer, 2005), respectively.

For the sake of clarity, all hydrogens, only polar hydrogens or no hydrogens were included in the depictions on a case-by-case basis to avoid overcrowded figures.

# RESULTS

The Sperrylite Dataset is a collection of all high-quality X-ray structures of small molecules bound to biomacromolecules that are contained in the PDB. The dataset includes 10,936 structures of 4,548 unique protein-bound ligands and was compiled with a recently developed cheminformatics pipeline that automatically (i) prepares the chemical structures of small molecules by taking into account the protein environment (in order to determine, e.g., the most likely tautomeric and protonation states); (ii) removes undesirable molecules such as crystallization aids as well as structures with topological and/or geometrical errors; and (iii) rejects structures of low quality (Friedrich et al., 2017a,b). Importantly, the procedure not only includes checks for resolution and DPI (Cruickshank, 1999), but also employs the recently developed EDIA method (Meyder et al., 2017) to assess the support of individual atoms of a structure by the electron density.

In this study the diversity of the protein-bound conformations of all ligands represented by at least 10 high-quality structures was investigated. This dataset consists of a total of 4,574 conformations of 91 unique ligands (an overview of all structures is provided in Scheme S1), including more than 30 nucleotides and 20 approved drug molecules. In an all-against-all comparison of the differences in conformation of each ligand as measured by RMSD, 81 of the 91 ligands had at least one conformer with an RMSD above 0.6 Å (which corresponds to the maximum positional uncertainty for atoms in the Sperrylite Dataset), and 69 had at least one conformer above 1 Å, meaning that they are clearly distinct. The correlation observed between the minimum median RMSD measured for all pairs of conformations and the number of rotatable bonds was (very) weak (R <sup>2</sup> = 0.126; Figure S1).

This work focuses on the analysis of the bioactive conformational space of a representative set of 17 approved drugs and cofactors (**Tables 1**, **2**; note that there is an overlap between cofactors and approved drugs). This set was compiled with the objective to include the most relevant and best-represented small molecules in a detailed analysis of individual ligands.

#### Definitions

In the following sections, "high-quality structures" refers to any structures matching the quality criteria defined in previous work (Friedrich et al., 2017b). Importantly, this term only refers to the quality of the protein-bound ligand, not the overall structure of the protein-ligand complex. Four-letter codes refer to PDB entries and three-letter codes in italics refer to PDB ligand identifiers.

# Small-Molecule Drugs

#### Imatinib

Imatinib (STI) is an approved anti-cancer drug targeting Bcr-Abl and several other tyrosine kinases. The drug binds to the ATP-binding site, spanning almost the entire width of the protein (Reddy and Aggarwal, 2012). Imatinib locks the protein in a closed conformation, thus arresting the enzyme's functionality. The PDB lists 11 high-quality structures with imatinib, 10 thereof with the drug bound to one of three different tyrosine kinases (ABL1: 1IEP, 1OPJ, 3K5V, 3MS9, 3MSS, 3PYY; ABL2: 3GVU; c-Src: 2OIQ, 3OEZ) or a synthetic construct of tyrosine kinase AS (4CSV), a common ancestor of Src and Abl.

The accessible conformational space of imatinib, which has seven rotatable bonds, is large. However, the conformations observed for imatinib bound to any of these tyrosine kinases are similar (**Figures 1A,B**), which is reflected by the low maximum pairwise RMSD of just 0.3 Å and is in agreement with the findings of He et al. (2015). This conformational similarity can be explained by the highly conserved nature of the residues that form the ligand binding sites of these tyrosine kinases (the minimum pairwise sequence identity between these proteins is 45%; **Figure 1D**).

One high-quality structure of imatinib is a complex with human quinone reductase 2 (3FW1). This enzyme exists as a dimer with two active sites, each located in a deep pocket at the interface between the monomers (Foster et al., 1999; Winger et al., 2009). Quinone reductase 2 is structurally dissimilar to protein kinases. Imatinib binds to the enzyme active site in proximity to the isoalloxazine ring of the FAD cofactor (**Figure 1C**), thereby adopting a distinct, "horseshoelike" conformation (Winger et al., 2009) that differs by at least 2.4 Å from any of the conformations observed with tyrosine kinases (**Figure 1A**).


<sup>a</sup>No. of distinct bioactive conformations.

<sup>b</sup>The experimental data are insufficient to allow a definitive conclusion on the number of distinct bioactive conformations.

Note that imatinib is known to bind to spleen tyrosine kinase (SYK) in an orientation that is different from that observed for Bcr-Abl and other tyrosine kinases (Alton and Lunney, 2008). A crystal structure of the imatinib-SYK complex exists (1XBB; Atwell et al., 2004) but is not part of the Sperrylite Dataset because of a poor electron density support of parts of the ligand facing the bulk water phase (Figure S2). The conformer of imatinib in complex with SYK has an RMSD of 2.5 Å to any of the other kinase-bound conformers but is similar to the imatinib conformation observed in the complex with quinone reductase 2 (RMSD = 1.3 Å).

#### Darunavir

Darunavir (017) is an antiretroviral drug approved for the treatment and prevention of human immunodeficiency virus (HIV) infections. The compound inhibits HIV-1 protease at picomolar concentrations by forming strong polar interactions with the target enzyme (King et al., 2004). Fourteen out of the 54 available structures with darunavir are of high quality, all of them being structures with darunavir bound to wild type or mutant HIV-1 protease. The mutations observed in the 14 high-quality structures introduce only subtle changes to the shape and chemical properties of the ligand binding environment. This is reflected in the high similarity of the protein-bound conformations of darunavir, where, among the high-quality structures, a maximum pairwise RMSD of just 0.2 Å was measured (Figure S3).

#### Acetazolamide

Acetazolamide (AZM) is an inhibitor of carbonic anhydrase and approved for the treatment of glaucoma, cardiac edema, idiopathic intracranial hypertension, epilepsy, and altitude sickness (Chakravarty and Kannan, 1994; Kaur et al., 2002). Ten out of the 29 structures of acetazolamide listed in the PDB are of high quality. Nine of these structures are with acetazolamide bound to one of six different human carbonic anhydrases (isoforms II, VII, IX, XII, XIII, and XIV, represented by PDB entries 3V2J, 3ML5, 3IAI, 1JD0, 3CZV, and 4LU3, respectively) or three different extremophilic bacteria carbonic anhydrases (Sulfurihydrogenibium sp., Thermovibrio ammonificans, and Sulfurihydrogenibium azorense, represented by PDB entries 4G7A, 4UOV, and 4X5S, respectively). The ligand binding pockets of all these carbonic anhydrase isozymes are highly similar (**Figure 2G**) and so are the conformations of acetazolamide observed for these complexes (**Figure 2A**). The protein-ligand complexes are stabilized by hydrogen bonds formed between the acetyl group of acetazolamide and the binding pocket (**Figure 2B**), with one exception, which is a complex with human carbonic anhydrase XII (1JD0). In that structure, the acetyl group of the ligand is rotated by about

#### TABLE 2 | Summary of cofactors and cofactor analogs investigated in this work.


<sup>a</sup>No. of distinct bioactive conformations.

140◦ as compared to any of the other structures (RMSD 0.9 Å; **Figure 2C**). A second, distinct conformation of acetazolamide is found in a complex with a different enzyme, endochitinase from Saccharomyces cerevisiae (2UY4) with a fundamentally different binding pocket. In that structure, the carbon-sulfur bond of the ligand is rotated by 120◦ (**Figure 2D**). The moieties in question are oriented toward the bulk water phase, freely rotatable, and not engaged in directed interactions with the protein. Also, the electron density maps do not allow a definitive conclusion on the orientation of these moieties (**Figures 2E,F**). It is therefore entirely possible that in reality all conformers of acetazolamide in the Sperrylite Dataset are nearly identical.

that the acetyl group in (C) and the sulfonamide moiety in (D) are present in the same orientation that is observed in any of the other crystal structures. (G) Superposed binding pockets of the nine human and three extremophilic bacterial carbonic anhydrases.

#### Triclosan

Triclosan (TCL) is an antibacterial and antifungal agent inhibiting enoyl-acyl carrier protein reductases (ENR), which are key enzymes in the fatty acid elongation cycle. Its wide use as a disinfectant in cremes and consumer products (e.g., soaps, toothpaste, detergents) is a controversial topic nowadays (Buth et al., 2010; Carey and McNamara, 2014).

In all 31 structures of triclosan contained in the PDB, the ligand is bound to an ENR. The conformers of triclosan observed among the 11 high-quality structures with ENR I and ENR III are very similar (median RMSD 0.1 Å; maximum pairwise RMSD < 0.6 Å; **Figure 3A**). These include the structures of Plasmodium falciparum ENR I (2O2Y) and Bacillus subtilis ENR III (3OID) which, despite a sequence identity of just 14% and a highly flexible binding site region (when in the unbound state), show almost identical structural features in the presence of triclosan (Kim et al., 2011).

In an X-ray structure of triclosan bound to Staphylococcus aureus ENR I (3GR6; not included in the Sperrylite Dataset because of low EDIA scores), the hydroxyl group of all four instances of triclosan is modeled in a different orientation (RMSD 1.4 Å measured to any of the other conformations present in the dataset). The EDIA score for the oxygen atom of the hydroxyl group of the four instances of this conformer is just 0.11–0.27, and visual inspection of the electron density map confirms a lack of support of this conformation (**Figure 3B**). The characteristic hydrogen bonds formed between the phenolic hydroxyl group of triclosan and Y156 as well as NAD(P) (Heath et al., 1999; Levy et al., 1999; **Figure 3C**) are also missing in this model (**Figure 3D**). All of these observations taken together indicate a likely error in this structural model.

The largest deviations between conformers of triclosan within the Sperrylite Dataset were observed for the complex with a triclosan-resistant G93V mutant (3PJF) of ENR I from Escherichia coli. These deviations are related to small conformational changes of a flexible α-helical turn in close proximity to the ligand (**Figure 3E**), resulting in the weakening of some edge-to-face aromatic interactions near the ligand (Singh et al., 2011). The high-level resistance of this mutant is not caused by a substantial loss in binding affinity of the drug but is a consequence of the inability of the G93V mutant to form the high affinity ENR-NAD+-triclosan ternary complex that inhibits the wild type (Heath et al., 1999).

#### Ubenimex, Bestatin

Ubenimex, also known as bestatin (BES), is a competitive protease inhibitor under investigation for the treatment of acute myelocytic leukemia and lymphedema (Tian et al., 2017). The molecule inhibits aminopeptidases and has shown immunomodulatory and host-mediated antitumor activities (Urabe et al., 1993; Inoi et al., 1995; Sakuraya et al., 2000). It has been approved in Japan as an adjunct to chemotherapy agents against acute non-lymphocytic leukemia for decades and has been reported to inhibit the growth of malaria parasites (Plasmodium falciparum) in vitro (Nankya-Kitaka et al., 1998).

Twenty-eight structures of bestatin are listed in the PDB. All of the 11 high-quality structures are with bestatin bound to aminopeptidases. The ligand conformations observed in eight of these high-quality structures are very similar to each other (maximum pairwise RMSD = 0.8 Å), even though the proteins originate from three different bacteria (E. coli, Pseudomonas putida and Vibrio proteolyticus), the unicellular protozoan parasite Plasmodium falciparum and mouse, and their minimum pairwise sequence identity is only 3.3%.

In contrast, the structure of bestatin bound to human aminopeptidase N (4FYR) shows an extended ligand conformation that has an RMSD of 2.0 Å to any of the ligand conformers observed for the bacterial proteins (**Figure 4A**). The conformations of the drug bound to human leukotriene A-4 hydrolase differ only slightly from and have similar binding modes to the characteristic conformation observed for aminopeptidases mentioned above (RMSD = 1.0 Å for both 3FUH and 3FTX; **Figures 4B–D**).

#### Biotin

Biotin (BTN, vitamin B7) is a water-soluble coenzyme for carboxylase enzymes and an approved drug for the treatment of dietary shortage or imbalance. There are 99 crystal structures including biotin listed in the PDB. The biotin conformers observed for the 43 high-quality structures can be assigned

FIGURE 4 | (A) Superposition of all eleven conformers of bestatin in the Sperrylite Dataset. The carbon atoms of the conformers in complex with human aminopeptidase N (4FYR) and human leukotriene A-4 hydrolase (3FUH and 3FTX) are indicated in green, violet and cyan, respectively. The carbon atoms of all other structures are shown in gray. (B) Typical conformer of bestatin bound to aminopeptidases N from E. coli (2HPT). (C) A conformation that differs slightly from the characteristic conformation, observed in complex with human leukotriene A-4 hydrolase (3FUH shown here). (D) Uncommon, extended conformation of bestatin observed in complex with the human aminopeptidase N (4FYR).

to three distinct groups, indicated by gray, green and violet carbon atoms in **Figure 5A**. Twenty-four of the 43 structures are complexes with core streptavidin from different bacteria (both wild type and mutants). Streptavidin homotetramers have a very high affinity for biotin, one of the strongest non-covalent interactions known (Kd ≈ 10−<sup>14</sup> to 10−<sup>16</sup> M) (Laitinen et al., 2006). The protein-ligand complex stands out by a high degree of shape complementarity and an extensive network of hydrogen bonds formed between both binding partners. One of the 24 structures of biotin bound to core streptavidin (4GD9) shows the impact of the cutting of a binding loop on the conformation of the bound ligand (Figure S4; Le Trong et al., 2013). Another structure (2IZJ) shows subtle structural changes of the streptavidin-biotin complex induced by a low pH that stabilizes intersubunit salt bridges (**Figure 5A**; orange carbon atoms; Katz, 1997).

Six crystal structures of avidin from chicken (wild type and mutants) and one of engineered avidin (2C4I) are also included in the dataset. Avidin is loosely related to streptavidin, with an equally high affinity to biotin and a very similar binding site (Figure S4). As expected, biotin binds to this protein in a conformation that is very similar to those predominantly observed for complexes with streptavidin.

Biotin-protein ligase (1WPY, 2EJ9, 2EJF, 2DTH, 2FYK, and 2ZGW) and biotin carboxylase (3G8C) share very low structural similarity with streptavidin and with each other. The conformations observed for biotin bound to biotin-protein ligase (**Figure 5A**; violet carbon atoms) are virtually identical among each other but differ by an RMSD of 1.1 Å from the predominant conformation observed in the Sperrylite Dataset. In particular, the angle of the alkyl chain leaving the ring system differs by around 103◦ from that observed for biotin bound to streptavidin. A third conformer of biotin is observed in complex with E. coli biotin carboxylase (3G8C; **Figure 5A**; green carbon atoms), with an RMSD of 0.9 Å measured against any of the streptavidinbound conformers. Despite substantial structural differences observed among the various different biotin-binding proteins, the non-covalent interactions formed between biotin and the target protein are largely conserved (**Figures 5B–D**).

#### Sapropterin

Sapropterin (tetrahydrobiopterin, H4B) is an approved drug for the treatment of tetrahydrobiopterin deficiency. It is an essential cofactor for the synthesis of nitric oxide and the hydroxylation of phenylalanine, tyrosine and tryptophan. The PDB counts 472 complexes with sapropterin, 188 of which are of high quality.

Of the high-quality conformers of sapropterin, all but three are extremely similar to each other (median RMSD of less than 0.1 Å; Figure S5A). All of these highly similar sapropterin conformers are bound to nitric oxide synthase, from five different species (human, rat, mouse, cattle and Bacillus subtilis). The exceptions are the conformers bound to human phenylalanine hydroxylase (1MMK, 1MMT and 1J8U), and differ by an RMSD of 0.7 Å from the conformer in human nitric oxide synthase (4D1N, Figure S5B). The sequence identity between human phenylalanine hydroxylase and human nitric oxide synthase is less than 15%. The slightly different conformer bound to phenylalanine hydroxylase is stabilized by hydrophobic interactions (Figure S5C).

#### Cholic Acid

Cholic acid (CHD) is one of the major bile acids produced from cholesterol in the liver. It is approved for the treatment of bile acid synthesis disorders and as an adjunctive treatment of peroxisomal disorders.

Thirteen of the 74 available crystal structures that include cholic acid are of high quality. Twelve thereof are from eukaryotic

proteins, including alcohol dehydrogenase, ferrochelatase, cytochrome c oxidase and bile acid-binding proteins; one structure is of choloylglycine hydrolase from Clostridium perfringens (2RLC).

Some pockets of cholic acid-binding proteins can accommodate more than a single cholic acid molecule, as observed e.g., in structures of the chicken liver basic fatty acidbinding protein (1TW4) and the zebrafish liver bile acid-binding proteins (2QO5).

Given the rigid scaffold of steroids it is not surprising that, despite in part low sequence identity between the cholic acidbinding proteins, the observed ligand conformations (i.e., those bound to the deepest part of their respective binding pocket) are highly similar (median RMSD = 0.6 Å; **Figure 6A**). The maximum pairwise RMSD of 1.6 Å was measured between the conformation of cholic acid in the crystal structure of the G55R mutant of zebrafish liver bile acid-binding protein (2QO6) and in human mitochondrial ferrochelatase (3W1W).

#### Deoxycholic Acid

Deoxycholic acid (DXC), a metabolic byproduct of intestinal bacteria, is a steroid acid commonly found in the bile of mammals (Ridlon et al., 2016). Deoxycholic acid is a detergent that disturbs the integrity of biological membranes and is used to isolate membrane-associated proteins. Deoxycholic acid is approved for submental fat reduction, as a safer and less invasive alternative to surgical procedures for the treatment of lipomas (Duncan and Rotunda, 2011) and for improvements of aesthetic appearance.

Of the 29 entries deposited in the PDB, 18 are of high quality. Eleven of those structures are deoxycholic acid bound to cathepsin A and have a maximum pairwise RMSD of just 0.1 Å. Because of the rigid ligand core, deoxycholic acid also binds to structurally distinct proteins in very similar conformations (**Figure 6B**). Examples from the Sperrylite Dataset include two structures of Betula pendula Bet v1 (a major pollen allergen; 4A81 and 4A84), a structure of subunits I and II of cytochrome c oxidase (3DTU) from Rhodobacter sphaeroides, a structure of choloylglycine hydrolase from Clostridium perfringens (2BJF), a structure of the multidrug transporter MdfA (4ZP0) from E. coli, and even a conformer of deoxycholic acid bound to the interface of a dimer of the cell invasion protein SipD from Salmonella enterica (3O01; Chatterjee et al., 2011) The maximum pairwise RMSD (0.9 Å) was measured for the ligand conformers bound to a K9E mutant of cathepsin A (4HAJ) and salmonella invasion protein D (3O01), indicated by violet carbon atoms in **Figure 6B**.

#### Cofactors and Cofactor Analogs

The most abundant small molecules in the Sperrylite Dataset are cofactors and their analogs. The cofactors represented by at least 10 high-quality structures can roughly be grouped into three categories: sinefungin and its analogs (S-adenosylmethionine, SAM, and S-adenosylhomocysteine, SAH; **Figure 7**), adenosine phosphates (AMP, ADP, ATP; **Figure 8**), and three cofactors without analogs listed in the dataset (glutathione, flavin mononucleotide and sapropterin). The RMSD distributions (allagainst-all comparisons) for the most relevant cofactors are reported in **Figure 9**.

#### Sinefungin and Analogs

#### **Sinefungin**

Sinefungin (SFG), an analog of the cofactor substrate SAM, inhibits a wide range of methyltransferases, thereby interfering with DNA synthesis (Pugh et al., 1978). It is an antifungal antibiotic and also a known effective inhibitor of the transformation of chick embryo fibroblasts by the cancer-causing Rous sarcoma virus (Vedel et al., 1978).

FIGURE 6 | (A) Ligand-based alignment of 13 structures of cholic acid bound to different eukaryotic proteins and choloylglycine hydrolase from Clostridium perfringens (2RLC; gray carbon atoms) and human mitochondrial ferrochelatase (3W1W; violet carbon atoms). (B) Ligand-based alignment of 16 structures of deoxycholic acid bound to structurally distinct proteins, including salmonella invasion protein D (3O01; violet carbon atoms).

carbon atoms), tRNA (guanine-N(1)-)-methyltransferase (4YVH; green carbon atoms), SMYDs and SET7 lysine methyltransferase (3CBP, 3PDN, 3N71, 3QWW and 3RU0; cyan carbon atoms); (C,D) 123 structures of SAM bound to different methyltransferases (gray carbon atoms), tRNA(m1G37)methyltransferase (1UAK; violet carbon atoms) and yeast ribosome synthesis factor Emg1 (2V3K; green carbon atoms); (E,F) 311 structures of SAH and (G,H) 74 conformers of glutathione (GSH).

The PDB lists 70 structures of sinefungin, all of them bound to methyltransferases. Thirty of these structures are of high quality. The observed conformers of sinefungin can be classified into three groups by an all-against-all comparison of their RMSDs (**Figure 9**). The largest group (**Figure 7A**; gray carbon atoms) includes 23 highly similar conformers (a representative example is given in **Figure 10A**) with a median RMSD of 0.5 Å, even though some of the proteins that these sinefungin molecules are bound to share low sequence identity (e.g., 30% for murine protein arginine N-methyltransferase 6 and the ribosomal protein L11 methyltransferase of Thermus thermophilus).

The second largest group consists of sinefungin conformers bound to the murine SET and MYND domains (SMYD) 1 (3N71) and 2 (3QWW), the human SMYD 3 (3PDN, 3RU0) and the SET7 lysine methyltransferase (3CBP), with RMSDs between 1.7 and 1.8 Å measured against the conformations representing the largest group (**Figure 7A**; cyan carbon atoms). Murine SMYD 1 (3N71) and human SET7 lysine methyltransferase (3CBP) have less than 15% sequence identity but bind sinefungin in very similar conformations (RMSD 0.3 Å).

Distinct conformations of sinefungin are observed for a complex with Haemophilus influenzae tRNA (guanine-N(1)-)methyltransferase (4YVH; **Figure 7A**; green atoms) and a complex with the ribosomal RNA small subunit methyltransferase NEP1 (3BBH, **Figure 7A**; violet carbon atoms, and **Figure 10B**) from Methanocaldococcus jannaschii, with RMSDs measured to the most abundantly observed conformation of 3.1 and 3.6 Å, respectively. In both cases the ligand conformation is stabilized by a hydrogen bond formed between the ligand's carboxyl group and the protein backbone.

#### **S-Adenosylmethionine**

SAM (SAM) is a cofactor that functions as a methyl donor in methyltransferases. It is essential for the methylation of proteins, DNA, lipids and small molecules. The bulk of SAM is generated in the liver, but all mammalian cells use it as an intermediate in the methionine-homocysteine cycle (Mato et al., 2013). SAM is also involved in the synthesis of many other endogenous metabolites. It has wide-ranging anti-inflammatory activity (Pfalzer et al., 2014) and, since its synthesis is depressed in chronic liver diseases, there has been considerable interest in its therapeutic use (Anstee and Day, 2012; Guo et al., 2015). S-adenosylmethionine is used as a drug for the treatment of depression, liver disorders, fibromyalgia, and osteoarthritis.

Four hundred ten structures listed in the PDB contain SAM. For example, almost all crystal structures of flavivirus

FIGURE 9 | Violin plot including box plots of the RMSD distributions of high-quality, protein-bound conformations of sinefungin (SFG), SAM, SAH, AMP, ADP, ATP, GSH and FMN. The width of each violin plot for a certain RMSD value indicates how often the specific value occurs in the pairwise comparison of all conformers.

methyltransferases contain SAM (because the molecule copurifies with the enzymes (Noble et al., 2014). There are 119 high-quality SAM-containing structures present in the Sperrylite Dataset. Many of these conformers are similar, with an overall median RMSD of 0.6 Å (**Figures 7C,D**). Even conformers bound to proteins sharing a low sequence identity (e.g., 19% in the case of Aeropyrum pernix fibrillarin, 4DF3, and human NSUN5, 2B9E), have RMSDs of just 0.5 Å. The all-against-all RMSD comparison shows a partitioning into three groups that are mainly determined by the torsion angles between the adenine and the ribose and to the torsion angles including the sulfonium linkage (**Figure 7**). The highest RMSD measured between any pair of SAM conformers is 3.3 Å, which was measured for the ligand in complex with Haemophilus influenzae tRNA(m1G37)methyltransferase (1UAK; **Figure 7C**; violet carbon atoms) and with SAM methyltransferase from Ruegeria pomeroyi (3IHT).

#### **S-Adenosyl-L-homocysteine**

The strong product inhibitor SAH (SAH) is released in all SAM-dependent methyltransferase reactions (Tehlivets et al., 2013). The ratio of SAM to SAH controls the activity of methyltransferase enzymes ("methylation ratio"; Schatz et al., 1977).

The PDB lists 784 structures including SAH, of which an unusually high proportion (40%; 311 structures) is of high quality (**Figure 7**). These represent a highly diverse set of proteins from all three domains of organisms in nature. Most of the structures are of human (73 structures) and Pyrococcus horikoshii (72 structures) proteins.

Many of the SAH conformations are highly similar, with an overall median RMSD of 0.6 Å. The all-against-all RMSD comparison shows three groups of conformations and an overall spread very similar to that observed for SAM (**Figure 9**). As shown in **Figure 7**, the conformations observed for SAM and SAH are similar. Also, all conformations of sinefungin are closely represented by at least one conformation of SAM and SAH.

The largest difference observed among the SAH conformations was measured between a coiled conformer bound to Haemophilus influenzae tRNA (Guanine-N(1)-) methyltransferase (1UAL) and a mostly stretched conformer bound to E. coli ribosomal RNA large subunit methyltransferase L (3V97) with an RMSD of 3.2 Å.

#### Glutathione

The tripeptide glutathione (GSH; GSH) is a cofactor of various different enzymes and a defensive reagent against toxic xenobiotics. Of the 360 entries with glutathione listed in the PDB, 74 structures are of high quality. These high-quality structures cover glutathione bound to 10 different proteins (**Figures 7G,H**). Most of the GSH conformers have a pairwise RMSD between 0.6 and 1.6 Å (**Figure 9**). The two most distinct conformers of glutathione observed in the Sperrylite Dataset are an unusually stretched conformer bound to a putative branched-chain amino acid ABC transporter from Chromobacterium violaceum (4PYR, **Figure 11A**) and an extremely coiled conformer bound to human mPGES-1 (4YL1, **Figure 11B**), with an RMSD of 3.6 Å. Nevertheless, their interaction patterns show similarities. Glutathione transferases are represented by 46 high-quality structures. These are mostly similar and have a median RMSD of less than 0.5 Å (**Figures 7G,H**).

#### Adenosine Phosphates

ATP functions as the most important molecule for intracellular storage and transport of chemical energy. It has many crucial roles in metabolism and is also a neurotransmitter. During metabolic processes, ATP is converted into adenosine diphosphate (ADP) and, subsequently, adenosine monophosphate (AMP), thereby releasing the stored energy.

#### **Adenosine monophosphate**

Out of the 575 complexes with AMP (AMP) found in the PDB, 171 conformers are of high quality. AMP has four rotatable bonds and the median RMSD measured between all high-quality conformers is 0.8 Å. The all-against-all comparison of AMP conformers results in a wide spread of the RMSD values (**Figure 9**). The flexibility of the molecule is mostly limited to the phosphate group (**Figures 8A,B**). The maximum RMSD of 2.5 Å was measured between an extremely coiled conformer bound to an adenylate kinase-related protein from Sulfolobus solfataricus (3LW7; **Figure 8A**, violet carbon atoms; **Figure 12A**) and the stretched conformer bound to NTPDase1 from Legionella pneumophila (4BRN; **Figure 12B**).

#### **Adenosine diphosphate**

Out of the 1,810 entries including ADP (ADP) in the PDB, 462 conformers are of high quality. Despite an additional phosphate group and a total of six rotatable bonds, the conformational space covered by ADP is very similar to that covered by AMP (**Figures 8C,D**). This similarity is reflected in the median RMSD of 0.9 Å between the conformers of ADP and a similar overall spread in the all-against-all comparison (**Figure 9**). The two most different ADP conformers in the Sperrylite Dataset are those bound to tryptophanyl-tRNA synthetase from Campylobacter jejuni (3TZL; **Figure 13A**) and an Stt7 homolog from Micromonas algae (4IX6; **Figure 13B**), with an RMSD of 2.9 Å.

#### **Adenosine triphosphate**

Only 218 conformers out of the 1,079 structures of the PDB containing ATP (ATP) were of high quality. In all structures of ATP included in the Sperrylite Dataset, the N-glycosidic bond is found in an anti-orientation. With its eight rotatable bonds ATP is more flexible than the previously discussed adenosine

phosphates. This results in a median RMSD of 1.6 Å among the ATP structures of the Sperrylite dataset (as compared to a median RMSD of 0.9 Å measured for ADP) and a distinct spread of the RMSD values in the all-against-all comparison (**Figure 9**). The maximum pairwise RMSD was 3.9 Å, measured between ATP conformers from human lysyl-tRNA synthetase (3BJU) and Drosophila melanogaster Wiskott-Aldrich syndrome protein homology 2 (3MN6).

ATP is observed in an extended conformation in most structures (**Figures 8G,H**), which is in agreement with earlier studies (Moodie and Thornton, 1993; Stockwell and Thornton, 2006; Bojovschi et al., 2012; Stegemann and Klebe, 2012). As reported also by Stockwell and Thornton (Stockwell and Thornton, 2006), some conformers are bent to an extent that the terminal phosphate atoms are almost in van der Waals contact with the adenine ring. Examples of ATP in bent conformations include complexes with the aspartyl-tRNA synthetase from Pyrococcus kodakaraensis (1B8A; Figure S6) and the ribonucleotide reductase protein R1 from E. coli (3R1R).

#### Flavin Mononucleotide

Flavin mononucleotide (FMN; FMN) is the prosthetic group of various oxidoreductases (including NADH dehydrogenase), as well as a cofactor in biological blue-light photoreceptors (Froehlich et al., 2002; Schwerdtfeger and Linden, 2003). Bluelight receptors in plants (phototropins), for example, employ flavin mononucleotide as the chromophore for their light sensing function (He, 2002).

Its frequent occurrence as a prosthetic group and a cofactor result in flavin mononucleotide's presence in 919 structures deposited in the PDB, among which 367 conformers of FMN are of high quality. Despite having seven rotatable bonds, most structures show extended, similar conformations (**Figures 8G,H**), with a median RMSD of 0.9 Å. The all-againstall comparison reveals four groups of conformers, with peaks observed in the RMSD distribution around 0.3, 1.2, 1.7, and 2.4 Å (**Figure 9**). These peaks correspond to an accumulation of conformers with similar torsion angles of the side chain. The maximum RMSD of 2.9 Å was observed between the conformation of FMN in E. coli pyridoxine 5′ -phosphate oxidase (1JNW) and in human glycolate oxidase (2RDU), with the sidechain bent into opposing directions.

# CONCLUSIONS

The Sperrylite Dataset presented in this work is a complete subset of high-quality conformations of protein-bound ligands extracted from the PDB. This dataset resulted from a multi-step data processing and filtering procedure that, most importantly, also includes an automated approach for the evaluation of the support of individual atom positions by the electron density. The Sperrylite Dataset consists of a total of 10,936 high-quality structures of 4,548 unique ligands. Ninety-one of those ligands are each represented by a minimum of ten structures, and among these only a (very) weak correlation was observed between the number of rotatable bonds of a molecule and its overall variability (measured as the minimum median RMSD; R <sup>2</sup> = 0.126). Sixtynine out of the 91 ligands had at least two distinct conformations (defined as RMSD above 1 Å).

A representative subset of 17 approved drugs and cofactors was analyzed in detail to determine the conformational variability of protein-bound conformations of small molecules. For all of the analyzed small-molecule drugs and some of the cofactors, a clear trend for the formation of few clusters of highly similar conformers was observed. Similar conformers were observed for proteins with similar binding sites, mostly independent of the overall protein sequence identity (which is in agreement with the findings of, e.g., Sturm et al., 2012). A particularly interesting example is imatinib, which was found to adopt highly similar conformations when binding to different tyrosine kinases (even to those sharing low overall sequence identity) but to adopt a distinct conformation upon binding to quinone reductase 2. For cofactors, a clear trend for extended conformations was observed, which is in agreement with previous works (Moodie and Thornton, 1993; Stockwell and Thornton, 2006; Bojovschi et al., 2012; Stegemann and Klebe, 2012). A few cases of strongly coiled conformers of cofactors were also observed. This result is well in line with earlier reports (Stockwell and Thornton, 2006).

It is clear that the currently available structural data on protein-bound ligands is still too limited to allow us to gain a full understanding of the bioactive space of small molecules. However, for several cofactors a large number of conformers observed in complex with dozens of proteins are available to date and provide valuable insight into the bioactive conformational space and the prevalence of bioactive conformations of small molecules. With an automated workflow for the extraction of high-quality ligand structures from the PDB in place, it is expected that the ever increasing amount of data will allow a more detailed understanding of, e.g., conformational preferences, ligand promiscuity, or the relationship between the bioactive conformational space of small molecules and the structural diversity of binding pockets.

#### DATA AVAILABILITY

The dataset generated for this study can be found at: http://www. zbh.uni-hamburg.de/sperrylite\_dataset.

#### AUTHOR CONTRIBUTIONS

JK and N-OF: conceived the work; N-OF and MS: conducted the computational studies. All authors contributed to the

#### REFERENCES


interpretation of the data and the writing of the manuscript. All authors have given approval to the final version of the paper.

#### ACKNOWLEDGMENTS

The authors thank Christina de Bruyn Kops for discussion and proofreading of the manuscript. MS was supported by the Erasmus+ Programme of the European Commission.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00068/full#supplementary-material

system tip protein SipD in complex with deoxycholate and chenodeoxycholate. Protein Sci. 20, 75–86. doi: 10.1002/pro.537


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Friedrich, Simsir and Kirchmair. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Design, Synthesis, and Evaluation of Dihydrobenzo[cd]indole-6 sulfonamide as TNF-α Inhibitors

#### Xiaobing Deng1,2, Xiaoling Zhang<sup>2</sup> , Bo Tang<sup>3</sup> , Hongbo Liu<sup>1</sup> , Qi Shen<sup>2</sup> , Ying Liu2,3 \* and Luhua Lai 1,2,3 \*

#### Edited by:

Daniela Schuster, Paracelsus Medizinische Privatuniversität, Austria

#### Reviewed by:

Hyun Lee, University of Illinois at Chicago, United States Dharmendra Kumar Yadav, Gachon University of Medicine and Science, South Korea

\*Correspondence:

Ying Liu liuying@pku.edu.cn Luhua Lai lhlai@pku.edu.cn

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 10 January 2018 Accepted: 20 March 2018 Published: 04 April 2018

#### Citation:

Deng X, Zhang X, Tang B, Liu H, Shen Q, Liu Y and Lai L (2018) Design, Synthesis, and Evaluation of Dihydrobenzo[cd]indole-6 sulfonamide as TNF-α Inhibitors. Front. Chem. 6:98. doi: 10.3389/fchem.2018.00098 <sup>1</sup> Peking–Tsinghua Center for Life Sciences, Peking University, Beijing, China, <sup>2</sup> Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China, <sup>3</sup> BNLMS, State Key Laboratory for Structural Chemistry of Unstable and Stable Species, College of Chemistry and Molecular Engineering, Peking University, Beijing, China

Tumor necrosis factor-α (TNF-α) plays a pivotal role in inflammatory response. Dysregulation of TNF can lead to a variety of disastrous pathological effects, including auto-inflammatory diseases. Antibodies that directly targeting TNF-α have been proven effective in suppressing symptoms of these disorders. Compared to protein drugs, small molecule drugs are normally orally available and less expensive. Till now, peptide and small molecule TNF-α inhibitors are still in the early stage of development, and much more efforts should be made. In a previously study, we reported a TNF-α inhibitor, EJMC-1 with modest activity. Here, we optimized this compound by shape screen and rational design. In the first round, we screened commercial compound library for EJMC-1 analogs based on shape similarity. Out of the 68 compounds tested, 20 compounds showed better binding affinity than EJMC-1 in the SPR competitive binding assay. These 20 compounds were tested in cell assay and the most potent compound was 2-oxo-N-phenyl-1,2-dihydrobenzo[cd]indole-6-sulfonamide (S10) with an IC<sup>50</sup> of 14µM, which was 2.2-fold stronger than EJMC-1. Based on the docking analysis of S10 and EJMC-1 binding with TNF-α, in the second round, we designed S10 analogs, purchased seven of them, and synthesized seven new compounds. The best compound, 4e showed an IC50-value of 3µM in cell assay, which was 14-fold stronger than EJMC-1. 4e was among the most potent TNF-α organic compound inhibitors reported so far. Our study demonstrated that 2-oxo-N-phenyl-1,2-dihydrobenzo[cd]indole-6-sulfonamide analogs could be developed as potent TNF-α inhibitors. 4e can be further optimized for its activity and properties. Our study provides insights into designing small molecule inhibitors directly targeting TNF-α and for protein–protein interaction inhibitor design.

Keywords: TNF-α inhibitor, dihydrobenzo[cd]indole-6-sulfonamide, virtual screening, synthesis, structure activity analysis

**152**

# INTRODUCTION

Tumor necrosis factor-α (TNF-α), an important cytokine mediator involved in inflammatory responses, is commonly used as a marker for many inflammatory disorders (Wajant et al., 2003). Antibodies that directly targeting TNF-α have achieved success in the treatment of inflammatory disorders such as rheumatoid arthritis, Crohn's disease, and ulcerative colitis (Bongartz et al., 2005; Jacobi et al., 2006). However, these biologics possess the possibility to cause antiantibody immune responses and weaken the immune system to opportunistic infections (Scheinfeld, 2004; Ai et al., 2015).

Thus, developing inhibitors to block TNF-α is still of great importance. Zhu et al. have reported several rationally designed proteins that directly bound to TNF-α. They grafted three key residues from a virus viral 2L protein to a de novo designed small protein DS119, and then optimized their residues at the interface, which provided some small proteins that bind TNFα with sub-micromolar affinities (Zhu et al., 2016). Other than small proteins, bicyclic peptides and helical peptides were also designed as peptidic antagonists of TNF-α (Lian et al., 2013; Zhang et al., 2013).

In addition to peptide inhibitors, small molecular inhibitors that directly targeting TNF-α have also been discovered (Leung et al., 2012; Davis and Colangelo, 2013; Shen et al., 2014). Suramin was thought to be the first small compound inhibitor that directly disrupts the interactions between TNF-α and its receptor (TNFR) (Grazioli et al., 1992). But its potency was too low to be used in clinic (Alzani et al., 1993). No breakthrough was made until 2005, when SPD304 was reported as the first potent small molecule inhibitor that directly targeting TNF-α, with an IC<sup>50</sup> of 22µM by ELISA. And the co-crystal structure of SPD304 in complex with TNF-α dimer was solved (He et al., 2005). However, as the 3-alkylindole moiety of SPD304 can be metabolized by cytochrome P450s to produce toxic electrophilic intermediates, its further applications in vivo is limited (Sun and Yost, 2008). After that, several novel TNF-α inhibitors were discovered using structure-based virtual screening (VS) of different chemical libraries. Chan et al. identified two compounds using high-throughput ligand-docking-based VS (**Figure 1**, **quinuclidine 1** and **indoloquinolizidine 2**), and their experimental tests showed that **quinuclidine 1** is more effective than **indoloquinolizidine 2** in inhibition of TNF-α induced NF-κB signaling in HepG2 cells, with IC50-values of 5 and >30µM, respectively (Chan et al., 2010). Choi and colleagues discovered a series of pyrimidine-2,4,6-trione derivatives from a 240,000-compound library. The best compound (**Figure 1**, **Oxole-1**) showed 64% inhibition at 10µM (Choi et al., 2010). Leung et al. reported a novel iridium(III)-based direct inhibitor of TNF-α (**Figure 1**, **[Ir(ppy)2(biq)]PF6**; Leung et al., 2012). Mouhsine et al. used combined in silico/in vitro/in vivo screening approaches to identify orally available TNF-α inhibitors with IC<sup>50</sup> of 10µM (**Figure 1**, **Benzenesulfonamide-1**; Mouhsine et al., 2017). Other efforts to develop TNF-α inhibitors were also reported (Mancini et al., 1999; Buller et al., 2009; Leung et al., 2011; Hu et al., 2012; Alexiou et al., 2014; Ma et al., 2014; Kang et al., 2016). However, due to the low potency and high cytotoxicity, small molecule TNF-α inhibitors still have a long way to go for clinical applications (Davis and Colangelo, 2013). Highly active TNF-α inhibitors with novel chemical structures need to be developed. In a previous study, we have discovered a compound (**Figure 1**, **EJMC-1**) that directly bound TNFα (Shen et al., 2014). The scaffold of the compound, 2-oxo-N-phenyl-1,2-dihydrobenzo[cd]indole-6-sulfonamide, has been reported as inhibitors of West Nile virus (Gu et al., 2006), RORγ inhibitors (Zhang et al., 2014), and BET bromodomain inhibitors (Xue et al., 2016; Mouhsine et al., 2017). Considering the good druggability of this scaffold, its analogs may be valuable for developing potent TNF-α inhibitors. In the present study, we used the scaffold of compound **EJMC-1** to perform similarity-based virtual screen and experimental testing. Topranking compounds were first tested for their abilities to reduce TNF-α binding with TNFR using surface plasmon resonance (SPR). Then the cell-based NF-κB reporter gene assay was used to test the activities of the compounds to reduce TNFα induced signaling. New compounds were further designed, synthesized, and tested. The structure-activity relationship of these compounds was analyzed.

# MATERIALS AND METHODS

### General Information

HEK293T cells were received as a gift from Professor Jincai Luo (Peking University, China). The extracellular domain of the TNF receptor 1 (TNFR1-ECD) was purchased from R&D Systems. The selected compounds were purchased from the SPECS database with purity higher than 90% and for most compounds >95% (confirmed by the supplier, using NMR or LC-MS data available through the website). Other biochemistry reagents were from Sigma Aldrich unless indicated otherwise. The organic reagents and solvents were commercially available and purified according to conventional methods. All reactions were monitored by thin layer chromatography (TLC), using silica gel 60 F-254 aluminum sheets and UV light (254 and 366 nm) for detection. All title compounds gave satisfactory <sup>1</sup>H NMR, <sup>13</sup>C NMR, and mass spectrometry analyses. The <sup>1</sup>H NMR and <sup>13</sup>C NMR spectra were measured on a Bruker-400 M spectrometer using TMS as internal standard. High resolution mass spectra were recorded on a Bruker Apex IV FTMS mass spectrometer using ESI (electrospray ionization).

# Synthesis

### Benzo[cd]indol-2(1H)-one 2

**2** was prepared based on the adoption of method by Kamal et al. (2012). Napthalic anhydride (1.98 g, 10 mmol), hydroxylamine hydrochloride (0.69 mg, 10 mmol), and dry pyridine (5 ml) were added to dried three-necked flask. Heating was discontinued after reflux for 1 h, than benzenesulfonyl chloride (5 g) was added portion wise to cause controlled boiling. Finally, heating was resumed for 1 h, and the hot mixture was poured into water (30 ml). The crystalline precipitate was collected, washed with 0.5 N NaOH and water. The crystals were boiled with water (15 ml) and ethanol (5 ml) containing sodium hydroxide (5 g) for 2 h, during the second of which, ethanol was allowed to distill

out. The solution was acidified with concentrated hydrochloric acid (3 ml), carbon dioxide being evolved and yellow crystals deposited. Next day, the crystals were washed, and dried to give light yellow needles (1.25 g, 74%). Mp175–179◦C; <sup>1</sup>H NMR (DMSO, 300 MHz, DMSO-d6) δ 8.05 (d, 1H, J = 6.7 Hz), 8.01 (d, 1H, J = 8.3 Hz), 7.75–7.70 (m, 1H), 7.53 (d, 1H, J = 8.3 Hz), 7.40 (dd, 1H, J = 7.5, 6.7 Hz), 6.94 (d, 1H, J = 6.7 Hz).

#### 2-oxo-1,2-dihydrobenzo[cd]indole-6-sulfonyl Chloride 3

**3** was prepared based on the adoption of method by Talukdar et al. (2010). Chlorosulfonic acid (3.2 ml) was added slowly to **2** (1.0 g, 5.9 mmol). The reaction mixture was stirred at 0◦C for 1 h and at room temperature for 2 h. The mixture was then poured into ice water (20 ml). The precipitate was washed with water (2 × 10 ml) and dried to give product as yellow solid (0.66 g, 38%). Used without further purification.

# General Procedure for N-Substituted 2-oxo-N-phenyl-1,2-dihydrobenzo[cd]indole-6-sulfonamides 4

A mixture of 2-oxo-1,2-dihydrobenzo[cd]indole-6-sulfonyl chloride (100 mg, 0.37 mmol), 0.37 mmol aniline, 0.4 ml Et3N, 20 mg DMAP was dissolved in 5 ml DMF, the reaction mixture was stirred at room temperature, the reaction was detected by TLC, after the reaction was finished, extracted with 50 ml ethyl acetate and 20 ml water, washed with water 20 ml three times, then 20 ml saturated NH4Cl aqueous, 20 ml brine. The organic layer was dried by Na2SO4, and the solvent was removed in vacuo. The residue was purified by Column chromatography.

#### N-(5-aminonaphthalen-1-yl)-2-oxo-1,2 dihydrobenzo[cd]indole-6-sulfonamide 4a

Seventy-six milligrams, yield 54%. <sup>1</sup>H NMR (400 MHz, DMSOd6) δ 5.66 (s, 2H), 6.52 (d, J = 7.5 Hz, 1H), 6.88 (t, J = 8.1 Hz, 1H), 6.92 (d, J = 7.6 Hz, 1H), 7.04 (d, J = 8.4 Hz, 1H), 7.11 (d, J = 7.3 Hz, 1H), 7.18 (t, J = 7.9 Hz, 1H), 7.87 (dd, J = 7.9, 4.0 Hz, 3H), 8.07 (d, J = 7.0 Hz, 1H), 8.65 (d, J = 8.4 Hz, 1H), 10.18 (s, 1H), 11.07 (s, 1H). <sup>13</sup>C NMR (101 MHz, DMSO-d6) δ 168.70, 144.73, 142.65, 132.68, 132.01, 130.51, 130.23, 129.48, 128.50, 126.64, 126.46, 125.92, 124.71, 124.56, 123.29, 122.84, 122.76, 121.03, 110.30, 107.63, 104.54. HRMS (ESI): calcd for C21H16N3O3S, [(M+H)+], 391.0912, found 390.0896.

#### N-(3-aminonaphthalen-2-yl)-2-oxo-1,2 dihydrobenzo[cd]indole-6-sulfonamide 4b

Thirty-four milligrams, yield 26%. <sup>1</sup>H NMR (400 MHz, DMSOd6) δ 1.20–1.29 (m, 2H), 4.67–5.36 (m, 2H), 6.79 (s, 1H), 6.95 (d, J = 7.6 Hz, 1H), 7.05 (ddd, J = 8.1, 6.7, 1.2 Hz, 1H), 7.22 (ddd, J = 8.2, 6.8, 1.3 Hz, 1H), 7.35 (s, 1H), 7.41 (d, J = 8.2 Hz, 1H), 7.46 (d, J = 8.2 Hz, 1H), 7.85 (dd, J = 8.4, 7.0 Hz, 1H), 7.95 (d, J = 7.6 Hz, 1H), 8.08 (d, J = 7.0 Hz, 1H), 8.65 (d, J = 8.4 Hz, 1H), 11.12 (s, 1H). <sup>13</sup>C NMR (101 MHz, DMSO-d6) δ 169.10, 143.82, 142.21, 132.68, 132.21, 130.81, 130.01, 129.31, 128.65, 126.66, 126.45, 125.87, 124.63, 124.53, 123.39, 122.81, 122.77, 121.03, 110.60, 106.63, 103.59. HRMS (ESI): calcd for C21H16N3O3S, [(M+H)+], 390.0912, found 390.0896.

#### 2-oxo-N-(1,2,3,4-tetrahydronaphthalen-1-yl)-1,2 dihydrobenzo[cd]indole-6 sulfonamide 4c

Sixty-eight milligrams, yield 48%. <sup>1</sup>H NMR (400 MHz, DMSOd6) δ 1.31–1.42 (m, 2H), 1.55–1.74 (m, 2H), 2.55–2.70 (m, 2H), 4.33 (dd, J = 9.7, 5.3 Hz, 1H), 6.93–6.97 (m, 2H), 7.02 (s, 1H), 7.08 (d, J = 7.4 Hz, 2H), 7.90–7.95 (m, 1H), 8.14 (t, J = 6.9 Hz, 2H), 8.35 (d, J = 8.4 Hz, 1H), 8.72 (d, J = 8.4 Hz, 1H), 11.16 (s, 1H). <sup>13</sup>C NMR (101 MHz, DMSO-d6) δ 169.31, 142.98, 137.57, 136.86, 132.65, 130.89, 130.37, 130.12, 129.17, 128.95, 127.41, 127.35, 126.67, 126.10, 125.31, 124.82, 105.17, 51.51, 30.68, 28.87, 19.68. HRMS (ESI): calcd for C42H36N4NaO6S2, [(2M+Na)+], 779.1974, found 799.1939.

#### N-(naphthalen-1-ylmethyl)-2-oxo-1,2 dihydrobenzo[cd]indole-6-sulfonamide 4d

Fifteen milligrams, yield 10%. <sup>1</sup>H NMR (400 MHz, DMSO-d6) δ 4.43 (d, J = 5.7 Hz, 2H), 6.94 (d, J = 7.5 Hz, 1H), 7.25–7.36 (m, 3H), 7.41 (ddd, J = 8.1, 6.8, 1.2 Hz, 1H), 7.72–7.76 (m, 1H), 7.82 (d, J = 8.1 Hz, 1H), 7.85–7.90 (m, 2H), 7.98 (d, J = 7.5 Hz, 1H), 8.08 (d, J = 7.0 Hz, 1H), 8.40 (t, J = 5.9 Hz, 1H), 8.65 (d, J = 8.3 Hz, 1H), 11.10 (s, 1H). <sup>13</sup>C NMR (101 MHz, DMSO-d6) δ 169.29, 133.54, 132.94, 132.84, 131.08, 130.74, 130.04, 128.93, 128.69, 128.47, 127.29, 126.52, 126.33, 126.07, 125.47, 125.13, 124.74, 123.89, 104.98, 44.73. HRMS (ESI): calcd for C44H33N4O6S2, [(2M+H)+], 777.1842, found 777.1804.

#### N-(1H-indol-6-yl)-2-oxo-1,2-dihydrobenzo[cd]indole-6-sulfonamide 4e

Ninety-one milligrams, yield 68%. <sup>1</sup>H NMR (300 MHz, DMSOd6) δ 6.21–6.27 (m, 1H), 6.67 (dd, J = 8.4, 2.0 Hz, 1H), 6.95 (d, J = 7.7 Hz, 1H), 7.05 (d, J = 1.7 Hz, 1H), 7.17–7.22 (m, 1H), 7.27 (d, J = 8.5 Hz, 1H), 7.91 (dd, J = 8.4, 7.0 Hz, 1H), 7.97 (d, J = 7.6 Hz, 1H), 8.08 (d, J = 6.9 Hz, 1H), 8.74 (d, J = 8.3 Hz, 1H), 10.20 (s, 1H), 10.91 (s, 1H), 11.11 (s, 1H). <sup>13</sup>C NMR (101 MHz, DMSO-d6) δ 169.15, 143.24, 136.18, 133.83, 131.41, 130.96, 129.77, 128.07, 127.28, 126.38, 125.87, 125.29, 124.90, 120.64, 114.44, 105.02, 104.61, 101.32. HRMS (ESI): calcd for C38H27N6O6S2, [(2M+H)+], 727.1433, found 727.1428.

#### N-(3-(1-methyl-1H-pyrazol-4-yl)phenyl)-2-oxo-1,2 dihydrobenzo[cd]indole-6-sulfonamide 4f

Seventy-six milligrams, yield 51%. <sup>1</sup>H NMR (300 MHz, DMSOd6) δ 3.84 (s, 3H), 6.80 (dt, J = 5.4, 2.8 Hz, 1H), 7.04 (d, J = 7.6 Hz, 1H), 7.09 – 7.13 (m, 2H), 7.19 (s, 1H), 7.64 (s, 1H), 7.90 – 7.95 (m, 1H), 7.98 (s, 1H), 8.09 (d, J = 7.0 Hz, 1H), 8.19 (d, J = 7.7 Hz, 1H), 8.72 (d, J = 8.3 Hz, 1H), 10.59 (s, 1H), 11.16 (s, 1H). <sup>13</sup>C NMR (101 MHz, DMSO-d6) δ 169.25, 143.33, 136.18, 133.83, 133.56, 131.41, 130.96, 130.51, 129.77, 128.07, 127.28, 126.38, 125.87, 125.29, 124.90, 124.64, 119.44, 117.82, 117.61, 114.32, 40.51. HRMS (ESI): calcd for C21H17N4O3S, [(M+H)+], 405.1021, found 405.1086.

The Red curve was TNF-α binding with TNFR1-ECD alone, and the other curves were TNF-α TNFR1-ECD in the presence of compounds at 100µM. The reference compound EJMC-1 was colored blue and the best compound in SPR assay S10 was colored brown.

#### 6-((1H-benzo[d]imidazol-1-yl)sulfonyl)benzo[cd]indol-2(1H)-one 4g

Eighty milligrams, yield 62%. <sup>1</sup>H NMR (400 MHz, DMSO-d6) δ 7.16 (d, J = 7.8 Hz, 1H), 7.37-7.31 (m, 2H), 7.69–7.75 (m, 1H), 7.82 (dt, J = 8.3, 0.9 Hz, 1H), 7.93–8.03 (m, 1H), 8.12 (d, J = 7.0 Hz, 1H), 8.71 (d, J = 7.8 Hz, 1H), 8.76 (d, J = 8.4 Hz, 1H), 9.19 (s, 1H), 11.35 (s, 1H). <sup>13</sup>C NMR (101 MHz, DMSO-d6) δ 168.46, 145.77, 143.47, 142.38, 136.22, 132.11, 129.83, 127.72, 127.05, 126.04, 125.57, 125.46, 124.71, 123.50, 122.77, 120.65, 112.18, 104.85. HRMS (ESI): calcd for C36H23N6O6S2, [(2M+H)+], 699.1120, found 699.1135.

### Competitive Binding Assay Using SPR

Binding interactions between TNF-α and TNFR1-ECD in the presence/absence of small molecule inhibitors were examined on the SPR-based Biacore T200 instrument (GE Healthcare). TNFR1-ECD was immobilized on a CM5 sensor chip using standard amine-coupling at 25◦C with 1X running buffer PBS-P (GE Healthcare). A reference flow cell was activated and blocked in the absence of TNFR1-ECD. All experiments were performed in phosphate-buffered saline (PBS)-EP buffer (10 mM NaH2PO4/Na2HPO4, 150 mM NaCl, 3.7 mM EDTA, 0.05% surfactant P20, pH 7.4) at 25◦C with a flow rate of 50 µl/min. A final concentration of 20 nM TNF-α was mixed with each compound at various concentrations (as indicated in section Results) in PBS-EP and the mixture was injected. Equal amounts of TNF-α mixed with PBS-EP were used as a control. Regeneration was achieved by extended washing with glycine hydrochloride buffer (10 mM Glycine-HCl, pH 2.1) after each sample injection.

#### Cell Based NF-κB Reporter Assay

The cellular assay were carried out as described previously (Zhang et al., 2013). HEK293T cells were grown to

FIGURE 3 | Inhibition of TNF-α induced NF-κB transcription activity. (A) Dose-response of compounds S10 in the cell based assay in 293T cell line. (B) Dose-response of compounds 4e in the cell based assay in 293T cell line. The data was reported as means ± errors from three independent experiments.

surface, the key residues were shown as sticks (green). (A) compound EJMC-1 (yellow). (B) Compound S10 (cyan). (C) EJMC-1 compare to SPD304 (gray). (D) compound 4e (magenta).

70% confluence in 6 cm dish at 37◦C in Dulbecco's modified Eagle's medium supplemented with 10% fetal bovine serum (FBS; Gibco), then transfected with purified plasmids 0.6 µg pGL4.32 (luc2P/NF-κB-RE/Hygro plasmid) and 0.4 µg pGL4.74 (hRluc/TK) with ViaFect transfection reagent (Promega). After 24 h, the transfected cells were seeded in 96-wells plate, 40,000 cells per well. Twelve hours later, 100 µL pre-incubated mixture of TNF-α and small molecules was added to stimulate the cells for 6 h and the luciferase assays were carried out using the Dual-Glo Luciferase Assay System (Promega) with a BioTek synergy 4 Multi-Mode Microplate Reader. The final concentration of TNF-α in each well was 10 ng/ml. Equal amounts of TNF-α without small molecular were added to the cells as a negative control to calculate the percentage of activity inhibition.

## Similarity-Based Virtual Screen

The crystalstructure of TNF-α dimer (PDB code: 2AZ5) was used for grid generation. The program Glide Standard Precise (SP) mode was used to do the molecular docking studies (Friesner et al., 2004; Halgren et al., 2004). **EJMC-1** was first docked to TNF-α dimer, and its conformation in the complex was used for Shape Screening of the SPECS library (May 2013 version for 10 mg; 197,276 compounds). The Shape Similarity indexes between each compound in the library and the reference compound were calculated. A total of 587 compounds with indexes between 0.8 and 0.99 were selected as candidates for the second round manual selection with the following selection criteria: (a) containing at least one ring which provides hydrophobic interaction; (b) containing no metal atoms; and (c) shared in multiple structures. A total of 68 compounds were purchased from SPECS for experimental testing.

### Molecular Docking

The complex structure of TNF-α with SPD304 (PDB code: 2AZ5) was retrieved from the Protein Data Bank and docking was performed with maestro (Schrödinger, Inc., version 10.2). Compound **EJMC-1, S10,** and **4e** were docked into TNF-α dimer protein using Glide Docking module (Friesner et al., 2004; Halgren et al., 2004). The details of the docking workflow are listed below: (1) Protein was prepared using the "Protein Preparation Wizard" workflow. All water molecules were removed from the structure of the complex. Hydrogen atoms and charges were added during a brief relaxation. After optimizing the hydrogen bond network, the crystal structure was minimized using the OPLS\_2005 force field with the maximum root mean square deviation (RMSD) value of 0.3 Å. (2) The ligand was prepared with LigPrep module in Maestro, including adding hydrogen atoms, ionizing at a pH range from 7.2 to 7.4, and producing the corresponding low-energy 3D structure. (3) Pose prediction mode of Glide Docking modules were adopted to dock the molecules into the SPD304-binding site with the default parameters. The center of the grid box was defined with SPD304. The top-ranking poses of molecule **EJMC-1, S10,** and **4e** were retained. The LigPrep mol2 format output was also docked using AutoDock Vina (Trott and Olson, 2010) with standard protocols. The computed binding free energies and structures for the top conformations were saved for post-docking analysis.

#### Statistical Analysis

Cell assay was repeated for three times. Statistical analysis was performed using OriginPro 9.1, data was fit by DoseResp using Origin 9.1. DoseResp was a three-parameter Hill equation. Results were expressed as mean ± SD (standard deviation value).

# RESULTS AND DISCUSSION

# Chemistry

Seven derivatives of dihydrobenzo[cd]indole-6-sulfonamide were synthesized using a three-step synthetic route (**Scheme 1**) with yields between 10 and 68%. Napthalic anhydride was transformed to benzo[cd]indol-2(1H)-one by aminolysis reaction smoothly, with a yield of 74%. Then, benzo[cd]indol-2(1H)-one underwent nucleophile substitution reaction with chlorosulfonic acid to get key intermediate 2-oxo-1,2 dihydrobenzo[cd]indole-6-sulfonyl chloride (**3**), with a yield of 38%. Reactions of compound **3** with various amines in the presence of a catalyst system consisting of DMAP, Et3N, afforded **4** and derivatives in good yields. The original spectra of featured compounds shown in Supplementary Image 1.

# Compounds From Similarity Search of EJMC-1 Block TNF-α Binding to TNFR

We used compound **EJMC-1** as the reference compound for similarity search (**Figure 1**). The binding conformation of **EJMC-1** with TNF-α was generated using molecular docking and used in pharmacophore based shape screening over the SPECS library.


4b >100 Synthesis

TABLE 1 | Continued


<sup>a</sup>Data shown represent the mean (n = 3).

Compounds with similarity index between 0.80 and 0.99 with **EJMC-1** were subjected to further manual selection. A total of 68 compounds were selected for experimental testing (Table S1). The chemical structures of these compounds fall into two classes, sulfonates and sulfonamides. The sulfonamides contain both N-aryl sulfonamides and N-alkylsulfonamides, with or without substituted aminocarbonyl group (Table S1).

We used a SPR competitive assay to test whether these compounds can more efficiently block TNF-α and TNFR binding than **EJMC-1**. TNF-α with or without compounds flowed over the chip surface where the extracellular domain of TNFR was immobilized. At the concentration of 100µM, 20 of the 68 compounds reduced the TNF-α binding signal compared to **EJMC-1** (**Figure 2**). These 20 candidates were selected for further cell-based inhibition studies. The specs ID of these 20 compounds were listed in Table S2, and the corresponding chemical structures were in supporting information. All the sulfonamide derivatives of **EJMC-1** showed competitive binding with TNF-α against TNFR1, while sulfonates could not.

### EJMC-1 Analogs Inhibit TNF-α Induced NF-κB Gene Expression

To explore whether these compounds with enhanced abilities to reduce TNF-α binding with receptor were active under cellular environment, we used a luciferase assay to monitor

(Continued)

their influences on NF-κB transcriptional activity. In this assay, in transfected cells, TNF-α induces NF-κB activation through TNFR1, which then drives the expression of the luciferase. The cell-level inhibitory effects of these 20 compounds were measured using the Dual-Glo Luciferase Assay System. With two dose screen, two compounds, **S3 and S10** showed better activity than **EJMC-1** (Table S2). The best compound, **S10**, suppressed NF-κB transcriptional activity dose-dependently (**Figure 3A**) with an IC<sup>50</sup> of 19.1 ± 2.2µM. The positive control, SPD304, displayed an IC<sup>50</sup> of 6.4 ± 0.6µM in the side-by-side experiment.

#### Docking Analysis and Compound Design

Molecular docking gave clues on rational designing compounds with potential enhanced activities and understanding SAR. In the complex structure of TNF-α with SPD304, SPD304 bound to a pocket in the TNF-α dimer (He et al., 2005; Shen et al., 2014). **EJMC-1** was shown to bind with the same site (He et al., 2005; Shen et al., 2014). Both Glide and AutoDock Vina were used in the docking study. We first tested whether the binding pose of SPD304 can be reproduced. We have tried many times with different parameters, but was unable to get a binding pose that is close to that in the crystal structure (with the minimum RMSD up to 4 Å). We then used AutoDock Vina to dock SPD 304 to the TNF-α dimer and the lowest binding free energy conformation obtained was closed to its crystal conformation with a RMSD of 0.70 Å. Despite of the different binding conformations obtained for SPD304 by using two docking software, the top ranking conformations of **EJMC-1** were almost the same from the docking runs using both Glide and AutoDock Vina. These differences might due to the flexibility of SPD304, which adopted a U shape conformation, and the conformational sampling preference of the docking software. As there are no essential differences in the docking poses of the compounds other than SPD304, we used the Glide docking poses of these compounds to compare to SPD304 in the crystal structure. Compared to **EJMC-1**, **S10** had increased hydrophobic interaction with the Tyr59 residue (**Figures 4A,B**). In addition to the nonpolar interactions with TNF-α as in the case of SPD304, the scaffold of **EJMC-1** and **S10** provide further polar interactions, strengthening the specificity and activity (**Figure 4C**). As **EJMC-1** is smaller than that of SPD304 with unoccupied hydrophobic space in the pocket (**Figure 4C**), several analogs with larger substituted group of sulfonamide of 2-oxo-1,2-dihydrobenzo[cd]indole-6 sulfonamide were designed and docked to this site. Compound **4e,** with larger hydrophobic group size and additional H-bond donor, interacts favorably with TNF-α and might be more potent (**Figure 4D**). Based on the docking analysis, the designed analogs were purchased or synthesized for cell assay.

# Optimization of Compound S10 and Structure-Activity Analysis

As shown in the cell assay, the inhibition activity of **S10** increased about 2-fold than that of **EJMC-1**. The introduction of the naphthalene ring provides stronger hydrophobic interactions. Based on the docking analysis and increased activity of **S10**, we try to: (1) Keep naphthalene ring, changed Nsubstituted groups of dihydrobenzo[cd]indole, (2) Keep N-H of dihydrobenzo[cd]indole, optimize the hydrophobic R group, (3) optimize both N-substituted groups of dihydrobenzo[cd]indole and the hydrophobic R group (**Figure 5**). Seven commercially available analogs of **S10** were purchased for testing (**Table 1**). The SPECS ID of these seven **S10** analogues were listed in Table S3. We further synthesized seven new compounds in three steps from 1,8-Naphthalic anhydride through conventional reactions (**Scheme 1** and **Figure 5**). All compounds passed the PAINS (pan assay interference compounds) remover, which filters out compounds that appear as frequent hitters (promiscuous compounds) in many biochemical high throughput screens (Baell and Holloway, 2010).

All the compounds were tested using the TNF-α induced NF-κB reporter assay. The structures and activities were listed in **Table 1**. For **S10**, methyl or ethyl group substitution on the amide of 2-oxo-1,2-dihydrobenzo[cd]indole-6-sulfonamide had no obviously enhanced inhibition (**S21** and **S22**), and the α or β substitution of the naphthyl group did not affect the inhibition (**S23, S24,** and **S25**). The size of the N-substituted groups of sulfonamide was important for inhibition activity (**EJMC-1, S10,** and **S27**). The flexibility and aromaticity of the N-substituted two-ring group of sulfonamide played dominant role, too rigid or too flexible dramatically reduced the activity (**4g**, **4d**). The fact that N-(5-aminonaphthalen-1-yl) and N-(3-aminonaphthalen-1 yl) group substituted compounds lost functions might be caused by the conformation change due to additional amino group on the naphthalene ring. Introducing heterocycle significantly increases the inhibition activity (**4e and 4f**). The N-(1H-indol-6-yl) substituted sulfonamides (**4e**) were 6-fold more potent than **S10**, even better than SPD304 (**Table 1**, **Figure 3B**). Though **S10** and **4e** had similar size of substitution group on sulfonamide, **4e** shown better inhibition activity than **S10** might due to the additional H-bond that **4e** forms with the backbone carbonyl of Gly121 (**Figure 4C**). Meanwhile, the indolyl group of **4e** was also deeper in the binding pocket than that of naphthyl group on **S10** (**Figure 4D**).

#### CONCLUSION

We have optimized a previously reported TNF-α inhibitor **EJMC-1** using similarity-based VS and rational design. An analog of **EJMC-1**, **S10** was found with 2-fold TNF-α increased inhibition activity. Based on the structures of **EJMC-1**, **S10,** and their interactions with TNF-α, we designed derivatives of 2-oxo-1,2-dihydrobenzo[cd]indole-6-sulfonamide. Several commercially available ones were purchased and seven new compounds were synthesized for SAR study. After two rounds of design, we obtained **4e** with an IC<sup>50</sup> of 3.0 ± 0.8µM, which is one of the most potent TNF-α small molecule inhibitors reported so far. Compound **4e** provides a good starting point for developing more potent TNF-α small molecule inhibitors.

#### REFERENCES


# AUTHOR CONTRIBUTIONS

LL and YL designed and guided this study; XD designed the research, performed molecular docking and similarity search, and conducted the chemical synthesis; XZ performed the cell assay; HL and QS participated in the cell assay; BT performed the SPR binding assay; XD, LL, and YL analyzed the data and wrote the manuscript with input from all authors; XD and XZ have equal contribution of this work.

#### ACKNOWLEDGMENTS

This work was supported in part by the Ministry of Science and Technology of China (2016YFA0502303, 2015CB910300) and the National Natural Science Foundation of China (21633001, 21573012).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00098/full#supplementary-material


discovery of a tumor necrosis factor-α antagonist. J. Am. Chem. Soc. 135, 11990–11995. doi: 10.1021/ja405106u


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Deng, Zhang, Tang, Liu, Shen, Liu and Lai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Insights Into the Bifunctional Aphidicolan-16-ß-ol Synthase Through Rapid Biomolecular Modeling Approaches

#### Max Hirte, Nicolas Meese, Michael Mertz, Monika Fuchs\* and Thomas B. Brück\*

Werner Siemens Chair of Synthetic Biotechnology, Department of Chemistry, Technical University of Munich, Munich, Germany

#### Edited by:

Daniela Schuster, Paracelsus Medizinische Privatuniversität, Salzburg, Austria

#### Reviewed by:

Victor Guallar Guallar, Barcelona Supercomputing Center, Spain Dharmendra Kumar Yadav, Gachon University of Medicine and Science, South Korea Arnout Voet, KU Leuven, Belgium

#### \*Correspondence:

Monika Fuchs monika.fuchs@tum.de Thomas B. Brück brueck@tum.de

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 10 January 2018 Accepted: 20 March 2018 Published: 10 April 2018

#### Citation:

Hirte M, Meese N, Mertz M, Fuchs M and Brück TB (2018) Insights Into the Bifunctional Aphidicolan-16-ß-ol Synthase Through Rapid Biomolecular Modeling Approaches. Front. Chem. 6:101. doi: 10.3389/fchem.2018.00101 Diterpene synthases catalyze complex, multi-step C-C coupling reactions thereby converting the universal, aliphatic precursor geranylgeranyl diphosphate into diverse olefinic macrocylces that form the basis for the structural diversity of the diterpene natural product family. Since catalytically relevant crystal structures of diterpene synthases are scarce, homology based biomolecular modeling techniques offer an alternative route to study the enzyme's reaction mechanism. However, precise identification of catalytically relevant amino acids is challenging since these models require careful preparation and refinement techniques prior to substrate docking studies. Targeted amino acid substitutions in this protein class can initiate premature quenching of the carbocation centered reaction cascade. The structural characterization of those alternative cyclization products allows for elucidation of the cyclization reaction cascade and provides a new source for complex macrocyclic synthons. In this study, new insights into structure and function of the fungal, bifunctional Aphidicolan-16-ß-ol synthase were achieved using a simplified biomolecular modeling strategy. The applied refinement methodologies could rapidly generate a reliable protein-ligand complex, which provides for an accurate in silico identification of catalytically relevant amino acids. Guided by our modeling data, ACS mutations lead to the identification of the catalytically relevant ACS amino acid network I626, T657, Y658, A786, F789, and Y923. Moreover, the ACS amino acid substitutions Y658L and D661A resulted in a premature termination of the cyclization reaction cascade en-route from syn-copalyl diphosphate to Aphidicolan-16-ß-ol. Both ACS mutants generated the diterpene macrocycle syn-copalol and a minor, non-hydroxylated labdane related diterpene, respectively. Our biomolecular modeling and mutational studies suggest that the ACS substrate cyclization occurs in a spatially restricted location of the enzyme's active site and that the geranylgeranyl diphosphate derived pyrophosphate moiety remains in the ACS active site thereby directing the cyclization process. Our cumulative data confirm that amino acids constituting the G-loop of diterpene synthases are involved in the open to the closed, catalytically active enzyme conformation. This study demonstrates that a simple and rapid biomolecular modeling procedure can predict catalytically relevant amino acids. The approach reduces computational and experimental screening efforts for diterpene synthase structure-function analyses.

Keywords: homology modeling, aphidicolin, diterpene, diterpene synthase, homology model refinement

# INTRODUCTION

With more than 50,000 different molecules known to date terpenes are the greatest natural occurring product family found in organisms from bacteria to fungi, mammals, and plants. They are all derived from the isoprene units' dimethylallyl diphosphate and isopentenyl diphosphate. Condensation reactions of this molecules lead to the formation of different length phosphorylated linear terpenes, serving as substrate for terpene synthases. This enzyme family carry out highly stereo complex C-C coupling reactions, resulting in structurally complex macrocycles that contribute to the structural and functional diversity of terpenes (Christianson, 2017). Diterpenes are derived from the linear aliphatic precursor geranylgeranyl diphosphate (GGDP) being cyclized by diterpene synthases. More specifically, diterpene synthases are classified into class I and class II enzymes based on the structural presence of the conserved motifs DDXD or DDXXD/E and NSE/DTE, respectively. While class II reactions perform a protonation initiated cyclization reaction to generate phosphorylated bicyclic structures, class I reactions are initiated by hydrolyses of the GGDP pyrophosphate moiety that is coordinated by a Mg2+-triad thereby generating mono- or poly-cyclic structures.

The natural product Aphidicolin, initially isolated from the fungus Cephalosporium aphidicola, is a hydroxylated, tetracyclic diterpenoid that exhibits a broad range of biological activities and applications (Brundret et al., 1972; Dalziel et al., 1973). More specifically, it is a potent inhibitor of the eukaryotic DNA α-polymerase with a commercial application as a cell synchronization agent. The compound is in pharmaceutical development due anti-tumor, anti-viral, and anti-leishmanial activity (Ikegami et al., 1978; Pedrali-Noy et al., 1980; Kayser et al., 2001; Edwards et al., 2013; Starczewska et al., 2016). Recently, other organisms including the fungus Nigrospora sphaerica and the pathogenic fungus Phoma betae have been identified as natural Aphidicolin producers. Current data suggests that Aphidicolin biosynthesis is exclusive to fungal metabolism and that natural sources for Aphidicolin are limited (Starratt and Loschiavo, 1974; Fujii et al., 2011; Lopes and Pupo, 2011). Nevertheless, elucidation of the responsible Aphidicolin biosynthetic gene cluster in P. betae allowed for the identification of a bifunctional diterpene synthase that contains both a functional class I and class II domain (Oikawa et al., 2001). The Aphidicolan-16-ß-ol synthase (ACS) generates the stereo-chemically demanding Aphidicolan-16-ß-ol (AD)—core structure of Aphidicolin—structure via a two-step reaction as depicted in **Figure 1** (Oikawa et al., 2002).

Initially, GGDP is rearranged in the class II active site cleft by protonation to the bicyclic syn-copalyl diphosphate (syn-CDP). Subsequently, syn-CDP is elaborated to AD in the class I active site (Adams and Bu'Lock, 1975; Oikawa et al., 2002). As depicted in **Figure 2** the cyclization mechanism in the class I active site, initiated by the hydrolysis of the pyrophosphate group, results

in 8-ß-pimaradienyl carbocation formation. A subsequent attack of the vinyl group, bridging the C ring, directly undergoes a Wagner-Meerwein rearrangement and results in the formation of the aphidicolenyl carbocation. Eventually, this cation is quenched by water thereby generating AD.

Terpene cyclization mechanisms are conventionally elucidated by radio labeling of protons and carbons (Dickschat, 2017). This substrate specific labeling provides for identification of unusual hydride shifts and rearrangements. Alternatively, the enzyme's cyclization mechanisms can be probed by altering amino acids, trying to terminate the reaction cascade at a specific transition state (Morrone et al., 2008; Janke et al., 2014; Schrepfer et al., 2016; Jia et al., 2017). Therefore, random mutagenesis can be performed but the screening effort for this methodology is elaborate without an efficient high throughput screening options (Lauchli et al., 2013). Biomolecular modeling allows for the rational identification and in silico modulation of amino acid networks that are involved in complex reaction cascades (Pemberton et al., 2015; Schrepfer et al., 2016; Christianson, 2017; Escorcia et al., 2018). This methodology provides for a knowledge based approach of enzyme mutagenesis and screening. Nevertheless, a particular challenge for this strategy is based on the missing structural information for most terpene synthases. However, as their structural elements and domains are highly conserved (Christianson, 2017), homology

**Abbreviations:** ACS, Aphidicolan-16-β-ol synthase; AD, Aphidicolan-16-β-ol; GGDP, geranylgeranyl diphosphate; LRS, labdane related diterpene synthase; syn-CDP, syn-copalyl diphosphate.

modeling is a potential route to identify catalytically relevant amino acids despite the low primary sequence identities in this enzyme family (Xu and Li, 2003). Unfortunately, most available crystal structures of terpene synthase are deposited in the open apo-enzyme configuration that is catalytically inactive. This open enzyme conformation presents an additional obstacle when catalytically relevant amino acids have to be identified in silico. At present, only two diterpene synthase structures have been reported in the closed, catalytically active form (Liu et al., 2014; Serrano-Posada et al., 2015). Therefore, automated homology modeling approaches will almost always result in catalytically non-relevant open enzyme configuration. Moreover, while prediction tools can place large cofactors (i.e., FAD, NADH, Heme) correctly in the apo-protein framework, ligand-metal interactions are difficult to predict because of the multiple coordination geometries and the lack of sufficiently accurate force field parameters (Khandelwal et al., 2005). Hence, structure function predictions that depend on the interplay between the amino acids of the protein framework with small metal ions cannot be conducted solely by application of automated software tools. In this context, a rational combination of structural information by superposition and extraction of cofactors is performed to prepare the protein structure for docking studies. Nevertheless, this approach often neglects reliable positioning of the cofactor coordinating amino acids. Additionally, falsely predicted positioning of amino acid side chains in the active site cleft can lead to invalid interpretation of a homology model based protein-ligand complex. To improve this situation, this study elucidated rapid and simple methodologies to refine diterpene homology models for docking studies thereby allowing for reliable structure-function predictions. In this context, an ACS class I homology model of the α-domain was predicted from the primary sequence. Subsequently, these models were compared to catalytically relevant closed terpene synthases structures. The location of metals was refined and fitted against specifically selected structural templates and multiple docking studies were carried out and validated. Our in silico results were experimentally evaluated by ACS mutagenesis studies. This lead to an identification of essential amino acid residue sidechains that are necessary for retaining the enzymes activity. Additionally, we detected amino acid substitutions that abort the catalytic reaction cascade en- route from syn-CDP to AD. Structural analyses and elucidation of these compounds

revealed the formation of syn-copalol and a labdane related, non-hydroxylated diterpene by the ACS mutants Y658L and D661A. Our approach of a protein homology model based structure function analysis can be easily adapted for other terpene synthases. This methodology allows for rapid and simple analysis of the catalytically relevant amino acid network that help studying complex reaction cascades and developing new biocatalysts.

### MATERIALS AND METHODS

#### Materials and Chemicals

All genes used were synthesized by Life technologies GmbH and the codon usage was optimized for E. coli if not stated otherwise. Primers were obtained from Eurofins Genomics GmbH. Strains and plasmids were obtained from Merck KGaA. All chemicals used were obtained at highest purity from Roth chemicals or Applichem GmbH. Enzymes were purchased from Thermo Fisher Scientific.

#### Software and Web-Tools

RaptorX was applied for homology modeling studies (http:// raptorx.uchicago.edu; Källberg et al., 2012). The initial predicted structure was analyzed and further modified in the environment of UCSF Chimera software package (Pettersen et al., 2004; http:// www.cgl.ucsf.edu/chimera). Comparative modeling by spatial restraints was performed by MODELLER (Eswar et al., 2006), and all substrate docking studies performed by AutoDock Vina (Trott and Olson, 2010; http://vina.scripps.edu). Chemical structures were drawn by PerkinElmer ChemBioDraw Ultra (http:// www.cambridgesoft.com). For ligand preparation the Avogadro (Hanwell et al., 2012; https://avogadro.cc/) software package was used. A syn-CDP toppar stream file was generated by CHARMM General Force Field program version 1.0.0 for use with CGenFF version 3.0.1 (https://cgenff.paramchem.org; Vanommeslaeghe et al., 2010, 2012; Vanommeslaeghe and MacKerell, 2012). Two ns molecular dynamic studies of the docked ACS model B in a water sphere have been performed under CHARMM general force field by NAMD (Phillips et al., 2005; http://www.ks.uiuc. edu/Research/namd/). NAMD was developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. For high resolution pictures the protein was prepared by Visual Molecular Dynamics (http://www.ks. uiuc.edu/Research/vmd/; Humphrey et al., 1996) and rendered by Tachyon implemented in the VMD software package (Stone, 1998).

#### Docking

Ligand structures were downloaded from https://pubchem.ncbi. nlm.nih.gov/ available and geometrically optimized by 500 steps of steepest descent under MMFF94 force field parameters included in Avogadro. Protein structures were prepared by Dock Prep, which is part of the Chimera software environment. The AMBER force field (AMBERff14SB) was applied to the receptor while Gasteiger charges were added to the ligand and co-factors. As recently reported, docking can be improved by assigning partial charges to metal ions (Hu and Shelver, 2003). In this context, Mg-ion charges were set to +1. Syn-CDP charge was set to −3. Docking was performed by AutoDock Vina using standard parameters. Docking poses were chosen based on a structural comparison to the pyrophosphate group that is co-crystallized in pdb 5A0J (see Figure S1B). The chosen pose was furthermore validated by re-dock approaches. Therefore, the predicted syn-CDP pose was de novo geometrically optimized by 500 steps of steepest descent under MMFF94 force field parameters included in Avogadro software environment prior to docking repetition (see Figure S1A).

### Model Generation

An initial homology model of the ACS α-domain was predicted by RaptorX starting from the amino acid 565. A model based on the pdb crystal structure 5A0J, referring to a labdane related diterpene synthase, was manually selected for further structure function analyses. In order to prepare the model for docking studies, the coordinating Mg2+-ion triad and water molecules were implemented in the structure by different methods. Model A was generated by structural alignment to 5A0J. Cofactor positions were transferred from the structure template to Model A without any further adjustment prior to docking studies. Model B was created by MODELLER implemented in the Chimera software environment using the 5A0J as template structure. In this model hetero atoms and water molecules in the structure environment were computationally implemented. The pyrophosphate group was removed prior to docking with syn-CDP. Model C was prepared analogously to Model B but prior to refinement by MODELLER, syn-CDP was docked into the template structure 5A0J.

#### Model Validation

The protein ligand complex of Model B was validated by molecular dynamics studies. Therefore, syn-CDP was initially extracted from Model B and parameterized by CHARMM General Force Field program version 1.0.0 for use with CGenFF version 3.0.1. VMD was used to parameterize the protein and for merging ligand and protein. Subsequently, a water sphere was added around the protein-ligand complex. Two nanoseconds of molecular dynamic studies under CHARMM General Force Field was applied to the protein complex by NAMD. The calculated rmsd of the generated frames was plotted over time (Figure S2). A constant rmsd value was chosen as the criteria for an equilibrated protein-ligand complex. The last frame obtained was compared to the initial model B (Figure S3).

#### Plasmids for Diterpene Production

For all cloning procedures E. coli HMS 174 (DE3) was used. Clones were cultivated at 37◦C in Luria-Bertani (LB) medium. Chloramphenicol (34 µg/L) and Kanamycin (50 µg/L) were added as required. For efficient production of the diterpene AD, E. coli's internal 1-deoxy-xylulose-5 phosphate pathway flux was increased by overexpression of deoxy-xylulose 5 phosphate synthase (dxs: GenBank: YP001461602.1), isopentenyldiphosphate delta isomerase (idi: GenBank: AAC32208.1), and further extended by expressing geranylgeranyl diphosphate synthase (crtE: GenBank: KPA04564.1) and Aphidicolan-16-ß-ol synthase (acs: GenBank: AB049075.1). Therefore, dxs and acs were amplified from original sources by PCR. Polycistronic operons (**Table 1**) were constructed by BioBrick cloning standard (Shetty et al., 2008).

Site directed mutations of acs were generated by PCR. Forward primers were designed exhibiting the respective mutation at the 5′ end while the corresponding reverse primers were phosphorylated at 5′ end (Table S1). PCR products were ligated by T4 Ligase prior to transformation. All amino acid exchanges were confirmed by sequencing.

### Production of Diterpenes

All diterpene production experiments were performed in E. coli BL 21 (DE3). To investigate the product outcome of ACS mutants, pACYC acs plasmids were co-transformed with pAX dic. Cultivation was performed in minimal media supplemented with 6 g/L yeast extract and 30 g/L glycerol at 25◦C. After 60 h the culture was extracted with a mixture of hexane, ethanol and ethyl acetate (1:1:1) (v/v/v) for 1 h. The extract was centrifuged at 10,000 g for 2 min. The upper, organic phase was directly analyzed for diterpene products via GC-MS.

### Diterpene Analytics

GC-MS analyses of diterpenes was performed by a Trace GC Ultra with DSQII (Thermo Fisher Scientific). Therefore, 1 µL sample was loaded (Split 1/10) by TriPlus AS onto a SGE BPX5 column (30 m, I.D 0.25 mm, Film 0.25µm). The initial column temperature was set to 160◦C and maintained for 5 min before a temperature gradient at 8◦C/min up to 320◦C was applied. The final temperature was kept for additional 3 min. MS data were recorded at 70 eV (EI) and m/z (rel. intensity in %) as total ion current (TIC). The recorded m/z range was in between 50 to 650.

NMR spectra were recorded in CDCl<sup>3</sup> with an Avance III 500 MHz (Bruker) at 300 K. <sup>1</sup>H NMR chemical shifts are given in ppm relative to CDCl<sup>3</sup> (δ = 7.26 ppm). The 2D experiments


(HSQC) were performed using standard Bruker pulse sequences and parameters.

# RESULTS AND DISCUSSION

#### Homology Model Refinement

The steady increase in published protein crystal structures provides for an accelerated improvement of computational homology prediction. Especially due to the high structurally conservation of the terpene synthase enzyme families, biomolecular tools can predict structures solely based on the amino acid sequence. In this context, structure prediction of the bifunctional ACS was performed to analyze the highly complex conversion of GGDP via syn-CDP to the tetracyclic AD which is the core structure of the cytostatic compound Aphidicolin. ACS belongs to the diterpene synthase family and we identified three highly structurally conserved domains. The initial conversion from the universal diterpene precursor GGDP to syn-CDP occurs in class II active site, located between the ACS ß- and γ-domain. The subsequent syn-CDP cyclization to AD is then conducted in the class I active site that is positioned in the middle of an α-helical bundle forming the ACS α-domain. Notably, the fungal ACS is structurally highly similar to the previously crystallized plant diterpene synthases Abietadiene (pdb: 3S9V) and Taxadiene synthase (pdb: 3P5R), respectively. Homology prediction based on the full ACS sequence took those two structures into account, but for both, crystals could only be achieved in N-terminal truncated forms. Furthermore, these crystal structures have only been solved in an open conformation that is catalytically inactive. In order to circumvent the consideration of these catalytically inactive templates, only the ACS α-domain sequence was used for homology prediction. A model based on the labdane related diterpene synthases (LRS) (pdb: 5A0J), which is provided in a catalytically active holo-complex (Serrano-Posada et al., 2015), was selected for ACS homology refinement. The structural superposition of Abietadiene (pdb: 3S9V), LRS (pdb: 5A0J), and the ACS model, as depicted in **Figure 3**, explicitly demonstrates that there is a better fit between the ACS model and the LRS crystal structure. While the structural fit between LRS and the ACS model is visually well apparent, we have not calculated an rmsd value qualifier as structural domains that do not constitute the active site region are highly variable.

Co-crystallized cofactors (Mg2+-ion triad) and waters, both provided in the LRS structure, are also involved in the ACS reaction en-route from syn-CDP to AD. Therefore, we differentially adapted both, the positions of the Mg2<sup>+</sup> ions and waters into the ACS models that resulted in the generation of three ACS models (A–C). Model A was prepared by adaptation of cofactor positions from the template structure LRS after structural alignment. Initial evaluation of this model indicated that this un-refined modeling method results in clashes of cofactors positions with amino acids side chains. Generally, in homology prediction the active site's cavity is not reserved for the substrate or cofactors specifically. Therefore, we presume that amino acid sidechains occupy this free space due to applied energy minimization optimizations. This is demonstrated in

our docking studies of model A, where ACS amino acid Y658 is preventing syn-CDP to completely access the active site cavity. With the MODELLER package, which is based on comparative protein structure modeling by spatial restraints, a protein structure can be refined based on a template structure. Additionally, hetero-atoms and water molecules can be included directly in the model refinement. This refinement methodology applied to our initial model structure lead to the generation of ACS model B. This model B computationally included the three Mg2+-ions, a pyrophosphate group (conventionally derived by Mg2<sup>+</sup> based hydrolysis of the phosphorylated substrate [syn-CDP] substrate) and water molecules directly as they are all present in the LRS template structure. Model B provides reliable positioning of the conserved amino acids that constitute the class I diterpene synthase signature DDXXD/E and NSE/DTE motifs in relation to the adapted Mg2+- ions, water and pyrophosphate moieties, respectively. Subsequently, we removed the pyrophosphate group from the model B structure to enable docking with the native syn-CDP substrate. Our docking data indicated that in Model B syn-CDP can completely access the active site's cavity. A specific syn-CDP conformation was selected pointing toward the ACS G-Helix, as this flexible helix is proposed to be involved in terpene cyclization reactions (Yoshikuni et al., 2006; Baer et al., 2014; Jia et al., 2017). This docking pose was validated by multiple re-docking approaches (Figure S1A). Additionally, we validated the pose while the position of the pyrophosphate moiety was compared to the pyrophosphate group co-crystallized in LRS (Figure S1B). Finally, a third approach for structure-function analyses was performed by docking syn-CDP into LRS prior to ACS refinement with MODELLER. Again, a syn-CDP conformation was chosen with close proximity toward the G-Helix. On the basis of this LRS holo-protein complex, an ACS holo-complex model C was generated. This method provided for a protein model that was refined around the substrate and cofactors. This methodology also provided for a precise specification of amino acids involved in the AD cyclization reaction. For all three models amino acids located within a five ´Å vicinity to the docked substrate syn-CDP (thereby neglecting the pyrophosphate moiety) were analyzed by mutational studies to elucidate their catalytic relevance (see **Figure 4**, Table S2).

## Mutational Validation of Catalytically Relevant ACS Amino Acids

Due to their stereo-chemical diversity, natural diterpene scaffolds are attractive research leads. The enormous stereo-chemical demand of diterpene macrocycles renders them difficult to access via total chemical synthesis approaches. Therefore, biosynthetic routes to generate these complex structures are currently an intense research focus (Dickschat, 2016; Bian et al., 2017; Jones, 2017). The ability to access new diterpene macrocycles via selective alteration of amino acids in diterpene synthases provides for a highly varied accessible chemical space. For the class I cyclooctat-9-en-7-ol synthase, which naturally generates a tricyclic fusicoccin type diterpene, amino acid mutations in the vicinity of the active site lead to intermittent abortion of the reaction cascade. Hence, alternative macrocyclic structures, such as the bicyclic dolabellane and the monocyclic cembrane, could be generated thereby elucidating the reaction cascade (Görner et al., 2013; Janke et al., 2014). In this study, insights into the class I reaction of the ACS were achieved by mutational studies. In that respect, we intended to quench the reaction from syn-CDP to AD at previously proposed transitional states (Adams and Bu'Lock, 1975; Oikawa et al., 2002). Based on the proposed ACS transitional states we presume that syn-labdatriene and syn-copalol (termination product of the syn-copalyl carbocation), stereoisomers of syn-pimaradiene (termination products of the pimaradienyl carbocation), or aphidicolene and stemodene (termination products of the aphidicolenyl carbocation) are potential abortion products (see **Figure 5**).

For an intermittent abortion of the reaction cascade from syn-CDP to AD, we have selected amino acids within a range of five ´Å to the docked ligands as prime targets for mutagenesis (see **Figure 4**). Preliminary studies revealed that sidechain substitutions encompassing amino acids exchanges that inherently change physico-chemical properties frequently resulted in inactive enzyme variants (Janke et al., 2014; Schrepfer et al., 2016). In this context, we focused on changing the size of the respective amino acid sidechain thereby trying to preserve physico-chemical characteristics. Alternatively, we chose amino acid side chain substitutions that would replace polar groups with similar size amino acids (Table S2).

ACS syn-CDP docking results pointed toward a strong interaction between the decalin core and surrounding hydrophobic sidechains. However, as the decalin structure of syn-CDP remains untouched in further cyclization steps most of the implemented mutations near this particular moiety resulted in inactive (I626A, Y923L, F789L) or wildtype activity variants (F629L, Y658F, C831G, C831T, T920G, Y923F). Based on our modeling results, we also identified specific amino acids located in the ACS G-helix that in other studies have been proposed to be of catalytic relevance (Baer et al., 2014;

Jia et al., 2017). While mutational changes in the G-Helix of Kaurene synthase like diterpene synthases resulted in alternative product profiles (Jia et al., 2017), our analogous approaches with ACS only provided inactive (A786L, F789L) or wildtype active (A786G, F789Y) variants. Nevertheless, our results support previous findings that propose the G-Helix as an essential flexible motif which is involved in the catalytically relevant structural change from the open to the closed enzyme configuration (Baer et al., 2014).

Only the substitution of ACS Y658L and D661A provided for a varied product outcome. In addition to amino acids that constitute the DXXDD/E and NSE/DTE signature motifs that are responsible for Mg2+-ion coordination, our combined in silico and experimental study identified only a few amino acids (see **Figure 6**, colored in pink) capable to terminate activity. Our successful mutations (D661A, Y658L) indicated that the unusual cyclization from syn-CDP to AD proceeds in a spatially restricted area of the active site's cleft. Additionally, our data suggests that the pyrophosphate group remains in the active site and coordinates the reaction cascade. This is in accordance to the recently postulated Taxadiene synthase reaction mechanism (Schrepfer et al., 2016).

#### ACS Mutants D661A and Y658L

GC-MS analyses of the ACS mutants Y658L and D661A revealed that this mutations lead to the formation of two unknown diterpene products (see **Figure 7**). In contrast to the native AD, which had a GC retention time of 17.67 min, these new diterpenes had a retention time of 12.79 and 13.46 min, respectively. The latter product with a retention time of 13.46 min, showed a total mass of 290 m/z. Comparison of the MS spectral data suggests that this was a hydroxylated diterpene with a similar structure to syn-copalol (Hoshino et al., 2011). Subsequently, this compound was isolated and structural characterized by NMR (Figures S4, S5). The results are in accordance to previous spectral data for syn-copalol (Yee and Coates, 1992). One plausible explanation for syn-copalol formation is the quenching of the syn-copalyl carbocation intermediate by water in the active site of the enzyme. The other diterpene product with a retention time of 12.79 min had a total mass of 272 m/z indicating that this structure was not-hydroxylated. While we expected the formation of syn-labda-8(17),12E,14-triene, comparison with published MS-spectra revealed significant differences (Morrone et al., 2011). Unfortunately, due to the low amounts produced and purification issues for this highly hydrophobic compound, we could not conduct NMR analysis. However, we presume that this compound is also originated from the syn-copalyl carbocation and that a labdane related diterpene with high structural similarity to syn-labda-8(17),12E,14-triene was generated by the ACS mutants. The newly generated diterpenes are of great interest as copalol derivatives display various biological activities analogous to aphidicolin (Hanson, 2015).

The structural changes (D661A and Y658L) still allowed syn-CDP binding in the active site with subsequent hydrolyses of the pyrophosphate group. The syn-copalyl carbocation was then quenched either by water (release of syn-copalol) or an amino acid side chain (release of non-hydroxylated diterpene). Furthermore, as we did not find other substitution that stopped cyclization at the proposed transitional states and as we could not even detect changes in the byproduct formation of the active mutants, we presume that the ACS cyclization occurs in a spatially restricted area and that the pyrophosphate group remains in the active site, which is in accordance to recent reports (Schrepfer et al., 2016).

Former diterpene centered production processes were limited by low target compound yields. However, optimization of recombinant diterpene production hosts has extensively progressed to provide gram per liter yields (Ajikumar et al., 2010; Schalk et al., 2012). Today, access to novel diterpene lead structures is limited by the effective identification of relevant enzyme systems from large scale genome sequencing projects. Therefore, rational alteration of known terpene synthase product profiles by using a combination of in silico prediction and knowledge based mutagenesis studies can allow for a more rapid and targeted expansion of the desired chemical space.

# CONCLUSION

A model of ACS synthase was computed that required the application of various methods for model refinement to improve the quality of in silico structure function analysis. A model of the

acid substitutions Y658L and D661A in the vicinity of the ACS active site lead to formation of the alternative cyclization products syn-copalol and a minor labdane related diterpene. Formation of these products were delineated by quenching of

the syn-copalyl carbocation en-route to AD. Additional mutants leading to inactive enzyme variants (A786L, F789L) provided insights into catalytically relevant amino acid residues within the G-Helix. The cumulative in-silico and experimental data suggests that amino acids constituting the G-loop motif of class I terpene cyclases are involved in the transformation of the open to the closed, catalytically active enzyme conformation. Moreover, as we only obtained a limited number of alternative cyclization products in our mutational screens, we presume that AD formation occurs in a rather confined location of the ACS active site. With respect to our biomolecular modeling approaches, we demonstrated that application of simple and rapid computational methodologies can be employed for prediction and structure function analyses of class I diterpene

# AUTHOR CONTRIBUTIONS

TB and MF supervised this study. MH initiated this study and performed virtual modeling and docking studies. NM and

#### REFERENCES

synthases.


MM conducted mutagenesis experiments and screening under supervision of MH and MF. Data was analyzed by MH, MF, NM, MM, and TB. All figures were created by MH. All authors verified the data, contributed to the manuscript, and approved the final version.

#### ACKNOWLEDGMENTS

MH, MF, and TB would like to acknowledge the financial support of the German ministry for Education and Research (BMBF) with the grant number 031A305A. TB gratefully acknowledges funding by the Werner Siemens foundation for establishing the field of Synthetic Biotechnology at the Technical University of Munich (TUM).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00101/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Hirte, Meese, Mertz, Fuchs and Brück. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors

Pavel V. Pogodin<sup>1</sup> , Alexey A. Lagunin1,2, Anastasia V. Rudik <sup>1</sup> , Dmitry A. Filimonov <sup>1</sup> , Dmitry S. Druzhilovskiy <sup>1</sup> , Mark C. Nicklaus <sup>3</sup> and Vladimir V. Poroikov <sup>1</sup> \*

<sup>1</sup> Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia, <sup>2</sup> Department of Bioinformatics, Medical-Biological Department, Pirogov Russian National Research Medical University, Moscow, Russia, <sup>3</sup> Computer-Aided Drug Design Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, NIH, NCI-Frederick, Frederick, MD, United States

#### Edited by:

Daniela Schuster, Paracelsus Medizinische Privatuniversität, Salzburg, Austria

#### Reviewed by:

Victor Kuz'Min, National Academy of Sciences of Ukraine (NAN Ukraine), Ukraine Alexandre Varnek, Université de Strasbourg, France

\*Correspondence: Vladimir V. Poroikov vladimir.poroikov@ibmc.msk.ru

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 29 January 2018 Accepted: 09 April 2018 Published: 26 April 2018

#### Citation:

Pogodin PV, Lagunin AA, Rudik AV, Filimonov DA, Druzhilovskiy DS, Nicklaus MC and Poroikov VV (2018) How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors. Front. Chem. 6:133. doi: 10.3389/fchem.2018.00133 Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of "active" and "inactive" compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.

Keywords: ChEMBL, bioactivity data, kinase inhibitors, SAR, PASS, virtual screening, classification, SAVI

# INTRODUCTION

Discovery of novel pharmaceutical agents with improved safety and efficacy is the perpetual task of medicinal chemistry (Pammolli et al., 2011). In addition to the traditional methods of chemical synthesis and pharmacological studies of various drug-like substances, in recent years substantial attention has been paid to the analysis of the general chemical-biological space (Lipinski and Hopkins, 2004; Baell and Holloway, 2010; Bon and Waldmann, 2010; López-Vallejo et al., 2012; Deng et al., 2013; Medina-Franco et al., 2013; Buonfiglio et al., 2015; Rodriguez-Esteban, 2016; Horvath et al., 2017). Such approaches significantly increase the diversity of the studied chemical libraries as well as the chances to identify the pharmaceutical agents interacting with multiple molecular targets and causing additive or synergistic desired pharmacological action (Sidorov et al., 2015; Lauria et al., 2016).

Nowadays, available chemical libraries can be divided into four categories: (1) databases containing information about structure and properties of publicly disclosed chemical compounds, e.g., PubChem (Li et al., 2010; Wang Y. et al., 2014) and ChEMBL (Bento et al., 2014); (2) databases containing information about structure of commercially available chemical samples, e.g., ZINC (Sterling and Irwin, 2015); (3) databases of virtually generated structures comprehensively covering the particular chemical space, e.g., GDB-17 (Ruddigkeit et al., 2012); (4) databases of virtually generated, synthetically accessible, structures with data on starting materials and proposed synthetic routes, e.g., SAVI (Synthetically Accessible Virtual Inventory) (Pevzner et al., 2017). Although GDB-17 is one of the largest<sup>1</sup> currently known libraries of chemical structures containing 166.4 billion possible molecules up to 17 atoms of C, N, O, S, and halogen, SAVI looks more attractive for utilization in drug discovery because of the synthesability of its molecules. Furthermore, it was shown (Pevzner et al., 2017) that the overlap between the 93 million structures from PubChem with the 238 million SAVI database is only about 0.03%. Thus, SAVI represents a significant previously unexploited reservoir of novel structures, presumably useful for drug discovery.

To reveal the hidden pharmacological potential of the synthesizable molecules from SAVI, computer-aided virtual screening could be applied (Jorgensen, 2004; Nettles et al., 2006; Bajorath, 2014; Fujita and Winkler, 2016; Lee et al., 2016). Although structure-based methods are widely used now, ligand-based methods have important advantages (Leelananda and Lindert, 2016). In several case studies, machine learning approaches were shown to surpass the performance of both chemical similarity assessment and reverse docking (Anusevicius et al., 2015; Druzhilovskiy et al., 2016; Murtazalieva et al., 2017).

Thus, it is reasonable to analyze the probable biological activity of SAVI molecules using our computer program PASS that recently received high marks: "One of the earliest and most widely used examples of data-mining target elucidation is the continuously curated and expanded Prediction of Activity Spectra for Substances (PASS) software, which was assimilated from the bioactivites of more than 270,000 compound-ligand pairs" (Mervin et al., 2015). The PASS development started more than 25 years ago (Poroikov et al., 1993; Filimonov et al., 1995), and during this time its performance has continuously and significantly improved. PASS in its 2017 version predicts over 7,000 kinds of biological activity with an average accuracy of 94% based on the analysis of structure-activity relationships for more than 1 million known biologically active compounds.

Initially, in the PASS training set a molecule is designated as "active" if reliable information about some biological activity is found in a authoritative source (publication in a peerreviewed journal, record in curated database, etc.); otherwise, it is designated as "(conditionally) inactive." This would seem to be a reasonable approach as it has been found that if the same set of chemical compounds is studied against the same molecular target in the three different assays, only 35% of active compounds completely coincided (Lipinski and Hopkins, 2004).

Since no one chemical compound has been tested for all known biological activities, this may appear to be the incorrect designation in some cases. However, it has been shown that PASS provides reasonable estimates of structure-activity relationships despite the incompleteness of information in the training set on both chemical structures and biological activities, due to the robustness of the Naïve Bayes approach in general (Rish, 2001; Rennie et al., 2003) and the MNA descriptors and the biological activity representation used in PASS in particular (Poroikov et al., 2000).

Quantitative data on structure and activity of many chemical compounds freely available from ChEMBL and PubChem databases allow one to consider alternative approaches for creating training sets that may improve the performance of machine learning methods. Such possibilities were recently considered in several studies (Heikamp and Bajorath, 2013; Smusz et al., 2013; Kurczab et al., 2014; Afzal et al., 2015; Mervin et al., 2015).

In this work we evaluated the PASS performance in virtual screening for kinase inhibitors with training performed using three approaches, which differ with respect to what compounds were selected as inactives: (1) only experimentally validated ("true") inactives; (2) combining true and conditionally inactives; (3) only conditionally inactives. The first and second approaches have the drawback that they require enough data on true inactives.

These training strategies are both related to the multi-label classification (Tsoumakas et al., 2010; Cherman et al., 2011; Afzal et al., 2015) and positive unlabeled learning (Kilic and Tan, 2012), because one and the same classifying object may simultaneously belong to several categories [have multiple labels, i.e., inhibit more than one kinase (Martin et al., 2011) in our case study] and the problem of inactives' selection may be solved using more than one method. In contrary to various approaches of inactives' selection described by the authors (Kilic and Tan, 2012), we used only straightforward approaches, since in chemoinformatics we

<sup>1</sup>The Danish biopharmaceutical company Nuevolution announced that it had created a library of 40 trillion unique molecules (C&EN, 2017, 95: 28–33); however, the web site (https://nuevolution.com/technology) states that the company enables DNA encoded synthesis of billions of chemically diverse drug-like small molecule compounds.

are forced to deal with extremely sparse data about ligand-protein interactions and, thus, introduction of data about target-to-target relations during the training may lead to strong overfitting.

The kinases were chosen for this study because of the strong family ties among kinases that manifest themselves through common structural features and predispose kinase inhibitors to polypharmacological action (Knight et al., 2010; Gani et al., 2015; Sidorov et al., 2015). Thus, the aforementioned differences in the training set formation may lead to visible changes in the virtual screening performance. Although this class of protein targets has a privileged place in contemporary drug discovery and there are thus many compounds that have been assayed against several or even numerous kinases (Fedorov et al., 2007; Gao et al., 2013; Christmann-Franck et al., 2016; Elkins et al., 2016), multitarget action is found only for a small and diverse subset of the whole chemical-biological space (Jasial et al., 2016).

Therefore kinases and their inhibitors represent an interesting and challenging case that provides useful insights into the influence of the multitarget action of chemical compounds on the success of virtual screening studies (Merget et al., 2017). Moreover, since the multitarget action is by definition an attribute of thoroughly studied compounds, such as FDAapproved drugs (Law et al., 2014), whereas most known compounds are not thoroughly studied, our results may be extrapolated to the target classes (Barelier et al., 2015; Munoz, 2017) less extensively studied compared to kinases, to help achieve better results in virtual screening of a huge chemical library.

## MATERIALS AND METHODS

# Brief Description of PASS

PASS (Filimonov et al., 2014) is a computer program for analysis of structure-activity relationships (SAR) that allows users to perform ligand-based virtual screening for ligands of multiple targets and/or compounds with desired biological activities (Abdou et al., 2017; James and Ramanathan, 2017; Stasevych et al., 2017; Yildirim et al., 2017). Structures of chemical compounds are represented in PASS as a set of 2D atom-centric substructural descriptors called MNA (Multilevel Neighborhoods of Atoms). It was previously shown that MNA descriptors are suitable for implementation in a wide range of qualitative (classification) SAR studies and reflect structural features important for ligand–target interactions (Filimonov et al., 1999). PASS predicts biological activity profiles for chemical compounds in standardized representation: uncharged, single-component, containing at least three carbon atoms, with molecular mass not exceeding 1,250 Da. The majority of druglike molecules fulfill these conditions and clipping of the nondrug-like compounds allows us to avoid dealing with non-specific and atypical biological activities. The mathematical approach of PASS is based on a naïve Bayes classifier and its particular realization in PASS has been previously described in detail elsewhere (Filimonov et al., 2014).

The result of PASS prediction is a list of probable biological activities arranged in descending order of Pa-P<sup>i</sup> values, where P<sup>a</sup> is the probability of belonging to the class of "actives," while P<sup>i</sup> is the probability of belonging to the class of "inactives". By default, Pa-P<sup>i</sup> > 0 is considered as the cutoff for discrimination between "active" and "inactive" molecules. The result of PASS-based virtual screening for a chemical library is the list of molecules predicted as "actives"; and these could be recommended for biological testing.

#### Training and Test Datasets Data Acquisition

Every dataset used in this study was formed based on the data contained in the ChEMBL database. We chose ChEMBL because this is one of the largest freely available sources of experimental bioactivity data, its data are well-organized and documented, they are easy to acquire (via graphical web interface or API), and easy to manipulate by setting-up a local version of the database. We used the list of protein kinases and their IDs that is available via the ChEMBL web interface by browsing targets by assigned protein classes to select the subset of targets for this case study.

The training set of chemical structures and activities of chemical compounds tested for inhibition of protein kinases was extracted from the 20th version of the ChEMBL database. The ChEMBL SQL-format file dump (dump itself and instructions are available from here: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ ChEMBLdb/releases/chembl\_20/)was handled in MySQL, SQL queries and PHP scripts were used to manipulate the data and write them to external SD files. Basic validation and comparison of the virtual screening performance were executed using 5-fold cross-validation.

The external test sets contained data from the up-todate 23rd version of ChEMBL on structures and activities not present in ChEMBL 20 (ftp://ftp.ebi.ac.uk/pub/databases/ chembl/ChEMBLdb/releases/chembl\_23/). ChEMBL 23 contains 1 154 583 new data on activities, among which we searched for those related to the targets involved in our study using the following procedure:


#### Data Preparation

It is known that some noise and various contradictions are stored in, and migrate from one source of bioactivity data to another, along with correct records (Kramer and Lewis, 2012; Kalliokoski et al., 2013; Tiikkainen et al., 2013; Papadatos et al., 2015). Thus, it is necessary to filter the data before using them in order to eliminate incorrect data and records that are inconsistent with the goal of the virtual screening study (Fourches et al., 2016). To achieve this goal, we used the procedures described in our previous work (Pogodin et al., 2015) with slight differences, designed to reflect the peculiarities of the targets selected for this study.

#### **Training data preparation**

First, chemical structures were filtered to eliminate incorrect molecular representations and to provide PASS with unambiguous (in the given feature space of MNA descriptors) examples for training and validation. We used an in-house command-line utility (SDF-check) to check structures for PASS compatibility and remove unsuitable ones. In addition to this, we identified structures having different ChEMBL IDs, but the same sets of MNA-descriptors, i.e., equivalent structures. We treated such structures as a single one. Thus, data on their activities were joined together, and all structures except first one encountered were deleted from the set.

After the filtering and preparation of the structures, data on bioactivities were processed to remove unreliable and inconsistent data points. In this study we used the following endpoints: K<sup>i</sup> , Kd, IC50, Potency—assessed as concentration of compound that induces the given response; Activity, Inhibition and Residual Activity—assessed as response of the kinase, induced by the given concentration of the compound. In addition to duplicates and incomplete records, the following data were excluded:


Measurements assessed as response of the kinase (Activity, Inhibition, Residual Activity), induced by the given concentration of a chemical compound were transformed to Inhibition for convenience. The problem with the "Activity" records is their ambiguity. Such records may mean both Inhibition and Residual Activity. We clarified the meaning based on the content of the assay description field. Residual Activity and Inhibition are unambiguously connected (Residual Activity = 100 − Inhibition) and it was easier for us to deal with only one (Inhibition) type of measurement.

Records on the bioactivities were filtered semi-automatically, utilizing the content of the "Description" field from the "Assays" table. Distinct "Description" fields were reviewed and, in the cases of detection of ambiguous data, analogous records were found using suitable set of words or regular expressions. Thus, identified suspicious entries were inspected using the original publications and deleted, if the suspicions were confirmed.

To improve the validation reliability, we included in the study only those kinases that had at least 100 actives and 100 inactives (determined at the concentration 1µM). These limitations also help with the creation of accurate classifiers, which may be used for their primary purpose: to search for novel kinase inhibitors. Attempts to balance sets in terms of actives to inactives ratio were not conducted, not in the least because the assessment of the difference in the quality of classifiers built on the training data with a different ratio of actives to inactives was of interest, since two of the studied approaches for the training set creation may be considered as a method to fight skewed training data distribution (Rennie et al., 2003).

After the filtering of the bioactivities, different measurements of the inhibitory activity were used to create overall qualitative assessments for each compound designating it as active or inactive against the particular kinase. As it was mentioned earlier, we had different types of data on activities in our set for some compounds. Within these types (percentage of kinase inhibition and compound concentration producing response), median values were calculated in case a given kinase-ligand pair had multiple assessments. If concentrations of compound were available and it was less than or equal to 1µM, we designated it as active against the particular kinase. In cases where data on concentration of compound were absent we designated it as active if inhibition of the particular kinase produced by this compound was greater than or equal to 50%. Otherwise the compound was designated as inactive.

Initially we extracted from ChEMBL 458 863 records on kinase inhibition. After the completion of the all procedures described above we were left with 173 275 data points on kinase inhibitors evaluated relative to the cut-off value of 1µM (62 309 on true actives and 110 966 on true inactives at given cut-off). These data characterize interactions of 55 162 compounds with one or more of 152 human protein kinases selected for this study. These kinases represent all major families of human kinases. Our data cover a significant portion of the human kinome and allow one to search for inhibitors for all kinase families (**Figure 1**).

#### **External test data preparation**

Preparation of the data for external test set was performed in the same way as for the training set data, except for the following differences:


In total, we were able to identify 81 563 new activities against the kinases involved in this study in the 23rd version of ChEMBL. After filtering, 35 317 activities describing the action of 23 004 compounds against kinases remained.

#### **Training set formation approaches**

Filtered training set data on kinase inhibitors were stored in the local MySQL database and used to create three different training sets described below and presented in **Figure 2**. In addition, each training set was divided into the five non-overlapping and equivalent subsets for subsequent stratified 5-fold crossvalidation (5-f CV).

#### **Individual sets (I-sets)**

The tested compounds for each kinase were sorted from the most active to the most inactive and, in this order, they were written to the five SD files: the first compound in the rank was placed into the first subset, the second compound into the second subset, the fifth compound into the fifth subset, the sixth then again into the first subset and so on; until each compound was placed into the each corresponding subset. The subsets were created in this way to be equivalent in terms of the total number of compounds and similar to each other in the degree of inhibitory activity of the placed compounds.

#### **Merged actives and inactives set (MAI-set)**

Then, we merged the first, second etc. subsets for each of the 152 kinases. If identical compounds were found in different subsets, only the structural formula was retained with all its kinase inhibiting activity data. As a result, we obtained 5 combined MAI-subsets, which were equivalent to the I-subsets because these subsets contained the same active compounds.

#### **Merged actives set (MA-set)**

This set was created in the same manner as MAI-set, but the true inactives were excluded.

#### Quality Metrics

We used the following metrics to evaluate the results of our ligand-based virtual screening of kinase inhibitors:

$$\text{SENSITIVITY(RECALL)} = \text{TP/(TP} + \text{FN)}\tag{1}$$

$$\text{SPECIFICITY} = \text{TN/(TN} + \text{FP)}\tag{2}$$

$$\text{Recent use presented in section }\mathsf{M}\text{-graded \"act\" sites and \"rates\" (\mathsf{A}\mathsf{A}\mathsf{A}\mathsf{L}\mathsf{-set})\text{ and \"Mover\" (\mathsf{A}\mathsf{C}\mathsf{A}\mathsf{L}\mathsf{-set})\text{.}}$$

$$\text{BALANCEED ACCCY} = \underbrace{\frac{1}{2}}\_{\text{---}} \text{\*} \underbrace{\text{TP}}\_{\text{---}} + \underbrace{\text{TN}}\_{\text{---}} \text{)} \qquad \text{(3)}$$

$$\text{PRECISION} = \text{TP/(TP} + \text{FP)}\tag{4}$$

$$\text{F2ECCISION} \* \text{RECALL}\tag{5}$$

$$F1 = 2 \* \frac{\textbf{\underset{PECCION}} + \textbf{\underset{RECALL}}}{\textbf{PRECSION} + \textbf{RECALL}} \quad \text{(5)}$$

$$\text{OCAUC} = P \left( \text{Rank}\_{\text{active}} < \text{Rank}\_{\text{inactive}} \right)$$

$$ROCAUC = P\left(Rank\_{active\_i} < Rank\_{inactive\_i}\right)$$
 
$$\text{in Uniform distribution} \tag{6}$$

$$BEDROC = P\left(Rank\_{active\_i} < Rank\_{inactive\_i}\right)$$
 
$$\text{in exponential Probability Density}$$

Function (PDF) with parameter α, IF α <sup>∗</sup>Ra << 1

(7)

Metrics (1–6) are appropriate for the evaluation of the performance of the classification procedure, which determines the upper limits of the virtual screening quality under condition where every compound predicted as active is screened experimentally.

Boltzmann-Enhanced Discrimination of Receiver Operating Characteristic (BEDROC) (Truchon and Bayly, 2007) (Equation 7) represents the adaptation of ROC AUC metric to conditions under which detection of maximal number of TPs in a certain top fraction of the set is more important than general recognition. Thus, it is designed to evaluate the early detection rate, i.e., to assess the quality of virtual screening under the limitation that it is possible to evaluate experimentally only small fraction of top rated compounds from the whole library. Parameter α in the BEDROC AUC is inversely related to the size of the top fraction that will contribute to 80% of the score value while the other 20% will come from the assessment of the remaining part of the set. Values of α that were used in this study, and the corresponding top fractions of the sets, are given in **Table 1**.

TABLE 1 | Values of BEDROC parameter α and corresponding top fractions of sets.


# Performance Assessments

#### Stratified 5-Fold Cross-Validation

The training data had been divided into the five subsets in such a way that the average numbers of actives and inactives were approximately equal in all subsets (Refaeilzadeh et al., 2009). Four subsets from each set were used for the training, while one subset was used as the external test set. This procedure was repeated five times; each time a different subset was used as the external test set. The main differences from the standard 5-fold CV were:


The overall scheme for performance evaluation is given in **Figure 3**.

Such validation procedure provides reliable quality assessments for classifiers, since every compound in the test sets had experimental test results against a particular kinase. Besides, such an approach provides the conditions for comparison that are close to those observed in real research projects when one tries to find novel activity for a compound already included in the training set with some other activities. Such situations occur in drug repurposing projects or in in silico toxicological studies (Wang Y. J. et al., 2014).

The results of the predictions were assessed using the metrics described in the Materials and Methods section. Unfortunately, at least one of them, BEDROC, may suffer from saturation. To avoid this, the ration of actives to inactives for a set (Ra in Formula 7) must be low enough to fulfill the condition given in Formula 7.

The condition of low fraction of actives in the set seems acceptable and reasonable in the context of high throughput screening, which typically provides a number of hits below 5% (Murray and Wigglesworth, 2017). However, the data on kinase inhibitors from our set do not fulfill this condition. Thus, the saturation effect on BEDROC was expected to affect the results of our study. To avoid BEDROC saturation, we implemented the procedure of random sampling with replacement as realized in R package mlr (Bischl et al., 2016) applied to the prediction results. We undersampled the portions of actives and oversampled the portions of inactives for each kinase. Factors to under- and oversample actives and inactives were chosen in such a way that numbers of actives and inactives in the resampled set became equal to approximately 60 and 60 000, respectively (Formulae 8, 9). Thus, we maintained the same actives rate in the resampled sets, which was chosen to be approximately 0.001. This rate is low enough to calculate BEDROC values for each α level selected for this study without the risk of saturation.


The resampling procedure was repeated 5 000 times for each type of sets and each kinase to achieve statistical significance in the subsequent assessment of differences between the results. BEDROC values were calculated on the resampled data using the R package enrichVS (http://cran. r-project.org/web/packages/enrichvs/index.html) for each resampled set. ROC AUC was also calculated using the R package pROC (Robin et al., 2011). To increase the speed of obtaining resampling results, we performed calculations in parallel mode using R package "parallel" (https://stat.ethz.ch/ R-manual/R-devel/library/parallel/doc/parallel.pdf). Values of the classification quality metrics achieved in cross-validation and training set composition could be found in Supplementary Table 1.

#### Virtual Screening of the External Test Set

Prepared data from 23rd version of ChEMBL was used for forming the test sets according to the procedure used for preparation of the training I-sets. During the external validation (Chen et al., 2012) with these sets we calculated BEDROC values for the resampled prediction results. Values of the classification quality metrics achieved in external validation and training set composition could be found in Supplementary Table 2.

### Comparison of the Results Obtained Using Different Training Approaches

The Tukey honest significant difference (HSD) test was used along with the analysis of variance to compare the quality of the created PASS classifiers based on the different types of training sets. These quality parameters include BEDROC for the resampled results; sensitivity, specificity, balanced accuracy, precision, F1 score and ROC AUC for the original results. The analysis was performed at a P-value < 0.05 using the functions "aov" and "TukeyHSD" from the R standard library. This provides the ranked lists for three PASS classifiers, which allows one to evaluate their performance.

# RESULTS

### Stratified 5-Fold Cross-Validation

All classification metrics values averaged over all kinases except the sensitivity values were slightly higher for the results achieved by classifiers trained on I-sets. Statistical analysis indicates that results obtained using the I-sets differ significantly from those obtained with the MA and MAI sets (**Figure 4**). The results of classifiers trained on the MA- and MAI-sets do not differ at the given level of significance from each other.

We used the resampled results to calculate values of BEDROC at different degrees of early recognition of TP (via varying values of α). These values were grouped according to the types of sets used for the training, and then averaged over the kinases in a manner similar to the way the original results were obtained. Statistical analysis of these data shows that classifiers trained on I-sets significantly outperform classifiers trained on MAI-sets and those, in turn, outperform classifiers trained on MA-sets (**Figure 5**) for any α value used in the study.

Also, using the resampled results, we were able not only to compare different approaches for the training by averaging values of the selected metrics across kinases, but to select the most adequate approach for each kinase individually. This was because

after the resampling procedure repeated 5,000 times, we had enough data points to estimate the statistical significance. Such estimation was performed as follows: at the level of the P-value chosen earlier, less than 0.05, we found that for most of the kinases the best approach for training is to use I-sets; nonetheless, for some kinases it is better to use MA- or MAI-sets (**Figure 6**) according to our evaluation. In total, we depicted 13 kinases for which the classifiers trained using MA- or MAI-sets performed better in early recognition of TP at at least three levels of α.

# Virtual Screening of External Test Set

Since we did not impose any limitations on the number of actives and inactives in our external test set, we were not able to calculate values for all the metrics for each kinase. We excluded such

kinases before averaging the values of the classification metrics across the different training approaches, thus only results for 128 kinases were compared.

The main conclusions of the comparison of Specificity, Balanced Accuracy, and AUC values are similar to those obtained using 5-f CV: The training approach I provided significantly better results than those introducing conditionally inactives (MA and MAI). No significant difference for the other metrics was found (**Figure 7**).

To compare the earliness of actives detection achieved using different training approaches, we resampled results of the inhibitory activity prediction for each kinase and calculate BEDROC values. In this part of the study only results related to kinases having at least 20 actives and 20 inactives in the external test set were included. This restriction was imposed to exclude the influence of extreme cases, where only few actives and inactives exist. Despite the introduced restrictions, we were forced to change the resampling protocol in some cases; if the kinase had less than 60 actives, we used an oversampling procedure instead of undersampling to make sure we had 60 actives.

The main result of the comparison of BEDROC values was concordant to those obtained using 5-f CV: at each value of the criterion α, training using I-sets led to the better results than training performed using MA- or MAI-set, while MAI-sets outperformed MA-sets (**Figure 8**).

# Correlations Between the Values of Metrics and Actives to Inactives Ratio in the Sets

shape of the points corresponds to the type of the training set.

We also analyzed the behavior of the employed accuracy metrics for different actives/inactives ratios, to be sure that they give an unbiased picture.

Values of Precision and F1-score were found to show correlations with the actives to inactives ratio in the test sets. Thus, we conclude that sets' imbalance affects Precision and F1 score values, while the other metrics are significantly more robust (see Supplementary Figure 1), especially AUC and Balanced Accuracy.

# Applicability Domain Estimation

of α. Results were obtained using the external test set.

To estimate the applicability domain, we calculated the values of the classification quality metrics for those cases where compounds had a certain number of new MNA-descriptors not found in the training set. In this case we merged the results over all kinases to obtain sufficient numbers of data points.

We showed that in the case of the results achieved using I-sets for training, the performance of the classifiers decreases linearly with increasing number of new MNA descriptors. In contrast to this, for the results achieved using MA- and MAI-sets for training, we were unable to find a strong dependence between the number of new MNA descriptors and the performance of the classifiers. Still, these results should be treated with caution, since the percentage of data points involved in this assessment decreases drastically with increasing number of new MNA descriptors, especially for the classifiers built using MAI- and MA-training sets (see **Figure 9**).

In the case of the classifiers built using I-sets for training we can judge that the applicability domain includes those compounds which have 4 or fewer new MNA descriptors, since the average balanced accuracy and AUC exceeded 0.7.

### DISCUSSION

In contrast to the many contemporary studies in the field of the virtual screening, in this work no decoys (Irwin, 2008) were used to assess the enrichment achieved in virtual screening of large datasets. Instead, validation and subsequent comparison of the different training approaches were performed using only experimentally tested compounds, both actives and inactives. Today, due to the constant growth of available computational resources and amount of bioactivity data, it is possible to do this using 5-f CV and true external test sets. Moreover, since negative influence of the conditionally inactive compounds involved in training was shown, this makes us wonder: if conditionally inactives can do harm during training, are decoys good for testing? The exact answer is not known yet, but the risk of reaching wrong conclusions may be mitigated by using resampling-based approaches in parallel with, or instead of, decoys.

Our study represents a quantitative assessment of the tradeoff between the initial requirements on the training data and the quality of PASS-based virtual screening. We have shown that the most efficient training approach for the ligand-based virtual screening system is to use the true actives and inactives for each target. This approach outperformed those where conditionally inactive compounds were introduced, in both classification quality and earliness of the detection. Moreover, in this case we observe a strong dependence of the performance depending on the number of new descriptors in the structures of the test compounds.

According to the analysis of the data from our training set, the higher the number of kinases for which compounds are tested, the more activities are found. Thus, using MA and MAI sets for training, some unknown actives could be treated as conditionally inactives (**Figure 10**). This may shed some light onto the problem of promiscuity of kinase inhibitors, which are often discussed as polypharmacological drugs. However, analysis of the content of bioactivity databases such as ChEMBL has shown that the average degree of promiscuity of such compounds is not so high (Hu et al., 2014). According to our results there is no contradiction between these points of view: kinase inhibitors tend to show promiscuity, but at the moment most of them have been studied against only a rather limited number of kinases.

Nevertheless, using MA and MAI approaches, it is possible to achieve good virtual screening results too, despite the softer requirements on the amount and quality of the training data. These approaches may be implemented in cases when only few active compounds are known, even in the absence of inactives, which helps expand the druggable target space and find new modes of action for existing molecular targets.

From this perspective it is surprising that we also found 13 kinases for which virtual screening may be performed more efficiently using training approaches introducing conditionally inactive compounds. This means that using machine learning it is easier to distinguish between inhibitors of these kinases and compounds tested against other kinases, than between their inhibitors and inactives at the given concentration cutoff. This fact can possibly be explained by the systematical shift in compounds selection for testing against these kinases. Also, it may indicate the importance of small structural changes in related targets leading to larger changes in inhibitor potency, since these 13 kinases are diverse, they belong to different families represented in our set and, in the case of other members of their families, introduction of the conditionally inactive compounds leads to the observed negative consequences. Thus, we show that virtual screening performance may benefit from the introduction of conditionally inactive compounds if these

compounds are unfamiliar to the main target. Unfortunately, this knowledge is risky to apply to achieve better results in ligandbased virtual screening, since our knowledge on target-target relations mediated by common ligands are generally based on sparse training sets.

We obtained rather good results of both external (quasi prospective) and cross-validation. However, in case of data on kinase inhibitors extracted from ChEMBL, one initially deals with the pre-selected compounds studied in the appropriate biological activity area, which provides good predictivity, particularly using the approach based on individual sub-sets.

Big libraries like SAVI contain diverse and previously not investigated chemical structures, including compounds other than those possessing known ligand-related target signatures (Sidorov et al., 2015). To achieve the best predictivity for such library, it seems reasonable to make pre-selection with the standard PASS approach using conditionally inactive compounds. As we already mentioned above, PASS provides satisfactory results of prediction despite the incompleteness of data in the training set (Poroikov et al., 2000). Moreover, in this work, we showed that classifiers created using the merged training sets did not exhibit the significant dependence between the prediction quality and the number of new MNA descriptors contained in the predicted chemical structures.

Consequently, we propose two-steps procedure to analyze the big and diverse chemical libraries. At the first step, pre-selection is performed using the general classifier that took into account the conditionally inactives. At the second step, one may more thoroughly discriminate between the active hits and putatively inactive structures using the specific classifier that is based only on the real actives and inactives.

# CONCLUSIONS

In this study, we compared the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting "active" and "inactive" compounds for the training sets. We used the program PASS to build classifiers based on different subsets of kinase inhibitors extracted from ChEMBL 20 (for training and 5f-CV) and ChEMBL 23 (for external, quasi-prospective validation). The highest classification and early recognition quality was obtained by using individual

training sets for each kinase containing only experimental data. Nevertheless, other training strategies can provide acceptable results even in the absence of data on known inactives, which is often the case with the novel targets (Russ and Lampel, 2005; Nguyen et al., 2017). We assessed the applicability domain of our classifiers: while classifiers trained using individual sets expose strong dependence of the prediction quality on the predicted compounds' novelty, training strategies employing merged sets are much less sensitive to the novelty of predicted compounds.

Taken together these findings allow us to suggest that one can benefit most from using combinations of different training strategies when exploring huge chemical libraries containing diverse structures of unexplored chemical compounds.

### REFERENCES

Abdou, W. M., Shaddy, A. A., and Kamel, A. A. (2017). Structure-based design and synthesis of acyclic and substituted heterocyclic phosphonates linearly linked to

# AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

#### FUNDING

This work is supported by the Russian Foundation for Basic Research grant No. 17-54-30015-NIH\_a.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00133/full#supplementary-material

Afzal, A. M., Mussa, H. Y., Turner, R. E., Bender, A., and Glen, R. C. (2015). A multi-label approach to target prediction taking ligand

thiazolobenzimidazoles as potent hydrophilic antineoplastic agents. Chem. Pap. 71, 1961–1973. doi: 10.1007/s11696-017-0190-z

promiscuity into account. J. Cheminform. 7, 1–14. doi: 10.1186/s13321-015-0 071-9


and comparison with the other descriptors. J. Chem. Inf. Comput. Sci. 39, 666–670. doi: 10.1021/ci980335o


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Pogodin, Lagunin, Rudik, Filimonov, Druzhilovskiy, Nicklaus and Poroikov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Discovery of Natural Products as Novel and Potent FXR Antagonists by Virtual Screening

Yanyan Diao<sup>1</sup> , Jing Jiang<sup>1</sup> , Shoude Zhang<sup>1</sup> , Shiliang Li <sup>1</sup> , Lei Shan<sup>2</sup> , Jin Huang<sup>1</sup> , Weidong Zhang<sup>2</sup> and Honglin Li <sup>1</sup> \*

<sup>1</sup> Shanghai Key Laboratory of New Drug Design, State Key Laboratory of Bioreactor Engineering, School of Pharmacy, East China University of Science and Technology, Shanghai, China, <sup>2</sup> Department of Phytochemistry, School of Pharmacy, Second Military Medical University, Shanghai, China

#### Edited by:

Simone Brogi, University of Siena, Italy

#### Reviewed by:

Marco Tutone, Università degli Studi di Palermo, Italy Denis Fourches, North Carolina State University, United States Michal Brylinski, Louisiana State University, United States Francesco Ortuso, Università degli Studi Magna Græcia di Catanzaro, Italy

> \*Correspondence: Honglin Li hlli@ecust.edu.cn

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 08 January 2018 Accepted: 12 April 2018 Published: 30 April 2018

#### Citation:

Diao Y, Jiang J, Zhang S, Li S, Shan L, Huang J, Zhang W and Li H (2018) Discovery of Natural Products as Novel and Potent FXR Antagonists by Virtual Screening. Front. Chem. 6:140. doi: 10.3389/fchem.2018.00140 Farnesoid X receptor (FXR) is a member of nuclear receptor family involved in multiple physiological processes through regulating specific target genes. The critical role of FXR as a transcriptional regulator makes it a promising target for diverse diseases, especially those related to metabolic disorders such as diabetes and cholestasis. However, the underlying activation mechanism of FXR is still a blur owing to the absence of proper FXR modulators. To identify potential FXR modulators, an in-house natural product database (NPD) containing over 4,000 compounds was screened by structure-based virtual screening strategy and subsequent hit-based similarity searching method. After the yeast two-hybrid (Y2H) assay, six natural products were identified as FXR antagonists which blocked the CDCA-induced SRC-1 association. The IC<sup>50</sup> values of compounds 2a, a diterpene bearing polycyclic skeleton, and 3a, named daphneone with chain scaffold, are as low as 1.29 and 1.79µM, respectively. Compared to the control compound guggulsterone (IC<sup>50</sup> = 6.47µM), compounds 2a and 3a displayed 5- and 3-fold higher antagonistic activities against FXR, respectively. Remarkably, the two representative compounds shared low topological similarities with other reported FXR antagonists. According to the putative binding poses, the molecular basis of these antagonists against FXR was also elucidated in this report.

Keywords: FXR, antagonist, virtual screening, molecular docking, similarity searching, natural product

### INTRODUCTION

The farnesoid X receptor (FXR, NR1H4), a member of the metabolic nuclear receptor superfamily, regulates the expressions and activities of a broad spectrum of genes. Since 1995 when FXR was isolated from a rat cDNA library for the first time (Forman et al., 1995), the studies on the physiological functions of FXR have been appealing and challenging. FXR is conserved from teleost fish to human beings (Maglich et al., 2003) and is abundantly expressed in liver, intestine, and kidney. As the endogenous receptor of bile acids, FXR can be activated by chenodeoxycholic acid (CDCA), lithocholic acid (LCA), deoxycholic acid (DCA), and many other bile acids (Makishima et al., 1999). In addition, FXR is also reported to exert regulatory roles in lipoprotein and glucose homeostasis, fatty acid and triglyceride synthesis, liver regeneration, and bacterial growth in the intestine (Lee et al., 2006; Wang et al., 2008). All these accumulating data make FXR a promising pharmaceutical target for multiple diseases, especially those related to metabolic disorders such as diabetes and cholestasis (Schaap et al., 2014; Gonzalez et al., 2016; Yuan and Li, 2016; Filho et al., 2017).

As a typical nuclear receptor, FXR shares common structural characteristics with other members of this superfamily, which comprises a highly conserved DNA-binding domain (DBD), a moderately conserved ligand-binding domain (LBD), and a ligand-dependent transcriptional activation domain (AF-2) (Pellicciari et al., 2005).Upon the binding of proper ligand to the LBD, FXR will undergo a conformational change, which is critical to determine whether a coactivator or a corepressor binds efficiently to the AF-2 motif. If activated by appropriate agonists, the recruitment of coactivators (such as SRC-1, DRIP, and PRMT) to FXR occurs, which further up- or down-regulates the expressions of certain target genes. While for antagonists, the association of FXR with activators will be hindered (Lew et al., 2004). Although it is widely accepted that FXR participates in many biological processes, owing to the diversity and complexity of target genes involved in the FXR signaling pathways (Zhang and Edwards, 2008), the physiological functions of FXR haven't been clearly defined. Therefore, it is still an essential step to identify potential FXR modulators, which may contribute to the elucidation of physiological effects of FXR and provide novel opportunities for the treatment of metabolic diseases by targeting FXR.

Apart from the natural bile acid ligands with steroidal skeleton, over 700 structurally diverse FXR modulators have also been identified (Gaulton et al., 2017), most of which function as agonists (Carotti et al., 2014). Obeticholic acid (6αethyl-chenodeoxycholic acid, 6-ECDCA), a semi-synthetic bile acid analog with highly potent FXR agonistic activity (EC<sup>50</sup> = 0.099µM) (Pellicciari et al., 2002), is the first FDA-approved drug that is used for treating primary biliary cholangitis (PBC) (Nevens et al., 2016). In contrast, the development of FXR antagonists, which are also useful chemical tools to unravel the physiological roles and relative clinical significance of FXR (Li et al., 2013), does not seem to be satisfactory due to the scanty number of potent FXR antagonists that have been reported so far. Guggulsterone, a natural product extracted from the resin of the guggul tree, is the most described FXR antagonist, with the ability of blocking the agonist-induced coactivator recruitment and decreasing the hepatic cholesterol in wild-type mice (Urizar et al., 2002). However, the researches on guggulsterone are still confined to preclinical and academic studies because of the complexity of its mechanism of action (Fiorucci et al., 2010; Yamada and Sugimoto, 2016). Although other natural or synthetic FXR antagonists have also been developed (**Figure 1**; Wu et al., 2002; Dussault et al., 2003; Nam et al., 2007; Choi et al., 2011; Huang et al., 2012; Xu et al., 2015), further pharmaceutically relevant activities were rarely reported. Herein, six natural products were identified as antagonists from an inhouse natural product database (NPD) through virtual screening strategy and subsequent biological experiment validation. In the yeast two-hybrid (Y2H) assay (Fields and Sternglanz, 1994; Lin and Lai, 2017), these compounds could abolish CDCAinduced FXR activation at micromolar level. We hope the natural products revealed in this study will offer novel scaffolds for uncovering new FXR regulatory mechanism and provide insights into potential development for further discovery of FXR modulators.

# MATERIALS AND METHODS

#### Structure-Based Virtual Screening Protein Preparation

The crystal structures of FXR-LBD in complex with 6-ECDCA (Mi et al., 2003) (PDB code 1OSV, a dimer and chain B was used) and fexaramine (Downes et al., 2003) (PDB code 1OSH, monomer) were obtained from the Protein Data Bank. The synthetically modified bile acid ligand 6-ECDCA was derived from the structure of CDCA, but showed almost 100-fold more potent FXR agonistic activity than CDCA and did not activate other nuclear receptors. Fexaramine is a synthetic nonsteroidal FXR agonist, which was identified by optimization of a benzopyrane-based combinatorial derived library. The coactivators and all the water molecules were removed. Hydrogen atoms and charges were added during a brief relaxation performed using the "Protein Preparation Wizard" workflow in Maestro 10.1. After the hydrogen bond network was optimized, the crystal structure was minimized until the root-mean-square deviation (RMSD) between the minimized structure and the starting structure reached 0.3 Å with OPLS\_2005 force field.

#### Glide Docking

The grid-enclosing box was placed on the centroid of the crystallographic ligand in the optimized protein structure and defined to enclose residues located within 15.0 Å of the binding pocket. A scaling factor of 0.8 was set to van der Waals (VDW) radii of those receptor atoms with partial atomic charges of less than 0.15 to soften the nonpolar parts of the receptor. After addition of hydrogen atoms and ionization at a pH range of 5.0- 9.0, the three-dimensional structures of compounds in the NPD were generated with Ligprep v3.3 module. Standard precision (SP) and extra precision (XP) approaches of Glide (Friesner et al., 2004; Halgren et al., 2004) were respectively adopted to dock the molecules into the binding site with the default parameters, and only the top one pose for each molecule were retained. After parallel Glide SP scorings using two different protein structures (PDB codes 1OSV and 1OSH), the top 500 docking poses were reserved for each docking calculation and subjected to XP calculation with a more precise scoring function, and the top 200 docking poses were retained, respectively, for further visual observation.

# Hit-Based Similarity Searching

The similarity searching process was accomplished in Pipeline Pilot v7.5, and two of the most potent FXR antagonists **2a** and **3a** were used as query molecules, respectively. The Tanimoto coefficient (Tc) of similarity between the query molecule and the target molecule was calculated using SciTegic functional connectivity fingerprints of radius 4 (FCFP\_4) (Bender et al., 2009). The minimum Tc was set to a low value of 0.3, to maximize the number of obtained analogs.

# SRC-1 Recruitment Assay

#### Materials

The restriction and modification enzymes in this work were obtained from New England Biolabs (Beijing, China). P-nitrophenyl α-D-galactopyranoside (PNP-α-Gal), yeast nitrogen base without amino acids, agar, PEG3350, dimethyl sulfoxide (DMSO), lithium acetate, and glucose were all purchased from Sigma (Shanghai, China). The yeast expression plasmids pGADT7 and pGBKT7 were from Clontech (Palo Alto, CA), and CDCA was from Merck. The dropout supplement free from leucine and tryptophan (-Leu/-Trp DO supplement) was bought from Takara, and Salmon Sperm DNA was obtained from invitrogen. The yeast strain AH109 was purchased from Clontech (Palo Alto, CA).

#### Plasmid Construction

Based on the genome sequences of FXRα (GenBank accession no. NC 000012.10), human FXRα-LBD (200-473 AA) was sub-cloned into vector pGBKT-7 using NdeI and BamHI restrict enzyme sites. The primers used for PCR amplification were listed as follows: FXRα-LBD (sense) 5 ′ -ATCATATG-GAAATTCAGTGTAAATCTAAGCG-3′ ,

(anti-sense) 5′ -ATGGATCCTCACTGCA-CGTCCCA-3′ . The combination plasmid pGADT7-SRC-1 was prepared as described previously (Lin et al., 2008), by amplifying with the following primers: (sense) 5′ -CAGAATTC-CATAACAATGACAGACTTTCA-3′ and (anti-sense) 5 ′ -AAGGATCCCACCTTTA- CATCATCCAGGCT-3′ .

#### Y2H System Construction

We constructed the Y2H for FXR by yeast co-transformation with pGBKT7-FXR LBD (BD) and pGADT7-SRC-1 (AD) according to the lithium acetate method. Briefly, 500 ng of BD and AD were added to 50 µL of the yeast competent cells and mixed with 36 µL of lithium acetate, 240 µL of 50% PEG3350, and 50 ng single-strain DNA at 30◦C for 30 min, followed by heat-shock (250 rpm) at 42◦C for 30 min. The mixture was subsequently spread on a drop-out-agar plate without leucine and tryptophan (6.7 g/L yeast nitrogen base without amino acids, 1.54 g/L -Leu/-Trp DO supplement, 20 g/L glucose, 20 g/L agar). The plates were incubated at 30◦C for 48 h for yeast growth and the PCR method was used to confirm the successful transformation.

#### Y2H Assay

We performed Y2H assay to determine the agonistic or antagonistic activities of the compounds. Yeast transformations were incubated with either a control vehicle (DMSO) or the indicated compounds for 24 h in an hFXR agonist testing, and in antagonist assays treated with tested compounds plus 10µM CDCA. The quantitative α-galactosidase activity assays were carried out by using PNP-α-Gal as the substrate according to the Clontech manual. Each experiment was repeated three times independently.


The α-galactosidase activity was calculated according to the following formula:

$$\begin{aligned} \alpha &\quad \text{-galactoside activity} \text{[milliumits/(mL} \times \text{cell)]}\\ &= \frac{OD\_{410} \times V\_f \times 1000}{(\varepsilon \times b) \times t \times V\_i \times OD\_{600}} \end{aligned}$$

where t is the elapsed time of incubation (min), V<sup>f</sup> is the final volume of assay (200 µL), V<sup>i</sup> is the volume of culture medium supernatant added (16 µL), OD<sup>600</sup> is the optical density of overnight culture, and ε×b is the p-nitrophenol molar absorptivity at 410 nm×the light path (cm) = 10.5 mL/µmol (Yeast Protocols Handbook PT3024-1, Clontech).

The agonistic activation and inhibition rates (%) were calculated as follows:

$$\text{againistic activation} = \frac{GA\_{treated}}{GA\_{DMSO}}$$

$$\text{inhibition rate(\%)} = \frac{GA\_{CDCA} - GA\_{treated}}{GA\_{CDCA} - GA\_{DMSO}}$$

where GA indicates α-galactosidase activity.

#### Chemistry

The NPD is our in-house collection of over 4,000 natural products isolated from about 100 plants and their structures were established by extensive spectroscopic. The purities of all compounds were checked by using NMR and HPLC (purities ≥ 95%). The detailed data of the natural products mentioned in the report are listed as follows.

**1a**: Abiesatrine B, isolated from Abies georgei; amorphous powder; ESI-MS: m/z 491 [M + Na]+; <sup>1</sup>H-NMR (600 MHz, CD3OD, δ): 2.28 (m), 1.83 (m), 1.95 (m), 1.58 (m), 3.39 (m), 1.45 (m), 1.92 (m), 5.66 (m), 1.42 (m), 2.25 (m), 1.81 (dd, J =14.7, 2.4 Hz), 5.56 (dd, J = 8.4, 2.4 Hz), 1.45 (m), 1.92 (m), 0.96 (s), 0.95 (s), 2.21 (m), 0.88 (d, J = 6.3 Hz), 2.92 (dd, J =14.1, 1.8 Hz), 2.23 (m), 6.86 (d, J = 1.5 Hz), 2.16 (d, J = 1.5 Hz), 0.94 (s), 0.92 (s), 1.20 (s). <sup>13</sup>C-NMR (150 MHz, CD3OD, δ): 30.7, 26.5, 77.2, 38.0, 39.4, 24.3, 120.1, 147.5, 52.6, 36.0, 29.1, 123.8, 157.4, 51.2, 37.9, 39.2, 47.6, 25.4, 22.8, 40.2, 16.2, 49.4, 205.5, 129.7, 149.1, 15.9, 174.6, 28.9, 23.6, 26.6.

**1b**: (24Z)-3,23-Dioxo-9βH-lanosta-7,24-dien-27-oic acid, isolated from Abies georgei; amorphous powder; ESI-MS: m/z 467 [M - H]−; <sup>1</sup>H-NMR (300 MHz, CD3OD, δ): 5.67 (1H, dt, J = 7.5, 2.7 Hz), 1.94 (3H, d, J = 1.2 Hz), 1.08 (3H, s), 1.07 (3H, s), 1.05(3H, s), 0.99 (3H, s), 0.95 (3H, d, J = 6.0 Hz), 0.82 (3H, s); <sup>13</sup>C-NMR (75 MHz, CD3OD, δ): 35.2, 35.3, 221.7, 48.1, 53.7, 24.0, 122.8, 149.9, 46.8, 37.0, 21.9, 35.6, 45.2, 53.2, 34.2, 29.5, 54.7, 22.9, 23.5, 34.2, 20.8, 50.1, 200.0, 128.1, 150.2, 173.6, 28.4, 21.7, 27.9.

**1c**: Abiesatrine D, isolated from Abies georgei; amorphous powder; ESI-MS: m/z 477 [M + H]+; <sup>1</sup>H-NMR (600 MHz, CD3OD, δ): 1.73 (m), 1.61(m), 2.50 (dt, J = 7.5, 1.8 Hz), 1.42 (dt, J = 12.0, 1.2 Hz), 1.94 (m), 1.89 (m), 5.65 (dt, J = 7.8, 2.7 Hz), 2.21 (m), 1.64 (m), 1.85 (m), 1.72 (m), 1.60 (m), 1.43 (m), 1.96 (m), 1.29 (m), 1.54 (m), 0.79 (m), 1.00 (s), 1.42 (m), 0.92 (d, J = 6.6 Hz), 1.63 (m), 1.58 (m), 2.24 (m), 2.15 (m), 6.19 (dt, J = 7.5, 1.2 Hz), 1.85 (brs), 1.10 (s), 1.11 (s), 1.03 (s). <sup>13</sup>C-NMR (150 MHz, CD3OD, δ): 34.2, 34.3, 219.1, 47.0, 52.4, 23.0, 121.5, 148.7, 45.5, 35.8, 20.9, 34.4, 44.0, 51.9, 33.1, 29.7, 53.0, 22.4, 23.1, 36.1, 18.2, 34.6, 26.0, 145.7, 126.6, 172.6, 12.0, 28.0, 21.3, 27.4.

**2a**: 15-Hydroxy-7-oxo-8,11,13-abietatrien-18-oic acid, isolated from Abies georgei; amorphous powder; ESI-MS: m/z 329 [M - H]−; <sup>1</sup>H-NMR (300 MHz, CD3OD, δ): 8.06 (1H, d, J = 2.1 Hz), 7.72 (1H, d, J =8.4, 2.1 Hz), 7.43 (1H, d, J = 8.4 Hz), 1.51 (6H, s), 1.31 (3H, s), 1.27 (3H, s); <sup>13</sup>C-NMR (75 MHz, CD3OD, δ): 39.1, 19.5, 38.2, 48.2, 45.7, 38.6, 201.5, 131.4, 155.9, 38.6, 124.9, 132.2, 149.1, 124.0, 72.6, 31.7, 31.7, 183.4, 17.4, 23.8.

**2b**: 17-Nor-7,15-dion-8,11,13-abietatrien-18-oic acid, isolated from Abies georgei; amorphous powder; ESI-MS: m/z 313 [M - H]−; <sup>1</sup>H-NMR (600 MHz, CD3OD, δ): 2.48 (m), 1.61 (dt, J = 6.9, 3.0 Hz), 1.81 (m), 1.80 (m), 2.68 (dd, J = 14.2, 3.0 Hz), 2.86 (dd, J = 17.6, 14.2 Hz), 7.63 (d, J = 8.4 Hz), 8.18 (dd, J = 8.4, 2.1 Hz), 8.52 (d, J = 2.1 Hz), 2.61 (s), 1.35 (s), 1.32 (s). <sup>13</sup>C-NMR (150 MHz, CD3OD, δ): 38.2, 19.1, 37.8, 47.5, 45.0, 38.8, 199.5, 132.0, 161.8, 39.5, 125.9, 134.5, 136.6, 128.5, 199.3, 26.7, 181.0, 16.9, 23.5.

**3a**: Daphneone, isolated from Daphne odora Thunb. var. marginata; white powder; ESI-MS: m/z 255 [M + H]+; <sup>1</sup>H-NMR (500 MHz, DMSO-d6, δ): 2.92 (2H, t, J = 6.0Hz), 1.61 (2H, t, J = 3.0 Hz), 1.61 (2H, t, J = 3.0 Hz), 2.60 (2H, t, J = 6.0 Hz), 7.84 (2H, d, J = 9.0 Hz), 6.85 (2H, d, J = 9.0 Hz), 6.85 (2H, d, J = 9.0 Hz), 7.84 (2H, d, J = 9.0 Hz), 7.18 (m), 7.27 (m), 7.18 (m), 7.27 (m), 7.18 (m), 10.28 (s); <sup>13</sup>C-NMR (125 MHz, DMSO-d6, δ): 23.7, 30.5, 34.9, 37.1, 115.1, 115.1, 128.3, 128.2, 128.1, 125.5, 128.1, 128.2, 130.3, 130.3, 142.0, 161.8, 198.1.

**3b**: Daphneolon, isolated from Daphne odora Thunb. var. marginata; white powder; EI-MS: m/z 270 [M]; <sup>1</sup>H-NMR (500 MHz, CD3OD, δ): 1.80 (2H, m); 2.72 (1H, m); 3.00 (1H, m); 3.08 (1H, m), 3.09 (1H, m), 4.14 (1H, m), 6.80 (2H, d, J = 7.0 Hz), 7.11 (1H, m), 7.15 (4H, m), 7.84 (2H, d, J = 7.0 Hz); <sup>13</sup>C-NMR (125 MHz , CD3OD, δ): 31.0, 38.3, 44.6, 66.9, 114.3, 124.8, 127.4, 127.5, 128.5, 130.1, 141.5, 161.9, 198.1.

**3c**: Daphnenone, isolated from Daphne tangutica Maxim; white powder; EI-MS: m/z 252 [M]+; <sup>13</sup>C-NMR (125MHz, DMSO-d6, δ): 187.3, 125.9, 146.7, 33.5, 33.7, 128.7, 130.9, 115.3, 162.0, 115.3, 130.9, 141.0, 128.3, 128.3, 125.8, 128.3, 128.3.

**3d**: P-coumaric acid, isolated from Incarvillea mairei var. grandiflora (Wehrhahn) Grierson; white amorpuous powder; ESI-MS: m/z 164 [M]+; <sup>1</sup>H NMR (600 MHz ,CD3OD, δ): 6.32 (1H, d, J = 16.0 Hz), 7.37 (1H, d, J = 16.0 Hz), 7.35 (2H, d, J = 8.5 Hz), 6.77 (2H, d, J = 8.5 Hz), 5.0 (4-OH), 6.41 (-H), 6.77 (3-H,5-H), 7.35 (2-H,6-H), 7.37 (7-H), 11.0 (-H); <sup>13</sup>C-NMR (150 MHz ,CD3OD, δ): 128.7 (s), 130.1 (d), 116.6 (d), 159.8 (s), 116.6 (d), 130.1 (d), 141.5 (d), 122.6 (d), 171.5 (s).

**3e**: Ethylparaben, isolated from Aeschynanthus bracteatus; Colloidal; ESI-MS: m/z 175 [M + Na]+; <sup>1</sup>H-NMR (300 MHz, CD3OD, δ): 7.97 (1H, d, J = 9.0 Hz), 6.87 (1H, d, J = 9.0 Hz), 6.87 (1H, d, J = 9.0 Hz), 7.97 (1H, d, J = 9.0 Hz), 4.35 (2H, d, J = 7.2 Hz), 1.26 (3H, s); <sup>13</sup>C-NMR (75 MHz ,CD3OD, δ): 123.1, 131.9, 115.1, 159.7, 115.1, 131.9, 162.9, 60.7, 14.4.

#### RESULTS AND DISCUSSIONS

#### Protein Structure Selection and Redocking Validation

The binding of proper modulators to LBD is the molecular basis for FXR activation that triggers the conformational change of FXR, the subsequent coactivators association, and the final target genes regulation. Previous studies have shown that FXR can be activated by structurally diverse agonists, and an array of crystal structures of FXR-LBD complexed with agonists have been solved. The agonist-binding pocket is positioned in the interior of the LBD, and agonists with different skeletons display enormously distinct binding features (Maloney et al., 2000; Soisson et al., 2008; Flatt et al., 2009; Jin et al., 2013). Computational studies, along with crystallographic experiments (Xu et al., 2015), support the notion that the agonist-binding pocket can also be occupied by antagonists (Meyer et al., 2005). Given the variability of the binding pockets occupied by distinct modulators, it is necessary to use different FXR crystallographic models in the virtual screening process, in order to maximize the diversity of hit compounds. At the time when we initiated this study, there were in total 27 crystal structures of FXR in complex with different modulators in the Protein Data Bank. After deleting the structures bound with structurally analogous ligands and those without published biological data, 10 unique protein structures were reserved for further analysis.

Protein flexibility has vital influence on ligand recognition, and even subtle protein conformational changes can significantly affect the results of docking simulations (Kitchen et al., 2004). Nevertheless, the receptor is usually held rigid for most docking procedures, including Glide used in this study, to speed up virtual screening of large databases. To compensate for the limitations of rigid protein conformation, as well as to simply the computational simulations, we decided to select two receptors representing dramatically dissimilar conformations to the other reported crystal structures to perform two independent docking calculations. Both the binding site similarity and ligand similarity profiles were taken into consideration for docking model selection. On the one hand, pairwise binding pocket similarities among the 10 structures were calculated using our in-house program PocketShape, which is designed for computational evaluation of the binding site similarity based on pocket shape and property and could be accessed through the webserver SiteMapper (http://lilab.ecust.edu.cn/). Residues within 5Å distance around the ligand were extracted as the binding site for each structure. Typically, two binding sites with a score value over than 0.8 are considered similar. On the other hand, pairwise molecular similarities among the 10 crystallographic ligands were calculated with SciTegic FCFP\_4 fingerprints in Pipeline Pilot v7.5, and a Tc cutoff value of 0.6 was set to define similarity.

The binding pocket and ligand similarity values of the 10 unique crystal structures were plotted in Figure S1. Meanwhile, for each model, the average pocket and ligand similarity values were calculated, respectively, to evaluate its uniqueness. Four crystal structures 1OSH, 1OSV, 3OLF, and 4WVD were considered, as all of them have top five minimum average values ranked by either pocket or ligand similarity calculations. Although both the pocket and ligand similarity values of 4WVD scatter in a low range, its ligand ivermectin is a large macrocyclic lactone (Jin et al., 2013), making the ligand binding pocket expand to a volume of 1081 Å<sup>3</sup> (Dundas et al., 2006), which is likely to cause artificial enrichment of molecules with large sizes and high molecular weights in docking procedure (Verdonk et al., 2004). Therefore, we did not choose 4WVD. The Tc values of the FXR agonist 6-ECDCA in 1OSV to other ligands are relatively low with predominant scattering below 0.2, and its binding pocket also reveals uniqueness with the average similarity value of 0.62. Moreover, 6-ECDCA is the only FDAapproved FXR modulator so far. Therefore, the 1OSV model was first selected for docking simulation (Mi et al., 2003). After thoroughly analyzing the binding interactions, the biological activities against FXR, as well as the X-ray crystal parameters, the second model of 1OSH was preferentially reserved, the ligand of which has a lower Tc value to 6-ECDCA than that of 3OLF (0.13 vs. 0.15).

Alignment of the 6-ECDCA and fexaramine binding sites exhibited substantial differences in both shape and surrounded residues (Figure S2A). And the tremendous dissimilarities of the two ligands in terms of topological structures and induced binding conformations render them extend to non-overlapping space. To evaluate the docking accuracy of Glide, the two cocrystallized ligands were redocked into the active pocket using Glide XP scoring for each structure. Superimposition of the best redocked poses and the experimental structures (Figures S2B,C) gave the RMSD values of 0.53 Å for 6-ECDCA and 0.65 Å for fexaramine, indicating the robustness of Glide in accurately reproducing the bioactive conformations of the ligands for our two docking models. Cross-docking calculations were also carried out where each crystal structure ligand was docked into the binding pocket of the other. The ligand 6-ECDCA could be docked into 1OSH by Glide XP mode, but the predicted binding pose deviated greatly from the crystallized bioactive conformation (Figure S2D). In agreement with the enormous differences of the two binding sites, fexaramine could not be accommodated by the smaller 6-ECDCA-binding site of 1OSV, hence no proposed docking pose was obtained.

#### Virtual Screening

To search for potent FXR modulators, structure-based virtual screening strategy, an effective method to identify novel ligands

based on predicted binding poses and docking scores, was executed using Glide v6.6 (Maestro v10.1, Schrödinger Inc.). And we speculated that if the docking pose of a certain compound to the agonist-binding pocket was computationally favorable, it could be an effective FXR modulator, either an agonist or an antagonist. The screened natural products database is a collection of over 4000 natural products isolated from about 100 plants, the structures of which have been validated by our researchers. After hierarchical virtual screenings independently implemented with Glide (**Figure 2**) by using two crystal structures of the receptor, a total of 400 top-ranking compounds were retrieved as candidates from the NPD. These candidates were then subjected to visual inspection to remove those that are likely to be nonbinders. With consideration of key interactions observed from the crystal structures, such as predominant VDW complementarity and some critical hydrogen bonds with polar residues, each docked pose of these candidates was carefully checked to delete the inappropriate compounds. Meanwhile, a specific focus was put on the sizes of the candidates, and those compounds with relatively large groups protruding out of the binding pocket were not considered. Additionally, the compounds with the same scaffold were reserved with a maximum of three to maximize the structural diversity. A total of 30 candidates were finally selected for further bioactivity assay. In the coactivatorrecruitment assay based on the Y2H system, none of the 30 compounds could enhance the association of SRC-1 to FXR-LBD, thus no agonist was found. Intriguingly, four compounds (**1a, 2a, 3a, 3c**) strongly inhibited the CDCA-induced SRC-1 recruitment with the inhibition rate higher than 50% in the concentration of 25µM, displaying apparent FXR antagonistic profiles. The IC<sup>50</sup> values of the four compounds were determined (**Table 1,** Figure S4), and guggulsterone was used as the reference compound.

Starting from the potent FXR antagonists **2a** and **3a** as hit compounds, the in-house NPD was re-screened using similarity searching method, to obtain more potent derivatives as well as to establish underlying structure-activity relationships (SAR). A prior minimum Tc value of 0.6 was first set to retrieve analogous compounds. Unfortunately, only several derivatives were obtained for each hit compound, probably owing to the heterogeneity of the in-house NPD. We, thus, chose a relatively low threshold of 0.3 to maximize the number of obtained analogs. Subsequently, the analogs were also manually checked, and only compounds possessing the same skeleton to the hit compound and with proper sizes were selected. After further in vitro Y2H assay, two compounds **3d** and **3e** with the Tc values of 0.4 and 0.41, respectively, to compound **3a**, were found to display moderate antagonistic activities against FXR. As illustrated in **Table 1**, compounds **3d** and **3e** share the phenol moiety with compound **3a**, and their IC<sup>50</sup> values were 14.1 and 19.3µM, respectively.

To evaluate the performance of the virtual screening strategy adopted in this study, the rankings and docking scores of the newly identified natural products were retrospectively examined. The distribution of docking scores of the top ranking candidates reserved from Glide calculations were presented in Figure S3. The natural compounds bearing different scaffolds, including **1c**, **2a**, **2b**, **3b,** and **3c**, could be ranked in the top 500 candidates during the Glide SP docking process using both crystal structures 1OSV and 1OSH, whereas the Glide XP results are totally different. As shown in **Table 2**, compounds containing similar chemical skeleton to the crystallographic ligand tend to score higher. Moreover, none of the identified FXR antagonists could be simultaneously ranked in the top 200 candidates by the two Glide XP calculations, confirming the rationality of using two distinct crystal structures for structure-based virtual screening. Despite of the confirmed moderate antagonistic effects against FXR, compounds **3d** and **3e** with smaller sizes were excluded after the initial Glide SP screening, which may be ascribed to the recognized bias of structure-based virtual screening method toward high molecular weight compounds (Pan et al., 2003). The results also demonstrate that 2D molecular similarity search method is a powerful and complementary approach to structurebased virtual screening, which could retrieve biologically active compounds that are regarded as false-negatives by docking simulations.

TABLE 1 | Chemical structures and activities of FXR antagonists and their analogs reported in this study<sup>a</sup> .


(Continued)

#### TABLE 1 | Continued


<sup>a</sup>Data shown are the average values of triplicate measurements determined by Y2H assays. This system employs the interaction between hFXR-LBD and the coactivator SRC-1. <sup>b</sup>Attempts to determine IC<sup>50</sup> values were made if the inhibition rate at 25µM was larger than 50%.



<sup>a</sup>Compounds that were ruled out by structure-based virtual screening process but recovered using similarity searching method are labeled with \*.

#### Structural Novelty Assessment

The six natural products were first reported to show FXR antagonistic activity. To evaluate their structural novelty with respect to known FXR antagonists, the pairwise Tc values of chemical similarity were calculated based on the FCFP\_4 fingerprints. The Tc value between the similar compounds **3a** and **3c** is 0.67, a Tc cutoff value of 0.6 was therefore set to define similarity. 15 structurally diverse FXR antagonists including seven natural products (compounds **4**–**21**, Table S1) were collected from literatures, among which the maximum Tc value is 0.46. As shown in **Figure 3**, all the Tc values of the six newly identified hits to the known 15 FXR antagonists were below 0.4, and the maximum Tc values of compounds **1a**, **2a,** and **3a** were 0.38, 0.25, and 0.31 (Table S2), respectively. Accordingly, the six natural products could be considered to be structurally novel as FXR antagonists.

#### Analysis of Predicted Binding Poses

From the structural point of view, the six antagonists can be simply categorized into two classes: terpenes possessing polycyclic skeletons and phenols with chain scaffolds. In order to better delineate SAR, their inactive analogs were also displayed and analyzed in this study.

FIGURE 3 | Heatmap presentation of topological similarities of the six natural products to the 15 reported FXR antagonists.

#### Terpenes

Compounds **1a** and **2a** belong to triterpenes and diterpenes, respectively (Yang et al., 2010), and both of them are isolated from Abies georgei which grows exclusively in China. In a previous study, compound **1a** was reported to have moderate agonistic effect against estrogen receptor (ER). The polycyclic ring skeletons of the two compounds are much similar to that of the bile acids, especially for the tetracyclic triterpene compound **1a**. However, the Tc values between compounds **1a** and **2a** and 6-ECDCA are as low as 0.36 and 0.28, respectively. Intriguingly, despite the opposite activities against FXR, the proposed binding poses of the two compounds closely resemble that of 6-ECDCA when interacting with FXR (**Figure 4**). Similar to 6-ECDCA, compound **1a** adopts cis-orientation in the A/B rings linkage, which is considered to be a unique feature for bile acids. The ring skeleton fits the binding pocket well through favorable VDW

contacts and hydrophobic effects with adjacent residues. The 3α-OH group extends to the space between helix 7 and helix 10/11 and putatively participates in hydrogen bond interactions with residues Tyr358 and His444. Additionally, the terminal carboxyl group could interact with residue Arg328, located at the entrance of the binding pocket, through salt bridge or hydrogen bond interactions. All the interactions described above are beneficial to the binding of compound **1a** to FXR. However, the hydrogen bond formed between the 7α-OH of 6-ECDCA and Tyr366 is absent for compound **1a** due to its structural variation on ring B. Under the physical-shape discrimination mechanism employed by FXR (Mi et al., 2003), bile acids without a 7α-OH group, such as LCA and DCA, showed extremely weak affinity with FXR, which may also cause the moderate antagonistic activity of compound **1a** against FXR.

For the analogs where the 3α-OH groups are replaced by carbonyl groups (compounds **1b** and **1c**), no detectable antagonistic activity was found in the coactivator recruitment assay. The planarity of the double bond may restrict the carbonyl oxygen atom to a position distant from residues Tyr358 and His444, and the absence of corresponding hydrogen bond interactions presumably results in the loss of agonistic effects against FXR.

Compared with compound **1a**, compound **2a** has a relatively smaller volume, but displayed 10-fold stronger antagonistic activity. Apparently, compound **2a** doesn't fit the canonical mechanism that nuclear receptors' antagonists are usually voluminous than agonists (Meyer et al., 2005). In the proposed binding pose with FXR, compound **2a** reveals the same amphipathic properties as the bile acid ligands. The oxygen atom of the carboxyl group putatively forms hydrogen bond interactions with residues Ser329 and Tyr366. At the other end of compound **2a**, hydrogen bond interactions could also form between the hydroxyl group and Arg328. The hydroxyl group seems to be essential for the antagonistic activity of compound **2a**, as its analog with an acetyl group (compound **2b**) exhibited no observed activity against FXR. Owing to the methyl group located at the10α-position, the carboxyl-substituted hydrocarbon ring A protrudes from the benzene ring panel, making VDW contacts and hydrophobic effects with Met325 on helix 5. Besides, the sequential ring structure could interact with loop H1-H2 (Met262), helix 3 (His291 and Met287), and helix 6 (Leu345) by favorable hydrophobic and VDW interactions.

Because of the relatively smaller volume, compound **2a** is not able to extend to the pocket that is occupied by rings A and B of 6-ECDCA, hence no direct interactions with helix 10/11 and helix 12 were observed. Previous studies have suggested that the π-cation interaction between His444 on helix 10/11 and Trp466 on helix 12 plays a critical role for the active conformation of helix 12 induced by endogenous bile acids (Mi et al., 2003; Pellicciari et al., 2005). Steroid agonists with 3α-OH group could facilitate the π-cation interaction by providing appropriate disposition of His444 through the steric restriction of hydrogen bonds formed between 3α-OH and residues His444 and Tyr358. Consequently, without the ability of establishing the triad of Tyr358, His444, and Trp466, compound **2a** couldn't secure helix 12 in the active conformation, thus preventing the recruitment of coactivators.

#### Phenols

Compounds **3a** and **3c**, named daphneone and daphnenone respectively (Zhang et al., 2006), are constituents of Daphne odora Thunb. var. marginata, an ornamental plant whose growth is restricted to the south of China. The two compounds, together with their simple structures and small volume, are extraordinarily peculiar to present antagonistic effects against the CDCAinduced SRC-1 association with FXR. Moreover, it is difficult to find common structural features between the two compounds and known agonists or antagonists. Accordingly, we turned to the initial docking poses to probe the structural basis of the FXR antagonistic profiles of this chemical series (**Figure 5**).

The two compounds were selected from the virtual screening process using the crystal structure of FXR-fexaramine complex. Compound **3a** is sandwiched in the cleft enclosed by helices 5, 7, and 10/11, partially overlapping with the fexaraminebinding pocket. The hydroxyl group points toward helix 5 and TABLE 3 | In silico predicted properties of the six FXR antagonists.


The recommended ranges by QikProp are as follows:

<sup>a</sup>Molecular weight, 130.0–725.0.

<sup>b</sup>Number of hydrogen bond donors, 0.0–6.0.

<sup>c</sup>Number of hydrogen bond acceptors, 2.0–20.0.

<sup>d</sup>Number of non-trivial rotatable bonds, 0–15.

<sup>e</sup>Predicted aqueous solubility, −6.5–0.5.

<sup>f</sup>Predicted octanol/water partition coefficient, −2.0–6.5.

<sup>g</sup>Predicted apparent Caco-2 cell permeability in nm/sec, <25 poor, >500 great.

<sup>h</sup>Number of violations of Lipinski's rule of five, maximum is 4.

<sup>i</sup>Number of violations of Jorgensen's rule of three, maximum is 3.

may form hydrogen bonds with residues Ser336 and His298. On the other terminal, the benzene ring moiety extends to the aromatic residues-rich groove formed by Phe288 (helix 7), Trp458 (helix 10/11), and Phe465 (loop H10/11-H12), and probably contacts with these residues by advantageous πstacking interactions. The linker between the two benzene rings fits the pocket by hydrophobic interactions and VDW contacts with surrounding residues such as Leu291, Met294, Ile356, and Ile361.

Previous molecular dynamics simulation studies assumed that the intrinsically unstable loop H10/11-12 controlled the flexibility of helix 12 (Costantino et al., 2005). Through offset face-to-face π-stacking interaction between the benzene ring and Phe465, both compounds **3a** and **3c** could contact with the loop directly, which may interfere with the conformation of the loop and push helix 12 away from its active conformation. For compound **3c** which has a double bond in the linker region, the antagonistic effect in the SRC-1-recruinment assay is slightly weaker. Presumably, the relatively flexible hydrocarbon linker is more suitable for the binding process with FXR-LBD, hence compound **3a** displayed stronger antagonistic activity. When a hydroxyl group was introduced into the hydrocarbon linker, the antagonistic activity of compound **3b** markedly decreased, displaying an IC<sup>50</sup> value higher than 25µM. Whereas the much smaller compounds **3d** and **3e**, which share the phenolic moiety with **3a**, exhibited moderate antagonistic effects against FXR. Compounds **3d** and **3e** were docked to FXR by Glide XP mode using the crystal structure 1OSH. As illustrated in Figure S5, the two small molecules occupy merely a fraction of the fexeramine-binding pocket. Hydrophobic effects and shape complementarities presumably dominate the interactions with FXR, as no hydrogen bond was detected in their proposed binding poses.

Notably, compounds **3a** and **3c** have been previously reported to show cytotoxic activities against a variety of human tumor cell lines, including K562, A549, MCF-7, LOVO, HepG2, and A375-S2, with the IC<sup>50</sup> values ranging from 3.12 to 51.0µM (Zhang et al., 2006; Wang et al., 2012). Besides, the agonistic profiles of compounds **3d** and **3e** against ER have also been described in a previous study (Cao et al., 2013). The phenol FXR antagonists identified in this study are relatively small, especially for compounds **3d** and **3e**, which probably have effects on other targets in living cells. Further thorough investigations are ongoing to better elucidate the exact mechanisms of action of the newly identified natural FXR antagonists and their implications regarding in vivo pharmacological effects.

#### Druglikeness Evaluation

To assess the drug-like profiles of the six natural products, an in silico prediction of ADME properties was performed using QikProp v4.3 module integrated into Maestro 10.1, and 6- ECDCA was used as the reference compound (**Table 3**). All physically significant descriptors and pharmaceutically relevant properties of the natural FXR antagonists, except for compound **1a**, fall into the recommended ranges of 95% of known drugs, suggesting remarkable potential of druglikeness. The QPlogPo/w and QPlogS values of compound **1a** exceed the limits of either Lipinski's rule of five or Jorgensen's rule of three, therefore the aqueous/lipid solubility should be taken into consideration if further structural optimization was carried out based on the tetracyclic triterpene compound **1a**.

#### CONCLUSION

In summary, we have established a small NPD containing over 4,000 compounds that were previously isolated from about 100 medicinal plants. From the database, six FXR antagonists were identified by strategic virtual screening method, which validated the feasibility of virtual screening to explore the potential targets of natural products. Although procured on the basis of known agonist-binding pocket, two of the most potent compounds **2a** and **3a** could antagonize the CDCA-induced SRC-1 recruitment to FXR-LBD with the IC<sup>50</sup> values of 1.29µM and 1.79µM, respectively. The predicted docking mode of the diterpene **2a** against FXR exhibited partially similar binding interactions to those of the crystallographic ligand 6-ECDCA bound to FXR, whereas the daphneone **3a** showed noncanonical proposed binding mode, which may directly contact with the intrinsically unstable loop H10/11-12 by π-stacking interactions with the aromatic residue Phe465. Moreover, as assessed by QikProp, most of the natural FXR antagonists displayed comparable drug-like properties to that of 95% of known drugs. We hope our discovery will provide promising chemical scaffolds for further hit-to-lead optimization and for the study of FXR-related biological mechanisms.

### AUTHOR CONTRIBUTIONS

HL, WZ, and LS conceived the study. YD and SL performed molecular simulations. YD analysed the data and drafted the manuscript. JJ, SZ, and JH carried out experimental studies. HL

#### REFERENCES


reviewed and revised the manuscript. All authors have read and approved the final manuscript.

#### FUNDING

The research is supported in part by the National Key Research and Development Program (Grant 2016YFA0502304), Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase) under Grant No. U1501501 and the Fundamental Research Funds for the Central Universities. SL is also sponsored by Shanghai Sailing Program (No. 18YF1405100).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00140/full#supplementary-material

orally active agonist of the farnesoid X receptor (FXR). J. Med. Chem. 52, 904–907. doi: 10.1021/jm8014124


receptor FXR. Mol. Cell 11, 1093–1100. doi: 10.1016/S1097-2765(03)00112-6 Nam, S. J., Ko, H., Ju, M. K., Hwang, H., Chin, J. W., Ham, J., et al. (2007). Scalarane sesterterpenes from a marine sponge of the genus spongia and their FXR antagonistic activity. J. Nat. Prod. 70, 1691–1695. doi: 10.1021/np070024k Nevens, F., Andreone, P., Mazzella, G., Strasser, S. I., Bowlus, C., Invernizzi, P.,

et al. (2003). Structural basis for bile acid binding and activation of the nuclear

agonist and induces GLUT4 expression in CHO-K1 cells. J. Steroid Biochem.

Maglich, J. M., Caravella, J. A., Lambert, M. H., Willson, T. M., Moore, J. T., and Ramamurthy, L. (2003). The first completed genome sequence from a teleost fish (Fugu rubripes) adds significant diversity to the nuclear receptor superfamily. Nucleic Acids Res. 31, 4051–4058. doi: 10.1093/nar/gkg444 Makishima, M., Okamoto, A. Y., Repa, J. J., Tu, H., Learned, R. M., Luk, A., et al. (1999). Identification of a nuclear receptor for bile acids. Science 284,

Maloney, P. R., Parks, D. J., Haffner, C. D., Fivush, A. M., Chandra, G., Plunket, K. D., et al. (2000). Identification of a chemical tool for the orphan nuclear receptor

Meyer, U., Costantino, G., Macchiarulo, A., and Pellicciari, R. (2005). Is antagonism of E/Z-guggulsterone at the farnesoid X receptor mediated by a noncanonical binding site? A molecular modeling study. J. Med. Chem. 48,

FXR. J. Med. Chem. 43, 2971–2974. doi: 10.1021/jm0002127

6948–6955. doi: 10.1021/jm0505056

110, 150–156. doi: 10.1016/j.jsbmb.2008.03.028

	- Wang, L. B., Dong, N. W., Wu, Z. H., and Wu, L. J. (2012). Two new compounds with cytotoxic activity on the human melanoma A375-S2 cells from Daphne giraldii callus cells. J. Asian Nat. Prod. Res. 14, 1020–1026. doi: 10.1080/10286020.2012.701206
	- Wang, Y. D., Chen, W. D., Moore, D. D., and Huang, W. (2008). FXR: a metabolic regulator and cell protector. Cell Res. 18, 1087–1095. doi: 10.1038/cr.2008.289
	- Wu, J., Xia, C. S., Meier, J., Li, S. Z., Hu, X., and Lala, D. S. (2002). The hypolipidemic natural product guggulsterone acts as an antagonist of the bile acid receptor. Mol. Endocrinol. 16, 1590–1597. doi: 10.1210/mend.16.7.0894
	- Xu, X., Xu, X., Liu, P., Zhu, Z. Y., Chen, J., Fu, H. A., et al. (2015). Structural basis for small molecule NDB (N-benzyl-N-(3-(tert-butyl)-4-hydroxyphenyl)-2,6 dichloro-4-(dimethylamino) benzamide) as a selective antagonist of farnesoid X receptor alpha (FXR alpha) in stabilizing the homodimerization of the receptor. J. Biol. Chem. 290, 19888–19899. doi: 10.1074/jbc.M114.630475
	- Yamada, T., and Sugimoto, K. (2016). Guggulsterone and its role in chronic diseases. Adv. Exp. Med. Biol. 929, 329–361. doi: 10.1007/978-3-319-41342-6\_15
	- Yang, X. W., Feng, L., Li, S. M., Liu, X. H., Li, Y. L., Wu, L., et al. (2010). Isolation, structure, and bioactivities of abiesadines AY, 25 new diterpenes from Abies georgei Orr. Bioorg. Med. Chem. 18, 744–754. doi: 10.1016/j.bmc.2009.11.055
	- Yuan, Z. Q., and Li, K. W. (2016). Role of farnesoid X receptor in cholestasis. J. Digest. Dis. 17, 501–509. doi: 10.1111/1751-2980. 12378
	- Zhang, W., Zhang, W. D., Liu, R. H., Shen, Y., Zhang, C., Cheng, H., et al. (2006). Two new chemical constituents from Daphne odora Thunb. var. marginata. Nat. Prod. Res. 20, 1290–1294. doi: 10.1080/1478641060 1101860
	- Zhang, Y., and Edwards, P. (2008). FXR signaling in metabolic disease. FEBS Lett. 582, 10–18. doi: 10.1016/j.febslet.2007.11.015

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Diao, Jiang, Zhang, Li, Shan, Huang, Zhang and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

1362–1365.

# Reverse Screening Methods to Search for the Protein Targets of Chemopreventive Compounds

Hongbin Huang1,2†, Guigui Zhang1,3†, Yuquan Zhou1,2, Chenru Lin1,3, Suling Chen1,2 , Yutong Lin1,3, Shangkang Mai 1,2 and Zunnan Huang1,3 \*

<sup>1</sup> Key Laboratory for Medical Molecular Diagnostics of Guangdong Province, Dongguan Scientific Research Center, Guangdong Medical University, Dongguan, China, <sup>2</sup> The Second School of Clinical Medicine, Guangdong Medical University, Dongguan, China, <sup>3</sup> School of Pharmacy, Guangdong Medical University, Dongguan, China

#### Edited by:

Daniela Schuster, Paracelsus Medizinische Privatuniversität, Salzburg, Austria

#### Reviewed by:

Marco Tutone, Università degli Studi di Palermo, Italy Francesco Ortuso, Università degli Studi Magna Græcia di Catanzaro, Italy

> \*Correspondence: Zunnan Huang zn\_huang@yahoo.com

†These authors have contributed equally to this work.

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 08 February 2018 Accepted: 09 April 2018 Published: 09 May 2018

#### Citation:

Huang H, Zhang G, Zhou Y, Lin C, Chen S, Lin Y, Mai S and Huang Z (2018) Reverse Screening Methods to Search for the Protein Targets of Chemopreventive Compounds. Front. Chem. 6:138. doi: 10.3389/fchem.2018.00138 This article is a systematic review of reverse screening methods used to search for the protein targets of chemopreventive compounds or drugs. Typical chemopreventive compounds include components of traditional Chinese medicine, natural compounds and Food and Drug Administration (FDA)-approved drugs. Such compounds are somewhat selective but are predisposed to bind multiple protein targets distributed throughout diverse signaling pathways in human cells. In contrast to conventional virtual screening, which identifies the ligands of a targeted protein from a compound database, reverse screening is used to identify the potential targets or unintended targets of a given compound from a large number of receptors by examining their known ligands or crystal structures. This method, also known as in silico or computational target fishing, is highly valuable for discovering the target receptors of query molecules from terrestrial or marine natural products, exploring the molecular mechanisms of chemopreventive compounds, finding alternative indications of existing drugs by drug repositioning, and detecting adverse drug reactions and drug toxicity. Reverse screening can be divided into three major groups: shape screening, pharmacophore screening and reverse docking. Several large software packages, such as Schrödinger and Discovery Studio; typical software/network services such as ChemMapper, PharmMapper, idTarget, and INVDOCK; and practical databases of known target ligands and receptor crystal structures, such as ChEMBL, BindingDB, and the Protein Data Bank (PDB), are available for use in these computational methods. Different programs, online services and databases have different applications and constraints. Here, we conducted a systematic analysis and multilevel classification of the computational programs, online services and compound libraries available for shape screening, pharmacophore screening and reverse docking to enable non-specialist users to quickly learn and grasp the types of calculations used in protein target fishing. In addition, we review the main features of these methods, programs and databases and provide a variety of examples illustrating the application of one or a combination of reverse screening methods for accurate target prediction.

Keywords: drug design, reverse screening, shape similarity, pharmacophore modeling, reverse docking, methodology, online service, screening databases

# INTRODUCTION

New drugs can be designed via traditional receptor structurebased virtual screening, which enables the discovery of bioactive compounds that bind the target protein, but they can also originate from reverse virtual screening, which finds the unknown protein targets of active compounds or additional targets of existing drugs (drug repositioning; Hurle et al., 2013). Among the 84 drug products introduced to the market in 2013, new indications of existing drugs accounted for 20%, implying that drug repositioning plays a key role in drug discovery (Graul et al., 2014; Li J. et al., 2016). The majority of drugs or bioactive compounds exert their functions by interacting with protein targets. With an increasing number of drugs showing the ability to target multiple proteins, target identification plays an important role in the fields of drug discovery and biomedical research (Wang J. et al., 2016). Many reverse screening methods can be used to search for the protein targets of molecules (Ziegler et al., 2013), although the earliest approaches involved expensive and time-consuming biological assays (Drews, 1997). However, with the continuous development of Big Data and computational techniques, computer-aided reverse screening methods are playing an increasingly important role in the prediction of the off-target effects and side effects of drugs as well as in drug repositioning (Rognan, 2010; Liu et al., 2014; Schomburg and Rarey, 2014).

These computational methods can be divided into three classes according to their underlying principles: shape screening, pharmacophore screening, and reverse docking. In the absence of receptor crystal structures, shape or pharmacophore screening facilitates the discovery of the potential targets of a query molecule by comparing its overall shape or key pharmacophore features with those of the compounds from a ligand database annotated with target information (Schuffenhauer et al., 2003; Hawkins et al., 2007; Chen et al., 2009). The annotated targets of the matched ligands can then be considered potential targets of the query molecule. Reverse docking, in contrast to the traditional molecular docking used to find the ligands of a target protein, refers to the successive docking of a query molecule into the active pocket of each protein from a protein 3D structure database based on spatial and energy principles to identify protein targets with strong binding affinity as potential targets of the query molecule (Li et al., 2013). Reverse screening methods are important computational techniques for identifying new

**Abbreviations:** 4-HT, 4-Hydroxy-tamoxifen; 5-Aza-dC, 5′ -Aza-2′ -deoxycytidine; 5-HT1AR, 5-Hydroxytryptamine 1A receptor; 5-HT2A, 5-Hydroxytryptamine receptor 2A; 67DiOHC8S, 6,7-Dihydroxycoumarin-8-sulfate; 6BIO, 6-Bromoindirubin-3′oxime; ABL, BCR-ABL tyrosine kinase; ACE, Angiotensin converting enzyme; ACHE, Acetylcholinesterase; ACM2, Muscarinic acetylcholine receptor 2; ADA, Adenosine deaminase; ADH, Alcohol dehydrogenase; AdrA3, Adenosine receptors A3; ADRB1, Human beta-1 adrenergic receptor; AGS-IV, Astragaloside IV; AKR1B1, Aldo-keto reductase family 1, member B1; ANXA5, Annexin A5; Apaf-1, Apoptotic protease activating factor-1; APH(2′ )-Iva, Aminoglycoside-2 ′ -phosphotransferase type Iva; AR, Aldose reductase; AR′ , Androgen receptor; ASA, Aspirin; BACE1, ASC, Astragalus Salvia compound; Beta-secretase 1; BBR, Berberine; Braf, B-raf kinase; Bub1, Human spindle checkpoint kinase Bub1; CA, Carnosic acid; CA1, Carbonic anhydrase 1; CA2, Carbonic anhydrase 2; CaM, Calmodulin; CASP-3, Cysteinyl aspartate specific proteinase 3; CB1, Cannabinoid receptor 1; CB2, Cannabinoid receptor 2; CBS, Cystathionine beta-synthase; c-di-GMP, Cyclic diguanylate monophosphate; CDK2, Cyclin dependent kinase-2; CHS, Chitin synthase; CK2, Casein kinase 2; CN, Calcineurin; CO, Curculigo orchioides; Complex I, NADH:ubiquinone oxidoreductase; COMT, Catechol-O-methyltransferase; COX-1, Prostaglandin G/H synthase 1; COX1, Cyclooxygenase-1; COX2, Cyclooxygenase-2; CT, Cryptotanshinone; CYP450, Cytochrome p450; D2R, Dopamine D2 receptor; DAPDC, Diaminopimelate decarboxylase; DAPH, Dialkylphosphorylhydrazone; DHFR, Dihydrofolate reductase; DHODH, Dihydroorotate dehydrogenase; DIP, Dipyridamole; DPD, Dihydropyrimidine dehydrogenase; DPP-IV, Dipeptidyl peptidase IV; DRD2, Dopamine receptor D2; EB1, Microtubule-associated protein RP/EB family member 1; EGFR, Epidermal growth factor receptor; EphA7, Ephrin receptor EphA7; ErbB-1, ErbB-1 tyrosine kinase; ErbB-2, ErbB-2 tyrosine kinase; ERK1, Extracellular regulated protein kinases 1; ERR-γ, Estrogen-related receptor-γ; ERα, Human estrogen receptor alpha; ESR1, Estrogen receptor alpha; ESR2, Estrogen receptor beta; FAK, Focal adhesion kinase; FGFR-4, Fibroblast growth factor receptor 4; GAD, Ganoderic acid D; GAPDH, Glyceraldehyde-3-phosphate Dehydrogenase; GBA3, Cytosolic beta-glucosidase; GCN5, General control non-derepressible 5; GCR, Glucocorticoid receptor; GFW, Guizhi Fuling Wan; GK, Glucokinase; GMP reductase, Guanosine 5′ -monophosphate oxidoreductase; GPX1, Glutathione peroxidase 1; GR, Glucocorticoid receptor; GR′ , Glutathione reductase; GS, β(1,3)-Glucan synthase; GSH-S, Glutathione synthetase; GSK3β,

Glycogen synthase kinase-3 beta; GST, Glutathione S-transferase; GSTA1, Glutathione S-transferase A1; GSTP1, Glutathione s-transferase PI-1; GTPase, Guanosine triphosphatase; GTs, Ganoderma triterpenes; HDAC2, Histone deacetylase 2; HDPR, 6-hydroxyl-1,6-dihydropurine ribonucleoside; HEXB, beta-Hexosaminidase; HGFR, Hepatocyte Growth Factor Receptor; HGPRT, Hypoxanthine-guanine phosphoribosyltransferase; HIV-1 PR, HIV-1 protease; HMGCR, 3-Hydroxy-3-methylglutaryl-coenzyme A reductase; HpPDF, H. pylori PDF; HPRT, Hypoxanthine phosphoribosyltransferase; HRAS, Harvey rat sarcoma; HRH4, Histamine receptor H4; HSD11B1, 11 beta-Hydroxysteroid dehydrogenase type 1; HSPA8, Heat shock protein family A member 8; HSYA, Hydroxysafflor yellow A; IDO, Indoleamine 2,3-dioxygenase; IDV, Indinavir; I-FABP, Intestinal fatty acid binding protein; IGF1-R, Insulin-like growth factor 1 receptor; IL-2, Interleukin-2; IMPDHII, Inosine 5′ -monophosphate dehydrogenase II; JNK, c-Jun N-terminal kinase; LCN-2, Lipocalin-2; LDH, L-lactate dehydrogenase; LTA4H, Leukotriene A4 hydrolase; lysC, Lysozyme C; MAO-B, Monoamine oxidase B; MAP2K1, Mitogen-activated protein kinase kinase 1; MAPK-14, Mitogen-activated protein kinase 14; MCDF, 6-Methyl-1,3,8-trichlorodibenzofuran; MDM2, Mouse double minute 2 homolog; MEK1, Mitogen-activated protein kinase 1; MIF, Migration inhibitory factor; MMP3, Metalloproteinase 3; MMP8, Metalloproteinase 8; MPO, Myeloperoxidase; MTX, Methotrexate; NBP, DL-3-n-Butylphthalide; NF-kB, Nuclear factor kB; NK2 receptor, Neurokinin NK2 receptors; NMT, N-myristoyltransferase; NQO1, NAD(P)H quinone oxidoreductases; Nrf2, Nuclear factor erythroid 2-related factor 2; OBA, Obacunone; OPRK, Kappa opioid receptor; OSC, Oxidosqualene cyclase; p38 MAPK, p38 Mitogen-activated protein kinase; PARP1, Poly [ADPribose] polymerase 1; PAs, Pyrrolizidine alkaloids; PBP4, Penicillin binding protein 4; PDE4, Phosphodiesterase 4; PDF, Peptide deformylase; PDGFR, Platelet-Derived Growth Factor Receptor; PDK1, Phosphoinositide-dependent kinase-1; PEPCK, Phosphoenolpyruvate carboxykinase; PGS, Phenolic acid glycoside sulfate; PI-3K, Phosphoinositide 3-kinase; PKA, cAMP-dependent protein kinase; PLA2, Phospholipase A2; PLMF1, Periodic leaf movement factor 1; POLB, DNA polymerase beta; PPARγ, Peroxisome proliferator-activated receptor γ; PPARδ, Peroxisome proliferator-activated receptor delta; PRDX3, Thioredoxin-dependent peroxide reductase mitochondrial precursor; PRIMA-1, P53 reactivation and induction of massive apoptosis; PTP1B, Protein tyrosine phosphatase 1B; PTPNT1, Protein tyrosine phosphatase non-receptor type 1; RA, Rosmarinic acid; RARα, Retinoic acid receptor alpha; REN, Renin; SAA, Salvianolic acid A; SB, Salvianolic acid B; SFJD, Shufengjiedu Capsule; SHBG, Sex hormone-binding globulin; SND, Sini decoction; STAT3, Signal transducer and activator of transcription 3; SULT1E1, Estrogen sulfotransferase; TCDD, 2,3,7,8 tetrachlorodibenzo-p-dioxin; TDP1, Tyrosyl-DNA phosphodiesterase 1; THRa, Human thyroid hormone receptor alpha; TOP1, DNA topoisomerase 1; UA, Ursolic acid; TrxR, Thioredoxin reductase; VEGFR-2, Vascular endothelial growth factor receptor; VGKC, Voltage gated potassium channel; WB, Wentilactone B; XO, Xanthine oxidase.

macromolecular targets of existing drugs or active molecules and for analyzing their functional mechanisms or side effects (Patel et al., 2015). Based on the principles of the methods and the availability of existing large-scale small-molecule [e.g., ChEMBL, the European Molecular Biology Laboratory (Gaulton et al., 2017)] or macromolecule (e.g., the PDB; Rose et al., 2015) databases, researchers worldwide have developed a variety of software and online services for predicting the protein targets of small molecules. Representative examples include SEA (Keiser et al., 2007), PharmMapper (Liu et al., 2010) and INVDOCK (Chen and Zhi, 2001), which are among the earliest tools for shape screening, pharmacophore screening and reverse docking, respectively. In recent years, these three methods have been widely used in the prediction of protein targets to clarify the molecular mechanisms of active small molecules against various diseases (Kharkar et al., 2014; Cereto-Massagué et al., 2015). Many of these molecules are derived from Chinese herbal medicine, and while their pharmacological or biological activities are known, their cellular and molecular mechanisms remain unclear. For example, Lim et al. (2014) used shape screening to determine that curcumin (compound **1**, **Figure 1**), extracted from Zingiberaceae, suppresses the proliferation of human colon cancer cells by targeting cyclin dependent kinase 2 (CDK2). Marine compounds are another class of bioactive small molecules. For example, wentilactone B (WB, compound **2**) is a tetranorditerpenoid derivative extracted from the marine algae-derived endophytic fungus Aspergillus wentii EN-48. Zhang et al. (2013) used reverse docking to discover that this small molecule induces G2/M phase arrest and apoptosis of human hepatocellular carcinoma cells by co-targeting the Ras/Raf/MAPK proteins in their signaling pathways.

Here, we begin by introducing the basic principles of these three types of reverse screening methods, i.e., shape screening, pharmacophore screening and reverse docking, for the prediction of the protein targets of small molecules. Then, representative and classical software and online services for each method as well as the relevant databases are hierarchically categorized and systematically presented. Finally, we reviewed nearly all articles on the applications of these methods since 2000 and selected some typical examples to illustrate the use of these methods. By statistically analyzing these articles, we reveal the trends in the application of these three methods for computeraided protein target prediction. In addition, we discuss their shortcomings and possible solutions as well as previous reviews of these reverse screening approaches for predicting the protein targets of small molecules.

## METHODS

Reverse screening to search for unknown targets, unintended targets, or secondary targets of small-molecule drugs can be achieved by shape similarity screening, pharmacophore model screening, or reverse protein-ligand docking (**Figure 2**). These three different calculation approaches are complementary and can be used in conjunction with each other. By comparison, shape, and pharmacophore screening are simpler and faster, while reverse docking is more complex and slower. We will introduce these three methods in detail in the following sections.

# Shape Screening

The basic principle of shape screening, from a two-dimensional (2D) perspective, is that structurally similar molecules may have similar bioactivity by targeting the same proteins. From a three-dimensional (3D) perspective, the basic principle is that molecules with similar volumes may have the potential to bind effectively to spaces of the same or similar size (considering the ligand-induced fit effect; Koshland, 1958) in the active pockets of proteins (Shang et al., 2017). To use shape screening to predict the targets of small molecules, a small-molecule ligand database annotated with protein targets is necessary. Then, the overall shape similarity of a query molecule to each ligand in the database can be measured individually. Finally, the protein targets for matched molecules with high similarity scores can be considered potential targets of the query molecule (Schuffenhauer et al., 2003). Shape screening involves two levels of mapping: the first mapping between the query molecule and the ligands in the database and the second mapping between the matched ligands in the database and their annotated protein targets (**Figure 2**).

Shape similarity comparison is based on the 2D or 3D topological structures of small molecules. Notably, 2D methods were originally created to obtain more of the same part between paired molecules, whereas 3D methods can be used to enhance scaffold diversity (Nettles et al., 2006). A universal descriptor for molecular similarity comparison in 2D methods is FingerPrint2D (FP2), which employs a simple bit vector to represent a variety of chemical characteristics and is encoded in a variety of software and databases (Bender et al., 2004). The most frequently used type of FP2 is extended-connectivity fingerprints (ECFPs), which are circular fingerprints. ECFPs symbolize circular atomic neighborhoods based on the Morgan algorithm and are designed especially for structural activity modeling (Rogers and Hahn, 2010). They have variable length: for example, ECFP4 refers to a diameter of 4 and ECFP6 to a diameter of 6 (Glem et al., 2006), both of which are encoded in TargetHunter (Wang L. et al., 2013). Molecular ACCess System (MACCS; Durant et al., 2002) is another commonly used FP2. It is a structure keybased fingerprint and is encoded in the 2D approach of the ChemMapper server (Gong et al., 2013). In addition to FP2, other descriptors are based on 2D topologies or paths, including the daylight fingerprint (http://www.daylight.com) encoded in ChemProt 3.0 (Kringelum et al., 2016) and the MDL structural key, another 2D descriptor (Durant et al., 2002). Structural matching based on 3D topology mainly compares the 3D geometries of the molecules, sometimes with the addition of pharmacophores (Lo et al., 2016), ElectroShapes (Armstrong et al., 2010), Spectrophores (Smusz et al., 2015), or other additional information. For example, WEGA (Yan et al., 2013) and gWEGA (Yan et al., 2014) compare only the volumes of two molecules, but SHAFTS (Lu et al., 2011), encoded in ChemMapper, incorporates pharmacophore matching when calculating the volume similarity.

The similarity of the descriptors in both 2D and 3D methods can be measured by the Tanimoto coefficient. The Tanimoto coefficient represents the ratio of the union to the intersection of the shapes of two molecules (Salim et al., 2003). For example, TargetHunter uses the Tanimoto coefficient to calculate the similarity among molecular fingerprints (Wang L. et al., 2013). The City-Block distance (CBD, also called the Manhattan or Hamming distance), which represents the difference between the sum of two molecular shapes and twice the overlap of two molecular shapes, can also be used to calculate the molecular similarity (Awale and Reymond, 2014). For example, SwissTargetPrediction uses this formula to calculate ElectroShape vectors in 3D comparisons (Gfeller et al., 2014).

Shape screening can be divided into two subclasses: indirect target prediction and direct target prediction. Indirect target prediction indicates that the potential targets of the query molecule are manually selected from the annotated protein targets of the matched database ligands. ROCS (Rush et al., 2005) and TargetHunter (Wang L. et al., 2013) are representative examples. These programs merely calculate the similarity scores between the query molecule and the matched ligands in the database but cannot reveal the complex relationships among the annotated protein targets of multiple matched ligands. In general, the annotated targets of any database ligand are not unique, and a protein target may also be annotated with multiple compounds (Rognan, 2010). Therefore, these programs can have high rates of false positives in target prediction and low accuracy in target searching.

Direct target prediction not only calculates the similarity score between the query molecules and the ligands in the database but also estimates the probability that the annotated targets of the matched ligands are targets of the query molecule. This extra process can reduce the false positive rate of target prediction and improve the accuracy of the target search. The probability that the annotated targets of the matched ligands are targets of the query molecule can be evaluated by multiple computational models or algorithms (the dotted line in **Figure 2**). For example, ChemMapper (Gong et al., 2013), which is based on a compoundprotein network constructed from the top similar structures and their annotated targets, employs a random walk algorithm (Köhler et al., 2008) to calculate the probabilities of interaction between the query structure and the annotated targets of the hit compounds. In addition, SwissTargetPrediction (Gfeller et al., 2014) and CSNAP3D (Lo et al., 2016) use a cross-validation method and a network algorithm, respectively, to assess the probabilities that the annotated targets of the matched ligands are targets of the query molecule.

Because shape screening is based on the comparison of overall molecular shape, it may not be suitable for predicting the potential targets of molecules that are excessively large or

small. Judging the potential targets of an oversized molecule is difficult because its best matched ligands usually show a low similarity score, and selecting the potential targets of an undersized molecule is difficult because its matched ligands are numerous with high similarity scores. Shape screening is suitable for predicting potential targets whose available inhibitors have sizes similar to that of the query molecule but is less fit for finding novel targets whose current inhibitors differ greatly in size from the query molecule but whose active pocket space is easily adjusted to bind diverse ligands due to a strong induced-fit effect.

### Pharmacophore Screening

The basic principle of pharmacophore screening is that the binding of certain drugs with their protein targets is primarily determined by key functional pharmacophores (Rognan, 2010). Thus, the matching of these important pharmacophores can be used to search for new targets of small-molecule drugs (Fang and Wang, 2002). A pharmacophore is the spatial arrangement of functional characteristics that allows molecules to interact with target proteins in a particular binding mode, such as a hydrophobic center (H), hydrogen bond acceptor vector (HBA), hydrogen bond donor vector (HBD), positively charged center (P), or negatively charged center (N) (Kurogi and Güner, 2001). A pharmacophore model is the combination of pharmacophores in a pattern of ligand-protein interaction that give the final pharmacological effect (Leach et al., 2010). Similar to a ligand database for shape screening, a pharmacophore database also requires annotation with target protein information. In pharmacophore screening, the pharmacophore features of the query molecule are successively matched with the features of the pharmacophore models in the database. A higher matching

degree indicates that the annotated protein target of the matched pharmacophore model has greater potential to be a target of the query molecule (Steindl et al., 2006). Pharmacophore screening also undergoes two levels of mapping: the first mapping is between the pharmacophore models of the query molecule and of the ligands in the database, and the second mapping is between the matched pharmacophore models of the ligands in the database and their annotated protein targets (**Figure 2**).

The pharmacophore database is built by pharmacophore modeling. The three construction methods are the use of ligands only, receptor structures only, or co-crystallized complex structures, which can be defined as ligand-based, structure-based and complex-based pharmacophore modeling, respectively. Ligand-based pharmacophore modeling was initially designed and is often used for traditional ligand-based virtual screening; an example is the quantitative structure–activity relationship (QSAR; Pulla et al., 2016). The most substantial common features shared by a group of active molecules can be easily extracted by using this method to form a good pharmacophore model to guide the further optimization of active compounds (Leach et al., 2010; Gaurav and Gautam, 2014). However, this approach is seldom used in reverse pharmacophore modeling due to the arbitrariness of pharmacophore models based on a single protein-annotated ligand.

The other two main methods, the use of only receptor structures and the use of protein-ligand complex structures, are forms of structure-based pharmacophore modeling (Gaurav and Gautam, 2014). In receptor-based methods, the pharmacophore features are first extracted from potential binding sites detected by specific protocols, and the pharmacophore models are then derived from the clustering of interaction point information and further refined or validated by using the input of the known ligands and their available or even calculated binding data (Chen and Lai, 2006). For instance, Pocket v.2 (Chen and Lai, 2006) and Catalyst SBP in Discovery Studio (DS) (BIOVIA, 2017) can both produce this type of pharmacophore database. In complexbased methods, pharmacophore models are simply generated via knowledge-based topological rules by using all features, such as hydrogen bonding information, charge, and hydrophobic contacts, based on the interactions between the co-crystallized ligands and receptor atoms (Sutter et al., 2011; Meslamani et al., 2012). Complex-based pharmacophore modeling is commonly used to construct pharmacophore databases, such as PharmaDB in Discovery Studio (Meslamani et al., 2012) and PharmTargetDB in PharmMapper (Liu et al., 2010), due to the stronger association between the built pharmacophore models and the experimentally verified ligand-protein interactions, which can improve the accuracy of target prediction.

The matching process between a pharmacophore model of the query molecule and the pharmacophore models in the pharmacophore database considers the alignment of two core components: pharmacophore feature types and the positions of the feature types (Wolber and Langer, 2005). The alignment of feature types is the matching between the pharmacophore features shared by the query molecule and the database ligands, such as matching between a hydrophobic feature in the molecular structure and those in database ligand pharmacophore models. The alignment of the feature positions is the pairwise matching of the distances between the fitted feature types in the pharmacophore models (Kabsch, 1976). For example, PharmMapper groups pharmacophores into triplets (e.g., H-H-H, H-HBA-HBD) and uses the vertexes of a triangle to represent the pharmacophore feature types and the side length of the triangle to measure the relative positions of these feature types (Liu et al., 2010).

In pharmacophore screening, the pairwise fitness score between pharmacophore models can be used directly as a basis for target evaluation. The fitness score includes the scores obtained from both the alignments between feature types and the alignments between the positions of each pair of pharmacophore models from the query molecule and database ligands. Higher fitness scores indicate higher probabilities (Wang X. et al., 2016). In addition, other matching information, such as the number of matched features and overall shape similarity, can also be used as additional references for target evaluation (Khedkar et al., 2007). If the pharmacophore scoring process does not consider the overall shape of the query molecules, it will be more likely to find pseudo protein targets with high fitness scores for a smaller query molecule because its limited pharmacophore features can be easily matched in the database (Wang X. et al., 2016). Thus, the target score must be recalculated to improve the prediction accuracy (the dotted line in **Figure 2**). For example, PharmMapper utilizes a normalized fitness score to re-rank the potential targets by standardizing a normal distribution of the fitness score to achieve a higher accuracy (Wang X. et al., 2017).

Since the construction of the pharmacophore database by structure-based pharmacophore modeling is not easy, the development of corresponding tools based on this principle has been somewhat limited. However, compared with shape screening, pharmacophore screening can improve the accuracy of prediction because it focuses on matching the key pharmacophore functional groups. In addition, it can ignore the total size of the molecule. As a result, pharmacophore screening can be used to search for potential targets of a query molecule with a large or small volume and can also be employed to find novel protein targets capable of binding a large diversity of ligands. Although PharmTargetDB, the PharmMapper in-house repository, does incorporate protein structural information, a pharmacophore database can be built to use ligands only. That is, constructing a pharmacophore database based on ligands with currently unavailable target structures is also useful for pharmacophore screening.

# Reverse Docking

The basic principle of reverse docking is that the binding strength of a small-molecule ligand and a potential protein target is determined by their interaction energy (docking energy). To use reverse docking to predict the targets of a query molecule, a structure grid database of a large number of protein targets is normally required. Then, the query molecule is individually docked with each protein structure in the database. Each docking score is calculated. Finally, the protein targets are sorted according to their docking energy. Generally, a higher rank indicates a greater probability that the protein is a target of the query molecule. In contrast to shape screening and pharmacophore screening, reverse docking involves one level of mapping, which reflects the direct relationship between the query molecule and the target proteins (**Figure 2**). However, it is a complex process that includes recognition of a binding site, construction of the docking grid, a molecular docking algorithm, docking score calculation and target evaluation, among other steps (Lee et al., 2016).

In most cases, the active site of a protein is already known and can be determined from its co-crystallized small-molecule ligand. However, for some apo-form structures without co-crystallized ligands, the docking program must first recognize the active binding site of these proteins. If the apo-form structure is from a protein for which other co-crystallized structures are available, its active site can also be identified from those protein structures with co-crystallized ligands. Otherwise, de novo detection of the active site of the apo-form structure is required. The literature describes multiple ways to achieve this task. For example, Wang et al. (2012) uses the "divide-and-conquer" method in idTarget to search the surface structure of the entire protein and possible allosteric structures to find potential binding sites. Kuntz et al. (1982) describes a method that was later used in INVDOCK (Chen and Zhi, 2001) to define a binding site by a group of overlapping probe spheres of certain radii, which fill up a cavity and whose inward-facing surfaces cover the van der Waals surfaces of the protein atoms at the interface. Active site recognition is very useful in attempts to dock a query molecule into cavities other than the binding pockets of known ligands, which can increase the diversity of the binding between the query molecule and protein targets and improve the accuracy of reverse docking.

The database of protein targets used in reverse docking can be a library of protein crystal structure grids with recognized binding sites determined by co-crystallized ligands or available cavities. We can build these databases by continuously downloading a series of protein crystal structures from the Protein Data Bank (PDB); the time-consuming human-computer interaction processes (such as the deletion of water molecules, the addition of hydrogen atoms, and energy optimization) can be accomplished by using a molecular docking program, and the protein structure grids are finally generated. Traditional molecular docking programs, such as DOCK (Allen et al., 2015), AutoDock (Di Muzio et al., 2017), Schrödinger (Schrödinger, 2018) and Discovery Studio (BIOVIA, 2017), can be used to construct a custom target database for reverse docking to search for potential targets of a small molecule. Alternatively, the protein target database can also be a simply processed, automatically constructed protein structure database, and the grids can be generated after the programmed identification of active sites in the process of reverse docking; an example is the idTarget in-house database (Wang et al., 2012). Notably, the lack of a universal protein structure grid database and the need to build a new one for each docking program are the main reasons that reverse docking cannot be used as often as traditional structurebased virtual screening.

At present, reverse screening uses two main types of molecular docking techniques, originally developed in DOCK and AutoDock. DOCK (Ewing et al., 2001) adopts a "geometry matching method" to perform molecular docking by complementing the geometric shape of the docking ligands with that of the protein active site, usually including hydrogen binding sites and locally accessible sites (Shoichet et al., 2010). The matching process is performed by an "anchor and grow algorithm," in which the anchor is a rigid portion of the ligand that is used to initialize a pruned conformation search, and grow refers to the generation of multiple conformations of the remaining segments to simulate the flexible docking of the ligand (Ewing et al., 2001). AutoDock uses a "docking simulation method" that employs the "genetic algorithm" to sample the conformations of a docking molecule inside a grid of the receptor binding pocket (Willett, 1995). In this algorithm, the molecule starts randomly at the receptor surface and undergoes orientation, translation and rotation to cause conformational changes until the ideal binding pose with the best binding energy is found (Morris et al., 2015). Among three reverse docking programs, INVDOCK (Chen and Zhi, 2001) and TarFisDock (Li et al., 2006) use the DOCK geometry matching method for molecular docking, while idTarget uses the AutoDock genetic algorithm for reverse docking (Wang et al., 2012).

Currently, almost all molecular docking programs can perform flexible-ligand docking due to the small size of the ligands; however, these programs still have difficulty in performing molecular docking with a fully flexible protein. Therefore, depending on the flexibility of the receptor proteins, reverse docking can also be classified into two types: rigid protein docking and semi-flexible protein docking. Although reverse docking with a rigid receptor is fast, it ignores ligand/receptorinduced fit effects. An example of a rigid protein docking program for reverse screening is TarFisDock (Li et al., 2006). Reverse docking with semi-flexible receptors can be achieved by various methods such as side-chain rotations (Liu H. et al., 2015), stretching of active pocket residues (Halgren et al., 2004), and ensemble docking (Lorber and Shoichet, 1998). For example, INVDOCK allows the amino acid residues of the receptor binding sites to rotate with the entry of the ligand, thereby simulating the ligand induced-fit conformational changes of receptors (Chen and Zhi, 2001). idTarget uses the docking of a query molecule into an ensemble of different receptor crystal structures after clustering (Wang et al., 2012) and thus simulates semi-flexible receptor docking by possible binding of the molecule with the distinct locations of the active pocket residues of the receptor in its different structures.

The docking score between a query molecule and receptors is an evaluation criterion for ranking its potential targets in reverse screening. Docking energy is a major method of scoring docking poses and normally refers to the interaction energy between the ligand and protein but may also include the energy of the ligand or the energies of both the ligand and the protein (or a part of the protein such as the binding pocket). For example, INVDOCK evaluates the docking structure by calculating the interaction energy between the ligand and receptor (Chen and Zhi, 2001), whereas idTarget scores the docking pose by calculating the energy of the ligand, the protein binding pocket and the interaction between them (Wang et al., 2012). According to the principle that the most stable structure has the lowest energy, a more negative docking energy results in stronger binding between the ligand and protein. The docking energy is calculated based on energy functions, which are mainly divided into three types: molecular mechanics energy functions, empirical energy functions, and semi-empirical energy functions. The molecular mechanics energy functions are more comprehensive and are rigorously defined by the sum of terms with clear physical meaning, including bond stretching, angle bending, torsion angles, van der Waals forces, electrostatic interactions, desolvation, or hydrophobic interactions, conformational entropy, and potentially others (Huang and Zou, 2010; Wang et al., 2011). In reality, the molecular mechanics energy functions used in the docking programs may include only some of these terms. For example, TarFisDock uses energy functions including only van der Waals and electrostatic interaction terms (Li et al., 2006). Empirical energy functions comprise weighted energy terms whose coefficients are obtained by reproducing the binding affinities of a benchmark data set of protein-ligand complexes (Gilson et al., 1997; Gilson and Zhou, 2007). For example, INVDOCK uses an empirical energy function based on simple contact terms, including hydrogen bond and non-bond terms, to calculate the ligand-protein interactive energy as the binding affinity (Chen and Zhi, 2001). Semi-empirical energy functions combine some molecular mechanics energy terms with empirical weights and/or empirical functional forms and have been widely used in computational docking methods (Raha and Merz, 2005). For example, idTarget follows the AutoDock 4 robust scoring functions (Huey et al., 2007) and employs a semi-empirical free energy function that includes hydrogen bonding, electrostatics, desolvation, and torsional entropy, whose weighting coefficients are derived from regression analysis of the experimental binding affinity information (Wang et al., 2011). In addition, reverse docking allows visual assessment of the docking poses by analyzing the number of hydrogen bonds, the presence or absence of critical hydrogen bonds and pi-pi conjugation, etc., as in traditional virtual screening, to further assist target evaluation for a more accurate prediction.

Reverse docking considers key elements of both shape screening and pharmacophore modeling. It determines whether or not the size of a query molecule can fit inside the binding pocket of a protein target by docking and scores the interaction of the key pharmacophore groups in the molecule and the targets to perform target evaluation. Thus, reverse docking could be the most comprehensive of the three methods in principle. However, similar to traditional molecular docking, it also has the following shortcomings: incompleteness of the search space, inaccuracy of the scoring function, and extensive calculation (Lee et al., 2016). Relative to traditional docking, reverse docking has the additional problem that the sizes of the active pockets of proteins defined by co-crystallized ligands are inconsistent. Even if the docking pockets can be defined as being a universally equal size, the residue density of different protein binding pockets may vary, resulting in differences in the calculation ranges for the binding interaction energies. Therefore, reverse docking suffers from a rationality problem, as it is unable to normalize binding energies for the correct sorting of potential targets. Nevertheless, reverse docking can serve as an effective method to complement shape and pharmacophore screening when the protein structures are available.

### Software and Online Services

Many software programs, some of which are available as online services, can be used for reverse screening to predict protein targets of small molecules, but the numbers of online tools available for the three methods are quite different. Shape screening tools are the most numerous and include more than a dozen, such as ChemProt (Kringelum et al., 2016), ROCS (Rush et al., 2005), ChemMapper (Gong et al., 2013), and the SEA search server (Keiser et al., 2007). They are listed in the outer ring of **Figure 3**. By contrast, the only tool available for pharmacophore screening is PharmMapper (Liu et al., 2010), as shown in the inner ring of **Figure 3**. The main tools available for target searching by reverse docking are TarFisDock (Li et al., 2006), idTarget (Wang et al., 2012) and INVDOCK (Chen and Zhi, 2001), which are illustrated in the middle ring of **Figure 3**. A few large software packages, such as Schrödinger and Discovery Studio, also contain related modules that perform reverse screening, but they can be used only for the indirect prediction of potential targets of small molecules. These tools require users to build their own databases or perform other relevant processing steps. We have summarized the basic information on these tools, organized according to their characteristics, in **Table 1**. In addition, for each type of software and online service,


**207**

TABLE

1


Characteristics

 of reverse screening tools.

we have provided more detailed descriptions of a few classic representatives.

### Shape Screening

At present, many online services are available to search for targets of small-molecule drugs by shape screening. According to whether these services and software programs can directly sort the potential protein targets by probability or not, we classified them into direct target prediction tools, such as SuperPred (Dunkel et al., 2008), HitPick (Liu et al., 2013), ChemMapper, SEA search server, ReverseScreen3D (Kinnings and Jackson, 2011), TarPred (Liu et al., 2015a), and SwissTargetPrediction, and indirect target prediction tools, such as SwissSimilarity (Zoete et al., 2016), ChemProt, TargetHunter (Wang L. et al., 2013), CSNAP3D (Lo et al., 2016), and ROCS. These categories, respectively, are located on the inside and outside of the outer ring in **Figure 3**. A brief introduction to these services, including their input and output formats, shape similarity calculation methods, database information and website links, is provided in **Table 1**. Because indirect target prediction services require the manual selection of protein targets, we do not provide a more detailed overview of these tools here. We chose the SEA search server among the direct target prediction services as a representative for further description because it is the oldest and most widely used shape-screening service (Keiser et al., 2007).

As a web-based target prediction tool, SEA was developed in 2007, and it performs quantitative classification and target association based on the chemical similarity of protein-related ligands (Keiser et al., 2007). SEA supports only the SMILES format for the input of query molecules for target prediction. After receiving the relevant information about the query compound, SEA performs a pairwise comparison of a 2D similarity metric in a collection of ∼65,000 ligands annotated with drug targets, in which most annotations contain hundreds of ligands (Keiser et al., 2007). SEA then clusters the ligands based on their chemical similarity into hundreds of sets, relating their corresponding annotated targets to each other quantitatively, and further uses a model resembling that of BLAST (Mount, 2007) to link these sets together in a minimal spanning tree (Keiser et al., 2007). Next, a statistical model is used to rank the significance (E-value) of the resulting similarity scores of each set in the minimum spanning tree. Finally, SEA produces a list of Max Tanimoto coefficients (MaxTc) and E-values. A larger similarity score (maxTC) with smaller significance score (E-value) indicate a higher rank, and there is a greater probability that the protein is a potential target.

### Pharmacophore Screening

PharmMapper, the only web server to screen the potential protein targets of a query molecule based on pharmacophore modeling (**Figure 3**), was developed in 2010 (Liu et al., 2010; Wang X. et al., 2017). PharmMapper uses a triangle hashing mapping method to match the pharmacophore models between the compound and the internal database ligands to predict potential protein targets of a query molecule (Liu et al., 2010). A brief introduction to this online tool is also given in **Table 1**, including the input and output formats, database information and website link. Its in-house database, PharmTargetDB, will be described in detail in the Databases section. PharmMapper supports the Tripos Mol2, MDL and SDF formats for the input of a 2D or 3D query molecule structure to begin a job. Next, PharmMapper flexibly aligns the molecule with each protein pharmacophore model in its database and calculates the fit score between the query molecule and the pharmacophore models (Liu et al., 2010). Subsequently, PharmMapper ranks candidate targets according to the fit score (Liu et al., 2010) or according to normalized fit scores standardized by using a two-dimensional Z-transformation algorithm on the ligand and pharmacophore target dimensions (Wang X. et al., 2016), and records the aligned pharmacophore pose for the query molecule and targets. With the default setting, the top 300 target hits of the prioritizing list are outputted, and users can select candidate proteins based on both these fit scores and the aligned pose for further bioassay experiments (Liu et al., 2010; Wang X. et al., 2017).

# Reverse Docking

TarFisDock, idTarget, and INVDOCK (**Figure 3**) are three reverse docking programs that are currently widely used in predicting the targets and mechanisms of various active biomolecules. A brief introduction to these tools is given in **Table 1**, including the input and output formats, database information and website links. Here, we will also provide a slightly more detailed description of these three tools.

TarFisDock, a web-based tool for predicting the potential binding targets of a given ligand, was first released in 2006 (Li et al., 2006) and last updated in 2008 (Gao et al., 2008). TarFisDock uses reverse docking to search for all possible protein binding partners of small molecules from a potential drug target database called PDTD (Li et al., 2006), which will be described further in the Databases section. This program supports only the mol2 format for the input of the query molecule. TarFisDock (Li et al., 2006) uses the docking program DOCK 4.0 to perform molecular docking between the given molecule and each protein in the PDTD, and it calculates their binding energy based on van der Waals and electrostatic interaction terms by using the Amber force field (Weiner et al., 1984). TarFisDock can output a list of the top 2, 5, or 10% target hits according to binding energy. The main limitation of TarFisDock is the insufficient number of target proteins in the PDTD. The PDTD initially released in 2006 contained only 698 protein structures (Li et al., 2006) and was expanded to contain >830 protein targets in 2008 (Gao et al., 2008). TarFisDock considers the flexibility of the small molecules but has yet to consider the flexibility of the protein targets.

In 2012, the web server idTarget (Wang et al., 2012) was developed to predict the potential binding targets of small molecules via a divide-and-conquer reverse docking approach. To improve the efficiency of target prediction, idTarget uses a contraction-and-expansion strategy to differentiate the protein structures at different levels for molecular docking into the families of homologous structures by clustering almost all of the protein structures deposited in the PDB (Wang et al., 2012). idTarget supports multiple coordinate formats, including pdbqt

and mol2, for the input of the given molecule. Then, the program uses MEDock (Chang et al., 2005) to initially generate a large number of conformations of the query molecule and directly orient them inside the grid box of the binding site for molecular docking (Wang et al., 2012). Subsequently, idTarget assesses the binding pose by using semi-empirical score functions derived from quantum chemical charge models and robust regression analysis (Wang et al., 2012). Finally, the program outputs two sets of results, both of which are ranked in ascending order of the predicted binding free energy (1G Pred). One set of results is listed according to the names of the proteins, while the other is listed according to the names of the homologous families (Wang et al., 2012). In addition, idTarget provides two modes for searching binding poses, scanning mode and fast mode (Wang et al., 2012). In "scanning mode," molecular docking is performed individually for each protein structure in the database. In "fast mode," the ligand is docked simultaneously to the binding sites of the superposed homologous protein structures, and the binding poses are further minimized by adaptive local sampling (Shindyalov and Bourne, 1998). The fast mode performs quick searches via docking between the ligand and the common binding sites after the protein structures of each homologous family are pre-aligned, but the scanning mode does not limit the docking conformation searches to these predetermined binding sites (Wang et al., 2012). Both modes uses the strategy of ensemble docking (Lorber and Shoichet, 1998) to consider the flexibility of the receptor indirectly (Wang et al., 2012).

INVDOCK (Chen and Zhi, 2001), an online service for ligandprotein reverse docking that runs on both Windows and Unix platforms, was developed in 2001. INVDOCK has an in-house protein target database of 9000 protein and nucleic acid entries (Chen and Zhi, 2001). It supports standard 3D ligand structure files, such as the SDF and MOL formats. INVDOCK sets cavities on the protein surface that are covered by a large portion of spherical probes as active binding pockets. The automatic docking is performed by multi-configuration shape matching between the molecule and cavities. Then, torsion optimization and energy minimization are performed on the molecule and on the protein residues in the binding region (Chen and Zhi, 2001). Finally, the simplified DOCK scoring method is used to score the binding energy, and the protein targets are ranked in ascending order by the ligand-protein interaction energy function (1ELP; Chen and Zhi, 2001). INVDOCK also considers the flexibility of the protein via a limited torsion space sampling of rotatable bonds in the side chains of the target residues at the binding site (Chen and Zhi, 2001).

#### Integrated Software Suites

Some drug design software suites, such as Schrödinger (2018) and Discovery Studio (BIOVIA, 2017), can also be used in shape screening, pharmacophore screening, and reverse docking to predict the protein targets of small molecules.

The Schrödinger modules for reverse screening are "Shape Screening," "Pharmacophore Modeling," and "Docking". However, Schrödinger does not provide any ligand or protein database that can be used for reverse screening, and thus, users must provide the databases themselves. "Shape Screening" and "Pharmacophore Modeling" require a protein-annotated ligand database that can be generated by a simple process from the ligand library Ligand Expo, which can be downloaded from the PDB database (Feng et al., 2004; Rose et al., 2015). Each small-molecule ligand in this library is annotated with its co-crystallized protein target information. The protein structure grid database for reverse docking can be constructed by users via grid generation by the "Docking" module after the "Protein Presentation Wizard" is used to pre-process the protein crystal structure coordinates from the PDB, such as to remove water molecules and add hydrogens. Then, reverse docking can be performed by using "Glide Cross Docking" to dock a given molecule with multiple proteins simultaneously. Although the number of proteins for simultaneous docking in "Glide Cross Docking" is limited (normally ∼ 50), users can write their own scripts to run reverse docking for one ligand and many proteins. We have performed several reverse screening tasks by using these modules in the Schrödinger software package (Kim et al., 2014; Lim et al., 2014; Wang Z. et al., 2016).

The "Pharmacophore" and "Receptor-Ligand Interaction" modules of Discovery Studio can be used for reverse screening. Users can select shape screening or pharmacophore screening by using the "Ligand Profiler" tool in the "Pharmacophore" module. This tool allows users to upload a database or use the Ligand Profiler Pharmacophore Database, PharmaDB, provided in the software. This database is generated from an annotated database of druggable binding sites called scPDB (Desaphy et al., 2015) based on the PDB and contains the molecular structure and corresponding pharmacophores for calculation of the shape and pharmacophore similarity. The protein crystal structure database for reverse docking must be prepared by the user. These crystal structures can be defined by the "Define and Edit Binding Site" in the "Tools" menu of the "Receptor-Ligand Interaction" module. "Libdock Batch Mode" in the "Protocols" menu of the "Receptor-Ligand Interaction" module can be used for batch docking and has the same effect as reverse docking. Because of the different algorithms and databases needed, the advantages and disadvantages of these two software suites for reverse screening have yet to be evaluated.

### Databases

Databases, whether protein-annotated ligand structure/pharmacophore databases or protein structure grid databases, are key elements of reverse screening. Although reverse screening has been under development for nearly two decades, no general or benchmarked database is available for use in different methods or programs. Here, we have classified the existing relevant databases at different levels (**Figure 4**). The first class of databases is associated with software built by software developers and used for program running. We call these databases "software databases" or "direct databases" (shown at the bottom layer of **Figure 4**). Each database in this class is named for its corresponding software or referred to as the software in-house database. The second class of databases provides resources describing annotated ligands or target structures, but the users must process and collect these resources to construct direct databases for reverse screening. We call these

databases "indirect databases" (shown in the middle layer of **Figure 4**), and examples include the PDB (Rose et al., 2015) and ZINC (Sterling and Irwin, 2015). The third class of databases can provide a large amount of information about ligands or proteins that can be used for reverse screening, but collecting and organizing the information from these databases to construct direct databases is difficult. However, we can use them to search for various information resources, such as additional targets, the bioactivities of matched molecules, or the signaling pathways of potential target proteins for further analysis of the reverse screening results. We call these databases "reference databases" (shown in the upper layer of **Figure 4**); examples include PubChem (Kim et al., 2016) and UniProt (Pundir et al., 2015). Relevant information on several direct databases used for reverse screening can be found in **Table 1**. In addition, information on indirect and reference databases, including the database coverage and website links, is listed in **Table 2**. In the following paragraphs, we provide a slightly more detailed introduction to these three classes of databases and their relationships with each other.

# Direct Databases

#### Direct Databases Used in Shape Screening

Each online service for shape screening has its own in-house database, except ROCS (Rush et al., 2005), which requires users to prepare their own protein-annotated ligand databases (Mori et al., 2015). The database information for 12 shape-screening software programs is shown in the "coverage" column of **Table 1**. Among the 11 direct databases, TargetHunter, CSNAP3D and ReverseScreen3D do not give specific capacity data; for the former two, the information is not published, and the latter is noted only as updated with updates to the RCSB PDB, according to the literature (Kinnings and Jackson, 2011).

**Figure 4** shows 10 direct databases with clear sources, including HitPick, CSNAP3D, TargetHunter, SwissTargetPrediction, SuperPred, SwissSimilarity, ChemProt, ChemMapper, SEA, and ReverseScreen3D. These databases are basically constructed by extracting data from indirect databases (**Figure 4**). For example, ChemMapper is built from several public databases, including ChEMBL (Gaulton et al., 2017), DrugBank (Law et al., 2014), BindingDB (Gilson et al.,



an expanding number of molecular targets

2016), KEGG (http://www.kegg.jp/kegg/) and the PDB. It collects bioactive targets and pharmacological information on small molecules, each of which has various pre-generated conformations for 3D similarity screening (Gong et al., 2013). ChemProt 3.0 (Kringelum et al., 2016) includes all chemicalprotein interaction data from the available open source databases, including ChEMBL (version 19), BindingDB, the Psychoactive Drug Screening Program (PDSP) Ki database (Roth et al., 2000), and DrugBank, as well as clinical information from the Anatomical Therapeutic Chemical (ATC) Classification System (Wang Y. C. et al., 2013) and side effect data from Sider (Kuhn et al., 2016). The SwissSimilarity database gathers proteinannotated ligands mainly from four indirect databases, namely, HMDB (Wishart et al., 2009), ZINC, ChEMBL, and DrugBank, as well as from some reference databases, such as Chemical Entries of Biological Interest (ChEBI; Hastings et al., 2016). The SuperPred (Nickel et al., 2014) database consists of a large data set of ligand-target interactions from two indirect databases, ChEMBL and BindingDB. The CSNAP3D (Lo et al., 2016), Target Hunter (Wang L. et al., 2013), SwissTargetPrediction (Gfeller et al., 2014), and SEA (Keiser et al., 2007) databases are all constructed by taking ligands with protein target information from ChEMBL. The HitPick (Liu et al., 2013) database collects information from the STITCH database (Szklarczyk et al., 2015), and the information in ReverseScreen3D (Kinnings and Jackson, 2011) is extracted from the PDB database. In addition, the TarPred (Liu et al., 2015a) database, which is not shown in **Figure 4**, is a compound-target-disease database built by gathering information from the Comparative Toxicogenomics Database (CTD; Davis et al., 2017) and UniProt.

#### Direct Databases Used in Pharmacophore Screening

Two direct databases used for pharmacophore screening are shown in **Figure 4**. PharmTargetDB is the in-house database of the PharmMapper server (Liu et al., 2010), and PharmaDB (Meslamani et al., 2012) is the direct database deposited and used in Discovery Studio. Since these two databases are updated when the software updates, the different versions of PharmTargetDB and PharmaDB have different data capacities. The numbers of pharmacophore models in the newest versions are also shown in **Table 1**.

The pharmacophore models in PharmTargetDB are derived from the DrugBank, BindingDB, PDB, and PDTD databases. These models are built by extracting pharmacophore features within cavities by using the receptor-based pharmacophore modeling program Pocket 2.0 (Chen and Lai, 2006) after the binding sites of given protein structures are detected and ranked based on their druggability scores by using the software CAVITY (Yuan et al., 2013) for binding site detection (Wang X. et al., 2017). The original version of PharmTargetDB contained more than 7,000 pharmacophore models built from co-crystallized complex structures of protein targets (Liu et al., 2010). A new version of PharmMapper was published in 2017 (Wang X. et al., 2017), and the new PharmTargetDB is six times larger than the previous one, with a total of 23,236 proteins covering 51,431 pharmacophore models. PharmaDB is the pharmacodynamics database for Discovery Studio drug design software, and its pharmacophore models are constructed by using the Receptor-Ligand Pharmacophore Generation Protocol with default settings based on the binding information of ligand and protein complexes in the scPDB database. The original version of PharmaDB contained 68,056 pharmacophore models annotated with receptor information (Meslamani et al., 2012). The latest version of PharmaDB includes 140,000 pharmacophore models (BIOVIA, 2017), and users can utilize it in Discovery Studio to perform rapid reverse pharmacophore screening to search for protein targets of small molecules.

#### Direct Databases Used in Reverse Docking

The direct databases used in reverse docking are collections of target structure grids, which are usually generated from protein crystal structures by using docking programs or their auxiliary software tools. Before grid generation, the target crystal structures downloaded from the PDB must be preprocessed to remove ions and waters, add hydrogens and define the binding pocket. The original 3D protein structures can also be downloaded from some PDB derivative databases, such as the PDBbind-CN Database (Liu Z. et al., 2015), where all valid ligand-protein structures in the PDB are identified and collected.

PDTD, INVDOCK, and idTarget are the three direct databases used for reverse docking shown in **Figure 4**. Among them, PDTD is the only open database, and it can be downloaded as a compressed file, which is then decompressed as a collection of two types of structure files. The first type is the preprocessed protein structure file in the PDB and mol2 formats, and the second type is the active site structure file in PDB format. These two types of structure files for any PDB entry can be downloaded independently and viewed using the molecular visualization tool plug-in (Gao et al., 2008). Currently, this database contains more than 1,100 protein entries with 3D structures in the PDB and covers 841 diverse drug targets associated with diseases, biological functions, and signaling pathways (Gao et al., 2008). The binding (active) sites of these protein structures were defined by a data set of amino acid residues within 6.5 Å of the bound ligand (Gao et al., 2008). The INVDOCK and idTarget databases are not published but are the in-house databases of the corresponding programs. In addition, in contrast to the binding sites defined by the co-crystallized ligands in the PDTD, the binding sites in the INVDOCK and idTarget databases are generated from the available cavities by a search performed with spherical probes. The protein structures in these two databases also originate from the PDB. The INVDOCK in-house database, constructed in 2001, collects 9000 proteins and nucleic acid structures from the PDB database (Chen and Zhi, 2001). Each receptor structure grid in this database is constructed by first calculating the inward-facing surface covering the interface of van der Waals surfaces of the receptor crystal structure with a probe sphere 1.4 Å in radius. The binding site is then defined as the surrounding space within 15 Å of the center of the cavity formed by the combination of neighboring spheres covered by protein atoms in more than 50% of directions. Finally, grid generation is performed (Chen and Zhi, 2001). The idTarget database collects all protein structures in the PDB and is regularly updated when the PDB updates (Wang et al., 2012). The binding

sites of each protein in the database are dynamically determined, and the grids are constructed by the "divide-and-conquer" method according to the size of the query molecule (Wang et al., 2012). Theoretically, idTarget could be the most extensive and complete database among the three for reverse docking.

#### Indirect Databases

Indirect databases are rich in ligand and target information and can be simply processed to build direct databases for reverse screening. Nine indirect databases are shown in the middle layer of **Figure 4**, and a brief introduction to these databases, including the coverage, update time and website link, is given in **Table 2**. Among these nine indirect databases, ZINC, ChEMBL, BindingDB, and DrugBank mainly include structural information on ligands and their target annotations, whereas the PDB, scPDB, and Therapeutic Target Database (TTD) mainly provide 3D structures of proteins with ligand binding information. The 15 direct and in-house softwareassociated databases used for reverse screening are essentially extracted or constructed from these indirect databases. These indirect databases are also extensively linked to reference databases. For example, DrugBank and ChEMBL have mutual data exchange with PubChem, ChEBI, and UniProt. BindingDB (Gilson et al., 2016) collects data from the PubChem and PDSP Ki databases. The scPDB and PDB share protein information with UniProt, including sequences and crystal structures. In addition, HMDB (Wishart et al., 2007) and ZINC share compound information with ChEBI, and the TTD database collects its information on therapeutic protein targets from UniProt.

Users can employ indirect databases to build their own databases when they use large commercial software suites for molecular drug design, such as Schrödinger and Discovery Studio, to perform reverse screening. For example, users can collect small molecules with explicit target annotation from the DrugBank, ChEMBL, and ZINC databases to construct a ligand database for shape screening. They can also use ligandprotein binding information from the BindingDB, scPDB, and PDB databases to build a pharmacophore model database for pharmacophore screening or a protein structure grid database for reverse docking. Notably, Ligand Expo (Feng et al., 2004) from the PDB can be easily used to build in-house program databases for shape or pharmacophore screening after non-ligand small molecules, including metal ions and solvent molecules, are removed. In addition, a collection of protein crystal structures downloaded from the PDB can be used to generate a protein structure grid database for reverse screening. In fact, we built our own databases for use in Schrödinger based on this Ligand Expo database and the protein structures in the PDB, and we then performed shape screening and reverse docking to search for the protein targets of several natural compounds (Kim et al., 2014; Lim et al., 2014; Wang Z. et al., 2016).

#### Reference Databases

Reference databases normally contain a very large amount of information on small-molecule compounds and proteins. However, extracting the rich information resources from these databases to establish direct databases for reverse screening can be difficult for users. Nevertheless, we can utilize these databases to search for additional information on the matched molecules and their potential protein targets from reverse screening results.

The four main reference databases, PubChem, ChEBI, PDSP Ki, and UniProt, which are closely associated with the nine indirect databases, are shown in the upper layer of **Figure 4**. A brief introduction to these four databases is also provided in **Table 2**. PubChem consists of three interrelated sub-databases: substances, compounds, and bioassays. The first two subdatabases provide information on the chemical structure and other properties of small molecules, and the sub-database of bioassays gives information on their pharmacological properties and biological targets (Kim et al., 2016). ChEBI is a freely available dictionary of molecular entities focused on "small" chemical compounds. ChEBI and ChEMBL are both sites of the European Molecular Biology Laboratories, and their data occasionally overlap. Compared with ChEMBL, ChEBI is more focused on "molecular entity" information, such as the chemical and biological roles and applications of a small molecule, rather than on biological target information (Hastings et al., 2013, 2016). The PDSP Ki database is a unique open resource that provides information on the ability of drugs to interact with increasing numbers of molecular targets. The key data in this database are the Ki activity data on ligands internally derived or from published articles, but it also includes information on proteinannotated ligand structures, protein-ligand affinities, and article sources (Roth et al., 2000). UniProt (Pundir et al., 2015) is a freely accessible resource of protein sequence and function information extracted from gene sequencing and published literature that undergoes quality assurance by curator-evaluated computational analysis (Poux et al., 2017).

The PubChem, ChEBI, UniProt, and Ki databases are the four major reference databases used by researchers. However, more reference databases containing information on proteins or ligands are available, and readers who are interested in learning about them can read two further database review papers authored by Zhang Y. et al. (2011) and Moura Barbosa and Del Rio (2012).

#### Applications

We reviewed nearly all articles on the applications of shape, pharmacophore screening and reverse docking in searching for the potential targets of small molecules published since 2000 and conducted a systematic analysis. In addition, we have provided slightly more detailed descriptions of some representative examples of the application of existing services to drug discovery. In reviewing these previous studies, we found two approaches to method application: the use of a single method and the use of combined methods. In general, shape screening, pharmacophore screening and reverse docking have all been successfully applied individually. However, these methods have their own limitations in terms of features and application scopes, as mentioned above. The majority of newer examples involve the combined application of multiple methods. Finally, it is noted that all the compounds as examples for illustration in this section are shown in **Figure 1**.

#### Shape Screening

Shape-screening services have a wide range of applications. **Table 3** shows 18 examples of the use of SEA, TargetHunter, ROCS, SwissTargetPrediction, etc. to perform shape screening for molecular target prediction. Here, even if an article involves several query molecules, we still present them as one example.

Reverse screening based on shape similarity has multiple types of applications. First, it is used to search for the targets of molecules from Chinese herbal medicine. For instance, using TargetHunter, Zhang et al. predicted human beta-1 adrenergic receptor (ADRB1) as the protein target of aconitine (compound **3**), an experimentally active component of Sini Decoction in the treatment of cardiovascular disease, and this prediction has been experimentally verified (Zhang H. et al., 2016). Shape screening is also helpful for drug repositioning and for clarifying the mechanisms of action of existing drugs. For example, Keiser et al. conducted shape screening using SEA to reposition 3,665 FDA-approved and investigational drugs, and they successfully predicted unintended targets of several drugs, such as the antagonism of the β1 receptor by the transporter inhibitor prozac (compound **4**), the inhibition of the 5-HT transporter by the ion channel drug vadilex (compound **5**), and the antagonism of the histamine H4 receptor by the enzyme inhibitor rescriptor (compound **6**; Keiser et al., 2009).

### Pharmacophore Screening

Reverse screening based on pharmacophore modeling is also widely used to search for the targets of components of various Chinese traditional medicines. **Table 4** shows 27 examples of the use of PharmMapper and Discovery Studio to perform pharmacophore screening for the prediction of molecular targets.

For example, Liu et al. used PharmMapper to predict p38, glucocorticoid receptor (GR) and dihydroorotate dehydrogenase (DHODH) as potential targets of berberine (BBR, compound **7**) and further elucidated the possible molecular mechanisms by which these protein targets participate in the anti-melanoma activity of BBR (Liu et al., 2017). Lei et al. employed the Pharmacophore module of Discovery Studio 3.5 in reverse screening and found that the isoquinoline alkaloids (such as compound **8**) extracted from Macleaya cordata (Bo Luo Hui) might target macrophage migration inhibitory factor (MIF), potentially leading to the broad-spectrum antitumor effects of the plant (Lei et al., 2015).

#### Reverse Docking

Reverse screening based on molecular docking is widely used to search for the targets of small molecules to elucidate their mechanisms of action. **Tables 5**, **6** show 25 and 20 examples with and without experimental validation, respectively, of the application of reverse docking to the prediction of molecular targets by using TarFisDock, idTarget, INVDOCK, and conventional virtual screening software such as DOCK, MDock, and AutoDock.

For example, several research groups used INVDOCK to predict that p53 (Lu et al., 2010), calmodulin (CaM; Ma et al., 2013), annexin A5 (ANXA5) and heat shock protein family A member 8 (HSPA8; Lu et al., 2012) might be protein targets of the broad-spectrum anticancer drug BBR (compound **7**).


\*Target prediction confirmed by the literature. Superscript values denotes that the protein targets in the second column correspond to the query molecules in the first column respectively.

TABLE 4 | Applications of pharmacophore screening in predicting protein targets of small molecules.


\*Target prediction confirmed by the literature. Superscript values denotes that the protein targets in the second column correspond to the query molecules in the first column respectively.

Zhang et al. employed TarFisDock in the reverse docking of 19 compounds extracted from the traditional Chinese medicines Bacopa monnieri (L.) Wettst and Daphne odora Thunb. Var. Marginata, and they concluded that five of the compounds (such as compound **9**) might target dipeptidyl peptidase IV (DPP-IV), thus accounting for the effectiveness of these medicines in the treatment of diabetes and their anti-inflammatory effects (Zhang S. et al., 2011). Scafuri et al. (2016) applied idTarget to predict that the proteins guanosine triphosphatase (GTPase), guanosine 5′ -monophosphate oxidoreductase (GMP reductase) and hypoxanthine-guanine phosphoribosyltransferase (HGPRT) might be key targets of apple polyphenols (such as compound **10** and **11**), resulting in their cancer-preventive effects. Grinter et al. used MDock to fish for the protein targets of the compound PRIMA-1 (compound **12**) in the PDTD database and discovered that PRIMA-1 could inhibit the cholesterol synthetic pathway by directly binding with oxidosqualene cyclase (OSC), considerably reducing the viability of BT-474 and T47-D breast cancer cells (Grinter et al., 2011).

## Hybrid Applications

The combinations of methods include shape screening with reverse docking, shape screening with pharmacophore screening, pharmacophore screening with reverse docking, and the combination of all three methods. **Table 7** shows 22 examples of the use of these four combinations to predict molecular targets.

Eight target prediction examples were performed by combining shape similarity and molecular docking. Kozielewicz et al. employed ReverseScreen3D and TarFisDock to predict the targets of oxindole pentacyclic alkaloids (such as compound **13**) and found that the biological ability of these compounds to induce cancer cell apoptosis may potentially involve the inhibition of several important targets, including dihydrofolate reductase (DHFR) and mouse double minute 2 homolog



\*Target prediction confirmed by the literature. Superscript values denotes that the protein targets in the second column correspond to the query molecules in the first column respectively.

(MDM2; Kozielewicz et al., 2014). The combination of shape screening and pharmacophore screening has been applied in three instances to predict molecular targets. Biplab Bhattacharjee and Jhinuk Chatterjee used PharmMapper and ReverseScreen3D to perform reverse screening and demonstrated that eucalyptol (compound **14**), the effective component of cardamom, might target CASP-3 and cAMP-dependent protein kinase (PKA), resulting in its anti-apoptosis, anti-inflammation, anti-proliferation, anti-invasion and anti-angiogenesis activities in cancer prevention (Bhattacharjee and Chatterjee, 2013). The combination of pharmacophore modeling and reverse docking has been used most frequently to predict the targets of small molecules, as shown by the 10 applications given in **Table 7**. For example, Ge et al. used PharmMapper and idTarget in reverse screening and predicted that dihydropyrimidine dehydrogenase (DPD) and human spindle checkpoint kinase Bub1 were the potential unintended or secondary targets of the antithrombotic agent dipyridamole (DIP, compound **15**), resulting in its anti-cancer activity (Ge et al., 2016). Finally, one application combined all three methods for reverse screening. Gao et al. used the Pharmacophore Modeling and Docking modules in Schrödinger as well as RerverseScreen3D jointly to predict molecular targets and found that baicalein (compound **16**), an anti-Parkinson disease drug, played a protective role in the nervous system by targeting catechol-O-methyltransferase (COMT) and monoamine oxidase B (MAO-B); these targets were confirmed experimentally (Gao et al., 2013).

#### DISCUSSION

### Comparison of the Applications of the Three Types of Reverse Screening Software and Online Services in the Prediction of Small-Molecule Targets

To provide a better understanding of the application of reverse screening for small-molecule target prediction, we collected as many reverse screening tools and databases as possible, most of which have been updated since 2016 (**Tables 1**, **2**). In addition, we counted the number of applications of these three methods since 2000, the number of applications of their representative software programs, and the application trends over the years. All information is shown in **Tables 3**–**7** and **Figures 5A,B**.

Different types of reverse screening tools have characteristic features. Many software programs and online services are based on shape similarity, and they have a rich supply of ligand databases. These shape screening tools have rapidly updated in-house databases, can screen large chemical databases rapidly, and therefore can be used for large-scale preliminary reverse screening. Although shape-screening methods have the smallest number of applications, a slow upward trend is evident in recent years (**Figure 5A**). Notably, few studies have applied TarPred, SwissSimilarity, and ChemMapper. The main software and online services based on pharmacophore modeling are PharmMapper and Discovery Studio, each with its own TABLE 6 | Applications of reverse docking in predicting protein targets of small molecules without experimental verification.


\*Target prediction confirmed by the literature. Superscript values denotes that the protein targets in the second column correspond to the query molecules in the first column respectively.

pharmacophore database. PharmMapper is a free online service, whereas Discovery Studio is commercial software, leading to more widespread use of PharmMapper. Among the three reverse screening methods, pharmacophore modeling has the fewest applications before 2016 (**Figures 5A,B**), but its application exhibited a significantly escalating trend in 2017 (**Figures 5A,B**). The reverse docking method has been applied the most; however, the applications of reverse docking have shown a downward trend in recent years. The possible reasons are that some online services, such as TarFisDock, have undergone limited expansion of the existing protein crystal structure grid database, while others, such as idTarget, have a long computational time and high computational cost. **Figure 5A** also illustrates the trend of the practical applications of hybrid methods, with a slow rise in the use of combinations of multiple reverse screening methods for target prediction in recent years.

In addition, we downloaded from PubChem or sketched in Schrödinger the structures of 57 small-molecule ligands whose targets were predicted by reverse screening and further verified experimentally, as reported in these application articles. We used the Cluster Analysis Module in Schrödinger via the two-tiered drop-down menu of Maestro's Scripts/Cheminformatics/Clustering of Ligands and a pop-up window titled Clustering based on Volume Overlap to conduct a cluster analysis of these small molecules. **Figure 6** shows representative compounds in the 28 clusters we obtained. Their targets were predicted separately by shape screening, pharmacophore screening and reverse docking. By comparing the structures of these molecules, we may be able to summarize some rules regarding the application ranges of each type of method according to the structures of the query molecules. For example, shape screening may be suitable for a query molecule whose structure shows stereoscopic sense even in a two-dimensional structural view and that is neither very large nor very small (Compounds **3** and **17–23**). Pharmacophore screening is appropriate for query molecules whose structures contain diverse pharmacophore functional groups with a good balance between them (compounds **24–30**). Reverse docking is suitable for the most diverse range of query molecules (compounds **31–39**). We hope that this type of cluster analysis can provide some guidance for the effective use of reverse screening to predict small-molecule targets in the future.

#### Deficiencies in Current Reverse Screening Methods and Potential Solutions

Each reverse screening tool has its own characteristics and appropriate application scope in terms of principle, algorithm and program. However, we need to know the application scopes of these tools in order to select the most appropriate software for making accurate predictions. Thus, a horizontal comparison can provide a better understanding of the advantages and TABLE 7 | Applications of hybrid screening in predicting protein targets of small molecules.


\*Target prediction confirmed by the literature. Superscript values denotes that the protein targets in the second column correspond to the query molecules in the first column respectively.

disadvantages of these reverse screening tools and their inhouse databases. Some clear deficiencies are present in the programs and in-house databases of current reverse screening tools, making a comparison of the efficiency or the accuracy of target prediction by these tools difficult. None of the online services has a general interface module that can be used to upload and recognize user databases. Because these tools cannot use external databases, evaluating the methods or services based on benchmark databases is infeasible. However, researchers may be able to test the pros and cons of these tools in a way that does not require a benchmark database: we may not need to know the superiority of these tools over each other but may instead need to learn their practical uses and application scopes so that they can be better applied in real-life practice. This comparison requires studies to select some benchmarking query compounds whose known targets represent a large category and whose secondary targets or non-targets have also been studied thoroughly. We can use evaluation indexes such as the enrichment factor and receiver operating characteristic (ROC) curve (Truchon and Bayly, 2007) to assess the practical effects of reverse screening tools on the prediction of targets within this large category for other small molecules, thus achieving a horizontal comparison of existing software and online services. This theory is not yet perfected, and successful examples of this approach remain lacking, but it may provide prospects for developing assessments of reverse screening methods and tools.

Moreover, reverse screening servers also lack general-type databases, and their in-house databases are not publicized. We cannot learn the inclusion and exclusion criteria for building these direct databases. Almost all direct databases are bound to the corresponding software, and we are unable to conduct potential data mining. The resources of different online services are also undisclosed, and the services cannot refer to each other's databases. Hence, we encourage the developers of all software and online services to disclose their own databases and their construction processes to facilitate user comprehension and utilization. Only in this way can these software databases be better applied in practice, and this approach could also promote the production of more excellent protein-annotated ligand or target structure grid databases for reverse screening.

### Previous Reviews and Prospective Studies on Reverse Screening in Molecular Target Prediction

To date, five reviews of reverse screening are available in the literature, which we will discuss briefly below. Readers can also peruse these reviews to deepen their understanding of molecule target prediction algorithms. We will not address other reviews that involve the use of experimental methods or a combination of computational and experimental methods to predict molecular targets (Schenone et al., 2013).

Three of the five reviews are similar to our work and provide broad overviews of in silico target fishing. They describe the principles, databases and software involved in computer-aided small-molecule target prediction in terms of different aspects, perspectives and levels (Rognan, 2010; Zheng et al., 2013; Cereto-Massagué et al., 2015). Cereto-Massagué et al. (2015) categorize the methods of target fishing into four classes according to computational principles: molecular similarity methods, protein structure-based methods, data mining/machine learning methods, and methods based on the analysis of bioactivity spectra. Our review covers the principles and applications of the first two classes, molecular similarity and protein structure, but does not address the latter two categories of machine learning and bioactivity spectra. Therefore, readers can review the article by Cereto-Massagué et al. carefully if they are interested in the calculation methods used in those latter two categories. Rognan et al. (Rognan, 2010) describe only protein structurebased approaches and further classify them into protein-ligand docking, structure-based pharmacophore searches, binding site similarity measurements, and protein-ligand fingerprints. Based on the principle of receptor structure-based screening, the authors describe these four sub-methods in detail and discuss their pros and cons for target fishing and ligand profiling. That review features descriptions of protein pocket similarity

methods.

matching and the molecular fingerprinting of protein-ligand interaction information, which is worth reading and comparing to our paper. In addition, Zheng et al. (2013) provide a comprehensive overview of computer-aided drug design methods, including conventional (forward) virtual screening and reverse screening, in terms of five aspects: drug target prediction, drug repositioning, protein-ligand interaction, virtual screening and lead optimization, and ADME/T (absorption, distribution, metabolism, excretion, and toxicity) property prediction. The three reverse screening methods we reviewed are closely related to drug target prediction, drug repositioning and protein-ligand interaction. The above review can help readers systematically study and understand the field of computer-aided virtual screening in drug design.

The other two reviews address only reverse docking and its applications. Specifically, Kharkar et al. (2014) give a detailed description of reverse docking programs and their target databases and further discuss the applications of reverse docking in target identification and the prediction of target functions and off-target effects. Lee et al. (2016) summarize target databases, software programs and services and discuss the application of reverse docking in small-molecule target recognition and drug discovery. They also professionally discuss four issues related to reverse docking that remain to be solved: the standardization of database construction, the inclusion of receptor flexibility, the time-consuming nature of flexible receptor docking, and the inaccuracy of binding free energy calculations and ligand binding pose prediction. Readers may refer to these two reviews for a more comprehensive understanding of reverse docking methods.

#### CONCLUSION

In this review article, based on previous studies, we selected the three most commonly used types of reverse screening methods, i.e., methods based on shape similarity, pharmacophore modeling and molecular docking, and provided a detailed and comprehensive introduction, including a description of the principles underlying each method and a systematic classification of software, online services, and databases. In addition, we

#### REFERENCES


collected nearly all the articles related to the application of computer-aided target reverse screening prediction published since 2000 and analyzed the possible relationships or correlations between compound structures and screening methods by using cluster analysis. The purpose of this review is to help readers quickly understand these three methods and the characteristics of the software and online services based on these methods, to familiarize readers with the status and applications of the different levels of ligand and protein databases used in reverse screening and to provide a better understanding of how existing tools can be applied to molecule target prediction. We strongly believe that more accurate predictions resulting from the familiarity of users with the existing online services and databases will increase the importance of reverse screening in drug repositioning and future research on the pharmacodynamics and pharmacological mechanisms of bioactive compounds.

# AUTHOR CONTRIBUTIONS

HH, GZ, YZ, CL, SC, YL, and SM: Information retrieval and analysis and reverse tools identification and classification; HH, GZ, LC, YL, and ZH: Database classification; HH, GZ, YZ, and ZH: Reference classification; HH, GZ, YZ, LC, SC, YL, SM, and ZH: Manuscript writing; HH, GZ, and ZH: Manuscript revision; HH, GZ, SC, YL, and ZH: Figure processing and table processing; HH, GZ, CL, and ZH: Cluster analysis; ZH: Manuscript guidance, communication work, and financial support.

#### ACKNOWLEDGMENTS

This work was supported by the National Natural Science Foundation of China (31770774), the Natural Science Foundation of Guangdong Province, China (2015A030313518), the Provincial Major Project of Basic or Applied Research in Natural Science, Guangdong Provincial Education Department (2016KZDXM038), and the 2013 Sail Plan The Introduction of the Shortage of Top-Notch Talent Project (YueRenCaiBan [2014] 1). We also thank American Journal Experts (AJE) for their help in revising English language.

and a naive Bayesian classifier. J. Chem. Inf. Comput. Sci. 44, 170–178. doi: 10.1021/ci034207y


docking, enzymatic assay, and X-ray crystallography validation. Protein Sci. 15, 2071–2081. doi: 10.1110/ps.062238406


J. Recep. Lig. Channel Res. 7, 27–38. doi: 10.2147/JRLCR. S46845


plumbagin-induced, reactive oxygen species-mediated apoptosis in cancer cell lines. Eur. J. Pharmacol. 765, 384–393. doi: 10.1016/j.ejphar.2015.08.058


metabolism. Nucleic Acids Res. 42, D1091–D1097. doi: 10.1093/nar/ gkt1068


pharmacophore-based screening technology. Chin. J. Nat. Med. 12, 12443–448. doi: 10.1016/S1875-5364(14)60069-8


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Huang, Zhang, Zhou, Lin, Chen, Lin, Mai and Huang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Structure-Based Design, Synthesis, Biological Evaluation, and Molecular Docking of Novel PDE10 Inhibitors With Antioxidant Activities

Jinxuan Li † , Jing-Yi Chen† , Ya-Lin Deng, Qian Zhou, Yinuo Wu\*, Deyan Wu\* and Hai-Bin Luo

*School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, China*

#### Edited by:

*Honglin Li, East China University of Science and Technology, China*

#### Reviewed by:

*Mariya al-Rashida, Forman Christian College, Pakistan Mingyue Zheng, Shanghai Institute of Materia Medica (CAS), China*

#### \*Correspondence:

*Yinuo Wu wyinuo3@mail.sysu.edu.cn Deyan Wu wudeyan3@mail.sysu.edu.cn*

*†These authors have contributed equally to this work.*

#### Specialty section:

*This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry*

Received: *04 January 2018* Accepted: *24 April 2018* Published: *15 May 2018*

#### Citation:

*Li J, Chen J-Y, Deng Y-L, Zhou Q, Wu Y, Wu D and Luo H-B (2018) Structure-Based Design, Synthesis, Biological Evaluation, and Molecular Docking of Novel PDE10 Inhibitors With Antioxidant Activities. Front. Chem. 6:167. doi: 10.3389/fchem.2018.00167* Phosphodiesterase 10 is a promising target for the treatment of a series of central nervous system (CNS) diseases. Imbalance between oxidative stress and antioxidant defense systems as a universal condition in neurodegenerative disorders is widely studied as a potential therapy for CNS diseases, such as Alzheimer's disease (AD), Parkinson's disease (PD) and amyotrophic lateral sclerosis (ALS). To discover multifunctional pharmaceuticals as a treatment for neurodegenerative diseases, a series of quinazoline-based derivatives with PDE10 inhibitory activities and antioxidant activities were designed and synthesized. Nine out of 13 designed compounds showed good PDE10 inhibition at the concentration of 1.0µM. Among these compounds, eight exhibited moderate to excellent antioxidant activity with ORAC (oxygen radical absorbance capacity) value above 1.0. Molecular docking was performed for better understanding of the binding patterns of these compounds with PDE10. Compound 11e, which showed remarkable inhibitory activity against PDE10 and antioxidant activity may serve as a lead for the further modification.

Keywords: Phosphodiesterase-10A, papaverine, antioxidant activity, Alzheimer's disease, molecular docking

# INTRODUCTION

Phosphodiesterases (PDEs) are a super enzyme family in charge of hydrolyzing the intracellular second messenger molecules 3′ ,5′ -cyclic adenosine monophosphate (cAMP) and 3′ ,5′ -cyclic guanosine monophosphate (cGMP) by degrading their phosphodiester bonds (Liu et al., 2001; Mehats et al., 2002; Castro et al., 2005; Bender and Beavo, 2006; Conti and Beavo, 2007; Houslay, 2010). As both cAMP and cGMP are involved in various extracellular signals and biological processes, the inhibition of PDEs can improve abnormal physiological processes caused by the low concentration of cAMP and/or cGMP by inhibiting their degradation (Lugnier, 2006; Francis et al., 2011). Thus, PDEs have been considered as promising targets for various diseases. Currently, up to 12 PDEs inhibitors have been approved, including PDE5 inhibitor sildenafil for erectile dysfunction and pulmonary arterial hypertension, PDE4 inhibitor roflumilast for chronic obstructive pulmonary disease (Sung et al., 2003; Christie, 2005). PDEs are classified into 11 distinct families (PDE1-11) based on the amino acid sequences, substrate specificities, and pharmacological properties (Bender and Beavo, 2006). The different expression of each subfamily on the organs and tissues makes specific PDE inhibitors have different therapeutic effects.

Phosphodiesterase 10 (PDE10) is a dual-specificity superfamily responsible for hydrolyzing both cAMP (K<sup>m</sup> = 0.05µM) and cGMP (K<sup>m</sup> = 3µM) (Soderling et al., 1999), which is highly expressed in the brain and has been considered as a potential target for the treatment of several central nervous system (CNS) disorders such as Schizophrenia and Huntington's disease (Hebb et al., 2004). Recent work has shown that blockade of PDE10A with selective inhibitors increases striatal cGMP and phosphorylated cAMP-response element binding protein (CREB), a downstream marker of cAMP production (Siuciak et al., 2006b). PDE10 inhibitors regulate the levels of cAMP and cGMP and activate the downstream dopaminergic pathways and glutamatergic pathways, which may avoid side effects of extrapyramidal system (EPS) caused by current anti-Schizophrenia drugs. In conditioned avoidance responding (CAR), an animal model predictive of drug antipsychotic activity, PDE10A inhibitors exhibited a dosedependent inhibition. (Jones et al., 2015; Suzuki et al., 2015). Great efforts have been devoted in the development of PDE10 inhibitors in the last decade. Up to 7 candidates such as **MP-10** and **TAK-063** have entered the preclinical or clinical trials (Kehler, 2013; Gentzel et al., 2015; Wilson et al., 2015). However, there is still no PDE10 inhibitor approved on the market as a drug.

Oxidative stress (OS) has been suggested as a possible element in the pathogenesis of neurodegenerative disorders (Ceballos et al., 1990; Islam, 2017). Researches showed that neurodegenerative disorders are qualified by different levels of oxidative stress biomarkers and antioxidant defense biomarkers in the brain and peripheral tissues. Recently, some pharmaceuticals on the market with anti-oxidant activities have been demonstrated to decelerate neurodegenerative processes and enhance comprehension ability of the Oxidative stress (OS) characteristics in the pathobiology of these stubborn conditions (Mecocci and Polidori, 2012; Danta and Piplani, 2014). Moreover, experimental studies have proved the presence of elevated levels of Oxidative stress (OS) biomarkers accompanied with the impairments to antioxidant defenses in central and peripheral tissues in pathological process of Parkinson's disease (PD), Alzheimer's disease (AD), and amyotrophic lateral sclerosis (ALS). Pharmaceuticals with antioxidant activity enable biomarkers of the oxidant/antioxidant to rebalance in animal models, thus are widely studied as possible antineurodegenerative agents (Zhang et al., 2006; Niedzielska et al., 2016). Vinpocetine, a moderate PDE1 inhibitor with antioxidant activity, can significantly improve the learning and memory in the streptozotocin infused AD rat models. Vinpocetine acts as a neuroprotective agent, which is widely applied to the treatment of CNS disorders with good antioxidant activity and the observed cognitive effects and memory improvement of vinpocetine is believed to be bound up with the antioxidant mechanism and elevations of cGMP levels (Hindmarch et al., 1991; Bönöczk et al., 2000). As noted above, PDE inhibitors with antioxidant activities have potential possibility to apply in the treatment of several CNS disorders.

Till now, compounds with both PDE10A inhibitory activities and antioxidant activities have seldom been reported. Taking all these into consideration, a strategy to design lead compounds combining the pharmacophore of PDE10A inhibitors and antioxidants seems to be attractive and challenging. In this study, a series of compounds expected to exhibit both PDE10A inhibition and antioxidant activity were designed and synthesized based on the chemical structure of a natural derivative papaverine. Five compounds showed moderate to good PDE10A inhibitory activities. Compound **11e** showed good antioxidant activity as well as PDE10A inhibitory activity.

#### MATERIALS AND METHODS

All starting materials and reagents were purchased from commercial suppliers (Adamas, Energy, Bide, Sigma-Aldrich, ShuYa, J&K, and Meryer) and used directly without further purification. Chemical HG/T2354-92 silica gel (200–300 mesh, Haiyang <sup>R</sup> ) was used for chromatography, and silica gel plates with fluorescence F254 (0.25 mm, Huanghai <sup>R</sup> ) were used for thin-layer chromatography (TLC) analysis. Reactions requiring anhydrous conditions were performed under argon or a calcium chloride tube. <sup>1</sup>H NMR and <sup>13</sup>C NMR spectra were recorded at room temperature on a Bruker AVANCE III 400 instrument with tetramethylsilane (TMS) as an internal standard (**Presentation 1**). The following abbreviations are used: s (singlet), d (doublet), t (triplet), m (multiplet), dd (doublet of doublets), dt (doublet of triplets), td (triplet of doublets), and br (broad signal). Coupling constants were reported in Hz. Low- and high-resolution mass spectra (LRMS and HRMS) were recorded on a MAT-95 spectrometer. The purity of compounds was determined by reverse-phase high-performance liquid chromatography (HPLC) analysis confirming to be over 95%. HPLC instrument: SHIMADZU LC-20AT (column: Hypersil BDS C18, 5.0µm, 4.6 × 150 mm (Elite); Detector: SPD-20A UV/VIS detector, UV detection at 254 nm; Elution, MeOH in water (60–80%, v/v); T = 25◦C; and flow rate = 0.8–1.0 mL/min.

# 7-methoxy-4-oxo-3,4-dihydroquinazolin-6 yl Acetate (4)

Pyridine (4 mL) was added dropwise to the solution of 6 hydroxy-7-methoxyquinazolin-4(3H)-one **3** (1.92 g, 10.0 mmol) in acetic anhydrate (20 mL). The reaction mixture was heated at 100◦C for 2 h and then cooled to room temperature. After the mixture was poured into ice water, a white solid was precipitated. The precipitate was collected, washed with water and dried to give the compound **4** (2.32 g, 99%) as a white solid. <sup>1</sup>H NMR (400 MHz, DMSO – d6) δ 8.09 (s, 1H), 7.76 (s, 1H), 7.28 (s, 1H), 3.92 (s, 3H), 2.30 (s, 3H).

# 4-chloro-7-methoxyquinazolin-6 yl acetate (5)

To a solution of **4** (2.34 g, 10.0 mmol) in SOCl<sup>2</sup> (20 mL) was added DMF (0.1 mL) dropwise. The mixture was stirred at 80◦C for 2.5 h and then concentrated under vacuum, providing compound **5** (2.22 g, 88%) which could be used in the next step without further purification. <sup>1</sup>H NMR (400 MHz, DMSO – d6) δ 9.02 (s, 1H), 8.02 (s, 1H), 7.65 (s, 1H), 4.03 (s, 4H), 2.36 (s, 3H).

# 7-methoxy-4-morpholinoquinazolin-6-ol (7)

A solution of compound **5** (2.52 g, 10.0 mmol) and morpholine (1.04 g, 12.0 mmol) in DMF (20 mL) was stirred at 80◦C for 6 h. The mixture was then poured into the ice water and a white solid was precipitated, which was collected and washed with ice water to afford compound **6**. The compound **6** was dissolved in methanol (20 mL). Ammonia (2.5 mL) was added to the mixture and the mixture was then stirred under reflux for 2 h. The solvents were evaporated under vacuum. The crude product was recrystallized using methanol to afford compound **7** (1.87 g, 72%). <sup>1</sup>H NMR (400 MHz, DMSO – d6) δ 8.52 (s, 1H), 7.24 (s, 1H), 7.21 (s, 1H), 3.93 (s, 3H), 3.83 – 3.75 (m, 4H), 3.54 – 3.47 (m, 4H).

# 4-(6-(2-(1H-indol-3-yl)ethoxy)-7 methoxyquinazolin-4-yl) morpholine (8a)

To a solution of compound **7** (522 mg, 2.0 mmol) in DMF (20 mL) was added 3-(2-bromoethyl)-1H-indole (538 mg, 2.4 mmol) and potassium carbonate (690 mg, 5.0 mmol). The reaction mixture was refluxed for 3 h. After cooling to room temperature, adding water to quench the mixture, and then the residue was diluted with CH2Cl<sup>2</sup> (30 mL) and washed with saturated aqueous sodium bicarbonate and water. The organic layer was dried over anhydrous sodium sulfate, and purified by silica gel column chromatography (petroleum ether/EtOAc, 3:1–1:1) to afford the title compound as a yellow solid. Purity: 97%; yield: 20%; <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.67 (s, 1H), 8.29 (s, 1H), 7.69 (d, J = 7.8 Hz, 1H), 7.38 (d, J = 8.0 Hz, 1H), 7.22 (t, J = 7.5 Hz, 1H), 7.17 (s, 1H), 7.15 (t, J = 7.4 Hz, 1H), 7.08 (s, 1H), 4.37 (t, J = 7.0 Hz, 2H), 4.02 (s, 3H), 3.82 (s, 4H), 3.60 (s, 4H), 3.39 (t, J = 7.0 Hz, 2H); <sup>13</sup>C NMR (101 MHz, CDCl3) δ 163.76, 155.11, 152.96, 149.17, 148.04, 136.26, 127.46, 122.49, 122.22, 119.51, 118.76, 111.85, 111.42, 111.28, 107.65, 104.44, 69.48, 66.63 × 2, 56.19, 50.21×2, 25.19; LRMS (ESI) m/z [M+H]<sup>+</sup> 405.2; HRMS (ESI) m/z calcd C23H24N4O<sup>3</sup> [M+H]<sup>+</sup> 405.1927, found 405.1922.

# General Procedure for Synthesis of Compounds 8b-8c

To a solution of compound **7** (522 mg, 2.0 mmol) in DMF (20 mL) was added the EDCI (575 mg, 3.0 mmol) and DMAP (12 mg, 0.1 mmol). The reaction mixture was stirred at room temperature for 0.5 h, and then the corresponding acid (2.4 mmol) was added, the reaction mixture was refluxed overnight. After cooling to room temperature, adding water to quench the mixture, and then the residue was diluted with CH2Cl<sup>2</sup> (30 mL) and washed with saturated aqueous sodium bicarbonate and water. The organic layer was dried over anhydrous sodium sulfate, and purified by silica gel column chromatography (petroleum ether/EtOAc, 3:1 to 1:1) to afford the title compound as a white solid.

# 7-methoxy-4-morpholinoquinazolin-6-yl 3-(4-hydroxyphenyl)Acrylate (8b)

White solid; purity: 97%; yield: 10%; <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.70 (s,1H), 7.86 (d, J = 16.0 Hz, 1H), 7.61 (s, 1H), 7.52 (d, J = 7.5 Hz, 2H), 7.35 (s, 1H), 6.90 (d, J = 7.5 Hz, 2H), 6.54 (d, J = 15.7 Hz, 1H), 5.35 (br, 1H), 3.96 (s, 3H), 3.91–3.85 (m, 4H), 3.78–3.73 (m, 4H); <sup>13</sup>C NMR (101 MHz, CDCl3) δ 177.80, 164.07, 155.98, 154.20, 151.68, 147.48, 139.24, 138.32, 130.45 × 2, 117.87, 116.23 × 2, 113.08, 110.55, 108.16, 94.86, 66.76 × 2, 56.28, 50.18 × 2; HRMS (ESI) m/z calcd C22H21N3O<sup>5</sup> [M+H]<sup>+</sup> 408.1559, found 408.1554.

# 7-methoxy-4-morpholinoquinazolin-6-yl 5-(1,2-dithiolan-3-yl)pentanoate (8c)

White solid; purity: 97%; yield: 16%; <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.66 (s, 1H), 7.48 (s, 1H), 7.31 (s, 1H), 3.93 (s, 3H), 3.90–3.78 (m, 4H), 3.75–3.64 (m, 4H), 3.59 (dt, J = 12.9, 6.4 Hz, 1H), 3.14 (dtd, J = 17.9, 11.4, 6.8 Hz, 2H), 2.63 (t, J = 7.3 Hz, 2H), 2.46 (td, J = 12.4, 6.4 Hz, 1H), 1.91 (td, J = 13.7, 7.0 Hz, 1H), 1.76 (ddd, J = 29.4, 14.5, 8.2 Hz, 4H), 1.58 (ddd, J = 22.9, 14.4, 8.1 Hz, 2H); <sup>13</sup>C NMR (101 MHz, CDCl3) δ 177.02, 155.73, 153.99, 151.77, 145.91, 139.05, 117.65, 111.54, 106.70, 66.68 × 2, 56.36, 50.08 × 2, 40.19, 38.45, 34.62, 34.29, 33.71, 28.81, 24.78; HRMS (ESI) m/z calcd C21H27N3O4S<sup>2</sup> [M+H]<sup>+</sup> 450.1521, found 450.1534.

# 4-chloro-6,7-dimethoxyquinazoline (10)

To a solution of **9** (10.0 mmol) in SOCl<sup>2</sup> (20 mL) was added DMF (0.1 mL) dropwise. The mixture was stirred at 80◦C for 2.5 h and then concentrated under vacuum, providing compound **10** (2.22 g, 90%) which could be used in the next step without further purification. <sup>1</sup>H NMR (400 MHz, DMSO – d6) δ 8.88 (s, 1H), 7.46 (s, 1H), 7.40 (s, 1H), 4.01 (d, J = 6.0 Hz, 6H).

# General Procedure for Synthesis of Compounds 11a-11h, 12 and 13

To a solution of **10** (2.0 mmol), the corresponding amine (3.0 mmol) in isopropanol (20 mL) was added triethylamine (6.0 mmol) dropwise. The reaction mixture was refluxed for 4 h and then concentrated under vacuum, providing crude product. The crude product was purified by silica gel column chromatography (CH2Cl2/MeOH, 100:1–40:1) to afford the title compound as a white solid.

# N-(2-(1H-indol-3-yl)ethyl)-6,7 dimethoxyquinazolin-4-amine (11a)

White solid; purity: 97%; yield: 60%; <sup>1</sup>H NMR (400 MHz, DMSO–d6) δ 10.92 (s, 1H), 10.27 (br, 1H), 8.80 (s, 1H), 8.06 (s, 1H), 7.63 (d, J = 7.8 Hz, 1H), 7.35 (d, J = 8.1 Hz, 1H), 7.26 (d, J = 11.6 Hz, 2H), 7.07 (t, J = 7.5 Hz, 1H), 6.98 (t, J = 7.4 Hz, 1H), 3.97 (s, 1H), 3.95 (s, 7H), 3.93–3.91 (m, 1H), 3.13 (t, J = 7.4 Hz, 2H); <sup>13</sup>C NMR (101 MHz, DMSO–d6) δ 159.44, 156.13, 150.20, 149.28, 136.72, 134.53, 127.62, 123.40, 121.49, 118.80, 114.08, 111.92, 111.57, 107.09, 104.42, 99.94, 57.29, 56.77, 42.80, 24.94; HRMS (ESI) m/z calcd C20H20N4O<sup>2</sup> [M+H]<sup>+</sup> 349.1665, found 349.1670.

# N-(1-(1H-indol-3-yl)propan-2-yl)-6,7 dimethoxyquinazolin-4-amine (11b)

White solid; purity: 96%; yield: 55%; <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.59 (s, 1H), 8.37 (br, 1H), 7.70 (d, J = 7.8 Hz, 1H), 7.41 (d, J = 8.1 Hz, 1H), 7.28 (s, 1H), 7.21 (t, J = 7.5 Hz, 1H), 7.12 (t, J = 7.2 Hz, 2H), 6.56 (s, 1H), 5.74 (br, 1H), 4.96–4.85 (m, 1H), 3.97 (s, 3H), 3.75 (s, 3H), 3.24 (dd, J = 14.5, 5.9 Hz, 1H), 3.13 (dd, J = 14.4, 4.6 Hz, 1H), 1.40 (d, J = 6.4 Hz, 3H); <sup>13</sup>C NMR (101 MHz, CDCl3) δ 157.73, 154.22, 154.00, 148.80, 146.09, 136.30, 128.25, 123.26, 122.12, 119.79, 118.90, 111.41, 111.28, 108.60, 107.39, 99.58, 56.13, 56.08, 46.72, 31.11, 20.06; HRMS (ESI) m/z calcd C21H22N4O<sup>2</sup> [M+H]<sup>+</sup> 363.1821, found 363.1806.

# 6,7-dimethoxy-N-(2-(5-methoxy-1H-indol-3-yl)ethyl)quinazolin-4-amine (11c)

White solid; purity: 97%; yield: 62%; <sup>1</sup>H NMR (400 MHz, DMSO–d6) δ 10.67 (br, 1H), 8.41 (s, 1H), 8.10 (t, J = 5.5 Hz, 1H), 7.59 (s, 1H), 7.24 (d, J = 8.7 Hz, 1H), 7.17 (d, J = 2.3 Hz, 1H), 7.13 (d, J = 2.4 Hz, 1H), 7.11 (s, 1H), 6.72 (dd, J = 8.7, 2.4 Hz, 1H), 3.89 (d, J = 9.2 Hz, 6H), 3.88 (s, 3H), 3.81 (dd, J = 14.3, 6.3 Hz, 2H), 3.71 (s, 3H), 3.10–3.00 (m, 2H); <sup>13</sup>C NMR (101 MHz, DMSO–d6) δ 158.27, 153.68, 153.60, 152.96, 148.25, 145.80, 131.37, 127.70, 123.27, 112.00, 111.86, 111.10, 108.56, 106.91, 101.99, 100.20, 55.97, 55.62, 55.17, 41.43, 24.85; HRMS (ESI) m/z calcd C21H22N4O<sup>3</sup> [M+H]<sup>+</sup> 379.1770, found 379.1766.

# 6,7-dimethoxy-N-(1-(5-methoxy-1H-indol-3-yl)propan-2-yl)quinazolin-4-amine (11d)

White solid; purity: 98%; yield: 51%; <sup>1</sup>H NMR (400 MHz, CDCl3) δ 8.61 (s, 1H), 8.23 (br, 1H), 7.31 (d, J = 8.8 Hz, 1H), 7.20 (s, 1H), 7.14 (d, J = 1.9 Hz, 1H), 7.10 (s, 1H), 6.87 (dd, J = 8.8, 2.1 Hz, 1H), 6.47 (s, 1H), 5.41 (br, 1H), 4.91 (dt, J = 12.5, 6.4 Hz, 1H), 3.99 (s, 3H), 3.73 (s, 3H), 3.70 (s, 3H), 3.69–3.64 (m, 1H), 3.22 (dd, J = 14.5, 6.0 Hz, 1H), 3.08 (dd, J = 14.4, 4.1 Hz, 1H), 1.38 (d, J = 6.5 Hz, 3H); <sup>13</sup>C NMR (101 MHz, CDCl3) δ 157.73, 154.25, 154.02, 148.91, 146.18, 131.36, 128.62, 124.05, 112.49, 112.18, 110.96, 108.61, 107.43, 100.45, 99.50, 56.14, 55.92, 55.52, 46.61, 31.10, 19.92; HRMS (ESI) m/z calcd C22H24N4O<sup>3</sup> [M+H]<sup>+</sup> 393.1927, found 393.1916.

# 3-(2-((6,7-dimethoxyquinazolin-4 yl)amino)ethyl)-1H-indol-5-ol (11e)

White solid; purity: 97%; yield: 62%; <sup>1</sup>H NMR (400 MHz, DMSO–d6) δ 10.58 (s, 1H), 10.23 (t, J = 5.0 Hz, 1H), 8.79 (s, 1H), 8.71 (s, 1H), 8.09 (s, 1H), 7.30 (s, 1H), 7.14 (d, J = 8.5 Hz, 2H), 6.98 (d, J = 2.0 Hz, 1H), 6.62 (dd, J = 8.6, 2.2 Hz, 1H), 3.95 (d, J = 4.5 Hz, 6H), 3.90 (dd, J = 14.5, 6.7 Hz, 2H), 3.06–3.02 (m, 2H); <sup>13</sup>C NMR (101 MHz, DMSO–d6) δ 159.03, 154.98, 152.15, 151.72, 150.69, 149.35, 131.28, 128.41, 123.65, 112.17, 111.80, 111.13, 108.23, 104.32, 103.27, 102.82, 56.79, 56.41, 42.20, 25.32; HRMS (ESI) m/z calcd C20H20N4O<sup>3</sup> [M+H]<sup>+</sup> 365.1614, found 365.1611.

# N-(2-(6-fluoro-1H-indol-3-Yl)ethyl)-6,7 dimethoxyquinazolin-4-amine (11f)

White solid; purity: 97%; yield: 44%; <sup>1</sup>H NMR (400 MHz, DMSO–d6) δ 10.90 (br, 1H), 8.39 (s, 1H), 8.09 (t, J = 5.5 Hz, 1H), 7.62 (dd, J = 8.6, 5.5 Hz, 1H), 7.59 (s, 1H), 7.21 (d, J = 2.1 Hz, 1H), 7.12 (dd, J = 11.1, 3.2 Hz, 2H), 6.85 (ddd, J = 9.8, 8.8, 2.3 Hz, 1H), 3.90 (s, 3H), 3.88 (s, 3H), 3.79 (dd, J = 14.3, 6.3 Hz, 2H), 3.08–3.02 (m, 2H); <sup>13</sup>C NMR (101 MHz, DMSO–d6) δ 158.75, 154.27, 153.81, 148.81, 145.57, 136.57, 124.68, 123.73, 119.88, 112.71, 108.88, 107.32, 107.08, 102.48, 97.92, 97.67, 56.50, 56.17, 41.87, 25.22; HRMS (ESI) m/z calcd C20H19FN4O<sup>2</sup> [M+H]<sup>+</sup> 367.1570, found 367.1766.

# 6,7-dimethoxy-N-(2-(5-methyl-1H-indol-3- Yl)ethyl)quinazolin-4-amine (11g)

White solid; purity: 97%; yield: 68%; <sup>1</sup>H NMR (400 MHz, DMSO–d6) δ 10.67 (br, 1H), 8.39 (s, 1H), 8.05 (t, J = 5.4Hz, 1H), 7.58 (s, 1H), 7.38 (s, 1H), 7.23 (d, J = 8.2 Hz, 1H), 7.15 (s, 1H), 7.10 (s, 1H), 6.90 (d, J = 8.2 Hz, 1H), 3.90 (s, 3H), 3.88 (s, 3H), 3.79 (dd, J = 13.5, 6.7 Hz, 2H), 3.04 (t, J = 7.5 Hz, 2H), 2.35 (s, 3H); <sup>13</sup>C NMR (101 MHz, DMSO–d6) δ 158.70, 154.14, 148.70, 146.49, 135.13, 128.07, 127.01, 123.19, 122.95, 118.54, 112.04, 111.54, 109.07, 107.54, 102.47, 99.99, 56.47, 56.12, 42.00, 25.38, 21.74; HRMS (ESI) m/z calcd C21H22N4O<sup>2</sup> [M+H]<sup>+</sup> 363.1821, found 363.1816.

# N-(2-(5-bromo-1H-indol-3-yl)ethyl)-6,7 dimethoxyquinazolin-4-amine (11h)

White solid; purity: 97%; yield: 72%; <sup>1</sup>H NMR (400 MHz, DMSO–d6) δ 11.04 (br, 1H), 8.40 (s, 1H), 8.13 (t, J = 5.6 Hz, 1H), 7.82 (d, J = 1.7 Hz, 1H), 7.57 (s, 1H), 7.32 (d, J = 8.6 Hz, 1H), 7.27 (d, J = 2.1 Hz, 1H), 7.18 (dd, J = 8.6, 1.9 Hz, 1H), 7.10 (s, 1H), 3.90 (d, J = 6.6 Hz, 6H), 3.78 (dd, J = 13.5, 6.8 Hz, 2H), 3.05 (t, J = 7.3 Hz, 2H); <sup>13</sup>C NMR (101 MHz, DMSO – d6) δ 158.85, 154.49, 153.24, 149.00, 144.50, 135.41, 129.75, 125.00, 123.79, 121.34, 113.87, 112.42, 111.43, 108.78, 106.27, 102.90, 56.67, 56.23, 42.06, 25.07; HRMS (ESI) m/z calcd C20H20BrN4O<sup>2</sup> [M+H]<sup>+</sup> 427.0770, found 427.0759.

# N-(4-(1H-indol-2-yl)butan-2-yl)-6,7 dimethoxyquinazolin-4-amine (12)

White solid; purity: 97%; yield: 60%; <sup>1</sup>H NMR (400 MHz, DMSO – d6) δ 10.71 (s, 1H), 8.32 (s, 1H), 7.67 (s, 1H), 7.58 (d, J = 7.6 Hz, 1H), 7.49 (d, J = 7.6 Hz, 1H), 7.32 (d, J = 8.0 Hz, 1H), 7.09 (t, J = 7.1 Hz, 2H), 7.04 (d, J = 7.7 Hz, 1H), 6.94 (t, J = 7.2 Hz, 1H), 4.58 – 4.48 (m, 1H), 3.90 (dd, J = 9.9, 4.6 Hz, 6H), 2.77 (t, J = 6.6 Hz, 2H), 2.02 – 1.88 (m, 2H), 1.31 (d, J = 6.3 Hz, 3H); <sup>13</sup>C NMR (101 MHz, CDCl3) δ 157.79, 154.22, 154.19, 148.79, 146.52 136.45, 127.22, 122.06, 121.39, 119.24, 118.75, 115.83, 111.19, 108.52, 107.79, 99.31, 56.22, 56.15, 47.07, 36.86, 22.15, 21.13; HRMS (ESI) m/z calcd C22H24N4O<sup>2</sup> [M+H]<sup>+</sup> 377.1978, found 377.1972.

# N-(2-(1H-benzo[d]imidazol-2-yl)ethyl)-6,7 dimethoxyquinazolin-4-amine (13)

White solid; purity: 98%; yield: 68%; <sup>1</sup>H NMR (400 MHz, DMSO – d6) δ 9.31(br, 1H), 8.56 (s, 1H), 7.84 (s, 1H), 7.52 (dd, J = 5.9, 3.2 Hz, 2H), 7.19 (s, 2H), 7.18 (d, J = 3.2 Hz, 1H), 4.17–4.01 (m, 2H), 3.92 (s, 3H), 3.89 (s, 3H), 3.34–3.26 (m, 2H); <sup>13</sup>C NMR (101 MHz, DMSO–d6) δ 159.33, 155.46, 152.92, 150.99, 150.93, 149.68, 149.66, 137.57, 137.46, 122.67×2, 114.82, 107.93, 103.66, 102.90, 56.97, 56.55, 31.18, 28.19; HRMS (ESI) m/z calcd C19H19N5O<sup>2</sup> [M+H]<sup>+</sup> 350.1617, found 350.1616.

#### Protein Expression and Purification

The recombinant pET15b-PDE10A plasmid coding the catalytic domain (residues 446-789) was subcloned and purified according to the following protocols previously reported (Li et al., 2015). Then it was transferred into E. coli strain BL21 (Codonplus, Stratagene). The E. coli cells carrying the recombinant plasmid were cultured in an 2XYT medium (containing 100µg/mL ampicillin and 30µg/mL chloramphenicol) at 37◦C until OD<sup>600</sup> = 0.6-0.8. And then, 1 mM isopropyl-β-Dthiogalactopyranoside was added in to induce the PDE10A protein expression at 20◦C for 24 h. The nickel nitriloacetic acid (Ni-NTA) column (Qiagen) was used for purifying PDE10A proteins. The concentration of the PDE10 fractions was estimated based on the absorbance at 280 nm (calculated by the ProtParam software). A typical batch of purification yielded 100-200 mg PDE10A protein from a 1.0 L cell culture.

## PDE10A Enzymatic Assays

The enzymatic activities of the catalytic domains of PDE10A were performed using <sup>3</sup>H-cGMP solution in the assay buffer of 50 mM Tris pH = 7.5, 4 mM MgCl2, 1 mM DTT, and <sup>3</sup>H-cGMP giving 20,000–30,000 cpm after the reaction terminated per assay. To a solution (DMSO) of test compounds in different concentration, the PDE10A enzyme in the assay buffer was added to perform the enzymatic reaction and then incubated at room temperature for 15 min. The assay was then terminated by addition of 0.2 M ZnSO4, Subsequently, 0.2 N Ba(OH)<sup>2</sup> was added to precipitate the reaction product <sup>3</sup>H-GMP, whereas unreacted <sup>3</sup>H-cGMP remained in the supernatant. The radioactivity in the supernatant was measured in 2.5 mL Ultima Gold liquid scintillation cocktails (PerkinElmer) by a liquid scintillation counter (PerkinElmer 2910). The IC<sup>50</sup> values of test compounds at PDE10A enzymes were measured by repeating of three independent experiments using the nonlinear regression method. Papaverine with an IC<sup>50</sup> of 0.1µM was used as the reference compound for enzymatic assay.

### Antioxidant Assay

The modified oxygen radical absorbance capacity fluorescein (ORAC-FL) method was performed to determine the antioxidant activity (Ou et al., 2001; Dávalos et al., 2004). The reaction was diluted with 75 mM phosphate buffer (pH = 7.4), and the volume of the final reaction mixture was 200 µL in well. Test compound (20 µL) and fluorescein (120 µL, 150 nM final concentration) were placed in the well of a black 96 well optical bottom plates. After the mixture was incubated at 37◦C for 15 min, AAPH solution (60 µL, 12 mM final concentration) was added rapidly. The plate was placed in a Spectrafluor Plus plate reader (Tecan, Crailsheim, Germany) and the fluorescence was recorded every minute for 4 h with an excitation wavelength at 485 nm and emission wavelength at 535 nM. Trolox was used as standard (1–8µM, final concentration). A blank (fluorescein + AAPH) with phosphate buffer instead of test compounds and trolox calibration were performed for the assays of antioxidants. The samples were measured at different concentration (1–10µM). All the reaction mixtures were prepared fourfold, and at least three independent assays were performed for each sample. Fluorescence in time course was normalized on basis of the blank (without antioxidants). The ORAC-FL values were calculated as the reported method. Final ORAC-FL values were expressed in µM of trolox equivalents. Ferulic acid was used as the positive reference compound, showing an ORAC-FL value of 1.6 trolox equivalents.

# Molecular Docking Studies

The starting conformation of synthesized compounds was generated using Accelrys Discovery Studio 2.5.5. The crystal structure of PDE10A protein with a bound inhibitor possessing the same quinazoline core (PDB code: 3QPN) was used as the reference (Helal et al., 2011). The binding site was defined by the co-crystallized PDE10A inhibitor in PDB entry 3QPN. The Surflex-dock in the software Tripos Sybyl 1.2 (Jain, 2003) was used to obtain the dominant docking conformations in this study.

# RESULTS AND DISCUSSION

### Rational Design of PDE10 Inhibitors

Papaverine, a natural drug used for the prevention of vasospasm in the clinic, has been proved to have good inhibitory activity toward PDE10A (IC<sup>50</sup> = 10–300 nM) (Siuciak et al., 2006a). Based on the structure of papaverine, several quinazoline compounds have been developed as PDE10A inhibitors such as compound **1** and compound **2** (**Figure 1**; Chappie et al., 2007; Helal et al., 2011). Observed from the crystal structure of **1** and PDE10A complex, following information for further structural modification were obtained. Firstly, the quinazoline ring located in a hydrophobic clamp comprised of Phe719 and Phe686 in the PDE10A protein, forming π-π interaction with Phe719. Secondly, the 6,7-dimethoxy group in the quinazoline ring formed a bidentate interaction with Gln716 in the pocket. As Gln716 has been regarded as a conserved amino acid residue in PDE10A, the interaction with Gln716 is the main reason for the high affinity of compounds with PDE10A protein. Thus, in our designed compounds, the quinazoline ring was kept as core. Last but not least, the piperazine ring of **1** located outside of the catalytic site in PDE10 protein, providing room for introducing a fragment with antioxidant activity. Furthermore, compound **2**, a PDE10 inhibitor developed from **1**, had a quinoline ring attached on the quinazoline core in order to completely fill the selectivity pocket mainly composed of Tyr683, Met703 and Gly715 in the PDE10A protein. Occupying this unique pocket in PDE10A may significantly enhance the selectivity of the inhibitors over other PDEs. In addition, 6-position at the quinazoline core also provides space and synthetic possibilities for conveniently introducing a fragment. Based on these evidences, the strategy combining PDE10A inhibitory activity and antioxidant activity might improve the druggability of hits. Different fragment from antioxidants (such as melatonin, ferulic acid and lipolic acid) were attached on the 6-position or 4-position of quinazoline ring to form the designed compounds and went into the following synthetic work.

#### Chemistry

The synthetic route of compounds **8a-8c** is outlined in the **Scheme 1** (Chandregowda et al., 2009). 6-hydroxy-7 methoxyquinazolin-4(3H)-one was reacted with acetic anhydride in pyridine to protect the phenolic hydroxyl group, providing the intermediated **4**. Chlorination of **4** was accomplished with SOCl<sup>2</sup> to give 4-chloro-7-methoxyquinazolin-6-yl acetate **5**, which could be used in the reaction with the morpholine directly without further purification, providing 7-methoxy-4 morpholinoquinazolin-6-yl acetate **6**. Hydrolysis of the acetyl group of **6** was performed with concentrated ammonia in MeOH to give **7**. The final products **8a-8c** were obtained by the SN2 displacement with corresponding bromides or condensation reaction with corresponding acids in moderate yields.

The series of compound **11a-11h**, **12**, and **13** were synthesized in three steps (**Scheme 2**). With 6, 7-dimethoxyquinazolinone as the starting material, a following chlorination led to intermediate **10**, which reacted with different melatonin derivatives to acquire compounds **11a-11h**, **12**, and **13** in high yields (Garofalo et al., 2012; Min et al., 2016).

#### Structure-Activity Relationships

Two series of compounds with antioxidant fragments on 4- or 6-position at the quinazoline core were synthesized. The inhibitory activities of these compounds toward PDE10A were evaluated with papaverine as the positive control. LogP (Octanol-Water Partition Coefficient) were calculated as the measure of lipophilicity, mostly drug candidates have LogP value below 5 based on Lipinski's Rule of Five. For our study, all synthetic and designed compounds have good LogP value. TPSA (Topological Polar Surface Area) provided a good estimate of proportion of compounds passed through BBB (blood-brain barrier), high penetration is necessary for drug candidates that targeted central nervous system (CNS) diseases. We were pleased to find that compound **8a**, with a fragment "2-(1H-indol-3-yl)ethyl" substituted on the 6-position, showed good inhibitory activity in the series of compounds **8a-8c**, although not as good as the papaverine and **1**, which also gave ORAC value of 1.0. Compound **8b** and **8c** only showed 26 and 24% inhibition on PDE10A at the concentration of 1µM, respectively, despite that **8b** showed good TPSA value and antioxidant activity with ORAC value of 3.3 (**Table 1**). We concluded that the ester group in the linker caused steric hindrance in PDE10A pocket, resulting the low inhibitory activities.

According to the structures of reported PDE10/inhibitor complexes, Gln716 and Tyr683 in the PDE10 catalytic domain are two key amino residues for the interaction between inhibitors and PDE10 protein. From this point, we hypothesized that introducing a long chain at the 6-position of our hit compounds to form an interaction with the residue Tyr683 may be useful for the improvement of inhibitory activity toward PDE10. In the other series of compounds **11a-11h, 12 and 13**, different groups were placed on the 4-position of the quinazoline core and the methoxy group on the 6-position remain unchanged. The results were encouraging. Compound **11a**, **11e**, **11f**, **12**, and **13** showed good PDE10A inhibitory activities with IC<sup>50</sup> below 1µM. As depicted in **Figure 2**, compound **11f** and **12** exhibited the best PDE10A inhibitory activities with the IC<sup>50</sup> of 0.33 and 0.24µM. However, they showed moderate antioxidant activities with ORAC (oxygen radical absorbance capacity) 1.0 and 1.3. Compound **11e**, giving the third best IC<sup>50</sup> of 0.64µM on PDE10A and good ORAC (oxygen radical absorbance capacity) value of 2.3, reached a compromise between the PDE10A inhibitory activity and antioxidant activity. From


#### TABLE 1 | The Inhibitory activities against PDE10A, TPSA, LogP, and oxygen radical absorbance capacity of compounds 8a-8c, 11a-11h, 12, and 13.

#### TABLE 1 | Continued


*a IC<sup>50</sup> values are given as the mean of three independent determinations.*

*<sup>b</sup>PSA and LogP values are calculated by Accelrys Discovery Studio 2.5.5.*

*<sup>c</sup>ORAC results are expressed as trolox equivalents.*

**11a** to **11b** and **12**, it can be observed that the inhibition on PDE10 was affected by introducing extra carbon atom and methyl group in the linker, while the antioxidant activity was not much affected. In contrast, the antioxidant activities were weakened from **11c** to **11d**, with the ORAC (oxygen radical absorbance capacity) of 2.2–1.3, we concluded that the slight change of introducing a methyl group may cause the steric hindrance (**11b**, **11d** and **12** compared to **11a**, **11c** and **11d**) and leading to impropriate occupation in PDE10 protein, thus responsible for the decreased inhibition on PDE10A. The predict TPSA value of **8b** and **11e** seems good for the BBB penetration. Taking all into consideration, compound **11e**, which showed good PDE10 inhibitory activity and antioxidant activity was good lead for the further modification.

# Molecular Docking

The molecular docking was performed to better understand the binding modes between the inhibitors and the PDE10 protein. The structure of PDE10 complex with papaverine has been reported in 2009 (PDB code: 2WEY), which clarified the interaction between papaverine and PDE10 (Andersen et al., 2009). All compounds of **8a**-**8c**, **11a-11h**, **12**, and **13** were docked into the PDE10 catalytic domain by the Surflex-dock in the software Tripos Sybyl 1.2 (Jain, 2003). From the docking results (**Figure 3**), we found that the quinazoline core of this series compounds occupied the same position as papaverine, and the oxygen of the quinazoline core formed a hydrogen bond interaction with the residue of Gln716. Besides, the quinazoline ring has the hydrophobic interactions with the residues of Phe719 and Phe686. These interactions have been reported to be the critical forces to determine the binding capacity of PDE10 inhibitors. In the series of compounds **8a**-**8c**, introducing groups at the 6-position of the quinazoline core could fill in the selective pocket of PDE10. However, no hydrogen bond was observed between compounds and Tyr683. The substituted groups of **8b** and **8c** might be too large in terms of volume size for the PDE10 selective pocket. Thus, the PDE10 inhibitory activities of them were weaker than that of **8a**. Compounds **11a**-**11h** had similar docking patterns since they structurally resemble each other. Besides, compound **8a** and **11a** have the same 2-(1H-indol-3-yl)ethyl group. By observing the docked conformations of **8a** and **11a** in complex with PDE10A, we found that the side chains stretch in different directions, respectively. The side chain of **8a** resides in the selective pocket of PDE10A, however, no hydrogen bond is formed with Tyr683. On the contrary, the side chain of **11a** stretches out of the catalytic site, possessing the same pattern as papaverine (**Figure 3d**).

# CONCLUSION

In summary, a series of novel PDE10A inhibitors with antioxidant activities were successfully designed and synthesized using a structure-based discovery strategy, which are potential pharmaceuticals as anti-neurodegenerative PD, AD, or ALS therapies. On the basis of the lead compound papaverine, 13 new quinazoline-based derivatives have been synthesized and evaluated by the inhibitory assays. Nine out of 13 compounds showed good PDE10A inhibitory activities at the concentration of 1µM. Among these compounds, eight exhibited moderate to good antioxidant activity with the ORAC above 1.0. Especially worthy to mention is that compound **11e**, gave an IC<sup>50</sup> value of 0.64µM on PDE10A and good ORAC value of 2.3. In conclusion, this work has described a structure-based discovery strategy and the synthesized quinazoline-based derivatives with both PDE10A inhibitory activity and antioxidant activity, and might provide a new perspective for the development of novel PDE10A inhibitors.

# AUTHOR CONTRIBUTION

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

# ACKNOWLEDGMENTS

This work was supported by the Natural Science Foundation of China (81602955, 81522041, 21572279, and 81373258), Science Foundation of Guangdong Province (2016A030310144), the Fundamental Research Funds for the Central Universities (Sun Yat-Sen University) (17ykpy20), and Medical Scientific Research Foundation of Guangdong Province (A2016104). We cordially thank Prof. Hengming Ke from Department of Biochemistry and Biophysics at the University of North Carolina, Chapel Hill, for

#### REFERENCES


his help with molecular cloning, expression, purification, crystal structure, and bioassay of PDEs.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00167/full#supplementary-material

Presentation 1 | <sup>1</sup>H NMR and <sup>13</sup>C NMR data for tested compounds.

disease transgenic mice prior to the onset of motor symptoms. Neuroscience 123, 967–981. doi: 10.1016/j.neuroscience.2003.11.009


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Li, Chen, Deng, Zhou, Wu, Wu and Luo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Quantum Chemical Approaches in Structure-Based Virtual Screening and Lead Optimization

#### Claudio N. Cavasotto\*, Natalia S. Adler and Maria G. Aucar

Laboratory of Computational Chemistry and Drug Design, Instituto de Investigación en Biomedicina de Buenos Aires, CONICET, Partner Institute of the Max Planck Society, Buenos Aires, Argentina

#### Edited by:

Daniela Schuster, Paracelsus Medizinische Privatuniversität, Salzburg, Austria

#### Reviewed by:

F. Javier Luque, Universitat de Barcelona, Spain Serdar Durdagi, Bahçe ¸sehir University, Turkey

#### \*Correspondence:

Claudio N. Cavasotto cnc@cavasotto-lab.net; ccavasotto@ ibioba-mpsp-conicet.gov.ar

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 18 January 2018 Accepted: 09 May 2018 Published: 29 May 2018

#### Citation:

Cavasotto CN, Adler NS and Aucar MG (2018) Quantum Chemical Approaches in Structure-Based Virtual Screening and Lead Optimization. Front. Chem. 6:188. doi: 10.3389/fchem.2018.00188 Today computational chemistry is a consolidated tool in drug lead discovery endeavors. Due to methodological developments and to the enormous advance in computer hardware, methods based on quantum mechanics (QM) have gained great attention in the last 10 years, and calculations on biomacromolecules are becoming increasingly explored, aiming to provide better accuracy in the description of protein-ligand interactions and the prediction of binding affinities. In principle, the QM formulation includes all contributions to the energy, accounting for terms usually missing in molecular mechanics force-fields, such as electronic polarization effects, metal coordination, and covalent binding; moreover, QM methods are systematically improvable, and provide a greater degree of transferability. In this mini-review we present recent applications of explicit QM-based methods in small-molecule docking and scoring, and in the calculation of binding free-energy in protein-ligand systems. Although the routine use of QM-based approaches in an industrial drug lead discovery setting remains a formidable challenging task, it is likely they will increasingly become active players within the drug discovery pipeline.

Keywords: quantum mechanics, semi-empirical methods, structure-based drug design, molecular docking, drug lead optimization, binding free energy, molecular dynamics

# INTRODUCTION

The drug discovery process relied for many years on the experimental high-throughput screening of large chemical libraries to identify and optimize new drug lead compounds. In spite of efforts to improve its efficiency, this remained an expensive and time consuming process (Phatak et al., 2009). The availability of 3D structures of protein-ligand (PL) complexes has guided lead optimization for many years, paving the way to a more rational approach. Later on, theoretical developments, coupled with better computational algorithms and faster computing resources, allowed the routine use of in silico methods to model PL interaction, estimate binding affinity, and screen chemical libraries using structure-based approaches. Today, computational chemistry is a well-established and valuable tool in the drug discovery process (Cavasotto and Orry, 2007; Jorgensen, 2009).

The central quantity in PL association is the binding free energy (1Gbinding ), a property of enormous relevance in the pharmaceutical industry, and no effort is too great to accurately estimate it in a computationally efficient way. Reliable prediction of receptor-small-molecule affinities in the early-stages of the drug discovery pipeline would be instrumental to rationally design new, more potent, and safer drugs, saving precious effort, time and cost. The accurate calculation of 1G depends on several factors: (i) the energy model of the system; (ii) the accounting for protein flexibility; (iii) the presence of water molecules within the binding site and the solvation model. The last two challenges have been thoroughly addressed in recent reviews [(Cavasotto, 2011, 2012b; Spyrakis et al., 2011), and (Spyrakis and Cavasotto, 2015), respectively]. The last 20 years have seen a remarkable advance in theoretical and algorithmic developments for the calculation of binding affinities (Gohlke and Klebe, 2002; Gilson and Zhou, 2007; Mobley and Gilson, 2017), ranging from fast estimates, to be used in high-throughput docking and scoring (Cross et al., 2009; Cavasotto, 2012a), to much slower—yet more accurate calculations using free energy perturbation or thermodynamic integration (Mobley and Klimovich, 2012; Hansen and Van Gunsteren, 2014), well-suited to guide chemical synthesis for hit-to-lead optimization. Most of these applications have been rooted in molecular mechanics (MM) force-fields (FF), but recent years have seen the development and application of quantum mechanical (QM) methods to biomacromolecular systems in the context of drug lead discovery and design. The recent blind challenges for ligand-pose and binding affinity predictions ran by the Drug Design Data Resource (D3R) in 2015 (Gathiaka et al., 2016) and 2016 (Gaieb et al., 2018) highlight the critical relevance of method development and benchmarking in pose prediction and affinity ranking of bound ligands.

It should be highlighted that the QM formulation accounts for all contributions to the energy (including effects missing in FFs, such as electronic polarization, charge transfer, halogen bonding, and covalent-bond formation), and thus is, in principle, theoretically exact; moreover, it offers the advantage of being general across the chemical space, avoiding system-dependent parameterizations, so that all elements and interactions can be considered on equal footing. In fact, QM has been present since the early days of computer-aided drug design (cf. the pioneering work of W.G. Richards on quantum pharmacology; Richards, 1977), and it is routinely used to derive FF parameters [such as torsional potentials from high level ab initio data and partial atomic charges by fitting to electrostatic surface potentials (ESP) (Mucs and Bryce, 2013)], in QSAR methods (De Benedetti and Fanelli, 2014), to study reaction mechanisms (Blomberg et al., 2014), and small-molecule strain (Forti et al., 2012; Juárez-Jiménez et al., 2015).

The goal of this short mini-review is to highlight the growing importance of quantum chemistry (QC) in the study of PL interaction, and present the latest applications of explicit QM calculations to structure-based drug design in the context of lead identification and optimization [for a survey on the development rather than application of QM methods for ligand-binding affinity calculations the reader is referred to an excellent recent review (Ryde and Söderhjelm, 2016); the review by Korth also offers a comprehensive coverage on the development of semiempirical QM and density functional theory (DFT) methods augmented by hydrogenbonding and dispersion corrections (Yilmazer and Korth, 2016)].

## QUANTUM CHEMICAL APPROACHES IN PROTEIN-LIGAND DOCKING

In silico molecular docking has been widely used to determine the binding mode (pose) of small-molecules to a binding site. However, the true potential of this technique is revealed when used in a high-throughput fashion to screen up to millions of molecules, aiming to generate a sub-library rich in potential binders, thus imposing a structural filter on a given chemical library to prioritize compounds for synthesis. In high-throughput docking (HTD), where usually the protein is considered rigid or with very few degrees of freedom, two stages could be identified: (i) the prediction of the binding modes of molecules within the binding site (docking stage); (ii) the calculation of a score which attempts to predict the likelihood that a molecule will actually bind to the target. Although docking accuracy depends on the program used, the number of ligand poses with RMSD < 2 Å compared to the native structure can reach up to 80% of the studied cases (Warren et al., 2006; Wang et al., 2016). In some docking programs the binding pose is assessed by searching the global energy minimum ("docking energy") within the potential energy surface (PES) of the protein-molecule system. Other energetic contributions should be accounted for (such as the free energy of the unbound molecule, the entropy change, and desolvation effects) in order to assign a "docking score" to molecules of a chemical library; scoring functions are classified as force-field-based, empirical, and knowledge-based (Kitchen et al., 2004). It should be highlighted that the docking energy discriminates among poses of the same molecule, while the docking score is aimed at discriminating among different molecules of the set [usually docking scores are calculated on the best pose (of few best poses) of each molecule]. In many docking programs, however, the docking score is used for both purposes.

In the last 10 years there have been continuous efforts to enhance scoring functions by incorporating some type of QMbased calculations, especially deriving system-specific charges, such as the QM-polarized ligand docking approach (Cho et al., 2005). Some degree of improvement was observed using these tailored energy functions in terms of pose prediction. However, these advances will not be addressed here, and the reader is referred to a sound review covering these issues (Mucs and Bryce, 2013).

There are fewer works describing PL interactions with explicit calculations at the QM level. One should highlight the pioneering work of Raha and Merz (Raha and Merz, 2004, 2005), who introduced QMScore, a semiempirical QM (SQM) scoring function based on the Austin Model 1 (AM1) Hamiltonian (Dewar et al., 1985), complemented with a FF dispersion term and a Poisson-Boltzmann implicit solvent model, and calculated using the linear-scaling divide and conquer method (Dixon and Merz, 1996). QMScore was able to discriminate native and decoy poses and captured essential binding affinity trends in a set of 165 PL complexes; a series of QM/MM scoring functions were also studied to discriminate native from decoy poses in six HIV-1 proteases (Fong et al., 2009), showing in some of them improvements over MM empirical potentials.

Very recently, the SQM/COSMO energy filter was introduced, aimed at discriminating native from decoy ligand docking poses (Pecina et al., 2016). The SQM/COSMO filter is a simplified version of a general binding free energy function (Raha and Merz, 2005; Lepšík et al., 2013),

$$
\Delta G\_{\text{binding}} = \Delta E\_{\text{int}} + \Delta \Delta G\_{\text{solv}} + \Delta G\_{\text{conf}} - T\Delta S \tag{1}
$$

(where 1Eint is the gas-phase interaction energy, 11Gsol<sup>v</sup> the solvation energy change upon complex formation, 1Gconf the change of conformational free energy, and –T1S the entropy change upon binding). In this new filter, only the first two dominant terms in Equation (1) are conserved, thus avoiding expensive SQM optimizations. 1Eint is calculated at the PM6 level (Stewart, 2007) with the D3H4X correction for dispersion, hydrogen- and halogen bonding interactions (Rezác and Hobza, ˇ 2011, 2012), and the implicit solvent model COSMO (Klamt and Schüürmann, 1993) is used to calculate the 11Gconf term (this filter was named PM6-D3H4X). It was shown that calculations in a small subsystem (the ligand and neighboring amino acids) do not deteriorate results compared to the whole system, with a clear benefit in terms of computational speed. The ability of this filter to discriminate binding-like poses from decoy poses was evaluated in four challenging systems [acetyl cholinesterase (AChE), TNF-α converting enzyme (TACE), aldose reductase (AR), and HIV-1 protease (HIV PR)], and compared to seven well-known empirical scoring functions and a physicsbased AMBER/GB. It was shown that the SQM/COSMO filter performed best by two metrics: the number of false-positive solutions, and the maximum ligand RMSD of all poses within a given range of a normalized score. The worst performance was on the TACE metalloprotein, containing a Zn2+. . . S<sup>−</sup> interaction. As for the computational requirements, this filter is ∼100 times slower than the traditional scoring functions, ∼10 times slower than the AMBER/GB scoring scheme, but ∼100 times faster than the standard SQM filter calculated using the full Equation (1). In a follow up contribution (Pecina et al., 2017), the SQM/COSMO filter was evaluated in the same four systems (AChE, TACE, AR, HIV PR) using the self-consistent charge density functional tight-binding (SCC-DFTB) (Elstner et al., 2001), complemented with the D3H4 corrections for dispersion and hydrogen-bond interactions (Rezác and Hobza, 2012 ˇ ). This improved filter (named DFTB3-D3H4) retained its excellent performance in AChE, AR and HIV PR, and clearly improved the results on the TACE system at a reasonably higher computational price. To further validate the two variants of SQM filters, diverse 17 PL complexes were studied using the PM6-D3H4X and the DFTB3- D3H4X (extended in this case to account for halogen bonding), and compared to four standard docking programs (Ajani et al., 2017). The QM-based energy functions clearly outperformed the standard scoring functions in terms of the number of false positives.

Using MD simulations and QC energy evaluations, Burton and co-workers evaluated the preferred docking (binding) mode of the natural salpichrolide A and a synthetic analog with an aromatic D ring within the estrogen receptor α (ERα) binding site (Alvarez et al., 2015). The MM/QM-COSMO (Anisimov and Cavasotto, 2011; Anisimov et al., 2011) method with the PM6 Hamiltonian was used for the energy calculations. The MD simulations coupled with energy evaluations corresponding to different ligand-binding modes support the preferred inverted orientation of the steroids in the ERα binding site, in which the aromatic ring D occupies a similar position to the corresponding A ring of estradiol.

G protein-coupled receptors (GPCRs) present a challenging case for docking due to their solvent-exposed and polar binding sites (Cavasotto and Palomba, 2015). A new docking protocol was recently presented where a QM/MM + implicit solvation model was used to rescore docked ligand poses (Kim and Cho, 2016). The gas energy was calculated at a QM/MM level, considering the ligand and neighboring residues within 5 Å as the QM region, and the solvation energy was calculated using a Poisson-Boltzmann (PB) approach with partial charges derived from ESP fitting. Evaluating their protocol on 40 GPCR complexes including representatives of classes A, B, and F, the authors obtained an average RMSD of 0.78 Å, and a success rate of 40/40 for ligands with RMSD < 2 Å.

Chaskar et al. (2014) developed an on-the-fly QM/MM approach combining the EADock DSS docking algorithm (Grosdidier et al., 2007) with calculations based on the SCC-DFTB model and the CHARMM FF (Brooks et al., 2009), and evaluated it on a dataset of high-quality x-ray structures of zinc metalloproteins. Their method significantly improved the success rate compared to classical docking programs for orthosteric ligands in terms of ligand pose RMSD. Recently, a similar approach (Chaskar et al., 2017), but coupled with the Attracting Cavities docking algorithm (Zoete et al., 2016), was applied on three different sets: (i) the Astex Diverse data set of 85 common non-covalent drug/target complexes; (ii) a zinc metalloprotein data set of 281 complexes: (iii) a heme protein data set of 72 complexes, where ligand/protein interactions are dominated by covalent ligand/iron binding. On the first set the performance was similar to the standard scoring functions, but on the other two, QM/MM showed an improved performance, especially in the third set.

#### CALCULATION OF LIGAND BINDING FREE ENERGY USING QUANTUM MECHANICS-BASED METHODS

The binding process of five classical AChE inhibitors was analyzed using free energy perturbation (FEP) and QM/MM MD simulations (Nascimento et al., 2017). The QM calculations were performed at the AM1 level. The 1Gbinding was obtained as the sum of two terms, introducing two parameters into the electrostatic and van der Waals QM/MM interaction terms in the total energy (Swiderek et al., 2014). The correlation between the experimental and calculated values was in very good agreement (R 2 value of 0.96 for 100 ps simulation time). Moreover, there was a qualitative agreement of the order of inhibition between theoretical and experimental values. The use of QM to describe these ligands was of great importance due to their polar nature and the high aromaticity of the enzyme binding site.

In order to analyze the efficiency of different approaches to calculate 1Gbinding at the QM/MM level employing MD simulations, Ryde and Olsson have recently compared the results of the calculation of the binding of nine small carboxylate ligands to the octa-acid deep cavity host (Olsson and Ryde, 2017), via reference-potential FEP calculations (Rod and Ryde, 2005) and full QM/MM FEP simulations. The ligand was described using a SQM PM6 Hamiltonian augmented by the DH+ empirical dispersion and hydrogenbond corrections (Korth, 2010). The results showed that the reference-potential approach is approximately three times more effective than the direct approach, and the convergence of the MM→QM/MM perturbations is improved by the addition of QM/MM MD simulations for a number of coupling parameter values between the MM and QM/MM energies.

Grimme and co-workers presented a full QM approach to evaluate absolute ligand binding free energies as the sum of three terms: the interaction energy, the solvation contribution, and the entropic term (Ehrlich et al., 2017). Calculations were performed on a reduced system consisting of the ligand and neighboring binding site atoms (∼1,000 atoms in total). For the interaction energy, two methods were used: the minimal basis Hartree-Fock HF-3c (Sure and Grimme, 2013) which includes a D3 dispersion correction (Grimme et al., 2010), and the composite hybrid PBEh-3c DFT lower computational cost method (Grimme et al., 2015); entropic contributions were calculated using a semiempirical DFTB3-D3 hessian (Gaus et al., 2011; Brandenburg and Grimme, 2014); the solvation contribution was calculated with the COSMO-RS method (Klamt, 1995, 2011). Two molecular systems were studied: the activated serine protease factor X (FXa) with 25 ligands and the non-receptor tyrosine-protein kinase 2 (TYK2) with 16 ligands. The mean absolute deviation (MAD) of the 1Gbinding using the HF-3c level was 2.8 and 2.7 kcal/mol, with a Pearson correlation coefficient 0.47 and 0.51, respectively; while a MAD of 2.1 kcal/mol was obtained on the FXa system using the PBEh-3c method, with a Pearson coefficient of 0.53. Although the results are clearly encouraging from a QC standpoint, this approach cannot be yet used in an industrial setting, and errors stemming from the structural optimization level, conformational sampling and the solvation contribution need further development.

Frush et al. performed a QM/MM-based evaluation of 1Gbinding on four diverse protein targets of pharmaceutical relevance: beta-secretase 1 (BACE1), TYK2, heat shock protein 90 α (HSP90), and protein kinase R (PKR)-like endoplasmic reticulum kinase (PERK), using 22, 16, 70, and 32 ligands, respectively (Frush et al., 2017). Binding affinities were calculated using the linear interaction energy (LIE) protocol (Aqvist et al., 1994), with α and β LIE coefficients similar to those reported elsewhere (Su et al., 2007), but modified to fit the experimental affinities of the TYK2 set. Ensemble averages were calculated through QM/MM calculations on MD trajectories, describing the ligand at the SQM level using the AM1 Hamiltonian, and the rest of the system using MM. On each of the four systems, the obtained MAE was 0.86, 0.42, 0.86, and 1.11 kcal/mol, respectively, and a correlation of 0.73, 0.71, 0.60, and 0.86, respectively. The authors concluded that their methodology reached a reasonable balance between accuracy and computational cost.

In the context of the D3R grand challenge blind test competition (Gathiaka et al., 2016), Ryde and co-workers evaluated four different approaches for predicting the binding affinities of three sets of ligands of the HSP90 protein (Misini Ignjatovic et al., 2016): (i) induced-fit docking (Sherman et al., 2006) followed by calculations with three energy functions; (ii) MM/GBSA calculations on minimized docked structures; (iii) optimization of docked structures with QM/MM calculations followed by QM-based energy evaluation of a subset of ∼1,000 atoms using continuous solvent; (iv) calculations of relative binding affinities using free-energy simulations. Although the results were somehow poor, the authors were able to identify the sources of error: in one case the ligand could displace water molecules (this could be found only after the experimental data was released), and for other two, ligands might exhibit alternative binding modes that those in the crystal, or conformational changes of the system might be critical.

# CONCLUSIONS AND PERSPECTIVE

In this short review we presented the most recent applications of QM-based methods to molecular docking and ligand binding free energy prediction in the context of drug lead discovery, focusing on cases where QM is explicitly used to calculate at least some of the free energy contributions. The last 10 years have seen a remarkable interest in the development and application of QM-based methods in the field of drug discovery. This was triggered by the interest in modeling biomolecular systems in a more accurate way, and allowed by the unprecedented growth of computational power. QM methods are theoretically exact, capturing the underlying physics of the system and accounting for all contributions to the energy; thus, missing effects in FFs (such as electronic polarization, covalent-bond formation, and coupling among terms) are de facto accounted for in QM formulations, which are thus systematically improvable; being generally valid across the chemical space, they offer greater freedom to deal with non-standard molecules, avoiding the FF parameterizations.

Overall, the results obtained using QM approaches are very encouraging, but still different sources of error should be addressed in order to improve accuracy and predictability of these methods: (i) they are still system-dependent; thus, further validation and benchmarking are needed; (ii) in spite of the progress in computational speed, most QM applications to drug discovery cannot still be used in industrial settings, highlighting the need for optimized codes, especially those using GPUs; (iii) conformational sampling and protein flexibility: due to computing time, in most approaches aimed for high-throughput use, only local energy minimization is performed, or even no minimization at all; this should be integrated with the possibility of system cutout, and optimal combinations of these thoroughly validated; (iv) solvation contribution, especially in charges systems; (v) entropic considerations, usually omitted in many of this type of calculations. In spite of these limitations, it is clear that reliable QM methods for biomolecular systems would be a

### REFERENCES


tremendous step forward toward predictive binding free energy calculations.

#### AUTHOR CONTRIBUTIONS

CNC conceived, designed, and supervised this review. All authors listed have made direct and intellectual contribution to the work, and approved it for publication.

#### ACKNOWLEDGMENTS

This work was supported by the National Agency for the Promotion of Science and Technology (ANPCyT) (PICT-2014- 3599), and FOCEM-Mercosur (COF 03/11). The authors thank the National System of High Performance Computing (Sistemas Nacionales de Computación de Alto Rendimiento, SNCAD) and the Computational Centre of High Performance Computing (Centro de Computación de Alto Rendimiento, CeCAR) for granting use of their computational resources.


of small molecules. J. Chem. Theory Comput. 8, 1808–1819. doi: 10.1021/ ct300097s


Zoete, V., Schuepbach, T., Bovigny, C., Chaskar, P., Daina, A., Rohrig, U. F., et al. (2016). Attracting cavities for docking. Replacing the rough energy landscape of the protein by a smooth attracting landscape. J. Comput. Chem. 37, 437–447. doi: 10.1002/jcc.24249

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Cavasotto, Adler and Aucar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Computational Chemical Synthesis Analysis and Pathway Design

Fan Feng<sup>1</sup> , Luhua Lai 1,2,3 and Jianfeng Pei <sup>2</sup> \*

<sup>1</sup> State Key Laboratory for Structural Chemistry of Unstable and Stable Species, Beijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University, Beijing, China, <sup>2</sup> Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China, <sup>3</sup> Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China

With the idea of retrosynthetic analysis, which was raised in the 1960s, chemical synthesis analysis and pathway design have been transformed from a complex problem to a regular process of structural simplification. This review aims to summarize the developments of computer-assisted synthetic analysis and design in recent years, and how machine-learning algorithms contributed to them. LHASA system started the pioneering work of designing semi-empirical reaction modes in computers, with its following rule-based and network-searching work not only expanding the databases, but also building new approaches to indicating reaction rules. Programs like ARChem Route Designer replaced hand-coded reaction modes with automatically-extracted rules, and programs like Chematica changed traditional designing into network searching. Afterward, with the help of machine learning, two-step models which combine reaction rules and statistical methods became the main stream. Recently, fully data-driven learning methods using deep neural networks which even do not require any prior knowledge, were applied into this field. Up to now, however, these methods still cannot replace experienced human organic chemists due to their relatively low accuracies. Future new algorithms with the aid of powerful computational hardware will make this topic promising and with good prospects.

#### Edited by:

Daniela Schuster, Paracelsus Medizinische Privatuniversität, Salzburg, Austria

#### Reviewed by:

Mingyue Zheng, Shanghai Institute of Materia Medica (CAS), China Dharmendra Kumar Yadav, Gachon University of Medicine and Science, South Korea

#### \*Correspondence:

Jianfeng Pei jfpei@pku.edu.cn

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 29 January 2018 Accepted: 15 May 2018 Published: 05 June 2018

#### Citation:

Feng F, Lai L and Pei J (2018) Computational Chemical Synthesis Analysis and Pathway Design. Front. Chem. 6:199. doi: 10.3389/fchem.2018.00199 Keywords: chemical synthesis analysis, retrosynthesis, pathway design, deep learning, seq2seq

# INTRODUCTION

Although the concept of organic chemistry was proposed before the nineteenth century, the first steps of synthesis analysis took human beings more than 100 years, from 1828, when the German chemist Friedrich Wöhler produced urea with potassium cyanate and ammonium sulfate (Leicester and Klickstein, 1951), to mid-twentieth century, when chemists such as Robinson, Woodward, and Corey raised it to a qualitatively higher level of sophistication with the idea of retrosynthetic analysis (Corey, 1988). Since then, laboratories around the world have made remarkable achievements in total synthesis, biosynthesis and biomimetic synthesis. The standard flow of synthesis pathway planning has made it possible for scientists to design computer programs to deal with synthetic problems.

Since the Dendral Project (although failed) of Stanford University in the 1960s, experts in chemistry, biology and computer science showed great enthusiasm in developing relevant algorithms in the next 30 years, but few breakthroughs were made and more people viewed it as a "mission impossible." Actually, this task was too complex for scientists at that time when machines

**246**

could only deal with very simple molecules which humans did not need much assistance with. However, after the 1990s, the developments of new efficient algorithms and more welldesigned databases including Reaxys and SciFinder (providing chemists the source of structured data of chemical reactions) lighted the passion for computer-assisted synthesis design again. And more cheminformatics tools were proposed, including the development of molecule descriptors and molecular encoding methods like SMILES (Simplified Molecular Input Line Entry Specification) (Weininger, 1988).

Early retrosynthesis analytic systems were mainly reaction rule-based, such as LHASA (Corey et al., 1972a,b), SYNLMA (Johnson et al., 1989). Different rule-based methods focused on different concepts, including reaction mechanisms, skeletal construction and some classic reactions between common groups. However, rule-based methods cannot cover the whole organic reaction space and probably give out incorrect results (e.g., the algorithms would produce a compound which never exist, or forget to protect groups with high reactivity).

After 1990, many new methods using machine learning as an important tool were proposed, but most of them still followed the concepts of traditional reaction rules. So we define them as "two-step models"—machine learning played the role of decision making, and decision generating were related to reaction rules or structural rules. In recent years, deep learning (or deep neural networks) techniques have been applied in reaction prediction and retrosynthesis analysis. For example, regarding reactions as translation between two languages ("reactants" and "products"), seq2seq (two recurrent neural networks) (Sutskever et al., 2014) was used in synthetic prediction. However, these modern tools still need essential improvements to meet the need of organic chemists. Also, negative samples are quite important in machine learning, but reaction databases seldom provide information about "A do not react with B," which is a severe limitation.

Recently, in the field of drug design, modern methods have changed the trial-and-error and time-consuming lab work into computational process. After designing molecules according to certain principles, medicinal chemists will have to synthesize the designed molecules. With modern web resources (Khan et al., 2011; Yadav et al., 2016), computers can take the synthesis pathway into consideration. For example, databases like KEGG enzymatic reaction and ChemBioFinder have benefited a lot in both drug discovery and drug synthesis prediction.

Organic reactions are not like the process of chess or Sudoku games, because they are full of exceptions and rarely have fixed rules, so it presents great challenge for computer programs. With the general trend of artificial intelligence (AI), scientists realized the combination of AI and synthetic planning would probably be the general trend in this field. Although we cannot guarantee the correctness of one computer-designed synthetic route, AI may probably come up with incredible new ideas beyond human ones, and its comprehension of complex reaction patterns such as rearrangement and catalytic cycles may be superior to humans, too. To sum up, we believe that computers will help scientists to a great extent in the field of synthetic analysis and pathway design in the future.

# SYNTHESIS PREDICTION WITH NETWORK SEARCHING AND RULE MATCHING

# Building and Searching Reaction Networks

As we all know, one decisive character differing between humans and computers are the ability of memory. For organic chemistry experts, they often memorize hundreds of classic reactions and rules, but modern computers have the ability to store and search for chemical databases as large as the entire set of known molecules and reactions. In a computer scientists' view, chemical reactions are sets of data indicating relationships or connections of compounds, and this kind of existence can be represented as data structures such as connections or networks. According to these ideas, Grzybowski et al. did such a kind of transformation in early 2000s and finally finished the Network of Organic Chemistry (NOC) (Fialkowski et al., 2005; Bishop et al., 2006; Grzybowski et al., 2009), which contains more than ten million organic reactions (edge) connecting a similar number of compounds (vertex) (**Figure 1**).

The searching process is not simple. Grzybowski's group tried different ways to do global minimization in their program Chematica. They took two factors into consideration: one is the overall "cost" Ctot of a pathway (including labor, purification costs, etc.) and the cost of starting materials. The other is the popularity scoring function Ptot which prioritized more popular reactions. For the searching algorithm, one approach is to minimize the scoring function in each "depth" of searching and gradually increase the "depth" to produce the synthetic pathway. Traditional BFS (breadth-first-search) (Lee, 1961) is also adapted to synthetic planning to generate many possible pathways. These searching algorithms can simplify the "combination explosion" problems into simple and intuitionistic ones, which can be solved within a few seconds. In addition, due to the specific data structure of NOC, Chematica has the Synthesis Optimization with Constrains (SOCS) scheme, too, which supports the existence of constraints, such as the maximum number of products and avoidance of certain intermediate. This process is just like finding a function's minimum value

with constrains. And without doubt, satisfying any constraint factors will probably cause a trade-off of an increased cost function.

# The Development of Rule-Based Synthetic Design

Although reaction networks can guarantee the validity of predicted retrosynthesis reactions, it was a much difficult task for early pioneers to collect reaction databases as big as NOC. The first ideas of chemists and computer scientists were using reaction rules to predict retrosynthesis reactions, and developing logic-based and knowledge-based searching strategies for designing reaction routes. By applying retrosynthetic (backward generation) procedure which was proposed in the mid-twentieth century, in theory, computers can generate reasonable starting materials and reaction pathways. However, although there are many rules and famous name reactions in the field of organic chemistry, choosing which reaction to use are things that really matters. One of the earliest pioneers, Dendral project (Lindsay et al., 1993) started by a Stanford team did not realize this goal. As one of the contributors of retrosynthesis analysis, Corey raised his rule of breaking bonds and planning the synthetic pathway, which can be taught to computers. Although it is far from mature in today's view, Corey and his idea had raised computer-assisted pathway design to a higher level. In 1969, Corey and Wipke presented the first computer-aided synthesis design software called OCSS for Organic Chemical Simulation of Synthesis (Corey and Wipke, 1969). It was then split into two directions: LHASA (Corey et al., 1972a,b) in Corey's group and SECS (Wipke et al., 1978) developed by Wipke. After that, many followers proposed different kinds of rule-based methods, which were introduced in detail in other recent reviews (Szymkuc and ´ Gajewska, 2016). Here we only briefly list some of them in **Table 1**.

For rule-based de novo synthesis prediction, there exists mainly two challenges. The first one is the collection of reaction rules. Early pioneers like LHASA and SECS are relatively weak in the number and diversity of reaction rules, while later programs like Syntaurus can meet the requirements of basic coverage of reaction space. The other challenge is ranking or scoring of pathways. To deal with this, different synthetic-planning programs used various types of methods ranging from bond disconnections in LHASA to minimize the combined scoring function in Syntaurus.

Perhaps the challenge has been tackled too early, as organic reactions are full of exceptions. Rule-based methods still cannot meet the full requirement of organic chemists. In practice, some relatively rare reactions, paradoxically, can be of vital importance in some particular synthesis, so generalized rules may not be the ample knowledge for computers, instead, some specialized cases are also needed. Moreover, most algorithms cannot predict issues of stereo- and regio-chemistry until the general application of SMILES and SMARTS (which can take these factors into consideration). Limitations of searching space and lack of intelligent algorithms still call on scientists to explore new revolutionary ways to predict synthetic pathways—that is why machine learning was becoming more and more popular in the past decade.

# THE APPLICATION OF MACHINE LEARNING IN SYNTHETIC DESIGN

## Automatically Learning Reaction Rules

Manual encoding of organic reaction rules has some obvious disadvantages. Since it relies on the experience of a small number of chemists, it usually did not cover enough fraction of the reaction space and few of them can be as ample as Syntaurus. Moreover, it is not realistic to exhaustively define the full substrate scope and incompatibilities for every possible reaction, and conflicting reactivity is rarely black and white; incompatibility depends on the exact nature of the reacting molecules. These factors motivate the development of an automated approach to the forward reaction evaluation.

Systems with machine-generated chemistry rules were first published in the early 1990s such as the example SYNCHEM (Gelernter et al., 1990), which also use machine learning to increase its knowledge base. The KOSP (Satoh and Funatsu, 1999) program (Knowledge Base-oriented System for Synthesis Planning) attempts to extract rules from reaction databases by clustering reactions based on characteristics of atoms within three bonds of a disconnection site. Similarly, RETROSYN (Blurock, 1990) also provided an interactive search based on finding single disconnections by similarity with precedent reactions. The system ARChem Route Designer (Law et al., 2009) developed by SymBioSys realized a systematic mode for automatically extract reaction rules and applied these rules in retrosynthetic design. However, it also has the limitation of not accounting for stereochemistry and/or regiochemistry like most rule-based system. **Figure 2** illustrates how ARChem Route Designer learns reaction rules from reaction pools.

ARChem Route Designer provides the method to generate synthesis trees. This method still has some weakness. First, the long-distance effect was neglected, for example, the existence of hydroxyl in the distance of several bonds can accelerate leaving of groups such as –OSO2CH3. Second, some conflicts might happen when there are two or more reactive groups in a molecule. Nevertheless, this approach already proved that computer's ability to learn reaction rules can make it possible for fully data-driving and automatic pathway designing algorithms.

# Two-Step Models—Combination of Rule-Based Model and Machine Learning

Methods summarized in section The Development of Rulebased Synthetic Design emphasized the importance of reaction rules as traditional organic chemists do. As statistical methods get more and more popular in recent two decades, scientists tried to combine reaction rules with data science skills, especially machine learning. We define these models as twostep ones, which undergoes two separate steps (1) the first step is for providing excess possible reaction results, and the second is for ranking or scoring of them; (2) or the first step is for classification of reactions, and the second is for applying certain pre-coded rules. In a two-step method, "reaction

lab work

TABLE 1 | Summary of some rule-based retrosynthesis models.


reactant to product) by comparing reactants and products, and extending the cores to contain neighboring atoms or functional groups. (D) Clustering the extracted reaction cores into common groups. (E) Producing a generalized rule template for each cluster group and completing the generalized rule templates.

rules" play the role of important intermediates in the models (**Figure 3**).

SYNCHEM (Gelernter et al., 1990) was one of the earliest effort in the application of machine learning methods to chemical predictions, relied on clustering similar reactions, and learning when reactions could be applied based on the presence of key functional groups. While SYNCHEM uses active node and non-active node to label the molecules, other subsequent

machine learning algorithms are based on molecular descriptors to characterize the reactants in order to guess the outcome of the reaction. Such descriptors include information both from experimental/physico-chemical measurements, such as dipole moment, and theoretical/structural information such as the number of rings, to represent the properties of the molecule. With descriptors as the fingerprint of molecules or reactions, computer algorithms become more likely to do classification or similarity calculation. Schneider and collaborators' work (Schneider et al., 2015) is an example to use molecular descriptors to generate reaction fingerprints and classify organic reactions into 50 classes, with methods of random forests, naïve bayes, K-means and logistic regression. If the input is shortened to only include reactant or product, this method can be applied to reaction prediction or pathway design.

During the last 10 years, there were many algorithms published to predict the outcome of organic reactions, which still rely on reaction rules but use machine learning to judge which rule to choose. Although the ideas are similar, they differ in some details. Since outcome prediction is forerunners of retrosynthesis analysis in this field, we briefly introduce some of the relevant algorithms. Carrera et al. used machine learning to predict chemical reactivity of organic molecules (Carrera et al., 2009). They train random forest models for certain molecules (such as BuNH<sup>2</sup> and NaCNBH3) to predict their reactivity. However, it was unlikely to give every compound an independent model, so it was far from a generalized reaction prediction system. The CSB (Chemical Sense Builder) system (Fica and Nowak, 2005) proposed by Fica and Nowak can simulate and predict organic reactions. This system consists of two separate functional modules, which can be used individually or sequentially. The first one contains four logic-based and knowledge-based models for generating and discovering reactions. The second one mainly applies learning tools for reaction simulation process. The CSB takes account of a set of mechanisms controlling the course of reaction generation, even considering thermodynamic concept (reaction enthalpy), and common reactive sites, searching for analogies in reaction database.

Reaction Predictor (Kayala et al., 2011; Kayala and Baldi, 2012) by Kayala et al. is an algorithm that first identifies potential electron sources and electron sinks in the reactant molecules based on atom and bond descriptors. The first component is a proposal model analyzing structures of input molecules and propose all possible reactions according to the mechanism of reactions. Finally, neural networks are used to determine the most likely combinations in order to predict the true mechanism. The reported accuracy is 78.1% for polar reaction, 85.8% for pericyclic reactions and 77% for radical reactions. While this approach allows for the prediction of many reactions at the mechanistic level, many organic chemistry reactions have relatively complicated mechanisms with several elementary, which would be costlier for this algorithm to predict. However, it does not require any reaction template.

Coley et al. also applied the idea of two-step analysis like ReactionPredictor too, but their way of generating the set of possible products is different (Coley et al., 2017). First, they generated a set of chemically plausible products according to preinputted reaction rules. During this process, they also mentioned the importance of negative sampling like Segler and Waller, and they expanded existing reaction databases with negative reaction examples. Second, softmax neural network layer (i.e., an exponential activation function that maps a list of numbers to a list of probabilities that sum to one) was applied to generate probabilities of each product. The most creative part was to use "edit-based" information as the feature of learning. Four kinds of information were inputted: (1) An atom a<sup>i</sup> loses a hydrogen; (2) An atom a<sup>i</sup> gains a hydrogen; (3) Two atoms, a<sup>i</sup> and a<sup>j</sup> , lose a connecting bond bij; (4) Two atoms, a<sup>i</sup> and a<sup>j</sup> , gain a connecting bond bij, and output will be the probability. Combining edit-based model and baseline model (only concern about the structure of products), the hybrid model gives the accuracy of 71.8% for top-1, 86.7% for top-3, and 90.8% for top-5. It can also be applied to predict retrosynthetic reactions.

Wei et al. (2016) used a graph-convolution neural network proposed by Duvenaud et al. (2015) to infer fingerprints of the reactants and reagents, and then predict the outcome of reactions based on reactant fingerprints. This kind of fingerprints were generated from molecule graphs, in which nodes represent atoms and edges represent bonds. At each layer of a convolutional neural network, information flows between neighbors in the graph. Finally, this model will generate a fixed-length fingerprint vector. In the afterward predicting algorithm, Wei et al. classified organic reactions into 16 different types (for alkyl halides and alkenes) and use SMARTS transformation to describe the transformation between product molecules and reactants. This method can achieve an accuracy of 85% of test set reactions and 80% of selected textbook questions from Wade problems (Wade, 2013). In fact, previously developed machine learning algorithms were also able to predict the products of these reactions with similar or better accuracy, but the structure of their algorithms allow for greater flexibility. However, only 16 types of reactions covering a very narrow scope of possible alkyl halide and alkene reactions limits the application of the algorithm. Furthermore, the effect of secondary reactant or reagent was over-simplified as only 50 common ones were taken into consideration.

Segler and Waller built a knowledge graph using reaction templates (Segler and Waller, 2017a), which resembles NOC described in section Building and Searching Reaction Networks. With some additional network-based calculation, this model can find novel reactions by searching for missing nodes in the graph and predict the catalysts of reactions. Although they did not include machine learning then, one major advancement is their idea of negative sampling. As they mentioned, while the positive evaluation of a reaction prediction system can be easily done with a test set of hold-out known reactions, negative evaluation with reactions that are known not to occur is a difficult task, because failed reactions or the limitations of synthetic methodology were seldom published. This lack of data has been criticized both by synthetic chemistry and chemoinformatics community. To get data on reactions which are unlikely to occur, Segler and Waller randomly selected 36,000 known reactions from their validation set and generated "wrong" (but some still plausible) products with hand-coded reaction rules. Then the model can identify the wrong products and label these reactions as unlikely to occur. That means negative samples can be generated by computers, which greatly helped the development of machine learning in the field of reaction prediction.

Although these methods are not designed specifically for retrosynthesis, some of them can be modified to meet the requirements of retrosynthetic prediction, too, such as Segler and Waller's reaction graph, Coley et al.'s ReactionPredictor and Wei et al.'s graph-convolution neural network. These methods, together with other earlier retrosynthesis methods related to machine learning are in common because they all divide the task into two separate steps, they all undergo an intermediate step—reaction rules. Similarly, programs specialized for reaction pathway prediction can also adopt this process. One important work is Segler and Waller's neural-symbolic approach (Segler and Waller, 2017b) for retrosynthesis and reaction prediction, as well as synthetic pathway design. Since it is specially designed for retrosynthesis analysis, it must have some distinguished features—global information has to be considered to avoid conflicts. For example, for carbon-carbon coupling reactions, when there are carboxyl or aldehyde groups in the target molecule, Kumada reaction should be abandoned because the Grignard reagent will react with these groups, so we can only choose Suzuki, which uses R-B(OH)<sup>2</sup> instead of RMgBr. In their neural-symbolic method, the computer has to learn which named reaction can be used to produce a molecule (or under which rule the starting materials reacted) with all information about the molecule. By training neural networks with millions of examples of known reactions and the corresponding correct reaction rules, computers will give each input a label of reaction type. Their reaction data are from the commercially available Reaxys database. The input information is ECPF4 (Unterthiner et al., 2014) of targeting molecule. Because this fingerprint a fixedlength indicator, a neural network with one hidden layer (Clevert et al., 2015) or a deep highway network can be applied. The neural network on molecular fingerprints to prioritize rules are combined with a Monte Carlo tree search, which can realize the function of retrosynthetic reaction prediction. When applying retrosynthesis prediction several times, we can get the synthesis pathway. Segler and Waller used 103 hand-coded reaction rules, such as Diels-Alder, Sonogashira, Kumada. Their model can predict retrosynthesis reaction rules in an accuracy of 78% (top-1) and 98% (top-3). Then, they replaced 103 hand-coded reaction rules with automatically-extracted 8,720 reaction rules from 4.9 million examples. Although the accuracy decreased to 64% (top-1) and 95% (top-3), this approach is fully end-to-end and datadriving. However, they reported an average of 44.5 matches per query, suggesting the coverage might be not enough.

In Segler et al. (2018) published their updated model. In this work, they proposed a 3N-MCTS approach for chemical synthesis prediction, which means three neural networks combine with Monte Carlo tree search (MCTS). Like their previous work, reactions published in Reaxys before 2015 were used to extract reaction rules (contain the information of reaction center), and two separate neural-symbolic models are trained—relatively slower "expansion policy" for selecting best candidate transformations and faster "rollout policy" for estimating synthesis positions values. Then by generating negative examples as they did in their previous work, a binary filter network for predicting whether reactions really occur were trained, thus every reaction proposed in the expansion process would be evaluated and only feasible ones are kept, which greatly reduced the risk of wrong output. Following the process of selecting, expansion, rollout and update, 3N-MCTS model can give result much more quickly than any other methods such as plain Monte Carlo, and BFS. In double-blind test, even chemists cannot distinguish literature and 3N-MCTS results. However, quantitatively prediction of enantiomerism is still an unsolved problem in this model. Because of the coverage of training set, the accuracies of synthetic prediction for natural products are limited.

For all the methods mentioned in this section, reaction rules are still the most important guidance of reaction prediction and pathway design, and machine learning is more like assisters. The common limitation of this kind of system, as well as other rulebased ones, is that they do not take stereochemistry into account. We are curious if it can be solved with more reaction examples or other descriptors, such as stereo-chemically aware descriptors (Carbonell et al., 2013). But it is indubitably that, machine learning greatly accelerates the development of retrosynthesis design. Although these methods have not fully got rid of the idea of rule-guided design, the wide application range and high accuracy is really impressing.

# Fully End-To-End Retrosynthesis Analysis With Deep Neural Networks

In recent years, deep neural networks have been applied to this field. One characteristic feature is that computers do not need to follow human-defined reaction rules, and instead, they can recomprehend chemical reactions with only millions of reaction examples. So we call these methods end-to-end ones—scientists only provide computers with two ends—one end is reactant and the other is product. These methods are fully data-driven. One exception is mentioned in section Two-Step Models— Combination of Rule-Based Model and Machine Learning the first template-free approach introduced by Kayala et al. (2011) and Kayala and Baldi (2012). Because it can predict a series of mechanistic steps to obtain one reaction outcome using fingerprints and handcrafted features, it was based on common reaction mechanisms, and not fully data-driven. Using end-toend analysis with deep neural networks, many approaches were proposed in recent years.

In order to implement the end-to-end methods, one kind of approaches is to define some special data structures to help computers understand the concept of reactions. These data structures are far from traditional "reaction rules" which can be understand by human beings. An important example is Jin et al.'s research (Jin et al., 2017) with a novel approach based on Weisfeiler–Lehman Networks (WLN) (Lei et al., 2017). They trained two independent networks on a set of 400,000 reactions extracted from US patents and their approach bypasses reaction templates by learning a reaction center identifier. In WLN, organic molecules are considered as a graph G = (V, E), where V is the set of atoms (vertices) and E is the set of associated bonds (edges), and a chemical reaction is a pair of molecular graphs (G<sup>r</sup> , Gp). Thus, a reaction center is defined as a minimal set of graph edits needed (change of bond type for certain atom pairs) to transform reactant graph to product graph. The WLN will give every node a vector by training it with the information of all the neighbor nodes, which captures the local chemical environment of the atom and involves a comparison against a learned set of reference environments. Then with the local or global information (taking important reagent into account), they trained the model to predict reactivity label. After generation of candidates according to the reactivity label, they trained another Weisfeiler–Lehman Difference Network (WLDN) to rank the candidates. Their method achieved a top-1 accuracy of 74.0% on a test set of 40,000 reactions. Jin et al. claimed to outperform template-based approaches by a margin of 10% after augmenting the model with the unknown products of the initial prediction to have a product coverage of 100% on the test set. Differing from methods summarized in section Two-Step Models—Combination of Rule-based Model and Machine Learning, this approach is not only end-to-end, but also gets rid of the dependence on reaction rules. Though it definitely undergoes an intermediate step of reaction center (defined with certain data structure), this method is more "computational" than "chemical," and the final model becomes more abstract than before.

Other end-to-end methods can even skip the step of "reaction center" (or similar concepts). In Nam and Kim (2016) first applied seq2seq approach to reaction prediction. Seq2seq (Sutskever et al., 2014) is an algorithm using a multilayered Long Short-Term Memory (LSTM) to an input sequence (of unfixed length), and then another deep LSTM to decode a target sequence (also of unfixed length) from the vector (**Figure 4**). It was designed for translation between English and French, with the advantage that we only need to input large amount of parallel data, and the powerful deep neural network will automatically extract information and features of different languages and finally realize the translation. Molecule structures can be represented as linear SMILES strings, which can be decomposed to a list of atoms, bonds and several kinds of symbols. Hence, in a linguistic perspective, SMILES can be regarded as a language

with grammatical specifications. In this sense, the problem of predicting products can be regarded as a problem of translating "reactants and reagents" to "products." Nam and Kim used reaction database collected from patents by Lowe (2012) and 2001–2013 USPTO. Their model was based on the TensorFlow translate model (v0.10.0) (Abadi et al., 2016), from which they took the default values for most of the hyperparameters. When testing with Wade problems, the accuracy ranges between 0.35 and 0.85 in different problem sets.

With more training data, seq2seq model can behave much better in the field of reaction prediction. Schwaller et al. from IBM Research, Zurich also published a seq2seq approach (Schwaller et al., 2017). They built on the idea of relating organic chemistry to a language and explore the application of state-of-the-art neural machine translation methods, which are seq2seq models. Besides Lowe's data, they used data extracted from US patents granted and applications dating from 1976 to September 2016 in addition. The portion of granted patents is made of 1,808,938 reactions, described with SMILES. They took only single product reactions, corresponding to 92% of the dataset, to have distinct prediction targets. The accuracy is 80.3% for top-1, 84.7% for top-2, 86.2% for top-3 and 87.5% for top-5.

Actually, retrosynthesis is the opposite of reaction prediction. Given a product molecule, the goal is to find possible reactants. So, if we reverse the reaction direction, seq2seq can also solve pathway design problems, and this algorithm was developed by Liu et al. (2017). They used a set of 50,000 reactions extracted and curated by Schneider et al. (2016) The accuracy is 34.1% for top-1, 56.5% for top-5, 62.0% for top-10, and 71.9% for top-50. An important difference between this and Schwaller et al.'s method is that they did not omit reactions with multiple reactants or products. Instead, adding a dot between separate SMILES string can deal with this kind of reaction. In their approach, the dataset was classified into 10 reaction classes, including heteroatom alkylation and arylation, acylation and related processes, etc. The dataset was split into training, validation and test datasets (8:1:1). The accuracies of different reaction classes were calculated separately. Reversed input can increase the accuracy of recurrent neural networks, so they also reversed all the SMILES strings before training. Compared with other rule-based algorithms (Law et al., 2009), seq2seq retrosynthetic analysis behave much better in protection and de-protection reactions, that is to say, this algorithm can judge whether to introduce a protection group to avoid side reactions. As for common bond connecting and breaking reactions, however, this retrosynthetic analysis program cannot outperform traditional rule-based reactions. Liu et al. summarized all the errors into three types. First, the model outputs invalid SMILES string, which means the data is not enough for computers to comprehend the grammar of SMILES. Second, some reaction rules are wrongly predicted. Third, the overall reaction is chemically plausible but different from the result of the test set—this means the accuracy is underestimated in some ways. It is partially because of the presence of multiple reaction sites in the target molecule that can be disconnected retrosynthetically, so multiple possible reactant sets are chemically plausible.

The accuracy of retrosynthesis prediction is much lower than reaction outcome prediction. The difference between training and testing data is one reason, and multiple possible pathways for synthetic design is another reason. However, it is undeniable that none of the previous works can achieve end-to-end learning to the level of seq2seq models, and the accuracy of reaction product prediction has reached the highest level. An obvious disadvantage when compared to template-based methods is that the strings are not guaranteed to be a valid SMILES, which might decrease the prediction accuracy. Another limitation of the training procedure is multiple pathway choices. However, the problem of multiple choices only affects the apparent accuracy, and the algorithms can still give valuable results of retrosynthesis pathway predictions.

# PERSPECTIVE

It is now clear that high-quality synthesis analysis systems are required to meet various needs in chemistry. With the development of learning algorithms and database, these needs are gradually being met or are the subject of active researches, but there are still many challenges to be overcome, including regiochemistry and stereochemistry. Computational chemical synthesis analysis and pathway design prediction is a task full of contradictions—more reaction rules mean more matches in each query, but are also likely to produce implausible examples; local scoring functions (for each step) may not give the best pathway, but designing functions emphasizing global minimum is so difficult. That's why recently scientists are shifting their attention to deep learning algorithm, however, methods like seq2seq are still not good enough for academic or commercial usage.

In an organic chemist's view, synthesis design is a kind of art rather than science—which intermediate, whether to protect. . . But for computer algorithms, whether rule-based methods or deep neural networks mainly focus on the availability of each step (some new methods could even solve the problem of the first step), and neglect the idea of "designing." To reach the level of intelligent design, algorithms other than seq2seq and datasets which contain multiple-step synthetic data should be developed. If we regard chemical space as a "compound surface," present methods are ready to tell us "how to take a correct step," but we need the result of "shortest trajectory," which is on a higher level.

Besides developing more methods for common chemical reactions, there are other fields needing the help of synthesis analysis. For example, biomimetic and biological synthesis is a tricky problem, and choosing proper enzymes can greatly reduce the complexity of synthesis pathway. Projects like PathPred (Moriya et al., 2010) used methods similar to database searching, but the result is limited due to the insufficient coverage of database and relatively poor ability of generalization. There are also learning-based methods like Dale et al.'s model (Dale et al., 2010) and rule-based methods like U Minnesota Pathway Prediction System (Gao et al., 2011) for biosynthesis pathway prediction. Predicting the condition of unknown reactions is also an extension of synthesis analysis systems.

In summary, in the past decades, there are plenty of exciting breakthroughs in chemical synthesis analysis and pathway design. Today, computers can be used to predict viable syntheses leading to quite complex targets and, with further development of computational methods, they can become better. As these systems of many varieties become more widely known and studied, the trend of chemical synthesis analysis systems will become more apparent and will stimulate research and development in directions not yet envisioned.

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

#### FUNDING

This work has been supported by the Ministry of Science and Technology of China (2016YFA05023032) and the National Natural Science Foundation of China (21673010, 21633001).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Feng, Lai and Pei. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Conformational Sampling of Small Molecules With iCon: Performance Assessment in Comparison With OMEGA

#### Giulio Poli <sup>1</sup> , Thomas Seidel <sup>2</sup> \* and Thierry Langer <sup>2</sup>

<sup>1</sup> Department of Biotechnology, Chemistry and Pharmacy, University of Siena, Siena, Italy, <sup>2</sup> Department of Pharmaceutical Chemistry, University of Vienna, Vienna, Austria

#### Edited by:

Daniela Schuster, Paracelsus Medizinische Privatuniversität, Austria

#### Reviewed by:

Pramod C. Nair, Flinders University, Australia Dharmendra Kumar Yadav, Gachon University of Medicine and Science, South Korea Ariel Fernandez, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina

#### \*Correspondence:

Thomas Seidel thomas.seidel@univie.ac.at

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 10 January 2018 Accepted: 31 May 2018 Published: 19 June 2018

#### Citation:

Poli G, Seidel T and Langer T (2018) Conformational Sampling of Small Molecules With iCon: Performance Assessment in Comparison With OMEGA. Front. Chem. 6:229. doi: 10.3389/fchem.2018.00229 Herein we present the algorithm and performance assessment of our newly developed conformer generator iCon that was implemented in LigandScout 4.0. Two data sets of high-quality X-ray structures of drug-like small molecules originating from the Protein Data Bank (200 ligands) and the Cambridge Structural Database (481 molecules) were used to validate iCon's performance in the reproduction of experimental conformations. OpenEye's conformer generator OMEGA was subjected to the same evaluation and served as a reference software in this analysis. We tested several setting patterns in order to identify the most suitable and efficient ones for conformational sampling with iCon; equivalent settings were also tested on OMEGA in order to compare the results obtained from the two programs and better assess iCon's performance. Overall, this study proved that iCon is able to generate reliable representative conformational ensembles of drug-like small molecules, yielding results comparable to those showed by OMEGA, and thus is ready to serve as a valuable tool for computer-aided drug design.

Keywords: conformer generation, conformational analysis, drug design, pharmacophore modeling, virtual screening

# INTRODUCTION

Conformer generation still represents a remarkably important topic within the Computer-Aided Molecular Design (CAMD) field. The exploration of the conformational space of small molecules is a challenging task that is required for different applications ranging from the search for the molecule conformation at its global energy minimum to the generation of conformational ensembles that properly represent all possible low-energy spatial dispositions that molecules are allowed to assume. Particularly, this latter analysis constitutes a fundamental step in many in-silico studies comprising pharmacophore modeling and pharmacophore-based virtual screening (VS) (Güner et al., 2004; Wolber and Langer, 2005), shape-based similarity searches (Hawkins et al., 2007; Sastry et al., 2011), docking and other VS methods (Cross et al., 2010; McGann, 2012), as well as different approaches like 3D and 4D QSAR modeling (Shim and MacKerell, 2011). Moreover, these techniques require different levels and qualities of conformational sampling depending on the specific goals they aim at. Therefore, it is important to balance speed and thoroughness of the conformational sampling process depending on the size of the database to be screened, in order to produce an appropriate conformational ensemble size that still guarantees reliable results. In this context, automated clustering algorithms have been recently applied for resampling conformational Poli et al. Conformational Sampling With iCon

ensembles of small molecules (Kim et al., 2017). Such clustering approaches based on RMSD matrices, which can also find application in post-processing docking results (Tuccinardi et al., 2014), were employed to filter out unrepresentative molecular conformers with the aim of reducing the size of ensembles and data, but still providing a high coverage of the ligand's conformational space. These data highlight that the conformational sampling of small molecules is still a hot topic in CAMD. Due to the different tasks conformer generators are asked for, it is not surprising that a substantial number of programs based on different sampling algorithms (Hawkins, 2017) belonging to both stochastic (Chang et al., 1989; Treasurywala et al., 1996; Saunders, 1998; Güner et al., 2004; Watts et al., 2010) and deterministic methods (Smelliem et al., 2003; Renner et al., 2006; Li et al., 2007; Hawkins et al., 2010) have already been developed. In particular, deterministic sampling algorithms are used within several well-known conformer generators employed for VS application, including CONAN (Smelliem et al., 2003), ROTATE (Renner et al., 2006), CAESAR (Li et al., 2007), and OMEGA (Hawkins et al., 2010; OpenEye Scientific Software, 2013) whose performance has been widely validated and compared to other software (Boström, 2001; Good and Cheney, 2003; Loferer et al., 2007; Schwab, 2010; Friedrich et al., 2017). Anyway, novel software is continuously appearing on the CAMD scene, where always newer and more efficient tools are needed, and they are tested for their ability of mapping the conformational space and reproducing the conformations of experimentally determined crystal structures of drug-like small molecules (Miteva et al., 2010; O'Boyle et al., 2011; Ebejer et al., 2012; Friedrich et al., 2017). Here we report the algorithm and the performance assessment of the novel conformer generator iCon implemented in LigandScout (Wolber and Langer, 2005) which uses a systematic, knowledge-based approach for the generation of conformational ensembles to be employed in the generation of pharmacophore models and in the creation of screening databases for pharmacophore-based searches. With the aim of best analyzing iCon's performance, we evaluated representative data sets of test compounds to be used in our study. Recently, Hawkins and co-workers reported an algorithm validation of the conformer generator OMEGA for its default settings by using two sets of high-quality crystallographic structures of small molecules originating from the Protein Data Bank (PDB) (Berman et al., 2000) and the Cambridge Structural Database (CSD) (Allen, 2002) that were selected by filtering larger data sets used in previous studies (Hawkins et al., 2010). These data sets were then further refined after an analysis aimed at better understanding their suitability for conformational sampling (CS) validation as well as identifying and studying OMEGA's failures, showing that they were able to well represent the torsion angle space of the parent sets (Hawkins and Nicholls, 2012). Stimulated by these analyses, we decided to use these data sets to validate the performance of iCon regarding the reproduction of crystallographic conformations of drug-like small molecules and to compare it to the corresponding results obtained with OMEGA. A wide panel of different settings has been tested for iCon in the attempt to identify the most suitable ones. In particular, we analyzed the impact of the main conformational sampling parameters on the size and quality of the conformational ensembles generated by iCon for the two data sets of small molecules using 20 different setting patterns. For each setting, the reliability of the conformers generated by iCon for the test ligands was evaluated based on the accuracy in the reproduction of their experimental conformations, which was assessed by using two different metrics of conformational similarity. The same analysis was also performed using the software OMEGA, which is one the best conformer generators available today and thus served as the reference for iCon's performance evaluation. The quality of the conformational ensembles generated with OMEGA using 20 setting patterns corresponding to those tested with iCon was assessed and the results produced by the two software packages were compared. Based on the whole analysis, the reliability of the new conformer generator iCon was demonstrated and the most suitable iCon's setting patterns were identified.

### MATERIALS AND METHODS

#### Data Sets Preparation

Two different data sets comprising 200 X-ray ligand structures originating from the Protein Data Bank and 481 X-ray structures from the Cambridge Structural Database, representing the final data sets of structures used by Hawkins and co-workers in their reported analyses concerning OMEGA's performance (Hawkins and Nicholls, 2012), were employed in this study.

For the creation of the PDB data set, we analyzed the PDB complexes from which the ligands used in Hawkin's study were extracted (see Supplementary Material) to obtain the corresponding ligand three-letter codes. The structures of all ligands were downloaded from the RCSB Ligand Expo database (www.ligand-expo.rcsb.org) in sd-file format. Hydrogen atoms were added to the ligands by using LigandScout 4.0 (Inte:Ligand GmbH, 2015) and then the molecules were visually checked for correctness on the basis of their corresponding parent Xray complexes. For the creation of the CSD data set, the list of CSD molecules used in Hawkin's study was directly downloaded from the CSD database (in sd-file format). The so obtained experimental ligand conformations served as reference structures in the computation of root mean square deviation (RMSD) and Tanimoto Combo (TC) score values for the corresponding conformers generated by iCon and OMEGA (vide infra). To avoid any bias that could affect the conformer generation by starting from 3D structures, the two data sets were converted into SMILES notation by using OpenEye's Babel 3.327 (OpenEye Scientific Software, 2010). The obtained 681 SMILES codes eventually served as the input for iCon and OMEGA and could be processed without any issues by the two programs.

**Abbreviations:** CAMD, computer-aided molecular design; CS, conformational sampling; CSD, Cambridge structural database; HA, heavy atom; MMFF94, Merck molecular force field; NOC, number of conformers; PDB, protein data bank; RB, rotatable bond; RMSD, root mean square deviation; TC, Tanimoto combo; VS, virtual screening.

#### Conformer Generation With iCon

Since OMEGA's algorithm has been broadly discussed elsewhere (Hawkins et al., 2010) here we describe the conformer generation algorithm of iCon, which uses a systematic, knowledge-based approach for the generation of conformer ensembles similar to CAESAR (Li et al., 2007). The overall process is presented schematically in **Figure 1** and can be divided into four logical phases that are described in more detail below.

#### Phase 1: Input Molecule Analysis and Fragmentation

When iCon starts to process an input molecule, the first step is the perception of all the rotatable bonds within the molecule. A rotatable bond is any single bond that is not a member of a ring system and connects only non-terminal heavy atoms (e.g., a bond to a methyl group or chlorine is not considered as rotatable). For each detected rotatable bond, a lookup in the built-in torsion rule database is performed to extract preferred relative torsions that are characteristic for the substituents of the bond. If a matching torsion rule cannot be found, one of the hard-coded fallback rules is applied which provide default torsion angles depending on the hybridization state of the bonded atoms. The next step is the perception of any topological symmetry that may occur in the input molecule. The thus obtained automorphism mappings of the heavy atoms are used in the conformer buildup stage for the detection of generated duplicate conformations that need to be discarded. The last step in phase 1 is the logical transformation of the input molecule into a tree-like hierarchy of structure fragments (see **Figure 1**). This is done by splitting the input molecule (which represents the root node of the tree) at its most central rotatable bond (green bond) into two smaller fragments of nearly the same structural complexity (fragments 1 and 2). The same procedure is then applied recursively to the two initial fragments until only fragments that cannot be partitioned any further remain. Those terminal fragments (fragments 3, 4, 5, and 6) represent the smallest conformational units of the input molecule and can be either simple heavy atom centers (e.g., - CH2-), rigid chain fragments (e.g., >C=C<) or various kinds of ring systems and combinations thereof.

#### Phase 2: Generation of Terminal Fragment Conformations

Initial conformations assigned to the structural units at the leaf nodes of the fragment-tree serve as the primary building-blocks for the recursive assembly of fragment conformer ensembles on higher tree-levels. Conformer 3D coordinates are generated by the following procedure which is based on a distance geometry approach: First, a distance bounds matrix is generated using the connection table of the fragment. The distance constraints are then augmented by volume constraints for defined chiral centers and any planar moieties of the fragment. In the next step, random 3D coordinates are assigned to each atom and then optimized to fulfill the distance and volume constraints. The thus obtained raw coordinates are further refined using a modified version of the static Merck Molecular Force Field (MMFF94s) (Halgren, 1996a,b,c,d, 1999a,b; Halgren and Nachbar, 1996) where electrostatic interactions are not considered in the energy calculation. In the case of terminal fragments representing flexible ring systems, multiple conformations of the system may be possible. If enabled (as by default, enum-rings option), the geometry optimization procedure is therefore repeated many times to obtain a set of multiple unique conformations of the ring system until a maximum number of subsequently failed attempts to generate a conformations or the timeout limit (max-frag-buildtime option) has been exceeded. Terminal fragments containing invertible nitrogen atoms are also treated specially (if enabled as by default with the enum-nitrogens option). For such fragments, the substituents of each invertible nitrogen atom are simply flipped and again refined in the force field to yield a second set of fragment 3D coordinates. The generation of terminal fragment conformations by the just described distance geometry/force field optimization procedure is quite simple but rather time consuming. For the speedup of the overall process, calculated terminal fragment conformations get stored in a continuously growing (up to an internal maximum size) dedicated cache. Whenever a future input molecule with an already processed fragment is encountered, the lengthy calculations can be bypassed and the cached fragment conformations are used instead.

#### Phase 3: Generation of Flexible Fragment Conformers

Phase 3 is concerned with the recursive assembly of conformer ensembles which is starting at the terminal fragments. For an explanation of the process let us consider the assembly of two fragments FX and FY at level L+1 of the tree into the larger parent fragment FXY at level L. Fragments FX and FY are connected by the rotatable bond BXY of the parent fragment and the conformations of both child fragments are available, either because assembled at a lower level or because generated in phase 2. At this stage, all conformers of FX and FY contain no duplicates, show no atom clashes, satisfy the user specified energy window constraint (e-window option) and are ordered by increasing MMFF94 energy. The assembly of FX and FY comprises the following sub-steps: The first step is to align the bond BXY in all conformers of both FX and FY in a way that the bond has the same standard orientation (e.g., in direction of the x-axis). In the next step, a conformation from FX and one from FY is selected and their coordinates are combined with a relative torsion angle taken from the list of favorable torsions provided by the assigned torsion library entry. Afterwards, the MMFF94 energy of the new conformer candidate is calculated and compared with the energies of the previously generated conformations. If the difference between the candidate conformation energy and the energy of the lowest energy conformer so far is larger than the user specified energy window, the new conformation gets rejected because any generated parent fragment conformations will then also exceed the energy threshold. One thing to note is that there is no explicit check for atom clashes in iCon. Conformers with Van der Waals clashes show a rather high MMFF94 energy that always exceeds the specified energy window and in turn leads to their automatic exclusion from any further processing. The next step is to make sure that the generated conformer is not a duplicate of a previously

generated conformation. Duplicates may always arise due to local rotational symmetries and must be excluded from the final list of fragment conformations. If the candidate conformation is not a duplicate, it gets inserted into the list of intermediate fragment conformers. If the inserted conformation is the new lowest energy conformation found so far, any previously generated conformations that now exceed the energy window are discarded. The number of fragment conformers stored at each node has an upper limit and is calculated dynamically depending on the number of rotatable bonds, the number of requested output conformations and the tree level. For the root node the limit is set to max(PS, 5×N) where N is the number of requested output conformation (max-num-confs option) and PS is the value of the max-pool-size option. For an internal node the maximum ensemble size depends on the number of rotatable bonds in the subtree and on the number of requested conformers of the parent node. If the maximum ensemble size is exceeded by a new conformation, the highest energy conformer is simply discarded to keep the ensemble size at its upper limit.

#### Phase 4: Selection of Output Conformations

Once a pool of candidate low energy conformations of the input molecule has been obtained, the requested number of output conformations is selected under the specified RMSD constraints (rms-thresh option). The selection algorithm works as follows: First, the list of root fragment conformations is ordered by increasing MMFF94 energy value and the lowest energy conformer is put into the list of output conformations. Using this conformer as a reference, the list of fragment conformations is searched in order of increasing energy to find a new conformer whose heavy atom 3D coordinates differ at least by the specified RMSD threshold. If such a conformation could be found, it is put into the list of output conformers and the search for the next sufficiently different conformation continues. This process is repeated until the requested number of output conformations or the end of the list of fragment conformations has been reached.

#### Conformational Model Generation

A total of 20 different setting patterns was used for the generation of conformational models of the two compound collections (**Table 1**). In each setting pattern, three parameters that have an analogous meaning in iCon and OMEGA (version 2.4.6.35) were systematically modified, while all other parameters were left unchanged. The parameters modified in the different settings are: e-window, max-num-conf, rms-thresh in iCon and ewindow, maxconfs, rms in OMEGA. The e-window and ewindow parameters define the strain energy window allowed for conformers to be included in the final ensemble of conformers. Conformers with strain energy higher than the sum of the energy of the global minimum conformer and the e-window/ewindow value are rejected. The default ewindow value for OMEGA is 10 kcal/mol. The max-num-conf and maxconfs parameters define the maximum number of conformers that can be included in the final ensemble of conformers (the default maxconfs setting for OMEGA is 200). If the number of conformers satisfying the energetic criteria is higher than the allowed limit, conformers with the highest strain energies are rejected until the threshold value is reached. Rms-thresh and rms parameters define the minimum RMSD of coordinates below which two conformers are considered as duplicates. The default value for OMEGA's rms option is 0.5 Å. To simplify the analysis of results, the settings patterns were divided in low, medium, and high accuracy settings depending on the average number of conformers (NOC) generated for the compounds of the PDB data set: up to 100 for low accuracy settings, from 100 to 200 for medium accuracy settings and from 200 to 500 for high accuracy settings.

#### Computation of RMSD Values

RMSD values between the experimental ligand conformations and the related ensembles of conformers generated by iCon and OMEGA employing the different setting patterns were calculated for each molecule. Only heavy atoms were considered in the RMSD computation, without including any mass-weighted term. For each molecule only the RMSD value between the crystallographic conformation and the best-fitting conformer was considered for performance analyses. For the actual calculation of the heavy atom RMSD of two conformations TABLE 1 | Setting patterns tested for conformer generation with iCon and OMEGA.


<sup>a</sup>MC, max-num-conf/maxconfs; <sup>b</sup>EW, e-window/ewindow; <sup>c</sup>RT, rms-thresh/rms.

an alignment in 3D space is required. The alignment was performed by a Java implementation of Kabsch's algorithm (Kabsch, 1976, 1978) which calculates the optimal rotation matrix that minimizes the RMSD between two paired sets of points (positions of the heavy atoms). Rotational symmetries were considered in the alignment and RMSD calculation by trying all possible pairings of equivalent heavy atoms and then using only the lowest obtained RMSD for the comparison of the two conformations.

#### Computation of Tanimoto Combo Scores

The Tanimoto Combo (TC) represents a complementary metric with respect to the RMSD to compare experimental and generated ligand conformations. It comprises two different scores: shape Tanimoto and color Tanimoto. Shape Tanimoto refers to the structural shape similarity whereas color Tanimoto refers to the matching of the ligands functional groups. Each score provides a contribution ranging from 0 to 1 to the TC score, which can thus assume values between 0 and 2. The TC score relative to the superposition between the experimental conformations of the test compounds and the related ensembles of conformers generated by iCon and OMEGA were calculated by using the Shape Toolkit (Haigh et al., 2005) implemented in ROCS (Hawkins et al., 2007; OpenEye Scientific Software, 2012) from OpenEye Scientific Software. Shell scripts were employed to allow the automated calculation of the TC score values for the conformer ensembles generated with the different tested setting patterns. For each compound, only the TC score of the superposition between the crystallographic conformation and the best-matching generated conformer was used for performance analyses.

### Hardware Specifications

All calculations considering computation time were performed on a single Intel i7-3770K 3.50 GHz PC equipped with 8 GB RAM running Linux Centos 5.8. All calculations were done in single CPU mode.

# RESULTS AND DISCUSSION

In order to evaluate the performance of iCon in reproducing experimentally determined ligand conformations, two data sets of high quality X-ray structures originating from the PDB and CSD were created. These data sets comprise a total of 681 structures (200 for the PDB data set and 481 for the CSD data set) and were selected by Hawkins and co-workers to validate the performance of their conformer generator OMEGA (Hawkins and Nicholls, 2012). The choice of these structures as test set for iCon's validation was also driven by the intention to use OMEGA as a reference software, since it is one of the best conformer generators available today.

#### Data Set Properties

The two data sets show different distributions of heavy atoms and rotatable bonds among the test compounds. For the ligands belonging to the PDB data set a quite homogeneous distribution of the heavy atoms (HAs) was observed, especially for compounds with up to 30 HAs (**Figure 2A**). On the contrary, about 95% of CSD compounds showed a number of HAs ranging from 15 to 30 and in particular almost 45% of molecules presented 21–25 HAs. Regarding the distribution of the number of rotatable bonds (RBs) in the data set compounds, PDB ligands showed again a more homogeneous trend with respect to the CSD structures (**Figure 2B**). In the CSD data set about 95% of compounds had less than 7 RBs and no molecules with more than 9 rotors were found, whereas 29% of PDB ligands presented more than 7 RBs and 15% of compounds showed an average of 13 rotors. All these data indicate that the PDB data set comprises molecules with a larger range of molecular weight compared to the CSD structures and with a higher conformational freedom. This makes the conformations of PDB ligands more challenging to reproduce with respect to the CSD molecules.

#### Influence of the Sampling Parameters on the NOC

The NOC generated by conformational sampling strongly influences the performance of a conformer generator in reproducing experimentally derived conformations; the higher the NOC in a conformational ensemble, the higher the probability that a conformer well-fitting the experimental one can be found in that ensemble. On the other hand, the quality of the sampling process also depends on the way the conformational space of the molecules is sampled. For example, the generation of an elevated number of redundant conformers does not help in the exploration of all the dispositions that a molecule can

assume according to its conformational freedom, but increases the calculation time and the data file size. This is the reason why the different parameters influencing the NOC should be reciprocally calibrated, so that the compounds conformational space can be adequately covered according to the NOC generated through the sampling process. To this aim, understanding how the different parameters affect the conformational sampling of different compounds is an important issue.

In **Figure 3** the average NOC generated with iCon for the two data sets by employing all the different setting patterns is shown, together with some of the results obtained with OMEGA by using the same settings. As expected according to the molecular properties analyzed for the two data sets, a higher NOC was always generated for the PDB ligands and this difference increased along with the maximum NOC allowed for the ensembles. This trend can be best observed by comparing the results obtained with iCon for MedAcc\_3 and HighAcc\_2 settings, differing only in the values of max-num-conf (200 and 500, respectively). In fact, the difference between the average NOC produced for CSD and PDB compounds increased more than 3.5 times passing from MedAcc\_3 to HighAcc\_2. Therefore, the max-num-conf parameter showed to have a stronger influence on conformer generation for PDB ligands than for CSD compounds.

Conversely, when the NOC was increased due to a lower RMSD threshold allowed among the output conformers, the difference between the NOC for CSD and PDB compounds was found to be smaller. This is clearly shown by the comparison of HighAcc\_3 and HighAcc\_4 settings, for which a reduction of the rms-thresh value from 0.5 to 0.2 Å determined the generation of a much higher NOC for both data sets, but

with a really smaller gap between them (PDB/CSD NOC with HighAcc\_3 settings = 232/171; PDB/CSD NOC with HighAcc\_4 settings = 343/339). Interestingly, increasing the max-num-conf value up to 800 in the HighAcc\_7 settings raised the NOC produced for the two data sets to almost 500 conformers per ensemble although maintaining such a small gap. These findings indicate that both max-num-conf and rms-thresh parameters have a strong influence on the NOC. Anyway, for compounds with less conformational freedom a low RMSD threshold has a bigger impact for the production of large conformational ensembles, even though it can lead to the generation of too similar conformers.

The value for the energy window seemed to have a lesser effect than the other two parameters on the NOC generated for the PDB ligands. Raising the e-window from 10 to 15 and 20 kcal/mol without changing max-num-conf and rms-thresh values (passing from MedAcc\_1 to MedAcc\_2 and MedAcc\_3 settings, respectively) produced a 12 and 18% increase in the NOC, respectively. Nevertheless, the energy window appears to have a greater influence on the size of the conformational ensembles produced for the CSD compounds, as the same changes resulted in a 27 and a 44% increase in the NOC for these molecules.

All the considerations reported above are also valid for OMEGA as the same trends relative to the variations of the NOC generated for the two data sets are observed. OMEGA always produced a higher NOC than iCon for all tested setting patterns, especially for high accuracy settings (on average a 9.3, 20.1, and 24.0% higher NOC for low, medium and high accuracy settings, respectively), with a corresponding wider gap between the NOC generated for the CSD and PDB compounds (see also Supplementary Figure 1).

# Influence of Rotors on the NOC

The analysis of the variation of the average NOC as a function of the number of rotatable bonds (RBs) clearly highlighted the unsurprisingly strong dependence of the conformer generation process on the conformational freedom of the compounds. For molecules with 3 or less rotors, ensembles of up to 50 conformers were generated for all the tested settings except for those where the RMSD cutoff was set to 0.2 Å, which produced nearly a threefold higher NOC (**Figure 4**). For ligands with 8 or more rotors ensembles comprising a minimum of 120 conformers (up to several hundreds) were generated for medium and high accuracy settings, where a wider conformational variability was allowed in the sampling process (see Supplementary Figure 2). As shown in **Figure 4**, the two conformer generators presented a similar trend in the NOC generated for the analyzed compounds with respect to their number of RBs. However, the increase in the number of rotors produced a slightly steeper increase in the NOC generated by OMEGA. This became even more evident when settings patterns producing high average NOC were considered. Anyway, setting the RMSD threshold to 0.2 Å reduced this difference, as shown by the comparison of the NOC generated with HighAcc\_3 and HighAcc\_4 settings.

#### Performance Assessment

The ability of the software iCon to reproduce the crystallographic conformation of the data set compounds was studied by using two different metrics: the root mean square deviation (RMSD) and the Tanimoto combo (TC) score, which were calculated for the generated ligand conformers using the corresponding experimental conformation as reference. These analyses were carried out on the conformers generated by using all the 20

FIGURE 4 | Average NOC generated by iCon and OMEGA, for some representative setting patterns, as a function of the number of rotatable bonds of PDB and CSD compounds. Due to the different rotor distribution in PDB and CSD molecules, different scales have been considered for the two data sets.

different settings patterns reported in **Table 1**. Only the values for the best-fitting conformers were taken into consideration, i.e., the lowest RMSD and the highest TC score obtained for each ligand conformational ensemble. In the same way the conformers generated by OMEGA using equivalent settings were analyzed, in order to compare the performance of the two software packages. To get a global overview of iCon's performance as a function of the various settings used and to compare it to OMEGA's performance, we calculated the average values of RMSD and TC scores obtained for the best-fitting conformers of the PDB and CSD compounds. Additionally, the number of ligands giving a RMSD value higher than 2 Å (RMSD failures) and the number of ligands giving a TC score lower than 1 (TC failures) were also reported and used as a secondary metric for performance assessment and comparison. The results obtained by applying low, medium, and high accuracy settings for the conformer generation of PDB and CSD ligands are reported in **Tables 2**– **4**, respectively. As expected, the PDB data set showed to be more challenging than the CSD data set, since for the CSD compounds both conformer generators gave significantly better RMSD and TC score values with respect to those produced for the PDB ligands. Accordingly, the number of RMSD and TC failures yielded by the two programs for the CSD data set were consistently lower than those reported for the PDB data set, which in fact contained a higher percentage of large compounds with a higher conformational freedom (see section Data Set Properties).

## Influence of the Sampling Parameters on iCon's Performance

The influence of the sampling parameters on iCon's performance was in agreement with their effect on the NOC generated. The max-num-conf parameter showed the strongest impact on the quality of the conformational sampling outcome when low accuracy settings were used. In this case, the maximum number of conformers allowed was quite small and represented the main limit to the generation of larger ensembles and to sampling accuracy. The increase of max-num-conf from 25 in LowAcc\_2 to 50 in LowAcc\_4 settings gave a difference in mean RMSD and TC score values of −11.0 and +3.57%, respectively, for PDB ligands, while a difference of −11.86 and +2.38%, respectively, was obtained for CSD compounds (**Table 2**). Moreover, this settings change produced a strong reduction of the number of failures for both data sets (from −25% up to −70%). This suggested that a max-num-conf value lower than 50 is too restrictive even for the generation of small ensembles, rejecting valuable conformers for an adequate sampling of the molecule's conformational space. By doubling again the max-num-conf value in LowAcc\_6 settings a lower (although still substantial) improvement in performance was obtained, in terms of both mean RMSD (−5.62% for PDB and −5.77% for CSD compounds) and TC score values (+2.07% for PDB and +1.26% for CSD data set). Finally, passing from MedAcc\_3 (max-num-conf = 200, **Table 3**) to HighAcc\_2 settings (max-num-conf = 500, **Table 4**) even smaller improvements were obtained for both PDB (−3.95% in mean RMSD and +1.99% in mean TC score) and CSD compounds (−4.35% in mean RMSD and +0.57% in mean TC score).

An e-window value of 10 kcal/mol (OMEGA's default ewindow value) seemed to be too restrictive for iCon, since an increase of 5 kcal/mol lead to a substantial improvement in the quality of the conformational ensembles generated with MedAcc\_2 respect to MedAcc\_1 settings (**Table 3**), especially for the CSD data set. With MedAcc\_2 settings iCon gave a mean RMSD of 0.47 Å and a mean TC score of 1.75 for the CSD data set (−7.84% and +1.74% compared to the results obtained with MedAcc\_1 setting), while for the PDB data set mean RMSD and TC score values of 0.78 Å (−4.88%) and 1.71 (+1.34%) were obtained. This was in agreement with the deeper influence produced by this parameter on the NOC generated for CSD compounds with respect to PDB ligands (see section Influence of the Sampling Parameters on the NOC). The better results obtained with the LowAcc\_4 settings in comparison with LowAcc\_3 (**Table 2**), particularly for the CSD compounds (−5.45% of mean RMSD and +1.78% of mean TC score), suggested that an e-window of 15 kcal/mol might be also suitable for the generation of small conformational ensembles (depending on the molecular properties of the compounds to be sampled), even if at the price of a slightly higher calculation time. A further increase of e-window up to 20 kcal/mol was considered more appropriate for larger ensembles, since when


TABLE 2 | Mean RMSD and TC score values obtained for PDB and CSD data set compounds by using iCon and OMEGA with low accuracy settings.

<sup>a</sup>LA, LowAcc; <sup>b</sup>MC, max-num-conf/maxconfs; <sup>c</sup>EW, e-window/ewindow; <sup>d</sup>RT, rms-thresh/rms.

used in MedAcc\_3 settings it did not seem to be worth the higher costs in machine time (see section Computational Resources) in light of the small improvements obtained in terms of mean RMSD and TC scores with respect to the MedAcc\_2 settings (**Table 3**).

As far as the rms-thresh value is concerned, it showed to have a quite different impact on the results obtained for the two different data sets. For the generation of small conformational ensembles an rms-thresh value of 0.8 Å allowed a substantial reduction of both RMSD and TC failures obtained for PDB ligands with LowAcc\_5 settings with respect to LowAcc\_6 (−43 and −26%, respectively), although accompanied by a marginal reduction of the mean TC score (**Table 2**). On the contrary, the results obtained for CSD compounds with LowAcc\_5 setting were considerably worse compared to those given by LowAcc\_6 (mean RMSD = 0.56 Å, +14.29%; mean TC score = 1.69, −2.87%). When higher max-num-conf and e-window values were used, a RMSD cutoff of 0.8 Å had a more deleterious effect on the size and quality of the conformational ensembles generated for CSD compounds especially in terms of mean RMSD values, for which an increment of 22.73% was obtained passing from HighAcc\_1 (mean RMSD = 0.44 Å, **Table 4**) to MedAcc\_6 settings (mean RMSD = 0.54 Å, **Table 3**). This change gave worse results also for the PDB data set (mean RMSD = 0.79 Å, +8.22%; mean TC score = 1.50, −2.60%), although without affecting the number of failures. Finally, reducing the rms-thresh value to 0.2 Å for the generation of very large conformational ensembles produced improvements in the results relative to the CSD data set, as observed for HighAcc\_3 and HighAcc\_4 settings (**Table 4**), which gave mean RMSD and TC score values of 0.40 Å and 1.80, respectively (−9.09% and +1.12%, compared to the HighAcc\_3 values). For the PDB data set, instead, this settings change seemed to result in the generation of ensembles comprising too similar conformers (with the consequent rejection of valuable ones, for some compounds), since it produced a higher number of failures and a higher mean RMSD value (0.75 Å, +4.17%), with only a marginal increase of mean TC score (1.55, +0.65%).

Taken together, these results show that in order to obtain good quality conformational ensembles, independently from the accuracy level required, it is not only necessary to reciprocally adjust the different sampling parameters, but also to calibrate them based on the molecular properties of the compounds to be sampled. Among the medium accuracy settings, MedAcc\_2 showed to be a good settings pattern for both data sets, considering the results in terms of mean RMSD and TC scores with respect to the average NOC generated, even though given the machine time required for the sampling it might not be particularly efficient for PDB-like molecules. For the same reason LowAcc\_4 seems to represent a good compromise between accuracy and computational resources for CSD compounds, but not for PDB ligands (see section Computational Resources). Nevertheless, in order to get a certain improvement in the quality of the conformational sampling of compounds with molecular properties similar to PDB ligands, a small increase of max-num-conf should be accompanied by a less strict RMSD cutoff (as in LowAcc\_5 setting), while for CSD-like


TABLE 3 | Mean RMSD and TC score values obtained for PDB and CSD data set compounds by using iCon and OMEGA with medium accuracy settings.

<sup>a</sup>MA, MedAcc; <sup>b</sup>MC, max-num-conf/maxconfs; <sup>c</sup>EW, e-window/ewindow; <sup>d</sup>RT, rms-thresh/rms.

compounds this change would only yield a negative effect. For a more exhaustive sampling the use of a lower rms-thresh value seemed more important than a considerable increase of ewindow and max-num-conf parameters to improve the quality of conformational ensembles of CSD-like compounds. In fact, the best results for the CSD data set were obtained with HighAcc\_4 and HighAcc\_7 settings, while for PDB-like molecules, the best results were obtained by using higher e-window and max-numconf values without reducing the RMSD cutoff (with HighAcc\_5 and HighAcc\_6 settings).

### OMEGA and iCon: Overall Comparison of the Results for PDB and CSD Data Sets

In general, despite the two conformer generators showed similar performances, OMEGA seemed to be slightly more effective in reproducing the bioactive conformation of PDB ligands, independently from the setting patterns used, since the mean RMSD values obtained with iCon were, on average, 3.23% higher than those shown by OMEGA and the TC scores were 3.11% lower (**Figures 5A,B**). Only with the LowAcc\_4 settings iCon showed the same mean RMSD values obtained with OMEGA. The main difference among the results obtained with the two programs for PDB data set concerned the number of TC failures, which was however significantly high for both programs when low accuracy settings were used, reaching a maximum of 16% for iCon and 14.5% for OMEGA (with LowAcc\_2 settings, **Table 2**). Although the average gap between the mean TC scores given by the two programs was quite small, the number of OMEGA's TC failures was about 40% lower than iCon's ones for medium and high accuracy settings (**Figure 5D**, **Tables 3**, **4**). For low accuracy settings instead, the number of TC failures was comparable between the two conformer generators, with iCon giving less failures than OMEGA with 4 out of these 7 settings (**Table 2**). An inverse situation is observed regarding the number of RMSD failures produced by the programs. By using low accuracy settings iCon gave a lower number of RMSD failures only with LowAcc\_4 and LowAcc\_5 settings (**Table 2**). On the contrary, with almost all the medium and high accuracy settings the number of iCon's RMSD failures was either lower or equal to the number of OMEGA's ones (**Figure 5D**, **Tables 3**, **4**).

For the CSD data set, a different trend in the performance of the two programs was observed depending on the group of setting patterns tested. As reported in **Table 2**, by using low accuracy settings iCon showed a slightly better performance with respect to OMEGA in terms of both mean RMSD (−2.01%, on average) and mean TC score (+0.59%, on average) values. Moreover, for these settings iCon produced a number of RMSD and TC failures corresponding, on average, to 50% of the failures shown by OMEGA (see also **Figures 5C,D**). By using medium accuracy settings the two conformer generators gave very similar results: the mean TC scores were practically identical and the differences in mean RMSD values minimal (**Table 3**, **Figure 5A**). Notably, for all these settings the number of iCon's RMSD and TC failures was always lower or equal to the corresponding OMEGA failures (**Figures 5C,D**). Finally, with high accuracy settings the difference in performance of the two conformer generators


TABLE 4 | Mean RMSD and TC score values obtained for PDB and CSD data set compounds by using iCon and OMEGA with high accuracy settings.

<sup>a</sup>HA, HighAcc; <sup>b</sup>MC, max-num-conf/maxconfs; <sup>c</sup>EW, e-window/ewindow; <sup>d</sup>RT, rms-thresh/rms.

seemed almost the opposite with respect to what observed for low accuracy settings. In fact, iCon showed mean RMSD values and mean TC scores that were, on average, 2.06% higher and 0.24% lower than those obtained with OMEGA (**Table 4**), although it never produced a higher number of RMSD or TC failures (**Figures 5C,D**).

The overall comparison of iCon's and OMEGA's results showed that iCon seemed more efficient in reproducing crystallographic ligand conformations through small conformational ensembles, since by using low accuracy settings it slightly outperformed OMEGA in terms of RMSD and TC scores (and corresponding failures) for the CSD data set. Moreover, for the PDB data set, the difference in performance with respect to OMEGA, which gave slightly better results, was lower than that observed for the other groups of setting patterns. When the generation of larger ensembles was allowed as in medium and high accuracy settings, OMEGA seemed to perform relatively better than iCon, although the differences were still modest. This can be due to the facts that OMEGA always produced a NOC considerably higher than iCon for these setting patterns (see section Influence of the Sampling Parameters on the NOC) and the built-in torsion library employed by OMEGA which is biased toward PDB ligand conformations (Hawkins et al., 2010). A reasonable explanation for the in general higher NOC generated by OMEGA is the input molecule fragmentation strategy that is adopted by OMEGA. In contrast to iCon, OMEGA allows flexible terminal chain fragments (see section Conformer Generation With iCon) with multiple rotatable bonds which are in turn looked up in a built-in cache of precalculated refined fragment conformations upon overall molecule conformer assembly. This effectively reduces the number of rotatable bonds when dealing with highly flexible molecules and, as a consequence, will speed up the conformer generation process in general and also decrease the chance to produce rejected high energy conformations of the overall molecule due to steric clashes. However, it is worth noting that with medium and high accuracy settings OMEGA gave a higher number of RMSD failures for both PDB and CSD data sets and a lower number of TC failures only for PDB compounds, on average.

#### OMEGA and iCon: Deep Comparative Analysis of Representative Setting Patterns Spreading of RMSD and TC Score Values

For a better insight into iCon's performance and a more accurate comparison with OMEGA, we analyzed the spreading of the RMSD (**Table 5**) and TC score values (**Table 6**) that were obtained for the generated conformers of both data sets by using three representative setting patterns, one for each of the three different setting groups: LowAcc\_4, MedAcc\_2, and HighAcc\_4 (see Supplementary Tables 1, 2 for the analysis of other representative setting patterns). The results for both metrics were divided into different classes representing different levels of precision in the reproduction of the experimental conformations. A RMSD smaller than 0.5 Å, as well as a TC score higher than 1.75, correspond to an excellent matching between two different conformations, thus denoting a perfect

FIGURE 5 | Overall comparison of (A) mean RMSD values, (B) mean TC score values, (C) number of RMSD failure, and (D) number of TC failures obtained for PDB and CSD data set compounds by using iCon and OMEGA with the different setting patterns. Black vertical lines are used to separate the three different groups of settings.

reproduction of the compound's crystallographic pose (TC scores higher than 1.95 and RMSDs of about 0.1 Å mean conformational identity). RMSD values between 0.5 and 1.0 Å correspond to a very good matching, where all the compound's functional groups of the best-fitting generated conformers are correctly superposed to the experimental ones; the same is valid for TC scores between 1.75 and 1.50. When the RMSD lies in the 1.0–1.5 Å range and/or when the TC score is in the 1.50–1.25 range there is still a good matching between the overlaid conformations. For RMSDs between 1.5 and 2.0 Å, as well as for TC scores between 1.25 and 1.0, the representation of the crystallographic conformation is less accurate, since some of the compounds' chemical features in the generated conformers might not be correctly oriented with respect to the same moieties in the reference ligand pose, but the overall superposition is still sufficiently good. RMSDs above 2.0 Å and/or TC scores below 1.0 mean that the matching between the generated and experimental conformers is not good enough to consider the crystallographic pose as properly reproduced.

The analysis of the results reported in **Tables 5**, **6** clearly demonstrates a high performance for both programs using the three setting patterns considered, since more than 50% of the generated conformational ensembles produced a very good matching with the ligand reference poses, giving TC scores ≥1.50 and RMSD values ≤1.0 Å. Precisely, as the PDB data set is concerned, for a minimum of 51.5% up to 68% of the ligands, a TC score above 1.50 was obtained for iCon\_LowAcc\_4 and OMEGA\_HighAcc\_4, respectively (**Table 6**), while the percentage of molecules showing RMSD values below 1.0 Å (**Table 5**) ranged from 69.5 up to 81.5%. Compared to iCon, OMEGA always gave better results for the PDB data set in terms of TC score, consistently with what observed in the overall comparison of the two programs' performance. OMEGA produced 58.0–68.0% of conformational ensembles with TC scores ≥ 1.50, on average 8% more than iCon (51.5–61.0%), for which a shift toward lower TC score values was observed. Moreover, OMEGA yielded 25.0% of ensembles with excellent fit (TC score ≥ 1.75) using LowAcc\_4 settings and 45.0% with HighAcc\_4, whereas those obtained with iCon for the same settings were 22.0 and 34.5%, respectively. The RMSD values revealed a slightly different situation (**Table 5**). Not only the difference in the percentage of ensembles with RMSD ≤ 1.0 Å generated by iCon and OMEGA was marginal (69.5–79.5 and 73.0–81.5%, respectively) but iCon also produced a number of ensembles with RMSD ≤ 0.5 Å (31.5–39.5%) higher than that TABLE 5 | Percentage spreading of RMSD values calculated for conformers of PDB and CSD compounds generated by iCon and OMEGA using three representative setting patterns.


TABLE 6 | Percentage spreading of TC score values calculated for conformers of PDB and CSD compounds generated by iCon and OMEGA using three representative setting patterns.


shown by OMEGA (25.0–39.0%). In particular, with LowAcc\_4 settings iCon produced 6.5% more excellent-fitting conformers with respect to OMEGA. Similar results were obtained by the analysis of LowAcc\_3, MedAcc\_1, and HighAcc\_3 setting patterns (see Supplementary Tables 1, 2). These results underline the complementarity of the two different metrics, which are based on two different methods of structure superposition and thus gave different results that seemed to be the more divergent the higher the dimensions and the conformational freedom of the considered compounds.

Concerning the CSD data set, a higher number of compounds with a very good matching between generated and experimental conformations was obtained by the two programs, with respect to the PDB ligands (consistent with the higher mean TC scores and lower mean RMSD values), but the results in terms of the two metrics were more similar to each other. For instance, the number of CSD compounds for which a TC score ≥ 1.50 and a RMSD ≤ 1.0 Å was obtained ranged from 80.2% (OMEGA\_LowAcc\_4) to 91.7% (OMEGA\_HighAcc\_4) and from 91.5% (iCon\_LowAcc\_4) to 97.1% (OMEGA\_HighAcc\_4), respectively. With LowAcc\_4 and MedAcc\_2 settings iCon generated a higher number of perfect fitting conformers with respect to OMEGA in terms of both metrics, with 0.8% of compounds showing a RMSD ≤ 0.1 Å and 7.5% presenting a TC score ≥ 1.95 (compared to 0.2% and 6.7–6.9%, respectively, as obtained for OMEGA), as well as a lower percentage of failures. For HighAcc\_4 settings, for which the two programs gave equal values of mean RMSD and TC scores, iCon produced less ensembles comprising perfect fitting conformers than OMEGA, in terms of TC score (14.1%, and 16.4% for OMEGA) but more in terms of RMSD (1.2 vs. 0.4%). For all these settings iCon showed a small enrichment in compounds with RMSD ≤ 0.5 Å (ranging from 61.8 to 77.8%) with respect to OMEGA (59.5–75.5%), but gave a marginally higher number of molecules with TC score ≥ 1.75 only with HighAcc\_4 settings (71.7, 71.3% reported for OMEGA).

Overall, the obtained data indicate that both software packages actually perform in a similar way, with OMEGA giving only slightly better results when a medium-to-high quality sampling of larger and more flexible compounds is carried out; also, these differences are mainly relative to the TC score.

#### Influence of Rotors in RMSD and TC Score Values

To better assess how the conformational freedom of the data set compounds influenced the performance of the two programs in reproducing experimental conformations, we plotted the obtained results in terms of RMSD and TC scores for PDB and CSD molecules by using MedAcc\_2 settings (as a reference setting) as a function of the number of rotors of the compounds (**Figure 6**). As expected, quite different trends were observed for the two data sets. For CSD molecules almost no correlation was found between the number RBs and the relative RMSD and TC score values given by iCon and OMEGA, which showed an almost identical distribution of the results with respect to both metrics (**Figures 6A,C**). For PDB ligands an appreciable correlation between conformational freedom and sampling performance was identified for both programs and particularly in terms of RMSD values, for which RBs/RMSDs correlation coefficients of 0.42 and 0.54 were calculated for OMEGA and iCon, respectively (**Figure 6C**). iCon's performance seems to be more influenced by the number of rotors compared to OMEGA, in accordance with what observed in the previous analyses. However, the difference in R 2 values was quite small and the distribution of the results in terms of both metrics was pretty similar for the two programs, which again showed a comparable behavior.

#### Computational Resources

To compare iCon's efficiency in terms of computation time with OMEGA and to understand how it is affected by the different settings, we reported the average time required by the two programs for the conformational sampling of PDB and CSD compounds by using the various settings patterns. Both conformer generators proved to be fast, especially in the sampling of the CSD data set (**Figure 7B**), which required <0.4 s per compound (s/cpd) for all the low accuracy settings and <0.6 s/cpd for all the medium accuracy settings. OMEGA showed generally a better efficiency with respect to iCon, even though for this data set the differences in the average elapsed time were substantial only for HighAcc\_4 and HighAcc\_7 settings, where a RMSD cutoff for saving conformers of 0.2 Å was used. Using these two settings OMEGA was particularly fast (0.371 and 0.478 s/cpd, respectively) considering the elevated number of conformers generated. On the contrary, when a RMSD cutoff of 0.8 Å was used in MedAcc\_6 settings iCon was found to be faster than OMEGA (0.450 and 0.522 s/cpd, respectively) while for LowAcc\_7 settings the difference between the two programs was marginal. With the CSD data set, LowAcc\_4 and MedAcc\_2 confirmed to be efficient setting patterns for iCon (compared to the other low and medium accuracy settings), considering the performance in terms of mean RMSD and TC score with respect to the calculation time and the average NOC generated. The same can be said for HighAcc\_4 among the high accuracy settings, which proved to be particularly efficient also for OMEGA. In fact, OMEGA employed less than half of the sampling time required by iCon using these parameters and was faster even with respect to the MedAcc\_4-6 settings.

The sampling of the PDB data set took, in general, a longer time for both programs. This is in accordance to the higher conformational freedom of these ligands with respect to the CSD molecules. For this data set the gap between iCon and OMEGA was more evident, the latter being 26% faster, on average. Nevertheless, such a difference can be attributed to iCon's caching strategy, which was designed in order to allow a conformational sampling that is getting faster with a growing number of compounds in the database to be sampled. Precisely, the conformations generated by iCon for the compounds terminal fragments are continuously stored in a cache, thus having no necessity to be recalculated when the same fragments are encountered in further input compounds during the sampling process (see section Conformer Generation With iCon). However, the performance results clearly show that it might be worth thinking about changing the current caching strategy toward a prebuilt start fragment cache (like OMEGA has one) that is updated with newly encountered fragments. This would allow for overall faster calculations also for small compound libraries where the current caching strategy does not provide any significant speedup in the conformational sampling process. The influence of the various setting parameters on the efficiency of the two programs was much stronger with PDB data set (**Figure 7A**). OMEGA was remarkably affected by the rms parameter, showing again a faster sampling for rms = 0.2 Å (HighAcc\_4 and HighAcc\_7 settings) and a substantial increase in computation time when an rms value of 0.8 Å instead of 0.5 Å was used (e.g., MedAcc\_6 vs. HighAcc\_1). On the contrary, this effect was not observed for iCon, which was appreciably faster than OMEGA for MedAcc\_6 settings and seemed to be mostly affected by the e-window value, in particular for the generation of medium- and small-sized conformer ensembles. With LowAcc\_1, LowAcc\_3, and MedAcc\_1 settings, for which an e-window value of 10 kcal/mol was used, iCon showed very similar calculation times (from 0.456 to 0.516 s/cpd) although the average NOC ranged from 17.5 to 102.2 compounds per ensemble, respectively (see also **Figure 3**). Similarly, in LowAcc\_2, LowAcc\_4-6, and MedAcc\_2 settings the e-window was set to 15 kcal/mol and the sampling time only ranged from 0.650 to 0.733 s/cpd even though the average NOC was raised from 22.2 to 114.4 compounds per ensemble. When larger ensembles were generated, the ewindow seemed to have a smaller impact on iCon's efficiency compared to the other parameters. These results also showed that the improvement in iCon's performance obtained by increasing e-window of 5 kcal/mol was paid with an increase of computation time of nearly 40% the for PDB data set. This makes LowAcc\_4 and MedAcc\_2 settings not really convenient for the sampling of PDB-like molecules compared to the LowAcc\_3 and MedAcc\_1 settings, respectively. Anyway, MedAcc\_1 showed to be a very efficient setting for PDB ligands, requiring just a 12.7% longer sampling time than LowAcc\_3 but with much better results in

FIGURE 6 | Distributions of TC score values as a function of the number of rotatable bonds for CSD (A) and PDB compounds (B). Distributions of RMSD values as a function of the number of rotatable bonds for CSD (C) and PDB compounds (D).

terms of both RMSD and TC score values, making it suitable not just for medium-sized databases but also for large ones, despite the higher NOC generated. For the high accuracy settings, HighAcc\_3 seemed to have a good efficiency, giving results nearly as good as HighAcc\_5-6 but in less time (averagely −31.6%).

#### Notes on Using the Reproduction Ability of Crystallographic Conformations as a Performance Measure

Before concluding the performance assessment of iCon it is worth mentioning the induced folding problem, i.e., the structural adaptation of the target protein to the ligand in order to form an optimal complex. The ligand-induced folding of the target receptor, which can be observed particularly in flexible protein such as tyrosine kinases after drug-target association, is a well-known issue in drug design (Fernández, 2016). Due to this effect, it is unlikely that the conformation of the target protein remains unchanged upon interaction with different ligands. Therefore, the target protein should not be considered as a rigid body in structure-based drug design studies: the flexibility of the corresponding target should be taken into account in combination with the conformational space of the ligand. However, the conformational flexibility of the protein is usually studied through computationally expensive molecular dynamic simulations which allow a thorough evaluation of the conformational motion of both ligand and protein at the same time. On the contrary, in docking studies the structure of the protein is normally treated as a rigid body allowing at most just a movement of residue side chains. Thus the conformational sampling of the docking algorithm only considers the ligand as flexible and largely neglects the adaption ability of the receptor. Moreover, in pharmacophore modeling and pharmacophorebased virtual screening, as well as in ligand-based similarity approaches, the protein structure is not even considered except for the generation of receptor-based pharmacophore models, and in this latter case only a single conformation of the protein is usually used. Therefore, all common conformer generators, especially those used for virtual screening purposes such as iCon and OMEGA, perform only the conformational sampling of small molecules in a way that is totally independent from the structure of any possible target protein. Indeed, there is no need to consider the conformational variability of the protein because it is intrinsically taken into account due to the fact that the output of the conformer generation is not a single conformer of a druglike molecule but an ensemble of conformers that covers many structurally different protein conformations. For this reason, our performance assessment of iCon was only based on the reproduction of experimental structures of small molecules, a methodology that is widely used and reported in literature (Hawkins et al., 2010; Miteva et al., 2010; O'Boyle et al., 2011; Ebejer et al., 2012; Hawkins and Nicholls, 2012; Friedrich et al., 2017).

# CONCLUSIONS

In this study we report the algorithm of the novel conformer generator iCon implemented in LigandScout 4.0 and the assessment of its performance in comparison to OMEGA by using two different data sets of high-quality crystal structures from the PDB and CSD databases. We evaluated iCon's efficacy in reproducing the experimentally determined conformation of the test compounds in terms of RMSD and TC score values for 20 different setting patterns and we compared the results with those obtained with OMEGA using equivalent settings. The three parameters changed in these setting patterns showed to affect the size and the quality of the conformational ensembles generated by iCon for the two data sets in a different manner. The results indicate that in order to obtain an adequate sampling of the conformational space, a max-num-conf lower than 50 should be avoided, even for the generation of small ensembles. Moreover, an e-window value not lower 15 kcal/mol is recommended to improve iCon's performance, but this might be paid with an increase of computation time that might not be suitable for highthroughput conformational sampling. An rms-thresh value of 0.5 Å showed to be quite appropriate for all kind of conformational ensembles, even though some small adjustments based on the molecular properties of the sampled compounds can lead to better results. LowAcc\_3-4 and MedAcc\_1-2 settings proved to be good for a high-throughput and average quality sampling, while for a more thorough conformational analysis HighAcc\_3-4 settings represent a better choice.

Compared to OMEGA, iCon showed its best performance in the reproduction of crystallographic poses of less flexible molecules through small conformational ensembles, slightly outperforming OMEGA in the results obtained for CSD compounds with low accuracy settings. With the CSD data set, iCon yielded high quality results also when larger ensembles were generated, showing a lower or equal number of failures with respect to OMEGA for most of the setting patterns. Also, the spreading of RMSD and TC score values proved to be extremely similar. OMEGA is more effective in the sampling of ligands with higher conformational freedom, since with PDB data set it always produced better results than iCon, whose performance is more influenced by the number of rotors of the sampled compounds. However, the observed differences were still small, particularly when settings yielding small conformational ensembles were considered; also, such differences were primarily related to the TC scores. OMEGA proved to be always slightly faster than iCon, particularly in the conformer generation of PDB ligands but, on the basis of its algorithm, iCon's computation times decrease when larger databases are sampled. Moreover, iCon always showed to generate smaller conformational ensembles than OMEGA for equivalent settings, which can speed up any analysis based on iCon's conformational sampling, like pharmacophore modeling or virtual screening processes. Overall, the study herein reported proved that iCon represents a solid and well validated new conformer generator that comes free of additional charge with LigandScout 4.0 and is seamlessly integrated in all pharmacophore modeling and virtual screening related workflows of LigandScout. For a further improvement of iCon, the adoption of a different input molecule fragmentation and terminal fragment caching strategy is planned. This will not only speed up the conformer sampling process in general but will also lead to better results when it comes to the reproduction of bioactive conformations of larger and more flexible molecules.

# AUTHOR CONTRIBUTIONS

GP prepared the data sets, performed all computations, analyzed the results, and wrote the paper. TS implemented programs for data preparation and analysis, planned and supervised the study and contributed to writing the paper. TL contributed to writing the paper.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00229/full#supplementary-material

The name of the 200 PDB structures from which the ligands used in this studies were extracted and their corresponding ligand codes. The codes of the 481 CSD compounds used in this studies. A figure showing the different average number

#### REFERENCES


of conformations generated by iCon and OMEGA for PDB and CSD data sets, using all the different setting patterns. A figure showing the average number of conformers generated by iCon as a function of the number of rotatable bonds in PDB and CSD compounds, using all the different setting patterns. Tables showing the percentage spreading of RMSD and TC score values calculated for conformers of PDB and CSD compounds generated by iCon and OMEGA using LowAcc\_3, MedAcc\_1, and HighAcc\_3 settings.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Poli, Seidel and Langer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# In Silico Workflow for the Discovery of Natural Products Activating the G Protein-Coupled Bile Acid Receptor 1

Benjamin Kirchweger <sup>1</sup> , Jadel M. Kratz <sup>1</sup> , Angela Ladurner <sup>1</sup> , Ulrike Grienke<sup>1</sup> , Thierry Langer <sup>2</sup> , Verena M. Dirsch<sup>1</sup> and Judith M. Rollinger <sup>1</sup> \*

<sup>1</sup> Department of Pharmacognosy, University of Vienna, Vienna, Austria, <sup>2</sup> Department of Pharmaceutical Chemistry, University of Vienna, Vienna, Austria

The G protein-coupled bile acid receptor (GPBAR1) has been recognized as a promising new target for the treatment of diverse diseases, including obesity, type 2 diabetes, fatty liver disease and atherosclerosis. The identification of novel and potent GPBAR1 agonists is highly relevant, as these diseases are on the rise and pharmacological unmet therapeutic needs are pervasive. Therefore, the aim of this study was to develop a proficient workflow for the in silico prediction of GPBAR1 activating compounds, primarily from natural sources. A protocol was set up, starting with a comprehensive collection of structural information of known ligands. This information was used to generate ligand-based pharmacophore models in LigandScout 4.08 Advanced. After theoretical validation, the two most promising models, namely BAMS22 and TTM8, were employed as queries for the virtual screening of natural product and synthetic small molecule databases. Virtual hits were progressed to shape matching experiments and physicochemical clustering. Out of 33 diverse virtual hits subjected to experimental testing using a reporter gene-based assay, two natural products, farnesiferol B (27) and microlobidene (28), were confirmed as GPBAR1 activators reaching more than 50% receptor activation at 20µM with EC50s of 13.53µM and 13.88µM, respectively. This activity is comparable to that of the endogenous ligand lithocholic acid (1). Seven further virtual hits showed activity reaching at least 15% receptor activation either at 5 or 20 µM, including new scaffolds from natural and synthetic origin.

#### Keywords: GPBAR1, TGR5, pharmacophore, virtual screening, natural product, triterpene

#### INTRODUCTION

The G protein-coupled bile acid receptor 1 (GPBAR1), also commonly named M-BAR or Takeda G-protein-coupled receptor 5 (TGR5), is a rhodopsin-like G protein-coupled receptor (GPCR) expressed in various tissues. It is primarily present in the bile duct, digestive system, spleen, and placenta. It is a cell-surface receptor comprising an extracellular N-terminus, an intracellular C-terminus and seven trans-membrane helices connected by intra- and extracellular loops. Its endogenous ligands are bile acids and neurosteroids. The binding pocket is predicted to be located between the trans-membrane helices. Next to the transcription factor farnesoid X receptor (FXR), GPBAR1 was the second receptor discovered to be responsive to bile acids (Maruyama et al., 2002; Kawamata et al., 2003; Keitel et al., 2010; Gertzen et al., 2015).

#### Edited by:

Honglin Li, East China University of Science and Technology, China

#### Reviewed by:

Christian W. Gruber, Medizinische Universität Wien, Austria Shoude Zhang, Qinghai University, China

> \*Correspondence: Judith M. Rollinger judith.rollinger@univie.ac.at

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 09 February 2018 Accepted: 06 June 2018 Published: 02 July 2018

#### Citation:

Kirchweger B, Kratz JM, Ladurner A, Grienke U, Langer T, Dirsch VM and Rollinger JM (2018) In Silico Workflow for the Discovery of Natural Products Activating the G Protein-Coupled Bile Acid Receptor 1. Front. Chem. 6:242. doi: 10.3389/fchem.2018.00242

In the past decade, this receptor has attracted attention as a potential drug target for a variety of pathologic conditions (Hodge and Nunez, 2016), predominantly because GPBAR1 is a key receptor in the adjustment of energy expenditure and glucose metabolism with possible implications for the treatment of obesity and type 2 diabetes. Its activation in enteroendocrine L-cells leads to the release of the incretins peptide tyrosine tyrosine (PYY) and glucagon like peptide 1 (GLP1), which promote insulin secretion in the pancreas and are important in the suppression of appetite (Woods and D'Alessio, 2008; Bala et al., 2014). GPBAR1 activation in pancreatic cells leads to an enhanced insulin secretion and a recovery of β-cell mass and function switching from glucagon to GLP1 (Kumar et al., 2012, 2016). In striated myocytes and brown adipocytes, GPBAR1 activation leads to thyroid hormone activation. In white adipocytes it mediates remodeling into beige cells and improves mitochondrial dynamics and cellular respiration rate (Watanabe et al., 2006; Velazquez-Villegas et al., 2018). Moreover, endothelium- and liver-protecting, as well as immunosuppressing effects offer perspectives for new therapies for diseases like atherosclerosis and inflammatory liver diseases (Keitel et al., 2007, 2008; Keitel and Haussinger, 2011; Pols et al., 2011; Asgharpour et al., 2015). Unusual for GPCRs, the GPBAR1 seems to only transfer signaling via G proteins and not via β-arrestins (Jensen et al., 2013).

In animal trials, GPBAR1 agonists showed promising results, however, difficulties have also been encountered since GPBAR1 agonists may induce itching and gallbladder extension (Vassileva et al., 2006; Alemi et al., 2013). Interestingly, gallbladder extension upon GPBAR1 activation is mainly caused by smooth muscle relaxation via induction of the cAMP–PKA pathway independent of the agonist scaffold (Lavoie et al., 2010; Li et al., 2011). The plethora of GPBAR1-mediated biological functions appears to be an obvious opportunity, but a major drawback at the same time (Vassileva et al., 2006; Alemi et al., 2013; Hodge and Nunez, 2016). Novel agonistic scaffolds may incorporate a different and possibly more favorable side effect profile in terms of receptor and functional selectivity as well as pharmacokinetic properties. In this sense there is a demand for new GPBAR1 ligands as they may help to cope with pharmacologically unmet therapeutic needs against metabolic diseases.

Beside bioassay-guided fractionation of plant extracts (Sato et al., 2007), bioisosteric replacement (Park et al., 2014), and exploitation/lead optimization of bile acid scaffolds (Pellicciari et al., 2009), previous efforts in the discovery of GPBAR1 modulators have focused on high throughput screening (HTS) (Evans et al., 2009; Herbert et al., 2010; Londregan et al., 2013; Martin et al., 2013) leading to a broad range of agonists of which some are depicted in **Figure 1**.

Chenodeoxycholic acid (CDCA, **3**) is a primary bile acid, which activates both GPBAR1 and the nuclear receptor FXR. By bacterial dehydroxylation, CDCA is transformed into the more potent lithocholic acid (LCA, **1**). The bile acid's potency on GPBAR1 can be further increased by conjugation with glycine or taurine, whereas taurine-conjugated lithocholic acid (TLC) is the most potent endogenous ligand (Sato et al., 2008). Lead optimization of the bile acid scaffold led to INT-747 (**4**), an approved drug for the treatment of primary biliary cholangitis and dual agonist of GPBAR1 and FXR as well as INT-777 (**2**), a more selective GPBAR1 agonist (Pellicciari et al., 2009; Fiorucci et al., 2014; Floreani and Mangini, 2018). Beside bile acids, several secondary plant metabolites activate this receptor (**5**-**8**). The antidiabetic effect of e.g., Olea europaea L. leaves may be linked to GPBAR1 activation by its major constituent oleanolic acid (**6**) (Sato et al., 2007). Moreover, HTS and extensive SAR efforts gave access to a large range of synthetic compounds activating this receptor in the nanomolar and low micromolar range (**9**-**12**).

Absence of a crystal structure of GPBAR1 forced researchers to rely either on homology models or ligand-based approaches for in silico studies, as this is the case for most GPCRs (Peeters et al., 2011; Vaidehi et al., 2014). Several GPBAR1 homology models have been described up to now. They all represent fundamentally different bile acid binding poses, but none of them is able to cover all results from mutagenesis studies (Macchiarulo et al., 2013; D'Amore et al., 2014; Gertzen et al., 2015; Yu et al., 2015). This prompted us to develop a ligand-based approach using pharmacophore modeling for the identification of new GPBAR1 agonists.

Pharmacophore modeling and subsequent virtual screening (VS) is a well-established method in the early drug discovery process showing some important benefits: (1) pharmacophore screening can retrieve ligands with structurally diverse scaffolds and allows for so called "scaffold-hopping"; (2) it can automatically and rapidly filter large compound libraries; (3) ligand-based pharmacophore VS has been able to retrieve satisfactory results, also without structural information on the target (Evers et al., 2005; Ha et al., 2015; Akram et al., 2017). Here, we report on the construction of two ligand-based 3D pharmacophore models, their in silico and in vitro validation, and the directed discovery of sesquiterpene coumarins as a new class of potent GPBAR1 agonists.

### MATERIALS AND METHODS

#### Software

The generation of pharmacophore models, their subsequent refinement and VS was performed with LigandScout 4.08 Advanced, available by Inte:Ligand GmbH (Wolber and Langer, 2005). The conformational libraries for both pharmacophore modeling and the VS process were created with i:Con, LigandScout's implemented conformer generator (Friedrich et al., 2017). Shape comparison was performed with OpenEye's ROCS 3.2.1.4 (Hawkins et al., 2007; OpenEye, 2016). 2D structures were drawn with ChemDraw Professional 15.0.

#### Data Sources

For model generation in LigandScout, structural data of GPBAR1 ligands with bioactivity annotations were collected. The data for GPBAR1 available in the ChEMBL database was extracted on March 15th 2016. It consisted of 24 different publications with 623 reported EC<sup>50</sup> values (Bento et al., 2014). The reliability of the

content was checked with the original literature. This molecule set was extended by extracting data from another 18 publications, 24 patents and previous in house projects, resulting in a total of 1025 activity annotations.

# Decoy Set

For the experimental validation of pharmacophores, next to a set of active molecules, also a set of inactive molecules and/or a set of decoys is necessary (Schuster et al., 2006). In contrast to true inactives, which are molecules reported in the literature not to be active at the target, decoys are hypothetical structures, which are unlikely to show activity at the target, but have not yet been tested experimentally. Due to the shortage of published negative data and therefore the presence of only a small set of reliably tested inactive compounds, a set of decoys was generated using the Dude decoys database (http://dude.docking.org/): 338 molecules from the "High Actives" dataset was submitted to the DUDE decoy online generator (Mysinger et al., 2012) to obtain decoys with similar 1D physicochemical properties but dissimilar 2D topology in comparison to the active compounds. Using this strategy, a "Decoy" set comprising 18 043 substances was created.

# Conformational Sampling, Ligand Set Clustering

The "High Actives," "Decoys," and "True Inactives" sets were transferred into multi-conformational databases via i:Con with the default "BEST" settings [Timeout (s): 600, RMS threshold: 0.8, energy window: 20.0, max. pool size: 4,000, max fragment build time: 30, max number of conformers: 200]. The 338 compounds of the "High Actives" set were clustered in LigandScout 4.08 using the implemented pharmacophore clustering tool. The tool clusters molecules with similar pharmacophore characteristics in the dataset: It generates pharmacophores for each molecule in the dataset for a desired number of conformations. The similarity of these pharmacophores is measured with the cosine similarity (value between 0 and 1) of their radial distribution function score (RDF) vectors. Options, distance 0.9 and cluster distance calculation method "maximum" with three conformations for each molecule, were used.

#### Pharmacophore Generation

With LigandScout pharmacophores can be generated as shared or merged feature pharmacophores. A shared feature pharmacophore only appoints common features observed during 3D alignment of the validation set molecules. A merged feature pharmacophore merges several shared pharmacophores. Features, which are not shared by the whole validation set, are appointed as optional. The initial pharmacophore model for cluster 7 was built as a shared feature pharmacophore with six molecules as templates using "pharmacophore fit and atom overlap" as the scoring function. The initial model for cluster 12 was built as a merged feature pharmacophore with four molecules as templates using "pharmacophore fit and atom overlap" as the scoring function. The models were refined and theoretically validated until favored theoretical performance was achieved.

#### Theoretical Validation

For theoretical validation, the scoring function was set to "pharmacophore-fit," the screening mode to "match all query features" with maximum number of omitted features zero. To assess the performance of the individual models, the resulting hit list were used to calculate common enrichment metrics, as comprehensively outlined in a review by Seidel and coworkers (Seidel et al., 2010).

#### Virtual Screening

Several freely available molecular structure databases were deployed for VS, having a strong focus on NP. The conformational libraries were generated with i:Con (Friedrich et al., 2017). Depending on the size of the database the recommended "BEST" or "FAST" settings were used (**Table 1**). For VS, the same settings were used as in the theoretical validation, although the retrieval method "get best matching conformation" was used.

# Hit List Prioritization

A principal component analysis using the chemGPS online tool (Larsson et al., 2007) was determined and a hierarchical cluster analysis with SIMCA facilitated the assignment of the compounds into 9 groups, each inhabiting a different chemical space. For clustering, the default Ward's minimum variance agglomerative clustering algorithm for the quantitative first three chemGPS principal components, PC1, PC2, and PC3, were used. A molecule's size, polarizability and shape are characterized by PC1, while PC2 describes its aromatic and conjugation-related properties and PC3 corresponds to its lipophilicity, polarity, and hydrogen bond (HB) capacity. Shape-focused VS was performed with Open Eye ROCS to retrieve a TC score, which combines a shape matching with a chemistry alignment Tanimoto score. This scoring function assesses the goodness of the alignment between the query and the candidate molecules. ComboScore puts exactly equal weights on both of its components, a shape-based scoring function and a function considering pharmacophore-like chemical pattern matching. Theoretically, the TC score can lie between 0 and 2 (Hawkins et al., 2007; OpenEye, 2016). The best fitting conformation of **13** derived from the alignment with model BAMS22 was used as query with the ROCS default options. The PAINS filters of the FAF-Drugs 4 online tool were applied to identify potential promiscuous hitters.

# In Vitro GPBAR1 Activity and Statistical Analysis

In vitro evaluation of the selected hit list was performed with a reporter gene-based luciferase assay in HEK 293T cells (obtained from ATCC, USA), which was described previously (Ladurner et al., 2017). Cells were grown and maintained in

TABLE 1 | Screened databases with content type (origin of molecules) and size (number of molecules), their source and the used standard settings for conformer generation with i:Con.


Dulbecco's modified eagle medium (DMEM) without phenol red with 10% heat-inactivated fetal bovine serum (FBS), 4.5 g/L glucose, 2 mM glutamine, 100 U/mL benzylpenicillin, and 100µg/ml streptomycin. During the experiments, charcoalstripped medium with 5% FBS was used. 6 × 10<sup>6</sup> cells were grown in 15 cm dishes for 19 h and then transiently transfected using the calcium phosphate method with 5 µg of a GPBAR1 expression plasmid and 5 µg of a CRE-Luc plasmid. For later normalization, 3 µg of an EGFP expression plasmid was co-transfected. Control experiments were performed with cells transfected only with 3µg EGFP and 5µg CRE-Luc plasmids. After 6 h, transfected cells were reseeded to 96 well plates (5 × 10<sup>4</sup> cells/well) and incubated with 5µM and 20µM compound dilutions, respectively, for 18 h. 0.1% DMSO served as vehicle control and 10µM LCA as positive control. After incubation the medium was removed, and the plates were immediately frozen at −80◦C. Plates were kept frozen for at least 1 h to facilitate lysis and measurements were performed in the following 10 days. For the measurement, cells were thawed, lysed and transferred to black 96-well plates. After addition of ATP and luciferin, emitted luminescence and fluorescence was measured with a Tecan Infinite 200 PRO plate reader (Tecan, Austria). GPBAR1 activity was expressed as fold activation compared with the solvent control (0.1% DMSO) or as % activation compared to the positive control 10µM LCA (arbitrary 100% activation). The measured relative luciferase units (nRLU) were normalized to the transfected cell mass expressed as EGFP-derived relative fluorescence units (RFU) from at least three independent experiments (mean values ± standard error mean) performed in quadruplicate. Quantified EGFP-derived fluorescence was used as an indicator for transfected cell mass and thus used to assess the compounds' cytotoxicity. Compounds, which resulted in significantly lower RFU values than the control, were considered as cytotoxic. For statistical analysis GraphPad Prism 4.03 was used. Statistical significance was assessed by One Way ANOVA and Bonferroni post-test (∗∗∗p < 0.001, ∗∗p < 0.01, <sup>∗</sup>p < 0.05, ns not significant). Non-linear regression was used to calculate EC<sup>50</sup> values with the sigmoidal dose response (variable slope) settings.

### Compounds and Chemicals

Hederagenin (CAS#465-99-6) and bayogenin (CAS#80368) were ordered from Phytolab (Germany). 2,3-O-isopropylidenyleuscaphic acid (CAS#220880-90-0) was ordered from Proactive Molecular Research (USA). Phytolaccoside B (CAS#60820- 94-2) and euscaphic acid (CAS#53155-25-2) was purchased from Cambridge Chemicals (USA). Phytolaccagenic acid (CAS#54928-05-1) and 16-dehydropregnenolone (CAS#1162- 53-4) were obtained from Carbosynth (UK). Spironolactone (CAS#52-01-7) and methylhyoxycholate (CAS#2868-48-6) were purchased from TCI Deutschland GmbH (Germany). The screening compounds (CAS#1019061-83-6, CAS#303139-94- 8, CAS#330636-58-3, CAS#353253-76-6, CAS#353779-79-0, CAS#432530-00-2, CAS#444931-63-9, CAS#496937-29-2, CAS#791840-52-3, CAS#902244-06-8, CAS#915930-57-3, CAS#932954-51-3, CAS#352644-32-7, CAS#500218-51-9, CAS#314757-83-0, CAS#380633-89-6, CAS#26179-09-9, CAS#664993-86-6, CAS#525577-20-2) were obtained from SPECS (Netherlands). Microlobidene (CAS#89783-66-4) and farnesiferol B (CAS#54990-68-0) were available from a previous project (Rollinger et al., 2008). Nordihydroguaretic acid (CAS#500-38-9) was purchased from Fluka (Switzerland). The positive controls LCA (CAS#434-13-9) and CDCA (CAS#474- 25-9) were obtained from Sigma Aldrich (Austria). Alphitolic acid (CAS#19533-92-7), was obtained by hydrolysis from a previously isolated saponin (Mair et al., 2018). The purity was checked using UPLC-PDA-MS and determined as ≥ 98% for compounds **20**, **21**, **23**, **24**, **28**, **30**-**35**, **37**, **40**, **41**, **44**-**46**, **48**, **49**, and **52**. For all other compounds it was between 90 and 98%. MS and NMR data of all in house compounds (**27**, **28**, **52**) are provided in the literature (Rollinger et al., 2008; Mair et al., 2018) and the Supplementary Information (Supplementary Figures 4–10).

## Cell Culture Reagents and Plasmids

DMEM, L-glutamine, benzylpenicillin and streptomycin were purchased from Lonza, (Switzerland), FBS, and trypsin were obtained from Gibco via Invitrogen (Austria). The GPBAR1 transcript variant 3 (NM170699) plasmid was obtained from Origene via Biomedica (Vienna, Austria). The CRE-Luc plasmid, (pGL4.29[luc2P/CRE/Hygro), luciferase assay system and used lysis buffer were ordered from Promega (Germany) and the EGFP (pEGFP-N1) plasmid was purchased from Clontech (USA).

# RESULTS AND DISCUSSION

#### Workflow

The workflow of this study is divided into 3 levels, as depicted in **Figure 2**: (1) Literature search for the compilation of a database of known GPBAR1 actives and inactives to be split and used as pharmacophore training set and a validation set for theoretical validation; The generation of a pharmacophore model collection with LigandScout and the subsequent theoretical validation. (2) VS of multi-conformational databases consisting of structures of natural and synthetic compounds (**Table 1**) using the two most promising models as queries; Evaluation of the hit list applying shape-based screening and physicochemical space clustering of virtual hits. (3) Selection of 33 virtual hits and their experimental validation in a HEK 293T cell based luciferase assay.

#### Pharmacophore Modeling

A pharmacophore model is the abstract three dimensional representation of the molecular interactions between a target and a ligand structure, which is reduced to a collection of steric and chemical features that are necessary to trigger a desired pharmacological effect. The quality of a ligand-based pharmacophore model strongly relies on the selection of training set molecules. Therefore, it is mandatory to strictly select only highly potent activators for the training and validation sets (Seidel et al., 2010). In the case of GPBAR1, available bioactivity data were not only obtained by different working groups, but also with different cellular assays. This raised concerns about direct data comparison among the used assays. Only ligands

tested clinically or with a reported activity, which was proven to be both potent and directly comparable to respective positive controls, were therefore used in this study. Subsequently, 428 of 815 compounds had to be discarded. The remaining 338 compounds formed the "High Actives" dataset. 49 compounds were categorized as "True Inactives." The data handling used as basis for the generation of the ligand-based phamacophore models is illustrated in **Figure 3**.

A pharmacophore model built of several query compounds binding at different ligand binding sites to the target protein would clearly distort the quality of such a model and devastate its predictive power. Therefore, the "High Actives" dataset compounds were divided into 12 clusters using the pharmacophore clustering tool implemented in LigandScout. Cluster 1 was discarded as it only consisted of one compound. The remaining 11 clusters contained between 11 and 80 molecules and were separated each into test and validation set.

Out of the retrieved 11 cluster sets, 12 pharmacophore models were generated. Altogether, in parallel screening, these models were able to predict 275 of 338 compounds (81%) in the "High Actives" database as true positives. However, a high number of false positives were retrieved, when the models were screened against the "Decoys" and "True Inactives" databases. This resulted in poor metrics of this entire model collection's enrichment factor (EF = 11.22). Two models, which were based on the pharmacophores of natural products, showed promising metrics and were selected for the prospective VS and experimental validation. The first model, BAMS22, was based on a training set of 6 molecules (depicted in **Figure 4**) resulting from cluster 7. They had been selected for covering nearly the whole physicochemical space and for incorporating most of the structure-activity information contained in cluster 7. BAMS22 was used for VS of the "Decoys"/"True Inactives" (n = 18.112) databases and the cluster 7 validation set (n = 20), which resulted in a specificity of 1 (0.998785) and a sensitivity of 1, achieving an EF of 823.3. Along with the molecules from cluster 7, the potent ligand TLC from cluster 12 was retrieved as highly ranked virtual hit.

The BAMS22 model consists of two mandatory hydrophobic features, two mandatory HB acceptor features, an optional HB donor, an optional hydrophobic, and an optional negatively ionizable feature, as well as a rigid exclusion volume coat. In agreement to the TLC binding predictions and experiments of Gertzen and coworkers (Gertzen et al., 2015), our pharmacophore model, although not based on the homology model's input information, depicts a very similar interaction pattern (**Figure 5**). Gertzen stated that the 3-hydroxyl moiety of TLC forms a HB to E169 and Y240, the sulfonic acid group forms a salt-bridge to R79 and hydrophobic interactions appear with L244. All of these statements were underlined with alaninescanning experiments and are in accordance with our model. The model also suggests a second important HB interaction with the C-24 carboxamide group of TLC (**Figure 5C**), as well

as with the C-24 hydroxyl group of **14**, or in the case of **16** with the C-20 keto group. The hydrophobic interactions were placed where hydrophobic alignment was possible. Although the model showed a very high specificity, it only consisted of four mandatory pharmacophore features, two HB accepting and two hydrophobic features, a widespread pattern of pharmacophore features. Therefore, not only steroid-like structures can putatively be retrieved in the prospective VS.

It is questionable whether triterpenes and bile acids share the same binding mode, as it was not possible to generate a restrictive pharmacophore model incorporating both scaffolds, although the binding modes appear to be very similar. It is likely that they have a different binding mode within the same binding position. Therefore, it was preferred to explain the steroidal structures with two highly specific local models and not with a single global model. It has previously been acknowledged for the identification of cyclooxygenase inhibitors that a set of highly specific local models leads to lower false positive hit rates, compared to one pleiotropic global model (Schuster et al., 2010).

The second model, TTM8 is based on a training set of 4 molecules (**Figure 4**) from cluster 12. The model consists of 4 mandatory hydrophobic, two mandatory HB acceptor features, and a mandatory negatively ionizable feature (**Figure 5**). TTM8 was theoretically validated against the set of "Decoys"/"True Inactives" datasets (n = 18,112) and the cluster 12 validation set (n = 16), and showed a specificity of 1 and sensitivity of 0.81, achieving an EF of 919.5.

Genet and co-workers (Genet et al., 2010) were the first evaluating the SAR of triterpenes on the GPBAR1 receptor. They concluded that essential features for agonistic activity are a 3αhydroxyl group, a carboxyl group in position 17α, and a rigid pentacyclic scaffold, in the best case a lupane backbone with high lipophilicity. Further publications regarding triterpenes are scarce, although some have shown higher selectivity over FXR and higher potency on the GPBAR1 than bile acids. Therefore,

FIGURE 5 | Representation of the pharmacophore model BAMS22 aligned to TLC in 3D with exclusion volume spheres (A), without exclusion volumes (B) and in 2D (C). Depiction of TTM8 aligned to oleanolic acid (6) in 3D with exclusion volume spheres (D), without exclusion volumes (E) and in 2D (F) The gray spheres in A,D depict so-called exclusion volumes reflecting steric hindrances. The colored spheres represent the pharmacophore features, explained at the bottom, whereby opaque spheres represent mandatory features and spheres with light shading optional ones. In the 2D graphs (C,F), HB features are illustrated as dashed arrows, hydrophobic features as yellow circles and negatively ionizable features as red marks with red bolts attached.


TABLE 2 | Results of the experimental validation of the virtual hits tested at 5µM and 20µM, the pharmacophore fit score (PF), the TC score (calculated in OpenEye ROCS with 13 as a query), the model with which they were predicted and their underlying database.

a cherry-picking pharmacophore model, highly sensitive to these pentacyclic triterpene acids, was created. It can be considered as a highly suitable filtering tool with a high applicability in in silico assisted NP research as previously reported e.g., for pharmacological profiling of secondary metabolites or target identification of NPs (Schuster, 2010; Waltenberger et al., 2011; Grienke et al., 2015; Kratz et al., 2016).

## Prospective Virtual Screening and Hit Selection

A prospective VS was performed with the two pharmacophores against over 350,000 molecules from nine different databases (**Table 1**). After removing duplicates, 1,069 virtual hits were obtained and clustered according to physicochemical diversity into 9 groups (**Figure 6**). As obvious from **Figure 6**, groups 1 and 2 differ from the other groups on a very early hierarchical level. The main structural difference of these two groups compared to the others is that they comprise synthetic compounds and NPs with aromatic rings, more conjugated double bonds and heteroatoms, while groups 3–9 consist of steroidal structures, reaching from cardenolides, pregnanes, bile acids to steroids and triterpenes. The most interesting molecules, in terms of structural diversity, are found in groups 1 and 2, as they comprise scaffolds dissimilar to the query molecules of the underlying pharmacophore models.

For prioritization of virtual hits to be experimentally tested a ranking was performed using shape-focused VS employing the ROCS Tanimoto Combo (TC) score (Hawkins et al., 2007; OpenEye, 2016). For this purpose best matching conformations derived from the pharmacophore-based VS were aligned with query molecule **13**. Hit selection considered a high TC score, but also compound availability in sufficient purity, and structural variance. Finally, 33 compounds were subjected to experimental validation (**Table 2**): 11 compounds had been clustered in groups 4–9 (Supplementary Figure 3), 9 compounds in group 1 (Supplementary Figure 1), and 13 compounds in group 2 (Supplementary Figure 2).

#### Biological Evaluation

GPBAR1 activity of selected hits (Supplementary Figures 1–3) was determined in a reporter gene-based luciferase assay performed in HEK 293T cells. This assay assesses the upregulation of the cAMP-PKA-CREB pathway upon GPBAR1 activation and all conclusions are therefore limited to this receptor pathway. Compounds were considered active when they achieved at least 50% receptor activation. Compounds reaching at least 15% receptor activation were counted as weak activators. The response to 10µM LCA was set to 100% receptor activation. Vehicle control with a final dimethylsulfoxide (DMSO) concentration of 0.1% was set to 0% activation. Initially, compounds were tested at 2 concentrations, i.e., 5µM and 20µM. From the 33 compounds, only two (**47** and **50**) were cytotoxic in both concentrations tested. From the remaining 31 compounds, two showed significant activity with more than 50% receptor activation at 20µM and six further compounds achieved more than 15% receptor activation either at 5 µM or 20µM (**Table 2**). Compounds **22**, **24**, **30**, **34-36**, and **41** were identified as potential pan-assay interference substance (PAINS) but none of them showed activity in the experimental validation. At 5µM, only one compound (**52**) achieved the arbitrary threshold of 15% receptor activation. Spironolactone (**46**) an approved drug for the treatment of heart failure, showed 17.3 % receptor activation at 15 µM. **Table 2** gives an overview of the experimental results.

As a result of this screening, the sesquiterpene coumarins **27** and **28** were discovered to be potent activators of the GPBAR1 receptor, which corroborated the scaffold-hopping competence of BAMS22. The two compounds are present in the gum resin

of Ferula assa-foetida L., used in central Asia as spice and medicine. The concentration response curves for **27** and **28** were determined and are shown in **Figure 7**. Compounds **27** and **28** were cytotoxic in transfected HEK cells at concentrations higher than 27.5 and 22.5µM, respectively. Due to this limitation the determination of Emax values could not be accurately determined in this assay. Accordingly, the analyzed fold activations and extrapolated EC<sup>50</sup> values of **27** and **28** are limited to the non-cytotoxic concentration-response range and may not be completely accurate. Although limited by these constraints, farnesiferol B (**27**) showed 10.54 ± 2.25 fold activation at 20µM (60.85% ± 20.05) and an EC<sup>50</sup> of 13.53µM. Microlobidene (**28**) achieved 16.21 ± 1.64 fold activation at 20µM (83.81% ± 12.00) and an EC<sup>50</sup> of 13.88µM. Whether the differences in sigmoidal slopes of LCA (**1**) and the newly identified ligands **27** and **28** are due to different interaction modes warrants further investigations. In the same assay the endogenous ligand CDCA only reached a fold activation of 11 ± 1.05 at 50µM. The positive control LCA (**1**) reached 18.59 ± 0.97 fold activation at 10µM and 20.19 ± 3.77 at 30µM. The activity of compounds **27** and **28** can therefore be regarded as in the range of endogenous bile acids.

induction compared with the solvent control (DMSO, 0.1%) as the mean with SEM of at least three independent experiments. The two highest concentrations of 27 are the mean of two independent experiments. GraphPad Prism's non-linear regression with the sigmoidal dose response settings (variable slope) was used to calculate curves. (B) Fold activation of compounds 1 (10µM), 27 and 28 (20µM) in comparison to vehicle control 0.1% DMSO in (left) GPBAR1 transfected cells and (right) GPBAR1 untransfected cells. HEK 293T cells were transfected with GPBAR1, EGFP and CRE-Luc expression plasmids (left), or with EGFP and CRE-Luc expression plasmids only (right). Cells were treated for 18 h with 20µM of 27 and 28 as well as 10µM LCA (1) as positive control and 0.1% DMSO as vehicle control. Luciferase activity was normalized to EGFP-derived fluorescence. Results are expressed as fold induction compared with the solvent control (0.1% DMSO). All given values are the mean of at least 3 independent experiments and the variance is given as SEM. Significance was evaluated with one-way ANOVA-Bonferoni post-test (\*\*\*p < 0.001; \*\*p, < 0.01; ns, not significant vs. vehicle control).

Many NPs are well-known PAINS or frequent hitters (Baell, 2016). In order to prevent such unintentional false-positive results, the experiments with the two GPBAR1-activating NPs have been repeated without transfecting GPBAR1. EGFP and CRE-Luc plasmids have been transfected as usual with the same concentrations. In these control experiments, none of the compounds showed a significant increase in luminescence values. In contrast to that, the increase in luminescence in GPBAR1 transfected cells in response to the positive control LCA (**1**), as well as to compounds **27** and **28**, was significant, confirming a direct interaction with GPBAR1 (**Figure 7**).

#### CONCLUSION

The two presented 3D pharmacophore models have proven their quality as VS queries, both theoretically and experimentally. The combined computational and experimental efforts led to the successful identification of novel GPBAR1 agonists with unreported scaffolds derived both from nature (**27** and **28**) and from synthetic origin (**32**). They not only enlarge the chemical diversity of receptor activators, but can also be promising starting points for SAR and further optimization. It is also the first study reporting the activity of spironolactone (**46**) on GPBAR1, highlighting the possibility that already approved drugs may interact with GPBAR1. The elucidation of the mechanism underlying the GPBAR1 activation by these compounds may be an interesting starting point for further research. The physicochemical clustering process enabled a scaffold rich hit selection and a solid predictive power, with 6.5% correctly predicted strong activators and 18.8% weak activators, recommending the presented workflow for future works. The

#### REFERENCES


study shows that the two models in combination are qualified for their application in the future assessments of a molecules' GPBAR1 activating profile, in particular for the assessment of NPs, as the models comprise scaffolds that are widespread in nature. This is particularly helpful for increasing our insight into the molecular mechanism of traditionally used herbal remedies with complex compositions of secondary metabolites. A fast appraisal of their pharmacological profile can give direction and fast-forward research (i.e., pinpointing most promising constituents), alongside reducing expenses.

#### AUTHOR CONTRIBUTIONS

JR planned and supervised the study. BK and JK created the pharmacophore models under supervision of TL and JR. BK and JK performed the virtual screening and hit selection along with UG. BK and AL conducted biological experiments and analyzed the data under supervision of VD. The manuscript was written with contributions of all authors. All authors have given approval to the final version of the manuscript.

#### ACKNOWLEDGMENTS

The authors thank OpenEye for providing the ROCS software free of charge.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00242/full#supplementary-material


products database from the biodiversity of Brazil. J. Nat. Prod. 76, 439–444. doi: 10.1021/np3006875


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Kirchweger, Kratz, Ladurner, Grienke, Langer, Dirsch and Rollinger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Advances in the Development of Shape Similarity Methods and Their Application in Drug Discovery

Ashutosh Kumar and Kam Y. J. Zhang\*

Laboratory for Structural Bioinformatics, Center for Biosystems Dynamics Research, RIKEN, Yokohama, Japan

#### Edited by:

Honglin Li, East China University of Science and Technology, China

#### Reviewed by:

Pedro Ballester, Institut National de la Santé et de la Recherche Médicale (INSERM), France Antreas Afantitis, NovaMechanics Ltd., Cyprus Xiaofeng Liu, East China University of Science and Technology, China

> \*Correspondence: Kam Y. J. Zhang kamzhang@riken.jp

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

> Received: 10 April 2018 Accepted: 09 July 2018 Published: 25 July 2018

#### Citation:

Kumar A and Zhang KYJ (2018) Advances in the Development of Shape Similarity Methods and Their Application in Drug Discovery. Front. Chem. 6:315. doi: 10.3389/fchem.2018.00315 Molecular similarity is a key concept in drug discovery. It is based on the assumption that structurally similar molecules frequently have similar properties. Assessment of similarity between small molecules has been highly effective in the discovery and development of various drugs. Especially, two-dimensional (2D) similarity approaches have been quite popular due to their simplicity, accuracy and efficiency. Recently, the focus has been shifted toward the development of methods involving the representation and comparison of three-dimensional (3D) conformation of small molecules. Among the 3D similarity methods, evaluation of shape similarity is now gaining attention for its application not only in virtual screening but also in molecular target prediction, drug repurposing and scaffold hopping. A wide range of methods have been developed to describe molecular shape and to determine the shape similarity between small molecules. The most widely used methods include atom distance-based methods, surface-based approaches such as spherical harmonics and 3D Zernike descriptors, atom-centered Gaussian overlay based representations. Several of these methods demonstrated excellent virtual screening performance not only retrospectively but also prospectively. In addition to methods assessing the similarity between small molecules, shape similarity approaches have been developed to compare shapes of protein structures and binding pockets. Additionally, shape comparisons between atomic models and 3D density maps allowed the fitting of atomic models into cryo-electron microscopy maps. This review aims to summarize the methodological advances in shape similarity assessment highlighting advantages, disadvantages and their application in drug discovery.

Keywords: molecular similarity, virtual screening, shape similarity, drug discovery, gaussian overlay, spherical harmonics, 3D Zernike descriptors

# INTRODUCTION

Molecular similarity is a key concept in drug discovery and has been routinely used in the discovery and design of new molecules. It is based on the notion that two molecules often share similar physical properties and biological function if they are structurally similar. This similarity principle has been widely utilized in early phases of drug development to discover new molecules. Virtual screening has been used to filter large databases of compounds to a smaller number based on this similarity principle. Molecular similarity has been also employed to optimize the potency and pharmacokinetic properties of lead compounds based on their structure–activity relationships.

There are two components of molecular similarity analysis (1) structural representations and (2) quantitative measurements of similarity between two structural representations. Many types of structural representations have been suggested to measure the similarity between two molecules. These include physiochemical properties, topological indices, molecular graphs, pharmacophore features, molecular shapes, molecular fields etc. Further, there are various methods to quantify the similarity between two structural representations, e.g., Tanimoto coefficient, Dice index, cosine coefficient, Euclidean distance, Tversky index etc. Among these, Tanimoto coefficient (Rogers and Tanimoto, 1960) is the most popular and widely used similarity measure. Based on the structural representation, molecular similarity approaches can be broadly classified into 2D or 3D similarity methods. The 2D similarity methods rely only on the 2D structural information and are among the fastest, efficient and most popular similarity search methods. Moreover, they do not rely on structural alignments for estimating the similarity between two molecules. These methods include substructure search, fingerprint similarity search and 2D descriptor-based methods. However, most of these methods are limited in their ability to enable scaffold hopping and provide no structural and mechanistic insights. To deal with the limitations associated with 2D similarity methods, several approaches were developed that account for 3D conformations of a molecule while performing similarity search. These methods include pharmacophore modeling, shape similarity, molecular field-based methods, 3D fingerprints among others. In recent years, ligand 3D shape-based similarity analysis has become a method of choice in increasing number of virtual screening campaigns. Several successful applications of shape similarity to discover new molecules have been published in the literature. The major advantage with shape-based virtual screening methods is that scaffold hopping can be conveniently accomplished and scaffolds other than the query can be identified.

In this review, we will summarize the development and application of various 3D shape similarity methods and will comment on their utility in drug discovery. We will first outline the classification and various types of 3D shape similarity methods highlighting their advantages and disadvantages. Later, we will describe various applications of 3D shape similarity methods in drug discovery.

### 3D SHAPE SIMILARITY METHODS

The 3D shape has been widely recognized as a key determinant for the activity of small molecules and other biomolecules (Zauhar et al., 2003; Rush et al., 2005; Schnecke and Boström, 2006; Kortagere et al., 2009). The shape complementarity between ligand and receptor is necessary for bringing the receptor and ligand sufficiently close to each other so they can form critical interactions necessary for binding. Two molecules with similar shape are likely to fit the same binding pocket and thereby exhibiting similar biological activity. Shape comparison methods could be broadly classified as (1) Alignment-free or non-superposition methods and (2) Alignment or superposition-based methods. Both of these methods have their own advantages and disadvantages. Alignment-free methods are independent of the position and orientations of molecules. As such, they are much faster and could be used to screen large compound databases. Alignmentbased methods rely on finding the optimal superposition between the compounds. Alignment-based methods are highly effective in identifying shape similarities among the molecular structures but they are computationally expensive. These methods enable comparison of the surface properties such as hydrophobicity and polarity. Visualization is one of the advantages with the alignment-based methods and the similarity between two molecules can be displayed. This information is useful in the design of new molecules and to guide further optimization. However, a subpar molecular alignment may lead to errors in comparing two molecules. Apart from this broad classification, shape similarity methods could be classified based on the underlying representation of molecular shape. The similarity between these shape representations is evaluated by employing various similarity metrics. A schematic overview of the similarity calculation between a query and database molecules is given in **Figure 1**. In the following paragraphs, we will outline commonly utilized shape representations with their advantages and disadvantages. As this review is targeted toward a broader readership, we will only provide an overview of the methods. For algorithmic details and mathematics behind each method, original publications may be referred.

# Atomic Distance-Based Descriptors

These methods are based on the assumption that the shape of a molecule can be described by the relative positions of its atoms. The similarity between molecules can be then calculated by comparing the corresponding distributions of atomic distances. As these descriptors only require the computation of interatomic distances in compounds, these methods are faster compared to other shape comparison methodologies. Additionally, these methods do not require the alignment between two molecules for shape comparison. An overview of various atomic distancebased methods is given in **Table 1** highlighting their availability as well as their advantages and disadvantages. One of the earlier atomic distance-based shape comparison method was based on atom triplet distances (Bemis and Kuntz, 1992). This method considered each molecule as a collection of three atom sub-molecules. The atom triplet triangle perimeters were used to generate shape histograms which were then utilized to compare the shape of molecules. This method however has a few limitations. It is difficult to select bin size suitable for all molecules. Each molecule typically generates 300–500 atom triplets and storing them require large space especially when comparing a large database of molecules. To deal with this limitation, another atom triplet based molecular shape comparison method was developed where a 2,048 bits long single condensed triplet shape signature was employed to represent the entire set of triplets in each molecule (Nilakantan et al., 1993). A signature of the query molecule is first compared with the already stored signatures of database molecules. Then only the compounds with adequately similar signatures are compared

in detail by generating all triplets. Although this method was efficient but there was a risk of missing similar compounds due to the use of highly reduced signature representation. Another group developed molecular descriptors based on atom triplet triangles, angular information from surface point normal and local curvature to facilitate shape comparisons (Good et al., 1995). However, these descriptors have limited discriminating power and require large disk space for storage.

Ultrafast shape recognition (USR) (Ballester and Richards, 2007a,b; Ballester, 2011) is possibly the most popular atomic distance-based method developed to overcome alignment and speed problems associated with shape similarity methods. This method also uses the relative positions of atoms to describe the shape of a molecule. The schematic overview of USR method is given in **Figure 2** along with an example of the shape similarity evaluation. USR calculates the distribution of all atom distances from four reference positions: the molecular centroid (ctd), the closest atom to molecular centroid (cst), the farthest atom from molecular centroid (fct) and the atom farthest away from fct (ftf). Consecutively, the first three statistical moments (mean,


variance, and skewness of distribution) are calculated from each of these distributions. Hence, each molecule has a vector of twelve descriptors to describe its 3D shape. Finally, the similarity between shapes of two molecules is calculated through an inverse of the Manhattan distance of these 12 values:

$$\mathcal{S}\_{q^i} = \frac{1}{1 + \frac{1}{12} \sum\_{l=1}^{12} |M\_l^q - M\_l^i|}$$

where M<sup>q</sup> and M<sup>i</sup> are vectors of shape descriptors for query and i th molecule, respectively. The performance of USR was retrospectively compared with EigenSpectrum Shape Fingerprints (EShape3D) where better mean enrichment for USR was observed (Ballester et al., 2009). A retrospective comparison with three state-of-the-art shape similarity methods: EShape3D, shape signatures and ROCS revealed that USR is 1,546, 2,038, and 14,238 times faster than each one of them respectively (Ballester and Richards, 2007a). A web implementation of USR (USR-VS) is an extremely fast way of carrying out shape similarity calculations (Li et al., 2016). USR-VS is capable of screening 55 million 3D conformers per second and can calculate similarity scores for 94 million 3D conformers in about 2 s. This extremely fast speed is achieved as the features for all 3D conformers are preloaded into the memory. Moreover, the multi-threaded design of the webserver and alignment-free nature of USR method also contributed to such a high computational efficiency. A hardware implementation of USR has been shown to achieve two-fold speed gains over standard CPU based implementation of USR (Morro et al., 2018). In this implementation, a computing technique, Spiking Neural Networks, has been adapted utilizing Field-Programmable Gate arrays to allow highly parallelized implementation of USR. Prospective application of USR in the identification of arylamine N-acetyltransferases, protein arginine deiminase 4 (PAD4), falcipain 2, phosphatases of regenerating liver (PRL-3), p53-MDM2 inhibitors and for phenotypic targets such as colon cancer cell lines established the real-world applicability of USR (Li et al., 2009; Ballester et al., 2010, 2012; Teo et al., 2013; Hoeger et al., 2014; Patil et al., 2014). As USR is an ultrafast, purely shape-based similarity method, several methods augmenting the original USR capabilities were developed. These include a method where USR was combined with MACCS key encoding the topological information of small molecules (Cannon et al., 2008). To clearly distinguish between enantiomers, methods complementing USR with optical isomerism descriptors were developed (Armstrong et al., 2009; Zhou et al., 2010). Electroshape, a USR variant appended partial charge and atomic lipophilicity (alogP) as additional molecular properties to account for electrostatics and lipophilicity along with shape recognition (Armstrong et al., 2010, 2011). A web implementation of Electroshape is available at SwissSimilarity (Zoete et al., 2016). AutoCorrelation of Partial Charges (ACPC) also utilized partial charges with atomic distances to measure similarity between two molecules (Berenger et al., 2014). The method uses an autocorrelation function and a point charge model to encode all atoms of a molecule into two vectors that are rotation translation invariant. Another implementation of

USR method is Ultrafast Shape Recognition with Atom Types (UFSRAT) which introduced pharmacophoric constraints to USR by incorporating atom type information (Shave, 2010; Lim et al., 2011; Shave et al., 2015). UFSRAT is capable of very fast comparison of query molecule with small molecule libraries from several major chemical vendors via its webserver (**Table 1**). Application of UFSRAT method in the discovery of MDM2, PRL-3, FK506-Binding Protein 12, kynurenine 3-monooxygenase and 11β-hydroxysteroid dehydrogenase type 1 (11βHSD1) inhibitors demonstrated its utility in key areas of drug discovery such as cancer, Alzheimer's disease, inflammation and type-II diabetes. (Hoeger et al., 2014; Houston et al., 2015; Shave et al., 2015, 2018). Another similar implementation, USRCAT utilized CREDO atom types to encode pharmacophoric information to USR (Schreyer and Blundell, 2009, 2012). USRCAT not only retained USR abilities to retrieve hits with low structural similarity but also demonstrated improved performance over the original USR implementation.

Atomic distance or descriptor-based methods are widely used due to their ability to quickly compare the shapes of query molecules with large small molecule libraries. A fast comparison of a wide range of chemical space increases the chances of finding novel hits. These methods are not only computationally efficient but also have produced excellent hit rates as revealed from several successful prospective studies against a wide range of molecular and non-molecular targets. Moreover, they are also capable of retrieving chemical scaffolds which are different from the query molecule, thus allowing scaffold hopping. As atomic distancebased shape similarity approaches are alignment-free, the visual inspection of shape similarity may be sometimes challenging especially for molecules which have low structural similarity. Selection of the right query compound is a key component of atomic distance-based shape similarity methods and their performance depends on optimal query selection. Hit rate can be improved by employing multiple queries and increasing the diversity of selected hits. Moreover, clustering based on shape similarity could be utilized to understand how different chemotypes arrange in binding pockets and thereby generating consensus queries (Pérez-Nueno et al., 2008; Pérez-Nueno and Ritchie, 2011) to improve virtual screening performance and reducing false positives.

# Atom-Centered Gaussian-Based Shape Similarity Methods

Among many methods of describing the molecular shape of a molecule, hard sphere (Connolly, 1985; Masek et al., 1993) and Gaussian sphere (Grant and Pickup, 1995; Grant et al., 1996) are two most widely adopted models. Both of these models describe the shape in terms of the volume of a molecule. Two molecules will possess similar shape if they have similar volume. Hard sphere model represents a molecule by a set of merged spheres where each sphere serves as an atom with its van der Waals radius. The volume of a molecule can be calculated by a formula that describes the union of a number of sets and their intersection. Although the analytical expression of the volume and its derivatives is reported in the original publication (Masek et al., 1993), it is not easy to implement as the formulas become very complicated with increasing number of intersections. Gaussian sphere model (Grant and Pickup, 1995, 1997; Grant et al., 1996) represents a molecule using a set of overlapping Gaussian spheres and measures the integral volume over all overlapping Gaussians. In this model, each intersection is expressed as the integral of a set of overlapping atom-centered Gaussian spheres and the volume of a molecule is described based on the inclusion-exclusion principle. Analytical expression for the volume calculation is given in the original publication which describes highly accurate volume calculation up to sixth order intersections (Grant and Pickup, 1995). The authors also proposed comparing shapes of two molecules by numerically optimizing the overlap between two molecules (Grant et al., 1996).

Several methods based on Gaussian overlays were developed to measure the shape similarity between two molecules. An overview of these methods is presented in **Table 2**. Among these, Rapid Overlay of Chemical Structures (ROCS) is undoubtedly the most widely used method that utilizes Gaussian functions to measure the shape similarity between two molecules (Rush et al., 2005; Hawkins et al., 2007). ROCS algorithm is based on the original Gaussian overlay approach that finds and quantifies the maximum volume overlap between two molecules (Grant and Pickup, 1995; Grant et al., 1996). An overview of ROCS shape similarity calculation is given in **Figure 3**. However, to improve the efficiency of volume overlap calculations, it incorporated several modifications to the original implementation. ROCS ignores hydrogens for the volume calculations and uses equal radii for all heavy atoms. Furthermore, ROCS utilizes only the first order terms of shape density function. ROCS employs Tanimoto (Rogers and Tanimoto, 1960) and Tversky (Tversky, 1977) correlation coefficients as similarity metrics to calculate the overlap between two molecules which are defined as:

$$\begin{aligned} Tanimoto\_{a,b} &= \frac{O\_{a,b}}{O\_a + O\_b - O\_{a,b}}\\ Tersky\_{a,b} &= \frac{O\_{a,b}}{O\_{a,b} + \alpha O\_a + \beta O\_b} \end{aligned}$$

where Oa,<sup>b</sup> is the volume overlap between molecules a and b, O<sup>a</sup> is the volume of molecule a and O<sup>b</sup> is the volume of molecule b. α and β are parameters for Tversky index. ROCS also considers chemical complementarity by including the chemical features to improve shape-based superposition. ROCS has been successfully employed in various drug discovery campaigns such as in the identification of small molecules inhibitors (Kumar et al., 2014b), to scaffold hop from one chemical class to another (Kumar et al., 2016), to rescore docking generated poses (Kumar and Zhang, 2016a) and to predict binding poses and ranking of inhibitors (Kumar and Zhang, 2016b,c). ROCS can routinely perform shape and chemical feature comparisons of about 600–800 conformers per second on a modern CPU. Although this speed is reasonable for alignment-based shape similarity methods, it takes several hours to screen a moderately sized virtual screening library. To facilitate large scale shape comparison, e.g., to screen large small molecule libraries within minutes, FastROCS (https://www.eyesopen.com/molecular-modelingfastrocs), a GPU implementation of ROCS has been developed that increased the shape comparison speed by about three orders of magnitude over its CPU implementation. FastROCS is capable of processing up to a million conformers per second on a single NVIDIA Tesla K20 GPU (https://docs.eyesopen. com/toolkits/python/fastrocstk/architecture.html). PAPER, an open source GPU implementation of ROCS algorithm, also demonstrated speed acceleration up to two orders of magnitude on an NVIDIA GeForce GTX 280 GPU over its open source CPU implementation on a Intel Xeon E5345 CPU (Haque and Pande, 2010). MolShaCS is another method that engages Gaussian description of shape to evaluate molecular similarity between two molecules (Vaz de Lima and Nascimento, 2013). In addition to shape, MolShaCS utilizes Gaussian description of charge distribution to optimize overlays and similarity computations using Hodgkin's index (Hodgkin and Richards, 1987; Good et al., 1992). It was able to process 21 compounds per second, which seems to be a quite impressive speed for computers of that time. As Gaussian overlay based methods require precise alignment for the calculation of shape similarity, several groups employed approaches such as pharmacophore and field based methods to generate initial alignment. SHAFTS (SHApe-FeaTure Similarity) (Liu et al., 2011) adopted pharmacophoric point triplets and least square fitting to generate initial alignment. A weighted sum of pharmacophoric fit and volume overlap was then used to assess shape similarities. Phase Shape (Sastry et al., 2011) also employed the same concept of atom distribution triplets to generate initial alignments which were then refined by maximizing the volume overlap. Phase Shape is capable of performing shape comparisons of about 500 conformers per second. Reminiscent of Shape and Electrostatic Potential (ShaEP) (Vainio et al., 2009) also resembles SHAFTS and Phase Shape as it utilizes a hybrid approach that combined field-based methods with volumetric methods to estimate molecular similarity. ShaEP borrowed a graph matching algorithm to generate initial superposition. Molecular graphs represented shape and electrostatic potential at points close to molecular surface. The method then optimized the initial alignment by maximizing the volume overlap calculated through Gaussian functions. Another similar method, SimG (Cai et al., 2013), adopted downhill simplex method (Nelder and Mead, 1965) to evaluate the similarity in shape and chemical features of a molecule and a binding pocket or ligand. SimG shape similarity method possessed advantage over other methods described here in the sense that it is capable of performing shape similarity evaluations between a ligand and a binding pocket. SABRE method (Hamza et al., 2012, 2013) introduced two modifications to the original Gaussian overlay based shape similarity implementation. First, it utilized reduced chemical structures by removing the functional group not present in query to generate initial alignments. Reduced chemical structures were subsequently replaced by full structures and the initial alignments were refined by rigid-body translation and rotation using steepest descent to produce shape density overlap with the query. Secondly, to avoid bias for large sized ligands when using Tanimoto similarity metric, a new scoring function Hamza– Wei–Zhan (HWZ) score was developed. An extension to SABRE method enabled its utility in chemogenomics area (Wei and Hamza, 2014). Shapelets (Proschak et al., 2008) is unlike any other Gaussian overlay based shape comparison method. It describes the shape of a molecule by decomposing its surface into discrete patches. This 3D graph representation can then be used for either full or partial shape similarity evaluations.

In most Gaussian function based overlay methods shape density of a molecule is described as the sum of shapes of individual atoms which sometimes results in the overestimation of the volume, for example, in molecules where some atoms


highly overlap with others in the vicinity. Weighted Gaussian algorithm (WEGA) method (Yan et al., 2013) puts forward a modification where a weight factor is introduced for every atom. This weight factor reflects the crowdedness of an atom with its neighbors. The shape density of a molecule is represented by the linear combination of weighted atomic Gaussian functions. Utilizing this modification, WEGA method demonstrated improved shape similarity and virtual screening performance. The speed of WEGA shape similarity calculations varies with the size of query and database compounds. For an average drug-like query, WEGA can process 1,000–1,500 conformations per second (Yan et al., 2013). A GPU implementation of this method (gWEGA) has also been developed that reported a virtual screening speed increase by two orders of magnitude on one NVIDIA Tesla C2050 GPU over its CPU implementation on a quad-core Intel Xeon X3520 CPU (Yan et al., 2014). Another WEGA derivative, HybridSim proposed a hybrid metric combining 2D fingerprints with WEGA shape similarity and demonstrated improved virtual screening performance over standalone 2D fingerprint and shape similarity methods (Shang et al., 2017).

Overall, atom-centered Gaussian-based shape similarity methods present many advantages over other shape similarity methods. Although not as fast as distance based methods, these methods are fast enough for large scale virtual screenings. The major advantage with atom-centered Gaussian-based shape similarity methods is the visualization. The visualization of shape similarity between two molecules is immensely helpful in deriving the structure activity relationship for the optimization and for scaffold hopping. A majority of these methods address the problem of ligand flexibility by utilizing conformational ensemble. However, in some cases it may not be trivial to sample all possible conformations, e.g., natural products. Moreover, several top performing conformational generation methods face difficulty in modeling the correct conformation of some molecules, e.g., macrocycles, peptidomimetics etc. Another limitation with these methods is that their performance highly depends upon the query molecule and choosing the right query is a critical component of a shape-based virtual screening campaign (Kirchmair et al., 2009). Despite these limitations, atom-centered Gaussian overlay based methods are the most widely used shape similarity methods. They have provided many successful examples demonstrating their utility in various areas of drug discovery which will be discussed later in this manuscript.

## Surface Based 3D Shape Similarity Comparison Methods

Molecular surface is another way of depicting the shape of a molecule. Comparison of molecular surfaces based on their shapes can reveal similarity in their physical and biological properties. There are many ways to describe the surface of a molecule. Precise definitions such as surface based on quantum mechanical wave functions are not practical especially for large molecules (Mezey, 2007). Surface definitions such as solventaccessible surface (Lee and Richards, 1971; Connolly, 1983) and van der Waals surface are more practical and much easier to calculate. Some studies employed alpha shapes (Edelsbrunner et al., 1983; Edelsbrunner and Mücke, 1994; Edelsbrunner, 1995) which is a coarse representation of Connolly surface (Connolly, 1983) to describe the shape of a molecule (Wilson et al., 2009). Alpha shapes of a set of points "S" are generalization of convex hull and utilize a parameter, α to describe the shape with varying levels of details. For large α values, the alpha shape is equivalent to convex hull and shape feature details such as concavities and voids started to appear with decrease in α value. The alpha shape method has been applied to represent and compare shapes of 3D molecules (Wilson et al., 2009).

Shape signatures or shape histograms offer another representation of molecular shape that can be used to explore 3D volume of a molecule confined by the solvent accessible surface (Zauhar et al., 2003; Meek et al., 2006). Shape signatures are probability distribution histograms borrowed from a computer graphics technique, ray-tracing. In this method, a ray is initiated within a molecule bound by its solvent accessible surface. Propagation of a ray trace inside of the triangulated solvent accessible surface is recorded as probability distribution histograms. The histograms for query and any other molecule can be easily compared using the following metrics:

$$\begin{aligned} L\_1^{1D} &= \sum\_i |H\_i^1 - H\_i^2| \\ L\_1^{2D} &= \sum\_i \sum\_j |H\_{i,j}^1 - H\_{i,j}^2| \end{aligned}$$

where 1D represents the probability distribution of raytrace lengths only while 2D represents ray-trace lengths in combination with additional molecular property such as electrostatic potential. Shape signature encodes shape, molecular size and surface charge distribution of a molecule and can be utilized to compare the histogram of a query molecule with the pre-generated histograms of small molecule libraries. The utility of shape signatures as a virtual screening approach has been demonstrated in several studies (Nagarajan et al., 2005; Wang et al., 2006; Hartman et al., 2009; Ai et al., 2014; Werner et al., 2014). As shape signature based similarity comparisons are fast and do not require the alignment of molecules, they are capable of screening millions of molecules in a short time. In addition to shape similarity, shape signatures also allow shape complementarity comparisons against a receptor binding pocket. Although shape similarity calculations with shape signature have been effectively used in many inhibitor discovery efforts, the high number of false positives is a concern especially for large and complex queries. To cope with these drawbacks, a few modifications to the original methods were reported. These include fragment-based shape signature (FBSS) (Zauhar et al., 2013) and inner distance shape signature (IDSS) (Liu et al., 2009, 2012). FBSS involves the generation and comparison of shape signatures for fragments in the molecules. IDSS utilizes inner distance which is the shortest path between landmark points within the molecular shape. IDSS has been shown to be especially useful in case of flexible molecules as it is insensitive to shape deformation of flexible molecules.


Several methods employed local surface shape similarity to align and estimate the similarity between molecules. One such method applied subgraph isomorphism to molecular surface comparison (Cosgrove et al., 2000). In this method, molecular surface was represented by patches of the same shape. Alignment between two molecules was obtained by using a clique-detection algorithm to obtain overlapping patches. Quadratic shape descriptors (Goldman and Wipke, 2000) exploited a similar concept where molecular surface was divided into a series of patches. Each patch was represented by geometrically invariant descriptors such as the normal, the shape index and the principle curvatures which were then used to identify similar patches. SURFCOMP (Hofbauer et al., 2004) further applied several filters such as surrounding shape and physicochemical properties to identify corresponding patches on surfaces of two molecules (**Table 3**).

Spherical harmonics (SH) based representations which are expansion of SH functions also allow quantitative description of molecular shapes (Max and Getzoff, 1988). In this representation, shapes are expressed as functions on a unit sphere. Each point on a unit sphere surface is described by its spherical coordinates (r,θ,φ) and setting f(θ,φ) = r, where r is a radial function encoding the distance of surface points from a chosen origin. This function can be determined by deriving an expansion of SH basis function given by:

$$r\left(\theta,\phi\right) = \sum\_{l=0}^{L} \sum\_{m=-l}^{l} c\_{l,m} Y\_{l}^{m}(\theta,\phi)$$

where Y m l (θ, φ) is the SH basis function for degree l and order m. cl, <sup>m</sup> are coefficients of SH function. L is the chosen limit to get desired resolution of the surface. The number of terms in the function depends upon this limit as a value of L, which yields (L+1)<sup>2</sup> terms. In general, SH are not rotation translation invariant as magnitude of cl, <sup>m</sup> change based on the rotation of r (θ, φ). Hence, prior alignment is necessary before comparing the shape of molecules. Efforts were also made to make SH rotation translation invariant (Kazhdan et al., 2003; Mak et al., 2008), however, these modifications increase the number of terms thereby increasing the complexity of SH.

2009a

available either as standalone program or web-server at

http://kiharalab.org/contact.php

About two decades ago, it was shown that SH functions could be applied to estimate the 3D molecular similarity between two macromolecules (Ritchie and Kemp, 1999). Since then, it has been successfully applied in virtual screening (Cai et al., 2002; Mavridis et al., 2007), protein structure comparisons (Tao et al., 2005; Gramada and Bourne, 2006), protein-ligand docking (Ritchie and Kemp, 2000; Lin and Clark, 2005; Yamagishi et al., 2006), binding pocket similarity comparison (Morris et al., 2005) etc. Additionally, several groups utilized variations of SH to compare the shapes of small molecules. The first implementation of SH to compare shapes of small molecules opened the way for many applications ranging from virtual screening to quantitative structure-activity relationship (QSAR) model building (Lin and Clark, 2005). SpotLight program utilizes SH to superpose and classify small molecules (Mavridis et al., 2007). To enable high throughput virtual screening, the vector interpretation of SH coefficients was used to construct rotation translation invariant fingerprints (RIFs) which were compared using a distance score (Mavridis et al., 2007). In this method, rotational invariance was gained by binning together the SH coefficients of the same order. This method was later developed as ParaFit (http://www.ceposinsilico.de) (**Table 3**). In another study, SH based molecular surface was decomposed and the norm of decomposition coefficients were used to describe the molecular shape (Wang et al., 2011). Norms of decomposition coefficients are partially rotation translation invariant enabling large scale comparison. The performance of this method was retrospectively demonstrated and was also prospectively applied in the discovery of cyclooxygenase-1 and cyclooxygenase-2 inhibitors. SHeMS method utilizes genetic algorithm to optimize the weights

of SH expansion coefficients for a reference set (Cai et al., 2012). Through optimization of weights, SHeMS demonstrated improved performance over original SH implementation and USR method. To facilitate measurement of similarity between sets of compounds, many shape similarity methods were complemented with physicochemical properties. Harmonic pharma chemistry coefficient (HPCC) method combined SH shape representation with pharmacophoric features (Karaboga et al., 2013). In HPCC method, SH surfaces are discretized as triangle meshes which are assigned pharmacophoric features. Tanimoto similarity for both shape and pharmacophore features is calculated separately between query and test molecules. A combo score is finally calculated by adding Tanimoto scores for shape and chemical overlay. HPCC method demonstrated improved performance for the combo approach over utilizing the shape alone.

In several studies, 3D-Zernike descriptors (3DZD) (Novotni and Klein, 2003), which are the extension of SH were employed to compare the shapes of molecules and cryoEM maps (**Figure 4** and **Table 3**). 3DZD differs from SH in terms of their mathematical description. 3DZD can model molecular shape precisely as compared to SH which can only model single valued or star-shape surfaces. They are rotation translation invariant, whereas SH depends on the orientation of the molecule. Although rotation translation invariant SH descriptors have been developed (Kazhdan et al., 2003), the number of terms are much higher in SH descriptors. 3DZD is also suitable to represent other properties on molecular surfaces such as hydrophobicity and electrostatic potential (Sael et al., 2008a). In the drug discovery area, 3DZD was initially applied to compare shapes of protein molecules (Sael et al., 2008b; **Figure 4A**). Later, the concept was extended to measuring shape similarity and small molecules (Venkatraman et al., 2009a) and between binding pockets (Kihara et al., 2009; Venkatraman et al., 2009b; **Figures 4B,C**). In 3DZD method, 3D Zernike function is described as:

$$Z\_{nl}^{m}(r,\theta,\phi) = R\_{nl}(r)Y\_{l}^{m}(\theta,\phi).$$

where Y m l (θ, φ) is the SH basis function while Rnl(r) is the radial function. Zernike moments are calculated using the following equation:

$$F\_{nl}^{m} = \frac{3}{4\pi} \int f\left(\mathbf{x}\right) \overline{Z\_{nl}^{m}\left(\mathbf{x}\right)} d\mathbf{x}$$

As Zernike moments are not rotationally invariant, so to make them rotation translation invariant, they are expressed as norm F m nl which is known as 3DZD. Shape similarity between two molecules based on 3DZD is compared using the following metrics:

$$\begin{aligned} \text{Euclidean distance} &= \sqrt{\sum\_{i=1}^{n} (X\_i - Y\_i)^2} \in [0, \infty] \\ \text{Pearson } r &= \frac{n \sum X\_i Y\_i - \sum X\_i \sum Y\_i}{\sqrt{n \sum X\_i^2 - (\sum X\_i)^2} \sqrt{n \sum Y\_i^2 - (\sum Y\_i)^2}} \in [-1, 1] \end{aligned}$$

$$Manhattan\,distance = \frac{1}{1 + \frac{\sum\_{i=1}^{n} |X\_i - Y\_i|}{N}} \in [0, 1]$$

Ligand 3D shape similarity comparison using 3DZD is fast and rotation translation invariant. As no alignment step is required for comparison, it can be utilized as a virtual screening tool to filter a database of compounds based on shape similarity with a query molecule.

Overall, surface-based shape similarity methods present attractive options for comparing the shapes of small molecules and macromolecules. They were quite successful in estimating the global and local similarities between macromolecules. However, most of these methods are still in infancy as far as small molecule shape comparison is concerned. Several reasons may have contributed to the lack of interest from researchers in accepting these methods as small molecule shape comparison tools. Surface-based methods such as SH and 3DZD are mathematically complex and involve inclusion of many terms to fully capture the shape of a molecule. Moreover, they are slow in comparison to atomic distance-based shape description and comparison methods while their accuracy in retrieving compounds similar in shape to a query does not match Gaussian overlay-based shape similarity methods. Further, while these methods capture very well the global shape of a molecule, the local shape similarity is not represented comprehensively which is very critical in comparing the shapes of small molecules. However, these methods present several new areas of shape comparison such as comparing shape of ligands with that of binding pockets which may be of immense utility for structurebased design.

#### Other Shape Similarity Approaches

There are many other approaches of shape representation and methods of similarity measurement in addition to these described above. Another way of representing molecular shape is to use molecular descriptors. Several shape-based descriptors have been traditionally used to compare small molecules and develop QSAR models. These descriptors mostly represent shape implicitly with other properties such as size, symmetry and atom distribution. These include Weighted Holistic Invariant Molecular (WHIM) descriptors of shape (Gramatica, 2006), shape indices, descriptors for moments of the distribution of molecular volume (Mansfield et al., 2002). Most of molecular descriptors are alignment independent, however, some such as moments of the distribution of molecular volume require superposition of molecules. Comparative Molecular-Field Analysis (CoMFA) (Cramer et al., 1988) is a widely used technique to develop QSAR models and understand SAR for a series of compounds. CoMFA compares a set of molecules by placing them on a grid and calculating potential energy fields. The differences and similarities between molecules are then correlated with differences and similarities in their biological activities. As CoMFA requires molecules to be pre-aligned, the 3D shape similarity of molecules can be obtained based on potential energy fields. A modification of CoMFA approach, Comparative Molecular Moment Analysis (CoMMA) calculates geometric moments from the center of mass, center of charge and center of dipole of a molecule (Silverman and

Platt, 1996). However, superposition of molecules is not required in this approach. Shape of the molecules can also be inferred from structural descriptors such as molecular quantum numbers (MQNs) (Nguyen et al., 2009; van Deursen et al., 2010). The MQN represents counts for 42 structural features such as atom, ring and bond types, polar groups and topology. MQN system has been used to effectively classify and visualize large libraries of organic molecules such as ZINC, GDB, and PubChem.

Volumetric aligned molecular shapes (VAMS) method (Koes and Camacho, 2014) uses data structures to represent and compare shapes of 3D molecules. It applies inclusive and exclusive shape constraints to estimate the similarity in shapes of 3D molecules. In VAMS method, the shape of a molecule is represented by solvent-excluded volume calculated from its heavy atoms using a water probe of radius 1.4 Å. Volume is discretized on a grid of 0.5 Å resolution where each point on the grid represents a Voxel or 3D pixel. An oct-tree data structure is used to store voxelized volume. This method requires all the shapes to be pre-aligned to a standard reference coordinates. The conformations of the molecule are aligned using the moment of inertia of heavy atoms. Voxelized shapes are compared using Tanimoto similarity (Rogers and Tanimoto, 1960) where the ratio of number of voxels common in two shapes and number of voxels present in either of the shapes is measured. The performance of VAMS method as a standalone virtual screening tool is not better than many other shape similarity methods, e.g., ROCS, however, VAMS is reasonably fast and could perform a million shape comparisons in about 10 s. Hence, it may be used as a pre-filtering tool for other shape similarity methods. Fragment oriented molecular shape (FOMS) is the extension of VAMS method, where shapes are aligned using fragments (Hain et al., 2016).

# APPLICATION OF SHAPE SIMILARITY METHODS IN DRUG DISCOVERY

#### Application in Virtual Screening

Shape similarity attempts to quantify the resemblance between two molecules utilizing several descriptions of molecular shape as described previously. This approach has been successfully utilized as a virtual screening tool to identify molecules similar to a given query from the library of chemicals. Several retrospective studies have been published demonstrating the utility of shape based similarity methods over 2D and other 3D similarity methods (Nagarajan et al., 2005; Renner and Schneider, 2006; Ballester et al., 2009; Giganti et al., 2010; Venkatraman et al., 2010; Ballester, 2011; Hu et al., 2012, 2016). Several studies also presented computational approaches to improve the performance and efficiency of shape comparison methods. One study recommended the selection of a suitable query and incorporation of chemical information such as pharmacophoric features of the query molecule to improve the performance of shape-based virtual screening (Kirchmair et al., 2009). Another study demonstrated that the application of a machine learning method, Support Vector Machine (SVM), to shape comparisons can significantly improve virtual screening efficiency (Sato et al., 2012). The need of automation was further suggested specially to carry out multiple query searches which ensure a diverse hit list (Kalászi et al., 2014).

Apart from retrospective tests, many prospective applications of shape similarity have been published in the literature. In numerous studies, it was employed as the only virtual screening approach to filter and prioritize compounds from a large library to a number small enough for biological testing (Rush et al., 2005; Boström et al., 2007; Freitas et al., 2008; Ballester et al., 2010, 2012; Kumar et al., 2012; Vasudevan et al., 2012; Sun et al., 2013; Hoeger et al., 2014; Patil et al., 2014; Temml et al., 2014; Chen et al., 2016; Bassetto et al., 2017). Among these studies, the shape based identification of a compound active on colon cancer cell line is quite interesting (Patil et al., 2014). This study employed USR to screen a database of approved drugs. The top virtual screening hit displayed dose dependent inhibition of a colon cancer cell line. This study not only repurposed a known drug but also demonstrated the applicability of shape similarity methods for phenotypic screens, e.g., anti-bacterial or antifungal drug discovery where molecular target is often unknown. This is especially important considering the fact that most approved drugs come from phenotypic screens (Swinney and Anthony, 2011). In other investigations, it was combined with other ligand-based virtual screening methods or structure based approaches such as molecular docking. Among ligand-based approaches, shape similarity was frequently used in combination with electrostatic similarity. As electrostatic comparison between two small molecules requires precise alignment between them, shape matching was first performed and then followed by the electrostatic potential similarity calculations. This hierarchical combination was utilized to discover a wide variety of binders including enzyme inhibitors (Hevener et al., 2011), mRNA binders (Kaoud et al., 2012), chemical probes (Naylor et al., 2009), protein-protein interaction inhibitors (Boström et al., 2013), SUMO activating enzyme 1 inhibitors (Kumar et al., 2016), and Aurora kinase A inhibitors (Kong et al., 2018).

Although shape-based approaches demonstrated considerable success in ligand-based virtual screening studies, the true potential of the method was realized when it was combined with structure based methods in a hierarchical manner or in a parallel manner. To effectively use shape based virtual screening, several groups employed hierarchical virtual screening (Kumar and Zhang, 2015) where it was coupled with molecular docking. As shape matching calculations are comparatively faster than structure based virtual screening methods, it is generally used during initials steps of a hierarchical virtual screening protocol. This hierarchical combination of shape similarity with molecular docking has been successfully employed in the discovery of type II dehydroquinase inhibitors (Ballester et al., 2012) and that of MDM2 inhibitors (Houston et al., 2015), 11β-hydroxysteroid dehydrogenase 1 inhibitors (Xia et al., 2011), PPARγ partial agonists (Vidovic et al., 2011 ´ ), inhibitors of chemokine receptor 5 (CCR5)-N terminus binding to gp120 protein (Acharya et al., 2011), Grb7-based antitumor agents (Ambaye et al., 2013), fungal trihydroxynaphthalene reductase inhibitors (Brunskole Švegelj et al., 2011), non-steroidal FXR ligands (Fu et al., 2012; Wang et al., 2015), novel SIRT3 scaffolds (Salo et al., 2013), protein kinase CK2 inhibitors (Sun et al., 2013), SUMO conjugating enzyme inhibitors (Kumar et al., 2014a), and chemokine receptor type 4 inhibitors (Das et al., 2015). Combination of shape similarity methods with structure-based methods such as docking provide several advantages. Ultrafast shape comparison methods such as USR can very quickly filter large libraries for compounds that are similarly shaped as known inhibitors. Hence, the time required for docking could be drastically reduced by eliminating compounds that doesn't fit in the binding pocket. Moreover, in case of some proteins the inhibitor activity is driven

by key moieties in compounds, e.g., metal binding groups in case of metalloproteins, reactive functional groups in cysteine proteases, hinge binding groups in kinases etc. In these scenarios, docking will help in the prioritization of compounds based on the interactions they make with the binding pocket. Sometimes the difference in shape similarity scores for compounds is very small and it is challenging to cherry pick for biological assay. Here, docking of shape similarity hits could also help in the prioritization of compounds for purchase or chemical synthesis. However, the combination of shape similarity with molecular docking is not always advantageous especially for proteins with highly flexible binding pockets, multiple pocket conformations or homology models where accurate docking is challenging. A virtual screening scheme where USR hits were reranked using Autodock-Vina score produced no active hits as docking was performed in a quite different pocket conformation (Hoeger et al., 2014). In another study, shape-based virtual screening alone produced better hit rates than hierarchical combination of shape similarity and docking methods (Ballester et al., 2012). In numerous studies, shape similarity calculations along with molecular docking were complemented with other approaches such as 2D similarity search, pharmacophore modeling, electrostatic potential matching, machine learning and MM-PBSA method (Mochalkin et al., 2009; Alcaro et al., 2013; Poongavanam and Kongsted, 2013; Wiggers et al., 2013; Hamza et al., 2014a; Kumar et al., 2014b; Pala et al., 2014; Feng et al., 2015; Corso et al., 2016; Mangiatordi et al., 2017; Xia et al., 2017). The use of different virtual screening approaches in parallel has been previously suggested as different methods tend to identify different set of compounds and virtual screening hit rates could be improved by employing them in parallel manner (Sheridan and Kearsley, 2002). In parallel virtual screening, several methods are run independently and the top hits from each method is selected. Parallel combination of various ligand and structure based methods with shape similarity approaches was found to be productive especially in case of challenging targets (Swann et al., 2011; Langdon et al., 2013; Hoeger et al., 2014). A parallel virtual screening to identify inhibitors of PRL-3 employing several ligand and structure-based methods against the same screening library produced contrasting hit rates for different approaches (Hoeger et al., 2014). Many prospective applications suggest the utility of hierarchical or parallel combination of shape similarity approaches with other ligand and structurebased methods. However, no benchmark study demonstrating their utility has been published. A systematic study will help researchers to identify areas where the combination of several approaches will be better than employing shape based virtual screening methods alone.

One application of shape similarity methods is to hop from one chemical scaffold to another in order to improve the potency, selectivity, physicochemical properties and to create novel intellectual property positions (Hu et al., 2017). Shape similarity methods are capable of identifying several scaffolds which are structurally different from the query compounds and each scaffold may be pursued separately. Scaffold hopping is highly effective in rescuing the problematic leads that cannot be pursued further due to problems in selectivity, pharmacology and pharmacokinetics. Both atomic distance-based and Gaussianoverlay shape similarity methods can effectively perform scaffold hopping as exemplified from several prospective studies. Among the first prospective application of shape similarity based methods in scaffold hopping, small molecule inhibitors of ZipA-FtsZ protein-protein interaction were identified (Rush et al., 2005). Some recent scaffold hopping applications include the identification of inhibitors of arylamine N-acetyltransferases (Ballester et al., 2010), type II dehydroquinase inhibitors (Ballester et al., 2012) sumoylation enzymes (Kumar et al., 2014b, 2016), anti-tubercular agents (Hamza et al., 2014b; Wavhale et al., 2017), anti-tumor agents (Ge et al., 2014), 11βHSD1 inhibitors (Shave et al., 2015), leucine zipper kinase inhibitors (Patel et al., 2015), kynurenine 3-monooxygenase inhibitors (Shave et al., 2018), and partial agonist of inositol trisphosphate receptor (Vasudevan et al., 2014). In addition to prospective application, rigorous benchmarking of shape similarity methods for their scaffold hopping capabilities is important. However, systematic benchmarking is challenging due to disagreement on the definition of scaffold. In one retrospective study, the scaffold hopping potential of atomic distance-based shape similarity method USRCAT has been demonstrated utilizing DUD-E dataset (Schreyer and Blundell, 2012). For the tested benchmark dataset, USRCAT was capable of identifying structurally dissimilar active hits that could not be retrieved by utilizing topological similarities. Shape similarity was also used to repurpose existing drugs for previously unknown activity (Vasudevan et al., 2012). Another application is in silico target fishing or the identification of protein targets of orphan chemical compounds. In one recent research, the target of antifungal macrocycle amidinoureas was identified following a shape similarity screening (Maccari et al., 2017). The representative structure from a series of macrocycle amidinoureas was used as a query to obtain most similar crystallographic ligand from all solved crystal structures. A prioritized list of targets based on similarity score and subsequent docking and enzymatic assay revealed Trichoderma viride chitinase as target of this class of compounds. Along the same line, retrospective studies showed that the combination of molecular shape and chemical structure similarity can reliably achieve biological target prediction (Abdulhameed et al., 2012; Gfeller et al., 2013). Additionally, shape similarity comparison based on spherical harmonics surface representation has been demonstrated that it can be used to predict drug promiscuity (Perez-Nueno et al., 2011). Furthermore, shape similarity comparisons could also be used to predict subtype selectivity of ligands (Kuang et al., 2016).

One important application of shape similarity methods in drug discovery is the clustering of known inhibitors of a protein target. As the performance of most shape-based methods highly depend on the selection of right query for the virtual screening (Kirchmair et al., 2009), special attention was paid toward the development of methods dealing with this problem. It has been reported that clustering of known inhibitors based on their shapes could help the identification of optimal query for virtual screening (Pérez-Nueno and Ritchie, 2011). Clustering of spherical harmonics-based consensus shapes assisted in the identification of ligands that bind to different regions in the binding pocket of some protein targets such as CCR5 (Pérez-Nueno et al., 2008). Further, the clustering of molecular shapes also helped in the identification of promiscuous protein targets and ligands (Pérez-Nueno and Ritchie, 2011). Selection and use of high quality compound libraries is an important aspect of high throughput screening (HTS). However, testing a large number of compounds is not economically viable. In silico, mostly 2D similarity based, methods are commonly employed to generate a subset or focused set for HTS (Huggins et al., 2011; Dandapani et al., 2012). The limitation with 2D similarity methods is that they ignore inherent property such as the shape of a molecule. Use of shape-based clustering of large compound libraries for creating quality HTS library present several advantages. Clustering of molecular libraries based on atomic distance-based methods such as USR can achieve similar or significantly better computational efficiency as 2D fingerprintbased methods. Moreover, it will ensure maximum diversity with less number of compounds in HTS library.

Apart from employing ligand 3D shape similarity as a virtual screening method, several groups adopted it to improve the performance of other virtual screening methods. Molecular docking is one such method widely used in drug discovery. Although there has been significant progress in the development of molecular docking methods, challenges still remain both in sampling and scoring of binding poses within protein binding pockets. In the last few years, several methods were developed that utilized ligand 3D shape similarity to improve both sampling and scoring performance of molecular docking. The shape overlap with known crystallographic ligands for the target protein was utilized to guide ligand conformational sampling toward critical regions of protein binding site (Wu and Vieth, 2004). Other methods used shape similarity based alignment for the selection of reliable poses among many docking generated poses (Fukunishi and Nakamura, 2008, 2012; Anighoro and Bajorath, 2016; Kumar and Zhang, 2016a). Ligand 3D shape similarity was also a key component of many pose prediction methods where shape similarity with existing ligand bound crystal structures was utilized to predict binding poses of unknown ligands (Kelley et al., 2015; Huang et al., 2016; Kumar and Zhang, 2016b,c). Several of these methods demonstrated excellent retrospective and prospective performance. Moreover, shape similarity also facilitated the improvement in scoring and rank-ordering performance of a docking method. Several methods have reported improved virtual screening performance of a docking method when shape overlap with crystallographic ligands was employed to select the best binding pose of ligands in a screening library (Roy et al., 2015; Anighoro and Bajorath, 2016). Consideration of protein flexibility in molecular docking is a challenging problem and several methods have been developed to tackle it (B-Rao et al., 2009). Among these, receptor ensemble based methods demonstrated reasonable performance (Bottegoni et al., 2011) where the receptor ensemble is selected either from many crystallographic structures or from those generated by in silico methods such as molecular dynamics simulation. It has been shown previously that the selection of receptor ensemble based on binding pocket shape similarity is an effective way of considering receptor flexibility in molecular docking (Osguthorpe et al., 2012). Further, one method suggested utilizing a single suitable receptor for each ligand in a screening library instead of docking all compounds to multiple receptor structures (Kumar and Zhang, 2018). It was also shown that single suitable receptor selection based on ligand 3D shape similarity is superior to 2D similarity based selection.

# Applications in Protein Structure Comparison

Evaluation of structural similarity between protein structures has many applications including but not limited to classification of protein structures, evolutionary relationship between protein structures, identification of templates for homology modeling, functional annotation, protein-protein interactions etc. Conventional methods for protein structure comparison are based on the alignment of protein atoms or residues. These methods require extensive rotational and translational sampling thereby limiting their utility for large scale protein structure comparisons. Several methods have been developed that utilize shape similarity to detect global or local similarity between protein structures. Classification of these methods also follows the previously described classification including Gaussian overlay based methods, surface-based methods using spherical harmonic descriptors, 3D Zernike descriptors etc. Among these, surface-based methods were developed previously to measure similarity between protein structures. Only later they were applied to the small molecule area. Several methods of protein structure comparison employed SH to represent shapes of protein structures (Tao et al., 2005; Gramada and Bourne, 2006; Konarev et al., 2016). Like SH, 3D Zernike based moments are also suitable to compare shapes of protein structures (Sael et al., 2008b; **Figure 4A**). Not only they were suitable to estimate the similarity between two proteins but also their rotation-translation invariant nature allows fast real-time search of similar proteins in structural databases such as PDB (La et al., 2009; Kihara et al., 2011; Xiong et al., 2014). A Gaussian mixture model based protein shape similarity method (Kawabata, 2008) also allows large scale comparisons of proteins with data from PDB and EMDB. This method has been implemented as Omokage search in PDB Japan (Suzuki et al., 2016; Kinjo et al., 2017). The server compares global shapes of proteins and results are obtained reasonably fast within 1 min after submission of a query. Large scale comparison of protein structures based on shape is useful in functional annotation, selection of templates for comparative modeling etc. An application of shape comparison method to protein classification has also been reported (Daras et al., 2006).

One important application of shape matching is the evaluation of similarity between protein binding pockets. This field is especially interesting as sequence and structural alignments are often not useful when comparing binding pockets of proteins with different folds. As protein binding pockets are much more conserved than protein structures (Gao and Skolnick, 2013), a reliable comparison between protein binding pockets is crucial for predicting protein functions, polypharmacology of ligands and for drug repurposing. Numerous methods based on distinct structural representations as described previously were developed in the last decade. One such method employed spherical harmonics to represent and compare the shapes of protein binding pockets (Morris et al., 2005). This method was later extended to compare the shape of protein binding pockets with that of binding ligands (Kahraman et al., 2007). PocketMatch compares two binding pockets based on the sorted list of distances that captured chemical nature and 3D shape of the binding pocket (Yeturu and Chandra, 2008). Another method based on property-encoded shape distributions (PESD) combines the concept of shape distributions with the chemical environment of the binding pocket surface to effectively capture binding pocket similarities (Das et al., 2009). Pocket-Surfer utilizes pseudo-Zernike descriptors and 3D Zernike descriptors to represent and compare properties and 3D shapes of binding pockets (Chikhi et al., 2010). An extension of this method, Patch-Surfer searches local similarity by representing a binding pocket as amalgamation of segmented surface patches which are described by properties such as shape, electrostatic potential, concaveness and hydrophobicity (Sael and Kihara, 2012). Similarity between protein cavities was also measured by representing the pockets by pharmacophoric grid points and aligning them by optimizing their volume overlap (Desaphy et al., 2012).

Concept of pocket similarity was also extended to complementarity between binding pockets and ligands. This gave rise to a new virtual screening methodology based on shape complementarity between binding pockets and ligands. PL-Patch-Surfer2 program evaluates the compatibility between ligand and binding pocket by measuring the complementarity between ligand surface and local surface patches in the binding pocket (Shin et al., 2016a,b; **Figure 4C**). The program utilizes 3DZD to represent molecular shape while physicochemical properties are also mapped onto the surface. The method was evaluated on benchmark datasets and revealed better performance than two docking programs. Spherical harmonics expansion coefficients have also been employed in the approximation and comparison of binding pockets and ligand surfaces (Cai et al., 2002). The complementarity was demonstrated utilizing 35 protein-ligand complexes. Elekit adopted shape and electrostatic complementarity concept to discover small molecule inhibitors of protein-protein interactions (Voet et al., 2013). Elekit assesses the similarity between small molecules and protein ligands of a receptor protein based on the electrostatic potential values stored on a 3D grid.

# Applications in Fitting of Atomic Models Into Cryo-Electron Microscopy Maps

Recent developments in cryo-electron microscopy (cryo-EM) has helped researchers to overcome resolution barrier and provide structural and mechanistic insights into structures of difficult proteins and large protein assemblies. Most of these improvements came from the advances in sample preparation, electron detector technologies, improved microscope and computational data processing. Computational methods played an important part in particle picking, particle reconstruction, building and fitting of structures into cryo-EM maps. In recent years, several methods were developed to improve building, fitting and refinement of protein structures in cryo-EM maps (Esquivel-Rodríguez and Kihara, 2013). Among these methods, a few methods employed shape similarity to fit atomic structures of protein subunits into the cryo-EM maps of multi-subunit proteins. One method, Gaussian Mixture macromolecule FITting (gmfit), utilizes Gaussian mixture models (GMM) to represent the shape of cryo-EM maps and atomic models (Kawabata, 2008). GMMs are probability distribution functions obtained by joining many 3D Gaussian functions. Initially, both the cryo-EM map and atomic models are first converted into GMM followed by the fitting of a single subunit GMM into the GMM of protein complex using random and gradient based local search. Finally, the fit between atomic models and cryo-EM map is obtained based on the position and orientation of GMM. This method is reasonably fast and can fit multiple subunits with reasonable accuracy. PDB Japan (https://pdbj.org) has implemented this method in its EM navigator utility to provide shape based structural similarity search against protein databases (Kinjo et al., 2017). Another method adopted a surface-based approach where 3DZD was used to represent and compare isosurface derived from low resolution cryo-EM maps of protein structures (Sael and Kihara, 2010; **Figure 4D**). It was demonstrated that 3DZD can distinguish proteins of different folds even at low resolution of 15 Å. A web-based platform for comparing cryo-EM maps was also developed by the same group (Esquivel-Rodríguez et al., 2015; Han et al., 2017). A similar method utilized 3D Zernike moments to search a database of protein structures for matching protein structures to a cryo-EM map (Yin and Dokholyan, 2011). EMLZerD method also utilized 3DZD to fit multiple structures in a cryo-EM map (Esquivel-Rodríguez and Kihara, 2012). The method generates hundreds of putative configurations of subunit arrangement using a protein-protein docking method. These configurations were later compared with a cryo-EM map using 3DZD and Euclidean distance. The biggest advantage of 3D Zernike moments methods is that they are rotation translation invariant and no computational expensive step of rigid body or flexible structural alignment is required. Moreover, these methods enable screening of proteins from structural databases such as PDB to find out models that can fit into a cryo-EM map.

# CONCLUSION AND FUTURE DIRECTIONS

3D shape similarity methods have contributed immensely to the overall acceptance of the computational virtual screening methods in drug discovery. Most shape similarity methods for shape comparison of small molecules and macromolecules took inspiration from the approaches developed to compare the shapes of 3D objects in computational geometry field. Several approaches were developed ranging from extremely fast atom distance-based methods to comparatively slower mathematically complex methods such as SH and 3DZD. Among all the 3D shape comparison methods, atomic distancebased and Gaussian overlay-based methods are the most widely used. These approaches possess several advantages over surface-based methods. Atomic distance-based methods present an extremely fast way of quickly comparing the shapes of small molecules. This has facilitated the screening of very large libraries of millions of compounds within a few seconds. Moreover, screening large libraries increased the probability of finding novel chemical scaffolds. Furthermore, as most of these methods depend on shape rather than the underlying chemical structure, scaffold hopping can be conveniently achieved. Another possible application of these fast shape similarity evaluation methods would be the clustering of large chemical space to generate quality shape diverse HTS screening libraries. Although Gaussian overlay-based methods are slower than atomic-distance based methods, they are fast enough to allow high throughput virtual screening. GPU implementations of these methods is not very difficult as exemplified by the development of several GPU compatible programs such as FastROCS, PAPER, gWEGA etc. resulting in further increase in the processing speeds. Another advantage with Gaussianbased methods is that they allow visualization as they require alignment of molecules prior to shape similarity calculations. Visualization is helpful in understanding the features responsible for biological activity and critical for the optimization of a molecule especially for the molecules with low structural similarity with query compound. However, a suboptimal alignment can lead to errors in volume overlap calculations and thereby affecting similarity scores and visualization. As alignment is the key component of Gaussian overlay methods, efforts should be focused toward improving molecular alignment. Some of these methods employ chemical features to refine global overlays. As alignment is global optimization problem, molecular alignment could also be improved by employing fast local optimization methods. Both atomic distance-based and Gaussian overlay-based shape similarity methods handle ligand flexibility by employing the conformational ensemble. The performance thus indirectly depends upon conformation

### REFERENCES


generation methods. Current state-of-the-art conformation generation methods still struggle to generate near-native conformations of ligands such as peptidomimetics, macrocycles etc. Development of novel conformation generation approaches utilizing knowledge from experimental databases such as CSD and PDB will steer improvement in performance of shapebased virtual screening approaches. Surface based methods such as SH expansion coefficients and 3DZD are suitable for comparing macromolecules and atomic models with electron density maps, however, comparatively less efforts have been made toward utilizing them in small molecule area. One advantage with surface-based methods is that the protein ligand complementarity search is possible by comparing enclosed shapes of binding pockets and ligands. This will be handy in cases where ligand-based virtual screening methods could not be used due to the lack of active compounds. Finally, shapebased similarity could be used in combination with other ligand and structure-based approaches either in hierarchical or parallel manner to improve hit rate especially for difficult targets.

# AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

# ACKNOWLEDGMENTS

We acknowledge RIKEN ACCC for the supercomputing resources at the Hokusai GreatWave supercomputer. The research in our laboratory was partially supported by Platform Project for Supporting Drug Discovery and Life Science Research [Basis for Supporting Innovative Drug Discovery and Life Science Research (BINDS)] from AMED under Grant Number JP18am0101082. We thank members of our lab for help and discussions.


impact of the molecular alignment on enrichment. J. Chem. Inf. Model. 50, 992–1004. doi: 10.1021/ci900507g


compared to established shape descriptors. J. Chem. Inf. Model. 49, 2231–2241. doi: 10.1021/ci900190z


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer XL and handling Editor declared their shared affiliation.

Copyright © 2018 Kumar and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Structure, Function, and Modulation of γ-Aminobutyric Acid Transporter 1 (GAT1) in Neurological Disorders: A Pharmacoinformatic Prospective

Sadia Zafar and Ishrat Jabeen\*

Research Center for Modeling and Simulation, National University of Sciences and Technology, Islamabad, Pakistan

#### Edited by:

Daniela Schuster, Paracelsus Medizinische Privatuniversität, Salzburg, Austria

#### Reviewed by:

Mariya al-Rashida, Forman Christian College, Pakistan Andrea Ilari, Istituto di Biologia e Patologia Molecolari (IBPM), Consiglio Nazionale Delle Ricerche (CNR), Italy Mariafrancesca Scalise, University of Calabria, Italy

#### \*Correspondence:

Ishrat Jabeen ishrat.jabeen@rcms.nust.edu.pk

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 02 February 2018 Accepted: 20 August 2018 Published: 11 September 2018

#### Citation:

Zafar S and Jabeen I (2018) Structure, Function, and Modulation of γ-Aminobutyric Acid Transporter 1 (GAT1) in Neurological Disorders: A Pharmacoinformatic Prospective. Front. Chem. 6:397. doi: 10.3389/fchem.2018.00397

γ-Aminobutyric acid (GABA) Transporters (GATs) belong to sodium and chloride dependent-transporter family and are widely expressed throughout the brain. Notably, GAT1 is accountable for sustaining 75% of the synaptic GABA concentration and entails its transport to the GABA<sup>A</sup> receptors to initiate the receptor-mediated inhibition of post-synaptic neurons. Imbalance in ion homeostasis has been associated with several neurological disorders related to the GABAergic system. However, inhibition of the GABA uptake by these transporters has been accepted as an effective approach to enhance GABAergic inhibitory neurotransmission in the treatment of seizures in epileptic and other neurological disorders. Here, we reviewed computational methodologies including molecular modeling, docking, and molecular dynamic simulations studies to underscore the structure and function of GAT1 in the GABAergic system. Additionally, various SAR and QSAR methodologies have been reviewed to probe the 3D structural features of inhibitors required to modulate GATs activity. Overall, present review provides an overview of crucial role of GAT1 in GABAergic system and its modulation to evade neurological disorders.

Keywords: γ-aminobutyric acid (GABA), GABA transporters (GATs), homology modeling, molecular dynamics (MD), QSAR

# INTRODUCTION

Transporters or solute carriers are membrane bound proteins involved in the transport of signaling molecules such as ions, nutrients, and various amino acids. The transport of the impermeant solutes against concentration gradient is ATP mediated. Among these transporters, solute carrier (SLC) transporter is one of the major class of human transport proteins that act as symporters, antiporters, exchangers, and are classified into 55 families on the basis of variation in structural elements and biological functions (Hediger et al., 2013). However, SLC6 transporters (the sodium- and chloride-dependent neurotransmitter transporter family) including γ-Aminobutyric acid transporters (GATs), norepinephrine transporter (NET), dopamine transporter (DAT), and serotonin transporter (SERT) encoded by SLC6A1-4 genes in humans are specifically known to be important for efficient neuronal synaptic transmission hence providing neurotransmitter homeostasis in the central nervous system (CNS) (Ben-Yona et al., 2011; Kristensen et al., 2011; Pramod et al., 2013).

NET, DAT, and SERT are further classified under monoamine transporters whereas GATs are amino acid transporters also known as GABA neurotransmitter transporters (Singh et al., 2007). The mammalian GATs are categorized into four subtypes, GAT1-3 and BGT1 (Betaine GABA transporter) with respect to their amino acid sequence and pharmacological properties (Conti et al., 2004; Besedovsky et al., 2007; Parpura and Haydon, 2008). Briefly, GAT1 and GAT3 subtypes accounts for major proportion in CNS. Peculiarly, GAT1 is mainly expressed throughout the brain in neurons (Jin et al., 2011); specifically at the presynaptic terminals of the axons and also in minute concentration in ganglia (Besedovsky et al., 2007; Rego et al., 2007; Wilson, 2011) whereas GAT3 is mainly localized at the perisynaptic astrocytes (Melone et al., 2015). Nevertheless, GAT2/BGT1 are expressed in the liver, kidney, meninges as well as at the blood brain barrier (BBB) (Takanaga et al., 2001; Zhou and Danbolt, 2013).

The imbalance in homeostasis of various ions including Na<sup>+</sup> and Cl<sup>−</sup> due to dysregulation of monoamine transporters at neuronal cells is widely associated with the modulation of anxiety, appetite, mood, attention, depression, and aggression etc (Singh et al., 2007). However, dysregulation of GATs (amino acid transporters), under pathological conditions results in extra removal of GABA neurotransmitter from the synapse thereby leads to severe mental illnesses like Parkinson's disease, Alzheimer, schizophrenia, and seizures in epilepsy (Hack et al., 2011; Schaffert et al., 2011). Generally, imbalance in GABAergic neuronal circuits due to lowered expression of glutamic acid decarboxylase (GAD), a key enzyme for the conversion of excitatory neurotransmitter glutamate into inhibitory neurotransmitter GABA in the presynaptic neuron, is affiliated with onset of epileptic seizures (Gupta, 2011). Moreover, decreased levels of GABA transaminase (GABA-T), a known catabolizer of GABA into succinic semialdehyde, are also profound in choreoathetosis, encephalopathy, hypersomnolence, Alzheimer's disease, and epilepsy. The lower GABA-T propagates the higher levels of GABA in the intraneuronal cytoplasm that causes certain pathological/psychiatric and pharmacological effects (Sadowski, 2003). Markedly, as GATs are in direct contact with the GABA neurotransmitter in the extracellular space therefore, of all the stated GABAergic system components, GATs have attained significant importance for maintaining concentration gradient during abnormal conditions (Yamashita et al., 2005).

Notably, GAT1 is mainly involved in the GABA binding and transport from the cytoplasm to extracellular space (reverse mode) and from the extracellular space back into the cytoplasm (forward mode). Thus, malfunctioning of GAT1 may provoke delay in communication with the post-synaptic GABA receptors (Scimemi, 2014) which may result in various neurological disorders (Hack et al., 2011; Schaffert et al., 2011). Given the pivotal role of GAT1 in GABAergic transport mechanism, it has been recognized as potential therapeutic target for decades (Bialer et al., 2007). Therefore, inhibition of GABA re-uptake transport [either through clinically tested GABA reuptake inhibitors (GRIs) or GAT1 selective antiepileptic FDA approved drug Tiagabine (Trimble and Schmitz, 2011)] to block the extra removal of GABA from synapse is the most accepted strategy to maintain a concentration gradient and normal activity of GABA at the synaptic clefts (Zhou et al., 2007, 2009; Krishnamurthy and Gouaux, 2012; Scimemi, 2014). Thus, this review highlights the structural and functional properties of GAT1 and also elucidates the important 3D structural features of its antagonists. Additionally, pharmacoinformatics strategies including quantitative structure-activity relationship (QSAR), pharmacophore modeling, homology modeling, molecular docking, and molecular dynamics (MD) studies have been highlighted to underscore the overall binding hypothesis of human γ-Aminobutyric acid transporter (hGAT1) modulators.

#### Mechanism of Action of GATs

The GABAergic mechanism starts with the conversion of an excitatory neurotransmitter Glutamate into the inhibitory neurotransmitter GABA by an enzyme glutamate decarboxylase (GAD) in the mature mammalian brain (Gropper and Smith, 2012). This conversion is followed by GABA packing and release into the synaptic vesicles. The vesicle's uptake priority is given to the newly synthesized GABA in lieu of the preformed GABA. However, the underlying mechanism of such priority supply of glutamate to the inhibitory synaptic terminals and to maintain the vesicles content with fresh GABA formation is not completely understood till date (Stafford et al., 2010). Hence, it has been advocated that GABAergic neuronal networks are mainly responsible for the synthesis and release of vesicular packed GABA along with the respective ions from presynaptic nerve terminals into the synaptic cleft down their electrochemical gradient as shown in **Figure 1** (Deidda et al., 2014). However, regulation of GATs functioning is dependent on a wide variety of signaling cascades including "second messengers" (such as pH, kinases, arachidonic acid, and phosphatases) and "synaptic proteins" (such as syntaxin) that play crucial role in functional modulation of GATs (Law et al., 2000). For instance, phosphorylation of tyrosine residues of GAT1 by tyrosine kinase helps in mediating its GABA transport function (Law et al., 2000; Wang and Quick, 2005). Various reports indicates that upon activation released vesicular GABA from the presynaptic neurons is taken up by GAT1 and transported to the GABA receptors that are present on the post-synaptic terminals of the dendrites across the synapse, as synaptic GABA does not undergoes enzymatic breakdown (**Figure 1**) (Gonzalez-Burgos, 2010).

GABA is known as a key player of regulating plasticity and inhibiting anxiety in eukaryotes (Brady et al., 2018). However, the action of GABA is terminated to maintain its concentration in the synapse. In this prospect GAT1 initiates about 80% GABA re-uptake into presynaptic neurons (forward mode) from where it releases again (reverse mode) when require. However, around 20% of the GABA molecules are metabolized into glutamine after transportation to the glial astrocytes by GAT3 and thus, are not available for the neuronal release (**Figure 1**). Hence, the next cycle begins with the conversion of glutamine to glutamate followed by conversion into GABA once again (Parpura and Haydon, 2008).

Further, Rosenberg and colleagues conferred the translocation cycle of GAT1 as explained in **Figure 2**. Briefly, GAT1 adopts

three distinct conformations i.e., open-to-out, occluded-out, and open-to-in conformations. When empty, GAT1 faces the extracellular medium (out T) to which two Na<sup>+</sup> ions bind (step 1). Na<sup>+</sup> ions stabilize the binding of substrate in the protein core. In a follow-up step, GABA (G) and a Cl<sup>−</sup> ion bind with the transporter so that the transporter becomes loaded. However, theoretical and computational studies have revealed that prokaryotes do not require chloride (Cl−) ion for the transport (Scimemi, 2014). Since, it is necessary in eukaryotic mammals for the compensation of positive charge induced by the co-transport of Na<sup>+</sup> ions during GABA translocation step to maintain the membrane potential (Rosenberg and Kanner, 2008). In step 3, fully loaded transporter adopts a seal conformation which does not allow the release of ions and/or substrate to either intracellular (cytoplasm) or extracellular (synapse) medium until or unless it changes its conformation. Subsequently loaded transporter becomes inward facing (step 4) and then GABA and co-transported ions are released into the cytoplasm (step 5). The empty inward facing transporter (in T) transits to again occlude its binding pocket and thus resume outward facing empty transporter configuration (step 6). Hence, a new translocation cycle begins again (Rosenberg and Kanner, 2008).

# STRUCTURAL AND FUNCTIONAL HOMOLOGS OF GAT1

Various attempts have been made to determine the crystal structure of GATs in humans. However, struggles remained unsuccessful due to the unavailability of appropriate quantities of pure and stable transporter proteins. In prokaryotes, availability of X-ray crystal structure of Aquifex aeolicus leucine transporter

concentration of Na<sup>+</sup> ions in the cytoplasm. To maintain the intracellular Na<sup>+</sup> ions concentration, Na<sup>+</sup> pump present on the presynaptic neuronal membrane effluxes the Na<sup>+</sup> ions into synaptic cleft to maintain the concentration gradient. Step 5 represents open-to-in conformation of hGAT1 (also known as reverse mode of reaction). In reverse mode functioning of Na<sup>+</sup> pump is opposite to that of the forward mode. In step 6, the hGAT1 becomes empty again to begin a new cycle.

(AaLeuT, PDB ID: 3F3A) that shares remarkable functional similarity and about 25% sequence similarity with eukaryotic GATs (Kristensen et al., 2011) has augmented the research efforts to elucidate the structure and function of human GATs (hGATs). Moreover, crystal structure of dopamine transporter (DAT, PDB ID: 4XP4) in open-to-out conformation in Drosophila melanogaster (Wang et al., 2015) (that shares 46% sequence identity with GAT1) serves as a good template for the molecular modeling of the tertiary structure of GATs.

Though, detailed insights into functional inhibition mechanism of GATs remained exclusive till date. Yet, inhibition of hGAT1 translocation cycle at any of the three distinct conformational states of hGAT1 (open-to-out, occludedout, or open-to-in) to inhibit the extra removal of GABA neurotransmitter has been reported by various authors in the past. However, open-to-out and occluded-out conformations are mostly targeted (Beuming et al., 2006; Skovstrup et al., 2010). It has been demonstrated that inhibition of open-toout conformation obstructs hGAT1 to acquire occluded-out conformation responsible for translocation of substrate GABA and co-transported ions (Baglo et al., 2013). As the transport of the substrate is mediated through two binding sites i.e., S1 and S2, none of the AaLeuT or dDAT crystal structures were solved with a bound ligand at S2 site until 2008. Briefly, S2 site is known as a low affinity and temporary occupied region for the ligands, as they finally move toward the S1 site from the extracellular vestibule (Quick et al., 2009).

Later, researchers were successful in elucidating the importance of S2 site through impairment of symporter activity (i.e., release substrate molecule to the S1 site) in a mutagenesis study conducted on hGAT homolog, AaLeuT. The substitution of n-octyl-D-glucopyranoside (OG) detergent along with the substrate at S2 site trapped the transporter in its open-to-out conformation due to its inhibitor like effect on activity thereby blocked the translocation of substrate to S1 site; subsequently led to the inhibition of occluded transport conformation (Quick et al., 2009). Moreover, in case of dDAT the binding of inhibitor cocaine at the S2 site in open-to-out conformation during pathogenic conditions induced the conformational change in the binding site of dDAT to facilitate its translocation to S1 site. The binding of the cocaine to S1 site resulted in the blockade of conformational shift from open-to-out to occluded-out conformation (Clementi and Fumagalli, 2015; Wang et al., 2015) which ultimately results in the inhibition of dopamine transport into the cytoplasm; eventually leading to the neuromodulation to overcome anxiety and depression.

# Topology and Physiological Properties of GAT1

The topology of GAT1 was first determined with the help of hydropathy plots that assists in structure elucidation of rest of the members of GATs as they share significant similarity (>50%) (Cummings et al., 2009). Hydropathy plots allow the identification of the domains which are soluble or insoluble i.e., charged or uncharged amino acids regions, respectively, over the length of protein sequence. Thus, sequence and structural inspection of electron microscopic, epitopic and Xray crystallographic studies delineates that GATs consists of 12 transmembrane (TM) segments with N- and C-terminus facing cytoplasm as shown in **Figure 3**. Overall, GATs encompasses two pseudo repeats of helices i.e., TM1-TM5 and TM6-TM10. Moreover, TM segments 1, 3, 6, and 8 are majorly involved in the upholding of ions and substrate in GATs (**Figure 3**). Mutagenesis studies have provided detailed insights into some structural aspects of the defined topology that includes identification of Nglycosylation sites that fall in the hydrophilic extracellular loop (EL2) in between TM3 and TM4 segments (Masuda et al., 2008) whereas phosphorylation occurs in the intracellular loops (IL) of GATs with the help of tyrosine kinases (Bennett and Kanner, 1997). Moreover, mutagenesis studies have showed that removal of these glycosylation sites may result in the reduced GABA uptake activity however, malfunctioning of tyrosine kinases involves the redistribution of GATs from the cell surface to intracellular locations (Masuda et al., 2008; Jin et al., 2011). Arbitrary, GATs require transportation of an extracellular Cl<sup>−</sup> ion along with Na<sup>+</sup> ions and a GABA molecule (substrate) per transportation step as shown in **Figure 3** (Reichenbach and Bringmann, 2010). However, stoichiometry of Na+:Cl−:GABA transport for GAT1, GAT2, GAT3, and BGT1 is 2:1:1, 2:1:1, ≥2:2:1, and 3:1(or 2):1, respectively (Loo et al., 2000; Dalby, 2003). In general, GABA molecule is zwitterion therefore, GATs propagates a net influx of one positive charge per transport step (Lu and Hilgemann, 1999).

#### Substrate and Na<sup>+</sup> Ions Binding Sites in hGAT1

Briefly, TM1 and TM6 segments contain unwound regions hence separating them as TM1a, 1b, 6a, and 6b. Moreover, I62 and G63 residues in the unwound regions adopt an extended conformation to link the TM1a-b segments whereas G307 to G311 are involve in linking TM6a-b segments. TM1 and TM6 segments, harboring the highest percentage of conserved residues, run in opposite direction. These two TM segments in their unwound regions along with TM3 and TM8 form the inner cylindrical ring (S1 binding site) which upholds the two Na<sup>+</sup> ions and substrate binding site (**Figure 3B**). Amino acid residues G59, A61, I62, L64, G65, Y60, N66 (of TM1), Y140 (of TM3), S305, G307 (of TM6), N327 (of TM7), L392, D395, S396, L402, and S406 (of TM8) are known to be involved in pocketing Na<sup>+</sup> ions and substrate in water depleted binding site of hGAT1 (Yamashita et al., 2005). However, S2 site is the preliminary allosteric site in the extracellular vestibule at which either substrate or inhibitor molecule binds. It mediates the release of Na<sup>+</sup> ions and substrate to primary site (S1), thus, enables the sodium coupled GABA (substrate) symporter activity (Quick et al., 2009).

#### Cl<sup>−</sup> Ion Binding Site

The X-ray crystal structure of AaLeuT does not encompass Cl<sup>−</sup> ion. However, the uneven estimate of its binding in AaLeuT is in the EL2 which is ∼20Å away from the binding pocket (S1 site). Thus, the transport is considered as Cl<sup>−</sup> independent transport (Forrest et al., 2007). In comparison, eukaryotic neurotransmitter transporters are Cl<sup>−</sup> dependent and R69 is known to be a crucial residue in Cl<sup>−</sup> ion binding during the transport. Moreover, replacement of any other residue with R69 especially charged residues abolishes the Cl<sup>−</sup> ion binding hence obstructs the substrate transport (Lajtha and Reith, 2007).

Additionally, the structural analysis of SERT, one of the members of neurotransmitter transporters that share significant similarity with GATs, emphasized that Y121, S336, N368, and S372 interact through carbonyl oxygen and amide nitrogen with Cl<sup>−</sup> ion in eukaryotes. The corresponding residues in prokaryotes are Y47, T254, N286, and E290 (Krogsgaard-Larsen et al., 2016). However, mutagenesis studies of S372 (corresponding E290 residue in prokaryotic AaLeuT) with alanine, cysteine, glutamate and aspartate, and N368 (corresponding residue N286 in AaLeuT) with aspartate inhibit the Cl<sup>−</sup> ion mediated transport (Forrest et al., 2007). Later on, Kristian identified that Cl<sup>−</sup> ion is important for the translocation of substrate (GABA in eukaryotes) against the concentration gradient by compensating the positive charges (Na<sup>+</sup> ions). Thus, the specific residues of hGAT1 known for Cl<sup>−</sup> ion dependence and selectivity are Y86, Q291, S295, N327, and S331 (**Figure 3B**) (Krogsgaard-Larsen et al., 2016).

Along with the substrate transport, the ions movements through neurotransmitter transporters also play a significant role

in inducing conformational change in the TM helical segments of the binding pocket. Generally, in open-to-out conformation of transporter, encompassing the bound Na<sup>+</sup> ions in the active site (S1), extracellular gates are relatively thin and remain open. However, substrate binding induces slight conformational changes in the extracellular regions of the TM1, TM2, TM6, and EL4 (Krishnamurthy and Gouaux, 2012). The functional role of EL4 is well-established in sealing of the binding site thereby leading to the occluded-out conformation (Gether et al., 2006). Upon release of Na<sup>+</sup> ions into the cytoplasm (open-to-in state), the re-shifting of TM segments 1, 2, 5, 6, and 7 induces a major conformational change in the transporter structure once again. Furthermore, intense changes in the hinge region of TM1a and extracellular vestibule of EL4 i.e., bending and occlusion, respectively occurs. This allows the formation of thick extracellular and thin intracellular gates therefore, blocking the access of water in the binding cavity and permit access to binding site from the cytoplasmic face (Krishnamurthy and Gouaux, 2012).

Bio-physiologically, GAT1 encompass four basic properties thoroughly determined by [3H] GABA uptake assays performed on rats: (i) GAT1 have strong affinity for GABA molecules as a substrate at low micromolar concentration (Guastella et al., 1990), (ii) the increase rate of GABA uptake in the presence of K <sup>+</sup>- selective ionophore valinomycin help in the determination of the fact that this transport is voltage dependent

ion is represented with green color. An enlarge view of GAT1 residues interaction profile of GABA and all three ions is presented on right side.

across the membrane (Kanner, 1978; D'adamo et al., 2013), (iii) replacement of Na<sup>+</sup> ion with other cations e.g., Li+, K+, Tris<sup>+</sup> may affect the transport mechanism thus, suggesting Na<sup>+</sup> ion crucial for the transport (Iversen and Neal, 1968; Nascimento et al., 2013), and (iv) the GABA transport requires electrochemical gradient of Na<sup>+</sup> ion which is generated by Na+/K<sup>+</sup> ATPase activity (Guastella et al., 1990; Hertz et al., 2013).

Although, GABA is now established as a major inhibitory neurotransmitter in the vertebrate brain (Tritsch et al., 2016), GABA presence in the CNS was not fully determined until 1975. However, during last 40 years a tremendous progress has been made to identify its role in CNS. In this regard, a number of experiments have been conducted on mice and crustacean models specifically in crayfish that helped in defining the role of GABA in GABAergic neuronal system mediated inhibition processes (Bowery and Smart, 2006). From more than 65 years, mutagenesis studies and wet lab experiments have been carried out to understand the functional relevancy of amino acid residues, quantitative measure or qualitative assessment of functional activity, presence or amount of the target (site/protein/chemical). Hereof, several biochemical, pharmacological, and physiological studies have shown determinable effects in comprehending GABAergic interneurons system and its use in treatment of epilepsy. For example, numerous studies have been conducted on activity of GAD enzymes, binding of GABA to post-synaptic GABA<sup>A</sup> receptors, percentage reduction in GABA mediated inhibition, presence of GABA in brain tissue and cerebrospinal fluid (CSF) (Treiman, 2001). Hitherto, a huge number of tested acquired and genetic animal models have shown a clear evidence of abnormalities in GABA regulation in interneurons system (Horton et al., 1982; Olsen et al., 1985; Peterson et al., 1985; Roberts et al., 1985; King and Lamotte, 1988).

### PHARMACOINFORMATICS APPROACHES

Under pathological conditions, the low GABA concentration near a synapse induces a weaker activation of its receptors (provoking a delay in generating communication between preand post-synaptic neurons) thus, making the system more liable to the de-formation of new memories (Laviv et al., 2010). Thus, different biological assays including equilibrium binding assay, GABA uptake assay and GAT1 transport assay have been used to study the GABA transport through GAT1 in the presence of various antagonists in various cell lines including CHO, HEK, and Xenopus oocytes (Kragler et al., 2005, 2008; Pizzi et al., 2011).

Additionally, numerous attempts have also been made to identify the GATs inhibitors by using combined structure and ligand based strategies. Main focus was to remain on GAT1 as limited GAT1 inhibitory compounds failed to enter the clinical phase due to their impairment of motor activities and inability to cross the BBB (Falch et al., 1987). One of the successful inhibitor, Tiagabine, is the only FDA approved second generation GAT1 selective antiepileptic drug in the market with less toxicity however, certain side effects such as tremor, ataxia, asthenia and sedation are related to its pharmacological activity (Schwartzkroin, 2009). In general, Tiagabine is the derivative of nipecotic acid with the lipophilic chain attached to the protonated nitrogen of the piperidine ring of nipecotic acid at one end and di-thiophene rings substitutions at the other end (Genton et al., 2001). Various authors utilized pharmacoinformatics approaches to design selective inhibitors of GATs subtypes however, only handful of compounds could meet the selectivity and affinity criteria. Thus, less statistics are available about the potent inhibitors of GAT2, GAT3, and BGT1 as compared to GAT1 (Clausen et al., 2005a).

# Structure Based Studies

#### Homology Modeling

Overall, a brief overview of different conformations of hGAT1 studied through X-ray crystallography technique in the bacterial and fly homolog has been presented in **Table 1**.

It has been elucidated that all four isoforms of GATs (GAT1-3 and BGT1) share >50% sequence similarity as shown in **Figure 4**. However, hGAT1 shares 60% sequence similarity with dDAT as compared to AaLeuT (36%) which makes dDAT valuable template for structural modeling of hGATs and to resolve nature and shape of binding pocket, opening and closing conformations of GAT1 through further docking and molecular dynamic simulation (MD) studies.

In the last decade, homology models of different isoforms of GATs have been developed to understand their structural and functional characterization in humans. In this regard, Baglo and colleagues conducted homology modeling of the hGAT1 using AaLeuT crystal structure as a template in three different conformations i.e., open-to-out (PDB ID: 3F3A), occluded-out (PDB ID: 2A65), and open-to-in (PDB ID: 3TT3). However, due to the difference in number of amino acid residues of EL2 among prokaryotes and eukaryotes maximum length of EL2 was not considered for model building (Baglo et al., 2013). The residues A61, I62, G63, L64, N66, S295, L300, S396, Q397, and C399 have been predicted to be involved in both GABA binding and transport. However, Dodd et al. (Dodd and Christie, 2007) and Anderson et al. (2010) have analyzed that the residues Y60, L136, G297, and T400 were specifically involved in GABA transport activity. The built homology models of GAT1 are discussed in detail in section Docking and Molecular Dynamics Simulations (MD) Studies with respect to amino acid residues involved in docking of ligands.

#### Docking and Molecular Dynamics Simulations (MD) Studies

Until now only two investigations have been carried out to computationally scrutinize the binding of substrate, two Na<sup>+</sup> and a Cl<sup>−</sup> ion in the S1 binding site of hGAT1 through molecular docking followed by molecular dynamics simulation studies. In addition to this, binding of such small molecules in GAT1 pocket allowed the researchers to predict the corresponding biological activities as well (Palló et al., 2007; Wein and Wanner, 2010).

Therefore, docking of small molecules into the binding pocket of hGAT1 provides a way to understand their mechanism along with the shape and nature of the binding core. Noticeably, in hGAT1 the coordination of one of the Na<sup>+</sup> ion was observed



with the carboxyl group of GABA. Moreover, GABA forms hydrogen bonds with the side chain hydroxyl group of Y140, to the main chain nitrogen atom of G65 and to the main chain oxygen of F294. The amine moiety of GABA in addition form ionic interactions with Y60 (Lovinger, 2010; Baglo et al., 2013).

In another study, the binding pattern of substrates of hGAT1 and AaLeuT i.e., GABA and leucine, respectively were analyzed. As both of the substrates possess carboxylic acid group, involved in interaction with the Na<sup>+</sup> ion therefore, represented a very similar pattern of binding. In comparison to AaLeuT, the carbon chain of the GABA adopted extended conformation in the binding pocket thus –NH of the GABA showed the hydrogen bond interaction with Y60 and G297 of hGAT1 as shown in **Figure 5** (Wein and Wanner, 2010). Later on, small molecule inhibitors such as nipecotic acid, guvacine, 4-amino-isocrotonic, taurine, and 4-amino-2-hydroxybutanoic acid were also docked into the built hGAT1 model to probe their binding in hGAT1. The subsequent molecular dynamics (MD) calculations after flexible docking showed that the active site was not easily accessible either from the extracellular or cytoplasmic face because it was of very limited size hitherto, suggested that the

large inhibitors bind in open-to-out conformation only (Wein and Wanner, 2010).

In 2010, Skovstrup and colleagues studied the binding conformations of GABA, nipecotic acid and Tiagabine in occluded-out conformation of hGAT1. **Figure 6** show a venn diagram of overlapping interacting amino acid residues in substrate binding site of GAT1 identified by previous researchers. It illustrates that T400, Y60, L136, and G297 amino acid residues play an important role in the binding of GABA and nipecotic acid derivatives (Yamashita et al., 2005; Gether et al., 2006; Dodd and Christie, 2007; Skovstrup et al., 2010; Baglo et al., 2013) however, Skovstrup et al. additionally reported the role of Y296 in the GABA binding.

It was also hypothesized that the large aromatic moieties of GAT1 modulators are important for their inhibition activity. The attachment of large hydrophilic chains to the aromatic moieties may allow the inhibitor Tiagabine to face the extracellular vestibule of GAT1 in comparison to nipecotic acid (devoid of hydrophilic chain) which orients toward the cytoplasmic face of GAT1 (**Figure 7**) i.e., formation of hydrogen bond interaction between the protonated nitrogen of Tiagabine and F294(O) in occluded-out conformation. Moreover, all of the three compounds (i.e., GABA, nipecotic acid, and Tiagabine) showed electrostatic interaction with sodium ion while shared common polar interactions with Y60, Y140, and S396. However, the specific polar contacts (in case of GABA) were seen with

Y296, G65 (in nipecotic acid), and F294 (in Tiagabine). On the other hand, MD simulations for open-to-out conformation of these compounds were also in agreement with the observations for occluded-out conformation (Skovstrup et al., 2010). This shows that occluded-out conformation requires major change in binding cavity for adjusting large inhibitors such as Tiagabine.

Later on, steered molecular dynamics (SMD) simulation approach was utilized to understand the whole mechanism of action of GABA in all the three GAT1 conformational states. Skovstrup and colleagues were successful in reorienting the occluded-out conformation into open-to-out and open-to-in conformations (Skovstrup et al., 2012). In case of reorientation to open-to-out conformation, the amino acid residues involved in the transfer of GABA from S1 site to temporary binding site (S2), located in the extracellular vestibule of GAT1, were

determined. Before dissociation of GABA from S1 to S2 site the carboxylate group of GABA showed (i) intra-molecular interaction with amine of GABA (ii) ionic interaction with the Na1 and (iii) hydrogen bonding with Y60(O), G65(NH), Y140(OH), and S396(OH) (Skovstrup et al., 2010). However, after 8 ns of simulation, the amine of GABA and D451 from S2 site started water mediated interaction with each other. Moreover, R69 rearranged itself to form ionic interaction with GABA carboxylate through guanidinium to inhibit the drifting of GABA. The residues Y72 (located one helical turn above R69) and K76 (located one helical turn above Y72 and two helical turns above R69) took part in GABA binding after ∼12 ns of MD simulation. The complex remained stable for around more 6 ns however, the GABA was fully solvated afterwards of the pure amino acids is oriented toward the intracellular face of the hGAT1 whereas the nitrogen atom of the N-substituted BTB or DPB derivatives face the extracellular vestibule. This led to the finding that in order to augment the hGAT1 locking in open-toout conformation the N-substituted amino acid derivatives are

> 2016). Therefore, DDPM-2571 has been synthesized later, an Nsubstituted derivative of pyridine. DDPM2571 (pIC<sup>50</sup> = 8.29 ± 0.02) showed comparative affinity to Tiagabine (pIC<sup>50</sup> = 7.43 ± 0.11) when subjected to the hot plate test, formalin test and mouse models. In addition, DDPM2571 (shown in **Figure 8**) did not disrupt motor skills of the mouse models in lieu it has augmented the memory deficits. Thus, DDPM-2571 may be declared as a lead structure for the inhibition of seizures in hGAT1 as well (Sałat et al., 2017).

better option as compared to the pure amino acids (Wein et al.,

Recently, nipecotic acid derivatives with alkyne type spacer followed by the aromatic moiety have been synthesized. The comparison of Tiagabine and newly synthesized nipecotic acid derivatives showed a hydrogen bond interaction between protonated nitrogen and carbonyl carbon of F294 of hGAT1 (Lutz et al., 2017). Moreover, a binding mode hypothesis of nipecotic acid and N-diarylalkenyl piperidine analogs has been determined in newly developed hGAT1 model (template: dDAT, PDB ID: 4XP4) that may provide a structural basis to apprehend hGAT1 analogs binding and design. The identified binding site residues were in good agreement with already known roof and base residues of hGAT1 pocket (Sadia, 2018).

(at 19 ns time period, sticked in the extracellular vestibule) representing the open-to-out conformation of GAT1 (Skovstrup et al., 2010). While in case of open-to-in GAT1 conformation, the conformational change of TM6 results in the displacement of the residue Y60 which in turn disrupted the interaction between the carboxylate of GABA and Na1 of hGAT1. The residues R44, W47, F53, Q106, Y309, N310, and N314 were observed to be involved in the formation of intracellular gate. Additionally, E101 made ionic contact with amine group of GABA hitherto emancipated GABA into the cytoplasm. Therefore, the channels from S1 to S2 (dissociation and release of GABA in extracellular space) and S1 to cytoplasm have been recognized hydrophobic in nature. On the other hand, R-nipecotic acid showed similar dissociation effect as that of the GABA whereas; Tiagabine showed hydrophobic interactions with the residues of TM1 and TM6 in between two binding sites i.e., S1 and S2 (Skovstrup et al., 2012).

R-nipecotic acid is known to be a medium-to-strong inhibitor of hGAT1 however; proline is known to be a weak inhibitor (Quandt et al., 2013). In 2016, Wein and colleagues synthesized a series of N-substituted 4,4-diphenylbut-3-en-1-yl (DPB) and 4,4-bis(3-methylthiophen-2-yl)but-3-en-1-yl (BTB) nipecotic acid and proline derivatives (examples shown in **Figure 8**). Interestingly in comparison to pure amino acids, the resultant BTB or DPB substituted amino acids showed similar binding affinities. On the other hand, docking of all these inhibitors in hGAT1 pocket has portrayed that the nitrogen atom

the Tiagabine (7.43 ± 0.11).

#### Ligand Based Studies

From early 1980s, several attempts have been made to optimize lead structures of GATs inhibitors. Hereof, researchers attempted to employ amino acids, non-amino acids and their respective derivatives to develop GATs antagonists (Andersen et al., 2001). Among all, the bi-aromatic rings attached to the lipophilic moiety are of fundamental importance (Kragler et al., 2008) however, the underlying molecular mechanism of interaction of these lipophilic analogs with GABA uptake system is unknown (Stromgaard et al., 2009). A breakthrough in our understanding of GATs pharmacology came with the development of a nipecotic acid derivative with a di-aromatic substituent attached to the lipophilic chain. The resulting analog Tiagabine was found to be a potent, subtype specific and competitive inhibitor with a high affinity (IC<sup>50</sup> = 0.049µM) (Nakada et al., 2013). Later on, derivatives of these cyclic GABA analogs such as 4,4 diarylbutenyl, aminomethylphenols, tetrahydrobenzo-isoxazols, diaryloxime, pyrrolidine-2-acetic acid derivatives, and diarylvinyl ethers have been used to design and synthesize well-known specific inhibitors of GAT1 (Knutsen et al., 1999; Andersen et al., 2001; Zhao et al., 2005; Kragler et al., 2008; Pizzi et al., 2011).

Thorough investigations of the compounds guvacine, proline and nipecotic acid led to the identification of the phenomenon that the addition of lipophilic side chains to these compounds results in the second generation compounds having ability to penetrate BBB. For example, SK&F 89976A (Murali Dhar et al., 1996), SKF-100591A (Zhao et al., 2005) SK&F 100330-A, CI-966 (Borden et al., 1994), NNC 711 (or NO 711), Tiagabine (highly selective for GAT1), SNAP-5294 (highly selective for GAT2) (Hack et al., 2011), (S)-SNAP-5114 (moderately selective for GAT3), NNC 05-2045, (poorly selective for BGT1), EF1502 (selective for GAT1/BGT1) etc (**Figure 9**). Normally, the lipophilic side chain is added onto the nitrogen atom of the parent molecule. The side chain addition has showed a significant increase in potency of many of the derivative inhibitors of GATs however, these compounds have not reached to the status of drugs (Pavia et al., 1992).

Andersen and colleagues synthesized novel tricyclic analogs from the **amino acids** nipecotic acid, guvacine, and homoβ-proline (**Figure 9**). The di-aryl groups were replaced with the tricyclic ring moieties and were further attached with the parent amino acid by the addition of variable length

#### TABLE 2 | Derivatives of Compound 1 as modulators of hGAT1.

S 2 2 (R)-nipecotic acid 0.30 O 2 2 (R)-nipecotic acid 14.6 CH<sup>2</sup> – – (R)-nipecotic acid >40

of hydrophilic chains, containing the electronegative moiety. However, this replacement decreased the potency of newly synthesized compounds with the exception of one derivative of homo-β-proline (HOM) that have showed 3-fold high potency (compound **1**), better ligand efficiency and hydrophilicity as compared to the parent compound. The Andersen group later extended the library of compound **1** like compounds by modifying the "A" and "R" substituents, resulting in moderate and poor inhibitors with a 0.18–40µM affinity (**Table 2**). Later on, in vivo testing of compound **1** (IC<sup>50</sup> = 0.05µM) for neuronal [3H]-GABA uptake inhibition in mice also reveal its anticonvulsant activity (i.e., higher than the nipecotic acid and guvacine) (Andersen et al., 2001) approximately equivalent to the Tiagabine (IC<sup>50</sup> = 0.049µM) (Nakada et al., 2013).

Together, the following section provides a summary of the pharmacology of GATs inhibitors with emphasis on the recent advances in deciphering their role in hGAT1 binding pocket and corresponding biological activities.

#### Aminomethylphenols

In 2008, Kragler and Wanner synthesized the **non-amino acid** aminomethylphenol derivatives and correlated their affinities against all GATs subtypes. The addition of the lipophilic side chain on the nitrogen of the aminomethylphenol molecule was applied to increase the flexibility of the compounds e.g., 5-n-dodecylaminomethyl-2-methoxyphenol (compound **2**, **Figure 10**). The compound **2** showed significant inhibition against both neuronal and glial [3H]-GABA uptake, although was subtype unspecific (IC<sup>50</sup> values: GAT1 = 12.30µM, GAT2 = 12.58µM, GAT3 = 2.69µM, BGT1 = 8.70µM) (Kragler et al., 2008). Later on, Pizzi and colleagues investigated nipecotic acid analogs by incorporating methyl, chlorine, fluorine, and bromine on the ortho positions of the di-aromatic moieties attached to the lipophilic chain. Nevertheless, only the addition of methyl and flouro groups produced 4,4-diphenylbut-3-enyl derivative (compound **3,** pK<sup>i</sup> = 7.83, **Figure 10**) using [3H]-Tiagabine radio ligand binding assay, with comparable affinity to Tiagabine (pK<sup>i</sup> = 7.77) which also possess methyl group substituent at the ortho position of the thiophene rings (di-aromatic moieties) (Pizzi et al., 2011).

Nipecotic acid being a polar and hydrophilic compound is not a perfect GAT1 blocker (Stella et al., 2007) therefore, addition of N-(4,4-diphenyl-3-butenyl) hydrophilic moiety to the nipecotic acid resulted in the robust derivative SK&F89976A (**Figure 9**) with an about 20-fold improvement in affinity over nipecotic acid. However, replacement of N- (4,4-diphenyl-3-butenyl) lipophilic moiety with substituent 3,3-diphenylpropyl showed no in vivo activity at physiological pH, but improved in vitro activity (Stella et al., 2007; Wermuth, 2011).

In another study, 1F9 cells were observed to measure the proficiency of blockers SK&F89976A, SK&F100844A (4 methoxyphenyl derivative of SKF89976-A), and SK&F100330A (guvacine derivative) against GAT1 (**Figure 9**). Two of the three derivatives (SK&F100844A and SK&F89976A) possessed saturated piperidine rings however, SK&F100330A contained unsaturated piperidine ring with the biological activities of 10, 0.8, and 0.5µM, respectively (Corey et al., 1994). Likewise, Yunger et al. also acknowledged the anticonvulsant activity of SK&F89976A, SK&F100330A, and SK&F100844A (IC<sup>50</sup> = 0.20, 0.21, and 1.25µM, respectively) in rats brain using [3H]- GABA uptake assay (Yunger et al., 1984). Later on, Braestrup synthesized a nipecotic acid derivative Tiagabine (NO 328) with the side chain addition of (R)-N-[4,4-Bis(3-methyl-2 thienyl)but-3-en-l-yl] to the nitrogen atom of the piperidine ring. Tiagabine was declared as a potential selective GABA inhibitor in astrocytes/neuronal cells and also a potential radio-ligand to check the concentration of GABA uptake (Braestrup et al., 1990). Later on, Tiagabine was renowned as a GAT1 selective inhibitor (Madsen et al., 2011).

Moreover, Yang synthesized a series of lipophilic di-aromatic derivatives of 3-ethoxy-4,5,6,7-tetrahydrobenzo[d]isoxazol-4 one by reductive amination of O-alkylatedracemic to obtain the astrocyte specific GABA uptake blockers; (R)-4-amino-4,5,6,7-tetrahydrobenzo[d]isoxazol-3-ol or (R)-exo-THPO (**Table 3**). In addition, in vitro analysis of their binding affinities against induced convulsions was carried out against all the GATs subtypes along with expression testing in three mediums/systems i.e., HEK cell lines, neurons, and astrocytes. Surprisingly, the obtained derivatives were more selective for the neuronal cells in comparison to the other two systems with the highest selective compound **5** ((RS)-4-[N-(1,1-diphenylbut-1-en-4-yl)amino]-4,5,6,7-tetrahydrobenzo[d]isoxazol-3-ol)

having high binding affinity of 0.14µM (**Table 3**). Other examples include compound **6** (IC<sup>50</sup> = 34µM, attached nitrogen in S-conformation) being a potent blocker of GAT2 whereas R-conformation of nitrogen atom in compound

TABLE 3 | Chemical structures of exo-THPO derivatives along with inhibitory potency (IC50) values against hGAT 1.

(R)-exo-THPO \*R1 and R2 = H (R-configuration)


**6** (IC<sup>50</sup> = 4µM) showed subtype selectivity for GAT1 (**Table 3**) (Clausen et al., 2005b). Additionally, N -methylexo-THPO (4,5,6,7-tetrahydroisoxazolo [4,5-c]pyridin-3-ol) with binding affinity of 28µM acted as astrocytic GABA transport blocker (**Table 3**, compound 7) (Yang and Rothstein, 2009).

#### Azetidine Derivatives

The carboxylic acid group attached at different positions (i.e., ortho, meta, or para) of the polar moiety of the GAT1 antagonists is known to play a crucial role toward high inhibitory potency (Zheng et al., 2004, 2006). However, another class of GATs inhibitors based on the bioisosteric substitution in place of carboxylic acid group with tetrazole ring was synthesized to evaluate the potential of the resultant azetidine derivatives. The subsequent derivatives displayed no effect on the GABA uptake which made tetrazole rings equipotent substitutors of carboxylic acid group. However, the substitution of piperidine ring in NNC-05-2045, one of the known GABA blocker, with the azetidine ring resulted in the potentially moderate azetidine derivatives of GAT1 e.g., 3-Hydroxy-3-(4-methoxyphenyl) (compound **8**, IC<sup>50</sup> = 26.6µM) and GAT3 (compound **9**, IC<sup>50</sup> = 31µM) as shown in **Figure 11**. Additionally, the insertion of 4,4-diphenylbutenyl or 4,4-bis(3-methyl-2-thienyl)butenyl moiety N-alkylated lipophilic side chains exhibited azetidine-2-ylacetic acid derivatives that ensured the highest activity against GAT1 (compound **10**, IC<sup>50</sup> = 2.83µM and compound **11**, IC<sup>50</sup> = 2.01µM). Whereas, the most active compound against GAT3 was the β-alanine analog 1-{2-[tris(4-methoxyphenyl)methoxy]ethyl}azetidine-3 carboxylic acid (compound **12**) with an IC<sup>50</sup> = 15.3µM (**Figure 11**) (Faust et al., 2010).

#### Aminomethyltetrazoles

In 2011, Glycine's mono- and di-substituted aminomethyltetrazole derivatives were evaluated for biological activity against all four subtypes of GATs in murine cells. 5-monosubsituted tetrazole blockers showed no contribution toward inhibition of the GABA whereas 1,5-disubstituted tetrazoles exhibited remarkable potential for the GAT2, GAT3, and GAT4 (BGT1 in humans) subtypes. For example, the highly selective di-substituted tetrazole derivative of GAT3 (compound **13**, IC<sup>50</sup> = 8.12µM) showed 4 and 12 folds higher selectivity in comparison to GAT4 and GAT1 subtypes, respectively (**Figure 12**). Until 2010, the GAT1 and GAT2 inhibitors were subtype unspecific due to the unavailability of detailed pharmacophore model of GAT2, which is still not completely solved. In this perspective, Schaffert's study provided a landmark in the identification of two new selective GAT2 inhibitors, having no impact on GAT1 activity i.e., compound **14** and compound **15** (IC<sup>50</sup> = 15.48 and 10.23µM, respectively). Moreover, the biological activity of Compound **15** was approximately similar to the activity of NNC-05-2090 i.e., IC<sup>50</sup> = 8.12µM (**Figure 12**) (Schaffert et al., 2011).

#### QSAR Studies

So far, limited three dimensional-quantitative structure activity relationship (3D-QSAR) studies based on comparative molecular field analysis (CoMFA) and 2D-QSAR study on GAT1 have been conducted. Zheng et al., in 2004 and later in 2006 developed 3D-QSAR models for N-diarylalkenyl-piperidinecarboxylic acid analogs. It was hypothesized that either one or two of the aryl rings substituted with bulky phenoxymethyl and benzyloxymethyl group in the ortho position might improve

other compounds in the series.

the GAT1 inhibitory activity. Moreover, negative groups e.g., carboxylic acid meta position with respect to nitrogen atom of the piperidine ring displayed greater potency for the interaction of inhibitors with GAT1 and both steric and electronic factors were also shown to be important (Zheng et al., 2004, 2006).

Later on, Jurik et al. performed 2D-QSAR study on 162 nipecotic acid and guvacine derivatives with pIC<sup>50</sup> = >7.0. Four different sets of descriptors including weinerPol, opr\_brigid, 16 physicochemical descriptors, 32 van der Waals surface area (VSA) descriptors were used to build the model. In this respect, contingency matrix and VSA descriptors turned out to be well-suited to describe the dataset. Moreover, as 2D-QSAR is a versatile method for capturing SAR information, therefore the test compounds were easily differentiated as active ones having ortho-substitution in the linker region of the derivatives of nipecotic acid from the inactive compounds (Jurik et al., 2013).

In addition, Hirayama and colleagues utilized pharmacophoric approach for the development of small molecule hSGLT1 and GAT1 inhibitors. Nipecotic acid derivatives, baclofen, saclofen, nortriptyline and SKF89976A compounds were used for the development of GAT1 pharmacophore model. The best pharmacophore model consisted of 1 hydrogen bond unfavorable region, 3 hydrogen bond donors and acceptors and 1 hydrogen bond donor site that plays a critical role in interaction between GAT1 and inhibitors. Moreover, it has been demonstrated that large aromatic or hydrophobic moieties of GAT1 inhibitors are separated at a distance of 8Å from the protonated nitrogen atom in the polar moiety (**Figure 13**). Overall, the GAT1 inhibitor's aromatic moieties binding position resides coplanar ∼8Å from the substrate (GABA) binding site and is responsible for the inhibition of translocation process (Hirayama et al., 2001).

Recently, a GRIND model of GAT1 antagonists was developed using flexible alignment by pharmacophore mapping approach. The model represent good statistics at second cycle of Fractional Factorial Design algorithm (FFD2) (Palló et al., 2007) with correlation coefficient (r 2 ) of 0.75. According to the model, two hydrogen bond acceptors (N1), one hydrogen bond donor (O) and one hydrophobic region (DRY) at certain distances from each other play an important role in achieving high inhibitory potency against hGAT1 (Sadia, 2018).

Briefly, the past decade has witnessed a paradigm shift in drug discovery with the help of computer aided drug design approaches. In this regard, the combine use of ligand based and structure based studies for the identification of GAT1

antagonists has bridged the gap between the ligands and transporter interactions. From the current review on GAT1, it has been deciphered that the hydrophobic region of GAT1 pocket allows the adjustment of the aromatic moieties of the GAT1 antagonists and sodium ion (Na1) of GAT1 is involve in making electrostatic interaction with the acidic group (most commonly COOH group) attached to the polar moiety. In addition, protonated nitrogen atom of polar region of GAT1 antagonists also plays an important role in interaction with F294/S295 of GAT1. In summary, over the short course of recent advances made for determining the mechanistic models of hGAT1, it might be expected that this progress will accelerate in the upcoming years and will serve as a fuel for the detailed insights of membrane transporter proteins. This should not only include the availability of high resolution X-ray structure of hGAT1 but also the development of new experimental protocols followed by the structure determination of other members of SLC6 family with more optimized computational models and methods.

# OUTLOOK SUMMARY

Knowledge of the structure and function of GABA transporters continues to increase due to recent advancements in structural biology. In molecular mechanism perspective, the efforts to understand the structure and function of GATs are mainly compromised due to lack of crystal structure of mammalian GATs. However, the crystal structures of bacterial and fly homologs of GATs aids to comprehend the pharmacology of GATs. Until now, only a single GAT1 selective FDA approved drug Tiagabine is available against one of the most notable neurological disorder epilepsy that is caused due to dysregulation of GAT1. Various molecular modeling studies reported that one of the sodium ions in binding pocket of GAT1 form electrostatic interactions with Tiagabine. This may depict the importance of one sodium ion in the translocation cycle of hGAT1. Moreover, the residues G65 and Y140 of GAT1 are also observed crucial for the formation of hydrogen bond either with the docked substrate or inhibitors. Overall, the binding hypothesis of Tiagabine and its derivatives suggests that carboxylic acid moiety in the basic scaffold may contribute positively in achieving high inhibitory potency (IC50) against hGAT1. However, substitution of large functional groups on the thiophene rings (aromatic moieties) of Tiagabine may result in less potent GAT1 inhibitors. Therefore,

#### REFERENCES


this could provide a rationale to design more potent GAT1 inhibitors to mediate fast inhibitory neurotransmission.

## AUTHOR CONTRIBUTIONS

SZ and IJ conceived and designed the paper, figures and/or tables, reviewed drafts of the paper.

#### FUNDING

Support was provided by HEC Indigenous Ph.D. Fellowship for 5,000 scholars Phase-II, Batch-I, 2012.

inhibitors. Adv. Pharmacol. 54, 265–284. doi: 10.1016/S1054-3589(06) 54011-6


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zafar and Jabeen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Dual-Specificity Phosphatase CDC25B Was Inhibited by Natural Product HB-21 Through Covalently Binding to the Active Site

Shoude Zhang1,2,3 \*, Qiangqiang Jia<sup>1</sup> , Qiang Gao<sup>1</sup> , Xueru Fan<sup>2</sup> , Yuxin Weng<sup>2</sup> and Zhanhai Su1,2 \*

<sup>1</sup> State Key Laboratory of Plateau Ecology and Agriculture, Qinghai University, Xining, China, <sup>2</sup> Department of Pharmacy, Medical College of Qinghai University, Xining, China, <sup>3</sup> School of Pharmacy, East China University of Science and Technology, Shanghai, China

#### Edited by:

Daniela Schuster, Paracelsus Medizinische Privatuniversität, Salzburg, Austria

#### Reviewed by:

Margherita Brindisi, Università degli Studi di Siena, Italy Jiangjiang Qin, University of Houston, United States

#### \*Correspondence:

Shoude Zhang shoude.zhang@qhu.edu.cn Zhanhai Su suzhanhai@foxmail.com

#### Specialty section:

This article was submitted to Medicinal and Pharmaceutical Chemistry, a section of the journal Frontiers in Chemistry

Received: 11 April 2018 Accepted: 12 October 2018 Published: 13 November 2018

#### Citation:

Zhang S, Jia Q, Gao Q, Fan X, Weng Y and Su Z (2018) Dual-Specificity Phosphatase CDC25B Was Inhibited by Natural Product HB-21 Through Covalently Binding to the Active Site. Front. Chem. 6:531. doi: 10.3389/fchem.2018.00531 Cysteine 473, within the active site of the enzyme, Cdc25B, is catalytically essential for substrate activation. The most well-reported inhibitors of Cdc25 phosphatases, especially quinone-type inhibitors, function by inducing irreversible oxidation at this active site of cysteine. Here, we identified a natural product, HB-21, having a sesquiterpene lactone skeleton that could irreversibly bind to cys473 through the formation of a covalent bond. This compound inhibited recombinant human Cdc25B phosphatase with an IC<sup>50</sup> value of 24.25µM. Molecular modeling predicted that HB-21 not only covalently binds to cys473 of Cdc25B but also forms six hydrogen bonds with residues at the active site. Moreover, HB-21 can dephosphorylate cyclin-dependent kinase (CDK1), the natural substrate of Cdc25b, and inhibit cell cycle progression. In summary, HB-21 is a new type of Cdc25B inhibitor with a novel molecular mechanism.

Keywords: Cdc25B inhibitor, sesquiterpene lactone, anticancer, cell cycle progression, covalent binding to protein

# INTRODUCTION

Dual-specificity protein phosphatases (DSP) such as Cdc25s (Cdc25A, Cdc25B, and Cdc25C) play an essential role in cell cycle progression by controlling the phosphorylation state of their natural substrates, cyclin-dependent kinases (CDKs) (Kristjánsdóttir and Rudolph, 2004). Overexpression of Cdc25s and overactivation of CDKs are involved in cancer-associated cell-cycle aberrations (Kristjánsdóttir and Rudolph, 2004). Therefore, Cdc25s have been demonstrated as promising anticancer targets (Boutros et al., 2006, 2007; Xing et al., 2008). Cysteine 473, within the active site of the enzyme Cdc25B, is catalytically essential for activating its natural substrates. Most potent small-molecule inhibitors of the Cdc25 phosphatases are quinone-derived compounds. It has been reported that inhibition of Cdc25B activity can occur by the oxidation of the cys473 through the production of reactive oxygen species (Brisson et al., 2007; Lavecchia et al., 2010, 2012).

Natural products have historically served as a major source of new leads for pharmaceutical development, especially for cancer therapy (Newman and Cragg, 2016). Sesquiterpene lactones (SLs) are one of the most prevalent secondary metabolites in plants, especially in Asteraceae (Chadwick et al., 2013). They have been subject to a number of studies because of their outstanding

**328**

biological activities, particularly in antiinflammation and anticancer (Lyss et al., 1998; Whan Han et al., 2001; Gertsch et al., 2003; Chen et al., 2011). The α-methylene-γ-lactone group (αMγL) of SLs was regarded as the pharmacophore for biological effects based on its alkylation power (Chadwick et al., 2013). Much research has proven that the αMγL group is capable of covalently binding to thiol groups of available cysteines through a Michael reaction, as shown in **Figure 1** (Liu et al., 2014; Wu et al., 2017). Here, we identified a natural compound 6β-Hydroxy-tomentosin (**Figure 2**), termed HB-21, which can bind to cysteine 473 of Cdc25B by forming a covalent bond. Furthermore, the related biological activities of cancer cells following treatment with this compound, such as the phosphorylation state of substrates and cell cycle arrest, were confirmed.

### MATERIALS AND METHODS

# Reagents

HB-21 (catalog no. BBP04900) was purchased from BioBioPha Co., Ltd (Kunming, China) with a purity <95%.

# Gene Expression and Protein Purification

The Cdc25B (372–551) coding sequence with an N-terminal TEV cleavage site was inserted into a pCold-GST vector. The protein expression was performed as described in previous research (Lund et al., 2015). In summary, the Cdc25B catalytic domain (372–551) was expressed in E. coli BL21 (DE3) with an N-terminal GST tag in the LB medium supplemented with 50µg/mL ampicillin. The cells were grown at 37◦C, and protein expression was induced by adding 0.5 mM IPTG until the OD<sup>600</sup> reached approximately 0.6. Then, the temperature was reduced to 21◦C, and the expression continued for 20 hours. The cells were collected by centrifugation at 4◦C and suspended in lysis buffer (50 mM Tris, pH 8.0, 150 mM NaCl, 0.5 mM DTT, and 0.5 mM PMSF). The suspensions were lysed using ultrasonication, and the supernatant containing soluble protein was collected by centrifuging for 40 min at 19,000 rpm with a Beckman centrifuge at 4◦C. The protein was captured by glutathione resin and eluted with lysis buffer containing 20– 50 mM L-glutathione. The GST tag was removed by adding HRV 3C protease, and further purification was performed by S-200 size-exclusion chromatography. The purified protein was pooled and frozen at −80◦C.

### In vitro Enzymatic Assay

The CycLex <sup>R</sup> Protein Phosphatase Cdc25B Fluorometric Assay Kit (CYClex, Cat. No. CY-1353) was used to screen for active compounds that inhibit the diphosphate activity of Cdc25B. The activities were measured using the substrate O-methyl fluorescein phosphate (OMFP) in a 96-well microtiter plate assay based on the manufacturer's protocol. In summary, 40 µL of assay mixture and 5 µL of test compound were combined in the wells and incubated for 15 min at room temperature with 5 µL of recombinant Cdc25B. Afterward, 25 µL of stop solution was added. Fluorescence was measured at an excitation wavelength of 485 nm and an emission wavelength of 530 nm using a fluorescence microplate reader (BioTek Instruments, Inc., Winooski, Vt, USA).

# Molecular Modeling

The docking method used is described in previous work (Liu et al., 2014). In summary, molecular modeling was performed using Maestro 9.0. The X-ray structure of Cdc25B (PDB code: 1QB0) was downloaded from the Protein Data Bank (PDB, http:// www.pdb.org) and prepared with "Protein Preparation Wizard" workflow using default settings. The grid-enclosing box was generated within 10 Å from the cys473 in the refined crystal structure. The structure of HB-21 was prepared using the Ligprep module. Docking was performed using the covalent docking module. The terminal carbon atom of the α-methylene moiety of HB-21 and the sulfur atom of cys473 were specified as the ligand reactive group and the receptor bond.

# Western Blot

The phosphorylation status of CDK1 was analyzed by Western blotting as described in our previous work (Zhang et al., 2014). In summary, the tsFT210 cells (1 × 10<sup>6</sup> ) were treated with HB-21 (0, 1, 5, 25µM) for 4 h and the lysed protein was analyzed via 10% SDS polyacrylamide gels. The protein signals were captured with primary antibodies and secondary antibodies according to the manufacturer's instructions. In this process, the protein βactin was used to normalize target protein. All the antibodies used in this paper were purchased from Cell Signal Technology (Inc, China). The data shown in **Figure 6** are representative of two independent experiments.

# Cell Cycle Analysis

The method of cell cycle analysis used was referenced by others (Tsuchiya et al., 2012). Briefly, the tsFT210 cells (1 × 10 5 cells/well) were blocked at the G2/M phase by increasing the temperature from 32 to 39◦C and treating for 17 h. Then, the cells were synchronized at 32◦C and immediately treated with shikonin. The cells were stained (50µg/ml propidium iodide, 0.1% sodium citrate, and 0.2% NP-40) and analyzed by flow cytometry (BD Biosciences). The concentration of nocodazole used was 100 nM. The data shown in **Figure 7** are representative of two independent experiments.

# Cell Lines and Culture Condition

The cancer cell line tsFT210 was kindly provided by the lab of Dr. Rongcai Yue (School of Pharmacy, Second Military Medical University). The tsFT210 cells were kept at logarithmic growth in 5% CO<sup>2</sup> at 37◦C in the RPMI-1640 medium, supplemented with 10% FBS and 1% penicillin G-streptomycin, in a humidified chamber at 5% CO2.

# Mass-Spectrometric Analysis of HB-21 Binding

The molecular weights of protein molecules and the protein– ligand adducts were detected as suggested by others (Böth et al., 2013). In brief, 5 mg of Cdc25B (372–551) was incubated with or without 1 mM HB-21 at 20◦C for 30 min. Subsequently, the samples were diluted in 0.5 ml denaturing buffer [5%(v/v)

acetonitrile, 0.1% (v/v) formic acid, 0.5 mM TCEP], and the molecular weights were detected by ESI-Q-TOF (Waters Corp.).

# RESULTS

# The Inhibitory Effects of HB-21 on Recombinant Human cdc25B Phosphatase

Using the protein phosphatase Cdc25 combo fluorometric assay kit, HB-21 inhibited recombinant human Cdc25B in vitro in a concentration-dependent manner (**Figure 3**) with an IC<sup>50</sup> value of 24.8 ± 1.63µM. Caulibugulone A was identified as an inhibitor of Cdc25s (Brisson et al., 2007) and thus was included as a positive control. This compound showed comparative inhibition of Cdc25B with an IC<sup>50</sup> value of 5.37 ± 0.45µM.

# Binding of HB-21 to cdc25B

The inhibitory effect of HB-21 on Cdc25B likely occurs through covalently binding to the cysteine residues within the active site, as shown in **Figure 1**. Incubation of the truncated form of Cdc25B (372–551) with HB-21 led to the formation of covalent Cdc25B–HB-21 (×3) adducts according to the results of ESI-MS (**Table 1** and **Supplementary Material**). The 21421.76 Da peak was assigned as the molecular mass of the truncated Cdc25B (residues 372–551) because it aligned with the calculated mass (21420.52 Da) on the basis of the sequence. A new mass peak (22214.18 Da) was generated after the incubation with 1 mM HB-21, and this mass corresponds to the exact mass of the three HB-21 added to that of Cdc25B (372–551) (**Table 1**). There were 5 cysteines in the truncated Cdc25B (374–551) as shown in **Figure 4**. However, two of them (Cys426 and Cys523) were not

TABLE 1 | Covalent adducts formed by Cdc25B with HB-21.


available for the formation of covalently bonded adducts with HB-21, as they were buried under the protein surface. Therefore, three HB-21 molecules covalently bound to Cdc25B (374–551).

# Molecular Model of HB-21 Interactions With cdc25B Catalytic Domain

To evaluate the binding mode and affinity of HB-21 with Cdc25B, molecular modeling was performed using the docking program Glide package. The crystal structure of Cdc25B has been solved with a resolution of 1.91 Å, showing that the catalytic domain of Cdc25B contains the canonical HCX5R PTPase catalytic-site motif (Reynolds et al., 1999). In this motif, C represents the catalytic cysteine 473 which forms a phosphate-binding loop with five X residues and arginine 479. In the proposed binding mode of HB-21-Cdc25B, HB-21 covalently binds to the cys473-located pocket of Cdc25B with suitable shape complementarity. In addition, six hydrogen bonds were observed between HB-21 and

residues surrounding cys473 (**Figure 5**). Although the covalent binding between HB-21 and cys473 of Cdc25B plays a crucial role in the inhibition process, these non-covalent interactions, such as hydrogen bonds, were previously thought to increase the rate of initial site-recognition and cause a simultaneous increase in binding affinity (Liu et al., 2014).

FIGURE 4 | The sequence of Cdc25B (374-551) and available cysteine residues (× not accessible).

# HB-21 Inhibits CDK1 Dephosphorylation and Delays the Entry Into Mitosis

Endogenous Cdc25 phosphatases control the cell cycle through dephosphorylating their natural substrate, cyclin-dependent kinases (CDKs) (Boutros et al., 2007). Therefore, the CDK1 protein will be hyperphosphorylated if CDC25s are inhibited. To confirm whether HB-21 could inhibit the activity of intracellular Cdc25 phosphatases, the phosphorylation status of CDK1 was analyzed by Western blotting. At concentrations of 5 and 25µM, HB-21 induced an accumulation of the tyrosine 15-phosphorylated form of CDK1 (**Figure 6**). These results suggested that HB-21 downregulated the activity of the Cdc25B phosphatase, leading to hyperphosphorylation of CDK1 in cultured cells. Hence, the effects of this compound on cell cycle progression were examined.

# HB-21 Inhibits Cell Cycle Progression

The inhibition of Cdc25s will result in dephosphorylation of CDKs and cell cycle arrest. Therefore, the effects of HB-21 on cell cycle progression were investigated. The tsFT210 cell line

has been widely used for the study of cell cycle progression because it can easily be controlled at different cell cycle phases through changing temperature (Th'ng et al., 1990). The cell cycle differences between synchronized tsFT210 cells treated with the indicated concentration of HB-21, nocodazole (a potent mitotic blocker), and 1% DMSO were analyzed by flow cytometry and shown in **Figure 7**. The positive control nocodazole significantly arrested the cells at G2/M phase and DMSO did not show a significant impact on cell cycle progression. Comparatively, the HB-21-treated tsFT210 cells were blocked at the G2/M phase in a concentration-dependent manner. Such results provide supplementary evidence that HB-21 can target Cdc25B and delay cell cycle progression at the Cdc25B-related G2/M phase.

# DISCUSSION

Cys473 is the crucial cysteine for catalyzing the substrate of Cdc25B. The most crucial quinone-derived inhibitors supposedly inactivate the enzyme through oxidizing the thiolate group of cys473 (Cui et al., 2017). Until now, no inhibitors have been reported to bind directly to cys473. This study has shown that the natural product HB-21 can directly bind to cys473 by forming a covalent bond.

The structural biodiversity of natural products makes them a valuable source for drug development (Lund et al., 2015). Studies on SLs have increased due to their prevalence in plants and their diverse bioactivity (Chadwick et al., 2013). HB-21 belongs to a type of xanthane sesquiterpene, which possesses an αMγL

moiety. The multiple bioactivities of SLs are attributed to the αMγL unit, which has the potential to bind to thiol groups of proteins covalently by a Michael reaction (García-Piñeres et al., 2001). Using mass spectrometry and molecular modeling, this investigation has also proven that this mechanism exists between HB-21 and Cdc25B. Modeling results showed that, in addition to the αMγL unit, other chemical groups of HB-21 are likely to have an influence on the activity of Cdc25B through noncovalent interactions. These interactions might serve as an initial site-recognition step during the binding of HB-21 to Cdc25B. The shape and size of the binding pocket of target proteins is variable, and good shape complementarity between SLs and target proteins is crucial for activity. Therefore, SLs containing more flexible groups show increased activity (Chadwick et al., 2013). Moreover, the residues around cysteine were thought to play a role in the initial site-recognition for ligand binding through other intermolecular forces, such as Van der Waals forces, hydrogen bonds, etc. (Liu et al., 2014). This outcome also explains why HB-21 showed a moderate inhibitory activity for Cdc25B. Although HB-21 covalently binds to cysteine 473, the noncovalent interactions between Cdc25B and HB-21 may be weak, resulting in slow initial site-recognition. Further research is necessary to fully understand the activity of HB-21 in the HB-21: Cdc25B co-crystal structure at the molecular level. This would allow a new HB-21-based derivative with much higher biochemical activity to be designed.

Further research also needs to address the selectivity of HB-21 for Cdc25B, and whether there are additional molecular targets for HB-21 within the cell. In this study, HB-21 began to induce cell cycle arrest at a concentration of 5µM in G2/M phase cells. At this concentration, however, HB-21 has a relatively low inhibitory effect (<5% inhibition). Three possible reasons may explain this result. (1) Only the reduced state of the cysteine's sulfhydryl group (-SH) can covalently bind to HB-21. This sulfhydryl group is, however, easily oxidized in air, resulting in the inability of HB-21 to bind to Cdc25B. (2) The Michael reaction needs time to complete. The Cdc25B and HB-21 were, however, only incubated for 15 min to keep the protein in a reduced state. However, in addition to the aforementioned reasons, the low inhibitory effect also suggests that HB-21 has other intracellular targets, such as the other

#### REFERENCES


homologues of Cdc25s (Cdc25B, and -C) sharing common structural properties with Cdc25A, especially for the signature motif (HCxxxxxR), which will be necessary to confirm in future research.

In conclusion, this study has identified a new type of Cdc25B inhibitor, HB-21. HB-21 resulted in the dephosphorylation of Cdc25's natural substrate, CDK1, and the inhibition of cell cycle progression by HB-21 covalently binding to cys473, located within the active site of Cdc25B. Neither in vivo nor in vitro activity of HB-21 has been evaluated prior to this study. This is the first time that HB-21 has been found to have anticancer activity, allowing for HB-21 to provide a new molecular template for anticancer drug development. The in vivo studies will be part of our future studies.

### AUTHOR CONTRIBUTIONS

SZ, QG, and ZS designed the experiments. QJ, QG, XF, and YW performed the experiments and analyzed data. SZ wrote the manuscript. ZS edited the manuscript. All the authors read and approved the final manuscript.

# FUNDING

This work was supported by the Project of Qinghai Science & Technology Department (2016-ZJ-Y01, 2018-ZJ-948Q) and the Open Project of State Key Laboratory of Plateau Ecology and Agriculture, Qinghai University (2017-ZZ-02).

# ACKNOWLEDGMENTS

The authors express their gratitude to the Analysis and Testing Center of Qinghai University for its support with mass spectrometry and Dr. Rongcai Rue for providing tsFT210 cells.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00531/full#supplementary-material


structure. ACS Med. Chem. Lett. 3, 294–298. doi: 10.1021/ml20 02778


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhang, Jia, Gao, Fan, Weng and Su. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.