QUANTUM MECHANICAL/MOLECULAR MECHANICAL APPROACHES FOR THE INVESTIGATION OF CHEMICAL SYSTEMS – RECENT DEVELOPMENTS AND ADVANCED APPLICATIONS

EDITED BY : Thomas S. Hofer and Sam P. De Visser PUBLISHED IN : Frontiers in Chemistry

#### Frontiers Copyright Statement

© Copyright 2007-2018 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.

The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.

Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.

Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.

As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.

All copyright, and all rights therein, are protected by national and international copyright laws.

The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use. ISSN 1664-8714 ISBN 978-2-88945-626-0 DOI 10.3389/978-2-88945-626-0

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# QUANTUM MECHANICAL/MOLECULAR MECHANICAL APPROACHES FOR THE INVESTIGATION OF CHEMICAL SYSTEMS – RECENT DEVELOPMENTS AND ADVANCED APPLICATIONS

Topic Editors: Thomas S. Hofer, University of Innsbruck, Austria Sam P. De Visser, University of Manchester, United Kingdom

The binding mode of the only known ligand robotnikinin to the Zn-site of the extracellular signalling protein Sonic Hedgehog as obtained from a hybrid QM/MM simulations (Hitzenberger et al., 2017). The comparably large QM zone shown as blue volume contains the central peptide-like ring of robotnikinin, the Zn2+-ion and the side chains of five amino acids critical for protein-ligand (H134, H135) and protein-ion interaction (D148, E177 and H183). A total of 7 QM/MM link bonds indicated in purple are present in the QM sub-system. Image: "Shh-Robotnikinin Complex" by Thomas S. Hofer.

The QM/MM method, short for quantum mechanical/molecular mechanical, is a highly versatile approach for the study of chemical phenomena, combining the accuracy of quantum chemistry to describe the region of interest with the efficiency of molecular mechanical potentials to represent the remaining part of the system. Originally conceived in the 1970s by the influential work of the the Nobel laureates Martin Karplus, Michael Levitt and Arieh Warshel, QM/MM techniques have evolved into one of the most accurate and general approaches to investigate the properties of chemical systems via computational methods. Whereas the first applications have been focused on studies of organic and biomolecular systems, a large variety of QM/MM implementations have been developed over the last decades, extending the range of applicability to address research questions relevant for both solution and solid-state chemistry as well.

Despite approaching their 50th anniversary in 2022, the formulation of improved QM/ MM methods is still an active field of research, with the aim to (i) extend the applicability to address an even broader range of research questions in chemistry and related disciplines, and (ii) further push the accuracy achieved in the QM/MM description beyond that of established formulations. While being a highly successful approach on its own, the combination of the QM/MM strategy with other established theoretical techniques greatly extends the capabilities of the computational approaches. For instance the integration of a suitable QM/MM technique into the highly successful Monte-Carlo and molecular dynamics simulation protocols enables the description of the chemical systems on the basis of an ensemble that is in part constructed on a quantum-mechanical basis.

This eBook presents the contributions of a recent Research Topic published in *Frontiers in Chemistry*, that highlight novel approaches as well as advanced applications of QM/MM method to a broad variety of targets. In total 2 review articles and 10 original research contributions from 48 authors are presented, covering 12 different countries on four continents. The range of research questions addressed by the individual contributions provide a lucid overview on the versatility of the QM/ MM method, and demonstrate the general applicability and accuracy that can be achieved for different problems in chemical sciences. Together with the development of improved algorithms to enhance the capabilities of quantum chemical methods and the continuous advancement in the capacities of computational resources, it can be expected that the impact of QM/MM methods in chemical sciences will be further increased already in the near future.

Citation: Hofer, T. S., De Visser, S. P., eds. (2018). Quantum Mechanical/Molecular Mechanical Approaches for the Investigation of Chemical Systems – Recent Developments and Advanced Applications. Lausanne: Frontiers Media. doi: 10.3389/978-2-88945-626-0

# Table of Contents

# CHAPTER I

#### EDITORIAL

*06 Editorial: Quantum Mechanical/Molecular Mechanical Approaches for the Investigation of Chemical Systems – Recent Developments and Advanced Applications*

Thomas S. Hofer and Sam P. de Visser

## CHAPTER II

#### REVIEW ARTICLES

*11 Chemical Reactivity and Spectroscopy Explored From QM/MM Molecular Dynamics Simulations Using the* LIO *Code*

Juan P. Marcolongo, Ari Zeida, Jonathan A. Semela, Nicolás O. Foglia, Uriel N. Morzan, Dario A. Estrin, Mariano C. González Lebrero and Damián A. Scherlis

*23 Steady-State Linear and Non-linear Optical Spectroscopy of Organic Chromophores and Bio-macromolecules*

Marco Marazzi, Hugo Gattuso, Antonio Monari and Xavier Assfeld

#### CHAPTER III

#### NOVEL METHODOLOGIES AND BENCHMARKING OF QM/MM TECHNIQUES


Mingyuan Xu, Tong Zhu and John Z. H. Zhang

*76 QM Cluster or QM/MM in Computational Enzymology: The Test Case of LigW-Decarboxylase*

Mario Prejanò, Tiziana Marino and Nino Russo

*85 Interfacing the Core-Shell or the Drude Polarizable Force Field With Car–Parrinello Molecular Dynamics for QM/MM Simulationse* Sudhir K. Sahoo and Nisanth N. Nair

## CHAPTER IV

#### ADVANCED APPLICATIONS OF QM/MM TECHNIQUES


Amy Timmins and Sam P. de Visser


Huimin Zhang, Tianqing Song, Yizhao Yang, Chenggong Fu and Jiazhong Li


Andrés M. Escorcia and Matthias Stein

# Editorial: Quantum Mechanical/Molecular Mechanical Approaches for the Investigation of Chemical Systems – Recent Developments and Advanced Applications

#### Thomas S. Hofer <sup>1</sup> \* and Sam P. de Visser <sup>2</sup>

*<sup>1</sup> Theoretical Chemistry Division, Institute of General, Inorganic and Theoretical Chemistry, University of Innsbruck, Innsbruck, Austria, <sup>2</sup> School of Chemical Engineering and Analytical Science, Manchester Institute of Biotechnology, The University of Manchester, Manchester, United Kingdom*

Keywords: hybrid QM/MM, ab initio methods, quantum chemisty, force fields, molecular mechanics, quantum chemistry, density functional theory

**Editorial on the Research Topic**

#### Edited and reviewed by:

*Hans Martin Senn, University of Glasgow, United Kingdom*

> \*Correspondence: *Thomas S. Hofer t.hofer@uibk.ac.at*

#### Specialty section:

*This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry*

Received: *29 June 2018* Accepted: *30 July 2018* Published: *13 September 2018*

#### Citation:

*Hofer TS and de Visser SP (2018) Editorial: Quantum Mechanical/Molecular Mechanical Approaches for the Investigation of Chemical Systems – Recent Developments and Advanced Applications. Front. Chem. 6:357. doi: 10.3389/fchem.2018.00357*

#### **Quantum Mechanical/Molecular Mechanical Approaches for the Investigation of Chemical Systems – Recent Developments and Advanced Applications**

With the advent of microprocessor technology in the late 1960s (Moore, 1965; Whitworth, 1979; Brinkman et al., 1997) the foundation to a novel interdisciplinary field of research known today as scientific computing was laid. Located at an intersection between mathematics and computational sciences on the one hand and scientific disciplines such as physics and chemistry on the other, computational methods assumed a dominant role in modern science and engineering, enabling investigations of a broad variety of phenomena in effectively every sub-discipline of these vast fields of research.

One of the main challenges for the successful application of computational models to study chemical systems rests with the accurate description of the interaction between atoms and molecules, the two main approaches being quantum mechanics (QM) (Parr and Yang, 1994; Szabo and Ostlund, 1996; Helgaker et al., 2000; Koch and Holthausen, 2002; Cook, 2005; Sholl and Steckel, 2009) and molecular mechanics (MM) (Leach, 2001; Jensen, 2006; Ramachandran et al., 2008) methods. The latter employs empirical (i.e., parametrised), classical representations of the interactions, based on comparably simple potential formulations such as harmonic springs to describe bonds and valence angles as well as Coulomb and Lennard-Jones interactions to account for charge-charge and non-bonded contributions, respectively. These approaches, often referred to as molecular force fields (FFs) provide a versatile and efficient description of chemical systems, provided that the large number of involved parameters are perfectly adjusted and balanced among each other. Typical applications for FF-based studies are located in the realm of biomolecular simulations such as proteins and nucleic acids. Nevertheless, some systems relevant for material sciences can be treated equally well with these approaches and particularly polymer chemistry studies (comprised of the same elements with similar functional groups as found for instance in peptides and proteins) often include FF-based methods.

MM models have also been extended to include more complex phenomena such as polarization (Yu and van Gunsteren, 2005; Baker, 2015; Lemkul et al., 2016) and many-body contributions (Stone, 1995), extending their applicability to areas in which simplistic force field approaches are unreliable, such as studies of solid-state interfaces and semi-conducting systems as well as metals and alloys. However, a key shortcoming inherent to the majority of FF approaches is the inability to describe the formation and cleavage of covalent chemical bonds. While socalled reactive force fields (van Duin et al., 2001; Mahadevan and Garofalini, 2007; Hartke and Grimme, 2010; Liang et al., 2013) have been developed to make such processes accessible in the regime of molecular mechanics, a quantum description of the system is often the natural choice.

QM-based descriptions of chemical systems (Parr and Yang, 1994; Szabo and Ostlund, 1996; Helgaker et al., 2000; Koch and Holthausen, 2002; Cook, 2005; Sholl and Steckel, 2009) partition the atoms into the nuclei and the surrounding electrons, inherently taking all shifts in the electron density resulting from polarization, many-body contributions and even charge-transfer into account. Although a key challenge of QM methods is the accurate description of the correlated motion of electrons (Raghavachari and Anderson, 2010; Popelier, 2011; McDonagh et al., 2017), the hierarchy of quantum chemical approaches established over the last decades (Szabo and Ostlund, 1996; Helgaker et al., 2000; Cook, 2005) provides a versatile framework for the study of challenging chemical phenomena. Since no empirical parameters are required in a QM-based description, quantum chemical methods are not restricted to a particular class of molecules and, thus, generally applicable to achieve first principle descriptions of chemical systems. Unfortunately, these benefits come with a cost, which in this case is a substantially increased computational effort over MM-based approaches and thereby, dramatically limiting their treatable system size.

In order to combine the advantages of MM and QM methods, hybrid QM/MM approaches (Gao, 1993; Bakowies and Thiel, 1996; Lin and Truhlar, 2007; Senn and Thiel, 2007, 2009; Metz et al., 2014; Pezeshki and Lin, 2015; Zheng and Waller, 2016) have been devised: In this framework the most relevant part of the chemical system is treated on the basis of a suitable quantum chemical method, while classical MM potentials are considered sufficiently accurate to model the remaining part of the system. This innovative idea was pioneered by Martin Karplus, Michael Levitt and Arieh Warshel (Warshel and Levitt, 1976; Field et al., 1990; Lyne et al., 1990; Aaqvist and Warshel, 1993; Warshel, 2002) in the 1970s, who were awarded the Nobel prize in chemistry for the development of multiscale models for complex chemical systems in 2013. Today, four decades after these influential developments, QM/MM methods are regarded as one of the most influential approaches for the description of challenging chemical phenomena. Initially conceived in the framework of biomolecular simulations (Friesner and Guallar, 2005; Hu and Yang, 2009; van der Kamp and Mulholland, 2013; de Visser et al., 2014; Cui, 2016; Lu et al., 2016; Quesne et al., 2016), the range of QM/MM methods has been substantially extended to include other areas accessing inter alia solid-state chemistry and material science (Gonis and Garland, 1977; Krüger and Rösch, 1994; Stefanovich and Truong, 1996; Jacob et al., 2001; Herschend et al., 2004; Keal et al., 2011; Bjornsson and Bühl, 2012; Golze et al., 2013, 2015; Hofer and Tirler, 2015) and solution chemistry (Staib and Borgis, 1995; Tuñón et al., 1995, 1996; Gao, 1996; Hofer et al., 2010, 2011, 2012; Weiss and Hofer, 2012; Hofer, 2014) as well. These QM/MM studies have given insight into how Nature works, and, for instance, explain regio- and stereochemical selectivities during substrate activation (Faponle et al., 2016, 2017; Timmins et al., 2017). Furthermore, using computational modeling, predictions can be made to engineer proteins and enzyme and in a recent example the computationally proposed change led to a full enantioselectivity reversal (Pratter et al., 2013a,b).

Despite their success and widespread recognition, the development of advanced QM/MM methodologies is still an active field of research, aiming to push the accuracy and applicability of this versatile approach even further. This article collection aims to present an overview of present research activities focused on the development and application of modern QM/MM formulations, demonstrating the versatile capabilities of this celebrated methodology.

A total of 12 exciting contributions by 48 authors from 12 different countries in four continents have been included in this article collection that contains ten original research contributions and two review articles.

Scherlis and his team compiled a review article that covers recent applications of their graphical-processor accelerated LIO code for density functional theory calculations (Marcolongo et al.). The presented examples include the decomposition of nitroxyl in aqueous solution, a comparison of the reactivity of thiols against peroxides in aqueous solution and the active site of a peroxiredoxin enzyme. The latter studies are linked to molecular spectroscopy such as the vibrational spectrum of aqueous peroxynitrite anion and LiAlH<sup>4</sup> and AlH<sup>−</sup> 4 . Furthermore, using their novel code they predict the UV/Vis spectra of (HO)NS<sup>2</sup> and the so-called NO/H2S "cross-talk" system.

The second review article by Marazzi et al. presents a comprehensive overview of recent research activities in studying electronic spectroscopy via QM/MM approaches. A broad variety of examples in the fields of linear absorption, non-linear optical properties and circular dichroism applied to organic molecules, proteins and nucleic acid systems is presented.

Prejanò et al. compared the application of a QM cluster model to the decarboxylation of 5-carboxyvanillate by LigW with results obtained from a more elaborate QM/MM setup. Their study indicates that the reaction is mainly influenced by the constituents of the active side and the cluster model already delivers a reliable description for the presented system.

The article of Esccorcia and Stein explored the influence of a conserved arginine residue of the E. coli Hyd-1 [NiFe]- Hydrogenase on the H<sup>2</sup> oxidation reaction via DFT-based QM/MM calculations. This study highlights the key influence of this Arg-residue in promoting both the access of molecular hydrogen to the catalytically active Ni-atom as well as the associated proton transfer to nearby terminal cysteine residues.

Xu et al. present a novel force balanced simulation approach, separating a protein system into hydrogen-bonded fragments that are computed quantum-mechanically, while the AMOEBA force field is employed to describe long-range non-bonded interactions. To conserve the total energy of the system, a force balancing of the hydrogen link-atoms is carried out. The applicability of this approach is demonstrated for linear ACE- (ALA)9-NME as well as the 56 residue GB3 peptides.

The contribution of Sahoo and Nair presents a combination of polarizable Drude oscillators with the well-established Car-Parrinello Molecular Dynamics framework via an extended Lagrangian QM/p-MM method. The approach is demonstrated for a H2O(QM)+4H2O(p-MM) test system and applied to study an O-vacancy in α-cristobalite, the hydrogenation of ethene via Y-Zeolite-supported Rh-Clustes and the H+-exchange between methane and a H-ZSM-5 zeolite.

The difference between additive and subtractive QM/MM protocols has been highlighted in the contribution by Cao and Ryde, focusing inter alia on the different correction schemes to account for errors introduced by the application of link–atoms. Three different systems of increasing complexity have been studied, namely an isolated ethanol molecule, sulfite oxidase and the conversion of oxophlorin to verdohaem by haem oxygenase.

Berraud-Pache et al. studied the keto-enol tautomerisation reaction of oxyluciferin representing the emissive species in the bioluminescent system of fireflies. By combing classical molecular dynamics studies of the active species in a polarisable continuum and explicit QM/MM calculations, the keto-OxyLH<sup>−</sup> species was identified as the most likely candidate to act as emitter in bioluminescence.

He and colleagues applied an automated fragmentation QM/MM protocol to study <sup>1</sup>H chemical shifts of the apo- and holo-neocarzinostatin-chromophore binding complex (Jin et al.). The calculated NMR data obtained by the fragmented QM/MM approach proved to be in good agreement with results of largescale calculations as well as experimental data.

The research team of Lin studied the migration of Cl<sup>−</sup> through the transmembrane domain of a prototypical E. coli chloridechannel (Wang et al.), thereby including the entire pore section

#### REFERENCES


into the quantum–mechanically treated zone. The obtained results demonstrate that the influence of electron delocalization, inherently taken into account at the QM level of theory, appear to be more critical than previously considered.

Timmins and De Visser investigated the impact of different mutations in prolyl-4-hydroxylase via a combined QM/MM and MD study. Based on the results of this extensive study two mutants with the potential of displaying notably changes in the regio- and stereoselectivity could be identified.

Hitzenberger et al. provided a contribution combining docking and pharmacophore modeling with QM-based molecular dynamics simulations to investigate the binding of the only known inhibitor robotnikinin to the Zn-site of the extracellular signaling protein Sonic Hedgehog. Comparison to a purely classical molecular dynamics highlights the substantially improved description of the binding observed in the QM/MM MD simulation.

In the contribution of Frau and Glossman-Mitnik the influence of different range-separated hybrid DFT methods in the prediction of chemical reactivity descriptors was evaluated.

Finally, Li and coworkers investigated the interaction mechanism between cyclopeptide DC3 and an androgen receptor via free energy calculations and extensive molecular dynamics simulations Zhang et al..

Clearly, the scope of topics covered by the contributions in this article collection demonstrates the widespread capabilities of the QM/MM technique as a general and versatile approach to address a broad spectrum of research questions. Moreover, the research field is as active as ever and moving into many different research directions. In conjunction with the formulation of advanced theoretical approaches, efficient simulation programs and the development of improved computational infrastructure, QM/MM methods proved to be an indispensable tool in modern chemical research, providing a highly successful alternative route for the study of complex chemical phenomena, which can be expected to play an even more dominant role in the coming years.

# AUTHOR CONTRIBUTIONS

The article collection was edited SdV and TH. The respective editorial was compiled by TH.


P450 peroxygenase: what drives the reaction to biofuel production? Chem. Eur. J. 22, 5478–5483. doi: 10.1002/chem.201600739


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Hofer and de Visser. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Chemical Reactivity and Spectroscopy Explored From QM/MM Molecular Dynamics Simulations Using the LIO Code

Juan P. Marcolongo1†, Ari Zeida1,2†, Jonathan A. Semelak <sup>1</sup> , Nicolás O. Foglia<sup>1</sup> , Uriel N. Morzan<sup>1</sup> , Dario A. Estrin<sup>1</sup> , Mariano C. González Lebrero<sup>1</sup> \* and Damián A. Scherlis <sup>1</sup> \*

#### Edited by:

Sam P. De Visser, University of Manchester, United Kingdom

#### Reviewed by:

Yong Wang, Lanzhou Institute of Chemical Physics (CAS), China Nino Russo, Dipartimento di Chimica e Tecnologie Chimiche, Università della Calabria, Italy

#### \*Correspondence:

Mariano C. González Lebrero nanolebrero@qi.fcen.uba.ar Damián A. Scherlis damian@qi.fcen.uba.ar

†These authors have contributed equally to this work.

#### Specialty section:

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

Received: 02 February 2018 Accepted: 05 March 2018 Published: 21 March 2018

#### Citation:

Marcolongo JP, Zeida A, Semelak JA, Foglia NO, Morzan UN, Estrin DA, González Lebrero MC and Scherlis DA (2018) Chemical Reactivity and Spectroscopy Explored From QM/MM Molecular Dynamics Simulations Using the LIO Code. Front. Chem. 6:70. doi: 10.3389/fchem.2018.00070 <sup>1</sup> DQIAyQF, INQUIMAE-CONICET, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina, <sup>2</sup> Departamento de Bioquímica and Center for Free Radical and Biomedical Research, Facultad de Medicina, Universidad de la República, Montevideo, Uruguay

In this work we present the current advances in the development and the applications of LIO, a lab-made code designed for density functional theory calculations in graphical processing units (GPU), that can be coupled with different classical molecular dynamics engines. This code has been thoroughly optimized to perform efficient molecular dynamics simulations at the QM/MM DFT level, allowing for an exhaustive sampling of the configurational space. Selected examples are presented for the description of chemical reactivity in terms of free energy profiles, and also for the computation of optical properties, such as vibrational and electronic spectra in solvent and protein environments.

Keywords: QM/MM, DFT, GPU, free energy, TDDFT

#### INTRODUCTION

The accurate prediction of the physicochemical properties of molecules within a realistic description of the surrounding environment is essentially the driving force in the development of QM/MM methods. The evolution of this field is not only associated with theoretical progresses, but is also very much related, and often constrained, by the advances in computer power. The last few decades have witnessed the birth and growth of two large categories of QM/MM strategies: those based on expensive QM levels of theory [like Density Functional Theory (DFT) or perturbation methods], but only applicable to dozens of quantum-mechanical atoms and a few nuclear configurations; and those that may account for thermal effects, i.e., thousands or even millions of configurations, affordable at the expense of a reduction in the accuracy of the electronic structure calculations, mostly by means of semiempirical approaches.

In this contribution, we present and review the current advances in the development of LIO, a lab-made code designed for electronic structure calculations exploiting the computational advantages of graphic processing units. This code can be coupled to molecular dynamics (MD) engines to perform QM/MM simulations, allowing for exhaustive MD sampling at the DFT level. In the next section, we provide an overview on the program structure and its capabilities, accompanied by some benchmarks illustrating efficiency. Then, we review a series of representative applications of our methodology, namely the description of chemical reactivity in terms of free energy profiles, and the prediction of optical properties, such as vibrational and electronic spectra. Closing remarks are given in the final section, along with some observations concerning current and future directions.

# THE LIO CODE

LIO (https://github.com/MALBECC/LIO) is a highly efficient tool to solve the electronic structure problem in molecules using DFT and Time Dependent DFT (TDDFT) frameworks. LIO is designed to be used as a standalone application or as a library to be combined with other molecular dynamics packages, to run QM/MM simulations. This versatility constitutes a design advantage to maximize its applicability and power. While its roots go back to the early 90's, to the program originally authored by Estrin et al. (1993), with the first QM/MM applications on chemical reactivity dating from only a few years later (Elola et al., 1999), today LIO is a project in continuous evolution with a focus in performance, and the outcome of an interdisciplinary effort bringing together developers from chemistry and computer science backgrounds. LIO has been interfaced with the Amber package (Pearlman et al., 1995), to perform Born-Oppenheimer molecular dynamics and electron dynamics simulations in a hybrid QM/MM framework. In this context, the total energy is obtained according to the electrostatic embedding, additive QM/MM formulation:

$$E = E\_{QM} + E\_{MM} + E\_{QM-MM} \tag{1}$$

where the first term on the right hand side corresponds to the QM Kohn-Sham energy, the second one to the MM force field potential, and the third one is the coupling energy between the classical and quantum regions.

$$E\_{QM}\left[\rho\right] = T\_s\left[\rho\right] + \sum\_{I} \int \frac{\rho Z\_{I}}{|\mathbf{r} - \mathbf{R}\_{I}|} d\mathbf{r} + \frac{1}{2} \int \int \frac{\rho(\mathbf{r}\_1)\rho(\mathbf{r}\_2)}{r\_1 - r\_2} d\mathbf{r}\_1 d\mathbf{r}\_2$$

$$+ E\_{\mathcal{K}} + \sum\_{I} \sum\_{A} \frac{Z\_{I}Z\_{A}}{\mathbf{R}\_{I} - \mathbf{R}\_{A}}\tag{2}$$

$$E\_{QM-MM} = E\_{Lf}^{QM-MM} \left( |\mathbf{R}\_A - \mathbf{R}\_I| \right) + \sum\_{A \in MM} q\_A \int \frac{\rho(\mathbf{r})}{|\mathbf{r} - \mathbf{R}\_I|} d\mathbf{r}$$

$$+ \sum\_{I \in QM} \sum\_{A \in MM} \frac{Z\_I q\_A}{|\mathbf{R}\_A - \mathbf{R}\_I|} \tag{3}$$

In the equations above, ρ represents the electron density, T<sup>s</sup> the electronic kinetic energy, Exc the exchange correlation term, Z<sup>I</sup> the atomic number of quantum atom I, and q<sup>A</sup> the partial charge of the classical atom A. E QM−MM LJ is a non-electrostatic term, which describes dispersion and short range repulsion effects between QM and MM atoms, using Lennard-Jones potentials, consistently with the MM force field.

The kinetic energy and nuclear attraction contributions are calculated in terms of analytical one electron integrals, which are derived using Obara-Saika recursive equations (Obara and Saika, 1986). The electron repulsion term is computed in terms of two-electron repulsion integrals (ERI), which are also derived recursively. The exchange correlation energy is calculated using numerical spherically centered grids (Becke, 1988). The flow diagram of the computation scheme is depicted in **Figure 1.**

In the development of efficient algorithms for electronic structure calculations it is very important to consider the size of the systems to be treated. Hence, in LIO we put our focus on medium-sized systems (a few tens of atoms), which are the typical dimensions of the QM region in hybrid calculations. As can be seen in **Figure 1**, in this QM/MM scheme the major computational cost goes into the calculation of exchangecorrelation, electron repulsion (ERIs) and QM/MM energy terms (in that order) and the corresponding forces contribution. In medium-size systems the cost of the computation is dominated by the exchange-correlation integral which scales linearly, whereas for larger systems the calculation of ERIs (which scale quadratically) and the diagonalization of the Fock matrix (which scales cubically) have an increasing importance.

Several optimization schemes were implemented to improve performance, like a linear scaling algorithm for exchangecorrelation (Stratmann et al., 1996) or the use of auxiliary basis functions for the ERIs (Stratmann et al., 1996). In addition, these terms are totally or partially computed in the GPU (Nitsche et al., 2014). The overall result is a code that performs QM/MM molecular dynamics with high efficiency allowing for the computation of systems and/or properties at reduced computational costs. **Table 1** shows the computation timings for some selected QM/MM systems discussed along the coming sections, and also for the case of the activation of copper-translocating P-type ATPase from Archaeoglobus fulgidus (AfCopA), an enzyme that couples the energy of ATP hydrolysis to catalyze Cu<sup>+</sup> translocation across cellular membranes (see **Figure 2**) (Tsuda and Toyoshima, 2009).

Aside from standard molecular dynamics simulations, LIO can propagate the electron density as a function of time for a fixed molecular configuration, becoming, to the best of our knowledge, the first real-time TDDFT (RT-TDDFT) implementation in a QM/MM setting (Morzan et al., 2014). The evolution of the density matrix ρ is performed according to the Liouville-von Neumann equation:

$$\frac{\partial \rho}{\partial t} = \frac{1}{i\hbar} \left[ H, \rho \right] \tag{4}$$

where H is the Kohn-Sham matrix. The integration of this equation can be realized with either a Verlet integration scheme,

$$
\rho \left( t + \Delta t \right) = \frac{2}{i\hbar} \left[ H(t), \rho(t) \right] \Delta t + \rho(t - \Delta t) \tag{5}
$$

or the Magnus expansion to first order,

$$\begin{aligned} \rho \left( t + \Delta t \right) &= \rho \left( t \right) - i \Delta t \left[ H \left( t + \Delta t / 2 \right), \rho \left( t \right) \right] \\ &- \frac{\Delta t^2}{2!} \left[ H \left( t + \Delta t / 2 \right), \left[ H \left( t + \Delta t / 2 \right), \rho \left( t \right) \right] \right] \\ &+ i \frac{\Delta t^3}{3!} \left[ H \left( t + \Delta t / 2 \right), \left[ H \left( t + \Delta t / 2 \right), \left[ H \left( t + \Delta t / 2 \right), \rho \left( t \right) \right] \right] \right] + \dots \end{aligned} \tag{6}$$

The advantage associated with the Magnus expansion is that it allows a greater 1t than the Verlet algorithm (usually

TABLE 1 | LIO performance in ground state molecular dynamics simulations, in terms of timings per QM/MM steps.


Calculations were conducted in an Intel(R) Core(TM) i5-7400 CPU @ 3.00 GHz and a Gforce 980 TI GPU.

in the order of 10 to 20 times larger), reducing the total number of steps needed for a fixed simulation time. On the other hand, the computational burden associated with the Magnus algorithm is much larger because it entails a greater number of matrix multiplications. However, these operations can be efficiently handled by GPUs, with a particularly high impact in the evaluation of the Magnus expansion which is usually truncated at order 10 to 50. Therefore, the use of the Magnus integrator turns out to be much more efficient than the use of the Verlet scheme when running on GPU.

The TDDFT scheme involves the evaluation of the Fock matrix, so all the optimizations made for SCF calculations are exploited here as well. Moreover, when working with fixed nuclear positions, other optimizations can be applied, especially in the QM/MM coupling integral and ERIs, significantly reducing their cost and leaving a quasi-linear scaling method. The efficiency of TDDFT depends on both the computation of one simulation step, and the maximum time step (1t) that can be used. The later varies with the type of atoms in the QM zone and can be extended by using better propagators and/or freezing the inner electron density through the use of pseudopotentials. For a more detailed discussion please refer to Foglia et al. (2017). **Table 2** illustrates the performance of TDDFT calculations comparing the full electron scheme with a pseudopotential approach.

# EXAMPLES

## Chemical Reactivity

The elucidation of the thermodynamics and kinetics of chemical reactions is one of the main goals of theoretical chemistry. In this context, the computation of free energy profiles assisted by QM/MM schemes has proved extremely useful to obtain mechanistic information (Kollman, 1993; Chipot and Pearlman, 2002; Hu and Yang, 2008; Carvalho et al., 2014). The two key ingredients for obtaining a meaningful free energy profile are the selection of the QM region and the QM level of theory on one hand, and the quality of the sampling on the other. Many sampling methodologies were developed and tested in order to optimize the tradeoff between cost and accuracy (Chipot and Pohorille, 2007). Each scheme presents advantages and caveats, but all of them rely in an extensive sampling of configurations of the system along the reactive process. The speedups described in the previous section, and the fact that LIO can be coupled to different MD engines, like Gromacs (Van Der Spoel et al., 2005) or Amber (Pearlman et al., 1995), allow us to attain this kind of sampling at the DFT level. It is worth noting that obtaining each of these profiles often requires ∼0.5–1 ns of QM/MM MD sampling. The following sections show selected examples of chemical reactions studied with our code, both in aqueous solution and in enzyme catalyzed processes. In these particular cases, the QM region computations were performed at the generalized gradient approximation (GGA) using the PBE combination of exchange and correlation functionals, with a dzvp basis set (Godbout et al., 1992). The MM region was treated with the Amber99 forcefield (Lindorff-Larsen et al., 2010).

#### Reactivity in Aqueous Solution: Nitrous Oxide Formation Upon Nitroxyl Decomposition

Modeling solvent effects on chemical reactivity, and the influence of aqueous solvation in particular, has been among the main goals of computational chemistry, since much of the most relevant processes in chemistry, biochemistry and materials sciences, take place in solution or at a solid-liquid interface. Here we illustrate

FIGURE 2 | Representation of AfCopA QM/MM system. The QM region is defined by the atoms in the reactive region, which includes the phosphates of ATP, Mg2+, the aspartic 424 (which is phosphorylated) and part of the atoms of lysine 600. Adapted with permission from Nitsche et al. (2014). Copyright 2014 American Chemical Society.

the application of our code to aqueous chemistry through an example involving nitroxyl, a key species in redox biochemistry. Nitroxyl (HNO) is a species playing different roles in nitrosative stress processes with great interest in the pharmacology field due to its potential use in heart failure treatment, as well as its vasodilator properties and its role in cellular metabolism (Ma et al., 1999; Miranda, 2005; Miranda et al., 2005). Nitroxyl rapidly decomposes in aqueous solution yielding nitrous oxide (N2O) (Shafirovich and Lymar, 2002). This is a fast reaction that competes with many other chemical processes in which HNO may be involved in a cellular context. Then, a detailed molecular description of this phenomenon is of general interest in biochemistry.

Great efforts have previously been done from an experimental and theoretical point of view to study the different steps leading to HNO decomposition (Shafirovich and Lymar, 2002; Fehling and Friedrichs, 2011). It was proposed that the mechanism involved a dimerization of HNO followed by a cleavage of one of the N-O bonds, to yield N2O and water. Our aim was to fully characterize the energetics and the molecular determinants of the reaction mechanism, taking into account the influence of the environment, which could be extremely important especially because an acid-base equilibrium was proposed to be involved in the mechanism. Therefore, we studied both steps of the reaction mechanism, evaluating different protonation states and possible isomers, by means of multiple QM/MM MDs, using the umbrella sampling scheme to determine the corresponding free energy profiles (Bringas et al., 2016).

Our results showed that the dimerization is an exergonic process that occurs without a significant activation barrier (**Figure 3A**), and that the cis isomer of the dimer (HONNOH) is more stable than the trans one by ∼2 kcal/mol (data not shown). The dimer intermediate might be involved in different acid-base equilibria, so we investigated the second step of the reaction, starting from different protonation states (**Figure 3B**). The anionic path showed a much lower activation barrier (∼7 kcal/mol) than the one corresponding to the neutral path (∼14 kcal/mol). This effect is mainly explained by a more stable transition state for the anionic pathway, due to specific interactions with water molecules (Bringas et al., 2016).

The available experimental data and this analysis allowed us to propose an overall reaction mechanism for HNO decomposition that can be written as a consecutive reactions scheme with



Effective core potentials and full-electron calculations were carried out with CEP and DZVP basis sets, respectively. Data from Foglia et al. (2017).

quick acid-base equilibria connecting products and reactants of different steps:

$$\begin{aligned} 2\text{HNO} & \xrightarrow{k\_1} \text{cis}-\text{ON }(\text{H})\text{ N }(\text{H})\text{ O} \\ \text{cis}-\text{ON }(\text{H})\text{ N }(\text{H})\text{ O} & \xrightarrow{\text{cis}-\text{H}\text{ON}} \text{HON}\text{OH} \\ \text{cis}-\text{HON}\text{NOH} & \xrightarrow{\text{cis}-\text{H}\text{ON}} [\text{cis}-\text{H}\text{ON}\text{NO}]^- + \text{H}^+ \\ & \xrightarrow{k\_2^{\text{NP}}} -\text{HON}\text{NOH} \xrightarrow{k\_2^{\text{NP}}} N\_2\text{O} + H\_2\text{O} \\ \text{[cis}-\text{HON}\text{NO}]^- & \xrightarrow{k\_2^{\text{NP}}} N\_2\text{O} + HO^- \end{aligned} \tag{7}$$

where k1, k NP 2 , and k AP 2 are the constants for the dimerization and for the dissociation via the neutral and anionic pathway elemental steps, respectively. Using transition state theory and the predicted pKas for the dimer intermediate (Fehling and Friedrichs, 2011), we estimated that the anionic pathway will be the only one operative near physiological pH.

In summary, the detailed molecular description of the reaction mechanism at the QM/MM level, showed that specific interactions between the reactive species and the water molecules turn out to be determinant in the stabilization of transition states, thereby modifying the free energy barriers. We predicted a strong pH-dependence of the overall kinetics of N2O formation, related with the fraction of reactive species available in solution, in agreement with previous experimental data.

#### Protein Catalysis

The investigation of the molecular basis of enzyme catalysis has attracted the attention of many researchers both in the experimental as well as the computational sides. To illustrate the application of our code to this objective, we have chosen hydroperoxides reduction as case study. The reduction of cellular endogenous or exogenous hydroperoxides is a key biochemical reaction associated not only with the cellular redox homeostasis, but also with signaling and regulation processes (Hopkins, 2017). One of the most important antioxidant mechanisms in biological systems is the reduction of peroxides through the oxidation of low molecular weight thiols (LMW) and/or reactive cysteine (Cys) residues in proteins:

$$\text{RS}^- + \text{R}^\prime \text{OOH} \rightarrow \text{RSO}^-/\text{RSOH} + \text{R}^\prime \text{OH}/\text{RO}^-\tag{8}$$

where the reactive species are the thiolate and the hydroperoxide. Although the reaction of peroxides with LMW thiols is usually slow (∼10 M−<sup>1</sup> s −1 for H2O2) (Winterbourn and Metodiewa, 1999), some enzyme thiols react several orders of magnitude faster in terms of second order rate constants. Among these Cysbased peroxidases, peroxiredoxins (Prx) are the most significant ones, given their reactivity, distribution, and concentration (Hofmann et al., 2002; Winterbourn, 2008).

In recent years, we have applied QM/MM techniques to shed light on the molecular determinants that govern this extremely important biochemical reaction, comparing the reactivity of LMW thiols with Prx, and assesing also different hydroperoxides (Zeida et al., 2012, 2013, 2014, 2015). In particular, aiming to gain microscopic insight onto the Prxs active site's properties that could explain the catalytic effect of these systems, we performed QM/MM MDs to determine the free energy profiles of reaction 8 for H2O2, with the thiolate being CH3S <sup>−</sup> or the reactive Cys of the alkyl hydroperoxide reductase E from Mycobacterium tuberculosis (MtAhpE), the 1-Cys Prx of the mycobacteria genome as a Prx model (Zeida et al., 2012, 2014). The umbrella sampling approach was applied, choosing the reaction coordinate as the difference between the OA-O<sup>B</sup> and S-O<sup>A</sup> distances (**Figure 4A**). We have also measured the activation thermodynamics parameters of the catalyzed reaction by means of temperature dependence stopped-flow kinetics experiments and the Eyring's formalism (**Table 3**).

**Figure 4A** shows the energy profile for the uncatalyzed and catalyzed processes, while **Figures 4B,C** provide a schematic representation of the QM subsystem for each case. The corresponding 1G † s of ∼8 and ∼4 kcal/mol turn out to be underestimated in comparison with the one determined experimentally (see **Table 3**), which can be possibly attributed to the flaws of DFT at the GGA level for determining activation energies (Zhao and Truhlar, 2008). However, the catalytic effect, meaning the difference between the activation free energies (11G † ), is about ∼4 kcal/mol, consistently with the experimentally determined 11G †= 5.4 kcal/mol and with the ∼5000-fold increase in reactivity observed (Luo et al., 2005; Hugo et al., 2009).

It is worth noticing that these profiles are not exactly the same as those published earlier (Zeida et al., 2012, 2014), due to improvements in our computing capabilities and significant advances in sampling protocols. Both new profiles display lower free energy barriers, which is expected for better explorations of the free energy landscape. Nevertheless, the reaction mechanism and properties along the reaction coordinate do not show significant changes, and, most importantly, the 11G † remain unaffected, and so the catalytic effect is still being reproduced in fair agreement with the experimental data.

(B,C) QM subsystems for the reaction in water (B) and in MtAhpE (C). Modified with permission from Zeida et al. (2012, 2014). Copyright 2012 American Chemical Society.



Standard deviations in parenthesis when available.

<sup>a</sup>pH-independent rate constants.

<sup>b</sup>data taken from Luo et al. (2005).

<sup>c</sup>data taken from Zeida et al. (2014).

The exploration of the reaction coordinate allow us to identify key events during the reaction to explain the differences in the 1G † s. Specifically, the strong interactions of the thiolate and the peroxide with Arg<sup>116</sup> and Thr<sup>42</sup> residues, which are extremely conserved among the Prx family (Soito et al., 2011; Perkins et al., 2015), are the main factors responsible for the transition state stabilization and the concomitantly significant reduction in 1H † , which in turn results in a decrease in 1G † in spite of the unfavorable entropic contribution (**Table 3**). Our calculations support the idea of a bimolecular nucleophilic SN2 type substitution mechanism, with an internal proton transfer and no acid-base catalysis. The catalytic ability of Prxs lies on the stabilization of the transition state due to an active site design that configures a complex H-bond network activating both reactive species, the thiolate and the peroxide.

#### Spectroscopic Studies

UV-vis or FTIR spectrophotometers are ubiquitous tools in research laboratories, routinely employed to characterize new species, to establish the identity of a chemical systems, to perform both thermal and photochemical reactivity studies, or even to analyse the presence of impurities in a given sample. Ambiguities in the characterization of unknown species are often present, and so experimental chemists are increasingly relying on theoretical calculations to complement their measurements.

The vast majority of the computational studies addressing electronic or vibrational spectra involve the geometry optimization of the molecule at a given level of theory, and the study of its spectroscopic properties at those frozen nuclear coordinates. Within this framework, the emulation of the environment is carried out by different approaches, as the Polarizable Continuum Model (PCM) (Tomasi et al., 2005), often placing a few explicit solvent molecules in some fixed position in the first solvation shell interacting with the chromophore (Zuehlsdorff et al., 2016) or, more recently, using a classical and explicit description of the solvent by employing a QM/MM hybrid Hamiltonian (Barone et al., 2010). The most common approach to calculate infrared spectra is the harmonic oscillator approximation, through the diagonalization of the Hessian matrix at different levels of theory. This process yields the vibrational modes, their characteristic frequencies, and an estimation of the intensity of those bands in the infrared spectrum (Bloino et al., 2016). However, this scheme may present flaws in cases in which there are strong anharmonicities or specific solute-solvent interactions. The IR spectrum can also be computed directly from a molecular dynamics simulation including explicit solvent molecules. The absorption spectrum can be determined as the Fourier Transform of the temporal autocorrelation function of the dipole moment (Futrelle and McGinty, 1971; McQuarrie, 1976):

$$I(\omega) = \frac{1}{2} \int dt e^{i\alpha t} \left< \overrightarrow{\mu} \left( 0 \right| \overrightarrow{\mu} \left( t \right) \right> \tag{9}$$

In the previous equation, the brackets denote an equilibrium ensemble average, which can be obtained from molecular dynamics simulations:

$$
\left\langle \stackrel{\rightarrow}{\mu}(0) \middle| \stackrel{\rightarrow}{\mu}(t) \right\rangle \approx \frac{t - t\_i}{\Delta t} \sum\_i \stackrel{\rightarrow}{\mu}(t\_i) \stackrel{\rightarrow}{\mu}(t\_i + t) \tag{10}
$$

where 1t is the integration time-step. Similarly, the Fourier Transform of the temporal autocorrelation function of the polarizability yields the Raman spectrum of a given system.

In the case of the electronic spectra, Time Dependent Density Functional Theory in the linear response formulation (LR-TDDFT) has become in the last two decades the most popular approach to predict the electronic excitation frequencies of middle-size systems, due to its modest computational cost and relatively good performance in comparison with highly correlated schemes (Runge and Gross, 1984; Marques et al., 2006). Alternatively, it is also possible to compute an electronic spectrum from electron dynamics simulations, through real-time TDDFT. This methodology, which is incorporated in LIO, is not so commonly used to calculate electronic frequencies. This is related to the fact that the electron dynamics simulations needed to recover the absorption frequencies with RT-TDDFT require to excite and propagate the density matrix of the electronic system for tens or hundreds of femtoseconds, and then perform an ad-hoc post-processing analysis. This scheme tends to be more demanding, and somehow prevents RT-TDDFT from being used in a black-box format. However, the real-time approach exhibits some appealing features in comparison with LR-TDDFT, that include the possibility to study intense perturbations beyond the linear response regime (Marques et al., 2006; Lopata and Govind, 2011), the scaling of the computation burden with respect to the number of electrons, that can be made quasilinear, or the avoidance of the computation of the exchangecorrelation kernel. Therefore, RT-TDDFT could be a competitive alternative to the linear response method, especially in the case of large systems, since the computation time in typical LR-TDDFT implementations grows as N <sup>3</sup> or N 4 .

Beyond the applied methodology and the level of sophistication of the electronic structure calculations, one of the main causes of the failures in the prediction of spectroscopic properties are due to the consideration of a frozen nuclear geometry, obtained by an initial optimization, thus neglecting thermal fluctuations. A proper exploration of configurational space produces spectra averaged on the degrees of freedom of both the chromophore and its environment, providing smooth lineshapes instead of the discrete, multiple-lines spectra corresponding to the particular transitions energies obtained from a single geometry. The use of molecular dynamics or Monte Carlo simulations where the spectra are obtained from statistical averages within an ensemble has become an increasingly chosen strategy for different types of systems (Valsson et al., 2013).

#### Vibrational Spectra of Simple Aqueous Species: Early Results and Benchmarks

In 2005 we investigated the structure and the vibrational spectrum of peroxynitrite anion in solution via QM/MM molecular dynamics simulations, calculating the temporal autocorrelation functions of the atomic velocities (Gonzalez Lebrero et al., 2005). At that time, the computation capabilities allowed us to perform production dynamics of 20 ps in a reasonable time window. This study provided a picture of the complex interactions between the ONOO<sup>−</sup> anion and the solvent, and how these were reflected in the vibrational spectrum in solution, as shown in **Figure 5**. Our results yielded frequency values much closer to the experimental ones than those obtained using standard methodologies, and also helped to assign a controversial band centered at 642 cm−<sup>1</sup> as corresponding to NO3 stretching.

The nature of the solute species present in ethereal solutions of LiAlH<sup>4</sup> is of crucial importance to understand the mechanism for the reduction of ketones and other functional groups by LiAlH4. In 2005, we have employed a combination of experimental and theoretical techniques to investigate the structure of this system in ethereal solutions, using QM/MM simulations in which LiAlH<sup>4</sup> was modeled at the DFT PBE level using dzvp basis sets, and the solvent was described using a three site potential (Bikiel et al., 2005). Our results were consistent with a dissociation equilibrium displaced to the associated species. However, a significant amount of the dissociated species is also expected to exist. We calculated the infrared spectra for both LiAlH<sup>4</sup> and AlH<sup>−</sup> 4 species performing molecular dynamics simulations, reproducing the main features of the experimental spectra, as shown in **Figure 6**.

#### The Electronic Spectrum and the Sampling Issue

In 2014 we have developed a powerful scheme to perform real-time TDDFT electron dynamics in a QM/MM framework (Morzan et al., 2014). This implementation can easily handle quantum subsystems in the order of 100 atoms surrounded by thousands of classical nuclei, enabling the investigation of the effect of a complex environment, such as a solvent or a protein matrix, on the UV-vis spectra of molecular systems. Our starting point was the validation on isolated species by comparison of the absorption maxima obtained using the real time and the linear response methodologies (some of these results are presented in **Table 4**). These molecules were chosen as a benchmark to test our code and verify that the

FIGURE 5 | Vibrational density of states computed as the Fourier transform of the velocity autocorrelation function, evaluated in the isolated species normal modes coordinates. Molecular dynamics simulations were done with the PBE functional and the TIP4P model for water. NO1 stretching, black line; O1NO3 bending, pink line; O3O4 stretching, green line; NO3 stretching, red line; OONO torsion, blue line; NO3O4 bending, dark green line. Adapted with permission from Gonzalez Lebrero et al. (2005). Copyright 2005 American Chemical Society.

obtained spectra were in very good agreement with experimental results.

The effect of the environment becomes crucial within proteins and here is where the QM/MM methodologies make the difference. In the work mentioned above, the spectrum of the CO hexa-coordinated heme group in Flavohemoglobin of Escherichia Coli (EcFlavoHb) was analyzed. The results showed that the Soret band of the chromophore experiences a notorious blue shift (1λ∼35 nm) when going from vacuum to the protein environment. Despite the lack of a direct experimental comparison, the observed shift in the EcFlavoHb heme with respect to the gas phase is in accordance with the expected trend.

The examples given so far correspond to electronic spectra extracted from electron dynamics simulations on a single geometry. To generate a realistic electronic spectrum that considers thermal effects, we calculate the spectra of a set of configurations extracted from a QM/MM molecular dynamics trajectory. These configurations must be separated from one another by a time frame η long enough as to decorrelate the molecular vibrations (for simple molecular systems in solution it is usually enough with η = 5–10 ps). The number of configurations needed to get a converged spectrum in solution is typically of a few dozens. A healthy practice is to combine, i.e., to intercalate, the QM/MM molecular dynamics simulations with some purely classical sampling, especially in systems with many degrees of freedom and multiple solvation structures, to amplify the exploration and avoid the system to get trapped around a local minimum. **Figure 7** shows a typical convergence sequence in the computation of an electronic spectrum using QM/MM dynamics.

As seen in **Figure 7**, it is of a huge importance to perform an adequate sampling of the nuclear configurations visited along the dynamics to obtain a spectrum with the absorption maximum located in the correct position and with the correct band shapes. Unfortunately, there is no magic number associated with the required amount of nuclear configurations to reach convergence.



Real-time and linear-response TDDFT calculations were performed with the PBE exchange-correlation functional and dzvp basis set.

This number is usually tied to several factors, mainly the rigidity of the molecular system under study (and therefore the number of accessible local minima) and the magnitude and lability of the interactions between the solute and the solvent.

**Figure 8** shows that the convergence of the electronic spectrum of the (HO)NS<sup>2</sup> molecule in acetonitrile is achieved by averaging less than 10 spectra taken at snapshots separated by 5 ps of QM/MM dynamics, while for more flexible species like S2−− 4 (with high rotational freedom) said process is not satisfactorily converged after averaging 70 spectra.

Ensuring that a spectrum is statistically converged is not only crucial to find the correct values of λmax; it is also important because, if not converged, the spectra may present shoulders similar to those found in experiments, thus leading to incorrect interpretations.

#### Solving Specific Questions: The Case of the "Crosstalk" Between NO and H2S

We will illustrate the capabilities of our code, by showing results related to the crosstalk chemistry between NO and H2S. In addition to the diverse biological roles of nitric oxide (NO) and hydrogen sulfide (H2S), there is a growing appreciation that both molecules have interdependent biological actions resulting in either mutual attenuation or potentiation responses, the so-called NO/H2S "cross-talk" (Marcolongo et al., 2017).

The study of the interaction pathways between these two molecules has led to a large number of publications throughout this decade, and there have been many controversies around the appearance and assignation of spectroscopic signals in different experiments. In particular, there has been much debate around the appearance of a transient signal at ∼409 nm produced in the reaction of S-nitrosoglutathione (GSNO) with HS<sup>−</sup> at pH = 7.4 (Filipovic et al., 2012). Filipovic and colleagues argue that this signal corresponds to the presence of a mixture of polysulfides while other groups claim that the yellow signal corresponds to perthionitrite (S2NO−), the sulfur analog of peroxynitrite (Seel and Wagner, 1988; Munro and Williams, 2000; Filipovic et al., 2012; Cortese-Krott et al., 2015; Bailey et al., 2016; Bolden et al., 2016). This anion has been well-characterized taking part of solid salts and in organic solvents (Seel et al., 1985; Filipovic et al., 2012; Wedmann et al., 2015), though their clear identification and chemistry in aqueous solutions remains elusive.

In 2016, we tested the use of QM/MM dynamics in conjunction with the application of RT-TDDFT implemented in LIO to try to help in the elucidation of this controversial signal, tilting the balance toward the proposal of SSNO<sup>−</sup> as the responsible for the yellow signal (Marcolongo et al., 2016). Our calculations for this species (together with a large number of related benchmark molecules) showed that SSNO<sup>−</sup> is a good candidate to absorb in that region and also reproduces quantitatively the solvatochromic shift of the absorption band while going from water to a set of organic solvents where the species is well-characterized. The QM/MM dynamics performed by our group showed that the specific interactions between SSNO<sup>−</sup> and the solvent are responsible for the modulation of the θ(N1-S1-S2) angle which strongly affects the spectroscopic properties of this molecule, as shown in **Table 5**. For these simulations, a TIP3P potential was utilized to describe the water molecule and the force fields for acetonitrile, acetone and methanol were generated following the restricted electrostatic potential (RESP) technique and DFT calculations at the PBE/dzvp level. Equilibrium distances and angles, as well as force constants were computed using the same electronic structure scheme.

The results obtained for this system showed how the statistical study of the interactions resulting from the QM/MM sampling is the key to obtain results that can be compared to the experimental values almost quantitatively.

## CONCLUDING REMARKS AND PERSPECTIVES

In this article, we have provided an overview on the basic features of the LIO program, focusing on some representative applications. This review has emphasized the fundamental importance of the code performance in the study of chemical reactivity and free energy profiles, and in the simulation of vibrational and electronic spectra. Currently, most QM/MM DFT programs are capable of handling molecular systems having, in the quantum and classical domains, in the order of a few hundreds and a few thousands of atoms, respectively. However, the computational cost associated with systems of this size restrains their applicability to single-point calculations or partial geometry optimizations, which may be helpful to analyze some structural or energetic aspects at zero temperature, whereas the kind of applications considered in this article require extensive sampling. A data point along a reaction coordinate profile, or the construction of an auto-correlation function to extract a vibrational spectrum, typically require molecular dynamics simulations in the order of at least several picoseconds. To reach these time-scales, the cost per iteration cannot exceed a few seconds. The goal in developing the LIO program is not to save computation time in the calculation of a given property, but to make it feasible when it was not. Thanks to recent advances in the code, it is currently possible to study models containing about 30 QM atoms in the quantum region for time windows in the order of the nanosecond, using a single commercial GPU.

In LIO, the most expensive parts of the QM/MM DFT scheme, including the calculation of the exchange-correlation energy, Coulomb interactions, and forces, have been thoroughly optimized, reaching an almost linear-scaling performance. As a consequence of these improvements, those operations typically inexpensive start to become the new limiting steps in the overall speed. One example of this is the diagonalization of the Hamiltonian, which, except in the case of very large systems, takes up a negligible fraction of the SCF iteration cost in Gaussian basis codes. At the present stage, diagonalization, which scales as N 3 , represents the next potential bottleneck for big systems. For this reason, LIO is not a "linear-scaling" implementation, although it does behave approximately linearly within a certain size range, when the diagonalization still does not contribute appreciably to the computational load. However, the diagonalization dominates the overall performance in the case of very large systems exceeding 200 or 300 atoms in the quantum

TABLE 5 | Calculated and experimental absorption maximums for S2NO<sup>−</sup> in different solvents, as well as the mean value of the θ(N1-S1-S2) angle.


region. Thus, to go beyond these sizes, the implementation of an order N diagonalization algorithm should be made available.

As discussed in the Examples section, a linear-scaling behavior is particularly appealing for the computation of electronic spectra of large molecules. Presently, most UV-visible spectroscopy calculations involve TDDFT in the linear response formulation. This implementation, typically scaling as N <sup>3</sup> or N 4 , is often easier to use in comparison with the real time approach, and more efficient than this one for systems of small and moderate size. Nevertheless, for large systems in complex environments, realtime TDDFT simulations with an approximately linear scaling might be a smart alternative to surpass the size limitations of the linear response method. Besides, the feasibility to sample the structural degrees of freedom in solution or in a biomolecule with the same Hamiltonian employed for the electron dynamics, constitutes an asset of the LIO code.

Aside from spectroscopic applications, real-time TDDFT offers the possibility to perform quantum transport simulations. Recently, we have implemented in LIO an approach to compute molecular conductance in open quantum conditions (Morzan et al., 2017). The combination of RT-TDDFT with a QM/MM framework provides a unique platform to investigate challenging phenomena related to electron dynamics in realistic environments, which are very difficult to address from the experimental side or even with other modeling strategies. In our group, we are studying the electron transfer dynamics to and from the CuA active site in cytochrome C oxidase. Charge transport between redox sites in an enzyme is a fascinating phenomenon that might be studied with our approach, providing that the number of atoms involved in the likely paths foreseen for the electrons is not excessive to be treated within the QM region. The conductance of conjugated polymers in a disordered matrix, or the electron exchange at an electrochemical interface, are also problems that could be explored using the present approach, where the solvent, the counterions, or, more generally, the surrounding media, can be described at the MM level.

The use of pseudopotentials, recently made available in the code (Foglia et al., 2017), is fundamental to perform real time TDDFT simulations with transition metals and heavy atoms. The time-propagation of the core electrons, with high characteristic frequencies, demands integration time-steps that may be orders of magnitude below those required for the integration of the valence density. Thus, all-electron TDDFT dynamics with transition metals or atoms below the second period are extremely costly because of the small time-steps necessary for energy conservation. The incorporation of pseudopotentials has relaxed this constraint, making it feasible the electron dyamics simulations of the active sites of metalloenzymes or metallic wires. The pseudopotential implementation of the forces, necessary to perform MD simulations, is currently underway.

To summarize, the LIO code is a very active project and a valuable resource, open to the community via github (https:// github.com/MALBECC/LIO), to perform DFT molecular dynamics and real time TDDFT simulations in a QM/MM framework. Beyond those objectives concerning performance, which have concentrated much of the efforts of recent years, there is presently a strong drive toward the development of new capabilities. Some of the features or implementations currently in progress are:


Hopefully, this review has made it clear how much realistic chemical and spectroscopic predictions rely on sampling efficiency, necessary to introduce both thermal and environment effects. This is the spirit that has guided the evolution of LIO, which has been optimized for QM/MM simulations where the typical size of the QM domain reaches a few tens of atoms. Perhaps the most important challenge for the development of molecular simulation methods is to keep up with the advances in HPC platforms, algorithmic optimization, and theory. To take the proper advantage from all these worlds is only possible through a collaborative endeavor involving theoretical chemists and computer scientists, with the continuous feedback from the experimental side.

#### REFERENCES


# AUTHOR CONTRIBUTIONS

AZ, JM, JS, NF, and UM selected and analyzed the data. AZ, JM, MG, DE, and DS wrote the paper. MG, DE, and DS supervised research.

# FUNDING

This research was supported by grants of the Universidad de Buenos Aires, UBACYT 20020130100097BA and Agencia Nacional de Promoción Científica y Tecnológica, PICT 2015- 0672, PICT 2014-1022, PICT 2015-2761, and CONICET grant 11220150100303CO. AZ, JS, NF, and JM gratefully acknowledge CONICET for fellowships.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Marcolongo, Zeida, Semelak, Foglia, Morzan, Estrin, González Lebrero and Scherlis. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

**22**

# Steady-State Linear and Non-linear Optical Spectroscopy of Organic Chromophores and Bio-macromolecules

#### Marco Marazzi 1,2,3, Hugo Gattuso1,2, Antonio Monari 1,2 \* and Xavier Assfeld1,2 \*

<sup>1</sup> Laboratoire de Physique et Chimie Théoriques, Université de Lorraine–Nancy, UMR 7019, Vandoeuvre-lés-Nancy, France, <sup>2</sup> Laboratoire de Physique et Chimie Théoriques, Centre National de la Recherche Scientifique, UMR 7019, Vandoeuvre-lès-Nancy, France, <sup>3</sup> Departamento de Química, Centro de Investigacíon en Síntesis Química (CISQ), Universidad de La Rioja, Logroño, Spain

#### Edited by:

Sam P. De Visser, University of Manchester, United Kingdom

#### Reviewed by:

Etienne Derat, Université Pierre et Marie Curie, France Artur Nenov, Università degli Studi di Bologna, Italy

#### \*Correspondence:

Antonio Monari antonio.monari@univ-lorraine.fr Xavier Assfeld xavier.assfeld@univ-lorraine.fr

#### Specialty section:

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

Received: 31 January 2018 Accepted: 12 March 2018 Published: 03 April 2018

#### Citation:

Marazzi M, Gattuso H, Monari A and Assfeld X (2018) Steady-State Linear and Non-linear Optical Spectroscopy of Organic Chromophores and Bio-macromolecules. Front. Chem. 6:86. doi: 10.3389/fchem.2018.00086 Bio-macromolecules as DNA, lipid membranes and (poly)peptides are essential compounds at the core of biological systems. The development of techniques and methodologies for their characterization is therefore necessary and of utmost interest, even though difficulties can be experienced due to their intrinsic complex nature. Among these methods, spectroscopies, relying on optical properties are especially important to determine their macromolecular structures and behaviors, as well as the possible interactions and reactivity with external dyes—often drugs or pollutants—that can (photo)sensitize the bio-macromolecule leading to eventual chemical modifications, thus damages. In this review, we will focus on the theoretical simulation of electronic spectroscopies of bio-macromolecules, considering their secondary structure and including their interaction with different kind of (photo)sensitizers. Namely, absorption, emission and electronic circular dichroism (CD) spectra are calculated and compared with the available experimental data. Non-linear properties will be also taken into account by two-photon absorption, a highly promising technique (i) to enhance absorption in the red and infra-red windows and (ii) to enhance spatial resolution. Methodologically, the implications of using implicit and explicit solvent, coupled to quantum and thermal samplings of the phase space, will be addressed. Especially, hybrid quantum mechanics/molecular mechanics (QM/MM) methods are explored for a comparison with solely QM methods, in order to address the necessity to consider an accurate description of environmental effects on spectroscopic properties of biological systems.

Keywords: nucleic acids, lipid membrane, (poly)peptide, circular dichroism, (two-photon) absorption, fluorescence, hybrid quantum mechanics/molecular mechanics, solvent and dynamics effects

## INTRODUCTION

The frontier between computational biochemistry and computational chemistry is now becoming blurred thanks to the development of novel more efficient modeling methods able to tackle very large systems also thanks to new more powerful hardware architectures. For a long time, biological systems, due to the enormous number of particles (atoms) to deal with, were studied only by means of statistical simulations based on classical force fields. Hence electronic phenomena

**23**

were explicitly excluded from such studies. Contrarily, chemical systems were treated by means of quantum chemistry tools since electronic motion was involved. However, most calculations were done placing the system in isolated (infinitely diluted gas phase) conditions. Nowadays, taking into account surroundings effects, like solvent effects, on chemical reactivity is routinely achieved either by implicit methods (Ruiz-Lopez et al., 1993; Tomasi et al., 2005) or explicit one. Conversely, studying electronic phenomena in biological macromolecules is also widely spread. These enlargements of both field, make them quite similar from the methodological point of view, since one needs to combine high accuracy quantum chemical calculations with statistical thermodynamics simulations in order to get meaningful information on electronic phenomena (chemical reactions, photochemical reactions, or electronic spectroscopy) in large systems. The appearance of hybrid quantum mechanics/molecular mechanics methods (QM/MM) in the 90's is at the origin of this merging between chemistry and biology, as recognized by the 2013 Nobel Prize (The Nobel Prize in Chemistry, 2013)<sup>1</sup> . It opens the way toward a brand new unexplored field: tackling electronic excited states in complex systems.

In this review we gather several applications carried out mainly by our group during the past 5 years, or so, that deal with different electronic spectroscopies, namely linear absorption, non-linear spectroscopy [especially two-photon absorption (TPA)], and circular dichroism on chemical or biochemical systems. All the chosen applications have a very tight link to critical biological problems.

This review is organized as follows. In section Linear Absorption, the importance of the surroundings and of the dynamical effects on electronic absorption spectra will be presented for various DNA photosensitizers. This is particularly important since DNA, although quite stable (almost no absorption) with respect to visible light irradiation, can be damaged by energy/charge transfer from small neighboring molecules (photosensitizers) that absorb the visible light. Section non-linear Spectroscopy collected mainly our studies on TPA. This method, arising from material science, starts now to be widely employed in photodynamic therapy since it allows (i) a deeper penetration of the beam, thanks to the use of red or infrared wavelengths, and (ii) a more precise localization of the action, since the density of photons needed for a twin absorption is high enough only in the focal region of the laser beam. The various tenets of the simulations will be discussed and their validity highlighted by numerical results. Finally, in section Circular Dichroism Modeling, a recently developed tool to compute electronic circular dichroism spectra of large macromolecular systems will be presented. Its application to DNA and peptides conformations will be shown and compared to other existing methods.

# LINEAR ABSORPTION

Modeling linear absorption spectra of complex system has been a crucial and very important task and many efforts have been devoted to its realization in the past. In particular the role of the environment should be precisely taken into account when devoting to the study of complex realistic systems.

One of the most straightforward methods to take the environment into accounts has been by means of continuum polarizable methods (Rinaldi and Rivail, 1973; Tomasi, 2004), such as COSMO (Klamt and Schüürmann, 1993) or PCM (Tomasi et al., 2005). Even though the former have been, and still are, extremely popular and proven efficient in modeling homogeneous media, more recently the use of QM/MM methods has gain appeal in particular for the possibility to treat inhomogeneous environment that may include chromophores interacting with complex biological systems, such as nucleic acids and proteins, or materials.

However, going from an implicit to an explicit description of the molecular surroundings, brings important conceptual problems that should be properly addressed, and that are partially different from the one observed for ground state problems. On that respect, indeed, when considering QM/MM methods one can distinguish a hierarchy in the treatment of the environment (Monari et al., 2013; Rivail et al., 2015). In particular one may distinguish mechanical embedding (ME) in which only the geometrical constraints imposed by the MM partition on the QM geometry are taken into account; the electrostatic embedding (EE) in which the polarization of the QM wave function by the MM point charges is allowed; and finally the polarizable embedding (PE) in which the back polarization of the MM potential by the QM partition is included. While, force field parameterization allows to recover the polarization of the bulk when dealing with ground state problems, and hence EE is usually sufficient to achieve a correct description of the complex environment, the situation is much more complex in the case of electronic excited state problems. Indeed, by definition, an electronic excited state will involve a sudden, and in some instance important, change in the electronic density distribution, hence, and as a consequence, the response, i.e., the polarization, of the nearby molecular surroundings will become important and cannot be neglected anymore. One straightforward strategy to include polarizable effects is to switch from a fixed-charge force field to a polarizable one (Caprasecca et al., 2014; Shi et al., 2015). Different strategies exist, based on the inclusion of atom multipole moments in addition to the charges, and have been interfaced with a number of very used codes such as Dalton and Gaussian QM/MMPol (Orlando and Jorgensen, 2010; Loco et al., 2016b). However, the parameterization of polarizable force field can in some instance be rather cumbersome, while the calculation overhead can become important.

An alternative strategy developed about 10 years ago (Jacquemin et al., 2009; Monari et al., 2013) and called the electrostatic response of the surrounding (ERS) proposes to tackle polarization issues by surrounding the QM partition in a polarizable cavity described by the fast component of the dielectric constants, i.e., the dielectric constant extrapolated

<sup>1</sup>Available online at: https://www.nobelprize.org/nobel\_prizes/chemistry/ laureates/2013/ (Accessed Jan 27, 2018)

to infinite frequencies. To be precise, let us remind that the former continuum will be embedded in the MM partition but will not interact with the MM point charges (**Figure 1**). The idea underlining this strategy is due to the fact that the fast component of the dielectric constant represents the instantaneous rearrangement of the surrounding to the change in the QM electronic density, hence the use of a self-consistent reaction field (SCRF) approach, like in PCM, will allow to optimize the surroundings charge distribution to the change induced by the electronic transition. The advantages of this approach are two-fold: first no particular parameterization is required since the fast component of the dielectric constant is almost a constant value comprised between 1.50 and 2.00 a.u. for every non-conductive material. Secondly the computational overhead compared to EE is absolutely negligible and of exactly the same magnitude as a PCM calculation. Those aspects may hence qualify ERS as an efficient and universal strategy to include PE in complex systems. Indeed, its performance has been extensively proven in a number of different systems, including chromophore embedded in protein (Monari et al., 2012a,b), dyes interacting with nucleic acids (Chantzis et al., 2013, 2014; Etienne et al., 2013; Véry et al., 2014; Dumont and Monari, 2015) as well as native proteins (Etienne et al., 2014b), and its capacity to yield extremely accurate results has been confirmed. Even though conceptually generalizable to every quantum chemistry method able to provide electronic excitation energies, most of the applications have been performed in the framework of time dependent density functional theory (TD-DFT) approach. This fact is definitively understandable considering from the one hand the good ratio between accuracy and computational cost, and on the other hand the fact that TD-DFT provides a well balanced description of a relatively large manifold of excited states independently on the choice of an active space. However, as in all DFT methods, the drawback is the necessity to preliminary choose an exchange-correlation functionals whose effects on the accuracy of the results can be rather important. The general good practices concerning the choice of the functionals for excited state calculations that have been established for isolated systems also hold when the environment is taken into account, as an example while hybrid represent an obvious improvement over the performance of pure LDA and GGA functionals, charge-transfer states will necessitate the inclusion of long-range corrected functionals. Furthermore, the use of diagnostic indexes (Peach et al., 2008; Le Bahers et al., 2011; Etienne et al., 2014c) giving a numerical representation of the amount of charge transfer is strongly encouraged.

Indeed in a most stunning application (Jacquemin et al., 2009), involving the TD-DFT study of a caged dye, the necessity of including PE in the QM/MM calculation of absorption spectra has been absolutely evidenced. It has been shown that the effects of EE and PE where of the same order of magnitude but while the former was inducing a red-shift the latter is blueshifting the absorption back as compared to the ME results (**Figure 2**). Hence, the inclusion of EE alone would have strongly deteriorated the quality of the computed results.

Once the necessity to include ME, EE, and PE in the QM/MM spectra calculations has been firmly assessed one has

solvated chromophore. The QM chromophore is represented in balls and sticks and atom color, the explicit MM charges are indicates as van der Waals spheres. The QM chromophore is placed in a cavity created in the polarizable continuum that is schematized by the cyan transparent surface.

to take into account that, especially in complex systems and complex environments, a good representation of the electronic excited states and hence of the absorption spectra requires to go beyond the usual vertical representation of the excitation energy, in which one consider that absorption or emission spectra can be obtained as vertical transition from the starting state equilibrium geometry. Usually, and as a first approximation, the explicit calculation of the difference in energy between the first vibrational states (E0−0) can be used, however this strategy does not allow to take into accounts the full vibrational structure. An alternative strategy is based on the explicit calculation of the quantum based vibronic coupling, i.e., the Franck-Condon and Herzberg-Teller factors, based on the overlap between the vibrational wavefunctions and whose convolution will give the vibronic electronic spectra, i.e., the exact coupling between the vibrational and electronic states (Improta et al., 2007; Santoro et al., 2007). Some codes allowing to calculate such factors exist, and have shown considerable success in particular in the description of the coupling with the high frequency modes and the corresponding change in the spectral band shape (Cerezo et al., 2015, 2016). However, also because of the numerical calculation of the vibrational overlap integrals, they show limitations in the case of low-frequency largeamplitude vibrational modes, such as out-of-plane bending of π-conjugated systems. Furthermore, the explicit inclusion of the quantum effects, provided by the Franck-Condon principle, is less important for the latter mode, which due to the large amplitude behaves much closer to the classical limit. Hence an alternative strategy to take into account such vibrational mode is to perform an accurate sampling of the potential energy surface landscape, allowing for the extraction of snapshots from which vertical excited states can be obtained (Etienne et al., 2013). The final spectrum will then be constituted by the convolution of all the vertical transitions, weighted by the corresponding intensities, for each of the snapshots. Even though the quantum chemistry calculation for the excited states will have to be repeated in order to obtain a reasonable statistical sampling of the conformational space, the explicit calculation of the Franck-Condon factors is avoided. Furthermore, different sampling techniques can be used ranging from classical or QM/MM molecular dynamics, up to semi-classical distributions such as the Wigner one (Dahl and Springborg, 1988). Most notably, in Wigner distribution the snapshots are obtained from the chromophore equilibrium geometry and its vibrational harmonic frequencies, taking into account the energies of a set of quantum harmonic oscillators.

We have shown that even for simple organic molecules such as harmane cations whose spectrum was simulated at TD-DFT level (Etienne et al., 2013, 2014d), the inclusion of vibrational effects via the calculation of the excited states on snapshots extracted from a classical molecular dynamics trajectory in a water box, induces a red-shift of about 20 nm and hence allows to perfectly recover the experimental results (**Figure 3**). Remarkably enough the same red-shift, and hence the same agreement with experimental results is obtained when the excited states of harmane are computed: (i) at QM/MM level from snapshots extracted from a molecular dynamic trajectory; (ii) at QM + PCM level from snapshots extracted from a molecular dynamic trajectory; and (iii) at QM + PCM level from snapshots extracted from a Wigner distribution. Hence, it appears clearly that the crucial element to recover experimental results will be the correct treatment of vibrational effects rather than the treatment of the environment. However, it has to be underlined that, even though to a less extent than for encapsulated squaraine, EE and PE produce different shifts on the absorption wavelength (**Figure 3**). The same importance of the dynamic effects has also been found in the calculation of emission, i.e., fluorescence, spectra of harmane (Etienne et al., 2014d). On the same point, and also due to the local nature of the bright states involved in the spectrum the choice of the functional appears as not crucial, and generally hybrid ones may be considered as accurate and provide vertical excitation energies results comparable with the one obtained at equation of motion coupled cluster (EOM-CC) level (Etienne et al., 2014d).

Vibrational effects also play a crucial role in the case of other organic compounds such as the natural occurring drug palmatine (Dumont and Monari, 2015), known to interact with DNA and triggering singlet oxygen production (Hirakawa and Hirano, 2008). It has been shown (**Figure 4**) that the calculation of vertical transitions from the equilibrium ground state geometry provides an excellent agreement with experimental values when hybrid functionals, such as PBE0, are used. On the other hand, including dynamic effects via a molecular dynamic sampling leads to an excessive red-shift of the spectrum. However, this agreement is only due to the unphysical stabilization of charge transfer states induced by the use of hybrid functional and hence is only related to errors cancelation. Indeed, when using longrange corrected or meta-GGA functionals, the effects are reversed and in the case of M06-2X, the spectrum modeled including the dynamic effects gives results in perfect agreement with the experimental one while the static one appears way too blueshifted (**Figure 4**). This tendency can be perfectly rationalized considering that long-range corrected functionals solve the problem related to the overstabilization of the charge transfer states, and hence globally blue-shift the absorption spectrum avoiding error cancelation with the dynamic effects. Hence, palmatine theoretical spectroscopy represent a very nice example of the subtle interplay of different interlocking effects in the treatment of excited states in complex systems, that necessitate to finely tune the computational strategy to avoid spurious errors and a bad description of the dominant states. Furthermore, the sampling of the palmatine conformational space through molecular dynamics has also allowed to recover its interaction modes with DNA and to model the change in absorption spectrum due to this interaction (Dumont and Monari, 2015).

In the same context it has been shown that the inclusion of vibrational and dynamics effects may also be crucial in determining not only the absorption spectrum but also the photophysically allowed pathways of many chromophores. As an example the sampling of the conformational space of the hypericin drug (Gattuso et al., 2017a) via the extraction of snapshots through Wigner and molecular dynamics distribution has allowed to calculate also other crucial parameters such as the spin-orbit coupling (SOC) between the low lying excited singlet and triplet states. In particular it has been shown that the SOC is generally increased due to geometry distribution and may reach values up to some tenths of cm−<sup>1</sup> . This sort of vibrationally allowed SOC can rationalize the relatively efficient intersystem-crossing experienced by hypericin and hence its efficient production of singlet oxygen upon irradiation that constitutes the basis of some of its pharmacological activity.

Remarkably enough those properties have been observed both in the case of aqueous solution and when interacting with lipid membrane bilayers.

Hence the combination of a proper treatment of the environment, allowing for the inclusion of environmental and dynamics effects, allows the precise determination of spectroscopic properties as well as photophysical key parameters, of a number of complex systems including optical materials such as solar cells dyes (Sengul et al., 2017) and poly-thiophene units (Turan et al., 2016), as well as biological systems and drugs. Furthermore, the effects of solvent relaxation and their influence on the spectroscopical properties can also be assessed through similar protocols (Zvereva et al., 2018).

## NON-LINEAR SPECTROSCOPY

# Two-Photon Absorption and Second-Order Harmonic Generation: General Principles, Experimental Techniques, and Computational Methods

If linear absorption properties are essential to characterize bio-macromolecules, non-linear absorption processes lead to important features, especially when looking at possible applications. Highly investigated in the past with the goal to enhance materials properties, the scientific interest has recently turned toward bio-macromolecules, in some cases reaching the bio-medicine field. Especially, two phenomena are of utmost interest: multi-photon absorption and high-order harmonic generation. Among them, because of the lower symmetry usually found in biological media compared to condensed materials, two-photon absorption (TPA) and second-order harmonic generation (SHG) are mainly concerned (Antoine and Bonaci ˇ c-Koute ´ cký, 2018). In both cases, two photons interact simultaneously with the same molecule resulting, in the case of TPA, in the absorption to an electronic excited state corresponding to the sum of the energies of the two incoming photons or, in the case of SHG, in the emission of a photon arising from the sum of the incoming photon energies. Moreover, we can discern between degenerate two-photon phenomena, in which both photons have the same energy (hence same frequency and same wavelength) and non-degenerate phenomena, in which the two photons differ in their energies (**Figure 5**).

Concerning applications in the biological field, TPA is an especially desired feature since it solves two problems at once:

i) The simultaneous absorption of two photons allows, in the case of degenerate TPA, to divide the required energy by a factor of two, i.e., to double the vertical transition wavelength. This permit to notably shift the required incoming photon energy toward the red part of the visible spectrum (i.e.,

bathochromic shift), eventually allowing to enter in the so-called near-infrared therapeutic window (from 650 to 1350 nm), where the penetration of biological tissues is maximal (Tsai et al., 2001; Mojzisova and Vermot, 2011).

ii) The probability for the two photons to be absorbed simultaneously is proportional to the square of the light source intensity. This will definitely increase the spatial precision, since TPA will most likely play a role only at the laser focal point, decreasing in intensity much faster than one-photon absorption (OPA), i.e., much faster than linear absorption processes discussed in the previous section. This results in a crucial advantage for bio-medical applications, since TPA can focus precisely on the lesion area, reducing the side effects (Sun and Dalton, 2008; Benninger and Piston, 2013).

On the other hand, SHG is mainly used as a principle to build powerful microscopy techniques of interest in bio-imaging. Hence, the key factor stands in the possibility to generate visible light at high-resolution. This allowed high resolution imaging from deep inside biological tissues, as lipid bilayers of cell membranes, by designing active chromophores with a required balance between hydrophobic and hydrophilic moieties (Barsu et al., 2009; Reeve et al., 2010).

Finally, the use of TPA and/or SHG should also be considered as highly promising in theranostics, a fusion of therapeutics and diagnostics (Jeelani et al., 2014). Indeed, solely TPA coupled to red/infrared emission can constitute the basis of emissive drugs combining treatment and imaging properties. Likewise, specific chromophores, as retinal analogs among others (Theer et al., 2011), can combine TPA and SHG to match absorption, emission, and reactivity purposes (Barsu et al., 2006). In the following, we will focus on non-linear absorption, hence on TPA.

Even though a detailed understanding of the underlying physical principles is out of the scope of the present review, some fundamental concepts should be recalled in order to interpret experimental and theoretical results, thus possibly predict TPA properties. In linear absorption spectra, the intensity values are obtained by the Beer–Lambert law, by which the logarithm of the intensity is proportional to the molar absorption coefficient and to two experimental parameters: the solution concentration and the optical length, i.e., the size of the spectroscopic cell. On the other hand, when theoretically determining linear absorption spectra, the oscillator strength is calculated, a dimensionless quantity proportional to the square of the transition dipole moment between the ground and the excited state of interest. Hence, a straightforward comparison between experimental and theoretical linear spectra is not possible, and usually relies on normalization of the computed spectra. This is not the case for TPA, since absolute Göppert-Mayers units (1 GM = 10−<sup>50</sup> cm<sup>4</sup> s photons−<sup>1</sup> molecule−<sup>1</sup> ) can be both measured experimentally

and computed by theory, being proportional to the so-called molecular TPA cross-section (Göppert-Mayer, 1931; Kaiser and Garrett, 1961). This makes in principle the comparison between theory and experiment easier, even though much less data are available in the literature compared to OPA, due to the more sophisticated experimental setups and the computational implementations required.

of one of the photons), the brighter the TPA strength to the final excited state. The latter phenomenon takes the name of resonance-enhanced TPA.

Concerning experiments, after the first demonstration of TPA by organic dyes (Peticolas et al., 1963), different techniques were developed. Indeed, an added requirement makes TPA direct measurements more complicated than OPA ones: the necessity to determine the source irradiance for each probed wavelength. This means that not only the energy of the laser beam, but also its spatial distribution (including possible changes of propagation through the sample) and its pulse-width (including eventual temporal modulation) need to be accurately monitored all over the spectral window to be probed (Negres et al., 2002). The main technique presently in use for TPA direct measurements is the z-scan technique. It derives its name from the required movement of the sample along the z axis defined as the distance between the focused laser and the detector (Sheik-Bahae et al., 1990). Nevertheless, as a way to overcome the difficulties caused by TPA direct measurements, an alternative technique has emerged, based on two-photon excited fluorescence (Xu and Webb, 1996): if OPA and TPA spectra of a reference compound are known (de Reguardati et al., 2016), then a simple comparison of one- and two-photon excited fluorescence spectra between the sample and the reference will allow to cancel out most of the variables required by the direct measurement, finally leading to the TPA spectrum of the sample. Thanks to this indirect approach, it is not necessary to manage all the issues raised by a direct laser focus, even though fluorescence is required in order to detect a signal, hence limiting the amount and type of molecules that can be measured. Moreover, TPA measurements cannot be obtained in the spectral region where OPA is dominant.

As for theory, general rules can be followed to attempt structure-property relationships, even though the inclusion of dynamics effects by molecular dynamics, coupled to explicit solvent and complex bio-chemical environments, can lead to additional and unexplored informations. More in detail, chromophores can be divided in centrosymmetric and non-centrosymmetric ones. Especially, while a π-conjugated backbone is necessary for an organic molecule to absorb light at long wavelengths, donor (D), and acceptor (A) groups can be added to the ends or in the middle of the π-conjugated chromophore, generating centrosymmetric (e.g., D-π-D, A-π-A, D-π-A-π-D, A-π-D-π-A) and non-centrosymmetric (e.g., D-π-A) structures. Indeed, symmetry plays a role in the photophysical transition selection rules (Heflin et al., 1988; Dixit et al., 1991): in centrosymmetric chromophores, a virtual state can be generated while the molecule experiences the field of the first photon, allowing the following—ca. 5 fs delay (Birge and Pierce, 1986)—photon to reach the final state, hence favoring TPA over OPA. This is possible thanks to the presence of an intermediate state next to the virtual state, a resonance condition which is generally not fulfilled in non-centrosymmetric chromophores, where in principle TPA and OPA are both possible with the same probability. (**Figure 5**; Pawlicki et al., 2009). While the presence of charge transfer states can indeed improve TPA cross section, their presence is however not strictly necessary to induce non-linear absorption (Beerepoot et al., 2014).

This can partially explain why, in the past, most of the attention focused on centrosymmetric modified molecules based initially on trans-stilbene (Parthenopoulos and Rentzepis, 1989; Ehrlich et al., 1997; Makarov et al., 2008) and azo-aromatic compounds (Antonov et al., 2003; De Boni et al., 2005) among others. This lead to the development of compounds being more appealing for material scientists compared to biochemists and biologists: natural and bio-inspired systems rarely accomplish the requirement to be totally centrosymmetric; moreover, strong D and/or A groups cannot be always added, depending on the frequently delicate balance between chemical structure and biological function.

Computationally, two-photon transition moments are available for EOM-EE-CCSD (Equation-Of-Motion for Excitation Energies Coupled-Cluster C with Single and Double substitutions; Krylov and Gill, 2013) and ADC (Algebraic Diagrammatic Construction; Knippenberg et al., 2012) methods within the Q-Chem package (Shao et al., 2015), but also at CC2 (second-order approximate Coupled-Cluster singles and doubles model; Christiansen et al., 1995) level of theory through the TURBOMOLE program package (Furche et al., 2014), and at TD-DFT (Time Dependent-Density Functional Theory) through the DALTON2016 program suite (Aidas et al., 2014). In the latter case, since TPA active molecules usually include D and A groups as explained above, it is important to choose functionals which can describe the displacement of charge during the transition from D-centered orbitals to A-centered ones, as hybrid and long-range corrected exchange-correlation functionals.

# Theoretical Predictions: Level of Theory and Environmental Effects

Two benchmark studies compare EOM-EE-CCSD, CC2 and TD-DFT/CAM-B3LYP (Yanai et al., 2004) two-photon cross sections for the chromophores of the Photoactive Yellow Protein (PYP) and of the Green Fluorescent Protein, GFP, in its neutral form (HBDI) (**Figure 6**; Beerepoot et al., 2015; Nanda and Krylov, 2015). Being all calculations performed in gas-phase, the results point toward a general qualitative agreement between the three levels of theory, with quantitative discrepancies: TD-DFT/CAM-B3LYP TPA cross-sections were found to be 1.5–3 times smaller than EOM-EE-CCSD and CC2 ones (Beerepoot et al., 2015). Of course, the divergence could lie in the different description of the excited state dipole moments (Bednarska et al., 2013). Nevertheless, also because the effect of the environment is not taken into account by this study, conclusions cannot be raised on the basis of a comparison with experimental results.

On the other hand, the computationally more affordable TD-DFT (compared to Equation-Of-Motion and Coupled Cluster approximations) allows to treat the environment by PCM and QM/MM methods at a much easier cost, including dynamics effects, finally allowing to establish how the TPA maximum intensity is affected by a realistic environment. Moreover, the spectral window and spectral shape can be recovered, determining the applicability of the chromophore for therapeutic purposes or as constituent of bio-materials.

More in detail, TPA properties of boron containing arenes were evaluated at TD-DFT level, including the solvent by linear response PCM (Turan et al., 2016). In this case, a vertical approach was used, meaning that all molecules were optimized on the ground state followed by calculation of the vertical transition to the excited state. The results point toward a S<sup>2</sup> optically bright state for TPA, while S<sup>1</sup> is the bright state for OPA (**Figure 7**). This can be explained in terms of molecular orbitals. Especially, electronic density reorganization in the excited state can be efficiently monitored through Natural Transition Orbitals (NTOs) (Martin, 2003), in this case obtained with the Nancy\_EX code (Etienne et al., 2014a,c). The S0→S<sup>1</sup> transition is of

local π,π ∗ character, while the S0→S<sup>2</sup> transition clearly shows the charge transfer from the E-dimesitylborylethenyl lateral groups toward the central part of the conjugated chromophore, resembling to a A-π-A centrosymmetric system mentioned in the previous section. Indeed, S0→S<sup>1</sup> TPA values are almost negligible (not more than 2.5 GM), while S0→S<sup>2</sup> TPA values reach up to 1010 GM. Interestingly, we should mention that TPA values for this kind of compounds is predicted to change by two orders of magnitude, when looking at the type and number of arenes contained. Especially, thiophene based compounds correspond to the highest cross section values, compared to phenyl and fluorinated phenyl. Moreover, increasing the number of linearly linked thiophene rings induced a strong non-linear effect, passing from 46 GM (one thiophene) to 64 GM (two thiophenes) to 1010 GM (three thiophenes). In this last case, the experimental value is available [1500 GM (Entwistle et al., 2009)], denoting an underestimation by the TD-DFT/PCM prediction, but keeping a good qualitative agreement.

An alternative method to obtain TPA properties is by performing classical MD, followed by QM/MM TPA calculations of single snapshots, randomly selected along the MD trajectory. This approach, explained also in section Linear Absorption for OPA, has the advantage to include the solvent explicitly, even though usually force field parameters have to be extracted for the chromophore under study. Especially, the proper parameterization of chromophores with an extended cyclic πconjugated backbone is far from being trivial (Li et al., 1989; Autenrieth et al., 2004; Zhang et al., 2012), as is the case of porphyrin-like systems. Indeed, the correct description of lowfrequency modes is crucial, since they are expected to notably impact the prediction of the TPA spectrum. An example is chlorin-e6, already reported as photodynamic and antibacterial drug (Zenkevich et al., 1996; Fernandez et al., 1997; Nyman and Hynninen, 2004; Paul et al., 2013; Winkler et al., 2016). In this case, a careful validation of the force field was performed by comparison with the Wigner distribution approach, when calculating OPA properties (Gattuso et al., 2017b). Specifically, a maximization of the overlap between force field and Wigner computed absorption spectra, coupled to a minimization of the Root Mean Square Deviation (RMSD) was accomplished. This force field optimization further allowed TPA properties calculations in explicit water at QM(TD-DFT)/MM level. The spectrum (**Figure 8**) shows two peaks corresponding—as for all OPA spectra of porphyrin-like systems—to the Soret band (at ca. 730 nm and 60 GM) and to the Q band (at ca. 1100 nm and 20 GM). TPA intensity values are therefore much lower than boron containing thiophenes aforementioned (Turan et al., 2016), even though still acceptable for bio-applications. Lower TPA intensities can be expected since chlorin-e6 partially loses the centrosymmetry of porphyrins, and can be further rationalized based on its NTOs: both OPA and TPA are possible, with a Soret (S0→S2) band much more intense than the Q band (S0→S1) in both cases. Indeed, the NTOs describe for both electronic transitions a charge reorganization centered in the core ring, also explaining the lower change in TPA intensity compared to the three orders of magnitude observed for thiophene based chromophores (**Figure 7**). Anyway, only a relatively small

overlap between OPA and TPA spectra is expected, thus justifying the use of the indirect—and easier to afford—experimental setup, also thanks to the fluorescent properties typical of these molecules (Gattuso et al., 2017b; Liu et al., 2018). This could hence encourage more experimental groups to measure this type of molecules, to better assess theoretical methods and proposed advancements in various applications (Ryu et al., 2018).

Nonetheless, several photo-active chromophores are totally non-centrosymmetric. Examples are the retinal and the Donor-Acceptor Stenhouse Adduct (DASA). The former is known as the cis-trans photo-isomerizable switch of rhodopsins, Gprotein coupled receptors responsible for the process of vision (Wald et al., 1968; Okada et al., 2004) and recently employed as optogenetic tools (Deisseroth, 2011; Tischer and Weiner, 2014; Guo et al., 2016; Hontani et al., 2017). The latter is a recently discovered type of photo-switch (Helmy et al., 2014a,b) which converts from an initial π-extended colored state into a final compact colorless state, moreover accompanied by a notable polarity change (**Figure 9**). Both retinals and DASAs are characterized by a lowest-lying singlet excited state (S1) of partial charge transfer character, even though of different origin: retinal is a protonated Schiff-base which, in biological media, is surrounded by the opsin pocket of rhodopsin, i.e., an active site where the charges and polarity of the amino acid side chains enhances the S<sup>1</sup> charge transfer character (a sort of induced D-π-A system), compared to a retinal molecule in solution. In the case of DASAs, no macromolecular entity is responsible for the efficiency of its photo-activity, since it is a D-π-A system itself. This symmetry reasonings explain why for both systems a low S0→S<sup>1</sup> TPA intensity is theoretically predicted [ca. 2 and 3.32 GM for 11-cis retinal and DASA, respectively (Palczewska et al., 2014; García-Iriepa and Marazzi, 2017)]. Even so, retinal isomerization by TPA is expected to play a role to trigger human infrared vision (Artal et al., 2017). Hence, some additional QM/MM studies including polarizable embedding could be worth. Also, it should be mentioned that bio-mimetic photo-switches based on retinal are available, even though still limited to the blue/UV part of spectrum (Sampedro et al., 2004; García-Iriepa et al., 2013, 2016), and could hence greatly benefit from computational design to improve TPA properties. In this context it is also important to precise that cross-sections of <10 GM can indeed be detected experimentally, however their increase is fundamental

in order to permit the use of TPA switches in practical applications.

Finally, the impact of the biological media—in the specific case of B-DNA—on the TPA properties of an organic photosensitizer, was established by QM(TD-DFT)/MM calculations (including electrostatic and mechanical embedding), after having determined the photosensitizer-DNA interaction modes by classical MD. Especially, a centrosymmetric A-π-D-π-A dication was selected as photosensitizer (the 3,6-bis[2- (1-methylpyridinium)-]9-methylcarbazole, abbreviated as BMEMC) as it was experimentally shown to cause DNA strand breaks upon infrared irradiation, also in hypoxic conditions (Zheng et al., 2015). The BMEMC-DNA interaction modes and the sensitization mode of action (Gattuso et al., 2016a; **Figure 10**) were then rationalized on theoretical basis: BMEMC can undergo a photoinduced spontaneous ionization, leading to the production of a solvated electron and a radical cation. Since it is a centrosymmetric structure with relatively strong A groups, TPA intensity values are high, overcoming 1000 GM for the S0→S<sup>2</sup> transition, when computed in water. We should note that in this specific case a proper spectroscopic description of TPA properties is possible only by averaging QM/MM results over a reasonable number of MD trajectory snapshots (20 in this case). Indeed, when compared with the static approach,

i.e., by considering only the Franck–Condon structure obtained with the PCM model, we note an apparent inversion of S0→S<sup>1</sup> and S0→S<sup>2</sup> bands in the TPA calculated spectrum (**Figure 10**). This can be explained in terms of low-frequency modes that are strongly affecting the TPA active S<sup>2</sup> state, for which only a dynamical approach results in a proper description. Indeed, as shown by the NTOs, the S<sup>2</sup> virtual orbital is the main responsible for the electron displacement toward the pyrimidinium A edges, which are also the most affected by low-frequency modes. When looking at the interactions with DNA, four stable binding modes are found by MD: two intercalation modes and two minor groove binding modes, depending if the carbazole D core or a pyrimidinium A edge attempts first the contact with the biomacromolecule. The QM/MM TPA spectra calculated in the four cases, by convoluting the same number of trajectory snapshots as in water, shows that the non-linear response of BMEMC does not significantly changes in presence of DNA, regarding both spectral regions and peak intensities. Such description matches qualitatively the experimental measurements (Zheng et al., 2015). Comparing intercalation modes, the only notable difference regards the S0→S<sup>1</sup> band shape, broader and less structured in intercalation than in minor groove binding. Again, this is the result of electrostatic and mechanical embedding: in intercalation, the core of BMEMC is in the hydrophobic pocket in the middle of DNA, and it is more constrained by the environment. Even if the present example is showing that the impact of the biological media is affecting TPA cross-section only marginal and is by the way more oriented toward photodynamic therapy than diagnosis, it is important to remind that possible applications of TPA fluorescence, and its modulation by the environment, are of great importance for bio-imaging and are paving the way to a new dimension in theranostic, i.e., the combined treatment and diagnosis (Hu et al., 2018).

# Effect of the Polarizable Embedding Scheme

The polarizable embedding scheme has become recently available to include not only polarization of the QM region due to MM charges, but also vice-versa, thus reaching an even more realistic description of the environment. Especially, TD-DFT can be applied to calculate multi-photon absorption properties in a modified version of the DALTON program (Steindal et al., 2016). This can definitely improve the quality in the treatment of the environmental effects, even though increasing the computational cost. Indeed, the same neutral GFP chromophore previously studied in gas-phase (**Figure 6**; Beerepoot et al., 2015) was found to absorb with a bathochromic shift (of 0.09–0.16 eV) when including the polarizable embedding scheme for two- , three- and four-photon absorption. Nevertheless, we should note that TPA strengths are more sensitive to the size of the QM region than OPA strengths (Steindal et al., 2016), hence envisaging two alternatives to maintain a reasonable balance between computational quality and cost: (i) increase notably the size of the QM region, keeping an electrostatic QM/MM embedding scheme; (ii) keep an acceptable size of the QM region, applying a polarizable QM/MM embedding scheme. Moreover, for future tests and benchmarks, alternative QM/MM border treatments rather than the hydrogen link-atom approach could be proposed, as a way to limit the size of the QM region and improve the quality of TPA properties prediction.

# CIRCULAR DICHROISM MODELING

Even though linear and non-linear absorption spectroscopies proved to be efficient in probing the binding propensity of organic compounds with complex bio-macromolecular systems (Jiang et al., 2014; Li et al., 2015), they are unable to provide information about the global structure of bio-macromolecular systems. Indeed, the sensitivity of these techniques is not sufficient to differentiate between different arrangements. In contrary, electronic Circular Dichroism (CD), which is a fast and sensitive spectroscopic method, provides specific optical signature for each polypeptide secondary structure arrangement (alpha-helix, beta-sheet, random coils,) (Greenfield, 2007) or nucleic acids macromolecular structure (A/B/Z-DNA, Gquadruplexes, i-motifs, and more; Kypr et al., 2009; Vorlícková ˇ et al., 2012). Hence, it allows an unequivocal differentiation of the systems configurations. Moreover, since the resulting optical signal is directly correlated to the global structure of the system, specific phenomena can be efficiently probed such as the formation of aggregates between molecular compounds and proteins or DNA. It can be used to investigate structural reorganizations from a configuration to another induced by external effects. Moreover, in the case of DNA/ligand and DNA/protein binding, CD is able to provide structural information at two levels: (i) on one side, considering the spectral fingerprint of well-characterized macromolecular arrangements, the structural modifications (local or global) induced by the binding can be probed by simply following the spectral features while increasing the concentration of the substrate (Basu and Suresh Kumar, 2014); (ii) on the other side, the induced CD (for non-chiral compounds) or band shape modifications (for chiral compounds) while bound to a biomolecular system can also provide insights in the binding configurations or be used to perform spectroscopic titration and hence recover binding free-energies (Holmgaard List et al., 2017).

In the case of nucleic acids aggregates, (Kypr et al., 2009; Vorlícková ˇ et al., 2012) extensive experimental studies of the ability of CD to compare and differentiate most of the well-known conformations such as A/B/Z-DNA, guanine, and cytosine quadruplexes or even i-motifs have been reported. Moreover, the results demonstrated the efficiency of CD to follow induced transitions between two configurations. Indeed, they described the trifluoroethanol-induced transitions from B-DNA to A-DNA (sequence poly[GCGGCGACTGGTGAGTACGC]) and B-DNA to Z-DNA (sequence poly[d(GC)]). Also, they described the evident spectral modifications during either transitions between quadruplexes or the formations of i-motifs while varying the pH or increasing the ionic strength. This work demonstrated how robust the information given by CD can be and paved the way to more advanced studies on DNA/ligand and DNA/protein bindings.

Considering natural nucleic and amino acid arrangements such as B-DNA or alpha helices, their CD spectral features arises from the interaction of optically active chromophores organized in a specific layout. The chromophoric units, giving rise to the most prominent CD signals, are the nucleobases in the case of DNA, and the -(N-H)-(C=O)- peptide bonds in the case of proteins. Assembled in larger macromolecular ensembles owning intrinsic chirality (such as the B-DNA double helix or alpha helical peptides) the interaction between these achiral molecular chromophores will induce circularly polarized optical signals. The interactions influencing the electronic properties of these units can either be (i) hydrogen bonds, (ii) π-stacking interactions, (iii) electrostatic interactions. Furthermore, upon the optical excitation of the aggregate one can evidence, (iv) excitons formation or (v) population of Charge-Transfer (CT) states. In the case of B-DNA, the delocalization of excitons is not extended to more than three stacked nucleobases and doublets of stacked nucleobase pairs can be considered as a model accurate enough to provide a reasonable description of the B-DNA CD spectra (Nogueira et al., 2017) as will be presented in the following sections.

Modeling the CD spectra of large macromolecular systems such as proteins and DNA is challenging since it requires to overcome numerous technical drawbacks, some of them being common also to the calculations of linear and nonlinear absorption but being emphasized by the inherent extreme sensitivity of CD spectroscopy to subtle structural and electronic effects:

i) Sufficient and accurate sampling of the conformational space. The increase of computing resources allows nowadays extensively long Molecular Dynamics (MD) simulations, reaching from microsecond to millisecond time scales, directly comparable to experimental results (Galindo-Murillo et al., 2015). Moreover, the accuracy of current force fields describing these biological systems have been demonstrated and the sampled potential energy surfaces can be considered close to in-vitro experiments (Dans et al., 2017). The recently available parmbsc1 correction (Ivani et al., 2015) available for AMBER is currently one of the most robust force field for nucleic acids.


The protocols that have been employed by our group to model CD spectra of biomolecular systems will be described more in details in the following. Nevertheless, they can be summarized in four steps: (i) the starting structure can be obtained from X-ray or NMR studies or from in-silico builders of biomolecules (NAB, tleap Case et al., 2017); (ii) MD simulation to sample the potential energy surface of the system; (iii) computation of excited states at ab-initio level and eventually of their coupling over the full macromolecular system via semiempirical Hamiltonians; (iv) convolution of excited states energies and transition strengths using Gaussian or Lorentzian shaped functions. In particular concerning the crucial point of the coupling between the individual chromophores we chose to use the simplest dipole moment approach. Hence, the coupling is estimated simply by the scalar product of the transition dipole moments calculated for each of the individual chromophores weighted by the distance of their center of charges. This strategy differs from the one employed by other authors (Jurinovich et al., 2014) who chose to explicitly determine the coupling via the excited states' density matrices, and although much simpler both conceptually and from a computational point of view is able to yield quite accurate representation of the macromolecular CD signal. However, care should be taken in using the approximate dipole model, especially in case of strongly coupled systems, for instance closely packed chromophore, for which the explicit calculation of the coupling via the density matrix approach may be necessary to avoid the model breakdown. Indeed, it is known that for distances of the interacting units smaller than 5 Å an overestimation of the excitonic coupling by the dipole model can be experienced.

# Nucleic Acids

#### Circular Dichroism of B-DNA

The specific CD signals of B-DNA are directly associated to its famous double helical structure. Modeling the CD spectral features that can be directly matched with experiments is still far from being straightforward and suffers many drawbacks: (i) most experimental studies of phenomena in DNA are performed on non-specific sequences such as calf thymus DNA (Kankia et al., 2001). Since the CD signal is directly related to the sequence, it is thus impossible to provide a meaningful model without considering all the possible nucleobases sequences. Moreover, in the case of the interaction of DNA with an organic molecule, the binding will often be non-specific. Since the length of B-DNA strands that are commonly modeled using MD ranges from 10 to 32 base-pairs, the number of possible binding sites increases drastically, thus reproducing every possible binding configuration becomes an unfeasible task. Based on this, model systems must be considered made up of a selected sequence with a specific binding configuration (Basu and Suresh Kumar, 2014). That is why, most of the CD modeling that have been performed so far were of either well-characterized sequences (such as the X-ray crystal structure—pdb code: 1BNA–or NMR structure pdb code: 1K9H) or of the simplest nucleic acid sequences, being hetero- and homo-polymers of adenine (A)-thymine (T) and guanine (G)-cytosine (C) (Drew et al., 1981).

The first reported modeling of B-DNA CD employing ab-initio methods has been performed by (Miyahara et al., 2013) on the structure of a poly(dGCCCGGGC) double strand obtained by X-ray crystallography (Heinemann et al., 1987). In this study, the excited states energies were computed using the Symmetry-Adapted Cluster (SAC)-Configuration Interaction (CI) theory after pre-optimization of the system at the DFT/B3LYP/6-31G(d,p) level. Solvation has been taken into account using Polarizable Continuum Model (PCM) implemented in Gaussian. In this work, either eight excited states have been computed in dimer models (two stacked nucleobases to study the influence of π-stacking interactions or a nucleobase pair including hydrogen bonding) or 14 excited states have been considered in a tetramer model made up of two stacked nucleobase pairs (containing both π-stacking and hydrogen bonding interactions). For the smaller dimer models, it has been demonstrated, on one side, that hydrogen bonding within base pairs is accountable for the change in excitation energies compared to single nucleobases and, on the other side, that πstacking is responsible for the sign of the CD signal. Finally, the computation performed on the tetramer model resulted in a CD spectrum in very good agreement with experiment.

Employing a Complex Polarization Propagator (CPP), Di Meo et al. modeled the CD spectra of a poly(dA.dT)<sup>20</sup> B-DNA double strand (Di Meo et al., 2015). Excited states of model base pairs dimers and trimers have been computed at the TD-DFT/CAM-B3LYP/aug-cc-pVDZ level of theory considering 30 excited states. Also, the sampling of the B-DNA conformational space has been performed through 100 ns long MD simulations using the amberff12-bsc0 force field. Their results show a very good agreement with the experimental CD fingerprint of the adenine-thymine base pairs homo-polymers in the nearultraviolet region, corresponding to two positive bands at 260 and 283 nm and a more intense negative band at 249 nm. Furthermore, Norman et al. (2015) tackled the case of the 147 base pairs long nucleosomal DNA (pdb code: 1KX5) employing a similar procedure. The results demonstrated the influence of the super-helical configuration on the CD spectra, including a hypsochromic shift of the band at 269 nm and a strong decrease of its intensity.

Another approach by Padula et al. to model the CD spectrum of two 10 base-pairs long B-DNA guanine-cytosine heteroand homo-polymers, relied on the explicit calculation of an exciton coupling in the DNA double strand (Padula et al., 2016). Experimental structures obtained by X-ray crystallography or NMR have been used as starting configurations. The excited states were computed at the TD-DFT/M06-2X/6-31G(d) level of theory taking into account the environment effects and interactions between chromophores using the MMPol method, which allow to treat the environment through a polarizable embedding. Moreover, only bright π-π ∗ excited states have been selected in this protocol. Afterwards, the excitonic Hamiltonian was generated coupling the full transition densities of each partition to obtain, after its diagonalization, the final CD spectra. The final spectrum results (averaged over 270 snapshots extracted from a 90 ns long MD simulation employing the parmbsc0 force field for DNA) recover the overall band shapes and is comparable with experiment.

Our protocol for the simulation of nucleic acids' CD signals, also relied on the 4 steps described hereinbefore with first 10 ns of MD simulation on an in-silico built 15 base pairs long DNA double strand (bsc0 force field for nucleic acids). Considering that excited states in B-DNA are relatively local, they can be treated in the framework of the Frenkel excitons theory. In such case, excited states can be computed on a series of decoupled partitions (for example single nucleobases) which will be recoupled by building an exciton coupling matrix. Each excited state for each partition will be coupled with all the other excited states using the methodology presented in (Gattuso et al., 2015, 2016b). The excited states have been obtained using TD-DFT/M06-2X/6-311+G(d) on 40 snapshots, divided in up to 8 different partitions made up of nucleobase pairs.

Moreover, QM/MM embedding was used to take into account the environment, in order to tune the electronic properties of each partition. Polarization was accounted using ERS as presented before. Then the resulting rotatory strength have been convoluted using Gaussian shaped functions of 0.2 eV Full Width at Half Maximum (FWHM). The results obtained on heteropolymers of AT and GC base pairs are presented in **Figure 11** and nicely underline the sequence effects on the DNA CD spectra.

#### Circular Dichroism of G-quadruplexes

G-quadruplexes (G4) can be sorted in three families,—parallel, hybrid, and antiparallel—corresponding to their specific nucleic acids arrangements with each family possessing sub-members (Randazzo et al., 2012). Their CD spectra have been deeply described (del Villar-Guerra et al., 2017) and can be easily decomposed in a series of nucleic acids interactions components with a major contribution from the central guanine core. Each family can be differentiated by a specific sequence of spectral bands: (i) parallel, characterized by a positive band at 264 nm and a negative one at 245 nm; (ii) hybrid, characterized by positive bands at 295 and 260 nm and a negative one at 245 nm; (iii) antiparallel, characterized by a positive band at 295 nm and a negative one at 260 nm.

In (Loco et al., 2016a) a similar procedure as in (Padula et al., 2016), has been used to model the CD spectrum of an antiparallel G4 (pdb code: 143D). However, this study did not consider a sampling of the conformations using MD simulations, but was instead performed directly on the experimental (crystallographic) structure. In the excitonic Hamiltonian, guanines have been considered as individual partitions brought in interaction using the MMPol polarizable embedding (TD-DFT/M06-2X/6-311+G(d) level, 10 excited states calculated for each guanine). The resulting spectrum almost matches the experiment one.

Using the same protocol as for B-DNA, including sampling the conformational space through MD, and considering as partitions the four triplets of stacked guanines of the G4 core, we also proved (Gattuso et al., 2016b) that the excitons coupling gives a satisfactory simulated CD spectrum of the three G4 families with a clear evidence of their spectral features (see **Figure 12**). Our starting structures were X-ray crystal structures (pdb code: 1KF1 for parallel, 2HY9 for hybrid and 143D for antiparallel configurations) which possess the main spectral features detailed before. Moreover, our results demonstrated that to model the CD spectra of G4, the explicit environment has to be considered with great care, since these aggregates are stabilized by two central cations (K<sup>+</sup> for hybrid and parallel configurations and Na<sup>+</sup> for the antiparallel one) electrostatically bound to the guanines of the core.

#### Peptides

#### Circular Dichroism of Alpha-Helices

Circular dichroism of protein secondary structures also draws strong interest since it can be, as DNA, directly related to the global structure of the bio-system (Greenfield, 2007). Indeed, it provides information about open and closed conformations in the case of binding with a ligand, or even protein-protein and protein-DNA bindings (Carpenter et al., 2009). Moreover, CD can probe reorganizations in the case of environmental changes, such as increase of ionic strength or temperature (Jirgensons, 1980). So far, the main attempts to model the CD spectra of macromolecular amino acids arrangements have targeted alpha helices and their specific spectral band shapes, consisting of two negative bands at 208 and 222 nm and an intense negative band at 193 nm.

Kaminský et al. (2011) performed the first CD modeling of relatively long alpha helices. Their protocol previewed to first sampling the conformational space of in-silico built alpha helices using MD simulations, followed by a computation of the excited states and rotatory strengths. Especially, Ac- (Ala)n-NH-Me structures (with n going from 2 to 19) were studied by Transition Dipole Coupling (TDC) and TD-DFT, using the B3LYP functional and various basis sets, focusing on the difference between vacuum and the COSMO method to include solvent effects. Even though the resulting band shape is reasonably comparable with experimental CD, the MD simulations has been performed on in-silico built starting configurations for a maximum of 50 ps. Considering the high flexibility and necessity to allow the system (and solvent) to reorganize, this time length cannot definitely be considered as sufficient to sample the ensemble of configurations of such system. They mainly concluded that considering several consecutive amino acids is important to allow exciton delocalization.

Our protocol (Gattuso et al., 2017b), described hereinbefore, combined, as it was the case for nucleic acids, conformational sampling through MD, followed by the diagonalization of an excitonic coupling Hamiltonian whose elements are obtained by QM/MM calculations on individual subunits. In detail, in

FIGURE 13 | Experimental (Miles and Wallace, 2006) and modeled (Gattuso et al., 2017b) CD spectra of an alpha helical peptide. On the left, the selected partitioning scheme to apply the Frenkel excitons scheme is shown.

the case of a polypeptide comprising 27 amino acids and assuming a dominant alpha-helix conformation, we performed: (i) 80 ns of MD sampling using amber99 force field; (ii) computation of excited states properties on partitions at the TD-DFT/M06-2X/6-311++G(d,p) level of theory (each computation was performed using hybrid QM/MM method including ERS); (iii) coupling of excitons transition dipole moments; (iv) convolution of rotatory strengths by Gaussian shaped functions of 0.4 eV at FWHM. In this study, we demonstrated that the final CD spectra are highly sensitive to the partitioning scheme with a convergence obtained for loop subunits (four consecutive amino acids, see **Figure 13**). Also, the simplicity of the Hamiltonian requires to consider with great care the arrangement of partitions. Indeed, the most accurate results were obtained considering a hybrid partitioning including loops of amino acids (quadruplets) and hydrogen bounded dimers (couples). In fact, this allowed to retain the higher accuracy obtained for bigger molecular partitions, while keeping an alphahelical arrangement of subunits' centers of charges. Most notably, the coupling of MD sampling and QM/MM calculations, also allowed to recover the modification of CD signals induced by the partial breaking of the alpha-helix, due to the coupling of the polypeptide with a photoactive switch undergoing cis/trans isomerization.

#### GENERAL CONCLUSION

In this review we gather recent work done in our group dealing with the description of electronic excited states in complex molecular systems. We present three main applications, namely linear absorption, non-linear optical properties (mainly

## REFERENCES


two-photon absorption) and circular dichroism together with the specific methodological development that has been made. We show that when an adequate description of both surrounding and dynamical effects is chosen, our calculations agree nicely with the available experimental data and can thus be used for prediction when experimental data are missing, or at least, as in the case of the structurally sensitive electronic CD, they allow for a simple and straightforward one-to-one mapping with the system structural features. As a general take home message, one should be aware that a good sampling of the conformational space and a polarizable embedding scheme are required to obtain reliable results.

# AUTHOR CONTRIBUTIONS

AM: was responsible for the part on linear absorption; MM: took care of the non-linear optical properties; HG: handled the section devoted to circular dichroism; XA: was responsible for the remaining parts of the manuscript.

## FUNDING

MM is thankful to the French National Research Agency (ANR) for a grant under the DeNeTheor project, and to the Universidad de La Rioja (UR) for a postdoctoral contract.

# ACKNOWLEDGMENTS

The authors would like to gratefully thank the Phi-Science student association for providing an adequate working space to the authors while they wrote this manuscript.


light photoswitch derived from furfural. J. Org. Chem. 79, 11316–11329. doi: 10.1021/jo502206g


chromophore isomerization. Proc. Natl. Acad. Sci. U.S.A. 111, E5445–E5454. doi: 10.1073/pnas.1410162111


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer, AN, declared a past co-authorship with several of the authors, XA, AM, MM, to the handling Editor.

Copyright © 2018 Marazzi, Gattuso, Monari and Assfeld. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# On the Difference Between Additive and Subtractive QM/MM Calculations

#### Lili Cao and Ulf Ryde\*

Department of Theoretical Chemistry, Chemical Centre, Lund University, Lund, Sweden

The combined quantum mechanical (QM) and molecular mechanical (MM) approach (QM/MM) is a popular method to study reactions in biochemical macromolecules. Even if the general procedure of using QM for a small, but interesting part of the system and MM for the rest is common to all approaches, the details of the implementations vary extensively, especially the treatment of the interface between the two systems. For example, QM/MM can use either additive or subtractive schemes, of which the former is often said to be preferable, although the two schemes are often mixed up with mechanical and electrostatic embedding. In this article, we clarify the similarities and differences of the two approaches. We show that inherently, the two approaches should be identical and in practice require the same sets of parameters. However, the subtractive scheme provides an opportunity to correct errors introduced by the truncation of the QM system, i.e., the link atoms, but such corrections require additional MM parameters for the QM system. We describe and test three types of link-atom correction, viz. for van der Waals, electrostatic, and bonded interactions. The calculations show that electrostatic and bonded link-atom corrections often give rise to problems in the geometries and energies. The van der Waals link-atom corrections are quite small and give results similar to a pure additive QM/MM scheme. Therefore, both approaches can be recommended.

#### *Edited by:*

Sam P. De Visser, University of Manchester, United Kingdom

#### *Reviewed by:*

Albert Poater, University of Girona, Spain Jiayun Pang, University of Greenwich, United Kingdom

> *\*Correspondence:* Ulf Ryde Ulf.Ryde@teokem.lu.se

#### *Specialty section:*

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

*Received:* 31 January 2018 *Accepted:* 14 March 2018 *Published:* 03 April 2018

#### *Citation:*

Cao L and Ryde U (2018) On the Difference Between Additive and Subtractive QM/MM Calculations. Front. Chem. 6:89. doi: 10.3389/fchem.2018.00089 Keywords: QM/MM, haem oxygenase, sulfite oxidase, mechanical embedding, electrostatic embedding, additive QM/MM, subtractive QM/MM

# INTRODUCTION

Combined quantum mechanics and molecular mechanics (QM/MM) is a popular method to study biological macromolecules, as well as homogeneous catalysis and nanostructures, with computational methods (Balcells and Maseras, 2007; Lin and Truhlar, 2007; Ramos and Fernandes, 2008; Stoyanov et al., 2008; Senn and Thiel, 2009; Keal et al., 2011; Chung et al., 2015; Jover and Maseras, 2016; Ryde, 2016). In this approach, a small region of central interest (typically 20–300 atoms) is treated with quantum mechanical (QM) methods, whereas the remainder of the macromolecule, as well as a considerable amount of explicit solvent are treated by molecular mechanics (MM). This is supposed to combine the accuracy of QM methods with the speed of MM methods. Moreover, the entire macromolecule is included in the calculations [in contrast to the alternative QM-cluster approach (Blomberg et al., 2014), in which most parts of the macromolecule are omitted], reducing the risk of making a biased choice in the selection of the considered system and allowing for a detailed study of how the surroundings affect the properties of interest.

A problem with QM/MM approaches is that there exists so many variants and that the details of these are seldom discussed. For example, QM/MM approaches can use either subtractive or additive schemes (Senn and Thiel, 2009). In a subtractive scheme, three separate calculations are performed: One QM calculation with the QM region (system 1; E QM 1 ) and two MM calculations, one for the entire system (systems 1 and 2; E MM <sup>12</sup> ) and one for the QM region (E MM 1 ) (Maseras and Morokuma, 1995; Ryde, 1996b; Svensson et al., 1996):

$$E\_{\rm QM/MM}^{\rm sub} = E\_1^{\rm QM} + E\_{12}^{\rm MM} - E\_1^{\rm MM} \tag{1}$$

The advantage with this approach is the simplicity: It automatically ensures that no interactions are double-counted and it can be set up for any QM and MM software (provided that they can write out energies and forces), without the need of any modification of the code. Thereby, the QM/MM software is updated every time the underlying QM or MM software is updated. Moreover, it can be easily extended to more than two computational methods and regions (Svensson et al., 1996). The typical example of a subtractive scheme is ONIOM (Svensson et al., 1996), but other software use similar methods, e.g., ComQum (Ryde, 1996b).

In the additive scheme, only two calculations are performed: the same QM calculation for the QM region, but only a single MM calculation (E MM 2−1 ) (Sherwood et al., 2003; Senn and Thiel, 2009):

$$E\_{\rm QM/MM}^{\rm add} = E\_1^{\rm QM} + E\_{2-1}^{\rm MM} \tag{2}$$

although the latter is often formally divided into two terms, a MM energy of system 2 and a QM/MM interface energy (E MM <sup>2</sup>−<sup>1</sup> = E MM <sup>2</sup> + E QM/MM <sup>12</sup> ). In this case, it is up to the developer to ensure that no interactions are omitted or double-counted. Therefore, an additive scheme requires a special MM software, in which the user or developer can select which MM terms to include. The advantage of the additive QM/MM scheme is that no MM parameters for the QM atoms are needed, because those energy terms are calculated by QM.

Further differences may arise if the QM region is covalently connected to the MM region. Then, the QM system needs to be properly truncated. This can be done by special localized orbitals (Levitt, 1976; Théry et al., 1994; Gao et al., 1998; Murphy et al., 2000; Senn and Thiel, 2009), but it is more common that the QM system is simply truncated by hydrogen atoms, the hydrogen link-atom approach (Singh and Kollman, 1986; Field et al., 1990; Reuter et al., 2000; Senn and Thiel, 2009; Ryde, 2016) In the subtractive scheme, MM parameters for the link atoms are needed (Senn and Thiel, 2009).

The interaction between the QM and MM regions is typically dominated by electrostatics. This interaction can also be treated at different levels of approximation (Senn and Thiel, 2009; Ryde, 2016). In mechanical embedding, it is calculated at the MM level (Maseras and Morokuma, 1995; Svensson et al., 1996). In electrostatic embedding, the electrostatic QM–MM interaction is instead treated at the QM level by including a point-charge model (i.e., atomic partial MM charges) of system 2 in the QM calculations (Singh and Kollman, 1986; Field et al., 1990; Ryde, 1996b; Dapprich et al., 1999). Thereby, system 1 is polarized by system 2, but not vice versa. In polarized embedding, both systems are mutually and selfconsistently polarized in the QM calculations (Poulsen et al., 2001; Söderhjelm et al., 2009; Olsen et al., 2010). This requires a polarizable MM force field for system 2 (Lopes et al., 2009) and a QM software that can treat polarizabilities, which are still rather unusual. Therefore, such calculations are less common and typically restricted to single-point calculations of accurate properties. Mechanical embedding is normally considered to be less accurate than electrostatic embedding (Senn and Thiel, 2009), and the latter has therefore been the most widely used approximation, although it involves polarization of only parts of the system and is more sensitive to the treatment of the link atoms (Hu et al., 2011b). Strictly, Equations (1, 2) apply only to mechanical embedding, but they can easily be adapted to electrostatic embedding by including a point-charge model of system 2 in the QM term (EQM1 + ptch2) and setting the charges of system 1 to zero in the MM calculations (Ryde, 1996b).

Unfortunately, the distinction between the subtractive and additive schemes in literature is often unclear and confused. In many cases, the subtractive scheme is equated with mechanical embedding and the additive scheme with electrostatic embedding (Senn and Thiel, 2009; Götz et al., 2014). In other cases, the subtractive scheme is equated with the ONIOM method (Roßbach and Ochsenfeld, 2017). We prefer the definition in Equations (1, 2), emphasizing that the subtractive scheme employs two MM calculations with an external MM program, whereas the additive scheme employs a single MM calculation with an internal MM program, allowing the developer to cherry-pick the MM terms actually needed. In particular, both additive and subtractive schemes may use either mechanical or electrostatic (or even polarized) embedding.

It is normally assumed that the subtractive scheme is harder to set up and requires accurate MM parameters for the QM region and link atoms. For example, Roßbach and Ochsenfeld state in a recent article comparing subtractive and additive QM/MM (Roßbach and Ochsenfeld, 2017): "The (additive) QM/MM approach has the advantage that parameters for QM and link atoms, saturating covalent bonds between QM and MM, are unnecessary, as these are never described by the force field. The subtractive ONIOM approach requires accurate parameters for all atoms, including link atoms, because an MM calculation of the QM region is also necessary to avoid double counting." On the other hand, Sousa et al. present the opposite view that the subtractive scheme is more advantageous, because of the "lack of a requirement for a parameterized expression describing the interaction of the various regions, and the fact that all systematic errors in the treatment of the inner regions by the lower levels of theory are canceled out" (Sousa et al., 2017).

In this article, we aim at clarifying the difference between the two schemes and compare their performance. We will show that with a proper setup, additive, and subtractive schemes should give identical results with a similar effort and that they require the same set of MM parameters. However, the subtractive scheme may be tuned to correct errors introduced by the link atoms and then additional parameters are needed.

# METHODS

## The ComQum QM/MM Software and Its Subtractive Scheme

A problem when comparing QM/MM methods is that there exist so many variants and that the details of the calculations are seldom discussed (Ryde, 1996b; Senn and Thiel, 2009). Therefore, we here give a thorough discussion of our QM/MM software and details of all QM/MM variants implemented. All QM/MM calculations in this article were performed with the ComQum software (Ryde, 1996b; Ryde and Olsson, 2001). ComQum is a modular program, combining the QM software Turbomole (Ahlrichs et al., 1989; Furche et al., 2014; TURBOMOLE version 7.1, 2016) and the MM software AMBER (Case et al., 2014). It consists of five small Fortran programs that read, write, manipulate, and transfer coordinates, energies, forces, and charges between the MM and QM software. It employs a subtractive scheme and it was developed in 1992–1995, concurrently and independently from the ONIOM software (Ryde, 1995, 1996a,b). It has always used electrostatic embedding, in contrast to ONIOM. Junctions are treated by the hydrogen link-atom approach (Ryde, 1996b; Reuter et al., 2000; Senn and Thiel, 2009).

To make the discussion clear, we use the following conventions (Senn and Thiel, 2009): The QM region is called system 1, whereas atoms in the MM region are called system 2. The QM region is terminated by hydrogen link atoms, called HL. They replace the corresponding carbon link (CL) atoms in the real system. HL and CL are different representations of the same atom and never appear in the same (MM or QM) calculation. This is illustrated in **Figure 1**. A superscript HL or CL show which representation is used. We will use XL to denote either HL or CL. The HL atom is covalently bound to a single Q<sup>1</sup> atom the QM region, whereas the CL atom is connected to (typically several) M<sup>2</sup> atoms in the MM region (we use this somewhat illogical notation, because several other QM/MM descriptions use M<sup>1</sup> to denote CL; we prefer our notation because in ComQum, HL and CL are two different representations of the same atom, which belongs to the QM region). The Q<sup>1</sup> atoms are covalently bound to Q<sup>2</sup> atoms, which are covalently bound to Q<sup>3</sup> atoms, and so on. Likewise, the M<sup>2</sup> atoms are covalently bound to M<sup>3</sup> atoms, and so on.

The HL atom is placed along the Q1−CL bond, with the Q1−HL bond length (rQ1−HL) calculated from (Ryde, 1996b):

$$r\_{Q\_1-\text{HL}} = r\_{Q\_1-\text{CL}} \frac{r\_{Q\_1-\text{HL}}^0}{r\_{Q\_1-\text{CL}}^0} \tag{3}$$

where rQ1−CL is the current Q1–CL bond length, r<sup>0</sup> <sup>Q</sup>1−CL is the optimum Q1–CL bond length in the MM force field and r<sup>0</sup> Q1−HL is the Q1–HL bond length in a model of the isolated truncated residue, optimized with the current QM method and basis set. r 0 <sup>Q</sup>1−CL can be found in the MM force field libraries and r<sup>0</sup> <sup>Q</sup>1−HLis easily obtained by a simple QM geometry optimisation (which typically takes less than a minute). Equation (3) can also be used in reverse during the QM/MM geometry optimization to calculate the CL coordinates from the HL coordinates and it is

also used together with the chain rule to obtain the MM forces on the HL atom (Ryde and Olsson, 2001). Thus, the HL atoms do not introduce any additional degrees of freedom.

The total QM/MM energy in standard ComQum (which uses a subtractive scheme with electrostatic embedding) is calculated from Equation (4) (Ryde, 1996b; Ryde and Olsson, 2001):

$$E\_{\rm QM/MM}^{\rm sub,EE} = E\_{\rm QM1+ptch2}^{\rm HL} + E\_{\rm MM12,q\_1=0}^{\rm CL} - E\_{\rm MM1,q\_1=0}^{\rm HL} \tag{4}$$

where E HL QM1 + ptch2 is the QM energy of the QM system truncated by HL atoms and embedded in the set of point charges modeling system 2 (but excluding the self-energy of the point charges). All atoms in system 2 are included in the point-charge model (but not the CL atoms, which do not belong to system 2 in our view). In our original development, charges of some additional atoms were excluded (Ryde, 1996b), but a comparison of several different charge distribution schemes did not show any advantage of more complicated schemes (Hu et al., 2011b). E HL MM1,q1=0 is the MM energy of the QM system 1, still truncated by HL atoms, but without any electrostatic interactions. Finally, E CL MM12,q1=0 is the MM energy of all atoms in the system with CL atoms and with the charges of the QM system set to zero (to avoid double counting of the electrostatic interactions).

In the original implementation of ComQum (Ryde, 1996b), it was necessary to set up two AMBER parameter and topology (prmtop) files for the MM calculations, one for the full system (E CL MM12,q1=0 ) and one for the truncated QM region (E HL MM1,q1=0 ). This involved development of MM parameters for all junctions, which is tedious and error prone, even if the same truncated residues can be used for several different proteins. Initially, some attempts were made to allow the calculations on the small system compensate for the truncations and the introduction of link atoms (i.e., for the conversion of CL atoms to HL atoms). However, we did not see any consistent improvement in the results; instead such a treatment often introduced instability in the calculations.

In particular, we soon realized that the parameters of the HL– Q<sup>1</sup> bond cannot be freely selected. The deterministic relation between HL and CL in ComQum (Equation 3) implies that we should use r 0 <sup>Q</sup>1−HL as the equilibrium bond length and the force constant must be

$$k\_{\rm Q1-HL} = k\_{\rm Q1-CL} (\frac{\mathbf{r}\_{\rm Q1-HL}^{0}}{\mathbf{r}\_{\rm Q1-CL}^{0}})^2 \tag{5}$$

where kQ1−CL is the force constant of the Q1−CL bond in the MM force field. Otherwise, a spurious force will be introduced. Moreover, for the Q2−Q1−HL angle and the Q3−Q2−Q1−HL dihedral parameters we simply used the corresponding MM Q2−Q1−CL angle and the Q3−Q2−Q1−CL parameters, obtained by copying these entries from the MM force field.

Even if this was a completely mechanical procedure, it was still somewhat tedious and error prone, because it had to be redone every time the MM force field, QM method, or basis set was changed. Therefore, we implemented in 2006 a program (changeparm) that performed this task automatically: Reading the AMBER prmtop file of the full system and a file with the ideal QM bond length (r<sup>0</sup> XL−HL), it automatically generates the prmtop and coordinate files for the MM calculation of the QM system (E HL MM1,q1=0 ) according to these rules. Thereby, only a single MM calculation needs to be set up (that of the full system, which typically is already done, because QM/MM studies of macromolecules normally start with an equilibration of the structure with molecular dynamics) and no special parameters needs to be developed for the truncated system or the link atoms. This removes one of the disadvantages with the subtractive scheme (but it also discards a potential advantage with the method, as will be discussed below). However, the changeparm program required procedures to read and write the prmtop file, as well as a complete understanding of the meaning of all entries in it, a significant programming effort. Still, it has been so valuable that we recently have implemented the corresponding program (Cao et al., 2018) also for the crystallography and NMR system (CNS; Brunger et al., 1998; Brunger, 2007) software for quantum refinement (Ryde et al., 2002; Ryde and Nilsson, 2003).

Thus, with our implementation, also the subtractive QM/MM scheme requires only a single prmtop for the entire system. However, this file must contain parameters for all atoms, including those in the QM region. This is so because (the leap module in) AMBER refuses print the file if any parameter is missing. It may be that other MM software is less restrictive. However, this is a rather minor restriction, because the parameters do not need to be accurate. On the contrary, for all interactions involving only QM atoms (except van der Waals interactions involving the link atoms), the MM energies cancel exactly (owing to the E CL MM12,q1=<sup>0</sup> <sup>−</sup> <sup>E</sup> HL MM1,q1=0 terms in Equation 4). Therefore, dummy (e.g., zeroed) parameters may be used. Moreover, with the general MM force fields available in most MM software (Vanommeslaeghe et al., 2010), including AMBER (Wang et al., 2004), most parameters already exist. The only problem may be metal sites, but if no explicit bonds are defined between the metals and any ligand atoms, only van der Waals parameters for the metal are needed and these are necessary also in the additive scheme. Thus, the need of parameters for the QM region is a very minor restriction of the subtractive scheme and the setup of dummy parameters can easily be automatized (although we have never felt such a need).

#### Additive ComQum

Recently, we implemented a simple software (calcforce), which calculates MM energies and forces, based on AMBER coordinate and parameter–topology files. This was done to allow calculations without any cut-off for non-bonded interactions, even for very large systems (AMBER employs a non-bonded pair list that can become too large for the memory with more than ∼10<sup>5</sup> atoms) and to gain control over exactly what forces are written out by the program (for example, turning on the dumfrc option in the AMBER software changes the electrostatic energy). As a byproduct, it also gives us full control over the energy function and allowed us to implement an additive QM/MM scheme.

In our additive approach, we use the energy function:

$$E\_{\rm QM/MM}^{\rm add,EE} = E\_{\rm QM1+ptch2}^{\rm HL} + E\_{\rm MM2-1}^{\rm CL} \tag{6}$$

where the E HL QM1 + ptch2 QM term is identical to that used in the subtractive scheme. As shown by the superscript, all terms in E CL MM 2−1 employ CL atoms, coordinates and parameters, never any HL atoms. We employed the following rules to determine what MM terms to include in E CL MM 2−1 (note again that in our notation, the CL atoms belong to the QM region):


These rules are based on the simple philosophy that we should calculate by MM all terms that are not already considered in the QM calculations. This seems very natural and should represent a typical implementation of additive QM/MM (Sherwood et al., 2003), although the rules are seldom discussed explicitly. This selection is illustrated in **Table 1** for the simple ethanol model


TABLE 1 | Illustration of which terms are included in the various energies for the ethanol molecules in Figure 1 (atom names are shown in that figure, except that H1 indicates either H11 or H12 and H2 indicates H21, H22, or H23).

QM means that the term is included by the QM calculation, MM that it is included as a MM term, HL or CL that it is treated as a HL or CL atom, Sc that the non-bonded interaction is scaled down, and Ptch that it is treated by point charges.

in **Figure 1** (with CB as the link atom). Note that the first four rules are identical to the rules we use for setting up the truncated prmtop file in the subtractive scheme, so that the two schemes should give identical bonded and electrostatic energies. In particular, the two schemes give identical energies for the XL–Q<sup>1</sup> bond term (assuming the harmonic term E bond <sup>Q</sup>1−XL <sup>=</sup> kQ1−XL(rQ1−XL − r 0 <sup>Q</sup>1−XL) 2 employed in AMBER and most other macromolecular force fields) even if the subtractive scheme uses HL coordinates and the additive scheme CL coordinates, because of the relations between the coordinates in Equation (3) and the force constants in Equation (5) (the force constants in Equation 5 were constructed with this aim).

The only difference between the two schemes is the van der Waals interactions involving the link atoms and another atom in the QM system (possibly other link atoms): In the additive scheme, no such interactions are calculated by MM (because both atoms belong to the QM system). However, in the subtractive scheme, all van der Waals interactions involving link atoms are calculated twice: In E CL MM12,q1=0 , they are obtained with MM parameters and coordinates for CL atoms, whereas in E HL MM1,q1=0 , they are obtained with MM parameters and coordinates for HL atoms. In variance to all the other QM atoms, these two terms are not identical and therefore will not cancel. Instead, they provide a MM correction to the link atom, i.e., to the fact that the HL atoms in the QM region are H atoms and not the correct C atoms (i.e., they have smaller van der Waals radii) and that they are at incorrect positions. Thus, these van der Waals terms in E HL MM1,q1=0 can be seen as a correction to the corresponding energy in the QM calculation (E HL QM1+ptch2), which also involves the incorrect HL coordinates and atoms. We will call it the van der Waals link-atom correction (VLAC). MM van der Waals parameters are normally quite accurate, so this approach is used in most subtractive schemes, but it cannot be included in a strict additive scheme, which may be a disadvantage. It should be noted that these interactions are only within the QM region, so with a small QM region, it involves only a few interactions and the correction is small.

In fact, we can exactly reproduce the additive QM/MM calculations within a subtractive scheme by replacing the E HL MM1,q1=0 term in Equation (4) with a E CL MM1,q1=0 term, in which parameters and coordinates corresponding to the CL atoms, rather than the HL atoms, are used. This was done manually to confirm that the implementations are correct, but it has never been implemented for production calculations (because the additive scheme gives the same results).

#### Mechanical Embedding

For comparison, we have also implemented mechanical embedding (ME) in ComQum [we have calculated single-point ME energies before (Hu et al., 2011a,b), but not done full geometry optimizations]. ME calculations can be run by simply deleting the point-charge model from the QM calculations (removing the \$point\_charges keyword from the Turbomole control file) and (re-)inserting charges of the QM system in the prmtop files for the two MM calculations in Equation (4). This gives the energy function:

$$E\_{\rm QM/MM}^{\rm sub,ME} = E\_{\rm QM1}^{\rm HL} + E\_{\rm MM12}^{\rm CL} - E\_{\rm MM1}^{\rm HL} \tag{7}$$

Two issues need to be settled in this implementation. The first is how to treat the link atoms. If the charge of the HL atom is identical to that of the CL atom, the electrostatics within the QM system cancel exactly in the E CL MM12−E HL MM1 terms in Equation (7). However, then the total energy reflects electrostatics involving HL atoms, from the E HL QM1 term. In analogy with the van der Waals energy correction described in the previous section, we prefer to have different charges for the HL and CL atoms, with those of the HL atoms being representative for a H atom and those of the CL atoms being representative for the true (typically C) atoms.

This leads us to the second issue, viz. how the charges are calculated for the QM system, including the HL and CL atoms. Since QM calculations are done for the QM system, it is natural to use some sort of QM-derived charges. Originally, ComQum employed Mulliken charges (these charges are used also with electrostatic embedding when parts of system 2 is optimized by MM; Ryde, 1996b). However, it is well-known that charges fitted to the electrostatic potential (ESP) give (by construction) more accurate electrostatic interaction energies (Sigfridsson and Ryde, 1998). Therefore, we have used such ESP charges [obtained with the Merz–Kollman scheme (Besler et al., 1990)] ever since ESP charges were implemented in Turbomole (note that in Turbomole, the point-charge model needs to be removed before the charges are calculated, without reoptimizing the wavefunction). These charges can be directly used in the E HL MM1 term and therefore also for the HL atoms, since they were obtained for a QM system with HL atoms.

However, for the E CL MM12 term, these charges need to be adapted so that the charge of the total system remains integer, meaning that the HL charges need to be adapted to apply for CL atoms instead. It is not evident how this should be done and it is seldom discussed, although this needs to be done for essentially all QM/MM and MD simulations employing QM charges calculated for QM systems with link atoms.

We have selected to follow this procedure:


The procedure is illustrated in **Table 2**. It is fully automatic, except that a possible integer charge outside the QM system needs to be specified. This way, charge transfer within the QM system is allowed, meaning that none of the QM residues has an integer charge. The modification of the charges on the CL atoms is also kept to a minimum. However, alternative approaches are conceivable, e.g., dividing the remaining charge equally over all link atom, after the charges of the non-HL QM atoms have been set. Thus, the CL charges are somewhat ambiguous. Importantly, the procedure keeps all charges of the QM system equal between E CL MM12 and E HL MM1, except for the HL/CL atoms, allowing for a proper cancelation of those electrostatic terms in the ME approach, in analogy with the VLAC correction for the subtractive scheme.

#### Electrostatic Link-Atom Corrections

In the electrostatic-embedding variant of ONIOM (Vreven et al., 2006), as well as in our QTCP approach (QM/MM thermodynamic cycle perturbation; Rod and Ryde, 2005), further attempts are made to correct errors introduced by the link atoms. In the QM calculations, the HL atoms are of the wrong element, TABLE 2 | Illustration of the method to determine charges for methanol and ethanol (shown in Figure 1 with atom names; H1 is H11 and H12; H2 is H21, H22, and H23).


The methanol charges were obtained from a QM RESP calculation. The ethanol Set1 charges were obtained in the same way. For Set2, charges of the C1, H1, O, and H atoms were taken from methanol (marked in bold face) and the charges of H2 were taken from Set1 (marked in bold face and italics). Finally, the charge on C2 was determined to give a vanishing net charge.

located at the incorrect position (compared to the CL atom) and they may make Coulombic interactions with point charges of nearby atoms that are not included or are scaled down in normal MM calculations (viz. interactions with the M2, M3, and M<sup>4</sup> atoms). These errors can be compensated by calculating exactly the same interactions in the E HL MM1 term and replace them with the corresponding interactions in the E CL MM12 term.

In practice, this is accomplished by using QM charges for the QM system in both MM calculations and including the same point-charge model of the surroundings in the E HL MM1 term. We call this approach electrostatic link-atom correction (ELAC) and it gives the following energy function:

$$E\_{\rm QM/MM}^{\rm sub,ELAC} = E\_{\rm QM1+ptch2}^{\rm HL} + E\_{\rm MM12}^{\rm CL} - E\_{\rm MM1+ptch2}^{\rm HL} \tag{8}$$

For these calculations, we used the same QM charges for the QM system and the same MM charges for MM system as described for the mechanical-embedding calculations.

# Bonded Link-Atom Corrections

Finally, the subtractive scheme allows for a third type of linkatom corrections, viz. for the bonded terms. It is likely that a HL atom will give rise to slightly different bonded terms than the corresponding CL atom, e.g., smaller XL–Q1−Q<sup>2</sup> ideal angles. Again, we may try to use the E HL MM1 calculation to correct for these errors (i.e., so that the HL bonded terms would cancel between the E HL QM1+ptch2 and E HL MM1 terms and the corresponding CL result in E CL MM12 would remain, instead of the exact cancelation of these terms between E HL MM1 and E CL MM12 as in both the standard additive and subtractive approaches). This is done by using different parameters for the Q2−Q1–HL and Q2−Q1–CL angles and the Q3−Q2–Q1−HL and Q3–Q2−Q1−CL dihedral parameters (and possibly also for the Q1–HL and Q1–CL bond parameters). We will call this bonded link-atom corrections (BLAC).

Of course, the parameters need to be accurate for there to be a hope of any improved results. In this paper, we tried two approaches. In the first (BLAC1), we used standard AMBER parameters for both the CL and HL terms, the latter typically coming from the GAFF (Wang et al., 2004) force field. These parameters involved also the bonded Q1-XL terms.

In the second approach (BLAC2), we instead performed a parametrisation of both the full and truncated systems, based on a QM frequency calculation on each system. The bonded parameters were then extracted with the Seminario approach (Seminario, 1996), using the Hess2FF program (Nilsson et al., 2003; Hu and Ryde, 2011). We used the same atom types as in the AMBER files in BLAC1. This means that in principle all bonded parameters will differ in the two MM calculations, not only those involving the XL atom. Finally, BLAC1J and BLAC2J was obtained from BLAC1 and BLAC2 by changing the Q1–HL force constants according to Equation (5), keeping everything else the same.

#### Test Systems

All QM calculations were carried out using the Turbomole software (versions 7.1 and 7.2; Ahlrichs et al., 1989; Furche et al., 2014). They were performed using the TPSS (Tao et al., 2003) functional in combination with def2-SV(P) (Schäfer et al., 1992) basis set, including empirical dispersion corrections with the DFT-D3 approach (Grimme et al., 2010) with Becke–Johnson damping (Grimme et al., 2011), as implemented in Turbomole. The MM calculations were performed with the AMBER ff14SB (Maier et al., 2015) force field for protein residues, GAFF (Wang et al., 2004) for non-protein molecules and TIP3P (Jorgensen et al., 1983) for water. In all QM/MM calculations, the MM system was kept fixed to simplify the interpretation of the results.

The various approaches were tested on three systems. The first was an isolated ethanol molecule. The reference system was ethanol, optimized with QM. In the QM/MM calculations, the QM system was methanol with C2 converted to a HL atom (**Figure 1**). The terminal methyl group was in the MM system.

The second test system was sulfite oxidase. The calculations were taken from our recent study of this enzyme (Caldararu et al., 2018). The active site contains a Mo ion coordinated to a molybdopterin (MPT) molecule, as well as a Cys residue and two oxo groups. All these groups were included in the QM system (Cys modeled as CH3S <sup>−</sup>), as well as the sulfite substrate [note that this QM system is smaller than in our previous study (Caldararu et al., 2018), in which nine additional residues and five water molecules were also included]. In this paper we compared the effects of either including the full MPT residue (in the reduced and protonated form, called MPH in our previous paper) or truncating it to a dimethyldithiolene molecule [DMDT, (CH3CS)2<sup>−</sup> 2 ; both shown in **Figure 2**]. Thus, the QM/MM calculations with the full MPT molecule were the reference structures and QM/MM calculations with DMDT in the QM system were run to compare the performance of the various QM/MM variants.

In DMDT two of the CL atoms are covalently bonded. This gives a complication when automatically setting up the prmtop file for the truncated system: This bond needs to be explicitly removed from the file (together with the corresponding angles and dihedrals); otherwise, spurious MM forces will cause the calculations to crash with distorted structures. Normally, we

discourage from having covalently connected CL atoms, but in this study, it provides a hard test for the various methods to provide junction corrections.

Five states were studied in the reaction: The MoVI =O+SO2<sup>−</sup> 3 reactant state (RS) with sulfite in the second coordination sphere of Mo, the MoIV-SO2<sup>−</sup> 4 intermediate (Im) with sulfate coordinated to Mo, the MoIV+SO2<sup>−</sup> 4 product state (PS) with sulfate in the second sphere of Mo (shown in **Figure 2**), as well as the two transition states (TS1 and TS2) connecting these three

RS.

states. The transition states were obtained from potential-energy scans along the S–O1 (TS1, 2.0 Å) and Mo–O1 (TS2, 3.7 Å) reaction coordinates. PS was also obtained with a restraint in the Mo–O2 distance of 3.85 Å, taken from calculations with an appreciably larger QM system (Caldararu et al., 2018). Besides the QM system, the setup was identical to that in our previous study (Caldararu et al., 2018).

The third test system was the conversion of oxophlorin to verdohaem by haem oxygenase. Again, the calculations were taken from a recent study of this enzyme (Alavi et al., 2017). The QM system consisted of the oxophlorin group (an oxidized haem molecule), as well as O<sup>2</sup> and His (truncated to imidazole) as axial ligands of the Fe ion [again, this QM system is smaller than in our previous study (Alavi et al., 2017), in which two additional residues and six water molecules were also included]. We compared the effects of including either the full oxophlorin ring (OXF) with its eight peripheral propionate, vinyl and methyl substituents or truncating all substituents to HL atoms (OXT; both shown in **Figure 3**). Thus, the calculations with the full OXF were the reference structures and QM/MM calculations with OXT were run to compare the performance of the various QM/MM variants.

Nine states were studied in the reaction as is shown in **Figure 3**: The Fe–O<sup>2</sup> reactant state (**1**), the first intermediate, in which O<sup>2</sup> is bridging between Fe and one of the OXF carbon atoms (C4B; **2**), the second intermediate (**3**) in which the O1– O2 bond is cleaved and the C4B–O2 bond is formed, the third intermediate (**4**) in which the O2 atom has formed a bond with the C1C OXF atom, giving a four-membered C4B–O2– C1B–CMC ring, the verdohaem product (**5**), in which CO has dissociated, but the C4B–O2–C1B bonds are kept, as well as the four connecting transition states (**T1**–**T4**). The transition states were obtained from potential-energy scans along the O2– C4B (**T1**, 1.8 Å), O1–O2 (**T2**, 1.6 Å), O2–C1B (**T3**, 1.9 Å), and the C4B–C (**T4** 1.8 Å). Besides the QM system, the setup was identical to that in our previous study (Alavi et al., 2017).

## RESULT AND DISCUSSION

In this paper, we clarify the difference between additive and subtractive variants of the QM/MM approach. In principle, the two approaches can be tuned to give exactly the same results, as the additive approach can freely pick almost any energy in the E MM 2−1 term in Equation (2). However, in a typical implementation of the two approaches, the primary difference between the two approaches is that the additive scheme employs only a single MM term for each interaction, whereas in the subtractive scheme, there are two MM terms for the QM system, one in E MM <sup>12</sup> and one in E MM 1 . Depending on the implementation, these duplicate terms can either be selected to be identical (and therefore canceling, which would give exactly the same results as in the additive scheme) or they can be different, in particular with the aim of correcting the errors introduced by the HL link atoms in the QM system. We have investigated three different levels of link-atom corrections, involving van der Waals terms (VLAC), electrostatic terms (ELAC), and bonded terms (BLAC). In the following, we test the performance of the various correction schemes for three different systems: ethanol, sulfite oxidase, and haem oxidase. The results are described in three separate sections. For each system, we study six different approaches: the additive scheme (Add, i.e., without any link-atom corrections), the subtractive scheme with van der Waals (VLAC), electrostatic (ELAC), and bonded link-atom corrections (BLAC), the latter in two variants (BLAC1 or BLAC2 and BLAC2J) and mechanical embedding (ME, using a subtractive scheme and VLAC). ELAC and the three variants of BLAC always also include VLAC. Therefore, we use the abbreviation Sub for the subtractive scheme involving only VLAC (emphasizing that this is the standard approach for the subtractive scheme in ComQum). For ethanol, we tested a few additional combinations.

#### Ethanol

We first tested all methods on a very simple model system, viz. ethanol, in which methanol was used as the QM system in QM/MM calculations and the results were compared to a QM calculation on the full ethanol molecule. Nine different calculations were run for this system: Add, Sub, two variants of ELAC, BLAC1, BLAC1J, BLAC2, BLAC2J, and ME, as well as ELAC combined with BLAC2J. The two variants of ELAC used the same ESP charges for the QM system. However, for the MM system, they used either the ESP charges of ethanol (ELAC2) or the ESP charges for methanol for all methanol atoms except HL, the ethanol ESP charges for the H21–H23 atoms on the terminal methyl group, whereas the charge on the CL atom was adapted to give a vanishing net charge (ELAC1). The latter approach is similar to what is used for the two enzyme systems, for which ESP charges for the full MM system are not available. The two sets of charges are shown in **Table 2**.

The results of the QM/MM calculations on ethanol are collected in **Table 3**. The first column gives the total QM/MM energies, relative to the Add calculations. This is only to illustrate that the QM/MM energy depends on the MM force field and therefore give different results for the various calculations.

The second column gives the QM energy of the QM/MM optimized structure of ethanol. It can be seen that BLAC1J and ME give the lowest energy, 1.7 kJ/mol above the QM minimum. On the other hand, ELAC+BLAC1 gives the highest energy, 3.3 kJ/mol. Thus, the variation in energies is small, showing that all structures give excellent structures. A MM minimisation with the GAFF force field gives a slightly higher energy, 3.8 kJ/mol (row MM in **Table 3**).

The third column shows the root-mean-squared deviation (RMSD) of the coordinates from the optimized QM structure of ethanol. ELAC2 gives the lowest RMSD (0.012 Å), followed by the two BLAC1 variants (0.013 Å), as well as Add, Sub and ME (0.014–0.016 Å). The three variants involving BLAC2 give slightly higher values, 0.036–0.039 Å.

Finally, the three last columns give the mean absolute deviation (MAD) for the 8 bonds, the 13 angles and the 12 dihedrals in the molecule. For the bonds, the three BLAC2 variants give minimal errors (0.001–0.002 Å), whereas the other methods give slightly higher errors, 0.005–0.006 Å. The same

applies for the angles and the dihedrals, the three BLAC2 variants are still the best (0.8 and 0.4◦ ), whereas the other methods give slightly larger errors, 0.8–1.2◦ and 0.5–0.7◦ .

In conclusion, the test calculations show that the BLAC approaches give the best result, but it depends on the force field used. The best structures are obtained with the Hess2FF force field, which is tailored for the molecule and the QM method. ELAC sometimes improves the results, sometimes not. Sub and Add give similar results and in the differences among the various method are minimal for this small test molecule.

#### Sulfite Oxidase

Next, we studied a more realistic enzyme system, viz. sulfite oxidase. Based on our recent QM-cluster (Van Severen et al., 2014) and QM/MM (Caldararu et al., 2018) studies, we considered the S→ OMo mechanism, in which the S atom of the sulfite substrate attacks the equatorial oxo group of MoVI, directly forming a MoIV-sulfate intermediate (Im), via a first transition state TS1 (**Figure 2**). The sulfate product then dissociates into the second coordination sphere of the Mo ion via a second transition state (TS2).

For all five states, we tested six different methods: Add, Sub, ELAC, BLAC2, BLAC2J, and ME. For all methods, we compare the QM/MM results obtained with the small DMDT model of the molybdopterin ligand with those obtained with a standard (subtractive with VLAC) QM/MM calculation with the full MPT ligand (**Figure 2**). Initially, we tested also BLAC1, but it failed for all systems. The RS state with ME could be obtained only if the SSub−O2 distance was fixed at 2.43 Å (taken from the reference structure).

**Table 4** shows the RMSD deviations of the small QM system between the MPT and DMDT calculations. It can be seen that all calculations give similar results, with a RMSD of 0.02–0.10 Å. The RMSD is typically lowest for the Im and TS1 states.



EQM/MM is the total QM/MM energy (kJ/mol), EEtOH is the QM energy of the whole ethanol molecule (kJ/mol), RMSD is the RMS difference of the coordinates, MADbond (Å), MADangle ( ◦ ), and MADdihed ( ◦ ) are the mean absolute deviation compared to the ethanol molecule optimized by QM. The best value in each column is marked in bold face.

TABLE 4 | RMS deviations (Å) of the various QM systems for sulfite oxidase, compared to the QM/MM structures optimized with the full MPT ligand.


The smallest RMSD is always found for Add and Sub methods, with an average RMSD of 0.05 Å for the five states. However, ELAC also gives a similar average, whereas that of the other three methods is somewhat larger, 0.06 and 0.08 Å for BLAC and ME, respectively. This increase in the RMSD is not caused by a single structure, but is seen for all structures.

In Table S1, the Mo–ligand and SSub−O distances for all structures are listed. It can be seen that these key distances are well preserved in the truncated calculations. The best results are again obtained for Add and Sub, for which the average difference for the nine distances and five sets of structures is only 0.008 Å. The maximum deviation is 0.14 Å for Add and 0.16 Å for Sub, in both cases obtained for the non-bonded SSub–O2 distance in the RS structure. Besides this distance, the largest deviation is only 0.02–0.03 Å. The results for the other four sets of calculations were slightly worse, with an average error of 0.017 Å for ME and 0.009 Å for the other three approaches. The maximum error is 0.14–0.17 Å, again for SSub–O2 distance in the RS structure, except for ME (Mo–O2 distance of the Im state).

**Figure 4** shows the energies (relative to the RS state) in the seven sets of calculations. It can be seen that the Add, Sub, BLAC2, and BLAC2J methods give very similar results. In fact, the two sets of BLAC2 energies differ by only 0.1 kJ/mol for all states and these methods differ by 0–2 kJ/mol from Sub, and slightly more from Add (2 kJ/mol on average). The Add and Sub results agree within 2 kJ/mol. All these curves follow quite closely the reference with a systematic underestimation that increases from 6–8 kJ/mol for TS1 to 11–13 kJ/mol for PS. On average, all methods give a MAD of 10 kJ/mol, lowest for BLAC2J.

On the other hand, ME and ELAC give appreciably larger errors, with MADs of 29 and 32 kJ/mol, respectively and maximum errors of 47 and 65 kJ/mol. For ME, the problem is related to the failure to find the RS state—different restraints for this state may translate the curve upwards and therefore reduce the error, but it always remains worse than the other four methods.

Thus, we can conclude that all methods give reasonable structures for sulfite oxidase, although ME has problem with one of the states. However, ME and ELAC give quite large errors for the energies. In general, Add and Sub seem to give the best (and similar) results.

#### Haem Oxygenase

The third test case is haem oxygenase, for which we studied the conversion of oxophlorin (OXF) to verdohaem (Alavi et al., 2017). As is shown in **Figure 3**, this involves five states and four transition states. It starts from the FeIII-OXF–O<sup>−</sup> 2 complex in the doublet state (**1**), which has one unpaired electron on each of the three moieties, in analogy with previous studies (Alavi et al., 2017; Gheidi et al., 2017). In the first step of the reaction, the terminal oxygen atom in O<sup>2</sup> (O2) reacts with the C4B atom of OXF, forming a bridging intermediate (**2**). In the next step, the O–O bond is cleaved, giving intermediate **3**. Next, the O2 atom reacts with another atom in the OXF ring (C1C), forming a fourmembered ring in intermediate **4**. Finally, the C4B–CMC and C1C–CMC bonds are cleaved, and CO dissociates, giving rise to verdohaem (**5**). These states are separated by four transition

states (**T1–T4**). We study the effect of moving the side chains of the OXF ring from the QM to the MM system.

This test case is challenging for at least two reasons. First, the electronic structure is more complicated than for the other two test cases, with several antiferromagnetically coupled open-shell moieties. Second, the OXF ring, for which we test the effect of truncation, is involved in the reaction. In fact, the C1C and C4B atoms are both only two bonds away from the HL atoms in the truncated OXT model.

As for sulfite oxidase, we tested six different methods with the truncated OXT model, Add, Sub, ELAC, BLAC1, BLAC2J, and ME. In contrast to sulfite oxidase, the simplest BLAC1 approach, with standard GAFF parameters for both OXF and OXT worked well. Considering the similar results for BLAC2 and BLAC2J for sulfite oxidase, we did not test BLAC2.

As for sulfite oxidase, we used the QM/MM calculations (subtractive with VLAC) with the full OXF ligand as the reference and study how the various QM/MM calculations with OXT reproduce these calculations in terms of the RMSD deviation for the entire (truncated) QM system, key distances and energies. The RMSD deviations of the nine different systems are shown in **Table 5**. It can be seen that most methods give similar results with an average RMSD of 0.06 Å. Add gives the lowest RMSD for most systems, but that of Sub is very similar and sometimes lower. The largest RMSD (0.09–0.10 Å) is typically found for state **3**, for which ME actually gives the best results and the latter method also gives the lowest maximum RMSD.

However, for the BLAC1 method, the RMSD is slightly larger, with an average of 0.07 Å and a maximum of 0.11 Å (still for **3**). For BLAC2J, the results are even worse (0.08 Å on average and a maximum of 0.13 Å for **T3**). In particular, BLAC2J failed to converge to any reasonable structure for the product (**5**). This most likely reflects that the force fields, especially that of BLAC2J, were determined for the starting structure **1** (note that BLAC2J gives the second-best structure for that state) and was then used unchanged for the other states. This is clearly suboptimal for structures later in the reaction mechanism. Of course, we

TABLE 5 | RMS deviations (Å) of the various QM systems of haem oxygenase, compared to the QM/MM structures optimized with the full OXF ligand.


could have determined new force fields for each intermediate and transition state in the mechanism, but this would have required much large computational and manual effort. Moreover, it would have given problems in the calculated energies, because the force field would be different for every state, making the energies not comparable. Thus, we do not recommend the BLAC approaches if the reaction is within three bonds of the XL atoms.

These trends in the RMSD are reflected also in the individual distances in the complexes. In Table S2, we examine the six Fe– N/O distances, as well as distances involving the reacting O2, C1C, C4B, and CMC atoms. It can be seen that all methods give similar MAD and maximum deviations from the reference distances (0.02–0.03 and 0.29–0.32 Å, respectively). However, Add and Sub still give the best results on average and BLAC2J the worst. All methods give large errors for the Fe–NHis distances in the **3, T4** and **4** states (0.26–0.32 Å too short). All methods also give a large error for the O2–C1C distance in state **3** (0.18–0.27 Å too short).

In **Figure 5**, the relative energies of the various states are shown. It can be seen that the Add and Sub methods still give

similar results, with a MAD of only 4 kJ/mol. However, the Sub and BLAC1 methods give even more similar results with a MAD of only 1.4 kJ/mol. This reflects that GAFF parameters for OXT and OXF differ only for a few bonds, angles, and dihedrals around the periphery of the ring, which apparently do not affect the results much. These three methods also reproduce the reference calculations fairly well, with MADs of 14–15 kJ/mol, with Sub giving the smallest error. The maximum error, 24–25 kJ/mol, is obtained for **T2**.

BLAC2J gives somewhat worse energies (MAD = 30 kJ/mol, with a maximum error of 47 kJ/mol for both **3** and **4**), especially for the later states in the reaction, reflecting that the force field is worse for these states. As for sulfite oxidase, the results for ELAC and ME are much worse, with MADs of 96–100 kJ/mol and maximum errors of 156–184 kJ/mol.

Finally, we checked also the electronic structures in the various calculations. However, they showed only a small variation between the various methods. For example, for the spin densities on Fe (shown in Table S3), Add, Sub, BLAC1 and BLAC2J give the same MAD from the reference calculations, 0.06 e, and the MADs of ELAC and ME are only slightly larger, 0.07 and 0.08 e, respectively. The spin density is lower for all calculations with OXT, except for the **1** state and the difference is largest for **3** and **T3** (0.09–0.12 e).

In conclusion, Sub and Add again give the best results, but those of BLAC1 are also good. Again, ELAC and ME give large errors in the energies and BLAC2J has problems with geometries of the later states in the reaction.

## CONCLUSIONS

In this paper, we have tried to clarify similarities and differences between the subtractive and additive QM/MM schemes. In our view, the primary difference is that the subtractive scheme allows for an attempt to correct for errors introduced by the link atoms. This correction can be introduced for three different types of interactions: van der Waals, electrostatic, and bonded interactions. Different software implements different corrections and it is also partly up to the user to define the level of correction (when setting up the force field for the QM system). For example, our ComQum software automatically implements the van der Waals correction, whereas ONIOM with electrostatic embedding implements also the electrostatic correction. Both software can implement the bonded correction, but typically do not do so.

Of course, the corrections come at some extra cost. For the van der Waals correction (VLAC), the cost is minimal: It requires van der Waals parameters of the HL atoms, which can almost always be taken from standard parameters for hydrogen atoms in the used MM force field (using the rules for atom types of the force field). Beside these parameters, the calculations can be automatically set up from a prmtop file for the full system with no extra effort, as in the ComQum implementation.

For the electrostatic link-atom corrections (ELAC), QM charges of the QM region are required, both with HL and CL atoms. This can be obtained from most QM software at a small extra cost and be automatized. Moreover, such charges are normally already available, because most QM/MM studies start with a MD equilibration of the full system (including solvent), for which a proper charge model of the QM system is needed. However, the charges on the CL atoms are ambiguous and ESP charges of buried atoms are poorly defined, which becomes a serious problem for large QM regions. Therefore, we have not seen any advantage for the electrostatic correction, at least not when the link atoms are rather close to the reactive atoms (Hu et al., 2011b). Moreover, the test calculations in this article indicate that ELAC can give rise to problems with energies in a reaction sequence. ELAC is implemented in the ONIOM software, but in practice it gives often severe convergence problems and is therefore seldom used. Instead, alternative approaches have been implemented, based on iterative calculations with mechanical embedding and updated charges (Kawatsu et al., 2011; Dutta and Mishra, 2014; Wójcik et al., 2014). It could therefore be recommended that ONIOM implements electrostatic embedding with only VLAC, which is a very stable approach in ComQum.

For the bonded link-atom correction (BLAC), an accurate MM force field for the QM region is required. Nowadays, the general MM force field for organic and drug-like molecules often provide the required parameters. However, it is unclear whether these (together with the MM parameters of the full system) are accurate enough to give any advantage of this approach. The alternative is to make a tailored force field for the QM region, both when truncated and in the full protein. In this article, we have tested both approaches. For the simple ethanol test case, the best results were actually obtained with BLAC2, i.e., with the optimized Hess2FF force field. However, for the two enzyme systems, the results were worse, especially if the reactive site is close to the link atoms. Therefore, we cannot recommend BLAC for general use.

Beside the parameters needed for these link-atom corrections, there is no difference in the requirement of MM parameters for the subtractive and additive QM/MM schemes; without any link-atom corrections, the two schemes should give identical results if correctly implemented and require exactly the same MM parameters. However, in practice there may be differences because the subtractive scheme is typically based on a standard (general-purpose) MM software, whereas the additive scheme is based on a software tailored for QM/MM. In particular, most MM software refuse to run if any MM parameters are missing. Therefore, subtractive calculations normally require a full set of MM parameters, also for the QM region. However, these can be dummy (zeroed) parameters, because they cancel in the QM/MM calculations. Moreover, also the additive scheme requires these parameters for an initial MD equilibration of the full solvated system. The same parameters can normally be used throughout a reaction mechanism (again because these MM terms cancel in Equation 4).

Thus, our conclusion is that intrinsically, the subtractive and additive QM/MM schemes are equivalent if properly implemented. The subtractive scheme allows the introduction of various link-atom corrections, at the expense of requiring more MM parameters. For van der Waals and electrostatic corrections, the extra cost is minimal and fully automatic, whereas for the bonded corrections, significant extra effort may be needed. Of course, the same corrections may be implemented also in an additive scheme, by picking proper terms, but this goes outside a standard implementation (and also a strict definition) of the additive scheme and therefore should be thoroughly specified.

In practice, the subtractive scheme is easier to implement and maintain (standard QM and MM software are used). On the other hand, the additive scheme may be somewhat easier to set up and can be tailored for QM/MM calculations. Moreover, if (a major part of) the MM system is fixed, calculations can be somewhat sped up by not calculating the MM energy and forces for the fixed atoms. In our test calculations, the additive and subtractive calculations with VLAC, but no other corrections (i.e., Add and Sub) give closely similar results, showing that the VLAC has only a minor influence on the results. Thus, both approaches can be recommended for QM/MM calculations.

We have also included mechanical embedding (within a subtractive scheme with VLAC) in the comparison. However, this approach gave rather poor structures, especially for sulfite oxidase. Moreover, the energies were quite poor, although this may be partly attributed to the fact that the reference energies employed electrostatic embedding and not mechanical embedding. For a more fair comparison, reference energies obtained with very large QM systems should be used, as in our previous study (Hu et al., 2011b).

Finally, we want to emphasize the importance of specifying exactly what is done in the QM/MM calculations, owing to the many different implementations. Obviously, it is not enough to say that a subtractive or additive scheme is used. Instead, for a subtractive scheme, it must be specified what type of linkatom corrections is applied (van der Waals, electrostatic, or bonded). In addition, the treatment of QM–MM electrostatics (mechanical or electrostatic embedding) must be specified, together with a detailed account of what charges are included in the point-charge model, if any charge redistribution scheme is employed and how charges on CL atoms are obtained. Finally, the treatment of link atoms need to be specified, as well as the relation between the coordinates of the HL and CL atoms.

# AUTHOR CONTRIBUTIONS

LC: Performed most of the calculations; UR: Did some calculations, designed the project, and wrote the article.

# ACKNOWLEDGMENTS

This investigation has been supported by grants from the Swedish research council (project 2014-5540), from Knut and Alice Wallenberg Foundation (KAW 2013.0022) and from COST through Action CM1305 (ECOSTBio). The computations were performed on computer resources provided by the Swedish National Infrastructure for Computing (SNIC) at Lunarc at Lund University.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00089/full#supplementary-material

#### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Cao and Ryde. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Automated Fragmentation QM/MM Calculation of NMR Chemical Shifts for Protein-Ligand Complexes

Xinsheng Jin<sup>1</sup> , Tong Zhu1,2, John Z. H. Zhang1,2,3 and Xiao He1,2,4 \*

<sup>1</sup> State Key Laboratory of Precision Spectroscopy, School of Chemistry and Molecular Engineering, Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, East China Normal University, Shanghai, China, <sup>2</sup> NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai, China, <sup>3</sup> Department of Chemistry, New York University, New York, NY, United States, <sup>4</sup> National Engineering Research Centre for Nanotechnology, Shanghai, China

In this study, the automated fragmentation quantum mechanics/molecular mechanics (AF-QM/MM) method was applied for NMR chemical shift calculations of protein-ligand complexes. In the AF-QM/MM approach, the protein binding pocket is automatically divided into capped fragments (within ∼200 atoms) for density functional theory (DFT) calculations of NMR chemical shifts. Meanwhile, the solvent effect was also included using the Poission-Boltzmann (PB) model, which properly accounts for the electrostatic polarization effect from the solvent for protein-ligand complexes. The NMR chemical shifts of neocarzinostatin (NCS)-chromophore binding complex calculated by AF-QM/MM accurately reproduce the large-sized system results. The <sup>1</sup>H chemical shift perturbations (CSP) between apo-NCS and holo-NCS predicted by AF-QM/MM are also in excellent agreement with experimental results. Furthermore, the DFT calculated chemical shifts of the chromophore and residues in the NCS binding pocket can be utilized as molecular probes to identify the correct ligand binding conformation. By combining the CSP of the atoms in the binding pocket with the Glide scoring function, the new scoring function can accurately distinguish the native ligand pose from decoy structures. Therefore, the AF-QM/MM approach provides an accurate and efficient platform for protein-ligand binding structure prediction based on NMR derived information.

#### Edited by:

Sam P. De Visser, University of Manchester, United Kingdom

#### Reviewed by:

Ahmet Altun, Max-Planck-Institut für Kohlenforschung, Germany Giovanni La Penna, Consiglio Nazionale Delle Ricerche (CNR), Italy

> \*Correspondence: Xiao He xiaohe@phy.ecnu.edu.cn

#### Specialty section:

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

Received: 30 January 2018 Accepted: 16 April 2018 Published: 08 May 2018

#### Citation:

Jin X, Zhu T, Zhang JZH and He X (2018) Automated Fragmentation QM/MM Calculation of NMR Chemical Shifts for Protein-Ligand Complexes. Front. Chem. 6:150. doi: 10.3389/fchem.2018.00150 Keywords: AF-QMMM, NMR chemical shift, protein-ligand binding, scoring function, structure prediction

# INTRODUCTION

Structure-based computational methods are useful tools for analyzing the binding modes and affinities for protein-ligand complexes (Grinter and Zou, 2014). With the development of Xray crystallography and nuclear magnetic resonance (NMR) technology, more than 100,000 high resolution three-dimensional protein structures have been determined, which is helpful for finding lead compounds and therapeutic targets (Ferreira et al., 2015). As compared to experimental methods, computational approaches, such as molecular docking, are the fast and efficient ways for drug discovery. In molecular docking programs, the scoring functions are used to approximate the binding free energy and hence to rank the simulated decoy poses (Sliwoski et al., 2014). Most scoring functions are roughly derived from force-field-based (Morris et al., 1998; Englebienne and Moitessier, 2009), empirical (Eldridge et al., 1997; Murray et al., 1998) or knowledge-based

**57**

potentials (Gohlke et al., 2000; Huang and Zou, 2006). However, based on parameterized functions, these scoring methods are usually not accurate enough to differentiate the experimental structure from the docked decoy structures, and sometimes the rankings from different software suites may be inconsistent (Sled ´ z and ´ Caflisch, 2018). For solving this problem, much effort has been devoted to the development of docking methods by introducing the experimental structural information or quantum mechanical (QM) calculations for scoring the native and predicted poses (Mohan et al., 2005; Grinter and Zou, 2014; Adeniyi and Soliman, 2017).

The chemical shift is one of the most effective and precise NMR parameters in reflecting the local chemical environment around the atom, which plays an important role in structure determination and refinement (Zhu et al., 2014; Bratholm and Jensen, 2017). For NMR chemical shift calculations, the empirical chemical shift prediction softwares include ShiftS (Xu and Case, 2001; Moon and Case, 2007), ShiftX (Neal et al., 2003), ShiftX2 (Han et al., 2011), CamShift (Robustelli et al., 2010), PROSHIFT (Meiler, 2003; Meiler and Baker, 2003), SHIFTCALC (Williamson and Craven, 2009), ProCS (Christensen et al., 2014; Bratholm and Jensen, 2017), CheShift (Vila et al., 2009; Garay et al., 2014) and Sparta+ (Shen and Bax, 2010). These empirical methods are fast in computational speed, and have been successful in predicting backbone chemical shifts. As the empirical formulas for these models are derived from fitting the experimental or QM calculated chemical shift database and the high-quality structures, these models are not well suited for accurate prediction of NMR chemical shifts for some complex systems such as protein-ligand complexes, nonstandard protein residues or non-canonical base pairs in nucleic acid systems (Swails et al., 2015). The quantum mechanical chemical shift calculations are in principle able to predict the NMR chemical shifts for any complex systems (Lodewyk et al., 2012; Hartman and Beran, 2014; Merz, 2014). For protein NMR chemical shift calculations, Cui and Karplus had proposed a very effective QM/MM approach (Cui and Karplus, 2000), Gao et al. developed fragment molecular orbital (FMO) method (Gao et al., 2007, 2010), Exner and coworkers utilized the adjustable density matrix assembler (ADMA) approach (Frank et al., 2012; Victora et al., 2014), Tan and Bettens developed the combined fragmentation method (CFM) (Tan and Bettens, 2013), and He and coworkers developed the automated fragmentation quantum mechanics/molecular mechanics (AF-QM/MM) method (He et al., 2009, 2014; Zhu et al., 2012, 2013, 2014, 2015; Swails et al., 2015; Jin et al., 2016) These fragment-based QM methods have been successfully applied for NMR chemical shift calculation of proteins and nucleic acids.

On the basis that chemical shifts or chemical shift perturbations (CSP) are sensitive to the variations of chemical environment, these parameters are quite suitable for structure determination (Case, 1998; Cavalli et al., 2007; Shen and Bax, 2015). Many NMR-based methods have been developed for prediction of protein-ligand binding modes (Medek et al., 2000; Cioffi et al., 2008; Riedinger et al., 2008; Aguirre et al., 2014). McCoy and Wyss utilized proton CSP data, induced by aromatic ring current effect in the ligands, to locate the ring position of docking structures (McCoy and Wyss, 2002). Recently, Ten Brink et al. compared the experimental and simulated CSPs to verify protein conformational changes and developed the CSP-based docking method (Ten Brink et al., 2015). Merz et al. developed CSP-based scoring functions to determine the binding poses for protein-ligand complexes (Wang et al., 2007; Yu et al., 2017). However, most of these scoring functions only calculate the proton chemical shift on proteins. The scoring functions could be more efficient by taking ligand <sup>1</sup>H chemical shifts into consideration.

In this work, we applied the AF-QM/MM approach for NMR chemical shift calculations on protein-ligand binding complexes. Based on DFT calculations, the <sup>1</sup>H chemical shifts on both protein and ligand are available for structure determination and improving the scoring functions. In the framework of the AF-QM/MM approach, the ligand is also divided into smaller fragments, and hence it significantly reduced the computational cost for <sup>1</sup>H NMR chemical shift calculation on the large ligand. In this study, the neocarzinostatin (NCS) protein is selected as the test case for the AF-QM/MM method because of its importance in cancer therapy. NCS has experimental chemical shifts for both apo and holo forms (Myers et al., 1988; Mohanty et al., 1994; Schaus et al., 2001; Takashima et al., 2005; Wang and Merz, 2010). Furthermore, by comparing the AF-QM/MM calculated chemical shifts with experiment data, a chemical shift based scoring function was developed to rank the native and predicted protein-ligand binding poses.

This paper is organized as follows: first, a benchmark test was performed using the AF-QM/MM method for NMR chemical shift calculations of protein-ligand complex. The computed results are compared to the large-sized system NMR chemical shift calculations. Subsequently, AF-QM/MM calculated chemical shifts are compared with the experimental results for both apo and holo NCS structures. Next, the performance of chemical shift based and conventional energy based scoring functions on the rankings of predicted proteinligand binding poses is discussed. Finally, the hybrid scoring function, that combines the calculated NMR chemical shifts and binding energy, is applied to rank the experimental structure and other docked binding poses.

# COMPUTATIONAL APPROACHES

# Structure Preparation

The X-ray structures of apo and holo NCS were download from the Protein Data Bank (PDB ids: 1NOA and 1NCO, respectively). The experimental chemical shift data of apo protein and chromophore were obtained from previous studies (Myers et al., 1988; Mohanty et al., 1994). The holo NCS experiment chemical shifts were downloaded from Biological Magnetic Resonance Data Bank (BMRB entry: 5969). The structure minimization of the protein X-ray structures was performed using the AMBER12 program (Case et al., 2012) with the ff99SB force field. The apo and holo NCS structures were solvated in a truncated octahedral periodic box of TIP3P water molecules with each side at least 10 Å from the nearest solute atom (Jorgensen and Jenson, 1998). After the entire system was neutralized with counter ions, 1,000 steps of steepest descent algorithm following with 4,000 steps of conjugate gradient method were used to remove the improper contacts of the system. For obtaining the force field parameters of the ligand, the general AMBER force field (GAFF) (Wang et al., 2004) and AM1-bond charge correlations (AM1-BCC) charge model were utilized for the ligand (Jakalian et al., 2002). The molecular docking was performed using the Glide module in the Schrödinger program (Friesner et al., 2004; Halgren et al., 2004). The scoring function used in this study was Glide XP. In this study, the protein structure was fixed when the ligand was docked into the binding site. Therefore, we did not include the flexibility of the protein during molecular docking. Based on the optimized experimental structure using the molecular force field, 38 docking poses of the ligand predicted by Glide, whose RMSDs range from 1.5 to 10.5 Å with reference to the native position, were selected for subsequent chemical shift calculations at the QM level.

# The AF-QM/MM Method for NMR Chemical Shift Calculation of the Protein-Ligand Complex

In the AF-QM/MM approach (He et al., 2009, 2014; Zhu et al., 2012, 2013; Swails et al., 2015; Jin et al., 2016), the apo protein is divided into individual residue by cutting through the peptide bonds. The number of fragments is the same as the number of residues in the protein. Each fragment contains a core region (each amino acid) in the protein, and the buffer region which contains the nearby residues surrounding the core region. Both the core and buffer regions are treated with quantum mechanics (QM) while the residues outside the buffer region are described by molecular mechanics (MM). The details for the definition of the buffer regions are described in our previous work (He et al., 2009; Zhu et al., 2012, 2013). For the holo protein studied in this work, we developed the fragmentation scheme for the ligand and its surrounding protein residues. As shown in **Figure 1**, we also divided the chromophore into three parts by cutting the C-O single bond. Fragment 1 contains the naphthoate group, fragment 2 includes the enediyne ring and fragment 3 has the aminosugar group, respectively. For each fragment of the ligand (taken as each core region), the rest part of the ligand and the protein residues surrounding the core region (each fragment of the ligand) are taken as the buffer region (see **Figure 1**). The distance criteria for selecting the buffer region for each fragment of the ligand is the same as that for each residue in the protein.

In this study, we adopt the following distance-dependent criteria to include residues within the buffer region of each core region for ligand: (1) if a heavy atom of the residue is less than 3.5 Å away from any atom in the core region, (2) if the distance of a hydrogen atom of the residue is less than 3.0 Å away from any atom in the core region. The cutoff enables a sufficient size of the buffer region for the convergence of chemical shift calculations on the core region (Flaig et al., 2012). The remaining residues beyond the buffer region are described by embedding charges to account for the electrostatic field outside the QM region. The protein charges were obtained directly from the ff99SB force field. For assigning the buffer region for each residue (where each residue is defined as the core region), the ligand is treated as a whole molecule (non-fragmented), and the definition of the buffer region follows the same criteria as the apo protein. The dangling bonds are capped with hydrogen atoms for constructing the closed-shell fragment.

FIGURE 1 | The fragmentation scheme for the ligand in AF-QM/MM. (A) The ligand is divided into three fragments. Each fragment of the ligand is taken as the core region. The buffer region contains remaining part of the ligand and protein residues within the certain distance threshold from the core region (see the text for more details). The core and buffer regions are calculated at the QM level while rest of the system are described by embedding charges. (B) The definition of each core region of the chromophore. Fragment 1: the naphthoate group; fragment 2: the enediyne ring part; fragment 3: the aminosugar group. The buffer region has the same color as the core region.

The fragment QM calculations were carried out in parallel at the B3LYP/6-31G∗∗ level. All QM calculations were performed using the Gaussian 09 package (Frisch et al., 2009). Only the NMR isotropic shieldings of the core region atoms were collected from each fragment QM calculation. The <sup>1</sup>H chemical shifts are obtained by referencing to that of the tetramethylsilane (TMS) at the same computational level, which is 31.66 ppm. The implicit solvation model was applied to approximate the solvent effect. The protein charge distribution polarizes the dielectric solution and creates a reaction filed to act back on the solute until equilibrium is reached. The reaction field acting on the solute can be effectively represented by the induced charges on the cavity surface. In this work, the surface charges are calculated by the Poisson-Boltzmann (PB) model using the Delphi program (Rocchia et al., 2001). The set of point charges of the MM environment and on the molecular surface, which represents the reaction field, are used as the background charges in the QM calculation. Because the computational cost of QM chemical shift calculations will be dramatically increased on multiple configurations when the conformational sampling effect is taken into account. In this study, the optimized X-ray structure using molecular force field was taken as a representative configuration for the ensemble averaging structure.

#### Scoring Functions

To differentiate the native protein-ligand binding structure from decoy poses, here we propose a scoring function based on NMR chemical shifts (CSscore), which is simply the root-mean-square deviation (RMSD) of computed chemical shifts with reference to the experimental values,

$$\text{CCscore} = \sqrt{\frac{\sum\_{i=1}^{N} \left(\delta\_{\text{H}}^{i} - \delta\_{\text{exp}}^{i}\right)}{N}} \tag{1}$$

where δ i H is the chemical shift of ith hydrogen atom on the ligand and nearby residues, and δ i exp is the experimental chemical shift of the corresponding atom in the native complex (holo NCS). N is the number of atoms whose chemical shifts were selected as molecular probe to characterize the NCS-chromophore binding structure. In this study, N was set to 31 for holo NCS, 21 of which are non-amide protons on the chromophore, and the other 10 hydrogen atoms are those with experimental chemical shift perturbations (CSP, between the bound and unbound complexes) greater than 0.5 ppm from residues in the binding site of the protein (see **Figure 2** and Table S1 of the Supplementary Materials).

It is worth noting that, in Equation (1), we could also add the chemical shifts of the hydrogen atoms that experimentally do not change upon ligand binding. A false docking pose may cause significant deviations in CSPs for those hydrogen atoms. However, there are many of such protons on the residues in the binding pocket, which will average out the final score to make the scoring function incapable of distinguishing the native structure from the decoy sets. Protons with experimental perturbations greater than 0.5 ppm are more sensitive to the binding pose, therefore we took those atoms into account in the scoring function. Furthermore, although NMR chemical shifts of amide

protons are also very sensitive to the local chemical environment of the binding pocket, these atoms were excluded owing to the lack of experimental data.

The second scoring function (CSGscore) we propose here, is a linear combination of CSscore and Glide score,

$$\text{CS}\_{\text{Gscore}} = \text{CSscore} + \alpha \text{Glife Score} \tag{2}$$

where α is a weighting factor. In this study, the ranges of CSscore and Glide Score are 0.42∼2.90 and −10.96∼10.98 (see Table S2 of the Supplementary Materials), respectively, and thus we choose α = (2.90–0.42)/(10.98–(−10.96)) ≈ 0.1. By adding the Glide score to CSscore, the unphysical structure with an unfavorable binding energy will be avoided. Since both the CSscore and Glide score will be smaller as the docking pose gets closer to the experimental structure, the native docking pose will give the lowest CSGscore value.

#### RESULTS AND DISCUSSION

# Benchmark Test of AF-QM/MM on the Native NCS-Chromophore Binding Complex

We first compared the calculated chemical shifts on the chromophore between AF-QM/MM and large-sized system calculations. Because the holo NCS contains more than 1500 atoms and was too large to perform full system QM calculations, we alternatively used the entire ligand and its buffer region for large-sized QM calculation. The other atoms beyond the buffer region are taken as background charges, and the PB surface charges for the entire complex are also placed to approximate the implicit solvent. The computed chemical shifts from such a model system (around 460 atoms in the QM region) are taken as the reference values. Here, we only compare the chemical shifts on the ligand between AF-QM/MM calculation and large-sized system calculation. As shown in **Figure 3**, the <sup>1</sup>H chemical shifts on the ligand calculated by the AF-QM/MM method (where the ligand is divided into three parts) are in good agreement with large-sized system calculation. The mean unsigned error (MUE) between AF-QM/MM results and chemical shifts from the large-sized system calculation is 0.046 ppm, and the RMSD between them is 0.051 ppm. The results demonstrate that the AF-QM/MM approach can accurately reproduce the large-sized system calculation. Furthermore, at the DFT level, the total computational cost was reduced by 36%, from 5,601 min (CPU time) by the large-sized system calculation to 3,585 min by dividing the ligand into 3 fragments. In addition, the 3 fragment-based QM/MM calculations were carried out in parallel. Therefore, the computational wall time could be further reduced by approximately 2/3.

Next, we compare the calculated chemical shifts for apo and holo NCS (31 protons in the binding pocket, as shown in **Figure 2**) with the experimental values. In this benchmark test, the AF-QM/MM results correlate well with the experiment (see **Figure 4**). For the bound complex, the MUE between the calculated and experimental chemical shifts is 0.44 ppm, and the RMSD is 0.57 ppm. For the unbound protein and ligand, the MUE and RMSD between calculated and experimental chemical shifts are 0.45 and 0.62 ppm, respectively.

The prediction of chemical shift perturbations (CSP) between apo and holo NCS could further validate the accuracy of the AF-QM/MM approach. Among the hydrogen atoms in the binding pocket, H15, H31 of the ligand, HD2 of Leu45 and HB2 of Cys37 are significantly influenced by the ring current effect, where they are close to the aromatic rings in the native holo structure (see **Figure 5**). As a result, large upfield shifts upon ligand binding were observed for those atoms. The experimental CSPs of those four atoms (namely, H15, H31, Leu45:HD2, and Cys37:HB2) are

−0.93, −0.86, −1.15, and −0.72 ppm, respectively, while the AF-QM/MM results of them are −0.45, −0.34, −1.25, and −0.71 ppm. The results show that large chemical shift perturbations between apo and holo protein-ligand systems could be accurately predicted by the AF-QM/MM approach.

# Performance of the CSscore Scoring Function

The Glide scores of the 38 docked decoys and the native holo NCS are shown in **Figure 6**. The energy based scoring function is capable of distinguishing the experimental structure from the docked poses whose structural RMSDs are larger than 4 Å with reference to the native pose. However, for the docked poses whose RMSDs are between 2 and 4 Å, the Glide scores of them are sometimes very close to the experimental structure (see Pose 7 in **Figure 6A**). In contrast, the CSscore is easier to discriminate the experimental structure from the decoy sets whose structural RMSDs are around 2 Å. This is mainly due to that chemical shifts are quite sensitive to the local chemical environment at the binding site. In CSscore, protons in ligand and the selected hydrogen atoms in protein residues serve as molecular probes to detect the binding environment. When the protons have different close contacts between the native and decoy structures, such as the interactions with aromatic rings or hydrogen bonding, the calculated chemical shift of certain protons may have substantial deviations between different binding modes. Therefore, the change of NMR chemical shifts of protons could clearly reflect the corresponding binding interactions between the protein and ligand.

The comparison of calculated chemical shifts between the native and docking poses could probe the structural changes of ligand binding poses among them. **Figure 7** shows that, for the positions of the naphthoate group and enediyne ring in Pose 7, the chromophore is very close to the native binding structure. However, the aminosugar group is pointing to a different direction, which results in the large chemical shifts deviation for H13 and H15 on the ligand (see **Table 1**).

The weakness of CSscore is that for decoys with larger structural RMSDs, the chemical shift based scoring function might be not as efficient as those with low structural RMSDs. In high structural RMSD range, some ligand poses might be close to the apo state (fewer interactions with the protein), and the CSscore score for the apo state of chromophore is 0.60. Therefore, the rankings of those poses are not very sensitive to the structural RMSDs using CSscore. The example cases for poses 10 and 18 will be discussed in Section Improvement of the hybrid CSGscore scoring function.

## Improvement of the Hybrid CSGscore Scoring Function

**Figure 8** shows that the CSGscore score is capable of differentiating the experimental structure from the decoys for the protein-ligand complex. In this work, the weighting factor α in Equation (2) was to set to 0.1 to make the CSscore and Glide scores on the same scale. As shown in **Figure 8**, for poses whose structural RMSDs are around 2–4 Å, the CSscore score from NMR chemical shifts dominates the scoring function. Even though the Glide score for the decoy structures are close to the experimental structure (see **Figure 6B**), the CSscore could discriminate the experimental structure from the decoys, resulting that the CSGscore (combination of CSscore and Glide scores) ranked the native binding pose clearly as the most favorable structure. On the other hand, for decoy poses whose structural RMSDs are larger than 4 Å, the energy based scoring function (Glide score) has the major impact on the CSGscore, which makes the decoys with large structural RMSDs deviates more from the experimental structure as compared to the CSscore score (**Figure 6B**).

The pose 10 was previously scored low in CSscore, whose chemical shift RMSD is 0.70 ppm between the calculated and experimental data. The naphthoate group of the ligand was flipped in the docking structure, while locations of the other regions of the chromophore are similar to the native binding pose (see **Figure 9**). On the naphthoate group, H23 has the largest chemical shift deviation between the native binding structure and pose 10. In the native binding pose, H23 is beside the aromatic ring, which causes downfield chemical shift on H23. While in Pose 10, the H23 atom moves away from the indole ring of Trp39, resulting that its calculated chemical shift is much lower than that of the experimental structure (see **Table 2**). Furthermore, **Figure 9a** shows that in Pose 10, H31 moves away from the phenyl ring of Phe52, and the corresponding chemical shielding decreased, resulting in higher NMR chemical shift. Furthermore, because the aromatic ring position of the naphthoate group moved in the docking structure, <sup>1</sup>H chemical shift on surrounding protein residues, such as Cys37:HB3, also changed significantly. Meanwhile, since the naphthoate ring strongly influenced the <sup>1</sup>H chemical shifts from the unbound

FIGURE 5 | X-ray structure of the holo NCS. The CSPs for protons of H15 (a) and H31 (b) in the chromophore, Leu45:HD2 (c) and Cys37:HB2 (d) in the NCS, are significantly influenced by the ring current effect.

FIGURE 6 | The Glide (A) and CSscore (B) scores with different structural RMSDs of the chromophore in the holo NCS. The green dot represents the score of the binding pose in the X-ray structure of the holo NCS. The red, yellow and violet dots denote the docking poses 7, 10 and 18, respectively.

to bound states, straying away of the naphthoate group in Pose 10 caused that the chemical shifts are less affected upon ligand binding, which results in slightly higher CSscore score (see **Figure 6B**, but other groups are close to the native state). Considering the physical non-bonded interactions between NCS and chromophore, as the naphthoate group stays deep inside the binding pocket in the native state, which contributes most to the NCS-chromophore binding energy. Therefore, the structural deformation in pose 10 caused that the interaction energy between them became weaker, and the corresponding Glide score gave the lower rank. In the hybrid scoring function CSGscore, the ranking of pose 10 has a clear separation from the experimental structure, because it incorporates both the NMR chemical shift deviations from the experimental data and the physical interaction energy between the protein and ligand.

The docking pose 18 is also the case that the energy function is more important than the CSscore in ranking the docking poses. **Figure 10** shows that the ligand position of pose 18 almost translated to the direction away from the binding pocket. As the naphthoate ring moved away, the chemical shieldings of

the docking pose 7 as compared to the experimental structure.

Cys37:HB2, Cys37:HB3 and Leu45:HD2 decreased as compared to the native state (see **Table 3**). The ligand pose 18 is closer

TABLE 1 | The comparison between the experimental and calculated chemical shifts (in ppm).


"Native" and "Pose 7" denote the calculated chemical shifts for the native binding pose and the docking pose of Pose 7, respectively.

FIGURE 8 | The rankings of the native (green) and docking structures (red: pose 7; yellow: pose 10; violet: pose 18; and blue: other docked poses, predicted by Glide) calculated by CSGscore.

TABLE 2 | Comparison between the experimental and calculated chemical shifts on the native binding model and pose 10 (in ppm).


TABLE 3 | Comparison between the experimental and calculated chemical shifts on the native state and the docking pose 18 (in ppm).


to the apo state, which resulted in high CSscore score, but the interaction energy between the chromophore and NCS would be obviously weaker than that of the native state, and the Glide score for the pose 18 is substantially higher than the native binding pose (see **Figure 6A**). Therefore, in the hybrid scoring function CSGscore, the ranking of pose 18 is clearly lower than the native state, which correctly reflects that the rankings decrease as the structural RMSDs become larger.

# CONCLUSION

In this work, we applied the automated fragmentation method for QM/MM calculation of NMR chemical shifts for proteinligand binding complexes. In the AF-QM/MM approach, the atomic NMR chemical shifts were obtained by dividing the protein automatically into residue-centric fragments. In order to reduce the computational cost for the ligand, the chromophore that contains 81 atoms was also divided into three smaller fragments to make the QM size for ligand calculation comparable to the protein fragments. The AF-QM/MM approach with the implicit solvation treatment is computationally efficient and linear-scaling with a low pre-factor. Moreover, the approach is massively parallel and can be applied to routinely calculate the ab initio NMR chemical shifts for protein-ligand complexes of any size.

The <sup>1</sup>H chemical shifts calculated by the AF-QM/MM approach at the DFT level are in good agreement with largesized system calculation, where the entire ligand and its buffer region are treated by QM, and the remaining atoms of the protein are described by background charges. The MUE between AF-QM/MM and large-sized system calculation is 0.046 ppm. Furthermore, the MUEs between calculated and experimental <sup>1</sup>H chemical shifts in the binding pocket of apo and holo NCS are 0.45 and 0.44 ppm, respectively. Our results demonstrate that the AF-QM/MM approach is capable of reproducing the large-sized system ab initio calculations of NMR chemical shifts for proteinligand complexes, and the calculated chemical shifts are in good agreement with the experimental results.

The results of CSscore scores show that chemical shifts could be utilized as molecular probes to detect the binding conformation of the protein-ligand complex. The experimental structure has the clear leading score as compared to the decoy binding poses. By investigating the CSP patterns of decoy structures, the position changes of the ligand could be detected by variations of chemical shifts in different local chemical environment.

In this study, we further proposed the hybrid scoring function CSGscore which combines CSscore and the energy-based scoring function of Glide score. The hybrid CSGscore scoring function can help to distinguish the native ligand structure from the decoy docking poses. CSGscore can also clearly separate the scores of decoy structures, which have significantly large structural RMSD values and give relatively low CSscore scores, from the native docking pose. The CSGscore incorporates both the experimental NMR chemical shift information and the energy-based scoring method, which could better determine the binding site structure of the protein-ligand complex. Therefore, the AF-QM/MM approach provides an accurate and efficient platform for proteinligand binding structure prediction based on NMR derived information.

# AUTHOR CONTRIBUTIONS

XH designed research; XJ, TZ, and XH performed research; XJ, TZ, JZ, and XH analyzed data; and XJ and XH wrote the paper.

# ACKNOWLEDGMENTS

This work was supported by the National Key R&D Program of China (Grant No. 2016YFA0501700), National Natural Science Foundation of China (Nos. 21673074, 21761132022 and 21433004), Youth Top-Notch Talent Support Program of Shanghai, NYU-ECNU Center for Computational Chemistry at NYU Shanghai, and Shanghai Putuo District (Grant 2014- A-02). We thank the Supercomputer Center of East China Normal University for providing us with computational time.

#### REFERENCES


#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00150/full#supplementary-material

molecular orbitals with electrostatic environment. Chem. Phys. Lett. 445, 331–339. doi: 10.1016/j.cplett.2007.07.103


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Jin, Zhu, Zhang and He. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Force Balanced Fragmentation Method for ab Initio Molecular Dynamic Simulation of Protein

Mingyuan Xu<sup>1</sup> , Tong Zhu1,2 \* and John Z. H. Zhang1,2,3,4 \*

<sup>1</sup> State Key Lab of Precision Spectroscopy, Shanghai Engineering Research Center of Molecular Therapeutics & New Drug Development, Shanghai Key Laboratory of Green Chemistry & Chemical Process, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai, China, <sup>2</sup> NYU-ECNU Center for Computational Chemistry at New York University Shanghai, Shanghai, China, <sup>3</sup> Department of Chemistry, New York University, New York, NY, United States, <sup>4</sup> Collaborative Innovation Center of Extreme Optics, Shanxi University, Taiyuan, China

A force balanced generalized molecular fractionation with conjugate caps (FB-GMFCC) method is proposed for ab initio molecular dynamic simulation of proteins. In this approach, the energy of the protein is computed by a linear combination of the QM energies of individual residues and molecular fragments that account for the two-body interaction of hydrogen bond between backbone peptides. The atomic forces on the caped H atoms were corrected to conserve the total force of the protein. Using this approach, ab initio molecular dynamic simulation of an Ace-(ALA)9-NME linear peptide showed the conservation of the total energy of the system throughout the simulation. Further a more robust 110 ps ab initio molecular dynamic simulation was performed for a protein with 56 residues and 862 atoms in explicit water. Compared with the classical force field, the ab initio molecular dynamic simulations gave better description of the geometry of peptide bonds. Although further development is still needed, the current approach is highly efficient, trivially parallel, and can be applied to ab initio molecular dynamic simulation study of large proteins.

#### Edited by:

Thomas S. Hofer, University of Innsbruck, Austria

#### Reviewed by:

Antonio Monari, Université de Lorraine, France Hans Martin Senn, University of Glasgow, United Kingdom

#### \*Correspondence:

Tong Zhu tzhu@lps.ecnu.edu.cn John Z. H. Zhang john.zhang@nyu.edu

#### Specialty section:

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

> Received: 06 March 2018 Accepted: 09 May 2018 Published: 30 May 2018

#### Citation:

Xu M, Zhu T and Zhang JZH (2018) A Force Balanced Fragmentation Method for ab Initio Molecular Dynamic Simulation of Protein. Front. Chem. 6:189. doi: 10.3389/fchem.2018.00189 Keywords: quantum fragment method, ab initio molecular dynamics, force balanced, GB3, protein dynamics, MFCC

# INTRODUCTION

Molecular dynamic (MD) simulation plays an increasingly important role in the study of structural and dynamical properties of biomolecules at the atomic level (Karplus and Petsko, 1990; Cheatham and Kollman, 2000; Karplus and McCammon, 2002). With the ever-increasing power of computer hardware and the development of enhanced sampling methods, a significant advance in MD simulations with larger systems and longer simulation time have been achieved over the past decades (Shaw et al., 2010; Prinz et al., 2011). However, the accuracy and reliability of MD results are highly dependent on the accuracy of the force field employed in the simulation (Weiner et al., 1984, 1986; Ponder and Case, 2003). Despite widely successful applications of the current force fields in bio-molecular simulations, these simplified, predefined pairwise force fields have serious drawbacks. The most widely known deficiency is that the atomic charges in most of these force fields are pre-fixed, and there is no explicit treatment of electrostatic polarization and charge transfer (Duan et al., 2010; Tong et al., 2010; Ji and Mei, 2014). In the past few decades, significant efforts have been devoted to the development of polarizable force fields. However, although great achievements have been made, the accuracy of these polarizable force fields still have a lot of room for improvement. In addition, many force fields have a bias toward the secondary structure of the protein. For example, the α-helical propensity of the AMBER03 force field is too high relative to experimental measurements, while that of AMBER99SB is arguably too low (Best et al., 2008).

Compared with classical force fields, QM calculation can provide much more accurate potential energy function for the studied system, and include all important quantum effects. The advantage or need of the so-called ab initio molecular dynamic (AIMD) simulation over classical force fields in the study of proteins have been reported by various researchers (Wei et al., 2001; Dal Peraro et al., 2005; Ufimtsev and Martinez, 2009; Ufimtsev et al., 2011; Isborn et al., 2013). In these AIMD calculations, the atomic forces of the studied protein were calculated by QM methods, normally on the DFT level, whereas the motion of the nuclei was handled by classical mechanics. However, QM calculation needs a large amount of computational cost, which means that it can only be used for proteins with relatively small size.

So far, considerable efforts have been made to extend the applicability of QM calculation to large systems. Among existing approaches, the fragment-based QM methods has attracted much attention (Gordon et al., 2012; Collins et al., 2014; Li et al., 2014; Pruitt et al., 2014; Ramabhadran and Raghavachari, 2014; Chung et al., 2015; Collins and Bettens, 2015; Raghavachari and Saha, 2015). These approaches based on the chemical locality of molecular system, which assumes that the local regions of a molecular system can only be influenced weakly by atoms that are far away from it (Xu et al., 1998; Fedorov et al., 2014; Gao et al., 2014; He et al., 2014; Pruitt et al., 2014). In this kind of methods, the studied system is divided into small subsystems (fragments); the properties of these fragments such as energy are calculated separately by QM method. Then the property of the whole system can be obtained by taking a proper combination of the properties of these individual fragments. The fragment-based QM method is attractive in several aspects, such as easy implementation of parallelization without extensively modifying the existing QM programs and can be combined with all levels of ab initio electronic structure theories. In our previous study (Liu et al., 2015), a fragment based approach is presented for AIMD simulation of protein. In this approach, the potential energy and atomic forces of the studied protein are calculated by a recently developed electrostatically embedded generalized molecular fractionation with conjugate caps (EE-GMFCC) method (Wang et al., 2013). This AIMD approach had been applied to MD simulation of a small benchmark protein Trpcage in both gas phase and in solution. Compared with AMBER force field, this method can give more stable protein structure in simulation, and capture quantum effects that are missing in standard classical MD simulations.

To further improve the accuracy and efficiency of the AIMD simulations for protein, in this work, we presented a force balanced generalized molecular fractionation with conjugate caps (FB-GMFCC) method and checked its performance in the AIMD simulations for several systems. The paper is organized as follows. The next section provides a description of the FB-GMFCC approach. In section Result and Discussion, we performed AIMD simulations on two selected proteins to validate the new method, and finally, a brief summary will be given in section Conclusion.

#### THEORY AND METHOD

The FB-GMFCC method was developed based on the framework of molecular fractionation with conjugate caps (MFCC) approach (Zhang and Zhang, 2003). The computation procedure of FB-GMFCC can be roughly divided into two steps. Firstly, the given protein is cut into caped molecular fragments, including individual residues and residues that form backbone hydrogen bonds. Then the energy and atomic forces of each fragment are calculated by QM methods separately. Secondly, the AMOEBA polarizable force field (Ren and Ponder, 2002; Ponder et al., 2010; Ren et al., 2011; Wu et al., 2012) was employed to describe the long-range non-bonded interactions. Thus, the total energy of the protein system is obtained by a summation of quantum and classical components,

$$\text{EFB} - \text{GMFCC} = \text{EQM} + \text{EAM} \tag{1}$$

Computational details of these energy components are describe below.

# Calculation of EQM

To calculate the energy EQM, a given protein with N amino acids (defined as A1A2A<sup>3</sup> . . . AN) is decomposed into N individual fragments by cutting through the peptide bonds (**Figure 1**). At every cut point, a pair of molecular caps were designed to

saturate each fragment in order to preserve the local chemical environment. To minimize the computational cost, we simply use the amine and formyl group from the peptide bond as molecular caps, which are conjugate to each other (by forming a peptide bond) and are much smaller than that used in the EE-GMFCC approach. To avoid dangling bond, hydrogen atoms were added to terminate the molecular cap, the position of these extra H atoms are determined from the coordinates of the corresponding Cα atoms.

Hydrogen bond is one of the most important structural elements of protein and the dominant factor that stabilizes the protein secondary structures. Many previous works demonstrated that the strength of hydrogen bond from simulations under non-polarizable force fields is underestimated due to the lack of polarization effect (Ji et al., 2008; Gao et al., 2012). In the FB-GMFCC method, the backbone hydrogen bond was considered by two-body QM calculation. To reduce computational cost, only the H-saturated peptide bond which contains the donor or the accepter (which actually is a formamide as shown in **Figure 2**) was kept in the two-body QM calculation, the position of the extra H atoms are also determined from the coordinates of the corresponding Cα atoms. If the distance between donor H atom and acceptor O atom is <3.0 Å and the angle θ of N-HN-O is larger than 120◦ , the 2-body correction will be considered. Thus, EQM can be expressed by the following formula:

$$\begin{aligned} \text{EQM} &= \text{E}\_{\text{fragment}} - \text{E}\_{\text{concap}} + \text{E}\_{\text{two-body}} \\ &= \sum\_{i=2}^{N-1} \text{E} (\text{Cap}\_{i-1}^{\*} \text{A}\_{i} \text{Cap}\_{i+1}) - \sum\_{i=2}^{N-2} \text{E} (\text{Cap}\_{i}^{\*} \text{Cap}\_{i+1}) \\ &+ \sum\_{\substack{i, j > i+2 \\ |\text{R}\_{\text{H}\_{N}-O}| \leq \lambda \\ \angle\_{N-\text{H}\_{N}-O} \geq \theta}} (\text{E} (\text{A}\_{i}^{\text{P}} \text{A}\_{j}^{\text{P}}) - \text{E} (\text{A}\_{i}^{\text{P}}) - \text{E} (\text{A}\_{j}^{\text{P}})) \\ &\leq\_{\text{N-H}\_{N}-O} \text{2} \end{aligned} \tag{2}$$

Where the i and j represent the index of ith and jth residues, respectively. If the formyl or amide group of residue A is included in a backbone hydrogen bond, A<sup>p</sup> represents the Hsaturated peptide bond which contains this group. The first term E(Cap<sup>∗</sup> <sup>i</sup>−1AiCapi+1) in Equation (2) represents the self-energy of fragment i (the ith residueAicapped with a left cap Cap<sup>∗</sup> i−1 and a right cap Capi+<sup>1</sup> ). And it is clear that the self-energy of conjugate caps E(Cap<sup>∗</sup> <sup>i</sup>Capi+1) are double counted in first term of Equation (2) and it should be deducted.

#### Calculation of EMM

The EQM term includes the self-energy of individual residue and the two-body correction of the interaction energy between residues that form backbone hydrogen bonds. To obtain the total energy expression for proteins, the classical force field was introduced to represent the long-range non-bonded interactions. In our previous study, we found that the electrostatic polarization arising from the environment also plays a critical role for including the many body effect in fragmentation methods (Wang et al., 2013). To describe the electrostatic polarization effect

between the donor H atom and the acceptor O atom is <3.0 Å and the angle of N-HN-O is larger than 120◦ , the 2-Body correction will be considered. To reduce computational cost, only the H-saturated peptide bond which contains the donor or the accepter was kept in the two-body QM calculation.

efficiently, we employed the polarizable atomic multipole-based AMOEBA force field (Ren and Ponder, 2002; Ponder et al., 2010; Ren et al., 2011; Wu et al., 2012). The expression of EMM is as the following:

$$\mathcal{E}\_{\text{MM}} = \sum\_{\substack{i,j \notin same \\ \text{QM zone}}} \mathcal{E}\_{\text{ele}}^{\text{perm}}(i,j) + \mathcal{E}\_{\text{elle}}^{\text{ind}}(i,j) + \mathcal{E}\_{\text{vdW}}(i,j) \tag{3}$$

Details about calculations of Van der Waals interactions, permanent and induced electrostatic energies of the AMOEBA force field could be found in Refs (Ren and Ponder, 2002; Ponder et al., 2010; Ren et al., 2011; Wu et al., 2012). For any two atoms that have not been calculated in the same QM zone, these nonneighboring interactions between them should be added to the total energy expression.

#### Balance the Force

To obtain atomic forces, we need to compute the derivative of FB-MFCC with respect to nuclear coordinates. For a given atom m, the atomic force can be expressed as following:

$$\mathbf{F}\_m = -\nabla\_m \mathbf{E}\_\text{FB-GMFCC} = -\nabla\_m \mathbf{E}\_\text{QM} - \nabla\_m \mathbf{E}\_\text{MM} \tag{4}$$

It should be noted, however, that we employed extra hydrogen atoms to avoid dangling bonds (**Figure 3**) and their coordinates

were determined from those of the corresponding Cα atoms. Because the forces on these extra hydrogen atoms in capped fragments E(Cap<sup>∗</sup> <sup>i</sup>−1AiCapi+1) cannot be canceled exactly by subtracting those in the caps E(Cap<sup>∗</sup> <sup>i</sup>Capi+1), it will not exactly conserve the energy. In order to fix this problem, we balance the forces on the corresponding Cα atoms by adding the differences of forces on these extra hydrogen atoms. For instance, the difference of the forces of the H atoms added to the carbonyl group of the residue i−1 (H atom in the left blue cycle of **Figure 3**) is

$$\Delta \mathbf{F} = \mathbf{F}\_{\text{Cap}\_i}^{\text{ex-H}} (\text{Cap}\_{i-1}^\* \mathbf{A}\_i \text{Cap}\_{i+1}) - \mathbf{F}\_{\text{Cap}\_i}^{\text{ex-H}} (\text{Cap}\_{i-1}^\* \text{Cap}\_i) \neq \mathbf{0}, \tag{5}$$

Which is added to the force of the C<sup>α</sup> atom of residue i−1,

$$\mathbf{F}\_{\mathbf{C}\_{\alpha}}^{\text{final}} = \mathbf{F}\_{\mathbf{C}\_{\alpha}} + \Delta \mathbf{F} \tag{6}$$

This approach will balance the forces of the fragments and conserve the total energy of the system.

#### RESULT AND DISCUSSION

#### Performance of FB-GMFCC on Pure Proteins

To validate the FB-GMFCC method, we checked its performance for four protein systems and compared the calculated results with that calculated by conventional full-system QM calculations. An Ace-(ALA)9-NME linear peptide was constructed by the TLEAP software in the AMBER16 package, and three small proteins with different secondary structures are selected from the protein data bank (ID: 2I9M, 1LE1, and 2OED). Energy minimization (by using gradient descent and conjugate gradient algorithms with Amber ff14SB force field) was performed to remove bad contacts in these structures before QM calculations. Comparisons of job CPU times of FB-GMFCC and full system QM calculations were shown in **Table 1**. We can see that FB-GMFCC method is 4 or TABLE 1 | Comparison of the computational cost of FB-GMFCC and full-system QM calculation on the Linux server with two Intel E5-2680v3 CPUs (14 cores, 2.50 GHz).


All calculations were performed with GAUSSIAN09 at M062X/6-31G\* level. \*Full-system QM calculation is not possible on our current machine due to large size of the system.

5 times faster than full QM calculation for a real protein with about 200 atoms. For the larger 2OED protein, the full-system QM calculation is not possible on our computer system due to its large size. It should be noted that the computational time in the present approach is essentially linear with the system size as shown in **Table 1**.

**Figure 4** shows the comparison of computed atomic forces with those from full-system QM calculation. Overall, the atomic forces are in good agreement with the full-system calculations except a few points. For example, there is a bad point in the calculated atomic force of 2I9M (**Figure 4**), which correspond to one H atom on the ε-amino group of LYS8. After carefully checking the structure, we found that a salt bridge is formed between this group and the side chain of GLU4. As a result, this salt bridge cannot be accurately described by the force field which is used in the present method to describe interaction between non-neighboring residues.

#### Ab Initio Molecular Dynamic in Gas Phase

To further check the performance of FB-GMFCC method, we performed an AIMD simulation for the linear peptide ACE-(ALA)9-NME in the NVE ensemble and gas phase. The simulation was performed by combining the FB-GMFCC and the TINKER program. Before the AIMD simulation, an energy minimization (by using gradient descent and conjugate gradient algorithms with Amber ff14 force field), a 400 ps heating simulation which heated the system from 0 to 300 K and a 5 ns equilibrium MD simulation (by using velocity verlet algorithm and the same force field) were performed. The AIMD simulation lasted 2 ps with 1fs time step and without any constraints. Another AIMD simulation without force balance was also performed as a reference. The total energy fluctuations in these two AIMD simulations were shown in **Figure 5**. As can be seen, the total energy in the AIMD simulation based on GMFCC without force balance is gradually increased in gas phase NVE ensemble, which means that the energy is not conserved if atomic forces of extra cap hydrogen atoms are not compensated to corresponding Cα atoms. However, the total energy of FB-GMFCC can be maintained well and conserved at 91 kcal/mol. On average, the extra H atoms can import extra forces as large as 11 kcal/(mol<sup>∗</sup> Å) to the system, which lead to additional works

on the system. Thus it is necessary to balance the atomic forces on these extra H atoms in the AIMD simulation.

# AIMD in Explicit Water

Since water plays an important role in protein structure and dynamics, the study of protein should be carried out in the solvent environment. The FB-GMFCC approach can also be used to perform AIMD simulations for proteins in explicit solvent environment. The total energy of protein-solvent system with FB-GMFCC can be expressed by the following formula.

$$\mathbf{E}\_{\text{total}} = \mathbf{E}\_{\text{Protein}}^{\text{FB-GMFCC}} + \mathbf{E}\_{\text{water}}^{\text{MM}} + \mathbf{E}\_{\text{Protein-water}}^{\text{MM}} \tag{7}$$

To save the computation cost, the inter- and intra-interactions of water molecules and their interactions with proteins are described by classical force field (Amber ff14SB), as mechanical embedding in the QM/MM framework.

We performed 110 ps AIMD simulation for the relatively larger protein (2OED, 56 residues, 862 atoms) in explicit water. The protein was solvated in a water ball consisting of 3084 TIP3P water molecules. Before AIMD simulation, energy minimizations were performed to remove bad contacts in the system, and a 25 ps heating simulation was performed to heat the system slowly to 300 K. A restraint of 50 kcal/mol was used on the backbone to avoid large unphysical structural change in heating process. Then the system underwent AIMD simulation with 1fs time step and without any constraints. The Langevin thermostat with the collision frequency 2.0 ps−<sup>1</sup> was applied to control the temperature. In addition, there was a 20 kcal/mol half-harmonic restrain used on the boundary of water ball to avoid the escaping of water molecules.

This AIMD simulation was performed on a linux server cluster with 30 nodes and each node has dual Intel Xeon E5- 2680v3 CPUs. To balance the computational cost and accuracy, the combination of BLYP functional and 6-31G<sup>∗</sup> basis set was used in the calculation. It took 55 days to complete the simulation. The time evolution of temperature in the simulation was shown in **Figure 6**. We can see that the temperature is very stable in the trajectory. The backbone RMSD with respect to the X-ray structure is no larger than 1.5 Å, which means that the structure of 2OED protein are relatively stable during the110 ps AIMD simulation.

Recently, many researchers discovered that considerable deviations from planarity of peptide bond (ω = 180◦ ) can be identified in atomic resolution X-ray structures, sometimes even exceeding 10–15◦ (Wlodawer et al., 1984; MacArthur and Thornton, 1996; Ulmer et al., 2003). As the resolution of the structure (2OED) used in this work is very high (1.1 Å) and the coordinates of hydrogen atoms in this structure were further refined with NMR experiment (Ulmer et al., 2003), it is worth to compare the planarity of peptide bonds in the experimental structure and the AIMD trajectory. Six peptide bonds were selected from the experimental structure as their Oi−1-Ci−1- Ni−H<sup>i</sup> <sup>N</sup> dihedral angles deviate the most from the peptide plane. The result can be found in **Table 2** and **Figure 7**. We can see that four of the six peptide bonds still maintained their large deviations. The results calculated by FB-GMFCC generally agree well with the experiments, especially for VAL21, TYR3, and PHE52. For comparison, we also test the performance of MD with classical force field (Amber ff14SB) at the same conditions. Not surprisingly, the Amber ff14SB force field prefers planer peptide bonds, which was predefined. As based on QM calculation, the AIMD simulation generally describes the intraprotein interactions more accurately.

#### CONCLUSION

In this study, a force balanced generalized molecular fractionation with conjugate caps (FB-GMFCC) method was presented. In this approach, fragment-based energies of individual residues and interaction energies of residues that form backbone hydrogen bonds are calculated by quantum mechanics. Other non-bonded interactions are considered by the polarizable AMOEBA force field. The calculated atomic forces of this method showed good agreements with that calculated by the conventional full-system QM calculations. A key element of the FB-GMFCC method is that the atomic forces of capped H atoms are corrected to achieve the conservation of the total force of the studied system.

We also demonstrated the applicability of FB-GMFCC method for performing ab initio molecular dynamic (AIMD) simulations for proteins. The results of an Ace-(ALA)9-NME linear protein showed that only the balanced force can keep

TABLE 2 | Comparison of six selected Oi−1-Ci−1-Ni−H<sup>i</sup> <sup>N</sup> dihedral angles in both the AIMD and AMBER MD calculations with experimental measurements.


The calculated values were averaged from the trajectories.

the conservation of the total energy during the simulation. An 110 ps AIMD simulation was also performed for a relatively large protein with 56 residues and 862 atoms in explicit water. Compared with the classical force field, the AIMD simulations gave better description about the geometry of peptide bonds. It should be note that the accuracy of the FB-MFCC method still have room to be improved. Further development of this method will focus on the consideration of strong short-range interactions such as salt bridges and hydrogen bond including side chains, relevant work is underway in our laboratory.

These results have shown that the FB-GMFCC approach is potentially powerful and attractive for studying protein dynamics. As a fragment based approach, the FB-GMFCC method is linear-scaling and trivially parallelizable. With further development and improvment, this method will become more and more practical for AIMD simulation of larger proteins.

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

MX performed theoretical calculation and analysis of result. TZ developed the theory and contributed to the writing and discussion of the paper. JZ organized the project and contributed to the discussion and writing of the paper.

#### ACKNOWLEDGMENTS

This work was supported by National Key R&D Program of China (Grant no. 2016YFA0501700), National Natural Science Foundation of China (Grant nos. 21433004, 91641116, 91753103), Shanghai Putuo District (Grant 2014-A-02), Innovation Program of Shanghai Municipal Education Commission (201701070005E00020), and NYU Global Seed Grant. We thank the Supercomputer Center of East China Normal University for providing us computer time.

Chem. 51, 435–471. doi: 10.1146/annurev.physchem.51. 1.435


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Xu, Zhu and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# QM Cluster or QM/MM in Computational Enzymology: The Test Case of LigW-Decarboxylase

Mario Prejanò, Tiziana Marino\* and Nino Russo

Dipartimento Di Chimica e Tecnologie Chimiche, Università della Calabria, Rende, Italy

The catalytic mechanism of the decarboxylation of 5-carboxyvanillate by LigW producing vanillic acid has been studied by using QM cluster and hybrid QM/MM methodologies. In the QM cluster model, the environment of a small QM model is treated with a bulky potential while two QM/MM models studies include partial and full protein with and without explicitly treated water solvent. The studied reaction involves two sequential steps: the protonation of the carbon of the 5-carboxy-vanillate substrate and the decarboxylation of the intermediate from which results deprotonated vanillic acid as product. The structures and energetics obtained by using three structural models and two density functionals are quite consistent to each other. This indicates that the small QM cluster model of the presently considered enzymatic reaction is appropriate enough and the reaction is mainly influenced by the active site.

#### Edited by:

Sam P. De Visser, University of Manchester, United Kingdom

#### Reviewed by:

Artur Nenov, Università degli Studi di Bologna, Italy Ahmet Altun, Max-Planck-Institut für Kohlenforschung, Germany

> \*Correspondence: Tiziana Marino tiziana.marino65@unical.it

#### Specialty section:

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

> Received: 11 April 2018 Accepted: 08 June 2018 Published: 28 June 2018

#### Citation:

Prejanò M, Marino T and Russo N (2018) QM Cluster or QM/MM in Computational Enzymology: The Test Case of LigW-Decarboxylase. Front. Chem. 6:249. doi: 10.3389/fchem.2018.00249 Keywords: QM, QM/MM, decarboxylation, enzymatic catalysis, reaction mechanism, LigW

# INTRODUCTION

Enzymes are biological machines that efficiently catalyze a huge number of chemical reactions in the very short time steps required by the physiological processes. In the last decades, computational enzymology has become a very useful tool for studying enzyme activity since it allows to determine the energies and structures of short-lived intermediates and transition states. Through computational enzymology, different reaction pathways can be analyzed, and their feasibility can be established by a careful analysis of calculated energy barriers. A crucial issue in computational enzymology is the choice of the model to be used in the simulations. The choice is not so obvious because it depends on the nature of enzyme (without or with metal cofactor) on the catalytic pocket and on the amino acids implicated in the chemical reaction. In fully quantum mechanical (QM) treatment (Himo, 2006; Ramos and Fernandes, 2008; Siegbahn and Himo, 2011; Merz, 2014), a cluster that contains all the residues around the active site is considered.

In metalloenzymes, the construction of the cluster model is facilitated by the presence of metal ions and all the residues of their inner coordination sphere. One may also need to include some other surrounding residues involved in the chemical process. Atoms at the periphery of the model, where truncation is made, are normally frozen in their original positions present in the crystallographic structure for avoiding artificial expansion or other rearrangements (Blomberg et al., 2014). The surrounding protein environment not directly implicated in the chemical transformation, is modeled with implicit dielectric constant-based solvation models (Warshel, 1991). This method is highly versatile and widely applied to a large variety of enzyme families and to different classes of enzymes (Ramos and Fernandes, 2008; Liao et al., 2010; Amata et al., 2011a; Himo, 2017; Piazzetta et al., 2017; Prejanò et al., 2017a). A different approach developed in 1976 (Warshel and Levitt, 1976) is the hybrid quantum mechanics/molecular mechanics (QM/MM) (Senn and Thiel, 2009; Quesne et al., 2016; Ryde, 2016). In this procedure, other than the QM portion a large number of residues (or the whole enzyme sequence) is treated at molecular mechanics level (MM) (Senn and Thiel, 2009). Convergence studies performed by different research groups indicated that QM-cluster models (Siegbahn and Himo, 2011; Ryde, 2017) gives reliable energetics when the size of the model is large enough. Herein we perform a theoretical study using both QM cluster and QM/MM approaches on the gene product of LigW of 5-carboxyvanillate decarboxylase (5CVA) (Vladimirova et al., 2016). The QM part in all the models has been treated in the framework of density functional theory (DFT) and by using two different exchange-correlation functionals.

The LigW belongs to the amidohydrolase (AHS) superfamily including a high number of enzymes catalyzing the hydrolysis of a wide range of substrates. In all AHS members, a mononuclear or binuclear metal binding site is found (Gerlt and Babbitt, 2001; Seibert and Raushel, 2005). All AHS members have a (β/α)8-barrel structural fold and catalyze the metal-dependent hydrolysis of phosphate and carboxylate esters (Jackson et al., 2005; Shapir et al., 2006; Elias et al., 2008; Khurana et al., 2009; Duarte et al., 2011; Tobimatsu et al., 2013). LigW catalyzes the C-C bond cleavage of 5-CV to vanillate (VAN) in an oxidantindependent fashion. The 5-carboxyvanillate (5-CV) represents one of the final product of the multienzymatic degradation of the biphenyl lignin derivatives. The lignin degradation of microbial origin represents an interesting process from both commercial and biotechnological point of view owing to the plant biomass conversion in renewable aromatic chemicals and biofuels (Liu and Zhang, 2006). Furthermore, decarboxylation represents a process of widespread occurrence in nature and therefore it is of relevant biological interest (Faponle et al., 2016).

# COMPUTATIONAL METHODS

All the calculations were carried out by using the Gaussian 09 program (Gaussian 09, Revision D.01, 2011)<sup>1</sup> . The QM portions were treated with the B3LYP (Lee et al., 1988; Becke, 1993) hybrid density functional. 6-31+G(d,p) basis set was used for the C, N, O, and H atoms, whereas the SDD pseudopotential and corresponding orbital basis set (Andrae et al., 1990) were employed for Mn atom. Our own N-layered integrated molecular orbital and molecular mechanics (ONIOM) method was applied as the QM/MM method in the framework of electronic embedding scheme, in which the effects of the fixed MM charges are incorporated in the QM hamiltonian (Svensson et al., 1996; Vreven et al., 2006). As shown in Figure S1, the enzyme-substrate complex (ES) is a high-spin sextet species while its low-spin doublet and the intermediate-spin quartet states are energetically not accessible. The sextet state does not suffer from any spin contamination (<S2> equal to 8.75). The optimized minima and transition states on the potential energy surfaces were confirmed by the analysis of the corresponding Hessian matrices. Zero-point-energy corrections were calculated and added to the final energies. In order to obtain more accurate energies, single point calculations on the optimized structures were performed with the larger basis set 6-311+G(2d,2p) taking into account the effects of the protein environment by using the solvation model density (SMD) (Marenich et al., 2009), with a dielectric constant (ε = 4) of the enzyme environment, for the cluster simulations (Alberto et al., 2010; Liao et al., 2010; Amata et al., 2011a,b; Himo, 2017; Piazzetta et al., 2017; Prejanò et al., 2017a,b). Energetics presented includes D3 dispersion correction (Grimme et al., 2011). To evaluate the effect of the exchangecorrelation functionals single point calculations on the B3LYP optimized geometries have been performed by using the M06- L functional that was previously demonstrated to be accurate for describing metal containing systems properties (Zhao and Truhlar, 2006, 2008) (see Table S1). NBO analysis (NBO, version 3.1, 2001)<sup>2</sup> was performed on all intercepted stationary points at QM and QM/MM levels with B3LYP functional. Furthermore, the noncovalent interactions on the minima of the PES have been assessed by using the NCIPLOT tool (NCIPLOT, version 3.0, 2011)<sup>3</sup> .

# COMPUTATIONAL SETUP AND QM MODEL DEFINITIONS

The model of the LigW active site, used for both QM and QM/MM calculations, was obtained from the three-dimensional structure of wild-type LigW in the presence of the substrate-like inhibitor 5-nitrovanillate (5-NV) isolated by N. aromaticivorans (PDB id: 4QRN, resolution: 1.07 Å). Vladimirova et al. (2016) due to the very small difference (one atom) between the inhibitor (5-NV) and substrate (5-CV). This choice has been already shown sufficient when structurally compared with larger QM clusters (Sheng et al., 2017). In the active site, (see **Figure 1**) the manganese ion is octahedrally coordinated to Glu-19, His-188, Asp-314, one water molecule **w1** and the substrate. Two water molecules, (**w2** and **w3**), located at about 5 A from the substrate and other residues of the active site pocket not directly bound to the metal ion are retained in QM region (Arg58, Phe212, His241, Arg252, and Tyr317).

In the QM/MM models, the Amber ff14SB force field (Maier et al., 2015) as implemented in AMBER16 software was used. The missing MM parameters for the substrate 5-CV were created from single molecule optimization at HF/6-31G(d) level of theory with the Antechamber tool, as implemented in AMBER16 (AMBER version 16, 2016)<sup>4</sup> . At this purpose the General Amber Force Field (GAFF) (Wang et al., 2004) and the Restrained Electrostatic Potential (RESP) (Bayly et al., 1993) methods were used to derive intramolecular and Lennard-Jones parameters and atomic charges, respectively (see Table S2).

<sup>2</sup>NBO, version 3.1 (2001).

<sup>3</sup>NCIPLOT, version3.0 (2011). Download: http://www.lct.jussieu.fr/pagesperso/ contrera/nciplot.html

<sup>4</sup>AMBER 16 (2016), University of California, San Francisco.

<sup>1</sup>Gaussian 09, Revision D.01 (2011), Gaussian, Inc., Wallingford CT.

# QM Cluster

All the amino acids of the QM region were truncated at the αcarbons, and hydrogen atoms were added manually. In order to avoid unrealistic movements of the groups during the geometry optimizations, the truncated α-carbons of the outer coordination shell labeled by stars in **Figure 1** were kept fixed to their crystallographic positions. The residues were modeled according to standard procedure (Liao et al., 2010; Amata et al., 2011a,b; Siegbahn and Himo, 2011; Blomberg et al., 2014; Himo, 2017; Prejanò et al., 2017a) considering the protonation states coming from the experimental evidences (Vladimirova et al., 2016). The obtained model consists of 126 atoms with a total charge equal to zero. The size of the cluster is adequate enough to represent the chemistry involved in the considered reaction mechanisms for formation or breaking bonds.

# ONIOM-1

In this model, the QM region is surrounded by the residues present in radius of 15 Å from the metal ion center. In this way, the interactions between α and β subunits of the homodimer were included. Inside the considered sphere, an outer shell of residues with a thickness of 2 Å was fixed, and only the inner 13 Å shell was allowed to move during the QM/MM geometry optimizations. This strategy is commonly used to avoid drifting through multiple minima unrelated to the reaction coordinate. This model includes in the MM region also a number of water molecules (20) present in the crystallographic structure. The obtained model consists of 2,154 atoms with 118 atoms in QM region (**Figure 1**).

### ONIOM-2

A rectangular box was used to solvate the system up to 12.0 Å of the metal center. During the optimizations, all the water molecules and protein atoms in the 18 Å from the active site were kept frozen, as proposed by a recent work (Medina et al., 2017). The final model contains 11,895 atoms with 118 atoms of QM region. In this case, the MM region includes the whole protein and a number of water molecules within 5 Å around of catalytic domain as depicted in **Figure 1**.

# RESULTS AND DISCUSSION

The reaction can follow two paths with the formation of **CO<sup>2</sup>** or **HCO<sup>3</sup>** <sup>−</sup> products (see **Figure 2**). After the formation of the **ES**, the reaction proceeds with the proton transfer from Asp314 to C5 of the substrate generating the **INT1** species, that acts as common intermediate for the formation of **EP\_I** or **EP\_II** complexes in which **CO<sup>2</sup>** or **HCO**<sup>−</sup> **3** product should be released. In both decarboxylation pathways, it is clear that the enzyme must generate an adjacent electron sink (such as the ketone carbonyl C4 since the formation of the new carbon– hydrogen bond) to stabilize the incipient carbanion at C5 prior to decarboxylation. This mechanism corresponds to that explored in the recent combined experimental and theoretical work (Sheng et al., 2017) where the membrane inlet mass spectrometry (MIMS) based assay is applied to study the LigW mechanism. The above-mentioned MIMS-based strategy (Sheng et al., 2017) was able to establish **CO<sup>2</sup>** and not **HCO**<sup>−</sup> **3** as reaction product. We have considered also the path for the bicarbonate release but our calculated PESs with the three models used give very high energetic barriers (see Table S1) that are not compatible with the enzymatic kinetics.

All the obtained PESs with the used models are depicted in **Figure 3**. Those concerning the QM one will be compared with

states.

the values arising from the previous larger QM-cluster model study (Sheng et al., 2017). B3LYP optimized structures obtained employing the ONIOM-2 model of all the species are given in **Figure 4** while that for the QM and ONIOM-1 models are given in Figures S2, S3.

In **ES** complex, the Asp314, as in the original X-ray structure, is oriented in a suitable way to deliver the proton to C5 of the substrate (HAsp314-C5 3.103 Å). **w2** and **w3** water molecules originally bonded to the metal ion and displaced upon the substrate entrance, lie in proximity to the reaction site establishing H- bonds network with the surrounding amino acid residues (see Figure S4). The bond lengths in the active site of the present (126 atoms) and previous (Sheng et al., 2017) larger (308 atoms) QM cluster study agree very well.

The formation of **INT1** takes place through the transition state **TS1** that describes the proton transfer from the Asp314 to the carbon atom of 5-CV. The related imaginary frequency (669i cm−<sup>1</sup> ) well accounts for this process since it is associated to the stretching vibrational motions of the proton transfer (O–H and H–C5). The analysis of the **TS1** optimized structure (**Figure 4** for ONIOM-2 and Figure S2 for QM cluster) reveals that the formation of the C5-H bond (1.303 Å) is more advanced in the case of ONIOM-2 calculation. In fact, the breaking bond between hydrogen and oxygen of Asp314 (1.611 Å) is more elongated than the usual sp<sup>3</sup> O-H bond. Furthermore, a major distortion of the –COO<sup>−</sup> moiety out of plane of the phenyl ring of the substrate can be observed (76 degrees in ONIOM-2 vs. 19 degrees in QM cluster). These geometrical differences may be responsible from the slight variations in the **TS1** barrier (14.7 kcal/mol and 16.3 kcal/mol for ONIOM-2 and QM cluster, respectively). **INT1** (**Figure 4**) is characterized by a C5-C7 single bond with a distance slightly elongated (1.613 Å) with respect to the single canonical bond (C-C) and a sp<sup>3</sup> C5 hybridized prone for the subsequent decarboxylation step. The barrier for the **CO<sup>2</sup>** formation (**TS2\_I**) is calculated to be 13.4 kcal/mol above **ES** complex, (only 6 kcal/mol relative to the **INT1**). The present QM cluster model obtains this barrier as 15.1 kcal/mol, analogous to the result (14.4 kcal/mol) of the previous (Sheng et al., 2017) cluster study with larger QM size.

The **TS2\_I** is characterized by the C5-C7 distance of 1.853 Å associated with a relative imaginary C-C stretching frequency of 129i cm−<sup>1</sup> (**Figure 4**). The already formed carbon dioxide is still coordinated to the metal ion (2.240 Å) and the manganese ion is still hexa-coordinate in octahedral geometry fashion (**Figure 4**). This topology is present in all our used models and in the previous larger QM cluster. (Sheng et al., 2017) At the end of the decarboxylation process, one molecule of carbon dioxide is released and the **EP\_I** complex is generated (see **Figure 4**). The manganese ion assumes a trigonal bipyramidal geometry due to the loss of the sixth ligand (**CO2**). The created vacancy will be filled by one of the two water molecules present in active site (**w2** and **w3**) and essential to restore the catalytic cycle. ONIOM-2 offers a better value of the reaction energy (0.2 kcal/mol below the **ES** complex, see **Figure 3**) while at QM level it is exergonic (−3.5 kcal/mol, see **Figure 3**). In order to verify the role of the bulk potential on the cluster model, single point computations were performed on the previous optimized structures removing all the environmental effects. Results, reported in **Figure 3**, show that the PES behavior is almost retained. The largest effect (−2.3 kcal/mol) concerns the INT1 species.

NBO charges trend illustrated in **Figure 5** confirms the nonoxidative nature of the decarboxylation process as evinced

indicated in Å. Imaginary frequencies of the transition states are also reported.

from the average value of the charges of the Mn2<sup>+</sup> (1.117 e), the C5 (−0.417 e) and the C7 (0.918 e) atoms in all the species intercepted on the PES. From the Figure S5, it can be also evidenced that the nonbonded interactions (characterizing the amino acid residues of the inner coordination shell with the metal ion) as well as the stacking interactions between the substrate (product) and Tyr317 are retained during the reaction.

All the models propose the **TS1** which describes the formation of the C5-protonated intermediate, as the rate limiting step (14.7 kcal/mol). Based on the experimental kcat value of 27 s−<sup>1</sup> for Sphingomonas paucimobilis LigW (Sheng et al., 2017), the reaction barrier is expected to be ∼16 kcal/mol. The closeness of the experimental estimate of the reaction barrier and computational **TS1** barrier suggest the appropriateness of the present and previous computational protocols.

The optimized species intercepted along the PES for the bicarbonate release (step II) are shown in Figure S6. The **w3** molecule comes into play in the reaction since it performs a nucleophilic attack on the carbon (C7) (Ow3 – C7 distance of 1.944 Å) for generating the **HCO**<sup>−</sup> **3** species and simultaneously donating a proton to Asp314 (H<sup>w</sup> – OAsp314 distance of 1.532 Å). The obtained energy barrier is 30.3 kcal/mol (see Table S1).

## CONCLUSION

In this work, we have investigated the reaction mechanism of LigW by using three different models and two exchange correlation functionals. This allowed us to assess the influence of the employed model on the computed structures and energetics compared with available experimental data. The models include full structure, its partial solvation, and reactive center with the rest represented by a bulk potential including geometrical restraints at the border. Since the results of these three models, previous larger QM cluster and experimental studies are consistent to each other, the amino acids and waters outside the reactive center act on the reaction energetics in an average way for the present enzyme system. A similar behavior was also observed in many other enzymes (Himo, 2006, 2017; Blomberg et al., 2014). However, one should keep in mind that every enzyme system acts differently and thus one should avoid the generalization of the result despite its validity on a large variety of enzyme systems.

# AUTHOR CONTRIBUTIONS

MP, TM, and NR have analyzed the results, edit and reviewed equally the manuscript. MP, TM, and NR approved it for publications.

# ACKNOWLEDGMENTS

Financial support from the Università degli Studi della Calabria -Dipartimento di Chimica e Tecnologie Chimiche (CTC) is acknowledged.

# REFERENCES


# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00249/full#supplementary-material


mechanism of 5-carboxyvanillate decarboxylase. J. Am. Chem. Soc. 138, 826–836. doi: 10.1021/jacs.5b08251


Zhao, Y., and Truhlar, D. G. (2008). The M06 suite of density functionals for main group thermochemistry, thermochemical kinetics, noncovalent interactions, excited states, and transition elements: two new functionals and systematic testing of four M06-class functionals and 12 other functionals. Theor. Chem. Acc. 120, 215–241. doi: 10.1007/s00214-007- 0310-x

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Prejanò, Marino and Russo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Interfacing the Core-Shell or the Drude Polarizable Force Field With Car–Parrinello Molecular Dynamics for QM/MM Simulations

#### Sudhir K. Sahoo† and Nisanth N. Nair\*

Department of Chemistry, Indian Institute of Technology Kanpur, Kanpur, India

#### Edited by:

Sam P. De Visser, University of Manchester, United Kingdom

#### Reviewed by:

Thomas S. Hofer, University of Innsbruck, Austria Hugo Gattuso, University of Liége, Belgium

#### \*Correspondence:

Nisanth N. Nair nnair@iitk.ac.in

#### †Present Address:

Sudhir K. Sahoo, Department Chemie, Universität Paderborn, Paderborn, Germany

#### Specialty section:

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

Received: 22 February 2018 Accepted: 18 June 2018 Published: 10 July 2018

#### Citation:

Sahoo SK and Nair NN (2018) Interfacing the Core-Shell or the Drude Polarizable Force Field With Car–Parrinello Molecular Dynamics for QM/MM Simulations. Front. Chem. 6:275. doi: 10.3389/fchem.2018.00275 We report a quantum mechanics/polarizable–molecular mechanics (QM/p–MM) potential based molecular dynamics (MD) technique where the core–shell (or the Drude) type polarizable MM force field is interfaced with the plane-wave density functional theory based QM force field which allows Car–Parrinello MD for the QM subsystem. In the QM/p-MM Lagrangian proposed here, the shell (or the Drude) MM variables are treated as extended degrees of freedom along with the Kohn–Sham (KS) orbitals describing the QM wavefunction. The shell and the KS orbital degrees of freedom are then adiabatically decoupled from the nuclear degrees of freedom. In this respect, we also present here the Nosé–Hoover Chain thermostat implementation for the dynamical subsystems. Our approach is then used to investigate the effect of MM polarization on the QM/MM results. Especially, the consequence of MM polarization on reaction free energy barriers, defect formation energy, and structural and dynamical properties are investigated. A low point charge polarizable potential (p–MZHB) for pure siliceous systems is also reported here.

#### Keywords: QMMM simulations, MD, POLARIZED MM, catalysis, CPMD-GULP

# 1. INTRODUCTION

Hybrid quantum mechanical/molecular mechanical (QM/MM) calculations offer a powerful way to bridge the length scales in a chemically complex system where a small region of the system of interest is treated by QM techniques, while the rest is described by computationally cheap MM force-fields. Widely used MM force fields employ a fixed point charge model for accounting the electrostatic interactions between MM atoms. The QM/MM implementations with fixed charge MM models enable polarization of QM charge density due to MM electrostatic potential. However, such approaches cannot take into account the polarization of MM atoms due to the QM electrostatic potential. Inclusion of polarization of MM atoms in QM/MM simulations demands usage of polarizable MM force fields, i.e., QM/polarized-MM (QM/p–MM) methods. It was reported that inclusion of polarization of MM atoms has significant effects on various properties (Bakowies and Thiel, 1996; Illingworth et al., 2006; Geerke et al., 2007; Lu and Zhang, 2008; Boulanger and Thiel, 2014), for instance, free energy barriers of chemical reactions are affected by about 10% (Lu and Zhang, 2008; Boulanger and Thiel, 2014).

The shell model (Dick and Overhauser, 1958) (or the Drude oscillator model) is widely used to describe polarization of MM atoms. Molecular dynamics (MD) simulations with the shell model based MM force fields can be carried out in two ways. In the conventional scheme, the position of the shells are minimized (Sangster and Dixon, 1976; Lindan and Gillan, 1993) at every MD step while keeping the core coordinates fixed. In an alternative scheme, the shells are treated as extended degrees of freedom and these are propagated classically to avoid minimization of their positions (Sprik, 1991; Mitchell and Fincham, 1993; Wilson and Madden, 1993) in the spirit of the Car–Parrinello MD method (Car and Parrinello, 1985). Often, the shell variables are assigned a mass smaller compared to that of the nuclear masses. Due to this reason, a smaller time step than used in a conventional MD is required for this approach. In practice, the shell temperature is kept close to 0 K, and most importantly, dynamics of shell degrees of freedom is made adiabatically decoupled from the rest of the system.

Different QM/MM MD schemes have been proposed to interfere the shell model with ab initio methods (Sulimov et al., 2002; Nasluzov et al., 2003; Woodcock et al., 2007; Geerke et al., 2007; Lu and Zhang, 2008; Lev et al., 2010; Boulanger and Thiel, 2012; Rowley and Roux, 2012; Boulanger and Thiel, 2014; Riahi and Rowley, 2014). Ideally, wavefunction of the QM subsystem and positions of shells are minimized at every MD step. In the approach by Lu and Zhang (2008) the positions of the shells were either minimized or updated only once in every MD step. The shell variables are treated as extended degrees of freedom in the QM/MM scheme proposed by Boulanger and Thiel (2012). Similar method was also reported by Rowley and Roux (2012). Recently Loco et al. (2016, 2017) have presented a QM/MM coupling method using the AMOEBA polarizable force fields. Here, SCF (self consistent field) calculations were carried out for the induced dipoles, while either SCF or the extended Lagrangian variant of Born-Oppenheimer MD (Niklasson et al., 2009) technique was employed for the wavefunction update. Incorporating the shell model in the extended Lagrangian scheme for Car– Parrinello MD within a QM/MM implementation is, however, not straightforward, and has not been attempted before, to the best of our knowledge.

In this paper we present an extended Lagrangian scheme to carry out Car–Parrinello MD for the QM subsystem which is coupled to a polarizable shell model based MM force field. First, we discuss the theory and the technical details of our method. Next, the implementation is validated by taking a system of water cluster composed of five water molecules. A new polarizable MM potential for silica with low point charges is then developed. Using our implementation and this new force-field for silica, we study three problems: (a) Oxygen vacancy in α–cristobalite silica; (b) Hydrogenation of ethene catalyzed by Rh cluster supported in Y–zeolite; (c) Proton exchange between methane and H–ZSM–5 zeolite.

#### 2. THEORY

### 2.1. Formulation of the Extended Lagrangian QM/p–MM Method

The Lagrangian for the conventional QM/MM Car–Parrinello MD simulation is,

$$\begin{split} \mathcal{L}\_{\text{CP/QMMM}}(\mathbf{R}, \dot{\mathbf{R}}, \dot{\boldsymbol{\phi}}, \dot{\boldsymbol{\phi}}) &= \sum\_{I} \frac{1}{2} M\_{I} \dot{\mathbf{R}}\_{I}^{2} + \sum\_{i} \frac{1}{2} \mu \left< \dot{\phi}\_{i} | \dot{\phi}\_{i} \right> \\ &- E\_{\text{KS}}(\mathbf{R}, \boldsymbol{\phi}) - E\_{\text{MM}}(\mathbf{R}) - E\_{\text{QM}/\text{MM}}(\mathbf{R}, \boldsymbol{\phi}) \\ &+ \sum\_{i,j} \Lambda\_{ij} \left( \left< \phi\_{i} \mid \phi\_{j} \right> - \delta\_{ij} \right), \end{split} \tag{1}$$

where {MI} and {µi} are the masses of the ionic and the orbital degrees of freedom, respectively, and {**R**I} and {φi} are the nuclear coordinates (in Cartesian) and the Kohn–Sham orbitals, respectively. Here, EKS, EMM, and EQM/MM are the energy of the QM subsystem, the energy of the MM subsystem and the energy due to QM-MM non-bonding interactions, respectively. The last term in the Lagrangian invokes orthonormality constraints during the classical evolution of the Kohn–Sham orbitals (Marx and Hutter, 2009). For details of this implementation, see Laio et al. (2002) and Sahoo and Nair (2016).

In the case of QM/p-MM implementation, MM polarization is included by augmenting the degrees of freedom by the shells {**r**<sup>k</sup> }, which are connected to a selected set of polarizable atoms P. We propose the QM/p-MM Lagrangian,

$$\begin{split} \mathcal{L}\_{\text{CP}/\text{Shell}}\left(\mathbf{R}, \dot{\mathbf{R}}, \mathbf{r}, \dot{\mathbf{r}}, \phi, \dot{\phi}\right) &= \sum\_{I} \frac{1}{2} M\_{I} \dot{\mathbf{R}}\_{I}^{2} + \sum\_{i} \frac{1}{2} \mu \left< \dot{\phi}\_{i} | \dot{\phi}\_{i} \right> + \sum\_{k \in \mathcal{P}}^{n\_{k}} \frac{1}{2} m\_{k} \dot{\mathbf{r}}\_{k}^{2} \\ &- E\_{\text{KS}}(\mathbf{R}, \phi) - E\_{\text{MM}}(\mathbf{R}, \mathbf{r}) - E\_{\text{QM}/\text{MM}}(\mathbf{R}, \mathbf{r}, \phi) \\ &+ \sum\_{i,j} \Lambda\_{ij} \left( \left< \phi\_{i} \mid \phi\_{j} \right> - \delta\_{ij} \right), \end{split} \tag{2}$$

where, n<sup>s</sup> is the number of shells, and m<sup>k</sup> is the fictitious mass of a shell k. Also,

$$E\_{\rm MM}(\mathbf{R}, \mathbf{r}) = E\_{\rm b}(\mathbf{R}, \mathbf{r}) + E\_{\rm nb}(\mathbf{R}, \mathbf{r}) + \sum\_{k}^{n\_s} \frac{1}{2} \kappa\_k s\_{k^\*}^2 \tag{3}$$

where

$$s\_k = |\mathbf{s}\_k| = |\mathbf{R}l - \mathbf{r}\_k| \,,$$

having the shell k harmonically bound to the core atom I ∈ P. Here E<sup>b</sup> refers to the sum of all the bonding terms in the MM potential, which is conventionally defined over the shell variables. The total non–bonding interaction energy, Enb, is the sum of the dispersive and the electrostatic interactions within the MM subsystem. The dispersive interactions are defined over the shell degrees of freedom, while the electrostatic interactions span over the cores and the shells degrees of freedom. Further,

$$E\_{\rm QM/MM}(\mathbf{R}, \mathbf{r}) = E\_{\rm b}^{\prime}(\mathbf{R}, \mathbf{r}) + E\_{\rm vdw}^{\prime}(\mathbf{R}, \mathbf{r}) - \sum\_{I} q\_{I}^{\rm c} \int d\mathbf{\overline{r}} \rho(\mathbf{\overline{r}}) \frac{1}{|\mathbf{R}\_{I} - \mathbf{\overline{r}}|}$$

$$- \sum\_{k} q\_{k}^{\rm s} \int d\mathbf{\overline{r}} \rho(\mathbf{\overline{r}}) \frac{1}{|\mathbf{r}\_{k} - \mathbf{\overline{r}}|}, \tag{4}$$

where E ′ b and E ′ vdw are the energy contributions due to the bonding and the dispersive interactions between the QM and the MM atoms, respectively. The last two terms in the above equation account for the electrostatic interaction between the point charges of the core ({q c I }) and the shell ({q s k }) degrees of freedom with the electronic density ρ(**r**), respectively. Electrostatic interactions are computed in the real space with the modified Coulomb kernel as in Laio et al. (2002).

For the success of this approach it is crucial that the Lagrangian in Equation (2) leads to dynamics close to that on the Born–Oppenheimer surface. This is taken care by starting the MD simulations with the optimized {φ} and {**r**} , while maintaining the temperatures of the orbitals (Tφ) and the shells (Ts) degrees of freedom close to zero, considering

$$\min\_{\{\phi,\mathbf{r}\}} \lim\_{T\_{\phi},T\_{s}\to 0} \mathcal{L}\_{\text{CP}/\text{Shell}}\left(\mathbf{R},\dot{\mathbf{R}},\mathbf{r},\dot{\mathbf{r}},\phi,\dot{\phi}\right) \to \mathcal{L}\left(\mathbf{R},\dot{\mathbf{R}}\right).\tag{5}$$

The physical temperature (Tphys) is defined as,

$$T\_{\rm phys} = \frac{1}{N\_{\rm f} k\_{\rm B}} \sum\_{I} \mathcal{M}\_{I} \mathbf{\hat{S}}\_{I}^{2},\tag{6}$$

while the shell temperature is defined as,

$$T\_s = \frac{1}{3n\_s k\_B} \sum\_{k}^{n\_s} \overline{m}\_k \dot{\mathbf{s}}\_k^2. \tag{7}$$

In the above equations, **S**˙ I is the velocity of the center of mass of a core–shell pair (I, k), and **s**˙<sup>k</sup> is the relative velocity of the shell k connected to a core atom I. Here,

$$\mathbf{S}\_{I} = \frac{1}{\mathcal{M}\_{I}} \left( M\_{I} \mathbf{R}\_{I} + m\_{k} \mathbf{r}\_{k} \right),$$

M<sup>I</sup> and m<sup>k</sup> are the total mass of a core–shell pair (I, k), and the reduced mass of a shell k, respectively:

$$
\mathcal{M}\_I = M\_I + m\_k
$$

$$
\overline{m}\_k = \frac{M\_I \, m\_k}{M\_I + m\_k}
$$

In the above, k<sup>B</sup> is the Boltzmann constant, N<sup>f</sup> is the total nuclear degrees of freedom and n<sup>s</sup> is the total number of shell variables, respectively. We have implemented this method in the CPMD/GULP QM/MM interface program, as developed in Sahoo and Nair (2016), where the plane wave density functional theory (DFT) based CPMD (CPMD, 132) code is interfaced with the MM based GULP (Gale, 1997) program.

At this stage, the following points are noted:


Accordingly, we have strategized the application of our implementation. The time step of integration and the masses of both shell and orbital degrees of freedom have to be chosen such that adiabatic separation between the nuclear subsystem and the subsystem containing shells and orbitals is maintained. We choose m<sup>k</sup> = µ (i.e., the masses of the shell and the orbital degrees of freedom are taken to be the same), which allows us to choose the same time step for integrating the equations of motion for all the subsystems.

#### 2.2. Implementation of Nosé–Hoover Chain Thermostat for Shell Dynamics

For obtaining stable dynamics and to achieve a canonical ensemble, it is crucial to couple the dynamical subsystems with thermostats. We have implemented three separate sets of Nosé– Hoover Chain (NHC) thermostats (Martyna et al., 1992). The system temperature is maintained at Tphys using one set of NHC thermostats whereas the shell and the orbital variables are maintained close to 0 K using two separate thermostats. We coupled the nuclear and the shell NHC thermostats to the center of mass motion and the relative motion of the core-shell pairs, and the corresponding equations of motion are given by,

$$\begin{split} \mathcal{M}\_{I}\ddot{\mathbf{S}}\_{I} &= \mathbf{F}\_{I}^{(\text{S})} - \mathcal{M}\_{I}\dot{\mathbf{S}}\_{I}\frac{p\_{\eta\_{1}}}{q\_{1}} \\ \dot{p}\_{\eta\_{1}} &= \sum\_{I} \mathcal{M}\_{I}\dot{\mathbf{S}}\_{I}^{2} - N\_{I}\mathbf{k}\_{\text{B}}T\_{\text{phys}} - \frac{p\_{\eta\_{2}}}{q\_{2}}p\_{\eta\_{1}} \\ \dot{p}\_{\eta\_{j}} &= \frac{p\_{\eta\_{j-1}}^{2}}{q\_{j-1}} - \mathbf{k}\_{\text{B}}T\_{\text{phys}} - \frac{p\_{\eta\_{j+1}}}{q\_{j+1}}p\_{\eta\_{j}}, \quad j = 2, \dots, n\_{\text{c}} - 1, \\ \dot{p}\_{\eta\_{j}} &= \frac{p\_{\eta\_{j-1}}^{2}}{q\_{j-1}} - \mathbf{k}\_{\text{B}}T\_{\text{phys}}, \quad j = n\_{\text{c}}, \\ \dot{\eta}\_{j} &= \frac{p\_{\eta\_{j}}}{q\_{j}}, \quad j = 1, \dots, n\_{\text{c}}. \end{split}$$

Here **F** (S) I is the force acting on the center of mass coordinates **S**I , {pη<sup>i</sup> } and {qi} are the momentum and the masses of the thermostat variables η. Number of thermostat variables, nc, is chosen to be more than one.

In order to thermostat the shell dynamics, we write the shell equations of motion in relative coordinates as,

$$\begin{aligned} \overline{m}\_k \ddot{\mathbf{s}}\_k &= \mathbf{f}\_k^{(s)} - \overline{m}\_k \dot{\mathbf{s}}\_k \frac{p\_{\eta\_1^\*}}{q\_1^\*} \\ \dot{p}\_{\eta\_1^\*} &= \sum\_k \overline{m}\_k \dot{\mathbf{s}}\_k^2 - 3n\_s k\_{\mathrm{B}} T\_s - \frac{p\_{\eta\_2^\*}}{q\_2^\*} p\_{\eta\_1^\*} \\ \dot{p}\_{\eta\_j^\*} &= \frac{p\_{\eta\_{j-1}^\*}^2}{q\_{j-1}^\*} - k\_{\mathrm{B}} T\_s - \frac{p\_{\eta\_{j+1}^\*}}{q\_{j+1}^\*} p\_{\eta\_j^\*}, \quad j = 2, \dots, n\_c^\* - 1, \end{aligned}$$

$$\begin{aligned} \dot{p}\_{\eta\_j^\*} &= \frac{p\_{\eta\_{j-1}^\*}^2}{q\_{j-1}^\*} - k\_{\mathrm{B}} T\_{\mathrm{s}}, \quad j = n\_{\mathrm{c}}^\*,\\ \dot{\eta}\_j^\* &= \frac{p\_{\eta\_j}}{q\_j^\*}, \quad j = 1, 2, \cdots, n\_{\mathrm{c}}^\*. \end{aligned}$$

Here **f** (s) k is the force acting on the relative coordinate **s**<sup>k</sup> of a core– shell pair (I, k). The thermostat variables for the relative motion are {η ∗ j }, having masses {q ∗ j }, and their conjugate momenta are given by {p<sup>η</sup> ∗ j }. In our practical implementation, we transform the Cartesian coordinates of a core–shell pair to the corresponding relative and the center of mass coordinates at every MD steps. Transformation of forces for a core–shell pair (I, k) is achieved by,

$$\begin{aligned} \mathbf{F}\_I^{(\mathbf{S})} &= \mathbf{F}\_I + \mathbf{f}\_k^{(\mathbf{r})},\\ \mathbf{f}\_I^{(\mathbf{s})} &= \frac{1}{\mathcal{M}\_I} \left( m\_k \mathbf{F}\_I - M\_I \mathbf{f}\_k^{(\mathbf{r})} \right). \end{aligned}$$

Here, **F**<sup>I</sup> and **f** (r) k are the Cartesian forces on an atom (or core) I and on a shell k connected to core I, respectively. A similar coordinate transformation from cartesian to normal mode and vice–versa was reported by Marx et al. (1999) and can be also found in the work of Lamoureux and Roux (2003).

#### 3. LOW POINT CHARGE POLARIZABLE FORCE FIELD FOR SILICA

Herein, the previously reported low point charge potential (MZHB) for silica (Sahoo and Nair, 2015) is further extended to include polarization of O atoms using the core shell model. This new potential is termed as p–MZHB hereafter. We have re-parameterized kθ of O–Si–O and Si–O–Si angles and optimized the core-shell coupling parameters {κ<sup>k</sup> }. During the parameterization, point charges of the core and the shell of O atoms were allowed to vary, while their sum was fixed to −0.35 e. The re-parameterization was done using the GULP (Gale, 1997) program to reproduce the experimental structure of α– Quartz (Jorgensen, 1978). The final set of parameters are given in **Table 1**. The detailed validation of the p–MZHB potential is discussed in the Supporting Information.

#### 4. RESULTS AND DISCUSSION

#### 4.1. Validation of the Implementation in CPMD/GULP Interface Program

Total energy conservation during MD runs is tested to validate the method and the implementation. For the benchmarking purpose, we used 1H2O(QM)+4H2O(p–MM) system (**Figure 1**), and carried out NVE MD runs for 36 ps using DFT/PBE (Perdew et al., 1996) to describe the QM subsystem and polarizable de Leeuw–Parker (de Leeuw and Parker, 1998) potential to treat the MM subsystem; see Supporting Information for other technical details. In this MM potential, shells are added on the oxygen atoms of the water molecules.


For the cross Lennard–Jones parameters, the Lorentz–Berthelot (Leach, 2010; Schlick, 2010) combination rule is applied.

The total energy, the orbital kinetic energy and the shell temperature of the system were monitored; see **Figure 2**. The drift in total energy is only of the order of 10−<sup>7</sup> a.u. atom−<sup>1</sup> ps−<sup>1</sup> , indicating that the total energy conservation is fairly good. The plot of kinetic energy of orbitals shows only a small long–time drift, while the temperature of the shell variables are maintained at low temperature. Such small drifts in kinetic energy could be controlled by connecting the orbital degrees of freedom with a thermostat, as demonstrated below.

As next, we repeated the same simulation, but in the NVT ensemble. Here, three thermostats were added on the nuclear, the orbitals and the shells degrees of freedom, and these were thermostated to 300 K (Tphys, 0.0007 Hartree and 1 K, respectively). Since the NHC thermostat posseses a conserved quantity, drift in this conserved quantity allows us to verify our implementation further. The drift in the conserved quantity is only of the order of 10−<sup>7</sup> a.u. atom−<sup>1</sup> ps−<sup>1</sup> , confirming the correctness of our implementation (see **Figure 3A**). The orbital kinetic energy and the shell temperature (see **Figure 3B–D**) plots clearly show that the dynamics of the extended variables is stable and is well thermostated.

#### 4.2. Benchmark Studies Using α–Cristobalite Silica

To further benchmark the performance of the developed method, we carried out MD simulation of α–cristobalite silica in the

FIGURE 2 | NVE MD simulation using QM/p–MM implementation for 1H2O (QM) + 4H2O (p–MM) system: (A) total energy (violet) and potential energy (green) (B) orbital kinetic energy, and (C) shell temperature. All the energies are in Hartree unit and temperature is in Kelvin. ECons is the drift in total energy per atom per ps.

NVT ensemble at 300 K. Here we use the DFT/PBE (Perdew et al., 1996) level of theory for the QM part and p-MZHB potential for the MM part. We used a supercell of 8 × 8 × 8 (Si<sup>2048</sup> O4096) for QM/p–MM calculations. Optimized structure and lattice parameters using the p–MZHB MM potential were used here. Multiple QM/p–MM calculations were carried out with different QM sizes: 2T (Si2O7), 8T (Si8O25), 14T (Si14O40), and 26T (Si26O67), where T stands for SiO<sup>4</sup> tetrahedral unit (see **Figure 4**).

The structure (inner QM atoms only) obtained from QM/p– MM simulation was compared with QM data ("all–QM") and MM data ("all–MM") data; see **Table 2**. Difference between "all– QM" and QM/p–MM data for Si–O bond length is only 0.01 Å, O–Si–O angle is only 0.2◦ and Si–O–Si angle is only 0.8◦ . However, structures near the boundary are deviating more from the "all–QM" data; see Supporting Information. This is expected due to the boundary effects in QM/MM calculations (as also seen in Sahoo and Nair, 2016).

Next, the vacancy formation energy, (1E<sup>f</sup> ) in α–cristobalite silica was then computed, as

$$
\Delta E\_{\rm I} = \frac{\varkappa}{2} \left[ E(\text{O}\_2) + E\_{\rm diss}(\text{O}\_2) \right] + E(\text{SiO}\_{2-x}) - E(\text{SiO}\_2) \tag{8}
$$

where E(O2), Ediss(O2), E(SiO2−x), and, E(SiO2) are the energies of O<sup>2</sup> molecule (in the triplet electronic ground state), the dissociation energy of O<sup>2</sup> molecule, the energy of bulk silica with oxygen vacancy, and the energy of pure bulk silica, respectively. E(O2) was computed from "all–QM" calculations whereas E(SiO2−x) and E(SiO2) were computed either by "all– QM" or QM/p–MM calculations. Ediss(O2) = 5.16 eV was taken from the available experimental data (Lide, 2005). 1E<sup>f</sup> values computed from "all–QM" calculations with varying supercell sizes are listed in **Table 3**. The converged value of 1E<sup>f</sup> (w.r.t supercell size) is 8.73 eV.

1E<sup>f</sup> values computed from QM/p–MM calculations listed in **Table 3**, show that 1E<sup>f</sup> is nearly converged to 8.77 eV which is in excellent agreement with the "all–QM" data. However, it may

TABLE 2 | The average value of Si–O bond length (Å), O–Si–O, and Si–O–Si angles (◦ ) of α–cristobalite computed from MD simulations in NVT ensemble at 300 K using p–MZHB, MZHB (Sahoo and Nair, 2015), QM/p–MM (14T), and QM potentials.


These results are also compared with experimental data (Downs and Palmer, 1994). For details see text.

<sup>a</sup>Using p–MZHB MM potentials.

Atom colors: Si (yellow), O (red).

<sup>b</sup>Using PBE density functional.

<sup>c</sup>Using PBE density functional and p–MZHB MM force–field.

TABLE 3 | 1E<sup>f</sup> computed from the single point energy calculations using "all–QM" with PBE density functional.


TABLE 4 | 1E<sup>f</sup> computed from single point energy calculations using QM/p–MM potential.


be noted that the values predicted by QM/p–MM calculations are sensitive to the size of the QM subsystem. The 1E<sup>f</sup> value of 9.43 eV using 2T site is a very poor estimate compared with the

"all–QM" data. Clearly, the computed value converges close to "all–QM" data with increase in the size of the QM subsystem (see **Table 4**).

The bond length distribution of the Si–Si bond (rSiSi) at the defect site is then computed from the NVT trajectory using "all–QM" (using 3×3×3 supercell), QM/MM, and QM/p–MM potentials; see **Figure 5**. The average rSiSi value using "all–QM" potential is 2.38 Å. Using the non-polarized MZHB QM/MM simulations, we obtained a slightly increased value of 2.41 Å, while this was 2.40 Å in the polarized case. Overall, **Figure 5** shows that the mean and the standard deviation of rSiSi from QM/p–MM MD agree better to the "all–QM" MD results than that from the QM/MM MD using the non–polarizable MM force–field.

# 4.3. Application: Hydrogenation of Ethene Catalyzed by Rh Clusters Supported in Y–Zeolite

In order to investigate the effect of polarization of MM atoms on the free energy barriers, we revisit the study of hydrogenation of ethene catalyzed by Rh clusters supported in Y–zeolites, as

in our previous study (Sahoo and Nair, 2016). Based on the experimental results from the Gates group (Liang and Gates, 2008) and the previous study (Sahoo and Nair, 2016), we choose the hydrogenated cluster Rh3H<sup>7</sup> in Y–zeolite as the catalyst model. In order to model Y–zeolite, a supercell of size 2 × 2 × 2 of pure siliceous Y–zeolite (Si1536O3072) was taken. The metal cluster, its ligands, and the metal–coordinated T25 site of the Y–zeolite were treated in the QM region (at the level of DFT/PBE+D2; Perdew et al., 1996; Grimme, 2006), while the rest of the zeolite was treated using MZHB or p–MZHB MM potentials.

The free energy profile for ethene hydrogenation was computed employing metadynamics (Laio and Parrinello, 2002) using the QM/MM and QM/p–MM implementations; see **Figure 6** for the mechanism of the reaction studied here. For the details of the metadynamics simulation setup, see Supporting Information. In the metadynamics simulations using QM/MM and QM/p–MM methods, we observed the same reaction mechanism. Here, one of the hydrogen atoms from Rh3H<sup>7</sup> cluster moved to one of the C atoms of ethene forming an intermediate **1.3**, which further reacted with a hydrogen atom on the Rh cluster to form ethane (**1.4**).

Interestingly, we observed differences of the order of 1 kcal mol−<sup>1</sup> only in the free energy barriers computed from QM/p-MM and QM/MM computations; see **Figure 6C**. The main difference is in the stability of the ethyl intermediate **1.3** and on the free energy barrier for **1.3**→**1.4**. However, the difference in free energy is only ∼ 1 kcal mol−<sup>1</sup> which is close to the error associated with the metadynamics simulation. Thus, we conclude that for this specific reaction, the effect of MM polarization in the free energy estimates and on the reaction mechanism are negligible.

# 4.4. Application: Proton Exchange Between Methane and H–ZSM–5 Zeolite

Methane activation is one of the important steps in various industrially relevant processes such as conversion of methane to higher hydrocarbons and methanol. In this process, the C–H bond, which is quite inert as evident from the high

bond formation energy (> 100 kcal mol−<sup>1</sup> ), is activated for further functionalization. Of great importance, protonated zeolites, particularly H–ZSM–5, is known to activate the methane C–H bonds (Arzumanov et al., 2014; Chu et al., 2016). Here we look at the following chemical reaction (**Figure 7A**):

dH1−C2 − dC2−H3, which is the difference in the distances H1–C2 and C2-H3.

H − **ZSM** + H ′ <sup>−</sup> CH<sup>3</sup> <sup>→</sup> <sup>H</sup> ′ <sup>−</sup> **ZSM** <sup>+</sup> <sup>H</sup> <sup>−</sup> CH3. (9)

One of the acidic protons in the H–ZSM–5 zeolite is exchanged with one of the protons in the methane molecule during this reaction. The mechanism for C–H activation in zeolites is previously studied in the literature (Vollmer and Truong, 2000; Zheng and Blowers, 2005; Truitt et al., 2006; Bucko et al., 2007; ˇ Gabrienko et al., 2011; Arzumanov et al., 2014; Tuma and Sauer, 2015; Chu et al., 2016). Two types of mechanisms are proposed for this reaction: (a) direct and (b) bimolecular. In the direct mechanism, hydrogen is exchanged through a carbonium ion intermediate, i.e., via the formation of the penta–coordinated carbocation. In the bimolecular mechanism, the alkane molecule dissociates one of H atoms to the zeolite framework, forming an alkoxyl intermediate (Truitt et al., 2006), which subsequently undergoes reactions with other alkane molecules. Chu et al. (2016) have reported from both experimental and theoretical studies that higher alkanes react through bimolecular mechanism whereas the lower alkanes react via the direct mechanistic route. Our interest here is the direct mechanism as it involves the formation of a charged reactive intermediate and we anticipate some effect of MM polarization in the free energy estimates.

We could successfully simulate the proton exchange reaction using the Temperature Accelerated Sliced Sampling (TASS) method (Awasthi and Nair, 2017). This method allowed us to explore a high–dimensional free energy landscape composed of five collective variables (CVs). More details about the CVs used here and other technical details of the TASS simulation are given in the Supporting Information. The computed free energy profiles (projected along one of the crucial CVs) are given in **Figure 7C**. In TASS+QM/MM and TASS+QM/p-MM simulations, the hydrogen exchange reaction was found to proceed through the formation of a carbonium ion (CH<sup>+</sup> 5 ; see structure **2.2** in **Figure 7B**). The free energy barrier (1F ‡ ) for the reaction computed from QM/MM and QM/p-MM simulations are 39.5 and 36.0 kcal mol−<sup>1</sup> , respectively. As expected, we are observing significant difference in the free energy barriers when polarized MM force-field is used. It is also noted in passing that the free energy barrier computed here are close to the potential energy barriers computed for similar reactions in zeolites in Vollmer and Truong (2000), Zheng and Blowers (2005), Bucˇko et al. (2007), and Tuma and Sauer (2015), which are in the range 29–38 kcal mol−<sup>1</sup> .

# 5. CONCLUSIONS

An extended Lagrangian based implementation of QM/p–MM method that allows to perform conventional Car–Parrinello MD for the QM subsytem is presented here. In particular, we have discussed a combined scheme where the extended Lagrangian dynamics of the shell (or the Drude) variables are performed together with the Car–Parrinello dynamics of the KS orbitals. Inclusion of polarization does not increase the computational cost of the QM/MM Car–Parrinello MD simulations within our approach, mainly because we use the same time step as that of the conventional Car–Parrinello MD.

We find that, invoking polarization of MM atoms in QM/MM calculations only marginally improve the predictions of equilibrium structure and dynamics of non–charged systems.

#### REFERENCES


On the other hand, when the transition state is charged, free energy barriers are considerably affected on the inclusion of MM polarization. We believe that the methods and the strategies developed here for a QM/polarized–MM implementation will be useful to study more complex problems in catalysis, reactions in solid–liquid interfaces, crystallization etc.

Other than these, we also report here a new low point charge polarizable potential for silica, which is suitable for performing QM/MM calculations. The potential is shown to be performing well and is able to reproduce bulk structures of various silica polymorphs. The developed MM potential has simple and commonly used potential functions with fewer parameters, making it easy to use in various simulation packages such as GULP, DL\_POLY (Todorov et al., 2006) and LAMMPS (Plimpton, 1995).

## AUTHOR CONTRIBUTIONS

NN has planned the work and SS have executed the implementation and computations. Both authors have analyzed the results and have written the paper.

# ACKNOWLEDGMENTS

Authors acknowledge the HPC facility, IIT Kanpur. SS thanks CSIR New Delhi and IIT Kanpur for the Ph.D. fellowship. Authors are thankful to Dr. Shalini Awasthi (IIT Kanpur) for her help in setting up the metadynamics and the TASS simulations. Authors are also grateful for the valuable inputs from Prof. Alessandro Laio (SISSA, Italy).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00275/full#supplementary-material


alkanes on Ga-modified zeolite BEA studied with 1H magic angle spinning nuclear magnetic resonance in situ. J. Phys. Chem. C 115, 13877–13886. doi: 10.1021/jp204398r


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling Editor and reviewer TH declared their involvement as co-editors in the Research Topic, and confirm the absence of any other collaboration.

Copyright © 2018 Sahoo and Nair. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Binding Mode of the Sonic Hedgehog Inhibitor Robotnikinin, a Combined Docking and QM/MM MD Study

#### Manuel Hitzenberger 1, 2, Daniela Schuster <sup>3</sup> and Thomas S. Hofer <sup>1</sup> \*

<sup>1</sup> Theoretical Chemistry Division, Institute of General, Inorganic and Theoretical Chemistry, University of Innsbruck, Innsbruck, Austria, <sup>2</sup> Department of Physics, Theoretical Biophysics (T38), Technical University of Munich, Munich, Germany, <sup>3</sup> Pharmaceutical Chemistry, Institute of Pharmacy, University of Innsbruck, Innsbruck, Austria

Erroneous activation of the Hedgehog pathway has been linked to a great amount of cancerous diseases and therefore a large number of studies aiming at its inhibition have been carried out. One leverage point for novel therapeutic strategies targeting the proteins involved, is the prevention of complex formation between the extracellular signaling protein Sonic Hedgehog and the transmembrane protein Patched 1. In 2009 robotnikinin, a small molecule capable of binding to and inhibiting the activity of Sonic Hedgehog has been identified, however in the absence of X-ray structures of the Sonic Hedgehog-robotnikinin complex, the binding mode of this inhibitor remains unknown. In order to aid with the identification of novel Sonic Hedgehog inhibitors, the presented investigation elucidates the binding mode of robotnikinin by performing an extensive docking study, including subsequent molecular mechanical as well as quantum mechanical/molecular mechanical molecular dynamics simulations. The attained configurations enabled the identification of a number of key protein-ligand interactions, aiding complex formation and providing stabilizing contributions to the binding of the ligand. The predicted structure of the Sonic Hedgehog-robotnikinin complex is provided via a PDB file as Supplementary Material and can be used for further reference.

#### Edited by:

Jean-Philip Piquemal, Sorbonne Universités, France

#### Reviewed by:

Jitrayut Jitonnom, University of Phayao, Thailand Albert Poater, University of Girona, Spain

> \*Correspondence: Thomas S. Hofer t.hofer@uibk.ac.at

#### Specialty section:

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

Received: 25 July 2017 Accepted: 25 September 2017 Published: 23 October 2017

#### Citation:

Hitzenberger M, Schuster D and Hofer TS (2017) The Binding Mode of the Sonic Hedgehog Inhibitor Robotnikinin, a Combined Docking and QM/MM MD Study. Front. Chem. 5:76. doi: 10.3389/fchem.2017.00076 Keywords: sonic hedgehog (Shh), QM/MM, robotnikinin, sonic hedgehog inhibitor, metalloproteins, density functional theory, docking studies, molecular dynamics simulation

# 1. INTRODUCTION

The Hedgehog (Hh) family of proteins derives its name from the malformations that occur to larvae of drosophila flies upon altering of the respective gene (Varjosalo and Taipale, 2008). While for drosophila and other invertebrates only one variant of this protein is known, at least three different forms occur in vertrebrates, namely Indian Hedgehog (Ihh), Desert Hedgehog (Dhh) and Sonic Hedgehog (Shh) (Ingham and McMahon, 2001; Varjosalo and Taipale, 2008). Incidentally, Dhh is more closely related to the drosophila variant of the protein, whereas Ihh and Shh share a lot of similarities, implying a more recent gene duplication event (Ingham and McMahon, 2001). The Hedgehog signaling pathway plays an important role in several crucial events during embryogenesis, including patterning of the neural tube, limb and lung development, it also steers

**95**

the segmentation of insect bodies (Ingham and McMahon, 2001; Jeong and McMahon, 2004; Varjosalo and Taipale, 2008). With the exception of adult stem cell differentiation (Palma et al., 2004) Hh signaling is mostly dormant in adults, however, aberrant activation of the pathway has been linked to a large number of cancerous diseases (Hahn et al., 1996; Goodrich et al., 1997; Berman et al., 2003; Hamed et al., 2004; Kubo et al., 2004) like for example bladder cancer, medulloblastoma, breast cancer, esophageal cancer or rhabdomyosarcoma and therefore has increasingly been targeted as a leverage point for novel anti-cancer therapies.

Shh is the most widely (Varjosalo and Taipale, 2008) expressed Hedgehog variant in vertebrates and thus most information concerning the biochemical pathways involving Hedgehog proteins has been gathered through investigations of Shh (Hwang et al., 2011). Sonic Hedgehog is synthesized inside the cell as a 45 kDa precursory protein, consisting of a 20 kDa N-terminal signaling and a C-terminal auto-catalytically active processing domain. Upon cleavage of the C-terminus, the remaining signaling peptide is modified with an N-terminal palmitic acid moiety and a C-terminal cholesterol molecule in order to mature into the morphologically active ShhN form (also referred to as Shh from this point on). The processed protein is then secreted into the extracellular matrix, where it acts as a ligand for the transmembrane protein Patched 1 (Ptc1). The binding of Shh to Ptc1 mediates the release of another transmembrane protein, Smoothened (Smo) which in turn migrates to the cell's primary cilium from where it activates the glioma-associated oncogene (Gli) transcription factors, thereby promoting the expression of Shh pathway-specific genes. An alternative binding partner on the cellular surface is the Hedgehog-interacting protein (Hhip), which is upregulated upon Shh binding and functioning as a decoy for Shh, hence acting as an antagonist for pathway activation (Ingham and McMahon, 2001; Varjosalo and Taipale, 2008; Bosanac et al., 2009). Most approaches, aiming to counteract the abnormal activation of the Hh pathway are targeting the deactivation of Smo or the transcription factors themselves (Varjosalo and Taipale, 2008; De Smaele et al., 2012). Another viable approach would be the inhibition of Shh binding to Ptc1, thereby increasing the therapeutic selectivity by minimizing the risk of unwanted deactivation of important biological pathways that are associated with Smo or Gli but independent of Shh (Rimkus et al., 2016). Known potential ligands include antibodies (Maun et al., 2010) as well as the small molecule robotnikinin (Mol. weight = 454.95g/mol; Stanton et al., 2009).

According to various investigations conducting X-ray crystallography, Shh possesses three divalent metal ions: Two Ca(II) ions, bound in loop regions by residues E90, E91, E127, D96, D130, and D132 (McLellan et al., 2008) and a Zn(II) ion, coordinated by two histidine residues (H141, H183), an aspartate (D148) and a water molecule, bridging the ion with glutamate E177 (Bishop et al., 1999, 2009; Bosanac et al., 2009). Structurally, the zinc site is analogous to those of zinc hydrolases such as thermolysin or bacterial carboxypeptidase A (Hall et al., 1995), however, extensive studies of Shh could not confirm any enzymatic activity (Fuse et al., 1999). Still, the existence of these ions suggests that they are important for Shh to carry out its role in the pathway, hence several experimental and theoretical studies have been undertaken to uncover their influence (Bishop et al., 1999, 2009; McLellan et al., 2008; Bosanac et al., 2009; Maun et al., 2010; Hwang et al., 2011, 2013).

This work is a follow-up on a recent computational study investigating the role of the metal ions of Shh, utilizing classical molecular mechanics (MM), as well as hybrid quantum mechanical(QM)/molecular mechanical molecular dynamics (MD) simulations (Hitzenberger and Hofer, 2016). One of the findings of this study was that simple MM based approaches are not sufficient to provide an accurate model for the complex interactions present in the Zn(II) binding site. The utilized DFT BP86-D3 (Perdew, 1986; Becke, 1988; Grimme et al., 2010), triple zeta (TZ) QM/MM (Warshel and Levitt, 1976; Lyne et al., 1987, 1990; Åqvist and Warshel, 1993) link bond (Hitzenberger and Hofer, 2015) approach, however, has been shown to be able to very accurately reproduce the available experimental data.

Targeting extracellular proteins, serving as ligands of transmembrane proteins can be a very challenging task which is highlighted by the effort required to discover robotnikinin necessitating the screening of a set of 10,000 diverse compounds (Stanton et al., 2009). Rational drug design could aid in the development of novel compounds that are able to inhibit Shh. To do this, however, requires the knowledge of the Shh-robotnikinin binding mode which is still unknown since no experimental data on this complex has been published yet. For this reason a QM/MM MD-refined docking study, providing detailed and highly accurate information on the interactions of robotnikinin with Shh is presented in this work. The combination of docking, force field approaches, quantum mechanics and molecular dynamics enables an exhaustive investigation of the system. By explicitly considering the dynamical aspects of the complex at QM level, the chosen approach is able to account for small conformational adaptations concerning the binding geometry and interaction profile (De Vivo et al., 2016).

# 2. METHODOLOGY

# 2.1. Classical Simulation Setup

The starting point for the docking of robotnikinin to Shh was the equilibrated classical simulation box, used in the previous investigations (Hitzenberger and Hofer, 2016), which are themselves based on an X-ray structure (Bosanac et al., 2009) (PDB:3HO5). All histidine residues were protonated at the ε-nitrogen, with the only exception being Hid 183, which was protonated at the δ-position to enable a binding geometry akin to the one predicted by X-ray investigations (Bosanac et al., 2009). The acidic and basic sidechains were all in the protonation state, predominately present at the physiological pH value. In order to generate a reasonable docking pose in which the ligand adequately occupies the binding groove, two iterations of docking with an intermittent MD simulation were necessary, since in the starting structures, stemming from simulations of empty Shh, the binding groove is not accessible to the ligand in its entirety. The first cycle of docking was performed using the software package MOE (Chemical Computing Group Inc., 2016), employing the AMBER-12:EHT force field and induced fit docking. The triangle matcher method was used to place the conformers of robotnikinin in the pseudo active site, while for scoring the London dG function was utilized prior to the refinement of the pose via the force field. After that, the poses were re-scored via the GVBI/WSA function, also used for the final ranking of the docking poses. Since all known interactions of Shh with its binding partners are mediated by amino acids present in the binding groove of Sonic Hedgehog (McLellan et al., 2008; Bosanac et al., 2009; Maun et al., 2010; Hwang et al., 2011, 2013), all residues in the respective region were selected as potential receptors in the docking step of the study. A set of diverse but highly ranked structures were selected for classical MD simulations in order to generate structures for the second docking cycle. The simulations were carried out using the AMBER-12SB (Zgarbova et al., 2011) force field in order to remain consistent with the settings used for docking. All ligand interactions were described by a GAFF force field generated via Antechamber (Case et al., 2014), a program part of the AMBER14 suite. Merz-Kollmann partial charges (Singh and Kollman, 1984) were derived by performing Hartree Fock (HF) calculations with a 6-31G<sup>∗</sup> (Hariharan and Pople, 1973; Krishnam et al., 1980; Francl et al., 1982; Clark et al., 1983; Gill et al., 1992) basis set using GAUSSIAN 09 (Frisch et al., 2009), as required by the AMBER force field. The Ca(II), Zn(II) ions of Shh as well as the chloride counter ions were described by the parameters (Aaqvist, 1990; Li and Merz, 2014) provided with the AMBER14 (Case et al., 2014) simulation package. The complexes were placed in periodic, cubic simulation boxes with a volume of ∼540,000 Å<sup>3</sup> and solvated in approximately 17,000 rigid TIP3P (Jorgensen et al., 1983) water molecules. The non-bonded cutoff was set to 10 Å and the long range interactions were treated by the particle mesh Ewald (PME) (Darden et al., 1993) method. In order to satisfy the requirements of the chosen NpT ensemble, temperature coupling was carried out via Langevin dynamics with a collision frequency of 1.0 ps−<sup>1</sup> , the pressure was controlled by the Berendsen manostat (Berendsen et al., 1984) with a relaxation time of 2 ps. The SHAKE (Ryckaert et al., 1977) algorithm was applied to constrain all bonds involving hydrogen, enabling a time step of 2.0 fs.

After an initial energy minimization of 60, 000 steps, utilizing the sander module of AMBER14 (Case et al., 2014), the systems were heated for 2 ns to the target temperature of 300 K using pmemd (Case et al., 2014) (MPI) and positional restraints to keep the protein-ligand complex fixed. Subsequently, the restraints were lifted and the systems equilibrated for 15 ns at a pressure of 1 atm, again using the MPI version of pmemd. The 100 ns production run was performed using the CUDA (Nickolls et al., 2008) implementation of pmemd, thereby considerably speeding up the process.

After the first MM MD run, the simulation in which the ligand displayed the lowest root mean square deviation was selected for the preparation of the actual simulation system by redocking the ligand using identical settings as before. The highest scoring structure was then used for another MM MD simulation, following the same protocol as above.

# 2.2. QM/MM Setup

Choosing an appropriate QM method to describe the chemically most relevant part of the system is imperative to gain accurate and representative data from the simulation, therefore a reasonable compromise between accuracy and computational demand has to be made. DFT methods have been found to work fairly well when employed for systems containing metal ions (Kuta et al., 2006; Lepšík and Field, 2007; Hierao, 2011; Ryde and Grimme, 2011) and even though there are examples where DFT fails to deliver accurate results (Schwenk et al., 2004; Radon and Pierloot, 2008; Yoo et al., 2009; Rowley ´ and Roux, 2012; Gillan et al., 2016), at the moment it still represents the best tradeoff between computational cost and reliability (Senn and Thiel, 2009). Alternatives, like HF have been shown to be inadequate for such systems (Ryde and Grimme, 2011) and second-order Møller-Plesset perturbation theory (MP2), while computationally much more costly is known to occasionally perform worse than DFT (Ryde and Grimme, 2011). More sophisticated methods, such as Coupled Cluster (CC) or Configuration Interaction (CI) are too demanding to utilize them for the description of systems of the size studied in this work. Moreover, the previously published QM/MM study of this protein has shown that the used BP86 functional (Perdew, 1986; Becke, 1988), along with the cc-pVTZ (Dunning, 1989) (for C, H, N and O atoms) and def2-TZVP (Wiegend and Ahlrichs, 2005) (for Zn) basis sets and the D3 correction is able to adequately describe the system, while still being economical enough to enable acceptable trajectory lengths (Hitzenberger and Hofer, 2016). For this reason and in order to produce data that is directly comparable, the same setup has been chosen for this work. However, for one of the simulations, the double-zeta (DZ) versions of the mentioned basis sets have been used. The resolution of identity (RI) (Ren et al., 2012) approach has been employed alongside the D3 correction (Grimme et al., 2010) to speed up the calculation of the 4-center-2 electron integrals and to improve the description of dispersion effects, respectively.

The system has been partitioned into a QM and an MM zone with the focus of attention on the Zn(II) coordination site and the ligand, since the classical part of the investigation (docking and MM MD) strongly suggested that the Zn(II)-robotnikinin interaction is of great importance to the stability of the resulting complex. An energy and structure adjusted link atom approach was applied to describe all bonds penetrating the interface between the QM and the MM zone (Amara and Field, 2003; Lin and Truhlar, 2007; Hitzenberger and Hofer, 2015; Messner, 2015). In order to cleanly terminate the QM region without the introduction of artifacts stemming from the QM/MM coupling, a set of suitable link atom parameters {ρ, r0, kL} (Hitzenberger and Hofer, 2015) was derived and can be found in **Table 1**. The embedding of the QM into the MM zone was handled via the electrostatic embedding method, where the QM atoms interact with their MM counterparts via inclusion of MM partial charges into the QM Hamiltonian. While the charges of the QM atoms TABLE 1 | Ideal link atom parameters for RI-BP86-D3, cc-pVTZ, or cc-pVDZ with embedding charges scaled by a factor of 0.666; ρ refers to the distance ratio between C<sup>α</sup> and C<sup>β</sup> on which the link atom is placed, r<sup>0</sup> (in Å) and k<sup>L</sup> (in kcal/mol/Å<sup>2</sup> ) represent the minimum and the force constant of the harmonic energy correction potential of the link bond (Hitzenberger and Hofer, 2015), respectively.


<sup>a</sup>Histidine protonated at the δ position.

<sup>b</sup>Histidine protonated at the ε position.

<sup>c</sup>First robotnikinin link bond.

<sup>d</sup>Second robotnikinin link bond.

were updated at every step and calculated via the Mulliken method (Mulliken, 1962), the MM charges used were provided by the force field. The problem with such a setup is that MM point charges used in most popular force fields and water models are usually not tailored to accurately represent the electron density of a molecule but to reproduce certain observables (such as the permittivity of water) when applied together with the other force field parameters (Senn and Thiel, 2009). In general, it can be assumed that the point charges of the force field are not compatible with the DFT-derived Mulliken charges. The oxygen atom of a TIP3P water molecule (Jorgensen et al., 1983), for example, possesses a charge of −0.83e, while for a water molecule in bulk conditions where the partial charge is calculated by the used QM setup it is only −0.55e. Similar over-polarization can be witnessed with the amino acids of the protein. In order to minimize the possibility to sample artificially strong interactions between QM and MM species at the QM/MM interface, a (QM method dependent) scaling factor for all MM charges, used for embedding, has been applied. At the same time the charges used to calculate MM/MM interactions were not altered. The applied scaling factor is in principle a simplification of an embedding scheme where all charges next to a QM atom are described by a Gaussian distribution (Amara and Field, 2003), thereby providing a distance dependent charge scaling. This Gaussian scheme was confirmed to be a very accurate embedding method when compared to more traditional techniques (Amara and Field, 2003; König et al., 2005), however, it is not compatible with many popular QM programs and strongly dependent on the proper parametrization of the blurring width. Therefore, for this study, a fixed scaling factor of 0.666 was used for all QM/MM MD simulations as this ensured that the charges of water molecules in close proximity to the QM region are scaled down to −0.55e. Additionally, test calculations showed that the mean QM/MM charge deviation of the amino acids His, Asp, and Glu (all residues in very close proximity to the QM zone are of these types) is reduced from 0.17 to 0.08 as well when the scaling factor is applied. The use of this scaling factor already proved successful in a previous study of Shh (Hitzenberger and Hofer, 2016). All remaining interactions between the QM and MM species, like bending and torsional terms are handled via the force field. All terms, however, where the central atoms are exclusively QM species have been excluded in order to prevent extensive double counting of forces (Eurenius et al., 1996).

Altogether, four different QM/MM MD simulations have been conducted in the course of this study. **Table 2** presents an overview of the differences between the simulations. Since the MM derived data suggests that robotnikinin directly coordinates to the Zn(II) site, the QM zone for the first simulation (henceforth called "core simulation") was chosen so that all important interactions around the ion were described by quantum mechanics. Therefore, the Zn(II) ion, the sidechains of H141, H183, D148, E177, the macrocycle, plus the chain containing the second amide function of robotnikinin (see **Figures 1A,B** for detailed information regarding QM/MM partitioning), as well as all water molecules within a radius of 5.5 Å were considered to be QM species. If robotnikinin as a whole would be included into the QM calculation then all residues in the vicinity of its phenyl and chlorophenyl rings would have to be described by QM as well because interactions between aromatics in the QM and the MM zone are very sensitive to the QM/MM potential and thus very difficult to describe correctly. Consequently, the addition of only the aromatic rings would have very likely lowered the predictive power of the simulation. The inclusion of all potential interaction partners of the rings in question would have lead to a very large QM system and thus making it impossible to sample a reasonable amount of configurations. In consequence, robotnikinin was truncated in the QM system and the substituents were described by the force field. Since the parametrization of the resulting link bonds was very thorough and the ionic site distant enough to justify an MM description of the aromatic moieties this provides an adequate compromise between effort and accuracy.

The starting point of the core QM/MM simulation was the structure resulting from the final MM MD simulation. During the first 5 picoseconds of the equilibration phase the atoms coordinating Zn(II), according to the MM simulation were restraint to the ion via harmonic bonds. In the course of this pre-equilibration process that was conducted at the target temperature of 300K, the force constants of the bonds were lowered from 500 kcal/mol/Å<sup>2</sup> to zero in 3 steps (250 kcal/mol/Å<sup>2</sup> , 100 kcal/mol/Å<sup>2</sup> , 0 kcal/mol/Å<sup>2</sup> ). This was followed by 10 ps of equilibration and a 80 ps sampling phase. Since the core simulation, in contrast to the MM simulation, suggested robotnikinin forming additional hydrogen bonds with two histidine sidechains (H134 and H135) not described by QM, an additional simulation, including those residues into the QM zone has been set up in order to confirm the existence of these interactions. The starting structure for this second simulation was taken from the core simulation.

However, this enlargement of the system results in a significant increase of the computing time by a factor of approximately 2, thus a third simulation utilizing only a double zeta (DZ) basis has been conducted, reducing the time needed per simulation step by a factor of four. This simulation was

TABLE 2 | Overview of the conducted QM/MM simulations.


The number of QM atoms includes hydrogen atoms but does not account for water molecules. The core system contains Zn(II), robotnikinin (excluding the aromatic rings), H141, D148, E177, and H183.

used to gather additional configurations in order to improve the statistics on which the evaluated importance of the new found interactions are based. After a 10 ps equilibration phase, a 90 ps evaluation trajectory was sampled for this simulation. In order to assess the accuracy of the results obtained at the DZ level, a further simulation, this time of empty Shh was conducted. The starting point of this simulation was the equilibrated TZ simulation of empty Shh taken from the previously published study (Hitzenberger and Hofer, 2016). All settings remained the same, however the basis set was switched to the double zeta variant and the link bond parameters were adjusted accordingly. After 20 ps of equilibration, a 120 ps long sampling trajectory was produced. The process flow used in this investigation is visualized in **Figure 2**.

Calculation of MM forces, the QM/MM coupling and the MD simulation itself were all handled by the in-house-developed QMCF (Rode et al., 2006; Hofer et al., 2010) simulation package. All quantum mechanical calculations were carried out using TURBOMOLE (Turbomole, 2007) and the temperature was controlled by the Berendsen thermostat (Berendsen et al., 1984) with a relaxation time of 1.0 ps. The nonbonded interactions were calculated explicitely up to a distance of 10 Å, while long range interactions were dealt with by the reaction field method (Barker and Watts, 1973) assuming a permittivity of ε = 78.355. In order to allow for a time step of 2.0 fs, the SHAKE algorithm (Ryckaert et al., 1977) was applied and the equations of motion were solved using the velocity-Verlet integrator (Swope et al., 1982).

# 3. RESULTS AND DISCUSSION

#### 3.1. MM MD Simulations

The binding site of Shh is located at the surface and shaped like a groove, therefore for a molecule like robotnikinin two general categories of binding poses are conceivable: one, where the chlorine atom points toward the region binding the Ca(II) ions and a second one, rotated by roughly 180◦ with the chlorine atom oriented in the opposite direction (see **Figures 3B,C**). During the docking phase two crucial properties were highlighted: Firstly, poses where an oxygen atom of robotnikinin coordinates to the Zn(II) ion were scored much higher than poses without direct robotnikinin-Zn(II) interactions. Nevertheless, MM MD simulations of complexes without direct Zn(II) robotnikinin interactions have been conducted—all resulting in the disassociation of the complex. This indicates that the interaction with the Zn(II) site is very important for the stability of the complex which is not surprising, since interactions between metal ions and polar sites are much stronger than simple hydrogen bonds. Furthermore, from experimental studies of the Shh-Hhip complex, it is well established that the Zn(II) ion is indeed accessible to ligands (Bosanac et al., 2009). The second finding was that the category containing the poses with the chlorine pointing away from the Ca(II) site is by far the prevalent one since nearly every high ranked pose that could be taken into consideration for a simulation was of that variant. Very likely, sterical effects preventing the formation of stable hydrophobic interactions of the chlorophenyl ring with its environment play an important role. These sterical clashes occur because the part of the binding groove connecting the Zn(II) and the Ca(II) sites is slightly too short to cleanly incorporate the chlorophenyl ring, which is separated from the macrocycle by 4 covalent bonds. In contrast, the phenyl ring is directly bonded to the macrocylce and thus can be easily positioned in this part of the binding site. Nevertheless, in order to make sure that the most likely binding pose is selected for the subsequent QM/MM MD simulations, the highest ranked docking structures of both robotnikinin binding pose families have been selected and MM MD simulations have been performed following the protocol from the previous section. The chlorophenyl moiety did not interact very strongly with any part of the protein when oriented in the direction of the Ca(II) ions. Instead, it was switching positions very frequently, thus leading to many configurations in which the aromatic residues of the ligand interacted with each other occasionally leading to robotnikinin dissociating from the protein. The only stable simulation of this binding mode category resulted in a robotnikinin heavy atom root mean square deviation (RMSD) shown in **Figure 3C**, which is very high compared to the RMSD derived from the simulation in which the ligand is rotated, as shown in **Figure 3D**. From this and a visual inspection of the trajectories it can be gathered that the poses from the type depicted in **Figure 3B** seem to be the prevalent ones and therefore all following simulations were based on them.

In order to elucidate the structural differences between loaded and empty Shh, heavy-atom RMSDs for all individual residues comparing each evaluation frame of loaded Shh (MM MD) to the first frame of a simulation of empty Shh (Hitzenberger and Hofer, 2016) (also MM MD) have been calculated. To eliminate high RMSDs stemming from natural residue fluctuations, perresidue RMSDs of empty Shh versus itself have been calculated as well. These "natural fluctuations" have then been subtracted from the raw data. The resulting "corrected," color coded perresidue RMSDs are shown in a heatmap plot in **Figure 4A**, which was generated using the "Heat Mapper" tool provided with the VMD (Humphrey et al., 1996) package. The regions showing a high deviation correspond to the coil region depicted in **Figure 4B**, functioning as a lid for the binding groove. The radial distribution function (RDF) calculated from the 100 ns MM MD simulation trajectory shows a mean coordination number of 7.5 oxygen or nitrogen atoms around the Zn(II) ion (counting up to a Zn(II)-ligand distance of 3.18 Å where the RDF shows a minimum). The ion is coordinated by H141, D148 (bidentate), E177 (mono- or bidentate), H183 and the amide-oxygen atoms

FIGURE 2 | Chart depicting the process flow of the investigation. "Empty" refers to Shh without the ligand, TZ and DZ to QM/MM simulations utilizing triple or double zeta basis sets.

of robotnikinin. The amide function in the macrocycle shows the higher affinity and is coordinated throughout the whole simulation (see **Figure 5**). The aromatic phenyl and chlorophenyl rings interact with the nearby sidechains of T126, H181, and Y175 via hydrophobic interactions. There seem to be no interactions between the ester or amide groups of robotnikinin with any of the surrounding sidechains. Instead, the used classical model predicts the formation of a hydrogen bond between the ester function and the macrocyclic amide. If a hydrogendonor distance of under 3.0 Å and a donor-acceptor-hydrogen angle lower than 35◦ are used as geometrical criteria for a hydrogen bond, then 66.1% of the 1,000 configurations chosen for evaluation feature an intramolecular hydrogen bond.

## 3.2. QM/MM MD Simulations

The core QM/MM MD simulation for which an 80 ps long evaluation trajectory has been sampled paints a different picture than the purely classical approach: Here, the Zn(II) ion is predominantly coordinated by D148 (bidentate), E177 (monodentate), H183 and the amide oxygen, part of the chain connecting the chlorophenyl ring of robotnikinin to its macrocycle. The coordination polyhedron is of quadratic pyramidal shape as can be seen in **Figure 5B** and according to an RDF (**Figure 5D**) that was calculated for the 4,000 frames of the evaluation trajectory, the mean coordination number is 5.1, when calculated up to a Zn(II) ligand distance of 2.83 Å (where the RDF reaches its first minimum), with the most likely ion-ligand distance being 2.03 Å. Incidentally, H141, part of the coordination sphere in empty Shh (Hall et al., 1995; McLellan et al., 2008; Bishop et al., 2009; Bosanac et al., 2009; Hitzenberger and Hofer, 2016) is mostly situated at distances between 3 and 4 Å of the ion, making way for robotnikinin, E177 and enabling a bidentate binding mode of D148. However, there are also configurations where it directly binds the ion thereby creating a quadratic bipyramidal coordination polyhedron. Another interesting aspect of the coordination site around the ion is

FIGURE 3 | (A) A snapshot from an MM MD simulation started from a docking pose with the chlorine atom of the ligand pointing toward the Ca(II) ions (depicted in orange). The snapshot shows a configuration in which the chlorophenyl ring left the binding groove and is oriented toward the ligand's second aromatic ring instead of the Ca(II) ions. This is caused by a lack of stabilizing hydrophobic interactions with the protein as well as steric clashes and indicates that the chosen starting structure is not a stable binding mode. (B) A snapshot from an MM simulation started from the preferred docking pose. (C) Heavy atom RMSD of robotnikinin, calculated from the trajectory of the simulation shown in (A). (D) Heavy atom RMSD of robotnikinin, derived from the trajectory of the simulation shown in (B). Each frame in the RMSD plots represents 100 ps.

E177, which has been shown to be mostly bridged by a water molecule in studies of empty Shh (Hitzenberger and Hofer, 2016) or X-ray derived crystal structures of Shh bound to Hhip (McLellan et al., 2008; Bishop et al., 2009; Bosanac et al., 2009). However, the conducted QM/MM MD simulation predicts direct E177-Zn(II) coordination surely aided by the MM derived

starting structure also predicting such a coordination. In the previous study of empty Shh (Hitzenberger and Hofer, 2016), the MM MD trajectory used as starting point for the QM/MM simulation erroneously predicted E177 to directly interact with the ion during the entire simulation. However, the QM/MM model switched to the experimentally predicted coordination sphere almost instantly after heating the system. This was not the case in this QM/MM simulation, however, in the present case E177-Zn(II) binding is not implausible because the second oxygen atom of E177 forms a very strong hydrogen bond with the amide function in the macrocycle of Shh, thereby also directing the amide oxygen atom toward histidines H134 and H135 enabling potential hydrogen bonding. Therefore, the presence of E177 increases the number of possible protein-ligand interactions thus stabilizing the Sonic Hedgehog-robotnikinin complex. Although the continued presence of a water molecule in the binding site would be possible, it is however unlikely that it would aid the binding of robotnikinin as strongly as E177 does because a water molecule is neutral and possesses just one atom that can function as an electron donor, whereas E177 bridges two positively charged sites via its two spatially separated oxygen atoms. Furthermore, the elimination of strongly bound water molecules aids to ligand binding via a favorable entropy contribution (Ladbury, 1996; Li and Lazaridis, 2005; Huggins, 2015). In the previously conducted QM/MM simulation it has also been shown that while the bridged binding of E177 to Zn(II) is absolutely predominant, there are also configurations where it binds directly (Hitzenberger and Hofer, 2016), suggesting that this form of coordination is not entirely unfeasible also without the presence of a ligand. Upon closer investigation of the E177 robotnikinin interaction it becomes clear that it is a very stable one, as the mean distance between the E177 oxygen and the hydrogen atom of the amide is 2.29 Å and applying the same H-bond criteria as before, 85.7% of all sampled configurations exhibit this particular hydrogen bond. Another argument for the exclusion of water molecules around the metal ion center is the fact that binding of robotnikinin to Shh renders Zn(II) practically inaccessible to the solvent. This is illustrated by the Zn(II)-water oxygen RDF of the core simulation: The first peak can be found at approx. 6.6 Å where integration up to this point yields 1.5 water molecules. At a separation of 5 Å integration of the RDF indicates a mean number of just 0.006 water molecules. This suggests that it is very unlikely that water re-enters the active site once it has left. If all these features are taken into consideration, it seems very plausible that robotnikinin binds to Zn(II) via a direct ionic bond, as well as indirectly via the residue E177.

The mean distances of residues H134 and H135 to the macrocyclic amide group of robotnikinin in the core QM/MM investigation are 4.41 and 2.53 Å, respectively and additionally there are some close contacts between H134 and the macrocyclic ester group (with an average distance of 4.09 Å), hinting at the existence of favorable interactions between these histidines and the ligand. Calculating the RMSDs for just the sidechains (considering only heavy atoms) of these two residues results in a mean deviation of 0.481 Å, which compared to the RMSDs derived from the MM simulation (1.187 Å) or all sidechains in the QM/MM described protein (0.874 Å) is a very low value, further suggesting the presence of stabilizing interactions. However, if the same hydrogen bond criteria used for the MM simulations are taken into account, then in only 19.1% of all the frames H135-amide hydrogen bonds are present, while 4.8% of all sampled configurations show H134-amide H-bonding and only 43 out of 4,000 configurations fulfill the criteria for accepting the existence of a hydrogen bond between H134 and the ester functionality. Apparently, while the electrostatic interactions between the histidines and robotnikinin seem to be rather strong, the angle between donor, hydrogen atom and acceptor deviates quite far from 180◦ most of the time. One of the likely reasons for this behavior is the fact that the histidines in question are not part of the QM zone of this simulation, therefore all observed interactions between them and robotnikinin are somewhat error prone and thus should be closer observed by including H134 and H135 in the QM region. However, as QM calculations scale very unfavorably with system size, beside a larger triple zeta (TZ) simulation, the system was also simulated utilizing only a double zeta basis set in order to attain a larger number of configurations and thus gain data that is statistically more conclusive.

The more sophisticated TZ simulation suggests that the interaction between the histidines and the macrocylce of robotnikinin is more distinct than predicted by the core QM/MM simulation as the mean amid histidine distances are reduced to 2.77 Å (H134) and 2.08 Å (H135) in addition, also the H-bond acceptor oxygen of the ester is on average separated from the donor hydrogen of H134 by 3.94 Å and occasionally close enough for hydrogen bonding. The simulation predicts a hydrogen bond population of 80.7% for H135-amide, 40.6% for H134-amide and 1.4% for H134-ester, respectively. The mean distance between the closest E177 oxygen atom and the hydrogen of the macrocyclic amide is 2.14 Å, with a hydrogen bond abundance of 95.3%, implying the E177-robotnikinin and the H134/135-robotnikinin interactions promote each other by orienting the amide group in a favorable position. The statistically more robust (but less accurate) 90 ps DZ simulation predicts interactions that are even stronger than in the TZ case yet only by a small margin. The mean distances between H134, H135 and the amide are predicted to be 2.56 and 2.05 Å, respectively, with the average separation between H134 and the ester being 3.27 Å. Of the 4,500 sampled configurations, 87.5% show a hydrogen bond between H135 and the amide, 50.7% between H134 and the amide and 5.4% between H134 and the ester. The mean E177-amide distance is predicted to be 2.13 Å, with a hydrogen bond abundance of 92.9%. A summary of the discussed H-bond results regarding the interactions of H134, H135, and E177 with robotnikinin is provided in **Table 3**. To further highlight the binding motif, robotnikinin– sidechain distance plots are depicted in **Figures 6A–C**, taking only into consideration the atoms closest to each other, as these are regarded to be the mediators of the respective Hbonds. The interactions found for the aromatic rings are the same as in the MM MD simulation, which is unsurprising since in both cases they were described solely by the force field and the predominant stacking geometries are depicted in **Figure 7**.

If all these findings are taken into consideration, one can conclude that, besides the very strong Zn(II)-robotnikinin ionic bond, the most stable interaction found by the QM/MM MD simulations is the hydrogen bond between E177 and the macrocyclic amide hydrogen as this interaction has an occurrence ranging from 85.7% (core QM/MM simulation) to 92.9 and 95.3% in case of the DZ and TZ extended QM/MM simulations. The second most important hydrogen bond exists between H135 and the macrocyclic amide of robotnikinin with an abundance close to 90% in both extended simulations including H135 and H134 in the QM zone. Another frequently occurring interaction was identified between H134 and the same amide, which is present in roughly half of all sampled configurations of the extended QM/MM MD simulations. Also predicted to exist but not nearly as important as the other interactions since only witnessed in 1 to 5% of all frames is an H-bond between the ester and H134. The binding pose of robotnikinin in Shh as predicted by the extended TZ simulation is shown in **Figure 8A** and respective interactions mediating this orientation are depicted in **Figure 8B**. In order to further confirm these findings and to construct a three-dimensional map of all important interactions between robotnikinin and Sonic Hedgehog every fiftieth configuration of the extended TZ sampling trajectory was employed to create twenty different interactions models, utilizing the software tool LigandScout (version 4.09) (Wober and Langer, 2005; Wieder et al., 2017). The resulting three- and


The occurrences of hydrogen bond formation are given in percent. Results stemming from purely QM described interactions are printed in bold font.

FIGURE 6 | Plots of the distances between H134 and robotnikinin's macrocyclic amide group (black), ester (red), as well as the separation of H135 and the macrocyclic amide function (blue) of robotnikinin. Each frame represents a time span of 0.02 ps. (A) Core simulation. (B) Extended simulation (TZ). (C) Extended simulation (DZ).

two-dimensional interaction maps are shown in **Figures 8C,D**, confirming the observations from the distance and angle based trajectory analysis. Not shown in the pictures is residue K88, which is very close to robotnikinin's phenyl ring with a mean distance of 5.48 Å (core QM/MM simulation). It is conceivable that if the phenyl ring was modified with an additional negatively charged moiety then a hydrogen bond interaction with K88 could be achieved, further stabilizing the robotnikinin-Shh complex.

In order to assess how robust the predictions based on the DZ basis are, a QM/MM MD simulation of empty Shh was conducted, utilizing the DZ basis (along with adjusted linkbond parameters) but other than that the same settings as in the previous study (Hitzenberger and Hofer, 2016). The results suggest that with a predicted Zn(II) coordination number of 5.5 (calculated up to a distance of 2.78 Å), where the ion is commonly coordinated by E177, along with H141, D148, H183 and one or two water molecules, in contrast to the results found by the TZ simulation (CN of 4.4 and very rare direct E177-Zn(II) bonding), that a double zeta basis might not be sufficient to accurately describe such a system when applied to a simulation utilizing the BP86 functional. Nevertheless, the obtained results

lines, while hydrophobic interactions are shown as yellow arrows. With the exception of hydrogen atoms that are part of H-bonds, only heavy atoms are shown. (C) Superposition of 20 interaction profiles constructed from every fiftieth frame of the extended TZ sampling trajectory. Green arrows represent H-bond donors, while acceptors are colored in red, hydrophobic interactions are shown in yellow and ligand-metal interactions are depicted in blue. (D) Two-dimensional interaction profile of robotnikinin with Shh, adapted from the output of LigandScout. The color-coding is analogous to the one used in (C).

are still much more accurate than those of the purely classical simulation, however, it is not as close to the experimental data as the TZ simulation. For this reason, all structurally relevant data for this study were taken from the TZ investigations. However, as the DZ simulation is nearly four-times faster than the TZ case and the results regarding hydrogen bond population very close to the data taken from the TZ investigations they can be viewed as a statistically more robust confirmation of the behavior witnessed in the shorter TZ simulation. The calculation of the solvent accessible surface area (SASA) of robotnikinin from the 1,000 configurations sampled in the extended TZ simulation resulted in a mean value of 297.6 Å<sup>2</sup> (±0.7 Å<sup>2</sup> , P = 95%, SD = 12.0 Å<sup>2</sup> , n = 1,000) which is very close to the 300.5 Å<sup>2</sup> (± 1.4 Å<sup>2</sup> , P = 95%, SD = 22.7 Å<sup>2</sup> , n = 1,000), obtained from the classical simulation. This finding is unsurprising because no major configurational changes have been witnessed, instead the most notable differences between the QM/MM and MM structures concern areas which are buried in the Shh's binding groove in both cases.

For further reference, a representative PDB file containing the system (taken from the extended TZ simulation) is provided as Supplementary Material. In order to reduce the size of the file, all solvent molecules have been removed.

# 4. CONCLUSIONS

In the presented study, a reasonable binding mode for the small molecule inhibitor of Sonic Hedgehog, robotnikinin is suggested. The interactions, identified via a series of QM/MM MD simulations were sufficiently strong to stabilize the Shhrobotnikinin complex throughout the investigations and enabled a binding mode in which the ligand interacts with six amino acids and the Zn(II) ion present in the binding groove of Shh. The most important and stable interactions are the ionic bond between the Zn(II) ion of Sonic Hedgehog and one of the oxygen atoms of robotnikinin as well as a hydrogen bond between E177 and the macrocyclic amide group of robotnikinin, bridging the inhibitor with the ion. Other very important contributions include hydrogen bonds between H134/H135 and the macrocyclic amide group as well as hydrophobic interactions between the aromatic rings of the ligand and the sidechains of residues T126, Y175 and H181, predominantly forming π–π interactions.

In all conducted simulations K88, also known to bind complex partners of Shh (Bosanac et al., 2009), is sufficiently close to the phenyl ring of robotnikinin in order to form an H-bond if an electron donor function were present at a suitable position, thereby potentially further improving the affinity of Shh to the small molecule inhibitor and in addition preventing the contribution of K88 to the formation of complexes with other proteins.

Even though in the core simulation the QM/MM interface very likely prevented donor-hydrogen-acceptor geometries commonly regarded as required to confirm the existence of hydrogen bonds between the classically described histidines H134/H135 and the QM described ligand, their behavior (low RMSD, close proximity to the inhibitor) was as a very strong hint for the presence of important interactions between those molecules hence justifying the utilization of the applied embedding scheme. It is, however, still very important to stress that in order to obtain reasonable structural data a careful choice of the QM zone is necessary making sure that all potential ligand protein interactions are treated at the same level of theory. However, as QM calculations scale very unfavorably with the number of atoms in the system, the size of the QM region has to be as small as possible in order to attain a sufficient number of configurations from which statistically robust data can be derived. Therefore, a reasonable compromise between method accuracy and system size is very important to sample a sufficient number of configurations that are also physically meaningful. In the presented work this could be achieved by conducting three different QM/MM simulations of the same system, differentiated by their respective basis set and system sizes.

# REFERENCES


The purely classical model of the studied system, on the other hand, predicts a vastly over-coordinated Zn(II) site due to an overestimation of the ion-ligand interaction strength, which is probably also the main reason for the absence of the hydrogen bonds between robotnikinin and E177 or H134 and H135 (as they seem to promote each other, presumably due to the partial double bond character of the C-N bond in the amide). This can be viewed as a further indication that a description of the system based solely on force fields of the kind as employed in this study are not ideally suited for the physically correct description of systems involving transition metal ions. Overall, this work also highlights the capabilities of an iterative docking/(QM)MM MD cycle if used to improve the prediction power of in silico ligand-receptor binding studies.

# AUTHOR CONTRIBUTIONS

All simulations were executed and evaluated by MH. The simulations were performed using code written and implemented by TH who also contributed to the conceptualization of the investigation. The manuscript was drafted by MH and revised by TH and DS who was consulted due to her expertise in protein ligand docking.

## ACKNOWLEDGMENTS

The computational results presented have been achieved using the HPC infrastructure of the University of Innsbruck. The authors thank Inte:Ligand for providing LigandScout version 4.09 free of charge.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2017.00076/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Hitzenberger, Schuster and Hofer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# How Are Substrate Binding and Catalysis Affected by Mutating Glu<sup>127</sup> and Arg<sup>161</sup> in Prolyl-4-hydroxylase? A QM/MM and MD Study

Amy Timmins and Sam P. de Visser\*

School of Chemical Engineering and Analytical Science, Manchester Institute of Biotechnology, The University of Manchester, Manchester, United Kingdom

Prolyl-4-hydroxylase is a vital enzyme for human physiology involved in the biosynthesis of 4-hydroxyproline, an essential component for collagen formation. The enzyme performs a unique stereo- and regioselective hydroxylation at the C<sup>4</sup> position of proline despite the fact that the C<sup>5</sup> hydrogen atoms should be thermodynamically easier to abstract. To gain insight into the mechanism and find the origin of this regioselectivity, we have done a quantum mechanics/molecular mechanics (QM/MM) study on wildtype and mutant structures. In a previous study (Timmins et al., 2017) we identified several active site residues critical for substrate binding and positioning. In particular, the Glu<sup>127</sup> and Arg<sup>161</sup> were shown to form multiple hydrogen bonding and ion-dipole interactions with substrate and could thereby affect the regio- and stereoselectivity of the reaction. In this work, we decided to test that hypothesis and report a QM/MM and molecular dynamics (MD) study on prolyl-4-hydroxylase and several active site mutants where Glu<sup>127</sup> or Arg<sup>161</sup> are mutated for Asp, Gln, or Lys. Thus, the R161D and R161Q mutants give very high barriers for hydrogen atom abstraction from any proline C–H bond and therefore will be inactive. The R161K mutant, by contrast, sees the regio- and stereoselectivity of the reaction change but still is expected to hydroxylate proline at room temperature. By contrast, the Glu<sup>127</sup> mutants E127D and E127Q show possible changes in regioselectivity with the former being more probable to react compared to the latter.

Keywords: quantum mechanics/molecular mechanics, enzyme mechanism, enzyme catalysis, mutations, density functional theory

# INTRODUCTION

Metalloenzymes play vital roles in nature and are involved in biosynthesis as well as biodegradation of compounds (Solomon et al., 2000; Costas et al., 2004; Abu-Omar et al., 2005; Kryatov et al., 2005; Bruijnincx et al., 2008; Kadish et al., 2010). Due to its large natural abundance often metalloenzymes contain one or more iron centers; however, in this work we will restrict ourselves to mononuclear iron enzymes only and particularly those that utilize molecular oxygen. In general, iron containing dioxygenases and monoxygenases use one molecule of molecular oxygen in their catalytic cycle and either transfer both oxygen atoms to substrate(s) or a single one with a water molecule as by-product (Sub = substrate), Equations 1, 2.

$$\text{(nonheme)Fe}^{\text{III}} + \text{Sub} + \text{O}\_2 \rightarrow \text{(nonheme)Fe}^{\text{III}} + \text{SubO}\_2 \tag{1}$$

$$\text{(heme)Fe}^{\text{III}} + \text{Sub} + \text{O}\_2 + 2\text{H}^+ + 2\text{e}^- \rightarrow \text{(hence)Fe}^{\text{III}} + \text{SubO} + \text{H}\_2\text{O} \tag{2}$$

Edited by:

Nino Russo, Università della Calabria, Italy

#### Reviewed by:

Tiziana Marino, University of Calabria, Italy Xavier Assfeld, Université de Lorraine, France

\*Correspondence: Sam P. de Visser

sam.devisser@manchester.ac.uk

#### Specialty section:

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

Received: 13 September 2017 Accepted: 24 October 2017 Published: 09 November 2017

#### Citation:

Timmins A and de Visser SP (2017) How Are Substrate Binding and Catalysis Affected by Mutating Glu127 and Arg161 in Prolyl-4-hydroxylase? A QM/MM and MD Study. Front. Chem. 5:94. doi: 10.3389/fchem.2017.00094

Thus, heme monoxygenases, like the cytochromes P450, react as monoxygenases and proceed through a catalytic cycle starting from an iron(III)-heme resting state with a protein cysteinate and a water molecule in the fifth and sixth iron ligand positions, respectively (Meunier et al., 2004; Ortiz de Montellano, 2004, 2010; Denisov et al., 2005; Kadish et al., 2010). The water molecule is released after substrate binding, which triggers a spin state change from low-spin to high-spin and enables molecular oxygen binding to the iron center. The iron-superoxo is subsequently reduced and protonated to form an iron(III) hydroperoxo(heme) complex, also called Compound 0 (Meunier et al., 2004; Denisov et al., 2005; Shaik et al., 2005; Ortiz de Montellano, 2010). A final protonation step gives water and an iron(IV)-oxo(heme cation radical) species called Compound I (CpdI) (de Visser et al., 2003; Rittle and Green, 2010; de Visser and Kumar, 2011). Now, CpdI is the active species of P450 enzymes and reacts with substrates through oxygen atom transfer and hence converts aliphatic groups to alcohols (Ogliaro et al., 2000b; de Visser et al., 2004; Ji et al., 2015), C = C double bonds to epoxides (de Visser et al., 2001, 2002b; Sainna et al., 2015), sulfides to sulfoxides (Sharma et al., 2003; Kumar et al., 2005b, 2011a), and arenes to phenols (de Visser and Shaik, 2003; de Visser, 2006c; Cantú Reinhard et al., 2016b). The mechanism of these reactions has been established with computational modeling including density functional theory (de Visser, 2012; Blomberg et al., 2014). In recent years full calculations on enzymatic structures were done and identified the effect of the protein, substrate orientation and hydrogen bonding interactions on the kinetics and thermodynamics of the reaction and the product distributions of P450 catalyzed reaction mechanisms. Thus, it was shown with small model complexes that hydrogen bond donations toward the axial thiolate ligand affected the electron affinity of this cysteinate residue, which led to a push-effect of electrons to the heme that influenced its redox potential and hence catalytic potential (Ogliaro et al., 2000a; de Visser et al., 2002a; Schöneboom et al., 2002). In the heme enzyme cytochrome c peroxidase, a model that included a cation binding site reproduced the experimentally characterized electronic configuration, and highlighted the importance of long-range electrostatic effects in enzyme models (de Visser, 2005). Because of these long-range effects, more and more computational studies are done using full enzymatic systems using the Quantum Mechanics/Molecular Mechanics (QM/MM) technique.

**Figure 1** displays the active site structures of (**Figure 1A**) cytochrome P450 and (**Figure 1B**) taurine/α-ketoglutarate dioxygenase (TauD) as a structural comparison (O'Brien et al., 2003; Guo and Sevrioukova, 2017). Thus, the P450s are heme enzymes, where the heme is linked to the protein backbone through an interaction of the metal with a cysteinate residue (the axial ligand). On the distal site of the heme the substrate binds, which is the drug molecule metformin in the 5G5J protein databank (pdb) file. The distal site of the heme has several hydrogen bonding and polar residues, such as Ser<sup>119</sup> and Arg212, the former has been proposed to be involved in the proton relay mechanisms during the catalytic cycle (Kumar et al., 2005a), whereas the latter holds the substrate through a salt bridge into position.

A second iron enzyme class that utilizes molecular oxygen is the nonheme iron dioxygenases (Bugg, 2001; Ryle and Hausinger, 2002; Solomon et al., 2013). These dioxygenases are found in all forms of life and are involved in the biosynthesis of antibiotics (Choroba et al., 2000; Higgins et al., 2005; Siitonen et al., 2016), DNA and RNA repair enzymes (O'Brien, 2006; Yi et al., 2009), as well as the metabolism of toxic natural compounds such as cysteine (Stipanuk, 2004; Straganz and Nidetzky, 2006; de Visser, 2009; Buongiorno and Straganz, 2013). These enzymes are structurally very different from the heme monoxygenases as they link the iron atom to the protein with only amino acid side chains such as His, Asp, or Glu residues. Usually, the nonheme

iron dioxygenases contain a facial triad of amino acid ligands with two histidine and one carboxylate group, i.e., 2-His/1-Asp, Glu (Que, 2000; Kovaleva and Lipscomb, 2008). As an example of a dioxygenase with these ligand features we show in **Figure 1B** the active site of taurine/α-ketoglutarate dioxygenase (TauD). TauD is a dioxygenase involved in the metabolism of cysteine, whereby it converts taurine to hydroxy-taurine. The 1OS7 pdb file (O'Brien et al., 2003) is a substrate and α-ketoglutarate (α-KG) bound structure of TauD with the iron bound to the protein through the side chains of residues His99, Asp<sup>101</sup> and His255. Substrate taurine is located nearby the metal and is held in position through a salt bridge with residue Arg270. Co-substrate α-KG is bound to the metal as a bidentate ligand through the keto and acid groups.

The catalytic cycle of TauD has been established through a combination of experimental and computational studies (Borowski et al., 2004; Bollinger et al., 2005; de Visser, 2006a,b, 2007; Godfrey et al., 2008). **Figure 2** schematically depicts the catalytic cycle of TauD specifically and starts from the resting state structure where iron is bound to the 2-His/1-Asp ligand system and the other ligand positions of the metal are occupied by three water molecules (structure A). When α-KG enters the pocket two water molecules are displaced and replaced by the keto and acid groups of α-KG (structure B). In the next step, substrate taurine binds, which displaces the last water molecule from iron (structure C) and is replaced by molecular oxygen that binds as an iron(III)-superoxo (structure D). Subsequently, the superoxo group attacks the α-keto position of α-KG to form a bicyclic ring-structure (structure E). In the next step, the dioxygen bond breaks to form a peracid succinate with the release of CO2. Finally, the peracid bond breaks and splits into an iron(IV)-oxo species and succinate (structure F). Iron(IV)-oxo is known to be a powerful oxidant that abstracts a hydrogen atom from taurine to give an iron(III)-hydroxo group (structure G) and the hydroxyl radical is then rebound to form hydroxytaurine as product (structure H). Products hydroxy-taurine and succinate are released from the iron center and their positions are replaced by water molecules to bring the catalytic cycle back into the resting state.

Another nonheme iron dioxygenase with a catalytic cycle similar to TauD is prolyl-4-hydroxylase (P4H), which regio- and stereospecifically hydroxylates a proline residue in a protein to R-4-hydroxyproline, **Scheme 1**. Product R-4-hydroxyproline is a common amino acid in animals and plants and has functions in collagen, where it enables crosslinking between individual strands. In addition, it is relevant to the synthesis of the hypoxia induced factor in animals (McDonough et al., 2006).

A range of biochemical and spectroscopic studies on P4H established key details of the catalytic cycle. Thus, reactions

of P4H with (de Visser et al., 2001) <sup>18</sup>O<sup>2</sup> provided evidence of the transfer of one atom of molecular oxygen to proline (Myllyharju and Kivirikko, 1997). Low-temperature Mössbauer, electron paramagnetic resonance (EPR) and UV-Vis absorption spectroscopic studies characterized several intermediates in the catalytic cycle, including the iron(IV)-oxo species (Hoffart et al., 2006). It was shown that the iron(IV)-oxo species has a quintet spin ground state and reacts with the substrate through a rate-determining hydrogen atom abstraction. In particular, rate constants for the reaction with taurine and taurine-d<sup>2</sup> gave a large kinetic isotope effect. To confirm the reaction mechanism computational studies on the catalytic cycle of P4H were performed: One study using an active site model complex (Karamzadeh et al., 2010) and another using the full enzyme structure with QM/MM (Timmins et al., 2017). These studies established the technical details of the catalytic cycle and confirmed the mechanism shown above in **Figure 2**. Furthermore, key functions of several amino acids were identified related to substrate positioning and product release as will be described in more detail later.

The key step in the catalytic cycle of P4H is the hydrogen atom abstraction of substrate by the iron(IV)-oxo intermediate (Hoffart et al., 2006), which was shown to be rate-determining. In principle, substrate proline has six aliphatic C–H bonds at positions C<sup>3</sup> , C<sup>4</sup> , and C<sup>5</sup> that could lead to six different product isomers, **Scheme 2**. We label the two hydrogen atoms on C<sup>3</sup> , C 4 , and C<sup>5</sup> as front (f) or back (b). Also shown in **Scheme 2** are bond dissociation free energies (BDFE) of each of these C– H bonds as calculated at UB3LYP/6-311+G ∗ as the difference in free energy of proline and the sum of a hydrogen atom and [proline – H• ]. As can be seen, the C–H bond strength at the C<sup>3</sup> and C<sup>4</sup> positions in proline are comparable, while the one at the C <sup>5</sup> position is much weaker in energy. Therefore, in the gas-phase proline hydroxylation should happen at the C<sup>5</sup> position as it is the weakest bond to break rather than at the thermodynamically unfavorable C<sup>4</sup> position. How P4H prevents hydroxylation of the weaker C<sup>5</sup> position in favor of hydroxylation at the C<sup>4</sup> position is the topic of this paper.

In addition to studies on wildtype (WT) P4H, we looked at the structure and catalytic properties of two active site mutants where the Glu<sup>127</sup> and Arg<sup>161</sup> residues were mutated to alternative groups. A previous study (Timmins et al., 2017) identified these two amino acids as key for substrate positioning in the substrate binding pocket and hence mutating them to a different amino acid should have a considerable effect.

#### METHODS

The calculations presented in this work follow previously described and benchmarked methods using QM/MM (Porro et al., 2009; Kumar et al., 2011b; Quesne et al., 2014, 2016a; Faponle et al., 2017; Li et al., 2017). Specifically, our previous work on the mechanism of the possible reaction channels of cytochrome P450 decarboxylase leading to decarboxylation of fatty acids or hydroxylation of fatty acids, predicted the correct regioselectivity of the reaction as compared to experiment and reproduced experimentally determined kinetic isotope effects (Faponle et al., 2016). Furthermore, QM/MM studies on 1- H-3-hydroxy-4-oxoquinaldine 2,4-dioxygenase focused on the rate-determining step of the co-factor independent reaction of substrate with molecular oxygen and predicted a rate constant in good agreement with experiment and explained how this enzyme functions without a metal cofactor present (Hernández-Ortega et al., 2014, 2015). Very recently, we used QM/MM modeling to predict spectroscopic fingerprints of short-lived catalytic cycle intermediates and used this on cysteine dioxygenase enzymes. Calculated UV-Vis absorption spectra and Mössbauer

and EPR parameters enabled the experimental characterization of a short-lived oxygen-bound intermediate (Fellner et al., 2016; Tchesnokov et al., 2016).

#### Model Set-Up

Our QM/MM starting point structures were set-up using previously described methods and procedures (Quesne et al., 2016a; Timmins et al., 2017), which we will summarize briefly here. The crystal structure coordinates from the 3GZE pdb file was used as a starting point for all models (Koski et al., 2009). The 3GZE pdb file represents a resting state P4H structure with Zn2+, pyridine-dicarboxylate co-substrate mimic and the (Ser-Pro)<sup>5</sup> peptide chain bound. The active site zinc(II)-water(pyridine-dicarboxylate) was replaced with iron(IV)-oxo(succinate) manually with an Fe–O distance of 1.63 Å: a typical distance found for analogous nonheme iron(IV) oxo complexes in enzymes and model complexes (de Visser, 2006a,b,d, 2007; Godfrey et al., 2008; Quesne et al., 2016b; Cantú Reinhard and de Visser, 2017). The short peptide chain (Ser-Pro)<sup>5</sup> we retained in the model as it has its proline residue tightly packed nearby the iron(IV)-oxo group.

Subsequently, hydrogen atoms were added to the protein structure using the pdbtopqr program package assuming a pH = 7 (Dolinsky et al., 2007). Thus, all acid residues, i.e., Glu and Asp, were deprotonated whereas the basic residues, i.e., Arg and Lys, were protonated. The protonation state of each individual histidine residue was decided upon visual inspection of its local environment (donating/accepting hydrogen bonds) and we chose to assign all as singly protonated. Thereafter, the protein structure was solvated in a sphere with radius of 40 Å and energy minimized with the Charmm forcefield (Brooks et al., 1983). The solvation procedure was repeated a number of times until a situation was reached (**Figure 3**), whereby <20 water molecules were added to the chemical system. The saturated structure was then minimized without geometric constraints and heated to a temperature of 298 K. Finally, a full molecular dynamics (MD) simulation was run for 10 ns. The full set-up procedures were repeated for the mutant structures, whereby one amino acid was manually replaced. As follows from the MD simulations shown in **Figure 3B**, all converge well within 10ns. For each of the structures, we started QM/MM calculations using the snapshots taken after 5 ns (Sn5ns).

#### QM/MM Procedures

Density functional theory (DFT) methods were used to describe the QM region of the QM/MM calculation. In particular, we used the unrestricted hybrid density functional method B3LYP (Lee et al., 1988; Becke, 1993) in all cases as recent benchmark studies from our group showed this procedure to give rate constants in very good agreement with experiment (Cantú Reinhard et al., 2016a). In particular, QM methods with dispersion included were shown to underestimate free energies of activation considerably (Cantú Reinhard et al., 2016a). Furthermore, B3LYP was shown previously to predict regioselectivities and bifurcation pathways well as compared to experiment (Kumar et al., 2004; Barman et al., 2016; Brazzolotto et al., 2017). Also, DFT calculated free energies of activation were shown to match experimentally determined ones of biomimetic model complexes containing iron and manganese very well and reproduced Hammett trends (Vardhaman et al., 2011, 2013; Kumar et al., 2014; Yang et al., 2016). Here, DFT calculations were run in Turbomole (Ahlrichs et al., 1989), and the MM ones in DL-Poly with the Charmm forcefield (Smith and Forester, 1996). The ChemShell software package(Sherwood et al., 2003) interfaced Turbomole and DL-Poly and was used to obtain QM/MM energies and derivatives. The link-atom approach was used to describe atoms on the border between the QM and MM regions and essentially replaced a covalent bond with a C–H bond (Bakowies and Thiel, 1996). All calculations use electronic embedding of the charges of the MM region included into the QM Hamiltonian.

Geometry optimizations and reaction coordinate scans were done with an SV(P) basis set on all atoms: basis set BSI (Schafer et al., 1992). Reaction coordinate scans were run with one degree of freedom fixed and explored the potential energy surface between reactants, intermediates and products. The maxima of these scans were used as starting points for transition state

searches. The energies of the stationary points were improved by running a single point calculation with an all-electron Wachterstype basis set on iron and def2-TZVP on the rest of the atoms: basis set BSII (Wachters, 1970).

#### QM Region

As the active site region of the protein contains many hydrogen bonding and π-stacking interactions we considered two QM regions: a minimal QM region A and an expanded QM region AB, see **Figure 4**. Thus, the minimal QM region A contains the iron(IV)-oxo group and its direct ligands (His143, His227, Asp145, and succinate) as well as the proline ring of the peptide substrate. The larger QM region AB was expanded with the indole ring of Trp243, the phenol group of Tyr<sup>140</sup> and three water molecules.

#### RESULTS AND DISCUSSION

#### P4H WT Structure

The iron(IV)-oxo species (structure G in the catalytic cycle of **Figure 2**) was fully optimized with QM/MM methods using B3LYP/BSI and QM region A and AB, see **Figure 5**. In agreement with experimental EPR and Mössbauer spectroscopic studies (Hoffart et al., 2006) the quintet spin state is the ground state, while we located the triplet and singlet spin states higher in energy by 16.0 and 33.1 kcal mol−<sup>1</sup> , respectively. Geometrically, the iron(IV)-oxo species is bound to the protein with two Fe– NHis interactions of 2.06 and 2.08 Å, which is typical for metalhistidine interactions in proteins (de Visser et al., 2009). The carboxylate group of Asp<sup>145</sup> binds as a monodentate ligand at a distance of 2.01 Å, whereas the succinate carboxylate group binds as a bidentate ligand with distances of 2.20 and 2.30 Å. Again, these distances match previous calculations on similar complexes nicely (Pratter et al., 2013). Substrate proline is not bound directly to the iron center but its transferring C–H4b hydrogen atom is found at a distance of 2.86 Å from the oxo group and hence is positioned in the ideal orientation for oxidation. The triplet and singlet spin states give analogous ligand distances but are

distinguished by their differences in iron(IV)-oxo bond length due to differences in molecular orbital occupation.

High-lying occupied and low-lying virtual orbitals of the iron(IV)-oxo species are shown in **Figure 6**, which gives the molecular z-axis along the Fe–O bond. The lowest lying orbital is the π ∗ xy orbital that represents the interactions of the metal 3dxy orbital with the equatorial ligands, namely His143, Asp<sup>145</sup> and succinate (Succ). A bit higher in energy are the two orthogonal π <sup>∗</sup> orbitals along the Fe–O bond that correspond to the mixing of the 3dxz on iron with 2p<sup>x</sup> on oxygen, i.e., π ∗ xz, and the 3dyz on iron with the 2p<sup>y</sup> on oxygen, i.e., π ∗ yz. Higher in energy still are two σ-type orbital interactions. The first one along the z-axis for the mixing of the metal 3dz2 with a 2p<sup>z</sup> on oxygen: σ ∗ z2. The second one is located in the xy-plane and results from the interaction of the 3dx2−y2 orbital on iron with orbitals on the ligands: σ ∗ x2−y2. The quintet spin state has orbital occupation π ∗1 xy <sup>π</sup> ∗1 xz <sup>π</sup> ∗1 yz <sup>σ</sup> ∗1 x2−y2 <sup>σ</sup> ∗0 z2, while the triplet spin state has configuration π ∗2 xy <sup>π</sup> ∗1 xz <sup>π</sup> ∗1 yz <sup>σ</sup> ∗0 x2−y2 <sup>σ</sup> ∗0 z2. Generally, enzymatic iron(IV)-oxo species tend to have a quintet spin ground state (Latifi et al., 2009), while most synthetic biomimetic models have a triplet spin ground state (de Visser et al., 2012). Computationalstudies showed that this is the result of differences in coordination system, where biomimetic models are often in octahedral coordination, whereas enzymatic structures have the iron(IV)-oxo in pentacoordination (Latifi et al., 2013).

#### P4H Hydroxylation Mechanisms

Subsequently, we investigated proline hydroxylation at the C<sup>3</sup> , C<sup>4</sup> and C<sup>5</sup> position of P4H and considered both hydrogen atoms at each of these positions. All reactions were found to be stepwise with an initial hydrogen atom abstraction via transition state **TS**HA to form a radical intermediate **I**H. A radical rebound step via **TS**reb then produced the alcohol product complexes **P**. In all cases the rebound step was small and the hydrogen atom abstraction was rate-determining. The QM/MM calculated energy landscapes are given in **Figure 7** for hydrogen atom abstraction from C4f, C4b, C5f, C5b, C3f, and C3b. The lowest energy barrier height is the one for C4b and after rebound will give the R-4-hydroxyproline product complex, which is the experimentally determined stereo- and regioselective product found. Although the difference in relative energies is small, the calculations predict the correct regio- and stereoselectivity. Slightly higher in energy (1 kcal mol−<sup>1</sup> using basis set BSII and 2.9 kcal mol−<sup>1</sup> using basis set BSI) we find the pathway for hydrogen atom abstraction from C5b, the thermodynamically more favorable pathway. Indeed the radical intermediate for C5b is much lower in energy than the one for C4b in agreement with the thermodynamic predictions. About 5.5 kcal mol−<sup>1</sup> higher in energy than <sup>5</sup>**TS**HA,C4b is the barrier <sup>5</sup>**TS**HA,C4f, which implies that there are strong energetic differences between hydrogen atom abstraction of the two hydrogen atoms on carbon center C 4 . The three barriers for hydrogen atom abstraction from C5f , C 3f, and C3b are all well higher in energy than <sup>5</sup>**TS**HA,C4b by at least 15 kcal mol−<sup>1</sup> and hence will play little role of importance. As the singlet and triplet spin states were already considerably higher in energy at the reactant stage, they remain well higher

in the hydrogen atom abstraction transition states as well. The triplet and singlet spin barriers for abstracting the C4b hydrogen atoms are 34.0 and 40.2 kcal mol−<sup>1</sup> at UB3LYP/BSI in QM/MM. As such the reactivity takes place on a single spin state only, the quintet spin state, and other spin states play no role in the rate-determining pathway.

.

The optimized geometries of <sup>5</sup>**TS**HA,C4b, <sup>5</sup>**TS**HA,C4f, and <sup>5</sup>**TS**HA,C5b (right-hand-side of **Figure 7**) give insight into their energetic ordering and relative energies. Thus, in <sup>5</sup>**TS**HA,C4b the transferring hydrogen atom is almost midway in between donor and acceptor atom and the Fe–O–C4b angle is about 125◦ . In <sup>5</sup>**TS**HA,C4f, by contrast, the substrate is oriented along a much larger angle of 139.0◦ . The <sup>5</sup>**TS**HA,C5b structure, on the other hand, has the transferring hydrogen atom at a relatively large distance from the accepting oxygen atom and hence is destabilized considerably.

# P4H Mutations of Arg<sup>161</sup>

To find out the effect of substrate positioning and catalytic turnover of active site mutations, we investigated several P4H models, where amino acids were replaced. Thus, in our previous studies (Timmins et al., 2017) we implicated an important role of Arg<sup>161</sup> and Glu<sup>127</sup> through hydrogen bonding interactions. In this section we will look into the structure and catalytic activity of P4H mutants with Arg<sup>161</sup> replaced by either Asp, Gln, or Lys. These changes could be dramatic as the Arg161Asp (R161D) mutation will replace a positively charged residue with a negatively one. Similarly, the Arg161Gln (R161Q) mutation changes a cationic residue into a neutral one.

**Figure 8** displays an overlay of the structures of the iron(IV) oxo species for WT and R161K mutation after a full QM/MM geometry optimization, where the positively charged Arg residue is replaced by the positively charged Lys amino acid. As can be seen the mutation displaces the salt bridge between Arg<sup>161</sup> and

Glu127, which bends outward. The space provided by the Glu<sup>127</sup> migration is filled up with extra water molecules. However, the removal of the strong hydrogen bonding interactions of the Arg161-Glu<sup>127</sup> couple toward the substrate has a major effect on the stability of the substrate and its positioning. Thus, substrate is lesser tight bound in the R161K mutant than in WT and hence its regio- and stereoselective substrate activation may be affected.

Subsequently, we studied the hydrogen atom abstraction mechanisms of the R161K mutant from the C<sup>3</sup> , C<sup>4</sup> , and C<sup>5</sup> positions of proline for the back and front protons. **Figure 9** displays relative energies and optimized geometries of selected hydrogen atom abstraction transition states for the R161K mutant. The hydrogen atom abstraction barrier from the C4b

position (5**TS**R161K,C4b) has an energy of 14.0 kcal mol−<sup>1</sup> (UB3LYP/BSI), which is almost identical to the one observed for WT of 14.3 kcal mol−<sup>1</sup> . Indeed, the optimized geometries are very similar: C–H and O–H distances are found of 1.24 and 1.40 Å for R161K, whereas they are 1.22 and 1.37 Å, respectively, for WT (**Figure 7** above). However, a much lower transition state is found for activation of the C4f position of only 2.7 kcal mol−<sup>1</sup> . Therefore, the R161K mutation will not affect the catalytic performance of the enzyme: It should react faster than WT, but will give a reversal of stereochemistry and predominantly produce the S-4-hydroxyproline product instead.

Comparison of <sup>5</sup>**TS**C4b transition state structures in R161K to WT shows that the positions of the active site components change little, however, the Fe-O-C<sup>4</sup> angle and NHis-Fe-O-C<sup>4</sup> dihedral angle are different, namely 137.5◦ and −73.4◦ for WT and 130.7◦ and −67.5◦ for R161K, respectively. Therefore, this mutation allows for the orientation of the substrate relative to the iron(IV)-oxo to change in such a way that the C4f position is now accessible to it. Hydrogen atom abstraction from the C<sup>3</sup> position is seen to be lowered as compared to WT but is significantly higher in energy than the barrier <sup>5</sup>**TS**R161K,C4f. The barrier for C 5b hydrogen atom abstraction is similar to that for C4b but now slightly lower in energy.

Thereafter, we studied the R161Q and R161D mutants and **Figure 10** gives structures of the iron(IV)-oxo species as compared to WT. As can be seen both mutations have a dramatic effect on substrate binding and positioning as a result of changes in the hydrogen bonding network between Glu127, R161D and surrounding residues located in the βII-βIII and β3-β<sup>4</sup> loops. We

then attempted to abstract hydrogen atoms from proline by the R161Q and R161D mutants. **Table 1** gives data with calculated barrier heights for several hydrogen atoms of proline. None of these barrier heights, however, is low enough in energy to make them accessible at room temperature. Therefore, the R161Q and R161D mutants will be catalytically inactive. As a result, the Arg<sup>161</sup> residue has a critical function in P4H enzymes in positioning the substrate in the correct orientation. This is done in conjunction with the Glu<sup>127</sup> residue that hydrogen bonds the protein loop of the substrate and makes sure it can approach the iron(IV)-oxo species.

# P4H Mutations of Glu<sup>127</sup>

In a final set of calculations we investigated P4H mutants where Glu<sup>127</sup> is replaced by either Asp or Gln. **Figure 11** displays the QM/MM optimized iron(IV)-oxo species of WT version E127D

TABLE 1 | Calculated barrier heights (kcal mol−<sup>1</sup> ) for several hydrogen atom transfers from the substrate proline residue to the iron(IV)-oxo oxidant in prolyl-4-hydroxylase.


Glu127 mutants (in cyan) as optimized with QM/MM. (A) WT vs. E127D and (B) WT vs E127Q.

and E127Q mutants. In E127D the hydrogen bond between Asp<sup>127</sup> and Arg<sup>161</sup> is broken and as a result as Asp<sup>127</sup> swings out, to subsequently effecting the positions of surrounding residues and the substrate. Interestingly, the hydrogen bond between Arg<sup>161</sup> and the substrate is maintained; however, the position of this residue corresponds to the change in the substrate position. It suggests that the function of Glu<sup>127</sup> is to anchor Arg<sup>161</sup> in a fixed position, which is essential for proper substrate positioning.

TABLE 2 | Calculated barrier heights (kcal mol−<sup>1</sup> ) for several hydrogen atom transfers from the substrate proline residue to the iron(IV)-oxo oxidant in prolyl-4-hydroxylase.


In E127Q, there are notable changes around the active site with, e.g., Tyr<sup>140</sup> rotating out of its WT position breaking its hydrogen bond to the iron(IV)-oxo which has previously been shown to be important for correct substrate positioning and release. Previous research (Koski et al., 2009) has suggested that any disturbance of this "conformational switch" would result in the inactivation of the enzyme as shown by its mutation to alanine in experiment (Koski et al., 2009). The position of Trp<sup>243</sup> is also altered affecting substrate positioning even more, and consequently the E127Q mutation is likely to lead to an inactive form of the enzyme. To test the hypothesis, we explored hydrogen atom abstraction from various positions of proline residue which are given in **Table 2**.

In E127D, no longer is hydrogen atom abstraction (HAT) from C4b the favored pathway, but now, HAT from C5b is within 1 kcal mol−<sup>1</sup> and hence the two pathways are competitive. Considering the energies of the intermediate radical species in the mutant, it becomes clear that the C5b structure (−31.8 kcal mol−<sup>1</sup> ) product will be the major product over the C4b (−2.9 kcal mol−<sup>1</sup> ), therefore the E127D mutation leads to a change in the regioselectivity of the reaction. This is similar to what has been seen in previous DFT calculations which showed that if given the choice HAT from the C<sup>5</sup> position will always be favored over the C<sup>4</sup> position as the BDE at the former position is weaker compared to the latter (Karamzadeh et al., 2010). Comparison of the <sup>5</sup>TSC5b transition state structures in WT and E127D reveal similar geometric parameters, however, the Fe-O-C<sup>5</sup> angle and NHis-Fe-O-C<sup>5</sup> dihedral angle are different, namely 125.4◦ and −77.8◦ for WT and 132.1◦ and −88.2◦ for E127D, respectively. As such, the orientation of the substrate relative to the iron(IV) oxo has changed, allowing the substrate in E127D to adopt a more favorable position for HAT from the C5b position as compared to WT. In E127Q, HAT from the C5b position is favored compared to that from C4b, as shown also for E127D. However, as the transition state barrier is 20.1 kcal mol−<sup>1</sup> for HAT from the C 4b position, the mutant will result in much slower reactivity as compared to WT.

In conclusion, previous QM/MM studies on P4H have elucidated the reasoning behind its observed regioselectivity and stereoselectivity in the WT through various mutations to the protein. The results of that study highlighted the role of Glu<sup>127</sup> and Arg<sup>161</sup> in substrate positioning and suggested how mutating those residues could alter the regioselectivity and stereoselectivity with minimal influence on the enzymes stability and catalytic ability, an important consideration for future biotechnological applications. This study has revealed that mutations E127D and R161K are possible mutation candidates which will result in a change in the regioselelctivity and stereoselectivity. Additionally,

the work highlights the importance in conserving the charge in the substrate binding residues of the enzyme around the substrate cavity to ensure interactions between the substrate and protein are maintained when an amino acid is mutated.

#### CONCLUSION

Here we describe a detailed computational study into the activity of prolyl-4-hydroxylase enzymes and several Glu<sup>127</sup> and Arg<sup>161</sup> mutants. In particular, a comprehensive QM/MM study is presented, whereby we investigated hydrogen atom abstraction channels of each pair of hydrogen atoms bound to C<sup>3</sup> , C<sup>4</sup> , and C<sup>5</sup> of the proline residue of the substrate. Studies on WT predict the experimentally observed product distributions and give regio- and enantioselective R-4-hydroxyproline as a product. Analysis of the structure and electronic configurations show the regioselectivity to be guided by substrate positioning and

#### REFERENCES


hydrogen bonding interactions. Mutations of Glu<sup>127</sup> and Arg<sup>161</sup> have major effects and lead to inactivity of the protein in several cases, in particular, when an anionic residue is replaced by a cationic one. Only in the case of the E127D mutant significant activity remains although competitive C4b and C5b hydroxylation is predicted.

#### AUTHOR CONTRIBUTIONS

AT and SdV designed and developed the project. AT performed the calculations. AT and SdV wrote the paper.

# FUNDING

The work was funded through a studentship from the Biotechnology and Biological Sciences Research Council (BBSRC) under grant code BB/J014478/1.


they circumvent spin-forbidden oxygenation of their substrates. J. Am. Chem. Soc. 137, 7474–7487. doi: 10.1021/jacs.5b03836


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer, TM, and handling Editor declared their shared affiliation.

Copyright © 2017 Timmins and de Visser. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Chloride Ion Transport by the *E. coli* CLC Cl−/H<sup>+</sup> Antiporter: A Combined Quantum-Mechanical and Molecular-Mechanical Study

Chun-Hung Wang, Adam W. Duster, Baris O. Aydintug, MacKenzie G. Zarecki and Hai Lin\*

Department of Chemistry, University of Colorado Denver, Denver, CO, United States

#### *Edited by:*

Thomas S. Hofer, University of Innsbruck, Austria

#### *Reviewed by:*

Xavier Assfeld, Université de Lorraine, France Gerardo Andres Cisneros, University of North Texas, United States

> *\*Correspondence:* Hai Lin hai.lin@ucdenver.edu

#### *Specialty section:*

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

*Received:* 23 January 2018 *Accepted:* 26 February 2018 *Published:* 13 March 2018

#### *Citation:*

Wang C-H, Duster AW, Aydintug BO, Zarecki MG and Lin H (2018) Chloride Ion Transport by the E. coli CLC Cl−/H<sup>+</sup> Antiporter: A Combined Quantum-Mechanical and Molecular-Mechanical Study. Front. Chem. 6:62. doi: 10.3389/fchem.2018.00062 We performed steered molecular dynamics (SMD) and umbrella sampling simulations of Cl<sup>−</sup> ion migration through the transmembrane domain of a prototypical E. coli CLC Cl−/H<sup>+</sup> antiporter by employing combined quantum-mechanical (QM) and molecular-mechanical (MM) calculations. The SMD simulations revealed interesting conformational changes of the protein. While no large-amplitude motions of the protein were observed during pore opening, the side chain rotation of the protonated external gating residue Glu148 was found to be critical for full access of the channel entrance by Cl−. Moving the anion into the external binding site (Sext) induced small-amplitude shifting of the protein backbone at the N-terminal end of helix F. As Cl<sup>−</sup> traveled through the pore, rigid-body swinging motions of helix R separated it from helix D. Helix R returned to its original position once Cl<sup>−</sup> exited the channel. Population analysis based on polarized wavefunction from QM/MM calculations discovered significant (up to 20%) charge loss for Cl<sup>−</sup> along the ion translocation pathway inside the pore. The delocalized charge was redistributed onto the pore residues, especially the functional groups containing π bonds (e.g., the Tyr445 side chain), while the charges of the H atoms coordinating Cl<sup>−</sup> changed almost negligibly. Potentials of mean force computed from umbrella sampling at the QM/MM and MM levels both displayed barriers at the same locations near the pore entrance and exit. However, the QM/MM PMF showed higher barriers (∼10 kcal/mol) than the MM PMF (∼2 kcal/mol). Binding energy calculations indicated that the interactions between Cl<sup>−</sup> and certain pore residues were overestimated by the semi-empirical PM3 Hamiltonian and underestimated by the CHARMM36 force fields, both of which were employed in the umbrella sampling simulations. In particular, CHARMM36 underestimated binding interactions for the functional groups containing π bonds, missing the stabilizations of the Cl<sup>−</sup> ion due to electron delocalization. The results suggested that it is important to explore these quantum effects for accurate descriptions of the Cl<sup>−</sup> transport.

Keywords: QM/MM, CLC, chloride transport, electron delocalization, conformational change, umbrella sampling, steered molecular dynamics, potential of mean force

# INTRODUCTION

CLC transport proteins are a ubiquitous group of Cl<sup>−</sup> ion channels and Cl−/H<sup>+</sup> antiporters that can be found in eukaryotes and bacteria. They are associated with many critical physiological and cellular processes such as aiding extreme acid-resistance response, cell-volume regulation, and muscle contraction (Maduke et al., 2000; Dutzler, 2006; Accardi and Picollo, 2010; Jentsch, 2015). Mutations in CLC proteins cause inherited diseases in humans, such as myotonia congenita, Dent's disease, Bartter's syndrome, osteopetrosis, and idiopathic epilepsy (Jentsch, 2008). CLC transport proteins assemble and function as homodimers, of which each monomer subunit contains an independent ion translocation pore (Chen, 2005). The architecture of the transmembrane domain is highly conserved across the CLC family (Estévez et al., 2003; Lin and Chen, 2003; Chen, 2005; Engh and Maduke, 2005). While CLC channels translocate Cl<sup>−</sup> ions passively, CLC antiporters mediate the coupled influx of Cl<sup>−</sup> and the efflux of H<sup>+</sup> with a 2Cl−/1H<sup>+</sup> ratio (Accardi and Miller, 2004; Picollo and Pusch, 2005; Scheel et al., 2005; Matulef and Maduke, 2007; Miller and Nguitragool, 2009; Feng et al., 2012; Accardi, 2015).

A prototype CLC antiporter is EcCLC from Escherichia coli, which has been subjected to extensive structural and functional studies (Dutzler et al., 2002, 2003; Accardi and Miller, 2004; Accardi et al., 2005, 2006; Nguitragool and Miller, 2006; Walden et al., 2007; Jayaram et al., 2008; Elvington et al., 2009; Lim and Miller, 2009; Miller and Nguitragool, 2009; Picollo et al., 2009, 2012; Robertson et al., 2010; Lim et al., 2012; Vien et al., 2017). Each subunit of EcClC contains 18 α-helices, which form the ion-permeation path. In each subunit, there are three binding locations: the intracellular binding site Sint, the central binding site Scen, and the extracellular binding site Sext. In Sint, the Cl<sup>−</sup> ion is coordinated by the backbone amine groups from Gly106 and Ser107 and may be partially hydrated (Dutzler et al., 2003; Gouaux and MacKinnon, 2005). In Scen, the Cl<sup>−</sup> ion is coordinated by the side chain hydroxyl groups of Ser107 in helix D and of Tyr445 in helix R as well as the backbone amine groups from Ile356 and Phe357 of helix N (Dutzler et al., 2002, 2003). In the wild-type protein crystal structure, Sext is occupied by the side chain of a highly conserved residue Glu148 (Dutzler et al., 2003). However, mutation of Glu148 to a charge-neutral residue results in the trapping of one Cl<sup>−</sup> ion in Sext, where the ion coordinates with the backbone amine groups of Arg147 to Gly149 and Ile356 to Ala358 (Dutzler et al., 2003). This distinct arrangement, where the anions are extensively coordinated by the backbone amide group, is known as the "broken-helix" architecture (Dutzler et al., 2002, 2003; Feng et al., 2010). Interestingly, the chargeneutralization mutation at Glu148 converts EcCLC to a Cl<sup>−</sup> channel (Dutzler et al., 2003; Accardi and Miller, 2004) and similar phenomena have been observed for mammalian ClC-4 and ClC-5 proteins (Zdebik et al., 2008). It has been suggested that the wild-type crystal structure represents the closed state and the E148Q mutation structure, the open state (Dutzler et al., 2003).

The structural information provides an important starting point for comprehending the ion binding and permeation. A number of computer modeling and simulations of CLC transport proteins have been carried out to study the mechanisms of ion transfer (Bostick and Berkowitz, 2004; Cohen and Schulten, 2004; Corry et al., 2004; Faraldo-Gómez and Roux, 2004; Miloshevsky and Jordan, 2004; Yin et al., 2004; Bisset et al., 2005; Gervasio et al., 2006; Cheng et al., 2007; Engh et al., 2007a,b; Kuang et al., 2007; Wang and Voth, 2009; Coalson and Cheng, 2010, 2011; Ko and Jo, 2010a,b; Miloshevsky et al., 2010; Kieseritzky and Knapp, 2011; Smith and Lin, 2011; Zhang and Voth, 2011; Cheng and Coalson, 2012; Picollo et al., 2012; Yu et al., 2012; Church et al., 2013; Nieto-Delgado et al., 2013; Han et al., 2014; Pezeshki et al., 2014; Chen and Beck, 2016; Jiang et al., 2016; Khantwal et al., 2016; Lee et al., 2016a,b,c; Chenal and Gunner, 2017). Regarding Cl<sup>−</sup> transport, all studies agree that the protonation of the external gating residue Glu148 is key to antiporter activation and that Ser107 and Tyr445 form an internal gate. Moreover, the reported free-energy barriers for Cl<sup>−</sup> translocation are relatively low, ranging from 3 to 8 kcal/mol (Cohen and Schulten, 2004; Bisset et al., 2005; Gervasio et al., 2006; Ko and Jo, 2010a; Cheng and Coalson, 2012). However, some issues are still controversial, such as whether the crystal structures of E148A and E148Q mutants correspond to the truly open state (Dutzler et al., 2003; Miloshevsky and Jordan, 2004) and how protonation of E148 is coupled to Cl<sup>−</sup> binding. (Dutzler et al., 2003; Bostick and Berkowitz, 2004; Miloshevsky and Jordan, 2004; Yin et al., 2004; Gervasio et al., 2006; Ko and Jo, 2010b) The disagreements are partly due to differences in the employed methodology, constructed models, and adopted parameters. While these calculations shed light on the operating mechanisms of CLC transport, complementing experimental measurements, many molecular details are yet to be elucidated.

One long-standing puzzle concerns the extent of protein conformational change associated with Cl<sup>−</sup> ions passage. The crystal structures of the protein are highly similar in both the closed and open states (Dutzler et al., 2003). This observation prompted people to propose that only local side chain motions are involved in the EcCLC operation. This mechanism differs fundamentally from other antiporters where significant motions of helixes or even domains are necessary (Feng et al., 2010, 2012). On the other hand, recent experiments on mutants that are restrained through cross linking between selected helixes suggested that rigid-body movements of certain helixes play a role (Basilio et al., 2014; Khantwal et al., 2016). Molecular dynamics (MD) simulations so far have not observed significant global conformational changes of the protein; but this could be because global rearrangements can occur in time scales longer than what the simulations have covered. On the other hand, normal-mode vibrational analysis suggested that large-amplitude swinging of helixes A and R may be one such conformational change (Miloshevsky et al., 2010). However, it is possible that local and global conformational changes both contribute to some extent.

Carrying a significant charge, Cl<sup>−</sup> can strongly polarize nearby solvent molecules and pore residues. Moreover, under certain circumstances, the excess charge of Cl<sup>−</sup> can easily delocalize to the ion's surroundings. Previous quantum-mechanical (QM) calculations using truncated models of EcCLC have indeed revealed substantial mutual polarization and partial charge transfer for the Cl<sup>−</sup> ions in the binding sites (Smith and Lin, 2011; Church et al., 2013; Nieto-Delgado et al., 2013). The most prominent manifestations of polarization and charge transfer were found among the π orbitals of the nearby protein atoms, e.g., the atoms of the backbone peptide links and of the side chains of Glu148 and Tyr445. Furthermore, energy decomposition analysis confirmed that the quantum effects of electron delocalization over Cl<sup>−</sup> and the π-orbitals atoms contributed significantly to the stabilization of the Cl<sup>−</sup> ions in the biding sites (Church et al., 2013). On the other hand, it has been demonstrated that these induction effects can impact the ion binding and translocation in cation channels (Compoint et al., 2004; Allen et al., 2006; Huetz et al., 2006; Bucher et al., 2007, 2010; Dudev and Lim, 2009; Illingworth and Domene, 2009; Bostick and Brooks, 2010; Illingworth et al., 2010; Varma and Rempe, 2010; Roux et al., 2011; Wang et al., 2012). Do the extensive charge redistributions of Cl<sup>−</sup> affect its transport in EcCLC? It will be interesting to find out.

In this study, we investigate possible protein conformational changes and explore how electron delocalization affects the ion's expedition through EcCLC. We carry out dynamics simulations at both the molecular mechanical (MM) and the combined quantum-mechanical/molecular-mechanical (QM/MM) levels (Warshel and Levitt, 1976; Field et al., 1990; Gao, 1996; Zhang et al., 1999; Rode et al., 2006; Lin and Truhlar, 2007; Senn and Thiel, 2007; van der Kamp and Mulholland, 2013; Pezeshki and Lin, 2015; Duster et al., 2017). We perform steered molecular dynamics (SMD) simulations (Izrailev et al., 1998) to escort a Cl<sup>−</sup> ion through the pore. We examine the changes in pore size and protein conformations as the ion moves through and estimate the associated potential of mean force (PMF) using umbrella sampling (Torrie and Valleau, 1977; Roux, 1995). By comparing and combing the MM and QM/MM results, this study will deepen our understanding of the Cl<sup>−</sup> translocation process operated by EcCLC.

# COMPUTATIONAL DETAILS

#### Model Preparation

The model system was constructed based on the crystal structure of the wild-type (WT) protein's transmembrane domain (PDB code: 1OTS) (Dutzler et al., 2003). Because the two subunits each function independently, only one subunit (chain A) was used to build the model system. The protonation states of the residues were set according to an earlier Poisson-Boltzmann calculation (Faraldo-Gómez and Roux, 2004) except for the external gate Glu148, which was protonated. The protonated Glu148 was manually rotated outwards to open the pore. Therefore, we were simulating the open state of the protein. The protein with the two bound Cl<sup>−</sup> ions in the Sint and Scen sites was embedded in a two-layer 1-palmitoyl-2-oleoyl-sn-glycero-3-phosphoethanolamine (POPE) lipid bilayer. The proteinmembrane complex was then solvated by adding water molecules on both sides. The thickness of the slab of water at either side was about 20 Å. We then replaced 109 randomly selected water molecules by 58 K<sup>+</sup> and 51 Cl<sup>−</sup> ions to neutralize the total charge and to achieve an approximate 0.15 M physiological salt concentration for the solution. The primary cell of the model system consists of 6823 protein atoms, 18308 water molecules, 60 Cl<sup>−</sup> ions, 51 K<sup>+</sup> ions, and 292 POPE molecules. The protein, lipid, and ions were described by the CHARMM36 force fields (MacKerell et al., 1998; Mackerell et al., 2004; Klauda et al., 2010; Best et al., 2012), and water by the TIP3P model (Jorgensen et al., 1983).

The system was equilibrated under the NpT ensemble at p = 1 bar and T = 310 K for 10 ns, followed by the NVT ensemble at T = 310 K for 45 ns, with final unit cell dimensions of 101.3 × 99.0 × 92.9 Å<sup>3</sup> . Temperature was controlled through Langevin dynamics (Phillips et al., 2005) where the temperature dampening coefficient was set to 1.0 ps−<sup>1</sup> , and pressure through Langevin piston Nosè-Hoover method (Martyna et al., 1994; Feller et al., 1995) with a barostat oscillation period of 175 fs and a damping time of 150 fs. Periodic boundary conditions were employed, and long-range electrostatic interactions were computed by particle mesh Ewald (PME) method (Darden et al., 1993; Essmann et al., 1995). A 14.0 Å cutoff was used for nonbonded interactions, with smoothing switch at 13.0 Å and pair lists cutoff at 16.0 Å. The SHAKE algorithm (Ryckaert et al., 1977) was used to make waters rigid as well as to constrain all bonds between hydrogen and heavy atoms. A time step of 2 fs was used. The equilibrations were performed by using the NAMD (Phillips et al., 2005) program version 2.10. The saved trajectories were visually inspected using the program VMD (Humphrey et al., 1996).

## Steered Molecular Dynamics

The Cl<sup>−</sup> transport by EcCLC is complicated due to the coupling with the H<sup>+</sup> migration, and a 2Cl−/1H<sup>+</sup> ratio has been established (Accardi and Miller, 2004; Picollo and Pusch, 2005; Scheel et al., 2005; Matulef and Maduke, 2007; Miller and Nguitragool, 2009; Feng et al., 2012; Accardi, 2015). However, exactly how the two Cl<sup>−</sup> ions were transferred in every cycle is still under debate. As such, we only considered the uncoupled translocation of one Cl<sup>−</sup> ion to simplify the analysis, which suffices the purpose of this work.

The parameters and program used in the model system equilibration were used in the constant-velocity SMD (Izrailev et al., 1998) simulations, except those indicated below. Before the SMD simulations, the Cl<sup>−</sup> ion at Scen in the equilibrated model was replaced and fixed at the extracellular side at z = 0, and the model system was re-equilibrated for 2 ns. The resulting geometry, where the Cl<sup>−</sup> ion stayed just outside the extracellular pore entrance, served as the starting geometry for the SMD simulation (see **Figure 1**). For the SMD simulations, the Cl<sup>−</sup> ion was dragged through the pore toward the intracellular side from z = 0.0 to −20.0 Å. Three steering speeds were tested: 2, 5, and 10 Å/ns. (As a comparison, the slowest speed applied in a previous study, Ko and Jo, 2010a was 10 Å/ns). The steering forces were applied along the –z direction, and the spring force constants were set to 10 kcal/mol/Å<sup>2</sup> . The trajectories were saved every 100 steps. The Cα atoms of randomly selected protein residues Gly141, Thr226, Ala325, and Leu421, which belongs to helixes E, I, L, and Q, respectively, were restrained to their initial positions

by harmonic potentials with force constants of 1.0 kcal/mol/Å<sup>2</sup> to prevent the protein from translating with the Cl<sup>−</sup> ion. The simulations were conducted with constant volume and constant temperature. The time step was set to 1.0 fs, as the SHAKE algorithm was not used (the same applied to umbrella sampling below).

# Umbrella Sampling

The starting geometries for umbrella sampling calculations were extracted from the trajectory of the SMD simulations at the slowest speed of 2 Å/ns. The z component of the Cl<sup>−</sup> ion's coordinate was chosen to be the reaction coordinate. In total, there were 41 sampling windows covering 0 > z > −20.0 Å, with equal spacing of 0.5 Å and each window containing the Cl<sup>−</sup> ion initially located approximately at the window center. The simulations were done at both the QM/MM and MM levels. The PMF was obtained using the weighted histogram analysis method (WHAM) (Kumar et al., 1992).

For the QM/MM umbrella sampling, the QM subsystem covered the entire pore section that the Cl<sup>−</sup> ion traverses. This included the Cl<sup>−</sup> ion, the pore residues (Gly106 to Pro110, Glu111 backbone, Leu145 to Pro150, Thr151 backbone, Gly354 to Ala358, Leu444, and Tyr445), selected residues near the extracellular pore entrance (Ala188 to Phe190 and Gly314 to Gly317), and 7 water molecules in the vicinity of the Cl<sup>−</sup> ion (**Figure 2**). For simplicity, the mechanical-embedding scheme with H as link-atom was adopted (Bakowies and Thiel, 1996; Lin and Truhlar, 2005). For the first and last residues in a list of consecutive residues in the QM subsystem, the amine

and carbonyl groups were replaced by H atoms, respectively. When the QM/MM boundary separated backbone and side chain, H atoms were used to cap the Cα (or Cβ) atom. To prevent MM water molecules from diffusing into the QM region, the MM water molecules adjacent to the QM subsystem (within 10.0 Å from the QM subsystem) were restrained by imposing harmonic potentials to their O atoms with force constants of 1.0 kcal/mol/Å<sup>2</sup> . The QM subsystem thus covered the entire pore section that the Cl<sup>−</sup> ion passed through. For computational efficiency, the PM3 (Stewart, 1989a,b) semiempirical Hamiltonian was chosen to model the QM subsystem. Higher-level electronic-structure methods are more accurate, but they were not employed for umbrella sampling due to their much higher computational costs. Periodic boundary conditions with a minimum-image convention were employed, where the cutoffs for electrostatic and van der Waals interactions were set to 14.0 Å with smoothing switches at 13.0 Å. For each window, equilibration of 10 ps and production of 200 ps were performed. A Nose-Hoover thermostat (Hoover, 1985; Kreis et al., 2016) with a coupling constant of 4.0 fs was applied to control the temperature. The force constants of the bias harmonic potential were set to 14 kcal/mol/Å<sup>2</sup> . The QM/MM simulations were done using a local version of the QMMM program (Lin et al., 2017), which called the MNDO program (Thiel, 2005) for QM calculations and NAMD for MM calculations, synthesized the QM and MM gradients, and propagated the trajectory.

For the MM umbrella sampling, longer simulations could be afforded, and 0.5 ns equilibration followed by 2.0 ns production run was performed for each window. The force constants of the bias harmonic potential were set to 5 kcal/mol/Å<sup>2</sup> . The same setups and program used in the model system equilibration were again used here, unless otherwise noted.

# Pore Radius Calculations

The pore radius at the Cl<sup>−</sup> ion locations were calculated for representative geometries extracted from umbrella sampling simulations. This analysis was carried out for both the MM and QM/MM simulations. In either case, 7 snapshots were extracted from the production phase of the trajectory for each of the 41 windows. The pore radii were computed utilizing the program HOLE (Smart et al., 1996), where a Monte Carlo-based search algorithm was employed to identify the best route for a sphere with a given radius to pass through the protein. The "hard-core" radii in Turano et al. (1992) was adopted for the calculations, which implicitly account for the thermal motions of atoms. The results of the 7 representative geometries were then averaged for a given window.

# Electron Delocalization

To characterize the Cl<sup>−</sup> electron delocalization during the ion translocation along the pore, electrostatic-embedding QM/MM single-point calculations were carried out on one representative geometry for each window of QM/MM umbrella sampling. The QM subsystem was the same as the one in the umbrella sampling except that the selected residues near the pore entrance (Ala188 to Phe190 and Gly314 to Gly317) were excluded; these residues were treated by MM due to computational cost considerations. We employed the charge-redistribution scheme for the boundary treatment (Lin and Truhlar, 2005). The B3LYP (Becke, 1988, 1993; Lee et al., 1988) functional with the 6-31+G(d) basis set (Ditchfield et al., 1971; Hehre et al., 1972; Francl et al., 1982; Clark et al., 1983) were used for the QM description, and the MM force fields were the same as those in the umbrella sampling. All MM atoms within 12.0 Å from the QM subsystem were included as background point charges in the embedded-QM calculations. The calculations were performed using the local version of QMMM, which called Gaussian09 (Frisch et al., 2010) for the QM calculations. The Löwdin charges (Löwdin, 1950; Wang et al., 2013) and the natural charges based on Natural Bond Orbital analysis (Foster and Weinhold, 1980; Reed et al., 1985) were computed using the polarized electron density of the QM subsystem. We tested and found that further inclusion of more MM background point charges in the embedded-QM Hamiltonian did not noticeably change the atomic charges for the QM atoms.

# RESULTS

# Equilibrated Model System

During equilibration, the protonated Glu148 side chain, which had been manually rotated out of Sext, moved toward Sext and partially obstructed the pore entrance, but it did not enter the pore to reclaim the binding site. Located at the transition area between the intracellular aqueous solution and the pore, the Cl<sup>−</sup> ion at Sint drifted away during the equilibration. This is in line with the weak binding affinity of Cl<sup>−</sup> at this site (Lobet and Dutzler, 2006), suggesting that Sint is less critical than Sext and Scen to the Cl<sup>−</sup> transport. Attracted by the positively charged Arg147, Cl<sup>−</sup> ions from the bulk often diffused to near the pore entrance on the extracellular side and stayed there for up to 1 ns. This additional weak binding site has been reported in previous computational studies, although its exact location varied in the literature (Bostick and Berkowitz, 2004; Cohen and Schulten, 2004; Faraldo-Gómez and Roux, 2004; Yin et al., 2004; Gervasio et al., 2006; Smith and Lin, 2011; Church et al., 2013). It is quite possible that this additional binding site plays a role in the Cl<sup>−</sup> ion recruitment.

# Conformational Changes Along Cl<sup>−</sup> Translocation

Our analysis on the protein conformational changes will put emphasis on the SMD trajectories, where a complete and continuous voyage was simulated for the Cl<sup>−</sup> ion through the pore.

#### Global Conformational Changes of the Protein

Generally speaking, the SMD trajectories of the three tested steering speeds showed rather similar changes in the protein conformation, but the variations became more prominent as the steering speed decreased. The overall conformational changes can be quantified by root-mean-square-deviation (RMSD) values of the protein backbones, which are plotted as functions of the simulation progress (Figure S1). All trajectories show a trend of increasing RMSD over time. While the trend was moderate at the speeds of 5 and 10 Å/ns, it was considerably enhanced at 2 Å/ns. While this could be incidental, we suspect that it is at least partly due to the longer relaxation time permitted for the model system at a slower steering speed. As such, we will present and discuss the data obtained with the slowest speed of 2 Å/ns, while bringing in the results of 5 and 10 Å/ns only when necessary.

First, let us look at the global backbone movements induced by the Cl<sup>−</sup> passage. **Figure 3** shows the orientations of the four helixes (D, F, N, and R) that define the pore in selected snapshots extracted from the SMD trajectory. In **Figure 4**, we display, vs. the simulation time t, the Cartesian coordinates of the Cl<sup>−</sup> ion as well as the backbone RMSD values of the protein and of these four helixes. (See Figure S2 for the backbone RMSD values of the complete set of all helixes).

As evident from **Figure 4**, the z coordinate of the Cl<sup>−</sup> ion varied linearly as a result of constant-velocity steering. The journey of the ion can be divided into three stages. In the first 2 ns, which we call the "entrance" stage, the ion wandered around in the extracellular vestibule just outside the entrance (z ∼ −4 Å) before finding its way into the pore. In the second "central" stage of t = 2–8 ns, the ion walked essentially straight down the path, as manifested by the approximately constant x and y coordinates. During this section, the ion traveled through Sext at z ∼ −6 Å and Scen at z ∼ −10 Å. The ion's arrival at the kink (z ∼ −16 Å)

pore.

at t = ∼8 ns marked the beginning of the last "exit" stage. The deviation in the y coordinate indicated that the pore is curved along this part. On its way out, the ion passed Sint (z ∼ −18 Å) at t = ∼9 ns.

The protein backbone RMSD increased slowly in the first 4 ns, implying that there were relatively small changes in the protein conformation during the entrance and early-central stages. The RMSD then grew rapidly in the next 4 ns from ∼1 Å to ∼2.5 Å, remained there for ∼1 ns, and dropped sharply down to 1.5 Å in the last 1 ns. The substantial rise and fall in the RMSD in 4–10 ns indicated significant conformational changes of the protein over this period. The conclusion is exemplified by the representative snapshots in **Figure 3**, where large-amplitude rigid-body displacements of part of the protein can be seen during this time. Three helixes were identified in the large-amplitude rigid-body displacements induced by the Cl<sup>−</sup> penetration: helix R most prominently and helixes D and F moderately, all of which are critical to the ion passage (**Figure 4B**).

• Helix R: This helix, which contains the internal gating residue Tyr445, exhibited marked swinging motions similar to a door opening and closing. This motion corresponds with a normal mode presented in an earlier study (Miloshevsky et al., 2010). It appeared that the movement of Tyr445 propagated through the helix, giving rise to the observed big changes in the backbone RMSD. Separating helixes R and D, the outward swinging reached its maximum at ∼8.7 ns. After that, the ion, which had been following the tilted path since t ∼ 7.9 ns, exited the pore, and helixes R and D returned to their original positions (though not completely restored in the next 2 ns). Because helix R extrudes into the intracellular solution, it is less restrained than the other helixes by the rest of the protein or by the membrane, thus possessing larger flexibility. However, one limitation of the current model is that it contains only one subunit. In the dimer structure, helix R may interact with helix A from the other subunit. Moreover, the C-terminal domain, which was not included in this model, is connected to helix R and may also reduce its mobility. Therefore, the motion amplitude of helix R was likely exaggerated here.


We note that helix A backbone RMSD also exhibited significant changes. Similar to helix R, helix A was exposed to the solvent and thus was quite mobile. However, the trend of its backbone RMSD change was not consistent across the three simulations with various steering speeds (Figure S3). Because helix A resides far from the rest of the protein in the current model, its motions were largely independent of the excursion of the Cl<sup>−</sup> ion in this study.

A small "bump" in the backbone RMSD of the protein from t = 1–2 ns was predominantly due to the small changes in the backbone RMSD of helixes D and R. Inspection of the trajectory unveiled that this was caused by the highly flexible Phe357 side chain, the fluctuation of which disturbed Tyr445 and, in turn, Ser107. However, this happened when the Cl<sup>−</sup> ion was still in the extracellular vestibule, and Tyr445 had roughly returned to its initial position by t = 2 ns when the ion began entering the pore. Moreover, such a bump was not observed in the simulations of 5 or 10 Å/ns speed. Therefore, this bump seemed irrelevant or insignificant to Cl<sup>−</sup> transport.

The protein conformations featured in the SMD trajectories were preserved in the trajectories generated by both the QM/MM and MM umbrella sampling simulations. Because the MM umbrella sampling allowed even longer equilibration (2.5 ns for each window), the observation of similar conformations in both the SMD and umbrella sampling simulations supports the hypothesis that other major conformational changes besides the helix R movement, if any, should occur at longer time scales.

#### Local Conformational Changes of Pore Residues

Next, we examined the conformational changes of individual residues. The first thing we looked at was how the Cl<sup>−</sup> ion entered the pore. In the entrance stage of its voyage, the Cl<sup>−</sup> ion was accompanied by the side chain carbonyl group of Glu148 as well as by the backbone amine groups of Arg147 and Glu148 (Figure S5). Inspection of the trajectory discovered that at t∼1.8 ns the side chain carboxyl group of Glu148, which had obstructed the pore entrance, flipped, revealing the pore for the Cl<sup>−</sup> ion to access. The side chain returned to its initial orientation after the ion passed through Sext. The conformational change of Glu148 side chain was characterized by a 60◦ rotation of dihedral χ2, while χ<sup>1</sup> and χ<sup>3</sup> remained largely unchanged (**Figure 5A**).

The Cl<sup>−</sup> ion in Sext was held by the backbone amine groups of Gly147 to Gly149 and Ile356 to Ala358 (**Figure 5B**). The Gly146 backbone also stayed close, but the minimum distance from its amine H to the Cl<sup>−</sup> was 3.3 Å, much longer than the other residue backbones considered here, for which the distances could reach 2.1 Å or shorter. We thus conclude that this interaction was not so critical as the other residue backbones. As the Cl<sup>−</sup> ion entered Sext, the backbones of Glu148 and Gly149 from helix F shifted closer to Cl<sup>−</sup> to better solvate the ion. In contrast, the backbones of Ile356 to Ala358 from helix N displaced less significantly, which can again be attributed to the tighter embrace of the N helix by the protein matrix. These local conformational changes were well captured by the variations in the backbone RMSD of these residues (Figure S4). Rapid increases can be seen in the backbone RMSD of Glu148 and Gly149 at ∼2.0 ns. (Note that the side chain RMSD of Glu148 increased ∼0.2 ns earlier than the rise in the backbone RMSD of Glu148 due to pore opening). The backbone RMSD values of Glu148 and Gly149 remained high for ∼2.4 ns before appreciably reducing. This period of ∼2.4 ns corresponded to the ion cruising through Sext. Interestingly, although the Phe357 side chain was highly mobile all the time, its activities did not seem very relevant.

The shifts of Glu148 and Gly149 backbones were also reflected by the distances between the Cα atoms of the involved residues (**Figure 5B**), where distances reduced by ∼1 Å were observed for the Glu148-Phe357 and Gly149-Ile356 pairs near t = 2.0 ns and remained low for ∼2.4 ns. However, we note that, unlike the flipping of Glu148 side chain, the shifts of Glu148 and Gly149 backbones were not a prerequisite of the Cl<sup>−</sup> ion passage through Sext. Rather, they were a consequence. In other words, the ion could still have cruised through without the movement of the

Glu148 and Gly149 backbones, because the pore sizes were large enough in this section. In fact, it has been known that even bigger anions such as Br<sup>−</sup> can also be transported rather efficiently (the relative permeability P − Br/P − Cl <sup>=</sup> 0.7; White and Miller, 1981; Gouaux and MacKinnon, 2005). The lack of high selectivity between Cl<sup>−</sup> and Br<sup>−</sup> is probably not problematic physiologically because of the much lower physiological abundance of Br<sup>−</sup> than Cl<sup>−</sup> ions (Gouaux and MacKinnon, 2005).

The Ser107 and Tyr445 side chains played a central role in the late central and exit stages of the Cl<sup>−</sup> ion's expedition. The side chain hydroxyl groups of Ser107 and Tyr445 accommodated the Cl<sup>−</sup> ion as soon as it left Sext (t ∼ 4.4 ns) and escorted it all the way through its exit from the channel. Maintaining the ion coordination by these two residues drove the movement of helixes D and R and led to the temporary separation of the two helixes. Consequently, the distance between the Cα atoms of Ser107 and Tyr445 increased until the ion exited from the pore (**Figure 5B**). Interestingly, we did not observe much rotation of the Ser107 side chain after 4.4 ns, during which χ1 was mostly constant (Figure S6). The attraction by the Cl<sup>−</sup> ion might have somewhat reduced the mobility of this side chain. The Tyr445 side chain χ1 and χ2 did not really change throughout the entire journey. The side chain hydroxyl groups of both residues remained quite flexible, though.

#### Pore Sizes

**Figure 6** displays the pore sizes at the locations of the Cl<sup>−</sup> ion in the umbrella sampling simulations. For each location, the radius was averaged over 7 representative geometries. Please note that this is different from the pore sizes along the pathway for a given snapshot of the protein. The pore size plot here reflects the binding environment of the Cl<sup>−</sup> ion as it travels down the channel. Nevertheless, the overall shapes of the MM and QM/MM pore-size curves agree reasonably well with those reported previously (Bostick and Berkowitz, 2004; Corry et al., 2004; Yin et al., 2004; Kuang et al., 2008; Ko and Jo, 2010a; Khantwal et al., 2016). The MM pore radius was in general larger by up to 0.5 Å than the QM/MM radius, especially at the binding sites.

# Cl<sup>−</sup> ion Charge Redistribution

In previous studies, (Smith and Lin, 2011; Church et al., 2013; Nieto-Delgado et al., 2013) it was found that the charge carried by the Cl<sup>−</sup> ion is delocalized to its solvation shell when it is bound at Sext and Scen. In the present work, we examined the trend along the entire pore. The QM charges were computed using the electrostatic-embedding QM/MM scheme on representative geometries from umbrella sampling simulations. Compared with our early truncated-QM model calculations for the pore in the gas phase (Smith and Lin, 2011; Church et al., 2013) here the electrostatic-embedding QM/MM calculations more realistically incorporated the polarization effects due to the protein/solvent environment. The results for the Cl<sup>−</sup> ion and for selected atoms and functional groups belonged to three residue cohorts (described later in this section) are depicted in Figure S7 as functions of the reaction coordination z. The total charges of each cohort are displayed in **Figure 7**. To better assist visualization, we have divided the reaction pathway into 10 sections (each of 2 Å long) and collected the average charges over each section. Importantly, both the natural and Löwdin charges gave qualitatively similar descriptions, although the Löwdin charges probably exaggerated the extents of charge variation. Therefore, our analysis will focus on the natural charges.

The functional groups in a given cohort were those who critically participated in the Cl<sup>−</sup> coordination at the corresponding stage of ion transport.


Note that there are overlaps in the assignment of the functional groups, because one functional group might participate in the Cl<sup>−</sup> ion solvation in more than one stage. For example, the entire exit cohort was part of the central cohort. Because our earlier QM analyses had discovered that the transferred charge would be delocalized over all nearby atoms with π bonds (Church et al., 2013) we treated the entire backbone of a residue rather than just the amine group as one unit in the present analysis. The same was applied to the ring in the Tyr445 side chain.

There are significant charge redistributions between Cl<sup>−</sup> and the coordinating residues along the ion transport pathway. In the external vestibules, if the Cl<sup>−</sup> ion had no contact with any cohort residues, it retained a charge of nearly −1.0 e, despite being coordinated by water molecules. As soon as it interacted with the entrance cohort, notable partial charge transfer was observed. The amount of charge loss increased as the ion advanced through the pore and interacted with more residues. The maximal charge deduction was ∼20% and occurred when the ion traveled from Sext (−6 Å) to Scen (−10 Å). This is not surprising, because the central cohort has the largest number of coordinating functional groups and caused most the significant charge redistribution. Correspondingly, a dip appeared near z = −8 Å in the total-charge curve of the central cohort. Finally, the exit cohort withdrew some of the charge density from Cl<sup>−</sup> as the ion approached it. The charge transfer to the exit cohort was the greatest from Scen to the kink (−10 > z > −16 Å). After leaving the kink, the ion gradually regained its charge on the way out the channel. Interestingly, the H atoms that directly coordinated the Cl<sup>−</sup> ion displayed almost negligible changes in their charges.

## Potential of Mean Force Through Umbrella Sampling

When analyzing the one-dimensional PMFs, we focused on the data in the range of z > −16 Å, because the pore is wide and considerably tilted after z < −16 Å, for which multidimensional PMFs shall provide better descriptions of the ion dynamics. First, we checked the convergence of the MM PMF with respect to the sampling time per window (Figure S8A). The MM PMF converged to within 0.2 kcal/mol from 1.5 ns/window to 2.0 ns/window, suggesting that the sampling time length was adequate. The final MM PMF is shown in **Figure 8** as the red curve. The lowest free energy in the entire PMF corresponded to Scen, but Sext is higher by only ∼0.5 kcal/mol. The well at the entrance was roughly the same as Sext. The PMF possesses a barrier of ∼1.5 kcal/mol between the pore entrance and Sext. This was primarily due to the need to rotate the Glu148 side chain that blocked the channel so that the pore could be accessed. Because our model was constructed in the open state instead of the closed state, this barrier did not account for the changes needed to prepare the protein in the open state. Going from Sext to Scen experienced only a low barrier of ∼0.3 kcal/mol. A barrier of ∼2 kcal/mol must be overcome when the ion departed from the channel. This was mainly caused by the internal gate opening by Tyr445 and Ser107, the movements of which propagated to the backbones of helixes R and D.

The convergence of QM/MM PMF was less satisfactory because of the shorter sampling time (Figure S8B). Although the QM/MM PMF is probably not fully converged, the shape of the curve was distinct. The final QM/MM PMF of 200 ps sampling time is also displayed in **Figure 8** as the blue curve. The QM/MM PMF agrees with the MM PMF very well in the locations of the

wells and barriers. Although Scen still corresponded to the lowest free energy when the ion was in the pore (−4 > z > −16 Å), it was higher by ∼2 kcal/mol in energy than the entrance, and Sext is ∼5 kcal/mol above Scen. The QM/MM PMF exhibited markedly higher barriers for Cl<sup>−</sup> entering and exiting the pore (8–10 kcal/mol), and conversely a small (∼0.5 kcal/mol) barrier from Sext to Scen.

## DISCUSSION

#### External Gating by Glu148

As revealed from our analysis, rotation of Glu148 side chain was necessary to make the pore entrance fully exposed to the Cl<sup>−</sup> ion. The average χ2 values were 104◦ , 167◦ , and 78◦ for 0–1.8, 1.8–4.4, and 4.4–10 ns, respectively. The corresponding values are 53◦ , 64◦ , and 64◦ in χ1 and 179◦ , 154◦ , and 201◦ in χ3, respectively (note that χ3 has been offset to be in the range of 0–360◦ ). The ∼60◦ change in χ2 upon pore opening was much larger than the variations of ∼10◦ in χ1 and ∼25◦ in χ3. The first and third average values of χ2 (and of χ3 are similar, signifying the side chain returning to partially block the pore entrance after departure of the Cl−. We suspect that their existing small difference of ∼20◦ was due to the presence of Cl<sup>−</sup> at the entrance during the first 1.8 ns.

The pattern of dihedral evolution seems in line with what have been seen in the experimental structures of EcCLC. In the crystal structure of the wild-type (WT) protein, which is in the closed state, χ1-χ3 (averaged over chains A and B) are −25◦ , 70◦ , and −107◦ , respectively. For the E148Q mutant, which serves as a mimic of the open state, the average values of χ1-χ3 are −36◦ , −57◦ , and −84◦ , respectively. Thus, the changes are substantial (∼130◦ ) in χ2 but insignificant in χ1 (∼10◦ ) and χ3 (∼20◦ ). It is encouraging that both the computations and experiments support the gating role of Glu148 side chain.

Our observation of the side chain rotation as the external gating mechanism shares some similarities with what was found for a eukaryotic Cyanidioschyzon merolae CLC (CmCLC) transporter in a recent computational study (Cheng and Coalson, 2012). In that paper, constant-force steering was applied to guide Cl<sup>−</sup> ions through the pore. No large-amplitude motions of the helixes were found. The side chain rotation of Glu210 (Glu148 equivalent in CmCLC) was characterized by significant changes in both χ1 and χ2 (>100◦ ). In contrast, only changes in χ2 was notable in the present study.

We note that an earlier paper (Bisset et al., 2005) based on the simulations of a pure Cl<sup>−</sup> channel CLC-0 suggested an alternative gating mechanism of Glu148: The Glu148 side chain was pushed back by the incoming Cl<sup>−</sup> ions into a more central position and pressed against the channel wall, such that it did not block the travel of Cl<sup>−</sup> ions. Such a conformation of the external gating residue was also detected in a recent experimental structure of CmCLC transporter (Feng et al., 2010). The discrepancies between this earlier and our current simulations may arise from the different proteins and/or the different starting geometries of Glu148 (the protonated Glu148 might stay initially deeper in the pore in the previous computation). On the other hand, it is certainly possible for the Glu148 side chain to move in either direction to escape from Sext. Therefore, while not observed in our simulations, we cannot rule out this alternative mechanism for pore opening. However, because EcCLC is an antiporter, the release of H<sup>+</sup> to the extracellular side will probably require Glu148 be exposed to the extracellular solution sometime during a transport cycle.

# Sext-Free vs. Sext-Occupied Open States

It is intriguing to see that the presence of the Cl<sup>−</sup> ion in Sext induced subtle local movements of Glu148 and Gly149 backbones. The approach of helix F to helix N can be attributed to the negatively charge of Cl−, which "glued" the two helix N-terminal ends. Note that these N-terminal ends are partially positively charged. The departure of the ion from Sext thus widened Sext again. These open-state geometries with and without the anion in Sext can be labeled as Sext-free and Sext-occupied, respectively. Because the deprotonated carboxyl groups of Glu148 also carries a negative charge, the close-state geometry and the Sext-occupied open-state geometry should bear similarities in the Sext region. This is indeed the case for the crystal structures (Dutzler et al., 2002, 2003). As can be seen in **Table 1**, the distances between Cα atom pairs for Arg147- Ala358, Glu148-Phe357, and Gly149-Ile356 (**Figure 5B**), which characterize the size of Sext, are rather similar across the three crystal structures. Averaged over the three pairs and over chains A and B, the distances are 7.0, 6.9, and 7.0 Å for WT, E148A, and E148Q, respectively. In view of this, the "true" open state could be the Sext-free one, for which E148Q without an anion at Sext will be a realistic mimic.

We note that the distances in our SMD simulations are consistently larger than those from the crystal structure. Even for those of 2.0–4.4 ns corresponding to the Sext-occupied state, the

TABLE 1 | Average distances (in Å) between pairs of Cα atoms for selected residues: Arg147-Ala358, Glu148-Phe357, Gly149-Ile356, and Ser107-Tyr445.


In addition to the SMD data, the experimental data are also provided for wild-type (WT, PDB code 1OTS) and two mutants E148A (PDB code 1OTT) and E148Q (PDB code 1OTU) (Dutzler et al., 2003).

distance average over these three pairs is 7.5 Å, ∼0.5 Å wider than the crystal structures. The average distance over the Sextfree state is 7.8 Å. Currently we do not know exactly what caused the difference, although the different environments in which the protein stays may have contributed to the disparity—the protein is more tightly packed in the crystal form but probably more expanded when embedded in membrane.

# Quantum Effects of Electron Delocalization on Cl<sup>−</sup> Translocation

Comparisons between the MM and QM/MM results obtained in this study provide an opportunity to gauge the quantum effects of electron delocalization on Cl<sup>−</sup> transport by EcCLC. First, we point out that the MM (2 kcal/mol) and QM/MM (10 kcal/mol) barriers estimated here for Cl<sup>−</sup> translocation are close to those (3–8 kcal/mol) obtained in previous computational studies (Cohen and Schulten, 2004; Gervasio et al., 2006; Ko and Jo, 2010a; Cheng and Coalson, 2012) again suggesting that the slow transport of Cl<sup>−</sup> ions (∼10<sup>3</sup> s −1 ) by EcCLC (Walden et al., 2007) arises from other factors not considered here. The same locations of the barriers and wells in both MM and QM/MM PMFs implied that electron delocalization was fine tuning rather than dramatically changing the Cl<sup>−</sup> transport mechanism. However, the QM/MM barriers are much higher than the MM barriers, suggesting stronger binding, especially at the entrance and Scen, at the QM/MM level than at the MM level.

To gain a deeper insight on the differences, we computed the binding energies for a few model complexes formed between Cl<sup>−</sup> and residue backbone or side chains at the CHARMM36, PM3, and B3LYP/6-31+G(d) levels (**Figure 9**). Note that the Glu–Cl<sup>−</sup> complex has two different conformations, both of which had been observed for Glu148 in the umbrella sampling trajectories. The positively charged arginine side chain was also included because of the strong attraction between the Arg147 side chain and the Cl<sup>−</sup> in the entrance, although the Arg147 side chain is not part of the entrance cohort. Among the three levels of theory, the B3LYP calculations are the most accurate and serve as the reference for comparisons. Overall, the QM and MM optimized geometries were similar. However, in the PM3 calculations, the binding energies were overestimated by 7–17% and the bond lengths were shorter by 10–20% than the reference calculations. More specifically, the errors were 13 kcal/mol for the arginine side chain and 2–4 kcal/mol for the other models. In contrast, the H–Cl distances by CHARMM36 calculations agreed almost perfectly with B3LYP for water, serine side chain, and arginine side chain, but were somewhat too long (by ca. 4–8%) for the backbone and the side chains of glutamate and tyrosine. Not surprisingly, the CHARMM36 performance in energy seemed mixed. While performing remarkably well (within 2 kcal/mol, or within 7%) for water, serine side chain, and the first (anti) conformation of glutamate side chain, CHARMM36 noticeably underestimated the binding energies for the backbone, tyrosine side chain, and the second (syn) conformation of glutamate side chain by 4–6 kcal/mol, or 23–28%.

Interestingly, the model complexes on which CHARMM36 performed poorly contain π bonds. As our earlier study revealed,

these π bonds stabilize the Cl<sup>−</sup> ion substantially through mutual polarization and partial charge transfer (Church et al., 2013). The MM calculation failed to account for these quantum effects of electron delocalization in anion-π interactions. On the other hand, MM describes classical electrostatic interaction very well. Therefore, CHARMM36 significantly outperformed PM3 for the arginine side chain, which, although contains π bonds, interacts with Cl<sup>−</sup> predominantly through electrostatic attractions. This also explains the different performances by MM for the two conformations of glutamate side chain. As the Cl<sup>−</sup> ion stands closer to the CH<sup>3</sup> group in the first conformation, electrostatic interactions contribute a larger share in the total binding energies. This was confirmed by natural orbital analysis, which revealed less electron delocalization in the first conformation than in the second: the Cl<sup>−</sup> ion lost 0.107 e in the first conformation but 0.140 e in the second conformation. As expected, CHARMM36 reproduced the interaction energy of the first conformation more accurately.

One may argue that the MM parameters were optimized for condensed-phase simulations, while these model complexes are gas-phase calculations. This is certainly true. However, this means that MM typically overestimates the binding energies for gas-phase models. In contrast, MM underestimates the binding affinities here. Thus, these inaccurate MM energies are most likely caused by the missing stabilizations due to quantum electron delocalization.

While the overestimation of binding energies at the PM3 level likely increased QM/MM barriers, the underestimation of binding energy at the CHARMM36 level probably lowered MM barriers. In particular, the significantly overestimated attraction by Arg147 side chain near the entrance at the PM3 level contributed to an artificially deep well in 0 > z > 2 Å in the QM/MM PMF. The dip at Scen, however, might be described more realistically in the QM/MM PMF curve because of the substantial under-binding with the Tyr445 side chain by CHARMM36. It is a bit puzzling that Sext, where the Cl<sup>−</sup> ion was extensively coordinated by backbone amine groups, was ∼5 kcal/mol higher than Scen in the QM/MM PMF, despite the over-binding with the backbone in the PM3 calculations. We do not know the reason, but it could be due to that the QM/MM PMF had not fully converged, as the barrier between Sext and Scen seemed to keep increasing with extending simulation time (Figure S8).

The differences between CHARMM36 and PM3 bindings also manifest in the pore-size plots (**Figure 6**). The stronger PM3 binding led to generally smaller pore sizes in QM/MM, especially in the binding sites. Interestingly, while the MM pore radii were ∼1.8 Å or larger, which correspond to the generally accepted Cl<sup>−</sup> ionic radius of 1.81 Å (Shannon, 1976) the QM/MM radii could go down to as small as ∼ 1.4 Å. This smaller size indicated that the anion's size was effectively reduced as it lost substantial amount of its electron density. For reference, we note that the radius of the charge-neutral Cl atom was determined to be 0.99 Å (Pyykkö and Atsumi, 2009).

## SUMMARY

We have carried out SMD simulations to escort a Cl<sup>−</sup> ion through the pore of EcCLC to identify the accompanied local and global conformational changes of the protein. We observed that the side chain rotation of the external gating residue Glu148 was essential for the full access of the pore entrance by Cl−. Occupation of Sext by the anion induced small but non-negligible shifting of the Glu148 and Gly149 backbones, which came closer to Cl<sup>−</sup> for better ion solvation. This local conformational change implied subtle differences between the Sext-free and Sextoccupied open states. The ion's passing through the internal gating residue Tyr445 prompted helix R, which extrudes into the intracellular solution, to swing away from helix D. Helix R returned to its initial position once the ion exited sideways from the pore. There was substantial electron delocalization of Cl<sup>−</sup> to its surroundings, especially to the π-bonds (e.g., the Tyr445 side chain), along the ion's journey through the channel. Umbrella sampling simulations at both the MM and QM/MM levels were performed to quantify the free-energy profile associated with Cl<sup>−</sup> transport. The results identified major barriers for Cl<sup>−</sup> moving into Sext from the extracellular vestibule and exiting the pore from Scen. The barrier heights were ∼2 kcal/mol in MM and ∼10 kcal/mol in QM/MM, respectively. The higher barriers in the QM/MM than MM free-energy landscapes were attributed to the differences in the interaction energies between Cl<sup>−</sup> and nearby residues, which were overestimated by PM3 but underestimated by CHARMM36. The QM/MM and MM results here probably provide the upper and lower bounds for the barrier heights. The weaker binding interactions predicted by MM was primarily caused by missing stabilization from electron delocalization between Cl<sup>−</sup> and functional groups containing π bonds. The findings here suggest that quantum effects of electron delocalization may be more important than previously considered and should be taken into account if more accurate descriptions are desired for Cl<sup>−</sup> transport proteins.

However, some of the above conclusions must be taken with caution, considering the intrinsic limitations in the model and methods employed this study. The current model system consisted of only one independent subunit of the transmembrane domain, while EcCLC contains a C-terminal domain and usually forms dimers. The amplitude of motion for helix R may be decreased in the presence of the C-terminal domain and the other subunit. The slowest steering speed in the SMD simulations was 2 Å/ns, which, although quite slow, might still be too fast for near-equilibrium steering. The QM/MM umbrella sampling was conducted employing the semi-empirical PM3 Hamiltonian with relatively short (200 ps per window) sampling time due to computational cost considerations. More advanced electronicstructure theory and longer simulations times are certainly desired for more accurate calculations. Classical polarizable force fields that explicitly include many-body contributions offer another choice (Kaminski et al., 2004; Patel and Brooks, 2004; Patel et al., 2004; Jorgensen, 2007; Warshel et al., 2007; Xie and Gao, 2007; Cisneros et al., 2014, 2016; Vanommeslaeghe and MacKerell, 2015; Albaugh et al., 2016). Although polarizable force fields are more expensive than standard non-polarizable force fields, they are more efficient than QM calculations. The disadvantage is that force fields based on classical electrostatics do not explicitly account for quantum effects such as polarizationexchange coupling and charge transfer (Illingworth and Domene, 2009). The simulations translocated only one Cl<sup>−</sup> ion, leaving out the scenarios of two or more Cl<sup>−</sup> ion moving together. The existence and movements of other Cl<sup>−</sup> ions in the channel will likely impact the binding and translocation of a given Cl<sup>−</sup> ion. The possible coupling between Cl<sup>−</sup> ion transport and H<sup>+</sup> delivery was also missed. Finally, the model we constructed was an "open-state" model, so the conversion between the closed and open states has not been visited. Future studies will need to address these issues.

# AUTHOR CONTRIBUTIONS

HL designed the project. AD carried out the code implementation. C-HW, BA, MZ, and AD performed the calculations. C-HW, AD, BA, and HL analyzed the results. C-HW, AD, and HL wrote the manuscript, which is then revised by all authors, who have given approval to the final version of the manuscript.

## FUNDING

This work is supported by the National Science Foundation (CHE-1564349), Camille and Henry Dreyfus Foundation (TH-14-028), NVIDIA Corporation, and the University of Colorado Denver Undergraduate Research Opportunity Program. This work used XSEDE under grant CHE-140070, supported by NSF grant number ACI-1053575, and NERSC under grant m2495.

## ACKNOWLEDGMENTS

We thank Christal Davis and Christina Garza for insightful discussion.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00062/full#supplementary-material

#### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Wang, Duster, Aydintug, Zarecki and Lin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Modeling Chemical Reactions by QM/MM Calculations: The Case of the Tautomerization in Fireflies Bioluminescent Systems

#### Romain Berraud-Pache, Cristina Garcia-Iriepa and Isabelle Navizet\*

Université Paris-Est, Laboratoire Modélisation et Simulation Multi Echelle, MSME, UMR 8208 CNRS, UPEM, Marne-la-Vallée, France

#### Edited by:

Sam P. De Visser, University of Manchester, United Kingdom

#### Reviewed by:

José Pedro Cerón-Carrasco, Universidad Católica San Antonio de Murcia, Spain Xavier Assfeld, Université de Lorraine, France

> \*Correspondence: Isabelle Navizet isabelle.navizet@u-pem.fr

#### Specialty section:

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

Received: 31 January 2018 Accepted: 29 March 2018 Published: 17 April 2018

#### Citation:

Berraud-Pache R, Garcia-Iriepa C and Navizet I (2018) Modeling Chemical Reactions by QM/MM Calculations: The Case of the Tautomerization in Fireflies Bioluminescent Systems. Front. Chem. 6:116. doi: 10.3389/fchem.2018.00116 In less than half a century, the hybrid QM/MM method has become one of the most used technique to model molecules embedded in a complex environment. A well-known application of the QM/MM method is for biological systems. Nowadays, one can understand how enzymatic reactions work or compute spectroscopic properties, like the wavelength of emission. Here, we have tackled the issue of modeling chemical reactions inside proteins. We have studied a bioluminescent system, fireflies, and deciphered if a keto-enol tautomerization is possible inside the protein. The two tautomers are candidates to be the emissive molecule of the bioluminescence but no outcome has been reached. One hypothesis is to consider a possible keto-enol tautomerization to treat this issue, as it has been already observed in water. A joint approach combining extensive MD simulations as well as computation of key intermediates like TS using QM/MM calculations is presented in this publication. We also emphasize the procedure and difficulties met during this approach in order to give a guide for this kind of chemical reactions using QM/MM methods.

Keywords: oxyluciferin, TD-DFT, molecular dynamics, QM/MM, keto-enol tautomerization, emission spectra, bioluminescence

# INTRODUCTION

Chemical reactions are ubiquitous in biology, controlling a wide variety of fundamental biological processes such as photosynthesis, the process of vision, bioluminescence, among others (Hales, 1976; Adam, 1982; Metzler and Metzler, 2001; Palczewski, 2012). In this regard, a full understanding of the underlined chemical mechanisms is crucial. Only then, the factors that govern the specificity and efficiency of these biological processes can be determined.

In order to get insight into the mechanism and intermediates, some experimental methodologies can be used such as spectroscopic techniques, mutagenesis experiments and kinetic evaluations (Meister, 1983; Zscherp and Barth, 2001; Frey and Hegeman, 2007). However, in general many chemical questions remain unanswered from an experimental point of view and hence, the modeling of the system and the mechanism by computational chemistry is needed (Becker et al., 2001; Mulholland, 2005; van der Kamp et al., 2008).

Regarding to the modeling, quantum mechanics/molecular mechanics (QM/MM) methods are the state-of-the-art computational techniques to study chemical reactions and to compute electronic properties in complex environments as biomolecular systems (Senn and Thiel, 2007, 2009; Acevedo and Jorgensen, 2010; van der Kamp and Mulholland, 2013; Sousa et al., 2017). In QM/MM methods, the chemically active site (i.e., where the chemical reaction takes place or the molecule whose properties are going to be calculated) is treated at the QM level whereas the protein surroundings and the explicit solvent molecules are treated at the MM level. Many possibilities arise depending on the QM and MM methods used and on the QM/MM interface (Senn and Thiel, 2009). Up to now, QM/MM methods have been applied to get insight into different biological issues: (i) enzymatic reaction mechanisms (Friesner and Guallar, 2005; Senn and Thiel, 2007; Acevedo and Jorgensen, 2010), (ii) the calculation of spectroscopic properties (Gascón et al., 2005; Sabin et al., 2010; Gattuso et al., 2017), (iii) the investigation of electronically excited states (Navizet et al., 2010; Sabin et al., 2010; García-Iriepa et al., 2016; Gozem et al., 2017) or (iv) the calculation of pKa values (Jensen et al., 2005; Riccardi et al., 2006). It should be stressed that through QM/MM methods, transition state structures, intermediates and activation energies can be computed, being central for reactivity (Gao et al., 2006; Ramos and Fernandes, 2008; Lonsdale et al., 2012). For this reason, QM/MM methods have become in the last decades an essential tool to get insight into the mechanism of biochemical reactions and hence, to confirm or discard different mechanistic proposals which cannot be elucidated from experimental data.

The multiple strengths of QM/MM methods are that large molecules can be modeled including explicitly the entire system in the calculations, they balance the simulation cost and accuracy and they complement experimental data. On the contrary the main limitation of them is the system setup prior the QM/MM calculations, as there are many possible conformations that the macromolecule can assume, also considering the presence of solvent. However, available experimental data, such as crystallographic structures or physiological measurements, ease the setup preparation.

In this work we present an application of QM/MM methods to study the bioluminescent system of fireflies, which basis are still not fully understood. In particular, the chemical nature of the emissive specie, so-called oxyluciferin, is still under debate as it can assume six different chemical forms due to triple equilibrium of phenol and/or enol deprotonation together with a keto-enol tautomerization (Hirano et al., 2009; Navizet et al., 2010; Hosseinkhani, 2011). The aim of this work is to elucidate if keto-enol tautomerization (**Scheme 1**) of phenolate-oxyluciferin (OxyLH−) is feasible, both in the ground and in the excited state, inside the active site of the fireflies' luciferase enzyme. For this aim, MD simulations have been performed to sample different possible conformations combined with QM/MM calculations to find the transition states and qualitatively evaluate the energy barriers.

#### METHODS

#### Model Setup

The crystal structure used in the present publication comes from a north American firefly chemically engineered luciferase, obtained by the group of Prof. Branchini (Branchini et al., 2011; Sundlov et al., 2012) (PDB 4G37). It has been designed to reproduce the conformation of the protein when the dioxygen binds to an intermediate of the bioluminescent reaction. This conformation was obtained thanks to a disulfide bridge between two residues, Isoleucine 108 and Tyrosine 447. Moreover, it contains the 5'-O-[N-(dehydroluciferyl)- sulfamoyl]adenosine (DLSA) (**Scheme 2A**) substrate.

The missing loops in the crystallographic structure 4G37.pdb downloaded from the RSCB PDB website were added with Disgro program (Tang et al., 2014). Then, the DLSA residue was replaced by protonated adenosine monophosphate AMPH, charged −1, and the phenolate-keto form of oxyluciferin, keto-OxyLH<sup>−</sup> (**Schemes 2B,C**). The two SO2<sup>−</sup> 4 groups and the disulfide bridging residue were removed. The two SO2<sup>−</sup> 4 groups are used as precipitant for the protein (and are not key to model the protein), and were deleted. The S-S cross-linked molecule has also been removed because no accurate parameters are available. No structural changes between the domains of the luciferase were observed during the MD. During the simulations, the protein conformation remains similar to the one of the crystallographic structure. Finally, and in order to neutralize the system, we protonated some histidine residues using the H++ program (Anandakrishnan et al., 2012), keeping unchanged the ones close to the protein active site. In this case, the following histidines were protonated (76, 171, 310, 332, 419, 461 and 489), each one yielding a +1 charge.

#### MD Simulations

Classical dynamics simulations were made with Amber14 program (Case et al., 2014). The AMBER99ff force field was used to model the residues of the protein. The parameters used for

both OxyLH<sup>−</sup> forms (keto or enol) and AMPH were designed by our group (Navizet et al., 2010, 2011, 2013). The model was solvated with TIP3P (Jorgensen et al., 1983) water molecules within a cube box, ensuring a solvent shell of at least 15 Å around the solute. The resulting system contained roughly 28,000 water molecules and 90,000 atoms in total. The system was heated from 100 to 300 K in 20 ps. Then, under NPT conditions with T = 300 K and P = 1 atm, 21 molecular dynamics of 10 ns using periodic boundary conditions were performed with a 2 fs time step. During these simulations, pressure and temperature were maintained using the Berendsen algorithm and SHAKE constraints were applied to all bonds involving hydrogen atoms (Ryckaert et al., 1977).

## QM/MM Setup

The QM/MM calculations have been carried out using a QM/MM coupling scheme (Ferré and Ángyán, 2002) between Gaussian09 (G09 D.01) (Frisch et al., 2010) and Tinker (Ponder, 2004). In particular, the interaction between the QM charge density (electrons and nuclei) and the external electrostatic potential of the MM part was computed by the electrostatic potential fitted (ESPF) method (Ferré and Ángyán, 2002). The microiterations technique (Melaccio et al., 2011) was used to converge the MM subsystem geometry for every QM minimization step. The QM part is composed of the OxyLH−, either in the keto or in the enol form, the AMPH and the water molecule n◦ 559 (hereafter named Wat1), placed between the oxyluciferin and AMPH. The rest of the system, that is the protein and the water molecules, are included in the MM part.

QM/MM calculations were used for searching of the transition states (TS) in the protein, defining energetic profiles bridging the TS to the keto and enol forms, and calculating the electronic transitions between the first singlet excited state (S1) and the ground state (GS). For TS and energetic profiles search, unrestricted with broken-symmetry and restricted DFT and TD-DFT calculations were performed using the M06-2X functional (Zhao and Truhlar, 2008) which includes a dispersive term and provides accurate results when studying chemical reactions (Chéron et al., 2012). Unrestricted and restricted calculations gave similar results in terms of energy and geometries, of about 0.12 eV for the transition state (TS), and about 0.15 eV for energetics profiles. Therefore restricted results are presented in this publication. TD-DFT calculations were performed using 3 roots. The emission energy (Te) between the first singlet excited state (S1) and the ground state (GS), from geometries obtained in the S1 state, were computed using both the M06-2X and the B3LYP (Stephens et al., 1994) functionals. The B3LYP functional is known to give emission energy values close to experiment for fireflies (Berraud-Pache and Navizet, 2016). Finally, the basis set used is the 6-311G(2d,p) as used in previous publications (Laurent et al., 2015; Berraud-Pache and Navizet, 2016).

# Transition State (TS) Search and Energetic Profiles

Once we have selected the MD snapshot for the QM/MM calculations, we can study the keto-enol tautomerization mechanism. Oxyluciferin, AMPH and Wat1 are included in the QM part (**Scheme 3**) whereas the protein and the other water molecules are treated at the MM level. To study the desired reaction, first the TS connecting the keto and enol forms of oxyluciferin has to be located. However, searching

the TS structure inside the protein by QM/MM methods is a complicated task due to both the complexity of the system and the non-availability of certain optimization algorithms [e.g., the Synchronous Transit-Guided Quasi-Newton (STQN) algorithm (Peng and Schlegel, 1993)] in G09/Tinker.

To solve this issue, the QM part selected for QM/MM calculations has been extracted (**Scheme 3**) and then we have first computed the TS in the GS, without including the protein surrounding but considering water as an implicit solvent by a Polarizable Continuum Model (PCM) (Cancès et al., 1997; Mennucci et al., 1997) of water. Then, the STQN algorithm is used, for which both the keto and the enol structures have to be provided. In our case, the keto form is the one extracted from the selected MD snapshot whereas the enol one is built based on the keto form, just by moving manually H1 to O1, H2 to O2, and H3 to O3 (see **Scheme 3** for atoms numbering).

To refine the TS structure obtained with the STQN algorithm, a Berny optimization has been performed. During the Berny optimization, all the non-involved atoms in the tautomerization reaction are blocked (see Supplementary Figure 1). In detail, it corresponds to the benzothiazole cycle of OxyLH−, the ribose and the adenosine part of AMPH. A frequency calculation was then performed to validate the found structure as TS, named TSPCM GS (see Note 1 in Supplementary Material). Then, the TS in the first excited state, named TSPCM S1 was computed while considering 3 excited electronics states, starting from the TSPCM GS structure.

The keto-enol tautomerization was then studied inside the protein using QM/MM calculations. TSPCM GS structure was put in the MD snapshot to replace the QM part coordinates. Then, a Berny optimization has been performed to obtain the transition state in the ground state in the protein, TSprot GS . Then search of TS in the first excited state in protein starting from TSprot GS was performed to obtain TSprot S1 .

To further validate the TS structures, intrinsic reaction coordinate (IRC) calculations have been performed finding that these structures actually connect the keto and the enol forms of oxyluciferin. The IRC has been done for both PCM and QM/MM calculations using the algorithm local quadratic approximation (LQA) (Page and McIver, 1988; Page et al., 1990) and a step of 0.01 Bohr (see Note 2 in Supplementary Material and Supplementary Figure 2).

In the IRC graphs shown in the rest of the publication, the point at the reaction coordinate (RC) equal 0 corresponds to the energy of the relaxed TS structure. Indeed, because some of the atoms are frozen during the Berny optimization, the first step of the IRC shows a big decrease of the energy corresponding to the removing of the constraints of all the atoms of the QM part. However, no geometrical differences are observed between the TS geometry and the relaxed one.

We have also performed a QM/MM dihedral scan of the phosphate group of the AMPH substrate inside the protein at the GS to link the keto-OxyLH<sup>−</sup> structure form the MD and the one obtained after the IRC search. This scan has been performed along the O4-P-O5-C4 atoms in GS using 10 steps of −10◦ while the ribose, the adenosine and the OxyLH<sup>−</sup> were kept frozen.

## RESULTS AND DISCUSSION

In the present paper, we aim to study the keto-enol tautomerization of oxyluciferin inside the luciferase protein both in the ground state (GS) and in the first singlet excited state (S1). In fireflies, it is well accepted that the final product of the biochemical reaction corresponds to the keto form of oxyluciferin (Liu et al., 2009). However, some experimental measurements (Naumov et al., 2009; Snellenburg et al., 2016) have deciphered the presence of the enol form inside the protein after the bioluminescence emission. The keto-enol tautomerization of oxyluciferin takes place inside the protein. By computing both the GS and S1 energetic profiles of the tautomerization reaction, we can also provide insights to explain if the reaction takes place before or after the emission. The first step is to describe correctly the proteic environment (this has been done by constructing some models from classical molecular dynamics (MD) simulations). We then have to find the geometry of the TSs joining the keto and the enol forms of the oxyluciferin in the proteic environment. As the search of TSs inside the protein, using QM/MM calculations, is complicated we have initially pictured the TSs using a simpler environment, implicit solvation with PCM. Once TSs were located inside the protein, energetic profiles connecting the keto and enol forms were computed.

## Classical MD Simulation

First, the system was built starting from the 4G37 luciferase structure as detailed in the Methods section. In this case, the phenolate keto form of oxyluciferin and the AMPH were selected for study.

As demonstrated in previous studies (Navizet et al., 2010; Berraud-Pache and Navizet, 2016), the use of MD simulations is mandatory to investigate the luciferase-oxyluciferin system. MD simulations permit to equilibrate the system, especially important when water molecules are near or in the enzymatic cavity and could lead to a hydrogen-bond network between the substrates and residues of the protein. 21 MD simulations each lasting 10 ns, were performed and thereafter numbered from MD1 to MD21. In all the simulations, we have observed the displacement of water molecules inside the cavity of luciferase. Particularly, in 5 out of the 21 MD simulations, a water molecule (thereafter called Wat1) positions between the keto moiety of oxyluciferin and AMPH. This conformation is stable along the simulation time as a hydrogen-bond network is created involving Wat1, oxyluciferin, AMPH and other protein residues, like LYS 443. By analyzing the MD simulations, we have observed that Wat1 comes close to oxyluciferin at different simulation times depending on the dynamic. For example, Wat1 comes close to oxyluciferin at around 1 ns in MD1, whereas it takes more than 6 ns to reach this position in MD17.

The position of Wat1 suggests that this water molecule could be involved in the keto-enol tautomerization mechanism of oxyluciferin inside the protein. This way, the tautomerization could take place by the displacement of 3 protons: one from oxyluciferin, one from Wat1 and one from AMPH. We therefore decided to study the feasibility of this mechanism both in the GS and in S1 inside the protein.

For calculation purposes, we selected one representative snapshot corresponding to the lowest structure in energy found along the MD (snapshot taken at 3.708 ns from MD1). In this snapshot a water molecule Wat1 was found between oxyluciferin and AMPH. By analyzing in more detail the structure of this snapshot (**Figure 1**), we observe that the nearby residues do not share any hydrogen-bond with the substrates or Wat1. Only the hydrogen-bond network between oxyluciferin, AMPH and Wat1 is present. Thus, the choice of this snapshot and hence of this hydrogen-bond network was motivated to simplify the system, preventing the inclusion of more residues in the QM part, which would have increased the computational cost.

#### Description of the TS in Implicit Solvent

The optimized TS structure in the GS computed in PCM (TSPCM GS ) corresponds to a 3 protons transfer reaction between 4 centers: O1, O2, O3 and C1 (see Supplementary Figure 3). Moreover, the TSPCM GS is structurally more similar to the keto form of oxyluciferin than to the enol form. For instance, H1 is closer to O2 of Wat1 (0.98 Å) than to O1 of oxyluciferin and H2 is closer to O3 of AMPH (1.03 Å) than to O2 of Wat1.

However, for the TS optimized in S1 in PCM (TSPCM S1 ), the position of O4 leads to a 5 centers TS, including O1, O2, O3, O4, and C1. It allows a bigger flexibility for the atoms and for the proton transfer. In TSPCM S1 (see Supplementary Figure 4), H1 is closer to O1 (1.01 Å) and H3 is still located between C1 and O3. The notable difference with the TSPCM GS is the presence of O4, another oxygen of the AMPH substrate, which is involved in the tautomerization reaction. Here the H2 is not located between O2 and O3 as in TSPCM GS but between O2 and O4. During the optimization of TSPCM S1 , the phosphate group of the AMPH is rotated about 30◦ , bringing the O4 close to the water molecule. Some details about the IRC calculations with implicit solvent can be found in note 3 and Supplementary Figure 5.

In conclusion, when computing the TS in PCM both in the GS and in S1 (TSPCM GS and TSPCM S1 ), two different TS geometries are obtained. The tautomerization mechanism concerning TSPCM GS is a 4 centers reaction while for TSPCM S1 , 5 centers are involved. This study shows that a TS connecting the keto and enol forms of oxyluciferin can be obtained with PCM. The computed geometry of TSPCM GS with PCM has served as starting guess for the calculation of the TS in GS inside the protein at the QM/MM level, which is the main goal of the present study.

## Keto-enol Tautomerization Inside the Protein

We have decided to use TSPCM GS , a 4 centers TS, as starting structure for the QM/MM calculations inside the protein. Indeed, in all the computed MD Wat1 adopts a position that suggests a 4 centers TS, as seen in **Scheme 3** and **Figure 1**.

The TS optimization done inside the protein in the GS gives some unexpected results. First, two different TS geometries can be obtained, both corresponding to the keto-enol tautomerization. In the first one, the position of the 3 protons suggests that the TS is derived from the enol tautomer of the oxyluciferin. It is thus named TSprot GS\_enol. In the same way, the second TS, named TSprot GS\_keto, shows protons close to the position adopted in the keto form. It should be remarked here that the TSprot GS\_enol lies ca. 34 kcal/mol above in energy compared to TSprot GS\_keto in the GS.

The second unexpected result concerns the geometries of all computed TSs. When using QM/MM calculations inside the protein, only a 5 centers tautomerization is observed in the GS even when starting with the 4 centers TSPCM GS . Within the protein active site, the substrates prefer to fill more space when performing the keto-enol tautomerization. All other attempts to compute a 4 centers TS in the GS failed, giving in all cases a 5 centers one (see **Scheme 4**).

Finally, the TSs in S1 have been located starting from both the structure of TSprot GS\_keto and of TSprot GS\_enol. As in the GS, two different TS structures have been found in S1, one structurally closer to the keto form, TSprot S1\_keto, and the other one closer to the enol form, TSprot S1\_enol. The TSprot S1\_keto is also lower in energy compared to TSprot S1\_enol. These differences are detailed in the Keto-enol tautomerization subsection.

#### Geometrical Description of the TSs

The TSsprot enol structures obtained in both GS and S1 are quite similar. They can be described by the transfer of 3 protons (H1, H2, and H3) between 5 centers (O1, O2, O3, O4, and C1) (see **Figure 2**). In particular, H1 is placed between O1 (oxyluciferin) and O2 (Wat1) but it is closer to O1 (1.00 Å) and so, resembles the enol form. Moreover, H2 is located close to O2 (Wat1) (0.99 Å) while H3 is situated between C1 and O3 (dH3−C1 = 1.46 Å and dH3−O3 = 1.28 Å).

Concerning the TSsprot keto, the structures computed in the GS and in S1 are also quite similar, corresponding to the transfer of 3 protons involving 5 centers (see **Figure 3**). However, in these cases H1 is closer to O2 (Wat1) (1.03 Å) than to O1 (oxyluciferin) (1.48 Å). Moreover, H2 is located close to O4 (AMPH) (1.04 Å) and H3 is nearly bonded to O3 (AMPH) (1.11 Å).

# Analysis of the Energetic Profiles of the Two TSs

After the optimization of the TSs, we computed the IRCs to check if they really connect the keto and enol forms of oxyluciferin both in the GS and in S1. The results presented in **Figures 4**, **5** correspond to a superposition of the IRC computed in the GS and in S1 for a better comparison. However, reader has to have in mind that the reaction coordinates of the IRCs in the GS and in S1 are not the same. Moreover, the TS structure has been defined as the 0 value of the reaction coordinate. When moving to the right toward positive values of the reaction coordinate, the keto tautomer is formed. Similarly, negative values of the reaction coordinate correspond to the formation of the enol form. The energy of the keto tautomer in the GS was considered as the reference.

TSprot

Energetic Profile of the TS\_enol

The shapes of the energetic profiles starting from TSprot

S1\_enol in the GS and in S1, respectively, are similar (see **Figure 4**). Moreover, it is observed that, the keto form is more

GS\_enol and

FIGURE 4 | Energetic profile of the keto-enol tautomerization starting from TSprot GS\_enol and TSprot S1\_enol at the reaction coordinate RC <sup>=</sup> 0. Superposition of GS in blue and S1 profiles in red. Positive RC values lead to the keto form while the negative ones to the enol form. The point at 0 eV corresponds to the lowest point of the energy profile that is to the keto form in the GS. The graph is a superposition of two graphs: reaction coordinates of the IRCs in the GS and in S1 are not the same.

stable than the enol form of about 5 kcal/mol in the GS and 11 kcal/mol in S1.

When following the energetic curve in the direction of the enol tautomer (i.e., negative values of reaction coordinate RC), we observed a steep decrease of the energy followed by a smaller slope. The steep gradient corresponds mainly to the movement of the proton H3, as the protons H1 and H2 are already in the final position of the enol form in the TSprot enol, both in the GS and in S1. Thus, at about RC equal −4, the enol tautomer is already formed. From this point, the system stabilization is due to a rearrangement of oxyluciferin and AMPH leading to the final enol structure (**Figure 5**). Finally, the energy barrier for tautomerization starting from the enol form found is 100 kcal/mol for the GS and 66 kcal/mol for the S1.

On the side of the curve between the TS and the keto tautomer (i.e., positive values of reaction coordinate), the profile can also be divided into two steps. In the GS curve, between RC 0 and RC 5, the proton H3 is moving in direction of C1. The shoulder at RC equal 5 corresponds to the break of the bonds O1-H1 and O2- H2. From this point, the system minimization can be described by the movement of these two protons H1 and H2. The final structure is represented in **Figure 6**. The energetic barrier for tautomerization starting from the keto form is of 105 kcal/mol for the GS and 77 kcal/mol for the S1.

Finally, in order to verify that the structures obtained at the end of the IRC calculations correspond to the keto and enol tautomers (**Figures 5**, **6**), we have computed their electronic transition energies (Te) between S1 and the GS (respectively T prot e\_keto\_IRC and Tprot e\_enol\_IRC). The electronic transition between S0 and S1 corresponds to a π-π ∗ transition with a small charge transfer between the benzothiazole ring and the thiazolone ring for both tautomers. Then, the computed energies have been compared to reference values (Tprot e\_keto\_ref and Tprot e\_enol\_ref) obtained

directly from the MD snapshots of these oxyluciferin forms (see Note 4 in Supplementary Material). T<sup>e</sup> have been computed with both, the M06-2X and the B3LYP functionals, the last one giving emission energy values closer to experiment (Berraud-Pache and Navizet, 2016) (see **Table 1**). The computed electronic transition energies of the keto and enol forms obtained by the IRC show good agreement with reference values. The difference of energy is less than 0.1 eV, which corresponds to the DFT level of incertitude.

# Energetic Profile of the TS\_keto

The energetic profiles found in the GS and in S1 starting from TSprot GS\_keto and TSprot S1\_keto, respectively, are similar (see **Figure 7**). The keto tautomer is again more stable than the enol one, of about 7 kcal/mol in the GS and 15 kcal/mol in S1. Concerning the formation of the enol form (negative values of RC), H3 moves first, corresponding to the shoulder observed at RC −7 in the GS and RC −9 in S1. Then, the bonds H1-O2 and H2-O3 break and the two protons H1 and H2 move toward O1 and O2 respectively, leading to the enol tautomer. The computed energetic barriers for the tautomerization starting from the enol form are 58 kcal/mol in the GS and 51 kcal/mol in S1.

For the formation of the keto form, positive RC values, a steep gradient is observed that corresponds to the movement of H3. The calculated energetic barriers from the keto form to the TS are 65 kcal/mol in the GS and 67 kcal/mol in S1.

Finally, the T<sup>e</sup> have also been computed for the obtained structures reached by the IRC (see **Figures 5**, **6**), starting from TSprot S1\_keto (see **Table 2**). The electronic transition between S0 and S1 corresponds to a π-π ∗ transition with a small charge transfer between the benzothiazole ring and the thiazolone ring for both tautomers. The computed values are again in good agreement TABLE 1 | Electronic transition energies between S1 and GS (Te) in the protein at the TD-DFT/MM level for the resulting structures of the IRC, starting with TSprot S1\_enol.


<sup>a</sup>Data in SI.

with previous results (see Note 4 in Supplementary Material), being the energy differences between 0.09 and 0.13 eV.

#### Rotation

During the study of the tautomerization inside the protein, we find that the reaction involves 5 centers. Thus, the keto form from the MD snapshot (**Figures 1**, **8a**) does not match with the one obtained by the IRC calculations in the GS starting from both, the TSprot GS\_ketol or the TSprot GS\_enol (**Figures 6**, **8b**). Indeed, by construction of the AMPH classical structure, the proton H2 is bound to the oxygen O3 in all MD. In the 5 centers reaction inside the protein leading to the keto form, the O4 from the AMPH retrieve the proton H2. The resulting keto system shows therefore a protonated AMPH on its O4 oxygen.

We have performed a dihedral scan to see the energy needed to allow the rotation of the phosphate group. We have selected the last point from the IRC calculation and

TABLE 2 | Electronic transition energies between S1 and GS (Te) in the protein at the TD-DFT/MM level for the resulting structures of the IRC, starting with TSprot S1\_keto.


<sup>a</sup>Data in SI.

performed a scan in GS around the dihedral O4-P-O5-C4 inside the protein. The same scan at the S1 level would be much longer to obtained and we hypothesize that the results would not be very different from the ones in GS. When looking at the energetic profile, the energy reaches a minimum when the hydroxyl group O4-H2 has the same position as O3-H2 observed during the MD (see **Figure 8**). However, the difference of stability and the rotation barrier between the two structures is about 5 kcal/mol, which is rather small (see **Figure 9**).

## Keto-enol Tautomerization

We have found that using QM/MM calculations different TSs, TSsprot keto and TSsprot enol, both in the GS and in S1, could lead to the keto-enol tautomerization. Besides, both TSs correspond to a 5 centers mechanism, involving two oxygen atoms of the phosphate group of the AMPH. In the TSsprot keto, the oxyluciferin is close to the keto conformation while it looks more like the enol form in the TSsprot enol. The energy of the TSsprot keto is always lower than the one of TSsprot enol, both in GS and in S1. The energy difference between the TSsprot keto and the TSsprot enol is quite high, about 34 kcal/mol in GS and 26 kcal/mol in S1. In all calculations we have done, the keto-OxyLH<sup>−</sup> is always more stable than the enol form, therefore it is not surprising to find TSsprot keto more stable than TSsprot enol.

(b) and the structure obtained after the dihedral scan (c).

This difference of energy also plays a major role in the barriers heights. The ones computed for the TSsprot keto are significantly lower than the ones of TSsprot enol, both in GS and in S1. However, the computed values remain high (the lowest computed one is ca. 51 kcal/mol, see **Table 3**) compared to previous studies of keto-enol tautomerization (ca. 40 kcal/mol) (Cucinotta et al., 2006; Alagona and Ghio, 2008).

The presence of the TSsprot keto and TSsprot enol also raises the question of a preferable pathway. Indeed, one possible hypothesis is that, because the TSsprot keto structures are closer to the keto-OxyLH<sup>−</sup> geometry, the tautomerization reaction from the keto tautomer to the enol tautomer might go through the TSsprot keto. The reverse reaction, enol to keto tautomerization, should also go through the TSsprot keto as they are lower in energy but because the TSsprot enol protons' arrangement is close to the reactant enol, the path through these TSs might also be considered.

From our calculations and with the hypothesis we have taken, the energy barriers calculated in the protein are very high and show that the tautomerization would not be easy inside the protein both in GS and S1. Experimental results show that after the reaction, a mixing keto and enol forms have been detected inside the protein (Naumov et al., 2009). However, other experimental results show that the keto-OxyLH<sup>−</sup> tautomer is the only bioluminescence emitter in fireflies, as one other recent study also shows (Pirrung et al., 2017). From all these results, we can deduce that the tautomerization is most probably difficult in S1 and, for it to happen in GS after light emission, the protein environment should change (for example, movement of the Cterm and rearrangement of the H-bonding network). This is still to be proved with further calculations, especially when using other hypothesis like the protonation state of AMP, using more snapshot from the MD, or using a bigger QM region. Moreover, some other details can be taken into account to refine the model. In the chosen snapshot, the QM part does not exhibit strong interaction with the protein. As we already mentioned before, some water molecules or residues can form hydrogen bonds with the concern atoms of the tautomerization reaction.

The use of QM/MM calculations has provided a better model regarding the keto-enol tautomerization compared to implicit solvent model. The main finding concerns the characterization of the TSs. Inside the protein, the concerted displacement of 3 protons is described. The 5 centers TS geometry shows that the active site of the protein is quite flexible and can sustain a complex chemical reaction.

#### CONCLUSION

In this publication we demonstrate the possibility to explore chemical reactions using QM/MM calculations by the study of keto-enol tautomerization of the emitter of the bioluminescence in fireflies.

Extensive MD calculations show a recurrence of the presence of a water molecule between the oxyluciferin and AMPH, which allows 3 protons transfer during tautomerization. Preliminary QM calculations in PCM are necessary to guess the TSs as models for further calculation at the QM/MM level.

The use of QM/MM calculations to study the chemical reaction unveils some unexpected results. First, the reaction is possible in protein when 5 chemical centers are involved, in contrary to the PCM study, where only a structure compatible with the 4 centers reaction is observed in the GS. Secondly, we have found two different TSs that can carry out the tautomerization reaction. These TSs reflect a tautomer of the

TABLE 3 | Energy barriers in kcal/mol of keto-enol and enol-keto tautomerization paths, in the GS and in S1, through TSsprot keto and TSsprot enol.


emitter, one is similar to the keto form while the other is close to the enol one. In addition, these TSs have distinct energies, the TSsprot keto are the most stable ones in both the GS and S1.

The computed barriers are quite high or even impassable. It is thus nowadays complicated to think that the keto-enol tautomerization can take place inside the protein in the excited state before emitting. This gives another proof of the role of the keto-OxyLH<sup>−</sup> as the main emitter of the bioluminescence.

The results presented here can be improved in different ways. It is possible to take into account several snapshots and improve the general picture of the system. For this purpose, we are presently studying the emission spectra simulation of different analogs of the oxyluciferin using a statistical number of snapshots. Moreover, some residues can have an effect on

#### REFERENCES

Acevedo, O., and Jorgensen, W. L. (2010). Advances in Quantum and Molecular Mechanical (QM/MM) simulations for organic and enzymatic reactions. Acc. Chem. Res. 43, 142–151. doi: 10.1021/ar900171c

the barrier height. The residue LYS 443, which is important in fireflies seems a good participant. Other methods can be used and have already prove efficiency, like QM/MM dynamics or meta-dynamics (Cucinotta et al., 2006).

We think that this protocol can be applied to other biological systems, like DNA (Cerón-Carrasco and Jacquemin, 2013), and bring new insights in modeling chemical reactions.

#### AUTHOR CONTRIBUTIONS

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

#### ACKNOWLEDGMENTS

All authors are grateful to the French Agence Nationale de la Recherche (grant ANR-BIOLUM ANR-16-CE29-0013). Cristina García-Iriepa acknowledge Fundación Ramón Areces for a postdoctoral fellowship.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00116/full#supplementary-material


constraints: molecular dynamics of n-alkanes. J. Comput. Phys. 23, 321–341. doi: 10.1016/0021-9991(77)90098-5


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Berraud-Pache, Garcia-Iriepa and Navizet. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Exploring the Interaction Mechanism Between Cyclopeptide DC3 and Androgen Receptor Using Molecular Dynamics Simulations and Free Energy Calculations

Huimin Zhang1†, Tianqing Song1†, Yizhao Yang<sup>2</sup> , Chenggong Fu<sup>1</sup> and Jiazhong Li <sup>1</sup> \*

*<sup>1</sup> School of Pharmacy, Lanzhou University, Lanzhou, China, <sup>2</sup> College of Chemistry and Chemical Engineering, Lanzhou University, Lanzhou, China*

#### Edited by:

*Sam P. De Visser, University of Manchester, United Kingdom*

#### Reviewed by:

*Gerardo Andres Cisneros, University of North Texas, United States Jitrayut Jitonnom, University of Phayao, Thailand*

> \*Correspondence: *Jiazhong Li lijiazhong@lzu.edu.cn*

*†These authors have contributed equally to this work.*

#### Specialty section:

*This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry*

Received: *29 January 2018* Accepted: *30 March 2018* Published: *19 April 2018*

#### Citation:

*Zhang H, Song T, Yang Y, Fu C and Li J (2018) Exploring the Interaction Mechanism Between Cyclopeptide DC3 and Androgen Receptor Using Molecular Dynamics Simulations and Free Energy Calculations. Front. Chem. 6:119. doi: 10.3389/fchem.2018.00119* Androgen receptor (AR) is a key target in the discovery of anti-PCa (Prostate Cancer) drugs. Recently, a novel cyclopeptide *Diffusa Cyclotide-3* (DC3), isolated from *Hedyotisdiffusa*, has been experimentally demonstrated to inhibit the survival and growth of LNCap cells, which typically express T877A-mutated AR, the most frequently detected point mutation of AR in castration-resistant prostate cancer (CRPC). But the interaction mechanism between DC3 and AR is not clear. Here in this study we aim to explore the possible binding mode of DC3 to T877A-mutated AR from molecular perspective. Firstly, homology modeling was employed to construct the three-dimensional structure of the cyclopeptide DC3 using 2kux.1.A as the template. Then molecular docking, molecular dynamics (MD) simulations, and molecular mechanics/generalized Born surface area (MM-GBSA) methods were performed to determine the bind site and explore the detailed interaction mechanism of DC3-AR complex. The obtained results suggested that the site formed by H11, loop888-893, and H12 (site 2) was the most possible position of DC3 binding to AR. Besides, hydrogen bonds, hydrophobic, and electrostatic interactions play dominant roles in the recognition and combination of DC3-AR complex. The essential residues dominant in each interaction were specifically revealed. This work facilitates our understanding of the interaction mechanism of DC3 binding to AR at the molecular level and contributes to the rational cyclopeptide drug design for prostate cancer.

Keywords: Cyclopeptide DC3, androgen receptor, protein drug interaction, homology modeling, molecular docking, molecular dynamics simulations

# INTRODUCTION

Prostate cancer (PCa) has become the second frequently diagnosed cancer in men throughout the world (American Cancer Society, 2015). Prostate, lung and bronchus, colorectal cancers accounts for about 44% of all cancer cases in men, with PCa alone accounting for 1 in 5 new diagnoses (Siegel et al., 2016). PCa is especially common in economically developed countries and regions like Northern and Western Europe, Northern America, and Oceania (American Cancer Society, 2015). In America, prostate cancer is the most common cancer and was predicted as the leading cause of male cancer-related death over the next decade (Siegel et al., 2016). In those less developed countries, the incidence rate of prostate cancer is increasing with stable or increasing mortality trend in recent years (Center et al., 2012).

Androgen receptor (AR) (NR3C4, nuclear receptor subfamily 3, group C, gene 4), a member of steroid hormone group of nuclear receptor superfamily, plays an essential role in the development and proliferation of prostate cancer (Tsai and O'Malley, 1994; Mangelsdorf et al., 1995; Nuclear Receptors Nomenclature Committee, 1999). The survival and growth of PCa cells are dependent on the androgenic stimulation through AR. Firstly, 5α-dihydrotestosterone (DHT) binds to AR to promote the association of AR co-regulators. Then the activated AR migrates into nucleus and regulates the expression of target genes in prostate cells (Heinlein and Chang, 2004).

Clinically, PCa is commonly treated by AR pathway perturbation, such as androgen suppression via surgical or chemical castration [gonadotropin-releasing hormone (GnRH) analogs] means (Palmbos and Hussain, 2013). AR antagonist drugs, such as flutamide, nilutamide, bicalutamide, and enzalutamide, take effects by suppressing the action of androgens via competing for AR binding sites (Yamamoto et al., 2012). These androgen blockade therapies are initially effective, however, a considerable population of patients ultimately develop as castration-resistant prostate cancer (CRPC) after prolonged use of an AR antagonist (Schröder, 2008; Yamaoka et al., 2010). AR mutation is one of the leading causes of antiandrogens resistance (Tan et al., 2015). These mutated ARs bind to other steroid hormones and induce the activation of AR transcriptional activity in response to antiandrogens, which results in the PCa growth (Tan et al., 2015). In this case, it shows far-reaching significance to seek and explore novel anti-CRPC drugs targeting gene mutational AR.

DC3 (Diffusa Cyclotide-3) is a novel cyclopeptide isolated from the traditional Chinese Medicine (TCM) Hedyotisdiffusa (Hu et al., 2015), which has been widely used for the treatment of various cancers and tumors, including prostate cancer, in China with a long history (Lin et al., 2010, 2011; Liu et al., 2010; Lee et al., 2011). It has been experimentally detected that DC3 expresses potent cytotoxicity against LNCaP cells and inhibits the cell migration and invasion. Besides, it can significantly inhibit the development of tumor in weight and size in the mouse xenograft model. All these findings lead to the conclusion that DC3 has evident anti-PCa effects both in vitro and vivo (Hu et al., 2015). Moreover, in the DC3 sensitivity experiments on three types of human prostate cancer cells, androgen dependent LNCaP cell lines showed obvious higher sensitivity to DC3 comparing to androgen independent PC3 and DUl45 cell lines. Besides, LNCap cell lines typically express T877A-mutated AR, which is the most frequently detected point mutation in CRPC (Veldscholte et al., 1992; Zhou et al., 2010; Yamada et al., 2013). All these evidences suggest that DC3 is a potential candidate binding to T877A AR.

However, the interaction mechanisms between DC3 and T877A-mutated AR are not clear. Therefore, it will be constructive and profoundly significant to launch the mechanism-relevant research. Fortunately, the amino acid sequence of DC3 has been experimentally determined (Hu et al., 2015), which makes it possible to explore the interaction mechanism at the molecular level. Referencing to the published papers, the current research could be consist of three parts: (1) constructing the three-dimensional structure of the cyclopeptide (Jitonnom et al., 2012), in our case is DC3; (2) determining the binding site and binding pose of cyclopeptide to protein (Punkvang et al., 2015); (3) investigating the detailed interaction mechanism between cyclopeptide and protein (Liu et al., 2015; Hitzenberger et al., 2017).

In our study, first of all, homology modeling technology was conducted to construct the three-dimensional structure of cyclopeptide DC3 based on its amino acid sequence. Then molecular docking, all-atom molecular dynamics (MD) simulations and molecular mechanics/generalized Born surface area (MM/GBSA) methods and various MD trajectory analysis methods were combined to explore the most possible binding site of DC3 to AR, investigate the key residues dominant in the binding process, and elucidate the detailed interaction mechanism. The results are expected to reveal the interaction mechanism of DC3-AR complex, promote the development of DC3 and correlative cyclopeptide AR antagonist, which will contribute to the rational drug design for prostate cancer.

# METHODS

# Homology Modeling of Cyclopeptide DC3

Homology modeling is a common technique to construct three-dimensional structure from amino acid sequence using homologous proteins with known structure as templates (Topham et al., 1990; Bordoli et al., 2009; Wang Z. et al., 2015). As amino acid sequence of DC3 was confirmed by Edman degradation and gene cloning (Hu et al., 2015), homology modeling was adopted here to build the 3D structure of DC3 using SWISS-MODEL (Arnold et al., 2006; Guex et al., 2009; Kiefer et al., 2009; Biasini et al., 2014; Bienert et al., 2016). Here, the SWISS-MODEL Template Library (SMTL) is searched both with BLAST and HHblits to identify templates and targettemplate alignments (Arnold et al., 2006). Then the template was selected based on various criteria such as sequence similarity, sequence identity, coverage, the global quality estimation score (GMQE) and so on.

## Molecular Docking Analysis of DC3 to AR

Molecular docking (Benkert et al., 2011; Meng et al., 2011; Yuriev et al., 2015) was used to analyze the possible binding site and preferred orientation of DC3 into androgen receptor by simulating combining conformation and computing binding affinity. Here the crystal structure coordinates of the T877Amutated AR LBD was obtained from the RCSB Protein Data Bank (http://www.rcsb.org/pdb; PDB ID: 4OHA). The missing loop regions were refined by Discovery Studio 2.5. (Accelrys Inc. CA, 2009). Molecular docking process was carried out by using ZDOCK module. The rigid-body protein–protein docking program ZDOCK uses the Fast Fourier Transform algorithm to enable an efficient global docking search on a 3D grid, and utilizes a combination of shape complementarity, electrostatics and statistical potential terms for scoring. Finally,

two simple scoring functions–ZRank Score and ZDock Score, pose amount of each cluster, and the rationality of binding mode were taken into consideration to evaluate the docking results.

#### Molecular Dynamics Simulations

Molecular dynamics (MD) simulations were operated through Amber12 package (Case et al., 2012). All the simulations are under the circumstance of ff99SB force field (Hornak et al., 2006) and periodic boundary condition. Firstly, six chloride counterions were added to each system to maintain the electro-neutrality. Then all studied systems were, respectively, immersed into a cubic box of TIP3P (van der Spoel and van Maaren, 2006) water with edge of the box at least 10Å distant from the complex. Energy minimization was carried out in three stages with different harmonic restraint: all atoms constrained by 5.0 kcal·mol−<sup>1</sup> ·Å −2 , only receptor backbone atoms constrained by 3.0 kcal·mol−<sup>1</sup> ·Å −2 and without any restraint. Each minimization was executed for 5,000 steps, in which the first 2,500 steps were calculated by the steepest descent method while the subsequent 2,500 steps were executed by conjugated gradient method. These systems were heated up to 310.0 K in the NVT ensemble for 100 ps with the receptor backbone atoms constrained by 5.0 kcal·mol−<sup>1</sup> ·Å −2 . And then, a total of 1.5 ns equilibration of each system was performed in NPT ensemble, where the former 800 ps were divided into four stages and the restraints applied to these stages were in a descending order (4.0, 3.0, 2.0, 1.0 kcal·mol−<sup>1</sup> ·Å −2 , respectively), the latter 700 ps were carried out without any restraint. Minimization and heat, as well as equilibration were executed in the Sander program. Finally, a 150 ns production of MD simulations of each system was performed in the PMEMD program at 310.0 K, 1 atm in the NPT ensemble without any restraint. The Langevin thermostat was used to control the temperature and the Berendsen barostat was used for constant pressure simulation. The time step was set as 2fs, and the coordinates of trajectories were recorded every 2 ps. During this simulation, the SHAKE algorithm (Ryckaert et al., 1977) was employed to constrain the bond lengths involving hydrogen, the Particle Mesh Ewald (PME) (Darden et al., 1993; Fischer et al., 2015) was adopted to calculate of electrostatic interaction with a 10Å non-bonded cutoff.

#### Free Energy Calculations

To investigate the interaction of DC3-AR systems from the energetic perspective, the binding free energy calculations based on the trajectories of MD simulations were performed by MM-GBSA method (Hou et al., 2011a,b; Xu et al., 2013; Sun et al., 2014a,b; Chen et al., 2016). The binding free energies 1Gbind was calculated as following equation:

$$
\Delta \mathbf{G}\_{\text{bind}} = \Delta \mathbf{H} - \mathbf{T} \Delta \mathbf{S} = \Delta \mathbf{E}\_{\text{gas}} + \Delta \mathbf{G}\_{\text{sol}} \cdot \mathbf{T} \Delta \mathbf{S}
$$

where 1H represents enthalpy contribution, which is composed of enthalpy changes in gas-phase (Egas) and solvent-phase (1Gsol). –T1S represents entropy contribution. Entropic calculation is time-consuming, and its value will fluctuate if a small quantity of snapshots were adopted (Hou et al., 2011a; Wang Q. et al., 2015). In this study it was omitted. Egas was considered as the sum of internal interaction (1Eint) from bonds, angles, and torsions, van der Waals (1Evdw) and electrostatic energies (1Eele) as follow:

$$
\Delta \mathcal{E}\_{\text{gas}} = \Delta \mathcal{E}\_{\text{int}} + \Delta \mathcal{E}\_{\text{ele}} + \Delta \mathcal{E}\_{\text{vdw}}
$$

1Gsol can be decomposed into the polar and nonpolar contributions as follow:

$$
\Delta \mathbf{G}\_{\rm sol} = \Delta \mathbf{G}\_{\rm GB} + \Delta \mathbf{G}\_{\rm SA}
$$

Here, 1GGB represents the polar solvation contribution, which is calculated by solving GB equation (Kollman et al., 2000; Onufriev et al., 2004). 1GSA, estimated by the solvent accessible surface area, represents the nonpolar solvation contribution.

To further explore the detailed interaction information of DC3-AR complex, free energy decomposition was performed by using MM-GBSA method to identify the key residues responsible for binding energy. The contribution of each residue was calculated without considering the contribution of entropies. The contribution is defined as the sum of van der Waals contribution (1EvdW), electrostatic contribution (1Eele), polar solvation contribution (1GGB), and nonpolar solvation contribution (1GSA):

$$
\Delta \mathbf{G}\_{\text{residence}} = \Delta \mathbf{E}\_{\text{vdW}} + \Delta \mathbf{E}\_{\text{ele}} + \Delta \mathbf{G}\_{\text{GB}} + \Delta \mathbf{G}\_{\text{SA}}.
$$

Snapshots, used for both binding free energy and free energy decomposition calculations, were extracted from the last 50 ns of MD trajectories at intervals of 2 ps.

#### MD Trajectory Analysis Hydrogen Bond Analysis

The number of formed hydrogen bonds vs. simulation time was calculated to detect the system stability during the process of simulation. Here, the hydrogen bond criteria was set as the distance of acceptor-donor <0.35 nm and the angle >120◦ (Fu et al., 2013). 0.35 nm is a common choice of hydrogen bond distance in literature (Liu et al., 2014; Wang Q. et al., 2015). The frames adopted for this calculation were extracted from the whole 150 ns MD trajectories at intervals of 2 ps. Besides, in order to determine exactly how hydrogen bonds play dominant roles in maintaining system stability in the last 50 ns MD simulations, the occupations of hydrogen bonds formed in this period were calculated as following equation (Liu et al., 2014):

$$P\_{hbond} = \frac{N\_{exit}}{N\_{total}} \times 100\%$$

Where Nexist is the number of frames which formed targeted hydrogen bond, and Ntotal is the total number of frames. The occupations are varied from 1 to 100%, and a higher percentage represents a more stable-existed hydrogen bond.

#### Dynamic Cross Correlation Matrix

The dynamic cross-correlation matrix (DCCM) analysis of the Cα atoms during the last 50 ns of the first parallel trajectory was performed to explore the correlated motion between residues of DC3-AR complex. The cross correlation matrix Cij, which reflects the displacements of the Cα atoms relative to average positions, was determined by following equation (Lange et al., 2005; Ghosh and Vishveshwara, 2007):

$$C\_{ij} = \frac{\langle \Delta R\_i \, \cdot \, \Delta R\_j \rangle}{\sqrt{\langle \Delta R\_i \, \cdot \, \Delta R\_i \, \rangle \, \langle \Delta R\_j \, \cdot \, \Delta R\_j \rangle}}$$

Where 1R<sup>i</sup> and 1R<sup>j</sup> represent displacements of atom i and j, respectively. The value of Cij fluctuated from −1 to 1, the positive value indicates a correlated motion between the residue i and residue j, while negative values indicates an anti-correlated motion.

#### Clustering Analysis

Clustering analysis was conducted by using the MMTSB toolset in Amber 12 to determine the representative structure of DC3-AR complex during the last 50 ns MD simulations. Firstly, the similar conformations of DC3-AR complex generated from the trajectory were classified into one cluster, and the most populated cluster has maximum number of conformations. Centroids of the generated clusters were then calculated and generated. Subsequently, RMSD of each structure was calculated with respect to specific centroid. Ultimately, the structure with lowest RMSD to cluster centroid from the most populated cluster was defined as the representative structure of DC3-AR. After that, the representative structure was adopted to generate the hydrophobic and electrostatic interaction surface of DC3-AR complex by using UCSF Chimera package (Pettersen et al., 2004; http://www.cgl.ucsf.edu/chimera).

#### Dynamical Correlation Network

The cross-correlation matrix Cij was also employed to build the dynamical correlation network to intuitively exhibit the correlated motion between residues in different protein domains. The Cα atom of each residue was defined as a "node," and "edge" is the connection of each pair of nodes if the residue pairs interact with each other (Liu et al., 2014). The edges were computed as the following equation:

$$d\_{\vec{\eta}} = -\log\left(|C\_{\vec{\eta}}|\right)$$

Here, each edge has a specific contribution to the movement of complex, the motion of residue i can be used to predict the motion direction of residue j. If |i − j|<=10, cross-correlation between i and j are ignored to remove the correlations due to special closeness. Besides, −0.3 ≤ Cij ≤ 0.3 were also deleted to make network plot more concise. Network View plugin in visual Molecular Dynamics 1.9.2 (VMD) (Humphrey et al., 1996; Hsin et al., 2008) was used to visualize the interaction network.

# RESULTS AND DISCUSSION

#### Homology Modeling of Cyclopeptide DC3

Through templates searching by SWISS-MODEL, 28 qualified templates for DC3 sequence were found. According to the selection criteria and the basic structure characters of cyclopeptide (Craik et al., 1999; Sze et al., 2009) 2kux.1.A (Plan et al., 2010) was selected as the final template to construct the three-dimensional structure of DC3. As shown in **Figure 1A**, the amino acid sequence identity between template and target is 56.67%, sequence similarity is 0.52, Global Model Quality Estimation (GMQE) value is 0.94, and the target sequence is all covered. The sequence alignment and structure comparison of target and template were shown in **Figure 1B**, from where highly similarity can be easily observed. Besides, the Z-score information and predicted local similarity of each residue to target were shown in **Figures 1C,D,** respectively. All these information reflects the reliability of the constructed model. As the AR residues were numbered from 671 to 919, here, the number of DC3-residues was defined from 641 to 670 for convenience.

#### Binding Site Exploration Molecular Docking Analysis

Molecular docking was then performed to explore the possible binding site and binding mechanism of DC3-AR complex. As a result, 2,000 poses of 60 clusters were generated by ZDOCK module. The pose amount of each cluster, ZRank Score, ZDock Score, and the rationality of binding mode were combined to assess the poses. The ZRank Score represents the extent of energy contribution to the system when a ligand binds to a receptor. The ZDock Score is calculated based on the shape matching degree of receptor and ligand, and a higher score represents a better pose. Furthermore, the electrostatic interaction, Van der Waals' force, and desolvation energy were also taken into consideration. A lower score represents a better pose. Based on these criteria, top four possible binding sites of DC3 to AR were selected. As shown in **Figures 2A,B**, the top four clusters of DC3-ARs located in Site 1, Site 2, Site 4, and Site 3, respectively. The score of binding to Site 2 was higher than other possible binding sites although the pose number of this cluster is relatively small. However, it should be noted that scoring functions do not always yield the best predictions of binding affinity (Ramírez and Caballero, 2016). To further confirm the binding affinity prediction, Molecular dynamics (MD) simulations were subsequently performed to obtain more conformational sampling of these four systems. Four poses (**Table 1**, **Figures 2C–F**) of DC3-AR complex were determined as the initial structures, named as system 1,system 2, system 3, and system 4, to perform MD simulations, which were selected from the top four possible binding sites with good ZRank Score, fine ZDock Score, and rational conformations.

#### Root Mean Square Deviation

One hundred and fifty nano seconds MD simulations were calculated on the four DC3-AR complexes systems acquired from molecular docking were then performed respectively. To obtain reliable and repeatable results, three parallel MD

simulations processes were executed on each system. Then root man square deviation (RMSD) values of DC3-AR complexes backbone atoms were calculated relative to the initial structures to monitor the stability and overall convergence of each system during the simulation process. As shown in **Figure 3**, all systems experienced various degrees of fluctuations at first, but gradually tended to converge. It can be seen that, the first and third trajectories of system 4 and all parallel trajectories of other three systems reached equilibrium in the last 50 ns, which were qualified for subsequent analyses of the dynamic behavior. However, the second parallel trajectory of system 4 experienced great structure changes at about 50 ns, which suggested the relatively poor stability of system 4. Considering the abnormality, this trajectory was eliminated in the succeeding binding free energy analysis which was carried out by averaging the values of parallel trajectories.

#### Root Mean Square Fluctuation

The root mean square fluctuation (RMSF) reveals the fluctuation of certain residues during simulation process around its average position, which is also a tool to assess the dynamics stability of system. Here, RMSF values of Cα atoms in the last 50 ns were calculated by employing the first parallel trajectory of each system. RMSF of DC3 residues were shown in **Figure 4A** to explore the stability of the DC3. It can be seen that, different from the great fluctuation in other systems, DC3 residues in system 2 experienced minor motions. It demonstrated that DC3 in system 2 showed obvious superiority in the stability. RMSF of AR residues in these four systems were compared with apo-AR system (made up by AR only) to determine whether the binding of DC3 affects the stability of AR. As shown in **Figure 4B**, the overall RMSF of system 2 is lower than apo-AR system, especially the residues 840-870 (corresponding to H9, loop 843- 849, H10), 880-905 (corresponding to H10, H11, loop888-893, H12). Moreover, only in apo-AR system and system 2, the RMSF values of all residues were under 10Å. Whereas, residues in other systems showed apparently larger conformational changes comparing to apo-AR system. These results demonstrated that the combination of DC3 to AR in site 2 (corresponding to Helix 11, loop 888-893, Helix 12) could stabilize androgen receptor. However, DC3 combination in other sites could visibly reduce AR stability. These RMSF analyses indicated that site 2 is the most possible site of DC3 binding to AR.

#### Interaction Energetic Features

In order to explore the interaction energetic features of DC3-AR complexes, MM-GBSA method was employed to calculate the binding free energies of each system. The average binding free energies and detailed energetic contribution components of the last 50 ns of parallel trajectories were calculated and shown in **Table 2**. It can be seen that the free energy of system 2 (−40.94 kcal/mol) is apparently lower than system 1 (−33.49 kcal/mol), system 3 (−18.99 kcal/mol), and system 4 (−19.73 kcal/mol). It demonstrated that DC3 showed a higher binding affinity to AR in system 2 comparing to other systems, which indicated that DC3 has a great tendency to bind to AR in site 2 and system 2 is more likely to remain stable. This result conforms to the conclusion obtained from the previous RMSF analysis. Moreover, details of the dominant components driving DC3 to bind to AR can be acquired by dissecting the binding free energy into contributing components. Here, the electrostatic interaction (1Eele) in system 2 (−232.64 kcal/mol) can be found to make a

MD simulations. DC3 is shown in marine and AR is shown in violet.

great contribution to the low binding free energy of the whole system, which reflects that significant electrostatic interactions may exist between DC3-AR complex and contribute greatly to the system stability.

#### Dynamic Cross-Correlation Matrices Analysis

The dynamic cross-correlation matrices (DCCM) analysis was further analyzed (**Figure 5**) to investigate the correlated conformational motions of DC3-AR complexes. Here, highly positive regions (colored by red and yellow) are associated with strong correlated motions (residue pairs move in the same direction), while negative regions (colored by blue) are linked with strong anti-correlation movements (residue pairs move in the opposite direction). Inspecting the DC3 domains of the four systems, it can be observed that relatively stronger correlations exist between DC3 residues in system 2. Moreover, comparing to other systems, obviously more correlated and anti-correlated motions between DC3 residues and AR residues can be found in system 2. These obvious differences indicated that there were more and stronger cross-correlation motions between residues in system 2, demonstrating more intense interaction and better stability of this DC3-AR complex.

#### Hydrogen Bonds Analysis

The stronger cross-correlation between DC3 and AR residues found in system 2 might also due to the formation of hydrogen bonds during MD simulations. Hydrogen bonds, as critical indicators of nonbonding interactions, play vital roles in the protein-ligand recognition process (Ramírez and Caballero, 2016). During this MD simulation, the number of hydrogen bonds formed between DC3 and AR vs. simulation time was calculated and plotted in **Figure 6**. Though we set 0.35 nm as the hydrogen bond criteria in this study, the distance we calculated for hydrogen bond were almost all around 3 Angstrom, longdistance hydrogen bond do not exist. As shown in this figure, hydrogen bond interaction patterns formed in system 2 remained constant during the entire simulation time. While in other systems, hydrogen bonds were unstable and most of them disappeared in about 90 ns. Even in the last 50 ns of simulation, the amount of hydrogen bonds still fluctuated a lot. This result reflects the obvious stability of system 2, which further proves that DC3 tends to bind to AR at site 2 as previous RMSF and binding free energy analyses demonstrated. Besides, the

TABLE 1 | The selected top four possible binding sites of DC3-AR complex and the corresponding poses scoring.


intermolecular hydrogen bonds formed in the last 50 ns MD simulations with occupation more than 10% were listed in **Table 3**. It can be clearly observed that much more hydrogen bonds are stably formed in system 2. On one hand, there are 13 hydrogen bonds occupied more than 10% in system 2, while only 4 hydrogen bonds in system 1, one hydrogen bond in system 3, and even no one in system 4. On the other hand, the highest occupation of hydrogen bonds are 23.51 and 12.55% in system 1 and system 3, respectively, while in system 2 hydrogen bonds formed between K669-H885, D890-R666, S900-T647, and D890- E643 occupied 85.23, 74.54, 72.49, and 64.38%, respectively.

Based on these data, it can be concluded that site 2 (H11, loop888-893, H12) is the most possible site of DC3 binding to AR complex.

## Exploration of the Binding Mechanism Between DC3 and AR Site 2

To fully explore the binding modes and interaction mechanisms of DC3 and AR, system 2 was further studied to reveal the complicated binding mechanism of DC3-AR complex.

#### Root Mean Square Fluctuation Analysis

According to the RMSF values of DC3 residues and AR residues shown in **Figures 4C,D**, respectively, it is easy to determine the key residues dominant in the binding process of DC3- AR complex. In DC3, residues 645-647 and 666 with low RMSF values experienced minor fluctuation, which indicated these residues were relatively more stable during the simulation process. Comparing the AR residues in system 2 to apo-AR

TABLE 2 | Binding free energy and the detailed energetic contribution components of four systems of DC3-AR complex averaged by the last 50 ns of parallel trajectories (kcal/mol).


system, it can be observed that the RMSF values of residues 840- 870 (correspond to H9, loop 843-849, H10), 790-800 (correspond to H7, loop 797-800) in system 2 were lower than apo-AR. which revealed the definite role of these residues in maintaining the system stability. Moreover, residues of binding site 2 (residues 880-903, correspond to H11, loop 888-893, H12) also exhibited obvious lower fluctuation comparing to apo-AR system. It can be observed that the binding site has become one of the most stable regions in system 2. These results not only indicated the dominant role of these residues, but also suggested that specific interactions must have been formed between residues 880-903 and DC3 residues, which then constrained the mobility of them and made the whole system stable.

#### Clustering Analysis

The representative conformation of DC3-AR complex during MD simulation was extracted by clustering analysis. The first parallel trajectory of system 2 was grouped into 4 clusters based on the conformational similarity. The most populated cluster contained14068 frames, which accounted for 56.27% of all frames extracted from the last 50 ns MD simulation. Then the conformation with least RMSD value in most populated cluster was defined as the representative structure. Here, the representative structure of DC3-AR complex extracted from MD simulations trajectory and the initial structure acquired from molecular docking were plotted in **Figures 7A,B**, respectively, to exhibit the interaction of key residues visually. From these

two figures, it can be seen that residues pairs K669-H885, R666- D890, S643-D890, and R666-E893 were apparently drawn near to each other during the MD simulations. All these indicated that some significant interaction forces might formed between these residues, which further stabilized the complex as RMSF data verified. Similarly, the closeness of residues L654-I882, V657-I882, L658-I882 could also be easily observed in MD simulation through **Figures 7C,D**. As residues L654, V657, L658, and I882 are nonpolar amino acid, it suggests specific hydrophobic interactions may formed, which needs to be further validated.

#### Interaction Surface Exploration and Free Energy Decomposition

To figure out the binding mechanism between the key residues in the binding process of DC3-AR complex, hydrophobic and electrostatic interaction surfaces of representative structure were generated. The hydrophobic interaction surface was plotted in **Figure 8A**, where dodger blue represents hydrophobic minimum, gray depicts the hydrophobicity of 0, and orange represents the largest hydrophobicity. It can be observed that the binding site of AR indeed exits strong hydrophobic interactions with DC3, which promoted the identification and combination of ligand and receptor to a certain extent. Furthermore, two highly hydrophobic interaction domains (deep orange) formed by V649, L650, L651-V901, and L654, V657, L658-L880, L881, I882, respectively, can be found, which played dominant roles in the development of hydrophobic interaction.

The electrostatic interaction surface was shown in **Figure 8B**. It can be seen that most DC3 residues carry positive charge, which come into being a positively charged surface (blue) in the interface. Whereas, a certain number of negatively charged residues (red) existed in binding site of AR. These residues with opposite charges in the interface attracted each other, and made great contribution to the binding process. To deeply investigate the energetic contribution, especially the electrostatic contribution of key residues, free-energy decomposition was performed based on the last 50 ns MD simulations of system 2. The energy contributions of DC3 and AR residues were shown in **Figures 9A,B**. The electrostatic contributions of DC3 and AR residues were depicted in **Figures 9C,D** respectively. From the energy contribution it can be seen that residue R666 made incomparable contribution to free-energy, which demonstrated that interactions existed between R666 and AR residues played essential roles in DC3-AR binding process. Meanwhile residues H885, D890, and S900 dominated the energetic contribution. These results proved that residues of both DC3 and AR in



binding site do make contributions in the decrease of free energy. Combined with the hydrogen bond analyzed before, it comes to a conclusion that hydrogen bonds formed between K669- H885, D890-R666, S900-T647, and D890-E643 are especially critical components to the interaction between DC3 and AR. Based on all the results above, we can reach a conclusion that K665, R666 of DC3 and E706, E709, D890, E893, E897 of AR ultimately make great contributions to the binding process.

#### Distance Analysis

To further validate the interaction formed between key residues and investigate the formation process, the distances of key residue pairs mentioned above vs. simulation time were calculated and plotted in **Figure 6B**. It can be firstly observed that the distance of residue pairs K669-H885, L654-I882, V657- I882, and L658-I882 experienced an obvious decrease in about 40 ns and maintained stable in the later simulation. Significant conformational changes of binding site could be speculated based on this crucial distance variation. This result revalidated the formation of hydrogen bond between K669-H885, and hydrophobic interaction between L645-I882, V657-I882, L658- I882. In addition, residue pairs E643-D890, T647-S900, R666- D890, and R666-E893 kept highly close (about 5Å) and remain stable throughout the simulation process, and some of them even reached about 2.5Å in the last 50 ns. Besides, from the hydrogen bonds figure shown in **Figure 6A**, it can also be seen that certain hydrogen bonds formed in about 40 ns. All these results proved the stable existence of hydrogen bonds and electrostatic interaction during the whole MD simulations.

#### Cross-Correlation Networks

To characterize and intuitively exhibit the underlying dynamical cross-correlations among different parts of DC3-AR complex, the overall cross-correlation networks of system 2 and apo-AR system were constructed. As shown in **Figure 10A**, masses of correlation and anti-correlation both widely and simultaneously existed in apo-AR system, which made it hard to identify the specific cross-correlation pattern, in other words, the interaction between different protein domains was mixed and disorderly. However, when DC3 binding to AR (**Figure 10B**), all anti-correlation among different AR regions decreased or disappeared, which indicated the improvements of system stability. Besides, organized anti-correlation developed between DC3 and AR regions (loop 687-690, H3, loop 722-725, loop 727-730, H4, loop 822-825, H9, H11, loop 888-893, H12). It reflected the opposite movement tendency between these regions and DC3, namely, they moved close to each other along with the simulation. Moreover, distinct correlation also formed between DC3 and some AR regions (H9, loop 843-849, H10). Combining previous study that residues in these regions had low RMSF values, it can be concluded that this correlation patterns decreased residues fluctuation and enhanced the system stability.

# CONCLUSION

In this work, the three-dimensional structure of cyclopeptide DC3 was firstly constructed by homology modeling technology using 2kux.1.A as template. Then molecular docking was carried out to predict possible binding site and preferred orientation of DC3 into AR. Finally, four systems with best docking score from top four clusters were selected to perform 150 ns allatom molecular dynamics (MD) simulations. The MM/GBSA method and a series of MD trajectory analyses were subsequently conducted. The analyses of RMSF, binding free energy, DCCM and hydrogen bonds indicated that DC3 showed a higher binding affinity to AR in site 2 (corresponding to H10, H11, loop888- 893, H12) and this system showed obvious superiority in stability comparing to other systems. Besides, much more intermolecular hydrogen bonds were constantly formed in system 2 with high occupation. Stronger cross-correlation among DC3 residues and stronger anti-correlation between DC3 and AR residues also exited here. These results suggest that DC3 is most likely to bind to AR in site 2 encompassed by H10, H11, loop888- 893, and H12. Subsequently, combining further analysis of free-energy decomposition, interaction surface, distance, and cross-correlation network, it can be observed that hydrogen bonds, hydrophobic, and electrostatic interactions play dominant roles in the recognition and combination of DC3-AR complex. For hydrogen bonds, it frequently existed between K669-H885, D890-R666, S900-T647, and D890-E643. Besides, K665, R666 of DC3, and E706, E709, D890, E893, E897 of AR made great contributions to electrostatic interaction values. V649, V650, V651-V801, and L654, V657, L658-L880, L881 play essential parts of hydrophobic interaction. These results elucidated the detail interaction mechanism of DC3-AR complex and the key residues dominated in specific interaction. These findings will significantly facilitate our understanding of action mode of DC3 to AR at the molecular level, and contribute to the future rational cyclopeptide drug design for prostate cancer.

# AUTHOR CONTRIBUTIONS

JL: conceived and coordinated the study; HZ and TS: did the homology modeling, molecular docking, molecular dynamics simulations, and they wrote the paper; YY and CF: helped to do molecular simulations. All authors analyzed the results and approved the final version of the manuscript.

# ACKNOWLEDGMENTS

This research is supported by the Fundamental Research Funds for the Central Universities (lzujbky-2017-203). We would like to thank Prof. Yuguang Mu and Dr. Liangzhen Zheng from School of Biological Sciences of Nanyang Technological University for helping us to treat the data for the figure of cross-correlation network analysis.

#### Zhang et al. Mechanism Between DC3 and AR

#### REFERENCES

Accelrys Inc. CA (2009). Discovery Studio Version 2.5.


ranking poses generated from docking. J. Comput. Chem. 32, 866–877. doi: 10.1002/jcc.21666


tuberculosis protein kinase b inhibitors from molecular dynamics simulation. Chem. Biol. Drug Des. 86, 91–101. doi: 10.1111/cbdd.12465


and transcription activation. Biochemistry 31, 2393–2399. doi: 10.1021/bi0012 3a026


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhang, Song, Yang, Fu and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Conceptual DFT Study of the Local Chemical Reactivity of the Colored BISARG Melanoidin and Its Protonated Derivative

#### Juan Frau<sup>1</sup> and Daniel Glossman-Mitnik 1,2 \*

<sup>1</sup> Departament de Química, Universitat de les Illes Balears, Palma de Mallorca, Spain, <sup>2</sup> Laboratorio Virtual NANOCOSMOS, Centro de Investigación en Materiales Avanzados, Departamento de Medio Ambiente y Energía, Chihuahua, Mexico

This computational study assessed eight fixed RSH (range-separated hybrid) density functionals that include CAM-B3LYP, LC-ωPBE, M11, MN12SX, N12SX, ωB97, ωB97X, and ωB97XD related to the Def2TZVP basis sets together with the SMD solvation model in the calculation the molecular structure and reactivity properties of the BISARG intermediate melanoidin pigment (5-(2-(E)-(Z)-5-[(2-furyl)methylidene]-3- (4-acetylamino-4-carboxybutyl)-2-imino-1,3-dihydroimidazol-4-ylideneamino(E)-4-[(2-

#### Edited by:

Sam P. De Visser, University of Manchester, United Kingdom

#### Reviewed by:

Carles Curutchet, Universitat de Barcelona, Spain Mark Earl Casida, Université Grenoble Alpes, France

\*Correspondence:

Daniel Glossman-Mitnik daniel.glossman@cimav.edu.mx

#### Specialty section:

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

Received: 14 February 2018 Accepted: 09 April 2018 Published: 01 May 2018

#### Citation:

Frau J and Glossman-Mitnik D (2018) Conceptual DFT Study of the Local Chemical Reactivity of the Colored BISARG Melanoidin and Its Protonated Derivative. Front. Chem. 6:136. doi: 10.3389/fchem.2018.00136 furyl)methylidene]-5-oxo-1H-imidazol-1-yl)-2-acetylaminovaleric acid) and its protonated derivative, BISARG(p). The chemical reactivity descriptors for the systems were calculated via the Conceptual Density Functional Theory. The choice of active sites applicable to nucleophilic, electrophilic as well as radical attacks were made by linking them with Fukui functions indices, electrophilic and nucleophilic Parr functions, and the condensed Dual Descriptor 1f(r). The study found the MN12SX and N12SX density functionals to be the most appropriate in predicting the chemical reactivity of the molecular systems under study starting from the knowledge of the HOMO, LUMO, and HOMO-LUMO gap energies.

Keywords: BISARG, conceptual DFT, chemical reactivity, dual descriptor, Parr functions

# 1. INTRODUCTION

Visual color in processed foods is largely due to colored products of Maillard or nonenzymic browning reactions. In spite of the longstanding aesthetic and practical interest in Maillard derived food coloring materials, relatively little is known about the chemical structures responsible for visual color (Rizzi, 1997). These chemical structures are known as Colored Maillard Reaction Products and can be isolated at intermediate stages during the melanoidin formation process.

Besides their interest as dye molecules which may be useful as food additives, but also as dyes for dye-sensitized solar cells (DSSC), these compounds have also antioxidant capabilities. Thus, they are amenable to be studied by analyzing their molecular reactivity properties.

One of these isolated molecules is named by the acronym BISARG and together with its protonated derivative, BISARG(p) have been experimentally studied as a part of a work related to the formation of melanidins (Hofmann, 1998) and we believe that it could be of interest to study their molecular reactivity by using the ideas of Conceptual DFT, in the same way of our previous works (Alvarado-González et al., 2013; Cervantes-Navarro and Glossman-Mitnik, 2013; Glossman-Mitnik, 2013a,b; Martínez-Araya et al., 2013a,b; Salgado-Morán et al., 2013; Glossman-Mitnik, 2014a,b,c,d; Martínez-Araya and Glossman-Mitnik, 2015; Martínez-Araya et al., 2015; Soto-Rojo et al., 2015; Frau et al., 2016a,b,c; Mendoza-Huízar et al., 2016; Frau et al., 2017a,b,c,d,e; Frau and Glossman-Mitnik, 2017a,b,c,d,e,f,g; Sastre et al., 2017).

The interest in using range-separated (RS) exchange correlation functionals in KS DFT is on the rise (Gledhill et al., 2016). The functionals tend to partition the r−<sup>1</sup> <sup>12</sup> operator and exchange them into long- and short-range parts, whose range separation parameter ω controls the rate of attaining the long-range behavior. It is possible to fix the value of ω. The value can also be nonempirically "tuned" through a system-by-system mechanism that minimizes some tuning norms. The basis of the optimal tuning approach is the knowledge that the energy that the HOMO should have, ǫH(N), in exact KS as well as generalized KS theory for an N electron system, ought to be exactly −IP(N). Hence, IP represents the vertical ionization potential that is calculated by considering a particular functional energy difference E(N-1) − E(N). If approximate functionals are used, it is possible to have considerable differences between ǫH(N), and −IP(N). Optimal tuning constitutes determining a system-specific range-separation parameter ω non-empirically in an RSE functional. Optionally, it also implied that several other parameters including ǫH(N) = −IP(N) are satisfied optimally (Jacquemin et al., 2014). Even though no equivalency exists to match this prescription of electron affinity (EA) coupled with LUMO in the case of neutral species, it is possible to say that ǫH(N+1) = −EA(N), that is, the electron affinity of the neutral system is equal to minus the HOMO energy of the anion (SOMO), which facilitates the finding of an optimized value of ω, and is then optimized to establish both properties simultaneously. Some concerns have been raised during the preparation of this paper regarding the validity of the ionization potential theorem (IP) within the context of Generalized Kohn-Sham (GKS) theory. However, it must stressed that Baer et al. (2010) and more recently Baerends et al. (2013) and Karolewski et al. (2013) have given arguments that the same criterion applies in GKS theories and with with hybrid and range-separated hybrid functionals. This will make it easy to predict the Conceptual DFT descriptors. In the past, the simultaneous prescription has been referred to as the "KID procedure" (for Koopmans in DFT), courtesy of the analogy it shares with the Koopmans' theorem within the Hartree-Fock theory. This SOMO energy will not be, in general, equal to the LUMO of the neutral, but if the difference between them, which we have called 1SL, is small enough to be considered negligible for predictions of the Conceptual DFT descriptors, then the practical KID procedure will have a computational support.

This implies that the appropriateness of a particular density functional in making predictions of the Conceptual DFT descriptors directly by relying on the properties that the neutral molecule can be easily estimated. It only requires one to check the way that it has followed the KID procedure. Nevertheless, tuneoptimization depends on the system and must be performed for each molecule one at a time. Therefore, examining the various density functionals exhibiting significant accuracy across various types of databases in physics, chemistry, and where the ω value is fixed will determine how they perform the practical technique.

Thus, in this computational study we will assess eight density functionals in calculating the molecular properties and structure of the BISARG intermediate melanoidin pigment and its protonated derivative, BISARG(p). Following the same ideas of previous works, we will consider fixed RSH functional instead of the optimally-tuned RSH density functionals that have attained great success and have also supported the validity of the IP theorem in the context of the GKS theory (Stein et al., 2009a,b; Karolewski et al., 2011; Kuritz et al., 2011; Refaely-Abramson et al., 2011; Foster and Wong, 2012; Koppen et al., 2012; Kronik et al., 2012; Phillips et al., 2012a,b; Karolewski et al., 2013; Moore and Autschbach, 2013; Egger et al., 2014; Foster et al., 2014; Jacquemin et al., 2014; Niskanen and Hukka, 2014; Sun and Autschbach, 2014; Manna et al., 2015; Lima et al., 2016; Pereira et al., 2017).

# 2. THEORETICAL BACKGROUND

The theoretical background of this study is similar to the previous conducted research presented complete purposes, because this research is a component of a major project that it is in progress. If we consider the KID procedure mentioned in the Introduction together with a finite difference approximation, then the global reactivity descriptors can be written as:


where ǫ<sup>H</sup> and ǫ<sup>L</sup> are the energies of the highest occupied and the lowest unoccupied molecular orbitals (HOMO and LUMO), respectively.

Applying the same ideas, the definitions for the local reactivity descriptors are:


where ρN+1(**r**), ρN(**r**), and ρN−1(**r**) are the electronic densities at point **r** for the system with N + 1, N, and N − 1 electrons, respectively, and ρ rc s (**r**) and ρ ra s (**r**) are related to the atomic spin density (ASD) at the **r** atom of the radical cation or anion of a given molecule, respectively (Domingo et al., 2016).

# 3. SETTINGS AND COMPUTATIONAL METHODS

Following the lines of our previous works, the computational studies were performed with the Gaussian 09 (Frisch et al., 2018) series of programs with density functional methods as implemented in the computational package. The basis set used in this work was Def2SVP for geometry optimization and frequencies, while Def2TZVP was considered for the calculation of the electronic properties (Weigend and Ahlrichs, 2005; Weigend, 2006). All the calculations were performed in the presence of water as the solvent by doing Integral Equation Formalism-Polarized Continuum Model (IEF-PCM) computations according to the SMD solvation model (Marenich et al., 2009).

For the calculation of the molecular structure and properties of the studied systems, we have chosen eight density functionals which are known to consistently provide satisfactory results for several structural and thermodynamic properties:


In these functionals, GGA stands for generalized gradient approximation (in which the density functional depends on the up and down spin densities and their reduced gradient) and NGA stands for nonseparable gradient approximation (in which the density functional depends on the up/down spin densities and their reduced gradient, and also adopts a nonseparable form).

# 4. RESULTS AND DISCUSSION

The three-dimentional molecular structure of the BISARG system was built with the aid of molecular graphics program starting from structure presented in the original article (Hofmann, 1998). Starting from this, the molecular structure of its protonated derivative, BISARG(p) was built with the aid of a chemical visualization software. The pre-optimization of the systems was done using random sampling that involved molecular mechanics techniques and inclusion of the various torsional angles via the general MMFF94 force field (Halgren, 1996a,b,c, 1999; Halgren and Nachbar, 1996) through the Marvin View 17.15 program that constitutes an advanced chemical viewer suited to multiple and single chemical queries, structures and reactions (https://www.chemaxon.com). Afterwards, the structures that the resultant lower-energy conformers assumed for both molecules were reoptimized using the eight density functionals mentioned in the previous section together with the Def2SVP basis set as well as the SMD solvation model using water as the solvent.

The analysis of the results obtained in the study aimed at verifying that the KID procedure was fulfilled. On doing it previously, several descriptors associated with the results that HOMO and LUMO calculations obtained are related with results obtained using the vertical I and A following the 1SCF procedure. A link exists between the three main descriptors and the simplest conformity to the Koopmans' theorem by linking ǫ<sup>H</sup> with -I, ǫ<sup>L</sup> with -A, and their behavior in describing the HOMO-LUMO gap as J<sup>I</sup> = |ǫH+Egs(N−1)−Egs(N)|, J<sup>A</sup> = |ǫL+Egs(N)− Egs(N + 1)|, and JHL = p JI <sup>2</sup> + J<sup>A</sup> 2 . Notably, the J<sup>A</sup> descriptor consists of an approximation that remains valid only when the HOMO that a radical anion has (the SOMO) shares similarity with the LUMO that the neutral system has. Consequently, we decided to design another descriptor 1SL (the difference between the SOMO and LUMO energies), to guide in verifying how the approximation is accurate.

The results of the calculation of the electronic energies of the neutral, positive and negative molecular systems (in au) of BISARG and BISARG(p), the HOMO, LUMO, and SOMO orbital energies (in eV), J<sup>I</sup> , JA, JHL, and 1SL descriptors calculated with the eight density functionals and the Def2TZVP basis set using water as solvent simulated with the SMD parametrization of the IEF-PCM model are presented in **Tables 1, 2**.

As presented in previous works, we considered four other descriptors that analyze how well the studied density functionals are useful for the prediction of the electronegativity χ, the global hardness η, and the global electrophilicity ω, and for a combination of these Conceptual DFT descriptors, considering only the energies of the HOMO and LUMO or the vertical I and A: J<sup>χ</sup> = |χ − χK|, J<sup>η</sup> = |η − ηK|, J<sup>ω</sup> = |ω − ωK|, and JCDFT = q J 2 <sup>χ</sup> + J 2 <sup>η</sup> + J 2 ω , where CDFT stands for Conceptual DFT. The underscript K stands for the descriptor calculated by applying the KID procedure.The results of the calculations of J<sup>χ</sup> , Jη, Jω, and JCDFT for the low-energy conformers of BISARG and BISARG(p) in water are displayed in **Tables 3, 4**, respectively.

As **Tables 1**–**4** provide, the KID procedure applies accurately from MN12SX and N12SX density functionals that are rangeseparated hybrid meta-NGA as well as range-separated hybrid NGA density functionals respectively. In fact, the values of J<sup>I</sup> , JA, and JHL are actually not zero. Nevertheless, the results tend to be impressive especially for the MN12SX density functional. As well, the 1SL descriptor reaches the minimum values when MN12SX and N12SX density functionals are used in the calculations. This implies that there are sufficient justifications to assume that the

TABLE 1 | Electronic energies of the neutral, positive and negative molecular systems (in au) of the BISARG molecule, the HOMO, LUMO, and SOMO orbital energies (in eV), JI , JA, JHL, and 1SL descriptors (also in eV) calculated with the eight RSH density functionals and the Def2TZVP basis set using water as solvent simulated with the SMD parametrization of the IEF-PCM model.


TABLE 2 | Electronic energies of the neutral, positive and negative molecular systems (in au) of the protonated BISARG(p) molecule, the HOMO, LUMO, and SOMO orbital energies (in eV), JI , JA, JHL, and 1SL descriptors (also in eV) calculated with the eight RSH density functionals and the Def2TZVP basis set using water as solvent simulated with the SMD parametrization of the IEF-PCM model.


TABLE 3 | J<sup>χ</sup> , Jη, Jω, and JCDFT (in eV) of the BISARG intermediate melanoidin pigment.


TABLE 4 | J<sup>χ</sup> , Jη, Jω, and JCDFT (in eV) of the protonated BISARG(p) intermediate melanoidin pigment.


LUMO of the neutral approximates the electron affinity. The same density functionals follow the KID procedure in the rest of the descriptors such as J<sup>χ</sup> , Jη, Jω, and JCDFT.

Having verified that the MN12SX/Def2TZVP model chemistry is a good choice for the calculation of the global reactivity descriptors, we now present the optimized molecular structures of BISARG and BISARG(p) in water in Supplementary Figures 1, 2. Meanwhile, the calculated bond lengths and bond angles for both cases are shown in Supplementary Tables 1–4.

As a summary of the previous results, the global reactivity descriptors for the BISARG and BISARG(p) molecules calculated with the MN12SX/Def2TZVP model chemistry in water are presented in **Table 5**.

The calculations of the condensed Fukui functions and dual descriptor are done by using the Chemcraft molecular analysis program to extract the Mulliken and NPA atomic charges (Zhurko and Zhurko, 2012) beginning with single-point energy calculations involving the MN12SX density functional that uses the Def2TZVP basis set in line with the SMD solvation model, and water utilized as the solvent.

Considering the potential application the studied molecules as antioxidants, it is of interest to get insight into the active sites for radical attack. Graphical representations of the radical Fukui TABLE 5 | Global reactivity descriptors for the BISARG intermediate melanoidin pigment and its protonated derivative BISARGd(p) calculated with the MN12SX density functional.



function f 0 calculated with the MN12SX/Def2TZVP model chemistry for both systems in water are presented in **Figures 1, 2**.

The condensed electrophilic and nucleophilic Parr functions P + k and P − k over the atoms of the BISARG and BISARG(p) molecules in water have been calculated by extracting the Mulliken and Hirshfeld (or CM5) atomic charges using the Chemcraft molecular analysis program (Zhurko and Zhurko, 2012) starting from single-point energy calculations of the ionic species with the MN12SX density functional using the Def2TZVP basis set in the presence of the solvents according to the SMD solvation model.

The results for the condensed dual descriptor calculated with Mulliken atomic charges 1f<sup>k</sup> (M), with NPA atomic charges 1f<sup>k</sup> (N), the electrophilic and nucleophilic Parr functions with Mulliken atomic spin densities P + k (M) and P − k (M), and the electrophilic and nucleophilic Parr functions with Hirshfeld (or CM5) atomic spin densities P + k (H) and P − k (H) are displayed in **Tables 6, 7** for the BISARG and BISARG(p) molecules in water, respectively, while **Figures 3, 4** show schematic representations of the molecules with the numbering of the most important reactive sites according to the results in **Tables 6, 7**.

pigment. TABLE 6 | The condensed dual descriptor calculated with Mulliken atomic charges 1f<sup>k</sup> (M), and with NPA atomic charges 1f<sup>k</sup> (N), the electrophilic and nucleophilic Parr functions with Mulliken atomic spin densities P + k (M) and P − k (M), and the electrophilic and nucleophilic Parr functions with Hirshfeld (or CM5) + −

(H) for the BISARG melanoidin molecule.

k

Hydrogens and atomic sites where the absolute value of the dual descriptor is

atomic spin densities P

k (H) and P


From the results for the local reactivity descriptors in **Table 6**, it can be concluded that C2 and C23 will be the preferred sites for a nucleophilic attack and that these atoms will act as electrophilic



species in a chemical reaction. In turn, it can be appreciated that N8 will be prone to electrophilic attacks and that this atomic site will act as a nucleophilic species in chemical reactions that involve the BISARG molecule in water. In turn, for the case of the BISARG(p) molecule in water, C2 and C17 will be the preferred

sites for a nucleophilic attack while C15 and C29 will be the sites for electrophilic reactions.

# 5. CONCLUSIONS

Eight fixed RSH density functionals, including CAM-B3LYP, LCωPBE, M11, MN12SX, N12SX, ωB97, ωB97X, and ωB97XD, were examined to determine whether they fulfill the empirical KID procedure so as to provide computational support for this common practice. The assessment was conducted by comparing the values from HOMO and LUMO calculations to those generated by the 1SCF technique for the BISARG molecule and its protonated derivative, BISARG(p). BISARG and BISARG(p) are intermediate melanoidin pigments that are of academic and industrial interest. The study has observed that the rangeseparated and hybrid meta-NGA density functionals tend to be the most suited in meeting this goal. Thus, they can be suitable alternatives to density functionals where the behavior of them is optimally tuned using a gap-fitting procedure. They also exhibit the desirable prospect of benefiting future studies aimed at understanding the chemical reactivity of colored melanoidins with larger molecular weights when reducing sugars react with proteins and peptides.

It is not the goal of Computational Chemistry to perform studies to reproduce known experimental results except in the case that they can be used for the calibration of a particular technique. Instead, it can be useful to predict in advance the structural and chemical reactivity characteristics of new or unknown molecular systems whose properties have not been reported and as guide for future research. As far as we know, there are no reports in the literature about the chemical reactivity properties for the molecular systems considered in this work and it is not possible to perform any kind of comparisons. However, the present study shows that with an adequate choice of the model chemistry we have been able to predict the sites of interaction of the BISARG and BISARG(p) molecules with impressive accuracy starting from the knowledge of the HOMO, LUMO, and HOMO-LUMO gap energies of the studied systems. This involves having DFT-based reactivity descriptors, including Fukui functions, Parr functions, and Dual Descriptor calculations. In conclusion, the Conceptual DFT descriptors are useful in characterizing and describing the preferred reactive sites and in comprehensively explaining the reactivity of the molecules.

# AUTHOR CONTRIBUTIONS

DG-M conceived and designed the research and headed, wrote and revised the manuscript, while JF contributed to the writing and the revision of the article.

# REFERENCES


#### ACKNOWLEDGMENTS

This work has been partially supported by CIMAV, SC, and Consejo Nacional de Ciencia y Tecnología (CONACYT, Mexico) through Grant 219566-2014 for Basic Science Research. DG-M conducted this work while a Visiting Lecturer at the University of the Balearic Islands from which support is gratefully acknowledged. This work was cofunded by the Ministerio de Economía y Competitividad (MINECO) and the European Fund for Regional Development (FEDER) (CTQ2014-55835-R).

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00136/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Frau and Glossman-Mitnik. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# QM/MM Investigation of the Role of a Second Coordination Shell Arginine in [NiFe]-Hydrogenases

Andrés M. Escorcia and Matthias Stein\*

Molecular Simulations and Design Group, Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany

[NiFe]-hydrogenases are highly efficient catalysts for the heterolytic splitting of molecular hydrogen (H2). The heterobimetallic cysteine-coordinated active site of these enzymes is covered by a highly conserved arginine residue, whose role in the reaction is not fully resolved yet. The structural and catalytic role of this arginine is investigated here using QM/MM calculations with various exchange-correlation functionals. All of them give a very consistent picture of the thermodynamics of H<sup>2</sup> oxidation. The concept of the presence of a neutral arginine and its direct involvement as a Frustrated Lewis Pair (FLP) in the reaction is critically evaluated. The arginine, however, would exist in its standard protonation state and perform a critical role in positioning and slightly polarizing the substrate H2. It is not directly involved in the heterolytic processing of H<sup>2</sup> but guides its approach and reduces its flexibility during binding. Upon substitution of the positively charged arginine by a charge-conserving lysine residue, the H<sup>2</sup> binding position remains unaffected. However, critical hydrogen bonding interactions with nearby aspartate residues are lost. In addition, the H<sup>2</sup> polarization is unfavorable and the reduced side-chain volume may negatively affect the kinetics of the catalytic process.

#### Edited by:

Hans Martin Senn, University of Glasgow, United Kingdom

#### Reviewed by:

Etienne Derat, Université Pierre et Marie Curie, France Giampaolo Barone, Università degli Studi di Palermo, Italy

#### \*Correspondence:

Matthias Stein matthias.stein@ mpi-magdeburg.mpg.de

#### Specialty section:

This article was submitted to Theoretical and Computational Chemistry, a section of the journal Frontiers in Chemistry

Received: 15 March 2018 Accepted: 23 April 2018 Published: 15 May 2018

#### Citation:

Escorcia AM and Stein M (2018) QM/MM Investigation of the Role of a Second Coordination Shell Arginine in [NiFe]-Hydrogenases. Front. Chem. 6:164. doi: 10.3389/fchem.2018.00164

Keywords: hydrogen conversion, enzyme, QM/MM, amino acid substitution, catalysis

# INTRODUCTION

Hydrogen (H2) is among the most important energy carriers in a post-fossil era. The generation of H<sup>2</sup> as a biofuel from sustainable sources is a versatile alternative to the standard generation process from electrolysis of water which requires elevated temperature and expensive catalyst metals (Holladay et al., 2009; Rodionova et al., 2017). Enzymes from bacteria and microalgae are able to perform the same catalysis at room temperature and standard pressure in the absence of a precious noble metal, and can also catalyze the reverse reaction, the heterolytic cleavage of H2. Biological H<sup>2</sup> conversion has attracted much interest owing to its potential application in a postcarbon based scenario employing H<sup>2</sup> as an energy storage compound and as a transportable fuel itself (Cammack et al., 2001).

These enzymes, the hydrogenases, are classified according to their active site composition as [NiFe]- and [FeFe]-hydrogenases (see **Figure 1**). The active site of [FeFe]-hydrogenases consists of a µ-carbonyl bridged iron-iron cluster with two additional terminal CO ligands. A bridging azadiothiolate ligand acts as an intermediate proton acceptor during formation of H2. The "Hcluster" is connected to a [4Fe4S]-cluster via a bridging cysteine amino acid. The azadithiolate nitrogen in [FeFe]-hydrogenase enzymes acts as an initial site of protonation before the product molecule hydrogen is formed upon reacting with a terminal Fe-bound hydride.

Meanwhile, the active site of [NiFe]-hydrogenases involves two terminal and two bridging cysteine residues and three diatomic inorganic ligands at the iron atom (one carbonyl and two cyanides). The [NiFe]-hydrogenases have a bias toward the heterolytic splitting of H<sup>2</sup> into protons and electrons. The oxidation of H<sup>2</sup> makes them useable in a fuel cell ("microbial fuel cell").

H<sup>2</sup> can be used in a fuel cell to generate electric power from its oxidation and the reduction of O<sup>2</sup> to give water. The absence of noble metals and operation conditions at room temperature make [NiFe]-hydrogenase enzymes a system of interest to scientist and engineers, and may act as an inspiration to develop novel bio-inspired catalysts (Du et al., 2007; Cracknell et al., 2008; Santoro et al., 2017). [NiFe]-hydrogenase adsorbed on a pyrolytic graphite electrode catalyzes H<sup>2</sup> oxidation at a diffusion-controlled rate matching that achieved by platinum (Jones et al., 2002).

In [NiFe]-hydrogenases, the "as-isolated" oxidized state contains a hydroxide anion (OH−) binding between the Ni(III) and Fe(II) ions. During the process of H<sup>2</sup> activation, nickel shuttles between Ni(III) and Ni(II) oxidation states whereas Fe remains redox-inactive in a 2+ state of oxidation. The catalytic reaction intermediate, Ni-C, is a Ni(III) Fe(II) species with a µ-bridging hydride, but the exact site that acts as the proton acceptor has not been resolved yet. QM and QM/MM calculations have favored one of the terminal cysteines to be the site of protonation (Niu et al., 1999; Lill and Siegbahn, 2009; Hu et al., 2013; Dong and Ryde, 2016; Dong et al., 2018). The ultra-resolution X-ray structure of the fully reduced state of the enzyme, Ni-R, indeed enables to reveal a hydride in the bridging position and one of the terminal cysteines protonated (Ogata et al., 2015b). Recently, however, a non-coordinating amino acid residue was identified to play a major role in H<sup>2</sup> activation by E. coli Hyd-1 (Evans et al., 2016). Substitution of a strictly conserved arginine residue (R509) ∼4.4 Å above the active site nickel (see **Figure 2**) by a charge-conserving lysine led to a >100 fold lower activity in comparison to the wildtype enzyme. This led to the hypothesis of the arginine guanidine group acting as the general base in H<sup>2</sup> activation (Carr et al., 2016). This would require R509 to at least be fractionally deprotonated and neutral, in order to be able to play a functional role similar to that of a frustrated Lewis pair (FLP) (Stephan and Erker, 2010).

At neutral pH, only lysine, arginine and sometimes histidine possess sidechains with a positive charge. The pKa-value describes the pH-value at which deprotonated and protonated forms are in equilibrium and for arginine, a pKa-value of 12 is usually given in textbooks (Hunter and Borsook, 1924; Berg et al., 2002). At pH < 12, the guanidine nitrogen atom becomes protonated and a positive charge is delocalized via the nitrogen atoms Nη1, Nη2, and Nε (**Figure 2**). The protein environment can lead to local deviations of the pKa-values of the amino acid side chains due to strong electrostatic interactions with other fully or partially charged groups as well as the polarity or dielectric constants of the medium that surrounds them. pKa-values of catalytic amino acids in or near the active sites of enzymes may be significantly perturbed by more than 2 units due to structural details and the energetics of the reactions that they catalyze (Harris and Turner, 2002). For arginine, however, no detectable shifts in pKa-values were ever reported and, for example, all buried 25 arginine residues in the staphylococcus nuclease remained in the charged state (Harms et al., 2011). Significant perturbations of pKa-values of arginine residues were only found from free energy perturbation calculations from MD trajectories of an arginine residue in a highly hydrophobic membrane environment (Yoo and Cui, 2008). Only when positioned close to the center of the bulk lipid membrane, an effective pKa-value of 7.7 could be obtained. Thus, significant populations of both the protonated and the neutral forms are only possible near the center of the strongly hydrophobic environment. The protein environment surrounding the conserved arginine in [NiFe]-hydrogenases is however far from being hydrophobic and strong (negative) electrostatics from aspartate residues dominate instead.

In this work, we investigate the possible involvement of both neutral and positively charged R509 in the heterolytic splitting of H<sup>2</sup> by E. coli Hyd-1, using QM/MM calculations with various sizes of QM regions. The charge distribution and the energetics of protonation of R509 is almost identical to that of a free arginine residue. Energetically, a simultaneous protonation of R509 and a proton transfer to a nearby aspartate residue is not favorable and the terminal cysteine residue C576 is the preferred proton acceptor. In the R509K mutant enzyme, structural parameters and the charge distribution are only affected to a minor degree. When substrate H<sup>2</sup> binds to the wildtype enzyme, it is slightly polarized by R509, which facilitates the heterolytic splitting and the proton transfer to the nearby terminal cysteine. This effect is absent in the mutant enzyme. Moreover, R509 is expected to exhibit very low flexibility, as it forms strong hydrogen bonds with surrounding aspartate residues. Thus, we can suggest a dual role of the arginine in the second coordination

shell of the hydrogenase enzyme: (i) an electronic function by polarizing the substrate H2, and (ii) a structural role by strong electrostatic interactions with negatively charged aspartate amino acid residues and displaying reduced conformational flexibility. By contrast, K509 is not able to form such interactions and it is thus expected to display a higher flexibility than R509, which may kinetically hinder hydrogen conversion in the mutant enzyme.

# COMPUTATIONAL DETAILS

The initial coordinates of the wildtype and R509K variant of E. coli Hyd-1 (hereafter referred to as EH1 and K-EH1) were taken from the crystal structures 5A4M and 4UE3, respectively (Evans et al., 2016). In these structures, the active site is in the oxidized "ready" Ni-B state. The oxygen atom coordinated between the Ni and Fe ions (in both protein structures) as well as oxidized C576 (in 5A4M) were thus removed to meet the functional form of the active site in the Ni-SIa state (see **Figure 2**). The PDB2PQR suite of programs was used to check the orientation of the side chains of Asn, Gln, and His. We used the PROPKA module of PDB2PQR to assign the protonation states of titratable residues (Dolinsky et al., 2004, 2007; Li et al., 2005; Bas et al., 2008). Both subunits (L: large subunit, and S: small subunit) of the enzymes were taken into account during the PDB2PQR procedure. After protonation, the heteroatoms Ni and Fe (including its CO and CN<sup>−</sup> ligands) ions of the active site of the L subunit as well as the [FeS]-clusters of the S subunit were inserted at their respective crystal-structure positions. Then, the protonation states and side chain orientations were carefully checked by visual inspection, and potential shifts of the PROPKA-calculated pKa-values due to the presence of the [NiFe] and [FeS] structural motifs were qualitatively assessed for a final assignment of the protonation states of the surrounding residues. Thereafter, the S subunit was removed, while the L subunit was retained for subsequent QM/MM calculations. Moreover, all crystal waters of the latter were deleted except for the activesite water molecules (EH1L: 9; K-EH1L: 10) (Evans et al., 2016). The protonation states of all EH1<sup>L</sup> and K-EH1<sup>L</sup> residues were identical. All acidic residues were negatively charged except for D67, D350, D574, and E73, which were protonated. All lysine and arginine residues were positively charged. Histidine residues were either singly protonated at Nε (H30, 83, 117, 119, 122, 189, 220, 229, 351, 364, 457, and 514), singly protonated at Nδ (H421, 571, and 582), or doubly protonated (H205). All cysteine residues coordinating to metals were deprotonated (C76, C79, C576, and C579).

EH1<sup>L</sup> and K-EH1<sup>L</sup> were used as starting structures in subsequent QM/MM calculations with different QM regions as described below.

# QM/MM Investigation of the R509 Protonation State and Its Involvement in Proton Transfer

Two QM/MM optimizations of EH1<sup>L</sup> were carried out with R509 being either in neutral (R509<sup>0</sup> ) or protonated (R509+) form. Among the nitrogen atoms of the guanidinium group of R509, the Nη1 atom is proposed to be the one potentially involved in H<sup>2</sup> activation since it is closer to the NiFe active site (Evans et al., 2016). This nitrogen atom was therefore chosen as deprotonation target to give R509<sup>0</sup> . The QM region (hereafter referred to as QM1 region) consisted of the side chain of R509+/<sup>0</sup> as well as the Ni and Fe ions with their first coordination sphere ligands (CO, two CN<sup>−</sup> groups, and the side chains (thiolate groups) of the four metal-coordinating cysteine residues) (see **Figure 3A**). The rest of the system (including the active-site water molecules) was treated at the MM level. The total charge of the QM1 region was −1 for R509<sup>+</sup> and −2 for R509<sup>0</sup> . All atoms within 6 Å of the QM1 region were unconstrained during QM/MM optimization whereas the positions of the more distant atoms were kept fixed (see **Figure 3B**). The QM/MM calculations were performed with the ChemShell<sup>1</sup> package (Sherwood et al., 2003; Metz et al., 2014) (version 3.7). The TURBOMOLE (Ahlrichs et al., 2011) (version 6.6) and DL POLY (Smith and Forester, 1996) (version 4.08) packages were used as QM and MM interfaces, respectively. The DL-FIND optimiser module of ChemShell was used for the optimizations (Kästner et al., 2009). The electrostatic interaction between the QM1 region and the surrounding partial charges was treated using the electrostatic embedding scheme with charge shift correction (Bakowies and Thiel, 1996; de Vries et al., 1999). Hydrogen link atoms were used to saturate the valencies at the covalent bonds crossing the QM/MM boundary (see **Figure 3A**; Sherwood et al., 1997). DFT was used to describe the QM1 region while the MM region was described by the CHARMM27 force field (MacKerell et al., 1998; Mackerell et al., 2004). Geometries were optimized using BP86 (Slater, 1951; Vosko et al., 1980; Perdew, 1986a,b; Becke, 1988) as the DFT functional with the def2-TZVP (Weigend and Ahlrichs, 2005) basis set. The calculations were sped up by using the resolution-of-identity (RI) approximation (Eichkorn et al., 1995, 1997). Single-point

<sup>1</sup>ChemShell, a Computational Chemistry Shell, see www.chemshell.org

FIGURE 3 | (A) QM regions (QM1-QM4) used in the QM/MM calculations. QM/MM boundary (Cα-Cβ) atoms are shown in orange. Reaction mechanisms investigated in this study are indicated in arrows; a unique color is used for each mechanism (e.g., black arrows are used to highlight an R509+-assisted H<sup>2</sup> activation). (B) Representative structure of the simulation system. The active region in the QM/MM calculations is enlarged at the bottom with the atoms of the QM1-QM4 regions highlighted. See the text for more details.

TABLE 1 | Definition of structural parameters of Arg509 coordination in the vicinity of the active site of the E. coli [NiFe]-hydrogenase.


QM/MM optimization results vs. X-ray structural data (PDB code 5A4M).

energy calculations at the BP86-D3 level were performed to check the influence of dispersion (Slater, 1951; Vosko et al., 1980; Perdew, 1986a,b; Becke, 1988; Grimme, 2006). For further validation of consistency, single-point calculations were also carried out using the density functionals TPSSH (Tao et al., 2003) and B3LYP (Slater, 1951; Vosko et al., 1980; Becke, 1988, 1993) with dispersion corrections (Grimme, 2006). From the QM/MM energies of EH1L-R509<sup>0</sup> and EH1L-R509+, the energy for the deprotonation of R509 was calculated.

QM/MM calculations were also performed to investigate the ability of R509<sup>+</sup> to transfer a proton to residues in its close surrounding, and to get insight into potential deprotonation mechanisms leading to formation of R509<sup>0</sup> . Two different mechanisms were considered: (i) a proton transfer from the R509+:Nη1 atom to D118:Oδ, and (ii) a water-mediated proton transfer from R509+:Nη2 to H122:Nδ. The reactant (R509+) and expected product for such reaction mechanisms were optimized to calculate the respective reaction energies. The calculations were carried out using the same QM/MM setup described above. The only difference was in the QM region. For the proton transfer to D118, the side chain of the latter was also included as part of the QM region, whereas for the water-mediated proton transfer the side chain of H122 and two active-site water molecules were included in the QM region. The resulting QM regions are referred to as QM2 (with a charge of −2) and QM3 (with a charge of −1), respectively (see **Figure 3A**). Initially, the reactant state was optimized. Then the optimized reactant structures were used as starting points to manually build the products (by relocation of the respective protons) and optimize them.

# QM/MM Investigation of the Thermodynamics of H<sup>2</sup> Activation

We also performed QM/MM calculations to compute the thermodynamic profiles of the EH1<sup>L</sup> catalyzed H<sup>2</sup> oxidation with both R509<sup>0</sup> and R509<sup>+</sup> acting as a base in the reaction. The potential ability of R509<sup>+</sup> to mediate H<sup>2</sup> dissociation was considered to be assisted by a strong R509+:Nη1- D118:Oδ hydrogen bond, which could make R509+:Nη1 slightly nucleophilic and thus enable a double proton transfer (H<sup>+</sup> → R509<sup>+</sup> → D118) reaction. For comparison, we computed the reaction energies with C76 and C576 as the H<sup>2</sup> proton acceptors, for both EH1<sup>L</sup> and K-EH1L. The reactant [(K-)EH1L· H2] complexes were optimized first. The latter were built by manual docking of H<sup>2</sup> into the EH1<sup>L</sup> and K-EH1<sup>L</sup> active sites; H<sup>2</sup> was positioned between Ni and either R509 or K509 and then fully optimized without any additional constraints. Again, the optimized reactants served as starting points to build the products, which were then also fully optimized. This time the QM/MM optimizations were carried out using a QM region (referred to as QM4) which includes the QM2 components as well as the H<sup>2</sup> atoms (see **Figure 3A**). All geometry optimizations were carried out considering both Ni and Fe to be in a closed-shell low-spin singlet state. Though the spin state of Ni2<sup>+</sup> is a controversial topic, recent computational and experimental studies on [NiFe]-hydrogenases using advanced and accurate methods (e.g., coupled cluster calculations and subatomic resolution protein crystallography) support the singlet state to be preferred (Bruschi et al., 2014; Delcey et al., 2014; Ogata et al., 2015a,b, 2016; Dong et al., 2017). Moreover, the protonation site of the enzyme during H<sup>2</sup> splitting as identified by subatomic resolution X-ray crystallography (Ogata et al., 2015b), was perfectly reproduced by computational calculations carried out with Ni in a single state (Dong and Ryde, 2016). For further validation in this regard, we have carried out single point calculations of the optimized geometries with Ni in a triplet state. Unless mentioned otherwise all energies reported in this manuscript are given with respect to the reactant complex and correspond to the computed energies for Ni in a singlet state which is shown to be the ground state (see below).

Atomic charge distributions in the QM region were calculated by the Mulliken population analysis approach in order to allow to make general statements about trend rather than absolute numbers (Mulliken, 1955a,b).

#### RESULTS

#### Structural Parameters and Charge Distributions

**Table 1** compares the structural parameters of the relevant interactions of R509 with the active site as obtained from the QM/MM calculations using the QM regions 1 and 2 (see above) to those obtained from X-ray crystallography. The BP86 functional was shown to give reliable structural parameters in good agreement with experiment for the active site of [NiFe] hydrogenases (Stein et al., 2001; Stein and Lubitz, 2004) and other transition metal complexes (Bühl and Kabrede, 2006). Overall, the QM(BP86)/MM(CHARMM) calculations reproduce well the interatomic distances corresponding to the non-covalent interactions between R509 and the Ni-Fe catalytic core. This is more difficult to achieve compared to covalent bonding, since structural flexibility and weak packing interactions must be treated equally well.

The Nη1 amine group forms a hydrogen bond with one of the cyanide ligands of the Fe atom (CN. . . Nη1 distance ∼3 Å) as well as strong electrostatic interactions with the negatively charged aspartate residue D118 (see **Figure 4**). When D118 is incorporated into the QM region (QM2), the structural parameters are in better agreement with experiment. This shows that an appropriate choice of the QM size is critical for an accurate description of long range interactions in an enzyme.

When R509 is deprotonated, there are only minor structural differences to be seen (**Table 1**). The variations are within the accuracy of the computational method and give no additional information as to the protonation state of R509 in the crystal structure.

**Table 2** provides the partial charges of the R509 atoms as obtained from the QM/MM calculations using the Mulliken population analysis approach (Mulliken, 1955a,b) and the QM regions 1 and 2, and compares them to those calculated for a free arginine. The formation of a delocalized double bond (a partial charge character) makes the central Cζ atom positively charged with 0.45 in a free arginine residue and 0.33 and 0.37 in QM regions 1 and 2, respectively. The Nη1 and Nη2 atoms are overall chemically equivalent in the free arginine with charges of −0.41 and −0.42. In the calculation with the QM1 region, charges of −0.5 and −0.53 indicate a stronger polarization due to

TABLE 2 | Charge distributions of a free arginine residue and the residue Arg509 of the E. coli [NiFe]-hydrogenase in their standard and neutral protonation states.


TABLE 3 | Calculated deprotonation energies of R509 of E. coli Hyd-1 using QM/MM calculations in kcal/mol.


interactions with surrounding residues. When D118 is explicitly included in the QM region (QM2), charges of−0.40 and−0.46 are obtained. This shows that the explicit QM electrostatic interaction of residue D118 leads to a slightly chemical nonequivalence of the Nη atoms of R509. The Nε is significantly less negatively charged (−0.22 in free arginine, −0.32 in QM1, and −0.31 in QM2) than the other nitrogen atoms. In the neutral form R509<sup>0</sup> , Cζ becomes significantly less positively charged (+0.25) in free arginine but the effect is less pronounced when embedded in the protein (+0.29 and +0.26 in the different QM regions). Upon deprotonation of Nη1, its charge changes only marginally both in the free residue (by 0.01) and the one in the protein (by 0.03–0.06). For Nη2, on the other hand, the negative charge increases by 0.05 in the free residue and 0.09–0.10 in the protein. Thus, Nη2 becomes more nucleophilic in neutral arginine. In the enzyme, this nitrogen atom is however at a too large distance from Ni (>6 Å) so that a direct involvement in hydrogen activation is not feasible.

In conclusion, we cannot report a significant perturbation of the charge distribution in R509 with respect to a free arginine residue. We therefore do not expect this residue to display a significantly perturbed pKa-value.

On the other hand, we calculate the reaction energy for the deprotonation of R509 from the computed QM/MM energies of EH1L-R509<sup>+</sup> and EH1L-R509<sup>0</sup> . We assume that the free proton is able to diffuse out of the protein (EH1L-R509+↔ EH1L-R509<sup>0</sup> + H+) by considering the free energy of solvation of H<sup>+</sup> in water [1Gsolv (H+) = −265.9 kcal/mol; (Tawa et al., 1998; Tissandier et al., 1998; Topol et al., 2000; Kelly et al., 2007; Rebollar-Zepeda and Galano, 2012). The results are shown in **Table 3**, where a positive deprotonation energy indicates the thermodynamic preference for protonated R509+. All DFT calculations give consistent results for the thermodynamic equilibrium between R509<sup>+</sup> and R509<sup>0</sup> . The differences between different functionals are within 4 kcal/mol. The protonated form R509<sup>+</sup> in the protein is energetically favored by 154–158 kcal/mol. This large deprotonation energy shows that in the [NiFe]-hydrogenase the arginine residue R509 close to the active site is predominantly in its protonated form. Meanwhile, attempts to calculate the energy difference between R509<sup>+</sup> and R509<sup>0</sup> with either D118 or H122 as a proton acceptor (see **Figure 3**) were unsuccessful; reversion to the zwitterionic state of R509 occurred immediately in the QM/MM optimizations involving R509<sup>0</sup> .

It should be noted that we are not attempting to calculate standard and perturbed pKa-values here, since it requires of a more robust computational approach to be implemented, including e.g., high level electronic structure methods, explicit solvent coordination with a certain number of solvent molecules plus continuum solvation, consideration of entropic contributions, and conformational sampling (Ghosh and Cui, 2008; Rebollar-Zepeda and Galano, 2012; Uddin et al., 2013). It can only be stated that in E. coli Hyd-1, the amount of neutral R509 is negligible and the thermodynamic equilibrium is far toward a positively charged residue in its standard protonation state. Thus, the direct involvement of the neutral form of R509 as a strong FLP in H<sup>2</sup> oxidation appears impossible.

#### The Substrate Bound Complex

Ni-SIa is the catalytically active species which performs hydrogen oxidation. In Ni-SIa, H<sup>2</sup> approaches the Ni site where it is heterolytically cleaved. The hydride occupies the µ-bridging position between the Ni and Fe atoms and one residue in the vicinity must act as a proton acceptor.

**Table 4** gives structural data for the H<sup>2</sup> Ni-SIa complexes corresponding to the EH1L-R509<sup>+</sup> wildtype enzyme and the R509K variant. Attempts to compute structural parameters and reaction energies for EH1L-R509<sup>0</sup> were unsuccessful, since the optimization of the respective reactant complex evolved spontaneously toward the product with a µ-hydride and a protonated arginine. This reinforces our conclusion on the preference of R509 to be in the protonated state.

In the reactant complex (RCEH1), H<sup>2</sup> is located above and very close to the Ni ion (at 1.6 Å), whereas the distance between the Fe ion and H<sup>2</sup> is longer (2.5 Å). This is in agreement with previous studies on [NiFe] hydrogenases which show that H<sup>2</sup> prefers to bind to Ni, rather than to Fe (Ogata et al., 2002; Dong et al., 2017). Upon H<sup>2</sup> binding, the relevant interatomic distances between the active site and R509 are overall unchanged compared to the Ni-SIa state (see **Tables 1**, **4**). Moreover, structural parameters at the reactant complex are similar for the wildtype and the mutant enzyme. Nickel-nitrogen and ironnitrogen distances as well as the interactions of H<sup>2</sup> with nickel, iron, or cysteine residues do not change overall. This indicates that upon the arginine-to-lysine mutation, the active center


TABLE 4 | Structural data of the H2 Ni-SIa complexes of the wildtype E. coli Hyd-1 and the R509K variant optimized using the larger QM2 region.

<sup>a</sup>Distances to the N<sup>ζ</sup> atom of lysine. <sup>b</sup>Not applicable.

remains fully assembled and structurally intact to perform the catalytic hydrogen oxidation. This is in agreement with structural and spectroscopic investigations (Evans et al., 2016).

The electronic structure also only changes slightly in the R509K mutant (**Table 5**). The Nη1 atom becomes less negative (−0.31) and thus less nucleophilic and will be then a weaker proton acceptor when H<sup>2</sup> is heterolytically splitted. Atomic charges at Ni, Fe and all other active site atoms remain overall unchanged (see **Table 5**). What becomes apparent by analyzing the partial charges is the fact that in the wildtype enzyme both hydrogen atoms of H<sup>2</sup> are slightly positively polarized (+0.07). In the lysine mutant, however, the hydrogen atoms become less and oppositely charged, with the hydrogen atom pointing toward the bridging position positive (HA, +0.02) and the distal hydrogen between lysine and cysteine negative (HB, −0.02). Since H<sup>A</sup> will become the bridging hydride, H<sup>B</sup> ought to be accepted by a (negatively charged) proton acceptor. This indicates that the introduction of a lysine residue does not structurally impair the catalytic function, but it reduces the proton affinity of a potential proton acceptor nitrogen atom and at the same time leads to a partial negative charge on the putative protonic species.

# The Thermodynamics of H<sup>2</sup> Oxidation

The QM/MM energies calculated for the binding and heterolytic splitting of H<sup>2</sup> by EH1L-R509<sup>+</sup> and K-EH1 are shown in **Table 6** for a series of different functionals. Since we could not obtain a stationary intermediate for the EH1L-R509<sup>0</sup> H<sup>2</sup> complex, those energies cannot be reported. All DFT calculations give a very consistent picture of the energetics of H<sup>2</sup> binding and splitting. TABLE 5 | Charge distributions in the H2 Ni-SIa complexes of the wildtype E. coli Hyd-1and the R509K mutant optimized using the larger QM2 region.


<sup>a</sup>N<sup>ζ</sup> atom of lysine (K509). <sup>b</sup>Not applicable.

TABLE 6 | Substrate binding energies and thermodynamics of heterolytic H2 splitting (in kcal/mol) in the wildtype and R509K mutant [NiFe]-hydrogenase from E. coli.


The hydride occupies the µ-bridging position between the Ni…Fe atoms.

This provides a reliable insight into the thermodynamics of H<sup>2</sup> oxidation and an estimate of the uncertainty of the computed energies.

As can be seen in **Table 6**, our calculations indicate that a potential heterolytic splitting of H<sup>2</sup> involving participation of R509<sup>+</sup> as a proton acceptor (H<sup>2</sup> → R509<sup>+</sup> → D118) is thermodynamically unfavorable by 13.9–19.3 kcal/mol. By contrast, C76 and C576 are found to be able to act as proton acceptors and mediate this process favorably, the reaction being exothermic by 3.7–9.9 and 7.0–15.3 kcal/mol, respectively. At all levels of theory evaluated protonation of C576 is thermodynamically more favorable than that of C76; the reaction energy associated with C576 is 4.7–5.4 kcal/mol lower in comparison to C76. This is in line with previous computational and X-ray crystallography studies on [NiFe]-hydrogenase from Desulfovibrio vulgaris Miyazaki F, which show that protonation of C546 (the equivalent for C576 in EH1) is preferred over the other coordinating cysteines (Ogata et al., 2015b; Dong and Ryde, 2016).

The structures of the optimized H2. . . Ni-SI<sup>a</sup> complexes are shown in **Figure 5**. Apart from a different positioning of the H2:H<sup>B</sup> atom (see **Figure 5**), there are only a few structural differences between the optimized product complexes. In PC76EH1 (protonated cysteine C76), the orientation of the side chain of the residue E28 (located in the MM region) is different and is better stabilized by the surrounding residues (via hydrogen bonds) in comparison to both PC576EH1 (protonated cysteine C576) and PC509EH1 (product complex for a proton transfer to R509+). This is also true when comparing PC76EH1 and RCEH1. Therefore, the hydrogen bonding interactions of E28 in PC76EH1 are considered to be important for the exothermic formation of this product, which is supported by the lower value of the MM energy component with respect to that for RCEH1 (see Supplementary Material for a detailed analysis of the QM and MM energy contributions to the QM/MM energies). (Senn and Thiel, 2009; Escorcia et al., 2017) Meanwhile, PC509EH1 differs from PC76EH1 and PC576EH1 regarding geometry and orientation of the side chain of R509+. The characteristic planar geometry of the guanidinium group is distorted in PC509EH1. In addition, the hydrogen bond interactions with the surrounding aspartate residues (D118 and D574) are overall weaker. Together these terms may contribute significantly to the endothermic formation of PC509EH1, as given by the higher value of the QM energy component in comparison to PC76EH1 and PC576EH1 (see 'Supplementary Material).

According to these results, R509 is not expected to be directly involved in the reaction mechanism of the H<sup>2</sup> activation by EH1. Instead, we propose this residue to be important for H<sup>2</sup> activation by guiding its access to the active site, promoting its binding to nickel and facilitating its polarization. Our conclusions are based on the QM/MM calculations with K-EH1L. As can be seen in **Table 6**, H<sup>2</sup> activation by K-EH1<sup>L</sup> is

given and the electrostatic interactions with the aspartate residues are less pronounced.

also thermodynamically feasible with either C76 or C576 acting as a base. The computed reaction energies also suggest the process to be thermodynamically comparable with respect to the EH1 wildtype enzyme. This shows that a mere thermodynamic argumentation cannot explain the high activity of EH1 and the >100-fold reduction in K-EH1L.

Also, the binding energy of H<sup>2</sup> is favored by 1.2 kcal/mol in the wildtype EH1. As described above, the charge analysis showed that the H2:H<sup>B</sup> atom is more polarized in RCEH1 than in RCK−EH1, with an atomic charge value of 0.07 and −0.02, respectively (**Table 5**). Considering this and the similar negative charge of the sulfur atoms of both C76 and C576, the H<sup>2</sup> splitting is expected to be kinetically favored (i.e. with a lower energy barrier) in EH1.

The most important structural differences between EH1 and K-EH1 are found to be in the immediate vicinity of the R509 and K509 residues. The former is strongly stabilized by the surrounding aspartate residues through hydrogen bond interactions (**Figure 5**), which hold R509 in place and may make this residue more rigid in comparison to lysine. The smaller spatial extension and a potential higher degree of flexibility of the side chain of K509 may account for a weaker binding of H<sup>2</sup> as well as a correct positioning and polarization of the latter in the wildtype enzyme. The investigation of the effect of the flexibility of the side chain on the kinetics of the H<sup>2</sup> splitting will require extensive QM/MM MD simulations with a sufficient degree of conformational sampling.

All energies discussed up to this point were obtained with Ni in a low-spin singlet state (S = 0).

The spin state of the EPR-silent Ni2<sup>+</sup> intermediate state is still a controversial issue. Recent computational studies on [NiFe]-hydrogenases using state-of-the-art methods (e.g., coupled cluster calculations and DMRG) support the singlet state to be preferred over the triplet in a cluster model of the active site (Dong et al., 2017). BP86 gave a low spin Ni(II) state for

NiSI (Stein et al., 2001; Stein and Lubitz, 2004) and was shown to be close to the DMRG results in terms of spin state splitting energies. Meanwhile, B3LYP gave reasonable thermodynamics for H<sup>2</sup> splitting but did not perform well for triplet vs. singlet Ni(II) spin state energies (Dong et al., 2017). With the B3LYP functional, a high spin Ni(II) was found to be the ground state in an earlier study (Pardo et al., 2006). For Ni(II) tetrathiolate complexes, the singlet-triplet energy splitting is very sensitive to the amount of exact Hartree-Fock exchange. A reduction to 0.15 in B3LYP<sup>∗</sup> gave an improved description of the relative spin state ordering for Ni(II)S4 model complexes and [NiFe] hydrogenase active site models (Bruschi et al., 2004). On the other hand, the TPSSH functional with 0.1 of exact exchange gave reliable structural parameters and bond energies for a set of 80 transition-metal-containing complexes. Furthermore, TPSSH provided reliable energies when tested against typical bioinorganic reactions including spin inversion and electron affinity in iron–sulfur clusters, and breaking or formation of bonds in iron proteins and cobalamins (Jensen, 2008). Thus, we have additionally carried out QM/MM calculations for the thermodynamics of H<sup>2</sup> splitting with Ni(II) in a triplet state (S = 1).

We compare the relative spin ordering of singlet and triplet spin states for the BP86, B3LYP, and TPSSH functionals. As shown in **Table 7**, the results from all functionals are absolutely consistent and indicate that the reactivity on the singlet state spin surface is favored over the triplet state surface for both the wildtype and the mutant enzyme, by 13–23 kcal/mol.

## CONCLUSIONS

Hydrogen oxidation by E. coli Hyd-1 was investigated by QM/MM calculations. Substitution of a highly conserved arginine amino acid residue by a charge conserving lysine does not affect the structural parameters and the electronic structure of the active site. The active site is fully assembled and pre-formed for catalysis. The introduction of lysine, however, leads to an unfavorable polarization of the substrate H<sup>2</sup> and makes proton transfer to a negatively charged terminal cysteine kinetically impaired. This explains the reduction of activity by

TABLE 7 | Singlet-triplet spin state splitting energies (in kcal/mol) from QM/MM calculations.


A positive energy indicates the singlet state of Ni(II) to be the ground state. RC: reactant complex (H2@Ni-SIa complex). PC76: protonated cysteine C76. PC576: two orders of magnitude in the K-EH1 mutant enzyme. It was initially suggested that a neutral arginine R509<sup>0</sup> might directly be involved in catalysis and act as a FLP for proton acceptance.

FLPs were identified to be highly effective in activating a variety of small molecules and prompted strong interest in their investigation, e.g., the activation of molecular hydrogen in the absence of a transition metal catalyst by a FLP was also reported (Stephan and Erker, 2010). The concept of FLPs has also been applied to the design of model systems for the active sites of the transition metal-containing hydrogenases. DuBois, Bullock, and co-workers (Raugei et al., 2012) developed enzyme model systems that combine a metal center with non-coordinating amine donor ligands. These pendant, neutral amine groups act in concert with the Ni center to give rise to electrochemical H<sup>2</sup> oxidation and the authors directly note the analogy to FLPs.

In the [NiFe]-hydrogenase enzyme, we cannot verify the existence of a neutral arginine amino acid residue close to the active site. This would be possible only if the pKa-value of that residue was strongly perturbed by the interactions with the protein environment. According to our findings, the charge distribution of R509 in the enzyme is very close to that of a free arginine amino acid residue and the deprotonation energy too high to enable generation of a neutral arginine. In biological catalysis, it is the positively charged side-chain guanidinium group that is often utilized as an electrophilic catalyst with a very high pKa-value (≥12). In the heterolytic cleavage of H2, a hydride occupies the µ-bridging position between the Ni and Fe atoms. Rather, a nucleophilic proton acceptor is necessary to take up the product proton. The positively charged R509 residue can still facilitate H<sup>2</sup> splitting via polarization of the latter due to interactions with the partially negatively charged (nucleophilic) Nη1 atom.

The role of the conserved arginine in hydrogenases is thus three-fold: (i) strong electrostatic interactions with nearby aspartate amino acid residues enable an easy H<sup>2</sup> access to the Ni atom with an access channel radius of ∼4Å (see **Figure 6**); (ii) the arginine assists the positioning and polarization of H<sup>2</sup> to

protonated cysteine C576.

enable a swift proton transfer to one of the terminal cysteines; and (iii) the strong electrostatic interactions with the protein environment keep the arginine in a rigid position and obstruct any conformational changes which otherwise might impede catalysis (see **Figure 5**).

#### AUTHOR CONTRIBUTIONS

MS: designed and initiated the project; AE: performed the calculations; MS and AE: analyzed and interpreted the data, wrote the manuscript, and approved the final version.

#### REFERENCES


#### FUNDING

Financial support by the Max Planck Society for the Advancement of Science and the EU COST Action CM1305 ECOSTBio is gratefully acknowledged.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00164/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Escorcia and Stein. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.