MULTISCALE MODELING FROM MACROMOLECULES TO CELL: OPPORTUNITIES AND CHALLENGES OF BIOMOLECULAR SIMULATIONS

EDITED BY : Valentina Tozzini, Giulia Palermo, Matteo Dal Peraro, Alexandre M. J. J. Bonvin and Rommie E. Amaro PUBLISHED IN : Frontiers in Molecular Biosciences

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88966-109-1 DOI 10.3389/978-2-88966-109-1

### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

## MULTISCALE MODELING FROM MACROMOLECULES TO CELL: OPPORTUNITIES AND CHALLENGES OF BIOMOLECULAR SIMULATIONS

Topic Editors:

Valentina Tozzini, Istituto Nanoscienze, Consiglio Nazionale delle Ricerche, Italy Giulia Palermo, University of California, Riverside, United States Matteo Dal Peraro, École Polytechnique Fédérale de Lausanne, Switzerland Alexandre M. J. J. Bonvin, Utrecht University, Netherlands Rommie E. Amaro, University of California, San Diego, United States

Citation: Tozzini, V., Palermo, G., Dal Peraro, M., Bonvin, A. M. J. J., Amaro, R. E., eds. (2020). Multiscale Modeling from Macromolecules to Cell: Opportunities and Challenges of Biomolecular Simulations. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88966-109-1

# Table of Contents

*05 Editorial: Multiscale Modeling From Macromolecules to Cell: Opportunities and Challenges of Biomolecular Simulations* Giulia Palermo, Alexandre M. J. J. Bonvin, Matteo Dal Peraro, Rommie E. Amaro and Valentina Tozzini *09 Understanding Ligand Binding to G-Protein Coupled Receptors Using Multiscale Simulations* Mercedes Alfonso-Prieto, Luciano Navarini and Paolo Carloni *19 Building Minimalist Models for Functionalized Metal Nanoparticles* Giorgia Brancolini and Valentina Tozzini *23 A Multi-Scale Approach to Membrane Remodeling Processes* Weria Pezeshkian, Melanie König, Siewert J. Marrink and John H. Ipsen *30 Unraveling the Molecular Mechanism of Pre-mRNA Splicing From Multi-Scale Simulations* Lorenzo Casalino and Alessandra Magistrato *34 Computational Studies of Cardiac and Skeletal Troponin* Jacob D. Bowman and Steffen Lindert *41 Some Notes on the Thermodynamic Accuracy of Coarse-Grained Models* Ewa Anna Oprzeska-Zingrebe and Jens Smiatek *45 Modeling Crowded Environment in Molecular Simulations* Natalia Ostrowska, Michael Feig and Joanna Trylska *51 Global Dynamics of Yeast Hsp90 Middle and C-Terminal Dimer Studied by Advanced Sampling Simulations* Florian Kandzia, Katja Ostermeir and Martin Zacharias *59 MARTINI-Based Protein-DNA Coarse-Grained HADDOCKing* Rodrigo V. Honorato, Jorge Roel-Touris and Alexandre M. J. J. Bonvin *67 Definition of the Minimal Contents for the Molecular Simulation of the Yeast Cytoplasm* Vijay Phanindra Srikanth Kompella, Ian Stansfield, Maria Carmen Romano and Ricardo L. Mancera *76 Structural Transition States Explored With Minimalist Coarse Grained Models: Applications to Calmodulin* Francesco Delfino, Yuri Porozov, Eugene Stepanov, Gaik Tamazian and Valentina Tozzini *85 Enzymatic Polymerization of PCL-PEG Co-polymers for Biomedical Applications* Pedro Figueiredo, Beatriz C. Almeida and Alexandra T. P. Carvalho *92 Multiscale Solutions to Quantitative Systems Biology Models* Nehemiah T. Zewde *97 Improved Modeling of Peptide-Protein Binding Through Global Docking and Accelerated Molecular Dynamics Simulations* Jinan Wang, Andrey Alekseenko, Dima Kozakov and Yinglong Miao *107 Large-Scale Conformational Changes and Protein Function: Breaking the*  in silico *Barrier*

Laura Orellana

*125 To Bud or Not to Bud: A Perspective on Molecular Simulations of Lipid Droplet Budding*

Valeria Zoni, Vincent Nieto, Laura J. Endter, Herre J. Risselada, Luca Monticelli and Stefano Vanni


Mahdi Bagherpoor Helabad, Senta Volkenandt and Petra Imhof


Brandon P. Mitchell, Rohaine V. Hsu, Marco A. Medrano, Nehemiah T. Zewde, Yogesh B. Narkhede and Giulia Palermo

*193 Computational Investigation of Voltage-Gated Sodium Channel* b*3 Subunit Dynamics*

William G. Glass, Anna L. Duncan and Philip C. Biggin


Bin Sun and Peter M. Kekenes-Huskey

# Editorial: Multiscale Modeling From Macromolecules to Cell: Opportunities and Challenges of Biomolecular Simulations

Giulia Palermo<sup>1</sup> , Alexandre M. J. J. Bonvin<sup>2</sup> , Matteo Dal Peraro<sup>3</sup> , Rommie E. Amaro<sup>4</sup> and Valentina Tozzini 5,6 \*

<sup>1</sup> Departments of Bioengineering and Chemistry, University of California, Riverside, Riverside, CA, United States, <sup>2</sup> Faculty of Science - Chemistry, Bijvoet Centre for Biomolecular Research, Utrecht University, Utrecht, Netherlands, <sup>3</sup> Institute of Bioengineering, School of Life Sciences, École Polytechnique Fdédérale de Lausanne, Lausanne, Switzerland, <sup>4</sup> Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA, United States, <sup>5</sup> Istituto Nanoscienze, CNR, Pisa, Italy, <sup>6</sup> Lab NEST, Scuola Normale Superiore, Pisa, Italy

Keywords: multiscale modeling, molecular dynamics simulations, advanced sampling methods, coarse grained models, macro-biomolecules, molecular crowding, system biology, bioinformatics

**Editorial on the Research Topic**

### **Multiscale Modeling From Macromolecules to Cell: Opportunities and Challenges of Biomolecular Simulations**

### Approved by:

Mark Nicholas Wass, University of Kent, United Kingdom

> \*Correspondence: Valentina Tozzini valentina.tozzini@nano.cnr.it

#### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

> Received: 10 July 2020 Accepted: 21 July 2020 Published: 28 August 2020

#### Citation:

Palermo G, Bonvin AMJJ, Dal Peraro M, Amaro RE and Tozzini V (2020) Editorial: Multiscale Modeling From Macromolecules to Cell: Opportunities and Challenges of Biomolecular Simulations. Front. Mol. Biosci. 7:194. doi: 10.3389/fmolb.2020.00194 The wonderful complexity of biological systems is responsible for the emergence of life from the chemical world, but it is also the reason why it is so difficult to address living systems in simulations. As recently demonstrated by the tremendous efforts directed to the study of SARS-CoV-2, even a relatively simple biological unit, such as a virus, needs to be addressed from multiple point of views—both as a whole, to study processes on the scales of microns and times of micro-milliseconds, as well as deconstructed into its single parts at the molecular level (Agúndez et al., 2020; Durrant et al., 2020). From the point of view of simulations, this implies following in silico the fate of (or tens or hundreds of) billions of atoms over macroscopic time scales. This appears impractical at a first sight especially for the computation cost, which, considering for instance Molecular Dynamics (MD) simulations, can be roughly estimated as ∞ N α <sup>D</sup> × N<sup>t</sup> = (S/dV) α (T/dt) where N<sup>D</sup> and N<sup>t</sup> correspond to the number of degrees of freedom and the number of timesteps needed to represent a system of size S for a simulation time T; dV, and dt represent the discretization levels in space<sup>1</sup> and time, and α is the exponent for the polynomial scaling of the computation cost with size<sup>2</sup> . Therefore, the history of molecular simulations is strongly interlaced with that of computing hardware development, both tracing back to the more than 50 years ago. The exponential increase of computing system performances up to now has led to the possibility of addressing whole viruses or (portion of) cells at the atomistic level in simulations of hundreds of ns (Tarasova and Nerukh, 2018), while simulations of single proteins can extend over the milliseconds scale (Shaw et al., 2009).

<sup>1</sup>dV is related to the resolution at which the system is treated. For instance, for the atomistic representation one can consider an average inter-particle distance of 1.5 Å leading to dV of the order of 10 Å<sup>3</sup> . For non MD techniques, dt can be substituted with a parameter describing the precision of phase or conformational space sampling. <sup>2</sup>This is usually between 1 and 2, but may depend on the used model.

However, at the moment fully atomistic MD simulations cannot access simultaneously macroscopic sizes and time scales large enough for a sufficient statistical exploration. Therefore, they are often coupled to techniques for evaluating thermodynamic quantities (typically free energy profiles) as in the original research paper by Bagherpoor Helabad et al. combining Langevin Dynamics (LD) with entropy evaluation to identify the DNA binding domains of the androgen glucocorticoid receptor, or in that by Sun and Kekenes-Huskey, where the Potential of Mean Force (PMF) calculation along the open-close transition of the Ca2<sup>+</sup> binding protein S100A1 involved in the cardiomyocyte function is operated with Weighted Histogram Analysis Method (WHAM) combined the Born surface area continuum solvation. With similar aims, a number of different techniques to expand the conformational and phase space is used, as reviewed by Bowman and Lindert focusing on the skeletal troponin. In these studies, stochastic dynamics (e.g., Brownian dynamics, BD) are combined with Umbrella Sampling-like techniques or steered molecular dynamics (SMD) and Markov chain modeling, with the result of effectively enhancing the conformational sampling. Similarly, the Gaussian MD method accelerates dynamics using an external potential to push the system out of the local minima, as in the simulations of Mitchell et al., on CRISPR-Cas9 in the presence of base pair mismatches. Also frequent is the combination of atomistic simulations and enhanced sampling techniques with bioinformatic methods, as in the template-based peptide sorting and docking algorithm (Peptidock) with the aim of designing peptides to interfere with Protein-Protein Interactions (PPI) for therapeutic scopes, as reported by Wang et al..

Besides the need of extending the simulation scales, there are other more subtle reasons that call for the search of new simulation strategies beyond conventional atomistic MD. One is that the first-generation atomistic Force Fields (FF), developed and tested during the last nearly six decades, start now to show their deficiencies, precisely due to the achievement of the macroscopic scales in simulations. As highlighted in the Perspective by Melcr and Piquemal, one shortcoming is the lack of polarizability due to the use of fixed partial charges, which determines a suboptimal representation of hydrogen bonds and as a consequence a poor description of secondary and tertiary structures relative stability, especially when the long time scales and temperature variations come into play. Thus, a tremendous parallel effort to reparameterize atomistic FFs to include polarizability has been ongoing, as in the AMOEBA FFs.

The failure in reproducing effects involving electronic rearrangements was one of the main driving factors inspiring the development of the multiscale approaches. The idea of multiscale is to combine atomistic FFs (molecular mechanics MM) with a higher resolution method explicitly representing electrons and therefore employing quantum mechanics (QM) in different space regions of the same system (hybrid QM/MM simulations, also called "parallel multiscaling"), in order to improve accuracy only in those regions where it is necessary. These regions are easily identifiable for instance in enzymes, where the active site is localized, making it possible the simulation of reactions such as the synthesis of Polycaprolactone—Polyethylene Glycol copolymers, realized by Figueiredo et al. by means of an interface between the Gaussian code for QM and the Amber code for MM. The authors, additionally, couple the QM and MM methods even in a "serial way," i.e., performing FF-MD simulations of the entire protein (no QM part) and QM simulations of the active site only, to compare and pass structural parameters between each other. In fact, in hybrid QM/MM simulations, the bottleneck of the calculation is the QM part, which also determines the reduction of the timestep of simulation, and consequently of the whole run length, implying an extension of the size of the system addressable with respect to QM only methods at same accuracy, but not of the time-scale. Therefore, a very important issue to solve is the efficiency of the implementation, which is addressed in the Opinion by Bolnykh et al.. Here the authors discuss the implementation realized in the MiMiC code, by means of a multiple program-multiple data paradigm, which combines the flexibility of the so-called loose coupling performed through an input/output interface between two different codes for QM and MM calculations with the computational efficiency of a strong coupling typically implemented in single ad-hoc codes for QM/MM. Additionally, to improve the extension of time scales of simulations MiMiC implements efficient multiple-time steps algorithms. We remark that, while the hybrid schemes solve in principle also the problem of polarization, the accurate treatment of electrostatics remains a crucial issue even in QM/MM approaches, addressed in MiMiC with the fully Hamiltonian electrostatic embedding. The hybrid QM/MM approaches can be coupled to methods for sampling enhancement as shown in the Perspective by Casalino and Magistrato focusing on the mechanism of Eukaryotes spliceosome, where combinations with thermodynamic integration, free energy calculations, principal component analysis of trajectories and electrostatic analysis are reviewed.

In biological systems the idea of multiscaling, or multiresolution approaches emerges naturally, because of the intrinsically hierarchical organization of biological matter, in which different levels of organization are easily recognizable. For biopolymers, the first super atomic level is that of the residue. Accordingly, the most popular super-atomistic (Coarse Grained CG) models are those based on a residue level representation. MARTINI and SDK FFs use, in fact, a slightly higher resolution (several 1-to-5 beads per residue) and explicit CG models for the solvent. This brings speed up the simulations of 200 to 400-fold with respect to atomistic ones, due in part to a direct reduction of ND, in part to the possibility of increasing dt, allowed by a the elimination of higher vibrational frequencies of the system, a secondary consequence of coarse graining. In practice the reduction of resolution operates a coarse graining both in the space and time domains, allowing the simulation of slow and extended processes like the budding of membrane and formation of lipid droplets, as described in the Opinion by Zoni et al.. MARTINI is among the more standardized CG FFs, and is often used in multi-scale approaches combined with atomistic simulations and e.g., homology modeling, as in the study by Glass et al. on the structure, function, and clustering of voltage

gated sodium channels, or embedded within a flexible docking protocol to supplement atomistic rigid docking between proteins and nucleic acids, as this paper by Honorato et al. reporting a modified version of HADDOCK code.

A further considerable reduction of computational cost is obtained with CG implicit solvent models, especially those with simplified parameterization. Alfonso-Prieto et al. review the atomistic-CG "hybrid" (parallel) approaches based on a Go-like models, applied to G-Proteins Coupled Receptors, and show that these models can be used in combination with homology modeling and docking techniques, to dramatically improve the predictive power of binding affinity of ligands, especially due to the inclusion of flexibility of the whole complexes at low computational cost. Similarly, Delfino et al. use a Cα based minimalist model to address the large conformational changes of calmodulin upon Ca2<sup>+</sup> binding/release, setting up a simulation paradigm that combines serially CG with atomistic representation and path searching, morphing, and minimum action path techniques, extendable to all switching proteins. D'annessa et al. review how atomistic and CG simplified representation such as the network models (EN) can be combined with docking algorithms, Monte-Carlo and MD possibly associated to enhanced sampling techniques (SD, WHAM, PMF) and implicit solvent treatments, focusing on applications to design peptide drugs to interfere with PPI.

A crucial point when considering CG approaches is related to the parameterization strategies. Besides the already mentioned simplified models (EN, Go-like, and minimalist) parameterized based on reference structures, parameterization strategies involve either bottom-up approaches based on higher resolution models or higher level theories (also called "physics based" or "ab initio") usually involving the match of forces or energy surfaces, or topdown strategies (also called "knowledge based" or "data driven"), which incorporate experimental data, generally of different origin (thermodynamic, structural, vibrational). There is an ambivalent case: the "statistics based" parameterization, in which sets of structural data of any origin (measured or calculated) are used through Boltzmann Inversion (BI)-related procedures to fit the model parameters. The latter approach in particularly preferred when CG simulations are used to evaluate thermodynamic properties, because BI is the expression of thermodynamic consistency with the dataset. Oprzeska-Zingrebe and Smiatek show with a theoretical analysis that many subtle effect may arise at the bulk level in the evaluation of thermodynamic properties and equilibrium constants, depending on the specific choice of the size of the CG bead and its location, which therefore must be chosen very carefully. This is especially true when the coarse graining is pushed at very low resolution, e.g., a single bead per molecule or domain, sometimes called mesoscale (MS) level, often used to represent the crowders in the cell cytoplasm. Ostrowska et al. nicely review the recent literature of the crowded environment representations, which, incidentally, are usually "parallel" or hybrid multi-scale representations, since the system of interest, typically a protein, is represented at a higher resolution level than the crowders. The authors highlight the effects purely due to confinement, those due to the crowders shape or to the detail of the surface. A similar MS model decorated with CG beads is used by Brancolini and Tozzini to represent bio-functionalized metal nanoparticles designed as anti-aggregating therapeutic agents in degenerative diseases due to amyloidogenic proteins.

Clearly, the possible combination of different resolution and different sampling or parameterization methodologies are limited only by the researchers' creativity. For instance, Kandzia et al. use a MS level network model as external biasing potential for replica exchange atomistic MD (replicas differing by the level of bias) to study the slow motion and mechanism of action of the Hsp90 chaperone of yeast, giving an original example of parallel multi-scaling. Pezeshkian et al. give a perspective on their methodology that matches a continuum-like representation of the membrane with the particle-like representation. Their model represent the membrane by a dynamical triangulation including elasticity and the effect of membrane protein or inclusions, which can modify the elasticity and curvature, dynamically changing the parameters it via a Metropolis algorithm. The model parameters are calibrated using both atomistic and CG (MARTINI), with which the model is fully compatible, thanks to a back-mapping algorithm. The multi-scaling approach is also perfectly suited to represent the chromatin, the system in which the hierarchical structural organization is most evident. In particular, compaction-decompaction transitions are events triggered at the level of the nucleosome by chemical changes in the histone proteins, and reflect on the macroscopic level through a process where electrostatics plays a major role. Electrostatics also play a role in maintaining the delicate balance, which keeps the DNA relatively compact, yet accessible for the transcription and duplication. Bendandi et al. review the methods used to simulate these processes, involving all scales from atomistic to MS, and using several methodologies from MD to MC, implicit electrostatics, statistical, and mathematical modeling and analyses (e.g., topological and fractal models). The multi-scale approach is combined with the mathematical knot theory also by Rosa et al., using an inter-disciplinary approach to analyze the paradox of packing-entangling and accessibility of DNA.

In the course of the last decades the low-resolution models have evolved, and it has become clear that the combination of top down and bottom up-strategies in their parameterization can produce model with accuracy comparable or exceeding that of atomistic FFs, especially in the evaluation of thermodynamic properties. In the review by Orellana, the theme of crossvalidation of in vitro and in silico is addressed, showing that the best way to tackle the complexity of live matter is a multi-disciplinary combination of enhanced sampling simulation techniques and path sampling methods applied to multi-scaling approaches mixing simplified models as EN with atomistic representation and experimental as CryoEM. The application focus is here on the switching proteins, ubiquitous, and difficult to address due to large conformational changes. However, a similar need for inclusion and crossvalidation of models by means of experimental data emerges in the MS models for the cytoplasm, where, as shown by this brief report of by Kompella et al., standardized data about the composition in mass, size and diffusivity and intercrossing relations between the cell elements are needed to set up a model for eukaryotic cells accurately reproducing the crowding effects.

Indeed, elements from system biology must be included when the level of simulation scales toward that of the cell. Widely used approaches in this case are those of Kinetic Master Equations (KME) connecting a set of cell elements. KME is used for instance in the representation of the whole complement cascade of the immune system illustrated in the Opinion by Zewde, where the vertices of the network are proteins, NAs and other cell components, and the kinetic parameters are evaluated through BD, within a "serial" coupling between particle-based and system biology methods. Similarly, Thornburg et al. address the processes of replication, transcription, and translation of a minimal synthetic cell, using atomistic data and genomic information for the parameterization. The model is able to predictively account for details such as the ribosomes production and activity. This should be considered a step forward in the representation of an entirely in silico cell.

The interdisciplinary character of multiscale approaches emerges clearly from the panoramic view on the methods illustrated in this collection, enriched by the contributions of the participants to the Workshop Multiscale Modeling from Macromolecules to Cell<sup>3</sup> (CECAM Lausanne Feb 4-6 2019) organized by us and by which this collection was inspired. It is apparent that we are currently witnessing the historical moment in which the bottom-up computational approaches rising from the atomic and molecular level, and the topdown experimental methods, from the macroscopic level, meet at the mesoscale, where new possibilities of discovery and comprehension are enabled. Finally, before closing, we would like to comment on COVID19, the severe respiratory syndrome

<sup>3</sup>Multiscale Modeling from Macromolecules to Cell, workshop website https:// www.cecam.org/workshop-details/241.

caused by the SARS-CoV-2 virus. COVID19 continues to unexpectedly test many of the cross-disciplinary and multiscale approaches discussed in this collection (Swiderek and Moliner, 2020) with many ongoing efforts from this community aiming to understand viral mechanisms of action (Zhao et al., 2020) as well as identify possible drugs and vaccines (Casalino et al., 2020). The urgency of the COVID19 situation has led to a unique combination of private-public worldwide coordination of governments, industries, and academies offering computing resources (Zimmerman et al., 2020) and sharing of methods, models, and data<sup>4</sup> . Although this terrible disease has not been defeated, yet, the incredibly rapid and coordinated worldwide research effort can already been considered a successful example to follow.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

GP was supported by the National Science Foundation under the Grant No. CHE-1905374 and by the National Institute of Health, under the Grant No. R01 EY027440. GP also acknowledges XSEDE (Grant No. TG- MCB160059) and the Covid-19 HPC Consortium (Grant No. MCB200150) for support through computational time. MD was supported by the Swiss National Science Foundation and EPFL. AB acknowledge financial support of the European Union Horizon 2020 projects BioExcel (675728 and 823830) and EOSC-hub (777536).

<sup>4</sup>Public sites collecting data on SARS-CoV-2 https://pubs.acs.org/doi/10.1021/acs. jcim.0c00319, https://covid.molssi.org.

### REFERENCES


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Palermo, Bonvin, Dal Peraro, Amaro and Tozzini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Understanding Ligand Binding to G-Protein Coupled Receptors Using Multiscale Simulations

#### Mercedes Alfonso-Prieto1,2, Luciano Navarini <sup>3</sup> and Paolo Carloni 1,4,5,6 \*

1 Institute for Advanced Simulation IAS-5 and Institute of Neuroscience and Medicine INM-9, Computational Biomedicine, Forschungszentrum Jülich, Jülich, Germany, <sup>2</sup> Medical Faculty, Cécile and Oskar Vogt Institute for Brain Research, Heinrich Heine University Düsseldorf, Düsseldorf, Germany, <sup>3</sup> illycafè S.p.A, Trieste, Italy, <sup>4</sup> Institute for Neuroscience and Medicine INM-11, Forschungszentrum Jülich, Jülich, Germany, <sup>5</sup> Department of Physics, Rheinisch-Westfälische Technische Hochschule (RWTH) Aachen University, Aachen, Germany, <sup>6</sup> VNU Key Laboratory "Multiscale Simulation of Complex Systems", VNU University of Science, Vietnam National University, Hanoi, Vietnam

#### Edited by:

Giulia Palermo, University of California, Riverside, United States

### Reviewed by:

Francesco Saverio Di Leva, University of Naples Federico II, Italy Antonella Di Pizio, Technical University of Munich, Germany

> \*Correspondence: Paolo Carloni p.carloni@fz-juelich.de

#### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

> Received: 18 February 2019 Accepted: 11 April 2019 Published: 03 May 2019

#### Citation:

Alfonso-Prieto M, Navarini L and Carloni P (2019) Understanding Ligand Binding to G-Protein Coupled Receptors Using Multiscale Simulations. Front. Mol. Biosci. 6:29. doi: 10.3389/fmolb.2019.00029 Human G-protein coupled receptors (GPCRs) convey a wide variety of extracellular signals inside the cell and they are one of the main targets for pharmaceutical intervention. Rational drug design requires structural information on these receptors; however, the number of experimental structures is scarce. This gap can be filled by computational models, based on homology modeling and docking techniques. Nonetheless, the low sequence identity across GPCRs and the chemical diversity of their ligands may limit the quality of these models and hence refinement using molecular dynamics simulations is recommended. This is the case for olfactory and bitter taste receptors, which constitute the first and third largest GPCR groups and show sequence identities with the available GPCR templates below 20%. We have developed a molecular dynamics approach, based on the combination of molecular mechanics and coarse grained (MM/CG), tailored to study ligand binding in GPCRs. This approach has been applied so far to bitter taste receptor complexes, showing significant predictive power. The protein/ligand interactions observed in the simulations were consistent with extensive mutagenesis and functional data. Moreover, the simulations predicted several binding residues not previously tested, which were subsequently verified by carrying out additional experiments. Comparison of the simulations of two bitter taste receptors with different ligand selectivity also provided some insights into the binding determinants of bitter taste receptors. Although the MM/CG approach has been applied so far to a limited number of GPCR/ligand complexes, the excellent agreement of the computational models with the mutagenesis and functional data supports the applicability of this method to other GPCRs for which experimental structures are missing. This is particularly important for the challenging case of GPCRs with low sequence identity with available templates, for which molecular docking shows limited predictive power.

Keywords: G-protein coupled receptor, molecular dynamics, multiscale simulations, molecular mechanics, coarse grained, chemosensory receptors, bitter taste receptors, olfactory receptors

## INTRODUCTION

G-protein coupled receptors (GPCRs) are one of the largest protein superfamilies, with more than 800 (4%) genes in humans (Venter et al., 2001; Fredriksson et al., 2003; Lagerstrom and Schioth, 2008; Tikhonova and Fourmy, 2010). They detect a wide variety of extracellular signals (from photons to hormones and neurotransmitters) and trigger a myriad of intracellular transduction cascades (using different G-proteins and second messengers) (Alexander et al., 2017). These pleiotropic receptors are involved in many physiological functions, from vision to chemical sensing and neurotransmission, and, hence, they are attractive targets for pharmaceutical intervention. Approximately 34% of currently FDA-approved drugs bind to GPCRs (Hauser et al., 2018) and they are used to treat disorders as diverse as pain, hypertension, diabetes, cancer or neurological diseases (Hauser et al., 2017). Given the physiological and pharmacological relevance of GPCRs, unraveling their ligand binding determinants can be extremely useful both for understanding receptor function and for designing new drugs.

Based on phylogenetic and sequence conservation analyses, GPCRs can be classified in 5 different families or classes (Fredriksson et al., 2003; Schioth and Fredriksson, 2005): rhodopsin (class A), secretin (class B1), adhesion (class B2), glutamate (class C), and frizzled/taste2 (class F). Nonetheless, taste 2 (or bitter taste) receptors have also been proposed to form part of class A (Nordstrom et al., 2011) or even constitute a sixth, additional family (class T) (Munk et al., 2016a). Since the appearance of the first crystal structure of rhodopsin in 2000, experimental structural characterization of GPCRs is blossoming (Munk et al., 2019). As of February 2019, there are 59 unique receptor structures solved (https:// gpcrdb.org/structure/statistics), most of them corresponding to the rhodopsin (or class A) family (**Figure 1**). Molecular dynamics (MD) simulations started from these experimental structures have provided very important insights into ligand binding and receptor activation (Miao and McCammon, 2016; Sengupta et al., 2016; Latorraca et al., 2017; Marino and Filizola, 2018; Torrens-Fontanals et al., 2018; Velgy et al., 2018).

Nonetheless, the experimental structural coverage is still very far from the total of 800 GPCRs. In particular, there are no experimental structures available for three receptor groups: olfactory receptors (ORs, which constitute half of class A), taste 2 receptors (TAS2Rs, which represent the third largest GPCR family) and adhesion (class B2) receptors. In silico modeling can help to fill this gap of ∼87% structurally uncharacterized GPCRs (Pándy-Szekeres et al., 2017). Indeed, the communitywide GPCR Dock assessment (Michino et al., 2009: Kufareva et al., 2011, 2014) has shown that homology modeling and ligand docking are able to provide valuable information on receptorligand interactions, in particular for those GPCR targets that have templates with sequence identity higher than 35–40% (Kufareva et al., 2011; Beuming and Sherman, 2012). Subsequent refinement of the bioinformatics-based models through molecular dynamics simulations (Kufareva et al., 2014; Cavasotto and Palomba, 2015; Lupala et al., 2018) and integration of experimental (mutagenesis

and ligand structure-activity relationship) data (Munk et al., 2016b) further increases the model quality to values close to experimental accuracy. However, approximately half of GPCRs do not have a close template (i.e., an experimental structure of a receptor from the same family with a similar ligand). For instance, the sequence identity of 90% GPCRs with the rhodopsin template (representative of the largest GPCR family, class A) is lower than 20% (Zhang et al., 2006). Therefore, in most cases the in silico modeling approach needs further improvement, typically using molecular dynamics (Kufareva et al., 2014; Cavasotto and Palomba, 2015; Lupala et al., 2018).

Chemosensory receptors (olfactory and bitter taste receptors) are among the GPCRs without close templates. Increasing evidence shows that these receptors are expressed not only in the nose and the tongue, but also in other parts of the body (Foster et al., 2014; Abaffy, 2015; Ferrer et al., 2016; Shaik et al., 2016; Lu et al., 2017; Behrens and Meyerhof, 2019; Lee et al., 2019) and thus they have become attractive novel targets for drug design campaigns (Lee et al., 2019). However, chemosensory receptors represent a major challenge for computational modeling. Their sequence identity with the available GPCR templates is lower than 20% (Fierro et al., 2017) and thus only low resolution homology models can be generated (Kufareva et al., 2011; Fierro et al., 2017). Hence, our lab has made a major effort to attempt at improving such low resolution homology models and at making valuable predictions of the ligand binding determinants of these receptors. In this review, we first explain the computational approach used in our group to study low resolution GPCR models, based on the combination of stateof-the-art bioinformatics techniques and multiscale molecular dynamics simulations, as well as its validation on a class A GPCR (the β2-adrenergic receptor) with a solved crystallographic structure. Then, we show that, although bioinformatics-based models can be a good starting point to study receptor-ligand interactions, multiscale simulations significantly improve the quality of the models for which MM/CG simulations have been run so far. A perspective on this multiscale approach concludes this review.

### MATERIALS AND METHODS

### Bioinformatics

Given the lack of experimental structures, the initial structures of the receptor/ligand complexes are generated using bioinformatics approaches. Although there are several webservers specialized in GPCR modeling (Launay et al., 2012; Zhang et al., 2015; Busato and Giorgetti, 2016; Esguerra et al., 2016; Pándy-Szekeres et al., 2017; Worth et al., 2017; Miszta et al., 2018), here we used the GOMoDo webserver (Sandal et al., 2013), which combines in a single pipeline homology modeling and molecular docking for GPCRs.

Since the sequence identity of any given olfactory or bitter taste receptor with the available GPCR templates is lower than 20% (Fierro et al., 2017), special care needs to be taken in the sequence alignment step. Hence, the alignment was done using profile Hidden Markov Models (HMMs) of the corresponding target receptor family and the GPCR template(s), which were generated with HHPred (Soding et al., 2005). This approach has been shown to improve the target-template alignment for distant homologs (Soding et al., 2005), in particular for GPCRs (Kufareva et al., 2014). This alignment was further improved by manual curation, taking advantage of the conserved seven transmembrane (7TM) helix topology and the presence of common conserved features across GPCRs (Lagerstrom and Schioth, 2008; Venkatakrishnan et al., 2013; Pydi et al., 2014, 2016; Tehan et al., 2014; de March et al., 2015; Di Pizio et al., 2016; Fierro et al., 2017). Moreover, since template selection is difficult with such low sequence identity, several models based on different templates were built using MODELLER (Webb and Sali, 2016), and the best model was selected considering also structural quality parameters (Melo et al., 2002; Shen and Sali, 2006).

The receptor structural model thus generated was then submitted to molecular docking using HADDOCK (Dominguez et al., 2003). Although other docking approaches were tested [based on AutoDock Vina (Trott and Olson, 2010) or Glide (Friesner et al., 2004)], no significant improvement in the quality of the models was observed (Fierro et al., 2017). The location of the ligand binding pocket inside the 7TM bundle is conserved (Venkatakrishnan et al., 2013), despite the low sequence identity among GPCRs. Moreover, the results of the GPCR Dock competitions (Michino et al., 2009; Kufareva et al., 2011, 2014; Cavasotto and Palomba, 2015; Munk et al., 2016b) seem to indicate that incorporating information about putative binding residues (from experimental data or computational predictions) helps to improve the docking results. Therefore, an informationdriven approach was taken, in which the computationally predicted binding residues [using fpocket (Le Guilloux et al., 2009)] were used to guide the docking. Nonetheless, the fine details of the ligand binding site are expected to be highly variable across GPCRs (Venkatakrishnan et al., 2013), due to the chemical diversity of the GPCR ligands. Hence, in our HADDOCK-based docking approach both receptor and ligand were considered fully flexible in order to allow mutual readjustments. Other flexible docking approaches have also been successfully employed by other groups to predict the binding determinants of chemosensory receptors [see for instance (Di Pizio and Niv, 2014; Di Pizio et al., 2017; Xue et al., 2018)].

### Multiscale Molecular Dynamics Simulations

The results of the GPCR Dock competitions [reviewed in references (Cavasotto and Palomba, 2015) and (Ranganathan et al., 2017)] showed that refinement of the docked complexes using molecular dynamics simulations can significantly improve the prediction of receptor/ligand interactions. This is particularly important for GPCR models based on low sequence identity, as it is the case for chemosensory receptors, where the low accuracy of the side chain prediction and the limited sampling of the docking algorithms may undermine the quality of the bioinformaticsbased models. There are several studies in the literature applying molecular dynamics simulations to chemosensory receptors (Gelis et al., 2012; Lai and Crasto, 2012; Charlier et al., 2013; Lai et al., 2014; Chen et al., 2018; Jaggupilli et al., 2018; Liu et al., 2018; Bushdid et al., 2019). Here we focus on a hybrid, multiscale approach developed in our group (Neri et al., 2005, 2008; Leguèbe et al., 2012; Giorgetti and Carloni, 2014; Musiani et al., 2015; Tarenzi et al., 2017), which is tailored to study ligand binding in GPCRs.

As shown in **Figure 2**, the ligand, the surrounding protein residues (typically the extracellular half of the receptor) and water molecules are described with molecular mechanics (MM) using the GROMOS united-atom force field (Schuler and Van Gunsteren, 2000; Schuler et al., 2001; Oostenbrink et al., 2004). Instead, the rest of the protein (i.e., the intracellular half of the receptor) is treated at the coarse grained (CG) level using a Go¯ model (Go and Abe, 1981). Each amino acid is mapped into a single coarse grained bead corresponding to the alpha carbon atom and native contacts are mimicked by introducing two new potential terms. The bonded interactions between consecutive CG beads are taken into account using a quartic potential, whereas the non-bonded interactions between non-consecutive CG beads are described through a Morse-like potential. The MM and CG regions are connected by an interface (I) region, which ensures the continuity of the protein backbone by coupling the two levels of resolution. The MM-I interaction is treated at the atomistic level using the GROMOS force field, whereas the I-CG interaction is described using the Go model. Namely, bonded ¯ interactions are calculated between the Cα atoms of the I residues and the consecutive CG beads, whereas non-bonded interactions are computed using both the Cα and Cβ atoms of the I residues and the non-consecutive CG beads.

The presence of the lipid bilayer is modeled implicitly, using three wall potentials: a "coating surface" wall that simulates the

effect of the lipid hydrophobic tails embracing the protein surface and two "membrane plane" walls that mimic the presence of the lipid head groups. In addition, two "hemispheric" wall potentials are included to cap the extracellular and cytoplasmic ends of the protein and to prevent water evaporation. Water molecules, Cα atoms and aromatic residues Phe, Trp, and Tyr (the so-called "anchor residues") are affected by these boundary potentials, which are added to the MM/CG potential energy function as functions of the distance of an atom to the closest wall. Recently, a reservoir of CG water has been introduced around the MM water cap, permitting water molecules to freely diffuse between the MM and CG regions, changing on the fly their resolution. This allows to carry out simulations in a statistically well-defined (grand canonical) ensemble in the higher-resolution MM region, resulting in a further improved description of the binding poses and the binding site flexibility (Tarenzi et al., 2019).

Compared to docking, these multiscale simulations allow to (i) sample protein flexibility and protein/ligand interactions more extensively (∼1 µs timescale) and (ii) include explicit water molecules, which may be involved in ligand binding in GPCRs (Pardo et al., 2007; Angel et al., 2009; Venkatakrishnan et al., 2019). Moreover, the use of the Go model in the intracellular half of the receptor prevents possible unfolding problems due the initial wrong orientation of the side chains in the low resolution homology model. For further details on the MM/CG implementation, we refer the reader to some recent reviews (Giorgetti and Carloni, 2014; Musiani et al., 2015; Schneider et al., 2018).

### RESULTS AND DISCUSSION

### Validation of the Molecular Mechanics/Coarse Grained (MM/CG) Approach

The reliability of the MM/CG approach was assessed using the β2-adrenergic receptor (β2-AR) in complex with either its inverse agonist S-carazolol or its agonist R-isoprenaline (Leguèbe et al., 2012; Marchiori et al., 2013). The availability of a crystal structure of the receptor (for the first complex) (Cherezov et al., 2007), as well all-atom (AA) molecular dynamics simulations (for both complexes) (Vanni et al., 2011) allows to compare the results of the MM/CG simulations with both static and dynamical data. Three different types of tests were carried out (Leguèbe et al., 2012; Marchiori et al., 2013), started from different initial structures: (i) the same initial structures of the β2-AR/Scarazolol and β2-AR/R-isoprenaline complexes as the atomistic simulations, (ii) an alternative initial structure of the β2-AR/Scarazolol complex built by displacing the ligand to a position where none of the crystallographic receptor/ligand interactions was present, and (iii) a low resolution model of the β2-AR/Scarazolol complex built using bioinformatics. Each of the test simulations were ∼0.8 µs long.

The first test (Leguèbe et al., 2012) showed that the MM/CG approach is able to preserve the receptor/ligand complex structure observed in the crystal structure, as well as to provide dynamical and hydration information similar to the AA simulations, but at a lower computational cost. Complementarily, the second test (Leguèbe et al., 2012) confirmed that the agreement between the MM/CG and AA simulations observed in the first test is not due to the use of a common initial structure and, furthermore, demonstrated the predictive power of the MM/CG approach even when starting from a wrong binding pose. Nonetheless, the two previous tests can be considered as redocking experiments: even though the system was converted from AA into hybrid MM/CG [test (i)] or the ligand was moved out of place [test (ii)], the binding residues were already positioned as in the correct binding pose. Instead, the third test (Marchiori et al., 2013) validated the reliability of the MM/CG approach applied to low resolution models, as the ones used for the bitter taste receptors discussed in the next section. In such models, the orientation of the side chains is expected to be hardly accurate, due to the low sequence identity with the template used in the homology modeling (Chothia and Lesk, 1986; Baker and Sali, 2001; Eramian et al., 2008; Olivella et al., 2013; Piccoli et al., 2013; Busato and Giorgetti, 2016). Indeed, the homology model of the β2-adrenergic receptor (built using as template the

the ends of the protein and prevent water evaporation.

experimental structure of squid rhodopsin) shares only a 20% sequence identity with the target and thus docking of the ligand S-carazolol resulted in a wrong binding pose. However, the ∼0.8 µs MM/CG simulation is able to yield a binding pose showing receptor/ligand interactions similar to the crystallographic ones (Marchiori et al., 2013).

### Predictive Power of the Computational Models of Chemosensory Receptors

In order to investigate the performance of bioinformatics and multiscale simulations in predicting receptor/ligand interactions in chemosensory receptors, we selected those receptor/ligand pairs for which experimental data are available (Fierro et al., 2017). As of August 2017, these included seven olfactory receptor/odorant complexes and fifteen bitter taste receptor/bitter tastant complexes with available sitedirected mutagenesis data and functional assays, typically agonist dose-response curves. The docked receptor/ligand complexes were obtained using the bioinformatics protocol described in the Materials and Methods section, whereas the three MM/CG simulations analyzed (for the complexes TAS2R38/6-n-propylthiouracil, TAS2R38/phenylthiocarbamide and TAS2R46/strychnine) were taken from previous studies from our group (Biarnés et al., 2010; Marchiori et al., 2013; Sandal et al., 2015).

In order to compare the computational models with the experimental data, we defined "computational binding" and "computational non-binding" residues, as well as "experimental binding" and "experimental non-binding" residues (Fierro et al., 2017). Computational binding and non-binding residues were determined based on the receptor/ligand distance (using a cutoff of 5.5 Å) and the presence or absence (respectively) of an actual chemical interaction (such as hydrogen bonds, salt bridges, hydrophobic or aromatic interactions, etc.). Experimental binding residues were inferred from experiments based on (i) the effect of the mutation on the half maximal effective concentration (EC50) value and (ii) their position in the upper extracellular part of the receptor, where the canonical binding site of class A GPCRs is located (Venkatakrishnan et al., 2013). Residues whose mutation does not change EC<sup>50</sup> and/or that are located in the lower intracellular part of the receptor are considered as experimental non-binding residues. Obviously, this is a simplistic definition of binding residue, as from the experimental data we cannot discard that these residues might also be involved in receptor activation (see reference Fierro et al., 2017 for further discussion).

Comparison of the computational and experimental residues yielded four different possible test outcomes. "True positives" (TP) were amino acids identified as binding residues by both experiment and computation, "false positives" (FP) were amino acids identified as non-binding residues by experiment but as binding residues in computation, "true negatives" (TN) were amino acids identified as non-binding residues by both experiment and computation, and "false negatives" (FN) were amino acids identified as binding residues by experiment but not in computation. In order to assess the agreement of the computational models with the experimental data, two statistical parameters, precision (PREC) and recall (REC), were calculated:

$$\begin{aligned} \text{PREC} &= \text{TP/(TP} + \text{FP)}\\ \text{REC} &= \text{TP/(TP} + \text{FN)} \end{aligned}$$

These parameters are close to 1 when the computational predictions were consistent with the experimental data, and zero when they were not. Precision and recall values were calculated for the best docking poses of the twenty-two complexes investigated and for a representative snapshot of each of the three MM/CG simulations analyzed (Fierro et al., 2017).

We found that the predictive power of the bioinformatics approach varied from complex to complex. Nonetheless, the general agreement between the binding residues identified in the docking poses and those inferred from experiments was low, with only 36% of the predictions consistent with experiment (Fierro et al., 2017). Residues shown experimentally to be important for binding were not observed in the docked complexes (i.e., low recall) and/or residues not involved in protein/ligand interactions according to experiments were predicted as binding residues by computation (i.e., low precision). Most likely, this is due, among other factors, to the low sequence identity (<20%) between the chemosensory receptor targets and the available GPCR templates, as well as the limited sampling of the docking algorithms. Therefore, although the bioinformaticsbased models are a good starting point to study ligand binding determinants in chemosensory receptors, they appear to require further refinement (Fierro et al., 2017). This finding is consistent with the results of the GPCR Dock competitions, which indicated that models based on sequence identity below 30% need substantial improvement in order to reach accuracy comparable to experimental structures (Kufareva et al., 2011, 2014).

Next, we compared the performance of molecular dynamics for the three bitter taste receptor complexes studied so far with (∼0.8–1 µs long) MM/CG simulations (Marchiori et al., 2013; Sandal et al., 2015). We found that the predictive power of the computational models improved dramatically, with 96– 100% of the predictions in agreement with experiment (Fierro et al., 2017). Most residues shown to be involved in binding by experiments are captured by the MM/CG simulations and the number of wrong predictions was minimal (i.e., both recall and precision increased to values near or equal to one, see **Table 1**). Considering the nearly 20 mutants tested experimentally for either TAS2R38 (Biarnés et al., 2010: Marchiori et al., 2013) or TAS2R46 (Brockhoff et al., 2010; Born et al., 2013; Sandal et al., 2015), the agreement of the computational models with experiments seems really remarkable. Moreover, although in our analysis we used all the available mutagenesis data to validate a posteriori the MM/CG results, simulations were also able to predict new binding residues. Indeed, the simulations of the TAS2R38 and TAS2R46 complexes suggested several binding residues not tested previously and these predictions were subsequently verified by performing additional mutagenesis and functional assays (Marchiori et al., 2013; Sandal et al., 2015). Altogether, multiscale simulations seem to be a robust approach for identifying ligand binding residues in olfactory and Alfonso-Prieto et al. Multiscale Simulations of GPCRs

TABLE 1 | Evaluation of the performance of the bioinformatics models and the multiscale MM/CG simulations for the three bitter taste receptors studied so far, i.e. TAS2R38/6-n-propylthiouracil (PROP), TAS2R38/phenylthiocarbamide (PTC) and TAS2R46/strychnine (Fierro et al., 2017).


Precision (PREC) and recall (REC) values are listed.

bitter taste receptors, at least for the three bitter taste receptor complexes studied so far (Fierro et al., 2017). This is consistent with the conclusions of the GPCR Dock competitions, where molecular dynamics simulations and integration of experimental data (such as site-directed mutagenesis or ligand structureactivity relationships) were shown to improve the computational predictions (Kufareva et al., 2011, 2014; Cavasotto and Palomba, 2015; Ranganathan et al., 2017). Nonetheless, the application of MM/CG simulations to other chemosensory receptor complexes with available experimental data is still needed to firmly establish the reliability and transferability of this method.

### Insights Into Ligand Selectivity Determinants in Bitter Taste Receptors

There are around 1,000 compounds characterized as bitter, which vary significantly in size, polarity and chemical structure (Meyerhof et al., 2010; Behrens and Meyerhof, 2018; Dagan-Wiener et al., 2019). To make things even more puzzling, three receptors (TAS2R10, TAS2R14, and TAS2R46) out of the 25 bitter taste receptors are able to recognize about half of these bitter compounds (Behrens and Meyerhof, 2018). In contrast to this broad agonist spectrum, there are two receptors, TAS2R38 and TAS2R16, that are specialized in detecting a specific chemical group (thiourea/isothiocyanate and β-Dglucopyranoside, respectively) (Behrens and Meyerhof, 2018). Here we discuss in detail the structural predictions described above to investigate whether they can help understand the molecular basis of this disparate ligand selectivity. In particular, the MM/CG approach has been applied so far to one receptor of each group, i.e., TAS2R46 (Sandal et al., 2015) and TAS2R38 (Biarnés et al., 2010; Marchiori et al., 2013).

The microsecond-long simulations of TAS2R46 in complex with its agonist strychnine (Sandal et al., 2015) showed that the ligand can explore not only one but two different binding cavities (**Figure 3**). The first one coincides with the canonical binding site of class A GPCRs (i.e., the so-called orthosteric site), whereas the second is located further toward the extracellular side and thus has been denoted as "vestibular." The mutagenesis data is compatible with this two-site architecture, as the residues experimentally inferred to be involved in binding (Brockhoff et al., 2010; Born et al., 2013; Sandal et al., 2015) are distributed between the two sites (**Figure 3**). Moreover, the identified vestibular binding cavity overlaps with the extracellular allosteric binding site observed for class A GPCRs (Dror et al., 2011, 2013; Kruse et al., 2012; Abdul-Ridha et al., 2014; Latorraca et al., 2017; Thal et al., 2018), further supporting its existence. This two-step binding architecture may constitute the molecular basis of the "access control" mechanism proposed by Meyerhof and coworkers (Brockhoff et al., 2010) and would help TAS2R46 to discriminate the wide range of ligands recognized by this promiscuous receptor (Sandal et al., 2015). Moreover, a bioinformatics analysis of the binding residues predicted for TAS2R46 across the bitter taste receptor family showed that half of these functionally relevant positions are conserved in two or more TAS2Rs, suggesting that the vestibular site might also be present in other receptors of this family (Sandal et al., 2015). However, the ∼0.8 µs simulations of TAS2R38 in complex with either PTC or PROP showed the ligand bound in a single site, corresponding to the orthosteric one (Marchiori et al., 2013). This hints at the possibility that the vestibular site is not as crucial for a group specific receptor such as TAS2R38 or even that the two-site architecture is not required for a more selective receptor (Suku et al., 2017). Naturally, given the crudeness of our models, further simulations and experimental studies on other members of the bitter taste receptor family are needed in order to confirm this proposal.

## CONCLUSIONS

Given the scarcity of experimental structural data (Munk et al., 2019), computational modeling of GPCRs is essential to understand ligand binding and design new drugs targeting this biologically and pharmacologically relevant family (Michino et al., 2009; Kufareva et al., 2011, 2014; Cavasotto and Palomba, 2015; Ranganathan et al., 2017; Lupala et al., 2018). These computational approaches (**Figure 4**) include homology modeling and molecular docking, often supplemented with experimental (mutagenesis and ligand structure-activity relationship) data. Subsequent refinement with molecular dynamics simulations has been shown to further improve the computational predictions (Kufareva et al., 2014; Cavasotto and Palomba, 2015; Ranganathan et al., 2017; Lupala et al., 2018). The accuracy of the models thus generated might reach values near the experimental ones for those GPCRs with a close structural template (i.e., with sequence identity larger than 35–40% and a chemically similar ligand) (Kufareva et al., 2011; Beuming and Sherman, 2012). However, for most GPCRs the closest structural template has sequence identity below this threshold, and thus computational predictions become challenging. This the case for olfactory and bitter taste receptors, which constitute the first and third largest GPCR groups, respectively, as their sequence identity with the available GPCR templates is below 20%.

In this review, we have shown that molecular dynamics simulations, in particular the multiscale molecular mechanics / coarse grained approach developed in our group (Neri et al., 2005, 2008; Leguèbe et al., 2012; Giorgetti and Carloni, 2014;

FIGURE 3 | Two binding site architecture of TAS2R46. The agonist strychnine (in licorice representation) can bind in either the orthosteric site (left panel) or the vestibular site (right panel). Receptor residues interacting with the ligand in the orthosteric site, the vestibular site or both sites are shown with blue, yellow or green spheres, respectively. The central panel displays the distribution of the ligand center-of-mass z coordinate for the two ∼1 µs MM/CG simulations, showing that strychnine stabilized either in the orthosteric site or in a second (vestibular) site, located further toward the extracellular side.

Musiani et al., 2015; Tarenzi et al., 2017), can overcome, at least in part, these limitations (Fierro et al., 2017) and successfully predict residues involved in ligand binding for the three bitter taste receptor complexes studied so far (Biarnés et al., 2010; Marchiori et al., 2013; Sandal et al., 2015). The natural extension of these previous works would be to other bitter taste and olfactory receptors for which experimental data are available. In addition, MM/CG simulations could be easily applied to other GPCRs. Although this approach has been used so far for a limited number of GPCR/ligand complexes (Leguèbe et al., 2012; Marchiori et al., 2013; Sandal et al., 2015), the excellent agreement of the computationally predicted binding poses with the experimental mutagenesis data [for the aforementioned three bitter taste receptor complexes (Marchiori et al., 2013; Sandal et al., 2015)] or the crystal structure [for the β2-adrenergic receptor (Leguèbe et al., 2012)] further supports the applicability of the MM/CG method to other GPCR/ligand complexes. Indeed, MM/CG simulations have been recently used to model the synthetic agonist diphenyleneiodonium chloride (DPI) bound to its target receptor GPR3 (Capaldi et al., 2018). Two of the predicted DPI binding residues were successfully validated a posteriori using mutagenesis and functional assays, as previously done for TAS2R38 (Marchiori et al., 2013) and TAS2R46 (Sandal et al., 2015).

## DATA AVAILABILITY

The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

The authors acknowledge the Ernesto Illy Foundation (Trieste, Italy) for financial support. We are also grateful to the Jülich-Aachen Research Alliance High Performance Computing for computer time grants JARA0023, JARA0082, and JARA0165. We also thank the Central

### REFERENCES


Library of Forschungszentrum Jülich for the open access publications fees.

### ACKNOWLEDGMENTS

We thank Alejandro Giorgetti, whom, together with PC, has been the major driver of this long-term project for the past decade. We are also indebted to all the computational (Xevi Biarnés, Alessandro Marchiori, Luciana Capece, Massimo Sandal, and Francesco Musiani) and experimental (Wolfgang Meyerhof, Maik Behrens, Paolo Gasparini, Stephan Born, Anne Brockhoff, and Carmela Lanzara) collaborators in this project.

optimization using homology models. Chem. Commun. 51, 13576–13594. doi: 10.1039/C5CC05050B


the nervous system in health and disease. Front. Aging Neurosci. 8:163. doi: 10.3389/fnagi.2016.00163


prediction of G-protein coupled receptor/ligand complexes. PLoS ONE 7:e47332. doi: 10.1371/journal.pone.0047332


**Conflict of Interest Statement:** LN was employed by the company illycafè S.p.A (Trieste, Italy).

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Alfonso-Prieto, Navarini and Carloni. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Building Minimalist Models for Functionalized Metal Nanoparticles

Giorgia Brancolini <sup>1</sup> \* and Valentina Tozzini <sup>2</sup>

*1 Istituto Nanoscienze–CNR-NANO S3, Modena, Italy, <sup>2</sup> Istituto Nanoscienze–CNR and NEST-Scuola Normale Superiore, Pisa, Italy*

Keywords: coarse grained models, molecular dynamics, brownian dynamics, multiscale simulations, gold nanocrystal, macromolecules aggregation

### THE LANDSCAPE OF COARSE GRAINED NP MODELS

Metal nanoparticles (NPs) have been recently proposed for an increasing number of applications in nano-medicine (Vlamidis and Voliani, 2018) and nanotechnology (Chen et al., 2015). For instance, gold NPs (Alex and Tiwari, 2015) allow covalent versatile functionalization via thiol chemistry (Hakkinen, 2012) with different biomolecules or functional groups to selectively favor interactions with proteins or other specific components of the cell milieu. In particular, thiol-protected gold NPs functionalized with phenyl groups, Au25L − <sup>18</sup> (L=S(CH2)2Ph) were considered capable of interfering with protein aggregation, and therefore viewed as possible therapeutic agents against degenerative diseases due to amyloid fibrils accumulation (Brancolini et al., 2014, 2018; Marcinko et al., 2017; Torsten et al., 2018). The optimization of the size and decoration of the NP for therapy can benefit from computer simulations exploring aggregation in different environmental conditions (relative concentration, temperature, ionic strength). However, such extremely large time and size scale simulations call for the use of super-atomistic representations (low resolution or coarse grained—CG—models) (Brancolini and Tozzini, 2019).

A number of CG models for proteins are available (Seo et al., 2012), even minimalist ones, i.e. with single-bead per amino-acid resolution and implicit solvent (Di Fenza et al., 2009; Tozzini, 2010; Trovato and Tozzini, 2012; Trovato et al., 2013). Conversely, for the NPs, available CG models are rather sparse and diverse. The presence of the gold core suggests treating it at the meso-scale as a single spheroidal object (Vàcha et al., 2014), but the roughness of the surface (Radic et al., 2015), and the specificity of the chemical decoration (Tavanti et al., 2015a; Cantarutti et al., 2017) have fundamental roles in the interaction with proteins and must be treated at a higher resolution (Brancolini et al., 2015; Tavanti et al., 2015b; Charchar et al., 2016; Cardellini et al., 2019). Particular attention must be paid to the representation of hydrophobic character of the chemical groups and to the presence of possible net charges, whose medium- and long-range character, respectively, is the determinant of the macroscopic aggregation properties of the system. Implicit solvent requires the use of accurate screened potentials to account for the ionic strength. Finally, for the NP model to be compatible with the protein counterpart, both resolution and parameterization of the force field (FF) should be well matched.

While these prescriptions are followed inprevious literature in given models (Radic et al., 2015; Charchar et al., 2016), here we outline a general strategy to build models for NPs including all of them. In our view (Brancolini et al., 2018) these should contain the following ingredients: (1) Minimalism, i.e., including the minimum possible amount of degrees of freedom (DoF), and implicit solvent (2) Compatibility with the protein models (3) Transferability to different sizes and chemical decorations. Clearly, each of these characteristics involves one or more among the following actions: (i) choice of the model structure/topology, (ii) choice of the functional forms for the interactions, (iii) optimization of parameterization. (ii) and (iii) are complex tasks which have

### Edited by:

*Edina Rosta, King's College London, United Kingdom*

Reviewed by: *Bart De Nijs, University of Cambridge, United States*

\*Correspondence: *Giorgia Brancolini giorgia.brancolini@nano.cnr.it*

#### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

> Received: *11 March 2019* Accepted: *17 June 2019* Published: *02 July 2019*

#### Citation:

*Brancolini G and Tozzini V (2019) Building Minimalist Models for Functionalized Metal Nanoparticles. Front. Mol. Biosci. 6:50. doi: 10.3389/fmolb.2019.00050*

been addressed using a large number of different methodologies (Bauer et al., 2017; Lin et al., 2018; Brancolini et al., submitted). Particularly effective are usually combinations of bottom up and top-down strategy (Leonarski et al., 2013; Mereghetti et al., 2016) including both atomistic simulations and experimental data (Trovato and Tozzini, 2014) from different sources (e.g., structural, or thermodynamic). Here we focus on a general strategy to address (i) (Brancolini et al., 2018).

### RATIONAL BUILDING OF A MINIMALIST NP MODEL

The starting point is an atomistic structure of the functionalized NP (**Figure 1A**). The minimalism requirement suggests using a single large interacting center ("bead") for the gold core, which is, in fact, a common feature to most of the NP models (Charchar et al., 2016; Shao and Hall, 2017). The chemical decoration is accounted for in several models by covering the central bead with smaller beads (Radic et al., 2015). The compatibility criterion can be satisfied choosing in specific ways the number and location of the decoration beads. For instance, when the functionalizing groups resemble in size and shape the side chains of amino-acids, this choice is rather straightforward: each of the functional group can be represented using the same representation of the protein amino-acids, i.e., 2–4 beads in MARTINI-like models (Seo et al., 2012), or a single bead for the minimalist models (**Figure 1B**). Remarkably, the model will include a number of DoF (Degrees of Freedom) proportional to the number of functional groups, i.e., will scale proportionally the surface of the NP, rather than to the volume.

An important point is how to choose the relative location of the decorating beads. Clearly, the thermal fluctuations of the group that they represent will determine the space distribution of the bead locations, which can be evaluated by means of atomistic simulations of small NPs (Maccari et al., 2014) (**Figure 1C**). The volume map build using this space distribution will form lobes, whose centroid and dispersion can be determined by clustering procedures (Arkhipov et al., 2006) (**Figure 1D**). This information can be used to build the starting location and topology of the model, and to parameterize the force field (FF) describing its internal dynamics (**Figure 1E**). Those parameters will then be transferred to larger NPs, once an average position of the functionalizing groups is determined, either from an atomistic model or from structural data (**Figure 1F**).

Distributing masses and effective charges among the beads is a non-trivial point. Considering masses, for instance, an obvious way would be to assign to each bead the sum of masses of their constituting elements. This, however, might not preserve the rotational inertia of the NP: in fact, being the total mass of the metal core concentrated in the center, it does not contribute, resulting in too small total rotational inertia. The problem can be solved by attributing larger masses to the peripheral beads. The proper balance of masses can be found by imposing that the total mass and the total rotational inertia correspond to that of the atomistic NP (Bauer et al., 2017).

The problem of charges is analogous: in this case an accurate charge distribution might be adjusted to reproduce the electrostatic potential, besides the net charge. The reference electrostatic potential can be generated from the RESP derived atomistic charges (Heaven et al., 2008), based on ab initio calculations (**Figures 1G,H**). Deriving the CG charges based on the atomistic components (Baker et al., 2001; Terakawa and Takada, 2014; McCullagh et al., 2016) results in effective charges depending on the bead type (gold or ligand) and symmetry (**Figure 1I**). The electrostatic potential generated by these can be compared with its atomistic counterpart, showing that the general shape of the iso-surfaces is preserved (**Figure 1J**): although of course the atomistic detail is lost, the CG model reproduce the global net prevalence of negative character (in blue), which however uncovers some positive areas (in red) for given directions, as in the atomistic case.

### SUMMARY AND PERSPECTIVES: THE NEXT STEPS

In our opinion, the presented strategy includes all the crucial elements of an optimal low resolution model: the choice of the minimal possible resolution, compatibility between different

### REFERENCES


levels of resolution, a parameterization including the specific coating present on the NP by means of superficial higher resolution interacting sites. The effective charges could be further optimized by directly adopting a RESP procedure for their fitting. This task and the model validation at different concentrations and ionic strengths vs. the aggregation tendency are currently in due course (Brancolini et al., submitted). The following steps will be the use of the model in combination with proteins models at the same CG level (minimalist), to verify their effective capability of preventing the amyloids aggregation. Furthermore, the strategy here outlined is extensible to larger NPs and different functionalization, which opens the possibility of in silico optimization of the NPs size and chemistry for therapeutic use.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

The authors acknowledge SEED project granted by CNR-Istituto Nanoscienze, Italy (GAE PUSEED04), titled LOPE-DeveLopment of a Coarse Grained MOdel forNanoparticle-Protein IntEractions for financial support.

### ACKNOWLEDGMENTS

The authors wish to thank Dr. Hender Lopez for useful discussions. Oak Ridge National Laboratory is acknowledged for the supercomputing project CNMS2018-338, through the Scientific User Facilities Division, Office of Basic Energy Sciences, U.S. Department of Energy. Facilities of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, are acknowledged.

amyloid fibril formation: D76N and 1N6 variants. Nanoscale 10, 4793–4806. doi: 10.1039/c7nr06808e


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Brancolini and Tozzini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Multi-Scale Approach to Membrane Remodeling Processes

Weria Pezeshkian<sup>1</sup> , Melanie König<sup>1</sup> , Siewert J. Marrink <sup>1</sup> and John H. Ipsen<sup>2</sup> \*

<sup>1</sup> Groningen Biomolecular Sciences and Biotechnology Institute and Zernike Institute for Advanced Materials, University of Groningen, Groningen, Netherlands, <sup>2</sup> Department of Physics, Chemistry and Pharmacy, Center for Biomembrane Physics (MEMPHYS), University of Southern Denmark, Odense, Denmark

We present a multi-scale simulation procedure to describe membrane-related biological processes that span over a wide range of length scales. At macroscopic length-scale, a membrane is described as a flexible thin film modeled by a dynamic triangulated surface with its spatial conformations governed by an elastic energy containing only a few model parameters. An implicit protein model allows us to include complex effects of membrane-protein interactions in the macroscopic description. The gist of this multi-scale approach is a scheme to calibrate the implicit protein model using finer scale simulation techniques e.g., all atom and coarse grain molecular dynamics. We previously used this approach and properly described the formation of membrane tubular invaginations upon binding of B-subunit of Shiga toxin. Here, we provide a perspective of our multi-scale approach, summarizing its main features and sketching possible routes for future development.

### Edited by:

Valentina Tozzini, Nanosciences Institute, National Research Council, Italy

### Reviewed by:

Riccardo Nifosì, Italian National Research Council (CNR), Italy Nicola Maria Pugno, University of Trento, Italy Jeffery Klauda, University of Maryland, College Park, United States

\*Correspondence:

John H. Ipsen ipsen@memphys.sdu.dk

### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

> Received: 13 May 2019 Accepted: 08 July 2019 Published: 23 July 2019

#### Citation:

Pezeshkian W, König M, Marrink SJ and Ipsen JH (2019) A Multi-Scale Approach to Membrane Remodeling Processes. Front. Mol. Biosci. 6:59. doi: 10.3389/fmolb.2019.00059 Keywords: dynamic triangulated surfaces, Martini coarse-grain simulation, Shiga toxin, simulation of continuum model, membrane remodeling, implicit protein model

### INTRODUCTION

Many biological processes involve large scale changes in lateral chemical organization and geometrical shapes of biological membranes (McMahon and Boucrot, 2015; Chavent et al., 2018). The modeling of these processes, by computer simulation, is a challenging task since they typically involve a wide range of length and time scales that cannot be captured in full by any single current simulation technique (Enkavi et al., 2019; Marrink et al., 2019). At large length scales, computational, and analytical techniques based on continuum models have played a great role in our understanding of these processes and has revealed many important generic phenomena (Seifert et al., 1991; Bozic et al., 1992; Ramakrishnan et al., 2013, 2015). Nevertheless, these predictions are often obscured by the simplicity of the model and by the approximations needed to make them mathematically tractable. In addition, such phenomenological models contain few model parameters that are typically hard to relate to their molecular origin. At small length scales, particle-based computer simulations techniques e.g., molecular dynamics (MD) and dissipative particle dynamics (DPD), are robust techniques to elucidate complex membrane behaviors but with a limited capacity to predict large length scale cooperative phenomena (Gao et al., 2007; Li et al., 2016; Enkavi et al., 2019; Marrink et al., 2019). To overcome these limitations, we have used a multi-scale simulation procedure that bridges the gap between the particle and continuum based models and allows the simulation of large biological membrane patches while retaining details from the atomistic length scale (Pezeshkian et al., 2016). Here, we summarize the main features of the method, extend its capacity to describe a wider range of processes and sketch possible routes for further development.

### METHODS

In our multi-scale approach, the large-scale physical properties of a membrane are described by a coarse-grained model which captures the elastic energy of membrane conformations and the energetics of the lateral organization of its chemical constituents. Such a model only contains a few model parameters which are calibrated using atomistic and mesoscopic simulations (Marrink et al., 2007).

### Simulation of Continuum Model

A continuous membrane is discretized by a dynamical triangulated surface (DTS) containing N<sup>υ</sup> vertices, N<sup>T</sup> triangles, N<sup>L</sup> links which together form an irregular planer triangulated network (**Figure 1A**). The difference between dynamical and static triangulation is that the mutual link between two neighboring triangles can flip (Alexander moves). This allows to sample through all possible triangulations for a given Nυ, NT, NL. Link flipping and positional updates of the vertices gives the fluid character with full translational invariance in the plane of the surface (**Figure 1B**). In this representation, a vertex can be visualized as a segment of a bilayer containing hundreds of lipids, this means that the resolution of the model is limited to the length-scales above few nanometers. To ensure self-avoidance of the surface each vertex is equipped with a spherical bead. Using a set of discretized geometrical operations, each vertex is furthermore assigned with a normal vector **N**ˆ <sup>υ</sup>, surface area A<sup>υ</sup> (one third of the area of its neighboring triangles), principal curvatures (c1υ, c2υ) and principal directions (**X**<sup>1</sup> (υ), **X**2(υ)) (Ramakrishnan et al., 2010) (**Figure 1A**). This suffices to construct an elastic energy function associated with membrane bending that allows us to obtain the surface equilibrium configurations using numerical update algorithms. In this work, we have employed the Metropolis Monte Carlo algorithm (Ramakrishnan et al., 2010; Bahrami et al., 2012; van der Wel et al., 2016), but many other updating schemes are possible (Noguchi and Takasu, 2001; Cooke et al., 2005; Noguchi and Gompper, 2006; Peng et al., 2013; Mauer et al., 2018).

### Elastic Energy

The Helfrich Hamiltonian (Helfrich, 1973) is the classic approach to describe membrane shape phenomena. The membrane elastic energy E<sup>b</sup> can be expressed in the terms of two surface invariants, the mean curvature H = 0.5(c<sup>1</sup> + c2), and Gaussian curvature, K = c1c2. A discretized form of the Helfrich Hamiltonian can be written as:

$$E\_b = \frac{\kappa}{2} \sum\_{1}^{N\_{\upsilon}} \left( 2H\_{\upsilon} - \overline{C}\_0 \right)^2 A\_{\upsilon} + \kappa\_G \sum\_{1}^{N\_{\upsilon}} K\_{\upsilon} A\_{\upsilon} \tag{1}$$

The second term of this equation only depends on the surface topology and does not change by continuous membrane deformation (Gauss-Bonnet theorem). The mean curvature elastic constant κ is called the bending elasticity, which carries the dimension of energy. The constant C<sup>0</sup> is called the spontaneous curvature, which represents a possible asymmetry between the two monolayers, e.g., differing solvent conditions. C<sup>0</sup> = 0 for a symmetric membrane. Equation (1) can be expanded in numerous ways depending on the membrane process at play. For example, for processes where a significant part of the total membrane surface undergoes deformations much faster than the flip flop rate of any monolayers chemical component, a monolayer area difference elastic term must be included (Seifert et al., 1991; Bozic et al., 1992). The difference in the area of the monolayers can be obtained as

$$
\Delta A = h \sum\_{\upsilon}^{N\_{\upsilon}} 2H\_{\upsilon} A\_{\upsilon} \tag{2}
$$

Where h is the membrane thickness. Up to second order, the area-difference elastic energy is expressed as E<sup>s</sup> = kr 2h <sup>2</sup>A<sup>0</sup> (1A − 1A0) 2 , with k<sup>r</sup> denoting the area compression modulus (Svetina and Žekš, 2014). Another relevant energy term that can be included is the elastic energy associated with change in the volume (V) of a closed surface (vesicle), E<sup>V</sup> = KV 2V0 (V − V0) <sup>2</sup> where both the volume compression modulus K<sup>V</sup> and the equilibrium volume V<sup>0</sup> are set by the osmotic conditions of the solvents in an experiment. For a triangulated surface, the volume can be easily obtained as

$$V = \frac{1}{3} \sum\_{T=1}^{N\_T} (\overrightarrow{\mathbf{R}}\_T \mathbf{\hat{N}}\_T) A\_T \tag{3}$$

Here, −→**R**<sup>T</sup> is the position of any point on the triangle T, **<sup>N</sup>**<sup>ˆ</sup> <sup>T</sup> and A<sup>T</sup> are the normal vector and area of the oriented triangle T, respectively. For analysis of bounded membrane patches or semiflat membranes in a periodic boundary box, a contribution τA<sup>p</sup> to the energy in Equation (1) becomes important. A<sup>p</sup> and τ are the projected area and frame tension of the membrane, respectively.

When we are dealing with membranes with highly curved regions, e.g., formation of narrow necks prior to scission during a fission process, Equation (1) requires modification. In these regions, the curvatures of different monolayers can be significantly different. A practical approach to include this mismatch is to treat the bending energy associated with each monolayer separately. Using mid-plane principal curvatures, the mean curvature of each of the monolayers can be determined as Safran (1994):

$$H\_{\text{up}} = \frac{H + 2Kh}{1 + hH + Kh^2}, \ H\_{\text{low}} = \frac{-H + 2Kh}{1 - hH + Kh^2} \tag{4}$$

### Implicit Protein Model

Membrane proteins can locally influence bilayer shape through direct and indirect couplings. Direct impacts include local rigidification (Zhang et al., 2015), local membrane curvature imprint (Pezeshkian et al., 2017b; Corradi et al., 2018; Wang et al., 2018), local change in membrane thickness (Corradi et al., 2018) etc. Indirect effects arise from their interactions with other proteins that have the capacity to affect the membrane shape through cooperative phenomena. In our multiscale simulation approach, these couplings are identified and quantified through atomistic and mesoscopic simulations and

two-dimensional vector in the plane of a vertex (bottom). The angle between the protein direction and the membrane main principal direction.

they are included in the system energy as new terms added to Equation (1). In the modeling, a protein or nanoparticle (an inclusion) is assigned to a vertex in the triangulation. Each vertex can at most occupy one inclusion, which naturally handles the in-plane excluded volume effect between inclusions. It also introduces a natural length scale to the model since we can associate the smallest possible area of a vertex with the projected area of the inclusion in question. Inclusions can move laterally through updates of the triangulation or by jumps between the neighboring vertices via Kawazaki moves (**Figure 1C**).

When an inclusion is situated in a vertex, it may change the elastic energy contribution from the vertex. For membrane proteins, the simplest model is to locally increase membrane bending rigidity (Frolov and Zimmerberg, 2008; Schweitzer et al., 2015). The most important effect of membrane proteins, that greatly influences the large-scale membrane shape, is to induce a local membrane curvature (Kozlov et al., 2014). This induced curvature can be in-plane rotationally symmetric or asymmetric. As a consequence of Eulers curvature formula, vertex-based inclusions, except π-symmetric inclusions (symmetric upon rotation by 180 degrees in the plane of the membrane, **Figure 1D**), can only induce symmetric curvature (Peliti and Prost, 1989). It may seem that this is a shortcoming of the model. Nevertheless, highly asymmetric curvature imprints decays quickly in the membrane plane (Dasgupta et al., 2017; Corradi et al., 2018) and does not appear in a macroscopic membrane model. The impact of these inclusions can be modeled by adding a local energy contribution e<sup>υ</sup> = −κHC0A<sup>υ</sup> to the bending energy per vertex, where C<sup>0</sup> is the local curvature imprint of the protein and needs calibration from finer scale simulations. Notice that C<sup>0</sup> can only be identified with C<sup>0</sup> in Equation (1) for a fully covered membrane. π-symmetric inclusions can locally bend the membrane differently in different directions (Frolov and Zimmerberg, 2008). Such inclusions can thus be given an orientation in the plane in the direction with maximal directional curvature imprint (C k 0 ) while the perpendicular direction in the plane gives the lowest directional curvature imprint (C ⊥ 0 ) (**Figure 1D**). The membrane curvature in these directions can easily be obtained by Eulers curvature formula C <sup>k</sup> = c1<sup>υ</sup> cos <sup>2</sup> (θ)+c2<sup>υ</sup> sin <sup>2</sup> (θ) and C <sup>⊥</sup> = c1<sup>υ</sup> sin <sup>2</sup> (θ)+ c2<sup>υ</sup> cos <sup>2</sup> (θ) where θ is the angle between the orientation of the inclusion and the direction of the main principal curvature of the membrane. Such inclusion will give rise to an additional local contribution to the total elastic energy in Equation (1), e<sup>υ</sup> = [ κ1 2 C <sup>⊥</sup>−C ⊥ 0 <sup>2</sup> + κ2 2 C <sup>k</sup> − C k 0 2 ]Aυ, where κ<sup>1</sup> and κ<sup>2</sup> are the directional bending rigidities imposed by the inclusion on the membrane. To complete the modeling, we need to include interactions between the inclusions. Here, we will only focus on the pair interactions but nevertheless the method can be extended to multi-body interactions. The pair-interactions between inclusions can be divided into two types: (i) as a function of distance between the proteins in the 3-dimensional space, e.g., electrostatic and van der Waals forces, (ii) as a function of a distance alongside the geodesic direction between two inclusions in the membrane, e.g., membrane mediated interactions (Haselwandter and Wingreen, 2014; Johannes et al., 2018). The former type of interactions can be modeled simply by a constant interaction energy when two inclusions are in proximity in the 3D space. This is a practical and valid choice, since the resolution of the model is well-below a range to capture the protein specific interactions. The second type of interactions is more challenging since it depends on the local curvature of the membrane. A particular consequence of this is that interactions between two neighboring non-isotropic inclusions can first be calculated after parallel transport between them, where the inplane orientations of the inclusion is kept fixed along their mutual geodesic curve (Ramakrishnan et al., 2010). The interaction between two inclusions on the neighboring vertices is only a function of angle between their in-plane orientations alongside geodesic direction: 12 = 2<sup>i</sup> − 2′ j , where 2<sup>i</sup> is the orientation of inclusion residing on vertex i, 2′ <sup>j</sup> represents the orientation of inclusion residing on vertex j after parallel transport to vertex i. This energy function can be written in term of Fourier series as

$$\varepsilon\_{\dot{ij}}\left(\Delta\Theta\right) = -\varepsilon\_0 - \mu\_0 \sum\_{k=1}^{M} \frac{a\_k}{M} \cos\left[kQ\_{\dot{ij}}\Delta\Theta + \Xi\_k\right] \tag{5}$$

The first term (−ε0) models the isotropic part of the interaction between two inclusions while the second term is to model anisotropic interactions e.g., caused by steric factors and the distribution of the peptide groups in a protein (Domanski et al., 2017). M is a constant integer and its value depends on the chosen degree of coarse graining. Larger M allows to include more structural details of the protein shape in the interactions with other proteins. 4<sup>k</sup> are the phase shift and µ0ak/M are amplitude of the Fourier modes and both need fitting from finer simulation techniques. By setting P<sup>M</sup> k=1 a<sup>k</sup> = M, µ<sup>0</sup> can be defined as the lowest energy level of the anisotropic part of the interaction. Qij is the least common multiple of the degree of the i,j proteins symmetry in the plane of the membrane (N). Note that the interaction energy in Equation (5) can also be used to model lipid domain formations in multicomponent membranes (Ramakrishnan et al., 2010; Hansen et al., 2017).

Different approaches can be used to model proteins on triangulated surfaces e.g., introducing a curvature field and additional length scale to the model (Tourdot et al., 2014), however we prefer our procedure since it allows the calibration of all parameters solely through a bottom up approach. This increases the predication power of the model without need to tune the inputs parameters to reach the excepted outcome.

### Calibration

To start a DTS simulation for a membrane containing different lipids and proteins, all the mentioned model parameters need to be calibrated using results from experiments or simulations of finer scales. Below we discuss several of these parameters (κ, C k 0 , C ⊥ 0 , ε0, µ0, N, a<sup>k</sup> , 4<sup>k</sup> ).

Bending rigidity κ: Bending rigidity is known for many one component lipid bilayers from both experiment and simulations. However, for new lipid bilayers, fluctuation spectrum analysis is a powerful technique to extract this parameter. Both, coarse grained and all-atom MD simulation can be used to calibrate this parameter (Brandt et al., 2011; Watson et al., 2012; Venable et al., 2015).

Local curvature imprint (C k 0 , C ⊥ 0 ): All-atom MD simulation has proven successful for calibration of these model parameters (Pezeshkian et al., 2016, 2017b; Kociurzynski et al., 2019). From an MD simulation trajectory, membrane curvature can be measured using different approaches. An accurate method is to use the first moment of the lateral membrane pressure profile, κC<sup>0</sup> = R z5(z)dz (Safran, 1994). However, this approach has several problems. First, a converged lateral pressure profile requires very long simulations even for pure membrane systems. Secondly, it only provides the mean value of the induced curvature (C k <sup>0</sup> + C ⊥ 0 ) unless the protein orientation is restricted (Bruhn et al., 2016; Ali Doosti et al., 2017). The second method is a geometrical approach and consists of fitting the upper and lower monolayer of the membrane to an analytical function and calculating the time-average curvature map on the surface of the bilayer. Note, since the typical radius of the curvature induced by proteins is much larger than a feasible MD simulation box size, the total average curvature of the fitted surface is zero. Therefore, one should only average the curvature of the surface up to a distance, in which the presence of the protein changes the lipid density, from the center of the proteins (Pezeshkian et al., 2016, 2017b; Corradi et al., 2018).

Protein-protein interaction parameters(ε0, µ0, N, a<sup>k</sup> , 4<sup>k</sup> ): An efficient approach to calibrate these parameters is to use coarse grained MD or DPD simulations. Typically, large simulation boxes are needed because the system size should be large enough so that the proteins do not interact (including membranemediated interactions) with their periodic image. Secondly a long simulation is required to disentangle the diffusive approach from the systematic interaction. In addition, mesoscale simulations allow us to derive a potential of mean force (PMF) profile that can be used to calibrate (ε0, µ0) (de Meyer et al., 2008; Periole et al., 2012; Domanski et al., 2017). In-plane symmetry of the protein structure (N) can be found from the crystal structure. a<sup>k</sup> and 4<sup>k</sup> can be calibrated from both the density map or from free energy profile as a function of angle between the proteins.

### Example: Shiga Toxin Induced Tubular Membrane Invaginations

The bacterial Shiga toxin is a member of the AB<sup>5</sup> protein family that is composed of an enzymatically active A-subunit, and a receptor-binding B-subunit. STxB is homopentameric and mediates intracellular toxin trafficking via binding to the glycolipid globotriaosylceramide (Gb3) at the plasma membrane of target cells. Shiga toxin can enter the cell by both clathrindependent and independent endocytosis. The formation of tubular membrane invaginations is the first step in the clathrinindependent STxB uptake (Römer et al., 2007). Previously, we have used this multi-scale simulation approach to describe formation of membrane tubular invaginations upon STxB binding. Here we shortly discuss the scheme and results.


isotropic strength of the pair interaction, is around 2.5 kBT (**Figure 2A**). STxB is a pentamer (2π/5-symmetric), therefore N=5. Based on these numbers, we defined the simplest form of the interaction energy as εij = −2.5 + (1 + cos 5[2<sup>i</sup> − 2′ j]) in units of kBT.

• Using the above input parameters, we performed a Monte Carlo simulation of DTS in the constant frame tension ensemble (τ = 0) and could reproduce the behavior as seen in the experimental setups, namely formation of a tubular membrane invagination (Römer et al., 2007). We also found the minimum requirements for the formation of tubular membrane invaginations, i.e., (1) capacity of the individual proteins to induce local membrane curvature (2) their ability to cluster, by any mean, upon binding to the membrane (**Figure 2A**) (Pezeshkian et al., 2016).

### BACK-MAPPING TO CG MODEL

The main assumption of this multi-scale simulation approach is that local properties of the membrane do not strongly get affected by large-scale membrane configurational changes. However, local lateral organizations of complex membranes chemical constituents can change upon large scale membrane deformations (Baoukina et al., 2018). In order to overcome this limitation, we have developed an algorithm that back-maps a DTS structure to its corresponding Martini CG model (Marrink et al., 2007; Marrink and Tieleman, 2013). This algorithm makes it possible to use DTS to equilibrate the slow large-scale membrane conformational change and exploit the Martini model to equilibrate the local lipid distributions. As a first attempt to explore this procedure, we performed a DTS simulation on a vesicle with a smaller volume/surface ratio of a perfect sphere (0.7) and a spontaneous curvature of 0.025 nm−<sup>1</sup> . Under this condition, the DTS simulation predicted the formation of a vesicular bud (**Figure 2B**) (Seifert et al., 1991; Markvoort et al., 2009; Bahrami et al., 2017). We then back-mapped the DTS structure to its corresponding Martini model and after a short energy minimization, it was ready for an MD simulation. The detail of this procedure is out of the scope of this article and will be published elsewhere.

### SUMMARY AND PERSPECTIVES

We described an extended version of our multi-scale simulation procedure that uses a bottom up scheme to calibrate DTS model parameters (Pezeshkian et al., 2016). The approach is well-suited for investigating membrane involved biological processes that take place at a large-range of time and length scales that cannot be captured by any single current simulation techniques.

One of the clear advantages of exploiting DTS at macroscopic length scales is the speed. DTS allows us to simulate micron size vesicles, decorated with membrane proteins, on a single CPU core. This length-scale is hardly reachable (using much more computational power) by any particle-based computer simulation techniques (Cooke et al., 2005; Ayton and Voth, 2009). Nevertheless, the approach still suffers from several limitations that need to be resolved. For example, DTS simulations with dynamic topology has been only developed for several special purposes (Jeppesen and Ipsen, 1993; Shillcock and Boal, 1996; Gompper and Kroll, 1998; Shillcock and Seifert, 1998) that limits its applications, as a generic method, to describe processes that involve membrane topological changes, e.g., membrane scission and poration (Boye et al., 2017). Another limitation is the current implicit protein model that is only applicable for membrane proteins. One possibility is to adopt one protein to few beads strategy e.g., essential dynamics coarsegraining (Zhang et al., 2008) to extend the range of the DTS protein mapping. Another route to increase the molecular level

### REFERENCES


detail is through dynamic coupling of macroscale and CG models. We shortly described a back-mapping algorithm that converts a DTS topology to a Martini structure. This algorithm opens up a new perspective to perform a dual resolution Martini/DTS simulation, so that DTS performs the large-scale moves while local moves of the chemical components is handled by the CG Martini model.

### DATA AVAILABILITY

The datasets generated for this study are available on request to the corresponding author.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

This work was funded by The Netherlands Organization for Scientific Research (NWO) within the framework of BaSyC— Building a Synthetic Cell Gravitation project.

### ACKNOWLEDGMENTS

The authors thank T. A. Wassenaar for constructive discussions and comments.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Pezeshkian, König, Marrink and Ipsen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Unraveling the Molecular Mechanism of Pre-mRNA Splicing From Multi-Scale Simulations

Lorenzo Casalino<sup>1</sup> and Alessandra Magistrato<sup>2</sup> \*

*<sup>1</sup> Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA, United States, <sup>2</sup> Consiglio Nazionale delle Ricerche–Istituto Officina dei Materiali, International School for Advanced Studies (SISSA), Trieste, Italy*

Keywords: splicing, spliceosome, group II introns, molecular dynamics, QM/MM

### INTRODUCTION

The removal of non-coding introns within a precursor messenger RNA (pre-mRNA) transcript is a key step of gene expression and regulation, occurring via two transesterification reactions mediated by at least two Mg2<sup>+</sup> ions (Kastner et al., 2019). Whereas in lower organisms this process is self-regulated by group II intron ribozymes (G2IRs) performing their own excision from a pre-mRNA strand, in eukaryotes, due to the increased complexity of the genome, these autocatalytic RNAs have evolved into a majestic protein/RNA machinery—the spliceosome (SPL)—composed of hundreds of proteins and five small-nuclear (sn)RNA filaments (Marcia and Pyle, 2012; Yan et al., 2019). The SPL, acting as a protein-directed metallo-ribozyme, promotes the conversion of pre-mRNAs into mature mRNAs. This massive architecture revolves around its central core constituted by Spp42/Prp8 protein (S. Pombe/S. Cerevisiae or human, respectively) and a catalytic site fully resembling that of G2IRs (Yan et al., 2019). As the most eminent genome tailor, the SPL undergoes a relentless compositional and conformational remodeling, repetitively assembling and transforming at every splicing cycle into eight distinct complexes (A, B, Bact, B<sup>∗</sup> , C, C ∗ , P, ILS) to achieve splicing with a single nucleotide precision.

Recent developments in single-particle cryo-EM have led to elucidate a plethora of near-atomic resolution structures of SPL complexes from human and yeast strains, thus allowing decades of biochemical, structural and functional studies to be interpreted. In this context, multiscale simulations can contribute to deciphering the intricacies of the splicing mechanism by assessing the chemical details of the pre-mRNA cleavage, and the role of the extraordinarily convoluted protein/RNA environment in creating the appropriate structural scaffold that finely modulates introns removal (Yan et al., 2019). Nevertheless, the size and the inner complexity of the SPL machinery require a wise use of advanced multiscale simulations to tackle the many different peculiarities of its mechanism, as shown in the following showcased studies.

### CHEMICAL MECHANISM OF PRE-mRNA SPLICING IN PROKARYOTES

The structure of the SPL catalytic site, impressively similar to that of its evolutionary predecessor G2IRs, is well-preserved among the distinct structures that have been solved. A series of crystal structures from Oceanobacillus iheyensis captured group IIC intron at sequential stages of the catalytic process, allowing a first structural breakthrough for unraveling the chemical mechanism

#### Edited by:

*Valentina Tozzini, Nanosciences Institute (CNR), Italy*

#### Reviewed by:

*Holger Kruse, Academy of Sciences of the Czech Republic, Czechia*

> \*Correspondence: *Alessandra Magistrato alessandra.magistrato@sissa.it*

#### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

> Received: *14 June 2019* Accepted: *11 July 2019* Published: *06 August 2019*

#### Citation:

*Casalino L and Magistrato A (2019) Unraveling the Molecular Mechanism of Pre-mRNA Splicing From Multi-Scale Simulations. Front. Mol. Biosci. 6:62. doi: 10.3389/fmolb.2019.00062*

**30**

of pre-mRNA splicing (Marcia and Pyle, 2012). These crystallographic reconstructions revealed an active site containing a four-metal-ion cluster made of two Mg2<sup>+</sup> and two K<sup>+</sup> ions, the former being catalytically active, while the latter most likely playing a structural role. Building on these structures, classical and hybrid quantum-classical QM/MM simulations enabled the investigation of the first and ratedetermining step of the splicing reaction as promoted by G2IRs (Casalino et al., 2016). In particular, this work focused on the water-mediated 5 ′ -exon cleavage mechanism (hydrolytic path). In fact, in G2IRs the hydrolytic catalysis can be as operative as the branching pathway, where, instead, the reaction is started by a conserved bulged adenosine within the branch point sequence (BPS). By using classical and QM(Car–Parrinello)/MM molecular dynamics (MD), with the QM part described at Density Functional Theory (DFT)-BLYP level of theory, and the MM part treated with the AMBER- ff12SB (ff99+bsc0+χOL3) force field (FF) (Pérez et al., 2007; Zgarbová et al., 2011; Maier et al., 2015), in combination with thermodynamic integration to enable the reaction event within the limited time-scale of the QM/MM MD simulations, this study unveiled a novel dissociative two-Mg2+-ion mechanism in which the bulk water acts as general base (Casalino et al., 2016).

The two-Mg2+-ion motif is a well-established catalytic cofactor shared by many enzymes processing nucleic acids. In these enzymes, the phosphodiester bond hydrolysis is believed to occur according to the Steitz and Steitz's proposal. In its original postulation, confirmed by distinct computational studies, the two Mg2<sup>+</sup> ions act as Lewis acids activating the nucleophile, stabilizing the leaving group and the transition state (Palermo et al., 2015; Sgrignani and Magistrato, 2015). At variance with this, in G2IRs a dissociative mechanism takes place, with the reactive water detaching from the Mg2<sup>+</sup> ion and performing the attack on the scissile phosphate while still in its non-deprotonated form. Only after the nucleophilic substitution has started, the catalytic water eventually releases its proton to the bulk water and terminates the reaction. In this mechanism one Mg2<sup>+</sup> ion activates the scissile phosphate group by making it more electrophilic, while the second Mg2<sup>+</sup> stabilizes the leaving group. Hence, in this chemical path the role of the two Mg2<sup>+</sup> ions remarkably differs from that of protein enzymes performing a two-metal-aided catalysis. It is tantalizing to believe that this mechanism may be specific for ribozymes, where the catalytic site is exclusively formed by the RNA sugar–phosphate backbone bearing a lower specificity/efficiency to promote the reaction than that of enzymes. This peculiar mechanism may represent an ancestral version of the two-Mg2+-ion catalysis later evolved in enzymes and in protein-directed ribozymes (spliceosome) (Casalino et al., 2016).

### SPLICING MECHANISM MODULATION BY THE PROTEIN ENVIRONMENT

In spite of the large number of cryo-EM structures of the SPL published as of yet, no catalytically competent form has been trapped, thus hampering a study of the chemical mechanism of splicing in eukaryotes. Moreover, the large size and complexity of the SPL pose serious challenges even when attempting to unravel its functional properties. Indeed, the deposited cryo-EM maps usually have a resolution ranging between 3 and 4 Å in the core and even reaching lower values in the peripheral regions of the macromolecular assembly, which often displays structural gaps (Kastner et al., 2019; Plaschka et al., 2019; Yan et al., 2019). For these reasons, in order to perform all-atom simulations of the SPL it is mandatory to find a compromise between system size and accuracy. In the first MD simulation study published to date, based on the first near-atomistic SPL structure solved from yeast S. Pombe capturing the intron lariat spliceosome (ILS) complex (Yan et al., 2015), two explicitly solvated core model-systems containing ∼1,000,000 atoms were built and simulated via multi-replica MD simulations for a cumulative statistics of few microseconds (**Figure 1**). In these simulations the AMBERff12SB FF was used for proteins (Maier et al., 2015), whereas ff99+bsc0+χOL3 FF was adopted for RNAs (Pérez et al., 2007; Zgarbová et al., 2011).

Correlation analyses, principal-component analysis (PCA), and electrostatic calculations disentangled the cooperative motions underlying the SPL functional dynamics, unraveling the role of electrostatics in modulating these movements (Casalino et al., 2018). The simulations provided unprecedented insights on the SPL functional plasticity, assigning to Spp42 (Prp8 in human) a central role in finely directing the motions of many distinct SPL components. Metaphorically, the resulting scenario is that of Spp42 as an orchestra conductor of the gene regulation symphony. The essential dynamics extracted from the PCA revealed, consistently with the stage of the splicing cycle investigated, an electrostatically-driven displacement and unrolling of the U2/intron-lariat branch helix co-promoted by Cwf19 (CWF19L2 in human) and Spp42, both involved in the ILS disassembly (Casalino et al., 2018). Strikingly, the implication of Cwf19 in the branch helix unwinding was thereafter corroborated by recent cryo-EM studies on the human SPL (Zhang et al., 2019). Despite the intrinsic limitations of this study due to the large size of the system and the well-known flaws of RNA (Šponer et al., 2018) and Mg2<sup>+</sup> (Casalino et al., 2017) FFs, this study has opened new avenues for probing this incredible machinery with atomiclevel simulations.

### DISCUSSION

A detailed comprehension of the molecular terms of eukaryotic splicing has entailed implications for revolutionary gene modulation therapies and drug discovery studies aimed at fighting the over 200 human diseases associated with splicing defects. Upon the deposition of the first SPL structure from yeast in 2015, many human cryo-EM maps have been solved, thus opening new opportunities to dissect detailed aspects of this machinery (Kastner et al., 2019; Plaschka et al., 2019; Yan et al., 2019). Among the unmet questions that need to be solved

FIGURE 1 | The intron lariat spliceosome complex. Proteins are shown with electrostatic surface (blue/red colors for positive/negative charges, respectively) together with the respective field lines. The intron lariat (yellow), U2 (orange) and U6 snRNA are represented as cartoon. Mg2<sup>+</sup> ions are depicted as orange spheres. The catalytic center is highlighted by light rays.

from an atomic-level perspective, the molecular recognition mechanism by which SPL can recruit key intronic sequences at the 3′ and 5′ splice sites, as well as that of the conserved BPS, stands out. The subtle molecular foundations ensuring the reliable identification of authentic consensus splice sites (constitutive splicing), while simultaneously providing some flexibility in the selection of non-consensus ones (alternative splicing) remain unclear. Deregulated constitutive/alternative splicing is well-known to lead to aberrant mRNA transcripts, which may either induce non-sense mediated decay or result in functionally-altered proteins, deleteriously affecting cells functions. In this context, research efforts have been devoted to understanding the mechanism by which mutations of the splicing factor SF3B1 affect BPS recognition, thus leading to aberrant splicing and to the outbreak of distinct hematological malignancies (Cretu et al., 2018). Splicing modulators hitherto trapped in SF3B1 have been found to target the BPS recognition site, elucidating the structural basis of their inhibition mechanism (Cretu et al., 2018; Zhang et al., 2018). Large-scale genomics studies have recently indicated that splicing abnormalities and cancer onset are strongly entwined. Thus, while eagerly awaiting for more structures to be released in the forthcoming years, we expect SPL to become an increasingly important subject of drug design studies tackling distinct types of cancer.

Although the reported results from all-atom simulations and all the possible future applications—appear to be very encouraging (Casalino et al., 2018; Palermo et al., 2019), several challenges need to be tackled, starting from the amelioration of current RNA and protein/RNA FFs (Šponer et al., 2018). Moreover, even though we have assisted to a fast development of computer hardware and software allowing for brute force unbiased MD simulations, biologically relevant time scales remain computationally extremely demanding and out of reach to most computational labs. In this respect, enhanced sampling and free energy methods to study rare events taking place in complex biological contexts call for further improvements (Miao and McCammon, 2016; Valsson et al., 2016). The presence of metals within the catalytic core of the SPL, which in fact makes it a protein-directed metallo-ribozyme, poses serious difficulties for a reliable fully classical prediction of its properties (Vidossich and Magistrato, 2014; Brunk and Rothlisberger, 2015). For this reason, the use of highly parallel QM/MM MD schemes capable of better exploiting large computational infrastructure would be ideal (Bolnykh et al., 2019; Olsen et al., 2019). A timely fashion communication between the QM and MM would in fact allow more efficient QM(DFT)/MM MD calculations, accounting for larger QM regions and longer simulation time than the accustomed ∼100 atoms and ∼100s ps time scale, respectively.

In this scenario, we expect that new methodological advances in computer simulations, modeling and analysis techniques will foster atomic-level studies of the SPL, contributing to an utter comprehension of this fundamental step of gene expression. This will also be of service for a better understanding of the allosteric signaling between distal sites, which occurs via the entangled protein/RNA networks characterizing the SPL, and for the discovery of druggable allosteric sites (Palermo et al., 2017). On a final note, we hope that any related breakthrough might help to elucidate the role of splicing pathways in cancer, concretely opening appealing opportunities for creating therapeutic approaches and innovative gene manipulations tools.

### AUTHOR CONTRIBUTIONS

LC and AM designed research and wrote the paper.

### FUNDING

This work has been supported by European Social Fund 2007/2013, Project DOCTOR EUROPAEUS and by the Italian

### REFERENCES


Association for Cancer Research (AIRC: My first AIRC grant no. 17134).

### ACKNOWLEDGMENTS

We thank Profs. G. Palermo and U. Roethlisberger who contributed to the original works discussed in this opinion.

Cas9. J. Am. Chem. Soc. 139, 16028–16031. doi: 10.1021/jacs. 7b05313


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Casalino and Magistrato. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Computational Studies of Cardiac and Skeletal Troponin

### Jacob D. Bowman and Steffen Lindert\*

*Department of Chemistry and Biochemistry, Ohio State University, Columbus, OH, United States*

Troponin is a key regulatory protein in muscle contraction, consisting of three subunits troponin C (TnC), troponin I (TnI), and troponin T (TnT). Calcium association to TnC initiates contraction by causing a series of dynamic and conformational changes that allow the switch peptide of TnI to bind and subsequently cross bridges to form between the thin and thick filament of the sarcomere. Owing to its pivotal role in contraction regulation, troponin has been the focus of numerous computational studies over the last decade. These studies elegantly supplemented a large volume of experimental work and focused on the structure, dynamics and function of the whole troponin complex, individual subunits, and even on segments of the thin filament. Molecular dynamics, Brownian dynamics, and free energy simulations have been used to elucidate the conformational dynamics and underlying free energy landscape of troponin, calcium, and switch peptide binding, as well as the effect of disease mutations, small molecules and post-translational modifications such as phosphorylation. Frequently, simulations have been used to confirm or explain experimental observations. Computer-aided drug discovery tools have been employed to identify novel potential calcium sensitizing agents binding to the TnC-TnI interface. Finally, Markov modeling has contributed to simulating contraction within the sarcomere on the mesoscale. Here we are reviewing and classifying the existing computational work on troponin and its subunits, outline current gaps in simulations elucidating troponin's role in contraction and suggest potential future developments in the field.

### Edited by:

*Giulia Palermo, University of California, Riverside, United States*

#### Reviewed by:

*Vladimir N. Uversky, University of South Florida, United States Peter M. Kekenes-Huskey, University of Kentucky, United States*

### \*Correspondence:

*Steffen Lindert lindert.1@osu.edu*

### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

> Received: *28 May 2019* Accepted: *25 July 2019* Published: *09 August 2019*

#### Citation:

*Bowman JD and Lindert S (2019) Computational Studies of Cardiac and Skeletal Troponin. Front. Mol. Biosci. 6:68. doi: 10.3389/fmolb.2019.00068* Keywords: troponin, molecular dynamics simulation, free energy methods, brownian dynamics, cardiac thin filament modeling

### INTRODUCTION

Troponin (Tn) is a three-subunit protein complex that resides on the thin actin filament in muscle cells. Its three subunits, troponin C (TnC), troponin I (TnI), and troponin T (TnT) have separate roles in facilitating muscle contraction (Greaser and Gergely, 1973). TnT is anchoring the complex to the actin filament and also interacting with the protein tropomyosin. TnI has an inhibitory region that will interact with actin and inhibit the movement of tropomyosin from the myosin-binding sites on the actin filament. TnC is the calcium-binding subunit, that binds calcium in its regulatory domain which allows TnC to bind to a region of TnI known as the switch-peptide (Parmacek and Solaro, 2004). This interaction then leads to a sliding of the tropomyosin on the actin filament and exposes the myosin-binding sites for contraction to occur (Gordon et al., 2001). Understanding the interactivity between subunits within the complex is critical to understanding muscle contraction at a molecular level. An important area of study are the intrinsically disordered regions within the troponin complex that play critical roles in functional regulation (Na et al., 2016; Papadaki and Marston, 2016; Marston and Zamora, 2019). Serious health conditions, such as cardiomyopathies, have been linked to proteins within the sarcomere, and especially troponin (Hershberger et al., 2010; Seidman and Seidman, 2011). In addition to the significant experimental contributions to the study of troponin, a plethora of computational methods have been developed and utilized to study structure, dynamics and function of troponin. Here we will review computational studies of cardiac and skeletal troponin, seen in **Figure 1**, including molecular dynamics simulations sampling the conformational dynamics of troponin, free energy simulations used to elucidate the underlying free energy landscape of troponin, modeling of small molecule interactions with TnC, as well as troponin's role in Markov state models of sarcomere contractility.

### MOLECULAR DYNAMICS SIMULATE THE CONFORMATIONAL DYNAMICS OF THE TROPONIN COMPLEX AND ITS SUBUNITS

### Cardiac Troponin Simulations

Dynamic motions of the cardiac troponin complex and its individual subunits have been extensively studied with molecular dynamics (MD) simulations and helped elucidate the functional importance of these motions. Molecular dynamics numerically integrates Newton's equations of motion and allows for the simulation of trajectories of biomolecular atoms and molecules (Karplus and McCammon, 2002). This technique can simulate dynamics on the order of ns-ms and systems up to millions of atoms, only limited by available computational resources. A wild-type cTnT subunit was simulated to investigate the hinge dynamics and develop a model for subunit interactions (Manning et al., 2012a). Conventional MD of the N-terminal regulatory domain of cardiac TnC (cNTnC) and the cTnIswitch peptide has been used to measure the distance between key interacting residues over the course of 40 ns simulations which revealed isoform-specific interactions (Thompson et al., 2014). Continuing their investigation into the role of isoformspecific interactions, the Metzger group subsequently simulated cTnC-cTnI switch-peptide systems at various protonation states (Palpant et al., 2012). Further study of the cTnC-cTnI in complex showed that there are key structural differences between the helix 4 of TnI and the switch-peptide region (Vetter et al., 2018). The intrinsically disordered region of TnI (C-terminal domain) was simulated with cNTnC which provided evidence that the region is flexible and has structural preferences (Metskas and Rhoades, 2015). Simulation insights into the I and T subunits depend largely on their relation to the effect they have on cTnC, therefore simulations of cNTnC are critical to understanding muscle contraction at a molecular level. Simulations of cNTnC showed calcium-binding is driven, entropically, by desolvation of the calcium ion rather than structural entropy change in cNTnC (Skowronsky et al., 2013). Long timescale simulations of wildtype, calcium-bound cNTnC, on the order of 10 µs, revealed sampling of a semi-open configuration of cNTnC that is not seen in the experimental structure (Lindert et al., 2012a). As a method to enhance sampling, accelerated MD simulations were performed on the cNTnC systems which, when projected onto a PCA space, sampled the open configuration exclusively in the calcium-bound state (Lindert et al., 2012b).

Developing a model of muscle contraction through computational methods requires going from the single subunits to a complete troponin complex and even beyond. The Li lab conducted 12 ns simulations on a full cTn complex (Varughese et al., 2010). This work was able to show that calcium coordination is altered between isolated cTnC and cTnC in complex. Longer timescale simulations of the core troponin complex were subsequently performed by the Gould lab who were able to simulation for hundreds of nanoseconds (Zamora et al., 2016). This model provided insight into interactions between the subunits and can be used for further mutational studies in the Tn complex. Experimental FRET has been used by the Dong lab to restrain molecular dynamics simulations of the core of the cardiac troponin complex (Jayasundar et al., 2014). These experimental restraints provided a more direct method to relax the model of the troponin complex to a native minimum. In order to study the troponin complex in its natural environment on the thin filament, a full thin filament model was developed by the Schwartz group (Manning et al., 2011). This model was then simulated using unrestrained MD (Williams et al., 2016). These seminal simulations were able to show the influence of cTnT mutations on cTnC and provide insight on the mechanism of disease pathology.

### Fast Skeletal Troponin Simulations

To further understand the molecular basis of skeletal contraction, the fast-skeletal troponin complex (sTn) and fast-skeletal troponin C (sTnC) have been simulated using molecular dynamics. The Li group simulated a full troponin complex and demonstrated that the inter-linker region of sTnC was flexible in simulations, in contrast to what the static model suggested (Varughese et al., 2010). This work also highlighted correlated motions within the complex between the C-terminal domain of sTnC and helices of the sTnT subunit. The Lu group simulated both the core domain of sTn and an isolated sTnC subunit (Genchev et al., 2013), showing that the calcium-bound Nterminal region transitioned from the open state (which is observed in the experimentally-derived structures), to a stable semi-open configuration. Closing of calcium-bound sTnC from the open state has been detected in other MD simulations as well. The isolated N-terminal domain of sTnC was used in conventional MD simulations for 1 µs in which semi-open and open configurations were sampled, but not exclusively (Bowman and Lindert, 2018). This work further supported temporary closing of the sTnC N-terminal region, even in the calciumbound state. The Ghosh lab modeled the missing residues of known sTnC crystal structures guided by thermodynamics (Sikdar et al., 2016). This work showed destabilization of key residues resulting from calcium binding and allowed the binding of sTnI to both domains of sTnC.

### Simulations of Disease State and Calcium Sensitivity Modulation Mutations

Mutations within the troponin complex and other sarcomeric proteins within cardiac muscle can lead to life-threatening cardiomyopathies, such as hypertrophic (HCM) and dilated (DCM) cardiomyopathy. A key use of MD is to study the dynamics of the cTn complex and its subunits in the presence of these mutations. HCM- and DCM-associated mutations that exist on the regulatory domain of cTnC have been in the focus of various conventional MD simulations. Early short simulations were run for 5 ns on the DCM-associated mutation D75Y in calcium-free cNTnC and showed that the D75Y mutation would lead to a reduction in contraction through stabilization of the closed state (Lim et al., 2008). The designed calcium sensitizing cNTnC mutation L48Q was simulated for up to 70 ns by the Regnier group (Wang et al., 2012), revealing an increase in the stability of the calcium binding site coordination and a disruption of the closed state. Additionally, the calcium desensitizing mutations L57Q and I61Q were simulated using a similar protocol as for L48Q (Wang et al., 2013). This study, also by the Regnier group, showed the destabilization of the cNTnC site II calcium-binding site caused by these mutations. Long timescale simulations of several microseconds of gain-offunction mutation V44Q and loss-of-function mutation E40A, were able to show distinct differences in the opening frequency imparted by these mutations (Lindert et al., 2012a). This modulation of opening frequency was suggested as a mechanism for calcium sensitization. An extension of this work showed that other gain-of-function and loss-of-function mutations altered the dynamic landscape of cNTnC (Kekenes-Huskey et al., 2012). This suggested that tuning the cNTnC dynamics would lead to tuning of the myofilament. Microsecond simulations of DCM-associated mutations revealed that the C-terminal cTnC mutation G159D and the N-terminal mutation D75Y both greatly reduced time spent in the open configuration of cNTnC (Dewan et al., 2016). The Tibbits group performed simulations of four HCM-associated mutations, in addition to the designed calcium-sensitizing L48Q mutation and the DCM-associated mutation Q50R (Stevens et al., 2017). These simulations showed that HCM-associated mutations destabilized the closed state of cNTnC. This result was in agreement with our simulations, showing an overall lower free energy of opening for HCM mutations and especially for the designed calcium-sensitizing mutation L48Q, and a slightly larger free energy of opening for the DCM-associated mutations (Bowman and Lindert, 2018).

Mutations that impact calcium sensitivity and lead to cardiomyopathies are also found in cTnI and cTnT and have been studied computationally. The HCM-associated cTnI mutation R145G showed little change in the overall dynamic behavior of the cTn complex compared to wild-type (Lindert et al., 2015a). This study suggested that the mutation exclusively disrupted residue-residue contacts created by phosphorylation as a mechanism for the HCM-associated mutation. This study also created a model that was ideal for studying diseaseassociated cTnI mutations. In addition to R145G, the cTnI mutation R21C was simulated (Cheng et al., 2015). Similarly to the R145G mutation, R21C disrupted the contacts generated by phosphorylation. The putative HCM-associated mutation P83S, studied with the same cTn model, exhibited dynamics similar to wild-type (Cheng et al., 2016). This finding agreed with the studies of R145G and R21C, in that the contacts imparted by phosphorylation were only blunted rather than completely disrupted. The Regnier lab investigated the restrictive cardiomyopathy (RCM) cTnI mutation R145W with a full cTn complex as well (Dvornikov et al., 2016). This cTnI mutant, by itself, did not alter the interactions between cTnC and cTnI. But upon addition of the phosphomimic mutations S23D/S24D, the R145W mutant disrupted the phosphorylation-mediated decoupling of cTnI and cTnC, leading to conclusion that the combination of phosphorylation and mutation lead to increased contractility. The Schwartz group has spearheaded investigations of the influence of cardiomyopathy-associated mutations on cTnT through MD. In an early iteration, residues 70–170 of murine cTnT with HCM mutations R92L and R92W were investigated in short simulations (Ertz-Berger et al., 2005). Both these mutations decreased helical stability of cTnT, as seen in disruption of hydrogen bonds, suggesting a mechanism for Tn destabilization. This work was subsequently extended employing longer simulations, on the order of 300 ps, on the same mutations (Manning et al., 2012a). Much like the previous work, this study showed decreased helical stability, and additionally suggested a mechanism of disease pathology by disrupting troponin tail and tropomyosin binding necessary for typical contraction. A full atomistic model of the troponin complex was developed for studying these HCM-associated mutations (Manning et al., 2012b). The cTnT mutations R92L, R92W, 1E160, E163K, and E163R were found to either induce changes in the flexibility of cTnT or change the calcium affinity for the cNTnC calciumbinding site. A full cardiac thin filament model was generated for further investigation of changes in dynamics and contacts induced by these HCM-mutations (Williams et al., 2016). This work was critical in linking the cTnT mutations to allosteric effects on calcium binding within cTnC. Understanding this link has potential to target cardiomyopathies through means other than calcium-sensitivity modulating small molecules.

### Simulations of Post-translational Modifications in Troponin

Post-translational modifications, specifically PKA phosphorylation of cTnI, are crucial to function within the troponin complex which ultimately reduces calcium sensitivity and promotes muscle relaxation. Phosphomimic mutations of cTnI residues S23 and S24 to aspartic acid were found to increase the movement of the entire Tn complex while not altering the site II calcium-binding of cNTnC. These phosphomimics also led to intrasubunit interactions between the cNTnC and the inhibitory region of cTnI, a region before the switch peptide (Cheng et al., 2014). These phosphomimic mutations were then assessed in the presence of a known disease-associated cTnI mutation, R145G of cTnI, to explore its impact on a phosphorylated system. This mutation interrupted the intrasubunit interaction observed in the wild-type phophomimic simulations which suggested a mechanism for the Tn modulation (Lindert et al., 2015a). In support of the validity of the phosphomimic model, these systems were also simulated with actual phosphoserine side chains at cTnI residues 23 and 24 and no distinguishable differences between the simulations were observed. An addition of the HCM-associated cTnI mutation R21C to the complex also lowered the intrasubunit contacts observed in the wild-type system with phophomimics added (Cheng et al., 2015). In contrast to the previously described mutations, the HCMassociated cTnI mutation P83S only moderately disrupted the phosphorylation-mediated interaction between cNTnC and cTnI. This study showed that there are other possible mechanisms which are additive to the P83S mutation that led to hypertrophic cardiomyopathy (Cheng et al., 2016). These studies were further extended by the Gould group that created a full troponin complex model and simulated on the order of 750 ns to investigate phosphorylation regulation of calcium-binding (Zamora et al., 2016). Utilizing phosphoserine, instead of a phosphomimic, this work showed that the phosphorylation moved the S69 in cTnC to an out of coordination position in site II for calcium.

### COMPUTATIONAL STUDIES OF CALCIUM AND TNI BINDING TO TNC

Techniques such as Brownian dynamics and umbrella sampling have been used to investigate the binding of calcium and TnI to TnC. Brownian dynamics is a technique that simulates a system based on an overdamped Langevin equation of motion, as opposed to Newtonian motion in MD, to study diffusion dynamics and obtain association rates for a given process (Ermak and McCammon, 1978). Browndye was utilized to estimate an on-rate for calcium for wild-type cNTnC comparable to experimentally determined values (Lindert et al., 2012b). Because this technique was able to recapitulate experimental values for wild-type, it was subsequently extended to use with disease-associated mutations of cTnC (Dewan et al., 2016). This work demonstrated that the calcium on-rate was indeed impacted by these mutations, in agreement with experimental data. Milestoning, applied to cTnC calcium binding by the Amaro group, also generated kon rates in agreement with experiment (Votapka and Amaro, 2015). The Tibbits group further developed an umbrella sampling scheme to investigate calcium binding free energies in zebrafish cTnC and ssTnC at two temperatures (Stevens et al., 2016). This method has also been extended to use on cardiomyopathy-associated mutations, which was able to ascribe differences to binding energies to these mutations (Stevens et al., 2017). Steered molecular dynamics techniques used by the Schwartz group have been used to assess calcium binding to the cTn complex with cTnT mutations (Williams et al., 2016). These simulations were able to calculate the work required to pull calcium ions from cNTnC within the context of the core of cTn. Free energy perturbations from the Metzger group were able to show an increase in calcium binding free energy for acidosis states of myocytes that agreed with experimental data (Thompson et al., 2014; Vetter et al.,

2018). Finally, a four state model was developed to explain, through investigation of the cTnI effective concentration, why calcium sensitivity varies from isolated TnC to the Tn complex to a full thin filament model (Siddiqui et al., 2016). In addition to the application of free energy methods to assess calcium binding, TnI binding has been explored. Both MM/PBSA (Stevens et al., 2017). and MM/GBSA (Lindert et al., 2015a). have been used to estimate the energy of binding of the cTnIswitch peptide to cNTnC. While these values did not exhibit close agreement with experimental measurements, they were still instructive in ranking scores of the approximate energy of binding for mutations of cTnI. Additionally, steered molecular dynamics and umbrella sampling methodologies have been developed to sample the free energy landscape of the troponin complex. We developed an umbrella sampling scheme for assessing the free energy of opening of the hydrophobic patch of the regulatory domain of sTnC, cTnC, and cardiomyopathyassociated mutations of cNTnC and provided insight into a potential mechanism of contraction modulation (Bowman and Lindert, 2018).

### SMALL MOLECULE INTERACTIONS WITH CTNC

Small molecules developed for the treatment of cardiomyopathies have been simulated bound to cNTnC to probe binding energies for these molecules through an MM/PBSA method. The Li group applied this approach to studying the well-known TnC binding molecule bepridil (Varughese et al., 2011). This work was able to show that bepridil enhanced calcium sensitivity by altering the calcium coordination residues in the isolated cNTnC, but decreased cTnC and cTnI interactions in the complex. As a result of the success of this method, it was further used in combination with drug discovery to validate calcium-sensitization of new compounds (Varughese and Li, 2011).

Treatment of the pathologies associated with diseased cardiac muscle has been of great interest. To this end, cNTnC has been a target for small molecule drug screens and drug development. High through-put virtual screens (HTVS) on clusters derived from MD simulations were performed on cNTnC. This technique was able to identify a calciumsensitizing compound, NSC147866, from the NCI II diversity set (Lindert et al., 2015b). A significantly improved version of this screening protocol, applied to structures of cNTnC from 100 ns simulations, found two additional calcium sensitizers, NSC600285 and NSC611817, from the entire NCI database (Aprahamian et al., 2017). Employing experimental intuition, instead of blind screens, compounds that were similar in structure to diphenylamine were docked into cNTnC (Cai et al., 2016). This allowed for the identification of the calcium sensitizer, 3-methyldiphenylamine. Small molecules bound to cNTnC have also been studied with an umbrella sampling scheme to show their influence on the free energy landscape (Bowman et al., 2019). In contrast to studying the cNTnC hydrophobic patch as a target for drugs, recent work has targeted the interdomain linker between N-domain and C-domain of cTnC (Szatkowski et al., 2019). Efficacy of these drugs was measured by changing of interaction between the tropomyosin and cTnT.

### MARKOV MODELING HAS CONTRIBUTED TO SIMULATING CONTRACTION WITHIN THE SARCOMERE ON THE MESOSCALE

Isolated models and simulations of Tn and its subunits provide a valuable, yet small window into muscle contraction. There is, however, a need to correlate these energetics and kinetics found at the protein level to the sarcomere level. Through Markov state modeling, in which the next state depends only on the current state of the system, these individual studies can be linked together to create picture of muscle contraction. A model of these processes was created based on the calcium binding, tropomyosin movement, and then myosin binding (Campbell et al., 2010). This model accurately predicted steadystate force change. The Markov model from the Campbell group was subsequently updated to include the azimuthal angle between tropomyosin between adjacent tropomyosin chains (Sewanan et al., 2016). Addition of this angle was unaccounted for in previous models and allowed for incorporation of tropomyosin mutations into the model. An updated model was proposed that included the opening of the hydrophobic patch and binding of the cTnI switch peptide (Dewan et al., 2016). This model used data from Brownian dynamics and molecular dynamics simulations. While unable to accurately predict the impact of cardiomyopathy-associated mutations on contraction, this model was able to show that small changes in these states can ultimately alter the pCa curves at the larger scale.

### CURRENT GAPS AND POTENTIAL FUTURE DEVELOPMENTS IN THE FIELD

Computational methods have already made a significant contribution to our understanding of the dynamics and function of troponin. However, several gaps in simulations elucidating troponin's role in contraction remain. The accuracy of free energy calculations, particularly with respect to calcium binding affinities, is currently insufficient, as a result of inaccuracies in forcefield descriptions of calcium, non-classical electronic effects, and a lack of robust sampling of the thermodynamic ensemble. Future efforts will have to focus on more accurately predicting calcium binding affinities, probably employing longer simulations, force field optimization, polarizable force fields or even QM/MM calculations. In the context of calcium binding, but not limited to it, the behavior of the troponin complex is distinctive from that of its substituents (e.g., isolated cNTnC, isolated cTnC), challenging simulations to correctly account for those differences. In general, design of additional computational experiments that are verifiable in vitro/vivo will lead to more cohesion between models and experiments. Another current limitation is the disparity between physiologically-relevant millisecond-scale conformational dynamics of the contractile system and the restriction of conventional simulations to tens of microseconds, often accompanied by simulation of very small sections of the contractile machinery. The Schwartz group has paved the way for extending the size of simulations to the thin filament and it is our prediction that the field will follow in the years to come. An alternative route to obtaining contractile information on the mesoscale are the Markov models developed by Campbell and coworkers. Future work will likely focus on obtaining additional model input from computational simulations, such as the accurate predictions

### REFERENCES


of calcium-binding affinities discussed above, as opposed to experimental measurements.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

This work was supported by the NIH (R01 HL137015 to SL).


biophysical experiment can develop small molecules that restore function to the cardiac thin filament in the presence of cardiomyopathic mutations. ACS Omega 4, 6492–6501. doi: 10.1021/acsomega.8b03340


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Bowman and Lindert. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Some Notes on the Thermodynamic Accuracy of Coarse-Grained Models

Ewa Anna Oprzeska-Zingrebe and Jens Smiatek\*

Institute for Computational Physics, Theoretical Chemical Physics, University of Stuttgart, Stuttgart, Germany

Keywords: coarse-grained (CG) model, thermodynamics, Kirkwood-Buff theory, free energies, implicit solvent model

Over the last decades, multiscale molecular dynamics (MD) simulations including ab initio, atomistic as well as coarse-grained models have significantly expanded our understanding of biologically relevant macromolecules like DNA, RNA, or proteins and their properties in solution. Despite the broad applicability, we comment here on some general challenges for coarse-grained approaches, the most important being a reliable thermodynamic description at large time and length scales.

Due to a massive increase in computational power, classical atomistic MD simulations are nowadays the method of choice for the study of complex molecular mechanisms, thereby taking into consideration hundreds of thousands of atoms on time scales of several microseconds. Although classical atomistic models provide a higher level of detail when compared to coarse-grained approaches, it has to be noted that the simplification of electronic behavior in terms of potential functions, so called force fields, introduces some conceptual artifacts into the dynamic and structural properties of the simulated molecular species (Dommert et al., 2012). Furthermore, polarization and charge-transfer mechanisms are usually ignored, such that more sophisticated ab initio or empirical models have to be used for systems where these effects become of importance (Smiatek et al., 2018; Kohagen et al., 2019; Nandy and Smiatek, 2019; Smiatek, 2019).

However, some processes take place on time and length scales, which are not accessible for atomistic MD simulations. Common examples are the formation of lipid bilayers and polyelectrolyte complexes, polymer and colloidal diffusion, charge transport or large scale DNA translocation (Smiatek and Schmid, 2011; Michalowsky et al., 2017, 2018; Smiatek and Holm, 2018). For the study of these and closely related problems, simple as well as more refined coarse-grained models offer a wide range of applications. Here, coarse-graining means the introduction of effective interaction sites (beads) instead of individual atoms, which reduces the degrees of freedom and thus also the number of necessary computations. In addition, the lower level of detail supports the straightforward use of implicit solvent approaches in combination with larger time steps (Marrink and Tieleman, 2013; Kleinjung and Fraternali, 2014; Onufriev and Case, 2019). Depending on the degree of coarse graining, one can differentiate between simple approaches such as reduced beadspring models for polymers and advanced or semi coarse-grained methodologies such as iterative Boltzmann inversion or the MARTINI method among others (Reith et al., 2003; Clark et al., 2012; Marrink and Tieleman, 2013; Noid, 2013; McCarty et al., 2014; Rudzinski and Noid, 2014; Dunn and Noid, 2015; Guenza et al., 2018; Smiatek and Holm, 2018). Although advanced coarse-graining approaches are often based on rather mild parameterization procedures, it should be noted that the consideration of effective interaction sites crucially affects the resulting size and the geometry of the molecular species (Vögele et al., 2015a; Michalowsky et al., 2017, 2018). With regard to this point, also coarse-grained methodologies reveal some generic drawbacks, thereby limiting the applicability of these approaches for the thermodynamic analysis of complex solutions.

### Edited by:

Valentina Tozzini, Nanosciences Institute (CNR), Italy

Reviewed by: Fabio Trovato, Freie Universität Berlin, Germany

\*Correspondence: Jens Smiatek smiatek@icp.uni-stuttgart.de

#### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

Received: 21 June 2019 Accepted: 27 August 2019 Published: 10 September 2019

#### Citation:

Oprzeska-Zingrebe EA and Smiatek J (2019) Some Notes on the Thermodynamic Accuracy of Coarse-Grained Models. Front. Mol. Biosci. 6:87. doi: 10.3389/fmolb.2019.00087

In terms of a specific example, many biologically relevant solutions, such as in mammalian or bacterial cells, are dense mixtures of various ions, co-solute and co-solvent species including a non-negligible concentration of solute components (Zhou et al., 2008). Among other effects, the individual components of the solution and their thermodynamic properties exert a tremendous influence on the structural stability of the dissolved biological species (Canchi and García, 2013; Smiatek, 2017; Oprzeska-Zingrebe and Smiatek, 2018a). For instance, it was shown (Zhang and Cremer, 2010; Canchi and García, 2013; Sukenik et al., 2013; Oprzeska-Zingrebe et al., 2018) that ions like SCN<sup>−</sup> or molecules like urea destabilize DNA or protein structures, whereas the presence of SO2<sup>−</sup> 4 , trimethylamine-N-oxide (TMAO), or ectoine enhances the stability of native macromolecular states. Additionally, many molecular mechanisms are also dominated by intraand intermolecular hydrogen bonds, polarization mechanisms as well as electrostatic and dispersion interactions. The presence of these mainly short-ranged interactions influences the radial distribution functions, potentials of mean force or the corresponding chemical potentials of the species, so that in the end, for non-negligible concentrations, there are more or less pronounced deviations from ideal solutions (Chandler, 1987; Smiatek, 2014, 2017; Dunn and Noid, 2015; Guenza et al., 2018; Oprzeska-Zingrebe and Smiatek, 2018a). The question now is whether coarse-grained models can reproduce these findings? Of course, one may wonder if the aforementioned properties need to be exactly reproduced, but we will illustrate by means of the following arguments that even slight deviations may have a decisive influence on the thermodynamic properties of the solution.

In more detail, modified interactions like in coarse-grained models under constant pressure p and temperature T result in variations of free energies, as defined by G = H − TS with the enthalpy H and the entropy S, and changes in the chemical potential via µ<sup>α</sup> = (∂G/∂Nα)p,<sup>T</sup> where N<sup>α</sup> denotes the number of molecules of species α. Due to changes in the enthalpy, also the corresponding molecular arrangements are affected, which often induces entropic variations as a secondorder effect. Furthermore, changes of chemical potentials from reference chemical potential µ 0 <sup>α</sup> with the universal gas constant R are directly related to changes in thermodynamic activities a<sup>α</sup> = exp((µ<sup>α</sup> − µ 0 α )/RT), vapor pressures, solubilities or chemical reaction equilibria, as can be shown by relations from equilibrium thermodynamics and Kirkwood-Buff (KB) theory (Kirkwood and Buff, 1951; Ben-Naim, 2013) . In consequence, it becomes obvious that even slight modifications of molecular interactions may establish a non-negligible variation of relevant thermodynamic properties as it will be discussed in more detail in the following.

For illustrative purposes, we develop our arguments for a binary solution under isobaric-isothermal conditions with two components, including only solvent (index 1) and cosolvent (index 3) species. It has to be noted that the corresponding expressions change for different ensembles and higher-component mixtures, such that we here focus on one of the simplest examples (Smith, 2006). In KB theory, the derivative of the chemical potential of the co-solvent µ<sup>3</sup> is defined as

$$\frac{1}{RT} \left( \frac{\partial \mu\_3}{\partial \ln \rho\_3} \right)\_{\text{T,p}} = \left( \frac{\partial \ln a\_3}{\partial \ln \rho\_3} \right)\_{\text{T,p}} = \frac{1}{1 + \rho\_3 (G\_{33} - G\_{31})}, \quad \text{(1)}$$

where ρ<sup>3</sup> denotes the number density of co-solvent species and G<sup>33</sup> and G<sup>31</sup> the corresponding KB integrals. A detailed explanation of KB integrals, their relation to radial distribution functions and their central meaning in KB theory can be found in the literature (Kirkwood and Buff, 1951; Ben-Naim, 2013; Smiatek, 2017; Oprzeska-Zingrebe and Smiatek, 2018a). For our considerations, it is sufficient to know that the KB integrals rely on radial distribution functions and represent excess volumes, which can be transformed into excess particle numbers N xs αβ = ρβGαβ for arbitrarily chosen components β around species α. With regard to this definition, Equation (1) can also be written as

$$\left(\frac{\partial \ln a\_3}{\partial \ln \rho\_3}\right)\_{\text{T,p}} = \frac{1}{1 + \left(N\_{33}^{\text{xs}} - (\rho\_3/\rho\_1)N\_{31}^{\text{xs}}\right)}\tag{2}$$

with the excess number of solvent N xs <sup>31</sup> and co-solvent molecules N xs <sup>33</sup> in combination with the corresponding number densities ρ<sup>1</sup> and ρ3. In terms of implicit solvent approaches with a continuum dielectric background, it follows that N xs <sup>31</sup> = 0 by definition, which implies that Equation (2) approaches the outcomes of experiments and atomistic models only under nearly ideal conditions with ρ<sup>3</sup> → 0 at infinite dilution. Further deviations can be observed for large and spherical coarse-grained solvent beads such that the resulting excess volumes are often not correctly reproduced (Vögele et al., 2015a), which implies a significant influence on bulk thermodynamic properties like solubilities or isothermal compressibilities (Pierce et al., 2008; Smiatek et al., 2018).

Noteworthy, also the transfer free energies in ternary mixtures between the co-solvent "3" and the solute "2" as defined by G † = N xs <sup>23</sup>−(ρ3/ρ1)N xs <sup>21</sup> rely on accurate values for the number densities and the excess numbers of molecules (Smiatek, 2017; Oprzeska-Zingrebe and Smiatek, 2018b) Otherwise, the thermodynamic affinity between the considered species is crucially affected. In order to highlight some further inconsistencies, it can be shown that also the chemical equilibrium between distinct chemical states in coarse-grained models differs from experimental values and atomistic approaches. In contrast to the chemical equilibrium constant K<sup>0</sup> in presence of a neat solute-solvent mixture, the modified chemical equilibrium constant K ∗ for denatured or native protein or DNA states (Oprzeska-Zingrebe and Smiatek, 2018a,b) or for associated and dissociated ion pairs (Krishnamoorthy et al., 2018) in presence of low co-solvent concentrations reads (Oprzeska-Zingrebe et al., 2019)

$$K^\* = K\_0 \exp(\Delta N\_{23}^{\text{xs}}) \tag{3}$$

with 1N xs <sup>23</sup> = N xs <sup>23</sup>(d)−N xs <sup>23</sup>(n) where d denotes the denatured and n the native state (Oprzeska-Zingrebe et al., 2019). With regard to the previous equation, a different value of 1N xs <sup>23</sup> as obtained from the coarse-grained simulations when compared to the atomistic model or experimental values (1N xs 23,exp) modifies the chemical equilibrium constant K <sup>∗</sup> 6= K ∗ exp and also the free energy difference in accordance with 1G <sup>∗</sup> = −RT ln K <sup>∗</sup> 6= 1G ∗ exp.

In consequence, incorrect sizes and geometries as well as simplified interactions or inaccurately parameterized coarsegrained interaction sites may induce significant deviations and spurious artifacts. A recent article revealed that specifically the number of interaction sites is of crucial importance (Dunn and Noid, 2015). Noteworthy, most deviations are only relevant for small molecular species like organic solvent molecules or ions, whereas significant improvements of coarse-grained models for polymers were recently reported (McCarty et al., 2014; Dunn and Noid, 2015; Vögele et al., 2015a,b; Guenza et al., 2018; Michalowsky et al., 2018).

In terms of these challenges, why should one use coarsegrained models at all? To answer this question, one should keep in mind that everything should be made as simple as possible, but not simpler. As already discussed, deviations between atomistic and coarse-grained models are mainly relevant for small molecular or ionic species where coarse-graining means a significant change of size and geometry. With regard to this point, it was recently shown that improvements in the parameterization strategy, the functional form of the interaction potentials as well as the consideration of polarizabilities in coarse-grained models increase the validity of the results (Noid, 2013; Rudzinski and Noid, 2014; Dunn and Noid, 2015; Michalowsky et al., 2017,

### REFERENCES


2018; Zeman et al., 2017; Guenza et al., 2018; Uhlig et al., 2018). With regard to this point, variations in thermodynamic properties become even visible for united- and all-atom models which highlights the importance of accurately parameterized molecular structures and interaction sites (Markthaler et al., 2017). Nevertheless, if the key features of interest can be reproduced through reduced models, nothing stands in the way of using these approaches. Otherwise, one must always be aware that uncontrollable artifacts may occur. In consequence, one may always keep the limits of the individual models in mind, such that the applicability of the approaches for certain research questions should be carefully reviewed.

### AUTHOR CONTRIBUTIONS

EO-Z and JS wrote, reviewed, and edited all versions of this article.

### ACKNOWLEDGMENTS

We thank CECAM and the organizers of the workshop Multiscale Modeling from Macromolecules to Cell: Opportunities and Challenges of Biomolecular Simulations (February 2019) for their initiative. We thank the Deutsche Forschungsgemeinschaft (DFG) through the Sonderforschungsbereich 716 (SFB 716/C8) for funding.


structures: thermodynamic insights into molecular binding mechanisms and destabilization effects. Phys. Chem. Chem. Phys. 20, 25861–25874. doi: 10.1039/C8CP03543A


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Oprzeska-Zingrebe and Smiatek. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Modeling Crowded Environment in Molecular Simulations

Natalia Ostrowska1,2, Michael Feig<sup>3</sup> and Joanna Trylska<sup>1</sup> \*

*<sup>1</sup> Centre of New Technologies, University of Warsaw, Warsaw, Poland, <sup>2</sup> College of Inter-Faculty Individual Studies in Mathematics and Natural Sciences, University of Warsaw, Warsaw, Poland, <sup>3</sup> Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, United States*

Biomolecules perform their various functions in living cells, namely in an environment that is crowded by many macromolecules. Thus, simulating the dynamics and interactions of biomolecules should take into account not only water and ions but also other binding partners, metabolites, lipids and macromolecules found in cells. In the last decade, research on how to model macromolecular crowders around proteins in order to simulate their dynamics in models of cellular environments has gained a lot of attention. In this mini-review we focus on the models of crowding agents that have been used in computer modeling studies of proteins and peptides, especially via molecular dynamics simulations.

### Edited by:

*Valentina Tozzini, Nanosciences Institute, National Research Council, Italy*

#### Reviewed by:

*Jozef Adam Liwo, University of Gdansk, Poland Pavel Srb, Academy of Sciences of the Czech Republic (ASCR), Czechia*

> \*Correspondence: *Joanna Trylska joanna@cent.uw.edu.pl*

#### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

Received: *03 July 2019* Accepted: *27 August 2019* Published: *11 September 2019*

#### Citation:

*Ostrowska N, Feig M and Trylska J (2019) Modeling Crowded Environment in Molecular Simulations. Front. Mol. Biosci. 6:86. doi: 10.3389/fmolb.2019.00086*

Keywords: protein dynamics, macromolecular crowding, coarse-grained models, molecular dynamics simulations, crowder models

### 1. INTRODUCTION

Intracellular organelles—in addition to water molecules, ions, metabolites, and other small solutes—typically contain between 200 and 400 g/L of macromolecules such as proteins, nucleic acids, ribosomes, and lipids. These complex environments may impact biomolecular function in vivo via crowding and confinement. The most obvious consequence is reduced diffusion. However, crowder molecules may also influence macromolecular folding and stability, internal dynamics and the sampling of functionally relevant conformations, complex formation, ligand binding and product release, catalytic activity, and other events (Zhou et al., 2008; Rivas and Minton, 2016).

In the majority of simulation studies, the functional dynamics of a given biomolecule has been investigated one molecule at a time and in the presence of only water and ions. However, when biomolecules experience crowding, the available volume is decreased and interactions with other biomolecules are unavoidable. This influences their diffusion and association pathways. Experiments increasingly study the function and dynamics of biomolecules under crowded conditions (e.g., Kuznetsova et al., 2014; Cheng et al., 2018; Fonin et al., 2018; Maximova et al., 2019). Thus, it is necessary to account for crowded conditions in simulations as well. Indeed, in the last decade, the number of studies of biomolecular interactions that consider not only water and ions but also other binding partners, metabolites or crowders has increased.

Crowding has the most pronounced effects on proteins with intrinsically disordered fragments or those that undergo significant conformational transitions as part of their function, for example during ligand binding. This applies to a vast majority of proteins. Therefore, it is time to establish standard protocols for how to include crowded environments in molecular simulations. This mini-review offers a brief guide through viable candidates. There are many reviews about the simulations of crowding, but we specifically focus on the crowder models used in molecular dynamics (MD) simulations. Other reviews cover the overall effects of crowding (Zhou et al., 2008; Christiansen et al., 2013), models of cellular environments at different scales (Feig and Sugita, 2013; Im et al., 2016; Feig et al., 2017), diffusion (Długosz and Trylska, 2011), and protein-protein interactions (Bhattacharya et al., 2013) in crowded environments.

### 2. REDUCED MODELS OF CROWDERS

A simple model for mimicking the excluded volume effect is geometric confinement where the physical volume available to a molecule is constrained. Typically, a spherical potential is applied, which restricts the conformational and diffusional freedom of the molecule similar to what explicit crowders would do. A similar approach is to penalize increased solventaccessible surface areas (Tanizaki et al., 2008). More sophisticated models, which account for both the volume restraint and the presence of mobile crowders, include randomly placing explicit crowders around the molecule and applying appropriate boundary conditions. The most common model is to simulate a single molecule, in most cases a protein represented in a coarsegrained (CG) manner, surrounded by spherical crowders. By default, such a crowder is modeled as a single pseudo-atom with an enlarged radius to match the volume of a crowding agent of interest (**Figure 1A**). Typical crowder particle radii vary between 10 and 50 Å, with an average of 25 Å. Such sizes are appropriate to represent folded proteins or crowding polymers like Ficoll.

### 2.1. Single-Particle Spherical Crowders

Spherical crowders provide the excluded volume effect without requiring any specific interactions with the biomolecule. Therefore, crowder-molecule and crowder-crowder interactions are limited to van der Waals interactions via the Lennard-Jones potential and often only the repulsive part of the potential is considered (Minh et al., 2006; Kim et al., 2010). As the number of atoms in such systems is restricted to a minimum, the simulations become relatively fast, especially when an implicit water model is used. As a fast and simple solution, spherical crowders became popular and have been adopted in many types of simulations. They are often used in Brownian dynamics (BD) simulations (Cheung et al., 2005; Minh et al., 2006; Stagg et al., 2007; Wieczorek and Zielenkiewicz, 2008; Oh et al., 2014), but they can be also used in other methods such as MD (Kim et al., 2014; Miller et al., 2016) and Monte Carlo simulations (Kim et al., 2010).

Simulations with spherical crowders can be classified as a mixed-resolution approach because typically the crowders are represented as single particles, whereas biomolecules of interest are represented at higher levels, with at least one bead per residue. If the protein is represented with a CG model, some details about its behavior may be lost, such as its internal dynamics or specific aspects about interactions with the environment such as explicit water, other biomoleculs, or ligands. However, many questions can still be addressed with this simple approach such as the impact on diffusion (Ridgway et al., 2008), the stability of proteins (Cheung et al., 2005; Stagg et al., 2007) and protein complex formation (Kim et al., 2010, 2014; Latshaw et al., 2014) or inter-domain mobility (Minh et al., 2006). Some of these examples are briefly described below.

Single-particle crowders were shown to mildly stabilize some globular proteins, such as the native state of the WW domain (Cheung et al., 2005). The apoflavodoxin protein also favored more compact states at 25% vol. crowding (Stagg et al., 2007). A study on the HIV-1 protease (**Figure 1A**) showed that the frequency of opening of the protease flaps covering the active site is suppressed at high crowder fractions but low 5% vol. crowding was found to actually enhance the flap dynamics (Minh et al., 2006).

A simulation of amyloid aggregation suggested that crowding increases the rate of oligomer formation and fibril growth (Latshaw et al., 2014). These effects were found to depend on the size of the crowder particles, where smaller crowders enhanced the oligomerization rate to a greater extent. A similar enhancement was also seen in a simulation of antibodyantigen association under crowded conditions (Wieczorek and Zielenkiewicz, 2008).

The effect of crowding on the interactions between proteins forming complexes was also investigated. The binding free energy in two protein complexes (ubiquitin/UIM1 and cytochrome c/cytochrome c peroxidase) was shown to decrease in the presence of higher concentrations of repulsive crowders (Kim et al., 2010). Repulsive crowders also modestly stabilized the interactions in the pKID-KIX protein complex (Kim et al., 2014), but including an attractive term for protein-crowder interaction could destabilize the interaction in the protein complex.

Spherical crowders have also been used to study the impact of crowding on the conformational dynamics of intrinsically disordered proteins (IDP). In one study, the crowders were reported to induce compaction of disordered peptides (Miller et al., 2016). The compaction increased with decreasing radius of the crowders and with increasing volume fraction, but the effects also strongly depended on the peptide sequence.

## 2.2. Many-Particle Crowders

The resolution of the crowder particle can be increased by distributing a set of small pseudo-atoms on the surface of a sphere to form a bead shell (Elcock, 2003; Kurniawan et al., 2012). Such a model can match the higher resolution of a biomolecule of interest better and offer computational advantages as shorter non-bonded cutoffs can be used. Bead-shell crowders have been used in a BD simulation to calculate the free energy of the escape of a protein from the GroEL cage (Elcock, 2003) and later on in an MD simulation with explicit water molecules to observe the conformational changes of a short peptide (Kurniawan et al., 2012). In the latter work, crowding was found to facilitate folding of a β-hairpin by promoting compact structures and preventing unfolding of the intermediate conformations. Modeling crowders as CG proteins placed around a biomolecule represented with a CG model of similar resolution is also a computationally feasible option (**Figure 1B**). Such an approach was applied e.g., in BD simulations of ligands associating with HIV-1 protease in the presence of glutathione S-transferase P as a crowding agent (Kang et al., 2011).

Other variants of the spherical model, are dumbbell-shaped objects (Christiansen et al., 2010; Chen et al., 2012), where two spheres are linked by a harmonic bond, spherocylinders (O'Brien et al., 2011; Kang et al., 2015), and polymer chains (Nguemaha et al., 2018) with parameters adjusted to represent proteins, DNA or other polymers like polyethylene glycol (PEG). A good model of a cellular environment may require a mixture of spherical and cylindrical crowders and it has been found that such a mixture leads to different results than with crowders of only one type (Kang et al., 2015). In this work, the simulation of a DNA fragment revealed that the DNA conformation "swells" under crowded conditions and that crowders of mixed shape affected the conformation to a greater extent than each of the homomorphic crowders.

In a recent study (Zegarra et al., 2019), a set of crowders with different shapes was used to reproduce and explain an NMR experiment and show that the unfolded apoazurin protein becomes more extended upon addition of dextran crowders. In this work, spheres and spherocylinders of various lengths were used along with a CG protein model. The crowders that best reproduced the experimental results were elongated rodlike structures, interacting with the protein with repulsive and attractive potential terms.

Other studies have also confirmed that spherocylindrical crowders may induce different effects than spherical crowders. It was noted, that in the presence of spherical crowders, the compaction of a polymer increases with decreasing crowder radii, but the effects of spherocylindrical crowders are highly nonmonotonical (Chen and Zhao, 2019). Spherocylinders were also shown to increase protein oligomer formation to a noticeably greater extent (O'Brien et al., 2011).

### 3. CAPTURING ATOMISTIC DETAILS

The most detailed information about the effects of a crowded environment can be obtained when the biomolecule and crowders are represented in atomistic detail (**Figure 1C**). In this case, not only the excluded volume effects can be explored but specific interactions between the molecule and crowders can be considered. As the level of realism increases, the question arises what kind of crowder molecules are best suited. The most realistic option would be to use a full model of a cytoplasm, with different proteins, nucleic acids, and metabolites (Yu et al., 2016). However, such an approach is computationally demanding and requires specific knowledge about the composition of specific cells. Other choices for all-atom crowders are small, well-studied proteins, like villin (Harada et al., 2013), protein G or trypsin inhibitor (Bille et al., 2019). Another aspect for choosing a specific crowder protein depends on what kinds of in vitro experiments a given simulation should be compared to.

All-atom simulations focused on the stability of protein native state in the presence of protein crowders represented in atomistic detail have suggested that crowding can promote local unfolding of the SOD1 protein (Bille et al., 2019) and destabilize the native state of villin (Harada et al., 2013). Destabilization of a pyruvate dehydrogenase subunit was also observed in a simulation of a model cytoplasm fragment (Yu et al., 2016) and was attributed to protein-protein interactions with crowder proteins. These observations are in contrast to previous studies using CG crowders, focused on the excluded volume effect, that tended to emphasize a stabilizing effect on native protein structures. This suggests that a full account of crowding effects cannot neglect the specific nature of protein-crowder interactions.

The advantages of using both all-atom and CG crowders can be combined by using a multi-scale approach. Such schemes allow for example to simulate a central molecule of interest in atomistic detail, while crowders are represented with a reduced CG model that still retains protein-like characteristics (O'Brien et al., 2011; Predeus et al., 2012; Bille et al., 2015). Such mixed resolution approaches allow simulations to be more efficient while still providing a detailed picture of protein behavior under crowded conditions. However, multi-scale approaches present challenges, e.g., with respect to how interactions between different levels of resolutions are treated.

One example for such a multi-scale approach, was the sampling of Trp-cage and melittin peptides (Predeus et al., 2012) in implicit solvent and in the presence of protein crowders represented with the PRIMO CG model. It was found that for both peptides, the addition of crowder molecules resulted in a more diverse conformational ensemble, with a larger share of non-native states.

In a multi-scale approach, crowders can be represented either as CG proteins or via simpler spherical molecules. Another study (Bille et al., 2015) investigating the Trp-cage conformational sampling compared an atomistic simulation with a mixed-resolution approach where an all-atom peptide was combined with spherical crowders. It was found that while the spherical crowders had almost no effect on the peptide conformations, rigid atomistic BPTI proteins used as crowders promoted non-native conformations and as a result stabilized the helical fragment. Again, this study points to the important role of non-specific peptide-crowder interactions.

A multi-scale approach was also used to study the formation of oligomers by peptides known to be amyloidogenic (O'Brien et al., 2011). In this study, the all-atom peptide model was mixed with crowders represented as spheres or spherocylinders. The authors compared the effects of different sizes, shapes, and volume fractions of the crowders. The crowders had a destabilizing effect on dimers formed by the peptides, but, surprisingly, trimers were stabilized. Moreover, it was reported that increasing crowder sizes reduced the crowding effect, while spherocylindrical crowders had a greater destabilizing effect than spherical crowders.

Apart from direct simulations, where crowders are explicitly present in the simulations, post-processing techniques have also been proposed. In this method, the protein in all-atom or CG representation and the crowders are simulated separately. The conformations obtained for the protein are then randomly placed in the snapshots of the crowder-containing trajectory and weighted based on the fraction of successful insertions. The postprocessing method was applied to study the effects of crowding on protein dynamics (Qin et al., 2010), protein folding and binding stability (Qin and Zhou, 2009), and the conformational sampling of disordered proteins (Qin and Zhou, 2013).

The most significant challenge with running atomistic simulations of crowded systems, including explicit water, is the high demand for computer resources. Another issue is related to detailed balance between protein-protein and protein-water interactions. Modern force fields have been found to overestimate the interactions between proteins resulting in too much aggregation (Petrov and Zagrovic, 2014). One proposed solution for the CHARMM force field is to strengthen water-protein interactions by scaling the Lennard-Jones interactions (Nawrocki et al., 2017). This has led to better agreement with NMR experiments. However, irreversible aggregation artifacts are not to be confused with transient cluster formation that has been noticed in several many-protein simulations (Nawrocki et al., 2017; von Bülow et al., 2019) and is believed to be an accurate reflection of crowded solutions.

### 4. DISCUSSION

Simulations of crowded environments can be performed at various levels of detail with respect to both the crowder and biomolecule. CG models of a biomolecule are often combined with a reduced crowder representation such as simple spheres. When biomolecules are represented in atomistic resolution, the range of models of the crowded environment becomes wider, ranging from spherical crowders and CG proteins in multi-scale approaches to all-atom protein crowders.

The crowders of choice for many researchers are spherical repulsive particles. They can be used in a variety of simulation methods and have been tested extensively. However, recent studies have shown that such models are likely oversimplified. The shape of the crowders and the way they interact can influence the effects that the crowders exert on the molecules (O'Brien et al., 2011; Kang et al., 2015; Chen and Zhao, 2019; Zegarra et al., 2019), including how crowding affects protein diffusion (Balbo et al., 2013).

Choosing a crowder model is a matter of finding a compromise between the allocated computational resources and simulation realism. Most detailed information about the impact of the crowded environment can be obtained if both the biomolecule and crowders are represented with atomistic details. Such models can provide insight into crowding effects well beyond the simple excluded volume effect. Including all-atom crowders may be especially vital to study peptides or IDPs since the interactions with the crowders can contribute significantly to the stabilization of their conformations other than those formed in bulk water or found in crystal structures.

The impact of crowding is a sum of often counteracting effects: the excluded volume effect and non-specific interactions of a biomolecule with the crowders. The importance of each component is not easy to predict as it may be case dependent (Rivas and Minton, 2016). With each level of reducing the representation of crowders, information about proteincrowder interactions is gradually lost, which is the main source of possible inaccuracies of simulations using CG crowders.

It has been shown that various sizes, concentrations, and shapes of CG crowders may differently influence the dynamics, interaction and diffusion of biomolecules. Therefore, the decision about the type of crowders is important and depends on the problem and questions that are being investigated, as well as the experiments with which simulations are being compared. Using atomistic representation is especially important while investigating the internal dynamics of biomolecules to compare with high-resolution structural experiments such as NMR spectroscopy. On the other hand, lower-resolution crowders may be sufficient to compare with experiments that emphasize non-biological space-filling crowders where the exact molecular nature is not as critical. One promising approach to account for both interaction details and reduce computational costs involves the use of mixtures of crowders with diverse properties. This may include crowders of different shapes, like spherical and spherocylindrical crowders (Kang et al., 2015), or a mixture of protein crowders such as the streptococcal protein G and the chicken villin head piece (Harada et al., 2013).

Finally, another question to consider while designing simulations of crowded environments is whether solvent needs to be accounted for explicitly. Explicit water typically requires fully atomistic simulations or high-level CG models although explicit water has also been combined with bead-shell crowders (Latshaw et al., 2014). On the other hand, if implicit solvent models are applied, questions about how to account for hydrodynamic effects arise (Ando and Skolnick, 2010; Długosz et al., 2011).

According to our benchmarks, surrounding a 236 amino-acid protein (in implicit solvent) with CG crowders has little to no effect on the simulation time. However, adding all-atom crowders (216 atom PEGs) to the same system slows down computations 3–5 times. For solvent treated explicitly, adding CG crowders can make the simulation faster because the crowders possess less atoms than water molecules that occupy similar volume. For example, 43-atom bead-shell crowders added to a protein-explicit solvent system (at 20% vol.) speed up the simulation by 20% as compared to simulations without crowders.

### REFERENCES


### AUTHOR CONTRIBUTIONS

NO performed literature search. All authors wrote the manuscript, read, and approved the submitted version.

### FUNDING

The authors acknowledge support from the National Science Centre, Poland (UMO-2016/23/B/NZ1/03198 to NO and JT), the US National Institutes of Health (grant R35GM126948 to MF), and the US National Science Foundation (grant MCB 1817307 to MF). JT acknowledges support from the Kosciuszko Foundation.

computer simulations. J. Phys. Chem. B 121, 8009–8025. doi: 10.1021/acs.jpcb. 7b03570


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Ostrowska, Feig and Trylska. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Global Dynamics of Yeast Hsp90 Middle and C-Terminal Dimer Studied by Advanced Sampling Simulations

Florian Kandzia, Katja Ostermeir and Martin Zacharias\*

*Physics Department T38, Technical University of Munich, Garching, Germany*

The Hsp90 protein complex is one of the most abundant molecular chaperone proteins that assists in folding of a variety of client proteins. During its functional cycle it undergoes large domain rearrangements coupled to the hydrolysis of ATP and association or dissociation of domain interfaces. In order to better understand the domain dynamics comparative Molecular Dynamics (MD) simulations of a sub-structure of Hsp90, the dimer formed by the middle (M) and C-terminal domain (C), were performed. Since this MC dimer lacks the ATP-binding N-domain it allows studying global motions decoupled from ATP binding and hydrolysis. Conventional (c)MD simulations starting from several different closed and open conformations resulted in only limited sampling of global motions. However, the application of a Hamiltonian Replica exchange (H-REMD) method based on the addition of a biasing potential extracted from a coarse-grained elastic network description of the system allowed much broader sampling of domain motions than the cMD simulations. With this multiscale approach it was possible to extract the main directions of global motions and to obtain insight into the molecular mechanism of the global structural transitions of the MC dimer.

### Edited by:

*Alexandre M. J. J. Bonvin, Utrecht University, Netherlands*

### Reviewed by:

*Carlo Camilloni, University of Milan, Italy Luca Monticelli, Centre National de la Recherche Scientifique (CNRS), France*

> \*Correspondence: *Martin Zacharias martin.zacharias@ph.tum.de*

### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

Received: *13 June 2019* Accepted: *11 September 2019* Published: *27 September 2019*

### Citation:

*Kandzia F, Ostermeir K and Zacharias M (2019) Global Dynamics of Yeast Hsp90 Middle and C-Terminal Dimer Studied by Advanced Sampling Simulations. Front. Mol. Biosci. 6:93. doi: 10.3389/fmolb.2019.00093* Keywords: Hsp90 conformational dynamics, biasing potential REMD, advanced sampling simulations, sampling global dynamics, multi-scale dynamics sampling, Hsp90 chaperone function

### INTRODUCTION

The 90 kDa heat-shock protein (Hsp90) is an essential molecular chaperon protein that plays a vital role in the folding process of several client proteins (Hunter and Poon, 1997; Mayer and Bukau, 1999; MacLean and Picard, 2003; Pratt and Toft, 2003; Prodromou and Pearl, 2003; Pratt et al., 2004). It is found in bacteria as well as eukaryotes and is essential for cell viability and plays a pivotal role in many signaling and regulation pathways (Echeverría et al., 2011). In its active conformation it forms a homodimer and its chaperone activity depends on ATP binding and hydrolysis. During its work cycle Hsp90 and its homologs (e.g., HtpG, Grp94, Trap1) can adopt different global conformations covering a range of tightly bound closed structures up to widely open conformations (Harris et al., 2004; Ali et al., 2006; Shiau et al., 2006; Dollins et al., 2007; Lavery et al., 2014; Verba et al., 2016). For example, the crystal structure of yeast Hsp90 bound to a non-hydrolysable ATP analog (AMPPNP) indicates a closed homodimer with domain contacts between the N-terminal (N)-domain and C-terminal (C)-domain of each monomer (the N- and C-domains in each monomer are connected by a middle (M)-domain) (Ali et al., 2006). A similarstructure was obtained for a complex of yeast Hsp90 and a kinase client protein (Verba et al., 2016). Furthermore, structures of a paralog, Grp94, from the mammalian endoplasmic reticulum Kandzia et al. Hsp90 MC Dimer Dynamics

(ER) (Dollins et al., 2007) and of HtpG (bacterial homolog) (Shiau et al., 2006) are known. The Grp94 adopts a slightly more open structure compared to yeast Hsp90 and the HtpG homolog is dramatically more open. Studies employing small-angle X-ray scattering (SAXS) indicate that the Hsp90 conformation depends significantly on the bound nucleotide. Using a large number of fluorescence donor and acceptor pairs a recent single molecule FRET (fluorescence resonance energy transfer) study was used to obtain ensembles of Hsp90 conformations in the apo state and in the presence of ADP, AMPPNP (Hellenkamp et al., 2016). These studies confirmed the known Hsp90 structure in the presence of AMPPNP and indicated a more open structure and reorientation of the N-domain compared to the closed conformation when ADP is bound or for the APO state (Hellenkamp et al., 2016). Based on restraint MD-simulations atomistic structural models compatible with the sFRET data for the open yeast Hsp90 state in the presence of ADP have been obtained (Hellenkamp et al., 2016). However, the mechanism how ATP hydrolysis (or loss of a bound nucleotide) can trigger global domain rearrangements is still not clear. The observation that a loss of the interaction between the N-domains results in a global opening indicates that the closed (ATP-bound form) corresponds to a structure under global stress (the unfavorable global deformation away from the open form is stabilized by the N-domain binding). Hence, a removal of the N-domain interaction should result in an opening of the Hsp90 structure. Indeed, the crystal structure of a truncated form of Hsp90 (without the N-domains) shows still the same C-domain dimerization contacts compared to the full Hsp90 structure but an increased distance between the M-domains toward a more open global conformation (**Figure 2D**). However, the degree of opening is significantly smaller compared to the "open" ADP-bound structure based on the sFRET data (**Figure 2E**). The origin of this discrepancy could be crystal contacts that may stabilize only one type of global conformation among other global arrangements that are accessible in free solution.

In order to elucidate the global conformational flexibility of the yeast Hsp90 MC dimer (Hsp90 without N-domain) in solution we performed a series of comparative Molecular Dynamics (MD) simulations starting from different initial conformations. The initial structures corresponded to the known crystal structure of the yeast Hsp90 MC dimer, the crystal structure of the closed full Hsp90 (AMPPNP-bound), a single molecule FRET derived start structure and an arrangement based on a bacterial homolog (in a wide open geometry) (Ali et al., 2006; Shiau et al., 2006; Hellenkamp et al., 2016). The transition between open (ADP-bound) and closed (ATP-bound) states of the full length Hsp90 occurs on the µs to ms time scale (Hellenkamp et al., 2016). However, it is expected that in case of the truncation of the N-domain (that is the primary interaction partner to stabilize the closed form) global transitions and conformational relaxations in the MC dimer occur much faster compared to the full structure and allow to identify the associated molecular details of the global domain motions.

In order to further enhance the sampling of global motions we also performed Hamiltonian replica exchange (H-REMD) simulations coupled with an elastic network model (ENM) description of the MC dimer. A low resolution representation of protein dynamics can be obtained using coarse-grained elastic network models (ENM) to extract directions of global mobility (Bahar and Rader, 2005; Bastolla, 2014). Recently, we have developed a H-REMD approach that uses information from an ENM analysis and combines it with atomistic MD simulations in explicit solvent (Ostermeir and Zacharias, 2014).

The approach forms an effective multi-scale methodology in which directions of large scale global conformational transitions (extracted from a low resolution technique) can guide and enhanced the high-resolution atomistic sampling of the multidomain structure.

Indeed, the unrestrained MD simulations starting from different initial conformations of the Hsp90 MC dimer with globally different initial domain arrangements sampled only conformations relatively close to the starting structures on a time scale of 200 ns. On the other hand, the ENM-coupled REMD methodology sampled a much wider range of domain arrangements including relatively close but also more open Hsp90 MC dimer structures.

The results indicate, firstly, that the ENM-REMD method is an efficient multi-scale enhanced sampling technique offering improved sampling compared to regular MD simulations. Secondly, our simulations demonstrate that in the absence of the N-domains the Hsp90 dimer can adopt a variety of closed and open domain arrangements that might be of functional importance for chaperone function. One function of the N-domain might be to limit these possible states by Ndomain dimerization that is controlled by the bound ATP or ADP nucleotide.

### MATERIALS AND METHODS

Four model structures of the MC dimer of yeast Hsp90 were build corresponding to published structures of Hsp90 and its homologs [pdb2cg9 (Ali et al., 2006), pdb2cge (Ali et al., 2006), pdb2ioq (Shiau et al., 2006), and the mean open structure as determined by Hellenkamp et al. (2016)]. Since the 2cge structure corresponds to a Hsp90 MC domain dimer the published structure served directly as start structure representing the 2cgemodel. For the models based on the closed 2cg9 full length structure and the full length mean open structure the N-terminal domain segments up to the start of the middle domain (residue 1–236) were removed forming the 2cg9- and sFRET-models, respectively. A starting model based on the bacterial homolog 2ioq was generated by superimposing the M- and C-domains from the yeast 2cge structure onto the corresponding conserved elements of the 2ioq homolog using Pymol (Schrodinger, 2015), resulting in a start structure with an overall Cα-Rmsd value of 3.4 Å relative to the corresponding elements in the 2ioq structure. Solvation of the structures was performed in octahedral boxes with explicit water molecules (TIP3P) (Jorgensen et al., 1983) and neutralized with chloride and sodium ions up to an ion concentration of 0.1 M using the leap module and employing the parm14SB force field (Maier et al., 2015) of the Amber14 package. All unrestrained simulations were performed using the pmemd.cuda code of the Amber14 package (Case et al., 2014). The start structures were energy minimized using steepest descent and conjugated gradient methods (5,000 steps), and slowly heated up to 300 K in 500 ps NVT run using Langevin dynamics, while restraining all heavy atoms with respect to the start structure. During another 150 ps the positional restraints were reduced in a step-wise manner, allowing the system to relax. The systems were further equilibrated for 200 ps at constant pressure using a Berendsen barostat followed by a 200 ns data gathering period. During all simulations the Particle Mesh Ewald (PME) method was used to calculate long range electrostatic interactions (Darden et al., 1993) with a real space cutoff radius of 9 Å. The Shake algorithm was used to constrain bonds involving hydrogen atoms (Ryckaert et al., 1977), which allowed employing a time step of 2 fs.

For the ENM-coupled H-REMD simulations we followed a published protocol. The H-REMD simulations involve a biasing potential that acts between domain centroids of the multi domain protein. In case of the Hsp90 MC dimer 4 centroids representing the centers (based on the protein Cα atoms) of the four domains were used. The protein conformational fluctuations are calculated by means of an elastic network model for the protein Cα atoms based on Hinsen (Hinsen, 1998) and following the protocol described in Ostermeir and Zacharias (Ostermeir and Zacharias, 2014). The first 50 normal modes were excited by a thermal energy of RT (R: gas constant and T, temperature = 300 K) that reflects the possible distance fluctuations between the domains from the ENM analysis. In the ENM-coupled REMD approach the biasing potential is generated to specifically enhance structural changes in the REMD simulations compatible with the fluctuation obtained from the ENM and to destabilize the domain arrangement along the centroid distances dij (i,j are centroid labels).

$$\begin{aligned} V\left(d\_{\vec{\eta}}\right) &= k \left( \left[d\_{\vec{\eta}} - d\_{\vec{\eta}0}\right]^2 - \Delta d\_{\vec{\eta}}^2 \right)^2, \text{ if } \left| d\_{\vec{\eta}} - d\_{\vec{\eta}0} \right| \le \Delta d\_{\vec{\eta}},\\ V\left(d\_{\vec{\eta}}\right) &= 0, \text{ otherwise} \end{aligned}$$

In the H-REMD one reference replica was run under the control of the original force field whereas the centroid-centroid distance dependent biasing potentials were added with increasing amplitudes in each of the 11 replicas (total replica number: 12). A replica exchange was attempted every 2 ps. The magnitudes and the width of the biasing potentials in the replicas were adjusted during the simulations with a starting biasing level of 2.25 kcal/mol (corresponding to ∼4 RT) between replicas at the beginning. Centroid-centroid distances were updated every 0.2 ns from the running average of the last 0.4 ns. The BP-amplitude was also adjusted every 0.2 ns to optimize the acceptance rate of replica exchanges. If the acceptance probability for exchanges between neighbors decreased to <20% (or surpasses 60%) in any of the replicas the BPs were lowered (or increased) by 10%. After the first 5 ns of the H-REMD the biasing levels stabilized to ∼0.45 RT between replicas and remained constant for the rest of the simulation within a standard deviation of ± 0.12 RT. The REMD simulations were extended for 25 ns. More details on the ENM coupled REMD methodology are given in reference 17. Simulation results were analyzed by means of the cpptraj module of Amber14 (Case et al., 2014).

### RESULTS AND DISCUSSION

The Hsp90 chaperone homodimer undergoes dramatic global conformational changes during its working cycle that are accompanied by binding of the N-domains (if ATP is bound) or dissociation of the N-domains (ADP-bound and apo states). During its functional cycle the C-domains stay always in a bound state (**Figure 1**). How ATP hydrolysis triggers N-domain dissociation and the subsequent global opening is not fully understood. If N-domain dissociation triggers global opening removal of the N-domains should also allow large scale global motions in the truncated Hsp90 MC dimer. Comparative MD simulations starting from closed intermediate open and fully open conformations were used to investigate the global mobility of the MC dimer. As described in the Methods section the start structures corresponded to the crystal structure of the Hsp90 MC dimer (pdb2cge) that can be considered as semiopen conformation (**Figure 2D**) compared to the structure found in the closed state (pdb2cg9) that formed another start conformation (**Figure 2D**). More open states are based on the recent single molecule FRET analysis in the presence of ADP (Hellenkamp et al., 2016), termed sFRET-conformation (**Figure 2E**) and another open conformation based on the bacterial homolog (pdb2ioq) (**Figure 2F**). For each unrestrained MD simulation, we recorded the deviation with respect to each of the four reference structures (**Figure 3**).

Starting from the closed conformation (2cg9-start) results in almost constant root-mean square deviation (RMSD of the backbone) during the simulation time (**Figure 3**). On the time scale of the simulation no tendency for global opening is observed (the RMSD with respect to the more open start structures remains constant). Similarly, in case of starting from the 2cge reference only small shifts in the RMSD is observed although the fluctuations in the RMSD are clearly larger compared to the simulations based on the 2cg9-start. Again little tendency toward larger global changes (closer approach of the more open conformations) is observed. To the contrary, the RMSD with respect to the more closed 2cg9-model is slightly decreasing over the simulation time. Note, that in both the 2cge as well as the 2cg9 conformations the M-domains are in close contact with the C-domains mediated by two large loop regions that are part of the C-domains (**Figure 1**) and that may transiently stabilize the global arrangement.

For the simulations starting from the open conformations, 2ioq and sFRET(ADP), the Rmsd decreases w.r.t. the semi open conformation of 2cge, while it increases w.r.t. the open conformations of 2ioq and sFRET. This is caused by a motion with the overall slight tendency of closing but not exactly reaching the 2cge structure (Rmsd > 8–10 Å). Note, that in the open form the contact between the loop region of the C-domain and the M-domain is largely missing that may allow a great global mobility.

In contrast to the unrestrained MD simulations (starting from the 2cge-structure) the reference replica of the ENM coupled REMD simulations showed much larger changes in the Rmsd over time (and on shorter time scale, **Figure 3**) with respect to all reference structures. Note, that the rapid

changes observed in the [**Figure 3** (panel labelled ENM)] are due to the exchanges between conformations in neighboring replicas indicating that a great variety of different conformations is sampled in this scheme. During the ENM coupled REMD distance-dependent biasing potentials are derived from the ENM analysis that act between the domain segments (C1,..,C5; illustrated in **Figure 2**). These potentials promote motions in the soft collective directions of the system (in the replicas) and result

in enhanced domain motion sampling (without providing any preset reaction coordinate for the global motions of the domains such as domain angles or dihedrals).

It is of interest to identify the origin of conformational fluctuations. One possible source are local conformational fluctuations within each MC monomer. One possibility to analyze the flexibility of each monomer is to investigate the mean fluctuations (RMSF) along each monomer (**Figure 3**) and the buried surface area along the sequence (**Supplementary Figure 1**). The pattern of conformational fluctuations looks qualitatively similar in all simulations with 4 regions indicating enhanced local mobility compared to the mean of the structures (**Figure 3**). In each of the simulations the pattern is similar for both monomers (compare orange and blue lines in **Figure 3**). In general, the magnitude of fluctuations is smaller for the 2cg9- and 2cge-simulations compared to the simulations starting from the open Hsp90 MC dimers or the ENM coupled REMD simulations. The more flexible regions are highlighted in light and dark gray. In the M domain an amphiphatic (III) and a flexible loop (IV) create strong interfaces with the opposite monomer that stabilize the closed conformations (**Figure 3**). In the simulations starting from the open conformations these loops are highly dynamic (**Figure 3**) and are much less mobile in case of the simulations starting from more closed MC dimers. It is possible that the interaction of the loop IV region with the M-domain may initiate closing movements in Hsp90. Other regions that show significant differences in mobility are located more near to the N-terminus (regions III, IV, **Figure 3**). Also, the M-domain indicates more fluctuations in the 2ioq, sFRET, and the ENM-coupled REMD simulations compared to the 2cg9 and 2cge MC dimer simulations. The C-terminal part of the dimer appears to be more rigid and inherits only a low mobility in all simulations and for all starting conformations.

The magnitude of the local RMSF of residues along each monomer cannot explain the large RMSD shifts and changes observed especially in the ENM coupled REMD simulations. As a next step we analyzed the global conformational fluctuations observed in the complete MC dimer. The global opening angle and the torsional dihedral angles described by the four domains (illustrated in **Figure 2**) might be considered as most useful and intuitive variables to illustrate and analyze the global domain motion. With respect to these variables the 2cg9 simulation indicates the least global mobility on the present simulation time scale. Apparently, it is locked in a locally stable arrangement that allows only limited local as well as global motions (on the present 200 ns time scale). More global mobility is observed for the 2cge case and even broader distributions are found for

FIGURE 4 | Global domain sampling recorded during MD simulations and ENM coupled REMD simulations in terms of a Hsp90 MC dimer angle and dihedral torsion (defined using the center coordinates c1, c2, c3, c4 shown in Figure 2A; angle formed by c1, (c2, c3), c4; dihedral torsion formed by c1, c2, c3, c4). (A) Each data point represents a sampled domain arrangement. (brown dots) Reference replica of the ENM coupled REMD (25 ns), (green dots) unrestrained MD simulation starting from 2cge structure (200 ns), (orange dots) simulation starting from MC dimer extracted from 2cg9 structure (200 ns), (purple dots) MD simulation starting from sFRET derived open Hsp90 conformation (200 ns), (red dots) sampling obtained from MD simulation starting from arrangement in bacterial homolog pdb2ioq (200 ns). (B–D) characteristic structural domain changes observed during the ENM coupled REMD simulation indicated by a snapshot superimposed on the start structure. (B) Within the M-domain a hinge like motion of the upper part vs. lower part of the M-domain is observed (indicated by a red double arrow). (C) The change of one M-domain relative to the C-domain is associated to a movement of the connecting helix (red arrow) (D) within the C-domains the C helix 2 (VI) segments (see Figure 2) undergo large scale motions (highlighted in green and indicated as red arrows) relative to the position in the start structure.

the simulations starting from the sFRET and from the 2ioq start structures (**Figure 4**). However, only the latter unrestrained simulations show some overlap of sampled global variables but there is no overlap between states sampled in the 2cge and 2cg9 simulations and those starting from the open model structures. Apparently, there are barriers between states or a low diffusivity on the global energy landscape that prevents the observation of global transitions in the unrestrained MD simulations on the 200 ns time scale.

However, a much broader covered range of sampled global opening angle and global dihedral torsion angle of the domains is sampled in the ENM coupled REMD simulations. The sampling overlaps very well with the states sampled in the 2cge simulation (it also started from the 2cge MC dimer structure), and also at least partially overlap with the sampling seen in the simulations that started from sFRET-derived start arrangement and the 2ioqbased simulations but covers also many more arrangements (**Figure 4**). Since the ENM coupled REMD technique involves an active driving force along the global variables (in the present case the domain distances and in the higher replicas of the REMD) it can more easily overcome small energy barriers and slows down global motions much less due to low domain diffusivity. The low diffusivity can be caused by many transiently stable interactions of equivalent stability but that need to be continuously disrupted and re-established during diffusive global motion. The broad sampling observed in the reference replica of the ENM coupled REMD indicates that large regions in the space of the two global variables are in principle accessible (are of equivalent free energy) that correspond to mostly open conformational states.

Interestingly, even in the REMD simulations (but similarly also in the 2cge based unrestrained MD-simulations) no conformations that closely approaches the 2cg9 structure were sampled. It is possible that the transition to the closed 2cg9 form involves a significant energy barrier and simultaneous rearrangement of interface residues between the C-domain loop segment and the M-domain that was not sampled during the relatively short ENM coupled REMD simulations. This assumption is further supported by the relatively small and very confined region in the two global variables that is sampled when starting from the 2cg9 conformation. Such confined sampling indicates the existence of energy barriers that prevent dissociation processes to trigger global opening motions. Indeed, a comparison of the 2cg9 and 2cge start structures indicates several additional contacts in the 2cg9 case. This includes the disruption of these contacts may cause the energy barrier. Vice versa simulations starting from the 2cge (or other more open forms) face a penalty to form the correct contacts between Mdomains before reaching the most closed 2cg9 state. Future ENM coupled REMD simulations or other advanced sampling techniques starting from the 2cg9 structure might be useful to investigate such putative energy barriers.

The simulation results can also be used to structurally characterize local changes that might be coupled to the observed global domain motions (**Figure 4**). The sampled open conformations of the MC dimer in the ENM-REMD indicate large local conformational changes especially in the helix connecting the M and C-domain (region V in **Figure 2**) that partially unfolds during domain opening motions (**Figure 4**). In addition, large motions of the C-helix 2 (region VI in **Figure 2**) are observed in the sampled states that represent more open domain arrangements in the REMD run (**Figure 4**). It is indeed this C-helix 2 region IV that mediates contacts between C-domains and between C-domains and the M-domains in the closed form (see **Figure 2C**). The interaction is partially broken in the 2cge form and largely lost in the open forms (compared **Figures 2C,F**) as well as in the ENM-coupled H-REMD simulations.

### CONCLUSIONS

Depending on the nucleotide-bound state Hsp90 can adopt different global domain arrangements. The stability of the domain arrangements is controlled by the binding of nucleotides to the N-domain. In the present simulations also different locally stable domain arrangements of the Hsp90 MC dimer (lacking the N-domain) were observed that do not undergo transitions in standard MD-simulations on the time scale of 200 ns. This indicates that not only N-domain interactions but also interactions of the other domains influence the global Hsp90 structure. The ENM-REMD technique that combines an atomistic description of the system with global mobile directions observed in a coarse-grained ENM was shown to more effectively sample the globally accessible space for the Hsp90 MC dimer. Future applications of the technique to the Hsp90 molecule including the N-domains could be useful to elucidate global motions in the full Hsp90 molecule.

### DATA AVAILABILITY STATEMENT

All datasets generated for this study are included in the manuscript/**Supplementary Files**.

### AUTHOR CONTRIBUTIONS

MZ designed and supervised research. FK performed MD simulations and analyzed data. KO performed replica exchange simulations. All authors contributed to writing of the manuscript.

### FUNDING

This work was supported by the Deutsche Forschungsgemeinschaft (DFG) through SFB863/project A10. We acknowledge also support by The Leibniz super computer (LRZ) center for providing supercomputer support by grant pr48ko and pr74bi.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmolb. 2019.00093/full#supplementary-material

**58**

### REFERENCES


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Kandzia, Ostermeir and Zacharias. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# MARTINI-Based Protein-DNA Coarse-Grained HADDOCKing

Rodrigo V. Honorato1,2†, Jorge Roel-Touris 1† and Alexandre M. J. J. Bonvin<sup>1</sup> \*

<sup>1</sup> Faculty of Science–Chemistry, Bijvoet Center for Biomolecular Research, Utrecht University, Utrecht, Netherlands, <sup>2</sup> Brazilian Biosciences National Laboratory (LNBio), Brazilian Center for Research in Energy and Materials (CNPEM), Campinas, Brazil

Modeling biomolecular assemblies is an important field in computational structural biology. The inherent complexity of their energy landscape and the computational cost associated with modeling large and complex assemblies are major drawbacks for integrative modeling approaches. The so-called coarse-graining approaches, which reduce the degrees of freedom of the system by grouping several atoms into larger "pseudo-atoms," have been shown to alleviate some of those limitations, facilitating the identification of the global energy minima assumed to correspond to the native state of the complex, while making the calculations more efficient. Here, we describe and assess the implementation of the MARTINI force field for DNA into HADDOCK, our integrative modeling platform. We combine it with our previous implementation for protein-protein coarse-grained docking, enabling coarse-grained modeling of protein-nucleic acid complexes. The system is modeled using MARTINI topologies and interaction parameters during the rigid body docking and semi-flexible refinement stages of HADDOCK, and the resulting models are then converted back to atomistic resolution by an atom-to-bead distance restraints-guided protocol. We first demonstrate the performance of this protocol using 44 complexes from the protein-DNA docking benchmark, which shows an overall ∼6-fold speed increase and maintains similar accuracy as compared to standard atomistic calculations. As a proof of concept, we then model the interaction between the PRC1 and the nucleosome (a former CAPRI target in round 31), using the same information available at the time the target was offered, and compare all-atom and coarse-grained models.

Keywords: docking, biomolecular complexes, nucleic acids, coarse-graining, force field

## INTRODUCTION

Protein-DNA interactions play essential roles in cellular processes such as gene expression, regulation, transcription, DNA repair, or chromatin packaging in eukaryotes (Pandey et al., 2019). Computational docking, commonly referred to as prediction of the three-dimensional (3D) structure of a complex given the structures of its free constituents, has been extensively proven as an ideal complement to experimental structural methods in order to accurately model biomolecular complexes (Rodrigues and Bonvin, 2014). Even though computational modeling approaches have steadily progressed in the past decade (Janin, 2010), modeling large biomolecular assemblies still remains a challenge. In other words, application to either large individual or high number of interactors are limited by the significant computational cost of thoroughly sampling the complex and intricated conformational landscapes and by the increased difficulty of identifying near-native structures from the large pool of generated models (Rout and Sali, 2019).

### Edited by:

Massimiliano Bonomi, Institut Pasteur, France

## Reviewed by:

Sophie Sacquin-Mora, UPR9080 Laboratoire de Biochimie Théorique (LBT), France Carlo Camilloni, University of Milan, Italy

#### \*Correspondence:

Alexandre M. J. J. Bonvin a.m.j.j.bonvin@uu.nl

†These authors have contributed equally to this work as joint first authors

### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

Received: 22 August 2019 Accepted: 17 September 2019 Published: 01 October 2019

### Citation:

Honorato RV, Roel-Touris J and Bonvin AMJJ (2019) MARTINI-Based Protein-DNA Coarse-Grained HADDOCKing. Front. Mol. Biosci. 6:102. doi: 10.3389/fmolb.2019.00102

Coarse-graining (CG) has been demonstrated to be a valuable alternative to standard atomistic (AA) approaches to alleviate some of those limitations and help the identification of the energy global minima by smoothing out the energy landscape (Hills et al., 2010; Roel-Touris et al., 2019). To this end, CG approaches group several atoms (either a few atoms or entire side chains) into larger "pseudo-atoms" or "beads," which results into a reduction in the number of degrees of freedom of the system (Kmiecik et al., 2016). Historically, the development of CG force fields has followed two directions: (1) Physicsbased, parametrized against its atomic counterpart or (2) knowledge-based, taking advantage of the increasing growth of statistical information derived from experimentally determined structures (Hills et al., 2010). Protein or/and protein-nucleic acid coarse-grained approaches have been implemented in several docking/modeling software such as for example: CABS-dock (Blaszczyk et al., 2016) RosettaDock (Gray et al., 2003), IMP (Russel et al., 2012), ATTRACT (Setny et al., 2012), NPDock (Tuszynska et al., 2015), PyRy3D (genesilico.pl/pyry3d), and more recently in HADDOCK (Dominguez et al., 2003; Roel-Touris et al., 2019), our integrative modeling platform.

MARTINI, a popular coarse-grained model for biomolecules, features lipids (Marrink et al., 2007) proteins (Monticelli et al., 2008), carbohydrates (López et al., 2009), and nucleic acids (Uusitalo et al., 2015, 2017) among others. Its DNA parametrization combines top-down (experimental data) and bottom-up (atomistic simulations) methodologies and is fully compatible with all other MARTINI models. On average, the nucleic acids' mapping follows a 1:6∼7 rule, which means that each nucleotide is mapped onto six or seven CG beads. Bead types are selected according to partition free energies from water to chloroform or hydrated octanol. Bonded interactions have been fitted to reproduce dihedral, angle and bond distributions from atomistic simulations of short single stranded DNAs (ssDNAs) (Uusitalo et al., 2015). The general design and parametrization of MARTINI allow to easily combine several types of biomolecules (high transferability) as well as a straightforward conversion to atomistic resolution.

In this manuscript, we describe and benchmark the integration of the MARTINI coarse-grained force field for DNA into HADDOCK. It builds upon our recent implementation of a MARTINI coarse-grained protein-protein docking protocol (Roel-Touris et al., 2019) and is further optimized to account for Watson-Crick interactions. Prior to the docking, the input structures are converted into their coarse-grained counterparts and hydrogen-bonding base pairs are automatically detected so that a special set of parameters and restraints are used for those during the docking. We evaluate the performance of coarse-grained protein-nucleic acid docking using 44 unboundunbound complexes from the protein-DNA benchmark (van Dijk and Bonvin, 2010). The results show a similar performance in terms of success rate and model quality while reducing the computational costs by ∼6-fold compared to standard atomistic simulations. For 6 of those, we repeated the docking (both all-atom and coarse-grained) using experimental data to drive the docking as a demonstration that our coarse-grained protocol is also applicable for integrative modeling purposes. Finally, we showcase the potential of CG protein-DNA docking by revisiting the PRC1-nucleosome core particle complex (McGinty et al., 2014), which was offered as a CAPRI target (Target 95 in round 31; Lensink et al., 2017) for which we failed at the time to select any near native models.

## METHODS

### Integration of the MARTINI DNA Coarse-Grained Force Field Into HADDOCK

The integration of the MARTINI coarse-grained force field for nucleic acids into HADDOCK builds upon our recent HADDOCK-CG implementation for protein-protein docking (Roel-Touris et al., 2019). We converted the MARTINI topologies and interaction parameters into a format compatible with the computational engine of HADDOCK, CNS– Crystallography and NMR System (Brünger et al., 1998). As in MARTINI, we represent the backbone of the nucleotide by three beads, one for the phosphate group, and two different beads for the sugar. Pyrimidines and purines are mapped into three and four beads, respectively. A detailed list of the topologies and parameters as used in HADDOCK can be found in the Supplementary Information (**Tables SI-1, SI-2**).

The latest official release of the MARTINI force field for nucleic acids, 2.2 (Uusitalo et al., 2015), includes eight additional beads and corresponding parameters compared to previous versions. These beads specifically account for Watson-Crick base pairing and mimics, to some extent, the hydrogen bonds that are formed between complementary nucleotide base pairs. These contribute to stabilizing the DNA double helix structure. When converting atomic structures into coarse-grained models, we automatically detect base pairing by calculating the Euclidean distance between neighboring nucleic acid side-chain atoms. We also use the distance between phosphate groups to ensure that bases are paired with their counterpart on the opposite strand and not with their neighbor in the sequence. We define a base pair when two opposite bases' heavy atoms are within the well-accepted hydrogen bond length of 3.5 Å, as used for example in LIGPLOT (Wallace et al., 1995), and their phosphate groups are at least 10 Å or further away from each other. If the input structures do not contain any phosphate, we use instead the center of mass of the nucleotides. By doing so, we avoid defining coupling between neighboring bases in sequence. This information is used by the HADDOCK machinery to ensure that specific interacting beads are used when necessary and the default HADDOCK DNA restraints were adapted to account for the CG beads and used to enforce correct DNA pairing (please see **Table SI-3**). As recommended in MARTINI, nonbonded interactions between CG beads are calculated using a 14 Å cutoff, whilst 8.5 Å is the default value for the united-atom OPLS force field (Jorgensen and Tirado-Rives, 1988) used in HADDOCK. Note that 8.5 Å is a reduced cutoff compared to the recommended one for OPLS, which was chosen as a compromise between accuracy and speed.

### Docking Procedure

Prior to the docking, we convert the atomic PDB coordinate files containing DNA/protein into a coarse-grained representation via an updated version of our in-house HADDOCK script for pre-processing CG input structures. During the vacuum part of the docking protocol (it0 and it1) we set the dielectric constant (epsilon) to 78.0 to screen the high DNA charge (in the all atom representation). Epsilon is set to 1.0 for the final refinement stage in explicit solvent (water) (van Dijk and Bonvin, 2010). In the CG runs, the final water refinement is replaced by the back-mapping from coarse-grained to atomistic resolution as described in Roel-Touris et al. (2019). Note that in our atomistic DNA force field implementation the charge on the backbone phosphate is reduced to 0.5 since no counter ions are included in the docking to screen its charge, while the phosphate bead in MARTIN is uncharged. The final resulting models are clustered based on the fraction of common contacts (FCC) (Rodrigues et al., 2012) using a 0.6 cutoff (i.e., two models belonging to the same cluster share at least 60% of contacts) and a minimum of four models per cluster, which is the default clustering protocol in HADDOCK. All docking calculations were made using the latest 2.4 version of HADDOCK (still in beta version and unpublished but available upon request).

### Protein-DNA Docking Benchmark

To systematically test the performance of our coarse-grained implementation for protein-DNA docking, we used 44 unboundunbound cases from the protein-DNA benchmark (van Dijk and Bonvin, 2008). Those are composed of 26 binary, 16 ternary, 1 quaternary (2c5r), and 1 pentameric (1ddn) complexes covering all major types of interactions (Luscombe et al., 2000). We removed three cases from the original dataset (PDB codes: 1diz, 1emh, and 4ktq) due to the fact that the MARTINI force field does not explicitly account for the modified nucleic bases P2U, NRI, and DOC. The benchmark is classified according to the amount of conformational changes that take place upon binding as measured by the interface positional root mean square deviation (i-RMSD) (i.e., unbound vs. bound structures) as follows:


This selection yielded 11 easy, 21 intermediate, and 12 difficult cases. For comparison purposes, we performed two different docking runs, one using the default atomistic force fields used by HADDOCK, and a second one with the parameters adapted from the MARTINI CG force field for both protein and DNA (Monticelli et al., 2008; Uusitalo et al., 2015). For the all-atom representation, OPLSX non-bonded parameters are used both for the protein (Jorgensen and Tirado-Rives, 1988) and DNA (Nozinovic et al., 2010). We used true interface information derived from the crystal structures translated into ambiguous interaction restraints (AIRs) to drive the docking calculations as previously defined in van Dijk and Bonvin (2010). The sampling parameters were kept to their default in HADDOCK: 1,000/200/200 models were generated for the rigid body (it0), simulated annealing (it1) and water refinement (itw) stages, respectively.

### Unbound Docking Using Experimental Data

We additionally modeled six complexes from the protein-DNA benchmark for which experimental data are available. The selected cases cover the different categories from the benchmark; "easy" (1by4, 3cro), "intermediate" (1azp, 1jj4), and "difficult" (1a74, 1zme). The available experimental information was collected from literature and include conserved residues, mutagenesis data, ethylation interference data, methylation interference data, NMR native state amide hydrogen exchange, and Raman spectroscopy as described in van Dijk and Bonvin (2010). As in the previous study (van Dijk and Bonvin, 2010), the sampling was slightly increased to 2,000/400/400 for it0/it1/itw docking stages, respectively.

### Modeling of the PRC1 Ubiquitylation Module Bound to the Nucleosome

We modeled the interaction between the multimeric PRC1 ubiquitylation module and the nucleosome by performing both AA and CG docking runs. As starting point for the docking, we used the unbound crystal structure of the enzymatical complex (PDB code: 3rpg) and the nucleosome particle (PDB code: 3lz0). We followed the same docking procedure as explained above (see Methods: Docking Procedure) except for the sampling parameters that were increased to 100,000, 400, and 400 for it0, it1, and water stages, respectively, because of the scarcity of the available information. The docking was driven by interaction restraints obtained from the literature at the time of CAPRI Round 31: One unambiguous distance restraint between the SG atom of the catalytic cysteine 85 of PRC1 and the NZ atoms of Lys119 or Lys118 on H2A, the ubiquitination target. In addition, we included mutagenesis data on PRC1 (K62A, R64A, K97A, and R98A) shown to be crucial for the interaction with the nucleosome (Bentley et al., 2011; Mattiroli et al., 2014). Ambiguous interaction restraints (AIRs) were defined for those (active) against all solvent accessible residues (passive) on the histones (those with either main chain or side chain relative accessibility >25% as calculated by NACCESS Lee and Richards, 1971). The list of active and passive residues used to guide the docking and the specific distance restraint can be found in Supplementary Information (**Table SI-3**).

### Metrics for the Evaluation of Model Quality

We evaluated the quality of the generated models following the standard CAPRI criteria (Janin, 2005). This includes the fraction of common contacts (Fnat) and the interface (i-RMSD) and ligand (l-RMSD) positional root mean square deviations from the reference crystal structures. Fnat is calculated from all heavy atom–heavy atom intermolecular contacts using a 5 Å distance cutoff. The i-RMSD is calculated on the interface backbone atoms after superimposition on the backbone of the interface residues, defined as those with any heavy atom within 10 Å distance of the partner molecule. The l-RMSD is calculated on the ligand backbone (usually the smallest molecule) after superimposition on the backbone atoms of the receptor (largest molecule). For both i-RMSD and l-RMSD, we only considered either backbone heavy atoms for atomistic models (C-alpha, C, N, O/P, C1, C9 for protein/DNA) or backbone particles (BB<sup>∗</sup> ) for coarse-grained models (in the it0 and it1 docking stages). The calculations were performed using ProFit (McLachlan, 1982) and the quality of the docking poses was classified as:


### Metrics for the Evaluation of Docking Success Rate

We analyzed the performance of the docking calculations as: (1) The percentage of cases in which at least one model of a given accuracy is found within the top N solutions ranked by HADDOCK (N = 1, 5, 10, 20, 25, 50, 100, 200), and (2) the percentage of cases in which at least one acceptable or higher quality model was found in the top T clusters (T = 1, 2, 3, 4, 5).

### RESULTS AND DISCUSSION

We have integrated the MARTINI CG force field for nucleic acids into HADDOCK version 2.4 (see Methods), combining it with our previous implementation of the protein MARTINI CG force field (Monticelli et al., 2008), enabling full coarsegrained protein-DNA docking. The AA to CG conversion scripts have been adapted to automatically account for specific Watson-Crick base pairing, which require special interacting parameters. In the following sections, we discuss the performance of our protocol for protein-DNA docking in terms of success rate and computational efficiency using 44 unbound-unbound complexes from the protein-DNA benchmark (van Dijk and Bonvin, 2008) with ideal interface information (see Methods; Protein-DNA docking benchmark). For six of them, we repeated the docking using experimental information to guide the docking. Finally, as a proof of concept, we revisited CAPRI Target 95 (Lensink et al., 2017), a protein-nucleosome complex for which we failed to identify near native solutions in our original CAPRI submissions (although we did generate some). In this new modeling, our top ranked predictions are in excellent agreement with the crystal structure of the complex (not used for the docking) for both standard atomistic docking and the hereby described coarsegrained implementation.

### Overall Performance of Coarse-Grained Protein-DNA Docking

The docking was performed starting from the unbound structures of each molecule and driven by AIRs as defined in our previous study (van Dijk and Bonvin, 2010; see Methods; Docking Procedure). In order to evaluate the performance of our approach, we calculated the success rates of both sets of runs (AA and CG) as the percentage of cases for which an acceptable or better quality was obtained in the top N ranked models (for details see Methods; Metrics for the Evaluation of Model Quality and Metrics for the Evaluation of Docking Success Rate).

Overall, coarse-grained docking generates and delivers acceptable or higher quality models for 40 out of the 44 cases after the back-mapping stage compared to 38 cases for the atomistic docking results. No near-native models are generated for four complexes; two of which are classified as difficult (1dfm, 1o3t), one as intermediate (1z9c) and one as easy (1tro). Inspection of the failed easy case reveals that it is a ternary complex (homodimer) and since no symmetry restraints were used in this case, its interface ambiguity was too high. In a previous benchmarking (van Dijk, 2006), acceptable models for this complex were obtained using a two-stage docking protocol in which a library of bent DNA conformations were given as input for the second docking run (a procedure not followed here). Among the successful CG cases, medium quality models are generated for 23 cases against 26 for the AA docking runs. Top one single structure-based ranking (best ranked structure) reaches 86.3% success rate for all-atom calculations vs. 81.8% for CG docking (**Figures 1A,B**). The overall success rates are similar for the top 5 and becomes higher for CG docking, reaching 90.9% in the top 200 while AA docking remains at 86.3% (which corresponds to 40 vs. 38 successful cases for CG and AA docking, respectively). In contrast, the quality of the models is slightly better for AA docking as measured by the success rates (**Figures 1A,B**) and rankings of medium quality models (**Figures 1C,D**). Notably, CG docking manages to generate acceptable models for two of the difficult cases that fail at standard atomistic HADDOCK runs (1zme and 1qrv). In 1zme, we find an acceptable model at position 176 (i.e., Top 200 according to our analysis) with 0.11/7.85 Å/9.94 Å for Fnat/i-RMSD/l-RMSD while the best AA model falls out the acceptable CAPRI criteria (0.04/7.51 Å/10.3 Å). For 1qrv, the fourth case with the largest conformational change, the docked models generated by the standard AA HADDOCK protocol failed to satisfy the quality metric thresholds (Fnat and i-RMSD or Fnat and l-RMSD). However, several models showed a satisfactory overlap in terms of Fnat with >20% of interface contacts. With coarse-graining instead, the first acceptable model is found at rank 44 with a l-RMSD of 8.8 Å and Fnat of 0.14 (i.e., Top 50 according to our analysis).

Coarse-graining approaches benefit from the reduction of the number of degrees of freedom of the systems under study and make the docking calculations computationally more efficient. The median computational time to generate one model via CG in HADDOCK is 8.6s and of 42.8s for it0 and it1 stages, respectively, vs. 16.5s and 115.0s for standard atomistic calculations. Overall, the use of the MARTINI force field for both proteins and nucleic acids leads to a ∼6-fold speed increase during rigid-body docking and semi-flexible stage (see **SI-3**, **Table SI-5**).

### Unbound Docking Using Experimental Data

We evaluated the capabilities of our HADDOCK-CG implementation to model protein-DNA interactions when using real experimental information. We selected six representative

protocol. (C,D) The quality of the docking models for all 44 cases as a function of the number of models considered. The complexes are ordered by increasing degree of difficulty (from top to bottom) for both all-atom and CG docking runs. The color coding indicates the quality of the docked models according to CAPRI criteria.

cases (van Dijk and Bonvin, 2010) from the protein-DNA benchmark classified as "easy" (1by4, 3cro), "intermediate" (1azp, 1jj4), and "difficult" (1a74, 1zme) for which experimental information was available. The latter was translated into AIRs (see Methods; Unbound Docking Using Experimental Data) in the form of active and passive residues and two different set of docking runs were performed using either the standard all-atom or the coarse-grained protocols.

As shown in **Table 1,**summarizing the quality of the generated clusters, for four out of the six cases, AA docking generates better quality models. No good solution in any of the tested protocols was found for 1zme, which undergoes a large conformational change of 4.68 Å upon binding. In terms of sampling, the standard all-atom protocol, in combination with experimental data, generates ∼900 near-native models (i.e., acceptable or higher quality according to CAPRI) on average per case, while our CG approach around three times less (∼300). This is somewhat surprising as the smoother energy landscape derived from the reduction of degrees of freedom might help the sampling process as previously demonstrated in our proteinprotein CG implementation (Roel-Touris et al., 2019). Despite this difference in sampling, both approaches perform rather similarly in terms of structure quality, indicating that our CG protocol is also applicable for integrative modeling of complexes in combination with real experimental data. Recent studies have indicated that the interpretation of CG models using experimental data, and in particular SAXS data, can benefit from improved forward models as demonstrated by Paissoni et al. (2019) for protein-DNA complexes.

### Revisiting CAPRI Target 95: The PRC1 Ubiquitination Module Bound to the Nucleosome

The polycomb repressive complex 1 (PRC1) represses the expression of genes regulated by developmental processes and is responsible for the ubiquitylation of the nucleosomal histone (Mattiroli et al., 2014). This complex was offered as a blind target to the CAPRI experiment (Round 31, target 95), to which we participated but failed to correctly identify near-native models out of our pool of generated complexes. Using the same information derived from the literature as used in CAPRI Round 31 (see **Table SI-4**), we repeated the docking using our MARTINI implementation in HADDOCK2.4 and validated our predictions against the crystal structure of the complex (PDB-ID: 4rp8; McGinty et al., 2014).

When analyzing the i-RMSD of the top-ranked model according to the HADDOCK score, the CG one is slightly closer (3.0 Å) to the reference crystal structure than the corresponding AA model (3.14 Å; **Table 2A**). Same behavior is observed when looking at the clustering statistics, in which the average i-RMSD for the top four models of the best cluster for CG was 3.09 ± 0.08 Å against 3.23 ± 0.23 Å in AA. A much large difference between the two protocols is however clearly visible when comparing the number of acceptable of better models generated at the various docking stages (**Table 2B**) with CG docking resulting in ∼1.5 times more acceptable models than AA docking. This

TABLE 2A | Sampling and quality assessment of the AA and CG PRC1 docking models.


Number Of Acceptable Models And Time Necessary To Generate One Model For The Rigid-Body And Semi-Flexible Stages For Both All-Atom And Coarse-Grained Simulations.

<sup>a</sup>The first number is the total number of acceptable models within the 10,000 generated and the second correspond to those in the top400 selected for further semi-flexible refinement.

TABLE 2B | Ranking, i-RMSD Comparison And Time Per Model Of All-Atom And Coarse-Grained Simulation Of Capri Target 95.


TABLE 1 | Performance of the all-atom and coarse-grained protocols in HADDOCK on six representative cases of the protein-DNA benchmark using experimental data to drive the docking.


The RMSDs (Å) and Fnats correspond to the best model of the best cluster. The ranking of the best cluster is also reported. The CAPRI column indicates the number of models per quality threshold (\*acceptable, \*\*medium, \*\*\*high).

improvement in the sampling is in contrast to what was observed above for the protein-DNA benchmark. As already observed for protein-protein docking (Roel-Touris et al., 2019), the impact of coarse graining is more evident when little or no information (ab-initio docking) is available to drive the docking process. Finally, a view of the top ranked models superimposed onto the reference crystal structure is shown in **Figure 2**. Both satisfy the distance restraint imposed to model the interaction between Cys85 of PRC1 with Lys118/119 of Histone 2A (PRC1-H2A). The proximity of those two residues was proposed (Bentley et al., 2011) to be necessary to restrict the ligase complex to a single region of the nucleosome (the information we used in CAPRI), which was confirmed by the crystal structure (PDB-ID 4r8p; McGinty et al., 2014).

### CONCLUSION

In this work, we have presented the integration of the MARTINI coarse-grained force field for nucleic acids into our HADDOCK integrative modeling software. It builds upon our previous implementation for protein-protein docking, using a coarsegrained representation during the rigid-body and semi-flexible refinement stages, and converting back the resulting models to atomistic resolution following an atom-to-bead distance restrained-guided morphing procedure. We have shown that the performance of coarse-grained docking is similar to that of standard all-atom protocol in terms of success rate, while the quality of the generated models remains rather similar according to standard CAPRI criteria. We demonstrated that our coarsegrained protocol is perfectly suited for use with experimental or predicted data. In particular, we have revisited a challenging target of the CAPRI experiment, taking full advantage of the hereby described implementation and obtaining nearnative models of PRC1 Ubiquitination module bound to the nucleosome in excellent agreement with the crystal reference. Further, by smoothening the energy landscape it also allows to generate more near native models in cases where limited information is available to guide the modeling, which should also benefit the scoring stage since it becomes easier to identify them. It also brings a significant gain in computing performance, with a ∼6-fold speed increase compared to standard atomistic simulations. In conclusion, with this extension, HADDOCK has gained the capability to model significantly larger assemblies consisting of mixed protein and DNA components, in a more efficient way without compromising its overall performance.

### DATA AVAILABILITY STATEMENT

The datasets generated for this study are available on request to the corresponding author.

## AUTHOR CONTRIBUTIONS

AB, RH, and JR-T conceived and designed the research and wrote the paper. RH and JR-T performed the computational analysis and interpreted the data.

## FUNDING

This work was supported by the Dutch foundation for Scientific Research (NWO) (TOP-PUNT Grant 718.015.001) and by the BioExcel CoE (www.bioexcel.eu), a project funded by the European Horizon 2020 program under grant agreements 675728 and 823830. RH acknowledges funding from FAPESP (2017/03191-2).

### ACKNOWLEDGMENTS

The authors acknowledge all members from the Computational Structural Biology group at Utrecht University for fruitful discussions. We thank the MARTINI group at Groningen University, the Netherlands, with special mention to Dr. Ignacio Faustino for his support in implementing MARTINI into HADDOCK. Finally, we acknowledge the use of software from the SBGRID consortium (Morin et al., 2013) for various analysis tasks.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmolb. 2019.00102/full#supplementary-material

## REFERENCES


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Honorato, Roel-Touris and Bonvin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Definition of the Minimal Contents for the Molecular Simulation of the Yeast Cytoplasm

Vijay Phanindra Srikanth Kompella1,2, Ian Stansfield<sup>3</sup> , Maria Carmen Romano2,3 and Ricardo L. Mancera<sup>1</sup> \*

*<sup>1</sup> School of Pharmacy and Biomedical Sciences, Curtin Health Innovation Research Institute, Curtin Institute for Computation, Curtin University, Perth, WA, Australia, <sup>2</sup> Physics Department, Institute for Complex Systems and Mathematical Biology, University of Aberdeen, Aberdeen, United Kingdom, <sup>3</sup> Institute of Medical Sciences, University of Aberdeen, Aberdeen, United Kingdom*

The cytoplasm is a densely packed environment filled with macromolecules with hindered diffusion. Molecular simulation of the diffusion of biomolecules under such macromolecular crowding conditions requires the definition of a simulation cell with a cytoplasmic-like composition. This has been previously done for prokaryote cells (*E. coli*) but not for eukaryote cells such as yeast as a model organism. Yeast proteomics datasets vary widely in terms of cell growth conditions, the technique used to determine protein composition, the reported relative abundance of proteins, and the units in which abundances are reported. We determined that the gene ontology profiles of the most abundant proteins across these datasets are similar, but their abundances vary greatly. To overcome this problem, we chose five mass spectrometry proteomics datasets that fulfilled the following criteria: high internal consistency, consistency with published experimental data, and freedom from GFP-tagging artifacts. Using these datasets, the contents of a simulation cell containing a single 80S ribosome were defined, such that the macromolecular density and the mass ratio of ribosomal-to-cytoplasmic proteins were consistent with experiment and chosen datasets. Finally, multiple tRNAs were added, consistent with their experimentally-determined number in the yeast cell. The resulting composition can be readily used in molecular simulations representative of yeast cytoplasmic macromolecular crowding conditions to characterize a variety of phenomena, such as protein diffusion, protein-protein interactions and biological processes such as protein translation.

Keywords: macromolecular crowding, proteomics, protein translation, yeast, molecular dynamics

### INTRODUCTION

The environment inside cells is densely packed, termed macromolecular crowding, the extent of which varies throughout the different growth and differentiation stages of the cell, as well as according to its type and volume (Nakano et al., 2014). A typical cell has a macromolecular concentration in the range 100–450 g/L, with 5–40% of its volume being occupied by macromolecules (Feig et al., 2017). Therefore, the space available for the free diffusion of metabolites and other macromolecules is greatly reduced, leading to what is known as an excluded volume effect. This reduces diffusion and favors more compact protein conformations and protein

### Edited by:

*Valentina Tozzini, Nanosciences Institute, National Research Council, Italy*

#### Reviewed by:

*Chia-en Chang, University of California, Riverside, United States Abhigyan Satyam, Harvard Medical School, United States*

> \*Correspondence: *Ricardo L. Mancera r.mancera@curtin.edu.au*

#### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

Received: *18 July 2019* Accepted: *11 September 2019* Published: *02 October 2019*

#### Citation:

*Kompella VPS, Stansfield I, Romano MC and Mancera RL (2019) Definition of the Minimal Contents for the Molecular Simulation of the Yeast Cytoplasm. Front. Mol. Biosci. 6:97. doi: 10.3389/fmolb.2019.00097* association. Transient aggregation of proteins is favored in crowded systems and is correlated with slower diffusion (Nawrocki et al., 2017). Macromolecules reduce the amount of bulk-like water in the cell by reducing the amount of water molecules present beyond the second solvation layer (Harada et al., 2012). As a consequence, a 40% reduction in the dielectric constant of yeast cells compared to that of a dilute solution has been determined (Asami et al., 1976; Tanizaki et al., 2008), leading to an increase in electrostatic interactions between molecules. Hindered diffusion due to macromolecular crowding, on the other hand, increases the probability of ligands being in the vicinity of their receptors in what is termed caging effect, which enhances reaction rates (Feig et al., 2017). Cells are believed to maintain their macromolecular concentration within a very small range in a process now termed "homeocrowding" (Van Den Berg et al., 2017). Moreover, it has been shown that the diffusion coefficient of molecules depends not only on the macromolecular concentration but also on the composition of the solution (Wang et al., 2010). Molecular crowding inside cells affects various biochemical processes such as protein translation. The diffusion of tRNA complexes in the cytoplasmic environment is hindered by crowding, in turn affecting the rate of translation (Klumpp et al., 2013).

Molecular dynamics (MD) simulations can be used to characterize the complex nature of the effects of macromolecular crowding, including effects on the diffusion of tRNAs and their binding to cytoplasmic ribosomes during translation. Two prior studies of the cytoplasm have focused on prokaryotic systems (E. coli). In one study, 118 protein molecules were chosen on the basis of their mole percentage in the cytosol, with the number of ribosomes being scaled down based on abundances reported at cell level and a total macromolecular density of 340 g/L (Ridgway et al., 2008). Each protein molecule was represented as a sphere, whilst tRNAs were not included at all (Ridgway et al., 2008). In a second study, 51 different types of macromolecules were considered, out of which 45 were proteins and which accounted for 86% of the total cytoplasmic protein mass reported by the proteomics dataset used with a macromolecular concentration of 275 g/L. The simulation cell also included three types of tRNAs (tRNA-Gln, tRNA-Phe, and tRNA-Cys) and 10 ribosomes in their corresponding subunits. The volume corresponding to lipids, lipopolysaccharides, mRNA, DNA, murein, and glycogen was accounted for by increasing the concentration of protein in the simulation cell (McGuffee and Elcock, 2010). In a more recent cytoplasmic model, developed for Mycoplasma genitalium, the simulation cell comprised more than 1,000 protein molecules, 275 tRNAs, nucleotides, metabolites, ions, and a total of 26 million water molecules represented atomistically with a macromolecular density of 291.5 g/L (Feig et al., 2015). To our knowledge, an equivalent representative definition of the eukaryotic cytoplasm has not been reported in the literature. The key challenges in defining such a simulation cell include identification of the required proteomics datasets and defining appropriate criteria to minimize the size of the cell whilst retaining the properties of the cytoplasmic environment.

In this study, we sought to address the lack of a standard molecular simulation environment for eukaryotes by defining the contents of a simulation cell based on the abundances of proteins, tRNAs and ribosomes in the yeast cytoplasm. A recent yeast proteomics dataset (Ho et al., 2018) unified abundance data from 21 different datasets, comprising a range of mass spectrometry (MS)-derived datasets, datasets based on green fluorescent protein (GFP)-tagging of yeast proteins and GFP flow cytometry and also a tandem affinity purification (TAP-tagging)-immunoblot dataset. We employed an in-depth proteomics survey of these datasets in order to define a molecular simulation environment for a model eukaryote cell. However, these datasets vary in terms of the growth conditions used to culture the cells, the cellular growth phase, the units in which abundances are reported, and the technique used to measure them. It was therefore necessary to investigate how these factors affect protein abundances reported across the range of datasets. We characterized the internal consistency amongst the datasets and their agreement with other published experimental data, leading to the selection of a proteome composition for the yeast cytoplasmic environment. Consideration of additional experimental data on the macromolecular density and the mass ratio of ribosomal-to-cytoplasmic proteins in the cytoplasm was also used, allowing the definition of the contents of a molecular simulation cell representative of the yeast cytoplasm.

## METHODS

### Definition of a Eukaryote Cell Simulation Environment

Previous reports of the number of ribosomes in yeast cytoplasm were taken from cell population scale experiments (Waldron and Lacroute, 1975) and from cell tomography experiments at single cell level (Yamaguchi et al., 2011), and were compared with the numbers calculated from proteomics datasets. The volume percentage of individual components of the yeast cell were also obtained from cell tomography studies (Yamaguchi et al., 2011), which are in agreement with other cell tomography experiments (Wei et al., 2012). Furthermore, we used the recently published unified yeast proteomics dataset that covers a total of 5,391 proteins (Ho et al., 2018).

Proteins associated with the nucleus, cell wall, ribosomes, mitochondria, endoplasmic reticulum, and vacuoles were removed from the dataset with the help of GO-slim annotations (http://www.yeastgenome.org/) to assign cellular location to a given protein. Gene ontology analysis of the function of encoded proteins was performed using the webserver Funcassociate 3.0 (http://llama.mshri.on.ca/funcassociate/) (Berriz et al., 2009).

### Statistical Analysis

The abundances reported for individual ribosomal proteins by any dataset were treated as multiple observations of the number of ribosomes (described in detail in the Results section). Based on this, pairwise statistical two-tailed t-tests for unequal variances between proteomics datasets were performed using an in-house code in MATLAB (https://github.com/BMMG-Curtin/FMOLB) to quantitatively understand the differences and similarities between datasets (**Figure S1**). Where multiple pairwise t-tests were conducted, the Bonferroni correction was applied to address type-I errors, whereby the critical alpha value is divided by the number of pairwise tests. In addition, p-values were adjusted using the Benjamini-Hochberg approach to address type-I errors and the results obtained were found to be qualitatively the same (**Figure S1**). The data was assumed to be normally distributed whilst conducting the above t-tests; therefore, a non-parametric Mann–Whitney U-test with the Bonferroni correction was also employed (**Figure S2**). The results of the U-test were also found to be qualitatively similar to the results obtained with the t-tests. Pairwise correlations between the functional ontological classes of proteins across different datasets were quantified using the Pearson's correlation coefficient. The Jaccard index was used to quantify the similarities between the ontological profiles obtained for each of the datasets.

### RESULTS

### Analysis of Internal Consistency of Yeast Proteomics Datasets

In order to define the protein composition of a eukaryote molecular simulation cell, the recently published unified yeast proteomics dataset was used (Ho et al., 2018). This covers 5,391 genes with a total protein mass per yeast cell of 2.7 × 10<sup>12</sup> Da, which is in good agreement with the total protein mass of a yeast cell previously reported to be 3 × 10<sup>12</sup> Da (Sasidharan et al., 2012). This proteomics dataset comprises data integrated from 21 different datasets, which vary in the type of growth medium used to culture cells, their growth phase and the technique used to measure protein abundances.

The top 200 most abundant proteins were taken from each of the 21 datasets based on their mass (i.e., molecular mass multiplied by their abundance) and were found to account for ∼70% of the total cytoplasmic protein mass (**Figure 1**). In order to assess the possible influence of cell culture conditions, growth phase and the method used to measure protein abundance on the composition of the yeast cytoplasm, the ontological classes of these proteins were assessed. The systematic names of these proteins were submitted to the Funcassociate 3.0 webserver, which detects over-representation of gene ontologies in a gene list. The number of proteins associated with each gene ontology class was identified for every dataset. Each pair of datasets was then compared by calculating the Pearson's correlation coefficient between the number of proteins associated with each gene ontology class. The Jaccard index was used to quantify the similarities between the sets of gene ontology classes obtained for every dataset. Despite the above differences between the datasets, a similar ontological landscape for the top 200 proteins in each of the datasets was observed, except for one dataset that used N-terminal GFP tagging, YOF (Yofe et al., 2016; **Figure 2**).

Although the gene ontology profiles of the top 200 cytoplasmic proteins are similar across datasets, significant differences in protein abundances were observed. For example, the average coefficient of variation (CV) (measured across the 21 datasets) for the cytoplasmic proteins is 78%. The differences are more marked in the case of ribosomal proteins (CV = 106%).

In order to investigate the internal consistency of the proteomic datasets and their agreement with other published data, ribosomal proteins were examined separately. The protein composition of ribosomes can be assumed to be fixed (Perry, 2007) and there are 79 ribosomal proteins per ribosome. Since the stoichiometry for each ribosomal protein with respect to the ribosome (Warner, 1999) is 1:1, it should be expected that the numbers of each of these ribosomal proteins in a given dataset will lie within a very small range. The identity of the ribosomal proteins was taken from the crystal structure of the eukaryotic ribosome (PDB code 4V88) (Ben-Shem et al., 2011). The CV of these proteins was computed in every dataset and the average CV of all MS datasets is 69%, whereas the average CV of GFP datasets is 103%, indicating better internal consistency in MS datasets compared to GFP datasets.

Depending on the consistency between datasets, the numbers reported for a given ribosomal protein across different datasets are expected to vary showing patterns in terms of experimental conditions. In order to test this, the abundances of different ribosomal proteins were compared across different datasets. Given the 1:1 stoichiometry for each ribosomal protein with respect to the ribosome (Warner, 1999), the abundance of each ribosomal protein in each dataset provided an estimate of the number of ribosomes per cell. The average number of ribosomal proteins was therefore calculated to derive an average ribosome per cell value for each dataset. The resulting values were then compared between datasets by performing multiple pairwise t-tests to determine any patterns arising from the growth media, growth phase or the technique used to measure protein abundance (**Figure 3**). High p-values were observed in the pairwise tests between the datasets derived from GFP-tagging of proteins, indicating consistency between them. On the other hand, no clear consistency was apparent within the MS datasets, and no patterns were observed that might be accounted for by the growth media or growth phase used during cell culture.

It has previously been reported that there are ribosomal proteins with extra-ribosomal functions in yeast (Lu et al., 2015). In order to test if the differences in the abundance (**Table S1**) of ribosomal proteins arise from the fact that some of them perform additional functions and might therefore be produced in excess of the requirements for ribosome synthesis, the mean of means and the mean of medians (across 21 datasets) of ribosomal proteins with extra functions (set I) and other ribosomal proteins (set II) were computed. If excess production of some ribosomal proteins was due to additional functions, their numbers might be expected to be higher than those of other proteins. However, the mean of means of set I is ∼88,400 units, whilst that of set II is ∼86,000 units. By contrast, the mean of medians of set I is ∼61,700 and that of set II is ∼53,157 units. Whilst ribosomal proteins with other functions seem to be abundant, it should be noted that the standard deviations of both sets of proteins are ∼25,000. A t-test carried out comparing the means reported for ribosomal proteins in set I and set II has a p-value of 0.85 and a similar calculation with medians showed a p-value of 0.23. These high p-values suggest that the differences in mean/median abundances do not have statistical significance, suggesting that the differences in the abundances of ribosomal proteins are not

protein. Proteins in the yeast proteomics dataset were ranked according to their mass, exhibiting a clear exponential decrease as a function of their mass rank in the cell. In the inset the cumulative percentage of mass is plotted as a function of rank. The top 200 cytoplasmic proteins contribute to ∼70% of the total cell protein mass.

due to the extra-ribosomal functions carried out by some of them. The causal relationships of this phenomenon will need to be further investigated.

### Selection of Datasets

Whilst the gene ontology profiles of the proteomics datasets are similar, they vary widely in the protein abundances reported. The ratio of the median of abundances reported by GFP datasets to the median of MS datasets was calculated for cytoplasmic and ribosomal proteins. We determined that for 74% of cytoplasmic proteins and 84% of ribosomal proteins the medians differ by more than 25%. The differences in the individual protein abundances between the GFP and MS datasets were reported to be possibly due to changes in protein or mRNA stability following GFP tagging (Ho et al., 2018). More specifically, in the case of ribosomal proteins, GFP tagging can alter their packing in the ribosome, thereby affecting their turnover dynamics and therefore their abundances (von der Haar, 2008).

The number of ribosomes, calculated by taking the median of all ribosomal proteins reported in the GFP datasets, revealed an estimated 51,800 ribosomes per cell, whereas previously reported figures are 150,000–300,000 (Waldron and Lacroute, 1975) and 169,000–265,000 (Yamaguchi et al., 2011) ribosomes per cell. As discussed earlier, the abundances of ribosomal proteins reported in the GFP datasets are also widely spread, with an average CV of 103%, in contrast to the average CV of 69% in the MS datasets. It was thus decided to omit the GFP datasets from further consideration.

The first five (LU, PENG, KUL, LAW, and LAHT) MS datasets report abundances in absolute numbers, whereas the other MS datasets report normalized abundances (with respect to the average of the five MS datasets) (Ho et al., 2018). When the median of the first five MS datasets was compared to the median of the other MS datasets individually for every protein, 78% of cytoplasmic proteins and 96% of ribosomal proteins showed more than 25% difference. These differences may potentially be an artifact of the normalization process. The number of ribosomes inferred from the median abundance of ribosomal proteins of the first five MS datasets was ∼130,000, whereas it was only 30,500 when calculated from the other MS datasets. This latter, lower figure is significantly different to previous reports (Waldron and Lacroute, 1975; Yamaguchi et al., 2011), as discussed above. The five MS datasets also showed high internal consistency in the pairwise t-tests performed on

ribosomal protein abundance compared to the other MS datasets (**Figure 3**). The five MS datasets were originally reported to be highly correlated (with the Pearson correlation coefficient varying from 0.43 to 0.81) (Ho et al., 2018), which is consistent with our findings. Consequently, it was decided that only the first five MS datasets would be used for the definition of the contents of a molecular simulation cell.

### Constraints for the Definition of the Contents of a Simulation Cell

A molecular simulation cell should be designed to mimic the environment of the yeast cytoplasm. This requires the inclusion of three important constraints: macromolecular density, the mass ratio of ribosomal-to-cytoplasmic proteins, and the number of ribosomes in the simulation cell.

Macromolecular density is an indirect measure of the excluded volume and, therefore, crowding. The volume of yeast cell has been reported to be 42 µm<sup>3</sup> (Jorgensen et al., 2002) and from the cell tomography determinations (Yamaguchi et al., 2011) we estimated the cytoplasm in yeast to be 65% of the total cell volume (27.3 µm<sup>3</sup> ). The mass of all the 1,374 cytoplasmic proteins in the dataset, excluding ribosomes, was calculated using the mean abundances of all proteins with the above chosen five MS datasets. There are 3 million tRNAs in a yeast cell (Waldron and Lacroute, 1975) and, using an average mass of 25,500 Da per tRNA (calculated assuming that there are 75 nucleotides in tRNAs, each weighing an average mass of 340 Da), the total tRNA mass was calculated. The median number of all ribosomal proteins across the five MS datasets was determined to be 126,213, which was used to calculate the ribosomal mass in the yeast cell. The total masses of tRNAs, ribosomes and cytoplasmic proteins was then used to estimate the macromolecular density of the yeast cytoplasm as 90 g/L.

It has been reported that the fractions of ribosomal protein (R-protein), translation protein (T-protein), fixed protein (Q), the proportion of which is independent of growth rate, and metabolic protein (P-protein), given by, 8R, 8T, 8Q, and 8P, respectively, are unique for a specific growth rate (Klumpp et al., 2013). Therefore,

$$
\Phi\_{\mathbf{Q}} + \Phi\_{\mathbf{P}} = \frac{\mathbf{Q} - \text{Protein}}{\mathbf{A}} + \frac{\mathbf{P} - \text{Protein}}{\mathbf{A}} \tag{1}
$$

where A is the total protein mass and C is the growth rate specific constant. The total Q- and P-protein content can be divided into cytoplasmic and non-cytoplasmic fractions. Therefore, the previous equation can be rewritten as

$$\begin{aligned} \left(\Phi\_{\mathbf{Q}} + \Phi\_{\mathbf{P}}\right) &= \frac{\text{non} - \text{cytoplasmic}\_{\mathbf{(Q+P)}}}{\text{A}} + \frac{\text{cytoplasmic}\_{\mathbf{(Q+P)}}}{\text{A}} \\ &= \text{C(growth rate)} \end{aligned} \tag{2}$$

$$\frac{\text{non} - \text{cytoplasmic}\_{\text{(Q+P)}} \cdot \text{cytroplasmic}\_{\text{(Q+P)}}}{\text{A}} : \frac{\text{cytroplasmic}\_{\text{(Q+P)}}}{\text{A}} = k \text{(growth rate)}\tag{3}$$

The last Equation (3) states the assumption that the mass ratio of cytoplasmic to non-cytoplasmic proteins is constant at a given growth rate, from which it follows that cytoplasmic fraction in Qand P-proteins remains constant. Since the T-protein fraction is a growth rate-dependent constant, the mass ratio of ribosomalto-total cytoplasmic proteins is constant at a given growth rate. This is the second constraint for the definition of the contents of a simulation cell. The mass ratio of ribosomal-to-cytoplasmic proteins (rib/cyt) was determined to be 0.2229.

The crystal structure of the ribosome is composed of 75 ribosomal proteins (Ben-Shem et al., 2011) and, at such size, it would be computationally challenging to include multiple ribosomes in a single simulation cell. Equally, ignoring the contribution of the ribosome to the excluded volume and macromolecular density would affect the accuracy of a simulation. Therefore, addition of a single ribosome to the simulation cell was decided as the third constraint for the definition of its contents.

### Definition of the Contents of the Simulation Cell

The choice of five MS datasets reduced the number of cytoplasmic proteins with abundance data from 1,594 to 1,374; however, when calculating the macromolecular density of the cytoplasm, data from all 1,594 proteins was considered. The total mass of cytoplasmic proteins calculated using abundances in the unified dataset is 7.56 × 10<sup>11</sup> Da. The median of the number of molecules reported for a given protein by the five chosen MS datasets was taken as the measure of its abundance in a typical yeast cell. The total mass of a given type of protein was calculated by multiplying its abundance (number of proteins per cell) by its molecular mass, and the protein list was then sorted in descending order of total mass. The top 200 proteins contribute, as mentioned earlier, about 70% of the total cytoplasmic protein mass. The top proteins from the list were chosen due to their significant contribution to the protein mass in the cytoplasm and their abundances were subsequently scaled down to their corresponding value in proportion to only one ribosome (calculated as the abundance "n" of a protein divided by the 126,213 ribosomes predicted in the MS datasets).

Each of the less abundant cytoplasmic proteins does not contribute significantly to the overall protein mass. However, their collective removal results in a significant loss in protein mass which needs to be accounted for in order to maintain the desired macromolecular density of the simulation cell. Additionally, a number of proteins will contribute to the cytoplasm in fractional units that are lost due to rounding. The number of protein molecules of each of the cytoplasmic proteins was thus multiplied by a scaling factor aimed at maintaining the overall macromolecular density of the simulation cell. The number of protein types was chosen such that their total mass contribution reflects the expected value of the rib/cyt ratio. This was achieved by testing multiple scaling factors under the above-described constraints. Use of a large scaling factor (e.g., 3.0) meant that the rib/cyt ratio could be reached with just 20 different types of proteins, amounting to 119 protein molecules. By contrast, the rib/cyt ratio could not be reached with very low scaling factors (e.g., <1.8). Although the total number of protein molecules remained in the range 120–130 with all of the scaling factors tested, the observed protein composition was affected significantly with the use of large scaling factors. A range of scaling factors meet the constraints of macromolecular density, rib/cyt ratio and the presence of one ribosome in the simulation cell. However, in order to maintain the most representative composition of cytoplasmic proteins, the lowest possible scaling factor of 1.803 was chosen. This resulted in a final list containing 128 protein molecules belonging to 70 types of proteins (**Table S2**).

Based on the constraint that there should be only one ribosome, the size of the simulation cell was calculated. A total of 126,213 ribosomes are assumed to be present in the cytoplasm, which has a volume of 27.3 µm<sup>3</sup> . This volume was scaled down to one ribosome unit, which for a cubic simulation cell results in a length of 560 Å. The number of tRNAs was scaled down from 3 million units per cell to the volume of the simulation box, resulting in 22 tRNA units. With one 80S ribosome, 128 protein molecules and 22 tRNAs, the resulting simulation cell has the required total macromolecular density of 90 g/L.

### DISCUSSION

This study shows that the ontological profiles of the most abundant proteins in yeast remains constant despite differences in growth medium and growth phase, indicating that the most abundant proteins constitute the fundamental biochemical framework of the cell. The abundances reported in GFP datasets are affected by tagging, particularly in the case of ribosomal proteins. This has been explained previously on the basis that ribosomal proteins form a compact structure in a single ribosome molecule and the tag attached to them affects their packing. Although this explains the low numbers of ribosomal proteins reported, the cause of the high CV of ribosomal proteins in GFP datasets (CV = 103%), indicating a selective effect of tagging, compared with that of MS datasets (CV = 69%) remains unclear. Moreover, the average number of ribosomes calculated using MS datasets that report abundances in relative units is very low (30,500 units). The causes behind this remain undetermined, although normalization of the data is a possible factor.

Unlike prokaryotic cells, eukaryotic cells have a sophisticated organization of cellular machinery into different organelles with varying macromolecular environments. In order to study the influence of this macromolecular environment, an accurate description of its composition is needed. This was achieved by assigning the cellular location of a protein from its gene annotation data (GO-slim data) and determining the volume percentage of cytoplasm in yeast from cell tomography experiments. The macromolecular density of yeast cytoplasm was found to be 90 g/L, which is three times lower than that of the cytoplasm of E. coli. Measurements of the diffusion coefficient of GFP in eukaryotic and prokaryotic cells indicate that the eukaryotic cytoplasm is less crowded (Ellis, 2001), in line with our findings. Crowding in eukaryotic cells is also non-uniform. For example, in the nucleus we have calculated the protein density to be 346 g/L [using the 10–11 volume percentage obtained from cell tomography experiments (Yamaguchi et al., 2011) and nuclear protein abundances from the dataset (Ho et al., 2018)]. These large macromolecular density differences indicate that an accurate estimate of the macromolecular density of the organelle of interest is necessary.

In conclusion, a simulation cell was defined such that the yeast cellular composition of proteins, the ribosome-to-cytoplasmic protein mass ratio and the macromolecular density are retained. This was achieved by increasing the relative proportion of the most abundant proteins under specific constraints. The resulting simulation cell contains 128 protein molecules belonging to 70 protein types, 22 tRNAs and one 80s ribosome within a cubic cell of 560 Å in length. The simulation cell contents act as a generic representation of the cytoplasm that can be used to study the diffusion and interactions of molecules in the yeast cytoplasmic environment.

### DATA AVAILABILITY STATEMENT

The datasets generated for this study can be found in the https:// github.com/BMMG-Curtin/FMOLB.

## AUTHOR CONTRIBUTIONS

VK conducted all of the analyses. VK and RM wrote the manuscript, which was further proofread by MR and IS. All authors conceived and designed this study and performed the interpretation of data.

### FUNDING

VK gratefully acknowledges the receipt of a scholarship under the Aberdeen-Curtin Alliance collaborative Ph.D. program.

### ACKNOWLEDGMENTS

We thank Prof. Grant Brown (University of Toronto) for making the yeast proteomics datasets available to us.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmolb. 2019.00097/full#supplementary-material

### REFERENCES


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Kompella, Stansfield, Romano and Mancera. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Structural Transition States Explored With Minimalist Coarse Grained Models: Applications to Calmodulin

Francesco Delfino1,2 \* † , Yuri Porozov 1,3†, Eugene Stepanov 4,5, Gaik Tamazian<sup>6</sup> and Valentina Tozzini <sup>2</sup>

*1 I.M. Sechenov First Moscow State Medical University, Moscow, Russia, <sup>2</sup> Istituto Nanoscienze – CNR and NEST-Scuola Normale Superiore, Pisa, Italy, <sup>3</sup> ITMO University, St. Petersburg, Russia, <sup>4</sup> St. Petersburg Branch of the Steklov Mathematical Institute of the Russian Academy of Sciences, St. Petersburg, Russia, <sup>5</sup> Department of Mathematical Physics, Faculty of Mathematics and Mechanics, St. Petersburg State University, St. Petersburg, Russia, <sup>6</sup> Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St. Petersburg, Russia*

#### Edited by:

*Gennady Verkhivker, Chapman University, United States*

### Reviewed by:

*Marc Delarue, Institut Pasteur, France Peng Tao, Southern Methodist University, United States*

\*Correspondence: *Francesco Delfino delfinofrancesco90@gmail.com*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

Received: *10 May 2019* Accepted: *24 September 2019* Published: *15 October 2019*

#### Citation:

*Delfino F, Porozov Y, Stepanov E, Tamazian G and Tozzini V (2019) Structural Transition States Explored With Minimalist Coarse Grained Models: Applications to Calmodulin. Front. Mol. Biosci. 6:104. doi: 10.3389/fmolb.2019.00104* Transitions between different conformational states are ubiquitous in proteins, being involved in signaling, catalysis, and other fundamental activities in cells. However, modeling those processes is extremely difficult, due to the need of efficiently exploring a vast conformational space in order to seek for the actual transition path for systems whose complexity is already high in the stable states. Here we report a strategy that simplifies this task attacking the complexity on several sides. We first apply a minimalist coarse-grained model to Calmodulin, based on an empirical force field with a partial structural bias, to explore the transition paths between the apo-closed state and the Ca-bound open state of the protein. We then select representative structures along the trajectory based on a structural clustering algorithm and build a cleaned-up trajectory with them. We finally compare this trajectory with that produced by the online tool MinActionPath, by minimizing the action integral using a harmonic network model, and with that obtained by the PROMPT morphing method, based on an optimal mass transportation-type approach including physical constraints. The comparison is performed both on the structural and energetic level, using the coarse-grained and the atomistic force fields upon reconstruction. Our analysis indicates that this method returns trajectories capable of exploring intermediate states with physical meaning, retaining a very low computational cost, which can allow systematic and extensive exploration of the multi-stable proteins transition pathways.

Keywords: proteins conformational transitions, classical molecular dynamics, coarse grained models, transition path sampling, minimal action path, PROMPT

### INTRODUCTION

Signaling is a core activity in cells. Most of the signaling processes are regulated by bi- (or multi-) stable proteins, which can undergo conformational transitions in response to changes in environmental conditions or stimuli of different origin (Grant et al., 2010). This class includes among others, G-proteins coupled receptors (Weis and Kobilka, 2008) such as Rhodopsins (Tavanti and Tozzini, 2014) and other transducers, e.g., Calmodulin (Wenfei et al., 2014), and a vast number of enzymes undergoing conformational changes during their activity, such as the HIV-1 protease (Tozzini et al., 2007). The structural variations are usually quite large, therefore atomistic molecular dynamics (MD) simulations might not be the most proper method to address them, because the slow transition kinetics requires simulations exceeding the currently reachable time and space scales. In addition, the atomistic representation with standard force fields (FF) is not warranty of accuracy for the strongly distorted and out of equilibrium transition states (Best and Hummer, 2009).

Strategies to overcome these difficulties involve different actions. On one side, adopting simplified low-resolution descriptions of the system such as coarse-grained (CG) models (Tozzini, 2005) reduces the computational cost and allows performing more efficient sampling of the conformational space. This advantage comes at the cost of increasing the empirical content of the FF, and consequently reducing predictive power and transferability. A compromise between accuracy and predictive power (Tozzini, 2010) is reached by including some a priori knowledge of the system, in different forms, such as, e.g., a (partial) bias (Tozzini and McCammon, 2005; Spampinato et al., 2014) toward reference structures. This appears a reasonable compromise especially in the case of the search of the path between two given structure, when the system must in any case be forced to have them as stable states.

On the other side, one can act by simplifying the sampling algorithm, e.g., using morphing related methods (Weiss and Levitt, 2009; Koshevoy et al., 2014; Tamazian et al., 2015) without relying on any specific FF. In particular PROMPT (Koshevoy et al., 2014; Tamazian et al., 2015) employs an approach based on the optimal mass transportation problem including physical constraints of geometric nature (Evans and Gangbo, 1999). Methods based on the action minimization of simplified FFs, such as MinActionPath (Franklin et al., 2007), can be thought as located between the two approaches. The combination of the different sampling methods with the different representations of the systems and its interaction has given rise in the last decades to a huge number of approaches, which has also posed the problems of their comparison and assessment (Seyler et al., 2015).

In this work, we first apply a minimalist CG model for proteins to the test case of Calmodulin, chosen because of its large conformational transition upon calcium binding. We perform molecular dynamics simulations in different conditions to sample the transition path. We then compare these results with those of the simplified path sampling methods.

### SYSTEM AND METHODS

### The Coarse Grained Model

The coarse graining procedure we consider in this work is schematized in **Figures 1A,B**, reporting the atomistic representation of a protein chain and the minimalist CG (MCG) representation in which only the Cα atoms are present. The choice of Cα as the representative atom of the amino-acid bead allows uniquely representing the secondary structure by the internal variables α, θ (Tozzini et al., 2006). The interactions are described by an empirical FF, derived from an energy potential U with a form similar to the atomistic ones, separated in bonded and non-bonded interactions

$$\begin{aligned} U &= \sum\_{\text{bonds}} \boldsymbol{\mu}\_i^b \left( \boldsymbol{d}\_i \right) + \sum\_{\text{bond angles}} \boldsymbol{\mu}\_i^\phi \left( \boldsymbol{\theta}\_i \right) \\ &+ \sum\_{\text{dihedrals}} \boldsymbol{\mu}\_i^\phi \left( \boldsymbol{\phi}\_i \right) + \sum\_{i > j} \boldsymbol{\mu}^{nb} \left( \boldsymbol{r}\_{ij} \right) \end{aligned} \tag{1}$$

di , θ<sup>i</sup> , ϕ<sup>i</sup> being the bond distances, angles, and dihedrals describing the local geometry of connected beads and rij distances between non-bonded ones (see **Figure 1B**). The functional forms (reported in **Table 1**) are somewhat more complex than those used in atomistic FFs: while u b i are holonomic restrains, the u θ i and u φ i take forms accounting for the anharmonicity of the CG interactions; in addition, the parameters are chosen to account for the different geometrical stiffness of the secondary structures, assigning different values to helices and sheets (see **Table 1**) 1 . The non-bonded interactions occur between couples not already involved in a bond, bond angle or dihedral interaction and are separated in local and nonlocal part

$$\sum\_{i>j} u^{nb} \left( r\_{ij} \right) = \sum\_{i,j \mid r\_{ij} < r\_{cut}} u\_{loc} \left( r\_{ij} \right) \quad + \sum\_{i,j \mid r\_{ij} > r\_{cut}} u\_{nl} \left( r\_{ij} \right) \tag{2}$$

both represented by a Morse potential, with the local term retaining a bias toward a reference structure (see **Table 1**). In this work the local/non-local separation is based on a geometric criterion: all the non-bonded couples whose distance is less than rcut = 8.5 Å in the reference structure are considered local, the others are considered non-local. The cutoff value used here was previously shown to include all the relevant H-bonds and other possible specific interactions such as disulfide or salt bridges (Trovato and Tozzini, 2012). The parameters of the Morse potential, were optimized in our previous works including a dependence on r<sup>0</sup> (distance in the reference structure) in order to reproduce stronger interaction in the H-bonding range and weaker ones in the hydrophobic range (Di Fenza et al., 2009) (see **Table 1**). Since here we are not interested in the accurate simulation of the inter-protein interactions, the non-local part is represented by a generic amino-acid independent potential reproducing an average level of hydrophobicity (**Table 1**), instead than with a complex matrix of amino-acid dependent potentials (Trovato et al., 2013).

### Simulation Setup and Transition Path Extraction

MD simulations were performed in canonical ensemble using the Langevin (stochastic) thermostat. The timestep was set at 0.01 ps. Simulations had different length, between 20 and 50 ns. The data dumping frequency was on average 0.1 ps−<sup>1</sup> . Simulations were performed with the two different CG FFs (hereafter FF<sup>A</sup> and FFB) generated with a bias toward closed and open states (A and

<sup>1</sup>The continuous dependence of the k<sup>θ</sup> elastic constant is a variant with respect to previous works using step-wise dependences (e.g., Di Fenza et al., 2009), which improves the numerical stability of the model.

B, respectively), and at different temperatures. Simulations were performed with DL\_POLY [vs. 4.08 (Bush et al., 2006; Todorov et al., 2006; Boateng and Todorov, 2015)] and the input was generated with proprietary software.

In order to extract a transition path from the trajectory, we first define the parameter σ based on the root mean square deviation (RMSDA/B) of a configuration **r** = {x<sup>i</sup> ,yi ,zi} from the reference structures **r A/B**[after alignment (Humphrey et al., 1996) 2 , to eliminate roto-translations]

$$\begin{aligned} \text{RMSD}\_{A/B}(\mathbf{r}) &= \sqrt{\frac{1}{N} \sum\_{i} \left( \boldsymbol{\chi}\_{i} - \boldsymbol{\chi}\_{i}^{A/B} \right)^{2}} \\ \sigma(\mathbf{r}) &= \frac{1}{2} \left( \frac{\text{RMSD}\_{A}(\mathbf{r}) - \text{RMSD}\_{B}(\mathbf{r})}{\text{RMSD}\_{A,B}} \right) + \frac{1}{2} \end{aligned} \tag{3}$$

σ ranges between 0 (in A) and 1 (in B), is a rough measure of the transition advancement. Clearly, structures with the same σ(**r**) can have different conformations, with different distances from A and B, accounted for by RMSDA(**r**) and RMSDB(**r**) separately, since the calculation of σ in practice operates a projection of the 2-dimensional path in the RSMDA/RSMD<sup>B</sup> plane onto a line connecting A and B. Therefore, the scatter plot RMSD<sup>B</sup> vs. RMSD<sup>A</sup> will also be considered to have more specific information on the transition path. σ is used to compare the properties of structures with similar transition advancement from the three different methods.

In order to identify a limited number of relevant points along the trajectory, we applied the principal path (PP) clustering algorithm (Ferrarotti et al., 2018) to the MD trajectories and extracted reduced trajectories, which retain the salient properties of the original ones. The PP algorithm is a regularized version of the k-means clustering algorithm (Arthur and Vassilvitskii, 2007), based on the evaluation of a cost functional composed of two parts: the sum of the squared distances of each point from its respective representative structure, and the sum of the squared distances between adjacent representative structures. The relative weight of the two components—the regularization parameters—is obtained by the Bayesian evidence maximization. The cost functional can be interpreted as an energy, thus the

<sup>2</sup>Alignment is performed by means of the built-in extension "RMSD Trajectory Tool" of the graphics software VMD.

Bayesian posterior probability function is set proportional to the exponential of its negative. The result of the clustering is a "cleaned-up trajectory" of representative structures, used to evaluate σ and energy profiles.

Energies were evaluated both with the CG FFs and at the atomistic level. To this aim, the atomistic structures were rebuilt from the MCG models using Pulchra (Rotkiewicz and Skolnick, 2008) without any local optimization, then explicitly hydrated and locally optimized using the OPLSe (Harder et al., 2016) FF with explicit solvent and the Polak-Ribiere conjugate gradient algorithm (Polak and Ribiere, 1969) keeping the backbone frozen during the minimization. The calculations were performed with Schrodinger 2018-2, MacroModel (2019).

### PROMPT and MAP Path Search

The PP clustering trajectories are compared with the trajectories obtained from other transition analysis methods. The method MinActionPath (Franklin et al., 2007) (MAP) employs differential equations, obtained by minimizing an action functional including a very simplified potential term representing the protein as a network of harmonic interactions (the elastic network model, ENM) (Tirion, 1996). The equilibrium distances are taken from the reference structures, making the ENM the simplest completely biased model. The solutions to the pair of differential equation are merged by requiring continuity between them. The final result is a single trajectory connecting the two states, reproducing the energy profiles of the mono-stable ENMs near A or B, and with a continuous crossover region.

On the other hand, PROMPT (Tamazian et al., 2015) [PRotein cOnformational Motion PredicTion<sup>3</sup> ] connects states A and B avoiding relations to any specific FF, by using only structural information. The protein is represented at the CG level and each protein conformation is handled as a set of internal coordinates. The transition path is first guessed e.g., using linear interpolation between extremal configurations **r** <sup>A</sup> and **r** B . The "admissible motions" are defined, as those preserving all the bond lengths b J i and other physical constraints related to

<sup>3</sup> Implemented in a publicly available toolbox for MATLAB with its source code on GitHub (http://github.com/gtamazian/PROMPT) and MATLAB File Exchange (http://www.mathworks.com/matlabcentral/fileexchange/49054- prompt).


*An illustration of the statistics-based parameterization procedure is also reported in the plots. Upper plot: The dots represent the inverse bond angle fluctuations as a function of the bond angle, evaluated using atomistic simulations of different test proteins (yellow a globular protein, blue the calmodulin itself, different symbols for different runs). This curve can be fitted as damped sin (cyan line). Assuming statistical equilibrium one has an angle dependent effective elastic constant from the equation k*′ = *kBT/*< θ *<sup>2</sup>* >*. A further factor 1/sin<sup>2</sup> (*θ*0) accounts for the non-exactly harmonic functional form used here (i.e., harmonic cosine) leading to the final functional form for k*θ *reported in the table, which accounts for the secondary structure dependence of the elastic constant (stronger for helices with* θ*<sup>0</sup>* ∼ *90*◦ *, softer for strands with* θ*<sup>0</sup>* > *110*◦ *). Red dots show the result from a simulation with MGC model with this parameterization. The black line reports the previously used parameter dependence for comparison. For the dihedral term a similar secondary structure dependent parameterization is used, expressed through a simpler step wise dependence on the dihedral value. The non-bonded interactions parameters are reported in the lower plot: dependence of the well depth (*ε*) and interaction range (1/*α*) on the equilibrium distance (the shorter the equilibrium distance, the stronger, and shorter ranged the interaction). The plot also reports typical interactions included in the corresponding ranges. In all cases, the 0 subscript indicates the rest value of the corresponding variable. i or i, j apices are the C*α *indices (e.g., rij 0 is the rest value of the distance between i and j C*α*s).*

bond and dihedral angles (i is the index running along the internal coordinate, and J labels the configuration along the path, from A to B). The path connecting A and B is therefore found by minimizing a kinetic only action integral within the space of admissible motions factorized by rigid roto-translations. The infinite-dimensional variational problem is addressed by discretizing the path between A and B and solved by means of the gradient descent method. The admissible motions are searched by changing the internal free variables of the systems, i.e. {θ J i , φ J i } in MCG model; θ J i is treated by interpolation when possible. The detailed description and formal comparison of the three method is reported elsewhere (Delfino et al., in preparation). Energies along MAP and PROMPT trajectories were compared using both atomistic (upon rebuilding and side chain optimization as already explained) and MCG FFs.

### RESULTS

### Molecular Dynamics of the Open-Closed Transition of Calmodulin

Calmodulin (Cam) displays two very different conformations (Wenfei et al., 2014), depending on the environmental calcium concentration. The two extremal structures of Cam, i.e., closed (A) and open (B) (see **Figure 1C**), correspond to the apo and Ca2+-bound state, respectively. Because these are, de facto, distinct proteins, having different ligands, it is conceptually correct to use two distinct FFs and to perform LD simulation started from A using FF<sup>B</sup> to reproduce the A→B transition occurring upon Ca2<sup>+</sup> binding, and, vice-versa, using FF<sup>A</sup> for the B→A inverse transition occurring upon Ca2<sup>+</sup> release. A few data are available for the difference in Gibbs free energy between the folded and denatured proteins ranging between 1GA∼1.5–3.5 (Masino et al., 2000; Rabl et al., 2002) kcal/mole for the A state and 1GB∼4.5–6.5 kcal/mole for the B state (Masino et al., 2000). Energy alignment is not straightforward, however, one might assume the denatured state as reference, and infer that B state is more stable than A of about 2–4 kcal/mole.

The A–B transition was simulated with LD, in both senses, at 300 K (RT) and at 130 K (complete simulation data in the **Supplementary Material**). **Figures 2A,B** reports the energies along the LD simulations. In both cases the transitions are clearly visible in the evolution of σ, passing from 0 to 1 (A→B, green) or from 1 to 0 (B→A, red), though they occur at different times, depending on the simulation parameters and on the FF. In particular, the closed to open transition (green) occurs earlier and more directly, while the inverse open to close transition appears to explore an intermediate conformation with σ∼0.4–0.5 for tens of ns before reaching the final state. This is better seen in the RMSD scatter plots reported in panels c and d: the intermediate state, located in the upper right off diagonal part of the plot, persists also after the clustering procedure (joined dots in **Figures 2C,D**) and is present at high and low temperature, although in the low one it is pushed toward the diagonal. It corresponds to a compact globular conformation, favored over the completely open one by hydrophobicity, but in which the specific contacts of the closed conformation are not formed (see the inset in **Figures 2C,D**, red structures). In this work Cam is used only as an example, therefore exploring in detail its transition is out of our scopes. However, we remark that the presence of such mis-folded transition intermediates was previously documented (Wenfei et al., 2014). The intermediate is not visible in the A→B simulations (green), in which the system passes rapidly to B, not even in the PROMPT and MAP trajectories, lying near the diagonal line joining A and B in the RMSD plot. These, additionally, display distorted conformations in the intermediate σ regions. An inspection to the structures with σ∼0.5 (reported in purple and cyan in **Figure 2C**) shows distortions in the central helix and too contracted terminal regions in the PROMPT structure, and broken chain in the MAP structure.

### Data Clustering and Comparison With PROMPT and MAP

While MAP and PROMT return transition paths made of a few points, the MD simulations explore a large portion of the conformational space returning thousands of conformations. Therefore, in order to compare the methods, we first performed a post-processing and clean-up of the MD trajectories to select a limited number of representative states along it. This can be done in several ways. **Figure 3A** reports a simple averaging procedure: the structures are first ordered according to their σ value (red and green dotted/dashed lines), so that A→B transition is read from left to right and B→A from right to left. Once again, the formation of an intermediate cluster at σ = 0.4–0.5 is clearly visible in the B→A simulations, beside the large cluster of A type structures and of B type structures in the A→B simulations, respectively. The structures are then grouped according to their σ value in a given number of regular σ intervals; the average energy evaluated in each interval is reported in the plot, for the A→B (green) and B→A (red) simulations at 300 and 130 K (dots with error bars). Interesting enough, transitions occur in all cases with a gain of ∼20 Kcal/mole (as measured from the starting state, i.e., in each case the opposite of the stable one), irrespective of the temperature and of the FF. As said, comparing the energies resulting from two different FFs is not straightforward. In this case, an inspection of **Figures 2C,D** shows that the simulation trajectories with FF<sup>A</sup> and FF<sup>B</sup> get particularly near in a region of the RMSDA-RMSD<sup>B</sup> plane corresponding to σ∼0.4, indicating that in that area structures belonging to different trajectories are similar. Aligning the energy values for that value of σ in the plot of **Figure 3A** generates a small shift leading to B structure more stable than A one of about 3–4 kcal, roughly corresponding to the experimental evaluation. The resulting "activated state structure" corresponds to the intermediate found in the B→A simulations, which turns out to be located ∼10 Kcal/mole above the A/B states. This "barrier" value seems rather independent on the simulation temperature, whose effect appears to be a rigid shift of the average energies.

While the described procedure gives reasonable values of the energies, representative structures along the trajectories are more properly selected via the PP algorithm. This returns a userdefined (20 in this case) number of elements, which are not elements belonging to the trajectories they represent, but rather elements optimizing the structure variance within the trajectory. As a consequence, the energy profiles obtained evaluating the FF<sup>A</sup> and FF<sup>B</sup> energies onto them (**Figure 3B**, solid lines and squared symbols) are rather regular and lie lower in energy with respect to parent trajectories, shown by lines connecting circle symbols (obtained selecting the nearest elements to the optimal ones, filled and empty dots connected by dotted and dashed lines). Remarkably, even after post processing, the main features of the simulation remain: the cluster located at σ∼0.4–0.5 is well-represented in FF<sup>A</sup> simulations, and is located about 10 Kcal/mole above with respect to A and B states.

The optimal element trajectories extracted from the low temperature runs are also reported in **Figure 3C** to be compared with the energies evaluated from the MAP and PROMPT trajectories using the MCG FFs. Even after a local optimization, the energies from MAP and PROMPT rapidly increase producing a very large energy barrier at intermediate σ values. An inspection of the structures (reported as insets in the plot) reveals that these arise from severe distortion of the backbone (especially for MAP) and/or steric clashes (both). In particular, the high energy of the intermediate from PROMPT seems to be due to steric clashes in one of the two ends of the protein (highlighted with a yellow circle in **Figure 3C**.

Clearly, higher energies on the MAP/PROMPT paths evaluated with MCG FFs are expected, since the low energy path extracted with PP from simulations minimize the MCG Hamiltonian. Therefore, in order to clarify if this energy difference reflects a real larger stability of MCG derived conformations, we rebuilt the atomistic structure of the paths

FIGURE 2 | Simulations results from Langevin dynamics at 300 K, γ = 8 ps−<sup>1</sup> (A) and 130 K, γ = 2 ps−<sup>1</sup> (B). Temperature (upper plots), total and potential energies (central plot) and σ are reported along the simulations from A to B (using FF<sup>B</sup> and starting from configuration A, green lines), and from B to A (using FF<sup>A</sup> and starting from configuration B, red lines). For the 300 K simulation also the running averages are reported for the potential energy as yellow and blue lines, respectively. (C,D) Scatter plot of the LD simulations (same color coding as previous) compared with MAP and PROMPT paths evaluation (color coding as in the legend of C). The connected dots are the representative elements of the PP clustering procedure. Sample configurations are reported in colors corresponding to the lines and their approximate location in the plots are indicated by arrows.

evaluated with all methods and compared their energies evaluated with the atomistic FFs (**Figure 3D**), after optimization of the side chain conformation keeping fixed the backbone structure. All methods give comparable energies for structures near A and B states, where in some cases PROMPT and MAP seems to work better than MCG models. However, the atomistic analysis confirms the strong instability of MAP derived structures, displaying unphysical backbone conformation, as shown by the reported Ramachandran plot (upper right inset of **Figure 3D**). The instabilities of the PROMPT profile are confirmed in the central σ∼0.2–0.8 region, although the Ramachandran plot (central inset) is regular even in there. In fact, in agreement with what found in the MCG model, the instability is not due to a wrong backbone conformation, but to steric clashes in the highlighted area (yellow circle), displaying two sheets whose relative conformation is too close and not correctly aligned. The complete set of structures and energy data is reported as **Supplementary Material**.

### SUMMARY AND CONCLUSIONS

In this work we set up a simulation paradigm for finding the transition path of proteins undergoing large conformational transitions, which is a long-standing problem of biophysics. Proteins are modeled by a Cα based coarse-grained representation, while the transition path is explored via classical molecular dynamics simulations with FFs partially biased toward the reference structures. The selection of a representative trajectory among the huge number of configurations explored during molecular dynamics simulations is accomplished by means of the principal path clustering algorithm, which managed to single out trajectories close to those of minimum free energy, yet capable of exploring intermediate states, with a very low computational cost. The comparison with minimal action path and PROMPT can be summarized as follows: MAP returns structures which are reasonable in the near vicinity of the references states, but is unable

FIGURE 3 | Simulation data analysis and comparison with PROMPT and MAP (A) Potential energy vs. σ along the simulations at 300 K (dotted lines) and at 130 K (dashed lines), with the FF<sup>A</sup> (red) and FF<sup>B</sup> (green) force fields (scales for FF<sup>A</sup> and FF<sup>B</sup> are shifted of 3 Kcal/mole to align the activated state as explained in the text. Both scales are reported on the left and right axis, in colors corresponding to the FF they refer to). Colored dot with error bars are averages over subsets of structures classified by σ intervals (errorbars correspond to standard deviations of data from average values). Representative closed (σ = 0) and open (σ = 1) structures are reported under the plot. (B) Potential energies vs. σ evaluated over the representative structures of the clusters outputted by PP procedure. Squares connected by solid lines: representatives optimized by the PP procedure (filled = from the 300 K simulation, empty = from the 130 K simulations, red with FFA, green with FFB). Circles connected by dashed/dotted lines: same as previous, but evaluated over a trajectory of structures extracted from the simulations, the nearest to the optimal ones. (Same color and empty/filled code as for squares; shift of scales as in A). (C) Comparison of the 130 K "optimal" energies with energies of trajectories from MAP (cyan) and PROMPT (magenta) evaluated with FF<sup>A</sup> (dotted) and FF<sup>B</sup> (dashed). Representative structures of the activated states are reported in corresponding colors. Same scale shift as in (A); the vertical scales are broken to zoom over the low energies. (D) Potential energy evaluated with the atomistic FF over the same trajectories as in (C) (same color coding). Representative structures are reported in corresponding colors; the Ramachandran plot of the activated states of PROMPT and MAP are reported (yellow squared dots superimposed to the standard map in colors). Both in (C,D) the area with distorted sheets in the activated state of PROMPT is highlighted with a yellow circle.

to provide meaningful ones, even after local optimization, in the intermediate regions. This was somehow expected: in fact stronger post-processing methods, involving e.g., the generation of swarms of unbiased trajectories from the transition states were proposed to solve this problem (Pan et al., 2008). PROMPT returns in addition good backbone local conformations along the whole path, but does not guarantee that amino-acids separated along the chain do not get too near and cause steric clashes, which happens in fact, in the intermediate regions. The MCG simulations, guarantee physically sound structures along the whole path, and can explore also intermediates far from the reference structures, but needs appropriate post-processing and clustering techniques to extract a reaction path. We envision that a synergistic use of these methods might combine accuracy and efficiency in the path search. This possibility, and the application to a number of diverse proteins, are explored in a forthcoming paper (Delfino et al., in preparation).

### DATA AVAILABILITY STATEMENT

All data used for this work are included in the paper or in **Supplementary Material**.

### AUTHOR CONTRIBUTIONS

FD has produced data and performed analyses, and participated in writing. YP has produced data and analyses, designed work, contributed ideas, and participated in writing. ES and GT have designed the work, contributed ideas, and participated in writing. VT has contributed ideas, designed work and supervised it, and written the paper.

### FUNDING

The work of ES has been supported by the RSF grant 19-71- 30020, Applications of probabilistic artificial neural generative models to development of digital twin technology for non-linear stochastic systems (HSE University). The work of YP has been supported by the Russian Academic Excellence Project 5-100 of Sechenov Medical University.

### REFERENCES


### ACKNOWLEDGMENTS

We wish to thank Dr. Walter Rocchia for useful discussions and for support in using the software for PP calculations, and Prof. Paolo Carloni, Dr. Giulia Rossetti, and Dr. Emiliano Ippoliti for useful discussions on Calmodulin. We also wish to thank Vladimir Kadochnikov for help with Gromacs calculations.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmolb. 2019.00104/full#supplementary-material

Numerical raw data (trajectories and energies during simulations, clusters analysis etc.) are available as **Supplementary Material**.


to a minimal polypeptide model. J. Chem. Theory Comput. 2, 667–673. doi: 10.1021/ct050294k


Wenfei, L., Wang, W., and Takada, S. (2014). Energy landscape views for interplays among folding, binding, and allostery of calmodulin domains. Proc. Natl. Acad. Sci. U.S.A. 111, 10550–10555. doi: 10.1073/pnas.14027 68111

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Delfino, Porozov, Stepanov, Tamazian and Tozzini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Enzymatic Polymerization of PCL-PEG Co-polymers for Biomedical Applications

Pedro Figueiredo† , Beatriz C. Almeida† and Alexandra T. P. Carvalho\*

*Center for Neuroscience and Cell Biology, Institute for Interdisciplinary Research (IIIUC), University of Coimbra, Coimbra, Portugal*

Biodegradable polymers, obtained via chemical synthesis, are currently employed in a wide range of biomedical applications. However, enzymatic polymerization is an attractive alternative because it is more sustainable and safer. Many lipases can be employed in ring-opening polymerization (ROP) of biodegradable polymers. Nevertheless, the harsh conditions required in industrial context are not always compatible with their enzymatic activity. In this work, we have studied a thermophilic carboxylesterase and the commonly used Lipase B from *Candida antarctica* (CaLB) for tailored synthesis of amphiphilic polyesters for biomedical applications. We have conducted Molecular Dynamics (MD) and Quantum Mechanics/Molecular Mechanics (QM/MM) MD simulations of the synthesis of Polycaprolactone—Polyethylene Glycol (PCL—PEG) model co-polymers. Our insights about the reaction mechanisms are important for the design of customized enzymes capable to synthesize different polyesters for biomedical applications.

### Edited by:

*Giulia Palermo, University of California, Riverside, United States*

### Reviewed by:

*Lorenzo Casalino, University of California, San Diego, United States Tiziana Marino, University of Calabria, Italy*

\*Correspondence:

*Alexandra T. P. Carvalho atpcarvalho@uc.pt*

*†These authors have contributed equally to this work*

### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

> Received: *23 July 2019* Accepted: *04 October 2019* Published: *17 October 2019*

### Citation:

*Figueiredo P, Almeida BC and Carvalho ATP (2019) Enzymatic Polymerization of PCL-PEG Co-polymers for Biomedical Applications. Front. Mol. Biosci. 6:109. doi: 10.3389/fmolb.2019.00109* Keywords: MD calculations, QM/MM MD simulations, PCL-PEG co-polymers, biodegradable polymers, ROP

## INTRODUCTION

Aliphatic polyesters have attracted great attention in the medical field due to their biodegradability, biocompatibility, and drug permeability, allowing the use of these polymers in biomedical applications (Cameron and Shaver, 2011). However, the hydrophobicity of some of these polymers, such as Polycaprolactone (PCL), still hampers some of their applications (for example, their use as drug delivery vesicles). PCL nanoparticles can be easily absorbed by proteins or be identified and captured by reticuloendothelial cells (Huang et al., 2015). A good way to protect them to be absorbed by proteins, can be achieved by modifying the surface hydrophilicity (Huang et al., 2015). Polyethylene Glycol (PEG), one of the most interesting initiators for synthesis of polyesters, can be used as the hydrophilic part of the linear amphiphilic block co-polymers (Piao et al., 2003; Fairley et al., 2008; Yang et al., 2014). PEG is a non-ionic and water-soluble polymer with biological compatibility, non-toxicity, non-antigenicity, and non-immunogenicity (Panova and Kaplan, 2003). The above mentioned properties, make this hydrophilic polymer widely applied in the pharmaceutical industry and in biomedical applications (Hutanu et al., 2014; Grossen et al., 2017). Recently, it was also employed in the development of polymer-based drug delivery systems. These systems consist in polymers covalently attached to systemic drugs, increasing their molecular weight and thus their circulation time (Hutanu et al., 2014).

Pharmaceutical moieties produced by chemical synthesis, usually contain traces of metals catalysts, which can be a problematic for biomedical application because of their toxicity. Thus, enzymatic synthesis of polymers is considered advantageous and has been extensively studied (Albertsson and Srivastava, 2008; Kobayashi, 2009, 2010; Kobayashi and Makino, 2009; Zhang et al., 2014). Enzymes present many advantages, e.g., they usually operate under mild reaction conditions, they can be highly selective and are biodegradable. The enzymatic synthesis of Polycaprolactone–Polyethylene Glycol (PCL–PEG) triblock co-polymers was reported for the first time in 2003, using Novozyme 435 (immobilized lipase B from Candida antartica—CaLB) with fair to good yields (70◦C, 63–70% yield), but still with relatively low molecular weights (12.500–17.600 g/mol) (He et al., 2003). A few years later, Huang and his colleagues, used again Novozyme 435 and PEG as the hydrophilic initiator to induce ring-opening of polymerization (ROP) of ε-caprolactone (ε-Cl). They were able to produce amphiphilic co-polymers with slightly higher molecular weights (M<sup>n</sup> = 11.900–19.000 g/mol at 70◦C, 1.28– 1.59 polydispersity index). However, these M<sup>n</sup> values are still low, so approaches with other enzymes or modified enzymes are still required (Huang et al., 2015). Here, in the quest to better understand these processes at atomic level and also to search for alternative enzymes (such as extremophile enzymes, that can withstand harsh industrial conditions), we have studied reaction mechanisms where **PEG** is the initiator in the ROP deacylation step of PCL-PEG co-polymers. We modeled the **PEG** initiator at two different chain sizes. The simpler model consists in a molecule of ethylene glycol and the larger in a polymer with three molecules of ethylene glycol. The initial structure for the Quantum Mechanics/Molecular Mechanics Molecular Dynamics (QM/MM MD) calculations was the second tetrahedral intermediate structure and the simulations were performed with two enzymes: the commonly used CaLB and the thermophilic esterase from the archaeon Archeoglobus fulgidus (AfEST).

### COMPUTATIONAL METHODS

### Systems Initial Setup

The initial structures were modeled from the crystal structures of CaLB (0.91 Å resolution) and AfEST (2.2 Å resolution), pdb codes 5A71 (Stauch et al., 2015), and 1JJI (De Simone et al., 2001), respectively and MolProbity (Chen et al., 2010) was used to assign the protonation states. The enzymeactivated monomer structures (**EAM** with one molecule of ethylene glycol—**MEG** and **EAM** with a oligomer with three molecules of ethylene glycol—**PEG**), the second tetrahedral intermediate structures (**INT-2**) and the product complexes (**PC**, **Co-P** model compound, and **Co-3P** model compound) were geometry optimized in Gaussian09 (Frisch et al., 2009) using B3LYP 6-31G(d) (Ashvar et al., 1996) basis set and with the Polarizable Continuum Model (PCM) (Tomasi et al., 2005) solvent description. The Restrained Electrostatic Potential (RESP) (Bayly et al., 1993) method from HF/6-31G(d) single point energy calculations was used to assign the atomic partial charges. The structures were placed within a pre-equilibrated octahedral box of toluene (10.0 Å between the surface of the protein and the box) and the entire systems neutralized with counter ions. The systems were subjected to two initial energy minimizations and 500 ps of equilibration in a NVT ensemble using Langevin dynamics with small restraints on the protein (10.0 kcal/mol) to heat the system from 0 to 300 K. Production simulations were carried out at 300 K in the NPT ensemble using also Langevin dynamics with a collision frequency of 1 ps−<sup>1</sup> . Constant pressure periodic boundary conditions were imposed with an average pressure of 1 atm. Isotropic position scaling was used to maintain pressure with a relaxation time of 2 ps. The time step was set to 2 fs. SHAKE constraints were applied to all bonds involving hydrogen atoms (Ryckaert et al., 1977). All the simulations were performed with the Amber molecular dynamics program (AMBER18) (Salomon-Ferrer et al., 2013) using parm99SB (Hornak et al., 2006) and GAFF (Wang et al., 2004) force fields. All reactants, products and intermediate structures were submitted to triplicated simulations of 20 ns each, with different initial velocities. The reference structures represented in the figures, were the lowest root-mean-square deviation (RMSD) structures to the average of the simulations (Dourado et al., 2018).

### Quantum Mechanical/Molecular Mechanical Molecular Dynamics (QM/MM MD) Calculations

The QM/MM MD calculations (Carvalho et al., 2014) were performed using the internal semi-empirical hybrid QM/MM functionality implemented in AMBER18 with periodic boundary conditions. The QM region was described by the PM6 semiempirical method (Stewart, 2007; Jindal and Warshel, 2016) and the MM region by the Amber parm99SB force field (Hornak et al., 2006). The PM6 Potentials of Mean Force (PMFs) were later corrected with geometry optimizations of the high-level layer (QM) models with the exchange correlation functional basis set for B3LYP/6-31G(d) (Ashvar et al., 1996) and wB97XD/6-31G(d) (Chai and Head-Gordon, 2008), according to Carvalho et al. (2017) and Bowman et al. (2008). Electrostatic embedding (Bakowies and Thiel, 1996) was also employed and the boundary was treated with the link atom approach. Longrange electrostatic interactions were described by an adapted implementation of the Particle Mesh Ewald (PME) method for QM/MM (Nam et al., 2005).

The QM region in the reactant complex for CaLB included: the **MEG** molecule (during the study of **Co-P** production) and the **PEG** molecule (during the study of **Co-3P** production), the S105 residue, the side chain of H224, D187, the amide groups of Q106 and T40, as well as, the side-chain of T40. For AfEST besides the **MEG**/**PEG** molecules and the S160 residue, the QM region also included the side chains of H285, D255, the amide groups of G88, G89, and A161. The initial structure was the **INT-2**, which was obtained using a procedure similar to Escorcia et al. (2017). The reaction coordinate for both enzymes was the distance between the proton of the histidine and the oxygen of the leaving alcohol. The coordinates were scanned in 0.1 Å increments using the umbrella sampling method, except near the transition states were 0.01 Å intervals were applied. The PMFs were computed resorting to the Weighted Histogram Analysis Method (WHAM) (Grossfiled, 2018). The total number of atoms in the high-level layer (QM region) in our initial structure (**INT-2**) was: 77 for CaLB during **Co-P** synthesis and 91 during **Co-3P** synthesis; 67 for AfEST during **Co-P** synthesis and 81 during **Co-3P** synthesis.

## RESULTS

The catalytic cycle of CaLB and AfEST toward the synthesis of PCL through ROP of ε**-Cl** was previously studied (Ma et al., 2009; Elsässer et al., 2013; Ren et al., 2016; Zhao, 2018; Almeida et al., 2019; Pellis et al., 2019) (unpublished data). Both enzymes are able to produce PCL polymers, as already described, and the ability to produce co-polymers of PCL-PEG was outlined in some experimental works (He et al., 2003; Huang et al., 2015),

FIGURE 1 | Catalytic mechanism from the formation of the Co-P and Co-3P products. Above, acylation step (ring-opening of ε-Cl); Below, deacylation step.

as well as, further explored here, via in silico methods. These enzymes have the same catalytic triad, composed by Ser-His-Asp residues (S105-H224-D187 for CaLB and S160-H285-D255 for AfEST) and a hydrogen bond donor region called oxyanion hole, which stabilizes the negative charge developed during the cycle, in the tetrahedral intermediate structure. In these enzymes the histidine residues act as an acid/base (transferring protons between the catalytic serine and the substrate) and are stabilized by the aspartate residue (Brady et al., 1990; Bezborodov and Zagustina, 2014; Douka et al., 2018). The stabilization of the protonated histidine by the aspartate is well-documented (Kobayashi, 2010; Douka et al., 2018). The oxyanion hole region for AfEST contains the backbone amides of G88, G89, and A161 as hydrogen bond donors (De Simone et al., 2001), whereas in CaLB the hydrogen bond donors are the backbone amides of T40 and Q106 and the side-chain hydroxyl group of T40 (Raza et al., 2001).

The catalytic cycle for the synthesis of PCL co-polymers by CaLB and AfEST include the acylation and deacylation steps. The first one (**Figure 1**), is the nucleophilic attack by the catalytic serine residue to a molecule of ε**-Cl**. This attack leads to the formation of the first tetrahedral intermediate (**INT-1**) structure followed by ring-opening of the **INT-1** resulting in the **EAM** structure. The deacylation steps comprise (**Figure 1**) a nucleophilic attack to the **EAM** structure by the terminal alcohol function of the initiator (**MEG** when the expected product is **Co-P** and **PEG** when the expected product is **Co-3P**). The result of this attack is the formation of the second tetrahedral intermediate (**INT-2**) structure that, after product release (**Co-P** or **Co-3P**), yields the product complex (**PC**), and the free enzyme is re-generated.

The active site of AfEST is located at the interface of the α/β hydrolase fold with the cap domain, shielding the active site. There is one entrance channel to the active site and two pockets (a large and a medium one, with the latter more buried within the protein) (De Simone et al., 2001). In CaLB there is no cap, just two helices (α10 and α5) that line the active site. The α5 helix was proposed to act as a putative lid (Skjøt et al., 2009; Stauch et al., 2015). The CaLB enzyme pockets have been extensively described in the literature, one binds the acyl moiety (residues A141, L144, V149, D134, T138, and Q157) of the ester and the other one binds the alcohol function (residues W104, L278, A281, A282, and I285) (Wu et al., 2013). The two enzymes have different orientations of the pockets when we compare them (De Simone et al., 2001; Stauch et al., 2015). In AfEST the large pocket has a more hydrophobic nature. Also, since there is just one entrance channel and due to the sequential nature of the events where the **EAM** structure is first formed and then reacts with the initiator, the alcohol function must be located in the larger pocket. We and others have studied the

enzyme acylation step in CaLB. The well described rate-limiting step for these enzymatic ROP reactions is usually the formation of the **EAM** structure (Kobayashi, 2010; Huang et al., 2015), excepting with bulky or crowded initiators (Panova and Kaplan, 2003). The acylation step in CaLB, requires around 10.0 kcal/mol (Elsässer et al., 2013) but for AfEST, this barrier is significantly higher (Ma et al., 2009; Li and Li, 2011). As discussed above, the different orientation of the pockets leads to different orientations of the **EAM** structure and hence a different relative position of the attacking alcohol moiety. In the initial **EAM** structures, the incoming alcohol oxygen atom of the initiator (**MEG** or **PEG** molecules) is in a distance range of 3.28–3.53 Å to the **EAM** carbon atom (**Figures 2**, **3**). The reaction proceeds through a transition state (where called **TS**3**,**because of the preceding acylation steps), with concerted proton transfer from the alcohol moiety to the histidine residue and bond forming between the oxygen and the carbon atoms of the **EAM** structure. In the **INT-2** structure, the histidine is well positioned toward the scissile oxygen bond for product formation (the PMFs are represented in **Figure 4**). In all considered cases, the **TS<sup>3</sup>** barriers are quite low (3.0 ± 0.1 kcal/mol and 3.3 ± 0.3 kcal/mol for CaLB with **MEG** and **PEG** molecules as the initiator, respectively, and 1.5 ± 0.1 kcal/mol and 0.8 ± 0.1 kcal/mol for AfEST with **MEG** and **PEG** molecules as the initiator, respectively—at the B3LYP level of theory correction), which leads to formation of **INT-2** always being exothermic, but significantly higher for AfEST (– 3.6 and −5.0 kcal/mol for CaLB; −18.4 and −15.6 kcal/mol for AfEST - B3LYP). The fourth transition state (**TS4**) free energy barriers are, generally, significantly higher than **TS<sup>3</sup>** for both initiators. With **MEG** molecule as the initiator, the 1G ‡ are 3.0 ± 0.1 kcal/mol and 11.0 ± 0.2 kcal/mol in CaLB and AfEST, respectively (B3LYP). On the other hand, when **PEG** molecule is the initiator, the **TS<sup>4</sup>** 1G ‡ barriers are 8.6 ± 0.1 kcal/mol in CaLB and 9.4 ± 0.2 kcal/mol in AfEST (B3LYP).

denotes the free energies calculated with PM6/MM and corrected with DFT methods (B3LYP with empirical dispersion and wB97XD).

### DISCUSSION

We have studied the nucleophilic attack of **PEG** molecules with different sizes to the **EAM** of CaLB and AfEST enzymes. We found that despite the obvious differences in pockets size, orientation, and lining residues, both enzymes achieve these chemical steps with similar overall energies (with the exception of **MEG** in CaLB that was a lower overall barrier) and that are lower than the barriers in the acylation steps (**TS2**—**Figure 1**). In AfEST, the formation of **INT-2** is always more exothermic than CaLB, independently of the substrate. The difference in energies of the **MEG** CaLB reaction in relation to the other reactions seems to be due to the fact that only in this case there is a hydrogen bond between ethylene glycol and the histidine in the reactant complex (**EAM**).

Detailed characterization of the intermediate structures, will allow to identify key residues in the catalytic cycle, opening the door for protein engineering approaches. Enhanced enzyme variants are a good option for industrial esterification

### REFERENCES


reactions (e.g., polyester synthesis) and to improve the biological compatibility of the polymers.

## DATA AVAILABILITY STATEMENT

The datasets generated for this study are available on request to the corresponding author.

## AUTHOR CONTRIBUTIONS

PF conducted the calculations for the enzyme CaLB and BA on enzyme AfEST. AC supervised the research. All authors contributed to the manuscript writing.

### FUNDING

This work was financed by Portuguese national funds via FCT – Fundação para a Ciência e a Tecnologia, under project[s] MIT-Portugal (MIT-EXPL/ISF/0021/2017), the grant IF/01272/2015 and UID/NEU/04539/2019.


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Figueiredo, Almeida and Carvalho. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multiscale Solutions to Quantitative Systems Biology Models

Nehemiah T. Zewde\*

*Department of Bioengineering, University of California, Riverside, Riverside, CA, United States*

Keywords: mathematical models, systems biology, complement system, molecular dynamics, Brownian dynamics, ordinary differential equations, innate immunity

## INTRODUCTION

Systems biology implements a variety of statistical, computational and mathematical techniques to understand how networks of biological systems work together to achieve a function (Westerhoff and Palsson, 2004; Wolkenhauer, 2014). Systems biology is a multi-scale field, as it has no fixed scale in the context of a biological response or cascade, where an ensemble of proteins, cofactors and small molecules concertedly act to achieve function. This is the case of fundamental pathophysiological networks, such as epidemiological responses with host and pathogens (Hillmer, 2015). Understanding the network of interactions that mediate these systems is of the utmost importance for deciphering the mechanisms associated with multifactorial diseases, as well as to address fundamental biological questions. This knowledge can be used for translational research and application in biomedicine (McGillivray et al., 2018). The multi-scale nature of systems biology calls for a multifaced description to bridge the system scale at the cellular level to the molecular scale of individual macromolecules.

### Edited by:

*Valentina Tozzini, National Research Council, Italy*

Reviewed by:

*Francesco Cardarelli, Scuola Normale Superiore of Pisa, Italy*

> \*Correspondence: *Nehemiah T. Zewde nzewd001@ucr.edu*

#### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

> Received: *20 August 2019* Accepted: *14 October 2019* Published: *30 October 2019*

#### Citation:

*Zewde NT (2019) Multiscale Solutions to Quantitative Systems Biology Models. Front. Mol. Biosci. 6:119. doi: 10.3389/fmolb.2019.00119*

Among the important biological cascades responsible for severe diseases, we focus here on the complement system, which is an effector arm of the immune system that eliminates pathogens, helps in maintaining host homeostasis, and forms a bridge between innate and adaptive immunity (Bennett et al., 2017; Reis et al., 2019). Complement is composed of three pathways known as alternative, classical and lectin that work in concert to achieve its function (Schatz-Jakobsen et al., 2016a). The complex network of proteins and other macromolecular entities composing the complement system represents an ideal case to build a systems biology workflow predicting the system's response in immunity against invading pathogens, and how under complement deficiencies this same system mediates different pathologies. Here, we report on the development of systems biology predictive models, which describe the intricate biochemical networks and the crosstalk among other elements of the immune system. We also show how the integration of multiscale modeling techniques can help for improving the predictive model, while also providing mechanistic information at the molecular level.

Complement dysfunction is associated with several diseases. Among others, the complement components have been associated with neurodegenerative disorders including Alzheimer and Parkinson diseases; as well as multiple sclerosis (Mastellos et al., 2019). Moreover, mutations of complement proteins have been linked to the etiology of renal diseases (De Vriese et al., 2015; Ricklin et al., 2016), while individuals with complement deficiencies develop severe infections, such as meningitis, bacteremia and pneumonia caused by microorganisms, such as Streptococcus pneumoniae, Neisseria meningitidis, and Staphylococcus aureus (Skattum et al., 2011). Clearly, while a proper activation of the complement system is associated with a wide spectrum of beneficial effects, dysfunctional states are associated with severe consequences. Considering that the function of the complement system is regulated by a network of multiple components, whose concerted activity underlies a variety of diseases, accurate models of the interaction network would greatly help therapeutic strategies (Ricklin et al., 2018).

### MATHEMATICAL MODELS OF THE COMPLEMENT SYSTEM

The complexity of the complement system arises from the mechanistic function of numerous proteins and related biochemical reactions within the complement pathways (**Figure 1**). For instance, complement is composed of more than 60 proteins that circulate in plasma and bound to cellular membranes of host cells that work to mediate different phases (fluid and solid) of immunity (Liszewski et al., 2017). This multi-phasic interaction between complement proteins forms the basis of the intricate biochemical networks and numerous crosstalk with different compartments of the immune system, such as pentraxins (C-reactive protein, serum-amyloid P, and long pentraxin 3) and the coagulation cascade (Amara et al., 2008; Ma and Garred, 2018).

In this complex scenario, mathematical models using ordinary differential equation (ODE) emerged as a powerful tool to elucidate the dynamics of the complement system. Indeed, ODEs can be used to generate predictive models of complex biological processes involving metabolic pathways, proteinproteins interactions, and tumor growth (Ilea et al., 2012; Dubitzky et al., 2013; Rohrs et al., 2018). In defining a biological network in a quantitative manner, ODE models can enable to predict concentrations, kinetics and behavior of the network components, building hypotheses on disease causation, progression and interference, which can be tested experimentally (Enderling and Chaplain, 2014). In line with this, models of the complement system based on ODEs have been designed to mechanistically deconstruct segments of the complement system under homeostasis and infection (Hirayama et al., 1996; Korotaevskiy et al., 2009; Liu et al., 2011; Zewde et al., 2016; Sagar et al., 2017; Lang et al., 2019).

To further these efforts, we recently generated an expanded ODE model that predicts the complement biomarker levels under the states of homeostasis, disease, and drug intervention (Zewde and Morikis, 2018). By using the reaction network in **Figure 1**, we generated a system of ODEs to describe the bi-phasic nature of the complement system: (i) initiation (fluid phase); (ii) amplification and termination (pathogen surface); and (iii) regulation (host cell and fluid phase). The ODE representation is shown below:

$$\frac{dC\_i}{dt} = \sum\_{j=1}^{\infty} \sigma\_{\vec{\eta}\vec{f}} f\_{\vec{\eta}}$$

where variable C<sup>i</sup> represents the concentration of an individual complement protein/complex, x<sup>i</sup> denotes the number of biochemical reactions associated with complement C<sup>i</sup> for the y th reaction. Moreover, σij, denotes stoichiometric coefficients and fj is a function that describes how the concentration C<sup>i</sup> changes with the biochemical reactions of the reactants/products and parameters, within the given timeframe.

Building on this basic concept, we have designed a model of the complement system that incorporates pathological conditions by reducing the regulatory kinetic rates constants and lowering blood plasma concentrations (Zewde and Morikis, 2018). By applying this model, it is possible to perform in silico mutation by perturbing a complement protein and its binding partner and examine how it translates into the global dynamics of the complement pathway activation and regulation. As a consequence, this enables to generate patient specific models provided clinical data, predicting the effect of a specific mutation within the entire system. For instance, disorders, such as C3 glomerulonephritis and dense-deposit disease are associated with a mutation that affects the complement regulatory protein factor H (FH) (Nester and Smith, 2016). This mutation results in low plasma levels of FH and subsequently leads to host cell damage due to under-regulation of the alternative pathway. By measuring patient's FH level, this value can be used to reparametrize the starting concentration of FH in the ODEs model and, subsequently, examine how the mutation affects activation and regulation of the alternative pathway (Zewde and Morikis, 2018). The ODE mathematical models can also be used to identify novel therapeutic targets, which can be object of experimental validations to assess their capability to interfere with the complement system. In this respect, one strategy, called "global sensitivity," enables to identify which set of kinetic parameters is important in the network of the complement system. In parallel, the "local sensitivity" analysis can help in pinpointing critical complement components that mediate the output of activation or regulation (examples in Liu et al., 2011; Zewde et al., 2016; Sagar et al., 2017). ODE models are also useful if kinetic data is available for known inhibitors. Indeed, ODEs can be used to perform comparison studies on how different therapeutic targets perform under disease-based perturbations. In our previous work (Zewde and Morikis, 2018), we incorporated two complement inhibitors known as compstatin, C3 inhibitor (**Figure 1**, magenta circle), and eculizumab, C5 inhibitor (**Figure 1**, light blue circle), and examined how they regulated a disease state mediated by FH. Our model showed both inhibitors performed differently in regulating an over-active complement system (disease state). Compstatin was shown to potently regulate early-stage complement biomarkers, whereas eculizumab over-regulates late-stage biomarkers. From these results, our model indicated the need for patienttailored therapies depending on how disease associated mutations manifest in the complement cascade. Altogether, ODE models can be utilized to mechanistically translate convoluted biological reaction-networks, reparametrized for patient specific modeling, and identify novel therapeutic targets under pathological conditions.

### MULTISCALE SOLUTIONS TO THE CHALLENGES OF ODE MODELS

Building on ODE models that predict how the molecular interactions mediate immunity and disease, our group has expanded the ODEs approach to model the pathways of the complement system as a whole. In this respect, one of the main challenges is represented by the lack of kinetic parameters, thereby significantly hindering our modeling

propagate to the surface and terminate by the formation of the membrane attack complex (MAC). This figure is adapted from Zewde and Morikis (2018). Structural representation of C3 (blue) with compstatin (cyan) shown in magenta circle (Janssen et al., 2005, 2007). Black circle denotes the surface representation of C5b in firebrick coloring and C6 in yellow (Hadders et al., 2012). Surface representation of C5 (red) and eculizumab (H- and L-chain in green) shown in light blue circle (Schatz-Jakobsen et al., 2016b).

efforts. For instance, we are currently building a comprehensive complement model that includes all three pathways (**Figure 1**), immunoglobulins (IgG and IgM) and pentraxins. This system, which comprises 670 differential equations with 328 kinetic parameters, is used to examine the interplay between complement activation and an immune evasive bacteria Neisseria meningitidis. However, 140 of our kinetic parameters are unknown and estimation of these parameters is challenging, due the limited availability of experimental data.

To overcome these challenges, multi-scale approaches can aid in alleviating some of these burdens by performing simulations to predict association rate constants. For example, Brownian dynamics (BD), milestoning and molecular dynamics (MD) can be used to predict the kinetic and conformational requirements of binding (Ermak and McCammon, 1978; Huber and McCammon, 2010; Votapka and Amaro, 2015). MD enables to follow the motions of macromolecules over time by integrating Newton equation of motion. As opposite, BD simulates a system based on an overdamped Langevin equation of motion, enabling the study of diffusion dynamics and obtaining association rates for a given process (Ermak and McCammon, 1978). Novel hybrid schemes, such as SEEKR combines multiscale approaches of MD, BD, and milestoning to estimate kinetic parameters of association and dissociation rate constants (Votapka et al., 2017).

We have already initiated this bridge between systems biology and multi-scale approaches by performing molecular dynamics and electrostatics studies on the complement complex C5bC6 (**Figure 1**, black circle) (Zewde et al., 2018). Our analysis identified three binding sites and critical salt bridges formed between C5b and C6. Building on this first study, Brownian dynamics simulations will aid into the prediction of kinetic parameters associated with C5bC6 complex formation, which will subsequently be inserted into our ODE model. As a further useful approach, in the cases where complete structural data are absent, homology models using computational tools, such as MODELLER (Webb and Sali, 2016) or SWISS-MODEL (Waterhouse et al., 2018) can be used as a supplement. This step can be followed by the utilization of protein docking tools like HADDOCK (Dominguez et al., 2003) or ClusPro (Kozakov et al., 2017) to generate potential complement complexes. Finally, top ranked structures can then be a subject of the multi-scale approaches mentioned above to estimate unknown kinetic parameters.

### SUMMARY AND PERSPECTIVES

Here, we described the current efforts to model the complexity of systems biology, by building predictive models based on ODEs. The multi-scale nature of this field, as characterized by a network of proteins, cofactors and small molecules concertedly acting to achieve function, calls for a multiscale description bridging the macromolecular level to the systems level. Here, we described

### REFERENCES


our investigations aimed at modeling the complex biological response of the complement system, which plays a prominent role in host defense, homeostasis, and disease. We showed how ODEs models can provide description of the network of interactions at the system level, while multiscale simulations methods can complement this approach providing a description at the macromolecular level.

ODE models of the complement system have elucidated key mechanisms of immune system function and regulation. These mathematical models show promise for the investigation of patient specific diseases and for the identification of therapeutic interventions under pathological conditions. Despite these advantages, modeling efforts are continuously challenged by the lack of kinetic parameters needed to generate and simulate ODEs models. A multi-scale approach—harnessing methods, such as Brownian and molecular dynamics—is promising to address some of these challenges by predicting unknown kinetic parameters to be utilized in quantitative models of the complement system. In addition to multi-scale estimations, high performance computing has made it possible to simulate large biological structures (Casalino et al., 2018; Palermo et al., 2018). This opens scientific avenues in the frontier of modeling entire biochemical networks, including the complement system, such merging the molecular level perspective to the system (i.e., cellular) scale.

### AUTHOR CONTRIBUTIONS

NZ designed the study and wrote the manuscript.

### FUNDING

This work was partially supported by NIH grant R01 EY027440.

### ACKNOWLEDGMENTS

I dedicate this article to my late advisor, Prof. Dimitrios Morikis.

information. J. Am. Chem. Soc. 125, 1731–1737. doi: 10.1021/ja 026939x


**Conflict of Interest:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zewde. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Improved Modeling of Peptide-Protein Binding Through Global Docking and Accelerated Molecular Dynamics Simulations

Jinan Wang<sup>1</sup> , Andrey Alekseenko2,3, Dima Kozakov 2,3 and Yinglong Miao<sup>1</sup> \*

*<sup>1</sup> Center for Computational Biology and Department of Molecular Biosciences, University of Kansas, Lawrence, KS, United States, <sup>2</sup> Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY, United States, <sup>3</sup> Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, United States*

### Edited by:

*Alexandre M. J. J. Bonvin, Utrecht University, Netherlands*

### Reviewed by:

*Martin Zacharias, Technical University of Munich, Germany Ilpo Vattulainen, University of Helsinki, Finland*

> \*Correspondence: *Yinglong Miao miao@ku.edu*

#### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

> Received: *22 July 2019* Accepted: *09 October 2019* Published: *30 October 2019*

#### Citation:

*Wang J, Alekseenko A, Kozakov D and Miao Y (2019) Improved Modeling of Peptide-Protein Binding Through Global Docking and Accelerated Molecular Dynamics Simulations. Front. Mol. Biosci. 6:112. doi: 10.3389/fmolb.2019.00112* Peptides mediate up to 40% of known protein-protein interactions in higher eukaryotes and play a key role in cellular signaling, protein trafficking, immunology, and oncology. However, it is challenging to predict peptide-protein binding with conventional computational modeling approaches, due to slow dynamics and high peptide flexibility. Here, we present a prototype of the approach which combines global peptide docking using *ClusPro PeptiDock* and all-atom enhanced simulations using Gaussian accelerated molecular dynamics (GaMD). For three distinct model peptides, the lowest backbone root-mean-square deviations (RMSDs) of their bound conformations relative to X-ray structures obtained from *PeptiDock* were 3.3–4.8 Å, being medium quality predictions according to the Critical Assessment of PRediction of Interactions (CAPRI) criteria. *GaMD* simulations refined the peptide-protein complex structures with significantly reduced peptide backbone RMSDs of 0.6–2.7 Å, yielding two high quality (sub-angstrom) and one medium quality models. Furthermore, the *GaMD* simulations identified important low-energy conformational states and revealed the mechanism of peptide binding to the target proteins. Therefore, *PeptiDock*+*GaMD* is a promising approach for exploring peptide-protein interactions.

Keywords: peptide-protein binding, peptide docking, PeptiDock, gaussian accelerated molecular dynamics (GaMD), peptide flexibility

### INTRODUCTION

Peptides mediate up to 40% of known protein-protein interactions in higher eukaryotes. Peptide binding plays a key role in cellular signaling, protein trafficking, immune response, and oncology (Petsalaki and Russell, 2008; Das et al., 2013). In addition, peptides have served as promising drug candidates with high specificity and relatively low toxicity (Ahrens et al., 2012; Fosgerau and Hoffmann, 2015; Kahler et al., 2018; Lee et al., 2019). The number of peptide-based drugs being marketed is increasing in recent years (Ahrens et al., 2012; Fosgerau and Hoffmann, 2015; Kahler et al., 2018; Lee et al., 2019). Therefore, understanding the molecular mechanism of peptide-protein interactions is important in both basic biology and applied medical research.

Rational design of peptide-derived drugs usually requires structural characterization of the peptide-protein complexes. Xray crystallography and nuclear magnetic resonance (NMR) have been utilized to determine high-resolution structures of peptideprotein complexes. These structures are often deposited into the Protein Data Bank (PDB) and also collected in specific databases focused on peptide-protein complex structures, including the PeptiDB (London et al., 2010), PepX (Vanhee et al., 2010), and PepBind (Das et al., 2013). Particularly, PeptiDB is a set of 103 non-redundant protein-peptide structures extracted from the PDB. The peptides are mostly 5–15 residues long (London et al., 2010). PepX contains 1,431 non-redundant X-ray structures clustered based on the binding interfaces and backbone variations. There are 505 unique peptide-protein interfaces, including those for the major histocompatibility complex (MHC) (14%), thrombins (12%), α-ligand binding domains (8%), protein kinase A (5%), proteases and SH3 domains (Vanhee et al., 2010). The PepBind contains a comprehensive dataset of 3,100 available peptide-protein structures from the PDB, irrespective of the structure determination methods and similarity in their protein backbone. More than 40% of the structures in PepBind are involved in cell regulatory pathways, nearly 20% in the immune system and ∼30% with protease or other hydrolase activities (Das et al., 2013). These databases have greatly facilitated structurebased modeling and drug design of peptide-protein interactions. However, the number of currently resolved structures is only a small fraction of the peptide-protein complexes, as limited by the difficulties and high cost of X-ray and NMR experiments.

Computational methods have been developed for predicting the peptide-protein complex structures. In this regard, modeling of peptide binding to proteins has been shown to be distinct from that of extensively studied protein-ligand binding and protein-protein interactions. Notably, small-molecule ligands are able to bind deeply buried sites in proteins, but peptides normally bind to the protein surface, especially in the largest pockets. On the other hand, protein partners usually have welldefined 3D structures before forming protein-protein complexes, despite possible conformational changes during association. In contrast, most peptides do not have stable structures before forming complexes with proteins (Petsalaki and Russell, 2008). The biggest and immediate challenge for modeling of peptideprotein binding is that peptide structures are not known a priori. Furthermore, peptide-mediated interactions are often transient. The affinity of peptide-protein interactions is typically weaker than that of protein-protein interactions, because of the smaller interface between peptides and their protein partners. Therefore, new and robust computational approaches are developed to address the above challenges in the modeling of peptideprotein binding.

Molecular docking has proven useful in predictions of peptide-protein complex conformations (Ciemny et al., 2018). The commonly used approaches include template-based docking such as GalaxyPepDock (Lee et al., 2015), local docking of peptides to pre-defined binding sites such as Rosetta FlexPepDock (Raveh et al., 2011), HADDOCK (Trellet et al., 2013), and MDockPep (Xu et al., 2018), and global docking of free peptide binding to proteins such as CABS-dock (Kurcinski et al., 2015), PIPER-FlexPepDock (Alam et al., 2017), and PeptiDock (Porter et al., 2017). The template-based docking is highly efficient, but often limited to the availability of templates (Lee et al., 2015). Local docking is able to generate good quality models that meet the Critical Assessment of PRediction of Interactions (CAPRI) criteria (Janin et al., 2003). However, it requires a priori knowledge of the peptide binding site on the protein surface. In comparison, global peptide docking provides sampling of peptide binding over the entire protein surface without the need for predefined binding sites, but it is challenging to account for the system flexibility. In this regard, ClusPro PeptiDock has been developed for docking of motifs (short sequences) of peptides, which are found to sample only a small ensemble of different conformations (Alam et al., 2017). Structural ensemble of a peptide motif is built by retrieving motif structures from PDB that are very similar to the peptide's bound conformation. A Fast-Fourier Transform (FFT) based docking is then used to quickly perform global rigid body docking of these fragments to the protein. PeptiDock is thus able to alleviate the peptide flexibility problem through ensemble docking of the peptide motifs. Nevertheless, it remains challenging to account for the high flexibility of the peptides. Overall, peptide docking often generates poor predictions that require further refinement to obtain CAPRI-quality models.

Molecular dynamics (MD) is a powerful technique that enables all-atom simulations of biomolecules. MD simulations are able to fully account for the flexibility of peptides and proteins during their binding (Knapp et al., 2015; Wan et al., 2015; Salmaso et al., 2017; Yadahalli et al., 2017; Kahler et al., 2018). MD has been used to refine binding poses of peptides in proteins in the pepATTRACT (De Vries et al., 2017) and AnchorDock (Ben-Shimon and Niv, 2015) docking protocols. However, it is challenging to sufficiently sample peptide-protein interactions through conventional MD (cMD) simulations, due to the slow dynamics and limited simulation timescales. Computational approaches that combine many cMD simulations provide improved sampling of peptide-protein interactions, including supervised MD (Salmaso et al., 2017) and weighted ensemble (Zwier et al., 2016). Notably, weighted ensemble of a total amount of ∼120 µs MD simulations has been obtained to investigate binding of an intrinsically disordered p53 peptide to the MDM2 Protein (Zwier et al., 2016). The simulation predicted binding rate constant agrees very-well with the experiments. However, expensive computational resources would be needed for applications of cMD simulations in large-scale predictions of peptide-protein complex structures.

On the other hand, enhanced sampling MD methods have been developed to improve biomolecular simulations (Christen and Van Gunsteren, 2008; Gao et al., 2008; Liwo et al., 2008; Dellago and Bolhuis, 2009; Abrams and Bussi, 2014; Spiwok et al., 2015; Miao and Mccammon, 2016). Multiensemble Markov models (Paul et al., 2017), which combine cMD with Hamiltonian replica exchange enhanced sampling simulations, have been used to characterize peptide-protein binding and calculate kinetic rates of a nano-molar peptide inhibitor PMI to the MDM2 oncoprotein fragment (Paul et al., 2017). While cMD is able to simulate fast events such as peptide binding, enhanced sampling simulations can capture rare events such as peptide unbinding. The steered MD (Cuendet et al., 2011), temperature-accelerated MD (Lamothe and Malliavin, 2018) and MELD (Modeling by Employing Limited Data) using temperature and Hamiltonian replica exchange MD (Morrone et al., 2017) have also been applied to study peptide-protein binding. In comparison, more enhanced sampling methods have been applied in studies of proteinligand binding and protein-protein interactions, including the umbrella sampling (Torrie and Valleau, 1977; Kastner, 2011; Rose et al., 2014), metadynamics (Laio and Parrinello, 2002; Alessandro and Francesco, 2008; Saleh et al., 2017a,b,c), adaptive biasing force (Darve and Pohorille, 2001; Darve et al., 2008), steered MD (Cuendet and Michielin, 2008; Gonzalez et al., 2011), replica exchange MD (Sugita and Okamoto, 1999; Okamoto, 2004), accelerated MD (aMD) (Hamelberg et al., 2004; Miao et al., 2015), and Gaussian accelerated MD (GaMD) (Miao et al., 2015; Miao and Mccammon, 2017, 2018; Pang et al., 2017). Overall, enhanced sampling simulations of peptide binding to proteins have been under explored. Peptideprotein binding shows distinct characteristics as described above and requires the development of improved enhanced sampling approaches.

Here, we present a prototype of a novel computational approach that combines global peptide docking using PeptiDock and all-atom enhanced sampling simulations using GaMD to model peptide-protein binding. Three model peptides have been selected from the PeptiDB database of non-redundant peptideprotein complex structures (London et al., 2010). They include peptide motifs "PAMPAR" (Peptide 1), "TIYAQV" (Peptide 2) and "RRRHPS" (Peptide 3), which bind to the SH3 domain, Xlinked lymphoproliferative syndrome (XLP) protein SAP and human PIM1 kinase, respectively. Starting with the lowest RMSD conformation selected from top 10 models of PeptiDock, GaMD significantly refines the peptide-protein complex structures. Furthermore, the simulations provided important insights into the mechanism of peptide binding to target proteins at an atomistic level. Thus, PeptiDock+GaMD is a promising approach for exploring peptide-protein interactions.

### METHODS

### A Computational Approach Combining PeptiDock and GaMD

A new computational approach was designed to predict peptideprotein complex structures by combining peptide docking with PeptiDock and all-atom enhanced sampling simulation with GaMD (**Figure S1**). Initial peptide-protein complex structures were obtained using the ClusPro PeptiDock server. The first step in the PeptiDock protocol is fragment search: the PDB database is searched for fragments containing the target peptide motif. The templates are clustered and an FFT-based rigid docking is applied to the cluster centroids. Top-scoring poses are clustered again and the centroids of the largest clusters are chosen as the final results (Porter et al., 2017). For the purpose of this study—to show the viability of the protocol only one pose within top 10 models of PeptiDock, known to be near native, was selected for further refinement using GaMD simulations.

## System Setup

Three model peptides were selected from the PeptiDB database of non-redundant peptide-protein complex structures (London et al., 2010). They included peptide motifs "PAMPAR" (Peptide 1), "TIYAQV" (Peptide 2) and "RRRHPS" (Peptide 3), which bind to the SH3 domain, XLP protein and human PIM1 kinase, respectively. The free X-ray structures of target proteins is 1OOT, 1D1Z and 2J2I, respectively. The corresponding bound structures are 1SSH, 1D4T (Poy et al., 1999) and 2C3I (Pogacic et al., 2007), respectively. The free X-ray structures of the target proteins were used in the peptide docking and GaMD simulation. Both capped/neutral and uncapped/zwitterion terminus models were investigated in the GaMD simulations. In the neutral terminus model, the N- and C-termini were capped with ACE and NHE, respectively.

### Peptide Docking

The standard ClusPro PeptiDock protocol was used for all three systems. In the first step, receptor structures were specified: 1OOT chain A (Peptide 1), 1D1Z chain A (Peptide 2) and residues 125-305 of 2J2I chain B (Peptide 3). The next step was specifying motifs—the templates for searching fragments in PDB database. The motif was specified as subsequence of the peptide with one or more wildcard symbols. Wildcards could be of two forms: "X," denoting any amino acid substitution, and "[...]," denoting substitution by any amino acid from the list. e.g., "[FT]" means that either Phe or Tyr can take this place. It is recommended to adjust the motif to yield between 100 and 1,000 hits, while preserving the essential features for binding. For the studied systems, the following motifs were used for fragment search: "PXMPXR" for Peptide 1 [107 hits, see Ref. Hou et al., 2012], "TI[YF]XX[VI]" for Peptide 2 [686 hits, see Ref. Poy et al., 1999] and "RXRHXS" for Peptide 3 [198 hits, see Ref. Bullock et al., 2005]. Since PDB contains bound structures of the studied systems, a number of PDB entries were explicitly excluded from template search, as listed in **Table S4**. The next steps were performed automatically by the server (Porter et al., 2017), being the same for all systems. The extracted fragments were changed to the target peptide sequence using backbone-dependent rotamer library (Dunbrack and Karplus, 1993). The extracted fragments (hits) were clustered using the greedy algorithm according to their pairwise root-mean-square deviation (RMSD), with 0.5 Å cluster radius. The centroids of top 25 clusters were docked to the receptor using rigid-body FFT docking (Kozakov et al., 2006), exhaustively sampling all possible mutual orientations of the receptor and ligand, and ranking them using a special scoring function with a mixture of physics-based and knowledge-based terms (Kozakov et al., 2006; Chuang et al., 2008). The top-scoring poses of each fragment were pooled together and clustered based on their pairwise RMSDs, with 3.5 Å cluster radius. The clusters were ranked according to their sizes (Kozakov et al., 2005). The centroids of ten largest clusters were subjected to energy minimization with a CHARMM19-based force field using the ABNR algorithm. To demonstrate the protocol, only the lowest RMSD conformation obtained from top 10 PeptiDock models of each peptide was selected for refinement using GaMD simulations. The ranks of docking poses with the lowest peptide backbone RMSDs used were 9, 5, and 10 for Peptides 1, 2 and 3, respectively. It is important to note that each of the top-10 docking models will be refined and scored in a full version of the protocol in further studies.

### GaMD Enhanced Sampling Simulations

GaMD was applied to refine the peptide-protein complex structures. Complexes were solvated in explicit water using tleap from the AMBER 18 package (Case et al., 2018). The Na<sup>+</sup> and Cl<sup>−</sup> ions were added to neutralize the system charge. The AMBER ff14SB force field parameters (Maier et al., 2015) and TIP3P model (Jorgensen et al., 1983) were used for the proteins/peptides and water molecules, respectively. Each system was minimized using steepest descent for 50,000 steps and conjugate gradient for another 50,000 steps. After minimization, the system was heated from 0 to 310 K in 1 ns simulation by applying 1 kcal/(mol•Å 2 ) harmonic position restraints to the protein and peptide heavy atoms with a constant number, volume and temperature (NVT) ensemble. Each system was further equilibrated using a constant number, pressure and temperature (NPT) ensemble at 1 atm and 310 K for 1 ns with same restraints as in the NVT run. Another 2 ns cMD simulations were performed to collect potential energy statistics (including the maximum, minimum, average, and standard deviation). Then 18 ns GaMD equilibration after applying the boost potential was performed. Finally, four independent 300 ns GaMD production simulations with randomized initial atomic velocities were performed on each peptide system. Simulation frames were saved every 0.2 ps for analysis. Snapshots of all four GaMD production simulations (1,200 ns in total) were combined for clustering to identify peptide binding conformations, for which the hierarchical agglomerative algorithm in CPPTRAJ (Roe and Cheatham, 2013) was applied. The cutoff was set to 3.5 Å for the peptide backbone RMSD to form a cluster. The PyReweighting toolkit (Miao et al., 2014) was applied to reweight four GaMD simulations combined and recover the original free energy or potential of mean force (PMF) profiles of each peptideprotein system. The RMSDs of the peptide and protein backbone were used as reaction coordinates. Detailed descriptions of GaMD theory and energetic reweighting were shown in **Supplementary Material**.

### RESULTS

### Prediction of Peptide Binding Conformations Through Docking and GaMD Simulations

There were no significant conformational changes in the protein during binding of Peptides 1 and 3 (**Figures 1A,C**). In comparison, binding of Peptide 2 induced a large structural rearrangement of the loop involving residues 67–74 in the protein (**Figure 1B**). In addition, Peptide 3 is highly charged as its first three N-terminal residues in the sequence are all arginine. These features of Peptides 2 and 3 raised the difficulty in accurate prediction of their peptide-protein complex structures. Peptide docking with PeptiDock showed different levels of accuracy: RMSDs of the peptide backbone compared with the bound X-ray structures were 3.3, 3.5, and 4.8 Å for the three peptides, respectively (**Figures 1A–C** and **Table 1**). The first two were of acceptable quality predictions according to the CAPRI peptide docking criteria (Janin et al., 2003), and the third one was slightly above acceptability cutoff. It should be noted that our flexible protein-peptide docking protocol PIPER-FlexPepDock (Alam et al., 2017) mentioned above is successful in obtaining high-quality model only in the case of Peptide 1, whereas the other two cases are challenging due to either significant receptor flexibility (Peptide 2) or remoteness of rigid-body docking poses to the native conformation (Peptide 3).

Next, GaMD simulations were performed to refine the docking models. Analysis of simulation trajectories showed that the GaMD simulations were able to effectively refine the peptide binding pose. For Peptides 1 and 2, RMSDs of the peptide backbone relative to the X-ray structures decreased to <1 Å during the GaMD simulations (**Figures 2A,B**). Peptide 1 bound tightly to the protein target site throughout the four GaMD simulations. Peptide 2 reached the native conformation within ∼10, ∼90, ∼120, and ∼170 ns in the four GaMD simulations and stayed tightly bound during the remainder of the simulations. In comparison, Peptide 3 exhibited higher fluctuations and sampled the near-native conformation transiently during the GaMD simulations (**Figure 2C**). Nevertheless, the minimum RMSDs of peptide backbone compared with X-ray structures were identified to be 0.20, 0.22, and 0.73 Å for the three peptides, respectively (**Figures 2A–C**).

Furthermore, GaMD simulation snapshots of the peptide conformations were clustered using the backbone RMSDs relative to the X-ray structures. This procedure was similar to analysis of the peptide docking poses. The 10 top-ranked clusters of peptide conformations with the lowest free energies were obtained. The 1st top-ranked cluster exhibited peptide backbone RMSDs of 0.94 and 0.61 Å for Peptides 1 and 2, respectively (**Figures 1D–E** and **Table 1**). For Peptide 3, the 3rd top-ranked cluster showed the smallest peptide backbone RMSD of 2.72 Å (**Figure 1F** and **Table 1**). According to the CAPRI criteria (Janin et al., 2003), structural predictions for Peptides 1 and 2 were of sub-angstrom high quality and medium quality for Peptide 3. Therefore, GaMD simulations significantly refined docking conformations of the three peptideprotein complex structures. The simulation predicted bound conformations of the peptides were in excellent agreement with experimental X-ray structures with 0.6–2.7 Å in the peptide backbone RMSDs. In comparison, docking poses of the three peptides obtained from PeptiDock showed RMSDs of 3.3–4.8 Å (**Table 1**).

FIGURE 1 | Docking poses (red) of three peptide motifs obtained using *PeptiDock* are compared with X-ray structures (green): (A) Peptide 1 "PAMPAR", (B) Peptide 2 "TIYAQV," and (C) Peptide 3 "RRRHPS"; Binding poses (red) of three model peptides obtained using the "*PeptiDock*+*GaMD*" are compared with X-ray structures (green): (D) Peptide 1, (E) Peptide 2, and (F) Peptide 3.



*<sup>a</sup>Only nine clusters were obtained for Peptide 1 from the GaMD trajectories and thus there were no RMSD or PMF values (–) for cluster 10.*

### Peptide Binding Mechanism Revealed From GaMD

Free energy profiles were calculated from the GaMD simulations using the protein and peptide backbone RMSDs relative to the bound X-ray structures as reaction coordinates. For Peptide 1, only one low-energy minimum was identified near the native bound state (**Figure 2D**). This was consistent with the clustering result that the peptide backbone RMSD of the 1st top-ranked cluster was only 0.9 Å.

For peptide 2, two low-energy minima were identified, corresponding to peptide backbone RMSDs of 0.5 and 4.2 Å, respectively (**Figure 2E**). As described above, the binding of Peptide 2 induced a significant conformational change in the protein loop of residues 67–74 (**Figure 1B**). Thus, the loop backbone RMSD and peptide backbone RMSD relative to the bound X-ray structure were also used as reaction coordinates to compute another two-dimensional free energy profile (**Figure 3A)**. The protein loop was highly flexible,

sampling a large conformational space. The loop backbone RMSD ranged from ∼0.2–∼8.0 Å. This loop sampled two lowenergy conformations, including the "Open" (bound) (RMSD < 1 Å) and "Closed" (free) states (RMSD ∼3–6 Å) (**Figure 3**). Compared to the "Open" state, the "Closed" loop moved closer to the core domain of protein (**Figure 3B**). GaMD simulations successfully captured the conformational change of this loop. The peptide and protein loop accommodated each other to form the final bound conformation (**Figure 3**), suggesting an "induced fit" mechanism.

For Peptide 3, GaMD sampled a broad low-energy well, centered at the ∼4.3 and ∼1.0 Å RMSDs for the peptide and protein backbone relative to the bound X-ray structure (**Figure 2F**). Overall, this peptide-protein complex underwent high fluctuations, visiting a large conformational space. Nevertheless, GaMD simulations sampled the native binding pose of Peptide 3, for which the peptide backbone RMSD decreased to ∼1 Å at ∼60 ns and 160 ns during one of the GaMD production runs (Sim1) (**Figure 2C**). In contrast to binding of Peptide 2 that involved induced fit of the protein receptor, binding of Peptides 1 and 3 did not induce significant conformational change of the receptors.

### Effects of the Terminal Residue Charges on Peptide Binding

In addition to the neutral terminus model as described above, we simulated another model of the three peptides with zwitterion terminal residues that were charged. Compared with the neutral terminus models, larger fluctuations were observed in the zwitterion terminus models of the three peptides (**Figures S2–S4**). For Peptides 2 and 3, their backbone RMSDs could reach large values of ∼40 and ∼20 Å, respectively. These results suggested that the peptides could dissociate from the initial near-native bound pose obtained from docking. Furthermore, 10 top-ranked clusters of peptide conformations with the lowest free energies were also calculated through structural clustering and energetic reweighting (see Methods for details). For Peptide 1, the 1st top-ranked cluster exhibited the smallest backbone RMSD of 1.22 Å relative to the Xray structure (**Figure S5A** and **Table S1**). The 2nd top-ranked clusters exhibited the smallest backbone RMSDs of 0.62 and 3.88 Å for Peptides 2 and 3, respectively (**Figures S5B,C** and **Tables S2–S3**). In summary, peptides with zwitterion terminal residues underwent higher fluctuations and the simulation predicted bound conformations deviated more from the native X-ray structures compared with the neutral terminal models.

### Improved Sampling Efficiency of GaMD Compared With Conventional MD

In addition to GaMD simulations, another set of cMD simulations of the same lengths were performed for comparison in their sampling efficiency to refine peptide binding conformations. The peptides were simulated with neutral terminal residues. Compared with GaMD, cMD needed typically longer simulation time to refine the binding mode of Peptide 1 (**Figure S6A**). The cMD mostly failed to refine binding poses of

Peptides 2 and 3, for which RMSD decrease was not observed in 3 out of 4 cMD simulations of Peptide 2 (**Figure S6B**) and all 4 cMD simulations of Peptide 3 (**Figure S6C**). The 1st top-ranked cluster exhibited peptide backbone RMSD of 0.96 Å for Peptide 1 (**Table S1**). For Peptide 2, the 2nd top-ranked cluster showed the smallest peptide backbone RMSD of 2.79 Å, suggesting a medium-quality model similar to the docking pose was obtained (**Table S2**). For Peptide 3, the 6th top-ranked cluster showed the smallest peptide backbone RMSD of 4.68 Å, being closely similar to the PeptiDock result (**Table S3**). Therefore, cMD was significantly less efficient in refining docking poses of peptides compared with GaMD.

### DISCUSSION

We have demonstrated that GaMD can successfully refine PeptiDock docking poses, and thus established the possibility of PeptiDok+GaMD combination to predict peptide-protein complex structures and explore the peptide binding mechanism. Three peptides with different difficulty levels were selected as model systems. Peptide 1 was the easiest one as the peptide is rigid and there was no conformational change in the protein during peptide binding. Both Peptides 2 and 3 were challenging for predicting bound conformations accurately. The binding of Peptide 2 involved a significant structural rearrangement of the residue 67–74 loop in the protein. Peptide 3 with dense residue charges proved difficult for both docking and GaMD simulations. Nevertheless, the GaMD refinement achieved high quality models for both Peptides 1 and 2, and medium quality prediction for Peptide 3. This approach showed promise to be widely applicable for other peptide-protein binding systems.

It is difficult for the current docking programs to account for large conformational changes of proteins during peptide binding (Ciemny et al., 2018). Even in the flexible docking calculation, only movements of protein side chains are often taken into account. This raised a challenge in the modeling of Peptide 2. On the other hand, cMD simulations could account for flexibility of the peptide and protein and had been applied to refine docking poses of peptides in proteins (Ben-Shimon and Niv, 2015; De Vries et al., 2017). However, cMD could suffer from insufficient sampling and limited simulation timescales. Indeed, cMD is significantly less efficient in refining docking poses of the peptides compared with GaMD, especially for Peptides 2 and 3. Thus, the GaMD enhanced sampling method has been used in this study. Remarkably, GaMD effectively captured the loop movement of Peptide 2 (**Figure 3**) and greatly refined the peptide docking poses (**Figure 1E**). In addition, highperformance GaMD simulations were performed using AMBER 18 on the GPUs. With NVIDIA Pascal P100 GPU cards, each of the 300 ns GaMD simulations took about 38.1, 43.5, and 53.2 h for Peptides 1, 2 and 3, respectively.

In summary, PeptiDock+GaMD has been demonstrated on predicting the peptide-protein complex structures and revealing important insights into the mechanism of peptide binding to proteins, using three distinct peptides as model systems. In the future, all top-10 models of the ClusPro PeptiDock will be refined with GaMD and a larger number of protein-peptide systems will be evaluated systematically. Furthermore, the effects of different force fields (e.g., CHARMM36m) and solvent models (e.g., TIP4P, implicit solvent, etc.) (Kuzmanic et al., 2019) are to be further investigated. Since excellent performance was obtained using the CHARMM19-based force field in the previous study of proteinpeptide docking with ClusPro PeptiDock (Porter et al., 2017), we continued to use it as implemented in the ClusPro PeptiDock server for docking calculations in the present study. For refinement of the docking poses with GaMD, because AMBER18 was applied for running the simulations, the widely used AMBER ff14SB force field was selected instead. Nevertheless, it might be better to use newer and the same force field for different stages of the modeling protocol, which will be tested in future studies. Development of novel protocols to increase the accuracy of peptide-protein structural prediction will facilitate peptide drug design. Advances in the computational methods and computing power are expected to help us to address these challenges.

### DATA AVAILABILITY STATEMENT

All datasets generated and analyzed for this study are included in the article/**Supplementary Material.**

### AUTHOR CONTRIBUTIONS

YM and DK designed research. JW and AA performed research. JW, AA, DK, and YM analyzed data and wrote the paper.

### FUNDING

This work was supported in part by the National Institutes of Health (R01GM132572), National Science Foundation (AF

### REFERENCES


1816314, DBI 1759277), Binational Science Foundation Grant (2015207), American Heart Association (Award 17SDG33370094), and the startup funding in the College of Liberal Arts and Sciences at the University of Kansas (KU). This work used the KU Center for Research Computing Cluster and supercomputing resources with the allocation award TG-MCB180049 through the Extreme Science and Engineering Discovery Environment (XSEDE), which was supported by National Science Foundation grant number ACI-1548562, and project M2874 through the National Energy Research Scientific Computing Center (NERSC), which is a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmolb. 2019.00112/full#supplementary-material


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang, Alekseenko, Kozakov and Miao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Large-Scale Conformational Changes and Protein Function: Breaking the in silico Barrier

#### Laura Orellana1,2 \*

1 Institutionen för Biokemi och Biofysik, Stockholms Universitet, Stockholm, Sweden, <sup>2</sup> Science for Life Laboratory, Solna, Sweden

Large-scale conformational changes are essential to link protein structures with their function at the cell and organism scale, but have been elusive both experimentally and computationally. Over the past few years developments in cryo-electron microscopy and crystallography techniques have started to reveal multiple snapshots of increasingly large and flexible systems, deemed impossible only short time ago. As structural information accumulates, theoretical methods become central to understand how different conformers interconvert to mediate biological function. Here we briefly survey current in silico methods to tackle large conformational changes, reviewing recent examples of cross-validation of experiments and computational predictions, which show how the integration of different scale simulations with biological information is already starting to break the barriers between the in silico, in vitro, and in vivo worlds, shedding new light onto complex biological problems inaccessible so far.

### Edited by:

Alexandre M. J. J. Bonvin, Utrecht University, Netherlands

### Reviewed by:

Vittorio Limongelli, University of Lugano, Switzerland Elodie Laine, Université Pierre et Marie Curie, France

#### \*Correspondence:

Laura Orellana laura.orellana@scilifelab.se; doble.helix@gmail.com

#### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

> Received: 20 July 2019 Accepted: 14 October 2019 Published: 05 November 2019

### Citation:

Orellana L (2019) Large-Scale Conformational Changes and Protein Function: Breaking the in silico Barrier. Front. Mol. Biosci. 6:117. doi: 10.3389/fmolb.2019.00117 Keywords: conformational change, proteins, molecular dynamics simulation, coarse-grained (CG) methods, structural ensemble

### CONFORMATIONAL CHANGES: LINKING SHAPE AND FUNCTION

Protein structure and dynamics are essential to understand living organisms at the molecular level. Already 60 years ago Feynman envisioned that life is, roughly speaking, not only about atomic organization, but also about the "jiggling and wiggling of atoms" (Feynman et al., 1963). The central paradigm of structural biology stated that the 3D-fold of a protein is encoded in the sequence (Dill and Chan, 1997; Wright and Dyson, 2015); the explosion of structural data in the past decades has dramatically expanded this classical view, confirming Feynman's prediction. Far from being static structures, it is now clear that proteins rather behave as living entities (Henzler-Wildman and Kern, 2007), ever-changing on temporal and spatial scales spanning several orders of magnitude: from local loop fluctuations in enzyme active sites (Aglietti et al., 2013; Pal et al., 2016) to concerted beta-sheets motions (Fenwick et al., 2014) or large-scale allosteric motions in transmembrane receptors (Bugge et al., 2016). Importantly, growing evidence indicates that these large conformational changes are intrinsically encoded in the overall 3D-shape (Bahar et al., 2010), and that external stimuli –binding, post-translational modifications, electrochemical gradients, etc.—just drive these "natural" motions further to trigger output responses. Signal transduction, membrane transport or synaptic communication, almost every cell process relies on switches that cycle between distinct states to allow for bioregulation (**Figure 1A**). The way that proteins change to sense and respond to such stimuli is therefore central to connect the micro-, meso-, and macroscales in biology. However, their elucidation from atomic "jiggling and wiggling" is far from trivial.

During the past decade, structural determination techniques have made incredible progresses in resolving structures of increasing complexity and flexibility. Currently, high-throughput time-resolved X-ray crystallography (Levantino et al., 2015; Neutze et al., 2015; Ourmazd, 2019), cryo-Electron Microscopy (cryo-EM; Nogales and Scheres, 2015; Murata and Wolf, 2018; Shoemaker and Ando, 2018), and Nuclear Magnetic Resonance (NMR; Baker and Baldus, 2014; Jiang and Kalodimos, 2017; Opella and Marassi, 2017), together with complimentary techniques like Small Angle X-ray Scattering (SAXS; Vestergaard, 2016), Förster Resonance Energy Transfer (FRET; Okamoto and Sako, 2017), double electron-electron resonance (DEER; Jeschke, 2012), mass spectrometry (Kahsai et al., 2014) or fluorescence microscopy (Lewis and Lu, 2019) are allowing to resolve and gain dynamic information from extremely challenging systems. Despite such advances, the experimental study of protein transitions is still demanding. A complete understanding of equilibrium dynamics requires sampling both the structure space available and the underlying free-energy landscape (FEL; Frauenfelder et al., 1991; Zhuravlev and Papoian, 2010a; Nussinov and Wolynes, 2014; Röder et al., 2019) along its relevant dimensions (**Figures 1B,C**). Ideally, a completely rational and quantitative FEL characterization should stem from first principles, for example, using methods like Molecular Dynamics (MD; Karplus and McCammon, 2002; Orozco, 2014), in which Newton's equations are integrated over time for an atomistic model of the system based on physical potentials. In practice, atomistic-level sampling of the functional FEL of biomolecules poses by itself a huge conceptual and technical problem in silico. Collective rearrangements and allosteric events in proteins can involve scales around ms-µs and up to 102Å. Note that this is far beyond what classical MD can address in terms of time and size: roughly two orders of magnitude larger than average simulated interatomic distances (∼1–10Å), and up to 9–12 orders of magnitude larger than the smallest simulated timestep (fs oscillations) (Sweet et al., 2013). Importantly, functional transitions often occur in this blurred frontier between theory and experimentation.

Scalable codes, graphic processing units (GPUs), parallelization and optimized simulation algorithms (Pierce et al., 2012; Sweet et al., 2013; Kutzner et al., 2015; Páll et al., 2015; Pouya et al., 2017) are however making increasingly feasible to simulate systems with millions of atoms for few µs, or even whole bacterial cytoplasms in the submicrosecond range (Yu et al., 2016). Still, for most proteins, these timescales cover a small part of the structural landscape, and longer simulations are only accessible with special-purpose supercomputers like Anton (Shaw et al., 2009; Dror et al., 2012). Apart from these technical aspects, there is a fundamental "sampling problem," not efficiently addressed by long simulations: transition paths in a multidimensional landscape are intrinsically stochastic—there are multiple possible transition routes, subject to random fluctuations that unpredictably push over energy barriers. Multiple evidences indicate that the way in which the configuration space is sampled is thus more critical than simulation length. For example, while in µs-long MD, full transitions are still rarely observed, in certain conditions e.g., upon relaxation after removing ligands (Nury et al., 2010; Calimet et al., 2013; Degiacomi, 2019) or introducing mutations (Smolin and Robia, 2015; Orellana et al., 2019b) they can occur in significantly shorter times. Similarly, coarse-grained (CG) methods like Elastic Network Models (ENMs; Mahajan and Sanejouand, 2015), are also capable to predict with striking accuracy, just from the overall shape of a protein, not only the conformational changes observed experimentally but also entire sequences of on-pathway intermediates (Orellana et al., 2016). This suggests that large-scale motions like those defining protein functional FELs may be better understood as collective, supra-atomistic and higher-scale phenomena. Whatever the theoretical framework chosen to explore this issue, the validation of in-silico predicted mechanisms is becoming a central question, as quantitative analysis become essential to rationalize the growing dynamical information from techniques like cryo-EM (Frank, 2018; Bonomi and Vendruscolo, 2019).

Let's now imagine the reader wants to know how a series of conformations for a given protein are related, to get insight into some biological mechanism. It is appropriate then to ask: Can in silico methods really predict conformational transitions? Have such in-silico transitions been validated and how? This review is intended to provide the non-specialist with some answers to these questions, first raised by Weiss and Levitt (2009). On the first part (**Table 1**), we will briefly review theoretical methods to predict transition pathways, focusing on the two most common approaches to explore the FEL between two states: either increase atomistic MD sampling (Maximova et al., 2016) or coarsegrain the model of the system (Zheng and Wen, 2017). On the second part (**Table 2**), we will discuss recent examples from our group and others attempting cross-validation between theory and experiments in this context. This review does not aim to provide an in-depth description of specific methods which can be found elsewhere (Bernardi et al., 2015; Maximova et al., 2016; Mori et al., 2016; Zheng and Wen, 2017; Harpole and Delemotte, 2018). We rather intend to provide general readers, and specially experimentalists, with a broad overview of the most accessible approaches to explore a transition for a typical protein, along with possible validation strategies. Our goal is to help the reader grasp the current potential of in silico methods to explain biological phenomena from microscopic scales, and the exciting boundaries we are reaching.

### FROM STATIC SNAPSHOTS TO MULTI-STATE STRUCTURAL ENSEMBLES

Since the first structure was determined by X-ray crystallography in the late 50s (Kendrew et al., 1958), the number of protein structures deposited every year in the Protein Data Bank (Berman et al., 2000) has been growing exponentially, from a few dozens in the 90s up to over 10,000 structures/year in the past 2 years. As of 2019, we know around 140 thousand native-like protein structures, with resolutions as low as 0.5Å. For a majority of them however, the conformers solved represent the equilibrium end-structures along their functional cycles, typically composed of at least two different

FIGURE 1 | biological cycles. (B) Experimental conformational landscapes for the hinge-bending transition of the Ribose Binding Protein (RBP) as computed from Principal Component Analysis: the open to closed RBP conformational change upon ribose binding (Left); RBP conformational landscape and eBDIMS coarse-grained (CG) transitions (Center) as projected onto the PCs derived from the 9 solved structures (Right). Note how eBDIMS paths approach the sequence of experimental intermediates. (C) Comparison of sampling strategies: NMs and path-finding CG-methods (Left); atomistic MD unbiased (500 ns from each unbound state) (Center) and 1 µs-biasing to the closed state (Right). Note how the first NM derived from both RBP end-states (Left) points to the experimental intermediates; note also how eBDIMS paths (gray) roughly follow the MD/X-ray sampled area. Adapted by the author with CC BY license from Orellana et al. (2016).

TABLE 1 | Summary of common in silico methods to explore protein conformational changes (\*CV-based, \*\*only for setup/short run).


meta-stable states (**Figure 1A**): active/inactive, bound/unbound, open/closed, etc. For such average proteins (**Figure 1B**), the native apo state frequently populates the deepest basin and spontaneously samples another of comparable or reduced depth, favored by stimuli like binding, post-translational modifications, etc. that shift the population (Nussinov and Wolynes, 2014). Structural determination techniques usually trap conformations near one of such low-energy basins, while the short-lived intermediates connecting them—which can be key to grasp mechanisms (see e.g., Machtens et al., 2015; Orellana et al., 2019b)—are often elusive both experimental and computationally.

To explore the conformational space, structures are typically solved in multiple conditions e.g., introducing mutations, modulating pH, ions, or complexing with molecules—from ligands to antibodies, affibodies, or small drugs. This contributes to enormous redundancy in the PDB, but at the same time, it is a powerful approach to catch intermediates along transitions. For a growing number of intensely studied proteins the multitude of conditions that has been used to determine their 3D-structures has gradually covered the entire conformational landscape. Especially cryo-EM is allowing to routinely obtain protein snapshots in multiple states with each data deposition [see e.g., the Glycine Receptor (GlyR; Du et al., 2015) in **Figure 1A**]

TABLE 2 | Examples of cross-validation of in silico-predicted properties with experiments to specifically probe conformational changes.


(Continued)

Integrating Simulations With Experiments


and, although limited to a few protein families, this is revealing the first glimpses into structural ensembles that cover nearlycomplete conformational landscapes (Frank, 2018; Bonomi and Vendruscolo, 2019; Hofmann et al., 2019).

Obtaining multiple snapshots of a protein is however just the first step to characterize its transitions. The second consists on understanding their relationships, which also implies identifying the relevant collective variables (CVs; Kitao and Go, 1999; Zhuravlev and Papoian, 2010b; Noé and Clementi, 2017) for each system. This task is comparable to taking multiple pictures of a moving animal in diverse situations, and then trying to reconstruct its biomechanics; one needs first to find a way to measure, classify and organize the images, so that an ordered sequence can be reconstructed. How are we going to efficiently describe the system? What are we going to measure to detect changes from one functional state to another? Fortunately, largescale transitions can be often described by a remarkably low number of CVs (Henzler-Wildman and Kern, 2007). This is not surprising since, for most proteins, functional movements are collective: each level of protein motion translates into the next, creating wider and slower movements. For example, local atomic vibrations are transmitted via hydrogen bond networks that make up secondary structures, creating higher amplitude motions; as shown in Fenwick et al. (2011, 2014) the coupled movements of interacting atoms in beta-sheet motifs create collective bending and twisting motions, which propagate to higher collective movements linked to allosteric regulation. Another recurring motif in large-scale protein transitions are open-to-closed motions upon binding (Flores et al., 2006; Amemiya et al., 2011), which in their simplest version consist in rigid-body displacements around a cracking hinge (**Figure 1A,** left). The hinge region, often located near a binding pocket, is typically an interdomain linker; in more complex transitions wider intra/inter-molecular surfaces can reshape as hinges e.g., in the "rocker-switch" motions between tandem repeats of solute transporters (Drew and Boudker, 2016; **Figure 1A,** center). Linker or interface reshaping propagate across structures triggering large-scale rearrangements. Usually, such rigid-body transitions are tracked with ad-hoc defined angles, distances, etc. However, while for simple hinge-bending transitions, an angle defined by moving rigid bodies can render a fair description of the process, the situation changes when systems undergo complex concerted changes: to accurately describe e.g., gating for ion channels like GlyR (**Figure 1A,** right) typically demands multiple variables describing extra- and intra-cellular motion features, much harder to define. In such cases, if the protein in question has solved structures in different basins, Principal Component Analysis (PCA; Jolliffe, 2002; Abdi and Williams, 2010), can provide a "natural" representation of the conformational landscape (Sankar et al., 2015) in the form of experimentally-encoded CVs. Compared to other approaches for semi-automated conformer annotation (e.g., based on machine learning; Ung et al., 2018), PCA does not need a priori system-tailored structural descriptors, requiring minimal user intervention PCA. PC-projections approach was recently applied in spliceosome cryo-EM to perform conformer classification, understand its dynamics and obtain a fist assessment of the FEL straight from experimental data (Haselbach et al., 2017, 2018). Moreover, PCs from multi-state ensembles behave as intrinsic complex coordinates that "contain" the heuristic CVs typically defined for each system. As we will discuss later, when such ensemble-analysis are combined with path-sampling, they can illuminate relationships between multiple basins and accurately assign intermediate states, allowing reconstruction of the landscape and its transitions into its experimentallydefined CVs (**Figure 2**). For example, the Ca2<sup>+</sup> pump SERCA, with over 70 structures and at least four different states along its complex pumping cycle, constitutes an exceptional example of a multi-basin ensemble where such analysis is critical to unambiguously assign and order experimental on-pathway intermediates (see Orellana et al., 2016). Importantly, PCA of such "structurally-rich" or multi-state ensembles provides a much needed and stringent test for any modeling technique to explore protein FELs. In the next two sections, we will review the two most popular and accessible approaches to perform such in silico exploration to "connect" experimental basins and "fill in" the conformational landscape: first, sampling with classical MD and its many derivatives, and second, path-finding with computationally simpler methods.

### EXPLORING THE LANDSCAPE: SAMPLING LONG VS. TRICKED

MD simulations, based on the rigorous formalism of molecular physics, constitute possibly the most accurate and accessible approach to model protein motions with atomic detail. Although still an idealized description of reality—proteins diffusing in a crowded and complex cellular soup—MD is based on a careful parameterization of covalent and non-covalent forces on the atomic scale (Beauchamp et al., 2012; Lindorff-Larsen et al., 2012; Monticelli and Tieleman, 2013). Since the first eye blink 9.2 ps simulation of the small BPTI (McCammon et al., 1977), MD has evolved dramatically over the past 40 years up to become almost a "computational microscope" (Dror et al., 2012): it is expected that for relatively small systems like GPCRs, MD will reach the second scale within 5 years (Martínez-Rosell et al., 2017). Nevertheless, for average protein machines, transitions are difficult to sample due to inherent stochasticity and high-energy barriers, involving challenging time and length scales. Although specialized computers like Anton allow simulations of ever-larger systems, longer than ever, and have indeed brought novel insights for key drug targets like GPCRs (Dror et al., 2015), Voltage-gated channels (Jensen et al., 2012) or Kinases (Shan et al., 2012, 2014), conformational changes are still hard to catch. As a rule of thumb, "everyday" simulations invariably require algorithmic "tricks" to explore transitions with reasonable efficiency. More than computational power or simulation length, efficient sampling remains a bottleneck.

The next brief enumeration of MD-strategies to overcome this problem and explore large transitions should provide the reader with a clear picture of its complexity and its many potential pitfalls. Without aiming to be exhaustive (for detailed reviews see e.g., Bernardi et al., 2015; Maximova et al., 2016; Mori et al., 2016; Harpole and Delemotte, 2018), the most common "tricks" (Pietrucci, 2017) to explore transitions are broadly: first, to speed up or optimize exploration of the FEL, without modifying it; second, to actually change the FEL to easily move and jump across its "hills and valleys" (**Table 1**). In both cases, the search can be biased or directly pushed along some a priori "direction," i.e., a CV. Among the first group are many multi-replicate methods, well-suited for highly scalable software implementations thanks to their intrinsically parallel algorithms. Replica exchange MD (REMD) often called "parallel tempering" [first applied to MD in Sugita and Okamoto (1999), reviewed in Ostermeir and Zacharias (2013)], exchanges multiple trajectories run in parallel (typically at different temperatures) to escape local minima. Weighted ensemble methods (WEM) originally developed for simpler Brownian Dynamics (BD; Huber and Kim, 1996; see also Zuckerman and Chong, 2017), use quasi-independent trajectories in which individual runs spawn daughter trajectories upon reaching new "bins" of the configuration space. Mention apart deserves adaptive-MD, a general term which includes a wide array of multi-run schemes aimed to speed up rare events without explicit biasing (Bowman et al., 2010; Pronk et al., 2011; Doerr et al., 2016). The main idea behind adaptive-MD is that simulations are guided toward underexplored FEL regions via iterative on-the-fly analysis; similarly, WEM partition of the FEL into bins also needs previous CV-reduction. Therefore, to identify meaningful CVs to check how simulations proceed becomes central, with risks to generate overly smooth landscapes or distort transition mechanisms (see Hruska et al., 2018; Zimmerman et al., 2018). One analysis approach used to guide sampling in adaptive-MD, are Markov State Models (MSMs; Pande et al., 2010), a statistical method to describe dynamics as memory-less transitions between states. MSMs can infer long-timescale dynamics from sets of shorter simulations, providing yet another shortcut to the sampling conundrum (Chodera and Noé, 2014). In contrast to these costly multi-replicate schemes, biasing methods directly guide single simulations through relevant CVs. For example, Essential Dynamics (Amadei et al., 1993; Daidone and Amadei, 2012) extracts with PCA the "essential" CVs (Essential Modes), which are used to bias the sampling toward collective motions. In Dynamic importance sampling (DIMS; Zuckerman and Woolf, 2000; Perilla et al., 2011) a progress variable or CV is used to select the most productive movement toward the target in a MC-scheme, while in Temperature-Accelerated MD (TAMD; Maragliano and Vanden-Eijnden, 2006) temperature is increased specifically along selected CVs.

A completely different approach is taken in FEL-modifying approaches like Umbrella sampling (US; Torrie and Valleau, 1977), which introduces harmonic biasing potentials along CVs in overlapping "umbrella" windows. Accelerated MD methods (aMD; Hamelberg et al., 2004; Pierce et al., 2012) change the relative height of the basins by adding "boost" potentials when the system's energy falls, locally flattening the FEL. In metadynamics (MTD), free energy wells are filled with "computational sand" to prevent returning back to previously explored CV-regions (Laio and Parrinello, 2002; Laio and Gervasio, 2008). The accelerated weight histogram (AWH; Lindahl et al., 2014, 2018) adaptively

bias simulations to fit a target distribution, filling up energy minima in a similar spirit as MTD (see **Figure 1C**), while in conformational flooding (Grubmüller, 1995), a destabilizing potential is added to the starting state, lowering the transition barrier. From all above methods, MTD has been maybe the most widely applied to study large transitions in a number of pioneering works, from the opening/closing of kinases (Berteotti et al., 2009) or actin monomers (Pfaendtner et al., 2009) to flexible binding and dissociation events (Limongelli et al., 2010, 2012; Formoso et al., 2015).

Moreover, all these different approaches can be combined in virtually infinite ways, giving rise to hybrid methods like Bias-Exchange MTD (Piana and Laio, 2007), MSM-driven MTD (Sultan and Pande, 2017), and many others. The main shared concern for the above listed methodologies is that trajectories may not accurately reproduce the biologically relevant motions (i.e., trapped experimentally), since they either modify the way sampling is done by decreasing its randomness, or directly change the underlying landscape, which can require re-scaling to remove biasing. Particularly, the bias-introducing methods require extra caution to not produce unrealistic high-energy intermediates (Ma and Karplus, 2002; Ovchinnikov and Karplus, 2012). A tightly connected issue stems from the choice of CVs, which is critical (Pan et al., 2014) but nevertheless, is frequently defined ad-hoc for each system. Typically, CVs are defined in terms of e.g., radius of gyration, distances, angles, rMSDs changing across sets of trial trajectories, which are expected to correlate or "describe" the transition. MSMs (Sultan and Pande, 2017), or machine-learning (Chen et al., 2018) can also be applied to solve this "dimensionality reduction" problem and identify relevant CVs. Another alternative is to define CVs from experimental data e.g., NMR chemical shifts (Granata et al., 2013) or SAXS intensities (Kimanius et al., 2015). In summary, CV definition is a non-trivial problem. For all these reasons, unbiased long simulations, which neither perturb the FEL nor require previous CV knowledge, are often preferred alternatives in many studies aiming for experimental validation, as we will review in the last section.

### PATH-FINDING METHODS: THROWING ROPES OVER MOUNTAINS

Apart from the host of methods to enhance MD conformational sampling, there is another fundamental strategy to explore protein transitions: to simplify either the simulation algorithm or the system, in order to obtain just a feasible pathway between states. Finding transition paths has been compared to "throwing ropes over mountain passes in the dark" (Bolhuis et al., 2002; Dellago and Bolhuis, 2007), since indeed, such methods produce one-dimensional trajectories, like ropes in the conformation space (**Figure 1C**). Instead of sampling transition ensembles covering broad areas of the FEL, the goal of path-sampling methods is to generate sequences of structures connecting end-states. Such rope-like transitions, apart of providing first mechanistic insights, can serve as seeds for further MD (e.g., with US, MTD or "swarms-of-trajectories" Pan et al., 2008; Maragliano et al., 2014) to reconstruct the FEL.

Very broadly, path-generating methods (Weiss and Koehl, 2014; **Table 1**) can be also classified into two groups: (i) geometric morphing algorithms, which generate stereochemically correct morphs between structures, without any potential function, and (ii) those methods based on some potential energy, that actually attempt to approach minimum energy paths (MEPs) connecting basins. Among the latter, there are path-finding schemes based on MD inspired by the same ideas of enhanced sampling, along with a series of CG-methods, which take a entirely different approach, simplifying description of structures and their interactions.

The first online tool to compute transition pathways appeared within the MolMov Database (MolMovDB; Gerstein and Krebs, 1998; Krebs and Gerstein, 2000), and applied the simplest possible morphing: a linear interpolation in Cartesian coordinates, followed by energy minimization. As could be expected, MolMovDB paths project as perfectly straight lines in the experimental PC-landscape, and thus do not correspond at all to realistic transitions (**Figure 1C,** left). FATCAT also uses a interpolation of rigid-body motions (Ye and Godzik, 2004). More sophisticated are methods like FRODA (Wells et al., 2005) or geometric targeting (Farrell et al., 2010), which move atoms toward the target by enforcing geometric constraints to keep stereochemistry, while robot motion-planning algorithms (Cortés et al., 2005; Haspel et al., 2010; Al-Bluwi et al., 2012) exploit analogies between molecular bonds and robot links to perform fast molecular kinematics. Note that none of these geometric path-finding methods, which usually generate atomistic paths thanks to high computational efficiency, aims to provide a physical approximation to the FEL. This is not the case for MD-derived perturbation methods (Huang et al., 2009) like targeted (Schlitter et al., 1994), steered (Izrailev et al., 1997), or adiabatic MD (Marchi and Ballone, 1999; Paci and Karplus, 1999), where an MD simulation is directly pushed to the target by time-dependent potentials along a CV. In the so-called "chain-of-states" methods (Tao et al., 2012) like the nudged elastic band (Maragakis et al., 2002) or the string methods (Ren and Vanden-Eijnden, 2005; Ren et al., 2005; Ovchinnikov et al., 2011), serial images of the system are minimized to find MEPs; in the "path-method," a guess path coordinate and two CVs that are functions of it are introduced to locally explore and optimize pathways (Branduardi et al., 2007; Bonomi et al., 2008). Although all these enhanced MD-derived path- sampling methods can be faster than conventional MD, finding proper CVs, biasing definitions or initial paths is again critical, and thus their implementation is not straightforward.

In contrast with the MD-inspired methods, CG-approaches, more than as alternative methods, should be rather considered a different way of looking at the sampling problem, literally, from a more collective scale. Coarse-graining simplifies the description of a system to capture its behavior with a minimum of parameters (Tozzini, 2010; Orozco et al., 2011; Saunders and Voth, 2013). By simplifying both potentials and structure description (Kmiecik et al., 2016), CG-methods accelerate computation increasing orders of magnitude the accessible scales; metaphorically speaking, they would be analogs to approaches like cryo-EM or SAXS, in which detail can be sacrificed to gain information from very large or flexible systems. Although hampered by loss-of-resolution regarding time and chemical properties, CG-methods can thus provide deep insights into complex systems behavior, as they distill multidimensional information to its very essential features. Although there are CGforce fields like the popular MARTINI implemented into real MD schemes (Marrink and Tieleman, 2013; Ingólfsson et al., 2014), in general CG-models are used in the context of much simpler algorithms, typically produce one-dimensional pathways, and are often available as webservers (**Table 1**).

To generate quick and efficient transitions, CG path-finding methods (Zheng and Wen, 2017) use a host of conceptually diverse protein representations: from a few heavy atoms (e.g., CABS model; Jamroz et al., 2013; Kmiecik et al., 2016) to residue beads (typical of ENMs) or rigid domains; and the same holds true for algorithmic approaches, that span from matrix diagonalization to MC or BD simulations. The only thing they have in common is skipping MD computational limitations, at the cost of losing information about time and energy. Among CG-methods, ENMs (Chennubhotla et al., 2005; Bahar et al., 2010) stand out due to conceptual simplicity and power to predict large changes through Normal Mode Analysis (NMA; Case, 1994). NMA is a molecular mechanics technique based on harmonic potentials, which was first used to predict infrared spectra and soon became also applied to analytically compute near-equilibrium protein atomic oscillations (Brooks, 1983; Levitt et al., 1985): solving a simple eigenvalue problem, a vector describing the directions of movement for every atom could be obtained. Inspired by "beads-and-springs" polymer models (Flory et al., 1976; Go and Scheraga, 1976), further coarse-graining of the protein description up to the C-alpha backbone lead to the minimalist ENM-NMA (Tirion, 1996; Bahar et al., 1997; Atilgan et al., 2001). Typically, ENMs reduce protein architecture to a network of Cα-carbons connected by springs, which model covalent and non-covalent interactions. In spite of this simplicity, it soon became evident that ENMs can not only predict residue fluctuations, but are also capable of guessing with striking precision the directions of large-scale transitions between e.g., X-ray open and closed pairs (Tama and Sanejouand, 2001). Later work has shown that ENMs reproduce as well the flexibility from experimental X-ray and NMR ensembles, or long MD simulations (Rueda et al., 2007; Orellana et al., 2010; Mahajan and Sanejouand, 2015; Sankar et al., 2018) and importantly, track the pathways for conformational change (Orellana et al., 2016; see NM projections, **Figure 1C**, left). Therefore, ENMs have been at the core of CG-strategies to find transition paths; however, being limited to an equilibrium basin, pathway generation requires iterative deformation along selected NMs, or implementation into some simulation scheme. Iterative ENMs range from simple interpolations like NOMAD-Ref and others (Kim et al., 2002; Lindahl et al., 2006; Seo and Kim, 2012) to more complex two-state approaches like iENM or ANMPathway (Yang et al., 2009; Tekpinar and Zheng, 2010; Das et al., 2014) or MinActPath (Franklin et al., 2007; Chandrasekaran et al., 2016), which assumes harmonic potential at the end-states and solves the action minimization problem to find the crossing points. A common issue for such CG-methods is that they typically produce stereochemical distortions, which can be reduced using internal coordinates like in iMODS (López-Blanco et al., 2014), structure corrections in NMSIMs (Ahmed et al., 2011; Krüger et al., 2012), or conjugate peak refinement like in the plastic network model (PNM; Maragakis and Karplus, 2005). In general, these approaches share the ENM power to capture allosteric transitions, but also display a shared weakness: a trend to reproduce similar one-dimensional paths rather than random pathway ensembles (**Figure 1C**, center). One solution to this problem is using NMs to bias simple e.g., Discrete Dynamics (dMD) simulations (Sfriso et al., 2012, 2013) in order to obtain a wider ensemble, although still, mode selection, as CV selection in enhanced MD schemes, poses a problem. Recently, we proposed an ENM-driven simulation approach, eBDIMS (Orellana et al., 2016, 2019a), also performing in parallel a thorough validation of path-finding algorithms against multi-state ensemble PCA. Based on a refined ENM force-field (Orellana et al., 2010), eBDIMS generates paths driven by interresidue distances, using a DIMS-Langevin scheme with a friction term mimicking solvent. This avoids unrealistic deformations, at the same time that ENMmodes are spontaneously sampled, generating random and nonlinear trajectories.

Mention apart deserve hybrid methods like Climber (Weiss and Levitt, 2009), which iteratively pulls the interresidue distances adding harmonic restraints to an internal energy function, based on the atomistic ENCAD atomistic force-field (Levitt et al., 1995). In our comparative studies we found that eBDIMS and Climber, starting from entirely different approaches (CG- vs. atomistic, Langevin integration vs. iterative pulling/minimization), generate surprisingly convergent, nonlinear, and asymmetric paths in PC-space. Remarkably, these paths closely overlap with solved experimental intermediates, which delimit the areas typically sampled by MD (see **Figure 1C**, center and right). Overall, our findings strongly suggested that these non-linear path-finding methods converge to actual MEPs, which are populated by trapped experimental intermediates. This raises a important question: how is it possible that such simple C-alpha based harmonic models like eBDIMS, can predict the directions of non-equilibrium conformational changes, while MD often requires powerful computing or enhanced sampling? On one hand, it has been suggested that dynamical systems theory assures the conservation of quasi-periodic motions upon small perturbations (Bastolla, 2014), and thus, ENMs are valid beyond the equilibrium, and in a wider set of conditions than was previously thought. On the other hand, the evident power of CG-methods to predict large-scale transitions and intermediates trapped by cryo-EM and crystallography, not only demonstrates such validity, but more importantly, it confirms that the collective shape-encoded dynamics of proteins, is maybe an essential determinant driving their underlying biologically functional transitions. Therefore, CG-methods are not just a quicker alternative to MD, but can provide an essential tool to dissect multi-scale problems like protein large transitions (Voth, 2009), specially in schemes where they are integrated with MD and experiments [see e.g., our recent experience (Orellana et al., 2019b) briefly discussed below].

### CROSS-VALIDATION OF SIMULATIONS AND EXPERIMENTS: TOWARD INTEGRATIVE BIOLOGY

Although the "raison d'être" of most theoretical methods to model protein transitions is to gain insight into molecular mechanisms and connect them to biology, attempts to validate them are still rare, and thus, any in silico predictions usually remain in the computational realm as mere hypotheses and are looked at with suspicion by experimentalists (see Lowe, 2015 critique on Kohlhoff et al., 2014). Traditionally, MD provided dynamic information on microscopic scales often inaccessible by experimental probes (e.g., atomic details on hydrogen bonding, loop fluctuations, etc.), and thus were un-testable. As larger scale events like conformational changes are simulated, MD can generate semi-quantitative estimates of observables that can be more easily measured experimentally. Therefore, current MD can significantly contribute to the understanding and interpretation of experimental data; and alternatively, it can also be driven by experiments (Hollingsworth and Dror, 2018). However, in comparison with the large efforts concentrated on pushing the simulation length and sampling, little has been done to systematize and validate in parallel the information obtained, especially when approaching the scales in which transitions happen and propagate over.

Simulating the physical world always involves a degree of approximation and uncertainty (Berendsen, 2007); but the same is true for biological experiments. This constitutes maybe the core reason separating the in silico world from actual biology: the extraordinary difficulty posed by integration of atomic-level data on motion with higher-scale experiments, which typically average out dynamic properties over time and space. Recently, a thorough critical analysis of factors influencing the agreement of simulations and experimental data was presented by van Gunsteren et al. (2018). We will not discuss here related issues associated to force-field parameterization (Lindorff-Larsen et al., 2012), convergence of the simulations (Knapp et al., 2011; Sawle and Ghosh, 2016), prediction of microscopic observables (Childers and Daggett, 2018), or the multiple caveats of modeling more realistic e.g., crowded complex environments (Chavent et al., 2016), electrochemical gradients (Delemotte et al., 2008; Khalili-Araghi et al., 2013), etc. We aim rather to revisit some experimental approaches that have recently provided hands-on direct or indirect validation of in silico predicted large-scale transitions (**Table 2**).

While there has been extensive work on force-field parameterization e.g., benchmarking predictions about microscopic properties, studies benchmarking the performance of atomistic simulation methods to sample conformational transitions are limited and often reduced to small proteins (Pan et al., 2016). A related issue with MD- benchmarking is also the abovementioned difficulty to identify relevant CVs for complex systems, especially when only one of the conformational states is known. Note that, in contrast to MD, benchmarking against large-scale changes not only has routinely been done for CG-methods, but also constituted the main basis for their parameterization and in consequence, are extremely effective at predicting transitions along with their CVs. Independent of the strengths and weaknesses of each method, however, the main issues to validate transition pathways are essentially two: on one hand, the scarcity of experimental data about on-pathway intermediates; on the other, the uncertainty determining the relevant CVs to monitor changes and their associated observables. Although a transition pathway should be ideally supported by direct structural data (either crystallography, cryo-EM, NMR, or SAXS), this is often difficult and the only feasible option is to attempt indirect "soft" validation, either from distance parameters e.g., via single-molecule FRET, FACS, or from functional assays, which can test predictions about protein activity, as we briefly review next.

### DIRECT PATH-VALIDATION: PROTEIN DATA BANK ENSEMBLES AND LANDSCAPES

Classically, in silico pathways like those generated by pathsampling were evaluated on the sole basis of stereochemical quality, or by tracking progression along ad-hoc systemdefined coordinates (Das et al., 2014; Seyler and Beckstein, 2014). As mentioned above, the selection of heuristic CVs for dimensionality reduction is problematic (Seyler et al., 2015), and in practice, structural quality or progression along user-defined CVs does not assure that a pathway samples the biologically relevant routes. Weiss and Levitt clearly stated this question a decade ago: "Can morphing methods predict intermediate structures?" (Weiss and Levitt, 2009), proposing for the first time to benchmark against proteins with at least three distinct states solved, and asses how close sampled pathways spontaneously approach known intermediates in terms of rMSD. Although this procedure definitely poses a more accurate test for in silico pathways, it cannot assess the feasibility of the movements or to what extent they correspond to biological motions. Based on these ideas, we proposed to go beyond two- or threestate benchmarking by introducing ensemble-level analyses that consider all structural information available in the PDB for a given protein, extracting at the same time their intrinsic CVs using PCA (Orellana et al., 2016). This kind of validation provides an extremely stringent test to evaluate sampling both by MD and path-finding algorithms and, thanks to the increasing amount of multi-state structural data available, we foresee that it could become widely applicable in the near future with cryo-EM. As a case apart of "hard" pathway validation, it is necessary to mention the study on a SWEET transporter by Latorraca et al. (2017), in which the spontaneous transition toward the inwardopen state was first observed in silico with Anton simulations, and subsequently validated by determining an X-ray structure trapped in the same conformation. Although such an approach is not feasible to routinely validate pathways, it has provided maybe the strongest evidence to date in favor of the power of MD simulations to accurately sample the conformational space of proteins.

### SOFT VALIDATION: FROM FRET AND ANTIBODY BINDING TO FUNCTIONAL ASSAYS AND ANIMAL MODELS

MD simulations have been traditionally validated and compared with microscopic information on relatively local protein flexibility like NMR couplings, B-Factors, etc. During the last years however, simulations have started to generate predictions of a scale that is suitable for experimental validation through non-structural techniques, finally crossing the boundaries with molecular biology. A quick glimpse into recent examples of cross-validation of conformational changes between simulation and experiments (**Table 2**) clearly shows how finally, we are starting to break the barriers separating both, providing new insights into biomedically relevant systems, including key drug targets. Functional conformational changes usually involve either large rigid-body motions of structural elements or more local unfolding, loop fluctuation transitions. While the first can expose or bury molecular surfaces for dimerization, interaction with other proteins or ligands etc., the second may have more subtle effects on structure-function relationship e.g., at enzyme active sites. Observing such changes in silico, has given rise to quantitative or qualitative predictions that mainly fall into two categories: concerning interactions with other proteins or small molecules (dimerization, binding), and/or regarding activity catalysis, phosphorylation, ion transport, etc.

Maybe one of the first examples attempting soft validation of in silico transitions aroused from short simulations of open/closed changes in Hsp70, confirmed by Trp-fluorescence changes upon ATP binding (Woo et al., 2009). A more complex validation strategy was taken by Laine et al. (2010), designing and testing a series of inhibitors against the different conformations of an allosteric site throughout an in silico transition path. In a groundbreaking study of the EGF-Receptor (EGFR kinase domain; Shan et al., 2012), Anton simulations revealed a third intermediate state characterized by local αC-helix disorder; further simulations of mutations indicated that they suppress this disorder to enhance dimerization and activation. In this case, proving intrinsic disorder and mutation effects required Hydrogen/Deuterium (H/D) exchange mass spectrometry (Wales and Engen, 2006), while enhanced dimerization was shown by Blue Native Gel electrophoresis (Wittig et al., 2006). Later work by the Shaw group, cross-validating NMR data and simulations (Arkhipov et al., 2013; Endres et al., 2013), provided new insights into EGFR transmembrane dimerization. Shorter µs-simulations by Kaszuba et al. (2015) also led to predictions about the impact of glycans on EGFR conformation, which were tested by monitoring the accessibility of glycosylation-sensitive surface-epitopes. Recently, we combined first a mutational screening partly based on ENMs, followed by MD simulations of "dynamically" hot EGFR ectodomain mutations (Orellana et al., 2014, 2019b) in a multiscale CG-MD scheme similar to that proposed by Saunders and Voth (2013). This approach highlighted how, as happens often experimentally, mutagenesis can help to trap intermediate states. In this case, the MD-trapped transition state, happened to be the target for a therapeutically relevant antibody, mAb806, which had been long hypothesized to bind a third ectodomain conformer distinct from the known crystal structures and enriched in tumor cells. This provided a rare opportunity to directly extrapolate an MD prediction to animal models by testing mAb806 therapeutic impact, with surprising success (Binder et al., 2018; Orellana, 2019); moreover, the integration of functional experiments, SAXS and MD revealed unsuspected functional and allosteric convergence of ectodomain deletions and missense mutations. A similar example, in which a protein is known to perform a certain biological activity but the corresponding conformation remains elusive, is illustrated by the work by Machtens et al. (2015), which extended previous MTD findings by Grazioso et al. (2012). In this case, excitatory aminoacid transporters (EAATs) were known to transport anions but the specific conduction path was not obvious in end-state X-ray structures. ED simulations of a prokaryotic glutamate transporter homolog, Gltph, revealed a potential channel in an intermediate state (independently trapped with crystallography), and the predicted pore-lining residues were confirmed with Trp-scanning mutagenesis, fluorescence quenching, and electrophysiology. Another indirect approach to validate MD-predicted changes consists on assessing intra or intermolecular distances with FRET, used e.g., to confirm the compaction of importin in apolar solvents (Halder et al., 2015) or DEER, an approach that allowed to prove the opening/closing dynamics in heterotrimeric G-proteins (Dror et al., 2015) and its modulation by nucleotide binding.

Although not the subject of this review, it is worth to mention the advances on simulations of spontaneous ligand binding events and protein-protein interactions, which constitute a special case regarding experimental validation and can occasionally provide indirect validation for conformational changes related to binding. For example, either long simulations or enhanced sampling techniques like aMD or MTD have captured spontaneous binding of small molecules to protein kinases or GPCRs (Dror et al., 2011, 2013; Shan et al., 2011, 2012; Kappel et al., 2015), dimerization in membranes (Lelimousin et al., 2016), or protein-protein interactions (Ma et al., 2019), approaching or reproducing crystallographic binding poses or NMR ensembles. In these cases, the PDB coordinates of known complexes, together with free energies of binding, drug efficacies, etc. (Shukla et al., 2015) can provide a hard-validation for MD.

### CONCLUDING REMARKS

We have provided a brief overview of the multiple approaches that are used to explore the conformational landscapes of proteins and their transitions in silico, and reviewed different methods used for their validation. On one hand, it becomes clear that the accumulated structural information and flexibilitycapturing techniques like cryo-EM are revealing first glimpses on functional landscapes. On the other hand, computational methods have reached maturity and are entering a stage in which they can start to contribute to real biology, modeling longer and larger scales. We have revisited the many approaches available to explore the FEL of proteins, optimizing hardware, software and algorithms pursuing the dream of the seconds-long sampling. From a completely different standpoint, simulations in crowded cell-like soups of multiple copies of the same protein, although still in the ns-scale, are already a reality that holds promise to reveal dynamical complexity in local microenvironments, providing yet another approach to the sampling problem (Yu et al., 2016; Feig et al., 2018). We have also briefly mentioned machine learning algorithms, paradigmatic of a series of novel fast-developing non-physically based strategies which are gaining ground to study transitions, either alone or in combination with MD or CG-methods: from co-evolution analysis (Morcos et al., 2013; Sutto et al., 2015; Sfriso et al., 2016) to cross-correlation, network and community approaches (Potestio et al., 2009; Morra et al., 2012; Rivalta et al., 2012; Papaleo, 2015; Negre et al., 2018), neural networks and deep learning (Ung et al., 2018; Degiacomi, 2019), or integrative sequence and structural analysis (Flock et al., 2015). These approaches, not primarily intended to generate conformational pathways or obtain a physical FEL, have shown their power to reveal new alternative conformations and dissect allosteric mechanisms, and thus are also greatly contributing to the exploration of protein flexibility space. We have reviewed some of the many flavors of CG- models and algorithms, and how they can provide low-resolution but stunningly accurate pathways. Finally, we have discussed recent examples where simulations have trapped intermediate states before confirmation by X-ray crystallography (Latorraca et al., 2017), or by in vivo tumor models (Orellana et al., 2019b). Altogether, the explosion of structural data, along with the ever expanding toolkit of in silico methods, computer capabilities and growing integration between simulations and experiments—driving or being driven

### REFERENCES


by them—are beginning to fulfill the dream of connecting the micro-, meso-, and macro- scales in the study of life phenomena. It also becomes evident that this enterprise requires careful integration of a multitude of techniques and approaches, to connect the atomistic level with the emerging collective behaviors that rule conformational changes. The times ahead are exciting, as we are approaching a critical mass of information on protein structures, and experimental techniques allow exploring their dynamics with ever-increasing detail. The challenge will be to merge the ever-growing data into a coherent picture, which has certainly the potential to revolutionize biology, medicine and drug discovery.

### AUTHOR CONTRIBUTIONS

LO conceived and wrote the manuscript and prepared figures.

### FUNDING

LO was supported by grants from the Knut and Alice Wallenberg (KAW) Foundation and the O.E. and Edla Johanssons Stiftelse.


Foundations, Successes, and Shortcomings. Adv. Protein Chem. Struct. Biol. 85, 183–215. doi: 10.1016/B978-0-12-386485-7.00005-3


for High Performance Computing, Networking, Storage and Analysis. doi: 10.1145/2063384.2063465


**Conflict of Interest:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Orellana. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# To Bud or Not to Bud: A Perspective on Molecular Simulations of Lipid Droplet Budding

Valeria Zoni <sup>1</sup> , Vincent Nieto<sup>2</sup> , Laura J. Endter <sup>3</sup> , Herre J. Risselada<sup>3</sup> , Luca Monticelli <sup>2</sup> \* and Stefano Vanni 1,4 \*

<sup>1</sup> Department of Biology, University of Fribourg, Fribourg, Switzerland, <sup>2</sup> Molecular Microbiology and Structural Biochemistry, UMR 5086 CNRS, Universitè de Lyon, Lyon, France, <sup>3</sup> Department of Theoretical Physics, Georg-August University Göttingen, Göttingen, Germany, <sup>4</sup> CNRS, Institut de Pharmacologie Moléculaire et Cellulaire, Université Côte d'Azur, Valbonne, France

Keywords: lipid droplets, molecular dynamics simulations, budding, membrane asymmetry, lipid droplet proteome

### INTRODUCTION

Fat storage is an essential mechanism whereby cells store energy that can be later used to perform basal functions when food intake is reduced or insufficient. In cells, fat is deposited in organelles called lipid droplets (LDs). LDs are not mere inert storage pools, but they are active sites of lipid metabolism and remodeling. Furthermore, they are involved in numerous diseases, such as obesity, diabetes, cancer, and viral infection (Welte and Gould, 2017).

### Edited by:

Valentina Tozzini, National Research Council, Italy

### Reviewed by:

Giacomo Fiorin, Temple University, United States

### \*Correspondence:

Luca Monticelli luca.monticelli@inserm.fr Stefano Vanni stefano.vanni@unifr.ch

#### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

> Received: 22 August 2019 Accepted: 24 October 2019 Published: 13 November 2019

### Citation:

Zoni V, Nieto V, Endter LJ, Risselada HJ, Monticelli L and Vanni S (2019) To Bud or Not to Bud: A Perspective on Molecular Simulations of Lipid Droplet Budding. Front. Mol. Biosci. 6:124. doi: 10.3389/fmolb.2019.00124

Despite this central role in important physiological and pathological processes, the general biology of LDs is poorly understood. This is due to the unique structure of LDs, featuring a core of neutral lipids (NLs), surrounded by a monolayer of phospholipids (PLs). As a consequence of this peculiar composition and organization, the mechanism of LD formation remains largely unclear.

The general consensus is that NLs are produced and stored between the two leaflets of the endoplasmic reticulum (ER) bilayer (**Figure 1A**); as the concentration of NLs exceeds a certain threshold, they aggregate in lenses (**Figure 1B**), that grow into nascent LDs (**Figure 1C**). Subsequently LDs bud from the ER bilayer toward the cytosol (**Figure 1D**) and, depending on the organism, they can either stay connected to the ER (**Figure 1E**) or detach in the cytosol (**Figures 1E,F**) (Wilfling et al., 2014b).

### LD BUDDING: EVIDENCES AND CHALLENGES

The budding step (**Figures 1D,E**) is crucial for proper LD maturation, and it has important physiological consequences. For example, a budded LD has a higher cytosolic surface that can thus be more efficiently exposed to enzymes, such as lipases, the proteins involved in the catabolism of NLs. Also, enrichment of NLs in the ER is toxic for the cell and formation and budding of LDs might provide an effective mechanism to remove NLs from the ER bilayer (Wilfling et al., 2014b). However, the main forces and molecular actors responsible for the regulation of LD budding are still unknown. Of note, the classical machineries for vesicle budding, such as COPI and COPII, have been ruled out, since, even if COPI can bind to LDs and detach nanodroplets in vivo (Thiam et al., 2013), its activity affects protein targeting rather than LD budding (Wilfling et al., 2014a).

On the other hand, regulation of both ER and LD surface tension has been shown to play a crucial role in modulating LD budding (Ben M'barek et al., 2017; Chorlay and Thiam, 2018; Chorlay et al., 2019). To this end, two main mechanisms have been demonstrated to modulate LD budding in vivo and in vitro by acting on surface tension: (i) protein binding to LDs (Chorlay et al., 2019) and (ii) PL composition (Ben M'barek et al., 2017; Choudhary et al., 2018)

and abundance (Chorlay et al., 2019). For example, asymmetry in the PL coverage of the NL core has been shown to favor emergence of LDs, promoting budding toward the side with the higher number of PLs (Chorlay et al., 2019). However, potential mechanisms leading to PL asymmetry between the two ER leaflets, and specifically at sites where nascent LDs are present, are currently not well-understood. Alternatively, asymmetry can also be promoted by protein binding, whereby proteins inserting in the PL monolayer, increase NL coverage and favor budding toward the side where binding occurs (Chorlay et al., 2019). At the same time, PL composition of the ER bilayer can modulate the emergence of LDs from the ER via two distinct mechanisms: PL shape and PL-induced membrane tension. In fact, PLs with intrinsic positive curvature have been shown to favor budded states (Choudhary et al., 2018), as do PLs that are able to reduce ER tension (Ben M'barek et al., 2017).

In parallel, several proteins localized at LDs have also been shown to regulate LD budding. Two such proteins are seipin and Pex30. Pex30 is a membrane shaping protein that can tubulate the ER (Joshi et al., 2016) and that is present only transiently at LDs (Wang et al., 2018). Simultaneous deletion of Pex30 and seipin leads to an impairment in LD budding (Wang et al., 2018). Seipin is a transmembrane ER protein that forms ring-shaped homo-oligomers (Sui et al., 2018; Yan et al., 2018) that can been found stably at ER-LD contact sites (Salo et al., 2016). Cryo-EM structures (Sui et al., 2018; Yan et al., 2018) suggest that the luminal portion of seipin, by covering most of the inner LD monolayer, hinders binding of peripheral proteins toward that side. Therefore, the outer monolayer can be covered by a larger number of proteins, including possibly seipin cytosolic loops, and budding would be favored toward the cytosolic side (Chorlay et al., 2019). Furthermore, electron microscopy images reveal that LD-ER contact sites have a well-defined neck-like topology, and the size of the observed membrane neck is compatible with one ring-shaped seipin oligomer (Salo et al., 2019), suggesting that seipin is crucial to maintain this structure. At the same time, the tertiary structure of the ER domain of the protein is very similar to that of some lipid binding proteins, and it has been shown that the luminal portion can bind phosphatidic acid (PA), suggesting that it could sequester it from the bilayer and possibly present it to metabolic enzymes to form either PLs or diacylglycerols (DAGs) (Yan et al., 2018).

Another family of proteins that is necessary for LD budding is the FIT family (Choudhary et al., 2015). FITs are phosphatases that convert PA into DAG (Hayes et al., 2017; Becuwe et al., 2018), a lipid that not only presents a very low energy barrier for bilayer flip-flop, but that can be also partially stored in the middle of the bilayer, like NLs (Campomanes et al., 2019). Since FITs act only on lipids in the luminal leaflet of the ER, production of DAGs could occur asymmetrically and consequently promote LD asymmetry and budding. At the same time, the high intrinsic curvature of DAG lipids, together with the presence of several transmembrane helices in FIT proteins (Gross et al., 2010), might lead to deformation in the ER cytosolic monolayer generating positive curvature (Thiam and Forêt, 2016). The relevance of deformations in the ER bilayer for LD budding has been proposed also for other proteins that target LDs through a hairpin domain and that, consequently, can impose high positive curvature to the bilayer. An example of this class of proteins is caveolins, also found at LDs (Ostermeyer et al., 2001) and known to deform the membrane at sites of vesicle formation (Parton and Collins, 2016).

### HOW CAN MOLECULAR DYNAMICS HELP UNDERSTANDING THIS PROCESS?

From the evidence in the literature, it appears that a combination of protein activity together with changes in membrane properties (such as surface tension, lipid composition, and surface coverage) is key in controlling LD budding. However, several aspects of this process are difficult to address with state-of-the-art experimental methods. Most notably, a detailed characterization of the molecular structures along the budding pathway remains unaddressed and difficult to achieve using current structural biology methods, due to the liquid nature of lipid aggregates, the small size of early-stage nascent LDs (well below optical resolution), and the transient nature of budding intermediates.

Molecular dynamics (MD) simulations are optimally suited to investigate the structural and dynamic properties of liquids, and they are particularly promising for the study of molecular mechanisms underlying LD budding (Soares et al., 2017). Notably, MD simulations have already been successfully applied to interpret and corroborate several experimental findings. For instance, MD simulations clarified how changes in bilayer surface tension alter the concentration of NLs stored in a LD lens (Ben M'barek et al., 2017). Also, MD simulations showed that asymmetry in monolayer coverage (hence asymmetry in surface tension) is able to control budding directionality independently of the lipid spontaneous curvature (Chorlay et al., 2019).

However, several questions remain on the mechanism and the energetics of budding, as well as on the role of different proteins in the process; we foresee that MD simulations will be instrumental in addressing such questions. First of all, MD simulations can be used to explore the structural role of PLs and how the distribution of different lipids influences budding. In particular it will be interesting to understand the role of PLs, such as phosphatidic acid, lysolipids, and DAG, during all the stages of LD growth and budding, since they seem to largely influence budding and protein recruitment (Ben M'barek et al., 2017; Choudhary et al., 2018). Second, MD simulations may help elucidating the energetic requirements associated with various steps of the budding process (depicted in **Figure 1**). Theoretical studies of LD budding suggest that, in order to achieve LD fission, the NL phase should completely dewed from either the inner or the outer leaflet of the ER, a mechanism that requires external energy, possibly controlled by surface tension (Thiam and Forêt, 2016). We envisage that MD simulations may allow detailed predictions on the energetics of LD budding under different and controlled conditions, therefore clarifying which of the proposed budding stages are spontaneous and which ones require external energy. Third, for those steps requiring external energy input, MD simulations will enable predictions of the molecular mechanisms by which proteins regulate LD budding. For example, how Pex30 and seipin promote concertedly budding is not understood. While it has been shown that seipin imposes a distinct topology to LD-ER contact sites (Salo et al., 2019) it remains unclear if, in order to achieve a fully budded state with a well-defined neck (**Figure 1E**), the LD lens needs to reach a certain size or if this topology is already stable in the early stages of LD formation (Deslandes et al., 2017).

More generally, open questions remain on the relevance of protein-induced membrane deformations in LD budding as well as on the influence of LD-binding proteins, and MD simulations can greatly contribute to address such questions, particularly as high-resolution structures of the proteins involved become available. Overall, MD simulations can help unveiling which morphologies are more energetically favorable for lipid aggregates with different compositions (e.g., different concentrations of NLs), and which transformations are more likely.

Finally, even though the mechanism of LD formation and budding showed in **Figure 1** is generally accepted, whether the final step of the process actually happens in vivo remains

### REFERENCES

Bacle, A., Gautier, R., Jackson, C. L., Fuchs, P. F. J., and Vanni, S. (2017). Interdigitation between triglycerides and lipids modulates surface properties of lipid droplets. Biophys. J. 112, 1417–1430. doi: 10.1016/j.bpj.2017.02.032

controversial. Of particular concern, no fission machinery leading to LD detachment from the ER has been identified so far, and it is unclear whether LD detachment could be promoted by membrane physical properties alone. MD simulations should be able to provide an estimate of the energetic cost of breaking the LD-ER neck and to clarify whether the process is driven only by surface tension or if protein activity is necessary to detach LDs from the ER.

### CONCLUSIONS

In this Opinion, we illustrate the main unanswered questions regarding LD budding that can be investigated using MD simulations. One of the challenges of simulating such systems is their computational cost, since LDs have diameters of hundreds of nanometers and their growth takes place on time scales of seconds (Salo et al., 2019). The employment of chemical-specific coarse-grained models, such as MARTINI (Marrink et al., 2007; Monticelli et al., 2008) and SDK (Shinoda et al., 2007), has recently allowed simulating some aspects of LD budding using realistic sizes and timescales. However, simulations representing the complexity of LD formation (that involves multiple lipid species and proteins throughout the process) might be beyond the current capabilities and accuracy of available CG models. Equilibrium CG simulations might not be sufficient to explore the key aspects of LD budding, and enhanced-sampling strategies might be required. Thus, even though pioneering simulations have started highlighting important aspects of LD biology (Khandelia et al., 2010; Bacle et al., 2017; Ben M'barek et al., 2017; Vanni, 2017; Pezeshkian et al., 2018; Chorlay et al., 2019; Zoni et al., 2019), we foresee that further developments in molecular modeling techniques will be required to advance our understanding of the mechanisms of LD biogenesis.

### AUTHOR CONTRIBUTIONS

VZ and SV wrote the article with help from VN, LE, HR, and LM.

### ACKNOWLEDGMENTS

VZ and SV acknowledge support from the Swiss National Science Foundation grant #163966 and from the Swiss National Supercomputing Centre (CSCS) under project ID s726 and s842. VZ and SV acknowledge PRACE for awarding us access to Piz Daint, ETH Zurich/CSCS, Switzerland. VN and LM acknowledge funding from the Agence Nationale de la Recherche (grant ANR-17-CE11-0003-01) and Grand Èquipement National de Calcul Intensif (GENCI, grant A0060710138). LM acknowledges funding from the Institut National de la Sante et de la Recherche Medicale (INSERM).

Becuwe, M., Bond, L. M., Mejhert, N., Boland, S., Elliott, S. D., Cicconet, M., et al. (2018). FIT2 is a lipid phosphate phosphatase crucial for endoplasmic reticulum homeostasis. bioRxiv 291765. doi: 10.1101/291765

Ben M'barek, K., Ajjaji, D., Chorlay, A., Vanni, S., Forêt, L., and Thiam, A. R. (2017). ER membrane phospholipids and surface tension control cellular lipid droplet formation. Dev. Cell 41, 591–604.e7. doi: 10.1016/j.devcel.2017. 05.012


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zoni, Nieto, Endter, Risselada, Monticelli and Vanni. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Topological Constraints in Eukaryotic Genomes and How They Can Be Exploited to Improve Spatial Models of Chromosomes

#### Angelo Rosa<sup>1</sup> \*, Marco Di Stefano<sup>2</sup> \* and Cristian Micheletti <sup>1</sup> \*

<sup>1</sup> Scuola Internazionale Superiore di Studi Avanzati, Trieste, Italy, <sup>2</sup> Centre Nacional d'Anàlisi Genòmica-Centre de Regulació Genòmica, Barcelona, Spain

Keywords: DNA and chromosomes, structural models, genomic entanglement, topological constraints, physical knots and links

### 1. INTRODUCTION

### Edited by:

Valentina Tozzini, National Research Council, Italy

### Reviewed by:

Stephen Daniel Levene, The University of Texas at Dallas, United States

#### \*Correspondence:

Angelo Rosa anrosa@sissa.it Marco Di Stefano marco.distefano@cnag.crg.eu Cristian Micheletti michelet@sissa.it

#### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

> Received: 12 September 2019 Accepted: 28 October 2019 Published: 15 November 2019

#### Citation:

Rosa A, Di Stefano M and Micheletti C (2019) Topological Constraints in Eukaryotic Genomes and How They Can Be Exploited to Improve Spatial Models of Chromosomes. Front. Mol. Biosci. 6:127. doi: 10.3389/fmolb.2019.00127 From viruses to eukaryotes, genomic DNA filaments are confined in spaces of linear dimension much smaller than their contour lengths. In bacteriophages, the µm-long genome is stored in 50 nm-wide viral capsids and the corresponding packing density is so high that viral DNA filaments that have little chance to be entangled in solution (knotting probability <3%) become almost certainly knotted (>95% probability) once confined inside the capsid (Rybenkov et al., 1993; Arsuaga et al., 2002; Marenduzzo et al., 2009, 2010). In humans, instead, the various cm-long chromosomes that make up the genome are kept inside 10 µm-wide nuclei (Alberts et al., 2014). Despite the major change of scale with respect to viruses, the volume fraction occupied by this eukaryotic genome is still large, about 10% (Rosa and Everaers, 2008).

These considerations pose several conundrums: How can chromosomal DNA be at the same time packed and yet accessible to the regulatory and transcriptional machineries? What is its typical degree of genomic entanglement and how much does it interfere with DNA transactions? To what extent are these aspects shaped by general passive physical mechanisms vs. active ones, e.g., involving topoisomerase enzymes?

### 2. INTRA- AND INTER-CHROMOSOME ARCHITECTURE

### 2.1. Phenomenology

Addressing these questions has proved challenging because of the wide range of length and time scales involved in genome architecture. Classical experimental tools provide details of chromosome architecture at two opposite scales (Marti-Renom and Mirny, 2011). At the smallest one (10 − 100 nm) X-ray crystallography revealed that DNA achieves local packing by wrapping around histones, while at the largest one (1 − 10 µm) fluorescence in situ hybridization (FISH) showed that each chromosome occupies a compact region of the nucleus, termed territory (Cremer and Cremer, 2001, 2010).

More recently, experimental breakthroughs such as super-resolution imaging, electron microscopy tomography plus selective labeling, and chromosome conformational capture (Hi-C) techniques have significantly extended our multiscale knowledge of genome architecture (Dekker et al., 2002; Lieberman-Aiden et al., 2009; Boettiger et al., 2016; Ou et al., 2017; Bintu et al., 2018; Nir et al., 2018).

These and other advancements helped establish various results that foster the present discussion of genomic entanglement.

Regarding inter-chromosome organization we recall that:


For intra-chromosome aspects we instead know that:


### 2.2. Relating Genomic Architecture and Relaxation Dynamics With Polymer Physics

The interpretation of these experimental results has been aided by an intense theoretical and computational activity that demonstrated how salient genomic architecture properties can be reproduced by a broad range of polymer models, and hence are likely governed by general physical mechanisms (Mirny, 2011; Rosa and Zimmer, 2014; Bianco et al., 2017; Haddad et al., 2017; Jost et al., 2017; Tiana and Giorgetti, 2018). This applies in particular to the aforementioned properties (i–iv) which can be rationalized as manifestations of the topological constraints that rule the behavior of semi-dilute or dense polymer systems, particularly their relaxation time scales (Doi and Edwards, 1986).

In fact, a solution of initially disentangled chains of contour length L<sup>c</sup> can reach the fully-mixed, homogeneous equilibrium state only via reptation, a slow and stochastic slithering-like motion with characteristic time scale equal to τrept ≃ τe(Lc/Le) 3 , where τ<sup>e</sup> is a microscopic collision time and L<sup>e</sup> is the typical contour length between entanglement points (De Gennes, 1971; Doi and Edwards, 1986).

Thus, based on this fundamental polymer physics result, it was estimated that the characteristic relaxation, or equilibration, time of mammalian chromosomes exceeds 100 years (Rosa and Everaers, 2008). The orders-of-magnitude difference between this time scale and the typical duration of the cell cycle (≈ 1 day) has several implications for genome organization, as it was realized even before Hi-C probing methods became available (Rosa and Everaers, 2008). It is clear, in fact, that mammalian chromosomes are never fully relaxed as they undergo the cyclic structural rearrangements from the separate compact rod-like mitotic architecture to the decondensed interphase one (Grosberg et al., 1993; Rosa and Everaers, 2008).

### 2.3. Implications for (Minimal) Intra- and Inter-chromosome Entanglement

From this standpoint, the emergence of chromosome territories is quantitatively explained as due to the kinetically trapped decondensation of the compact mitotic chromatin (Rosa and Everaers, 2008): interphase chromosomes retain the memory and limited mutual overlap of the earlier mitotic state, consistent with experimental results (Cremer and Cremer, 2001, 2010; Branco and Pombo, 2006). In addition, the ordered linear organization of the mitotic rods should also inform the intra-chromosomal architecture, making it more local than equilibrated polymers. This is consistent with the experimental fact that the effective scaling behavior of the contact probability with the genomic separation ℓ in interphase chromosomes has a more local character (∼ ℓ −1 ) than the one expected (∼ ℓ −3/2 ) for equilibrated polymers (Lieberman-Aiden et al., 2009). Intuitively, the same memory mechanism ought to facilitate the subsequent separation of interphase chromosomes and their recondensation upon re-entering the mitotic phase in the cell cycle (Rosa and Everaers, 2008).

For the present discussion, we stress that these out-ofequilibrium effects should impact not only the architecture but also the physical entanglement of eukaryotic genomes. In fact, mammalian chromosomes should be more unlinked (for the limited inter-chromosomal intermingling) and unknotted (for the enhanced intra-chromosomal local contacts) than at equilibrium. These heuristic conclusions are supported by various studies showing that the aforementioned scaling relationships obtained by FISH and Hi-C experiments can be ascribed to the topological constraints at play in solutions of unknotted and unlinked polymers (Khokhlov and Nechaev, 1985; Vettorel et al., 2009; Halverson et al., 2014; Rosa and Everaers, 2014).

### 2.4. Implications for Genomic Structural Modeling and Its Improvement

These considerations appear particularly relevant for the structural modeling of eukaryotic genomes based on phenomenological data, such as spatial proximity constraints, which are typically too sparse to pin down even coarse-grained models of interphase chromosomes (Lieberman-Aiden et al., 2009).

A key question is whether such structural models should additionally be informed by the notion that interphase chromosomes must originate and eventually return to the separate and condensed mitotic state.

Evidence presented in our earlier work help shed some light on the matter. With our co-workers, we considered a model system of six copies of human chromosome 19 in a cubic simulation box with periodic boundary conditions to explore the connection between coregulation and colocalization of genes (Di Stefano et al., 2013). Each copy

was initially prepared as a mitotic-like conformation (Rosa and Everaers, 2008), consisting of a polymer filament forming a solenoidal pattern with rosette-like cross-section featuring chromatin loops of about 50 kilo-basepairs, see **Figure 1A**. We then used a molecular-dynamics steering protocol to bring in proximity pairs of intra-chromosomal loci that were known to be significantly co-regulated. Importantly, topological constraints were accounted for by avoiding unphysical chain crossings during the steering process.

Remarkably, and consistently with the gene kissing hypothesis (Cavalli, 2007), we found that most (> 80%) pairs of significantly coregulated genes could indeed be colocalized in space within the contact range of 120nm and further showed that this colocalization compliance followed from the presence of gene cliques in the coregulatory network (Di Stefano et al., 2013).

Conversely, the same protocol applied to the same set of chains but initially prepared as generic self-avoiding random walks failed to give colocalization (Di Stefano et al., 2013). Physically, this happens because the intra- and inter-chain entanglements present in this system, which mimicks an artificial set of equilibrated chromosomes, were too numerous and conflicting to be successfully negotiated on a viable simulation time scale (see **Figure 1A**).

Further elements come from the genome-wide structural modeling of human chromosomes of Di Stefano et al. (2016). In this study too, the model chromosomes were initially prepared in mitotic-like states and were then steered to bring in proximity those pairs of loci that corresponded to significantly enhanced entries of two independent Hi-C datasets (Dixon et al., 2012; Rao et al., 2014). The architecture of the final conformations were, as expected, significantly changed by the steering protocol. Yet, as illustrated in **Figure 1B**, we verified that each model chromosome could be brought to a condensed compact shape as needed for the interphase-mitotic transition without significant hindrance from intra- or inter-chromosomal topological constraints (Di Stefano et al., 2016).

We note that the limitedly-entangled architecture of models of long eukaryotic chromosomes has emerged lately (Di Pierro et al., 2016) as the consequence of microphase separation of regions of different chromatin types (Jost et al., 2014) in a block copolymer model with pair interactions tuned to reproduce the contact propensities of point (iv). The point is reinforced by studies on the yeast genome showing that knots and links have a generally low incidence especially in comparison to equivalent systems of equilibrated chains (Duan et al., 2010; Segal et al., 2014; Pouokam et al., 2019). Finally, besides the indication from structural models, other mechanisms such as loop extrusion have been advocated to be instrumental for maintaining a low degree of chromosomal entanglement (Racko et al., 2018; Orlandini et al., 2019).

To some inevitable extent though, physical entanglements are still expected to arise in eukaryotic chromosomes.

The recent work of Roca's lab showed that knots do occur in eukaryotic minichromosomes in vivo, for instance during transcription, due to transient accumulation of entanglement (Valdés et al., 2017, 2019). On broader scales, various knots (Siebert et al., 2017), and even links (Niewieczerzal et al., 2019), were found in model mouse chromosomes obtained from single cell Hi-C (Stevens et al., 2017). The genuineness

### REFERENCES


of the entangled states was suggested by the systematic recurrence of certain knot types in independent instances of the reconstructed chromosomal structures (Siebert et al., 2017). These were obtained by imposing phenomenological constraints on an initially disconnected set of effective monomers, so we expect that a more defined knot spectrum could be obtained by using disentangled self-avoiding chains as the reference model.

### 3. CONCLUSIONS

To conclude, we have discussed experimental evidence and general physical mechanisms based on polymer theory that consistently point to an unusually low degree of entanglement expected in long eukaryotic chromosomes. Such property, which is arguably essential for the capability of chromosomes to reconfigure as needed at various stages of the cell cycle, appears important for genomic modeling too.

We argued that the structural modeling of long chromosomes can benefit, both for realism and computational efficiency, by starting off with disentangled self-avoiding chains, e.g., mitotic-like ones, because their plasticity makes it possible to accommodate a large number of phenomenological constraints in a physically-viable manner, i.e., without deformations involving intra- or inter-chain crossings.

The latter are, of course, possible in in vivo systems thanks to the action of topoisomerase enzymes. An important open question regards the extent to which these active mechanisms are involved in the shaping the overall intra- and inter-chromosome architecture. This point, we believe, can be significantly advanced in future studies with a tight synergy of experiments and models (Goloborodko et al., 2016; Jost et al., 2017; Valdés et al., 2019).

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

The authors acknowledge support from the Italian Ministry of Education, MIUR. The authors would like to acknowledge networking support by the COST Action CA17139.


ii-mediated knotting of intracellular DNA. Nucleic Acids Res. 47, 6946–6955. doi: 10.1093/nar/gkz491


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Rosa, Di Stefano and Micheletti. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Kinetic Modeling of the Genetic Information Processes in a Minimal Cell

Zane R. Thornburg<sup>1</sup> , Marcelo C. R. Melo1,2, David Bianchi <sup>1</sup> , Troy A. Brier <sup>1</sup> , Cole Crotty <sup>1</sup> , Marian Breuer 1,3, Hamilton O. Smith<sup>4</sup> , Clyde A. Hutchison III <sup>4</sup> , John I. Glass <sup>4</sup> and Zaida Luthey-Schulten<sup>1</sup> \*

*<sup>1</sup> Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL, United States, <sup>2</sup> Machine Biology Group, Department of Psychiatry, Microbiology, and Bioengineering, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States, <sup>3</sup> Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, Maastricht, Netherlands, <sup>4</sup> Synthetic Biology and Bioenergy Group, J. Craig Venter Institute, La Jolla, CA, United States*

JCVI-syn3A is a minimal bacterial cell with a 543 kbp genome consisting of 493 genes. For this slow growing minimal cell with a 105 min doubling time, we recently established the essential metabolism including the transport of required nutrients from the environment, the gene map, and genome-wide proteomics. Of the 452 protein-coding genes, 143 are assigned to metabolism and 212 are assigned to genetic information processing. Using genome-wide proteomics and experimentally measured kinetic parameters from the literature we present here kinetic models for the genetic information processes of DNA replication, replication initiation, transcription, and translation which are solved stochastically and averaged over 1,000 replicates/cells. The model predicts the time required for replication initiation and DNA replication to be 8 and 50 min on average respectively and the number of proteins and ribosomal components to be approximately doubled in a cell cycle. The model of genetic information processing when combined with the essential metabolic and cell growth networks will provide a powerful platform for studying the fundamental principles of life.

Keywords: minimal cells, stochastic simulations, kinetic parameters, DNA replication, transcription, translation, mRNA production, protein production

### 1. INTRODUCTION

JCVI-syn3A, a bacterial cell with a synthetic minimal genome of size 543 kbp and 493 genes, is an organism designed to have the fewest genes necessary for life and is therefore an ideal model organism for studying fundamental principles of life (Lachance et al., 2019). In Breuer et al. (2019), we published the flux balance analysis of the essential metabolism of JCVI-syn3A along with the gene map and the genome-wide data from essentiality and proteomics experiments. Although metabolism, including transport of nutrients into the cell, has been established, the reactions and kinetic models for genetic information processes in JCVI-syn3A are missing. The accompanying gene map in **Figure 1A** assigned all 452 protein coding genes to one of the four major functional classes: metabolism with transporters (143), genetic information processes (212), cellular processes such as cell division (6), and unclear functions (91). Accompanying the gene map is a map of the proteomics data detected for the 428 proteins in **Figure 1B**. The model presented here uses the proteomics data to guide the modeling of protein production.

### Edited by:

*Giulia Palermo, University of California, Riverside, United States*

### Reviewed by:

*Juan R. Perilla, University of Delaware, United States Ali Mohamad Farhat, University of Michigan, United States*

> \*Correspondence: *Zaida Luthey-Schulten zan@illinois.edu*

#### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

> Received: *27 August 2019* Accepted: *07 November 2019* Published: *28 November 2019*

#### Citation:

*Thornburg ZR, Melo MCR, Bianchi D, Brier TA, Crotty C, Breuer M, Smith HO, Hutchison CA III, Glass JI and Luthey-Schulten Z (2019) Kinetic Modeling of the Genetic Information Processes in a Minimal Cell. Front. Mol. Biosci. 6:130. doi: 10.3389/fmolb.2019.00130*

ncbi.nlm.nih.gov/nuccore/CP016816.2 (Breuer et al., 2019).

In our previous work on ribosome biogenesis in Escherichia coli (Earnest et al., 2015, 2016), ribosome assembly was included along with DNA replication and transcription/translation of just the ribosomal proteins (rproteins). In this simplified model we focus on developing kinetic parameters that replicate the DNA, generate proteins comparable to the proteomics abundances, and produce sufficient numbers of rprotein and ribosomal RNA (rRNA) to generate approximately 500–700 ribosomes estimated from the biomass equation in Breuer et al. (2019). Here we introduce the construction and results of our simplified genetic information processing model for a cell 400 nm in diameter. The kinetics for initiation of DNA replication is based on a mechanism derived from the JCVI-syn3A genomic sequence, crystal structures of the initiator protein DnaA complexed with DNA and kinetics parameters from single molecule fluorescence resonance energy transfer (smFRET) experiments. Parameters for simplified kinetics describing DNA replication, transcription, mRNA degradation, translation, and protein degradation are derived from the literature and our previous studies on JCVIsyn3A (Breuer et al., 2019) and E. coli (Earnest et al., 2015, 2016). Within the cell cycle of 105 min, these processes duplicate the genome, generate, and translate sufficient amounts of mRNA to approximately reproduce the proteomics data, and the estimated number of ribosomes. All 452 protein coding genes and 35 genes for rRNAs and tRNAs in the genome of JCVI-syn3A are expressed. Three pseudo genes and three genes for small RNA are not expressed in this model.

### 2. METHODS

Each of the genetic information processing subsystems involve species that are low in population in the cell, for example one or two copies of a gene and 0–10 copies of a protein-coding mRNA. To capture the stochastic nature of genetic information processes, the kinetics were modeled with chemical master equation (CME) simulations and solved using the Gillespie algorithm as implemented in the software Lattice Microbes (Roberts et al., 2013; Hallock et al., 2014; Earnest et al., 2015, 2018) with the pyLM interface in a Python 3 Jupyter notebook. Due to the small size of JCVI-syn3A, 400 nm in diameter, we neglect the spatial location of species inside the cell in this simplified model which allows us to stochastically model the kinetics as well-stirred using CME simulations. The results of stochastic simulations were averaged over 1,000 replicates/cells. Each replicate requires a run time of one second. The Jupyter notebooks are available and are posted at GitHub (https://github. com/zanert2/Thornburg\_FrontMolBiosci\_2019).

### 2.1. Polymerization Model and Rate Forms

In our genetic information processing model, DNA replication, transcription, and translation are all reactions that involve an enzyme (DNAP, RNAP, or ribosome) catalyzing polymerization reactions based on a preexisting template polymer (the entire ssDNA, each unique gene on the ssDNA, or its corresponding mRNA). In the case of replication, the single template is the entire genomic sequence of 543 kpb. In the case of transcription, the templates are the individual 493 genes, each with a unique length and sequence. In the case of translation, the templates are the number of individual messengers for each of the proteins. We use a rate form based on Equation (33) from Hofmeyr et al. (2013) that was derived assuming polymerization from a single unique template where the enzyme is in excess and the concentration of free enzyme is constant. DNA replication, transcription, and translation all involve a situation in which the enzyme is in excess of unique templates. For DNA replication, there is a single start site, oriC, and 35 DNAP molecules in the proteomics data. In the case of transcription, there are 187 RNAP and if we consider any one gene as the template for the rate form, there are at most two copies of the gene at any point in the cell cycle. In translation, there are over 500 ribosomes available to translate the individual mRNAs which typically number <10. In each case, we assume a constant steady-state concentration of free enzymes in determining the kinetic rates, although the template concentrations will change over time. The general polymerization rate form can be written as

$$\nu\_{poly} = \frac{k\_{cat}[T]}{\left(1 + \frac{K\_0}{[E]}\right) \frac{K\_{D1}K\_{D2}}{[M]\_1[M]\_2} + \sum\_i \frac{n\_i K\_{Di}}{[M]\_i} + n\_{tot}} \tag{1}$$

which we modify for transcription and translation in the following sections to address that there is competition among unique templates of different lengths ntot in each process. For our experimental situation, the polymerization rate is dominated by kcat, ntot, and template concentrations. The variation in rates based on these assumptions is discussed further below in Equation (2). The general rate form considers a mechanism starting with enzyme E (DNAP, RNAP, or ribosome) binding to a polymer template T with binding constant K0. Once the enzyme and template have bound, the first two monomers (dNTP, NTP, or the charged aa-tRNA) M<sup>1</sup> and M<sup>2</sup> bind to the template/enzyme complex with association constants KD<sup>1</sup> and KD2. The monomer concentrations are determined by the pool sizes provided in Zhang and Ignatova (2009) and Breuer et al. (2019). A value of K<sup>D</sup> has been measured for a single elongation

step of mRNA by RNAP, but not for DNAP or ribosomes (Larson et al., 2012). Values for K<sup>D</sup> were fitted to maximize the rate of each process assuming their respective pool sizes and other experimentally measured kinetic parameters. Our fitted value for RNAP agrees well with the experimentally determined value. Monomers of type i are then added to the growing polymer by the binding with their respective association constant KDi and we assume that they are the same for any one process. The growing polymer is elongated at a rate kcat. The resulting polymer (DNA, rRNA, mRNA, tRNA, or protein) of length ntot will consist of n<sup>i</sup> of each respective monomer type M<sup>i</sup> following the first two positions in the polymer.

In general, both the enzyme and template concentrations are functions of time. In evaluating the rate constant, the enzyme concentrations were held constant to the values derived from the proteomics data making the polymerization rate obey first order kinetics

$$\nu = k(n\_{tot}, k\_{cat}) [T] \tag{2}$$

where the rate constant is defined as

$$k(n\_{tot}, k\_{cat}) = C \times \frac{k\_{cat}}{\left(1 + \frac{K\_0}{[E]}\right) \frac{K\_{D1} K\_{D2}}{[M]\_1 [M]\_2} + \sum\_i \frac{n\_i K\_{Di}}{[M]\_i} + n\_{tot}} \tag{3}$$

in which C represents any modifications to the rates of transcription or translation. For the kinetic parameters, pool sizes, and low enzyme concentrations assumed in the kinetic model, the denominator is dominated by the third term, the length of the new polymer ntot. In analyzing the sensitivity of DNA replication, transcription, and translation to the concentration of each respective enzyme, we found that the rate constants k from Equation (3) deviated no more than 10−4% as the concentration of enzyme is doubled over the cell cycle. Our above approximations hold assuming the cell is in the exponential growth phase where nutrient and pool sizes are in a steady state. The approximations no longer hold in cases such as the transition from exponential to stationary growth. As nutrients in the environment become depleted, the rate of elongation steps in DNA replication, transcription, and translation will be slowed down due to a lack of monomers M<sup>i</sup> .

### 2.2. Replication Initiation

Previous treatments of replication initiation have proposed a mechanism based on E. coli and B. subtillis that began with the initiator protein DnaA binding to four 9-bp signatures of the DNA near oriC, followed by accumulation of DnaA monomers around that location until a buildup of 20–30 monomers was reached (Atlas et al., 2008; Karr et al., 2012). Our model of DNA replication initiation is based on the genomic sequence of JCVI-syn3A in **Figure 2** and a mechanism derived from crystal structures of the multi-domain DnaA binding to ds- and ssDNA shown in **Figure 3**. In the genomic sequence structure, a strong DnaA binding signature (TTATCCACA) is located near the origin matching the whole 9-bp sequence with two neighboring signatures matching 7 out of 9 bp (Schaper and Messer, 1995; Weigel et al., 1997; Speck et al., 1999). These signatures lie next to an AT-rich region 93 bp in length.

DnaA domain IV [DnaA(IV)] binds most strongly to the sequence TTATCCACA. DnaA(IV) binds to the dsDNA signatures (Erzberger et al., 2006; Duderstadt et al., 2011). DnaA domain III [DnaA(III)] binds to AT-rich ssDNA in 3 nucleotide increments forming a helical, filament-like structure (Erzberger et al., 2006; Duderstadt et al., 2011). Our mechanism assumes that the binding of DnaA(IV) to the three neighboring dsDNA signatures near oriC opens up a small pocket of ssDNA in the neighboring AT-rich region. This mechanism is illustrated in **Figure 4A**. Once the dsDNA sites are occupied, DnaA(III) can start binding to the neighboring AT-rich region on the ssDNA. The DNA continues to be unwound until the AT-rich region is wrapped by the DnaA filament. Since DnaA(III) binds to ssDNA in 3 nt increments (Duderstadt et al., 2011; Cheng et al., 2014) the 93 bp AT-rich region shown in **Figure 2**, produces a filament with 30 DnaA. After formation of the filament, replication can be initiated.

FIGURE 3 | Crystal structures of DnaA binding to *E. coli* DNA suggest a mechanism for initiation of replication: (A) PDB 1J1V; DnaA(IV) binds to a 9-bp signature on dsDNA. (B) PDB 3R8F; Four DnaA(III) bind to 3-nucleotide increments on ssDNA.

To capture the proposed mechanism, we begin with a reaction binding a DnaA to the high affinity binding signature near OriC on dsDNA, creating a bound site and the two low affinity free sites on either side of the high affinity site. The low affinity sites on dsDNA then react with one DnaA each, creating a bound site for each. The dsDNA binding rates use second order rate forms using the rate constants shown in **Table 1**. There is also a reaction in the model for DnaA binding to other high affinity sites around the chromosome. This is included since the filament length strongly depends on the number of free DnaA available. The kinetic model for the formation of the DnaA filament is based on an smFRET study on ssDNA (Cheng et al., 2014). The smFRET study in **Figure 4B** reports values for kon for addition of a DnaA molecule to the growing DnaA filament bound to ssDNA and koff for removal of a DnaA molecule from the filament as shown in **Figure 4C**. These kinetic parameters are presented in **Table 1** and were used for each independent binding and unbinding until a filament consisting of 30 DnaA has formed. Once the filament is formed and replication begins, the filament is assumed to be removed at the rate of the polymerization in DNA replication which models removal of DnaA by DNA helicase. The model is constructed so that only one replication initiation event occurs in a cell cycle.

## 2.3. Replication

The replisome, a complex containing proteins necessary for DNA replication including DNA helicase, DNAP, DNA primase, gyrase/topoisomerase, and the beta clamp, binds at oriC once

parameters for the binding of DnaA(III) to ssDNA were obtained from a smFRET study (Cheng et al., 2014) where the FRET signal depended on the number of DnaA bound. Fewer DnaA corresponded to compact ssDNA, resulting in a high FRET signal. Increasing the number of DnaA bound to ssDNA extends the filament, lowering the FRET signal. (C) Schematic of the binding kinetics of DnaA(III) to ssDNA forming a DnaA filament of length n. The *kon* and *koff* values correspond to the kinetics measured by smFRET.

the replication initiation event has occurred and then proceeds in both directions around the chromosome, creating the two replication forks as shown in **Figure 5**. Using smFRET experiments, the replisome has been observed to assemble in just a few seconds (Downey and McHenry, 2010; Cho et al., 2014). We do not model the assembly of the replisome and assume its assembly occurs during or before replication initiation. As the replisome proceeds along the chromosome, the original chromosome shown in green is unzipped and the two new chromosomes shown in red and blue are polymerized on the original ssDNA template. Both strands of ssDNA at the replication fork are treated the same with continuous polymerization, and okazaki fragments are not modeled. The model assumes that once the replisomes reach the terminus, they fall off quickly and the two new chromosomes are instantaneously separated. The number of dATP, dTTP, dCTP, and dGTP monomers n<sup>i</sup> appearing in the rate form (Equation 1) are calculated from the A, T, C, and G content of the genome: 203606 A, 207816 T, 67238 C, and 64720 G. Since there are no metabolic reactions to produce deoxynucleotides or ATP for the reactions to occur, constant pools for each are assumed using the pool sizes from Breuer et al. (2019) presented in **Table 2**.

Kinetic parameters for replication are given in **Table 3**. The elongation rate constant kcat (Xie et al., 2008) and the association constant for DNAP to DNA K<sup>0</sup> (Zhang et al., 2016) were obtained from the literature for E. coli. In order to make a second copy of



the genome within the 105 min doubling time, the choice of K<sup>D</sup> was made in order to minimize the time to duplicate the DNA. Assuming the constant pool sizes and DNAP concentrations, the value of K<sup>D</sup> corresponds to the value where the length of the genome is the dominant term in the denominator of k in Equation (3).

### 2.4. Transcription

To modify the general rate form for transcription, we incorporate two factors: the probability of an active RNAP selecting any gene Pgene selection and the strength of the gene's promoter Spromoter. The fraction of active RNAP as estimated by Bremer and Dennis (2008) for a cell with a ∼100 min doubling time implies that around 29 of the 187 RNAP are actively transcribing at any time. Of the actively transcribing RNAPs, Bremer and Dennis (2008) estimate that approximately 24% are involved in making stable RNA like rRNA. Since each rRNA operon only contains the 16S, 23S, and 5S rRNAs and no tRNAs, transcription of the two drRNA genes will require four RNAP. Therefore, the probability of any other gene being selected is Pgene selection = 25/487 = 0.05. We estimate that each rRNA operon is always being actively

TABLE 2 | Pool sizes from Breuer et al. (2019) and estimated from Zhang and Ignatova (2009) and Mackie (2013)\*.


transcribed by two RNAP, and therefore has a probability of gene selection of 1. The expression from Hofmeyr et al. (2013) did not include competition for multiple templates which is now captured with the probability of gene selection. This gives us a transcription rate

$$\upsilon\_{transcription} = P\_{gene\ selection} \times \upsilon\_{poly} \tag{4}$$

which we use for transcription of rRNA, tRNA, and ribosomal protein-coding genes.

The the rate of transcribing a gene also depends on the strength of its promoter sequence (Jones et al., 2014), however the precise promoter sequences and their strengths have not been measured for JCVI-syn3A. In a preliminary analysis of the sequences preceding each protein-coding gene, we found that, in general, a protein is more likely to have a higher proteomics value if the start codon is preceded by both a Shine Dalgarno sequence a promoter sequence TANAAT as characterized in Mycoplasma pneumoniae (Lloréns-Rico et al., 2015). Using this information, to incorporate a proxy for promoter strength, Spromoter, into the kinetics, the transcription rate for each non-ribosomal protein coding gene is multiplied by the ratio of gene's proteomics count to the average proteomics count of 180

$$\nu\_{m\text{RNA\\_transcription}} = \text{S}\_{\text{promoter}} \times P\_{\text{gene\\_selection}} \times \nu\_{\text{poly}} \tag{5}$$

Since some ribosomal proteins were not reported in the proteomics data, this factor is not used in the transcription rates of ribosomal protein coding genes.

The model expresses the genes for all 452 protein coding genes and the genes for rRNA and tRNA. For each protein or RNA, the gene identifier from the NCBI entry (NCBI GenBank CP016816.2: https://www.ncbi.nlm.nih.gov/nuccore/CP016816. 2; Breuer et al., 2019) is read and the corresponding sequence is used to determine the nucleotide stoichiometries for the formation and degradation reactions. RNA formation reactions

TABLE 3 | Parameters used in kinetics for replication, transcription, translation, mRNA degradation, and protein degradation.


use our modified polymerized, template-driven rate forms in Equations (4) and (5) and the degradation reactions of mRNA follow first order kinetics. The nucleotide stoichiometries are used to determine the monomer counts n<sup>i</sup> and total polymer length ntot in the rate form. Constant pools of nucleotides are assumed using the pool sizes from Breuer et al. (2019) presented in **Table 2**. For the transcription reactions, the enzyme is RNAP and the template is the total concentration of the gene in the cell as a function of time and includes the replication of DNA. This model, however, does not take into account the location of a gene on the genome during DNA elongation. The elongation rate constant kcat and the association constants K<sup>0</sup> and K<sup>D</sup> are listed in **Table 3**. Literature values of mRNA and tRNA elongation rates of 25 nt/s are used for kcat (Chen et al., 2015). A messenger halflife of 4 min is used for all mRNA degradation. The half-life of 1 min in Breuer et al. (2019) did not result in mRNA abundances that produced proteins quickly enough to double the number of proteins in the cell cycle. The 4 min half life gives a total mRNA abundance in better agreement with the data published in Lynch and Marinov (2015). The experimentally observed rRNA operon elongation rate kcat of 90 nt/s (Ryals et al., 1982) was multiplied by two for both operons to model the effect of two RNAP simultaneously transcribing each operon. The association constant for association of RNAP to DNA K<sup>0</sup> was calculated according to Hofmeyr et al. (2013) using the concentrations of the free and actively transcribing RNAP (Bremer and Dennis, 2008) and concentration of the gene. The association constant for nucleotides binding to the RNAP/gene complex K<sup>D</sup> was fitted so that the rate of transcription was maximized by making transcript length the dominant term in the denominator of k in Equation (3). Our fitted value agrees with a measured experimental value of 0.14 mM (Larson et al., 2012). With no transcriptomic data available, each mRNA begins with a count of 1 and each tRNA is divided evenly at 190 each to have a total tRNA abundance of 3,750, a value scaled from E. coli based on differences in cell volume (Mackie, 2013).

### 2.5. Translation

Since the number of total mRNA is approximately on the same order of the number of ribosomes, the probability of any mRNA being translated is near unity. The only other modification of the translation rate expression is to allow more than one ribosome (polysomes) Nribo to bind to a long transcript in Equation (6).



*The cost of translation does not include charging of the tRNAs as those reactions are incorporated in the essential metabolism (Breuer et al., 2019).*

This factor is an integer calculated as the length of the transcript over an estimated ribosome spacing of 300 nt in E. coli (Brandt et al., 2009). If the value is calculated as <1, the value of Nribo is set to 1. The ribosome spacing was estimated using an observed approximate average of 4 ribosomes per polysome for an average transcript length of 1,200 nt.

$$\nu\_{translation} = \mathbf{N}\_{ribo} \times \nu\_{poly} \tag{6}$$

The model includes the translation and degradation of each protein made from each mRNA. The gene identifier from the NCBI entry also includes the amino acid sequence for protein coding genes which is used to determine the corresponding stoichiometries of tRNA charged with their corresponding amino acids (aa-tRNA) required to build the protein and the amino acid stoichiometries when the protein is degraded. For the translation reactions, the template in the polymerization rate form (Equation 1) is the associated mRNA. The model uses whole, intact ribosomes as the enzyme and does not model association of messengers to the 30S small subunit followed by association of the 50S large subunit. The elongation rate constant kcat and the association constants K<sup>0</sup> and K<sup>D</sup> are listed in **Table 3**. For E. coli, experimentally measured elongation rates range from 10 to 20 aa/sec (Bremer and Dennis, 2008), however slower rates have been reported in other bacteria such as Mycobacterium bovis with an elongation rate of 2 aa/sec (Cox, 2004). A value within the estimated range of 2–10 aa/sec of 5 aa/sec was chosen so that the number of proteins was approximately doubled in a cell cycle. The association constant of the ribosome to the mRNA K<sup>0</sup> was estimated using the average fraction of actively translating ribosomes (Bremer and Dennis, 2008) and an average concentration of an mRNA to be one in the cell. The association constant for aa-tRNA binding to the ribosome/mRNA complex K<sup>D</sup> was fitted to maximize the rate of translation assuming constant aa-tRNA pool sizes and ribosome concentration. The value of K<sup>D</sup> was computed using the length of the shortest protein, ribosomal protein L34 (40 aa), in the equation for the rate constant k (Equation 3). A half-life of 25 h was used for

FIGURE 6 | (A) DnaA filament formation for four different replicates shown in different colors. The stochastic effects of the filamentation kinetics result in a wide range of times to form the filament from <5 to 50 min. (B) Probability distribution of replication initiation times when the thirtieth DnaA in the ssDNA filament binds. We predict the most probable time to form the filament to be approximately 5 min and the average time to be approximately 8 min shown with a dotted line. (C) Average of genome duplication over 1,000 replicates shows that on average the genome will be duplicated in 65 min of the 105 min cell cycle, leaving approximately 40 min for continued cell division. (D) The average abundance of DnaA not bound to DNA gets depleted by filament formation and replenished by translation and removal of the filament by DNA helicase.

protein degradation reactions (Maier et al., 2011) Degradation of the proteins in extremely slow, so the main source of dilution would be by cell division after 105 min.

### 2.6. ATP Energy Costs

Replication, transcription, translation, mRNA degradation, and protein degradation have associated ATP hydrolysis costs. Although the mechanism for ATP hydrolysis is not explicitly modeled, the costs are incorporated as additional time dependent reactions for each subsystem. For example, in DNA replication the DNA helicase is not explicitly modeled, but we assume that 1 ATP hydrolysis event per bp is required to unwind the dsDNA. The ATP cost of each reaction in each subsystem is determined by the length of the DNA/RNA/protein being formed or mRNA/protein being degraded (Russell and Cook, 1995; Lynch and Marinov, 2015). In transcription, we assume that the RNAP uses 1 ATP hydrolysis event per bp to unwind the dsDNA. The mRNA degradation reactions also assume that

FIGURE 7 | Abundances of mRNA and tRNA transcribed in a 105 min cell cycle. (A) A single replicate from the stochastic simulation of the mRNA abundance for glucose-6-phosphate isomerase shows fluctuations in the average integer abundance of messengers. Fluctuations arise from competing rates of formation, degradation, and replication. The average mRNA abundances of mRNA coding for (B) metabolic proteins, (C) genetic information processing, DnaA (orange), and cell division proteins, (D) ribosomal proteins, and (E) proteins of unclear function all have average abundances between zero and seven. (F) The total number of all messengers during a cell cycle averaged over 1,000 replicates shows that typically there are 300–450 messengers present in the cell at any time.

1 ATP hydrolysis event is required per nucleotide removed from the messenger. The transcription reactions assume 2 ATP hydrolysis events per amino acid addition. These reactions use 2 instead of 4 ATP hydrolysis events since the amino acid charging of the tRNA are already included in the essential metabolic network (Breuer et al., 2019). The costs used are also shown in **Table 4**.

### 3. RESULTS

### 3.1. Replication Initiation and Replication

We found that DnaA(IV) requires <1 min to bind to all three dsDNA signatures. The stochastic trajectories of DnaA filament formation from four representative cells are shown in **Figure 6A**. The distribution of times to form the DnaA filament in **Figure 6B** is peaked at 5 min, but on average it takes 8 min for the DnaA filament to form on ssDNA as shown with a dotted line. Once the filament is 30 DnaA in length, replication begins and the DnaA filament is removed by the polymerization of DNA, resulting in the fast drop from 30 to 0 DnaA in the filament as seen in the trajectories in **Figure 6A**. It then takes another 50 min on average for replication to reach completion in **Figure 6C**. We predict replication initiation and replication are completed by 65 min, leaving another 40 min for the cell to divide in the 105 min cell cycle.

To illustrate the time-dependent variation in protein formation, the average abundance of free DnaA is shown in **Figure 6D**. Within the first minute we see a fast drop due to DnaA(IV) binding to high affinity dsDNA binding sites around the genome. The filament formation slowly removes DnaA from the free DnaA abundance until around 8 min when replication most frequently begins. DnaA in then replenished over several minutes due to removal of the filament by DNA helicase and translation of new DnaA.

### 3.2. Transcription

The mRNA production in a single cell exhibits fluctuations due to competing rates of formation and degradation. A representative of the mRNA production for glucose-6-phosphate isomerase over the 105 min simulation is shown in **Figure 7A**. The abundance of the messenger fluctuates from zero to two before DNA replication occurs and then one to five once the gene has been duplicated. The time dependence of all mRNA over a cell cycle averaged over 1,000 replicates are shown in **Figures 7B–E**. The mRNA are divided by mRNA for metabolic proteins (**Figure 7B**), genetic information processing and cell division proteins (**Figure 7C**), ribosomal proteins (**Figure 7D**), and proteins of unclear function (**Figure 7E**). The resulting kinetics show each mRNA growing or depleting in population from the initial one copy until the effects of replication are fully manifested around 60 min. In the early phase, the increase or decrease of mRNA reflects the competition between mRNA decay and the length of the transcript and the strength of the gene's promoter. As the genome is duplicated, this equilibrium for each mRNA shifts once a second copy of the gene is present. As the position of the gene in the genome is not considered, the variations are proportional to change in the DNA copy number of the cell cycle and not the nearness to oriC. The total number of mRNAs in **Figure 7F** varies from its initial value of 452 (one for each of the protein-coding genes) to an equilibrium value of approximately 425.

More than 500 of each rRNA were produced in a cell cycle shown in **Figure 8A**, reaching the number required to produce 500–700 ribosomes in the cell cycle estimated by Breuer et al. (2019). The number of each tRNA produced in **Figure 8B** reveals three groupings of tRNA production. The three groupings depend on the number of genes for each tRNA present in the genome. The groups consisting of more than one gene include 3 each of methionine and leucine tRNA genes making up the tRNA grouped between 500 and 600 tRNA and 2 each of threonine, tryptophan, lysine, arginine, and serine tRNA genes making up the tRNA grouped between 300 and 400 tRNA. Overall the model produces approximately 4,000 total tRNAs over a cell cycle, in close agreement with the initial estimate of 3,750 obtained from scaling the abundances in E. coli (Mackie, 2013).

non-ribosomal genetic information processing and cell division proteins (blue), and proteins of unclear function (gray). (C) A histogram of the scaled protein abundances at 105 min shows that the model doubles the abundances of most proteins with only a few outliers including mostly proteins of unclear function and thioredoxin, acyl carrier protein, transcription antitermination factor NusB, aspartyl/glutamyl-tRNA amidotransferase, and transporter ptsH. (D) A histogram of the number of ribosomal proteins generated shows that the model produces approximately 500 of most ribosomal proteins, enough to form the predicted 500 ribosomes. Some ribosomal proteins were produced in large excess including L34 above 4000, S21 near 3000, and L32, L35, S14, and L28 around 2000 each.

## 3.3. Translation

Since the protein degradation rate of 25 h is much slower than the mRNA degradation rate of 4 min, proteins will accumulate and only decay significantly by dilution through cell division. The goal of the model was to approximately reproduce the experimental proteomics distribution, double the abundance of each non-ribosomal protein, and produce 500–700 of each ribosomal protein. We compare our distribution of generated proteins over a cell cycle to the experimental proteomics in **Figure 9A**. We approximately reproduce most of the distribution with the greatest deviation being for proteins with fewer than 10 counts in the proteomics data. In the rest of our analysis of non-ribosomal proteins, we focus on proteins with experimental proteomics abundances >10. For further comparison, the number of each non-ribosomal protein generated over a cell cycle is compared to its proteomics value used to initialize the simulations (**Figure 9B**). From the histogram in **Figure 9C** we see that most non-ribosomal proteins double in number over a cell cycle with a few outliers, of which most are proteins of unclear function. The remaining outliers include thioredoxin, acyl carrier protein, transcription antitermination factor NusB, aspartyl/glutamyl-tRNA amidotransferase, and ptsH, all of which are short proteins around 100 amino acids in length or shorter. The histogram of ribosomal proteins abundances generated by the model in **Figure 9D** reveals that the model produces 500 copies for the majority of the ribosomal proteins, while the shortest are being overproduced. Ribosomal proteins overproduced include L34 above 4,000, S21 near 3,000, and L32, L35, S14, and L28 above 2,000 each. Ribosomal proteins not generated to an abundance of at least 500 include L1, L3, S3, S5, S2, and L2.

### 3.4. ATP Energy Costs

The model was constructed to estimate the ATP hydrolysis requirements for the genetic information processes in the minimal cell using per bp, nt, or aa usage of ATP in DNA

TABLE 5 | ATP hydrolysis costs of the deterministic model for genetic information processes.


*The ATP cost for transcription reported here only includes the hydrolysis costs of the RNAP, it does not include ATP built into RNA sequences.*

elongation, transcription, translation, mRNA degradation, and protein degradation. The estimates of the ATP hydrolysis cost over a 105 min simulation are presented in **Table 5** as both the total number of ATP used and the corresponding concentration of ATP required for a 400 nm cell. The model predicts that the total ATP hydrolysis cost over a cell cycle to be approximately 3,800 mM for JCVI-syn3A. This estimate does not suggest that 3,800 mM of ATP needs to be present in the cell, but provides an estimate for how quickly the metabolism will need to convert ADP into ATP. The most significant of the ATP hydrolysis costs in the genetic information processes comes from translation requiring 2,900 mM and the smallest of the costs is for DNA replication at 28 mM. The cost for translation will be higher once the genetic information processes are paired with the metabolism, as this cost did not account for the two ATP hydrolysis events to charge each tRNA which are included in the essential metabolism (Breuer et al., 2019). The cost for transcription of 500 mM does not include the ATP built into RNA sequences, it only includes the ATP hydrolysis costs of the RNAP. The predicts ATP requirements for mRNA degradation and protein degradation are predicted to be 290 and 90 mM, respectively. The cost for protein degradation is smaller due to the long protein have-life of 25 h relative to the 4 min half-life of messengers.

### 4. DISCUSSION

Our detailed model for the initiation of DNA replication builds upon observations from crystal structures of the initiator protein DnaA bound to signatures on ds-and ssDNA found near the oriC and smFRET measurements of the DnaA filament formation on ssDNA. The time taken for DNA replication initiation is predicted to vary from <5 min up to 50 min. We predict a total time of 65 min on average for the formation of the second copy of the genome, which means at least one copy of the DNA can be generated in a cell cycle.

The average number of any mRNA is within the expected range from zero to ten as reported in E. coli (Milo and Phillips, 2015) and can be used as predictions for mRNA counts in JCVI-syn3A until transcriptomic data or smFISH experiments are available for validation. We predict that approximately 450 messengers will be present in the cell on average, agreeing with the extrapolated number for a 400 nm diameter cell from Lynch and Marinov (2015). In our previous treatments of replication and transcription of a given gene in E. coli (Peterson et al., 2015; Cole and Luthey-Schulten, 2017) we showed how the variation in DNA copy number and position of the gene in circular DNA can broaden the mRNA distribution. We are likely underestimating the distributions for genes close to oriC and overestimating the distributions for genes near the terminus. In the case of rRNA, a higher transcription rate generated a sufficient number of rRNA to form 500–700 ribosomes in a cell cycle. A higher transcription rate was justified from the greater promoter strength of the rRNA operon observed in E. coli and other bacteria (Maeda et al., 2015) as well as the presence of multiple RNAPs estimated to be reading the operon (Bremer and Dennis, 2008). While the model produces over 500 rRNAs, there is variation in the number of ribosomal proteins. For the majority of the ribosomal proteins, approximately 500 of each were generated. However, the long ribosomal proteins were not generated quickly enough and the shorter ribosomal proteins occurred in much higher numbers. This is likely due to no promoter strength being assigned to the transcription of genes coding for ribosomal proteins. In the case of non-ribosomal proteins where we assigned promoter strengths based on proteomics counts, our model, to the most part, approximately doubles the number of proteins over a cell cycle. Identification of the promoter sequences and operonal structures for genes in JCVI-syn3A would help assign variation in promoter strengths and transcription rates on the basis of genomic information rather than proteomics values.

The simplified kinetic models for the genetic information processing reactions in the minimal cell JCVI-syn3A neglected the explicit assembly of the protein complexes that replicate DNA (replisome), transcribe the genes, and translate the mRNA and instead focused on the "polymerization" reactions that replicated the DNA, transcribed the genes into mRNAs, and translated them into proteins and how they are coupled. In some cases, this neglect can be justified by assumed timescale separation of the processes, but in general more experimental measurements of the assembly reactions would help to establish to what degree the association of the complexes are captured in the kinetic parameters given in the literature for the fundamental processes of replication, transcription, and translation. As the next step, the results from the genetic information processes will first be connected to uptake reactions that transport nucleobases, nucleosides, and amino acids into the minimal cell. Coupling genetic information processes with the essential metabolism and cell growth should result in a complete whole cell kinetic model of JCVI-syn3A.

### DATA AVAILABILITY STATEMENT

The jupyter notebooks containing the models in this study can be found at https://github.com/zanert2/Thornburg\_ FrontMolBiosci\_2019.

### AUTHOR CONTRIBUTIONS

ZT and ZL-S: developed models for genetic information processes, data curation, writing—original draft, and writing reviewing and editing. MM: assistance in writing of Jupyter notebooks. DB and TB: advised Lattice Microbes interface for the stochastic model. HS: assisted in development of DNA replication initiation model. CC: constructed initial stochastic model of replication initiation. MB: data curation. CH and JG: reviewing.

### REFERENCES


### FUNDING

Partial support from NSF MCB 1818344 and 1840320, The Center for the Physics of Living Cells NSF PHY 1430124, NSF PHY 1505008, and NSF REU 1659598.

### ACKNOWLEDGMENTS

The authors thank Tyler Earnest for help with the gene map software.

cells: integration of experiment and theory. Biopolymers 105, 735–751. doi: 10.1002/bip.22892


Zhang, H., Tang, Y., Lee, S.-J., Wei, Z., Cao, J., and Richardson, C. C. (2016). Binding affinities among DNA helicase-primase, DNA polymerase, and replication intermediates in the replisome of bacteriophage T7. J. Biol. Chem. 291, 1472–1480. doi: 10.1074/jbc.M115. 698233

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Thornburg, Melo, Bianchi, Brier, Crotty, Breuer, Smith, Hutchison, Glass and Luthey-Schulten. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Accurate Biomolecular Simulations Account for Electronic Polarization

#### Josef Melcr <sup>1</sup> \* † and Jean-Philip Piquemal 2,3,4 \* †

<sup>1</sup> Groningen Biomolecular Sciences and Biotechnology Institute and the Zernike Institute for Advanced Materials, University of Groningen, Groningen, Netherlands, <sup>2</sup> Laboratoire de Chimie Théorique, Sorbonne Université, UMR7616 CNRS, Paris, France, <sup>3</sup> Institut Universitaire de France, Paris, France, <sup>4</sup> Department of Biomedical Engineering, The University of Texas at Austin, Austin, TX, United States

In this perspective, we discuss where and how accounting for electronic many-body polarization affects the accuracy of classical molecular dynamics simulations of biomolecules. While the effects of electronic polarization are highly pronounced for molecules with an opposite total charge, they are also non-negligible for interactions with overall neutral molecules. For instance, neglecting these effects in important biomolecules like amino acids and phospholipids affects the structure of proteins and membranes having a large impact on interpreting experimental data as well as building coarse grained models. With the combined advances in theory, algorithms and computational power it is currently realistic to perform simulations with explicit polarizable dipoles on systems with relevant sizes and complexity. Alternatively, the effects of electronic polarization can also be included at zero additional computational cost compared to standard fixed-charge force fields using the electronic continuum correction, as was recently demonstrated for several classes of biomolecules.

Keywords: molecular dynamics simulations, electronic polarization, electronic continuum correction, biomolecules, phospholipids, amino acids, nucleic acids, ions

In molecular dynamics simulations, the interactions between molecules are described with approximate potentials known as force fields that mimic the true Born-Oppenheimer energy hypersurface. Among these methods, pairwise additive potentials are very popular for modeling biomolecules such as proteins, lipids or nucleic acids (Ponder and Case, 2003; Lopes et al., 2015). The current standard force fields (Huang and MacKerell, 2013; Maier et al., 2015; Robertson et al., 2015), however, neglect important physical many-body effects such as the electronic polarization, charge transfer, or many-body dispersion (cited in decaying magnitude order) (Kleshchonok and Tkatchenko, 2018). Although such models have provided valuable insight into many phenomena from various fields including biology, chemistry, biophysics, or material sciences, there are several important cases in which accounting for polarizability is crucial.

### PITFALLS OF NON-POLARIZABLE FORCE FIELDS

The limited predictive accuracy of non-polarizable force fields led the molecular modeling community to develop new generation "polarizable" force fields (Gresh et al., 2007; Jorgensen, 2007; Stone, 2013; Shi et al., 2015; Piquemal and Jordan, 2017; Kleshchonok and Tkatchenko, 2018; Martinek et al., 2018; Melcr et al., 2018, 2019; Antila et al., 2019; Jing et al., 2019) able to include the missing physics with a special focus on the polarizability effects. Although such techniques are now widely used in fields studying highly charged ionic liquids (Bedrov et al., 2019), their application

#### Edited by:

Valentina Tozzini, Nanosciences Institute, National Research Council, Italy

#### Reviewed by:

Sebastien Fiorucci, University of Nice Sophia Antipolis, France Matteo Tiberti, Danish Cancer Society Research Centre (DCRC), Denmark

#### \*Correspondence:

Josef Melcr j.melcr@rug.nl Jean-Philip Piquemal jean-philip.piquemal@ sorbonne-université.fr

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

Received: 09 September 2019 Accepted: 20 November 2019 Published: 04 December 2019

#### Citation:

Melcr J and Piquemal J-P (2019) Accurate Biomolecular Simulations Account for Electronic Polarization. Front. Mol. Biosci. 6:143. doi: 10.3389/fmolb.2019.00143 cannot be limited only to such extreme cases. For instance, neglecting the effects of the electronic polarizability in important biomolecules like amino acids, nucleic acids, and phospholipids affects the structure of proteins (Jiao et al., 2008; Shi et al., 2009; Duboué-Dijon et al., 2018a), DNA (Babin et al., 2006), and membranes (Harder et al., 2009; Catte et al., 2016; Melcr et al., 2018) having a large impact on interpreting experimental data (Hauser et al., 1977; Eisenberg et al., 1979; Kurland et al., 1979; Feigenson, 1986; Mattai et al., 1989; Roux and Bloom, 1990, 1991; Böckmann and Grubmüller, 2004; Lund et al., 2008; Vacha et al., 2009; Berkowitz et al., 2012; Melcrová et al., 2016; Javanainen et al., 2017; Magarkar et al., 2017) as well as building coarse grained models. Importantly, these results show that the electronic polarization yields non-negligible effects also at overall neutral molecules (Gresh et al., 2007; Melcr et al., 2018).

Secondary structure of proteins is to a large extent determined by an intricate network of hydrogen bonds. The description of hydrogen bonds in standard force fields, however, does not contain important contributions, e.g., from polarization and partially covalent character (Babin et al., 2006). It was demonstrated in many cases including structure of water (Dang, 1998), binding of ligands to proteins (Friesner, 2005; Jiao et al., 2008), and protein folding and unfolding (Morozov et al., 2006; Freddolino et al., 2010; Piana et al., 2011, 2014; Huang and MacKerell, 2014; Lemkul et al., 2016; Célerse et al., 2019) that polarizability contributes significantly to the accuracy of simulations of structures with hydrogen bonds. Also, salt bridging between amino acids is likely overestimated in strength when the effects of polarization are not included (Friesner, 2005; Vazdar et al., 2013; Debiec et al., 2014; Ahmed et al., 2018; Célerse et al., 2019; Mason et al., 2019). For instance, the interaction of acidic side chains of glutamate and aspartate with cations is overestimated in strength in classical non-polarizable force fields (Patel et al., 2009; Duboué-Dijon et al., 2018a), while treatment of polarizability in solvent relaxation affects salt bridge dissociation (Célerse et al., 2019). Taken together, the secondary and tertiary structural arrangements in the simulations of proteins are likely biased to certain preferred configurations due to the lack of polarizability depending on the chosen parametrization strategy (Freddolino et al., 2009, 2010; Piana et al., 2011, 2014).

Membrane proteins form a large part of cellular proteome and are in direct contact with amphiphilic cellular membranes, which influence their structure and activity (Lee, 2004). Membranes themselves are crucial cell organelles which define the inner resp. outer cellular environment. They are predominantly composed of amphiphilic phospholipids, which self-assemble into stable bilayer structures (Harayama and Riezman, 2018). The force fields for phospholipids have been tuned to the level that the simulations of commonly used simplified model lipid membranes can reproduce a large variety of experimentally measured properties, phenomena and structural features including lipid self-diffusion, x-ray scattering patterns, bilayer thickness, area per phospholipid, and acyl chain order parameters (Pluhackova et al., 2016).

This could make an impression that the currently available non-polarizable lipid force fields provide comparable accuracy to the models with explicit polarization at a fraction of the computational cost. While the non-polarizable models yield accurate results in many cases (Lucas et al., 2012; Chowdhary et al., 2013a), simulation studies have revealed that such models gradually lose their predictive accuracy with increasing complexity beyond model systems used during their parametrization, e.g., when membranes are put into contact with buffers of physiological ionic strengths (Catte et al., 2016). For instance, improvements in the electrostatics of phospholipid membranes have a great impact on the membrane dipole potential, permeation of water through membranes, and viscosity of organic liquids (Harder et al., 2009; Venable et al., 2019). Moreover, the interactions between phospholipids and cations, especially divalent cations like calcium, are overestimated in the classical non-polarizable models (Catte et al., 2016; Melcr et al., 2018; Antila et al., 2019).

In general, the structure of divalent cations complexes that are widespread in biosystems is traditionally problematic in non-polarizable simulations (Kohagen et al., 2015). In contrast, simulations with explicit or implicit treatment of polarization yield comparable accuracy to DFT-based ab-initio calculations and neutron scattering experiments, as was demonstrated for biologically relevant divalent cations Ca2<sup>+</sup> and Mg2<sup>+</sup> (Piquemal et al., 2006b; Wu et al., 2010; Martinek et al., 2018). While accounting for the electronic polarization overall improves the predictive accuracy of simulations in general, it is not sufficient in some cases like zinc chloride ion pairing, where more complex physics beyond "mere" electronic polarization is at play (Gresh et al., 2005, 2007; Piquemal et al., 2007; Duboué-Dijon et al., 2018b).

### IMPLICIT TREATMENT OF ELECTRONIC POLARIZATION VIA ELECTRONIC CONTINUUM CORRECTION

The necessity of polarizability and screening in modeling lipid bilayers has been an issue from the very beginning of computational modeling of model membranes. The first pioneering works on phospholipid bilayers document the need of including polarizability and extra screening in the development of the first models, which was achieved at that time through an empirical scaling factor for the partial atomic charges of the phospholipids (Egberts et al., 1994). A similar strategy supported by continuum theory was used in the recent developments of phospholipid force fields, which implicitly account for the electronic polarization using Electronic continuum correction (ECC) (Leontyev and Stuchebrukhov, 2009, 2010a; Mason et al., 2012; Pegado et al., 2012; Pluhaˇrová et al., 2013; Martinek et al., 2018). Despite the approximate treatment of the polarizability using ECC, such lipid force fields provide accurate interactions between phospholipid bilayers and cations in agreement with experiments (Melcr et al., 2018). In particular, in the case of the neutral phosphatidylcholine (PC), ECC improved the cation binding affinity for monovalent, and divalent cations reaching agreement with experiments (Melcr et al., 2018), while for negatively charged phosphatidylserine (PS) it has also improved the overall structure of the phospholipid and the interactions with other lipids (Antila et al., 2019; Melcr et al., 2019).

Electronic continuum correction is a very efficient alternative to otherwise computationally demanding explicit modeling of electronic polarization (Bedrov et al., 2019). The accuracy of the ECC method was shown to yield promising results on several polar organic solvents (Leontyev and Stuchebrukhov, 2010b, 2012; Lee and Park, 2011; Vazdar et al., 2013), while it proved to be necessary yet sufficient for an accurate description of the structure of several monovalent and divalent ions in aqueous solutions (Mason et al., 2012; Pegado et al., 2012; Pluhaˇrová et al., 2013). To date, the array of force fields utilizing ECC has grown from a wide range of biologically relevant ions (Kohagen et al., 2014, 2015; Martinek et al., 2018), to protein moieties (Vazdar et al., 2013; Duboué-Dijon et al., 2018a; Mason et al., 2019), and whole phospholipid molecules (Melcr et al., 2018) making realistic simulations of e.g., membrane proteins at physiological ionic conditions possible.

In ECC all particles are assumed to have equal polarizabilities and the electric field and electron density within each particle is homogenous (Leontyev and Stuchebrukhov, 2009). Such approximations simplify the calculations of the polarization to such an extent that it can be simply included in the interactions as a pre-determined charge-scaling factor (Leontyev and Stuchebrukhov, 2009), which is derived from the highfrequency dielectric constant of electrons, εel, as 1/ <sup>√</sup>εel <sup>≈</sup> 0.75 for aqueous solutions. Importantly, εel is close to 2 for a wide variety of biologically relevant environments meaning that even interfaces like biological membranes do not give rise to large gradients. Despite the coarseness of the approximations, the effects of electronic polarization are described sufficiently well for a variety of biologically relevant molecules in a condensed phase (Duboué-Dijon et al., 2018a,b; Martinek et al., 2018; Melcr et al., 2018). Moreover, ECC accounts for the effects of electronic polarization at zero additional computational cost compared to standard fixed-charge force fields. Although, a new generation of simulation codes performing large scale simulations with explicit polarization models starts to emerge (Lagardère et al., 2018), ECC yields the benefit of employing the widely adopted and already highly optimized codes for classical MD.

The common implementation of ECC via charge rescaling profoundly resembles an empirical scaling factor, which, obviously, reduces the interaction of charged molecules. From both the derived ECC theory (Leontyev and Stuchebrukhov, 2010a) and its applications, which compare ECC to also other methods (Pegado et al., 2012; Martinek et al., 2018), it is however clear that the improvements pertinent to ECC can be attributed to the electronic polarization. For instance, interactions of sulfate anions were directly compared between simulations with ECC, solvent shell model (Rick and Stuart, 2003) and ab-initio calculations (Pegado et al., 2012). This comparison has revealed that ECC performed comparably well to the other methods at a fraction of the computational cost. Moreover, ECC was concluded as preferable over the explicit solvent shell model for sulfate anions as it was closer to the structures from ab-initio calculations (Pegado et al., 2012).

### CAPTURING EFFECTS BEYOND ELECTRONIC POLARIZATION

The accuracy of the implicit methods including ECC is limited and gradually becomes inadequate in cases, which do not adhere to the assumed approximations. For instance, the complex electronic structure of Zn2<sup>+</sup> makes it difficult to capture the ion pairing of zinc chloride with ECC unless specific ad hoc interaction terms between the ions are introduced (Duboué-Dijon et al., 2018b). Hence, resorting to more accurate modeling strategies including explicit polarizable dipoles—or even effects beyond electronic polarization—becomes necessary in such cases.

The AMOEBA force field with explicit polarizable dipoles correctly reproduces water structure around Zn2<sup>+</sup> in bulk solution and its free energy of hydration, however, it still does not capture the fine details of zinc chloride ion pairing. The reason for that is that Zn2<sup>+</sup> exhibits considerably large charge transfer effects prefiguring what is happening with transition metals where back-donation effects become important (Gresh et al., 2005, 2007; Piquemal et al., 2007). Simulations then need to utilize more complex polarizable force fields able to separately evaluate the different physical contributions. Indeed, short-range electrostatics in such systems is anything but classical as it is strongly affected by quantum penetration effects in the overlap region (Piquemal et al., 2003, 2006a; Gresh et al., 2005, 2007; Wang et al., 2015). On the contrary, many-body polarization interactions which are usually cooperative (i.e., the total energy being larger that the purely additive contributions) do not behave in such a way (Gresh et al., 2007, 2016; Zhang et al., 2012; Jing et al., 2018). Divalent metal cations in particular locally reverse the physical trends and exhibit net anticooperativity as the total energy becomes smaller than the sum of individual contributions. For example SIBFA (Sum of Interactions Between Fragments Ab initio computed) incorporates a many-body explicit charge transfer (Gresh et al., 2005, 2007; Piquemal et al., 2007) and a penetration correction for electrostatics (Piquemal et al., 2003; Narth et al., 2016), and is able to deal with such difficult systems.

Such effects also exist with variable magnitude in biomolecular simulations, and resorting to more accurate methods employing physics even beyond explicit polarization will be likely required for predictive accuracy in many cases, e.g., metalloproteins, which shall be interesting playgrounds for such modeling (Gresh et al., 2007, 2016; Zhang et al., 2012; Jing et al., 2018). Improvements in capturing correct physics is a general trend in current developments, and besides SIBFA, the AMOEBA force field is gradually evolving into the AMOEBA+ potential, which additionally includes such physical effects (Liu et al., 2019). Moreover, several other general polarizable potentials are emerging (Huang et al., 2017; Das et al., 2019; Rackers and Ponder, 2019) indicating the start of next-generation polarizable force fields development (Piquemal et al., 2006a; Duke et al., 2014; Piquemal and Cisneros, 2016).

### ARE POLARIZABLE SIMULATIONS COMPUTATIONALLY TRACTABLE?

This being said the question remains: is there any practically achievable perspective application of such advanced models to meaningfully large simulations of biologically relevant systems?—Certainly yes. If the use of polarizable models has been doomed by their computational cost for years, things have dramatically improved. In terms of computational requirements, the approaches utilizing Drude particles (Lopes et al., 2013) traditionally appeared more feasible compared to explicit point dipole approaches (Lipparini et al., 2014; Lagardère et al., 2015), as their computational cost in standard high-performance codes was higher by a factor 2–4 depending on implementation and reference settings compared to non-polarizable force fields (Jiang et al., 2011), while the explicit point dipoles models were roughly twice slower. However, such models cannot utilize long time-steps because of their use of extended Lagrangian, which practically imposes a speed limit (Wang and Skeel, 2005; Albaugh and Head-Gordon, 2017). In contrast, utilizing advanced algorithms for solving polarization and dynamical integration is possible within explicit point dipole approaches leading to strong speed increases to the performance level of Drude approach (even for higher-level multipolar electrostatics approaches such as AMOEBA) when compared to usual non-polarizable models simulation (Lagardère et al., 2019). However, the numerous available point dipole force fields (AMOEBA, SIBFA etc. . . ) had in practice another handicap besides their computational cost: they were not available in high performance/production codes such as GROMACS or NAMD (Phillips et al., 2002; Van Der Spoel et al., 2005).

This situation has gradually changed in recent years. First, in link with the improved multi-timestep integration, the key mathematical problem of solving the point dipole equations using iterative methods was alleviated using new non-iterative approaches such as the Truncated Conjugate Gradient (TCG-1) (Aviat et al., 2017a,b) that allows for a fix cost evaluation of polarization. When coupled to an analytical evaluation of gradient such an approach fully preserves energy and, hence, allows for long time step simulations. Second, the availability of massively parallel MPI codes able to efficiently use supercomputers using 3D domain decomposition techniques such as Tinker-HP (Lagardère et al., 2018; Jolly et al., 2019) [the high performance engine of the Tinker molecular package Rackers et al., 2018] shed first rays of light at the end of the tunnel leading toward simulations of biologically relevant large systems on long enough timescales with explicit polarization. Moreover, GPU accelerated implementations of AMOEBA in OpenMM (Huang et al., 2018) and Tinker-OpenMM (Harger et al., 2017) are available whereas the support of hybrid (multi)CPUs-GPUs is coming in Tinker-HP (O. Adjoua et al., personal communication).

Overall, methodology has made a key progress and will continue in this direction for all types of polarizable force fields as the accessible computer power quickly increases reducing therefore the computational gap with additive potentials. Whereas specialized highly accurate water potentials based on many-body expansions emerge such as MBPOL (Riera et al., 2019) and allow for a better understanding of fine physical effects in clusters and bulk water, the availability of general polarizable force fields such as AMOEBA offering water (Ren and Ponder, 2003), ions, organochlorine compounds (Mu et al., 2014), proteins and nucleic acids (Shi et al., 2013; Zhang et al., 2018) now enables performing enough sampling to achieve highly accurate and biologically meaningful simulations. The Drude approaches parametrization is expanding as well (Lamoureux et al., 2003; Chowdhary et al., 2013a,b; Lopes et al., 2013). Moreover, accelerated sampling methods start to be applied also to polarizable approaches (Célerse et al., 2019) offering improved simulation capabilities and access to accurate and fast evaluation of free energies of binding thanks to GPUs (Harger et al., 2017). Such capabilities allow to tackle hard systems as in the case of the Phosphate binding mode of the Phosphate-binding protein where it was possible to highlight the critical effect of the buffer solution ending a long standing controversy thanks to free energy computations (Qi et al., 2018).

### SUMMARY: BIOMOLECULAR SIMULATIONS OF THE FUTURE ARE POLARIZABLE

In summary, we have presented several important classes and case studies of biomolecules, where including polarizability is an important factor for the simulation accuracy. Cytosolic environment in cells is mostly composed of water solutions of ions, for which polarizability is necessary for the accurate description of the solvated structure of ions, their pairing and interaction with other biomolecules (Piquemal et al., 2006b; Wu et al., 2010; Mason et al., 2012; Pegado et al., 2012; Pluhaˇrová et al., 2013; Duboué-Dijon et al., 2018a,b; Martinek et al., 2018; Melcr et al., 2018). Polarizability is an important factor for accurate interactions between amino acids, namely salt bridges between them, which are overestimated in strength in current non-polarizable force fields (Friesner, 2005; Vazdar et al., 2013; Ahmed et al., 2018; Célerse et al., 2019; Mason et al., 2019). Moreover, polarizable force fields yield a better description of the hydrophobic effect and hydrogen bond networks in proteins, which to a large extent determine the dynamic structure and conformational changes of proteins (Dill et al., 1995; García-Moreno et al., 1997; Fitch et al., 2002; Morozov et al., 2006; Freddolino et al., 2010; Piana et al., 2011, 2014; Huang and MacKerell, 2014; Lemkul et al., 2016; Célerse et al., 2019; Venable et al., 2019). Polarizability is necessary for accurate structure and interactions of both neutral and charged phospholipids, which constitute a dominant part of cellular membranes (Harder et al., 2009; Catte et al., 2016; Melcr et al., 2018).

The representation of electronic polarization in classical MD simulations can vary largely with Drude and induced point dipoles approaches on one side and continuum approximations on the other (Cieplak et al., 2009; Lopes et al., 2009; Leontyev and Stuchebrukhov, 2011; Schröder, 2012; Baker, 2015; Shi et al., 2015; Lemkul et al., 2016; Bedrov et al., 2019; Jing et al., 2019). With the advances in both computational power together with theory and algorithms it is practically achievable to perform simulations with explicit polarizable dipoles on systems with relevant sizes and complexity (Qi et al., 2018; Bedrov et al., 2019; Lagardère et al., 2019; Loco et al., 2019). In particular, it is currently realistic to perform simulations with explicit polarization at time scales, which are competitive to the standard fixed-charge simulations (Lemkul et al., 2016; Lagardère et al., 2018; Célerse et al., 2019). Moreover, advanced polarizable potentials (e.g., SIBFA, AMOEBA+) including effects even beyond electronic polarization are being actively developed to tackle systems with complex structure like metalloproteins, kinases or ribozymes (Gresh et al., 2007, 2016; Zhang et al., 2012; Jing et al., 2018, 2019; Das et al., 2019; Liu et al., 2019; Rackers and Ponder, 2019). Also, approximate implicit solutions like ECC, which circumvent the computational costs of explicit polarization, gradually gain on popularity and provide a promising solution for a variety of applications in biomolecular simulations (Duboué-Dijon et al., 2018a,b; Martinek et al., 2018; Melcr et al., 2018, 2019; Mason et al., 2019). Finally, as fully variational polarizable embeddings are now possible in hybrid QM/MM molecular simulations (Loco et al., 2016, 2017, 2019), one can expect that hybrid explicit polarization/ECC simulations will be possible in the near future offering a multilevel global treatment of polarization across very large complex molecular systems.

Biomolecules in the real world cannot turn off their polarizability. Hence, molecular dynamics simulations, which aim to give a realistic, robust, and predictive results, cannot afford to neglect this important contribution to the electrostatic

### REFERENCES


interaction. Currently, polarizable force fields for a large variety of biomolecules and simulation codes implementing polarizability exist and are readily available to solve various biophysical problems (Wu et al., 2010; Chowdhary et al., 2013a; Lemkul et al., 2016; Duboué-Dijon et al., 2018a,b; Lagardère et al., 2018; Martinek et al., 2018; Melcr et al., 2018; Zhang et al., 2018; Bedrov et al., 2019; Célerse et al., 2019; Jing et al., 2019; Liu et al., 2019). We expect that the popularity of such approaches will grow and will become a common tool in biomolecular research in the near future.

### DATA AVAILABILITY STATEMENT

All datasets generated for this study are included in the article/supplementary material.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

This work has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 810367), project EMC2.

### ACKNOWLEDGMENTS

We thank the organizers of the CECAM workshop 2019 Multiscale Modeling from Macromolecules to Cell: Opportunities and Challenges of Biomolecular Simulations for their kind invitation to this special research topic.


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Melcr and Piquemal. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Molecular Dynamics Simulations of a Chimeric Androgen Receptor Protein (SPARKI) Confirm the Importance of the Dimerization Domain on DNA Binding Specificity

### Edited by:

Giulia Palermo, University of California, Riverside, United States

#### Reviewed by:

Francesco Delfino, INSERM U1054 Centre de Biochimie Structurale de Montpellier, France Stephen Daniel Levene, The University of Texas at Dallas, United States

#### \*Correspondence:

Mahdi Bagherpoor Helabad mbagerpoor@zedat.fu-berlin.de Petra Imhof petra.imhof@fu-berlin.de

#### †Present address:

Petra Imhof, Department of Chemistry, Bioscience, and Environmental Engineering, University of Stavanger, Stavanger, Norway

### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

> Received: 02 October 2019 Accepted: 10 January 2020 Published: 31 January 2020

#### Citation:

Bagherpoor Helabad M, Volkenandt S and Imhof P (2020) Molecular Dynamics Simulations of a Chimeric Androgen Receptor Protein (SPARKI) Confirm the Importance of the Dimerization Domain on DNA Binding Specificity. Front. Mol. Biosci. 7:4. doi: 10.3389/fmolb.2020.00004

#### Mahdi Bagherpoor Helabad\*, Senta Volkenandt and Petra Imhof\* †

Department of Physics, Freie Universität Berlin, Berlin, Germany

The DNA binding domains of Androgen/Glucocorticoid receptors (AR/GR), members of class I steroid receptors, bind as a homo-dimer to a cis-regulatory element. These response elements are arranged as inverted repeat (IR) of hexamer "AGAACA", separated with a 3 base pairs spacer. DNA binding domains of the Androgen receptor, AR-DBDs, in addition, selectively recognize a direct-like repeat (DR) arrangement of this hexamer. A chimeric AR protein, termed SPARKI, in which the second zinc-binding motif of AR is swapped with that of GR, however, fails to recognize DR-like elements. By molecular dynamic simulations, we identify how the DNA binding domains of the wild type AR/GR, and also the chimeric SPARKI model, distinctly interact with both IR and DR response elements. AR binds more strongly to DR than GR binds to IR elements. A SPARKI model built from the structure of the AR (SPARKI-AR) shows significantly fewer hydrogen bond interactions in complex with a DR sequence than with an IR sequence. Moreover, a SPARKI model based on the structure of the GR (SPARKI-GR) shows a considerable distortion in its dimerization domain when complexed to a DR-DNA whereas it remains in a stable conformation in a complex with an IR-DNA. The diminished interaction of SPARKI-AR with and the instability of SPARKI-GR on DR response elements agree with SPARKI's lack of affinity for these sequences. The more GR-like binding specificity of the chimeric SPARKI protein is further emphasized by both SPARKI models binding even more strongly to IR elements than observed for the DNA binding domain of the GR.

Keywords: androgen receptor, glucocorticoid receptor, response element, protein-DNA interaction, chimeric SPARKI protein

### 1. INTRODUCTION

Steroid receptors (SRs), a subfamily of nuclear receptors, are ligand-activated transcription factors that bind to a specific DNA target sequence in order to enhance or repress gene transcription (Evans, 1988; Corson, 2005; Bunce and Campbell, 2010).

Members of SRs, i.e., Androgen receptor (AR), Glucocorticoid receptor (GR), Mineralocorticoid receptor (MR), and Progesterone receptor (PR), bind as a homo-dimer to consensus 15 base pair (bp) palindromic DNA sequences, termed classical response elements (CREs) (Ham et al., 1988). The DNA of CREs is organized as an inverted repeat (IR) of hexamer "AGAACA", separated with a 3 bp DNA sequence, called spacer (Beato et al., 1995) (**Figures 1B,D**). Among the CREs, the first hexamer (HS1) elements are almost invariant and therefore suggested as high affinity DNA sequences for receptor binding (La Baer and Yamamoto, 1994). The DNA binding domain (DBD) of the proteins, which includes about 70 amino acid residues, contains two vital subdomains, each identified with a zinc ion that is coordinated by four Cysteine residues. The first subdomain includes an α-helix, termed H1, which is responsible for protein-DNA major groove interactions. The second subdomain holds a loop domain, termed Dim, which is responsible for proteinprotein dimerization (Luisi et al., 1991; Kumar and Thompson, 1999) (see **Figures 1A,D**). A flexible loop, named lever arm connects these subdomains to each other (**Figure 1D**).

Steroid receptors show high structural conservation and share almost identical DNA response elements, allowing these response elements to be functionally substituted (Arora et al., 2013). For instance, a response element that corresponds to the androgen receptor might function for glucocorticoid receptor activation and vice versa. Recent studies have shown that AR and GR share about one third of their response binding sites (Zhang et al., 2018). Still, androgen response elements (AREs) are merely recognized by AR and not by GR (Schoenmakers et al., 1999; Claessens et al., 2001; Moehren et al., 2008). The AREs are arranged as direct-like repeat (DR) "TGTTCT" of hexamer "AGAACA" (see **Figure 1B**) and also separated with a 3 bp spacer (Haelens et al., 2003). In 2004, Shaffer et al. crystallized the only structure of AR(DBD) in complex with a DR response element in which an unexpected head-to-head conformation was revealed (Shaffer et al., 2004). This structure of AR-DR indicates additional hydrogen-bond interactions of residue S580, which is not present in GR, in each monomer with its counterpart in the other monomer. These interactions have been discussed as a potential stabilization of the unexpected head-to-head arrangement in the AR-DR complex (Verrijdt et al., 2003; Shaffer et al., 2004).

Studies have shown that AR activity varies depending on the bound response elements, i.e., DR or IR (Geserick et al., 2003; Verrijdt et al., 2006). For instance, R581D mutation in the dimerization domain of AR-DBD enhances AR's activity on CREs but has less effect on AREs. On the other hand, the A579T mutation shows reduced activity on AREs but not on CREs

FIGURE 1 | (A) Schematic overview of the DNA binding domain (DBD) sequences in the androgen receptor (AR) and glucocorticoid receptor (GR) protein with corresponding residue numbers above and below, respectively. The amino acids colored in dark red are those elements of the GR-DBD that differ from the AR-DBD sequence. The other amino acids are the same in the AR- and GR-DBD. The amino acids shown with green shadow are those elements in AR that are replaced with residues from GR in order to make Sparki (Schauwaers et al., 2007). (B) DNA sequences for direct (DR) and inverted repeats (IR). The non-capital letters are the spacer base-pairs, colored in orange. (C) Schematic 3D structure of one monomer of Sparki-DBD, regions colored in green and blue are those subdomains that are GR- and AR-like, respectively. (D) The 3D structure of the GR- DBD/DNA complex (pdb ID: 1R4R). A similar structure exists for the AR-DBD/DNA complex (pdb ID: 1R4I). The lever arm and dimerization domain (Dim) are shown in yellow and red, respectively. The spacer region of the DNA is colored with orange.

(Geserick et al., 2003). In contrast, mutations at points that differ between the AR and GR Dim, i.e., S580G and T585I, in the AR, and G478S and I483T, in the GR, do not show much effect on DNA binding affinity and activity of these receptors (Verrijdt et al., 2006). These mutation data indicate that less of the AR-DR binding specificity can be attributed to the Dim interface than suggested by the crystal structure. Also, it is shown that the changes in AR activity due to the loss of Dim interactions strongly depend on the engaged DNA response element (van Royen et al., 2012). Since the Dim region is too far (about 18 Å) from the DNA surface to build direct interaction, other parts of DBDs likely play a role in DNA binding specificity (Meijsing et al., 2009). In a recent study, Watson et al. showed that the lever arm conformation strictly depends on the spacer sequence. The lever arm has therefore been suggested as an allosteric modulator that not only connects the H1 to the Dim (see **Figure 1**), but also associates the DNA response sequence to its respective dimer partner (Watson et al., 2013). The activities of AR and GR are shown to also depend on this region (Meijsing et al., 2009; Helsen et al., 2012; Dalal et al., 2014). A recent study on the DNAbinding preferences of AR and GR has revealed that AR binding to DNA is more enthalpically energized, while GR binding is more entropy driven (Zhang et al., 2018).

In 2007, an in vivo study done by Schauwaers et al. generated a chimeric receptor, termed SPARKI (SPecificity-affecting AR KnockIn), in which 12 amino acids of AR in its second zincbinding domain were replaced by those of GR (**Figures 1A,C**) (Schauwaers et al., 2007). In vitro studies have shown that swapping this second zinc-binding motif between the AR and GR leads to the loss of affinity of this chimeric receptor with a DR-like motif (Schoenmakers et al., 1999; Moehren et al., 2008). Consistently, the in vivo experiment exhibited a reduced affinity of the SPARKI receptor for DR-like elements whereas for IR-like elements it showed similar or even better binding affinity than AR (Schauwaers et al., 2007). The lack of the SPARKI system's ability to bind to DR-like response elements was also confirmed by a later in vivo study, done by Sahu et al. (2014). Interestingly, this study shows that for DR-like elements, which were selectively enriched by wild-type AR, there is a wellconserved 5′ -hexamer (HS1, **Figure 1B**) but not a stringent 3′ hexamer (HS2) sequence conservation. In contrast, binding of both wild-type AR and SPARKI to IR-like elements requires a specific HS2 sequence (Sahu et al., 2014). Moreover, in vitro assays show the high-affinity of AR and GR receptors to HS1, due to its highly conserved sequences (Verrijdt et al., 2000). It is speculated that due to the high-affinity of the two subunits in the AR dimer, this receptor could bind to a more diverse HS2 than the GR could. For instance, it is shown that the thymine (T) next to guanine (G) in HS2 of the IR elements is a highly conserved base in the response elements of SRs. This specific T is not required for AR, allowing this receptor to bind to DR-like elements which have an adenine (A) in that position (John et al., 2011; Sahu et al., 2011, 2014; Yin et al., 2012; Ballaré et al., 2013; Grøntved et al., 2013). However, it is not yet clear how the high affinity of AR-DBD to DR-like response elements, which leads to strong interactions in the protein's dimerization interface, is influenced by (more diverse) HS2 elements. Moreover, the distinct binding of AR(DBD)-DR (or IR) and GR(DBD)-IR is still not well-understood. The SPARKI is an outstanding model that could explain the distinct regulation of AR-specific responses with respect to those which can be regulated by GR as well.

In this study, by employing all-atom molecular dynamics simulations, we investigate the factors that lead to a different binding of AR and GR receptors to DNA response elements. In this regard, we simulated six protein-DNA complexes consisting of the DNA binding domains of wild type AR and GR, bound to a DNA sequence with IR and DR, respectively, and SPARKI models (with both IR and DR elements) made by AR and GR mutation. Our MD simulations allowed us to determine the significant dynamics of these receptor's DBD-DNA interface. These results suggest a loss of affinity of the chimeric proteins, i.e., SPARKI, to DR sequences and a strong affinity for IR sequences. Furthermore, our data reveal that the "weaker" dimerization interface interactions in the IR complexes, compared to the AR-DR complex, allows those dimeric proteins to be properly accommodated on IR sequences.

### 2. MATERIALS AND METHODS

### 2.1. Structural Models

The atomic models of the DNA binding domains (DBD) of AR- and GR complexed to their respective response element were prepared using the crystallographic structures 1R4I and 1R4R, respectively. In order to achieve consistency with the AR(DBD)-DNA complex, the guanine in the spacer region of the GR(DBD)-DNA complex, was mutated in silico to cytosine. The response elements in the two complexes are thus 5′ - CC **AGAACA**tca**TGTTCT** GA-3′ (DR, for AR) and 5′ -CC **AGAACA**tca**AGAACA** GA-3′ (IR, for GR), respectively. The residues listed in bold are the core response elements including the two half sites, HS1 and HS2, respectively, the spacer is given in small letters. We have constructed two atomic models of the SPARKI receptor, one based on the structure of the AR-DNA complex (1R4I) and one on the structure of the GR-DNA complex (1R4R). In the AR-based model, termed SpAR, residues in the second zinc-binding motif of AR that differ from GR (highlighted in green in **Figure 1A**), were replaced with the corresponding residues of the GR protein, as in the experimental mutation (Schauwaers et al., 2007). These residues are located at the dimerization interface (see **Figures 1A,C**). The second model, termed SpGR, is based on the GR protein in which the residues of the first zinc-binding motif of GR that differ from AR, which are part of the DNA-binding interface, were mutated to those of AR. The resulting sequence of the proteins in both Sparki models, SpAR and SpGR is thus identical, however, their initial structures differ, since these are based on two different crystal structures.

Both SPARKI models were furthermore modeled in complex with both DNA sequences, DR and IR, respectively. Therefore, a total of six models, i.e., AR-DR, GR-IR, SpAR-DR, SpAR-IR, SpGR-DR, and SpGR-IR have been simulated.

### 2.2. Molecular Dynamics Simulations

The systems were solvated with ∼23,000 water molecules in a cubic box of ∼90 × 90 × 90 Å<sup>3</sup> and a number of sodium

ions were added to neutralize the systems. The CHARMM-27 force field (Brooks et al., 1983; MacKerell et al., 1998) and the TIP3 water model were used in the simulations (Mahoney and Jorgensen, 2000). Long-range electrostatic interactions were treated by the particle mesh Ewald method via a switch function with a cutoff of 14–12 Å and employing periodic boundary conditions (Darden et al., 1993). The systems were energy minimized for 5,000 steps (conjugate gradient with an energy tolerance of 10−<sup>4</sup> kcal/mol), followed by a molecular dynamics (MD) simulation of 30 ps (time step of 1 fs) to heat the system by velocity scaling (with harmonic constraint on all heavy atoms, by force constant 10 kcal·mol−<sup>1</sup> ·Å −2 ). Then, 100 ps of MD relaxation (in NPT ensemble) at target temperature (300 K) and time step 1 fs were computed. Langevin dynamics with a damping factor of 1 ps−<sup>1</sup> have been used for temperature control (Allen and Tildesley, 2017). The Nosé–Hoover Langevin pressure control, with piston period of 200 fs and a damping time of 100 fs, have been used in order to maintain the pressure at 1 bar (Martyna et al., 1994). After the equilibration phase, three 100 ns MD replicas (with different initial velocities) for each system were carried out (time step of 2 fs). From those, one run per system was chosen for longer simulation, based on the calculated root mean-squared deviation (RMSD) (see **Figure S1**). These longer MD simulations were carried out for 900 ns for the SPARKI systems and for 500 ns for AR-DR and GR-IR, respectively, and saved at 2 ps intervals. In all simulations, the terminal DNA base pairs were restrained (centered around 3 Å between the centers of mass of the respective bases) by a harmonic potential with a force constant of 20 kcal/mol in order to decrease the edge effects. The MD simulations were run using version 2.10 of NAMD (Phillips et al., 2005).

### 2.3. Hydrogen Bond Analysis

Hydrogen bonds were analyzed based on geometric criteria, i.e., a maximal distance of 3.2 Å between donor and acceptor atom and an angle formed by donor, hydrogen atom, and acceptor, that deviates maximally by 42◦ from linear. This criterion was evaluated for each frame of the simulation trajectory, i.e., each 2ps of the simulations time. A hydrogen-bond probability is then obtained as the hydrogen bond occupancy Hocc = nHbond N , i.e., the number of frames in which a hydrogen bond is formed, nHbond, divided by the number of frames analyzed, N. Water-mediated hydrogen-bonds between protein and DNA were identified as two hydrogen bonds formed simultaneously by a water molecule, one with the protein and another one with the DNA. The hydrogen bond analysis has been carried out using VMD (Humphrey et al., 1996) and in-house scripts.

### 2.4. Conformational Analysis

The median structure of each trajectory was determined as the snapshot that has minimum root mean-squared deviation (RMSD) from the averaged structure of the trajectory. The local DNA conformation was analyzed using Curves+, a program for analyzing the coarse-grained geometry of DNA (Lavery et al., 2009). The errors estimated for the DNA parameters are standard errors, which are calculated by a block averaging approach (Grossfield and Zuckerman, 2009).

### 2.5. Linear Correlation Score Function

Correlations between all pairs of fluctuating atom positions were calculated as Pearson correlation. The Pearson correlation, is defined by the normalized covariance matrix (Ichiye and Karplus, 1991):

$$r\_{ki} = \frac{cov(\mathbf{x}\_k, \mathbf{x}\_i)}{\sigma\_{\mathbf{x}\_k}\sigma\_{\mathbf{x}\_i}} \tag{1}$$

where **x<sup>k</sup>** and **x<sup>i</sup>** are the fluctuations of random variable k and i, respectively.

The correlation score function is a measure of the intensity of correlation for each variable k (here, the position of the Cα atoms of the protein residues), defined as (Ricci et al., 2016):

$$\text{CS}\_k = \frac{1}{N-1} \sum\_{i}^{N-1} r\_{ki} \tag{2}$$

Here, the correlation score function is normalized. In order to remove the trivial and non-important correlations only pairs with a of rki ≥ 0.4 were considered.

### 2.6. Entropy Estimation

The configurational entropy of the protein is estimated based on the mass weighted covariance matrix of atomic fluctuations via two well-established methods, one proposed by Schlitter (Schlitter, 1993) and another one by Andricioaei and Karplus (2001).

For computation of the protein entropy we used the fluctuations of the backbone Cα atoms. The last 300 ns of the simulations are considered for the analysis. The error bars are standard deviation of three different simulation trajectories samples due to different chosen time strides. All the calculations are done via Grcarma software, a Task-Oriented Interface for the Analysis of MD trajectories (Koukos and Glykos, 2013).

### 3. RESULTS

The results are organized to first present a comparison of the overall structure of the complexes. This is followed by an analysis of the proteins, first, in terms of flexibility and an estimate of their entropies in the different complexes. Then, the protein-protein interactions between the two subdomains are investigated. Subsequently, the conformation of the two DNA sequences in the different complexes is analyzed. Finally, the hydrogen-bond interactions between the proteins and the DNA are reported.

### 3.1. Median Structure

In order to estimate the overall structural change of each complex during the simulation, the median structures representing the first 100 ns and last 100 ns (of the total of 500 ns simulation time for AR-DR and GR-IR, respectively, and 900 ns for SPARKI models), respectively, were aligned with respect to each other and compared. As can be seen in **Figure 2**, the lever arm is the most variable domain whereas the initial and final conformations of the remainder of the systems are similar. Remarkable exceptions

are the monomer A, located at the first half-site, and the Dim interface of the SpGR-DR model, which exhibit a considerable distortion. In this model, a conformational change takes place not only in the lever arm but also in both zinc-binding subdomains where the zinc ions, together with their coordinating ligands, change positions. Moreover, the Dim regions of the AR-DR system are slightly closer to each other than in the other models. The distances between different domains/subdomains of protein-DNA complexes are listed in **Table S1**. As shown in this table, the distance between monomer A and monomer B in AR-DR (24.37 ± 0.31 Å) is shorter than that of GR-IR (25.08 ± 0.20 Å). The SpGR-DR system also exhibits a larger distance between the receptor's dimer interfaces as well as between the respective zinc ions of the two subunits, than the other systems. The simulations of the SpAR-DR model, which represent the same system but were started from a different initial structure, in contrast, do not exhibit a distortion of the Dim interface, Accordingly, the distance between the two monomeric subunits in this model are shorter than in the SpGR-DR model.

### 3.2. Root Mean Square Fluctuations (RMSF)

**Figure 3** shows the per-residue root mean square fluctuations (RMSF) of the protein monomers for all the systems. As can be seen in this figure, the lever arm corresponding to residues 571– 576 (AR, SpAR)/469–474 (GR, SpGR) is the most fluctuating region in all models. Comparison of fluctuations between monomer A and monomer B shows almost similar fluctuations of the protein residues in all systems, except for SpGR-DR. The IR complexes, though, exhibit higher flexibility than the DR complexes in the lever arm region, i.e., residues 469–474 or 571– 576 in GR or AR numbering, respectively. SpGR-DR exhibits particularly high fluctuations of the protein residues, especially in monomer A; higher than the fluctuations of monomer A in any of the other systems. Monomer B of SpGR-DR, however, shows larger fluctuations than the other systems only for the residues situated in the dimer interface, i.e., 576–581 (AR, SpAR)/474–479 (GR, SpGR). Of note, in the SpGR models, residues in the dimer interface are directly modeled, that is without in silico mutation,

from the crystal structure of the wild-type GR protein and may therefore represent a GR-like conformation.

### 3.3. Entropy Estimation

As can be seen from **Figure 4**, the estimated entropy of SpGR-DR and SpGR-IR are higher than those computed for SpAR-DR and SpAR-IR, respectively. This is the case for both entropy estimation methods. Both AR-DR and GR-IR exhibit rather similar values in entropy, although the two proteins are in complex with different DNA sequences. Comparison of only DR or IR complexes, respectively, shows higher entropy values for the Sparki models than for the respective wild-type complexes. Among the chimeric Sparki models, SpAR does not exhibit a significant difference in entropy when complexed to DR or IR sequence, whereas SpGR shows a significantly higher entropy in the DR complex compared to the IR complex.

TABLE 1 | Protein-protein hydrogen-bond interactions.


The star indicates that more than one hydrogen bond is formed simultaneously. AB and BA refer to the monomer A as donor and monomer B as acceptor and vice versa. Here, the hydrogen-bond interaction occupancies below 40% are considered as weak interactions and are therefore not listed.

### 3.4. Protein–Protein Hydrogen Bond Interactions

The hydrogen bond interactions between the protein subunits are listed in **Table 1**. Our results indicate that the dimer interface of the AR-DR system forms more strong hydrogenbond interactions than those seen in the SPARKI systems and in the GR-IR. In particular, the inter-subunit hydrogen bond S580A-S580B, which has been discussed to be crucial for tight dimerization of the AR-DR complex (Shaffer et al., 2004), is not present in the other systems. Furthermore, a strong interaction of R581-D583 can also be seen in AR-DR, but not in the other systems. Two interactions, L577- N593(AR, SpAR)/L475-N491(GR, SpGR) and A579-I585(AR,

SpAR)/A477-I483(GR, SpGR), exist in all the systems, in both directions, that is from monomer A to monomer B (AB) and vice versa (BA). However, in the SpGR-DR, only a one-sided of these interactions is formed, indicating a weaker dimer interface interaction of the SpGR-DR than in the other systems. Moreover, the dimer interfaces of the SpAR complexes exhibit stronger hydrogen-bond interactions than the SpGR models. An extra interaction of C578-R590 can be seen in SpARs that is not present in SpGRs. This extra interaction is also observed in the AR-DR complex, based upon which the SpAR-DR model has been built. The dimerization interactions of the GR-IR model also exhibit two moderate and one-way (BA-side) hydrogen-bond interactions C476-R488 and R479-D481 that are not present in SpGR models.

### 3.5. Linear Correlation Score

In order to capture how the protein residues in each monomer are influenced by other residues of that monomer, the linear correlation score has been calculated for all the systems (linear correlation scores calculated for the first 100 ns and middle 100 ns of trajectories of the SPARKI systems are shown as **Supplementary Material**, see **Figure S13**). As can be seen in **Figure 5**, almost all the residues show a similar magnitude of correlation score in all the systems, except for SpGR-DR. This model exhibits considerably higher correlation score values, in both protein monomers, than any of the other models. This indicates that the fluctuating motion of each residue is highly dependent on the rest of the residues in that protein. Any local conformational change, as observed for the lever arm and the Dim of SpGR-DR, as visualized by the median structures (see above), does not only affect the neighboring residues but also distal domains of the protein and thus has a more global effect. Moreover, for SpGR-DR the correlation score increases during the simulation, corresponding to an increase in conformational change of the monomers in this model (see **Supplementary Figure S13**).

### 3.6. DNA Conformation

To study the impact of the DBD of the receptors on their respective DNA structure, the local geometrical parameters of DNA, i.e., inter- and intra-bp parameters (**Figures S5–S10**), major- and minor-groove widths (**Figure 6**), and helical axis bending (**Figure S4**) were calculated for the last 100 ns of the AR-DR and GR-IR trajectories. For the SPARKI systems, the changes of these parameters in the course of the simulations were also considered (**Figures S2, S3**) and are discussed in the **Supplementary Material**.

The DNA grooves of the IR complexes differ from those of DRs. Interestingly, these differences can not only be observed in the second hexamer, which is expected due to the different DNA sequence, but also in the spacer and in the first hexamer in the IR complexes (see **Figure 6**). For instance, the major groove at position C8, in the spacer region, is narrower in the IRs than in DRs. Also, a narrower major groove at positions C5-A6 (in HS1) can be observed in Sp(AR/GR)-IR compared to SpAR-DR or AR-DR. The DNA of both SPARKI-IR systems exhibits very similar conformations. This can be seen in almost all DNA parameters (see **Figures S5–S10**).

The DNA parameters in both SPARKI-IR complexes show some differences from the GR-IR parameters. The minor groove of Sp(AR/GR)-IR at positions between A4-T7 (in HS1) is narrower than that in the GR-IR (see **Figure 6**). Also, the DNA of the GR-IR complex shows higher bending than the DNA of the Sp(AR/GR)-IR complexes (**Figures S4B,D**). Since the DNA sequence is the same in all IR complexes, the observed differences in the DNA conformation can be attributed to the interaction with the different proteins.

In contrast to the two SPARKI-IR complexes, all DNA parameters of the SpAR-DR complex and the SpGR-DR complex represent conformations that are considerably different from AR-DR (see **Figure 6** and **Figures S4–S10**). SpAR-DR and SpGR-DR, moreover, show differences between some of their DNA parameters. For instance, in SpGR-DR the HS2 has a wider major and narrower minor groove and HS1 has a considerably

wider minor groove than in SpAR-DR. Furthermore, the DNA helical axis bending is higher in SpGR-DR than in SpAR-DR (**Figure S4**). In the two SPARKI-DR models not only the DNA sequence is the same, but also the residues of the protein. The different DNA conformations may also be attributed to different interactions with the (same) proteins, representing different (metastable) binding modes due to different initial starting conformations.

In the SPARKI-IR systems, the first hexamer exhibits a narrower major groove than the second hexamer whereas the opposite is observed for the SpAR-DR and AR-DR systems (see **Figure 6**). Interestingly, the position T12, in the second hexamer, seems to have an important role in the IR complexes. For most IR complexes the dinucleotide G11T<sup>12</sup> shows an extreme value which is not the case in the DNA parameters of the DRs with G11A<sup>12</sup> at this position (see **Figures S5, S6, S9**). Also the intra base pair parameters exhibit at position G<sup>11</sup> more extreme values in the IR complexes than in those with DR (**Figures S7, S8, S10**), which may be an effect of the neighboring residue being thymines at positions T<sup>10</sup> and T<sup>12</sup> in IRs, instead of adenine residues in DRs.

### 3.7. Protein-DNA Hydrogen-Bond Interactions

In order to analyze the interaction strengths, probabilities of direct and indirect (mediated by water molecules) hydrogen bonds between protein and DNA have been calculated. **Figures 7**–**9** show the hydrogen bond interactions of all studied systems, calculated from the last 100 ns of the simulations. For the SPARKI systems, the hydrogen bond interactions of the middle 100 ns (W2 interval) were also calculated (see **Figures S11, S12**). According to these figures, differences in protein-DNA interactions between W2 and W3 intervals in SpARs can be seen only in the first hexamer, HS1 (**Figure S11**), whereas for SpGRs such differences exist in both DNA hexamers (**Figure S12**).

For each DNA hexamer, i.e., HS1 and HS2, there are four sites whose hydrogen bond interactions with the protein are conserved among all the systems. These are s1A1, s1G2, s2G5, and s2T6 in HS1 and s1A10, s1G11, s2T15, and s2G14 in HS2. The guanine residues at positions s1G11 and s2G5 are the predominant residues that form strong, i.e., highly probable, hydrogen-bond interactions with the protein in all systems. In particular, the residue R568 in the helix H1 of the AR-DBD, and residues R466 in helix H1 of the GR-DBD form base-specific hydrogen bonds with guanine residues s1G11 and s2G5, respectively. Our results indicate that the AR-DR complex involves more hydrogen-bonded protein-DNA interactions than the GR-IR complex. Moreover, hydrogen bonds of residues s1G2 and s2G14 with K563 and K567, respectively, and also those of residues s2A7 and s2T6 (in the spacer) with Y576 are stronger in the AR-DR complex than the corresponding hydrogen bonds in the GR-IR complex (see **Figure 7**).

Comparison of the hydrogen-bond patterns between the SpAR systems shows that the SpAR-IR complex has more strong and moderate hydrogen-bond interactions than the SpAR-DR complex. In particular, residues s1T10 and s2G5 are more strongly hydrogen-bonded in the SpAR-IR model than in the SpAR-DR complex (see **Figure 8**). The two SpGR systems show rather similar protein-DNA hydrogen-bond interactions (see **Figure 9**). However, comparing the hydrogen-bond interactions between the SpAR-IR and SpGR-IR shows that the SpAR-IR includes more and stronger hydrogen interactions than the SpGR-IR. In particular, for the SpAR-IR model more hydrogen bonds than in the SpGR-IR complex can be observed for each specific guanine residue, i.e., s1G11 and s2G5. One further residue, i.e., s1T10, forms stronger hydrogen-bonded interactions with the protein in the SpAR-IR than in SpGR-IR complex. There is also a strong interaction in residue s2A7 of SpGR-IR which is not present in SpAR-IR. These differences in the protein-DNA interaction between the SpAR-IR and SpGR-IR complexes, that is two models of the same system, may represent two slightly different binding modes, as a consequence of different initial conformations used in the simulations.

On the other hand, our results show that both the Sp(AR/GR)- IR complexes exhibit stronger hydrogen-bond interactions than the GR-IR complex (compare residues s1G2 and s1G3, between Sp(AR/GR)-IR and GR-IR, residue s1T10 between SpAR-IR and

GR-IR, and residue s2T6 and s2A7 between SpGR-IR and GR-IR). Furthermore, the AR-DR complex exhibits slightly stronger hydrogen-bond interactions than observed in the SpGR-DR but considerably stronger than observed in SpAR-DR. Interestingly, those interactions, present in AR-DR but not in SpAR-DR, are mostly formed with the HS1 and the spacer. Moreover, there are more water-mediated interactions in SpAR-IR than in SpGR-IR. Finally, the number of water-mediated hydrogen bond interactions in AR-DR is higher than in GR-IR.

## 4. DISCUSSION

All the protein-DNA complexes modeled in this work, represent states in which the DNA is bound by the respective DBD. The interaction strengths within the complexes, as manifested by hydrogen bond interactions between protein and DNA, as well as between the protein subunits, and conformational flexibility, however, varies between the different systems.

Of all the protein-DNA systems, the AR-DR complex exhibits the strongest interactions between protein and the DNA via direct and water-mediated hydrogen bonds.

## 4.1. Protein–Protein Interactions

The complex which exhibits the strongest hydrogen bonds between the two protein monomers is AR-DR. In particular, the strong hydrogen-bonded interaction S580-S580, as suggested by the crystal structure (Shaffer et al., 2004), contributes to the stabilization of the dimerization interface. This interaction can also be regarded as facilitating the interaction of the neighboring R581 with D583. This is furthermore in agreement with the experimental suggestion that the strong dimer interface of AR-DR allows the AR-DBDs to bind to DNA in a head-to-head conformation (Shaffer et al., 2004; van Royen et al., 2012).

The mutations in the SPARKI systems, which transform an AR into the chimeric protein, are mainly located in one loop that constitutes the dimerization interface. The protein-protein

interactions in all the SPARKI systems are weaker than in the AR-DR and comparable to (or even weaker than) those in the GR-IR system. This suggests that the dimerization interface of SPARKI is indeed GR-like, as would be expected based on its constituting sequence.

A significant conformational distortion can be seen in monomer A and the dimer interface of SpGR-DR, that is not observed in the SpGR-IR. In addition, the dimer interface of SpGR-DR has two hydrogen bonds fewer than the SpGR-IR. The SpGR-DR model, moreover, exhibits the largest Zn-Zn distances and the largest distance between the loops of the dimerization interface of all the models investigated in this work. These findings suggest that in the SpGR model, accommodation of the DR sequence, and interactions with the protein comparable to a IR sequence, can be achieved only at the expense of a distortion of the dimerization interface.

The deformation of monomer A and the dimerization interface observed in the SpGR-DR model is not observed in the SpAR-DR model, that is the complex that has been modeled from the crystal structure of the AR-DR. We attribute this difference to the different starting points for the simulations, AR-DR and GR-IR, respectively. In the SpAR models, the residues which have been in silico mutated (second zinc-binding motif) are located at the dimerization interface, whereas in the SpGR models these residues (first zinc-binding motif) are part of the DNAbinding interface. Furthermore, in the SpGR-DR model the DNA sequence has been changed from IR to DR in silico.

In the SpAR-DR model, the monomers of SpAR are tightly bound in the AR-like starting conformation. The modified dimerization interface leads to a weaker protein-protein interaction as manifested by the longer distance and fewer hydrogen bonds between the two subunits. The protein, on the other hand, does not "reach" the DNA as good as in the other models as can be seen by SpAR-DR showing the longest, though not by much, protein-DNA distances of all the complexes. Moreover, the number of hydrogen bonds between protein and

DNA is smaller than in the wild-type AR-DR, in particular in HS1, pointing toward a loser complex in the chimeric model. This is in agreement, albeit does not fully explain the experimentally observed low affinity of SPARKI for DR elements (Schauwaers et al., 2007; Moehren et al., 2008; Sahu et al., 2014).

In the SpGR-DR model the dimerization interface is GRlike, that is weak to start with. In addition the protein is not properly oriented on the DR sequence. In the course of the simulation, the protein undergoes conformational changes in the dimerization interface, considerably weakening the proteinprotein interactions. The distortion, weakened interactions in the dimerization interface, result in a reoriented monomer A and a deformed monomer B. That means that monomer B in SpGR does not manage to fully adjust onto the direct repeat on HS2 to form strong contacts. The observed conformational change in the Dim regions and the monomer A may be regarded as an attempt by the system to make favorable contacts in other parts of the complex. Indeed in the SpGR-DR model, more contacts, that is hydrogen-bonds between protein and DNA, are observed than in the SpAR-DR model. However, these contacts are with the HS1. Strong interactions with only one hexamer and a distorted protein-protein interface suggest a low affinity, or a rather unstable Sp(GR)-DR complex. The SpGR model is, by construction, a GR-like SPARKI. Also GR lacks affinity for DR sequences, possibly because no stable complexes can be formed between GR and DR. A deformed conformation in the dimerization interface of SpGR-DR may thus point toward a loss of stability in that wild-type GR-DR complex.

Analysis of the DNA parameters around T12 exhibits extreme values in the neighboring G11 (intra bp) as well as extreme inter base pair parameters in the GT step that are not present in the GA step of the direct repeat. The affected G11 has strong interactions with the protein and is therefore an important residue for binding. This interplay may explain why T12 is essential for specific DNA recognition by GR (Sahu et al., 2014) as has been shown by in vivo experiments.

The sequence and conformation in the HS2, moreover, affect the spacer region. In this region, a narrower major groove has been observed for the IR sequence than for the DR sequence. Such a DNA conformation, though not quite a kink in the DNA spacer, requires the protein to "follow" the DNA conformation so as to form favorable contacts. This is achieved by a lever arm that is more flexible in the IR-bound systems, i.e., GR and SPARKI (see **Figure 3**), and the two protein subunits being slightly further apart, as manifested by longer monomer-monomer distances in GR-IR compared to AR-DR, while the distances of the protein subunits to their respective half site on the DNA are similar. Among the complexes with an IR sequence, both SPARKI models, SpAR-IR and SpGR-IR, reveal stronger protein-DNA interactions, especially with the HS1, than the other wild-type complex, GR-IR, in agreement with experiments that show similar or higher affinity of SPARKI systems for the IR elements or classical response element, i.e., CREs (Schauwaers et al., 2007).

The higher affinity of the SpAR/GR complexes to the IR sequence, compared to that of GR-IR, can thus be explained by the chimeric systems having both properties, the AR-like ability to strongly interact with the DNA and the GR-like "softness", that is weaker interactions, of the dimerization interface, that allows the protein to flexibly accommodate to the binding on the DNA. Qualitatively, the higher flexibility in the dimerization interface and lever arm region of the SPARKI-IR systems can be understood as entropically favorable. Indeed, the SPARKI models show a higher entropy than the wild-type complexes. Additionally, the stronger protein-DNA interactions can be understood as an increased enthalpic contribution. An increased binding affinity of SPARKI compared to GR can thus be attributed to favorable enthalpic and entropic contributions.

The AR-DR complex, in contrast, is more enthalpically stabilized by the contribution of both, protein-protein and protein-DNA hydrogen-bond interactions. In the DR-DNA the minor groove is ∼ 1Å narrower at the GA step than at the corresponding GT step in an inverted repeat DNA. This narrower minor groove is associated with the phosphate groups of the DNA backbone being closer to each other, and thus providing a higher negative charge density. Electrostatic interactions of the positively charged Arg (and Lys at other positions) residues with the DNA is therefore strengthened, as manifested by the larger number of strong hydrogen bonds in the AR-DR system.

The protein-DNA complexes studied in this work are characteristic for a competition between the protein-protein interactions and protein-DNA interactions, that is, a stable dimerization interface vs. specific contacts to the DNA. A balance to the former or the latter thus decides about specificity, or at least preference, for direct or inverted repeat DNA, respectively.

### 5. CONCLUSION

Our simulations of the chimeric SPARKI protein, complexed to inverted and direct repeat sequences, reveal a higher affinity of this model protein for IR than for DR sequences. In fact, binding to a DR results in a loose complex, eventually even with a distorted protein conformation, a possible explanation for the experimentally observed weak affinity for such a sequence (Schauwaers et al., 2007; Moehren et al., 2008; Sahu et al., 2014).

Since AR, GR, and the SPARKI models can in principle all form the same contacts with specific residues of the DNA, IR or DR, the ability to accommodate the protein on the DNA is important for specificity. The required flexibility is observed in those systems with a "weaker" dimerization interface, that is GR and the GR-like SPARKI, which can thus be considered to have more entropy driven specificity. The interactions in the dimerization interface and protein-DNA interactions are balanced to allow proper accommodation of the protein on the DNA and formation of specific contacts, tuning the enthalpic contribution to specific complex formation. In this competition, the stability of the dimerization interface is important and to a large extend determines the preferred response element.

The starting point, that is the crystal structure used for model building, has, even after rather long simulation time, still an effect on the protein conformation in the complex. SPARKI models initiated from the structure of the GR-IR complex are not capable of forming strong interactions in the dimerization domain. In contrast, SPARKI models started from an AR-DR complex structure maintain a rather stable dimerization interface, despite the mutation of some residues in this domain to those of GR. Still, this interface is weaker than in the wild-type AR-DR complex,. Moreover, the chimeric SPARKI protein shows fewer interactions with DR than observed in AR-DR, rendering its specificity GR-like.

All together, this study reveals the importance of the dimerization domain on distinct specificity of AR and GR, bound to DR and IR response elements, respectively.

### DATA AVAILABILITY STATEMENT

The datasets generated for this study will not be made publicly available. Datasets are available on request.

### AUTHOR CONTRIBUTIONS

PI and MB designed the research. MB performed the research. MB and SV analyzed the data. MB and PI wrote the manuscript.

### ACKNOWLEDGMENTS

We thank the North-German Supercomputing Alliance (HLRN) for computational resources. Support by the IT team of the Physics Department at Freie Universität Berlin is gratefully acknowledged. SV is grateful for support by the IMPRS for Biology and Computation.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmolb. 2020.00004/full#supplementary-material

## REFERENCES


and shared mechanisms in T47D breast cancer cells and primary leiomyoma cells. PLoS ONE 7:e29021. doi: 10.1371/journal.pone.00 29021

Zhang, L., Martini, G. D., Rube, H. T., Kribelbauer, J. F., Rastogi, C., FitzPatrick, V. D., et al. (2018). Selexglm differentiates androgen and glucocorticoid receptor dna-binding preference over an extended binding site. Genome Res. 28, 111–121. doi: 10.1101/gr.222844.117

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Bagherpoor Helabad, Volkenandt and Imhof. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Chromatin Compaction Multiscale Modeling: A Complex Synergy Between Theory, Simulation, and Experiment

Artemi Bendandi 1,2, Silvia Dante<sup>2</sup> , Syeda Rehana Zia<sup>3</sup> , Alberto Diaspro1,2 and Walter Rocchia<sup>4</sup> \*

<sup>1</sup> Physics Department, University of Genoa, Genoa, Italy, <sup>2</sup> Nanophysics & NIC@IIT, Istituto Italiano di Tecnologia, Genoa, Italy, <sup>3</sup> Dr. Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, Pakistan, <sup>4</sup> Concept Lab, Istituto Italiano di Tecnologia, Genoa, Italy

Understanding the mechanisms that trigger chromatin compaction, its patterns, and the factors they depend on, is a fundamental and still open question in Biology. Chromatin compacts and reinforces DNA and is a stable but dynamic structure, to make DNA accessible to proteins. In recent years, computational advances have provided larger amounts of data and have made large-scale simulations more viable. Experimental techniques for the extraction and reconstitution of chromatin fibers have improved, reinvigorating theoretical and experimental interest in the topic and stimulating debate on points previously considered as certainties regarding chromatin. A great assortment of approaches has emerged, from all-atom single-nucleosome or oligonucleosome simulations to various degrees of coarse graining, to polymer models, to fractal-like structures and purely topological models. Different fiber-start patterns have been studied in theory and experiment, as well as different linker DNA lengths. DNA is a highly charged macromolecule, making ionic and electrostatic interactions extremely important for chromatin topology and dynamics. Indeed, the repercussions of varying ionic concentration have been extensively examined at the computational level, using all-atom, coarse-grained, and continuum techniques. The presence of high-curvature AT-rich segments in DNA can cause conformational variations, attesting to the fact that the role of DNA is both structural and electrostatic. There have been some tentative attempts to describe the force fields governing chromatin conformational changes and the energy landscapes of these transitions, but the intricacy of the system has hampered reaching a consensus. The study of chromatin conformations is an intrinsically multiscale topic, influenced by a wide range of biological and physical interactions, spanning from the atomic to the chromosome level. Therefore, powerful modeling techniques and carefully planned experiments are required for an overview of the most relevant phenomena and interactions. The topic provides fertile ground for interdisciplinary studies featuring a synergy between theoretical and experimental scientists from different fields and the cross-validation of respective results, with a multi-scale perspective. Here, we summarize some of the most representative approaches, and focus on the importance of electrostatics and solvation, often overlooked aspects of chromatin modeling.

Keywords: chromatin, nucleosome, coarse-grain modeling, electrostatics, solvation

#### Edited by:

Valentina Tozzini, Istituto Nanoscienze, Consiglio Nazionale delle Ricerche, Italy

#### Reviewed by:

Elodie Laine, Université Pierre et Marie Curie, France Rosa Di Felice, Istituto Nanoscienze, Consiglio Nazionale delle Ricerche, Italy

> \*Correspondence: Walter Rocchia walter.rocchia@iit.it

#### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

Received: 09 September 2019 Accepted: 27 January 2020 Published: 25 February 2020

#### Citation:

Bendandi A, Dante S, Zia SR, Diaspro A and Rocchia W (2020) Chromatin Compaction Multiscale Modeling: A Complex Synergy Between Theory, Simulation, and Experiment. Front. Mol. Biosci. 7:15. doi: 10.3389/fmolb.2020.00015

## 1. INTRODUCTION

If one were to stretch the DNA found inside a cell nucleus, they would end up with an ∼2-m long fiber. In order to fit inside the cellular nucleus, which measures ∼6 µm in diameter, DNA needs to compact itself in a manner that permits efficient accessibility to DNA-binding proteins, while at the same time reinforcing and compacting the fiber. Compaction is achieved through the wrapping of DNA around certain proteins, the histones, forming the building blocks of the chromatin fiber, the nucleosomes. Nucleosomes are composed of a protein core, the histone octamer (consisting of H2A, H2B, H3, and H4 histone dimers), and 147 base pairs (bp) of DNA wrapped around the core in 1.64 turns. Each histone of the octameric core has a highly disordered N-terminal portion, the histone tail, whereas the rest of the residues form alpha helices (Kalashnikova et al., 2013). Two more tails extend from the C-terminals of H2A histones, amounting to a total of ten unstructured dynamic domains (McGinty and Tan, 2014). Nucleosomes are connected to each other by varying lengths of linker DNA strands, but it has been calculated that the spooling of DNA around nucleosomes alone makes DNA shorter by seven times (Iashina et al., 2017). Chromatin is a molecule that demands multiscale analysis since changes as small as the absence of one DNA bp in nucleosomal or linker DNA can cause non-local changes in the topology of the fiber. Given the fundamental importance of chromatin organization regarding gene expression, the question of discovering the manner in which the genome folds and compacts itself is one of the most fundamental in Biology.

The simultaneous advances in computational and experimental resources not only led to significant milestones, but have also opened new possibilities in chromatin studies. Because of the intrinsic multiscale nature of chromatin, there is a plethora of computational and experimental approaches, which focus on structures as small as the single nucleosome and its dynamics, up to the entire genome of an organism. These models try to describe and predict experimental observables, such as different fiber-start patterns, as well as the effect of different linker DNA lengths on fiber topology. For chromatin modeling, especially at small and intermediate scales, approaches that rely on basic physical interactions for the description of electrostatics and solvation are of uttermost importance. The other indisputably essential ingredient is the mechanical connection; for example, the presence of high-curvature AT-rich segments (A-tracts) in linker DNA is known to influence nucleosome interaction and alter chromatin folding (Buckwalter et al., 2017).

Overall, the study of chromatin is an intrinsically multiscale endeavor, since the effects of interactions spanning from atomic to chromosome-level are both physically and biologically important. Chromatin polymorphism is mostly driven by the delicate equilibrium of electrostatic interactions, solvation effects and mechanical constraints, such as steric exclusion and linker DNA length.

In this review, we provide a succinct synopsis of some among the existing modeling approaches for chromatin, focusing on the physics-based ones, and on those that allow integration with experimental biophysical and/or biological knowledge. This description will be paralleled with that of experimental techniques providing instrumental information for the validation and improvement of these models, paying particular attention on methods that only minimally perturb the observed system. We initially present a variety of chromatin models, starting from works studying single nucleosomes and oligonucleosome fibers, moving on to discuss coarse-grained models and finally fractal models. In the third section of this paper, we examine the fundamental importance of electrostatic interactions in chromatin, and their impact on fiber compaction and polymorphism. This brings us to an exploration of the, often underrated, role of solvation in chromatin compaction, in the fourth section of this review. Finally, we conclude our analysis with a discussion on experimental techniques that have been used in chromatin studies.

## 2. MULTISCALE MODELS

Chromatin models can be divided into two general categories, depending on the underlying initial assumptions and on the chosen building blocks: bottom-up and top-down models (Dans et al., 2016). The preferred approach depends on the level of detail of interest, the level of theory that one wants to adopt for the model and, inescapably, the computational capabilities at hand. Bottom-up models take the nucleosome and linker DNA crystal structures as a starting point (**Figure 1A**). The electrostatics and dynamics of these structures may be studied at the full atom level, and the derived results can be used to feed a coarse-grained model, which allows to draw conclusions for larger systems, such as oligonucleosomes or, sometimes, even larger structures (**Figure 1B**) (Savelyev et al., 2011; Fan et al., 2013; Collepardo-Guevara and Schlick, 2014; Izadi et al., 2016; Ghosh and Jost, 2018). The parameters used in these coarsegrained models depend on the properties of interest and on those observed by the accompanying experiments. In order to parameterize these types of models, data is often used from allatom structures and simulations, making their results dependent on the resolution of the structures and the performance of the force fields used.

In top-down models, the behavior of the fiber is deduced from experimental observations and sequencing of large regions of chromatin, or even of the entire genome, from which a scheme of interactions is derived. Given the limitations in resolution and accuracy of experiments, top-down models cannot possess the same level of detail as bottom-up models. However, they provide a way to study global chromatin properties. These models may incorporate a multitude of, often ad-hoc, coarse grained descriptions to look into very specific chromatin features related to smaller scale structures, such as the kbp scale. Finally, in this category of models the use of notions from polymer physics is very common, representing chromatin as a polymer chain and its stages of compaction as phase transitions, imposing constraints in the forms of potentials. (Giorgetti et al., 2014; Bianco et al., 2017)

Alternatively (Imakaev et al., 2015), chromatin models have been divided in categories based on whether they are built to

match pre-existing data or emerge as representations of physical properties: data-driven models and ab initio models. Regarding data-driven models, some examples are given by approaches that try to generate chromosome structures based on Hi-C maps (Fudenberg et al., 2016), translating contact probability to distance. In these cases, however, one needs to bear in mind that Hi-C maps, and sequencing techniques in general, often give an average picture of the genome. Ab initio models, on the other hand, take properties that have been observed or even hypothesized about chromatin as a starting point, and aim to reproduce them through the application of constraints and potentials (Tompitak, 2017; Lequieu et al., 2019). The mathematical nature of these models can sometimes lead to a simplification of biological factors at play.

Here, bearing in mind these general classifications, which are consistent with model classifications in many fields, we propose an exploration of various models based on the final order of magnitude that they are able to study, ranging from mononucleosome studies up to works examining the entire genome. Examining different orders of magnitude of chromatin, we present approaches that make use of different assumptions and are based on different types of data, illustrating the multifaceted nature of the topic. An overview of different modeling paradigms based on the order of magnitude at interest is shown in **Figure 1**.

### 2.1. From the Single Nucleosome to Oligonucleosome Fibers

Nucleosomes have the ability to dissociate entirely in histones and DNA, upon unwrapping, and then reassemble (Kulaeva et al., 2012). The curvature of the DNA can either favor or disfavor histone-DNA contacts, and therefore the formation of nucleosomes (Szerlong and Hansen, 2011). Based on this premise, starting our analysis from the building blocks of chromatin, we encounter Partially Assembled Nucleosome States (PANS), which are interesting as they reveal the electrostatic and mechanical changes that occur when a nucleosome is forming or dissolving. Rychkov et al. (2017) analyzed three types of PANS (hexasomes, tetrasomes, and disomes) through Molecular Dynamics (MD) simulations, visualizing the structures with Atomic Force Microscopy (AFM) experiments. The nucleosome formation procedure was observed to occur as such: the two H3 and H4 dimers bind to the DNA first, forming a tetrasome, followed by the sequential addition of H2A and H2B dimers. The results were compared to Small Angle X-ray Scattering (SAXS), Forster Resonance Energy Transfer (FRET), and AFM data. Nucleosome disassembly follows the reverse order, and both assembly and disassembly were seen to be associated with DNA supercoiling, as a way to regulate torsional stresses on the fiber (Bancaud et al., 2007).

Linker DNA length is extremely important for chromatin compaction, not only for mechanical but also for electrostatic reasons. Determining how linker DNA influences chromatin topology, and how its length and sequence can affect compaction has been the subject of many studies and speculations. In the work of Buckwalter et al. (2017), for instance, the presence of so-called A-tracts, DNA segments where multiple A-T pairs are present in a row, and their influence on DNA rigidity, and therefore on chromatin fiber flexibility, are examined. It has been observed by comparison of MC simulations and Electron Microscopy (EM) experiments on reconstituted oligonucleosome arrays that the presence of A-tracts causes DNA bending angles of up to 90◦ , and that these particular segments are often found in linker DNA (Cui and Zhurkin, 2009). The direction of bending of the linker DNA is also relevant for compaction: for example, when DNA bends inwards at the exit sites from the NCP the resulting structures are more compact compared to the opposite case, and give rise to zig-zag configurations and closer overall nucleosome proximity. It is evident that linker DNA length is of great importance when it comes to chromatin topology; however, its role is not immediate; the really important parameters for packing are the DNA bending angles, which are influenced by linker DNA length through topological and persistence length constraints.

The presence of the linker histone H1 (or H5 in avian chromatin) is also a key for compaction. This histone is not always present in nucleosomes, and its position can vary on or off the nucleosome dyad axis, the axis of symmetry of the nucloeosome (Pachov et al., 2011). The H1/H5 changes the orientation and flexibility of linker DNA, forming contacts with both entering and exiting strands. When two or more nucleosomes in sequence are bound to H1 histones, rigid structures termed DNA stems are formed, which present straighter linker DNA and reduced separation angle between the entering and exiting DNA; the latter effect is more pronounced in chromatin configurations with long linkers (Collepardo-Guevara and Schlick, 2014). The increased rigidity of DNA because of the formation of DNA stems is mitigated by the dynamic nature of H1/H5 binding and unbinding on nucleosomes (Collepardo-Guevara and Schlick, 2012).

Most all-atom and coarse grained models dealing with chromatin simulations require the use of empirical force fields at some point, impacting on the simulation results. Even though an extensive critical comparison of force fields and force field modifications for nucleic acids is beyond the scope of this review, we suggest the works by Galindo-Murillo et al. (2016) and Dans et al. (2017).

### 2.2. Coarse-Grained Oligonucleosome Models

According to the number of nucleosomes in the start of a fiber, different behaviors have been observed. Among some general categories, the most prominent of which are the zigzag and solenoid fiber models (Buckwalter et al., 2017). Zigzag models for chromatin propose what is commonly called a two-start fiber model (two nucleosomes at the start of the fiber), in which linker DNA crosses the main fiber axis. In two-start zigzag models, nucleosomes are stacked in the periphery of the fiber and linker DNA occupies the central space of the structure. Solenoid models on the other hand propose compaction through coiling of the linker DNA along the superhelical path. In these models, fibers are one-start, and nucleosomes create frontal contacts, with 6 to 8 nucleosomes per turn of the fiber. It is thought that both models coexist in fibers, along with straight linker DNA and bent linker DNA. (Grigoryev et al., 2009; Schlick and Perišic, 2009 ´ ) Contrary to the zigzag fiber, where the dominant interactions are n ± 2, in solenoid models they were found to be n ± 5 or n ± 6 (Robinson et al., 2006; Grigoryev et al., 2009) where n represents the position of the reference nucleosome.

Besides the number of nucleosomes at the start of the fiber, and taking into consideration the fact that linker DNA length is not always the same across the fiber, different nucleosome repeat lengths (NRL) produce different fiber configurations, and alter the propensity of a fiber to unfold. In Collepardo-Guevara and Schlick (2014), MC simulations were performed on coarse-grained oligonucleosome fibers (**Figure 1C**) to study these variations, and observed a variety of structures, reaching the perhaps not surprising—conclusion that structures with highly varying NRL were more compact than uniform structures, a direct consequence of fewer topological constraints. In relation to gene expression, the study also found that transcriptionally active cells presented shorter NRLs, while in inactive cells the opposite was observed (Gilbert et al., 2004). In the coarse-grained model, shorter NRL fibers arranged in ladder-like forms, while medium fibers arranged in zig-zags and longer NRLs resulted in heteromorphic structures (Grigoryev et al., 2009).

Nucleosomes bearing histone modifications, or even less histones than the canonic octamer (Winogradoff et al., 2015) have also been studied as a factor influencing chromatin compaction. In this study by Diesinger and Heermann (2009), a genome folding model was constructed using Monte Carlo (MC) simulations and introducing histone and nucleosome depletion. In a subsequent paper, the role of epigenetic modifications regarding nucleosome depletion was investigated, and MC data was compared to 5C and fluorescence in situ hybridization (FISH) (Diesinger et al., 2010). Even though full atom models are very instructive in the mononucleosome scale, in certain mesoscale chromatin models (Kulaeva et al., 2012), DNA base pairs are represented as rigid bodies, with parameters that account for orientation and displacement. Oftentimes, in more coarse-grained models, nucleosomes are treated as rigid bodies with concentrated charge and the dynamics of the histone tails are modeled as Gaussian distributions or as series of beads. Works like Giorgetti et al. (2014); Kepper et al. (2008) model chromatin as an inextensible chain of beads, whose distance depends on the spatial scale of the desired simulations.

Works like Koslover et al. (2010); Koslover and Spakowitz (2009) aim to optimize chromatin morphology through studying its dependence on linker DNA elasticity and length, introducing the role of inter-nucleosome core particle (NCP) interaction potentials in the packing of the fiber. Such works often use MC or Brownian dynamics simulations (Wedemann and Langowski, 2002; Langowski, 2006) and model electrostatic interactions based on potentials at various levels of sophistication. In Koslover et al. (2010), the chromatin fiber is constructed as a helical array by cyclically repeating a fundamental structure, defined as two nucleosomes and the linker DNA between them, in which nucleosomes are treated as rigid bodies and linker DNA as a series of beads. As we mentioned previously, histone modifications are also relevant factors for chromatin compaction, and are sometimes used as model parameters. An example of histone modifications as model parameters is MacPherson et al. (2018), a polymer MC coarse-grained model using methylation as a parameter to study chromatin dynamics and conformation statistics.

In the work of Schiessel et al. (2001), so-called two-angle models were developed, using linker DNA entry and exit angles and NCP twist angles, generating ensembles of minimum energy conformations through MC and analysing their dynamics through Brownian dynamics. NCP geometry becomes itself a parameter in several works (Kepper et al., 2008; Stehr et al., 2008; Kulaeva et al., 2012), in which internucleosomal interactions are specifically studied as triggers for compaction. When it comes to the representation of the NCP as a rigid body, shapes, such as an oblate ellipsoid or an oblate spherocylinder are more accurate than simple spheres. In Kepper et al. (2008), a coarse-grained computer model was applied to a sample pool of 101 nucleosome arrays, using different chromatin models with and without the presence of linker DNA. It was shown that nucleosome spacing is relevant to chromatin stability, with the highest destabilization occurring at a 2 bp shift, by analysing energy landscapes. Energy variations were compared to values from chromatin stretching experiments (Cui and Bustamante, 2000). After surpassing the 2 bp energy barrier, nucleosome repositioning toward a new conformation, rather than returning to the original one, becomes more energetically favorable. Nucleosome orientation was also shown to be of importance, since, for example, it was observed that in cases where a nucleosome was oriented transversally it occupied more volume and caused its neighbors to be pushed further apart, hindering close packing.

### 2.3. Topological and Fractal Models

During the past decade, great progress has been made in the study of chromatin organization due to the advent of Chromosome Capture technologies (3C). The field was particularly revolutionized by Hi-C, which provides the interaction frequencies between loci of an entire genome. 3D reconstructions of genomic regions and even entire genomes are possible, using Hi-C data, through structural inference and statistical methods (Lesne et al., 2014; Varoquaux et al., 2014). There are two main categories of techniques to generate 3D structures from Hi-C contacts: ensemble approaches and consensus approaches. In the latter case, the Hi-C data are considered as a single ensemble, while in the former models different categories of structures are created from the data. It has been suggested recently that it might be possible to reconstruct the diploid 3D chromatin structures (Cauer et al., 2019).

It can be of interest to combine results from high throughput techniques, such as Hi-C, with computer simulations. In Ohno et al. (2019) parallels were drawn between protein structure and chromatin. Through a combination of Hi-C data at nucleosomal resolution obtained at several cell phases and coarse grained simulations, Ohno et al. observe two general secondary structure types in chromatin, which they call α-tetrahedron and βrhombus, as an analogy to the α-helix and β-sheet structures in proteins, supporting the claim that fibers can alternate between these structures when nucleosome positioning changes. Information on nucleosome orientation was gleaned through analysis of the spatial proximity between DNA entry and exit points in individual nucleosomes across the genome and their 3D positioning. Solvation effects were not directly taken into account, as nucleosomes were modeled as space-filling objects, and linker DNA was also implicitly treated.

In the study of compaction and larger scale interactions within the chromatin fiber, for example for characterizing the Topologically Associating Domains (TADs), loop extrusion models are very significant. TADs are regions of the genome with enhanced contact frequency, identifiable on Hi-C maps as squares. During loop extrusion, Loop Extrusion Factors (LEFs), such as cohesin, interact with chromatin, inducing the formation of loops until they encounter a Border Element (BE), such as CTCF. It has been observed by Goloborodko et al. (2016) that macroscopic loop characteristics depend on the abundance of LEFs. Loop extrusion models provide explanations for experimental observations, such as the preferential orientation of CTCF, the enrichment of TAD boundaries in proteins with architectural functions, and TAD merging in LEF deletion experiments, and could provide insight on chromosome-level phenomena (Fudenberg et al., 2016). Polymer simulations are frequently used by loop extrusion models to make predictions and to validate analytical models. Loop formation has also been studied with mesoscopic models, where it was observed to depend on linker histone presence, ion concentration, and linker DNA length (Bascom et al., 2016). In addition to the "one-sided" loop extrusion mechanism described above, recent research indicates that "two-sided" loop extrusion might prove to be more robust in explaining experimental data (Banigan et al., 2019).

In the last decade, there has been growing interest on fractal models describing chromatin, and part of the chromatin modeling community, particularly emerging from polymer physics, has been focusing on the possibility that chromatin organizes itself as a fractal, especially since a similar state has been proposed in the seminal paper of the Hi-C method by Lieberman-Aiden et al. (2009). In this work, a distinct case of the previously theorized globular equilibrium model was proposed for the Mbp scale: the fractal globule—otherwise called crumpled globule (Grosberg et al., 1988), a polymer conformation that enables maximally dense packing while preserving the ability to easily fold and unfold any genomic locus (Lieberman-Aiden et al., 2009; Mirny, 2011; Tamm et al., 2015) (**Figure 1D**). In such models, as in polymer models for chromatin in general, chromatin is considered as a flexible polymer fiber, and the notion of the single nucleosome is lost. Because of their large scope, these kinds of models can be relevant for large scale systems or even the entire genome.

Distinct chromosomal regions can be modeled as equilibrium globules, structures used to describe polymers in poor solvent conditions (Lieberman-Aiden et al., 2009). The chromatin fiber could assume a Peano curve conformation, which represents a continuous fractal trajectory that densely fills space without crossing itself. In fractal globules, compaction is achieved through the collapse of the globule and it has been shown that the fractal globule has the ability to organize territorially, alluding to chromosome territories, (Tamm et al., 2015) distinct regions in the nucleus occupied by certain chromosomes, in contrast with the previously proposed equilibrium globule, which does not present such organization. In the fractal globule, the number of interactions as a function of volume shows a linear correlation, which leads to the interdigitation of different regions in the globule with each other, allowing for extensive genomic cross talk (Mirny, 2011) (**Figure 1D**). This is particularly interesting for two main reasons: cross talk has been observed in simulations between the regions, and fractal globules unfold in an optimal way, which is relevant in the study of transcription.

However, it needs to be noted that the fractal globule is a metastable state, unlike the equilibrium globule, and that its lifetime depends on topological constraints, which, in real cells, can be affected by enzymes and DNA-binding proteins. Fractal globules have been observed experimentally in Hi-C experiments (Lieberman-Aiden et al., 2009; Rao et al., 2014; Ghosh and Jost, 2018) and Small Angle Neutron Scattering (SANS) experiments (Ilatovskiy et al., 2012; Iashina et al., 2017). The relationship between the physical environment of a fractal chromatin fiber and transcription has been studied in several works, such as Almassalha et al. (2017), in which the analytical correspondence between changes in the fractal dimension of the chromatin fiber and increment of chromatin accessibility and compaction heterogeneity was studied. Furthermore, the authors speculated that differences in the transcription of a certain gene might be influenced by folding of neighboring genomic regions. The findings were supported by microscopy measurements on cancer cells.

Fractal globule models have been criticized based on the argument that self-similarity cannot be assessed in only a couple of orders of magnitude. However, researches in the field, such as Bancaud et al. (2012) claim that, even though mathematical fractals are self-similar ad infinitum, physical fractals are only self-similar within certain orders of magnitude, typically 2 or 3, while chromatin architecture spans 4 or more orders of magnitude, and a common fractal architecture would connect all of them under a single topological theme, without the need for separate structures in each order of magnitude.

To conclude this section, we present a summary table (**Table 1**) of the models mentioned, categorized by the final order of magnitude that they treat (e.g., single nucleosomes, oligonucleosome arrays, entire genome). We include information on the computational methods used, and when available, the type of experimental data used for result validation.

### 3. ELECTROSTATIC INTERACTION IN CHROMATIN: AN INTRINSICALLY MULTISCALE PHENOMENON

At large scales in the chromatin fiber, structures are approximately electrostatically neutral, allowing for an average treatment of electrostatics and solvation in polymer models for chromatin. However, at the NCP and oligonucleosome scale, electrostatics and solvation become extremely important, due to the high charge of the DNA. The charges present on the DNA backbone are partly neutralized by the winding of DNA around the histone core, especially through the effect of the histone tails, and partly through counter-ions present in the nuclear environment. The modeling of internucleosomal interactions using reductionist analytical potentials, which omit the explicit role of histone tails, can cause secondary, but still relevant, electrostatic effects to be overlooked.

Considering the biological importance of different ionic types, Mg2<sup>+</sup> is particularly significant, as it has been found to promote nucleosome condensation and aggregation and could promote linker DNA bending, because in its presence interactions of first and third neighboring nucleosomes are boosted (Grigoryev et al., 2009). Tetravalent cations on the other hand require lower concentrations to induce compaction (Zinchenko et al., 2017). In Fan et al. (2013), systems of 1–10 nucleosome core particles (NCPs) were studied using a coarse-grained model in order to study the effects of monovalent, divalent, and trivalent cations on these structures, reproducing experimental data. It was observed that an increase in K<sup>+</sup> ions amplified the repulsive internucleosomal electrostatic interaction; increasing Mg2<sup>+</sup> concentration caused partial aggregation, and an increase in COHex3<sup>+</sup> ions triggered a strong mutual internucleosomal attraction in 10 NCP systems, therefore showing that the aggregation of NCPs is different under the effect of different types and concentrations of counterions.

Multivalent ions and the effect of their distribution around NCPs on chromatin conformation were also studied in Gan and Schlick (2010), using a mean-field Poisson Boltzmann Equation (PBE) approach, with an emphasis on shielding charges, which aggregate particularly around DNA and the exposed parts of the histone tails. The fact that a surface needs to be exposed to solvent in order for ions to bind on it makes ion-caused electrostatic screening (a change in the effective electric charge) and ion-chromatin interactions in general directly dependent on compaction. Calculations showed that the enhanced screening due to divalent ions might not only be because of their higher charge, but also because they form a denser layer of counterions around the NCP and fluctuations in this layer are correlated to different fiber conformations. This makes even more evident the fact that the topology of compaction is a key determinant for chromatin-ion interaction. It was observed in these simulations that the shielding charge arising from both monovalent and divalent ions was linearly correlated with the ionic strength of the solution.

In the study of structures as large and complex as chromatin, it has been proposed in Izadi et al. (2016) that implicit solvent Generalized Born (GB) simulations would


TABLE 1 | Computational and experimental works mentioned in this review (partial account), listed under the scale of interest.

Computational techniques and experimental data used are listed, when applicable.

be preferable to traditional fully explicit MD, in order to circumvent computational limitations. However, standard GB scales poorly with the number of solute atoms and, in this work, a multiscale atomistic GB model that incorporates improvements in the electrostatic calculations is presented, the accuracy of which was evaluated through point-by-point comparison with PBE calculations. Taking advantage of the natural hierarchical organization and charge distribution of chromatin, Izadi et al. used approximate point charges to calculate electrostatic interactions between distant points in a 40-nucleosome structure, containing ∼1 million atoms, focusing particularly on the behavior of the histone tails. They were able to reproduce experimental findings of the interaction of the H3 histone tail and the linker DNA. The GB approach proved the existence of viable alternatives that drastically reduce the cost of conformational sampling in very large structures.

One could not conclude a discourse on chromatin electrostatics without mentioning the effect of the histone tails, which have been found to promote stability of the linker histone on the NCP. In some models, histone tails are modeled as a series of beads with one positive charge per bead (Gan and Schlick, 2010; Fan et al., 2013; Korolev et al., 2014). It was seen by Shaytan et al. (2016) that certain histone tail configurations promote DNA bulging at entry and exit sites, possibly contributing to the formation of twist defects in the nucleosomal DNA. Twist defects are DNA deformations that allow for one more or less DNA bp in positions where DNA interacts closely with histones (Brandani et al., 2018). They are important, among other reasons, because their presence causes the formation of nucleosomes with 146 bp instead of the usual 147 (Pasi and Lavery, 2016), due to overwinding and stretching of the DNA (Davey et al., 2002). They also speculated that the presence of arginines and lysines might impose constraints on histone tail motion because of attractive electrostatic interactions. Contacts between DNA and histones were seen to be dominated by the histone tails, making up 60% of protein-DNA interactions in the nucleosome, rapidly wrapping around the DNA (in Shaytan et al., 2016, it was observed that they do so in the first 20 ns of the simulation).

In another study, the N-terminal of the H4 histone tail was observed to interact with the "acidic patch" present on the surface of adjacent nucleosomes, a small groove formed by eight residues, six belonging to H2A and the remaining to H2B, which constitutes a region of highly negative charge density on the nucleosome surface, serving as a hot-spot for DNA-binding proteins and histone tails (Kalashnikova et al., 2013; McGinty and Tan, 2014; Zhou et al., 2018). Throughout 1 µs-long MD simulations in Shaytan et al. (2016), the NCP is seen to be very stable in dynamics, in contrast to histone tails and linker DNA: large scale unwrapping or opening of NCP DNA were not observed, even when simulations were performed in 1M salt concentration, under which conditions they are known to occur (Wilhelm et al., 1978). This indicates that such phenomena might take place on longer time scales. Of particular interest are the histone H3 tails, which have been suggested by experiments (Kato et al., 2009) to form stable folded structures, and even to potentially compete with other DNA-binding proteins, affecting accessibility of epigenetically modified sites in the minor grooves.

It has already been mentioned that the presence of A-tracts can change the curvature of DNA, causing the minor grooves to be narrower than those in segments with lower curvature, and locally enhancing negative electrostatic potentials. In Rohs et al. (2009), PBE calculations were performed on DNA, showing that the electrostatic potential caused by the DNA backbone had intensity peaks inside the major and minor grooves. The position of these peaks correlates with the positions of arginine residues on the histone core. Previously observed binding preference for arginines over lysines in minor grooves, and especially in narrower ones, was partly explained via a combination of electrostatic and desolvation effects. For the study of minor groove geometry, all the crystal structures of protein-DNA complexes containing at least one base atom–aminoacid contact were analyzed. Analysis of nucleosomal DNA was based on the nucleosome structures available on the Protein Data Bank (PDB) at the time.

### 4. THE ROLE OF SOLVATION IN CHROMATIN COMPACTION

The role of the solvent in biomolecular interactions is known to be crucial. In part, this is because of solvent-mediated electrostatic effects—the screening of the water molecules and that of the ions in solution. In addition, there is the so-called cavity formation phenomenon, which penalizes the occurrence of solvent-excluded regions. Chromatin spatial arrangement, due to NCP charge, size and porosity, is expected to be particularly affected by these phenomena, which must be accurately considered. It has already been described that the formation of the fundamental unit of chromatin, the nucleosome, is carried out by the complexation of the negatively-charged DNA polymer with the positively-charged histone protein octamer. If investigated at the molecular level, this process is governed by a number of interactions, such as hydrogen-bonds, salt-bridges, and water-mediated interactions occurring along the positivelycharged arginine anchors that intercalate deep inside the minor grooves of DNA facing the histone core (McGinty and Tan, 2014; Gebala et al., 2019). When it comes to histone core-DNA electrostatic interactions, it is known that every nucleosome presents 14 non-covalent histone-DNA contacts, at the sites of arginine residues (Szerlong and Hansen, 2011).

Solvent exposure affects electrostatic interactions at the nucleosome level: compared to H3 and H4 histones, the two H2 variants are more solvent exposed, making them more accessible to chromatin-binding proteins as well (Izadi et al., 2016). Specific ion binding sites and their location on the nucleosome are also of particular interest, and they can be studied using electron density maps in combination with chemical information (Davey et al., 2002). It has been observed that sodium preferentially condenses around regions rich in solvent accessible acidic residues, especially in areas with two or more acidic residues in close proximity (Materese et al., 2009). It is also speculated that, in chromatin fibers exhibiting high compaction, internucleosomal electrostatic repulsion could be reduced in intensity because of an increased neutralization of the DNA backbone charge by the neighboring histone cores and counterion screening.

The idea that the nucleosome is an impermeable object has been proven erroneous (Materese et al., 2009); in this work, it was seen that mobile ions are able to reach the NCP inner core because of high levels of local solvation (more than 1,000 water molecules). This led to the conclusion that the local value of dielectric constant in the region facing the histone core is larger than expected. The authors also looked into the mobility of water molecules on the first hydration layer of the nucleosome and, as expected, found them to be less mobile than bulk water molecules. Through detailed visualization of structured water at the protein-DNA interface, they also found that water molecules not only contribute significantly to the stability of DNA binding but also adapt histone surfaces to conformational variations of DNA, facilitating nucleosome dynamics. All-atom electrostatics calculations were conducted and compared to PBE calculations, observing a slight inconsistency between the two. PBE predicts that the most significant contribution to DNA charge neutralization comes from the enhancement of the electric field and that it is a result of the tight wrapping of the DNA around the histone core. These results indicate that close condensation of ions around the nucleosome can significantly reduce the short range effect of the nucleosomal charge, having as a natural consequence the facilitation of chromatin close packing.

In another work concerning NCP solvation (Davey et al., 2002), the solvent-accessible surface area (SASA) of nucleosome crystals with 147 and 146 bp was investigated. NCPs with 147 bp were found to possess a SASA of ∼74 Å<sup>2</sup> , which is distributed mostly in the cavities within the histone octamer and in the space between it and the DNA. The primary hydration layer of the NCP was found to contain slightly more than 2,000 water molecules, the positions of which were found to largely correspond to the positions of A-tracts, especially in the vicinity of the minor groove. Water was shown to be important in the two main mechanisms of protein-DNA recognition: direct readout (nucleotide chemically specific bonds) and indirect readout (sequence-dependent conformational features of DNA recognized by sterically complementary protein contacts). Structures termed "spines of hydration" were also observed, in which water molecules bind regularly to adenine N3 and thymine O2 atoms (Kopka et al., 1983). Structural analyses have shown that the phosphate groups are the most strongly solvated components of the DNA (Egli et al., 1998; Schneider et al., 1998).

In order to illustrate the porosity of the nucleosome, particularly described in Materese et al. (2009), we have conducted a study on the nucleosome crystal structure [PDB code 1kx5 (Davey et al., 2002), **Figure 2A**] using NanoShaper interfaced with VMD (Decherchi and Rocchia, 2013; Decherchi et al., 2018), providing the values of the Surface to Volume Ratio (SVR), the number of cavities and pockets. We measure an SVR of 0.387 Å−<sup>1</sup> , which reflects a quite high porosity (Shirota et al., 2008), and a number of cavities and pockets. In **Figure 2C**, we visualize the channel traversing the nucleosome core, which significantly impacts on NCP accessibility to water and ions. Our results are consistent with previous qualitative analyses mentioned in this section, and indeed indicate that the nucleosome is highly solvated and porous. We have also constructed an electrostatic map of the nucleosome, using data from the DelPhi Poisson-Boltzmann solver Rocchia et al. (2001) on the potential and constructing the SASA of the nucleosome with NanoShaper (videos of the full 3D structure found in **Supplementary Material**), as seen in **Figure 2B**, where it is possible to clearly see, among other features, the position of the acidic patch (residues E56, E61, E64, D90, E91, E92 of H2A and E102, E110 on histone H2B (Kalashnikova et al., 2013), and the highly charged histone tails, both key elements in chromatin compaction and chromatin interaction with DNAbinding proteins. This analysis showed a minor acidic region, on the surface of histone H4.

### 5. EXPERIMENTAL STUDIES OF CHROMATIN: FROM THE NUCLEOSOME TO THE NUCLEUS

Throughout this review, we have highlighted the main manifestations of the multiscale nature of chromatin, and we have explored the multitude of factors affecting its compaction. The interplay between simulations and experiments is crucial to reach a deep understanding of this complex system, and has given rise to breakthroughs that would have been impossible without the combination of the two approaches. Experimental investigations of chromatin can be carried out at different scales, similarly to computational approaches. Having already mentioned some experimental results validating computational models, we have specifically looked into some of the experimental techniques used in both small and large scales, from the NCP up to entire nucleus.

Starting from the nucleosome, experiments have been carried out to determine its crystal structure, with continuing endeavors starting from Luger et al. (1997), in which a 2.8 Å resolution structure of the NCP was obtained via X-ray crystallography, using reconstituted nucleosomes. In Luger's work, many of the structural elements of the nucleosome were uncovered, such as the number of base pairs wrapped around the octamer, which were unknown despite the fact that the octamer histone structure had already been observed. The histone tails and their structural role have also been studied to great extent in Widom (1997). Since then, further structures with 147 bp (Davey et al., 2002) and 146 bp (Tachiwana et al., 2010) have been observed. The study of sub-structures, such as the histone tails and of sitespecific interactions (van Emmerik and van Ingen, 2019) in more detail, required the use of NMR (Davey et al., 2002). In latest years, there has been growing interest for the study of NCPs using Cryo-EM (**Figure 3A**). The sample preparation protocols involved in this technique make it an interesting alternative to X-ray crystallography for structural studies. Cryo-EM provided information on custom-made NCPs in studies relevant to DNA binding protein-NCP interactions (Takizawa et al., 2018) and also on interactions of the NCP with components of the nuclear environment, such as the nuclear pore complex (Kobayashi et al., 2019). The orientation of NCPs has also been observed by Cryo-EM in a recent study, where it is stated that in the most common arrangement of a pair of NCPs they are placed in parallel, facing histone octamers (Bilokapic et al., 2018).

X-ray crystallography provides structures with atomic resolution, which are key for atomic-level studies. However, this approach has some limitations; it fails to provide good information on the more mobile domains of the NCP, and it cannot be used for large oligonucleosomes (the largest structures that have been crystallized to date are tetranucleosomes (Schalch et al., 2005; Ekundayo et al., 2017). In order to circumvent these constraints, one can turn to scattering techniques. SAXS studies have looked into the issue of whether the histone tails protrude into the solvent surrounding the NCP or associate with DNA at physiological salt conditions. The histone tails are notoriously hard to resolve in crystallography because of their size and intrinsically mobile nature (Kato et al., 2009; Zhou et al., 2012; Gao et al., 2013). Using SAXS however, it is possible to indirectly observe whether the histone tails are solvated or adherent to the DNA, by measuring changes in the overall structure size. Bertin et al. (2004) have applied SAXS to study histone tails as well, focusing on the structural details of internucleosomal interactions and the effects that histone tails have on them. Often SAXS has been used in conjunction to other techniques to correlate structural to dynamical data. In Mauney et al. (2018) SAXS, FRET, and MD were used to dissect the sequence-dependent DNA unwrapping mechanism. Fluorescence Correlation Spectroscopy (FCS) has been used in a work by Fan et al. (2013) to estimate NCP stacking energy. In this combined experimental and theoretical work, model parameters were tuned based on comparison with single molecule FCS and SAXS data, which also showed that histone tails facilitate NCP stacking by acting as bridges between NCP surfaces. FCS data was also used by Norouzi and Zhurkin (2018) to tune the parameters of an MC model of nucleosome arrays under the influence of external forces.

Moving on from NCPs to larger structures, nucleosome arrays are the next step; besides SAXS (Howell, 2016), AFM has also been used to study arrays of varying lengths (**Figure 3B**). The advantage of using this technique for chromatin is 2-fold: there is the possibility of taking many measurements, making it good for statistical purposes; and it allows for the study of electrostatic and related interactions, such as differences in ionic concentration. The importance of ionic interactions with chromatin has naturally gained the attention of the experimental community. Studies, such as Gan and Schlick (2010) have shown that Mn2<sup>+</sup> ions bind to the major DNA groove near CG pairs. In Krzemien et al. (2017) AFM was used to measure the changes in chromatin topological conformations depending on salt levels in the environment (**Figure 3C**). Studying NCP arrays in varying salt concentration revealed that array compaction has a non-monotonic salt dependence. Increasing salt concentration induces partial screening of the charges of the DNA backbone, therefore reducing the electrostatic interactions between DNA and histones, directly impacting on compaction. The stability of mononucleosomes has also been investigated in correlation with salt concentration (Hazan et al., 2015): in low to intermediate salt regimes they observed some partially disassembled states

the histone tails, and strong negative charges on the DNA backbone.

(as also studied computationally in Rychkov et al. (2017) where H2A/H2B histone dimers partially dissociate from the NCP. Regarding the mechanical properties of chromatin, DNA stiffness was observed to be salt-dependent as well, in accordance with other experimental and computational studies (Rohs et al., 2009; Pasi et al., 2015, 2017; Pasi and Lavery, 2016); the persistence length was seen to increase at higher ionic concentrations.

Optical microscopy, a field traditionally tied to biological applications, is a natural candidate for chromatin studies, due to the advances in resolution obtained by super-resolution techniques, and to the fact that label-free optical microscopy methods have been on the rise for the past decade. Experiments using the single molecule super-resolution microscopy technique STORM (Ricci et al., 2015) have observed units of chromatin organization termed by the authors clutches, heterogeneous groups of various sizes. The size of the clutches has been speculated by Ricci et al. to be related to the pluripotency capacity of each cell, and the median number and nucleosome density in the nucleus was found to be cell-specific. From longer nucleosome arrays to chromatin fiber, other super-resolution techniques, such as Photoactivated Localization Microscopy (PALM) have been used to extrapolate chromatin topology in

used as a fingerprint of chromatin to demonstrate the correlation with the label-free approach using circular polarization excitation (Le Gratiet et al., 2018).

the nucleus from nucleosome dynamics. Label-free techniques are also used to study chromatin at the nuclear level, such as Circular Intensity Differential Scanning (CIDS) in Le Gratiet et al. (2018). In this work, it is shown that the main advantage of this polarimetric method compared to standard fluorescence microscopy is the capability to obtain specific contrast mechanisms due to the chiral organization of the DNA in a label-free approach without a priori knowledge of the sample. Indeed, it is shown that the stronger signal region corresponds to more compacted DNA region, i.e., heterochromatin, while the weaker signal, such as for the nucleoli, corresponds to a lower compaction, i.e., euchromatin region (**Figure 3D**).

Experimental validation has been attempted also for some among the most exotic theoretical models proposed for chromatin, namely those hypothesizing fractal globules. Fractal globules have been observed experimentally in Hi-C experiments (Lieberman-Aiden et al., 2009) and Small-Angle Neutron Scattering (SANS) experiments (Ilatovskiy et al., 2012; Iashina et al., 2017). The important question tackled by works on this topic is the way in which fractal states with stable longlived properties are formed. SANS has been considered a good technique for experiments looking for fractal structures in the nucleus because of its extended spatial range, from ∼15 nm to 10 µm. The use of Cryo-Electron Tomography (Cryo-ET) has provided insight on the structure of mitotic chromosomes in fission yeast (Cai et al., 2018). SAXS and Cryo-EM have also been used in structural analysis of the fiber up to the chromosome level (**Figure 3A**) (Joti et al., 2012; Nishino et al., 2012; Maeshima et al., 2014, 2016).

## 6. CONCLUSIONS

Chromatin is an extremely complex system, the behavior of which is tuned both by mechanical and electrostatic factors, and by biological interactions. Simulations provide extremely useful insights, depending on the level of approximation used to represent the system, on the different mechanisms and factors influencing compaction. In this review we mention several computational works that used as inputs parameter sets acquired through experiments or evaluated their results by comparing them with preexisting experimental data. It is clear that combining simulations results with various experimental techniques, appropriate to the resolution of interest, can help shed light on the main determinants of chromatin compaction.

Electrostatics in chromatin encompasses an intricate combination of different mechanisms and the importance of its role in compaction and chromatin remodeling is paramount. The high negative charge of DNA is partially neutralized by the direct interaction of the latter with histones (including the effects of histone tails and the linker histone), but electrostatic stabilization of the chromatin fiber is achieved through a combination of this effect with longrange electrostatics and solvent screening. Simulations in which ionic interactions with chromatin at the NCP level are treated more accurately would be a great improvement to existing approaches. In addition, a more accurate representation of the nucleosome core is crucial when performing these analyses, since solvation has proved to be a very important factor in nucleosome behavior, whereas neglecting these effects would hamper a correct understanding of chromatin compaction.

In summary, we have presented an overview of some, mostly theoretical and computational, approaches to the description of chromatin, from the nucleosomal to the cellular level, particularly focusing on the role of electrostatics and solvation as the driving mechanisms of chromatin conformational changes and equilibria. To complement this overview, we also presented some representative experimental approaches to study chromatin structure and dynamics, at both small and large scales.

### AUTHOR CONTRIBUTIONS

AB performed most of the reviewing and writing tasks. SD and AD organized and discussed the experimental part. SZ worked on the analysis of the role of solvation. WR and AB decided the

### REFERENCES


organization of the manuscript and checked the consistency of the work. All authors reviewed and checked the manuscript.

### ACKNOWLEDGMENTS

We acknowledge PRACE for awarding us access to Marconi at CINECA, Italy.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmolb. 2020.00015/full#supplementary-material

Supplementary Video 1 | View of the full 3D surface of the nucleosome core particle (PDB code 1KX5), coloured by electrostatic potential. Blue regions indicate positive surface potential, while red regions indicate negative potential.

Supplementary Video 2 | View of the full 3D Solvent Excluded Surface Area of the nucleosome core particle (PDB code 1KX5). Blue regions indicate the channel traversing the histone core, and an adjacent large cavity.

revealed by mesoscale modeling of oligonucleosomes. Nucleic Acids Res. 40, 8803–8817. doi: 10.1093/nar/gks600


nucleosome-nucleosome interactions under varying ionic conditions. PLoS ONE 8:e54228. doi: 10.1371/journal.pone.0054228


modulating higher order chromatin structure. J. R. Soc. Interface 10:20121022. doi: 10.1098/rsif.2012.1022


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Bendandi, Dante, Zia, Diaspro and Rocchia. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Spontaneous Embedding of DNA Mismatches Within the RNA:DNA Hybrid of CRISPR-Cas9

Brandon P. Mitchell<sup>1</sup>† , Rohaine V. Hsu<sup>1</sup>† , Marco A. Medrano<sup>1</sup> , Nehemiah T. Zewde<sup>1</sup> , Yogesh B. Narkhede<sup>1</sup> and Giulia Palermo1,2 \*

<sup>1</sup> Department of Bioengineering, University of California, Riverside, Riverside, CA, United States, <sup>2</sup> Department of Chemistry, University of California, Riverside, Riverside, CA, United States

### CRISPR-Cas9 is the forefront technology for editing the genome. In this system, the Cas9 protein is programmed with guide RNAs to process DNA sequences that match the guide RNA forming an RNA:DNA hybrid structure. However, the binding of DNA sequences that do not fully match the guide RNA can limit the applicability of CRISPR-Cas9 for genome editing, resulting in the so-called off-target effects. Here, molecular dynamics is used to probe the effect of DNA base pair mismatches within the RNA:DNA hybrid in CRISPR-Cas9. Molecular simulations revealed that the presence of mismatched pairs in the DNA at distal sites with respect to the Protospacer Adjacent Motif (PAM) recognition sequence induces an extended opening of the RNA:DNA hybrid, leading to novel interactions established by the unwound nucleic acids and the protein counterpart. On the contrary, mismatched pairs upstream of the RNA:DNA hybrid are rapidly incorporated within the heteroduplex, with minor effect on the protein-nucleic acid interactions. As a result, mismatched pairs at PAM distal ends interfere with the activation of the catalytic HNH domain, while mismatches fully embedded in the RNA:DNA do not affect the HNH dynamics and enable its activation to cleave the DNA. These findings provide a mechanistic understanding to the intriguing experimental evidence that PAM distal mismatches hamper a proper function of HNH, explaining also why mismatches within the heteroduplex are much more tolerated. This constitutes a step forward in understanding off-target effects in CRISPR-Cas9, which encourages novel structure-based engineering efforts aimed at preventing the onset of off-target effects.

Keywords: CRISPR-Cas9, off-target effects, protein-nucleic acid interactions, molecular dynamics, RNA:DNA hybrid

### INTRODUCTION

CRISPR (clustered regularly interspaced short palindromic repeats)-Cas9 is the core of a transformative genome editing technology that is innovating life science with cutting-edge impact in basic and applied biosciences (Doudna and Charpentier, 2014; Hsu et al., 2014). This technology is based on a protein/nucleic acid complex, composed of the endonuclease Cas9, which associates with guide RNAs to recognize and cleave complementary DNA sequences (**Figure 1**;

#### Edited by:

Gennady Verkhivker, Chapman University, United States

### Reviewed by:

Carlo Camilloni, University of Milan, Italy Yinglong Miao, The University of Kansas, United States

\*Correspondence:

Giulia Palermo giulia.palermo@ucr.edu †These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

> Received: 20 November 2019 Accepted: 19 February 2020 Published: 17 March 2020

#### Citation:

Mitchell BP, Hsu RV, Medrano MA, Zewde NT, Narkhede YB and Palermo G (2020) Spontaneous Embedding of DNA Mismatches Within the RNA:DNA Hybrid of CRISPR-Cas9. Front. Mol. Biosci. 7:39. doi: 10.3389/fmolb.2020.00039

**184**

Jinek et al., 2012). The Cas9 protein performs a site-specific recognition of the DNA, by binding a short sequence of 2–5 nucleotides, known as a Protospacer-Adjacent Motif (PAM), located within the DNA (Sternberg et al., 2014). Upon PAM binding, the DNA base pairs guide the RNA with one strand (i.e., the so-called target strand, TS) to form an 20 base-paired RNA:DNA hybrid structure, while the other DNA non-target strand (NTS) is displaced and subsequently accommodated in the protein.

The formation of a well-matched RNA:DNA hybrid is a fundamental step of the CRISPR-Cas9 function (Sternberg et al., 2015). Indeed, upon formation of the RNA:DNA hybrid, the catalytic HNH domain can change conformation from an inactive state (in which the catalysis is hampered, **Figure 1A**; Anders et al., 2014; Nishimasu et al., 2014) to a catalytically active conformation, which approaches the cleavage site on the TS (**Figure 1B**; Jiang et al., 2016). In spite of this fundamental requirement, the presence of DNA mismatches at specific positions of the RNA:DNA hybrid still enables the partial activation of the HNH domain (Fu et al., 2013; Hsu et al., 2013). This leads to the off-target cleavages, which limit the applicability of CRISPR-Cas9, resulting in mutations at sites in the genome other than the desired target site. Several biophysical studies have investigated the effect of base pair mismatches within the RNA:DNA hybrid on the conformational dynamics of CRISPR-Cas9 (Singh et al., 2016; Chen et al., 2017; Dagdas et al., 2017; Yang et al., 2018). Single molecule and kinetics studies have revealed that the presence of 4 base pair mismatches at PAM distal ends can trap the catalytic HNH domain in an inactive conformation also referred to as "conformational checkpoint" (**Figure 1**, shown as a cartoon in panel A and as a 3D structure in panel B) (Dagdas et al., 2017). As a consequence, the cleavage of the TS gets hampered owing to lack of conformational changes that bring HNH in immediate vicinity to the cleavage site. Inversely, up to 3 base pair mismatches at PAM distal ends still allow the repositioning of HNH, thereby resulting in off-target cleavages. These studies indicate the occurrence of off-target cleavage is linked to the conformational states of HNH. In a recent computational study, we employed molecular dynamics (MD) simulations to investigate the factors affecting the HNH conformational dynamics prior to activation (Ricci et al., 2019). Our study employed the Gaussian accelerated MD (GaMD) method (Miao et al., 2015), to broadly explore the conformational space of CRISPR-Cas9 in complex with an on-target DNA and in the presence of base pair mismatches. These simulations have revealed that the presence of 4 base pair mismatches at PAM distal sites (i.e., at positions 17–20 of the RNA:DNA hybrid) induced an extended opening of the RNA:DNA hybrid, with formation of conserved interactions between the TS and the HNH domain. This effectively decreased the conformational mobility of the HNH domain. Contrariwise, up to 3 base pair mismatches (at positions 18–20) display a lower conformational effect on the RNA:DNA hybrid, and do not affect the conformational dynamics of HNH. These simulations thereby provided a theoretical rationale for the experimental evidence describing the molecular interactions that "lock" HNH in the presence of 4 base pair mismatches at PAM distal ends (Chen et al., 2017; Dagdas et al., 2017; Yang et al., 2018).

However, mechanistic investigations of how DNA mismatches located upstream of the RNA:DNA heteroduplex affect the conformational dynamics of the hybrid structure and the HNH "conformational checkpoint" are absent. Knowledge of the conformational changes arising from base pair mismatches in the middle of the RNA:DNA hybrid are important to gain a deeper understanding of the molecular determinants of off-target binding, which consequently may offer insights for improving the specificity of CRISPR-Cas9. Moreover, understanding how base pair mismatches affect the RNA:DNA structure is important to characterize the dynamics of the heteroduplex itself. This is a key point considering the importance of RNA:DNA hybrids in a variety of biological processes, such as transcription, formation of Okazaki's fragments and R-loop structures, as well as in eukaryotic chromosomes (Cheatham and Kollman, 1997; Rich, 2006; Shaw and Arya, 2008; Nadel et al., 2015; Palermo, 2019a; Terrazas et al., 2019).

In this research report, we extend our recent investigations to 4 additional model systems, which include base pair mismatches upstream of the RNA:DNA hybrid (**Figure 1**). Analysis of the results has been performed in comparison with our recently published data, Ricci et al. (2019) thereby evaluating similarities and differences with base pair mismatches at PAM distal ends and with an on-target DNA. We show that while base pair mismatches at PAM distal sites induce an opening of the RNA:DNA hybrid, at upstream positions they are incorporated within the heteroduplex, with minor effect on the protein-nucleic acid interactions. Additionally, mismatches at PAM distal sites limit the mobility of HNH in the "conformational checkpoint" state and consequently affect its activation toward DNA cleavage. Conversely, mismatched pairs within the heteroduplex do not affect the dynamics of HNH, which can freely change conformation as needed to perform DNA cleavages.

### RESULTS AND DISCUSSION

To understand the effect of DNA mismatch pairs within the RNA:DNA hybrid on the conformational dynamics of CRISPR-Cas9 and on the HNH domain, we carried out molecular simulations. These investigations have been carried out in

analogy to our recent study, which has investigated the effect of mismatch pairs at PAM distal ends (Ricci et al., 2019). In detail, molecular simulations have been performed on the X-ray structure of CRISPR-Cas9 capturing a "conformational checkpoint" state of the HNH domain (i.e., 4UN3.pdb) (Anders et al., 2014), thereby enabling us to understand if and how base pair mismatches could affect the dynamics of HNH prior its activation. A GaMD method has been employed (Miao et al., 2015), adding a boost potential to the simulation that accelerates transitions between low-energy states (see section "Materials and Methods"). The method has been shown to enhance a broad sampling of the conformational space in large biomolecular systems (Miao and McCammon, 2016, 2018; Wang and Chan, 2017; Liao and Wang, 2018; Sibener et al., 2018), including CRISPR-Cas9 as apo form and in complex with nucleic acids (Palermo et al., 2017; Palermo, 2019b), or bound to off-target DNAs (Ricci et al., 2019). Recently, GaMD has shown to sample long time scale motions in agreement with NMR relaxation experiments, showing that the method can efficiently capture the dynamics of large protein/nucleic acid complexes (East et al., 2020). A set of model systems

have been built; introducing couples of base pair mismatches "mm" within the hybrid complex at positions 10 to 17 (i.e., mm@10–11, @12–13, @14–15, and @16–17, **Figure 1A**, bottom panel). The dynamics of these systems have been compared with the simulations of CRISPR-Cas9 binding to an on-target DNA and including 1 to 4 mismatches at PAM distal sites (i.e., mm@17–20, @18–20, @19–20, and @20), which we have recently published (Ricci et al., 2019). For each system, ∼1 µs of conformational sampling has been performed (see section "Materials and Methods"), as in our previous study and by employing the same simulations conditions, thereby enabling proper comparison.

## Dynamics of the RNA:DNA Hybrid in the Presence of DNA Mismatches

Molecular dynamics simulations of CRISPR-Cas9 bound to a fully matched RNA:DNA hybrid (i.e., on-target system) have revealed a stable Watson-Crick base pairing (**Figure 2A**, left panel), both at PAM distal ends and within the heteroduplex. Notably, transient openings at the end of a DNA duplex, or base

ribbons, color-coded as in the cartoon in panel (A).

flipping are not unusual over long timescales in MD simulations, as shown by several research groups (Pérez et al., 2007, 2008; Mura and McCammon, 2008; Ricci et al., 2010; Ma et al., 2016). However, in the simulations of the on-target CRISPR-Cas9 system, the RNA:DNA hybrid maintains the Watson-Crick base pairing, stabilized by the protein framework, as observed in several conventional and GaMD simulations of this system (Palermo et al., 2016, 2017). Contrariwise, in the presence of base pair mismatches at PAM distal ends (i.e., at positions 16 to 20), we previously observed the opening of the RNA:DNA hybrid (central panel) (Ricci et al., 2019). Here, when we introduce DNA mismatches at the upstream positions (i.e., @10–11, @12–13, and @14–15), we detect that the RNA:DNA hybrid preserves its overall shape (right panel), similarly to what observed in the on-target system. In order to estimate the conformational changes of the RNA:DNA hybrid, we analyzed in all simulated systems, the minor groove width from PAM distal ends up to the middle of the RNA:DNA hybrid (**Figure 2B**). As a result, we observe that the presence of base pair mismatches at PAM distal ends (i.e., mm@17 to 20) induced an increase of the minor groove width at positions 18–20, which corresponds to the hybrid opening. Notably, the hybrid opening is also observed when including mismatches at positions 16 and 17. This indicates that, perturbations at position 17 (as in the mm@17–20 and mm@16– 17 systems) lead to major distortions in the heteroduplex. Conversely, when introducing mismatches at positions 10–11, 12–13, and 14–15, the minor grove width of the RNA:DNA hybrid preserves the conformation of the on-target system.

To understand the effects of the base pair mismatches on the Watson-Crick base pairing, we have used a key geometrical descriptor of the base pair complementarity. We have selected the Propeller Twist parameter (**Figure 2C**), which describe the rotation of couples of base pairs with respect to each other. Based on our previous study, this parameter enables us to properly characterize alterations in the base pairing along the RNA:DNA hybrid (Ricci et al., 2019). **Figure 2C** reports the distribution of

the Propeller Twist angle along the dynamics for each base pair from PAM distal ends up to the middle of the RNA:DNA hybrid (i.e., from base pair bp20 to bp9). This analysis shows that the presence of base pair mismatches at positions 16 to 20 induces the remarkable loss of base pairing at PAM distal ends, as shown in the mm@20, mm@19–20, mm@18–20, mm@17–20, and in the mm@16–17 systems ("major distortion" in **Figure 2C**). On the contrary, the geometrical requirements for the base pairing reveal "minor distortion" for mismatches within the RNA:DNA hybrid (i.e., mm@10–11, mm@12–13, and mm@14–15). Notably, this local distortion is due to the loss of base pair interactions (mainly H-bonds), which is typical between DNA mismatched pairs. However, the analysis of the minor grove width (**Figure 2B**) shows that the hybrid preserves its overall shape when base pair mismatches are introduced in the middle of the structure. Hence, a combined analysis of the minor grove width and the base pair complementarity reveal that the presence of base pair mismatches within the hybrid does not influence the overall shape of the RNA:DNA hybrid, and that base pair mismatches result embedded within the heteroduplex structure.

### Mobility of the HNH Domain in the Presence of DNA Mismatches

Our previous study has revealed that in the presence of 4 base pair mismatches at PAM distal ends, the DNA TS establishes conserved interactions with the HNH domain (Ricci et al., 2019). These interactions restrict the mobility of HNH and affect its conformational activation toward DNA cleavage, while also contributing to the widening of the RNA:DNA hybrid. Here, in order to assess the conformational mobility of HNH in the presence of base pair mismatches within the RNA:DNA hybrid, we performed Principal Component Analysis (PCA). This analysis enabled to capture the essential degrees of freedom of the HNH domain (see section "Materials and Methods"). PCA has been carried out in comparison with the on-target system and with the system including 4 base pair mismatches at PAM distal ends (i.e., mm@17–20). **Figure 3A** reports the dynamics of the HNH domain along its first mode of motion (i.e., Principal Component 1, PC1), shown using arrows to indicate the direction and relative amplitude of the motions. The top panel shows a comparison between the system binding an on-target DNA and in the presence of 4 base pair mismatches at positions 17–20. In the mm@17–20 system, we observe that the unwound TS approaches the arrows corresponding to the HNH principal motion. A closeup view displays the interactions established by the DNA and the residues of the HNH domain. Notably, these interactions are stable along the dynamics, as discussed in our previous paper. The bottom panel reports the PCA analysis for the simulated systems including base pair mismatches within the RNA:DNA hybrid. We observe that for base pair mismatches at positions 16–17, the TS displays a similar unwinding of the mm@17–20 system, with conserved interactions established with the HNH domain (close-up view). Indeed, the interaction between the

nucleobases at position 17 and R904 is conserved in the two systems. This indicates that local distortions due to mismatched nucleobases at position 17, which is in close proximity to the HNH (α-helices, can critically affect the dynamics of HNH. We note that the interaction established at position 17 involves the DNA backbone (rather than the nucleobases), which suggests that this interaction is not specific, but rather could be established also in the presence of different mismatched nucleobases. This hypothesis, however, warrants further investigations, which are currently ongoing in our lab as a follow-up of this study. On the contrary, base pair mismatches @10–11, @12–13, and @14–15 do not result in the approach of the TS to the HNH domain, resembling what observed the dynamics of the on-target system (top panel).

In order to characterize the conformational space sampled by the HNH domain, we plotted the first versus the second principal components (PC1 vs. PC2, **Figure 3B**). This analysis revealed that in the mm@17–20 system, HNH explores a narrower conformational space with respect to the remaining systems, indicating a diminished mobility. A narrow conformational space is also observed for the mm@16–17 system. As discussed above, in these two systems, the TS tightly interacts with the HNH domain, thereby limiting its conformational dynamics. In the systems including base pair mismatches within the RNA:DNA hybrid, the HNH domain assumes a wider conformational space, similar to what observed in the on-target system. This indicates that the dynamics of HNH is not significantly affected by base pair mismatches in the middle of the RNA:DNA hybrid.

To further characterize the mobility of the systems and to understand the relation between the dynamics of the nucleic acids and the HNH domain, we performed cross-correlation (CCij) analysis. This analysis enabled us capturing coupled motions between the protein Cα atoms and the TS phosphate atoms (details in the see section "Materials and Methods"). **Figure 4A** reports the CCij matrices computed between the residues of the HNH α-helices that locate in proximity of the hybrid, and the TS bases from position b20 (PAM distal ends) to position b9 (within the hybrid). Positive correlations (CCij = 0, magenta) indicate highly coupled motions in the same direction, whereas anti-correlated motions display negative correlations (CCij = 0, green). A cartoon of the system, highlighting the regions used to compute the cross-correlations is shown in **Figure 4B**. For the sake of the clarity, the HNH α-helices in proximity of the hybrid are indicated in red (residues 890–900, Helix–A), yellow (901–910, Helix–B) and orange (911–920, Helix–C).

As a result of this analysis, in the presence of mismatches at PAM distal ends (i.e., in the mm@17–20 system) and at positions 16–17 (mm@16–17 system), Helix–A and Helix–B are highly correlated with the TS bases from position 18 to 14 (as highlighted using a box in **Figure 4A**). This indicates that the dynamics of the HNH and of the TS are mutually affected by each other, when in the presence of mismatched pairs at PAM distal ends. Moreover, we note that in the presence of mismatches at PAM distal ends, the DNA TS mainly interacts with Helix–B (**Figure 3A**, and also shown by Ricci and coauthors) (Ricci et al., 2019), thereby affecting its conformational dynamics. Inversely, in the systems displaying base pair mismatches at upstream positions (mm@14 to 10), as well as in the on-target system, a weakening of the correlated motions can be seen. In these systems, there are no interactions being established between the TS and the HNH domain, signified by the diminished correlations between them. Overall, the cross-correlation analyses confirm that the presence of base pair mismatches at PAM distal ends affects the dynamics of HNH, while mismatches at upstream positions do not exert a relevant effect.

### CONCLUSION

Here, molecular simulations have been used to characterize the conformational dynamics of CRISPR-Cas9 in the presence of base pair mismatches within the RNA:DNA hybrid. The simulations have shown that the presence of base pair mismatches at PAM distal ends of the RNA:DNA hybrid (i.e., positions 20 to 17) induce an opening of the heteroduplex (Ricci et al., 2019). As a result, newly formed interactions between the DNA TS and the catalytic HNH domain have been shown to "trap" HNH in an inactive "conformational checkpoint" state, hampering its activation for cleavage. On the contrary, base pair mismatches at upstream positions (i.e., within the RNA:DNA hybrid, at positions 14 to 10) are incorporated within the heteroduplex, with minor effect on the protein-nucleic acid interactions. Indeed, the presence of DNA mismatches within the hybrid does not affect the mobility of HNH, which is similar to that of the on-target system (**Figure 3**). This suggests that mismatched base pairs within the RNA:DNA hybrid do not interfere with the process of HNH activation (**Figure 1A**), where HNH changes in configuration from its "conformational checkpoint" state to an activated form are prone to cleave the DNA TS (**Figures 1A,B**). Notably, these results agree with existing experimental studies and offer a rationale to the observed outcomes. Indeed, the presence of DNA mismatches at PAM distal ends has been experimentally shown to trap HNH in a "conformational checkpoint" state, likely due to interactions established with the DNA TS, as previously suggested (Singh et al., 2016; Chen et al., 2017; Dagdas et al., 2017; Yang et al., 2018). However, mismatches in the middle of the hybrid are much more tolerated than at PAM distal ends, and lead to DNA cleavages. In light of this fact, our results indicate that mismatches at upstream positions (i.e., positions 14 to 10) still allow to preserve the overall structure of the RNA:DNA, without affecting the conformational dynamics of the catalytic HNH domain. As such, HNH can freely change conformation as needed to perform DNA cleavages (**Figures 1A,B**). Overall, this research report constitutes a step forward in understanding the effect of DNA mismatches within the RNA:DNA hybrid in CRISPR-Cas9, offering insightful information on off-target effects. This work also forms the basis for further investigation, to characterize the effect of DNA mismatches along the entire RNA:DNA hybrid and therefore to report an atomic-level understanding also for DNA mismatches at PAM-proximal sites (i.e., positions 1 to 9). These studies are currently ongoing in our laboratory, as inspired from the current work, taking also into account different conformations of the HNH (**Figure 1A**) domain and diverse mismatched nucleobases. Finally, we note that understanding how mismatched pairs affect the heteroduplex structure is per se important to understand the function of RNA:DNAs, which are critical in a variety of biological processes (Cheatham and Kollman, 1997; Rich, 2006; Shaw and Arya, 2008; Nadel et al., 2015; Palermo, 2019a; Terrazas et al., 2019).

In summary, this study provides an atomic-level understanding of the dynamic effects of the binding of DNA base pair mismatches within the RNA:DNA hybrid in CRISPR-Cas9. As a take-home message, the presence of mismatched pairs at distinctive locations of the RNA:DNA hybrid produces different conformational effects, which affect the protein counterpart. Specifically, mismatched pairs at PAM distal ends interfere with the activation of the catalytic HNH domain, while mismatches fully embedded in the RNA:DNA do not affect the HNH dynamics and enable its activation to cleave the DNA. This provides a reasonable explanation on why off-target sequences holding mismatches at PAM distal ends are less likely to produce DNA cleavages in CRISPR-Cas9, than mismatched pairs within the heteroduplex, as experimentally observed (Singh et al., 2016; Chen et al., 2017; Dagdas et al., 2017; Yang et al., 2018). These findings contribute in understanding the mechanistic basis of off-target effects in CRISPR-Cas9 and encourage novel experimental studies aimed at designing more specific variants of the system that prevent the onset of off-target effects.

### MATERIALS AND METHODS

Structural models have been based on the X-ray structure of the Streptococcus pyogenes CRISPR-Cas9 complex (4UN3.pd, 2.58 Å resolution) (Anders et al., 2014) which captures the inactivated state of the HNH domain (i.e., "conformational checkpoint") (Dagdas et al., 2017). MD simulations have been performed applying a well-established protocol for protein/nucleic acid complexes, which employs the Amber ff12SB force field, including the ff99bsc0 (Perez et al., 2007) corrections for DNA and the ff99bsc0+(χOL3 (Banas et al., 2010; Zgarbova et al., 2011) corrections for RNA. To broadly explore the conformational space of CRISPR-Cas9, we employed a recent accelerated MD (aMD) simulations method (Miao et al., 2015). Specifically, we applied a Gaussian aMD (GaMD) method, which adds a harmonic boost potential to smoothen the potential energy surface, thereby decreasing energy barriers and accelerating transitions between the low-energy states (a complete description of the method is reported as a **Supplementary Material**).

The method has extended the use of aMD to large biomolecular systems, with applications of this method to G-protein coupled receptors (Miao and McCammon, 2016, 2018), the Mu opioid receptor (Wang and Chan, 2017; Liao and Wang, 2018), T-cell receptors (Sibener et al., 2018), and CRISPR-Cas9 (Palermo et al., 2017; Palermo, 2019b; Ricci et al., 2019).

### DATA AVAILABILITY STATEMENT

The datasets generated for this study are available on request to the corresponding author.

### AUTHOR CONTRIBUTIONS

BM and RH contributed equally. BM and RH performed the simulations, analyzed the data, and wrote the manuscript. MM, NZ, and YN analyzed the data and wrote the manuscript. GP conceived the original research, supervised research, and wrote the manuscript.

### REFERENCES


### FUNDING

This material is based upon work supported by the National Science Foundation under Grant No. CHE-1905374. The Extreme Science and Engineering Discovery Environment (XSEDE) provided computer time through the grant TG-MCB160059. This work was partially supported by the NIH grant R01 EY027440.

### ACKNOWLEDGMENTS

We thank Dr. Łukasz Nierzwicki and Dr. Pablo R. Arantes for useful discussions.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmolb. 2020.00039/full#supplementary-material



**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer YM declared a past collaboration with one of the authors GP.

Copyright © 2020 Mitchell, Hsu, Medrano, Zewde, Narkhede and Palermo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Computational Investigation of Voltage-Gated Sodium Channel β3 Subunit Dynamics

### William G. Glass, Anna L. Duncan and Philip C. Biggin\*

Structural Bioinformatics and Computational Biochemistry, Department of Biochemistry, University of Oxford, Oxford, United Kingdom

Voltage-gated sodium (Nav) channels form the basis for the initiation of the action potential in excitable cells by allowing sodium ions to pass through the cell membrane. The Na<sup>v</sup> channel α subunit is known to function both with and without associated β subunits. There is increasing evidence that these β subunits have multiple roles that include not only influencing the voltage-dependent gating but also the ability to alter the spatial distribution of the pore-forming α subunit. Recent structural data has shown possible ways in which β1 subunits may interact with the α subunit. However, the position of the β1 subunit would not be compatible with a previous trimer structure of the β3 subunit. Furthermore, little is currently known about the dynamic behavior of the β subunits both as individual monomers and as higher order oligomers. Here, we use multiscale molecular dynamics simulations to assess the dynamics of the β3, and the closely related, β1 subunit. These findings reveal the spatio-temporal dynamics of β subunits and should provide a useful framework for interpreting future low-resolution experiments such as atomic force microscopy.

### Edited by:

Alexandre M. J. J. Bonvin, Utrecht University, Netherlands

### Reviewed by:

Pavel Srb, Academy of Sciences of the Czech Republic (ASCR), Czechia Natalia Kulik, Academy of Sciences of the Czech Republic (ASCR), Czechia

> \*Correspondence: Philip C. Biggin philip.biggin@bioch.ox.ac.uk

### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

> Received: 15 November 2019 Accepted: 19 February 2020 Published: 18 March 2020

### Citation:

Glass WG, Duncan AL and Biggin PC (2020) Computational Investigation of Voltage-Gated Sodium Channel β3 Subunit Dynamics. Front. Mol. Biosci. 7:40. doi: 10.3389/fmolb.2020.00040 Keywords: molecular dynamics, coarse-grain, epilepsy, lipid bilayer, multiscale

## INTRODUCTION

Voltage-gated sodium (Nav) channels are the initiators of action potentials in electrically excitable cells and are also implicated in many disease and pathological states including cardiac arrhythmia (Watanabe et al., 2008), epilepsy (Audenaert et al., 2003; van Gassen et al., 2009), neuropsychiatric disorders (Gargus, 2006), and chronic pain (Shah et al., 2000, 2001; Takahashi et al., 2003; Bouza and Isom, 2018). Na<sup>v</sup> channels are comprised of an α subunit that forms the central pore-conducting region and β subunits that perform various roles such as modulating the voltage sensitivity and regulating the trafficking of the channel. In humans, there are ten α and four β subunits (the β1 subunit gives rise to two isoforms, β1 and β1B) that are expressed in different tissue-specific combinations, thus giving precise regio-selective control of the Na<sup>v</sup> channel behavior. Additionally, β subunits also function independently as cell-adhesion molecules (CAMs) (Isom et al., 1995; Rougon and Hobert, 2003; Yu et al., 2003) and may play a role in Na<sup>v</sup> channel clustering at the nodes of Ranvier (Ratcliffe et al., 2001) to promote the propagation of the action potential.

As perhaps might be expected given its central role in sodium ion conduction, most attention has been paid to the α subunit. However, in vivo, the effects of the β subunits are increasingly recognized and may well offer alternative therapeutic routes in the long run (Hull and Isom, 2018). For example, the β1 subunit has been shown to stabilize the Na<sup>v</sup> 1.7 channel against mechanical

stress (Körner et al., 2018) and has diverse roles with respect to its interactions with Na<sup>v</sup> channels (Edokobi and Isom, 2018). People with mutations in the β3 subunit (SCN3B gene) show cardiac conduction problems (Brackenbury and Isom, 2011) and in mice deletion of SCN3B leads to cardiac arrhythmias (Hakim et al., 2008, 2010). The SCN3B gene has also been linked to Brugada syndrome (Okata et al., 2016).

Although the β1 – 4 isoforms all share a similar scaffold of an extracellular immunoglobulin (Ig) – like fold with a single transmembrane (TM) helix and unstructured intracellular domain, their binding to the α subunit differs. Both β2 and β4 bind covalently (Yu et al., 2003; Chen et al., 2012), via a disulfide bond, whilst β1 and β3 bind non-covalently. It has previously been shown that β3 subunits can trimerize via their Ig domains and can also induce higher order oligomerization of Na<sup>v</sup> channel α subunits (Namadurai et al., 2014). Increasingly, there is evidence to suggest that sodium channels may in fact operate in higher order complexes (Clatot et al., 2017) and can also form complexes with many other proteins involved in a variety of signaling pathways (Kanellopoulos et al., 2018).

The α subunit itself is constructed from four homologous domains (DI – DIV), each containing six TM helices that make up the voltage sensor domain (VSD, helices 1 – 4) and poreforming domain (helices 5 and 6). The exact location of where β subunits bind to the α subunit is uncertain, additionally the exact ratio of α:β subunit is also not very well characterized and may vary depending on tissue type and the cellular environment (Patino et al., 2011). Evidence from experimental fluorescence studies suggests that both the β3 and β1 subunits can bind to the α subunit and may alter the rate of fast inactivation through interaction with the VSDs of DIII and DIV, respectively (Zhu et al., 2017). Interestingly, recently released structures of Na<sup>v</sup> channels with β subunits bound all contain β subunit density in this region. The first of the eukaryotic structures with a β subunit bound was that of Na<sup>v</sup> 1.4 from Electric Eel by Yan et al. (2017), solved at a resolution of 4 Å. Here the fully resolved β1 subunit interacts via its transmembrane domain (TMD), and Ig domain with the VSD of DIII and extracellular loops of the α subunit, respectively. Shortly after, the nearly identical human structure of Na<sup>v</sup> 1.4 was solved by Pan et al. (2018) with β1 again bound to the VSD of DIII at an improved resolution of 3.2 Å. In another cryo-EM structure, this time with the human Na<sup>v</sup> 1.7 α subunit, not only could the position of β1 be resolved, but also the position of β2 and various toxin molecules (Shen et al., 2019). These structures offer an insight into not only the various states the α subunit occupies in its activation profile but also where β subunits may bind. In these structures the β1 subunit is bound to VSD of DIII, usually on the periphery of the α subunit. At this stage, it remains unclear as to whether the site of binding is consistent between α subunits or indeed whether the binding interactions for β1 will be the same for β3 (Zhu et al., 2017). Interestingly, it was recently reported that the human β1 subunit can also interact with the bacterial NaChBac channel (Molinarolo et al., 2018) although the mode of interaction was not discussed.

Despite the plethora of recent structural information, several aspects regarding the role of the β subunits remain unclear. What is the dynamic behavior of β subunit monomers? Do they oligomerize, and if so, how? How does the trimeric Ig domain structure of β3 relate to the position and orientation of the β subunits observed when in complex with the α subunits? To try and address these questions, we have used multiscale molecular dynamics simulations. We show that although β1 and β3 exhibit a relatively high sequence identity (51%), the behavior of the monomers is quite different, with β3 being more dynamic than β1. We attribute this to distinct residue – lipid contacts in the Ig domains of both subunits. We also demonstrate that the lipid composition is likely to have a key role in controlling the dynamical behavior.

### MATERIALS AND METHODS

## Homology Models

### β3 Monomer

The recent cryo-EM structure of the β1 subunit in the human Na<sup>v</sup> 1.4-β1 complex (Pan et al., 2018) (PDB: 6AGF) was used to construct a model of the human β3 subunit. Sequence alignment was performed using the MUSCLE web server (Edgar, 2004) with the full length human β3 and the β1 cryo-EM structure sequence with a sequence identity = 51% (see **Supplementary Figure S1** for sequence alignment and domain annotation). A total of 200 models were created with each model scored using Discrete Optimized Protein Energy (DOPE) in the Modeller software package (Webb and Sali, 2014). The 10 best models were ranked using Qualitative Model Energy Analysis (QMEAN) (Benkert et al., 2008) and the final model chosen with the highest QMEAN score.

### β1 Monomer

The model used for β1 simulations was constructed directly from the Na<sup>v</sup> 1.4-β1 structure (Pan et al., 2018), since all residues had been resolved. All mutations in the Ig domain and linker (see **Table 1**) were performed in PyMol (DeLano, 2004).

### β3 Trimer

The crystal structure of the trimeric β3 subunit (Namadurai et al., 2014) (PDB: 4L1D), containing just the extracellular region, was used as a template to construct a model of the trimeric extracellular region of β3. A total of 200 models were created with each model scored using DOPE in the Modeler software package (Webb and Sali, 2014). The 10 best models were ranked using QMEAN (Benkert et al., 2008) and the final model chosen with the highest QMEAN score.

### Full Length β3 Trimer

Over the course of the simulations of the β3 monomer model (see section "β3 Monomer") a large variety of conformations were visited. To construct the β3 trimer model a frame from the first run was taken at a pitch angle of 44.7◦ with the long axis of the Ig domain approximately perpendicular to the plane of the membrane. This was overlaid with each chain of the β3

#### TABLE 1 | Summary of simulations.

fmolb-07-00040 March 16, 2020 Time: 15:30 # 3


crystal structure containing just the extracellular domain (ECD) (see section "β3 Trimer"). After the model was constructed it was checked for no steric clashes, of which there were none. All overlays were performed in PyMol (DeLano, 2004).

### Molecular Dynamics (MD) Simulations

All atomistic simulations were performed using GROMACS 2018 (Abraham et al., 2015) with the AMBER ff99SB-ILDN force field (Lindorff-Larsen et al., 2010). Protein models constructed with a membrane were prepared using the InflateGRO (Kandt et al., 2007) methodology and in-house scripts used for final adjustments. Equilibration steps of each system consisted of solvation using the TIP3P water model and neutralization using 150 mM NaCl, energy minimization using the steepest decent algorithm and a short (1 ns) and long (5 ns) equilibration whilst position restraining the Cα atoms with a force constant of 1000 kJ mol−<sup>1</sup> . All simulations were carried out in the NPT ensemble. The temperature and pressure were set to 300 K and 1 bar using the Nosé-Hoover thermostat (Nose, 1984; Hoover, 1985) and Parrinello-Rahman barostat (Parinello and Rahman, 1981) with coupling constants of 0.8 and 5.0 ps, respectively.

All coarse-grained (CG) simulations were performed using GROMACS (Abraham et al., 2015) 2019 with the MARTINI (v2) force field (de Jong et al., 2013). Each CG protein was embedded in a membrane using the INSANE (Wassenaar et al., 2015) methodology. For each system, energy minimization was performed with the steepest decent algorithm. Equilibration steps consisted of solvation using the non-polarizable MARTINI water model and neutralization using 150 mM NaCl, followed by a short (20 ns) and long (100 ns) equilibration whilst position restraining backbone atoms with a force constant of 1000 kJ mol−<sup>1</sup> . All simulations were carried out in the NPT ensemble at a temperature of 323 K and pressure of 1 bar. The V-rescale (Bussi et al., 2007) temperature and Berendsen pressure coupling (Berendsen et al., 1984) were used for short equilibrations with coupling constants of 1.0 and 8.0 ps, respectively. The V-rescale temperature coupling and Parrinello-Rahman pressure coupling were used for long equilibrations with coupling constants set to 4.0 ps and 8.0 ps, respectively. The 6 × 6 β3 grid was constructed by tiling a unit cell of one membrane embedded protein after the previously mentioned equilibration steps in the x and y direction.

All simulations performed are summarized in **Table 1**. All atomistic simulations were performed in a 1-palmitoyl-2 oleoyl-glycero-3-phosphocholine (POPC) bilayer whilst all CG simulations were performed in a generalized mammalian plasma membrane (PM) composition from Koldsø et al. (2014), where the composition of the membrane is as follows:


Where POPE, 1-palmitoyl-2-oleoyl-glycero-3-phosphatidy lethanolamine; Sph, sphingomyelin; GM3, monosialodihexosylg anglioside; CHOL, cholesterol; POPS, 1-palmitoyl-2-oleoylglycero-3-phosphatidylserine; PIP2, 1-palmitoyl-2-oleoyl-gly cero-3-phosphatidylinositol-4,5-bisphosphate.

### Ig Orientation Analysis

Assessment of the Ig domain's favored orientation was achieved by calculating the principal axes (PAs) at each frame and measuring the Tait-Bryan angles using the standard basis ex, ey, e<sup>z</sup> as a reference [where e<sup>x</sup> = (1, 0, 0), e<sup>y</sup> = (0, 1, 0), and e<sup>z</sup> = (0, 0, 1)]. In order to calculate the PAs the center of mass was taken as the center of mass of secondary structures contained within the Ig domain (i.e., the β-sheets and 3–10 helices). This was chosen to minimize any noise associated with flexible loop movement over the course of the simulation. The PAs **p**1, **p**2, and **p**3 were obtained via the diagonalization of the moment of inertia tensor, I.

$$\mathbf{I} = \sum\_{i=1}^{N} m\_i \left[ (\mathbf{r\_i} \cdot \mathbf{r\_i}) \sum\_{\alpha=1}^{3} \mathbf{e\_{\alpha}} \otimes \mathbf{e\_{\alpha}} - \mathbf{r\_i} \otimes \mathbf{r\_i} \right] \tag{1}$$

$$
\Lambda = \mathbf{U}^T \mathbf{I} \mathbf{U} \tag{2}
$$

Where U = (**p**1, **p**2, **p**3) and 3 is a diagonal matrix of eigenvalues that correspond to the principal moments of inertia. At every frame, the first, second, and third principal axes were used to define a rotation matrix (based on the direction cosine matrix between each principal axis and the reference basis) and from this the Tait-Bryan angles computed. Using an intrinsic rotation formalism of ZYX the yaw, pitch, and roll angles were defined. In this study we focus on the pitch angle in relation to the Ig domain.

All angle analysis was produced from in-house python scripts are available at https://github.com/bigginlab/protein\_orientation.

### RESULTS

### Dynamics of β1 and β3 Interactions With the Membrane

The recent structures of the Na<sup>v</sup> α/β subunit complexes revealed the β1 Ig domain to adopt a conformation such that the long axis of the strands sits roughly parallel to the membrane surface (see **Figure 1A**). We noted at this point that if full-length β3 subunits adopted the same trimeric structure as observed for the Ig domain (from β3) only (**Figure 1B**; Namadurai et al., 2014), their interaction with the membrane would most likely require some substantial conformational rearrangement (**Figure 1C**). Thus, we investigated the dynamic behavior of full-length monomeric β1 and β3 in a POPC bilayer system using 25 replicas of 400 ns unbiased MD simulations.

We examined the behavior of the Ig domain, in terms of the "pitch" with respect to the Ig domain in the first frame of each simulation (see section "Ig Orientation Analysis" and **Figures 2A–F**). Perhaps surprisingly, the behavior of the Ig domains in terms of the pitch is very different for β1 compared to β3 despite a high sequence identity (see section "β3 Monomer"). A pitch angle of 0◦ corresponds to an orientation parallel to the membrane plane and typically bound to the membrane surface. For β1 simulations, the pitch remains tightly clustered around 0◦ , with only a few runs exhibiting significant sojourns into higher pitch angles. In contrast, for β3 there is a wide variety of pitch states visited when analyzing all the repeats with a favored pitch angle centered around 30◦ . Individual runs (**Figure 2F**) also appear to show more dynamic movement of the Ig domain within runs.

Our β3 model was constructed from the recent β1 cryo-EM structure (see section "β3 Monomer") which, when bound to the α subunit, positions the TM helix approximately parallel to the bilayer normal. However, during simulations, both β1 and β3 TM helices adopt a significant tilt angle (**Figures 3A,B**), leading to a classic bell-shaped curve with a peak around 40◦ for β1 and 38◦ for β3. These are quite large tilt angles compared to many TM proteins (Bowie, 1997). Common to both β1 and β3 is a conserved glutamic acid residue (E177 and E176 in β1 and β3, respectively) that is located, somewhat surprisingly, within the lower part of the TM helix. Visual inspection of the trajectories suggested that it may play a role in maintaining the tilt angles. Analysis of the bilayer around this residue (**Figures 3C–E**) reveals that as the TM helix tilts there is a distortion of the membrane around E177/E176 in the lower leaflet where the carboxylic acid group of the side chain can interact with the positively charged NH<sup>3</sup> <sup>+</sup> group of POPC.

Further analysis of the contacts made between β subunits and the membrane (**Figure 4**) suggested that for β1 (**Figure 4A** and **Supplementary Figure S2**), the longest-lived interactions between the ECD and the membrane are, as perhaps might be expected, localized to polar residues and in particular, arginine and lysine residues. During analysis, a residue was considered to be in contact with the membrane surface if the center of mass of its side chain was within 5 Å of a phosphorus atom in the lipid headgroup. Protein – lipid contacts for each residue were calculated across all repeats and used to define the protein – lipid interaction density. The contacts seem to favor one "face" of the Ig domain, partially exposing the hydrophobic V27, V29, and P30 residues that are responsible stabilizing the observed Ig domain trimerization away from the Ig body in β3. For the β1 TM region, the longest-lived interactions are localized toward the end of the helix and again feature lysine residues K183 and K184 as well as Y164 and Y182.

In contrast, the Ig domain of β3 exhibits fewer regions of high contact, as expected from the analysis shown in **Figure 2** where this domain exhibits orientations that place it away from the membrane surface. The contacts made (**Figure 4B** and **Supplementary Figure S2**), when in close proximity to the membrane, are again dominated by lysine and arginine residues (K50, K98, and R144). The longest-lived interactions from the TMD region are located at the intracellular end where a cluster of positively charged residues (R182, K183, and K186) interact with the phosphate headgroups. Interestingly these residues exhibit strong local bending, possibly due to the preference for these residues to interact with the bilayer.

At this point the extent of non-conserved residues was analyzed in both the β1 and β3 sequences, with a particular focus on charged residue differences in the Ig and linker domains between both subunits (**Figure 4C**). There are a total of 25 residue differences that are summarized in **Table 2**. In order to investigate the likely contribution that charged residues make to the observed differences between Ig domain orientation a series of systems were constructed (see **Table 1** for simulation details).

β3 subunit.

Firstly, the residues present in the β1 subunit Ig domain were mutated to the corresponding residue in β3 if they differed in charge, referred to as β1 Igmut hereafter. A second system was also prepared with two mutations in the linker (K149E and K152E), in addition to ones applied in the Ig domain (β1 Igmut + linkermut). Finally, a system with only the linker mutated (β1 linkermut) was prepared to assess what impact the linker has on β1 Ig domain dynamics.

The effects of mutations in the domains of each β1 system (β1 Igmut, β1 Igmut + linkermut, and β1 linkermut) reveal distinct dynamics (**Figure 5**) and hint at the regions responsible for differences observed in pitch angles between WT β1 and β3. The charge swaps within the Ig domain of β1 Igmut cause a slight increase in pitch angle to approximately 10◦ with respect to WT β1 (**Figure 5A**). In β1 Igmut + linkermut the addition of K149E and K152E mutants in the linker drastically increase the sampled angles to values around 45◦ . Also present is another population close to WT β1 and β1 Igmut values, indicative of the Ig domain almost parallel to the membrane plane (**Figure 5B**). When applying only K149E and K152E in beta1 linkermut, the pitch angles populate values close to 40◦ with a smaller population at 10◦ reflecting an Ig domain pitch angle somewhere between membrane-bound and perpendicular to the membrane plane (**Figure 5C**). In addition to changes in Ig domain pitch with the K149E and K152E mutants there is also a tendency for the linker to become more linear as well as distinct changes in the Ramachandran plots at D148, located at the "hinge" before the start of the Ig domain (**Supplementary Figure S3**).

### Full-Length β3 Trimeric Model Dynamics

Recently there have been several high-quality cryo-EM structures of full length β subunits bound to the α subunit of Na<sup>v</sup> channels (Yan et al., 2017; Zhu et al., 2017; Pan et al., 2018; Shen et al., 2019). However, the trimeric crystal structure of β3 (Namadurai et al., 2014) lacks the TMD and its role, if any, to observed β subunit clustering remains elusive. To investigate the role of the TMD in β3 – β3 interactions and vice-versa, a trimeric model was constructed using the ECD β3 trimer and the TMD of the β3 monomer (see section "Full length β3 Trimer"). A total of three repeats of 400 ns atomistic MD were performed. As expected the extracellular trimeric structure remained intact and conformationally stable throughout the simulations (**Figures 6A,B**) and remained in an "upright" position on top of the membrane surface. The TM helices on the other hand were much more mobile (and indeed dominate the overall Cα root mean squared deviation (RMSD) (**Figure 6B**). Visual inspection of the trajectories revealed that the TM helices exhibit considerable lateral movement with respect to each other and appear to adopt significant tilt compared to the starting conformations. Analysis of the helical tilt angles (**Figure 6C**) confirms the adoption of significant tilt but also reveals that the helices can adopt a range of different tilt angles with a significant proportion centered around ∼ 12◦ and another around ∼ 26◦ . Even though there are strong preferences for these particular tilt angles, each helix is still able to visit the whole range of tilt angles from 0 to just over 40◦ . Note that for monomeric β3 (**Figure 3A**) the distribution was a classic bell-shaped curved centered around

transmembrane domain. (A) Histogram of TMD tilt angles over 25 × 400 ns simulations of the β1 (red) and β3 (blue) subunits. (B) Schematic of the angle used to measure the tilt angle in the TMD, phosphorus atoms of the POPC bilayer are shown as orange spheres. (C) Histogram of minimum distances between E177 (β1)/E176 (β3) (center of mass of sidechain oxygens) and the nitrogen atom of the surrounding POPC headgroups over 25 × 400 ns. The shoulder at a distance of 7 Å reflects the initial starting coordinates. (D) Position in the membrane of the conserved glutamic acid residue (highlighted inside a red box) in the β3 subunit after 400 ns. (E) Closer look at E176 (β3) in (D) with two nearby POPC residues interacting with the terminal oxygen atoms of the residue.

38◦ suggesting that in the trimer, the tilt angle is, as might be expected, restricted by the tethering to the ECD.

We next analyzed the interaction of the protein with the lipid membrane. The interactions in the TM region (**Figure 6D** and **Supplementary Figure S4**) are very similar to those observed for the β3 monomers. There was also a significant amount of interaction between the bottom face of the ECD and the membrane (**Figure 6E** and **Supplementary Figure S4**), mediated in the main by positively charged residues, but not exclusively so by any means.

### Clustering of β3 Subunits in a Realistic Membrane Model

Given the recent observation from atomic force microscopy (AFM) that β3 monomers could aggregate and form higher-order oligomers including dimers and trimers (Namadurai et al., 2014), we set up CG MD simulations to investigate how such oligomers

might come together [see section "Molecular Dynamics (MD) Simulations"]. We set up a large membrane with a composition that replicated an endothelial cell (**Figure 7A**) and inserted 36 copies of the β3 subunit model and ran three independent simulations for 10 µs each. β3 subunits were indeed observed to form high-order oligomers (**Figure 7B**). The size of the clusters was analyzed over the course of each run and it was found that the cluster size tended to be present as a monomer or dimer with a significant population of higher order clusters (**Figure 7C**).

Long, fibril-like structures were formed in all repeats, with the Ig domains often making tip-tip interactions in a manner reminiscent of the interactions between the DIP and Dpr neuronal recognition proteins (Cosmanescu et al., 2018). Protein – protein contacts were measured over the three repeats. There are typically high regions of interaction on the last ∼10 residues in the TMD region of the β3 subunit as well as contacts present in the Ig domain. High regions of contact include residues 128 – 135 that correspond to the FEAHRPFV loop, at the "tip" of the Ig domain, located between the F and G β strands (see **Supplementary Figure S1** for strand labeling) of the Ig domain (**Figure 7A** and **Supplementary Figure S5**). There are also regions of interaction in the loop region of residues 79 – 82 (NGHQ) and 89 – 92 (QGRL) between β strands C" and D that form one face of the Ig domain (**Figure 7A**). At the C-terminus of the TMD, residues M177, C180, Y181, K183, and V184 show regions of increased interaction between subunits. Further investigation of the Ig domains orientation on the membrane surface revealed a variety of conformations that reflect the dynamics seen in atomistic simulations. A number of protein copies were present with the long axis of the Ig domain



parallel to the membrane whilst another population showed the Ig domain pointing up and away from the membrane, similar to the orientation seen in the trimeric crystal structure (**Figure 7D**).

We also investigated protein – lipid contact sites. Interactions were counted using the headgroup bead of each lipid type and a cut-off value of 6.5 Å. It can be seen (**Figures 8A,B**) that there is a slight preference for one side (which we label Face 1) of the Ig domain to interact with the lipid membrane, most notably for GM3. The other side (Face 2) of the Ig domain retains interactions with GM3 but to a lesser extent than Face 1 (**Supplementary Figure S6**). The radial distribution function reflects the high levels of interaction with GM3 as well as with PIP<sup>2</sup> and cholesterol where the latter two interact with the TMD.

### DISCUSSION

### β Subunit Monomers Exhibit Distinct Differences

Although similar in sequence and underlying fold, the behaviors of the β1 and β3 subunits in the membrane exhibit some striking differences. Our simulations suggest that the ECD of the β3 subunit is much more dynamic than β1. In contrast to the β1 subunit structure from Pan et al. (Pan et al., 2018), the Ig domain of the β3 subunit samples several pitch states (see **Figure 2**), with only a few corresponding to the β1-like cryo-EM structure. Conversely, the β1 subunit simulations provide evidence for a more restricted Ig motion, with the long axis of the Ig domain parallel to the membrane plane in 60% of the simulations performed. This increased membrane interaction may go some way to explain why β1 has a decreased propensity to form higher order oligomers, since the Ig domain is restricted to lie close to the membrane surface. The interaction in the β1 cryo-EM structure (Pan et al., 2018) between the ECD and the top of the VSD of DIII involves the conserved C21 – C43 disulfide bond. This orientation of the ECD in this cryo-EM structure is quite similar to the orientation we observe for β1 monomers in the membrane and thus we hypothesize that a monomer moving from the membrane to interact with an α subunit would only require a small change in conformation. Clearly, electrostatic interaction between the Ig domain and membrane surface will contribute to the preferred Ig orientation that both β1 and β3 adopt. Charge swap mutations in the β1 subunit for those present in β3 supports this (see **Table 2** and **Figure 5**). Mutations performed within the Ig domain have a small effect on Ig domain pitch, however, the addition of two mutations (K149E and K152E) in the linker cause β3 Ig domain dynamics to be partially recovered in β1. The linker's contribution to Ig dynamics is somewhat reduced when only K149E and K152E mutations are present in β1 and suggests that, although important, the difference in dynamics between both β1 and β3 may be a compound effect within both the Ig and linker domains of both subunits.

### The Transmembrane Helix Undergoes a Large Tilt

In both the β1 and β3 models there is a slight shortening of the TMD as well as the large helical tilt in the membrane. In the cryo-EM structure (Pan et al., 2018), this region of the β1 subunit has the lowest resolution of around 4.2 Å and we hypothesize that the TMD region of the β subunit may indeed be flexible until any hydrophobic mismatch with the bilayer is optimized either by TM helix tilting or bending. The tilting in both β1 and β3 is facilitated in part by the presence of a glutamic acid (see **Figure 3**). This glutamate is highly conserved and is found in both β1 and β3 sequences. Given its unusual position, it has been argued that it is likely to have functional significance and indeed has been investigated within β1 (McCormick et al., 1999) and β3 (Namadurai et al., 2014; Salvage et al., 2019). It also seems that for a full-length model based upon the trimeric β3 crystal structure of the ECD to be adopted in the context of a lipid bilayer system, the TMD helices of our model must change their tilt with respect to the membrane normal.

### Behavior of the Trimer

A major difference between monomeric and trimeric β3 lipid interactions is in the Ig domain. As part of a trimer, the Ig

for the whole trimer (blue), the TM domains (orange, residues F153 to E189 of each subunit), the ECD only (green). Pale background reflects one standard deviation. (C) Distribution of tilt angles for the three helices in the trimer. (D) Probability density colored from white to red to black mapped onto the structure to show the lipid-protein interactions. (E) shows the key residues of the ECD that form interactions with the membrane. In both (D,E) different protein monomers are indicated in superscript.

domain is no longer able to sample large pitch states due to the favorable hydrophobic interactions in the N-terminus of each chain. As such, the lipid interaction between the Ig trimer is markedly reduced when compared to the monomer with only a few charged residues (R65, E67, and K70) interacting with the membrane surface. Interestingly if the β3 subunit were to interact with the VSDs of the pore-forming α subunit in a similar fashion to β1 there would need to be substantial rearrangement

of the Ig domains. This leads to the question of what, if any, the role of the TMD helices could play in α - β and/or β β interactions? Lipid contact analysis in the TMD reveals that there is little difference between the trimeric and monomeric models. Close to the Ig domain, the restriction imposed via the stable Ig trimer reduces translational motion, whereas at the intracellular end the translational motion is much more dynamic with no clear preference for residue – residue interaction between chains. These results suggest that the TMD of the β3 subunit does not have an overall stabilizing effect on the β3 trimer and in fact may only be required for correct positioning within the membrane. This is in agreement with previous super-resolution microscopy data, where the density function estimated from the C-terminal mEos2-tagged β3 was consistent with a relatively unconstrained transmembrane helix/C terminus. This suggests that any trimerization events are likely to be controlled via the Ig domain. Additionally, when the helices of each β3 chain are in close proximity, the conserved E176 residue appears to be orientated away from the trimer center and preferentially interacts with the POPC membrane (data not shown). These observations are also supported by recent experimental work examining the role of E176 in β3 subunits, and that also concluded that oligomerization was dependent on the extracellular domain but not E176 (Salvage et al., 2019).

### β3 Subunit Clustering

It has been previously reported (Namadurai et al., 2014) via the use of AFM and Fluorescence Photoactivated Localization Microscopy (FPALM) that the β3 subunits can form higher order oligomers. In particular, the AFM suggested the presence of dimers and trimers, whilst the FPALM experiments suggested the presence of a trimer in live cells. In our large CG simulations, where we try to capture the complexity of a mammalian cell membrane, we do indeed observe the formation of oligomers. The interactions between individual β3 subunits tends to show an "end on" interaction, whereby the tip of one Ig domain interacts with the base of another to produce long, fibrillike oligomers with a small contribution from the C-terminal end of the TMD. This leads to a slightly different picture of how the β3 subunits may interact compared to that arrived at by Namadurai et al. (2014) who interpreted the formation of the trimers in the context of a crystal structure of the β3 Ig domain (and forms distinct trimers). The formation of similar, but full-length, trimers would mean that the Ig domains must frequently "lift off " the surface of the membrane (see **Figure 5A**) and the oligomerize predominantly through the exposed flat face of the Ig domain. Although we observe movements of the Ig domain in the atomistic simulations (see **Figure 2D**) that would be compatible with the formation of such a trimer, we observe such movements in the CG simulations only infrequently. Furthermore, such orientations (**Figure 6**) are too short-lived relative to the time required for oligomerization via the exposed faces of the β-sheets. A key difference to note here is that the atomistic simulations were performed in a POPC bilayer, whereas the CG simulations were performed in a bilayer of a more complex composition. Ideally the use of a mixed lipid membrane at an atomistic level would reveal finer protein – lipid interaction details. However, to study large scale clustering would require computational resource beyond our current capability. Visual inspection of the CG simulations suggests that the interaction of the Ig domain with the headgroup of the GM3 headgroup appears to keep the Ig domain close to the surface of the membrane. On the face of it, this may appear at odds with the interpretation by Namadurai et al. (2014). However, there is no direct atomically detailed evidence of how the full-length β3 subunit may come together,

and the work here, presents an alternative possibility. Regardless, the results here suggest that spontaneous oligomerization of a full-length trimer, where the Ig domains adopt the crystal structure conformation, would likely be a very slow process if it does occur.

### CONCLUSION

In this work, we have explored the dynamics of the β1 and β3 subunit monomers with a lipid bilayer. The dynamics exhibited a remarkable and unexpected difference in behavior of the ECD, which we attribute to distinct binding patterns within the Ig domain. It will be interesting to investigate the influence of the non-conserved charged residues between both subunits in future experiments. A full-length model of a β3 subunit based on a trimeric structure of the Ig domain only, suggests that the TM helices do not interact particularly strongly. Finally, the CG simulations suggest that higher order oligomerization of monomers may be mediated by "end-on-end" interactions. These results should provide a useful framework on which to interpret low-resolution methods such as AFM that are examining the nature of oligomerization in ion channels. The existing agreement between experiment and simulation is encouraging.

### DATA AVAILABILITY STATEMENT

All datasets generated for this study can be obtained via request to the corresponding author.

### REFERENCES


### AUTHOR CONTRIBUTIONS

WG performed and analyzed all simulations and co-wrote the manuscript. AD performed analysis and gave advice. PB conceived the work and co-wrote the manuscript.

### FUNDING

WG was supported by the EPSRC via the Theory and Modeling in the Chemical Sciences Doctoral Training Centre (EP/L015722/1). AD was supported by the BBSRC (BB/R00126X/1). This work was supported by ARCHER UK National Supercomputing Service (http://www.archer.ac.uk), provided by HECBioSim, the UK High End Computing Consortium for Biomolecular Simulation (hecbiosim.ac.uk), which is supported by the EPSRC (EP/L000253/1).

### ACKNOWLEDGMENTS

We thank Tony Jackson, Chris Huang, and Samantha Salvage for useful discussions. We also thank Rocco Meli, Aphroditi Zaki, and Irfan Alibay for mathematical discussions and technical support.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmolb. 2020.00040/full#supplementary-material



extracellular immunoglobulin-like domain. J. Cell Biol. 154, 427–434. doi: 10.1083/jcb.200102086


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Glass, Duncan and Biggin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# MiMiC: Multiscale Modeling in Computational Chemistry

Viacheslav Bolnykh<sup>1</sup> \*, Jógvan Magnus Haugaard Olsen<sup>2</sup> \*, Simone Meloni <sup>3</sup> , Martin P. Bircher <sup>4</sup> , Emiliano Ippoliti <sup>5</sup> , Paolo Carloni 5,6 and Ursula Rothlisberger <sup>1</sup> \*

<sup>1</sup> Laboratory of Computational Chemistry and Biochemistry, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, <sup>2</sup> Hylleraas Centre for Quantum Molecular Sciences, Department of Chemistry, UiT the Arctic University of Norway, Tromsø, Norway, <sup>3</sup> Dipartimento di Scienze Chimiche e Farmaceutiche, Università degli Studi di Ferrara, Ferrara, Italy, <sup>4</sup> Computational and Soft Matter Physics, University of Vienna, Vienna, Austria, <sup>5</sup> Computational Biomedicine, Institute for Advanced Simulation (IAS-5) and Institute of Neuroscience and Medicine (INM-9), Molecular Neuroscience and Neuroimaging, Institute of Neuroscience and Medicine (JARA INM-11), Forschungszentrum Jülich, Jülich, Germany, <sup>6</sup> Department of Physics and Universitätsklinikum Aachen, RWTH Aachen University, Aachen, Germany

Keywords: molecular dynamics, QM/MM, DFT, HPC, multiscale simulations, computational chemistry

### 1. INTRODUCTION

#### Edited by:

Giulia Palermo, University of California, Riverside, United States

#### Reviewed by:

Pablo Ricardo Arantes, University of California, Riverside, United States Carme Rovira, University of Barcelona, Spain

#### \*Correspondence:

Viacheslav Bolnykh viacheslav.bolnykh@epfl.ch Jógvan Magnus Haugaard Olsen jogvan.m.olsen@uit.no Ursula Rothlisberger ursula.roethlisberger@epfl.ch

#### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

> Received: 23 December 2019 Accepted: 02 March 2020 Published: 20 March 2020

#### Citation:

Bolnykh V, Olsen JMH, Meloni S, Bircher MP, Ippoliti E, Carloni P and Rothlisberger U (2020) MiMiC: Multiscale Modeling in Computational Chemistry. Front. Mol. Biosci. 7:45. doi: 10.3389/fmolb.2020.00045 Hybrid quantum mechanics/molecular mechanics (QM/MM) approaches are commonly used methods for investigating a plethora of chemical, biochemical, and biophysical processes that require explicit treatment of the electronic degrees of freedom when the system is too big to be entirely treated by QM methods alone (Warshel and Levitt, 1976; Senn and Thiel, 2009; Adhireksan et al., 2014; Campomanes et al., 2014, 2015; Brunk and Rothlisberger, 2015; Genna et al., 2016; Li et al., 2017; Cupellini et al., 2018; Loco et al., 2018; Morzan et al., 2018). It is often the method of choice for computational investigations of systems with more than a few thousand atoms (which is commonly the case for biological systems). In QM/MM, the system is split into two parts: a smaller part that is treated at the QM level of theory, whereas the remainder is described at the MM level, which is a computationally more expedient description. In this way, local electronic effects can be captured with the accuracy of a first-principles method, while at the same time explicitly including the effects of the environment at a reasonable computational cost. Current QM/MM implementations have roughly followed either of two strategies: (1) tight integration of QM and MM modules in a single software package or (2) loose coupling of separate QM and MM codes. Strategy (1) generally profits from computational efficiency due to the ability to pass data between the submodules directly (via function calls) but suffers from limited flexibility, since the available choice of methods is often restricted and extensions to different programs may require formidable programming efforts. In contrast, strategy (2), which is typically implemented resorting to data exchange between QM and MM codes via file input and output, enables high flexibility but penalizes efficiency because of increased communication overhead. However, with the field rapidly growing, new simulation paradigms and approaches might quickly emerge, clearly favoring strategy (2) over (1). In the following, we show that flexibility does not necessarily come at the expense of a high computation (or communication) overhead by presenting the recently developed MiMiC framework (Bolnykh et al., 2019; Olsen et al., 2019) that combines the capability of performing fast and efficient multiscale molecular dynamics (MD) simulations with facile support for flexible extensions. These objectives are achieved by applying (2) with an efficient method to exchange data among the coupled software packages. In practice, MiMiC implements a multiple program-multiple data (MPMD) paradigm through a message passing interface (MPI)-based communication library, which allows the entities collaborating within MiMiC to exchange data efficiently. Overall, MiMiC represents a highly modular and general multiscale simulation framework that enables the combination of multiple resolutions and methods for different parts of a system, while retaining high computational efficiency. Moreover, MiMiC was designed to have a flexible architecture enabling multiple resolutions, implementation of different types of coupling (e.g., QM/QM, QM/QM/MM, etc.), and to straightforwardly incorporate emerging—and future—methods and software packages in the field of computational chemistry. This flexibility is of utmost importance in the light of the rapid development of computational methods enabling researchers to tackle complex scientific problems with more and more degrees of freedom that require the incorporation of multiple space and time resolution scales on the one hand, and the rapid advent of new computational approaches on the other hand.

### 2. MIMIC ARCHITECTURE

### 2.1. Model

MiMiC implements a generalized version of the fully Hamiltonian electrostatic embedding scheme introduced in Laio et al. (2002). The key quantity is the electrostatic QM/MM coupling energy term:

$$E^{\rm QM/MM} = \sum\_{i}^{N^{\rm MM}} q\_i^{\rm MM} \int d\mathbf{r} \,\rho^{\rm QM}(\mathbf{r}) \frac{r\_{c,i}^4 - |\mathbf{R}\_i - \mathbf{r}|^4}{r\_{c,i}^5 - |\mathbf{R}\_i - \mathbf{r}|^5} \tag{1}$$

where N MM is the total number of MM atoms, q MM i and rc,<sup>i</sup> are the partial charge and the covalent radius of the i-th MM atom, respectively, while **R**<sup>i</sup> is its coordinate and ρ QM(**r**) is the electron density in point **r**. This form of the electrostatic QM/MM coupling term modifies the Coulomb interaction at short range, thus avoiding electron spill-out (Laio et al., 2002). It is worth remarking that the QM/MM term is responsible for the polarization of the electronic density due to MM atoms and, thus, models the effects of the environment on the properties of the chemically active subdomain.

The straightforward implementation of such a term is rather costly to compute, in particular for systems with large MM regions. Therefore, a hierarchical electrostatic embedding approach (Laio et al., 2002) is used in order to mitigate the high computational cost of a direct evaluation. Within this hierarchical scheme the QM/MM electrostatic interactions are divided into two groups depending on the distance (commonly referred to as the cutoff distance) of MM atoms from the QM subsystem. In the vicinity of the QM part the interaction is computed using Equation (1), whereas more distant atoms are coupled via a multipole expansion of the electrostatic potential of the QM charge distribution. We have extended the original scheme with an open-ended multipole expansion allowing the user to choose the order at which the expansion is truncated. This allows (i) higher accuracy in the calculation of the electrostatic QM/MM interactions, at a negligibly higher computational cost and (ii) reduction of the cutoff distance, thus further lowering the computational cost (Olsen et al., 2019).

An official release of MiMiC will be published under the open-source GPLv3+ license in 2020.

### 2.2. Implementation

MiMiC is a loosely-coupled MPMD multiscale simulation framework. Within this approach, both QM and MM codes run simultaneously with computational resources being allocated separately to either entity. Moreover, while enabling efficient communication, such an approach avoids tight integration of MiMiC into either code, which would incur a high implementation and maintenance effort. This enables the construction of a highly modular and efficient multiscale simulation framework capable of coupling virtually any set of simulation codes with the potential for extending it further to enable the support of alternative levels of theory such as a different QM method, coarse-grained approaches, or approaches based on artificial intelligence (Behler and Parrinello, 2007; Christensen et al., 2019; Singraber et al., 2019). In the present implementation, CPMD 4.3 (Hutter et al., 2018) computes the QM contributions, while GROMACS 2019 (Spoel et al., 2005; Abraham et al., 2015, 2019) computes the classical interactions within the MM subsystem as well as all bonded and Lennard-Jones interactions crossing the QM/MM interface. The electrostatic QM/MM interactions are computed by MiMiC. Finally, CPMD integrates the equations of motion.

The structure of a QM/MM implementation using the MiMiC framework is shown in **Figure 1A**. The use of a plane wavebased code to handle the QM subsystem ensures highly efficient scaling performance, while GROMACS guarantees expedient MM computations.

The workflow of a QM/MM MD simulation using MiMiC follows closely the workflow of a typical MD simulation in CPMD. At the beginning of each time step, MiMiC collects atomic coordinates from CPMD and dispatches them to GROMACS, which then computes MM forces and energies. While this is done, CPMD computes QM contributions and MiMiC computes the electrostatic QM/MM interaction terms. MiMiC adds up all force contributions and provides them to CPMD, which uses them to propagate atomic positions according to the selected ensemble and imposing the necessary constraints.

The calculation of the QM/MM interactions of Equation (1) can be parallelized by distributing MM atoms and points of the mesh discretizing the QM domain of integration. Extreme scalability is achieved parallelizing over both degrees of freedom through a multi-layered hybrid distributed- and shared-memory parallelization strategy. At the top layer, all MPI tasks are divided into groups, each receiving a subset of MM atoms. Then, at a lower level, the mesh discretizing the QM subspace is split into a set of 2D slabs along the X dimension. Each of the MPI tasks belonging to each group receives a subset of these slabs to compute the corresponding part of the integral in Equation (1) (and other analogous terms). Finally, at the lowest level, the shared-memory simultaneous multi-threading (SMT) approach (based on OpenMP) is employed in order to further extend the scalability limit. At this level, each of the slabs is divided into a set of 1D "pencils," which are then attributed to the threads associated with a particular MPI task.

Using this multi-layered parallelization scheme, we have demonstrated efficient scalability using over ten thousand cores in a single QM/MM MD simulation while maintaining an overall parallel efficiency above 75% for a system containing a large Cl−/H<sup>+</sup> antiporter protein embedded in a lipid membrane bilayer (**Figure 1B**) solvated in water. In this system, 19 atoms out of a total of 150,925 atoms were treated at the QM

level. The size of the whole system was 126.9 x 126.8 x 99.3 Å<sup>3</sup> , and the size of the cubic QM box was 17.7 x 17.7 x 17.7 Å<sup>3</sup> . We used a plane wave cutoff of 90 Ry, which corresponds to a real-space mesh with 240 points along each dimension. Benchmarks were performed using Troullier– Martins pseudopotentials (Troullier and Martins, 1991). The average wall time of a single MD time step is around 13 s (Bolnykh et al., 2019) when computationally demanding hybrid exchange–correlation functionals, such as B3LYP (Becke, 1988, 1993; Lee et al., 1988), are employed. This enables nanosecondscale QM/MM MD simulations to be performed, which in turn allows one to obtain converged free energy calculations of biological systems if enough computational resources are available. Some representative scaling benchmark results are shown in **Figures 1C,D**. We expect similar extreme scalability for systems characterized by QM domains of similar size.

### 3. CONCLUSION

We have given a short introduction to the recently developed MiMiC framework as a highly flexible and extremely powerful multiscale modeling software solution capable of delivering unprecedented levels of scaling performance. The efficiency of the framework is ensured by using a well-established and extensively validated electrostatic embedding scheme while flexibility and modularity is achieved via an efficient loosely coupled MPMD architecture. Finally, extreme scalability is attained through a multi-layered parallelization strategy.

## AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

## FUNDING

UR acknowledges funding from the Swiss National Science foundation via the NCCR MUST and individual grant No. 200020\_185092. This project/research has received funding from the 230 European Union's Horizon 2020 Framework Programme for Research and Innovation 231 under Specific Grant Agreement 720270 (Human Brain Project SGA2). JO acknowledges financial support from the Research Council of Norway through its Centres of Excellence scheme (Project ID: 262695). MB acknowledges a postdoctoral fellowship from the Swiss National Science Foundation (project 184500). PC acknowledges the funding by the Deutsche Forschungsgemeinschaft via FOR 2518 DynIon project P6 and by the Centre of Excellence for Computational Biomolecular Research, BioExcel CoE, funded by the European Union (contracts: H2020-INFRAEDI-02-2018-823830, H2020-EINFRA-2015-1-675728).

### ACKNOWLEDGMENTS

The authors thank Dr. Maria Gabriella Chiarello for providing the structure of the protein used in benchmarks.

### REFERENCES


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Bolnykh, Olsen, Meloni, Bircher, Ippoliti, Carloni and Rothlisberger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Bioinformatics and Biosimulations as Toolbox for Peptides and Peptidomimetics Design: Where Are We?

Ilda D'Annessa<sup>1</sup>† , Francesco Saverio Di Leva<sup>2</sup>† , Anna La Teana<sup>3</sup> , Ettore Novellino<sup>2</sup> , Vittorio Limongelli2,4 \* and Daniele Di Marino<sup>3</sup> \*

1 Istituto di Chimica del Riconoscimento Molecolare, CNR, Milan, Italy, <sup>2</sup> Department of Pharmacy, University of Naples Federico II, Naples, Italy, <sup>3</sup> Department of Life and Environmental Sciences, New York-Marche Structural Biology Center (NY-MaSBiC), Polytechnic University of Marche, Ancona, Italy, <sup>4</sup> Faculty of Biomedical Sciences, Institute of Computational Science, Università della Svizzera Italiana (USI), Lugano, Switzerland

#### Edited by:

Alexandre M. J. J. Bonvin, Utrecht University, Netherlands

#### Reviewed by:

Martin Zacharias, Technical University of Munich, Germany Christine Peter, University of Konstanz, Germany

#### \*Correspondence:

Vittorio Limongelli vittorio.limongelli@gmail.com Daniele Di Marino d.dimarino@univpm.it; daniele.dimarino@gmail.com

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences

> Received: 10 August 2019 Accepted: 25 March 2020 Published: 05 May 2020

#### Citation:

D'Annessa I, Di Leva FS, La Teana A, Novellino E, Limongelli V and Di Marino D (2020) Bioinformatics and Biosimulations as Toolbox for Peptides and Peptidomimetics Design: Where Are We? Front. Mol. Biosci. 7:66. doi: 10.3389/fmolb.2020.00066 Peptides and peptidomimetics are strongly re-emerging as amenable candidates in the development of therapeutic strategies against a plethora of pathologies. In particular, these molecules are extremely suitable to treat diseases in which a major role is played by protein–protein interactions (PPIs). Unlike small organic compounds, peptides display both a high degree of specificity avoiding secondary off-targets effects and a relatively low degree of toxicity. Further advantages are provided by the possibility to easily conjugate peptides to functionalized nanoparticles, so improving their delivery and cellular uptake. In many cases, such molecules need to assume a specific threedimensional conformation that resembles the bioactive one of the endogenous ligand. To this end, chemical modifications are introduced in the polypeptide chain to constrain it in a well-defined conformation, and to improve the drug-like properties. In this context, a successful strategy for peptide/peptidomimetics design and optimization is to combine different computational approaches ranging from structural bioinformatics to atomistic simulations. Here, we review the computational tools for peptide design, highlighting their main features and differences, and discuss selected protocols, among the large number of methods available, used to assess and improve the stability of the functional folding of the peptides. Finally, we introduce the simulation techniques employed to predict the binding affinity of the designed peptides for their targets.

Keywords: peptides design, peptidomimetics, binding free-energy, protein–protein interaction, bioinformatics tools

### INTRODUCTION

Year by year the use of theoretical approaches to study structural and dynamical features of macromolecules (Di Marino et al., 2014, 2015a; Orozco, 2014; D'Annessa et al., 2018, 2019a) is constantly growing, thanks to the continuous improvement of methodologies and algorithms, as well as of the high performance computing facilities. Theoretical methodologies are achieving an increasing importance in many fields of science and have now gained a primary role in drug design. Indeed, hundreds of examples exist in which the use of computational techniques was

**211**

crucial to discover new molecules active against different diseases (Sliwoski et al., 2014; D'Annessa et al., 2019b). In the modern era, computer-aided drug design is successfully exploited not only to develop small molecules but also to guide the more challenging design of larger size compounds like peptides or peptide-like molecules (i.e., peptoids or peptidomimetics), which can retain the physicochemical features of bioactive proteins or polypeptide chains. One such feature is the conformational plasticity of peptides that allows them to interact with larger and more shallow surfaces compared to the typically cryptic binding pockets targeted by small molecules (Di Marino et al., 2015b; Vercelli et al., 2015; Di Leva et al., 2018). Therefore, peptides and peptidomimetics represent ideal candidates for targeting protein–protein interactions (PPIs). Indeed, PPIs have emerged as relevant drug targets since they are responsible for numerous cellular processes (Wanner et al., 2011; Otvos and Wade, 2014; Sun, 2016). Nonetheless, most PPIs were until recently considered "undruggable" by small compounds due to the involvement of large binding surfaces where the recognition is ruled by both the physicochemical properties and the shape of the interacting proteins (Bakail and Ochsenbein, 2016). Similar to protein-(small)ligand interactions PPIs are stabilized by noncovalent interactions, but with hydrophobic contacts, usually responsible for recognition and packaging, playing a primary role in stabilizing the complex (Tan et al., 2016). Moreover, upon the formation of macromolecular complexes new pockets can be formed at the interface between two or more proteins, and in some cases their targeting, aimed at stabilizing, instead of disrupting, the complex, can represent a clever therapeutic strategy to treat different diseases. Also in this case, however, small compounds are often not suitable for this purpose, while peptide-like molecules are particularly favored (Henninot et al., 2018; Lee et al., 2019). Furthermore, isolated peptides can compensate for the absence of the whole protein, as in the case of hormones, or can counteract the immune system in autoimmune diseases (Lau and Dunn, 2018). Moreover, peptides have peculiar characteristics that represent advantages in the field of drug development with respect to small molecules. For instance, they show a very low or null toxicity compared to synthetic compounds, being typically degraded in non-toxic metabolites, and are highly selective against a specific target, thus making their use particularly favored (Smith et al., 2019). Finally, many peptides can be easily conjugated either to nanoparticles for targeted delivery (Valcourt et al., 2018; Kalmouni et al., 2019) or to organic molecules working as biomarkers for diagnostic purposes (Wang and Hu, 2019).

In this perspective, much effort was dedicated in the last decades to develop theoretical approaches for the design of therapeutic peptides/peptidomimetics, leading to a new branch of drug development, known as computational peptidology (Zhou et al., 2013). These strategies gave birth to a leading industry producing nearly 20 new peptide-based clinical trials annually. At the time this review was written, more than 400 peptide drugs were under clinical development and over 60 already approved for clinical use in the United States, Europe and Japan (Lee et al., 2019). Several designed peptides have shown great potential for the treatment of different types of cancers (Marqus et al., 2017; Zanella et al., 2019). Although these peptides have an extraordinary effectiveness in cancer cell cultures, they still do not provide encouraging results in vivo (Marqus et al., 2017). This because peptides may suffer from poor metabolic stability and membrane permeability, rapid proteolysis and unstable secondary structure (Zhang et al., 2018). With the aim to overcome such limitations, many strategies have been developed that rely on the application of chemical modifications such as cyclization, N-methylation, stapling or the introduction of amide bond bioisosters and non-natural amino acids. In addition, peptidomimetics can represent a valid alternative to target PPIs. Peptidomimetics are indeed organic molecules featuring physicochemical and structural properties resembling those of classical oligopeptides (Vagner et al., 2008; Zhang et al., 2018) but generally endowed with improved pharmacokinetic profiles.

The possibility to rationally design peptide-based molecules exploiting the structural characteristics of PPIs represents an enormous advantage to achieve the desired effect on the pathological process. The growing number of 3D structures available from X-ray diffraction and NMR has augmented our knowledge on protein–protein recognition and binding process, providing unprecedented insight into the proteins' structures in the apo form states and in protein–protein and protein–peptide complexes. This information is instrumental in the peptide design process. In this perspective, combining bioinformatics approaches with molecular simulations is a valuable strategy to obtain good drug-candidate peptides. Moreover, the increased accuracy in the calculation of binding free energy allows further characterizing the energetics of the molecular binding interaction, increasing the success rate of the design process (Torrie and Valleau, 1977; Di Marino et al., 2014, 2015b; Kilburg and Gallicchio, 2016). However, the field of peptides design and PPIs prediction/refinement is really extensive and the number of approaches developed for these purposes is constantly growing. Here we provide a concise report of selected computational protocols for peptides/peptidomimetic design, paying particular attention to the most widely employed bioinformatics tools and facilities and docking algorithms available to this end. We also introduce the simulations techniques used to validate protein–peptide complexes obtained by docking procedures and to predict the binding affinity of the designed peptides for their targets.

### PEPTIDE DESIGN AND DOCKING

Since PPIs emerged as druggable targets much effort was dedicated to develop algorithms and tools for peptides/ peptidomimetics design. However, this is far from being a fully addressed issue and still poses many hurdles. Indeed, notwithstanding the increasing structural information available, the investigation of protein–peptide recognition is not an easy task to handle and shows several layers of complexity. For a full description of the process: (1) the three-dimensional structure of the investigated protein–protein complex should be available, in order to detect the protein region to use as a template for the design of peptides; (2) in the case the complex is not

available, the protein surface that has to be recognized by the PPI disruptor should be detected, or at least predicted, with high accuracy; (3) the structure of the target protein in its apo and holo states should be known, since the binding surface might change undergoing structural rearrangement upon protein or ligand binding; (4) since peptides are highly flexible entities, their conformational flexibility, stability in solution and the ability to achieve and maintain a well-defined active structure should be considered; and (5) finally, a putative structure of the designed peptide in complex with the target protein should be generated, typically by docking, in order to provide a possible mechanism of binding. However, achieving an accurate docking of conformationally flexible peptides to a target protein is a challenging task as discussed in the following sections.

To date numerous bioinformatics tools for peptides design are available. These can be basically classified as ligand-based and target-based (**Figure 1**), even if in most of the cases the two approaches are combined. Ligand-based approaches can be further distinguished into sequence-based, conformationbased and property-based, with this last possibility still being the least explored.

Sequence-based approaches rely on the identification of conserved functional motifs, usually detected through multisequences alignment. These sequences are then modified to obtain a ranking of different candidates potentially able to interact with a specific target protein usually blocking an interaction with another protein partner. This is the case of the PeptideMine webserver (Shameer et al., 2010).

Substantially different are the conformation-based approaches that are aimed at building peptides structures and conformational ensembles further refined by investigation of structure-activity relationships. Example is PEP-FOLD that exploits a Hidden Markov Model to derive a structural alphabet to design stretches of "letters" that are assembled into 3D structures then refined by Monte-Carlo calculations (Thévenet et al., 2012).

Target-based strategies include knowledge-based and de novo design approaches. Knowledge-based methods use information from protein complexes, peptides and protein fragments (Vanhee et al., 2011). For instance, PiPred analyses protein complexes to find anchor residues and use them to find the best peptides matching the target surface from databases of fragments (Oliva and Fernandez-Fuentes, 2015). PepComposer explores a pool of protein surfaces and delivers a set of backbone scaffolds that is able to target them. A following Monte Carlo simulation refines the conformation of the newly designed peptides shown in the final peptide-protein complex (Obarska-Kosinska et al., 2016). Similarly, PepCrawler and its cognate PinaColada analyze protein complexes and derive candidate peptides that are subsequently randomly mutated in order to increase their affinity for the target. As final result, the newly designed peptides are ranked according

to the predicted binding affinity (Donsky and Wolfson, 2011; Zaidman and Wolfson, 2016).

De novo approaches endeavor to obtain peptides without any a priori structural knowledge. The pepsec tool, included in the Rosetta suite (Raman et al., 2009), provides peptide sequences and structures that are simultaneously optimized. The process is similar to the "anchor and grow" docking algorithms in as much an anchor residue of the peptide is positioned on the protein surface and the chain is assembled starting from that point (King and Bradley, 2010). A significant advance was achieved with the implementation in Rosetta of a rotamers library that allows generating peptoid foldamers for the design of compounds with defined 3D structures thanks to the introduction of nonnatural amino acids (Renfrew et al., 2014). Another example of de novo approaches is the VitAl algorithm, which identifies the binding site via a Coarse Grained Gaussian Network model and generates the peptides by sequentially docking pairs of residues and determining the binding energies (Besray Unal et al., 2010).

The described methodologies, especially ligand-based strategies, can be supported by stand-alone protein–peptide docking programs, in order to identify or refine the binding poses of the designed peptides. Notably, these software can be also used to predict the interaction mode of known biologically active peptides with their target, thus guiding the design of novel PPI inhibitors. Nonetheless, protein–peptide docking programs can suffer from some inaccuracies, especially in the solvation and in the conformational sampling of the ligand backbone (Zhou et al., 2013). In the last decade, however, significant progress has been made to address these issues, achieving a satisfactory quality of predictions both by knowledge-based approaches among which HADDOCK and GalaxyPepDock represent some of the most accurate software (Trellet et al., 2013; Lee et al., 2015; Van Zundert et al., 2016), and ab initio programs, including the newest version of the Glide SP algorithm (Glide SP-peptide) and HPEPDOCK, which exploits a hierarchical algorithm to manage peptide flexibility through an ensemble of conformations generated (Antes, 2010; Tubert-Brohman et al., 2013; Li et al., 2014; Ben-Shimon and Niv, 2015; Kurcinski et al., 2015; Schindler et al., 2015; Alam et al., 2017; Zhou et al., 2018). In HADDOCK, experimental information on the targeted PPIs is exploited to drive the docking through the inclusion of interaction restraints during the calculations. The HADDOCK procedure for flexible protein–peptide docking is a multi-step process that combines different solvent models, conformational search and selection, and induced fit algorithms in a highly efficient protocol. The GalaxyPepDock protocol consists of a combination of similarity-based docking and energy-based optimization methods. Given a target protein and a peptide, the server performs a scan of experimentally determined PPIs structures database, in order to identify a proper PPI template. Subsequently, GalaxyPepDock builds a number of protein– peptide complexes that are further refined by energy-based methods to find the best structure interface. Conversely, Glide SP-peptide, pepATTRACT or Rosetta FLexPepDock perform without any a priori experimental information. In particular, Glide SP-peptide relies on a grid-based docking protocol, which takes advantage of advanced sampling algorithms during the search phase. The obtained poses can be further refined by post-processing calculations with physics-based implicit solvent MM-GBSA methods, rescored and ranked by a custom scoring function. PepATTRACT combines a coarse-grained ab initio docking followed by an atomistic refinement protocol. In particular, a fully blind procedure is followed, where the server examines the whole protein surface to find a putative binding site and simultaneously predicts the bound peptide conformation. Finally, FlexPepDock, which is implemented in the Rosetta suite, is able to provide high-resolution protein–peptide complexes starting from a generation of coarse-grained models. These starting coarse-grained models are refined by performing Monte-Carlo Minimization restricting the peptide's degrees of freedom and allowing the flexibility of the receptor's binding site side chains.

### CONFORMATIONAL PEPTIDE PREDICTION

As reported above, bioinformatics tools show a good degree of accuracy in predicting peptides conformational plasticity, mainly through internal search algorithms that iteratively build different peptide backbone conformations, each one assigned with a specific binding score. However, severe approximations still reside in the docking sampling. For instance, many docking software treat the peptide backbone as rigid during the calculations making the a priori knowledge of its bioactive conformation necessary. In simplest cases, when the ligand assumes a unique, or at least a prevalent conformation in water, this can be straightforwardly computed based on experimental techniques such as proton NMR experiments. This strategy can be, for instance, applied to small cyclic peptides featuring a restricted backbone conformational space. However, in many cases peptides can assume several energetically equivalent states characterized by a rugged conformational free energy landscape. In such cases, it is advisable to support the peptide design with a reliable energy estimation of the different conformations assumed by the new peptide. To this end, atomistic simulations represent a valid tool. In particular, a number of efficient conformational searching methods have been developed or specifically adapted for this purpose. These include simulated annealing (Kirkpatrick et al., 1983; Wilson and Cui, 1990), distance geometry (Donné-Op Den Kelder, 1989), random search Monte Carlo (MC) (Chang et al., 1989; Weinberg and Wolfe, 1994), eigenvector-following (Cerjan and Miller, 1981; Simons et al., 1983), basin-hopping global optimization (Wales and Doye, 1997), discrete path sampling (Wales, 2002, 2004) and molecular dynamics (MD) based algorithms. Extensive reviews are available in literature on the application of simulated annealing (Bernardi et al., 2015) and distance geometry (Mucherino et al., 2013) to study peptides conformational sampling. For this reason, here we will mainly focus on the other approaches.

Among stochastic or random search approaches is the Monte Carlo Multiple Minimum (MCMM) method, commonly known as torsional sampling (Saunders et al., 1990), in which the peptide torsional bonds are randomly rotated through iterative Monte

Carlo simulations, each followed by energy minimization, in order to identify local minima in the conformational potential energy surface (PES).

An interesting example of eigenvector-following method is the low mode conformational search (LMCS) (Kolossváry and Guida, 1996), in which local minima in the PES are found through movements along the "low energy eigenvectors" that are identified through a preliminary normal mode analysis, and following energy minimization. The process is then iteratively repeated to find additional minima, eventually leading to the identification of a minimum energy path. In order to improve the performance of LMCS in global searches, a mixed MCMM/LMCS strategy has been also developed (Kolossvàry and Guida, 1999) and successfully applied to the conformational sampling of macrocyclic compounds (Parish et al., 2002).

In basin-hopping global optimization (BHGO), the potential energy landscape is transformed into a series of "basins of attraction" which are explored through a hybrid random searchgeometry optimization protocol (Li and Scheraga, 1987; Wales and Doye, 1997). In detail, random structural perturbations such as backbone Cartesian moves or rotations of amino acid side chains are initially applied to the biomolecule. After each perturbation, a geometry optimization cycle is performed to find the nearest local minimum, usually through the quasi-Newton L-BFGS (Limited-memory BFGS) minimization algorithm (Liu and Nocedal, 1989). The transition is finally either accepted or rejected based on a Metropolis criterion. The method allows crossing high barriers that separate the different energy basins, thus leading to the identification of the global minimum. Also, the thermodynamic properties of the system can be computed using the data set of local minima found during the search. Many variants of the technique have been developed to specifically address problems of biological interest including peptides' conformational sampling. For instance, the efficiency of basin hopping can be improved by including experimental restraints (Carr et al., 2015) or by combining the method with other approaches, such as parallel-tempering (Strodel et al., 2010; Joseph and Wales, 2018). Connected to BHGO, is the discrete path sampling approach. Here, a discrete path is defined as a connected sequence of minima and the intervening transition state(s) between them, which are appropriate for describing dynamical properties but can also be subjected to kinetic analysis (Wales, 2005). Discrete path sampling has been successfully used to explore the conformational energy landscape of both linear and cyclic peptides (Evans and Wales, 2004; Oakley and Johnston, 2013).

Molecular dynamics (MD) based techniques are largely explored for peptides conformational sampling both as stand alone tools or in tandem with experiments. It has been indeed demonstrated that the inclusion of NMR data such as chemical shifts, interatomic distances or residual dipolar couplings (RDCs), as structural restraints in MD simulations can significantly improve the speed and efficiency of sampling algorithms. Ensemble or time-averaged MD represents a first example (Bonvin et al., 1994) followed by more recent advanced methodologies that integrate MD with experimental data. For instance, it was shown that, if geometrical restraints are applied to the system and averaged over simulation replicas, ensembles of conformations compatible with the maximum entropy principle are generated (Cavalli et al., 2013). This approach is known as replica-averaged restrained molecular dynamics and can offer a valid representation of the unknown Boltzmann distribution of a peptide conformational landscape (De Simone et al., 2011). Also, MD simulations can be coupled to Markov State Models (MSM) to predict the folding pathways and kinetics of polypeptides (Chodera and Noé, 2014; Husic and Pande, 2018). An efficient alternative strategy is to employ enhanced sampling methodologies, which allow investigating events that extend beyond the timescale limit of standard simulations. Important examples are umbrella sampling (US) (Torrie and Valleau, 1977) and metadynamics (MetaD) (Laio and Parrinello, 2002), which rely on the application of a bias on a set of user-defined reaction coordinates, specifically designed for the system under investigation, commonly referred to as collective variables (CVs). These methodologies can provide an accurate description of the free energy landscape underlying the process of interest. Particularly, MetaD (Laio and Parrinello, 2002) in its well-tempered variant (Barducci et al., 2008) was largely applied to conformational studies of both linear and cyclic peptides. For instance, Musco and coworkers employed MetaD to predict the bioactive conformation and the pharmacological behavior of cyclic penta- and hexa- peptides designed as RGDintegrin receptors modulators (Spitaleri et al., 2011; Simon et al., 2018). Remarkably, metadynamics can be combined with replicaexchange (RE) methods like parallel-tempering (PT) (Bussi et al., 2006) and bias-exchange (BE) (Piana and Laio, 2007) algorithms in which n exchangeable replicas of the systems are simulated at different temperatures and biasing different set of CVs, respectively. For instance, PT-MetaD was recently applied to predict the turn-helix conformation of a linear peptide reported as a selective ligand of the αvβ6 RGD-integrin, leading to new selective cyclopeptidic ligands with potential clinical applications (**Figure 2A**; Di Leva et al., 2018). Furthermore, the metadynamics performance can be improved through the inclusion of experimental data either in the user-defined CVs in a BE scheme (Granata et al., 2013) or as replica-averaged structural restraints. The latter approach is known as replicaaveraged metadynamics (Camilloni et al., 2013) and is typically performed in the well-tempered ensemble (WTE) where the energy is used as CV (Camilloni et al., 2013). In alternative to CV-based techniques, other enhanced sampling methodologies such as accelerated MD (Hamelberg et al., 2004), replica exchange with solute-tempering (REST) (Liu et al., 2005) and reservoir-REMD (R-REMD) (Okur et al., 2007; Roitberg et al., 2007), have been successfully used for peptides' conformational sampling. In accelerated MD the sampling is improved through the addition of a boost potential to the potential energy of the system (Hamelberg et al., 2004). This technique demonstrated to provide conformational ensembles for peptidic macrocycles well reproducing the available experimental structures (Kamenik et al., 2018). In replica exchange with solute-tempering, the contribution of solute–solvent and solvent–solvent energies are scaled in order to strengthen solvent interactions at elevated temperatures. As a result, only the solute is simulated at different

temperatures as in traditional REMD, while the solvent is kept at original temperature in all replicas. The exchange probabilities exclusively depend on the contribution from solute atoms that generally show broader energy distributions compared to the solvent. Accordingly, a lower number of replicas is needed to cover the desired temperature range compared to standard REMD, thus saving computational time and resources (Liu et al., 2005). Finally, R-REMD is based on a classical PT scheme in which, the highest temperature replica is replaced by a structure reservoir that is pre-generated through standard MD simulations performed at the same temperature (Okur et al., 2007; Roitberg et al., 2007).

### ESTIMATION OF THE PEPTIDES/ PEPTIDOMIMETICS BINDING FREE-ENERGY

An accurate estimation of the protein–peptide binding affinity is important to guide key steps in the drug discovery pipeline such as the hit-to-lead and lead optimization processes. This is however, a challenging task to achieve with standard computational methodologies. For instance, docking algorithms can provide rapid qualitative information about the peptide binding modes but generally fail in accurately estimating receptor affinities due to the intrinsic approximations of the method. On the other hand, standard MD would require tens of microseconds of simulations to collect enough statistics to describe the full ligand binding process (Dror et al., 2011; Shan et al., 2011), which are rarely accessible with the current protocols and resources (Salmaso and Moro, 2018). The timescale limitation of classical MD can be overcome by means of free-energy methods, which can be grouped in three main categories: endpoint, alchemical perturbation and physical pathway methods.

Endpoint methods, which include linear interaction energy (LIE) (Aqvist et al., 2002), molecular mechanics Poisson– Boltzmann surface area (MM-PBSA) (Srinivasan et al., 1998), and generalized Born surface area (MM-GBSA) (Kuhn and Kollman, 2000), compute the binding free energy by taking the difference between the absolute free energy of the ligand in unbound and bound states, which are sampled separately. These methods, particularly MM-PBSA and MM-GBSA, offer a good balance between computational efficiency and accuracy, and can be successfully used to predict the binding affinities and identify or rescore the correct binding poses for protein–peptide systems (Weng et al., 2019). Interestingly, a dampened MM-PBSA scoring function was recently introduced in HADDOCK to further improve the predictiveness of the docking protocol and to estimate the protein–peptide binding affinity (Spiliotopoulos et al., 2016). Nevertheless, a large-scale application of endpoint approaches use is partly limited by some approximations to both the sampling and energy calculation which are mainly due to the use of implicit solvent models (Wang et al., 2019).

Alchemical methods are typically more rigorous and accurate, although suffering from the higher demanding computational

cost. They include thermodynamic integration (TI) (Kirkwood, 1935), free-energy perturbation (FEP) (Kirkwood, 1935) and Bennett Acceptance Ratio (BAR) (Bennett, 1976; Shirts and Chodera, 2008). In these calculations, ligand and protein are gradually decoupled and the binding free energy computed from a thermodynamic path connecting the bound and unbound states. At each step of the alchemical path, the sampling can be alternatively performed using either MC or MD simulations, with the latter approach being the most widely utilized. Frequently, a translational restrained potential is applied along the path to control the turning off of the molecular interactions between the ligand and the protein binding site. This allows reducing the configurational space to sample between the end-points, thus enhancing the efficiency of the free energy calculation. Alchemical transformations which employ translational restraints are generally referred as to the "double decoupling method" (DDM), while those calculations in which no translational restraint is present are classified as "double annihilation method" (DAM) (Deng and Roux, 2009).

In physical pathway methods, which include steered molecular dynamics (SMD) (Izrailev et al., 1997) and US (Torrie and Valleau, 1977), the ligand and the receptor are physically separated along the binding pathway and finally the potential of mean force (PMF), and in turn the binding free energy, is computed. In SMD, an external force with tuneable spring constant and velocity is applied to pull the ligand out from the binding site. The PMF is then obtained from the average of the irreversible work minus the dissipative work of the process according to the Jarzynski non-equilibrium work theorem (Jarzynski, 1997a,b). Several independent SMD trajectories need to be carried out to provide a statistically significant calculation of the irreversible work, and, accordingly, an accurate estimation of the PMF. Also, the optimization of the pulling force can reduce the dissipative part of the work, which eventually leads to an increased calculations convergence. In US, an external harmonic bias potential is applied on a user-defined CV to physically drive the ligand from the bound state to the unbound state. The pathway is usually divided in n steps, commonly known as windows, in which standard MD calculations are performed in presence of the harmonic potential. The change in free energy between adjacent windows can be computed from the collected MD trajectories using different methods, with the most commonly used being the Weighted Histogram Analysis Method (WHAM) (Souaille and Roux, 2001).

Numerous successful applications of both alchemical and pathway methods are reported in literature. However, also these methodologies can suffer from some limitations such as: (1) a limited use to small-size ligands, for which relatively few conformations must be sampled and (2) the need of a priori knowledge of the ligand binding mode, for alchemical transformation methods; (3) an incomplete sampling of the ligand solvated state (Limongelli et al., 2012); (4) an insufficient sampling of the ligand bound state(s) in case of receptor's large conformational changes; and (5) the presence of additional degrees of freedom important for the ligand binding/unbinding process which are neglected during the calculation (Limongelli et al., 2012; Limongelli, 2020). In addition, the binding free energy calculation typically converges slowly and might change in dependence of the ligand size and charge, thus hampering the application of such methods in studying peptide/peptidomimetics-protein interaction (Gumbart et al., 2013).

In the attempt to address these problems, many variants of these methodologies were developed over the last decades. In the field of alchemical transformations, for instance, REMDbased approaches were introduced to increase the accuracy and the convergence rate of calculations. Among these is a mixed FEP/REMD strategy that relies on accelerated MD simulations performed in a Hamiltonian replica exchange MD (H-REMD), in which n replicas of the system with a modified Hamiltonian are run in parallel and are exchanged according to specific acceptance criteria (Sugita et al., 2000). The FEP/REMD approach allows the ligand to escape from kinetically trapped conformations, which usually affect the efficiency of standard FEP/MD calculations (Jiang and Roux, 2010). A more recent example is Modeling Employing Limited Data (MELD)-accelerated MD in which experimentally derived constraints are applied in a temperature and H-REMD simulations framework (Morrone et al., 2017). Alternatively, a single decoupling method was proposed, in which a single alchemical calculation is performed in a H-REMD scheme using, however, an implicit solvent model (Kilburg and Gallicchio, 2018). In its original formalism, SDM (Single-Decoupling Binding Free Energy Method) relied on US simulations performed in Hamiltonian replica exchange and combined with the WHAM method for the calculation of the binding free energy. This approach is known as Binding Energy Distribution Analysis Method (BEDAM) and computes the binding constant through a Boltzmann-weighted integral of the probability distribution of the binding energy obtained in the canonical ensemble in which the ligand, while positioned in the binding site, is embedded in the solvent continuum and does not interact with receptor atoms (Gallicchio et al., 2010; Di Marino et al., 2015c).

As mentioned above, physical pathway methods are typically affected by an insufficient sampling of the ligand solvated state. A possible solution to this critical point was provided by the works of Roux and Henchman who introduced a cylindrical restrained potential in US simulations to reduce the sampling space in the unbound state (Woo and Roux, 2005; Doudou et al., 2009). Following this example, geometrically restricted potentials were introduced in other enhanced sampling methodologies such as MetaD. A recent example is Funnel-Metadynamics (FM) in which a funnel-shaped restrained potential is applied to the system along the simulation to reduce the phase space exploration by the ligand in the unbound state. This enhances the sampling of both the target binding site and the ligand solvated state, leading to a thorough characterization of the binding free-energy surface and an accurate calculation of the absolute proteinligand binding free energy (Limongelli et al., 2013). So far, the method has been employed to study both ligand/protein and ligand/DNA systems (Troussicot et al., 2015; Moraca et al., 2017; Yuan et al., 2018; D'Annessa et al., 2019b), being suitable also in the investigation of peptide-protein binding processes.

### CONCLUDING REMARKS

fmolb-07-00066 May 1, 2020 Time: 12:34 # 8

Designing peptides able to interact with specific target proteins is only the first step toward the development of compounds that can be considered as drug candidates. Despite their great potential, as largely discussed above, some limitations to the use of peptides in clinical routines still exist, mainly due to their low stability in solution, poor permeability through cellular membranes and physiological barriers, such as the blood– brain barrier (BBB).

The introduction of modifications in the chemical structure that could stabilize a peptide in its bioactive conformation, increasing efficiency, represents the smartest strategy. This can be achieved by introducing non-natural side chains, Damino acids, non-alpha-amino-acids, peptide bond isosteres, staples and cyclization that change peptides into peptoids or peptidomimetics (**Figure 2**; Vagner et al., 2008; Zhang et al., 2018). Typically, these modifications are designed by either adding chemical functional groups to a well-characterized active peptide or using small molecules as building blocks that mimic the amino acids backbone with the aim of reproducing the geometry of secondary structure elements (SSE) (i.e., α-helix and β-strand) of bioactive peptides (Vagner et al., 2008; Zhang et al., 2018). Indeed, SSEs play a key role in PPIs, and among them α-helices are the most commonly found at PPI interfaces. Peptidomimetics guarantee enhanced protection against peptidases, improved systemic delivery and cell penetration, high target specificity and poor immune response and they are already in use against different pathologies, such as cancer and diabetes (Vagner et al., 2008; Zhang et al., 2018). In this context, computational approaches such as MetaD (**Figure 2A**) and classical MD simulations (**Figure 2B**) demonstrated to be valid tools to drive the conversion of peptides in more active peptoids/peptidomimetics, targeting αvβ6 RGD-integrin in one case (Di Leva et al., 2018) and the eukaryotic translation initiation factor 4E (eIF4E) in the other (Lama et al., 2013, 2019).

### REFERENCES


As highlighted in this review, peptides and peptidomimetics can play a central role in pharmacological applications, also having a potential strong economic impact on the pharmaceutical industries. Indeed, the use of peptides/peptidomimetics for the treatment of very different pathologies, including some types of cancer, Alzheimer's disease, metabolic diseases and microbial infections, is now becoming a standard approach (Qvit et al., 2017; Mabonga and Kappo, 2019).

Furthermore, the implementation of "hybrid" approaches that combine theoretical and experimental techniques can sensibly assist drug design, allowing, for instance, to overcome some issues related to the development of peptides, mainly due to their nature and size.

We strongly believe that the improvement of computational peptidology techniques aimed at modifying and increasing the potential of these molecules to obtain multifunctional peptides, cell penetrating peptides and peptide drug conjugates, will help strengthen the efficacy and the applicability of peptides as therapeutics.

In conclusion, peptide design is an appealing but complex process that raises many challenges and for a successful outcome a deep knowledge of the available approaches and how to combine them to overcome some major drawbacks are necessary.

### AUTHOR CONTRIBUTIONS

This mini-review article was conceived by DD with contributions from all authors, under the supervision of DD and VL.

### FUNDING

VL thanks the support of the Swiss National Science Foundation (Project N. 200021\_163281), the Italian MIUR/PRIN 2017 (2017FJZZRC), and the Cost action CA15135 (Multi-target paradigm for innovative ligand identification in the drug discovery process MuTaLig).





**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 D'Annessa, Di Leva, La Teana, Novellino, Limongelli and Di Marino. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Molecular Basis of S100A1 Activation and Target Regulation Within Physiological Cytosolic Ca2<sup>+</sup> Levels

Bin Sun and Peter M. Kekenes-Huskey\*

*Department of Cell and Molecular Physiology, Loyola University Chicago, Maywood, IL, United States*

The S100A1 protein regulates cardiomyocyte function through its binding of calcium (Ca2+) and target proteins, including titin, SERCA, and RyR. S100A1 presents two Ca2<sup>+</sup> binding domains, a high-affinity canonical EF-hand (cEF) and a low-affinity pseudo EFhand (pEF), that control S100A1 activation. For wild-type S100A1, both EF hands must be bound by Ca2<sup>+</sup> to form the open state necessary for target peptide binding, which requires unphysiological high sub-millimolar Ca2<sup>+</sup> levels. However, there is evidence that post-translational modifications at Cys85 may facilitate the formation of the open state at sub-saturating Ca2<sup>+</sup> concentrations. Hence, post-translational modifications of S100A1 could potentially increase the Ca2+-sensitivity of binding protein targets, and thereby modulate corresponding signaling pathways. In this study, we examine the mechanism of S100A1 open-closed gating via molecular dynamics simulations to determine the extent to which Cys85 functionalization, namely via redox reactions, controls the relative population of open states at sub-saturating Ca2<sup>+</sup> and capacity to bind peptides. We further characterize the protein's ability to bind a representative peptide target, TRKT12 and relate this propensity to published competition assay data. Our simulation results indicate that functionalization of Cys85 may stabilize the S100A1 open state at physiological, micromolar Ca2<sup>+</sup> levels. Our conclusions support growing evidence that S100A1 serves as a signaling hub linking Ca2<sup>+</sup> and redox signaling pathways.

Keywords: S100A1 protein, calcium affinity, post-translational modification (PTM), passive tension, molecular dynamics

### 1. INTRODUCTION

S100A1 is a Ca2<sup>+</sup> binding protein that is implicated in cardiac and neurological functions (Wright et al., 2009b). S100A1 regulates several targets including ryanodine receptor (RyR), sarcoplasmic/endoplasmic reticulum calcium ATPase (SERCA), phosphoglucomutase, tubulin, and tumor protein p53 (Landar et al., 1996; Santamaria-Kisiel et al., 2006; Wright et al., 2008; Duarte-Costa et al., 2014) in a Ca2+-dependent manner. S100A1 expressed in cardiac tissue is believed to manage contractile behavior either through modulating cytosolic Ca2<sup>+</sup> (Kraus et al., 2009), which triggers the initiation of contraction, or through modulating properties of the contractile fibers of the myofilament. For the latter, there is evidence that Ca2+-loaded S100A1 disrupts interactions between actins of the thin filament and titin (Gutierrez-Cruz et al., 2001; Granzier et al., 2010; Yamasaki et al., 2011). Specifically, there are several reports that the PEVK-rich regions of titin

#### Edited by:

*Giulia Palermo, University of California, Riverside, United States*

### Reviewed by:

*Matteo Salvalaglio, University College London, United Kingdom Matteo Tiberti, Danish Cancer Society Research Center (DCRC), Denmark*

### \*Correspondence:

*Peter M. Kekenes-Huskey pkekeneshuskey@luc.edu*

#### Specialty section:

*This article was submitted to Biological Modeling and Simulation, a section of the journal Frontiers in Molecular Biosciences*

> Received: *22 January 2020* Accepted: *06 April 2020* Published: *23 June 2020*

#### Citation:

*Sun B and Kekenes-Huskey PM (2020) Molecular Basis of S100A1 Activation and Target Regulation Within Physiological Cytosolic Ca2*<sup>+</sup> *Levels. Front. Mol. Biosci. 7:77. doi: 10.3389/fmolb.2020.00077* bind actin and thereby behave as a viscous brake during extension and contraction. Curiously, in vitro assays indicate S100A1 binds targets at Ca2<sup>+</sup> levels considerably above physiological Ca2<sup>+</sup> concentrations, which casts doubts on the ability of wildtype (WT) S100A1 to contribute to titin's management of contractile properties.

S100A1 belongs to the EF-hand calcium binding protein family in which the Ca2<sup>+</sup> ions are bound to the helix-loop-helix motif (EF-hand). Two EF-hands exist in the S100A1 monomer, the canonical hand (cEF) and the pseudo hand (pEF). The cEF hand has a higher Ca2<sup>+</sup> affinity than the pEF hand, with the dissociation constant for the former is ∼27–50 µM and latter is ∼250–16,700 µM, respectively (Goch et al., 2005; Wright et al., 2005). Within cells, the dissociation constant for the pEF hand is magnitudes larger than cellular Ca2<sup>+</sup> concentration, implying that this pEF hand does not significantly bind Ca2+. However, it has been shown that binding Ca2<sup>+</sup> at both the pEF and cEF hands is the prerequisite to S100A1's interactions with Ca2+ dependent targets (Nowakowski et al., 2013). Thus, it is of great importance to understand how S100A1 is activated under non- (half)-saturating cellular Ca2<sup>+</sup> conditions.

S100A1 is generally found as a homo-dimer when subject to conditions amenable to structure determination via x-ray crystallography (Melville et al., 2017) or nuclear magnetic resonance (NMR) spectroscopy (Wright et al., 2005; Nowakowski et al., 2011, 2013); its dimerization happens at picomolar monomer concentrations (Kraus et al., 2009). Similar to most members of the S100 class of Ca2<sup>+</sup> binding proteins, S100A1 activation proceeds through binding of two Ca2+, one each at the low-affinity pEF and high affinity cEF hands. In its fully-saturated, Ca2+-bound (holo) state, S100A1 presents a hydrophobic patch between helices three and four (H3 and H4) that enable binding to regulatory domains of protein targets (Wright et al., 2005; Nowakowski et al., 2011), which may be accompanied by significant increases in solvent accessible surface area relative to the apo state (Chaturvedi et al., 2020). In the absence of Ca2+(apo state), the hydrophobic patch is concealed by closing the H3/H4 hinge region (**Figure 1**) (Nowakowski et al., 2011). Our recent studies suggested that the half-saturated state of the S100A1 (Scott and Kekenes-Huskey, 2016), which we characterize as the conformation with a single Ca2<sup>+</sup> bound at the cEF hand, is insufficient to maintain an exposed regulatory binding region. Since the pEF Ca2<sup>+</sup> affinity is reported to be in the sub-millimolar range (Goch et al., 2005), these findings suggested that S100A1 may be incapable of recognizing Ca2+ dependent targets (Scott and Kekenes-Huskey, 2016). Hence, native S100A1 is unlikely to dynamically regulate protein targets, such as passive tension, within physiological Ca2<sup>+</sup> ranges.

Interestingly, functionalization of Cys85 at the C-terminus of H4 has been shown to yield 10- and 10,000-fold increases in Ca2<sup>+</sup> affinity at the cEF and pEF hand, respectively (Goch et al., 2005). The increasing Ca2<sup>+</sup> affinity was reported to be caused by the favorable cooperativity between the binding events in the two EF-hands (Goch et al., 2005). The enhanced Ca2<sup>+</sup> affinity could therefore confer the ability to activate S100A1 at micromolar Ca2+. Indeed, an assay by Goch et al. (2005) indicated that the modified S100A1 protein presented its peptide-binding hydrophobic patch at physiological Ca2<sup>+</sup> levels in contrast to the native protein.

Since S100A1 Cys85 is a known target for in vivo nitrosylation (Živkovic et al., 2012 ´ ) and glutathionylation (Goch et al., 2005), we explored via molecular dynamics simulation whether mutation of the cysteine to bulkier side groups could promote S100A1 domain opening in its half-saturated state. Additionally, although atomistic-resolution structures of S100A1 complexed with RyR regulatory peptide and a 12-residue peptide "TRTK12" have been determined (Wright et al., 2008), neither the motif nor the molecular structure of an S100A1/PEVK complex have

been established. Under the assumption that S100A1 binds PEVK in a conformation similar to other established Ca2+ dependent targets, namely TRTK12, and those exhibited among other S100 family proteins, we investigate conditions necessary for the regulatory protein's activation under physiological Ca2<sup>+</sup> concentrations. We specifically examined S100A1's capacity to bind regulatory peptides in WT and C85 mutants. Furthermore, we demonstrate how changes in the Ca2+-sensitivity of S100A1 activation could control competition between PEVK/S100A1 and PEVK/actin binding using a multi-state macroscopic model.

### 2. MATERIALS AND METHODS

### 2.1. Molecular Dynamics Simulation

The starting structures are based on the NMR apo and holo structures of S100A1 [PDBs 2L0P (Nowakowski et al., 2011) and 2LP3 (Nowakowski et al., 2013), respectively]. The mutations of C85 to E85 and R85 were performed by Charmm-Gui utility (Jo et al., 2008). For the half-saturated state, the Ca2<sup>+</sup> at pEF site was deleted. The system was solvated in TIP3P (Jorgensen et al., 1983) water box with 20 Å margin. 0.15 M KCl ions were added into the system to maintain a physiological ionic strength. The system was parameterized by the AMBER ff12SB force field (Case et al., 2012) with Ca2<sup>+</sup> parameters adapted from the Li-Merz work (Li et al., 2003). Each system was first subjected to an energy minimization process followed by the heating stage. During the heating stage, the weak-coupling algorithm was used. After reaching equilibrium state with temperature being 300 K, a 100 ns production MD was performed by using the PMEMD.CUDA module of the AMBER 14 package (Case et al., 2014). Clustering analysis were performed on this 100 ns production trajectory via CPPTRAJ (Roe and Cheatham, 2013) using a hierarchical agglomerative (bottom-up) approach. The representative structures of the 3 least populated clusters were used as starting structures for next cycle of production runs (each run was about 400 to 700 ns long).

The accumulative sampling time for each case is around 2 µs for the apo/half-saturated state and about 0.6 µs for the fully-saturated state. The time step was 2 fs and cutoff for nonbonded interaction was set to 10 Å. During the whole MD, the SHAKE algorithm (Ryckaert et al., 1977) was used to restraint the length of bonds involving hydrogen atom. All simulations are summarized in **Table S1**. The CPPTRAJ (Roe and Cheatham, 2013) program from Amber was used to calculate the root mean squared fluctuations (RMSF), α-helical probability, contact map and H3/H4 inter-helix angle values. The RMSF were calculated on backbone atoms (Cα, C, N, and O atoms). The α-helical probability for residues in the H4 C-terminus (residues 85-93) were calculated using the Define Secondary Structure of Proteins (DSSP) algorithm (Kabsch and Sander, 1983). Contact map data was calculated with distance cutoff as 7 Å and only residue pairs which are at least 6 residues apart (i and i + 6) in sequence were considered.

### 2.2. Potential of Mean Force of TRTK12 Peptide Unbinding From S100A1

TRTK12 is a 12 residue peptide that was reported to bind to S100A1 in a Ca2+-dependent manner (Ivanenkov et al., 1995). To simulate the dissociation PMFs of TRTK12 peptide from S100A1, we constructed the TRTK12-S100A1 complex structures using our simulated half- and fully-saturated S100A1 structures. Specifically, the MD-sampled most probable S100A1 structures (see **section S1.1** in **Supplementary Material** for details of determining these structures) were superimposed with NMR structure of TRTK12-S100A1 complex (PDB 2KBM). We minimized the energy after superpositioning the experimentallydetermined TRTK peptide structure into the hydrophobic pocket of MD-simulated S100A1 to eliminate potentially overlapping atoms. After the minimization, we equilibrated the system and further run a 60 ns long MD simulation of the TRTK12-S100A1 complex in the WT fully-saturated state to assess the binding of the TRTK12 peptide at the hydrophobic cleft (**Figure S3**). The reaction coordinate (RC) was defined as the distance between the center of mass (COM) of peptide Cα atoms to the COM of Cα atoms in the H3 C-terminus (residues K30-T39) and the H4 Nterminus (residues E73-A84). We note that we used NAMD in order to make use of its support for collective reaction coordinate variables (the protein COMs). This required parameterizing the system using the CHARMM36 force field (MacKerell et al., 1998, 2004). As such, all structures subjected to PMF calculations were compared using the CHARMM36 force field, whereas all other simulations used the AMBER ff14SB parameterization to ensure consistent comparisons. After obtaining the TRTK12- S100A1 complex structures, the system was then solvated in a TIP3P water box with 14 Å margin. 0.15 M KCl ions were added into the system. The CHARMM36 (MacKerell et al., 1998, 2004) force field was used. The sampling is preformed with RC ranging from 9.5 to 33.5 Å resulting total 49 simulation windows with window width as 0.5 Å. For each window, after minimization and equilibrium, 8 ns production MD was performed in the NPT ensemble at 300 K temperature. An harmonic potential was applied at the center of each window with a force constant as 18 kcal/mol/Å<sup>2</sup> during the simulations. Two loose angle constraints were introduced to prevent the peptide from sliding along the S100A1 surface. All PMF calculations were performed via NAMD2.11 (Phillips et al., 2005). The PMFs along reaction coordinate were constructed using the WHAM program (Grossfield). The PMF error was estimated using the Monte Carlo Bootstrap Error Analysis function in the WHAM program (with 30 MC trials). We also performed the PMF error analysis based on block averaging. Specifically, the variance of the RC in each simulation window from umbrella sampling was estimated via block averaging analysis with block size as 200 (**Figures S11A,B**). The RC variance was then used to calculate PMF error (**Figure S11C**) based on Equation (1) in Zhu and Hummer (2012). Since the magnitudes of PMF errors from MC trails and block averaging are comparable, we show in the main text the PMF with the MC error analysis.

### 2.3. Molecular Mechanics-Generalized Born and Surface Area Continuum Solvation (MM-GBSA) Calculation

The per residue energy contribution to the interaction energy between monomers of S100A1 was calculated via MMGBSA.

$$
\Delta G = \langle G\_{dimer} \rangle - \langle G\_{monomer} \rangle \tag{1}
$$

where hGdimei and hGmonomeri are ensemble-averaged MMGBSA calculated free energies of S100A1 dimer and monomer, respectively. The calculations were performed on a subset of MD trajectory extracted at a 2 ns frequency. The generated subtrajectories were used as input of MMPBSA.py in Amber16 to calculate the free energies of each part. The salt concentration was set as 0.15 M with the generalized Born model option setting as igb = 5. No quasi-harmonic entropy approximation was made during the calculation. The total 1G was decomposited to achieve per residue contribution by setting dcomp = 2 during the calculation. We excluded the internal energies (bonded terms) from the final results as these energies are nearly identical in the dimer and monomer. Thus, our final per residue energy contribution contains three energy terms: electrostatic interactions (EEEL), van der Waals interactions (EvdW), and solvation energy (Esolv).

## 3. RESULTS AND DISCUSSION

### 3.1. S100A1 Structure and Dynamics

The relative weak Ca2<sup>+</sup> affinity in the S100A1 pEF hand (K<sup>d</sup> ∼250–16,700 µM; Goch et al., 2005; Wright et al., 2005) suggests that only the protein's cEF site is likely occupied by Ca2<sup>+</sup> at physiological intracellular Ca2<sup>+</sup> levels (∼0.1 µM; Clapham, 2007). Previous computational studies of S100A1 have utilized the WT structure with cysteine at the 85th position (C85) (Scott and Kekenes-Huskey, 2016). That study suggested that halfsaturated S100A1 tends to assume a semi-closed state that would preclude target binding. Since post-translational modification (PTM)s at the S100A1 site C85 on helix 4 (H4) have been shown to increase its Ca2<sup>+</sup> affinity at both cEF and pEF hands, with the latter exhibiting a four orders of magnitude increase (Goch et al., 2005), we sought to determine potential mechanisms for those PTM-induced changes.

In this study, we introduced glutamic acid (E) or arginine (R) at site C85 to approximate the effects of post-translational modifications including glutathionylation that modulate Ca2<sup>+</sup> affinity (Goch et al., 2005). We performed extensive all-atomistic molecular dynamics (MD) simulations on the WT and C85E/R S100A1 variants in the apo, half-saturated (only cEF hand has Ca2<sup>+</sup> bound) and fully-saturated (both cEF and pEF have Ca2<sup>+</sup> bound) state. In present work, we use the terms "fullysaturated" and "holo" interchangeably. We also performed two extra simulations of S100A1 in the apo state with W90 mutated to alanine (W90A) and in the fully-saturated state with target peptide bound to probe the interaction between H4 and the remaining helices. All simulations as well as the starting structures and the accumulated simulation lengths are summarized in **Table S1**.

To determine whether using glutamic acid (E) or arginine (R) sufficiently mimicked known PTMs at C85 of S100A1, we compared the MD-sampled C85E/R apo/holo structures with those PTM S100A1 structures deposited in the Protein Data Bank. Namely, structures have been reported for the WT (PDB 2L0P-apo, 2LP3-holo) and variants with PTMs at Cys85 that increase Ca2<sup>+</sup> affinity. The variants include C85 mercaptoethanol (PDB 2JPT; Zhukov et al., 2008), C85-Snitrosylation (PDB 2LLT; Živkovic et al., 2012 ´ ), and C85M (PDB 2LLS) in the apo state and C85-cysteine (PDB 2LP2; Nowakowski et al., 2013) in the holo state. As shown in **Figure 2A**, in the apo state, the MD-sampled C85E/R structures both have moderate structural deviations from β-mercaptoethanol modified S100A1, as the RMSDs are around 4.0 Å. Interestingly, most of the structural difference in the C85E/R stem from the displacement of H3 helix away from H4 helix relative to the PTM structure. This suggests that C85E/R variants tend to sample a slightly more open conformation. Additionally, the C85E/R variants present similar inter-subunit contacts as PTM S100A1, as the H1/H4 helices are highly overlapped with the PTM S100A1. Next we compared the MD-sampled C85E/R variants against homocystine-modified S100A1 in the holo state (**Figure 2B**). Similar to the apo state comparison, although the C85E/R have moderate structural deviations (RMSD values are 4.1 and 4.8 Å, respectively), the differences are primarily due to the H2-H3 linker and H3 helix. Specifically, in the C85E/R variants, the H2- H3 linkers are less folded and are slightly more displaced from the H4 helix. Although the two C85 variants have structural differences, the Ca2<sup>+</sup> coordination patterns at the EF hands were identical to that of the homocystine-modified S100A1 structure (**Figure S1**). As both the C85E/R variants and the PTMs have similar inter-subunit contacts in the apo and holo state, the overall structural stability of S100A1 in the C85E/R and PTMs are comparable. These data therefore suggest that the site directed variants we considered are reasonable approximations of the chemically modified proteins reported in the Protein Data Bank. We reported RMSF values before and after Ca2+-binding in **section S3.1** (**Supplementary Material**), which are consistent with previous studies.

### 3.2. Hydrophobic Pocket Opening Indicated by H3/H4 Inter-helix Angle

The fundamental physiological role of S100A1 is to bind downstream targets after chelating Ca2<sup>+</sup> ions. Ca2+-saturated S100A1 presents a hydrophobic patch between H3 and H4 that engages in target binding, similar to other Ca2+-binding proteins like calmodulin and troponin C. We investigated the ability of C85 variants to maintain an open conformation of the hydrophobic patch that binds regulatory targets. In Scott and Kekenes-Huskey (2016) we utilized principal components analysis (PCA) to characterize the predominant conformational motions that distinguish the S100A1 apo from holo states. The largest mode was referred to as Principal Component 1 in that study and correlated with the opening and closing of the peptidebinding pocket formed between H3 and H4. We therefore report here the angle between H3 and H4 as an indicator of pocket opening in **Figure 3** for the WT and C85 mutants.

The average angle values are shown in **Figure 3B**. For the WT structures, the apo state is generally closed and half-/fullysaturated states stay open, as the average angle values are close to values measured from the apo (2.73 × 101◦) and holo (5.43 × 101◦) NMR structures, respectively. We also note that peptide

binding at fully-saturated state has a negligible effect on pocket opening, as indicated by the comparable angle values in the WT peptide-bound case and in the WT holo case. C85R is slightly more open than WT in the apo state, as both chains have ∼ 7◦ larger angles than WT, while C85E has comparable angle values as WT. C85R when half-saturated is asymmetric with one chain more open and the other is more closed. C85E has comparable angle values as WT in the half-saturated state. All cases maintain the open pocket in the fully saturated state. As shown in **Figure 3C**, we selected the most-probable simulated structures of C85 variants (see **section S1.1** in **Supplementary Material** for details of determining these structures) and compared them with the NMR structure of S100A1 (PDB 2KBM) in which a 12-residue TRTK12 peptide was bound at the hydrophobic patch. In both the half- and fully-saturated states, the two C85 variants are able to accommodate the TRTK12 peptide as they have no structural clashes with the TKTR12 peptide. In other words, the variants can bind targets, despite only having one bound Ca2+. While the two C85 variants have comparable degrees of openness with that of the WT structure, the C85R in the half-saturated state has a more opened hydrophobic patch relative to WT. We attribute this in part to the onset of H3/H4 closing reported for the WT half-saturated case in Scott and Kekenes-Huskey (2016). Overall, our simulations indicate that the two C85 variants were as good as, if not better, than WT at assuming a peptide-compatible configuration, which thereby could facilitate peptide binding. We quantify this facilitation in section 3.4 via potential of mean force calculations.

### 3.3. H4 Terminal Helicity

The C-terminal region of H4 (residues C85 to S93) plays a vital role in the Ca2+-dependent activation of S100A1 targets (Landar, 1998), as experiments show that either the deletion of this terminal region or mutations of three aromatic amino acids in this region to alanine diminish Ca2+-dependent activation of S100A1 targets. Further, the highly divergent C-terminal region of H4 among S100 family proteins has been suggested to account for the selectivity of targets binding (Santamaria-Kisiel et al., 2006). For example, the three-fold TRTK12 affinity difference between S100A1 and S100B was attributed in part to different residues in the H4 C-terminus with regard to S100A1 (Wright et al., 2009a). Specifically, TRTK12 in S100A1 assumes a different orientation than that of S100B due to the different residues in the H4 C terminus, resulting in less optimized hydrophobic interactions between S100A1 and TRTK12 peptide.

Experimental data indicate that in the apo state, residues from N87 to W90 in H4 are in the equilibrium between helix and random coil configurations (Nowakowski et al., 2011,

2013). Upon Ca2+-binding H4 adopts a complete helix. This H4 helix extension in the holo state is thought to predispose the hydrophobic residues (i.e., C85 and F88) to interact with hydrophobic residues of the target peptide (Wright et al., 2009a), however, to our knowledge, its impact on Ca2<sup>+</sup> binding has not been investigated. Since H4 in S100A1 undergoes appreciable rearrangements upon Ca2<sup>+</sup> binding, we speculated that such interactions might counter the free energy gain upon binding Ca2+, which would reduce its apparent affinity relative to a system lacking H4 self-interactions. To investigate if our C85 mutations similarly affect the α-helicity in the H4 C-terminus, we measured the α-helix probability of residues C85 to S93 in H4 (**Figure 4**). In the apo state, WT S100A1 residues from F88 to N92 have considerably smaller α-helix probability than that of the holo state. The reduced H4 helical content in the apo state may be caused by two contacts that hinder α-helix formation: (1) contact between the H4 C-terminus with pEF loop from the other subunit (**Figure S8**) and (2) contact between the H4 C-terminus with H2- H3 linker (**Figure S5**). These two contacts are attenuated in the holo state. We therefore mutated W90 to A90 in the apo state to disrupt the first contact and thereby permit the C-terminus of H4 to adopt a folded α-helix. However, results show that W90A has comparable α-helix probabilities to the WT, implying that the first contact does not affect the helix formation in the H4 C-terminus. Thus it is likely that the contacts between H4 and H2-H3 linker in the apo state hinder the helix formation in the H4 C-terminus. Indeed, we show in **Figure S6** that in WT apo

S100A1, F44 from the H2-H3 linker region maintains contacts with L81 and F88 from the H4 C-terminal region, as F44 was sandwiched by the two hydrophobic residues.

We had expected that our C85 variants would disrupt native H4 interactions and facilitate α-helix formation. For instance, in the β-mercaptoethanol-modified C85 apo-S100A1, NOE data show that residues from C85 to F89 folded into α-helix, possibly due to the hydrophobic interactions between the newly introduced β-mercaptoethanol and aromatic residues F88/F89. However, we found that both C85E and C85R variants have reduced α-helix probability in the apo state with C85E presenting a larger degree of reduction. We speculated that charged side chains of C85E/R prevent the favorable hydrophobic interactions within the C-terminus that is needed for α-helix formation. We anticipated that the large desolvation energy penalties of these solvent-exposed charged residues [as indicated by solvent-accessible surface area (SASA) in **Figure S4**] hinder the formation of α-helix.

In the half-saturated WT and C85E configurations, the H4 C-terminus is unfolded to similar degrees relative to the holo state, although the latter variant features one partially-folded helix. This unfolding we believe stems from significant contacts between H4 and the H2-H3 linker in the half-saturated state (**Figure S5**). Meanwhile, H4 for the C85R half-saturated state exhibits folded content comparable to the holo state, which we attribute to reduced H4 and H2-H3 linker interactions. In the fully-saturated state, all cases maintain high helical content

in the H4 C-terminal region. Further, binding of the target peptide has a negligible effect on helical content; this suggests that the free energy gain of helix formation likely occurs during ion binding and not thereafter. Lastly, it is interesting to note (see **Figure 4D**) that the apo state C85R mutant presents energetically-unfavorable interactions between its monomers that disrupt H4/linker interactions and could thereby facilitate H4 formation.

### 3.4. Thermodynamics of TRTK12 Peptide Binding to S100A1

To determine if the C85 variants thermodynamically facilitate target binding to S001A1 at physiological Ca2<sup>+</sup> concentrations, under which only the cEF has Ca2<sup>+</sup> bound, we performed potential of mean force (PMF) calculations to characterize the free energy profile of TRTK12 peptide dissociation from halfsaturated S100A1. Specifically, starting from the MD-simulated most probable half-saturated structures that are compatible for target-peptide binding, a TRTK12 peptide was placed at the hydrophobic patch. After energy minimization, the TRTK12 peptide was pulled away along the reaction coordinate (RC) defined as the distance between the COM of peptide Cα atoms to the COM of Cα atoms in the H3 C-terminus (residues K30- T39) and H4 N-terminus (residues E73-A84). The PMFs along the dissociation process are shown in **Figure 5**. To validate the accuracy of PMF calculations, we first compared the PMF of TRTK12 peptide dissociation from WT fully-saturated S100A1 (dashed line in **Figure 5A**). The experimental and calculated binding free energies were 1Gexpt. ≈ −6.5 (Wright et al., 2009a) and 1Gcalc. = −9.5 kcal/mol, respectively, which are in reasonable qualitative agreement.

For all half-saturated C85 variants, we found that TRTK12 peptide exhibits a minimum in the free energy profile at RC≈14 Å, similar to the WT. However, the binding free energies are considerably more favorable. In the half-saturated state, the two C85 variants have binding free energy values of −12.3 and −13.5 vs. that of −7.6 kcal/mol for the WT type. This thermodynamic advantage in the C85 variants is likely due to the stronger hydrophobic interactions between S100A1 and TRTK12 than that of the WT. This is evidenced by the hydrogen bonding and contact map analyses in **Figure S12**. In the bound state, the numbers of hydrogen bonds between TRTK12 peptide and S100A1 are comparable for the WT and C85 variants. However, the contacts between TRTK12 and the H2-H3 linker in C85 variants outnumbered those of WT. These data suggest that introducing glutamic acid or arginine at C85 increases peptide binding affinity. We additionally investigated the gating kinetics of the H3/H4 patch and found no significant difference between WT and the C85 variants (see **section S3.2** in the **Supplementary Materials**).

### 3.5. Relating S100A1's C85 Modifications to Physiological Function: Combining S100A1-Mediated Actin Passive Model With PMF Calculations

To exemplify the potential impact of improved peptide binding on S100A1's physiological function, we relate these functions to its capacity to bind the PEVK repeats in titin in the N2B isoform (Granzier et al., 2010) as a model system for S100A1 target regulation. The elastic PEVK domain of titin contains 70% of proline, glutamate, valine, and lysine residues and exists

in three conformational states: polyproline II (PPII) helix, bturn, and random coil (Labeit and Kolmerer, 1995; Ma and Wang, 2003). The PEVK domain consists of a repeating motif of 28 residues on average with no long-range cooperativeness between motifs (Gutierrez-Cruz et al., 2001). The extension of PEVK is believed to contribute to titin's elasticity. A competition assay demonstrated that isolated PEVK fragments washed into skinned myocyte preparations reduced passive tension over physiological sarcomere lengths (Yamasaki et al., 2011). Further, S100A1 was shown to reduce F-actin-bound I27-PEVK-I84 in a dose-dependent manner, with higher rates of reduction under conditions of elevated (0.1 mM) Ca2<sup>+</sup> (Yamasaki et al., 2011).

To relate changes in peptide binding due to mutations to the potential regulation of target, we proposed a competitive S100A1/actin-binding scheme in **Figure 6A**. This includes an equilibrium between actin interacting with titin's PEVK segment, while S100A1 also can interact with titin's PEVK segment. The former actin-titin interaction is proposed to delay the filament sliding, while the presence of S100A1 will disrupt the actintitin interaction and modulate muscle contraction. This model is mechanistically consistent with trends reported for S100A1 dependent reductions of actin/PEVK (AP) binding at elevated Ca2<sup>+</sup> (Yamasaki et al., 2011). Namely, AP binding was shown to reduce as S100A1 increased, with greater efficacy demonstrated at 0.1 mM Ca2+. The governing equations are listed in **section S1.3** in **Supplementary Material**.

We first fitted the model to experimental data from Yamasaki et al. (2011) to obtain the dissociation constants of S100A1 to actin/PEVK (AP) complex with and without Ca2+. As shown in **Figure 6A**, in the absence and presence of 0.1 mM Ca2+, the fitted dissociation constants of apo-S100A1, half-saturated S100A1 and fully-saturated S100A1 to AP are Kd<sup>1</sup> = 0.52, Kd<sup>2</sup> = 0.13 and Kd<sup>2</sup> ′ = 0.03 µM, respectively. Surprisingly, the corresponding binding free energies of Kd<sup>2</sup> and Kd<sup>2</sup> ′ are −9.40 and −10.27 kcal/mol, respectively. These two values are close to calculated binding free energy −9.50 kcal/mol of TRTK12 peptide to WT fully-saturated S100A1 (**Figure 5**). This agreement indicates that it is reasonable to use the TRTK12 peptide as a mimic of the PEKV fragment to study the binding affinity of PEVK to WT/mutant S100A1 systems. We anticipate that the similar binding arises from PEVK generally showing highly amphoteric charge distributions and modestly positive net charges (Forbes et al., 2005), similar to TRTK12 (Wafer et al., 2013).

In general, we have found that end-point methods for computing free energy differences as well as more rigorous approaches including potentials of mean force seem to fare well in rank ordering cases by energy. However, for a variety of reasons not limited to force field inaccuracies, difficulties in estimating entropic contributions, and finite sampling of protein configurations during limited molecular dynamics simulation times, we find that the simulation approaches for the systems we have considered are unable to accurately predict the absolute energy differences between cases that have been experimentally characterized. Hence to map our PMF results (calculated KDs) to the experimentally-measured KDs, we introduce a scaling parameter λ. The value of λ was calculated based on the alignment between experimental and calculated dissociation constants (via PMF) for TRTK12 peptide binding to WT fullysaturated S100A1:

$$K\_{D, \text{exppt.}} = \lambda K\_{D, \text{calc.}} \tag{2}$$

$$
\lambda = e^{(\Delta G\_{\rm cpt.} - \Delta G\_{\rm calc.})/RT} = 157.436 \tag{3}
$$

where we used 1Gexpt. = −6.5 and 1Gcalc. = −9.5 kcal/mol, respectively. We then use the average of C85E and C85R as the binding free energy of PEVK to S100A1 after C85-glutathionylation (1Gglu. = −12.9 kcal/mol). Compared to the WT, the dissociation constant of S100A1 with C85 glutathionylation would thus be reduced as:

$$K\_{d\,\text{glu.}} = \lambda K\_d \mathcal{e}^{(\Delta G\_{\text{glu.}} - \Delta G\_{WT})/RT} = 0.021 K\_d \tag{4}$$

where 1GWT = −7.6 kcal/mol and λ is defined in Equation (2) and K<sup>d</sup> refers to Kd<sup>2</sup> and Kd<sup>2</sup> ′ in **Figure 6B**. When used in **Figure 6A**, the C85-glutathionylated S100A protein more rapidly reduces the PEVK-actin interaction at various increasing, physiological Ca2<sup>+</sup> concentrations, as shown in **Figure 6C**.

### 4. CONCLUSIONS

Previous studies established that post-translational modifications of C85 at H4 of S100A1 increase S100A1's Ca2<sup>+</sup> sensitivity of activation (Zhukov et al., 2008; Živkovic et al., ´ 2012; Nowakowski et al., 2013). By using computational methods in this study, including molecular dynamics simulations and thermodynamic models of binding, we determined potential mechanisms governing how C85 modified S100A1 can bind

Ca2+-dependent targets at sub-saturating Ca2+. Specifically, we used two variants (C85E/R) that have bulky side chains as steric surrogates of post-translational modifications at the C85 position in S100A1. Our data show that C85E/R variants have similar structural effects as post-translational modifications on displacing the C-terminus of H3 from H4 in the apo state of S100A1. We further focused on structures bound with one equivalent of Ca2+(half-saturated) in its pEF domain, which are likely to predominate (Scott and Kekenes-Huskey, 2016) at physiological Ca2<sup>+</sup> concentrations (100 to 1000 nM) found within cells (Berridge et al., 2000). We found for the C85E/R variants relative to the WT that (1) the mutations disrupt the half-saturated structures by increasing the solvent exposure of its target binding domain (the hydrophobic patch) found between H3 and H4, (2) yield stronger TRTK12 binding in the half-saturated variants vs. WT and (3) for the half-saturated configuration, the H4 C-terminus in the two variants have greater alpha helical character than the WT and are consistent with levels

exhibited in the target-bound configuration. Ultimately, these findings are suggestive of cysteine-targeted post-translational modifications priming S100A1 for target regulation within physiological ranges of Ca2+. Importantly, the computational studies further support the notion that S100A1 toggles its Ca2+ dependent regulation of downfield targets in response to Cys modification, as is common in redox pathways such as those using glutathionylation (Zhukova et al., 2004).

A variety of studies implicate S100Al in regulating proteins that mediate Ca2<sup>+</sup> signaling or alter their mechanical properties in response to Ca2+. S100A1's regulatory roles are most apparent in the heart in which the protein is predominantly expressed (Kato and Kimura, 1985). In cardiac tissue, S100A1 has inotropic effects on Ca2<sup>+</sup> handling, that is, it helps increase the generation of contractile force (Kraus et al., 2009). This is accomplished through priming sarcoplasmic reticulum (SR) Ca2<sup>+</sup> concentration and release (Kettlewell et al., 2005), through interactions with targets including SERCA, RyR, the L-type Ca2<sup>+</sup>

channel, and the sodium calcium exchanger (Rohde et al., 2010; Völkers et al., 2010; Duarte-Costa et al., 2014). While S100A1 appears to dualy regulate RyR at both diastolic (∼100 nM; Berridge et al., 2000) and systolic (>1 µM; Berridge et al., 2000; Yamaguchi et al., 2011), the WT S100A1 likely acts on its Ca <sup>2</sup>+-dependent targets only at saturating Ca2<sup>+</sup> conditions (Nowakowski et al., 2013), under which both EF hands of the protein are bound with Ca2+. While it has been speculated that S100A1 could modulate target proteins at systolic Ca2<sup>+</sup> levels such as in the case of S100A1/titin interactions facilitating myocyte contraction (Granzier et al., 2010; Yamasaki et al., 2011) we would expect that the low binding affinity of its pseudo EF hand (K<sup>D</sup> ∼250–16,700 µM) and reduced ability to maintain an open, target-peptide compatible binding site (Scott and Kekenes-Huskey, 2016) would not be sufficient for significant regulation of the intended targets. This raises the question of how S100A1 modulates its targets in vivo where cytosolic Ca2<sup>+</sup> concentrations are generally far below the protein's KDs of Ca2+.

Post-translational modifications of S100A1 likely explain this enigma. Intriguingly, Cys85 is a redox sensitive residue presenting a variety of oxidizing functional groups (Nowakowski et al., 2013). Previous studies have indicated that S100A1 species with mercaptoethanol and glutathione generally increased the apparent Ca2<sup>+</sup> affinity relative to the WT by up to four orders of magnitude (Goch et al., 2005). The most apparent rationale for the enhanced Ca2<sup>+</sup> affinity from our simulations was that the mutations we considered disrupt the apo state H3/H4 folding, namely by compromising H4 interactions within a given monomer and its opposing monomer of the dimeric state. Since those interactions are dramatically reduced in the Ca2<sup>+</sup> bound state relative to the apo, we speculate that their weakening in the apo state reduces the thermodynamic penalty they could impose upon Ca2<sup>+</sup> binding observed for the WT. Nonetheless, the half-saturated S100A1 state appeared to demonstrate thermodynamically favorable, albeit weaker, TRKT12 peptide binding, which suggests that the protein may have a modest ability to bind targets in its wild-type form. We propose therefore that post-translational modification of the H3/H4 interface may constitute a general mechanism for controlling Ca2+-dependent activation of protein/protein interactions in S100 families, given the prevalence of H3/H4 binding patches featured in proteinprotein interactions (Zimmer et al., 2003).

If S100A1 demonstrates increased activity following posttranslational modifications, when would such modifications be expected in vivo? It is reasonable to assume that S100A1 oxidation would be most significant during conditions of enhanced reactive oxygen species (ROS) signaling. ROS are particularly prevalent during metabolic stress, ischemicreperfusion, and physiological reactive oxygen species-based signaling (Jeong et al., 2012). In fact, glutathionylation of protein targets including S100A1 is a vital component of cardiovascular ROS signaling (Pastore and Piemonte, 2013) Physiological conditions including exercise for instance demonstrate significant cardiac RyR gluthionylation that enhance its activity to compensate for increased demand (Sánchez et al., 2008). Analogous modifications of the SERCA Ca <sup>2</sup><sup>+</sup> pump promote enhanced Ca2<sup>+</sup> uptake and smooth muscle relaxation (Pastore and Piemonte, 2013). It is intriguing that reducing agents mitigate these effects (Volkers, unpublished from Völkers et al., 2010), which is expected if glutathionylations are prevalent. Since both RyR and SERCA are S100A1-dependent targets and S100A1 itself is subject to glutathionylation, this suggests redox regulation of inotropy may be controlled both directly and indirectly by glutathione modifications. Hence, in physiological systems, ROS signaling, especially as mediated by glutathionylation, might prime inotropic effects relative to basal or reduced conditions (Nikolaienko et al., 2018), through augmenting S100A1 stimulation of its targets.

### 4.1. Limitations

Our study includes several limitations of note. One, in this study we have used glutamic acid or arginine substitutions to probe how introducing larger, polar amino acids into the redox sensitive C85 site impact S100A1 function and peptide binding. Our choice for these variants was based on observations in several S100A1 structures with either redox modifications or amino acid substitution (C85M) that have been deposited in the Protein Data Bank. These structures exhibited more open-like character than the WT, which suggests that the opening behavior may be more sensitive to the size of the introduced side group than its specific chemical properties. Nonetheless, simulations that include the specific functional group in question, such as the glutathione group investigated in Goch et al. (2005), are likely to provide more fine detail into the mechanism of its effect on S100A1. We found that substitution of glutamic acid or arginine at the C85 position as a steric surrogate of post-translational modifications yielded half-saturated S100A1 structures that more closely overlapped with the peptide-bound (TKT12) protein.

Second, we also note that to map the PMF-predicted relative energy differences between S100A1 states to their experimentally-measured absolute differences we introduced a scaling parameter λ. In our experience, we have found computational free energy methods perform reasonably well at rank-ordering system configurations according to their experimentally-reported values, but absolute differences in energy have been less successful. We anticipate the further improvements in force field parameterizations and sampling techniques could potentially better align the relative energy predictions with absolute differences and thus obviate the scaling term we used here. Additionally, although the S100A1 open state appears to be necessary to bind target peptides as part of its regulatory function, it may be of interest to examine whether the half-saturated variants we considered significantly sample states resembling the apo (closed) configuration, as we demonstrated for the WT in Scott and Kekenes-Huskey (2016). This could be accomplished using the biased sampling technique described in that study as well as other enhanced sampling techniques including accelerated-MD (Hamelberg et al., 2004).

Lastly, in order to quantitatively link the effects of potential post-translation modifications of S100A1 to a physiological process, we examined its putative binding to titin PEVK fragments discussed in Yamasaki et al. (2011). In that study, a competition assay was used as a proxy for measuring the protein's impact on passive tension in muscle fibrils (Yamasaki et al., 2011). Passive tension is described as the force when muscle cells are stretched beyond their resting length, independent of Ca2+. Actin/titin interactions have been suggested as an important mechanism for controlling myofilament passive tension (Granato et al., 2010). Competitive binding assays conducted by Yamasaki et al. (2011) demonstrated that S100A1 interferes with actin/titin interactions by competitively binding the titin PEVK domain. However, it is important to note that myriad factors contribute to passive tension, including tubulin and collagen, nebulin/PEVK interactions (Gutierrez-Cruz et al., 2001) or even be recapitulated without changes in titin stiffness by modulating bound myosin/actin populations (Campbell, 2009). Further since titin/actin-dependent effects are more evident for the N2B isoform, while the N2BA tends to predominate in humans (Granzier et al., 2010), the significance of S100A1 in modulating titin varies across species.

### DATA AVAILABILITY STATEMENT

All datasets generated for this study are included in the article/ **Supplementary Material**.

### REFERENCES


### AUTHOR CONTRIBUTIONS

BS and PK-H designed the simulations and wrote the manuscript. BS performed the simulations and analyzed the data. All authors contributed to the article and approved the submitted version.

### ACKNOWLEDGMENTS

Research reported in this publication, release was supported by the Maximizing Investigators' Research Award (MIRA) (R35) from the National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health (NIH) under grant number R35GM124977. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) (Towns et al., 2014), which is supported by the National Science Foundation under grant ACI-1548562.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmolb. 2020.00077/full#supplementary-material


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Sun and Kekenes-Huskey. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.