Structural Transition States Explored With Minimalist Coarse Grained Models: Applications to Calmodulin

Transitions between different conformational states are ubiquitous in proteins, being involved in signaling, catalysis, and other fundamental activities in cells. However, modeling those processes is extremely difficult, due to the need of efficiently exploring a vast conformational space in order to seek for the actual transition path for systems whose complexity is already high in the stable states. Here we report a strategy that simplifies this task attacking the complexity on several sides. We first apply a minimalist coarse-grained model to Calmodulin, based on an empirical force field with a partial structural bias, to explore the transition paths between the apo-closed state and the Ca-bound open state of the protein. We then select representative structures along the trajectory based on a structural clustering algorithm and build a cleaned-up trajectory with them. We finally compare this trajectory with that produced by the online tool MinActionPath, by minimizing the action integral using a harmonic network model, and with that obtained by the PROMPT morphing method, based on an optimal mass transportation-type approach including physical constraints. The comparison is performed both on the structural and energetic level, using the coarse-grained and the atomistic force fields upon reconstruction. Our analysis indicates that this method returns trajectories capable of exploring intermediate states with physical meaning, retaining a very low computational cost, which can allow systematic and extensive exploration of the multi-stable proteins transition pathways.


INTRODUCTION
Signaling is a core activity in cells. Most of the signaling processes are regulated by bi-(or multi-) stable proteins, which can undergo conformational transitions in response to changes in environmental conditions or stimuli of different origin (Grant et al., 2010). This class includes among others, G-proteins coupled receptors (Weis and Kobilka, 2008) such as Rhodopsins (Tavanti and Tozzini, 2014) and other transducers, e.g., Calmodulin (Wenfei et al., 2014), and a vast number of enzymes undergoing conformational changes during their activity, such as the HIV-1 protease (Tozzini et al., 2007). The structural variations are usually quite large, therefore atomistic molecular dynamics (MD) simulations might not be the most proper method to address them, because the slow transition kinetics requires simulations exceeding the currently reachable time and space scales. In addition, the atomistic representation with standard force fields (FF) is not warranty of accuracy for the strongly distorted and out of equilibrium transition states (Best and Hummer, 2009).
Strategies to overcome these difficulties involve different actions. On one side, adopting simplified low-resolution descriptions of the system such as coarse-grained (CG) models (Tozzini, 2005) reduces the computational cost and allows performing more efficient sampling of the conformational space. This advantage comes at the cost of increasing the empirical content of the FF, and consequently reducing predictive power and transferability. A compromise between accuracy and predictive power (Tozzini, 2010) is reached by including some a priori knowledge of the system, in different forms, such as, e.g., a (partial) bias (Tozzini and McCammon, 2005;Spampinato et al., 2014) toward reference structures. This appears a reasonable compromise especially in the case of the search of the path between two given structure, when the system must in any case be forced to have them as stable states.
On the other side, one can act by simplifying the sampling algorithm, e.g., using morphing related methods (Weiss and Levitt, 2009;Koshevoy et al., 2014;Tamazian et al., 2015) without relying on any specific FF. In particular PROMPT (Koshevoy et al., 2014;Tamazian et al., 2015) employs an approach based on the optimal mass transportation problem including physical constraints of geometric nature (Evans and Gangbo, 1999). Methods based on the action minimization of simplified FFs, such as MinActionPath (Franklin et al., 2007), can be thought as located between the two approaches. The combination of the different sampling methods with the different representations of the systems and its interaction has given rise in the last decades to a huge number of approaches, which has also posed the problems of their comparison and assessment (Seyler et al., 2015).
In this work, we first apply a minimalist CG model for proteins to the test case of Calmodulin, chosen because of its large conformational transition upon calcium binding. We perform molecular dynamics simulations in different conditions to sample the transition path. We then compare these results with those of the simplified path sampling methods.

The Coarse Grained Model
The coarse graining procedure we consider in this work is schematized in Figures 1A,B, reporting the atomistic representation of a protein chain and the minimalist CG (MCG) representation in which only the Cα atoms are present. The choice of Cα as the representative atom of the amino-acid bead allows uniquely representing the secondary structure by the internal variables α, θ (Tozzini et al., 2006). The interactions are described by an empirical FF, derived from an energy potential U with a form similar to the atomistic ones, separated in bonded and non-bonded interactions being the bond distances, angles, and dihedrals describing the local geometry of connected beads and r ij distances between non-bonded ones (see Figure 1B). The functional forms (reported in Table 1) are somewhat more complex than those used in atomistic FFs: while u b i are holonomic restrains, the u θ i and u φ i take forms accounting for the anharmonicity of the CG interactions; in addition, the parameters are chosen to account for the different geometrical stiffness of the secondary structures, assigning different values to helices and sheets (see Table 1) 1 . The non-bonded interactions occur between couples not already involved in a bond, bond angle or dihedral interaction and are separated in local and nonlocal part i>j u nb r ij = i,j|r ij <r cut u loc r ij + i,j|r ij >r cut u nl r ij (2) both represented by a Morse potential, with the local term retaining a bias toward a reference structure (see Table 1).
In this work the local/non-local separation is based on a geometric criterion: all the non-bonded couples whose distance is less than r cut = 8.5 Å in the reference structure are considered local, the others are considered non-local. The cutoff value used here was previously shown to include all the relevant H-bonds and other possible specific interactions such as disulfide or salt bridges (Trovato and Tozzini, 2012). The parameters of the Morse potential, were optimized in our previous works including a dependence on r 0 (distance in the reference structure) in order to reproduce stronger interaction in the H-bonding range and weaker ones in the hydrophobic range (Di Fenza et al., 2009) (see Table 1). Since here we are not interested in the accurate simulation of the inter-protein interactions, the non-local part is represented by a generic amino-acid independent potential reproducing an average level of hydrophobicity (Table 1), instead than with a complex matrix of amino-acid dependent potentials (Trovato et al., 2013).  B, respectively), and at different temperatures. Simulations were performed with DL_POLY [vs. 4.08 (Bush et al., 2006;Todorov et al., 2006;Boateng and Todorov, 2015)] and the input was generated with proprietary software. In order to extract a transition path from the trajectory, we first define the parameter σ based on the root mean square deviation (RMSD A/B ) of a configuration r = {x i ,y i ,z i } from the reference structures r A/B [after alignment (Humphrey et al., 1996)

Simulation Setup and Transition Path Extraction
σ ranges between 0 (in A) and 1 (in B), is a rough measure of the transition advancement. Clearly, structures with the same σ (r) can have different conformations, with different distances from A and B, accounted for by RMSD A (r) and RMSD B (r) separately, since the calculation of σ in practice operates a projection of the 2-dimensional path in the RSMD A /RSMD B plane onto a line connecting A and B. Therefore, the scatter plot RMSD B vs. RMSD A will also be considered to have more specific information on the transition path. σ is used to compare the properties of structures with similar transition advancement from the three different methods.
In order to identify a limited number of relevant points along the trajectory, we applied the principal path (PP) clustering algorithm (Ferrarotti et al., 2018) to the MD trajectories and extracted reduced trajectories, which retain the salient properties of the original ones. The PP algorithm is a regularized version of the k-means clustering algorithm (Arthur and Vassilvitskii, 2007), based on the evaluation of a cost functional composed of two parts: the sum of the squared distances of each point from its respective representative structure, and the sum of the squared distances between adjacent representative structures. The relative weight of the two components-the regularization parameter s-is obtained by the Bayesian evidence maximization. The cost functional can be interpreted as an energy, thus the 2 Alignment is performed by means of the built-in extension "RMSD Trajectory Tool" of the graphics software VMD.
Bayesian posterior probability function is set proportional to the exponential of its negative. The result of the clustering is a "cleaned-up trajectory" of representative structures, used to evaluate σ and energy profiles.
Energies were evaluated both with the CG FFs and at the atomistic level. To this aim, the atomistic structures were rebuilt from the MCG models using Pulchra (Rotkiewicz and Skolnick, 2008) without any local optimization, then explicitly hydrated and locally optimized using the OPLSe (Harder et al., 2016) FF with explicit solvent and the Polak-Ribiere conjugate gradient algorithm (Polak and Ribiere, 1969) keeping the backbone frozen during the minimization. The calculations were performed with Schrodinger 2018-2, MacroModel (2019).

PROMPT and MAP Path Search
The PP clustering trajectories are compared with the trajectories obtained from other transition analysis methods. The method MinActionPath (Franklin et al., 2007) (MAP) employs differential equations, obtained by minimizing an action functional including a very simplified potential term representing the protein as a network of harmonic interactions (the elastic network model, ENM) (Tirion, 1996). The equilibrium distances are taken from the reference structures, making the ENM the simplest completely biased model. The solutions to the pair of differential equation are merged by requiring continuity between them. The final result is a single trajectory connecting the two states, reproducing the energy profiles of the mono-stable ENMs near A or B, and with a continuous crossover region.
On the other hand, PROMPT (Tamazian et al., 2015) [PRotein cOnformational Motion PredicTion 3 ] connects states A and B avoiding relations to any specific FF, by using only structural information. The protein is represented at the CG level and each protein conformation is handled as a set of internal coordinates. The transition path is first guessed e.g., using linear interpolation between extremal configurations r A and r B . The "admissible motions" are defined, as those preserving all the bond lengths b J i and other physical constraints related to Bond angle An illustration of the statistics-based parameterization procedure is also reported in the plots. bond and dihedral angles (i is the index running along the internal coordinate, and J labels the configuration along the path, from A to B). The path connecting A and B is therefore found by minimizing a kinetic only action integral within the space of admissible motions factorized by rigid roto-translations. The infinite-dimensional variational problem is addressed by discretizing the path between A and B and solved by means of the gradient descent method. The admissible motions are searched by changing the internal free variables of the systems, i.e. {θ J i , φ J i } in MCG model; θ J i is treated by interpolation when possible. The detailed description and formal comparison of the three method is reported elsewhere (Delfino et al., in preparation). Energies along MAP and PROMPT trajectories were compared using both atomistic (upon rebuilding and side chain optimization as already explained) and MCG FFs.

Molecular Dynamics of the Open-Closed Transition of Calmodulin
Calmodulin (Cam) displays two very different conformations (Wenfei et al., 2014), depending on the environmental calcium concentration. The two extremal structures of Cam, i.e., closed (A) and open (B) (see Figure 1C), correspond to the apo and Ca 2+ -bound state, respectively. Because these are, de facto, distinct proteins, having different ligands, it is conceptually correct to use two distinct FFs and to perform LD simulation started from A using FF B to reproduce the A→B transition occurring upon Ca 2+ binding, and, vice-versa, using FF A for the B→A inverse transition occurring upon Ca 2+ release. A few data are available for the difference in Gibbs free Frontiers in Molecular Biosciences | www.frontiersin.org energy between the folded and denatured proteins ranging between G A ∼1.5-3.5 (Masino et al., 2000;Rabl et al., 2002) kcal/mole for the A state and G B ∼4.5-6.5 kcal/mole for the B state (Masino et al., 2000). Energy alignment is not straightforward, however, one might assume the denatured state as reference, and infer that B state is more stable than A of about 2-4 kcal/mole. The A-B transition was simulated with LD, in both senses, at 300 K (RT) and at 130 K (complete simulation data in the Supplementary Material). Figures 2A,B reports the energies along the LD simulations. In both cases the transitions are clearly visible in the evolution of σ, passing from 0 to 1 (A→B, green) or from 1 to 0 (B→A, red), though they occur at different times, depending on the simulation parameters and on the FF. In particular, the closed to open transition (green) occurs earlier and more directly, while the inverse open to close transition appears to explore an intermediate conformation with σ∼0.4-0.5 for tens of ns before reaching the final state. This is better seen in the RMSD scatter plots reported in panels c and d: the intermediate state, located in the upper right off diagonal part of the plot, persists also after the clustering procedure (joined dots in Figures 2C,D) and is present at high and low temperature, although in the low one it is pushed toward the diagonal. It corresponds to a compact globular conformation, favored over the completely open one by hydrophobicity, but in which the specific contacts of the closed conformation are not formed (see the inset in Figures 2C,D, red structures). In this work Cam is used only as an example, therefore exploring in detail its transition is out of our scopes. However, we remark that the presence of such mis-folded transition intermediates was previously documented (Wenfei et al., 2014). The intermediate is not visible in the A→B simulations (green), in which the system passes rapidly to B, not even in the PROMPT and MAP trajectories, lying near the diagonal line joining A and B in the RMSD plot. These, additionally, display distorted conformations in the intermediate σ regions. An inspection to the structures with σ∼0.5 (reported in purple and cyan in Figure 2C) shows distortions in the central helix and too contracted terminal regions in the PROMPT structure, and broken chain in the MAP structure.

Data Clustering and Comparison With PROMPT and MAP
While MAP and PROMT return transition paths made of a few points, the MD simulations explore a large portion of the conformational space returning thousands of conformations. Therefore, in order to compare the methods, we first performed a post-processing and clean-up of the MD trajectories to select a limited number of representative states along it. This can be done in several ways. Figure 3A reports a simple averaging procedure: the structures are first ordered according to their σ value (red and green dotted/dashed lines), so that A→B transition is read from left to right and B→A from right to left. Once again, the formation of an intermediate cluster at σ = 0.4-0.5 is clearly visible in the B→A simulations, beside the large cluster of A type structures and of B type structures in the A→B simulations, respectively. The structures are then grouped according to their σ value in a given number of regular σ intervals; the average energy evaluated in each interval is reported in the plot, for the A→B (green) and B→A (red) simulations at 300 and 130 K (dots with error bars). Interesting enough, transitions occur in all cases with a gain of ∼20 Kcal/mole (as measured from the starting state, i.e., in each case the opposite of the stable one), irrespective of the temperature and of the FF. As said, comparing the energies resulting from two different FFs is not straightforward. In this case, an inspection of Figures 2C,D shows that the simulation trajectories with FF A and FF B get particularly near in a region of the RMSD A -RMSD B plane corresponding to σ∼0.4, indicating that in that area structures belonging to different trajectories are similar. Aligning the energy values for that value of σ in the plot of Figure 3A generates a small shift leading to B structure more stable than A one of about 3-4 kcal, roughly corresponding to the experimental evaluation. The resulting "activated state structure" corresponds to the intermediate found in the B→A simulations, which turns out to be located ∼10 Kcal/mole above the A/B states. This "barrier" value seems rather independent on the simulation temperature, whose effect appears to be a rigid shift of the average energies.
While the described procedure gives reasonable values of the energies, representative structures along the trajectories are more properly selected via the PP algorithm. This returns a userdefined (20 in this case) number of elements, which are not elements belonging to the trajectories they represent, but rather elements optimizing the structure variance within the trajectory. As a consequence, the energy profiles obtained evaluating the FF A and FF B energies onto them (Figure 3B, solid lines and squared symbols) are rather regular and lie lower in energy with respect to parent trajectories, shown by lines connecting circle symbols (obtained selecting the nearest elements to the optimal ones, filled and empty dots connected by dotted and dashed lines). Remarkably, even after post processing, the main features of the simulation remain: the cluster located at σ∼0.4-0.5 is well-represented in FF A simulations, and is located about 10 Kcal/mole above with respect to A and B states.
The optimal element trajectories extracted from the low temperature runs are also reported in Figure 3C to be compared with the energies evaluated from the MAP and PROMPT trajectories using the MCG FFs. Even after a local optimization, the energies from MAP and PROMPT rapidly increase producing a very large energy barrier at intermediate σ values. An inspection of the structures (reported as insets in the plot) reveals that these arise from severe distortion of the backbone (especially for MAP) and/or steric clashes (both). In particular, the high energy of the intermediate from PROMPT seems to be due to steric clashes in one of the two ends of the protein (highlighted with a yellow circle in Figure 3C. Clearly, higher energies on the MAP/PROMPT paths evaluated with MCG FFs are expected, since the low energy path extracted with PP from simulations minimize the MCG Hamiltonian. Therefore, in order to clarify if this energy difference reflects a real larger stability of MCG derived conformations, we rebuilt the atomistic structure of the paths FIGURE 2 | Simulations results from Langevin dynamics at 300 K, γ = 8 ps −1 (A) and 130 K, γ = 2 ps −1 (B). Temperature (upper plots), total and potential energies (central plot) and σ are reported along the simulations from A to B (using FF B and starting from configuration A, green lines), and from B to A (using FF A and starting from configuration B, red lines). For the 300 K simulation also the running averages are reported for the potential energy as yellow and blue lines, respectively. (C,D) Scatter plot of the LD simulations (same color coding as previous) compared with MAP and PROMPT paths evaluation (color coding as in the legend of C). The connected dots are the representative elements of the PP clustering procedure. Sample configurations are reported in colors corresponding to the lines and their approximate location in the plots are indicated by arrows. evaluated with all methods and compared their energies evaluated with the atomistic FFs ( Figure 3D), after optimization of the side chain conformation keeping fixed the backbone structure. All methods give comparable energies for structures near A and B states, where in some cases PROMPT and MAP seems to work better than MCG models. However, the atomistic analysis confirms the strong instability of MAP derived structures, displaying unphysical backbone conformation, as shown by the reported Ramachandran plot (upper right inset of Figure 3D). The instabilities of the PROMPT profile are confirmed in the central σ∼0.2-0.8 region, although the Ramachandran plot (central inset) is regular even in there. In fact, in agreement with what found in the MCG model, the instability is not due to a wrong backbone conformation, but to steric clashes in the highlighted area (yellow circle), displaying two sheets whose relative conformation is too close and not correctly aligned. The complete set of structures and energy data is reported as Supplementary Material.

SUMMARY AND CONCLUSIONS
In this work we set up a simulation paradigm for finding the transition path of proteins undergoing large conformational transitions, which is a long-standing problem of biophysics. Proteins are modeled by a Cα based coarse-grained representation, while the transition path is explored via classical molecular dynamics simulations with FFs partially biased toward the reference structures. The selection of a representative trajectory among the huge number of configurations explored during molecular dynamics simulations is accomplished by means of the principal path clustering algorithm, which managed to single out trajectories close to those of minimum free energy, yet capable of exploring intermediate states, with a very low computational cost. The comparison with minimal action path and PROMPT can be summarized as follows: MAP returns structures which are reasonable in the near vicinity of the references states, but is unable FIGURE 3 | Simulation data analysis and comparison with PROMPT and MAP (A) Potential energy vs. σ along the simulations at 300 K (dotted lines) and at 130 K (dashed lines), with the FF A (red) and FF B (green) force fields (scales for FF A and FF B are shifted of 3 Kcal/mole to align the activated state as explained in the text. Both scales are reported on the left and right axis, in colors corresponding to the FF they refer to). Colored dot with error bars are averages over subsets of structures classified by σ intervals (errorbars correspond to standard deviations of data from average values). Representative closed (σ = 0) and open (σ = 1) structures are reported under the plot. (B) Potential energies vs. σ evaluated over the representative structures of the clusters outputted by PP procedure. Squares connected by solid lines: representatives optimized by the PP procedure (filled = from the 300 K simulation, empty = from the 130 K simulations, red with FF A , green with FF B ). Circles connected by dashed/dotted lines: same as previous, but evaluated over a trajectory of structures extracted from the simulations, the nearest to the optimal ones. (Same color and empty/filled code as for squares; shift of scales as in A). (C) Comparison of the 130 K "optimal" energies with energies of trajectories from MAP (cyan) and PROMPT (magenta) evaluated with FF A (dotted) and FF B (dashed). Representative structures of the activated states are reported in corresponding colors. Same scale shift as in (A); the vertical scales are broken to zoom over the low energies. (D) Potential energy evaluated with the atomistic FF over the same trajectories as in (C) (same color coding). Representative structures are reported in corresponding colors; the Ramachandran plot of the activated states of PROMPT and MAP are reported (yellow squared dots superimposed to the standard map in colors). Both in (C,D) the area with distorted sheets in the activated state of PROMPT is highlighted with a yellow circle.
to provide meaningful ones, even after local optimization, in the intermediate regions. This was somehow expected: in fact stronger post-processing methods, involving e.g., the generation of swarms of unbiased trajectories from the transition states were proposed to solve this problem (Pan et al., 2008). PROMPT returns in addition good backbone local conformations along the whole path, but does not guarantee that amino-acids separated along the chain do not get too near and cause steric clashes, which happens in fact, in the intermediate regions. The MCG simulations, guarantee physically sound structures along the whole path, and can explore also intermediates far from the reference structures, but needs appropriate post-processing and clustering techniques to extract a reaction path. We envision that a synergistic use of these methods might combine accuracy and efficiency in the path search. This possibility, and the application to a number of diverse proteins, are explored in a forthcoming paper (Delfino et al., in preparation).

DATA AVAILABILITY STATEMENT
All data used for this work are included in the paper or in Supplementary Material.

AUTHOR CONTRIBUTIONS
FD has produced data and performed analyses, and participated in writing. YP has produced data and analyses, designed work, contributed ideas, and participated in writing. ES and GT have designed the work, contributed ideas, and participated in writing. VT has contributed ideas, designed work and supervised it, and written the paper.

FUNDING
The work of ES has been supported by the RSF grant 19-71-30020, Applications of probabilistic artificial neural generative models to development of digital twin technology for non-linear stochastic systems (HSE University). The work of YP has been supported by the Russian Academic Excellence Project 5-100 of Sechenov Medical University.