A Force Balanced Fragmentation Method for ab Initio Molecular Dynamic Simulation of Protein

A force balanced generalized molecular fractionation with conjugate caps (FB-GMFCC) method is proposed for ab initio molecular dynamic simulation of proteins. In this approach, the energy of the protein is computed by a linear combination of the QM energies of individual residues and molecular fragments that account for the two-body interaction of hydrogen bond between backbone peptides. The atomic forces on the caped H atoms were corrected to conserve the total force of the protein. Using this approach, ab initio molecular dynamic simulation of an Ace-(ALA)9-NME linear peptide showed the conservation of the total energy of the system throughout the simulation. Further a more robust 110 ps ab initio molecular dynamic simulation was performed for a protein with 56 residues and 862 atoms in explicit water. Compared with the classical force field, the ab initio molecular dynamic simulations gave better description of the geometry of peptide bonds. Although further development is still needed, the current approach is highly efficient, trivially parallel, and can be applied to ab initio molecular dynamic simulation study of large proteins.


INTRODUCTION
Molecular dynamic (MD) simulation plays an increasingly important role in the study of structural and dynamical properties of biomolecules at the atomic level (Karplus and Petsko, 1990;Cheatham and Kollman, 2000;Karplus and McCammon, 2002). With the ever-increasing power of computer hardware and the development of enhanced sampling methods, a significant advance in MD simulations with larger systems and longer simulation time have been achieved over the past decades (Shaw et al., 2010;Prinz et al., 2011). However, the accuracy and reliability of MD results are highly dependent on the accuracy of the force field employed in the simulation (Weiner et al., 1984(Weiner et al., , 1986Ponder and Case, 2003). Despite widely successful applications of the current force fields in bio-molecular simulations, these simplified, predefined pairwise force fields have serious drawbacks. The most widely known deficiency is that the atomic charges in most of these force fields are pre-fixed, and there is no explicit treatment of electrostatic polarization and charge transfer (Duan et al., 2010;Tong et al., 2010;Ji and Mei, 2014). In the past few decades, significant efforts have been devoted to the development of polarizable force fields. However, although great achievements have been made, the accuracy of these polarizable force fields still have a lot of room for improvement. In addition, many force fields have a bias toward the secondary structure of the protein. For example, the α-helical propensity of the AMBER03 force field is too high relative to experimental measurements, while that of AMBER99SB is arguably too low (Best et al., 2008).
Compared with classical force fields, QM calculation can provide much more accurate potential energy function for the studied system, and include all important quantum effects. The advantage or need of the so-called ab initio molecular dynamic (AIMD) simulation over classical force fields in the study of proteins have been reported by various researchers (Wei et al., 2001;Dal Peraro et al., 2005;Ufimtsev and Martinez, 2009;Ufimtsev et al., 2011;Isborn et al., 2013). In these AIMD calculations, the atomic forces of the studied protein were calculated by QM methods, normally on the DFT level, whereas the motion of the nuclei was handled by classical mechanics. However, QM calculation needs a large amount of computational cost, which means that it can only be used for proteins with relatively small size.
So far, considerable efforts have been made to extend the applicability of QM calculation to large systems. Among existing approaches, the fragment-based QM methods has attracted much attention (Gordon et al., 2012;Collins et al., 2014;Li et al., 2014;Pruitt et al., 2014;Ramabhadran and Raghavachari, 2014;Chung et al., 2015;Collins and Bettens, 2015;Raghavachari and Saha, 2015). These approaches based on the chemical locality of molecular system, which assumes that the local regions of a molecular system can only be influenced weakly by atoms that are far away from it (Xu et al., 1998;Fedorov et al., 2014;Gao et al., 2014;He et al., 2014;Pruitt et al., 2014). In this kind of methods, the studied system is divided into small subsystems (fragments); the properties of these fragments such as energy are calculated separately by QM method. Then the property of the whole system can be obtained by taking a proper combination of the properties of these individual fragments. The fragment-based QM method is attractive in several aspects, such as easy implementation of parallelization without extensively modifying the existing QM programs and can be combined with all levels of ab initio electronic structure theories. In our previous study (Liu et al., 2015), a fragment based approach is presented for AIMD simulation of protein. In this approach, the potential energy and atomic forces of the studied protein are calculated by a recently developed electrostatically embedded generalized molecular fractionation with conjugate caps (EE-GMFCC) method (Wang et al., 2013). This AIMD approach had been applied to MD simulation of a small benchmark protein Trpcage in both gas phase and in solution. Compared with AMBER force field, this method can give more stable protein structure in simulation, and capture quantum effects that are missing in standard classical MD simulations.
To further improve the accuracy and efficiency of the AIMD simulations for protein, in this work, we presented a force balanced generalized molecular fractionation with conjugate caps (FB-GMFCC) method and checked its performance in the AIMD simulations for several systems. The paper is organized as follows. The next section provides a description of the FB-GMFCC approach. In section Result and Discussion, we performed AIMD simulations on two selected proteins to validate the new method, and finally, a brief summary will be given in section Conclusion.

THEORY AND METHOD
The FB-GMFCC method was developed based on the framework of molecular fractionation with conjugate caps (MFCC) approach (Zhang and Zhang, 2003). The computation procedure of FB-GMFCC can be roughly divided into two steps. Firstly, the given protein is cut into caped molecular fragments, including individual residues and residues that form backbone hydrogen bonds. Then the energy and atomic forces of each fragment are calculated by QM methods separately. Secondly, the AMOEBA polarizable force field (Ren and Ponder, 2002;Ponder et al., 2010;Ren et al., 2011;Wu et al., 2012) was employed to describe the long-range non-bonded interactions. Thus, the total energy of the protein system is obtained by a summation of quantum and classical components, Computational details of these energy components are describe below.

Calculation of E QM
To calculate the energy E QM , a given protein with N amino acids (defined as A 1 A 2 A 3 . . . A N ) is decomposed into N individual fragments by cutting through the peptide bonds (Figure 1). At every cut point, a pair of molecular caps were designed to saturate each fragment in order to preserve the local chemical environment. To minimize the computational cost, we simply use the amine and formyl group from the peptide bond as molecular caps, which are conjugate to each other (by forming a peptide bond) and are much smaller than that used in the EE-GMFCC approach. To avoid dangling bond, hydrogen atoms were added to terminate the molecular cap, the position of these extra H atoms are determined from the coordinates of the corresponding C α atoms.
Hydrogen bond is one of the most important structural elements of protein and the dominant factor that stabilizes the protein secondary structures. Many previous works demonstrated that the strength of hydrogen bond from simulations under non-polarizable force fields is underestimated due to the lack of polarization effect (Ji et al., 2008;Gao et al., 2012). In the FB-GMFCC method, the backbone hydrogen bond was considered by two-body QM calculation. To reduce computational cost, only the H-saturated peptide bond which contains the donor or the accepter (which actually is a formamide as shown in Figure 2) was kept in the two-body QM calculation, the position of the extra H atoms are also determined from the coordinates of the corresponding C α atoms. If the distance between donor H atom and acceptor O atom is <3.0 Å and the angle θ of N-H N -O is larger than 120 • , the 2-body correction will be considered. Thus, E QM can be expressed by the following formula: Where the i and j represent the index of ith and jth residues, respectively. If the formyl or amide group of residue A is included in a backbone hydrogen bond, A p represents the Hsaturated peptide bond which contains this group. The first term (2) represents the self-energy of fragment i (the ith residueA i capped with a left cap Cap * i−1 and a right cap Cap i+1 ). And it is clear that the self-energy of conjugate caps E(Cap * i Cap i+1 ) are double counted in first term of Equation (2) and it should be deducted.

Calculation of E MM
The E QM term includes the self-energy of individual residue and the two-body correction of the interaction energy between residues that form backbone hydrogen bonds. To obtain the total energy expression for proteins, the classical force field was introduced to represent the long-range non-bonded interactions. In our previous study, we found that the electrostatic polarization arising from the environment also plays a critical role for including the many body effect in fragmentation methods (Wang et al., 2013). To describe the electrostatic polarization effect FIGURE 2 | To better describe the backbone hydrogen bond, if the distance between the donor H atom and the acceptor O atom is <3.0 Å and the angle of N-H N -O is larger than 120 • , the 2-Body correction will be considered. To reduce computational cost, only the H-saturated peptide bond which contains the donor or the accepter was kept in the two-body QM calculation. efficiently, we employed the polarizable atomic multipole-based AMOEBA force field (Ren and Ponder, 2002;Ponder et al., 2010;Ren et al., 2011;Wu et al., 2012). The expression of E MM is as the following: Details about calculations of Van der Waals interactions, permanent and induced electrostatic energies of the AMOEBA force field could be found in Refs (Ren and Ponder, 2002;Ponder et al., 2010;Ren et al., 2011;Wu et al., 2012). For any two atoms that have not been calculated in the same QM zone, these nonneighboring interactions between them should be added to the total energy expression.

Balance the Force
To obtain atomic forces, we need to compute the derivative of FB-MFCC with respect to nuclear coordinates. For a given atom m, the atomic force can be expressed as following: It should be noted, however, that we employed extra hydrogen atoms to avoid dangling bonds (Figure 3) and their coordinates were determined from those of the corresponding C α atoms.
Because the forces on these extra hydrogen atoms in capped fragments E(Cap * i−1 A i Cap i+1 ) cannot be canceled exactly by subtracting those in the caps E(Cap * i Cap i+1 ), it will not exactly conserve the energy. In order to fix this problem, we balance the forces on the corresponding C α atoms by adding the differences of forces on these extra hydrogen atoms. For instance, the difference of the forces of the H atoms added to the carbonyl group of the residue i−1 (H atom in the left blue cycle of Figure 3) is Which is added to the force of the C α atom of residue i−1, This approach will balance the forces of the fragments and conserve the total energy of the system.

Performance of FB-GMFCC on Pure Proteins
To validate the FB-GMFCC method, we checked its performance for four protein systems and compared the calculated results with that calculated by conventional full-system QM calculations. An Ace-(ALA) 9 -NME linear peptide was constructed by the TLEAP software in the AMBER16 package, and three small proteins with different secondary structures are selected from the protein data bank (ID: 2I9M, 1LE1, and 2OED). Energy minimization (by using gradient descent and conjugate gradient algorithms with Amber ff14SB force field) was performed to remove bad contacts in these structures before QM calculations. Comparisons of job CPU times of FB-GMFCC and full system QM calculations were shown in Table 1. We can see that FB-GMFCC method is 4 or All calculations were performed with GAUSSIAN09 at M062X/6-31G * level. * Full-system QM calculation is not possible on our current machine due to large size of the system. 5 times faster than full QM calculation for a real protein with about 200 atoms. For the larger 2OED protein, the full-system QM calculation is not possible on our computer system due to its large size. It should be noted that the computational time in the present approach is essentially linear with the system size as shown in Table 1. Figure 4 shows the comparison of computed atomic forces with those from full-system QM calculation. Overall, the atomic forces are in good agreement with the full-system calculations except a few points. For example, there is a bad point in the calculated atomic force of 2I9M (Figure 4), which correspond to one H atom on the ε-amino group of LYS8. After carefully checking the structure, we found that a salt bridge is formed between this group and the side chain of GLU4. As a result, this salt bridge cannot be accurately described by the force field which is used in the present method to describe interaction between non-neighboring residues.

Ab Initio Molecular Dynamic in Gas Phase
To further check the performance of FB-GMFCC method, we performed an AIMD simulation for the linear peptide ACE-(ALA) 9 -NME in the NVE ensemble and gas phase. The simulation was performed by combining the FB-GMFCC and the TINKER program. Before the AIMD simulation, an energy minimization (by using gradient descent and conjugate gradient algorithms with Amber ff14 force field), a 400 ps heating simulation which heated the system from 0 to 300 K and a 5 ns equilibrium MD simulation (by using velocity verlet algorithm and the same force field) were performed. The AIMD simulation lasted 2 ps with 1fs time step and without any constraints. Another AIMD simulation without force balance was also performed as a reference. The total energy fluctuations in these two AIMD simulations were shown in Figure 5. As can be seen, the total energy in the AIMD simulation based on GMFCC without force balance is gradually increased in gas phase NVE ensemble, which means that the energy is not conserved if atomic forces of extra cap hydrogen atoms are not compensated to corresponding C α atoms. However, the total energy of FB-GMFCC can be maintained well and conserved at 91 kcal/mol. On average, the extra H atoms can import extra forces as large as 11 kcal/(mol * Å) to the system, which lead to additional works FIGURE 4 | Correlation between atomic forces of three proteins [ACE-(ALA) 9 -NME linear peptide, 1LE1, 2I9M] calculated by full-system QM calculation and that calculated by FB-MFCC. All the QM calculations were performed at the M062X/6-31G* level.
FIGURE 5 | Time evolution of total energies in AIMD simulations of a linear peptide ACE-(ALA) 9 -NME with FB-GMFCC method (blue) and GMFCC without force balance (red) in gas phase and NVE ensemble.
on the system. Thus it is necessary to balance the atomic forces on these extra H atoms in the AIMD simulation.

AIMD in Explicit Water
Since water plays an important role in protein structure and dynamics, the study of protein should be carried out in the solvent environment. The FB-GMFCC approach can also be used to perform AIMD simulations for proteins in explicit solvent environment. The total energy of protein-solvent system with FB-GMFCC can be expressed by the following formula.
To save the computation cost, the inter-and intra-interactions of water molecules and their interactions with proteins are described by classical force field (Amber ff14SB), as mechanical embedding in the QM/MM framework. We performed 110 ps AIMD simulation for the relatively larger protein (2OED, 56 residues, 862 atoms) in explicit water. The protein was solvated in a water ball consisting of 3084 TIP3P water molecules. Before AIMD simulation, energy minimizations were performed to remove bad contacts in the system, and a 25 ps heating simulation was performed to heat the system slowly to 300 K. A restraint of 50 kcal/mol was used on the backbone to avoid large unphysical structural change in heating process. Then the system underwent AIMD simulation with 1fs time step and without any constraints. The Langevin thermostat with the collision frequency 2.0 ps −1 was applied to control the temperature. In addition, there was a 20 kcal/mol half-harmonic restrain used on the boundary of water ball to avoid the escaping of water molecules.
This AIMD simulation was performed on a linux server cluster with 30 nodes and each node has dual Intel Xeon E5-2680v3 CPUs. To balance the computational cost and accuracy, the combination of BLYP functional and 6-31G * basis set was used in the calculation. It took 55 days to complete the simulation. The time evolution of temperature in the simulation was shown in Figure 6. We can see that the temperature is very stable in the trajectory. The backbone RMSD with respect to the X-ray structure is no larger than 1.5 Å, which means that the structure of 2OED protein are relatively stable during the110 ps AIMD simulation.
Recently, many researchers discovered that considerable deviations from planarity of peptide bond (ω = 180 • ) can be identified in atomic resolution X-ray structures, sometimes even exceeding 10-15 • (Wlodawer et al., 1984;MacArthur and Thornton, 1996;Ulmer et al., 2003). As the resolution of the structure (2OED) used in this work is very high (1.1 Å) and the coordinates of hydrogen atoms in this structure were further refined with NMR experiment (Ulmer et al., 2003), it is worth to compare the planarity of peptide bonds in the experimental structure and the AIMD trajectory. Six peptide bonds were selected from the experimental structure as their O i−1 -C i−1 -N i− H i N dihedral angles deviate the most from the peptide plane. The result can be found in Table 2 and Figure 7. We can see that four of the six peptide bonds still maintained their large deviations. The results calculated by FB-GMFCC generally agree well with the experiments, especially for VAL21, TYR3, and PHE52. For comparison, we also test the performance of MD with classical force field (Amber ff14SB) at the same conditions. Not surprisingly, the Amber ff14SB force field prefers planer peptide bonds, which was predefined. As based on QM calculation, the AIMD simulation generally describes the intraprotein interactions more accurately.

CONCLUSION
In this study, a force balanced generalized molecular fractionation with conjugate caps (FB-GMFCC) method was presented. In this approach, fragment-based energies of individual residues and interaction energies of residues that form backbone hydrogen bonds are calculated by quantum mechanics. Other non-bonded interactions are considered by the polarizable AMOEBA force field. The calculated atomic forces of this method showed good agreements with that calculated by the conventional full-system QM calculations. A key element of the FB-GMFCC method is that the atomic forces of capped H atoms are corrected to achieve the conservation of the total force of the studied system.
We also demonstrated the applicability of FB-GMFCC method for performing ab initio molecular dynamic (AIMD) simulations for proteins. The results of an Ace-(ALA) 9 -NME linear protein showed that only the balanced force can keep  The calculated values were averaged from the trajectories.  the conservation of the total energy during the simulation. An 110 ps AIMD simulation was also performed for a relatively large protein with 56 residues and 862 atoms in explicit water. Compared with the classical force field, the AIMD simulations gave better description about the geometry of peptide bonds. It should be note that the accuracy of the FB-MFCC method still have room to be improved. Further development of this method will focus on the consideration of strong short-range interactions such as salt bridges and hydrogen bond including side chains, relevant work is underway in our laboratory. These results have shown that the FB-GMFCC approach is potentially powerful and attractive for studying protein dynamics. As a fragment based approach, the FB-GMFCC method is linear-scaling and trivially parallelizable. With further development and improvment, this method will become more and more practical for AIMD simulation of larger proteins.

AUTHOR CONTRIBUTIONS
MX performed theoretical calculation and analysis of result. TZ developed the theory and contributed to the writing and discussion of the paper. JZ organized the project and contributed to the discussion and writing of the paper.