Understanding protein diffusion on force-induced stretched DNA conformation

DNA morphology is subjected to environmental conditions and is closely coupled with its function. For example, DNA experiences stretching forces during several biological processes, including transcription and genome transactions, that significantly alter its conformation from that of B-DNA. Indeed, a well-defined 1.5 times extended conformation of dsDNA, known as Σ-DNA, has been reported in DNA complexes with proteins such as Rad51 and RecA. A striking feature in Σ-DNA is that the nucleobases are partitioned into triplets of three locally stacked bases separated by an empty rise gap of ∼5 Å. The functional role of such a DNA base triplet was hypothesized to be coupled with the ease of recognition of DNA bases by DNA-binding proteins (DBPs) and the physical origin of three letters (codon/anti-codon) in the genetic code. However, the underlying mechanism of base-triplet formation and the ease of DNA base-pair recognition by DBPs remain elusive. To investigate, here, we study the diffusion of a protein on a force-induced stretched DNA using coarse-grained molecular dynamics simulations. Upon pulling at the 3′ end of DNA by constant forces, DNA exhibits a conformational transition from B-DNA to a ladder-like S-DNA conformation via Σ-DNA intermediate. The resulting stretched DNA conformations exhibit non-uniform base-pair clusters such as doublets, triplets, and quadruplets, of which triplets are energetically more stable than others. We find that protein favors the triplet formation compared to its unbound form while interacting non-specifically along DNA, and the relative population of it governs the ruggedness of the protein–DNA binding energy landscape and enhances the efficiency of DNA base recognition. Furthermore, we analyze the translocation mechanism of a DBP under different force regimes and underscore the significance of triplet formation in regulating the facilitated diffusion of protein on DNA. Our study, thus, provides a plausible framework for understanding the structure–function relationship between triplet formation and base recognition by a DBP and helps to understand gene regulation in complex regulatory processes.


PROTEIN MODEL
The protein is modeled through a coarse-grained (CG) representation where each protein residue is represented by one CG particle placed at the respective C α position Bhattacherjee et al. (2016). The protein maintains its folded structure through a native topology-based model in which a Lennard-Jones potential encourages the formation of native contacts found in the crystal structure, thus help to preserve the protein native fold during the simulations Clementi et al. (2000b). Such reduced model promise great advantages in studying folding of proteins Clementi et al. (2000a), protein-protein interactions Bhattacherjee and Wallin (2012) and multiple basin free energy landscape for large scale motion of proteins Okazaki et al. (2006). In addition, electrostatic interactions between positively charged (Arg, Lys) and negatively charged (Asp, Glu) amino acids are modeled through Debye-Hückel potential. Note that Debye-Hückel theory is inadequate to capture the effect of ion condensation and is valid only for dilute solutions. Despite several limitations, the Debye-Hückel potential has successfully been applied to understand many crucial aspects of nucleic acid biophysics Mondal and Bhattacherjee (2015); Bhattacherjee and Levy (2014a,b); Dey and Bhattacherjee (2018, 2019a,b, 2020; Mondal et al. (2022a,b).
The potential energy function for protein is designated as: The bond energy, E protein bond between two successive C α beads is given by, where K b = 100 kJ/mol/Å 2 , r i and r 0 i are the distances between i th and (i + 1) th C α beads in intermediate and folded structures of the protein respectively. The bend energy, E protein bend for any variation in angle is given by, where K θ = 20 kJ/mol/rad 2 , θ i and θ 0 i are the angles between a vector connecting i th and (i + 1) th C α atoms and a vector connecting (i + 1) th and (i + 2) th C α atoms in the intermediate and folded structures of the protein respectively. The potential energy function for torsional angle, E protein torsion due to rotation in the dihedral angles between four consecutive C α atoms connected by bonds is given as, where K ϕ 1 = 0.5 kJ/mol, K ϕ 2 = 1.0 kJ/mol, ϕ i and ϕ 0 i are the torsional angles between i th , (i + 1) th , (i + 2) th and (i + 3) th C α beads in the intermediate and folded structures of protein respectively. E protein N ative−Contacts is the conformational energy that favours the formation of the native contacts found in the folded structure and such structure-based potential (as originally proposed by Clementi et al. Clementi et al. (2000b)) is modeled by a Lennard-Jones potential given by, where ϵ ij = 4.18 kJ/mol. σ ij is the C α − C α distance between i − j pairs that are in contact with each other in the crystal structure and r ij is the same distance in the intermediate structures generated during the simulations.
All the non-bonded and non-native C α pairs are allowed to interact through short-range repulsive excluded volume interactions, E protein ev given by, where ϵ ij = 1.0 kJ/mol, r ij denotes the distance between i th and j th C α beads and σ ij is the sum of the radii of the interacting particles. The repulsion radius of the C α bead is set to 2.0Å.
The Debye-Hückel screened electrostatic potential energy function is given by, where q i and q j are the charges on i th and j th beads, r ij denotes the separation between them, λ D is the Debye screening length, ϵ 0 is the dielectric permittivity of the vacuum and ϵ(T, C) is the dielectric permittivity of the solution.
The dielectric permittivity ϵ(T, C) is a function of salt molarity C and temperature T and can be expressed as the product of their individual contributions, The Debye screening length can be written as where β = 1 k B T is the inverse thermal energy, N A is the Avogadro's number, e c is the elementary charge and I is the ionic strength of the solution.

DNA MODEL
In this study, we adopted 3SPN.2C coarse-grained model of DNA developed in de Pablo's group, where each nucleotide is described as three beads: one for the phosphate, one for the sugar and one for the base Hinckley et al. (2013);Freeman et al. (2014). The model accurately estimates the structural properties such as, major and minor groove widths which are in good aggrement with experimental results. The model successfully reproduces DNA persistence length and predicts DNA melting temperatures which are consistent with experimental results. Most importantly, the model has been successful to incorporate sequence-dependent curvature and sequence-dependent flexibility of DNA Freeman et al. (2014). These features make it a suitable candidate to study the DNA dynamics at the molecular level.
The potential energy function used to model the DNA is given by bonded is a combination of bond, angle and torsional potentials to preserve the DNA initial structure, given by where K b = 0.6 kJ/mol/Å 2 and r i , r 0 i are the instantaneous and equilibrium bond lengths for the i th DNA bond respectively; K θ represent the force constant for bending which is dependent on the sequence of DNA. The sequence dependent force constant for bending energy at each base-step is shown in supplementary Table S1 that introduce the sequence-dependent flexibility of DNA. θ 0 i and θ i are the equilibrium and instantaneous bend angles for bend i. K ϕ = 7.0 kJ/mol/rad 2 and σ ϕ,i , ϕ 0 i are the Gaussian well depth and equilibrium angle for the i th dihedral respectively. In this model, the dihedral forces only act on the backbone of the system, i.e., the dihedrals are formed by phosphate and sugar sites (S−P−S−P and P−S−P−S).
Additionally, a weak dihedral potential is applied to the DNA backbone which is of the form where K ϕ,periodic = 2.0 kJ/mol/rad 2 . This energy function will provide extra stability to the helix and prevent severe deviations from the equilibrium structures.
For non-bonded potential, we use excluded volume interactions, electrostatic interactions, and basepairing interactions as given by, The excluded volume interactions, E DN A ev between sites i and j are modeled through a purely repulsive Lennard-Jones potential, where ϵ r = 1.0 kJ/mol and r ij = σ ij is the average diameter of the interacting particles. The potential only acts between those sites that are not involved in any bonded interactions or base pair non-bonded interactions.
The electrostatic interactions E DN A elec between all charged phosphate atoms, which are not from neighbouring nucleotides, are modeled using Debye-Hückel potential energy function given in equation S7. The phosphates are assigned a negative charge of 0.6 instead of 1.0 in order to consider the counter-ion condensation.
The base-pairing interactions can be divided into three parts: base-stacking (E DN A bstk ), base-pair (E DN A bp ) and cross-stacking (E DN A cstk ) interactions. These three interactions rely on a Morse potential of the form which can be decomposed into a repulsive component and an attractive component Here ϵ ij denotes the well depth of attraction between sites i and j, α ij is used to control the range of attraction and r 0 ij is the equilibrium distance between interacting sites.
A modulating function f is also incorporated to the angle of interaction in the base-stacking, base-pairing and cross-stacking interactions, which is of the form where the modulating constant K manages the attractive cone width. With these definitions, we can fully describe the potential energy function for intra-strand base-stacking interactions as where K BS = 6.0, α BS = 2.0, ϵ ij gives the depth of the well of attraction between interacting sites and θ BS is the angle between the vector connecting sugar and base in the 5 ′ direction and the vector joining the two base atoms in the 3 ′ direction.
The potential energy function for base-pairing interactions is given by .0 and f modulates the decomposition of attractive and repulsive portions of the Morse potential. The base pairing interactions is modulated by f using θ 1 and θ 2 , where θ 1 is the angle between the vector joining the sense-strand sugar, base and the vector joining sense-strand base with its complementary base, while θ 2 is the same angle but on the opposite strand of DNA. The deviations from a reference dihedral angle is penalized by ∆ϕ 1 = ϕ 1 − ϕ 0 1 , where ϕ 1 is the dihedral between the sugar and base on the sense and anti-sense strands of DNA.
The potential energy function for cross stacking interactions is given by where K CS = 8.0, α CS = 4.0. The cross stacking interaction is modulated by θ 3 and θ CS , where θ 3 is the angle between the vectors connecting the sugars to the bases in a W−C base pair and θ CS is the vector connecting the sugar to the base, and the vector connecting the base on the anti-sense strand in the 5 ′ to the base in the 3 ′ direction on the sense strand.

CALCULATION OF RUGGEDNESS OF THE CHEMICAL POTENTIAL LANDSCAPE
In the present study, the ruggedness of the potential energy landscape is calculated directly from the simulation trajectories using the method primarily adopted by Putzel et al. Putzel et al. (2014)and is discussed briefly here. We partition the whole simulation box into cubic cells of dimension 50Å 3 . From our simulation trajectories, we first consider the position of the center of mass of recognition region of the diffusing protein and then calculate the probability (p cell i ) of occupying i th cell by the searching protein. When the diffusing protein moves from i th cell to j th cell, then the corresponding change in free energy is given by −k B T ln(p cell i /p cell j ). This also signifies the change in chemical potential between these cells. Consequently, the ruggedness of the chemical potential energy landscape can be calculated, in units of k B T , with the formula given by where N cells is the number of cubic cells. This calculation is performed at each stretching force and the result is presented in Figure 4A in the main text.

CALCULATION OF ELECTROSTATIC POTENTIAL ENERGY ALONG DNA CONTOUR FROM DELPHI PROGRAM
We calculate electrostatic potentials along the DNA contour using DelPhi program by solving nonlinear Poisson-Boltzmann Equation Li et al. (2012). Several input parameters are required for estimating electrostatic potential in and around the DNA. For instance, partial charges and atomic radii of each coarse-grained DNA beads are chosen from de Pablo's model Hinckley et al. (2013) for the calculation. The dielectric constants for the internal surface of the solvent molecules and for the exterior aqueous solution are assigned a standard value of ϵ indi = 2 and ϵ exdi = 80 respectively. Radius of the probe molecule (prbrad) is set to 1.4Å to define the solvent accessible surface area. The ionic strength is set to a physiological concentration of 140 mM. The scale for grid spacing is set to 2 grids/Å with Debye-Hückel boundary conditions and 90% of the lattice filled (perfil).
The electrostatic potential is calculated at a reference point i along the DNA. The reference point i is chosen to be the geometric midpoint between the phosphate atoms of nucleotide i in the 5 ′ -3 ′ direction and nucleotide i − 2 in the 3 ′ -5 ′ direction of the DNA strand. We choose two different DNA conformations at 5 × 10 8 and 8 × 10 8 MD steps from the same simulation trajectory at any given force and calculated the electrostatic potentials along these two conformations using the above procedure. Then we identified the base pairs that are involved in forming doublets and triplets along the DNA and extracted the values of the corresponding electrostatic potential energies separately from these two conformations. Finally, we take the average contribution of these two conformations for all the base doublets and triplets along the DNA and the corresponding electrostatic potential (EP, in units of k B T per charge) is presented in Figure 5C in the main text and in Figure S6.

3D DIFFUSION, SLIDING, HOPPING AND 1D DIFFUSION COEFFICIENT
We analyze all the simulation trajectories at each force to quantify the propensities of different modes of diffusion such as sliding, hopping and 3D diffusion using the method adopted in our previous work Mondal and Bhattacherjee (2015); Dey and Bhattacherjee (2018) and is described briefly here. If the distance between the centers of the recognition helix and its closest DNA base pair is more than 30Å, then the corresponding snapshots are characterized as fully unbound state where the protein diffuses three dimensionally in the bulk (3D diffusion). A snapshot is considered to be a sliding motion if it satisfy the following three conditions simultaneously: (i) centers of the protein's recognition region and the closest DNA base pair from it should be at a distance within 15Å, (ii) at least 70% of the recognition helix should stay inside the DNA major groove and (iii) the orientation angle of the protein relative to DNA must be less than 25 • . If any of these three conditions fails to satisfy and the protein is still close enough to the DNA surface, then we consider the corresponding event as hopping mode. Therefore, while scanning the DNA non-specifically, protein can exhibit either sliding or hopping motion along the DNA. This is a 1D diffusion along the DNA contour. We calculate 1D Diffusion coefficient (D 1 ) from the linear behavior of the mean square displacement of the center of mass of recognition helix of the protein diffusing along the stretched DNA through sliding and hopping dynamics only.

PROBABILITY CALCULATION OF THE FORMATION OF BASE PAIR CLUSTERS
To calculate the probability of forming different base pair clusters (i.e, doulets, triplets, quadruplets and so on), we first evaluated the distance between consecutive DNA bases, which usually represents the DNA rise per base pair. If the two consecutive nucleobases are separated by a rise gap of length 5Å , then it is identified as doublets. Similarly, triplets are identified if the three consecutive bases satisfy the above condition. Likewise, the formation of quadruplet, quintuplet, sextuplet, septuplet, octuplet, nonuplet and decuplet can easily be identified. By analysing a complete simulation trajectory, the total number (N k , k = 2, 3, 4, . . . , 10) of different base pair clusters can easily be counted throughout the simulation. Finally, the probability of forming doublets (p 2 ), triplets (p 3 ), quadruplets (p 4 ) and so on, can be calculated as Mondal, A. and Bhattacherjee, A. (2015). Searching target sites on DNA by proteins: Role of DNA dynamics under confinement. Nucleic Acids Res 43, 9176-9186 Mondal, A., Mishra, S. K., and Bhattacherjee, A. (2022a). Nucleosome breathing facilitates cooperative binding of pluripotency factors Sox2 and Oct4 to DNA. Biophys J Mondal, A., Sangeeta, and Bhattacherjee, A. (2022b). Torsional behaviour of supercoiled DNA regulates recognition of architectural protein Fis on minicircle DNA. Nucleic Acids Res 50, 6671-6686 Okazaki, K., Koga, N., Takada Tables   Table S1. Force constants (k θ ) for bending angle potential in 3SPN.2C DNA model.