Modeling DNA Opening in the Eukaryotic Transcription Initiation Complexes via Coarse-Grained Models

Recently, the molecular mechanisms of transcription initiation have been intensively studied. Especially, the cryo-electron microscopy revealed atomic structure details in key states in the eukaryotic transcription initiation. Yet, the dynamic processes of the promoter DNA opening in the pre-initiation complex remain obscured. In this study, based on the three cryo-electron microscopic yeast structures for the closed, open, and initially transcribing complexes, we performed multiscale molecular dynamics (MD) simulations to model structures and dynamic processes of DNA opening. Combining coarse-grained and all-atom MD simulations, we first obtained the atomic model for the DNA bubble in the open complexes. Then, in the MD simulation from the open to the initially transcribing complexes, we found a previously unidentified intermediate state which is formed by the bottleneck in the fork loop 1 of Pol II: The loop opening triggered the escape from the intermediate, serving as a gatekeeper of the promoter DNA opening. In the initially transcribing complex, the non-template DNA strand passes a groove made of the protrusion, the lobe, and the fork of Rpb2 subunit of Pol II, in which several positively charged and highly conserved residues exhibit key interactions to the non-template DNA strand. The back-mapped all-atom models provided further insights on atomistic interactions such as hydrogen bonding and can be used for future simulations.


INTRODUCTION
Transcription is fundamental to virtually all area of biology. In eukaryotic cells, RNA polymerase II (Pol II) transcribes all messenger RNAs, making it of central importance. The Pol II transcription initiation requires progressive assembly of several general transcription factors (TFs) and Pol II on the promotor DNA sequence, forming the pre-initiation complex (PIC). After the initial transcription of short RNAs, the transcription machinery escapes the promoter region converting its architecture for the transcription elongation. Much of the transcriptional regulations are related to these early stages of transcription and thus it is of utmost importance to understand the molecular mechanisms of the transcription initiation, which we focus in this study.
Overall processes in the Pol II transcription initiation have been characterized via decades of studies. The PIC consists of Pol II and six general TFs, TFIIA, TFIIB, TFIID, TFIIE, TFIIF, and TFIIH (Buratowski et al., 1989;Roeder, 1996;Grünberg and Hahn, 2013;Sainsbury et al., 2015). In addition, coactivators such as Mediator are involved in its regulation (Schilbach et al., 2017;Nozawa et al., 2017). The initiation process begins with the recognition of the promoter DNA sequence by TFIID. For the promoter sequences that contain the classic TATA box, the TATA-binding protein (TBP) in TFIID binds the TATA box DNA sequence, leading to ∼90-degree bend of the DNA. Then, TFIIA, TFIIB, Pol II-TFIIF complex assemble in this bent site. Further, TFIIE and TFIIH are recruited in order, to form the PIC with the bent duplex DNA (termed the closed complex, CC). In particular, PIC without TFIIH is called as core PIC (cPIC). Next, the promoter DNA melts into the template and non-template DNA strands, driven by the ATP-dependent translocase activity of TFIIH (termed the open complex, OC). The template DNA strand moves toward the active site of Pol II. The melted DNA region is called "DNA bubble", of which size is experimentally characterized as ∼6 bp in the OC state (Tomko et al., 2017). Subsequently, the DNA bubble expands to ∼13 bp (Tomko et al., 2017), which allows the template DNA strand reaching to the active site to begin the messenger RNA synthesis. The complex in which the initial transcription begins is called the initially transcribing complex (ITC). Notably, while the ATP-dependent translocase activity of TFIIH facilitates the promoter DNA opening (Compe and Egly, 2012;Fishburn et al., 2015), some promoter DNAs can open spontaneously without TFIIH (Plaschka et al., 2016).
Recently, the cryo-electron microscopy (cryo-EM) revealed near-atomic structures in key stages of the Pol II transcription initiation (Schilbach et al., 2017;Nozawa et al., 2017;Plaschka et al., 2016), which provided the model of DNA opening process in the PICs ( Figure 1A) (Plaschka et al., 2016). The model is based on the yeast CC, OC (Plaschka et al., 2016), and ITC structures (Plaschka et al., 2015), and the highly conserved human CC structure (He et al., 2013). However, the state transitions from CC to OC, and to ITC were not directly observed. Moreover, the modeled structures of OC and ITC by cryo-EM do not contain parts of DNA strands because of high flexibility in the DNA bubble. Thus, how the template and nontemplate DNA strands behave inside Pol II has not been fully understood. Complementarily, the DNA bubble size in OC and ITC states has been detected via optical and magnetic tweezer experiments (Tomko et al., 2017;Fazal et al., 2015). However, the structural details in the DNA bubble is currently missing.
Given such situations, molecular dynamics (MD) simulations can offer another complementary approach to address the structural dynamics of the Pol II transcription initiation since MD simulations can provide high-resolution spatiotemporal information (Chen et al., 2010;Hyeon and Thirumalai, 2011;Feig and Burton, 2010;Huang et al., 2010;Silva et al., 2014). However, since the DNA opening process involves rather largescale and slow movements of DNA within large complexes, conventional MD simulations with fully-atomic resolution (designated as the all-atom (AA) MD hereafter) cannot easily sample these structural dynamics. To circumvent this difficulty, one can alternatively use coarse-grained (CG) MD simulations, which can speed up the simulation by orders of magnitude at the cost of accuracy (Liwo et al., 2014;Takada et al., 2015;Kmiecik et al., 2016;Pak and Voth, 2018). Once comprehensively sampled by CG-MD, one can back-map the sampled CG structure models into AA models, followed by AA-MD simulations (Shimizu and Takada, 2018). A recent study employed such a protocol to gain comprehensive and high-resolution energy landscape in a bacterial RNA polymerase (Unarta et al., 2021).
In this study, using the cryo-EM yeast structures for the CC, OC, and ITC, we performed multiscale MD simulations to model structures and dynamic processes of DNA opening.
Combining CG-and AA-MD simulations, we first obtained the atomic model for the DNA bubble in the OC. Then, in the CG-MD simulation from the OC to the ITC, we found a previously unidentified intermediate state which is formed by the bottleneck in the fork loop 1 of Pol II: The loop opening triggered the escape from the intermediate, serving as a gatekeeper of the promoter DNA opening. In the ITC, the non-template DNA strand passes a groove made of the protrusion, the lobe, and the fork of Rpb2 subunit of Pol II, in which several positively charged and highly conserved residues exhibit key interaction to the non-template DNA strand.
We used the DNA sequence identical to that used in cryo-EM studies (Plaschka et al., 2016;Plaschka et al., 2015). The sequence was derived from the promoter sequence of HIS4 gene locus, from which 28 bp were deleted at the downstream of the TATA box.

Coarse-Grained MD Simulations
In this study, we applied the coarse-grained (CG) simulation model that has been developed previously and extensively applied to protein-DNA complex systems (Levy et al., 2007;Terakawa et al., 2012;Freeman et al., 2014;Zhang et al., 2016;Shimizu et al., 2016;Lequieu et al., 2017;Niina et al., 2017;Brandani et al., 2018;Tan and Takada, 2020). We used AICG2+ model for proteins, 3SPN.2 model for DNA (Li et al., 2014;Hinckley et al., 2013). Briefly, in AICG2+, each amino acid in proteins are represented by one CG particle placed at its Cα position and the structurebased contact potential biases its energy landscape towards the reference structure. In 3SPN.2 model, each nucleotide is modeled by three CG particles corresponding to the phosphate, the sugar, and the base. Orientation-dependent potentials for base-base interactions and others are designed to reproduce basic experimentally-characterized properties of duplex and, to some extent, single strands. Between proteins and DNA, we applied the structure-based contact potential for representing the specific interactions, as well as a general excluded volume term and the electrostatic interaction via the Debye-Huckel approximation (the monovalent salt concentration was set to 200 mM throughout this study). For the electrostatic interaction, we employed partial charges on the surface residues of proteins, which were optimized to reproduce the electrostatic potential around the protein obtained by the all-atom model via the RESPAC method (Terakawa and Takada, 2014). For time propagation, including the solvent effect implicitly, we employed a simple Langevin dynamics at the temperature 300 K. For all the CG-MD simulations, we used the software CafeMol 3.2 (Kenzaki et al., 2011).
The specific protein-DNA interaction, i.e., the structure-based contact potential is, as usual, expressed as ij∈contact ϵ go 5 r ij0 r ij 12 − 6 r ij0 r ij 10 where r ij is the distance between CG particles i and j, r ij0 is the corresponding distance at its reference structure, and ϵ go is a parameter that modulates the strength of the interaction, of which value was calibrated to be 1.2 kcal/mol, to maintain experimentally characterized contacts in the three states, CC, OC, and ITC (Plaschka et al., 2016;Plaschka et al., 2015). Specifically, we set the following four conditions to be satisfied: 1. In the CC state, the contact between DNA and the E-wing of TFIIE is maintained. 2. In the OC state, the template DNA strand can maintain the native contacts with a region close to the active site of Pol II. 3. In the OC and ITC states, an upstream side of DNA maintains its contact with N50, K51, and T52 of TFIIE. 4. In the OC and ITC states, the triple mutations N50E, K51E, and T52E lead to loss of the DNA-TFIIE contacts (Plaschka et al., 2016).
For the specific protein-DNA interaction, we collected protein-DNA contacts in the three complexes structures and used its union set for the structure-based contact potential. This union set includes the particular contacts satisfying above four conditions.

Trajectory Analysis
The state-to-state transitions were characterized by protein-DNA contacts that depend on the state. There are 92, 22, and 118 contacts between proteins and DNA in CC, OC, and ITC states, respectively. The contacts in CC are all specific to the CC state and are not shared with the other two states. Thus, all the 92 contacts are used to characterize the CC state. The contacts in OC are mostly a subset of the contacts in the ITC state (19 out of 22 included in the ITC contacts). We used all the 22 contacts to characterize the OC state. Of the 118 contacts in ITC states, 99 are unique to the ITC state and thus are used to characterize the ITC state. Once the sets of contacts are defined, we quantify the stateto-state transition by the fraction of contacts formed in each snapshot.
The size of the DNA bubble was estimated as the sum of the broken base pairs, which are defined by the distance between the CG particles of the base pairs larger than 6.2 Å: In preliminary CG-MD simulations of duplex DNA at the same solvent condition, the probability that the base-base distance is larger than 6.2 Å was 0.3%. In apparently melted DNA configurations, their base-base distances were almost surely larger than this threshold distance.

Back-Mapping to All-Atom Model and All-Atom MD Simulations
Following the previously developed protocol (Shimizu and Takada, 2018), we performed the back-mapping from our CG models to all-atom models. While we used the cryo-EM-based CC protein structures as the reference structures of all the three states in the dynamical modeling, we moved them back to the respective cryo-EM protein structures aiming at more accurate modeling of all-atom structures. For the intermediate state I 2 , we used the OC structure as the reference. For each state, we began with the CG-MD simulation at 300 K for 10 5 MD steps. Then, to reduce local fluctuations, we performed a quick annealing simulation, quenching the temperature from 300 to 1 K, followed by a 10 5 MD step simulation at 1 K. The final structure was put into the back-mapping toolset. For DNA, we applied the CG to AA reconstruction tool (Shimizu and Takada, 2018), whereas for proteins, we employed the PD2 ca2main (Moore et al., 2013) for backbone and SCWRL4 (Krivov et al., 2009) for sidechain reconstruction.
Once the all-atom model for the PIC were obtained, we performed all-atom MD simulations using the software GROMACS 2020.2 (Abraham et al., 2015) with the protein, DNA, and water force fields, ff14SB (Maier et al., 2015), and parmbsc1 (Ivani et al., 2016), and TIP3P (Jorgensen et al., 1983), respectively. We used the standard protocol: We set the box size of 182.2 × 232.1 × 186.4 Å 3 solvating with water molecules and 171 Na + ions to neutralize the system. After the energy minimization, we equilibrated the local system with NVT and then NPT ensembles (T 300 [K], the pressure 1 bar), followed by 10 ns MD simulations. We used the cutoff distance of 1 nm for the Coulomb interaction with the particle-mesh-Ewald for long range treatment.

Multiscale Modeling of Pre-Initiation Complexes
Our multiscale modeling begins with CG-MD simulations that connect the three states of the PIC; the CC, OC, and ITC. The constructed CG models were then back-mapped to AA models, which is followed by short-time MD simulations with the AA models.
We employ the CG model that has been extensively used to protein-DNA complexes (Levy et al., 2007;Terakawa et al., 2012;Freeman et al., 2014;Zhang et al., 2016;Shimizu et al., 2016;Lequieu et al., 2017;Niina et al., 2017;Brandani et al., 2018;Tan and Takada, 2020). In the CG model, each amino acid in proteins is represented as one particle located at the Cα atom position and each nucleotide in DNA is modeled by three particles each representing the phosphate, the sugar, and the base. The protein energy function AICG2+ contains the contact potentials that stabilizes the predefined reference (native) structure, i.e., the structure-based model (Li et al., 2012). The DNA energy function 3SPN.2 is empirically tuned to reproduce several experimental data such as the sequence-dependent melting temperature and bending modulus (Hinckley et al., 2013). The protein-DNA energy function consists of the generic terms; the short-range repulsion and the electrostatic interaction, and the specific interactions; structure-based contact potentials (See Materials and Methods for more details).
The simulation system consists of an 81-bp promoter DNA (the sequence shown in Figure 1B) and the protein complex that contains Pol II, TBP, TFIIA, TFIIB, TFIIE, and TFIIF (that is, this study deals with cPIC). TFIIH is not included because the DNA bubble can form without TFIIH (Plaschka et al., 2016;Alekseev et al., 2017) and because the structural information on ATPdependent conformational change in TFIIH is incomplete albeit some structures previously reported (Schilbach et al., 2017;Dienemann et al., 2018;Osman and Cramer, 2020).
The root-mean-square-differences (RMSD) of protein complexes were 0.85 Å between CC and OC, and 3.5 Å between OC and ITC, which are smaller than the resolutions reported in the cryo-EM analysis (8.8, 4.4, and 7.8 Å, for CC, OC, and ITC models, respectively). We note that we excluded DNA in the calculations of these RMSDs. In the CG-MD simulations, these modest-sized structure changes in proteins should appear via the interaction to DNA (and a short RNA in the case of ITC). Since CC contains the weakest protein-DNA interaction among the three complexes structure models, we took the protein structure of CC as a reference structure of the protein complex in the CG model throughout this study.

Modeling the DNA Bubble in the Open Complex
First, to obtain the OC model with the open DNA, we performed 40 independent CG-MD simulations of 5 × 10 6 MD steps, starting from the CC structure (Supplementary Figure S1). In any of the simulations, the promotor DNA did not melt spontaneously and most of the OC specific protein-DNA contact did not appear although the particular region of the promotor (−18∼ +7 relative to the transcription start site (TSS)) was distorted toward the cleft of Pol II (Supplementary Figure S1). This distortion in DNA was observed in a previous cryo-EM study, which indicates the prestage of DNA opening (Dienemann et al., 2018). A previous study shows that the DNA opening in the absence of TFIIH takes a very long time; the real-time observation of the formation of the DNA bubble shows that it takes a few seconds (Fazal et al., 2015). Therefore, it is reasonable that we did not observe spontaneous DNA opening in our CG-MD simulations that cannot cover second time scales.
Then, to enforce the prompt DNA opening, we modified the non-template DNA sequence to introduce the DNA mismatch of 15 bases into the promotor ( Figure 1B). The introduced mismatch is identical to that used for the cryo-EM structures of OC and ITC (Plaschka et al., 2016;Plaschka et al., 2015). Under this condition, we performed 40-independent CG-MD simulations of 2 × 10 7 MD steps (a representative trajectory is shown in Figure 2). Figure 2B, left panel depicts a representative time course of the fraction of protein-DNA contacts specific to CC (red) and to OC (green). In this trajectory, in the very initial phase, ∼80% of the CC-specific contacts were lost, whereas ∼40% of the OC-specific contacts were formed to reach an intermediate state, which we call "pre-OC" state ( Figure 2A, center). In the pre-OC state, most of the mismatched DNA region melted to form a bubble of ∼13 bp ( Figure 2C). This caused the +2 site of the template DNA strand to form new contacts with Pol II (Supplementary Movie S1). All the 40 trajectories paused at this pre-OC state (40/40 cases). In the trajectory in Figure 2B, left panel, the template DNA strand jumped further down to the active site at ∼ 0.9 × 10 6 MD steps, reaching to the OC-like state with the mismatch (Figure 2A, right, Figure 2B) (22/40 cases). In the 22 cases, we observed that the complex further moved towards the ITC state in five cases). The transition was driven by the new contact formation of the +1 site of the template DNA strand with Pol II (Supplementary Movie S1).
To obtain the OC structure model without the DNA mismatch, for obtained OC-like structures with the mismatch, we changed the DNA sequence back to the original sequence without mismatch, followed by 5 × 10 6 MD steps CG-MD simulations. A representative trajectory is depicted in the right panels of Figures 2B,C. While the overall positioning of DNA did not change ( Figure 2B, right), part of the melted DNA regained the base pairing during the trajectory ( Figure 2C, right). The observed bubble size fluctuated in time in the range of 6-10 bp, with the mean and the standard deviation 8.2 ± 1.7 bp. In the 22 cases, we did not see significant difference in the fraction of DNAprotein contacts and the DNA bubble size (Another trajectory shown in Supplementary Figure S2). Notably, we observed that the bubble size depends on the promoter sequence to some extent; with the promoter sequence used in the single-molecule magnetic tweezer experiment, our simulation resulted in the bubble size of 5.5 ± 1.4 bp, which is fairly compared with the experimental estimate, 6.1 ± 0.3 bp (Tomko et al., 2017).
To detect protein-DNA interactions in the OC state at atomic level, we modeled all-atom structures via back-mapping from the snapshots of a CG-MD trajectory with the DNA bubble sizes of 6 and 9 bp. The obtained all-atom models were further relaxed/ refined by 10 ns MD simulations with explicit water solvent (Figure 3). In the upstream of the DNA bubble, we find that the hydrogen bonds of the TFIIE E-wing residues K80 with the non-template DNA at −11 to −10 sites, and with the template DNA at −13 site, which are present in the CC state, are maintained ( Figure 3B). These interactions are suggested to facilitate the promoter opening and contribute to the efficiency of transcription initiation (Forget et al., 2004). Comparing the structures with 6 and 9 bp DNA bubbles, we find that the template DNA strand is rather similar each other while the non-template DNA strand is more mobile. The 6 bp in the downstream side (from −4 to +2 sites) were melted in both structures, while the 3 bp in the upstream side (from −7 to −5 sites) were formed/melted in the 6 bp/9 bp DNA bubbles.

Dynamical Modeling of the Transition from the Open Complex to the Initially Transcribing Complex
Next, we addressed dynamic process of the transition from the OC state to the ITC state. Starting from a final snapshot of the previous simulation that paused at the OC state for the promoter DNA without the DNA mismatch, we performed 140 independent CG-MD simulations of 2 × 10 7 MD steps ( Figure 4). In most trajectories (130/140 cases), the contact between the promoter DNA (from −16 to −9 sites) and the E-wing of TFIIE persisted for the whole simulation time, which clearly precluded the template DNA strand from accessing the active site. Only in 10 cases, we observed the disruption of this contact, which directly triggered the template DNA strand to move down towards the active site (Figure 4). 9 out of these 10 trajectories reached the ITC state.
In these successful trajectories, we found two intermediate states ( Figure 4A the second and the third models) before reaching the ITC state ( Figure 4A, bottom). In a representative trajectory (red in Figures 4B-D), the first transition occurred at ∼1 × 10 6 MD steps, after which about 35% of the ITC specific contacts were formed. In this intermediate state I 1 , ∼4 bp of the template DNA strand, (−2 ∼ +2 sites, relative to the TSS) approached the active site, while the contact between the DNA and the TFIIE E-wing is maintained ( Figure 4A the second structure). After a long waiting time, the DNA was detached from the TFIIE E-wing region (at 1.1 × 10 7 MD steps in the red trajectory), followed by the motion of the entire DNA bubble towards the active site ( Figure 4A the third structure). However, the template strand DNA in the upstream side of the DNA bubble, −13 ∼ −9 sites, collides with the fork loop 1 of Rpb2 subunit of Pol II ( Figure 4A, the third structure, and Supplementary Figure S4, left). This forms a metastable intermediate state I 2 . After some duration time at this intermediate state, the complex made the final transition to the ITC state (at ∼ 1.7 × 10 7 MD steps in the red trajectory). The other successful trajectories followed similar pathways.
To increase the samples of transitions to the ITC state, we performed 160 extra-simulations of 5 × 10 6 MD steps in which the contact between DNA and the TFIIE Ewing was weakened intentionally (see Materials and Methods; Supplementary Figure  S3). In this setup, we observed the successful transition to the ITC state for 102 out of 160 cases with the transition pathway unchanged. The rest of trajectories stayed at the intermediate state I 2 until the end of trajectories (58/160 cases).

Fork Loop 1 Serve as a Gatekeeper
The intermediate state I 2 appears because of the blockage by the fork loop 1, which led us to hypothesize that the fork loop 1 may serve as a gatekeeper. To monitor large-scale motions of the fork loop 1, we plotted in Figure 4D the time courses of the distance between the fork loop 1 and the B-linker of TFIIB, finding that the fork loop 1 exhibits intermittent large-scale fluctuation to open the gate (green dashed lines in Figure 4D). In the representative time course (red), the time of the transition from I 2 to ITC states in Figure 4B coincides with a large-scale opening. Looking into structure changes at the time, we found that the template DNA strand passed the fork loop 1 upon the loop opening, and moved toward the active site ( Figure 4A; Supplementary FigureS4, right). Notably, in any trajectory, the non-template DNA strand never passed the fork loop 1. Instead, the non-template DNA strand approached to the wall of Pol II. Therefore, after the passage of the template strand, the fork loop 1 is located inside the DNA bubble. This support the hypothesis that the fork loop 1 serve as a gatekeeper; it is only passed by the template, but not the non-template DNA strand. This role is supported by previous studies (Plaschka et al., 2016;Meyer et al., 2006;Meyer et al., 2009).
The fork loop 1 sequence is fairly well conserved from yeast to human ( Figure 5). In the human Pol II, it has been reported that a mutant that deletes two residues in the fork loop 1 (K458, A459 in human Pol II, which align with K471, A472 in yeast Pol II) abolishes the transcription in vitro (Jeronimo et al., 2004). This supports the crucial role of the fork loop 1. The mutation may

The Template and Non-Template DNA Strands in the Initially Transcribing Complex
To predict the placement of the template and non-template DNA strands and probe protein-DNA interactions in the ITC state, we constructed all-atom structure model via the back-mapping from snapshots in a CG-MD trajectory ( Figure 6A), which is followed by 10 ns AA-MD simulations. We note that the template DNA strand was anticipated to be in the wall from the cryo-EM study even though the cryo-EM structure model for the ITC state does not contain the segment of the template and non-template DNA strands (Plaschka et al., 2016). The constructed model in this work supports this prediction; the template DNA strand is indeed placed in the wall of Pol II (Supplementary Figure S4, right).
More specifically, our model suggests that the non-template DNA strand is localized at the protrusion, the lobe, and the fork of RPB2 subunit of Pol II ( Figures 6B,C). This placement of the non-template DNA strand is fairly close to that found in the yeast elongation complex structures solved by X-ray diffraction (Supplementary Figure S5) (Barnes et al., 2015). These regions form the groove with many basic amino acids ( Figure 6C). Along the groove the three basic residues, R241, R504, and K507 made specific interactions to the DNA at −1, +1, and +2 sites ( Figure 6C). These three residues are strictly conserved across broad range of eukaryotes ( Figure 6D).

DISCUSSION
Our computational modeling revealed that, after passing the OC state, the PIC passes two intermediate states before reaching the ITC, through which a small DNA bubble in the OC is expanded to complete the DNA bubble ready for RNA synthesis. One key gating state is I 2 , where the upstream part of the template DNA strand (−9 to −13 sites) interacts with the fork loop 1. The fluctuation of fork loop 1 was obligatory to escape from I 2 that leads to engaging the template DNA into the active center at ITC (Figures 4, 5). The non-template DNA strand did not pass the fork loop 1, suggesting that the fork loop 1 serves as the gatekeeper for the DNA bubble. Previous studies implicated two critical roles of fork loop 1. Based on the structural change in the fork loop 1 between the nucleic-acid free state and in the transcribing state (Meyer et al., 2006;Meyer et al., 2009) as well as the mutation assays (cite), the fork loop 1 was considered to play an important role in the transcription initiation. Alternatively, since the fork loop 1 is located at around the terminus of the DNA/RNA hybrid in the elongation complex, it may have important roles in separation of the product RNA from the template DNA strand. Our current simulation clearly supports the former functional role. The fork loop 1 forms a gate together with the rudder of RPB1 subunit in Pol II and serves as a gatekeeper for the engagement of the template DNA strand, but not the non-template DNA strand. The structural model obtained can be used to guess key residues as the gatekeeper, which can be examined by mutagenesis experiments. Furthermore, the shapes of the DNA transcription bubble of the intermediates I 1 and I 2 are different from those of OC and ITC. Especially, it will be interesting to investigate the structure change of the template strand (e.g., the opening of the DNA transcription bubble upstream) experimentally, for example, by the FRET technology.
Moreover, the current study found that non-template DNA strand in ITC is localized in the groove, formed by the protrusion, the lobe, and the fork of RPB2 subunit ( Figure 6). This pass is close to the non-template DNA strand pass in the yeast elongation complex (Barnes et al., 2015), suggesting its ubiquitous importance. However, to our knowledge, these interaction sites were not investigated by mutagenesis. Systematic mutation assays in these sites would clarify the roles of stabilizing the non-template DNA strand in the transcription process.
In this study, we only mentioned the formation of DNA transcription bubble and did not discuss the initial transcription proceed by RNA polymerase. The ITC modeled in this study is a state in which transcription has not yet occurred, followed by the scanning of the transcription-start-site, early RNA transcription, and the promotor escape. It has been proposed that the initial transcription proceeds in prokaryotes via the "scrunching" model (Kapanidis et al., 2006;Winkelman and Gourse, 2017), but it is unclear whether the same is true for eukaryotic transcription.
Obviously missing in the current work is the kinetic and energetic arguments on the very initial process of the DNA opening in the transition from the CC to the OC states. In this study, even with the use of CG-MD simulations, the DNA opening was too slow to be simulated directly in MD simulations for the native promoter sequence. Instead, we needed to introduce a mismatch sequence in the promoter region. This is clearly a limitation. To study kinetic and energetic aspects in this initial DNA opening without the mismatch sequence, we need some advanced sampling methods, such as the umbrella sampling, the Markov-state modeling (Husic and Pande, 2018), and the string method (Weinan et al., 2002). Alternatively, since the ATP-driven motor activity of TFIIH helicase is expected to accelerate the DNA opening, including this effect either explicitly or implicitly may enable to simulate the dynamic process of the bubble formation more directly. These developments are left for future studies.
Related to this, it has been known that Pol I and Pol III systems do not have TFIIH-like helicases (Vannini and Cramer, 2012;Paule and White, 2000;Han et al., 2017;Gouge et al., 2017), yet initiating the transcription efficiently via similar three states (Engel et al., 2017;Abascal-Palacios et al., 2018;Vorlander et al., 2018). Comparison of the transcription initiation in the three RNA polymerase systems can put forward comprehensive understanding of transcription initiation in Eukaryote.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
GS and ST contributed to conception and design of the study. GS performed simulations and analyzed data. GS and ST wrote the draft of the manuscript, contributed to manuscript revision, read, and approved the submitted version.