Experimental Characterization of the Hepatitis B Virus Capsid Dynamics by Solid-State NMR

Protein plasticity and dynamics are important aspects of their function. Here we use solid-state NMR to experimentally characterize the dynamics of the 3.5 MDa hepatitis B virus (HBV) capsid, assembled from  240 copies of the Cp149 core protein. We measure both T 1 and T 1ρ relaxation times, which we use to establish detectors on the nanosecond and microsecond timescale. We compare our results to those from a 1 microsecond all-atom Molecular Dynamics (MD) simulation trajectory for the capsid. We show that, for the constituent residues, nanosecond dynamics are faithfully captured by the MD simulation. The calculated values can be used in good approximation for the NMR-non-detected residues, as well as to extrapolate into the range between the nanosecond and microsecond dynamics, where NMR has a blind spot at the current state of technology. Slower motions on the microsecond timescale are difficult to characterize by all-atom MD simulations owing to computational expense, but are readily accessed by NMR. The two methods are, thus, complementary, and a combination thereof can reliably characterize motions covering correlation times up to a few microseconds.

In this section we show the considerations that have been used to construct the estimated error on the quantitative MD prediction of amplitudes of motions at different timescales shown in the main text Fig. 3 (grey shaded area). We will discuss in detail two contributions leading to uncertainty, when trying to extract quantitative predictions on motions from the 1 MD trajectory and will illustrate how both errors grow significantly for timescales approaching the end of the trajectory in particular for the detectors constructed from the experimental conditions over a correlation time rage from [10 %!& , 10 %' ] s. Doing the error investigation on a detector analysis that uses a set of seven almost non-overlapping detector sensitivities spaced at each order of magnitude (with mean correlation times ranging from 10 -11.6 s to 10 -5.0 s) allows for monitoring the increase of the error sources for slower timescales and thereby proposing the position of the boundary between the green and grey zones of the MD prediction, discussed in the main text. The whole analysis has been performed on chain A of the Cp149 capsid.

S1.1 Detector Responses for detectors approaching the end of the trajectory are less well defined (influence of the regularization parameter)
As sketched in the main text Fig.3(a) amplitudes of motion at all timescales are extracted from the correlation function of the 1 MD trajectory by computing an inverse Laplace transform. This is mathematically an ill-posed problem, solved with the help of a regularization function, consisting in a minimization of the second derivative of the distribution of motion Θ. The relative priority of minimizing the second derivative, compared to obtaining a good fit of the input correlation function is determined by the regularization parameter . Increasing the regularization parameter will in general lead to a smoother/broader distribution of motion. If the regularization parameter is increased too much, the obtained distribution of motion will broaden to such an extent that it will not lead to a good fit of the input correlation function anymore. This behaviour of the total fit error on the regularization parameter translates into an L-shaped curve typical of regularization problems. The L-curve for the dependence of the total fit error as function of the regularization weight in chain A of Cp149 is shown in Fig.S16(a). In the zoom given in the insert of Fig.S16(a) it can be seen that up to around = 10 # , the total fit error increases less than 50% of the initial error at = 0, while it starts growing substantially afterwards. Fig. S16(b) shows the expected broadening of the distribution of motion with increasing the regularization parameter ( = 0, 10 * , 10 # ). It has been shown by (Smith et al. 2019) on HET-s(218-289) that this change in the form of the distribution of motion has in general only a minor impact on the extracted detector responses. Finding the distributions of motion is an ill-posed problem, but the step from the distributions of motion to the actual detector responses is such, that the result of the two-step process is a rather well-defined problem. Indeed, the detector sensitivities are convoluted with the distribution of motion and serve as a regularization function themselves, being much broader than the individual peaks in the distribution of motion (green, pink and blue areas in Fig.S16(b)). However, motions at long correlation times, approaching the end of the trajectory can be indistinguishable from each other based on the time points available in the correlation function. This can lead to ill-defined detector responses introducing a dependence of the detector responses to the regularization function. Thus, a variety of slower motions can be fitted equally well, because the trajectory is not long enough and does not contain sufficient points to define a clear peak position. Fig.S16(b)) illustrates how the systematic broadening of the peaks in the distribution of motion due to an increase in regularization parameter can lead to an additional overlap with a neighbouring detector and thereby alter the detector responses, as it is particularly evident for E77. This is a possible source of error when describing slow motions. Note that this effect does not come from a bad fitting of the input correlation function. All chosen values correspond to regularization weights for which the total RMS of the fit is small (see Fig.S16(a)). This effect was systematically investigated over a large range of timescales, by using the seven detector sensitivities spaced at each order of magnitude shown in Fig.S17(a) together with distribution of motions obtained with regularization parameters between = 0 and = 10 # (chosen such that the RMS increases less than 50% of its initial value for = 0). We find that the curves are almost indistinguishable from each other at all timescales. Note however that this finding is also influenced by the choice and number of detectors used for the analysis. The seven detectors given in Fig.S17(a) are specifically optimized over the full correlation time range [10 %!& , 10 %& ] s based on the available data (the 1 trajectory), such that it generates broad detectors where there is uncertainty and narrow detectors where there is more confidence. The situation is different when performing the regularization influence analysis specifically for the three detectors optimized from the experimental conditions and on the restricted correlation time range from [10 %!& , 10 %' ] s (Fig.S18). Indeed, we now observe a significant increase in regularization error at the slower timescales, which is likely due to the narrowness of the +.-./ detector, which is restricted by the ending of the MD trajectory.

S1.2 Motions approaching the longest correlation times are not equilibrated
The MD trajectory contains 60 copies of the Cp149 molecule. This allows to calculate the correlation function and inverse Laplace transform for all 60 copies and use it to obtain the corresponding detector responses for the set of seven detector sensitivities spaced at each order of magnitude (shown in Fig.S17(a)). The site-specific standard deviations over the different detector responses for all 60 copies could thus be calculated and compared between the different orders of magnitude. For better comparison the squared standard deviation (variance) was normalized by the mean over all 60 copies ( ) /mean) in Fig.S19, accounting for the amplitude of each motion. The variance was divided by the mean, instead of the standard deviation directly, since if one assumes an underlying Poisson statistic, the variance equals the expectation value for the mean. We observe, as expect, the total error to be smaller for fast motions and higher for slow motions approaching the end of the trajectory, due to the slower motions not being well equilibrated in the given trajectory. The only exception represents the detector centred at 10 %# s, which is however special because it describes motions outside of the actual trajectory length, which are therefore purely extrapolated by the 1 trajectory.

S1.3 Statistical confidence in detector responses at longer correlation times can be improved by averaging over 60 copies of the molecule.
Using a 1 trajectory implies that the statistical significance of nanosecond predictions should be better than the one of microsecond predictions, since the nanosecond motions are much better monitored during the length of the 1 trajectory. This improves the extraction of these motions from the correlation function. A possibility to estimate and improve the statistical significance of predictions at all timescales, would be to run a large number of molecular dynamic trajectories with different starting conditions and compare the obtained results. Due to the heavy computational demands for a large viral assembly, this cannot be applied on Cp149. However, since the published MD trajectory contains information on the trajectory of 60 copies of the Cp149 molecule, it is alternatively possible to increase the statistical significance of the predictions for slower motions by averaging over the distribution of detector responses generated by all 60 copies, which we eventually also used to define the estimated error on the MD prediction (grey shaded area in Fig.3).
Prior to use such an approach, it is important to find out if the motion of the different copies is correlated, as strong correlations between the copies could bias the result. In the following we will discuss how to investigate the presence of such correlations and that they don't play a significant role for Cp149 in particular for slower motions, supporting the validity of an averaging approach over the 60 molecular copies.
We identify correlations based on calculations of the 60 detector responses for the correlation functions calculated from the first and the second half of the trajectory (both extending over a length of 500 ns). Two methods of error statistics are then performed on these 2 × 60 detector responses on different timescales. The first method will lead to the correct variance of the statistics, while the second will lead to a different value only if the copies can be considered to be correlated.
We generalize the problem and consider 2 x N samplings of a distribution (mean: μ , standard deviation: σ , in our case N=60). There, the two copies of the sample are independent of each other (1 st and 2 nd half of the trajectory being independent), but each set of N samplings may not be (there might be a correlation between copies which we want to find out). Method 1, consists in calculating the mean over the copies and variance over 1st/2nd half. We consider that E as we assume that the two copies are independent. We thus obtain from this approach the correct variance ) of the distribution (result of equation [1]).
In method 2, the variance is taken over the 60 potentially correlated copies, and then the average over the uncorrelated pair of data (1st and 2nd half). The variance over the 60 copies is yield by E -,! ) G − E -,! G ) as following (the expectation for the average over both halves [1] should be the same as the expectation for just one half). We can assume that the sum over all covariances relative to some index i is the same regardless of the starting index.  if the covariance (correlation) is zero between copies, the two methods should yield the same result. However, if there is a positive/negative covariance between copies then the variance calculated by method 2 is smaller/higher than the one calculated with method 1. Fig.S20 displays the results of the error statistics applying method 1 (blue, correlation unbiased) and method 2 (green dashed, correlation biased) to the detector responses of the 60 copies at the seven detector sensitivities shown in Fig.S17(a). The results for both methods agree to a very good degree over most portions of the molecule. The only regions where the two methods show larger deviations are the loop from 44-50 and the last part of the molecule approaching the C-ter (on the other side, the N-ter does not show any large correlations). We observe slight tendency to an increased difference between the two methods for the slowest detector, hinting to an increased correlation between the motions. Overall these results indicate that the MD error analysis can be improved, by averaging over the copies without having to account for covariance between copies, avoiding the computationally demanding step of performing a multitude of different MD 1 trajectories using different starting conditions. Fig. S20: Cp149 MD error statistics using method 1 (correlation unbiased) and method 2 (correlation biased) discussed in section S1.3 for the detector sensitivities in Fig.S17(a). Inverse Laplace transforms and detector profile optimization have been performed over the correlation time interval 0 = [10 123 , 10 13 ] .

S1.4: Estimation of the total MD error at different timescales
The total estimated error on the quantitative MD prediction of amplitudes of motions at different timescales (grey shaded area in Fig.3) could be constructed combining the two effects we described above (S1.1 and S1.2).
The general procedure for the error estimation is sketched in Fig.S21(a). We start by extracting for each of the 60 copies a correlation function for the first and second half of the trajectory as well as for the full trajectory. Each correlation function is inverse Laplace transformed with 5 logarithmically-spaced regularization parameters ranging from [0,10 # ]. Note that this range of values corresponds to regularization parameters which lead to a good fit of the input correlation function and therefore representing a valid solution of the inverse-Laplace transform problem (as it has been shown in Fig.S16(a)). The result comprises 3x(1 st /2 nd ,full)x60(copies)x5( )= 900 distributions of motion. We now extract detector responses for all distributions of motion and perform an average over all 60 copies for each regularization parameter and for the 1 st /2 nd half and the full trajectory. This leaves us with 3x5 averaged responses. We have shown in section S1.3 that it is possible to perform such an averaging over copies, without having to account for strong correlation between copies. For each regularization parameter we define a grey shaded area, which ranges from the minimum to the maximum of all detector responses for the 1 st /2 nd half and the full trajectory. This represents the amount of the MD error that can account for the nonequilibration of slower motions within the length of the trajectory (see section S.1.2). The grey shaded area in Fig.S21 shows this portion of the MD error for = 0. As it can be seen the nonequilibration error grows systematically for slower motion compared to the one for faster motions. The blue shaded area in Fig.S21 is created by taking the minimum and maximum value of the averaged detector responses over the grey shaded areas calculated for all regularization parameters . For this reason, the grey shaded area for = 0 shown in Fig.S21 is always contained in the blue shaded area and the difference between blue and grey shaded area accounts for the portion of the MD error that can be attributed to the instability of the detector responses with respect to the choice of the regularization parameter. As illustrated in section S.1.1 this second error source remains however much smaller. Fig. S22 estimates the MD error according to the same procedure for the detectors !.*45. , 6.)75 , 6.'75 constructed from the experimental conditions and used for the analysis in the main text. As already seen in Fig. S18, due to slower motion being ill-defined, based on the time points available in the trajectory and the narrowness of the +.-./ detector due to the ending of the trajectory, the regularization error (blue-shaded) is substantially larger for the slower motions approaching 1 , thereby becoming an important contribution for the total error.