DeepHEMNMA: ResNet-based hybrid analysis of continuous conformational heterogeneity in cryo-EM single particle images

Single-particle cryo-electron microscopy (cryo-EM) is a technique for biomolecular structure reconstruction from vitrified samples containing many copies of a biomolecular complex (known as single particles) at random unknown 3D orientations and positions. Cryo-EM allows reconstructing multiple conformations of the complexes from images of the same sample, which usually requires many rounds of 2D and 3D classifications to disentangle and interpret the combined conformational, orientational, and translational heterogeneity. The elucidation of different conformations is the key to understand molecular mechanisms behind the biological functions of the complexes and the key to novel drug discovery. Continuous conformational heterogeneity, due to gradual conformational transitions giving raise to many intermediate conformational states of the complexes, is both an obstacle for high-resolution 3D reconstruction of the conformational states and an opportunity to obtain information about multiple coexisting conformational states at once. HEMNMA method, specifically developed for analyzing continuous conformational heterogeneity in cryo-EM, determines the conformation, orientation, and position of the complex in each single particle image by image analysis using normal modes (the motion directions simulated for a given atomic structure or EM map), which in turn allows determining the full conformational space of the complex but at the price of high computational cost. In this article, we present a new method, referred to as DeepHEMNMA, which speeds up HEMNMA by combining it with a residual neural network (ResNet) based deep learning approach. The performance of DeepHEMNMA is shown using synthetic and experimental single particle images.

Similarly, a 3 × 3 rotation matrix can be converted into the unit quaternion and the unit quaternion can be converted to a 3 × 3 rotation matrix, which makes the basis for converting quaternions back to Euler angles [49].

SB. Comparison of the use of Euler angles and quaternions for the neural network training
Supplementary Table S1 shows the accuracy of the angular inference for the network trained using Euler angles or using quaternions. The results shown for the network using quaternions are also shown in the main text (Table 1). It can be noted that the angular errors are larger when using Euler angles than when using quaternions.

Supplementary Table S1
Mean and standard deviation (Std) of the distance between the inferred, ground-truth, and HEMNMAestimated angles using a small test set of 2,000 images, after training with Euler angles or with quaternions using 14,055 images (image size: 128 × 128 pixels). The results for the use of quaternions are those shown in Table 1.

SC. Comparison of the network performance for different ResNet depths
Supplementary SD. Influence of number of images, noise, CTF, in-plane rotations, and in-plane shifts on conformational learning and prediction Supplementary Table S3 shows results of tests of the network sensitivity to the number of images used for training, noise, CTF, and in-plane rotations and shifts, when training the network to learn the conformational parameters (normalmode amplitudes). In these tests, we trained the network with ground-truth values of parameters, to evaluate the accuracy of the network independently of HEMNMA (instead of training the network with HEMNMA-estimated parameters, which is done in the main text).
The images used in the tests shown in this section were synthesized using a similar procedure to the one described in the main text. They had uniformly-distributed random projection directions (as described in the main text). The in-plane rotations and shifts were zero in one case and uniformly randomly distributed in the other case (in the range described in the main text). The noise and the CTF were not applied in one case and applied in the other case (as described in the main text, using SNR=0.1 and -0.5 µm defocus). For these tests, we used a set of 10,000 images (size 256 × 256 pixels) and the same set after data augmentation to 20,000 images. The data augmentation was performed using the standard machine learning approach of making image copies by randomly rotating and shifting images from the original set. Each image from the set of 10,000 images was in-plane rotated using a random angle and in-plane shifted using random shifts in the range [-7,7] pixels (note that this shift range is slightly larger than the shift range used to synthesize the original images). In both cases, without and with data augmentation, we used 2,000 images for validation and 2,000 images for inference. The training was performed using the remaining 6,000 images from the set without data augmentation or using the remaining 16,000 images from the set with data augmentation. The images were not downscaled for the tests performed in this section.
Supplementary Table S3 shows that the inference error is lower for the network trained with 16,000 images than for the network trained with 6,000 images. However, the decrease in the inference error was not enough significant with the network trained with 30,000 images, considering the large computational cost of the training (not shown here), and we decided to perform all other experiments with synthetic AK data using 20,000 images at most.
Similar results to those shown in Supplementary Table S3 were obtained using images with the CTF defocus of -1 µm (and SNR=0.1) and slightly better results were obtained using images with SNR=0.3 (for both -0.5 µm and -1 µm defocus values). Examples of synthesized images with two SNR values and two defocus values are shown in Supplementary Figure S1, indicating that images with SNR=0.1 and the defocus of -0.5 µm have lower contrast and less CTF-induced oscillations near the particle edges, meaning that they hold higher-resolution structural information. In this article, we show results using images with SNR=0.1 and -0.5 µm defocus.

Supplementary Table S3
Accuracy of normal-mode amplitudes inferred for 2,000 synthetic images (size: 256 × 256 pixels) with and without in-plane rotations, shifts, noise (SNR=0.1), and CTF (defocus -0.5 µm), after the network training with ground-truth normal-mode amplitudes (to evaluate the accuracy of the network independently of HEMNMA). The gray rows denote that the training dataset was obtained by data augmentation.

Number of images for training
In-plane rotation

SE. Influence of image size on conformational learning and prediction
Supplementary Table S4 and Supplementary Figure S2 show accuracy of the inference of normal-mode amplitudes using the network trained with 14,055 synthetic images of 256 × 256 pixels and with these images downscaled to 128 × 128 pixels. The results obtained with the downscaled images are also shown in Table 1 and Figure 6 in the main text.

Image size
Distance between inferred and ground-truth normal-mode amplitudes Mean over modes 7-9

Supplementary Figure S2
Overlap between the inferred (red), ground-truth (black), and HEMNMA-estimated normal-mode amplitudes (blue) obtained using images of the size of 256 × 256 pixels (top row) and 128 × 128 pixels (bottom row). The results for the size of 128 × 128 pixels (bottom row) are also shown in Figure 6 but as a 3D scatter plot. Each point corresponds to an image and a molecular conformation inside it. Close points correspond to similar conformations and vice versa. See also Supplementary  Table S4. SF. FSC curves of the reconstructions in the inferred conformational space from synthetic images Supplementary Figure S3 shows FSC curves of ten 3D reconstructions from 10 regions of the conformational space shown in Figure 7. Each FSC was obtained with respect to the map simulated from the atomic model that is the centroid of the corersponding region used for the reconstruction. The reconstructed maps were neither filtered not masked before calculating the FSC curves. The maps and the number of images used for each reconstruction are shown in Figure 7.
Supplementary Figure S3 FSC curves of ten 3D reconstructions from the corresponding ten regions of the conformational space shown in Figure 7, with respect to the maps simulated from the atomic-model centroids of the regions used for the reconstruction. The intersections of the FSC curves with FSC=0.5 and FSC=0.143 are also shown. Supplementary Tables S5-S7 show the wall-clock times needed for HEMNMA estimation, network training, and network inference using the synthetic data and 3 normal modes in the experiment shown in the main text. Note that the times in these tables are those of using one CPU core or one GPU card and should be multiplied by the number of CPU cores or GPU cards, respectively. Also, note that the time of HEMNMA is the time needed to estimate all parameters (normal-mode amplitudes, angles, and shifts), whereas the time of the network is the time needed for one type of parameters (normal-mode amplitudes, angles, or shifts) and should be multiplied by 3 for the 3 types of parameters.

SH.
Conformational space of experimental cryo-EM data of yeast 80S ribosome-tRNA complexes (EMPIAR-10016) Supplementary Figure S4 shows the 2D conformational space obtained for the EMPIAR-10016 dataset, by PCA of the normal-mode amplitudes inferred from 12,095 images. It also shows two selected groups of images in this space, which were used for the 3D reconstructions shown in Figure 8A (4,741 images) and Figure 8B (4,219 images). The groups of images were selected automatically using logical operators on the coordinates of the two principal axes, which excludes some points that are far away from the majority and some points that are in the middle of the point cloud (the region with the coordinates [-100,100] on the principal axis 1 is excluded to get a clearer difference between the two 3D reconstructions from the selected groups of images). Such image grouping was done to demonstrate the reconstruction of two different average conformations of the ribosome from this space and to compare these reconstructions with those obtained based on the EMPIAR-10016 FREALIGN classification (Figure 8).
Supplementary Figure S5 shows the 2D conformational space obtained by PCA of a combined set of normal-mode amplitudes inferred from 12,095 images and normal-mode amplitudes estimated by HEMNMA from 10,000 images (the total number of images: 22,095 images). It also shows two selected groups of images in this space, which were used for the 3D reconstructions shown in Figure 8E (7,870 images) and Figure 8F (6,682 images). The merging of the inferred and HEMNMA-estimated normal-mode amplitudes was done to show the improvement of the 3D reconstructions with an increase in the number of images (in particular in the region where the additional tRNA is expected, Figure 8E).
Supplementary Figure S4 Two-dimensional conformational space for the EMPIAR-10016 dataset (cryo-EM single particle images of yeast 80S ribosome-tRNA complexes) obtained by principal component analysis of normal-mode amplitudes inferred from 12,095 images, with panels A and B showing two selected groups of images (yellow) used for the 3D reconstructions shown in Figure 8A (4,741 images) and Figure 8B (4,219 images), respectively. The groups of images were selected automatically using logical operators on the coordinates of the two principal axes (principal axis 1: [-900, -100]