CryoETGAN: Cryo-Electron Tomography Image Synthesis via Unpaired Image Translation

Cryo-electron tomography (Cryo-ET) has been regarded as a revolution in structural biology and can reveal molecular sociology. Its unprecedented quality enables it to visualize cellular organelles and macromolecular complexes at nanometer resolution with native conformations. Motivated by developments in nanotechnology and machine learning, establishing machine learning approaches such as classification, detection and averaging for Cryo-ET image analysis has inspired broad interest. Yet, deep learning-based methods for biomedical imaging typically require large labeled datasets for good results, which can be a great challenge due to the expense of obtaining and labeling training data. To deal with this problem, we propose a generative model to simulate Cryo-ET images efficiently and reliably: CryoETGAN. This cycle-consistent and Wasserstein generative adversarial network (GAN) is able to generate images with an appearance similar to the original experimental data. Quantitative and visual grading results on generated images are provided to show that the results of our proposed method achieve better performance compared to the previous state-of-the-art simulation methods. Moreover, CryoETGAN is stable to train and capable of generating plausibly diverse image samples.


INTRODUCTION
Cryo-electron tomography (Cryo-ET) has emerged as a powerful 3D imaging tool with unprecedented quality in capturing structural and spatial organization information of macromolecules inside single cells. Analysis of macromolecules in a Cryo-ET image (i.e., a tomogram, usually of size 6,000 × 6,000 × 1,500 voxels) is done at subtomogram level. A subtomogram is a small 3D cubic sub-image of a tomogram that generally contains one macromolecule extracted from tomograms. Deep-learning-based classification has been successfully applied and achieved high accuracy on Cryo-ET subtomogram identification. Plenty of previous works have been devoted to separating structurally highly heterogeneous macromolecules captured by Cryo-ET data into structurally homogeneous subgroups (Bartesaghi et al., 2008;Scheres et al., 2009;Alber, 2011, 2012;Chen et al., 2014;Bharat et al., 2015;Che et al., 2018). Nevertheless, the main bottleneck for these deep learning methods is a lack of training data. Since various subtomogram datasets may be collected under different experimental conditions, directly applying the knowledge learned from one dataset to the other will result in a decrease in performance such as classification accuracy due to domain shift. Therefore, part of the dataset must be manually labeled in order to predict the rest of the data, which is a highly time consuming process. To automate this process and reduce domain shift, training the network on realistically generated subtomogram datasets becomes an ideal approach. Simulation can provide an unlimited number of training instances with pre-specified labels.
Conventional image simulation methods for Cryo-ET use atomic models in Protein DataBank (PDB) (Bernstein et al., 1977), using a specified resolution and voxel spacing together with low-pass data filtering. Gaussian-distributed noise and Modulation Transfer Function noise (MTF) are applied for the realistic electron optical effect to match a certain signal-tonoise ratio (SNR). Random rotation and translation operations are performed to synthesize more samples. Yet, simulating realistic data presents challenges due to high degree of structural complexity, irregular noise, and tomographic distortions. Neural networks trained on them result in poor testing performance when applied to experimental data. By inferring from real image data, machine learning methods potentially overcome common restrictions such as infeasible interactive use and substantial computational resources.
The recent explosion in the Generative Adversarial Networks (GANs) field have shown great success in tasks such as image synthesis, image-to-image translation (Yang et al., 2017;Schlemper et al., 2018;Seitzer et al., 2018;Wang et al., 2019Wang et al., , 2021Guo et al., 2020;Yuan et al., 2020;Chen J. et al., 2021;Jiang et al., 2021;Li et al., 2021;Lv et al., 2021a,b,c). Recent advances have used GANs to formulate biomedical image simulation as an image-to-image translation task and arouse a wide interest in biomedical area (Bi et al., 2017;Calimeri et al., 2017;Nie et al., 2017;Wolterink et al., 2017;Zhao et al., 2017;Liu et al., 2021a,b). In most cases, 3D images do not have paired data; as a result, learning from unpaired data becomes crucial. The cycle-consistent generative adversarial network (Zhu et al., 2017) successfully performed unpaired image-to-image translations, only requiring two unpaired datasets and is capable of preserving semantics. In the same spirit, we formulate a framework called CryoETGAN to simulate subtomograms indiscriminable from real data on given structures from density map which shows electron density occupancies and distribution of the particle (Kaur et al., 2021). We conduct experiments to demonstrate the effectiveness of our method qualitatively and quantitatively. The generated datasets can serve as training datasets for future subtomogram study.
We are the first to propose an image translation based simulation method for cryo-ET 3D images. Although image translation has been used to simulate cryo-EM 2D images (Gupta et al., 2020b(Gupta et al., , 2021Miolane et al., 2020), they are not directly comparable to our method as 3D cryo-ET and 2D cryo-EM images capture different kinds of information. One prior work applying GANs in a related space is Gupta et al. (2020a), in which a GAN is trained to perform single-particle cryogenic electron microscopy (Cryo-EM) reconstruction given a large number of Cryo-EM images. We note this work differs in many aspects including the task and the nature of the data. First, Gupta et al. (2020a) trains a generative simulator using many Cryo-EM images of a specific particle, not a general image-to-image translation model. In addition, 2D single-particle cryogenic electron microscopy (Cryo-EM) images and 3D cryoelectron tomography (Cryo-ET) images are different media: single-particle Cryo-EM typically uses noisy images of many copies of a macromolecular structure, while Cryo-ET operates on a single cell sample (Marx, 2018). As noted in Marx (2018), Cryo-ET shines where it is not feasible to make "tens of thousands" of copies of a structure of interest, and has led to discoveries such as Basler et al. (2012). In essence, Gupta et al. (2020a) solves an important but distinct task in a related field.
Thus, our main contributions are as follows: 1. We propose the use of a GAN-based image translation method in order to augment the training datasets of Cryo-ET models using density maps. 2. We develop a GAN framework to robustly generate diverse Cryo-ET images from density maps. We propose several architectural modifications to incorporate priors on Cryo-ET data to stabilize training. 3. We demonstrate the effectiveness of these techniques on traditional metrics of generative model performance as well as downstream classification performance.

MATERIALS AND METHODS
Our proposed framework for Cryo-ET image synthesis: CryoETGAN is presented in Figure 1. In the following paragraphs, we will elaborate on CryoETGAN and its network architecture starting with preliminary details.

Formulation
We first introduce our notations. Macromolecular complexes and cellular components which can be extracted from tomograms of cells using template-free methods such as Difference of Gaussian, are densely packed in small 3D volume of cubic shape (3D analog of a 2D image patch). Those experimental subtomograms are represented as Another domain we use contains density maps which are simulated from proteins using EMAN2 (Tang et al., 2007), which is a image processing package with a focus on single particle reconstruction. Those experimental density maps are denoted as our goal is to learn two mapping functions, G ds : D → S and G sd : S → D. The generators are guided by the discriminators to learn the mappings between the subtomograms and density maps in order to preserve the edges and details.
As shown in Figure 1, our CryoETGAN model has four main components: two generators G ds and G sd to capture the data distribution from two domains, two discriminators D A and D B that estimate the probability of the generated samples whether they are from the experimental datasets or generated ones. Discriminator D A aims to distinguish between experimental FIGURE 1 | Overview of CryoETGAN: with adversarial loss, cycle-consistency loss, and Wasserstein loss, our method is capable of learning mapping between domain S and D with unpaired data.
, and D B aims to discriminate between experimental density map . Two generators are trained to produce realistic data to fool the adversarially trained discriminators D A and D B . The training loss of CryoETGAN contains three types of terms: adversarial loss for matching the distribution of generated data to corresponding D or S domain; cycle-consistent loss to make sure the generated images in target domain can be generated back to the source domain and enable the mapping between these two domains; and Wasserstein loss to prevent mode collapse.

Adversarial Loss
The adversarial losses are applied to both mapping directions. Given a distribution s ∼ p data , generators define the probability distribution as the distribution of the sample G ds (d) and G sd (s) For the generator G ds : D → S and its discriminator D A , the objective is defined as: In this setting, we train the generators G ds , G sd , and discriminators D A , D B together. Without paired data, we conduct a min-max training between the generators and discriminators. Ideally the image G ds (d) generated by G ds will be visually similar to images in S domain. Meanwhile the discriminators distinguish between generated images and real images. Similarly, the adversarial loss for the mapping function G sd : S → D and its discriminator D B is defined as below:

Cycle Consistency Loss
To further guarantee that the mapping function can map an input d i to its ideal output s i , also from s i to d i . Inspired by Zhu et al. (2017), we use cycle-consistent loss to enable the image translation cycle to force d back to the original image, i.e., d → Similarly, for each image s from domain S, G sd and D d should also make the reconstructed image G ds [G sd (s)] to be identical to input s. The cycle-consistent loss is written as:

Wasserstein Loss
During preliminary testing, expressions of density maps were frequently transferred to the same pose and to the same subtomogram expression. Moreover, the standard discriminator loss uses cross-entropy loss and suffers from vanishing gradients. Instead of the Jensen-Shannon divergence, Wasserstein GAN (Arjovsky et al., 2017) adopts the Earth Mover distance to measure the distance between real and generated samples: Following the notation from Arjovsky et al. (2017) (P r , P g ) represents for the set of all joint distributions. γ (x, y) represents for the transporting cost from x to y in order to transform the distributions P r to P g . In practice, this is accomplished by replacing the discriminator with a critic and using the difference between the critic predictions on real and fake images as the critic's loss, and the negated version for the generator, and then enforcing a constraint on the discriminator to enforce 1-Lipschitz continuity. Inspired by Wasserstein GAN, we adopted the following improvements in order to deal with the model collapse problem in adversarial training and to achieve more stable results.
• Clip the weight ofs D.
• Use RMSProp instead of ADAM.
• Lower learning rate. The rate in the paper is α = 0.0005.

Mode Collapse
The scenario of mode collapse refers to the generator produces similar data every time and still able to successfully fool the discriminator. We pass random noise vectors to the generator in order to deal with mode collapse. To learn the distribution over subtomogram, the generator builds a mapping function from a distribution density map to subtomogram. Between convolutional layers and deconvolutional layers, we concatenate a noise vector to it so that it can generate different pattern according to the style. On the other side of the cycle translation, another generator builds a mapping function from subtomogram to density map.

Full Objective
Given the formulations of adversarial loss, cycle-consistent loss, and wasserstein loss above, our full objective is formulated as follows: where λ adjusts the importance of the cycle-consistency objective.
Solving the min-max optimization problem has long been known for a challenging task. Previous work proposed careful designed network architectures and objective functions in order to achieve good performance-we adopt the spectral normalization layer proposed by Miyato et al. (2018) to normalize weights, regulating the scale of feature response values and stabilizing the training process.

Architecture
Following the CycleGAN paper notation (Zhu et al., 2017), the generator architecture is c7s1-d32, d64, d128, R128, R128, R128, R128, R128, R128, u64, u32, c7s1-u1. The output after downsampling is concatenated along the filter dimension with a one-channel Gaussian noise vector of the same shape, so the input to the u32 layer has 129 channels. Note dk denotes a k-filter 3 × 3 × 3 and stride-2 convolution followed by instance norm and ReLU, uk denotes the same with stride 1 2 and fractionalstrided-convolution, and Rk is a k-filter residual block. The last convolutional layer has tanh without InstanceNorm. The discriminator has an architecture of C64, C128, C256. Note Ck corresponds to a 4 × 4 × 4 convolution with stride 1 followed by InstanceNorm and a Leaky ReLU with slope of 0.2. Spectral Normalization is applied to each convolutional layer of the discriminator.

Experimental Datasets
We tested our CryoETGAN on two experimental datasets S e1 and S e2 . Dataset S e1 contains 1,600 subtomograms of size 40 3 from four classes of macromolecules, the four classes are Proteasome (5MPA), Ribosome (5T2C), TRiC (4V94), and Membrane. Each class has 400 images. For the density maps, We simulated 3D noise free density maps using EMAN2 corresponding to the subtomogram classes. The proteins are from Protein Data Bank (Berman et al., 2000) which is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. Dataset S e2 contains 2,800 subtomograms from seven classes of macromolecules, which were extracted from Noble Single Particle Dataset collected by Noble et al. (2018), each class has 400 subtomograms from EMPIAR. Subtomograms were extracted and about 20 macromolecules were manually picked. The 20 subtomograms were averaged to generate the structural template. Structural template was aligned to all subtomograms extracted and produces cross-correlation scores. Each particle is consisted of 28 3 voxels, and the size of each voxel is 0.94 nm. The SNR is 0.5 and missing wedge angle is 30 • . For each tomogram in the original set, subtomograms of size 28 3 were extracted using a Difference of Gaussian(DOG) particle picking process (Pei et al., 2016) with the parameters of s1 = 7.0 and k = 1.1. We applied a template search approach as described in Zeng et al. (2018) to select the top 1,000 subtomograms according to the crosscorrelation scores. Four hundred subtomograms are manually selected for each class which contain macromolecule structures. In our experiments, we select 2,000 subtomograms for training and the remaining 800 for testing.

Evaluation Metrics
We use several common GAN evaluation metrics (Borji, 2019) as the quality evaluation criteria for the Cryo-ET data generated in our experiments as shown in Figures 5, 6.

Inception Score (IS)
IS was originally proposed by Salimans et al. (2016) to quantitatively evaluate the quality of the generated images (shown in Equation 6). The intuition behind Inception Score is that a generator with high performance should generate samples with low entropy in the class distribution of a single generated data while producing high entropy in the classes across all generated samples. In our experiments, we adopted CB3D (Che et al., 2018) as our "Inception V3" to calculate an IS-equivalent for Cryo-ET.

Frechet Inception Distance (FID)
FID has been widely used in measuring the similarity between real and generated images. Unlike IS, FID (Heusel et al., 2017) compares the distance between two multivariate Gaussian distributions as shown in (Equation 7) FID = ||µ r − µ g || 2 + Tr( r + g − 2( r g ) 1/2 ), s where X r ∼ N (µ r , r ) and X g ∼ N (µ g , g ) are the 4,096 dimensional activation inputs of the CB3D model's dense layer for real and generated data, respectively. Single-value metrics such as IS and FID evaluate the generative model, yet they are not perfect for diagnostic FIGURE 2 | The 2D slide visualization of generated subtomograms (Top: S e1 , Middle and Bottom: S e2 ). In general, we find CryoETGAN retrieval produces qualitatively similar subtomogram compared to the ground truth and is capable of producing various classes without mode collapse. purposes (Naeem et al., 2020). Fidelity and diversity attribute are usually considered as a trade-off in the design strategy of generative models, which represents for how realistic the inputs are and how well those generated data capture the variations in real data (Naeem et al., 2020). We use precision and recall proposed by Sajjadi et al. (2018) to measure these two characteristics, we use the same notations as in Naeem et al. (2020), B(X, r): the ball around the point x with radius r, NND k (X i ): the distance to the kth-nearest neighbor. X i are the real embedded samples and Y j are the fake embedded samples. Precision: Recall:

Density
Density and coverage are proposed by Naeem et al. (2020) as alternatives to precision and recall, respectively, to be more robust to outliers. Density emphasizes not only whether the samples generated are close to a real sample, but also how many spheres around real-samples contain the generated FIGURE 3 | The 2D slide visualization of real subtomogram samples from every class (Top: S e1 , Middle and Bottom: S e2 ), the sequence of those subtomograms is corresponding to the sequence in Table 1. example. It counts how many real-sample neighborhood contains fake samples.

Coverage
Coverage is a metric evaluating recall in terms of the real manifold rather than the fake manifold. This penalizes sparse coverage of the real space, where generators may benefit in terms of the recall metric by simply having few examples in some part of the real space. It builds the nearest neighbor manifolds around the real samples instead of the fake samples due to more outliers.

Classification Accuracy
Deep Neural networks are able to capture global and local information from image data. Therefore, we use the state-of-theart deep learning-based classification model for Cryo-ET data: CB3D (Che et al., 2018) to objectively quantify the generated subtomogram generated from density map data. We consider this as a way to interpret the generative ability of our model. Ideally, one would have a high density as well as a high coverage. We believe these metrics alongside classification performance are the most relevant indicators for this model, as one density map may correspond to numerous subtomograms.
Compared to the traditional method (Bernstein et al., 1977) which has the testing classification accuracy 19.7% on a welltrained CB3D for S e1 and 28.9% for S e2 , our method outperforms the traditional method by achieving the classification accuracy of 76.4 and 67.3%.
We believe that the fact that the coverage result is much better than the recall result is a consequence of a few factors: first, the relatively small size of the real dataset means that the original recall metric will penalize the model for generating anything except exactly the correct test set examples. Using the real manifold, as in coverage, rather than the fake manifold, as in recall, is more forgiving. Since these metrics were not developed with an emphasis on small real datasets and the evaluation of precision and recall of generative models is an ongoing topic of research, there may be a better metric to be proposed, but this is outside the scope of our article. The evaluation results are shown in Table 2.

Uncertainty Estimation
Uncertainty estimation is a common approach to check the generative model's performance, we build on Gal and Ghahramani (2016) and combine their contributions in order to get an uncertainty map using Monte Carlo dropout as an implicit representation of the underlying subnetworks. The detailed description of our uncertainty estimation method is: we apply dropout in the generator, sample 20 times using the same density map, calculate the standard deviation per pixel, and then we can overlay them to have an uncertainty map over the pixel wise of the model per given input for visualization. Then we compare the result of using Dropout and not using dropout. In this way we will be able to measure the generator uncertainty from pixel level. We show the uncertainty maps in Figure 4.

Analysis of Noise Standard Deviation
In Table 3, we compare CryoETGAN's performance under various standard deviations of noise during training. The performance of our CryoETGAN substantially improved when we applied zero-mean Gaussian noise to the density maps   The results show that the wasserstein loss and the spectral normalization significantly improved the performance.
in the experiment relative to training without noise. From Figures 5, 6, we can see improvements in Inception Score and faster convergence in Frechet Inception Distance.

Analysis of Model and Loss Design
We further evaluated the presence of the Wasserstein loss and the Spectral normalization. The results are shown below. Here we evaluated on S e1 four classes dataset. We find that without the Wasserstein loss there is clear indication of mode collapse, and without the spectral norm a significant penalty on downstream performance. The ablation study results are shown in Table 4.

CONCLUSION
We proposed a machine learning based method: CryoETGAN to synthesize Cryo-ET images and therefore to enable the realistic simulation of protein density maps consistent with the Cryo-ET data. Our generated images performed competitively when trained for classification and this approach potentially increases the available training data for further new Cryo-ET based algorithms which depends on large data collection. This new data provides a way to investigate new methods for object detection, segmentation, domain adaptation tasks, etc. Our approach can also be extended to support other multimodal nanoparticles image synthesis in fluorescence/soft X-ray/tomography of nucleoplasmic reticulum and apoptosis in mammalian cells, which serves as a way to study images and resolve tasks limited by insufficient available data.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.