Applications and Challenges of Machine Learning to Enable Realistic Cellular Simulations

In this perspective, we examine three key aspects of an end-to-end pipeline for realistic cellular simulations: reconstruction and segmentation of cellular structures; generation of cellular structures; and mesh generation, simulation, and data analysis. We highlight some of the relevant prior work in these distinct but overlapping areas, with a particular emphasis on current use of machine learning technologies, as well as on future opportunities.


INTRODUCTION
Machine learning (ML) approaches, including both traditional and deep learning methods, are revolutionizing biology. Owing to major advances in experimental and computational methodologies, the amount of data available for training is rapidly increasing. The timely convergence of data availability, computational capability, and new algorithms is a boon for This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. biophysical modeling of subcellular and cellular scale processes such as biochemical signal transduction and mechanics [1]. To date, many simulations are performed using idealized geometries that allow for the use of commonly used techniques and software [2][3][4][5][6]. This is historically due to the lack of high-resolution structural data as well as the theoretical and computational challenges for simulations in realistic cellular shapes, due to the complexity of generating high-quality, high-resolution meshes for simulation, and the need to develop specialized fast numerical solvers that can be used with very large unstructured mesh representations of the physical domain.
As biophysical methods have improved, the complexity of our mathematical and computational models is steadily increasing [7][8][9]. A major frontier for physics-based study of cellular processes will be to simulate biological processes in realistic cell shapes derived from various structural determination modalities [10,11]. For biological problems ranging from cognition to cancer, it has long been understood that cell shapes are often correlated with mechanism [12][13][14][15][16]. Despite such clear correlations, there remain gaps in our understanding of how cellular ultrastructure contributes to cellular processes and the feedback between cellular structure and function. Challenges such as the diffraction limit of light and difficulties in manipulation of intracellular ultrastructure constrain the potential scope of what can be achieved experiments. Much like the partnership between biophysics and molecular dynamics simulations have enabled the modeling of invisible protein motions to shed insights on experimental observations, simulations of cellular processes can also aid in the validation and generation of hypothesis currently inaccessible by experimental methods. Recently, we and others have shown that, for example, cell shape and localization of proteins can impact cell signaling [4,5,12,17,18].
The major bottleneck for the widespread use of cell scale simulations with realistic geometries is not the availability of structural data. Indeed, there exist many threedimensional imaging modalities such as confocal microscopy, multiphoton microscopy, super-resolution fluorescence and electron tomography [19,20]. For example, automation of modalities such as Serial Block-Face Scanning Electron Microscopy is already enabling the production of data at rapid rates. The bottleneck lies in the fact that much of the data generated from these imaging modalities need to be manually curated before it can be used for physics-based simulations. This current status quo of manually processing and curating these datasets for simulations is a major obstacle to our progress. In order to bridge the gap between abundance of cellular ultrastructure data generated by 3D electron microscopy (EM) techniques and simulations in these realistic geometries, innovations in machine learning (ML) methods will be necessary to reduce the time it takes to go from structural datasets to initial models. There are already many similar efforts at the organ/ tissue and connectomics length scales [21][22][23]. In this work, we summarize the main steps necessary to construct simulations with realistic cellular geometries ( Figure 1) and highlight where innovation in ML efforts are needed and will have significant impacts. We further discuss some of the challenges and limitations in the existing methods, setting the stage for new innovations for ML in physics-based cellular simulations.

SOURCES OF ERROR IN IMAGING MODALITIES
Images generated by the various microscopy modalities must undergo pre-processing to correct for errors such as uneven illumination or background noise [25,26]. The choice of suitable algorithms for error correction depends on multiple factors, some of which are listed here-the length scale of the experiment being conducted, scalability and reproducibility of the experiment, optical resolution of the microscope, sensitivity of the detector, specificity of the staining procedure, imaging mode (2D, 3D, 3D time-series), imaging modality (fluorescence, EM, ET etc.,) and other imaging artifacts like electronic noise, lens astigmatism, mechanical tilting/vibration, sample temperature, and discontinuous staining [25][26][27][28]. These sources of error are an important consideration for model implementation further downstream [6].
Electron tomography (ET) remains one of the most popular methods of cell imaging for modeling purposes [29][30][31], as it retains the highest resolution of all the 3D cell imaging techniques [26] by reconstructing a 3D object from a series of 2D images collected at different tilt angles [32]. However, images from ET also have a low signal to noise ratio (SNR) and have anisotropic resolution (for example, 1 nm resolution in x, y and 10 nm resolution in z) [25]. This is partly because biological samples can withstand only a limited dose of electron beam radiation (SNR is proportional to the square root of the electron beam current) before the specimen is damaged [33]. Other sources of error such as misalignment of projections and missing wedges from an incomplete tilt angular range can significantly affect the quality of the reconstruction. To work with data such as these, image processing steps are required for high resolution 3D reconstruction [25,34]. Commonly used software packages for image processing such as IMOD [35] and TomoJ [36] use reconstruction algorithms such as Weighted Backprojection (WBP) and Simultaneous Iterative Reconstruction Technique (SIRT). While these have been very effective at reconstruction, sources of error can still accumulate, leading to further manual adjustment [37].

RECONSTRUCTION OF CELLULAR STRUCTURES
Given a noisy 3D reconstruction, how can we segment cellular structures of interest? One approach is to employ manual segmentation tools applied to 3D tomograms such as XVOXTRACE [32,38], and more generally, manual contouring, interpolation, and contour stacking ( Figure 1). The advantage of such methods is that the human eye performs exceptionally well at detecting objects in an image [27,39]. Consequently, semi-manual and manual segmentation are widely adopted, favoring accuracy over efficiency. However, such methods can be extremely tedious and not always reproducible. Alternatively, numerous semi-automatic segmentation algorithms such as interpolation, watershed, thresholding, and clustering are available as plugins in software packages like IMOD [35] and ImageJ [40] (Figure 1, classical). However, the accuracy of open platform algorithms is debatable [41] because of two main reasons-(i) Even with a "perfect" ET reconstruction (no tilt misalignment, no missing wedge, no noise), the application of filtering algorithms like Gaussian blur or non-linear anisotropic diffusion (NAD) [42] can cause artifacts that lead to misclassifications, rendering the image unsuitable for downstream quantitative simulations and analysis. (ii) Segmentation workflows are often designed for a specific structure and/or imaging modality, limiting their generalizability and applicability.
Annual cell segmentation challenges are evidence of the demand for automatic segmentation [43,44], with many of its past winners responding with ML-based programs [45,46]. Training labels for ML techniques requires a relatively small percentage (as small as 10%) of manually segmented labels, allowing for very large data sets to be processed significantly faster than previously implemented semi-automatic segmentation methods. The most successful teams utilized ML techniques such as random forest classifiers, support vector machines, or a combination of these to get segmentations comparable or often even better than their human counterparts [43][44][45][46] (Figure 1, machine learning). These techniques function by imputing several image features such as noise reduction, and texture and edge detection filters [47]. These filters are then used to train a classification algorithm in an interactive manner, achieving better classification accuracy at the cost of increased training time compared to the direct application of a filter. However, because the algorithm is interactive, it still requires manual input and both the training time and accuracy can depend on the user.
More recently, deep learning-based ML algorithms (Figure 1, deep learning), and more specifically, convolutional neural networks (CNNs) have surged in popularity due to the success of AlexNet in the ImageNet classification challenge [48]. CNNs are complex learnable non-linear functions that do not require the imputation of data-specific features. Indeed, CNNs learn the feature mapping directly from the image. The U-Net convolutional neural network architecture [46] further generalized deep learning, winning the ISBI neuronal structure segmentation challenge in 2015 with a quicker speed and with fewer training images. It functions by using the feature mapping imputed by a CNN to map the classification vector back into a segmented image. Such is the achievement of the U-Net that its variants are now the state-of-the-art in tasks like calling genetic variation from gene-sequencing data [49], brain tumor detection [50] and segmentation of medical image datasets [51]. However, such deep learning based methods have their own challenges. They require both quality and quantity of annotated training data, significant amount of training time, graphics processing unit computing, and can generalize poorly to a different dataset.
Both the difficulty and cost of generating annotated training data increases exponentially when dealing with Volumetric (3D) images compared with 2D, which are the desired inputs for biophysical simulations. Since the U-Net is a 2D architecture [46], it cannot be applied directly to 3D images without modifications. To this end, 3D U-net used sparsely annotated 2D slices to generate volumetric segmentations of brain tumors [52]. Similarly, VoxRestNet [53] introduced residual learning using ResNet [54], a deep residual network capable of training hundreds to thousands of layers without a performance drop, to a voxelwise representation of 3D magnetic resonance (MR) images of the brain, paving the way for scalable 3D segmentation.
Excitingly, such algorithms are being made openly accessible and easy-to-use. For example, iLastik [45,55] and Trainable Weka Segmentation [47] are both available as plugins in software packages like ImageJ. These tools provide an interactive platform for segmentation, employing supervised classification techniques like random forests as well as unsupervised clustering such as K-means [47]. Similarly, deep learning tools such as DeepCell [56] and U-Net [46,57] are also available in various bioimage software packages. Other stand-alone tools like the Allen Cell Structure Segmenter provide a lookup table of 20 segmentation workflows that feed into an iterative deep learning model [58]. Cloud compute based segmentation plugins like CDeep3M [59] leverage Amazon Web Services (AWS) images to provide an efficient and compute-scalable tool for both 2D and 3D biomedical images.
Generating well-organized and annotated training data continues to be the major challenge for most ML segmentation methods. Crowdsourced annotation tools like Amazon's Mechanical Turk can be useful in this context, but are still limited by the difficulty of training naive users on tracing specific structural images. Alternatively, many ML algorithms leverage transfer learning approaches using pre-trained networks such as VGGnet [60][61][62], AlexNet [48], and GoogleNet [63]. In fact, popular semantic segmentation and clustering networks like Fully Convolutional Networks (FCN) [64] and DECAF [65] are themselves implemented using transfer learning approaches. Such transfer learning can also be used to generalize models trained on biological data to a different cell type or experimental condition, significantly reducing the time for training and accompanying computing resources required. More recently, label-free approaches employing a U-net variant have been applied to predict cellular structure from unlabeled brightfield images [66,67]. These methods can serve as a platform for building low cost, scalable, and efficient segmentation of 3D cellular structure.

CELLULAR STRUCTURES
There are two main aspects involved in the development of comprehensive biophysical models- (1) what is the process being modeled? and (2) what is the geometry in which this process is being modeled? Answers to the first question are based on experimental observations and specific biology. Answering the latter is significantly more challenging because of the difficulties in-(i) obtaining accurate segmentations, (ii) discovering new structure from experiments, and (iii) simultaneously visualizing multiple structures. The use of synthetically generated geometries, which can probe different arrangements of organelles within cells could be relevant for generating biologically relevant hypotheses.
A subset of ML models, called generative models, deal with the task of generating new synthetic but realistic images that match the training set distribution. For our purposes, such methods are relevant in the context of generating (i) noise-free images, (ii) images representative of a different cell type, structure, or time-point, and (iii) unlikely images that represent the most unique shapes of the structure being imaged. For example, by capturing the unlikely and likely shapes in our dataset, we could generate sequences of synthetic images that transition from one shape to the next. These synthetic images can be used in biophysical simulations to generate biologically relevant hypotheses.
In recent years, there has been rapid progress in applying deep generative models to natural images, text, and even medical images. Popular classes of deep generative models like Variational Autoencoders [68], Generative Adversarial Networks [69], and Autoregressive models such as PixelRNN [70] and PixelCNN [71] have achieved state of the art performance on popular image datasets such as MNIST [72], CIFAR [73] and ImageNet [74]. Each class of models has numerous modified implementations. For example, GANs alone include models like deep convolutional GAN (DCGAN) [75], conditional GAN (cGAN) [76], StackGAN [77], InfoGAN [78], and Wasserstein GAN [79] to name a few. Each model has its own distinct set of advantages and disadvantages. GANs can produce photo-realistic images at the cost of tricky training and no dimensionality reduction. VAEs allow for both generation and inference, but their naive implementation results in less photo-realistic generative examples. Autoregressive models obtain the best log-likelihoods at the cost of poor dimensionality reduction. Importantly, all of these models are unsupervised, implying that they are not limited by manual annotation that is otherwise a common challenge to supervised learning approaches.
In cell biology, much of the work in building generative models of cellular structures has been associated with the open source CellOrganizer [80][81][82][83][84][85][86], which uses a Gaussian Mixture Model given reference frames like the cell and nuclear shape in order to predict organelle shape distribution. These models also have the option to be parametric (parameters such as number of objects), which reduces the complexity of the learning task, the training time and GPU computing resources required, while also allowing for exploration and analysis of the parameters and their effect on the spatial organization of cells. Aside from CellOrganizer, other recent efforts have begun to leverage deep generative models in cell biology. We now have models that can predict structure localization given cell and nuclear shape [87], extract functional relationships between fluorescently tagged proteins structures in cell images [88], learn cell features from cell morphological profiling experiments [89], and interpret gene expression levels from single-cell RNA sequencing data [90,91].
The challenge going forward will be how best to use generative modeling given the data in hand. This will depend on the question we want to ask of the data. For example, if we are modeling processes associated with cell and nuclear shape, spherical harmonics based generative models might be more appropriate than deep learning based methods [92]. If we are interested in inpainting a missing wedge from a tomogram using a generative model, then GANs might be more appropriate [93]. Generated images can also be used as a source of training data for segmentation and classification tasks [94]. Taken together, these choices will help develop efficient end-to-end pipelines for segmentation and shape generation, and provide a platform for running biophysical simulations. Already, CellOrganizer can export spatial instances to cell simulation engines such as MCell [95] and VirtualCell [96], allowing us to simulate chemical reactions in different spatial compartments. Similar pipelines for deep generative models will need to be implemented in order to fully realize their downstream interpretations.

ANALYSIS
ML is commonly applied to mesh segmentation and classification; examples include PointNet [97] (segments and classifies a point cloud), and MeshCNN [98] (segments and classifies edges in a mesh). However, although the term machine learning was not traditionally used to describe meshing techniques, in fact algorithms for mesh generation (cf. [99]), mesh improvement (such as mesh smoothing [100]), and mesh refinement [101][102][103][104] all fundamentally involve local (cf. [105]) and/or global (cf. [106]) optimization of an objective function (see Figure 2). Mesh point locations, and/or edge/face connectivity decisions are viewed as parameters that are determined (or learned) as part of an iterative algorithm that extremizes a local or global objective function (usually involving constraints as well) in an effort to generate, improve, or refine a given mesh. In addition, adaptive numerical methods for simulation of physical systems involving the solution of ordinary (ODE) and partial (PDE) differential equations are again an early example of the application of ML techniques in computational science, long before the terminology was widely used. A classic reference from the 1970's in the context of adaptive finite element methods is Babuška and Rheinboldt [108,109]; all modern approaches to adaptive numerical methods for ODE and PDE systems continue to follow the same general framework outlined in that work: (i) Solve the ODE/PDE on the current mesh; (ii) Estimate the error using a posteriori indicators; (iii) Refine the mesh using provably non-degenerate local refinement with closure; (iv) Go back to step (i) and repeat the iteration until a target quality measure is obtained (a standard approach is to approximately minimize a global error function, through the use of local error estimates). These types of adaptive algorithms are effectively machine learning the best possible choice (highest accuracy with least cost) of mesh and corresponding numerical discretization for the target ODE/PDE system. Recent work in the area is now moving toward a more explicit and sophisticated use of modern ML techniques (cf. [110,111]).
Given a high quality and high resolution mesh representation of a structural geometry (see Figure 2), we can begin to probe structure-function relationships through mathematical modeling [11,112]. However, a single realistic geometry is not enough since it only captures a snapshot in time of the cellular geometry. Furthermore, structural variability is a hallmark of cell biology [14,24]. Dimensionality reduction techniques like principal component analaysis (PCA) or Variational Autoencoders (VAEs) can help determine both the average and extreme representations of a distribution of shapes, providing a starting point for running simulations. Generative models (discussed in section 4) can then be used to populate a single cell with multiple learned distributions-for example, generate cell shapes from EM images overlaid with protein structure shapes learned from fluorescence images [10,80,87].
To facilitate population studies, it is important that structural datasets be made publicly available, as they commonly are in neuroscience [23,24,113]. This is following in the footsteps of -omics datasets such as genomics, proteomics and transcriptomics [114], which have traditionally been made public in large scale projects like the Cancer Genome Atlas [115], the Human Microbiome Project [116] and the ENCODE project consortium [117]. ML can then be used to identify structure-phenotype associations, in much the same way as genotype-phenotype relationships are predicted from -omics studies [118,119].
Importantly, by running simulations on distributions of realistic shapes, we can generate experimentally testable hypotheses. This is much harder in -omics datasets, where mechanistic insight is usually obtained via constraint based modeling [119]. Further, we can also explore the implications of assuming an idealistic geometry-a common assumption in hypothesis-driven modeling. For example, idealized geometries have been the starting point of many signaling models that explore spatio-temporal dynamics using deterministic reaction-diffusion formulations [4,5,17,18,[120][121][122][123] or using Brownian dynamics or other formulations [95,[124][125][126][127][128][129][130][131][132][133]. An excellent example of insights gained using these idealized geometries is in exploring how changing the diffusion distances can affect the dynamics of signaling molecules such as Ca 2+ [123]. Other physical systems that are commonly modeled and simulated include structural mechanics [2,7], fluid mechanics [112,134] and thermodynamics [135].
A major bottleneck in setting up accurate computational simulations of biophysical systems, idealistic or otherwise, revolve around the choice of constitutive equations, estimation of the free parameters such as reaction rate constants, diffusion coefficients, and material properties, and computational algorithms for solving the resulting governing equations numerically on these domains. While there is a large history of mathematical modeling in biology to set the stage for constitutive equations, estimation of free parameters remains a major challenge. Another major challenge for physically realistic models of signaling is knowing the location of the various molecules involved. Realistic geometries pose the additional challenge of requiring us to first understand the distribution of shapes, followed by analyzing simulation results across that distribution. Similar to how ML can be used in adaptive numerical methods to output a good mesh, ML can also be used for adaptive nonlinear data fitting to determine biophysical parameters with uncertainity estimates [136,137]. Incorporating domain knowledge such as stress-strain relationships [138] or statistical molecular dynamic states [139] into ML algorithms can also improve interpretability while closing the loop between ML frameworks and biophysical modeling.

PERSPECTIVES AND FUTURE DIRECTIONS
In this perspective, we have discussed three key aspects of a pipeline for realistic cellular simulations: (i) Reconstruction and segmentation of cellular structure; (ii) Generation of cellular structure; and (iii) Mesh generation, refinement and simulation. While these were discussed separately, neural networks like Pixel2Mesh demonstrate the feasibility of end-toend pipelines from a single black box [140]. Of course, black boxes are not interpretable, and recent ML frameworks using Visible Neural Networks [141] demonstrate the potential of incorporating extensive prior knowledge to create a fully interpretable neural network capable of highlighting functional changes to every neuron/subsystem upon perturbing the input. Other ML frameworks like SAUCIE use regularizations to enforce mechanistic intepretability in the hidden layers of an autoencoder neural network [142]. We anticipate that future endeavors will implement a fully interpretable and end-to-end pipeline for biophysical simulations.

ACKNOWLEDGMENTS
We would like to thank Prof. Pietro De Camilli and coworkers for sharing their datasets from Wu et al. [24]. We also thank Dr. Matthias Haberl, Mr. Evan Campbell, Profs. Brenda Bloodgood, and Mark Ellisman for helpful discussion and suggestions. MH thanks M. Nguyen for helpful comments and background on machine learning techniques. GJ thanks the Allen Institute for Cell Science founder, Paul G. Allen, for his vision, encouragement and support.

FUNDING
RV, MR, CL, and PR were supported in part by an AFOSR MURI award FA9550-18-1-0051 and ONR fund ONR N00014-17-1-2628. CL acknowledges support from a Hartwell Foundation Postdoctoral Fellowship. GJ was supported by the Allen Institute for Cell Science. MH was supported in part by NSF Awards DMS-FRG1934411 and DMS-CM1630366. CL and MH were supported in part by the National Institutes of Health under grant number P41-GM103426.

FIGURE 1 |.
An illustration of the complex pipeline needed to go from imaging data to a segmented mesh, with various opportunities for emerging techniques in machine learning shown throughout the pipeline. (Top row) EM images obtained from Wu et al. [24] of dendritic spines from mouse brain tissue. (Middle row) Manual tracing or contouring, interpolation, and stacking of contours is extremely time consuming, prone to error, and relies of human judgement. (Bottom row) On the other hand, development of training labels and different learning techniques can reduce both time and error, bridging the gap between biological data and simulations. Classical algorithms like Otsu's thresholding and watershed are widely used and convenient but prone to error. Traditional machine learning algorithms like Random Forest and Naive Bayes are less prone to error and easy to use but require manual painting/interaction. Deep learning algorithms are highly effective and require no manual interaction but are limited by large training sets and compute resources. The list of techniques described is representative only, and not exhaustive.