Application of Machine Learning Techniques to an Agent-Based Model of Pantoea

Agent-based modeling (ABM) is a powerful simulation technique which describes a complex dynamic system based on its interacting constituent entities. While the flexibility of ABM enables broad application, the complexity of real-world models demands intensive computing resources and computational time; however, a metamodel may be constructed to gain insight at less computational expense. Here, we developed a model in NetLogo to describe the growth of a microbial population consisting of Pantoea. We applied 13 parameters that defined the model and actively changed seven of the parameters to modulate the evolution of the population curve in response to these changes. We efficiently performed more than 3,000 simulations using a Python wrapper, NL4Py. Upon evaluation of the correlation between the active parameters and outputs by random forest regression, we found that the parameters which define the depth of medium and glucose concentration affect the population curves significantly. Subsequently, we constructed a metamodel, a dense neural network, to predict the simulation outputs from the active parameters and found that it achieves high prediction accuracy, reaching an R2 coefficient of determination value up to 0.92. Our approach of using a combination of ABM with random forest regression and neural network reduces the number of required ABM simulations. The simplified and refined metamodels may provide insights into the complex dynamic system before their transition to more sophisticated models that run on high-performance computing systems. The ultimate goal is to build a bridge between simulation and experiment, allowing model validation by comparing the simulated data to experimental data in microbiology.


INTRODUCTION
The Earth may be described as a microbial planet with microbes found in complex communities that perform a wide range of important functions, from microbial communities in the human gut, to communities in soil that affect the growth of plants (Leveau et al., 2018). Although, advances in genomics, metabolomics, and imaging have provided clues about the organization and composition of microbial communities, our ability to predict and manipulate the functions of microbial communities for desired outcomes is limited (Kreft et al., 2017). Complementing these experimental approaches, simulation techniques, such as agent-based modeling (ABM), are moving to the forefront as a powerful tool to unravel how interactions between microbes lead to emergent traits at the community level (Kreft et al., 1998(Kreft et al., , 2017Hellweger et al., 2008Hellweger et al., , 2016Leveau et al., 2018).
As with any simulation technique, the accuracy of an ABM is determined by comparing the simulation data to the experimental data, a process known as validation. In microbiology, the experimental data can exist in many forms, such as genomic sequences, metabolite profiles, growth curves of individual microbial strains, or images of microbial growth. Validating the ABM in a multi-scale manner is possible using pattern oriented modeling (POM), but it is also very expensive computationally (Grimm et al., 2005). To reduce the computational burden, it is customary to first perform a sensitivity analysis (SA) of the ABM (Ligmann-Zielinska, 2013;Ligmann-Zielinska et al., 2014;Thiele et al., 2014;ten Broeke et al., 2016). With SA, one can uncover which parameters of the model have the strongest effect on the output, and what correlations, if any, exist among them. Then, during the validation process, one can focus on varying the independent parameters only, thereby reducing the computational burden.
We have started building ABMs that are capable of reproducing experimental local and global information of microbial growth. Local information can be size and shape of the colonies, length of the boundaries between colonies, etc. Global information is usually represented by the population curves of the whole community, where growth is monitored over time. As part of this effort, we have focused on investigating the microbial interactions in the rhizosphere of poplar trees (Populus deltoides or Populus trichocarpa), a tree that shows promise as a bio-energy crop (Cregger et al., 2021). Pantoea sp. YR343 is a gammaproteobacterium that was isolated from the rhizosphere of P. deltoides and possesses several plant growth-promoting characteristics (Bible et al., 2016). In the work presented here, we developed the ABM to investigate the growth of Pantoea. This model is an improvement to the previous model made for investigating a population containing a wild-type and a mutant strain of Pseudomonas aeruginosa (Wilmoth et al., 2018).
The ABM developed is written in the language NetLogo (Wilensky, 1999). This language was designed for teaching complex phenomena to students and thus it is easy to learn. This has undoubtedly contributed to its popularity. Further, we have found that NetLogo's interface allows experimentalists to run simulations and explore different scenarios, facilitating the exchanges of ideas and further improvement of the model. Nonetheless, in general, NetLogo lacks capabilities for performing simulations on high-performance computing systems, although, recent works have shown that significant improvements in performance can sometimes be attained (Ayllón et al., 2016;Railsback et al., 2017). Despite this, NetLogo can be a powerful tool for prototyping ideas before implementing them in more sophisticated software, such as NUFEB (Li et al., 2019).
The model present here contains 13 parameters, thus making the SA and validation analysis computationally prohibitive. This limitation forced us to turn our attention to identifying ways to simplify our model that circumvent the need to carry out a large number of simulations (Pereda et al., 2017). To this end, we turned our attention to machine learning (ML), specifically deep learning (DL). ML and particularly DL have undergone significant advances in the last 10 years. Using application programming interfaces (APIs) such as Keras (Chollet, 2015) and scientific packages such as SciPy (Virtanen et al., 2020), it is relatively simple to write code to classify objects, perform regression, or make predictions. The Web is also full of blogs and courses that flatten the learning curve for these types of techniques. ML and DL have also been applied to microbiology, yet the confluence of these techniques with ABMs is, to our knowledge, not broadly used (Lee et al., 2020).
Here, the population curves of Pantoea produced by the NetLogo ABM are compared to the experiments. The simulated population curves are also analyzed using random forest regression to study the correlation between the parameters and outputs of the ABM and pinpoint the most important parameters of the model. Then, we build a fully dense neural network, which we show is capable of producing results statistically similar to that of the original ABM. The choice of these ML techniques is due to their consistent robustness and accuracy in prediction (Biau, 2012;Abiodun et al., 2018;Couronné et al., 2018). To improve the efficiency of our ABM simulations, we use NL4Py (Gunaratne and Garibay, 2018), a Python wrapper for NetLogo, which permits submitting thousands of simulations in a parallel fashion. We also note challenges when comparing simulated and experimental population curves, and we suggest how to circumvent this limitation in the Discussion section.

Agent-Based Modeling Method
The ABM was designed to reproduce the behavior of a Pantoea growing in a R2A-rich agar petri dish. This model uses many of the implementations in IBM-INDISIM (Ginovart et al., 2002), the ABM which has been successfully used to simulate microbial cultures in situations such as fermentation, multispecies composting, and yeast dynamics in aerobic media and denitrification processes (Ginovart and Cañadas, 2008;Prats et al., 2010;Portell et al., 2014;Banitz et al., 2015;Araujo Granda et al., 2016b). As in IBM-INDISIM, we use a thermodynamic approach to describe microbial metabolism, e.g., cellular maintenance and mass production, in the ABM for Pantoea. This approach is called the Thermodynamic Equivalent Electron model (TEEM), and it relies on a set of thermo-chemical reactions that account for the Gibbs free energy involved in the overall metabolism, catabolism, and anabolism (McCarty, 2007). We also assume that the growth is Monod driven with a maximum growth rate, μ max , ranging from 0 to 10. Two types of entity exist in our ABM: the bacterium and the square patches where it moves and grows. Each bacterium has its own unique identification number, biomass, metabolism, and reproduction parameters, as well as viability and location coordinates, and each one performs the following actions: nutrient uptake, cellular maintenance, biomass synthesis, product generation, and bipartition (reproduction). The time-dependent variables are calculated and updated in each time-step according to a time scale of 0.1-1.6 min per time-step. The square patches contain the R2A agar growth medium, and growth medium actions include changes in metabolite concentration and nutrient consumption on each patch. Carbon and nitrogen consumption, as well as their generation and accumulation, are controlled to ensure mass balance. Additionally, the model includes behavior actions that control the overlap of bacteria and the interaction after bipartition.
The empirical formula C 4.17 H 8 O 1.75 N (Araujo Granda et al., 2016a) was used to describe the elemental composition of Pantoea. The bacterium's dimensions were assumed to have length and width at 1.1 ± 0.5 μm and 0.55 ± 0.25 μm (mean ± SD), respectively (Kapetas et al., 2012). The individual mass was deduced from the volume and an assumed density of 1.1 g/cm 3 (Gras et al., 2011). A two-dimensional lattice grid was used to describe the environment where the bacteria grow. Each grid-cell represents a volume that can be tuned by changing the world dimensions, depth, and length. A volume was calculated using a portion of the wells in the experimental setup with dimensions of 605 μm × 605 μm × depth, where depth ranges from 10 to 50 μm. The initial concentration of glucose and ammonium were estimated from commercial R2A agar composition.
When doing simulations with ABMs, it is important to produce a large number of simulations. This is because ABMs are stochastic, and many simulations are needed to obtain reliable statistics. We used NL4Py to investigate 3,358 different initial configurations. NL4Py (Gunaratne and Garibay, 2018) is a Python library that facilitates the control and reporting of parallelizable NetLogo workspaces. This allowed us to interface ML algorithms implemented in the SciKit-Learn (Pedregosa et al., 2011) and Keras (Chollet, 2015) Python libraries with the NetLogo model and provide automated simulation deployment and evaluation. For each of the 3,358 initial configurations, we ran 4-6 repetitions to ensure statistical sampling, which yields around 18,000 simulations in total. In these simulations, we fixed the values of six parameters to values generally accepted by the literature or based on our experiments: efficiency (McCarty, 2007), energy_ maintenance_pa (Gras et al., 2011), ammonium (Wushensky et al., 2018), total-length-world based on the experiments described above, and lastly, max-time-viability_pa and rep_pa based on preliminary tests following a similar application by Font Marques and Ginovart (2016). The remaining seven parameters, namely pmax, microorganism, depth, umax_pa, diffusion-coefficient, glucose, and min/steptime, were randomly varied within a pre-defined range. A brief description of the 13 parameters and their values that define this ABM is listed in Table 1. The Overview-Design Concepts-Details (ODD; Grimm et al., 2010Grimm et al., , 2020, the NetLogo implementation, and the Python code for the simulations using NL4Py are available at https://github.com/ miguel-fc/ABM-Pantoea. The output of each simulation consisted of a population curve, which shows how much the population grew over the simulation time. As described by Wilmoth et al. (2018), we fit the growth part of each population curve to a logistic function. From this fitting, three parameters, namely the maximum (A), the slope (μ), and the lag-time (τ), were obtained, and they are used to identify each population curve. An example of a simulated image of bacterial growth at t = 0 h and t = 10 h, and a population curve are shown in Figures 1A It is important to explain why we chose fitting our population curves to a logistic function. Doing this fitting seems counterintuitive, given that by assuming from the onset that Pantoea grows in a Monod fashion, we already know the shape of the population curves even without running the simulations. When a Monod growth is assumed from the onset, the power of running ABM simulations relies not in determining the growth of a whole population, of course, but on determining the local growth, i.e., at the level of a few bacteria. At the local level, nutrient concentration and microbe-microbe interactions are bound to affect the growth in a different way that they do at a global population level. Determining these differences, though, requires analyzing the images of growth, and extracting metrics that correlate microbe-microbe proximity and/or nutrient concentration to growth; this is out of the scope of the present work. The purpose here is to find a method for refining and simplifying our model. It is for this reason that we decided to focus on the population level only, and assumed we knew nothing about its growth. In such a scenario, it is common to fit the population curves to different functions. We used here one of these functions (Zwietering et al., 1990) and used the parameters that define the curves as the outputs of our simulations. By studying the effect of the inputs of the ABM on these outputs, we are then able to understand our model better and suggest refinements.

Experimental Methods
Image-based growth curves were performed using a glass bottom 24-well black plate (CellVis, catalog P24-1.5H-N) with the BioTek Cytation 5 high content imaging plate reader. In order to visualize bacterial colony growth, we used R2A media with 1.5% agarose to make thin agar pads (300 μl spread across the well) in each well of the plate and allowed them to solidify. We next prepared a culture of Pantoea sp. YR343 expressing GFP by growing them overnight in R2A media (OD at 600 nm was approximately 1), then diluting the culture with fresh R2A media to 10 −4 cells/ml. From the diluted culture, we added 300 μl of cells to each well and incubated the plate at 28°C for 1 h. Next, we removed any excess media and allowed the plate to dry for 1 h at 28°C. After that, we placed the 24-well plate in the Cytation 5 and set imaging parameters to take images and measure fluorescence intensity every hour for a total of 18 h. Fluorescence intensity values were used to generate growth curves. See Figures 1D-F for an example of experimental results.
In experimental studies, we observed a lag phase that lasted approximately 6-8 h post inoculation, followed by a period of exponential growth that lasted 3-4 h, and ending in stationary phase. Parameters from these studies, such as media composition, volume of the well, and growth rates were used to inform the simulation studies.

Random Forest Regression
We used Random Forest Regression via the Scikit-Learn Python Library (Pedregosa et al., 2011) to understand the correlation between the seven input parameters and the three outputs. We first sorted the 3,358 population curves into four subsets based on their steptime values. The four steptime values used in the simulation were 0.1, 0.6, 1.1, and 1.6 min, which had 831, 837, 879, and 811 population curves, respectively. Next, for each subset, we randomly split the corresponding population curves into two datasets, training and test, in a 70/30 ratio. Each population curve in a steptime subset was associated with a unique set of the remaining six input parameters and average values of the three outputs. As the inputs are on different scales, we normalized each of their distributions to zero mean and unit variance to allow for efficient training.
For each output, we first constructed a Random Forest Regression model and optimized its hyperparameters by random search (Bergstra and Bengio, 2012) with 3-fold cross validation on the training dataset. The hyperparameters that were tuned included the number of decision trees in the forest, the maximum depth of the tree, the minimum number of samples required to split an internal node and to be at a leaf node, the number of features considered by each tree to split a node, and whether or not bootstrap samples when building trees. We then used each optimized model to predict the corresponding output of the test dataset and evaluated the models by their coefficients of determination, R 2 scores, between the predicted and simulation (true) outputs.

Neural Network
The methodology above leads to one Random Forest Regression model for each of the three outputs at a particular steptime value. With a feed-forward Neural Network, we aim to build one single comprehensive model that is not only capable of predicting the three outputs simultaneously but also invariant to steptime. Therefore, we considered all 3,358 population curves and randomly split them into training and test datasets in a 70/30 ratio. We built the network using the Keras Library (Chollet, 2015). The network was composed of two hidden layers with 30 and 15 nodes, respectively. The output layer had three nodes, one for each output. The total number of trainable parameters was 753. We used a ReLU activation function at each hidden layer, mean squared error as the loss function and the Adam optimizer with a learning rate of 0.01. As the inputs and outputs are on different scales, we normalized each to values between 0 and 1. We evaluated the network using 3-fold cross validation, each fold for 300 epochs and a batch size of 50. The mean absolute error (MAE) over the three folds was 0.018, with a SD of 0.001. We then re-trained the network on the entire training dataset and validated it on the remaining test dataset using the same set of hyperparameters found in the cross validation. The MAE was 0.019. To calculate the coefficients of determination, R 2 scores, of the three outputs, we applied the final network on the test dataset and transformed the predicted values back to the original scales.

RESULTS
Pantoea sp. YR343 is a robust colonizer of plant roots in the rhizosphere, where it competes with many other microbes for both space and resources. In order to better understand how competition for resources affects the spatial distribution of bacterial growth, we used microscopy to observe growth of bacterial colonies on an agar surface in order to develop the simulation tools described in this study. Experimental growth curves showed an initial lag time of approximately 8 h before entering into exponential phase, where Pantoea sp. YR343 forms small, mucoid colonies that grow rapidly, and at later timepoints, begin merging with other colonies before they enter stationary phase. Data obtained from the experimental growth curves were used to inform the parameters for our simulations.
To assess the accuracy of our ABM for Pantoea, we compared the populations curves from simulation and experiment; examples of these curves are shown in Figure 1. By fitting the growth curves to a logistic function, we extracted the three outputs: the maximum A, the slope μ, and the lag time τ. Both experiments and simulations have comparable μ, in which the averaged values are 5.12 ± 0.95 h −1 and 6.08 ± 0.78 h −1 (±SD), respectively. However, τ is shorter in simulations, with an average value of 2.62 ± 0.13 h, as compared to 8.29 ± 0.17 h in experiments. The difference in the experimental and simulated values for τ occurs because in the experiments, τ was measured by the time that the bacteria took to adapt to a new medium and started growing, while in simulations, τ was calculated from the minimum time and biomass needed for each bacterium to reproduce. There was no adaption in simulations, and the bacteria started growing right away. It is difficult to compare A due to the differences in measurements. In simulations, the measurement A is the number of bacteria, while in experiments, it is the fluorescence intensity; we elaborate more on this discrepancy in the Discussion section. Nonetheless, in general the model captures the experimental behavior of overall population growth reasonably well.
A comprehensive analysis of model accuracy would require thorough exploration of the seven-dimensional input phase space and is prohibitively expensive. Accordingly, we determine the relationship between input parameters and the model outputs by using random forest regression.
The R 2 scores and the Gini importance of the six input parameters for predicting average values of A (<A>), μ (<μ>), and τ (<τ>) for the four steptime test sets are presented in stacked bar charts (see Figure 2). We also calculated the permutation importance, and the results are similar to the Gini importance results reported herein. For the steptime of 0.1 min, the R 2 scores are 0.93 (<A>), 0.92 (<μ>), and 0.79 (<τ>). The three most important input parameters for predicting all three outputs are depth, glucose, and growth rate (umax_pa), which account for 82% of the total for <A>, 78% for <μ>, and 80% for <τ>. The importance ranking of these three inputs are the same for <A> and <μ>, in which depth is first, followed by glucose and umax_pa. Comparably, for <τ> glucose is first, followed by umax_pa and depth (Figure 2A). As the steptime increases, the R 2 scores for prediction of <A> and <μ> decrease and the score for <τ> increases (Figures 2B-D). This trend is most evident when comparing the R 2 scores from steptime of 1.1 min to those from steptime of 0.1 min. For steptime of 1.1 min, the R 2 scores drop to 0.78 and 0.80 for <A> and <μ>, respectively, while the score increases to 0.87 for <τ> ( Figure 2C). Interestingly, as the steptime increases, microorganism (micro) becomes one of the most important input parameters for predicting the three outputs. For example, for the steptime of 1.1 min, the three most important input parameters for predicting <A> are micro, depth, and glucose, which account for 95% of the total. For <μ>, depth, glucose, and micro contribute to 79%, and for <τ>, umax_pa, micro, and glucose account for 90%.
To understand better the relationship between the outputs and the corresponding important input parameters, we plotted on a heat map each output as a function of the two most important inputs (Figure 3). In general, <A> and <τ> depend on steptime: when steptime is low (i.e., steptime of 0.1 min), <A> is mostly below 300 and <τ> is below 2.5. However, when steptime is higher (i.e., steptime of 1.1 min), <A> can adopt values above 500 and <τ> above 10. With steptime of 0.1 min, <A> and <μ> are relatively low and independent of glucose when depth is less than 30. As depth increases, <A> and <μ> increase rapidly when glucose is above 25. <τ> is low regardless of glucose and umax_pa (Figures 3A-C). In comparison, with steptime of 1.1 min, <A> increases rapidly as depth increases when micro is low but <A> increases slowly when micro is high. <μ> follows a similar trend here as it does with steptime of 0.1 min. Overall <τ> is low except when umax_pa is 0 (Figures 3D-F).
The above results indicate that using Random Forest Regression with only three input parameters, it should be possible to predict the values of the three outputs with a R 2 score of at least 0.59 (i.e., with steptime of 1.6 min, the R 2 score of <A> 0.67 multiplied by 88% contributed by micro, depth, and glucose). Unfortunately, this approach requires using one Random Forest Regression model per output and per steptime. It is much more convenient to predict the three outputs with one single model for all four steptime values.
To enable predicting all three outputs with one model only, we constructed a three-layered feed-forward Neural Network. Figure 4 shows the performance of this network: it achieves R 2 scores of 0.89, 0.92, and 0.79 for predicting the values of <A>, <μ>, and <τ> of the test dataset, respectively. Taken together, we demonstrated the capability of determining values of the outputs <A>, <μ>, and <τ> from the input parameters using Random Forest Regression and neural network. Based on Random Forest Regression Models, we identified important inputs for each output prediction. Besides, from a trained neural network model, we further made predictions for multiple outputs without compromising the accuracy of the model. Our results demonstrate that a metamodel derived from machine learning techniques can facilitate efficient ABM of microorganisms.

DISCUSSION
Traditionally, bacterial growth curves have been generated by measuring the optical density of a liquid culture over time and plotting absorbance values over time. Unfortunately, much information about how these bacterial colonies are organized and how they compete for nutrients within a given space is lost with these techniques and therefore remained not well understood. The use of ABM to model bacterial growth in a given space can yield insights into how competition for resources determines, which microbes become dominant within a certain environment. This is because ABMs of microbiology produce local and global information of microbial growth. Local information can be used to determine how different species interact, for example, by tracking the size and shape of their corresponding colonies, their boundaries, etc. Global information can inform on how a particular microbial population grows as a whole. Additionally, the application of ML to the ABM simulation results can provide insights of the important parameters of the model and further build metamodels that allow reaching length and time scales unattainable by the ABM. However, a big question is how to ensure that the ABM and the resulting metamodel of a particular microbial system is accurate. The answer is to compare the patterns that the simulated and experimental system generate at different time and length scales (Grimm et al., 2005). Extracting the patterns from a microbial system can be done by analyzing the images with a variety of ML techniques, such as those in OpenCV (Bradski, 2000). DL techniques such as segmentation (Stringer et al., 2021) and variational autoencoders (Chen et al., 2020;Kalinin et al., 2021) can then be used to reduce the dimensionality in the system and uncover trends. Applying these types of ML and DL techniques is out of the scope of this paper, but we encourage the use of such techniques in future studies to improve the accuracy of ABMs, and metamodels thereof, in microbiology.
Here, we focus on gauging the accuracy of the ABM by reproducing global information of growth, e.g., the information contained in the experimental population curves. However, due to the manner in which the simulated and experimental population curves were obtained and what they measured, currently there is no appropriate method to directly compare the two results. Experimental growth curves were generated using microscopy as a tool for measuring bacterial growth on a surface in threedimensional space. Because this unique method does not allow us to easily quantify the number of cells present at various timepoints, we used measurement of fluorescence as a proxy for cell density. Factors that likely contribute to discrepancies between experimental curves and simulation curves include that the experimental curves were obtained by measuring the fluorescence intensity, whereas the simulated curves were generated by counting the number of bacteria directly. A direct comparison between the two requires converting the fluorescence intensity to the number of bacteria, which is likely prone to errors. Moreover, the experiments were performed in three dimensions, while only the colonies close to the surface were being imaged. In other words, bacteria growing outside the imaging plane were likely to be neglected, and thus precludes correct counting of bacteria number. Other issues, such as photobleaching of the fluorescence signal, might also affect the measurement accuracy. More effort is needed to build the bridge between simulation and experiment in microbiology, and we believe that this effort includes computer vision tools, as those in OpenCV and sophisticated DL techniques such as segmentation and VAEs. Using these techniques will allow us to obtain not only global but also local information of growth, which can then be used to gauge the accuracy of ABMs. The work we present here demonstrates a route to leverage ML and DL techniques for improved ABM development in microbiology.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.