MEMMAL: A tool for expanding large-scale mechanistic models with machine learned associations and big datasets

Computational models that can explain and predict complex sub-cellular, cellular, and tissue-level drug response mechanisms could speed drug discovery and prioritize patient-specific treatments (i.e., precision medicine). Some models are mechanistic with detailed equations describing known (or supposed) physicochemical processes, while some are statistical or machine learning-based approaches, that explain datasets but have no mechanistic or causal guarantees. These two types of modeling are rarely combined, missing the opportunity to explore possibly causal but data-driven new knowledge while explaining what is already known. Here, we explore combining machine learned associations with mechanistic models to develop computational models that could more fully represent cellular behavior. In this proposed MEMMAL (MEchanistic Modeling with MAchine Learning) framework, machine learning/statistical models built using omics datasets provide predictions for new interactions between genes and proteins where there is physicochemical uncertainty. These interactions are used as a basis for new reactions in mechanistic models. As a test case, we focused on incorporating novel IFNγ/PD-L1 related associations into a large-scale mechanistic model for cell proliferation and death to better recapitulate the recently released NIH LINCS Consortium MCF10A dataset and enable description of the cellular response to checkpoint inhibitor immunotherapies. This work is a template for combining big-data-inferred interactions with mechanistic models, which could be more broadly applicable for building multi-scale precision medicine and whole cell models.


Introduction
The molecular signaling mechanisms of cancer cells are highly heterogenous, leading to treatment resistance and recurrence.Thus, the need for personalized interventions to block tumor growth is high.The traditional drug discovery pipeline is comprised of extensive trial-and-error experiments, testing thousands of chemicals, refining their structure for safety and toxicity, and administering years of clinical trials.This burden might be reduced by understanding the underlying molecular mechanisms with the help of computational models (Yu et al., 2018;Saez-Rodriguez and Blüthgen, 2020).
Computational tools and models are becoming indispensable in medical research, where a cycle of experimentation and computation is used to learn about and test new hypotheses.
The models guide experimental hypothesis generation, and experimental observations enable fine-tuning computational models to understand the biological phenomena.Owing to the advances in wet-lab experimental techniques and tools, "Big Data" repositories become more prominent each year.The knowledge base of these databases includes genomics, proteomics, epigenomics, and clinical information (Barrett et al., 2012;Uhlen et al., 2015;Subramanian et al., 2017;Hoadley et al., 2018;Wishart et al., 2018;Nusinow et al., 2020).To understand the underlying biological facts, analysis of the wealth of the aforementioned big datasets should become more practical and go beyond context-dependent and scopelimited biological events.Building computational models that explain and predict such highly heterogenous and complex cellular responses is no easy task.The popular mechanistic models are sets of detailed equations describing curated knowledge of what is happening within the cells.Such models (Bouhaddou et al., 2018;Fröhlich et al., 2018;Münzner et al., 2019) are usually small in scale: tens of equations and 10s-100s of model species (Figure 1).Another popular class is machine learning based models, which are data-driven, descriptive, and mostly large-scale (genome-wide or exome-wide) (Malta et al., 2018;Wong and Yip, 2018;Yu et al., 2018;Yang et al., 2019).These types of models are generally coined as black-box models because although they perform well in precision/recall metrics, how they do so is blurry (Figure 1).So far in the literature, these two types of models are rarely combined, missing the opportunity to generate new knowledge while explaining what is already known (Baker et al., 2018).
Here, we explore a combination of both methods to develop better models that will more completely represent generated biological knowledge and introduce MEMMAL (MEchanistic Modeling with MAchine Learning) framework.MEMMAL processes connections inferred via machine-learning pipelines (i.e., MOBILE (Erdem et al., 2022a)) as new interactions into mechanistic models (i.e., SPARCED (Erdem et al., 2022b)) to better recapitulate available datasets (i.e., the recently-released MCF10A dataset (Gross et al., 2022)).The NIH-LINCS Consortium and MCF10A Common Project recently released this dataset, consisting of multiple omics assay types on breast epithelial MCF10A cell line.MOBILE is a new pipeline to integrate multi-omics datasets and identify context-specific interactions.SPARCED is one of the largest mechanistic models of mammalian cells and is an open-source, human-interpretable, and easy to alter modeling format.Here we focused on incorporating novel IFNγ/PD-L1 related associations into the SPARCED model to enable description of the cellular response to checkpoint inhibitor immunotherapies.This work is a template for combining big data, machine-learning-inferred interactions with mechanistic models, which could be more broadly applicable towards building multi-scale precision medicine and whole cell models.

Materials and methods
In this work, we use ligand-specific interactions between genes as new connections in a large-scale mechanistic model to study the effect of the newly added gene interactions in model responses.It is important to note that MEMMAL is agnostic to the specific tool used to nominate new associations, and the base mechanistic model used; the below are simply chosen as illustrative.

MOBILE
MOBILE is a recent tool for finding context-specific network features by integrating pairs of omics datasets (Erdem et al., 2022a).In short, statistical associations are calculated between pairs of chromatin accessibility regions, mRNA expressions, and protein/phosphoprotein levels.Lasso (least absolute shrinkage and selection operator) regression models are run in replicate to select coefficients with high occurrence rates (Tibshirani, 1996;Erdem et al., 2016;Erdem et al., 2022a).The so-called Integrated Association Networks (IANs) are generated by combining the association networks inferred for RPPA (reverse phase protein array)+RNAseq and RNAseq + ATACseq data inputs.Finally, the IANs are coalesced into gene-level networks: nodes representing genes of the assay analytes and edges representing the inferred Lasso coefficients.From MOBILE generated IFNγ-specific IAN, a sub-network of connections between canonical interferon genes, PD-L1, and PD-1 is filtered to obtain a 297 node + 321 edge module.Then, only the interactions with IRF1, PD-L1, PD-1, and STAT1 are retained as input for MEMMAL.

SPARCED
The starting mechanistic model used in this work is obtained from the SPARCED repository (github.com/birtwistlelab/SPARCED/tree/develop)(Erdem et al., 2022b).It is a recent framework for large-scale mechanistic modeling that enables model file creation using simple text files as input with minimal coding requirements.In short, a set of annotated text files are constructed to define model specifics.Then, Jupyter notebooks are used to process these files and create community-standard model file type called Systems Biology Markup Language (SBML) (Hucka et al., 2003;Keating et al., 2020).The software was first built to replicate the one of the largest mammalian single-cell mechanistic model of proliferation and death signaling (Bouhaddou et al., 2018;Erdem et al., 2022b).Then, an expanded SPARCED model was created to include IFNγ signaling and SOCS1 crosstalk to growth pathways and the new model was named as SPARCED-IFNG-SOCS1 (Erdem et al., 2022b).This final model and its input files are used as the basic model in this work and is modified further with the MOBILE inferred set of new connections.

MEMMAL
Jupyter notebooks-MEMMAL pipeline is composed of multiple Jupyter notebooks defined below and detailed steps given in Supplementary Table S1.

1.
enlargeModel notebook: As the core of MEMMAL, this Jupyter notebook processes the machine learning model inferred connections list and creates Species (genes, mRNAs, proteins, phosphoproteins), RateLaws (the reaction format and related parameters), Gene Regulatory Interactions (defining transcriptional activators and repressors) and finds relevant new omics data from LINCS datasets.The input files for SPARCED pipeline are then updated followed by model compilation and simulation steps.
The pipeline starts by finding the unique list of genes from the MOBILE associations input.Then, for each unique gene added we create species for the active gene, inactive gene, mRNA, and protein (phosphoproteins as well if the gene has corresponding phosphoprotein measurements).The species initial conditions are updated using LINCS (Gross et al., 2022), MCF10A (Bouhaddou et al., 2018), or other literature datasets (Schwanhäusser et al., 2011).The experimental data in molecules per cell (mpc) are converted into nanomolar (nM) concentration and the corresponding values are updated.
Next, first-order translation, transcription, and protein and mRNA degradation reactions are created and the rate laws are defined.The rate constants are set using literature data (Schwanhäusser et al., 2011) or set to the mean value of the corresponding reaction parameter values for existing genes in SPARCED.
The mRNA and protein degradation rate constants are set using literature half-life data kTCd = log 2 mRNA ℎalf−life ; kTLd = log 2 protein ℎalf−life , basal transcription rate constants using the equation ((kTCd*mRNA count )*(kG in + kG ac )/(kG ac *Gene Copy Number) where kG in and kG ac are rate of gene inactivation and activation, respectively.The translation rate constants are set using the equation (protein concentration *kTLd/mRNA concentration ).
Importantly, for this work we specify that all associations are gene regulatory mechanisms, and for each association, two transcriptional regulation connections are created: the protein species of gene1 activates/represses gene2 expression and protein of gene2 activates/represses gene1 expression.That however is because of the specific submodel of interest here being a gene regulatory subnetwork and future implementations would need to be considered case-by-case.These gene regulatory reactions are modeled as Hill equations as defined for other gene regulatory reactions in SPARCED (Erdem et al., 2022b).The Hill equation parameters are: i) n A : Hill coefficients set to "4" for all new reactions and ii) K A the concentration for half-maximal transcriptional output effect, initially set to half of the transcriptionally regulating protein concentration.The values of these K A parameters are fitted later, as described below.Finally, the updated input files are written into text files for model creation and compilation.

2.
createModel_o4a notebook: The Jupyter notebook to create an integrated SBML version of the SPARCED type models (Erdem et al., 2022b).Creating the model file fully in SBML format provides extensive speed-up of simulations.
The newly updated input files by enlargeModel notebook are used to create and compile the expanded model.

3.
runModel notebook: This Jupyter notebook is used to simulate and explore multiple scenarios for the new model.

4.
enlargeSBMLModel notebook: This Jupyter notebook contains an example to enlarge any SBML model using user defined lists of species, reactions, and parameters.We provide an example use of enlargeModel notebook created lists of model elements to expand the SBML file of IFNγ/JAK/STAT signaling pathway (Yamada et al., 2003).

5.
testMEMMAL notebook: This Jupyter notebook contains commands to run MEMMAL from start to finish.It calls the first three notebooks and plots the figure panels.
Steps to obtain the list are given in Supplementary Figure S1.

4.
RPPADataLINCS, RPPADataStdLINCS, and RPPADataStdLINCSfc: Median normalized RPPA data in log2 format."Std" refers to standard deviation of triplicate measurements."fc" refers to fold-change with respect to time point zero.
Parameter fitting-The new parameter values were initially set using literature data or existing model parameters.We then estimated some of them in a semi-automated way.First, the basal transcription (mRNA production) rate constants of the new mRNAs species (eight in total) are fitted one at a time, in the order of species added to the model.If the mRNA level was not at steady state, degrading or accumulating in no ligand (growth factors or IFNγ) stimulation simulations, the parameter value is estimated by varying it uniformly (15 points) within three orders of log10-magnitude of the default value.Then, the best-fit value that yields a constant level is manually adjusted for better fit if possible.Finally, such parameter values are kept constant and the next is explored.One of the mRNA degradation parameters (of FAM83D) was also fitted similarly.
The values for the K A (half-maximal) concentrations of the newly added gene regulatory reactions were adjusted using the LINCS mRNA (ACSL5, BST2, CLIC2, FAM83D, HIST2H2AA3, and METAP2) and protein (IRF1 and PD-L1) time course data with EGF and EGF + IFNγ stimulation.The model, starting from an initial steady-state condition in the absence of growth factors (from above), is simulated for 48 h with EGF (1.5625 nM) or EGF (1.5625 nM) + IFNγ (1.1834 nM) treatment.The K A for each new gene regulatory interaction (27 total) is varied uniformly (15 points) within three orders of log10-magnitude of the default value (half the regulating protein species concentration) and both stimulation conditions are simulated.The sum-of-squared errors between simulation and the data is evaluated for each, and the value giving minimum error is chosen.In some cases, the value with minimum error is manually adjusted between originally sampled values to achieve better fit.These fitted K A parameter values are reported in the runModel notebook.

Large-scale mechanistic models can become larger and more precise by expansion using machine learned relationships
There are only a handful of large-scale (hundreds of genes, thousands of species) mechanistic signaling pathway models in the literature (Fröhlich et al., 2018).Usually, such big models are constructed by bottom-up modeling or by semi-manual stitching of previously published models (Bouhaddou et al., 2018).Both approaches are time consuming, manually curated, and biased for including/excluding model components: genes, proteins, post-translational modifications, interactions, or even cellular compartments.Here, we tackle this "what-to-add" problem by using association networks inferred via data-driven machine learning algorithms.
The Mechanistic Modeling with Machine Learning (MEMMAL) tool presented here (Figure 2) is comprised of scripts to expand mechanistic models created using SPARCED pipeline (Erdem et al., 2022b) with candidate connections generated by the tool called MOBILE, a recent pipeline for multi-omics data integration (Erdem et al., 2022a).However, other tools and models could be used in their place; they are simply used to demonstrate the approach.For now, the MEMMAL Jupyter notebooks process these new connection candidates to update SPARCED input files, taking advantage of their modular structure for model building (github.com/birtwistlelab/SPARCED/tree/develop).Here, we combine novel connections inferred via MOBILE with a large-scale mechanistic model called SPARCED to add an immune-checkpoint related sub-module to the existing pan-cancer model to study effects of the newly added gene products on the regulation of Interferon Regulatory Factor 1 (gene name IRF) and Programmed Death Ligand 1 (PD-L1, gene name CD274) upon interferon-gamma (IFNγ, gene name IFNG) stimulation.

MOBILE pipeline integrated LINCS MCF10A multi-omics dataset to infer ligand-specific associations
The normal-like breast epithelial cell line MCF10A was recently profiled with multiple assay types under multiple ligand stimulation conditions (Gross et al., 2022).Using this newly released multi-omics dataset, our lab introduced the MOBILE pipeline for data integration and showed how ligand-specific associations can be inferred (Erdem et al., 2022a).One of the ligands included in the LINCS study that induced MCF10A growth inhibition was interferon-gamma (Gross et al., 2022).We previously analyzed the LINCS MCF10A dataset to find IFNγ-specific associations that nominate novel connections with the PD-L1 (gene name CD274) axis (Erdem et al., 2022a).IFNγ can induce transient PD-L1 expression, a transmembrane protein that binds to its receptor PD-1 on T-cells (Abiko et al., 2015;Thiem et al., 2019;Ju et al., 2020).This binding inhibits tumor clearance, where targeted therapies towards these proteins are a new class of anti-cancer drugs: the immune checkpoint inhibitors (Gong et al., 2018).However, inter-and intra-tumor variability of PD-L1 expression results in heterogeneous patient responses and makes the response predictions a challenge (Wu et al., 2019).A more thorough understanding of the regulatory mechanism of PD-L1 expression could help inform new immunotherapeutic drugs or treatment options.
Applying MOBILE, we generated a data-driven IFNγ-specific integrated associations network, which had 297 nodes (genes) and 321 edges (connections) (Figure 2B and Supplementary Figure S1).We further filtered this network by looking for connections with STAT1 (the only overlapping gene with the mechanistic model).The final list of candidate connections had nine genes (ACSL5, BST2, CD274, CLIC2, FAM83D, HIST2H2AA3, IRF1, METAP2, and STAT1) and 14 connections.The list is imported into the SPARCED environment to start altering the existing mechanistic model structure (Figure 2B and Supplementary Figure S1).

SPARCED modeling makes mechanistic model expansions easy
SPARCED is a recent software (Erdem et al., 2022b) and modeling framework for largescale mechanistic modeling.It enables SBML model file creation using simple text files as input with minimal coding requirements.Jupyter notebooks (Kluyver et al., 2016) are used to process the input files and to create the model files.The software was first built to replicate the largest mammalian single-cell mechanistic model of proliferation and death signaling (Bouhaddou et al., 2018) and was then expanded to include a new sub-module of IFNγ signaling (Yamada et al., 2003).So, the starting mechanistic model in this work, SPARCED-IFNGSOCS1 already includes an IFNγ submodule (Figure 3A, gray background), with a total of 149 genes, 1,302 species, and 3,584 ratelaws (Figure 3B).

MEMMAL incorporates MOBILE-inferred gene-level statistical associations into SPARCED as gene regulatory mechanisms
The list of candidate connections from MOBILE pipeline are processed via MEMMAL enlargeModel notebook to add rows and update SPARCED input files (Figure 2B).As a default SPARCED requirement, each gene node from MOBILE list is interpreted to create active gene, inactive gene, mRNA, and protein species, with relevant basic reactions: gene switching, transcription, translation (Figure 3C, black arrows), mRNA degradation, and protein degradation.Importantly, the MOBILE inferred connections are interpreted as transcriptional activator and repressor (TAR) reactions (Figure 3C) because the MOBILE inferred connections are obtained by looking at pairs of mRNA-protein and chromatin region-mRNA dataset pairs.A logical way a protein affecting another mRNA's expression level is by transcriptional regulation.Additionally, a highly open chromatin region can permit transcription, which potentially yields higher mRNA expression and thus another gene regulatory connection.So, all the candidate associations are treated as TARs in the current MEMMAL pipeline.For future work, users should decide how to handle such connections.
The negative valued associations here are treated as inhibitory whereas the positive magnitude connections are added as activators (Figure 3C, gray and red arrows).Some of the transcriptional activators are labeled as "integrative links" because they connect existing SPARCED model genes with the new gene species (Figure 3C, red arrows).After all the input files are updated, createModel_o4a Jupyter notebook is used to create and compile the new SBML model file (Figure 2B).The MEMMAL expansion of SPARCED via MOBILE inferred network resulted in the addition of eight genes, 16 species, 16 signaling reactions, and 27 transcriptional regulatory mechanisms (Figures 3A, B).With the current addition, the SPARCED model now includes an IFNγ-PD-L1 submodule (Figure 3A, green background).
Following model expansion, we first verified the model can recapitulate previous observations (Figure 3D).We show that inclusion of new species and reactions did not alter canonical STAT1-SOCS1 response to IFNγ stimulation.Previous studies have shown that in response to IFNγ, STAT1 and SOCS1 show transient activation over several hours followed by damped oscillations before reaching a steady state slightly higher than the baseline levels (Yamada et al., 2003).In the model, IFNγ treatment leads to transient STAT1 activation by inducing its phosphorylation, dimerization, and translocation to nucleus (Figure 3D, top panel).Nuclear STAT1 dimer acts as an activating transcription factor for SOCS1 and induces SOCS1 mRNA production (Figure 3D, middle panel), which then causes SOCS1 protein levels to increase (Figure 3D, bottom panel).Moreover, as reported previously in (Erdem et al., 2022b), IFNγ does not induce significant changes in MAPK signaling but leads to a slight decrease in early AKT response (Supplementary Figure S2).

MEMMAL model offers exploration of the effect of novel connector genes on the expression of PD-L1 expression in response to IFNγ
Since the modified model passed these quality control checks, the next step was to fit new unknown parameters to recapitulate experimental time-course data for newly added genes (RNAseq: ACSL5, BST2, CLIC2, FAM83D, HIST2H2AA3, METAP2 and RPPA: IRF1, PD-L1) (Figure 4A).These 27 + 16 (43 total) unknown parameters were the half-maximal concentrations for the Hill functions underlying the new gene regulatory reactions and protein/mRNA degradation rate constants.The data show IFNγ induces transcription of ACSL5, BST2, CLIC2, and HIST2H2AA3 and expression of both IRF1 and PD-L1 with no sustained induction of FAM83D and METAP2, and the fitted model captures these trends.
There are only two discrepancies where the model could not capture: 24-h time point data of FAM83D and HIST2H2AA3 mRNA levels.However, the model can recapitulate the increasing trend of mRNA_HIST2H2AA3 and fit the last time points for both species levels.
The runModel Jupyter notebook reports the final updated parameter values and scripts to compare simulation trajectories with LINCS data (Figure 4A).
After acceptable agreement was achieved between simulations and experimental mRNA and protein levels (Figure 4A), we simulated scenarios (Figure 4B) to explore the effects of new genes on the IRF1 and PD-L1 responses.We wanted to nominate the new connections predicted to be most important in regulating PD-L1 expression.To do this we compared wild-type simulations (new model with fit parameters) to single gene knock-out simulations (protein, gene, and mRNA levels set to zero) (Figures 4B,C).
Only BST2, FAM83D, and METAP2 knock-outs had observable effects on simulated PD-L1 and/or IRF1 dynamics (Figure 4C).Knocking out other newly added genes (ACSL5, CLIC2, HIST2H2AA3) had no significant effects and thus are not shown here.Perturbing BST2 caused a small decrease in initial PD-L1 levels, which later reaches to wild-type response levels (Figure 4C, top row).Perturbing FAM83D only slightly increased steady-state IRF1 levels (Figure 4C, middle row).Perturbing METAP2 caused a significant decrease in late IRF1 and PD-L1 responses (Figure 4C, bottom row).We summarized all these knock-out response observations with the candidate gene regulatory network in Figure 3C to show a functional network with possibly causal links only (Figure 4C).These results demonstrate that mechanistic models with machine learning derived connections can nominate genes for follow-up experimental studies.

Discussion
Combining and synergizing machine learning with mechanistic modeling could bring clinically predictive computational models and personalized medicine closer to reality.
To that end, here we introduced a recipe to expand a large-scale mechanistic model with machine learned connections between gene products.Because understanding PD-L1 regulation mechanisms would help us design better therapeutic interventions, we focused on exploring the IFNγ/PD-L1 axis.We used the LINCS MCF10A dataset and added the recently inferred (via MOBILE pipeline) IFNγ/PD-L1 connections to the existing SPARCED mechanistic model.We then were able to study the effects of new gene regulatory mechanisms.We showed that perturbing BST2, FAM83D, or METAP2 induces changes in PD-L1 and IRF1 dynamics.
MEMMAL could serve as an initial step towards combining mechanistic models with machine learnt potential connections by providing a rationale for such a merging protocol.MEMMAL protocol first creates genes and gene products (mRNA and protein) if MOBILE list nodes are not already present in SPARCED.It then updates -omics level information for the new genes and adds corresponding reactions.It also assigns transcriptional activator and repressors (based on MOBILE association coefficient sign) and related rate constant parameters.The updated SPARCED input files are then processed via modified default Jupyter notebooks to execute desired simulations.The current state of the MEMMAL assumes an overlap (genes) between the mechanistic model and machine learned associations.Although this is not a hard assumption, it also makes logical sense that the effects of added interactions can be explored via crosstalk mechanisms.
Although MEMMAL makes use of recent tools from our lab, the idea is applicable to other tools available in the literature.For instance, rule-based modeling software like BioNetGen (Harris et al., 2016) and PySB (Lopez et al., 2013) can also be used for mechanistic model creation and update if machine learning predicted associations are converted into new rules.Another possible application can include INDRA (Gyori et al., 2017) if the new connections are put into suitable sentence format.Such options will be valuable to expand the MEMMAL idea and its applications.
MEMMAL is agnostic to the approach or tool used to identify connections and to the base mechanistic model for expansion.MEMMAL can generate mechanistic ODE models by integrating connections inferred using MOBILE, databases, correlation studies (Lin et al., 2013;Min et al., 2021), kernel-based methods (Mariette and Villa-Vialaneix, 2018;Yang et al., 2018), other machine learning tools (Park et al., 2015;Zhang et al., 2018;Hulot et al., 2021), or direct experiments.For the base model any mechanistic model that can be modified programmatically could be used.To facilitate the use of other models, we have provided a Jupyter notebook (enlargeSBMLmodel) to expand any SBML model with MEMMAL generated lists of new species, reactions, and parameters.
The MOBILE pipeline was used to infer ligand-specific and statistically robust association networks (Erdem et al., 2022a).Here we used a filtered list of connections for interferongamma signaling and among them some genes were already shown to be associated with immunotherapeutic signatures including BST2, CLIC2, and FAM83D (Wang et al., 2013;Walian et al., 2016;Xu et al., 2020;Zhou et al., 2020;Mei et al., 2021).In short, BST2 is part of an anti-CTLA4 response in melanoma (Mei et al., 2021) and CLIC2 is a favorable prognosis biomarker (Xu et al., 2020).FAM83D functions in cell growth regulation and is a prognostic marker for multiple cancer types (Wang et al., 2013;Walian et al., 2016).
In addition to such pieces of literature support, we can take a step further to explore their mechanistic functionalities by combining these genes and their predicted connections as new interactions in a computational model.
The investigation of the effects of new genes (via knock-out simulations) was carried out after fitting the new reaction parameter values to match experimental time course data.The simple semi-automated fitting procedure in this work resulted in a set of parameter values, reported in runModel notebook, but their identifiability is not guaranteed.Because the effects of single gene knock-outs simulations are dependent on such values, a more extensive parameter exploration would build confidence in the predictions of which genes are more important for PD-L1 regulation.Indeed, the AMICI package (Fröhlich et al., 2020) used by SPARCED enables users to do such high-level parameter estimation studies.
In conclusion, the MEMMAL pipeline provides a starting point for merging large-scale mechanistic models with big-data based association networks.We used MEMMAL to test novel candidate interactions for their effect on regulating IRF1 and PD-L1 expression and found that METAP2 is a good candidate yet to be studied experimentally.We believe combining big data, machine learning, and mechanistic models is a valuable direction to unravel novel context-specific mechanisms.

FIGURE 1 .
FIGURE 1.Different computational modeling types of biological data possess a variety of pros and cons and provide an opportunity for model merging.The mechanistic models are mostly curated, usually small-scale, causal networks of signaling pathways.Machine learning models are data-driven, large-scale, and usually correlative associations.Combining these two modes of modeling provides an opportunity for creating larger scale data-informed models to generate novel hypotheses for experimental validation.The merged model would include curated lists of pathway genes (species) as well as genes with new connections inferred via machine learning models.The final model structure could represent a collection of overlapping genes (and gene products) and interactions present in both lists.

FIGURE 2 .
FIGURE 2.MEMMAL is a pipeline to merge mechanistic modeling with machine learning (A) The MEMMAL pipeline combines mechanistic models created by SPARCED with association networks generated via MOBILE pipeline (B) The recipe for MEMMAL pipeline starts by obtaining a set of connections not presented in the candidate mechanistic model.Here, the novel gene-level connections list is inferred via the MOBILE tool and then filtered for overlap with SPARCED model genes.Next, this candidate network is imported into SPARCED environment, where the MEMMAL enlargeModel Jupyter notebook processes the network file and updates SPARCED input files The nodes (genes) of the IFNG/PD-L1 subnetwork are used to create new genes and species (mRNAs, proteins, phosphoproteins) for SPARCED.The new genes can get activated/inactivated as described in SPARCED.The expanded MEMMAL model is created and compiled by default SPARCED model notebooks.The final step in MEMMAL is to run user defined exploratory simulations to gain insights on the effects of new connections added.

FIGURE 3 .
FIGURE 3. MOBILE inferred IFNγ/PD-L1 network nodes and connections are inserted into the SPARCED-IFNG model using MEMMAL (A) The SPARCED network is enlarged to include a sub-network spanning innate immune response and PD-L1 regulation (B) The final MEMMAL model is 157 genes, 1,318 species, and 3,600 ratelaws, 60 TARs, and 3,885 parameters (C) The reactions added into SPARCED include translation (black arrows).The connections from MOBILE are modeled as transcriptional activation and repression (TAR) reactions in MEMMAL (gray arrows).The TAR reactions linking existing SPARCED species with the newly added species are represented as integrative links (red arrows) (D) The final MEMMAL model recapitulates canonical transient STAT1 and SOCS1 activation in response to IFNγ stimulation in MCF10A cells(Erdem et al., 2022b).Normalized simulation trajectories of the activated nuclear STAT1 dimer (STAT1*Dn), SOCS1 mRNA (mRNA_SOCS1), and free SOCS1 protein (SOCS1) are shown (solid gray lines).

FIGURE 4 .
FIGURE 4. MEMMAL can replicate the previous SPARCED-IFNG model and offers new insights into IFNγ regulation of IRF1 and PD-L1 dynamics (A) MEMMAL model parameters are fitted to recapitulate experimental data from LINCS RNAseq and RPPA assays.Fold-changes are shown for data (dots, crosses, and error bars, STD) and simulations (solid lines).Most mRNAs and IRF1 and PD-L1 (gene name CD274) are induced by IFNγ (B) Simulation scenarios to test the effects of newly added genes.The parameter fitted MEMMAL model is simulated with reported perturbation under IGF1 stimulation (basal growth condition) and then stimulated with additional EGF + IFNγ for 48 h (C) Comparison of complete gene knock-out perturbation scenario (dotted lines) to wild-type (no perturbation, black lines) condition shows genes with induced IRF1 and PD-L1 changes.Among the newly added genes, METAP2 induces the greatest change: a complete recession of IRF1 response and decreased PD-L1 steady-state level.The network diagram (summary of Figure 3C) shows the connections among functional genes and STAT1, with non-functional edges faded out.