<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Food. Sci. Technol.</journal-id>
<journal-title>Frontiers in Food Science and Technology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Food. Sci. Technol.</abbrev-journal-title>
<issn pub-type="epub">2674-1121</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">996399</article-id>
<article-id pub-id-type="doi">10.3389/frfst.2022.996399</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Food Science and Technology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Knowledge-informed data-driven modeling for sparse identification of governing equations for microbial inactivation processes in food</article-title>
<alt-title alt-title-type="left-running-head">Zhang et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frfst.2022.996399">10.3389/frfst.2022.996399</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Zhang</surname>
<given-names>Steve</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="fn" rid="fn1">
<sup>&#x2020;</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1990490/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ahamed</surname>
<given-names>Firnaaz</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="fn" rid="fn1">
<sup>&#x2020;</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1522166/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Song</surname>
<given-names>Hyun-Seob</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/186938/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>Department of Biological Systems Engineering</institution>, <institution>University of Nebraska&#x2013;Lincoln</institution>, <addr-line>Lincoln</addr-line>, <addr-line>NE</addr-line>, <country>United States</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>Department of Food Science and Technology</institution>, <institution>Nebraska Food for Health Center</institution>, <institution>University of Nebraska&#x2013;Lincoln</institution>, <addr-line>Lincoln</addr-line>, <addr-line>NE</addr-line>, <country>United States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1105424/overview">Muhammad Sajid Arshad</ext-link>, Government College University, Faisalabad, Pakistan</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1504978/overview">Jiajia Chen</ext-link>, The University of Tennessee, Knoxville, United States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/489202/overview">Qingli Dong</ext-link>, University of Shanghai for Science and Technology, China</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Hyun-Seob Song, <email>hsong5@unl.edu</email>
</corresp>
<fn fn-type="equal" id="fn1">
<label>
<sup>&#x2020;</sup>
</label>
<p>These authors have contributed equally to this work and share first authorship</p>
</fn>
<fn fn-type="other">
<p>This article was submitted to Food Safety and Quality Control, a section of the journal Frontiers in Food Science and Technology</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>07</day>
<month>10</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>2</volume>
<elocation-id>996399</elocation-id>
<history>
<date date-type="received">
<day>17</day>
<month>07</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>15</day>
<month>09</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Zhang, Ahamed and Song.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Zhang, Ahamed and Song</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Prevention of the growth of harmful microorganisms in food products is an important requirement for ensuring food safety and quality. Mathematical models to predict the quantitative changes in microbial populations in food to the variations of environmental conditions are useful tools in this regard. While equations for microbial inactivation have typically been formulated based on polynomial functions, empirical choice of the model order and terms not only results in over- or underfitting, but also makes it difficult to identify key factors governing the target variable. To address this issue, we present a data-driven modeling pipeline that enables 1) automatic discovery of model equations through parsimonious selection of relevant terms from a pre-built library and 2) subsequent evaluation of the impacts of individual terms on the model output. Through case studies using literature data, we evaluated the effectiveness of our pipeline in predicting the <italic>D</italic>-value (i.e., the time taken to reduce microbial population to 10% of the initial level) as a function of multiple factors including temperature, pH, water activity, NaCl content, and phosphate level. In doing this, we determined basic functional forms of input and output variables based on their pre-known relationships, e.g., by accounting for the Arrhenius dependence of <italic>D</italic>-value on temperature. Incorporation of such theoretical knowledge into the pipeline improved model accuracy. Using the Akaike information criterion, we optimally determined hyperparameters that control a trade-off between model accuracy and sparsity. We found the literature models benchmarked in this study to be over- or under-determined and consequently proposed better structured and more accurate equations. The subsequent global sensitivity analysis allowed us to evaluate the context-dependent impacts of key factors on the <italic>D</italic>-value. The pipeline presented in this work is readily applicable to many other related non-linear systems without being limited to microbial inactivation datasets.</p>
</abstract>
<kwd-group>
<kwd>food safety and security</kwd>
<kwd>data-driven modeling</kwd>
<kwd>microbial inactivation</kwd>
<kwd>global sensitivity analysis</kwd>
<kwd>information-theoretic criteria</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Food is vulnerable to contamination by pathogens and spoilers. Pathogens in contaminated food induce foodborne diseases, while spoilers deteriorate the quality of food by changing the biochemical properties of food materials (<xref ref-type="bibr" rid="B12">Lianou et al., 2016</xref>). The invasion of those harmful microorganisms can take place anytime throughout the lifecycle of food including production, processing, distribution, storing, and preservation (<xref ref-type="bibr" rid="B12">Lianou et al., 2016</xref>). Treatment of food with extreme conditions is known to render microbes inert, which is however not an ideal solution due to adverse effects on texture, taste, and flavor, denaturation of nutrients (e.g., vitamin A), as well as excessive energy demand (<xref ref-type="bibr" rid="B3">Amit et al., 2017</xref>). As complete removal of pathogens and spoilers from food is often infeasible as such, their suppression to a safe low level by refining treatment methods and conditions is essential for ensuring food safety and quality. Therefore, determination of optimal conditions to control the growth of harmful microorganisms requires meeting multiple objectives that are often contradictory (<xref ref-type="bibr" rid="B13">Madoumier et al., 2019</xref>). While many alternative microbial inactivation technologies with temperate processing conditions have emerged, such as high-pressure processing (<xref ref-type="bibr" rid="B17">Podolak et al., 2020</xref>), pulsed light inactivation (<xref ref-type="bibr" rid="B4">Art&#xed;guez et al., 2011</xref>), and various non-thermal methods (<xref ref-type="bibr" rid="B14">Ma&#xf1;as and Pag&#xe1;n, 2005</xref>), accurate evaluation of the relative influences of the associated process factors remains challenging due to the lack of a tractable and generalizable approach to analyze the process mechanics.</p>
<p>Mathematical models are indispensable tools for predicting and optimizing microbial inactivation processes in food. Accurate modeling of microbial growth or inactivation is a difficult task due its complex dependence on numerous internal (such as water activity, pH, composition, and preservatives) and external food conditions (e.g., temperature and humidity) (<xref ref-type="bibr" rid="B2">Akkermans et al., 2020</xref>). Appropriate consideration of the functional relationships between microbial populations and such intrinsic and extrinsic parameters is critical for model performance. Microbial inactivation models are often built on fitted polynomial equations, while other forms such as Arrhenius or square root relationships have also been considered (<xref ref-type="bibr" rid="B21">Whiting, 1995</xref>; <xref ref-type="bibr" rid="B18">Ross and Dalgaard, 2003</xref>). Typical modeling efforts using the polynomial equations have focused on determining optimal parameter values (i.e., coefficients of <italic>pre-chosen</italic> terms) through data fit. However, this approach cannot ensure robust development of microbial inactivation models because inadequate representation of equations can lead to poor performance in data fit and prediction due to intrinsic <italic>structural error</italic> that cannot be compensated through parameter estimation (<xref ref-type="bibr" rid="B11">Kaplan, 2002</xref>). Moreover, empirical determination of governing terms often lacks expandability with increasing number of process variables, necessitating a more systematic, rational approach.</p>
<p>Sparse Identification of Nonlinear Dynamics (SINDy) (<xref ref-type="bibr" rid="B7">Brunton et al., 2016</xref>) is a promising approach that enables automatic discovery of model equations without having to assume model structure <italic>a priori</italic>, making it distinct from typical approaches that focus on estimating optimal values of the parameters through data fit in a pre-defined function. SINDy allows the use of a library of input variables (that potentially affect the output variables of interest) to identify the model structure by linear combinations of the terms in the library. Following the Occam&#x2019;s razor principle postulating that the simplest explanation generally tends to be the correct representation (<xref ref-type="bibr" rid="B5">Blumer et al., 1987</xref>; <xref ref-type="bibr" rid="B19">Song et al., 2013</xref>), SINDy promotes parsimony in model identification based on a minimal subset of terms.</p>
<p>In this work, we present a data-driven modeling pipeline utilizing SINDy for robust development of microbial inactivation models for application in food safety and quality. While the original goal of SINDy is to identify sparse models of nonlinear dynamical systems, we apply it to non-dynamical systems through appropriate reformulation (see Methods). For demonstration, we considered case studies of modeling the change in <italic>D</italic>-values&#x2014;the time taken for a 90% reduction in microbial population&#x2014;under the variations of multiple factors including temperature, pH, water activity, NaCl content, and phosphate level. Built on SINDy, our modeling pipeline has three major additional features: 1) Incorporation of theoretical knowledge on the relationships between basic input and output variables, e.g., by accounting for the temperature dependence of <italic>D</italic>-value following the Arrhenius equation; 2) rational determination of hyperparameters (such as the polynomial order and sparsity-controlling parameter) based on information-theoretic metric for an optimal balance between model accuracy and sparsity, and 3) integration with global sensitivity analysis to evaluate the effects of key factors on model outputs. Our analysis showed that the benchmark models in the literature considered in this work are mostly over- or underfitted. Using our approach, therefore, we were able to propose better structured models with improved accuracy and less complexity.</p>
</sec>
<sec id="s2">
<title>2 Materials and methods</title>
<p>Identification of model structure (i.e., functional forms of the relationship between input and output variables) is challenging as there are many possible solutions to formulate a specific model from a given dataset. In this section, we describe how systematic identification of model equations and key variables/terms governing microbial inactivation can be enabled by an advanced data-driven approach called SINDy (<xref ref-type="bibr" rid="B7">Brunton et al., 2016</xref>) in conjunction with global sensitivity analysis, respectively.</p>
<sec id="s2-1">
<title>2.1 Essence of sparse identification of nonlinear dynamics</title>
<p>The original motivation of SINDy is to discover governing equations for nonlinear dynamical systems, which is reconfigured here to apply to non-dynamical systems as follows:<disp-formula id="e1">
<mml:math id="m1">
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="bold">f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>where <inline-formula id="inf1">
<mml:math id="m2">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the vector of state variables, and <inline-formula id="inf2">
<mml:math id="m3">
<mml:mrow>
<mml:mi mathvariant="bold">f</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> denotes the nonlinear relationship between the input (<inline-formula id="inf3">
<mml:math id="m4">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>) and output variables (<inline-formula id="inf4">
<mml:math id="m5">
<mml:mrow>
<mml:mi mathvariant="bold-italic">y</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>). SINDy approximates <inline-formula id="inf5">
<mml:math id="m6">
<mml:mrow>
<mml:mi mathvariant="bold">f</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> by a weighted linear combination of nonlinear terms, e.g., for the <inline-formula id="inf6">
<mml:math id="m7">
<mml:mrow>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> output variable:<disp-formula id="e2">
<mml:math id="m8">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2248;</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mi>k</mml:mi>
</mml:munder>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b8;</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mtext>&#x200a;</mml:mtext>
<mml:msub>
<mml:mi>&#x3be;</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>where <inline-formula id="inf7">
<mml:math id="m9">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b8;</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf8">
<mml:math id="m10">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3be;</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> denote the <inline-formula id="inf9">
<mml:math id="m11">
<mml:mrow>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> term and its weight, respectively. The above equations can be represented in a more succinct form as matrices, i.e.,<disp-formula id="e3">
<mml:math id="m12">
<mml:mrow>
<mml:mi mathvariant="bold">Y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi mathvariant="bold">&#x39e;</mml:mi>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>where <inline-formula id="inf10">
<mml:math id="m13">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mn mathvariant="bold">1</mml:mn>
</mml:msub>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mn mathvariant="bold">2</mml:mn>
</mml:msub>
<mml:mo>&#x22ef;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi mathvariant="bold">m</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf11">
<mml:math id="m14">
<mml:mrow>
<mml:mi mathvariant="bold">Y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mn mathvariant="bold">1</mml:mn>
</mml:msub>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mn mathvariant="bold">2</mml:mn>
</mml:msub>
<mml:mo>&#x22ef;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf12">
<mml:math id="m15">
<mml:mrow>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is a library of candidate functions of <inline-formula id="inf13">
<mml:math id="m16">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and the matrix of weights <inline-formula id="inf14">
<mml:math id="m17">
<mml:mrow>
<mml:mi mathvariant="bold">&#x39e;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3be;</mml:mi>
<mml:mn mathvariant="italic">1</mml:mn>
</mml:msub>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3be;</mml:mi>
<mml:mn mathvariant="italic">2</mml:mn>
</mml:msub>
<mml:mo>&#x22ef;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3be;</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. In SINDy, the library <inline-formula id="inf15">
<mml:math id="m18">
<mml:mrow>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is built by polynomial expansion of input variables <inline-formula id="inf16">
<mml:math id="m19">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, i.e., <inline-formula id="inf17">
<mml:math id="m20">
<mml:mrow>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mn mathvariant="bold">1</mml:mn>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msup>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mn mathvariant="italic">2</mml:mn>
</mml:msup>
<mml:mo>&#x22ef;</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mi>d</mml:mi>
</mml:msup>
<mml:mo>&#x22ef;</mml:mo>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> where <inline-formula id="inf18">
<mml:math id="m21">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mi>d</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> denotes a matrix with column vectors of all possible <inline-formula id="inf19">
<mml:math id="m22">
<mml:mrow>
<mml:msup>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>degree monomials in the state variable <inline-formula id="inf20">
<mml:math id="m23">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>SINDy seeks a parsimonious model composed with minimal number of terms as possible without compromising model accuracy. Sparse regression methods such as Sequentially Thresholded Least Squares (STLS) and Least Absolute Shrinkage and Selection Operator (LASSO) are useful algorithms that can be used in SINDy for this purpose (<xref ref-type="bibr" rid="B7">Brunton et al., 2016</xref>). In this work, we employ STLS where <inline-formula id="inf21">
<mml:math id="m24">
<mml:mrow>
<mml:mi mathvariant="bold">&#x39e;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> in <xref ref-type="disp-formula" rid="e3">Eq. 3</xref> retains the coefficients (weights) greater than the prescribed parameter <inline-formula id="inf22">
<mml:math id="m25">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (otherwise, zero weights are assigned), such that only the terms in the library with significant influence on the outputs are included in the final model structure. Here, <inline-formula id="inf23">
<mml:math id="m26">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is known as sparsity-promoting knob because the model sparsity increases with higher values of <inline-formula id="inf24">
<mml:math id="m27">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, while model accuracy may decrease.</p>
</sec>
<sec id="s2-2">
<title>2.2 Application of SINDy to microbial inactivation modeling</title>
<p>We use SINDy to formulate microbial inactivation as functions of various process variables including temperature (<inline-formula id="inf25">
<mml:math id="m28">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>), pH, water activity (<inline-formula id="inf26">
<mml:math id="m29">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>), NaCl content (<inline-formula id="inf27">
<mml:math id="m30">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>), and phosphate level (<inline-formula id="inf28">
<mml:math id="m31">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>), which are all known to significantly influence microbial growth rate (<xref ref-type="bibr" rid="B10">Juneja et al., 1995</xref>; <xref ref-type="bibr" rid="B9">Cerf et al., 1996</xref>). We employ <italic>D</italic>-value (i.e., the time for microbial population to shrink to 10% of initial level) as a standard measure for microbial inactivation, which is taken as our target variable to predict in applying SINDy. With a single target variable chosen, <xref ref-type="disp-formula" rid="e3">Eq. 3</xref> is reduced to the following equation, i.e.,<disp-formula id="e4">
<mml:math id="m32">
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi mathvariant="bold">&#x3be;</mml:mi>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(4)</label>
</disp-formula>
</p>
<p>While SINDy offers flexibility to pick any nonlinear terms for input and output variables, we determine the inclusion of their specific functional forms following the known mechanistic knowledge and characteristics of the system. Therefore, we used a vector of <inline-formula id="inf29">
<mml:math id="m33">
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi mathvariant="bold">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> as <inline-formula id="inf30">
<mml:math id="m34">
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (instead of a vector of <inline-formula id="inf31">
<mml:math id="m35">
<mml:mrow>
<mml:mi mathvariant="bold">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>) and determined <inline-formula id="inf32">
<mml:math id="m36">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> to be [<inline-formula id="inf33">
<mml:math id="m37">
<mml:mrow>
<mml:mn mathvariant="bold-italic">1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, <bold>pH</bold>, <inline-formula id="inf34">
<mml:math id="m38">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">a</mml:mi>
<mml:mi mathvariant="bold">w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf35">
<mml:math id="m39">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">C</mml:mi>
<mml:mi mathvariant="bold">N</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf36">
<mml:math id="m40">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">C</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>] (i.e., the use of <bold>1/T</bold>, instead of <inline-formula id="inf37">
<mml:math id="m41">
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>). The rationale for our choice of functional forms of output and input variables are detailed in <xref ref-type="sec" rid="s3-1">Section 3.1</xref>.</p>
</sec>
<sec id="s2-3">
<title>2.3 Tuning model sparsity and accuracy based on an information-theoretic criterion</title>
<p>We tune the order of combination of primitive process variables and the sparsity index, <inline-formula id="inf38">
<mml:math id="m42">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, in stages. We first determine the maximum order of combination with <inline-formula id="inf39">
<mml:math id="m43">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn mathvariant="italic">0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> (which will result in a non-parsimonious model), beyond which there are no significant improvements to model accuracy. Subsequently, by retaining the maximum polynomial order, we employ the maximum <inline-formula id="inf40">
<mml:math id="m44">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> that does not significantly compromise the model accuracy. To facilitate determining optimal polynomial order and <inline-formula id="inf41">
<mml:math id="m45">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> values for a balanced compromise between model accuracy and sparsity, we use an information-theoretic metric, Akaike Information Criterion (AIC) (<xref ref-type="bibr" rid="B1">Akaike, 1998</xref>). Specifically, we use the second-order information criterion that includes a correction term to alleviate the bias that may arise if the number of model parameters is large relative to the sample datapoints (<xref ref-type="bibr" rid="B8">Burnham and Anderson, 2002</xref>):<disp-formula id="e5">
<mml:math id="m46">
<mml:mrow>
<mml:mi mathvariant="normal">A</mml:mi>
<mml:mi mathvariant="normal">I</mml:mi>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>ln</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">M</mml:mi>
<mml:mi mathvariant="normal">S</mml:mi>
<mml:mi mathvariant="normal">E</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>K</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>K</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>K</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(5)</label>
</disp-formula>where MSE denotes mean squared error, <inline-formula id="inf42">
<mml:math id="m47">
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the number of sample datapoints, <inline-formula id="inf43">
<mml:math id="m48">
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the number of model parameters and the third term on the RHS corrects the bias where it tends to zero when <inline-formula id="inf44">
<mml:math id="m49">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>&#x226b;</mml:mo>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. Generally, a model with the least AIC score is ideal as AIC penalizes the model based on the relative balance between error and complexity (<inline-formula id="inf45">
<mml:math id="m50">
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>). The formulation above is often denoted as AICc in the literature. The methodical implementation of the general guidelines to develop microbial inactivation models is demonstrated in <xref ref-type="sec" rid="s3-2">Section 3.2</xref>.</p>
</sec>
<sec id="s2-4">
<title>2.4 Density-based global sensitivity analysis</title>
<p>We perform sensitivity analyses on our models as an alternative to arduous assessment of the relative effects of the process variables on microbial inactivation directly from highly distributed experimental data. As the models are linear combinations of nonlinear terms and the datasets used in this work span over a wide parameter space, the possibility of model forming stiff parameter dependency is high. Therefore, we employ a density-based global sensitivity analysis approach called PAWN (<xref ref-type="bibr" rid="B16">Pianosi and Wagener, 2015</xref>), instead of local sensitivity approach. Based on this approach, absolute deviation is calculated between an unconditional cumulative density function of model output, <inline-formula id="inf46">
<mml:math id="m51">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, where all input variables in the model are randomly sampled simultaneously over the whole parameter space, and conditional cumulative density functions, <inline-formula id="inf47">
<mml:math id="m52">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>X</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, which are constructed by randomly sampling all but a single model variable of interest fixed at the <inline-formula id="inf48">
<mml:math id="m53">
<mml:mrow>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> nominal value, <inline-formula id="inf49">
<mml:math id="m54">
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>X</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. The sensitivity index for <inline-formula id="inf50">
<mml:math id="m55">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th model variable, <inline-formula id="inf51">
<mml:math id="m56">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, is characterized as the maximum value across the distribution of absolute deviations collected for a range of <inline-formula id="inf52">
<mml:math id="m57">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> nominal values:<disp-formula id="e6">
<mml:math id="m58">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mi>max</mml:mi>
<mml:msub>
<mml:mover accent="true">
<mml:mi>X</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:munder>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>X</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>;</mml:mo>
<mml:mtext>&#x2003;</mml:mtext>
<mml:mi>K</mml:mi>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>X</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mi>max</mml:mi>
<mml:mi>y</mml:mi>
</mml:munder>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>X</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x7c;</mml:mo>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(6)</label>
</disp-formula>
</p>
<p>Here, <inline-formula id="inf53">
<mml:math id="m59">
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the Kolmogorov&#x2014;Smirnov statistic, <inline-formula id="inf54">
<mml:math id="m60">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the model-estimated <italic>D</italic>-values and the variable <inline-formula id="inf55">
<mml:math id="m61">
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the <inline-formula id="inf56">
<mml:math id="m62">
<mml:mrow>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> element of <inline-formula id="inf57">
<mml:math id="m63">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</sec>
<sec id="s2-5">
<title>2.5 Experimental datasets</title>
<p>The experimental datasets used for microbial inactivation modeling in this work are collated in <xref ref-type="table" rid="T1">Table 1</xref>. We chose datasets that are predominantly distinct in terms of microorganisms, media, process variables, and parameter space to demonstrate the tractability of our knowledge-informed data-driven pipeline for model development. We also found that the structure of the literature models developed from these datasets was under- or over-determined, rather than optimally determined. Consequently, the datasets and benchmark models we chose serve as an ideal testbed for evaluating the robustness of our approach.</p>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>Experimental datasets used in this study.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Data source</th>
<th rowspan="2" align="left">Microorganism</th>
<th rowspan="2" align="left">Media</th>
<th colspan="5" align="left">Process variables</th>
</tr>
<tr>
<th align="left">Temperature, <inline-formula id="inf58">
<mml:math id="m64">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (&#xb0;C)</th>
<th align="left">pH</th>
<th align="left">Water activity, <inline-formula id="inf59">
<mml:math id="m65">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</th>
<th align="left">NaCl content, <inline-formula id="inf60">
<mml:math id="m66">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (%)</th>
<th align="left">Phosphate level, <inline-formula id="inf61">
<mml:math id="m67">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (%)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref>
</td>
<td align="left">
<italic>Escherichia coli</italic>
</td>
<td align="left">n.a<xref ref-type="table-fn" rid="Tfn1">
<sup>a</sup>
</xref>
</td>
<td align="left">52.05&#x2013;63.10</td>
<td align="left">3.0&#x2013;9.0</td>
<td align="left">0.928&#x2013;0.995</td>
<td align="left">n.a</td>
<td align="left">n.a</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref>
</td>
<td align="left">
<italic>Clostridium botulinum</italic>
</td>
<td align="left">Turkey</td>
<td align="left">70.00&#x2013;90.00</td>
<td align="left">5.0&#x2013;7.0</td>
<td align="left">n.a</td>
<td align="left">0&#x2013;3</td>
<td align="left">0&#x2013;2</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref>
</td>
<td align="left">
<italic>Salmonella enteritidis</italic>
</td>
<td align="left">Almond kernels</td>
<td align="left">56.00&#x2013;80.00</td>
<td align="left">n.a</td>
<td align="left">0.601&#x2013;0.946</td>
<td align="left">n.a</td>
<td align="left">n.a</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="Tfn1">
<label>a</label>
<p>n.a.&#x2014;not available.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s2-6">
<title>2.6 Computational implementation</title>
<p>Numerical codes were developed using MATLAB<sup>&#xae;</sup> R2021a by adapting the prototype codes of SINDy provided in <xref ref-type="bibr" rid="B6">Brunton and Kutz (2019)</xref> and PAWN global sensitivity analysis given in <xref ref-type="bibr" rid="B16">Pianosi and Wagener (2015)</xref> and <xref ref-type="bibr" rid="B15">Pianosi et al. (2015)</xref>.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Results</title>
<sec id="s3-1">
<title>3.1 Development of a knowledge-informed data-driven modeling pipeline</title>
<p>Our data-driven modeling approach combines SINDy and global sensitivity analysis to identify model equations and key factors that govern microbial inactivation in food. As a main feature, users can define any functional forms of input variables (<inline-formula id="inf62">
<mml:math id="m68">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>H</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>, i.e., <inline-formula id="inf63">
<mml:math id="m69">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> <inline-formula id="inf64">
<mml:math id="m70">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf65">
<mml:math id="m71">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf66">
<mml:math id="m72">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>), and output variables (i.e., <inline-formula id="inf67">
<mml:math id="m73">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>D</mml:mi>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>)). As explained below, we set <inline-formula id="inf68">
<mml:math id="m74">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>D</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (instead of <inline-formula id="inf69">
<mml:math id="m75">
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>) and determined <inline-formula id="inf70">
<mml:math id="m76">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn mathvariant="italic">1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> based on the Arrhenius equation, while using first-order terms for the other input variables, i.e., <inline-formula id="inf71">
<mml:math id="m77">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>H</mml:mi>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> <inline-formula id="inf72">
<mml:math id="m78">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf73">
<mml:math id="m79">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf74">
<mml:math id="m80">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. Subsequently, a library of input terms is generated through polynomial combinations of those basic input variables provided from the user. SINDy, then, identifies a sparse model by choosing a minimum number of input terms (included in the library) that is required to represent the output variable with an acceptable accuracy (<italic>cf.</italic> <xref ref-type="sec" rid="s2-1">Section 2.1</xref>). The resulting equations derived by SINDy takes the form of a linear combination of nonlinear terms, and therefore, explicitly show the impacts of environmental variables on <italic>D</italic>-values. The impact of individual primitive input variables (not the combined terms) can be identified through PAWN global sensitivity analysis. The two complementary tools together identify key model equations and factors that govern <italic>D</italic>-value for a given pathogen or spoiler. The modeling workflow is illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>. We term our approach knowledge-based data-driven modeling as we incorporate known insights of system characteristics (such as Arrhenius equation) as a key component to determine basic form of input and output variables as described in detail below.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Flowchart depicting our knowledge-informed data-driven model development pipeline. The weights assigned to the columns of experimental input variables curated to the functional forms of candidate terms in the library instruct the choice of terms to be included in the final model structure through Sequentially Thresholded Least Squares (STLS) regression (more details in main texts).</p>
</caption>
<graphic xlink:href="frfst-02-996399-g001.tif"/>
</fig>
<p>The development of data-driven microbial inactivation model can be facilitated by known characteristics of the system. While first-order representation is a typical choice for input and output variables, it is possible to improve model performance by a more appropriate choice of their functional forms. For this purpose, we leverage mechanistic microbial growth models (<xref ref-type="bibr" rid="B21">Whiting, 1995</xref>) to inform our choice of functional forms for microbial inactivation dynamics as follows:<disp-formula id="e7">
<mml:math id="m81">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">p</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(7)</label>
</disp-formula>where <inline-formula id="inf75">
<mml:math id="m82">
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is population density and <inline-formula id="inf76">
<mml:math id="m83">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x3e;</mml:mo>
<mml:mn mathvariant="italic">0</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the deactivation rate constant, which is given as a function of a vector of environmental variables (<inline-formula id="inf77">
<mml:math id="m84">
<mml:mrow>
<mml:mi mathvariant="bold">p</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>). If we maintain environmental variables constant over time, we can get the solution in an analytical form, i.e.,<disp-formula id="e8">
<mml:math id="m85">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>exp</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">p</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(8)</label>
</disp-formula>
</p>
<p>By definition, the population density is <inline-formula id="inf78">
<mml:math id="m86">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn mathvariant="italic">0.1</mml:mn>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mn mathvariant="italic">0</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> when <inline-formula id="inf79">
<mml:math id="m87">
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, i.e.,<disp-formula id="e9">
<mml:math id="m88">
<mml:mrow>
<mml:mn>0.1</mml:mn>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>exp</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">p</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(9)</label>
</disp-formula>
</p>
<p>Therefore, <italic>D</italic>-value is simply:<disp-formula id="e10">
<mml:math id="m89">
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>ln</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>0.1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">p</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(10)</label>
</disp-formula>
</p>
<p>Subsequently, applying logarithm to the equation above yields:<disp-formula id="e11">
<mml:math id="m90">
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>log</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">p</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(11)</label>
</disp-formula>where <inline-formula id="inf80">
<mml:math id="m91">
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is a constant. Given that many prior Arrhenius-based models produce reasonable fit to growth data by relating the growth rate to various environmental variables as <inline-formula id="inf81">
<mml:math id="m92">
<mml:mrow>
<mml:mi>ln</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mn mathvariant="italic">1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>H</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> (<xref ref-type="bibr" rid="B21">Whiting, 1995</xref>; <xref ref-type="bibr" rid="B18">Ross and Dalgaard, 2003</xref>), we similarly re-write <xref ref-type="disp-formula" rid="e11">Eq. 11</xref> as:<disp-formula id="e12">
<mml:math id="m93">
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(12)</label>
</disp-formula>
</p>
<p>Consequently, the functional forms of output variable and input variables provided for SINDy implementations are <inline-formula id="inf82">
<mml:math id="m94">
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi mathvariant="bold">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf83">
<mml:math id="m95">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mn mathvariant="italic">1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mi mathvariant="bold">H</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">a</mml:mi>
<mml:mi mathvariant="bold">w</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">C</mml:mi>
<mml:mi mathvariant="bold">N</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">C</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> in <inline-formula id="inf84">
<mml:math id="m96">
<mml:mrow>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, respectively, in reference to the generalized form in <xref ref-type="disp-formula" rid="e4">Eq. 4</xref>.</p>
</sec>
<sec id="s3-2">
<title>3.2 Optimization of model complexity: Setting polynomial order and model sparsity</title>
<p>To substantiate our choice of functional forms for input and output variables in the preceding section, we compare the model performance per our approach (orange lines in <xref ref-type="fig" rid="F2">Figure 2</xref>) against another base case with non-logarithmic <italic>D</italic>-values and non-reciprocal temperature and other process variables (blue lines in <xref ref-type="fig" rid="F2">Figure 2</xref>). Here, complete non-parsimonious models are used to ensure fair comparison of the models without the influence of sparse regression. Our approach consistently performed better in terms of MSE calculated based on logarithmic <italic>D</italic>-values for both cases across different datasets and orders of polynomial combinations of input variables. For all the results that follow henceforth, our choice of the functional forms (i.e., orange lines in <xref ref-type="fig" rid="F2">Figure 2</xref>) are adopted.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Comparison of model performance between different choices of functional forms for input and output variables using datasets from: <bold>(A)</bold> <xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref>, <bold>(B)</bold> <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref>, and <bold>(C)</bold> <xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref>. The blue line represents a model with D (chosen as the target variable) and <inline-formula id="inf86">
<mml:math id="m98">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (chosen as input variables), whereas the orange line is our choice of functional forms by taking log D as the target variable and <inline-formula id="inf88">
<mml:math id="m100">
<mml:mrow>
<mml:mrow>
<mml:mn mathvariant="italic">1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> as input variables. The non-parsimonious models (<inline-formula id="inf89">
<mml:math id="m101">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn mathvariant="italic">0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) were developed for increasing orders of polynomial combinations of input variables producing monomial terms with various degrees.</p>
</caption>
<graphic xlink:href="frfst-02-996399-g002.tif"/>
</fig>
<p>With <inline-formula id="inf90">
<mml:math id="m102">
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> chosen as the target variable, we identify the optimal model structure that balances both accuracy and sparsity. We first determine the order of polynomial combination without accounting for model sparsity (i.e., with <inline-formula id="inf91">
<mml:math id="m103">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn mathvariant="italic">0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) and subsequently choose the appropriate value of <inline-formula id="inf92">
<mml:math id="m104">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (now to promote sparsity). This two-step process is demonstrated through the case study of the dataset from <xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref> (<xref ref-type="fig" rid="F3">Figures 3</xref>, <xref ref-type="fig" rid="F4">4</xref>). In doing this, we used three major criteria including AIC values, MSE, and the number of terms. The analysis based on the first criterion suggested us to choose the third order polynomial combination (<xref ref-type="fig" rid="F3">Figure 3A</xref>) and <inline-formula id="inf93">
<mml:math id="m105">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn mathvariant="italic">0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> (<xref ref-type="fig" rid="F3">Figure 3B</xref>) where the AIC scores are minimal. In contrast, determination of the order of polynomial combination is not clear based on MSE because it keeps decreasing as the order increases (<xref ref-type="fig" rid="F4">Figure 4A</xref>), highlighting the utility of information-theoretic criterion. While the third-order model with <inline-formula id="inf94">
<mml:math id="m106">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn mathvariant="italic">0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> may be a desirable choice from a rigorous statistical point of view, we found that the increase of MSE is not significant up to <inline-formula id="inf95">
<mml:math id="m107">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn mathvariant="italic">1.36</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> (<xref ref-type="fig" rid="F4">Figure 4B</xref>) where the number of terms can be further reduced from 20 to 17 (<xref ref-type="fig" rid="F4">Figure 4C</xref>). MSE was significantly increased when <inline-formula id="inf96">
<mml:math id="m108">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
<mml:mo>&#x3e;</mml:mo>
<mml:mn mathvariant="italic">1.36</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> without significantly reducing the number of terms, leading us to choose the third-order model with <inline-formula id="inf97">
<mml:math id="m109">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn mathvariant="italic">1.36</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Stepwise tuning of model accuracy and sparsity through the comparison of information-theoretic metric (AIC) by first <bold>(A)</bold> fixing the order of polynomial combinations of input variables for non-parsimonious model (<inline-formula id="inf98">
<mml:math id="m110">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn mathvariant="italic">0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>), followed by <bold>(B)</bold> setting the sparsity index, <inline-formula id="inf99">
<mml:math id="m111">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, that balances the model accuracy and desired sparsity. Vertical dashed lines indicate the chosen model settings. Here, the model tuning for dataset from <xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref> is shown as an example.</p>
</caption>
<graphic xlink:href="frfst-02-996399-g003.tif"/>
</fig>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Depiction of our stepwise model tuning approach that minimizes model overfitting through the optimization of <bold>(A)</bold> the order of polynomial combinations of input variables for non-parsimonious model (<inline-formula id="inf100">
<mml:math id="m112">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn mathvariant="italic">0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>), and <bold>(B/C)</bold> the sparsity index, <inline-formula id="inf101">
<mml:math id="m113">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. Vertical dashed lines indicate the chosen model settings. Here, the model tuning for dataset from <xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref> is shown as an example.</p>
</caption>
<graphic xlink:href="frfst-02-996399-g004.tif"/>
</fig>
<p>We also applied this stepwise model construction approach to the datasets from <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref> (<xref ref-type="sec" rid="s10">Supplementary Figures S1, S2</xref>) and <xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref> (<xref ref-type="sec" rid="s10">Supplementary Figures S3, S4</xref>). The analysis for the dataset from <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref> showed that AIC values have two local minima at the polynomial orders 2 and 4 (<xref ref-type="sec" rid="s10">Supplementary Figure S1A</xref>). Through further checking with MSE and the number of terms, we chose the second-order model for better interpretability. After determining the polynomial order, we subsequently determined the optimal value of <inline-formula id="inf102">
<mml:math id="m114">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> to be 0.24 (<xref ref-type="sec" rid="s10">Supplementary Figure S1B</xref>). The changes of MSE and the number of terms as polynomial orders and <inline-formula id="inf103">
<mml:math id="m115">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> values support our choice (<xref ref-type="sec" rid="s10">Supplementary Figure S2</xref>). Lastly, the analysis of the dataset from <xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref> suggested the first-order model (<xref ref-type="sec" rid="s10">Supplementary Figures S3, S4</xref>), which is because any further increase of model complexity would result in severe overfitting due to limited data points (<inline-formula id="inf104">
<mml:math id="m116">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>16</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>). In this case, we have not further reduced model complexity by fine tuning <inline-formula id="inf105">
<mml:math id="m117">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. The final model was therefore a simple equation with two input variables (<inline-formula id="inf106">
<mml:math id="m118">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf107">
<mml:math id="m119">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>).</p>
</sec>
<sec id="s3-3">
<title>3.3 Data-driven identification of governing equations for enhanced accuracy and expandability</title>
<p>Following the guidelines outlined in the preceding and Methods sections, we developed models for all experimental datasets considered in this work. The resulting model equations are summarized in <xref ref-type="table" rid="T2">Table 2</xref>. The individual models consist of varying number and degree of monomial terms as identified through our data-driven model development pipeline which are optimal to represent the output variable within the parameter space of the respective datasets. The identification of the optimal terms (especially higher-order terms) would not be possible with the previous approaches that rely on empirical choices of equation terms. The issues of model overfitting and uncertainties are also minimized with our stepwise approach in model design.</p>
<table-wrap id="T2" position="float">
<label>Table 2</label>
<caption>
<p>Governing model equations identified by leveraging our knowledge-informed data-driven modeling pipeline.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Data source</th>
<th align="left">Model equations identified from our pipeline</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref>
</td>
<td align="left">
<inline-formula id="inf108">
<mml:math id="m120">
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>305.14</mml:mn>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>442.49</mml:mn>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>50.61</mml:mn>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>7.16</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mn>10</mml:mn>
<mml:mn>3</mml:mn>
</mml:msup>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>334.52</mml:mn>
<mml:msubsup>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>108.52</mml:mn>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4.4985</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mn>10</mml:mn>
<mml:mn>4</mml:mn>
</mml:msup>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>155.43</mml:mn>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>8.49</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mn>10</mml:mn>
<mml:mn>5</mml:mn>
</mml:msup>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>528.54</mml:mn>
<mml:msubsup>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
<mml:mn>3</mml:mn>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>52.47</mml:mn>
<mml:msubsup>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>4.85</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mn>10</mml:mn>
<mml:mn>4</mml:mn>
</mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>472.09</mml:mn>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1.54</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mn>10</mml:mn>
<mml:mn>6</mml:mn>
</mml:msup>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1.42</mml:mn>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1.70</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mn>10</mml:mn>
<mml:mn>4</mml:mn>
</mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1.50</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mn>10</mml:mn>
<mml:mn>7</mml:mn>
</mml:msup>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mn>3</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</inline-formula>(13)</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref>
</td>
<td align="left">
<inline-formula id="inf109">
<mml:math id="m121">
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>24.94</mml:mn>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>3.43</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mn>10</mml:mn>
<mml:mn>3</mml:mn>
</mml:msup>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>0.93</mml:mn>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>30.06</mml:mn>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>350.45</mml:mn>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1.10</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mn>10</mml:mn>
<mml:mn>5</mml:mn>
</mml:msup>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>64.03</mml:mn>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>585.46</mml:mn>
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1.09</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mn>10</mml:mn>
<mml:mn>4</mml:mn>
</mml:msup>
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>2.12</mml:mn>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>24.55</mml:mn>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>144.32</mml:mn>
<mml:msubsup>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>186.41</mml:mn>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>2.79</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mn>10</mml:mn>
<mml:mn>3</mml:mn>
</mml:msup>
<mml:msubsup>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</inline-formula>(14)</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref>
</td>
<td align="left">
<inline-formula id="inf110">
<mml:math id="m122">
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6.70</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4.57</mml:mn>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>672.69</mml:mn>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>(15)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We compare the performance of our models with existing models from the literature in <xref ref-type="fig" rid="F5">Figure 5</xref>. Our models consistently perform better than the literature models across all datasets, particularly for the dataset from <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref> that has considered additional process variables, i.e., <inline-formula id="inf111">
<mml:math id="m123">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf112">
<mml:math id="m124">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. It is certainly possible that <inline-formula id="inf113">
<mml:math id="m125">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf114">
<mml:math id="m126">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> have significant interaction effects with other process variables and are critical to characterize microbial inactivation dynamics, which explains the enhanced accuracy that accompanies their inclusion in the model. Quantitative information of the models is tabulated in <xref ref-type="table" rid="T3">Table 3</xref>. In all cases, our models perform adequately with reasonable error measures, i.e., MSE <inline-formula id="inf115">
<mml:math id="m127">
<mml:mrow>
<mml:mo>&#x2264;</mml:mo>
<mml:mi mathvariant="normal">O</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mn mathvariant="italic">10</mml:mn>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn mathvariant="italic">2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and low AIC scores. In two of the cases (models for datasets from <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref> and <xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref>), our models gained more than ten-fold increase in accuracy with fewer number of functional terms in the model structures as compared to the literature models. Moreover, we demonstrate the opportunity to further enhance the model accuracy over two-folds for dataset from <xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref> by considering a more complex model equation (greater number of functional terms) without the risk of overfitting as shown by the lower AIC score as compared to the literature model. By carefully adopting the stepwise model tuning scheme as described in <xref ref-type="sec" rid="s3-2">Section 3.2</xref>, we were able to optimally tune the models to achieve better accuracy while minimizing the chances of overfitting as compared to existing literature models.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Comparison of the performance of our knowledge-informed data-driven models (left panels) against literature models (right panels) across different datasets: <bold>(A)</bold> <xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref>, <bold>(B)</bold> <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref>, and <bold>(C)</bold> <xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref>.</p>
</caption>
<graphic xlink:href="frfst-02-996399-g005.tif"/>
</fig>
<table-wrap id="T3" position="float">
<label>Table 3</label>
<caption>
<p>Quantitative performance measures of models.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Data source</th>
<th rowspan="2" align="left">Maximum order of terms</th>
<th rowspan="2" align="left">Sparsity index, <italic>&#x3bb;</italic>
</th>
<th colspan="2" align="left">Number of terms</th>
<th colspan="2" align="left">Mean squared error</th>
<th colspan="2" align="left">AIC</th>
</tr>
<tr>
<th align="left">This work</th>
<th align="left">Literature model</th>
<th align="left">This work</th>
<th align="left">Literature model</th>
<th align="left">This work</th>
<th align="left">Literature model</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref>
</td>
<td align="left">3</td>
<td align="left">1.36</td>
<td align="left">17</td>
<td align="left">5</td>
<td align="char" char=".">0.006</td>
<td align="char" char=".">0.014</td>
<td align="char" char=".">&#x2212;448.85</td>
<td align="char" char=".">&#x2212;392.24</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref>
</td>
<td align="left">2</td>
<td align="left">0.24</td>
<td align="left">14</td>
<td align="left">15</td>
<td align="char" char=".">0.004</td>
<td align="char" char=".">0.071</td>
<td align="char" char=".">&#x2212;193.46</td>
<td align="char" char=".">&#x2212;65.96</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref>
</td>
<td align="left">1</td>
<td align="left">0</td>
<td align="left">3</td>
<td align="left">8</td>
<td align="char" char=".">0.06</td>
<td align="char" char=".">0.559</td>
<td align="char" char=".">&#x2212;36.96</td>
<td align="char" char=".">27.26</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s3-4">
<title>3.4 Integration of data-driven approach and sensitivity analysis for determining key governing process variables and model terms</title>
<p>While our model governing equations offer good representations of experimental data, the individual effects of process variables remain elusive. In addition to data-driven modeling, we leverage global sensitivity analysis (<xref ref-type="bibr" rid="B16">Pianosi and Wagener, 2015</xref>) using the model-derived governing equations to identify key governing process variables and possibly divulge the interactions between them. Here, the sensitivities are evaluated for the entire parameter space encompassed by the respective datasets (<xref ref-type="table" rid="T1">Table 1</xref>). Our results demonstrate highly disparate sensitivity measures for the process variables across datasets as shown in <xref ref-type="fig" rid="F6">Figure 6</xref>, positing that the relative effects of the process variables may be highly environment dependent. For example, <inline-formula id="inf116">
<mml:math id="m128">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> exhibited high sensitivity in the model for data from <xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref> but registered lower sensitivity than other process variables in the model for data from <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref>, although <inline-formula id="inf117">
<mml:math id="m129">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is often viewed as the primary process variable in most microbial inactivation experiments. Conversely, <inline-formula id="inf118">
<mml:math id="m130">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> consistently displayed relatively lower sensitivity measures than other process variables across the models. While condition-specificity is one of the plausible explanations for contrasting sensitivity measures, it should also be noted that the global sensitivity analysis strictly represents the individual influence of process variables on output and are indifferent to interactions effects between process variables. Therefore, there may be instances where the sensitivity measures of the individual process variables are insignificant, whilst considerable interactions effects with other process variables exist. For instance, <inline-formula id="inf119">
<mml:math id="m131">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and pH registered low sensitivity in the model for data from <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref> with statistically insignificant influence on the output variable (<xref ref-type="fig" rid="F6">Figure 6</xref>), but the variables were repeatedly included in numerous terms in the governing equation identified through our knowledge-informed data driven approach (<xref ref-type="table" rid="T2">Table 2</xref>). This is expected given that SINDy retains the most influential terms irrespective of individual or interaction effects necessary for representation of output data.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Distribution of global sensitivities of process variables on output variable, i.e., <inline-formula id="inf120">
<mml:math id="m132">
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, examined using models derived from different datasets: <bold>(A)</bold> <xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref>, <bold>(B)</bold> <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref>, and <bold>(C)</bold> <xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref>. The symbol &#x2a; marks the model variables with significant influence on the output based on 95% confidence interval (<inline-formula id="inf121">
<mml:math id="m133">
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>0.05</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>).</p>
</caption>
<graphic xlink:href="frfst-02-996399-g006.tif"/>
</fig>
<p>To resolve this issue, we implement the combined use of data-driven approach and global sensitivity analysis that enables elucidation of the factors influencing the process dynamics from the contexts of governing equations, key process variables and critical model monomial terms. To this effect, we iteratively removed each term in the governing equations, and subsequently refitted the models with new MSEs as shown in <xref ref-type="fig" rid="F7">Figure 7</xref>, whereby considerable increase in MSE (as compared to the original complete model, represented by horizontal dashed lines) indicates that the respective terms are essential to represent the output variable. In the model for data from <xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref>, removal of terms containing <inline-formula id="inf122">
<mml:math id="m134">
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf123">
<mml:math id="m135">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> results in considerable increase in MSEs though the variables had relatively insignificant influences on the output in global sensitivity analysis. Clearly, <inline-formula id="inf124">
<mml:math id="m136">
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf125">
<mml:math id="m137">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> significantly influence the effects of <inline-formula id="inf126">
<mml:math id="m138">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> on microbial inactivation through strong interaction effects, but the reverse is not necessarily true. The finding here is reinforced by the outcome from SINDy, where the terms with <inline-formula id="inf127">
<mml:math id="m139">
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf128">
<mml:math id="m140">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> were unequivocally retained in the governing equations despite diminished main effects (individual effects), as the variables are still relevant to represent the output through their influence on <inline-formula id="inf129">
<mml:math id="m141">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. Therefore, users can, hypothetically, fix the <inline-formula id="inf130">
<mml:math id="m142">
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf131">
<mml:math id="m143">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> at arbitrary optimal levels and tune only <inline-formula id="inf132">
<mml:math id="m144">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> as an alternative reduced-order experimental optimization.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Comparison of model performance with iterative removal of terms from the model equations generated from different datasets: <bold>(A)</bold> <xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref>, <bold>(B)</bold> <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref>, and <bold>(C)</bold> <xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref>. MSEs are recalculated after refitting the model with the removal of each term, whereby a large increase in MSEs indicates that the respective terms are highly critical to represent the output data. The MSEs of the original complete model are denoted by the horizontal dashed lines (cf. MSEs in <xref ref-type="table" rid="T3">Table 3</xref>).</p>
</caption>
<graphic xlink:href="frfst-02-996399-g007.tif"/>
</fig>
<p>Conversely, in the model for data from <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref>, removal of terms with <inline-formula id="inf133">
<mml:math id="m145">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf134">
<mml:math id="m146">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> does not significantly increase the MSE despite their elevated global sensitivity. Therefore, the effects of <inline-formula id="inf135">
<mml:math id="m147">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf136">
<mml:math id="m148">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> on microbial inactivation are possibly dependent on other process variables but they do not impose similar magnitudes of influence in return. Nevertheless, SINDy has retained the terms with <inline-formula id="inf137">
<mml:math id="m149">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf138">
<mml:math id="m150">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> as their substantial individual effects are critical to represent the output when other process variables are invariant. For this system, a stepwise process optimization will work best where <inline-formula id="inf139">
<mml:math id="m151">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf140">
<mml:math id="m152">
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mi mathvariant="normal">H</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is independently tuned first followed by the optimization of <inline-formula id="inf141">
<mml:math id="m153">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf142">
<mml:math id="m154">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. Lastly, the model for data from <xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref> is fairly simple where both involved process variables possibly impose equivalent individual and interaction effects. While the global sensitivity analysis and reassessment of MSEs with iterative removal of terms offer additional insights that can aid process optimizations, we do not recommend the users to influence the model selection through these insights as they are model-derived contextual outcomes, and thus, the model should be thoroughly optimized beforehand, through the iterative feedback loop (<xref ref-type="fig" rid="F1">Figure 1</xref>) as desired.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Discussion</title>
<p>Conventional modeling approaches that empirically ascertain model structures and parameter functional forms are forceful approximations as there exists an overwhelmingly large number of possible solutions. Using a systematic data-driven model development pipeline guided by knowledge-informed choices of parameter functional forms and methodical tuning of model sparsity and accuracy, we developed microbial inactivation models that outperformed existing models from the literature. Our approach not only ensures identifying the most plausible model structure by leveraging on domain knowledge of the system, but also elucidates the factors affecting the process dynamics through the combined use of global sensitivity analysis.</p>
<p>The sound choice of functional forms for input and output target variables as informed by mechanistic formulation (e.g., Arrhenius equation in this work) can be integral in optimizing model structure and performance. The inclusion of the reciprocal functional form for temperature is fitting as it resembles the inverse relationship between logarithmic rate and temperature in linear Arrhenius equation. We could not make such consideration for other variables in the absence of any literary basis or improvements to model fits with reciprocal terms. The choice of logarithmic output variable is also apt for several reasons: 1) Logarithmic <italic>D</italic>-values form linear relationships with temperature and other process variables that may assume the classical power-law form, which is realized through the linear combinations of monomial terms of various degrees through SINDy, and 2) training the model on logarithmic <italic>D</italic>-values ensure that the resulting model is sensitive and operates optimally in the small <italic>D</italic>-value regions which are especially critical for microbial inactivation dynamics. Although even empirical formulation may bear structural similarity to our models to a degree, we highlight that our choice of final terms to represent the output variable is guided by systematic sparse identification as described in <xref ref-type="sec" rid="s2">Section 2</xref>.</p>
<p>The use of information-theoretic criterion such as AIC guided us to determine optimal levels of model complexity, while the literature models were inappropriately structured. Therefore, our approach allowed us to propose more accurate models with fewer number of terms as demonstrated through the cases with <xref ref-type="bibr" rid="B10">Juneja et al. (1995)</xref> and <xref ref-type="bibr" rid="B20">Villa-Rojas et al. (2013)</xref> in <xref ref-type="table" rid="T3">Table 3</xref>. In contrast, in the case of underfitted models such as the one from <xref ref-type="bibr" rid="B9">Cerf et al. (1996)</xref>, we showed how to further reduce the error by adding extra terms. The inclusion of higher-order terms (polynomial combinations of individual process variables) in the model is critical to account for mixed effects between the process variables. For example, a lower temperature is generally observed to increase the effects of pH but the effects on water activity are contradictory in the literature (<xref ref-type="bibr" rid="B20">Villa-Rojas et al., 2013</xref>). Despite potential importance, those mixed effects have been overlooked in microbial inactivation studies except the well-known interaction between temperature and pH. Even in the case of accounting for combinatorial effects of multiple process variables, determination of their functional forms remains largely elusive. In contrast, our pipeline evaluates interaction effects through the higher-order terms, which are subsequently compared with individual effects through global sensitivity analysis to divulge complex associations between the process variables. Our approach is particularly useful to handle systems with many process variables as all possible interactions are simultaneously handled in the library matrix, which otherwise would be inefficient in conventional approaches. As highlighted here and above, therefore, our approach complements SINDy by providing additional guidance towards selection of basic functional forms for input and output variables and determination of the optimal level of model complexity, ensuring the robust performance across different cases.</p>
<p>While our knowledge-informed data-driven modeling pipeline worked well for all the datasets considered in this work, user may further tweak the model design to their desired complexity, sparsity, and accuracy through the feedback loop in <xref ref-type="fig" rid="F1">Figure 1</xref>. For example, one may employ a larger <inline-formula id="inf143">
<mml:math id="m155">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> while retaining the order of polynomial combination to produce a sparser model at the expense of model accuracy. Conversely, a lower order of polynomial combinations with a more lenient <inline-formula id="inf144">
<mml:math id="m156">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> setting is also a viable alternative. While higher-order complex models may enhance the accuracy, their inherent complexity impacts the elucidation and interpretability of the system dynamics. Hence, sparse lower-order models are often desired for most applications.</p>
<p>Beyond achieving enhanced performance of data fit compared to existing models, our approach provides a systematic and generalizable pipeline for high-throughput development of microbial growth and inactivation models applicable to various types of datasets. This capability should prove useful for food manufacturers and researchers to assess the efficacy of their existing food production stratagem and to identify new necessary conditions for effective microbial inactivation in a yet unexamined food. Further, our approach also serves as a future basis to model new microbial inactivation processing technologies that steer away from conventional processing conditions to more intricate parameters such as pressure, light pulses and degree of exposure, and various non-thermal variables (<xref ref-type="bibr" rid="B14">Ma&#xf1;as and Pag&#xe1;n, 2005</xref>; <xref ref-type="bibr" rid="B4">Art&#xed;guez et al., 2011</xref>; <xref ref-type="bibr" rid="B17">Podolak et al., 2020</xref>). Moreover, our approach can be readily extended to develop primary (dynamic) inactivation models when appropriate and adequate time-series microbial inactivation data becomes available to render the model reasonably identifiable. Lastly, the combined use of data-driven modeling and global sensitivity analysis in the pipeline is also useful for rational model-based optimization of operating conditions and design of control systems, not only for microbial inactivation processes but any non-linear systems for a wide range of applications.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s5">
<title>Data availability statement</title>
<p>The MATLAB codes and literature data used in this study can be found in the GitHub repository at <ext-link ext-link-type="uri" xlink:href="https://github.com/hyunseobsong/sindy4inactivation">https://github.com/hyunseobsong/sindy4inactivation</ext-link>. Further inquiries can be directed to the corresponding author for additional information.</p>
</sec>
<sec id="s6">
<title>Author contributions</title>
<p>H-SS conceptualized the data-driven modeling pipeline and designed the study. FA contributed to the formulation of the framework and performed global sensitivity analysis. SZ built SINDy models to compare with literature results. SZ and FA drafted the manuscript, which was edited by H-SS to its final version. All authors contributed to the interpretation and analysis of the results.</p>
</sec>
<sec id="s7">
<title>Funding</title>
<p>This work was supported by Nebraska Tobacco Settlement Biomedical Research Enhancement Funds to H-SS.</p>
</sec>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s10">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/frfst.2022.996399/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/frfst.2022.996399/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="DataSheet1.PDF" id="SM1" mimetype="application/PDF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Akaike</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>1998</year>). <article-title>Information theory and an extension of the maximum likelihood principle</article-title>, <fpage>199</fpage>&#x2013;<lpage>213</lpage>. <pub-id pub-id-type="doi">10.1007/978-1-4612-1694-0_15</pub-id> </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Akkermans</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Smet</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Valdramidis</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Van Impe</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Microbial inactivation models for thermal processes</article-title>. <source>Food Eng. Ser.</source>, <fpage>399</fpage>&#x2013;<lpage>420</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-42660-6_15</pub-id> </citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Amit</surname>
<given-names>S. K.</given-names>
</name>
<name>
<surname>Uddin</surname>
<given-names>M. M.</given-names>
</name>
<name>
<surname>Rahman</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Islam</surname>
<given-names>S. M. R.</given-names>
</name>
<name>
<surname>Khan</surname>
<given-names>M. S.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>A review on mechanisms and commercial aspects of food preservation and processing</article-title>. <source>Agric. Food Secur.</source> <volume>6</volume>, <fpage>1</fpage>&#x2013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1186/s40066-017-0130-8</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Art&#xed;guez</surname>
<given-names>M. L.</given-names>
</name>
<name>
<surname>Lasagabaster</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Mara&#xf1;&#xf3;n</surname>
<given-names>I. M. de</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Factors affecting microbial inactivation by Pulsed Light in a continuous flow-through unit for liquid products treatment</article-title>. <source>Procedia Food Sci.</source> <volume>1</volume>, <fpage>786</fpage>&#x2013;<lpage>791</lpage>. <pub-id pub-id-type="doi">10.1016/j.profoo.2011.09.119</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Blumer</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Ehrenfeucht</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Warmuth</surname>
<given-names>M. K.</given-names>
</name>
</person-group> (<year>1987</year>). <source>Occam&#x2019;s Razor. Inf. Process. Lett.</source> <volume>24</volume>, <fpage>377</fpage>&#x2013;<lpage>380</lpage>. <pub-id pub-id-type="doi">10.1016/0020-0190(87)90114-1</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Brunton</surname>
<given-names>S. L.</given-names>
</name>
<name>
<surname>Kutz</surname>
<given-names>J. N.</given-names>
</name>
</person-group> (<year>2019</year>). <source>Data-Driven Sci. Eng.</source> <publisher-name>Cambridge University Press</publisher-name>, <publisher-loc>Cambridge, UK</publisher-loc>, <pub-id pub-id-type="doi">10.1017/9781108380690</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brunton</surname>
<given-names>S. L.</given-names>
</name>
<name>
<surname>Proctor</surname>
<given-names>J. L.</given-names>
</name>
<name>
<surname>Kutz</surname>
<given-names>J. N.</given-names>
</name>
<name>
<surname>Bialek</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Discovering governing equations from data by sparse identification of nonlinear dynamical systems</article-title>. <source>Proc. Natl. Acad. Sci. U. S. A.</source> <volume>113</volume>, <fpage>3932</fpage>&#x2013;<lpage>3937</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1517384113</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Burnham</surname>
<given-names>K. P.</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>D. R.</given-names>
</name>
</person-group> (<year>2002</year>). <source>Model selection and multimodel inference: A practical information-theoretic approach</source>. <publisher-name>Springer Science &#x26; Business Media</publisher-name>, <publisher-loc>Berlin/Heidelberg, Germany</publisher-loc>, <edition>2nd ed</edition>. <pub-id pub-id-type="doi">10.1016/j.ecolmodel.2003.11.004</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cerf</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Davey</surname>
<given-names>K. R.</given-names>
</name>
<name>
<surname>Sadoudi</surname>
<given-names>A. K.</given-names>
</name>
</person-group> (<year>1996</year>). <article-title>Thermal inactivation of bacteria - a new predictive model for the combined effect of three environmental factors: Temperature, pH and water activity</article-title>. <source>Food Res. Int.</source> <pub-id pub-id-type="doi">10.1016/0963-9969(96)00039-7</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Juneja</surname>
<given-names>V. K.</given-names>
</name>
<name>
<surname>Marmer</surname>
<given-names>B. S.</given-names>
</name>
<name>
<surname>Phillips</surname>
<given-names>J. G.</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>A. J.</given-names>
</name>
</person-group> (<year>1995</year>). <article-title>Influence of the intrinsic properties of food on thermal inactivation of spores of nonproteolytic Clostridium botulinum: Development of a predictive model</article-title>. <source>J. Food Saf.</source> <pub-id pub-id-type="doi">10.1111/j.1745-4565.1995.tb00145.x</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kaplan</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2002</year>). <article-title>Structural equation modeling</article-title>. <source>Int. Encycl. Soc. Behav. Sci.</source>, <fpage>15215</fpage>&#x2013;<lpage>15222</lpage>. <pub-id pub-id-type="doi">10.1016/B0-08-043076-7/00776-2</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lianou</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Panagou</surname>
<given-names>E. Z.</given-names>
</name>
<name>
<surname>Nychas</surname>
<given-names>G. J. E.</given-names>
</name>
</person-group> (<year>2016</year>). <source>Microbiological spoilage of foods and beverages</source>. <publisher-name>Elsevier</publisher-name>. <publisher-loc>Amsterdam, Netherlands</publisher-loc>, <pub-id pub-id-type="doi">10.1016/B978-0-08-100435-7.00001-0</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Madoumier</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Trystram</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>S&#xe9;bastian</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Collignan</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Towards a holistic approach for multi-objective optimization of food processes: A critical review</article-title>. <source>Trends Food Sci. Technol.</source> <volume>86</volume>, <fpage>1</fpage>&#x2013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1016/j.tifs.2019.02.002</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ma&#xf1;as</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Pag&#xe1;n</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>Microbial inactivation by new technologies of food preservation</article-title>. <source>J. Appl. Microbiol.</source> <volume>98</volume>, <fpage>1387</fpage>&#x2013;<lpage>1399</lpage>. <pub-id pub-id-type="doi">10.1111/j.1365-2672.2005.02561.x</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pianosi</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Sarrazin</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Wagener</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>A matlab toolbox for global sensitivity analysis</article-title>. <source>Environ. Model. Softw.</source> <volume>70</volume>, <fpage>80</fpage>&#x2013;<lpage>85</lpage>. <pub-id pub-id-type="doi">10.1016/j.envsoft.2015.04.009</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pianosi</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Wagener</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>A simple and efficient method for global sensitivity analysis based on cumulative distribution functions</article-title>. <source>Environ. Model. Softw.</source> <volume>67</volume>. <pub-id pub-id-type="doi">10.1016/j.envsoft.2015.01.004</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Podolak</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Whitman</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Black</surname>
<given-names>D. G.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Factors affecting microbial inactivation during high pressure processing in juices and beverages: A review</article-title>. <source>J. Food Prot.</source> <volume>83</volume>, <fpage>1561</fpage>&#x2013;<lpage>1575</lpage>. <pub-id pub-id-type="doi">10.4315/JFP-20-096</pub-id> </citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ross</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Dalgaard</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2003</year>). <article-title>Secondary models</article-title>. <source>Model. Microb. Responses Food</source>, <fpage>63</fpage>&#x2013;<lpage>150</lpage>. <pub-id pub-id-type="doi">10.1201/9780203503942.ch3</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>H.-S.</given-names>
</name>
<name>
<surname>DeVilbiss</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Ramkrishna</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Modeling metabolic systems: The need for dynamics</article-title>. <source>Curr. Opin. Chem. Eng.</source> <volume>2</volume>, <fpage>373</fpage>&#x2013;<lpage>382</lpage>. <pub-id pub-id-type="doi">10.1016/j.coche.2013.08.004</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Villa-Rojas</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kang</surname>
<given-names>D. H.</given-names>
</name>
<name>
<surname>Mah</surname>
<given-names>J. H.</given-names>
</name>
</person-group>, (<year>2013</year>). <article-title>Thermal inactivation of salmonella enteritidis PT 30 in almond kernels as influenced by water activity</article-title>. <source>J. Food Prot.</source> <pub-id pub-id-type="doi">10.4315/0362-028X.JFP-11-509</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Whiting</surname>
<given-names>R. C.</given-names>
</name>
</person-group> (<year>1995</year>). <article-title>Microbial modeling in foods</article-title>. <source>Crit. Rev. Food Sci. Nutr.</source> <volume>35</volume>, <fpage>467</fpage>&#x2013;<lpage>494</lpage>. <pub-id pub-id-type="doi">10.1080/10408399509527711</pub-id> </citation>
</ref>
</ref-list>
</back>
</article>