CANA: A Python Package for Quantifying Control and Canalization in Boolean Networks

Logical models offer a simple but powerful means to understand the complex dynamics of biochemical regulation, without the need to estimate kinetic parameters. However, even simple automata components can lead to collective dynamics that are computationally intractable when aggregated into networks. In previous work we demonstrated that automata network models of biochemical regulation are highly canalizing, whereby many variable states and their groupings are redundant (Marques-Pita and Rocha, 2013). The precise charting and measurement of such canalization simplifies these models, making even very large networks amenable to analysis. Moreover, canalization plays an important role in the control, robustness, modularity and criticality of Boolean network dynamics, especially those used to model biochemical regulation (Gates and Rocha, 2016; Gates et al., 2016; Manicka, 2017). Here we describe a new publicly-available Python package that provides the necessary tools to extract, measure, and visualize canalizing redundancy present in Boolean network models. It extracts the pathways most effective in controlling dynamics in these models, including their effective graph and dynamics canalizing map, as well as other tools to uncover minimum sets of control variables.


A TOOL TO STUDY REDUNDANCY AND CONTROL IN BOOLEAN NETWORKS
Mathematical and computational modeling of biological networks promises to uncover the fundamental principles of living systems in an integrative manner (Iyengar, 2009;Ideker and Nussinov, 2017). In particular, Boolean Networks (BN), a class of logical dynamical systems, provide an effective framework to capture the dynamics of interconnected biological systems without the need for detailed kinetic parameters (Bornholdt, 2005;Assmann and Albert, 2009). BN have been used to model and predict biochemical regulation in genetic networks (Li et al., 2004), cell signaling (Helikar et al., 2008), chemical reactions in metabolic networks (Chechik et al., 2008), anticancer drug response (Choi et al., 2017), action potentials in neural networks (Kurten, 1988), and many other dynamical systems involved in biomedical complexity (Albert and Othmer, 2003).
Two reasons contribute to the success of BN models: (i) the reduction of complex multivariate dynamics to a graph revealing the organization and constraints of the topology of interactions in biological systems, and (ii) a coarse-grained treatment of dynamics that facilitates predictions of limiting behavior and robustness (Bornholdt, 2008). However, more than understanding the organization of complex biological systems, we need to derive control strategies that allow us, for example, to intervene on a diseased cell , or revert a mature cell to a pluripotent state (Wang and Albert, 2011). Recently, several mathematical tools were developed to enhance our understanding of BN control by removing redundant pathways, identifying key dynamic modules (Marques-Pita and Rocha, 2013), and characterizing critical driver variables .
Here we present CANA 1 , a python package to study redundancy and control in BN models of biochemical dynamics (Correia et al., 2018). It provides a simple interface to access computational tools for three important aspects of BN analysis and prediction: 1. Dynamics. Python classes are included to enumerate all attractors and calculate the full state transition graph (STG) of BN, as described in section 2. 2. Canalization. The redundancy properties of automata functions have been characterized as a form of canalization (Kauffman, 1984), particularly when used to model dynamical interactions in models of genetic regulation and biochemical signaling (Kauffman et al., 2004;Reichhardt and Bassler, 2007;Marques-Pita and Rocha, 2013). At the level of individual Boolean transition functions (network nodes), canalization is observed when not all inputs are necessary to determine a state transition (see section 3 for formal definition). CANA can be used to calculate all measures of canalization that derive from removing dynamical redundancy via two-symbol schemata re-description (Marques-Pita and Rocha, 2013): effective connectivity, input redundancy, and input symmetry. At the network level, CANA also calculates the effective graph, a weighted and directed graph whose edge weights denote their effective contribution to node transitions, as well as the dynamics canalizing map, a parsimonious representation of the necessary and sufficient state transitions that define the entire dynamics of BN. All canalization measures and network representations are applicable to synchronous and asynchronous BN models, as described in section 3. 3. Control. From a subset of driver variables-nodes that act as the loci of control interventions-CANA computes the controlled state transition graph (CSTG), as well as the controlled attractor graph (CAG) capturing all controlled transitions between attractors possible via driver variable interventions . CANA also computes measures of controllability that depend on the CSTG and CAG: mean fraction of reachable configurations, mean fraction of controlled configurations, and mean fraction of reachable attractors, as described in section 4. Currently, control analysis in CANA is applicable only to synchronous BN models.
Here we demonstrate the full functionality of the CANA package using the BN model of floral organ development in the flowering plant Arabidopsis thaliana (Chaos et al., 2006). Additionally, we provide an interface between CANA and the Cell Collective (Helikar et al., 2012), allowing for an extensive analysis of control and canalization in complex biological systems. The CANA package fills a key void in the available library of computational software to analyze Boolean Network models. Existing software falls into two categories: either they are designed to reverse engineer BN models from biological experimental data, or they focus on simulating BN dynamics. Examples of the first category include the CellNetOptimizer which creates BN from high-throughput biochemical data (Terfve et al., 2012), and the Dynamic Deterministic Effects Propagation Networks (DDEPN) package which reconstructs signaling networks based from time-course experimental data (Bender et al., 2010). The second category of BN simulation packages is best exemplified by BooleanNet, a python package that simulates both synchronous and asynchronous dynamics (Albert et al., 2008), and PANET, a Cytoscape plugin that quantifies the robustness of BN models (Trinh et al., 2014). The Cell Collective, a collaborative platform and intuitive visual interface to share and build BN models, can also be used to simulate BN dynamics (Helikar et al., 2012). The CANA package expands the set of available tools of the second category, by providing Python classes to calculate measures and visualizations of canalization (dynamical redundancy) and control of BN models, as detailed below. CANA is designed as a toolbox for both computational and experimental system biologists. It enables the simplification of BN models and testing of network control algorithms, thus prioritizing biochemical variables more likely to be relevant for specific biological questions (e.g., genes controlling cell fate), and ideal candidates for knockout experiments.

BOOLEAN NETWORK REPRESENTATION AND DYNAMICS
A Boolean automaton is a binary variable, x ∈ {0, 1}, whose state is updated in discrete time-steps, t, according to a deterministic Boolean state-transition function of k inputs: x t+1 = f (x t 1 , ..., x t k ). The state-transition function, f : {0, 1} k → {0, 1}, is defined by a look-up (truth) table (LUT), F ≡ {f α : α = 1, ..., 2 k }, with one entry for each of the 2 k combinations of input states and a mapping to the automaton's next state (transition or output), x t+1 ( Figure 1A). In CANA, a Boolean automaton-a python class denoted BooleanNode-is instantiated from the list of transitions that define its LUT.
A Boolean Network is a graph B ≡ (X, C), where X is a set of N Boolean automata nodes x i ∈ X, i = 1, ..., N and C is a set of directed edges c ji ∈ C : x i , x j ∈ X that represent the interaction network, denoting that automaton x j is an input to automaton x i , as computed by F i . The set of inputs for automaton x i is denoted by X i = {x j ∈ X : c ji ∈ C}, and its cardinality, k i = |X i |, is the in-degree of node x i . At any given time t, B is in a specific configuration of automata states, x t = x t 1 , x t 2 , ..., x t N , where we use the terms state for individual automata (x t i ) and configuration (x t ) for the collection of states of all automata of In-degree (k), input redundancy (k r ), input symmetry (k s ), and effective connectivity (k e ) of TFL1 automaton. Values in parenthesis are the respective (relative) measures normalized by k, used for comparisons between automata with different number of inputs. (E) Canalizing Map (CM) of automaton TFL1, with its two possible states, TFL1 ∈ {0, 1}, shown as circles with red contour; white (black) fill color denotes state 0 (1). Input variables and their respective state are also shown as circles (s-units) with the same color criterion, and link to t-units shown as blue diamonds with corresponding threshold value inside; thus, TFL1 requires 3 input conditions (LFY = 0 ∧ EMF1 = 1 ∧ AP1 = 0) to turn on (TLF1 = 1), but only one (EMF1 = 0 ∨ AP1 = 1 ∨ LFY = 1) to turn off (TLF1 = 0); ∧ and ∨ denote the logical conjunction (and) and disjunction (or), respectively. Network rendering generated with Graphviz (Ellson et al., 2002). the BN at time t, i.e. the collective network state. The set of all possible network configurations is denoted by X ≡ {0, 1} N , where |X | = 2 N . The dynamics of B unfolds from an initial configuration, x 0 , by a synchronous, update policy in which all automata transition to the next state at the same time step, or an asynchronous update policy, in which automata update their next step in distinct time steps according to some update schedule (e.g. stochastically). The complete dynamical behavior of the system for all initial conditions is captured by the state-transition graph (STG), G ≡ STG(B) = (X , T ), where each node is a configuration x α ∈ X , and an edge T α,β ∈ T denotes that a BN in configuration x α at time t will be in configuration x β at time t +1. Under deterministic dynamics, only a single transition edge T α,β is allowed out of every configuration node x α . Configurations that repeat, such that x t+µ α = x t β , are known as attractors and differentiated as fixed-point attractors when µ = 1, and limit cycles when µ > 1, respectively. Because G is finite, it contains at least one attractor, as some configuration or limit cycle must repeat in time (Wuensche, 1998).
In CANA, a python class named BooleanNetwork represents a BN, and is instantiated from a dictionary containing the transition functions (LUT) of all its constituent automata nodes, or loaded from a file. We also provide several predefined example BN models that can be directly loaded: the Arabidopsis Thaliana gene regulatory network (GRN) of flowering patterns (Chaos et al., 2006), a simplified version of the segment polarity GRN of Drosophila melanogaster (Albert and Othmer, 2003), the Budding Yeast cell-cycle regulatory network (Li et al., 2004), and the BN motifs analyzed in . Beyond the aforementioned networks, our current release also incorporates all publicly available networks in the Cell Collective repository (Helikar et al., 2012). These were loaded from the Cell Collective API and converted into truth tables that can be read by CANA 2 . Our package has two built-in methods available to compute network dynamics: for relatively small BN (N < 30) the full state-space can be computed, whereas for larger BN, CANA uses a Boolean satisfiability (SAT-based) algorithm, capable of enumerating all attractors in a BN with thousands of variables (Dubrova and Teslenko, 2011).

CANALIZATION
Important insights about BN dynamics are gained by observing that not all inputs to an automaton are equally important for determining its state transitions, a concept known as canalization (Reichhardt and Bassler, 2007). Originally, the term was proposed by Waddington (1942) and subsequently refined to characterize the buffering of genetic and epigenetic perturbations leading to the stability of phenotypic traits (Siegal and Bergman, 2002;Masel and Maughan, 2007;ten Tusscher and Hogeweg, 2009). Understanding how canalization occurs in a given BN model allows us to uncover and remove redundancy present in the pathways that control its dynamics. In CANA, we follow Marques-Pita and Rocha (2013) by quantifying canalization through the logical redundancy present in automata. Specifically, we use the Quine-McCluskey Boolean minimization algorithm (Quine, 1955) to identify those inputs of an automaton which are redundant given the state of its other inputs, thus reducing its LUT to a set of prime implicants. The prime implicants are in turn combined to create wildcard schemata, F ′ ≡ {f ′ υ }, in which the wildcard or "Don't care" symbol, # (also represented graphically in gray) denotes an input whose state is redundant given the state of other necessary input states. In this process, the original LUT F ( Figure 1A) is redescribed by a more compressed set of schemata F ′ (Figure 1B). Every wildcard schema f ′ υ ∈ F ′ redescribes a subset of entries in the original LUT, denoted by ϒ υ ≡ {f α : f α f ′ υ } ⊆ F; means 'is redescribed by'. Finally, CANA also calculates the two-symbol schemata redescription, F ′′ ≡ {f ′′ θ }, whereby in addition to the wildcard symbol, a position-free symbol, •, further captures permutation redundancy (i.e., groupsymmetry): subsets of inputs whose states can permute without affecting the automaton's state ( Figure 1C). Every two-symbol schema f ′′ θ ∈ F ′′ redescribes a set θ ≡ {f α : f α f ′′ θ } ⊆ F of LUT entries of automaton x.
Several measures of canalization present in the LUT of an automaton are also defined in CANA, and can be accessed by function calls to both the BooleanNode and BooleanNetwork classes. Input redundancy, k r (x), measures the number of inputs that on average are not needed to compute the state of automaton x. This is measured by tallying the mean number of wildcard symbols present in the set of schemata F ′ (x) or F ′′ (x) that redescribe the LUT F(x) (Equation 1). Effective connectivity, k e , is a complementary measure of k r (x) yielding the number of inputs that are on average necessary to compute the automaton's state (Equation 1). Whereas k(x) is the number of inputs to automaton x present in the BN, k e (x) is the minimum number of such inputs that are on average necessary to determine the state of x-its effective connectivity or degree. Similarly, input symmetry, k s (x), is the mean number of inputs that can permute without effect on the state of x. It is measured by tallying the mean number of position-free symbols present in F ′′ (x) (Equation 1): where n # υ and n • θ are the number of inputs with a # or • in schema f ′ υ or f ′′ θ , respectively 3 . Figure 1D shows the values of these measures for the LUT of the TFL1 gene in the thaliana GRN model. Additional algorithmic details of the two forms of canalization, as well as their importance to study control, robustness, and modularity of BN models of biochemical regulation, are presented in Marques-Pita and Rocha (2013). Next we introduce new per-input measures of canalization as well as the effective graph, which CANA also computes.
Most automata contain redundancy of one or both of the two forms of canalization; only the two parity functions for any k have k r = 0 (e.g., the XOR function and its negation for k = 2), and even those can have k s > 0. Therefore, the original interaction graph of a BN tends to have much redundancy and does not capture how automata truly influence one another in a BN. To formalize this idea, the CANA package computes an effective graph, E ≡ (X, E), where X is as in section 2 and E is a set of weighted directed edges e ji ∈ [0, 1]∀x i , x j ∈ X denoting the effectiveness of automaton x j in determining the truth value of automaton x i , and computed via Equation 2. Specifically, we define per-input measures of canalization for redundancy, effectiveness, and symmetry, respectively: where (j #) υ is a logical condition that assumes the truth value 1(0) if input j is (not) a wildcard in schema f ′ υ , and similarly for (j •) θ for a position-free symbol in schema f ′′ θ ; avg is the average operator. Naturally, k r (x i ) = j r ji , k e (x i ) = j e ji , and k s (x i ) = j s ji .
The effective graph was shown to be important in predicting the controllability of BN . Furthermore, the mean k e of BN (the mean in-degree of the effective graph) is a better predictor of criticality than the in-degree of the original interaction graph (Manicka, 2017), improving the existing theory for predicting criticality in BN (Aldana, 2003). Those results suggest that Natural Selection can select for canalization, thereby enhancing the stability and controllability of networks with high connectivity, that would otherwise exist in the chaotic regime Manicka, 2017). As an example, the interaction and effective graphs of the Thaliana GRN BN model, as computed by CANA, are shown in Figures 2A,B, demonstrating that much redundancy exists in the original model. The most extreme case of redundancy occurs when an input from x j to automaton x i exists in the original interaction graph C, c ji = 1, but not in the effective graph E, e ji = 0, because  2). Some edges, originally in C, are completely removed in E (e.g., AG→AG, AP1→AG, and AP2→TFL1). Others, have very small effectiveness (e.g., AP1→PI and CLF→AG). (C) Dynamics Canalization Map (DCM) representing the entire logic of interactions after removal of redundancy. Original BN automata nodes appear twice in the DCM, once for each Boolean truth value and denoted as s-unit, white (0) or black (1) circles. When s-units are determined by another single s-unit, for simplicity and without loss of generality, they are connected with a beige directed edge-a simplification to avoid the rendering of a t-unit with a threshold of one. All other variable state determinations occur via t-units with larger threshold values. Red edges represent outputs from t-units to s-units: a state determination of the receiving s-unit, after the logical condition of the t-unit is met.All other (blue or green) edges denote inputs from s-units to t-units, that is, the sufficient conditions for a state determination. Blue edges denote group disjunction constraints, whereby conditions captured by s-units can merge because any one of the merging conditions is sufficient [e.g., (TFL = 0 ∨ EMF1 = 0) → LFY = 1]. Green edges denote independent and necessary conditions. Directed edges into s-units are denoted by arrows, while directed edges into t-units are denoted by small circles. Network rendering by Graphviz (Ellson et al., 2002).
it is fully redundant and does not affect the automaton's transition (see several such cases in Figures 2A,B).
The canalizing logic of an automaton provided by the schemata set F ′′ , can also be represented as a McCulloch and Pitts (1943) threshold network, named a Canalizing Map (CM) in Marques-Pita and Rocha (2013). Figure 1E depicts the CM for the TFL1 gene. It consists of two types of nodes: state units (s-unit, denoted by circles), which represent automata in one of the Boolean truth values (x i = 0, white, or x i = 1, black), and threshold units (t-unit, denoted by diamonds), which implement a numerical threshold condition on its inputs. When the CM of all automata of a BN are linked, we obtain the Dynamics Canalization Map (DCM), as shown in Figure 2C for the Thaliana GRN. Directed fibers connect nodes and propagate an activation pulse; fibers can merge and split, but each end-point always contributes one pulse to an s-unit. The DCM is a highly parsimonious representation of the dynamics of a BN. It contains only necessary information about how (canalizing) control signals determine network dynamics. It enables inferences about control, modularity and robustness to be made about the collective (macro-level) dynamics of BN (Marques-Pita and Rocha, 2013). Because it is assembled using solely the micro-level canalizing logic of individual automata, its computation scales linearly with the number of nodes of the network, and thus it can be computed for very large networks. The computational bottleneck can only be the number of inputs (k) to a particular automaton, since the Quine-McCluskey algorithm grows exponentially with the number of variables. Functions with a large number of variables have to be minimized with heuristic methods such as Espresso (Brayton et al., 1984). Because all measures of canalization, as well as the effective graph and the DCM, derive from removing dynamical redundancy at the level of individual automata, they are independent from the updating regime chosen for the network. In other words, the canalization analysis is applicable to synchronous and asynchronous BN models.

CONTROL
The discovery of control strategies in BN models is a central problem in systems biology; theoretical insights about controllability can enhance experimental turnover by focusing experimental interventions on genes and proteins more likely to result in the desired phenotype output. It is well-known that when the set of automata nodes X of a BN is large, enumeration of all configurations x ∈ X of its STG becomes difficult, making the controllability of deterministic BN an NP-hard problem (Akutsu et al., 2007). Thus control methodologies which leverage the interaction graph or remove the redundancy in canalizing automata are highly desirable, since they can greatly simplify BN complexity.
CANA contains Python functions designed to provide a testbed for the development of BN control strategies, and to investigate the interplay between canalization, control, and other dynamics properties. Specifically, we study the control exerted on the dynamics of a BN, B = (X, C), by a subset of driver variables D ⊆ X-a subset of automata nodes of B. Control interventions are realized by instantaneous bit-flip perturbations to the state of the variables in D (Willadsen and Wiles, 2007). This results in a controlled state transition graph, CSTG(B) ≡ G D ≡ (X , T ∪ T D ), which is an extension of the STG that captures all possible trajectories due to controlled interventions on D . The additional edges T D denote transitions from every configuration to a set of 2 |D| − 1 configurations in the STG, which are reachable given the bit-flip perturbations of the driver variables. A BN is controllable when every configuration is reachable from every other configuration in G D (Sontag, 1998), a condition equivalent to requiring that the CSTG G D be strongly connected. . Each large blue node A 1 , . . . , A 10 represents an attractor of the network dynamics. The BN configurations for steady-state attractors A 3 and A 5 are shown as interaction graphs with node variables colored white or black for states x i = 0 and x i = 1, respectively; driver variables are shown with a yellow contour.
CANA computes the CSTG of B given a driver set D, which in turn is used to calculate the mean fraction of reachable configurations, R D , and the mean fraction of controlled configurations, C D , : where, for each configuration x α , r(G D , x α ) is the fraction of reachable configurations, defined as the number of other configurations X β lying on all directed paths from x α , normalized by the total number of other configurations 2 N−1 . Similarly, the fraction of controlled configurations counts the number of new configurations that are reachable due to interventions to D, but were not originally reachable in the STG: . When a BN is fully controlled by D, R D = 1.0, but for partially controlled BNs R D ∈ [0.0, 1.0); note that C D ≤ R D .
In Systems Biology applications, typically only the attractors of BN are meaningful configurations, used to represent different cell types (Kauffman, 1969(Kauffman, , 1993Müller and Schuppert, 2011), diseased or normal conditions , and wildtype or mutant phenotypes (Albert and Othmer, 2003). In this context, a more relevant control measure is the extent to which driver variables can steer dynamics from attractor to attractor. To quantify such control, CANA computes the controlled attractor graph (CAG) of a BN B : C D = (A, Z D ). The nodes of this graph, A κ ∈ A, represent an attractor of B, and each edge z κγ ∈ Z D , denotes the existence of at least one path from attractor A κ to attractor A γ in the CSTG G D (Figure 3B). The mean fraction of reachable attractors is then given by where κ = 1 . . . |A| . Since this notion of control depends only on the enumeration of attractors, CANA can leverage a SAT-based bounded model algorithm to quantify the mean fraction of reachable attractors in a BN with thousands of variables (Dubrova and Teslenko, 2011). Figure 3A shows the values of R D and A D for various sizes of driver sets D in the Thaliana GRN. Finally, CANA also provides the functionality to approximate the minimal driver variable subset using two prominent network control methodologies: Structural Controlability (SC) (Lin, 1974;Liu et al., 2011) and Minimum Dominating Set (MDS) (Nacher, 2012;Nacher and Akutsu, 2013).

SUMMARY AND CONCLUSION
We presented a novel, open-source and publicly-available software platform that integrates the analytic methodology used to study canalization in automata network dynamics. This methodology can now be used by others to simplify large automata networks, especially those in models of biochemical regulation dynamics. In addition to the extraction and visualization of specific effective pathways that regulate key phenotypic outcomes in a sea of redundant interaction, CANA includes functionality to measure canalization, uncover control variables, and study dynamical modularity, robustness, and criticality. We hope that the consolidation of redundancy and control algorithms into one package encourages other researchers to build upon our work on canalization, thus adding additional algorithms to CANA.

DATA AVAILABILITY STATEMENT
The CANA python package and all datasets analyzed for this study can be found on Github at github.com/rionbr/CANA.

AUTHOR CONTRIBUTIONS
RC, AG, and XW contributed to the CANA package. LR developed the per-input measures of canalization and the effective graph formulation. RC, AG, and LR wrote the manuscript.

FUNDING
RC was supported by CAPES Foundation grant 18668127, Instituto Gulbenkian de Ciência (IGC), and Indiana University Precision Health to Population Health (P2P) Study. LR was partially funded by the National Institutes of Health, National Library of Medicine Program, grant 01LM011945-01, by a Fulbright Commission fellowship, and by NSF-NRT grant 1735095, Interdisciplinary Training in Complex Networks and Systems. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.