<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2020.535976</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Causal Learning From Predictive Modeling for Observational Data</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Ramanan</surname> <given-names>Nandini</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/911928/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Natarajan</surname> <given-names>Sriraam</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/399203/overview"/>
</contrib>
</contrib-group>
<aff><institution>Computer Science Department, University of Texas at Dallas</institution>, <addr-line>Dallas, TX</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Novi Quadrianto, University of Sussex, United Kingdom</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Bowei Chen, University of Glasgow, United Kingdom; Parisa Kordjamshidi, Michigan State University, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Nandini Ramanan <email>nandini.ramanan&#x00040;utdallas.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Machine Learning and Artificial Intelligence, a section of the journal Frontiers in Big Data</p></fn></author-notes>
<pub-date pub-type="epub">
<day>07</day>
<month>10</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>3</volume>
<elocation-id>535976</elocation-id>
<history>
<date date-type="received">
<day>25</day>
<month>02</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>27</day>
<month>08</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2020 Ramanan and Natarajan.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Ramanan and Natarajan</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>We consider the problem of learning structured causal models from observational data. In this work, we use causal Bayesian networks to represent causal relationships among model variables. To this effect, we explore the use of two types of independencies&#x02014;context-specific independence (CSI) and mutual independence (MI). We use CSI to identify the candidate set of causal relationships and then use MI to quantify their strengths and construct a causal model. We validate the learned models on benchmark networks and demonstrate the effectiveness when compared to some of the state-of-the-art Causal Bayesian Network Learning algorithms from observational Data.</p></abstract>
<kwd-group>
<kwd>causal models</kwd>
<kwd>probabilistic learning</kwd>
<kwd>learning from data</kwd>
<kwd>structured causal models</kwd>
<kwd>causal Bayesian networks</kwd>
</kwd-group>
<contract-num rid="cn001">FA9550-18-1-0462</contract-num>
<contract-sponsor id="cn001">Air Force Office of Scientific Research<named-content content-type="fundref-id">10.13039/100000181</named-content></contract-sponsor>
<counts>
<fig-count count="5"/>
<table-count count="1"/>
<equation-count count="1"/>
<ref-count count="66"/>
<page-count count="13"/>
<word-count count="10384"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Given the recent success of machine learning, specifically deep learning, in several applications (Goodfellow et al., <xref ref-type="bibr" rid="B19">2016</xref>), there is an increased interest in learning more explainable models including causal models.</p>
<p>Many researchers have attempted to develop methods to infer causality from observational data over for several years (Pearl, <xref ref-type="bibr" rid="B43">1988b</xref>, <xref ref-type="bibr" rid="B44">2000</xref>; Neapolitan et al., <xref ref-type="bibr" rid="B39">2004</xref>). While there have been some notable contributions in the field demonstrating the plausibility of learning causality from non-experimental data (Granger, <xref ref-type="bibr" rid="B20">1969</xref>; Sims, <xref ref-type="bibr" rid="B52">1972</xref>; Pearl, <xref ref-type="bibr" rid="B44">2000</xref>), learning structural causal models from observational data is still a challenge (Guo et al., <xref ref-type="bibr" rid="B21">2019</xref>). Recent advances in the field of discovering causality has looked at learning Causal Bayesian Network (CBN). In this framework, causations among variables are represented with a Directed Acyclic Graph (DAG) (Pearl, <xref ref-type="bibr" rid="B44">2000</xref>). The problem of learning a DAG from data is not computationally realistic as the number of possible DAGs grows exponentially with the number of nodes. This computational complexity has prevented the adaptation and application of causal discovery approaches to high dimensional datasets, with a few examples.</p>
<p>In this work, we consider the problem of full model learning of causal models from observational data. We are inspired by tasks in real-world where only limited knowledge could potentially be available and hence building a full causal model is not possible. Similarly, the data might be obtained before learning, making interventions particularly, hard. In such cases, learning a probabilistic causal model from data is preferred. However, this is a hard task with a larger number of variables. This is the problem we tackle in this paper&#x02014;<italic>how can we scale causal learning to a moderate number of features?</italic></p>
<p>To this effect, we build upon the success in using two sets of independencies for building causal models&#x02014;that of mutual independencies (MI) (Janzing et al., <xref ref-type="bibr" rid="B28">2015</xref>) and context specific independence (CSI) (Tikka et al., <xref ref-type="bibr" rid="B58">2019</xref>). While MI can be used to quantify the strength of the causal relationships, CSI has been used for causal identifiability. We employ these in the context of learning from data. We aim to learn a causal model by first learning probabilistic dependencies that can identify CSI. We then adopt a heuristic measure to remove and re-orient the edges of the probabilistic graphical model. We employ MI and heuristics to guide the search. The net result as we show empirically is a causal model. This is particularly important as scaling causal learning to large problems without interventions or bias is a significantly challenging task.</p>
<p>Specifically, we leverage the success of dependency networks (DN) (Heckerman et al., <xref ref-type="bibr" rid="B24">2000</xref>; Neville and Jensen, <xref ref-type="bibr" rid="B40">2007</xref>; Natarajan et al., <xref ref-type="bibr" rid="B38">2012</xref>) for learning with large data sets. Recall that a DN is a probabilistic graphical model that approximates the joint distribution using a product of conditionals. Hence, compared to a Bayesian Network (BN) these are uninterpretable and more importantly, approximate. However, their key advantage is that since they are products of conditionals, the conditionals can be learned in parallel and can be scaled to very large data sets.</p>
<p>To scale causal model learning, we first learn a DN. To perform this, we learn a single (probabilistic) tree for every variable, then we identify and remove cycles from this DN. We consider mutual information employed in causal models to score and remove the edges. In addition, we detect and remove cycles from the DN, if any. Contrary to popular intuition, we employ two levels of learning to uncover a causal model&#x02014;first is on learning a DN using trees and the second is on learning a causal model employing heuristics measures. Our evaluations on the two synthetic and one real benchmark causal data sets demonstrate the utility of such an approach. While we present quantitative metrics, qualitatively, the edges that are learned in this model uncover interesting findings. In addition, we compare the proposed approach to three other state-of-the-art causal learning methods employed on just the non-experimental data. Our results demonstrate that we obtain most of the causal links on large problems in order-of-magnitude fewer operations than most causal approaches.</p>
<p>We make a few crucial contributions&#x02014;we present the first causal learning approach that leverages progress in probabilistic methods toward learning from data. We develop heuristics on breaking the cycles and orienting the edges based on the causal modeling research. We learn a causal model on two synthetic and one real benchmark causal data sets and compare with ground truth network to understand the robustness of our approach. We also demonstrate the efficacy and efficiency of the approach on standard benchmark data sets compared to other state-of-the-art constrained based methods in the literature. Our proposed approach opens the door for a domain expert to interactively guide the causal model learner to a better model thus allowing a hybrid method for causal models.</p>
<p>The rest of the paper proceeds as follows: after reviewing the related work on BN, followed by the discussion of some notable work in constrained based methods for learning CBN, we provide the background on DN learning. Next, we present our algorithm and provide intuitions on its functionality. We discuss the motivation of this work, that of the three benchmark data sets which are used to learn the joint causal model over the factors. Then we present the empirical evaluations on the two synthetic benchmark causal data sets and one real data set by comparing our algorithm with other commonly used Causal learning approaches as well as the ground truth. Finally, we conclude by outlining potentially interesting future directions.</p></sec>
<sec id="s2">
<title>2. Background and Related Work</title>
<p>We first introduce Bayesian networks and dependency networks and certain concepts which build the foundation for innovations in CBN learning.</p>
<sec>
<title>2.1. Bayesian Network</title>
<p>A Bayesian network (BN) is a directed acyclic graph <italic>G</italic> &#x0003D; &#x02329;<bold>V</bold>, <bold>E</bold>&#x0232A; whose nodes <bold>V</bold> represent random variables and edges <bold>E</bold> represent the conditional influences among the variables. A BN encodes factored joint representation as, <italic>P</italic>(<bold>V</bold>) &#x0003D; &#x0220F;<sub><italic>i</italic></sub><italic>P</italic>(<italic>V</italic><sub><italic>i</italic></sub>&#x02223;<bold>Pa</bold>(<italic>V</italic><sub><italic>i</italic></sub>)), where <bold>Pa</bold>(<italic>V</italic><sub><italic>i</italic></sub>) is the parent set of the variable <italic>X</italic><sub><italic>i</italic></sub>. It is well-known that full model learning of a BN is computationally intensive, as it involves repeated probabilistic inference inside parameter estimation which in turn is performed in each step of structure search (Chickering, <xref ref-type="bibr" rid="B5">1996</xref>). Therefore, much of the research has focused on approximate, local search algorithms that are generally broadly classified as constraint-based and score-based.</p>
<p>In constraint-based methods, we learn a BN which is consistent with conditional independencies inferred from data (Spirtes et al., <xref ref-type="bibr" rid="B56">2000</xref>). By contrast, score-based methods search through the space of structures, and find the structure with the highest score (Heckerman et al., <xref ref-type="bibr" rid="B25">1995</xref>; Friedman et al., <xref ref-type="bibr" rid="B15">1999</xref>). Hybrid learning approaches combine the advantages of both approaches; for example, using constraint-based techniques to estimate the network skeleton, and using score-based techniques to identify the set of edge orientations that best fit the data (Tsamardinos et al., <xref ref-type="bibr" rid="B61">2006</xref>).</p>
<p>Our work is inspired by and can be considered as extending constraint-based methods which have been discussed extensively in the context of causal structure discovery.</p></sec>
<sec>
<title>2.2. Constraint-Based Algorithms</title>
<p>Constraint-based methods for learning causal structure from just the observational data typically use tests for conditional independencies to identify the causal links that exist in the data.</p>
<p>Following three assumptions are employed to connect the underlying causations that are not perceived directly to observable probabilistic dependencies:
<list list-type="bullet">
<list-item><p>The <bold>Causal Markov Assumption</bold> states that every variable in a causal DAG <italic>G</italic><sub><italic>c</italic></sub> is (probabilistically) independent of all other variables if all its parents are observed.</p></list-item>
<list-item><p>The <bold>Faithfulness Assumption</bold> states that a causal DAG <italic>G</italic><sub><italic>c</italic></sub> and probability distribution <italic>P</italic> are faithful to one another iff the only conditional independencies in <italic>P</italic> are those entailed by the <italic>Causal Markov Condition</italic> on <italic>G</italic><sub><italic>c</italic></sub>.</p></list-item>
<list-item><p>The <bold>Causal Sufficiency Assumption</bold> that there doesn&#x00027;t exist a common unobserved cause of one or more nodes in the domain (no hidden cause).</p></list-item>
</list></p>
<p>The <italic>Causal Markov Assumption</italic> produces a set of (conditional and unconditional) probabilistic independencies from a causal graph, and the <italic>Faithfulness Assumption</italic> ensures that all of the probabilistic independencies in the distribution are entailed by the causal Markov condition. The above stated three assumptions together ensure that causal DAG <italic>G</italic><sub><italic>c</italic></sub> meets the <italic>Minimality Condition</italic>. The minimality condition ensures that there exists no proper subgraph of the true causal DAG <italic>G</italic><sub><italic>c</italic></sub> that can satisfy the causal Markov assumption as well as produce the same probability distribution (Zhang, <xref ref-type="bibr" rid="B66">2008</xref>).</p>
<p>Consequently, the constraint-based methods for causal discovery are both sound and complete given perfect (noise-free) data (Spirtes and Glymour, <xref ref-type="bibr" rid="B54">1991</xref>; Zhang, <xref ref-type="bibr" rid="B66">2008</xref>; Colombo and Maathuis, <xref ref-type="bibr" rid="B9">2014</xref>). The well-known PC algorithm assumes no latent variables and learns a BN consistent with conditional independencies inferred from data (Spirtes et al., <xref ref-type="bibr" rid="B55">1993</xref>; Margaritis and Thrun, <xref ref-type="bibr" rid="B35">2000</xref>). PC and a related algorithm FCI (Spirtes et al., <xref ref-type="bibr" rid="B56">2000</xref>) take a global approach to causal discovery by learning a network to model the joint distribution. The FCI algorithm in addition can model latent confounders. However, they require searching over exponential space of possible causal structures. This restricts their adaptation to high-dimensional data (Silander and Myllymaki, <xref ref-type="bibr" rid="B51">2012</xref>). Consequently, there are extensions of FCI, RFCI (Colombo et al., <xref ref-type="bibr" rid="B10">2012</xref>) that improve the efficiency at the cost of model quality.</p>
<p>PC algorithm is heavily variable order dependent, i.e., if the order of the variables changes during learning, the resultant causal Bayesian network could potentially change. Stable-PC (Colombo and Maathuis, <xref ref-type="bibr" rid="B8">2012</xref>) is a modified version of the PC algorithm that queries all the neighbors of each node while computing CI tests and yields order-independent skeletons. Modified PC is efficient enough to handle large sets of variables, at the cost of not being provably sound and complete (Coumans et al., <xref ref-type="bibr" rid="B12">2017</xref>). To overcome the inefficiency of computing CI test between all pairs of variables, algorithms to uncover only local causal relationships between a specific target node and its neighbors have been developed (Margaritis and Thrun, <xref ref-type="bibr" rid="B35">2000</xref>; Aliferis et al., <xref ref-type="bibr" rid="B1">2003</xref>; Ramsey et al., <xref ref-type="bibr" rid="B47">2017</xref>). A well-known work in this line of research is Grow Shrinkage algorithm (GS) (Margaritis and Thrun, <xref ref-type="bibr" rid="B35">2000</xref>). GS is based on the idea that the Markov blanket includes all the nodes that contain the information about the current node being tested. Although the PC algorithm and the GS algorithm have had a major impact in this area of research, GS is still exponential in the size of the Markov blanket.</p>
<p>Following the success of GS, several methods, such as IAMB (Tsamardinos et al., <xref ref-type="bibr" rid="B60">2003</xref>) and its variants (Yaramakala and Margaritis, <xref ref-type="bibr" rid="B64">2005</xref>) have been developed for the induction of CBNs by identifying the neighborhood of each node. Unlike PC and FCI, a well-known algorithm called Greedy Equivalence Search (GES) (Meek, <xref ref-type="bibr" rid="B36">1995</xref>) begins with an empty graph and adds and removes edges iteratively. The GES algorithm falls broadly under a score-and-search procedure, that searches over equivalence classes of DAG and scores them (Chickering, <xref ref-type="bibr" rid="B6">2002a</xref>,<xref ref-type="bibr" rid="B7">b</xref>). Although GES works well with moderate number of nodes, the space of equivalence classes is exponential in the number of nodes (Gillispie and Perlman, <xref ref-type="bibr" rid="B17">2013</xref>). The Greedy Fast Causal Inference (GFCI) combines the benefit of GES (to learn the network) and FCI (to prune unnecessary edges as well as orient the edges) (Ogarrio et al., <xref ref-type="bibr" rid="B41">2016</xref>). Meanwhile, there has also been more and more evidence demonstrating the possibility of discovering causal relationships by combining both experimental and observational data (Cooper and Yoo, <xref ref-type="bibr" rid="B11">2013</xref>; Hauser and B&#x000FC;hlmann, <xref ref-type="bibr" rid="B23">2015</xref>; Meinshausen et al., <xref ref-type="bibr" rid="B37">2016</xref>). Other notable direction involves learning from mixed data types (continuous and discrete variables) (Andrews et al., <xref ref-type="bibr" rid="B2">2018</xref>; Tsagris et al., <xref ref-type="bibr" rid="B59">2018</xref>). In principle, our approach can be naturally adapted to handle mixed variable types, as long as an appropriate conditional independence test is employed. However, we note this as a future direction.</p>
<p>Our approach can be seen as scaling such methods to large observational data by potentially identifying a cyclic dependency network that can then be transformed into a causal graph. As mentioned earlier, we move away from the data-driven independency tests and consider model-based independency tests which could allow us to scale to potentially large data sets. We hypothesize that learning such a dependency network is scalable thus reducing the complexity of causality search.</p></sec>
<sec>
<title>2.3. Dependency Networks</title>
<p>Dependency Networks (DN) (Heckerman et al., <xref ref-type="bibr" rid="B24">2000</xref>), another directed model is similar to a BN, except that the associated network structure need not be acyclic. That is to say, unlike a BN, a DN permits cycles. A DN encodes conditional independence constraints such that each node is independent of all other nodes, given its parents (Heckerman et al., <xref ref-type="bibr" rid="B24">2000</xref>). Therefore, they approximate the joint distribution over the variables as a product of conditionals thus allowing for cycles. These conditionals can be learned locally, resulting in significant efficiency gains over other exact models, i.e., <bold>P</bold>(<bold>V</bold>) &#x0003D; &#x0220F;<sub><italic>V</italic>&#x02208;<bold>V</bold></sub><bold>P</bold>(<italic>V</italic>|<bold>Pa</bold>(<italic>V</italic>)), where <bold>Pa</bold>(<italic>V</italic>) indicates the parent set of the target variable <italic>V</italic>. Since they are approximate [unlike standard Bayes Nets (BNs)], Gibbs sampling is typically used to recover the joint distribution; this approach is, however, very slow even in reasonably-sized domains. In summary, learning DNs is scalable and efficient, especially for larger data sets, but BNs are preferable for inference, interpretation, discovery and analysis. Recall that our goal is to discover causal relationships between variables. In order to develop an approach for this motivating application, we propose an algorithm for learning a BN from DN, that can scale to a large number of variables.</p></sec></sec>
<sec id="s3">
<title>3. Exploiting Context-Specific Independencies for Learning Causal Models</title>
<p>Given the necessary background, we now present our learning algorithm for learning causal models from data. Our method is purely data-driven&#x02014;extending this work to exploit domain expertise is an important immediate future direction. However, it must be noted that incorporating human advice as inductive bias, search constraints and/or orientation knowledge is natural in our framework. In this work, we assume that only the data and (if available) some ordering over the variables as inductive bias is provided.</p>
<p>We use bold capital letters to denote sets (e.g., <bold>V</bold>) and plain capital letters to denote set members (e.g., <italic>V</italic><sub><italic>i</italic></sub> &#x02208; <bold>V</bold>). Using this convention, we denote the set of variables as <bold>V</bold>. The goal of our algorithm is to learn the joint distribution over all the variables (features and the target) that models causality. Given that there is no additional input, it is quite possible that the joint distribution that is purely learned from data may not result in a causal model, i.e., the learned network is a general Bayes net (BN) instead of a causal Bayes net (CBN). To evaluate this, we verify the learned model on a few benchmarks to demonstrate the efficacy of the approach. Beyond empirical evaluations, we provide some theoretical insights on why the learned model is causal. Before explaining the procedure, let us formally define the learning task.</p>
<p><bold>Given:</bold> Data, <inline-formula><mml:math id="M1"><mml:mstyle mathvariant="bold"><mml:mtext>D</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mo>&#x02329;</mml:mo><mml:mrow><mml:mrow><mml:mo>&#x02329;</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>&#x0232A;</mml:mo></mml:mrow></mml:mrow><mml:mo>&#x0232A;</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, where <italic>n</italic> is the number of variables, <italic>m</italic> is the number of examples, <bold>V</bold> is the set of variables,</p>
<p><bold>To Do:</bold> Learn a causal joint distribution, <italic>P</italic>(<bold>V</bold>), i.e., a causal BN &#x02329;<bold>V</bold>, <bold>E</bold>&#x0232A;, where <bold>E</bold> is the set of edges in the causal BN.</p>
<p>One of the challenges with standard BN learners and certainly CBN learners is that of scale. When the number of variables is large (as in the real benchmark data set), many structure learning algorithms do not scale viably. Hence, we propose a hybrid approach that combines the salient features of both search and score, namely the ability to perform local search effectively with the ability of constraint-based methods to potentially identify causal models. More precisely, our algorithm performs three steps: learning a dependency network from data, detect the cycles and then remove the edges that are mutually independent. This process is illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>. The overall intuition behind this approach is fairly simple: use a scalable algorithm to handle a large number of variables and learn a dense model quickly. Since this learned model could potentially (and in practice) contain many cycles, we detect and remove edges based on mutual information. We then orient the edges ensuring acyclicity. Given that previous literature has demonstrated that an information-theoretic measure based on mutual information between two variables <italic>X</italic> and <italic>Y</italic> can be used as a reliable measure for quantifying the strength of an arc <italic>X</italic> &#x02192; <italic>Y</italic> (Solo, <xref ref-type="bibr" rid="B53">2008</xref>; Weichwald et al., <xref ref-type="bibr" rid="B62">2014</xref>; Janzing et al., <xref ref-type="bibr" rid="B28">2015</xref>), we use CSI and MI to establish the causal relationships.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Flow Chart of the proposed framework. Given data <italic>D</italic> with <italic>V</italic> variables, a dependency network <italic>DN</italic> &#x02261; (<bold>V</bold>, <bold>E</bold>) is learnt on entire data. Learn a dependency network where each conditional is a decision tree of small depth. Recollect that resultant <italic>DN</italic> may have bidirectional edges between nodes. All the bidirected edges in the <italic>DN</italic> are converted to undirected edges (if any). For all variables with edges in between them in <italic>DN</italic>, mutual independence scores between them are computed. We loop through all the cycles in <italic>DN</italic>, such that the shortest cycles from the <italic>DN</italic> are first identified and the appropriate edges are removed based on MI less than the threshold &#x003B4;. Our framework also allows for an expert to provide the predefined threshold &#x003B4;. The process is repeated until there are no more directed cycles. Finally, the undirected edges are oriented based on MI while preserving acyclicity.</p></caption>
<graphic xlink:href="fdata-03-535976-g0001.tif"/>
</fig>
<p>We now describe each of these steps in detail before presenting the high-level algorithm.</p>
<sec>
<title>3.1. Learning Context-Specific Independences</title>
<p>The first step of our learning algorithm is to learn distributions of the form <italic>P</italic>(<italic>V</italic><sub><italic>i</italic></sub>|<bold>V</bold>\<italic>V</italic><sub><italic>i</italic></sub>), i.e., a conditional for a variable given all the other variables in the data. To this effect, we employ the intuition that a structured representation of a conditional probability table (CPT), such as a tree can be used inside probabilistic models to capture <italic>context-specific independence</italic> (CSI) (Boutilier et al., <xref ref-type="bibr" rid="B3">1996</xref>). Specifically, we learn a single probability tree for each variable <italic>V</italic><sub><italic>i</italic></sub> given all the other variables in the data. The tree CPDs can capture <italic>context specific independence</italic> based on regularities in the CPTs of a node. Tree CPD for a variable is a rooted tree with each interior node representing tests on parent vertices and leaf nodes have the probability conditioned on particular configurations along the path from the root to leaf. The key idea here is that each tree can capture the CSI that exists between the variable&#x00027;s parents and the target variable conditioned on the values of some of the other parents. This is an important step as <italic>it has been recently demonstrated that CSI can be used for identifying causal effects by</italic> Tikka et al. (<xref ref-type="bibr" rid="B58">2019</xref>). While their work derives the calculus for identifying the causal relationships, we go further in employing the use of CSI in larger data sets. Further, our finally learned network can be considered as a special case of the structural causal model proposed by Tikka et al. where the structured representations (trees) are used to model the CSIs and the edges of the graphical model are aligned using information-theoretic measures.</p>
<p>To learn CSI at every variable, we employ the notion of DNs. Recall that a DN is a (potentially cyclic) graphical model that approximates the joint distribution as a product of conditionals. To learn such a DN, we iterate through every variable and learn a (probabilistic) decision tree for each variable given all the other variables, i.e., the goal is to learn <italic>P</italic>(<italic>V</italic><sub><italic>i</italic></sub>|<bold>V</bold> \ <italic>V</italic><sub><italic>i</italic></sub>) for each <italic>i</italic> where each conditional is modeling using a probabilistic tree. We observe that in this step, one could provide an important domain knowledge&#x02014;<italic>ordering between the variables</italic>. This variable ordering can be used to construct expert guided causal model which introduces CSIs that satisfies the ordering constraints. As shown by Tikka et al. (<xref ref-type="bibr" rid="B58">2019</xref>), the conditional distributions induced using these CSIs can be effectively employed in identifying do calculus.</p>
<p>The advantage of this approach is that it learns the qualitative relationships (structure) and quantitative influences (parameters) simultaneously. The structure is simply the set of all the variables appearing in the tree and the parameters are the distributions at the leaves which can be reused in later stages. The other advantage is that the approach is that it is easily parallelizable and scalable. Thus, our method can be viewed as one that could scale up learning of causal models to real large data sets. The third advantage of the approach is that being a separate step, this can be integrated with other causal search methods, such as the one proposed by Tikka et al. Exploring these connections is an interesting future direction.</p>
<p>Let us denote the conditionals learned over all the variables (potentially given some order) as <italic>DN</italic>, the dependency network induced from the data. In most cases, this DN contains cycles since these conditionals are learned independent of each other. This can be an advantage and a disadvantage. The advantage is its efficiency as the costly step of checking for acyclicity can be avoided during learning and a disadvantage since it is an approximate model. Shorter cycles can result in larger approximations (Heckerman et al., <xref ref-type="bibr" rid="B24">2000</xref>). After learning this <italic>DN</italic>, we perform an additional step. We convert edges of the form <italic>X</italic> &#x02190; <italic>Y</italic> and <italic>X</italic> &#x02192; <italic>Y</italic> to <italic>X</italic> &#x02212; &#x02212; <italic>Y</italic>. This is similar to the PC algorithm (Spirtes et al., <xref ref-type="bibr" rid="B56">2000</xref>) in that strong correlation between two variables are considered as undirected and will be oriented in the final step of our algorithm. Next, we convert the DN to an intermediate CBN with potential undirected edges.</p></sec>
<sec>
<title>3.2. Detecting and Removing Cycles</title>
<p>To convert the DN to a CBN, the first step is to detect and remove cycles. A na&#x000EF;ve approach to deleting edges would be: search for an edge, remove it, check for acyclicity and log-likelihood (Hulten et al., <xref ref-type="bibr" rid="B27">2003</xref>). The key limitation of this approach is that the resulting model is not necessarily causal. The use of log-likelihood does improve the training performance but does not guarantee causality. Hence, inspired by the research in information-theoretic approaches to causality (Solo, <xref ref-type="bibr" rid="B53">2008</xref>; Weichwald et al., <xref ref-type="bibr" rid="B62">2014</xref>; Janzing et al., <xref ref-type="bibr" rid="B28">2015</xref>), we employ mutual information for identifying the edges.</p>
<p>For detecting cycles, several methods exist (Kahn, <xref ref-type="bibr" rid="B29">1962</xref>) including topological sorting. Any of these methods would be compatible with our learning algorithm. For the purposes of our data sets, we employ depth-first search (DFS). One key aspect of our DFS is that we identify short cycles. Recall that DN approximates a joint distribution as a product of conditionals.
<disp-formula id="E1"><mml:math id="M2"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02248;</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x0220F;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>\</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
The theoretical analysis of the approximation is based on the inference algorithm, specifically Gibbs sampling and on the size of the data. In simple terms, if the Gibbs sampler converges on a large data set, the approximation is quite effective (Heckerman et al., <xref ref-type="bibr" rid="B24">2000</xref>; Neville and Jensen, <xref ref-type="bibr" rid="B40">2007</xref>). In practice, we have previously observed that when the cycles are large, i.e., the size of the clique in the undirected graph, the approximation is quite robust (Natarajan et al., <xref ref-type="bibr" rid="B38">2012</xref>; De Raedt et al., <xref ref-type="bibr" rid="B13">2016</xref>).</p>
<p>With this insight, in the first step of cycle detection, we identify the short cycles. The intuition is that short cycles lead to larger approximations and removing them would render the product of conditionals closer to the true joint distribution. Once the shortest cycle is identified, the next step is identifying the edge to remove from this short cycle. For this purpose, we employ mutual information (MI). As a pre-processing step, we compute the MI between every pair of variables and sort them by the MI. We consider MI instead of conditional MI as one of our key goals is efficiency. Computing conditional MI requires us to condition on a large set of related variables in the DN. This requires both repeated computations and a large number of conditionals. Thus, first, we detect the smallest directed cycle. We then break the cycle by removing edges that are smaller than a predefined threshold of &#x003B4;. In our work, we simply choose &#x003B4; to be the MI with the largest difference to the previous MI value in the sorted list. We use <italic>Maximum adjacent difference</italic> in the sorted list, as our &#x003B4; in our setting, unless a default value is presented by an expert as domain knowledge. Large values of &#x003B4; would result in a sparse graph and lower values &#x003B4; will result in a dense graph. Once these edges are removed, the process continues where the next smallest cycle (if one exists) is detected and the low MI edges are removed and so on. <bold>Coupling CSI with MI between variables</bold> <italic>X</italic> <bold>and</bold> <italic>Y</italic> <bold>quantifies the strength of</bold> <italic>X</italic> &#x02192; <italic>Y</italic>.</p>
<p>To summarize, from the DN, we create an initial CBN by detecting cycles and removing edges with low dependencies. Now the last step is to orient the bi-directed edges which are undirected and then learn the parameters of the resulting causal BN.</p></sec>
<sec>
<title>3.3. Edge Orientation and Parameter Learning</title>
<p>Once the directed cycles are detected and removed, we focus on the undirected edges (in reality bi-directed edges). Inspired by the PC algorithm (Spirtes et al., <xref ref-type="bibr" rid="B56">2000</xref>), we orient the edges in the final step using two criteria&#x02014;MI and acyclicity. We orient the edges by removing the edge with the lowest MI if it does not result in a cycle. As mentioned earlier, this is similar to that of PC. After all the undirected edges have been oriented, the resulting CBN is our casual network skeleton.</p>
<p>We estimate the parameters of this CBN using standard MLE (Pearl, <xref ref-type="bibr" rid="B42">1988a</xref>). All our data sets are fully observed and hence MLE suffices for learning the conditional distributions. For the parameters, we learn a decision tree locally and in parallel using only the variables in the parent set of every node to capture the conditional distribution. Extending this to handle missing data is a significant extension as it does not merely affect the parameter learning but the structure search as well. Once the parameters are learned, we now have the full causal BN learned from data.</p></sec>
<sec>
<title>3.4. DN2CN Algorithm</title>
<p>Before we provide the algorithm, we present an example in <xref ref-type="fig" rid="F2">Figure 2</xref>. There are six variables &#x02329;<italic>A</italic>, &#x02026;, <italic>F</italic>&#x0232A;. First, a DN is learned where there are cycles and bi-directed edges. Next, the smallest cycle &#x02329;<italic>A, B, C</italic>&#x0232A; is detected and the edge with least MI <italic>A</italic> &#x02192; <italic>C</italic> is removed. Now, there are no directed cycles in the CBN (in the general case, there could be more cycles that need to be removed). Note that there are two undirected edges between <italic>B</italic> and <italic>D</italic>, and between <italic>E</italic> and <italic>F</italic>. First, the edge between <italic>D</italic> and <italic>B</italic> is oriented based on MI and the fact that this does not create a cycle. Finally, the edge between <italic>E</italic> and <italic>F</italic> is oriented to obtain the CBN. The parameters are then learned by learning a decision-tree for each conditional.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>First the DN is learned (notice the two bi-directed edges). All the bidirected edges in the DN are converted to undirected edges (BD and EF). The shorted cycle <italic>A</italic> &#x02192; <italic>C</italic> &#x02192; <italic>B</italic> &#x02192; <italic>A</italic> is identified and the edge <italic>A</italic> &#x02192; <italic>C</italic> is removed based on MI. Since no more cycles exist, the undirected edges are considered next. <italic>E</italic> &#x02212; &#x02212; <italic>F</italic> becomes <italic>F</italic> &#x02192; <italic>E</italic> and then <italic>B</italic> &#x02212; &#x02212; <italic>D</italic> becomes <italic>D</italic> &#x02192; <italic>B</italic>. The resulting network is acylic and exploits both CSI and MI in becoming a causal network.</p></caption>
<graphic xlink:href="fdata-03-535976-g0002.tif"/>
</fig>
<p>This approach is formally presented in Algorithm 1 and as a flow chart in <xref ref-type="fig" rid="F1">Figure 1</xref>. As can be seen in the algorithm, the first step is to learn the DN (line 4). The L<sc>earn</sc>P<sc>arent</sc>S<sc>et</sc> function in line 3 of Algorithm 2 learns a tree and collects the set of parents from that set. It can optionally take an ordering among the variables provided by a domain expert (if any). Then the algorithm computes the mutual information (MI) for all the edges. One could instead simply wait till the cycles are detected and then compute the MI but we compute it outside the cycle detection step. The algorithm then iteratively removes the least informative edges till no more cycles are present in the graph. We orient the undirected edges (If any) ensuring acyclicity. Then the parameters are then learned from the data.</p>
<table-wrap position="float">
<label>Algorithm 1</label>
<caption><p>DN2CN: dependency network to causal network.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr><td align="left" valign="top">1: &#x000A0;<bold>Given</bold>: Data <bold>D</bold>; Variables <bold>V</bold>; Ordering among variables (if any) <italic>O</italic>: &#x0003D; &#x02205;; Threshold &#x003B4;: &#x0003D; 0</td></tr>
<tr><td align="left" valign="top">2: &#x000A0;<bold>function</bold> DN2CN(<bold>D</bold>,<bold>V</bold>, <bold>O)</bold></td></tr>
<tr><td align="left" valign="top">3: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>E</bold>&#x02190;&#x02205; &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x022B3; Initialize edge set</td></tr>
<tr><td align="left" valign="top">4: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <sans-serif>DN</sans-serif> &#x02261; (<bold>V</bold>, <bold>E</bold>) &#x0003D; <bold>L<sc>earn</sc>DN</bold>(<bold>D</bold>, <bold>V</bold>, <italic>O</italic>)</td></tr>
<tr><td align="left" valign="top">5: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>for all</bold> <sans-serif>edge</sans-serif> &#x02208; <bold>E do</bold></td></tr>
<tr><td align="left" valign="top">6: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <sans-serif>MI</sans-serif>[<sans-serif>edge</sans-serif>] &#x02190; C<sc>ompute</sc>M<sc>utual</sc>I<sc>nfo</sc>(<sans-serif>edge</sans-serif>)</td></tr>
<tr><td align="left" valign="top">7: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>end for</bold></td></tr>
<tr><td align="left" valign="top">8: &#x000A0; <sans-serif>SortedMI</sans-serif>[<sans-serif>edge</sans-serif>] &#x02190; S<sc>orted</sc>(<sans-serif>edge</sans-serif>, <italic>reverse</italic> &#x0003D; <italic>True</italic>) &#x022B3; <bold>Sort in descending order</bold></td></tr>
<tr><td align="left" valign="top">9: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>if</bold> &#x003B4; &#x0003D; 0 <bold>then</bold></td></tr>
<tr><td align="left" valign="top">10: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; &#x003B4; = <sc>argmax</sc>_A<sc>bs</sc>D<sc>iff</sc>(<sans-serif>SortedMI</sans-serif>[<sans-serif>edge</sans-serif>]) &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x022B3; <bold>Max absolute diff of 2 contiguous elements in array SortedMI</bold></td></tr>
<tr><td align="left" valign="top">11: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>end if</bold></td></tr>
<tr><td align="left" valign="top">12: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>C</bold> &#x02190; D<sc>etect</sc>C<sc>ycles</sc>(<sans-serif>DN</sans-serif>) &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x022B3; <bold>Using any sort</bold></td></tr>
<tr><td align="left" valign="top">13: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>for all</bold> <sans-serif>cycle</sans-serif> &#x02208; <bold>C do</bold></td></tr>
<tr><td align="left" valign="top">14: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>for all</bold> <italic>e</italic> &#x02208; <bold>cycle do</bold></td></tr>
<tr><td align="left" valign="top">15: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>if</bold> <sans-serif>S</sans-serif><italic>ortedMI</italic>[<italic>e</italic>] &#x02264; &#x003B4; <bold>then</bold></td></tr>
<tr><td align="left" valign="top">16: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>E</bold>&#x02190;<bold>E</bold>\<italic>e</italic> &#x022B3; <bold>Remove edges if exist in <sans-serif>DN</sans-serif></bold></td></tr>
<tr><td align="left" valign="top">17: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>end if</bold></td></tr>
<tr><td align="left" valign="top">18: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>end for</bold></td></tr>
<tr><td align="left" valign="top">19: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>C</bold>&#x02190;<bold>C</bold>\<sans-serif>cycle</sans-serif></td></tr>
<tr><td align="left" valign="top">20: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x022B3; <bold>Update cycles list after each iteration</bold></td></tr>
<tr><td align="left" valign="top">21: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>if C</bold> &#x0003D; &#x02205; <bold>then</bold> &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x022B3; <bold>No more cycles left</bold></td></tr>
<tr><td align="left" valign="top">22: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>break</bold></td></tr>
<tr><td align="left" valign="top">23: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>end if</bold></td></tr>
<tr><td align="left" valign="top">24: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>end for</bold></td></tr>
<tr><td align="left" valign="top">25: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <inline-formula><mml:math id="M3"><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>E</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mtext>O</mml:mtext><mml:mstyle mathsize="small"><mml:mtext>RIENT</mml:mtext></mml:mstyle><mml:mtext>E</mml:mtext><mml:mstyle mathsize="small"><mml:mtext>DGES</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>E</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> &#x022B3; <bold>Introduce directions ensuring acyclicity as required</bold></td></tr>
<tr><td align="left" valign="top">26: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>return</bold> (<inline-formula><mml:math id="M4"><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>E</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula><bold>)</bold></td></tr>
<tr><td align="left" valign="top">27: &#x000A0;<bold>end function</bold></td></tr>
</tbody>
</table>
</table-wrap>
<table-wrap position="float">
<label>Algorithm 2</label>
<caption><p>L<sc>earn</sc>DN: learn dependency network.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr><td align="left" valign="top">1: &#x000A0;<bold>function</bold> L<sc>earn</sc>DN(<bold>D</bold>, <bold>V</bold>, <bold>O)</bold></td></tr>
<tr><td align="left" valign="top">2: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>E</bold>&#x02190;&#x02205; &#x022B3; Initialize edge set</td></tr>
<tr><td align="left" valign="top">3: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>for all</bold> <sans-serif>var</sans-serif> &#x02208; <bold>V do</bold></td></tr>
<tr><td align="left" valign="top">4: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>P</bold>(<sans-serif>var</sans-serif>) &#x02190; L<sc>earn</sc>P<sc>arent</sc>S<sc>et</sc>(<sans-serif>var</sans-serif>, {<bold>V</bold>\<sans-serif>var</sans-serif>}<sub><italic>O</italic></sub>, <bold>D</bold>) &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x022B3; <bold>Parent set</bold> {<bold>V</bold>\<sans-serif>var</sans-serif>} <bold>is constrained by O (if any)</bold></td></tr>
<tr><td align="left" valign="top">5: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>for all</bold> <sans-serif>parent</sans-serif> &#x02208; <bold>P</bold>(<sans-serif>var</sans-serif>) <bold>do</bold></td></tr>
<tr><td align="left" valign="top">6: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>E</bold> &#x02190; <bold>E</bold> &#x0222A; {<sans-serif>parent</sans-serif> &#x02192; <sans-serif>var</sans-serif>}</td></tr>
<tr><td align="left" valign="top">7: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x022B3; <bold>Add new directed edge between parent and var</bold></td></tr>
<tr><td align="left" valign="top">8: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>end for</bold></td></tr>
<tr><td align="left" valign="top">9: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>end for</bold></td></tr>
<tr><td align="left" valign="top">10: &#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>return</bold> (<bold>V</bold>, <bold>E</bold>)</td></tr>
<tr><td align="left" valign="top">11: &#x000A0;<bold>end function</bold></td></tr>
</tbody>
</table>
</table-wrap>
<sec>
<title>3.4.1. Theoretical Analysis</title>
<p>A natural question to ask is&#x02014;<italic>what is the complexity of our approach?</italic> We present an initial analysis of this work, by adapting the arguments from the literature [see for instance the original reducibility result (Karp, <xref ref-type="bibr" rid="B31">1972</xref>)]. We present our result by analyzing each component of the algorithm. Tightening these bounds with appropriate heuristics is left for future work.</p>
<p>Let <italic>v</italic> be the number of vertices (features), <italic>n</italic> be the number of training examples. In Algorithm 1, while learning <italic>DN</italic>, we learn a decision tree locally [line 4]. This requires <italic>O</italic>(<italic>n</italic><sup>2</sup><italic>d</italic>) where <italic>d</italic> is the depth of the tree (Su and Zhang, <xref ref-type="bibr" rid="B57">2006</xref>). While this can be reduced to <italic>O</italic>(<italic>n</italic> &#x000B7; <italic>d</italic>), this requires making independence assumptions among the variables. Our tree growing procedure is fairly standard without much optimization. Hence the complexity of learning a full DN is <italic>O</italic>(<italic>v</italic> &#x000B7; <italic>n</italic><sup>2</sup><italic>d</italic>). However, the trees can be learned in parallel, thus reducing the complexity to <italic>O</italic>(<italic>n</italic><sup>2</sup><italic>d</italic>).</p>
<p>Cycle detection (line-12) has a complexity of <italic>O</italic>(<italic>v</italic>(<italic>v</italic> &#x0002B; <italic>e</italic>)), where <italic>v</italic> is no. of nodes and <italic>e</italic> is number of edges in the network (<italic>e</italic> is asymptotically <italic>O</italic>(<italic>v</italic><sup>2</sup>). A single cycle detection running a DFS to search for the cycle thus is <italic>O</italic>(<italic>v</italic><sup>2</sup>). Doing this for all the variables will result in <italic>O</italic>(<italic>v</italic><sup>3</sup>) for the entire cycle detection. Sorting the edges to compute the MI requires <italic>O</italic>(<italic>v</italic><sup>2</sup><italic>log</italic>(<italic>v</italic>)). Edge orientation is <italic>O</italic>(<italic>v</italic><sup>2</sup>).</p>
<p>Thus the complexity DN2CN is dominated by two terms&#x02014;<italic>O</italic>(<italic>v</italic><sup>3</sup>) the cube of the number of edges and <italic>O</italic>(<italic>n</italic><sup>2</sup><italic>d</italic>), the term that depends on the data. Since, typically, <italic>n</italic> &#x0003E; <italic>v</italic><sup>2</sup> to learn a meaningful model, our final complexity is <italic>O</italic>(<italic>n</italic><sup>2</sup><italic>d</italic>). Optimizing the tree learner to lower this complexity and better cycle detection methods to reduce the cubic complexity can significantly improve the asymptotic bound. These are open research directions.</p></sec>
<sec>
<title>3.4.2. Discussion</title>
<p>The proposed approach has some salient advantages&#x02014;(1) One could parallelize the learning of the DN to scale it up to very large data sets. (2) The computation of the MI can also be parallelized. (3) Any traversal algorithm could be used to detect cycles in the graph for pruning. (4) There are two levels of independence used in this algorithm;&#x02014;(a) context specific independence (CSI) to identify potentially independent influences. Inspired by the work of Tikka et al. (<xref ref-type="bibr" rid="B58">2019</xref>), we rely on the ability of CSI to model interventions; in the context of interventions, any influences that otherwise have a causal effect thereon variable, are removed. Learning a BN as a series of trees for every interacting variable facilitates the ability to model such CSI and so are able to represent interventions in sufficient detail to reason about conditional independence properties, (b) Mutual independence which when combined with expert domain knowledge can potentially yield even causal influences. (5) The algorithm also has two types of controls (similar to regularizations) to combat overfitting. First is to control the depth of trees and second is selecting the number of edges to remove. (6) Finally, the use of both local search and constraint based methods inside the algorithm enables it to learn effectively at scale.</p>
<p>Before presenting our empirical results, we briefly discuss the interpretability of the resulting network. DN2CN represents causal dependencies using BNs that provide an intuitive visualization by modeling features as nodes and the statistical association between the features as edges. This statistical interpretability is similar in spirit to traditional interpretability. This allows to answer questions, such as &#x0201C;does BMI influence susceptibility to Covid?&#x0201D; Moreover, it has been argued that developing an effective CBN for practical applications requires expert knowledge when data collection is cumbersome (Fenton and Neil, <xref ref-type="bibr" rid="B14">2012</xref>). This applies to domains, such as medicine, similar to our experimental evaluation. A typical characteristic of these domains is that they can be data-poor and knowledge-rich due to several decades of research. Kahneman et al. showed that human beings tend to interpret events in terms of cause-effect relations (Kahneman et al., <xref ref-type="bibr" rid="B30">1982</xref>; Pennington and Hastie, <xref ref-type="bibr" rid="B45">1988</xref>). Also, causal models are easier to construct, easier to modify and easier to interpret by humans (Henrion, <xref ref-type="bibr" rid="B26">1987</xref>; Pennington and Hastie, <xref ref-type="bibr" rid="B45">1988</xref>). Following these observations, our framework can incorporate both data-driven and human inputs, thus allowing to learn a more robust hypothesis. Lipton explains that with interpretable models it becomes imperative to guarantee fairness (Lipton, <xref ref-type="bibr" rid="B34">2018</xref>). It must be noted that we can extend DN2CN&#x00027;s interactive framework and leverage the Bayesian networks learnt to assess the bias as well as compare multiple models in terms of their fairness and performance (Chiappa and Isaac, <xref ref-type="bibr" rid="B4">2018</xref>). In summary, our framework can leverage interpretability as a tool to verify causal assumptions and relationships. We verify the above claims empirically in a real data set and two synthetic benchmark causal data sets in the next section.</p>
</sec></sec></sec>
<sec id="s4">
<title>4. Empirical Evaluation&#x02014;Domains</title>
<p>To assess the effectiveness of our method, we perform extensive evaluations on both synthetic as well as real benchmark causal data sets. In all our data sets, we have the underlying true causal graph, and we apply our method as well baseline approaches to reconstruct the causal network from the data to demonstrate the effectiveness. We first describe the data sets used before discussing the baselines used.</p>
<sec>
<title>4.1. Benchmark1: LUCAS&#x02014;(LUng CAncer Simple Data Set)</title>
<p>The LUCAS (LUng CAncer Simple set) data set from causality challenge (Guyon et al., <xref ref-type="bibr" rid="B22">2008</xref>) represents a synthetic medical diagnosis problem, where the task is to identify patients with lung cancer given a set of socioeconomic and clinical factors of putative causal relevance. The generative model is a Markov process, so the value of the children node is stochastically dependent on the values of the parent nodes&#x00027;. The data set consists of 2000 observations. Ground-truth consists of 12 binary variables that include <italic>anxiety, peer pressure, day of birth, smoking, genetics, yellow finger, lung cancer, attention disorder, cough, fatigue, allergy, car accidents</italic>, and their causal relations. There are no missing values in the data set. As the data are generated artificially by causal BN with variables, the true nature of the underlying causal relationships is known. Hence we use this benchmark data set for illustrating the effectiveness of our approach.</p></sec>
<sec>
<title>4.2. Benchmark2: Asia Data Set</title>
<p>The ASIA Network is an expert-designed causal network with logical links. This BN was originally presented by Lauritzen and Spiegelhalter (Lauritzen and Spiegelhalter, <xref ref-type="bibr" rid="B32">1988</xref>), who have specified reasonable transition properties for each variable given its parents. It is an eight node BN that describes the effect of visiting Asia and smoking behavior of an individual on the probability of contracting tuberculosis, cancer or bronchitis. The underlying structure expresses the known qualitative medical knowledge. Each node in the network represents a feature that relates to the patient&#x00027;s condition. The example is motivated as follows: &#x0201C;<italic>Shortness-of-breath (called dyspnea) may be due to tuberculosis, lung cancer or bronchitis, or none of them, or more than one of them. A recent visit to Asia increases the chances of tuberculosis, while smoking is known to be a risk factor for both lung cancer and bronchitis. The results of a single chest X-ray do not discriminate between lung cancer and tuberculosis, as neither does the presence or absence of dyspnea.&#x0201D;</italic> The data set contains 10,000 observations and eight binary variables whose values are 0 or 1. There are no missing values in the data set.</p></sec>
<sec>
<title>4.3. Causal Protein-Signaling Networks in Human T Cells Data Set</title>
<p>This data analyzed and published by Sachs et al. (<xref ref-type="bibr" rid="B48">2005</xref>) is a multivariate proteomics data set, widely used for research on causal discovery methods. This is a biological dataset with different proteins and phospholipids in human immune system cells. The data comprises of the simultaneous measurements of 11 phosphorylated proteins and phospholipids (PKC, PKA, P38, Jnk, Raf, Mek, Erk, Akt, Plcg, PIP2, PIP3) derived from thousands of individual primary immune system cells. In the data set we considered, there are (1) 1,800 observational data points subject only to general stimulatory cues, so that the protein signaling paths are active; (2) 600 interventional data points with specific stimulatory and inhibitory cues for each of the following four proteins: pmek, PIP2, Akt, PKA; and (3) 1,200 interventional data points with specific cues for PKA. Overall, the data set consists of 5,400 instances with no missing value. The 11 variables are discretized into three bins (low, medium, and high) for each feature, respectively. A network consisting of 18 well-established causal interactions between these molecules has been constructed supported with biological experiments and literature (Sachs et al., <xref ref-type="bibr" rid="B48">2005</xref>). This data is a good fit to test our proposed causal discovery method, as the knowledge about the &#x0201C;ground truth&#x0201D; is available, which helps verification of results. Hence the goal of the data set is to unearth protein signaling networks, originally modeled as CBN.</p></sec></sec>
<sec id="s5">
<title>5. Experimental Results</title>
<p>In our experiments, we aim to answer the following questions explicitly:
<list list-type="simple">
<list-item><p><bold>Q1</bold>: Does the learned model identify influencing variables as in the &#x0201C;Ground truth&#x0201D; network?</p></list-item>
<list-item><p><bold>Q2</bold>: How does the resulting network produced by DN2CN compare to standard constraint based approaches qualitatively?</p></list-item>
<list-item><p><bold>Q3</bold>: How does the resulting network produced by DN2CN compare to standard constraint based approaches quantitatively?</p></list-item>
</list></p>
<p>Specifically, we consider two different types of experiments&#x02014;the first on evaluating <bold>goodness</bold> of the model on the synthetic benchmark data sets and the second on <bold>verifying</bold> if the approach can learn a good causal model on the real data set.</p>
<sec>
<title>5.1. Setup</title>
<p>In DN2CN, we used a tree depth of 2 for all the experiments. We set &#x003B4; as 0.015 for both LUCAS and Asia data sets and 0.25 for the real T cells data set.</p>
<p>We compare DN2CN to three of the well-known computational methods for causal discovery (Glymour et al., <xref ref-type="bibr" rid="B18">2019</xref>). Two of these algorithms are commonly employed constraint-based algorithms&#x02014;PC and Fast Causal Inference (FCI) (Spirtes et al., <xref ref-type="bibr" rid="B56">2000</xref>). The third algorithm is a score-based algorithm&#x02014;Fast Greedy Equivalence Search (FGES) (Ramsey et al., <xref ref-type="bibr" rid="B47">2017</xref>). It must be mentioned that PC, FCI and FGES, are widely applicable as they handle various types of data distributions as well as causal relations, given reliable conditional independence testing methods. We strongly believe that these attributes make them a strong as well as a fair baseline for DN2CN as suggested by Glymour et al. (<xref ref-type="bibr" rid="B18">2019</xref>).</p>
<p>We further discuss each of the baseline approaches and their corresponding experimental settings used, as follows:
<list list-type="bullet">
<list-item><p><italic>PC algorithm</italic> (denoted <bold>PC</bold>) (Spirtes et al., <xref ref-type="bibr" rid="B56">2000</xref>) starts with a fully connected undirected graph, tests all possible conditioning set for every order of conditioning and then finally orients the edges. Test statistic we used is the mutual information for PC algorithm, to keep the comparison fair. We used type I error rate; &#x003B1; &#x0003D; 0.05 in our setting.</p></list-item>
<list-item><p><italic>Fast Greedy Equivalence Search algorithm</italic> (denoted <bold>FGES</bold>) (Ramsey et al., <xref ref-type="bibr" rid="B47">2017</xref>) is an optimized and parallelized version of an algorithm developed by Meek (Meek, <xref ref-type="bibr" rid="B36">1995</xref>) called the Greedy Equivalence Search (GES). GES is a CBN learning algorithm that starts with an empty graph, heuristically performs a forward stepping search over the space of CBNs and stops with the one with the highest score. GES finally performs a backward stepping search that iteratively removes edges until no single edge removal can increase the Bayesian score. We use the modified BIC (Bayesian information criterion) (Schwarz, <xref ref-type="bibr" rid="B49">1978</xref>) score rewritten as <inline-formula><mml:math id="M5"><mml:mi>S</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mi>I</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>B</mml:mi><mml:mo>:</mml:mo><mml:mi>D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mi>L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>D</mml:mi><mml:mo>;</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mi>B</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>k</mml:mi><mml:mo class="qopname">log</mml:mo><mml:mo>|</mml:mo><mml:mi>D</mml:mi><mml:mo>|</mml:mo></mml:math></inline-formula>, where <italic>L</italic> is the likelihood, <italic>k</italic> the number of parameters, and |<italic>D</italic>| the sample size. So higher BIC scores will correspond to greater dependence.</p></list-item>
<list-item><p><italic>Fast Causal Inference algorithm</italic> (denoted <bold>FCI</bold>) (Spirtes et al., <xref ref-type="bibr" rid="B56">2000</xref>) is a constraint-based algorithm which learns an equivalence class of CBNs that entail the set of conditional independencies that are true in the data. FCI then orients the edges using the stored conditioning sets that led to the removal of adjacencies earlier. We use the same modified BIC score as with the other baseline, i.e., FGES algorithm.</p></list-item>
</list></p>
<p>For PC algorithm we used the open-source implementation, i.e., <italic>stable-PC</italic> in bnlearn (Scutari, <xref ref-type="bibr" rid="B50">2009</xref>) while TETRAD (Spirtes et al., <xref ref-type="bibr" rid="B56">2000</xref>) was used to run FGES and FCI algorithms; a reliable tool for causal explorations. Data set details are presented in section 3 which describes the number of variables and the number of training examples.</p></sec>
<sec>
<title>5.2. Results</title>
<p>Recall that our goal is faithful modeling of underlying data. In addition, we also demonstrate the training log-likelihood of the learned model for (1) ground truth model, (2) model learnt using DN2CN algorithm, (3) model learnt using PC algorithm, (4) model learnt using FGES algorithm, and (5) model learnt using the FCI algorithm. This is to say that our analysis is <italic>qualitative</italic> as well as <italic>quantitative</italic>.</p>
<p>To answer <bold>Q1 and Q2</bold>, consider the networks presented in <xref ref-type="fig" rid="F3">Figures 3A&#x02013;D</xref>&#x02013;<xref ref-type="fig" rid="F5">5A&#x02013;D</xref>, respectively. These are the learned networks obtained by our approach DN2CN and baseline methods PC, FGES &#x00026; FCI summarized together with the ground truth network. To evaluate the validity of the proposed approach, we compared the model arcs with those present in the ground truth. An arc is correct, if and only if the same arc exists in the ground truth graph and the orientation of the arc aligns with the orientation in the ground truth graph; an arc is considered incorrect, if the arc does not exist in the ground truth graph or if it exists but its orientation is the opposite of the true orientation. Hence, in all the data sets, to understand the effectiveness of DN2CN, motivated by Sachs et al. (<xref ref-type="bibr" rid="B48">2005</xref>), Gao and Ji (<xref ref-type="bibr" rid="B16">2015</xref>), and Yu et al. (<xref ref-type="bibr" rid="B65">2019</xref>) we summarize the arcs learned by our method as well as PC, FGES and FCI for each data set using the following metrics:</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>The learned network for <bold>(A)</bold> Our Approach DN2CN, <bold>(B)</bold> PC algorithm, <bold>(C)</bold> Fast Greedy Equivalence Search algorithm (FGES), and <bold>(D)</bold> Fast Causal Inference algorithm (FCI) and the summary results on LUCAS data set (best viewed in color). Each node represents a feature and the arcs represent causal relationships, i.e., X &#x02192; Y represents that X is a cause of Y. As can be seen, our DN2CN and FGES had a 100% true positive rate with a 0 false positive and false negative rates. PC and FCI missed two edges each. PC and FCI also introduced spurious edges (incorrect edge orientation).</p></caption>
<graphic xlink:href="fdata-03-535976-g0003.tif"/>
</fig>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>The learned network for <bold>(A)</bold> Our Approach DN2CN, <bold>(B)</bold> PC algorithm <bold>(C)</bold> Fast Greedy Equivalence Search algorithm (FGES), and <bold>(D)</bold> Fast Causal Inference algorithm (FCI) and the summary results on ASIA data set (best viewed in color). Each node represents a feature and the arcs represent causal relationships, i.e., X &#x02192; Y represents that X is a cause of Y. As can be seen, our DN2CN and FGES had a 100% true positive rate with a 0 false positive and false negative rates. PC and FCI both missed two edges. Also, PC introduced two spurious causal edges in the resultant network.</p></caption>
<graphic xlink:href="fdata-03-535976-g0004.tif"/>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>The learned network for <bold>(A)</bold> Our Approach DN2CN, <bold>(B)</bold> PC algorithm, <bold>(C)</bold> Fast Greedy Equivalence Search algorithm (FGES), and <bold>(D)</bold> Fast Causal Inference algorithm (FCI) and the summary results on T-Cell data set (best viewed in color). Each node represents a feature and the arcs represent causal relationships, i.e., X &#x02192; Y represents that X is a cause of Y. This is a challenging data set where DN2CN had missed one edge and introduced two spurious edges. PC, on the other hand, had significantly worse performance with 10 missed edges and four spurious ones.</p></caption>
<graphic xlink:href="fdata-03-535976-g0005.tif"/>
</fig>
<list list-type="bullet">
<list-item><p><italic>True Edge Rate</italic>, is the fraction of the true connections in the ground truth network that our approach (or PC or FGES or FCI) captures correctly, i.e., true positive.</p></list-item>
<list-item><p><italic>False Edge Count</italic>, for connections that are not in the ground truth network, but which were captured by our approach (or PC or FGES or FCI), i.e., false positive.</p></list-item>
<list-item><p><italic>Missed Edge Rate</italic>, is the fraction of the true edges missed in the ground network by our approach (or PC or FGES or FCI), i.e., a false negative.</p></list-item>
</list>
<p>As can be observed our algorithm DN2CN and baseline algorithm FGES had a 100% true positive rate with a 0 false positive and false negative rates in both LUCAS and ASIA data sets. However, the other baselines methods PC and FCI both missed two edges in LUCAS as well as ASIA data sets. In addition, the PC algorithm introduced spurious causal flows in both LUCAS and ASIA data sets. This clearly establishes that our framework is indeed capable of retrieving the full causal model while learning only from the data.</p>
<p>In the real benchmark data set, i.e., <italic>Causal Protein-Signaling Network in human T cells</italic>, the ground truth network and the reconstruction by employing DN2CN, PC, FGES and FCI are illustrated in <xref ref-type="fig" rid="F5">Figures 5A&#x02013;D</xref>, respectively. It can be observed that our approach DN2CN performs <bold>significantly better</bold> than all the baselines, i.e., PC, FGES and FCI. DN2CN missed four edges and introduced four spurious edges. Whereas, the baseline algorithms PC, FGES, and FCI, had significantly worse performance with 13, 11, 14 missed edges and 6, 15, 8 spurious ones, respectively. On closer inspection at the unexpected edges in our acyclic causal model reconstruction, one can see that they actually explain the data quite well. Especially, both arcs, PKC &#x021D2; PKA and Erk &#x021D2; Akt, can be understood qualitatively in rat ventricular myocytes (Wilhelm et al., <xref ref-type="bibr" rid="B63">1997</xref>) and colon cancer cell lines (Lemaire et al., <xref ref-type="bibr" rid="B33">1997</xref>), respectively. However, We hypothesize that, our DN2CN method missed four causal relationships, that are all involved in cycles. As BNs are acyclic by definition, our inference missed these arcs, which is one of the caveats of this approach. Extending this to dynamic causal bayesian network to handle feedback loops, remains an interesting future research direction.</p>
<p><xref ref-type="table" rid="T1">Table 1</xref> presents quantitative comparisons between the different methods. In all our experiments, we present the numbers in bold whenever they are better than all the other baselines on a data set. It must be mentioned that in some cases, PC, FGES, and FCI did not yield a directed arc, and we chose a direction (ensuring acyclicity) to compute the overall joint log-likelihood on the training set. As can be seen from the table, the proposed DN2CN approach produces a network with significantly better joint log-likelihood on the training set than the baseline algorithms PC and FCI learning method in all the domains. We can see that FGES has better joint log-likelihood than DN2CN in T-Cell data set. One key reason is that the resultant network using FGES is relatively denser than other models. FGES introduces 14 spurious causal edges leading to increased likelihood. It is well-known in the Bayes net learning literature that denser the graph is, higher the training set likelihood. As can be seen from the table in the <xref ref-type="fig" rid="F5">Figure 5</xref>, the false edge count of FGES is significantly higher than the other methods. Hence, the denser network can yield a much higher training set loglikelihood. This answers <bold>Q3</bold> affirmatively: that DN2CN is more effective in modeling than the causal method, such as PC, FGES, and FCI.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Table comparing the log-likelihood estimate in CBN learned using DN2CN and baseline approach, i.e., PC algorithm, Fast Greedy Equivalence Search algorithm (FGES) and Fast Causal Inference algorithm (FCI) learned directly from data.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center" colspan="5"><bold>Methods</bold></th>
</tr>
<tr>
<th valign="top" align="left"><bold>Data sets</bold></th>
<th valign="top" align="center"><bold>Ground truth</bold></th>
<th valign="top" align="center"><bold>DN2CN</bold></th>
<th valign="top" align="center"><bold>PC</bold></th>
<th valign="top" align="center"><bold>FGES</bold></th>
<th valign="top" align="center"><bold>FCI</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Lucas</td>
<td valign="top" align="center"><bold>&#x02212;12130.83</bold></td>
<td valign="top" align="center"><bold>&#x02212;12130.83</bold></td>
<td valign="top" align="center">&#x02212;12178.59</td>
<td valign="top" align="center"><bold>&#x02212;12130.83</bold></td>
<td valign="top" align="center">&#x02212;12161.49</td>
</tr>
<tr>
<td valign="top" align="left">Asia</td>
<td valign="top" align="center"><bold>&#x02212;22212.85</bold></td>
<td valign="top" align="center"><bold>&#x02212;22212.85</bold></td>
<td valign="top" align="center"><bold>&#x02212;22212.85</bold></td>
<td valign="top" align="center"><bold>&#x02212;22212.85</bold></td>
<td valign="top" align="center">&#x02212;23747.1</td>
</tr>
<tr>
<td valign="top" align="left">Sachs</td>
<td valign="top" align="center">&#x02212;38723.1</td>
<td valign="top" align="center">&#x02212;38081.29</td>
<td valign="top" align="center">&#x02212;41930.74</td>
<td valign="top" align="center"><bold>&#x02212;35782.43</bold></td>
<td valign="top" align="center">&#x02212;40822.13</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Numbers are presented in bold text whenever they are better than all the other baselines on a data set</italic>.</p>
</table-wrap-foot>
</table-wrap></sec></sec>
<sec sec-type="conclusions" id="s6">
<title>6. Conclusions</title>
<p>We introduced a scalable causal learning algorithm that is capable of exploiting two types of independencies&#x02014;context-specific independence (CSI) and conditional independence (CI). To exploit CSI, we learn a single tree for each variable in the model. Each tree can locally model and capture the CSI. Next, we orient and remove edges from this potentially cyclic model by computing the mutual information which allows for capturing the CIs. The intuition is that these two independence metrics have previously been explored in the context of causal learning and combining them will allow for learning a robust causal model. Our empirical evaluations in the standard data sets clearly demonstrate that the proposed DN2CN method does retrieve the true causal model in most of the domains. Most importantly, it does not introduce a denser model than what is necessary even if it means sacrificing the training likelihood. Thus, a natural regularization is achieved by controlling the depth of the trees and the orienting of edges as against other information-theoretic methods, such as BIC that employs a model complexity penalty.</p>
<p>There are several possible extensions of future work&#x02014;adapting and applying these models to real problems in the lines of our previous work (Ramanan and Natarajan, <xref ref-type="bibr" rid="B46">2019</xref>) is an important direction. Developing the theoretical underpinnings between CSI and CI with causal models is the next immediate direction. Converting the CSI from our models to do calculus and employing them in the context of learning from both observational and experimental data is another important problem. Finally, allowing for rich domain knowledge and inductive bias to guide the learner to a better causal model is possibly the most interesting direction.</p></sec>
<sec sec-type="data-availability-statement" id="s7">
<title>Data Availability Statement</title>
<p>The datasets analyzed for this study can be found in following repository, respectively: LUCAS&#x02014;LUng CAncer Simple data set: <ext-link ext-link-type="uri" xlink:href="http://www.causality.inf.ethz.ch/data/LUCAS.html">http://www.causality.inf.ethz.ch/data/LUCAS.html</ext-link>; Asia data set: <ext-link ext-link-type="uri" xlink:href="http://www.bnlearn.com/bnrepository/">http://www.bnlearn.com/bnrepository/</ext-link>; Causal Protein-Signaling Networks in human T cells data set: <ext-link ext-link-type="uri" xlink:href="http://www.bnlearn.com/bnrepository/">http://www.bnlearn.com/bnrepository/</ext-link>.</p></sec>
<sec id="s8">
<title>Author Contributions</title>
<p>NR and SN contributed equally to the ideation and contributed nearly equally to the manuscript preparation. NR led the empirical evaluation. All authors contributed to the article and approved the submitted version.</p></sec>
<sec id="s9">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p></sec>
</body>
<back>
<ack><p>The authors acknowledge the support of members of STARLING lab for the discussions. We thank the reviewers for their insightful comments and in significantly improving the paper.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Aliferis</surname> <given-names>C. F.</given-names></name> <name><surname>Tsamardinos</surname> <given-names>I.</given-names></name> <name><surname>Statnikov</surname> <given-names>A.</given-names></name></person-group> (<year>2003</year>). <article-title>Hiton: a novel markov blanket algorithm for optimal variable selection</article-title>, in <source>AMIA Annual Symposium Proceedings</source>, <volume>Vol. 2003</volume> (<publisher-loc>Washington, DC</publisher-loc>: <publisher-name>American Medical Informatics Association</publisher-name>), <fpage>21</fpage>. <pub-id pub-id-type="pmid">14728126</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Andrews</surname> <given-names>B.</given-names></name> <name><surname>Ramsey</surname> <given-names>J.</given-names></name> <name><surname>Cooper</surname> <given-names>G. F.</given-names></name></person-group> (<year>2018</year>). <article-title>Scoring bayesian networks of mixed variables</article-title>. <source>Int. J. Data Sci. Analyt</source>. <volume>6</volume>, <fpage>3</fpage>&#x02013;<lpage>18</lpage>. <pub-id pub-id-type="doi">10.1007/s41060-017-0085-7</pub-id><pub-id pub-id-type="pmid">30140730</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Boutilier</surname> <given-names>C.</given-names></name> <name><surname>Friedman</surname> <given-names>N.</given-names></name> <name><surname>Goldszmidt</surname> <given-names>M.</given-names></name> <name><surname>Koller</surname> <given-names>D.</given-names></name></person-group> (<year>1996</year>). <article-title>Context-specific independence in bayesian networks</article-title>, in <source>UAI</source> (<publisher-loc>Portland</publisher-loc>: <publisher-name>Morgan Kaufmann Publishers Inc.</publisher-name>), <fpage>115</fpage>&#x02013;<lpage>123</lpage>.</citation></ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chiappa</surname> <given-names>S.</given-names></name> <name><surname>Isaac</surname> <given-names>W. S.</given-names></name></person-group> (<year>2018</year>). <article-title>A causal bayesian networks viewpoint on fairness</article-title>, in <source>IFIP International Summer School on Privacy and Identity Management</source> (<publisher-loc>Vienna</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>3</fpage>&#x02013;<lpage>20</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-16744-8_1</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chickering</surname> <given-names>D.</given-names></name></person-group> (<year>1996</year>). <article-title>Learning bayesian networks is NP-complete</article-title>, in <source>Learning From Data</source> (<publisher-loc>Springer</publisher-loc>), <fpage>121</fpage>&#x02013;<lpage>130</lpage>. <pub-id pub-id-type="doi">10.1007/978-1-4612-2404-4_12</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Chickering</surname> <given-names>D. M.</given-names></name></person-group> (<year>2002a</year>). <article-title>Learning equivalence classes of bayesian-network structures</article-title>. <source>J. Mach. Learn. Res</source>. <volume>2</volume>, <fpage>445</fpage>&#x02013;<lpage>498</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.jmlr.org/papers/volume2/chickering02a/chickering02a.pdf">https://www.jmlr.org/papers/volume2/chickering02a/chickering02a.pdf</ext-link></citation></ref>
<ref id="B7">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Chickering</surname> <given-names>D. M.</given-names></name></person-group> (<year>2002b</year>). <article-title>Optimal structure identification with greedy search</article-title>. <source>J. Mach. Learn. Res</source>. <volume>3</volume>, <fpage>507</fpage>&#x02013;<lpage>554</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.jmlr.org/papers/volume3/chickering02b/chickering02b.pdf">https://www.jmlr.org/papers/volume3/chickering02b/chickering02b.pdf</ext-link></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Colombo</surname> <given-names>D.</given-names></name> <name><surname>Maathuis</surname> <given-names>M. H.</given-names></name></person-group> (<year>2012</year>). <article-title>A modification of the PC algorithm yielding order-independent skeletons</article-title>. <source>arXiv</source> 1211.3295.</citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Colombo</surname> <given-names>D.</given-names></name> <name><surname>Maathuis</surname> <given-names>M. H.</given-names></name></person-group> (<year>2014</year>). <article-title>Order-independent constraint-based causal structure learning</article-title>. <source>J. Mach. Learn. Res</source>. <volume>15</volume>, <fpage>3741</fpage>&#x02013;<lpage>3782</lpage>.</citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Colombo</surname> <given-names>D.</given-names></name> <name><surname>Maathuis</surname> <given-names>M. H.</given-names></name> <name><surname>Kalisch</surname> <given-names>M.</given-names></name> <name><surname>Richardson</surname> <given-names>T. S.</given-names></name></person-group> (<year>2012</year>). <article-title>Learning high-dimensional directed acyclic graphs with latent and selection variables</article-title>. <source>Ann. Stat</source>. <volume>40</volume>, <fpage>294</fpage>&#x02013;<lpage>321</lpage>. <pub-id pub-id-type="doi">10.1214/11-AOS940</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cooper</surname> <given-names>G. F.</given-names></name> <name><surname>Yoo</surname> <given-names>C.</given-names></name></person-group> (<year>2013</year>). <article-title>Causal discovery from a mixture of experimental and observational data</article-title>. <source>arXiv</source> 1301.6686.</citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Coumans</surname> <given-names>V.</given-names></name> <name><surname>Claassen</surname> <given-names>T.</given-names></name> <name><surname>Terwijn</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <source>Causal Discovery Algorithms and Real World Systems</source>. Masters thesis.</citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>De Raedt</surname> <given-names>L.</given-names></name> <name><surname>Kersting</surname> <given-names>K.</given-names></name> <name><surname>Natarajan</surname> <given-names>S.</given-names></name> <name><surname>Poole</surname> <given-names>D.</given-names></name></person-group> (<year>2016</year>). <source>Statistical Relational Artificial Intelligence: Logic, Probability, and Computation</source>, <volume>Vol. 10</volume>, <publisher-name>Morgan &#x00026; Claypool</publisher-name>. p. <fpage>1</fpage>&#x02013;<lpage>189</lpage>.</citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fenton</surname> <given-names>N.</given-names></name> <name><surname>Neil</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). <source>Risk Assessment and Decision Analysis With Bayesian Networks</source>. (<publisher-loc>Boca Raton, FL</publisher-loc>: <publisher-name>CRC Press</publisher-name>), p.<fpage>524</fpage>.</citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Friedman</surname> <given-names>N.</given-names></name> <name><surname>Nachman</surname> <given-names>I.</given-names></name> <name><surname>Pe&#x000E9;r</surname> <given-names>D.</given-names></name></person-group> (<year>1999</year>). <article-title>Learning bayesian network structure from massive datasets: the sparse candidate algorithm</article-title>, in <source>UAI</source> (<publisher-loc>Stockholm</publisher-loc>: <publisher-name>Morgan Kaufmann Publishers Inc.</publisher-name>), <fpage>206</fpage>&#x02013;<lpage>215</lpage>.</citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>T.</given-names></name> <name><surname>Ji</surname> <given-names>Q.</given-names></name></person-group> (<year>2015</year>). <article-title>Local causal discovery of direct causes and effects</article-title>, in <source>Advances in Neural Information Processing Systems</source>, eds <person-group person-group-type="editor"><name><surname>Cortes</surname> <given-names>C.</given-names></name> <name><surname>Lawrence</surname> <given-names>N. D.</given-names></name> <name><surname>Lee</surname> <given-names>D. D.</given-names></name> <name><surname>Sugiyama</surname> <given-names>M.</given-names></name> <name><surname>Garnett</surname> <given-names>R.</given-names></name></person-group> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>NeurIPS</publisher-name>), <fpage>2512</fpage>&#x02013;<lpage>2520</lpage>.</citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gillispie</surname> <given-names>S. B.</given-names></name> <name><surname>Perlman</surname> <given-names>M. D.</given-names></name></person-group> (<year>2013</year>). <article-title>Enumerating markov equivalence classes of acyclic digraph models</article-title>. <source>arXiv</source> 1301.2272.</citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Glymour</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>K.</given-names></name> <name><surname>Spirtes</surname> <given-names>P.</given-names></name></person-group> (<year>2019</year>). <article-title>Review of causal discovery methods based on graphical models</article-title>. <source>Front. Genet</source>. <volume>10</volume>:<fpage>524</fpage>. <pub-id pub-id-type="doi">10.3389/fgene.2019.00524</pub-id><pub-id pub-id-type="pmid">31214249</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Courville</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <source>Deep Learning. MIT Press</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://www.deeplearningbook.org">http://www.deeplearningbook.org</ext-link></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Granger</surname> <given-names>C. W.</given-names></name></person-group> (<year>1969</year>). <article-title>Investigating causal relations by econometric models and cross-spectral methods</article-title>. <source>Econometrica</source> <volume>37</volume>, <fpage>424</fpage>&#x02013;<lpage>438</lpage>. <pub-id pub-id-type="doi">10.2307/1912791</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>Y.</given-names></name> <name><surname>Ruan</surname> <given-names>Q.</given-names></name> <name><surname>Zhu</surname> <given-names>S.</given-names></name> <name><surname>Wei</surname> <given-names>Q.</given-names></name> <name><surname>Chen</surname> <given-names>H.</given-names></name> <name><surname>Lu</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Temperature rise associated with adiabatic shear band: causality clarified</article-title>. <source>Phys. Rev. Lett</source>. <volume>122</volume>:<fpage>015503</fpage>. <pub-id pub-id-type="doi">10.1103/PhysRevLett.122.015503</pub-id><pub-id pub-id-type="pmid">31012723</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Guyon</surname> <given-names>I.</given-names></name> <name><surname>Aliferis</surname> <given-names>C.</given-names></name> <name><surname>Cooper</surname> <given-names>G.</given-names></name> <name><surname>Elisseeff</surname> <given-names>A.</given-names></name> <name><surname>Pellet</surname> <given-names>J.-P.</given-names></name> <name><surname>Spirtes</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2008</year>). <article-title>Design and analysis of the causation and prediction challenge</article-title>, in <source>Causation and Prediction Challenge</source>, eds <person-group person-group-type="editor"><name><surname>Guyon</surname> <given-names>I.</given-names></name> <name><surname>Aliferis</surname> <given-names>C. F.</given-names></name> <name><surname>Cooper</surname> <given-names>G. F.</given-names></name> <name><surname>Elisseeff</surname> <given-names>A.</given-names></name> <name><surname>Pellet</surname> <given-names>J.</given-names></name> <name><surname>Spirtes</surname> <given-names>P.</given-names></name> <name><surname>Statnikov</surname> <given-names>A. R.</given-names></name></person-group> (<publisher-loc>Hong Kong</publisher-loc>: <publisher-name>JMLR.org</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>33</lpage>.</citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hauser</surname> <given-names>A.</given-names></name> <name><surname>B&#x000FC;hlmann</surname> <given-names>P.</given-names></name></person-group> (<year>2015</year>). <article-title>Jointly interventional and observational data: estimation of interventional markov equivalence classes of directed acyclic graphs</article-title>. <source>J. R. Stat. Soc. B Stat. Methodol</source>. <volume>77</volume>, <fpage>291</fpage>&#x02013;<lpage>318</lpage>. <pub-id pub-id-type="doi">10.1111/rssb.12071</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Heckerman</surname> <given-names>D.</given-names></name> <name><surname>Chickering</surname> <given-names>D.</given-names></name> <name><surname>Meek</surname> <given-names>C.</given-names></name> <name><surname>Rounthwaite</surname> <given-names>R.</given-names></name> <name><surname>Kadie</surname> <given-names>C.</given-names></name></person-group> (<year>2000</year>). <article-title>Dependency networks for inference, collaborative filtering, and data visualization</article-title>. <source>JMLR</source> <volume>1</volume>, <fpage>49</fpage>&#x02013;<lpage>75</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.jmlr.org/papers/volume1/heckerman00a/heckerman00a.pdf">https://www.jmlr.org/papers/volume1/heckerman00a/heckerman00a.pdf</ext-link></citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Heckerman</surname> <given-names>D.</given-names></name> <name><surname>Geiger</surname> <given-names>D.</given-names></name> <name><surname>Chickering</surname> <given-names>D.</given-names></name></person-group> (<year>1995</year>). <article-title>Learning bayesian networks: the combination of knowledge and statistical data</article-title>. <source>MLJ</source> <volume>20</volume>, <fpage>197</fpage>&#x02013;<lpage>243</lpage>. <pub-id pub-id-type="doi">10.1007/BF00994016</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Henrion</surname> <given-names>M.</given-names></name></person-group> (<year>1987</year>). <article-title>Practical issues in constructing a bayes&#x00027; belief network</article-title>, in <source>Proceedings of the Third Conference on Uncertainty in Artificial Intelligence</source> (<publisher-loc>Seattle, WA</publisher-loc>), <fpage>132</fpage>&#x02013;<lpage>139</lpage>.</citation></ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hulten</surname> <given-names>G.</given-names></name> <name><surname>Chickering</surname> <given-names>D.</given-names></name> <name><surname>Heckerman</surname> <given-names>D.</given-names></name></person-group> (<year>2003</year>). <article-title>Learning bayesian networks from dependency networks: a preliminary study</article-title>, in <source>AISTATS</source> (<publisher-loc>Key West, FL</publisher-loc>).</citation></ref>
<ref id="B28">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Janzing</surname> <given-names>D.</given-names></name> <name><surname>Steudel</surname> <given-names>B.</given-names></name> <name><surname>Shajarisales</surname> <given-names>N.</given-names></name> <name><surname>Sch&#x000F6;lkopf</surname> <given-names>B.</given-names></name></person-group> (<year>2015</year>). <article-title>Justifying information-geometric causal inference</article-title>, in <source>Measures of Complexity</source> (<publisher-loc>Springer</publisher-loc>), <fpage>253</fpage>&#x02013;<lpage>265</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-21852-6_18</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kahn</surname> <given-names>A. B.</given-names></name></person-group> (<year>1962</year>). <article-title>Topological sorting of large networks</article-title>. <source>Commun. ACM</source> <volume>5</volume>, <fpage>558</fpage>&#x02013;<lpage>562</lpage>. <pub-id pub-id-type="doi">10.1145/368996.369025</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kahneman</surname> <given-names>D.</given-names></name> <name><surname>Slovic</surname> <given-names>S. P.</given-names></name> <name><surname>Slovic</surname> <given-names>P.</given-names></name> <name><surname>Tversky</surname> <given-names>A.</given-names></name></person-group> (<year>1982</year>). <source>Judgment Under Uncertainty: Heuristics and Biases</source>. <publisher-name>Cambridge University Press</publisher-name>.<pub-id pub-id-type="pmid">17835457</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Karp</surname> <given-names>R. M.</given-names></name></person-group> (<year>1972</year>). <article-title>Reducibility among combinatorial problems</article-title>, in <source>Complexity of Computer Computations</source> (<publisher-loc>Springer</publisher-loc>), <fpage>85</fpage>&#x02013;<lpage>103</lpage>. <pub-id pub-id-type="doi">10.1007/978-1-4684-2001-2_9</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lauritzen</surname> <given-names>S. L.</given-names></name> <name><surname>Spiegelhalter</surname> <given-names>D. J.</given-names></name></person-group> (<year>1988</year>). <article-title>Local computations with probabilities on graphical structures and their application to expert systems</article-title>. <source>J. R. Stat. Soc. B Methodol</source>. <volume>50</volume>, <fpage>157</fpage>&#x02013;<lpage>194</lpage>. <pub-id pub-id-type="doi">10.1111/j.2517-6161.1988.tb01721.x</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lemaire</surname> <given-names>P.</given-names></name> <name><surname>Wilhelm</surname> <given-names>K.</given-names></name> <name><surname>Curdt</surname> <given-names>W.</given-names></name> <name><surname>Sch&#x000FC;le</surname> <given-names>U.</given-names></name> <name><surname>Marsch</surname> <given-names>E.</given-names></name> <name><surname>Poland</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>1997</year>). <article-title>First results of the sumer telescope and spectrometer on SOHO</article-title>, in <source>The First Results From SOHO</source> (<publisher-loc>Springer</publisher-loc>), <fpage>105</fpage>&#x02013;<lpage>121</lpage>. <pub-id pub-id-type="doi">10.1007/978-94-011-5236-5_6</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lipton</surname> <given-names>Z. C.</given-names></name></person-group> (<year>2018</year>). <article-title>The mythos of model interpretability</article-title>. <source>Queue</source> <volume>16</volume>, <fpage>31</fpage>&#x02013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.1145/3236386.3241340</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Margaritis</surname> <given-names>D.</given-names></name> <name><surname>Thrun</surname> <given-names>S.</given-names></name></person-group> (<year>2000</year>). <article-title>Bayesian network induction via local neighborhoods</article-title>, in <source>NIPS</source> (<publisher-loc>Denver, CO</publisher-loc>), <fpage>505</fpage>&#x02013;<lpage>511</lpage>.</citation></ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Meek</surname> <given-names>C.</given-names></name></person-group> (<year>1995</year>). <article-title>Causal inference and causal explanation with background knowledge</article-title>, in <source>Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>403</fpage>&#x02013;<lpage>410</lpage>.</citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meinshausen</surname> <given-names>N.</given-names></name> <name><surname>Hauser</surname> <given-names>A.</given-names></name> <name><surname>Mooij</surname> <given-names>J. M.</given-names></name> <name><surname>Peters</surname> <given-names>J.</given-names></name> <name><surname>Versteeg</surname> <given-names>P.</given-names></name> <name><surname>B&#x000FC;hlmann</surname> <given-names>P.</given-names></name></person-group> (<year>2016</year>). <article-title>Methods for causal inference from gene perturbation experiments and validation</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A</source>. <volume>113</volume>, <fpage>7361</fpage>&#x02013;<lpage>7368</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1510493113</pub-id><pub-id pub-id-type="pmid">27382150</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Natarajan</surname> <given-names>S.</given-names></name> <name><surname>Khot</surname> <given-names>T.</given-names></name> <name><surname>Kersting</surname> <given-names>K.</given-names></name> <name><surname>Gutmann</surname> <given-names>B.</given-names></name> <name><surname>Shavlik</surname> <given-names>J.</given-names></name></person-group> (<year>2012</year>). <article-title>Gradient-based boosting for statistical relational learning: the relational dependency network case</article-title>. <source>Mach. Learn</source>. <volume>86</volume>, <fpage>25</fpage>&#x02013;<lpage>56</lpage>. <pub-id pub-id-type="doi">10.1007/s10994-011-5244-9</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Neapolitan</surname> <given-names>R. E.</given-names></name> <etal/></person-group>. (<year>2004</year>). <source>Learning Bayesian Networks</source>, <volume>Vol. 38</volume>. <publisher-loc>Upper Saddle River, NJ</publisher-loc>: <publisher-name>Pearson Prentice Hall</publisher-name>.</citation></ref>
<ref id="B40">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Neville</surname> <given-names>J.</given-names></name> <name><surname>Jensen</surname> <given-names>D.</given-names></name></person-group> (<year>2007</year>). <article-title>Relational dependency networks</article-title>. <source>J. Mach. Learn. Res</source>. <volume>8</volume>, <fpage>653</fpage>&#x02013;<lpage>692</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.jmlr.org/papers/volume8/neville07a/neville07a.pdf">https://www.jmlr.org/papers/volume8/neville07a/neville07a.pdf</ext-link></citation></ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ogarrio</surname> <given-names>J. M.</given-names></name> <name><surname>Spirtes</surname> <given-names>P.</given-names></name> <name><surname>Ramsey</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>A hybrid causal search algorithm for latent variable models</article-title>, in <source>Conference on Probabilistic Graphical Models</source> (<publisher-loc>Lugano</publisher-loc>), <fpage>368</fpage>&#x02013;<lpage>379</lpage>. <pub-id pub-id-type="pmid">28239434</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pearl</surname> <given-names>J.</given-names></name></person-group> (<year>1988a</year>). <source>Morgan Kaufmann Series in Representation and Reasoning. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference</source> <publisher-name>Morgan Kaufmann</publisher-name>.</citation></ref>
<ref id="B43">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pearl</surname> <given-names>J.</given-names></name></person-group> (<year>1988b</year>). <source>Probabilistic Reasoning in Intelligent Systems; Networks of Plausible Inference</source>. <publisher-name>Morgan Kaufmann</publisher-name>.</citation></ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pearl</surname> <given-names>J.</given-names></name></person-group> (<year>2000</year>). <source>Causality: Models, Reasoning, and Inference</source>. <publisher-name>Cambridge University Press</publisher-name>.<pub-id pub-id-type="pmid">21977966</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pennington</surname> <given-names>N.</given-names></name> <name><surname>Hastie</surname> <given-names>R.</given-names></name></person-group> (<year>1988</year>). <article-title>Explanation-based decision making: effects of memory structure on judgment</article-title>. <source>J. Exp. Psychol. Learn. Mem. Cogn</source>. <volume>14</volume>:<fpage>521</fpage>. <pub-id pub-id-type="doi">10.1037/0278-7393.14.3.521</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ramanan</surname> <given-names>N.</given-names></name> <name><surname>Natarajan</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <source>Work-in-Progress</source> : <italic>Ensemble Causal Learning for Modeling Post-partum Depression</italic>. <publisher-loc>Palo Alto, CA</publisher-loc>.</citation></ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ramsey</surname> <given-names>J.</given-names></name> <name><surname>Glymour</surname> <given-names>M.</given-names></name> <name><surname>Sanchez-Romero</surname> <given-names>R.</given-names></name> <name><surname>Glymour</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images</article-title>. <source>Int. J. Data Sci. Analyt</source>. <volume>3</volume>, <fpage>121</fpage>&#x02013;<lpage>129</lpage>. <pub-id pub-id-type="doi">10.1007/s41060-016-0032-z</pub-id><pub-id pub-id-type="pmid">28393106</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sachs</surname> <given-names>K.</given-names></name> <name><surname>Perez</surname> <given-names>O.</given-names></name> <name><surname>Pe&#x00027;er</surname> <given-names>D.</given-names></name> <name><surname>Lauffenburger</surname> <given-names>D. A.</given-names></name> <name><surname>Nolan</surname> <given-names>G. P.</given-names></name></person-group> (<year>2005</year>). <article-title>Causal protein-signaling networks derived from multiparameter single-cell data</article-title>. <source>Science</source> <volume>308</volume>, <fpage>523</fpage>&#x02013;<lpage>529</lpage>. <pub-id pub-id-type="doi">10.1126/science.1105809</pub-id><pub-id pub-id-type="pmid">15845847</pub-id></citation></ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schwarz</surname> <given-names>G.</given-names></name></person-group> (<year>1978</year>). <article-title>Estimating the dimension of a model</article-title>. <source>Ann. Stat</source>. <volume>6</volume>, <fpage>461</fpage>&#x02013;<lpage>464</lpage>. <pub-id pub-id-type="doi">10.1214/aos/1176344136</pub-id></citation></ref>
<ref id="B50">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Scutari</surname> <given-names>M.</given-names></name></person-group> (<year>2009</year>). <article-title>Learning bayesian networks with the bnlearn R package</article-title>. <source>arXiv</source> 0908.3817.</citation></ref>
<ref id="B51">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Silander</surname> <given-names>T.</given-names></name> <name><surname>Myllymaki</surname> <given-names>P.</given-names></name></person-group> (<year>2012</year>). <article-title>A simple approach for finding the globally optimal bayesian network structure</article-title>. <source>arXiv</source> 1206.6875.</citation></ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sims</surname> <given-names>C. A.</given-names></name></person-group> (<year>1972</year>). <article-title>Money, income, and causality</article-title>. <source>Am. Econ. Rev</source>. <volume>62</volume>, <fpage>540</fpage>&#x02013;<lpage>552</lpage>.</citation></ref>
<ref id="B53">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Solo</surname> <given-names>V.</given-names></name></person-group> (<year>2008</year>). <article-title>On causality and mutual information</article-title>, in <source>2008 47th IEEE Conference on Decision and Control</source> (<publisher-loc>Cancun</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4939</fpage>&#x02013;<lpage>4944</lpage>. <pub-id pub-id-type="doi">10.1109/CDC.2008.4738640</pub-id></citation></ref>
<ref id="B54">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Spirtes</surname> <given-names>P.</given-names></name> <name><surname>Glymour</surname> <given-names>C.</given-names></name></person-group> (<year>1991</year>). <article-title>An algorithm for fast recovery of sparse causal graphs</article-title>. <source>Soc. Sci. Comput. Rev</source>. <volume>9</volume>, <fpage>62</fpage>&#x02013;<lpage>72</lpage>. <pub-id pub-id-type="doi">10.1177/089443939100900106</pub-id></citation></ref>
<ref id="B55">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Spirtes</surname> <given-names>P.</given-names></name> <name><surname>Glymour</surname> <given-names>C.</given-names></name> <name><surname>Scheines</surname> <given-names>R.</given-names></name></person-group> (<year>1993</year>). <source>Causation, Prediction, and Search: Lecture Notes in Statistics</source>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation></ref>
<ref id="B56">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Spirtes</surname> <given-names>P.</given-names></name> <name><surname>Glymour</surname> <given-names>C.</given-names></name> <name><surname>Scheines</surname> <given-names>R.</given-names></name></person-group> (<year>2000</year>). <source>Causation, Prediction, and Search</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>.</citation></ref>
<ref id="B57">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Su</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name></person-group> (<year>2006</year>). <article-title>A fast decision tree learning algorithm</article-title>, in <source>Proceedings of the 21st National Conference on Artificial Intelligence&#x02013;Volume 1, AAAI&#x00027;06</source> (<publisher-loc>Boston, MA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>500</fpage>&#x02013;<lpage>505</lpage>.</citation></ref>
<ref id="B58">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tikka</surname> <given-names>S.</given-names></name> <name><surname>Hyttinen</surname> <given-names>A.</given-names></name> <name><surname>Karvanen</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Identifying causal effects via context-specific independence relations</article-title>, in <source>Advances in Neural Information Processing Systems</source>, eds <person-group person-group-type="editor"><name><surname>Wallach</surname> <given-names>H. M.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>Beygelzimer</surname> <given-names>A.</given-names></name> <name><surname>d&#x00027;Alch&#x000E9;-Buc</surname> <given-names>F.</given-names></name> <name><surname>Fox</surname> <given-names>E. B.</given-names></name> <name><surname>Garnett</surname> <given-names>R.</given-names></name></person-group> (<publisher-loc>Vancouver, BC</publisher-loc>: <publisher-name>NeurIPS</publisher-name>), <fpage>2800</fpage>&#x02013;<lpage>2810</lpage>.</citation></ref>
<ref id="B59">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tsagris</surname> <given-names>M.</given-names></name> <name><surname>Borboudakis</surname> <given-names>G.</given-names></name> <name><surname>Lagani</surname> <given-names>V.</given-names></name> <name><surname>Tsamardinos</surname> <given-names>I.</given-names></name></person-group> (<year>2018</year>). <article-title>Constraint-based causal discovery with mixed data</article-title>. <source>Int. J. Data Sci. Analyt</source>. <volume>6</volume>, <fpage>19</fpage>&#x02013;<lpage>30</lpage>. <pub-id pub-id-type="doi">10.1007/s41060-018-0097-y</pub-id><pub-id pub-id-type="pmid">30957008</pub-id></citation></ref>
<ref id="B60">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tsamardinos</surname> <given-names>I.</given-names></name> <name><surname>Aliferis</surname> <given-names>C. F.</given-names></name> <name><surname>Statnikov</surname> <given-names>A. R.</given-names></name> <name><surname>Statnikov</surname> <given-names>E.</given-names></name></person-group> (<year>2003</year>). <article-title>Algorithms for large scale markov blanket discovery</article-title>, in <source>FLAIRS Conference</source> (<publisher-loc>St. Augustine, FL</publisher-loc>), <volume>Vol. 2</volume>, <fpage>376</fpage>&#x02013;<lpage>380</lpage>.</citation></ref>
<ref id="B61">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tsamardinos</surname> <given-names>I.</given-names></name> <name><surname>Brown</surname> <given-names>L.</given-names></name> <name><surname>Aliferis</surname> <given-names>C.</given-names></name></person-group> (<year>2006</year>). <article-title>The max-min hill-climbing bayesian network structure learning algorithm</article-title>. <source>MLJ</source> <volume>65</volume>, <fpage>31</fpage>&#x02013;<lpage>78</lpage>. <pub-id pub-id-type="doi">10.1007/s10994-006-6889-7</pub-id></citation></ref>
<ref id="B62">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Weichwald</surname> <given-names>S.</given-names></name> <name><surname>Sch&#x000F6;lkopf</surname> <given-names>B.</given-names></name> <name><surname>Ball</surname> <given-names>T.</given-names></name> <name><surname>Grosse-Wentrup</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>Causal and anti-causal learning in pattern recognition for neuroimaging</article-title>, in <source>4th International Workshop on Pattern Recognition in Neuroimaging (PRNI)</source> (<publisher-loc>T&#x000FC;bingen</publisher-loc>: <publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/PRNI.2014.6858551</pub-id></citation></ref>
<ref id="B63">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wilhelm</surname> <given-names>K.</given-names></name> <name><surname>Lemaire</surname> <given-names>P.</given-names></name> <name><surname>Curdt</surname> <given-names>W.</given-names></name> <name><surname>Sch&#x000FC;hle</surname> <given-names>U.</given-names></name> <name><surname>Marsch</surname> <given-names>E.</given-names></name> <name><surname>Poland</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>1997</year>). <article-title>First results of tide sumer telescope and spectrometer on SOHO</article-title>, in <source>The First Results From SOHO</source> (<publisher-loc>Springer</publisher-loc>), <fpage>75</fpage>&#x02013;<lpage>104</lpage>. <pub-id pub-id-type="doi">10.1007/978-94-011-5236-5_5</pub-id><pub-id pub-id-type="pmid">18259499</pub-id></citation></ref>
<ref id="B64">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yaramakala</surname> <given-names>S.</given-names></name> <name><surname>Margaritis</surname> <given-names>D.</given-names></name></person-group> (<year>2005</year>). <article-title>Speculative markov blanket discovery for optimal feature selection</article-title>, in <source>Fifth IEEE International Conference on Data Mining (ICDM&#x00027;05)</source> (<publisher-loc>Houston, TX</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4</fpage>. <pub-id pub-id-type="doi">10.1109/ICDM.2005.134</pub-id></citation></ref>
<ref id="B65">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>J.</given-names></name> <name><surname>Gao</surname> <given-names>T.</given-names></name> <name><surname>Yu</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>DAG-GNN: DAG structure learning with graph neural networks</article-title>. <source>arXiv</source> 1904.10098.</citation></ref>
<ref id="B66">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>J.</given-names></name></person-group> (<year>2008</year>). <article-title>On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias</article-title>. <source>Artif. Intell</source>. <volume>172</volume>, <fpage>1873</fpage>&#x02013;<lpage>1896</lpage>. <pub-id pub-id-type="doi">10.1016/j.artint.2008.08.001</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> The authors gratefully acknowledge the support of AFOSR award FA9550-18-1-0462. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the AFOSR, or the US government.</p>
</fn>
</fn-group>
</back>
</article>