Learning Subject-Specific Directed Acyclic Graphs With Mixed Effects Structural Equation Models From Observational Data

The identification of causal relationships between random variables from large-scale observational data using directed acyclic graphs (DAG) is highly challenging. We propose a new mixed-effects structural equation model (mSEM) framework to estimate subject-specific DAGs, where we represent joint distribution of random variables in the DAG as a set of structural causal equations with mixed effects. The directed edges between nodes depend on observed exogenous covariates on each of the individual and unobserved latent variables. The strength of the connection is decomposed into a fixed-effect term representing the average causal effect given the covariates and a random effect term representing the latent causal effect due to unobserved pathways. The advantage of such decomposition is to capture essential asymmetric structural information and heterogeneity between DAGs in order to allow for the identification of causal structure with observational data. In addition, by pooling information across subject-specific DAGs, we can identify causal structure with a high probability and estimate subject-specific networks with a high precision. We propose a penalized likelihood-based approach to handle multi-dimensionality of the DAG model. We propose a fast, iterative computational algorithm, DAG-MM, to estimate parameters in mSEM and achieve desirable sparsity by hard-thresholding the edges. We theoretically prove the identifiability of mSEM. Using simulations and an application to protein signaling data, we show substantially improved performances when compared to existing methods and consistent results with a network estimated from interventional data. Lastly, we identify gray matter atrophy networks in regions of brain from patients with Huntington's disease and corroborate our findings using white matter connectivity data collected from an independent study.

The identification of causal relationships between random variables from large-scale observational data using directed acyclic graphs (DAG) is highly challenging. We propose a new mixed-effects structural equation model (mSEM) framework to estimate subject-specific DAGs, where we represent joint distribution of random variables in the DAG as a set of structural causal equations with mixed effects. The directed edges between nodes depend on observed exogenous covariates on each of the individual and unobserved latent variables. The strength of the connection is decomposed into a fixed-effect term representing the average causal effect given the covariates and a random effect term representing the latent causal effect due to unobserved pathways. The advantage of such decomposition is to capture essential asymmetric structural information and heterogeneity between DAGs in order to allow for the identification of causal structure with observational data. In addition, by pooling information across subject-specific DAGs, we can identify causal structure with a high probability and estimate subject-specific networks with a high precision. We propose a penalized likelihood-based approach to handle multi-dimensionality of the DAG model. We propose a fast, iterative computational algorithm, DAG-MM, to estimate parameters in mSEM and achieve desirable sparsity by hard-thresholding the edges. We theoretically prove the identifiability of mSEM. Using simulations and an application to protein signaling data, we show substantially improved performances when compared to existing methods and consistent results with a network estimated from interventional data. Lastly, we identify gray matter atrophy networks in regions of brain from patients with Huntington's disease and corroborate our findings using white matter connectivity data collected from an independent study.

INTRODUCTION
Directed acyclic graphs (DAGs) are used to represent the causal mechanisms of a complex system of interacting components, such as biological cellular pathways (Sachs et al., 2005), gene regulatory networks (Ud-Dean et al., 2016), and brain connectivity networks (Friston, 2011). The ability to identify causal relations between variables in observational data is highly challenging. Specifically, given a set of centered random variables M = (M 1 , · · · , M p ) ′ , referred to as nodes, the causal relationship between these nodes in a DAG can be represented by a structural equation model (SEM) (Pearl, 2009): M j = f j (pa(j), ε j ), j = 1, · · · , p, where pa(j) is the set of parental nodes of M j , and ε j is a random variable representing unexplained variation. In many applications, M is assumed to follow a multivariate Gaussian distribution satisfying a linear SEM, where B = (θ jk ) is referred to as an adjacency matrix. Estimation of DAG structure (i.e., parental sets pa(j)) is non-deterministic polynomial-time hard (NP-hard) because the number of possible DAGs grows super-exponentially with the number of nodes (Robinson, 1971). Mainly two types of methods are proposed to tackle this challenge, namely, independencebased (e.g., Spirtes et al., 2000) and score-based (e.g., Heckerman et al., 1995) methods. The independence-based approaches calculate the partial correlation between any pair of nodes and perform statistical tests to assess the conditional dependence. A popular method is the PC algorithm (Spirtes et al., 2000), which has been proven to be uniformly consistent for estimating ultra high-dimensional, sparse DAGs (Kalisch and Bühlmann, 2007). The PC algorithm was modified as PC-stable to remove its dependence on node ordering (Colombo and Maathuis, 2014). A limitation of the PC algorithm is that it does not provide the proper level of multiple comparison correction and thus may lead to a large number of false positives in practice. To remedy this limitation, a hybrid, two-stage approach was proposed (PenPC, Ha et al., 2016) that first estimates a sparse skeleton based on penalized regression and then performs a modified PC-stable algorithm on the skeleton.
The score-based approach searches for the DAG using a pre-specified score criterion, such as Bayesian Information Criterion (BIC) or penalized likelihood function. As it is not computationally feasible to search through the space of all DAGs, a two-phase greedy equivalence search algorithm explores an equivalence class based on BIC by adding and deleting edges. With additional information on node ordering, the estimation of DAG is equivalent to neighborhood selection for which several penalized likelihood approaches have been developed (Shojaie and Michailidis, 2010;Yuan et al., 2012). More recently, attempts have been made to estimate a DAG without knowing the node ordering (Aragam and Zhou, 2015;Han et al., 2016). Other recent developments include leveraging asymmetric information between nodes (Shimizu et al., 2006;Luo and Zhao, 2011) or exploring the invariance property of causal relation using combined observational and interventional data (Meinshausen et al., 2016). Simulation studies suggest that independence-based methods perform adequately for identifying the skeleton of a DAG from observational data (Smith et al., 2011). However, these methods may perform worse for identifying the causal direction than some search-and-score methods that exploit the asymmetric distributional information (Smith et al., 2011).
All of the existing DAG estimation methods assume homogeneity of the causal effect of the underlying DAG model in (1) (i.e., θ jk is common across individuals in the population). However, there is a growing body of evidence suggesting that biological networks may depend on subjectspecific characteristics such as genomic markers (Brown et al., 2011;Bohlken et al., 2016;Langfelder et al., 2016). For mental disorders, individual differences in edge strength in comorbidity networks have been widely observed (Fleeson et al., 2010). Modeling heterogeneity of network effects may improve interpretability, biological relevance, and predictability. This area is much less explored with the exception of a few methods proposed to study subject-specific undirected graphical models. For example, a conditional Gaussian graphical model with covariate-adjusted mean but homogeneous precision matrix has been considered (Yin and Li, 2011;Cai et al., 2013). To characterize heterogeneous dependence structure between groups, Guo et al. (2015) jointly estimated graphical models that share common structure but also allowed for differences between networks. Recently, instead of modeling groups separately, Cheng et al. (2014) directly incorporated covariates into an Ising model in order to build a covariate-dependent undirected graph. A common assumption of these approaches is that the dependence between two nodes is fully explained by the observed exogenous covariates. Such an assumption may not be satisfied in many biological and clinical applications due to the presence of unexplained latent residual heterogeneity representing hidden pathways between nodes. Shimizu and Bollen (2014) proposed a Bayesian approach to estimate DAG by including non-Gaussian latent variables in a linear SEM, but does not estimate individualspecific graphs.
Our goal in this article is to develop a novel method and an efficient estimation procedure to study covariate-dependent DAGs with latent effect modification in multi-dimensional settings. Our method is based on mixed-effects SEM (mSEM) and penalized likelihood to obtain DAG structure and causal effects simultaneously. The covariates are treated as exogenous variables, and their joint distribution is not of interest. The key difference between mSEM and current approaches is that the causal effect, θ jk in Model (1), is random and varies across individuals. To capture variation of the manifestation of causal relationship among individuals, our model allows the magnitude of the edge strength to be heterogeneous across subjects, while keeping the direction of causal relationship to be homogeneous. The heterogeneous causal magnitude is modeled by both fixed effects that depend on observed covariates and random effects that capture unexplained heterogeneity.
We propose a two-stage approach to estimate mSEM, whereby the first stage performs neighborhood selection by maximizing a penalized likelihood to identify a sparse skeleton, and the second stage searches for the DAG by solving an approximate ℓ 0 -penalty problem via hard-thresholding within the identified skeleton, followed by an easily implemented DAG-checking procedure. We show theoretical proof of the identifiability (the graph is unique) of our model. Through extensive simulations and application to a well-known protein signaling study (Sachs et al., 2005), we show substantially improved performance in terms of robustness and accuracy when compared to existing methods, including PC (Spirtes et al., 2000) and penPC (Ha et al., 2016), and consistent performance when compared to analysis using interventional data. Lastly, we apply the proposed method to discover the causal dependence relationship among regions of brain atrophy from patients with Huntington's disease (HD) (Paulsen et al., 2014) and corroborate our findings in an independent study (McColgan et al., 2017).

METHODOLOGY
For the ith subject, let M i = (M i1 , M i2 , · · · , M ip ) ′ denote p random variables or nodes in a DAG. Let X i = (1, X i1 , X i2 , · · · , X iq ) ′ denote a q + 1-dimensional vector including a constant and q exogenous covariates that may modify the causal network among components in M i . We consider a mixed-effects model in which the causal effect depends on both fixed effects of observed variables X i and unobserved random effects {γ ijk }. For the jth node, the mSEM is given by: where β jk is the vector of fixed effects (including an intercept and effects associated with X i ), and γ ijk is the unexplained heterogeneity of causal effects beyond X i . We assume that γ ijk are independent and follow N(0, σ 2 jk ) and the independent error terms ε ij follow N(0, σ 2 ε j ). The SEM in (2) assumes that for each edge in the DAG, the causal effect is decomposed into a subject-specific fixed-effect term that depends on the exogenous covariates (i.e., β T jk X i ) and a subject-specific random-effect term that captures residual heterogeneity in causal effects due to other latent factors beyond X i (i.e., γ ijk ). When β jk = 0 and σ 2 jk = 0, the causal dependence between j and k is explained by unobserved latent factors but not X i . No causal effect between node j and k corresponds to β jk = 0 and σ 2 jk = 0. In this work, we assume that the ordering of causal dependence or the parental sets are unknown, and propose methods to simultaneously learn the ordering and structure of DAG and the parameters in the SEM. Previous literature has pointed out that qualitative capacity claims about causal effects are invariant across different populations of subjects, whereas the quantitative claims in SEM often are population-specific (e.g., Woodward, 2005, Chapter 7). Thus, we assume that the qualitative causal dependence (set of nodes and directed edges) is homogeneous among subjects while the magnitude of the edge strength varies across subjects. Presence of an edge from M ik to M ij is defined as β jk = 0 or σ 2 jk = 0; otherwise, there is no causal effect from M ik to M ij . Note that when the components of β jk associated with covariates X il are zero and σ 2 jk are zero, the subject-specific DAG model in (2) reduces to a homogeneous DAG model in (1). We express the model for M i given γ ijk in matrix form as where B(X i ) is a matrix of fixed effects with entry (j, k) as β T jk X i and the diagonal elements as zeros, Ŵ i is a matrix of random effects with entry (j, k) as γ ijk and the diagonal elements as zeros, and ε i = (ε i1 , ε i2 , · · · , ε ip ) ′ is a vector of error terms. Note that the joint distribution of M in Model (3) is non-Gaussian due to random effects in Ŵ i , where the asymmetric information on the distribution between nodes can facilitate inference on the causal network from the observational data.
To estimate a DAG, we use a likelihood-based approach. Given the random effects Ŵ i , the conditional likelihood function of M i is given by where Cov[ε i ] = E is a diagonal matrix of σ 2 ε j . The derivation of (4) is given in the online Supplementary Material Section S1.
To simplify presentation, we introduce the notation for the vectorized Ŵ i and define non-zero components of vectorized Ŵ i as γ i = {γ ijk : σ 2 jk > 0}. Then, Ŵ i can be expressed as a linear combination of components in γ i as Ŵ i = σ 2 jk >0 γ ijk H jk , where H jk is a single-entry matrix with one entry (j, k). Denote by Cov[γ i ] = diag{σ 2 12 , . . . , σ 2 jk , . . . , σ 2 pp−1 } = G the covariance matrix of γ i . The observed likelihood function is given by where p(γ i ) ∝ |G| −1/2 exp −γ T i G −1 γ i /2 . Under the DAG assumption of no directed cycle, B(X i ) + Ŵ i can always be transformed into an upper diagonal matrix after some unknown permutation of the rows and columns. Therefore, the determinant |I − B(X i ) − Ŵ i | in the likelihood function (4) is one. The integral in the likelihood (5) can be explicitly calculated up to a constant and the negative log-likelihood function is given by where the detailed derivations are given in the online Supplementary Material Section S1. Based on the objective function (6), the parameter estimation for each node in the likelihood is separable, leading to significant computational advantage. Note that for node j, the negative log-likelihood function (6) is equivalent to the objective function of a weighted least squares. Therefore, to obtain initial values, one can use weighted least squares to update {β jk : j = 1, · · · , p; k = 1, · · · , p} and use the Newton-Raphson algorithm to update {σ 2 jk : j = 1, · · · , p; k = 1, · · · , p} and {σ 2 ε j : j = 1, · · · , p} until convergence. The identifiability of parameters in the model is shown in Theorem 1 in section 2.3.

Initial Sparse Graph
With a large number of nodes, minimizing (6) would result in a full graph with all non-null estimates of β jk and σ 2 jk . Without any constraint on the estimates, the graph may potentially involve many false positive edges. To accommodate the large number of nodes, we propose to use a penalized likelihood to choose an initial sparse graph skeleton and search for the optimum of (6) within this reduced graph space. Based on model (2), the marginal expectation and variance of By the method of moments, we obtain initial estimates of the graph by minimizing the following objective for each j with j = 1, · · · , p. In order to obtain an initial sparse graph, ℓ 1 -norm penalty can be included to minimize the objective function and obtain initial estimates β jk , σ 2 jk , and σ 2 ε j : where R ij is R ij with β jk replaced by β jk , the parameter estimated from minimizing the first objective function of β at the current iteration. Here we use the same tuning parameter across nodes j = 1, · · · , p for illustration, although in practice nodespecific tuning parameter can be used at the price of increasing computational burden. In cases where the topology of the graph varies greatly across nodes, different tuning parameters can be used. Given a regularization path with varying λ 1 and λ 2 , we select the optimal λ * 1 and λ * 2 using the BIC criteria and apply the corresponding estimates as the initial skeleton. We set the edge (j, k) of the initial graph as null if β jk = 0 and σ 2 jk = 0.

Algorithms for Estimating DAG With Mixed-Effects Model (DAG-MM) and Justification
The initial graph, although asymptotically consistent (Meinshausen and Bühlmann, 2006), may not satisfy the DAG constraint due to that estimated β jk = 0 and β kj = 0 or σ jk = 0 and σ kj = 0. Define graph A (set of nodes, edges, and edge strength) as the set of non-null edges (j, k) : β jk 2 2 /q + σ 2 jk > 0 in the skeleton resulting from (7). Let θ A = β jk , σ 2 jk :(j, k) ∈ A; σ 2 ε j : j = 1, · · · , p be the parameters for graph A and n A be the number of non-zero edges of A. To obtain a sparse DAG, a direct approach is to constrain the number of edges in the graph by optimizing a regularized likelihood: where C is a tuning parameter controlling the number of edges in A. The constraints in (8) guarantee the estimated graph is a DAG and also perform edge selection. However, the optimization in (8) is NP-hard, because one needs to evaluate all possible graphs that satisfy the constraint n A < C. Furthermore, the computational challenge is elevated due to the acyclic constraint. Instead, we perform hard-thresholding to approximately solve the ℓ 0 -norm constrained optimization problem in (8). Specifically, after the estimates in θ A are obtained for a given graph skeleton A, we perform hard-threshold on the estimated edge weights by removing the edge with the smallest β jk 2 2 /q + σ 2 jk from A and then update the graph A. Given an updated graph A, we then start from the estimates obtained in the previous iteration and update the estimate θ A . This procedure continues until some criterion of optimality is met. In our implementation, we use BIC as the criterion to select the optimal graph.
The above procedure can be summarized into a DAG-MM algorithm (described in Algorithm 1). The tasks include identifying graph structure (set of nodes and edges), direction of edges, and edge strength. DAG-MM consists of three main steps: estimation of sparse skeleton and edge strength, edge orientation, and iterative DAG building. In the first step, each node's Markov blanket is identified by penalized likelihood and edge strength is obtained. In the second step, edge orientation is performed by removing directionalities with weak dependence (computed from fixed-effects parameters and variances of random effects). In the third step, an iterative procedure performs edge pruning using the norm of the edge connection strength and searches for the DAG that satisfies the acyclic constraint using a general and fast routine described in a DAG-Checking algorithm (described in Algorithm 2 in Supplementary Material).
Algorithm 1 is computationally efficient for several reasons: the sparse skeleton reduces the search space of DAGs; ranking by the magnitude of edge effects provides search paths in the DAG space; selection criteria BIC is only calculated when the log-likelihood (6) is the correct model (i.e., the acyclic constraint is satisfied); and the optimal graph is selected from candidate Algorithm 1: DAG With Mixed Model (DAG-MM) 1. Sparse skeleton: Estimate an initial sparse graph A I by solving the objective function (7). Obtain the estimates θ A I by minimizing (6) for A I .

Edge orientation: Initialize
DAGs. We observe empirically that the full graph shrinks to a DAG very fast in only a few iterations of the third step. For implementation, we have developed main routines in C++ codes with an R interface (R program available upon request).

Rationale of DAG-MM Algorithm and Theoretical Result
Essentially DAG-MM uses the likelihood function as the objective function in the optimization and thus belongs to the class of score-based approaches for estimating DAG. Similar to other score-based methods in this class (Heckerman et al., 1995), the search is performed locally at each iteration. The first step provides a sparse skeleton and consistent initial estimators of DAG edge strength through moment estimation, with the magnitude of estimated effects close to the truth parameter values. In the second step, the direction that maximizes the network edge strength is selected. The rationale is that the overall edge strength under the correct direction is greater than the strength under the incorrect one (which is close to null effect). In the third step, the DAG with the lowest BIC objective function is selected. Under the identifiability result in Theorem 1 shown below, the optima is uniquely identified, and the DAG-MM algorithm may converge in a local neighborhood of true parameters.
The proof of the theorem is in the Supplementary Material Section S3. The heterogeneity assumption implies that when there are multiple child nodes, their squared edge strengths from fixed effects are different across parental nodes. When there is a single child node, the edge strengths are different across subpopulations defined by covariates X.
In each simulation, we compared DAG-MM with the commonly used PC algorithm (Kalisch et al., 2012) and a two-step penalized version of the PC algorithm, penPC (Ha et al., 2016). We used the default settings in R-packages "pcalg" and "penPC" for these alternative methods (e.g., with α = 0.1). The edge selection performance was assessed by the number of true positive (TP) edges and false positive (FP) edges, taking into consideration the direction (i.e., an edge with a wrong direction will be counted as false). To evaluate the estimation of causal effects, we calculated the root sum squared (RSS) error of β jk , σ 2 jk , and σ 2 ε j , which is defined as RSS respectively. The simulations were repeated 100 times for each setting. Table 1 summarizes the number of TP and FP edge selections. The initial graph selection (i.e., performing steps 1 and 2 in Algorithm 1) correctly identified the true edges for all settings with TP edges very close to 12, but also selected many FP edges. Starting from the initial graph, the DAG-MM procedure can retain almost all the TP edges and also remove most FP edges, with a FP rate close to 0. Note that there are 9,900 edges in total when p = 100, and DAG-MM can still select the 12 true edges from a total of 9,900 edges (0.05%). With a small sample size of n = 200, the performance of DAG-MM remains to be satisfactory, except in Setting 2. Setting 2 is more difficult because all edges involve latent effects. DAG-MM selects about 40% of TP edges when n = 200 and selects almost all true edges when the sample size increases to n = 1, 000, without including FP edges. PC and penPC algorithms are designed for Setting 5 -constant effect without any covariates. As expected, they perform the best for Setting 5 but not other settings, and penPC selects fewer FP edges than PC algorithm due to an initial penalized regression step. However, for Setting 5, DAG-MM significantly outperforms the two PC algorithms in terms of fewer FP. Figure 1 visualizes the number of times (greater than one) that an edge is selected in the simulations. The visualization shows that DAG-MM performs satisfactorily and correctly identifies the true network structure in all settings. In contrast, penPC identifies many edges with incorrect direction and includes many more FP edges.
Next, we examined the estimation performance of the strength of the connection. Table 2 shows the RSS for parameters β, σ 2 , and σ 2 ε . Overall, RSS decreases to small values as sample size n increases. The increase in the number of features p affects the estimation of variance components σ 2 and σ 2 ε more than β. The results may suggest that for large p, including more samples improves the estimation performance of the individuallevel heterogeneity associated with γ ijk . The sample size required to obtain stable estimates depends on the true underlying graph (e.g., number of nodes, number of true edges, edge strength and its variability across subjects). In our simulation examples, for a network with 20 nodes and 12 true edges, a small sample size of n = 500 gives adequate performance under all five settings. With a larger graph with 100 nodes and 12 true edges, n = 1, 000 gives adequate performance.    n = 500 11.9 11.8 11.7 11.6 11.3 11.1 0.5 0.4 0.4 1.0 1.0 0.9 n = 1, 000 12.0 12.0 11.9 11.6 11.6 11.5 0.6 0.5 0.5   The computing time for DAG-MM is highly manageable. For example, in simulation Setting 5, the running time (averaged over 100 replicates) for simulated data with n = 1, 000 is 0.4 s for p = 20, 1.2 s for p = 50, and 4.4 s for p = 100, compared to 3.2, 16.8, and 66.5 s, respectively, for the penPC algorithm.

Protein Signaling Network
Our first application involved a study that examined the interaction between major mitogen-activated protein kinase (MAPK) pathways in human CD4+ T cells. Using intracellular multicolor flow cytometry, single-cell protein expression levels were measured for 11 proteins in the MAPK pathways in Sachs et al. (2005). Six experiments were performed using different stimuli, each targeting a different protein in the selected pathway (Sachs et al., 2005), and thus both interventional and unperturbed observational data were available for our application. Various data-driven methods were proposed to estimate the protein signaling networks, including Bayesian network analyses (Sachs et al., 2005;Mooij and Heskes, 2013) and ICP using combined interventional and observational data (Meinshausen et al., 2016), and results were compared with a consensus network in the literature (Mooij and Heskes, 2013;Meinshausen et al., 2016). The datasets are available from https://github.com/lawrennd/ rca/tree/master/matlab/PROTSIG_DATA. There were a total of 7392 available measurements in both perturbed and unperturbed  Table S1 in Meinshausen et al., 2016). ICP (Meinshausen et al., 2016)  settings in the Experiments 1-9, which were used by Sachs et al. (2005) and Meinshausen et al. (2016) in their main analyses. In our analyses, we applied DAG-MM to learn the causal signaling network using unperturbed, observational data only, including anti-CD3 + anti-CD28, anti-CD3/CD28 + ICAM-2, and anti-CD3/CD28 + LY294002. These experiments did not directly intervene on the activity of 11 proteins. The observational data consisted of 2,594 observations and were preprocessed using a standard arcsinh transformation for biological interpretability. DAG-MM with fixed effects only (DAG-MM1) and with mixed effects (DAG-MM2) were applied. Meinshausen et al. (2016) used a subset of 8 of the 9 environments and one of the environment is considered as observational data without external interventions. Each environment contains between 700 and 1,000 samples. There are 11 nodes in the signaling network. Sachs et al. (2005) identified 17 edges and Meinshausen et al. (2016) identified 7 edges by ICP and 13 FIGURE 2 | Estimated protein signaling network. Black: edges identified by DAG-MM2 and also reported previously (see summary Table S1 in Meinshausen et al., 2016); Blue: edges are identified by DAG-MM2 but not reported previously; Gray: edges previously reported edges but not identified by DAG-MM2. edges by hiddenICP. All these identified edges were summarized in the Supplementary Table S1 in Meinshausen et al. (2016). Our results were compared with those obtained using the PC algorithm as reported in Kalisch et al. (2012) and with ICP as reported in Meinshausen et al. (2016) for both interventional and observational data. Table 3 summarizes the number of selected edges by each method and whether these edges were also previously reported in the literature. Treating the edges previously identified and reported in Supplementary Table S1 of Meinshausen et al. (2016) as "gold standard, " DAG-MM2 reduces the number of FP edges to a greater extent than DAG-MM1. PC and ICP identified a similar number of true positive edges as DAG-MM2, but with a higher number of FP edges. In Figure 2, we show edges identified previously in Meinshausen et al. (2016) or by our methods (edges identified elsewhere, e.g., between AKT and RAF are not shown). We focus on comparing DAG-MM2 with ICP. The skeleton of DAG-MM2 and ICP is almost identical, with DAG-MM2 identifying one more edge, Plcg → PIP3. Two edges were in the reverse direction of those reported in literature, which might due to feedback loops that are expected to be present in this system (Meinshausen et al., 2016). The striking similarity of DAG-MM2 identified from observational data alone and ICP using interventional data suggests robustness and the ability of the former to infer causal relationships from observational data by including random effects.

Brain Gray Matter Atrophy Dependence Network
Our second application involved a study on atrophy networks in the brains of patients with HD. HD is a monogenic neurodegenerative disorder caused by an expansion of the CAG trinucleotide (≥36) in the huntingtin gene (O'Donovan, 1993). The hallmark of HD neuropathology is brain atrophy, in terms of gray matter loss within the striatum and white matter loss around the striatum . While evidence shows that selective brain regions undergo atrophy at different rates (Paulsen et al., 2014), it is unknown how these regional atrophies depend on one another and act together as disease progresses. In this application, we aimed to construct brain atrophy dependence networks using data collected from a large natural-history study of HD progression, PREDICT-HD (Paulsen et al., 2014), and we aimed to corroborate findings in an independent study, TRACK-ON (Klöppel et al., 2015). Subcortical gray matter loss of volume and gray matter cortical thinning were considered as measures of brain atrophy and hallmarks of HD. Thus, we examined dependencies between rates of volume loss and cortical thinning in different brain regions.
For the PREDICT-HD study, we included individuals who carried an expansion of the CAG trinucleotide in the hungtington gene and thus were at risk of HD but had not been diagnosed at baseline. Data consisted of 824 subjects with 68 cortical regions of interest (ROI) and 22 subcortical ROIs measured by structural magnetic resonance imaging (MRI). Longitudinal assessments were obtained from these subjects with a median follow-up period of 3.9 years. The details of MRI data segmentation, preprocessing, and study design are in Paulsen et al. (2014). A linear mixed-effects model with subject-specific random intercepts and random slopes was used to estimate the rate of volumetric change and the rate of cortical thickness change at each ROI for each subject. Rates of change at ROIs form the nodes in the brain atrophy dependence network. Because CAG repeats and age are two variables with substantial contribution to HD, a covariate based on the CAG-age product (CAPs score in Zhang et al., 2011) was created to indicate a subject's risk of receiving a diagnosis of HD (low, medium, and high risk). Baseline age was dichotomized into two groups (young vs. old) based on the median split. A total of seven covariates was included (high risk, medium risk, baseline age group, sex, and baseline clinical measures: total functioning capacity [TFC], total motor score [TMS], symbol digit modalities test [SDMT]).
Potentially there are 462 edges (involving 4,180 parameters) for the subcortical gray matter volumetric atrophy network and 4,556 potential edges (involving 41,072 parameters) for the cortical gray matter thickness network. The proposed DAG-MM identified 5 connections (Supplementary Material Section S4, Table S1) from the subcortical network (e.g., left thalamus to right accumbens, and right pallidum to left putamen), which suggests that most subcortical ROI atrophy rates do not depend on other ROIs. In contrast, a denser network was identified for the cortical thickness network, with 58 connections identified (Supplementary Material Section S4, Table S2), suggesting that cortical thinning acts in a more concerted fashion, consistent with the neuroimaging literature on cortical networks in HD (He et al., 2008). PenPC identified a very dense network for both subcortical volumes (92 edges) and cortical thickness networks (480 edges). Due to its non-sparseness and difficulty in interpretation, we omit results from PenPC and report DAG-MM in the subsequent presentation.
ROIs were further organized into modules related to HD pathology as in McColgan et al. (2017) for better interpretation. We present these results in Figure 3, where the modular-wise strength of the connection was computed as the total strength of connections within a module [summation of β jk between all pairs of connected nodes (j, k) in the same module] or between two modules (summation of β jk between all pairs of connected nodes (j, k) for j in one module and k in the other). Figure 3 shows that the two strongest connections in the average modular graph (with covariates fixed at the sample averages) are the interhemispheric links between the left and right temporal regions and between the left and right motor-occipital-parietal regions. For within-modular connection, the right side motor-occipitalparietal module has the strongest strength.
Using the estimated parameters (e.g., β jk ) from model (2) and (3), we also examined differences between the networks for high-risk group vs. low-risk group, and medium-risk vs. low-risk (other covariates fixed at the sample average). The differential edge strength of the graphs for subjects in two groups was computed based on β T jk X 1 − β T jk X 2 , where X 1 and X 2 were covariates for subjects in each group. For the highvs. low-risk group comparison (Figure 3), the largest difference is in the inter-hemispheric temporal regions. Most withinmodule and between-module connections show a loss of strength in the high-risk group. For example, a large loss of intramodular connections within the right motor-occipital-parietal, right temporal, left fronto-cingulate is seen. A loss of betweenmodule connections is observed between the left and right motor-occipital-parietal regions and between the left frontocingulate and left and right temporal regions. A minor gain of connection is seen within and between a few modules. A similar trend with a milder effect is present for most connections when contrasting medium-risk and low-risk groups. When comparing older adults with younger adults, most connections show a loss of strength in the older group (Figure 3). The largest loss in the intra-modular connections is in the right temporal region. A loss of between-module connections occurs between the left and right fronto-cingulate regions, between the left fronto-cingulate and left and right temporal regions, between the left fronto-cingulate and left motor-occipital-parietal regions, and between the right fronto-cingulate and right temporal regions.
In Supplementary Material Section S4, Figure S1, we show the node-wise DAGs and the difference of the estimated network between groups with different baseline risk of HD diagnosis. At the nodal level, we see a loss of connection in the highrisk group and older group in a large number of links. The connection with the largest difference is L.caudalmiddlefrontal ⇒ L.rostralmiddlefrontal (based on L2-norm). When effects are aggregated from nodes within modules, group differences are more apparent (Figure 3). The strength of connections between nodes is summarized in Supplementary Material Section S4, Tables S1, S2. Among all covariates, the three covariates with the largest effects aggregated across all connections (based on L 2 -norm) are high-low risk group contrast, medium-low risk contrast, and older-younger adult contrast. Substantial heterogeneity of connections due to latent factors not captured by covariates is observed for almost all links (represented by σ 2 in Supplementary Material Section S4, Tables S1, S2). We show the variation of the heterogeneous effects (standard deviation: σ jk ) of connections in Supplementary Material Section S4, Figure S2. The connection with the highest variation is L.caudalmiddlefrontal ⇒ L.posteriorcingulate. This analysis demonstrates substantial heterogeneity of the brain dependence networks among individuals.

A Validation Study Using Independent Samples
We sought to corroborate our estimated cortical gray matter network using white matter cortical connectivity network data obtained from an independent study, TRACK-ON (Klöppel et al., 2015;McColgan et al., 2017). TRACK-ON is a longitudinal study of premanifest HD, with 84 subjects and a median follow-up length of 1.89 years. White matter structural connection network was constructed from diffusion tensor imaging (DTI) technology, and connection strengths between pairs of nodes were computed by probabilistic tractography. A similar algorithm as PREDICT-HD was used to define regions of interest, and the same method that was used to partition nodes into HD pathology also informed modules (McColgan et al., 2017). Detailed information on the study design and data pre-processing can be found in McColgan et al. (2017). With longitudinal DTI measurements available, a linear mixed-effects model was used to compute the rate of change in connections between nodes and their p-values. Baseline connection, CAG, age, gender, motor score, SDMT, and TFC were included as covariates. Nodes were classified into modules by the same method as the structural MRI network. Intermodular connection was defined as present if at least c pairs (c = 1, 2) of nodes (each node resides in the module being considered) were connected after the false discovery rate (FDR) correction (q < 0.1). Presence of intra-modular connection was defined similarly based on the number of pairs of nodes connected (with q < 0.1) within a module. In total, 30 white matter atrophy connections were identified after the FDR correction.
Supplementary Material Section S4, Table S3 summarizes the module-wise white matter structural connectivity network estimated from the DTI technology. The average modular gray matter atrophy network and the white matter connection network both indicate a strong intra-modular connection in the right-side motor-occipital-parietal region and a strong interhemispheric connection in the left and right motoroccipital-parietal regions, whereas a weak connection (or no connection for the white matter network) was present in the left side of the same module. For some of the other four modules, the intra-modular connection strength for gray matter and white matter appears to be complementary: a stronger link in the former corresponds to a weaker link in the latter. For example, connections between the right temporal and right motor-occipital-parietal regions and between the left and right temporal regions show a moderate to strong dependence in the gray matter network, but are absent in the white matter network. The link between the right-fronto-cingulate region is strong in the white matter network, but weak in the gray matter network.  These observations might suggest a mechanism that constrains the total modular connections in the gray matter and white matter networks; thus, a strong connection in one correlates with a weak connection in the other. We evaluated the consistency of the gray matter cortical network (obtained by DAG-MM2 statistical modeling) with the white matter cortical structural connectivity network (directly measured by DTI technology). The overall operational characteristics of the gray matter network are reported in Table 4, treating the white matter network as the reference since white matter connections were directly measured by DTI. Due to a potential complementary effect on the total number of connections between and within modules, the number of connections in the gray matter and white matter networks is negatively correlated. Thus, we computed the sensitivity as P(L ≤ l|C ≥ c), where L denotes the number of links in the gray matter network, and C denotes the number of links in the white matter network. We fixed the connectivity threshold of the white matter network at c = 1 or c = 2, and we evaluated the overall consistency of the gray matter network across all levels of threshold l by computing the AUC across l. The AUC is 0.80 (95%CI: 0.61, 0.99) at c = 1 and 0.75 (95%CI: 0.48, 1.00) at c = 2. Using a higher threshold c increases sensitivity, but with a slightly decreased specificity and a slightly lower AUC. These results show that at the modular level, the gray matter cortical atrophy network estimated by DAG-MM has adequate consistency with the white matter structural connectivity network.
McColgan et al. (2017) reported a modular white matter network obtained by comparing connectivity in patients with HD and healthy controls and applying FDR adjustment ( Figure  2 in McColgan et al., 2017). When we compare our results to theirs, we see similarity, in terms of connections between the left and right temporal regions and between the left and right motor-occipital-parietal regions.

DISCUSSION
In this article, we propose a statistical framework for estimating DAGs with mixed effects in multi-dimensional settings, referred to as DAG-MM. The framework captures covariate-dependent causal effects, along with residual effect modification, by building a series of mSEMs. Our framework is a two-stage approach, which first obtains a sparse initial skeleton (undirected graph) and then searches for DAG through a solution path within the selected skeleton and an easily implemented DAG checking procedure. The DAG-MM method is computationally efficient and has shown satisfactory performance, especially for edge selection and orientation, in both simulation studies and realdata applications. The advantage arises when taking into account the covariate-dependent structure and residual heterogeneous effects through the use of random variables. Specifically, the joint distribution of the nodes in model (2) are non-Gaussian due to these random effects and their multiplicative form with the other nodes. This asymmetry in the joint distribution permits the identification of causal relationships from observational data, which we formally prove in Theorem 1. We note that the edge orientation is more accurate than PC and its derivatives, which assume a symmetric joint distribution. For computation, the regularized likelihood-based approach identifies a sparse skeleton in an efficient fashion.
In the analyses of brain atrophy dependence network in patients with HD, some modules of the gray matter network estimated from the PREDICT-HD study share similarity with the white matter connectivity network estimated from an independent study. For some other modules, the results suggest a complementary mechanism that constrains the total modularwise connections in gray and white matter networks. In the second application, the protein signaling network constructed from DAG-MM with observational data and invariance causal prediction (ICP) with interventional data (Meinshausen et al., 2016) is highly similar. The latter approach assumes causal relationships remain invariant under interventions that do not directly target a cause. This similarity suggests that the random effects in mSEM may serve a similar role as a random perturbation of the node distribution. Under the invariance assumption, the true causal effects are stable under such perturbation, and thus, DAG-MM generates similar results as ICP, but with only observational data.
The network structure among nodes can be further parameterized to incorporate prior information about the causal effects. For example, the knowledge on pathways in the gene regulatory network available in public databases or discoverable in published literature can be included by removing or adding the edge between nodes j and k or by restricting the edge direction from j to k. Model (3) can handle this structure by specifying some values of β jk or/and σ 2 jk as zero. In addition, exploring other methods of edge orientation including Bayesian model choice methods is worth future research. Another extension is to analyze temporal data M i with two time points t 0 and t 1 , where the desirable temporal ordering corresponds to removing all edges from M i (t 1 ) to M i (t 0 ) and modeling the effect from M i (t 0 ) to M i (t 1 ).
DAG-MM can be extended to handle multiple types of data, including neuroimaging, protein, and other biomarker measures of different scales, in a regression framework by choosing the appropriate regression for each data type. When the dimension of covariates X is high (e.g., large number of genomic measures), feature selection can be imposed on β in order to choose important covariates. Here, we use mSEMs to estimate network connectivity, but we did not differentiate the fixed effects from the random effects. Our main algorithm is a backward selection method and does not allow edge addition. To overcome this issue, one may start DAG-MM from multiple skeletons, which is an approach that provides a more stable edge selection. Other interesting extensions include direct modeling of a dynamic network among M(t) to allow for timevarying network structure and associate network connections with clinical outcomes. Lastly, the inference of a subjectspecific graph (e.g., p-values and confidence intervals) can be based on bootstrap (i.e., bootstrap data B times, and construct bootstrap confidence interval for DAG edge strength). However, a formal treatment of inference procedure is a topic worth future research.

AUTHOR CONTRIBUTIONS
YW and DZ designed and oversaw the study. XL and SX developed algorithm, implemented the study, and carried out the statistical analysis. PM, ST, and RS provided DTI data, discussed results, and gave the biological insights. All authors participated in writing the manuscript.

FUNDING
This research is support by U.S. NIH grants NS082062 and NS073671 and Wellcome Trust Grant 103437/Z/13/Z.