^{1}

^{2}

^{3}

^{1}

^{4}

^{5}

^{1}

^{1}

^{2}

^{3}

^{4}

^{5}

Edited by: Raya Khanin, Memorial Sloan-Kettering Cancer Center, USA

Reviewed by: Raya Khanin, Memorial Sloan-Kettering Cancer Center, USA; Erik Larsson, Memorial Sloan-Kettering Cancer Center, USA

*Correspondence: Frank Emmert-Streib, Cancer Research UK Cambridge Research Institute, Li Ka Shing Centre, University of Cambridge, Cambridge, UK. e-mail:

This article was submitted to Frontiers in Bioinformatics and Computational Biology, a specialty of Frontiers in Genetics.

This is an open-access article distributed under the terms of the

In this paper, we present a systematic and conceptual overview of methods for inferring gene regulatory networks from observational gene expression data. Further, we discuss two classic approaches to infer causal structures and compare them with contemporary methods by providing a conceptual categorization thereof. We complement the above by surveying global and local evaluation measures for assessing the performance of inference algorithms.

The purpose of this paper is to provide a systematic overview of methods used to estimate gene regulatory networks (GRN) from large-scale expression data. The inference of gene regulatory networks, which is sometimes also referred to as reverse engineering (Stolovitzky and Califano,

Due to the fact that this field is currently vastly expanding, this overview is inevitably incomplete. Instead of aiming to cover as many approaches as possible, we focus on conceptual clarity and methods for observational expression data. That means, we review statistical approaches from the literature we consider most important and show that they can be categorized nicely according to assumptions they make about the dynamic behavior of the data but also with respect to conceptual strategies they employ. In order to facilitate the understanding of the latter point we present also two seminal, and in the meanwhile classic, methods for the causal inference of networks and their theoretical foundations (Chow and Liu,

The inference of gene networks from high-throughput data is a very complex and vastly expanding area triggered by the invention of measurement technologies. In order to provide a systematic discussion of the underlying principles we limit this review to observational, steady-state gene expression data, and consider correlation- and mutual information-based inference methods only, as visualized in Figure

The complexity of the network inference problem can be visualized with the help of Figure

We begin our presentation by some necessary preliminaries. Directed acyclic graphs (DAGs) are frequently employed to represent causal relations among variables (Wright,

Definition 1. Two nodes

If

Definition 2. A path

on

on

Here,

A systematic connection between independence relations among variables

Here the subscript

Definition 3.

Here minimal I-map means that if any edge in the DAG is deleted,

In our opinion, there are basically two principle approaches in the literature that are relevant for our contextual problem, which

The first algorithm we discuss proven to reconstruct a causal structure is the _{xy}_{xy}_{xy}_{P} emphasizing explicitly that this independence is with respect to an underlying distribution

The basic principle on which the IC algorithm is based on, is a connection between the d-separation relations of a DAG

The second algorithm proven to reconstruct a causal structure is from Chow and Liu (

In this section we present an overview and a categorization of methods that have been

There are two principally different ways to construct co-expression networks from microarray data. One approach follows a hard- and the other a soft-thresholding of correlation coefficients (Zhou et al., _{ij}_{ij}_{ij}_{ji}_{ij}_{ij}_{ji}

For the soft-thresholding two types of adjacency functions are frequently used (Zhang and Horvath,

and the power adjacency function,

Both types of adjacency functions lead to undirected but weighted networks. In order to choose the above parameters appropriately Zhang and Horvath (

We would like to remark that the purpose for the construction of co-expression networks is different to all other methods discussed in this paper. Co-expression networks serve as means to explore the functionality of genes on a systems level (Zhang and Horvath,

Asymmetric-N is an algorithm that takes the fact into account that biological networks contain hubs. It is a modified version of Symmetric-N (Agrawal,

Instead of using _{C}_{P}

Xiong et al. (

Xiong et al.’s (

Here

Partial correlations of low-order have been employed in de la Fuente et al. (_{ij}_{i}_{j}_{ij}_{i}_{j}_{ij}_{ji}_{ij}_{k}_{ij}|_{i}_{j}_{ij}_{ji}_{ij}_{ij}

GGM, also known as covariance selection model, concentration graph, or Markov random field (Dempster, ^{−1}, also called precision or concentration matrix. Network inference methods based on GGM make use of the relation,

connecting the partial correlation of full-order with the elements of Ω, _{ij}

From this conditional independence relation, the principle way to infer a network structure from GGM becomes apparent estimating a network in the following way. If ρ_{ij}_{ij}_{ji}

Several approaches have been made to infer gene regulatory networks based on GGM (Wille et al., ^{−1}, is estimated and in the statistical tests employed to define significance. The reason for these technical variants comes from a variety of problems. First, if the number of samples is smaller than the number of genes, which is typically the case for genomics data, the sample covariance matrix is not positive definite and, hence, not invertible. Another problem is caused by the small samples size problem (Schäfer and Strimmer,

Bayesian networks allow identifying a DAG structure of a network. BN were among the first methods that have been applied to expression data to infer GRN (Friedman et al.,

In Figure _{P}

In the following, we restrict our discussion to the working mechanisms of the discussed procedures. However, we want to emphasize that also for these methods the employed statistical estimators are of importance (Beirlant et al.,

_{0}. The resulting network is constructed based on this threshold by including an edge between two genes in the respective adjacency matrix of the network, _{ij}_{ji}_{ij}_{0}, otherwise no edge is included between _{0} was found by randomization of the expression data set. From this randomization, mutual information values were re-calculated from which a reference distribution of mutual information values, resembling a null-distribution, was obtained. Based on this reference distribution the threshold _{0} was obtained by heuristic arguments. For this reason, mutual information values that are larger than _{0} are called relevant but cannot necessarily be called statistically significant.

_{0} allowing to declare mutual information values significant if _{ij}_{0}^{1}_{ij}_{ij}_{ji}_{0}) such that, for each triplet (_{ij}_{jk}_{ik}_{2} multiplied by a factor, i.e.,

Here 0 ≤ ∈ ≤ 1. The introduction of this step has been motivated by the so called

implies that

To ensure that the application of equation (

ARACNE employs two parameters, _{0} and ∈. The cut-off parameter _{0} is determined by a resampling method estimating the distribution of the null hypothesis corresponding to a vanishing mutual information. This allows to assign _{0} is found in an unsupervised and ∈ in a supervised way of learning.

_{ij}_{ij}_{i}_{j}_{i}_{j}_{ik}_{jk}_{i}_{j}_{0} for each mutual information value correspondingly pair of genes, CLR estimates

is defined as the difference between the generalized two-way mutual information,

and the mutual information values for two random variables. A visualization of synergy is shown in Figure

which is the negative of the three-dimensional interaction information (Watkinson et al.,

This term is called ^{2}

In order to detect also non-cooperative effects, the measure actually used by SA-CLR is

It is interesting to note that

holds formally but is not identical to the (general) conditional mutual information

Whenever a gene, _{j}_{j}_{0}, then this gene is added to the set

We want to remark that the score _{j}

where the (auxiliary) random variable, _{j}_{j}_{j}_{j}_{j}_{j}

_{j}_{3} measure of the MI3 algorithm. Blue corresponds to positive and red to a negative values.

_{3}

which equals the difference of mutual informations between the target gene and the two regulators and the target gene with one of the regulators. Alternatively, _{3} can also be written as

Equation (_{3} has also a simple relation to synergy, as shown in equation (

In the right Figure _{3}. Here, the shown intersections result from the correspondence of _{1}, _{2}) to “d + e + f,” _{1}) to “d + e,” and _{2}) to “e + f.” Subtraction of these terms according to equation (_{1}, _{2}) is counted twice.

The MI3 algorithm learns gene regulatory networks in two steps: first, local regulatory networks consisting of only three genes (_{1}, and _{2}) are learned. Starting from a given gene, _{1}, _{2}). As a result, directed edges between _{1} and _{2} and _{3} value. Overall, the MI3 algorithm aims to learn the optimal two-parent causal model for each target variable in the form R1 → T and R2 → T.

_{i}_{j}_{k}_{ij}_{ji}_{i}_{j}_{k}_{k}

_{i}_{j}_{i}_{j}_{0}, then an edge is included, _{ij}_{ji}_{i}_{j}_{k}_{i}_{j}_{0}, _{i}_{k}_{0}, and _{j}_{k}_{0} the conditional mutual information for all combinations of _{i}_{j}_{k}_{i}_{j}_{k}_{i}_{j}_{0}, _{i}_{k}_{0}, and _{j}_{k}_{0} the conditional mutual information for all combinations of (_{i}_{j}_{k}

Mutual information in combination with conditional mutual information was also used in Zhao et al. (

In Figure

Due to the fact that all methods presented above involve many hypotheses that are tested, one needs to apply a multiple hypothesis correction method (Lehmann and Romano,

In order to assess the performance of inference methods several measures have been suggested. In the following, we present three different types of such measures: (1) General statistical measures, (2) Ontology-based measures, (3) Network-based measures. On overview of these different measures is given in Figure

The most widely used statistical measures are based on,

obtained by comparison of a inferred (predicted) network with the true network underlying the data. From the measures listed above, three pair-wise combinations thereof are frequently used to assess the performance of an algorithm. The first measure is the area under the curve for the ^{3}

also called _{1} because it is a special form of,

We want to emphasize that all measures presented above are general statistical measures used in statistics and data analysis. None of them is specific to our problem under consideration, namely, the inference of regulatory networks from expression data. In other words, none of these measures utilizes either biological or network specific information in any form. Further, each of these general statistical measures are global error measures because they evaluate the network inference performance as a whole, represented by a scalar value. As a consequence thereof, it is implicitly assumed that the inference process is homogeneous, i.e., each interaction should have about the same true positive rate, because otherwise it would be implausible to summarize the inference performance by just one value, e.g., a

In the following, we present extensions of general statistical measures for both types of information.

An evaluation strategy utilizing biological information to assess the performance of an inference method tries to quantify the

The third type of measures for assessing the performance of an inference algorithm considers the network structure explicitly. That means, in contrast to the general statistical measures and also the ontology-based measures, network-based measures can only be used if there is a network that underlies the problem.

A measure that makes explicit use of the network structures of the true (

the weighted sum of false-positive edges (first term) plus the false-negative edges (second term). This measure is not only asymmetric in its arguments but also gives a different weight to type-1 (false positives) respectively type-2 (false negatives) errors. For type-1 errors, in the true network _{G}

In contrast to all above measures, which were _{1}(_{E}_{1}(_{E}_{ij}_{ij}

which corresponds to ^{e}). Analogously, we can estimate the TNR.

Principally, any combination of true positive and true negative rates would result in a valid network-based measure consisting, e.g., of network motifs, subnetworks, or even only of individual edges. For such a measure representing a structural region within a network it is then possible to estimate its reconstruction rate. To provide a concrete example, we give the reconstruction rate of a three-gene motif that is given by the chain in Figure

Also, if biological information about the genes in the network is available, this can also be used to define appropriate measures. For example, one could obtain a reconstruction rate for all interactions that are connected with transcription factors or of particular biological pathways involving only certain genes as defined, e.g., via the gene ontology database. Further examples of such measures can be found in Altay and Emmert-Streib (

A question that is of practical relevance is which of the discussed inference methods, listed in Figure ^{2}) and for MRNET it is between ^{2}) and ^{3}) which is difficult to quantify exactly because of the iterative nature of the second step employed by MRNET. For the conducted simulations which involved networks in the order of ^{2}) genes this was tractable, however, for real biological data the number of genes can reach over 10,000 causing series problems. In contrast, C3NET has been applied to a bootstrap ensemble from B cell lymphoma for 9.684 genes (de Matos Simoes et al., submitted). Second, due to the conservative character of C3NET, which is currently the most conservative method of all inference methods, the number of obtained interactions is easier to deal with than for other methods. For example, in Basso et al. (

We would like to remark that the methods studied in Altay and Emmert-Streib (^{1}) genes, and individual data sets rather than ensembles. For example, in Werhli et al. (

From the discussions of the preceding methods for the structural inference of regulatory networks one could get the impression that their number is sheerly unlimited as well as the principle ideas behind them. In our opinion the first point is probably true, the latter not. In order to see this more clearly we would like to outline the general procedure underlying the development of each method. First, a principle idea or hypothesis is raised about a mechanism for the inference of regulatory networks and, second, a method is conceived that could accomplish this. Formally, the first part relates to the conceptual or qualitative framework of a method whereas the latter refers to its quantitative realization, e.g., in form of statistical estimators. In order to learn about two classic inference algorithms, embodying two different conceptual ideas, we presented the IC (inductive causation) algorithm (Pearl,

We would like to finish this review with a brief outlook on future directions. In the introduction, we mentioned that due to the nature of gene expression data, which do not allow to derive unique predictions about the underlying molecular interactions between gene products, the resulting gene regulatory networks represent a mixture of a transcriptional regulatory network and a protein interaction network. For example, in Altay and Emmert-Streib (

In order to obtain more refined predictions about the type of interactions and also to improve the inference performance of the methods, complementary information from other types of high-throughput data is needed. For example, data from ChIP-chip or ChIP-Seq experiments could be used to obtain information about the potential gene targets of transcription factors, similarly, proteomics data could be employed to reveal protein-protein interactions. Ideally, information from all three data types (ChIP-Seq, gene expression, and proteomics) should be integrated to infer a more detailed network with a clearer interpretation of the inferred interactions between the gene products. Sporadically, methods have been already pioneered for such an integration (Nariai et al.,

Another very important topic, aside data integrative methods, relates to the generation of the data themselves. Specifically, in this review we focused on observational data only, however, experimental data consisting of gene interventions or perturbations form a very fruitful source of information that could be systematically exploited (Fröhlich et al.,

This discussion emphasizes the need for a clear conceptual distinction between different methods and the information they are based on.

In this paper we presented a systematic overview of methods for inferring gene regulatory networks. Although this field is currently vastly expanding making it very difficult to obtain such an overview, we assumed two different perspectives that allowed to categorize inference algorithms sensibly. The first perspective was based on the dynamical assumptions (linear vs. non-linear) methods make about the underlying data. The second considered the methods through the lense of classic approaches which use either d-separation or the DPI (Chow and Liu,

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Frank Emmert-Streib would like to thank Simon Tavaré and Florian Markowetz for fruitful discussions and the Department for Employment and Learning through its “Strengthening the all-Island Research Base” initiative and the School of Medicine, Dentistry and Biomedical Sciences for financial support. Gökmen Altay is funded by Cancer Research UK.

^{1}The details of the randomization are different to that used by RN allowing now to make statistical statements.

^{2}We want to remark that

^{3}For example, the mutual information threshold _{0}, used in the CLR algorithm.