Gene regulatory networks and their applications: understanding biological and medical problems in terms of networks

In recent years gene regulatory networks (GRNs) have attracted a lot of interest and many methods have been introduced for their statistical inference from gene expression data. However, despite their popularity, GRNs are widely misunderstood. For this reason, we provide in this paper a general discussion and perspective of gene regulatory networks. Specifically, we discuss their meaning, the consistency among different network inference methods, ensemble methods, the assessment of GRNs, the estimated number of existing GRNs and their usage in different application domains. Furthermore, we discuss open questions and necessary steps in order to utilize gene regulatory networks in a clinical context and for personalized medicine.


INTRODUCTION
About 15 years ago inference of large-scale gene regulatory networks (GRNs) was made possible thanks to the availability of high-throughput gene expression data. Within this time, many different methods have been developed (Liang et al., 1998;Friedman, 2004;Wille et al., 2004;Zhang et al., 2011) and used to enhance our understanding of diseases (Basso et al., 2005;Madhamshettiwar et al., 2012). However, despite their widespread usage in current biomedical research, there is still much confusion about the basic meaning of GRNs, ways of assessment, and possible application areas.
In this paper, we aim to clarify some of these problems and also provide a discussion of important next steps in order to bring gene regulatory networks closer to the clinical and medical application. Furthermore, we add some recommendations we consider important to make GRNs more popular among biologists and clinicians, as they require a dedicated platform for accessing and analyzing inferred gene regulatory networks.

HOW DO WE CALL NETWORKS INFERRED FROM GENE EXPRESSION DATA
For reasons of clarity, we first define what we mean by a gene regulatory network.
Definition 1. We call a network that has been inferred from gene expression data a "gene regulatory network," briefly denoted as GRN.
From the above definition one can see that we are assuming a statistical perspective placing the data in the center of focus. Due to the nature of gene expression data, providing information about the abundance of mRNAs only rather than binding information, gene regulatory networks defined in the above sense provide information about regulatory interactions between regulators and their potential targets; gene-gene interactions, and potential protein-protein interactions (e.g., in a complex) (de Matos Simoes et al., 2013a). There are many examples where such networks have been studied (Margolin et al., 2006;Werhli et al., 2006;Meyer et al., 2008;Stolovitzky et al., 2009;Emmert-Streib et al., 2012); see Table 1 for a brief overview. So far there is no generally adopted parlance to name such inferred networks, but the term gene regulatory network (Hecker et al., 2009) is frequently used and will also be utilized in this paper.
For completeness, we would like to mention that there are a variety of conceptually different approaches to infer networks and we would like to refer the reader to the review articles by Lee et al. (2009);Markowetz et al. (2007) for a thorough discussion.

IS THERE JUST ONE "RIGHT" METHOD?
In the last years, there have been many network inference methods introduced and many comparisons have been conducted (Akutsu et al., 1999;Margolin et al., 2006;Werhli et al., 2006;Meyer et al., 2008;Stolovitzky et al., 2009;Emmert-Streib et al., 2012). As it seems, the results of such technical comparisons depend crucially on the studied conditions, including; type of the data (simulated, real), size of the network, number of samples, amount of noise, experimental design (observational, experimental, interventional), type of the underlying interaction structure (scale-free, random, small-world), error measure (global, local), among others. For this reason it is unlikely that there is one "right" method that fits all different biological, technical and experimental design conditions best.
However, if one asks less technical and more biological questions about the meaning of the inferred networks, i.e., by evaluating the biological consistency of inferred networks resulting from different network inference methods, there is supporting evidence that the differences might not be that large, as recently Three-way mutual information Luo et al., 2008 demonstrated for C3Net, BC3Net and Aracne (de Matos Simoes et al., 2013b). Hence, it is unlikely that there is just one method that outperforms all others for all conditions, but a number of methods result in an overlapping spectrum having the potential to infer similar biological information.

ENSEMBLE METHODS
A recent trend in the field of biological network inference is the use of ensemble methods (Zhang and Singer, 2010) to improve their stability and accuracy. Ensemble methods have been popularized by Leo Breiman as exemplified by random forest classifiers (Breiman, 2001) that have at their heart bagging (Breiman, 1996). Briefly, the underlying idea is to (1) bootstrap a given data set, Although ensemble approaches to network inference are computationally intensive, they have the clear advantage of being straightforwardly and efficiently implemented in large computer cluster. Indeed, if one runs an ensemble of size B on a computer cluster with B nodes, the computation time for the whole ensemble is (about) the same as for just one method run on one desktop computer.

ASSESSING INFERRED NETWORKS
The assessment of inferred networks is an important and complicated topic. The reason for this is that networks are highdimensional, structured objects that enable modeling of diverse aspects of biological systems. There are two main issues one has to face when assessing the quality of inferred biological networks: (I) the definition of a set of "true" interactions, referred to as gold standard and (II) the choice of statistical measures to quantitatively assess the quality of networks using this gold standard. The former issue is usually addressed by using known interactions from research articles (Mostafavi et al., 2008;Haibe-Kains et al., 2012) and structured biological databases such as KEGG (Kanehisa and Goto, 2000) or I2D (Brown and Jurisica, 2005). The main disadvantage of this approach is that, although the set of known interactions might be quite large, many of them might not be relevant to the biological conditions under investigation. For this reason, it is also important to note that the standardized reporting of such contextual information is crucial for comparing causal and correlative relationships between molecular entities meaningfully. Examples for such are endeavors that provide computer processable languages are BEL, PySB, and BCML (Slater, 2014).
As an alternative, several research groups performed multiple perturbations of the biological system under study (cancer cell lines for instance) to measure their effects and subsequently validate their inferred networks (Frohlich et al., 2008;Olsen et al., 2014). This experimental design, although significantly more lengthy and costly, enables to validate inferred interactions in conditions that are identical or closely mimic those used for network inference. As an example, Olsen et al. knocked down 8 genes in the RAS signaling pathway in colorectal cancer cell lines to quantitatively assess the quality of gene interaction networks built from expression data of human colon tumors.
Given a set of known interactions, one can use traditional statistical error measures, such as F-score or AUC-ROC (area under the receiver operating characteristics curve). These measures can be used to assess the quality of networks at the global-level (for the network as a whole) or at the edge-level (for each individual edge) or for many intermediate-levels (for instance for networkmotifs); see Altay and Emmert-Streib (2010b); Emmert-Streib and Altay (2010). That means, already for generic statistical error measures there are many different levels that can be assessed. Furthermore, real biological data and simulated data can, and should, be used for the assessment of networks. For real biological data this allows to assess the biological relevance of inferred networks, e.g., by using GO or KEGG, and simulated data enable a detailed analysis of any technical aspect. In general, one should use a large variety of quality and error measures on a routine way. Unfortunately, standards are currently not available, but would need to be developed.

HOW MANY GENE REGULATORY NETWORKS EXIST?
It is generally acknowledged that a phenotype is an emergent property of genotype-environment interactions. Specifically, a phenotype results from molecular and cellular activity patterns from genotype-environment interactions. This implies that each observable phenotype is associated with phenotype-specific gene networks, because without changing molecular interactions a phenotype cannot change; this concept is illustrated in Figure 1. In this figure, gene networks can be seen as a bottleneck between the genotype and the phenotype with respect to their coupling. That means every change on the genotype level that will result in a change of the phenotype will also inevitably lead to a change in the gene network structure as mediator between both levels.
However, since gene networks refer to all possible types of molecular networks, including the transcriptional regulatory network, protein interaction network, metabolic network, gene regulatory network and interactions between these networks, it is less clear which of these networks, or all of them, are actually changed. Moreover, because a gene regulatory network can potentially represent many types of physical biochemical interactions among genes and gene products (de Matos Simoes et al., 2013a) it can be expected that gene regulatory networks are highly phenotype specific (Schadt, 2009;Emmert-Streib and Glazko, 2011). Establishing such relationships will therefore be a complex task, but also provides an opportunity to catalog phenotypes quantitatively. An example for the analysis of tissue-specific networks can be found in Guan et al. (2012) where 107 tissue specific network have been studied. Currently, the number of GRNs is difficult to estimate but based on these preliminary results one can hypothesize that there are more than 200 different GRNs for Human alone, because this corresponds about to the number of different cell types. However, also pathological cells manifesting tumors have their own characteristic networks  implying that there are probably thousands of different gene networks in Human.

USAGE OF GENE REGULATORY NETWORKS
It is important to emphasize that the inference of gene regulatory networks is not the final result, but these networks are supposed to help in solving a number of different biological and biomedical problems.

CAUSAL MAP OF MOLECULAR INTERACTIONS
Maybe the most frequently named use of gene regulatory networks is to serve as a "map" or a "blueprint" of molecular interactions. In this respect such a network can be used to derive novel biological hypothesis about molecular interactions, e.g., for the transcription regulation of genes, which can then be investigated in wet lab experiments by using, e.g., ChIP-chip and gene expression experiments (Bussemaker et al., 2001;Basso et al., 2005). In such a case GRNs represent causal biochemical interactions because the predicted links are supposed to correspond to actual physical binding events between molecules. It is important to note that the inference of such causal interactions between gene products is a challenging task, because it goes beyond the mere association between such entities that would include also indirect relations/interactions involving intermediate gene products as well. However, despite the limitations of association networks it is interesting to note that also such networks capture valuable biological information . An important aspect of this application is that the GRNs represent statistically significant predictions of molecular interactions obtained from large-scale data. Given the very large number of potential interactions between ∼20, 000 genes in Human and ∼6000 gene in yeast, the GRNs are of tremendous help in narrowing these numbers down to potential interactions for which statistical support is available. Overall, this enables more effective experiments by an adopted experimental design.

EXPERIMENTAL DESIGN AND PERTURBATION EXPERIMENTS
An under appreciated applicability of gene regulatory networks is to use these for guiding the experimental design of new experiments. Specifically, many high-throughput experiments are screening experiments generating observational data. That means, these experiments are not controlled by establishing conditions that enhance molecular target processes to improve the signal strength of these, but they merely "observe" the state of the systems as it is, without interventions or perturbations. A downside of such screening experiments is that the signal about, e.g., certain pathways, may be too low to be inferable by statistical means. However, using prior knowledge about "partial" gene regulatory networks inferred from such observational data may allow to overcome these obstacles systematically and help in designing perturbation or intervention experiments to stimulate the molecular system purposefully. That means by identifying the parts of the molecular system that are not well detected, based on inferred networks, dedicated perturbations can be constructed to boost their active responses.

NETWORKS AS BIOMARKERS
In recent studies, it has been argued that (sub-)networks could also be used as biomarkers, e.g., for diagnostic, predictive or prognostic purposes (Chuang et al., 2007;Ben-Hamo and Efroni, 2011;Chen et al., 2011;Dehmer et al., 2013a). This is particularly plausible for a complex disorder like cancer, because the hallmarks of cancer are represented by pathways rather than individual genes (Hanahan and Weinberg, 2011) and the crucial aspect of pathways is that their constituting genes are actively interacting with each other. For this reason, network-based biomarkers can be seen as statistical measures that consider the interaction structure between individual genes explicitly. In contrast, biomarkers based on individual genes neglect these completely. For further applications of network-based biomarkers see also Dehmer et al. (2013b).
In the near future, we expect to see similar applications also for other types of complex disorders, because, despite their differences among each other, all of them share a need for considering interaction changes. Unfortunately, developing network-based biomarkers is considerably more complex than using univariate and multivariate gene signatures. Also, quantitatively, it remains to date unclear which gain one should expect from this new type of biomarkers, if any Staiger et al. (2012).

COMPARATIVE NETWORK ANALYSIS
When more and more gene regulatory networks from different physiological and disease conditions become available, it will be possible to statistically compare these networks (Dehmer and Emmert-Streib, 2007;Dehmer and Mehler, 2007). This will allow to learn about interaction changes across different physiological or disease conditions and enrich our biological and biomedical understanding of such phenotypes (Ideker and Krogan, 2012;Islam et al., 2013). Besides using classical comparative measures such as the graph edit distance or the Zelinka distance, topological indices could be also employed for such an analysis, see Dehmer et al. (2013c). It might be challenging to determine which similarity or distance measures are suitable to perform such a comparative network analysis and different types of networks as well as different biological questions may require different approaches, see (Sharan and Ideker, 2006;Przulj, 2007;Mithani et al., 2011;Pache and Aloy, 2012) for protein interaction networks or metabolic networks, for instance.
However, in order for this approach to succeed it will be necessary to establish databases, similar to sequence or protein structure databases, that provide free access to the inferred gene regulatory networks from different physiological and disease conditions. To this end, it may be necessary to form an international coalition because the expected effort to establish such a database and interactive access interfaces is anticipated to be larger than of sequence databases.

NETWORK MEDICINE AND DRUG DESIGN
For establishing a network medicine useful for clinicians, it will be necessary to integrate different types of gene networks with each other (Shapira et al., 2009;Barabási et al., 2011), because each network type carries information about particular molecular aspects. For example, whereas the transcriptional regulatory network contains only information about the controlling regulations of gene expression, protein interaction networks represent information about protein-protein complexes. Taken together, an integration of various important molecular interaction types results in a comprehensive overview of regulatory programs and organizational architectures. Also, information about temporal changes in the network structure are important to understand immune response, infection and differentiation processes (Rozenblatt-Rosen et al., 2012;Yosef et al., 2013).
Also for a more efficient design of rational drugs the utilization of gene networks are indispensable (Ghosh and Basu, 2012;Fortney et al., 2013). For this reason, both subjects would profit tremendously if there would be more large-scale gene expression data available together with, e.g., survival data and drug-dose response information. This would allow to create, e.g., a connectivity map (Lamb et al., 2006) that is based on the similarity of molecular interaction networks rather than on the mere similarity of expression profiles. Overall, this would help us on our way to a more personalized medicine (Chan and Ginsburg, 2011), because condition specific gene regulatory networks are closer to the phenotype than genetic or epigenetic markers; see Figure 1 for a visualization.

KNOWLEDGE PLATFORM FOR MEDICAL AND CLINICAL PRACTICE
It is important to emphasize that gene regulatory networks are not the final outcome of a biological or biomedical study, but an intermediate result. For this reason, interrogation platforms are needed that allow the downstream analysis of such networks. Specifically, aside from databases that store inferred GRNs, network analysis tools and visualization layouts are needed that allow an easy integration with biological and clinical information, e.g., in form of GO and KEGG databases or clinical patient and general epidemiological data. Such a knowledge platform should hold cross-disease information similar to the OMIM (Online Mendelian Inheritance in Man). Furthermore, it would be desirable if such a knowledge platform has an intuitive to use interface allowing also non-technical experts the exploration of GRNs. For a practical realization, e.g., the tranSMART platform (Athey et al., 2013) could be utilized. TranSMART is based on the open source i2b2 (informatics for integrating biology and the bedside) framework, sponsored by the NIH Roadmap NCBC, to provide clinicians with the tools to integrate clinical genomics with medical patient record data. An attractive feature of tranSMART is that it can be combined with Galaxy, an open, web-based user interface, which allows the connection to a variety of programming languages. That means a researcher without specific bioinformatic expertise can utilize R scripts, e.g., provided by Bioconductor, CRAN or individually developed packages, via a web-based graphical user interface for analyzing GRNs. Importantly, Galaxy offers also several mechanisms to ensure the reproducibility of research results.

CONCLUSION
In this paper, we discussed important aspects of gene regulatory networks inferred from gene expression data. Due to the multifaceted nature of GRNs, for which we gave some examples in this paper, a discussion about these networks cannot be onedimensional because this would give a misleading impression of their meaning and potential usage. For this reason, we tried to provide a broad discussion touching upon a variety of different aspects to emphasize the intriguing depth offered by gene regulatory networks.
We think that neither the future of biology nor medicine is conceivable without gene networks in general, whereas gene regulatory networks form an important subtype of these, because such networks can be seen as a practical embodiment of systems biology. However, in order to exploit and utilize such networks efficiently in molecular biology, cellular biology and the biomedical sciences we need to establish comprehensive databases.

FUNDING
Matthias Dehmer thanks the Austrian Science Funds for supporting this work (project P26142). Matthias Dehmer also gratefully acknowledges funding from the Standortagentur Tirol (formerly Tiroler Zukunftsstiftung).