TECHNOLOGY REPORT article
CoExpNetViz: Comparative Co-Expression Networks Construction and Visualization Tool
- 1Department of Plant Systems Biology, Vlaams Instituut voor Biotechnologie, Ghent, Belgium
- 2Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- 3Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
- 4Department of Plant Sciences and the Environment, Weizmann Institute of Science, Rehovot, Israel
- 5Genomics Research Institute, University of Pretoria, Pretoria, South Africa
Motivation: Comparative transcriptomics is a common approach in functional gene discovery efforts. It allows for finding conserved co-expression patterns between orthologous genes in closely related plant species, suggesting that these genes potentially share similar function and regulation. Several efficient co-expression-based tools have been commonly used in plant research but most of these pipelines are limited to data from model systems, which greatly limit their utility. Moreover, in addition, none of the existing pipelines allow plant researchers to make use of their own unpublished gene expression data for performing a comparative co-expression analysis and generate multi-species co-expression networks.
Results: We introduce CoExpNetViz, a computational tool that uses a set of query or “bait” genes as an input (chosen by the user) and a minimum of one pre-processed gene expression dataset. The CoExpNetViz algorithm proceeds in three main steps; (i) for every bait gene submitted, co-expression values are calculated using mutual information and Pearson correlation coefficients, (ii) non-bait (or target) genes are grouped based on cross-species orthology, and (iii) output files are generated and results can be visualized as network graphs in Cytoscape.
Availability: The CoExpNetViz tool is freely available both as a PHP web server (link: http://bioinformatics.psb.ugent.be/webtools/coexpr/) (implemented in C++) and as a Cytoscape plugin (implemented in Java). Both versions of the CoExpNetViz tool support LINUX and Windows platforms.
A biological pathway is represented by a set of molecular entities (e.g., genes) that are involved in a given biological process and often interact with each other. Filling the gaps in our current knowledge with respect to biological pathways is a fundamental challenge. Although current insight into some biological pathways is substantial and useful for systems-level analyses, not all genes that participate in these pathways and affect their function are known and even in extensively studied model plants such as Arabidopsis and rice, many genes are still lacking experimental functional annotation (Rhee and Mutwil, 2014). Furthermore, many other biological pathways are still exhibiting major information gaps such as for instance those generating specialized metabolites (formerly known as secondary metabolites) in plants (Hansen et al., 2014; Tohge et al., 2014).
Advancements in computational approaches and robust statistical methods, along with the ever-increasing availability of transcriptomics data sets provide an excellent platform for gene discovery in unresolved or partly known pathways. Considering the premise that genes participating in the same biological process might posses a more similar expression pattern than expected by chance, co-expression is one of the most widely used functional gene discovery methods to fill gaps in metabolic pathways. Moreover, co-expression analysis allows the transfer of information from model (e.g., Arabidopsis, tomato, rice, maize, etc.) to non-model plant species (Stuart et al., 2003; Usadel et al., 2009; Heyndrickx and Vandepoele, 2012; Tzfadia et al., 2012; Movahedi et al., 2012; Itkin et al., 2013; Amar et al., 2014; Rhee and Mutwil, 2014).
So far, most reports on functional gene discovery via co-expression analysis in plants described the use of transcriptome data for a single species. Finding conserved co-expression patterns between orthologs across related plant species can provide a highly relevant list of candidate genes that potentially share similar functions and act in the same pathways (Hirai et al., 2005; Obayashi et al., 2007; Usadel et al., 2009; Mutwil et al., 2010, 2011; Movahedi et al., 2012; Hansen et al., 2014; Tohge et al., 2014). An example of using such a strategy was recently described for tomato and potato, where comparative co-expression information was utilized for constructing a co-expression network, leading to the discovery of a metabolic gene cluster related to the steroidal glycoalkaloids (SGAs) pathway (Itkin et al., 2013).
Several co-expression-based tools have been commonly used in plant research, such as ATED-II (Obayashi et al., 2007), PlaNet (Mutwil et al., 2011), GeneCAT (Mutwil et al., 2008), CORNET (De Bodt et al., 2012), Complex (Neotea et al., 2014), PODC (Ohyanagi et al., 2015) and Expressolog (Patel et al., 2015). However, most of these pipelines are limited to data from model systems, which limit their utility. In addition, none of these pipelines allow plant researchers to make use of their own (custom and/or unpublished) gene expression data for performing a comparative co-expression analysis and generate multi-species co-expression networks. Moreover, downstream analysis is usually hindered by the inaccessibility of the output files and networks.
Here, we introduce a user-friendly tool, called CoExpNetViz, that will allow biologists to use their own transcriptomics data generated in multiple species of their choice for cross species co-expression analysis. Co-ExpNetViz can use any number of queries or “bait” genes (from one or multiple species) that are known to be involved in the same biological process or pathway. CoExpNetViz can be used to search for new genes involved in a common process, or to find functional orthologs of the bait genes (Figure 1). The output includes files for visualizing the network in Cytoscape (Shannon et al., 2003), and correlation matrices for the given bait genes. Additionally, the user could apply network clustering (hubs) / GO enrichment/ network properties using other Cytoscape plug-ins.
Figure 1. The algorithm workflow of CoExpNetViz. (A) The user input is (i) a number of genes of interest (the query or “bait” genes), along with (ii) gene expression data and (iii) two cutoff values. In this example, three Arabidopsis thaliana (squares) and two Solanum lycopersicum (triangles) bait genes are chosen. (B) CoExpNetViz will (1) calculate a correlation matrix for every species individually and translate correlations above the positive cutoff value or below the negative cutoff value into edges in a network. Then (2) all non-bait genes are grouped into gene families and (3) the families are ordered into the same partition if they share links with the same set of bait genes. (C) After running the analysis, the output can be visualized in Cytoscape.
Finally, to illustrate the utility of CoExpNetViz, we describe a case study in which we recreated the comparative co-expression network in tomato (Solanum lycopersicum) and potato (Solanum tuberosum) for finding SGAs related genes as performed earlier (manually) in Itkin et al. (2013).
The graphical interface for the comparative co-expression construction console is available as both a PHP web server (http://bioinformatics.psb.ugent.be/webtools/coexpr/index.php; Figures 2A,B) and as a Cytoscape plug-in (see Appendix B; Figures B1–5, in Supplementary File 1).
Figure 2. Screen shots of the CoExpNetViz web interface. (A) The home page of the CoExpNetViz containing links to: documentation for both developing the Cytoscap app Supplementary File 1, see Appendix A, and users (Supplementary File 1, see Appendix B), and for downloading the Cytoscape plugin. (B) The page for submitting a job to CoExpNetViz.
2. Materials and Methods
CoExpNetViz takes as input a set of “bait genes” from one or multiple species and one preprocessed expression data set per species. Normally, genes known to be involved in the same biological process are chosen as the baits. CoExpNetViz will then determine which gene families have co-expressed genes with these bait genes. The user specifies both negative and positive Pearson correlation coefficient cutoff values, which will be used to determine if two genes are co-expressed.
2.2. Correlation Calculation
Many algorithms for calculating co-expression exist. Here we use Pearson correlation coefficient (PCC) which is the most popular algorithm used, and a custom Python implementation of mutual information (Song et al., 2012) as a measure for similarity between expression profiles. For a each species, a correlation value is calculated for every bait gene x = (x1, x2,…, xn) and every other gene y = (y1, y2, . . . , yn). By default we consider genes to be co-expressed if their correlation falls below the 5th or above the 95th percentile of a sample distribution of expression correlations based on the similarity between expression profiles for 4000 random genes (approximately 1,000*999*0.5 gene pairs). Bait genes that are not present in the species' data set are discarded. This step results in the generation of a correlation matrix of bait and target genes for each of the species analyzed (see steps B.1–3 in Figure 1).
2.3. Network Construction
After computing the correlation matrices, the positive and negative cut-off values are (Vandepoele et al., 2009) used to translate r-values into edges in a graph where nodes represent the genes and edges represent a co-expression relationship.
An edge between two genes is retained only if the r-value is above the positive cutoff value or below the negative cutoff value (displaying negative correlation). Finally, genes that do not contain any edges are discarded (see step B.1 in Figure 1).
2.4. Grouping Homologous Genes
For grouping target (i.e., co-expressed) genes into homologous families, we used (sub)gene families as available in PLAZA (Proost et al., 2015). These gene families are the result of clustering genes based on sequence similarity using the Markov clustering based Tribe-MCL algorithm (Enright et al., 2014), followed by a post-processing algorithm to identify outliers. A gene is defined as an outlier if it shows sequence similarity to only a limited number of genes in the gene family. Subfamilies are then inferred from the Tribe-MCL families using the Ortho-MCL algorithm (Li et al., 2003). The PLAZA platform has separate databases for monocotyledonous and dicotyledonous organisms. Yet, there are 10 species that are present in both databases: some of these species were included to serve as a reference to link both databases while others function as out-groups. To allow CoExpNetViz to work with datasets from monocots and dicots simultaneously, the overlapping species in PLAZA have been used to create merged families that contain both monocotyledonous and dicotyledonous species.
All target genes that were retained in the previous step (network construction) are next grouped into one node if they belong to the same gene family. These new nodes, termed here “family nodes,” contain only genes that were present in the gene expression data and that are co-expressed with at least one bait gene. Using these family nodes, a new graph is created in which the nodes are either family nodes or bait genes and edges represent co-expression relationships. An edge is drawn between a family node and a bait gene if at least one gene in the family node is co-expressed with that bait gene (see step B.2 in Figure 1).
2.5. Grouping Family Nodes Into Partitions
A partition is a set of family nodes where every family node (but not necessary all genes in that family), is co-expressed with the same set of baits. Partitions are computed from the previously obtained graph by grouping them into sets, which share the same neighboring nodes (see step B.3 in Figure 1). For example, in Figure 1, the purple nodes are all co-expressed with At5g23190 and Solyc02g014730, while all the pink nodes are also co-expressed with At3g11430 and At5g41040 in addition to At5g23190 and Solyc02g014730.
2.6. The CoExpNetViz Output
The CoExpNetViz web interface creates network files, which can be readily imported into Cytoscape. If the plug-in is available, the network will be loaded automatically into Cytoscape. Family nodes are displayed as circles with specific colors, where every circle represents one partition. Bait genes are displayed in white, and when using the web interface, are grouped in one circle at the top left. An advantage of the plugin compared to the web interface is that the plugin provides an enhanced layout algorithm to place the bait genes of different species into different circles. These circles are then placed at equal distances around the parti-tions containing the family nodes (Figure 1C).
2.7. Downstream Analysis
Once the co-expression network is created and visualized in Cytoscape, users can take advantage of the plethora of plug-ins available in Cytoscape and that allows users to quickly and conveniently analyze different properties of the co-expression network. Here, we will mention only a few key features for a full list of plug-ins available in Cytoscape, we refer the reader to the Cytoscape user manual.
BiNGO (Maere et al., 2005) is a Cytoscape plugin to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. BiNGO maps the predominant functional themes of a given gene set on the GO hierarchy, and outputs this mapping as a Cytoscape graph. Additionally, it supports a wide range of organisms. MCODE (Bader and Hogue, 2003) is another plugin, which finds clusters (highly interconnected regions) in a large network.
The CoExpNetViz Cytoscape tool is written mainly in java (Perl/BioPerl and Python were also used for parsing files into the desired format; see Supplementary file 1 for detailed descriptions). The website was implemented in C++, Perl, MySQL and Apache, and supports all major browsers (tested on Linux and Windows systems). All source code and binaries are freely available to non-commercial users for download at http://bioinformatics.psb.ugent.be/webtools/coexpr/index.php. The CoExpNetViz Cytoscape plugin was written in Java/OpenJDK (http://openjdk.java.net). We used Maven Building for documenting and organization of the plugin (http://maven.apache.org), and OSGi for integrating the CoExpNetViz tool into the Cytoscape core program (http://www.osgi.org). For debugging and version control we used Git and GitHub (http://git-scm.com and https://github.com).
A recent publication by Itkin et al. (2013), presented comparative co-expression analysis to discover new genes that participate in the SGAs biosynthesis pathway in species of the Solanaceae family. Itkin and colleagues conducted a co-expression analysis between two individual species, namely tomato (Solanum lycopersicon) and potato (Solanum tuberosum). Genes co-expressed with GAME1 and GAME 4 were determined in tomato. Orthologs of GAME1 and GAME4 were then determined in potato using BLAST. Next, co-expression analysis was carried out with GAME1 and GAME4 in tomato (SlGAME1 and SlGAME4) and using GAME1 and GAME4 in potato (StGAME1 and StGAME4). Careful examination of the co-expression network (see inner circle of Figure 3 in Itkin et al., 2013) combined with genomic clustering information and experimental validation led to the discovery of an operon-like cluster of genes involved in SGAs biosynthesis. To illustrate the features of CoExpNetViz for generating cross-species co-expression and its visualization, we used the bait genes SlGAME1, StGAME1, SlGAME4, and StGAME4, as in Itkin et al., 2013 as input for CoExpNetViz. By analyzing the co-expression network obtained (Figure 3), a number of known SGAs related genes could be retrieved. CoExpNetViz could successfully identify three glycosyltransferases (GAME10a, GAME17, and GAME18), a delta(24)-sterol reductase-like protein (GAME19), a BHLH transcription factor (GAME20) and a sterol reductase (GAME23). In addition, recent work showed that more genes that are co-expressed with the GAME genes in potato and tomato (yellow inner circle; see Figure 3) are involved in SGAs biosynthesis (Sawai et al., 2014). Interestingly CoExpNetViz provided additional candidate genes expressed with the GAME bait genes when compared to the candidate genes found in the co-expression network generated by Itkin et al., 2013. These additional candidate genes found by CoExpNetViz are likely to be a result of the utilization of the PLAZA gene families to determine orthologous genes which allow to account for many to many gene orthology mapping and therefore increase the candidate genes relevant to the pathway examined.
Figure 3. Steroidal glycoalkaloids comparative gene co-expression network (from Itkin et al., 2013) reanalyzed by CoExpNetViz. Edges connect co-expressed genes (nodes) exhibiting an r-value greater than 0.8 with the bait genes. The color of nodes of the co-expressed genes corresponds to the bait with which they were found to be co-expressed with. The light purple circle of nodes represent shared homologs of co-expressed genes for bait-genes from tomato (SlGAME1 and SlGAME4) and potato (StSGT1 and StGAME4). CoExpNetViz could successfully identify three glycosyltransferases (GAME10a, GAME17, and GAME18), a delta(24)-sterol reductase-like protein (GAME19), a BHLH transcription factor (GAME20) and a sterol reductase (GAME23).
The aim of CoExpNetViz is to identify genes that are co-expressed with as many of the query or bait genes as possible, preferably across multiple species. Being co-expressed with orthologous genes (across different species, rather than in just one species) makes the candidate genes more robust as they reflect an evolutionary conserved gene expression pattern. The approach used by CoExpNetViz to find such conserved co-expression relationships is to first find co-expressed genes within one species and then group these links across multiply species using the concept of homology. The CoExpNetViz could be further developed to offer more correlation methods. In addition we would like to make it possible to easily use species which are not in PLAZA to infer homology by parsing BLAST outputs.
OT designed the project, analyzed the data and wrote the MS. TD developed the web tool, SD developed the Cytoscape plug-in, KV analyzed the data, wrote the MS. AA designed the project, analyzed the data, wrote the MS. YP analyzed the data, wrote the MS. All authors read, revised and approved the MS.
The work in the AA lab was supported by the European Research Council grant SAMIT (no. 204575). We thank the Tom and Sondra Rykof Family Foundation for supporting the AA lab activity. AA is the incumbent of the Peter J. Cohn Professorial Chair. KV and YP acknowledge the Multidisciplinary Research Partnership “Bioinformatics: from nucleotides to networks” Project (no 01MR0310W) of Ghent University. YVdP also acknowledges support from the European Union Seventh Framework Programme (FP7/2007-2013) under European Research Council Advanced Grant Agreement 322739 “DOUBLE-UP.”
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Guest Associate Editor Aaron Fait declares that, despite having previously collaborated with the author Oren Tzfadia, the review process was handled objectively.
The Authors would like to thank Thomas Van Parys and Michiel Van Bel for technical assistance.
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/article/10.3389/fpls.2015.01194
Supplementary File 1. CoExpNetViz user and development manuals.
Amar, D., Frades, I., Danek, A., Goldberg, T., Sharma, S. K., Hedley, P. E., et al. (2014). Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case. BMC Plant Biol. 14:329. doi: 10.1186/s12870-014-0329-9
De Bodt, S., Hollunder, J., Nelissen, H., Meulemeester, N., and Inzé, D. (2012). CORNET 2.0: integrating plant coexpression, protein-protein interactions, regulatory interactions, gene associations and functional annotations. New Phytol. 195, 707–720. doi: 10.1111/j.1469-8137.2012.04184.x
Hansen, B. O., Vaid, N., Musialak-Lange, M., Janowski, M., and Mutwil, M. (2014). Elucidating gene function and function evolution through comparison of co-expression networks of plants. Front. Plant Sci. 5:394. doi: 10.3389/fpls.2014.00394
Heyndrickx, K. S., and Vandepoele, K. (2012). Systematic identification of functional plant modules through the integration of complementary data sources. Plant Physiol. 159, 884–901. doi: 10.1104/pp.112.196725
Hirai, M. Y., Klein, M., Fujikawa, Y., Yano, M., Goodenowe, D. B., Yamazaki, Y., et al. (2005). Elucidation of gene-to-gene and metabolite-to-gene networks in Arabidopsis by integration of metabolomics and transcriptomics. J. Biol. Chem. 280, 25590–25595. doi: 10.1074/jbc.M502332200
Itkin, M., Heinig, U., Tzfadia, O., Bhide, P. A, Shinde, B., Cardenas, P., et al. (2013). Biosynthesis of anti-nutritional glycoalkaloids in Solanaceous crops is mediated by clustered pathway genes. Science 341 175–179. doi: 10.1126/science.1240230
Maere, S., Heymans, K., and Kuiper, M. (2005). BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in biological networks. Bioinformatics 21, 3448–3449. doi: 10.1093/bioinformatics/bti551
Mutwil, M., Usadel, B., Schütte, M., Loraine, A., Ebenhöh, O., and Persson, S. (2010). Assembly of an interactive correlation network for the Arabidopsis genome using a novel heuristic clustering algorithm. Plant Physiology 152, 29–43. doi: 10.1104/pp.109.145318
Mutwil, M., Klie, S., Tohge, T., Giorgi, F. M., Wilkins, O., Campbell, M. M., et al. (2011). PlaNet: combined sequence and expression comparisons across plant networks derived from seven species. Plant Cell 23, 895–910. doi: 10.1105/tpc.111.083667
Obayashi, T., Kinoshita, K., Nakai, K., Shibaoka, M., Hayashi, S., Saeki, M., et al. (2007). ATTED-II: a database of co-expressed genes and cis elements for identifying co-regulated gene groups in Arabidopsis. Nucleic Acids Res. 35, 863–869. doi: 10.1093/nar/gkl783
Ohyanagi, H., Takano, T., Terashima, S., Kobayashi, M., Kanno, M., Morimoto, K., et al. (2015). Plant omics data center: an integrated web repository for interspecies gene expression networks with NLP-based curation. Plant Cell Physiol. 56:e9. doi: 10.1093/pcp/pcu188
Patel, R. V., Nahal, H. K., Breit, R., and Provart, N. J. (2012). BAR expressolog identification: expression profile similarity ranking of homologous genes in plant species. Plant J. 71, 1038–1050. doi: 10.1111/j.1365-313X.2012.05055.x
Proost, S., Van Bel, M., Vaneechoutte, D., Van de Peer, Y., Inzé, D., Mueller-Roeber, B., et al. (2015). PLAZA 3.0: an access point for plant comparative genomics. Nucleic Acids Res. 43, 974–978. doi: 10.1093/nar/gku986
Sawai, S., Ohyama, K., Yasumoto, S., Seki, H., Sakuma, T., Yamamoto, T., et al. (2014). Sterol side chain reductase 2 is a key enzyme in the biosynthesis of cholesterol, the common precursor of toxic steroidal glycoalkaloids in potato. Plant Cell 9, 3763–3774. doi: 10.1105/tpc.114.130096
Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang, J. T., Ramage, D., et al. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504. doi: 10.1101/gr.1239303
Song, L., Langfelder, P., and Horvath, S. (2012). Comparison of co-expression measures: mutual information, correlation, and model based indices. BMC Bioinformatics 13:328. doi: 10.1186/1471-2105-13-328
Tzfadia, O., Amar, D., Bradbury, L. M., Wurtzel, E. T., and Shamir, R. (2012). The MORPH algorithm: ranking candidate genes for membership in Arabidopsis and tomato pathways. Plant Cell 24, 4389–4406. doi: 10.1105/tpc.112.104513
Usadel, B., Obayashi, T., Mutwil, M., Giorgi, F. M., Bassel, G. W., Tanimoto, M., et al. (2009). Co-expression tools for plant biology: opportunities for hypothesis generation and caveats. Plant Cell Environ. 32, 1633–1651. doi: 10.1111/j.1365-3040.2009.02040.x
Vandepoele, K., Quimbaya, M., Casneuf, T., De Veylder, L., and Van de Peer, Y. (2009). Unraveling transcriptional control in Arabidopsis using cis-regulatory elements and coexpression networks. Plant Physiol. 150, 535–546. doi: 10.1104/pp.109.136028
Keywords: co-expression, comparative genomics, networks, cytoscape, plants
Citation: Tzfadia O, Diels T, De Meyer S, Vandepoele K, Aharoni A and Van de Peer Y (2016) CoExpNetViz: Comparative Co-Expression Networks Construction and Visualization Tool. Front. Plant Sci. 6:1194. doi: 10.3389/fpls.2015.01194
Received: 26 August 2015; Accepted: 11 December 2015;
Published: 05 January 2016.
Edited by:Aaron Fait, Ben Gurion University of the Negev, Israel
Reviewed by:Torgeir R. Hvidsten, Norwegian University of Life Sciences, Norway
Chuang Ma, Northwest Agricultural and Forestry University, China
Copyright © 2016 Tzfadia, Diels, De Meyer, Vandepoele, Aharoni and Van de Peer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Oren Tzfadia, firstname.lastname@example.org