Edited by: Saliha Durmus, Gebze Technical University, Turkey
Reviewed by: Ikbal Agah Ince, Wageningen University and Research Centrum, Netherlands; Kazim Yalcin Arga, Marmara University, Turkey
*Correspondence: Jörg Linde, Department of Systems Biology and Bioinformatics, Leibniz-Institute for Natural Product Research and Infection Biology – Hans-Knoell-Institute, Beutenbergstr. 11a, 07745 Jena, Germany e-mail:
This article was submitted to Infectious Diseases, a section of the journal Frontiers in Microbiology.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Inference of inter-species gene regulatory networks based on gene expression data is an important computational method to predict pathogen-host interactions (PHIs). Both the experimental setup and the nature of PHIs exhibit certain characteristics. First, besides an environmental change, the battle between pathogen and host leads to a constantly changing environment and thus complex gene expression patterns. Second, there might be a delay until one of the organisms reacts. Third, toward later time points only one organism may survive leading to missing gene expression data of the other organism. Here, we account for PHI characteristics by extending NetGenerator, a network inference tool that predicts gene regulatory networks from gene expression time series data. We tested multiple modeling scenarios regarding the stimuli functions of the interaction network based on a benchmark example. We show that modeling perturbation of a PHI network by multiple stimuli better represents the underlying biological phenomena. Furthermore, we utilized the benchmark example to test the influence of missing data points on the inference performance. Our results suggest that PHI network inference with missing data is possible, but we recommend to provide complete time series data. Finally, we extended the NetGenerator tool to incorporate gene- and time point specific variances, because complex PHIs may lead to high variance in expression data. Sample variances are directly considered in the objective function of NetGenerator and indirectly by testing the robustness of interactions based on variance dependent disturbance of gene expression values. We evaluated the method of variance incorporation on dual RNA sequencing (RNA-Seq) data of
Organisms need to constantly adapt to environmental changes. On a molecular level, this is mediated by complex signaling cascades, which transmit the signal to cell nuclei. Transcription factors bind to their target genes, which consequently leads to a change in gene expression. This way, biological systems adapt to new environmental conditions.
In most cases underlying networks are unknown. This is especially interesting for interacting organisms, such as pathogens and host. Both the experimental setup and the nature of PHIs exhibit certain characteristics: (i) pathogen and host are in a battle leading to constantly changing conditions, (ii) a change in gene expression is triggered by new environmental conditions and the response of one organism might initiate faster or persist longer than the response of the other organism and (iii) two different organisms interact and eventually one survives which can lead to missing data time points.
The immune system of the host is permanently active to recognize and eliminate infectious microorganisms. As a first line of defense, components of the innate immune system such as the complement system, immune cells, and antimicrobial peptides recognize pathogen-associated molecular patterns (PAMPs). In contrast, pathogens developed many strategies to evade these mechanisms. They can shield microbe-associated cell surface proteins, mimic host surfaces or secrete proteases degrading host immune proteins (Zipfel et al.,
The transcriptome of pathogen and host can be measured by physical separation of pathogen and host cells before RNA extraction. This enables RNA extraction from pathogen and host at different time points. For example, Oosthuizen et al. (
Network inference is a systems biology approach which aims to reverse engineer underlying interaction networks based on gene expression data (Hecker et al.,
NetGenerator, a tool to infer small scale GRNs (Guthke et al.,
Hereafter, we discuss a variety of aspects for dual RNA-Seq data acquisition and processing. Furthermore, we describe the application of the extended NetGenerator version to infer an inter-species GRN based on dual RNA-Seq data. Even though we focus on the novel technique RNA-Seq, most parts of the described workflow can be applied to microarray data. We evaluate the impact of multiple input stimuli on the inference accuracy with NetGenerator based on a benchmark example. The extended NetGenerator version handles missing data values, which we demonstrate with the same benchmark example. We further extended the algorithm and its application to consider variances in replicated measurement data. This is directly embedded in the inference process and indirectly through a robustness analysis. We applied this method to a real dual RNA-Seq data set of murine dendritic cells infected with
RNA-Seq requires a certain amount of input RNA often in a microgram range, which is practically difficult to extract. Furthermore, mRNA should be enriched to avoid sequencing data being dominated by structural RNAs (Tariq et al.,
Furthermore, the pathogen-host cell ratio of the experimental setup, also known as multiplicity of infection (MOI), has to be considered. A high MOI results in more pathogenic RNA, but may also lead to a faster and stronger host response and less clinical relevance.
The number of reads required to achieve a good genome coverage in both species has to be estimated in advance. The number of reads needs to be calculated for the least abundant species based on the intended fold coverage, transcriptome size and read length. The total number of reads can be estimated through the ratio of the amount of extracted pathogen and host RNA.
Furthermore, sequencing parameters need to be set taking into account transcriptome sizes and how closely related studied species are. Number of reads, read length, strand-specificity and single / paired end sequencing have a great impact on the number of ambiguously mapped reads. For instance, Yazawa et al. (
Finally, data time points have to be determined. A change of the transcriptional program triggered by a stimulus is usually strong at the start of the response. Thus, in best case the organism adapts and the degree of transcriptional change decreases. The temporal onset and duration of transcriptional response of pathogen and host can be very different. To detect both responses, RNA extraction time points need to be chosen carefully. Small-scale experiments should be carried out in advance to determine good data time points.
Preprocessing and analysis of sequencing data and the selection of candidate genes is an important step in advance of network inference (Figure
Tools like featureCounts (Liao et al.,
Typically, hundreds of DEGs are found, of which a subset of candidate genes has to be selected. This number can be reduced, for instance by clustering gene expression kinetics (Bezdek,
We extended the heuristic network inference tool NetGenerator (see Data and Methods) and its application to predict PHI networks. NetGenerator requires logarithmic fold changes (logFCs) of gene expression time series data that can be obtained by various technologies, such as RNA-Seq or microarrays. Furthermore, the user of NetGenerator has to provide at least one input stimulus representing the external signal leading to a change in gene expression. Also, prior knowledge can be provided by the user to support the inference process (Figure
We generated a benchmark example to evaluate the influence of different stimuli and missing data on the inference performance (see Data and Methods). The benchmark comprised six data points of seven genes and two stimuli (Figure
Multiple stimuli trigger responses in both pathogen and host during infection, such as the mutual stimulation of pathogen and host. This can be translated into at least two stimuli—the host stimulating the pathogen and
First, only one constant stimulus (Test-1) set to a value of 1 was given. In a second test, an additional stimulus set to 0 until 30 min and set to 1 afterwards (Test-2) was given (Supplementary Table
We always observed noticeable larger F-measures given two stimuli in comparison to only one given stimulus. The difference in F-measure of Test-1 and Test-2 was up to 1.36 fold (Figure
It is conceivable that time series experiments of pathogen and host were carried out independently under comparable experimental conditions. In this case, it is possible to utilize the pathogen and host data sets to predict PHI networks. Thus, data time points might differ which leads to missing values at intermediate time points or at the end of the time series. In case of dual RNA-Seq, pathogen and host are collectively processed. This may lead to a reduced amount of sample RNA of either of the species resulting in missing gene expression data. This is a problem especially for later time points when one species may dye. We extended the NetGenerator algorithm to handle missing data values at intermediate time points (see Data and Methods). We evaluated the influence of missing data on the performance based on the benchmark example, prior knowledge data sets and two given stimuli as in Test-2 (Figure
We included data of one additional time point (165 min) for host genes, but additional data for pathogen genes were not given (Test-3). Thereby, we demonstrated the applicability of the extended NetGenerator version to data with missing values. We set the time point in such a way, that an additional data point covering the onset of the host response was provided and observed a noticeable increase of F-measure (Figure
NetGenerator requires complete data for the last time point. In case of missing measurements at the end of the time range for a subset of candidate genes, their values must be obtained in a different way and provided by the user. Here, we set the last time point to its preceding value (Test-4). We found slightly greater F-measures for Test-2 in comparison to Test-4 independent of the number of given prior knowledge. We observed a maximal difference between the F-measures between Test-2 and Test-4 (0.02) given four, six and eight prior knowledge interactions (Figure
Various differential expression analysis tools are available that calculate fold changes from multiple replicates. However, fold changes alone cannot reflect the degree of gene- and time point specific variances. This variance might be high especially regarding complex biological systems such as PHIs where cells from two species constantly interact and change the environment. However, biological variances can be considered in the network inference process to obtain robust predictions. For this purpose, we extended and applied NetGenerator to incorporate variances within the algorithm and in an outer robustness analysis. The extended NetGenerator algorithm was applied to one of the first published dual RNA-Seq data sets (Tierney et al.,
Variances from replicated measurements were incorporated in the objective function of NetGenerator and need to be provided by the user. We calculated variances of the dual RNA-Seq data set of Tierney et al. (
We predicted a GRN (Supplementary Figure
Furthermore, variances were considered in an outer robustness analysis which we carried out based on the data of Tierney et al. (
Tierney et al. (
In this study, we propose a workflow for dual RNA-Seq data acquisition, data processing and inter-species network inference. Furthermore, we describe how to handle a different temporal onset of transcriptional changes, missing data and how to integrate variances from replicated measurements based on the extended NetGenerator algorithm.
In a dual transcriptome data set we expect the onset of the pathogen and host transcriptional response at different time points. So far, several infection-related transcriptome studies of fungi were carried out. Transcriptome data was generated already at two to three time points within 60 min after infection (Linde et al.,
On the other hand, it takes some time until the host recognizes a pathogen. Moyes et al. (
However, we do not see a delayed transcriptional response of host DEGs in comparison to pathogen DEGs in the data set of Tierney et al. (
NetGenerator requires time series gene expression data, at least one stimulus function and optionally prior knowledge. LogFCs are passed to NetGenerator in form of a data matrix, where columns correspond to candidate genes and rows to measured time points.
PHIs are very complex systems, but available data is limited regarding the number of time points and replicates. Furthermore, transcriptome data do not provide any information about processes taking place as for instance on protein level and in the extracellular space. Therefore, it has to be considered that predicted PHIs are indirect, when they are interpreted.
A GRN can be understood as a biological system that adapts to external, environmental stimuli yielding changes in gene expression. NetGenerator can integrate multiple stimuli and requires one function per stimulus representing it.
Many biological processes can be interpreted as external stimuli triggering responses in both pathogen and host cells during infection. In a typical experimental setup the host is incubated with the pathogen stimulating both organisms. The host recognizes PAMPs on pathogen cell surfaces by pathogen recognition receptors (PRRs). This initiates an information flow through signaling cascades (Akira et al.,
We found that multiple stimuli functions improve network inference results significantly. Therefore, we recommended to provide two or more stimuli functions for inter-species network inference. One option to model the stimulus representing the influence of the host on the pathogen is a constant function. Therewith, the stimulus is active from time point zero onwards and models an early pathogen transcriptional response.
More options for stimuli functions are possible when real experiments are carried out. For example, the number of differentially expressed host and pathogen genes can be determined for every time point and translated into stimuli functions. This can be done by scaling the number of DEGs to a range from zero to one. Additional measurements, e.g., cytokine release or cell contacts, can also be used as a basis for stimuli functions. Of particular interest is the growth curve of the pathogen, which we recommend to measure and integrate in the stimuli functions. Nevertheless, many biological events trigger responses, of which not all can be integrated in the network inference.
Optionally, the user of NetGenerator can provide prior knowledge about interactions of candidate genes. This is strongly recommended to reduce the search space resulting from the large number of possible interactions (Hecker et al.,
Prior knowledge about interactions in GRNs originates from published results that were transferred to databases. PHI databases like PHISTO (Tekir et al.,
Host specific prior knowledge can be extracted manually from literature or automatically with text mining tools. Pathway Studio is a text mining tool specific for mammals (Nikitin et al.,
As well, organism specific websites exist for pathogens, e.g., Aspergillus Genome Database (Cerqueira et al.,
For both host and pathogen transcription factor binding motifs and binding sites can be obtained from databases, e.g., TRANSFAC (Matys et al.,
We extended the NetGenerator algorithm and its application to incorporate variances from replicated measurements in the inference method and in a robustness analysis. The output provides guidance for experimental validation of predicted interactions.
Inference methods should take into account the variance of replicates, because this additional information improves the parameter estimation. Under the assumption of independent Gaussian distributed noise the minimization of the objective function (Equation 4) corresponds to a Maximum Likelihood Estimator (MLE) (see e.g., Klipp et al.,
In previous publications a similar robustness analysis was carried out with the same standard deviation for each gene and time point set to a fixed value (Linde et al.,
We performed the robustness analysis for the data of Tierney et al. (
The application of the robustness analysis is beneficial in many ways. It provides a ranking of predicted interactions based on noise added to the data. This makes it easier to decide, which predicted interactions should be experimentally verified. Furthermore, NetGenerator is a heuristic algorithm, which means that not all possible solutions are tested. It is likely, that not the best solution is returned, but a good one. The robustness analysis generates many good solutions resulting in a consensus network. It also accounts for possible mutually contradictory predictions.
Network inference was carried out by the NetGenerator algorithm (see Guthke et al.,
The NetGenerator heuristics infers GRNs from time series gene expression data of multiple experiments and multiple stimulation. Expression data (logFCs), stimuli functions and prior knowledge (optionally) have to be provided by the user. Stimuli are factors that (directly or indirectly) cause changes in gene expression. It is assumed, that stimuli are not influenced by genes or their products, at least in the experimental setup. Nevertheless, stimuli values may evolve over time.
The inferred network model is described by a system of first order linear differential equations of the form
The change of gene expression
The second term evaluates the integration of prior knowledge, see (Weber et al.,
described the error between measured data
NetGenerator was extended to account for missing data values. Now, NetGenerator accepts missing values at intermediate time points provided by the user as “NA.” Internally, the time vector of the respective output is adjusted and interpolation is carried out based on existing measurement data. During inference, both simulation and objective function (Equation 4) can process that information of missing and replaced values.
The objective function
Therefore, the variances σ2 of the logFCs became additional input arguments to NetGenerator. Larger variances decrease the objective function value which effectively allows for a larger error between associated measured and simulated values in comparison to measurements of smaller variance.
Moreover, variances are considered in an outer robustness analysis by predicting GRNs based on disturbed logFCs. To simulate the measurement process, we sampled three replicates of Gaussian distributed logFCs (mean = measured logFC, σ = standard deviation of replicates) and determined their mean. This resulted in a noisy logFC for each candidate gene and time point used as input for extended NetGenerator. We repeated this process 500 times.
For better visualization of the robustness analysis results we introduced the bubble map (Figure
with
A big circle represents a frequently predicted interaction. Small or no circles represent rarely or no predicted interactions. Pie charts show the ratio of inferred activating (orange) and inhibiting (blue) interactions. Note, that the diagonal represents autoregulations. Exact robustness scores depending on how frequently an edge was predicted and corresponding objective function values of the predicted GRN are available as additional output (Supplementary Table
Both the extended version of the objective function and the robustness analysis require variances derived from data. The gene- and time point specific variance σ2
The respective standard deviations σ
We constructed a benchmark system composed of differential equations representing the logFC time series data of three pathogen genes, four host genes and two stimuli. The network topology included 21 directed, signed edges representing interactions. Common biological motifs like feed forward loops and feedback loops are integrated, too. Based on this topology we set up a system of differential equations and simulated this model with the R-package deSolve (Soetaert et al.,
As mentioned before, an additional input to guide network inference is prior knowledge. We generated a prior knowledge data set for the benchmark data by randomly sampling two interactions of the known network topology and repeated this 50 times. 50% of sampled prior knowledge is signed (activation or inhibition) and 50% is unspecific. Likewise, we generated prior knowledge data sets of four, six and eight interactions.
To evaluate predicted GRNs we computed statistical measures that compare the known topology to the predicted topology. Sensitivity (SE), specificity (SP), precision (PR) and F-measure (FM) are calculated as:
taking the number of true positives (TP), false positives not part of the known topology (FPn), false positives with wrong sign (FPs), true negatives (TN) and false negatives (FN) into account (Weber et al.,
We utilized one of the first dual RNA-Seq data sets published by Tierney et al. (
The predicted interactions of
The extended NetGenerator 2.3.-0 tool is available at
Conception and design of the investigation and work: all. Analyzing the properties of PHIs: Sylvie Schulze, Jörg Linde, and Reinhard Guthke. Implementation of NetGenerator and contribution to mathematical background: Sebastian G. Henkel and Dominik Driesch. Data processing, application of computational algorithm and evaluation of results: Sylvie Schulze, Sebastian G. Henkel, and Jörg Linde. Drafting the manuscript: Sylvie Schulze and Sebastian G. Henkel. Revising it critically for important intellectual content and final approval of the version to be published: all. Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved: all.
Sylvie Schulze and Jörg Linde are supported by the Deutsche Forschungsgemeinschaft (DFG) CRC/Transregio 124 “Pathogenic fungi and their human host: Networks of interaction,” subproject B3 (Sylvie Schulze) and subproject INF (Jörg Linde). Sebastian G. Henkel and Dominik Driesch are supported within the Virtual Liver Network funded by the German Federal Ministry of Education and Research (BMBF, Fkz. 0315760).
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We would like to thank Uwe Menzel, Steffen Priebe and Sebastian Vlaic for valuable discussions about data processing and modeling.
The Supplementary Material for this article can be found online at:
1