Bipartite networks represent causality better than simple networks: evidence, algorithms, and applications

A network, whose nodes are genes and whose directed edges represent positive or negative influences of a regulatory gene and its targets, is often used as a representation of causality. To infer a network, researchers often develop a machine learning model and then evaluate the model based on its match with experimentally verified “gold standard” edges. The desired result of such a model is a network that may extend the gold standard edges. Since networks are a form of visual representation, one can compare their utility with architectural or machine blueprints. Blueprints are clearly useful because they provide precise guidance to builders in construction. If the primary role of gene regulatory networks is to characterize causality, then such networks should be good tools of prediction because prediction is the actionable benefit of knowing causality. But are they? In this paper, we compare prediction quality based on “gold standard” regulatory edges from previous experimental work with non-linear models inferred from time series data across four different species. We show that the same non-linear machine learning models have better predictive performance, with improvements from 5.3% to 25.3% in terms of the reduction in the root mean square error (RMSE) compared with the same models based on the gold standard edges. Having established that networks fail to characterize causality properly, we suggest that causality research should focus on four goals: (i) predictive accuracy; (ii) a parsimonious enumeration of predictive regulatory genes for each target gene g; (iii) the identification of disjoint sets of predictive regulatory genes for each target g of roughly equal accuracy; and (iv) the construction of a bipartite network (whose node types are genes and models) representation of causality. We provide algorithms for all goals.

2022), while others like GENIE3 (Huynh-Thu et al., 2010), BiXGBoost (Zheng et al., 2019), OutPredict (Cirrone et al., 2020), and SCENIC (Van de Sande et al., 2020) choose non-linear regression models for the same goal.The main metric of these methods is conformance to some gold standard (GS) network.However, let us consider the actionable result of such research: to influence the behavior of an organism to make it more useful (e.g., more drought-tolerant crop or one with higher nutrient yield).
This paper asks the question "Are networks a good representation for actionable insights?"Because the edges are simple edges between pairs of regulatory and target genes, the network representation does not suggest any kind of synergy between the putative causal regulatory genes, e.g., transcription factors.So, the natural model choice for a given target gene g given the network is a linear model on the genes pointing to g.
Any causality model should be able to make reasonably accurate predictions.In Newtonian mechanics, e.g., a model involving mass and gravity will be able to predict the speed curve of a ball on an inclined plane.Predictive accuracy is not sufficient to establish causality.Some mysterious force might cause the ball to move in that way, but a causal model should be predictive.We consider predictive accuracy to be a necessary condition of a causal model.
We now consider several approaches to prediction.
1. Starting with all transcription factors (TFs) as possible causal features, both a non-linear random forest (RF)-style model M nonlin and a linear model M lin are tried.2. Based on the random forest model mentioned above, a minimal set of TFs that could produce similar regression results in a model M minimal are iteratively searched for.3. On the edges of a gold standard network for some target gene g, a non-linear model M nonlin or a linear model M lin is used.4. A random forest model that uses the same number of TFs for each target g is formed as known from the gold standard network.Those TFs are chosen according to their feature importance starting from M nonlin in approach 1 above.
As discussed later, (i) the non-linear models work better than the linear models and (ii) starting with all transcription factors and then shrinking that set based on model accuracy is better than using the gold standard network.
Our models are all based on time series in which we predict the mRNA expression level of the target gene based on the state of regulatory genes at the previous time point.This agrees with the biological intuition that the state of causal regulatory genes takes at least minutes to affect their targets.One implication of this approach to model building is that if a transcription factor T and a target gene g are correlated (e.g., they rise in the same time points and fall in the same time points), T will not be identified as causal.On the other hand, if T rising (respectively, falling) at one time point were associated with g rising (respectively, falling) at the next time point, then causality might be hypothesized.2 Materials and methods

Expression prediction setup
All the experiments carried out in this study focus on time series RNA sequencing (RNA-seq) data because gene regulation through transcription factors is a temporal causal process.Following this logic, we build regression models that predict the expression of each target gene based on the TF expression levels from a previous time point in the time series.Formally, suppose we are given time series RNA-seq data consisting of sequencing data from time points t 0 , t 1 , t 2 , . ..., t i , . . ., t n .We use RNA-seq data at time t i to predict the expression of a target gene at t i+1 (i ≥ 0, i < n).
In order to split the whole time series into training/testing sets for validating the prediction quality of different methods, we chose to always reserve the tail end of the time series for testing while using the preceding part in training.More specifically, if n < 5, then only t n will be used as the test sample with t n−1 being the input.If 5 ≤ n < 10, t n and t n−1 are reserved for testing, with t n−1 and t n−2 as model input.If n ≥ 10, then the final three in the series constitute the testing set.

RNA sequencing data used
Bulk time series RNA-seq data from four different species with varying experimental setups were used, totaling more than 100 data points for each species.The experimental sources for each species are given here.For the training/testing setups below, we train on a prefix of the time points and test on the remaining time points.
Thus, our study derives from 4 well-studied living organisms ranging from bacteria to humans.We sought datasets having time series RNA-seq data with relatively tight timing intervals (no greater than 4 h in all cases) and suitably long series (≥4 time points) to form training/testing splits.We collected as much public bulk RNA-seq data about the four species as possible given the above constraints.Because the data came from widely different species (from bacteria to human), we expect that our qualitative conclusions will be generalizable.For each species, we obtained a GS regulatory network from the sources given in Table 1.

Metrics
Because RNA-seq counts are strongly dependent on the amount of cellular material that is read, relative expression is a better metric to determine induction or repression than absolute expression.For that reason, we measure expression based on the z-score of the normalized RNA-seq counts in the form of transcripts per kilobase million (TPM): To compare the performance of each method, we measure how accurate the regression results were by checking the error of the prediction on the test set for each of the target genes.More specifically, the root mean square error (RMSE) of the model prediction in the test set for the expression of each target gene was compared across different regression models.Because every regression model was trained/fitted to make predictions on the same set of time series expression samples for each target gene in question, the performance metrics can be compared based on a paired test.For this purpose, we use a non-parametric paired test (Katari et al., 2021).

Methods compared
For the purpose of predicting the expression of each target gene on a future unseen time point, we fitted four different types of regression models, as described above: 1.An RF model that takes the expression of all TFs as input.

A ridge regression (linear regression with L2 regularization)
model that takes the expression of all TFs as input.3. A random forest model that takes only the expression of TFs known from the GS network for each particular target gene as input.4. A ridge regression model that takes only the expression of TFs known from the GS network for each particular target gene as input.
Next, we test how good the transcription factors from the GS network are compared to the same number of transcription factors derived from a non-linear model.For each target gene g, let k g be the number of transcription factors in the GS network that point to g.In addition to the tests above, we compare a random forest on those GS TFs against a random forest for g based on the top k g TFs found using method 1 above.The idea is to test the usefulness of GS edges for prediction.One may argue that GS edges are inferred using different methods-usually by modifying single regulatory genes and observing their effect-and, therefore, should not necessarily be useful for prediction but could still be useful if modifying a single gene is all that is possible for practical reasons.We do not contest their utility for such purposes.We do, however, aim to evaluate their predictive power in a synergistic setting (i.e., when potentially several regulatory genes can be simultaneously modified).Finally, using the method given in Section 3, we construct a minimal random forest model for each target gene g on the training set and view its result on the test set.We choose random forest because decision tree-based regression models have proven to be among the best methods in gene regulatory tasks (Huynh-Thu et al., 2010;Moerman et al., 2019;Zheng et al., (2019).We did not expand our model selection because the main focus of our work is not to find the best fitted machine learning model for the task but rather to demonstrate a novel approach to the representation of potential causality in gene regulation.

Algorithms to construct bipartite causality graphs
Having chosen prediction as the metric for causality, we now turn to the other three goals of our proposed framework: 1. Finding minimal sets of predictive TFs. 2. Finding disjoint minimal sets that have p-valueindistinguishable predictive accuracy.3. Creating a bipartite visual representation of causality.

Minimal sets of predictive transcription factors
Efron (2020) noted that disjoint sets of causal factors often have similar predictive accuracy.Inspired by this observation, we propose the following strategy.For a given target gene, a random forest predictor that takes the expression levels of all known TFs is fit to predict the expression of the target.Then, the number of TFs are iteratively cut in half based on their feature importance in the fitted model until a further reduction results in a statistically significantly (p-value < 0.05) worse-performing random forest.We refer to the final remaining set of TFs as the "minimal TF set per target."The pseudo-code for this feature selection process is shown in Algorithm 1.
The histograms given in Figure 1 show the distribution of the size of minimal TF sets yielded for each target gene for the four species we investigated.These size distributions show that most of the minimal sets consist of a rather small number of TFs.When compared with the distribution of GS network coverage for each target gene in

Finding disjoint sets of predictive transaction factors
After finding a minimal set of predictive TFs, our algorithm performs a new round of TF searches to discover disjoint sets of roughly equally predictive TFs.Algorithm 2 describes the process for finding all such disjoint sets of a given target gene.Similar to the minimal set search algorithm, we based our iterative search on the random forest regression that takes all available TFs U as input.Rather than stopping after a minimal set S1 is found, we test if using all remaining TFs (U − S1) could also produce a regression prediction as good as the baseline.If that is the case, we carry on a similar feature reduction process that ends with a new "minimal set" S2 from U − S1.This process then repeats with U − (S1 ∪ S2) and continues until the baseline performance cannot be beaten or there are no TFs left.For each target gene g, we define the collection of TABLE 2 Gold standard network edge coverage per target gene for different RNA-seq species compared with the size of minimal transcription factor set sizes per target gene, derived using random forest regression.Distributions are presented as the median followed by the interquartile range (IQR), which is the range between the 25th and 75th percentile of the data.The one exception is yeast, where the gold standard edges are much more numerous.As shown in the tables above, the minimal sets generally have better predictive power than the gold standard sets and roughly the same number of edges per target.Frontiers in Genetics frontiersin.org06 Shen et al. 10.3389/fgene.2024.1371607minimal sets discovered this way as the minimal disjoint sets of predictive transcription factors for g or MinDisjoints(g) for short.

Median
We then surveyed the distribution of how many MinDisjoints are found for each target gene across the four species.The histograms given in Figure 2 show that most of the target genes have a small number of disjoint sets of TFs associated with them, while some target genes have a large number of MinDisjoints.Our analysis did not yield biological mechanisms for predictability/causality, so we have no mechanistic explanation for how multiple disjoint sets of TFs might control the same target gene.However, the result was not wholly unexpected, given the well-known redundancy in biological systems.Algorithm 2. Disjoint sets of TFs: Finding minimal sets of disjoint TFs (MinDisjoints), where each minimal set has the same error as using all TFs.

Bipartite network representation
Networks have a pleasing visual representation, especially when focusing on one or a few target genes.However, what we showed is that the network itself is a poor basis for prediction.Now that we have constructed multiple disjoint sets of predictive TFs for each target gene g, we propose a bipartite representation for them.The bipartite representation for each target gene g consists of a model node m corresponding to each disjoint set d m from D(g).The TFs from d m in turn point to m.
Suppose that TFs A, B, and C through model M(A, B, C) provide good predictions regarding target gene g.Suppose further that TFs D, E, F, and H provide roughly equally good predictions on g.The classic gene regulatory approach would be a graph with arrows from A, B, C, D, E, F, and H all pointing to g.The bipartite approach would suggest instead to show a bipartite graph that would have A, B, and C point to a model node, which, in turn, points to g, and have D, E, F, and H point to a different model node, which also points to g.
To demonstrate this new representation, we picked one example for each species we studied, as shown in Figure 3. Here, we specifically highlighted one interesting scenario: a set of TFs were found to form one of the disjoint sets for more than one target gene.Such a relationship between two genes would not have been found in a simple network representation.The bipartite representation reveals group effects that would not otherwise be evident.

Comparison of approaches
Figure 4 shows the accuracy of the six different modeling approaches listed in Section 2.4.Basically, feeding expression information from all the TFs into a random forest ("RF with all TFs") yielded the best outcome.Relying solely on known GS edges ("RF with GS TFs") usually performed poorly, even compared to using the same number of TFs for each target gene from the random forest model ("RF with top TFs").
We note that linear models on all TFs are competitive and sometimes better than random forests on minimal TF sets for B. subtilis and mouse.Still, overall, given the same input information, random forests perform better than linear models, which is the main point of that comparison.
Tables 3-7 list the detailed pairwise non-parametric results comparing the performance of all possible pairs of models.The tables show that using all TFs in the regression yields the highest prediction accuracy.Finding a minimal set of the most important TFs yields almost the same accuracy as using all TFs.
A question to ask is what biological meaning disjoint sets of transcription factors could have for a given target gene g.Our computational analysis does not provide a biological meaning other than predictive ability.Experimentalists might take various disjoint sets of TFs and manipulate them to achieve some desired effect on a target gene.The choice of such sets may depend on the side effects such manipulation might have on other genes.This is a direction for future work.

Batch effects
While the z-score takes care of quantity bias in different tests, batch effects may cause predictions on batch A based on data from batch A to be superior to predictions on batch A from data on many batches.This is a limitation of any predictive model in biology.
To test this, we created our models based on multiple batches and tested them on the tail end of all those batches.We compared that approach with batch-by-batch predictions.Figure 5 shows that the same model trained on all batches of data achieves the same level or better predictive performance than when using batch X data on batch X tail, for each batch X.

Ensemble of disjoint sets of transcription factors
Another potential use of the identification of disjoint sets of predictive TFs stems from the fact that each disjoint set represents a regression model for the prediction of the target gene.For all the disjoint sets, we found a target gene g, and the regression model of each disjoint set can provide a prediction about the expression of the target gene, given the expression input of the TFs at the previous time point.As has been shown in many previous studies in both general machine learning and gene network inference (Dietterich, 2000); Marbach et al., 2012); Sagi and Rokach, 2018); Ganaie et al., 2022), an ensemble consisting of the arithmetic mean of these model predictions may lead to an overall better performing prediction.Inspired by those results, we compared the predictive performance of this ensemble of disjoint sets of TFs to that of all other RF-based Bipartite representation of causality.Light circular orange nodes represent non-linear models that take transcription factors (dark orange rectangles) as inputs and produce predictions on a single target gene (blue).Here, we show a particular case where disjoint sets of TFs can form high-quality prediction models for one target gene, and the same TF can be in models for several target genes.
regressions we discussed before, and the results are given in Table 8.In most cases, this ensemble yielded regression accuracies second only to the model that takes all TFs as input.
A complete list of all the minimal sets and disjoint sets of TFs for each target gene we surveyed in this study is given in Supplementary Table S1-S5.

An application: optimizing gene expression
Suppose our goal is to cause a gene g to be expressed at a certain level.We observed that the GS network, even when available, provides quite poor predictions.A better approach is to start with a good Root mean square error (RMSE) performance (lower is better) of different regression models compared across four species; error bars represent the standard error of each group.When we compare all the models for each of the tested target genes, a paired non-parametric test can be applied between each pair of models to see if the performance is statistically different.The best performing models that are statistically indistinguishable this way are marked with *. "RF with all TFs" is always one of the best performing models.The caption of Table 2 provides a glossary of terms.
predictive model for g on a small number of TFs T and then to determine values of the TFs in T that might lead to the desired expression level of g.This goal is supported by the three goals of our framework: to find a good model, reduce the number of TFs while preserving accuracy, and find possible alternative sets of TFs that also yield high prediction accuracy.Gene regulatory networks do not provide natural guidance for any goal like this.
Thus, the bipartite network approach provides an actionable approach to causality.At the same time, it provides (i) a visualization that shows alternative ways to manipulate a target gene and (ii) a simple ensemble approach to prediction.

Empirical findings
Our empirical findings are as follows: • We confirm previous observations (Pratapa et al., 2020); Zhao et al., 2021) that non-linear models generally yield better results (as measured by RMSE) than linear models.• Using all TFs yields better predictive results than using the TFs from GS edges.For each target gene g, there often exist several disjoint minimal sets (mostly of size eight or less) that yield predictive accuracy nearly as high as all TFs.• Using all batches of each species together for training yields results on the time series test tails of each batch that are as good as or better than using each batch on its own test tail.• For a given target gene g, forming a model consisting of the most influential k g TFs in a non-linear model (e.g., random forests) as measured on the training set, where k g is the number of TFs in the GS network that point to g, yields better prediction accuracy on the test set than using the same kind of model on the GS TFs.This superiority holds for all the species we tested from yeast with a Random forest regression performance differences measured across different data batches for different species.Here, all batches results are default and presented relists shown in this work, where the random forest model using all TFs was trained on training data from all batches and tested on all the tail testing parts from different batches.The same model was then trained and tested on individual batches for its respective high-variance target genes.Note that for Arabidopsis, one singular batch was used, so such comparison was unnecessary, and for B. subtilis, the first batch does not have highvariance target genes in its testing set, so the comparison was also omitted.
Frontiers in Genetics frontiersin.org13 Shen et al. 10.3389/fgene.2024.1371607mean value of 53 TFs for each target gene to B. subtilis with a mean value of 1.9 TFs for each target gene.

Conclusion
Based on our empirical findings, we suggest a framework for studying causality in gene regulation having three main features.
First, the figure of merit for causality should be predictive accuracy rather than conformance with "gold standard" edges.One reason is epistemic: any causal model should be predictive.Another reason is pragmatic: prediction is useful if we want to manipulate some property such as the expression of a target gene.
Second, the network representation of such causality should be a bipartite graph consisting of gene (including transcription factor) nodes and model nodes.Such graphs encode the synergy of multiple TFs in the model nodes.
Third, the bipartite representation may include many model nodes that point to the same target gene, where each model node has a disjoint set of TFs as input.A single TF plays a role in disjoint sets of several target genes.
In addition to suggesting a modified approach to causality research for transcriptional regulation, we assert that our framework is applicable beyond transcriptional causality.Our main future work is to apply this form of analysis to other multifactor causality domains.We welcome other researchers to try this approach and offer our software to help.
with F to predict target gene G 5: E all ← training Error of M all 6: F m ← MINIMALSET(G, F, E all ) 7: Add F m to D 8: F r ← F \ F m 9: flag ← True 10: while F r ≠ Ø and flag == True do 11: M r ← initialized regression model 12: Fit M r with F r to predict target gene G 13: E r ← training Error of M r 14: if E r > E all , with statistical significance then a given target gene G, D will be the set of disjoint sets of TFs G.

FIGURE 5
FIGURE 5 Our novel contribution is a framework, algorithms, and software for encoding possible causality in transcriptional settings into a bipartite directed graph.Our framework consists of the following workflow:1.A machine learning method M that predicts the behavior of a target gene g starting with all possible causal regulatory elements (transcription factors for genomic networks) as candidates is chosen.M may be statistical, a neural network, a forest, or a linear model.We do not advocate any particular model, although non-linear models generally have lower errors than linear ones.2. The set of possible causal regulatory elements for g is reduced to a smaller set S 1 that provides statistically equal (based on the p-value) accuracy still based on M. 3. Inspired by an observation of the statistician Efron (2020), a set D of mutually disjoint sets S 1 , S 2 . . .S n is found that all provide statistically the same accuracy as S 1 in predicting the expression of some target gene g, possibly by training a new model for each S i .D may contain S 1 alone or may contain many sets.4. A visual representation of these mutually disjoint subsets of regulatory elements is provided for each given target gene g.
The visualization consists of a bipartite graph in which each transcription factor of the disjoint subsets (S 1 , S 2 . . .S n ) of transcription factors feeds a model node whose output is the target gene g.

TABLE 1
Information on the gold standard (GS) networks used in this study.The number of target genes and transcription factors are for genes that are both present in the regulatory network and the RNA-seq data for each species, respectively.
FIGURE 1Size distribution of a minimal transcription factor (TF) set for each target gene.For every species, the majority of genes are best predicted by fewer than 8 transcription factors, i.e., 96.6% for yeast, 73.5% for B. subtilis, 98.7% for Arabidopsis, 71.9% for mice, and 97.7% for humans.
Table 2, we see that an accurate regression model constructed this way usually has fewer input transcription factors compared to the GS networks.
Algorithm 1. Minimal TF set per target: For each target gene G, initial set of TFs, and RMSE E, the set of necessary transcription factors are repeatedly reduced by half until the error increases significantly with respect to E. A minimal set of TFs for a given target gene G will be a call to this function MinimalSet(G, all TFs, 0).