Probabilistic edge inference of gene networks with markov random field-based bayesian learning

Current algorithms for gene regulatory network construction based on Gaussian graphical models focuses on the deterministic decision of whether an edge exists. Both the probabilistic inference of edge existence and the relative strength of edges are often overlooked, either because the computational algorithms cannot account for this uncertainty or because it is not straightforward in implementation. In this study, we combine the Bayesian Markov random field and the conditional autoregressive (CAR) model to tackle simultaneously these two tasks. The uncertainty of edge existence and the relative strength of edges can be measured and quantified based on a Bayesian model such as the CAR model and the spike-and-slab lasso prior. In addition, the strength of the edges can be utilized to prioritize the importance of the edges in a network graph. Simulations and a glioblastoma cancer study were carried out to assess the proposed model’s performance and to compare it with existing methods when a binary decision is of interest. The proposed approach shows stable performance and may provide novel structures with biological insights.


Field-based Bayesian Learning"
Supplementary section S1: Supporting information for simulation studies

Supplementary section S2: Supporting information for array-based gene expression data from the TCGA glioblastoma study.
The set of edges corresponding to the top 15% sample correlation and edges identified by SPACE were first derived and then the union of the two sets was taken, leading to a set of 99 edges and a sparsity around 0.3. Figure S1. Sample correlation matrix for the array-based gene expression data.     Numbers in the lower triangle represent the number of edges identified by the column method but not the row method.
For example, BMRF.P identifies 69 edges. Among them, 61 edges are identified by M&B as well, while 8 among the 69 are identified by BMRF.P but not by M&B. Table S5. List of selective interactions that were reported before. The symbol "" indicates an edge that was successfully identified by the corresponding column method, and the symbol "" indicates a failure to be identified. Following the same procedures in previous data analyses, only primary tumor tissues from male patients between 40 ~ 75 years were included in the analysis. In this application, we extracted the protein-protein interaction (PPI) network from the STRING database (https://string-db.org/) version 11.0b (https://version-11-0b.stringdb.org/) to select a gene set showing interaction with EGFR. EGFR, is commonly known as a frequent alteration and oncogenic molecule associated with GBM, and it has been the target clinical biomarker and prognostic factor in research (Heimberger et al., 2005;Eskilsson et al., 2018;Westphal et al., 2017). The aim is to estimate the regulatory relationship between genes in the PPI network around EGFR to examine the underlying biological mechanisms of GBM. The data for analysis consist of 30 genes and 83 primary tumor tissues.
All the implementations of BMRF.P are the same as before. The screening of possible interactions was first carried out by the data-driven procedures, where the union of the set containing lines of the top 15% sample correlation (the corresponding value is 0.41) and the set of lines identified by SPACE (57 edges) was derived. This results in a total of 80 edges and will be estimated by BMRF.P. Figure S6 shows the regulatory network constructed by BMRF.P. In this case, BMRF.P identified 55 edges with prob(edge) > 0.5, and the corresponding network structure is shown in Figure S6 (a). In particular, there are 41 interactions computed with edge probability higher than 0.7, and 20 edges with existing probability greater than 0.9.
The corresponding network structures constructed by the different probabilistic thresholds are presented in Figure S6 (b) and (c). By the strength of estimated existence probability, BMRF can prioritize the importance of identified edges so that we can highlight those highly evidential results in this analysis and construct the sub-network structure. Another merit of using probability to measure uncertainty is that the strength of probability is more interpretable and easy to implement. On the other hand, however, the magnitude of the tuning parameter of those lasso-based approaches, which can control the network sparsity, usually does not have a general criterion to interpret the exact value. The posterior distributions for the relative strength of those 20 edges with existence probability greater than 0.9 are illustrated by the violin plot in Figure S7. The posterior mean and 95% credible interval, which illustrate the relative partial correlation of these 20 edges, are summarized in Table S6. nodes. In other words, the hub nodes pointed out by the competing methods as well as by BMRF.P with a 0.5 threshold tells us that these nodes connect more edges than others, but the intensity of these edges can be weak. Instead, using a rigorous threshold, such as the probability greater than 0.9, the hub nodes identified by BMRF.P can be interpreted as not only connecting more edges than others but also the corresponding partial correlation intensity is vital. This finding again highlights the usefulness of determine target genes via probabilistic prioritization. Table S8 compares the similarity of findings between all methods. Consistent with the conclusion in simulation studies and the first application study, the findings from BMRF.P are usually more similar to M&B and SPACE, which showed better estimation performance against others in simulation studies. The Glasso and BM_BMA identified more and different edges than other methods, however, we found that these two methods tend to identify more false-positive signals in simulation studies.
We focus on those edges connected with the EGFR to gain some biologically meaningful findings. We first look at GAB1, which is also identified by BMRF.P as the hub node when adopting strict threshold. Much evidence shows that the GAB1 is involved in the signaling process of positive feedback activation to EGFR and further engages in the response of enhancing cell proliferation (Kapoor and DM O'Rourke, 2010;Azuaje et al., 2015). Another interaction with EGFR, the SPRY2, which has been shown to associate with a negative prognostic indicator for the survival of GBM patients, several papers have reported that the effect of SPRY2 knockdown on cell proliferation may drive the resistance to the EGFR inhibition and further associated with negative drug response (Walsh et al., 2015;Park et al., 2018;Day et al., 2020;TCGA 2008). These interactions can be served as potential targets for further research when focusing on EGFR-driven association to therapeutic targets of GBM. The regulatory network constructed by BMRF.P also identifies some important biomarkers. For example, the interaction between GAB1 and CD247, which the higher expression level of GAB1 may associate with shorter survival time for GMB patients (Liu et al., 2014) and the critical role of CD247 in T cell activation highly related to the overall survival for cancer patients (Wang et al., 2019) has been demonstrated, can also be a new biological insight for further GBM research.  Table S6. Details about the 20 edges (existence probability > 0.9).
(1) Gene1, Gene2: the two genes connected by the selected edge   Table S7. The top three hub nodes in each method's estimated network. In the first column, the value in the parentheses represents number of edges in the estimated network by each method. The value in the parentheses in columns 2 ~ 4 denotes the number of edges connected with hub genes in the network. BMRF.P with the different selected thresholds (0.5, 0.7, 0.9) are shown in the first three rows.