An Augmented High-Dimensional Graphical Lasso Method to Incorporate Prior Biological Knowledge for Global Network Learning

Biological networks are often inferred through Gaussian graphical models (GGMs) using gene or protein expression data only. GGMs identify conditional dependence by estimating a precision matrix between genes or proteins. However, conventional GGM approaches often ignore prior knowledge about protein-protein interactions (PPI). Recently, several groups have extended GGM to weighted graphical Lasso (wGlasso) and network-based gene set analysis (Netgsa) and have demonstrated the advantages of incorporating PPI information. However, these methods are either computationally intractable for large-scale data, or disregard weights in the PPI networks. To address these shortcomings, we extended the Netgsa approach and developed an augmented high-dimensional graphical Lasso (AhGlasso) method to incorporate edge weights in known PPI with omics data for global network learning. This new method outperforms weighted graphical Lasso-based algorithms with respect to computational time in simulated large-scale data settings while achieving better or comparable prediction accuracy of node connections. The total runtime of AhGlasso is approximately five times faster than weighted Glasso methods when the graph size ranges from 1,000 to 3,000 with a fixed sample size (n = 300). The runtime difference between AhGlasso and weighted Glasso increases when the graph size increases. Using proteomic data from a study on chronic obstructive pulmonary disease, we demonstrate that AhGlasso improves protein network inference compared to the Netgsa approach by incorporating PPI information.

where ψ ik =ε i,−k if |ε i,−k | < |ε k,−i | and ψ ik =ε k,−i otherwise. With the partial correlation coefficients, the network structure could be learned with the following Ψ algorithm proposed in the previous study (Liang et al., 2015): Step 1, Correlation screening: Determine the reduced neighborhood for each variable X(i); a) Conduct a multiple hypothesis test to identify the pairs of vertices for which the empirical correlation coefficient is significantly different from zero (empirical correlation network); b) For each variable X(i), identify its neighborhood in the empirical correlation network, and reduce the size of the neighborhood by removing the variables having a lower correlation (in absolute value).
Step 2, Ψ-calculation: For each pair of vertices i and j, identify the separator S ij based on the reduced correlation network resulted in step (1) and calculate Ψ ij by inverting the subsample covariance matrix; Step 3, Ψ-screening: Conduct a multiple hypothesis test to identify the pairs of vertices for which Ψ ij is significantly different from zero. If the pairs of vertices are not significantly different from 0, these edges were set to 0 to reduce dimensionality.
Similar to the huge R library, we adapted the correlation screening step in Ψ-algorithem to AhGlasso to reduce the size of potential neighborhood and speed up the estimation. n n A B Figure S1. Method performance comparisons in a non-scale free random network. The simulated random network graph included 500 (p) nodes. The overlap between prior information and target true network is 88%. With the same true network and its corresponding covariance matrix (Σ true ), we created various sizes (n) of multiple normal expression data for testing. We estimated the true network topology by using two weighted graphical LASSO (wGlasso 2015 and wGlasso 2017), Netgsa, and the proposed AhGlasso method. The λ was optimized with each designed criteria as shown in Table 1. The F1 score and MCC were calculated based on the estimated network and true network. For each simulation setting, the simulations were repeated 5 times. The lines represent the mean scores for the simulated sample size and the error bars represent the standard error of the mean for each method. Of note, similar results were achieved in various p and n simulations. Figure S2. Method performance comparisons in a non-scale free random network. The simulated random network graph included 500 (p) nodes. The overlap between prior information and target true network varied as indicated. With the same true network and its corresponding covariance matrix (Σ true ), we created multiple normal expression data for testing with n = 300. We estimated the true network topology by using two weighted graphical LASSO (wGlasso 2015 and wGlasso 2017), Netgsa, and the proposed AhGlasso method. The λ was optimized with each designed criteria as shown in Table 1. The F1 score and MCC were calculated based on the estimated network and true network. For each simulation setting, the simulations were repeated 5 times. The lines represent the mean scores for the simulated sample size and the error bars represent the standard error of the mean for each method.  Expected, the expected number of proteins in a pathway if we randomly selected 40 proteins from 1212 background proteins; Adjusted P value: Benjamini-Hochberg adjusted P value to control for False Discovery Rate Table S2. GO enrichment of the top 40 hub proteins in AhGlasso estimated network.

GO.ID
Annotated, number of proteins in a pathway from the complete set of 1212 proteins; Significant, number of protens in a pathway from 40 hub proteins;  COPDGene Phase 3

Grant Support and Disclaimer
The project described was supported by Award Number U01 HL089897 and Award Number U01 HL089856 from the National Heart, Lung, and Blood Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health.

COPD Foundation Funding
COPDGene is also supported by the COPD Foundation through contributions made to an Industry Advisory Board that has included AstraZeneca, Bayer Pharmaceuticals, Boehringer-Ingelheim, Genentech, GlaxoSmithKline, Novartis, Pfizer, and Sunovion.