Structural Equation Modeling of In silico Perturbations

Gene expression is controlled by multiple regulators and their interactions. Data from genome-wide gene expression assays can be used to estimate molecular activities of regulators within a model organism and extrapolate them to biological processes in humans. This approach is valuable in studies to better understand complex human biological systems which may be involved in diseases and hence, have potential clinical relevance. In order to achieve this, it is necessary to infer gene interactions that are not directly observed (i.e. latent or hidden) by way of structural equation modeling (SEM) on the expression levels or activities of the downstream targets of regulator genes. Here we developed an R Shiny application, termed “Structural Equation Modeling of In silico Perturbations (SEMIPs)” to compute a two-sided t-statistic (T-score) from analysis of gene expression data, as a surrogate to gene activity in a given human specimen. SEMIPs can be used in either correlational studies between outcome variables of interest or subsequent model fitting on multiple variables. This application implements a 3-node SEM model that consists of two upstream regulators as input variables and one downstream reporter as an outcome variable to examine the significance of interactions among these variables. SEMIPs enables scientists to investigate gene interactions among three variables through computational and mathematical modeling (i.e. in silico). In a case study using SEMIPs, we have shown that putative direct downstream genes of the GATA Binding Protein 2 (GATA2) transcription factor are sufficient to infer its activities in silico for the conserved progesterone receptor (PGR)-GATA2-SRY-box transcription factor 17 (SOX17) genetic network in the human uterine endometrium.


Two-class bootstrap simulation
The primary objective of selecting SEM in our research and fundamental advantage of SEM is to allow researchers to derive the relationship between variables of interest when these variables are not directly measurable. In the proposed SEMIPs method, we tested the relationship via a 3-node SEM model among three variables in a complex genomic system. Each of these variables can either be a regulator that regulates a group of downstream genes or a readout of impact from some upstream regulator. The a two-sided t-statistic, naminglyT-scores are calculated based on the direction of these upstream/downstream signatures(genes), and then used in the SEM modeling.
When we have a group of signatures (genes) obtained from an experiment (i.e. a KEGG pathway analysis in our paper, data comes with the SEMIPs application at "/app_installation_dir /testData/bootstrap/"), we are interested in finding out whether a regulator (upstream/downstream) is associated with a factor (i.e. GATA2 in our example) in our SEM model. We chose to eliminate these group of signatures (genes) from the GATA2-related signatures. To provide an unbiased assessment of such analysis, we implemented a two-class bootstrap simulation method "elimination with replacement and elimination without replacement" (Figure 3).
In an elimination without replacement bootstrap analysis, it randomly eliminated the same number of signatures from this originate GATA2-related signatures, then re-calculate the T-score and re-evaluate the SEM model. In the paper, we suggest a 1,000 round of simulation to provide an empirical distribution for any non-parametric statistics test. On the other hand, in the elimination with replacement bootstrap analysis, after randomly eliminating the same number of signatures from this originate GATA2-related signatures, we replace the same number of "irrelevant" signatures back to the "shrunken" list. Then, we re-calculate the T-score and reevaluate the SEM model, we also suggest a 1000 round of simulation to provide an empirical distribution for any non-parametric statistics test.
The elimination without replacement simulation was used to test whether a regulator has any impact on our factor (i.e. GATA2) in term of function association; and the elimination with replacement simulation was used to rule out the possibility that the number of downstream signatures of a factor (i.e. GATA2) has any impact on its function. Both empirical distributions serve as the null hypothesis for the statistical testing.

Gene list preparation
The microarray gene expression data was analyzed using The Partek Genomics Suite 7.17 The published GATA2 occupancy information GEO accession: GSE40659 (Rubel et al. 2016) was first lifted from mm9 to mm10 genome assembly and then annotated by HOMER (Heinz et al. 2010) for the nearby genes. The obtained GATA2 ChIP-seq targets were mapped to the GATA2 signature from microarray data to identify the putative GATA2 direct downstream targets (GATA2 direct signature -Supplemental Table 1). The criteria used to selected GATA2 ChIP-seq targets was GATA2 binding at immediate promoter regions (+/-2kb of TSS).

The main steps to follow the use case example
Step 1. To get the T-score: Users can launch the App and import the 634 genes list (Supplemental Table 1) and HumanArray4Shiny comes with the App. By clicking the green "Go" button, the corresponding T score will then be calculated and can be download (shown in Supplemental Figure 1). We also provided this calculated T-score in Supplemental Table 2.
Step 2. To construct the dataset: Users need to open the _sampleDAT.txt under the "app_installation_dir/dataSEM/", i.e. /Users/li11/myGit/SEMIPs/dataSEM, append the new T -Score column from step 1 and name the header accordingly. We use "GATA2 Direct" in this use case. Please save the new file as "app_installation_dir/dataSEM/sampleDAT.txt".
Step 3. To run the SEM model: Users need to re-launch the app. Under the SEM tab, from the drop-down list select "GATA2 Direct", "PGR_act_FC13_P01", and "SOX17_lev" as show in  Table 2 (GATA2 direct gene list). The fitting statistics can be downloaded by clicking the "Download Results" button.