<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Epidemiol.</journal-id>
<journal-title>Frontiers in Epidemiology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Epidemiol.</abbrev-journal-title>
<issn pub-type="epub">2674-1199</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fepid.2022.899655</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Epidemiology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Causal discovery in high-dimensional, multicollinear datasets</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Jia</surname> <given-names>Minxue</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="author-notes" rid="fn002"><sup>&#x02020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1677524/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Yuan</surname> <given-names>Daniel Y.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="author-notes" rid="fn002"><sup>&#x02020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1677513/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Lovelace</surname> <given-names>Tyler C.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1693087/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Hu</surname> <given-names>Mengying</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1957404/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Benos</surname> <given-names>Panayiotis V.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/195027/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Computational and Systems Biology, University of Pittsburgh School of Medicine</institution>, <addr-line>Pittsburgh, PA</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>Joint Carnegie Mellon - University of Pittsburgh Computational Biology PhD Program</institution>, <addr-line>Pittsburgh, PA</addr-line>, <country>United States</country></aff>
<aff id="aff3"><sup>3</sup><institution>Department of Epidemiology, University of Florida</institution>, <addr-line>Gainesville, FL</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Marco Piccininni, Charit&#x000E9; - Universit&#x000E4;tsmedizin Berlin, Germany</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Jaeyoon Chung, Boston University, United States; Lan Zou, Sichuan University, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Panayiotis V. Benos <email>pbenos&#x00040;ufl.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Research Methods and Advances in Epidemiology, a section of the journal Frontiers in Epidemiology</p></fn>
<fn fn-type="equal" id="fn002"><p>&#x02020;These authors have contributed equally to this work and share first authorship</p></fn></author-notes>
<pub-date pub-type="epub">
<day>13</day>
<month>09</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>2</volume>
<elocation-id>899655</elocation-id>
<history>
<date date-type="received">
<day>19</day>
<month>03</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>08</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Jia, Yuan, Lovelace, Hu and Benos.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Jia, Yuan, Lovelace, Hu and Benos</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>As the cost of high-throughput genomic sequencing technology declines, its application in clinical research becomes increasingly popular. The collected datasets often contain tens or hundreds of thousands of biological features that need to be mined to extract meaningful information. One area of particular interest is discovering underlying causal mechanisms of disease outcomes. Over the past few decades, causal discovery algorithms have been developed and expanded to infer such relationships. However, these algorithms suffer from the curse of dimensionality and multicollinearity. A recently introduced, non-orthogonal, general empirical Bayes approach to matrix factorization has been demonstrated to successfully infer latent factors with interpretable structures from observed variables. We hypothesize that applying this strategy to causal discovery algorithms can solve both the high dimensionality and collinearity problems, inherent to most biomedical datasets. We evaluate this strategy on simulated data and apply it to two real-world datasets. In a breast cancer dataset, we identified important survival-associated latent factors and biologically meaningful enriched pathways within factors related to important clinical features. In a SARS-CoV-2 dataset, we were able to predict whether a patient (1) had COVID-19 and (2) would enter the ICU. Furthermore, we were able to associate factors with known COVID-19 related biological pathways.</p></abstract>
<kwd-group>
<kwd>causal discovery</kwd>
<kwd>dimensionality reduction</kwd>
<kwd>latent factors</kwd>
<kwd>collinearity</kwd>
<kwd>empirical bayes matrix factorization</kwd>
</kwd-group>
<contract-num rid="cn001">F31LM013966</contract-num>
<contract-num rid="cn001">R01HL127349</contract-num>
<contract-num rid="cn001">R01HL157879</contract-num>
<contract-num rid="cn001">R01HL159805</contract-num>
<contract-num rid="cn001">T32CA082084</contract-num>
<contract-sponsor id="cn001">National Heart, Lung, and Blood Institute<named-content content-type="fundref-id">10.13039/100000050</named-content></contract-sponsor>
<counts>
<fig-count count="10"/>
<table-count count="3"/>
<equation-count count="4"/>
<ref-count count="70"/>
<page-count count="16"/>
<word-count count="9054"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>Introduction</title>
<p>Technological advances have allowed the cost-effective collection of high-throughput data in unprecedented volume and rate. Such data can be used to uncover biological mechanisms of disease and develop predictors of disease status and progression (<xref ref-type="bibr" rid="B1">1</xref>, <xref ref-type="bibr" rid="B2">2</xref>), response to drugs (<xref ref-type="bibr" rid="B3">3</xref>, <xref ref-type="bibr" rid="B4">4</xref>) or define disease subtypes (<xref ref-type="bibr" rid="B5">5</xref>). However, such high-throughput datasets, with thousands to millions of features present two major problems to most analytical methods: high dimensionality, which creates a complexity problem, and high collinearity of the variables due to the underlying biological structure. For example, genes regulated by the same transcription factors (TFs) are usually co-expressed, genetic variants (single nucleotide polymorphisms&#x02014;SNPs) in close proximity co-segregate, and expression of microRNAs correlates with their target genes.</p>
<p>The analytical questions on such data are usually two-fold: (1) mechanistic and (2) predictive. In the first category, the objective is to understand the mechanism that causes a biological phenomenon or a clinical outcome. The focus here is on uncovering the complex relations between features in the dataset. For example, how one gene affects the expression of another gene, how smoking, age, or sex affects gene expression, or whether a SNP contributes to drug response or clinical outcomes. The second question is about our ability to predict an outcome. Causal learning methods are emerging as a flexible tool for addressing both these types of questions (<xref ref-type="bibr" rid="B6">6</xref>, <xref ref-type="bibr" rid="B7">7</xref>). For example, directed graphs learned from such observational data can be used to infer regulatory interactions between genes (<xref ref-type="bibr" rid="B5">5</xref>, <xref ref-type="bibr" rid="B8">8</xref>) and the Markov blanket of an outcome can be used to build an efficient predictor of it (<xref ref-type="bibr" rid="B9">9</xref>, <xref ref-type="bibr" rid="B10">10</xref>). However, causal learning methods also suffer from the curse of dimensionality and feature collinearity, which limit their applicability to high-throughput omics datasets.</p>
<p>To cope with these problems, supervised (<xref ref-type="bibr" rid="B11">11</xref>, <xref ref-type="bibr" rid="B12">12</xref>) and unsupervised (<xref ref-type="bibr" rid="B13">13</xref>) methods for inferring latent feature spaces have been developed and used for identifying regulatory modules (<xref ref-type="bibr" rid="B14">14</xref>, <xref ref-type="bibr" rid="B15">15</xref>) and for clustering or prediction (<xref ref-type="bibr" rid="B13">13</xref>, <xref ref-type="bibr" rid="B16">16</xref>). Although much of this work has been applied to gene expression, it has also seen success in identifying biologically meaningful signals in DNA methylation data (<xref ref-type="bibr" rid="B17">17</xref>). In causal modeling literature, there are algorithms that can learn causal graphs in the presence of latent confounders (<xref ref-type="bibr" rid="B18">18</xref>&#x02013;<xref ref-type="bibr" rid="B21">21</xref>), but they usually do not model the observed features as the result of those latent variables. However, it is expected that combining dimensionality reduction methods with causal learning methods will solve both problems that hinder causal discovery in modern biomedical datasets. Recently, a new method (CausER) (<xref ref-type="bibr" rid="B22">22</xref>) has been introduced, which first uses a form of soft clustering to infer latent factors that can explain all observed variables; and then uses causal discovery on the latent factors to build predictors of clinical outcomes.</p>
<p>Gene (mRNA) expression is regulated by proteins (transcription factors) that bind to DNA cis-regulatory modules of the genes. However, the protein levels of TFs are typically not measured in the same set up as the mRNA expression. In addition, the concordance between mRNA and protein levels can be as low as 25% (<xref ref-type="bibr" rid="B23">23</xref>, <xref ref-type="bibr" rid="B24">24</xref>), and other factors, like chromatin accessibility and the presence of enhancers, might affect expression of a given gene. <xref ref-type="fig" rid="F1">Figure 1A</xref> represents a simplified model on how accessible and active promoter and enhancer regions can regulate the transcriptional program of cells by binding to TFs. Promoters regulate the transcription of the downstream gene, while enhancers can loop over long-range genomic regions to interact with distal promoters. Thus, the expression of genes is mainly controlled by the accessibility of active enhancers and promoters to TF. We propose that this regulation model also suggests a possible solution.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p><bold>(A)</bold> Regulation of gene expression. <bold>(B)</bold> Causal Graph including relationship among observed features, latent factors, and target features.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fepid-02-899655-g0001.tif"/>
</fig>
<p>Suppose we have a system where the observed variables are independent conditioned on all latent features. If we can extract latent features from the raw features, we can build a causal model between latent features, where there will be no direct interactions within the raw variables. Any causal associations found among latent features can then examined further by looking at the raw variables that comprise them. Such is the gene expression model depicted in <xref ref-type="fig" rid="F1">Figure 1B</xref>. Gene-gene interactions do not happen at the mRNA level (the measured property) and there are multiple non-measured steps that are involved (translation, post-translational modifications, localization, etc.). These unmeasured events can be approximately represented by latent factors. In this scenario, the latent factors can capture TF activity (protein synthesis rate, post-translational modifications, localization, etc.) as well as genetic and epigenetic factors (such as chromosome structure, microRNAs, etc.), which also affect gene expression in a collective way (<xref ref-type="bibr" rid="B25">25</xref>&#x02013;<xref ref-type="bibr" rid="B27">27</xref>).</p>
<p>In our paper, we introduce a new framework that combines a recently published approach for non-orthogonal dimensionality reduction (empirical Bayes approach to matrix factorization&#x02014;EBMF) (<xref ref-type="bibr" rid="B28">28</xref>) with causal discovery to concurrently address both high dimensionality and collinearity issues. We demonstrate the utility of this framework by making useful predictions on two biomedical problems.</p>
</sec>
<sec sec-type="methods" id="s2">
<title>Methods</title>
<sec>
<title>High-dimensional CausalMGM workflow</title>
<p>Biomarkers or features identified from high-dimensional biological data through regression or statistical tests are informative for prediction but the observed variables (mRNA expression) may not be directly (causally) linked to disease driver mechanisms or events. We propose a new framework that can identify latent factors causally linked to clinical features without use of prior knowledge. Our proposed workflow is composed of two parts. First, we learn the latent factors from the observed variables. Second, we learn the causal associations among latent factors and between latents and the target features (e.g., outcomes) (<xref ref-type="fig" rid="F1">Figure 1B</xref>). For the first part, we selected EBMF due to its ability to learn non-orthogonal latent factors that are identifiable up to scale transformations. Additionally, Wang et al. demonstrated the ability of EBMF to recover sparse, biologically meaningful factors in gene expression data (<xref ref-type="bibr" rid="B28">28</xref>). For the second part, we use our CausalMGM framework, which has been shown to have superior performance on mixed data types (<xref ref-type="bibr" rid="B2">2</xref>, <xref ref-type="bibr" rid="B29">29</xref>).</p>
</sec>
<sec>
<title>Empirical bayes matrix factorization</title>
<p>EBMF is a recently introduced non-orthogonal, unsupervised method for dimensionality reduction. The <italic>K</italic>-factor EBMF model is defined in Equation (1) where <italic>Y</italic> is the <italic>n</italic> &#x000D7; <italic>p</italic> data matrix, <bold>l</bold><sub><italic>k</italic></sub> is an <italic>n</italic>-vector, <bold>f</bold><sub><italic>k</italic></sub> is a <italic>p</italic>-vector, <inline-formula><mml:math id="M1"><mml:mrow><mml:mi mathvariant='script'>G</mml:mi></mml:mrow></mml:math></inline-formula> are pre-specified families of distributions, and <italic>g</italic> are unknown &#x0201C;prior&#x0201D; distributions, <italic>E</italic> is an <italic>n</italic>-specified families of distributions, and <italic>p</italic> matrix of independent error terms, and <bold>&#x003C4;</bold> is an unknown <italic>n</italic>-specified families of distributions, and <italic>p</italic> matrix of precisions in some space <inline-formula><mml:math id="M2"><mml:mrow><mml:mi mathvariant='script'>T</mml:mi></mml:mrow></mml:math></inline-formula>,</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M3"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>Y</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>K</mml:mi></mml:munderover><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:mstyle><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>f</mml:mi></mml:mstyle><mml:mi>k</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mo>+</mml:mo><mml:mi>E</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mo>~</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mi mathvariant='script'>G</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mn>&#x02026;</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mo>~</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>f</mml:mi></mml:mstyle><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>f</mml:mi></mml:mstyle><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mi>&#x02131;</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>f</mml:mi></mml:mstyle></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>~</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:msub><mml:mi>&#x003C4;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>&#x003C4;</mml:mi></mml:mstyle><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>&#x003C4;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>T</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Wang et al. (<xref ref-type="bibr" rid="B28">28</xref>) implemented two main algorithms for fitting the k-factor EBMF model: (1) greedy and (2) backfitting. The greedy approach starts by optimizing the first factor, then adding second factor and optimizing that, and so on one factor at a time. The backfitting approach uses the estimates of all factors to refine the estimate of one factor (<xref ref-type="bibr" rid="B30">30</xref>). Wang et al. also state that empirical Bayes approaches to matrix factorization can automatically select the number of factors <italic>K</italic> (<xref ref-type="bibr" rid="B31">31</xref>) because, if <italic>K</italic> is set sufficiently large, some loading/factor combinations will be optimized to 0. They also show that, in their EBMF model, each loading and factor is identifiable to a multiplicative constant (given that <inline-formula><mml:math id="M4"><mml:mrow><mml:mi mathvariant='script'>G</mml:mi></mml:mrow></mml:math></inline-formula> is a scale family). This can also be dealt with by normalizing factor estimates.</p>
</sec>
<sec>
<title>CausalMGM</title>
<sec>
<title>PC and FCI</title>
<p>One of the most popular constraint-based causal discovery algorithms is the PC algorithm (<xref ref-type="bibr" rid="B32">32</xref>). One of the main assumptions of the PC algorithm is that the ground truth graph can be represented by a Directed Acyclic Graph (DAG). The PC algorithm starts with a fully connected undirected graph. It then tests each pair of variables for independence conditioned on the empty set and removes all those edges found to be independent. Then, the procedure is repeated by testing for conditional independence of each edge given every single neighborhood variable, every pair of neighborhood variables, etc. until a predetermined size. After every conditionally independent edge has been identified and removed, the remaining edges are oriented. The orientation has the following steps. First, if node Z is adjacent to nodes X and Y, and X and Y are independent conditioned on a set that does not include Z, then this is a collider (X &#x02192; Z&#x02190;Y). Next, the remaining edges are oriented to avoid additional colliders. Finally, any edges that are not in a collider, nor would result in a collider, are oriented to avoid cycles.</p>
<p>In the original PC algorithm, edges would be removed as soon as a conditional independence is found between two nodes. However, this meant that the output graph would be dependent on the order in which the edges were tested. PC-stable (<xref ref-type="bibr" rid="B33">33</xref>) corrects for this by performing all edge removals concurrently at the end of each depth (subset size) test. PC&#x02212;Max further improves on this idea by picking the conditioning set with the highest <italic>p</italic>-value so that there are no ambiguities for directionality. In addition, PC&#x02212;Max is parallelized for scalability (<xref ref-type="bibr" rid="B34">34</xref>).</p>
<p>The Fast Causal Inference (FCI) algorithm (<xref ref-type="bibr" rid="B32">32</xref>) is a modification of PC that has been explicitly designed for causal discovery in the presence of latent confounders. The FCI algorithm is composed of three parts: (1) determine which edges to remove from the complete undirected graph (2) identify collider and orient edges (3) re-orient edges for ancestor relations. The output of FCI is defined as partial ancestral graph (PAG) including four types of edge:</p>
<list list-type="bullet">
<list-item><p><italic>A</italic>&#x02192;<italic>B</italic> if and only if A is an ancestor of B in the I-equivalence class</p></list-item>
<list-item><p><italic>A</italic>&#x02194;<italic>B</italic> if and only if A and B are not ancestors of each other in the I-equivalence class</p></list-item>
<list-item><p>A <italic>o</italic>&#x02212;&#x0003E; B if and only if B is not an ancestor of A in the I-equivalence class</p></list-item>
<list-item><p>A <italic>o</italic>&#x02212;<italic>o</italic> B places no restriction on ancestor relation.</p></list-item>
</list>
</sec>
<sec>
<title>FGES</title>
<p>The Fast Greedy Equivalence Search (FGES) algorithm (<xref ref-type="bibr" rid="B35">35</xref>) is one of the most popular score-based causal discovery algorithm, which infers causal structure through maximizing a likelihood score instead of conditional independence tests. Recently, FGES has been extented to mixed-type data <italic>via</italic> Degenerate Gaussian score (<xref ref-type="bibr" rid="B36">36</xref>) where the likelihood is calculated by modeling continuous random variable as Gaussian distributions, and each k-category discrete random variable as a latent (k &#x02013; 1) dimensional Gaussian distributions. The output of FGES is pattern, which contains directed edge (&#x02192;) representing direct causation and undirected edge (&#x02212;), where its causal direction cannot be determined.</p>
</sec>
<sec>
<title>Mixed graphical models</title>
<p>The main limitation of constraint-based causal algorithms (like PC&#x02212;Max) is that the runtime in dense or scale-free graphs is a high-order polynomial based on the maximum degree of the graph. One way to improve upon this is to start the search from a sparse undirected skeleton that includes all adjacencies in the true data generating DAG. This reduces a global high-order polynomial problem to a local lower-order polynomial problem. Mixed graphical model (MGM) is one strategy that allows for the efficient learning of the moralized graph (which is a superset of the true causal graph plus the edges for the shielded colliders) (<xref ref-type="bibr" rid="B37">37</xref>, <xref ref-type="bibr" rid="B38">38</xref>). MGM can learn a undirected graph skeleton and avoid a large number of false positive edges (<xref ref-type="bibr" rid="B2">2</xref>).</p>
</sec>
</sec>
</sec>
<sec id="s3">
<title>Experiments</title>
<sec>
<title>Comparison on synthetic datasets</title>
<sec>
<title>Synthetic datasets</title>
<p>Clinical datasets often contain a mixture of continuous and discrete variables. When multi-omics data is included, these datasets become high-dimensional and collinear. To simulate similar datasets, we used both Lee and Hastie (LH) (<xref ref-type="bibr" rid="B37">37</xref>) and Conditional Gaussian (CG) (<xref ref-type="bibr" rid="B39">39</xref>) models for data generation. Conditional Gaussian models generate data from a Gaussian mixture where each Gaussian component exists for a particular combination of discrete variables, while Lee &#x00026; Hastie models generate continuous variables from the joint distribution of the continuous variables conditioned on the discrete variables following a multivariate Gaussian with common covariance.</p>
<p>The TETRAD software (<xref ref-type="bibr" rid="B40">40</xref>, <xref ref-type="bibr" rid="B41">41</xref>) contains LH and CG simulation models. We generated 10,000 samples for 50 continuous variables (representing factors <italic>Z</italic>) and 25 categorical variables (analogous to clinical features) from a sparse true DAG with average graph degree 2&#x02013;4. We then generated 2,500 variables (analogous to gene expression data <italic>X</italic>) based on the 50 latent factors (<italic>Z</italic>) and loadings (<italic>A</italic>) with weights drawn from a Gaussian distribution using Equation (2). To model the structure of regulation of gene expression, only 2% of the values of the combined loading matrix with the greatest absolute value were kept, and others were set to 0. This strategy controls the simulation so that each latent factor (e.g., TF) regulates approximately 50 gene expression features directly.</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mi>Z</mml:mi><mml:mi>A</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>E</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><italic>X</italic> is the <italic>n</italic>&#x000D7;<italic>p</italic> gene expression data matrix, <inline-formula><mml:math id="M6"><mml:mi>Z</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant='script'>R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> denotes K latent variables for n samples, <inline-formula><mml:math id="M7"><mml:mi>A</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant='script'>R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the Transcription Factor Regulation Matrix assigning p variables to K groups. <inline-formula><mml:math id="M8"><mml:mi>E</mml:mi><mml:mo>&#x0007E;</mml:mo><mml:mrow><mml:mi mathvariant='script'>N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is an <italic>n</italic>&#x000D7;<italic>p</italic> matrix of independent error.</p>
<p>For the gene expression simulation, we randomly picked five 1,000-sample datasets (out of the 10,000 total simulated samples) from each of the LH and CG datasets. The reported results are averages across these 5 sub-datasets.</p>
</sec>
<sec>
<title>Benchmark methods</title>
<list list-type="bullet">
<list-item><p><bold>PCA</bold> We consider Principal component analysis (PCA) as a comparable baseline method to EBMF, since it is one of the most popular dimensionality reduction methods. We pick the first K principal components of the observed features by the eigenvalue ratio test (<xref ref-type="bibr" rid="B42">42</xref>), where K is estimated by:</p></list-item>
</list>
<disp-formula id="E3"><label>(3)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mover accent="false"><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo class="qopname">arg</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo class="qopname">&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mo class="qopname">&#x000AF;</mml:mo></mml:mover></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:munder></mml:mstyle><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M10"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo></mml:math></inline-formula> are eigenvalues, and largest K cannot be greater than <italic>min</italic>(<italic>n, p</italic>)&#x02212;1.</p>
<list list-type="bullet">
<list-item><p><bold>FCI</bold> For causal structure in the presence of confounders, we use FCI, a popular approach that generates asymptotically correct results. Thus, we compare performance of FCI and our model on simulation datasets with 10 latent factors and 200 observed features.</p></list-item>
</list>
</sec>
<sec>
<title>Metrics</title>
<p>EBMF models were calculated using the <monospace>flashr</monospace> package available by the authors of (<xref ref-type="bibr" rid="B28">28</xref>). We then built the moralized undirected graph from the trained EBMF identified factors and the categorical features (clinical targets) using MGM with StEPS subsampling procedure (<xref ref-type="bibr" rid="B38">38</xref>) for optimal sparsity. The moralized graph is used as the initial graph for the PC&#x02212;Max algorithm with StARS (<xref ref-type="bibr" rid="B43">43</xref>), which produced a final stable causal graph on which we evaluated the original graph recovery. The causal discovery algorithms were performed using the <ext-link ext-link-type="uri" xlink:href="https://github.com/tyler-lovelace1/rCausalMGM"><monospace>rCausalMGM</monospace></ext-link> package.</p>
<p>To evaluate the recovery of latent factors by EBMF, we computed the mean correlation coefficient (MCC) between the known data-generating factors and the estimated latent factors learned through EBMF. To compute MCC, we calculated all pairs of correlation coefficients between source and recovered latent factors. Then, we permuted the correlation matrix to maximize diagonal sum through solving a linear sum assignment problem. Finally, we assigned each recovered latent factor to best-correlated source latent factor, and computed MCC based on the diagonal of the permuted correlation matrix.</p>
<p>To evaluate the true graph recovery, we computed adjacency precision (AP) &#x00026; recall (AR) and arrowhead (causal orientation) precision (AHP) &#x00026; recall (AHR). Adjacency precision and recall refer to the correctness of the edges in the graph estimated by the causal discovery algorithms regardless of the corresponding orientation(s). Arrowhead precision and recall are computed using the true positive, false positive, false negative, and true negative orientations defined in <xref ref-type="table" rid="T1">Table 1</xref> (<xref ref-type="bibr" rid="B29">29</xref>). This score is computed only on edges that appear in both the data generating graph and the estimated graph. Thus, this arrowhead score always is accompanied by adjacency score to separate the information gained from the adjacencies vs. arrowheads. The main idea of the score is to treat endpoint &#x0201C;&#x0003E;&#x0201D; as positive and endpoint &#x0201C;&#x02212;&#x0201D; as negative. If the true graph contains edge A &#x02192; B, and the estimated graph contains edge A &#x02192; B, this counts as one true positive in A and one true negative in B. The precision and recall of adjacency and arrowhead are defined in Equation (4).</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Adj Precision</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext class="textrm" mathvariant="normal">correctly predicted adjacencies</mml:mtext></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">predicted adjacencies</mml:mtext></mml:mrow></mml:mfrac></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Adj Recall</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext class="textrm" mathvariant="normal">correctly predicted adjacencies</mml:mtext></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">true adjacencies</mml:mtext></mml:mrow></mml:mfrac></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Arr Precision</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac><mml:mtext>&#x02003;</mml:mtext><mml:mtext class="textrm" mathvariant="normal">Arr Recall</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr><mml:mtr></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Causal orientation score.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Edge</bold></th>
<th valign="top" align="left"><bold>True: <italic>A</italic>&#x0002A;&#x02212;&#x0003E;<italic>B</italic></bold></th>
<th valign="top" align="left"><bold>True: <italic>A</italic>&#x0002A;&#x02212;&#x02212;<italic>B</italic></bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Predicted: <italic>A</italic>&#x0002A;&#x02212;&#x0003E;<italic>B</italic></td>
<td valign="top" align="left">TP</td>
<td valign="top" align="left">FP</td>
</tr>
<tr>
<td valign="top" align="left">Predicted: <italic>A</italic>&#x0002A;&#x02212;&#x02212;<italic>B</italic></td>
<td valign="top" align="left">FN</td>
<td valign="top" align="left">TN</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec>
<title>Application to biomedical datasets</title>
<p>In order to test how well our high-dimensional CausalMGM performs in real-world biological applications, we applied the strategy to two clinically important gene expression datasets: (1) breast cancer and (2) SARS-CoV-2. For breast cancer, we use METABRIC (<xref ref-type="bibr" rid="B44">44</xref>), a microarray dataset with 1307 patients and 60 clinical features. For COVID-19, we use a recently published open-access RNAseq dataset (<xref ref-type="bibr" rid="B45">45</xref>) with 100 COVID-19 and 26 non-COVID-19 patients and 9 clinical features.</p>
<p>For both datasets, we used EBMF as the first step because the results on the simulated data showed that it outperformed PCA. We trained a greedy-backfitted EBMF model over all patient gene expression data to build the latent factor model. Then, we merged the known clinical features with the latent factors to train a causal network (using PC and FCI).</p>
<p>By definition, in a Bayesian network, all the information regarding a target is contained within its Markov Blanket. We used this concept to select important factors of target variables for further analysis.</p>
<p>The Mann-Whitney <italic>U</italic>-test was used to identify significant EBMF factors between groups. For Gene Set Enrichment Analysis (GSEA) (<xref ref-type="bibr" rid="B46">46</xref>) of factor loadings, the R function <monospace>gseGO</monospace> from the <monospace>clusterProfiler</monospace> package (<xref ref-type="bibr" rid="B47">47</xref>) was applied to the rank order of the EBMF factor loadings to identify Gene Ontology Biological Processes (<xref ref-type="bibr" rid="B48">48</xref>) enriched in a given factor loading. Enrichment dot plots for GSEA results were generated with the <monospace>enrichplot</monospace> package (<xref ref-type="bibr" rid="B49">49</xref>) in R.</p>
<p>For the METABRIC dataset, Kaplan-Meier estimates (<xref ref-type="bibr" rid="B50">50</xref>) and Cox Proportional Hazards model (<xref ref-type="bibr" rid="B51">51</xref>) of the disease-free survival curves are computed with the <monospace>survival</monospace> package (<xref ref-type="bibr" rid="B52">52</xref>). Optimal cutoff values for each factor in relation to disease-free survival were determined by a maximally selected rank statistic. The Kaplan-Meier estimates were then plotted with the <monospace>survminer</monospace> package (<xref ref-type="bibr" rid="B53">53</xref>).</p>
<p>For the SARS-CoV-2 dataset, the performance of predictive models for disease state and ICU admission were assessed by the area under the receiver operator characteristic (ROC) curve (AUC). These ROC curves were computed and plotted with the <monospace>pROC</monospace> package (<xref ref-type="bibr" rid="B54">54</xref>) using five-fold nested cross-validation. A baseline predictive model, constructed directly on gene expression data, was learned using logistic regression with elastic net regularization implemented in <monospace>glmnet</monospace> (<xref ref-type="bibr" rid="B55">55</xref>). Predictive models based on EBMF factor causal models were constructed by performing logistic regression on the Markov blanket of the target outcomes.</p>
</sec></sec>
<sec sec-type="results" id="s4">
<title>Results</title>
<sec>
<title>Performance on synthetic data</title>
<sec>
<title>Evaluation of latent factor recovery</title>
<p>To evaluate the ability of EBMF to recover the true latent factors in a dataset, we applied both the greedy-only and greedy &#x0002B; backfitting algorithms on synthetic datasets. As a baseline, we also applied PCA with the number of principal components selected by the eigenvalue ratio test. We consider latent factor recovery to be successful if the method correctly identifies the number of latent factors, and the MCC of the recovered factors and true data generating factors is high. The greedy &#x0002B; backfitting algorithm and PCA with the eigenvalue ratio test perform well in selecting the correct number of latent factors (<xref ref-type="fig" rid="F2">Figure 2A</xref>). However, the MCC of the learned latent factors and true data generating factors is much higher for EBMF compared to PCA (<xref ref-type="fig" rid="F2">Figure 2B</xref>) across all simulation conditions, demonstrating that EBMF can better recover the source factors when compared to PCA. This is a consequence of the orthogonality constraint in PCA, which hinders the recovery of interacting, dependent latent factors. EBMF does not have this constraint, and can more precisely recover the true source latent factors for CG simulated data (<xref ref-type="fig" rid="F2">Figure 2</xref> and <xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 1B</xref>). While the MCC for the LH simulated data is relatively lower (<xref ref-type="fig" rid="F2">Figure 2</xref> and <xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 2B</xref>), indicating lower true factor recovery, it still outperforms PCA by a large margin.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Bar plots for <bold>(A)</bold> number of latent factors and <bold>(B)</bold> mean correlation coefficient (MCC); EBMF represents latent factors recovered by greedy algorithm. EBMF&#x0002B;BF represents latent factors recovered by greedy algorithm &#x0002B; Backfitting algorithm; PCA represent principal component analysis; HC/LC represents simulated data with high/low correlation; the line at 50 latent factors indicates the ground truth number of source latent factors.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fepid-02-899655-g0002.tif"/>
</fig></sec>
<sec>
<title>Evaluation of causal graph recovery</title>
<p>Next, we performed MGM with StEPS on the EBMF greedy &#x0002B; backfitting discovered latent factors to identify optimal lambda values for an undirected graph. Using the undirected network generated by MGM, we ran PC&#x02212;Max with StARS to determine an optimal alpha value for a stable graph. <xref ref-type="table" rid="T2">Table 2</xref> summarizes the adjacency and arrowhead precision and recall, and the results are split by edge types: CC are edges between continuous variables, DD are edges between discrete variables, and CD are edges between mixed (continuous and discrete) variables.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Graph recovery of simulated data (MGM-PC-MAX).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center"><bold>EDGE</bold></th>
<th valign="top" align="center" colspan="4"><bold>CC</bold></th>
<th valign="top" align="center" colspan="4"><bold>CD</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>Data</bold></td>
<td valign="top" align="center"><bold>TYPE</bold></td>
<td valign="top" align="center"><bold>AP</bold></td>
<td valign="top" align="center"><bold>AR</bold></td>
<td valign="top" align="center"><bold>AHP</bold></td>
<td valign="top" align="center"><bold>AHR</bold></td>
<td valign="top" align="center"><bold>AP</bold></td>
<td valign="top" align="center"><bold>AR</bold></td>
<td valign="top" align="center"><bold>AHP</bold></td>
<td valign="top" align="center"><bold>AHR</bold></td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">CG-LC</td>
<td valign="top" align="center">Source</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.60</td>
<td valign="top" align="center">0.67</td>
<td valign="top" align="center">0.33</td>
<td valign="top" align="center">0.98</td>
<td valign="top" align="center">0.26</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.47</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">EBMF&#x0002B;BF</td>
<td valign="top" align="center">0.86</td>
<td valign="top" align="center">0.14</td>
<td valign="top" align="center">0.65</td>
<td valign="top" align="center">0.18</td>
<td valign="top" align="center">0.98</td>
<td valign="top" align="center">0.29</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.18</td>
</tr>
<tr>
<td valign="top" align="left">CG-HC</td>
<td valign="top" align="center">Source</td>
<td valign="top" align="center">0.99</td>
<td valign="top" align="center">0.81</td>
<td valign="top" align="center">0.91</td>
<td valign="top" align="center">0.65</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.52</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.45</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">EBMF&#x0002B;BF</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">0.10</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.27</td>
<td valign="top" align="center">0.89</td>
<td valign="top" align="center">0.52</td>
<td valign="top" align="center">0.50</td>
<td valign="top" align="center">0.20</td>
</tr>
<tr>
<td valign="top" align="left">LH-LC</td>
<td valign="top" align="center">Source</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.81</td>
<td valign="top" align="center">0.66</td>
<td valign="top" align="center">0.60</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">EBMF&#x0002B;BF</td>
<td valign="top" align="center">0.65</td>
<td valign="top" align="center">0.24</td>
<td valign="top" align="center">0.64</td>
<td valign="top" align="center">0.51</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">0.67</td>
<td valign="top" align="center">0.64</td>
<td valign="top" align="center">0.54</td>
</tr>
<tr>
<td valign="top" align="left">LH-HC</td>
<td valign="top" align="center">Source</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.97</td>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">0.56</td>
<td valign="top" align="center">0.98</td>
<td valign="top" align="center">0.91</td>
<td valign="top" align="center">0.81</td>
<td valign="top" align="center">0.62</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">EBMF&#x0002B;BF</td>
<td valign="top" align="center">0.19</td>
<td valign="top" align="center">0.03</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.50</td>
<td valign="top" align="center">0.53</td>
<td valign="top" align="center">0.85</td>
<td valign="top" align="center">0.62</td>
<td valign="top" align="center">0.56</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Since we model the latent factors as continuous variables, we investigate the ability of our method to recover CC and CD edges. As a benchmark, we compare the performance of MGM-PC-Max applied to the EBMF recovered latent factors to MGM-PC-Max applied to the true, data generating source latent factors. This benchmark represents the best case causal discovery performance, in the scenario where the latent factors are perfectly reconstructed. On Conditional Gaussian simulated data, causal discovery on EBMF estimated latent factors performs similarly well to casual discovery on the source factors in terms of recovering interactions between the continuous latent factors and the categorical &#x0201C;clinical&#x0201D; features. EBMF struggled to learn interactions between latent factors, especially in the high correlation setting, but reasonably high adjacency precision indicate that most edges that are inferred by MGM-PC-Max on EBMF latent factors are true causal interactions. However, MGM&#x02212;PC&#x02212;Max has worse performance on EBMF recovered latent factors from Lee and Hastie simulated data. This is especially true for the high correlation case (<xref ref-type="fig" rid="F2">Figure 2B</xref>). This difficulty with recovering latent factors in Lee &#x00026; Hastie simulated data results in a failure to identify true causal interactions within the latent factors (<xref ref-type="table" rid="T2">Table 2</xref>). This performance difference is due to the higher internal covariance between continuous variables and discrete variables in Lee &#x00026; Hastie model data, making it more difficult to recover the true source latent factors. This indicates that strong associations between continuous and discrete variables can cause EBMF to struggle with identifying the true latent factors. However, when compared to PCA and FCI, EBMF was able to recovery of true latent factors at a much higher accuracy, which is critical for the subsequent causal discovery.</p>
<p>We also compare our approach with a baseline method, FCI, on smaller simulation datasets with 10 latent factors and 200 observed features. While FCI is not designed to recover the actual latent factors, it can infer whether two variables are confounded. We inferred the expected confounded edge adjacency matrix from the loading matrix (for source loading and EBMF loading), and we compared this to the confounding edges inferred by FCI. This allows us to compute adjacency recall &#x00026; precision for confounded edges from the FCI output PAG. For PAG, We count one double-arrow edge (&#x02194;), which indicates that two variables are confounded, as one True Positive (TP); one double circle edge (o-o), which indicates confounding but also possible direct cause-effect in either direction, as one third TP and two thirds False Positive (FP); and one <italic>o</italic> &#x02192; edge, which indicates confounding or possible cause-effect in one direction, as one half TP and one half FP. Overall, EBMF has a higher recall and relatively high precision. In comparison, FCI has a much lower recall and relatively lower precision (<xref ref-type="table" rid="T3">Table 3</xref>). This demonstrates that our model, in addition to recovering the values of the latent factors and feature loadings themselves, can identify confounded features more efficiently than FCI.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Adjacency recovery of loading matrix.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Data</bold></th>
<th valign="top" align="center"><bold>EBMF-AP</bold></th>
<th valign="top" align="center"><bold>EBMF-AR</bold></th>
<th valign="top" align="center"><bold>FCI-AP</bold></th>
<th valign="top" align="center"><bold>FCI-AR</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">CG_LC</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.53</td>
<td valign="top" align="center">0.19</td>
</tr>
<tr>
<td valign="top" align="left">CG_HC</td>
<td valign="top" align="center">0.67</td>
<td valign="top" align="center">0.98</td>
<td valign="top" align="center">0.51</td>
<td valign="top" align="center">0.18</td>
</tr>
<tr>
<td valign="top" align="left">LH_LC</td>
<td valign="top" align="center">0.63</td>
<td valign="top" align="center">0.99</td>
<td valign="top" align="center">0.53</td>
<td valign="top" align="center">0.21</td>
</tr>
<tr>
<td valign="top" align="left">LH_HC</td>
<td valign="top" align="center">0.44</td>
<td valign="top" align="center">0.99</td>
<td valign="top" align="center">0.51</td>
<td valign="top" align="center">0.20</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec>
<title>Performance on biomedical disease data</title>
<sec>
<title>Application to METABRIC breast cancer dataset</title>
<p>After removing features and patients with missing data, the METABRIC dataset contained 1,221 patients with 17,268 genes and 14 clinical features. The clinical features included can be seen in <xref ref-type="supplementary-material" rid="SM1">Supplementary Table 1</xref>. From the 17,268 gene expression features, the greedy &#x0002B; backfit EBMF identified 328 latent factors. The resulting latent factors also have low pairwise correlations, indicating that EBMF has successfully reduced the multicollinearity of the dataset (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 3</xref>).</p>
<p>Using these 328 EBMF factors and 14 clinical features, we ran MGM to build an undirected skeleton graph, which was used as the initial graph for PC&#x02212;Max. The StARS subsampling procedure yielded the most stable graph that connected the latent factors and clinical features. A selected subset (first and second neighbors of clinical features) of the network can be seen in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Markov blankets of important clinical features from the causal network learned using MGM&#x02212;PC&#x02212;Max. Blue diamonds are factors learned from METABRIC gene expression using greedy backfitted EBMF. Green squares are clinical METABRIC features. For explanation of the clinical features, please see <xref ref-type="supplementary-material" rid="SM1">Supplementary Table 1</xref>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fepid-02-899655-g0003.tif"/>
</fig>
<p>We then evaluated this graph by examining the EBMF factors that are in the Markov blanket of the clinical features. The associated factors were used for gene set enrichment analysis in order to identify the enriched biological processes. The top 10 gene sets with ratios, adjusted <italic>p</italic>-value, and total weight can be seen in <xref ref-type="fig" rid="F4">Figure 4</xref>. Some factors had less than 10 processes associated with them with FDR adjusted <italic>p</italic> &#x0003C; 0.05, such as LF59.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Biological functions significantly associated with factors that are in the Markov blankets of tumor size: LF59, distant relapse: LF1 and LF10, ER status: LF2, PR status: LF11 and LF27.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fepid-02-899655-g0004.tif"/>
</fig>
<p>In the Markov blanket of ER status, we see one factor, LF2, which has positive values for ER&#x0002B; patients, and negative values for ER- patients (<italic>p</italic>-value &#x0003C; 2.2e-16, <xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 4</xref>). LF2 is associated with immune response, interaction, and signaling pathways, which is matches existing literature about how ER activity can regulate immune signaling pathways and cytokine production (<xref ref-type="bibr" rid="B56">56</xref>). Previous literature has also demonstrated that the majority ER&#x0002B; patients have lower levels of tumor infiltrating lymphocytes (<xref ref-type="bibr" rid="B57">57</xref>) compared to triple-negative patients.</p>
<p>Two factors (LF11 and LF27) are directly linked to PR status, and both are associated with immune response. In particular, the genes comprising these factors belong to interferon signaling pathways based on GSEA analysis. Those genes are also associated with viral immune response. Recent work has shown that PR&#x0002B; tumors exhibit lower levels of phospho-STAT1, which in turn attenuates interferon-induced STAT1 signaling, which in turn may allow PR&#x0002B; tumors to escape immune surveillance (<xref ref-type="bibr" rid="B58">58</xref>).</p>
<p>In the Markov blanket of tumor size, there is one factor LF59 as well as distant relapse and lymph node status. LF59 is associated with protein targeting to the endoplasmic reticulum. Endoplasmic reticulum stress is known to be directly linked to tumor growth inhibition and apoptosis (<xref ref-type="bibr" rid="B59">59</xref>, <xref ref-type="bibr" rid="B60">60</xref>). It is also associated with cell growth and structure related gene sets, which are needed for tumor cell growth.</p>
<p>For distant relapse, we see factors LF1 and LF10. LF1 and LF10 are positive for survivors, and negative for early, middle, and late recurrence patients. LF1 is associated with RNA translation, processing, and quality control pathways. There is also an association with the biological pathways associated with viral immune system response. LF10 is associated with pathways connected RNA processing and quality control, cell death promoters, and translation initiation, elongation, and termination. In general, aberrant RNA quality control and protein translation plays an important role in cancer pathogenesis (<xref ref-type="bibr" rid="B61">61</xref>).</p>
<p>For each of these four continuous factors, we were able to determine the optimal cut-point to split the survival into high and low categories (<xref ref-type="bibr" rid="B53">53</xref>). Using these cutoffs, we were then able to build disease-free survival curves for each of the factors (<xref ref-type="fig" rid="F5">Figure 5</xref>). We trained a multivariate Cox Proportional Hazards model using these factors to determine their contribution to disease-free survival (<xref ref-type="supplementary-material" rid="SM1">Supplementary Table 2</xref>). From the figure, we see that factors LF1, LF6, and LF10 have significant <italic>p</italic>-values, while LF59 does not. We exclude LF59 from further analysis for this reason. Using the significant factors LF1, LF6, and LF10, and their respective cutoffs, we are able to generate eight survival curves to represent the eight possible high/low combinations of the factors (<xref ref-type="fig" rid="F5">Figure 5</xref>). These eight groups are able to stratify patients into significantly different disease-free survival curves, demonstrating that the recovered latent factors recover biologically relevant information about a key clinical outcome.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Disease free survival based on <bold>(A)</bold> individual factors and the <bold>(B)</bold> combinations of all factors.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fepid-02-899655-g0005.tif"/>
</fig></sec>
<sec>
<title>Application SARS-CoV-2 dataset</title>
<p>In the COVID-19 RNA-seq dataset (19,472 genes), the backfitted EBMF identified 40 latent factors. Using FCI with bootstrapping and an &#x003B1; &#x0003D; 0.05, we learned a stable (edge appearance &#x0003E;50%) causal network from the latent factors and clinical variables. The majority of the network (26/30 features, 28/32 edges) was within the Markov Blanket of the clinical features, and can be seen in <xref ref-type="fig" rid="F6">Figure 6</xref>. Two features of particular interest are disease state and ICU admittance (which we used as a proxy for disease severity). The Markov Blanket of disease state contains factors LF2, LF5, LF20, and LF36. The Markov Blanket of ICU admittance contains LF3 and LF23.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Combined Markov blankets of clinical features from causal network learned using FCI&#x02212;Max with bootstrapping, highest ensemble, and &#x003B1; &#x0003D; 0.05. Blue diamonds are factors learned from gene expression using greedy-backfitted EBMF. Green squares are clinical features. For explanation of the clinical features, please see <xref ref-type="supplementary-material" rid="SM1">Supplementary Table 3</xref>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fepid-02-899655-g0006.tif"/>
</fig>
<p>Next, since the dataset contained both COVID and non-COVID patients, we were able to split these estimated EBMF factor values (LF2, LF5, LF20, LF36) based on patient disease class. This allows us to compare the distribution of values between these two groups. The results can be seen in the violin plots in <xref ref-type="fig" rid="F7">Figure 7A</xref>. In the plots, we can see that the average derived value for COVID-19 patients in all four latent factors is positive, while the average derived value for non-COVID-19 negative, with significant <italic>p</italic>-values distinguishing the two groups. This indicates that the causal algorithm is finding factors directly linked to whether the patient has COVID-19. Additionally, this also means that genes that are positively linked to LF2 are overexpressed for COVID-19 patients, while genes that are negatively linked to LF2 are overexpressed in non-COVID-19 patients.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Distributions of EBMF factor values for features in the Markov blanket of <bold>(A)</bold> disease state (COVID-19 vs. non-COVID-19), <bold>(B)</bold> ICU admittance (yes vs. no).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fepid-02-899655-g0007.tif"/>
</fig>
<p>The same strategy can be used for patient ICU admittance (LF3, LF23) since it is also a binary category. The relative distributions of those who had to enter the ICU vs. those who didn&#x00027;t can be seen in <xref ref-type="fig" rid="F7">Figure 7B</xref>. We see that patients who enter the ICU have positive average factor values for LF3, and negative average factor values for LF23 (and vice versa for those who don&#x00027;t enter the ICU). This indicates that genes positively associated with LF3 will be positively linked to ICU admittance, while genes positively associated with LF23 will be negatively linked to ICU admittance.</p>
<p>Using pathway enrichment analysis on the loading gene weights that made up each factor, we were able to identify the biological mechanisms that each factor was associated with. The gene sets with ratios, adjusted <italic>p</italic>-value, and overall total weight can be seen in <xref ref-type="fig" rid="F8">Figure 8</xref>. In the figure, we see that the factors (LF2, LF5, LF20, LF36) associated with COVID-19 disease state have high enrichment for innate &#x00026; adaptive immune response, metabolism and autophagy, which matches expected response to viral infections and is supported by COVID-19 literature (<xref ref-type="bibr" rid="B62">62</xref>). For ICU admittance, we see enrichment in crucial pathways that are known to be closely associated with COVID-19 severity, such platelet activation and coagulation (<xref ref-type="bibr" rid="B63">63</xref>), and chemokine activity (<xref ref-type="bibr" rid="B64">64</xref>). We also see many other immune response pathways within the top 10 pathways within all 6 factors.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Biological functions significantly associated with Factors that are in the Markov blanket of disease state (LF2, LF5, LF20, LF36) and ICU Admittance (LF3, LF23).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fepid-02-899655-g0008.tif"/>
</fig>
<p>We also compare the individual factors to the list of biological functions significantly associated with the complete differentially expressed genes (DGEs) result. The UpSet plots in <xref ref-type="fig" rid="F9">Figure 9</xref> describe the overlap between the DGE gene sets and the factors&#x00027; gene sets. We can see individual factors capture subsets of the main DGE gene sets. For example, there are 62 significant (p &#x0003C; 0.05 in both) gene sets that are overlapping between the main DGE and LF36. Similarly, we see 210 significant overlapping gene sets between ICU DGE and LF3. The results also demonstrate that the factors contain novel information that the full dataset DGE results do not contain. For example, LF36 has 71 unique significant gene sets that are not shared with any other COVID-19 disease state associated factors or the complete DGE analysis. This indicates that the factors are capturing both general biological information about COVID-19 disease state and ICU admittance that is contained in the overall dataset, but is also able to capture detailed differences that full dataset analysis would otherwise miss.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Intersection between biological function gene sets that are significantly associated with the differentially expressed genes and LFs across <bold>(A)</bold> disease state and <bold>(B)</bold> ICU admittance.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fepid-02-899655-g0009.tif"/>
</fig>
<p>The Markov Blanket of a clinical feature can also be used for feature selection for clinical prediction (<xref ref-type="bibr" rid="B9">9</xref>). Using five-fold nested cross validation, we were able to build general linear models to predict whether a patient had COVID, and whether a patient would enter the ICU. Note that, in order to avoid overfitting, the entire causal learning process &#x0002B; Markov Blanket feature selection needs to be done for each fold (nested cross-validation). The ROC plots are presented in <xref ref-type="fig" rid="F10">Figure 10</xref>. These results show that classifiers based on the EBMF factors are equally good as those using all differentially expressed genes.</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>General linear model prediction ROC using factors contained within the Markov blanket for <bold>(A)</bold> disease state and <bold>(B)</bold> ICU admittance.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fepid-02-899655-g0010.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec sec-type="discussion" id="s5">
<title>Discussion</title>
<p>In this paper, we have proposed a two-step system to build causal models from high-dimensional multicollinear datasets. In step-1, we use EBMF to derive latent factors (e.g., TFs) from the observed variables (e.g., Gene Expression). In step-2, we perform causal discovery on the latent factors and clinical features (e.g., Age, Response).</p>
<p>In simulation studies we show that both EBMF and PCA can identify an accurate number of latent factors. However, only the EBMF recovered latent factors are highly correlated with the (known) source latent factors. Additionally, the latent factors learned by EBMF are not strictly orthogonal, which enables them to be used for causal discovery. We also demonstrate that our method can recover causal interactions between latent factors and clinical covariates with a high degree of accuracy.</p>
<p>Application to breast cancer and SARS-CoV-2 gene expression data shows that EBMF is able to significantly reduce the dimensionality of these datasets. It is also able to reduce to the multicollinearity within the datasets. The factors in the Markov blanket of clinical features also contained significant biologically relevant pathway information that is supported by existing literature.</p>
<p>For breast cancer, factors related to ER and PR status are associated with known immune signaling pathways and immune cell activity. Tumor size is associated with cell growth and tumor growth inhibition protein targeting pathways. Distant relapse linked factors contain important RNA and protein quality control pathways. Using the aforementioned factors, we were able to stratify breast cancer patients into distinct survival groups, where the best prognosis groups having over 50% survival rate at 7,500 days, and the worst having less than 25% survival rate.</p>
<p>For SARS-CoV-2, there is clear significant difference between the COVID-19 and non-COVID-19 patient distributions for the four latent factors related to disease state. These factors contain immune pathway genes that match known viral immune responses. Similarly, there is a significant difference in factor value distributions between ICU and non-ICU patients for the two important related factors. These ICU factors matched known biological pathways associated with severe immune response.</p>
<p>We also used EBMF with causal discovery as feature selection method in order to do clinical prediction of patients with COVID-19, and the severity of the patient&#x00027;s case. We were able to predict with high accuracy both whether a patient has COVID-19 and whether they will enter the ICU. These results were achieved with linear models, which cannot fully capture the non-linear relationships that exist in biological datasets. Therefore, the accuracy could increase even further with non-parametric models (e.g., random forests). Interestingly, we were also able to find factors associated solely with sex and with diabetes status, which may be interesting starting points for future research.</p>
<p>Due to our structural assumptions, the performance of our framework may suffer if there are many interactions among observed features (RNA-RNA interactions). RNA-RNA interactions typically include non-coding RNA (ncRNA)-ncRNA or ncRNA-message RNA (mRNA). Direct interactions between mRNA-mRNA are rare because they usually have to go through protein products and post-translational signaling. Research shows that stable mRNA-mRNA interactions are sometimes detected <italic>in vivo</italic> (<xref ref-type="bibr" rid="B65">65</xref>), but very few mRNA-mRNA interactions are involved in gene regulation, such as hly mRNA binding prsA2 mRNA and protect it from degradation (<xref ref-type="bibr" rid="B66">66</xref>, <xref ref-type="bibr" rid="B67">67</xref>). The lack of regulatory function means that it will not have a large effect on the gene expression, so in gene expression datasets containing mRNA features only, mRNA-mRNA interactions can be ignored. When working with whole transcriptomics data, which can contain other RNA types (such as ncRNA), there may be more regulatory RNA-RNA interactions. In this case, we may overlook some possible dependencies, which is a limitation of our method.</p>
<p>The authors recognize that the model proposed is not a complete biological picture since it makes assumptions about the internal structure and relationships between features. However, based on our current understanding of gene expression regulation, we argue that it is a justified and reasonable approximation. Additionally, all causal discovery methods are at best approximations when it comes to analyzing gene expression data since the data are often averages of many cells (e.g., bulk RNAseq or microarray), have systematic bias (e.g., single cell RNAseq dropout), or have cyclic relationships, which violate (to some degree) the assumptions of most existing causal discovery algorithms. Despite these drawbacks, the success of probabilistic graphical models in analyzing biomedical data (<xref ref-type="bibr" rid="B68">68</xref>&#x02013;<xref ref-type="bibr" rid="B70">70</xref>) has shown that they can still be good approximations. Our own simulation and application results demonstrate the potential of EBMF CausalMGM for gene expression data.</p>
</sec>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The simulated data and code can be found this github link (<ext-link ext-link-type="uri" xlink:href="https://github.com/Lobbii/HD_CL_causal_discovery">https://github.com/Lobbii/HD_CL_causal_discovery</ext-link>). The Covid19 dataset for this study can be found in the GSE157103. The breast cancer dataset for this study can be found in cbioportal (<ext-link ext-link-type="uri" xlink:href="http://www.cbioportal.org/study/summary?id=brca_metabric">http://www.cbioportal.org/study/summary?id=brca_metabric</ext-link>). Further inquiries can be directed to corresponding authors.</p>
</sec>
<sec id="s7">
<title>Author contributions</title>
<p>MJ: study design, performed simulation data analysis, interpreted simulation results, and wrote the paper. DYY: study design, performed biomedical data analysis, interpreted biomedical results, and wrote the paper. TCL: study design, performed biomedical data analysis, and helped with paper writing. MH: preliminary data analysis. PVB: study design, aided in interpreting the results, and edited the paper. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>This work was supported by the National Institutes of Health (R01 HL157879, R01 HL159805, R01 AA028436, R01 DK130294, R01 HL127349 [PVB], T32 CA082084 [DYY], F31 LM013966 [TCL].</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ack><p>We want to thank Haiyi Mao and Mark Ebeid for helping read and edit the manuscript.</p>
</ack><sec sec-type="supplementary-material" id="s10">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fepid.2022.899655/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fepid.2022.899655/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.PDF" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/></sec>
<ref-list>
<title>References</title>
<ref id="B1">
<label>1.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fachal</surname> <given-names>L</given-names></name> <name><surname>Aschard</surname> <given-names>H</given-names></name> <name><surname>Beesley</surname> <given-names>J</given-names></name> <name><surname>Barnes</surname> <given-names>DR</given-names></name> <name><surname>Allen</surname> <given-names>J</given-names></name> <name><surname>Kar</surname> <given-names>S</given-names></name> <etal/></person-group>. <article-title>Fine-mapping of 150 breast cancer risk regions identifies 191 likely target genes</article-title>. <source>Nat Genet</source>. (<year>2020</year>) <volume>52</volume>:<fpage>56</fpage>&#x02013;<lpage>73</lpage>. <pub-id pub-id-type="doi">10.1038/s41588-019-0537-1</pub-id><pub-id pub-id-type="pmid">31911677</pub-id></citation></ref>
<ref id="B2">
<label>2.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sedgewick</surname> <given-names>AJ</given-names></name> <name><surname>Buschur</surname> <given-names>K</given-names></name> <name><surname>Shi</surname> <given-names>I</given-names></name> <name><surname>Ramsey</surname> <given-names>JD</given-names></name> <name><surname>Raghu</surname> <given-names>VK</given-names></name> <name><surname>Manatakis</surname> <given-names>DV</given-names></name> <etal/></person-group>. <article-title>Mixed graphical models for integrative causal analysis with application to chronic lung disease diagnosis and prognosis</article-title>. <source>Bioinformatics</source>. (<year>2019</year>) <volume>35</volume>:<fpage>1204</fpage>&#x02013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bty769</pub-id><pub-id pub-id-type="pmid">30192904</pub-id></citation></ref>
<ref id="B3">
<label>3.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roushangar</surname> <given-names>R</given-names></name> <name><surname>Mias</surname> <given-names>GI</given-names></name></person-group>. <article-title>Multi-study reanalysis of 2,213 acute myeloid leukemia patients reveals age-and sex-dependent gene expression signatures</article-title>. <source>Sci Rep</source>. (<year>2019</year>) <volume>9</volume>:<fpage>1</fpage>&#x02013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1038/s41598-019-48872-0</pub-id><pub-id pub-id-type="pmid">31455838</pub-id></citation></ref>
<ref id="B4">
<label>4.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abecassis</surname> <given-names>I</given-names></name> <name><surname>Sedgewick</surname> <given-names>AJ</given-names></name> <name><surname>Romkes</surname> <given-names>M</given-names></name> <name><surname>Buch</surname> <given-names>S</given-names></name> <name><surname>Nukui</surname> <given-names>T</given-names></name> <name><surname>Kapetanaki</surname> <given-names>MG</given-names></name> <etal/></person-group>. <article-title>PARP1 rs1805407 increases sensitivity to PARP1 inhibitors in cancer cells suggesting an improved therapeutic strategy</article-title>. <source>Sci Rep</source>. (<year>2019</year>) <volume>9</volume>:<fpage>1</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1038/s41598-019-39542-2</pub-id><pub-id pub-id-type="pmid">30824778</pub-id></citation></ref>
<ref id="B5">
<label>5.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Buschur</surname> <given-names>KL</given-names></name> <name><surname>Chikina</surname> <given-names>M</given-names></name> <name><surname>Benos</surname> <given-names>PV</given-names></name></person-group>. <article-title>Causal network perturbations for instance-specific analysis of single cell and disease samples</article-title>. <source>Bioinformatics</source>. (<year>2019</year>) <volume>36</volume>:<fpage>2515</fpage>&#x02013;<lpage>21</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btz949</pub-id><pub-id pub-id-type="pmid">31873725</pub-id></citation></ref>
<ref id="B6">
<label>6.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Glymour</surname> <given-names>C</given-names></name> <name><surname>Zhang</surname> <given-names>K</given-names></name> <name><surname>Spirtes</surname> <given-names>P</given-names></name></person-group>. <article-title>Review of causal discovery methods based on graphical models</article-title>. <source>Front Genet</source>. (<year>2019</year>) <volume>10</volume>:<fpage>524</fpage>. <pub-id pub-id-type="doi">10.3389/fgene.2019.00524</pub-id><pub-id pub-id-type="pmid">31214249</pub-id></citation></ref>
<ref id="B7">
<label>7.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>K</given-names></name> <name><surname>Scholkopf</surname> <given-names>B</given-names></name> <name><surname>Spirtes</surname> <given-names>P</given-names></name> <name><surname>Glymour</surname> <given-names>C</given-names></name></person-group>. <article-title>Learning causality and causality-related learning: some recent progress</article-title>. <source>Natl Sci Rev</source>. (<year>2018</year>) <volume>5</volume>:<fpage>26</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1093/nsr/nwx137</pub-id><pub-id pub-id-type="pmid">30034911</pub-id></citation></ref>
<ref id="B8">
<label>8.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sachs</surname> <given-names>K</given-names></name> <name><surname>Perez</surname> <given-names>O</given-names></name> <name><surname>Pe&#x00027;er</surname> <given-names>D</given-names></name> <name><surname>Lauffenburger</surname> <given-names>DA</given-names></name> <name><surname>Nolan</surname> <given-names>GP</given-names></name></person-group>. <article-title>Causal protein-signaling networks derived from multiparameter single-cell data</article-title>. <source>Science</source>. (<year>2005</year>) <volume>308</volume>:<fpage>523</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1126/science.1105809</pub-id><pub-id pub-id-type="pmid">15845847</pub-id></citation></ref>
<ref id="B9">
<label>9.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Aliferis</surname> <given-names>C</given-names></name> <name><surname>Tsamardinos</surname> <given-names>I</given-names></name> <name><surname>Statnikov</surname> <given-names>A</given-names></name></person-group>. <article-title>A novel Markov blanket algorithm for optimal variable selection</article-title>. In: <source>AMIA Annual Symposium Proceedings</source>. <publisher-loc>Washington, DC</publisher-loc> (<year>2003</year>). p. <fpage>21</fpage>&#x02013;<lpage>5</lpage>.<pub-id pub-id-type="pmid">14728126</pub-id></citation></ref>
<ref id="B10">
<label>10.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Raghu</surname> <given-names>VK</given-names></name> <name><surname>Zhao</surname> <given-names>W</given-names></name> <name><surname>Pu</surname> <given-names>J</given-names></name> <name><surname>Leader</surname> <given-names>JK</given-names></name> <name><surname>Wang</surname> <given-names>R</given-names></name> <name><surname>Herman</surname> <given-names>J</given-names></name> <etal/></person-group>. <article-title>Feasibility of lung cancer prediction from low-dose CT scan and smoking factors using causal models</article-title>. <source>Thorax</source>. (<year>2019</year>) <volume>74</volume>:<fpage>643</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1136/thoraxjnl-2018-212638</pub-id><pub-id pub-id-type="pmid">30862725</pub-id></citation></ref>
<ref id="B11">
<label>11.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carvalho</surname> <given-names>CM</given-names></name> <name><surname>Chang</surname> <given-names>J</given-names></name> <name><surname>Lucas</surname> <given-names>JE</given-names></name> <name><surname>Nevins</surname> <given-names>JR</given-names></name> <name><surname>Wang</surname> <given-names>Q</given-names></name> <name><surname>West</surname> <given-names>M</given-names></name></person-group>. <article-title>High-dimensional sparse factor modeling: applications in gene expression genomics</article-title>. <source>J Am Stat Assoc</source>. (<year>2008</year>) <volume>103</volume>:<fpage>1438</fpage>&#x02013;<lpage>56</lpage>. <pub-id pub-id-type="doi">10.1198/016214508000000869</pub-id><pub-id pub-id-type="pmid">21218139</pub-id></citation></ref>
<ref id="B12">
<label>12.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lucas</surname> <given-names>JE</given-names></name> <name><surname>Kung</surname> <given-names>HN</given-names></name> <name><surname>Chi</surname> <given-names>JTA</given-names></name></person-group>. <article-title>Latent factor analysis to discover pathway-associated putative segmental aneuploidies in human cancers</article-title>. <source>PLoS Comput Biol</source>. (<year>2010</year>) <volume>6</volume>:<fpage>e1000920</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1000920</pub-id><pub-id pub-id-type="pmid">20824128</pub-id></citation></ref>
<ref id="B13">
<label>13.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Way</surname> <given-names>GP</given-names></name> <name><surname>Zietz</surname> <given-names>M</given-names></name> <name><surname>Rubinetti</surname> <given-names>V</given-names></name> <name><surname>Himmelstein</surname> <given-names>DS</given-names></name> <name><surname>Greene</surname> <given-names>CS</given-names></name></person-group>. <article-title>Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations</article-title>. <source>Genome Biol</source>. (<year>2020</year>) <volume>21</volume>:<fpage>1</fpage>&#x02013;<lpage>27</lpage>. <pub-id pub-id-type="doi">10.1186/s13059-020-02021-3</pub-id><pub-id pub-id-type="pmid">32393369</pub-id></citation></ref>
<ref id="B14">
<label>14.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Engreitz</surname> <given-names>JM</given-names></name> <name><surname>Daigle</surname> <given-names>BJ</given-names> <suffix>Jr.</suffix></name> <name><surname>Marshall</surname> <given-names>JJ</given-names></name> <name><surname>Altman</surname> <given-names>RB</given-names></name></person-group>. <article-title>Independent component analysis: mining microarray data for fundamental human gene expression modules</article-title>. <source>J Biomed Inform</source>. (<year>2010</year>) <volume>43</volume>:<fpage>932</fpage>&#x02013;<lpage>44</lpage>. <pub-id pub-id-type="doi">10.1016/j.jbi.2010.07.001</pub-id><pub-id pub-id-type="pmid">20619355</pub-id></citation></ref>
<ref id="B15">
<label>15.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>Z</given-names></name> <name><surname>Michailidis</surname> <given-names>G</given-names></name></person-group>. <article-title>A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data</article-title>. <source>Bioinformatics</source>. (<year>2016</year>) <volume>32</volume>:<fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btv544</pub-id><pub-id pub-id-type="pmid">26377073</pub-id></citation></ref>
<ref id="B16">
<label>16.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Way</surname> <given-names>GP</given-names></name> <name><surname>Greene</surname> <given-names>CS</given-names></name></person-group>. <article-title>Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders</article-title>. In: <source>Pacific Symposium On Biocomputing 2018: Proceedings of the Pacific Symposium</source>. <publisher-loc>Kohala Coast, HI</publisher-loc>: <publisher-name>World Scientific</publisher-name> (<year>2018</year>). p. <fpage>80</fpage>&#x02013;<lpage>91</lpage>. <pub-id pub-id-type="doi">10.1142/9789813235533_0008</pub-id><pub-id pub-id-type="pmid">29218871</pub-id></citation></ref>
<ref id="B17">
<label>17.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Z</given-names></name> <name><surname>Wang</surname> <given-names>Y</given-names></name></person-group>. <article-title>Extracting a biologically latent space of lung cancer epigenetics with variational autoencoders</article-title>. <source>BMC Bioinformatics</source>. (<year>2019</year>) <volume>20</volume>:<fpage>568</fpage>. <pub-id pub-id-type="doi">10.1186/s12859-019-3130-9</pub-id><pub-id pub-id-type="pmid">31760935</pub-id></citation></ref>
<ref id="B18">
<label>18.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Spirtes</surname> <given-names>P</given-names></name> <name><surname>Meek</surname> <given-names>C</given-names></name> <name><surname>Richardson</surname> <given-names>T</given-names></name></person-group>. <article-title>Causal inference in the presence of latent variables and selection bias</article-title>. In: <source>Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence</source>. <publisher-loc>Montr&#x000E9;al, QC</publisher-loc> (<year>1995</year>). p. <fpage>499</fpage>&#x02013;<lpage>506</lpage>.<pub-id pub-id-type="pmid">33152602</pub-id></citation></ref>
<ref id="B19">
<label>19.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Raghu</surname> <given-names>VK</given-names></name> <name><surname>Ramsey</surname> <given-names>JD</given-names></name> <name><surname>Morris</surname> <given-names>A</given-names></name> <name><surname>Manatakis</surname> <given-names>DV</given-names></name> <name><surname>Sprites</surname> <given-names>P</given-names></name> <name><surname>Chrysanthis</surname> <given-names>PK</given-names></name> <etal/></person-group>. <article-title>Comparison of strategies for scalable causal discovery of latent variable models from mixed data</article-title>. <source>Int J Data Sci Anal</source>. (<year>2018</year>) <volume>6</volume>:<fpage>33</fpage>&#x02013;<lpage>45</lpage>. <pub-id pub-id-type="doi">10.1007/s41060-018-0104-3</pub-id><pub-id pub-id-type="pmid">30148202</pub-id></citation></ref>
<ref id="B20">
<label>20.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Malinsky</surname> <given-names>D</given-names></name> <name><surname>Spirtes</surname> <given-names>P</given-names></name></person-group>. <article-title>Estimating bounds on causal effects in high-dimensional and possibly confounded systems</article-title>. <source>Int J Approxim Reason</source>. (<year>2017</year>) <volume>88</volume>:<fpage>371</fpage>&#x02013;<lpage>84</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijar.2017.06.005</pub-id><pub-id pub-id-type="pmid">29203954</pub-id></citation></ref>
<ref id="B21">
<label>21.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jabbari</surname> <given-names>F</given-names></name> <name><surname>Ramsey</surname> <given-names>J</given-names></name> <name><surname>Spirtes</surname> <given-names>PL</given-names></name> <name><surname>Cooper</surname> <given-names>GF</given-names></name></person-group>. <article-title>Discovery of causal models that contain latent variables through Bayesian scoring of independence constraints</article-title>. In: <source>Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD</source>. <publisher-loc>Skopje</publisher-loc> (<year>2017</year>). p. <fpage>142</fpage>&#x02013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-71246-8_9</pub-id><pub-id pub-id-type="pmid">29520396</pub-id></citation></ref>
<ref id="B22">
<label>22.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bing</surname> <given-names>X</given-names></name> <name><surname>Lovelace</surname> <given-names>T</given-names></name> <name><surname>Bunea</surname> <given-names>F</given-names></name> <name><surname>Wegkamp</surname> <given-names>M</given-names></name> <name><surname>Singh</surname> <given-names>H</given-names></name> <name><surname>Benos</surname> <given-names>PV</given-names></name> <etal/></person-group>. <article-title>Essential regression - a generalizable framework for inferring causal latent factors from multi-omic human datasets</article-title>. <source>Patterns</source>. (<year>2022</year>) <volume>3</volume>:<fpage>100473</fpage>. <pub-id pub-id-type="doi">10.1016/j.patter.2022.100473</pub-id><pub-id pub-id-type="pmid">35607614</pub-id></citation></ref>
<ref id="B23">
<label>23.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pascal</surname> <given-names>LE</given-names></name> <name><surname>True</surname> <given-names>LD</given-names></name> <name><surname>Campbell</surname> <given-names>DS</given-names></name> <name><surname>Deutsch</surname> <given-names>EW</given-names></name> <name><surname>Risk</surname> <given-names>M</given-names></name> <name><surname>Coleman</surname> <given-names>IM</given-names></name> <etal/></person-group>. <article-title>Correlation of mrna and protein levels: cell type-specific gene expression of cluster designation antigens in the prostate</article-title>. <source>BMC Genomics</source>. (<year>2008</year>) <volume>9</volume>:<fpage>246</fpage>. <pub-id pub-id-type="doi">10.1186/1471-2164-9-246</pub-id><pub-id pub-id-type="pmid">18501003</pub-id></citation></ref>
<ref id="B24">
<label>24.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ibarra</surname> <given-names>IL</given-names></name> <name><surname>Hollmann</surname> <given-names>NM</given-names></name> <name><surname>Klaus</surname> <given-names>B</given-names></name> <name><surname>Augsten</surname> <given-names>S</given-names></name> <name><surname>Velten</surname> <given-names>B</given-names></name> <name><surname>Hennig</surname> <given-names>J</given-names></name> <etal/></person-group>. <article-title>Mechanistic insights into transcription factor cooperativity and its impact on protein-phenotype interactions</article-title>. <source>Nat Commun</source>. (<year>2020</year>) <volume>11</volume>:<fpage>124</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-019-13888-7</pub-id><pub-id pub-id-type="pmid">31913281</pub-id></citation></ref>
<ref id="B25">
<label>25.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Craig</surname> <given-names>NL</given-names></name></person-group>. <source>Molecular Biology: Principles of Genome Function</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name> (<year>2021</year>).</citation>
</ref>
<ref id="B26">
<label>26.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kato</surname> <given-names>M</given-names></name> <name><surname>Hata</surname> <given-names>N</given-names></name> <name><surname>Banerjee</surname> <given-names>N</given-names></name> <name><surname>Futcher</surname> <given-names>B</given-names></name> <name><surname>Zhang</surname> <given-names>MQ</given-names></name></person-group>. <article-title>Identifying combinatorial regulation of transcription factors and binding motifs</article-title>. <source>Genome Biol</source>. (<year>2004</year>) <volume>5</volume>:<fpage>R56</fpage>. <pub-id pub-id-type="doi">10.1186/gb-2004-5-8-r56</pub-id><pub-id pub-id-type="pmid">15287978</pub-id></citation></ref>
<ref id="B27">
<label>27.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vandel</surname> <given-names>J</given-names></name> <name><surname>Cassan</surname> <given-names>O</given-names></name> <name><surname>Lebre</surname> <given-names>S</given-names></name> <name><surname>Lecellier</surname> <given-names>CH</given-names></name> <name><surname>Brehelin</surname> <given-names>L</given-names></name></person-group>. <article-title>Probing transcription factor combinatorics in different promoter classes and in enhancers</article-title>. <source>BMC Genomics</source>. (<year>2019</year>) <volume>20</volume>:<fpage>103</fpage>. <pub-id pub-id-type="doi">10.1186/s12864-018-5408-0</pub-id><pub-id pub-id-type="pmid">30709337</pub-id></citation></ref>
<ref id="B28">
<label>28.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>W</given-names></name> <name><surname>Stephens</surname> <given-names>M</given-names></name></person-group>. <article-title>Empirical Bayes matrix factorization</article-title>. <source>J Mach Learn Res</source>. (<year>2021</year>) <volume>22</volume>:<fpage>1</fpage>&#x02013;<lpage>40</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.1802.06931</pub-id></citation>
</ref>
<ref id="B29">
<label>29.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Raghu</surname> <given-names>VK</given-names></name> <name><surname>Poon</surname> <given-names>A</given-names></name> <name><surname>Benos</surname> <given-names>PV</given-names></name></person-group>. <article-title>Evaluation of causal structure learning methods on mixed data types</article-title>. <source>Proc Mach Learn Res</source>. (<year>2018</year>) <volume>92</volume>:<fpage>48</fpage>.<pub-id pub-id-type="pmid">31080946</pub-id></citation></ref>
<ref id="B30">
<label>30.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breiman</surname> <given-names>L</given-names></name> <name><surname>Friedman</surname> <given-names>JH</given-names></name></person-group>. <article-title>Estimating optimal transformations for multiple regression and correlation</article-title>. <source>J Am Stat Assoc</source>. (<year>1985</year>) <volume>80</volume>:<fpage>580</fpage>&#x02013;<lpage>98</lpage>. <pub-id pub-id-type="doi">10.1080/01621459.1985.10478157</pub-id></citation>
</ref>
<ref id="B31">
<label>31.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bishop</surname> <given-names>CM</given-names></name></person-group>. <article-title>Variational principal components</article-title>. In: <source>Ninth International Conference on Artificial Neural Networks</source>. <publisher-loc>Edinburgh</publisher-loc> (<year>1999</year>). p. <fpage>509</fpage>&#x02013;<lpage>14</lpage>.</citation>
</ref>
<ref id="B32">
<label>32.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Spirtes</surname> <given-names>P</given-names></name> <name><surname>Glymour</surname> <given-names>CN</given-names></name> <name><surname>Scheines</surname> <given-names>R</given-names></name></person-group>. <source>Causation, Prediction, and Search</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name> (<year>2000</year>). <pub-id pub-id-type="doi">10.7551/mitpress/1754.001.0001</pub-id></citation>
</ref>
<ref id="B33">
<label>33.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Colombo</surname> <given-names>D</given-names></name> <name><surname>Maathuis</surname> <given-names>MH</given-names></name></person-group>. <article-title>Order-independent constraint-based causal structure learning</article-title>. <source>J Mach Learn Res</source>. (<year>2014</year>) <volume>15</volume>:<fpage>3741</fpage>&#x02013;<lpage>82</lpage>. <pub-id pub-id-type="doi">10.5555/2627435.2750365</pub-id><pub-id pub-id-type="pmid">35327862</pub-id></citation></ref>
<ref id="B34">
<label>34.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ramsey</surname> <given-names>J</given-names></name></person-group>. <article-title>Improving accuracy and scalability of the PC algorithm by maximizing P-value</article-title>. <source>arXiv [preprint] arXiv:</source>1610.00378. (<year>2016</year>). <pub-id pub-id-type="doi">10.48550/arXiv.1610.00378</pub-id></citation>
</ref>
<ref id="B35">
<label>35.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ramsey</surname> <given-names>JD</given-names></name></person-group>. <article-title>Scaling up greedy equivalence search for continuous variables</article-title>. <source>arXiv [preprint] arXiv:</source>1507.07749. (<year>2015</year>). <pub-id pub-id-type="doi">10.48550/arXiv.1507.07749</pub-id></citation>
</ref>
<ref id="B36">
<label>36.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Andrews</surname> <given-names>B</given-names></name> <name><surname>Ramsey</surname> <given-names>J</given-names></name> <name><surname>Cooper</surname> <given-names>GF</given-names></name></person-group>. <article-title>Learning high-dimensional directed acyclic graphs with mixed data-types</article-title>. In: <source>Proceedings of Machine Learning Research. vol. 104 of Proceedings of Machine Learning Research</source>. <publisher-loc>Anchorage, AK</publisher-loc>: <publisher-name>PMLR</publisher-name> (<year>2019</year>). p. <fpage>4</fpage>&#x02013;<lpage>21</lpage>.<pub-id pub-id-type="pmid">31453569</pub-id></citation></ref>
<ref id="B37">
<label>37.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>JD</given-names></name> <name><surname>Hastie</surname> <given-names>TJ</given-names></name></person-group>. <article-title>Learning the structure of mixed graphical models</article-title>. <source>J Comput Graph Stat</source>. (<year>2015</year>) <volume>24</volume>:<fpage>230</fpage>&#x02013;<lpage>53</lpage>. <pub-id pub-id-type="doi">10.1080/10618600.2014.900500</pub-id><pub-id pub-id-type="pmid">26085782</pub-id></citation></ref>
<ref id="B38">
<label>38.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sedgewick</surname> <given-names>AJ</given-names></name> <name><surname>Shi</surname> <given-names>I</given-names></name> <name><surname>Donovan</surname> <given-names>RM</given-names></name> <name><surname>Benos</surname> <given-names>PV</given-names></name></person-group>. <article-title>Learning mixed graphical models with separate sparsity parameters and stability-based model selection</article-title>. <source>BMC Bioinformatics</source>. (<year>2016</year>) <volume>17</volume>:<fpage>175</fpage>. <pub-id pub-id-type="doi">10.1186/s12859-016-1039-0</pub-id><pub-id pub-id-type="pmid">27294886</pub-id></citation></ref>
<ref id="B39">
<label>39.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Andrews</surname> <given-names>B</given-names></name> <name><surname>Ramsey</surname> <given-names>J</given-names></name> <name><surname>Cooper</surname> <given-names>GF</given-names></name></person-group>. <article-title>Scoring Bayesian networks of mixed variables</article-title>. In: <source>Proceedings of the 2017 ACM SIGKDD Workshop on Causal Discovery.</source> <publisher-loc>Halifax</publisher-loc> (<year>2017</year>).<pub-id pub-id-type="pmid">30140730</pub-id></citation></ref>
<ref id="B40">
<label>40.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Scheines</surname> <given-names>R</given-names></name> <name><surname>Spirtes</surname> <given-names>P</given-names></name> <name><surname>Glymour</surname> <given-names>C</given-names></name> <name><surname>Meek</surname> <given-names>C</given-names></name> <name><surname>Richardson</surname> <given-names>T</given-names></name></person-group>. <article-title>The TETRAD project: constraint based aids to causal model specification</article-title>. <source>Multivariate Behav Res</source>. (<year>1998</year>) <volume>33</volume>:<fpage>65</fpage>&#x02013;<lpage>17</lpage>. <pub-id pub-id-type="doi">10.1207/s15327906mbr3301_3</pub-id><pub-id pub-id-type="pmid">26771757</pub-id></citation></ref>
<ref id="B41">
<label>41.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ramsey</surname> <given-names>JD</given-names></name> <name><surname>Zhang</surname> <given-names>K</given-names></name> <name><surname>Glymour</surname> <given-names>M</given-names></name> <name><surname>Romero</surname> <given-names>RS</given-names></name> <name><surname>Huang</surname> <given-names>B</given-names></name> <name><surname>Ebert-Uphoff</surname> <given-names>I</given-names></name> <etal/></person-group>. <article-title>TETRAD&#x02013;a toolbox for causal discovery</article-title>. In: <source>Proceedings of the 8th International Workshop in Climate Informatics.</source> Boulder, CO (<year>2018</year>).</citation>
</ref>
<ref id="B42">
<label>42.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ahn</surname> <given-names>SC</given-names></name> <name><surname>Horenstein</surname> <given-names>AR</given-names></name></person-group>. <article-title>Eigenvalue ratio test for the number of factors</article-title>. <source>Econometrica</source>. (<year>2013</year>) <volume>81</volume>:<fpage>1203</fpage>&#x02013;<lpage>27</lpage>. <pub-id pub-id-type="doi">10.3982/ECTA8968</pub-id></citation>
</ref>
<ref id="B43">
<label>43.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>H</given-names></name> <name><surname>Roeder</surname> <given-names>K</given-names></name> <name><surname>Wasserman</surname> <given-names>L</given-names></name></person-group>. <article-title>Stability approach to regularization selection (stars) for high dimensional graphical models</article-title>. In: <person-group person-group-type="editor"><name><surname>Lafferty</surname> <given-names>J</given-names></name> <name><surname>Williams</surname> <given-names>C</given-names></name> <name><surname>Shawe-Taylor</surname> <given-names>J</given-names></name> <name><surname>Zemel</surname> <given-names>R</given-names></name> <name><surname>Culotta</surname> <given-names>A</given-names></name></person-group> editors. <source>Advances in Neural Information Processing Systems</source>. <publisher-loc>Vancouver, BC</publisher-loc>: <publisher-name>Curran Associates, Inc</publisher-name>. (<year>2010</year>). p. <fpage>1432</fpage>&#x02013;<lpage>40</lpage>.<pub-id pub-id-type="pmid">25152607</pub-id></citation></ref>
<ref id="B44">
<label>44.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Curtis</surname> <given-names>C</given-names></name> <name><surname>Shah</surname> <given-names>SP</given-names></name> <name><surname>Chin</surname> <given-names>SF</given-names></name> <name><surname>Turashvili</surname> <given-names>G</given-names></name> <name><surname>Rueda</surname> <given-names>OM</given-names></name> <name><surname>Dunning</surname> <given-names>MJ</given-names></name> <etal/></person-group>. <article-title>The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups</article-title>. <source>Nature</source>. (<year>2012</year>) <volume>486</volume>:<fpage>346</fpage>&#x02013;<lpage>52</lpage>. <pub-id pub-id-type="doi">10.1038/nature10983</pub-id><pub-id pub-id-type="pmid">22522925</pub-id></citation></ref>
<ref id="B45">
<label>45.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Overmyer</surname> <given-names>KA</given-names></name> <name><surname>Shishkova</surname> <given-names>E</given-names></name> <name><surname>Miller</surname> <given-names>IJ</given-names></name> <name><surname>Balnis</surname> <given-names>J</given-names></name> <name><surname>Bernstein</surname> <given-names>MN</given-names></name> <name><surname>Peters-Clarke</surname> <given-names>TM</given-names></name> <etal/></person-group>. <article-title>Large-scale multi-omic analysis of COVID-19 severity</article-title>. <source>Cell Syst</source>. (<year>2021</year>) <volume>12</volume>:<fpage>23</fpage>&#x02013;<lpage>40</lpage>.e7. <pub-id pub-id-type="doi">10.1016/j.cels.2020.10.003</pub-id><pub-id pub-id-type="pmid">33096026</pub-id></citation></ref>
<ref id="B46">
<label>46.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Subramanian</surname> <given-names>A</given-names></name> <name><surname>Tamayo</surname> <given-names>P</given-names></name> <name><surname>Mootha</surname> <given-names>VK</given-names></name> <name><surname>Mukherjee</surname> <given-names>S</given-names></name> <name><surname>Ebert</surname> <given-names>BL</given-names></name> <name><surname>Gillette</surname> <given-names>MA</given-names></name> <etal/></person-group>. <article-title>Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles</article-title>. <source>Proc Natl Acad Sci USA</source>. (<year>2005</year>) <volume>102</volume>:<fpage>15545</fpage>&#x02013;<lpage>50</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.0506580102</pub-id><pub-id pub-id-type="pmid">16199517</pub-id></citation></ref>
<ref id="B47">
<label>47.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>T</given-names></name> <name><surname>Hu</surname> <given-names>E</given-names></name> <name><surname>Xu</surname> <given-names>S</given-names></name> <name><surname>Chen</surname> <given-names>M</given-names></name> <name><surname>Guo</surname> <given-names>P</given-names></name> <name><surname>Dai</surname> <given-names>Z</given-names></name> <etal/></person-group>. <article-title>clusterProfiler 4.0: a universal enrichment tool for interpreting omics data</article-title>. <source>Innovation</source>. (<year>2021</year>) <volume>2</volume>:<fpage>100141</fpage>. <pub-id pub-id-type="doi">10.1016/j.xinn.2021.100141</pub-id><pub-id pub-id-type="pmid">34557778</pub-id></citation></ref>
<ref id="B48">
<label>48.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ashburner</surname> <given-names>M</given-names></name> <name><surname>Ball</surname> <given-names>CA</given-names></name> <name><surname>Blake</surname> <given-names>JA</given-names></name> <name><surname>Botstein</surname> <given-names>D</given-names></name> <name><surname>Butler</surname> <given-names>H</given-names></name> <name><surname>Cherry</surname> <given-names>JM</given-names></name> <etal/></person-group>. <article-title>Gene ontology: tool for the unification of biology</article-title>. <source>Nat Genet</source>. (<year>2000</year>) <volume>25</volume>:<fpage>25</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1038/75556</pub-id><pub-id pub-id-type="pmid">10802651</pub-id></citation></ref>
<ref id="B49">
<label>49.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>G</given-names></name></person-group>. <source>Enrichplot: Visualization of Functional Enrichment Result</source>. R package version. (<year>2019</year>).</citation>
</ref>
<ref id="B50">
<label>50.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaplan</surname> <given-names>EL</given-names></name> <name><surname>Meier</surname> <given-names>P</given-names></name></person-group>. <article-title>Nonparametric estimation from incomplete observations</article-title>. <source>J Am Stat Assoc</source>. (<year>1958</year>) <volume>53</volume>:<fpage>457</fpage>&#x02013;<lpage>81</lpage>. <pub-id pub-id-type="doi">10.1080/01621459.1958.10501452</pub-id></citation>
</ref>
<ref id="B51">
<label>51.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cox</surname> <given-names>DR</given-names></name></person-group>. <article-title>Regression models and life-tables</article-title>. <source>J R Stat Soc Ser B Methodol</source>. (<year>1972</year>) <volume>34</volume>:<fpage>187</fpage>&#x02013;<lpage>202</lpage>. <pub-id pub-id-type="doi">10.1111/j.2517-6161.1972.tb00899.x</pub-id></citation>
</ref>
<ref id="B52">
<label>52.</label>
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Therneau</surname> <given-names>TM</given-names></name></person-group>. <source>A Package for Survival Analysis in R</source>. R package version 3.2-13 (<year>2021</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://CRAN.R-project.org/package=survival">https://CRAN.R-project.org/package=survival</ext-link></citation>
</ref>
<ref id="B53">
<label>53.</label>
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Kassambara</surname> <given-names>A</given-names></name> <name><surname>Kosinski</surname> <given-names>M</given-names></name> <name><surname>Biecek</surname> <given-names>P</given-names></name> <name><surname>Fabian</surname> <given-names>S</given-names></name></person-group>. <source>survminer: Drawing Survival Curves using &#x02018;ggplot2&#x00027;</source>. R package version 0.4.9 (<year>2021</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://cloud.r-project.org/web/packages/survminer/index.html">https://cloud.r-project.org/web/packages/survminer/index.html</ext-link></citation>
</ref>
<ref id="B54">
<label>54.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Robin</surname> <given-names>X</given-names></name> <name><surname>Turck</surname> <given-names>N</given-names></name> <name><surname>Hainard</surname> <given-names>A</given-names></name> <name><surname>Tiberti</surname> <given-names>N</given-names></name> <name><surname>Lisacek</surname> <given-names>F</given-names></name> <name><surname>Sanchez</surname> <given-names>JC</given-names></name> <etal/></person-group>. <article-title>pROC: an open-source package for R and S&#x0002B; to analyze and compare ROC curves</article-title>. <source>BMC Bioinformatics</source>. (<year>2011</year>) <volume>12</volume>:<fpage>77</fpage>. <pub-id pub-id-type="doi">10.1186/1471-2105-12-77</pub-id><pub-id pub-id-type="pmid">21414208</pub-id></citation></ref>
<ref id="B55">
<label>55.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Friedman</surname> <given-names>J</given-names></name> <name><surname>Hastie</surname> <given-names>T</given-names></name> <name><surname>Tibshirani</surname> <given-names>R</given-names></name></person-group>. <article-title>Regularization paths for generalized linear models via coordinate descent</article-title>. <source>J Stat Softw</source>. (<year>2010</year>) <volume>33</volume>:<fpage>1</fpage>&#x02013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.18637/jss.v033.i01</pub-id><pub-id pub-id-type="pmid">20808728</pub-id></citation></ref>
<ref id="B56">
<label>56.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kovats</surname> <given-names>S</given-names></name></person-group>. <article-title>Estrogen receptors regulate innate immune cells and signaling pathways</article-title>. <source>Cell Immunol</source>. (<year>2015</year>) <volume>294</volume>:<fpage>63</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1016/j.cellimm.2015.01.018</pub-id><pub-id pub-id-type="pmid">25682174</pub-id></citation></ref>
<ref id="B57">
<label>57.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Loi</surname> <given-names>S</given-names></name></person-group>. <article-title>Tumor-infiltrating lymphocytes, breast cancer subtypes and therapeutic efficacy</article-title>. <source>Oncoimmunology</source>. (<year>2013</year>) <volume>2</volume>:<fpage>e24720</fpage>. <pub-id pub-id-type="doi">10.4161/onci.24720</pub-id><pub-id pub-id-type="pmid">24073365</pub-id></citation></ref>
<ref id="B58">
<label>58.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goodman</surname> <given-names>ML</given-names></name> <name><surname>Trinca</surname> <given-names>GM</given-names></name> <name><surname>Walter</surname> <given-names>KR</given-names></name> <name><surname>Papachristou</surname> <given-names>EK</given-names></name> <name><surname>D&#x00027;Santos</surname> <given-names>CS</given-names></name> <name><surname>Li</surname> <given-names>T</given-names></name> <etal/></person-group>. <article-title>Progesterone receptor attenuates STAT1-mediated IFN signaling in breast cancer</article-title>. <source>J Immunol</source>. (<year>2019</year>) <volume>202</volume>:<fpage>3076</fpage>&#x02013;<lpage>86</lpage>. <pub-id pub-id-type="doi">10.4049/jimmunol.1801152</pub-id><pub-id pub-id-type="pmid">30936295</pub-id></citation></ref>
<ref id="B59">
<label>59.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>ZL</given-names></name> <name><surname>Chen</surname> <given-names>RP</given-names></name> <name><surname>Zhou</surname> <given-names>XT</given-names></name> <name><surname>Zhan</surname> <given-names>HL</given-names></name> <name><surname>Hu</surname> <given-names>MM</given-names></name> <name><surname>Liu</surname> <given-names>B</given-names></name> <etal/></person-group>. <article-title>Long non-coding RNA MEG3 induces cell apoptosis in esophageal cancer through endoplasmic reticulum stress</article-title>. <source>Oncol Rep</source>. (<year>2017</year>) <volume>37</volume>:<fpage>3093</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.3892/or.2017.5568</pub-id><pub-id pub-id-type="pmid">28405686</pub-id></citation></ref>
<ref id="B60">
<label>60.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>RP</given-names></name> <name><surname>Huang</surname> <given-names>ZL</given-names></name> <name><surname>Liu</surname> <given-names>LX</given-names></name> <name><surname>Xiang</surname> <given-names>MQ</given-names></name> <name><surname>Li</surname> <given-names>GP</given-names></name> <name><surname>Feng</surname> <given-names>JL</given-names></name> <etal/></person-group>. <article-title>Involvement of endoplasmic reticulum stress and p53 in lncRNA MEG3-induced human hepatoma HepG2 cell apoptosis</article-title>. <source>Oncol Rep</source>. (<year>2016</year>) <volume>36</volume>:<fpage>1649</fpage>&#x02013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.3892/or.2016.4919</pub-id><pub-id pub-id-type="pmid">27432655</pub-id></citation></ref>
<ref id="B61">
<label>61.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>P</given-names></name> <name><surname>Yang</surname> <given-names>F</given-names></name> <name><surname>Jin</surname> <given-names>H</given-names></name> <name><surname>Wang</surname> <given-names>X</given-names></name></person-group>. <article-title>The regulation of protein translation and its implications for cancer</article-title>. <source>Signal Trans Target Therapy</source>. (<year>2021</year>) <volume>6</volume>:<fpage>1</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1038/s41392-020-00444-9</pub-id><pub-id pub-id-type="pmid">33597534</pub-id></citation></ref>
<ref id="B62">
<label>62.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gassen</surname> <given-names>NC</given-names></name> <name><surname>Papies</surname> <given-names>J</given-names></name> <name><surname>Bajaj</surname> <given-names>T</given-names></name> <name><surname>Emanuel</surname> <given-names>J</given-names></name> <name><surname>Dethloff</surname> <given-names>F</given-names></name> <name><surname>Chua</surname> <given-names>RL</given-names></name> <etal/></person-group>. <article-title>SARS-CoV-2-mediated dysregulation of metabolism and autophagy uncovers host-targeting antivirals</article-title>. <source>Nat Commun</source>. (<year>2021</year>) <volume>12</volume>:<fpage>1</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1038/s41467-021-24007-w</pub-id><pub-id pub-id-type="pmid">34155207</pub-id></citation></ref>
<ref id="B63">
<label>63.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yatim</surname> <given-names>N</given-names></name> <name><surname>Boussier</surname> <given-names>J</given-names></name> <name><surname>Chocron</surname> <given-names>R</given-names></name> <name><surname>Hadjadj</surname> <given-names>J</given-names></name> <name><surname>Philippe</surname> <given-names>A</given-names></name> <name><surname>Gendron</surname> <given-names>N</given-names></name> <etal/></person-group>. <article-title>Platelet activation in critically ill COVID-19 patients</article-title>. <source>Ann Intensive Care</source>. (<year>2021</year>) <volume>11</volume>:<fpage>1</fpage>&#x02013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1186/s13613-021-00899-1</pub-id><pub-id pub-id-type="pmid">34273008</pub-id></citation></ref>
<ref id="B64">
<label>64.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Perreau</surname> <given-names>M</given-names></name> <name><surname>Suffiotti</surname> <given-names>M</given-names></name> <name><surname>Marques-Vidal</surname> <given-names>P</given-names></name> <name><surname>Wiedemann</surname> <given-names>A</given-names></name> <name><surname>Levy</surname> <given-names>Y</given-names></name> <name><surname>Laou&#x000E9;nan</surname> <given-names>C</given-names></name> <etal/></person-group>. <article-title>The cytokines HGF and CXCL13 predict the severity and the mortality in COVID-19 patients</article-title>. <source>Nat Commun</source>. (<year>2021</year>) <volume>12</volume>:<fpage>1</fpage>&#x02013;<lpage>10</lpage>. <pub-id pub-id-type="doi">10.1038/s41467-021-25191-5</pub-id><pub-id pub-id-type="pmid">34373466</pub-id></citation></ref>
<ref id="B65">
<label>65.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sharma</surname> <given-names>E</given-names></name> <name><surname>Sterne-Weiler</surname> <given-names>T</given-names></name> <name><surname>O&#x00027;Hanlon</surname> <given-names>D</given-names></name> <name><surname>Blencowe</surname> <given-names>B</given-names></name></person-group>. <article-title>Global Mapping of Human RNA-RNA Interactions</article-title>. <source>Mol Cell</source>. (<year>2016</year>) <volume>62</volume>:<fpage>618</fpage>&#x02013;<lpage>26</lpage>. <pub-id pub-id-type="doi">10.1016/j.molcel.2016.04.030</pub-id><pub-id pub-id-type="pmid">27184080</pub-id></citation></ref>
<ref id="B66">
<label>66.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ignatov</surname> <given-names>D</given-names></name> <name><surname>Vaitkevicius</surname> <given-names>K</given-names></name> <name><surname>Durand</surname> <given-names>S</given-names></name> <name><surname>Cahoon</surname> <given-names>L</given-names></name> <name><surname>Sandberg</surname> <given-names>SS</given-names></name> <name><surname>Liu</surname> <given-names>X</given-names></name> <etal/></person-group>. <article-title>An mRNA-mRNA interaction couples expression of a virulence factor and its chaperone in <italic>Listeria monocytogenes</italic></article-title>. <source>Cell Rep</source>. (<year>2020</year>) <volume>30</volume>:<fpage>4027</fpage>&#x02013;<lpage>40</lpage>.e7. <pub-id pub-id-type="doi">10.1016/j.celrep.2020.03.006</pub-id><pub-id pub-id-type="pmid">32209466</pub-id></citation></ref>
<ref id="B67">
<label>67.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>N</given-names></name> <name><surname>Niu</surname> <given-names>G</given-names></name> <name><surname>Xie</surname> <given-names>Z</given-names></name> <name><surname>Chen</surname> <given-names>Z</given-names></name> <name><surname>Itzek</surname> <given-names>A</given-names></name> <name><surname>Kreth</surname> <given-names>J</given-names></name> <etal/></person-group>. <article-title>The <italic>Streptococcus mutans</italic> irvA gene encodes a trans-acting riboregulatory mRNA</article-title>. <source>Mol Cell</source>. (<year>2015</year>) <volume>57</volume>:<fpage>179</fpage>&#x02013;<lpage>90</lpage>. <pub-id pub-id-type="doi">10.1016/j.molcel.2014.11.003</pub-id><pub-id pub-id-type="pmid">25574948</pub-id></citation></ref>
<ref id="B68">
<label>68.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Friedman</surname> <given-names>N</given-names></name></person-group>. <article-title>Inferring cellular networks using probabilistic graphical models</article-title>. <source>Science</source>. (<year>2004</year>) <volume>303</volume>:<fpage>799</fpage>&#x02013;<lpage>805</lpage>. <pub-id pub-id-type="doi">10.1126/science.1094068</pub-id><pub-id pub-id-type="pmid">14764868</pub-id></citation></ref>
<ref id="B69">
<label>69.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dobra</surname> <given-names>A</given-names></name> <name><surname>Hans</surname> <given-names>C</given-names></name> <name><surname>Jones</surname> <given-names>B</given-names></name> <name><surname>Nevins</surname> <given-names>JR</given-names></name> <name><surname>Yao</surname> <given-names>G</given-names></name> <name><surname>West</surname> <given-names>M</given-names></name></person-group>. <article-title>Sparse graphical models for exploring gene expression data</article-title>. <source>J Multivariate Anal</source>. (<year>2004</year>) <volume>90</volume>:<fpage>196</fpage>&#x02013;<lpage>212</lpage>. <pub-id pub-id-type="doi">10.1016/j.jmva.2004.02.009</pub-id><pub-id pub-id-type="pmid">28487748</pub-id></citation></ref>
<ref id="B70">
<label>70.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huynh-Thu</surname> <given-names>VA</given-names></name> <name><surname>Irrthum</surname> <given-names>A</given-names></name> <name><surname>Wehenkel</surname> <given-names>L</given-names></name> <name><surname>Geurts</surname> <given-names>P</given-names></name></person-group>. <article-title>Inferring regulatory networks from expression data using tree-based methods</article-title>. <source>PLoS ONE</source>. (<year>2010</year>) <volume>5</volume>:<fpage>e12776</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0012776</pub-id><pub-id pub-id-type="pmid">20927193</pub-id></citation></ref>
</ref-list>
</back>
</article>