<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fgene.2019.00265</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>NoMAS: A Computational Approach to Find Mutated Subnetworks Associated With Survival in Genome-Wide Cancer Studies</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Altieri</surname> <given-names>Federico</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/671939/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Hansen</surname> <given-names>Tommy V.</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Vandin</surname> <given-names>Fabio</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/358580/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Information Engineering, University of Padova</institution>, <addr-line>Padova</addr-line>, <country>Italy</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Mathematics and Computer Science, University of Southern Denmark</institution>, <addr-line>Odense</addr-line>, <country>Denmark</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Marco Pellegrini, Italian National Research Council (CNR), Italy</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Georges Nemer, American University of Beirut, Lebanon; Salvatore Alaimo, University of Catania, Italy</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Fabio Vandin <email>vandinfa&#x00040;dei.unipd.it</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Systems Biology, a section of the journal Frontiers in Genetics</p></fn></author-notes>
<pub-date pub-type="epub">
<day>10</day>
<month>04</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>10</volume>
<elocation-id>265</elocation-id>
<history>
<date date-type="received">
<day>15</day>
<month>11</month>
<year>2018</year>
</date>
<date date-type="accepted">
<day>08</day>
<month>03</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2019 Altieri, Hansen and Vandin.</copyright-statement>
<copyright-year>2019</copyright-year>
<copyright-holder>Altieri, Hansen and Vandin</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Next-generation sequencing technologies allow to measure somatic mutations in a large number of patients from the same cancer type: one of the main goals in their analysis is the identification of mutations associated with clinical parameters. The identification of such relationships is hindered by extensive genetic heterogeneity in tumors, with different genes mutated in different patients, due, in part, to the fact that genes and mutations act in the context of <italic>pathways</italic>: it is therefore crucial to study mutations in the context of interactions among genes. In this work we study the problem of identifying subnetworks of a large gene-gene interaction network with mutations associated with survival time. We formally define the associated computational problem by using a score for subnetworks based on the log-rank statistical test to compare the survival of two given populations. We propose a novel approach, based on a new algorithm, called <underline>N</underline>etwork <underline>o</underline>f <underline>M</underline>utations <underline>A</underline>ssociated with <underline>S</underline>urvival (NoMAS) to find subnetworks of a large interaction network whose mutations are associated with survival time. NoMAS is based on the color-coding technique, that has been previously employed in other applications to find the highest scoring subnetwork with high probability when the subnetwork score is additive. In our case the score is not additive, so our algorithm cannot identify the optimal solution with the same guarantees associated to additive scores. Nonetheless, we prove that, under a reasonable model for mutations in cancer, NoMAS identifies the optimal solution with high probability. We also design a holdout approach to identify subnetworks significantly associated with survival time. We test NoMAS on simulated and cancer data, comparing it to approaches based on single gene tests and to various greedy approaches. We show that our method does indeed find the optimal solution and performs better than the other approaches. Moreover, on three cancer datasets our method identifies subnetworks with significant association to survival when none of the genes has significant association with survival when considered in isolation.</p></abstract>
<kwd-group>
<kwd>cancer genomics</kwd>
<kwd>survival analysis</kwd>
<kwd>network analysis</kwd>
<kwd>log-rank statistic</kwd>
<kwd>holdout approach</kwd>
</kwd-group>
<counts>
<fig-count count="6"/>
<table-count count="0"/>
<equation-count count="1"/>
<ref-count count="41"/>
<page-count count="12"/>
<word-count count="9442"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Recent advances in next-generation sequencing technologies have enabled the collection of sequence information from many genomes and exomes, with many large human and cancer genetic studies measuring mutations in all genes for a large number of patients of a specific disease (Cancer Genome Atlas Research Network, <xref ref-type="bibr" rid="B6">2013</xref>, <xref ref-type="bibr" rid="B7">2014</xref>; Cancer Genome Atlas Network, <xref ref-type="bibr" rid="B4">2015</xref>; Cancer Genome Atlas Research Network et al., <xref ref-type="bibr" rid="B8">2017</xref>; Raphael et al., <xref ref-type="bibr" rid="B28">2017</xref>). One of the main challenges in these studies is the interpretation of such mutations, in particular the identification of mutations that are clinically relevant. For example, in large cancer studies one is interested in finding somatic mutations that are associated with survival and that can be used for prognosis and therapeutic decisions. One of the main obstacles in finding mutations that are clinically relevant is the large number of mutations present in each cancer genome. Recent studies have shown that each cancer genome harbors hundreds or thousands of somatic mutations (Garraway and Lander, <xref ref-type="bibr" rid="B14">2013</xref>), with only a small number (e.g., &#x02264; 10) of <italic>driver</italic> mutations related to the disease, while the vast majority of mutations are <italic>passenger</italic>, random mutations that are accumulated during the process that leads to cancer but not related to the disease (Vogelstein et al., <xref ref-type="bibr" rid="B38">2013</xref>).</p>
<p>In recent years, several computational and statistical methods have been designed to identify driver mutations and distinguish them from passenger mutations, exploiting data from large cancer studies (Raphael et al., <xref ref-type="bibr" rid="B27">2014</xref>). Many of these methods analyze each gene in isolation, and use different single gene scores (e.g., mutation frequency, clustering of mutations, etc.) to identify significant genes (Dees et al., <xref ref-type="bibr" rid="B13">2012</xref>; Lawrence et al., <xref ref-type="bibr" rid="B20">2013</xref>; Tamborero et al., <xref ref-type="bibr" rid="B34">2013</xref>). While useful in finding driver genes, these methods suffer from the extensive <italic>heterogeneity</italic> of mutations in cancer, with different patients showing mutations in different cancer genes (Kandoth et al., <xref ref-type="bibr" rid="B18">2013</xref>). One of the reasons of such mutational heterogeneity is the fact that driver mutations do not target single genes but rather <italic>pathways</italic> (Vogelstein et al., <xref ref-type="bibr" rid="B38">2013</xref>), groups of interacting genes that perform different functions in the cell. Several methods have been recently proposed to identify significant groups of interacting genes in cancer (Vandin et al., <xref ref-type="bibr" rid="B37">2012b</xref>; Hofree et al., <xref ref-type="bibr" rid="B15">2013</xref>; Kim et al., <xref ref-type="bibr" rid="B19">2015</xref>; Leiserson et al., <xref ref-type="bibr" rid="B21">2015a</xref>,<xref ref-type="bibr" rid="B22">b</xref>; Shrestha et al., <xref ref-type="bibr" rid="B31">2017</xref>). Many of these methods integrate mutations with interactions from genome-scale interaction networks, without restricting to already known pathways, that would hinder the ability to discover new important groups of genes.</p>
<p>In addition to mutation data, large cancer studies often collect also clinical data, including survival information, regarding the patients. An important feature of survival data is that it often contains <italic>censored</italic> measurements (Kalbfleisch and Prentice, <xref ref-type="bibr" rid="B17">2002</xref>): in many studies a patient may be alive at the end of the study or may leave the study before it ends, therefore only a lower bound to the survival of the patient is known. Survival information is crucial in identifying mutations that have a clinical impact. However, the survival information is commonly used only <italic>after</italic> candidate genes or groups of genes have been identified using other methods, as the ones described above, to evaluate the clinical significance of such genes or groups of genes (Cancer Genome Atlas Research Network, <xref ref-type="bibr" rid="B5">2011</xref>; Hofree et al., <xref ref-type="bibr" rid="B15">2013</xref>). Overall, there is a lack of methods that integrate mutations, interaction information, and survival data to directly identify groups of interacting genes associated with survival.</p>
<p>The field of survival analysis has produced an extensive literature on the analysis of survival data, in particular for the comparison of the survival of two given populations (sets of samples) (Kalbfleisch and Prentice, <xref ref-type="bibr" rid="B17">2002</xref>). The most commonly used test for this purpose is the log-rank test (Mantel, <xref ref-type="bibr" rid="B23">1966</xref>; Peto and Peto, <xref ref-type="bibr" rid="B26">1972</xref>). In genomic studies we are not given two populations, but a single set of samples, and are required to identify mutations that are associated with survival. The log-rank test can be used to this end to identify single genes associated with survival time by comparing the survival of the patients with a mutation in the gene with the survival of the patients with no mutation in the gene. The other commonly used test, the Cox Proportional-Hazards model (Kalbfleisch and Prentice, <xref ref-type="bibr" rid="B17">2002</xref>), is equivalent to the log-rank test when the association of a binary feature with survival is tested, as it is in the case of interest to genomic studies. For a given group of genes, one can <italic>assess</italic> the association of mutations in the genes of the group with survival by comparing the survival of the patients having a mutation in at least one of the genes with the survival of the patients with no mutation in the genes. However, this approach cannot be used to <italic>discover</italic> sets of genes, since one would have to screen all possible subsets of genes and test their association with survival, and the number of subsets of genes to screen is enormous even considering only groups of genes interacting in a protein interaction network (e.g., there are &#x0003E;10<sup>15</sup> groups of 8 interacting genes in HINT&#x0002B;HI2012 network; Leiserson et al., <xref ref-type="bibr" rid="B22">2015b</xref>).</p>
<p>In this paper we study the problem of finding sets of interacting genes with mutations associated to survival using data from large cancer sequencing studies and interaction information from a genome-scale interaction network. We focus on the widely used log-rank statistic as a measure of the association between mutations in a group of genes and survival. Our contribution is in five parts: first, we formally define the problem of finding the set of <italic>k</italic> genes whose mutations show the maximum association to survival time by using the log-rank statistic as a score for a set of genes: we show that such problem is NP-hard. We show that the problem remains hard when the set of <italic>k</italic> genes is required to form a connected subnetwork in a large graph with at least one node of large degree (<italic>hub</italic>). Second, we propose an efficient algorithm, <underline>N</underline>etwork <underline>o</underline>f <underline>M</underline>utations <underline>A</underline>ssociated with <underline>S</underline>urvival (NoMAS), based on the color-coding technique, to identify subnetworks associated with survival time. Color-coding has been previously used to find high scoring graphs for bioinformatics applications (Dao et al., <xref ref-type="bibr" rid="B11">2011</xref>; Hormozdiari et al., <xref ref-type="bibr" rid="B16">2015</xref>) when the score for a subnetwork is <italic>set additive</italic> (i.e., the score of a subnetwork is the sum of the scores of the genes in the subnetwork). In our case the log-rank statistic is not set additive, and we prove that there is a family of instances for which our algorithm cannot identify the optimal solution. Nonetheless, we prove that, under a reasonable model for mutations in cancer, our algorithm identifies the optimal solution with high probability. Third, we test our algorithm on simulated data and on data from three large cancer studies from The Cancer Genome Atlas (TCGA). On simulated data, we show that our algorithm does find the optimal solution while being much more efficient than the exhaustive algorithm that screens all sets of genes. On cancer data, we show that our algorithm finds the optimal solution for all values of <italic>k</italic> for which the use of the exhaustive algorithm is feasible, and identifies better solutions (in terms of association to survival) than a greedy algorithm similar to the one used in Reimand and Bader (<xref ref-type="bibr" rid="B29">2013</xref>). Fourth, to strengthen the statistical reliability of NoMAS&#x00027;s results, we employ a holdout scheme, splitting the patients dataset in two parts, a <italic>training</italic> set and a <italic>holdout</italic> set. While solutions of the NoMAS are computed on the former, the assessments of their statistical significance are performed on the latter, thus providing a correction for the multiple hypothesis testing performed on the training set. Finally, we show that NoMAS identifies better solutions than using an (additive) score (i.e., the same gene score used in Vandin et al., <xref ref-type="bibr" rid="B35">2012a</xref>) for a set of genes. For the cancer datasets, we show that our algorithm identifies novel groups of genes associated with survival where none of them is associated with survival when considered in isolation. The work is organized as follows: in section 2 we provide the description of the model and NoMAS; section 3 presents the analysis of the algorithm (section 3.1), including the analysis under a reasonable model for mutations in cancer and analysis of our experiments on both simulated and real data (section 3.2); finally section 4 presents the discussion of our results. Details for our theoretical results are given in <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>.</p>
</sec>
<sec sec-type="materials and methods" id="s2">
<title>2. Materials and Methods</title>
<p>In this section we present the model we consider, our algorithm NoMAS, and the tests we have designed to assess the statistical significance of the results.</p>
<sec>
<title>2.1. Computational Problem</title>
<p>In survival analysis, we are given two populations (i.e., sets of samples) <italic>P</italic><sub>0</sub> and <italic>P</italic><sub>1</sub>, and for each sample <italic>i</italic> &#x02208; <italic>P</italic><sub>0</sub> &#x0222A; <italic>P</italic><sub>1</sub> we have its survival data: i) the survival time <italic>t</italic><sub><italic>i</italic></sub> and ii) the censoring information <italic>c</italic><sub><italic>i</italic></sub>, where <italic>c</italic><sub><italic>i</italic></sub> &#x0003D; 1 if <italic>t</italic><sub><italic>i</italic></sub> is the exact survival time for sample <italic>i</italic> (i.e., sample <italic>i</italic> is not censored), and <italic>c</italic><sub><italic>i</italic></sub> &#x0003D; 0 if <italic>t</italic><sub><italic>i</italic></sub> is a lower bound to the survival time for sample <italic>i</italic> (i.e., sample <italic>i</italic> is censored). Let <italic>m</italic><sub>0</sub> be the number of samples in <italic>P</italic><sub>0</sub>, <italic>m</italic><sub>1</sub> be the number of samples in <italic>P</italic><sub>1</sub>, and <italic>m</italic> &#x0003D; <italic>m</italic><sub>0</sub> &#x0002B; <italic>m</italic><sub>1</sub> be the total number of samples. Without loss of generality, the samples are {1, 2, &#x02026;, <italic>m</italic>}, the survival times are <italic>t</italic> &#x0003D; 1, 2, &#x02026;, <italic>m</italic>, with <italic>t</italic><sub><italic>i</italic></sub> &#x0003D; <italic>i</italic> (i.e., the samples are sorted by increasing values of survival), and we assume that there are no ties in survival times. The survival data is represented by two vectors <bold>c</bold> and <bold>x</bold>, with <italic>c</italic><sub><italic>i</italic></sub> representing the censoring information for sample <italic>i</italic>, and <italic>x</italic><sub><italic>i</italic></sub> represents the population information: <italic>x</italic><sub><italic>i</italic></sub> &#x0003D; 1 if sample <italic>i</italic> is in population <italic>P</italic><sub>1</sub>, and <italic>x</italic><sub><italic>i</italic></sub> &#x0003D; 0 otherwise. Given the survival data for two populations <italic>P</italic><sub>0</sub> and <italic>P</italic><sub>1</sub>, the significance in the difference of survival between <italic>P</italic><sub>0</sub> and <italic>P</italic><sub>1</sub> can be assessed by the widely used log-rank test (Mantel, <xref ref-type="bibr" rid="B23">1966</xref>; Peto and Peto, <xref ref-type="bibr" rid="B26">1972</xref>). The log-rank statistic is</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mstyle><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>-</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Under the (null) hypothesis of no difference in survival between <italic>P</italic><sub>0</sub> and <italic>P</italic><sub>1</sub>, the log-rank statistic asymptotically follows a normal distribution <inline-formula><mml:math id="M2"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, where the standard deviation<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> is given by: <inline-formula><mml:math id="M3"><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>-</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msqrt><mml:mo>.</mml:mo></mml:math></inline-formula> Thus the normalized log-rank statistic, defined as <inline-formula><mml:math id="M4"><mml:mfrac><mml:mrow><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></inline-formula>, asymptotically follows a standard normal <inline-formula><mml:math id="M5"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> distribution, and the deviation of <inline-formula><mml:math id="M6"><mml:mfrac><mml:mrow><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></inline-formula> from 0 is a measure of the difference in survival between <italic>P</italic><sub>0</sub> and <italic>P</italic><sub>1</sub>.</p>
<p>In genomic studies, we are given mutation data for a set <inline-formula><mml:math id="M7"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula> of <italic>n</italic> genes in a set <inline-formula><mml:math id="M8"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow></mml:math></inline-formula> of <italic>m</italic> samples, represented by a mutation matrix <italic>M</italic> with <italic>M</italic><sub><italic>i, j</italic></sub> &#x0003D; 1 if gene <italic>i</italic> is mutated in patient <italic>j</italic> and <italic>M</italic><sub><italic>i, j</italic></sub> &#x0003D; 0 otherwise. We are also given survival data (survival time and censoring information) for all the <italic>m</italic> samples. Given a set <inline-formula><mml:math id="M9"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>&#x02282;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula> of genes, we can assess the association of mutations in the set <inline-formula><mml:math id="M10"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> with survival by comparing the survival of the population <inline-formula><mml:math id="M11"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> of samples with a mutation in at least one gene of <inline-formula><mml:math id="M12"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> and the survival of the population <inline-formula><mml:math id="M13"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> of samples with no mutation in the genes of <inline-formula><mml:math id="M14"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula>. That is, <inline-formula><mml:math id="M15"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mo>:</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M16"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mo>:</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0003E;</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
<p>Given the set <inline-formula><mml:math id="M17"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula> of all genes for which mutations have been measured, we are interested in finding the set <inline-formula><mml:math id="M18"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>&#x02282;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula> with <inline-formula><mml:math id="M19"><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>|</mml:mo><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> that has maximum association with survival by finding the set <inline-formula><mml:math id="M20"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> that maximizes the absolute value of the normalized log-rank statistic. Given a set <inline-formula><mml:math id="M21"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> of genes, let <inline-formula><mml:math id="M22"><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> be a 0&#x02212;1 vector, with <inline-formula><mml:math id="M23"><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> if at least one gene of <inline-formula><mml:math id="M24"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> is mutated in patient <italic>i</italic>, and <inline-formula><mml:math id="M25"><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> otherwise. The normalized log-rank statistic for the set <inline-formula><mml:math id="M26"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> is then <inline-formula><mml:math id="M27"><mml:mfrac><mml:mrow><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></inline-formula>. Note that for a given set of patients the censoring information <bold>c</bold> is fixed, therefore we can consider the log-rank statistic as a function <inline-formula><mml:math id="M28"><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> of <inline-formula><mml:math id="M29"><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> only. Analogously, we can rewrite <inline-formula><mml:math id="M30"><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math id="M31"><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msqrt></mml:math></inline-formula> with <inline-formula><mml:math id="M32"><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>|</mml:mo><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msubsup><mml:mo>|</mml:mo></mml:math></inline-formula>, and <inline-formula><mml:math id="M33"><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>-</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msqrt></mml:math></inline-formula> does not depend on <bold>x</bold><sup><italic>S</italic></sup> and is fixed given <bold>c</bold>.</p>
<p>To identify the set of <italic>k</italic> genes most associated with survival, we can then consider the score <inline-formula><mml:math id="M34"><mml:mo>|</mml:mo><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo><mml:mo>=</mml:mo><mml:mo>|</mml:mo><mml:mfrac><mml:mrow><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>|</mml:mo></mml:math></inline-formula>. For ease of exposition in what follows we consider the score <inline-formula><mml:math id="M35"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, corresponding to a one tail log-rank test for the identification of sets of genes with mutations associated with reduced survival; the identification of sets of genes with mutations associated with increased survival is done in an analogous way by maximizing the score <inline-formula><mml:math id="M36"><mml:mo>-</mml:mo><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. We define the following problem.</p>
<p><bold>The max</bold> <italic>k</italic><bold>-set log-rank problem:</bold> <italic>G</italic>iven a set <inline-formula><mml:math id="M37"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula> of genes, an <italic>n</italic> &#x000D7; <italic>m</italic> mutation matrix <italic>M</italic> and the survival information (time and censoring) for the <italic>m</italic> patients in <italic>M</italic>, find the set <inline-formula><mml:math id="M38"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>&#x02282;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula> of <italic>k</italic> genes maximizing <inline-formula><mml:math id="M39"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
<p>We have the following.</p>
<p><bold>Theorem 1</bold>. <italic>The max <italic>k</italic>-set log-rank problem is NP-hard</italic>.</p>
<p>We now define the max connected <italic>k</italic>-set log-rank problem that is analogous to the max <italic>k</italic>-set log-rank problem but requires feasible solutions to be connected subnetworks of a given graph <italic>I</italic>, representing gene-gene interactions.</p>
<p><bold>The max connected</bold> <italic>k</italic><bold>-set log-rank problem:</bold> <italic>G</italic>iven a set <inline-formula><mml:math id="M40"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula> of genes, a graph <inline-formula><mml:math id="M41"><mml:mi>I</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mi>E</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> with <inline-formula><mml:math id="M42"><mml:mi>E</mml:mi><mml:mo>&#x02286;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula>, an <italic>n</italic> &#x000D7; <italic>m</italic> mutation matrix <italic>M</italic> and the survival information (time and censoring) for the <italic>m</italic> patients in <italic>M</italic>, find the set <inline-formula><mml:math id="M43"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> of <italic>k</italic> genes maximizing <inline-formula><mml:math id="M44"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> with the constraint that the subnetwork induced by <inline-formula><mml:math id="M45"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> in <italic>I</italic> is connected.</p>
<p>If <italic>I</italic> is the complete graph, the max connected <italic>k</italic>-set log-rank problem is the same as the max <italic>k</italic>-set log-rank problem. Thus, the max connected <italic>k</italic>-set log-rank problem is NP-hard for a general graph. However, we can prove that the problem is NP-hard for a much more general class of graphs.</p>
<p><bold>Theorem 2</bold>. <italic>The max connected k-set log-rank problem on graphs with at least one node of degree</italic> <inline-formula><mml:math id="M1000"><mml:mrow><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mi>c</mml:mi></mml:mfrac></mml:mrow></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, <italic>where c</italic> &#x0003E; 1 <italic>is constant, is NP-hard</italic>.</p>
</sec>
<sec>
<title>2.2. Algorithm NoMAS</title>
<p>We design a new algorithm, <underline>N</underline>etwork <underline>o</underline>f <underline>M</underline>utations <underline>A</underline>ssociated with <underline>S</underline>urvival (NoMAS)<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>, to solve the max connected <italic>k</italic>-set log-rank problem. The algorithm is based on an adaptation of the color-coding technique (Alon et al., <xref ref-type="bibr" rid="B2">1994</xref>). Our algorithm is analogous to other color-coding based algorithms that have been used before to identify subnetworks associated with phenotypes in other applications where the score is additive (Dao et al., <xref ref-type="bibr" rid="B11">2011</xref>; Hormozdiari et al., <xref ref-type="bibr" rid="B16">2015</xref>).</p>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> provides an overview of NoMAS. The input to NoMAS is an undirected graph <italic>G</italic> &#x0003D; (<italic>V, E</italic>), an <italic>n</italic> &#x000D7; <italic>m</italic> mutation matrix <italic>M</italic>, and the survival information <bold>x</bold>, <bold>c</bold> for the <italic>m</italic> patients in <italic>M</italic>. NoMAS first identifies a subnetwork <inline-formula><mml:math id="M46"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> with high weight <inline-formula><mml:math id="M47"><mml:mfrac><mml:mrow><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></inline-formula>. To identify a subnetwork of high weight, the algorithm proceeds in iterations. In each iteration NoMAS colors <italic>G</italic> with <italic>k</italic> colors by assigning to each vertex <italic>v</italic> a color <inline-formula><mml:math id="M48"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> chosen uniformly at random. For a given coloring of <italic>G</italic>, a subnetwork <inline-formula><mml:math id="M49"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> is said to be <italic>colorful</italic> if all vertices in <inline-formula><mml:math id="M50"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> have distinct colors. The <italic>colorset</italic> of <inline-formula><mml:math id="M51"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> is the set of colors of the vertices in <inline-formula><mml:math id="M52"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula>. Note that the number of different colorsets (subsets of {1, &#x02026;, <italic>k</italic>}) is 2<sup><italic>k</italic></sup>. In each iteration the algorithm efficiently identifies high-scoring colorful subnetworks, and at the end the highest-scoring subnetwork among all iterations is reported.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Algorithm NoMAS. Given alteration data and survival information (time and censoring status) for a set of patients, NoMAS employs a color coding approach to identify subnetworks with mutations associated with survival time, i.e., with high log-rank statistic, and then assesses the statistical significance of the subnetworks using (i) permutation testing and (ii) a holdout approach.</p></caption>
<graphic xlink:href="fgene-10-00265-g0001.tif"/>
</fig>
<p>Consider a given coloring of <italic>G</italic>. Let <italic>W</italic> be a (2<sup><italic>k</italic></sup> &#x02212; 1) &#x000D7; |<italic>V</italic>| table with a row for each non-empty colorset and a column for each vertex in <italic>G</italic>. Entry <italic>W</italic>(<italic>T, u</italic>) stores the set of vertices of one connected colorful subnetwork that has colorset <italic>T</italic> and includes vertex <italic>u</italic>. Entries of <italic>W</italic> can be filled by dynamic programming. For colorsets of size 1, the corresponding rows in <italic>W</italic> are filled out trivially: <italic>W</italic>({&#x003B1;}, <italic>u</italic>) &#x0003D; {<italic>u</italic>} if <inline-formula><mml:math id="M53"><mml:mi>&#x003B1;</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, and <italic>W</italic>({&#x003B1;}, <italic>u</italic>) &#x0003D; &#x02205; otherwise.</p>
<p>For entry <italic>W</italic>(<italic>T, u</italic>) with |<italic>T</italic>| &#x02265; 2, NoMAS computes <italic>W</italic>(<italic>T, u</italic>) by combining a previously computed <italic>W</italic>(<italic>Q, u</italic>) for <italic>u</italic> with another previously computed <italic>W</italic>(<italic>R, v</italic>) where <italic>v</italic> is a neighbor of <italic>u</italic> in <italic>G</italic>, ensuring that the resulting subnetwork is connected and contains <italic>u</italic>. Colorfulness is ensured by selecting <italic>Q</italic> and <italic>R</italic> such that <italic>Q</italic> &#x02229; <italic>R</italic> &#x0003D; &#x02205; and <italic>Q</italic> &#x0222A; <italic>R</italic> &#x0003D; <italic>T</italic>, and in turn ensures that <italic>W</italic>(<italic>T, u</italic>) contains |<italic>T</italic>| distinct vertices. Note that for a given <italic>T</italic> the choice of <italic>Q</italic> uniquely defines <italic>R</italic>. Thus, for each neighbor <italic>v</italic> of <italic>u</italic> there are (at most) 2<sup>|<italic>T</italic>|&#x02212;1</sup> possible combinations. Let <inline-formula><mml:math id="M54"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> be the set of all colorful subnetworks that can be obtained by combining an entry <italic>W</italic>(<italic>Q, u</italic>) for <italic>u</italic> and an appropriate entry <italic>W</italic>(<italic>R, v</italic>) for a neighbor <italic>v</italic> of <italic>u</italic> so that <italic>Q</italic> &#x0222A; <italic>R</italic> &#x0003D; <italic>T, Q</italic> &#x02229; <italic>R</italic> &#x0003D; &#x02205;. That is: <inline-formula><mml:math id="M55"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo>&#x022C3;</mml:mo></mml:mrow><mml:mrow><mml:mtable class="subarray-c" rowspacing="0" columnalign="center"><mml:mtr><mml:mtd><mml:mi>v</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02208;</mml:mo><mml:mi>E</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>Q</mml:mi><mml:mo>&#x0222A;</mml:mo><mml:mi>R</mml:mi><mml:mo>=</mml:mo><mml:mi>T</mml:mi><mml:mo>,</mml:mo><mml:mi>Q</mml:mi><mml:mo>&#x02229;</mml:mo><mml:mi>T</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x02205;</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:msub><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0222A;</mml:mo><mml:mi>W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>R</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>(in the definition of <inline-formula><mml:math id="M56"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> we assume that the union with &#x02205; returns &#x02205;). <italic>W</italic>(<italic>T, u</italic>) stores the element of <inline-formula><mml:math id="M57"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> with largest value of our objective function, that is <inline-formula><mml:math id="M58"><mml:mi>W</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo class="qopname">argmax</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math></inline-formula> At the end, the best solution is identified by finding the entry of <italic>W</italic> of maximum weight. Analogously, NoMAS identifies sets that minimize <inline-formula><mml:math id="M59"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> (sets associated to increased survival) by maximizing the score <inline-formula><mml:math id="M60"><mml:mo>-</mml:mo><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. (See Appendix for pseudo code and illustrations of the working of NoMAS).</p>
<sec>
<title>Parallelization</title>
<p>The computation of <italic>W</italic> is parallelized using <italic>N</italic> &#x02264; |<italic>V</italic>| processors. All entries of <italic>W</italic> are kept in shared memory and |<italic>V</italic>|/<italic>N</italic> unique columns uniformly at random are assigned to each processor. Entries of <italic>W</italic> are computed in order of increasing colorset sizes. We define the <italic>i</italic>-th <italic>colorset group</italic> as the set of all <inline-formula><mml:math id="M61"><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mtable><mml:mtr><mml:mtd><mml:mi>k</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>i</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> colorsets of size <italic>i</italic>. We exploit the fact that the rows within the <italic>i</italic>-th colorset group are computed by reading entries exclusively from rows belonging to colorset groups &#x0003C; <italic>i</italic>. When a processor has finished the rows of the <italic>i</italic>-th colorset group it waits for the other processors to do the same. When the last processor completes the <italic>i</italic>-th colorset group, all <italic>N</italic> processors can safely begin to compute rows of colorset group <italic>i</italic>&#x0002B;1. In total, <italic>k</italic> synchronization steps are carried out, one for each colorset group.</p>
</sec>
</sec>
<sec>
<title>2.3. Statistical Significance</title>
<p>We designed two procedures to assess the statistical significance of the results found by NoMAS: the first is based on permutation testing, while the second uses a holdout approach.</p>
<sec>
<title>Permutation Testing</title>
<p>After identifying the best solution <inline-formula><mml:math id="M62"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> for the mutation matrix <italic>M</italic>, NoMAS can assess its statistical significance by i) estimating the <italic>p</italic>-value <inline-formula><mml:math id="M63"><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> for the log-rank statistic (using a Monte-Carlo estimate with 10<sup>8</sup> samples), and then ii) using a permutation test in which <inline-formula><mml:math id="M64"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> is compared to the best solution <inline-formula><mml:math id="M65"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> for the mutation matrix <italic>M</italic><sup><italic>p</italic></sup> obtained by randomly permuting the rows of <italic>M</italic>. A total of 100 permutations are performed and the <italic>permutation</italic> p-value is recorded as the ratio of permutations in which <inline-formula><mml:math id="M66"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02265;</mml:mo><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. While the <italic>p</italic>-value from the log-rank test reflects the association between mutations in the subnetwork and survival, the permutation <italic>p</italic>-value assesses whether a subnetwork with association with survival at least as extreme as the one observed in the input data can be observed when the genes are placed randomly in the network. Note that we can identify multiple solutions by considering different entries of <italic>W</italic> (even if the same solution may appear in multiple entries of <italic>W</italic>), and we obtain a permutation <italic>p</italic>-value for the <italic>i</italic>-th top scoring solution by comparing its score with the score of the <italic>i</italic>-th top scoring solution in the permuted datasets.</p>
</sec>
<sec>
<title>Holdout Method</title>
<p>We designed a holdout method to strengthen the statistical robustness of the results produced by NoMAS. We split the dataset in two parts, called <italic>training</italic> and <italic>holdout</italic>, and then run NoMAS on the former, obtaining subnetworks with high weight. The <italic>p</italic>-value of these subnetworks is then computed with a Monte-Carlo procedure estimate with 10<sup>8</sup> samples on the holdout dataset. More in detail, assuming that a set <italic>P</italic> of <italic>m</italic> patients is analyzed, let <italic>v</italic> be a parameter with value in (0, 1) that represents the proportion of data in the training set: we partition <italic>P</italic> into two parts, <italic>P</italic><sub><italic>t</italic></sub> and <italic>P</italic><sub><italic>h</italic></sub>, sized <italic>m</italic><sub><italic>t</italic></sub> &#x0003D; &#x0230A;<italic>mv</italic>&#x0230B; and <italic>m</italic><sub><italic>h</italic></sub> &#x0003D; <italic>m</italic> &#x02212; <italic>m</italic><sub><italic>t</italic></sub> respectively. In order to preserve the survival distribution in both the training and the holdout set, the partition is performed over each of <italic>g</italic> temporal intervals of the same length, where <italic>g</italic> is a parameter to be passed in input by the user. The sets <italic>P</italic><sub><italic>t</italic></sub> and <italic>P</italic><sub><italic>h</italic></sub> are obtained by the union of the corresponding sets in each interval. Once we obtain the partition of <italic>P</italic> into <italic>P</italic><sub><italic>t</italic></sub> and <italic>P</italic><sub><italic>h</italic></sub>, NoMAS is executed over the population <italic>P</italic><sub><italic>t</italic></sub> and <italic>p</italic>-value of the found solution is computed over <italic>P</italic><sub><italic>h</italic></sub>.</p>
</sec>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>3. Results</title>
<sec>
<title>3.1. Analysis of NoMAS</title>
<p>We consider the performance of NoMAS excluding the statistical significance testing. The log-rank statistic <inline-formula><mml:math id="M67"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is computed in time <italic>O</italic>(<italic>m</italic><sub>1</sub>) &#x02208; <italic>O</italic>(<italic>m</italic>). The total time complexity for computing a single entry <italic>W</italic>(<italic>T, u</italic>) is then bounded by <italic>O</italic>(<italic>m</italic>deg(<italic>u</italic>)2<sup>|<italic>T</italic>|&#x02212;1</sup>) &#x02208; <italic>O</italic>(<italic>m</italic>deg(<italic>u</italic>)2<sup><italic>k</italic></sup>), where <italic>deg</italic>(<italic>u</italic>) is the degree of <italic>u</italic> in <italic>G</italic>. Given a coloring of <italic>G</italic>, the computation of the entire table can thus be performed in time <inline-formula><mml:math id="M68"><mml:mi>O</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:munder><mml:mi>m</mml:mi><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">deg</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02208;</mml:mo><mml:mi>O</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>|</mml:mo><mml:mi>E</mml:mi><mml:mo>|</mml:mo><mml:msup><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. If <italic>L</italic> iterations are performed, then the complexity of the algorithm is <italic>O</italic>(<italic>Lm</italic>|<italic>E</italic>|4<sup><italic>k</italic></sup>).</p>
<p>Let <italic>OPT</italic> be the optimal solution. If the score <inline-formula><mml:math id="M69"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> was set additive, as the scores considered in previous applications of color-coding for optimization problems on graphs, to discover <italic>OPT</italic> it would be sufficient that <italic>OPT</italic> be colorful, that happens with probability <italic>k</italic>!/<italic>k</italic><sup><italic>k</italic></sup> &#x02265; <italic>e</italic><sup>&#x02212;<italic>k</italic></sup> for each random coloring. Therefore <italic>O</italic>(ln(1/&#x003B4;)<italic>e</italic><sup><italic>k</italic></sup>) iterations would be enough to ensure that the probability of <italic>OPT</italic> not being discovered is &#x02264; &#x003B4;, resulting in an overall time complexity of <italic>O</italic>(<italic>m</italic>ln(1/&#x003B4;)|<italic>E</italic>|(4<italic>e</italic>)<sup><italic>k</italic></sup>).</p>
<p>However, our score <inline-formula><mml:math id="M70"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is not set additive [e.g., if two genes in <inline-formula><mml:math id="M71"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> have a mutation in the same patient the weight of the patient is considered only once in <inline-formula><mml:math id="M72"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>]. Therefore, while <italic>OPT</italic> being colorful is still a necessary condition for the algorithm to identify <italic>OPT</italic>, the colorfulness of <italic>OPT</italic> is not a sufficient condition. In fact, we have the following.</p>
<p><bold>Proposition 1</bold>. <italic>For every <italic>k</italic> &#x02265; 3 there is a family of instances of the max connected <italic>k</italic>-set log-rank problem and colorings for which <italic>OPT</italic> is not found by NoMAS when it is colorful</italic>.</p>
<p>Even more, we prove that when mutations are placed arbitrarily then for every subnetwork <inline-formula><mml:math id="M73"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> and a given coloring of <inline-formula><mml:math id="M74"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula>, <italic>any</italic> color-coding algorithm that adds subnetworks of size <italic>k</italic> to <italic>W</italic> by merging neighboring subnetworks of size &#x0003C; <italic>k</italic> could be &#x0201C;fooled&#x0201D; to not add <inline-formula><mml:math id="M75"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> to <italic>W</italic> by simply adding 3 vertices to <italic>G</italic> and assigning them a specific color.</p>
<p><bold>Theorem 3</bold>. <italic>For any optimal colorful connected subnetwork</italic> <inline-formula><mml:math id="M76"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> <italic>of size k</italic> &#x02265; 3 <italic>and any color-coding algorithm</italic> <inline-formula><mml:math id="M77"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:math></inline-formula> <italic>which obtains subnetworks with colorsets of cardinality i by combining 2 subnetworks with colorsets of cardinality &#x0003C; i, by adding 3 neighbors to</italic> <inline-formula><mml:math id="M78"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> <italic>we have that</italic> <inline-formula><mml:math id="M79"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:math></inline-formula> <italic>may not discover S</italic>.</p>
<p>Intuitively, Proposition 1 and Theorem 3 show that if mutations are placed adversarially (and the optimal solution <italic>OPT</italic> has many neighbors), our algorithm may not identify <italic>OPT</italic>. However, we prove that our algorithm identifies the optimal solution under a generative model for mutations, that we deem the <italic>Planted Subnetwork Model</italic>. We consider <inline-formula><mml:math id="M80"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> as the unnormalized version of the log-rank statistic. In this model: i) there is a subnetwork <inline-formula><mml:math id="M81"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M82"><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow><mml:mo>|</mml:mo><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>, with <inline-formula><mml:math id="M83"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02265;</mml:mo><mml:mi>c</mml:mi><mml:mi>m</mml:mi></mml:math></inline-formula>, for a constant <italic>c</italic> &#x0003E; 0; ii) each gene <inline-formula><mml:math id="M84"><mml:mi>g</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> is such that <inline-formula><mml:math id="M85"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow><mml:mo>\</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02265;</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>, for a constant <italic>c</italic>&#x02032;&#x00026; #x0003E; 0; iii) for each gene <inline-formula><mml:math id="M86"><mml:mi>g</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula>: <italic>w</italic>({<italic>g</italic>}) &#x0003E; 0; iv) for each gene <inline-formula><mml:math id="M87"><mml:mi>&#x0011D;</mml:mi><mml:mo>&#x02209;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula>, &#x0011D; is mutated with probability <italic>p</italic><sub><italic>g</italic></sub> in each patient, independently of all other events (and of survival time and censoring status).</p>
<p>Intuitively: (3.1) above states that the subnetwork <inline-formula><mml:math id="M88"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> has mutations associated with survival; (3.1) states that each gene <inline-formula><mml:math id="M89"><mml:mi>g</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> contributes to the association of mutations in <inline-formula><mml:math id="M90"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> to survival; (3.1) states that each gene <inline-formula><mml:math id="M91"><mml:mi>g</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> should have the same association to survival (increased or decreased) as <inline-formula><mml:math id="M92"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula>; and (3.1) states that all mutations outside <inline-formula><mml:math id="M93"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> are independent of all other events (including survival time and censoring of patients).</p>
<p>We show that when enough samples are generated from the model above, our algorithm identifies the optimal solution with the same probability guarantee given by the color-coding technique for additive scores.</p>
<p><bold>Theorem 4</bold>. <italic>Let M be a mutation matrix corresponding to m samples from the Planted Subnetwork Model. If m</italic> &#x02208; &#x003A9;(<italic>k</italic><sup>4</sup>(<italic>k</italic> &#x0002B; &#x003B5;)ln <italic>n</italic>) <italic>for a given constant</italic> &#x003B5; &#x0003E; 0 <italic>and O</italic>(ln(1/&#x003B4;)<italic>e</italic><sup><italic>k</italic></sup>) <italic>color-coding iterations are performed, then our algorithm identifies the optimal solution</italic> <inline-formula><mml:math id="M94"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> <italic>to the max connected k-set log-rank with probability</italic> <inline-formula><mml:math id="M95"><mml:mo>&#x02265;</mml:mo><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B5;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>-</mml:mo><mml:mi>&#x003B4;</mml:mi></mml:math></inline-formula>.</p>
</sec>
<sec>
<title>3.2. Experimental Results</title>
<p>We assessed the performance of NoMAS by using simulated and cancer data. We compared NoMAS to the exhaustive algorithm that identifies the subnetwork of <italic>k</italic> vertices with the highest score <italic>w</italic>(<italic>S</italic>) for the values of <italic>k</italic> for which we could run the exhaustive algorithm (we implemented a parallelized version of the algorithm described in Maxwell et al., <xref ref-type="bibr" rid="B24">2014</xref> to efficiently enumerate all connected subnetworks), to three variants of a greedy algorithm similar to the one from Reimand and Bader (<xref ref-type="bibr" rid="B29">2013</xref>), and to the use of a score given by the sum of single gene scores. Cancer data is obtained from The Cancer Genome Atlas (TCGA). In particular, we consider somatic mutations (single nucleotide variants and small indels) for 268 samples of glioblastoma multiforme (GBM), 315 samples of ovarian adenocarcinoma (OV) and 174 samples of lung squamous cell carcinoma (LUSC) for which survival data is available.</p>
<p>For all our experiments we used as interaction graph <italic>G</italic> the graph derived from the application of a diffusion process on the HINT&#x0002B;HI2012 network<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref>, a combination of the HINT network (Das and Yu, <xref ref-type="bibr" rid="B12">2012</xref>) and the HI-2012 (Yu et al., <xref ref-type="bibr" rid="B40">2011</xref>) set of protein-protein interactions, previously used in Leiserson et al. (<xref ref-type="bibr" rid="B21">2015a</xref>). The details of the diffusion process are described in Leiserson et al. (<xref ref-type="bibr" rid="B21">2015a</xref>). In brief, for two genes <italic>g</italic><sub><italic>i</italic></sub>, <italic>g</italic><sub><italic>j</italic></sub> the diffusion process gives the amount of heat <italic>h</italic>(<italic>g</italic><sub><italic>i</italic></sub>, <italic>g</italic><sub><italic>j</italic></sub>) observed on <italic>g</italic><sub><italic>j</italic></sub> when <italic>g</italic><sub><italic>i</italic></sub> has one mutation, and the amount of heat <italic>h</italic>(<italic>g</italic><sub><italic>j</italic></sub>, <italic>g</italic><sub><italic>i</italic></sub>) observed on <italic>g</italic><sub><italic>i</italic></sub> when <italic>g</italic><sub><italic>j</italic></sub> has one mutation. The graph used for our analyses is obtained retaining an edge between <italic>g</italic><sub><italic>i</italic></sub> and <italic>g</italic><sub><italic>j</italic></sub> if max{<italic>h</italic>(<italic>g</italic><sub><italic>i</italic></sub>, <italic>g</italic><sub><italic>j</italic></sub>), <italic>h</italic>(<italic>g</italic><sub><italic>j</italic></sub>, <italic>g</italic><sub><italic>i</italic></sub>)} &#x02265; 0.012. The resulting graph has 9, 859 vertices and 42, 480 edges, with the maximum degree of a node being 438. In all our experiments we removed mutations in genes mutated in &#x0003C; 3 of the samples. For cancer data, this resulted in 890 mutated genes removed in GBM, 780 in OV, and 2, 915 in LUSC. The machine, on which all our experiments were carried out, consists of two CPUs of the type Intel Xeon E5-2698 v3 (2.30 GHz), each with 16 physical cores, for a total of 64 virtual cores, and 16 banks of 32 GB DDR4 (2,133 MHz) memory modules for a total of 512 GB of memory.</p>
<p>The remaining of the section is organized as follow: section 3.2.1 presents the results on simulated data, while section 3.2.2 presents the results on cancer data.</p>
<sec>
<title>3.2.1. Simulated Data</title>
<p>We assess the performance of NoMAS on simulated data generated under the Planted subnetwork Model. The subnetwork <inline-formula><mml:math id="M96"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow><mml:mo>&#x02282;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow><mml:mo>|</mml:mo><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> associated with survival is generated by a random walk on the graph <italic>G</italic>. We model the association of <inline-formula><mml:math id="M97"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> to survival by mutating with probability <italic>p</italic> one gene of <inline-formula><mml:math id="M98"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> chosen uniformly at random in each sample among the <inline-formula><mml:math id="M99"><mml:mfrac><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula> of lowest survival. All other genes in <inline-formula><mml:math id="M100"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> are mutated independently with probability 0.01 in all samples, to simulate passenger mutations (not associated with survival) in <inline-formula><mml:math id="M101"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> (Lawrence et al., <xref ref-type="bibr" rid="B20">2013</xref>). For genes in <inline-formula><mml:math id="M102"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow><mml:mo>\</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula>, we used the same mutation frequencies observed in the GBM study, and mutate each gene independently of all other events.</p>
<p>We fixed <italic>k</italic> &#x0003D; 5 and considered the values of <italic>p</italic> &#x02208; {0.5, 0.75, 0.85} and <italic>m</italic> &#x02208; {268, 500, 750, 1, 000}. We kept the same ratio of censored observations as in GBM and chose the censored samples uniformly among all samples. For every pair (<italic>p, m</italic>) we performed 100 simulations, running NoMAS on the dataset with <italic>L</italic> &#x0003D; 256 color-coding iterations, and recorded whether NoMAS reported <inline-formula><mml:math id="M108"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> as the highest scoring subnetwork. Results are shown in <xref ref-type="fig" rid="F2">Figure 2A</xref>. For sample sizes similar to the currently available ones, NoMAS frequently reports <inline-formula><mml:math id="M109"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> as the highest scoring solutions when there is a quite strong association of <inline-formula><mml:math id="M110"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> with survival (<italic>p</italic> &#x02265; 0.85), but for <italic>m</italic> &#x0003D; 1, 000 the highest scoring subnetwork reported by NoMAS is <inline-formula><mml:math id="M111"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> in &#x0003E; 80% of the cases even for <italic>p</italic> &#x0003D; 0.5. <xref ref-type="fig" rid="F2">Figure 2B</xref> shows that even when NoMAS does not report <inline-formula><mml:math id="M112"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> as the highest scoring solution, the solution reported by NoMAS contains mostly genes that are in <inline-formula><mml:math id="M113"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula>, even for current sample size (e.g., on average 74% of the genes in the <inline-formula><mml:math id="M114"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> are reported by NoMAS for <italic>m</italic> &#x0003D; 268 and <italic>p</italic> &#x0003D; 0.85 even when <inline-formula><mml:math id="M115"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> is not the highest scoring solution by NoMAS). Finally, we assessed whether <inline-formula><mml:math id="M116"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> would be among the highest scoring solutions in the table <italic>W</italic> computed by NoMAS: <xref ref-type="fig" rid="F2">Figure 2C</xref> shows that by considering the top-10 solutions <italic>W</italic> the chances to identify <inline-formula><mml:math id="M117"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> increase substantially even for <italic>m</italic> &#x0003D; 268 and <italic>p</italic> &#x0003D; 0.5, with most configurations having &#x0003E; 0.8 probability of finding <inline-formula><mml:math id="M118"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> in the top-10 solutions by NoMAS. For a fixed <italic>p</italic> &#x0003D; 0.75 and for each value of <italic>m</italic> we assessed whether NoMAS identified the optimal solution even when it was not <inline-formula><mml:math id="M119"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> (an event not excluded in the Planted subnetwork Model) and found that for <italic>m</italic> &#x02265; 500 NoMAS reported the optimal solution in 10 out of 10 cases (for <italic>m</italic> &#x0003D; 268 NoMAS identified the optimal solution 9 out of 10 times). These results show that NoMAS does indeed find the optimal solution in almost all cases even for sample sizes currently available (while the theoretical analysis of section 3.1 suggests that much larger sample sizes are required) and it can be used to identify <inline-formula><mml:math id="M120"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> or the majority of it by considering the top-10 highest scoring solutions.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Results of NoMAS on simulated data from the Planted Subnetwork Model. One hundred datasets were generated for each pair (<italic>m</italic>,<italic>p</italic>), where <italic>m</italic> is the number of samples and for different probabilities <italic>p</italic> of mutations in the set <inline-formula><mml:math id="M103"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> of genes associated with survival. <bold>(A)</bold> Probability that <inline-formula><mml:math id="M104"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> is reported as the highest scoring solution by NoMAS. <bold>(B)</bold> Ratio of genes from the set <inline-formula><mml:math id="M105"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> that are in the best solution when <inline-formula><mml:math id="M106"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> is not the highest scoring solution by NoMAS. <bold>(C)</bold> Probability that <inline-formula><mml:math id="M107"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> is among the top-10 solutions reported by NoMAS. All probabilities are estimated from the simulated datasets.</p></caption>
<graphic xlink:href="fgene-10-00265-g0002.tif"/>
</fig>
</sec>
<sec>
<title>3.2.2. Cancer Data</title>
<p>We assessed the performance of NoMAS on the GBM, OV, and LUSC datasets. We first assessed whether NoMAS identified the optimal solution by comparing the highest scoring solution reported by NoMAS with the one identified by using the exhaustive algorithm for <italic>k</italic> &#x0003D; 2, 3, 4, 5. In all cases we found that NoMAS does identify the optimal solution, while requiring much less running time compared to the exhaustive algorithm (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 2</xref>). For <italic>k</italic> &#x0003E; 5 we could not run the exhaustive algorithm, while the runtime of NoMAS is still reasonable. The runtime of NoMAS can be greatly improved by using the parallelization strategy described in section 2.2 (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 3</xref>). We therefore used NoMAS to find subnetworks of size <italic>k</italic> &#x0003D; 6 and <italic>k</italic> &#x0003D; 8. We also considered two modifications of NoMAS that solve some easy cases where NoMAS may not identify the highest scoring solution due to its subnetwork merging strategy (see Appendix for a description and pseudocode of the modifications). We run both modifications on GBM, OV, and LUSC for <italic>k</italic> &#x0003D; 6, 8 (using the same colorings used by the original version of NoMAS): in all cases the modified versions of NoMAS did not report subnetworks with higher scores than the ones from the original version of NoMAS. We also note that the original version of NoMAS is significantly faster in practice than its two modifications (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 3</xref>) and, therefore, we used the original version of NoMAS in the remaining experiments.</p>
<p>We also compared NoMAS with three different greedy strategies for the max connected <italic>k</italic>-set log-rank problem. All three algorithms build solutions starting from each node <italic>u</italic>&#x02208;<italic>G</italic> and, in iterations, by adding nodes to the current solution <inline-formula><mml:math id="M124"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula>, diversifying in the way they enlarge the current subnetwork <inline-formula><mml:math id="M125"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> of size 1 &#x02264; <italic>i</italic> &#x0003C; <italic>k</italic>. (See Appendix for a description of the three greedy strategies). We run the three greedy algorithms on GBM, OV, and LUSC for <italic>k</italic> &#x0003D; 4, 5, 6, 8. For each dataset we compared the resulting subnetworks with the ones identified by NoMAS. Results are shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. In almost all cases we found that NoMAS discovered subnetworks with higher score than the subnetworks found by using greedy strategies, even if in some cases there is a greedy strategy that identifies the same subnetworks for all values of <italic>k</italic>. The difference in score increases as <italic>k</italic> increases, showing the ability of NoMAS to discover better solutions for larger values of <italic>k</italic>, with the main expense being the running time of NoMAS as opposed to the greedy strategies (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 4</xref>). We also assessed whether the fact that greedy strategies discover lower scoring solutions than NoMAS has an impact on the estimate of the <italic>p</italic>-value in the permutational test. We considered the top-10 scoring solutions (corresponding to 10 different starting nodes <italic>u</italic> &#x02208; <italic>G</italic>) discovered by the best greedy strategy in the GBM dataset and computed the permutational <italic>p</italic>-value for each solution by generating 100 permuted datasets either using the (same) greedy strategy or NoMAS for (with only 32 iterations on the permuted data). <xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 1</xref> shows a comparison of the distribution of the <italic>p</italic>-values. As we can see, the greedy strategy incorrectly underestimate the permutational <italic>p</italic>-values for the solutions, due to the greedy algorithm not being able to identify solutions of score as high as NoMAS in the permuted datasets. The use of the greedy algorithms would then lead to both 1. identify solutions in real data with lower association to survival compared to NoMAS and 2. wrongly estimate their permutational <italic>p</italic>-value as more significant than it is.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Comparison of the normalized log-rank statistic of the best solution reported by NoMAS, by greedy algorithms (see Appendix for the description), and by the algorithm that uses an additive scoring function <inline-formula><mml:math id="M121"><mml:mi>a</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> (denote by &#x0201C;additive&#x0201D; in the plots). To maintain readability we omit values above &#x02212;4.0 when considering mutations associated with increased survival. For each datasets the results for the maximization of <inline-formula><mml:math id="M122"><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> (top panel) and the maximization of <inline-formula><mml:math id="M123"><mml:mo>-</mml:mo><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> (bottom panel) are shown separately. <bold>(A)</bold> Results for GBM dataset. <bold>(B)</bold> Results for OV dataset. <bold>(C)</bold> Results for LUSC dataset.</p></caption>
<graphic xlink:href="fgene-10-00265-g0003.tif"/>
</fig>
<p>Finally, we compared NoMAS with the use of an (additive) score that sums single gene scores (similar to the ones used in Vandin et al. (<xref ref-type="bibr" rid="B35">2012a</xref>). For each gene <italic>g</italic>&#x02208;<italic>G</italic> we computed the <italic>p</italic>-value <italic>p</italic>(<italic>g</italic>) for the association of <italic>g</italic> with survival using the log-rank test and defined <inline-formula><mml:math id="M126"><mml:mi>a</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mo class="qopname">log</mml:mo></mml:mrow><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. We then partitioned the genes according to their association with increased survival or with decreased survival and modified our algorithm to look for high scoring solutions in a partition using score <inline-formula><mml:math id="M127"><mml:mi>a</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. Results are in <xref ref-type="fig" rid="F3">Figure 3</xref>. We found that NoMAS outperforms the use of a single gene score, with a very large difference for certain values of parameters.</p>
<p>We then used the holdout approach to identify significant subnetworks for GBM, LUSC, and OV, considering the top-10 highest scoring subnetworks found in the training set and compute their <italic>p</italic>-value in the holdout set. We test all datasets using <italic>k</italic> &#x0003D; 3, 4, 8, 256 iterations of the color coding algorithm. As before, as pre-processing, genes mutated in &#x0003C; 3 samples were eliminated. NoMAS identified several subnetworks with significant association to survival. In GBM, for <italic>k</italic> &#x0003D; 8, NoMAS found the subnetwork including COL5A3, DCN, EGFR, IGF1R, LAMA2, MYLK, PIK3R1, and PIK3CA (<italic>p</italic> &#x02264; 0.05; <xref ref-type="fig" rid="F4">Figure 4</xref>). None of the genes is associated with survival when considered in isolation. DCN, EGFR, IGF1R, PIK3R1 recur in various metabolic functions related to lipids and enzymes signaling and reception. These genes, together with PIK3CA, MYLK, and LAMA2, are involved in formation and maintenance of biological tissues, in cell movement and migration and cell protection organization. Moreover, EGFR, PIK3R1, and PIK3CA are well-known cancer genes. EGFR, IGF1R, LAMA2, MYLK, PIK3CA, PIK3R1, and MYLK are members of the focal adhesion pathway, whose dynamics are highly altered in cancer cells. In LUSC, NoMAS found the subnetwork including MAD1L1, USP15, and ZNF434 (<italic>p</italic> &#x02264; 0.03; <xref ref-type="fig" rid="F5">Figure 5</xref>). None of the genes is associated with survival when considered in isolation. USP15 stabilizes MDM2, a well-known cancer gene, to regulate cancer-cell survival and mediates antitumor T cell responses (Zou et al., <xref ref-type="bibr" rid="B41">2014</xref>), while increased expression of MAD1L1 is associated with poor prognosis in breast cancer (Sun et al., <xref ref-type="bibr" rid="B33">2013</xref>). In OV, NoMAS identified the subnetwork including EP300, NCOA3, NOTCH1, and NOTCH4 (<italic>p</italic> &#x02264; 0.1; <xref ref-type="fig" rid="F6">Figure 6</xref>). None of the genes is associated with survival when considered in isolation. These genes are part of a pathway related to RNA metabolic processes and have a role in regulation of epidermis development and cell differentiation within its layers. All genes are also linked to the thyroid hormone signaling pathway, that is related to cell death and DNA damage in ovarian cancer (Shinderman-Maman et al., <xref ref-type="bibr" rid="B30">2017</xref>).</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Subnetworks identified by NoMAS on GBM data. Subnetwork <italic>S</italic> associated with survival in GBM, Kaplan&#x02013;Meier plot for the samples with mutations in <italic>S</italic> vs. samples with no mutation in <italic>S</italic>. The bottom panel shows the mutations in patients for the genes and the entire subnetwork (last row); patients with censored survival are in gray, other patients are in light blue; mutations in patients are show in dark color.</p></caption>
<graphic xlink:href="fgene-10-00265-g0004.tif"/>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Subnetworks identified by NoMAS on LUSC data. Subnetwork <italic>S</italic> associated with survival in LUSC, Kaplan&#x02013;Meier plot for the samples with mutations in <italic>S</italic> vs. samples with no mutation in <italic>S</italic>. The bottom panel shows the mutations in patients for the genes and the entire subnetwork (last row); patients with censored survival are in gray, other patients are in light blue; mutations in patients are show in dark color.</p></caption>
<graphic xlink:href="fgene-10-00265-g0005.tif"/>
</fig>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Subnetworks identified by NoMAS on OV data. Subnetwork <italic>S</italic> associated with survival in OV, Kaplan-Meier plot for the samples with mutations in <italic>S</italic> vs. samples with no mutation in <italic>S</italic>. The bottom panel shows the mutations in patients for the genes and the entire subnetwork (last row); patients with censored survival are in gray, other patients are in light blue; mutations in patients are show in dark color.</p></caption>
<graphic xlink:href="fgene-10-00265-g0006.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec sec-type="discussion" id="s4">
<title>4. Discussion</title>
<p>In this work, we study the problem of identifying subnetworks of a large gene-gene interaction network that are associated with survival using mutations from large cancer genomic studies. Few methods have been proposed to identify groups of genes with mutations associated with survival in genomic studies. The work of Vandin et al. (<xref ref-type="bibr" rid="B35">2012a</xref>) combines mutations and survival data with interaction information using a diffusion process on graphs starting from gene scores derived from <italic>p</italic>-values of individual genes, but did not consider the problem of directly identifying groups of genes associated with survival. The work of Reimand and Bader (<xref ref-type="bibr" rid="B29">2013</xref>) combines mutation information and patient survival to identify subnetworks of a kinase-substrate interaction network associated with survival. It only focuses on phosphorylation-associated mutations, and the approach is based on a local search algorithm that builds a subnetwork by starting from one seed vertex and then greedily adding neighbors (at distance at most 2) from the seed, extending the approach used in different types of network analyses (Chuang et al., <xref ref-type="bibr" rid="B10">2007</xref>). A similar greedy approach is used by Wu and Stein (<xref ref-type="bibr" rid="B39">2012</xref>) to identify groups of genes significantly associated with survival in cancer from gene expression data. For gene expression studies, Chowdhury et al. (<xref ref-type="bibr" rid="B9">2011</xref>) proposes an approach to enumerate dysregulated subnetworks in cancer based on an efficient search space pruning strategy, inspired by previous work on the identification of association rules in databases (Smyth and Goodman, <xref ref-type="bibr" rid="B32">1992</xref>). Patel et al. (<xref ref-type="bibr" rid="B25">2013</xref>) uses the general approach described in Chowdhury et al. (<xref ref-type="bibr" rid="B9">2011</xref>) to identify subnetworks of genes with expression status associated to survival.</p>
<p>Color-coding is a probabilistic method that was originally described for finding simple paths, cycles and other small subnetworks of size <italic>k</italic> within a given network (Alon et al., <xref ref-type="bibr" rid="B2">1994</xref>). The core of the color-coding technique is the assignment of random colors to the vertices, as a result of which the search space can be reduced, by restricting the subnetworks under consideration to <italic>colorful</italic> ones, those in which each vertex has a distinct color. For the identification of colorful subnetworks, dynamic programming is employed. The process is repeated until the desired subnetwork has been identified, that is having been colorful at least once, with high probability. When the dynamic programming algorithm is polynomial in <italic>n</italic> and the subnetworks being screened are of size <italic>k</italic>&#x02208;<italic>O</italic>(log<italic>n</italic>), the overall running time of the color-coding method too remains polynomial in <italic>n</italic>. Color-coding has been previously used to count or search for subgraphs of large interaction networks (Alon et al., <xref ref-type="bibr" rid="B1">2008</xref>; Bruckner et al., <xref ref-type="bibr" rid="B3">2010</xref>). Color-coding has also been used to identify groups of interacting genes in an interaction network that are associated with a phenotype of interest, but restricted to additive scores for sets of genes (i.e., the score of a set is the sum of the scores of the single genes); for example, Dao et al. (<xref ref-type="bibr" rid="B11">2011</xref>) uses color-coding to find optimally discriminative subnetwork markers that predict response to chemotherapy from a large interaction network by defining a single gene score as &#x02212;log<sub>10</sub><italic>d</italic>(<italic>g</italic>), where <italic>d</italic>(<italic>g</italic>) is the discriminative score for gene <italic>g</italic> (i.e., a measure of the ability of <italic>g</italic> to discriminate two classes of patients); similarly, Hormozdiari et al. (<xref ref-type="bibr" rid="B16">2015</xref>) uses color-coding to find groups of interacting genes with discriminative mutations in case-control studies, using as gene score the &#x02212;log<sub>10</sub> of the <italic>p</italic>-value from the binomial test of recurrence of mutations in the cases (while limiting the number of mutations in the controls).</p>
<p>In this work we formally define the associated computational problem, that we call the max connected <italic>k</italic>-set log-rank problem, by using as score for a subnetwork the test statistic of the log-rank test, one of the most widely used statistical tests to assess the significance in the difference in survival among two populations. We prove that the max connected <italic>k</italic>-set log-rank problem is NP-hard in general, and is NP-hard even when restricted to graphs with at least one node of large degree. We develop a new algorithm, NoMAS, based on the color-coding technique, to efficiently identify high-scoring subnetworks associated with survival. We prove that even if our algorithm is not guaranteed to identify the optimal solution with the probability given by the color-coding technique (due the non-additivity of our scoring function), it does identify the optimal solution with the same guarantees given by the color-coding technique when the data comes from a reasonable model for mutations and independently of the survival data. Using simulated data, we show that NoMAS is more efficient than the exhaustive algorithm while still identifying the optimal solution, and that our algorithm will identify subnetworks associated with survival when sample sizes larger than most currently available ones, but still reasonable, are available.</p>
<p>We use cancer data from three cancer studies from TCGA to compare NoMAS to approaches based on single gene scores and to greedy methods similar to ones proposed in the literature for the identification of subnetworks associated with survival and for other problems on graphs. Our results show that NoMAS identifies subnetworks with stronger association to survival compared to other approaches, and allows the correct estimation of <italic>p</italic>-values using a permutation test. Moreover, in two datasets NoMAS identifies two subnetworks associated with survival containing genes previously reported to be important for prognosis in the same cancer type as well as novel genes, while no gene is significantly associated with survival when considered in isolation.</p>
<p>There are many directions in which this work can be extended. First, we only considered single nucleotide variants and indels in our analysis; we plan to extend our method to consider more complex variants (e.g., copy number aberrations and differential methylation) in the analysis. Second, we believe that our algorithm and its analysis could be extended to the identification of subnetworks associated with clinical parameters other than survival time and to case-control studies, but substantial modifications to the algorithm and to its analysis will be required. Third, this work considers the log-rank statistic as a measure of association with survival; another popular test in survival analysis is the use of Cox&#x00027;s regression model (Kalbfleisch and Prentice, <xref ref-type="bibr" rid="B17">2002</xref>). The two tests are identical in the case of two populations, therefore our algorithm identifies subnetworks with high score w.r.t. Cox&#x00027;s regression model as well. However, Cox&#x00027;s regression model allows for the correction for covariates (e.g., gender, age, etc.) in the analysis of survival data. A similar approach could be obtained by stratifying the patients in the log-rank test, but how to efficiently identify subnetworks, and in general combinations of genomic features, associated with survival while correcting for covariates remains a challenging open problem. Fourth, genomic regions other than genes (e.g., regulatory regions) or even other regulatory elements (e.g., microRNAs regulating the expression of driver genes) may be important for survival: the incorporation in our method of alterations in such regions and elements is an interesting direction for future research. Finally, in some studies the information regarding tumor (sub)clones and their mutations may be available: how to properly integrate such information in our analyses is a challenging direction for further investigation.</p>
</sec>
<sec id="s5">
<title>Author Contributions</title>
<p>FV conceived and designed the study. FA, TH, and FV designed the algorithms, performed the analyses, and wrote the manuscript. FA and TH wrote the software.</p>
<sec>
<title>Conflict of Interest Statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<sec sec-type="supplementary-material" id="s6">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fgene.2019.00265/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fgene.2019.00265/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.PDF" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alon</surname> <given-names>N.</given-names></name> <name><surname>Dao</surname> <given-names>P.</given-names></name> <name><surname>Hajirasouliha</surname> <given-names>I.</given-names></name> <name><surname>Hormozdiari</surname> <given-names>F.</given-names></name> <name><surname>Sahinalp</surname> <given-names>S. C.</given-names></name></person-group> (<year>2008</year>). <article-title>Biomolecular network motif counting and discovery by color coding</article-title>. <source>Bioinformatics</source> <volume>24</volume>, <fpage>i241</fpage>&#x02013;<lpage>i249</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btn163</pub-id><pub-id pub-id-type="pmid">18586721</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Alon</surname> <given-names>N.</given-names></name> <name><surname>Yuster</surname> <given-names>R.</given-names></name> <name><surname>Zwick</surname> <given-names>U.</given-names></name></person-group> (<year>1994</year>). <article-title>Color-coding: a new method for finding simple paths, cycles and other small subgraphs within large graphs</article-title>, in <source>Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing</source> (<publisher-loc>New York, NY: ACM</publisher-loc>), <fpage>326</fpage>&#x02013;<lpage>335</lpage>. <pub-id pub-id-type="doi">10.1145/195058.195179</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bruckner</surname> <given-names>S.</given-names></name> <name><surname>H&#x000FC;ffner</surname> <given-names>F.</given-names></name> <name><surname>Karp</surname> <given-names>R. M.</given-names></name> <name><surname>Shamir</surname> <given-names>R.</given-names></name> <name><surname>Sharan</surname> <given-names>R.</given-names></name></person-group> (<year>2010</year>). <article-title>Topology-free querying of protein interaction networks</article-title>. <source>J. Comput. Biol.</source> <volume>17</volume>, <fpage>237</fpage>&#x02013;<lpage>252</lpage>. <pub-id pub-id-type="doi">10.1089/cmb.2009.0170</pub-id><pub-id pub-id-type="pmid">20377443</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><collab>Cancer Genome Atlas Network</collab></person-group> (<year>2015</year>). <article-title>Comprehensive genomic characterization of head and neck squamous cell carcinomas</article-title>. <source>Nature</source> <volume>517</volume>, <fpage>576</fpage>&#x02013;<lpage>582</lpage>. <pub-id pub-id-type="doi">10.1038/nature14129</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><collab>Cancer Genome Atlas Research Network</collab></person-group> (<year>2011</year>). <article-title>Integrated genomic analyses of ovarian carcinoma</article-title>. <source>Nature</source> <volume>474</volume>, <fpage>609</fpage>&#x02013;<lpage>615</lpage>. <pub-id pub-id-type="doi">10.1038/nature10166</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><collab>Cancer Genome Atlas Research Network</collab></person-group> (<year>2013</year>). <article-title>Comprehensive molecular characterization of clear cell renal cell carcinoma</article-title>. <source>Nature</source> <volume>499</volume>, <fpage>43</fpage>&#x02013;<lpage>49</lpage>. <pub-id pub-id-type="doi">10.1038/nature12222</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><collab>Cancer Genome Atlas Research Network</collab></person-group> (<year>2014</year>). <article-title>Integrated genomic characterization of papillary thyroid carcinoma</article-title>. <source>Cell</source> <volume>159</volume>, <fpage>676</fpage>&#x02013;<lpage>690</lpage>. <pub-id pub-id-type="doi">10.1016/j.cell.2014.09.050</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><collab>Cancer Genome Atlas Research Network Analysis Working Group: Asan University, BC Cancer Agency, Brigham and Women&#x00027;s Hospital, Broad Institute, Brown University</collab></person-group>. (<year>2017</year>). <article-title>Integrated genomic characterization of oesophageal carcinoma</article-title>. <source>Nature</source> <volume>541</volume>:<fpage>169</fpage>. <pub-id pub-id-type="doi">10.1038/nature20805</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chowdhury</surname> <given-names>S. A.</given-names></name> <name><surname>Nibbe</surname> <given-names>R. K.</given-names></name> <name><surname>Chance</surname> <given-names>M. R.</given-names></name> <name><surname>Koyut&#x000FC;rk</surname> <given-names>M.</given-names></name></person-group> (<year>2011</year>). <article-title>Subnetwork state functions define dysregulated subnetworks in cancer</article-title>. <source>J. Comput. Biol.</source> <volume>18</volume>, <fpage>263</fpage>&#x02013;<lpage>281</lpage>. <pub-id pub-id-type="doi">10.1089/cmb.2010.0269</pub-id><pub-id pub-id-type="pmid">21385033</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chuang</surname> <given-names>H.-Y.</given-names></name> <name><surname>Lee</surname> <given-names>E.</given-names></name> <name><surname>Liu</surname> <given-names>Y.-T.</given-names></name> <name><surname>Lee</surname> <given-names>D.</given-names></name> <name><surname>Ideker</surname> <given-names>T.</given-names></name></person-group> (<year>2007</year>). <article-title>Network-based classification of breast cancer metastasis</article-title>. <source>Mol. Syst. Biol.</source> <volume>3</volume>:<fpage>140</fpage>. <pub-id pub-id-type="doi">10.1038/msb4100180</pub-id><pub-id pub-id-type="pmid">17940530</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dao</surname> <given-names>P.</given-names></name> <name><surname>Wang</surname> <given-names>K.</given-names></name> <name><surname>Collins</surname> <given-names>C.</given-names></name> <name><surname>Ester</surname> <given-names>M.</given-names></name> <name><surname>Lapuk</surname> <given-names>A.</given-names></name> <name><surname>Sahinalp</surname> <given-names>S. C.</given-names></name></person-group> (<year>2011</year>). <article-title>Optimally discriminative subnetwork markers predict response to chemotherapy</article-title>. <source>Bioinformatics</source> <volume>27</volume>, <fpage>i205</fpage>&#x02013;<lpage>i213</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btr245</pub-id><pub-id pub-id-type="pmid">21685072</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Das</surname> <given-names>J.</given-names></name> <name><surname>Yu</surname> <given-names>H.</given-names></name></person-group> (<year>2012</year>). <article-title>Hint: High-quality protein interactomes and their applications in understanding human disease</article-title>. <source>BMC Syst. Biol.</source> <volume>6</volume>:<fpage>92</fpage>. <pub-id pub-id-type="doi">10.1186/1752-0509-6-92</pub-id><pub-id pub-id-type="pmid">22846459</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dees</surname> <given-names>N. D.</given-names></name> <name><surname>Zhang</surname> <given-names>Q.</given-names></name> <name><surname>Kandoth</surname> <given-names>C.</given-names></name> <name><surname>Wendl</surname> <given-names>M. C.</given-names></name> <name><surname>Schierding</surname> <given-names>W.</given-names></name> <name><surname>Koboldt</surname> <given-names>D. C.</given-names></name> <etal/></person-group>. (<year>2012</year>). <article-title>Music: identifying mutational significance in cancer genomes</article-title>. <source>Genome Res.</source> <volume>22</volume>, <fpage>1589</fpage>&#x02013;<lpage>1598</lpage>. <pub-id pub-id-type="doi">10.1101/gr.134635.111</pub-id><pub-id pub-id-type="pmid">22759861</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Garraway</surname> <given-names>L. A.</given-names></name> <name><surname>Lander</surname> <given-names>E. S.</given-names></name></person-group> (<year>2013</year>). <article-title>Lessons from the cancer genome</article-title>. <source>Cell</source> <volume>153</volume>, <fpage>17</fpage>&#x02013;<lpage>37</lpage>. <pub-id pub-id-type="doi">10.1016/j.cell.2013.03.002</pub-id><pub-id pub-id-type="pmid">23540688</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hofree</surname> <given-names>M.</given-names></name> <name><surname>Shen</surname> <given-names>J. P.</given-names></name> <name><surname>Carter</surname> <given-names>H.</given-names></name> <name><surname>Gross</surname> <given-names>A.</given-names></name> <name><surname>Ideker</surname> <given-names>T.</given-names></name></person-group> (<year>2013</year>). <article-title>Network-based stratification of tumor mutations</article-title>. <source>Nat. Methods</source> <volume>10</volume>, <fpage>1108</fpage>&#x02013;<lpage>1115</lpage>. <pub-id pub-id-type="doi">10.1038/nmeth.2651</pub-id><pub-id pub-id-type="pmid">24037242</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hormozdiari</surname> <given-names>F.</given-names></name> <name><surname>Penn</surname> <given-names>O.</given-names></name> <name><surname>Borenstein</surname> <given-names>E.</given-names></name> <name><surname>Eichler</surname> <given-names>E. E.</given-names></name></person-group> (<year>2015</year>). <article-title>The discovery of integrated gene networks for autism and related disorders</article-title>. <source>Genome Res.</source> <volume>25</volume>, <fpage>142</fpage>&#x02013;<lpage>154</lpage>. <pub-id pub-id-type="doi">10.1101/gr.178855.114</pub-id><pub-id pub-id-type="pmid">25378250</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kalbfleisch</surname> <given-names>J.</given-names></name> <name><surname>Prentice</surname> <given-names>R.</given-names></name></person-group> (<year>2002</year>). <source>The Statistical Analysis of Failure Time Data, 2 Edn</source>. <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>Wiley-Interscience</publisher-name>.</citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kandoth</surname> <given-names>C.</given-names></name> <name><surname>McLellan</surname> <given-names>M. D.</given-names></name> <name><surname>Vandin</surname> <given-names>F.</given-names></name> <name><surname>Ye</surname> <given-names>K.</given-names></name> <name><surname>Niu</surname> <given-names>B.</given-names></name> <name><surname>Lu</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Mutational landscape and significance across 12 major cancer types</article-title>. <source>Nature</source> <volume>502</volume>, <fpage>333</fpage>&#x02013;<lpage>339</lpage>. <pub-id pub-id-type="doi">10.1038/nature12634</pub-id><pub-id pub-id-type="pmid">24132290</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>Y.-A.</given-names></name> <name><surname>Cho</surname> <given-names>D.-Y.</given-names></name> <name><surname>Dao</surname> <given-names>P.</given-names></name> <name><surname>Przytycka</surname> <given-names>T. M.</given-names></name></person-group> (<year>2015</year>). <article-title>Memcover: integrated analysis of mutual exclusivity and functional network reveals dysregulated pathways across multiple cancer types</article-title>. <source>Bioinformatics</source> <volume>31</volume>, <fpage>i284</fpage>&#x02013;<lpage>i292</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btv247</pub-id><pub-id pub-id-type="pmid">26072494</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lawrence</surname> <given-names>M. S.</given-names></name> <name><surname>Stojanov</surname> <given-names>P.</given-names></name> <name><surname>Polak</surname> <given-names>P.</given-names></name> <name><surname>Kryukov</surname> <given-names>G. V.</given-names></name> <name><surname>Cibulskis</surname> <given-names>K.</given-names></name> <name><surname>Sivachenko</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Mutational heterogeneity in cancer and the search for new cancer-associated genes</article-title>. <source>Nature</source> <volume>499</volume>, <fpage>214</fpage>&#x02013;<lpage>218</lpage>. <pub-id pub-id-type="doi">10.1038/nature12213</pub-id><pub-id pub-id-type="pmid">23770567</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Leiserson</surname> <given-names>M. D. M.</given-names></name> <name><surname>Vandin</surname> <given-names>F.</given-names></name> <name><surname>Wu</surname> <given-names>H.-T.</given-names></name> <name><surname>Dobson</surname> <given-names>J. R.</given-names></name> <name><surname>Eldridge</surname> <given-names>J. V.</given-names></name> <name><surname>Thomas</surname> <given-names>J. L.</given-names></name> <etal/></person-group>. (<year>2015a</year>). <article-title>Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes</article-title>. <source>Nat. Genet.</source> <volume>47</volume>, <fpage>106</fpage>&#x02013;<lpage>114</lpage>. <pub-id pub-id-type="doi">10.1038/ng.3168</pub-id><pub-id pub-id-type="pmid">25501392</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Leiserson</surname> <given-names>M. D. M.</given-names></name> <name><surname>Wu</surname> <given-names>H.-T.</given-names></name> <name><surname>Vandin</surname> <given-names>F.</given-names></name> <name><surname>Raphael</surname> <given-names>B. J.</given-names></name></person-group> (<year>2015b</year>). <article-title>Comet: a statistical approach to identify combinations of mutually exclusive alterations in cancer</article-title>. <source>Genome Biol.</source> <volume>16</volume>:<fpage>160</fpage>. <pub-id pub-id-type="doi">10.1186/s13059-015-0700-7</pub-id><pub-id pub-id-type="pmid">26253137</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mantel</surname> <given-names>N.</given-names></name></person-group> (<year>1966</year>). <article-title>Evaluation of survival data and two new rank order statistics arising in its consideration</article-title>. <source>Cancer Chemother. Rep.</source> <volume>50</volume>, <fpage>163</fpage>&#x02013;<lpage>170</lpage>. <pub-id pub-id-type="pmid">5910392</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Maxwell</surname> <given-names>S.</given-names></name> <name><surname>Chance</surname> <given-names>M. R.</given-names></name> <name><surname>Koyut&#x000FC;rk</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>Efficiently enumerating all connected induced subgraphs of a large molecular network</article-title>, in <source>Algorithms for Computational Biology</source>, eds <person-group person-group-type="editor"><name><surname>Dediu</surname> <given-names>A.-H.</given-names></name> <name><surname>Mart&#x000ED;n-Vide</surname> <given-names>C.</given-names></name> <name><surname>Truthe</surname> <given-names>B.</given-names></name></person-group> (<publisher-loc>Terragona</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>171</fpage>&#x02013;<lpage>182</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-07953-0_14</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Patel</surname> <given-names>V. N.</given-names></name> <name><surname>Gokulrangan</surname> <given-names>G.</given-names></name> <name><surname>Chowdhury</surname> <given-names>S. A.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Sloan</surname> <given-names>A. E.</given-names></name> <name><surname>Koyut&#x000FC;rk</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Network signatures of survival in glioblastoma multiforme</article-title>. <source>PLoS Comput. Biol.</source> <volume>9</volume>:<fpage>e1003237</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1003237</pub-id><pub-id pub-id-type="pmid">24068912</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peto</surname> <given-names>R.</given-names></name> <name><surname>Peto</surname> <given-names>J.</given-names></name></person-group> (<year>1972</year>). <article-title>Asymptotically efficient rank invariant test procedures</article-title>. <source>J. R. Stat. Soc. A</source> <volume>135</volume>, <fpage>185</fpage>&#x02013;<lpage>206</lpage>. <pub-id pub-id-type="doi">10.2307/2344317</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Raphael</surname> <given-names>B. J.</given-names></name> <name><surname>Dobson</surname> <given-names>J. R.</given-names></name> <name><surname>Oesper</surname> <given-names>L.</given-names></name> <name><surname>Vandin</surname> <given-names>F.</given-names></name></person-group> (<year>2014</year>). <article-title>Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine</article-title>. <source>Genome Med.</source> <volume>6</volume>:<fpage>5</fpage>. <pub-id pub-id-type="doi">10.1186/gm524</pub-id><pub-id pub-id-type="pmid">24479672</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Raphael</surname> <given-names>B. J.</given-names></name> <name><surname>Hruban</surname> <given-names>R. H.</given-names></name> <name><surname>Aguirre</surname> <given-names>A. J.</given-names></name> <name><surname>Moffitt</surname> <given-names>R. A.</given-names></name> <name><surname>Yeh</surname> <given-names>J. J.</given-names></name> <name><surname>Stewart</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Integrated genomic characterization of pancreatic ductal adenocarcinoma</article-title>. <source>Cancer Cell</source> <volume>32</volume>, <fpage>185</fpage>&#x02013;<lpage>203</lpage>. <pub-id pub-id-type="doi">10.1016/j.ccell.2017.07.007</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Reimand</surname> <given-names>J.</given-names></name> <name><surname>Bader</surname> <given-names>G. D.</given-names></name></person-group> (<year>2013</year>). <article-title>Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers</article-title>. <source>Mol. Syst. Biol.</source> <volume>9</volume>:<fpage>637</fpage>. <pub-id pub-id-type="doi">10.1038/msb.2012.68</pub-id><pub-id pub-id-type="pmid">23340843</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shinderman-Maman</surname> <given-names>E.</given-names></name> <name><surname>Cohen</surname> <given-names>K.</given-names></name> <name><surname>Moskovich</surname> <given-names>D.</given-names></name> <name><surname>Hercbergs</surname> <given-names>A.</given-names></name> <name><surname>Werner</surname> <given-names>H.</given-names></name> <name><surname>Davis</surname> <given-names>P. J.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Thyroid hormones derivatives reduce proliferation and induce cell death and dna damage in ovarian cancer</article-title>. <source>Sci. Rep.</source> <volume>7</volume>:<fpage>16475</fpage>. <pub-id pub-id-type="doi">10.1038/s41598-017-16593-x</pub-id><pub-id pub-id-type="pmid">29184090</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shrestha</surname> <given-names>R.</given-names></name> <name><surname>Hodzic</surname> <given-names>E.</given-names></name> <name><surname>Sauerwald</surname> <given-names>T.</given-names></name> <name><surname>Dao</surname> <given-names>P.</given-names></name> <name><surname>Wang</surname> <given-names>K.</given-names></name> <name><surname>Yeung</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Hit&#x00027;ndrive: patient-specific multidriver gene prioritization for precision oncology</article-title>. <source>Genome Res</source>. <volume>27</volume>, <fpage>1573</fpage>&#x02013;<lpage>1588</lpage>. <pub-id pub-id-type="doi">10.1101/gr.221218.117</pub-id><pub-id pub-id-type="pmid">28768687</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Smyth</surname> <given-names>P.</given-names></name> <name><surname>Goodman</surname> <given-names>R. M.</given-names></name></person-group> (<year>1992</year>). <article-title>An information theoretic approach to rule induction from databases</article-title>. <source>IEEE Trans. Knowledge Data Eng.</source> <volume>4</volume>, <fpage>301</fpage>&#x02013;<lpage>316</lpage>. <pub-id pub-id-type="doi">10.1109/69.149926</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>Q.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Liu</surname> <given-names>T.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Geng</surname> <given-names>J.</given-names></name> <name><surname>He</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Increased expression of mitotic arrest deficient-like 1 (mad1l1) is associated with poor prognosis and insensitive to taxol treatment in breast cancer</article-title>. <source>Breast Cancer Res. Treat.</source> <volume>140</volume>, <fpage>323</fpage>&#x02013;<lpage>330</lpage>. <pub-id pub-id-type="doi">10.1007/s10549-013-2633-8</pub-id><pub-id pub-id-type="pmid">23860928</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tamborero</surname> <given-names>D.</given-names></name> <name><surname>Gonzalez-Perez</surname> <given-names>A.</given-names></name> <name><surname>Perez-Llamas</surname> <given-names>C.</given-names></name> <name><surname>Deu-Pons</surname> <given-names>J.</given-names></name> <name><surname>Kandoth</surname> <given-names>C.</given-names></name> <name><surname>Reimand</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Comprehensive identification of mutational cancer driver genes across 12 tumor types</article-title>. <source>Sci. Rep.</source> <volume>3</volume>:<fpage>2650</fpage>. <pub-id pub-id-type="doi">10.1038/srep02650</pub-id><pub-id pub-id-type="pmid">24084849</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vandin</surname> <given-names>F.</given-names></name> <name><surname>Clay</surname> <given-names>P.</given-names></name> <name><surname>Upfal</surname> <given-names>E.</given-names></name> <name><surname>Raphael</surname> <given-names>B. J.</given-names></name></person-group> (<year>2012a</year>). <article-title>Discovery of mutated subnetworks associated with clinical data in cancer</article-title>. <source>Pac. Symp. Biocomput</source> <volume>2012</volume>, <fpage>55</fpage>&#x02013;<lpage>66</lpage>. <pub-id pub-id-type="doi">10.1142/9789814366496_0006</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vandin</surname> <given-names>F.</given-names></name> <name><surname>Papoutsaki</surname> <given-names>A.</given-names></name> <name><surname>Raphael</surname> <given-names>B. J.</given-names></name> <name><surname>Upfal</surname> <given-names>E.</given-names></name></person-group> (<year>2015</year>). <article-title>Accurate computation of survival statistics in genome-wide studies</article-title>. <source>PLoS Comput. Biol.</source> <volume>11</volume>:<fpage>e1004071</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1004071</pub-id><pub-id pub-id-type="pmid">25950620</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vandin</surname> <given-names>F.</given-names></name> <name><surname>Upfal</surname> <given-names>E.</given-names></name> <name><surname>Raphael</surname> <given-names>B. J.</given-names></name></person-group> (<year>2012b</year>). <article-title><italic>De novo</italic> discovery of mutated driver pathways in cancer</article-title>. <source>Genome Res.</source> <volume>22</volume>, <fpage>375</fpage>&#x02013;<lpage>385</lpage>. <pub-id pub-id-type="doi">10.1101/gr.120477.111</pub-id><pub-id pub-id-type="pmid">21653252</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vogelstein</surname> <given-names>B.</given-names></name> <name><surname>Papadopoulos</surname> <given-names>N.</given-names></name> <name><surname>Velculescu</surname> <given-names>V. E.</given-names></name> <name><surname>Zhou</surname> <given-names>S.</given-names></name> <name><surname>Diaz</surname> <given-names>L. A.</given-names> <suffix>Jr.</suffix></name> <name><surname>Kinzler</surname> <given-names>K. W.</given-names></name></person-group> (<year>2013</year>). <article-title>Cancer genome landscapes</article-title>. <source>Science</source> <volume>339</volume>, <fpage>1546</fpage>&#x02013;<lpage>1558</lpage>. <pub-id pub-id-type="doi">10.1126/science.1235122</pub-id><pub-id pub-id-type="pmid">23539594</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>G.</given-names></name> <name><surname>Stein</surname> <given-names>L.</given-names></name></person-group> (<year>2012</year>). <article-title>A network module-based method for identifying cancer prognostic signatures</article-title>. <source>Genome Biol.</source> <volume>13</volume>:<fpage>R112</fpage>. <pub-id pub-id-type="doi">10.1186/gb-2012-13-12-r112</pub-id><pub-id pub-id-type="pmid">23228031</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>H.</given-names></name> <name><surname>Tardivo</surname> <given-names>L.</given-names></name> <name><surname>Tam</surname> <given-names>S.</given-names></name> <name><surname>Weiner</surname> <given-names>E.</given-names></name> <name><surname>Gebreab</surname> <given-names>F.</given-names></name> <name><surname>Fan</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2011</year>). <article-title>Next-generation sequencing to generate interactome datasets</article-title>. <source>Nat. Methods</source> <volume>8</volume>, <fpage>478</fpage>&#x02013;<lpage>480</lpage>. <pub-id pub-id-type="doi">10.1038/nmeth.1597</pub-id><pub-id pub-id-type="pmid">21516116</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zou</surname> <given-names>Q.</given-names></name> <name><surname>Jin</surname> <given-names>J.</given-names></name> <name><surname>Hu</surname> <given-names>H.</given-names></name> <name><surname>Li</surname> <given-names>H. S.</given-names></name> <name><surname>Romano</surname> <given-names>S.</given-names></name> <name><surname>Xiao</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Usp15 stabilizes mdm2 to mediate cancer-cell survival and inhibit antitumor T cell responses</article-title>. <source>Nat. Immunol.</source> <volume>15</volume>:<fpage>562</fpage>. <pub-id pub-id-type="doi">10.1038/ni.2885</pub-id><pub-id pub-id-type="pmid">24777531</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>In the literature two different standard deviations (corresponding to two related but different null distributions, permutational and conditional) have been proposed for the normal approximation of the distribution of the log-rank statistic; we have previously shown (Vandin et al., <xref ref-type="bibr" rid="B36">2015</xref>) that the one we use here (corresponding to the permutational distribution) is more appropriate for genomic studies.</p></fn>
<fn id="fn0002"><p><sup>2</sup>The implementation of NoMAS is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/VandinLab/NoMAS">https://github.com/VandinLab/NoMAS</ext-link></p></fn>
<fn id="fn0003"><p><sup>3</sup><ext-link ext-link-type="uri" xlink:href="http://compbio-research.cs.brown.edu/pancancer/hotnet2/">http://compbio-research.cs.brown.edu/pancancer/hotnet2/</ext-link></p></fn>
</fn-group>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> This work is supported, in part, by the University of Padova under projects CPDA121378/12, SID 2017, and STARS: Algorithms for Inferential Data mining, and by NSF grant IIS-1247581. The results presented in this manuscript are in whole or part based upon data generated by the TCGA Research Network: <ext-link ext-link-type="uri" xlink:href="http://cancergenome.nih.gov/">http://cancergenome.nih.gov/</ext-link>. This paper was selected for oral presentation at RECOMB 2016 and an abstract is published in the conference proceedings.</p>
</fn>
</fn-group>
</back>
</article> 