Edited by: Marco Pellegrini, Italian National Research Council (CNR), Italy
Reviewed by: Georges Nemer, American University of Beirut, Lebanon; Salvatore Alaimo, University of Catania, Italy
This article was submitted to Systems Biology, a section of the journal Frontiers in Genetics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Next-generation sequencing technologies allow to measure somatic mutations in a large number of patients from the same cancer type: one of the main goals in their analysis is the identification of mutations associated with clinical parameters. The identification of such relationships is hindered by extensive genetic heterogeneity in tumors, with different genes mutated in different patients, due, in part, to the fact that genes and mutations act in the context of
Recent advances in next-generation sequencing technologies have enabled the collection of sequence information from many genomes and exomes, with many large human and cancer genetic studies measuring mutations in all genes for a large number of patients of a specific disease (Cancer Genome Atlas Research Network,
In recent years, several computational and statistical methods have been designed to identify driver mutations and distinguish them from passenger mutations, exploiting data from large cancer studies (Raphael et al.,
In addition to mutation data, large cancer studies often collect also clinical data, including survival information, regarding the patients. An important feature of survival data is that it often contains
The field of survival analysis has produced an extensive literature on the analysis of survival data, in particular for the comparison of the survival of two given populations (sets of samples) (Kalbfleisch and Prentice,
In this paper we study the problem of finding sets of interacting genes with mutations associated to survival using data from large cancer sequencing studies and interaction information from a genome-scale interaction network. We focus on the widely used log-rank statistic as a measure of the association between mutations in a group of genes and survival. Our contribution is in five parts: first, we formally define the problem of finding the set of
In this section we present the model we consider, our algorithm NoMAS, and the tests we have designed to assess the statistical significance of the results.
In survival analysis, we are given two populations (i.e., sets of samples)
Under the (null) hypothesis of no difference in survival between
In genomic studies, we are given mutation data for a set
Given the set
To identify the set of
We have the following.
We now define the max connected
If
We design a new algorithm,
Algorithm NoMAS. Given alteration data and survival information (time and censoring status) for a set of patients, NoMAS employs a color coding approach to identify subnetworks with mutations associated with survival time, i.e., with high log-rank statistic, and then assesses the statistical significance of the subnetworks using (i) permutation testing and (ii) a holdout approach.
Consider a given coloring of
For entry
The computation of
We designed two procedures to assess the statistical significance of the results found by NoMAS: the first is based on permutation testing, while the second uses a holdout approach.
After identifying the best solution
We designed a holdout method to strengthen the statistical robustness of the results produced by NoMAS. We split the dataset in two parts, called
We consider the performance of NoMAS excluding the statistical significance testing. The log-rank statistic
Let
However, our score
Even more, we prove that when mutations are placed arbitrarily then for every subnetwork
Intuitively, Proposition 1 and Theorem 3 show that if mutations are placed adversarially (and the optimal solution
Intuitively: (3.1) above states that the subnetwork
We show that when enough samples are generated from the model above, our algorithm identifies the optimal solution with the same probability guarantee given by the color-coding technique for additive scores.
We assessed the performance of NoMAS by using simulated and cancer data. We compared NoMAS to the exhaustive algorithm that identifies the subnetwork of
For all our experiments we used as interaction graph
The remaining of the section is organized as follow: section 3.2.1 presents the results on simulated data, while section 3.2.2 presents the results on cancer data.
We assess the performance of NoMAS on simulated data generated under the Planted subnetwork Model. The subnetwork
We fixed
Results of NoMAS on simulated data from the Planted Subnetwork Model. One hundred datasets were generated for each pair (
We assessed the performance of NoMAS on the GBM, OV, and LUSC datasets. We first assessed whether NoMAS identified the optimal solution by comparing the highest scoring solution reported by NoMAS with the one identified by using the exhaustive algorithm for
We also compared NoMAS with three different greedy strategies for the max connected
Comparison of the normalized log-rank statistic of the best solution reported by NoMAS, by greedy algorithms (see Appendix for the description), and by the algorithm that uses an additive scoring function
Finally, we compared NoMAS with the use of an (additive) score that sums single gene scores (similar to the ones used in Vandin et al. (
We then used the holdout approach to identify significant subnetworks for GBM, LUSC, and OV, considering the top-10 highest scoring subnetworks found in the training set and compute their
Subnetworks identified by NoMAS on GBM data. Subnetwork
Subnetworks identified by NoMAS on LUSC data. Subnetwork
Subnetworks identified by NoMAS on OV data. Subnetwork
In this work, we study the problem of identifying subnetworks of a large gene-gene interaction network that are associated with survival using mutations from large cancer genomic studies. Few methods have been proposed to identify groups of genes with mutations associated with survival in genomic studies. The work of Vandin et al. (
Color-coding is a probabilistic method that was originally described for finding simple paths, cycles and other small subnetworks of size
In this work we formally define the associated computational problem, that we call the max connected
We use cancer data from three cancer studies from TCGA to compare NoMAS to approaches based on single gene scores and to greedy methods similar to ones proposed in the literature for the identification of subnetworks associated with survival and for other problems on graphs. Our results show that NoMAS identifies subnetworks with stronger association to survival compared to other approaches, and allows the correct estimation of
There are many directions in which this work can be extended. First, we only considered single nucleotide variants and indels in our analysis; we plan to extend our method to consider more complex variants (e.g., copy number aberrations and differential methylation) in the analysis. Second, we believe that our algorithm and its analysis could be extended to the identification of subnetworks associated with clinical parameters other than survival time and to case-control studies, but substantial modifications to the algorithm and to its analysis will be required. Third, this work considers the log-rank statistic as a measure of association with survival; another popular test in survival analysis is the use of Cox's regression model (Kalbfleisch and Prentice,
FV conceived and designed the study. FA, TH, and FV designed the algorithms, performed the analyses, and wrote the manuscript. FA and TH wrote the software.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at:
1In the literature two different standard deviations (corresponding to two related but different null distributions, permutational and conditional) have been proposed for the normal approximation of the distribution of the log-rank statistic; we have previously shown (Vandin et al.,
2The implementation of NoMAS is available at
3