Cancer Markers Selection Using Network-Based Cox Regression: A Methodological and Computational Practice

International initiatives such as the Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) are collecting multiple datasets at different genome-scales with the aim of identifying novel cancer biomarkers and predicting survival of patients. To analyze such data, several statistical methods have been applied, among them Cox regression models. Although these models provide a good statistical framework to analyze omic data, there is still a lack of studies that illustrate advantages and drawbacks in integrating biological information and selecting groups of biomarkers. In fact, classical Cox regression algorithms focus on the selection of a single biomarker, without taking into account the strong correlation between genes. Even though network-based Cox regression algorithms overcome such drawbacks, such network-based approaches are less widely used within the life science community. In this article, we aim to provide a clear methodological framework on the use of such approaches in order to turn cancer research results into clinical applications. Therefore, we first discuss the rationale and the practical usage of three recently proposed network-based Cox regression algorithms (i.e., Net-Cox, AdaLnet, and fastcox). Then, we show how to combine existing biological knowledge and available data with such algorithms to identify networks of cancer biomarkers and to estimate survival of patients. Finally, we describe in detail a new permutation-based approach to better validate the significance of the selection in terms of cancer gene signatures and pathway/networks identification. We illustrate the proposed methodology by means of both simulations and real case studies. Overall, the aim of our work is two-fold. Firstly, to show how network-based Cox regression models can be used to integrate biological knowledge (e.g., multi-omics data) for the analysis of survival data. Secondly, to provide a clear methodological and computational approach for investigating cancers regulatory networks.

International initiatives such as the Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) are collecting multiple datasets at different genome-scales with the aim of identifying novel cancer biomarkers and predicting survival of patients. To analyze such data, several statistical methods have been applied, among them Cox regression models. Although these models provide a good statistical framework to analyze omic data, there is still a lack of studies that illustrate advantages and drawbacks in integrating biological information and selecting groups of biomarkers. In fact, classical Cox regression algorithms focus on the selection of a single biomarker, without taking into account the strong correlation between genes. Even though network-based Cox regression algorithms overcome such drawbacks, such network-based approaches are less widely used within the life science community. In this article, we aim to provide a clear methodological framework on the use of such approaches in order to turn cancer research results into clinical applications. Therefore, we first discuss the rationale and the practical usage of three recently proposed network-based Cox regression algorithms (i.e., Net-Cox, AdaLnet, and fastcox). Then, we show how to combine existing biological knowledge and available data with such algorithms to identify networks of cancer biomarkers and to estimate survival of patients. Finally, we describe in detail a new permutation-based approach to better validate the significance of the selection in terms of cancer gene signatures and pathway/networks identification. We illustrate the proposed methodology by means of both simulations and real case studies. Overall, the aim of our work is two-fold. Firstly, to show how network-based Cox regression models can be used to integrate biological knowledge (e.g., multi-omics data) for the analysis of survival data. Secondly, to provide a clear methodological and computational approach for investigating cancers regulatory networks.

INTRODUCTION
Recent developments in high-throughput technology have produced a huge amount of multiple and diverse genomescale data to deal with biological and clinical questions in cancer. For example, genomics, transcriptomics, and epigenomics information is nowadays publicly available for tens of different cancer cell lines from thousands of patients in The Cancer Genome Atlas (TCGA, http://cancergenome. nih.gov/). Mutations data over one million tumor samples are also reported in Cosmic (http://cancer.sanger.ac.uk/cosmic), the world's largest and most comprehensive resource for exploring the impact of somatic mutations. Other valuable databases include The Gene Expression Omnibus (GEO, http://www.ncbi. nlm.nih.gov/gds) among others. Such amount of data is likely to revolutionize genetics and biomedical cancer research, but a thorough integration of all these different types of information is necessary. Indeed, cancer is a "multi-factorial" disease caused by a combination of genetic, environmental, and lifestyle factors. Such factors play an important role in discovering prognostic and diagnostic cancer gene signatures opening a new way toward the so called "personalized medicine." The term refers to a new type of therapy that is essentially based on the features of each patient. For instance, the anticancer drug Cetuximab (Karapetis et al., 2008) inhibits cells proliferation by binding to the EGF receptor and, consequently, preventing activation of the downstream signaling pathway. However, it has been found that Cetuximab can work only if the K-RAS gene is not mutated. Another example is the anti-cancer drug Trastuzumab (Hudis, 2007), which is effective only in patients that highly express the human epidermal growth factor (HER2) at the cell surface, to which the antibody binds. These examples highlight the need of identifying stable and interpretable biomarkers able to predict patient survival and characterize a patient-personalized therapy. In addition, the knowledge of complex cancer processes and networks is important to optimize the use of technology within health care (Raghupathi and Raghupathi, 2014). By discovering associations within the data, big data analytics has the potential to improve care, save lives, and lower costs.
As a consequence, in the last years, there has been a growing interest in developing methods that integrate different genomescale data into regression models for survival data to create a comprehensive view of human biology and disease (Wang et al., 2014). A popular used approach for the integration of genomic and clinical information is the Cox proportional hazard model (Cox, 1972). The main goal of such method is investigating the connection between gene expression data and survival information to predict cancer survival, assess cancer outcomes, and identify new gene markers. However, since gene expression data are usually characterized by a number of covariates p much larger than the sample size n, the traditional Cox model cannot be applied. Hence, several penalized Cox regression methods have been developed to identify core pathways and biomarkers involved in cancer progression, e.g., the Cox model based on Lasso penalty (Tibshirani, 1996(Tibshirani, , 1997Gui and Li, 2005). Alternative penalized Cox regression models based on variable selection include the SCAD (Fan and Li, 2001), the adaptive Lasso (Zou, 2006), the elastic net model (Zou and Hastie, 2005;Simon et al., 2011a;Wu, 2012), and the Dantzig selector (Candes and Tao, 2007) among others. These methods are able to cope with the high-dimensionality of gene expression data, thus solving the "p ≫ n" issue (Engler and Li, 2009). All these penalized models are statistically efficient in high-dimensional regression, but they perform poorly on data with high collinearity. Moreover, no biological knowledge is taken into account. Indeed, they are simply based on statistical frameworks completely ignoring biological regulatory network, protein-protein interaction (PPI), signaling pathways, and well-known relationships among genes. In such models, the lack of biological information produces instability in predictors reducing the predictive ability of the models. Hence, in order to provide more reliable and biologically meaningful results, the inclusion of a-priori biological knowledge into the models is mandatory. To address this issue, new penalized Cox methods based on the integration of genomic information have been recently proposed (Zhang et al., 2013;Gong et al., 2014;Sun et al., 2014). In such models, the genomic information is encoded by a network whose graph structure identifies a given relation (edges) between genes (nodes). The resulting Laplacian matrix is then integrated as penalty in the Cox regression models. In particular, the network can represent the correlation between genes (Zhang et al., 2013), KEGG pathways identification (Sun et al., 2014), functional interaction network (Huttenhower et al., 2009), or PPI. These Cox models based on a-priori biological network are called "network-based Cox regression." The network-based Cox regression methods provide an efficient tool to perform Cox regression on high-dimensional data incorporating genes network information. In literature, there are some recent approaches that analyze different Cox methods. For instance, an accurate review of eight different methods that integrate network information into multi-variable Cox models is presented to study the risk prediction in breast cancer and the integrated Brier score is used as a performance measure (Fröhlich, 2014). However, the study performed enrichment analysis on the signatures genes selected by the compared models without showing any survival prediction analysis in terms of Kaplan-Meier curves. A network-based Cox regression model that explores gene-to-gene connections in multiple cancer datasets is also performed for maximizing the overall association of the sub-network with clinical outcomes (Martinez-Ledesma et al., 2015). A potential limitation of these conventional networks is that the edges only reflect the information of within-features or within-relations, and do not consider the association between features and outcomes, which may be useful in improving the predictive power. Therefore, an alternative network construction method for the outcomeguided gene-interaction network has to be introduced in order to improve the performance of survival analysis in network-based Cox regression (Jeong et al., 2015).
In this work, we present a methodological framework for the analysis of molecular and survival data through a crossvalidated approach of network-based Cox regression algorithms (Net-Cox, Adalnet, and fastcox, see Section Methods). The method starts from the analysis of raw data and, through a cross-validated penalty approach, it guides the reader to the interpretation of the final results. As shown in Figure 1, the general steps of our approach are the following: (i) defining the biological question and the experimental design using microarray data, then integrating a-priori biological information using functional map of the human genome such as HEFalMp (Huttenhower et al., 2009) and KEGG; (ii) performing biological screening of the data for selecting relevant features through crossvalidated penalization (Simon et al., 2011b); (iii) implementing network-based Cox regression models for the analysis of cancerrelated genes; (iv) evaluating survival models to predict cancer patient prognosis and exploring cancer associated pathways. The presented approach provides a new methodological framework for the study and the interpretation of regression methods through gene-network and pathways analyses and it can be easily adapted to incorporate other network-based Cox regression algorithms.
A preliminary study for the comparison of penalized Cox models was presented in Iuliano et al. (2014), where the analysis was limited to cancer survival prediction using top ranked genes. No simulation studies, extensive pathways analysis or validation of the data were performed in that study. On the contrary, this article presents a more accurate and complete analysis based on a cross-validated approach (Simon et al., 2011b), the overall workflow (see Figure 2) that includes both simulation studies and novel real cancer datasets (see Section Data Analysis). Simulated data have been used to perform a statistical comparison of the methods in terms of sensitivity, specificity, number of selected genes, false positive rates, and Matthews correlation coefficient in two simulation settings with different genetic effects. On the other hand, real datasets analysis was performed to assess the relevance of the selected genes in the training dataset and to test the survival prediction accuracy of each model. Cross-validated Kaplan-Meier curves for survival analysis and pathway analysis were also computed (see Section Results). The novelty of the current study consists in the integration of a cross-validated approach (Simon et al., 2011b) to obtain an accurate survival prediction even when the number of cases is relatively small for an effective sample splitting (see Figure 2). Cross-validation methods have been largely applied in Cox regression models to estimate prediction errors and for model parameters tuning (Vasselli et al., 2003;Molinaro et al., 2005;Simon et al., 2011b). Some of the most relevant crossvalidation approaches include leave-one-out cross-validation (LOOCV; Kearns and Ron, 1999), k-fold (Refaeilzadeh et al., 2009), and bootstrap algorithms (Kohavi, 1995). However, all these methods do not provide a good estimation if the data available are limited for an effective division in training and test sets. On the contrary, the cross-validation method used in our analysis (Simon et al., 2011b) is based on a re-sampling algorithm that allows an accurate prediction of the survival risk model regardless the data size. Therefore, in this work, we first present a novel statistical approach to infer pathway interaction networks from gene expression data that relies on a new mathematical FIGURE 1 | The pipeline of network-based Cox models approach for cancer survival analysis in four general steps. (1) Define the biological question and the experimental design and then, integrate a-priori biological information using functional map of the human genome; (2) perform biological screening of the data in order to select IN variables to use in the analysis; (3) implement network-based Cox regression models with the integration of a re-sampling method based on a cross-validated approach; (4) apply survival analysis to predict cancer patients and pathway analysis to explore groups of genes associated to the disease.
FIGURE 2 | Workflow of prognostic model building by using gene expression profile in cancer. The method starts from the analysis of raw data and, through a cross-validated penalty approach, it leads to the interpretation of the final results.
Step (1) includes the input data for the survival analysis: gene expression data, Frontiers in Physiology | www.frontiersin.org FIGURE 2 | Continued cancer-related genes, pathway information, and overall survival (OS) times.
Step (2) illustrates the novelty of the work based on a k-fold cross-validation Kaplan-Meier procedure by integrating network-regularized Cox models for selecting significant genes and pathways structures. The Prognostic Index (PI) has been used to divide the patients in high-risk and low-risk groups. Then, the union of these two groups is done to plot single cross-validated Kaplan-Meier curves and to calculate the p-value permutation test.
Step (3) shows the survival prediction to test how well the models generalize across independent cancer datasets.
concept (based on the biological screening and network-based Cox regression methods) for understanding pathways' activity and relationships. Second, we provide a methodological strategy to researchers for the use of network-based Cox regression models in order to turn cancer research results into clinical applications.

Network-Regularized Cox Regression Models
The Cox Proportional hazards model (Cox, 1972) is the most widely used model to describe the relationship between survival times and predictor covariates. Given a sample of n subjects, let T i and C i be the survival time and the censoring time, respectively, for subject i = 1, . . . , n. Let t i = min {T i , C i } be the observed survival time and δ i = I(T i ≤ C i ) the censoring indicator, where I(·) is the indicator function (i.e., δ i = 1 if the survival time is observed and δ i = 0 if the survival time is censored). We denote by X i = (X i1 , . . . , X ip ) ′ the regression vector of p-variables for the ith subject (i.e., the gene expression profile of the ith patient over p genes). The survival time T i and the censoring time C i are assumed to be conditionally independent given X i . Furthermore, the censoring mechanism is assumed to be non-informative. The observed data can be represented by the triplets {(t i , δ i , X i ) , i = 1, ..., n}. The Cox regression method assumes that the hazard function h(t|X i ), which is the risk of death at time t for the ith patient with gene expression profile X i , can be written as where h 0 (t) is the baseline hazard and β = (β 1 , . . . , β p ) ′ is the column vector of the regression parameters.
In the classical setting, the regression coefficients are estimated by maximizing the Cox's log-partial likelihood where t i is the survival time (observed or censored) for the ith patient, R(t i ) is the risk set at time t i (i.e., the set of all patients who still survived prior to time t i ). However, in the analysis of gene expression data, the number of genes p is usually larger than the sample size n and the standard Cox-model cannot be directly applied. To cope with the curse of dimensionality (p ≫ n), a variety of penalization approaches have been proposed for achieving good prediction performance and easy interpretation of the data. Although these regularization methods induce sparsity into the solution by shrinking some estimates to zero, the biological relationship of gene expression profiles is not taken into account. Hence, in order to integrate information from molecular interactions between genes, network-based constrained methods for highdimensional Cox regression have been introduced.
In this context, the regression coefficients are estimated by maximizing the penalized Cox's log-partial likelihood function where P λ (β) is a network-constrained penalty function on the coefficients β. Such penalty function describes the existing relationships among the covariates (genes) specified by a network G = (V, E, W) (weighted and undirected graph), where V = 1, . . . , p is the set of vertices (genes/covariates), an element (i, j) in the edge set E ⊂ V × V indicates a link between vertices i and j and W = (w ij ), (i, j) ∈ E is the set of weights associated with the edges. These weights are usually used to represent the relations between genes in terms of gene-gene interaction, KEGG pathway analysis or PPI. Hence, the network structure plays an important role since it incorporates prior gene regulatory information often ignored.
The three regularized network-based Cox regression models used in our study are presented below and differ in the form of the penalty function P λ (β).

Net-Cox method
Net-Cox regression (Zhang et al., 2013) is an extension of the L 2 -Cox model and uses the following penalty function where λ > 0 and α ∈ (0, 1] are two regularization parameters in the network constraint. and The penalty (3) consists of two terms: the first one is an L 2 -norm of β that regularizes the uncertainty in the network constraint; the second term is a network Laplacian penalty (β) that encourages smoothness among correlated gene in the network and encode prior knowledge from a network. Given a normalized graph weight matrix W, we assume that co-expressed (related) genes are assigned similar coefficients by defining the cost term (β) as reported in Equation (4). (β) can be also written as (β) = β ′ (I − W)β = β ′L β whereL is a positive semi-definite matrix derived from network information (weight matrix W) and I is an identity matrix. Hence, the objective function will result in a significant cost in the network if any pair of genes is connected by an high weight edge and the difference between their coefficients is large.
Note that to identify the signature genes classified by Net-Cox, which is a ridge regression based method, we create a consensus ranking of the relevant cancer genes.

AdaLnet Method
Adaptive Laplacian net (Sun et al., 2014) is a modified version of a network-constrained regularization procedure for fitting linear models and for variable selection Li, 2008, 2010) where the predictors are genomic data with graphical structures. AdaLnet is based on prior gene regulatory network information, represented by an undirected graph for the analysis of gene expression data and survival outcomes.
Denoting with d i = i:(i,j)∈E w ij the degree of vertex i, AdaLnet defines the normalized Laplacian matrix L = (l ij ) of the graph G by Note that L is positive semi definite. The network-constrained penalty in Equation (2) is given by with Equation (6) is composed by two penalty terms. The first one is an L 1 -penalty that induces a sparse solution, the second one is a quadratic Laplacian penalty (β) = β ′L β that imposes smoothness of the parameters β between neighboring vertices in the network. Note thatL = S ′ LS with S = diag(sign(β 1 ), . . . , sign(β p )) andβ = (β 1 , . . . ,β p ) is obtained from a preliminary regression analysis. The scaling of the coefficients β respect to the degree allows the genes with more connections (i.e., the hub genes) to have larger coefficients. Hence, small changes of expression levels of these genes can lead to large changes in the response. An advantage of using penalty (6) consists in representing the case when two neighboring variables have opposite regression coefficient signs, which is reasonable in network-based analysis of gene expression data. Indeed, when a transcription factor (TF) positively regulate gene i and negatively regulate gene j in a certain pathway, the corresponding coefficients will result with opposite sign.
Note that in Net-Cox and AdaLnet, λ is the parameter controlling the weight between the likelihood and the network constraint and α ∈ (0, 1] is the parameter weighting the network constraint.

Fastcox Method
The penalty function of fastcox (Yang and Zou, 2012) computes the solution paths of the elastic net penalized Cox's proportional hazards model (Wu, 2012). In this method the penalty function in Equation (2) is given by where the non-negative weights w allow a more flexible estimation. In particular, setting w j = 0 implies no shrinkage and the variable j will be always included in the final model. Default is 1 for all variables. α ∈ (0, 1] is the elastic net trade off. This regularization technique is a combination of the lasso and ridge penalty that produce a sparse model (given by the L 1 -penalty) with good prediction accuracy, while encouraging a grouping effect. It is worthy to note that this method does not include any gene network information. It has been used in our study to obtain pathways investigation and survival prediction from a relevant method that is simply based on statistical framework.

Tuning Parameters by Five-Fold Cross-Validation
For all the methods, we estimated the regularization parameters using cross-validation. Four-folds of data are used to build a model for validation on the fifth fold, cycling through each of the five-folds in turn. Then, the (λ,α) pair that minimizes the cross-validation log-partial likelihood (CVPL) are chosen as the optimal parameters. CVPL is defined as whereβ (−k) (·) is the estimate obtained from excluding the kth part of the data with a given pair of (λ, α), ℓ(·) is the Cox logpartial likelihood on all the sample and ℓ (−k) (·) is the log-partial likelihood when the kth fold is left out (van Houwelingen et al., 2006).

General Algorithm: A Re-Sampling Method for Survival Prediction
The prediction capabilities of a given method are usually evaluated using a training set to select the markers and a testing set to measure the goodness of the prediction. In several cases training and test sets are obtained splitting a given dataset in two parts. However, findings could be over optimistic depending on the specific split. To further understand the role of the network information in cross-validation and to overcome the drawbacks of investigating only one split, each network-based model was validated with the re-sampling procedure suggested by Simon et al. (2011b). This method is based on a cross-validated estimate of the survival distribution of the risk groups and provide a more efficient use of data than fixed sample splitting (see Figure 2). The steps of the re-sampling algorithm for survival prediction are presented below.
Procedure 1: k-fold Cross-validated Kaplan-Meier survival method 1. The full dataset D is partitioned into K approximately equal parts D 1 , . . . , D K . For each k = 1, . . . , K 2. Set T k = D − D k as the training set and D k as the testing set. 3. Perform network-based Cox regression on T k and select highrisk cancer genes G k . Denote the parameter estimate byβ T k . 4. Calculate the prognostic index (PI) for each patient i k in D k as where x i k is the vector of gene expression value associated to the i k -th patient into the k-fold. Each patient i k in D k is assigned into the high/low-risk group if its prognostic index PI D k i k is above (or below) a fixed threshold PI * ,T k defined adaptively on T k . 5. All the patients classified as low-and-high risk in any of the folds are grouped together and a single Kaplan-Meier curve is computed as the union of the risk groups defined in each fold. The set of predictive genes is selected as the union of G k , for k = 1, . . . , K. 6. Compute the log-rank χ 2 0 statistic under the null hypothesis that survival is independent of expression profile. 7. Calculate a permutation p-value as follows: compute the log-rank χ 2 b statistic using the crossvalidation procedure (1-6), (ii) compute the permutation p-value,p, aŝ For our analysis, the estimateβ T k in step 4 was computed by using five-fold cross-validation (i.e., K = 5) to select the optimal tuning parameter values (λ T k ,α T k ), that we used to fit the corresponding penalized function Pλ T k ,α T k (β T k ) on T k . In particular, we first set α to a sufficiently fine grid of values on [0, 1]. For each fixed α, λ was chosen from {10 −5 , 10 −4 , 10 −3 , 10 −2 , 10 −1 , 1} for Net-Cox, while it was set λ to a decreasing sequence of values λ max to λ min automatically chosen for AdaLnet and fastcox.
In step 5, we selected PI * ,T k as the optimal cut-off in terms of PI D k . By using the PI T k i k , it was possible to split the patients in two subgroups, i.e., high-risk and low-risk prognosis groups. Thus, the patient i k in T k was assigned to the high-risk (or low-risk) group if his prognostic index PI T k i k was above (or below) the quantile selected on a grid of given values that spans from 30 to 70%. The cut-off PI * ,T k was chosen in correspondence to the lowest p-value in a log rank test on this grid.
In step 7, we set M equal to 500.

Survival Analysis
Network-based Cox regression model was used to discover significant variables, i.e., genes, correlated with death risk.
Overall survival (OS) curves were estimated using the Cross Kaplan-Meier estimator and compared using the two-sided log-rank test as implemented in the R package survival. The statistical significance of the log-rank statistic related to the cross-validated Kaplan-Meier curves was obtained through a permutation distribution (Simon et al., 2011b) as described in the previous section. Permutation test was used to test the association between high-risk or low-risk groups and p < 0.05 were considered statistically significant. A simple scheme of the applied procedure for OS estimation is reported in Figure 2. Furthermore, we also validated the predictive performance of the three methods using independent dataset for training and testing. In this context, we used the largest dataset as training set to identify the gene expression signatures (see Figure 2, step 2). Then, the second independent dataset was considered as test set in order to analyze the survival prediction of the models. We used Kaplan-Meier survival curves and log-rank test to perform the analysis (see Figure 2, step 3).

Pathway Analysis
We performed pathway analysis based on KEGG database and on the Human Experimental/Functional Mapper (Huttenhower et al., 2009). In particular, we focused on a gene-gene interaction analysis developing gene-networks that describe the relations between genes in terms of KEGG pathways. Each node in the network represents a gene and an edge between two nodes means that the two genes belongs to the same pathway. Different colors are used for different pathways. The color of each node indicates how strong is the relationship between the gene and the disease under analysis (ovarian and breast cancer; Huttenhower et al., 2009). The p-value chosen within the interval [0, 0.1] represents the node color intensity. Red color, that is p = 0, means that there is a high significant gene-disease relation, while green color, that is p = 0.1, means that not exist a relevant gene-disease relation.
Gene networks have been computed by considering only the not isolated genes in the intersection between KEGG pathways and the set of genes selected by each method. Given a set of genes G and the set of all the KEGG pathways K, we defined a gene g as not isolated if G ∩ K {g}. Namely, g is not isolated if there is at least another gene g ′ ∈ G belonging to the same pathways of g.

Software
The methodological approach presented in Figure 2 has been implemented as an integrative R script that allows to run the different algorithms under the same R environment. Net-Cox, which is a Matlab toolbox (http://compbio.cs.umn.edu/Net-Cox/), AdaLnet, available as an R code and sent us upon request and fastcox, which is an R package (http://code.google.com/p/ fastcox/) were merged together by using R.matlab, https://cran. r-project.org/web/packages/R.matlab/index.html. The script also includes the implementation of the re-sampling permutation approach (Simon et al., 2011b) and the cross-validation method for parameters estimation. Both simulated and real data can be used to run the script which can be easily adapted for the integration of new Cox models.
For real data analysis, the microarray data were preprocessed using R packages available in Bioconductor. First, we selected from the initial dataset the genes that were more likely to be involved in cancer by using a functional map summarizing the most relevant interactions in the cancer area of interest (Huttenhower et al., 2009). Then, we used HEFaIMp tool (Huttenhower et al., 2009) to build the genes network and identify the weight of the edges between the selected genes. Finally, Net-Cox, AdaLnet, and fastcox were implemented integrating a cross-validation method for selecting the optimal tuning parameters λ and α and a re-sampling based procedure (Simon et al., 2011b), see Procedure 1.
The scripts are available upon request from the first two authors.

Simulation Scheme
We used the three methods in two different simulation settings (Wu and Wang, 2013;Sun et al., 2014) in order to investigate the performances and the properties of the three models and to facilitate the interpretation of results. We considered two scenarios that are likely to be encountered in genomic studies and we simulated gene expression data as network constrained. Both the two settings consist of 100 regulatory networks. Each regulatory network is composed by one transcription factor (TF) that regulates 10 genes resulting in a total of 1100 genes. Detailed settings are given below.

Scenario 1: Not-Overlapped Networks
The first setting simulates a scenario with not-overlapped networks, which means that the 100 regulatory networks are disjoint each other and each gene is linked to only one TF. Under this assumptions, the degree d i of each TF = 10 and d i = 1 for the regulated genes. The edges' weight w ij = 1 between the TFs and their regulated genes, w ij = 0 otherwise. The expression value of each TF was generated from a normal standard distribution. The expression values of the ten regulated genes were generated from a conditional normal distribution with positive correlation (ρ = 0.7) between the expression of five genes and the corresponding TF, and negative correlation (ρ = −0.7) for the remaining five genes. This simulates the activation or repression of each gene under the effect of the corresponding TF. The failure times were generated from the Cox model λ(t|X) = λ 0 (t) exp 88 j=1 β j X j which includes only s = 88 relevant genes (i.e., eight regulatory networks). The baseline hazard function λ 0 (t) was specified by a Weibull distribution with shape parameter 5 and scale parameter 2. Censoring times were generated from U(2, 15) with a censoring rate of about 30%. The sample size was fixed at n = 200 and the simulation were replicated 100 times. In this setting of not-overlapped genes, the coefficients β j , j=1, . . . , 44 were generated from the uniform distribution U(0.1, 1), while β j , j=45, . . . , 88 were generated from U(−1.5, −0.1).
For each of the settings above, we quantified the noise as the mean between the variance of each transcription factor (TF) and the variance of the 10 corresponding regulated genes.

Scenario 2: Overlapped Networks
The second setting simulates a scenario with overlapped networks, where four regulatory networks (i.e., 44 genes) are connected to the other four networks. This mimics the fact that some genes can belong to different pathways regulating different biological processes, as often observed in cancer. For the sake of simplicity, we assume that all the genes (including the TF) in the networks P 3 , P 4 , P 5 , and P 6 are connected to the genes in the remaining four network P 1 , P 2 , P 7 , and P 8 which are maintained disjointed and independent each other. The expression values of the TFs and the regulated genes were generated from a multivariate normal distribution with cov(X i , X j ) = 0.5 |i−j| . The coefficients β j , j = 1, . . . , 22, corresponding to P1 and P2, were generated from the uniform distribution U(0.1, 0.5), the coefficients corresponding to the 44 common genes β j , j = 23, . . . , 66 were generated from U(−0.1, 0.1) and the coefficients β j , j = 67, . . . , 88, corresponding to P 7 and P 8 , were generated from the uniform distribution U(−1, −0.5). Survival times were generated as reported in the first setting with the same censoring rate.

Statistical Measures
The performance of each method is summarized by four measures: sensitivity, specificity, number of genes selected, and the Matthews correlation coefficient (MCC). The sensitivity or true positive rate (TPR) and specificity or true negative rate (TNR) are given by where TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. A test with high sensitivity (few false negative) has a low type II error rate, while a test with a high specificity (few false positive) has a low type I error rate. The number of genes selected refers to the genes identified as relevant by each method in the training set. The analysis of these genes gives information on prediction accuracy. The Matthews correlation coefficient (MCC) is defined as The MCC measure is an global measure of accuracy, and a larger MCC indicates a better performance.

Real Data Applications
We applied the three network methods on different real datasets containing large-scale microarray gene expression measurements from ovarian and breast cancer including survival information (see Table 1) in order to facilitate the detection of molecular biomarker and pathway analysis with clinical utility.  Network, 2011). It was obtained at the gene level (level 3) using the Affymetrix human U133A microarray from 578 samples. All patients were diagnosed with high-grade serous carcinoma and were in an advanced stage. We noted that such datasets are very similar in terms of type of patients, platforms, and cancer disease. Therefore, they can be also used for validation.

Breast Datasets
The breast cancer microarray datasets were downloaded from NCBI GEO database as raw .CEL files (Kao: GSE20685 and Desmedt: GSE7390). Gene expression profiling of the first dataset was conducted on fresh frozen breast cancer tissue collected from 327 patients diagnosed and treated between 1991 and 2004 at the Koo Foundation Sun-Yat-Sen Cancer Center. Hybridization targets were prepared from total RNA according to the Affymetrix U133 plus 2.0 platform. The second breast cancer dataset was chosen on gene expression profiling of frozen samples from 198 N-systemically untreated patients at the Bordet Institute. It was based on the Affymetrix U133 platform.

Preprocessing
All the raw files were processed and normalized by RMA package available in Bioconductor (Gentleman et al., 2004). Between arrays normalization was carried out by using the preprocessCore package available in Bioconductor (Gentleman et al., 2004). Survival data (OS, i.e., overall survival), censoring indicator and time to death, for each patients in every dataset were also given (Figure 2, step 1).

Cancer Genes and Related Functional Networks
Following our previous study (Iuliano et al., 2014), in order to better analyze real datasets, we first applied a biologically inspired size reduction of the dataset, then we built an a-priori network information for the type of cancer under investigation (see Figure 2, step 1). For a better focus on genes that are more likely to be relevant in cancer, we selected the high-risk cancer genes using the Human Experimental/Functional Mapper  (Huttenhower et al., 2009), which is based on a regularized Bayesian integration system. This mapper provides a p-value for each gene describing the significance of the relation between the gene and the disease of interest (breast and ovarian cancer, respectively). In our analysis, we selected only the genes with p < 0.05. A summary of the final number of the genes selected from each dataset is reported in Table 2. The network matrices used to test the network-based Cox models in our analysis were also derived from the Human Experimental/Functional Mapper which provides maps describing the genes functional activity and interaction networks in over 200 areas of human cellular biology with information from 30,000 genome-scale experiments. This functional network summarizes information from a variety of biologically informative perspectives: prediction of protein function and functional modules, cross-talk among biological processes, and association of novel genes and pathways with known genetic disorders (Huttenhower et al., 2009). The edges of the network are weighted between [0, 1] and express the functional relation between two genes. Note that the functional linkage network includes more information than Human PPI, frequently used as the network prior knowledge. It is clear that taking into account such biological knowledge helps in identifying significant genes that are functionally related in order to obtain important results biologically interpretable. In order to adapt the gene network to the different methods, the final weight matrix was slightly different from method to method. In particular, since AdaLnet requires a weight matrix consisting of 0 and 1, each matrix element was set equal to 0 (or 1) if the weight value was below (or above) a fixed threshold equals to 0.5. On the other hand, Net-Cox uses the original weight matrix as obtained in the original paper (Huttenhower et al., 2009).

RESULTS
In our study, we analyzed three network-based Cox regression methods described in Section Methods both on simulated  Sensitivity, specificity, number of selected genes, false positive rates and MCC were averaged over the 100 replications. The table reports three consensus rankings for Net-Cox obtained selecting 44, 88, and 176 genes. For AdaLnet and fastcox, we show the results related to the general setting, and the statistical measures obtained when the number of selected genes is higher (or lower) that a fixed threshold (threshold was set equal to 100 for AdaLnet and equal to 10 for fastcox). Standard deviation is reported in brackets.Standard deviation is reported in brackets.
and real data. Here, the major interest is the association of genomic features with clinical outcomes under specific scenarios. Simulation studies were based on two different biological scenarios and were introduced to show the performance of the selected network methods. While, real data analysis was performed in order to provide a better understanding of the outcomes in terms of predictive/prognostic biomarkers and to demonstrate their validity and clinical utility. In particular, we first investigated the three methods in terms of survival prediction performances and then, a pathway analysis was carried out focusing on the relevance in cancer of the selected genes.
It is important to note that the goal of this study is not to provide a rank list of the analyzed methods, but to present a accurate study for the identification of new cancer related genes and core pathways in order to make available such information to biomedical community in the form of a comprehensive methodological procedure (see Figure 1).

Simulation Studies
We analyze the performance of the three analyzed methods in two simulation settings where the number of relevant genes is fixed a-priori to 88 genes. The first setting simulates a scenario with not overlapped pathways, which means that each gene in the network belongs to only one pathway (notoverlapped pathways). The second setting represents a more realistic scenario with a set of genes shared among different pathways (overlapped pathways). In both cases, a five-fold cross validation was conducted on the full dataset in order to select the tuning parameters (λ, α) and to obtain the coefficient estimates by using the three methods. The details of the simulation data are reported in Section Methods. The performance of each method is summarized by several statistical measures: sensitivity, specificity, number of selected genes, false positive rates, and Matthews correlation coefficient (MCC). Simulation results for both the models are reported in Tables 3, 4, respectively (standard deviation is reported in brackets). To analyze the signature genes identified by Net-Cox, which is a method based on ridge regression, we considered three different consensus rankings where the number of significant genes selected by the method was fixed to 44, 88, and 176 genes, respectively. The selected genes were classified in descending order according to the absolute value of the regression coefficients. On the other hand, to better highlight the variable selection performance of AdaLnet and fastcox, we split the 100 iterations in two groups based on the number of genes selected at each iteration. We fixed 100 genes as threshold for AdaLnet and 10 genes for fastcox, then we computed again the statistical measures based on the two groups.
In the not-overlapped setting, Net-Cox performed better than the other two methods as showed by the MCC, which provides an overall measure of accuracy. In particular, when considering 44 and 88 genes, the false positive rate in Net-Cox was 22.910 and 44.940, respectively, with MCC equals to 0.300 and 0.445. Sensitivity and specificity were, respectively, 0.240 and 0.977 in the first case, 0.489 and 0.956 in the second case study. When the number of selected genes was increased to 176, even if the false positive rate increased resulting in a lower specificity (0.890), the sensitivity reached its highest values producing the highest MCC (0.464).
Since the majority of the selected genes were irrelevant and both AdaLnet and fastcox resulted in sparse models, specificity was much higher than sensitivity and was comparable between the two variable selection methods. In particular, in the notoverlapped setting, AdaLnet selected in average 249.360 genes with a false positive rate equals to 210.330. Sensitivity and specificity were equal to 0.444 and 0.792 resulting in a MCC of 0.190. On the other hand, fastcox selected in average 42.62 genes with a false positive rate of 30.19. MCC was equal to 0.160 with sensitivity 0.141 and specificity 0.970.
AdaLnet had the best performance when the number of selected genes was below 100, while fastcox exhibit the best performance when the number of genes was above 10. This means that in the other cases the methods fail in the execution of the cross-validation (see Supplementary Image 1).
In the overlapped-pathways setting, Net-Cox obtained the highest MCC overall when considering 88 genes (MCC equals to 0.227) with a false positive rate equals to 62.620, sensitivity 0.288 and specificity 0.938. However, even if the specificity levels of the three consensus rankings were almost equal to the previous setting (specificity for 44, 88, and 176 genes equals to 0.970, 0938, and 0.860, respectively), in this setting Net-Cox sensitivity decreased resulting in lower MCC compared to the not-overlapped case (MCC for 44, 88, and 176 genes equals to 0.175, 0.227, and 0.182, respectively). AdaLnet and fastcox also reported lower MCCs compared to the not-overlapped setting (MCC equals to 0.166 in AdaLnet and 0.134 in fastcox). In particular, both AdaLnet and fastcox showed an higher specificity than before (0.879 and 0.974, respectively) but a lower sensitivity (0.262 and 0.098). Further analysis showed that AdaLnet had the highest MCC when the number of selected genes was below 100 (MCC 0.196), while fastcox had the highest MCC (0.158) when the number of selected genes was above 10, in accordance with the previous results (see Supplementary Image 2).

Real Data Analysis
In order to evaluate the performance of the three Cox models in terms of survival analysis, we used cross-validated Kaplan-Meier curves (Simon et al., 2011b) for overall survival (OS) both on ovarian and breast microarray studies (see Figure 2, step 2). Note that p-value was estimated within the same dataset but the crossvalidation approach is used to correct over optimistic conclusions due to the lack of independence between samples.
Moreover, since the ovarian datasets are comparable in terms of types of patients, platforms and cancer disease, Kaplan-Meier curves and two-side log-rank test were used to estimate the survival time and stratify the low-risk and high-risk groups on the independent test set (see Figure 2, step 3). Table 5 reports the number of genes selected by the three Cox regression methods for each OS and the optimal tuning parameter α. Interestingly, the optimal α was often equal to 0.5, indicating that there was a good balance between statistical constraints and network information. These results confirm that the network carries important information useful for improving survival analysis. Moreover, since Net-Cox is a method based on ridge regression, the genes are only shrunk and it is necessary to fix a threshold for selecting the most relevant cancer genes. Hence, within each fold, we ordered the genes according to the absolute value of the corresponding regression coefficients, then we considered the union of the top 50 genes selected in each fold. FIGURE 3 | Cross-validated Kaplan-Meier curves of the prognostic models on GSE26712 dataset. The patients are divided in high-risk and low-risk groups based on the pathways and genes selected by each methods for overall survival (OS). The survival probabilities of these two groups are compared using the log-rank test by using Net-Cox (A), AdaLnet (B), and fastcox (C).
In the following, we present the main results obtained. Figures 3, 4 show the cross-validated Kaplan-Meier curves for high-and-low risk groups patients selected in the ovarian datasets (Benome: GSE26712 and OV TCGA datasets, respectively).  Figure 3 shows that in the Bonome dataset the gap between the survival curves of the two risk groups in Net-Cox ( Figure 3A) and fastcox ( Figure 3C) is wider compared to AdaLnet (Figure 3B). In particular, in predicting survival probabilities, fastcox (permuted p < 0.05) seem to discriminate the risk groups better than Net-Cox and AdaLnet where the permuted p > 0.05. These findings confirm the results previously obtained in Iuliano et al. (2014), in relation to the survival curves for each method. This was mainly due to the cross-validation approach used in this analysis to overcome the sample splitting problem with too small dataset.

Results on the Ovarian Datasets
On the other hand, in the OV TCGA dataset (Figure 4), the survival curves for high-and-low risk patients are not significantly separated. In particular, fastcox is the only method with a FIGURE 6 | Cross-validated Kaplan-Meier curves of the prognostic models on GSE20685 dataset. The patients are divided in high-risk and low-risk groups based on the pathways and genes selected by each methods for overall survival (OS). The survival probabilities of these two groups are compared using the log-rank test by using Net-Cox (A), AdaLnet (B), and fastcox (C).
significant difference (permuted p < 0.05) in the OS between the high-and-low-risk groups.
Finally, to test the survival prediction across independent datasets, we used the ovarian OV TCGA dataset as training set, and the Benome dataset as the test set to predict the risk scores FIGURE 7 | Cross-validated Kaplan-Meier curves of the prognostic models on GSE7390 dataset. The patients are divided in high-risk and low-risk groups based on the pathways and genes selected by each methods for overall survival (OS). The survival probabilities of these two groups are compared using the log-rank test by using Net-Cox (A), AdaLnet (B), and fastcox (C).
of the patients (see Figure 2, step 3). Figure 5 shows the Kaplan-Meier curves for the two risk groups (high-and-low risk groups) in the Benome dataset obtained by Net-Cox (Figure 5A), AdaLnet (Figure 5B), and fastcox ( Figure 5C). All the three methods gave a significant p-value at the 5% significance level (log-rank test, p < 0.05). Figures 6, 7 show the cross-validated Kaplan-Meier curves for high-and-low risk groups patients selected in the breast datasets (Kao: GSE20685 and Desmedt: GSE7390, respectively). In the Kao dataset, the permuted p-value related to Figure 6A (Net- Cox) and Figure 6C (fastcox) was smaller than 0.05, which means the high-risk and low-risk groups were significantly separated and the selected pathways and genes were related to survival times. In Figure 6B (AdaLnet), a patient of the high-risk group fell in the low-risk group and the permuted p-value is not significant.

Results on Breast Datasets
We performed the same analysis for high-and-low risk patients in the Desmedt dataset. Also in this case, there was a significant difference in OS between the two risk groups as shown in Figure 7A (Net-Cox) and Figure 7C (fastcox) where the permuted p-value is smaller than 0.05. In Figure 7B (AdaLnet) the permuted p-value is not significant.

Identified Pathways
In this section, we present the results of the analysis in terms of KEGG pathways analysis based only on not-isolated genes (see section Methods for details). We report here only the networks related to AdaLnet and Net-Cox since all the networks related to fastcox have more than 100 node and 2000 edges and a clear visualization would not be possible. However, the lists of the genes selected by fastcox and the related pathways are reported in Supplementary  Figures 8, 9 show the gene-networks obtained for the Bonome dataset (GSE26712) built on the genes identified by Net-Cox and Adalnet, respectively. From the color of the nodes, we can infer that all the selected genes have a significant relation with ovarian cancer. Indeed, almost all the genes are close to red except for AKT3 which has a p-value correlation equal to 0.039. Indeed, AKT3 is usually involved in prostate and breast cancer (Nakatani et al., 1999). However, since it was selected both by Net-Cox and fastcox, a possible significant relation between AKT3 and ovarian cancer could be inferred as indeed confirmed by literature (Liby et al., 2012). In particular, AKT3 has a specific role in the genesis of ovarian cancer through modulation of G2-M phase transition (Cristiano et al., 2006). As showed in Figure 8, AKT3 is also involved in many cancer pathways, such as KEGG basal cell carcinoma, KEGG prostate cancer, and KEGG melangiogenesis. It is worthy to note that this gene was also selected in our previous study (Iuliano et al., 2014) by all the analyzed methods and it was also involved in the same cancer related pathways. These findings confirm the importance of AKT3 in ovarian cancer as confirmed indeed by literature (Cristiano et al., 2006).
FIGURE 8 | Gene-network of not isolated genes selected by Net-Cox in the Bonome ovarian dataset (GSE26712). Each node represents a gene and an edge between two nodes means that the two genes belongs to the same pathway. Different colors are used for different pathways. The color of each node represents the p-value of the interaction between the gene and ovarian cancer (Huttenhower et al., 2009). Genes with p > 0.10 are represented in green.
FIGURE 9 | Gene-network of not isolated genes selected by Adalnet in the Bonome ovarian dataset (GSE26712). Each node represents a gene and an edge between two nodes means that the two genes belongs to the same pathway. Different colors are used for different pathways. The color of each node represents the p-value of the interaction between the gene and ovarian cancer (Huttenhower et al., 2009). Genes with p > 0.10 are represented in green. Figures 10, 11 show the gene-networks obtained for the OV TCGA ovarian dataset built on the genes identified by Net-Cox and Adalnet, respectively. As already observed in the Bonome dataset analysis, all the selected genes in the OV TCGA dataset resulted strongly correlated with ovarian cancer. Indeed, almost all the genes are close to red. The only gene with a slightly different color is FZD3 which has a p-value of 0.049 and was selected by all the three methods. Hence, even if this gene has been mainly classified as gastric-cancer-related (Katoh, 2005), our results prove that it also has a relevant effect in ovarian cancer as confirmed by literature (Tapper et al., 2001). It is also important to note that other genes have been selected by all the three methods (i.e., GMPR, ENPP1, and APC). Such genes have been already classified as ovarian-related in cancer literature (Gayther et al., 1997;Kikuchi et al., 2007;Rikova et al., 2007), but, in our analysis, the pathways involved in such relation are also investigated. For example, while GMPR and ENPP1 interact simply through the KEGG purine metabolism pathway, the APC-FZD3 interaction involves three different pathways: KEGG basal carcinoma, KEGG pathways in cancer, and KEGG wnt signaling pathway.
It is worthy to note that some of the genes selected by the three methods (e.g., NPY, COL5A1, EGFR, and FBL1) have been already reported in literature (Zhang et al., 2013) where an analysis of subnetwork signatures in ovarian cancer based on Cox model is presented. Moreover, our approach selected new genes, such as AKT3 and RB1, which are also related to ovarian cancer (Flesken-Nikitin et al., 2003;Cristiano et al., 2006). These results show that our findings are consistent with the previous ones including, at the same time, other gene signatures. Figures 12, 13 report the gene-networks selected in the Kao dataset (GSE20685) by Net-Cox and Adalnet, respectively. FGFR2 and BCL2 were again selected in this dataset confirming the strong relevance of the two genes in breast cancer. Moreover, FIGURE 10 | Gene-network of not isolated genes selected by Net-Cox in the TCGA ovarian dataset. Each node represents a gene and an edge between two nodes means that the two genes belongs to the same pathway. Different colors are used for different pathways. The color of each node represents the p-value of the interaction between the gene and ovarian cancer (Huttenhower et al., 2009). Genes with p > 0.10 are represented in green.
FIGURE 11 | Gene-network of not isolated genes selected by Adalnet in the TCGA ovarian dataset. Each node represents a gene and an edge between two nodes means that the two genes belongs to the same pathway. Different colors are used for different pathways. The color of each node represents the p-value of the interaction between the gene and ovarian cancer (Huttenhower et al., 2009). Genes with p > 0.10 are represented in green.
BRCA2 (Wooster et al., 1995) was selected by Net-Cox and fastcox confirming the accuracy of our analysis. It is also worthy to note that in all the breast cancer gene-networks the KEGG prostate cancer is always recurrent. This is mainly due to the common biomarkers between the two diseases (Yang et al., 1998;Mattie et al., 2006) and through our analysis new common biomarkers can be identified.
In the Desmedt dataset (GSE7390), all the genes selected by Adalnet were isolated and no network was built in this case. A list of the genes selected is reported in Table 6. Figure 14 FIGURE 12 | Gene-network of not isolated genes selected by Net-Cox in the GSE20685 breast dataset. Each node represents a gene and an edge between two nodes means that the two genes belongs to the same pathway. Different colors are used for different pathways. The color of each node represents the p-value of the interaction between the gene and breast cancer (Huttenhower et al., 2009). Genes with p > 0.10 are represented in green.
FIGURE 13 | Gene-network of not isolated genes selected by Adalnet in the GSE20685 breast dataset. Each node represents a gene and an edge between two nodes means that the two genes belongs to the same pathway. Different colors are used for different pathways. The color of each node represents the p-value of the interaction between the gene and breast cancer (Huttenhower et al., 2009). Genes with p > 0.10 are represented in green.
reports the gene-network related to the genes selected by Net-Cox. All the selected genes show a strong relation with the disease, such as FGFR2 and BCL2, which were selected by both Net-Cox and fastcox and are involved in KEGG prostate cancer and in KEGG pathways in cancer. Both the genes are largely known as independent prognostic marker in breast cancer (Hunter et al., 2007;Thomadaki et al., 2007;Callagy et al., 2008). Both Net-Cox and fastcox selected UGT2B15, which has a breast-cancer-correlation p = 0.049. This gene has been usually involved in prostate cancer (Gsur et al., 2002), but recent works highlight its role also in breast cancer (Wegman et al., 2007).
In the analysis of the breast datasets, there was no overlap with our previous study (Iuliano et al., 2014). This was mainly due to the different datasets analyzed here potentially (different cancer subtype and different types of conditions) and to the more sophisticated procedures followed in this analysis. Indeed, in our previous work, we split the dataset in training and test set only FIGURE 14 | Gene-network of not isolated genes selected by Net-Cox in the GSE7390 breast dataset. Each node represents a gene and an edge between two nodes means that the two genes belongs to the same pathway. Different colors are used for different pathways. The color of each node represents the p-value of the interaction between the gene and breast cancer (Huttenhower et al., 2009). Genes with p > 0.10 are represented in green. The second column reports the breast-cancer correlation p-value of each gene accordingly (Huttenhower et al., 2009). All the selected genes resulted isolated and no network was built in this case.
once, while here we used a cross-validation procedure that is expected more robust results.

DISCUSSION AND CONCLUSIONS
A key issue in cancer survival analysis is uncovering the relation between gene expression profiles and cancer patients survival in order to identify biomarkers for disease diagnosis and treatment. In the last years, there has been a growing interest in methods that incorporate network information into classification algorithms for genes signature discovery. The main aims are to identify molecular biomarkers that reliably predict patient's response to therapy and to avoid ineffective treatment for reducing drug side-effects and associated costs. For this purpose, prognostic and diagnostic biomarker signatures need to be derived from omics data for various disease entities in order to offer useful methodological and practical strategy in research and clinical settings.
Here, we presented an extended methodological strategy for the analysis of gene signatures and survival prediction (see Figure 1). We integrated a new cross-validation method (Simon et al., 2011b) with the most recent network penalized Cox models (Yang and Zou, 2012;Zhang et al., 2013;Sun et al., 2014) to obtain an effective multi-splitting of the data and achieve an accurate survival prediction (see Figure 2). The analysis of the models was based both on simulated and real datasets in order to provide an accurate analysis in terms of statistical and biological investigation. Indeed, we showed that, given a number of variables not extremely high, all the analyzed methods were able to select the altered genes under different simulation settings. On the other hand, the analysis on real cancer datasets showed that through the integration of network information into Cox regression methods it is possible to identify cancer gene signatures with an accurate prognostic performance. Therefore, the contribution of this study is two-fold. Firstly, to obtain an integrative analysis of cancer genes networks and survival prediction. Secondly, to provide a computational and methodological framework for better investigating cancers regulatory networks and facilitating the management of patients in terms of prognosis, diagnosis and treatment.
The findings of this study have a number of important implications for future practice. Firstly, a practically appealing study based on a fast screening procedure (Fan and Lv, 2008;Fan et al., 2010) could be introduced in order to reduce the size of the feature space to a moderate scale. In fact, several types of screening procedures could be combined to integrate biological information into statistical screening analysis and provide more definitive understanding of the gene-regulatory networks. Secondly, the integration of clinical information and data from different omics (e.g., epigenomics or metabolomics) into the screening procedure could also provide a more accurate investigation and prevent the drawbacks of the current methods. Moreover, a more accurate biomarkers investigation could be performed using a number of high-quality binary PPIs available in literature (Rolland et al., 2014) where a proteome-scale map of the human binary interactome is compared to alternative network maps in order to give a deeper insight into genotype-phenotype relationships. Finally, it will be necessary to develop an user-friendly interface to turn this methodological framework into a practical tool.

AUTHOR CONTRIBUTIONS
AI and AO are joint first authors and both authors contributed equally. AI and AO prepared the computational codes and carried out all of the statistical analysis. CA, ID, and PL initiated and coordinated the work, guided the study design, supervised all data curation and analysis, and finalized all study conclusion. CA, ID, and PL are equal contributors. All the authors wrote, reviewed and approved the final manuscript.