Pathway Instability Is an Effective New Mutation-Based Type of Cancer Biomarkers

DNA mutations play a crucial role in cancer development and progression. Mutation profiles vary dramatically in different cancer types and between individual tumors. Mutations of several individual genes are known as reliable cancer biomarkers, although the number of such genes is tiny and does not enable differential diagnostics for most of the cancers. We report here a technique enabling dramatically increased efficiency of cancer biomarkers development using DNA mutations data. It includes a quantitative metric termed Pathway instability (PI) based on mutations enrichment of intracellular molecular pathways. This method was tested on 5,956 tumor mutation profiles of 15 cancer types from The Cancer Genome Atlas (TCGA) project. Totally, we screened 2,316,670 mutations in 19,872 genes and 1,748 molecular pathways. Our results demonstrated considerable advantage of pathway-based mutation biomarkers over individual gene mutation profiles, as reflected by more than two orders of magnitude greater numbers by high-quality [ROC area-under-curve (AUC)>0.75] biomarkers. For example, the number of such high-quality mutational biomarkers distinguishing between different cancer types was only six for the individual gene mutations, and already 660 for the pathway-based biomarkers. These results evidence that PI value can be used as a new generation of complex cancer biomarkers significantly outperforming the existing gene mutation biomarkers.


INTRODUCTION
Cancer is a multifactorial disease which is conditioned by alterations arising from biological, chemical, radiological impacts, as well as inherited. Tumor transformation is characterized by frequent accumulation of genetic mutations (1). The pivotal initiating role here belongs to DNA damage and genome instability (2,3). The resulting combinations of gene mutations driving cancer development vary dramatically among different cancers types and individual patients (4).
Recently, high throughput studies of cancer genomes were initiated to identify mutation enrichment specific for the different cancer types. For example, the large scale projects like Wellcome Trust Sanger Institute's Cancer Genome Project, the International Cancer Genome Consortium (ICGC), The Cancer Genome Atlas (TCGA) showed very high molecular heterogeneity of cancer, not only between different cancer types, but also among the individual tumors of the same type (5)(6)(7)(8). This allowed to considerably advance current understanding of carcinogenetic molecular mechanisms by documenting complete or near-complete landscapes of pathological somatic mutations including base substitutions and gene fusions. Many of the alterations revealed appeared promising for molecular cancer diagnostics in order to improve and personalize the treatment regimens (9,10).
Identification of informative and robust genetic markers of cancer is one of the major tasks of the contemporary biomedicine. Many reports have been published featuring cancer-specific mutations and gene fusions, as well as epigenetic alterations (11)(12)(13). Some of them are already widely used in clinical practice as the biomarkers, but the problem of finding new relevant and informative cancer markers with higher sensitivity and specificity is largely unsolved (10,14). Further accumulation of cancer type-and condition-specific biomarkers can be a key to a more effective, personalized treatment (15).
Despite recent success in high throughput analysis of molecular basis for cancer transformation, traditionally the focus is being made on the roles of the individual genes (16,17). However, this approach cannot always explain tumor development in a comprehensive way. Apparently, this is most probably due to the mode of gene functioning as the nodes of molecular pathways, where roles of individual genes are highly interconnected and frequently interchangeable (18).
Previously, the analysis of molecular pathways at the level of gene expression was successfully applied for cancer investigations (19)(20)(21). Several approaches for measuring molecular pathway activities were proposed for the expression data at both mRNA, protein and microRNA levels (22)(23)(24)(25)(26)(27). The extent of pathway activation, so called pathway activation strength (PAS), is a cumulative value aggregating relative expression levels of the enclosed gene products in relation with their functional roles in a pathway (28).
Interestingly, for most of the cancer types investigated the molecular pathways were shown to be stronger expression biomarkers of cancer than the individual genes (29). PAS was also proven to be more stable and experimental platformindependent metric than the individual gene expression patterns (30). This property appeared to be fundamental and linked with the ability to aggregate individual gene expression levels, thus decreasing experimental errors, as modeled in a recent investigation (25). PAS biomarkers are also used for predicting efficiencies of target cancer drugs in the ongoing clinical trials (24). However, to our knowledge, this type of quantitative pathway approach was never applied before for the mutation data, including human cancers.
Here, for the first time, we propose a new type of molecular biomarkers based on DNA mutation impacts on the molecular pathways. We introduced a quantitative metric termed Pathway instability (PI) proportionate to the relative number of mutated genes in a pathway and developed a specific bioinformatic algorithm for quantization thereof. Using high throughput gene mutation profiles, we identified affected molecular pathways that specifically define the major human cancer types. We took cancer somatic mutation data published in the TCGA project for 5,956 patients representing 15 different cancer types. Totally, we screened 2,316,670 mutations in 19,872 genes and 1,748 molecular pathways. The robustness of mutationbased molecular pathway approach dramatically exceeded that for the individual gene biomarkers. This trend was also reproducible when only truncating mutations were considered for PI calculations, thus confirming consistency of the new method. Finally, we provide a list of 660 novel robust cancer type-specific pathway mutation biomarkers.

DNA Mutation Data
The source DNA mutation dataset was obtained from the database of COSMIC (the Catalog Of Somatic Mutations In Cancer) project (31). We downloaded the verified somatic mutations data from COSMIC website, database version 76 (32). The initial dataset contained 6,651,236 mutation records for 20,528 genes from 19,434 individual tumor samples of 37 primary localizations. For statistical consistence, we took only those tumor localizations having at least 100 tumor samples profiled during The Cancer Genome Atlas (TCGA) project (33). The TCGA mutation profiles were selected because they represented the largest collection of uniformly treated biosamples profiled using the same deep sequencing platforms (34). Totally, we analyzed 2,316,670 mutations in 5,956 tumor genetic profiles corresponding to 15 primary localizations: breast, central nervous system, cervical, endometrium, ovaries, prostate, kidney, urinary tract, liver, hematopoietic and lymphoid tissue, stomach, large intestine, lung, thyroid, and skin (Supplementary Table 1). The database accession numbers of 5,956 samples are given in Supplementary Table 3.
In parallel, for the additional analysis we selected a fraction of gene truncating mutations that possibly lead to the loss of gene function. We meant truncating mutations as those having the following labels in COSMIC description: "Deletion-Frameshift, " "Insertion-Frameshift, " "Complex-frameshift, " "Substitution-Nonsense." Totally, we selected 161,760 truncating mutations in 5,297 tumor samples corresponding to the same 15 cancer types (Supplementary Table 2). The database accession numbers of 5,297 samples are given in Supplementary Table 4.

Molecular Pathways
The structures of 3,121 molecular pathways were taken from the following public databases: Reactome (35), NCI Pathway Interaction Database (36), Kyoto Encyclopedia of Genes and Genomes (37), HumanCyc (38), Biocarta (39), Qiagen (40). For all the pathways, the gene contents were extracted and cataloged. For further analyses, we pre-selected 1,748 molecular pathways each including at least 10 gene products.

Principal Component Analysis
The Principal component analysis (PCA) was performed with package made 4 in R.

RESULTS
In this study, we applied quantitative molecular pathway approach to human cancer DNA mutation data. We developed algorithm for quantization of mutational impact on molecular pathways and applied it for screening of 1,748 human pathways. The DNA mutation data was extracted from the Catalog Of Somatic Mutations In Cancer (COSMIC) database (32). For the reasons of statistical significance, we analyzed here only the tumor localizations having at least 100 complete exome sequencing-profiled tumor samples, totally fifteen primary localizations and 5,956 individual tumor specimens (Supplementary Table 1). For parallel analysis, we also selected a subset of 161,760 truncating mutations for 5,297 tumor samples of the same cancer types (Supplementary Table 2).

Pathway Instability (PI) Scoring
Pathway instability (PI) scoring for a molecular pathway depends on the mutation frequencies in the genes participating in this pathway. To assess mutation burden of the individual genes, we introduced Mutation rate (MR) value calculated according to the formula: where MR n is the Mutation rate of a gene n; N mut(n,g) is the total number of mutations identified for a gene n in a group of samples g; N samples (g) is the number of samples in a group g. However, the MR values strongly positively correlated with the lengths of gene coding DNA sequences (CDS)s, Spearman correlation was 0.798 for all mutations and 0.629 for truncating mutations, p < 2.2e-16 in both cases (Supplementary Images 1A,C), most probably because larger genes had higher probabilities to accumulate mutations.
To avoid the bias linked with the CDS lengths, we next introduced a Normalized mutation rate (nMR) value expressed by the formula: where nMR n is the Normalized mutation rate of a gene n; MR n is the Mutation rate of a gene n; Length CDS(n) is the length of CDS of a gene n in nucleotides. In contrast to the previous metric, nMR did not correlate with the size of CDS for the respective genes (Supplementary Images 1B,D; rank correlation 0.151 for all mutations and 0.024 for truncating mutations). nMR scores calculated using all mutations for each tumor sample are listed in Supplementary Data Sheet 1, nMR scores calculated based on truncating mutations only are listed in Supplementary Data Sheet 2.
We next calculated Pathway instability (PI) scores for every pathway to estimate their relative enrichments by cancer-specific mutations. Pathway instability is expressed by the formula: where PI p is Pathway instability score for a pathway p; nMR n is the Normalized mutation rate of gene n; PG p.n is pathwaygene indicator that equals to one if gene n belongs to pathway p, or equals to zero if gene n doesn't belong to pathway p; N ptotal number of gene products that belong to pathway p. Unlike previous metrics for pathway activation scoring based on gene expression data, the current equation for PI calculation doesn't have coefficients defining activator or repressor molecular roles of genes participating in a molecular pathway under investigation (28). This has been done because for most of mutations their functional roles (neutral, repressing or activating gene function) remain unclear. Pathway instability (PI) (calculation also doesn't utilize logarithmation of nMR scores during summation because the number of mutations in tumor is not lower than in the reference normal tissue.  (Figures 1A,B). The PCA revealed no significant differences for the different cancer types both at the gene-based and pathway-based levels, although pathway-based approach covered most of the variation by the first component, unlike gene-based level of data analysis, where dramatically lower proportion of variation was covered. Similar figure was seen for the fraction of truncating mutations, where the variation of first principal component did not exceed 1.85% for genes and 19.77% for pathways (Figures 1C,D). The most likely reason for lack of tumor type-specific clustering may be the redundancy of features at high sparsity of the mutation data.

Cancer Type-Specific Pathway Instability (PI) Mutation Signatures
However, many molecular pathways had characteristic PI scores that were clearly distinctive of the different tumor types, as shown by the high area under the ROC curve (AUC) values. The AUC value is the universal biomarker robustness characteristics depending on its sensitivity and specificity (42). It varies from 0.5 till 1 and positively correlates with the quality of a biomarker. The AUC discrimination threshold is typically 0.7 or 0.75. The parameters with greater AUC are considered good-quality biomarkers, and vice-versa (43). This statistical approach is also applicable to mutation research in human cancer (44)(45)(46). We performed the ROC AUC test in two ways: (i) for comparing every separately taken tumor type (localization) vs. all other tumors, and (ii) for all possible pairwise comparisons among the tumor types.
In parallel, the same AUC tests were performed also for every gene nMR characteristics of every sample. The tests were performed for all mutations and in parallel-for only truncating mutations (Supplementary Tables 5, 6, respectively). The data analysis pipeline is schematized on Figure 2. In this way, we could compare the biomarker potentials of the individual gene mutations (nMR) with the aggregated pathway-based mutation characteristics (PI).
Our analysis revealed a dramatic advantage of the pathway based (PI) compared to gene based (nMR) approach in finding good quality biomarkers in all types of the comparisons made. For example, for the analyzes when one tumor localization was compared against fourteen others, the total number of good quality (AUC>0.75) biomarkers was 660 for all mutations and 21 for truncating mutations for the pathways (PI), compared to only six for all mutations and one for truncating mutations for the individual genes (nMR). Similarly, for the pairwise comparisons we identified totally 32,594 good quality PI biomarkers vs. only 226 nMR biomarkers for all mutations (Figure 2A). For the truncating mutations, we discovered in pairwise comparisons 1,056 good quality PI biomarkers vs. only 24 nMR biomarkers ( Figure 2B). Provided that the initial number of potential pathway biomarkers (1748) was one order of magnitude lower than the number of gene biomarkers (19,872 for all mutations and 16,760 for truncating mutations), this further strengthens the advantage of a PI-based approach.

Cancer Type Specific Biomarkers
For the dataset of all mutations, the cancer type-specific six gene mutation biomarkers were APC for colorectal cancer, PTEN for endometrial cancer, BRAF for thyroid cancer and MUC16, DNAH5, TTN for cutaneous melanoma. These genes were previously linked with the respective cancer types in the literature (47)(48)(49)(50), but the overall number of six biomarkers may seem negligible provided they were obtained for fifteen comparisons (Figure 2A). In contrast, the pathway approach returned here as much as 660 reliable biomarkers representing 428 pathways. Different localizations had markedly different numbers of marker pathways (Figure 3A). Gene and pathway mutation biomarkers could be found for four and eight tumor localizations, respectively. In the case of truncating mutations, we found gene and pathway biomarkers for colorectal cancer only (Figures 3B,D).
For the first time, we provide here the list of tumor type-specific pathway biomarkers based on all mutation for eight localizations investigated: colorectal, kidney, non-small cell lung, prostate, thyroid cancers, hematological malignancies, cutaneous melanoma and uterine corpus endometrial carcinoma (Supplementary Table 7). The list of colorectal cancer specific pathway biomarkers obtained using truncating mutations is shown on Supplementary Table 8.
Despite the large number of good-quality pathway biomarkers (Figures 3A,C), all of them were applicable only for eight localizations of the fifteen totally investigated. Colorectal cancer and endometrial carcinoma had maximum number of pathwaybased biomarkers. It should be noted that characteristic tumor type-specific PI scores could be either higher or lower than the average values for other cancer types, thus resulting in "high" or "low" biomarkers ( Figure 3A).
Using total pool of mutations, we next identified molecular pathways that were frequently mutated in all the cancer types under study. To this end, we selected 1,145 pathways having AUC<0.7 in all tumor types (Supplementary Table 9) and intersected them with the list of top 10% pathways sorted according to the average PI values. The selected short list contained 18 pathways that were most frequently mutated in all cancer types ( Table 1).
On the other hand, we also screened for the pathways that were most informative as biomarkers (AUC>0.75) for the maximum number of cancer types. Top 25 most informative biomarker pathways are shown on Table 2.

Pairwise Comparison Biomarkers
The number of high quality biomarkers identified in pairwise comparisons can be characteristic of tumor mutational landscapes and their relative similarities. For example, when two cancer types under comparison have little or no specific biomarkers, this suggests small differences in their mutation profiles. In contrast, high number of biomarkers would mean more distinct mutation profiles. Based on the numbers of high quality ROC AUC biomarkers, a distance matrix can be created for all the tumor localizations under comparison, and a clustering dendrogram can be built. In this study, we used pairwise comparisons to analyze common features and clustering of 15 tumor types. The distance matrix was built separately for the gene (nMR) and the pathway (PI) mutation biomarkers ( Figure 4A and Supplementary Image 2A for all and truncating mutations, respectively).
For all mutations, in a series of pairwise comparisons we found that the number of biomarkers that distinguish between the two cancer types varies greatly depending on the localizations compared ( Figure 4A). The number of PI roughly two orders of magnitude per instance exceeded the number of nMR biomarkers ( Figure 4A).
There was also an overall correlation between these numbers (correlation 0.69, p-value = 3.677e-16), for example, all fourteen comparisons having no good biomarkers according to PI values, also had no nMR biomarkers there (Figure 4A). The numbers of biomarkers differed for PI up to 784 and for nMR-only up to 5 per instance. Provided obviously higher number of the effective biomarkers, the pathway-based approach can be regarded beneficial and much more informative than the genebased mutation analysis.  For the truncating mutations, we found that the number of respective pathway biomarkers was also significantly higher (up to 43 pathway biomarkers vs. only one gene biomarker), but they were identified in a smaller group of cancer types (Supplementary Image 2A). Nevertheless, the correlation between gene and pathway pairwise comparisons was 0.634, p-value = 3.62e-13.

Clustering Dendrograms of Cancer Localizations
Taking number of molecular pathways with significant AUC as a metric of the proximity between cancer types, we built clustering dendrograms for the 15 investigated cancer types. The dendrograms generated for the different cancer types differed considerably for the nMR and PI data. We focused on the results based on all mutations, because the dendrograms built using only truncating mutations were not biologically informative (Supplementary Images 2B,C).
First of all, the nMR-based tree had lower number of major clades (three vs. four for the PI data tree). Second, distances between the cancer localizations were more degenerated on the nMR tree (Figures 4B,C). These features may reflect the approximately two orders of magnitude lower numbers of biomarkers used to construct dendrograms identified for the nMR data. Clade compositions were largely similar between both types of tree, but the above considerations suggest in favor of using pathway-rather than gene-specific tree based on mutation data.
Interestingly, positions on the clades of all the dendrograms were not linked with the anatomical proximities of the respective localizations in human body. This suggests that accumulations of characteristic mutations in cancers are following complex mechanisms that are not yet completely understood.

DISCUSSION
Bioinformatic approaches based on measuring of molecular pathway activation were efficient in finding biomarkers using high throughput proteomics (25), mRNA (27,51), microRNA (26) and even transcription factor binding site data (52). Here, we applied molecular pathway scoring approach to large scale mutation data. For the first time, we developed reliable universal method of measuring mutation enrichment of molecular pathways. It should be noted that the idea of collapsing mutation data has been already reflected in several previous studies. For example, bioinformatic tool BioBin overcomes sparsity of data by combining mutations into bins at the levels of molecular pathways, protein families, evolutionary conversed regions and regulatory regions (53,54). An alternative approach for the same has been provided by the method Network regularized sparse non-negative TRI matrix factorization for PATHway identification using known molecular pathways and gene interaction networks (55). Unlike previous methods, our approach focuses on generation of universal parameters that objectively assess the mutation load of a molecular pathway.
Previous approaches evaluated mutation load only based on presence or absence of a mutation, not considering number of gene products-pathway participants and lengths of their DNA coding sequences, which hindered accurate comparison of pathways. These major problems were addressed by the current PI approach, which provides clear, simple and reliable universal measure of mutation burden of molecular pathways. We adopted this method for finding mutation biomarkers of cancers. On the example of 5,956 exome sequencing profiles of different cancer patients we showed at least two order of magnitude superior performance of the pathway instability scoring compared to the single gene-based approach.
In the current approach, we did not classify the effects of the different mutations on the pathway activities because only a minor fraction of the mutations identified has been experimentally characterized in terms of its impact on the protein and pathway functionalities. However, further accumulation of these data on a high throughput basis will make it possible to improve the PI calculation by adding the specific coefficients reflecting effect of every individual mutation on the respective protein. To assess stability of PI scoring, we also tested a version of this method considering only a minor fraction of truncating mutations that most likely led to the loss of gene function. Truncating mutations have demonstrated the same trends as the total fraction of mutations, thus confirming PI scoring robustness.
This method can be easily translated to comparisons of every sets of human exome or complete genome data. To this end, for every sample, mutations should be identified for the genes participating in the molecular pathways under investigation (1,748 pathways including 8,543 genes in this study). Pathway instability (PI) scores are then calculated showing relative mutation burden of each pathway in every biosample. These findings can be valuable per se for better understanding of the individual mechanisms of carcinogenesis. Furthermore, ROC AUC test can be next applied to the PI data to identify reliable biomarkers of the sample groups under comparison. All these procedures can be done by using publicly available bioinformatic tools, and the gene compositions of 1,748 molecular pathways required for PI calculation are available in the respective databases (35)(36)(37)(38)(39)(40).
Taken together, our data suggest that in addition to better understanding of fundamental tissue-specific mechanisms of carcinogenesis, molecular pathway approach can be beneficial in finding reliable tumor type-specific biomarkers for identification of tumor origin in the low-or nondifferentiated tumor histotypes. Pathway instability (PI) mutation data can be used as the additional criteria for differential diagnostics in oncology. Tumor relationship based on the pathway specific genetic signatures such as those shown on Figure 4C may help optimize design of the basket clinical trials. This may be also beneficial to help to predict common patterns in response to drugs and different treatment regimens.
Although a major focus of this study was made on the specific deviations in PI between the cancer types, we could also identify molecular pathways that were strongly mutated in all the cancer types, as previously predicted in the literature (19,21).
Finally, we provide the list of 660 marker molecular pathways that distinguish between the major human cancer types. This list can be helpful for better understanding molecular grounds of carcinogenesis and for further investigations in molecular oncology and drug development.

AUTHOR CONTRIBUTIONS
MZ developed algorithms, did mutation/Pathway instability analyses, planned the research, and wrote the manuscript. SR planned the research, extracted, and filtered cancer mutation data. MS completed the molecular pathway database. NB developed algorithms, did statistical analyses, and planned the research. AB developed algorithms, planned the research, and wrote the manuscript.

ACKNOWLEDGMENTS
We cordially thank Prof. Donald Geman (Johns Hopkins University, USA) for fruitful discussion and useful suggestions on the algorithm development and design of this study. The sheet "Pathway biomarkers" contains AUC values and AUC 95% confidence intervals for pathways having AUC > 0.75 in at least one of the comparisons. The sheet "Gene biomarkers" includes AUC values and AUC 95% confidence intervals for genes having AUC > 0.75 in at least one of the comparisons.

SUPPLEMENTARY MATERIAL
Supplementary Table 6 | AUC values for comparisons of every separately taken tumor type (localization) versus all other tumor types, calculated for truncating mutations. The sheet "Pathway biomarkers" contains AUC values and AUC 95% confidence intervals for pathways having AUC > 0.75 in at least one of the comparisons. The sheet "Gene biomarkers" includes AUC values and AUC 95% confidence intervals for genes having AUC > 0.75 in at least one of the comparisons. Table 7 | List of tumor type-specific good-quality pathway biomarkers for eight tumor types, obtained using all mutations data. Every sheet corresponds to the primary localization of tumors and contains pathway names, AUC values and their 95% confidence interval (CI), mark about PI values of this tumor type relative to others, average PI scores for all cancer types, average PI for this cancer type. TCGA abbreviations of tumor types were used: uterine corpus endometrial carcinoma -UCEC, acute myeloid leukemia -LAML, kidney renal papillary cell carcinoma -KIRP, kidney renal clear cell carcinoma -KIRC, colorectal cancer -COADREAD, lung adenocarcinoma -LUAD, lung squamous cell carcinoma -LUSC, prostate adenocarcinoma -PRAD, skin cutaneous melanoma -SKCM, thyroid carcinoma -THCA. The data about all mutations was used.  Supplementary Image 2 | (A) Data matrix of high quality (AUC > 0.75) biomarkers for pairwise comparisons between different cancer types based on truncating mutations only. The cancer types are abbreviated as follows: breast invasive carcinoma -BRCA, brain lower grade glioma -LGG, glioblastoma multiforme -GBM, cervical squamous cell carcinoma and endocervical adenocarcinoma -CESC, uterine corpus endometrial carcinoma -UCEC, acute myeloid leukemia -LAML, kidney renal papillary cell carcinoma -KIRP, kidney renal clear cell carcinoma -KIRC, colorectal cancer -COADREAD, liver cancer -LICA, liver hepatocellular carcinoma -LIHC, lung adenocarcinoma -LUAD, lung squamous cell carcinoma -LUSC, ovarian serous cystadenocarcinoma -OV, prostate adenocarcinoma -PRAD, skin cutaneous melanoma -SKCM, stomach adenocarcinoma -STAD, thyroid carcinoma -THCA, bladder urothelial carcinoma -BLCA. The lower triangle shows numbers of good biomarkers for pathway-based data (PI); the upper triangle -for individual gene-based mutation data (nMR). (B) Clustering dendrogram built for the fifteen cancer types based on mutation biomarker (PI) data for truncating mutations. Number of biomarkers was used as the distance metric. (C) Clustering dendrogram built for the above fifteen cancer types using mutation biomarker (nMR) data for truncating mutations. Number of biomarkers was used as the distance metric.