A Network-Centric Framework for the Evaluation of Mutual Exclusivity Tests on Cancer Drivers

One of the key concepts employed in cancer driver gene identification is that of mutual exclusivity (ME); a driver mutation is less likely to occur in case of an earlier mutation that has common functionality in the same molecular pathway. Several ME tests have been proposed recently, however the current protocols to evaluate ME tests have two main limitations. Firstly the evaluations are mostly with respect to simulated data and secondly the evaluation metrics lack a network-centric view. The latter is especially crucial as the notion of common functionality can be achieved through searching for interaction patterns in relevant networks. We propose a network-centric framework to evaluate the pairwise significances found by statistical ME tests. It has three main components. The first component consists of metrics employed in the network-centric ME evaluations. Such metrics are designed so that network knowledge and the reference set of known cancer genes are incorporated in ME evaluations under a careful definition of proper control groups. The other two components are designed as further mechanisms to avoid confounders inherent in ME detection on top of the network-centric view. To this end, our second objective is to dissect the side effects caused by mutation load artifacts where mutations driving tumor subtypes with low mutation load might be incorrectly diagnosed as mutually exclusive. Finally, as part of the third main component, the confounding issue stemming from the use of nonspecific interaction networks generated as combinations of interactions from different tissues is resolved through the creation and use of tissue-specific networks in the proposed framework. The data, the source code and useful scripts are available at: https://github.com/abu-compbio/NetCentric.


INTRODUCTION
Cancer is a disease caused mostly due to a gradual accumulation of somatic alterations that give rise to pathway dysregulation through alterations in copy number, DNA methylation, gene expression, and molecular function. An important challenge in cancer genomics is to distinguish driver mutations from passenger mutations. The former are those determined to be causal for cancer progression, whereas the latter are characterized as those not leading to any selective advantage. Several computational methods have been proposed for the identification of cancer driver genes or driver modules of genes by integrating mutations data with various other types of genetic data; see Dimitrakopoulos and Beerenwinkel. (2017), Zhang and Zhang. (2018), Bailey et al., 2018, Tokheim et al., 2016 for recent comprehensive evaluations and surveys on the topic.
A phenomenon observed frequently in the data pertaining to the alterations that the tumors acquire is mutual exclusivity (ME); a driver mutation is less likely to occur in case of an earlier mutation that has common functionality in the same molecular pathway (Thomas et al., 2007;Yeang et al., 2008;Leiserson et al., 2016;van de Haar et al., 2019). Therefore several driver gene or module identification approaches employ ME detection as part of their problem definitions and optimization goals (Babur et al., 2015;Ciriello et al., 2012;Leiserson et al., 2013;Kim et al., 2015;Ahmed et al., 2019;Baali et al., 2020). Such a central role in driver gene and module identification has led to the design of many different approaches for defining and computing mutual exclusivity. Some of these approaches are based on combinatorial definitions of mutual exclusivity (Vandin et al., 2012;Leiserson et al., 2013;Sarto Basso et al., 2019;Ahmed et al., 2019;Song et al., 2020;Baali et al., 2020). In most cases the combinatorial definitions are incorporated and tested within a driver gene or module identification framework, rather than as stand-alone ME tests. On the other hand, the vast majority of the ME detection approaches are based on statistical tests (Ciriello et al., 2012;Szczurek and Beerenwinkel, 2014;Leiserson et al., 2015;Constantinescu et al., 2015;Hua et al., 2016;Canisius et al., 2016;Leiserson et al., 2016;Kim et al., 2017;Liu et al., 2020;Zhang et al., 2020) and in most cases for such approaches the specific goal is to provide ME significance results. Therefore the focus of the proposed framework is the evaluation of the latter set of approaches consisting of the statistical ME tests.
Among such approaches, MEMo builds a graph based on gene similarities and extracts cliques from this graph. To determine whether each clique has significant mutual exclusivity, it then proposes a null model generated by randomly permuting the set of genomic events, while preserving the overall distribution of observed alterations across both genes and samples, and introduces a Markov Chain Monte Carlo (MCMC) permutation strategy based on random network generation models (Ciriello et al., 2012). Szczurek and Beerenwinkel. (2014) propose a probabilistic, generative model of mutual exclusivity, explicitly taking coverage, impurity, and error rates into account. Based on such a model, they provide a statistical test of mutual exclusivity by comparing its likelihood to the null model that assumes independent gene alterations. Mutex defines the alteration of two genes to be mutually exclusive if their overlap in samples is significantly less than expected by chance, where the statistical significance of the overlaps are calculated using a hypergeometric test with the assumption of a uniform alteration frequency among samples (Constantinescu et al., 2015). This may not always be the case as in many data sources there are hyper-mutated samples. The problem is resolved partially by simply excluding such samples from the analysis. CoMEt (Leiserson et al., 2015) on the other hand provides an exact statistical test for mutual exclusivity conditional on the observed frequency of each alteration with the goal of introducing less bias towards high frequency alterations. Based on this it provides a tail enumeration procedure to compute the exact test, as well as a binomial approximation. DISCOVER provides a statistical independence test that makes no assumption of identical gene alteration probabilities across tumors (Canisius et al., 2016). The alteration probabilities are estimated by solving a constrained optimization problem guaranteeing the probabilities are consistent with both the observed number of alterations per gene and the observed number of alterations per tumor. The tumorspecific gene alteration probabilities are then used to compute the probability of concurrent alterations which in turn are used to decide whether the number of tumors altered in both genes deviates from the expectation through an analytical test based on the Poissonbinomial distribution. WeXT provides a weighted exact test that conditions simultaneously on the number of samples with a mutation and the per-event, per-sample mutation probabilities (Leiserson et al., 2016). A recursive formulation to compute p-values for this weighted test exactly and a saddle-point approximation of the test are proposed. WeSMe provides a permutation-based test and an approximation of significance through a weighted sampling technique that enables further improvements in running time spent for sampling and a way to obtain a better precision without increasing the computational time significantly . Mina et al. propose the SELECT method which uses a weighted version of mutual information to identify significant mutual exclusivity or co-occurrence patterns where significance is estimated by comparing against patterns observed in random permutations of the data (Mina et al., 2017). Two recently suggested ME tests are FSME (Zhang et al., 2020) and MEScan (Liu et al., 2020). The former proposes a seed-and-extend strategy to alleviate the computational cost of a permutation-based test. The seed pairs are constructed by a combinatorial formulation incorporating both ME and the coverage of the pair. The seeds are then grown with new genes by employing an independence test. MESCan provides a test statistic that incorporates a patient and gene-specific background mutation rate in the calculation to adjust for the background noise, and that includes a gene-specific weight to down-weigh genes with high mutation rates. Such a statistic is then employed in an MCMC algorithm followed by a false discovery rate control.
We propose a network-centric framework to evaluate the pairwise significances found by statistical ME tests. It is important to make a distinction between the network-centric view of the current study and that of the previous studies employing both network data and the concept of ME (Ciriello et al., 2012;Leiserson et al., 2013;Kim et al., 2015;Ahmed et al., 2019;Baali et al., 2020). The latter are network-centric in the sense that the proposed ME tests are applied on interacting pairs or subnetworks as part of a more general goal of identifying cancer driver genes/modules. Thus due to the nature of the set objectives their evaluations focus on the success of output genes/ modules matching reference cancer-related drivers/pathways. The proposed study takes on an approach in the opposite direction; we assume the interaction network and the reference cancer-related drivers to be inputs to our framework which evaluates the success of various ME tests. The focus of the proposed framework is on pairwise significances since one of the major application areas where ME tests are commonly employed is knowledge-based cancer driver identification where pairwise ME significances are of major essence. In terms of the general objectives our work is most similar to that of Deng et al., 2017, where a framework for performance comparisons of statistical ME detection approaches is proposed and executed on six such tests. An important distinction is that the performance analysis of Deng et al. is based on experiments with simulated data and the framework does not suggest any mechanism to avoid confounders inherent in ME detection. One such confounder is due to the alterations specific to cancer subtypes (Deng et al., 2017;van de Haar et al., 2019). Alterations in different subtypes may be incorrectly diagnosed with ME, although the alterations are not due to any natural root causes of ME such as redundant functionality. Inspired by the observation that mutual exclusivity is enriched among physically interacting pairs of genes (Dao et al., 2017), our network-centric view aims to recognize such false positives by constructing reference sets based on known drivers gathered from neighborhoods of interaction networks. Furthermore, inspired by the mutation load confounding concept of van de Haar et al., 2019, we extend our network-centric framework to dissect side effects caused by mutation load artifacts; mutations that drive tumor subtypes with low mutation load might be incorrectly diagnosed as mutually exclusive. A possible drawback of the proposed network-centric evaluation framework would be due to the use of nonspecific interaction networks that are generated as combinations of interactions from different tissues and are thus suboptimal in resolving confounding issues of mutual exclusivity. In order to detect whether there exists such discrepancies or to limit their effect if they do, we therefore refine the networkcentric approach by designing further tests on tissue-specific networks (TSN) we construct based on gene co-expression.

METHODS
The overall network-centric ME evaluations framework has three main components. The first one consists of definitions of the metrics employed in the network-centric ME evaluations. Such metrics are designed so that network knowledge and the reference set of known cancer genes are incorporated in ME evaluations under a careful definition of proper control groups. The second component detects whether the use of the interactome information provides similar advantages in ME corrections of pairwise mutual exclusivity findings as the subtype-stratification idea suggested by van de Haar et al., 2019. Finally, the third component extends our framework to incorporate tissue-specific networks with the aim of reducing the possible side effects of using nonspecific interaction networks.

Metrics for the Network-Centric Mutual Exclusivity Evaluations
Assuming that cancer driver genes in the same pathway are more likely to show mutually exclusive mutation profiles, we utilize the interactome to devise a strategy for evaluating the ME methods and the effects of the interactome information on quantifying ME. Let G, C, T , S, p t , c denote respectively the input Protein-Protein Interaction (PPI) network, the employed cohort, the statistical ME test undergoing the network-centric ME evaluations, the golden standard reference gene set of known cancer drivers, the p-value threshold for significance, and the type of the control group to be employed. Let N S (g i ) denote the set of genes from S that are in the neighborhood of the node corresponding to gene g i in the PPI network G. For a gene g i ∈ S, corresponding to each neighbor g j ∈ N S (g i ), we randomly select a gene g r from a control group X c (g i ), and compute TP cur , FP cur , based on the − log-transformed p-values p i,j and p i,r as computed by the ME test T . Here p i,j denotes the significance of the mutual exclusivity of the pair g i , g j for g i ∈ S and g j ∈ N S (g i ), and p i,r denotes the significance of the mutual exclusivity of the pair g i , g r for a random gene g r from the control group. Based on the premise that cancer driver genes interacting in the PPI network are likely to exhibit ME, a pair g i , g j belongs to the set of True Positives if p i,j is significant and a pair g i , g r belongs to the set of False Positives if p i,r is significant.
To obtain robust results, the selection of the random genes from the control group is repeated robustness_iterations number of times, which is set to 100 in all the evaluations, except for those testing the robustness of the framework with respect to various parameter settings. Finally the medians of these 100 instances are summed over all genes g i ∈ S to provide the necessary statistics TP, FP. Thus precision, sensitivity, and the F1 scores are computed based on these statistics. Precision is calculated as |TP|/(|TP| + |FP|). Sensitivity is calculated with the formula |TP|/|P| where P corresponds to condition positives which are defined as the gene pairs g i , g j ∈ S where g i , g j interact in G.
We note that limiting our focus solely on these conventionally formed TP, FP classes may be misleading as each one considers the significance of p i,j and p i,r individually. A more detailed inspection with a simultaneous consideration of their values could prove more insightful in certain cases since they both involve a common gene g i . Towards this aim we introduce the strict versions of these conventional classes. More specifically TP strict consists of g i , g j pairs where p i,j is significant not only with respect to the given threshold but also as compared to the p-value of the control pair g i , g r . Similarly FP strict consists of the control pairs g i , g r , where p i,r is more significant than both the threshold value and p i,j . Based on these strict classes we can compute three metrics: precision strict , sensitivity strict , and F1 strict . Precision strict is defined as |TP strict |/(|TP strict | + |FP strict |) and sensitivity strict is defined as |TP strict |/|P|. Such a consideration is especially convenient in reducing any potential bias inherent in genes like TP53 which have large mutation frequencies almost exclusively in tumors with small numbers of mutations; both p i,j and p i,r are likely to be significant in such a scenario giving rise to vagueness in the conventional F1 score. A comparison of F1 strict values based on the two statistics simultaneous by their nature, TP strict and FP strict provides a more rigorous evaluation in such cases.
For the network-centric ME evaluations we employ two different definitions for the control groups. For the first one, the control group X 1 (g i ) consists of genes in S that do not interact with g i in the PPI network. For the second one, X 2 (g i ) consists of neighbors of g i in the PPI network that are not in S. In the latter case only the genes g i ∈ S for which the number of neighbors not in S is larger than or equal to the number of neighbors in S are taken into account.

Network-Centric Mutual Exclusivity Corrections in Relation to Mutation Load Association
Some statistical mutual exclusivity tests are based on the assumption that gene's alterations across tumors are identically distributed. Among the approaches considered in this study Fisher's Exact Test and MEGSA belong to this category. However, it has been observed that the number of alterations per tumor can vary quite considerably, even in tumors of the same type; colorectal tumors with microsatellite stability have a median of 66 non-synonymous mutations, but colorectal tumors with microsatellite instability have a median of 777 mutations (Vogelstein et al., 2013;Leiserson et al., 2016). It has been shown that under such settings the mutual exclusivity tests relying on identical alteration probabilities across tumors may lead to reduced sensitivity for mutual exclusivity analysis (Canisius et al., 2016). The effects of varying alteration probabilities on pairwise mutual exclusivity calculations have been formalized within the context of the socalled mutation load confounding (MLC) in a recent study by van de Haar et al., 2019. MLC is a correlation between the number of statistically significant mutual exclusivity findings and the mutation load association (MLA) of a gene. MLA of a gene is calculated by running a logistic regression where a gene's binary mutation status indicating whether the gene is mutated or not in a tumor is used as the only feature to predict the mutation load of that tumor. Mutation load is defined as the number of genes that are mutated in a tumor. Once the coefficient of the feature is obtained by fitting the logistic regression model, it is standardized by dividing by the standard error to make it comparable across the genes. This standardized coefficient value is defined as the MLA value. Note that negative MLA values correspond to higher mutation frequencies in tumors with low mutation loads, whereas positive values correspond to higher mutation frequencies in tumors with high mutation loads. Strong negative correlations between the MLA of a gene and the number of statistically significant pairwise mutual exclusivities have been observed, implicating the finding that the more negative a gene's MLA, the higher the number of other genes that show mutual exclusivity with that particular gene (van de Haar et al., 2019). However, such a negative correlation does not always imply true ME since a gene that exclusively shows large mutation frequency in tumors with low mutation loads, naturally has a better chance of forming mutually exclusive pairs with other genes. Thus extra sources of information are necessary to filter out the pairs with true ME relations among a set of statistically significant pairwise mutual exclusivities postulated by some exclusivity test. van de Haar et al., 2019 make use of the subtype information for such a purpose and show that MLC can be reduced by correcting via tumor subtype stratification. Such a correction greatly reduces the number of gene pairs reported to show mutual exclusivity, especially for pairs that include genes with low MLA. A major drawback is the absence of subtype information for many tumors. As part of our network-centric ME framework, we suggest that such a correction can be efficiently done with the interaction network data, rather than or better yet on top of the subtype information. For this purpose we calculate the correlation between the number of statistically significant pairwise ME findings and the MLA for two settings; one where pairwise mutual exclusivities are sought between a gene in S and all other genes in S, and the other where a gene in S is checked against only its PPI neighbors that are in S. The computations of the two settings are repeated with the subtype-stratified data as well, to see the added value of the network-centric ME corrections on top of the subtype-based corrections on statistically significant pairwise MEs.

Network-Centric Mutual Exclusivity Evaluations in Relation to Tissue-Specific Networks
Rather than using a common nonspecific network for all the cancer types, in this component of our evaluation framework we employ TSN based on the tissue in which the tumor develops. To construct the TSN for a particular tissue, we start with the original PPI network and remove the edges between the pairs of genes that are not co-expressed in the corresponding tissue. For this purpose, we download RNA-seq datasets from GTEX portal (GTEXConsortium, 2020). See Supplementary Table S49 for the total number of available samples for each tissue. To determine the co-expressed genes, we follow the procedure described in Luck et al., 2020. For each pair of genes that have an edge in the original PPI network, we identify the number of samples where both genes have Transcripts Per Kilobase Million (TPM) values ≥1. We then divide this number with the total number of samples where either gene has a TPM value ≥1. The resulting value is called the co-expression ratio. Gene pairs interacting in the original network are included in the TSN cor if the co-expression ratio is ≥ cor, for a given threshold cor.
In addition to applying the network-centric metrics introduced in Section 3.1 on the constructed TSNs, we also propose a more detailed evaluation in terms of ROC analysis based on tissue-specificity. For this purpose, we define the gene pairs with co-expression ratio value of 1 as tissue-specific gene pairs. Similarly, the gene pairs with co-expression ratio values ≤0.5 are called non-tissue-specific gene pairs. To test whether a specific ME test identifies stronger mutual exclusivities for the tissue-specific gene pairs in S, we rank the gene pairs in S in increasing order of p-values. To construct the control group, we rank the same number of random samples of gene pairs not in S with respect to the p-values making sure that the sizes of the positive (or negative) sets of gene pairs not in S are exactly the same as those that are found for the gene pairs in S. For both gene pairs in S and gene pairs not in S, the set of positives consists of the tissue-specific gene pairs, whereas non-tissue-specific gene pairs are labelled as negatives. We then compute the True Positive Rate (TPR) and the False Positive Rate (FPR) for each case. Note that for robustness considerations the control group

Input Data and Parameter Settings
The somatic mutation data from TCGA was preprocessed and provided by  (531). The preprocessing step involves the removal of all mutations with "variant_classification" of "Silent," "3'UTR," "Intron," "5'UTR," "RNA," "3'Flank" and "5'Flank" from the TCGA data. The input data is then further filtered by mutation frequency threshold, t, to include genes with > t mutations across the cohort. More specifically, with t 20 we include the genes that are mutated in more than 20 samples within the cancer type under study. Regarding subtypes, we download subtype information for BRCA from the cBioPortal Gao et al., 2013) and the CMS stratification for COADREAD from (Guinney et al., 2015). We use the COSMIC Cancer Gene Census database to compile the set of known cancer genes (Sondka et al., 2018). For the results presented in the main document we employ the IntAct PPI network as it is a comprehensive and wellcharacterized database (Orchard et al., 2014). As a preprocessing step, we remove duplicate edges and edges below the confidence threshold of 0.35 from the network. The final network contains 15,079 nodes and 103,520 edges. For the gene expression data employed in the construction of TSNs, we download RNA-Seq data from the Genotype-Tissue Expression (GTEx) portal (GTEXConsortium, 2020) (05-06-2017).
For the comparative evaluations of our network-centric framework described in the previous section, we choose six popular statistical mutual exclusivity methods: DISCOVER (Canisius et al., 2016), DISCOVER Strat (Canisius et al., 2016;van de Haar et al., 2019), Fisher's Exact Test, WeXT (Leiserson et al., 2016), MEMo (Ciriello et al., 2012) and MEGSA (Hua et al., 2016). Among these, MEMo and MEGSA are originally designed to output p-values for a set of genes with size > 2. For MEMo, we re-implement the first part of the algorithm where pairwise ME p-values are estimated. We use Q 100 and N 10, 000 as suggested by the original paper (Ciriello et al., 2012). For MEGSA, pairwise ME p-values are calculated by applying chi-square cumulative probability less than or equal to the value of the log likelihood calculated by the funestimate function. With regards to the parameter settings of our proposed framework, we employ the values of 5 and 20 for t. Table 1 and 2 show the results of evaluating the 6 ME detection methods on COADREAD data where t 20 and we use the data from 498 patients for which subtype information is available. We use X 1 and X 2 as the control group in Table 1 and 2, respectively. We first discuss the results of X 1 . We observe that DISCOVER Strat gives the highest precision and precision strict values. The ranking of the other methods from best to worst in terms of precision or precision strict is as follows: WeXT, DISCOVER, MEMo, MEGSA and Fisher's Exact Test. A comparison of the precision and precision strict values distinguishes two groups of ME methods; for DISCOVER, DISCOVER Strat, Fisher's Exact Test, and WexT the precision strict values are greater than or equal to the precision values, whereas the exact opposite is observed for MEGSA and MEMo. This suggests that the performance of the methods in the latter group gets worse when random control gene pair is considered simultaneously in the precision calculation, that is precision strict . Compared to the precision, we observe much larger differences among the sensitivity or the sensitivity strict values output by the employed methods. We can group the methods into two where the first group contains WeXT, MEMo and DISCOVER, and the second group contains the rest of the methods. The first group of methods give much larger sensitivity or sensitivity strict values than the second. For instance, the sensitivity value obtained with WeXT is an order of magnitude larger than that of Fisher's Exact Test. This also shows that the second group of methods are more conservative than the first group of methods. WeXT is the least conservative approach based on its high sensitivity value. Even though WexT predicts many significant p-values, it still has a competitive precision strict value which is slightly lower than the maximum observed value (0.725 vs 0.727). Accordingly, WeXT obtains the best F1 score and F1 strict score which is followed by MEMo and DISCOVER. The remaining three methods give much smaller F1 scores and they rank as follows from highest to lowest: MEGSA, DISCOVER Strat and Fisher's Exact Test. Comparing the conventional F1 score with the F1 strict score of each ME method, the largest difference is observed for MEMo indicating that the consideration of the random pair as a control affects its performance dramatically. Another interesting observation is the lower performance of DISCOVER Strat compared to DISCOVER which suggests that the use of subtype information is not useful for COADREAD. Table 2 shows the results where X 2 is used as the control group. Since X 2 (g i ) is defined as the non-CGC neighbors of g i in the PPI network, we can only consider the CGC genes that have more non-CGC neighbors than CGC neighbors. As such, the number of pairs included in this analysis is much smaller than that of Table 1 ( 107 vs 196). The ranking of the methods in Table 2 with respect to F1 score and sensitivity remain the same as Table 1. However, there are differences in the ranking with respect to other metrics. For instance, WeXT ranks best in terms of precision whereas the best ranking method in Table 1, DISCOVER Strat, ranks the fifth. Compared to Table 1, the precision values of all the methods are smaller in Table 2. We see the opposite trend for sensitivity values. These changes are in parallel with the increase in percent significant p-values output by the methods. For instance, the percentage of significant p-values output by DISCOVER is 12% in Table 1 and 18% in Table 2. We also observe differences between the conventional and the strict versions of the employed metrics. WeXT and DISCOVER have increased precision strict values compared to precision whereas we observe the opposite trend for the rest of the methods. Additionally, the ranking of the methods with respect to F1 score and F1 strict score is different. Namely, MEMo's ranking decreases from second highest to third highest when we switch from F1 score to F1 strict score. Accordingly, DISCOVER's ranking improves from third highest to second highest based on F1 score. This increases the confidence of DISCOVER results as F1 strict requires a stricter definition of true and false positives. Supplementary  Table S1 shows the results with X 1 control group and t 20 filtering for the other cancer types. A detailed discussion of these results are available in the Supplementary Material. Table 3 and 4 show the COADREAD results of t 5 setting with c X 1 and c X 2 , respectively. Using a lower value for t increases the number of gene pairs tested in our analysis. When we compare these results with the results we obtained when t 20, we observe few differences. Though the number of tested gene pairs is larger, the percentage of significant p-values obtained by the methods decreases. For instance, the percentage of significant p-values output by WeXT for COADREAD data decreases from 42 to 14% when t is changed from 20 to 5. This is likely related to the larger inclusion of low mutation frequency genes when t 5. An interesting observation for t 5 results is the decrease in DISCOVER Strat's performance. For COADREAD, DISCOVER Strat's precision and precision strict value is the highest for t 20 when X 1 is used as the control group. However, when t 5, we observe that it ranks after WeXT and DISCOVER in terms of precision/precision strict value. Similarly, for BRCA dataset, DISCOVER Strat ranks after WeXT for both control groups X 1 and X 2 (Supplementary Table S25B, Supplementary Table  S37B).

Mutual Exclusivity Evaluations Based on Defined Metrics
Lastly, we investigate the robustness of our results with respect to robustness_iterations value, the p-value significance threshold value, the reference gene set and the employed PPI network. The results together with a discussion of these results are available in

Mutual Exclusivity Evaluations Based on Corrections via Mutation Load Association
Having compared the ME tests with respect to our novel networkcentric evaluation framework, we now assess whether including network knowledge reduces the mutation load confounding  (Cheng et al., 2015) where, for each gene, mutual exclusivity tests are performed with all the other genes in the panel. Here, we first perform a similar analysis where we use the COSMIC CGC database (Forbes et al., 2017) to define the reference cancer gene set as it is more comprehensive and up to date. Figures 1A shows the MLA of the reference cancer genes vs the percent significant findings in mutual exclusivity tests performed with DISCOVER for the TCGA COADREAD cohort (498 tumors). We observe a strong negative correlation between MLA values and percent significant findings in mutual exclusivity tests (Pearson correlation -0.88, p-value 4.0e − 25) similar to van de Haar et al., 2019. In Figures 1B, we take into account the PPI information to calculate percent significant findings. Namely, for each CGC gene, we perform mutual exclusivity tests only with its PPI neighbors that are also in CGC. Note that CGC genes which do not have any CGC In Figures 1B, we observe a reduced correlation when network information is included (Pearson correlation -0.4, p-value 4.91e − 4). We also run DISCOVER Strat where stratification is based on CMS subtypes (Guinney et al., 2015). We plot these results in Figures 1C where we again ensure comparability with Figures  1D where both subtype and network information are considered. Comparing Figures 1A and Figures 1C, we verify the findings of van de Haar et al., although with less significance in correlation difference (Pearson correlation −0.73, p-value 2.8e − 13). It should be noted that the subtype stratification inherently causes an overall decrease in percent significant findings, not specific to genes with low MLA. On the contrary the idea of ME corrections through network incorporation, materialized in the comparison of Figures 1A and Figures 1B, inherently  Next, we utilize waterfall plots to compare the outputs of DISCOVER and DISCOVER-Strat to assess how MLA and subtype information can affect mutual exclusivity findings. Figures 2A shows two selected gene pairs that display significant mutual exclusivity based on both DISCOVER and DISCOVER-Strat estimations on TCGA COADREAD dataset. The mutual exclusivity between BRAF and NRAS, two members of the MAPK pathway, is well-known and has been detected in multiple cancer types including melanoma, myeloma and colorectal cancer (Samowitz et al., 2006;Roth et al., 2010;Popovici et al., 2012) BRAF is frequently mutated in patients from CMS1 subtype whereas NRAS shows almost no mutation across these patients. However, since BRAF and NRAS mutations are mutually exclusive across not only CMS1 subtype but also across the other subtypes, DISCOVER-Strat identifies this pair as significantly mutually exclusive. Similarly, SMAD3 and SMAD4 are two members of the TGF-β pathway and the mutual exclusivity between these two transcription factors is previously reported in colorectal cancer (Fleming et al., 2013). Mutations on SMAD3 and SMAD4 are distributed almost uniformly across the subtypes. As such, the mutual exclusivity between the mutations of these two genes is still significant when subtype information is incorporated. Figures 2B similarly shows two selected gene pairs that display significant mutual exclusivity based on DISCOVER but not based on DISCOVER Strat. For the first pair, we observe that NUP98 is mutated almost exclusively in patients from the CMS1 subtype which shows hypermutation due to microsatellite instability. On the other hand, there is a depletion of APC mutations among the patients from the CMS1 subtype which results in a low MLA value. As such, FIGURE 2 | Waterfall plots of the distribution of mutations for selected gene pairs. (A) Mutation distribution of two selected gene pairs (BRAF-NRAS and SMAD4-SMAD3) that are found to be significantly mutually exclusive based on both DISCOVER and DISCOVER-Strat estimations. (B) Mutation distribution of two selected gene pairs (APC-NUP98 and KRAS-PDE4DIP) that are found to be significantly mutually exclusive based on DISCOVER but not based on DISCOVER-Strat. Note that the set of samples included in each plot is determined by finding the set of patients that have a mutation in at least one of the listed genes. GenVisR R package is used to generate the waterfall plots (Skidmore et al., 2016). Subtype information is downloaded from (Guinney et al., 2015).
Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 746495 DISCOVER Strat fails to detect a significant ME between these two genes since it explores ME within each subtype separately. A similar observation can also be made for the KRAS-PDE4DIP pair where the former has a low MLA and the latter has a high MLA. Supplementary Figure S2 compares the MLA of the reference cancer genes with the percent significant findings in mutual exclusivity tests for BRCA. Similar to the results that we obtain for COADREAD data, including network information reduces the correlation between MLA and ME detection rate ( Supplementary Figures S2B vs S2C). The magnitude of reduction is even more significant than what we observe for COADREAD data (Pearson correlation −0.93 vs −0.27). Interestingly, including subtype information results in a very slight decrease in correlation coefficient (−0.93 to −0.91) (Supplementary Figures S2B vs S2E) as opposed to what we observe for COADREAD. We observe that including subtype information on top of network information results in no decrease in correlation (Supplementary Figures S2C vs S2F). This difference in the effect of including subtype information for BRCA and COADREAD datasets could be related to the average tumor mutation load of subtypes. BRCA subtypes have comparable average TML values (Her2: 146, LumA:65, LumB: 71, Normal: 55) whereas the CMS1 subtype in COADREAD has a dramatically larger average TML value compared to the other subtypes of COADREAD (CMS1: 1387, CMS2:93, CMS3: 272, CMS4: 212) We repeat the same analysis with the other ME detection methods as well as for other cancer types when t is set to 20 (Supplementary Figures S1-S8). We observe that the percent significant finding values can vary remarkably across the tumor types. Compared to other cancer types, we observe smaller percent significant findings for LUSC (Supplementary Figures S5A,S5D, S5G). Similarly, very few pairs have percent significance value ≥ 20 when we consider network information in LUSC (Supplementary Figures  S5C,S5F,S5I). On the contrary, we observe many pairs with large percent significant values for CGC-CGC neighbors in UCEC data. This is particularly true for DISCOVER and WeXT results (Supplementary Figures S8C-S8L).
When we consider the correlation between MLA and percent significant values, we observe that adding network information decreases the correlation coefficient values for all cancer types and for all ME detection methods except for Fisher's Exact Test. Figures S5-S6 D vs F). Also, the correlation coefficient can not be computed for LUAD and STAD since Fisher's Exact Test gives a value of 0 for the percent significant findings of all considered genes ( Supplementary Figures S4D-S7F). Another interesting observation is the variance in magnitude of decrease in correlation values across different tumor types. In particular, we observe a smaller decrease in correlation values for LUAD compared to other cancer types. The analogous results are also available for t 5 setting (Supplementary Figures S9-S16). For all the cancer types, the correlation between MLA values and percent significant findings decreases and becomes non-significant for most cases.

Fisher's Exact Test results show an increased correlation with the addition of network information for LUSC and SKCM (Supplementary
We should also note that the majority of CGC genes have only one neighbor within the data setting of the cancer type under consideration. This leads to percentage significant findings of either 0 or 1 in many cases simply because these are the only possible values; for COADREAD see Figures 1B  Individual genes of interest are those that have increased percent significant findings when network neigborhood information is incorporated while at the same have significant number of CGC neighbors. More specifically, for the former constraint, we identify the CGC genes with at least 0.1 increase in percentage of significant findings value of WeXT, DISCOVER and MEMo when the network information is included as opposed to the scenario when it is not (e.g., for COADREAD, Figures 1A vs Figures 1B). We choose these 3 ME methods since they are top performers based on the defined metrics in Section 3.1. For STAD, SKCM and UCEC, since MEMo results are unavailable, we only consider WeXT and DISCOVER results. For the second constraint, we include the CGC genes with at least 3 CGC neighbors. For COADREAD, this selection procedure results in four genes: EP300, CREBBP, NCOA2 and NCOR2. Among these, EP300 is a well-known tumor suppressor in epithelial cancer types including COADREAD (Gayther et al., 2000). For BRCA, the only identified gene is PIK3R1. PIK3R1 is found to be significantly mutually exclusive with PIK3CA and SPEN based on both WeXT, DISCOVER and MEMo results. PIK3R1 and PIK3CA are members of the PI3K pathway and their mutual exclusivity has been previously established in the literature (Chen et al., 2018). For LUAD, PTPRB is the only identified gene and is found to be mutually exclusive with EGFR, a well-known oncogene in non-small cell lung cancer (Bethune et al., 2010). The set of identified genes for STAD are NCOA2, NCOR2 and CREBBP; all of which are found to be mutually exclusive with TP53. For SKCM, we identify ERBB4, RAC1, EP300 and ITK. ERBB4 is a well-known oncogene in skin cancer and found to be mutually exclusive with ERBB2 (Prickett et al., 2009;Nielsen et al., 2014). ERBB2 and ERBB4 indeed belong to the same family (i.e. ErbB family of receptor tyrosine kinases) and form a heterodimer receptor for Heparinbinding EGF-like growth factor (HB-EGF) (Iwamoto et al., 2017). RAC1 mutation P29S is an established driver in melanoma (Jiang et al., 2018). RAC1 is found to be mutually exclusive with MYH9, a tumor suppressor in melanoma (Singh et al., 2020). Lastly, ITK has been shown to be an oncogene in melanoma (Carson et al., 2015). For UCEC, we identify 33 genes in total. Among these, KIT and PTEN have established roles in UCEC cancer development (Chang et al., 2015;Wang et al., 2020). Moreover, PTEN is found to be strongly mutually exclusive with SPOP, whose mutations are also associated with endometrial cancer (Clark and Burleson, Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 746495 2020). Lastly, for BLCA and LUSC, no gene satisfies the abovementioned criteria. Overall these results suggest that the CGC genes that show increased ME with network incorporation as well as their mutually exclusive partner genes often have established roles in the development of the particular cancer type.

Mutual Exclusivity Evaluations Based on Corrections via Tissue-Specific Networks
We first provide our ME evaluations with respect to the metrics defined in Section 3.1 by replacing the non-specific networks with TSNs. We provide two types of comparisons; one where we compare TSN 0.5 with the original non-tissue specific Intact network and one where results of TSN 0.5 are compared against TSN 0 . We do the latter to avoid artifacts that may be introduced due to the fact that some genes in the original Intact network might be simply missing from even TSN 0 since they may be nonexistent in the GTEX database. For the BLCA dataset, comparing the F1 scores of the ME methods under TSN 0 and TSN 0.5 settings, we observe that the scores of all methods are higher for the latter network. The largest percent increase of 10% is observed for WeXT when the control group is X 1 . Similarly, the largest percent increase of 12% is observed for MEMo when the control group is X 2 . On the other hand, when we compare the scores of TSN 0 against the original network, the differences are negligible. The next largest difference between the F1 scores obtained under TSN 0.5 as compared to TSN 0 is observed in STAD where we see a 7% increase in DISCOVER's score for X 1 , and a 10% increase in WeXT's score for X 2 . For the rest of the cancer types under study, for LUSC and UCEC we observe slight increase in performances of all the ME methods comparing the metrics under TSN 0.5 against TSN 0 . For COADREAD, BRCA and SKCM we observe both increases and decreases in performances but the differences are almost negligible; see Supplementary  Tables S50-S81 for detailed results. Figure 3 compares the ROC curves of CGC gene pairs and non-CGC gene pairs for COADREAD data where mutual exclusivities are estimated with DISCOVER, DISCOVER Strat, Fisher's Exact Test, MEGSA, MEMo and WeXT with t 20. We observe that all the ME methods estimate stronger mutual exclusivities for tissue-specific CGC gene pairs compared to non-tissue-specific CGC gene pairs since AUROCs are greater than 0.5. Additionally, we observe much smaller AUROCs for the control group where we repeat the same analysis with non-CGC gene pairs. Analogous results are available for the other cancer types where both the positive and negative set contains at least 10 number of pairs when t is set to 20. (Supplementary Figures  S33-S35). We observe a similar result for SKCM where CGC pairs result in larger AUROCs compared to non-CGC pairs for all ME methods (Supplementary Figure S34). We observe a steep increase in the ROC curves plotted for MEGSA results. This is due to the utilized likelihood ratio test that results in a p-value of 0.5 when the likelihood values are equal to each other. For UCEC, we see a significant difference between the ROC curves of CGCpairs vs non-CGC pairs for Fisher's Exact Test and MEGSA; whereas the corresponding difference is negligible for DISCOVER and WeXT.

CASE STUDY
Apart from the defined network-centric ME evaluation framework, we discuss a case study where we assess whether mutual exclusivities estimated by the considered ME methods improve the performance of driver identification methods that utilize mutual exclusivity information. To this end, we compare the original version of MEXCOwalk with its alternatives where mutual exclusivity estimates are provided by the employed ME methods. Assuming that g i and g j genes are mutated in patient sets S i and S j , respectively; MEXCOWalk simply computes the mutual exclusivity between these two genes with the following formula: |S i ∪ S j |/(|S i | + |S j |). MEXCOwalk uses the estimated mutual exclusivity values as part of edge weights. As such, to utilize the p-values output by ME detection methods in MEXCOwalk, we first compute −log (p-value) and then convert the resulting values between 0 and 1. To this end, we replace all −log (p-value)'s larger than 10 with 1. We then find the maximum −log (p-value) less than 10 and divide all other −log (p-value)'s with this value. The reason why we set a threshold for finding the maximum is the large differences across the smallest p-values output by different ME methods. For instance, WeXT outputs a very large range of p-values and if we use the smallest p-value to scale, all other −log (p-value)s will be converted to values that are very close to 0. In the original MEXCOWalk study, a threshold of 0.7 is applied to ME values such that all values ≤0.7 are clamped to 0. This conversion is equivalent to removing those edges from the network since the edge weights include a multiplicative term for ME values. We find that the removal of these edges correspond to a 0.035 percent reduction in graph density. For the current analysis, we determine the threshold value for each ME detection method to achieve the same percent density reduction in the graph. Figure 4 shows the number of recovered CGC genes for fixed output gene sizes from 100 to

DISCUSSION
It is important to investigate whether the employment of an interaction network within our ME evaluation framework causes any ascertainment bias in the findings and to elaborate on how any such potential bias is mediated within the framework. It is established that known cancer genes have larger number of interactions compared to other genes in the network (Hou and Ma, 2014a). This implies a potential bias that needs to be resolved in cancer driver gene identification methods employing interaction network data. Such a bias is less of a problem for the current study, since our aim is not to identify novel cancer driver genes but to utilize the interaction network and known cancer genes to form a ground truth of mutually exclusive interactions for evaluating existing ME methods. On the contrary, the fact that most known cancer genes have wellcharacterized interactions in the network provides a benefit for our work as it supports the confidence of our true positive examples. Additionally, our framework makes use of not only genes from the reference set S but also genes not in S to create random controls. Nevertheless, the fact that some known cancer genes have significantly larger number of interactions compared to other known cancer genes could lead to a bias. For instance, for our analysis of the COADREAD data (t 20, S CGC), there are 74 CGC genes among which five CGC genes have more than ten CGC neighbors whereas 41 have exactly one CGC neighbor. This could lead to a bias as CGC genes with large number of CGC neighbors contribute to the aggregate statistics and metrics much more than those CGC genes with small number of CGC neighbors. To mediate this bias, our framework includes additional results where all the statistics and the traditional measures such as the F1 score are calculated in a degree-normalized way for each gene and the gene-level results are then aggregated by taking an average across the genes. These results are available in the Supplementary Document; Supplementary Tables S12, S24, S36, S48. To summarize, the degree-normalized results are in agreement with those of the previous settings in almost all the cases in terms of ranking based on F1 score.
Another important point worth emphasizing is that apart from the aggregate statistics provided in the previous sections as part of the metrics for the network-centric ME evaluations, our proposed framework also provides analogous statistics at the gene-level as well. Such statistics may in fact be of more interest to cancer biologists than the aggregate statistics in certain cases. Several interesting observations can be made through an inspection of these gene-level evaluations, especially for the settings where the conventionally defined F1 score fails in quantifying ME. Genes with low MLA comprise an example setting, where TP53 is a leading member. Consider the case of TP53 in COADREAD evaluations for instance. With respect to the degree-normalized setting, the values of precision, sensitivity, precision strict and sensitivity strict for WeXT are respectively 0.5, 1, 0.25, 0.25 which gives rise to an F1 score of 0.66 and F1 strict score of 0.25. On the other hand, MEMo provides the same precision, sensitivity and F1 scores as WeXT whereas its precision strict , sensitivity strict and F1 strict scores are all 0. To summarize, although the inspection of the F1 scores does not provide a distinction between the two results, an inspection of the F1 strict scores establishes that MEMo is worse than WeXT in this setting. We note that the advantages of inspections based on the strict definitions of the metrics rather than the conventional ones are also apparent in the aggregate analysis as well. In addition to the COADREAD evaluations shown in Table 2, BRCA also contains an example instance where the conventional and the strict versions of the metrics provide different conclusions; see Supplementary Table S7B. In terms of the F1 scores, DISCOVER Strat ranks fourth, whereas comparing F1 strict scores it ranks the second. Also, overall we observe that MEMo's performance gets severely affected when the strict versions of the metrics are employed.
Next, our robustness analysis results reveal some suggestions for potential users of our framework. We recommend using a p-value threshold smaller than 0.1 but larger than 0.05 as lower threshold values are too stringent and lead to too few predicted positives. Regarding robustness_iterations, we tested values both smaller than and higher than the default value of 100 for COADREAD evaluations: 5, 50, 100, 300 and 500. We repeated each experiment 20 times and calculated the standard deviation of the obtained set of F1 and F1 strict scores. For the majority of the cases, we observe a large decrease in the standard deviation values when robustness_iterations is increased from 5 to 50. (Supplementary Table S82). This analysis suggests that the robustness_iterations should be set to a at least 50. Lastly, we observe that different PPI networks can lead to large differences in both the F1/F1 strict scores and the ranking of the methods. As such, exploring different PPI sources would be beneficial.
To assess whether our findings extend to other datasets other than TCGA, we repeat our evaluations on somatic mutation data of 402 colon cancer patients within the Pancancer Analysis of Whole Genomes (PCAWG) study (Campbell, 2020).
Supplementary Tables S83-S86 shows the ME evaluations with respect to the metrics defined in section 4.2. We observe an overall decrease in F1 strict scores of the methods. Compared to analogous results in TCGA data, WexT still performs the best in terms of F1 strict score whereas the second best performing method The majority of the somatic mutations observed in cancer genomics are passenger mutations. In the evaluations provided in the Results section we employ a simple filtering strategy where we remove silent mutations and mutations on non-coding regions of the genes. Additionally, we also assess the effects of employing a more elaborate mutation filtering procedure. To this end, we download the predictions of the Muiños et al. study on COREAD type (Muinos et al., 2021). This includes the classification of all possible mutations on 12 genes as driver or passenger mutations. Accordingly, we filter out the proposed passenger mutations from our mutation data and repeat all of our relevant analyses. We observe that the ranking of the methods according to the metrics proposed in section 4.2 remain the same where WExT, MEMO, and DISCOVER Strat show reduced F1 strict scores, and DISCOVER and Fisher's Exact Test show higher F1 strict scores (Supplementary Tables S87-S90). The TSN results and the MLA analysis results are also similar to our original results (Supplementary Figures S39-S40). Muiños et al. provides classifications of mutations on a subset of genes which have training data larger than a certain size. If such classifications become available for a larger set of genes in the future we can provide a better assessment regarding the filtering procedures employing these classifications.
Mutated genes in cancer prevalently exhibit a long tail phenomenon where few genes are mutated in many patients and large number of genes are mutated in few patients. To check whether assessing the mutual exclusivity of gene pairs with very different mutation frequencies bias the evaluations of the compared ME methods, we repeat our analyses after filtering out the genes with mutation frequencies < 5% and > 30%. The results after this filtering step are available in Supplementary Tables S91-S93. We observe that the ranking of the methods remain the same where we see a significant increase in Precision/ Precision strict values and a slight decrease in Sensitivity/ Sensitivity strict values. When we look at the Precision strict values in more detail, we observe that the FP strict values drop dramatically when we apply the filtering. This suggests that the control gene pairs that include genes with very low or very high mutation frequencies can have more significant p-values as compared to the p-values obtained for the corresponding CGC-CGC pair.
We also evaluate a more general ME detection method SELECT, which investigates both types of relationships among pairs, co-occurence and ME simultaneously. SELECT outputs ME associated scores to only a subset of the input gene pairs. Thus one strategy for comparing the results of SELECT against other methods is to focus only on such subsets. The relevant results where we use this strategy are available in Supplementary  Tables S94-S97. We report evaluations on two subsets of TCGA COADREAD dataset: 1) the set of CGC-CGC pairs where SELECT results are available, 2) the set of CGC-CGC pairs where SELECT's version which uses subtype information (i.e., SELECT subtype ) are available. For the former, we observe that SELECT and SELECT subtype rank the fourth after WExT, MEMO, and DISCOVER. For the latter evaluation, SELECT subtype performs better than SELECT although both of them still rank the fourth among the other ME methods. Another strategy to fix this problem is to assign the worst ASC score to such pairs without specific ASC scores in the ME direction. We employ this approach as well and observe that it gives no significant difference in the comparisons.
Lastly, it is important to mention certain limitations of the proposed framework. Our framework is based on the presupposition that ME is likely to occur between interacting known cancer genes. Although rare, there may exist two different types of exceptions to this assumption; ME can be observed between non-interacting known drivers and the relationship between an interacting pair of known drivers can be that of co-occurrence rather than that of ME. These constitute respectively the false negative and the false positive events in our framework. An example instance of the former is the mutually exclusive mutations of APC and RNF43 observed in colorectal cancer [Mina et al., 2017] and example instance of the latter is the co-occurrence of CCNE1 and TP53 alterations [Zhang et al., 2014]. Both of these patterns are currently ignored by our framework and incorporation of mechanisms to dissect each such pattern to increase the performance of true ME detection is an important future step. Another limitation of the current framework is that it requires the availability of wholegenome or whole-exome sequencing data.

CONCLUSION
We propose a network-centric framework to evaluate pairwise mutual exclusivity findings reported by different ME algorithms. The first component of our framework consists of useful definitions of statistics employed in the network-centric ME evaluations. We observe that for the majority of the cancer types under study WeXT outperforms the other methods in terms of F1 score measured with respect to appropriately defined control groups. In half of the cancer types DISCOVER and in the other half MEMo perform as the second best methods. When comparing different cancer types we observe that BRCA and COADREAD are among the top two types leading to maximum F1 scores with at least one of the ME methods providing a score greater than 0.5. We note that DISCOVER Strat is only applicable in two cancer types among a total of eight since these are the only cancer types with well-defined subtypes. Furthermore, among these two cancer types, DISCOVER Strat outperforms original DISCOVER algorithm in BRCA, whereas it is the second worst method after Fisher's Exact Test in COADREAD. This is noteworthy since van de Haar et al. propose subtype stratification as employed by DISCOVER Strat as a way to emphasize true mutual exclusivity by reducing mutation load confounding (van de Haar et al., 2019). We also observe that Fisher's exact test and MEGSA are more conservative compared to DISCOVER and WeXT, where from the latter group, WeXT outputs notably larger number of significant p-values. The second component of our framework evaluates ME tests by comparing two types of measures obtained with and without network information. First measure is with respect to the percent significant findings of mutually exclusive gene pairs, whereas the second is based on MLC values. In most of the cancer types and for most of the genes we observe an increase with respect to the former whereas a decrease with respect to the latter measure. Finally, we repeat the same analysis by considering TSNs in the network-centric framework. Considerable improvements achieved due to the use of TSNs as opposed tissue nonspecific interaction network are only observed for BLCA and STAD datasets. A more detailed analysis in terms of comparing ROCs of CGC gene pairs and non-CGC gene pairs on cancer types with considerable number of tissue-specific gene pairs indicate the advantages of employing tissue specificity in detecting mutual exclusivity in COADREAD, SKCM, and UCEC. Finally we extend out network-centric evaluation framework to assess whether including network knowledge reduces the mutation load confounding problem.
As noted earlier the proposed framework is intended for the network-centric evaluations of mutual exclusivities of pairs of genes rather than groups of genes. Such a choice stems form the fact that the mutual exclusivities are commonly made use of in driver gene/module identification algorithms which mostly employ pairwise mutual exclusivities. Furthermore the extensive evaluation settings proposed, the number of ME methods under study and their own computational requirements, and the potentially exponential computational complexity inherent in handling groups of genes limits the scope of the current study to evaluations of pairwise ME scorings. Nonetheless most statistical ME methods are capable of providing ME results for groups of genes as well. Regarding the ME tests considered in this study, the main ME test provided by DISCOVER is based on a pairwise test definition but it also extends the definition for possible use in quantifying the ME of a group of genes, although the experiments involving the latter are based only on simulation data. The remaining tests MEGSA, MEMo, and WeXT are all ME tests specifically designed for groups of genes. An important direction for future work is to design a suitable extension of the proposed network-centric framework to evaluate the results of ME tests on groups of genes. Design choices relevant for such an extension would involve an appropriate and computationally efficient definition of the reference groups of genes analogous to a pair of interacting genes from the set S in the current setting and the definitions of control groups analogous to X 1 and X 2 . Another future direction is to apply our network-centric framework on heterogeneous biological networks incorporating biological pathway information with PPI network data. Such incorporations have been successfully applied in other bioinformatics domains such as cancer driver identification (Hou and Ma, 2014b;Dinstag and Shamir, 2020).

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.