Gene Set Enrichment Analysis of Interaction Networks Weighted by Node Centrality

Gene set enrichment analysis (GSEA) is a powerful tool to associate a disease phenotype to a group of genes/proteins. GSEA attributes a specific weight to each gene/protein in the input list that depends on a metric of choice, which is usually represented by quantitative expression data. However, expression data are not always available. Here, GSEA based on betweenness centrality of a protein–protein interaction (PPI) network is described and applied to two cases, where an expression metric is missing. First, personalized PPI networks were generated from genes displaying alterations (assessed by array comparative genomic hybridization and whole exome sequencing) in four probands bearing a 16p13.11 microdeletion in common and several other point variants. Patients showed disease phenotypes linked to neurodevelopment. All networks were assembled around a cluster of first interactors of altered genes with high betweenness centrality. All four clusters included genes known to be involved in neurodevelopmental disorders with different centrality. Moreover, the GSEA results pointed out to the evidence of “cell cycle” among enriched pathways. Second, a large interaction network obtained by merging proteomics studies on three neurodegenerative disorders was analyzed from the topological point of view. We observed that most central proteins are often linked to Parkinson’s disease. The selection of these proteins improved the specificity of GSEA, with “Metabolism of amino acids and derivatives” and “Cellular response to stress or external stimuli” as top-ranked enriched pathways. In conclusion, betweenness centrality revealed to be a suitable metric for GSEA. Thus, centrality-based GSEA represents an opportunity for precision medicine and network medicine.


INTRODUCTION
High-throughput data consist in a wide amount of information obtained as the output of last-generation technologies. Nowadays, data from omics approaches are obtained to systematically explore human biology at a cellular or molecular level. This leads to a significant advantage in the study of a biological system in its complexity (Tebani et al., 2016). On the other hand, an analytical method capable of reading, prioritizing, and interpreting such large set of information is needed. The integration of the medical/biological language and the mathematical/computational language in a cross-disciplinary approach represents a challenge (Barabási, 2007). For this reason, new theories and new algorithms have now been generated, and a strong support of bioinformatics tools becomes necessary (Al-Haggar et al., 2013).
Gene set enrichment analysis (GSEA) is a powerful tool for the interpretation of high-throughput expression studies such as mass spectrometry-based proteomics or Next-Generation Sequencing, in order to identify insights into biological processes or pathways underlying a given phenotype. Thanks to the acquisition of an expression profile, the list of differentially expressed genes is ranked in terms of a metric associated to the observed expression change. The ranked gene list is compared to a gene set, i.e., a list of genes known to be associated with a certain biological process, gene ontology (GO), molecular function or pathway. The metric is needed to calculate the enrichment score (ES) that indicates the degree by which a gene set is overrepresented at the extremes of the ranked list (Subramanian et al., 2005).
In the Big Data era, meaningful models and impactful results may only be achieved at the network complexity level. In particular, genomics and quantitative approaches to networkbased analysis are combined to advance the frontiers of network medicine (Sonawane et al., 2019). Indeed, the multifactorial approach of network medicine is based on the multiplicity of factors that can alter a system to identify functional connections that link the clinical phenotype to multiple factors (Piñero et al., 2016).
Usually, protein networks are built starting from a list of query proteins and by retrieving the protein-protein interaction (PPI) information from a suitable database. The resulting network may be further enriched by looking for first interactors that might join isolated or distant nodes. The rationale of this approach is that proteins that were identified as phenotype-correlated by chance are likely excluded from the network, whereas proteins that for several reasons were not detected as phenotype-correlated are now reconnected to the network (Fasano et al., 2016). A standard functional analysis of the resulting network usually consists in an over-representation analysis based on the Fisher's exact test or a GSEA (Monti et al., 2019).
The Graph Theory provides tools to analyze the connectivity structure and the topology of a network. Node centrality and edge betweenness are among the most useful parameters to identify hotspots in a network. The importance of a node is the extent of its involvement in a network, and this can be measured in multiple ways. Common measures include degree, closeness, betweenness and eigenvector centralities (Ashtiani et al., 2018). The latter is at the core of the PageRank and the hypertext-induced topics search (HITS) algorithms for ranking networks (Kleinberg, 1999). The betweenness centrality of a given node takes into account the number of geodesics, or shortest paths, between any couple of nodes that include the considered node over the total number of geodesics that connect any couple of nodes of the graph. When nodes are in distinct components, the geodesic pathway is not defined. Otherwise, the betweenness centrality ranges from 0 to 1. On the other hand, the closeness centrality is inversely proportional to the sum of the shortest paths between the given node and any other node in the network.
Several algorithms have been proposed to combine pathway enrichment analysis with network topological information and their performance has been recently reviewed (Ma et al., 2019). Some of them take into account interconnectivity or adjacency information, whereas other consider centrality measures such as degree or betweenness centrality. Nevertheless, all topology-based pathway enrichment analysis methods require an expression metric that indicate whether a gene is differentially expressed in a group comparison.
Since whole genome sequencing (WGS) and whole exome sequencing (WES) are useful genomics tools to identify genes showing alterations (e.g., single nucleotide variants -SNVs, Indels, larger duplications or deletions), the results generated by these techniques can be used to generate personalized networks for each individual (Suwinski et al., 2019). More generally, altered genes can be considered as constituting the input list for generating a PPI network that reflects a disease network model in a single individual, thus allowing for the comparison between personalized networks of patients with the same phenotype and a distinct genetic background.
Here, we propose a GSEA approach to analyze PPI networks based on betweenness centrality values as ranking metrics. The proposed analysis was applied to two case paradigms. In the first case, the network was obtained from genomics data, where quantitative expression metrics are not available. In the second case, the network represents the result of a secondary data analysis, where proteomics data are scraped from the literature and quantitative information is not always available.

Definition of Gene Signatures From Genomics Analysis
The patients and their parents came from the cohort of medical genetics clinics of the ASST Sette Laghi showing clinical phenotypes related to neurodevelopmental disorders. A written informed consent to perform array-based comparative genomic hybridization (aCGH) and WES, and a consent for use of anonymous data for research were provided by the parents and relatives of the patients. The consent model and procedure were approved by the Institutional Review Board (ASST Sette Laghi Cod MOD09 IOS01SSDGM). Neurocognitive tests were based on WISC-IV (Wechsler Intelligence Scale for Children).
Language skills were assessed with Boston and TROG-2 tests. No bias selection of gender, family history or age was performed. To identify submicroscopic chromosomal rearrangements, the aCGH technology was performed after DNA extraction from peripheral blood cells (QIAmp DNA blood Maxi Kit, Qiagen, Hilden, Germany). The four patients were selected on the basis of the presence of 16p13.11 microdeletion.
For exome results, genes involved in neurodevelopmental disorders and genes related to rare diseases were considered according to literature. Among these, genes with variants in hemizygosity, homozygosity, compound heterozygosity and simple heterozygosity inherited from the healthy non-carrier parent of the 16p13.11 deletion were selected.
Array-based comparative genomic hybridization was performed using CytoSure ISCA V2 4x180K platform with a backbone resolution of 1 probe/25 Kb and 1 probe/19 Kb in critical regions, human genome reference hg19/GRCh37 and sex matched normal human DNA pool (Kreatech, Amsterdam, Holland) as control. InnoScan 710 Microarray Scanner (Carbonne, France) and Mapix (Innopsys, Carbonne, France) were used to detect and analyze fluorescence levels. Results were interpreted using Cytosure Interpret Software (Oxford Gene Technology, Begbroke, Oxfordshire, United Kingdom).
Whole exome sequencing was performed on probands and parents DNA using the Twist Human Core Exome Kit (Twist Bioscience, San Francisco, CA, United States), according to the manufacturer's protocol and sequenced on the Illumina NovaSeq 6000 platform. The BaseSpace pipeline (Illumina, San Diego, CA, United States) and the TGex software (LifeMap Sciences, Alameda, CA, United States) were used for the variant calling and annotation, respectively. Sequencing data were aligned to the hg19/GRCh37 human reference genome. Variants with a lower coverage than 10×, quality score (GQ) <15, and gnomAD minor allele frequency (MAF) >5% have been excluded.
RRN3 PHEX RRN3 a Lists include altered genes from both aCGH and WES results.

Protein-Protein Interaction Networks From Genomics Analysis
Cytoscape 3.7.2 1 was used to generate networks (Su et al., 2014). The public database IMEx 2 was queried through Cytoscape using the PSICQUIC standard (the Proteomics Standard Initiative Common QUery InterfaCe). Each list of altered genes, obtained by aCGH and WES as described above, was used to generate a network encompassing all first interactors of the gene products. The network was filtered for taxonomy ID 9606 (Homo sapiens) to remove homology inferences. A first interactors list was generated and used to query the IMEx database again, thus including second interactors too. All self loops and duplicated edges were removed. Node degrees were calculated using the NetworkAnalyzer plugin. Isolated (degree = 0) and terminal (degree = 1) nodes were removed to reduce the network size, with the only exception for isolated/terminal nodes included in the input list of altered genes.

Protein-Protein Interaction Network From Secondary Data Analysis
Proteins reported to be associated to Parkinson's disease (PD), amyotrophic lateral sclerosis (ALS) and Alzheimer's disease (AD) were extracted from a secondary data analysis by Monti et al. (2018). Using a pubmed scraping procedure, authors obtained lists of proteins reported to be quantitatively altered in the three neurodegenerative disorders considered and built a network that encompasses all selected proteins and their common interactors using PPI spider (Antonov et al., 2009). The network was then exported as a.xgmml file.

Topological Analysis in Cytoscape
Betweenness centrality was calculated for each node in networks by using the NetworkAnalyzer utility in the Cytoscape Environment. Both size and border width of all nodes were mapped onto their betweenness centrality values. The betweenness centrality of node n i is given by where N is the total number of nodes, σ jk is the number of geodesics between nodes n j and n k , and σ jk (n i ) is the number of geodesic pathways containing node n i . Networks were reduced in size by selecting nodes having betweenness centrality ≥10 −4 for display clarity only. For each network, betweenness centralities were transformed with a Hill function in order to enhance the weight of nodes with higher values, as follows hereafter: Gene Set Enrichment Analysis Using R Gene set enrichment analysis was performed using the ReactomePA package (Yu and He, 2016) in the R environment for statistical analysis (R Core Team, 2017). Normalized enrichment score (NES), p-value, and false discovery rate (FDR) for all variables and signatures were obtained in the R environment. The database for biological processes/pathways Reactome was used for the analysis (Fabregat et al., 2017). Hill-transformed betweenness centralities were used as a suitable ranked list metric instead of those usually considered in expression studies. The list was then resampled by randomly assigning observed values to gene IDs and the analysis was performed again to identify pathways in common with those arising from the analysis of values that were obtained from the topological analysis of the networks. The R code is reported in the Supplementary Material.

Over-Representation Analysis
Over-representation analysis was performed for comparison using the WEB-based GEne SeT AnaLysis Toolkit 3 (Liao et al., 2019) using Reactome as the functional database and "human genome, protein coding" as the reference set.

Personalized Signatures From aCGH and WES Analysis
Four patients were referred to genetic investigations for diagnostic purposes and counseling for developmental disorders ranging from learning delay to intellectual disability, with or without associated congenital malformations. By aCGH screening, a microdeletion on 16p13.11 was identified. The deletion on 16p13.11 was either inherited from the healthy mother (two patients) or from the healthy father (one patient), or it was a de novo unbalance (one patient). Several mutations inherited from both healthy parents were found in the four patients by WES analysis. By merging aCGH and WES results, lists of altered genes were generated and used as personalized signatures for each patient ( Table 1).
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) are listed hereafter. For aCGH data: DECIPHER 4 350680 (Patient 1), 414066 (Patient 2), 414078 (Patient 3) and 318359 (Patient 4). For WES data: European Variation Archive Project: PRJEB41629, Analyses: ERZ1687005.

Genomics Networks Without Expression Metrics
The four lists of altered genes were used to generate personalized networks. Supplementary Figure 1 shows an overview of the four networks encompassing first and second interactors after filtering out isolated and terminal nodes (i.e., nodes with degree = 0 and degree = 1, respectively) and nodes with betweenness centrality ≤10 −4 , except for genes in the signature lists. Signature genes display different betweenness centrality values in the four personalized models with three orders of magnitude among different genes ( Table 2). None of the interaction networks has nodes with high values. Genes MYH11 and NDE1 are present in all four networks with relatively high centralities, whereas ANK3, 3 www.webgestalt.org 4 https://decipher.sanger.ac.uk/ CHRNA7, DYRK1A, KIF1A, SHANK1, and TSC2, although they are present in single networks, also have high centrality. On the other hand, several signature genes have negligible centrality, although they are present in all four networks (e.g., ABCC1, ABCC6, FOPNL, MARF1, NOMO3, and RRN3). Figure 1 shows the 10 most central nodes in each personalized network together with interacting signature genes. In all cases, top 10 nodes are among the first interactors of signature genes, with six out of 10 genes in common among patients (i.e., DISC1, ESR1, GOLGA2, GRB2, JUN, and HSCB). No signature genes displayed centrality values comparable to first interactors represented in Figure 1. An over-representation analysis of these clusters, including top 10 genes and interacting signature genes, did not lead to any significantly over-represented pathway.
In order to enrich the functional analysis with nodes with higher centrality, original betweenness centrality values were transformed with a Hill function. Figure 2 shows ranked centrality values before and after the transformation. In this way, nodes with low centrality have a negligible contribution to the ES, whereas nodes with high centrality represent leading edges. As an example, patient 1 network had 8,101 nodes. Among them, 579 nodes had a score (i.e., betweenness centrality after Hill transformation) greater than 0.2, 318 greater than 0.5, and 192 greater than 0.8. Therefore, this transformation dramatically reduced the dimensionality of the dataset by selecting less than 10% of the original network genes. As a comparison, we performed the same transformation using closeness centrality. As shown in Supplementary Figure 2, closeness centrality values are distributed in a narrow range even after Hill transformation.
Gene set enrichment analysis was performed for the four networks using transformed betweenness centrality values as the scoring metric, with Reactome as gene set database. Figure 3 shows enriched categories as a dot plot, with dot size proportional to the number of overlapping genes. All patients displayed the enrichment of one or more pathways related to cell cycle and to Rho GTPases in a different ranking in terms  of statistically significant gene ratio, i.e., the ratio between the number of genes related to each pathway and the total number of genes in the network. These pathways were not identified after resampling of scores obtained from betweenness centrality values (Supplementary Figure 3). Moreover, classic over-representation analysis of the whole network without ranking measures led to a different set of enriched pathways. Table 3 summarizes enriched pathways found by classic overrepresentation analysis of patient 1 network as a whole and by selecting subsets of nodes by their ranking (8,101, 579, 318, and 192 genes, respectively) (see Supplementary Materials).

Consensus Networks From Secondary Data Analysis
The centrality-based GSEA approach was then applied on a consensus network already available in the literature. In a previous paper by Monti et al. (2018), authors retrieved fulltext proteomics original articles focused on the development of three neurodegenerative diseases and extracted three input lists including 928 proteins for AD, 1,155 proteins for PD, and 387 proteins for ALS. The input lists were used to generate a consensus network representing interactions among proteins in the lists (Monti et al., 2018). We performed the topological analysis of the consensus network, whose results are shown in Figure 4A, where node size and border width are proportional to betweenness centrality. This network encompasses 483 nodes and 659 edges. Noteworthy, among the 95 nodes with the highest betweenness centrality (i.e., the top 20% central nodes), 78 are linked to PD, either in common or not with other diseases (Figure 4B). Most central genes are TIMP2 and EIF3A, both associated to PD only, and GABARAPL2, associated to PD and AD. However, these three genes did not lead to any significant association with PD pathogenesis, when taken alone. All nodes related to PD were then extracted from the whole network for further topological analysis and GSEA ( Figure 4C). Gene set enrichment analysis was performed for the whole network ( Figure 4A) and for the PD network ( Figure 4C) using Hill-transformed betweenness centrality values as the scoring metrics, with Reactome as gene set reference database. Figure 5 shows enriched categories as a dot plot, with dot size proportional to the number of overlapping genes. For the whole network, enriched pathways mainly refer to catabolic pathways. Immune system appeared to be significantly enriched; however, it was also found after resampling of centrality metrics (Supplementary Figure 4). On the other hand, most significantly enriched pathways for the PD network were "Metabolism of amino acids and derivatives" and "Cellular response to stress or external stimuli, " that were not observed in the analysis of the resampled dataset.  Monti et al., 2018). Node size and border width are proportional to betweenness centrality, whereas edge width is proportional to edge betweenness.

DISCUSSION
Gene set enrichment analysis is usually performed to associate a disease phenotype to a group of genes by using quantitative expression data. Unlike over-representation analysis, GSEA has the advantage of considering the role of each gene by taking into account its correlation to the clinical phenotype or any other categorical variable (treatment, stage, time, and other experimental conditions) (Subramanian et al., 2005). Nevertheless, a quantitative metric is not always available in a dataset. Here, GSEA based on betweenness centrality of a PPI network is described and applied to two cases, where an expression metric is missing.
Actually, building a network starting from a list of entities such as genes or proteins is based on publicly available evidence of interaction, and the topology of the network accounts for the specific role each protein plays in the network itself. The degree of each node (i.e., the number of interactions with other nodes) is clearly a measure of how much this node is integrated in the network. We decided to consider the betweenness centrality as a suitable metric to rank network nodes in terms of the role they play in the disease represented by the network. Among other centrality measures, betweenness centrality discriminates centrality of nodes over a range that spans orders of magnitudes, unlike closeness centrality values that are distributed in a narrow range and therefore are slightly affected by the resampling procedure. Signature genes of the first case analyzed here showed very different centrality values in a range from 10 −6 to 10 −3 . Worthy of note, several genes showing high centrality values are involved in neurodevelopmental disorders. Indeed, some of them (ANK3, BCKDK, CHRNA7, DYRK1A, MECP2, SHANK1, and TSC2) are included in the database SFARI (Banerjee-Basu and Packer, 2010) with high scores, supporting that the gene contributes to the disorder pathogenesis or it is a strong candidate.
The personalized networks are assembled around a cluster of very central genes, whose composition is partially overlapping. Interestingly, the top 10 clusters never include signature genes, and the weight of every member of the clusters may change among patients. As an example, the leucine-rich repeat kinase 2 (LRRK2) is present in all four networks. However, its centrality is low in patient 1 (C B = 7.56 × 10 −3 ), whereas it has the top value in the other three patients. Recently, a role for LRRK2 in cognitive development has been proposed, thus linking LRRK2 to intellectual disability and autism (Labonne et al., 2020). Another gene present with very high centrality in all four networks, ESR1, has been related to the pathogenesis of autism and autism related symptoms (Wang et al., 2016).
We used the tools included in the ReactomePA package (Yu and He, 2016) to associate a topological parameter to each gene in the dataset in place of commonly used expression data. However, we observed that betweenness centrality values show a steep decrease within the first nodes in the ranked list. To select a sub-population of central nodes, betweenness centrality values were transformed with a Hill function that assigns high scores to less than 10% nodes. In this way, a drastic change in the weight associated to each gene is expected. Using Hill-transformed betweenness centrality values, GSEA of the four personalized networks yielded a different ranking of enriched pathways or even different pathways. As an example, the pathway "Cell Cycle, Mitotic" is enriched in all four patient networks. However, its ranking differs in the four dot plots. From the pathogenetic point of view, all patients share a neurodevelopmental impairment that is brought to evidence by the present analysis (e.g., "Mitotic G1-G1/S phase" pathway). It is known from literature that neurodevelopmental disorders are related to alterations of the cell cycle, and a central role is played by genes involved in neurogenesis regulation and cell adhesion processes (Pramparo et al., 2015). Moreover, our evidence supports the conclusions obtained so far on NDE1, the candidate gene for the susceptibility to the neurocognitive disorder given by the deletion on 16p13.11. Using a bottom-up approach, NDE1 and NDEL1 (its paralogue) were deeply characterized (i.e., biochemistry, PPIs, role in neuronal differentiation and cortical development, pathology caused by their alterations) (Bradshaw and Hayashi, 2017). In particular, the role of NDE1 and NDEL1 in cell cycle progression, neurite outgrowth and development and neuronal migration has been highlighted, thus supporting our results.
As a second application of the proposed approach, we analyzed a PPI network previously published by our group (Monti et al., 2018). This network encompasses proteins that were reported to quantitatively change in three neurodegenerative diseases by independent articles. Two of them are movement disorders (PD and ALS) and one is a cognitive disorder (AD). The original aim of the work was to distinguish between proteins that are specifically associated to a single neurodegenerative disease and proteins generically linked to neurodegeneration. Topological analysis of the network adds substance to this purpose. We observed that most central proteins are often linked to PD, either standalone or associated to other disorders. Indeed, the most evident colors in Figure 4B are pink (PD), red (PD and AD) and black (PD, AD and ALS). To gain more evidence to pathways associated to most central nodes, we extracted a subnetwork to represent all proteins that were altered in PD. This selection, together with the topological analysis, allowed us to improve the specificity of GSEA.
The use of a topological parameter to perform GSEA in the absence of any quantitative expression metric was never done before. This application of GSEA has been performed on two datasets, where expression metrics were not available. Although we decided to consider betweenness centrality as a ranking metric, further analyses could be performed by identifying other candidate topological parameters and using these data to obtain new pathway networks and new information.

CONCLUSION
The centrality-based GSEA procedure is another powerful tool to investigate personalized networks or disease networks, in order to unveil hidden information and to specifically plan experimental investigations. This analysis can be applied to a wide variety of networks, also in combination with clustering tools. Since biological networks are usually too large and are characterized by high modularity, topological descriptors could assist the identification of functional modules (Sharma et al., 2017;Choobdar et al., 2019). Altogether, the functional analysis of networks by weighting nodes in terms of their centrality could also provide a valuable tool to explore pathogenetic mechanisms and to precisely identify sensitive targets for drug development or repositioning. In this view, centrality-based GSEA represents an opportunity for precision medicine and network medicine.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by ASST Sette Laghi Ethical Committee. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.

AUTHOR CONTRIBUTIONS
AZ performed the cytogenetic analysis and bioinformatics analysis and wrote the manuscript. ML and TA revised the GSEA results and the manuscript. PG performed the cytogenetic analysis and contributed to the interpretation of whole exome sequencing results. DC and AN performed the whole exome sequencing analysis. RC enrolled the patients, carried out medical genetics visits, supervised genetics analysis, and revised the manuscript. MF supervised the project, wrote the code, and wrote the manuscript. All authors approved the final version of the manuscript.

FUNDING
This study was supported by the Volunteer Association "la gemma rara."