A Comparative Study of Cluster Detection Algorithms in Protein–Protein Interaction for Drug Target Discovery and Drug Repurposing

The interactions between drugs and their target proteins induce altered expression of genes involved in complex intracellular networks. The properties of these functional network modules are critical for the identification of drug targets, for drug repurposing, and for understanding the underlying mode of action of the drug. The topological modules generated by a computational approach are defined as functional clusters. However, the functions inferred for these topological modules extracted from a large-scale molecular interaction network, such as a protein–protein interaction (PPI) network, could differ depending on different cluster detection algorithms. Moreover, the dynamic gene expression profiles among tissues or cell types causes differential functional interaction patterns between the molecular components. Thus, the connections in the PPI network should be modified by the transcriptomic landscape of specific cell lines before producing topological clusters. Here, we systematically investigated the clusters of a cell-based PPI network by using four cluster detection algorithms. We subsequently compared the performance of these algorithms for target gene prediction, which integrates gene perturbation data with the cell-based PPI network using two drug target prioritization methods, shortest path and diffusion correlation. In addition, we validated the proportion of perturbed genes in clusters by finding candidate anti-breast cancer drugs and confirming our predictions using literature evidence and cases in the ClinicalTrials.gov. Our results indicate that the Walktrap (CW) clustering algorithm achieved the best performance overall in our comparative study.


INTRODUCTION
Drugs physically bind specific target proteins and activate downstream effectors to ultimately change the gene expression profiles of tumor cells, which are highly modular in the context of molecular interaction networks (Hartwell et al., 1999;Berger and Iyengar, 2009;Sah et al., 2014). Investigation of the modular properties of interactomes, such as protein-protein interaction (PPI) networks, can facilitate further discovery of the underlying molecular interaction mechanisms that drive cell response under specific conditions, such as drug treatment (Isik et al., 2015). Previous studies have used interaction networks to predict gene function, identify novel disease-related genes and to understand the overlapping association across disease phenotypes (Sharan et al., 2007;Lee et al., 2008;Menche et al., 2015). Recently, computational approaches have been used to build topological clusters as functional modules in PPI networks. For example, Spirin and Mirny identified modules in the PPI network through three methods and subsequently demonstrated the association between topological clusters and functional modules (Spirin and Mirny, 2003). Additionally, Kenley and Cho proposed a graph entropy algorithm to identify functional clusters from PPI networks (Kenley and Cho, 2011). These efforts have led to more effective modeling of PPIs and the drug targeting thereof with respect to specific diseases.
In the last decades, cancer cell lines have been widely used as models for understanding cancer biology and cellular response to anticancer drugs (Goodspeed et al., 2016). These cell lines have not only been comprehensively profiled at the molecular level, but they have also been used in large pharmacogenomic studies. The Connectivity Map (CMap) contains to date more than 7,000 expression profiles in five cancer cell lines (MCF7 and ssMCF7, human breast adenocarcinoma cell line; HL-60, human promyelocytic leukemia cell line; PC-3, human prostate cancer cell line; SK-MEL-5, melanoma cell line), screened for transcriptional responses induced by 1,309 small molecule compounds at varying concentrations from 6,100 microarray experiments conducted using the Affymetrix HT_HG_U133A array with 22,283 probesets (Lamb et al., 2006). Previous studies have also reported that the CMap data can be used to link transcriptional biomarkers to known mechanisms of drug action, such as the thioridazine inhibition of the phosphatidylinositol-3 -kinase (PI3K)/AKT pathway in ovarian cancer cells (Bae et al., 2011), allowing the identification of drug target proteins and facilitating the process of drug repurposing (Babcock et al., 2013;Iskar et al., 2013;Wu et al., 2013;Musa et al., 2017). In this study, we extensively used the transcriptomic and pharmacogenomic data pertaining to the cell lines found in the CMap to further understand the topological clustering in PPI networks from a comparative computational approach. We focused our interest on the MCF7 cell line, as it is the most commonly used cell line in human breast carcinoma, established in 1973 at the Michigan Cancer Foundation (Holliday and Speirs, 2011). Due to the hormone sensitivity found in MCF7 through the expression of the estrogen receptor (ER), this cell line has been reported as an ideal model to study hormone response in breast cancer (Levenson and Jordan, 1997).
Over the past few years, systems biology has made significant progress in addressing fundamental biological questions by making use of PPIs, leading to practical applications in drug target identification and drug discovery (Iskar et al., 2013;Wu et al., 2013;Ivliev et al., 2016). The STRING database (version 10.0) provides critical assessment and integrates all possible direct and indirect PPIs for more than 2,000 species, including 19,247 proteins with 8,548,002 interactions for Homo sapiens (Szklarczyk et al., 2015). Furthermore, various drug targeting measures have been developed, including a method called local radiality (LR) by Isik et al. (2015), integrating perturbed gene expression with PPI network information to prioritize drug target identification through different essential protein detection algorithms. The STRING database assigns a confidence score to each predicted protein-protein association calculated based on several sources, including published literature, experimental interaction data, and data concerning co-regulation of genes. However, despite these ongoing efforts that investigate cellular PPIs, due to the varying gene expression profiles in different cell lines, proteins exhibit dynamic behavior in interaction that current cell-agnostic PPI assignment methods do not fully recapitulate (Holliday and Speirs, 2011;Liang et al., 2014). As such, accuracy is poor for functional cluster prediction in individual cancer cell lines using existing clustering algorithms, and there remains a lack of convergence between the algorithms due to their diverse module detection theories and methods (Liu et al., 2017). Inspired by this, we developed a cell-based PPI network using the MCF7 cell line as an alternative to current cell-agnostic models. In this study, we compared the properties of functional clusters elucidated from a cell-based PPI network in the MCF-7 cell line produced by four module detection algorithms. We subsequently extract drug-induced functionally perturbed genes from the big clusters (defined as clusters with a size of greater than or equal to 10) detected by the algorithms and integrate them with MCF7 cell-based network information to improve the prioritization of target genes. Finally, we illustrate the potential association between perturbed genes and clusters in the MCF7 cell-based PPI network through investigations in drug repurposing. Our results highlight the validity of this comparative approach to identify novel anti-breast cancer drugs, which were further validated using literature and data from ClinicalTrials.gov (Figure 1). Furthermore, our results indicate that the Walktrap (CW) algorithm yields the best performance for detecting functional clusters in the PPI network.

Establishment of Cell-Based PPI Network Using MCF7
We downloaded the human PPIs from the STRING database (version 10) (Szklarczyk et al., 2015) to serve as the unfiltered PPIs for our cell-based PPI network. The ENSP IDs of the proteins found in STRING human PPI network were matched to their corresponding ENSG gene IDs using the R package biomaRt (version 2.34.2) (Badalà et al., 2009). To create the cell-based PPI network, we used as filtering criteria the gene expression data of the MCF7 cell line obtained from the Broad-Novartis Cancer Cell Line Encyclopedia (CCLE). The CCLE project provides public access to genomic data ( Supplementary  Table S1), as well as the analysis and visualization thereof for approximately 1,000 cancer cell lines (Barretina et al., 2012;Smirnov et al., 2018), with MCF7 being one of them. The gene expression data of the MCF7 cell line in CCLE were obtained using the R package PharmacoGx (version 1.12.0) (Smirnov et al., 2016), which contains the pharmacological profiles of several hundred cell lines generated by CCLE, the Genentech Cell Line Screening Initiative (gCSI) (Klijn et al., 2014), the Genomics of Drug Sensitivity in Cancer (GDSC) (Garnett et al., 2012) and the GRAY dataset, generated in Dr. Joe Gray's lab at the Oregon Health and Science University (Daemen et al., 2013). To calculate the expression value of genes found in the MCF7 cell line, we converted the number of fragments per kilobase per million (FPKM) mapped reads units were converted to log2(FPKM+1) as the expression value of genes. We defined expressed genes is defined as those with a log2(FPKM+1) value of greater than or equal to 1 [log2(FPKM+1) ≥ 1] (Gonzalez-Porta et al., 2013). Using this definition, the MCF7 cell line expresses 13,096 genes. We then developed the cell-based PPI network subsetting the human PPI network to only include these expressed genes MCF7.
To limit non-existent and false-positive interactions between proteins in the MCF7 cell-based PPI network, we further filtered this network based on PPI confidence scores. The STRING repository has a confidence score for each PPI according to sources including existing literature, as well as co-expression and experimental data. A PPI has a high confidence score if there is supporting evidence for the interaction from a wide range of these sources. We selected PPIs with a confidence score of greater than 800 to be included in our final MCF7 cell-based network, which was also used in all downstream analysis. The MCF cellbased PPI network contains 7,904 protein members projecting 2,13,422 interactions, which represent the nodes and edges of the network, respectively.

Functional Cluster Identification Algorithms
To date, a number of different criteria have been proposed for defining clusters in networks (Gao et al., 2009). A cluster is defined as a group of nodes in a network that are densely interconnected to each other, while sparsely connected to other the nodes of the same network. In this study, four widely used clustering algorithms were evaluated, compared, and categorized based on the methods applied for cluster identification (Orman et al., 2012) (Supplementary Table S2): These algorithms were selected due to their short runtimes determined in preliminary tests using the R package igraph (version 1.2.1) (Csardi and Nepusz, 2006).
The leading eigenvector (CLE) algorithm detects densely connected clusters in a network graph by calculating the leading non-negative eigenvector of the modularity matrix of the graph (Newman, 2006). Pons and Latapy have proposed a module detection hierarchical structure algorithm called the Walktrap (CW) for module detection, which is a hierarchical structure algorithm. They argued that short random walks tend to stay in the same cluster (Pons and Latapy, 2005). In the label propagation (CLP) algorithm, each node is initialized with a unique numeric label and chooses the label that is dominated by its neighbors during an iterative process. The CLP algorithm tends to gather densely connected nodes with the same label that comprise a cluster (Raghavan et al., 2007). Lastly, we also obtained topological gene clusters by implementing the infomap (CI) algorithm, proposed by Rosvall and Bergstrom (2008), to decompose the cell-based PPI network into clusters by employing random walks to analyze the information flow through a network. The igraph R package igraph (version 1.2.1 number) (Csardi and Nepusz, 2006) was used to identify clusters in the cell-based PPI network using the four cluster detection algorithms with on default arguments. Afterwards, we extracted the big clusters produced by the four algorithms, defined as those with a size of greater than or equal to 10 members. We also defined small clusters as those with less than 10 members. The big clusters detected by each algorithm were considered as functional clusters and subsequently used for target gene prediction and drug repurposing.
To elucidate the effect of network size for the clustering by CI and CW, we calculated the ratio of proteins in the big clusters to the total number of proteins in both the cell-agnostic PPI and the MCF7 cell-based PPI networks. The formula for this calculation is shown below: where P b represents number of proteins in the big clusters, and P t stands for the total number of proteins in each PPI network.

Evaluation of Biological Processes
The biological processes in Gene Ontology (GO), which provide functions of genes and gene products determined by biological process (BP), molecular function (MF), and cellular component (CC), were downloaded from Molecular Signatures Database 1 . We enriched connected protein members of each big cluster in the cell-based PPI network to the GO terms to annotate protein function as well as to confirm the mechanisms of action of candidate anti-breast cancer drugs through hypergeometric tests using the R package piano (version 1.18.1) (Väremo et al., 2013). The biological process GO terms with false discovery rate (FDR) ≤ 0.05 were considered.

Drug Perturbation Signatures in Cancer Cell Lines
To curate drug perturbation data, we first accessed the normalized gene expression in the MCF7 cell line from CMap via PharmacoGx (version 1.12.0) (Smirnov et al., 2016). We used the drugPerturbationSig function to identify genes whose expression 1 http://software.broadinstitute.org/gsea/msigdb/index.jsp is differentially expressed upon drug treatment, creating a signature representing gene expression changes induced by each drug. The algorithm of this function uses a linear regression model for the effect of drug concentration on gene expression in cell lines, while adding a term controlling for the batch effect in the CMap dataset: where G stands for gene expression, C i indicates the concentration of a given compound, T denotes the type of cell line, D represents the duration of the experiment, and B represents the experimental batch, while βs are the regression coefficients. The significance of the association between a drug and a gene was estimated by the statistical significance of βi, which was calculated using an F-test to determine the improvement in fit after inclusion of the term. The function returns four values including the estimated coefficient for concentration, the t-statistic, the p-value and the FDR associated with that coefficient in a 3D array with genes and drugs. The t-statistic, carries information regarding the direction (up or down) of regulation of a given gene to identify perturbation in microarray experiments (Guo et al., 2006;Jeffery et al., 2006). The t-statistic returned by the lm function in R was calculated by following equation: where βi is the regression coefficient of the sample regression line, and SE is the standard error of the slope. After preprocessing, genes whose p-value was less than 0.01 were considered perturbed and their absolute t-statistic values were used as differential expression data. In CMap, the MFC7 cell line was treated with 1,255 drugs. Using the drugPerturbationSig function, we discovered in this cell line 11,833 distinct perturbed genes whose expression profiles were subsequently analyzed downstream this study.

Integration of Gene Perturbation Data With Cell-Based PPI Network
We argue that perturbed genes participate in one of the big clusters generated by the clustering algorithms, and will be highly related to the mechanism of action of the drug that induced said perturbation. To integrate gene perturbation data into the MCF7 cell-based PPI network, we further filtered the network so that the perturbed genes were included only if they participate in the big clusters of each algorithm.

Drug Target Prioritization in MCF7 Cell-Based PPI Network With Gene Perturbation Data
To date, several methods have been used to combine gene perturbation data with network information to find the target genes of compounds. Laenen et al. (2013) proposed a local measure called correlation diffusion (CD) that stands for a random walk-based algorithm to predict drug target genes. They first filtered out all connectivity correlation coefficients based on a certain t-threshold and normalized the coefficient value for each retaining correlation of its corresponding gene. Afterwards, they only considered single genes and their directly connected perturbed neighbors in the filtered network to compute the score for target gene prioritization. In contrast, a global measure using a shortest path (SP) approach combines the entire network topology propriety and perturbation data to calculate the gene score (Isik et al., 2015). The hypothesis for SP is that perturbed genes should have the shortest path to the target genes in the network. For this study, we used CD and SP to prioritize drug targets produced by the four clustering algorithms to compare the accuracy of each. We first revised the CD and SP methods to prioritize drug targets in the MCF7 cell-based network (CCD and CSP, respectively) with the perturbed genes, and the outcome of the evaluation of the four algorithms by the two methods are assigned new acronyms (CCD_CLE, CCD_CLP, CCD_CW, and CCD_CI; CSP_CLE, CSP_CLP, CSP_CW, and CSP_CI, respectively). Here, we show the revised formulations as follows for the two algorithms: For CCD: where P ij is the confidence value score of genes i and j expressed in the MCF7 cell-based PPI network. Normalizing the confidence scores based on the remaining connections per protein, the elements of a normalized interaction matrix M can be defined as formula 4. Multiplication of the T pr with this matrix results in a ranking score for each gene: where T pr is a t-statistic value of each perturbed gene. The perturbed genes in CCD_CLE, CCD_CLP, CCD_CW, and CCD_CI are filtered as those found in the big clusters of each clustering algorithm. The ranking score of each gene in CCD is computed by integrating the MCF-7 cell-based PPI network with all Prs perturbed genes. The control group (CD_C) combine the PPI network (confidence score > 800) with all Prs perturbed genes induced by each drug. For CSP: where P ij is the confidence score of genes i and j expressed in the MCF7 cell-based PPI network. Pr is defined as a perturbed gene of MCF-7 to a drug, and it belongs to a big cluster as determined by the four algorithms. Lastly, we sort genes based on the score in increasing order for the CSP algorithm. Prioritized genes in CSP_CLE, CSP_CLP, CSP_CW, and CSP_CI were produced by the shortest path algorithm based on the functional perturbed gene and confidence score in MCF-7 cell-based PPI network. Gene's scores in CSP group is calculate by shortest path between gene in MCF-7 cell-based PPI network and all perturbed gene induced by drug. We also computed the distance between genes in PPI network only filtered by confidence scores and all perturbed genes as control (SP_C).

Drug Target Identification and Selection of Candidate Drugs
The target genes of the drugs used to treat the cell lines in the CMap database were downloaded from Drugbank 2 (Wishart et al., 2008), a database that comprehensively combines information about drug targets with that of drug action and that has been widely used for drug target discovery, drug design, and drug interaction prediction. We used Drugbank to determine target genes of drugs in the MCF7 cell-based PPI network. In the Drugbank database, 3,291 proteins are marked as the targets of approximately 4,900 drugs, 60% of which are including Food and Drug Administration (FDA)-approved drugs and tagged 10% are drugs under investigation (labeled as "experimental drugs"). The gene symbols were obtained by matching the UniProt identifier of the target genes in the drug target identifiers file 3 . The UniProt database contains structure and function information of completely sequenced proteins, annotated with unique and stable identifiers (Consortium, 2017). After the mapping and filtering steps, 298 drugs with 270 target proteins were identified for the treated MCF-7 cell lines in CMap. This information was subsequently used to evaluate the performance of drug target prediction.

Evaluation of Performance of Drug Prioritization
We evaluated the performance of the two target prioritization algorithms using precision-recall curves (PRC). The true positive (TP), true negative (TN), false positive (FP), and FN (false negative (FN) predictions were defined as previously described (Laenen et al., 2013). We divided the predictions into true and false sets based on each cut-off. TPs are all correctly predicted known targets above or equal to the rank cut-off. FPs are all proteins ranked above the cut-off, which are not in the known target set. FNs are known drug targets that are ranked below the cut-off and all remaining proteins are defined as TNs. The recall indicator is defined as whereas the precision indicator is defined as The two indicators were calculated using all possible thresholds from 1 to 13,184 (indicating the number of nodes in the MCF7 cell-based PPI network filtered by confidence scores) for the ranking list of drug targets. We made a PRC for the precision and recall of different rank cut-offs and calculated an area under the curve (AUC) value. Simple expression ranking can be set up as a baseline approach for performance assessment of the CCD and CSP network-based drug target ranking methods.

Characteristic of Drugs in Clusters Across Four Detection Algorithms
We calculated the proportion of drug-perturbed genes in each big cluster across the four detection algorithms. The characteristic of a drug is evaluated by equation (10): Where N Ci denotes the number of perturbed genes within the ith big cluster, and N P is the total number of drug-perturbed genes.

Mechanism of Action of Candidate Breast Anticancer Drugs
For each drug, we extracted the perturbed genes of the drug within the biggest clusters to analyze function enrichments using hypergeometric tests. The biological process GO terms with an FDR of less than ≤ 0.05 were considered and overlapping biological processes were identified among candidate anti-breast cancer drugs. We generated the relationship of GO terms by Supek et al. (2011), which summarizes and visualizes long lists of GO terms. The Cytoscape tool (version 3.6.1) was used for GO term visualization.

Clusters From the MCF7 Cell PPI Network
To investigate differences in clustering structure produced by the four clustering algorithms, CLE, CLP, CW, and CI, we applied them to the MCF-7 cell-based PPI network, consisting of 7,904 proteins and 213,422 interactions. The four algorithms displayed a diversity of topological module sizes. We defined big clusters as those with a size of greater than or equal to 10 members and small clusters as those less than 10 members. The number of small clusters was larger than the number of big clusters in all algorithms (Figure 2A), which comprise a smaller proportion of the total number of clusters (15%, CW; 27.6%, CLE; 29%, CLP and; 0.3%, CI) ( Supplementary Table S3), consistent with a previous report that compared the modular structure of human cell-agnostic PPI networks generated by seven community detection methods (Liu et al., 2017). The CW algorithm identified the largest number of big clusters (n = 55), as well as 341 small clusters. The CI algorithm detected 14 big clusters and 2,897 small clusters. Moreover, we calculated the ratio of the number of proteins in the clusters to the total number of proteins for all algorithms; the number of nodes was subsequently compared between (≥10 members) and small clusters (2 ≤ members < 10). Our results show that although the number of big clusters did not represent the majority of all clusters, the number of nodes included in big clusters of each algorithm, except CI, takes up the largest portion of the proteins in MCF7 cell-based PPI network. The distribution of cluster sizes is a crucial feature in determining the cluster structure in the four algorithms. It seems to follow a power-law distribution indicating that their sizes are heterogeneous, with many small clusters and only a few large ones ( Figure 2B). For CLE, each cluster with a distinct size was detected only once, whereas the size distribution of the clusters detected by the other three algorithms had longer tails (Supplementary Figure S1).
To validate the agreement amongst the big clusters produced by the four algorithms, we computed the Jaccard index of protein members in all the big clusters between paired algorithms. We observed that most protein members in the big clusters identified by CW, CLP, and CLE are shared, as determined by high Jaccard indices ( Figure 3A). We argue that the reason for this finding is that PPI networks naturally contain an underlying cluster structure and these three algorithms are hierarchical clustering methods. The CI algorithm is an exception because it identifies clusters based on coding theory and does not follow a hierarchical approach (Peel, 2010;Yang et al., 2016). Furthermore, clusters identified by CI require minimal bandwidth to represent some random walks in the network. We found that all four algorithms agree on 418 proteins in the big clusters. The CLP, CW, and CLE algorithms also share 6,487 proteins in their respective big clusters ( Figure 3B). The big clusters generated by CLE contain a greater number of protein members (7,821 proteins) than other methods. On the other hand, 428 of 7,904 proteins are in the big clusters generated by CI. To elucidate the potential reason of cluster structure difference produced by CI, we further compared the number of protein members in the big clusters produced by CI with CW in the unfiltered cellbased PPI network (containing all connections) and filtered cell-based PPI network (confidence score > 800). CI is a compression-based approach, finding the clustering structure that provides the shortest description length for a random walk on the graph. Compared with the unfiltered cell-based PPI network, more proteins are grouped into big clusters in the filtered cell-based PPI network (Supplementary Figure S4A). Supplementary Figure S4B shows that the number of proteins in the filtered cell-based PPI network (confidence score > 800) plays an important role for clustering using the CI algorithm; more protein members can share the fixed connections to increase the median degree of each network (Supplementary Table S5), thus getting a greater number of nodes for big clusters. Thus, the CI algorithm considers most protein members with less degrees in the network as unimportant details and filters them out during clustering. Orman et al. (2011) also found that CI algorithm can produce a large number of small clusters from generated networks. Unlike the CI algorithm, the number of proteins and the number of connections has less impact for cluster size by the CW algorithm (Supplementary Figure S5).

Go Term Analysis for Clusters
Clusters have been considered as function modules in the past decades (Qin and Gao, 2010). However, different clustering  The overlap of protein members that are clustered in big cluster among algorithms. All proteins in big clusters of CW and CI are shared with big clusters produced by CLE and CLP, while CLE and CLP contain its specific proteins in the big clusters. methods generate different protein clusters with different sizes and protein members, potentially confounding the function annotation for each cluster. For each algorithm, we searched for big clusters with shared biological processes, and we found that these big clusters were distinct when enriched to GO terms. This suggests that all proteins in the big clusters are essential not only in cancer cell growth but also participate in diverse biological processes, which are identified by different cluster detection approaches. In the heatmap shown in Figure 4, the rows represent the top 20 significant biological processes and the columns represent the clusters from the MCF7 cell-based PPI network. We found significantly distinct clusters in all of the four algorithms and highlighted the most significant biological processes in the biggest cluster in each of the four algorithms. For example, in the CW heatmap (Figure 4), the annotation FIGURE 4 | Functional analysis of big size clusters. This heatmap displays the top 20 significant associations between clusters and biological processes across four clusters detection algorithms. Color denotes enrichment of a given cluster with a biological process, hypergeometric log p-value after Benjamin-Hochberg adjustment. Cluster height reflects the number of interrelated processes associated with the given cluster. Cluster weight reflects clusters that shared this significant function. For major cluster based on each cluster detection algorithm, key biological themes are marked.
"Transcription" denotes the biggest cluster determined by this algorithm and is associated with the regulation of transcription through the RNA polymerase II promoter. In addition, Supplementary Table S4 shows the adjusted p-values of these processes associated with the clusters. Approximately 3,000 biological processes of the big clusters are shared among CW, CLP, and CLE (Supplementary Figure S2). Furthermore, 51 of the 55 big clusters detected by CW contain biological processes that map to GO terms (Supplementary Figure S3), while a smaller number of big clusters in CLP, CLE, and CI (25, 12, and 5) is seen containing such processes. These results indicate that CLE, CLP, and CI can identify more complex functional clusters than CW.
Cluster Improved Shortest Path Algorithm Performance Isik et al. (2015) integrated perturbed genes from drug-treated cell lines and human PPI networks to identify drug target genes. However, the expression profiles of genes in different cell lines are different and may influence the interaction between proteins . To improve the performance of drug target identification, we first selected genes exhibiting drug-induced perturbation signatures from the big clusters produced by the four algorithms and considered these genes as functionally perturbed genes (FPGs) of the drugs. Then, we focused on integrating the FPGs of each of the 298s mapped to Drugbank with the MCF-7 cell-based PPI network using our revised CSP and CCD algorithms to combine perturbed genes with MCF-7 cell-based PPI networks to prioritize target genes and calculate the AUC to compare the performance of several conditions. We prioritized 270 target genes for the 298 drugs using the ranks produced by the two algorithms, which were subsequently sorted to predict the possible targets of a given drug. Figure 5A shows the number of perturbed genes of the drugs in the network after several different filtering conditions. We observed 11,715 perturbed genes for the 298 drugs in MCF-7 before the application of any PPI networks. After integration of the human cell-agnostic PPI network from STRING, 8,987 perturbed genes were found, while 5,989 of these genes remained once the human PPI network was further filtered to control for the MCF7 cell line (i.e., the MCF7 cell-based network). We found that 5,321 functionally perturbed genes appeared in both MCF-7 cellbased PPI network and big clusters produced by CW algorithm. We found that removing non-existing interactions in the network reduced the number of FPs while elucidating the shortest distance from the target genes to the perturbed genes associated with individual drugs. We validate our comparative and integrative approach as we found that using perturbed genes found only in functional clusters for calculating the shortest distance proves more effective than using all perturbed genes. For both CD and SP, we integrated the cell-agnostic PPI network (with confidence score > 800) with all druginduced perturbed genes, as opposed to FPGs, as control groups (SP_C and CD_C) ( Figure 5B and Supplementary Figure S6). We observed in the controls that the data of the target genes and their corresponding drugs produced by CD are less than that generated by SP (Table 1). We found that the cell-based PPI network improved the performance of drug target ranking by first evaluating all perturbed genes without the integration of clusters (Figures 5C,D). The AUC value as determined by CSP for the cell-based group (CSP, 0.30) was higher than that of the cell-agnostic control (SP_C, 0.29). The CCD method found similar results: the AUC value of the cell-based group (CCD, 0.09) was also higher than that of the control group (CD_C, 0.06). For the CSP approach, perturbed proteins found in functional clusters (i.e., proteins translated from FPGs) can further improve target gene ranking; this outcome was not seen for CCD (Figures 5C,D). For instance, CSP_CLE, CSP_CLP, and CSP_CW have a higher AUC value compared to CSP (0.38 AUC). Figure 5D shows the results are similar for the AUC value across CLE, CLP, CW, and, CCD. The shape of the baseline curve is significantly different from all other curves. Collectively, these results indicated that network-based approaches demonstrate improved performance for drug target ranking, although the participation of perturbed proteins in functional clusters contribute less to drug target prioritization calculated by CCD. For the cell-based PPI network, the top 200 target genes rank in CLE, CLP, and CW, including 66 target genes shared between the three algorithms and corresponding to 127 drugs with significantly lower than CSP (Wilcoxon signed rank test, p-value = 0.04604, 0.04989, and 0.0347, respectively) and SP_C (Wilcoxon signed rank test, p-value = 0.000978, 0.001027, and 0.0006714, respectively) ( Figure 5E). Top 200 target

Drug Repurposing in Breast Cancer Based on the Cluster
Comparing Potential Anti-cancer Drugs Identified by Cluster Detection Algorithms Understanding the association between drugs and diseases at the molecular level is critical to unveil disease mechanisms and treatments (Zhao and Li, 2012). Drugs interact with targets and off-targets, inducing downstream pathway activity causing perturbations in the cellular transcriptome (Isik et al., 2015). The perturbed genes reflect the cellular response after drug treatment. If the perturbed genes of drugs that bind different target genes participate in the same functional clusters, these drugs could share a similar mode of action. Different algorithms generate different protein clusters with different sizes and protein members, potentially influencing functional annotation of each cluster. The hypothesis is that the fraction of perturbed genes of a drug in the clusters can reflect the properties of the drug.
In this study, we used gene perturbation data integrated with clusters in the MCF7 cell-based network to find promising anti-breast cancer drugs. For each big cluster, we calculated the fraction of perturbed genes by each of the 298 drugs out of the total number of genes in that cluster, resulting in a matrix that illustrates the effect of each drug on each big cluster ( Figure 1C). We then calculated the Pearson correlation of the perturbed gene proportions in the big clusters for the 298 drugs and compared it between these drugs and FDAapproved breast anticancer drugs, such as tamoxifen. Tamoxifen, an ER-targeting prodrug, is the most commonly administered chemotherapeutic drug in breast cancer patients. It is also the drug reported to induce the most gene perturbation in breast cancer cells. Figure 6 illustrates the distribution of the Pearson's correlation between tamoxifen and the perturbed gene fractions. We found a strong correlation (r > 0.5) across CW, CLE, CLP, and CI. The reason for this result is that the bigger clusters of each algorithm usually include the most perturbed genes of drugs. We further predicted the top nine drugs from the drug ranking list for the each of the four algorithms as our candidate anti-breast cancer drugs, sorted based by the Pearson correlation coefficient. We demonstrated our predictions of each algorithm using evidence in the Clinical trials 4 and literature ( Table 3). All top nine drugs predicted by CW demonstrated ability to inhibit cell growth or induce apoptosis in breast cancer cells, while the other three algorithms predicted fewer candidate anti-breast cancer drugs that fulfilled this aim. We found fulvestrant, geldanamycin, loperamide, and ouabain (an endogenous hormone) to be the common promising candidate anti-cancer drugs identified by both CW and CLP. Fulvestrant, a known ER antagonist, is recommended as a treatment option for ER-positive breast cancer (Lamb et al., 2006). Geldanamycin, a heat shock protein 90 inhibitor, has been evaluated for the cancer treatment in a phase III clinical trial (Shipp et al., 2011). Loperamide, a peripheral opiate agonist that induces a dosedependent apoptosis and G2/M arrest in human cancer cell lines (Wen et al., 2012;Regan et al., 2014), is typically used as an antidiarrheal and was also identified as a promising anticancer drug by CI, in addition to CW and CLP. Furthermore, Raloxifene (identified by CLE), Fulvestrant (identified by CW and CLP), Geldanamycin (identified by CW and CLP), Tanespimycin  Table S6). Supplementary Table S7 shows that four candidate drugs associated with paclitaxel detected by CW have been used on breast cancer patients in the Clinicaltrials database and two others drugs were proved in vitro. Therefore, integrating the clusters produced by the Walktrap algorithm with druginduced differential gene expression yields the best performance for drug repurposing.

Common Biological Processes of Candidate Breast Anti-cancer Drugs
We argue that perturbed genes that participate big functional clusters are important for studying the mode of action of drugs. To better understand this association for the candidate anticancer drugs detected by the top-performing algorithm, we extracted the perturbed genes of tamoxifen, fulvestrant, geldanamycin, and clomifene within the biggest cluster of CW to analyze their functions using a hypergeometric test. We obtained the overlapping biological processes shared amongst the four drugs (Figure 7). We found that these drugs, despite their differences in chemical structure and target genes, share 33 common biological processes, 3 of which are directly involved in cell growth and development, including regulation of cell proliferation, cell death, and cell cycle. Moreover, the biological pathways associated with these drugs also shared the GO term for estrogen response (GO: 0043627).
FIGURE 7 | The common biological process of perturbed genes in the major cluster. The chemical structures of tamoxifen, fulvestrant, geldanamycin, and clomifene are on the right. The brown circle denotes the functional processes related to cell growth and development.

DISCUSSION
The interactions among molecular components link biological functions, which are typically involved in functionally related clusters for their activities (Barabási and Oltvai, 2004;Liu et al., 2017). It has been reported that genes of similar disease phenotypes have a significantly higher number of interactions and highly connect with each other than with random genes (Goh and Choi, 2012), and the topological modules usually define disease-associated functional clusters (Ruan et al., 2006). However, Liu et al. (2017) pointed out the functional diversity of clusters in human PPI networks and that cluster detection approaches should be used with caution. Meanwhile, most studies classify the modules without considering the expression profile of specific cell lines or tissue types. In this study, we compared the clusters in the MCF-7 cell-based PPI network produced by four cluster detection algorithms, leading eigenvector (CLE), Walktrap [CWlabel propagation (CLP), and Infomap (CI)]. We observed that the big clusters (size ≥ 10) comprised a smaller fraction of total clusters and the big clusters generated by CLP, CLE, and CW contained most protein members, except CI. The distribution of node degrees in the network is a crucial factor for finding clusters using the CI algorithm. Therefore, we argue that compression-based clustering approaches should be used with caution for the identification of function clusters.
Drugs usually target multiple gene products and have the potential to be used for the treatment of several diseases. Some gene products may lead to serious side-effects and others may introduce novel applications to guide drug repurposing. Moreover, tumors are highly multifactorial at the molecular level, involving interactions in gene function networks (Zickenrott et al., 2017). The Connectivity Map (CMap) is a collection of gene expression profiles of five cancer cell lines before and after treatment with 1,300 small molecules (Lamb et al., 2006). In previous studies, integration of gene perturbation data in CMap with human PPI networks revealed drug targets and key pathways (Isik et al., 2015). However, most of the current network-based approaches for identification of essential proteins neglect significant differences in expression profiles among tissue types or even cell lines. We found that using a cell-based PPI network for target gene identification reduced the number of false positives for determining the shortest path between target and perturbed genes and improved drug target prediction compared with the cell-agnostic human PPI network. In this study, we improved target gene prediction by combining a cell-based PPI network with gene perturbation involved in big clusters.
Shortest path and correlation diffusion are two target gene prioritization algorithms. The SP measure assumes that deregulated genes might be close to drug targets based on network topology. SP identifies more diverse targets than CD, which agrees with previous conclusions (Isik et al., 2015). Moreover, we observed that integrating functional cluster-related perturbed genes with the MCF-7 cell-based network by CDhas little influence on the target gene prediction. The CD ranking, a random walk-based measure (Isik et al., 2015), sorts the proteins based on connectivity correlation in the network (Laenen et al., 2013). Cluster detection algorithms are also used as connection factor to identify functional clusters. Thus, the perturbed genes within the small clusters did not change the target gene ranking based on the this method. Furthermore, we calculated the fraction of drug-perturbed genes in each cluster as a drug-specific pattern to reflect cell response after drug treatment. We compared these patterns across tamoxifen, raloxifene, and paclitaxel, by computing the Pearson correlation to find promising anti-breast cancer candidate drugs. We observed that these drugs are highly similar to tamoxifen as predicted by CW, CLP, CLE, and CI. We selected the top nine drugs from each similarity ranking list as promising anti-breast cancer drugs and validated them through published literature and clinical trial cases. Based on the validation result, we found that the CW cluster detection algorithm showed the best ability to extract functional clusters from the network. To better understand the potential application of these drugs in breast cancer treatment, we performed a GO term analysis by extracting deregulated genes for four known anti-breast cancer drugs from the major cluster and obtained the overlapping biological processes among these drugs. The perturbed genes of these drugs not only appeared in big clusters detected from our cell-based PPI network, but also participated in common biological processes, such as cell cycle, cell death, and estrogen response pathways. We also used brinzolamide-induced gene perturbation data, which weakly correlates with tamoxifen, as a negative control, only three perturbed genes were found to overlap with gene members in the main cluster and no significant GO term enrichment was seen.
In summary, we gained insights into the structure of interactions by cluster detection algorithms and implemented the properties of clusters in the identification of target proteins and drug repurposing. We provide a new computational pipeline for the identification of drugs. However, the therapeutic potential of these agents requires further investigation.

AUTHOR CONTRIBUTIONS
JM, BH-K, and PD conceived and designed the study. JM and LG developed the methodology. JM, LG, JW, and XM analyzed and interpreted the data (e.g., statistical analysis, biostatistics, and computational analysis). JM, JW, BH-K, and XM wrote, reviewed, and/or revised the manuscript. BH-K and PD supervised the study.