DPPN-SVM: Computational Identification of Mis-Localized Proteins in Cancers by Integrating Differential Gene Expressions With Dynamic Protein-Protein Interaction Networks

Eukaryotic cells contain numerous components, which are known as subcellular compartments or subcellular organelles. Proteins must be sorted to proper subcellular compartments to carry out their molecular functions. Mis-localized proteins are related to various cancers. Identifying mis-localized proteins is important in understanding the pathology of cancers and in developing therapies. However, experimental methods, which are used to determine protein subcellular locations, are always costly and time-consuming. We tried to identify cancer-related mis-localized proteins in three different cancers using computational approaches. By integrating gene expression profiles and dynamic protein-protein interaction networks, we established DPPN-SVM (Dynamic Protein-Protein Network with Support Vector Machine), a predictive model using the SVM classifier with diffusion kernels. With this predictive model, we identified a number of mis-localized proteins. Since we introduced the dynamic protein-protein network, which has never been considered in existing works, our model is capable of identifying more mis-localized proteins than existing studies. As far as we know, this is the first study to incorporate dynamic protein-protein interaction network in identifying mis-localized proteins in cancers.


INTRODUCTION
Eukaryotic cell is the most basic structural and functional unit of eukaryotic living creatures. Every cell contains numerous more basic components named subcellular compartments or subcellular organelles (Reece, 2015). According to the presence or absence of membranes, these subcellular organelles can be divided into two categories, the membrane bounded subcellular compartments and the non-membrane bounded subcellular structures (Perez-Ordonez et al., 2006). The membrane bounded subcellular compartments are those compartments surrounded by a single or double lipid layer membrane, such as mitochondria, nucleus and chloroplasts (in photosynthetic organisms). The non-membrane bounded subcellular structures, for example, the ribosomes, the cytoskeletons and the centrioles, are those structures without a membrane.
Proteins, which are translated in cytosol or rough ER (Endoplasmic Reticulum), must be transported to proper compartments during or after the translations to perform their biological functions (Mitra et al., 2006;Nyathi et al., 2013;Johnson et al., 2013). This process is known as the protein sorting process (Alberts et al., 2002). The subcellular organelles, where a protein performs its biological functions, are called the subcellular localization of the protein. A protein may have one or more than one subcellular localizations (Cheng et al., 2017). In complex disease conditions, some proteins may be sorted to incorrect subcellular locations, which results in abnormal intracellular behavior (Lee et al., 2008). For example, Zellweger syndrome is a rare congenital disorder characterized by the reduction or absence of functional peroxisomes in the cells of an individual (Brul et al., 1988). A study showed that many diseases such as Swyer syndrome, speech-language disorder, Alzheimer's disease, kidney stones and Diamond-Blackfan anemia were all associated with mis-localized proteins (Hung and Link, 2011). Therefore, tracking alternative subcellular locations in different cellular conditions is important in understanding the pathology of complex diseases, like cancers.
With the help of automatic image processing and understanding technology, the first comprehensive human protein localization map was finally established (Uhlen et al., 2010;Thul et al., 2017). However, the experimental methods used to establish this kind of comprehensive localization map is still costly and time consuming (Horwitz and Johnson, 2017), which makes it difficult to establish this kind of localization map in different cellular conditions, such as disease conditions, drug perturbations and environmental stress conditions. Therefore, computational prediction approaches are still demanded in analyzing altered protein subcellular locations in different conditions. During the last twenty years, hundreds of works have been done in predicting protein subcellular locations using various types of information at various levels of cellular structure in various species (Chou and Shen, 2006;Briesemeister et al., 2010;Mooney et al., 2011;Zhou et al., 2017;Cheng et al., 2017). For example, many works have been done in predicting protein subcellular locations using protein sequences and sequence related information (Chou and Shen, 2007;Du et al., 2011;Du and Xu, 2013). Most of these works rely on machine learning algorithms (Chou, 2011). Unfortunately, almost all existing studies, which focus on predicting protein subcellular locations, only predict subcellular locations for a given protein in only one condition (Liu and Hu, 2016). This is because almost all existing studies of this kind utilize only the static information as the input data. For example, most of the existing methods tried to extract informative features from the primary sequence of proteins, while the mutations and the SNPs were not taken into considerations. For another example, some of the existing methods make use of the gene ontology annotations, as well as the functional domain composition of proteins (Zhou et al., 2017). There is still no distinguishable information that can be extracted from the gene ontology annotations or the functional domain compositions for different cellular conditions. Several existing methods are designed to find the alternative protein subcellular locations in different cellular conditions. PROLocalizer makes use of sequence mutations to detect mislocalized protein in diseases Vihinen, 2009, 2011). Lee et al. integrated protein sequences, PPI (Protein-Protein Interaction) networks, and gene expression profiles to predict mis-localized proteins in glioma (Lee et al., 2008). Liu and Hu improved the Lee's method to predict mis-localized protein in several types of cancers (Liu and Hu, 2016).
In these existing works, the information to distinguish different cellular conditions comes from two sources, one is the mutations and SNPs, while the other is the differential gene expressions. Although the gene mutation and SNP information is useful, it is not easy to utilize them in sequence based features. On the contrary, many gene expression datasets have been deposited in the NCBI GEO (Gene Expression Omnibus) database (Barrett et al., 2013), which have been proved to be useful if they are combined with the protein-protein interaction networks (Ideker and Krogan, 2012). Therefore, combining the gene expression profiles and the PPI network is a feasible way to explore mis-localized proteins in cancers, as well as other kinds of complex diseases.
Although state-of-the-arts methods, which applied gene expression profiles and PPI networks to predict mis-localized proteins in cancers, have achieved success in several specific types of cancers, it should be noted that these methods have two common issues.
First, all state-of-the-arts methods used identical PPI network structures in both the disease and non-disease conditions. This is the result of lacking PPI network data in specific disease conditions. However, if a protein is mis-localized in the disease condition, its interacting proteins must be changed, as the physical distances between the mis-localized protein and the other proteins are changed. Therefore, the topological structure of the PPI network in the disease condition must not be identical to the non-disease condition.
Second, as the topological structure of the PPI network should be changed in the disease condition, the difference of the topological structure of the PPI network should be utilized to predict mis-localized proteins.
In this work, we tried to solve the above two issues by building a model named DPPN-SVM (Dynamic Protein-Protein Network with Support Vector Machine). We made changes to the PPI network in the non-disease condition according to the changes of co-expression scores in disease condition to establish an adjusted PPI network in the disease condition. We applied the ECC (edge clustering coefficient), which has already been applied in predicting essential proteins and protein subcellular locations (Wang et al., 2012;Du and Wang, 2014), to extract the PPI network structure information. By training SVM classifiers with diffusion kernels (Kondor and Lafferty, 2002) on the PPI network, we can predict protein subcellular locations in different cellular conditions. We developed a mis-localization score, which describes how likely a protein will move to or leave from a specific subcellular location in a specific cellular condition. We hope this work may provide a better way in predicting mis-localized protein in various types of cancers.

PPI Network Construction
We downloaded our PPI data from the BioGRID database version 3.5.179 (Oughtred et al., 2019). To construct a high quality working dataset, we screened the raw PPI data strictly using the following criteria. (1) Only interactions between two human proteins were kept. (2) The interactions between two identical proteins were discarded, as this kind of interactions does not provide useful information for protein subcellular localizations.
(3) Duplicate interaction records were reduced to unique interactions. (4) Only physical interactions were kept. All other types of interactions were removed. This is because the physical interactions implied that the two interactors have a very short physical distance, which contributes to protein subcellular location predictions. To achieve this, we kept only those interaction records with interaction type MI:0915 (physical association) or MI:0407 (direct interaction). After all above filtering procedures, we obtained 341088 interactions involving 23810 proteins.

Subcellular Localization Annotations
We obtained reviewed human protein records from the UniProt database (UniProt Consortium, 2019), which include 20432 proteins. We employed the online ID mapping function of the UniProt database to convert the BioGRID protein IDs of every node in the PPI network to the UniProt database IDs. There are 16319 proteins in our PPI network, which can be mapped uniquely between the UniProt database and the BioGRID database. Although this covers just about 68% nodes in the PPI network, the number of interactions between these mapped proteins is 301366, which covers over 88% of all interactions.
After the mapping procedure, we transferred the GO (Gene Ontology) annotations in cellular component ontology category from the UniProt records to the BioGRID proteins. We chose the following 12 subcellular locations, including When the GO annotations were transferred from the Uniprot records to the BioGRID proteins, we choose to transfer only those GO terms with experimental evidences. This is achieved by choosing only those terms with evidence code IDA (Inferred from Direct Assay) or HDA (Inferred from High Throughput Direct Assay). We have 6461 BioGRID proteins that were experimentally annotated with at least one of the above 12 subcellular locations.
Among the 6461 annotated BioGRID proteins, there were 4112 proteins with only one subcellular location, 1731 proteins with two locations, 503 proteins with three locations, 98 proteins with four locations, 15 proteins with five locations and 2 proteins with six locations. The average multiplicity degree of the dataset was 1.48. The breakdown of the dataset for different location multiplicity is illustrated in Figure 1A.

Virtual Locative Proteins
Since one protein may have more than one subcellular locations, it is necessary to introduce the virtual locative protein concept (Chou and Shen, 2006). In the view of machine learning, computational prediction of multiple subcellular locations for a single protein is a multi-label classification problem. Therefore, it should be converted to a single-label classification problem before it can be dealt with traditional machine-learning algorithms.
Every protein with κ (κ > 1) subcellular locations was split into κ virtual locative proteins. Each of the κ virtual locative proteins has one and only one of the κ subcellular locations. For example, if a protein p i has two subcellular locations l 1 and l 2 , we split the protein p i into two different virtual proteins, located at l 1 and l 2 , respectively.
The virtual locative proteins inherited the properties of the original real proteins, including all PPI connections and gene expression profiles. Since the virtual locative proteins have different subcellular locations, we assumed that there is no PPI between the virtual locative proteins that are generated from the same real protein.
The original 6461 proteins with experimentally annotated subcellular locations are split into 9562 virtual locative proteins, resulting in a multiplicity degree of 1.48. Therefore, the number of proteins that are mapped between UniProt and BioGRID increased to 19420, which is about 120% of the original. The number of PPI in the network increased to 601693, which is about 200% of the original. Figure 1B gives the breakdown of the dataset in the term of virtual locative proteins in different subcellular locations.

Edge Clustering Coefficients
Edge clustering coefficient was originally developed in analyzing social networks (Radicchi et al., 2004). It has been introduced in identifying essential proteins (Wang et al., 2012), as well as in predicting protein subcellular locations (Du and Wang, 2014). Particularly, ECC has been proved to be an indicator of whether two interacting proteins tend to have common subcellular locations (Du and Wang, 2014). For a pair of interacting proteins, which can be noted as the u-th and the v-th proteins, the ECC can be defined as follows: where η u , v is the ECC between the u-th and the v-th proteins, z u , v the number of triangles that involve the edge between the u-th and the v-th proteins, and d u and d v the degree of the u-th and the v-th proteins, respectively. The denominator in Eq (1) represents the possible most number of triangles that may involve the u-th and the v-th proteins. We set η u , v = 0 in the case that the denominator is degraded to zero.

Diffusion Kernel Matrix
In order to apply machine learning techniques to graphlike structures, diffusion kernel was proposed to capture the long-range relationships between vertices induced by the local structure of a graph (Kondor and Lafferty, 2002). The diffusion kernels provide means to incorporate all neighbors of proteins in the network (Lee et al., 2006).
Let G be a simple graph. Its Laplacian matrix can be defined as: where A is the adjacency matrix of the graph, and D the degree matrix. The matrix D can be defined as: Frontiers in Genetics | www.frontiersin.org where d i is the degree of the i-th vertex in the graph. The diffusion kernel matrix K(τ) is given by: where τ is a constant parameter, exp() the matrix exponential function. It can be easily shown that the K(τ) is a valid kernel function.

Co-expression Network Construction
Three cancer-related gene expression profile datasets were obtained from the NCBI GEO database. These datasets are from studies on acute myeloid leukemia, breast cancer and hepatitis carcinoma, respectively. The datasets include GSE9476 (myeloid leukemia, 25 cases and 38 controls), GSE27567 (breast cancer, 51 cases and 31 controls) and GSE121248 (hepatitis carcinoma, 70 cases and 37 controls). All gene expression datasets were retrieved using the Affymetrix platforms (Dalma-Weiszhausz et al., 2006). We used the "simpleaffy" package in the Bioconductor to perform quality controls (Wilson and Miller, 2005). For each dataset, the following filtering steps were carried out. (1) The samples with scale factors larger than 3 were removed.
(2) The samples with 3 to 5 ratios for β-actin less than 3 were kept. (3) The samples with 3 to 5 ratios for GAPDH (Glyceraldehyde 3-phosphate dehydrogenase) less than 1.25 were kept. We also checked the RLE (relative log expression) and NUSE (normalized unscaled standard errors) of samples. Samples with significant different RLE or NUSE values to other samples were removed. The case and control samples in each dataset were grouped, respectively. The MAS5 algorithm (Pepper et al., 2007) were applied to generate expression values for every sample. We applied the affymetrix templates and annotation packages in Bioconductor to map the gene expression values to UniProt proteins. In case of a many-to-one mapping, we used the mean value as the final expression value for proteins. Let x i , u be the u-th protein expression values of the i-th sample, n the number of samples in a group. We define the sample-wise centered expression vector X u as follows: where T is the transpose operator for matrix, and We now defined the pair-wise PCC (Pearson Correlation Coefficient) between the u-th protein and the v-th protein as the follows: where ρ u , v is the PCC between the u-th and the v-th proteins. The PCC was used to quantify the coherent extent of two proteins in terms of gene expressions. Regardless to whether two proteins have physical interactions, their PCC was calculated as above.

Disease-Related Mis-Localized Protein Identification
Given a specific disease status θ, we term the case sample set as θ 1 , while the control sample set as θ 0 .
We can compute the PCC for all pairs of proteins as Eq (7) using only the samples in θ 0 . The PCC between the u-th and the v-th proteins in non-disease states can be noted as ρ u , v (θ 0 ). Similarly, we can compute the ECC for each interaction as Eq(1). The ECC between the u-th and the v-th proteins in non-disease states can be noted as η u , v (θ 0 ).
Let A(θ 0 ) be the adjacency matrix of the PPI network in nondisease states, which can be defined as follows: The u-th and v-th protein are interacting 0 otherwise .
The Laplacian matrix in non-disease state can be defined as: where D(θ 0 ) is the degree matrix that is computed using Eq (3).
With L(θ 0 ), we can create the diffusion kernel matrix K(τ,θ 0 ) using Eq(4). This kernel matrix is used in an SVM model to predict protein subcellular locations in the non-disease state. Since we took the multi-label scenario into the consideration, we employed the libSVM package (Chang and Lin, 2011) to derive the probability that each locative protein localized to each subcellular locations.
Let p u , k (θ 0 ) be the probability score that the u-th protein localize to the k-th subcellular location. The libSVM package ensures that m k=1 p u,k (θ 0 ) = 1, where m is the number of all possible subcellular locations. Due to the imbalanced dataset, the ranges of p u , k (θ 0 ) of different subcellular locations varies a lot. Therefore, we defined the following adjusted probability score, q u , k (θ 0 ), which is for the u-th protein and the k-th subcellular location: With all above definitions, the u-th protein localize to the k-th subcellular location if the following condition is satisfied: Frontiers in Genetics | www.frontiersin.org where α is a real number parameter between 0 and 1. The subcellular locations, which are predicted for the u-th protein using Eq(13), can be denoted as a set S u (θ 0 ). For the disease state, all above computation can be performed on θ 1 . However, to amplify the differences between disease and non-disease status, we altered the topology of the PPI network before all computations in disease status. This is different to all existing works in predicting mis-localized proteins in diseases.
For the u-th protein and the v-th protein, we first compute the PCC in θ 1 , which can be noted as ρ u , v (θ 1 ). We define the disease status difference of PCC as follows: We define two threshold parameters as follows: where h is the average value of all h u , v , and σ the standard deviation of all h u , v . If the u-th protein and the v-th protein are two interacting proteins in non-disease status, the interaction would be removed, if h u , v < t − is satisfied. Similarly, if the u-th protein and the v-th protein are two non-interacting proteins in non-disease status, the interaction between them should be established, if h u , v > t + is satisfied.
After altering the topology of the PPI network as above, we compute the S u (θ 1 ) according to the Eq(8) to Eq(13) using the updated PPI network and gene expression samples in θ 1 . It should be noted that the η u , v (θ 1 ) may be different to η u , v (θ 0 ), as the topology of the PPI network is altered in the disease state.
By comparing the S u (θ 1 ) and S u (θ 0 ), we can identify whether the subcellular locations of the u-th protein were altered in the disease state. However, this method cannot quantify how likely a protein would be mis-localized in the disease state. Therefore, we developed the following method to quantify the mis-localized proteins, which we termed as the mis-localization scores.
For each disease, we compute the differences of adjusted probability scores between the disease and non-disease states. The mis-localization score of the u-th protein in the k-th subcellular location of disease θ can be defined as follows: The ϕ u , k (θ) indicates the extent that the u-th protein would localize to or move from the k-th subcellular location. For each protein, we define the following two boundaries: We sorted the proteins according to the sup[ϕ u (θ)] and inf[ϕ u (θ)] in descending and ascending orders, respectively. The top-ranked proteins within a fixed proportion of the entire list are considered as mis-localized proteins. The proportion is fixed as 0.1% in this work.

Performance Evaluation Methods
In this study, we used 10-fold cross-validation to evaluate the prediction performance of our method in the non-disease state. Four statistics, including aiming (AIM), coverage (CVR), multilabel accuracy (mlACC), absolute-true rate (ATR) were applied to measure the prediction performances (Jiao and Du, 2016). These statistics are defined as follows: where S u (θ 0 ) is the set of predicted protein subcellular locations of the u-th protein in the non-disease state, S u the set of experimental protein subcellular locations, b the number of proteins, | .| the cardinal operator in set theory, and Since we have introduced the virtual locative proteins in our work, we also applied single-label performance measures. Five statistics, including sensitivity (Sen), specificity (Spe), virtuallocative accuracy (vlAcc), positive-predictive value (PPV) and Matthew's Correlation Coefficients (MCC) are applied in our work. These statistics can be defined as follows:

Parameter Calibrations
We used a grid search strategy to find the parameter combination of τ and α that optimize the 10-fold cross validation performances in the non-disease state. The parameter τ in computing the diffusion kernel was searched from 0.1 to 2.0 with step 0.1. The parameter α in Eq(13) was searched from 0.1 to 0.3 with a step of 0.1. Supplementary Figure 1 showed the global MCC score under different parameters. We chose the parameter values τ = 1.1 and α = 0.3 in our works.

Prediction Performance Analysis in the Non-disease State
We used 10-fold cross-validation to evaluate the prediction performances in non-disease state. It should be noted that our method is designed to find out the alteration of protein subcellular locations, rather than the exact subcellular locations in non-disease state. Therefore, we choose to compare our method to Liu and Hu's method (Liu and Hu, 2016). Since we applied virtual locative protein concept in our work, while Liu and Hu employed the top-k accuracy performance measure, it is difficult to perform an exact apple-to-apple orange-toorange comparison. However, we managed to compare the global sensitivity of our work to the top-1 accuracy of Liu and Hu's work. As our performance value was obtained by using 10-fold cross-validation, this gives some advantage to Liu and Hu's work. Our global sensitivity is 0.556, while the top-1 accuracy of Liu and Hu's work is 0.364. Although both values are not high enough in the general protein subcellular location predictions, we still achieved a comparable or little higher performance. Other global performance measure in terms of virtual locative proteins are a specificity of 0.899, a PPV of 0.437, an accuracy of 0.857 and an MCC of 0.412.
To make further performance assessment, we choose to compare the multi-label performance of our method to the Hum-mPLoc 3.0, which was developed by using gene ontology information. Since our method does not rely on the gene ontology annotations, which has been proved to have superior performances in predicting protein subcellular locations, it should be noted that the Hum-mPLoc 3.0 (Zhou et al., 2017) has intrinsic performance advantages.
Since Hum-mPLoc 3.0 does not use identical subcellular locations annotations as our method, we choose to compare the overlapped locations. To achieve a fair enough comparison, we compose a testing dataset of 3842 proteins. All these proteins are with at least one overlapped subcellular location. This testing dataset was fed into the Hum-mPLoc 3.0 and our method in non-disease state. The overall multi-label performances were compared in Table 1. It can be seen that our method has better performance in terms of aiming, coverage, accuracy and absolute true rate. This is an expectable result, as our method incorporates PPI information and gene expression profiles.

Discovery of Potentially Mis-Localized Proteins in Cancers
We applied our method on three different type of cancers, including leukemia, breast cancer and hepatitis carcinoma. Table 2 gives a list of representative mis-localized proteins in these cancer cells. For each disease, we listed the top six (0.1% of the entire list) proteins, which are most likely to mis-localize to an abnormal location, and the top six proteins, which are most likely to mis-localize from their normal locations. The corresponding location, the mis-localization score and the score rank can also be found in

Leukemia
In acute myeloid leukemia, we used 25 cases and 38 controls. Our prediction showed that protein SETBP1 mis-localized to ER in cancer cells, as its localization score in ER increased from 0.083 to 0.633 with a mis-localization score +658.04%, while its localization score in nucleus dropped from 0.226 to 0.036 with the mis-localization score −83.94%. A recent study have suggested a direct involvement of SETBP1 in leukemia development (Oakley et al., 2012). We predicted that EI24 mis-localized from ER in cancer cells, as its localization score drops from 0.94 to 0.468 with a mis-localization score −50.22%, while Zhao et al. (2005) found that EI24/PIG8 was an ER-localized Bcl2-binding protein, which was highly mutated in aggressive breast cancers.

Breast Cancer
For breast cancer, we used 51 cases and 31 controls. We made a prediction that the protein B7H1 mis-localized from plasma a The mis-localization score is marked after the altered location. The "+" prefix indicates this is a new subcellular location in disease state. The "−" prefix indicates this non-disease subcellular location is lost in the disease state. The "Inf" indicates a positive infinity value, which is produced by the zero original localization probability. b The score ranks are sorted using the boundary values in Eq(18) and Eq(19). The mis-localization scores with value of −100% does not participate in the ranking, as it does not necessarily indicate a completely loss of a subcellular location, but just a bias of available data.
membrane and to nucleus, as its localization score in plasma membrane dropped from 0.243 to 0.105 with a mis-localization score −56.76%, while its localization score in nucleolus increased from 0.023 to 0.092 with the mis-localization score +290.65%.
This consists with the record in literature (Wang and Li, 2014). Our method also reported that the protein VEGFR3 mis-localized from plasma membrane and to cell nucleus, as its localization score in cell nucleus increased from 0.047 to 0.106 (with the mis-localization score +125.16%). This also consists with the record in literature (Wang and Li, 2014). The protein IFNgR2 was annotated with location ER and Golgi apparatus in the Uniprot database. We predicted that IFNgR2 mis-localized from plasma membrane to mitochondria in cancer cells, as its localization score in plasma membrane drop from 0.214 to 0.078, with a mis-localization score −63.74%, while the localization score in mitochondria increased from 0.136 to 0.388 (with a mislocalization score +184.5%). It was reported that the IFNgR2 molecules can be mainly detected in mitochondria in cancer cells (Ngo et al., 2012).

Hepatitis Carcinoma
For hepatitis carcinoma, we used 70 cases and 37 controls. The protein S100A11 was reported to have very weak nuclear expression in adenocarcinomas (Rehman et al., 2004), while our method reported that it mis-localized to peroxisome, as its localization score in peroxisome increased from 0.014 to 1.0 (with a mis-localization score +6868.17%). We predicted that FOXP mis-localized to peroxisome, as its localization score in peroxisome increased from 0.003 to 0.019 with mis-localization score 612.39%. It has been reported that FOXP would lose its nuclear localization in cancers (Hung and Link, 2011). ABCA1 was reported to mis-localize from plasma membrane to lysosome in cancers (Hung and Link, 2011). Our method reported the same result, as the localization score in plasma membrane dropped from 0.162 to 0.004 with mis-localization score −99.28%), and lysosome from 0.012 to 0.972 (with a mislocalization score +8115.45%).

Potential Results Validation
Using our method, we identified some proteins that may mislocalize from or to a specific location. Some of them have been verified by existing studies. But most of the predicted proteins have not been verified. Due to our limited resources, we cannot perform experimental validations. This may be considered as a future work. It should also be noted that, there is still no database for mis-localized proteins. The information regarding the mis-localized proteins is still scattered in many literatures. Establishing such kind of database is a valuable yet impacting work, which is also in our consideration as a future work in this research topic. Since mis-localized proteins are of great significance on revealing the mechanism of diseases, we believe that it is valuable to establish a database to summarize and store relevant discoveries in future.

CONCLUSION
Computational prediction of proteins subcellular locations has been studied for over twenty years. However, computationally detecting disease-related mis-localized proteins was rarely discussed. By integrating gene expression profiles and proteinprotein interaction networks, we developed a computational approach, DPPN-SVM, to detect mis-localized proteins in various cancers. The results indicated that our method can successfully identify cancer-related or mis-localized proteins that has been reported in various literatures.
Comparing to existing studies, our method not only provide a comparable or better prediction performance in nondisease state, but also further amplify the differentially expressed gene information by introducing the dynamic PPI network and the SVM classifiers with diffusion kernels. The prediction results of our method provide candidate proteins as spatial cancer markers, while the method of our work gives a new way to explore the spatial distribution of proteins within a cell.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/brown-2/mis_localization.

AUTHOR CONTRIBUTIONS
G-PL collected data, process the data, implement the algorithm, performed most of the experiments, and analyzed the results. P-FD designed and directed the study, proposed the algorithm, analyzed the results, and wrote the manuscript. Z-AS and H-YL performed part of the experiments. TL analyzed the results, and participated in writing the manuscript. All authors contributed to the article and approved the submitted version.