Identifying Disease Related Genes by Network Representation and Convolutional Neural Network

The identification of disease related genes plays essential roles in bioinformatics. To achieve this, many powerful machine learning methods have been proposed from various computational aspects, such as biological network analysis, classification, regression, deep learning, etc. Among them, deep learning based methods have gained big success in identifying disease related genes in terms of higher accuracy and efficiency. However, these methods rarely handle the following two issues very well, which are (1) the multifunctions of many genes; and (2) the scale-free property of biological networks. To overcome these, we propose a novel network representation method to transfer individual vertices together with their surrounding topological structures into image-like datasets. It takes each node-induced sub-network as a represented candidate, and adds its environmental characteristics to generate a low-dimensional space as its representation. This image-like datasets can be applied directly in a Convolutional Neural Network-based method for identifying cancer-related genes. The numerical experiments show that the proposed method can achieve the AUC value at 0.9256 in a single network and at 0.9452 in multiple networks, which outperforms many existing methods.


INTRODUCTION
With the rapid development of high-throughput biological experiment and the wide application of bioinformatics (Guingab-Cagmat et al., 2013), the identification of genes related to human diseases becomes more and more important in understanding the mechanism of disease pathogenesis. Many biological networks (Raval and Ray, 2013) have been used to identify disease related genes, such as genetic interaction networks (Boucher and Jenna, 2013), protein-protein interaction networks (Seebacher and Gavin, 2011), and gene interaction networks (Robert, 2012), etc. Ramsahai et al. (2017) use gene interaction networks to improve the identification of cancer driver genes. Gevaert et al. (2014) use module network interaction of multi-omics data to identify ovarian cancer driver genes.
To achieve the identification of disease related genes by using networks data, many powerful machine learning methods have been proposed from various computational aspects, such as decision tree (She et al., 2010), support vector machine (Choi et al., 2011) and naive Bayes (Yousef et al., 2007). Meanwhile, deep learning methods have also gained big success in identifying disease related genes according to their higher calculation accuracy and efficiency. However, most deep learning methods lack the consideration of multifunction properties of many genes and scale-free characteristics of biological networks, thus resulting in some limitations in the identification of disease related genes. Genes' multi-function and biological networks' scale-free characteristics are shown in Figures 1A,B.
To be more specific, the disease related genes tend to have multi-functions, that is, there may be several genes co-work together to result in some diseases, or one gene is related to multiple diseases. Many graph neural network based representation methods do not consider the multifunction of nodes and also suffer from the limitation of node's degree, especially for scale-free networks (Zhou et al., 2018). Many biological networks, such as PPI networks, pathway cooccurrence networks, gene co-expression networks, and DNA co-methylation network, etc., are not inappropriate with each other, and they often fail to exploit the power of graph neural network. These networks do not accurately explain the similarity between genes and their feature vectors, i.e., those edges only simply indicate that two genes are related to each other, but do not show their global similarities between neighbors. In addition, many genes have no attribute data associated with a disease, but those known multi-function disease related genes contain more FIGURE 1 | The overall idea of the proposed method. (A) The multi-function of genes, which shows genes' multi-tag. (B) The biological networks' scale-free characteristics. Two cancer related genes WT1 and KIT are take as the example. (C) The four genes' representations processed by our proposed network representation method. Each genes will be surrounded by those selected genes and their tags. All of them come together like an "ecosystem." (D) The image-like datasets which corresponded to the representation of genes. information, which make the dimension difference of resulting feature vectors.
Most biological networks have the scale-free property (Boccaletti et al., 2010), which means hubs get more information from their neighbors, while those non-hub nodes get less information from their neighbors. They aren't equivalent. So we need to give "neighbor" a new definition based on similarity instead of the neighbor defined by the original adjacency relationship, which makes node in the scale-free network get information of roughly the same scale. After the node sequence selection and the neighborhood graph construction, we regard node-induced sub-network as the represented object to have a regularization such that every node can get information of roughly the same scale from node-induced sub-network corresponded with itself.
At the same time, the function of a gene is actually the function of the gene's product, that is, the protein's function of the coding gene and the RNA's function of the non-coding gene (Gamermann et al., 2019). The selective expression of genes means genes express in a certain time and space. All of these cause the multifunction properties of many genes. So we need to give the gene a specific environment information to distinguish the gene functions according to the identification of disease class. For one gene, this specific environment is reflected in two aspects: (1) which disease class the gene in this gene's neighbors belongs to; (2) how do this gene's neighbors affect this gene.
In this study, we take node-induced sub-network as the represented object to have a regularization for solving the limitation of node's degree, and add neighbor's environmental characteristics of nodes for solving multifunction of genes, shown in Figure 1C. Then we find a low-dimensional network space for a network, to transfer topological networks into imagelike datasets, which can be applied directly by convolutional neural network for identifying cancer-related genes, shown in Figure 1D.

Data Sources
Seven biological networks are employed in this study, which includes four PPI networks, one pathway co-occurrence network, one gene co-expression network, and one DNA methylation dataset. The PPI network are collected from HPRD (Release 9), BioGrid (3.4.143), IntAct (4.2.3.2), and InWeb_IM (2016_09_12). The first three of them are binary PPIs, while the last one is weighted PPIs. The pathway dataset is download from MSigDB (c2.all.v5.2). The expression profiles are obtained from ArrayExpress (E-TABM-305). The DNA methylation dataset is collected from GEO (GSE36064). In this study, we selected those node entries which appear in at least six datasets and resulted in 9189 identical vertices by blurring the differences between proteins and genes (Chen et al., 2017).
The known gene-disease associations are obtained from Goh's paper (Goh et al., 2007) and the OMIM dataset, where 1285 genes are overlapped with the previous 9189 genes. There are 22 classes of diseases, such as cancer, bone, earnosethroat, hematological, etc. Since only genes related to cancer class exhibit dense connections in the human disease gene network, we will take the cancer class for example in this study, and evaluate our proposed method to identify cancer-related genes.

Low-Dimensional Network Space
A graph G =< V, E > is commonly used to represent a network, where V is the vertex set, and E is the edge set (Cohen and Havlin, 2010). The space of the adjacency matrix A of G is called an n-dimensional network space, where n is the number of nodes. The network representation aims to learn a low-dimensional vector space for a network, in contrast with the n-dimensional space (Cui et al., 2017). Obviously, we need to choose an mdimensional network space be the low-rank space, where m << n. That is, for every node, we need to choose m − 1 nodes as its neighbors, and add environmental characteristics through the relevant information of its neighbors.

Embedding
A one-to-one mapping Ŵ from n-dimensional network space to m-dimensional network space is established, which is illustrated FIGURE 2 | An example of a node's representation, where n-dimensional network space is reduced to m-dimensional network space.
in Figure 2. A sub-network with m nodes is obtained after embedding nodes into the m-dimensional graph space as follows.
Firstly, we do the node sequence selection by similarity. We take the row vector as the vector representation of a vertex, e.g., a i = a i,1 , a i,2 , · · · , a i,n is a row vector as the vector representation of the node v i . Define the similarity S i,j between the node v i and the node v j as The larger S i,j , the more consistent that node v i and node v j influence on other vertices, i.e.,the node v i is similar to the node v j in this network, which means they may have the similar biological function or take part in similar cellular processes. Then we rearrange the genes to facilitate the selection of m − 1 neighbors. An agglomerative hierarchical clustering algorithm is employed to cluster vertices in the network, where a sequence of leaf vertices v 1 ′ , v 2 ′ , · · · , v n ′ is obtained corresponding to the clustering tree. Vertices with higher similarity are closer, while those with less similarity are far away from each other.
Secondly, we do the neighborhood network construction. Given a vertex v i ′ , a 2k + 1 neighborhood field can be obtained by taking v i ′ as the center and a receptive field with a radius of k, where m < k < n. After this, m − 1 vertices can be selected according to their similarity to the center as follows Thirdly, we do the network normalization. Those vertices selected can be embedded to a m-dimensional network space and a sub-network with m vertices is obtained as the representation of v i ′ . The diagram is shown in Figure 3.

Transferring
The adjacent matrix can then be rearranged according to this leaf sequence. By doing this, an m * m sub-adjacent matrix A v i ′ of the above sub-network of vertex v i ′ can be obtained. The  sub-adjacent matrix A v i ′ fully preserves surrounding topological structures of vertex v i ′ , which reflects how the surrounding vertices affect vertex v i ′ .
Moreover, each vertex may also belong to a disease class according to the gene-disease association. We choose neighbors' disease class information of vertex v i ′ as its environmental characteristics. A classification matrix C v i ′ can also be generated by taking the disease class information as the diagonal element for its m − 1 neighbors in the sub-network.
Considering the multi-tag caused by the genes' multi-function, we need to process neighbors' disease class information of vertex v i ′ . For the identified disease class t id and vertex v i ′ , whose tag set is T(v i ′ ) = t 1 , t 2 , · · · , t j , first we define the tag t center of the vertex v i ′ as follows: Second, for the tag t center of the vertex v i ′ and vertex v i k ′ whose tag set is T(v i k ′ ) = t 1 , t 2 , · · · , t j , we define c i k the as follows: where d(t i , t j ) means the centroid linkage of t i and t j . The bold values indicate the best prediction accuracy in that column.   Then an image-like two-dimensional matrix E v i ′ can be obtained by adding A v i ′ and C v i ′ together. This image-like two-dimensional matrix of the sub-network is a network representation of vertex v i ′ , which preserves rich structural information from sub-network's weighted adjacency matrix and the important network properties from classifications' matrix.
3. EXPERIMENTS AND RESULTS

The Comparison of Traditional Machine Learning Methods and Convolutional Neural Network
Considering the single weighted PPI network (InWeb_IM), we do a binary classification of cancer-related genes as an example. The bold values indicate the best prediction accuracy in that column.
There are a total of 1,285 genes related to 22 classifications of diseases known, of which 178 cancer-related genes are positive samples, and the same number of negative samples are randomly selected from other diseases' classification.
Since there are 22 classifications of diseases, one of which is unclassified, so we choose 21-dimensional network space as m-dimensional network space be the low-rank space. After the processing of the network representation method, we embed genes of the network to the 21-dimensional network space, then we transfer those 21-dimensional sub-networks to image-like 21*21 matrix.
We flat the image-like 21*21-dimensional matrix as a 441dimensional vector, and do binary classification of cancerrelated genes using traditional machine learning methods, such as Decision Trees, Support Vector Machines and Naive Bayes. At the same time, we use image-like 21*21-dimensional matrix directly to do binary classification of cancer-related genes by Convolutional Neural Network. Results are shown in Frontiers in Cell and Developmental Biology | www.frontiersin.org method. The SVM(Gau) method has roughly the same ability of the identification for positive and negative samples.
For Convolutional Neural Network, its Performance is better than those five traditional machine learning methods according to the f1-measure and the AUC values. From the classification results, this classifier has roughly the same ability of the identification for positive and negative samples, without focusing on one of them. Table 1 shows the results of comparing the Convolutional Neural Network method with five traditional machine learning methods.

The Comparison of Different Networks by Convolutional Neural Network
To verify that our network representation method is applicable to different types of networks, we consider all these Seven biological networks, which include four PPI networks, one pathway cooccurrence network, one gene co-expression network, and one DNA methylation dataset. We also do a binary classification of cancer-related genes by Convolutional Neural Network as an example. Results are shown in Figure 6.
Comparing the four PPI networks (NT_HPRD, NT_BioGrid, NT_IntAct, and NT_InWeb), weighted PPI network (NT_InWeb) performs best. This is in line with our expectations, because the weighted PPI network contains more detailed information than binary PPI network. Among all these seven networks, the best performance in the binary classification of cancer-related genes is weighted PPI network (NT_InWeb), and other networks also have a good performance, but the DNA methylation network (NT_meth).
Through comparison, we find that the network with more information is more conducive to the identification of cancerrelated genes. Therefore, we hope to merge the information of the seven networks to achieve the purpose of cancerrelated genes' identification. We choose the node-induced subnetwork caused by weighted PPI network (NT_InWeb), and add image-like 21*21 matrix evenly, which is corresponding to this node-induced sub-network in other six networks. Then we do a binary classification of cancer-related genes by Convolutional Neural Network to compare with the single weighted PPI network (NT_InWeb). Results are shown in Figure 7.
Although there is little difference in AUC between the two, seven multiple networks is slightly higher than the single network. And the best threshold with the least classification error of seven multiple networks is lower than the single network. Table 2 shows the results of comparing the single network (NT_InWeb) and seven multiple networks.

The Comparison With Previous Network Representation Method
To verify that our network representation method is better than general network representation method without environmental characteristics, we consider the single weighted PPI network (InWeb_IM) to compare the two. After the node sequence selection by similarity, we choose v i ′ , v i+1 ′ , · · · , v i+20 ′ as the neighborhood of the node v i ′ , and rearrange the 21*21 adjacent matrix according to this neighborhood sequence as the representation of node v i ′ . We also do a binary classification of cancer-related genes by Convolutional Neural Network as an example. Results are shown in Figure 8. It's clear to see our novel network representation method achieve a higher AUC value than general network representation method without environmental characteristics, and its best threshold is also lower.

CONCLUSIONS
In this paper, we have proposed a novel network representation method, aiming to find a low-dimensional network space for a network, by transferring topological networks into image-like datasets. It can be applied directly by Convolutional Neural Network. Compared with traditional machine learning methods, Our network representation method can process network data directly for identifying disease related genes by Convolutional Neural Network, and achieve a very high AUC value in the binary classification of cancer-related genes.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.