NMFNA: A Non-negative Matrix Factorization Network Analysis Method for Identifying Modules and Characteristic Genes of Pancreatic Cancer

Pancreatic cancer (PC) is a highly fatal disease, yet its causes remain unclear. Comprehensive analysis of different types of PC genetic data plays a crucial role in understanding its pathogenic mechanisms. Currently, non-negative matrix factorization (NMF)-based methods are widely used for genetic data analysis. Nevertheless, it is a challenge for them to integrate and decompose different types of genetic data simultaneously. In this paper, a non-NMF network analysis method, NMFNA, is proposed, which introduces a graph-regularized constraint to the NMF, for identifying modules and characteristic genes from two-type PC data of methylation (ME) and copy number variation (CNV). Firstly, three PC networks, i.e., ME network, CNV network, and ME–CNV network, are constructed using the Pearson correlation coefficient (PCC). Then, modules are detected from these three PC networks effectively due to the introduced graph-regularized constraint, which is the highlight of the NMFNA. Finally, both gene ontology (GO) and pathway enrichment analyses are performed, and characteristic genes are detected by the multimeasure score, to deeply understand biological functions of PC core modules. Experimental results demonstrated that the NMFNA facilitates the integration and decomposition of two types of PC data simultaneously and can further serve as an alternative method for detecting modules and characteristic genes from multiple genetic data of complex diseases.


INTRODUCTION
Pancreatic cancer (PC) is a highly fatal disease of the digestive system and it is becoming an increasingly common cause of cancer mortality, yet its pathogenic mechanisms remain unclear (Mizrahi et al., 2020). Therefore, comprehensively analyzing multiple types of PC genetic data to understand its pathogenic mechanisms has become a hot topic and many studies have been conducted. For instance, Wu et al. (2011) applied the lasso penalized Cox regression to transcriptome data to identify genes that are directly related to PC survival. Yang et al. (2013) identified thousands of differentially expressed genes of PC and then six genes were predicted to be involved in PC development. Gong et al. (2014) integrated pathway information into PC survival analysis and applied the doubly regularized Cox regression model to microarray data to identify both PC-related genes and pathways. Kwon et al. (2015) used the support vector machine to evaluate the diagnostic performance of PC biomarkers based on miRNA and mRNA expression data. Tao et al. (2016) performed a comprehensive search of electronic literature sources to evaluate the association between K-ras gene mutations and PC survival. Li et al. (2018) identified two hub genes of PC from the integrated microarray data and then validated them in RNA-sequencing data by k-nearest neighbor and random forest algorithms. These studies provided several underlying biomarkers and can help cancer researchers design new strategies for the early detection and diagnosis of PC (Gong et al., 2014).
Currently, non-negative matrix factorization (NMF)-based methods are widely used for genetic data analysis. For example, Mishra and Guda (2016) applied the NMF to genome-scale methylome analysis of PC data and detected three distinct molecular subtypes. Wang et al. (2013) proposed the maximum correntropy criterion-based NMF (NMF-MCC) method for cancer clustering on gene expression (GE) data. Zhao et al. (2018) used the NMF bi-clustering method to identify subtypes of pancreatic ductal adenocarcinoma, which is the most common type of PC. Xiao et al. (2018) proposed the graphregularized NMF to discover potential associations between miRNAs and diseases. These methods show that the NMF is a powerful tool for genetic data analysis. Nevertheless, it is a challenge for them to integrate and decompose different types of genetic data simultaneously. Zhang et al. (2012) adopted the joint NMF (jNMF) method to address this challenge, which projects multiple types of genomic data onto a common coordinate system, and applied the jNMF to the methylation (ME), GE, and miRNA expression data of ovarian cancer to identify cancer-related multidimensional modules. Yang and Michailidis (2016) introduced the integrative NMF (iNMF) to analyze multimodal data, which includes a sparsity option for jointly decomposing heterogeneous data, and also evaluated the iNMF on ME, GE, and miRNA expression data of ovarian cancer. These integrated NMF methods can reveal pathogenic mechanisms that would have been overlooked with only a single type of data, and uncover associations between different layers of cellular activities (Zhang et al., 2012).
However, most of these NMF-based methods only consider individual genetic effects and ignore interaction effects among different features. It has been widely accepted that interaction effects could unveil a large portion of unexplained pathogenic mechanisms of cancers . For capturing these interaction effects, several NMF-based network analysis methods have been proposed due to network facilitating presenting interactions between features. Liu et al. (2014) developed a network-assisted co-clustering algorithm for the identification of cancer subtypes, which first assigns weights to genes based on their impact in the network, and then utilizes the non-negative matrix trifactorization (TriNMF) to cluster cancer patients (Ding et al., 2006). Chen and Zhang adopted the NMF framework in a network manner (NetNMF) to integrate pairwise genomic data for identifying two-level modular patterns and the relationships among these modules (Chen and Zhang, 2018). Elyanow et al. (2020) proposed the netNMF-sc method to cluster cells based on prior knowledge of gene-gene interactions. Nevertheless, the netNMF-sc ignored interaction effects among different features and used the decomposed submatrix to construct the network, which might weaken the internal connection between nodes in the network. Gao et al. (2019) proposed the integrated graph-regularized NMF (iGMFNA) model for clustering and network analysis of cancers, which decomposes the integrated data into submatrices for constructing networks. Zheng et al. (2019) used the NMF to integrate ME and copy number variation (CNV) networks for identifying prognostic biomarkers in ovarian cancer. These NMF-based network analysis methods provide new insights into the pathogenic mechanisms of cancers, especially their interaction effects.
Inspired by both integration and network-assisted strategies of the NMF, in this paper, we presented a NMF network analysis method, NMFNA for short, based on graph-regularized constraint, to identify modules and characteristic genes from integrated ME and CNV data of PC. Firstly, the Pearson correlation coefficient (PCC) is employed to construct three PC networks, i.e., ME network, CNV network, and ME-CNV network. Then, these networks are further integrated and decomposed simultaneously to identify modules effectively due to the introduced graph-regularized constraint, which is the highlight of the NMFNA. Finally, both gene ontology (GO) and pathway enrichment analyses are performed, and characteristic genes are detected by the multimeasure score, to deeply understand biological functions of PC core modules. Experimental results demonstrated that the NMFNA facilitates the integration and decomposition of two types of PC data simultaneously and can further serve as an alternative method for detecting modules and characteristic genes from multiple genetic data of complex diseases.

Non-negative Matrix Factorization Methods
The NMF (Lee and Seung, 1999) and its variants have been increasingly applied to identify modules in biological networks (Chen and Zhang, 2018;Wang et al., 2018;Gao et al., 2019). For a biological network X m×n , the NMF can decompose it into two non-negative matrices U m×k and V k×n , such that X ≈ V, where k < min (m, n). The Euclidean distance between X and its approximation matrix UV is applied to minimize the factorization error, which can be written as, where ||·|| 2 F is the Frobenius norm of a matrix. Since it is difficult to find a global minimal solution by optimizing the convex and non-linear objective function, the NMF adopts the multiplicative iterative update algorithm to approximate U and V, In addition to the two-factor NMF, the three-factor NMF also plays an important role in matrix factorization, which constrains scales of U and V by an extra factor S, i.e., X ≈ USV. This factored matrix S not only provides an additional degree of freedom to make the approximation tight, but also indicates associations between identified modules (Chen and Zhang, 2018). A three-factor NMF variant TriNMF (Ding et al., 2006) is defined as, which minimizes the objective function by, Particularly, if X is the symmetric similarity matrix, it could be decomposed into USU T . For pairwise biological networks with the same samples but two types of features, X m 1 ×n 1 and X m 2 ×n 2 , combining the idea of two-factor and three-factor NMF, the NetNMF (Chen and Zhang, 2018) is defined as, min G1,G2,S11,S22 where R m 1 ×m 1 11 and R m 2 ×m 2 22 are the symmetric feature similarity matrices of X 1 and X 2 , respectively, that is, their respective co-expression networks; R m 1 ×m 2 12 is the two-type feature similarity matrix (co-expression network) between X 1 and X 2 ; G m 1 ×k 1 and G m 2 ×k 2 are the non-negative factored matrices used for identifying modules in their respective networks; S k×k 11 and S k×k 22 are also symmetric matrices whose diagonal elements can be used for measuring associations between identified modules; k is the user prespecified dimension parameter; α and β are used to balance three terms of the objective function and default settings are m 1 m 2 and m 1 m 2 2 , respectively (Chen and Zhang, 2018). The NetNMF minimizes the objective function by,

Non-negative Matrix Factorization Network Analysis Method
In order to further improve the capability of identifying modules and capturing interaction effects, we proposed the NMFNA method by introducing the graph-regularized constraint into the NetNMF, which can preserve the inherent geometrical structure of input networks (Deng et al., 2011). For demonstrating its effectiveness, in the study, we applied the NMFNA to twotype PC data of ME and CNV to identify modules and characteristic genes. In fact, the NMFNA is universally useful and can be applied to any type of genetic data in various complex diseases. The overall workflow of the NMFNA for identifying modules and characteristic genes by integrating ME and CNV data of PC is shown in Figure 1. It is seen that the NMFNA mainly has three stages. In the first stage, three co-expression networks are constructed from ME and CNV data of PC. In the second stage, these three networks are applied to the objective function to identify modules. In the third stage, both GO and pathway enrichment analyses are performed, and characteristic genes are detected, to deeply understand biological functions of PC core modules. Among them, the objective function, which introduces the graph-regularized constraint, is the highlight of the NMFNA.
The graph-regularized constraint indicates the inherent geometrical structure of the input networks. In other words, the graph-regularized constraint ensures that interactive features in the Euclidean space are also close to each other in the low-dimensional space, which is defined as, where Z is the sparse weight matrix established using the geometrical information of X (Xiao et al., 2018), and Zij represents the similarity between gene v i and v j . Tr (·) represents the trace of a matrix, and G is the factored matrix of the biological network X by the NMF. Define a diagonal matrix D whose elements are column sums of matrix Z, that is, D ii = i Z ij , and L is the graph Laplacian matrix defined as: FIGURE 1 | The overall workflow of the non-negative matrix factorization network analysis (NMFNA) for identifying modules and characteristic genes by integrating methylation (ME) and copy number variation (CNV) data of pancreatic cancer (PC).
Based on the NetNMF and the graph-regularized constraint, the objective function of the NMFNA is defined as, where R 11 and R 22 are the ME and CNV co-expression networks, R 12 is the ME-CNV co-expression network, λ is the tuning parameter to adjust the closeness between interactive features, and other notation meanings and parameter settings are the same as those in the NetNMF.
The multiplicative iterative update algorithm is adopted here to minimize the objective function of the NMFNA. Suppose ψ 1 , ψ 2 , ψ 3 , and ψ 4 are matrices of Lagrange multipliers that, respectively, constrain S 11 ≥ 0, S 22 ≥ 0, G 1 ≥ 0, and G 2 ≥ 0, the Lagrange function f of the NMFNA is, Hence, partial derivatives of f with respect to S 11 , S 22 , G 1 , and G 2 are, According to Karush-Kuhn-Tucher conditions (Dreves et al., 2011), i.e., ψ 1 S 11 = 0, ψ 2 S 22 = 0, ψ 3 G 1 = 0, and ψ 4 G 2 = 0, iterative formulas can be written as, Frontiers in Genetics | www.frontiersin.org Two types of modules, namely, ME modules and CNV modules, can be identified from R 11 and R 22 guided by G 1 and G 2 , respectively. In particular, G 1 and G 2 are first z-score normalized; then for each column (1, · · · , k) of them, those genes whose corresponding weights are greater than or equal to the threshold are considered as a cluster; finally, according to these clusters, subnetworks of R 11 and R 22 can be captured as ME modules and CNV modules. Here, the threshold is set to be 2 according to the reference (Chen and Zhang, 2018). In addition, similar to previous studies Zhao et al., 2020), modules with the most genes are known as core modules.
To identify characteristic genes from core modules, which may play an important role in deeply understanding the biological functions of modules, we employ the multimeasure score (MS) to numerically quantify the importance of each gene, which is defined as, where DC (v), BC (v), and EC (v) are the degree centrality, betweenness centrality, and eccentricity centrality of gene v in R 11 and R 22 , which are networks filtered from R 11 and R 22 , respectively, with edge weights higher than a given threshold. Betweenness centrality and eccentricity centrality focus on the global feature of a gene in the network, while degree centrality focuses on the local feature of a gene in the network ; hence, the MS combines both global and local features of a gene.

Data and Parameter Settings
Two types of PC data, i.e., ME data and CNV data, are downloaded from the TCGA database 1 . These two data have the same samples (176 tumor samples and 4 normal samples) but different features: 21,031 methylations in ME data and 23,627 CNVs in CNV data. Based on these PC data, three co-expression networks, i.e., the ME network R 21,031×21,031 11 , the CNV network R 23,627×23,627

22
, and the ME-CNV network R 21,031×23,627 12 , are constructed using the PCC. Besides, to further prove the experimental results of two types of PC data, we also analyze the GEO datasets of PC. Four profile datasets, i.e., GSE62452, GSE15471, GSE16515, and GSE28735, of PC are 1 https://cancergenome.nih.gov/ downloaded from the GEO database 2 for this study. Details of these four datasets are shown in Table 1.
In the NMFNA, four parameters should be set, which is, tuning parameters λ 1 and λ 2 , the dimension parameter k, and the iteration number. λ reflects the degree of imposed graph-regularized constraint. A large one focuses on reaching consensus across views, while a small one cannot tolerate matrix factorization error (Liu et al., 2013). Since these different items have no distinction of importance, and the dimension reduction parameter k has a greater impact compared with parameter λ, considering the convenience of comparison, both λ 1 and λ 2 are set to be the same value. We run the NMFNA with different λ values ranging from 0 to 0.1 to select the proper one based on the measure of total module similarity (Wang et al., 2018), which is defined as, where M x represents members in module x. According to experiment results (Figure 2A), λ 1 and λ 2 are set to be 0.03. The dimension parameter k is determined by the singular value decomposition method (Qiao, 2015), and its first inflection point, i.e., 6,834, is selected to k. In order to reduce the decomposition error, a large iteration number is used. In the study, we set it to 200 since the decomposition error here has already reached a relatively stable state (Figure 2B).

GO and Pathway Enrichment Analyses
To demonstrate the validity of the NMFNA, we compared it with the NMF, TriNMF, and NetNMF by performing GO and pathway enrichment analyses on their respective identified core modules. The GO enrichment analysis was carried out by an online tool, DAVID Bioinformatics Resources 3 (Dennis et al., 2003). The pathway enrichment analysis was conducted using the KOBAS v3.0 web server 4 (Kanehisa et al., 2016), in which, the KEGG pathway, BioCyc, PANTHER, and Reactome databases were used. The numbers of GO terms and pathways (p-value < 0.05) obtained from enrichment analyses of ME and CNV core modules identified by the compared methods are shown in Figure 3. It is seen that either ME or CNV core modules identified by the NMFNA have more GO terms and pathways than those identified by other methods, implying that modules identified by the NMFNA might contain more

FIGURE 3 | Numbers of GO terms and pathways obtained from enrichment analyses of ME and CNV core modules that identified by compared methods. (A)
Represents the details of the ME core module and (B) represents the details of the CNV core module.
biological information related to understanding the pathogenic mechanisms of PC. The numbers of enriched genes in GO terms obtained from the enrichment analyses of ME and CNV core modules identified by the NMFNA are shown in Figure 4. It is seen that for the ME core module, genes are mainly enriched in GO:0005515, GO:0005737, GO:0005829, GO:0070062, GO:0005654, GO:0016020, GO:0005615, and GO:0005739, corresponding to protein binding, cytoplasm, cytosol, extracellular exosome, nucleoplasm, membrane, extracellular space, and mitochondrion; for the CNV core module, genes are mainly enriched in GO:0005515, GO:0006351, GO:0003676, and GO:0005789, corresponding to protein binding, DNAtemplated, DNA binding, and endoplasmic reticulum. Details of these significant GO terms are listed in Table 2. Several studies have confirmed that these GO terms contribute to the development of PC cells. The protein binding (GO:0005515) is the most significantly enriched GO term among molecular functions in both ME and CNV core modules. As one of the specific binding protein, IGF binding protein-1 has been confirmed to inhibit the activity of insulin-like growth factor I, which has growth-promoting effects on PC cells (Wolpin et al., 2007). The cytoplasm (GO:0005737) plays an important role in the development of PC by regulating the expression of carbonic anhydrase IX (Juhász et al., 2003). As a member of the cadherin/catenin family, p120(ctn) has been found in the cytosol (GO:0005829) of PC cells (Mayerle et al., 2003), which is correlated to the degree of tumor dedifferentiation. The fractional volume of the extracellular space (GO:0005615) in the PC tissue has been reported to be statistically larger than that in the normal tissue (Yao et al., 2012). The novel mitochondrion interfering compound NPC-26 may effectively inhibit the growth of PC cells by destroying the mitochondria (GO:0005739) (Dong et al., 2016). Genes involved in the DNA-templated (GO:0006351) have been clinically used for treating lung cancer (Lu et al., 2016), which might be speculated to affect other cancers by their pan-cancer co-regulation mechanisms. The nicotine can induce the inhibitor of the DNA binding (GO:0003676) and has been confirmed as an established risk factor for PC (Treviño et al., 2011). The endoplasmic reticulum (GO:0005789) has been identified as the key target in PC, which shows its potential for antitumor drug development (Gajate et al., 2012). Other three GO terms, including extracellular exosome (GO:0070062), nucleoplasm (GO:0005654), and membrane (GO:0016020) also have been reported to associate with PC (Sakai et al., 2019;Zhou et al., 2019).
Among all pathways obtained from enrichment analyses of ME and CNV core modules identified by the compared methods (Figure 3), we recorded common pathways that appear in at least three out of four methods in Table 3.  It is seen that in terms of p-values, as well as p-values corrected by the false discovery rate (FDR), namely adj.P, the NMFNA performs best among all compared methods, implying that pathways enriched in core modules identified by the NMFNA are more significant than those enriched in core modules identified by other compared methods. To further analyze these core modules, the top 10 pathways according to their adj.P enriched in ME and CNV core modules identified by the NMFNA are listed in Figure 5, in which, the node size and color represent the number of genes enriched in the pathway and the significance of the pathway, respectively. Specifically, two pathways, i.e., transport of small molecules and arachidonic acid metabolism, have already been reported to be associated with PC. The former can filter downregulated differentially expressed genes of PC, while the latter can promote the progress of PC adj.P is the p-value corrected by the FDR. " \" represents that the pathway cannot be obtained from enrichment analyses of ME and CNV core modules that were identified by the corresponding method.  (Biswas et al., 2014;Long et al., 2016). To aid early diagnosis, the metabolism pathway can be used individually or in combination to differentiate people with and without PC. The metabolism of xenobiotics by cytochrome P450 pathway has been considered as an important pathway associated with the progression of cancer (Hu and Chen, 2012). Besides, three other metabolism-related pathways, namely, lipid metabolism and autophagy, glutamine-regulatory enzymes, and Akt/c-Myc pathway (Dando et al., 2013;Blum and Kloog, 2014), have been identified to directly affect the growth of PC cells. The small molecule metabolic process also has been found as the enriched pathway for the biological process of PC (Hu et al., 2017). The identification of immune system-related regulation pathways has been reported to provide several new insights for PC treatment and prognosis (Yang and Michailidis, 2016).

Analysis of Characteristic Genes
In order to further demonstrate the validity of the NMFNA and deeply understand biological functions of core modules, we detect and analyze characteristic genes from these core modules. First, ME and CNV core modules identified by four compared methods are filtered by removing weak edges with their PCC less than or equal to 0.8. Second, based on these filtered networks, the MS of each gene in the corresponding module is calculated, which can be considered as its contribution to interactions among genes in the module. Third, genes in each core module are sorted in descending order according to the MS, and the top ones are viewed as characteristic genes. In the study, the top 10, 30, and 50 characteristic genes in each core module are, respectively, selected and sent to GeneCards 5 to measure their relevance scores, which represent association strengths between corresponding genes and PC. After that, for each compared method, relevance scores of characteristic genes in both ME and CNV core modules are summated together and recorded in Table 4. It is seen that in all scenarios, scores of the NMFNA are significantly larger than the scores of other compared methods, which implies that characteristic genes detected by the NMFNA are more relevant  to PC and are more likely to reveal the pathogenic mechanisms of PC. In addition, for each compared method, PC-related genes in the top 10 characteristic genes of both ME and CNV core modules are retrieved from GeneCards and listed in Table 5. It is seen that the NMFNA hits eight genes and is the winner of all compared methods. We then analyzed in detail the biological functions of these PC-related genes. The TNKS1BP1 has been reported to regulate cancer cell invasion, which might further affect the progression of PC (Mayerle et al., 2003). The TK1 level is upregulated 4-fold in the mice PC specimen (Yao et al., 2012); therefore, we naturally speculate that it might also play a potential role in the human PC. The variation of the KCNJ1 has been claimed to be associated with diabetes (Farook et al., 2002), which is a closely related disease to PC and is generally thought of as the important risk factor of PC. As an independent prognostic biomarker, the SDC1 has been confirmed to be upregulated in PC (Juuti et al., 2005). Since the SDC1 is an important paralog of the SDC3, we infer that the SDC3 might be related to PC. The NR3C2 has been identified as the target of miR-135b-5p, which promotes migration and invasion of PC cells (Zhang et al., 2017). Though the TTC21A, USF1, and MAN1C1 have been marked as PC-related genes in GeneCards, and they are indeed associated with several PC complicating diseases, including hyperlipidemia and alcohol-induced mental disorder, there are few supporting literature studies.
Besides, the experiments of the GEO datasets are also performed to further verify the effectiveness of modules and characteristic genes identified by the NMFNA method. Firstly, we calculated the numbers of common genes of the four datasets GSE62452, GSE15471, GSE16515, and GSE28735. As shown in Supplementary Figure 1, there are 17,327 common genes, which are more likely to be related to PC. Secondly, Table 6 lists the number of genes in the modules obtained by different methods that overlap with these common genes. It can be seen from Table 6 that genes of the core modules identified by the NMFNA method contain the largest number of common genes, which indicates that these core modules have been verified to be related to PC in different databases.

CONCLUSION
Pancreatic cancer is a disease with a poor prognosis, in which malignant cells originate in the pancreatic tissue. To understand its pathogenic mechanisms, in this study, based on NMF and graph-regularized constraint, we presented NMFNA to identify modules and characteristic genes from integrated ME and CNV data of PC. First, the ME network, CNV network, and ME-CNV network are constructed by the PCC. Then, these networks are further integrated and decomposed simultaneously to identify modules effectively due to the introduced graph-regularized constraint, which is the highlight of the NMFNA. Finally, both GO and pathway enrichment analyses are performed, and characteristic genes are detected by the multimeasure score, to deeply understand the biological functions of PC core modules. Compared with the NMF, TriNMF, and NetNMF, the NMFNA identified more PC-related GO terms, pathways, and characteristic genes in core modules, demonstrating that the NMFNA facilitates the integration and decomposition of two types of PC data simultaneously and can further serve as an alternative method for detecting modules and characteristic genes from multiple genetic data of complex diseases.
The NMFNA has several advantages. First, it performs well in integrating and decomposing different types of genetic data simultaneously. Second, introducing the graph-regularized constraint into the NMFNA eases the heterogeneity of multiple networks, which is beneficial to detect core modules. Third, the NMFNA can not only consider individual genetic effects but also capture interaction effects among different features contributing to the development of PC. Nevertheless, it still has some limitations. For instance, associations between ME modules and CNV modules, i.e., S 11 , S 22 , are not deeply analyzed in theory and experiment; it only supports the integration and decomposition of two types of genetic data and fails three or higher types. These limitations inspire us to continue working in the future.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s. J-XL analyzed the experiment results and wrote the manuscript. All authors read and approved the final manuscript.

FUNDING
This work was supported by the National Natural Science Foundation of China (61972226, 61902216, 61902430, and 61872220) and the China Postdoctoral Science Foundation (2018M642635). The funder played no role in the design of the study and collection, analysis, and interpretation of data and in the writing of the manuscript.