CBP-JMF: An Improved Joint Matrix Tri-Factorization Method for Characterizing Complex Biological Processes of Diseases

Multi-omics molecules regulate complex biological processes (CBPs), which reflect the activities of various molecules in living organisms. Meanwhile, the applications to represent disease subtypes and cell types have created an urgent need for sample grouping and associated CBP-inferring tools. In this paper, we present CBP-JMF, a practical tool primarily for discovering CBPs, which underlie sample groups as disease subtypes in applications. Differently from existing methods, CBP-JMF is based on a joint non-negative matrix tri-factorization framework and is implemented in Python. As a pragmatic application, we apply CBP-JMF to identify CBPs for four subtypes of breast cancer. The result shows significant overlapping between genes extracted from CBPs and known subtype pathways. We verify the effectiveness of our tool in detecting CBPs that interpret subtypes of disease.


INTRODUCTION
Complex biological processes (CBPs) are the coordinated effect of multiple molecules, which result in some functional pathways and the vital processes occurring in living organisms. In addition, the vast amounts of multi-omics data, such as genomics, epigenomics, transcriptomics, proteomics, and metabolomics, can be integrated to understand systems biology accurately (Suravajhala et al., 2016). Hasin et al. (2017) pointed out that a deeper and better understanding of important biological processes and modules can be obtained through multi-omics studies. However, practical tools are still missing to integrate diverse multi-omics data at different biological levels and reveal the CBPs and other problems like the causes of diseases.
Non-negative matrix factorization (NMF) (Lee and Seung, 1999) is a powerful tool for dimension reduction and feature extraction. It has been increasingly applied to diverse fields, including bioinformatics (e.g., high-dimensional genomic data analysis). For example, Brunet et al. (2004) applied NMF and consensus clustering to the gene expression data of leukemia to discover metagenes and molecular patterns. Xi et al. (2018) detected driver genes from pan-cancer data based on another matrix decomposition framework called matrix tri-factorization. Up to now, several variants of NMF have been proposed, including tri-factorization NMF (Ding et al., 2006), graph-regularized NMF (Cai et al., 2011), joint NMF (Zhang et al., 2012), iNMF (Yang and Michailidis, 2016), etc. (more details are in Supplementary Note 1 of the Supplementary Materials). In 2012, jNMF (Zhang et al., 2012) was proposed to identify multiomics modules by integrating cancer's DNA methylation data, gene expression data, and miRNA expression data. Chen and Zhang (2018) applied joint matrix tri-factorization to discover two-level modular organization from matched genes and miRNA expression data, gene expression data, and drug response data.
Omics data across the same samples contain signal values from expression counts, methylation levels, and protein concentrations, which control biological systems, resulting in so-called multi-dimensional genomic (MG) data. The natural representation of these diverse MG data is a series of matrices with measured values in rows and individual samples in columns. Recently, there are integrative analysis tools based on NMF technique that reveal low-dimensional structure patterns. The low-dimensional structure patterns reflect CBPs and sample groups while preserving as much information as possible from high-dimensional MG data (Stein-O'Brien et al., 2018).
In general, most particular matrix factorization techniques are being developed to enhance their applicability to specific biological problems. Meanwhile, the applications to represent disease subtypes (Biton et al., 2014) and cell types (Fan et al., 2016) have created an urgent need for sample grouping and associated CBP-inferring tools. Moreover, cancer and other complex diseases are heterogeneous, i.e., there are various subgroups for a cancer or a complex disease. The study of the heterogeneity of cancer and complex diseases will help us understand the disease further and provide better opportunities to disease treatment (Xi et al., 2020). To address this issue, we extend traditional jNMF and develop CBP-JMF, an improved joint matrix tri-factorization framework for characterizing CBPs that represent sample groups, and implement a Python package. This package takes labeled samples as the prior information and integrates MG data (e.g., copy number variation, gene expression, microRNA expression, and/or molecule interaction network) to identify the underlying CBPs which characterize the specific functional properties of each group. CBP-JMF can be used to mark unlabeled samples with groups of known labels. For ease of use, CBP-JMF can recommend reasonable parameter settings for users. CBPs found by CBP-JMF are connected network markers, and they are distinguished between sample groups. These markers usually have specific biological functions and play important roles in phenotypes. As an example, CBPs for subtypes of breast cancer are obtained by CBP-JMF, but they may not have been collected in any reference database yet.
The rest of this paper is organized as follows. Section "Framework of CBP-JMF" deals with the problem formulation of CBP-JMF and the implementation of it. Then, Section "Results" exemplifies our approach by applying CBP-JMF to identify CBPs for different subtypes of breast cancers and compares the results of classifying unlabeled samples with CBP-JMF and its several variants. Finally, Section "Discussion" discusses our results and lists our expectations of our method and the limitations of it. Section "Conclusions" highlights our method.

Problem Definition
Given a non-negative matrix X ∈ R m×n , it can be factorized into three non-negative matrix factors based on matrix trifactorization: X ≈ USV, where U ∈ R m×k , S ∈ R k×k , and V ∈ R k×n . Factored matrix S cannot only absorb scale difference between U and V but also indicates relationships between the identified k modules.
In CBP-JMF, given a MG dataset composed of P omics, it can be presented by multiple matrices X (1) , X (2) , ..., X (P) , as illustrated in Figure 1. For each matrix, the rows indicate molecules like genes, and the columns indicate samples; the values in it are related to the meaning of omics. If X (p) (p ∈ [1, P]) is a matrix of gene expression data, X (p) ij represents the expression value of the gene in the i-th row on the j-th sample. Basically, each non-negative matrix X (p) ∈ R m×n , p = 1, 2, ..., P is factorized into three non-negative matrix factors based on matrix tri-factorization: X (p) ≈ U (p) S (p) V, where molecular coefficient matrix (MCM) U (p) ∈ R m×k and sample basis matrix (SBM) V ∈ R k×n are the pattern indicator matrices of k CBPs and k sample groups, respectively. Scale absorbing matrix (SAM) S (p) ∈ R k×k explores the relationships between them. Furthermore, MCM describes the structure pattern between molecules (e.g., genes), SBM indicates the structure pattern between samples, and SAM absorbs the difference of scales between MCM and SBM (Figure 1). Each column of the MCM infers a latent feature associated with a CBP, and the continuous values in it represent the relative contribution of each molecule in the CBP. Meanwhile, each row of the SBM describes the relative contributions of the samples to a latent feature. The sample groups can be detected by comparing the relative weights in each row of the SBM.

Objective Function of CBP-JMF
Considering that different datasets may play different roles in data integration, we adopted a method that can learn the weights of different input data through a weighted joint tri-NMF:

where
= π (1) , π (2) , ..., π (P) . CBP-JMF differentiates the importance of datasets by the weight constraint 2 , and π (p) will get a weight to represent the contribution of data X (p) to objective function after optimization. If X (p) contributes to the optimization of cost function, then it will be given a higher weight π (p) , or if X (p) contains lots of noises which hinder the optimization of objective function, it will be given a lower weight π (p) . In addition, V can be divided into labeled V L and unlabeled V UL parts according to the labeled samples and unlabeled samples. In order to learn the correlation between labeled samples, we use a graph Laplacian to represent the distance of labeled sample in latent space (Guan et al., 2015). We use Equations (2) and (3) to denote the distance between labeled samples from the same class and different class in the learned latent space, respectively, where N L is the number of labeled samples in V, and W a (W affinity ) and W p (W penalty ) are the weighted adjacency matrices (see Supplementary Note 2 in SM) corresponding to intragroup and inter-group samples respectively. L a (L affinity ) and L p (L penalty ) are the Laplacian matrix of W a and W p , respectively, In machine learning, people try to make samples from the same class near each other in the learned latent space and samples from different class far from each other. This principle can be written as Combining weighted joint tri-NMF and the constraints of correlation between labeled samples mentioned above, we give the formulation of the optimization objective function of CBP-JMF as follows (Figure 1): Parameters β and ω represent the importance of the graph Laplacian regularization and weight constraint 2 . In total, each X (p) is factorized into individual molecular matrix U (p) and scale matrix S (p) and a common sample matrix V. We allowed all matrices to share the same sample matrix V for finding common factors in MG data. There is only a part of samples labeled (subtype or subpopulation or subgroup is known as prior information); we incorporate this prior information with graph Laplacian. We can also learn the weights of different input data to conclude the roles that different data matrices play in CBP-JMF.

Optimization and Update Rules of CBP-JMF
To solve the problem of factorization X ≈ USV, we firstly randomly initialize the solution of U, S, and V and then apply iterative multiplicative updates as the optimization Algorithm 1 | The CBP-JMF algorithm.

Input:
P data matrices X (1) , X (2) , ..., X (P) , parameters β ω Output: P basis matrices U (1) , U (2) , ..., U (P) , P relation matrices S (1) , S (2) , ..., S (P) , factor matrices V, weight vector = π (1) , π (2) , ..., π (P) 1: Begin Firstly, we fix V and S and update U; then, we can get the Lagrange function and let be the Lagrange multiplier for the constraints U The partial derivatives of L U (P) with U is: Based on the KKT conditions ij U ij = 0, we can get the following update rules: Similarly, we can get the update rules for W, V L , and V UL : As for updating of π, when U,V, and S are fixed, minimization of O(π) is a convex optimization, and we use convex optimization toolbox to update π.

CBPs Obtained From CBP-JMF
Values in each column of U (p) represent the relative contribution of each molecule in each module, and values in each row of V represent the degree of each sample involved in each module.
According to the rules of matrix multiplication, the i-th column of basis matrix U (p) , p = 1, 2, ..., P corresponds to the i-th row of coefficient matrix V, so there is a one-to-one correspondence between subtype and multi-omics module discovered from the columns of U (p) matrix. Firstly, we need to know the relationship between k modules and subtypes by counting each subtype's value in each module from V (p) matrix (see Supplementary Note 3 in Supplementary Material).
To select features associated with each module, CBP-JMF calculates the z-scores of each molecule for each column vector of U (p) as z = ( be the j-th column of U (p) and infer a latent feature associated with j-th CBP. The continuous value u (p) ij represents the relative contribution of molecule i in the jth CBP. u (p) ij can be regarded as x i , and the length of u (p) j can be regarded as n in Equation (12). CBP-JMF calculates a z-score for each value in u (p) j and obtains CBP's members through a given cutoff (z-score >2 in our tests). Then, they are mapped to a built-in molecule interaction network (see "Section 'Results"') to extract their connected components as the final CBP.

RESULTS
We applied CBP-JMF to BRCA with multi-omics data. The reason we chose BRCA as example is that breast cancer is a heterogeneous complex disease, and it is the most commonly occurring cancer. BRCA is also a type of cancer that can be divided into smaller groups based on certain characteristics of the cancer cells. Distinct complex biological processes represent different subtypes. Characterizing the processes can provide us comprehensive insights into the mechanisms of how multiple FIGURE 2 | Complex biological processes of luminal B and basal-like subtype. We mapped the genes and miRNAs obtained from luminal B's module and basal-like's module to an integrated gene regulation network. The network was obtained through integrating three databases including Reactom, Kyoto Encyclopedia of Genes and Genomes, and Nci-PID Pathway Interaction Database. The interactions between genes and miRNAs were obtained from miRTarBase. The size of the node is proportional to the size of the degree. The thickness of the edges indicates the strength of the regulatory relationship expressed by the Pearson correlation coefficient between microRNA and gene.

Data
Firstly, we downloaded the Gene Expression (GE) data, miRNA expression (ME) data, and copy number variation (CNV) data across the same set of 738 breast cancer samples from UCSC Xena (Goldman et al., 2018). Secondly, we obtained the sample label information which is classified by PAM50 from The Cancer Genome Atlas Network (Koboldt et al., 2012). Among 738 samples, there are 522 breast cancer samples with labels, including 231 luminal A, 127 luminal B, 98 triple negative/basallike, 58 HER2-enriched, and eight normal-like. Thirdly, we filtered out some samples, in which more than 90% of the genes have an expression value of zero. For genes and miRNAs, we filtered the genes and miRNAs with an expression value of zero in more than 20% of the samples. Fourthly, we did differential expression analysis for genes using edgeR package (Robinson et al., 2009) in R with P-value < 0.01 and |log(fold change)|> 0.5 to filter out genes which are not associated with breast cancer. Fifthly, we imputed missing miRNA data using knnimpute package in MATLAB. About the CNV data, the GISTIC2 (Mermel et al., 2011) thresholded the estimated values of CNV to −2, −1, 0, 1, and 2, which represent homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level number amplification. Finally, we obtained the GE data X (1) ∈ R 2913×725 and ME data X (2) ∈ R 516×725 . Among 725 samples, 179 samples are marked with subtype labels (80 luminal A, 38 luminal B, 39 basal-like, 22 HER2-enriched) and shared between GE, ME, and CNV datasets. Furthermore, we calculated the Pearson correlation of 179 labeled samples using CNV data to construct W a ∈ R 179×179 , W p ∈ R 179×179 , and their Laplacian matrices to form the graph Laplacian regularization tr V L L a V L T − tr V L L p V L T .

Complex Biological Processes for Breast Cancer Subtypes
In our example, we set parameters k = 4, β=10, and ω=100, 000. Other parameters and more details can be found in Supplementary Note 2 of Supplementary Material. As a result, we obtained unique matrices U (1) ∈ R 2913×4 , U (2) ∈ R 516×4 , S (1) ∈ R 4×4 , and S (2) ∈ R 4×4 and a common matrix V ∈ R 4×725 . To get heterogeneous CBPs (Supplementary Table 1), directed regulatory pathways containing miRNAs and genes, which correspond to each cancer subtype we put subtype-specific multi-omics modules obtained from matrix U (p) , p = 1, 2 onto an integrated gene regulation network from Reactome (Croft et al., 2014), Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto, 2000), and Nci-PID pathway (Schaefer et al., 2009). Then, we add directed regulatory edges from miRNA to the gene supported by miRTarBase (Chou et al., 2018). Finally, we extracted the maximum connected component of the regulation network and showed the discovered characteristic CBPs underlying luminal B and basal-like subtypes in Figure 2.
To explore whether the genes in the CBPs of luminal B and basal-like subtype have significant biological importance or not, we performed an enrichment analysis with all 124 genes from Figure 2 across six datasets. The datasets are from OMIM (Hamosh et al., 2005), CGC (Futreal et al., 2004), virhostome, kinome (Manning et al., 2002), drug target (Wishart et al., 2008), KEGG pathway of BRCA (Kanehisa and Goto, 2000). Genes associated with breast cancer or breast tissue in the six datasets are selected as the set of enrichment analysis. Genes extracted through CBP-JMF have significant overlapping with known datasets (Table 1). Furthermore, for each subtype's CBP, functional enrichment analysis (Supplementary Figure 4) shows that four CBPs are mainly enriched in known biological processes and pathways associated with breast cancer, such as cell cycle and various signaling pathways (including p53 signaling pathway and estrogen pathway). However, each CBP also has its specific biological processes and path. This may explain differences between subtypes. As a demonstration, we take the CBPs of luminal B and basal-like as example. Based on the study of the subtypes of BRCA, luminal B is mainly driven by the estrogen/ER pathway (Zhang et al., 2014). In our discovered CBPs, we found several CBPs containing genes like ERBB2, ERBB3, and ESR1 that are related to the estrogen/ER pathway. Besides that, through literature review, miRNAs in luminal B's CBP can regulate the estrogen/ER pathway, such as miR-34a, miR-125b, miR-200b, and so on (Figure 3, Table 2). In addition, basal-like subtype is mainly driven by the deregulation of various signaling pathways including Notch, MAPK, and wnt/β-catenin signaling pathway (King et al., 2012). In our discovered CBPs, we found genes involved in the above-mentioned pathways, such as MAPKAPK2, CDC25B, PLK1, and so on. Besides that, we also found that miRNAs in CBPs of basal-like, such as miR-221 and miR-210, may regulate the genes above in basal-like subtype (Figure 3, Table 3). In summary, subtype-specific biological processes can be identified by CBP-JMF, and CBP-JMF can help users discover potential biological targets.
Meanwhile, to classify unlabeled samples into subtypes, CBP-JMF returned predicted labels for unlabeled samples (Supplementary Note 4 in Supplementary Material). Figure 4 shows the Kaplan-Meier (KM) survival analysis using survival package (Therneau, 2015) on unlabeled samples based on their clinical data in TCGA. We compared our results with other NMF methods (Supplementary Note 4 of Supplementary Material) and found that CBP-JMF achieves more accurate subtype classification results. Unlabeled samples are classified by using GE data and ME data. Figure 4 indicates that the survival analysis for unlabeled samples has the most significant Cox (Lin and Zelterman, 2002) p-value 0.031 and similar survival curves like the labeled samples. This proves that the CBP-JMF framework is useful for cancer subtyping, as the framework incorporates integration of multi-omics data and samples' prior information.

DISCUSSION
Understanding CBPs is vital to help us further understand the development of disease and intervene in the disease. NMF is an effective tool for dimension reduction and data mining in high-throughput genomic data. In this paper, we proposed CBP-JMF, an improved method of multi-view data analysis. It is designed for heterogeneous biological data based on NMF. Moreover, we created an easy-touse package in Python. CBP-JMF analyzes multi-dimensional genomic data across the same samples integrally. Our method can discover CBPs that underlie sample groups and classify unlabeled samples through learning the relationship between labeled samples.
We tested this framework on the gene expression data and miRNA expression data of BRCA. CBP-JMF discovered subtype-specific biological processes and classified unlabeled samples into four subtypes. We did survival analysis and function analysis, and the results showed that CBP-JMF has great performance. Furthermore, CBP-JMF is a weighted joint tri-NMF framework in essence. We expect that it can be applied to vast fields including disease subtypes, cell types, and population stratification. Meanwhile, we expect that CBP-JMF can be used to identify hub genes or predict the association between genes or non-coding mRNA and diseases by integrating a variety of data. Though CBP-JMF is efficient to uncover CBPs by integrating multi-omics data, CBP-JMF must integrate different multi-omics data that have the same samples. This weakness limits the use of more types of data and integrates more information to obtain more significant results.

CONCLUSIONS
In this article, we develop CBP-JMF, a matrix tri-factorization and weighted joint integration tool, for detecting CBPs, which characterize prior disease subtypes and cell groups in Python. We improve its usability by estimating the parameters, such as determining the number of features through consensus clustering. CBP-JMF always gives reference values of all parameters. In applications, CBP-JMF characterizes the CBPs of four subtypes of BRCA based on gene and miRNA expression data from TCGA, and we find the significantly different functional pathways that characterized luminal B and basallike subtypes.

DATA AVAILABILITY STATEMENT
The datasets presented in this study are publicly available and the addresses for finding them are listed within the article. Prediction results and a reference implementation of CBP-JMF in Python are available at: https://github.com/wangbingbo2019/CBP-JMF.

AUTHOR CONTRIBUTIONS
BW, YWu, and XM conceived and designed the experiments. YWu and MX performed the experiments. XM, RD, CZ, LY, XG, and LG analyzed the data. BW, YWu, XM, and YWa proofread the paper. All authors contributed to the article and approved the submitted version.