Microbiome Data Analysis by Symmetric Non-negative Matrix Factorization With Local and Global Regularization

A network is an efficient tool to organize complicated data. The Laplacian graph has attracted more and more attention for its good properties and has been applied to many tasks including clustering, feature selection, and so on. Recently, studies have indicated that though the Laplacian graph can capture the global information of data, it lacks the power to capture fine-grained structure inherent in network. In contrast, a Vicus matrix can make full use of local topological information from the data. Given this consideration, in this paper we simultaneously introduce Laplacian and Vicus graphs into a symmetric non-negative matrix factorization framework (LVSNMF) to seek and exploit the global and local structure patterns that inherent in the original data. Extensive experiments are conducted on three real datasets (cancer, cell populations, and microbiome data). The experimental results show the proposed LVSNMF algorithm significantly outperforms other competing algorithms, suggesting its potential in biological data analysis.


INTRODUCTION
With the development of high-throughput metagenomic sequencing and 16S sequencing technologies, more and more biological data have been accumulated. Generally, these sequences have very complicated characteristics, making discoveries and identification of latent relations among samples very daunting. In order to reach a good understanding of the roles that the microbiome plays in the health and disease states of humans, many plans, including the Human Microbiome Plan (HMP) (Turnbaugh et al., 2007), integrative Human Microbiome Plan (iHMP) (The Integrative Hmp (iHMP) Research Network Consortium, 2019), and the Metagenomics of Human Intestinal Tract (MetaHIT) (Qin et al., 2010), have been launched. These actions pave the way for researchers to further explore the complex relationships residing in microbiome data. Arumugam et al. classified the microbiome into different enterotypes and pointed out the significance of a functional analysis to reveal the interactions among microorganisms (Arumugam et al., 2011;Siezen and Kleerebezem, 2011). Clustering approaches and similarity measurements were applied to elucidate the influence that various factors impose on the identification of enterotypes (Koren et al., 2013). Jiang et al. (2012) proposed a new approach based on nonnegative matrix factorization (NMF) to identify the structure and functions of complex microbial communities across environmental samples. Wu et al. (2016) developed a stable NMF method, staNMF, and obtained a novel and concise representation of spatial gene expression patterns. As a clustering technology, NMF has attracted a lot of attention in terms of its good data representation capabilities. In NMF, samples or features can be viewed as the linear additive combination of basis vectors. Meanwhile, the membership label of each sample can be assigned by the corresponding coefficient matrix. When the data have a linear structure, NMF usually achieves better performance. However, in the real world, data points are generally embedded in a non-linear manifold; thus, adopting other ways (such as graphs) to describe latent relationships among data points is a better choice (Kuang et al., 2012). Symmetric non-negative matrix factorization (SNMF) takes a similarity matrix as input and outputs a cluster indicator. In this process, a similarity matrix can be obtained in many ways, such as through a Gaussian kernel, polynomial kernel, linear kernel, and so on. Kuang et al. (2012Kuang et al. ( , 2015 designed an effective SNMF algorithm to model the complex relationships contained in non-linear data that outperforms many NMF-based approaches. Ma et al. (2016a) developed HSNMF, a method that combined SNMF with a second-order graph to explore microbiome data.
Recently, the graph regularization framework has been successfully applied in many fields including bioinformatics, image processing, and text mining, achieving good performance. Cai et al. (2010) proposed the GNMF algorithm to reveal the potential patterns of several datasets. Specifically, GNMF used Laplacian regularization (Lr) to encode the intrinsic geometrical structure presented in the original data. Subsequently, a number of variants based on Lr were generated (He et al., 2015;Ma et al., 2016bMa et al., , 2017. Although these methods obtained some interesting findings, they may ignore some important aspects. For example, a traditional Laplacian graph captures the global structure of a data matrix, which is insufficient in biology research, where local topologies need to be sought and utilized effectively (Wang et al., 2017). Moreover, recently emerging methods developed to capture local topological information have been shown to obviously outperform global algorithms (Roweis and Saul, 2000;Wu and Schölkopf, 2007;Jiang and Hu, 2014). These methods aim to reconstruct each data point using its local neighbors and are shown to be robust and insensitive to outliers. Vicus, an alternative local spectral matrix, can effectively capture the local geometrical information from neighboring nodes to model biological interactions, and it shares many similar properties with Laplacian matrices; for example, both matrices are symmetric and positive semidefinite (psd), and both have non-negative, real-valued eigenvalues (Wang et al., 2017). Compared with Laplacian graphs, Vicus graphs, which are constructed via local subnetworks, are more robust with respect to noise and can lessen the influence of outliers to some extent. In this paper, we first used similarity graphs to establish the complicated relationships in samples; second, given these graphs, we constructed Laplacian and Vicus spectral matrices; finally, we integrated the Vicus (and/or Laplacian) matrix into an SNMF framework to conduct downstream analysis, such as clustering, visualization, and so on.
In view of the above considerations, in this paper we introduce Laplacian and Vicus matrices (Wang et al., 2017) to simultaneously model the global and local structure connections residing within the data and compare their performance with the methods only based on Laplacian or Vicus graphs on several real datasets. Our experiments include tumor classification, microbiome samples identification and so on. The experimental results show that the proposed algorithm outperforms other baseline and competing approaches, which demonstrates its efficiency and effectiveness in microbiome data analysis. Figure 1 gives an illustrative example.
The contribution of this work lies in the fact that (1) an effective clustering algorithm has been proposed and can be easily expanded to other applications and (2) to our knowledge, this is the first attempt to integrate global and local structure information into an SNMF framework to conduct microbiome data analysis. The rest of this paper is organized as follows: in the next section a brief statement of SNMF is given. Then Lr, the formulation of Vicus, and the proposed algorithm are also provided. In section "Results and Discussion", extensive experiments are conducted, and the experimental results and comparisons analysis are presented. Section "Conclusion" summarizes the conclusions and further research plans.

Symmetric Non-negative Matrix Factorization
In SNMF, given an n × n symmetric matrix A and a reduced rank k, SNMF seeks to find the best factorization so that A = HH T , where H can be viewed as the cluster indicator. The objective of SNMF can be formalized as follows: + , and • denotes the c of a matrix. Compared with NMF, SNMF concerns only the factorized similarity matrix A and doesn't consider whether the structure of the data is linear or non-linear. Once A is given, SNMF conducts factorization similar to that of NMF. Therefore, SNMF is more suitable to modeling unknown data. For a matrix A, A ij , the ij-th element of A, denotes the similarity score between the ith and the jth data points. Similarity metrics can take many forms. One common way is to use the Gaussian kernel function to construct a weighted similarity matrix: where, Xi denotes the ith data point (sample), σ i is the Euclidean distance between Xi and its d-th neighborhood. We set d = 7 as suggested in the literature of Zelnik-Manor and Perona (2005). It is noted that the diagonal elements of similarity W are set to be zeros to eliminate self-similarity. Next, we only retain those edges linking nodes with their p nearest neighbors N (i). Thus, the weighted matrix W derived from Equation 2 can be rewritten as: Where N (i) is the set of neighbors of the ith node. As suggested in [20], W can be transformed into the normalized form: here, Dii = n j=1 w ij denotes the degree matrix. Finally, by multiplication update rules SNMF can obtain the locally optimal solution:

Laplacian Graph
Given P, a weighted matrix, let L be the Laplacian matrix; L can be defined as: where D is a degree matrix and Dii = n j=1 dij . The normalized cut version of L can be formalized as: where I denotes the identity matrix. Note that the general Laplacian matrix (Equation 6) is used in the complete experiments, and the normalized version of L is used to be input for spectral methods. In the process of constructing a weighted graph P, many similarity functions can be used, such as the inner product function, kernel function, and so on. In our experiments, the Gaussian kernel function (Equation 2) is employed to establish P.

Vicus Graph
As described in Wang et al. (2017), Vicus graphs have the same properties as Laplacian graphs. For example, both matrices are symmetric and positive semidefinite, and the eigenvector corresponding to the smallest eigenvalue 0 of both matrices is the constant vector 1. Compared with the Laplacian graph, the Vicus graph can capture the local structure within the data. In this subsection, we demonstrate the process of constructing a Vicus graph in detail. Note that the Vicus matrix is constructed based on a sample-sample similarity matrix, which can be computed via any similarity function, such as the Gaussian kernel, the cosine kernel functions, and so on. In this paper, we used the Gaussian kernel for computing the similarities between any two samples. The differences between the Laplacian and the Vicus graphs lies in the fact that the former describes the global structure information inherent in the data whereas the latter captures well the fine-grained topological information present in the biological network. The intuition is that we can use the local connection information from neighboring nodes to make the Vicus matrix more robust with respect to noise. Thus, it helps to lessen the influence of outliers. Let {x 1 , x 2 , · · · , x n } be a set of data points. Then v i corresponding to x i denotes the ith vertex in a weighted network P, and N (i) represents x i 's neighbors, not including x i . Here, the neighborhood size of all nodes is consistent ( N i = k, i = 1, 2, · · · , n).
The main assumption behind Vicus is that the cluster label of the ith data point can be inferred from its nearest neighbors N (i). First, a subnetwork Pi = (V i , Ei) is extracted such that Vi =N (i) x i and Ei represents the edges connecting all points inherent in Vi. Using the label diffusion algorithm (Zhou et al., 2004), a virtual label indicator vector c k v i can be reconstructed as: Where α ∈ (01) is a constant. In all our experiments, α is set 0.9 as suggested in Wang et al. (2017). C is the number of clusters. q k v i is the scaled cluster indicator of Pi.Si denoting the normalized transition matrix, i.e., Si (u, t) = Pi (u, t) K+1 l=1 Pi u, l . c k v i is a vector of K + 1 elements; here q k i = c k v i [K + 1] is the estimate of how likely it is that node i belongs to the kth cluster. The goal is maximal concordance between q k i and q k i . Let β i ∈ R K+1 be the ith row of the matrix (1 − α) (I − α Si) −1 , representing label propagation at its terminal state. We set q k i = β i q k v i . Thus, q k i can be approximated by Where β i [1 : K] is the first K elements of β i and β i [K + 1] denotes the (K + 1)th element in β i . Next, we use matrix B to represent the linear relationship q k ≈ B q k , k = 1, 2, · · · , C: In order to minimize the difference between q k and q k , we can adopt a simple objective as listed below: Here, Tr(•) denotes the trace of a matrix. Setting V = (I − B) T (I − B), we thus obtain the Vicus matrix. Similarly to the traditional spectral clustering formulation, by performing eigen-decomposition of V we can obtain clustering results. In this paper, we use V as constraint term to preserve the local manifold structure of the data. Note that in order to better reveal the complicated relationships among microbiome samples, we simultaneously introduce the Laplacian and Vicus spectral matrices into the objective function of SNMF. The details are given below, in section "Symmetric Non-negative Matrix Factorization Based on Laplacian and Vicus Regularization."

Symmetric Non-negative Matrix Factorization Based on Laplacian and Vicus Regularization
Based on analysis above, we combine Laplacian and Vicus matrices into the Symmetric Non-negative Factorization framework to explore the global and local structure inherent in the data. The proposed algorithm, namely LVSNMF, takes full advantage of the global and local consistency of the data to model complex relationships between different samples. The final objective can be defined simply as: (12) Where α is a regularization parameter and used to balance the trade-off between matrix reconstruction errors and spatial structure preservation. The second term in Equation 12 simultaneously takes into account the global information (Laplacian) and local structure (Vicus) of the data.
To minimize the objective of LVSNMF, we use multiplicative update rules to solve the optimal problem. The updating formula of H can be obtained as follows: Here, V = V + − V − . LVSNMF integrates the global and local similarity information inherent in the data, and it can therefore obtain better performance than just using the Laplacian graph or other methods based on local diffusion information from neighboring nodes.

Evaluation Metrics
The proposed LVSNMF and competing algorithms are evaluated by comparing the generated labels of all samples with the ground truth contained in the datasets. Two common cluster metrics, accuracy (AC) and normalized mutual information (NMI), are used to evaluate the performance of the proposed LVSNMF algorithm. Generally, the higher AC and NMI values achieved, the better the clustering quality is. More detailed information on these two metrics can be found in Xu et al. (2003).

Datasets
Three datasets are used in our experiments. The first one is a cancer dataset from TCGA, the second is a pollen dataset, and the last is a human microbiome dataset from HMP 1 . The important statistics of these three datasets are summarized in Table 1.
Lung cancer: This is a benchmark dataset including five cancer subtypes. It can be downloaded from https://jundongl.github.io/ scikit-feature/datasets.html.
Pollen: these data consist of 11 cell populations, containing neural cells and blood cells. These data are obtained directly from Pollen et al. (2014).
Human microbiome data: these data consist of 637 samples drawn from seven body sites: one gut (stool), one vagina (posterior fornix), one nasal (anterior nares), one skin site (retroauricular crease), and three oral sites (supragingival plaque, tongue dorsum, and buccal mucosa). Each sample consists of 710 microorganisms. The relative abundance of each at species level wase estimated by MetaphlAn. All the data can be downloaded from the HMP website.

Experimental Results
In this section, we conduct extensive experiments on these three datasets. The experimental results are shown in Table 2. From this table, we can see that LVSNMF outperforms the second best algorithm at 0.49/1.93% points in terms of AC/NMI on the Lung dataset, 3.22/2.17% points on the Pollen dataset, and the 2.09/0.61% points on HMP dataset.
The results in Table 2 are obtained when p = 12 and k = 15 across all three datasets. Here, p denotes the number of neighbors in constructing similarity matrix A, k is the number of local neighbors in constructing a Vicus matrix. For other values, LVSNMF still outperforms these competing algorithms in  most cases. Note that in our experiments NNDSVD (Boutsidis and Gallopoulos, 2008) is utilized to enhance the initiation of SNMF-based algorithms, which result in rapid reduction of reconstruction errors.

Parameter Analysis
In the proposed LVSNMF method, there is one essential parameter: the regularization parameter α. In fact, Laplacian and Vicus matrices should have different weight parameters. In this study, we set them to be equal for convenience. Parameter α reflects the extent to which we want to exert punishment for violating the manifold consistency hypothesis.
In our experiments, the values of α are set to be in the range (0.001 0.005 0.01 0.05 0.1 0.5 1 10). Table 2 reports the best performance of all algorithms on three datasets.  In order to intuitively observe performance of LVSNMF, we draw curve of LVSNMF vs.α on HMP dataset. Figure 2 gives the performance curve of LVSNMF as α increases.

Comparison and Discussion
In order to further demonstrate the effectiveness of LVSNMF, we compare the performance of LVSNMF with three other algorithms on the HMP dataset when α varies in the range (0.001 0.005 0.01 0.05 0.1 0.5 1). As shown in Figure 1, in the interval (0.001, 0.1), LVSNMF achieves consistently good performance as α increases. When α, LVSNMF obtains the best performance (90.58%/85.56% in terms of AC/NMI), and performs slightly better than the second best algorithm. One possible reason is that samples from HMP are noiseless and the Laplacian has successfully captured sufficient structure information such that the performance of LVSNMF might not be obvious.
On the other hand, on the pollen dataset the proposed LVSNMF algorithm significantly outperforms other competing algorithms (as Table 2 shows). This suggests that LVSNMF has the ability to seek and find the latent structure inherent in the data.
Vicus has an important hyperparameter, the number of neighbors K. To validate the influence of K on the performance of LVSNMF, we also conduct additional experiments on the HMP dataset. Figure 3 shows that the performance of LVSNMF varies with K (α = 0.01).
As shown in Figure 3, two algorithms based on Vicus regularization have consistently good performance as K varies within the interval (10, 30), especially when K equals 15; then they obtain the best performance. This suggests that the proposed LVSNMF is robust with respect to the number of neighbors K. Unlike the SNMF + Vicus algorithm, LVSNMF also takes into account the global information of the data; therefore, it can achieve the better performance in most cases.
In summary, the proposed LVSNMF method can adequately capture the global and local structure of the data. The experimental results on three real datasets demonstrate its efficiency and effectiveness.

CONCLUSION
In this paper, we propose a novel approach, called LVSNMF, to conduct microbiome data analysis. In LVSNMF, the global and local structure similarities are encoded in Laplacian and Vicus matrices, respectively. Extensive experiments are executed on three datasets, and the results show that the proposed LVSNMF algorithm significantly outperforms other baseline or state-of-the-art methods, which demonstrate its efficiency and effectiveness on microbiome data analysis.
Although LVSNMF achieves good performance, it concerns only sample clustering. In the real world, the relationship among microbes is often subtle and complicated. Therefore, modeling microbial interactions is important to dissect the mechanism behind diseases related to the microbiome. In the future, we will develop new methods to construct microbial interaction networks and seek appropriate ways to describe the functional or genetic similarities among microbes, such as phylogenetic trees, metabolic abilities, and so on.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/chonghua-1983/LVSNMF.

AUTHOR CONTRIBUTIONS
JZ and YM wrote the manuscript, designed the algorithms, and conducted all the experiments. YM developed the concept for the structure and content of the manuscript. LL critically revised the final manuscript. All authors reviewed and approved the final version of the manuscript.