Edited by: Daniel Potts, Technische Universität Chemnitz, Germany
Reviewed by: Junhong Lin, École Polytechnique Fédérale de Lausanne, Switzerland; Uwe Schwerdtfeger, Technische Universität Chemnitz, Germany
This article was submitted to Mathematics of Computation and Data Science, a section of the journal Frontiers in Applied Mathematics and Statistics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
This paper proposes a method for estimating the cluster matrix in the Gaussian mixture framework via Semi-Definite Programming. Theoretical error bounds are provided and a (non linear) low dimensional embedding of the data is deduced from the cluster matrix estimate. The method and its analysis is inspired by the work by Guédon and Vershynin on community detection in the stochastic block model. The adaptation is non trivial since the model is different and new Gaussian concentration arguments are needed. Our second contribution is a new Bregman-ADMM type algorithm for solving the semi-definite program and computing the embedding. This results in an efficient and scalable algorithm taking only the pairwise distances as input. The performance of the method is illustrated via Monte Carlo experiments and comparisons with other embeddings from the literature.
Low dimensional embedding is a key to many modern data analytics. Data are better understood after choosing the best coordinates, i.e., embedding, and extracting the main features. Based on a compressed description, the data can then be projected, visualized, or clustered more reliably and efficiently. The goal of the present paper is to present an efficient technique for joint embedding and clustering, based on pairwise affinity analysis and reliable convex optimization.
Combining the goals of reducing dimensionality and clustering in a principled manner is challenging and novel, but also draws on ideas from spectral clustering [
The mapping of a 3D cluster using Diffusion Maps from the Matlab package drtoolbox
Apart from this previous works, standard clustering techniques usually start from already embedded data as obtained after e.g., PCA processing. Said otherwise embedding and clustering are often considered as completely separate tasks. Based on embedded data, mainstream clustering techniques are non-parametric (as
Our starting point in this attempt at finding appropriate embeddings for clustering is the method by Guedon and Vershynin [
From a technical perspective, our contribution is three-fold.
First, we generalize the Guedon/Vershynin approach in order to deal with the Gaussian Cluster Model (GCM) and show that the cluster matrix in the GCM can also be estimated by solving an SDP. For doing so, we use an affinity matrix as input that depends only on the pairwise distances between observations. Contrarily to the adjacency matrix arising in the SBM, our affinity matrix from the GCM has non independent entries, thus making the analysis non trivial.
Our second contribution is to demonstrate in practice that the estimated cluster matrix yields a natural associated embedding. Indeed, quite similarly to spectral clustering, the eigenvectors of the estimated cluster matrix provide a meaningful embedding. Contrarily to standard embedding methods such as PCA, Laplacian eigenmaps, Maximum Variance Unfolding, t-SNE, etc., the embedding does not try to preserve pairwise distances but rather to estimate the cluster matrix. The intuition for using the cluster matrix is supported by Remark 1.6 in Guédon and Vershynin [
Our third contribution is to propose a new scalable algorithm for solving the main Semi-Definite Programming problem at the heart of Guédon and Vershynin [
The paper is organized as follows. The SDP approach for estimating the cluster matrix, the associated embedding and the main theoretical results are presented in section 2. The proofs are postponed to section 2 in the
The mathematical framework is the following. We assume that we observe a data set
with
The clustering problem aims at recovering the clusters
It determines entirely the clusters and, up to a reordering of the points, it is a block-diagonal matrix with a block of ones for each cluster.
Note that the Gaussian Cluster Model slightly differs from the usual Gaussian mixture model where the data set consists in independent observations from the Gaussian mixture
As proved in Bandeira et al. [
We will define in the next section an estimate
and we deduce that
the rank of
the nonzero eigenvalues of
We assume in the sequel that the cluster sizes are all different so that all non-zero eigenvalues have multiplicity one. The clusters can hence be recovered from the eigenstructure of the matrix
The estimate
We now turn to the estimation
Based on the data set
where ||·||2 denotes the Euclidean norm on ℝ
with
Before stating the Semi-Definite Program, we introduce some matrix notations. The usual scalar product between matrices
With these notations, we define
with
Here λ0 ∈ ℕ is the number of non-zero edges in the true cluster matrix and Guedon and Vershynin state in Guedon and Vershynin [
The heuristic justifying that
Lemma 2.1.
with
The intuition behind condition (8) is that the average distance (or more precisely the average affinity) between two points within a same cluster is smaller than the average distance between two points from different clusters. This corresponds to the intuitive notion of clusters. Note that a similar condition appears in Guédon and Vershynin [
The SDP (5) appears as an approximation of the SDP (9) since the affinity matrix
□
Our main result is a non asymptotic upper bound for the probability that
□
Theorem 2.2 has a simple consequence in terms of estimation error rate. After computing
The following corollary provides a simple bound for the asymptotic error.
Corollary 2.3.
In the case when the cluster means are pairwise different and fixed while the cluster variances converge to 0, i.e., σ → 0, it is easily seen that the right hand side of the above inequality behaves as
While our proof of Theorem 2.2 follows the ideas from Guédon and Vershynin [
Proposition 2.4.
□
Theorem 2.2 assumes that λ0 is known. It is worth noting that λ0 corresponds to the number of edges in the cluster graph and that we can derive from the proof of Theorem 2.2 how the algorithm behaves when the cluster sizes are unknown, i.e., when the unknown parameter λ0 is replaced with a different value λ. The intuition is given in Remark 1.6 in Guédon and Vershynin [
In order to check condition (8), explicit formulas for the mean affinity matrix are useful. The next proposition solves the case of the Gaussian affinity function.
Proposition 2.5.
□
As an interesting consequence of Proposition 2.5, when the variance matrices from the Gaussian Cluster Model (1) are all equal and isotropic, that is
and
Condition (8) is therefore satisfied (whatever the choice of
In all the experiments, the parameter
The hyper-parameter λ was chosen so as to minimize the mean squared error between the estimated cluster matrix and the empirical affinity matrix.
As for spectral clustering, the components of the most significant eigenvectors, i.e., the eigenvectors associated with the largest eigenvalues, are the coordinates of the embedded data. Given these embedded data, as advised in Vu [
Simulations have been conducted to assess the quality of the proposed embedding. In this subsection, we used the Matlab package drtoolbox
Original affinity matrix vs. Guedon Vershynin Cluster matrix.
The affinity matrix obtained after embedding using different methods from the Matlab package drtoolbox
In this section, we present some simulation experiments assessing the performance of the Guedon-Vershynin embedding for Gaussian Cluster Models.
Our experiments were performed on problems of successive sample size 100, 200, …, 1,000 and number of clusters equal to 2, 5, and 10. The dimension of the Gaussian Mixture Model was set to 100. For each experiment, we performed 100 Monte Carlo repeats. All the results in this section show the average over the Monte Carlo experiments. Our Gaussian Cluster Model was built as follows: for a model with
The value of λ was selected so as to minimize the Frobenius distance between Ẑ and
Estimation error
Estimation error
The goal of the present paper was to propose an analysis of Guedon and Vershynin's Semi-Definite Programming approach to the estimation of the cluster matrix and show how this matrix can be used to produce an embedding for preconditioning standard clustering procedures. The procedure is suitable for very high dimensional data because it is based on pairwise distances only. Moreover, increasing the dimension will improve the robustness of the procedure when the Law of Large Numbers will apply along dimensions, hence forcing the affinity matrix to converge to a deterministic limit and thus making the estimator less sensitive to its low dimensional fluctuations.
Another feature of the method is that it may apply to a large number of mixtures type, even when the component's densities are not log-concave, as do a lot of embeddings as applied to data concentrated on complicated manifolds. Further studies will be performed in this exciting direction.
Future work is also needed for proving that the proposed embedding is provably efficient when combined with various clustering techniques. One of the main reason why this should be a difficult problem is that the approximation bound proved in the present paper is not so easy to leverage for controlling the perturbation of the eigenspaces of
SC and CD contributed the theoretical analysis and the proofs. SC and AF contributed the code and simulation experiments for the initial version of this work. SC contributed the new algorithm and the corresponding simulation experiments.
AF was employed by company DigitalSurf. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The results presented in this paper have appeared previously as Chapter 3 of the third author's Ph.D. thesis [
The Supplementary Material for this article can be found online at:
1Extending our study to the setting of Gaussian Mixture Models using the relationship with the Gaussian Cluster Model based on this conditioning is a somewhat tedious but not difficult task.
2