Hamiltonian formulations of centroid-based clustering

Seong, Myeonghwan; Park, Daniel Kyungdeock

doi:10.3389/fphy.2025.1544623

ORIGINAL RESEARCH article

Front. Phys., 22 April 2025

Sec. Quantum Engineering and Technology

Volume 13 - 2025 | https://doi.org/10.3389/fphy.2025.1544623

This article is part of the Research TopicAdvancing Quantum Computation: Optimizing Algorithms and Error Mitigation in NISQ DevicesView all 4 articles

Hamiltonian formulations of centroid-based clustering

Myeonghwan Seong¹

Daniel Kyungdeock Park^1,2*

¹Department of Statistics and Data Science, Yonsei University, Seoul, Republic of Korea
²Department of Applied Statistics, Yonsei University, Seoul, Republic of Korea

Clustering is a fundamental task in data science that aims to group data based on their similarities. However, defining similarity is often ambiguous, making it challenging to determine the most appropriate objective function for a given dataset. Traditional clustering methods, such as the $k$ -means algorithm and weighted maximum $k$ -cut, focus on specific objectives—typically relying on average or pairwise characteristics of the data—leading to performance that is highly data-dependent. Moreover, incorporating practical constraints into clustering objectives is not straightforward, and these problems are known to be NP-hard. In this study, we formulate the clustering problem as a search for the ground state of a Hamiltonian, providing greater flexibility in defining clustering objectives and incorporating constraints. This approach enables the application of various quantum simulation techniques, including both circuit-based quantum computation and quantum annealing, thereby opening a path toward quantum advantage in solving clustering problems. We propose various Hamiltonians to accommodate different clustering objectives, including the ability to combine multiple objectives and incorporate constraints. We evaluate the clustering performance through numerical simulations and implementations on the D-Wave quantum annealer. The results demonstrate the broad applicability of our approach to a variety of clustering problems on current quantum devices. Furthermore, we find that Hamiltonians designed for specific clustering objectives and constraints impose different requirements for qubit connectivity, indicating that certain clustering tasks are better suited to specific quantum hardware. Our experimental results highlight this by identifying the Hamiltonian that optimally utilizes the physical qubits available in the D-Wave System.

1 Introduction

Quantum machine learning (QML) offers new possibilities and approaches to address various challenges in data science, pushing the boundaries of existing methods. Among its potential applications, clustering is a widely used technique in numerous domains of pattern recognition and data mining, such as image recognition, social network analysis, customer segmentation, and anomaly detection [1–8]. In addition, clustering has found increasing applications in drug discovery, aiding in the selection of potential leads, mapping protein binding sites, and designing targeted therapies [9–11].

Despite its broad applicability and importance, clustering encounters several challenges from an optimization perspective [12–14]. A primary issue is the ambiguity in defining the objective function for clustering. As there is no ground truth, it is often unclear which criteria should be used to group the target dataset, requiring the analyst to make subjective decisions about what constitutes similarity. A common approach involves using distance measures to quantify similarity. However, this approach still requires determining whether to rely on local information, such as the pairwise distance between individual data points, or global information, such as the distance between a data point and the centroid of a cluster. When using local information, clustering can be formulated as combinatorial optimization and approached by solving the maximum $k$ -cut problem, which corresponds to maximizing dissimilarity between clusters. On the other hand, an example of using global information is the $k$ -means clustering algorithm, which is equivalent to minimizing the variance within clusters by focusing on the distance to centroids. Yet, the challenge remains in how to effectively incorporate both or potentially other objectives for improved clustering. Another critical challenge is that even if the analyst decides to use either local or global information as described in constructing the objective function, finding the global solution is intractable. This intractability arises because the cardinality of the feasible set (i.e., the number of clustering configurations) grows exponentially with the number of data points and due to the non-convexity of the optimization landscape. As a result, in practice, polynomial-time approximate algorithms are employed to obtain good local solutions. This highlights the need for developing more efficient optimization algorithms that can either improve solution quality, reduce runtime, or accomplish both. In addition, existing polynomial-time algorithms often require a random initial cluster assignment, and both the quality of the solution and the convergence speed can be highly sensitive to this choice of initialization.

To address these challenges, we develop a unified framework that incorporates multiple data characteristics—such as local and global information—into the optimization objective, with the flexibility to assign arbitrary weights to specify their relative importance in clustering. Our approach begins by decomposing the problem of finding $k$ clusters into hierarchical clustering, where each level of the hierarchy consists of binary clustering. We then introduce a method to incorporate centroids as variables within the objective function of a combinatorial optimization problem. This formulation enables various centroid-based binary clustering models, such as those that account for intercluster distance, intracluster distance, or both (see Figure 1), to be cast as combinatorial optimization problems. Furthermore, the objective function with centroid variables can be linearly combined with that of the weighted max-cut problem into a single, unified objective. Solving an unconstrained combinatorial optimization problem with binary variables can be mapped to the problem of finding the ground state (i.e., the eigenstate with the lowest eigenvalue) of a spin Hamiltonian [15]. This mapping offers a crucial benefit: it enables the problem to be solved on a quantum computer using quantum simulation techniques such as those based on quantum phase estimation and amplitude amplification [16–20], the variational quantum eigensolver [21–23], the quantum approximate optimization algorithm (QAOA) [24], quantum annealing [25], or quantum-inspired algorithms [26]. Notably, QAOA and quantum annealing do not require a random initial cluster assignment, as their initial quantum state is a uniform superposition of all computational basis states. This means that the algorithms begin with all possible clustering configurations, each assigned equal weight. Consequently, these approaches are free from the sensitivity to initial conditions, unlike classical polynomial-time algorithms. Moreover, formulating clustering as a combinatorial optimization problem is advantageous when incorporating constraints, as constraints can also be formulated as combinatorial optimization problems and included as penalty terms in the objective function.

Figure 1

Figure 1. Illustration of the intracluster distance and the intercluster distance. Unlike supervised learning, unsupervised learning cannot utilize a loss function with exact labels. In a clustering approach, the loss function can be created based on how well the hypothesis separates data points into their appropriate groups. We customized and combined intracluster distance, which measures how tightly data points are clustered together within a cluster, and intercluster distance, which measures how far apart different cluster centers are, as weighted criteria in QUBO formula. By optimizing this loss function, we were able to solve the clustering problem.

We benchmark the effectiveness of our approach and the proposed Hamiltonian formulations (i.e., the combinatorial optimization problems) using several datasets: Iris, Wine, a subset of MNIST, and a synthetic Gaussian overlapping dataset. To evaluate performance, we employed the Silhouette Score (SS) [27] and the Rand Index (RI) [28] as metrics, conducting comparative analysis with the $k$ -means algorithm [29] and the weighted max cut. Initially, we assessed the performance of each Hamiltonian by searching for its exact solutions using a brute-force algorithm, in order to establish a benchmark for comparing theoretical predictions and practical outcomes. We then empirically tested the efficacy of our Hamiltonian formulations using simulated annealing and quantum annealing on the D-Wave Advantage System 6.4 [30]. Furthermore, we expanded our investigation to constrained clustering, incorporating Must-Link (ML), Cannot-Link (CL), and cluster size constraints. By doing so, we highlight the advantages of our centroid-based method, focusing on its ability to manage complex data structures and accommodate real-world data constraints. The following sections will elaborate on our methodological framework, the tailored Hamiltonian designs for clustering objectives, and the results of our comparative analysis, emphasizing the adaptability of our approach.

2 Related work

In this section, we review prior research efforts that framed centroid-based clustering as a combinatorial optimization problem. Several studies have approached this with a particular focus on Quadratic Unconstrained Binary Optimization (QUBO) formulations. Ref. [31] and Ref. [32] explored the representation of cluster centroids in QUBO under the assumption of equal cluster sizes, typically in the context of $k$ -means clustering. Ref. [33] extended this work by introducing a QUBO formulation of the $k$ -medoids approach, which differs from $k$ -means clustering by selecting $k$ representative data points (medoids) as cluster centers instead of calculating centroids based on the mean of the data points.

These works illustrate that centroid-based clustering algorithms, like $k$ -means and $k$ -medoids, can be formulated as combinatorial optimization problems, specifically QUBO, albeit under restricted conditions. A recurring assumption in these studies is the uniform distribution of data, which undesirably constrains clusters to be of approximately equal size. Moreover, the use of synthetic data in experiments raises concerns about the generalizability of these methods to real-world data. While Ref. [34] proposed an iterative fractional cost approach to address the issue of uneven data distributions, their solution significantly increases computational complexity due to the need for hyperparameter tuning and iterative recalculations.

In contrast to previous approaches, our method does not require predefined cluster sizes or rely on computationally intensive iterative processes. It directly incorporates the number of data points in each cluster as a variable in the objective function, eliminating the assumption of fixed cluster sizes. This enables greater flexibility and adaptability to a wider range of data distributions.

3 Methodology

We begin by discussing the process of mapping clustering problems to combinatorial optimization problems. To represent the assignment of $N$ data points into two clusters, we use a binary variable $z \in {- 1, + 1}^{N}$ . Each element $z_{i}$ indicates the cluster assignment of the $i$ th data point, $x_{i}$ , where $x_{i} \in R^{d}$ is the representation of the data in the feature space. This feature space representation is obtained after applying any necessary pre-processing techniques, such as Principal Component Analysis (PCA), normalization, or standardization.

In the Hamiltonian formulation, we introduce $z_{i}^{'} = (1 + z_{i}) / 2 \in {0,1}$ to denote the computational basis state of the $i$ th qubit, representing the cluster assignment. The variables $z$ and $z^{'}$ correspond to the eigenvalues and the eigenstates of the Pauli $Z$ operator, respectively: $Z | 0 〉 = + | 0 〉$ and $Z | 1 〉 = - | 1 〉$ . In general, the objective function subject to minimization in QUBO problems can be expressed as

f (z) = a_{0} + \sum_{i < j} a_{i j} z_{i} z_{j} + \sum_{i = 1}^{N} a_{i} z_{i} . (1)

In the context of clustering, $a_{0}$ is a constant independent of $z$ , $a_{i j}$ represents the relationship between data points $x_{i}$ and $x_{j}$ , $a_{i}$ reflects characteristics of each individual data point. The combinatorial optimization problem can be mapped to finding the smallest eigenvalue and the corresponding eigenvector of the Hamiltonian for a finite-dimensional quantum system. The corresponding Hamiltonian is obtained by replacing $z_{i}$ with the Pauli $Z$ operator and 1 with the identity operator acting on the $i$ th qubit:

H = \sum_{z \in {- 1, + 1}^{N}} f (z) | |z^{'} 〉⟩ ⟨〈 z^{'}| | = a_{0} I + \sum_{i < j} a_{i j} Z_{i} Z_{j} + \sum_{i = 1}^{N} a_{i} Z_{i} . (2)

where $z^{'} \in {0,1}^{N}$ is obtained by mapping every element of $z$ as $(1 + z_{i}) / 2$ , and $I$ is the identity matrix.

Existing QUBO-based clustering algorithms typically rely solely on pairwise distances between data points. In this case, $a_{0} = 0$ and $a_{i} = 0$ for all $i$ , leading to an optimization problem of the form

\min_{z \in {\{- 1, + 1\}}^{N}} \sum_{i < j}^{N} a_{i j} z_{i} z_{j}, (3)

where $a_{i j} \geq 0$ represents the dissimilarity measure between the $i$ th and $j$ th data points. For instance, $a_{i j} = ‖ x_{i} - x_{j} ‖_{2}$ . Equivalently, this problem can be formulated as finding the ground state (i.e., the lowest-energy state) of the Hamiltonian,

H = \sum_{i < j} a_{i j} Z_{i} Z_{j} . (4)

This optimization problem is also known as the weighted max-cut problem on a graph. However, this formulation of clustering neglects global information, such as centroids, in the optimization process. This limitation is primarily due to the computational complexities and challenges involved in representing centroids within the combinatorial optimization framework, unless there is prior knowledge that the dataset is evenly distributed among clusters [32, 35, 36] as noted in Section 2.

To incorporate the centroid information into the combinatorial optimization (e.g., QUBO) framework, we introduce the variables $N_{+}$ and $N_{-}$ , which represent the number of data points assigned to +1 and −1, respectively, and add these variables into the objective function. The number of data points assigned to each cluster can be computed as

N \pm = \sum_{i = 1}^{N} \frac{1 \pm z_{i}}{2} .

These variables serve as the building blocks for constructing the desired objective function, along with any necessary constraints. The centroids of the two clusters can then be expressed as

μ \pm = \frac{1}{N \pm} \sum_{i = 1}^{N} x_{i} \frac{1 \pm z_{i}}{2} .

Moreover, for a given dataset $x = {x_{i}}_{i = 1}^{N}$ , we define the distance function $l : R^{d} \times {- 1, + 1}^{N} \times {- 1, + 1} \to R_{\geq 0}$ as

l (μ, z, s) = \sum_{i = 1}^{N} ‖ x_{i} - μ ‖_{2}^{2} \frac{1 + s z_{i}}{2} .

Here, $(1 + s z_{i}) / 2$ acts as an indicator function, taking the value 1 if $z_{i} = s$ (i.e., if $x_{i} \in C (s)$ , where $C (s)$ denotes the cluster labeled by $s$ ), and 0 otherwise. Thus, this function computes the total distance between the centroid $μ$ and the data points in the cluster labeled by $s$ . For instance, $l (μ \pm, z, \pm 1)$ measures the total distance between the centroid $μ \pm$ and the data points grouped in the cluster labeled $\pm 1$ . (i.e., the total intracluster distance). On the other hand, $l (μ \pm, z, \mp 1)$ calculates the total distance between the centroid of the $\pm 1$ cluster and the data points labeled the $\mp 1$ (i.e., the total intercluster distance).

To construct a clustering objective function that incorporates centroid information, one can take a linear combination of the distance functions $l (μ, z, s)$ . However, this approach poses two challenges when applying the Hamiltonian approach to solve the optimization problem. First, it is non-trivial to map the binary variables within the $1 / N_{\pm}$ term into the Hamiltonian framework. Second, these denominators can cause numerical instability if all (or nearly all) data points are assigned to one of the clusters. To address these issues, we multiply the objective function by suitable powers of $N_{\pm}$ to eliminate the denominators and prevent numerical instability. As detailed in Supplementary Appendix 1, the terms involving $N_{\pm}$ in the denominators appear with powers of either 1 or 2 from when the functions are combined for optimization. Accordingly, we apply the necessary multiplicative factors to cancel out these terms while minimizing deviations from the original clustering objectives. By linearly combining these modified functions, optimization problems that focus on minimizing intracluster distances, maximizing intercluster distances, or both can be transformed into a Hamiltonian problem. In the following, we present three specific examples of such centroid-based objective functions. The corresponding Hamiltonians can be derived using a similar procedure as explained in Equations 1–4, by replacing the scalar 1 with a $2^{N}$ -dimensional identity matrix and the binary variables $z_{i}$ with the Pauli $Z_{i}$ operators acting on the $i$ th qubit. The multiplications of the binary variables $z_{i}$ (e.g., $z_{i} z_{j}$ or $z_{i} z_{j} z_{k}$ ) correspond to the tensor products of the Pauli $Z_{i}$ operators (e.g., $Z_{i} \otimes Z_{j}$ or $Z_{i} \otimes Z_{j} \otimes Z_{k}$ ).

3.1 Intracluster distance

We start by setting up the optimization problem aimed at minimizing intracluster distances. This is achieved by linearly combining $l (μ_{+}, z, + 1)$ and $l (μ_{-}, z, - 1)$ and scaling the result by an appropriate multiplicative factor, as shown below:

\min_{z \in {\{- 1, + 1\}}^{N}} N_{+}^{2} N_{-}^{2} [l (μ_{+}, z, + 1) + l (μ_{-}, z, - 1)] . (5)

This formulation encourages clusters to concentrate around their centroids by minimizing intracluster variance, which is conceptually equivalent to the objective of the well-known $k$ -means algorithm.

However, the minimum of the objective function in Equation 5 can be achieved by setting either $N_{+}$ or $N_{-}$ to zero, leading to a trivial solution that does not represent useful clustering. To prevent this, we multiply $l (μ_{+}, z, + 1)$ by $N_{+}^{2}$ and $l (μ_{-}, z, - 1)$ by $N_{-}^{2}$ , focusing on the respective clusters. The problem can then be reformulated as:

\min_{z \in {\{- 1, + 1\}}^{N}} N_{+}^{2} l (μ_{+}, z, + 1) + N_{-}^{2} l (μ_{-}, z, - 1) . (6)

Notably, in each intracluster distance term, either $N_{+}$ or $N_{-}$ appears only with a power of 1 (see Supplementary Appendix 1). Thus, multiplying each term by a linear factor $N_{\pm}$ suffices to eliminate the denominator. However, using higher-order factors, such as the quadratic term $N_{\pm}^{2}$ , not only removes the denominator but also reflects the influence of cluster sizes into the optimization process. To analyze how different powers of $N_{\pm}$ influence the clustering results, we conducted simulations using both $N_{\pm}^{2}$ and $N_{\pm}$ . The results obtained by scaling with $N_{\pm}^{2}$ are labeled as Intra, whereas those obtained by scaling with $N_{\pm}$ are labeled as Intra*.

3.2 Intercluster distance

To achieve well-separated clusters, it is beneficial to consider intercluster distance, which aims to maximize the separation between different clusters. While minimizing intracluster distance enhances cohesion within each cluster, it may introduce ambiguity near adjacent clusters, especially when boundaries are unclear. By focusing on intercluster separation, we can better distinguish data points near ambiguous or overlapping boundaries, thereby improving the overall clustering performance. Following a similar approach to that used for intracluster distances, the objective function is constructed by linearly combining $l (μ_{-}, z, + 1)$ and $l (μ_{+}, z, - 1)$ , with both terms multiplied by $N_{+}^{2} N_{-}^{2}$ . Since each intercluster distance term includes either $1 / N_{+}^{2}$ or $1 / N_{-}^{2}$ (see Supplementary Appendix 1), multiplying the entire linear combination by $N_{+}^{2} N_{-}^{2}$ is necessary to cancel these denominators. The resulting optimization problem for intercluster distance is then defined as follows:

\min_{z \in {\{- 1, + 1\}}^{N}} - N_{+}^{2} N_{-}^{2} [l (μ_{-}, z, + 1) + l (μ_{+}, z, - 1)] . (7)

In this formulation, we maximize the squared distance of each data point to the centroid of the opposite cluster, encouraging the data points to be as far as possible from the other cluster. We observe that this approach enhances clustering performance, particularly in cases where cluster boundaries are not clearly defined (see Section 4).

3.3 Combining intra and intercluster distances

Now, we can integrate both intracluster and intercluster distances within a unified framework. By simultaneously optimizing these distances, we aim to strengthen the compactness within clusters while enhancing the separation between different clusters. This can be achieved by linearly combining Equation 5 and Equation 7, with the multiplicative factor $N_{+}^{2} N_{-}^{2}$ , which removes the denominators in both the intracluster and intercluster distance terms. The resulting optimization problem is

\min_{z \in {\{- 1, + 1\}}^{N}} N_{+}^{2} N_{-}^{2} [l (μ_{+}, z, + 1) + l (μ_{-}, z, - 1) - l (μ_{-}, z, + 1) - l (μ_{+}, z, - 1)] . (8)

The optimization aims to assign each data point $x_{i}$ to a cluster label $z_{i} \in {- 1, + 1}$ such that the overall intracluster distances are minimized while the intercluster distances are maximized. Specifically, the function promotes tight clustering by minimizing the distances between data points and the centroid of their assigned cluster. At the same time, it enhances separation by maximizing the distances between data points and the centroid of the opposite cluster. By optimizing over all possible assignments of $z_{i}$ , we seek a clustering configuration where data points are closely grouped around their respective centroids and well-separated from the other cluster.

By rearranging Equation 8 (see Supplementary Appendix 1.3), the combined objective function can also be expressed as:

\min_{z \in {\{- 1, + 1\}}^{N}} - N N_{+}^{2} N_{-}^{2} ‖ μ_{+} - μ_{-} ‖_{2}^{2} .

This expression reveals that optimizing the combined intracluster and intercluster distances is equivalent to maximizing the squared distance between the cluster centroids. Therefore, by optimizing the combined objective function in Equation 8, we inherently maximize the separation between the centroids of the two clusters.

3.4 Constrained clustering

In practice, analysts often need to perform clustering under constraints, which are dictated by task requirements or the available information. These constraints ensure that clustering not only groups data effectively but also adheres to the underlying structure and expert knowledge specific to the domain. According to Ref. [37], these constraints typically fall into three main categories: labeling constraints, cluster constraints and comparison constraints.

Labeling constraints are based on preassigned labels from domain knowledge, guiding the clustering algorithm to ensure that labeled objects are assigned to the correct groups. Cluster constraints focus on the characteristics of the clusters, such as the desired number of clusters or restrictions on cluster size or density. Comparison constraints include Must-Link (ML) and Cannot-Link (CL) relations, which specify whether certain objects should or should not be placed in the same cluster based on their inherent relationships. This approach allows users to specify relationships between data points even in the absence of class labels. In our framework, these constraints can be incorporated by augmenting the objective function with penalty terms that increase the objective value when the constraints are violated.

To implement labeling constraints, we modify the objective function to penalize incorrect cluster assignments. If the $i$ th data point is labeled as $+ 1$ (corresponding to the $z_{i} = 1$ ), we add a term $- λ_{p} z_{i}$ with $λ_{p} > 0$ to the objective function. Similarly, if the data point is labeled $- 1$ (corresponding to the $z_{i} = - 1$ ), we add $+ λ_{p} z_{i}$ . This ensures that labeled points are assigned to the correct cluster, minimizing the penalty function when the cluster assignment matches the provided labels. For cluster constraints, if the goal is to ensure that a specific number of data points are assigned to each cluster, we can modify the objective function as

f (z) + λ_{p} {(C - \sum_{i = 1}^{N} z_{i})}^{2} . (9)

Here, $C$ represents the desired difference in the number of data points between two clusters, and $λ_{p} > 0$ is the hyperparameter controlling this aspect. Since $z_{i} \in {- 1, + 1}$ , the term $\sum_{i = 1}^{N} z_{i}$ evaluates the difference in the number of data points between the two clusters for a given cluster assignment. Consequently, the second term in Equation 9 becomes zero only when the constraint is satisfied, while the objective value increases quadratically with deviation from the desired cluster sizes.

Comparison constraints, such as Must-Link (ML) and Cannot-Link (CL), can also be incorporated. Using the penalty term described in Ref. [38], where $Q_{i j} = + 1$ for Must-Link and $Q_{i j} = - 1$ for Cannot-Link, we modify the objective function as

f (z) - λ_{p} \sum_{i < j} Q_{i j} z_{i} z_{j} .

The second term ensures that the objective value increases when Must-Link or Cannot-Link constraints are violated, thereby seeking a solution that satisfies these pairwise relationships.

Note that the constraints are incorporated via the penalty method, where the hyperparameter $λ_{p}$ controls the strength of constraint enforcement. Choosing an appropriate value for $λ_{p}$ is crucial, as excessively large values may overly restrict the optimization, while very small values may fail to enforce constraints effectively. Common approaches for selecting $λ_{p}$ include grid search, random search [39], and adaptive methods [40].

3.5 $k$ -clustering

Algorithm 1

Algorithm 1.Hamiltonian $k$ -clustering.

Building upon the work of Ref. [35], we briefly discuss a $k$ -clustering method inspired by hierarchical clustering techniques. To formalize this approach, we present the Hamiltonian $k$ -clustering algorithm, shown in Algorithm 1. This method iteratively performs binary clustering, eliminating the need for one-hot encoding for each cluster and avoiding complex constraint penalty terms, such as those ensuring that each data point belongs to only one cluster. Consequently, this approach simplifies the clustering process, reduces the problem size, and enhances scalability, making it more suitable for the current capabilities of quantum annealers. Furthermore, this method can provide hierarchical insights into the data structure by unveiling nested cluster relationships. Thus, Hamiltonian clustering can be extended beyond binary clustering to general clustering problems. In the following section, we present experimental results that validate the effectiveness of our proposed method.

4 Experiments

To assess the effectiveness of our array of customized Hamiltonians, we conducted experimental analyses using the Silhouette Score (SS) and the Rand Index (RI) as primary performance metrics, with comparisons to the $k$ -means algorithm and the weighted MaxCut. The Silhouette Score evaluates cohesion within clusters and separation between clusters, while the Rand Index measures agreement with the ground truth by calculating the true positives and true negatives in the clustering results. In addition to these primary metrics, we examined other aspects of clustering performance, such as the distances between cluster centroids, intracluster distances (the sum of distances within clusters), and intercluster distances (the sum of distances between clusters). These additional results are summarized in the tables in Supplementary Appendix 2.

4.1 Exact solutions

To establish a baseline for evaluating the performance of our proposed Hamiltonian methods, we employed a brute-force algorithm to exhaustively search the solution space on small datasets. Although this approach is computationally expensive and infeasible for large datasets, it allows us to find exact solutions and precisely evaluate the performance of our proposed methods. For this reason, we selected the Iris and Wine datasets for our experiments due to their widespread usage as standard benchmarks in clustering and classification tasks, as well as their suitability for exhaustive search given their size. The Iris dataset consists of 150 samples with four features categorized into three classes, while the Wine dataset contains 178 samples with thirteen features also categorized into three classes. To focus on binary clustering, we excluded the Setosa class (50 samples) from the Iris dataset and class 1 (59 samples) from the Wine dataset. For each dataset, we randomly sampled 16 data points and repeated the experiment 150 times. We applied normalization to scale the features of the Iris dataset to a range between 0 and 1. For the Wine dataset, we applied standard scaling to transform the features to have zero mean and unit variance.

Figure 2 summarizes the performance of different methods on the Iris and Wine dataset. For the Silhouette Score, the $k$ -means algorithm achieves the highest score on the Iris dataset. The Intra-Inter combined method and $Intra *$ method follow closely in second and third place. Notably, the Intra-Inter combined method surpasses the $k$ -means algorithm on the Wine dataset. For the Rand Index, one of our Hamiltonian methods outperforms the $k$ -means algorithm on both datasets. The $Intra *$ method achieves the highest Rand Index on the Iris dataset, whereas the Inter method outperforms the $Intra *$ method on the Wine dataset. By combining Intra and Inter methods, we achieved balanced performance across both datasets. In all cases, at least one of our Hamiltonian methods outperforms the weighted MaxCut, highlighting the benefit of incorporating centroid information into the clustering process.

Figure 2

Figure 2. The heatmap illustrates exact search results, showing the performance of Hamiltonian methods on the Iris and Wine datasets. The values represent the means of performance metrics (RI: Rand Index, SS: Silhouette Score), with darker shades indicating higher rank (better performance). White text indicates the best results for each evaluation metric.

4.2 Simulated annealing

Although the brute-force algorithm guarantees exact solutions, its high computational complexity restricts its application to small datasets. To validate the scalability of our method, we employ the simulated annealing algorithm. This approach enables testing on larger datasets, including not only the Iris and Wine datasets but also Gaussian-distributed synthetic dataset and 0–1 MNIST dataset. The synthetic dataset follows Gaussian distributions with overlapping ranges (see Supplementary Figure S1 in Supplementary Appendix 2), and the 0–1 MNIST dataset contains handwritten images of digits 0 and 1 in a $28 \times 28$ pixel format. We selected 100 samples from the Iris dataset (excluding the Setosa class) and 119 samples from the Wine dataset (excluding class 1). For the Gaussian-distributed synthetic dataset 150 samples were used and 175 samples were chosen from the 0–1 MNIST dataset. All experiments were conducted using an identical annealing schedule, ensuring that each experiment was allocated the same computational time budget.

Figure 3 presents the performance of different methods across these dataset. For the synthetic dataset, the $k$ -means algorithm achieved the highest Silhouette Score compared to other methods. However, both the Inter and Intra-Inter combined methods achieved the highest Rand Index. This indicates that they handle overlapping data more effectively than other methods. Furthermore, simulations using actual datasets revealed noteworthy results. Although the $k$ -means algorithm achieved marginally higher Silhouette Scores, our Hamiltonian methods consistently yielded high Rand Index values while maintaining comparable Silhouette Scores. This enhancement in the Rand Index suggests that our methods not only optimize intracluster cohesion and intercluster separation but also produce cluster assignments that more accurately reflect the true underlying classes. In particular, for the 0–1 MNIST dataset, the Intra-Inter combined method demonstrated excellent performance, indicating that Hamiltonian based clustering can be effectively applied to image recognition tasks.

Figure 3

Figure 3. The heatmap illustrates simulated annealing results, showing the performance of Hamiltonian methods on the Gaussian synthetic, Iris, Wine and MNIST 0–1 datasets. The values represent the means of performance metrics (RI: Rand Index, SS: Silhouette Score), with darker shades indicating higher rank (better performance). White text highlights the best results for each evaluation metric.

4.3 Quantum annealing

To verify that our method can operate on a current quantum device, we performed quantum annealing using the D-Wave Advantage System 6.4, which employs the Pegasus topology (see Figure 4B). Our clustering problem inherently involves a fully connected (complete) graph, as depicted in Figure 4A. This connectivity poses challenges for current quantum devices, which often have limited qubit connectivity. We utilized the clique sampler from the D-Wave Ocean SDK [42], which is designed to optimally embed fully connected problems onto the hardware. In graph theory, a clique is a subset of vertices in which every pair of distinct vertices is connected by an edge, forming a complete subgraph. The term clique size refers to the number of vertices in such a fully connected subgraph. Notably, the maximum clique size for the D-Wave Advantage System 6.4 is 175, meaning that it can embed fully connected problems involving up to 175 logical qubits (representing data points). This capability allowed us to process the entire Iris and Wine datasets—each containing fewer than 175 data points—in a single trial. However, the 0–1 MNIST dataset exceeds the limited qubit connectivity of the system, necessitating the random selection of subsets of 175 data points. To ensure statistical robustness, we repeated this sampling process ten times. Our intercluster method demands additional qubits beyond those representing the data points due to the inclusion of higher-order terms (see Supplementary Equation S21 in Supplementary Appendix 1.2) and hence the slack variables necessary for formulating it as a Binary Quadratic Model (BQM). This extra qubit requirement exceeds the hardware’s maximum clique size when handling larger datasets, leading us to exclude the intercluster method from our quantum annealing experiments given the current hardware constraints.

Figure 4

Figure 4. (a) A visualization of a complete graph with 5 vertices, denoted as $K_{5}$ , where each vertex (representing a data point) is connected to every other vertex. (b) The embedding of $K_{5}$ onto the D-Wave Pegasus topology [41], showing how the fully connected graph is mapped onto the limited qubit connectivity of the quantum hardware. The gray background highlights unused qubits, while colored nodes and edges show the embedded qubits and their connections. The purple node is embedded onto two separate physical qubits, which are linked by a purple edge, representing a chain. This chain ensures that the two qubits act together as a single logical qubit during the quantum annealing process.

In this analysis, the Intra-Inter combined method achieved higher Rand Index values while maintaining similar Silhouette Scores compared to the $k$ -means algorithm for the Gaussian synthetic, Iris, and 0–1 MNIST datasets, as shown in Table 1 These results indicate a notable advancement in quantum-enhanced clustering. In contrast, the Intra, ${Intra}^{*}$ , and weighted MaxCut methods encountered the logical qubit embedding issue known as chain breaks, leading to random solutions. Figure 5 illustrates this phenomenon by showing the chain break fractions for four Hamiltonians, highlighting the stability of the Intra-Inter combined method on the 0–1 MNIST dataset. In the process of embedding a problem into the D-Wave Systems, multiple physical qubits are used to represent a single logical qubit. These physical qubits are connected in a chain, as illustrated by the purple edge in Figure 4B. A chain break occurs when these physical qubits fail to maintain the same state after the annealing process. This misalignment leads to unreliable solutions. Several studies [43–47] have investigated how chain breaks affect the accuracy of quantum annealing results, highlighting the need for effective embedding strategies and adjustments of chain strength values to minimize such occurrences. Notably, the Intra-Inter combined method performed efficiently on current QPU without the need for additional system parameters tuning or embedding strategies, such as adjusting annealing schedules or chain strengths.

Table 1

Table 1. Comparison of quantum annealing results: Silhouette Score and Rand Index between Intra-Inter combined method and $k$ -means algorithm for the Gaussian synthetic, Iris, Wine and 0–1 MNIST datasets.

Figure 5

Figure 5. Chain break fraction for four different Hamiltonians using the D-Wave Advantage System 6.4. We conducted 200 samplings on a single quantum machine instruction on a QPU, setting the annealing time to the maximum possible duration of 2000 $μ$ s. The system employed 5,612 physical qubits. Using the 0–1 MNIST dataset, we observed chain breaks by averaging the results of 200 samplings. Remarkably, the Intra-Inter combined method did not experience any chain breaks, demonstrating superior stability. This stability significantly influenced the annealing results, resulting in consistently high Rand Index values. When evaluating the Rand Index, we calculated the Rand Index of the minimum energy among the 200 samples, further affirming the robustness and effectiveness of the Intra-Inter combined method.

4.4 Constrained clustering

We performed constrained clustering on the Iris and Wine datasets using Must-Link (ML) and Cannot-Link (CL) constraints, implemented through simulated annealing. Labels of randomly selected data points were revealed according to specified proportions from the entire dataset. Based on these revealed labels, we generated constraints whether pairs of data points should be grouped together (ML) or separated (CL). The proportion of data points with revealed labels ranged from 0% to 100% in 10% increments, resulting in 11 distinct levels. A 0% ratio reflects a standard clustering scenario without any constraint information, whereas a 100% ratio indicates that all labels are fully known. For each ratio, we conducted 50 trials using different random samples to calculate average performance metrics. This choice balances statistical robustness and computational efficiency.

Figure 6 presents the outcomes of constrained clustering with ML and CL constraints. In both datasets, the Rand Index initially declined when only 10%–30% of label information was provided, but progressively improved as more information available. In the case of Wine dataset, the Intra-Inter combined method quickly converged to the true labels. In contrast, the Inter method showed no consistent trend and exhibited considerable variability. This suggests that the solution values are sensitive to the hyperparameter $λ_{p}$ , which can influence the prioritization of constraints or clustering methods.

Figure 6

Figure 6. The plots display the Rand Index as a function of the percentage of known labels for the Iris (left) and Wine (right) datasets, under constrained clustering with Must-Link (ML) and Cannot-Link (CL) constraints. Where the proportion of revealed labels ranges from 0% to 100% in 10% increments. The performance improves as more label information is revealed, with the both dataset showing a recovery in performance after an initial decline between 10% and 30%. The variability in the Inter method suggests sensitivity to the hyperparameter $λ_{p}$ .

Subsequently, we implemented cardinality constraints to assign a specific number of data points to each cluster on the Iris and Wine datasets using simulated annealing. For each experiment, we selected a total of 50 data points from the Iris dataset and 48 data points from the Wine dataset, drawn from labels 1 and 2 according to predetermined ratios. The proportions of data points from label 1 and label 2 varied as follows: (10%, 90%), (20%, 80%), (30%, 70%), (40%, 60%), (50%, 50%), (60%, 40%), (70%, 30%), (80%, 20%), and (90%, 10%). This means we started with 10% data points from label 1% and 90% from label 2, gradually adjusting the proportions until we reached 90% from label 1% and 10% from label 2.

Figure 7 illustrates the performance of our methods with cardinality constraints. The ${Intra}^{*}$ and Intra-Inter combined methods achieved high Rand Index values across different label ratios for both datasets, closely matching the desired cardinality values. On the Iris dataset, the Intra, weighted MaxCut and Inter methods exhibited less satisfactory performance with balanced data points (50% from each label). On the Wine dataset, the Intra and weighted MaxCut methods also underperformed with balanced data points. Nevertheless, the results of the ${Intra}^{*}$ and Intra-Inter combined methods indicate that it is possible to perform clustering on imbalanced data and adjust cluster sizes according to user specifications, effectively addressing real-world clustering problems.

Figure 7

Figure 7. The top plots show the Rand Index across various methods with cardinality constraints applied to the Iris (left) and Wine (right) datasets. The bottom plots depict the difference between the given cardinality $(C)$ and the experimental results $(C^{*})$ . The horizontal axis represents the number of data points from label 1 to label 2, ranging from (10%, 90%) to (90%, 10%). Notice that the Inter method underperformed on the Iris dataset. In contrast, the ${Intra}^{*}$ and Intra-Inter combined methods maintained a high Rand Index while closely achieving the desired cardinality.

5 Conclusions and discussion

In this work, we formulated the clustering problem as finding the ground state of a Hamiltonian and developed methods to integrate centroid information directly into the objective function. We defined a distance function $l (μ, z, s)$ that encompasses intracluster distance, intercluster distance, and a combination of both. By incorporating the number of data points in each cluster as a variable within the objective function, we eliminated the need for fixed cluster size assumptions. We also extended our method to constrained clustering, enabling domain experts to embed prior knowledge into the clustering process. Our experimental results demonstrated that at least one of our proposed Hamiltonians outperforms the weighted MaxCut across multiple datasets, including both synthetic and real-world examples. This underscores the importance of incorporating centroid information in clustering algorithms. Notably, the Intra-Inter combined method exhibited balanced performance across most datasets.

The significance of our research lies in developing a flexible and unified clustering strategy capable of addressing complex clustering challenges. By enabling the integration of various clustering objectives, our approach effectively manages data points clustered around their mean and handles overlapping clusters, as evidenced by our experimental results. Application to real datasets further highlights the practical utility of the Hamiltonian formulation. A key benefit of our method is its compatibility with quantum simulation techniques. In particular, our quantum annealing experiments on the D-Wave Systems showed that the Intra-Inter combined method operates effectively on current quantum device, demonstrating its applicability to real-world problems.

Potential directions for future research include expanding the Hamiltonian-based clustering framework to develop data-driven, automated methods for determining the optimal number of clusters. The intracluster distance formulation in our Hamiltonian approach can be leveraged to enforce density constraints. Additionally, refining the Intra-Inter combined method by applying dynamic weights to the linear combination of distances could allow adaptation to context-specific requirements. Extending this Hamiltonian formulation to address more complex clustering scenarios, such as time series and high-dimensional datasets, also presents a promising direction. Exploring alternative quantum hardware platforms, such as Rydberg atom arrays [48–53], could further expand the utility of our approach. While large-scale all-to-all qubit connectivity is not naturally available, ongoing advances in high-fidelity control [54], highly tunable interactions [55], and long coherence times [56, 57] open a path for scaling our method to larger problem instances. These advancements have the potential to enable new applications in fields like finance, drug discovery, and social network analysis.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

MS: Data curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing–original draft, Writing–review and editing. DP: Conceptualization, Formal Analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing–original draft, Writing–review and editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work was supported by Korea Research Institute for Defense Technology Planning and Advancement---grant funded by Defense Acquisition Program Administration (DAPA) (KRIT-CT-23–031). This work was also supported by Institute of Information and communications Technology Planning and evaluation (IITP) grant funded by the Korea government (No. 2019-0-00003, Research and Development of Core technologies for Programming, Running, Implementing and Validating of Fault-Tolerant Quantum Computing System), the Yonsei University Research Fund of 2024 (2024-22-0147), the National Research Foundation of Korea (2023M3K5A1094813 and RS-2023-NR119931), and the KIST Institutional Program (2E32941-24-008). Access to the D-Wave system was supported by the `Quantum Information Science R&D Ecosystem Creation’ through the National Research Foundation of Korea (NRF) funded by the Korean government (Ministry of Science and ICT) (2020M3H3A1110365).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphy.2025.1544623/full#supplementary-material

References

1. Coleman GB, Andrews HC. Image segmentation by clustering. Proc IEEE (1979) 67:773–85. doi:10.1109/proc.1979.11327

Hamiltonian formulations of centroid-based clustering

1 Introduction

2 Related work

3 Methodology

3.1 Intracluster distance

3.2 Intercluster distance

3.3 Combining intra and intercluster distances

3.4 Constrained clustering

3.5 k-clustering

4 Experiments

4.1 Exact solutions

4.2 Simulated annealing

4.3 Quantum annealing

4.4 Constrained clustering

5 Conclusions and discussion

Data availability statement

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

Supplementary material

References

3.5 $k$ -clustering