qCLUE: a quantum clustering algorithm for multi-dimensional datasets

Gopalakrishnan, Dhruv; Dellantonio, Luca; Di Pilato, Antonio; Redjeb, Wahid; Pantaleo, Felice; Mosca, Michele

doi:10.3389/frqst.2024.1462004

ORIGINAL RESEARCH article

Front. Quantum Sci. Technol., 11 October 2024

Sec. Quantum Computing and Simulation

Volume 3 - 2024 | https://doi.org/10.3389/frqst.2024.1462004

qCLUE: a quantum clustering algorithm for multi-dimensional datasets

Dhruv Gopalakrishnan^1,2,3*

Luca Dellantonio^1,4,5*

Antonio Di Pilato⁶*

Wahid Redjeb^6,7*

Felice Pantaleo⁶*

Michele Mosca^1,3,4,8*

¹Institute for Quantum Computing, University of Waterloo, Waterloo, ON, Canada
²Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada
³Perimeter Institute of Theoretical Physics, Waterloo, ON, Canada
⁴Department of Physics and Astronomy, University of Waterloo, Waterloo, ON, Canada
⁵Department of Physics and Astronomy, University of Exeter, Exeter, United Kingdom
⁶CERN, Geneva, Switzerland
⁷RWTH Aachen University Physikalisches Institut III A, Aachen, Germany
⁸Department of Combinatorics and Optimization, University of Waterloo, Waterloo, ON, Canada

Clustering algorithms are at the basis of several technological applications, and are fueling the development of rapidly evolving fields such as machine learning. In the recent past, however, it has become apparent that they face challenges stemming from datasets that span more spatial dimensions. In fact, the best-performing clustering algorithms scale linearly in the number of points, but quadratically with respect to the local density of points. In this work, we introduce qCLUE, a quantum clustering algorithm that scales linearly in both the number of points and their density. qCLUE is inspired by CLUE, an algorithm developed to address the challenging time and memory budgets of Event Reconstruction (ER) in future High-Energy Physics experiments. As such, qCLUE marries decades of development with the quadratic speedup provided by quantum computers. We numerically test qCLUE in several scenarios, demonstrating its effectiveness and proving it to be a promising route to handle complex data analysis tasks – especially in high-dimensional datasets with high densities of points.

1 Introduction

Clustering is a data analysis technique that is crucial in several fields, owing to its ability to uncover hidden patterns and structures within large datasets (Gopalakrishnan et al., 2024). It is essential for simplifying complex data, improving data organization, and enhancing decision-making processes (Oyelade et al., 2019; Gu and Hübschmann, 2022; Caruso et al., 2018; Wu et al., 2021). For instance, clustering has been applied in marketing (Huang et al., 2007; Punj and Stewart, 1983), where it helps segment customers for targeted advertising (Wu et al., 2009), and in biology, for classifying genes and identifying protein interactions (Dutta et al., 2020; Au et al., 2005; Wang et al., 2010; Asur et al., 2007). In the realm of computer science and artificial intelligence, it is invaluable for speech recognition (Kishore Kumar et al., 2018; Chang et al., 2017), image segmentation (Coleman and Andrews, 1979), as well as for recommendation systems (Shepitsen et al., 2008; Schickel-Zuber and Faltings, 2007) used for personalizing user content. Finally, clustering techniques are pivotal for Event Reconstruction (ER), where data points that originated from the same “event” are to be grouped together. In High-Energy Physics, for instance, clustering algorithms are used to reconstruct the trajectories of subatomic particles in collider experiments. High volumes of data are expected at the endcap High Granularity CALorimeter (HGCAL) (Didier and Austin, 2017) which is currently being built for the CMS detector at the High Luminosity Large Hadron Collider (HL-LHC). This must be tackled by new generations of clustering algorithms such as CLUE (Rovere et al., 2020). The discovery of the Higgs boson (Aad et al., 2012), awarded the Nobel prize in 2012, was made possible by such algorithms.

ER enables the interpretation of data obtained from particle collision events, including those occurring at the Large Hadron Collider (LHC) at CERN. Several clustering algorithms like DBScan, K-Means, and Hierarchical Clustering among others (Amaro et al., 2023; Dalitz et al., 2019; Rodenko et al., 2019) can be employed for ER. Our work is based on CERN’s CLUstering of Energy (CLUE) algorithm (Rovere et al., 2020; CMS Collaboration, 2022), which is adopted by the CMS collaboration (Hayrapetyan et al., 2023; Hayrapetyan et al., 2024; Tumasyan et al., 2023). It is designed for the future HGCAL detector due to the limitations of the currently employed algorithms. Despite these limitations, such algorithms are already at the basis of several discoveries, such as the doubly charged tetraquark (Aaij et al., 2023), the study of rare B meson decays to two muons (Tumasyan et al., 2023) and the observation of four-top quark production in proton-proton collisions (Hayrapetyan et al., 2023).

The efficiency of clustering algorithms, as illustrated by the CLUE algorithm (Rovere et al., 2020), is crucial for handling large datasets. Initially designed for two-dimensional datasets, CLUE reduces the search complexity from $O (n^{2})$ to $O (m n)$ through the use of local density and a tiling procedure, where $n$ $(m)$ represents the (average) number of points (per tile).

In the context of CLUE, where the datasets in question are limited to two dimensions, $m$ is small, making this approach to ER particularly effective. However, as the dimensionality of the dataset is incremented, the value of $m$ generally increases exponentially. This is highlighted by Figure 1A, where for a $d$ -dimensional lattice with $a$ points per edge, $m$ follows the relation $m = a^{d}$ . This is a serious challenge to CLUE and classical clustering algorithms in general.

Figure 1

Figure 1. Scaling of point density and complexities of classical and quantum algorithms for the unstructured search problem with dimension $d$ . In (A), different $d$ -dimensional lattices for $d = 1,2,3$ and $a = 3$ points per edge. In (B), best-known classical (solid lines) and quantum (dashed lines) algorithmic scaling for the Unstructured Search Problem (Lov, 1996) applied to square $d$ -dimensional lattices with the values of $a$ reported in the plot. Classically, the cost $O (m)$ reflects the need to iterate through all the $m$ points to find the desired one. Grover achieves the same in $O (\sqrt{m})$ steps, providing a quadratic advantage. This advantage increases with the density of points in the considered dataset, which grows exponentially with respect to the dimension $d$ according to $m = a^{d}$ .

A first step towards extending CLUE to more dimensions is 3D-CLUE (Rovere et al., 2020; Brondolin, 2022). In this work, data points from different detector layers are first projected onto a single $d = 2$ surface, where clustering is then performed. However, this projection from the original $d = 3$ dataset to a $d = 2$ surface comes at the cost of a slower algorithm since $m$ becomes effectively larger. The solid lines in Figure 1B show the increase in average points per tile in $d$ -dimensional datasets made of the lattices in panel (a). While the improved performance of 3D-CLUE in ER tasks (Rovere et al., 2020; Brondolin, 2022) justifies the increased computational overhead, extending this enhancement to higher dimensions and larger datasets is challenging. Finding practical approaches to deal with datasets where $d$ is large is therefore extremely important, not only for ER tasks, but also in other fields such as gene analysis in bioinformatics (Karim et al., 2020) and market segmentation in business (Zhou et al., 2020).

Quantum computers provide a route to mitigate the complexity blow-up arising from higher-dimensional datasets. Wei et al. (2020) addresses the task of jet clustering in High-Energy Physics, while Kerenidis and Landman (2021) targets spectral clustering, which itself uses the efficient quantum analogue of $k$ -means clustering (Kerenidis et al., 2019). Gong et al. (2024a); Zhou et al. (2021) provide $k$ -Nearest-Neighbors based approaches for image classification, a common machine learning task. Other approaches include quantum $k$ -medians clustering (Aïmeur et al., 2007) and a quantum algorithm for density peak clustering (Duarte et al., 2023). Gong et al. (2022); Gong et al. (2024b); Gong C. et al. (2024) also present interesting quantum solutions to a wide number of common machine learning tasks.

In this work we develop qCLUE, a CLUE-inspired quantum algorithm. Similarly to other quantum algorithms (Nicotra et al., 2023; Tüysüz et al., 2020), qCLUE leverages the advantage provided by Grover Search (Lov, 1996). A comparison of classical and quantum (Grover) runtimes is presented in Figure 1B, where the solid [dashed] lines refer to the classical $O (m)$ [quantum $O (\sqrt{m})$ ] scaling. As can be seen, the complexity advantage that Grover search provides can be substantial, particularly for large values of $d$ or $a$ .

Overall, we find that qCLUE performs well in a wide range of scenarios. With ER-inspired datasets as a specific example, we demonstrate that clusters are correctly reconstructed in typical experimental settings. Similar to other quantum approaches to clustering that rely on Grover Search (Aïmeur et al., 2007; Pires et al., 2021; Magano et al., 2022), qCLUE showcases a quadratic speedup compared to classical algorithms. Magano et al. (2022) is especially interesting as it provides a detailed computational complexity analysis to a related problem within ER. Specifically, this approach tackles a subsequent task compared to qCLUE, namely the creation of so-called tracksters from hits (CMS Collaboration, 2022). It also demonstrates that the quantum algorithm has a quadratic advantage if compared to the classical one in physically relevant scenarios. We mention here the significance of variational solutions (Zlokapa et al., 2021; Tüysüz et al., 2021) to the ER reconstruction problem but note that these do not have predictable runtimes or error bound guarantees.

The specific advantages of qCLUE are its CLUE-inspired approach to cluster reconstruction (which demonstrated to be extremely successful (CMS Collaboration, 2022; Hayrapetyan et al., 2023; Tumasyan et al., 2023; CMS Collaboration, 2024)), and its consequent seamless integration with the classical framework currently employed by the CMS collaboration (Rovere et al., 2020; Brondolin, 2022; CMS Collaboration, 2023).

This paper is structured as follows. In Section 2, we describe our algorithm qCLUE. Specifically, we provide a general overview of its subroutines – namely the Compute Local Density, Find Nearest Higher, and the Find Seeds, Outliers and Assign Clusters steps. We describe the results of our simulated version of qCLUE on a classical computer in Section 3. In more detail, we explain the scoring metrics we use to quantify our results, and describe qCLUE performance when the dataset is subject to noise and different clusters overlap. Conclusions and outlook are finally presented in Section 4.

2 qCLUE

qCLUE is a quantum adaptation of CERN’s CLUE and 3D-CLUE algorithms (Rovere et al., 2020; Brondolin, 2022), that is specifically developed for ER, yet it is suitable to work with any (high dimensional) dataset. The main advantage of qCLUE stems from employing Grover’s algorithm, which provides a quadratic speedup for the Unstructured Search Problem (Lov, 1996). While qCLUE is designed to work in arbitrary dimensions, for clarity we restrict ourselves to $d = 2$ . This simplifies the following discussions and allows us to simulate qCLUE with meaningful datasets on a classical computer. Generalizations to higher dimensions can be done following the steps outlined below. Furthermore, to provide a better connection with CLUE and 3D-CLUE, we employ a similar notation.

In Section 2.1, we offer an overview of the algorithm and its different subroutines. Section 2.2 is dedicated to the first subroutine of qCLUE, namely, calculating the Local Density. We then explain how to determine the Nearest Highers $(N_{j})$ , Seeds, and Outliers in Section 2.3. Finally, Section 2.4 delves into the conclusive Cluster Assignment subroutine, where the points in the dataset are effectively heirarchically clustered.

2.1 Overview and setting

As for CLUE and 3D-CLUE (Rovere et al., 2020; Brondolin, 2022), we consider a dataset with spatial coordinates and an weight for every point. Similar datasets can also be found in medical image analysis and segmentation (Qaqish et al., 2017; Ng et al., 2006), in the analysis of asteroid reflectance spectra and hyperspectral astronomical imagery in astrophysics (Galluccio et al., 2008; Gaffey, 2010; Gao et al., 2021) and in gene analysis in bioinformatics (Karim et al., 2020; Oyelade et al., 2016).

In $d = 2$ dimensions, the spatial coordinates $X_{j}$ for point $j$ are $X_{j} = [x_{j, 1}, x_{j, 2}]$ , that are promptly generalized for larger values of $d$ . Both CLUE and qCLUE first perform tiling over the dataset to reduce the search and therefore enhance the efficiency of the algorithm. Tiling is the process of partitioning the dataset into a grid of rectangular tiles $□_{k}$ , where $k$ is the tile index (see Figure 2). Therefore, our input dataset comprises of point and tile indices $j$ and $k$ , respectively, the coordinates $X_{j}$ , and a parameter $E_{j}$ associated to each point. Following CLUE’s notation, we call $E_{j}$ the weight, yet this should be considered as a label that can be employed to improve the clustering quality for any given dataset. The tiling procedure of qCLUE and CLUE enables searching only over Search Spaces $S$ marked by the tiles in green in Figure 2A as opposed to the full dataset. In case of CLUE, this allowed for an improvement in scaling from $O (n^{2})$ to $O (m n)$ . The scaling of qCLUE is investigated below.

Figure 2

Figure 2. Pictorial representation of the main subroutines of qCLUE. In (A), the Local Density computation subroutine is represented. The consideration circle of radius $d_{c}$ (light blue) centered at the base point $j$ (black) contains all points (green) that satisfy $d_{i, j} \leq d_{c}$ . This consideration circle intersects 2 tiles $□_{k}$ (indexed by tile index $k$ ), highlighted in blue, that form the search space $S$ . As per Equation 2, the Local Density computation step determines the set of green points from all points in the search space (green and grey) and then computes the local density. In (B), we pictorially present the Find Nearest Higher $(N_{j})$ subroutine. The consideration circle (green) around base point $j$ (black) has radius $d_{m}$ . This consideration circle, containing the green points as well as the Nearest Higher $N_{j}$ (pink), intersects the 4 tiles highlighted in green, which form the search space $S$ . In (C), we describe the Find Seeds, Outliers and Assign Clusters subroutines. The seeds (red) and outliers (blue) are determined via Grover search on the dataset. In this specific example there are two clusters in the dataset whose non-seed points are in orange and purple, respectively. Followers (see main text) in these clusters are connected by dashed arrows. The Cluster Assignment subroutine is shown to be working on the orange cluster where the cluster $C$ currently consists of the seed (red, dashed border) and the first of its followers (orange, dotted border). Followers are being found within the Dynamic Search Space (DSS, light red box with solid red border). The DSS is formed as the set of tiles $□_{k}$ covered partially or fully by the minimum bounding box of the square windows that contains all the search spaces $S$ of the points within $C$ .

In this work, we employ a qRAM to store and access data, which is an essential building block for quantum computers. Following Giovannetti et al. (2008), we therefore assume that we can efficiently prepare the state.

\sum_{j} |j⟩ \overset{q R A M}{\to} |j⟩ |D_{j}⟩, (1)

where $D_{j}$ is the data associated with a given index $j$ , e.g., the $j^{th}$ point in the database. As explained in Giovannetti et al. (2008), the cost of preparing the dataset for qRAM is $O (n)$ , which has to be done once. Subsequent accesses cost $O (\log n)$ . This makes this step more efficient than the other subroutines within qCLUE. For convenience, here, in Equation 1, and throughout this paper we do not explicitly write the normalization factors of quantum states.

The qCLUE algorithm consists of the following steps:

2.1.1 Local density

The first step is to calculate the local density $ρ_{j}$ of all points $j$ [e.g., black point in Figure 2A] that is defined by

ρ_{j} = E_{j} + \frac{1}{2} \sum_{d_{i, j} < d_{c}} E_{i} (2)

and it is indicative of the weight in a neighborhood of point $j$ . As can be seen from Equation 2; Figure 2A, $ρ_{j}$ is a weighted sum over the weights $E_{i}$ of all points $i$ whose distance $d_{i, j} = \sqrt{\sum_{α = 1}^{d} {(x_{i, α} - x_{j, α})}^{2}}$ from the base point $j$ is within a user-specified critical radius $d_{c}$ that characterizes the consideration circle for the Local Density computation subroutine (light blue circle in the figure). As such, $E_{i}$ is the weight of the $i^{th}$ point which is $d_{i, j}$ away from point $j$ . The choice of weight factor $1 / 2$ for $E_{j}$ in the definition of $ρ_{j}$ in Equation 2 is empirically found to yield better performances for CLUE (Rovere et al., 2020).

2.1.2 Find nearest higher

After calculating the local densities, we determine the nearest highers. The Nearest Higher $N_{j}$ of a point $j$ is the point nearest to $j$ with a higher local density $ρ_{N_{j}} > ρ_{j}$ . As better explained in Section 2.4, the Nearest Higher are used to heirarchically cluster points together in the Cluster Assignment process at the end of qCLUE. In Figure 2B, the Nearest Higher $N_{j}$ of the base point $j$ (black point) is the pink point.

2.1.3 Find seeds, outliers and assign clusters

As schematically represented in Figure 2C, seeds (red points) are the points whose distance $d_{j, N_{j}}$ from their Nearest Higher $N_{j}$ and whose local density $ρ_{j}$ are lower bounded by user defined thresholds. Outliers (blue points) are the points whose distance from Nearest Higher is similarly lower bounded but whose Local Density has an upper threshold. As such a point $j$ is

a s e e d if d_{N_{j}, j} > d_{c} and ρ_{j} > \tilde{ρ}, (3a)

an o u t l i e r if d_{N_{j}, j} > δ d_{c} and ρ_{j} < \tilde{ρ} . (3b)

Here, $δ$ is the Outlier Delta Factor that determines the upper bound on the allowed local density for outliers. Furthermore, $\tilde{ρ}$ is the critical density threshold – the lowest local density a point can have to be classified as a seed. Both $δ$ and $\tilde{ρ}$ are user-specified and can be varied to enhance the quality of the output. Seeds are generally located in areas of high weight density, and will be employed as starting points to build clusters. Outliers are points that are likely to be noise in the dataset and are therefore discarded.

Once seeds and outliers are determined, the clusters are constructed. From the seeds, we iteratively combine “followers.” If point $N_{j}$ is the Nearest Higher of point $j$ , then point $j$ is termed as $N_{j}$ ’s follower. The follower of a point is most likely generated by the same process as the point itself (in the context of ER, by the same particle), and as such shall be included in the same cluster. In Figure 2C, the orange and purple points form two different clusters, and the followers of the points in the purple one are indicated by arrows.

2.2 Local density computation

In this section, we describe the subroutine (schematically represented in Figure 3) that computes the Local Density $ρ_{j}$ of the point $j$ , as defined in Equation 2. To perform the computation, all points $i$ whose distance $d_{i, j}$ from point $j$ is smaller than the threshold $d_{c}$ need to be determined from the search space $S$ . This search space is the smallest set of tiles $□_{k}$ required to cover the $d_{i, j} < d_{c}$ consideration circle. In Figure 2A, $S$ is highlighted in light blue.

Figure 3

Figure 3. Algorithm flow for Local Density computation and for Assigning Clusters. The quantum state is initialized in the green “Initialize” box. For Local Density Computation (Cluster Assignment), it comprises all points in the DSS $S$ (in the DSS). The “Grover” (light blue) block performs $U_{ψ}$ and $U_{P}$ in succession $O (\sqrt{m})$ times, and returns all points satisfying the required condition. The inset considers the case of Local Density computation where the condition is $d_{i, j} < d_{c}$ . For the cluster assignment step, we check if points in the DSS are followers of the points in the cluster $C$ (see Section 2.4). The output of the Grover subroutine is then measured to yield an index that is checked for validity in the grey “Valid?” diamond. If the point satisfies the chosen condition, the $Y$ branch is executed. Within the “Update” (light blue) step this point is then removed from either $S$ or the DSS and stored to be returned in the “Return” orange box. Once all points are found, the “Valid?” condition triggers the $N$ branch to terminate the algorithm. Depending on the chosen subroutine, the returned indices are employed to compute the Local Density from Equation 2, or to construct $C$ .

We shall refer to $S$ as the local dataset that, as explained above, can be efficiently prepared with the qRAM (Giovannetti et al., 2008). To do so, we only require determining the tiles $□_{k}$ that are in the search space, which can be done efficiently classically (Rovere et al., 2020). The initial state of this subroutine, after being prepared via the qRAM, is therefore

\sum_{k \in S} \sum_{i \in □_{k}} |i⟩ \overset{qRAM}{\to} \sum_{k \in S} \sum_{i \in □_{k}} |i⟩ |X_{i}, E_{i}⟩, (4)

where the index $i$ is unique for each point in $S$ . $i \in □_{k}$ indicate all indices within tile $k$ [either of the light blue squares in Figure 2A]. Ancillary qubits, omitted for clarity in Equation 4, are employed within the Grover search (for more information, see Supplementary Appendix SA).

At this stage, we must find the points $i$ [green dots in Figure 2A] that are within a radius of $d_{c}$ from the base point $j$ [black point in Figure 2A]. As shown in Figure 3, we perform Grover Search Brassard et al. (2002) to prepare.

\sum_{i} |i⟩ |X_{i}, E_{i}⟩ \overset{G r o v e r}{\to} \sum_{d_{i, j} < d_{c}} |i⟩ |X_{i}, E_{i}⟩ . (5)

Here, the first register of the Grover output contains all points characterized by indices $i$ such that $d_{i, j} < d_{c}$ . As shown in the inset of the figure, the Grover Search consists of $O (\sqrt{m})$ repetitions (where $m$ is the number of points in $S$ ) of the $U_{ψ}$ and $U_{P}$ operators. $U_{P}$ is the diffusion operator and $U_{ψ}$ is the unitary associated with the oracle of Grover Search (Lov, 1996). Further details regarding Grover Search and the unitaries we use for our algorithm can be found in Supplementary Appendix SA.

When the algorithm is run, measurement either yields a point that satisfies this distance condition, or (if there are no valid indices left) an index that does not satisfy this condition. This is verified by the grey “Valid?” diamond in Figure 3. The branched logic following this block ensures that the algorithm loops until all the required points are returned by the algorithm in the “Return” block.

Once we have obtained all indices $i$ of points satisfying the distance condition $(d_{i, j} < d_{c})$ , we perform the summation in Equation 2. This is computed and stored in the original dataset for each point. The database is now updated using qRAM with local density values for all points where the $j^{th}$ point in the database has the corresponding computed local density $ρ_{j}$ .

The scaling of the subroutine that determines the local density of a single point is given by the number of points in the blue consideration circle in Figure 2A such that $d_{i, j} < d_{c}$ . If we say this number is $p$ , $O (p)$ runs are required. This is therefore a $O (p \sqrt{m})$ algorithm as opposed to the $O (m)$ classical iterative algorithm for the Unstructured Search Problem.

As a final remark, we highlight that it is in principle possible to design a unitary that computes the Local Density directly and stores the output in a quantum register. This unitary would remove the requirement of finding individually the indices $i$ such that $d_{i, j} < d_{c}$ , thus removing the overhead of $p$ in $O (p \sqrt{m})$ . However, designing this circuit is non-trivial and its depth may be large. This is therefore left for future investigations.

2.3 Find nearest higher

Here, we describe qCLUE’s subroutine for finding the Nearest Highers $(N_{j})$ introduced in Section 2.1. As a reminder, $N_{j}$ is the nearest point to the base point $j$ whose local density $ρ_{N_{j}}$ is more than the local density $ρ_{j}$ of the base point, see Equation 3a.

Similar to the initialization carried out for the Local Density Computation step, we use qRAM to initialize the quantum state

\sum_{k \in S} \sum_{i \in □_{k}} |i⟩ \overset{qRAM}{\to} \sum_{k \in S} \sum_{i \in □_{k}} |i⟩ |X_{i}⟩ |ρ_{i}⟩ . (6)

Here, the indices $i$ are within the tiles $□_{k}$ , as in Equation 4, and $S$ is the considered search space, schematically represented by the light green box in Figure 2B. This search space is determined from $d_{m}$ as opposed to $d_{c}$ , which is the user-defined threshold that is set to be $δ d_{c}$ . Note that the weight $E_{i}$ , employed for determining the densities $ρ_{i}$ in Section 2.2, is hereon not required.

To find the Nearest Higher, we use a Grover-Enhanced Binary Search (GEBS) where each search step is enhanced by Grover’s algorithm (Equation 5). The output of every Grover run,

\sum_{\begin{array}{c} d_{L} < d_{i, j} < d_{t}, \\ ρ_{i} > ρ_{b} \end{array}} |i⟩ |X_{i}, ρ_{i}⟩, (7)

is a superposition over all points $i$ whose distance $d_{i, j}$ from the base point $j$ lies between the thresholds $d_{L}$ and $d_{t}$ . Furthermore, their local density $ρ_{i}$ should be higher than that of the base $ρ_{j}$ . At each step, $d_{L}$ and $d_{t}$ are updated based on whether a point satisfying the conditions in the grey diamond of Figure 4A is found. Ancilla registers are used here as detailed in Supplementary Appendix SA.

Figure 4

Figure 4. (A) Diagrammatic representation of the algorithm. GEBS determines successive candidates for the “Nearest Higher” until the proper one is found. The quantum state in Equation 6 is prepared in the “Initialize” step (green box). Grover Search (larger diamond) is then performed to find the points satisfying $d_{L} < d_{i, j} < d_{t}, ρ_{i} < ρ_{j}$ . If this condition is satisfied (“ $Y$ ” branch), $d_{t}$ is updated and Grover run again. If not (“ $N$ ” branch), control flows to the “?” diamond. The branch $A$ is entered if the “?” condition is being checked for the first time or if branch $B$ was just run. Branch $B$ is entered if branch $Y$ was just run. (B) The algorithm’s working is shown step-by-step (numbers at the bottom) for the search space $S$ in the inset in the top right corner. The points are mapped to a line where the height represents the distance $d_{i, j}$ from the base point $j$ (black dot at the bottom). The grey (orange) points are outside (inside) the green consideration circle with radius $d_{m}$ [see also Figure 2B]. At each step of GEBS, the thresholds $d_{L}$ and $d_{t}$ are updated according to the logic in panel (A). The dot with the red border indicates the current candidate for $N_{j}$ ; when filled (empty) it is (not) found by Grover Search at that step. The yellow point is the Nearest Higher $N_{j}$ that is found at the end of GEBS.

To better understand the algorithm, we provide a step-by-step walkthrough of the example in Figure 4B. The search space $S$ is schematically represented by the inset in the right hand side, where each dot represents a point with a size that is proportional to its local density. The consideration circle (light green, dotted border) highlights all points within a radius $d_{m} = δ d_{c}$ . In this work, we set the outlier delta factor $δ$ to 2. The consideration circle in the inset corresponds to $d_{L} = 0$ and $d_{t} = d_{m}$ , shown in step (I). In the main panel, vertical lines refers to the steps (I–VI) of GEBS that are reported below, and schematically represent the distances of all points (coloured dots) from the base point $j$ (black one at the bottom).

GEBS starts with the higher threshold set as $d_{t} = d_{m}$ and the lower threshold $d_{L} = 0$ as shown in vertical line (I) of Figure 4B. Following the probabilistic nature of quantum mechanics, assume that the point with a red border indexed $i$ is found after measuring the output of the Grover Search in Equation 7. This triggers the updates in the $Y$ branch in the diagram of Figure 4A, such that we assign $N_{j} = i$ and update $d_{t} \mapsto (d_{i, j} + d_{L}) / 2$ . The point indexed $i$ is then removed from the search space, as can be seen in (II). Now, since no point satisfies the conditions in the diamond of the flow diagram [see (II)] and $d_{t}$ was just set to $(d_{i, j} + d_{L}) / 2$ , the $B$ branch is carried out. This updates the thresholds $d_{t}$ and $d_{L}$ for the next iteration of the algorithm, see (III).

Now, assume that the new point with a red border is found [step (III)]. Updates in the $Y$ branch of Figure 4A are carried out again with a new index $i$ and the search region is reduced to contain a single point. In the next step (IV), that point (yellow) is found and, for the third and last time, the nearest higher and the thresholds are triggered according to the $Y$ branch. Next, since no point is found in (V), qCLUE executes the updates in the $B$ branch of the diagram. In the last iteration (VI), no points satisfy the desired conditions. The parameter $d_{t}$ was just set to $d_{t - 1}$ , i.e., the subroutine just ran $B$ which means that the $A$ branch is now executed and $N_{j}$ is returned.

The runtime complexity of the GEBS procedure, with $m$ points in the search space $S$ , is $O (α \sqrt{m})$ as opposed to $O (m)$ classically. The $α$ term is due to the binary search procedure and depends on the size of the quantum register used to encode the distance. Specifically, for a chosen precision $2^{- Δ}$ used for the positions of the points in the datasets, $α = Δ$ .

2.4 Find seeds, outliers, and assign clusters

Once the Nearest Highers $N_{j}$ are determined for all points $j$ in the dataset, Seeds and Outliers are found via another Grover Search over all points in the dataset. As per the definition in Equation 3a, Seeds [red points in Figure 2C] are the points with highest local density within a neighbourhood. Outliers [blue points in Figure 2C] are mathematically described by Equation 3b, are most likely noise, and therefore do not belong to any cluster.

Similar to the previous subroutines, the quantum registers for these procedures are initialized via qRAM. Seeds and outliers are then determined based on the corresponding conditions via Grover Search. Two quantum registers, the first marking whether a point is an outlier and the second to store the seed number – which is also the cluster number – are added to the quantum database.

The final subroutine of qCLUE is the assignment of points to clusters. At this stage, outliers have been removed from the input dataset, as they have been already identified. The algorithm flow is the same as that of the Local Density step in Figure 3. For a chosen seed $s$ , we define $C$ to be the set containing the indices of all points determined to be in the associated cluster at the end of this subroutine. To assign points to $C$ , we follow a procedure similar to that of the Local Density step in Figure 3. In the “Initialize” step, $C$ is initialized to ${s}$ and the quantum registers are initialized via qRAM to the state.

\sum_{i \in DSS} |i⟩ \overset{q R A M}{\to} \sum_{i} |i⟩ |V_{i}⟩, (8a)

|V_{i}⟩ = |X_{i}, ρ_{i}, d_{N_{i}, i}, X_{N_{i}}⟩ . (8b)

In the “Grover” block, we search over a superposition of points in the dataset which we call the Dynamic Search Space (DSS) created by qRAM as shown in Equations 8a, 8b. The DSS differs from the search space $S$ in the Local Density step as it is dynamic. This is because it depends on the points in $C$ , which are updated at each iteration. In Figure 2C, for instance, the red seed and the orange point both with black borders are the elements of the current $C$ . To find the DSS, a square window of edge $2 d_{m}$ is first opened for every point in $C$ (in the figure, the squares with the same border style as the corresponding points). A rectangular region (red box) is then obtained by finding the axis-aligned minimum bounding box for these windows. The set of tiles $□_{k}$ covered partially or fully by this minimum bounding box is the DSS. For example, in Figure 2C, it comprises the 9 tiles highlighted in light red.

With a similar procedure as for the Local Density subroutine, the “Grover” block now systematically identifies all followers of all points within set $C$ . Here, in the “Update” step in Figure 3, as the point found by the “Grover” block has passed the “Valid” condition, it is appended to $C$ . Once no more points are found, the “Return” block yields $C$ , following the same flow as the Local Density computation subroutine.

The complexity of the Cluster Assignment step is similar to the one of the Local Density Computation subroutine. The quantum advantage stems from the quadratic speedup provided by the Grover algorithm, which allows determining the follower faster if compared to CLUE. If there are $f$ points in a cluster $C$ and $m$ points in the corresponding DSS, the classical complexity of the Cluster Assignment step is $O (m)$ , while the quantum algorithm has a runtime of $O (f \sqrt{m})$ .

3 Results

In this section, we test qCLUE in multiple scenarios, each designed to investigate its performance for different settings. In Section 3.1, we introduce the scoring metrics used for our analysis. In Section 3.2, we describe the performance of the algorithm applied on a single cluster in a uniform noisy environment. In Section 3.3, we study the performance on overlapping clusters. Finally, in Section 3.4, we study the performance of qCLUE on non-centroidal clusters with and without a weight profile.

3.1 Scoring metrics: homogeneity and completeness scores

It is more important to correctly classify high-weight points such as seeds as compared to low-weight points such as outliers. Since we would like our metric to be cognizant to this, we use modified, weight-aware versions (Jekaterina, 2023) of the Homogeneity $(F_{H})$ and Completeness $(F_{C})$ scores (Rosenberg and Hirschberg, 2007). These metrics are defined in terms of the predicted cluster labels $C_{p}$ obtained from qCLUE, and the true cluster labels $C_{t}$ of the generated dataset. $F_{H}$ and $F_{C}$ are based on the weight aware (Jekaterina, 2023) mutual information $I (C_{p} : C_{t})$ , the Shannon entropy $H (C_{t})$ , and the joint Shannon entropy $H (C_{t}, C_{p})$ (Nielsen and Chuang, 2010):

F_{H} = \frac{I (C_{p} : C_{t})}{H (C_{t})} and F_{C} = \frac{I (C_{p} : C_{t})}{H (C_{p})}, (9a)

H (C_{p}) = - \sum_{a} \frac{E_{a}}{E} \log_{2} \frac{E_{a}}{E}, (9b)

H (C_{t}) = - \sum_{b} \frac{E_{b}}{E} \log_{2} \frac{E_{b}}{E}, (9c)

H (C_{p}, C_{t}) = - \sum_{a} \sum_{b} \frac{E_{a, b}}{E} \log_{2} \frac{E_{a, b}}{E}, (9d)

I (C_{p} : C_{t}) = H (C_{p}) + H (C_{t}) - H (C_{p}, C_{t}) . (9e)

As discussed in (Jekaterina, 2023), $E_{a}$ is the weight aggregated over all points that qCLUE classifies into cluster $a$ . $E_{b}$ is the weight aggregated over all points in cluster $b$ in the true dataset. $E_{a, b}$ is the sum of weights of all points in cluster $b$ in the true dataset that are also assigned to cluster $a$ by qCLUE. $E$ is the accumulated weight of all points in the dataset. We remark that for unit weights, Equations 9a, Equations 9e, reduce to the more common form presented in Rosenberg and Hirschberg (2007).

qCLUE applied to an input dataset yields homogeneity $F_{H} = 1$ if all of the predicted clusters only contain data points that are members of a single true cluster. On the other hand, $F_{C} = 1$ is obtained if all the data points that are members of a given true cluster are elements of the same reconstructed cluster. Therefore, these metrics are better suited to different scenarios. The impacts of noise and cluster overlap investigated in Sections 3.2, 3.3 are better captured by $F_{H}$ . Indeed, if qCLUE incorrectly classifies noise points into predicted clusters, $F_{C}$ is unaffected. On the other hand, $F_{C}$ shall be employed when studying non-centroidal clusters in Section 3.4, since $F_{H} = 1$ if one true cluster is divided by qCLUE into many sub-clusters.

3.2 Noise

Here, we study the performance of qCLUE for a single cluster in a noisy environment. We vary the number $N_{N}$ of noise points sampled from a uniform distribution over a square region of fixed size. A cluster of $N_{C}$ points with coordinates $X_{j} = [x_{j, 1}, x_{j, 2}]$ is sampled from the multivariate Gaussian distribution.

p d f (X_{j}) = \frac{e^{- \frac{1}{2} {(X_{j} - μ_{j})}^{T} Σ^{- 1} (X_{j} - μ_{j})}}{{(2 π)}^{\frac{n}{2}} | Σ |^{\frac{1}{2}}}, (10)

where $μ = {[μ_{x_{1}}, μ_{x_{2}}]}^{T}$ is the mean of the distribution (set to ${[0,0]}^{T}$ in our case) and $Σ$ the covariance matrix. Here, we choose $Σ = σ I$ , with $I$ being the identity matrix and $σ$ a positive real number.

Examples of the generated clusters (in orange) and noise (in blue) are given in Figures 5A, B for $N_{N} / N_{C} = 0.33$ at $σ = 32$ and $N_{N} / N_{C} = 1$ at $σ = 10$ , respectively. The weight assigned to each point $X_{j}$ in the cluster is given by $A \times p d f (X_{j})$ [see Equation 10] with $A = 5 \times 1 0^{2}$ . The weight of each noise point is randomly sampled between zero and one. This choice resembles the typical scenarios in ER tasks for which CLUE (Rovere et al., 2020) was designed.

Figure 5

Figure 5. Numerical results from qCLUE simulated on a classical machine. (A–C) qCLUE’s performance in noisy environments. The dataset generated for these experiments and visualized in panels (A, B) consists of a cluster (noise) with $N_{C} = 750$ $(N_{N})$ points sampled from the Gaussian distribution in Equation 10 (uniform distribution) over a square of size 500. The weight of noise points is sampled uniformly between zero and one, while each cluster point is assigned a weight that is the probability of being sampled multiplied by a factor $A = 500$ . (A, B) Computed clusters at $N_{N} / N_{C} = 0.33, σ = 32$ , and $N_{N} / N_{C} = 1, σ = 10$ , respectively. In (C), $F_{H}$ is plotted against $N_{N} / N_{C}$ for the $σ$ in the legend. (D–F) Performance for overlapping clusters. In (D), $F_{H}$ vs. $r / σ$ is shown for $σ = 30$ and different ratios $N_{1} / N_{2}$ . Here, $r$ is the distance between the centers of two clusters with $N_{1} = 500$ and $N_{2}$ points, and we assign to each point a weight that is equal to its sampling probability in Equation 10. (E, F) Computed clusters at $r / σ = 2.0, N_{1} / N_{2} = 1$ , and $r / σ = 2.67, N_{1} / N_{2} = 2$ , respectively. The shadowed regions in (C, D) represent the standard deviations of $F_{H}$ over 30 iterations. (G–J) Performance over non-centroidal clusters of 500 points each generated from $s c i k i t - l e a r n$ (Pedregosa et al., 2018). In (G, H) the points’ weight profile is uniform, while in (I, J) is varied linearly with respect to the distance such that each cluster has a single, most energetic point (see Section 3.4). For all experiments, $d_{c}$ was set to 20 and $\tilde{ρ}$ was set to 25. (A–F) use the weight-aware metric in Equation 9a, Equation 9e, while in (G–J), since the weight profile is assigned by the user and is not part of the dataset itself, in the scoring process we set all points to have the same weight.

In Figure 5C, we show the variation of homogeneity score $F_{H}$ with respect to the ratio $N_{N} / N_{C}$ . We employ the values of $σ$ reported in the legend, associated to different colors in the plot. As can be seen, the clustering performance is inversely proportional to both $N_{N} / N_{C}$ and $σ$ . When these parameters are small, the typical distance between cluster points is much smaller than that between noise points, and $F_{H}$ approaches unity. With a higher chance of labeling noise points as within the cluster, however, $F_{H}$ is lowered. As such, the degradation of $F_{H}$ is proportional to the probability of a noise point being in the cluster region, which increases with both $σ$ and $N_{N} / N_{C}$ .

3.3 Overlap

Here, we consider the case of two circular clusters with $N_{1}$ and $N_{2}$ points respectively, each sampled from the multivariate Gaussian distribution in Equation 10 and with $Σ = σ I$ . The weight profile is determined by $p d f (X_{j})$ for coordinates $X_{j}$ . The centers $μ_{1}$ and $μ_{2}$ (two instances of $μ$ ) are chosen to be $(r / 2,0)$ and $(- r / 2,0)$ , respectively, such that the distance between the cluster centers is $r$ .

In Figure 5D, we study the variation of homogeneity score $F_{H}$ as a function of $r / σ$ for several values of $N_{2} / N_{1}$ . The computed clusters for $r / σ = 2$ at $N_{2} / N_{1} = 1$ and $r / σ = 2.67$ at $N_{2} / N_{1} = 2$ are shown in panels (e) and (f), respectively, to showcase the typical scenarios considered here.

For all $N_{1} / N_{2}$ , $F_{H}$ is zero for low $r / σ$ (high overlap). There is then a region where $F_{H}$ increases with $r / σ$ and then saturates at unity for high $r / σ$ (little to no overlap). When the two clusters are too close, i.e., $r / σ ≪ 1$ , they are in fact indistinguishable and qCLUE labels all points together. Increasing the ratio $r / σ$ makes the clusters move away from each other and thus qCLUE can discern them. This behavior can be observed in Figures 5E, F. Importantly, large values of $F_{H}$ are already attained when the clusters still have a significant overlap. In this scenario, employing the weight labels and the weight density considerably contributes to accurate cluster assignment. In fact, the nearest higher points are more likely to connect the points near or on the decision boundary with the more energetic core, thus separating the clusters better.

The performance of qCLUE is also affected by the ratio $N_{1} / N_{2}$ . When one cluster contains more points than the other, it is more likely to “capture” points from the smaller. The resulting loss in homogeneity score $F_{H}$ for low $r / σ$ ratios is evident from Figure 5D, where it can be seen that clusters of similar sizes are better distinguished from each other.

3.4 Non-centroidal clusters

Finally, we study the performance of qCLUE on non-centroidal clusters. For this purpose, we use the Moons and Circles datasets in Figures 5G–J, generated using $s c i k i t - l e a r n$ (Pedregosa et al., 2018). Two settings are considered - one where a uniform weight profile is applied over the points [panels (g, h)] and one where a linear gradient weight profile is employed [panels (i, j)].

In the latter case, we assign the highest value of the weight for each cluster to a single point and lower the weights of all other points proportionally to their $x_{2}$ coordinate. In the case of the moon dataset, $E = x_{2}$ for the upper moon (so the top point of the upper moon has the maximum weight in the cluster) and $E = 60 - x_{2}$ for the lower moon (so the bottom point has the highest weight in the cluster). For the circles, $E = | x_{2} - 200 | / 10$ for the inner circle and $E = | x_{2} + 100 | / 5$ for the outer one.

Since these datasets are noiseless and well separated, $F_{H}$ is always one and we employ $F_{C}$ to characterize the performance of qCLUE. As in Figures 5G, H the weight profile is uniform, and several points satisfy the seed condition. Therefore, qCLUE groups each circle into several clusters, such that we obtain limited values for $F_{C}$ . On the contrary, cases with a weight profile assigned [Figures 5I, J] results in fewer seeds that are better recognized by qCLUE, and the completeness score $F_{C}$ is considerably enhanced.

4 Conclusion and outlook

We introduced qCLUE, a novel quantum clustering algorithm designed to address the computational challenges associated with high-dimensional datasets. qCLUE’s significance lies in its potential to efficiently cluster data by effectively leveraging quantum computing, mitigating the escalating computational complexity encountered by classical algorithms upon increasing dimensionality of datasets. The algorithm’s ability to navigate high-dimensional spaces is particularly promising on datasets with high point density, where local searches become too demanding for classical computers. Therefore, qCLUE will be beneficial in multiple scenarios, ranging from quantum-enhanced machine learning (Haug et al., 2023; Zeguendry et al., 2023) to complex data analysis tasks (Sinayskiy et al., 2015).

According to our numerical results, qCLUE works well and its performance is significantly enhanced when a weight profile is assigned. Specifically, we study qCLUE in noisy environments, on overlapping clusters, and on non-centroidal datasets that are commonly used to benchmark clustering algorithms (Fujita, 2021; Tiwari et al., 2020). In scenarios that are typically encountered in ER tasks, qCLUE correctly reconstructs the true clusters to a high level of accuracy as it matches the performance of CERN’s CLUE on a given dataset. On the other hand, a weight profile can significantly boost qCLUE performance as we have seen in the case of non-centroidal clusters. Our numerical results, backed up by the well-studied CLUE and by the quadratic speedup stemming from Grover search, make qCLUE a promising candidate for addressing high-dimensional clustering problems (Wei et al., 2020; Kerenidis and Landman, 2021; Duarte et al., 2023).

As a first outlook, we identify the implementation of qCLUE on NISQ hardware (Celi et al., 2020; Labuhn et al., 2016; Bernien et al., 2017; Lanyon et al., 2011; Arute et al., 2019; Córcoles et al., 2015; Debnath et al., 2016). This requires a comprehensive consideration of real device constraints. Aspects such as circuit optimization (Nash et al., 2020), and the impact of noise will be critical and must be carefully addressed. Second, it is possible to improve the scaling of qCLUE by devising a unitary that mitigates the need for repeating Grover’s algorithm for each point satisfying the search condition and thereby eliminating the factors of $p$ , $α$ , and $f$ in the scaling of the subroutines outlined in Sections 2.2–2.4 respectively. We finally note that it is worth investigating variations of qCLUE that improve the quality of clustering in different scenarios. For instance, one can devise more sophisticated criteria for the Nearest Higher or Local Density computation steps. Performance on a given dataset can also be improved by performing exhaustive hyperparameter searches or via hyperparameter optimization algorithms (Wu et al., 2019).

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://zenodo.org/records/12655189.

Author contributions

DG: Writing–original draft, Writing–review and editing. LD: Writing–original draft, Writing–review and editing. AD: Writing–original draft, Writing–review and editing. WR: Writing–original draft, Writing–review and editing. FP: Writing–original draft, Writing–review and editing. MM: Writing–original draft, Writing–review and editing.

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. CERN Quantum Initiative. Wolfgang Gentner Programme of the German Federal Ministry of Education and Research (grant no. 13E18CHA). EPSRC quantum career development grant EP/W028301/1. NTT PHI Lab. Government of Canada through Innovation, Science and Economic Development Canada (ISED). Province of Ontario through the Ministry of Colleges and Universities.

Acknowledgments

We thank the CERN Quantum Initiative, Fabio Fracas for creating the fertile ground for starting this project and Andrew J. Jena as well as Priyanka Mukhopadhyay for theoretical support. WR acknowledges the Wolfgang Gentner Programme of the German Federal Ministry of Education and Research (grant no. 13E18CHA). LD acknowledges the EPSRC quantum career development grant EP/W028301/1. DG and MM acknowledge the NTT PHI Lab for funding. Research at IQC is further supported by the Government of Canada through Innovation, Science and Economic Development Canada (ISED). Research at Perimeter Institute is supported in part by the Government of Canada through ISED and by the Province of Ontario through the Ministry of Colleges and Universities.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frqst.2024.1462004/full#supplementary-material

References

Aad, G., Abajyan, T., Abbott, B., Abdallah, J., Abdel Khalek, S., Abdelalim, A., et al. (2012). Observation of a new particle in the search for the standard model higgs boson with the ATLAS detector at the LHC. Phys. Lett. B 716 (1), 1–29. doi:10.1016/j.physletb.2012.08.020