CLUE: A Fast Parallel Clustering Algorithm for High Granularity Calorimeters in High-Energy Physics

One of the challenges of high granularity calorimeters, such as that to be built to cover the endcap region in the CMS Phase-2 Upgrade for HL-LHC, is that the large number of channels causes a surge in the computing load when clustering numerous digitized energy deposits (hits) in the reconstruction stage. In this article, we propose a fast and fully parallelizable density-based clustering algorithm, optimized for high-occupancy scenarios, where the number of clusters is much larger than the average number of hits in a cluster. The algorithm uses a grid spatial index for fast querying of neighbors and its timing scales linearly with the number of hits within the range considered. We also show a comparison of the performance on CPU and GPU implementations, demonstrating the power of algorithmic parallelization in the coming era of heterogeneous computing in high-energy physics.


Introduction
Calorimeters with high lateral and longitudinal readout granularity, capable of providing a fine grained image of electromagnetic and hadronic showers, have been suggested for future high energy physics experiments [1]. The silicon sensor readout cells of the CMS endcap calorimeter (HGCAL) [2] for HL-LHC [3] have an area of about 1 cm 2 . When a particle showers, the deposited energy is collected by the sensors on the layers which the shower traverses. The purpose of the clustering algorithm when applied to shower reconstruction is to group together individual energy deposits (hits) originating from a particle shower. Due to the high lateral granularity, the number of hits per layer is large, and it is computationally advantageous to collect together hits in 2D clusters layer-by-layer [4] and then associate these 2D clusters, representing energy blobs, in different layers [2].
However, a computational challenge emerges as a consequence of the large data scale and limited time budget. event is tightly constrained by a millisecond-level execution time.
This constraint requires the clustering algorithm to be highly efficient while maintaining a low computational complexity. Furthermore, a linear scalability is strongly desired in order to avoid bottlenecking the performance of the entire event reconstruction. Finally, it is highly preferable to have a fully-parallelizable clustering algorithm to take advantage of the trend of heterogeneous computing with hardware accelerators, such as graphics processing units (GPUs), achieving a higher event throughput and a better energy efficiency.
The input to the clustering algorithm is a set of n hits, whose number varies from a few thousands to a few millions, depending on the longitudinal and transverse granularity of the calorimeter as well as on the number of particles entering the detector. The output is a set of k clusters whose number is usually one or two order of magnitudes smaller than n and in principle depends on both the number of incoming particles and the number of layers. Assuming that the lateral granularity of sensors is constant and finite, the average number of hits in clusters (m = n/k) is also constant and finite. For example, in the CMS HGCAL, m is in the order of 10. This leads to the relation among the number of hits n, the number of clusters k, and the average number of hits in clusters m as n > k m.
Most well-known algorithms do not simultaneously satisfy the requirements on linear scalability and easy parallelization for applications like clustering hits in high granularity calorimeters, which is characterized by low dimension and n > k m. It is therefore important to investigate new, fast and parallelizable clustering algorithms, as well as their optimized accompanying spatial index that can be conveniently constructed and queried in parallel.
In this paper, we describe CLUE (CLUstering of Energy), a novel and parallel densitybased clustering. Its development was inspired by the work described in ref. [5]. In Section 2, we describe the CLUE algorithm and its accompanying spatial index. Then in Section 3, some details of GPU implementations are discussed. Finally, in Section 4 we present CLUE's ability on non-spherical cluster shapes and noise rejection, followed by its computational performance when executed on CPU and GPU with synthetic data, mimicking hits in high granularity calorimeters.

Clustering Algorithm
Clustering data is one of the most challenging tasks in several scientific domains. The definition of cluster is itself not trivial, as it strongly depends on the context. Many clustering methods have been developed based on a variety of induction principles [6]. Currently popular clustering algorithms include (but are not limited to) partitioning, hierarchical and density-based approaches [6,7]. Partitioning approaches, such as k-mean [8], compose clusters by optimizing a dissimilarity function based on distance. However, in the application to high granularity calorimeters, partitioning approaches are prohibitive because the number of clusters k is not known a priori. Hierarchical methods make clusters by constructing a dendrogram with a recursion of splitting or merging. However, hierarchical methods do not scale well because each decision to merge or split needs to scan over many objects or clusters [7]. Therefore, they are not suitable for our application. Density-based methods, such as DBSCAN [9], OPTICS [10] and Clustering by Fast Search and Find Density Peak (CFSFDP) [5], group points by detecting continuous high-density regions. They are capable of discovering clusters of arbitrary shapes and are efficient for large spatial database. If a spatial index is used, their computational complexity is O(n log n) [7]. However, one of the potential weaknesses of the currently well-known density-based algorithms is that they intrinsically include serial processes which are hard to parallelize: DBSCAN has to iteratively visit all points within an enclosure of density-connectedness before working on the next cluster [9]; OPTICS needs to sequentially add points in an ordered list to obtain a dendrogram of reachability distance [10]; CFSFDP needs to sequentially assign points to clusters in order of decreasing density [5]. In the application to high granularity calorimeters, as discussed in Section 1, linear scalability and fully parallelization are essential to handle a huge dataset efficiently by means of heterogeneous computing.
In order to satisfy these requirements, we propose a fast and fully-parallelizable density- where δ is the distance to the nearest point with higher density ("nearest-higher") which is slightly adapted from that in CFSFDP in order to take advantage of the spatial index.
Then cluster seeds and outliers are identified based on thresholds on ρ and δ. Differing from cluster assignment in CFSFDP, which sorts density and adds points to clusters in order of decreasing density, CLUE first builds a list of followers for each point by registering each point as a follower to its nearest-higher. Then it expands clusters by passing cluster indices from the seeds to their followers iteratively. Since such expansion of clusters is fully independent from each others, it not only avoids the costly density sorting in CFSFDP, but also enables a k-way parallelization. Unlike the noise identification in CFSFDP, CLUE rejects noise by identifying outliers and their iteratively descendant followers, as discussed in Section 4.1.

Spatial index with fixed-grid
Query of neighbourhood, which retrieves nearby points within a distance, is one of the most frequent operations in density-based clustering algorithms. CLUE uses a spatial index to access and query spatial data points efficiently. Given that the physical layout of sensor cells is a multi-layer tessellation, it is intuitive to index its data with a fixed-grid, which divides the space into fixed rectangular bins [11,12]. Comparing with the data-driven structures such as KD-Tree [13] and R-Tree [14], space partition in fixed-grid is independent of any particular distribution of data points [15], thus can be explicitly predefined before loading data points. In addition, both construction and query with a fixed-grid are computationally simple and can be easily parallelized. Therefore, CLUE uses a fixed-grid as spatial index for efficient neighbourhood queries.
For each layer of the calorimeter, a fixed-grid spatial index is constructed by registering the indices of 2D points into the square bins in the grid according to the 2D coordinates where Ω d (i) is guaranteed to include all neighbours within a distance d from the point i.

Namely,
Without any spatial index, the query of N d (i) requires a sequential scan over all points. In contrast, with the grid spatial index, CLUE only needs to loop over the points in Ω d (i) to acquire N d (i). Given that d is small and the maximum granularity of points is constant, the complexity of querying N d (i) with a fixed-grid is O(1).

Clustering procedure of CLUE
where w j is the weight of point j, χ(d ij ) is a convolution kernel, which can be optimized according to specific applications. Obvious possible kernel options include flat, Gaussian and exponential functions.
The nearest-higher and the distance to it δ (separation) in CLUE are defined as: The lists of followers are built by registering the points which are neither seeds nor outliers to the follower lists of their nearest-highers. The cluster indices, associating a follower with a particular seed, are passed down from seeds through their chains of followers iteratively.
Outliers and their descendant followers are guaranteed not to receive any cluster indices from seeds, which grants a noise rejection as shown in Fig. 4. The calculation of ρ, δ and the decision of seeds and outliers both support n-way parallelization, while the expansion of clusters can be done with k-way parallelization. Pseudocode of CLUE is included in Appendix A.

GPU Implementation
To parallelize CLUE on GPU, one GPU thread is assigned to each point, for a total of n threads, to construct spatial index, calculate ρ and δ, promote (demote) seeds (outliers) and register points to the corresponding lists of followers of their nearest-highers. Next, one thread is assigned to each seed, for a total of k threads, to expand clusters iteratively along chains of followers. The block size of all kernels, which in practice does not have a remarkable impact on the speed performance, is set to 1024. In the test in Table 2, changing the block size from 1024 to 256 on GPU leads to only about 0.14 ms decrease in the sum of kernel execution times. The details of parallelism for each kernel are listed in Table 1.
Since the results of a CLUE step are required in the following steps, it is necessary to guarantee that all the threads are synchronized before moving to the next stage. Therefore, each CLUE step can be implemented as a separate kernel. To optimize the performance of accessing the GPU global memory with coalescing, the points on all layers are stored as a single structure-of-array (SoA), including information of their layer numbers and 2D coordinates and weights. Thus points on all layers are input into kernels in one shot.
When parallelizing CLUE on GPU, thread conflicts to access and modify the same memory address in global memory could happen in the following three cases: i) multiple points need to register to the same bin simultaneously; ii) multiple points need to register to the list of seeds simultaneously; iii) multiple points need to register as followers to the same point simultaneously.
Therefore, atomic operations are necessary to avoid the race conditions among threads in the global memory. During an atomic operation, a thread is granted with an exclusive access to read from and write to a memory location which is inaccessible to other concurrent threads until the atomic operation finishes.
This inevitably leads to some microscopic serialization among threads in race. The   Figure 3: Examples of CLUE clustering on synthetic datasets. Each sample includes 1000 2D points with the same weight generated from certain distributions, including uniform noise points. The color of points represent their cluster ids. Black points represent outliers detached from any clusters. The links between pairs of points illustrate the relationship between nearest-higher and follower. The red stars highlight the cluster seeds.
We demonstrate the clustering results of CLUE with a set of synthetic datasets, shown in Fig. 3. Each example has 1000 2D points and includes spatially uniform noise points. The datasets in Fig. 3 (a) and (c) are from the scikit-learn package [16]. The dataset in Fig. 3  In the induction principle of density-based clustering, the confidence of assigning a low density point to a cluster is established by maintaining the continuity of the cluster.
Low density points with large separation should be deprived of association to any clusters.
CFSFDP uses a rather costly technique, which calculates a boarder region of each cluster and defines core-halo points in each cluster, to detach unreliable assignments from clusters [5].
In contrast, CLUE achieves this using cuts on δ o and ρ c while expanding a cluster, as described in Section 2. The example in Fig. 4 shows how cutting at different separation values helps to demote outliers. Figure 4 (a) represents the decision plot on the ρ − δ plane.
Points with density below ρ c = 80, shown on the left side of the vertical blue line, could be demoted as outliers if their δ are larger than a threshold. Once an outlier is demoted, all its descendant followers are disallowed from attaching to any clusters. While keeping ρ c = 80 fixed, the effect of using three different values of δ o (10, 20, 60), shown as orange dash lines in Fig. 4 (a), has been investigated. The corresponding results are shown in Fig. 4 (b), (c) and (d), respectively.

Execution time and scaling
We tested the computational performance of CLUE using a synthetic dataset that resembles high occupancy events in high granularity calorimeters operated at HL-LHC. The dataset represents a calorimeter with 100 sensor layers. A fixed number of points on each layer are assigned a unit weight in such a way that the density represents circular clusters of energy whose magnitude decreases radially from the centre of the cluster according to a Gaussian  GPU --cuda memcpy HToD --cuda memcpy DToH --cuda memset --build tiles --calculate density --calculate separation --promote seeds --assign clusters --other Figure 5: (Upper ) Execution time of CLUE on the single-threaded CPU, multi-threaded CPU with TBB and GPU scale linearly with number of input points, ranging from 10 5 to 10 6 in total. Execution time on single-threaded CPU is shown as blue circle dots and on 10 multi-threaded CPU with TBB is shown as blue square dots, while the time on GPU is shown as green circle dots. The stacked bars represent the decomposition of execution time. The green and blue narrower bars are latency for data traffic between host memory and device memory; wider bars represent time of essential CLUE steps; light grey narrower bars labelled as "other" are the difference between the total execution time and the sum of major CLUE steps (and major CUDA API calls if GPU). (Lower ) Comparing with the single-threaded CPU, the speed-up factors of the GPU range from 48 to 112, while the speed-up factors of the multi-threaded CPU with TBB range from 1.2 to 2.0, which is less than the number of concurrent threads on CPU because of atomic pushing to the data containers discussed in Section 3. Table 2 shows the details of the decomposition of the execution time in the case of 10 4 points per layer.
distribution with the standard deviation, σ, set to 3 cm. 5% of the points represent noise distributed uniformly over the layers. When clustering with CLUE, the bin size is set to 5 cm comparable with the width of the clusters and the algorithm parameters are set to d c = 3 cm, δ o = δ c = 5 cm, ρ c = 8. To test CLUE's linear scaling, the number of points on each layer is incremented from 1000 to 10000 in 10 equalling steps. A total of 100 layers are input to CLUE simultaneously which simulates the proposed CMS HGCAL design [2].
Therefore the total number of points in the test ranges from 10 5 to 10 6 . The linear scaling of execution time are validated in Fig. 5.   The single-threaded version of the CLUE algorithm on CPU has been implemented in C++, while the one on GPU has been implemented in C with CUDA [17]. The multithreaded version of CLUE on CPU uses the Thread Building Block (TBB) library [18] and has been implemented using the Abstraction Library for Parallel Kernel Acceleration (Alpaka) [19]. In Fig. 5 (upper ), the scaling of CLUE is linear, consistent with the expectation.
The execution time on the single-threaded CPU, multi-threaded CPU with TBB and GPU increases linearly with the total number of points. The stacked bars represent the decomposition of execution time. In the decomposition, unique to the GPU implementation is the latency of data transfer between host and device, which is shown as blue and green narrower bars, while common to all the three implementations are the five CLUE steps.
Comparing with the single-threaded CPU, when building spatial index and deciding seeds, shown as red and pink bars, the multi-threaded CPU using TBB does not give a notable speed-up due to the implementation of atomic operations in Alpaka [19] as discussed in Section 3, while the GPU has a prominent outperformance thanks to its larger parallelization scale. For the GPU case, the kernel of seed-promotion in which serialization exists due to atomic appending of points in the list of seeds, does not affect the total execution time significantly if compared with other sub-processes. In the two most computing-intense steps, calculating density and separation, there are no thread conflicts or inevitable atomic operations. Therefore, both the multi-threaded CPU using TBB and the GPU provide a significant speed-up. The details of the decomposition of execution time in the case of 10 4 points per layer are listed in Table 2.  Table 2, the speed-up factors of multi-threaded CPU using TBB reduce to less than 1 in the sub-process steps of building spatial index and promoting seeds and registering followers, where atomic operations happen and bottleneck the overall speed-up factor.

Conclusion
The clustering algorithm is an important part in the shower reconstruction of high granularity calorimeters to identify hot regions of energy deposits. It is required to be computationally linear with data scale n, independent from prior knowledge of the number of clusters k and conveniently parallelizable when n > k m ≡ n k in 2D. However, most of the well-known algorithms do not simultaneously support linear scalability and easy parallelization. CLUE is proposed to efficiently perform clustering tasks in low-dimension space with n > k m, including (and beyond) the applications in high granularity calorimeters. The clustering time scales linearly with the number of input hits in the range of multiplicity that is relevant for, e.g., the high granularity calorimeter of the CMS experiment at CERN. We evaluated the performance of CLUE on synthetic data and demonstrated its capability of non-spherical cluster shape with adjustable noise rejection. Furthermore, the studies suggest that CLUE on GPU outperforms single thread CPU by more than an order of magnitude within the data scale ranging from n = 10 5 to 10 6 .