This article was submitted to Big Data and AI in High Energy Physics, a section of the journal Frontiers in Big Data
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
The High-Luminosity upgrade of the Large Hadron Collider (LHC) will see the accelerator reach an instantaneous luminosity of 7 × 1034 cm−2 s−1 with an average pileup of 200 proton-proton collisions. These conditions will pose an unprecedented challenge to the online and offline reconstruction software developed by the experiments. The computational complexity will exceed by far the expected increase in processing power for conventional CPUs, demanding an alternative approach. Industry and High-Performance Computing (HPC) centers are successfully using heterogeneous computing platforms to achieve higher throughput and better energy efficiency by matching each job to the most appropriate architecture. In this paper we will describe the results of a heterogeneous implementation of pixel tracks and vertices reconstruction chain on Graphics Processing Units (GPUs). The framework has been designed and developed to be integrated in the CMS reconstruction software, CMSSW. The speed up achieved by leveraging GPUs allows for more complex algorithms to be executed, obtaining better physics output and a higher throughput.
The High-Luminosity upgrade of the LHC (
When the HL-LHC will be operational, it will reach a luminosity of 7 × 1034 cm−2s−1 with an average pileup of 200 proton-proton collisions. To fully exploit the higher luminosity, the CMS experiment will increase the full readout rate from 100 to 750 kHz (
This exceeds by far the expected increase in processing power for conventional CPUs, demanding alternative solutions.
A promising approach to mitigate this problem is represented by
In order to investigate the feasibility of a heterogeneous approach in a typical High Energy Physics experiment, the authors developed a novel pixel tracks and vertices reconstruction chain within the official CMS reconstruction software CMSSW (
The results shown in this article are based on the Open Data and the data formats released (
The development of a heterogeneous reconstruction faces several fundamental challenges: the adoption of a different programming paradigm; the experimental reconstruction framework and its scheduling must accommodate for heterogeneous processing; the heterogeneous algorithms should achieve the same or better physics performance and processing throughput as their CPU siblings; it must be possible to run and validate on conventional machines, without any dedicated resources.
This article is organized as follows:
The backbone of the CMS data processing software, CMSSW, is a rather generic framework that processes independent chunks of data (
The data are processed by modules that communicate via a C++-type-safe container called event (or luminosity block or run for the larger units). An analyzer can only read data products, while a producer can read and write new data products and a filter can, in addition, decide whether the processing of a given event should be stopped. Data products become immutable (or more precisely,
During the Long Shutdown 1 and Run 2 of the LHC, the CMSSW framework gained multi-threading capabilities (
The reconstruction of the trajectories of charged particles recorded in the silicon pixel and silicon strip detectors is one of the most important components in the interpretation of the detector information of a proton-proton collision. It provides a precise measurement of the momentum of charged particles (muons, electrons and charged hadrons), the identification of interaction points of the proton-proton collision (primary vertex) and decay points of particles with significant lifetimes (secondary vertices).
Precise track reconstruction becomes more challenging at higher pileup, as the number of vertices and the number of tracks increase, making the pattern recognition and the classification of hits produced by the same charged particle a harder combinatorial problem. To mitigate the complexity of the problem the authors developed parallel algorithms that can perform the track reconstruction on GPUs, starting from the “raw data” from the CMS Pixel detector, as will be described later in this section. The steps performed during the tracks and vertices reconstruction are illustrated in
Steps involved in the tracks and vertices reconstruction starting from the pixel “raw data”.
The data structures (structure of arrays, SoA) used by the parallel algorithms are optimized for coalesced memory access on the GPU and differ substantially from the ones used by the standard reconstruction in CMS (legacy data formats). The data transfer between CPU and GPU and their transformation between legacy and optimized formats are very time consuming operations. For this reason the authors decided to design the
The CMS “Phase 1” Pixel detector (
Longitudinal sketch of the pixel detector geometry.
The analog signals generated by charged particles traversing the pixel detectors are digitized by the read-out electronics and packed to minimize the data rate. The first step of the track reconstruction is thus the
During this phase, the digitized information is unpacked and interpreted to create
Neighboring digis are grouped together to form
Finally, the shape of the clusters and the charge of the digis are used to determine the
Clusters are linked together to form creation of doublets connection of doublets identification of root doublets Depth-First Search (DFS) from each root doublet
The doublets are created by connecting hits belonging to adjacent pairs of pixel detector layers, illustrated by the solid arrows in
Combinations of pixel layers that can create doublets directly (solid arrow), or by skipping a layer to account for geometrical acceptance (dashed arrow).
Various selection criteria are applied to reduce the combinatorics. The following criteria have a strong impact on timing and physics performance:
Windows opened in the transverse and longitudinal planes. The outer hit is colored in red, the inner hits in blue (
Hits within each layer are arranged in a tiled data-structure along the azimuthal (ϕ) direction for optimal performance. The search for compatible hit pairs is performed in parallel by different threads, each starting from a different outer hit. The pairs of inner and outer hits that satisfy the alignment criteria and have compatible clusters sizes along the
Description of the cuts applied during the reconstruction of doublets.
Cut | Description |
---|---|
PhiHist | Binned phi window between inner and outer hit using a 128 bin histogram |
PhiW | PhiHist + tuned phi window between inner and outer hit |
ZW | Window in z for the inner hit |
ZIP | Cut on the impact parameter along the beam axis |
PT | Cut on the curvature assuming zero transverse impact parameter (TIP), equivalent to a cut on the TIP for high |
CSZ | Cut on the cluster size compatibility |
Average number of doublets,
Cuts | Doublets |
|
Tracks | Not connected |
---|---|---|---|---|
PhiHist | 1,268,193 | 23,254 | 1,256 | 0.966 |
PhiHist + ZW | 866,316 | 18,301 | 1,266 | 0.966 |
PhiHist + ZW + ZIP | 269,410 | 11,235 | 1,265 | 0.926 |
PhiW + ZW | 594,739 | 13,403 | 1,212 | 0.958 |
PhiW + ZW + ZIP | 185,642 | 8,327 | 1,214 | 0.919 |
PhiW + ZW + ZIP + CSZ | 129,307 | 6,060 | 1,087 | 0.915 |
PhiW + ZW + ZIP + PT | 164,567 | 7,273 | 1,141 | 0.921 |
PhiW + ZW + ZIP + PT + CSZ | 115,248 | 5,270 | 999 | 0.918 |
Time spent in the three components of
Time in |
||||
---|---|---|---|---|
Cuts | Doublets | Connect | DFS | Clean |
PhiHist | 6,123 | 15,127 | 1,690 | 1,976 |
PhiHist + ZW | 950 | 6,582 | 778 | 538 |
PhiHist + ZW + ZIP | 310 | 488 | 354 | 237 |
PhiW + ZW | 552 | 2,995 | 549 | 377 |
PhiW + ZW + ZIP | 271 | 265 | 274 | 183 |
PhiW + ZW + ZIP + CSZ | 291 | 187 | 216 | 154 |
PhiW + ZW + ZIP + PT | 259 | 156 | 246 | 125 |
PhiW + ZW + ZIP + PT + CSZ | 280 | 108 | 192 | 114 |
The doublets that share a common hit are tested for compatibility to form a triplet. The compatibility requires that the three hits are aligned in the
All compatible doublets form a direct acyclic graph. All the doublets whose inner hit lies on
Full hit coverage in the instrumented pseudorapidity range is implemented in modern Pixel Detectors via partially overlapping sensitive layers. This, at the same time, mitigates the impact of possible localized hit inefficiencies. With this design, though, requiring at most one hit per layer can lead to several
A typical
Furthermore, among all the tracks that share a hit-doublet only the ones with the largest number of hits are retained.
The “Phase 1” upgraded pixel detector has one more barrel layer and one additional disk at each side with respect to the previous detector. The possibility of using four (or more) hits from distinct layers opens new opportunities for the pixel tracker fitting method. It is possible not only to give a better statistical estimation of the track parameters (
The pixel track reconstruction developed by the authors includes a multiple scattering-aware fit: the Broken Line ( a fast pre-fit in the transverse plane gives an estimate of the track momentum, used to compute the multiple scattering contribution, a line fit in the s-z plane ( a circle fit in the transverse plane.
The
The fits are performed in parallel over all
Tracks that share a hit-doublet are considered “ambiguous” and only the one with the best
The fitted pixel tracks are subsequently used to form pixel vertices. Vertices are searched as clusters in the
This algorithm is easily parallelizable and, in one dimension as in this case, requires no iterations. As showed below this algorithm is definitively more efficient and has comparable resolution than the “gap” algorithm used so far at the CMS HLT [(
Each vertex position and error along the beam line are computed from the weighted average of the
Finally the vertices are sorted using the sum of the
In this section the performance of the Patatrack reconstruction is evaluated and compared to the track reconstruction based on pixel quadruplets that CMS has used for data taking in 2018 (in the following referred to as CMS-2018) (
The performance studies have been performed using 20,000 t
The efficiency is defined as the fraction of simulated tracks,
A reconstructed pixel track is associated with a simulated track if all the hits that it contains come from the same simulated track. The efficiency is computed only for tracks coming from the hard interaction and not for those from the pileup. The CPU and GPU versions of the Patatrack workflow produce the same physics results, as shown in
Comparison of the pixel tracks reconstruction efficiency of the CPU and GPU versions of the Patatrack pixel reconstruction for simulated t
The efficiency of quadruplets is sensibly improved by the Patatrack quadruplets workflow with respect to CMS-2018, as shown in
Pixel tracks reconstruction efficiency for simulated t
The fake rate is defined as the fraction of all the reconstructed tracks coming from a reconstructed primary vertex that are not associated uniquely to a simulated track. In the case of a fake track, the set of hits used to reconstruct the track does not belong to the same simulated track. As shown in
Pixel tracks reconstruction fake rate for simulated t
If one simulated track is matched to more than one reconstructed tracks, the latter are defined as “duplicate.” The introduction of the
Pixel tracks reconstruction duplicate rate for simulated t
For historical reasons the CMS-2018 pixel reconstruction does not perform a fit on the
The resolution of the estimation of the
Pixel tracks
Pixel tracks transverse impact parameter resolution for simulated t
The CMS-2018 pixel tracking behaves better in the longitudinal plane than it does in the transverse plane. However, the Broken Line fit’s improvement in the estimate of the longitudinal impact parameter
Pixel tracks longitudinal impact parameter resolution for simulated t
The number of reconstructed vertices together with the capability to separate two close-by vertices have been measured to estimate the performance of the vertexing algorithm. This capability can be quantified by measuring the vertex merge rate, i.e. the probability of reconstructing two different simulated vertices as a single one.
Pixel vertices reconstruction efficiency and merge rate for simulated t
The hardware and software configurations used to carry out the computing performance measurements are: dual socket Xeon Gold 6130 ( a single NVIDIA T4 ( NVIDIA CUDA 11 with Multi-Process Service ( CMSSW 11_1_2_Patatrack (
A CMSSW reconstruction sequence that runs only the pixel reconstruction modules as described in
In a data streaming application the measurement of the throughput, i.e. the number of reconstructed events per unit time, is a more representative metric than the measurement of the latency. The benchmark runs eight independent CMSSW jobs, each reconstructing eight events in parallel with eight CPU threads. The throughput of the CMS-2018 reconstruction has been compared to the Patatrack quadruplets and triplets workflows. The test includes the GPU and the CPU versions of the Patatrack workflows. The Patatrack workflows run with three different configurations:
These configurations are useful to understand the impact of optimizing a potential consumer of the GPU results so that it runs on GPUs in the same reconstruction sequence or so that it can consume GPU-friendly data structures, with respect to interfacing the Patatrack workflows to the existing framework without any further optimization.
The results of the benchmark are shown in
Throughput of the Patatrack triplets and quadruplets workflows when executed on GPU and CPU, compared to the CMS-2018 reconstruction. The benchmark is configured to reconstruct 64 events in parallel. Three different configurations have been compared: in
Throughput in events/s | |||||
---|---|---|---|---|---|
Configuration | Triplets CPU | Triplets GPU | Quadruplets CPU | Quadruplets GPU | CMS-2018 |
No copy | 611 (1.28) | 870 (1.83) | 892 (1.87) | 1,386 (2.91) | 476 (1.00) |
Copy, no conv | — | 867 (1.82) | — | 1,372 (2.88) | — |
Conversion | 585 (1.23) | 861 (1.81) | 855 (1.80) | 1,352 (2.84) | — |
The future runs of the Large Hadron Collider (LHC) at CERN will pose significant challenges on the event reconstruction software, due to the increase in both event rate and complexity. For track reconstruction algorithms, the number of combinations that have to be tested does not scale linearly with the number of simultaneous proton collisions.
The work described in this article presents innovative ways to solve the problem of tracking in a pixel detector such as the CMS one, by making use of heterogeneous computing systems in a data taking production-like environment, while being integrated in the CMS experimental software framework CMSSW. The assessment of the Patatrack reconstruction physics and timing performance demonstrated that it can improve physics performance while being significantly faster than the existing implementation. The possibility to configure the Patatrack reconstruction workflow to run on CPU or to transfer and convert results to use the CMS data format allows to run and validate the workflow on conventional machines, without any dedicated resources.
This work is setting the foundations for the development of heterogeneous algorithms in HEP both from the algorithmic and from the framework scheduling points of view. Other parts of the reconstruction, e.g. calorimeters or Particle Flow, will be able to benefit from an algorithmic and data structure redesign to be able to run efficiently on GPUs.
The ability to run on other accelerators with a performance portable code is also being explored, to ease maintainability and test-ability of a single source.
The source code used to perform the studies in this article can be found at
FP, MR, and VI, contributed to the development of the algorithms. AB and MK contributed to the framework development and the integration of the reconstruction in the CMS software.
This manuscript has been partially authored by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We thank our colleagues of the CMS collaboration for publishing high quality simulated data under the open access policy. We also would like to thank the Patatrack students and alumni R. Ribatti and G. Tomaselli for their hard work and dedication. We thank the CERN openlab for providing a platform for discussion on heterogeneous computing and for facilitating knowledge transfer and support between Patatrack and industrial partners.