Fitting a collider in a quantum computer: tackling the challenges of quantum machine learning for big datasets

Current quantum systems have significant limitations affecting the processing of large datasets with high dimensionality, typical of high energy physics. In the present paper, feature and data prototype selection techniques were studied to tackle this challenge. A grid search was performed and quantum machine learning models were trained and benchmarked against classical shallow machine learning methods, trained both in the reduced and the complete datasets. The performance of the quantum algorithms was found to be comparable to the classical ones, even when using large datasets. Sequential Backward Selection and Principal Component Analysis techniques were used for feature's selection and while the former can produce the better quantum machine learning models in specific cases, it is more unstable. Additionally, we show that such variability in the results is caused by the use of discrete variables, highlighting the suitability of Principal Component analysis transformed data for quantum machine learning applications in the high energy physics context.


Introduction
The Standard Model of Particle Physics (SM) provides a remarkable description of the fundamental constituents of matter and their interactions, being in excellent agreement with the collider data accumulated so far.Nonetheless, there are still important open questions, unaddressed by the SM, such as gravity, dark matter, dark energy, or the matter-antimatter asymmetry in the universe [1], motivating a comprehensive search program for new physics phenomena beyond the SM (BSM) at the Large Hadron Collider (LHC) at CERN.
The search for BSM phenomena at colliders poses specific challenges in data processing and analysis, given the extremely large datasets involved and the low signal to background ratios expected.In this context, the analysis of the collision data obtained by the LHC experiments often relies on machine learning (ML), a field in computer science that can harness large amounts of data to train generalizable algorithms for a variety of applications [2,3], such as classification tasks.These techniques have shown an outstanding ability to find correlations in high-dimensional parameter spaces to discriminate between potential signal and background processes.They are known to scale with data, and usually rely on a large number of learnable parameters to achieve their remarkable performance.
In order to train these large models, classical 1 machine learning (CML) takes advantage of hardware accelerators, such as graphics processing units (GPUs), for efficient, parallel, and fast matrix multiplications.On the other hand, a new class of hardware is becoming available, with the advent of noisy intermediate-scale quantum (NISQ) computing devices.This accelerated the development of new quantum algorithms targeted at exploiting the capacity and feasibility of this new technology for ML applications.
Quantum machine learning (QML) is an emerging research field aiming to use quantum circuits to tackle ML tasks.One of the motivations for using this new technology in high energy physics (HEP) relates to the intrinsic properties of quantum computations, namely representing the data in a Hilbert space where the data can be in a superposition of states or in entangled states, which can allow to explore additional information in data analysis and, eventually, contribute to better classification of HEP events, namely in the context of the search for BSM phenomena.Recently, this new technology has been applied to various HEP problems [4].Namely, in event reconstruction [5][6][7][8][9][10][11], classification tasks [12][13][14][15][16][17][18][19][20] data generation [21][22][23][24][25] and anomaly detection problems [26][27][28][29].
Despite the promising potential of quantum computation, NISQ processors have important limitations, such as the qubit quality (i.e. the accuracy with which it is possible to execute quantum gates), the qubit lifetime and the limited depth of quantum circuits, since for large circuits the noise overwhelms the signal [30,31].This necessarily limits the complexity of the circuits and the size of the datasets used to train them.
In this paper we perform a systematic comparison of the performance of QML and shallow CML algorithms in HEP.The choice to focus on shallow methods rather than state-of-the-art architectures based on deep neural networks is to provide a fair comparison between methodologies, since neural networks are known to require large datasets (both in terms of sample size and dimension) to achieve good performance, something that is not feasible with current quantum computers.By choosing CML algorithms suited for smaller datasets, we will add to the on-going discussion regarding potential advantages of quantum computing by comparing QML and CML in the same footing.
The use of QML algorithms in this context is studied by targeting a common binary classification task in HEP: classifying a BSM signal against SM background.A benchmark BSM signal leading to the Z t final state is considered, in events with multiple leptons and b-tagged jets, which can be used to achieve a reasonable signal to background ratio.Variational quantum classifiers (VQC) are trained and optimized via a grid search.The use of reduced data is explored, considering both the number of features and the number of events, via different strategies: ranking of features, data transformations aiming for a richer reduced set of features, use of random samples, and choice of representative data samples.

Quantum Machine Learning
The QML algorithms are implemented using a quantum circuit, i.e. a collection of quantum gates applied to an n-qubit quantum state, followed by a measurement (or multiple measurements) that represent the output of the circuit.In order to implement a learning algorithm, the quantum circuit can be parameterized with parameters that can be learned by confronting the measurement to a loss function.
QML is effectively an extension of CML techniques to the Hilbert space, where instead of representing data as vectors in a high-dimensional real space, we encode it in state vectors of a Hilbert space.A QML algorithm, such as a quantum neural network, can be implemented using the quantum equivalent of a perceptron, one of the building blocks of CML.A problem arises from the realization that the activation functions used in CML can not be expressed using a linear operation, which is inherently required from the quantum evolution of a state.Ideas have been proposed to imitate an activation function in the quantum space, [32,33] but, in the current paper, only variational quantum classifiers [34,35] are used for binary classification.
A VQC is a parameterized quantum circuit, a circuit type containing adjustable gates with tunable parameters.These gates are a universal set of quantum gates and, in the current study, rotation [R X (w), R Y (w), R Z (w)] and CNOT gates are used 2 .
The considered VQC pipeline used has the following components: • Data Embedding: the numerical vector X representing the classical information is converted to the quantum space with the preparation of an initial quantum state, |ψ X 〉, which represents a HEP event.
• Model circuit: a unitary transformation U(w), parameterized by a set of free parameters w, is applied to the initial quantum state |ψ X 〉.This produces the final state |ψ ′ X 〉 = U(w) |ψ X 〉.
• Measurement: a measurement of an observable is performed in one of the qubits of the state |ψ ′ X 〉, which will give the prediction of the model for the task at hand.The training of a VQC aims to find the best set of parameters w to match the event labels to the prediction.Throughout this work, the PennyLane package [36] was used as a basis for the hybrid quantumclassical machine learning applications.Leveraging PennyLane's default.qubitquantum simulator, a straightforward tool for quantum circuit simulations, we trained and assessed the performance of various QML algorithms.Subsequently, the performance of the algorithms trained on IBM's quantum computers was gauged by integrating PennyLane with IBM's quantum computing framework, Qiskit [37].

Data Embedding
Before passing the data through the VQC, the preparation of the initial quantum state |ψ X 〉 is required.This is called data embedding, and there are a number of proposals to perform this step [38].Among the different possible embeddings, it was chosen to test amplitude embedding against angle embedding.The preliminary results have shown that angle embedding leads to a better performance than the former, as previously reported in a different context [17].In this paper angle embedding was, therefore, the adopted choice.Further studies on possible embeddings is left for future works.
For an N -dimensional vector of classical information, X = (x 1 , x 2 , ..., x N ), the state entering the VQC will be defined via a state preparation circuit applied to the initial state of |0〉 N .The information contained in X is embedded as angles: these are the values used in rotation gates applied to each qubit, thus requiring N qubits for embedding N features from the original dataset.
In the current study, the embedding is done using rotations around the x-axis on the Bloch sphere, thus defining the quantum state embedded with the classical information as: where R X (x) = e −i x σx and σx is a Pauli operator.In this embedding each of the considered features of the original dataset is required to be bound between [−π, π].

Model Circuit
The model circuit is the key component of the VQC and includes the learnable set of parameters.It is defined by a parameterized unitary circuit U(w), with w being the set of tunable parameters, which will evolve a quantum state embedded with classical information ψ X into the final state ψ ′ X .Analogously to the architecture of a classical neural network, the model circuit is formed by layers.Each layer is composed of an assemblage of rotation gates applied to each qubit in the system, followed by a set of CNOT gates.
A rotation gate, R, is designed to be applied to one single qubit and rotate its state.It is composed by 3 learnable parameters: φ, θ , ω, which enables the gate to rotate any arbitrary state to any location on the Bloch sphere.
Since all the learnable parameters of the VQC are contained inside the rotation gates, and each gate has 3 parameters, the shape of the weight vector is w ∈ n×l×3 , where n is the number of qubits of the current system and l is the number of layers in the network.As mentioned in the previous section, n will depend on the number of features in the data and l is a hyper-parameter (HP) to be tuned.
After rotating the qubits' state, a collection of CNOT gates will be applied to entangle the qubits.The CNOT gate is a 2-qubit gate with no learnable parameters.It will flip the state of the so-called targetqubit, based on the value of the control-qubit, and it is usually represented by having two inputs as such: CNOT(control-qubit, target-qubit).Given the number of qubits, the CNOT arrangement is implemented as detailed in Algorithm 1.

Measurement
The output of the model is obtained by measuring the expectation value of the Pauli σz operator in one of the qubits of the final state ψ ′ X .An example of the implementations of a VQC is represented in Figure 1.

Classical Machine Learning Methods
Shallow CML methods are used to provide a baseline comparison to the QML models.The specific methods chosen for the comparison are Logistic Regression (LR) and Support Vector Machines (SVM), with these algorithms being trained with the same data as the QML algorithms.
All the classical methods were implemented using scikit-learn [39] library and, if not specified otherwise, the default parameters were used.

Logistic Regression
Logistic Regression is one of the simplest ML models and can be formulated as one of the basic building blocks of a neural network -a single-layer perceptron.The goal is to find the best set of weights w that fit the data x: where ŷ is the probability of an event to belong to class 1, w and b are learnable parameters, and σ is the sigmoid function.
The learning process is guided by minimizing the loss function, which in our case is the binary crossentropy: where y is the binary label of whether the event is of the class signal or not, and x is the expectation value over the training data, obtained using the event weights corresponding to each signal and background process.

Support Vector Machine
An SVM classifier is trained by finding the hyperplane that best separates two classes of data in the hyperspace of features.It does so by using support vectors, which are the data points from the two classes closer to the hyperplane, influencing the position and orientation of the hyperplane.
The loss function of an SVM revolves around the goal of maximizing the margin, i.e the distance between the hyperplane and the nearest data point from either class.In other words, the goal is to find the hyperplane with the greatest possible margin between itself and any point within the training set, giving a greater chance of new data being classified correctly.
Just like the Logistic Regression, the base SVM classifier can only learn a linear decision boundary.However, classification problems are rarely simple enough for it to be separable using a hyperplane, thus usually requiring a non-linear separation.SVM can do this by transforming the data using a non-linear function, named kernel, after which it can be split by a hyperplane.For this implementation, the radialbasis function (RBF) was used as kernel.This endows the SVM with a non-linear mapping where it better separates the two classes using a hyperplane.

Dataset
The dataset used in this work [40] is comprised of simulated events of pp collisions at 13 TeV, in final states with 2 leptons, at least 1 b-jet, at least 1 large-R jet and large scalar sum of transverse3 momentum (p T ) of all reconstructed particles in the event (H T > 500 GeV).Such basic selection corresponds to a topology commonly used in different searches for BSM events at the LHC [41].The dominant SM background for this topology, Z b b, and the BSM signal corresponding to tt production with one of the top-quarks decaying via a flavour changing neutral current decay t → qZ (q = c, u) [42], were considered.Such signal was chosen given the kinematic similitude to the background, thus providing a good benchmark for the present study.
Both samples were generated with MADGRAPH5 2.6.5 [43] and PYTHIA 8.2 [44], and the detector was simulated using DELPHES 3 [45] with the default CMS card.Jets were clustered using the antik t algorithm [46], implemented via FASTJET [47], with R-parameters of 0.5 and 0.8 (the latter for the large-R jets).
The following features were used for training of both the classical and quantum machine learning algorithms: • (η, φ, p T , m, b-tag) of the 5 leading jets, ordered by decreasing p T , with b-tag being a Boolean variable indicating if the jet is identified as originating from a b-quark by the b-tagging algorithm emulated by DELPHES; • (η, φ, p T , m) of the leading large-R jet; • N -subjettiness of leading large-R jet, τ n with n = 1, ..., 5 [48].
• (η, φ, p T ) of the 2 leading leptons (electrons or muons); • transverse momentum ( E T ) and φ of the missing transverse energy; • multiplicity of jets, large-R jets, electrons and muons; The proportion of signal and background events was kept the same as the original simulated data during training, being 13% and 87% respectively.Additionally, the Monte Carlo weights, corresponding to the theoretical prediction for each process at target luminosity of 150 fb −1 , were taken into account in the evaluation of all the considered metrics and loss functions.

Feature Selection
As described in the previous section, a total of 47 features are available for training.Considering the type of data embedding chosen, 47 qubits would be needed to train a VQC using all the dataset features.Such number of qubits is impractical given the currently available quantum computers and thus it is not feasible to train a VQC using all the features in our dataset.For the purposes of the current study, quantum computers with only 5 qubits were considered and two methods for feature selection were implemented: principal component analysis (PCA) and sequential feature selection (SFS).
A relative comparison of the best 5 features4 is shown in Table 1 while the best performance obtained with state-of-the-art CML methods without any features or data points restrictions can be seen in Figure 2.
Feature AUC

Sequential Feature Selection
SFS algorithms are a widely used family of greedy search algorithms used for automatically selecting a subset of features that is most relevant to the problem.This is achievable by removing or adding one feature at a time based on the classifier performance until a feature subset of the desired size, k, is reached.
There are different variations of SFS algorithms but for the current paper, the Sequential Backward Selection (SBS) algorithm was chosen.This algorithm starts with the full set of features (n = 47) and, at each iteration, it generates all possible feature subsets of size n − 1 and trains a ML model for each one of the subsets.The performance is subsequently evaluated and the feature that is absent from the subset of features with the highest performance metric is removed.This process is iterated until the feature subset contains k features.
This technique was used to find subsets of 1 to 5 features.The ML model assisting the SBS was a boosted decision tree (BDT) with a maximum number of estimators set at 100 and a learning rate of 1 × 10 −5 .The considered loss function was a logistic regression for binary classification and the AUC score was used as evaluation metric.The BDT was implemented using xgboost [49] and the SBS algorithm using mlxtend [50].The selected features for the different values of k is shown in Table 2 and the AUC scores for each feature in Table 3.It should be noted that Table 2 shows the features selected with the SBS algorithm and Table 3 shows the AUC value of each one of these features.The latter is ordered by descending AUC value.

Principal Component Analysis
The PCA transforms a highly correlated, high-dimensional dataset and into a new one with reduced dimensionality and uncorrelated features, by rotating the dataset in the direction of the eigenvectors of the dataset covariance matrix.In the present paper, the PCA was performed only to remove the correlation between the features, maintaining the same dimensionality as the original data.The PCA transformation was learned from the training dataset and then applied to all datasets.When training a VQC for a specific number of features, the PCA components were ranked by AUC score and thus selected from the highest to the lowest.This is done by introducing a priority queue, i.e. if training a model using two features is desired, the 2 top-ranked PCA components will be selected.The scikit-learn PCA implementation was used and the obtained 5 better components are shown in Table 4

Dataset Size Reduction
The present paper addresses the use of reduced datasets to overcome the limitation of NISQ processors while minimizing the loss of information and thus avoiding a performance loss of the QML algorithms in the HEP context.The primary method used for this purpose in the current study is KMeans, where the k th most representative points, i.e. a set of centroids, is obtained from the original dataset.Although these centroids are the most representative data points, they are not necessarily contained in the original dataset and, consequently, a resampling process, allowing to choose points of the original dataset (centrus), is required.
A study of the performance of the proposed dataset reduction method will be done by training a logistic regression model with the original dataset and comparing the results with those obtained when Kmeans and randomly undersampled datasets are used.

KMeans Algorithm
Considering a clustering algorithm, Kmeans iteratively tries to separate data into independent groups [51].This separation is done using the Lloyd's algorithm [52], based on the minimal variability of samples within each cluster.The KMeans algorithm requires the specification of the desired number of clusters (k) a priori.The following steps were used: 1. Initialization of the centroids: using the scikit-learn implementation, it is possible to do it in two different ways, random and k-means++ [53]: • Random: k random samples of the original dataset are chosen.
• K-means++: k samples of the original dataset are chosen based on a probabilistic approach, leading to the centroids being initialized far away from each other.
Assuming there is enough time, the algorithm will always converge, although the convergence to an absolute minimum is not guaranteed.The K-means++ initialization helps to address this issue.Furthermore, for both initializations, the algorithm, by default, runs several times with different centroid seeds, with the best result being the output.
2. Assignment: Each data point x i: is addressed to a cluster c k ′ , in such a way that the inertia is minimized: where F is the dimensionality, i.e. the number of features, µ k is the centroid of the cluster c k and j stands for the ( j + 1) th feature.

Update of the centroids' position:
The new centroids are just the means positions of each cluster, i.e.
with n k being the number of samples addressed to c k .It should be noted that if n k ′ = 0 the centroid µ k ′ doesn't change.
4. Iteration: Steps 2 and 3 are repeated until the maximum number of iterations is reached or until the result converges, i.e. the centroids don't change.
The KMeans algorithm was used separately for the signal and background samples, with the corresponding weights being used.

Dataset Resampling
As previously mentioned, although centroids are the most representative points, they are not necessarily contained in the original dataset.Hence, it was chosen to consider 10 neighbors of each centroid to determine each centrus, i.e. the 10 nearest points of the original dataset.
The position of each centrus was determined using the weighted mean of the position of the neighbors, where W is the mean position, x i: is the (i + 1) th nearest point and w i the weight of the sample.The sample weight of each centrus was calculated based on the number of samples of the same label (i.e.signal or background) on the original dataset: with w i being the weight of the (i + 1) th centrus and n the number of samples in the original dataset with the same label of this centrus.

Quantum and Classical Machine Learning Training
The training of the QML algorithms used in the current paper requires the use of optimizers.Two different ones were considered: Adam [54] and tree-structured Parzen estimator sampler (TPE) [55,56].
The Adam optimizer uses an extension of stochastic gradient descent, leveraging techniques such as adaptive moment estimation, being extensively used in optimization problems, namely in the context of machine learning.Nonetheless, since there is no reason to expect, a priori, that it will work equally well in the context of QML, where specific challenges are expected, the TPE optimizer was also tested.
The TPE is a Bayesian optimization algorithm first developed for HP tuning in the context of machine learning.In the current study, it will be used to optimize VQC weights in a way very similar to what is typically done for HP tuning.TPE is implemented using Optuna [57], a library focused on HP optimization for machine learning models.TPE works by choosing a parameter candidate that maximizes the likelihood ratio between a Gaussian Mixture Model (GMM) fitted to the set of parameters associated with the best objective values, with another GMM being fitted to the remaining parameter values.In the context of HEP, TPE has also been used to explore parameter spaces of BSM models [58].
Different machine learning methods were optimized, namely a LR, a SVM and a VQC.The corresponding training was done for the set of HP summarized in Table 5, where the scanned values are also listed.For each set of HP, 5 models were trained on 5 different subsets of the initial dataset (random sampling).For both optimizers, the considered cost function used is the squared error, with the individual Monte Carlo samples being properly weighted.During the training of VQCs, the inference was done on the validation dataset at 5 epoch intervals, the AUC computed and only the best-performing model, according to the previously mentioned metric, was considered.

Adam Implementation Details
The training starts with the initialization of the weight vector.This is done randomly with an order of magnitude of 10 −2 , which is followed by training iterations until a maximum number of epochs is reached.At each iteration, the model is inferred with the training dataset, the cost function calculated and the model parameters updated via the Adam optimizer.A summary of Adam-optimized VQC training is shown in Algorithm 2.

TPE Implementation Details
We use the Optuna implementation of the TPE sampler.Being a Bayesian optimization algorithm, TPE works very differently to Adam, which is a gradient descent algorithm.In TPE, for every training iteration, each parameter is replaced by a new value acquired sampling from a Gaussian Mixture Model of good points, which is then used to compute the loss function.At each epoch, the algorithm computes new values for the model parameters.With the value of the loss function of the suggested parameters, TPE will update its internal Gaussian Mixture Models of good and bad points, which will allow it to learn what are good suggestions as more parameter values are sampled.Since TPE is a Bayesian algorithm, it does not need to compute derivatives of the loss function, as Adam does, which might allow for a light workload when running trainings on quantum computers.

Feature Reduction
The results indicate that QML circuits trained with SBS data are generally unstable and very susceptible to fluctuations in the randomly sampled data, as can be seen in Figure 3. Specifically, it is evident that using PCA-originated data produces significantly more stable results.
The performance of both optimizers, Adam and TPE, is usually saturated with only 2 layers.This effect is most noticeable when the number of features is greater or equal to 3. When considering only the PCAobtained results, the two optimizers are compatible for most of the configurations tested.Exceptions occur when using a high number of features (≥ 4) and only one layer, where TPE outperforms Adam, and when using a high number of features (≥ 4) and more than one layer, where the opposite happens and Adam outperforms TPE.
The shallow ML methods trained on the same data as the VQCs are shown in Figure 4.The AUC scores obtained in this case are more stable for both the PCA and SBS datasets.The performance in both cases is saturated when using 2 features and the models trained with SBS data outperform the PCAtrained models, contrary to what was observed for the QML case.It should also be noted that the SVM outperforms LR in all cases except when only one feature is used, which is not surprising since SVMs are more sophisticated classifiers.
For the best set of HP, VQCs trained using TPE and Adam have performed similarly to the shallow ML methods (c.f. with Figure 5 and Figure 14, respectively).It was also observed that there are no cases where QML outperforms any of the shallow methods tested.The TPE optimizer regime produced the best performance for QML, achieving an AUC score of 0.841 ± 0.051, as shown in Figure 5.
The reduction algorithms studied come with an additional computational cost compared to using the original dataset directly.In particular, the SBS algorithm added an overhead of 1 hour for running the XGBoost algorithm and selecting the features with more classification power.On the other hand, the PCA algorithm took a sub-minute negligible time to complete.However, since these algorithms only need to be run once, before the training, and given that the grid search for the VQC, SVM and LR algorithms took over 200 CPU hours on a dual-Intel(R) Xeon(R) Gold 6348 machine, in the end the computational cost of the classical reduction algorithms is negligible.

VQC's Robustness to Discrete Features
In the previous section it was noted that there was a significant variability in the final score of QML models, especially when training with SBS data.In fact, VQCs, being variational algorithms, are highly susceptible to small fluctuations in the data, which can have a correspondingly significant impact on   the computed AUC.Additionally, numerical instabilities caused by computational floating point accuracy were observed during the validation step, leading to considerable fluctuations in the computed AUC in this regime.
To further investigate this behavior, which was not observed at the same level on the PCA-trained circuits, we looked at the AUC distributions produced by QML models as a function of the number of features.As shown in Figure 6, it is clear that the instability in SBS results occurs when more than two features are used.The biggest difference in the AUC mean is found for 4 features, where the value for SBS is 0.471 ± 0.129 and for PCA is 0.719 ± 0.096.The smallest difference is found for 1 feature, where the value for SBS is 0.814 ± 0.035 and for PCA is 0.724 ± 0.037.
Additionally, we produced visualisations of the decision regions of the models trained using both feature selection methods.We focused on runs that used two features, as this is where the problem originated.Figure 7 and Figure 8 show the decision regions obtained with each model for one representative run, illustrating the sensitivity of each boundary to variations in the data, for SBS or PCA.The SBS features used are listed in Table 2, where the second feature, the number of muons in the event, is a discrete variable.While LR and SVMs are robust in the presence of discrete variables, they may pose a challenge for continuous learning algorithms such as VQCs.It is therefore possible that the variability observed when using different sub-samples of SBS data could be attributed to the use of this discrete variable.To investigate this, we conducted the SBS feature selection once again, this time excluding all discrete variablesyielding Table 6.The VQC circuits where once again trained using this modified list of inputs in a limited study of 2 features only, as illustrated in Figure 9.
Feature AUC   5, with the two features restriction.The error bar represents the standard deviation associated with each data point since each point is the average of 5 different random samplings from the data.
Using the discrete-free SBS version to train the VQC led to significantly better AUC scores, outperforming PCA-trained QML models with an average AUC score of around 0.85, although still with larger variability than that of the PCA-trained VQCs.This is a notable departure from our previous observations in subsection 8.1, where including discrete features in the SBS feature selection methodology resulted in erratic performance with no instance of outperforming PCA (except in cases where only one continuous feature was used).Therefore, we found that excluding discrete variables during feature selection led to better performance for VQC circuits in a limited study of 2 features, compared to when discrete variables were included.This indicates that the choice of input features is crucial for achieving high accuracy in quantum machine learning, and future studies should consider the impact of discrete variables on VQC performance.The findings may inform future choices in selecting input features for VQC circuits to optimize model performance.

Implementation of KMeans
The performance of the KMeans algorithm was tested initially by training LR models with 10 reduced datasets and selecting a different number of k features (k ∈ [1,2,3,4,5]) obtained with the SBS algorithm.The KMeans algorithm considers the sample weight and, in order to have an equal number of signal and background centroids, it was separately applied to the signal and background data.Since state-of-the-art quantum computing requires small datasets, the data reduction studies were done for datasets with 100, 500, 1000, and 5000 data points and the number of features previously mentioned.
Two configurations were studied: the framework presented in subsection 6.1 was applied to the training and test datasets; and only to the training datasets (with test datasets obtained through random undersampling). 6he mean AUC score and respective standard deviation found using KMeans for train and test datasets are summarised in Figure 10.The results obtained using the KMeans algorithm for the training dataset and random undersampling in the test signal and background samples are presented in Figure 11.In order to provide a benchmark point for comparison with the performance of the reduced datasets, a LR model was trained on the full original dataset, with results shown in both Figures.It can be seen in Figure 10 and Figure 11 that using the KMeans algorithm to reduce the training dataset results in AUC scores that are compatible with the performance obtained using the full original dataset.
This study shows that although KMeans is a more sophisticated algorithm for data reduction than random undersampling, in the HEP case under consideration no significant deterioration of the performance is observed when using it, suggesting that in this study the dataset composed of prototypes is a good representative of the whole dataset in the small dataset regime, which is explored in this work.

Application to QML
The QML, SVM, and LR models were trained using KMeans reduced datasets as well as random undersampling, for different dataset sizes.In this comparison, the HP for the VQC are the ones previously found to be the best, i.e. one feature chosen with the SBS method and five VQC layers for the architecture.The metric used to compare all models is the AUC score average of five different runs.
For all cases, the test and validation sets were reduced using random undersampling, hence, for each dataset size there are one train, five validation and five test datasets.The choice to keep random sampling for the test dataset, rather than KMeans reduction, is to ensure that our methodology represents the test samples as close to the original dataset as possible, ensuring that sophisticated resampling techniques do not significantly modify the data.
The obtained results are shown in Figure 12.It can be seen that the performance for the KMeans reduced dataset is compatible with the one obtained using the dataset reduced through random undersampling, for QML and CML models.Furthermore, the performance achieved by the simulated VQCs is identical within the statistical uncertainties to the performances by the SVM and LR, in agreement to what was observed in subsection 8.1.
Nonetheless, it should be emphasized that the model trained with random undersampling needs to be trained several times for achieving these average scores, as many times as the number of reduced datasets used.On the other hand, the models using the KMeans reduced dataset need to be trained only once.This can be relevant in the context of quantum computers, where access is often subject to long queues and thus the number of accesses can be a limiting factor.While the KMeans reduction technique brought an overall increase in time of around 1%, this change is negligible taking into account the reduction in number of accesses.

Real Quantum Computers Results
Until this point, only simulated quantum environments were used.In order to test the performance in real quantum computers, and thus validate the simulation results, the Pennylane framework was used as the integration layer with Qiskit, which works effectively as a direct API to the quantum computers provided by IBM.
In this study only the best performant model HP-set was used, i.e. the TPE-trained VQC.This VQC was implemented and its test set was inferred on six different quantum systems with identical architectures, all freely available.Evaluating our model in multiple identical quantum systems allows us to get an idea of the scale of the associated systematic uncertainty via the variability of the observed results.Since the implemented circuits are small, no error mitigation techniques were implemented.IBM's transpiler optimization level was set to 3 7 [37] and, for each event, the final expectation value was computed by averaging 20k shots on the quantum computer.The obtained results, shown in Figure 13, are compatible with the simulated ones (Figure 5).

Conclusion
In this paper, we assessed the feasibility of using large datasets in QML applications by exploring data reduction techniques.To allow for a fair comparison between CML and QML models, we opted to use shallow classical methods as opposed to deep methods, which require large datasets that are not viable given the limitations of the current quantum computers.Our results indicate that there is comparable performance between CML and QML when tested on the same small dataset regime.
To achieve this, our study first compared feature selection techniques, showing that while SBS can produce the best performant QML model, it generally yielded worse and more unstable results than PCA.Additionally, we found this was produced by using discrete variables in VQCs, highlighting the suitability of PCA-transformed data for QML applications in the HEP context, where discrete variables are commonly used to describe collider events.
Our grid search over different HP combinations of VQC ran in simulation provided no evidence of quantum advantage in our study.We confirmed the results by running the best performing configuration on real-world quantum systems, obtaining compatible performances and therefore validating our conclusions.We compared the performance of TPE and Adam optimizers in QML and found that TPE achieves competitive results.Being a gradient-free optimiser, TPE offers the advantage that it can lead to faster training with a smaller memory usage when compared to Adam, which in principle can further facilitate the application of QML in current quantum computers.
We then explored data reduction techniques, finding that reducing the dataset size with the KMeans algorithm produces results that are similar to those obtained from random undersampling.This finding is significant in that it means that the model can achieve similar performance with fewer accesses to a quantum computer during training, which is a considerable bottleneck in current QML.
In conclusion, while our study found no evidence of quantum advantage in the current state of QML within the context of large HEP datasets, the performance of QML models was comparable to that of classical machine learning models when restricted to small dataset regimes.Our findings suggest that using dataset reduction techniques enables us to use large datasets more efficiently to train VQCs, facilitating the usage of current quantum computers in large datasets often found in HEP.

Figure 1 :
Figure 1: An example circuit for the VQC architecture used.It is comprised of 2 layers and 3 features as input.The three main stages of a QML model can be seen: embedding of the data, passing the data through the model circuit, and the measurement of the outcome.

Figure 2 :
Figure 2: Obtained ROC curve and respective AUC score on the test dataset when training an Boosted Decision Tree, implemented with xgboost [49] using the full set of features and data points.The classifier has an identical configuration as the one described in subsection 5.1.

Figure 3 :Figure 4 :
Figure 3: Plot grid representing the results for both Adam and TPE-Trained VQCs.Each data point represents the AUC score on the test dataset of a different set of HP, as listed in Table5.The error bar represents the standard deviation associated with each data point since each point is the average of 5 different random samplings from the data.

Figure 5 :
Figure 5: ROC of the best HP set, using TPE's QML model average AUC score as a metric and the corresponding shallow methods ROCs for the same data.The HP for this run are SBS for feature method, 100 data points, 1 feature, and 5 VQC layers.The different colours indicate the different random samplings of the data.

Figure 6 :
Figure 6: Distribution of the AUC values obtained for the QML model as a function of the number of features used in training, evaluated on the test dataset, for SBS and PCA inputs.

Figure 7 :
Figure 7: Decision regions of the 3 different architectures in a run where large variability of results for the QML SBS-trained model was observed.This case uses SBS data, Adam as an optimizer, 100 data points for training and 2 layers for the circuit.

Figure 8 :
Figure 8: Decision regions of the 3 different architectures in a run where large variability of results for the QML PCA-trained model was observed.This case uses PCA data, Adam as an optimizer, 100 data points for training and 1 layer for the circuit.

Figure 9 :
Figure 9: Plot grid representing the results for both Adam and TPE-Trained VQCs.Each data point represents the AUC score on the test dataset of a different set of HP, as listed in Table5, with the two features restriction.The error bar represents the standard deviation associated with each data point since each point is the average of 5 different random samplings from the data.

Figure 10 :
Figure 10: Average AUC score and corresponding standard deviation, represented as uncertainty bands, for different numbers of clusters as a function of the number of features.The training and testing datasets were reduced using the KMeans algorithm.In each case, 10 different reduced test datasets were used.

Figure 11 :
Figure 11: Average AUC score and corresponding standard deviation, represented as uncertainty bands, for different numbers of clusters as a function of the number of features.The training dataset was reduced using the KMeans algorithm.In each case, 10 different randomly undersampled test datasets were used.

Figure 12 :
Figure 12: Comparison between the QML, SVM, and LR models when trained with the TPE and the best set of HP for different dataset sizes for both random undersampling (regular) and KMeans reduced datasets.

Figure 13 :
Figure 13: Final ROC curve of the best-performing model when inferred on the test dataset in 6 different IBM systems.The average AUC scores and the corresponding standard deviations are also shown.The colours in each subplot stand for different runs of the same circuits in the same QC.

end if end for end if
n being the number of qubits.

Table 2 :
List of the features selected by the SBS algorithm for k = 1, ..., 5.

Table 3 :
Features selected by the SBS Algorithm and their respective AUC Score on the training dataset.

Table 4 :
. Top 5 PCA components obtained with the training dataset, ranked by their AUC.

Table 5 :
List of scanned hyperparameters.The LR parameter is used only for the VQC optimized by Adam while the number of layers is only used by the VQCs.

Table 6 :
Features selected by the SBS Algorithm and their respective AUC Score on the training dataset with all the discrete features removed.