IRC-Safe Graph Autoencoder for Unsupervised Anomaly Detection

Anomaly detection through employing machine learning techniques has emerged as a novel powerful tool in the search for new physics beyond the Standard Model. Historically similar to the development of jet observables, theoretical consistency has not always assumed a central role in the fast development of algorithms and neural network architectures. In this work, we construct an infrared and collinear safe autoencoder based on graph neural networks by employing energy-weighted message passing. We demonstrate that whilst this approach has theoretically favorable properties, it also exhibits formidable sensitivity to non-QCD structures.


INTRODUCTION
New physics searches at the high-energy frontier of the Large Hadron Collider (LHC) have so far not resulted in any significant deviation of experimental results from the Standard Model (SM) expectation. However, with a growing dataset of these high energy measurements, the pressure mounts for theoretically motivated scenarios of beyond the SM (BSM) physics which have been devised to tackle known shortcomings of the SM. So what are the ways out of this juxtaposition of experimental agreement with the SM and its failure to describe established physics at small and large distances?
On the one hand, there is an increasing emphasis on theoretically as-model-independent-aspossible approaches based on effective field theory (EFT) (Weinberg, 1979). EFT navigates QFT correlations away from the SM prediction in any possible direction given the SM symmetry and particle content, thus avoiding UV model biases. Alas, such an approach poses its own challenges: looking for deviations from the SM expectation along these lines involves many ad-hoc interactions. Concrete models will typically only source a subset of relevant interactions, e.g., (Englert et al., 2019;Das Bakshi et al., 2021;Bakshi et al., 2022). There has been great progress to facilitate matching calculations (Carmona et al., 2021), however, depending on the new physics scenario, this can create a significant overhead that must be included in the parameter fitting procedure itself (Freitas et al., 2016;Englert et al., 2020).
On the other hand, we can look for phenomenological deviations from specific SM signatures directly in collider results without any new physics bias. Under the assumption that collider data can be modeled sufficiently adequately, we can employ the SM expectation to identify regions where measurements do not follow the SM expectation. This anomaly detection has emerged as a powerful tool to look for any hidden signature of new physics in the data. Recently, a range of state-of-the-art methods for anomaly detection (Collins et al., 2018(Collins et al., , 2021Aaboud et al., 2019;Blance et al., 2019;De Simone and Jacques, 2019;Blance and Spannowsky, 2020;Cheng et al., 2020;Hajer et al., 2020;Nachman, 2020;Nachman and Shih, 2020;Araz and Spannowsky, 2021;Atkinson et al., 2021b;Hallin et al., 2021;Canelli et al., 2022) using deep learning have been designed.
Theoretical consistency when confronting collider data with theoretical expectations is pivotal. The formulation of infrared and collinear (IRC)-safe observables is necessary to guarantee the comparability of experimental measurements and theoretical predictions to all orders in perturbation theory employing the Kinoshita-Lee-Nauenberg (KLN) theorem (Kinoshita, 1962;Lee and Nauenberg, 1964) and collinear factorization (Collins et al., 1989) of parton distributions. Any sensitivity enhancement observed by algorithms that inadvertently employ IRC-unsafe information will be critically assessed in subsequent studies, with a potentially meaningless theoretical outcome. This is a tedious task all too familiar from the use of IRC-unsafe jet clustering algorithms (e.g., iterative cone algorithms) by the CDF and D0 experiments (Kilgore and Giele, 1997), which was later only partially addressed with the midpoint algorithm during Tevatron Run II, until fully IRC-safe algorithms (Catani et al., 1993;Cacciari et al., 2008) were established as the only theoretically meaningful community consensus.
It seems prudent to avoid mistakes of the past: in this paper, we devise an IRC-safe Graph Neural Network (GNN) autoencoder algorithm, employing an Energy-Weighted Message-Passing Network (EMPN) (Konar et al., 2022) for unsupervised anomaly detection. While the IRC-safe loss function is the primary observable of our autoencoder, we also study the latent space (graph) representation structure as a motivating tool for new physics discrimination (Atkinson et al., 2021b;Dillon et al., 2021) and highlight the relations to known and more "traditional" IRCsafe observables. This paper is organized as follows: in Section 2, we outline our EMPN approach and detail our IRC-safe graph construction before we introduce the IRC-safe autoencoder architecture and simulation framework in Section 3. Section 4 is devoted to the discussion of the sensitivity performance of the autoencoder; we also highlight the correlation of sensitivity with more traditional jet-based observables. We conclude in Section 5.

A BRIEF OUTLINE OF ENERGY-WEIGHTED MESSAGE PASSING ALGORITHM
This section presents a brief overview of the IRC safe Energyweighted Message passing algorithm (Konar et al., 2022). It generalizes Energy Flow Networks (Komiske et al., 2019;Dolan and Ore, 2021), an IRC safe feature extraction on point clouds, by learning relational information between two elements (nodes) by constructing a graph out of the point cloud. This procedure is similar to message-passing networks like the Dynamic Graph Convolutional Neural Network (DGCNN) (Wang et al., 2019) that extract local features beyond the global feature extraction via point-cloud-based architectures such as deep-sets (Zaheer et al., 2017) and PointNet (Charles et al., 2017;Qi et al., 2017). The algorithm consists of two necessary ingredients: • An IRC safe prescription for constructing graphs which guarantees that the graph is invariant under soft and collinear splittings; • An energy-weighted summed aggregation of messages (and node features after the final message-passing layer) taking the directional (unit vectors or angles) inputsp i andp j of the nodes connected by an edge (j, i) at the initial layer.
In the following, we discuss these two elements separately.

IRC Safe Graph Construction
The inductive biases that a message-passing algorithm imposes on its input data are highly dependent on the graph structure. For instance, the neighborhood sets (the set of particles a node is connected with) determine the local connectivity of the nodes. Thus, the graph construction algorithm from a point cloud gives a strong indication that a graph neural network is the best avenue to pursue. The IRC safety of a message-passing algorithm also depends on the graph construction, and we highlight such an IRC safe graph construction algorithm in this section. Let S = {p 1 , p 2 , p 3 , ....., p N } be the set of four-vectors of the particles within a jet, while S ′ = {p 1 , p 2 , ...., p N+1 } is the same set in the presence of an additional splitting. The collinear limit is when the emitted particles r and s with the angular separation rs tending to zero, while the soft limit refers to the case when one of the particle's energy tends to zero. These four-vectors can be written as for hadron colliders, with the separation in the η − φ plane between two particles i and j, defined as R ij = η 2 ij + φ 2 ij denoting the quantity analogous to ij . Since we will be taking directed edges, the neighborhood set of a node i will be the set of all nodes with incoming connections to i. For all particles i in S or S ′ , a graph construction algorithm will construct neighborhood sets N [i] and N ′ [i], respectively. We will use a "closed" neighborhood with i ∈ N [i] instead of an "open" neighborhood i / ∈ N (i), since the second choice will always be IRC unsafe when the node i splits. To illustrate this, we show the radius graph with R 0 in the (η, φ) plane in Figure 1, where the node q undergoes a splitting. The black arrows highlight the connections of the radius graph. Figure 1 also demonstrates a nearest neighborhood connection as an example of an IRC unsafe graph construction.
To formalize the graph construction algorithm in terms of the four-vectors of the particles, we define a decision function D(p i , p j ) and a threshold function T(p i , p j ), such that any particle j with four-vector p j will be assigned to the neighborhood of particle i with four-vector p i if D(p i , p j ) is less-than or equal-to T(p i , p j ). This can be summarized as (2) FIGURE 1 | Representation of radius graph with R 0 in the (η, φ) plane undergoing a QCD splitting. The black arrows correspond to the connections of a radius graph, while the red arrows highlight the 3-nearest neighbors connections. One can see that the radius neighborhoods have the same total energy, which is not the case for those obtained by the nearest neighbors method, leading to an IRC-unsafe construction.
Since we are interested in the soft and collinear limits, constructing an IRC safe graph requires putting conditions on these functions in the respective kinematical configurations. The required condition on these functions for a "parent" splitting q → r + s when the "daughters" r, s become collinear is where the second condition arises since the nodes q, r or s can also be the node whose neighborhood is being determined. The only requirement in the IR limit for a daughter particle is that all the particles in the set N [i] are also present in N ′ [i], with the only potential addition of a soft particle. This is guaranteed by the form of Equation (2), since both functions depend only on the four-vector of the two nodes of interest 1 . The conditions (c.f. Equation 3) are satisfied in the collinear limit rs → 0 if employing the definitions Equation (1). Therefore, graphs formed by connecting particles within a constant radius R 0 in the η − φ plane are IRC safe when the decision and threshold functions take the form Note that these choices of D, T yield closed neighborhoods without additional requirements. We will use these graphs in the remainder of this paper; the neighborhood of a particle of such a radius graph is shown in Figure 1.

Energy-Weighted Message Passing
We detail the IRC safe message passing operation in this section. Before doing so, we summarize the general definition of message passing operation in the following steps. The first step, the message-passing stage, involves calculating the messages for all edges present in the graph. The message function, parameterized as a multilayer perceptron shared for all edges, takes the node features of the two nodes connected by an edge and evaluates the message. Since the message function does not need to be symmetric for the two node features, a direction convention is necessary for the second phase. In our convention, the message originates from all nodes in the neighborhood N [i] and flows toward the particle i. The second step, the node-readout stage, updates the node features of each node in the graph as a permutation-invariant function of all incoming messages. IRC safety of the updated node features after a messagepassing operation is crucially dependent on the nature of the node readout. A readout based on the maximum or minimum value of the node features depends on a single node feature in the neighborhood, and a soft or collinear splitting of this particular node would render the updated node feature IRCunsafe. This is ultimately related to identifying a specific node in the neighborhood as special 2 , which impedes KLN cancellations. A mean readout, on the other hand, explicitly depends on the cardinality of the neighborhood sets N [i] which is not a welldefined QCD quantity either since there can be an arbitrary but finite amount of resolvable emissions in the enhanced collinear or soft regions of phase space. Thus we use a summed readout, which will inclusively take all the particles in the neighborhood into account and will not explicitly depend on their size.
An IRC safe graph construction algorithm ensures two things: the equality of the sum of energy (transverse energy in the case of hadron colliders) of all particles in either neighborhood sets and the presence of both collinear daughters in N ′ [i] if the parent is present in N [i]. Defining a scope-dependent energy weight-factors analogous to z i as with K denoting the set of particles in the particular readout operation, any message passing of the form with h (0) i =p i and h (l) i denoting the updated node-features after l message-passing operations satisfies IRC safety; in the infrared limit, it is straightforward to see that any soft particle with z r → 0 ⇒ ω (N [i]) r → 0 for any node i. The splitting q → r + s for IRC-safe graphs therefore yields (7) In the collinear limit withp q =p r =p s we haveˆ (0) Combining this with Equation (7), we obtain (for l = 0) When evaluating Equation (6) for the neighborhood of a node i, the terms on the RHS and LHS of this expression are the only ones which will not be common between N [i] and N ′ [i], due to the IRC safe graph construction. The same expression is followed when i = q on the left, and i = r or i = s on the right, since {r, s} ⊂ N ′ [s] and {r, s} ⊂ N ′ [r], with all three neighborhoods (including N [q]) containing the same particles except for q, r, and s. Therefore, from Equation (6) r not necessarily zero. The presence of the node features of the daughter particles, even in the soft or collinear limit, impedes an IRC safe examination of the full jet graph unless observables are specifically designed to be insensitive to their presence in the IRC limit. The procedures to take care of these additional nodes are explained in the following sections, which are different for supervised and unsupervised methods. Since the above derivation used the collinearity of q, r, and s, for IRC safe neighborhoods, for the same neighborhoods and any successive application of an energy-weighted message passing of the form Equation (6), we have h (l) q = h (l) r = h (l) s for any l.

IRC-SAFE GRAPH AUTOENCODER
In a supervised machine learning scenario, the IRC-safe graph readout acting on the node features of the final message-passing operation gives an IRC-safe graph representation, and one loses the graph's structure. The graph representation, a fixed-length vector obtained after applying a permutation invariant function on the node features for any variable-length graph, feeds into the downstream network. Therefore, training a classifier on the loss function defined with the downstream network's output proceeds without any complications from the presence of additional soft or collinear nodes. On the other hand, a graph autoencoder similar to the one proposed in Atkinson et al. (2021a) preserves the graph structure until the output. Therefore, the autoencoder's output graph will have additional nodes in the soft and collinear limits in the case of extra emissions. Since the observable of interest for anomaly detection with an autoencoder is the loss function, we need to ensure its IRC safety. In this section, we first devise an IRC safe loss function and give details of the network architecture and training.

An IRC-Safe Loss Function
The definition of the loss function involves input which changes with a soft or collinear splitting. Therefore, the loss which is normally used as an observable in anomaly detection, needs to be IRC-safe. A simple IRC-safe loss function for a jet with constituent set G is of the form The barred quantities are the output of the network, while the unbarred quantities are the inputs to the network. The function d(p i ,p i ) ≥ d 0 denotes a well-behaved metric (one-to-one) between the input and the output space, with d(p i ,p i ) = d 0 . We now show that this is indeed an IRC safe choice: Any soft particle s, will not contribute to the sum since z s → 0, and hence it is IR safe. For the splitting q → r + s we have L S = ... + z q d(p q ,p q ) + ... L S ′ = ... + z r d(p r ,p r ) + z s d(p s ,p s ) + ... . Since, by construction, a GNN's node output after L total message-passing operations h (L) i =p i , is a function of the input four-vectors {p 1 , p 2 , p 3 , ....p N }, in general, they can have a very complicated dependence on all the input node features. However, due to the IRC safety of the EMPN, we have In the collinear limit withp q =p r =p s ⇒p q =p r =p s , we therefore have (since z q = z r + z s ), z q d(p q ,p q ) = z r d(p r ,p r ) + z s d(p s ,p s ) ⇒ L S ′ − L S = 0 , (10) i.e., collinear safety. In the following analysis of the EMPN autoencoder we will use mean-squared error between the input and output node features for d(p i ,p i ).

Jet Graph Definition
To demonstrate the performance of the described algorithm, we use the publicly available top-tagging dataset of Butter et al. (2018) and Kasieczka et al. (2019). The dataset contains a training, validation and testing set of 600k, 200k, and 200k QCD jets, respectively. The training and validation are done only with the background QCD samples since the total cross-section of their production would be orders of magnitude higher than most probable signals. Although the dataset has the same number of top jets for each of the three analysis stages, we use the 200k top jets of the test dataset as a benchmark signal scenario. These jets are simulated using Pythia8 (Sjostrand et al., 2008;Sjöstrand et al., 2015) and passed through Delphes3 (de Favereau et al., 2014) for the detector simulations using the default ATLAS parameter card. Jets are clustered from particle flow (Eflow) constituents with a distance parameter R = 0.8 using the antik t algorithm (Cacciari et al., 2008). The transverse momentum of the jets is in the range p T ∈ [550, 650] GeV.
Using the constituents of these jets, we construct the radius graphs which serve as the input to the IRC safe graph network. To construct the jet radius graph, we first calculate the interparticle distance R ij in the (η, φ) plane. Next, we define a set of all the particles i as the neighborhood set N [i] such that R ij ≤ R 0 , where R 0 is an external tunable parameter. Each node is associated with three node features where η i , φ i , R i are calculated with respect to the jet axis. For the network analysis, we choose R 0 = 0.3. Since the dependence of the classification power on R 0 for the supervised case was found to be mild (Konar et al., 2022), with the AUC values changing in the third decimal value for different values of R 0 between 0.1 and 0.5, we restrict ourselves to a single value in the intermediate range.
The final node vectors contain information about the L-hop neighborhood with an effective radius of R 0 × L. On the other hand, the primary region of activity for the one-prong QCD jets used to train the network lies in a relatively small central region of the total jet of radius R = 0.8. Therefore, the features learnt by the autoencoder would be weakly dependent on R 0 , once the effective radius covers a significant portion of the central region.

Network Architecture and Training
Now that we have described the construction of the jet graphs, we discuss the details of the network architecture and training in this section. Follow from Figure 2 where we sketch a schematic diagram of an IRC safe graph-autoencoder. The encoder consists of three edge convolution operations with output dimensions of 128, 64 and 2, which is the dimension of the latent representation. Since we take three-dimensional node features, we restrict ourselves to a 2-dimensional latent space (g 1 , g 2 ) to induce an information bottleneck 3 . The decoder also has three edge convolution operations, with the first two dimensions mirroring the encoder network dimensions (excluding the latent dimension). Finally, the last edge convolution operation maps the 128-dimensional node vectors at the penultimate message passing the layer to a three-dimensional space to reconstruct the input node features.
We takeˆ (l) at each message-passing layer to be a multilayer perceptron (MLP). For an edge convolution operation, we have for two node features h (l) i and h (l) i connected by an edge in Equation (6), Therefore the input vector to the MLP has twice the nodefeature's dimensions, since the direct sum h (l) , is a concatenation of the two vector quantities of equal dimensions. The dimension of the MLP's output is the same as the output dimension of the message passing operations and has a linear activation. We fix the MLP to have two hidden layers with ReLU activation and the same number of nodes as the output dimension. The network is implemented using the Pytorch-Geometric (Fey and Lenssen, 2019) package. Note that we have not performed any hyperparameter scan as part of this present, proof-ofconcept study. We train the network for fifty epochs with a learning rate of 0.001 using the Adam (Kingma and Ba, 2014) optimiser. The training and validation losses are compared after each epoch to ensure that there is no overfitting or a premature termination of training. The epoch with minimum validation loss is used to infer the anomaly detection on the test dataset.

ANOMALY DETECTION PERFORMANCE AND RESULTS
We now discuss the performance of the designed IRC safe loss function in detecting anomalous jets when the network is trained only on the QCD background. We choose boosted top jets from the aforementioned public dataset as our benchmark. In Figure 3 (left), we show the distribution of the loss function for the QCD and top jets (our inputs are the node features given in Equation 11). As can be seen, the distributions of the loss function values for the QCD and top jets are significantly different, highlighting the capability of the architecture to detect anomalous jets in an IRC-safe way. The Receiver-Operator-Characteristic (ROC) curve and the Area Under the Curve (AUC) of 0.902 shown in Figure 3 (right) confirm the good separation shown in the loss distribution, rivaling convolutional autoencoders (Heimel et al., 2019;Roy and Vijay, 2019;Farina et al., 2020;Finke et al., 2021) which also have AUCs close to such values [up to 0.93 (Heimel et al., 2019) and 0.91 (Finke et al., 2021)] on the same dataset. Although we did not perform a hyperparameter scan for this study, we observed a decrease in performance for a one-dimensional latent space.
Top jets possess a different and hard kinematical structure that is typically not present in QCD jets. The ability to look  into the soft and collinear splittings from the QCD shower evolution in an IRC safe way enables the network to access such information and the hard radiation pattern in a theoretically meaningful way. Modifications of the soft and collinear radiation patterns that are seeded by novel hard scales (see, e.g., Englert et al., 2011;Spannowsky, 2011, 2014;Gerwick et al., 2012a,b;Prestel and Spannowsky, 2019 for a more traditional jet-based approach to this) are therefore consistently included in the anomaly detection performance. Therefore, when such non-QCD structures are present, the anomaly detection performance is considerably improved.
In light of these results, it is worthwhile to compare our autoencoder results to phenomenological aspects of QCD in jet substructure analyses. From the point of view of soft and collinear features, Energy Correlation Functions (ECF) (Larkoski et al., 2013) are particularly relevant for such a comparison as we will motivate below. Furthermore, given that our autoencoder condenses the QCD information into the latent space in an IRC-safe way, it is interesting to see how it correlates with ECF observables. To this end, we define where h a are the latent node features. Similar to the graph readout in a classification scenario (Konar et al., 2022), this is an IRC safe representation of the jet. The distribution of the individual components of the two-dimensional graph representation are shown in Figure 4. The good performance of the autoencoder is reflected in the good separation in the latent space. The two latent space directions are, however, completely anti-correlated; see Figure 5 (they are also highly correlated with the loss). Thus, restricting ourselves to any of these three variables would be sufficient for the anomaly detection problem studied in this work.
FIGURE 4 | The distribution of each dimension of the two-dimensional latent spaces obtained after an IRC safe graph readout given in Equation (12). The loss would most likely be a better choice when one focuses on anomaly detection capabilities since it condenses the information of the two-dimensional latent space into a single quantity. On the other hand, any latent feature would be more suitable for applications demanding lower execution times, since in this case only the encoder needs to be evaluated during inference.
Moving on to the relation of the learned information with ECFs, we first define the ECFs as (13) Focussing specifically on the case N = 2, we obtain where I J is the index set of the constituent set J, and z i = p i T /( k∈I J p k T ). We can rewrite the expression as Therefore, the quantity H i can be regarded as a scalar node feature obtained from the message function R β ij = R β ij (p i ,p j ), with a weighted (by z j ) summed readout, while the sum over i to get the ECF is similar to a graph readout operation on all the nodes (or constituents) of the jet. Although the graph structure in the current case is the 2-combinatorial graph, such an analogy suggests that the features extracted by the EMPN are closely connected to ECFs.
This expectation is analyzed in more detail in Figure 5, where we show the correlation of different order ECFs with each dimension g i of the latent graph readout and the loss function. There is a strong correlation between the 2-point ECFs and the network outputs, which decreases when considering the 4-point ECFs. This difference illustrates the close relation of the message passing architecture to the 2-point ECFs. The latent dimensions show a higher correlation for β = 1 than β = 2, while the opposite holds for the loss function. This may be due to the ReLU activation, which is essentially a linear function for all positive arguments, while the loss function's higher correlation to the quadratic ECFs may be due to the usage of the mean-squared error as d(p i ,p i ).

CONCLUSIONS
Infrared and collinear safety is not a luxury but an essential requirement to guarantee the theoretical consistency of particle physics collider data interpretations. The emerging and fast-developing area of anomaly detection should therefore incorporate IRC safety when analyzing data at the LHC where QCD activity plays a dominant role. New heavy physics significantly deviates from QCD phenomenology, predominantly characterized by soft and collinear emissions. Reflecting the QCD expectation adequately helps isolate anomalies further; the ability to meaningfully interpolate into the soft and collinear regime is crucial for extending the reach of such techniques to lower scales. Despite this, IRC safety has not played an essential role in the implementation of anomaly detection. In this paper, we have placed IRC safety at the heart of anomaly detection for the first time by constructing a graph neural network autoencoder that employs Energy-Weighted Message-Passing, which gives rise to an IRC-safe architecture (Konar et al., 2022).
Graph neural networks are well-adapted approaches for isolating tell-tale correlations of final states (Atkinson et al., 2021a;Dreyer and Qu, 2021) and we find that our algorithm shows a high anomaly detection capability whilst having theoretically appealing properties. We have demonstrated this by injecting top jets as an anomaly and finding excellent discriminating sensitivity. While this partly results from the direct presence of a novel hard scale in the jet's substructure, additional sensitivity is accessed from a different soft and collinear shower pattern that accompanies the hard scale. To highlight this relation to well-studied observables in QCD phenomenology, we have shown a strong relation of the information encoded in our autoencoder's latent space with energy correlation functions. This motivates extending anomaly detection analyses using our framework to new physics scenarios of lighter BSM degrees of freedom, which we leave for future work.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://zenodo.org/record/2603256.

AUTHOR CONTRIBUTIONS
OA, AB, and VN implemented the code and dataset used in the analysis. AB, CE, and VN wrote the first draft of the manuscript. All authors contributed to the conception and design of building the IRC safe graph autoencoder and contributed to manuscript revision, read, and approved the submitted version.