Improving Variational Autoencoders for New Physics Detection at the LHC With Normalizing Flows

We investigate how to improve new physics detection strategies exploiting variational autoencoders and normalizing flows for anomaly detection at the Large Hadron Collider. As a working example, we consider the DarkMachines challenge dataset. We show how different design choices (e.g., event representations, anomaly score definitions, network architectures) affect the result on specific benchmark new physics models. Once a baseline is established, we discuss how to improve the anomaly detection accuracy by exploiting normalizing flow layers in the latent space of the variational autoencoder.


INTRODUCTION
Most searches for new physics at the CERN Large Hadron Collider (LHC) target specific experimental signatures.The underlying assumption of a specific new physics model could enter at various stages in the search design, e.g., when reducing the data rate from 40 M to 1,000 collision events per second in real time (Aad et al., 2020;Sirunyan et al., 2020;Trocino, 2014), when designing the event selection, or when running the final hypothesis testing.When searching for pre-established and theoretically well-motivated particles (e.g., the Higgs boson), this strategy is extremely successful because the underlying assumption can be exploited to maximize the search sensitivity.On the other hand, the lack of a predefined target might turn this strength into a limitation.
To compensate for this potential problem, model independent searches are also carried out (Aaboud et al., 2019;Aaltonen et al., 2009;Aaron et al., 2009;CMS-PAS-EXO-14-016, 2017;D0 Collaboration, 2012) at hadron colliders.These searches consist in an extensive set of comparisons between the data distribution and the expectation derived from Monte Carlo simulation.Many comparisons are carried out in parallel for multiple physics-motivated features while applying different event selections.However, when searching for new physics among many channels, the "global" significance of observing a particular discrepancy must take into account the probability of observing such a discrepancy anywhere.This so called look-elsewhere effect can be quantified in terms of a trial factor (Gross and Vitells, 2010).While the large trial factor typically reduces the statistical power of this strategy in terms of significance, model independent searches are valuable tools to identify possible regions of interest and provide data-driven motivations for traditional, more targeted searches to be performed on future data.
As part of our contribution to the DarkMachines challenge, we investigated the use of a particle-based variational autoencoder (VAE) (Kingma and Welling, 2014;Rezende et al., 2014) and the possibility of enhancing its anomaly detection capability by using normalizing flows (NFs) (Papamakarios et al., 2021) in the latent space to optimize the choice of the latent-space prior.In this paper, we document those studies and expand that effort, investigating the impact of specific architecture choices (event representation, network architecture, usage of expert features, and definition of the anomaly score).This study is an update of our contribution to the DarkMachine challenge (Aarrestad et al., 2021), which benefits from the lessons learned by the DarkMachines challenge.Taking inspiration from solutions presented by other groups in the challenge (e.g., Refs.(Caron et al., 2021;Ostdiek, 2021)), we evaluate the impact of some of their findings on our specific setup.In some cases (but not always), these solutions translate in an improved performance, quantified using the same metrics presented in Aarrestad et al. (2021).In this way, we establish an improved baseline model, on top of which we evaluate the impact of the normalizing flow layers in the latent space.

DATA SAMPLES AND EVENT REPRESENTATION
This study is based on the datasets released on the Zenodo platform (DarkMachines Community, 2020) in relation to the Dark Machines Anomaly Score Challenge (Aarrestad et al., 2021).They consist of a set of processes predicted in the standard model (SM) of particle physics, mixed according to their production cross section in proton-proton collisions at 13 TeV center-of-mass energy, and a set of benchmark signal samples.The datasets contains labels, identifying the process that generated each event.Labels are ignored during training and used to evaluate performance metrics.
For each sample, four datasets are provided, corresponding to four different event selections (called channels (Aarrestad et al., 2021)): where p T is the magnitude of a particle's transverse momentum, H T is the scalar sum of the jet p T in the event, and p miss T is the vector equal and opposite to the vector sum of the transverse momenta of the reconstructed particles in the event, while p miss T is its magnitude1 .More details are provided in Aarrestad et al. (2021).The input consists of the momenta of all the reconstructed physics objects in the event (jets, b jets, electrons e, muons µ, and photons), ordered by decreasing p T .Each list of objects is zero-padded to force each event into a fixed-length matrix with the same order: up to 15 jets, and up to 4 each of b jets, µ ± , e ± , and photons.We pre-process the input by applying the scikit-learn (Pedregosa et al., 2011) standard scaling and arranging the list of objects into a matrix of 39 particles times four momentum features (E, p T , η, φ), where E is the particle energy.For e, µ, and photons, the energy is computed assuming zero mass.For jets, the measured jet mass is used.The input matrix is interpreted as an image or an unordered point cloud, depending on the underlying VAE architecture.
The training and validation dataset consists of background events from the SM mixture.The available dataset size is detailed in Table 1 for each of the channels.The background test samples are combined with the benchmark signal samples listed in Table 2 to form the labeled test dataset on which performance is evaluated.

TRAINING SETUP AND EVALUATION METRICS
Variational Autoencoders (Kingma and Welling, 2014;Rezende et al., 2014;Kingma and Welling, 2019) are a class of likelihood-based generative models that maximize the likelihood of the training data according to the generative model x∈data p θ (x) for the set of observed variables x in the training data.To achieve this in a tractable way, the generative model is augmented by the introduction of a set of latent variables z, such that the the marginal distribution over the observed variables p θ (x), is given by: p θ (x) = p θ (x|z)q θ (z)dz.In this way, q θ (z) can be a relatively simple distribution, such as a Gaussian, while maintaining high expressivity for the marginal distribution p θ (x) as an infinite mixture of simple distributions controlled by z.Besides being used as generative models, VAEs have been shown to be effective as anomaly detection algorithms (An and Cho, 2015).
In this work, the VAE models are trained on the training and validation datasets, minimizing the loss function: where L C is a reconstruction loss, which is chosen to be an L 1 -type permutation-invariant Chamfer loss (Barrow et al., 1977): similar to the L 2 -type Chamfer distance used in Refs.(Fan et al., 2017;Zhang et al., 2020).In Eq. ( 2), D KL is the Kullback-Liebler divergence term usually employed to force the data distribution in the latent space to a multidimensional Gaussian with unitary covariance matrix (Rezende and Mohamed, 2015), and β is a parameter that controls the relative importance of the two terms (Higgins et al., 2017).
All of our models are optimized using the Adam minimizer (Kingma and Ba, 2015).A learning rate of 10 −4 is applied along with a brute force early stopping strategy used on an ad-hoc basis.A batch size of 32 is chosen to train models.All models are implemented with the PyTorch (Paszke et al., 2019) deep learning framework and are hosted on GitHub (Jawahar and Pierini, 2021).
We train and test all our models on the WPI Turing Research Cluster2 , using 8 CPU nodes and 1 GPU node (NVIDIA Tesla V100 or Tesla P100).
At inference time, L C is used as an anomaly detection score, to quantify the distance between the input and the output.By applying a lower-bound threshold on L C , we identify every event with an L C value larger than the threshold as an anomaly.By comparing this prediction to the ground truth, we can assess the performance of the VAE on specific signal benchmark models.
To evaluate model performance we follow the same strategy and code used in Aarrestad et al. (2021) to enable comparison with other models tested on this dataset.As explained in Aarrestad et al. (2021), we extract four main performance parameters from the receiver operating characteristic (ROC) curves based on the chosen anomaly metric for each model, namely the area under the curve (AUC) and true positive rate (also known as the signal efficiency S ) at three different, fixed values of the false positive rate (also known as background efficiency B ).We then combine these scores from all models on all available signal regions across all channels of the dataset to form box-and-whisker plots, using 6 different combination and comparison strategies namely, the highest mean score method, highest median score method, average rank method, top scorer method, top-5 scorer method, and highest minimum scorer method.A box is drawn spanning the inner half (50% quantile centered at the median) of the data as shown in Fig. 1.A line through the box marks the median.Whiskers extend from the box to either the maximum and minimum unless these are further away from the edge of the box than 1.5 box lengths.The outlier points are shown as circles.
For Fig. 1 and the other figures, the representative ranking as denoted by the legend corresponds to the performance based on the highest mean score method unless mentioned otherwise.However, to choose the best model for each experiment described in this paper, we consider all six comparison methods to arrive at a consensus.The code to perform these comparisons and to generate the corresponding plots is available in Aarrestad et al. (2021).

BASELINE VAE MODEL
The main goal of this study is to evaluate the impact of normalizing flow layers in the latent space on the anomaly detection capability of a reference VAE model.This and the following sections describe how this reference model is built, starting from the VAE based on convolutional layers (Conv-VAE) presented in Aarrestad et al. (2021) and modifying its architecture based on some of the lessons learned during the DarkMachine challenge.
The encoder of the initial Conv-VAE consists of three convolutional layers, with 32, 16, and 8 kernels of size (3, 4), (5, 1), and (7, 1), respectively.For all layers, the stride is set to 1 and zero padding to "same".The output of the convolutional layers is flattened and passed to 2 fully-connected neural network (FCN) layers that output the mean and variance for the latent space.The cardinality of the latent space is fixed to 15.The decoder mirrors the encoder architecture, returning an output of the same size as the input.
In order to define the reference model, the architecture of the starting model is modified in different ways, each time evaluating the impact of a given choice on the test dataset.Several possibilities are considered: how to embed the event in the two-dimensional (2D) array (see section 4.1); how to interpret the array, e.g., as an image or a graph (see section 4.2); whether to extend the event representation beyond the particle momenta, adding domain-specific high level features as an additional input (see section 4.3); and which anomaly score to use (see section 4.4).We study various options for each of these points, following this order.Doing so, we establish a candidate model, which replaces the initial model.We evaluate on this new model the benefit of using normalizing flow layers in the latent space (see section 5) to improve the anomaly detection accuracy.

Data representation
By their nature, events consist of a variable number of objects.To some extent, this conflicts with most neural network architectures, which assume a fixed-size input.As a baseline, we adopt the simplest solution, i.e., to zero-pad all events to standardized event sizes for all available samples.To get a better idea of how padding affects results, we study performance across alternative input encodings.We consider two main types of encodings, listed as AllObj and TrdObj in Fig. 1.The former involves considering the entire event which implies allowing for a large enough padding such that every object per event is taken into consideration across the entire dataset.The latter involves cutting down the padding and the input sequence by considering only up to four leading jets and three objects each of the other types per event.Best models on all channels combined based on mean score When using the truncated sequence, the model loses information regarding the number of objects of each type per event, which is implicitly learned when the whole sequence is considered.To compensate for this loss, one can explicitly add this information passing a second input to the model, consisting of a vector containing the multiplicities of each object type.This input is concatenated to the flattened output received from the convolutional layers in the encoder before passing them to the fully connected layers.For the sake of comparison, we also do the same for the AllObj case (labeled as "+Mult" in Fig. 1).
The results in Fig. 1 show that the truncated sequence does worse than the full sequence.We also see little improvement in performance with the addition of multiplicity information per event in both the AUC as well as performance at lower background efficiencies.As a result, we keep the input encoding that considers the complete sequence per event.

VAE architecture
The convolutional architecture used for the baseline VAE is not the only option to handle the input considered in this study.The ensemble of reconstructed particles in an event can be represented as a point cloud.Doing so, we can process it with a graph neural network.The main advantage of this choice stands with the permutation invariance of the graph processing, which pairs that of the loss in Eq. 2 and complies with the unordered nature of the input list of particles.Graph-based architectures have already been shown to perform better with sparse, non-Euclidean data representations in general (Bronstein et al., 2017;Zhou et al., 2020) and in particle physics in particular (Shlomi et al., 2020;Duarte and Vlimant, 2020).

Best 2nd
Best models on all channels combined based on mean score graph which is passed to the GCN layers of the encoder, defined as (Kipf and Welling, 2017): where H (l) is the input to the (l + 1)th GCN layer with H (0) = X where X represents the node feature matrix.H (l+1) is the layer output, A = A + I, where A is the adjacency of the graph, with I being the identity matrix which implies added self connections for each node.D ii = j A ij is defined for the normalized adjacency based message passing regime, W (l) is the layer weights matrix and σ(•) is a suitable nonlinear activation function.The output of the last GCN layer is flattened and passed to an FCN layer which populates the latent space.The encoder has 3 GCN layers that scale the 5 node features to 32, 16, and 2 respectively, followed by a single FCN layer which generates a 15-dimensional latent space.The decoder has a symmetrically inverted structure with the sampled point being upscaled through an FCN layer first and the resulting output is reshaped and passed to GCN layers that reconstruct the node features.
Considering all comparison metrics along with the representative results shown in Fig. 2, graph architectures exhibit a definitive improvement in performance compared to the Conv-VAE.The improvement is seen not only in the AUC metric, but more significantly in the S at low B .Because of this, the GCN-VAE is used as the reference architecture in the rest of this section and in section 5.

Physics-motivated high-level features
We also experiment with adding physics-motivated high-level features, as explicit inputs to the model, similar to what was done with object multiplicities in section 4.1.Doing so, we intend to check if domain knowledge helps in improving anomaly detection capability.We pass event information such as the missing transverse momentum in the event (p miss T ), the scalar sum of the jet p T (H T ) and m Eff = H T + p miss T to the model, by concatenating these with the output of the convolutional layers of the encoder.The concatenated output is then passed to the fully connected layers in the encoder to form the latent space.After the point sampled from the latent space passes through the fully connected layers of the decoder, the reconstructed p miss T , H T and m Eff are extracted and the rest of the layer output is re-shaped and further passed to the subsequent layers of the decoder.

Best 2nd
Best models on all channels combined based on mean score Figure 3 shows that adding high-level features brings no definitive improvement in performance, thereby leading us to conclude that the baseline model with marginally lower number of trainable parameters is a good choice.

Anomaly scores
While so far the Chamfer loss has been used as the anomaly score, this is not the only possibility.We consider two alternative metrics: the D KL term in Eq. ( 1) and (Aarrestad et al., 2021): where µ and σ are the mean and RMS returned by the encoder and the index i runs across the latent-space dimensions.
The use of different anomaly scores requires a tuning of the β hyperparameter.Since β determines the relative importance of the D KL and Chamfer loss terms in the loss, the use of one or the other as anomaly score is certainly related to the choice of the optimal β value.Similarly, the use of R z (i.e., anomaly detection in the latent space) might not be optimal when using a β value that was tuned to emphasize the reconstruction accuracy (i.e., the minimization of the Chamfer term in the loss).On the other hand, the study in Aarrestad et al. (2021) shows that an excessive tuning of the hyperparameters affects generalization of performance negatively beyond the available dataset.
In order to address this point, we compare three weights for the β term.The first case (β = 1) corresponds to training the VAE without the contribution of the reconstruction loss.In the second case (β = 0.5) the two contributions are equally weighted.The final case (β = 10 −6 ) corresponds to suppressing the D KL term to a negligible level.
Figure 4 shows that all three anomaly scores underperform in the β = 10 −6 case.The best performing models overall are the β = 1 and β = 0.5 cases.Comparing across the three different anomaly scores, we see that the β = 1 model that uses D KL and R z metrics, as well as the β = 0.5 model that uses the reconstruction metric perform the best.All three cases also show very similar performance across all comparison metrics as well as methods, implying that either model-anomaly score combination is equally suitable.We also find that the β = 1 D KL score and the β = 0.5 reconstruction score show a similar correlation pattern on signal and background.As a result, we expect that only a limited improvement would be obtained by combining the two, which spares us the cost of introducing a new hyperparameter (the 10 -4 10 -3 10 -2 10 -1 10 0 S ( B = 10 −2 ) 10 -4 10 -3 10 -2 10 -1 10 0 S ( B = 10 −3 ) 10 -4 10 -3 10 -2 10 -1 10 0 S ( B = 10 −4 ) Best 2nd 3rd 4th 5th 6th 7th 8th 9th Best models on all channels combined based on mean score relative weight of the two terms) whose optimal value would be signal-specific, as in the case of Caron et al. (2021).

Baseline discrimination
As a result of the tests presented so far, the baseline VAE model is established as a GCN-VAE taking as input the whole set of reconstructed physics object but no domain-specific high level features.The Chamfer loss function is used as the anomaly score.The GCN-VAE is trained and tested only with data available within a given channel and the dataset sizes per channel are described in Table 1. Figure 5 shows the ROC curves for the baseline VAE model on benchmark signals in the four channels.It is evident that we suffer from a shortage of events for some signal models at very low B .We still show ROC curves down to B = 10 −4 to allow one to compare our results to those in Aarrestad et al. (2021), where this range was chosen.We see an overall improvement in S at very low B for the GCN-VAE compared to our Conv-VAE submission in Aarrestad et al. (2021).

NORMALIZING FLOWS
With the GCN-VAE serving as the baseline, we investigate how the use of NFs (Tabak and Vanden-Eijnden, 2010;Tabak and Turner, 2013) impacts the anomaly-detection performance.Normalizing flow layers are inserted between the Gaussian sampling and the decoder.They provide additional complexity to learn better posterior distributions (Rezende and Mohamed, 2015) by morphing the multivariate prior of the latent space to a more suitable, learned function.
In other words, we use the NF layers to handle the fact that a VAE converging to a good output-to-input matching does not necessarily correspond to a configuration with a Gaussian prior in the latent space, p(z) = G(z).degradation of the output-to-input matching.With NFs, we learn a generic prior p(z) as f (G(z)), where f is the transformation function learned by the NF layers.This is different from the way NFs are traditionally used in VAE training, i.e., to improve the convergence of f (z) to G(z) with a stronger evidence lower bound (ELBO) condition.Because of this, we do not modify the D KL term in the loss, as done in Rezende and Mohamed (2015).The results obtained following this more traditional training procedure are described in the supplementary material.Doing so, we observe worse S for the same B .This is expected because the ELBO improvement with NFs was introduced in Tomczak and Welling (2017) as a way to improve the VAE generative properties, and it does not imply a better anomaly detection capability.
This is a provisional file, not the final typeset article A NF can be generalized as any invertible, diffeomorphic transformation that can be applied to a given distribution to produce tractable distributions (Papamakarios et al., 2021;Kobyzev et al., 2020).In order to be compatible with variational inference, it is desirable for the transformations to have an efficient mechanism for computing the determinant of the Jacobian, while being invertible (Rezende and Mohamed, 2015).The NFs are trained sequentially, together with the baseline VAE model.
We utilize four major families of flow models: • Planar flows are invertible transformations whose Jacobian determinant can be computed rather efficiently, making them suitable candidates for variational inference (Rezende and Mohamed, 2015).PF transformations are defined as: where u, w ∈ R D , b ∈ R and h is a suitable smooth activation function.
• Sylvester normalizing flows (SNFs) (Berg et al., 2018) build on the planar flow formulation and extend it to be analogous to a multilayer perceptron with one hidden layer of M units and a residual connection as: where Computing the Jacobian determinant for such a formulation is made more efficient by utilizing the Sylvester determinant identity (Berg et al., 2018).Depending on the way A and B are parametrized, we get different types of SNFs.In this paper we use orthogonal, Householder, and triangular SNFs, as described in Berg et al. (2018).
• Inverse autoregressive flows (IAFs) (Kingma et al., 2016) are computation-efficient normalizing flows based on autoregressive models.Autoregressive transformations are invertible, making them suitable candidates for our case.However, computing the transformation requires multiple sequential steps (Berg et al., 2018).The inverse transformation however, leads to certain simplifications as described in Berg et al. (2018), allowing more efficient parallel computing, thereby making it a more desirable transformation for our case.We use the IAFs formulated as: Such a formulation allows one to stack multiple transformations to achieve more flexibility in producing target distributions.
• Convolutional normalizing flows (ConvolutionalFlows) (Zheng et al., 2018) are an extension of single-hidden-unit planar flows (Kingma et al., 2016) to the case of multiple hidden units, further enhanced by replacing the fully connected network operation with a one-dimensional (1D) convolution, to achieve bijectivity.They are defined by the following transformation: where w ∈ R k is the parameter of the 1D convolution filter with k-sized kernel, h is a monotonic nonlinear activation function and denotes pointwise multiplication.
• Autoregressive neural spline flows (NSFARs) (Durkan et al., 2019) are similar to IAFs, where affine transforms are replaced by monotonic rational-quadratic spline transforms as described in Durkan et al.Best models on all channels combined based on median score  (2019).They resemble a traditional feed-forward neural network architecture, alternating between linear transformations and elementwise non-linearities, while retaining an exact, analytic inverse.
The hyperparameters for each normalizing flow architecture are chosen arbitrarily to avoid overtuning on the available dataset as learned from Aarrestad et al. (2021).The planar flow model consists of a stack of six flows, each made of three dense layers with 90 neurons each.SNFs are defined by stacking six flows with 8 orthogonal, householder and triangular transformations for each of the respective types of SNF.IAFs are constructed with four masked autoencoder for distribution estimation (MADE) (Germain et al., 2015) layers as described in Kingma et al. (2016), each containing 330 neurons.ConvolutionalFlows include four flow layers with kernel size k = 7 and applying kernel dilation as described in Zheng et al. (2018).NSFARs are defined by stacking four flow layers each with K = 64 bins and eight hidden features.
Figure 6 shows the results of all GCN-VAE models combined with all the different types of flows as described in section 5. Based on results from all data channels combined through all six strategies mentioned in section 3, and considering variance across trainings from different random seeds (see supplementary material), it is evident that using normalizing flows improves not only the AUC metric but also the signal efficiencies at low background efficiencies.We find that the Householder variant of SNFs produces the best improvement with respect to the baseline GCN-VAE model.The exercise was also repeated with a Conv-VAE model and similar trends were observed.There, the normalizing flows showed a larger improvement from the baseline Conv-VAE than for the GCN-VAE model but the overall results are less accurate than that of GCN-VAE with normalizing flows.
Figure 7 shows the ROC curves for the best presented model, GCN-VAE_HouseholderSNF across all available signal samples in all data channels.For some of the samples, the small dataset size translates in a discontinuous curve and larger uncertainties.

CONCLUSIONS
We constructed a graph-based anomaly detection model to identify new physics events in the DarkMachines challenge dataset.Inspired by the outcome of this challenge, specific model design choices (data representation, use of physics-motivated high-level features, and anomaly score definition) were further  optimized in order to maximize anomaly detection performance.As the case for many other deep learning applications to particle-physics data, we observed that the graph architecture better captures the point-cloud nature of this data, resulting in an enhanced performance.
In this baseline, we investigate the impact of using a stack of normalizing flows in the latent space of the variational autoencoder (VAE), between the Gaussian sampling and the decoding, in order to improve the accuracy of the prior learning process, by morphing the Gaussian prior to a more suitable function, learned during the training.

IMPACT OF NORMALIZING FLOWS ON THE LATENT SPACE PRIOR
To understand exactly how non-Gaussian the latent distributions become after passing through the normalizing flow layers, we attempt to visualize our 15 dimensional latent space via two methods.First, we create 2D histograms from multiple randomly chosen pairs of dimensions.Fig. S1 shows one such distribution between dimensions 5 and 9.We also make an approximate visualisation by first performing a principal component analysis (PCA) to express the latent space in 2 dimensions, and then plotting the resulting 2D histogram as shown in Fig. S2.
We also show a comparison of the latent space obtained from the trainings with the two different loss functions, as described in the previous section of supplementary material.We see that our loss function results in a more complex, non-Gaussian distribution compared to the modified ELBO loss, and this is desirable to improve anomaly detection performance using VAEs.It is important to note that using our loss function may not necessarily correspond to better reconstruction of the input for the trained class (background samples) or better generation, but it rather contributes to a larger separation between the trained class and the non-trained class (signal samples) by making it harder for the decoder to reconstruct signal samples.As a result the anomaly identification performance increases regardless of whether the reconstruction of the background samples improves or not.

Figure 1 .
Figure 1.Anomaly detection performance for the Conv-VAE with different inputs given (see text for more details): all physics objects in the event (AllObj); truncated input object list (TrdObj); all objects and array of object multiplicity (AllObj+Mult); truncated input object list and array of object multiplicity (TrdObj+Mult).

Figure 2 .
Figure 2. Comparison of the GCN-VAE and Conv-VAE performances, in terms of the benchmark figures of merit adopted in the paper.

Figure 3 .
Figure 3.Comparison of the GCN-VAE performance with and without high-level features added as a separate input.

Figure 4 .
Figure 4. Comparison of anomaly detection performance from different anomaly score definitions, applied to the GCN-VAE.

Figure 5 .
Figure 5. ROC curves for the baseline GCN-VAE model in channel 1 (top left), channel 2a (top right), channel 2b (bottom left), and channel 3 (bottom right), computed from the S and B values obtained on the background sample and the benchmark signal samples.Most of the ROC curves are not smooth, due to the small dataset size for some of the channels.

Figure 6 .
Figure 6.Comparison of anomaly detection performance for GCN-VAE models with different normalizing flow architectures in the latent space

Figure S1 .
Figure S1.Latent space visualization by making histograms across arbitrarily chosen dimensions 5 and 9, before (left) and after normalizing flow transformations (right) with our loss function (top) and the modified ELBO loss (bottom).

Figure S2 .
Figure S2.Latent space visualization after 2D PCA, before (left) and after normalizing flow transformations (right) with our loss function (top) and the modified ELBO loss (bottom).

Table 1 .
Summary of the available dataset size.

Table 2 .
Aarrestad et al. (2021)ing to the signal dataset in each channel.The process code, adopted in this study, is taken fromAarrestad et al. (2021).
To reach this configuration (e.g., when training a VAE as a generative model), one typically uses a β-VAE with an increased weighting of the D KL regularizer.This typically results in a